Minuten Lesezeit

Image Recognition: Past, Present, and Future

Oliver Zeigermann is a speaker for ODSC West 2022 in San Francisco, USA. Be sure to check out his talk, “Image Recognition with OpenCV and TensorFlow,” there. The ODSC West takes place from first to third of November 2022 in San Francisco and virtually.

Zur deutschen Version des Artikels.

Let’s say you run a factory producing rings and want to tell the two models you have from each other. Maybe you want to do this for sorting or quality control:

Figure 1: Two different kinds of rings
Figure 1: Two different kinds of rings

People are typically pretty good at this, but maybe this is just not what you want them to do for whatever reason. There are options, though, and If you are able to check those rings in a pretty controlled environment, classic image recognition is good at this as well. You check for a specific low-level feature telling one category from another. In this example it could simply be: match a ring and check its diameter:

Figure 2: Different diameters detected by Hough Transformation
Figure 2: Different diameters detected by Hough Transformation

Detecting a circle can be done using the so-called Hough Transformation. Having detected the cycles of the rings, we can calculate the diameter which is quite distinct and actually allows us to tell one type of ring from the other. Libraries like OpenCV provide us with implementations for such basic routines including the Hough Transformation.

How to spot the fine Austin Squirrel – Classic might not always be enough

Figure 3: Two squirrels camouflaging in their natural environment
Figure 3: Two squirrels camouflaging in their natural environment

Now let’s have a look at those two pals, lovely aren’t they? You may notice that they have developed some sort of natural camouflage and their fur looks a lot like the trees they are hiding in.

Good for the squirrel, bad for us when we want to recognize animals like them. Looking for low-level features like certain colors or patterns will hardly be successful. Instead, we will have to use more high-level patterns that must cover the wide range of squirrels occurring in their natural habitat. It turns out that machine learning using a sequence of specific neural network layers is just the thing to do that.

Finding out which architecture suits best our specific image recognition task is something we should leave to the academic world. Instead, it is helpful to know what kinds of pre-defined architectures exist, how to choose the right one, and how to make them train on our examples. TensorFlow with its Keras APIs has all those architectures pre-defined and even pre-trained for us to train further or adapt and fine-tune to our specific applications.

How to make sure we look for the right things?

The literature about image recognition is full of anecdotes of things like tanks being recognized by the snow or the blue skies they have been photographed in instead of the tank itself. The issue described here is called “overfitting”. Overfitting occurs when a machine learning model learns all kinds of features of the examples it is trained on, but does not concentrate on the relevant ones. It thus is not general enough to recognize similar objects that were not in the training set.

Using tools like Alibi Explain we can segment the image into its parts called superpixels. They can then be combined into so-called anchors and sent through the network until the network is sure to see the same thing as in the complete image. This way we can check what the network sees as essential in the image:

Figure 4: Our neural network detects a cat by its face
Figure 4: Our neural network detects a cat by its face

In the example above this looks reasonable. The anchor is pretty much the same as what we would need to recognize a cat as well. In the examples of the tanks, such a procedure would rather see the background and not the object. This way we would know that the network has not been trained properly and will not generalize well to anything it has not seen before.

And the future?

The classic techniques and machine learning described here are both well established and work well in practice. They can be seen as the past and the present of image recognition. But what comes next? There are a couple of techniques that certainly look promising, but they are maybe not readily available or are not quite as mature.

The most promising approach for the future is called Vision Transformer (ViT). The idea is to phrase image recognition as a language problem using the successful transformer architecture. Images are being split into patches of sub-images and transformed into tokens which are then passed into a pretty standard transformer architecture.

Another approach is to train images along with their descriptions and letting. After training people can write descriptions for images to generate. Sticking with our previous example of squirrels this is my experiment to let the state-of-the-art model DALL-E 2 generate some very special squirrels for us using the description “A chubby green squirrel on the moon”:

Figure 5: DALLE-E generates a “A chubby green squirrel on the moon”
Figure 5: DALLE-E generates a “A chubby green squirrel on the moon”

Click here for the German version.

Click here for the original articel.

In this article Oliver will show you what artificial intelligence and machine learning have to do with open knowledge and why this could also interesting to you.

No items found.

Weitere Artikel aus unserem Blog