A paper from MIT to appear in CVPR 2008 tries to push the frontiers of object recognition. One of the lessons from modern search engines is that very simple algos give remarkable performance by using data on an internet scale. However, to apply such an approach to object recognition is computationally intractable..right from the task of even downloading 80M images to experimenting with this huge dataset.

This is the motivation to find shorter numerical representations that can be derived from an image that will provide a useful indication of its content. A short representation for an image will allow for real time solutions for object recognition tasks, which are otherwise extremely computationally intensive. As described in this news article, it has been shown than representing images as a compact binary code (as few as 256 bits per image) captures the essential contents of an image. With these codes, a database of 12.9 million images takes up less than 600MB of memory. This can easily fit on the memory of a PC. Thus such a representation of images would enable real time searches over millions of images on a single PC. The object recognition results obtained in the paper (with the short codes used to represent an image) are comparable to the results obtained using full descriptors. Because of the information lost by reducing the bits representing an image, complex or unusual images are less likely to be correctly matched. But for the most common objects in pictures–people, cars, flowers, buildings–the results are quite impressive.

Che Guevara

The same project also created a cool app called the visual dictionary. A list of all nouns in the English language was obtained from Wordnet. Images for each word were obtained by Google image search. Each tile in the image above is the average of 140 images. This average represents the dominant visual characteristics of each word. The average could be a recognizable image (as Che Guevara above) or a colored blob. The tiles are arranged such that the proximity of two tiles is indicative of their semantic distance. Thus the poster explores the relationship between visual and semantic similarity.