Teaching computers what objects look like is more difficult than teaching children, because computers need an extra step between ‘point’ and ‘identify.’ 

Ironically, most online searches for images and videos usually rely on words that you type into search engines such as Google, Yahoo or Bing. But that only works if the image files are tagged or have descriptive names – unlike most images on the Web.

AbstractNow what? Hand-tag each and every image and video? With more than 100 billion images on Facebook and tens of hours of videos uploaded to YouTube every single minute, good luck with that. This vast majority of untagged images is called the ‘deep Web,’ which is – for all practical purposes – inaccessible. (Think about how difficult it can be to find a particular photo on your hard drive – or the shoebox in your closet, for that matter.)

What if we could teach computers to translate our textual queries into “visual queries” so that words can be matched to visual content? But computers see pictures only as a multitude of small colorful dots standing next to each other. How do you make these colorful pixels have any meaning for a machine?

This is the difficult question that computer vision researchers are asking. The challenge is to link low-level information (the pixels) with high-level concepts such as objects and scenes. It’s a little like teaching a child: Point to an object and identify it. But teaching computers is more difficult because computers need an extra step between ‘point’ and ‘identify.’

Scientists at Xerox’s research labs in Europe mulled over this problem and came to a simple realization. Just as search engines can determine what a document is about by counting how often particular words show up in it, a visual vocabulary can accomplish a similar result. The researchers decided to split images into smaller image patches, then developed algorithms that allow computers to group these patches into “visual words.” These words are used by the computer to predict the presence of an object. A simplified example would be the prediction of an image of a ‘house’ from the visual words such as ‘window’, ‘roof’ or ‘door. The paper that describes this process was published by Xerox researchers Csurka, et al, 10 years ago, and it remains one of the most cited articles in computer vision research.  The vast majority of algorithms proposed since then build on the same seminal “visual vocabulary” idea.

This technology has been applied to many problems of high practical value as varied as document routing in scanning workflows, vehicle recognition in surveillance videos, product recognition in retail businesses, or image aesthetic analysis in communication and marketing.

Learn More

More details about visual computing are available at:

A tribute to visual words and how they revolutionized computer vision

Computer Vision research at Xerox

Curated by Gregory Pings from an article written by Diane Larlus, a Xerox researcher.