Noteworthy Image Search

August 2007 - June 2010

Course: PhD project
Software: C++, with libcurl, libxml2, tinyxml, cximage libraries, XAMPP with PHP

Description

This text-based internet search engine finds the newest images for any particular query, especially for noteworthy imagery, i.e. "what are the newest images for 'Star Wars' since my last visit?". To do this, we keep track of images as they are downloaded, cluster them by similarity and determine by date which ones are new. The similarity clustering is done by using descriptive keywords extracted from the page the image was found on and by using content features extracted from the image.

Image crawler

One of our first goals is to download as many images as we can find on the internet, as fast as possible. The fact that the internet is very very large - Google recently visited its 1 trillionth unique URL - and constantly changing, makes this a formidable task. Without large quantities of storage, memory and processing power available, careful thought is required in order to make this task manageable.

Over time our design has evolved from an approach trying to use the most modern techniques to its current form that uses old-fashioned, proven and robust methods. In the initial design, the image crawler consisted of a manager thread and many worker threads that all were concurrently accessing the open source transaction-based Oracle Berkeley DB to read/write link and image URLs and image data. In theory it worked beautifully, however we soon found out that this approach had several drawbacks. Even though threads share memory space with the host process, so they can easily exchange information, they are notoriously hard to kill when they go haywire and can crash the entire process when they do so. Also, many available open source libraries are not thread-safe, or only allow one thread at a time to use certain functionality. This can of course be avoided by writing each library ourselves, but it should not be too hard to imagine that flawlessly replicating and improving existing functionality is not easy and extremely time consuming. As a final problem, the database caused our performance to grind to a halt when it was pummeled by large numbers of simultaneous requests. Even thought the database is well-known for its robustness and high performance, it is designed for handling large number of read requests (e.g. a airline travel database) and not for large number of write requests, which is what our crawler requires.

Realizing that communication between manager and workers does not have to take place directly, we arrived at the current design, see Figure 1. At the moment, the manager and the workers run in separate processes and communication between them takes place using simple plain files on the hard disk. The manager prepares files containing unique URLs for the workers to look at, and the workers write files containing any URLs they have discovered. This approach instantly solves the thread-safety issues with existing open source libraries, and also relieves us of the need for a database that is accessible by multiple entities at the same time, because now the manager is the only one who needs access to it.

Image Crawler Flow Chart

Figure 1. Communication between crawler manager and workers.

With the billions of links present in the internet, providing a uniqueness check for URLs requires some thought. As the check needs to be fast, all URLs must be kept in memory. However, storing each complete URL in memory and comparing URLs character by character quickly becomes infeasible storage- and processing-wise. Therefore we represent each URL by hashing it to a 64-bit signature, using an adapted version of MD5 (Message-Digest algorithm 5), which is a widely used but partially insecure cryptographic hashing function. It has as strength that highly similar inputs will hash to completely different outputs, and thus it will rarely occur for URL signatures to collide. The signatures are stored in a hash tree, which is a combination between a hash list and a binary tree, allowing for a fast uniqueness check.

Our crawler is a nice crawler, because it looks at the robots.txt file of a site and follows the specified rules. In addition, the URLs that the manager hands out to the workers are randomized, to prevent accessing the same web server too many times in a row.

Keywords and copies

The image information gathered by the image collectors is still in a raw format and needs to be processed before it can be used by the search engine. The raw information describes the title of the page the image was found on, the text surrounding the image, and so on. From this information descriptive keywords are extracted, which will allow user to search for images by typing in these keywords. As a special functionality, we have built in preliminary support for language detection, so that the keyword extraction can be done more efficiently and people will be able to choose the language of their choice during searching.

To ensure diversity in the presentation of the found images as a result of a search request, we perform (near-)copy detection. To achieve this, we have selected the best copy detection algorithm from experiments we have performed, see the copy detection project page for more information.

Indexing Flow Chart

Figure 2. Communication between index manager and workers.

To enable fast processing, many concurrently running index workers extract keywords and features from the images. The index manager gathers the keywords and stores them in a database, and also compares features to detect image copies. See Figure 2 for the current design.

Search engine

Because each image is associated with descriptive keywords, users can easily search for images using text. As mentioned before, we can present the user with a more diverse set of image results using our copy detection algorithm. We collapse copies and near-copies of images into one representative image, thus leaving space for more image results.

Search Engine Flow Chart

Figure 3. Interaction between the search engine and the user.

At the moment, the search engine is available for internal testing only. We will need to incorporate a few extra things in order to make it fully usable for the general public. One thing is that we need to answer the question of what makes an image ‘noteworthy’. This is a difficult question, which makes it one of the main research aspects of this project. To take it to the extreme, for a movie fanatic user anything related to movies is noteworthy and everything else is not. However, because we want to cater to the general public, we should broaden our scope. Many different techniques can be used, but initially we are thinking about looking into neural networks to learn to detect when an image is noteworthy or not. Also, we need to keep track of user identities to provide additional services. For instance, each image is tagged with the date it was downloaded, so once we know a user is interested in a particular topic we can show the newest images on this topic that the user hasn’t seen yet. Finally, we want to think about ways to propagate keywords from images that have them to images that do not have them. A way to achieve this would be to determine similarity between images by analyzing their pictorial content, so that an image that has no or few keywords can inherit them from another image which is highly similar to it.

Publications

The noteworthy search engine will be discussed in the upcoming Eureka magazine issue.