One of the common myths associated with the web search engines is that
the number of hosts and/or the number of web pages in the index is a
good measure of its comprehensiveness. There are many
reasons why neither of these are good indicators. I'll briefly
describe a few of them. The number of hosts does not describe the
number of pages within each host which are indexed. The number of
pages is inherently flawed because it is currently impossible to
determine the number of unique normalized URLs since it is often
possible to reach the same file but using a different path. File
systems are not acyclic. Another problem is that cgi programs can
generate an unlimited number of unique html pages. Yet another
difficult if not unsolvable problem is that the WWW is dynamic. Pages
are created and destroyed continuously. Measuring the size of
the web might be as difficult as measuring the size of the
universe.
Regarding the comprehensiveness of the web search engines, the content
of the web is composed of text, images, sound and video. Its
discouraging to realize that none of the current search engines allow
the user to perform a search based on anything but text. Thus, the far
majority of the WWW is completely ignored. Within the universe of
text, most popular search engines allow the user to search over the web
and the newsgroups. Some search engines such as Infoseek also let you
look at current news wires which lets you get up to date information on
world news.
In summary, no search engine can claim to search the
entire WWW. Nor can any search engine even know how
many unique pages it covers. The numbers which they
list are popularly believed to be lower bounds, but
even this is not necessarily true. Search engines can
assert that they cover a certain number of URLs, but
it is never certain if these URLs are unique in part
at least to the fact that the local filesystem at the
remote page server may provide multiple links to the
same file.
|