Web hosting with free domain names in europeCpanel X unlimited pop3 accounts with linux web servers
dedicated server uptimeGreek Lang | gr domain name registration 
SERVICES
Web Hosting
Cpanel Free Scripts
Dedicated Servers
Servers Stock
Network
Web Design
Domain Parking
Domain Registration
Data Center Tour
FREE WEB TOOLS
Flash Toolbar Generators
Graphic Toolbar Generators
DHTML / CSS Menu Generators
Java Script Menu Generators
MAILING LIST
Sign up to our mailing list
E-mail:
I want to:
SSL
Google Ads

The Anatomy of a Web Search Engine

4.3 Crawling the Web

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol , and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.

4.4 Indexing the Web

  • Parsing -- Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work.
  • Indexing Documents into Barrels -- After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer.
  • Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID. Then the sorter, loads each basket into memory, sorts it and writes its contents into the short inverted barrel and the full inverted barrel.

4.5 Searching

The goal of searching is to provide quality search results efficiently. Many of the large commercial search engines seemed to have made great progress in terms of efficiency. Therefore, we have focused more on quality of search in our research, although we believe our solutions are scalable to commercial volumes with a bit more effort. The google query evaluation process is show in Figure 4.
  1. Parse the query.
  2. Convert words into wordIDs.
  3. Seek to the start of the doclist in the short barrel for every word.
  4. Scan through the doclists until there is a document that matches all the search terms.
  5. Compute the rank of that document for the query.
  6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
  7. If we are not at the end of any doclist go to step 4.
    Sort the documents that have matched by rank and return the top k.
Figure 4. Google Query Evaluation

To put a limit on response time, once a certain number (currently 40,000) of matching documents are found, the searcher automatically goes to step 8 in Figure 4. This means that it is possible that sub-optimal results would be returned. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation.

4.5.1 The Ranking System

Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Additionally, we factor in hits from anchor text and the PageRank of the document. Combining all of this information into a rank is difficult. We designed our ranking function so that no particular factor can have too much influence. First, consider the simplest case -- a single word query. In order to rank a document with a single word query, Google looks at that document's hit list for that word. Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, ...), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help. We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.

For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched together. For every matched set of hits, a proximity is computed. The proximity is based on how far apart the hits are in the document (or anchor) but is classified into 10 different value "bins" ranging from a phrase match to "not even close". Counts are computed not only for every type of hit but for every type and proximity. Every type and proximity pair has a type-prox-weight. The counts are converted into count-weights and we take the dot product of the count-weights and the type-prox-weights to compute an IR score. All of these numbers and matrices can all be displayed with the search results using a special debug mode. These displays have been very helpful in developing the ranking system.

4.5.2 Feedback

The ranking function has many parameters like the type-weights and the type-prox-weights. Figuring out the right values for these parameters is something of a black art. In order to do this, we have a user feedback mechanism in the search engine. A trusted user may optionally evaluate all of the results that are returned. This feedback is saved. Then when we modify the ranking function, we can see the impact of this change on all previous searches which were ranked. Although far from perfect, this gives us some idea of how a change in the ranking function affects the search results.

5 Results and Performance

Query: bill clinton
http://www.whitehouse.gov/
100.00%  (no date) (0K) 
http://www.whitehouse.gov/ 
Office of the President
        99.67% (Dec 23 1996) (2K)  
        http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html
Welcome To The White House
        99.98%  (Nov 09 1997) (5K)
        http://www.whitehouse.gov/WH/Welcome.html  
Send Electronic Mail to the President
        99.86%  (Jul 14 1997) (5K)  
        http://www.whitehouse.gov/WH/Mail/html/Mail_President.html 
mailto:president@whitehouse.gov
99.98% 
mailto:President@whitehouse.gov
        99.27% 
The "Unofficial" Bill Clinton 
94.06% (Nov 11 1997) (14K) 
http://zpub.com/un/un-bc.html 
Bill Clinton Meets The Shrinks 
         86.27%  (Jun 29 1997) (63K)  
         http://zpub.com/un/un-bc9.html 
President Bill Clinton - The Dark Side
97.27%  (Nov 10 1997) (15K) 
http://www.realchange.org/clinton.htm 
$3 Bill Clinton
94.73%  (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html  Figure 4. Sample Results from Google
The most important measure of a search engine is the quality of its search results. While a complete user evaluation is beyond the scope of this paper, our own experience with Google has shown it to produce better results than the major commercial search engines for most searches. As an example which illustrates the use of PageRank, anchor text, and proximity, Figure 4 shows Google's results for a search on "bill clinton". These results demonstrates some of Google's features. The results are clustered by server. This helps considerably when sifting through result sets. A number of results are from the whitehouse.gov domain which is what one may reasonably expect from such a search. Currently, most major commercial search engines do not return any results from whitehouse.gov, much less the right ones. Notice that there is no title for the first result. This is because it was not crawled. Instead, Google relied on anchor text to determine this was a good answer to the query. Similarly, the fifth result is an email address which, of course, is not crawlable. It is also a result of anchor text.

All of the results are reasonably high quality pages and, at last check, none were broken links. This is largely because they all have high PageRank. The PageRanks are the percentages in red along with bar graphs. Finally, there are no results about a Bill other than Clinton or about a Clinton other than Bill. This is because we place heavy importance on the proximity of word occurrences. Of course a true test of the quality of a search engine would involve an extensive user study or results analysis which we do not have room for here. Instead, we invite the reader to try Google for themselves at http://google.stanford.edu/ .

5.1 Storage Requirements

Aside from search quality, Google is designed to scale cost effectively to the size of the Web as it grows. One aspect of this is to use storage efficiently. Table 1 has a breakdown of some statistics and storage requirements of Google. Due to compression the total size of the repository is about 53 GB, just over one third of the total data it stores. At current disk prices this makes the repository a relatively cheap source of useful data. More importantly, the total of all the data used by the search engine requires a comparable amount of storage, about 55 GB. Furthermore, most queries can be answered using just the short inverted index. With better encoding and compression of the Document Index, a high quality web search engine may fit onto a 7GB drive of a new PC.
Storage Statistics Total Without Repository 55.2 GB Total With Repository 108.7 GB
Total Size of Fetched Pages 147.8 GB
Compressed Repository 53.5 GB
Short Inverted Index 4.1 GB
Full Inverted Index 37.2 GB
Lexicon 293 MB
Temporary Anchor Data 
(not in total)
6.6 GB
Document Index Incl. 
Variable Width Data
9.7 GB
Links Database 3.9 GB
Web Page Statistics
Number of Web Pages Fetched 24 million
Number of Urls Seen 76.5 million
Number of Email Addresses 1.7 million
Number of 404's 1.6 million
Table 1. Statistics

 5.2 System Performance

It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly. For Google, the major operations are Crawling, Indexing, and Sorting. It is difficult to measure how long crawling took overall because disks filled up, name servers crashed, or any number of other problems which stopped the system. In total it took roughly 9 days to download the 26 million pages (including errors). However, once the system was running smoothly, it ran much faster, downloading the last 11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. We ran the indexer and the crawler simultaneously. The indexer ran just faster than the crawlers. This is largely because we spent just enough time optimizing the indexer so that it would not be a bottleneck. These optimizations included bulk updates to the document index and placement of critical data structures on the local disk. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.
<Back

 

WEBMASTERS
Search Engine Submit Global
Web Hosting FAQ
Web Hosting Glossary
Search engine ranking tips
Download free scripts
Keyword Suggestion Tool
Downloads
Google Page Ranking
Search Engine Analysis
Robots Index
Web Crawlers
Affiliates
WHOIS
SUPPORT
24/7 Help Desk
Cpanel
Contact
WE RECOMMEND
   
Dependable Linux Servers providing cheap web hosting worldwide
INTRO | HOME | WEB HOSTING | DEDICATED SERVERS | DEDICATED SERVERS STOCK | NETWORK DIAGRAMM |WEB DESIGN | DOMAIN PARKING | FREE FLASH MENU GENERATORS | FREE GRAPHICS NAVBARS | DHTML/CSS CODE GENERATORS | JAVA SCRIPT CSS CODE GENERATORS | FREE SEARCH ENGINE SUBMISSION | WEB HOSTING F.A.Q | WEB HOSTING GLOSSARY | WEEKLY SEARCH ENGINE RANKING TIPS | DOWNLOAD FREE SCRIPTS & PROGRAMMS | SEARCH ENGINE ANALYSIS | SEARCH TERM SUGESSTION TOOL | TECH NEWS FEED | DOWNLOAD FREE HTML TOOLS | GOOGLE PAGE RANK TIPS | ROBOTS INDEX | WEB CRAWLERS | CPANEL DOCUMENTATION | TERMS OF USE | CONTACT | FORUMS
© 2002 Hostsun™ All wrignts reserved

Dedicated servers provider in Europe and Greece