Anatomy of a search engine : Infrastructure of Google
Design and layout from a description provided by the founders of Google, Lawrence Page and Sergey Brin.
Dating from 1998, this paper presents a simplified view and in particular, the unit of PageRank should be developed into several units to take into account multiple criteria, including for example the user behavior.
Architecture of the search engine
(c) 2010 Scriptol.com/Scriptol.fr - Reuse forbidden in Web pages
The components
Crawlers | There are several distributed crawlers, they parse the pages and extract links and keywords. |
URL Server | Provides to crawlers a list of URLs to scan. |
Server Store | The crawlers sends collected data to a store serve. It compresses the pages and places them in the repository. Each page is stored with an identifier, a docID. |
Repository | Contains a copy of the pages and images, allowing comparisons and caching. |
Indexer | It indexes pages to ber provided to SERPs (results). It decompresses documents and converts them into sets of words called "hits". It distributes hits among a set of "barrels". This provides an index partially sorted. It also creates a list of URLs on each page. A hit contains the following information: the word, its position in the document, font size, capitalization. |
Barrels | These "barrels" are databases that classify documents by docID. They are created by the indexer and used by the sorter. |
Anchors | The bank of anchors created by the indexer contains internal links and text associated with each link. |
URL Resolver |
It takes the contents of anchors, converts relative URLs into absolute addresses and finds or creates a docID. It builds an index of documents and a database of links. |
Doc Index | Contains the text relative to each URL. |
Links | The database of links associates each one with a docID (and so to a real document on the Web). |
PageRank | The software uses the database of links to define the PageRank of each page. |
Sorter | It interacts with barrels. It includes documents classified by docID and creates an inverted list sorted by wordID. |
Lexicon | A software called DumpLexicon takes the list provided by the sorter (classified by wordID), and also includes the lexicon created by the indexer (the sets of keywords in each page), and produces a new lexicon to the searcher. |
Searcher | It runs on a web server in a datacenter, uses the lexicon built by DumpLexicon in combination with the index classified by wordID, taking into account the PageRank, and produces a results page. |
It goes without saying that this scheme is simplified. The system tends to become increasingly complex over time and added features.
Reference:
The Anatomy of a Large-Scale Hypertextual Web Search Engine, by Larry Page and Sergey Brin.
This document is hosted by Stanford, and probably dates from 1998. It describes in depth the functioning of the search engine.
This present document is a summary, includes a new graph made for Scriptol.com/Scriptol.fr.
Icons in this page are either created for Scriptol or taken in the free icon store.
Références
- Caffeine explained.
Colossus the file system and BigTable the database are directly linked to Caffeine the backend of the search engine. - How search works
Comprehensive explanation of the search engine by Google.
This page may be freely printed and distributed for educational purposes. It must not be reproduced on the Web, link it instead.