PageRank and sandbox
This document is established from an patent made by Google in April 2007 and summarize it. It explains in detail how is assigned to each page of a website the score which will determine its position in results of the search engine. All the criteria which determine the rank of a page are analyzed and consequently the reasons which cause the sandbox effect are revealed.
The date of the document
The date is important to assign a PageRank.
To determine the date of a document, several methods are possible, this can
be the date of the indexing, or the date at which a backlink is placed to
the page.
If the number of links on a page increases more quickly than for an older
page, that will give a better PakeRank, but that can also signal spamming.
If a document is more recent than the average of the pages in a result, one
can assign a better PageRank to him to improve his position in order to take
account of his novelty.
Evolution of the contents of the page
The score is not the same one according to whether the contents of the document
are often changed or not.
To determine the changes, one can store the whole document, or a signature
which represents it in short, or a part considered to be essential to this
document.
The score can be positive or negative according to these changes.
Analyzis of requests and clicks on results
One can take into account the way in which a document is selected among the
results of a request.
So when certain terms appear more frequently in the requests of users, a document
associated to these terms (containing them or having backlinks which contains
them) will have a better score.
If a document often answers similar requests, this document will obtain a
better score.
Account will be taken of the fact that certain requests are maintained in
time while the pages which answer it are not the same ones (in sport results
for example). The score decrease if the document does not answer the request
any more.
In certain domains, like a FAQ, the innovation of a document is important
and improves the score.
However if users click on the link of an older document and are unaware of
most recent ones, this document will have a better score.
A document which more often appears in the requests on a topic, but less as
soon the topic is restrained, will have a less score (for example the topic
can be a sport and it is restrained to a precise sporting club).
If a document appears in requests without relationship between them, that
signals a spam and the score is reduced.
Links to the page
The appearance of backlinks and their disappearance is taken into account
for the PakeRank.
If the appearance of new backlinks is reduced with time, that means that the
document becomes staled, its score is reduced.
But conversely if this number tends to progress it will have a better score.
If the contents of a document are modified, but that the link which it holds
to another page is maintained, that adds value to this link and thus increases
the score of the dependent page.
The value of links increases if they are trusted, which is the case for example
for governmental sites.
The speed of appearance of backlinks signals spam. It is supposed that the
pages of a given type attract the links according to a given speed. So when
too much backlinks appear, that implies an exchange or purchase of links,
or pages of free inscription (such as directories) and that is spam.
Text of anchors
The modification of the text of anchors means that there was an update of
the document.
If the text changes and differs from the wording of the anchors, that means
a rebuild of the document, and the fact that it is not relevant any more with
the anchors, which is not desirable.
One can from that determining the date when a domain changes the topic and
the links former to the date will be ignored.
If the document knows minor changes, it is better to preserve the wording
of the anchors, their seniority means for relevance.
Traffic on the page
If traffic, in other words the number of readings of a page decrease to a
significant degree, that means that the document is staled. Comparisons are
made over time and the periods to estimate the decrease of the traffic.
The traffic coming from advertisements is taken into account. If advertisements
are placed about other sites with strong traffic, then the page will have
a better PakeRank than with advertisements for minor sites.
Behavior of visitors
The number of times a page is selected in results of requests, as well as
time spent to reach the page are taken into account.
According to whether the visitor spends more or less time on a page, this
one will be regarded as relevant or staled. If the visitors spend less and
less time on a page with time, it will be regarded as staled.
Informations on the domain name
Hosting is taken into account, Intranet, Internet or network of databases
of documents.
Recent domains can be used by spammers and thus are regarded as less legitimate.
The data of the DNS, the owner of the domain, contacts, DNS addresses, are
taken into account. Frequent changes are signs of spam. IP and other data
used for these ephemeral sites are recorded in a database as well as the associated
documents.
The DNS is better considered if it refers various domains and different registrars.
It is bad if it hosts porn sites, sites of spams, domains containing commercial
words.
The PakeRank of a page depends on the domain and its hosting.
Previous ranks
The previous ranks are taken into account. The number of positions which
a document gains in a given time modifies its score. However if a rank remains
high whereas the positions tend to change with time on a subject, that indicates
a commercial topic and a stronger probability of spam.
If the number of selections for a page tends to increase, or if the selections
are more frequent, the page will have a better score.
The engine takes in account spike in the rank of documents, typically meaning
for spam. To make the difference, various factors are taken into account.
A document evoked in news for example, is not a spam.
Contrary, a sudden fall of the rank of a document indicates that it is staled.
In conclusion, the evolution of the rank of a document influences its score
and its future rank.
Bookmarks
The bookmarks and other data of this type influence the PageRank of a document.
The fact of being added or of being removed of this type of list is taken
into account. The fact that one often select the document in the list influences
too.
Memory cache, temporary directories are taken into account, as well as cookies.
All that indicates if a document is consulted or if net surfers ignores it.
Unique words and anchors
The frequency of a single word or a sentence in anchors is taken into account
in relation to the links to the linked page.
If anchors are suspect, in particular because there are many occurrence of
unique words in different documents, that will have an impact on the score
of these documents and those which link them.
Unrelated links
Unrelated backlinks and outgoing unrelated links are an indicator of spam and cause a drop in the PakeRank of the page.
Topic of the page
It is used to determine its PakeRank.
The topic of a page is determinated from rare words, the URL, the synopsis,
the contents, etc.
If the topic of a set of documents changes, that indicates a new owner or
a different topic for the site and all information on the page become out-of-date.
Or that means that the page is used to make spam.