How Google ranks a web page?

Since a decade Google dominates the market for Internet search engines. Its strength is that it intelligently sorts its results by relevance. How is this possible? Since its inception in 1998, Google continues to evolve and most improvements remain closely guarded secrets. The main idea, by cons, has been published. the backbone of its success is a sound mathematical modeling that we track here.

What does a search engine?

A database has a predefined structure which allows to extract information, such as "name, street, postal code, phone, ... . The internet, cons, is unstructured: it is a huge collection of texts of various kinds. (I am excluding here of different formats and media, as we return to the search for good keywords.) A tentative classification is doomed to failure, especially since the web is rapidly changing: a multitude of authors Independent constantly adding new pages and modify existing pages.
 
To find information in this amorphous pile, the user can search by keywords. This requires some preparation to be effective: the search engine first copy the web pages one by one in local memory and sort the words alphabetically. The result is a directory of keywords associated with their web pages.
 
For a given keyword there are typically thousands of relevant pages (over a million for "tangent", for example). But some pages are more relevant than others. How to help users identify potentially interesting results? It is here that Google has made its great innovation.

The web is a graph!

Enjoy the little structure that is available. The internet is not a collection of independent texts, but a huge hypertext pages they refer to each other. To analyze this structure we will neglect the content pages and consider only links between them. What we get is the structure of a graph. The following figure shows an example in miniature.
 
 
Here the vertices represent web pages and the arrows represent the links, that is to say the citations between web pages. Each arrow points to the top issuer to the page cited.
 
In the following web pages I note with           and write          . In our graph we have a link   5, for example, but no link   1. However, in this first example, all pages communicate via paths to one or several steps.

How to use this graph?

The links on the Internet are not random but have been carefully edited. What information can we give this graph? The basic idea, yet to be formalized, is a link   is a recommendation of page P_j   to go and read the page    . Thus a vote of P_j   for the authority of the page    .
 
Analyze our example in this aspect. The following presentation of our graph suggests a possible hierarchy - even to justify.
 
 
Among the pages           P_1 $ is a common reference and seems a good starting point to search for information. It is the same in the group           where   P_9 serves as a common reference. The structure of the group           is similar, where   P_7 is the most cited. Note however that the pages     and   P_9, already recognized as important, refer to page D_5  . One might well suspect that the page D_5   contains essential information for all, it is most relevant. In the following we will try to formalize this classification.

First model: naive counting

It is plausible that a page receives a lot of important links. With a little naive, we also believe the statement converse: if a page gets a lot of links, then it is important. Thus we could define the measure of importance     of     page number links as   received    . In this formula is written as follows:
 
  M_i: = \ sum_ {j \ to i} 1.  
Here the sign     denotes the sum over all links pointing to the page    , and the words to sum all worth  . That is,     is equal to the number of "votes" for the page    , where every vote contributes the same value  . It's easy to define and calculate, but often does not meet the importance felt by the user: in our example there are       4 to       3. What is worse, this naive counting is too easily manipulated by adding pages without interest recommending any page.

Second model: weighted metering

Some pages emit a lot of links: they appear to be less specific and their weight is lower. We therefore share the vote of the page P_j   in   ell_j equally, where     denotes the number of bonds issued. Thus we could define a finer measurement:

  M_i: = \ sum_ {j \ to i} \ frac {1} {\} ell_j.  
That is,     count the number of "weighted votes" for the page    . It's easy to define and calculate, but still does not match well with the perceived importance: in our example there are       2 to     and    . And as before the count is too easy to fake.

Third model: Recursive counting

Heuristically, one page     seems important if many important pages of the quote. This leads us to define the extent of     recursively as follows:

  M_i = \ sum_ {j \ to i} \ frac {1} {m_j} ell_j.  
Here the voting weight   is proportional to the weight of     page issuer. It's easy to make but less obvious to calculate ... An efficient method will be explained later [ 3 ]. To reassure you can already see that our example does admit the solution   \ begin {matrix} & & & P_1 & P_2 & P_3 P_4 & & & D_5 P_6 P_7 & & & P_8 P_9 & P_ {10} & P_ {11 } & P_ {12} & \ \ m & = & (& 2 & 1, & 1, & 1, & 3, & 1 & 2 & 1 & 2 & 1, & 1, & 1 & ). \ End {matrix}   Unlike previous models, the page D_5   is identified as the most important. It's a good sign we're on the right track.
 
Note that  is a system of n linear equations with   unknowns. In our example, where   12, it is already difficult to solve by hand, but still easy on the computer. For much larger graphs we need specialized methods.

Random walk on the canvas

Before attempting to solve the equation, trying to develop an intuition. For this let us imagine a random surfer who wanders on the Internet by clicking on links at random. How is his position?
 
For example, suppose that our rider starts at time   on page   P_7. The only link to D_5  , so at time   1 surfer found there with probability  . By leaving three links, then at time   2 it is one of the pages P_6         P_8 with probability  . Here are the following probabilities (rounded to     near)  

begin {matrix} & P_1 & P_2 & P_3 P_4 & & & D_5 P_6 P_7 & & & P_8 P_9 & P_ {10} & P_ { 11} & P_ {12} \ \ t = 0 & .000 & .000 & .000 & .000 & .000 & .000 & 1.00 & .000 & .000 & .000 & .000 & .000 \ \ t = 1 & .000 & .000 & .000 & .000 & 1.00 & .000 & .000 & .000 & .000 & .000 & .000 & .000 \ \ t = 2 & & .000 & .000. 000 & .000 & .000 & .333 & .333 & .333 & .000 & .000 & .000 & .000 \ \ t = 3 & .167 & .000 & .000 & .000 & .333 &. 000 & .333 & .000 & .167 & .000 & .000 & .000 \ \ t = 4 & .000 & .042 & .042 & .042 & .417 & .111 & .111 & .111 &. 000 & .042 & .042 & .042 \ \ t = 5 & .118 & .021 & .021 & .021 & .111 & .139 & .250 & .139 & .118 & .021 & .021 &. 021 \ \ \ dots \ \ t = 29 & & .059 & .117 .059 & .059 & .177 & .059 & .117 & .059 & .117 & .059 & .059 & .059 \ \ t = 30 & .117 & .059 & .059 & .059 & .177 & .059 & .117 & .059 & .117 & .059 & .059 & .059 \ end {matrix}  
There is a distribution which converges quickly to a stationary distribution (at     near the end of a thirty iterations). Test this observation by a second example, this time from the page    :   \ begin {matrix} & P_1 & P_2 & P_3 P_4 & & & D_5 P_6 P_7 & & & P_8 P_9 & P_ {10} & P_ {11} & P_ {12} \ \ t = 0 & 1.00 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 \ \ t = 1 & .000 & .250 & .250 & .250 & .250 & .000 & .000 & .000 & .000 & .000 & .000 & .000 \ \ & t = 2 & .375 .125 & .125 & .125 & .000 & .083 & .083 & .083 & .000 & .000 & .000 & .000 \ \ t = 3 & .229 & .156 & .156 & .156 & .177 & .000 & .083 & .000 & .042 & .000 & .000 & .000 \ \ t = 4 & .234 & .135 & .135 & .135 & .151 & .059 & .059 & .059 & .000 & .010 & .010 & .010 \ \ t = 5 & .233 & .126 & .126 & .126 & .118 & .050 & .109 & .050 & .045 & .005 & .005 & .005 \ \ \ dots \ \ t = 69 & .117 & .059 & .059 & .059 & .177 & .059 & .117 & .059 & .117 & .059 & .059 & .059 \ \ t = 70 & .117 & .059 & .059 & .059 & .177 & .059 & .117 & .059 & .117 & .059 & .059 & .059 \ end {matrix} 
 
 
Although dissemination take more time to stabilize, the stationary measure is the same! It also coincides with our solution  , here divided by   1 USD . The pages where     is large are the most "popular" or most "popular". In the quest to rank web pages in order of importance is even an argument for using the measure   as an indicator.
 
The model of the random surfer may seem surprising, but in the absence of more precise information, the use of probabilistic considerations is often very helpful!

The Transition Act

How to formalize the distribution shown above? Suppose that at time   our random surfer is on page P_j   with probability    . The probability of P_j from   and follow the link   then      . The probability of arriving at a time   on     page is
 
P'_i  : = \ sum_ {j \ to i} \ frac {1} {\} p_j ell_j. 

 
Given the initial distribution  , the transition law defines the following distribution    . Thus we get a line   from   line in our examples. (In probability theory this is called a Markov chain.) The stationary measure is characterized by the equilibrium equation  , which is precisely our equation departure.

Watch out for black holes

What happens there when our graph contains a page (or group of pages) without end? For illustration, here is our graph plus a new page     without end:
 

The interpretation as a random walk solves the equation without any calculation: page     absorbs all likelihood because our random surfer will fall sooner or later on this page, where it remains for rest of his life. So the solution is   \ begin {matrix} & & & P_1 & P_2 & P_3 P_4 & & & D_5 P_6 P_7 & & & P_8 P_9 & P_ {10} & P_ {11} & P_ {12} & P_ {13 } & \ \ m & = & (& 0, & 0, & 0, & 0, & 0, & 0, & 0, & 0, & 0, & 0, & 0, & 0, & 1 &). \ End {matrix}   Our model is not yet satisfactory.

The model used by Google PageRank

To escape the black hole, Google uses a more refined model:
  • with a fixed probability   surfer abandons its current page P_j   and again on the   web pages, chosen so equiprobable;
  • Otherwise, with probability  , the surfer follows a link from page P_j  , chosen so as equiprobable. (This is the usual random walk).
This trick of "teleportation" avoids being trapped by a page without issue, and guaranteed to arrive anywhere in the graph regardless of connectivity issues.
 
In this model the transition is given by
 
P'_i  : = \ frac {c} {n} + \ sum_ {j \ to i} \ frac {1-c} {\} p_j ell_j.   The first term   comes from the teleportation, the second term is the random walk before. The balance measure thus satisfies the equation

  M_i = \ frac {c} {n} + \ sum_ {j \ to i} \ frac {1-c} {m_j} ell_j.  
The parameter   is still to be calibrated. For   we get the previous mode.
 
In general, choose the constant   non-zero but close to zero. For example, the choice   0.15 is about $ 6 U.S. dollars follow links on average, which seems a realistic description.
 
To conclude the analysis of our example, here is the random walk starting from the page    :   \ begin {matrix} & P_1 & P_2 & P_3 & P_4 D_5 & & & P_6 P_7 P_8 & & & P_ {10 P_9 } & P_ {11} & P_ {12} \ \ t = 0 & 1.00 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 & .000 &. 000 \ \ t = 1 & .013 & .225 & .225 & .225 & .225 & .013 & .013 & .013 & .013 & .013 & .013 & .013 \ \ .305 & t = 2 & .111 & .111 & .111 & .028 & .076 & .087 & .076 & .034 & .020 & .020 & .020 \ \ t = 3 & .186 & .124 & .124 & .124 & .158 & .021 & .085 & .021 & .071 & .028 & .028 & .028 \ \ t = 4 & .180 & .105 & .105 & .105 & .140 & .057 & .075 & .057 & .057 & .040 & .040 & .040 \ \ t = 5 & .171 & .095 & .095 & .095 & .126 & .052 & .101 & .052 & .087 & .042 & .042 & .042 \ \ \ dots \ \ t = 29 & .120 & .066 & .066 & .066 & .150 & .055 & .102 & .055 & .120 & .066 & .066 &. 066 \ \ t = 30 & .120 & .066 & .066 & .066 & .150 & .055 & .102 & .055 & .120 & .066 & .066 & .066 \ end {matrix}   The stationary measure is quickly reached, and the page       before = 0.15 m_5 pages     and   with         0.12.

The fixed point theorem

To develop a promising model we used heuristic arguments and experimental designs. Now fix this model and ask it on a solid theoretical foundation. Our calculations result in fact in our example miniature, but is this always the case? The following beautiful result answers in full generality:
 
Fixed point theorem. - Consider a finite graph and fix any parameter   such that  . Then:
  • Equation admits a unique solution that       1. In this solution       are strictly positive.
  • For any initial probability distribution on the graph, the diffusion process converges to the unique stationary measure  .
  • Convergence is at least as fast as that of the geometric     to $ 0.
Emphasize the importance of each of these three points. The first simply ensures the existence and uniqueness of a solution to our problem. Better yet, a solution not only exists but the second point tells us how to calculate: an iterative algorithm. Here the independence of the concept ensures a certain numerical stability: the calculations with point numbers, rounding errors are often unavoidable, but fortunately by such disturbances do not affect the final result. Finally, the third point guarantees that the convergence rate is high enough, which is crucial for any application size. For its Google ranking processes several billion web pages. This Herculean task is only possible with the iterative algorithm, and the theorem guarantees its effectiveness regardless of the graph.
 
This theorem is as elegant and useful. The idea of proof is surprisingly simple: we show that the transition law defines a map     is contracting over  . The result thus follows from the fixed point theorem of Banach.

Conclusion

To be useful, a search engine should not only list the results of a query, but rank them in order of importance. However, estimating the relevance of web pages is a profound challenge to modeling.
 
As a first approximation Google scans the graph formed by links between web pages. Interpreting a link   as "voting" on page P_j   for     page, the PageRank model defines a measure of "popularity."
 
The fixed point theorem ensures that this equation admits a unique solution, and justifies the iterative algorithm to approach. It is easy to implement on a computer and quite effective for graphs of size.
 
Armed with these mathematical tools and a clever business strategy, Google makes billions of dollars. He had to think!

0 comments:

Post a Comment