The Yahoo Bonus and its Impact on Search Engine Optimization

Many experts in search engine optimization assume that certain websites obtain a special PageRank evaluation from the Google search engine which needs a manual intervention and does not derive from the PageRank algorithm directly. Mostly, the directories Yahoo and Open Directory Project (dmoz.org) are considered to get this special treatment. In the context of search engine optimization, this assumption would have the consequence that an entry into the above mentioned directories had a big impact on a site's PageRank.

An approach that is often mentioned when talking about influencing the PageRank of certain websites is to assign higher starting values to them before the iterative computation of PageRank begins. Such proceeding shall be reviewed by looking at a simple example web consisting of two pages, whereby each of these pages solely links to the other. We assign an initial PageRank of 10 to one page and a PageRank of 1 to the other. The damping factor d is set to 0.1, because the lower d is, the faster the PageRank values converge during the iterations. So, we get the following equations for the computation of the pages' PageRank values:

PR(A) = 0.9 + 0.1 PR(B)
PR(B) = 0.9 + 0.1 PR(A)

During the iterations, we get the following PageRank values:

IterationPR (A)PR (B)
0110
11.91.09
21.0091.0009
31.000091.000009

It is obvious that despite assigning different starting values to the two pages each of the PageRank values converges to 1, just as it would have happened if no initial values were assigned. Hence, starting values have no effect on PageRank if a sufficient number of iterations takes place. Indeed, if the computation is performed with only few iterations, the starting values would influence PageRank. But in this case, we have to consider that in our example the PageRank relation between the two pages reverses after the first iteration. However, it shall be noted that for our computation the actual PageRank values within one iteration have been used and not the ones from the previous iteration. If those values would have been used, the PageRank relation had alternated after each iteration.

Modification of the PageRank Algorithm

If assigning special starting values at the begin of the PageRank calculations has no effect on the results of the computation, this does not mean that it is not possible to influence the PageRank of websites or web pages by an intervention in the PageRank algorithm. Lawrence Page, for instance, describes a method for a special evaluation of web pages in his PageRank patent specifications (United States Patent 6,285,999). The starting point for his consideration is that the random surfer of the Random Surfer Model may get bored and stop following links with a constant probabilty, but when he restarts, he won't take a random jump to any page of the web but will rather jump to certain web pages with a higher probability than to others. This behaviour is closer to the behaviour of a real user, who would more likely use, for instance, directories like Yahoo or ODP as a starting point for surfing.

If a special evaluation of certain web pages shall take place, the original PageRank algorithm has to be modified. With another expected value implemented, the algorithm is given as follows:

PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Here, (1-d) is the probability for the random surfer no longer following links. E(A) is the probability for the random surfer going to page A after he has stopped following links, weighted by the number of web pages. So, E is another expected value whose average over all pages is 1. In this way, the average of the PageRank values of all pages of the web will continue to converge to 1. Thus, the PageRank values do not vaccilate because of the special evaluation of web pages and the impact of PageRank on the general ranking of web pages remains stable.

In our example, we set the probability for the random surfer going to page A after he has stopped following links to 0.1. The probability for him going to page B is set to 0.9. Since our web consists of two pages E(A) equals 0.2 and E(B) equals 1.8. At a damping factor d of 0.5 we get the following equations for the calculation of the single pages' PageRank values:




PR(A) = 0.2 × 0.5 + 0.5 × PR(B)
PR(B) = 1.8 × 0.5 + 0.5 × PR(A)

If we solve these equations we get the following PageRank values:

PR(A) = 11/15
PR(B) = 19/15

The sum of the PageRank values remains 2. The higher probability for the random surfer jumping to page B is reflected by its higher PageRank. Indeed, the uniform interlinking between both pages prevents our example pages' PageRank values from a more significant impact of our intervention.

So, it is possible to implement the special evaluation of certain web pages into the PageRank algorithm without having to change it fundamentally. It is questionable, indeed, what criteria is used for the evaluation. Lawrence Page suggests explicitly the utilization of real usage data in his PageRank patent specifications. Google, meanwhile, collects usage date by means of the Google Toolbar. And Google would not even need as much data, as if the whole ranking was solely based on usage data. A limited sample would be sufficient to determine the 1,000 or 10,000 most important pages on the web. The PageRank algorithm can then fill the holes in usage data and is thereby able to deliver a more accurate picture of the web.

Of course, all statements regarding the influence of real usage data on PageRank are pure speculation. Even if there is a special evaluation of certain web pages at all will in the end, stay a secret of the people at Google.

Nonetheless, Assigning Starting Values?

Although, assigning special starting values to pages at the begin of PageRank calculations has no effect on PageRank values it can, nonetheless, be reasonable.

We take a look at our example web consisting of the pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. In this case, the damping factor d is set to 0.75. So, we get the following equations for the iterative computation of the single pages' PageRank values:





PR(A) = 0.25 + 0.75 PR(C)
PR(B) = 0.25 + 0.75 (PR(A) / 2)
PR(C) = 0.25 + 0.75 (PR(A) / 2 + PR(B))

Basically, it is not necessary to assign starting values to the single pages before the computation begins. They simply start with a value of 0 and we get the following PageRank values during the iterations:

IterationPR(A)PR(B)PR(C)
0000
10.250.343750.60156
20.701170.512940.89764
30.923230.596211.04337
41.032530.637201.11510
51.086320.657371.15040
61.112800.667301.16777
71.125830.672191.17633
81.132240.674591.18054
91.135400.675781.18261
101.136960.676361.18363
111.137720.676651.18413
121.138100.676791.18438
131.138280.676861.18450
141.138370.676891.18456
151.138420.676911.18459
161.138440.676921.18460
171.138450.676921.18461
181.138460.676921.18461
191.138460.676921.18461
201.138460.676921.18461
211.138460.676921.18461
221.138460.676921.18462

If we assign 1 to each page before the computation starts, we get the following PageRank values during the iterations:

IterationPR(A)PR(B)PR(C)
0111
110.6251.09375
21.070310.651371.13989
31.104920.664341.16260
41.121950.670731.17378
51.130340.673881.17928
61.134460.675421.18199
71.136490.676181.18332
81.137490.676561.18398
91.137980.676741.18430
101.138230.676841.18446
111.138350.676881.18454
121.138400.676901.18458
131.138430.676911.18460
141.138450.676921.18461
151.138450.676921.18461
161.138460.676921.18461
171.138460.676921.18461
181.138460.676921.18461
191.138460.676921.18462

If we now assign a starting value to each page, which is closer to its effective PageRank (1.1 for page A, 0.7 for page B and 1.2 for page C), we get the following results:

IterationPR(A)PR(B)PR(C)
01.10.71.2
11.150.681251.19219
21.144140.679051.18834
31.141260.677971.18645
41.139840.677441.18552
51.139140.677181.18506
61.138790.677051.18483
71.138630.676981.18472
81.138540.676951.18467
91.138500.676941.18464
101.138480.676931.18463
111.138470.676931.18462
121.138470.676921.18462
131.138460.676921.18462

So, the closer the assigned starting values are to the effective results we would get by solving the equations, the faster do the PageRank values converge in the iterative computation. Less iterations are needed, which can be useful for providing more up to date search results, especially regarding the growth rate of the web. Starting point for an accurate presumption of the actual PageRank distribution may be the PageRank values of a former PageRank calculation. All the pages which are new in the index could get an initial PageRank of 1, which will then be a lot closer to the effective PageRank value after the first few iterations.

Additional Factors Influencing PageRank

PageRank and Google are trademarks of Google Inc., Mountain View CA, USA. PageRank is protected by US Patent 6,285,999.

The content of this document may be reproduced on the web provided that a copyright notice is included and that there is a straight HTML hyperlink to the corresponding page at pr.efactory.de in direct context.

Urlaub Kuba

(c)2002/2003 eFactory GmbH & Co. KG Internet-Agentur - written by Markus Sobek

eFactory
GmbH & Co. KG
sobek@eFactoryy.de

Goethestraße 75
40237 Düsseldorf

Tel.: 0211 44 03 97-21
Fax: 0211 44 03 97-40