Unlocking the Google Panda Algorithm with Bayesian Mathematics

Panda 4.0 was an algorithm introduced by Google on the 20th May 2014. The algorithm was introduced to combat web spam in response to SEO’s using unethical techniques to manipulate their client’s rankings in Google’s search results. There have been many theories and articles that have sought to explain the Panda algorithm, however none of the resulting claims had been supported by scientific evidence that are 99% statistically significant, until now.

Challenge

While Google has been commendable enough to allude to Panda’s ingredients in a number of articles on its Google Webmaster Central blog, Google does not reveal which of ingredients are the most active or the optimal use of the ingredients for website design and content.

The study gave us an opportunity to explore which of these ingredients had a causal effect on SEO traffic, and the weighting of those ingredients. This was so that SEOs could explain and respond to Panda 4.0 in a predictable manner.

Solution

To deconstruct Panda we segmented sites within our data set into groups that as a result of Panda had:

An increase in traffic
A decrease in traffic
No change

If there was a change in traffic, we used Bayesian maths to calculate the likelihood that the change was in fact due to an event that happened on the 20th May 2014. i.e. Panda 4.0 and nothing else! Once the sites were segmented we were then able to start analysing for possible causal factors.

Process

MathSight didn’t know what Panda was looking for, so we decomposed website architecture and content into a number of unique MathSight defined candidate signals such as the the use of rare words, number of external CSS files, the use of paragraphs etc.

The next step was to use mean differences to spot patterns that are directly identifiable with the change in traffic as a result of Panda. Some of the signals may look like patterns; but they could be just coincidence, so we analysed each of the candidate signals using ANOVA (Analysis of variance).

If the signal had 5% or lower probability of causing a change in Google traffic then we know that there is a 95% (or higher) statistical likelihood in the signal being a causal factor for a change in traffic as a result of Panda.

From the dataset, we also found the HTML character distance limit for content.

We carried out over 200,000 visits of the data per website, to ensure the findings were repeatable and robust, should a third party data scientist conduct a similar study on an identical selection of web sites.

Results

The key stand out signal for MathSight was the proximity of the start of website’s on-page copy to the top of the page. This was measured in terms of the HTML character distance of the paragraph to the top of the source code.

So our analysis has established that Google is rewarding sites that have positioned the content in an immediately readable position; the further the body copy is from the top, the more it was penalised.

Unlocking the Google Panda Algorithm with Bayesian Mathematics

Challenge

Solution

Process

Results

Andreas Voniatis