I’ve read seemingly hundreds of forum posts discussing duplicate content, none of which gave the full picture, leaving me with more questions than answers. I decided to spend some time doing research to find out exactly what goes on behind the scenes. Here is what I found.

Most people are under the assumption that duplicate content is looked at on the page level when in fact it is far more complex than that. Simply saying that “by changing 25 percent of the text on a page it is no longer duplicate content” is not a true or accurate statement. Lets examine why that is.

To gain some understanding we need to take a look at the k-shingle algorithm that may or may not be in use by the major search engines (my money is that it is in use). I’ve seen the following used as an example so lets use it here as well.

Let’s suppose that you have a page that contains the following text:

The swift brown fox jumped over the lazy dog.

Before we get to this point the search engine has already stripped all tags and html from the page leaving just this plain text behind for us to take a look at.

The shingling algorithm essentially finds word groups within a body of text in order to determine the uniqueness of the text. The first thing they do is strip out all stop words like and, the, of, to. They also strip out all fill words, leaving us only with action words which are considered the core of the content. Once this is done the following “shingles” are created from the above text. (i’m going to include the stop words for simplicity)

The swift brown fox
swift brown fox jumped
brown fox jumped over
fox jumped over the
jumped over the lazy
over the lazy dog

These are essentially like unique fingerprints that identify this block of text. The search engine can now compare this “fingerprint” to other pages in an attempt to find duplicate content. As duplicates are found a “duplicate content” score is assigned to the page. If too many “fingerprints” match other documents the score becomes high enough that the search engines flag the page as duplicate content thus sending it to supplemental hell or worse deleting it from their index completely.
My old lady swears that she saw the lazy dog jump over the swift brown fox.

The above gives us the following shingles.
my old lady swears
old lady swears that
lady swears that she
swears that she saw
that she saw the

she saw the lazy
saw the lazy dog
the lazy dog jump
lazy dog jump over
dog jump over the
jump over the swift
over the swift brown
the swift brown fox

Comparing these two sets of shingles we can see that only one matches (”the swift brown fox“). Thus it is unlikely that these two documents are duplicates of one another. No one but google knows what the percentage match must be for these two documents to be considered duplicates, but some thorough testing would sure narrow it down ;).

So what can we take away from the above examples? First and foremost we quickly begin to realize that duplicate content is far more difficult than saying “document A and document B are 50 percent similar”. Second we can see that people adding “stop words” and “filler words” to avoid duplicate content are largely wasting their time. It’s the “action” words that should be the focus. Changing action words without altering the meaning of a body of text may very well be enough to get past these algorithms. Then again there may be other mechanisms at work that we can’t yet see rendering that impossible as well. I suggest experimenting and finding what works for you in your situation.

Author: Black Hat Cloaker