Duplicate Context in Link Building and Automation
Ok, so normally on here I only post things that I have confirmed through a test. The following entry is a mixture of common knowledge on duplicate content and a bit of speculation supported by tests, but not normally to the degree I like to do them. However, given the logical sense it makes and the support of the little evidence I had available, I figured I’d do a post on it. I’ll try and remember to update when I get a chance to absolutely confirm this. And by the way, yes I mean “duplicate context” instead of content, as I’m referring to the context of the link, not the total available content on the page.
If this entry is common knowledge, I apologize. It’s something I never really considered the full implications of this before(especially in context of BH), so I’m assuming others are the same way.
Introduction
While I’ve long held that Google duplicate content filters are easily defeated, I’ve never truly examined the implications of them on a large scale link building effort. First, it may help to examine the large issues google has with link building at present time.
- Link Spam
- Mass Social Bookmarking/Submissions
- Blog Farms/Autoblogs
- Large scale article syndication
- Spammy Web Directories.
Now, accepting that, let’s acknowledge one thing. Most sites operate off of a template. They have static and dynamic sections. For example, within my blog, most of the HTML going into each page is identical. The exception is between a certain set of div tags(my post) and the title of the article. How easy would it be for Google to isolate the dynamic sections, and only use that for dupe content?
Now, imagine they isolate in that fashion, and then break it down to text immediately surrounding the link.
In the past, I’ve assumed the entire page is checked for dupe content, so the extraneous text in any given site would help to eliminate the duplicate content issue so long as the message/text that I was duping was short.
I’m beginning to think this is not the case.
By isolating the surrounding text in each of the above “forbidden link methods” you could catch 75%+ of link spam/questionable links.
My Current Belief
I’m beginning to think that Google much more strongly than before is comparing the context of the links to a site with the context of it’s other links to determine if the same person is spawning the links.
How Would This Method Catch The “Shady” Tactics? Of the above forbidden tactics, nearly all could be caught by examining the link text and page titles immediately surrounding the link itself.
- Link Spam – Most people do not macro (shuffle) the text in their link spam much. On average, I’d say people have 10 potential messages for in many cases 10k+ initial posts. This truly opens it up for contextual duplicate content near their link.
- Blog Spam – Relatively constant class names within wordpress blogs identify comments, making it easy to separate them from truly contextual, in post links. This is perhaps the hardest to catch of the sketchy tactics, given the amount of other “dynamic” sections on the page.
- Message Board Spam – This is actually the beast that initiated this post. Think about it. Typically the posts will only have a few potential titles. Having titles that are nearly identical on almost all your posts? That’s an issue. Beyond that, the message you send (immediately surrounding the link) is the only content that is different on your spammed page vs. every other page.
So to make this more clear, every single piece of HTML on each thread of each message board is identical to the other threads. The only thing that’s different is the actual post. So by isolating the only section that’s different and comparing it vs. your other links…yeah. It’s easy to see similar context for each post.
- Mass Social Bookmarking - It’s no secret that people sell social bookmarks. Hundreds of them. For dirt cheap. Now, they almost all have a few fields they ask for. Title, tag, url, description. Now, tag and url are negligible in terms of duplicate context. Title normally becomes anchor text, so that obviously has to change sometimes. But the description? Congrats, that’s the same similar context issue that appears above in the “Message Board Spam” section. Effectively a way to discredit individual pages on an otherwise trusted domain.
- Blog Farms/Mass Article Syndication – For the sake of this article, we’re talking about the ghetto kind of blog farm that reads in a single RSS feed owned by the blog farm owner, then syndicates that article to 400+ blogs. Not the advanced kind. But obviously this is duplicate content, and as such the link context is identical as well.
- Spammy Directories – Spammy directories normally fit into one of two situations. They’re either mirrors of eachother(which is easily identified by identical link profiles) or they are automatically submitted to. So once again, same problem as with standard link spam, the context of the link spam stays identical. The descriptions are not varied enough.
Isn’t This a Problem With Automation?
Absolutely. Something I’ve been exploring lately is semi-automation. That is to say tiny batches of automatically submitted links that are designed to be actually relevant to whatever it is you’re commenting on. But that’s an entry for another day. Anywho, this is a problem with automation that has plagued online sketchballs like myself since the dawn of internet marketing. But it’s not impossible to get around.
So How Do We Get Around This?
There’s a few potential methods I’m testing out. Here’s a quick rundown. Whitehats out there, you might want to take this stuff in the context of social media submissions and link buying and whatnot.
- Eliminate the Content All Together – Posting JUST a link. Let the other content on the page serve as the context. However, this will not even get past gullible webmasters, so it’s limited to sites that are more or less abandoned.
- Random Text Jumble – Perhaps the most efficient of the possibilities, you could always just throw in say 40 keywords from a dictionary file. However, this will increase the possibility that people will find the page via longtail searches and report it. Beyond that, it once again will not even get past gullible webmasters.
- Let’s Talk Macros – As much as I hate to bring up spam email in a post about BH, spam emailers were essentially the founders of the word shuffling macro, and I feel they make a great case study on how to change up the content and context. Especially for things like link spam were dodging statistical filters is a necessity.
Everyone jokes about e-mail spammers not being able to speak coherently. Now back in the day I knew several of these chaps, and let me assure you there are some who are incredibly articulate and intelligent.
But they had to have thousands of potential messages from their template to get past statistical filters. Let’s example what a potential mortgage spam would look like.
Anything inside of the { } and seperated by a comma is a word that may be substituted into that place in the sentence.
{Greetings, Hello, Hi, Howdy Doody, Hey} {Consumer, Home Owner, Respected Customer},
{How are you, How’s it going, How’s life} {on this fine day, today, lately}?
{I’m, We} are {emailing, messaging, contacting} you {today, on this day, now} because {we, I} have a {special, exclusive, unique} {offer, promotion, opportunity} available {in your, only in your, in your current} {area, locale, residential area}.
{Mortgage, Home Mortgage} rates {up to, at least, often} {10%,20%,30%,40%,5%} {better, superior, nicer} than your {current, present} {rate, company offers}{!,!!,!!!} {Sound Good, Seem good, Good enough for you}? URL HERE
.
You get the idea. But by employing similar care in the creation of any automatic link building method, we can have a truly incredible number of possibilities for the message to shuffle up the context surrounding the link.
What Evidence Do You Have to Support This?
Well first off, Google admits to using class names and whatnot to footprint sites and try and find their spammy brethren. This seems like a natural extension of that. But beyond that, in my most recent test (same niche, different template, different content, same link targets, almost identical inbound links, all links dropped within 24 hours and sites indexed within 3 hours of eachother), I tried the randomized method of message board posting, and another with 2 possible messages. Despite being in a pretty easy niche, the site with 2 possible messages managed to not rank. At all. With ~800ish backlinks. With the randomized post content, it was ranking nearly twice as high.
I’m definitely going to run some more tests before I confirm this, and that may take a bit.
My blackhat network is currently in the testing/re-building phase since I have sustainable income from PPC for the time, and want to truly apply my knowledge to the blackhat sites I’m making. So yeah. That will delay confirmation on this thing.
Conclusion
Given this one hit wonder could clean up a huge percentage of Google’s self described spammy links, I’d be amazed if it wasn’t currently in practice. Segmenting a page by <div> or <p> tags is simple to do, and has a massive potential payoff for them.
The problem with any mass deployment is and always will be coherrent/convincing content. It must be hand written or copied from somewhere. This greatly reduces the ability to mass deploy. We can only segment the content in so many ways, and Google has an equal (or greater) ability to do the same. If they do indeed treat pages like this, some serious advancements may be needed soon.


May 27th, 2008 at 10:21 am
Awesome post! Google is very sophisticated at some kinds of analysis, and amazingly oblivious to some techniques.
May 27th, 2008 at 12:28 pm
Great post dude.
May 27th, 2008 at 1:17 pm
Interesting post, thanks.
Perhaps another approach you could take to randomize content is that used by Craigslist spammers: inserting a bunch of random text nearby and changing the font color to make it practically invisible, or insert junk html tags with random text inside and mix those into the post. Like this, replacing ~ with html tags of course:
Make ~rainbows~ sure that ~elephants are big~ you ~enchiladas~ visit my ~harold sawyer~ great ~chilled wine~ site!
This of course assumes that you have the ability to use html tags in the places you’re spamming.
May 27th, 2008 at 5:18 pm
$5 dollars sais Google hires you to replace Matt Cutts to fight SPAM
May 27th, 2008 at 5:44 pm
I built the above site (click name) a while ago which randomizes the title and description of bookmarking submissions. I’d like to hear your thoughts on this.
May 28th, 2008 at 12:10 am
Funny you should bring that up. I’m working with a dev on a plugin that does some similar stuff. Tell you more on msn.
May 29th, 2008 at 10:31 am
Very interesting post and it got my wheels turning. Brendan I will also be testing out the PostToaster.
August 20th, 2008 at 12:50 am
Very interesting information need to experiment,
thanks.