How To Beat Proxy Hi-Jackers, and Have Fun While Doing It
|
| |
![]() | |
What Are Proxy Hijackers?
Proxy hijackers are myspace/youtube/facebook CGI/PHP based proxies that people use to get around website blocks their schools put up. They are monetized by trying to get the content they view indexed /(as if it’s their own) and throwing up adsense ads and the like on the resulting pages.
Obligatory Disclaimer:
I have never tested this. I have used the proxies to dig around, to the point where I’m reasonably sure it would work. This is for educational purposes, and I’m unsure of the legality. I do not reccomend you do it. But perhaps before you start your own obnoxious ass proxy, you’ll think twice. Beyond that, I’m not 100% sure this works. I’ve seen a few examples of it accidentally occuring in the wild(a concerted effort will have more effect), so I’d say I’m 99% sure… Just use it to get your own mental gears turning. I don’t feel like trouble.
XMCP, Why Do you Hate Proxy Hijackers?
For me, the redeeming factors of blog farms is the creativity that goes into creating them. Spinning unique content, or for that matter modifying it so as not to run into dupe content filters, that interests me, and requires a certain skill to create properly(that said, at this time I do not run a blog farm, but have several thousand registered just in case).
CGI-Proxies require NO skill to create. Just a little bit of time to market, and buttload of bandwidth on cheap hosting. Beyond that, they DESTORY dupe content rules, by copying the HTML 100%, making it an exact dupe. Sometimes, it appears ahead of the real site in SERPs(a rare feat for most splogs). So no skill, much more intense penalizations for dupe content, somewhat obnoxious owners(though not in all cases). I see no point in creating something you don’t learn from. And they jack my content(you know who you are, and you have only yourself to blame for this entry). Oh yeah, and often they strip out MY ads, replacing them with theirs.
An awesome article is available over at seofaststart about it. He lists a fix to not get de-listed as a result of these sites. This method is a way to stop them from putting the dupe content out there in the first place, by overworking them. It kind of exceeds the whole de-listing thing.
The Weaknesses of CGI-Proxies - They have a few weaknesses that make them fun to play with.
- They Use a Lot of Processor/Bandwidth - Their server limitations are their weakness. They provide a low profit, so the greatest challenge of proxy operation is that they need a lot of bandwidth to operate. And a lot of processor space. So people are normally just barely under the proper limit. Sometimes dozens run on one VPS.
- Their Owners are Not Technically Minded(for the most part) - These guys are not going to update their software. They do not forsee the problems with scraping angry webmasters content, without modifying the content enough to pass dupe inspection
- Duplicate Content - While their greatest strength is dupe content(it’s where their traffic comes from), it is equally their greatest weakness. Within one site, Google does not enjoy seeing dozens of copies of the same site. Even worse so if the HTML is identical. Even worse so if the IP is the same(in my experience). We’ll get into this later.
- They Update Pages When The Search Engine Crawls Them - Whenever Google checks a page, proxies go back to that page, and relay the content, often acting like Googlebot. Yeah. It’s a bandwidth/cpu hog. Heh.
- They Directly Modify ALL Links on a Given Page - If you search Google with proxyexample1.com(and if that existed) it would replace all the urls in Google’s search results with its own URLs, so that you stay within the proxy, and if Google checks on it later, it will find a plethora of “internal” urls of jacked content.
The Concept
Have you ever wondered what happen if Google undid it’s robots.txt, and started crawling itself? Especially it’s own cache? What about when it got around to the cache of that cache? Yeah. It goes on infinitely. That’s what we’re going to do with proxies.
Could You Make This Method Simple?
The method I’m about to describe is one that could be made more simple. It can actually be done with one proxy, but if I did that, it would make it far too easy for the creators to update their software to protect against this. So we’re going to use 2 different proxies for it, and make it unblockable.
The Process
- Step 1: Find your proxies. A search for myspace proxy, youtube proxy, etc is more than enough to find these sites. Click a link. Look around the page to find out if they link to their own other proxies on that page. This technique is a lot more effective if proxies are on the same hosting account. Copy both of these URLs down. If you can’t find 2 on one page, just pick another from the search engine. For this example, we’ll use unblock*.tld(I’m avoiding saying the TLD to prevent giving them traffic, and for legal shit), and myspaceproxy*.tld (henceforth referred to as proxy1.com, and proxy2.com)
- Step 2: Use one proxy to visit the other. So we have gone to proxy1.com, and viewed proxy2.com with it.
- Step 3: We now use our 2-layer proxy setup to visit our dear friend Google. Go to advanced search, and tell it to return search results only indexed within the past week/month(it can be upped or lowered if you’d like). Your search query is (w/o quotes) “site:proxy1.com” This returns everything indexed within the past week. Now, Google truncates the search result URL that is displayed. Copy it if you’d like, and compare it to the same search done without the proxy. The URLs start the same, but end with several dozen more characters. That means they’re indexable. As is the Google cache. As is the various pages of the search result. In some cases, the translation feature is indexable as well.
- Step 4: Post the link somewhere. A wordpress blog you started for this, your own blog(although I’m unsure of how Google looks at this), wherever. Even manually submit it to Google. Just make sure it gets indexed.
- Step 5: Wait.
What Does this Achieve?
Allow me to explain. When Google finally crawls it’s own search result page through the proxy, one of two things will happen, depending on if(and how long) the proxy caches the data for.
- If the proxy does NOT cache data(which most don’t), then the below infinitely-looping dupe content issue listed below remains the same(you’ll get there in a second). In Addition it should go like this:
Proxy1 Loads Proxy2 which loads the search result page.
The moment Google attempts to crawl one of the unique URLs in the results page, it should query proxy1, which queries proxy2, which has to query proxy1 again to get the page, and then THAT will return the real content.
But wait, there’s more.
As this effect compounds on itself, the number of queries per request increases exponentially. After there’s 6 layers of unique URLs for example, it will suddenly be executing 12 CPU/Bandwidth intensive queries to get back to that original page every single time Google crawls them. Effectively raping that poor shared hosting account that the proxy was using. The effect continues until the proxy caves in on itself, or Google decides to stop crawling. - If the proxy Caches Data(or Not) it will simply return the search result page of it’s own content(which will eventually expire and change), and then Google will start indexing duplicate copies of the site’s content. Each time it does that, it will create yet another piece of duplicate content that will discredit the site. This will continue for a bit, until eventually the time period we set(7-30 days) on our search expires. Then, our new pages (AKA Google’s search result pages) will begin to show up everywhere. When it recrawls those, it will go through the FIRST proxy we used. That will change the URL.
For example: If the 1st entry in the search engine is http://proxy1.com/94hhrnabnab08u4hbjka, our proxy we were browsing via the first proxy alters that to http://proxy2.com/gjonrae0u409nkjbvna. But having received that, proxy1.com switches the URL back to it’s domain, and as it doesn’t recognize proxy2.com’s unique URL it just created(as a result of the original unique url that was in the search results), and creates yet another unique URL to index. This continues in an infinite loop.
Extra Points: Eventually they’ll also crawl their own caches, and the cached copies Google keeps of their own result pages.
Why Use Multiple Proxies to Achieve This?
If you use just 1 proxy, the proxy author could update the software, and teach it to recognize it’s own domain, and use an old link. This way, they’d need a list of every single proxy on the internet in order to stop this technique.
What Happens if They Just Stop Modifying Google Searches?
Then the fun is not over. Google’s search box is ALL over the web. Find one of those, and do the same thing I described here.
OMGZ YOU SUCK I <3 MY PROXYYYYY
Then learn to code and fix it. As a blackhat, I’d be a hypocrite if I spoke out against annoying people online. But that doesn’t mean I can’t tell you to adapt your shit to deal with this.





















December 27th, 2007 at 6:06 pm
Nice “educational” information. As a black hat, I’m sure you know plenty of ways to automate “Step 4.” Educationally speaking, that’s something you’d want to be able to do.
Links are the key - if all you do is fire enough links to get the first page crawled, there won’t be a loop. To really bring a proxy down, you would need to push enough PageRank in to get the links on the proxied URLs spidered.
Not that anyone should do this potentially illegal thing. My comments are intended for entertainment purposes only.
December 27th, 2007 at 6:11 pm
Maybe I have misunderstood this post, but I don’t get the problem or the solution. I believe most proxies don’t actually cache the content, they just fetch it for the user, their pages aren’t cached in Google. Also I believe most proxy scripts don’t allow you to surf via another proxy and have timeout rules to prevent infinite loops.
As far as the ad swapping goes, I’m almost positive this is against Google’s TOS, if not against the law.
However I primarily use PHProxy, not CGIproxy so maybe it’s different.
December 27th, 2007 at 8:32 pm
Some definitely swap. And the problem is when you clone a site 100%, down to it’s HTML, Google reacts really terribly to the dupe content. Beyond that, it’s obnoxious to have carbon clones of sites floating around out there, and I don’t respect proxies. They require no skill/ingenuity.
It’s a complicated post though, and at times I’m not terribly articulate.
Pretty much, it makes the proxy get locked into a loop with another proxy, each fetching content from eachother. It eventually terminates, but puts progressively more strain on the server, until it cracks.
December 27th, 2007 at 9:17 pm
@Dan: heh I suppose you could bring it down to hell with the sheer force of 1200 or so bots descending upon the site via a lot of dropped links.
But that takes a lot of the technical fun and wizardry out of it! For some reason I really enjoy the concept of them devouring each other in a cannibalistic dance of doom.
December 28th, 2007 at 7:50 am
Check my article on this. PHP code included.
December 28th, 2007 at 10:55 am
@Subliminal: …whore
December 28th, 2007 at 11:33 am
I do not think that this will work simply because of the fact that these sites use URL encoding so if u noticed, proxy1.com/ajashd398uhdja when entered directly in the URL box will not return the full proxified content. It will simply return an error.
(I block my websites from this kind of a thing, I do not scrape any content from any sites.Heck,I am capable of writing a few paragraphs myself and lastly, I dont run my proxies on the limit of the VPS :))
December 28th, 2007 at 12:02 pm
I’ve fixed this problem long before you was born … dude
December 28th, 2007 at 12:11 pm
I know there’s a bunch of different ways to do it. Never denied that. This was just my fun way
Also, I’d say your way isn’t working right now, as there’s still thousands out there.
I’ll give you a hint. Not all is said in the post.
Sign on to messenger though. I do have a quick question for ya. Kind of what I need to bring everything together.
December 28th, 2007 at 12:25 pm
@Akshay
sounds like you’re running a very different breed of proxy. But no, there’s no post request involved. The popular proxies today create a static URL for each URL that is browsed, and this is what gets crawled.
So by feeding it ANOTHER proxies unique/(static) URL, it does not realize it cached before.
If you want more information, drop some instant messenger contact info. I’ll delete it quick as I can before anyone else sees it(hopefully), and I’ll hit you up to explain it further.
December 28th, 2007 at 2:19 pm
Another great post, although I’ve always found he with the most links wins despite duplicate content / proxies.
December 28th, 2007 at 2:32 pm
Well, he’s using the same keywords as before, since he’s crawling himself. So chances are, he won’t pick up any longtail. And after google begins indexing your evilness, you can drop the backlink you gave him.
Either way, most proxies don’t cache, so they’ll just cave in on themselves. Of course, a little nudging helps.
December 31st, 2007 at 12:01 pm
You got it all wrong. Most of those proxies will not work for this attack because they redirect you to the homepage if your referrer header is not the proxy domain. This is done via htaccess so even google bot will be redirect to the homepage, meaning the page will not get indexed. If there is something I missed, let me know.
December 31st, 2007 at 12:14 pm
On the ones I tested against, you’re allowed to copy/paste the url that’s given for each page, with no referrer, and it went find.
Old school proxies were the way you describe, from my understanding. But the new ones bind to the URL so they get indexed. I’ll hit you up on skype later today to explain more.
January 3rd, 2008 at 7:35 am
[…] Noch etwas ausführlicher wird das ganze hier auf Englisch erklärt: How To Beat Proxy Hi-Jackers, and Have Fun While Doing It. […]