• Home
  • About
  • Piqq.us Invite Feed
  • Links
  • RSS CULT
  • New ShadyGenV2.0b (Site Generator) Almost Completed

    Add to Mixx!

    Ok, so what with exams and all, I haven’t had a lot of time to work lately, which is a shame. Those of you who know me, know I stripped apart my software awhile back, and am in the process of re-writing it. Good for tech, BAD for profits. So I got out of exams today, and got hackin away at my new scraper/site generator. Thought y’all might enjoy the feature list. There’s 3 different components right now. And 2 will be added later, for version 3.0

    No, this is not for sale. This is a post that I do because I’m still taking exams, and can’t right a fully-fledged entry. But yeah, it still rocks.

    Anything NOT yet completed is in italics.

    • ShadyGen
      • The code to each and every page is exactly the same, but renders completely differently.
      • Each newly generated site gets it’s own, custom generated template.
        • All templates are randomized, but stay within the acceptable bounds of the web. Contrast is proper, number of table/frame columns is normal(1-3), and the data is placed in proper sections
        • Randomizes where the line breaks within the source code occur, so even the source itself is difficult to fingerprint.
        • Generates random, ever changing colors, with certain controls in place to ensure contrast stays good.
        • Randomally names, renames, relocates, and aligns images. Once again, this retains the proper web-design etiquette; for example, it will not set a large image in the menu.
        • Elements WITHIN the html tags are randomally assigned, reducing site footprints.
        • Each Template is static within a domain name, but upon new site generation, creates a unique one.
        • Yes kids, there is random+shuffled CSS.
      • Creates a nice site architecture, easily crawlable, and not reminiscent of spammy sites.
      • Automatically expands over time, as if new pages are being introduced.
      • Work in Progress - RSS feed for each cat/subcat, including a proxy scraper/tester for sending out pings.
      • Creates all needed pages/files, verifies MySQL database entries, generates proper h1/h2/h3/title tags, but not in an easily identifiable way.
    • ShadyCloaker
      • Dynamic bot identification, as has been discussed previously in here.
      • Brand new points system, for rating the probability the user is a bot in disguise. Tested variables include
        • Reverse DNS check
        • “Hot” C-Class IP Block check.
        • Standard referrer-level filtering
        • Bot traps hidden in pages.
        • New “compatability” for LIVSOP/Google referrer spam
        • SERP Rank checking system (seeing if we even rank for whatever the incoming search is) - Only occurs after rating the search term to see how competitive it would be, to save bandwidth.
        • Cross-IP Access Checker(If you access MORE than one of my seemlingly random sites, chances are you’re a bot)
      • Updated referrer filtering, including some commonly SPOOFED urls that don’t actually exist.
      • The ability to dynamically redirect based on the specified search terms.
      • Ban list. If you’re blocked once, you’re blocked forever.
      • Updated the IP List to include not only individual IPs, but entire blocks of IPs. I’m considering blacklisting all of Level3 Communications that are not allocated to smaller ISPs. Although that would be kinda like killing an ant with a sledgehammer.
      • Various other features I’m NEVER going to release in here.
      • Gathers together referrer strings, to find unmoderated forums where my post apparently hit.
    • Shady Scraper - The fastest damn Java Program you’ve ever seen
      • Loads keyword list, scrapes a recursive list of relevant sub-keywords from a variety of sources. Allows you to specify the limit of sub-keywords per base-keyword.
      • Searches SERPS for first 200 results for each given subkeyword, selects a specified number of results randomally, and retrieves those. (25-150 threads! woot woot!)
      • Strips out HTML, Javascript, CSS, then breaks the data up into splits the data into 4-8 word blocks.
      • Organizes the data into multiple tables in a mysql database. Heavily optimized for low CPU usage. Query times are now approximately 1/5th of what they were before.
    •  Future Version 3.0 Plans (Not all can be mentioned here. Tharr be lurkers in that thar interweb)
      • Add automatic niche selection, based on a given keyword source(yay wikipedia). About 50% done right now.
      • Automatic domain ban check, including age, and stress on that domain.
      • Ability to auto-register it’s own domain. Preferably with desirable keywords.
      • Ability to add subdomain through cpanel, all with curl.
      • Ability to VERIFY the previous 2 successfully occured
      • The ability to add it’s fresh set of links into the link spam queue to be sent out.
      • Essentially, it will be able to run almost completely without human intervention. 

    My biggest thought right now is about dropping links. There’s no question that a new way is needed, and I know there is a way. I just have to find it.

    Share and Enjoy(You know you want to): These icons link to social bookmarking sites where readers can share and discover new web pages.
    • Technorati
    • StumbleUpon
    • Reddit
    • PlugIM
    • Blue Dot
    • Bumpzee
    • Simpy
    • Netscape
    • del.icio.us
    • blogmarks
    • Spurl
    • Furl
    • Fark
    • TailRank
    • BlinkList
    • NewsVine

    13 Responses to “New ShadyGenV2.0b (Site Generator) Almost Completed”

    1. Chris says:

      Very smart list mate! Good inspiration to make thy own!

    2. Sucka Hater says:

      I can’t believe that you talk about this stuff in public - you’re not just tipping off Matt Butts, but getting all the suckas to improve their games…

      I don’t need any more competition for my sites — hell, I just discovered a new way to linkspam that gets a 50%+ response rate (is it spam?) and a way to power up a linkspamming technique others are using by a factor of a thousand or so. Why should I share these techniques with affiliate spammers who can’t get a link to save their lives?

      Everywhere I go I find these autogen sites by the suckas. You know, the retards who add spelling errors to Wikipedia and then republish it. They’ve got like four backlinks for the whole site — I gotta give them a few thou just to saturate the links that they scraped that point to my sites.

      I got no patience for autogen garbage sites — yeah, they get a high CTR because the ads are more interesting than the content, but they just pollute the web.

      Meantime, I’m busting my ass to make sites that have real value — autogen or not. I do the whitehat stuff and do it pretty well… It’s great to get talked about in blogs and magazines, to get links from the authority sites, but ten thousand or so links of low to moderate quantity never hurts.

      On one hand I’m laughing at the suckas who can only manage three or four backlinks to promote wack affiliate sites, but I’ve pulled into the BH world cause I got competitors that buy links by the 10000… I gotta smoke their ass ’cause I can’t wait for Butts to do the job.

    3. admin says:

      @SuckaHater:
      Cutts already knows this stuff. I guarantee it.
      Yeah, I put out valuable information. However, I dodge around a few things.
      1)I leave out a few “premier” features. This allows me to continue existing, and to keep my edge.
      2)Anything that is not either common knowledge, and I didn’t come to on my own.

      Most people are too afraid to try this stuff. But the few that ARE willing? I welcome them. I’ve gotten an INCREDIBLE number of quality tips from readers of this blog. Probably more than I’ve given out. I’m sure other people will benefit from them as well. It’s not just all about competition numbers. More minds breeds more ideas, and greater innovation.

      In the coming days, I have a feeling we’re going to see the search engines take a stronger stance against link spam. When that happens, we’ll need all the minds we can get.

    4. Sucka Hater says:

      Well, I think the problem is sites with no content, not “link spam”.

      Some people believe Butts when he says, “If you build it they will come”, but people who try to win the viral lottery encounter the dark side of the long tail — good sites need extensive promotion to get noticed. Whitehat methods can work for some niches, but just won’t for others.

      Frankly I don’t think that Butts cares if a good site goes from #17 to #1 with some darkside techniques. It’s more a problem that the web is getting choked with garbage pages…

      For instance, some sucka made a madlib site that combined town names (ex “Lexington MA”) with things you might be interested in those towns (ex “Flying Lessons”). The page didn’t have a word of content about flying lessons in Lexington — just badly targeted ads from adsense and other networks. Honestly, it would have been cool if it had some paid ads from people who offered flying lessons in that area, but it didn’t… 100% worthless

      That site got zapped from the index last month and I say “hell yeah!”

      Here’s what I think Google will be doing in the next few years…

      The basic idea is that they’ll categorize queries and categorize sites. They’ll train machine learning algorithms to assign scores for things like:

      Query:

      * Is looking for medical information
      * Is looking to buy something
      * Mentions a brand name
      * Is looking for broad information
      * Is looking for specific information
      * Is looking for information about a place

      Sites:

      * Is a blog
      * Is about medical information
      * Is targeted to children
      * Is likely to be autogenerated
      * Is about technology
      * Is selling something directly
      * Is an affiliate site

      I think Google will (if it doesn’t already) apply different “ranking” algorithms for different types of queries. It’s quite likely that different slots in the results will use different ranking algorithms, so that we get a prescribed kind of diversity.

      The current results for “Canon EOS” look like this (by accident or design): you see the official Canon site, a few good photography information sites, and a chance to buy the camera at a reputable online store — something that reflect different information needs… Just like the top pages of a domainer or MFA site offer broad links that select people by interest for something more targeted.

      My hunch is that the “sandbox” behavior of Google already involves some kind of site classification mechanism. Looking at Google’s crawling behavior for new sites, I get the feeling that it’s trying to evaluate if a site is going to aiming for long-term value content (gets sandboxed) or aiming for timely information (gets suppressed for competitive terms, enhanced for less competitive terms). A linkbait that makes it big on Digg probably ~should~ rank high for a week or so afterwards, and then fall off in popularity unless the links keep coming.

      If I were you, I’d be concerned with (1) making autogen content that looks legit, and (2) making autogen content that’s actually valuable to users — with an emphasis on (2).

      I did some work a few years ago on duplicate detection algorithms: most kinds of autogen text have terrible statistical abnormalities that stand out like a sore thumb… Algorithms for detecting autogen look a lot like the algorithms for dup detection, but take up more storage and CPU — That’s not a problem for Google, which has made a transition to fifth-generation computing, DMMD’s and such.

    5. admin says:

      Oh man sucka hata, you should comment more often; I love this.

      What you mentioned about categorization is definitelt on its way; I have a feeling the only reason its taken this long is because of patent issues with the teoma engine, which already does a lot of this. Personally, I look forward to it. The challenge will be incredible.

      I think in the future, writing autogenned content that looks real is something that will definitely be important. But for right now, I’m concerned with getting my software out there, and getting it active; I lose money every minute it’s not.
      I’ve experimented with autogenning coherent content, and found a trick or two, but for what I’d really need, on the proper scale, I figured out I’d needed a few resources I just don’t have available right now (an extra dedicated for example).

      And there HAVE been rumors of google tentatively classifying sites. The rumor says that they have a guage of how quickly sites in certain niches gather up their links, and compares your link gathering rate to that. Honestly, evidence I have greatly supports this. My entire piece of software is modular; updating one file re-writes every site to mesh it with the update. That was done with the idea that many of the changes you’ve discuessed will become very very evident in the near future, and I need to be ready for it.

    6. TheMadHat says:

      “The rumor says that they have a guage of how quickly sites in certain niches gather up their links, and compares your link gathering rate to that.”

      Build out a tool that monitors backlink growth in a certain niche for a couple months, then program your link slinger to build links at a slightly higher rate. Piece of cake.

      Site classification is already happening to some extent. You can see it reflected in the SERPs (sometimes, it still has quite a ways to go).

      I like the randomized CSS within acceptable bounds. That’s not easy to do.

    7. Jeffrey Henderson says:

      Where do I sign up to get this software? Have you considered making it an open source project? We’d love to contribute to it.

      Thanks!

    8. admin says:

      @TheMadHat: I’ve thought about that, but it would only apply for new sites, which are hard to pick out. Especially since the link spam would make many appear relevant, but they arent(they have the keywords, and would show up in the serps). Beyond that, I want to exceed the average. You can still do that, but on some scale apparently there is a maximum.

      @Jeffrey Henderson:
      I’ve thought about such a thing, and maybe eventually that will be the case. The problem is that once any software gets too widely used, the search engines begin to look specifically for it. Any software has ways to get around it, and open source would allow them to find that even faster, especially with the increased usage. I may eventually launch a sort of service where someone’s remote server can connect to my own, sending user data(referer, ip address), and my server will direct them towards what to do with it.
      For the time being though, I’m keeping it internal. Eventually, who knows. I’m working on a method to duplicate wordpress templates automatically, but it’s going to be a bit before I can parse the template, and figure out EXACTLY where title/subtitle/content/links go, without leaving any traces of the original site. Once that is complete though, it has more potential for release.

    9. TheMadHat says:

      Good point on the new sites. It would be easy to monitor older trusted sites. I guess the problem would be identifying the new sites in the first place. Something to ponder.

    10. taky says:

      good project, ill be interested to see how it comes along. as for adding the domains, you dont have to use curl. you can use a simple snoopy fetch or a file get contents and fill out the post parameters in the url.

      cheers mate.

    11. thequack says:

      Allllrighty, I’ve swam around this site long enough so I guess it’s time to take the plunge and post a comment. Be forewarned, this might smack of naivete’. Everything sounds interesting, but the cloaking (to me, remember) seems a bit overly complicated. Isn’t there a temporal factor here? Wouldn’t a spider actively crawling your site be making requests at a much higher rate than a human? Couldn’t you have a rule that says, “If I get two requests from the same IP in rapid succession, ding ding ding ding! We have a spider.”? And then start serving up the SEO pages? Another point, (haven’t learned about bot traps yet) couldn’t you simply pepper your pages with links that aren’t displayed to humans? A bit like web bugs? So you have a scenario were something is rapidly pulling your site and accessing links humans shouldn’t be (unless it’s some punk 12 year old viewing your source – but there is still the temporal factor). Conversely, shouldn’t your scheme require more time to deliver your content? If your site is on a shared hosting environment and google notices (assuming they do) that it takes longer to serve your pages than the others (accounting for the complexity of the content) then aren’t you risking a cloaking footprint? Maybe set a flag for closer scrutiny?

      On a side note, you programmed some of your stuff in Java and the rest in PHP? Why not do the whole thing in Python? You may want to check out “beautiful soup”. It’s a scrapper in Python, people seem to like it.

      -quack

    12. admin says:

      @Quack: Ok here, we go, I’ll try and hit every point I need to for this.
      1)Crawlers don’t go straight through your site, nor are they always the same IP. So relying solely on rate limits is a bit unreliable. Also, for that FIRST request they do; a redirect or a 404 can flag use as an unstable site, which hurts rankings. Even if you adjust to it after.

      2)Somewhere in here(maybe this entry? Maybe another?) I do mention web bugs, and honestly, yeah they’re useful. However, for some cloaking sites(ones that redirect) its impractical since you have to determine if they’re a bot BEFORE you show them the page. If not, they’d be redirected, and you’d never get to see if they hit your trap. Beyond that, I admit I get uneasy about giving anyone who somehow manages to see the real site a way to flag themselves as a spider.

      3)It is possible for them to check how fast sites render, but once again, this is not really evidence of cloaking. Different sites do a variety of things that require different time. Some webhosts give different levels of service on the same IP. Some (legit) sites need to grab RSS feeds before displaying. These things can increase time load.
      Beyond that, the old version of this software did take forEVER to load. It was pulling it’s data from a single table, often with more than 14 million lines of data. The new one however, has a more complex database setup that’s a can render a page in about .1-.23 seconds. MUCH faster ;-)

      4)Last point, on the Python. Honestly, I don’t know python. Beyond that, I’ve been coding threaded Java programs for YEARS, so I can handle a lot of threads very efficiently. With that knowledge, Java becomes too fast to ignore(not to mention I’m most comfortable with it). Eventually, there may be room to integrate them together. I currently have 8 different scrapers I’ve coded over the past few months, and I’m constantly improving them. Eventually, they may have be redone simply because that allows the whole “system” to come together and function independently of myself a lot easier.
      Time will tell. I just finished the software, and I’m already working on version 3 :-D

      And btw, don’t ever worry about commenting. I very rarely bite. Especially to people with honest questions.

      Glad to have ya aboard!

    13. thequack says:

      Hey, thanks for the answer. Actually, I must confess that I wrote my comment from a somewhat selfish point of view. I wasn’t even thinking of the whole redirect to an affiliate thingy at the time. I’m still sharpening my skills on being a general nuisance to the internet at the moment. I’ll worry about cashing in on my SERP graffiti later.

      Keep the posts coming!

      -quack

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

    Marketing & SEO Blogs - Blog Top Sites
    © Slightly Shady SEO, All Rights Reserved. Scrape me, and I will eat your soul.