• Home
  • About
  • Piqq.us Invite Feed
  • Links
  • RSS CULT
  • Data Mining your Way to Profit

    Add to Mixx!

    Alright. So in the past I’ve covered the basics of scraping and content generation. But today we’re going to look at a more whitehat way to utilize scraped data. While technically this has some copyright issues attached, they’re rarely caught, and are most often traceable. The ethical scraper can always source his content. This really exists in the internet today has a powerful force. In fact, I have reliable information that one or two well-known borderline evangelical whitehats make this a consistent practice.
    Before everyone starts complaining at me, remember that back in the day, Wikipedia had a lot of information added by bot.

    The Concept
    We are going to attempt to find LOTS of data by a variety of sources, and insert it into a database. Since we can call back this information at will, we can drop it into pages of heavily optimized, effectively organized content onto a domain, without fear of banning.

    Why Does this Work So Well?
    Sites with any reasonable amount of backlinks, and a lot of content WILL make money in almost every case. The trick is always that content creation takes time. Thankfully, we are afforded the lovely ability to cite sources online. So we can find content out there, snag the information we want, and create thousands of pages from it without fear of having our site banned.

    Potential Sources for Your Data Mining (or Scraping if You’re being Picky)

    • White Pages
    • Wikipedia
    • Census Data
    • CIA World Fact Book
    • Amazon.com
    • Any other organized information source that you can find.

    So How Do We Mine This Data?
    There’s a couple ways. One is reliable, but is more traceable as duplicated data(though you should always cite your sources), and has less information. The other one is less reliable, and will require manual review, but is virtually untraceable.

    1. The Table Method (Closer to Scraping)
      Used on individual sites with good, readable/predictable layouts. The concept is that you crawl an individual site, and each time scrape your desired data.
      For example, let’s say we’re making a site that’s all about bands, and the different genres of music they make. We would start out with a list of the musicians we want, then we would search for them on Amazon, and extract the “genres” they’re listed under. Then we might go to last.fm, and do the same. We compare the two. Any genre that is not on both sites, we throw away, marking it as the editor being stupid.
    2. The Language-Comprehension Method (Closer to Data Mining)
      This method is tricky and unreliable, but implemented properly can create truly unique content. Like REALLY unique content. Lists of information never seen before.
      For example, let’s say we want information on sweat shops. We would start off with the following query:
      (”owns a sweatshop”) OR (”runs sweatshops”) OR (”owns sweatshops”) OR (”uses sweatshops”)
      Anyways, we search for that, and look at the results from Google. Load all those pages if needed, and extract 1-2 words back after the occurrence of the given phrase in the website. For this particular search, the first 2 would render “Bebe” and “Nike”. Have the script show all available scraped results, and you just check the ones you want to appear. Not hard at all. Like a super weak markov.
      After that(to build up more content) you can scrape public business resources for a few more trinkets of information…and magic, you have a new page.

    Precautions
    A few issues can arise from this practice. First off, if you scrape incorrect or libelous information, you can have issues. Beyond that, whenever possible run information through a dictionary and verify all misspelled words. Nothing sets of red flags like site-specific typos. Also, cite your sources somewhere and chances are all will be well. Unless you scrape IncrediBill, in which case he will hire ninjas with Jet Packs to swoop in on you in the night.
    So How Do I Profit From This?
    If you can get your site ranking even moderately well, these will traditionally get linked to quite well by the internet community since they’re so informational. At that point, it takes on a life of it’s own. If you’re lazy, run adsense. If you’re not, find affiliate programs related to your niche. It’s magical, yes?

    Disclaimer: I’m no ace on copyright law. I know this is common practice, not sure on the legal details.

    -XMCP

    Tune in later this week for some blackhat reputation management!

    Share and Enjoy(You know you want to): These icons link to social bookmarking sites where readers can share and discover new web pages.
    • Technorati
    • StumbleUpon
    • Reddit
    • PlugIM
    • Blue Dot
    • Bumpzee
    • Simpy
    • Netscape
    • del.icio.us
    • blogmarks
    • Spurl
    • Furl
    • Fark
    • TailRank
    • BlinkList
    • NewsVine

    7 Responses to “Data Mining your Way to Profit”

    1. blackhat seo says:

      scraping != data mining

    2. admin says:

      @blackhat: I’m well aware. The difference in my mind is that data mining creates a much more structured representation of the data. Hence this.

    3. spostareduro says:

      I am learning a lot more from you than I should I think..lol

    4. johnrobin says:

      @blackhat seo.. I believe that’s right..

      I’ve been visiting lots of website and use the table method for new contents.. I see some different sites will end at one source of information. This becomes common at these days. I try to use this method but I have to be careful since I don’t know the legality.

    5. Internet Marketing Joy says:

      Sounds interesting!

    6. Friendly Webmaster says:

      I remember scraping 35,000 products out of Amazon’s clothing store BEFORE it was out of Beta (in 2002) and it quickly made me lots of affiliate money but sincerely it got really old after a few years. I’m just glad I have the experience to scrape a site and convert it to a new platform because I have no patience for copy-pasting! Like the blog.

    7. fleamarket product sourcing says:

      This is the kind of talk I like to hear! Unfortunately, I’m not a programmer, so I don’t know how to automate this. In some ways this is reminiscent of some of the things Eli has talked about at bluehatseo.com. I’ve really got to lear php and sql.

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

    Marketing & SEO Blogs - Blog Top Sites
    © Slightly Shady SEO, All Rights Reserved. Scrape me, and I will eat your soul.