• Home
  • About
  • Piqq.us Invite Feed
  • Links
  • RSS CULT
  • The Basics of Content Generation: Methods, Coherrence, and Unique Content

    Add to Mixx!

    Alrighty. So few people despise content creation as much as myself. While I do have a trick or two I will be hording to myself, I thought I’d share the basics methods of content generation, and the things you should look into if you want to create an engine that’s bringing it up to the next level.

    1. The Text Jumble
      Scraped from different sources, recombining text of randomally sized blocks(2-6 words) is a decent way to create text, so long as no one will ever see it(cloaking), and so long as the search engines don’t get any better at recognizing proper language patterns. There’s not much to say on this one, so I’ll leave it for now. It’s the basic entry level of content generation, and rarely makes a coherent sentence.
    2. Markov/String Permutations
      I’ve been told this is the concept behind Markovs, so I’ll call it that. I know the concept better than I know the name ;-)
      This is a common way of generating semi coherent text, and has the greatest possibility for the future, if made more intelligent. Here’s the concept.

      1. Scrape lots and lots of data on a given topic. Arrange it by sentence.
      2. Select a random start word from your sentences.
      3. Search through your for all the words that come AFTER that word, and append a random word.
      4. Rinse, lather, repeat.

      The end result is a bunch of text that is somewhat coherent, but still obviously generated. So let’s take a look at how to expand on this, and make it more readable.

      • Fix capitalization. This is an easy one to do.
      • Maintain proper tense. If you do not doing this, then your content had easily looked like this did.
        • Either load a dictionary of these up, or do your best at adding/removing the proper endings. Sometimes searching Google for your attempt can reveal the proper spelling for irregular words.
      • If you’re feeling especially zesty, note the type of word each is(verb, adjective, noun, etc). Note the combinations of these that is normally sensical, and try to recreate it like that.
      • Break up by paragraph as well as sentence. That way, you can weight the randomization by not only if the word is there, but if the paragraph is similar to the one you’ve already created. If you don’t have enough data for this though, you’ll edn up with very, very similar lines.
    3. The Synonym Switch
      This solution is the one many arrive at logically first. All it is is looking up synonyms in an online thesaurus, and swapping out the current word for another one.
      HOWEVER be warned. There’s a lot of synonyms no longer in use, or rarely used. As a result, your text can come out very footprintable, and sounding as if a mixture between a thug and Shakespeare wrote it. A good way to offset this affect is to search for each keyword on Google(store this in a database, so you only have to do it once and can space out search times), and record the number of hits. Weight the algorithm deciding which word to swap in according to how many results it got. This will help you only get more common synonyms.

    Combining the Processes
    Combining these(errr #2 and #3) is a pretty decent way to create unique content without the hassle of writing. However, they are quite CPU intensive, so don’t say I didn’t warn you!

    Conclusion
    The largest issue with writing proper text automatically is that it’s hard to scale. Not enough content. To supplement website scrapes, make use of all the free text out there. Project Gutenberg is a good resource(17k books with expired US copyrights, free for download). Cliffnotes. Wikipedia. Random ebooks on emule/bittorrent. RSS. Newspaper articles. Even scrapes of lyrics sites. There’s a lot of organized data out there, and all of it can be used. Give it a go.

    As I grow better at this myself, I’ll be sure to keep everyone updated. I’m exhausted, so I’m going to crash now. Hopefully I can think of something lovely for tomorrow.
    -XMCP

    Share and Enjoy(You know you want to): These icons link to social bookmarking sites where readers can share and discover new web pages.
    • Technorati
    • StumbleUpon
    • Reddit
    • PlugIM
    • Blue Dot
    • Bumpzee
    • Simpy
    • Netscape
    • del.icio.us
    • blogmarks
    • Spurl
    • Furl
    • Fark
    • TailRank
    • BlinkList
    • NewsVine

    15 Responses to “The Basics of Content Generation: Methods, Coherrence, and Unique Content”

    1. Jacek Becela says:

      What kind of scaling issues are on your mind? Why not enough content (scraping the interwebs is not enough?) What volumes of generated text you consider a satisfying scale?
      The only problem I see would be generating content for *very* long-tail phrases, but it’s not hard to mix these phrases into neutral (say guttenberg) content…

    2. Paul says:

      I’ve always found randomly generated text interesting. As technology gets better and better it’s fun to think of the day when there will be a program that can just research a subject and create it’s own content. They’ll probably start as a mashup of their input, but evolve to something even more interesting.

    3. Webster says:

      I’m doing some jumbling, but it’s less sophisticated. I have a site that has a lot of pages of the form . Let’s say there are 100 locations and 50 “widget colors”.

      Instead of writing 5000 pages, we are writing 50 widget blurbs and 100 location blurbs. Each location blurb gets passed some widget variables, eg: California has many great s. For widget blurbs: is a great place to buy blue widgets. Make sense? Each blurb is about 4 sentences.

      Then the Colorado Blue Widget page pulls the colorado blurb (being passed the blue widget variable) and the blue widget page (being passed the colorado variable). No one page has the same two blurbs, and the content really comes out OK.

      I was wondering what your thoughts of his tactic are for an older site with 3K plus links (non are paid for). Will I escape the filter?

      Thanks, and great post.

    4. Webster says:

      Oops. Used carrots to show variables above. To clarify:

      pages have the form: ‘location name’ ‘widget color’

      Location blurbs use: California has many great ‘widget color variable’

      and for widget blurbs: ‘location variable’ is a great place to buy blue widgets.

    5. admin says:

      If you have a domain like that, I’d steer clear of autogen altogether. Create a clean domain, and test BH on that ;)
      For the tactic itself, it’s decent, but I’d make sure there’s a LOT of difference in the sentences. If you can keep the variation up, it’d probably get past the auto-detection algos. Probably not a human though.

    6. Trophaeum says:

      Dont forget you can download the entire wikipedia content as a .xml.bz2! its a goldmine :)

    7. Rebecca says:

      I clearly don’t get this concept. My store blog (at xanga.com/dextr) is actual written content, and I can’t imagine when you’d want to have fake content. But it does seem to me that the strategies listed above would be slower and more trouble than actually writing something? Is the object to have it not be real language, for some reason?

    8. Webster says:

      Thanks for the feedback Shady. Love the blog.

    9. TuxPirate says:

      I find it interesting to tickle content generation techniques, this is can certainly be a challenging task. I’d like to raise a question; What about documenting, plannifying or thinking/defining/documenting your content generation requirements (however you like to call that)?

      In my opinion this is the most important step one needs to go through or at least think about somewhat thoroughly.

      Establishing your needs and requirements will allow you to get a better idea of where you are heading to, what you want to aim for and it’ll even make your life easier when wanting to implement some crazy idea to test out.

      Here’s an example; writing my content generation use cases and requirements, I decided it would make sence to generate posts for backlinking blogs based on content analysis (this is an experiment where the idea is to improve the relevancy of pages linking to my posts).

      You raise solutions for designing content generators. However it is specific to a page/post/article’s textual content. Don’t get be wrong I did like the article. Just raising a few related questions I had:

      What about content classification? I’m using wordpress, automating content tagging is easy. But what about generating a site’s categories accurately? And how do you generate your pages titles nicely?

      There are a few potential solutions I don’t particularly like that much; using tags as categories, and relying on some keyword utility or service. Am I just too picky on my category names? ;-)

      Here’s a challenging topic I’d like to hear about as well…. Generating medias (videos, images, sounds).

      Finally, thanks for sharing the good stuff!

    10. admin says:

      @Rebecca: Autogenerated content is for an entirely different business model than a store blog. Normally it’s on a self-piloting/enlarging site, where either one of 2 conditions is true.
      1)The layout is such that most readers will only make it to the adsense portion, and not the content
      or
      2)The content will only be seen by the search engines; all others will be redirected over to a affiliate site.

      While it’s more difficult in the first place to generate content, the ability to create dozens of websites a day with no manual labor whatsoever, allows a web promoter to expand extremely rapidly, and have a large income from the result.

    11. Your First Blackhat Setup: Blackhat 101 : Slightly Shady SEO says:

      […] The Basics of Content Generation: Methods, Coherrence, and Unique Content […]

    12. Robin says:

      There are a couple of chatbots out there that use textbooks to ‘learn’ and can generate new sentences themselves. Many of them are open source.
      I’m sure there is some more to learn from them :)

    13. Robin says:

      Oh, more more thing that comes to mind: Googlism.com can also be a nice tool to easily fetch some sentences for your keywords. Mix them with autogenerated content.

    14. Rebecca says:

      Thank you for the explanation. I would never have guessed.

    15. mbreezy says:

      Question for ya…

      I recently wrote a script that spiders through links and on each link it spiders it grabs all the sentences (complete sentences, gotta love delimiters) and stores them for me. Not only that, the page in which the sentence came from, it grabs the keywords, description, title and the URL so I can sort them based on those things later.
      End result… Complete and coherent sentences all somewhat relavant because they come from the same types of pages.
      Since these are 80-110 character sentences, can the SEs catch on that it’s not original content?

      I wouldnt think so because a single article might have sentences are 30 different websites, ya know?

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

    Marketing & SEO Blogs - Blog Top Sites
    © Slightly Shady SEO, All Rights Reserved. Scrape me, and I will eat your soul.