The Basics of Content Generation: Methods, Coherrence, and Unique Content
|
| |
![]() | |
Alrighty. So few people despise content creation as much as myself. While I do have a trick or two I will be hording to myself, I thought I’d share the basics methods of content generation, and the things you should look into if you want to create an engine that’s bringing it up to the next level.
- The Text Jumble
Scraped from different sources, recombining text of randomally sized blocks(2-6 words) is a decent way to create text, so long as no one will ever see it(cloaking), and so long as the search engines don’t get any better at recognizing proper language patterns. There’s not much to say on this one, so I’ll leave it for now. It’s the basic entry level of content generation, and rarely makes a coherent sentence. - Markov/String Permutations
I’ve been told this is the concept behind Markovs, so I’ll call it that. I know the concept better than I know the name
This is a common way of generating semi coherent text, and has the greatest possibility for the future, if made more intelligent. Here’s the concept.- Scrape lots and lots of data on a given topic. Arrange it by sentence.
- Select a random start word from your sentences.
- Search through your for all the words that come AFTER that word, and append a random word.
- Rinse, lather, repeat.
The end result is a bunch of text that is somewhat coherent, but still obviously generated. So let’s take a look at how to expand on this, and make it more readable.
- Fix capitalization. This is an easy one to do.
- Maintain proper tense. If you do not doing this, then your content had easily looked like this did.
- Either load a dictionary of these up, or do your best at adding/removing the proper endings. Sometimes searching Google for your attempt can reveal the proper spelling for irregular words.
- If you’re feeling especially zesty, note the type of word each is(verb, adjective, noun, etc). Note the combinations of these that is normally sensical, and try to recreate it like that.
- Break up by paragraph as well as sentence. That way, you can weight the randomization by not only if the word is there, but if the paragraph is similar to the one you’ve already created. If you don’t have enough data for this though, you’ll edn up with very, very similar lines.
- The Synonym Switch
This solution is the one many arrive at logically first. All it is is looking up synonyms in an online thesaurus, and swapping out the current word for another one.
HOWEVER be warned. There’s a lot of synonyms no longer in use, or rarely used. As a result, your text can come out very footprintable, and sounding as if a mixture between a thug and Shakespeare wrote it. A good way to offset this affect is to search for each keyword on Google(store this in a database, so you only have to do it once and can space out search times), and record the number of hits. Weight the algorithm deciding which word to swap in according to how many results it got. This will help you only get more common synonyms.
Combining the Processes
Combining these(errr #2 and #3) is a pretty decent way to create unique content without the hassle of writing. However, they are quite CPU intensive, so don’t say I didn’t warn you!
Conclusion
The largest issue with writing proper text automatically is that it’s hard to scale. Not enough content. To supplement website scrapes, make use of all the free text out there. Project Gutenberg is a good resource(17k books with expired US copyrights, free for download). Cliffnotes. Wikipedia. Random ebooks on emule/bittorrent. RSS. Newspaper articles. Even scrapes of lyrics sites. There’s a lot of organized data out there, and all of it can be used. Give it a go.
As I grow better at this myself, I’ll be sure to keep everyone updated. I’m exhausted, so I’m going to crash now. Hopefully I can think of something lovely for tomorrow.
-XMCP





















January 22nd, 2008 at 7:23 pm
What kind of scaling issues are on your mind? Why not enough content (scraping the interwebs is not enough?) What volumes of generated text you consider a satisfying scale?
The only problem I see would be generating content for *very* long-tail phrases, but it’s not hard to mix these phrases into neutral (say guttenberg) content…
January 22nd, 2008 at 7:29 pm
I’ve always found randomly generated text interesting. As technology gets better and better it’s fun to think of the day when there will be a program that can just research a subject and create it’s own content. They’ll probably start as a mashup of their input, but evolve to something even more interesting.
January 22nd, 2008 at 10:40 pm
I’m doing some jumbling, but it’s less sophisticated. I have a site that has a lot of pages of the form . Let’s say there are 100 locations and 50 “widget colors”.
Instead of writing 5000 pages, we are writing 50 widget blurbs and 100 location blurbs. Each location blurb gets passed some widget variables, eg: California has many great s. For widget blurbs: is a great place to buy blue widgets. Make sense? Each blurb is about 4 sentences.
Then the Colorado Blue Widget page pulls the colorado blurb (being passed the blue widget variable) and the blue widget page (being passed the colorado variable). No one page has the same two blurbs, and the content really comes out OK.
I was wondering what your thoughts of his tactic are for an older site with 3K plus links (non are paid for). Will I escape the filter?
Thanks, and great post.
January 22nd, 2008 at 10:47 pm
Oops. Used carrots to show variables above. To clarify:
pages have the form: ‘location name’ ‘widget color’
Location blurbs use: California has many great ‘widget color variable’
and for widget blurbs: ‘location variable’ is a great place to buy blue widgets.
January 23rd, 2008 at 12:33 am
If you have a domain like that, I’d steer clear of autogen altogether. Create a clean domain, and test BH on that
For the tactic itself, it’s decent, but I’d make sure there’s a LOT of difference in the sentences. If you can keep the variation up, it’d probably get past the auto-detection algos. Probably not a human though.
January 23rd, 2008 at 9:39 am
Dont forget you can download the entire wikipedia content as a .xml.bz2! its a goldmine
January 23rd, 2008 at 3:34 pm
I clearly don’t get this concept. My store blog (at xanga.com/dextr) is actual written content, and I can’t imagine when you’d want to have fake content. But it does seem to me that the strategies listed above would be slower and more trouble than actually writing something? Is the object to have it not be real language, for some reason?
January 23rd, 2008 at 5:34 pm
Thanks for the feedback Shady. Love the blog.
January 23rd, 2008 at 10:54 pm
I find it interesting to tickle content generation techniques, this is can certainly be a challenging task. I’d like to raise a question; What about documenting, plannifying or thinking/defining/documenting your content generation requirements (however you like to call that)?
In my opinion this is the most important step one needs to go through or at least think about somewhat thoroughly.
Establishing your needs and requirements will allow you to get a better idea of where you are heading to, what you want to aim for and it’ll even make your life easier when wanting to implement some crazy idea to test out.
Here’s an example; writing my content generation use cases and requirements, I decided it would make sence to generate posts for backlinking blogs based on content analysis (this is an experiment where the idea is to improve the relevancy of pages linking to my posts).
You raise solutions for designing content generators. However it is specific to a page/post/article’s textual content. Don’t get be wrong I did like the article. Just raising a few related questions I had:
What about content classification? I’m using wordpress, automating content tagging is easy. But what about generating a site’s categories accurately? And how do you generate your pages titles nicely?
There are a few potential solutions I don’t particularly like that much; using tags as categories, and relying on some keyword utility or service. Am I just too picky on my category names?
Here’s a challenging topic I’d like to hear about as well…. Generating medias (videos, images, sounds).
Finally, thanks for sharing the good stuff!
January 24th, 2008 at 1:06 pm
@Rebecca: Autogenerated content is for an entirely different business model than a store blog. Normally it’s on a self-piloting/enlarging site, where either one of 2 conditions is true.
1)The layout is such that most readers will only make it to the adsense portion, and not the content
or
2)The content will only be seen by the search engines; all others will be redirected over to a affiliate site.
While it’s more difficult in the first place to generate content, the ability to create dozens of websites a day with no manual labor whatsoever, allows a web promoter to expand extremely rapidly, and have a large income from the result.
January 28th, 2008 at 2:40 pm
[…] The Basics of Content Generation: Methods, Coherrence, and Unique Content […]
January 28th, 2008 at 3:41 pm
There are a couple of chatbots out there that use textbooks to ‘learn’ and can generate new sentences themselves. Many of them are open source.
I’m sure there is some more to learn from them
January 28th, 2008 at 3:43 pm
Oh, more more thing that comes to mind: Googlism.com can also be a nice tool to easily fetch some sentences for your keywords. Mix them with autogenerated content.
January 29th, 2008 at 1:11 pm
Thank you for the explanation. I would never have guessed.
February 24th, 2008 at 5:38 am
Question for ya…
I recently wrote a script that spiders through links and on each link it spiders it grabs all the sentences (complete sentences, gotta love delimiters) and stores them for me. Not only that, the page in which the sentence came from, it grabs the keywords, description, title and the URL so I can sort them based on those things later.
End result… Complete and coherent sentences all somewhat relavant because they come from the same types of pages.
Since these are 80-110 character sentences, can the SEs catch on that it’s not original content?
I wouldnt think so because a single article might have sentences are 30 different websites, ya know?