5ubliminal@twitter

Blackhatvolution: Content Generation - Past, Future and N-Grams! : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/296/">Blackhatvolution: Content Generation - Past, Future and N-Grams! : 5ubliminal's TellinYa</a>
5ubliminal's YAMS
Everything below is based on observations, coding experience, common sense and guess-work. When I say Google does I mean Google might do!
Blackhat Evolution!
No such thing as content generation!

The title is a bit extreme but unfortunately is rather true. Most content generation methods no longer work and content scraping without actually paying attention to the crap you put together is dead and buried. It might bring you a few bucks but the chances to work as it should are gone.

Is all I knew about content generation obsolete?

Yes and no. Content generation is actually a three-step process:

  1. Gather existing content.
  2. intermediary steps
  3. Generate new content.

Step #1 and Step #3 stand still but the steps in between are no longer viable. There is a simple rule in content generation that now applies and will grow stronger and stronger:

Content is either semantically correct or it's ruled out as supplemental crap!

Google knows semantic? Google knows grammar? NO! They know something worse than the previous two put together. They know n-grams. They know the sheet we've been teaching them.

What are the N-Grams?

A while ago I was giving you some basic Markov in content generation informations. N-Grams are exactly the same only that instead of probability of occurrence based on single words, the probability is based on word-packs of N words occurring before or after one word.

Why is this a problem?

Let's say you generate text based on Markov that is somehow (more or less) readable. And let's also say that Google unleashes it's N-Grams on it. What is 20% of your word combinations almost don't exist? Do you think Google will consider you are reinventing English? Not really. They'll just think you never took the time to learn it and downgrade your site to supple-soup. Unless you are one of the few matches for a search (supplementals) they will always hide the grammar-challenged from public eyes.

N-Grams can validate a text chunk quite well but the more n-grams you have the better. And Google's got a sheetload of n-grams (numbers on site are mind-blowing) and can cross reference your site with them and rate it in terms of readability. You don't give unreadable text in the results.

What if I use plain basic English?

On the other hand N-Grams are the holly grail in duplicate content finding. So you get all the n-grams from a page, see where else they are found and then see if you notice patterns (if all n-grams are found on several other sites … it's bad). With a bit extra checks you can find duplicate chunks of text easily but … of course … takes a lot of firepower.

But what if I write in terms of 2nd grade English? And all my N-Grams are found on many other sites? If your N-Grams are found in a lot of pages than you're safe as they pass as common. But if the matching pages number narrows down than you might be duplicate. To go for overkill, a human review would quickly label your entire domain as crap and kill it! So you could just flag sites with algo and get a human to pass the death sentence.

So how should I approach content generation?
This will be material for the next posts in which we'll also break the myth about longtails and huge sites :) Stick around and we'll talk about blackhat evolution!

12 Comments Posted By Readers :

Add your comment
#1 vjstar from Canada
Posted on Thursday, 24 January, 2008
can't wait for the next post about content gen! i'm building myself a new content gen and look forward to hear what do to to avoid the 2nd grade content quality :P

(does it have to do with the thesaurus you have posted some time ago?)

hurrrry!

Thanks for the great post man.
#2 5ubliminal web
Posted on Thursday, 24 January, 2008
Got a lot of ideas going on in my mind right now and I'll lay it all down in the next few posts that will be the blackhatvolution.
It does have to do with the thesaurus and much more.
#3 Christoph from Austria web
Posted on Friday, 25 January, 2008
nice to see you started to write about it that quickly :)

keep it up
#4 PR from United States
Posted on Friday, 25 January, 2008
I would love to get my hands on a copy of that "Web 1T 5-gram Version 1" at LDC. There's a *lot* you could do with that. The fact that google is sponsoring it's release should tell you something about how far advanced they are beyond that simple but large corpus.
#5 5ubliminal web
Posted on Friday, 25 January, 2008
Why not build your own N-Grams based on your own topic?
That amount of data is huge and I'd love to see a home PC do queries on it.

But slices would work.
#6 Montreal from Canada web
Posted on Sunday, 27 January, 2008
Personally, I'm not sure Google are using the N-grams. In the main results, perhaps, but at least in Google blogsearch you still get loads of gibberish. That might just be google not caring, or intentionally leaving blackhats the blogsearch traffic so that they can test different stuff in with Google blogsearch.

On the other hand, I found dupe content by copy-pasting a big chunk of text (3 sentences) while doing some research, and the SERPs didn't even show all the results; there might even be tiers to their supplemental index (officially it's been rolled back into the main index, but imho, the way it worked is still there). Bottom line is that Google was avoiding showing the dupes, if that makes sense to you?

p.s. It's Gab from Sphinn; the name is Montreal because I'm doing some linkbuilding.
p.p.s. Your bloody captchas are illegible to my HUMAN eyes!
#7 5ubliminal web
Posted on Sunday, 27 January, 2008
I never use Blogsearch. That's about fresh and not about relevant so using N-Grams and such on it would be overkill and ... for what?

On the other hand if you paid attention to my site you would have noticed I mask (and robots.txt block) all links from commenters. So linkbuilding here is of no use.
Please make sure you keep an eye out for the next post. I'll write something especially for you there. ;)

PS: I made my CAPTCHA difficult to read to make sure those who comment do have something to say!
#8 Gab from Canada web
Posted on Monday, 28 January, 2008
Well for those of us who care about finding the most recent blog posts, to use your paradigm, autogenned spam is really not going to be any more useful than stuff several years old.

I just noticed the booboo on the links. Oh well. I did notice that it was nofollow, but figured hell, a link is a link. Besides, Yahoo doesn't use nofollow, so...

Look forward to the next post, btw :). Also, do you have MSN messenger?

As to the captcha, the damn S looks like a Q! And the P's not great either...
#9 5ubliminal web
Posted on Monday, 28 January, 2008
I'm writing a post right now … it'll be interesting. And I'll have the Messenger info placed next to the email address by tomorrow.

I'll try some new fonts see which looks better :)
#10 Gab from Canada web
Posted on Monday, 28 January, 2008
The more block-ish font I'm seeing now is a million times better and more legible! What an improvement!
#11 m0nkeymafia from Great Britain
Posted on Tuesday, 05 February, 2008
Very interesting dude, I actually used a basic version of N Grams years ago to make a chat bot, however it sucked, but the idea was there haha.

It is very interesting though, it looks like you can get access to all the data. You could generate yourself some content, then run it through your own N Grams style program altering / updating it where applicable to get it to a decent level of N Gram saturation.
#12 5ubliminal web
Posted on Tuesday, 05 February, 2008
This is what I'm working on right now. Hope to have something done pretty soon.
Still gotta run some tests against the enemies: search engines :)
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Friday, 03 July, 2009 - 21:44:52 GMT]   No Ajax / Flash Used Here
" Blackhatvolution: Content Generation - Past, Future and N-Grams! : 5ubliminal's TellinYa "
Close
Tellinya.com is relocating to blog.5ubliminal.com. This blog is no longer maintained and comments are no longer accepted / answered.