Everything below is based on observations, coding experience, common sense and guess-work. When I say Google does I mean Google might do!
The title is a bit extreme but unfortunately is rather true. Most content generation methods no longer work and content scraping without actually paying attention to the crap you put together is dead and buried. It might bring you a few bucks but the chances to work as it should are gone.
Yes and no. Content generation is actually a three-step process:
Step #1 and Step #3 stand still but the steps in between are no longer viable. There is a simple rule in content generation that now applies and will grow stronger and stronger:
Content is either semantically correct or it's ruled out as supplemental crap!
Google knows semantic? Google knows grammar? NO! They know something worse than the previous two put together. They know n-grams. They know the sheet we've been teaching them.
A while ago I was giving you some basic Markov in content generation informations. N-Grams are exactly the same only that instead of probability of occurrence based on single words, the probability is based on word-packs of N words occurring before or after one word.
Let's say you generate text based on Markov that is somehow (more or less) readable. And let's also say that Google unleashes it's N-Grams on it. What is 20% of your word combinations almost don't exist? Do you think Google will consider you are reinventing English? Not really. They'll just think you never took the time to learn it and downgrade your site to supple-soup. Unless you are one of the few matches for a search (supplementals) they will always hide the grammar-challenged from public eyes.
N-Grams can validate a text chunk quite well but the more n-grams you have the better. And Google's got a sheetload of n-grams (numbers on site are mind-blowing) and can cross reference your site with them and rate it in terms of readability. You don't give unreadable text in the results.
On the other hand N-Grams are the holly grail in duplicate content finding. So you get all the n-grams from a page, see where else they are found and then see if you notice patterns (if all n-grams are found on several other sites … it's bad). With a bit extra checks you can find duplicate chunks of text easily but … of course … takes a lot of firepower.
But what if I write in terms of 2nd grade English? And all my N-Grams are found on many other sites? If your N-Grams are found in a lot of pages than you're safe as they pass as common. But if the matching pages number narrows down than you might be duplicate. To go for overkill, a human review would quickly label your entire domain as crap and kill it! So you could just flag sites with algo and get a human to pass the death sentence.
This will be material for the next posts in which we'll also break the myth about longtails and huge sites :) Stick around and we'll talk about blackhat evolution!