Make sure you read this post too. It's kind of related ;)

Scraping is taking content from existing sources for further personal :) use. This is not the official explanation (it might be the lamest out there) but will do it to understand what it means. So, you actually write a script to download a page and take parts of it that you are interested in (and do whatever with them) and use them.
Scraping is blackhats' most important weapon. Blackhats don't write content and, as content is king, queen, and palace employees, blackhats gather it from others, add a twist, label it as new and present it back to search engines.
I worked these few days on a better algo for text extraction from websites and I have achieved it. I'll explain you the basic concept so you can have a starting point to make your own or you can just buy mine rather soon.
My previous scraping algo used to get P and DIV tags and take text from them. It was simple, basic and outcome was short and not great. Not everyone uses paragraphs to contain their text slices so had to find an alternative to scrape less and get more!
As I was sitting and meditating ;) I split *HTML tags in three categories:
Non-Separators are style modifiers or links. They live within HTML just to add a twist but not to split block of content apart. Such non-separator tags are A,B,U,S,EM,STRONG,SPAN,BR,FONT and some others! These tags, when existing in text don't mean text ends and starts over. They just make the text better looking or more functional.
Discardables are tags you don't need. IMG tag is a discardable. It can have an alt (hence contain text) but it can be found in sentences without breaking their flow. And is usually found separated from the rest by other tags. So we can just drop them dead upfront.
Separators are tags that, by opening or ending actually separate blocks of text. Such tags are: P,TD,TR,TABLE,DIV,DD,DT,BLOCKQUOTE and all others except the ones from above.
I'm sure you are starting to get the big picture right now. By replacing Separators & Discardables with a space (just removing them could weld words so always replace them with a a space!) you can have all chunks of text by splitting using the Separators.
If you study my strip_tags function replacement, you will see it's all almost done there. Currently I have the algo in C++ but really soon I'll move it over to PHP and I might post it here :)
Once you have the chunks of text … it's your show. Figure out how to split them, get sentences, change them beyond recognition and then spit'em'out as a valid website somewhere on the web!
One more hint is that between [] , () , {} are usually explanations about what you are reading and can be discarded too when you start making something out of the chunks you get to have after putting the moves I mendtioned here on the sites you plan to scrape.
Byt using the new method I scrape 10x less then before and get 10x more then before :)