Extracting Content From Websites | Web Text Scraping 101 : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/276/">Extracting Content From Websites | Web Text Scraping 101 : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware
Reveal More!
Make sure you read this post too. It's kind of related ;)
Extracting Text From Pages - Basics of Web-Scraping
Web-Scraping!

Scraping is taking content from existing sources for further personal :) use. This is not the official explanation (it might be the lamest out there) but will do it to understand what it means. So, you actually write a script to download a page and take parts of it that you are interested in (and do whatever with them) and use them.

Scraping is blackhats' most important weapon. Blackhats don't write content and, as content is king, queen, and palace employees, blackhats gather it from others, add a twist, label it as new and present it back to search engines.

Why am I writing this?

I worked these few days on a better algo for text extraction from websites and I have achieved it. I'll explain you the basic concept so you can have a starting point to make your own or you can just buy mine rather soon.

How To Scrape Text From HTML?

My previous scraping algo used to get P and DIV tags and take text from them. It was simple, basic and outcome was short and not great. Not everyone uses paragraphs to contain their text slices so had to find an alternative to scrape less and get more!

As I was sitting and meditating ;) I split *HTML tags in three categories:

  • Separators
  • Non-Separators
  • Discardables

Non-Separators are style modifiers or links. They live within HTML just to add a twist but not to split block of content apart. Such non-separator tags are A,B,U,S,EM,STRONG,SPAN,BR,FONT and some others! These tags, when existing in text don't mean text ends and starts over. They just make the text better looking or more functional.

Discardables are tags you don't need. IMG tag is a discardable. It can have an alt (hence contain text) but it can be found in sentences without breaking their flow. And is usually found separated from the rest by other tags. So we can just drop them dead upfront.

Separators are tags that, by opening or ending actually separate blocks of text. Such tags are: P,TD,TR,TABLE,DIV,DD,DT,BLOCKQUOTE and all others except the ones from above.

I'm sure you are starting to get the big picture right now. By replacing Separators & Discardables with a space (just removing them could weld words so always replace them with a a space!) you can have all chunks of text by splitting using the Separators.

Rather simple right?

If you study my strip_tags function replacement, you will see it's all almost done there. Currently I have the algo in C++ but really soon I'll move it over to PHP and I might post it here :)

Once you have the chunks of text … it's your show. Figure out how to split them, get sentences, change them beyond recognition and then spit'em'out as a valid website somewhere on the web!

One more hint is that between [] , () , {} are usually explanations about what you are reading and can be discarded too when you start making something out of the chunks you get to have after putting the moves I mendtioned here on the sites you plan to scrape.

Byt using the new method I scrape 10x less then before and get 10x more then before :)

10 Comments Posted By Readers :

Add your comment
#1 Jason from United States
Posted on Saturday, 12 January, 2008
Very good point. I have been looking to do this myself after I realized the problems with scraping just RSS feeds. You either dont get enough content or content that is not worth scraping. I figured if I just got it from the source it would be better and I would pull all data from the tags. The only problem I see with your method of getting all , , etc is the fact some sites may have several or within other tags like causing you to have dup content when your script pulls it ( same content in the as in the tags )
#2 5ubliminal web
Posted on Saturday, 12 January, 2008
You would not have duplicate content as you would use Markov or any other method of Generating content on the final outcome.
Tell me it there's anything I'm missing here but do not use actual HTML code in the post as it gets stripped.

PS: I always pass Copyscape ;)
#3 Jason from United States
Posted on Saturday, 12 January, 2008
Ah, well it looks like the blog stripped my HTML in the comment... Take this situation.. your scrapping a page where the webmaster has a table, inside this table he has a p tag and inside the p tag you have a ul list. The way I envision it, your scraper is looking for these tags and parses the content. If you are listing each tag on its own for your output, you would have a listing for the table content, p tage content, and ul content on its own.. however, since the p tag and ul tag are inside the table, wouldnt this make for "dup content" that YOU are scraping, because your output for the table would contain the same output for your ul tag since the ul was in the table tag? The was I would want to scrape to handle all situations would be to get my output per tag ( since some webmasters might not have a ul in a p tag, but some might ). So I need a way to filter out tags within tags since a p tag could contain a ul tag. I dont want my p tag having dup content as my ul tag output right? Anyway around this?
#4 5ubliminal web
Posted on Saturday, 12 January, 2008
:) Wow! The workaround is so easy for imbricated tags that I never imagined I needed to actually explain how this works. This is why I love comments that … help me clear out things I never think of mentioning.
Let's take the following :[tag1][tag2][tag3]text[/tag3][/tag2][/tag1]. This is the final form. It actually contains only text slieces and separators! I'll try to explain this best and easy to understand. No matter how many consecutive separators you have … those are all separators.

It's like having this text: ?!.sentence goes here?!.

Once a separator always a separator. It's no longer treated as an HTML tag. You don't care it there's a concrete or cardboard wall between two rooms. Those are both separators. To overcome this you must combine/treat all separators into/as 1. Once you go past first steps (dumped the rest) and have only text and separators you don't care about tags syntax anymore. Those tags become walls and loose all their HTML meanings. Tags (closing, opening or standalone) are just fences between strips of text and no matter how you look at them they will split and each will be a splitter and none will encapsulate anything anymore. So it's like replacing all HTML tags with | and splitting by | and then removing empty slices.

This is basic Regexp: $slices = preg_split("/<[^>]+>/is",$text); . Then you just remove blank strips.
In the end you have to remove potential duplicates that might repeat thrughout a page.

PS: I'm not an English speaker and you may find this more/less difficult to understand. I'll have the PHP code added by Monday and it will be pretty self explanatory.
#5 Jason from United States
Posted on Saturday, 12 January, 2008
Ha, you know thats funny because reading your explanation really helped me. I guess it would be that easy, but for some reason I didnt comprehend it before.

Very true, very true... grab all the real text content ( between the html separators tags ) and then just replace all the separators with something like | and then just remove the | itself. I see now. Cool.

This helps. I was banging my head for a minute wondering how to overcome this, but the answer was simple.

Thanks. I also code PHP ( I am more of a SEO guy though ) so I am working on my code too for this. Thanks, big help!
#6 5ubliminal web
Posted on Saturday, 12 January, 2008
Glad to have helped :) and you're welcome.
#7 Lars Koudal from Denmark web
Posted on Saturday, 15 March, 2008
Hi

After reading your post, I remembered I found a simpler way to do that once, a German guy has created a patch, where all you have to do, is run it, reboot, and you are golden!

See it here: http://www.lvllord.de/?lang=en&url=downloads#4226patch
#8 5ubliminal web
Posted on Sunday, 16 March, 2008
@Lars: As I mentioned in the Scraping at the speed of light post I would not encourage people to use EXEs but I won't censor your post.
Yes, there are easier to use patches than hand-jobs on the file itself but .SYS files are just a bit safer:) as an EXE can do a lot more evil, easier.

Cheers.
#9 Lars Koudal from Denmark web
Posted on Sunday, 16 March, 2008
Wow, I actually posted in the wrong window? Damn, I had several tabs open, and my comment slipped into the wrong one.

Although I see your point about using executables, just call me lazy :)
#10 5ubliminal web
Posted on Sunday, 16 March, 2008
@Lars: Don't worry … it's cool.
I'm working on a way on interlinking posts based on content automagically but, for now, refferences like this between each other work :)

PS: I'm rather lazy too but I got a lot of C++ code that can't slip away :) so I'm really paranoid.
Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEO & SEM Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Sunday, 06 July, 2008 - 04:17:35 GMT]   No Ajax / Flash Used Here
" Extracting Content From Websites | Web Text Scraping 101 : 5ubliminal's TellinYa "