5ubliminal@twitter

Extracting Sentences From Content - Web Scraping 101 : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/316/">Extracting Sentences From Content - Web Scraping 101 : 5ubliminal's TellinYa</a>
5ubliminal's YAMS
I know I said I'll restrain from posting for a short while but … it's not an easy task. It's some stuff you need to talk about :) and at least here I know there's some who actually understand it.
Going after the sentences!

A short while back I published a tutorial on how to easily find and extract text blocks within HTML. It is time to evolve. I'm sure many have put that to good use and applications are only limited by imagination but I'd like to take this further.

So we have the text blocks from content. Those text chunks are separate slices of content. They are not part of the same phrase or sentence. They are different. So why don't we extract actual text sentences from that text? I bet sentences will be much more useful than those text chunks in terms of blackhat content generation and so on.

Locating sentences. What makes them special?

In order to find a sentence we need to understand the structure of the sentence, the way it looks and feels, the way it moves and the way it smells :) Of course I will not teach you semantics here. Semantics is like freestyle. You have the elements but your creativity in how you combine them is the limit.

There are 4 simple steps to follow in order to break text blocks into sentences:

  • Weld them using one of the obvious separators: .?!|
  • Split them using the obvious separators: .?!|
  • Remove between ( … ) { … } [ … ] - (usually brackets contain extra comments which are not part of sentence flow)
  • Validate each chunk against a set of rules I will show below!
  • Weed out the invalid and Keep the hot and pretty ones that make them wobots horny.
Sentence 101 - Looks and Feels!

I will enunciate the first rule you need to understand about sentences. If you understand this one everything else will flow:

Sentences are like women. There are so many you can afford to be picky about the ones you choose!

After I said that I feel feminine anger. But I don't think girls read this blog so I don't worry about that.

There is one basic rule for a sentence. It starts with a Letter (Capital would be best) and ends with letters too. Last words should have over 3 characters unless you aim for sex. And let me get one thing clear. These two sentence features already ruled out a lot but think about the women. We're gonna be picky here!

Sentence model varies - but there are outlines!

First thing we need to understand is that each topic has a different sentence model. Some are slim(many short words), some are fat(many long words), some are short(not summer short), some are tall, some contain many numbers and some contain a lot of punctuation signs. Sentence model varies based on topic. E.g.: You will find a certain type of rather short and concise sentences in an ecommerce store, and long ones with lots of words in a content site.

But no matter how much the models vary there are rules that no viable sentence can break (many will but those are not viable). I made up these rules by analyzing several sentences that pass my judgment filters in term of viability. The tests I ran them against were:

  • Upper and Lower and all letters count.
  • Words count (count of words in sentence).
  • Words length (maximum, minimum and average length of words).
  • Numbers count (number words or digits).
  • Consecutive punctuations (a , followed by . followed by , does not make a lot of sense).
  • … and several others I'll let you brain-storm about!

By checking some sample sentences I came to some rules of common sense-tence (get itsensetence :)).

  • Upper-Case characters should not exceed 5% of letter count
  • Digits should not exceed 5% of letter count
  • Punctuation signs should not exceed 5% of letter count
  • Percentage of Unique Words out of Word Count should not be less then 50%
  • Should start with Upper-Case
  • Should have more then 5 words and less then … fill this in yourself!
  • etc …

Add to the above you own rules to filter sentences or change the above to fit your needs. (When searching for cellphone models you need to allow more digits and so on ….) These rules may seem strict but there's so many women sentences out there you can and will be picky!

Putting it all together!

You extract text chunks. You weld, split, verify each sentence against your own set of rules and be picky about them! Dump duplicates.

Stay tuned as there's more to come in future text scraping and content generation episodes. We'll talk some word stemming and more juicy stuff. And … don't forget the comment form :)

13 Comments Posted By Readers :

Add your comment
#1 Garcia from Bolivia web
Posted on Saturday, 09 February, 2008
I was just thinking about this a couple of days ago. Not for content generation but for a short description of a post to show in the home page. You've come up with really good ideas... good job.
#2 5ubliminal web
Posted on Saturday, 09 February, 2008
Thanks :) Glad to see you're still alive.
This can be used for yourself to index content on site or extract sentences that contain some keywords and so on.
Like creating snippets for your articles.
#3 Paul D from United States
Posted on Sunday, 10 February, 2008
It would take significantly more time but shouldn't you also create rules on language using...
I don't know..perhaps a database you gave our a few months ago ;) Just make your life that much easier when you start to spin the content using that same database.
#4 5ubliminal web
Posted on Sunday, 10 February, 2008
I'm not really sure I got your comment but this article is one in a longer series. One bit at a time. This is about finding sentences.
These sentences that come out with this algo are 99% quality slices of text. Are material to work on.
What you (I) do with them is a whole different thing … :)

We'll go over parts of speech, keyphrase searching and so on.
#5 Gab from Canada web
Posted on Sunday, 10 February, 2008
So ... looking to rank for text chunks, and word stemming huh? Why not try Eli on for size and try and rank for "spammy text" and "spammy spam spam text" :D? Anyways, nice tutorial and clever point abut the brackets.
#6 5ubliminal web
Posted on Monday, 11 February, 2008
Eli kills flies with the hammer. I use needles.
I'm a sniper (I don't nuke like Bush) and you'll notice I'll never encourage anyone to put up 1 million page sites.
250 pages can have the same results with less impact on the search engines' indexes and your webserver.

PS: No! Not looking to rank for chunks of text. Building a corpus(cul) (as it's small) of sentences from same topic which can later be used for a lot of stuff including decent content generation.
#7 TheMadHat from United States web
Posted on Monday, 11 February, 2008
Man, you and XMCP keep stealing all my upcoming posts...I'm running out of things to write about.

Way to go with doing the work for me though...and yours is a little more in depth that I picked up a couple pointers. Keep on keepin on.
#8 5ubliminal web
Posted on Monday, 11 February, 2008
:) Don't worry and don't hold back. We all got pieces of the puzzle or different views.
PS: And I bet we don't all share the same readers so enlighten yours too :)
#9 Gab from Canada web
Posted on Monday, 11 February, 2008
No million page sites? Boooooooo! What kind of content generator/longtail targeter are you anyways?? :P ;) I like the hammer analogy btw, though killing flies with needles seems a bit cruel.
#10 5ubliminal web
Posted on Monday, 11 February, 2008
Yeah … but it's less messy :) (Hope PETA don't see this one!)
PS: I can't stop laughing … cruel you say …
#11 nogenius from United States
Posted on Saturday, 01 March, 2008
Thanks for this - really helpful post - before I read this post, I was planning to do some really complicated CSS parsing to determine the location of blocks of content. But now, I can just apply "sentence rules" (after applying strip_tags to the scraped page), and with some modifications, I get hundreds of sentences ready to use. :)
#12 5ubliminal web
Posted on Sunday, 02 March, 2008
X-actly :)
#13 susan eros from United States
Posted on Tuesday, 20 May, 2008
Nice post! I was originally planning to write a basic web scraper for my web app in PHP or RoR, but then I came across Feedity ( http://feedity.com ) which made things a lot easier. Feedity generates custom RSS feeds from webpages, and now I just consume the resulting RSS feed in my application. Simple and straight! Check it out sometime!
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Monday, 08 February, 2010 - 22:19:11 GMT]   No Ajax / Flash Used Here
" Extracting Sentences From Content - Web Scraping 101 : 5ubliminal's TellinYa "
Close
Tellinya.com is relocating to blog.5ubliminal.com. This blog is no longer maintained and comments are no longer accepted / answered.