I know I said I'll restrain from posting for a short while but … it's not an easy task. It's some stuff you need to talk about :) and at least here I know there's some who actually understand it.
A short while back I published a tutorial on how to easily find and extract text blocks within HTML. It is time to evolve. I'm sure many have put that to good use and applications are only limited by imagination but I'd like to take this further.
So we have the text blocks from content. Those text chunks are separate slices of content. They are not part of the same phrase or sentence. They are different. So why don't we extract actual text sentences from that text? I bet sentences will be much more useful than those text chunks in terms of blackhat content generation and so on.
In order to find a sentence we need to understand the structure of the sentence, the way it looks and feels, the way it moves and the way it smells :) Of course I will not teach you semantics here. Semantics is like freestyle. You have the elements but your creativity in how you combine them is the limit.
There are 4 simple steps to follow in order to break text blocks into sentences:
I will enunciate the first rule you need to understand about sentences. If you understand this one everything else will flow:
Sentences are like women. There are so many you can afford to be picky about the ones you choose!
After I said that I feel feminine anger. But I don't think girls read this blog so I don't worry about that.
There is one basic rule for a sentence. It starts with a Letter (Capital would be best) and ends with letters too. Last words should have over 3 characters unless you aim for sex. And let me get one thing clear. These two sentence features already ruled out a lot but think about the women. We're gonna be picky here!
First thing we need to understand is that each topic has a different sentence model. Some are slim(many short words), some are fat(many long words), some are short(not summer short), some are tall, some contain many numbers and some contain a lot of punctuation signs. Sentence model varies based on topic. E.g.: You will find a certain type of rather short and concise sentences in an ecommerce store, and long ones with lots of words in a content site.
But no matter how much the models vary there are rules that no viable sentence can break (many will but those are not viable). I made up these rules by analyzing several sentences that pass my judgment filters in term of viability. The tests I ran them against were:
By checking some sample sentences I came to some rules of common sense-tence (get it … sensetence :)).
Add to the above you own rules to filter sentences or change the above to fit your needs. (When searching for cellphone models you need to allow more digits and so on ….) These rules may seem strict but there's so many women sentences out there you can and will be picky!
You extract text chunks. You weld, split, verify each sentence against your own set of rules and be picky about them! Dump duplicates.
Stay tuned as there's more to come in future text scraping and content generation episodes. We'll talk some word stemming and more juicy stuff. And … don't forget the comment form :)
Post Feedback