5ubliminal@twitter

Remember the Extract Text Zones From Page Tutorial? Good! : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/346/">Remember the Extract Text Zones From Page Tutorial? Good! : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware » I'm back to work … sorting my shit now :)
Reveal More!
Extracting text-zones from pages:

I don't have much to write here except to point you to the original post about the text zones extraction. I explained the concept there, this method has served and still does serve me but now I'll share it so you get your share of fun. The PHP script is below, feed it HTML and get the text-zones. Easy!

The ExtractTextZones Script:

For This Code to Work it needs my stripTags function and to get the HTML you can use my PHP cURL Class = eHttpClient.

<?
//-----------------------------------------------------------------
//-- Copyright 5ubliminal 2008. (5ubliminal.com)
// http://www.tellinya.com/art2/346/
// You can do anything with it except selling it or claiming it.
// Copyright notice must remain intact and links are appreciated!
//-----------------------------------------------------------------
function extractTextZones($html_code){
    //---------------------------------------------
    //-- these tags are not considered separators
    $non_separators = "a|b|u|i|strong|em|s|strike|span|font";
    //-- these tags are considered irrelevant and removed
    $non_requireds = "img|br|nobr|spacer";
    //-- the rest of tags are considered separators
    //---------------------------------------------
    //-- my striptags function http://www.tellinya.com/art2/61
    //-- I use it on next line just to strip away attributes
    $ml = stripTags($html_code,1,1);
    //-- we play a bit with spaces and spaces between tags ...
    $ml = preg_replace("@>s+<@","><",$ml);
    $ml = preg_replace("@<([^s>]+)[^>]*>s*</1>@","",$ml);
    $ml = preg_replace("@<[a-z]+s*/>@i"," ",$ml);
    //-- here we drop the non required tags
    $ml = preg_replace("@<(".$non_requireds.")>@i"," ",$ml);
    //-- we drop the non separators that open and close
    while(true){
        $newml = preg_replace("@<(".$non_separators.")>(.*?)</\1>@i","$2",$ml);
        //-- if no change occurs we break
        if($newml == $ml) break; $ml = $newml;
    }
    //-- we drop the non separators that open OR close
    //-- we try to idiot-proof the code
    $ml = preg_replace("@</?(".$non_separators.")>@si"," ",$ml);
    //-- we know play a bit with spaces and <>
    $ml = preg_replace("@>s+@",">",$ml);
    $ml = preg_replace("@s+<@","<",$ml);
    $ml = preg_replace("@>s*<@","",$ml);
    //-- we drop all consecutive spaces and keep just one
    $ml = preg_replace("@s+@s"," ",$ml);
    //-- we drop any tags we might have missed on the way
    //-- (some might have appeared after dropping spaces)
    $pieces = preg_split("@<.*?>@", trim($ml));
    //-- we remove the first and last
    //-- (I can't remember why I did this but I must have had a good reason)
    array_splice($pieces,0,1); array_pop($pieces);
    //-- and we return the array with the slices
    return array_values($pieces);
}
//--
?>

If you got problems with it … holler.

6 Comments Posted By Readers :

Add your comment
#1 silent from Indonesia web
Posted on Tuesday, 15 April, 2008
err... I believe there's typo in title.
#2 5ubliminal web
Posted on Tuesday, 15 April, 2008
Thanks for the heads-up. I posted this in great hurry and didn't pay attention to details.
PS: I'm covered in shame right now … but I'll be OK :)
#3 m0nkeymafia from Great Britain
Posted on Monday, 21 April, 2008
I like your commenting style, might actually start to use that myself if i remember
#4 chatmasta from United States
Posted on Wednesday, 18 June, 2008
5ubliminal,

I put this script on my local windows machine, so that may be what caused these problems, but you were missing a few escapes in your regexes. It wasn't removing spaces, but it WAS removing s's. It was just a matter of escaping these regexes:

CHANGE THIS:
//-- we know play a bit with spaces and
$ml = preg_replace("@>s+@",">",$ml);
$ml = preg_replace("@s+",$ml);
$ml = preg_replace("@s+
#5 5ubliminal web
Posted on Wednesday, 18 June, 2008
My blog breaks the slashes. :) I built this site without paying a lot of attention to details. It is free you know.
And I found that problem after pointed out by others.

Didn't have time to update all the broken posts but most people fixed them themselves.
I'll try to look into this these days but I'm really busy right now with the upcoming launch: Check the last three posts in the RSS!

Thanks.
#6 chatmasta from United States
Posted on Wednesday, 18 June, 2008
lol yeah I noticed that...heres the full script if you want to copy it: http://pastebin.com/m5f0314f1

nice script, nice blog
Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Tuesday, 07 October, 2008 - 13:03:12 GMT]   No Ajax / Flash Used Here
" Remember the Extract Text Zones From Page Tutorial? Good! : 5ubliminal's TellinYa "