Extracting text-zones from pages:
I don't have much to write here except to point you to the original post about the text zones extraction. I explained the concept there, this method has served and still does serve me but now I'll share it so you get your share of fun. The PHP script is below, feed it HTML and get the text-zones. Easy!
The ExtractTextZones Script:
For This Code to Work it needs my stripTags function and to get the HTML you can use my PHP cURL Class = eHttpClient.
<?
function extractTextZones($html_code){
$non_separators = "a|b|u|i|strong|em|s|strike|span|font";
$non_requireds = "img|br|nobr|spacer";
$ml = stripTags($html_code,1,1);
$ml = preg_replace("@>s+<@","><",$ml);
$ml = preg_replace("@<([^s>]+)[^>]*>s*</1>@","",$ml);
$ml = preg_replace("@<[a-z]+s*/>@i"," ",$ml);
$ml = preg_replace("@<(".$non_requireds.")>@i"," ",$ml);
while(true){
$newml = preg_replace("@<(".$non_separators.")>(.*?)</\1>@i","$2",$ml);
if($newml == $ml) break; $ml = $newml;
}
$ml = preg_replace("@</?(".$non_separators.")>@si"," ",$ml);
$ml = preg_replace("@>s+@",">",$ml);
$ml = preg_replace("@s+<@","<",$ml);
$ml = preg_replace("@>s*<@","",$ml);
$ml = preg_replace("@s+@s"," ",$ml);
$pieces = preg_split("@<.*?>@", trim($ml));
array_splice($pieces,0,1); array_pop($pieces);
return array_values($pieces);
}
?>
If you got problems with it … holler.
Post Feedback