26 February 2008 - Major Update!
I updated the strip_tags function replacement with a new version with some fixed issues and added new helper functions for better HTML charatcers decoding. Hold on to it and save it in your PHP library as I'll have some new code up real soon and will need this updated version.
PS: I'm still alive :) Don't worry …
Why the strip_tags php function replacement?
Most annoying this in stripTags was when I once tried to strip to P tags with no space between them. The HTML looked like this:
<p>Paragraph1 Content</p><p>Paragraph2 Content</p>
And the output of strip_tags against this HTML code shocked me!
Paragraph1 ContentParagraph2 Content
The strip_tags function does its job very well but can not understand that when a P ends and a new one starts it needs to space them out.
So I decided to rewrite this function as I love to reinvent the wheel everytime I have the chance to do it!
The goals of the new stripTags function
I made a list of what I need from a strip tags function. This is what I came up with:
- It has to separate blocks of data that are naturally separated in browsers such as: div,p,li,td (list can be costumized in function).
- It has to dump large tags as Script, Style and Comments + CDATA.
- It has to allow me to just strip the attributes and keep the tags.
- It has to be able to replace IMG tags with their ALT attributes and ACRONYM tags with their Title attributes.
- It has to be able to strip consecutive spaces.
- It has to allow me to strip tags selectively: some of them, none of them, all of them.
- And has to be fully costumizable.
So having these in mind I got to work. You can see the function below. It is somehow commented but I will explain it in detail after its body.
The stripLargeTags Helper Function
The stripLargeTags will remove script, style and comments + CDATA. Use it to keep just HTML of page.
<?
function stripLargeTags($html){
$searches = array (
"/<!\[CDATA\[(.*)\]\]>/si",
"/<script[^>]*>.*?<\/script>/si",
"/<style[^>]*>.*?<\/style>/si",
"/<code[^>]*>.*?<\/code>/si",
"/<!--.*?-->/s",
"/<!.*>/Us",
"/<\?.*>/Us"
);
$replace = array();
foreach($searches as $search){
array_push($replace, '');
} reset($searches);
$text = preg_replace($searches, $replace, $html);
return $text;
}
?>
The HTML decoding helper functions.
The HTML translation table is an array with characters and their HTML encoded equivalents. The original PHP one has many missing characters and this new function has them added. I found it on PHP.net and soon I'll compare it with my C++ one and add any missing ones.
<?
function get_html_translation_table_CP1252() {
$trans = get_html_translation_table(HTML_ENTITIES);
$trans[chr(130)] = '‚';
$trans[chr(131)] = 'ƒ';
$trans[chr(132)] = '„';
$trans[chr(133)] = '…';
$trans[chr(134)] = '†';
$trans[chr(135)] = '‡';
$trans[chr(136)] = 'ˆ';
$trans[chr(137)] = '‰';
$trans[chr(138)] = 'Š'; S
$trans[chr(139)] = '‹';
$trans[chr(140)] = 'Œ';
$trans[chr(145)] = '‘';
$trans[chr(146)] = '’';
$trans[chr(147)] = '“';
$trans[chr(148)] = '”';
$trans[chr(149)] = '•';
$trans[chr(150)] = '–';
$trans[chr(151)] = '—';
$trans[chr(152)] = '˜';
$trans[chr(153)] = '™';
$trans[chr(154)] = 'š'; S
$trans[chr(155)] = '›';
$trans[chr(156)] = 'œ';
$trans[chr(159)] = 'Ÿ';
ksort($trans);
return $trans;
}
function unhtmlentities($string){
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
$trans_tbl = get_html_translation_table_CP1252();
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
?>
The strip_tags replacement function.
This is the new striptags function. I'm not sure how it performs on malformed HTML (never tried it :)) but it works very well on all the pages I used it on. It's customizable so try it and see the outcome.
<?
function stripTags(
$html,
$stripAttributes =false,
$keepTags =false,
$stripHtmlChars =true,
$stripSpaces =true,
$stripImages =true,
$stripAcronyms =true
){
$tags2Separate = "div|p|td|li|acronym|img";
$tags2Separate = preg_split("/[^a-z]/i",$tags2Separate);
$html = preg_replace("/<(\/?".implode("|",$tags2Separate).")([^>]+)>/i"," <\\1\\2>",$html);
if($stripImages){
$html=preg_replace("/<img[^>]+?alt=\"([^\"]*)\"[^>]*>/si","\\1",$html);
}
if($stripAcronyms){
$html=preg_replace("/<acronym.*title=\"([^\"]*)\"[^>]*>.*<\/acronym>/Ui","\\1",$html);
}
$text = stripLargeTags($html);
if($stripAttributes){
if(preg_match_all("@<[^\s>]+\s.+?>@si", $text, $tags)){
$tags = $tags[0]; array_unique($tags);
foreach($tags as $tag){
if(!strchr($tag,' ')) continue;
$tagml = $tag; $tag = trim($tag,"<>"); $tag = trim($tag);
if(!preg_match("@^([^\s]+)\s.*$@is", $tag, $pcs)) continue;
$tag_name = $pcs[1];
if($tag_name[0]=='/'){
$tag_name = substr($tag_name, 1);
$text = str_replace($tagml, "</".$tag_name.">", $text);
continue;
}
if($tag_name[strlen($tag_name)-1]=='/'){
$tag_name = substr($tag_name, 0, strlen($tag_name)-1);
$text = str_replace($tagml, "<".$tag_name."/>", $text);
continue;
}
$text = str_replace($tagml, "<".$tag_name.">", $text);
}
}
}
if(is_bool($keepTags)){
if(!$keepTags){
$text = preg_replace("@<[^>]+>@si", "", $text);
}
}else{
if(is_string($keepTags)){
$keepTags = preg_split("/[^a-z]/i",$keepTags);
}
if(is_array($keepTags)){
if(preg_match_all("/<([^\/\s>]+)[^>]*>/si",$text,$tags)){
$tags=$tags[1];
$stripTags = array_diff($tags, $keepTags);
$text=preg_replace("/<\/?(".implode("|",$stripTags).
")(\s|)[^>]*>/","",$text
);
}
}
}
$text=str_replace("&","&",$text);
if($stripHtmlChars){
$text = unhtmlentities($text);
$text = preg_replace("/&[^;]+;/Ui"," ",$text);
}
if($stripSpaces){
$text=str_replace(" "," ",$text);
$text=preg_replace("/\s+/"," ",$text);
}
return trim($text);
}
?>
Understanding the strip_tags replacement function.
First I will explain you the parameters and how to group them to get different results.
- $html
This is the HTML code you want to play with. You can retrieve web pages with PHP and cUrl using another PHP class I wrote and posted here.
- $stripAttributes
This will allow you to keep or strip tags attributes. Next parameters will explain this better.
- $keepTags
This parameter can be a bool or string/array.
- If it is a bool (true or false) it will work like this: for true it will keep all tags untouched for false it will strip all HTML tags out.
- If it is an array it will be left untouched, if it is a string it will be split by separators to an array. Then all tag names found here will be kept and the rest stripped off.
- $stripHtmlChars
If this is set to true it will replace all html special characters with spaces. All found between & and ; will be changed to spaces. These characters have little value in text.
- $stripSpaces
If set to true it will replace all consecutive spaces with one space. It will also include CRLF along the way.
- $stripImages
Will replace IMG tags with their ALT attributes values.
- $stripAcronyms
Will replace ACRONYM tags with their TITLE attributes values.
In the end I will explain how 2 parameters combine. If you set $keepTags to true and set $stripAttributes to true you will only be left with tags. Like this:
<a href="#">link</a>
will become
<a>link</a>
The stripTags function can be easily costumized!
<?
$tags2Separate = "div|p|td|li|acronym|img|br";
?>
Can be changed so other tags can break naturally and gain some space between them even if it is not there.
Last but no least any bugs you notice will be quickly addressed and I'm looking forward to creative input to extend its functionality.