5ubliminal@twitter

strip_tags PHP Function Replacement : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/61/">strip_tags PHP Function Replacement : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware » I'm on vacation! … still alive :)
Reveal More!
26 February 2008 - Major Update!

I updated the strip_tags function replacement with a new version with some fixed issues and added new helper functions for better HTML charatcers decoding. Hold on to it and save it in your PHP library as I'll have some new code up real soon and will need this updated version.

PS: I'm still alive :) Don't worry …
Why the strip_tags php function replacement?

Most annoying this in stripTags was when I once tried to strip to P tags with no space between them. The HTML looked like this:

<p>Paragraph1 Content</p><p>Paragraph2 Content</p>

And the output of strip_tags against this HTML code shocked me!

Paragraph1 ContentParagraph2 Content

The strip_tags function does its job very well but can not understand that when a P ends and a new one starts it needs to space them out.

So I decided to rewrite this function as I love to reinvent the wheel everytime I have the chance to do it!

The goals of the new stripTags function

I made a list of what I need from a strip tags function. This is what I came up with:

  • It has to separate blocks of data that are naturally separated in browsers such as: div,p,li,td (list can be costumized in function).
  • It has to dump large tags as Script, Style and Comments + CDATA.
  • It has to allow me to just strip the attributes and keep the tags.
  • It has to be able to replace IMG tags with their ALT attributes and ACRONYM tags with their Title attributes.
  • It has to be able to strip consecutive spaces.
  • It has to allow me to strip tags selectively: some of them, none of them, all of them.
  • And has to be fully costumizable.

So having these in mind I got to work. You can see the function below. It is somehow commented but I will explain it in detail after its body.

The stripLargeTags Helper Function

The stripLargeTags will remove script, style and comments + CDATA. Use it to keep just HTML of page.

<?
//-- Large tags not needed in content extraction must be dumped!
function stripLargeTags($html){
    $searches = array (
        "/<!\[CDATA\[(.*)\]\]>/si", // Remove CData
        "/<script[^>]*>.*?<\/script>/si", // Strip out javascript
        "/<style[^>]*>.*?<\/style>/si", // Strip out styles
        "/<code[^>]*>.*?<\/code>/si", // Strip out code chunks
        "/<!--.*?-->/s", // Strip comments
         "/<!.*>/Us", // Strip !Tags
         "/<\?.*>/Us" // Strip !Tags
    );
    $replace = array();
    foreach($searches as $search){
        array_push($replace, ''); // Replace all with ''
    } reset($searches);
    $text = preg_replace($searches, $replace, $html);
    return $text;
}
?>

The HTML decoding helper functions.

The HTML translation table is an array with characters and their HTML encoded equivalents. The original PHP one has many missing characters and this new function has them added. I found it on PHP.net and soon I'll compare it with my C++ one and add any missing ones.

<?
//-- HTML translation table in PHP does not contain all the below.
//-- Found these on the php.net site.
function get_html_translation_table_CP1252() {
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
    $trans[chr(131)] = '&fnof;';    // Latin Small Letter F With Hook
    $trans[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
    $trans[chr(133)] = '&hellip;';    // Horizontal Ellipsis
    $trans[chr(134)] = '&dagger;';    // Dagger
    $trans[chr(135)] = '&Dagger;';    // Double Dagger
    $trans[chr(136)] = '&circ;';    // Modifier Letter Circumflex Accent
    $trans[chr(137)] = '&permil;';    // Per Mille Sign
    $trans[chr(138)] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans[chr(139)] = '&lsaquo;';    // Single Left-Pointing Angle Quotation Mark
    $trans[chr(140)] = '&OElig;';    // Latin Capital Ligature OE
    $trans[chr(145)] = '&lsquo;';    // Left Single Quotation Mark
    $trans[chr(146)] = '&rsquo;';    // Right Single Quotation Mark
    $trans[chr(147)] = '&ldquo;';    // Left Double Quotation Mark
    $trans[chr(148)] = '&rdquo;';    // Right Double Quotation Mark
    $trans[chr(149)] = '&bull;';    // Bullet
    $trans[chr(150)] = '&ndash;';    // En Dash
    $trans[chr(151)] = '&mdash;';    // Em Dash
    $trans[chr(152)] = '&tilde;';    // Small Tilde
    $trans[chr(153)] = '&trade;';    // Trade Mark Sign
    $trans[chr(154)] = '&scaron;';    // Latin Small Letter S With Caron
    $trans[chr(155)] = '&rsaquo;';    // Single Right-Pointing Angle Quotation Mark
    $trans[chr(156)] = '&oelig;';    // Latin Small Ligature OE
    $trans[chr(159)] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    ksort($trans);
    return $trans;
}
// -----------------------------------------------------------
//-- Function to convert &*; formats to characters.
function unhtmlentities($string){
    // replace numeric entities
    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
    // replace literal entities
    $trans_tbl = get_html_translation_table_CP1252();
    $trans_tbl = array_flip($trans_tbl);
    return strtr($string, $trans_tbl);
}
?>

The strip_tags replacement function.

This is the new striptags function. I'm not sure how it performs on malformed HTML (never tried it :)) but it works very well on all the pages I used it on. It's customizable so try it and see the outcome.

<?
function stripTags(
    $html, // The Html To strip
    $stripAttributes =false, // Strip Tag Attributes
    $keepTags =false, // Keep Html Tags
    $stripHtmlChars =true, // Strip Html Chars &*;
    $stripSpaces =true, // Replace Consecutive Spaces With Space
    $stripImages =true, // Replace Images With Alt Values
    $stripAcronyms =true // Replace Acronyms With Title Value
){
    //-- Separate Tags From Text
    $tags2Separate = "div|p|td|li|acronym|img";
    $tags2Separate = preg_split("/[^a-z]/i",$tags2Separate);
    $html = preg_replace("/<(\/?".implode("|",$tags2Separate).")([^>]+)>/i"," <\\1\\2>",$html);
    //-- Remove Images and Replace The With Their Alts
    if($stripImages){
        $html=preg_replace("/<img[^>]+?alt=\"([^\"]*)\"[^>]*>/si","\\1",$html);
    }
    //-- Remove Acronyms and Replace The With Their Titles
    if($stripAcronyms){
        $html=preg_replace("/<acronym.*title=\"([^\"]*)\"[^>]*>.*<\/acronym>/Ui","\\1",$html);
    }
    $text = stripLargeTags($html); // Change Params to keep <code>
    if($stripAttributes){
        //-- We strip attributes here! To overcome XHTML format we take the long way.
        //-- We find all tags and replace them.
        if(preg_match_all("@<[^\s>]+\s.+?>@si", $text, $tags)){
            $tags = $tags[0]; array_unique($tags);
            foreach($tags as $tag){
                if(!strchr($tag,' ')) continue;
                $tagml = $tag; $tag = trim($tag,"<>"); $tag = trim($tag);
                if(!preg_match("@^([^\s]+)\s.*$@is", $tag, $pcs)) continue;
                $tag_name = $pcs[1];
                if($tag_name[0]=='/'){
                    $tag_name = substr($tag_name, 1);
                    $text = str_replace($tagml, "</".$tag_name.">", $text);
                    continue;
                }
                if($tag_name[strlen($tag_name)-1]=='/'){
                    $tag_name = substr($tag_name, 0, strlen($tag_name)-1);
                    $text = str_replace($tagml, "<".$tag_name."/>", $text);
                    continue;
                }
                $text = str_replace($tagml, "<".$tag_name.">", $text);
            }
        }
    }
    //-- Strip Regular Tags
    if(is_bool($keepTags)){
        if(!$keepTags){
            //-- We strip all the tags here
            $text = preg_replace("@<[^>]+>@si", "", $text);
        }
        //-- Otherwise we keep all the tags
    }else{
        if(is_string($keepTags)){
            $keepTags = preg_split("/[^a-z]/i",$keepTags);
        }
        if(is_array($keepTags)){
            if(preg_match_all("/<([^\/\s>]+)[^>]*>/si",$text,$tags)){
                //-- Tags Found in The Html
                $tags=$tags[1];
                $stripTags = array_diff($tags, $keepTags);
                $text=preg_replace("/<\/?(".implode("|",$stripTags).
                    ")(\s|)[^>]*>/","",$text
                );
            }
        }
    }
    //-- Remove Here Some WellKnown Html Chars
    $text=str_replace("&amp;","&",$text);
    //-- Trash all special characters
    if($stripHtmlChars){
        $text = unhtmlentities($text);
        // If anything left-over
        $text = preg_replace("/&[^;]+;/Ui"," ",$text);
    }
    //-- Strip all Consecutive Spaces
    if($stripSpaces){
        $text=str_replace("&nbsp;"," ",$text);
        $text=preg_replace("/\s+/"," ",$text);
    }
    //-- Here you can continue to play with the output
    //-- And then return it
    return trim($text);
}
?>

Understanding the strip_tags replacement function.

First I will explain you the parameters and how to group them to get different results.

  1. $html
    This is the HTML code you want to play with. You can retrieve web pages with PHP and cUrl using another PHP class I wrote and posted here.
  2. $stripAttributes
    This will allow you to keep or strip tags attributes. Next parameters will explain this better.
  3. $keepTags
    This parameter can be a bool or string/array.
    • If it is a bool (true or false) it will work like this: for true it will keep all tags untouched for false it will strip all HTML tags out.
    • If it is an array it will be left untouched, if it is a string it will be split by separators to an array. Then all tag names found here will be kept and the rest stripped off.
  4. $stripHtmlChars
    If this is set to true it will replace all html special characters with spaces. All found between & and ; will be changed to spaces. These characters have little value in text.
  5. $stripSpaces
    If set to true it will replace all consecutive spaces with one space. It will also include CRLF along the way.
  6. $stripImages
    Will replace IMG tags with their ALT attributes values.
  7. $stripAcronyms
    Will replace ACRONYM tags with their TITLE attributes values.

In the end I will explain how 2 parameters combine. If you set $keepTags to true and set $stripAttributes to true you will only be left with tags. Like this:

<a href="#">link</a>
     will become
<a>link</a>

The stripTags function can be easily costumized!

<?
     $tags2Separate 
"div|p|td|li|acronym|img|br";
?>

Can be changed so other tags can break naturally and gain some space between them even if it is not there.

Last but no least any bugs you notice will be quickly addressed and I'm looking forward to creative input to extend its functionality.

19 Comments Posted By Readers :

Add your comment
#1 cass from United States
Posted on Friday, 24 August, 2007
this looks like great code, well-written and easily customized! thank you!
#2 5ubliminal web
Posted on Friday, 24 August, 2007
Thanks. These comments keep me going! ;)

Keep checking back every once in a while. More great code to come.
#3 ladynada from United States web
Posted on Wednesday, 29 August, 2007
thank you!
nada
#4 dwikristianto from Indonesia web
Posted on Tuesday, 11 September, 2007
hi, nice functions there. btw, i want to enhance my template system (FastTemplate) with several features, like : trimming unneeded space in tags, trim space between tags, fix html tags to xhtml, etc etc. can u help me on this ?
thanks
#5 5ubliminal web
Posted on Tuesday, 11 September, 2007
First things first: Thanks and I'm glad you like them!

Unfortunately I have a rule. Stay away from template systems. I never use them, I code everything myself so I'm not into 3rd party solutions.

… Still I can give you some tips …

Check the output buffering tutorial. Add output buffering to your template system, and then, right before HTML is output, use regexp and play around with the tags. It's not difficult but your template system has to have a file included on top of all files it uses. So you can add output buffering in front of everything at once.

This is pretty much the only way to easily do what you want and not to have to learn how the Template System works internally.
#6 Dev from Netherlands
Posted on Friday, 21 September, 2007
This is exactly what I'm looking for. I'm stripping content from a wysiwyg editor to generate keywords for a search engine, yet whenever a or tag is stripped, it doesn't add a space creating weird new words ;)
#7 5ubliminal web
Posted on Friday, 21 September, 2007
I had the same problem and this is why I needed it: to keep words apart. It's scary first because you don't know where it comes from.:)

Glad it helped. And change it as you need.
Regards.
#8 charliefortune from Great Britain
Posted on Monday, 17 December, 2007
This is a really useful function, thanks. I want to add an array to the $stripAttributes part to keep certain attributes i.e. src for tags. I'll put it up if when I've done it.
#9 5ubliminal web
Posted on Monday, 17 December, 2007
That's a good idea.
I'll have it done in January if you don't do it by then as I'm taking some time off now.
#10 Santosh Patnaik from United States
Posted on Saturday, 26 January, 2008
Have a look at the htmLawed script -- a highly customizable PHP filter/purifier that can remove XSS code, specified HTML tags/attributes, etc.
#11 David Leigh from France web
Posted on Thursday, 31 January, 2008
I was searching for a strip_tags replacement because it totally wiped out my content. This is for use with phplist to create text versions of HTML emails. As a drop-in replacement yours leaves my content, but line breaks are gone too (which kind of whacks any basic formatting). In your $stripSpaces parameter, you state, "It will also include CRLF along the way." Does that mean it will "include" newlines in what it strips or that it should replace "BR"s with newlines? I'm assuming if I leave the br html tag, that it will show up in my plain text. What I really need I guess, is something that, after stripping the tags will replace the br's with newline characters (crlf). Can I do that WITH your function or do I simply leave in br html tags and then convert them afterwards myself?

thank you for this great function and your hard work!
#12 5ubliminal web
Posted on Thursday, 31 January, 2008
Well … true. I changed the code and should work now.
Just copy paste the code again and tell me if you have other problems.

Regards.
#13 David Leigh from France web
Posted on Monday, 11 February, 2008
Yeah, that's much better on the formatting. I'll play with it more when I have time to see if there are other tweaks to make.
Thanks!
#14 nogenius from United States
Posted on Tuesday, 26 February, 2008
Thanks 5ubliminal - I am currently working on making my own content source, and this code will be very helpful to use.

One issue though - I am getting a warning:
"Warning: reset() [function.reset]: Passed variable is not an array or object"
due to this line in the StripLargeTags function "} reset($search);", and I am not sure what is wrong. I invoked the function with "$new=stripTags($page,false,false,true,true,false,true);"

Thanks,
nogenius
#15 silent from Indonesia web
Posted on Wednesday, 27 February, 2008
Thanks for the code...
It help me on my research.
#16 5ubliminal web
Posted on Wednesday, 27 February, 2008
@nogenius: My bad :) It's reset($searches). I fixed it. Thanks.
I keep warnings and notices disabled so … I did not notice the error but it had no negative effect on code.

@Silent: U'r welcome.
#17 stefys from Romania
Posted on Thursday, 20 March, 2008
stripLargeTags() should not be used to sanitize user input, as it is not safe.
For example, one could use:
--------------
&lt;script&gt;
alert("JS injection");
&lt;/script&gt;
--------------
The regex replacements are done in the specified order, and removal is before removal.
Therefore, the pattern won't replace anything, but will be removed afterwards and you'll end up with some valid tag that won't get removed.

Possible solutions (only if you're planning on using stripLargeTags for untrusted input, otherwise there's no problem):
1. Use the pattern in increasing order of importance (resulting in being the last pattern).
2. Apply the patterns more times, until none of them match.
3. Do not make the match of tag ending mandatory (i.e. )
Note: This solutions are probably not enough, one could find other ways of making the pattern not match.

Another way one could avoid the removal is by using a tab character, so instead of he'll use ( =tab character).
This will prevent the pattern from replacing, but it will still be rendered by browsers (tested on IE6 and FF2)
#18 5ubliminal web
Posted on Thursday, 20 March, 2008
Ai doua posturi cu doua IP-uri diferite. Misto!

I have used this function for quite a while for scraping legit pages that exist in search engine results. They have not failed up to know. These have not been built with endurance to malformed or evil HTML but just to use it on what seach engines consider good.
I think you're right with the problem you mention here and I'll look into it within the next few days as time allows.

Regards si salutare.
PS: striplargetags is not meant so sanitize but to remove some tags which are useless to me.
#19 Terry Spratt from Canada
Posted on Saturday, 19 April, 2008
Excellent script. Thank you.
Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Thursday, 21 August, 2008 - 20:31:23 GMT]   No Ajax / Flash Used Here
" strip_tags PHP Function Replacement : 5ubliminal's TellinYa "