PHP tip: How to strip HTML tags, scripts, and styles from a web page

Technologies: PHP 4.3+. UTF-8

The HTML tags on a web page must be stripped away to get clean text for a PHP search engine, keyword extractor, or some other page analysis tool. PHP's standard strip_tags( ) function will do part of the job, but you need to strip out styles, scripts, embedded objects, and other unwanted page code first. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Code

PHP's handy strip_tags( ) function removes HTML tags that look like <word...>, <word.../>, or </word>. However, it doesn't understand the tags it's removing. It will blindly remove the opening and closing style tags in <style>code</style>, but leave the style code to confuse text parsing. This simplistic tag removal also causes words on either side of the tags to be joined, creating hard to parse text.

To fix these problems, you need to process certain tags first before using strip_tags(). This is easily done with a few regular expressions that:

  • Remove HTML tag pairs and enclosed content for styles, scripts, embedded objects, etc.
  • Add line breaks around block-level tags to prevent word joining problems after tag removal.

Once this is done, call strip_tags() to remove the remaining tags.

Below is sample code to do this. Its regular expressions are more verbose than strictly necessary, but it helps make the function clearer. More explanations follow in the sections after the code.

Downloads: strip_html_tags.zip.

/**
 * Remove HTML tags, including invisible text such as style and
 * script code, and embedded objects.  Add line breaks around
 * block-level tags to prevent word joining after tag removal.
 */
function strip_html_tags( $text )
{
    $text = preg_replace(
        array(
          // Remove invisible content
            '@<head[^>]*?>.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags( $text );
}

Example

Read an HTML file, convert to UTF-8, strip out HTML tags and invisible content, and decode HTML entities into UTF-8:

/* Read an HTML file */
$raw_text = file_get_contents( $filename );

/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches );
$encoding = $matches[3];

/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags and invisible text */ $utf8_text = strip_html_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "UTF-8" );

Explanation

For search engine indexing or keyword analysis, you need to read an HTML web page and strip away everything but the raw text. This gives you a clean word list that you can tokenize, stem, analyze, and index.

PHP's standard strip_tags() function is widely used for removing tags, mostly because it is built into PHP. However, as I said earlier, it has two important limitations that have to be addressed:

  1. It leaves behind text that should be invisible, such as style and script code.
  2. It joins together words to the left and right of a removed tag.

Removing invisible text

Style, script, and other invisible page code is enclosed by tag pairs like <script>...</script>. A regular expression can easily find these. But be sure to use pattern modifiers to do case-insensitive matches (an "i" pattern modifier) and be UTF-8 safe (a "u" pattern modifier). This insures that your code will work on upper, lower, and mixed-case tags and on international web pages.

@<script[^>]*?>.*?</script>@siu

The sample code at the top of this article uses a similar expression for several tag pairs that surround code and invisible text:

  • <head>...</head> encloses header information, such as the title and page styles and scripts.
  • <style>...</style> encloses page styles.
  • <script>...</script> encloses page scripts.
  • <object>...</object> encloses an object and its parameters.
  • <embed>...</embed> (deprecated) encloses a plug-in and its parameters.
  • <applet>...</applet> (deprecated) encloses a Java applet and its parameters.
  • <noframes>...</noframes> encloses text to show if frames are not supported.
  • <noscript>...</noscript> encloses text to show if scripts are not supported.
  • <noembed>...</noembed> (deprecated) encloses text to show if a plug-in cannot be embedded.

There are a few more tag pairs whose enclosed content you may want to remove:

  • <area>...</area> encloses an area in a map.
  • <map>...</map> encloses an area in an image map.
  • <marquee>...</marquee> (deprecated) encloses text for a scrolling marquee.
  • <menu>...</menu> (deprecated) can enclose options for a menu.
  • <select>...</select> can enclose options for a drop-down menu.
  • <textarea>...</textarea> encloses text to be shown within a multi-line text area.

Preventing incorrect word joins

When PHP's strip_tags() removes HTML tags, the text on either side of the tag is joined. For example, this text:

<p>First <strong>J</strong>oined paragraph</p><p>Second <strong>J</strong>oined paragraph</p>

becomes this:

First Joined paragraphSecond Joined paragraph

For inline tags like <strong> or <a>, word joining is the right behavior. The word "Joined" above is correctly created by joining the bold "J" with the normal "oined".

But for block tags like <p> or <div>, word joining is not correct. These tags have an implicit line break before and after the block. In the example above, the words "Paragraph" and "Second" should not have been joined.

To correct this problem, the sample code in this article recognizes block tags and inserts line breaks before and after each block.

Removing HTML tags

Finally, use PHP's strip_tags() to remove the remaining HTML tags and comments.

Other issues

For this to work reliably:

  • Use the page's content type to get its character set, then convert to UTF-8 using the iconv() function. This insures that the text is in a known character encoding. The preg_replace() function handles multi-byte characters in UTF-8 if the /u pattern modifier is included.
  • Strip HTML tags before decoding HTML entities. Decoding some entities, such as &lt; and &gt;, generates characters that will confuse tag parsing.
  • HTML web pages must be syntactically correct (or reasonably close). For instance, if there is an open <script> tag and no closing tag, everything from the tag to the end of the file will be removed.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Thank you

Fantastic - just what I was looking for.

Thanks for the code :)

Arkad

just great

Thanks for the code !
I was also searching for it.

thank you very much ^^

thank you very much ^^

From ukraine

Cool!!!! You save my time!
Thank you!!

THANK YOU SO MUCH.

That just saved me from a serious headache that was starting to develop.

Worked perfectly.

It works! but there seems to be one issue

The function works great on most of the cases... I just found out a but that I'm pretty sure that it can be fixed with one character (or two).

If you have something like < style type="text/css"> .flickr-photo ... It doesn't work.

Does anyone know enough about regular expressions to fix this?

Issue with tag stripping

Well, in your example, there is a space between the < and style. This is not legal HTML. The element name must follow immediately after the left angle brace. Since this is not legal HTML, browsers and this PHP function will leave the text alone. This is the correct behavior. An HTML syntax remover should only remove text that definitely is HTML syntax, and not text that's merely close.

If you really need to recognize this incorrect HTML, you can modify the regular expressions to include \s* (zero or more white space characters) before and after each element name. So, the expression to remove the <head> element would read '@<\s*head[^>]*?>.*?<\s*/head\s*>@siu'.

thanks man

Thanks You save my time and effort.
Regards,
Imran khan

Great idea!

It works! thank you for your quality scripting! :)

▀Neat

▀Neat Code!
▀▀Impeccable contribution!

Fixed Regular Expressions

When I use your code as-is, the line breaks cause problems. Basically, it looks like your regular expressions are adding breaks after "". So, I added ".*?>" to the end of the regular expression just before the at sign. That seems to work.

Finally, I also removed the "u" modifier from all of mine. This seems to work better for me and I assume it will work better in cases where there is no meta tag indicating the encoding. I may have trouble with this if I run across other encodings, but I'm lucky enough to be dealing with files from a CMS system that I believe all have the same character encoding.

Thanks,
Joel

Fixed Regular Expressions

First, I'm not sure what problem you are encountering with the line breaks (but I'd like to know). Remember that adding line breaks is intentional so that when the HTML tags are removed, words on either side of block-level tags won't be incorrectly joined. The current regular expressions look for the start of an end-of-block tag, such as </p>, and add a line break before the tag. Your change to the regular expression adds a minor restriction on the tag's format (it forces a closing > to be found for each tag), but it won't change the way line breaks are added before valid tags. So, I'm not sure what your fix is intended to fix.

Second, after you remove the "u" modifier, you must be very sure that all of the text you parse uses only 1 byte per character. This is the case for Latin-1 (English and many European languages) and a couple of other encodings, but not for UTF-8. And UTF-8 is increasingly the default in many CMSes. If you try to parse UTF-8 text, you'll get occasional strange matches and sometimes a garbled mess. But also note that since Latin-1 is a subset of UTF-8, it never hurts to include the "u" even when you know you're parsing Latin-1. So, I'm afraid I don't see what you gain by removing the "u" (there is a minor performance improvement, though).

Depending upon your task, you may be introducing a problem further down the road by removing the "u" and skipping UTF-8 handling. If you don't use UTF-8, and you try to expand HTML entities into characters, some of those entities won't expand. There simply are no character equivalents for some of them when you use 1-byte characters. Since those HTML entities won't expand, you'll still have that syntax in your text and that could mess up further text parsing.

The point of this UTF-8 handling is to convert text from limited encodings into a very flexible one. Then parse in that one without having further worries about encoding quirks. If you're absolutely sure you won't need anything but Latin-1, then you're fine. But the cost of being wrong is strange parsing problems when the CMS configuration changes or you try to apply the code to some other source text. And the cost of doing it right in the first place is a few extra lines of easy code. Essentially, all you do is look for the character encoding in the HTTP header, pass that to PHP's iconv( ) function to convert to UTF-8, and leave the "u" on the above regular expressions. That's about it.

You can read more about the full context of page parsing, including character encoding, by skimming my article on How to extract keywords from a web page.

Thanks

Hi,

Just a quick thankyou for taking the time and effort to publish this for other. The code is great, and has saved me (and many others) so much time.

Best regards, Ben.

Excellent

You good sir are a legend. This was exactly what I was looking for and saved me a great deal of time!

Thanks you :)

Ben

Well Done

Thanx Buddy
This page is very helpful for me and solved my problem in seconds
Thanx

I have used this tutorial

I have used this tutorial it works perfect.
Thanks for sharing this very good and helpful topic.
Regards
cobro

Great!!!

Hello,

I was looking a code to strip html tags of a document.
Thank you.

Helped A lot

Awesome function, shortest and worthiest!!

Thanks for the code, it helped me so much, prevented excessive hair loss

Bug in the code

Well this code does not deletes amp and nbsp sort of things.

Should have to generalize to work with these things too.

Re: Bug in the code

Well, no, it isn't a bug in the code. As the article title says, this code strips out HTML tags, scripts, and styles. It does not remove HTML entities. However, the example usage code shown earlier shows you how to do it using a built-in PHP function:

$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "UTF-8" );

This is awesome.

This is awesome. I can't thank you enough. One issue though. My version pcrep didn't allow support for unicode. Had to change the settings to get it to do that. Once I did that (had to google a bunch to find out how) I rebooted my server and it worked.

Thanks a lot

Great stuff! thanks so much.
This is just what I needed.

Superb!

Thanks very much, this is exactly what I was looking for and it works perfectly. Cheers!

php & forms?

can the strip HTML tags function be used to filter out html tags that are maliciously placed into a form?

Re: php & forms?

I expect you could use this to strip out HTML tags in form input. However, many web sites are willing to accept a few specific tags in certain types of forms, such as those for posting comments. In that case you would need different code to strip away only the tags you want to remove while leaving the rest.

Great Job Guys

Thanks a lot :D

Very much appreciated

Well done! Quite concise as well.

very nice

Function is what I need.
Thanks a lot

just what i needed saved my

just what i needed

saved my time and efforts

regards to you dude you rock!!

error

In the php code:

preg_match(... $raw_Text, $matches )

note that in the rest of the code the variable "$raw_text" uses a small "t" in "text" while the preg_match() function uses "raw_Text" with a capital "T" which of course causes an error in PHP.

Great stuff

Cheers mate, this has just solved my problems in an instance, great stuff thanks!

The solution for my...

Very cool. Perfect for my onlineshop. I search long time for this solution. Thanks very much.

Bless you

Perfect, Thank you so much.

You rule!

Excellent work

Saved a lot of time. Actually, my software allows people to post comments in rich text using tinymce - but some clients using the software have javascript disabled (a prerequisite for rich text editing). A lot of hassle for a minority - but for those few cases, my software had to identify that they have no javascript, and display the stripped version of data in a textarea instead of the version with tags. (further hassle with the fact that they must load another page from the server rather than using javascript functions to popup an iframe holding the input form!)

I used this code for the new editor page so that they can still edit existing data (in a textarea) under the knowledge that all formatting will be lost. The code you supplied worked perfectly to remove the formatting for them to load and edit.

Thanks again!

thanks

Thanks!
It Worked :)

script

Well done! Quite concise as well.

Awesome script

I am hoping to introduce a simple messaging service between members on my site, and this code is perfect for stripping out anything I don't want people to do!

Thanks you so much :)

SF

Great Function !!! This is

Great Function !!!

This is what i need ...

nice code

Thanks for you PHP code. That's what I am looking for.

asdasdasd

excellent

Thank you

Thank you for the code.
It is very good and very helpful.

Thanks

Just thanks, I can now keep some of my hair !

Thanks + bonus

As I was saying, this helped me a lot.

I also had to add this to correct accents and other such things:

$text = html_entity_decode($text, ENT_COMPAT, 'UTF-8');

Thank you!

I just want to say thank you!

You did a hell of a good job. This save me a ton of time and also my sanity.

Regards

Good Tips

Very well said dude, keep updating for us..

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting