PHP tip: How to strip HTML tags, scripts, and styles from a web page

Technologies: PHP 4.3+. UTF-8

The HTML tags on a web page must be stripped away to get clean text for a PHP search engine, keyword extractor, or some other page analysis tool. PHP's standard strip_tags( ) function will do part of the job, but you need to strip out styles, scripts, embedded objects, and other unwanted page code first. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Code

PHP's handy strip_tags( ) function removes HTML tags that look like <word...>, <word.../>, or </word>. However, it doesn't understand the tags it's removing. It will blindly remove the opening and closing style tags in <style>code</style>, but leave the style code to confuse text parsing. This simplistic tag removal also causes words on either side of the tags to be joined, creating hard to parse text.

To fix these problems, you need to process certain tags first before using strip_tags(). This is easily done with a few regular expressions that:

  • Remove HTML tag pairs and enclosed content for styles, scripts, embedded objects, etc.
  • Add line breaks around block-level tags to prevent word joining problems after tag removal.

Once this is done, call strip_tags() to remove the remaining tags.

Below is sample code to do this. Its regular expressions are more verbose than strictly necessary, but it helps make the function clearer. More explanations follow in the sections after the code.

Downloads: strip_html_tags.zip.

/**
 * Remove HTML tags, including invisible text such as style and
 * script code, and embedded objects.  Add line breaks around
 * block-level tags to prevent word joining after tag removal.
 */
function strip_html_tags( $text )
{
    $text = preg_replace(
        array(
          // Remove invisible content
            '@<head[^>]*?>.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<object[^>]*?.*?</object>@siu',
            '@<embed[^>]*?.*?</embed>@siu',
            '@<applet[^>]*?.*?</applet>@siu',
            '@<noframes[^>]*?.*?</noframes>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            '@<noembed[^>]*?.*?</noembed>@siu',
          // Add line breaks before and after blocks
            '@</?((address)|(blockquote)|(center)|(del))@iu',
            '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
            '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
            '@</?((table)|(th)|(td)|(caption))@iu',
            '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
            '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
            '@</?((frameset)|(frame)|(iframe))@iu',
        ),
        array(
            ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
            "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            "\n\$0", "\n\$0",
        ),
        $text );
    return strip_tags( $text );
}

Example

Read an HTML file, convert to UTF-8, strip out HTML tags and invisible content, and decode HTML entities into UTF-8:

/* Read an HTML file */
$raw_text = file_get_contents( $filename );

/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches );
$encoding = $matches[3];

/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags and invisible text */ $utf8_text = strip_html_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "UTF-8" );

Explanation

For search engine indexing or keyword analysis, you need to read an HTML web page and strip away everything but the raw text. This gives you a clean word list that you can tokenize, stem, analyze, and index.

PHP's standard strip_tags() function is widely used for removing tags, mostly because it is built into PHP. However, as I said earlier, it has two important limitations that have to be addressed:

  1. It leaves behind text that should be invisible, such as style and script code.
  2. It joins together words to the left and right of a removed tag.

Removing invisible text

Style, script, and other invisible page code is enclosed by tag pairs like <script>...</script>. A regular expression can easily find these. But be sure to use pattern modifiers to do case-insensitive matches (an "i" pattern modifier) and be UTF-8 safe (a "u" pattern modifier). This insures that your code will work on upper, lower, and mixed-case tags and on international web pages.

@<script[^>]*?>.*?</script>@siu

The sample code at the top of this article uses a similar expression for several tag pairs that surround code and invisible text:

  • <head>...</head> encloses header information, such as the title and page styles and scripts.
  • <style>...</style> encloses page styles.
  • <script>...</script> encloses page scripts.
  • <object>...</object> encloses an object and its parameters.
  • <embed>...</embed> (deprecated) encloses a plug-in and its parameters.
  • <applet>...</applet> (deprecated) encloses a Java applet and its parameters.
  • <noframes>...</noframes> encloses text to show if frames are not supported.
  • <noscript>...</noscript> encloses text to show if scripts are not supported.
  • <noembed>...</noembed> (deprecated) encloses text to show if a plug-in cannot be embedded.

There are a few more tag pairs whose enclosed content you may want to remove:

  • <area>...</area> encloses an area in a map.
  • <map>...</map> encloses an area in an image map.
  • <marquee>...</marquee> (deprecated) encloses text for a scrolling marquee.
  • <menu>...</menu> (deprecated) can enclose options for a menu.
  • <select>...</select> can enclose options for a drop-down menu.
  • <textarea>...</textarea> encloses text to be shown within a multi-line text area.

Preventing incorrect word joins

When PHP's strip_tags() removes HTML tags, the text on either side of the tag is joined. For example, this text:

<p>First <strong>J</strong>oined paragraph</p><p>Second <strong>J</strong>oined paragraph</p>

becomes this:

First Joined paragraphSecond Joined paragraph

For inline tags like <strong> or <a>, word joining is the right behavior. The word "Joined" above is correctly created by joining the bold "J" with the normal "oined".

But for block tags like <p> or <div>, word joining is not correct. These tags have an implicit line break before and after the block. In the example above, the words "Paragraph" and "Second" should not have been joined.

To correct this problem, the sample code in this article recognizes block tags and inserts line breaks before and after each block.

Removing HTML tags

Finally, use PHP's strip_tags() to remove the remaining HTML tags and comments.

Other issues

For this to work reliably:

  • Use the page's content type to get its character set, then convert to UTF-8 using the iconv() function. This insures that the text is in a known character encoding. The preg_replace() function handles multi-byte characters in UTF-8 if the /u pattern modifier is included.
  • Strip HTML tags before decoding HTML entities. Decoding some entities, such as &lt; and &gt;, generates characters that will confuse tag parsing.
  • HTML web pages must be syntactically correct (or reasonably close). For instance, if there is an open <script> tag and no closing tag, everything from the tag to the end of the file will be removed.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Thank you

Fantastic - just what I was looking for.

Thanks for the code :)

Arkad

just great

Thanks for the code !
I was also searching for it.

thank you very much ^^

thank you very much ^^

From ukraine

Cool!!!! You save my time!
Thank you!!

THANK YOU SO MUCH.

That just saved me from a serious headache that was starting to develop.

Worked perfectly.

It works! but there seems to be one issue

The function works great on most of the cases... I just found out a but that I'm pretty sure that it can be fixed with one character (or two).

If you have something like < style type="text/css"> .flickr-photo ... It doesn't work.

Does anyone know enough about regular expressions to fix this?

Issue with tag stripping

Well, in your example, there is a space between the < and style. This is not legal HTML. The element name must follow immediately after the left angle brace. Since this is not legal HTML, browsers and this PHP function will leave the text alone. This is the correct behavior. An HTML syntax remover should only remove text that definitely is HTML syntax, and not text that's merely close.

If you really need to recognize this incorrect HTML, you can modify the regular expressions to include \s* (zero or more white space characters) before and after each element name. So, the expression to remove the <head> element would read '@<\s*head[^>]*?>.*?<\s*/head\s*>@siu'.

thanks man

Thanks You save my time and effort.
Regards,
Imran khan

Great idea!

It works! thank you for your quality scripting! :)

▀Neat

▀Neat Code!
▀▀Impeccable contribution!

Fixed Regular Expressions

When I use your code as-is, the line breaks cause problems. Basically, it looks like your regular expressions are adding breaks after "". So, I added ".*?>" to the end of the regular expression just before the at sign. That seems to work.

Finally, I also removed the "u" modifier from all of mine. This seems to work better for me and I assume it will work better in cases where there is no meta tag indicating the encoding. I may have trouble with this if I run across other encodings, but I'm lucky enough to be dealing with files from a CMS system that I believe all have the same character encoding.

Thanks,
Joel

Fixed Regular Expressions

First, I'm not sure what problem you are encountering with the line breaks (but I'd like to know). Remember that adding line breaks is intentional so that when the HTML tags are removed, words on either side of block-level tags won't be incorrectly joined. The current regular expressions look for the start of an end-of-block tag, such as </p>, and add a line break before the tag. Your change to the regular expression adds a minor restriction on the tag's format (it forces a closing > to be found for each tag), but it won't change the way line breaks are added before valid tags. So, I'm not sure what your fix is intended to fix.

Second, after you remove the "u" modifier, you must be very sure that all of the text you parse uses only 1 byte per character. This is the case for Latin-1 (English and many European languages) and a couple of other encodings, but not for UTF-8. And UTF-8 is increasingly the default in many CMSes. If you try to parse UTF-8 text, you'll get occasional strange matches and sometimes a garbled mess. But also note that since Latin-1 is a subset of UTF-8, it never hurts to include the "u" even when you know you're parsing Latin-1. So, I'm afraid I don't see what you gain by removing the "u" (there is a minor performance improvement, though).

Depending upon your task, you may be introducing a problem further down the road by removing the "u" and skipping UTF-8 handling. If you don't use UTF-8, and you try to expand HTML entities into characters, some of those entities won't expand. There simply are no character equivalents for some of them when you use 1-byte characters. Since those HTML entities won't expand, you'll still have that syntax in your text and that could mess up further text parsing.

The point of this UTF-8 handling is to convert text from limited encodings into a very flexible one. Then parse in that one without having further worries about encoding quirks. If you're absolutely sure you won't need anything but Latin-1, then you're fine. But the cost of being wrong is strange parsing problems when the CMS configuration changes or you try to apply the code to some other source text. And the cost of doing it right in the first place is a few extra lines of easy code. Essentially, all you do is look for the character encoding in the HTTP header, pass that to PHP's iconv( ) function to convert to UTF-8, and leave the "u" on the above regular expressions. That's about it.

You can read more about the full context of page parsing, including character encoding, by skimming my article on How to extract keywords from a web page.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting