PHP tip: How to extract keywords from a web page

Technologies: PHP 5+, UTF-8

Web page keywords characterize the page's topic for a search engine. Extracting keywords requires that you recognize the page's character encoding, strip away HTML tags, scripts, and styles, decode HTML entities, and remove unwanted punctuation, symbols, numbers, and stop words. This article shows how.

Get the page text

Keyword extraction starts by reading the HTML page into a text string. If you have the page in a file already, just read it in using file_get_contents( ) or its equivalent. Otherwise, have the keyword script get the page directly from a web server. This insures that you always get a complete page that includes everything added by server-side scripts and includes.

Whichever way you get the HTML page, once you have the text you'll need to handle the many different character encodings used by international web pages. Let's do this step by step.

Get the web page from a disk file

PHP has several functions to read a file from disk. I'll use file_get_contents( ), which returns a text string containing the entire file.

$encodedText = file_get_contents( $filename );

Get the web page from the server

PHP has two main ways to connect to a web server and download a page. First, PHP's fopen( ) wrappers let you use standard file reading functions to open a URL and read page text. Second, PHP's CURL (Client URL) functions let you build a detailed web server request and read the server's response and the returned page text. Both of these are available in standard PHP distributions from 4.0.4 onward.

I've covered these in two separate articles. Both articles include explanations and sample code:

CURL is the more flexible choice. The CURL article above includes a sample get_web_page( ) function that uses CURL. Pass it a URL and it returns an associative array containing the page text and the web server header. Error codes tell you when the URL is bad or the server is down.

$result = get_web_page( $url );
 
if ( $result['errno'] != 0 )
    ... Error:  bad URL, timeout, or redirect loop ...
if ( $result['http_code'] != 200 )

    ... Error:  no server, no permissions, or no page ...
 
$encodedText = $result['content'];

Determine the page's character encoding

The "content type" for a web file tells you the file's MIME type, such as "image/gif" for a GIF image and "text/html" for an HTML page. The content type for text files also includes a "character encoding" (or "charset") that tells you how the file represents characters. Some encodings use one byte per character, others use several bytes. You'll need to determine the encoding so that you can recognize letters, numbers, punctuation, and other symbols in the text.

There are dozens of character encodings, such as the old ASCII encoding used for American English, ISO 8859 for the Latin alphabet used by English, German, Spanish, French, and others, Big5 for traditional Chinese, Shift JIS for Japanese, and several old Windows-specific encodings. Older web pages may use any of these encodings, but newer pages are shifting to the UTF-8 international standard. This single generic encoding can represent any of Unicode's over 100,000 characters spanning all of the world's languages.

The page's character encoding is usually found in the web server's response header. If it isn't there, you can look for a <meta> tag at the top of the page instead. If there isn't one, you can try PHP's mb_detect_encoding( ) to guess at the encoding used by a page.

I cover several ways to get the content type and character encoding in a separate article. The article explains things a bit more and includes sample code:

If you use the CURL-based get_web_page( ) function I used in the previous section, it returns the content type in the associative array. A typical content type looks like this:

text/html; charset=utf-8

Parse that with a regular expression to get the character encoding name (the word after "charset").

$contentType = $result['content_type'];
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $contentType, $matches );
$charset = $matches[3];

Convert the page text to UTF-8

It isn't practical to write a custom keyword extractor for every character encoding. Instead, always convert web page text into the generic UTF-8 encoding.

To convert text, use PHP's iconv( ) function. Pass it the name of the current encoding, the name of the one you want to use ("utf-8"), and the text to convert. The function returns the converted text. If the text is already in UTF-8, the returned text is the same as the original text.

$text = iconv( $charset, "utf-8", $encodedText );

You also can use PHP's mb_convert_encoding( ) function. The arguments are the same, but in a different order:

$text = mb_convert_encoding( $encodedText, "utf-8", $charset );

In either case, the returned UTF-8 text uses between one and four bytes per character. Some of PHP's string functions won't work right with these multibyte characters. For instance, strlen( ) will return the number of bytes in a string, not the number of multibyte characters.

So, you'll have to be careful which string functions you use on the text from here forward. In most cases, there are "mb" (multibyte) equivalents of the standard string functions. For example, use mb_strlen( ) instead of strlen( ), mb_substr( ) instead of substr( ), and so on. Remember to pass the encoding name, "utf-8", as the last argument to all of these functions. There's nothing in the string itself to tell the functions what encoding you're using and if they use the wrong one you'll get garbled text.

Several of the functions discussed below use PHP's preg_replace( ), which works well with multibyte strings as long as you add the /u pattern modifier. See the Multibyte String Functions manual for much more detail.

Remove HTML syntax

We need to clean the text of all HTML syntax. This includes removing HTML tags, scripts, and styles, and decoding HTML character references and entities used to embed special characters.

Before doing this, you may wish to extract URLs. You can use these to guide spidering of your web site, do link checking, build a site map, and build a table of internal and external links from the page. I cover this in a separate article that includes explanations and sample code:

After you remove HTML tags, you'll be left with a long string of unstructured text. Page headers, footers, sidebars, menus, and the body are all included, without distinguishing one from another. Yet the body text is probably more relevant for our analysis than, say, a copyright statement in the footer. It would be nice if we could use semantic markup to focus on the most relevant text.

Each new version of the HTML standard adds more tags with semantic meaning. HTML 4 defines the <acronym>, <cite>, <abbr>, and other tags for acronyms, citations, and abbreviations, plus the (deprecated) <dir> and <menu> tags for directory lists and menus. The forthcoming HTML 5 defines the <nav> tag to mark the navigation menu of a page, <header> and <footer> for the page's header and footer, and <article> and <aside> tags for the principal content of a page. When properly used, these tags enable page analysis to find the relevant parts of a page and ignore the rest.

Unfortunately, HTML 5 isn't a standard yet and the semantic tags of HTML 4 are not widely used. If you know something about the pages you're analyzing, you can use that knowledge here to throw out uninteresting parts of the page. Otherwise you'll have to process the entire page.

Strip HTML tags, scripts, and styles

PHP's strip_tags( ) function will remove the tags, but it forgets to remove styles, scripts, and other unwanted text between the tags. When it removes the tags it also joins together the words before and after the tags. For block-level tags, like <p>, this is the wrong thing to do and it'll garble your keyword list.

In a separate article I cover the steps needed to properly remove HTML tags, avoid word-joining, and get rid of embedded scripts and styles. The article includes sample code and more explanations.

The article above includes a strip_html_tags( ) function to do the job. Pass it the UTF-8 page text and it returns the stripped text.

$text = strip_html_tags( $text);

Decode HTML character references

An "HTML character reference" is a special sequence of characters used to include a symbol that may not be supported by the page's character encoding. These always start with an "&" followed by a name or number and a semi-colon. For instance, &euro; creates a € symbol and &copy; creates a © symbol. There are hundreds of named character references, called "HTML entities". Additionally, any Unicode character can be referenced by its decimal or hexadecimal code, such as &#x03A9; for the Greek letter Omega Ω.

HTML character references and named entities are often inserted automatically by HTML authoring tools. We need to convert these to normal UTF-8 characters so that they can be processed along with the rest of the text. You can do this with PHP's standard html_entity_decode( ) function. Unfortunately, the function is often used improperly so I've discussed it in a brief separate article:

Call html_entity_decode( ) and pass it the text to decode, a flag indicating how it should handle quotes, and the name of the character encoding to use ("utf-8"):

$text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );

Remove unwanted characters

At this point you have UTF-8 text stripped of all HTML syntax. However, since we're only interested in keywords, we need to remove the punctuation, special symbol characters, numbers, currency signs, plus and minus, and other math characters.

Beware, however, that stripping away punctuation joins sentences together into a single long string of words. This is fine for extracting keywords, but not phrases. The words at the end of one sentence will blend into those at the start of the next sentence, creating odd word pairings that can give unexpected phrase searching results.

Removing punctuation also destroys the grammatical cues needed by natural language processing. If you intend to use more sophisticated ways of extracting meaning from a page, stop here. But for keyword extraction, we don't need punctuation, symbols, or numbers — just the words.

Strip punctuation characters

Punctuation includes full stops (periods), commas, quotes, brackets, dashes, and so on. In Unicode there are hundreds of these characters. For instance, there are 65 different closing brackets alone. There are 11 types of opening quotes, 18 different dashes, and several hundred special marks for Hebrew, Arabic, Thai, Cuneiform, and other languages.

Punctuation removal is complicated by characters that have different meanings in different contexts. For example, a dash is removable when used parenthetically — like this — but not when used as a hyphen in up-to-date. A full stop (period) is removable punctuation at the end of a sentence, but not when embedded in a domain name like example.com or a file name like index.htm. An apostrophe is removable when used as a quote in 'this', but not in a contraction like can't or a possessive like Dave's.

There's too much to cover here, so I discuss this in a separate article that includes sample code and quite a bit of explanation:

The article above defines a strip_punctuation( ) function that takes a text string and returns the string without punctuation:

$text = strip_punctuation( $text );

Strip symbol characters

Unicode has an enormous number of special symbols. Some are in common use, such as + = < >© ® ™ $ ¢ £ € ¥. Others are only found in mathematics text, such as ± ÷ ∞ ∑ ∅ ∀ ∴ ∪ ∩ ∫, or in other special situations, such as ♜ ♘ ♫ ♥ ♣ ♻ ☯ ✓ ☿ ♋ ☂ ⏄ ⏆ ↑ ↓ ↩ ↪ ♨ ☞ ☼ ☾ ☎. There are 914 different mathematical symbols and another 2,958 more specialty symbols.

As with punctuation removal, there are a few special cases to handle. If you intend to keep numbers in the text (such as dates, IP addresses, and currencies), you'll need to keep currency symbols and remove plus and minus only when they aren't adjacent to the digits of a number. For East Asian languages, you'll need to skip past radical and stroke symbols used to assemble ideographs.

Again, there's too much to cover here. I discuss these special cases in a separate article that includes sample code and further explanations:

The article above defines a strip_symbols( ) function that takes a text string and returns the string without symbol characters:

$text = strip_symbols( $text );

Strip number characters

Depending upon your needs, numbers in the page may or may not be interesting. If you are processing a financial report or a list of IP addresses, the numbers may be worth keeping. Otherwise, you can strip them out.

While most of the world's languages use the same 0 through 9 Arabic numerals, there are 280 more Unicode digit symbols for languages like Thai, Tibetan, and Balinese. There are another 210 numeric letter symbols for Roman numerals and those in Old Persian, Cuneiform, and others. Add another 336 symbols for fractions, superscripts, subscripts, and specialty digits, 41 currency symbols, and hundreds more for units of measure, like degrees celsius and square feet.

When removing numbers we also have to watch for non-digits related to those numbers. This includes plus and minus, AM/PM after a time, a colon between the hours and minutes of a time, a dash within a numeric range like 2006-2008, decimal points in floating point numbers, and so on.

Once again, this is too much to cover here. I discuss this in a separate article that includes explanations and sample code:

The article above defines a strip_numbers( ) function that takes a text string and returns the string without number characters:

$text = strip_numbers( $text );

Convert to lower case

Differences in upper and lower case are not that significant for keywords. So use the multibyte-safe mb_strtolower( ) function to convert to lowercase.

$text = mb_strtolower( $text, "utf-8" );

Process the page's words

So far, processing has been largely character-based. Now it's time to split the text into words and build a unique keyword list.

Split the text into a word list

PHP's explode( ) function is often used to split text into an array of words. Unfortunately, it isn't safe with multibyte characters. Instead, use the mb_split( ) function. Its first argument is a pattern for a word delimiter (use a space). The second argument is the string to split. Unlike most of the other "mb" functions, mb_split( ) doesn't have an argument to select the character encoding to use. Instead, use the mb_regex_encoding( ) function first to set the character encoding to use.

mb_regex_encoding( "utf-8" );
$words = mb_split( ' +', $text );

Stem the words

Stemming shortens a word to its root. This reduces words like "Stem", "Stems", and "Stemming" to just "Stem". All of the variants of a word collapse into the same stem word, letting us focus on the keywords themselves instead of each minor difference in the way they are used.

For English, the "Porter stemmer" algorithm, by Martin Porter, is widely used. Porter's web page on The Porter Stemming Algorithm explains the algorithm and links to a free PHP implementation.

Stem each word in the word list.

foreach ( $words as $key => $word )
    $words[$key] = PorterStemmer::Stem( $word, true );

Stemming algorithms must be different for different languages. Wikipedia's Stemming page is a good place to start if you need to stem a language other than English.

Remove stop words

English words like "the", "and", "a", "by", and others are very common and make for poor search words. Search indexing can skip these so-called "stop words."

There is no official stop word list for English, or any other language. But there are many suggested lists available if you do a web search on "stop words". The one I use has about 700 words.

To use a stop word list, read it into a string and split it into an array of words. Be sure to use UTF-8, lower case, and stems of the stop words

$stopText  = file_get_contents( $stopWordsFilename );
$stopWords = mb_split( '[ \n]+', mb_strtolower( $stopText, 'utf-8' ) );
foreach ( $stopWords as $key => $word )
    $stopWords[$key] = PorterStemmer::Stem( $word, true );

To remove stop words, use PHP's array_diff( ) function to remove any word in the word list that is also in the stop word list.

$words = array_diff( $words, $stopWords );

Remove unwanted words

Beyond stop words, extremely common words, like "web" or "html", make poor keywords. Depending upon your needs, you may strip your word list of poor keywords at this stage. This can significantly shorten the keyword list and simplify further processing.

For my own use, I maintain a long list of words to remove. The list primarily contains adjectives (like "big"), adverbs (like "quickly"), pronouns (like "he" and "she"), prepositions (like "to" and "from"), conjunctions (like "but"), and interjections (like "Ouch!"). Very generic verbs and nouns are also on the list (like "edit" and "html"). After removing these words, I'm left with a strong list of keywords that characterize the page's text.

To remove unwanted words, use PHP's array_diff( ) function again and your unwanted word list:

$words = array_diff( $words, $unwantedWords );

Count keyword usage

PHP's array_count_values( ) function counts the number of times each value occurs in an array. It returns a list of those keywords as array keys, and their counts as the array values. Sort on those values using the arsort( ) function to get a keyword list ordered from most to least frequently occurring.

$keywordCounts = array_count_values( $words );
arsort( $keywordCounts, SORT_NUMERIC );

Call array_keys( ) on the array to get a list of unique keywords:

$uniqueKeywords = array_keys( $keywordCounts );

Be careful if you want to sort the keywords alphabetically. PHP's default alphabetic sorts for English won't sort well for other languages. Instead, you'll need the Unicode Collation Algorithm, which is beyond the scope of this article.

Where to go from here

For simple search indexing, add your keywords to the search index for the page.

Page keywords and their frequency roughly characterize the page's topic. Tag clouds have become a trendy way to show that topic. Take the keyword counts list above (before sorting it), and print out each keyword. Increase the keyword's font size for more frequent keywords.

Actually, keyword frequency isn't a very reliable way to characterize a page topic. Instead, look at the Vector space model. The idea is to weight the keywords both by their frequency and by their global importance across some large document set. All the pages in Google's index, for instance, would be a great document set. The more often a keyword occurs in that document set, the less well that keyword distinguishes one document from another and the worse that keyword is for a search index. We already did something like this by removing stop words and other common words. The vector space model describes the math needed to weight your keywords by importance, rather than just frequency.

Comments

Great Article

Very interesting article. Makes for a great starting point for formalising my own method for gathering keywords. I came across this page as I'm looking for a way to rank my advertisements against what people are searching for or viewing. The stemming algorithm, along with the vector space model are great resources. Thanks.

Problems

Dear Mr. Nadeau

I have experienced some problems trying to convert web page to text using your strip_html_tags() function.

I am trying to get text contents of this page http://www.henningercpa.com/ with html tags stripped out. After downloading a page and converting it to UTF-8 i have a $string with page contents as value. However when i pass the $string to strip_html_tags() function i get an empty $string.

I experience this problem with the above page only.

Could you please check what's wrong ?

Thank you,
Vladimir

Problems stripping tags from www.henningercpa.com

My strip_html_tags( ) function removes script, style, and other blocks, then calls PHP's standard strip_tags( ) function to remove the remaining tags. The above page makes it through my code, but fails in strip_tags( ). The problem appears to be an HTML comment that reads "<!-- URL's used in the movie-->". The apostrophe on "URL's" causes strip_tags( ) to fail. After removing it, everything works. This is clearly a bug in PHP's strip_tags( ) since apostrophe's are legal characters in HTML comments.

By the way... the above page is invalid HTML. It includes <script> and <style> tags before the DOCTYPE (which must be the first line of the file), it doesn't quote some attribute values, it uses invalid attributes, and it doesn't close some tags. Many browsers are very forgiving and can handle this, but it'd be better if it were valid HTML. If this is your page, you should run it through the W3C validator and make the changes it recommends.

Wonderful Resource

Thank you for your informational posting! Up until this point I've been dabbling with perl to provide text analysis. Will give this a try. I was wondering if you have code available for download that providing the whole process stated in this tip? I'm working through the snippets at the moment but I learn best by taking apart examples.

Thanks again!

- Jason

encoding problems

Very interesting...
I have a little encoding problem.. on some webpages. Maybe you can help..

example:
When wanting to read some webpages, i first get the serverside encoding from the header.. its windows-1250
then i get the server equiv charset from metatag.. its UTF-8
now in those cases .. sometimes the page is UTF-8.. and sometimes its windows-1250

it seems it is windows-1250 (server encoding) if the title or meta description or meta keywords is before the the meta http-equiv and it takes the other encoding if the meta http-equiv charset comes first.
Now this is an asumption.. because i'm not sure.. (just happened to find 5 or 6 pages like that)

It is clear that this comes from bad html but i didnt program those pages.

here is an example of a russian page... that has this double encoding...
maybe you have an advice on how to define encoding in this case. (http-equiv charset is UTF-8) and server side encoding in header is windows-1251.
Now the page seems to be windows-1251.
http://www.historymuseum.org/?lang_id=2

If you have any ideas to find a solution
i would be very gratefull

Luc

Re: encoding problems

The HTML specification's section on Specifying the character encoding says that browsers (and scripts like yours) should check for the page's charset in this order (high to low priority):

  1. If the HTTP header includes a "Content-Type" field, use the value of the "charset" parameter.
  2. If the page includes a META tag with "http-equiv" set to "Content-type", use the "charset" value of the tag.
  3. If a tag includes a "charset" attribute, use the value.
  4. Otherwise, guess the character set or use a default charset chosen by the user.

In your case, the page you refer to has an HTTP header saying that the page is in "Windows-1251", while the page's own META tag says it is in "UTF-8". Since the HTTP header has a higher priority, use it. For this page, this gives the correct answer: the page indeed uses "Windows-1251" and the META tag is wrong.

The specification further notes that a META tag giving the character set should occur before other tags that provide non-ASCII content, such as the page's title text. However, the spec doesn't say if you should or should not believe the META tag if the title comes before it. Nor does the spec say what you should do if there is more than one META tag with conflicting information, which Microsoft's site did until recently. But, really none of this helps in your case since you already know that the HTML is wrong and you can't fix it.

So, if neither the HTTP header or a META tag are reliable, you can try PHP's mb_detect_encoding( ) function. Pass it a string containing the page text and, optionally, a priority ordered list of encodings you think it might be using. The function returns its guess at the encoding being used.

Stemming via PHP

Just as a word stemming tip for PHP...
If you are serious about building something you should try and install the PECL stem package http://pecl.php.net/package/stem

It is compliant to the Snowball API (http://snowball.tartarus.org) relatively fast and supports a few languages (not just english).

BTW. Nice article.

Function reuse

Very useful article! I am unclear though whether you are actually permitting the reuse of the PHP functions you have developed for your posts (like strip_symbols etc). Your copyright policy seems to imply a restriction on reuse but that would be really odd for this kind of code. Please clarify.

Re: Function reuse

My code is covered by the OSI BSD license, which is one of the most permissive licenses there is. Indeed you can re-use the code, modify it, redistribute it, sell it, or whatever. You only have to retain my copyright message and not use my name to endorse or promote your product. Very standard stuff.

Quoting from the "Readme.txt" file in my ZIP file for the code (emphasis added):

"These files are distributed under a BSD license that enables you to use, modify and redistribute this code as you see fit."

Quoting from the comment header in the code: (emphasis added)

"Redistribution and use in source and binary forms, with or without modification, are permitted..."

See also the full OSI BSD license at:

http://www.opensource.org/licenses/bsd-license.php

Thanks

Thank you, your entry is very useful ;)

Great Article

This is a great article. I've been searching for this kind of knowledge all over the net. I really appreciate your patient to explain each fine thing.

Please keep writing such articles.

Roopak

Awesome

Thanks for the php tip.

Amazing Info!!

Thanks, one of the best site for PHP web scraping info

very precise, well written article

I would like to read more articles you have written. very nice!

special thanks for this

special thanks for this article

Extremely useful. A life-saver in fact.

Thank you for a very well thought out, detailed and structured description of the solution to a nagging problem. Good wishes for you.

fantastic

wonderful info, thank you

Sense of appreciation

Thanks very much for time taken to do such brilliant presentation. Must be a pretty good teacher. The presentation is well structured and the communication is good.

Thanks very much!

Malformed keywords :(

I have issues with some keywords in $uniqueKeywords array. The word "memory" is extracted as "memori" and many other words have such behaviour. I don't why this is happening. What can be the reason ?

Re: Malformed keywords

The Porter Stemmer algorithm strives to map many variations of a word to a common root word so that only that root word is counted, not every minor variation. The root word may not be an actual dictionary word. It's an algorithmically generated root word. This is why you get "memori" as the root word of "memories", "memorize", "memorable", etc.

Many Many Thanks

I have not enough words to thank you. It has helped me a lot. Its an awesome article. Very beautifully explained.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting