PHP tip: How to extract keywords from a web page

Technologies: PHP 5+, UTF-8

Web page keywords characterize the page's topic for a search engine. Extracting keywords requires that you recognize the page's character encoding, strip away HTML tags, scripts, and styles, decode HTML entities, and remove unwanted punctuation, symbols, numbers, and stop words. This article shows how.

Get the page text

Keyword extraction starts by reading the HTML page into a text string. If you have the page in a file already, just read it in using file_get_contents( ) or its equivalent. Otherwise, have the keyword script get the page directly from a web server. This insures that you always get a complete page that includes everything added by server-side scripts and includes.

Whichever way you get the HTML page, once you have the text you'll need to handle the many different character encodings used by international web pages. Let's do this step by step.

Get the web page from a disk file

PHP has several functions to read a file from disk. I'll use file_get_contents( ), which returns a text string containing the entire file.

$encodedText = file_get_contents( $filename );

Get the web page from the server

PHP has two main ways to connect to a web server and download a page. First, PHP's fopen( ) wrappers let you use standard file reading functions to open a URL and read page text. Second, PHP's CURL (Client URL) functions let you build a detailed web server request and read the server's response and the returned page text. Both of these are available in standard PHP distributions from 4.0.4 onward.

I've covered these in two separate articles. Both articles include explanations and sample code:

CURL is the more flexible choice. The CURL article above includes a sample get_web_page( ) function that uses CURL. Pass it a URL and it returns an associative array containing the page text and the web server header. Error codes tell you when the URL is bad or the server is down.

$result = get_web_page( $url );
 
if ( $result['errno'] != 0 )
    ... Error:  bad URL, timeout, or redirect loop ...
if ( $result['http_code'] != 200 )

    ... Error:  no server, no permissions, or no page ...
 
$encodedText = $result['content'];

Determine the page's character encoding

The "content type" for a web file tells you the file's MIME type, such as "image/gif" for a GIF image and "text/html" for an HTML page. The content type for text files also includes a "character encoding" (or "charset") that tells you how the file represents characters. Some encodings use one byte per character, others use several bytes. You'll need to determine the encoding so that you can recognize letters, numbers, punctuation, and other symbols in the text.

There are dozens of character encodings, such as the old ASCII encoding used for American English, ISO 8859 for the Latin alphabet used by English, German, Spanish, French, and others, Big5 for traditional Chinese, Shift JIS for Japanese, and several old Windows-specific encodings. Older web pages may use any of these encodings, but newer pages are shifting to the UTF-8 international standard. This single generic encoding can represent any of Unicode's over 100,000 characters spanning all of the world's languages.

The page's character encoding is usually found in the web server's response header. If it isn't there, you can look for a <meta> tag at the top of the page instead. If there isn't one, you can try PHP's mb_detect_encoding( ) to guess at the encoding used by a page.

I cover several ways to get the content type and character encoding in a separate article. The article explains things a bit more and includes sample code:

If you use the CURL-based get_web_page( ) function I used in the previous section, it returns the content type in the associative array. A typical content type looks like this:

text/html; charset=utf-8

Parse that with a regular expression to get the character encoding name (the word after "charset").

$contentType = $result['content_type'];
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $contentType, $matches );
$charset = $matches[3];

Convert the page text to UTF-8

It isn't practical to write a custom keyword extractor for every character encoding. Instead, always convert web page text into the generic UTF-8 encoding.

To convert text, use PHP's iconv( ) function. Pass it the name of the current encoding, the name of the one you want to use ("utf-8"), and the text to convert. The function returns the converted text. If the text is already in UTF-8, the returned text is the same as the original text.

$text = iconv( $charset, "utf-8", $encodedText );

You also can use PHP's mb_convert_encoding( ) function. The arguments are the same, but in a different order:

$text = mb_convert_encoding( $encodedText, "utf-8", $charset );

In either case, the returned UTF-8 text uses between one and four bytes per character. Some of PHP's string functions won't work right with these multibyte characters. For instance, strlen( ) will return the number of bytes in a string, not the number of multibyte characters.

So, you'll have to be careful which string functions you use on the text from here forward. In most cases, there are "mb" (multibyte) equivalents of the standard string functions. For example, use mb_strlen( ) instead of strlen( ), mb_substr( ) instead of substr( ), and so on. Remember to pass the encoding name, "utf-8", as the last argument to all of these functions. There's nothing in the string itself to tell the functions what encoding you're using and if they use the wrong one you'll get garbled text.

Several of the functions discussed below use PHP's preg_replace( ), which works well with multibyte strings as long as you add the /u pattern modifier. See the Multibyte String Functions manual for much more detail.

Remove HTML syntax

We need to clean the text of all HTML syntax. This includes removing HTML tags, scripts, and styles, and decoding HTML character references and entities used to embed special characters.

Before doing this, you may wish to extract URLs. You can use these to guide spidering of your web site, do link checking, build a site map, and build a table of internal and external links from the page. I cover this in a separate article that includes explanations and sample code:

After you remove HTML tags, you'll be left with a long string of unstructured text. Page headers, footers, sidebars, menus, and the body are all included, without distinguishing one from another. Yet the body text is probably more relevant for our analysis than, say, a copyright statement in the footer. It would be nice if we could use semantic markup to focus on the most relevant text.

Each new version of the HTML standard adds more tags with semantic meaning. HTML 4 defines the <acronym>, <cite>, <abbr>, and other tags for acronyms, citations, and abbreviations, plus the (deprecated) <dir> and <menu> tags for directory lists and menus. The forthcoming HTML 5 defines the <nav> tag to mark the navigation menu of a page, <header> and <footer> for the page's header and footer, and <article> and <aside> tags for the principal content of a page. When properly used, these tags enable page analysis to find the relevant parts of a page and ignore the rest.

Unfortunately, HTML 5 isn't a standard yet and the semantic tags of HTML 4 are not widely used. If you know something about the pages you're analyzing, you can use that knowledge here to throw out uninteresting parts of the page. Otherwise you'll have to process the entire page.

Strip HTML tags, scripts, and styles

PHP's strip_tags( ) function will remove the tags, but it forgets to remove styles, scripts, and other unwanted text between the tags. When it removes the tags it also joins together the words before and after the tags. For block-level tags, like <p>, this is the wrong thing to do and it'll garble your keyword list.

In a separate article I cover the steps needed to properly remove HTML tags, avoid word-joining, and get rid of embedded scripts and styles. The article includes sample code and more explanations.

The article above includes a strip_html_tags( ) function to do the job. Pass it the UTF-8 page text and it returns the stripped text.

$text = strip_html_tags( $text);

Decode HTML character references

An "HTML character reference" is a special sequence of characters used to include a symbol that may not be supported by the page's character encoding. These always start with an "&" followed by a name or number and a semi-colon. For instance, &euro; creates a € symbol and &copy; creates a © symbol. There are hundreds of named character references, called "HTML entities". Additionally, any Unicode character can be referenced by its decimal or hexadecimal code, such as &#x03A9; for the Greek letter Omega Ω.

HTML character references and named entities are often inserted automatically by HTML authoring tools. We need to convert these to normal UTF-8 characters so that they can be processed along with the rest of the text. You can do this with PHP's standard html_entity_decode( ) function. Unfortunately, the function is often used improperly so I've discussed it in a brief separate article:

Call html_entity_decode( ) and pass it the text to decode, a flag indicating how it should handle quotes, and the name of the character encoding to use ("utf-8"):

$text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );

Remove unwanted characters

At this point you have UTF-8 text stripped of all HTML syntax. However, since we're only interested in keywords, we need to remove the punctuation, special symbol characters, numbers, currency signs, plus and minus, and other math characters.

Beware, however, that stripping away punctuation joins sentences together into a single long string of words. This is fine for extracting keywords, but not phrases. The words at the end of one sentence will blend into those at the start of the next sentence, creating odd word pairings that can give unexpected phrase searching results.

Removing punctuation also destroys the grammatical cues needed by natural language processing. If you intend to use more sophisticated ways of extracting meaning from a page, stop here. But for keyword extraction, we don't need punctuation, symbols, or numbers — just the words.

Strip punctuation characters

Punctuation includes full stops (periods), commas, quotes, brackets, dashes, and so on. In Unicode there are hundreds of these characters. For instance, there are 65 different closing brackets alone. There are 11 types of opening quotes, 18 different dashes, and several hundred special marks for Hebrew, Arabic, Thai, Cuneiform, and other languages.

Punctuation removal is complicated by characters that have different meanings in different contexts. For example, a dash is removable when used parenthetically — like this — but not when used as a hyphen in up-to-date. A full stop (period) is removable punctuation at the end of a sentence, but not when embedded in a domain name like example.com or a file name like index.htm. An apostrophe is removable when used as a quote in 'this', but not in a contraction like can't or a possessive like Dave's.

There's too much to cover here, so I discuss this in a separate article that includes sample code and quite a bit of explanation:

The article above defines a strip_punctuation( ) function that takes a text string and returns the string without punctuation:

$text = strip_punctuation( $text );

Strip symbol characters

Unicode has an enormous number of special symbols. Some are in common use, such as + = < >© ® ™ $ ¢ £ € ¥. Others are only found in mathematics text, such as ± ÷ ∞ ∑ ∅ ∀ ∴ ∪ ∩ ∫, or in other special situations, such as ♜ ♘ ♫ ♥ ♣ ♻ ☯ ✓ ☿ ♋ ☂ ⏄ ⏆ ↑ ↓ ↩ ↪ ♨ ☞ ☼ ☾ ☎. There are 914 different mathematical symbols and another 2,958 more specialty symbols.

As with punctuation removal, there are a few special cases to handle. If you intend to keep numbers in the text (such as dates, IP addresses, and currencies), you'll need to keep currency symbols and remove plus and minus only when they aren't adjacent to the digits of a number. For East Asian languages, you'll need to skip past radical and stroke symbols used to assemble ideographs.

Again, there's too much to cover here. I discuss these special cases in a separate article that includes sample code and further explanations:

The article above defines a strip_symbols( ) function that takes a text string and returns the string without symbol characters:

$text = strip_symbols( $text );

Strip number characters

Depending upon your needs, numbers in the page may or may not be interesting. If you are processing a financial report or a list of IP addresses, the numbers may be worth keeping. Otherwise, you can strip them out.

While most of the world's languages use the same 0 through 9 Arabic numerals, there are 280 more Unicode digit symbols for languages like Thai, Tibetan, and Balinese. There are another 210 numeric letter symbols for Roman numerals and those in Old Persian, Cuneiform, and others. Add another 336 symbols for fractions, superscripts, subscripts, and specialty digits, 41 currency symbols, and hundreds more for units of measure, like degrees celsius and square feet.

When removing numbers we also have to watch for non-digits related to those numbers. This includes plus and minus, AM/PM after a time, a colon between the hours and minutes of a time, a dash within a numeric range like 2006-2008, decimal points in floating point numbers, and so on.

Once again, this is too much to cover here. I discuss this in a separate article that includes explanations and sample code:

The article above defines a strip_numbers( ) function that takes a text string and returns the string without number characters:

$text = strip_numbers( $text );

Convert to lower case

Differences in upper and lower case are not that significant for keywords. So use the multibyte-safe mb_strtolower( ) function to convert to lowercase.

$text = mb_strtolower( $text, "utf-8" );

Process the page's words

So far, processing has been largely character-based. Now it's time to split the text into words and build a unique keyword list.

Split the text into a word list

PHP's explode( ) function is often used to split text into an array of words. Unfortunately, it isn't safe with multibyte characters. Instead, use the mb_split( ) function. Its first argument is a pattern for a word delimiter (use a space). The second argument is the string to split. Unlike most of the other "mb" functions, mb_split( ) doesn't have an argument to select the character encoding to use. Instead, use the mb_regex_encoding( ) function first to set the character encoding to use.

mb_regex_encoding( "utf-8" );
$words = mb_split( ' +', $text );

Stem the words

Stemming shortens a word to its root. This reduces words like "Stem", "Stems", and "Stemming" to just "Stem". All of the variants of a word collapse into the same stem word, letting us focus on the keywords themselves instead of each minor difference in the way they are used.

For English, the "Porter stemmer" algorithm, by Martin Porter, is widely used. Porter's web page on The Porter Stemming Algorithm explains the algorithm and links to a free PHP implementation.

Stem each word in the word list.

foreach ( $words as $key => $word )
    $words[$key] = PorterStemmer::Stem( $word, true );

Stemming algorithms must be different for different languages. Wikipedia's Stemming page is a good place to start if you need to stem a language other than English.

Remove stop words

English words like "the", "and", "a", "by", and others are very common and make for poor search words. Search indexing can skip these so-called "stop words."

There is no official stop word list for English, or any other language. But there are many suggested lists available if you do a web search on "stop words". The one I use has about 700 words.

To use a stop word list, read it into a string and split it into an array of words. Be sure to use UTF-8, lower case, and stems of the stop words

$stopText  = file_get_contents( $stopWordsFilename );
$stopWords = mb_split( '[ \n]+', mb_strtolower( $stopText, 'utf-8' ) );
foreach ( $stopWords as $key => $word )
    $stopWords[$key] = PorterStemmer::Stem( $word, true );

To remove stop words, use PHP's array_diff( ) function to remove any word in the word list that is also in the stop word list.

$words = array_diff( $words, $stopWords );

Remove unwanted words

Beyond stop words, extremely common words, like "web" or "html", make poor keywords. Depending upon your needs, you may strip your word list of poor keywords at this stage. This can significantly shorten the keyword list and simplify further processing.

For my own use, I maintain a long list of words to remove. The list primarily contains adjectives (like "big"), adverbs (like "quickly"), pronouns (like "he" and "she"), prepositions (like "to" and "from"), conjunctions (like "but"), and interjections (like "Ouch!"). Very generic verbs and nouns are also on the list (like "edit" and "html"). After removing these words, I'm left with a strong list of keywords that characterize the page's text.

To remove unwanted words, use PHP's array_diff( ) function again and your unwanted word list:

$words = array_diff( $words, $unwantedWords );

Count keyword usage

PHP's array_count_values( ) function counts the number of times each value occurs in an array. It returns a list of those keywords as array keys, and their counts as the array values. Sort on those values using the arsort( ) function to get a keyword list ordered from most to least frequently occurring.

$keywordCounts = array_count_values( $words );
arsort( $keywordCounts, SORT_NUMERIC );

Call array_keys( ) on the array to get a list of unique keywords:

$uniqueKeywords = array_keys( $keywordCounts );

Be careful if you want to sort the keywords alphabetically. PHP's default alphabetic sorts for English won't sort well for other languages. Instead, you'll need the Unicode Collation Algorithm, which is beyond the scope of this article.

Where to go from here

For simple search indexing, add your keywords to the search index for the page.

Page keywords and their frequency roughly characterize the page's topic. Tag clouds have become a trendy way to show that topic. Take the keyword counts list above (before sorting it), and print out each keyword. Increase the keyword's font size for more frequent keywords.

Actually, keyword frequency isn't a very reliable way to characterize a page topic. Instead, look at the Vector space model. The idea is to weight the keywords both by their frequency and by their global importance across some large document set. All the pages in Google's index, for instance, would be a great document set. The more often a keyword occurs in that document set, the less well that keyword distinguishes one document from another and the worse that keyword is for a search index. We already did something like this by removing stop words and other common words. The vector space model describes the math needed to weight your keywords by importance, rather than just frequency.

Great Article

Very interesting article. Makes for a great starting point for formalising my own method for gathering keywords. I came across this page as I'm looking for a way to rank my advertisements against what people are searching for or viewing. The stemming algorithm, along with the vector space model are great resources. Thanks.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting