Web page keywords characterize the page's topic for a search engine. Extracting keywords requires that you recognize the page's character encoding, strip away HTML tags, scripts, and styles, decode HTML entities, and remove unwanted punctuation, symbols, numbers, and stop words. This article shows how.
Table of Contents
- Get the page text
- Get the web page from a disk file
- Get the web page from the server
- Determine the page's character encoding
- Convert the page text to UTF-8
- Remove HTML syntax
- Remove unwanted characters
- Process the page's words
- Split the text into a word list
- Stem the words
- Remove stop words
- Remove unwanted words
- Count keyword usage
- Where to go from here
Get the page text
Keyword extraction starts by reading the HTML page into a text string. If you have the page in a file already, just read it in using
file_get_contents( ) or its equivalent. Otherwise, have the keyword script get the page directly from a web server. This insures that you always get a complete page that includes everything added by server-side scripts and includes.
Whichever way you get the HTML page, once you have the text you'll need to handle the many different character encodings used by international web pages. Let's do this step by step.
Get the web page from a disk file
PHP has several functions to read a file from disk. I'll use
file_get_contents( ), which returns a text string containing the entire file.
$encodedText = file_get_contents( $filename );
Get the web page from the server
PHP has two main ways to connect to a web server and download a page. First, PHP's
fopen( ) wrappers let you use standard file reading functions to open a URL and read page text. Second, PHP's CURL (Client URL) functions let you build a detailed web server request and read the server's response and the returned page text. Both of these are available in standard PHP distributions from 4.0.4 onward.
I've covered these in two separate articles. Both articles include explanations and sample code:
CURL is the more flexible choice. The CURL article above includes a sample
get_web_page( ) function that uses CURL. Pass it a URL and it returns an associative array containing the page text and the web server header. Error codes tell you when the URL is bad or the server is down.
$result = get_web_page( $url ); if ( $result['errno'] != 0 ) ... Error: bad URL, timeout, or redirect loop ... if ( $result['http_code'] != 200 ) ... Error: no server, no permissions, or no page ... $encodedText = $result['content'];
Determine the page's character encoding
The "content type" for a web file tells you the file's MIME type, such as "image/gif" for a GIF image and "text/html" for an HTML page. The content type for text files also includes a "character encoding" (or "charset") that tells you how the file represents characters. Some encodings use one byte per character, others use several bytes. You'll need to determine the encoding so that you can recognize letters, numbers, punctuation, and other symbols in the text.
There are dozens of character encodings, such as the old ASCII encoding used for American English, ISO 8859 for the Latin alphabet used by English, German, Spanish, French, and others, Big5 for traditional Chinese, Shift JIS for Japanese, and several old Windows-specific encodings. Older web pages may use any of these encodings, but newer pages are shifting to the UTF-8 international standard. This single generic encoding can represent any of Unicode's over 100,000 characters spanning all of the world's languages.
The page's character encoding is usually found in the web server's response header. If it isn't there, you can look for a
<meta> tag at the top of the page instead. If there isn't one, you can try PHP's
mb_detect_encoding( ) to guess at the encoding used by a page.
I cover several ways to get the content type and character encoding in a separate article. The article explains things a bit more and includes sample code:
If you use the CURL-based
get_web_page( ) function I used in the previous section, it returns the content type in the associative array. A typical content type looks like this:
Parse that with a regular expression to get the character encoding name (the word after "charset").
$contentType = $result['content_type']; preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $contentType, $matches ); $charset = $matches;
Convert the page text to UTF-8
It isn't practical to write a custom keyword extractor for every character encoding. Instead, always convert web page text into the generic UTF-8 encoding.
To convert text, use PHP's
iconv( ) function. Pass it the name of the current encoding, the name of the one you want to use ("utf-8"), and the text to convert. The function returns the converted text. If the text is already in UTF-8, the returned text is the same as the original text.
$text = iconv( $charset, "utf-8", $encodedText );
You also can use PHP's
mb_convert_encoding( ) function. The arguments are the same, but in a different order:
$text = mb_convert_encoding( $encodedText, "utf-8", $charset );
In either case, the returned UTF-8 text uses between one and four bytes per character. Some of PHP's string functions won't work right with these multibyte characters. For instance,
strlen( ) will return the number of bytes in a string, not the number of multibyte characters.
So, you'll have to be careful which string functions you use on the text from here forward. In most cases, there are "mb" (multibyte) equivalents of the standard string functions. For example, use
mb_strlen( ) instead of
mb_substr( ) instead of
substr( ), and so on. Remember to pass the encoding name, "utf-8", as the last argument to all of these functions. There's nothing in the string itself to tell the functions what encoding you're using and if they use the wrong one you'll get garbled text.
Several of the functions discussed below use PHP's
preg_replace( ), which works well with multibyte strings as long as you add the
/u pattern modifier. See the Multibyte String Functions manual for much more detail.
Remove HTML syntax
We need to clean the text of all HTML syntax. This includes removing HTML tags, scripts, and styles, and decoding HTML character references and entities used to embed special characters.
Before doing this, you may wish to extract URLs. You can use these to guide spidering of your web site, do link checking, build a site map, and build a table of internal and external links from the page. I cover this in a separate article that includes explanations and sample code:
After you remove HTML tags, you'll be left with a long string of unstructured text. Page headers, footers, sidebars, menus, and the body are all included, without distinguishing one from another. Yet the body text is probably more relevant for our analysis than, say, a copyright statement in the footer. It would be nice if we could use semantic markup to focus on the most relevant text.
Each new version of the HTML standard adds more tags with semantic meaning. HTML 4 defines the
<abbr>, and other tags for acronyms, citations, and abbreviations, plus the (deprecated)
<menu> tags for directory lists and menus. The forthcoming HTML 5 defines the
<nav> tag to mark the navigation menu of a page,
<footer> for the page's header and footer, and
<aside> tags for the principal content of a page. When properly used, these tags enable page analysis to find the relevant parts of a page and ignore the rest.
Unfortunately, HTML 5 isn't a standard yet and the semantic tags of HTML 4 are not widely used. If you know something about the pages you're analyzing, you can use that knowledge here to throw out uninteresting parts of the page. Otherwise you'll have to process the entire page.
Strip HTML tags, scripts, and styles
strip_tags( ) function will remove the tags, but it forgets to remove styles, scripts, and other unwanted text between the tags. When it removes the tags it also joins together the words before and after the tags. For block-level tags, like
<p>, this is the wrong thing to do and it'll garble your keyword list.
In a separate article I cover the steps needed to properly remove HTML tags, avoid word-joining, and get rid of embedded scripts and styles. The article includes sample code and more explanations.
The article above includes a
strip_html_tags( ) function to do the job. Pass it the UTF-8 page text and it returns the stripped text.
$text = strip_html_tags( $text);
Decode HTML character references
An "HTML character reference" is a special sequence of characters used to include a symbol that may not be supported by the page's character encoding. These always start with an "&" followed by a name or number and a semi-colon. For instance,
€ creates a € symbol and
© creates a © symbol. There are hundreds of named character references, called "HTML entities". Additionally, any Unicode character can be referenced by its decimal or hexadecimal code, such as
Ω for the Greek letter Omega
HTML character references and named entities are often inserted automatically by HTML authoring tools. We need to convert these to normal UTF-8 characters so that they can be processed along with the rest of the text. You can do this with PHP's standard
html_entity_decode( ) function. Unfortunately, the function is often used improperly so I've discussed it in a brief separate article:
html_entity_decode( ) and pass it the text to decode, a flag indicating how it should handle quotes, and the name of the character encoding to use ("utf-8"):
$text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );
Remove unwanted characters
At this point you have UTF-8 text stripped of all HTML syntax. However, since we're only interested in keywords, we need to remove the punctuation, special symbol characters, numbers, currency signs, plus and minus, and other math characters.
Beware, however, that stripping away punctuation joins sentences together into a single long string of words. This is fine for extracting keywords, but not phrases. The words at the end of one sentence will blend into those at the start of the next sentence, creating odd word pairings that can give unexpected phrase searching results.
Removing punctuation also destroys the grammatical cues needed by natural language processing. If you intend to use more sophisticated ways of extracting meaning from a page, stop here. But for keyword extraction, we don't need punctuation, symbols, or numbers — just the words.
Strip punctuation characters
Punctuation includes full stops (periods), commas, quotes, brackets, dashes, and so on. In Unicode there are hundreds of these characters. For instance, there are 65 different closing brackets alone. There are 11 types of opening quotes, 18 different dashes, and several hundred special marks for Hebrew, Arabic, Thai, Cuneiform, and other languages.
Punctuation removal is complicated by characters that have different meanings in different contexts. For example, a dash is removable when used parenthetically — like this — but not when used as a hyphen in up-to-date. A full stop (period) is removable punctuation at the end of a sentence, but not when embedded in a domain name like example.com or a file name like index.htm. An apostrophe is removable when used as a quote in 'this', but not in a contraction like can't or a possessive like Dave's.
There's too much to cover here, so I discuss this in a separate article that includes sample code and quite a bit of explanation:
The article above defines a
strip_punctuation( ) function that takes a text string and returns the string without punctuation:
$text = strip_punctuation( $text );
Strip symbol characters
Unicode has an enormous number of special symbols. Some are in common use, such as + = < >© ® ™ $ ¢ £ € ¥. Others are only found in mathematics text, such as ± ÷ ∞ ∑ ∅ ∀ ∴ ∪ ∩ ∫, or in other special situations, such as ♜ ♘ ♫ ♥ ♣ ♻ ☯ ✓ ☿ ♋ ☂ ⏄ ⏆ ↑ ↓ ↩ ↪ ♨ ☞ ☼ ☾ ☎. There are 914 different mathematical symbols and another 2,958 more specialty symbols.
As with punctuation removal, there are a few special cases to handle. If you intend to keep numbers in the text (such as dates, IP addresses, and currencies), you'll need to keep currency symbols and remove plus and minus only when they aren't adjacent to the digits of a number. For East Asian languages, you'll need to skip past radical and stroke symbols used to assemble ideographs.
Again, there's too much to cover here. I discuss these special cases in a separate article that includes sample code and further explanations:
The article above defines a
strip_symbols( ) function that takes a text string and returns the string without symbol characters:
$text = strip_symbols( $text );
Strip number characters
Depending upon your needs, numbers in the page may or may not be interesting. If you are processing a financial report or a list of IP addresses, the numbers may be worth keeping. Otherwise, you can strip them out.
While most of the world's languages use the same 0 through 9 Arabic numerals, there are 280 more Unicode digit symbols for languages like Thai, Tibetan, and Balinese. There are another 210 numeric letter symbols for Roman numerals and those in Old Persian, Cuneiform, and others. Add another 336 symbols for fractions, superscripts, subscripts, and specialty digits, 41 currency symbols, and hundreds more for units of measure, like degrees celsius and square feet.
When removing numbers we also have to watch for non-digits related to those numbers. This includes plus and minus, AM/PM after a time, a colon between the hours and minutes of a time, a dash within a numeric range like 2006-2008, decimal points in floating point numbers, and so on.
Once again, this is too much to cover here. I discuss this in a separate article that includes explanations and sample code:
The article above defines a
strip_numbers( ) function that takes a text string and returns the string without number characters:
$text = strip_numbers( $text );
Convert to lower case
Differences in upper and lower case are not that significant for keywords. So use the multibyte-safe
mb_strtolower( ) function to convert to lowercase.
$text = mb_strtolower( $text, "utf-8" );
Process the page's words
So far, processing has been largely character-based. Now it's time to split the text into words and build a unique keyword list.
Split the text into a word list
explode( ) function is often used to split text into an array of words. Unfortunately, it isn't safe with multibyte characters. Instead, use the
mb_split( ) function. Its first argument is a pattern for a word delimiter (use a space). The second argument is the string to split. Unlike most of the other "mb" functions,
mb_split( ) doesn't have an argument to select the character encoding to use. Instead, use the
mb_regex_encoding( ) function first to set the character encoding to use.
mb_regex_encoding( "utf-8" ); $words = mb_split( ' +', $text );
Stem the words
Stemming shortens a word to its root. This reduces words like "Stem", "Stems", and "Stemming" to just "Stem". All of the variants of a word collapse into the same stem word, letting us focus on the keywords themselves instead of each minor difference in the way they are used.
For English, the "Porter stemmer" algorithm, by Martin Porter, is widely used. Porter's web page on The Porter Stemming Algorithm explains the algorithm and links to a free PHP implementation.
Stem each word in the word list.
foreach ( $words as $key => $word ) $words[$key] = PorterStemmer::Stem( $word, true );
Stemming algorithms must be different for different languages. Wikipedia's Stemming page is a good place to start if you need to stem a language other than English.
Remove stop words
English words like "the", "and", "a", "by", and others are very common and make for poor search words. Search indexing can skip these so-called "stop words."
There is no official stop word list for English, or any other language. But there are many suggested lists available if you do a web search on "stop words". The one I use has about 700 words.
To use a stop word list, read it into a string and split it into an array of words. Be sure to use UTF-8, lower case, and stems of the stop words
$stopText = file_get_contents( $stopWordsFilename ); $stopWords = mb_split( '[ \n]+', mb_strtolower( $stopText, 'utf-8' ) ); foreach ( $stopWords as $key => $word ) $stopWords[$key] = PorterStemmer::Stem( $word, true );
To remove stop words, use PHP's
array_diff( ) function to remove any word in the word list that is also in the stop word list.
$words = array_diff( $words, $stopWords );
Remove unwanted words
Beyond stop words, extremely common words, like "web" or "html", make poor keywords. Depending upon your needs, you may strip your word list of poor keywords at this stage. This can significantly shorten the keyword list and simplify further processing.
For my own use, I maintain a long list of words to remove. The list primarily contains adjectives (like "big"), adverbs (like "quickly"), pronouns (like "he" and "she"), prepositions (like "to" and "from"), conjunctions (like "but"), and interjections (like "Ouch!"). Very generic verbs and nouns are also on the list (like "edit" and "html"). After removing these words, I'm left with a strong list of keywords that characterize the page's text.
To remove unwanted words, use PHP's
array_diff( ) function again and your unwanted word list:
$words = array_diff( $words, $unwantedWords );
Count keyword usage
array_count_values( ) function counts the number of times each value occurs in an array. It returns a list of those keywords as array keys, and their counts as the array values. Sort on those values using the
arsort( ) function to get a keyword list ordered from most to least frequently occurring.
$keywordCounts = array_count_values( $words ); arsort( $keywordCounts, SORT_NUMERIC );
array_keys( ) on the array to get a list of unique keywords:
$uniqueKeywords = array_keys( $keywordCounts );
Be careful if you want to sort the keywords alphabetically. PHP's default alphabetic sorts for English won't sort well for other languages. Instead, you'll need the Unicode Collation Algorithm, which is beyond the scope of this article.
Where to go from here
For simple search indexing, add your keywords to the search index for the page.
Page keywords and their frequency roughly characterize the page's topic. Tag clouds have become a trendy way to show that topic. Take the keyword counts list above (before sorting it), and print out each keyword. Increase the keyword's font size for more frequent keywords.
Actually, keyword frequency isn't a very reliable way to characterize a page topic. Instead, look at the Vector space model. The idea is to weight the keywords both by their frequency and by their global importance across some large document set. All the pages in Google's index, for instance, would be a great document set. The more often a keyword occurs in that document set, the less well that keyword distinguishes one document from another and the worse that keyword is for a search index. We already did something like this by removing stop words and other common words. The vector space model describes the math needed to weight your keywords by importance, rather than just frequency.