HTML entities encode special characters and symbols, such as
€ for €, or
© for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard
html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.
A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s
<meta> tag, or an XML file’s
<?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.
The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.