June, 2007

  • June 30, 2007

    HTML entities encode special characters and symbols, such as € for €, or © for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.

  • June 16, 2007

    A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s <meta> tag, or an XML file’s <?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.

  • June 10, 2007

    The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.

Syndicate content
Nadeau software consulting
Nadeau software consulting