Text processing

  • January 6, 2008

    The starting point for building a link checker, web spider, or web page analyzer is, of course, to get the web page from the web server. Java's java.net package includes classes to manage URLs and to open web server connections. This tip shows how to use them to a get text, image, audio, or data file from a web server.

  • August 8, 2009

    Java has several ways to parse integers from strings. Performance differences between these methods can be significant when parsing a large number of integers. Doing your own integer parsing can provide an important speed boost. This tip looks at five ways to parse integers, compares their features, and benchmarks them to see which method is the fastest.

  • May 27, 2008

    An absolute URL is complete and ready to use to download a web file. But web pages often include incomplete relative URLs with missing parts, such as an "http" or host name, or the first part of a file path. These parts need to be filled in by copying them from a base absolute URL. This article shows how and includes code to do it.

  • June 30, 2007

    HTML entities encode special characters and symbols, such as € for €, or © for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.

  • April 13, 2008

    Web page keywords characterize the page's topic for a search engine. Extracting keywords requires that you recognize the page's character encoding, strip away HTML tags, scripts, and styles, decode HTML entities, and remove unwanted punctuation, symbols, numbers, and stop words. This article shows how.

  • January 3, 2008

    Though HTML is usually the focus for extracting URLs for a link checker or analysis tool, CSS files also include URLs. The CSS @import rule uses a URL to include another CSS file, and many style properties include a URL to load an image or other content. This tip shows how to scan a CSS file and extract its URLs.

  • January 3, 2008

    URL extraction is at the core of link checkers, search engine spiders, and a variety of web page analysis tools. While <a> and <img> elements are primary sources of URLs, there are more than 70 element attributes with URLs in HTML, XHTML, WML, and assorted HTML extensions. This tip shows how to extract URLs from all of these.

  • June 16, 2007

    A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s <meta> tag, or an XML file’s <?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.

  • June 10, 2007

    The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.

  • July 14, 2007

    PHP’s fopen wrappers enable the standard file functions to read web pages from a web server. A few additional calls are needed to set parameters for a web server request and to get the server’s HTTP response header. This tip shows how.

Syndicate content
Nadeau software consulting
Nadeau software consulting