Java tip: How to get a web page

Technologies: Java 5+

The starting point for building a link checker, web spider, or web page analyzer is, of course, to get the web page from the web server. Java's java.net package includes classes to manage URLs and to open web server connections. This tip shows how to use them to a get text, image, audio, or data file from a web server.

Introduction

Downloading a web file sends a "request" to a web server using the standard HTTP 1.1 protocol. The server processes your request and sends you a "response". The response's "header" tells you the file's size, last modified date, MIME type, and other useful information. Finally, the response's payload is the file itself.

Java has several ways to do this request-response interchange between your application and a web server. Most of them hide these details. While this simplifies the process, a lot of control and error handling is lost. The simpler approaches don't let you check the response header to see if you got an error and what type of error it was. These approaches also don't let you ask for or receive compressed content and they don't tell you the MIME type and character encoding of returned data.

I don't advocate using the simpler approaches that hide the request-response interchange. I list them at the end of this article if you're interested. Instead, this article shows you how to use the lower-level classes that give you more control over requests and responses. As you'll see, this isn't that hard.

Code

First off is the complete code for a sample WebFile class that illustrates the use of java.net's URL, URLConnection, and HttpURLConnection classes. The constructor takes a URL string, opens a web server connection, sends a server request, checks the response, and reads the file (text, image, audio, etc.). If needed, the file is transcoded into Java's UTF-8 character encoding. Methods on the class return important header values and the web file content.

In the sections after the code I'll explain the code, the request-response steps, and how to use the appropriate Java classes.

/**
 * Get a web file.
 */
public final class WebFile {
    // Saved response.
    private java.util.Map<String,java.util.List<String>> responseHeader = null;
    private java.net.URL responseURL = null;
    private int responseCode = -1;
    private String MIMEtype  = null;
    private String charset   = null;
    private Object content   = null;
 
    /** Open a web file. */
    public WebFile( String urlString )
        throws java.net.MalformedURLException, java.io.IOException {
        // Open a URL connection.
        final java.net.URL url = new java.net.URL( urlString );
        final java.net.URLConnection uconn = url.openConnection( );
        if ( !(uconn instanceof java.net.HttpURLConnection) )
            throw new java.lang.IllegalArgumentException(
                "URL protocol must be HTTP." );
        final java.net.HttpURLConnection conn =
            (java.net.HttpURLConnection)uconn;
 
        // Set up a request.
        conn.setConnectTimeout( 10000 );    // 10 sec
        conn.setReadTimeout( 10000 );       // 10 sec
        conn.setInstanceFollowRedirects( true );
        conn.setRequestProperty( "User-agent", "spider" );
 
        // Send the request.
        conn.connect( );
 
        // Get the response.
        responseHeader    = conn.getHeaderFields( );
        responseCode      = conn.getResponseCode( );
        responseURL       = conn.getURL( );
        final int length  = conn.getContentLength( );
        final String type = conn.getContentType( );
        if ( type != null ) {
            final String[] parts = type.split( ";" );
            MIMEtype = parts[0].trim( );
            for ( int i = 1; i < parts.length && charset == null; i++ ) {
                final String t  = parts[i].trim( );
                final int index = t.toLowerCase( ).indexOf( "charset=" );
                if ( index != -1 )
                    charset = t.substring( index+8 );
            }
        }
 
        // Get the content.
        final java.io.InputStream stream = conn.getErrorStream( );
        if ( stream != null )
            content = readStream( length, stream );
        else if ( (content = conn.getContent( )) != null &&
            content instanceof java.io.InputStream )
            content = readStream( length, (java.io.InputStream)content );
        conn.disconnect( );
    }
 
    /** Read stream bytes and transcode. */
    private Object readStream( int length, java.io.InputStream stream )
        throws java.io.IOException {
        final int buflen = Math.max( 1024, Math.max( length, stream.available() ) );
        byte[] buf   = new byte[buflen];;
        byte[] bytes = null;
 
        for ( int nRead = stream.read(buf); nRead != -1; nRead = stream.read(buf) ) {
            if ( bytes == null ) {
                bytes = buf;
                buf   = new byte[buflen];
                continue;
            }
            final byte[] newBytes = new byte[ bytes.length + nRead ];
            System.arraycopy( bytes, 0, newBytes, 0, bytes.length );
            System.arraycopy( buf, 0, newBytes, bytes.length, nRead );
            bytes = newBytes;
        }
 
        if ( charset == null )
            return bytes;
        try {
            return new String( bytes, charset );
        }
        catch ( java.io.UnsupportedEncodingException e ) { }
        return bytes;
    }
 
    /** Get the content. */
    public Object getContent( ) {
        return content;
    }
 
    /** Get the response code. */
    public int getResponseCode( ) {
        return responseCode;
    }
 
    /** Get the response header. */
    public java.util.Map<String,java.util.List<String>> getHeaderFields( ) {
        return responseHeader;
    }
 
    /** Get the URL of the received page. */
    public java.net.URL getURL( ) {
        return responseURL;
    }
 
    /** Get the MIME type. */
    public String getMIMEType( ) {
        return MIMEtype;
    }
}

Examples

Open a WebFile and get the HTML text, if any:

WebFile file   = new WebFile( "http://example.com" );
String MIME    = file.getMIMEType( );
Object content = file.getContent( );
if ( MIME.equals( "text/html" ) && content instanceof String )
{
    String html = (String)content;
    ...
}

Open a WebFile and get the image, if any:

WebFile file   = new WebFile( "http://example.com/example.gif" );
String MIME    = file.getMIMEType( );
Object content = file.getContent( );
if ( MIME.startsWith( "image" ) && content instanceof java.awt.Image )
{
    java.awt.Image image = (java.awt.Image)content;
    ...
}

Open a WebFile and get the audio clip, if any:

WebFile file   = new WebFile( "http://example.com/example.aiff" );
String MIME    = file.getMIMEType( );
Object content = file.getContent( );
if ( MIME.startsWith( "audio" ) && content instanceof java.applet.AudioClip )
{
    java.applet.AudioClip audio = (java.applet.AudioClip)content;
    ...
}

Explanation

The java.net package's URL,URLConnection, and HttpURLConnection classes have been available since JDK 1.1. Getting a web file using these classes always includes these steps:

  1. Create a URL object from a URL string.
  2. Open a URLConnection object from the URL object.
  3. Set up the web server request by calling set* methods on the URLConnection object.
  4. Send the request to the web server by calling connect() on the URLConnection object.
  5. Get the web server response by calling get* methods on the URLConnection object.
  6. Decode the file content based upon the content type.

So, let's walk through each of these steps.

Creating a URL

Starting with a URL string like "http://example.com", simply call the URL( string ) constructor. A MalformedURLException is thrown if there is a URL syntax problem.

URL url = new URL( "http://example.com" );

Opening a URL connection

To prepare for communications with a web server, call URL.openConnection( ) to get a URLConnection object. Despite the method name, no communications actually occurs yet. Instead, the object provides methods to build a web server request, then issue that request and get the server's response.

URLConnection uconn = url.openConnection( );

A URLConnection is a generic interface to several protocols supported by Java. For the HTTP protocol used by web servers, the object is actually an HttpURLConnection object, which offers several necessary methods for web server communications. So, cast the URLConnection object to an HttpURLConnection when you're dealing with a web server.

HttpURLConnection conn = (HttpURLConnection)uconn;

Setting up a request

Methods on the URLConnection enable you to configure the file request. Here are the essentials:

  • setConnectTimeout( ). Fail if a web server doesn't respond to a connection with a time limit (in milliseconds). The default value is 0, which waits forever and is clearly undesirable. Ten or twenty seconds is more reasonable.
  • setReadTimeout( ). Fail if a web server doesn't return the web file within a time limit (in milliseconds). Again, the default value is 0, which waits forever and is clearly bad. Ten or twenty seconds is a better limit.
  • setInstanceFollowRedirects( ). When true, follow web page redirects automatically. Otherwise, only make the first request and stop, which is the default. Since you probably want the redirected-to page, set this to true.
  • setRequestProperty( ). Set other request properties. See below.
conn.setConnectTimeout( 10000 );
conn.setReadTimeout( 10000 );
conn.setInstanceFollowRedirects( true );
conn.setRequestProperty( "User-agent", "spider" );

Use setRequestProperty( ) to set special properties in the request. The most important is the "User-Agent", which gives the name of the application making the request (such as a web browser). If left empty or set to a non-standard value, some web servers will reject the request. UserAgentString.com has a good list of known user agent strings for common web browsers and spiders. However, it is poor web etiquette to lie and send a bogus user agent string. Instead, send something that identifies the name and purpose of the application.

Other settable request properties include the "Authorization" for logging in to a site, the "Referer" for the URL of a page that led you to the requested file, or any of several "If" properties for conditionally getting the file. All of these are standard HTTP 1.1 request header fields discussed in the HTTP 1.1 specification. Wikipedia has a nice summary of HTTP.

If you are sending a form, you can configure the request as a GET or a POST using setRequestMethod( ). The same method also supports HEAD, OPTIONS, PUT, DELETE, and TRACE requests. The HTTP 1.1 specification discusses these types of requests. For most situations, use the default GET request.

Sending the request

Call the connect( ) method to connect to the web server, send your request, and collect the server's response. An IOException is thrown if there is a problem.

conn.connect( );

Actually, this call is optional. Calling any of the response methods below will automatically call connect( ) if needed.

Getting the response

Once your request is sent, response values are available from get* methods. Here are the essentials:

  • getHeaderFields( ). Get all of the response header fields as a list of strings. The most common header fields are also available by dedicated methods, like those in the next few bullets here.
  • getResponseCode( ). Get the HTTP numeric response code from the header. This is important since it tells you if your request succeeded. See the HTTP specification or Wikipedia's List of HTTP status codes. On success, the code is 200. 404 means the page wasn't found, 503 means the site was down, and so on for about 50 standard codes. The HttpURLConnection class also has constants for the codes, such as HTTP_OK for 200, HTTP_NOT_FOUND for 404, etc.
  • getURL( ). Before sending the request, this gets the original URL. But after getting a response, this gets the URL from that response. Normally, they are the same. But if the response is a redirect to another page, this will be the URL of that page.
  • getContentType( ). Get the content type string from the header. This is discussed more below.
  • getContentEncoding( ). Get the encoding (compression) used by the content. Web servers are supposed to always return uncompressed content unless the request includes an "Accept-encoding" property. If you've included such a property, the content still might not be compressed if the server doesn't support it or if it didn't want to do it for some reason. Use getContentEncoding( ) and look for a value like "gzip" to see if the returned content is compressed.
  • getContentLength( ). Get the length of the content, in bytes. This value will be a -1 if the length is not known. This is quite common and occurs when content is sent in bursts by server-side PHP, Perl, or Java code generating the content on-the-fly. When you get a -1 for the content length, you'll have to detect the end of the data yourself during parsing.
Map<String,List<String>> header = conn.getHeaderFields( );
int responseCode       = conn.getResponseCode( );
URL responseURL        = conn.getURL( );
String contentType     = conn.getContentType( );
String contentEncoding = conn.getContentEncoding( );
int contentLength      = conn.getContentLength( );

There are a few more less essential methods. Call getDate( ) to get the date and time of the server's response. Call getLastModified( ) for the date the content was last modified on the server, and getExpiration( ) to get the server's recommended cache expiration date. But if you aren't planning on managing a cache, you can ignore these.

The getHeaderFields( ) method returns a rather cumbersome Map of named lists of strings. This map includes the content type, content encoding, content length, etc., retrieved more easily by the above methods. It also includes the "Set-Cookie" field for cookies, "Keep-Alive" for controlling connections that remain open for awhile, "Server" for the name and version number of the web server, "Cache-Control" for the server's preferences on caching the returned content, and others. All of these are explained further in the HTTP 1.1 specification. But if you are just getting a single web file, you can usually ignore these.

Getting the content

Finally, there are two methods to get the payload of the server's response:

  • getErrorStream( ). Get the InputStream to access the web server's error response, if any. Some response codes, such as 404 for a missing file, tell you that an error occurred when trying to get the file. Nevertheless, web servers usually send a custom error page to be shown to the user. getErrorStream( ) returns an InputStream for reading this error page. It also returns null if no error occurred. Checking for this null is a quick way to quickly detect an error.
  • getContent( ). Get the web file's content if no error occurred. On an error it throws an exception.
java.io.InputStream errorStream = conn.getErrorStream( );
if ( errorStream == null ) {
    Object content = conn.getContent( );
    ...
}

There are several types of content that can be delivered by a web server, such as HTML text, images, audio clips, videos, ZIP archives, etc. Since URLConnection is generic, its getContent( ) method simply returns an Object. The type of Object depends upon the content:

  • Image content returns a java.awt.Image object, if the image file format is supported by Java, such as GIF, JPEG, and PNG.
  • Audio content returns a java.applet.AudioClip object, if the audio file format is supported by Java, such as AIFF and WAV.
  • HTML and plain text, ZIP files, JAR files, unrecognized image and audio types, and all other content returns a java.io.InputStream so that the application can parse the content itself. You can also get this stream by calling getInputStream( ).

The getContentType( ) method listed earlier returns the type of content returned by the server. For instance, if it reads "image/gif", getContent( ) returns an Image constructed from a GIF file. You can do string comparisons on the content type, but it is easier to use instanceof to check the object returned by getContent( ):

Object content = conn.getContent( );
if ( content instanceof java.awt.Image )
{
    java.awt.Image image = (java.awt.Image)content;
    ...
}
else if ( content instanceof java.applet.AudioClip )
{
    java.applet.AudioClip audio = (java.applet.AudioClip)content;
    ...
}
else
{
    java.io.InputStream stream = (java.io.InputStream)content;
    ...
}

However, none of this is guaranteed! The Java API does not specify what types of objects may be returned by getContent( ). Instead, it leaves this to whatever ContentHandler objects are returned by a ContentHandlerFactory set on the URLConnection. The defaults are not specified and are implemented by Sun's internal code. While that code has been released as Open Source along with Java, there is no requirement that other vendors support that code in their Java releases. Vendors may skip it, change it, or add their own. For instance, Apple's Java in Mac OS X uses different image code that is probably more efficient on its platform. Apple's code still returns an Image object, but not via Sun's own implementation.

Sun's default content handlers are all in the sun.net.www.content package and include the following classes as of Java 5:

  • sun.net.www.content.audio.aiff
  • sun.net.www.content.audio.wav
  • sun.net.www.content.image.gif
  • sun.net.www.content.image.jpeg
  • sun.net.www.content.image.png
  • sun.net.www.content.text.plain

Notice that none of these are for HTML. To handle HTML text, you can write your own ContentHandler or, more easily, process the bytes from an InputStream.

If you decide to write our own ContentHandler, Java Boutique has a brief tutorial on Creating Content and Protocol Handlers in Java.

Interpreting the content

When getContent( ) returns an InputStream (such as for HTML text), you can use a loop to read all of the stream's data into a byte array. Note that these are bytes, not characters.

To convert the stream's bytes to characters you need to know the character encoding. This is available within the content type returned by getContentType( ). That type has one of two common forms:

  • "type/subtype"
  • "type/subtype; charset=set"

The type and subtype characterize the content, such as "text/html" or "image/png". Together, these form a MIME type. These values are standardized and the Internet Assigned Numbers Authority has a list of standard MIME Media Types. Some common MIME types include:

MIME type Meaning
text/plain Plain text file
text/html HTML web page
application/xml XML data file
application/xhtml+xml XHTML web page
image/jpeg JPEG image
image/png PNG image
image/gif GIF image

While XHTML has its own MIME type, most web servers are intentionally misconfigured to return XHTML under the incorrect "text/html" MIME type. This is done because Internet Explorer 6 didn't recognize XHTML's MIME type and would incorrectly show XHTML as an unstyled hierarchical XML listing. Additionally, incorrectly configured web servers sometimes return ZIP archives, JAR files, and anything else they don't recognize as "text/plain". So, the MIME type should be consider only as a strong hint, not absolute truth.

In a content type, optional parameters may follow the MIME type after a semicolon. While there are several parameters possible, common use primarily includes the "charset" parameter giving a character set (encoding) name for text content. This important name indicates how to map the raw bytes from the InputStream into characters.

If there is no character set name, the content is data, not text. It is up to the application to process the data appropriately. For instance, if the MIME type were "application/zip", you could pass the InputStream to java.util.zip.ZipInputStream to read and unarchive a ZIP file.

Text content is supposed to always have a character set name. Character set names are standardized. Older content may use "US-ASCII" for US English text, "Big5" for Chinese, or "Shift_JIS" for Japanese. There are also a flock of Microsoft Windows-specific encodings that may be in use for content generated by Microsoft products. Fortunately, today, content is moving towards the "UTF-8" character encoding for the international Unicode character set. This is also Java's default encoding for strings.

To transcode the stream's bytes into Java's UTF-8 characters, extract the character set name from the content type and create a new String. An exception is thrown if the character set name isn't recognized.

String text = new String( bytes, charset );

At this point you finally have HTML, XHTML, or plain text content from the web server. Congratulations!

Interpreting the content without a character set name

But what if the content is text, but there is no character set name in the content type? When improperly configured web servers do this, applications must fall back to looking at the content itself. For HTML, look for a <meta> tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For XML and XHTML, look for an <?xml> tag:

<?xml version="1.0" encoding="UTF-8" ?>

There should be just one of these tags early in the content. However, improperly authored content may not include these, may include multiple tags, or may include tags whose character set is wrong or doesn't match that in the response header (if any). For instance, as of this writing, Microsoft's own microsoft.com home page incorrectly includes two <meta> tags with differing character sets. What's an application to do?

When there is conflicting information about the character set, the response header's content type has first priority. If that is missing, the content's first <meta> or <?xml> tag has second priority. Finally, when there is no explicit character set stated anywhere, the application can guess or use its own preferences. Often, "UTF-8" is a good guess.

Since all HTML and XML tags use 8-bit bytes in the US-ASCII character set, Java code can fairly safely scan through the beginning of the byte array from an InputStream and look for a <meta> or <?xml> tag if needed. Once the tag is found and the character set name is extracted, transcode to UTF-8 and start over with whatever processing you intend to do with the web page.

Alternatives

The code and discussion above is a general approach for getting content from a URL. However, there are several simpler alternatives listed below. All of these are actually implemented using the URLConnection and HttpURLConnection objects we discussed above. However, because these objects are not exposed, you can't set HTTP request parameters (timeouts, user agent string, etc.), or get response headers, the response code, a redirected-to URL, or the content type. This is a significant drawback and why I advocate using the underlying classes directly.

Here are a few of the many places URLs are used by Java classes to automatically load content from a web server:

  • For all URLs:
    Object content = url.getContent( );
  • For text URLs displayed in a JEditorPane:
    javax.swing.JEditorPane editor = new javax.swing.JEditorPane( url );
    
  • For GIF, JPEG, and PNG image URLs:
    java.awt.Toolkit toolkit = java.awt.Toolkit.getDefaultToolkit( );
    java.awt.Image   image  = toolkit.createImage( url );
  • For GIF, JPEG, and PNG image URLs loaded as an ImageIcon:
    javax.swing.ImageIcon icon = new javax.swing.ImageIcon( url );
  • For GIF, JPEG, PNG, BMP, and WBMP image URLs:
    java.awt.Image image = javax.imageio.ImageIO.read( url );
  • For audio URLs:
    java.applet.AudioClip audio = java.applet.Applet.newAudioClip( url );
  • For MIDI sequence URLs:
    javax.sound.midi.Sequence midi = javax.sound.midi.MidiSystem.getSequence( url ); 

The InputStream from a URLConnection also can be used to load content:

  • For ZIP input streams:
    java.util.zip.ZipInputStream zip = new java.util.zip.ZipInputStream(
        (java.io.InputStream)url.getContent( ) );
    java.util.zip.ZipEntry entry = null;
    while ( (entry = zip.getNextEntry( )) != null )
    { ... }
  • For GZIP input streams:
    java.util.zip.GZipInputStream zip = new java.util.zip.GZipInputStream(
        (java.io.InputStream)url.getContent( ) );
    java.util.zip.ZipEntry entry = null;
    while ( (entry = zip.getNextEntry( )) != null )
    { ... }
  • For JAR input streams:
    java.util.jar.JarInputStream jar = new java.util.jar.JarInputStream(
        (java.io.InputStream)url.getContent( ) );
    java.util.zip.ZipEntry entry = null;
    while ( (entry = jar.getNextEntry( )) != null )
    { ... }

Further reading

Related tips

Other articles and specifications

Comments

Nice

Nice material ! thanks !!

Hey, I just tried your

Hey, I just tried your class, it's interesting & useful.
just tell you maybe 2 bugs I found.

1, final int buflen = Math.max( 1024, Math.max( length, stream.available() ) );
in this line, I believe the first one should be "Math.min".

2, When MIMEtype is an image or other non-text,
I believe the content should readStream from conn.getInputStream( ),
instead of just content = readStream( length, (java.io.InputStream)content );
The content seems to be empty at this time. What do you think?

Thanks,

Re: Hey, I just tried your

I'm glad you find the code useful. :-) Now, to address your questions:

1. Regarding this line of code:

final int buflen = Math.max( 1024, Math.max( length, stream.available() ) );

Well, the code is correct as-is. The code here selects a buffer size for reading data from the stream. For the best performance, we'd like that buffer to be as large as the incoming data so that we make as few I/O calls as possible. We are given two estimates of the length of that data: the length variable filled in earlier from the HTTP header's "content-length" field, and the value returned by stream.available( ). Unfortunately, both of these can be unusable. The HTTP "content-length" field is optional and will not be present for server-side script-generated content, such as that from PHP, Perl, or Java scripts. In these cases, the length returned by the earlier length = conn.getContentLength( ) call will be -1. Next, the stream's available bytes will be zero if no data has arrived yet because the connection is slow. But if both of these values are non-positive, we still need a non-zero length buffer. So, we take the Max of the Max of 1024 and these two values.

2. Regarding this line of code:

content = readStream( length, (java.io.InputStream)content ); 

Well, this code is correct as-is too. The idea here is to read in the stream's bytes and return them in some usable form. The code in the constructor first checks if the returned "content" from the connection is a stream or not. For recognized image content types, the returned content will be an Image object and we return that as-is. For recognized sound types, the content will be an AudioClip and we can return that as-is too.

For all other content, we just get back an input stream. It might contain HTML text, a different type of image, or something else. But no matter what it is, we have to read the bytes first before we can do anything more. If the HTTP header's "content-type" field just includes a MIME type, and no charset, then the input is binary data (such as an unrecognized image type) and we can just read in the bytes and return them as-is. The application has to take it from there. But if the HTTP header includes a charset value, then the content is some form of text and we have to transcode the raw bytes into that charset. The transcoded result is returned as a String.

The internal call to readStream( ) does the reading and transcoding (if needed) for you. If you call conn.getInputStream( ) yourself, then you'll need to do these same tasks too and your code will probably look pretty much the same as the code above.

Re: Charset

How do would you extract the charset again? I'm trying to read a page where charset=iso-8859-1. Not quite sure how this line of code:
String text = new String( bytes, charset );
helps anything.
thanks!!

Re: Charset

The character set name is available as part of the content type in the HTTP header when you get the file, or in a meta tag at the top of the file. The code in this Java article parses the HTTP header content type in the WebFile constructor in the code chunk commented as "Get the response". Once you've got the charset value (such as "ISO-8859-1"), then calling the String constructor creates a new string and converts from the file's charset to Java's internal UTF character encoding. This is discussed some in the above section Getting the content.

I also have a PHP article that may help. The article explains more about the content type and includes PHP regular expressions to parse it from the HTTP header or from a meta tag. While the PHP syntax is certainly different from Java, the regular expressions used are essentially the same. See PHP tip: How to get a web page content type.

I hope this helps.

I found this explanation

I found this explanation very usefull
Thanks

Great post

I had some real problems with the size of the byte arrays that your class created. I found that a BufferedReader did the same thing much easier....

[ Ed: removed extra code not related to the BufferedReader topic ]
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) 
    content += inputLine;
in.close();

re: Great post

Your use of a BufferedReader assumes that the stream being read contains text, and that it uses Unicode multibyte character encoding. Those may be poor assumptions. How will you read an image? Or text using a different encoding?

My code instead reads a raw stream of bytes, without assumptions. If a character encoding was provided in the HTTP header, then and only then does my code interpret those bytes as characters. And it interprets them with whatever encoding was named, not just Unicode.

getting content

Regarding:

  if ( errorStream != null )
  {
    Object content = conn.getContent();
    ...
  }

Why would you get the content if errorStream is NOT null? Wouldn't you want to get it only if errorStream IS null?

Re: getting content

You're right. This was a bug in the explanation text, but not in the main code at the top of the article. I've updated the text. Thanks.

Very Useful Post, and with a question

Hi, (from a beginner), I tried your files, but met an error when compiling. Hope you can give me a hint. the error is "postc.java:5: unreported exception java.net.MalformedURLException; must be caught or declared to be thrown". I just use two java file one is "WebFile.java" with the content mentioned in the article, and another one is "postc.java" with the code below:

import java.awt.*;
import java.applet.*;
public class postc extends Applet {
    public static void main(String[] args)
    {
        WebFile file   = new WebFile( "http://www.liacs.com" );
        String MIME    = file.getMIMEType( );
        Object content = file.getContent( );
        if ( MIME.equals( "text/html" ) && content instanceof String )
        {    String html = (String)content;
            System.out.println(content);
        }
    }    
}

Thanks!

Re: Very Useful Post, and with a question

The WebFile constructor takes a URL text string and internally creates a URL object. That URL creation may fail with a MalformedURLException if the URL string is badly formatted. When that happens, the WebFile constructor fails by letting the exception pass through to the caller. In this case, that means it will interrupt your first line of code and it is up to you to do something about it. This is easily done by adding a try...catch block:

try
{
    WebFile file = new WebFile( "http://www.liacs.com" );
    ... rest of your code ...
}
catch ( java.net.MalformedURLException e )
{
    System.err.println( "Bad URL" );
}

Most Useful Explanation I've Read!

Well, first off, I'd like to say you did an excellent job with this. I've read through quite a few tutorials and explanations on how to get a web request, but they all lacked what you had.

For that reason, I was curious if you'd be okay with me using this in an application I'm making for the Android OS. I didn't see any specific copyright on the code, so I figured I'd ask before I just go out and use it.

Again, much appreciated for the great explanation!

getUrl()

the getUrl() method does not return the redirected url, it gets the original URL!

Cheers,
MyD

hey thanks

this is a great walk-through and had everything i needed to start working with the java.net.URL, java.net.URLConnection, and java.net.HttpURLConnection libraries.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting