Don’t bother using HTML white-space removal to speed up a web site

Technologies: HTML 4+, an HTML optimizer, Drupal 5+ (optional)

Removing HTML white-space (spaces, tabs, blank lines, and comments) makes a file slightly smaller and faster to send to a site visitor. The improvement you get depends upon how verbose your HTML is to start with. This article uses the HTML Tidy optimizer and measures the improvement for a sample web site and 22 different standard themes or page templates. Each theme generates different HTML and shows a different level of improvement from HTML optimization. Unfortunately, in all cases the improvement is tiny and probably not worth the effort.

This article is part of a series on Essential steps to speed up a Drupal web site. The discussion here, however, applies equally well to any type of web site, not just one that uses Drupal.

How to remove HTML white space

The bigger the HTML page, the longer it takes to send to your site's visitors. Speed up a site by reducing page size. One way is to remove HTML bytes that don't need to be there:

  • Remove indentation.
  • Remove blank lines.
  • Remove comments.
  • Concatenate short lines to remove unnecessary line breaks.

Indentation, comments, etc., are often in the HTML to make the file easier to read for the site's designers. But none of these make a difference to the site visitor or their web browser. Reduce page load times by removing these extra bytes.

HTML optimizers

Doing optimization by hand is tedious. Instead, use an HTML optimizer application. There are lots of free and commercial tools, including:

All of these tools will do about the same thing. Their main differences are in the quality of their user interfaces, the operating systems they work on, and how much they cost.

HTML Tidy is my tool of choice. It is one of the first HTML optimizers, it's free, it is under constant development by the Open Source community, and it has been incorporated into multiple free and commercial products for Windows, Mac OS X, and Linux. There are command-line tools for use in scripts, programmer libraries for custom Java, Perl, or Python programs, and the mod_tidy module for Apache 2 that will automatically optimize every HTML page delivered by the web server.

Some other tools will do more "optimizations" than HTML Tidy, but beware. One "optimization" promoted by another tool removes all of the double-quotes around tag attribute values. This produces invalid X/HTML. Another "optimization" removes some closing tags, and again this produces invalid X/HTML. Still another "optimization" replaces <strong> tags with <b>, <em> with <i>, etc., which again produces invalid XHTML and it will break some style sheets. Once you turn off these invalid optimizations in fancy tools, you get back to the same safe valid set of optimizations that all of the tools can do.

Before spending time and money on any of these, though, let's do some testing to see what HTML optimization can do for a web site.

How well does it work?

I measured the effect of HTML optimization for a variety of test cases. I hand-edited several files first and compared them to the output of HTML Tidy. They differed by only a few bytes, which is surprisingly good. Instead of doing any further hand-editing, all of the following tests use HTML Tidy. I used the Balthisar Tidy user interface for HTML Tidy on a Mac.

Effect on HTML page template/theme sizes

Many web sites today use page templates or themes that provide the common elements of every site page, such as banners, menus, and footers. Only the body text changes from page to page. Change the template or theme, and the look of every page changes.

To optimize a site quickly, optimize the page template with HTML Tidy. I measured the improvement on a selection of page templates from themes for the Drupal content management system. The same could have been done for themes for WordPress, or templates used by Dreamweaver or iWeb. (see the appendix for details on how I did these tests)

Impact on Drupal theme page.tpl.php size
(higher is better)
Theme Original Optimized Saved Percent  
Aberdeen 3,141 2,373 768 24%
Amadou 3,733 2,286 1,447 39%
Andreas09 2,154 2,026 128 6%
Antique Modern 2,970 2,223 747 25%
Arcmateria 2,485 2,165 320 13%
Blix 2,054 1,780 274 13%
Blue Breeze 3,556 2,709 847 24%
Blue Marine 2,304 2,113 191 8%
Brushed Steel 4,530 3,291 1,239 27%
Fancy 3,818 3,624 194 5%
Gagarin 2,134 1,938 196 9%
Garamond 2,324 2,015 309 13%
Garland 3,706 2,929 777 21%
Glossy Blue 2,349 2,080 269 11%
iTheme 2,364 1,934 430 18%
Kubrick 1,692 1,494 198 12%
News Portal 1,840 1,720 120 7%
Ocadia 3,241 2,837 404 12%
Push Button 3,882 3,407 475 12%
Slash 2,582 2,570 12 0%
Stylized Beauty 3,500 2,734 766 22%
Zen 3,685 2,623 1,062 29%
    Average 508 16%  

The average reduction in size for optimized page templates was 16%. This seems good, but a page template is only the shell of a page, lacking any real content. What happens if we add the content?

Effect on HTML page sizes

I applied each theme to a test web site and measured the size of the site’s home page before and after optimization with HTML Tidy. The home page I used is fairly complex, with multiple blocks of text, lists, images, forms, and tables. (see the appendix for details on how I did these tests)

Impact on Drupal page size
(higher is better)
Theme Original Optimized Saved Percent  
Aberdeen 28,141 26,017 2,124 8%
Amadou 27,764 25,631 2,133 8%
Andreas09 25,791 25,108 683 3%
Antique Modern 27,057 25,601 1,456 5%
Arcmateria 26,766 25,772 994 4%
Blix 26,395 25,699 696 3%
Blue Breeze 28,624 26,177 2,447 9%
Blue Marine 26,911 26,030 881 3%
Brushed Steel 29,418 26,933 2,485 8%
Fancy 30,400 29,327 1,073 4%
Gagarin 26,164 25,457 707 3%
Garamond 24,773 24,102 671 3%
Garland 28,273 26,607 1,666 6%
Glossy Blue 24,789 24,151 638 3%
iTheme 27,283 26,340 943 3%
Kubrick 22,994 22,583 411 2%
News Portal 25,743 24,954 789 3%
Ocadia 26,374 25,820 554 2%
Push Button 27,856 26,845 1,011 4%
Slash 28,099 27,021 1,078 4%
Stylized Beauty 28,013 26,555 1,458 5%
Zen 28,634 26,272 2,362 8%
    Average 1,239 4%  

The average reduction in size for optimized HTML was 4%. That's less impressive. Bytes were still saved, but the size of the content overwhelms the bytes saved through HTML optimization.

Effect on HTML page sizes, after compression

Production web sites should always use Apache file compression. What happens if we use HTML Tidy, then compress the pages using Apache and mod_gzip or mod_deflate? (see the appendix for details on how I did these tests)

Impact on Drupal page size, after compression
(higher is better)
Theme Original Optimized Saved Percent  
Aberdeen 5,986 5,718 268 4%
Amadou 6,089 5,700 389 6%
Andreas09 5,567 5,522 45 1%
Antique Modern 5,763 5,586 177 3%
Arcmateria 5,795 5,632 163 3%
Blix 5,751 5,648 103 2%
Blue Breeze 5,862 5,698 164 3%
Blue Marine 5,783 5,723 60 1%
Brushed Steel 6,236 5,888 348 6%
Fancy 6,365 6,229 136 2%
Gagarin 5,708 5,599 109 2%
Garamond 5,249 5,133 116 2%
Garland 6,018 5,754 264 4%
Glossy Blue 5,245 5,149 96 2%
iTheme 5,953 5,764 189 3%
Kubrick 5,080 5,017 63 1%
News Portal 5,554 5,506 48 1%
Ocadia 5,784 5,708 76 1%
Push Button 5,951 5,859 92 2%
Slash 5,805 5,765 40 1%
Stylized Beauty 6,006 5,919 87 1%
Zen 6,026 5,762 264 4%
    Average 150 3%  

The average reduction in size for compressed optimized HTML is just 3%. Compression did a spectacular job of reducing page sizes by 78%, and removing 21,000 bytes on average. In comparison, the 150 byte improvement from HTML optimization isn't very impressive.

The HTML is just part of a real web page. What happens if we include the style sheets and images?

Effect on HTML page sizes, after compression and including CSS and images

I re-calculated the improvement from HTML optimization, now including the size of CSS and images needed by the home page for each of the tested themes. (see the appendix for details on how I did these tests)

Impact on Drupal page size, after compression, including CSS and images
(higher is better)
Theme Original Optimized Saved Percent  
Aberdeen 23,999 23,731 268 1.1%
Amadou 36,187 35,798 389 1.1%
Andreas09 21,358 21,313 45 0.2%
Antique Modern 52,785 52,608 177 0.3%
Arcmateria 19,007 18,844 163 0.9%
Blix 26,699 26,596 103 0.4%
Blue Breeze 63,878 63,714 164 0.3%
Blue Marine 13,864 13,804 60 0.4%
Brushed Steel 99,466 99,118 348 0.4%
Fancy 200,484 200,348 136 0.1%
Gagarin 56,322 56,213 109 0.2%
Garamond 36,359 36,243 116 0.3%
Garland 26,192 25,928 264 1.0%
Glossy Blue 32,006 31,910 96 0.3%
iTheme 90,039 89,850 189 0.2%
Kubrick 25,105 25,042 63 0.3%
News Portal 20,427 20,379 48 0.2%
Ocadia 76,697 76,621 76 0.1%
Push Button 28,370 28,278 92 0.3%
Slash 32,670 32,630 40 0.1%
Stylized Beauty 28,573 28,486 87 0.3%
Zen 49,112 48,848 264 0.5%
    Average 150 0.4%  

The average reduction in size for optimized HTML in the context of a full page is just 0.4%. The size of the CSS and images for the page overwhelm the meager gains from HTML optimization.

Conclusions

HTML optimization seems like a good idea at first. But, in the real-world context of a web page with style sheets and images, and a web server properly configured to compress content, the few bytes saved from HTML optimization are hardly worth the effort.

To put this in perspective, the average savings in the above tests, after compression, was just 150 bytes. On a typical cable modem or DSL connection supporting 6 Mbps (megabits per second) downloads, that 150 bytes takes about 1/4000th of a second to send. Saving that time isn't going to be noticeable by your site visitors.

Also consider that one web server message to get a file takes about 250 bytes, plus the size of the file. To optimize a site, you would do much better to remove just one image from the design, saving that 250 bytes for the message, plus the bytes for the image. Dropping even a tiny one-pixel image will save you more than HTML page optimization is likely to do.

An advertisement for one HTML optimization product claimed a 20% reduction in page size! But what is a "typical" page for that claim? Most pages today are created by authoring tools (such as Dreamweaver or iWeb) or by content management systems (such as Drupal or WordPress), and both ways produce lean HTML with little to remove in HTML optimization. If you can get a 20% reduction from optimization, then something is wrong with the authoring tool you are using.

Further reading

Appendix: How I tested

All testing used a Drupal 5.1 web site loaded with sample content and a home page layout that listed teasers for the 10 most recent posts in the body. Blocks on the left side of the page supported menus and a recent image. Testing used standard themes downloaded from the Drupal web site.

All size measurements were done by simply counting bytes before and after running HTML Tidy on the theme template or the site home page. Full page sizes, including CSS and images, were measured by monitoring a page loaded into Firefox while using the Charles proxy server on a Mac.

As with all benchmarking, your results may vary due to differences in your page layout, content, and choice of theme or page template. Use these results only as a rough guideline.

Comments

Disagree

I have compressed my pages with the absolute html tool and i have to say my site is running alot faster and pingdom.com results say the page load time has droppped between 5-10 seconds on all my pages... I would not say that compressing pages is hardly worth it, in fact i think it can be very good when all else fails...

Re: Disagree

Well, if you shaved 5-10 seconds by removing white space, then something was seriously wrong with your HTML in the first place! Consider that a typical cable modem can receive about 6 Mbits/second, or about 600 Kbytes/second. So, shaving 5 seconds requires a reduction of 5 * 600 = 3 Mbytes! And if you had 3 Mbytes of unnecessary spaces in your HTML, then you badly need to re-examine whatever tool you're using to create that HTML. Don't depend upon optimizer tools to patch your problems — fix the problems instead.

As another point for comparison, a typical complex web page is between 50 and 100 Kbytes (HTML only). A savings of 3 Mbytes is 30 times larger than the total size of most web pages. That seems unlikely.

If you used Pingdom's "full page test", then be aware that it's results are not useful. Here are some of its problems:

  • It doesn't make HTTP requests for compressed content, though most web browsers do. Web servers will respond by delivering uncompressed HTML, CSS, and JavaScript to Pingdom. Such files are larger and slower to deliver, increasing Pingdom's reported download time relative to what a real browser would see.
  • It downloads any image referenced in the CSS, not just those used by the page being tested. For sites that use a site-wide CSS with multiple stock images used for different parts of the site, Pingdom will download many extra files and report a load time much larger than what a real browser would see.
  • It doesn't run JavaScript, so it won't download images (such as ads) and other content used by your pages. Since these do take real download time, Pingdom's reported times will be smaller than those for a real browser.
  • It has an unspecified file download cap. This will crop the download and report fewer files and a lower download time than a real browser would see.
  • It doesn't state the bandwidth and latency assumptions it is using to calculate load times, so there is no way to compare its values to those real users might see.

The numbers reported by Pingdom, and similar tools, are of no use. Instead, try Firefox's Web Developer and YSlow plugins or Safari's development features. These give real load times in real browsers under real conditions.

Also note that "Absolute HTML Compressor", like many others, can create invalid HTML and change the appearance of your web pages. For instance, it can remove <!DOCTYPE> tags, which will switch a page from strict to quirksmode rendering and affect the layout. It can remove <META> tags, which can affect how page character sets are interpreted and how web spiders treat page links. It can remove double-quotes around tag values, which makes the code invalid HTML (though many browsers will accept it anyway).

Be aware that so-called compressor/optimizer tools game the results they advertise by enabling these invalid and appearance-changing "optimizations" and by applying their tools to ridiculously bad HTML in the first place. When you test these tools on real content, as I have, you find that they provide little benefit and a lot of potential for messed up results. Shaving a few hundred or a even a few thousand bytes from a web page has little impact on page load time. Unless you've got 3 Mbytes of unnecessary spaces in your HTML...

Disagre

Perhaps it may be true in a drupal env. I've saved 50% compressing js css and html on some projects.

The real performance gains are when you can limit the requests to the clients web browser more so then the actual file size. If you can get All your content in 4 requests then your page will load instantly, as web browsers are hardwired to only do 4 requests at a time.

So using 4 requests:

1. (IMAGE.PNG) css sprite image file
2. (SITE.CSS) css compressed
3. (MYSCRIPT.JS) javascript indlude
4. (INDEX.HTML) html

That's how you make a website scream.

Re: Disagre

First, think about that... you saved 50%? That would require that every other character in those files is a space, carriage-return, or part of a comment. If that's really the case, something is seriously wrong with the way you are authoring these files. Either you are using very bad tools or you are embedding large comment blocks in those files. Get better tools and move comment blocks to a design document. They should never be included in anything you download over and over and over.

Second, you should always use Apache's compression (mod_deflate) to zip files as they are being delivered. This will substantially reduce the size of any text file and save you more than white-space removal will. Once you enable compression, you'll find that the impact of white-space removal is too trivial to be bothered with (see the article's results above).

Third, the HTTP 1.1 specification says web browsers may make two requests in parallel, not four. Some browsers don't follow these rules, and some browser plugins let you override these rules. But as a web site developer, you should design for browsers that follow the rules.

Fourth, the way browsers make parallel requests is more complex than you imply. Browsers start with the HTML file and queue file requests as they encounter tags for images, flash, CSS files, and scripts. File queueing recurses into CSS files to queue background images for the site's theme. File queuing also pauses on scripts until the script finishes executing because it may change the web page's content and what files need to be queued next. The order in which files are requested from the queue then depends upon the order in which requested files arrive, and that depends upon network and web server load, and what is cached at the browser or within the network. All of this makes trying to predict exactly how a browser will request files nearly impossible, so designing a site to optimize for this order is pointless.

Instead of worrying about white-space removal, compression, and parallel browser file requests, use this one simple rule:

Use fewer files

The time it takes to request and get a file swamps the time it takes to download it (unless it is a giant multi-megabyte file). Don't worry about making files a few bytes smaller. Worry about using fewer files in the first place.

Compression not even worth it over numerous hits

I completely agree with you premise that compressing html isn't worth it. I think compressing JavaScript is worth it though, as it only takes seconds to do (numerous online resources) and it almost always saves a more than a few bytes (I got an 80kb file down to 63kb). On some of those savings for html you sited even if you got one million hits you'd only save a few giga bytes of bandwidth space. Even if you have a crappy service provider it wouldn't matter. I'm totally with you on this one. Unless your goggle or CNN compressing html isn't worth it. I wonder why big sites don't do it? If you look at most constant hit sites (like news sites, etc), they aren't compressing. It'd be interesting to find out why (probably CMS issue).

Re: Compression not even worth it over numerous hits

Older web browsers sometimes do not support receiving compressed content. To support users with old software on old PCs, big generic sites may elect to skip file compression. For everybody else, file compression for HTML, CSS, and JavaScript is a trivial one-time cost on the server (compressed files are usually cached) and everybody benefits from the lower bandwidth use.

This article argues against using HTML white-space removal and other HTML "optimizer" tools. While such tools do reduce the file size, they don't reduce it by much if the code has been written well in the first place. If you see big gains from an "optimizer", chances are your original code was pretty bad. And that would be an indicator that you should fix the code rather than patch it with an optimizer.

JavaScript is different situation. JS files are often a concatenated hodge-podge of scripts written for the site or obtained from third-parties to provide pop-up menus, hovering images, animated collapsing section headers, and all sorts of other bells and whistles. A site's developers may not know how third-party scripts work so it may not be practical to fix them to reduce their size (such as by removing giant comment blocks). In these cases, using a JavaScript optimizer that automatically strips comments and white space may be the only practical way to reduce the script size. After that, the server's file compression will reduce the file size even further.

Nice Writeup

Great article, thanks for putting the thought and effort into this, really puts things in perspective!

-J

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting