Some notes on the Cocoon HTML generator.


  • Can generate XHTML/XML from a local file or remote URL
  • Configured as generator in sitemap, including reference to properties file for configuring JTidy:
<map:generator type="html" src="org.apache.cocoon.generation.HTMLGenerator">
  • multiple JTidy configs supported by using multiple generator definitions
  • caches results based on last modified date of document


Useful options when screen-scraping

Preserves original case of elements and attributes, unless you switch it off with uppercase-tags, uppercase-attributes (HTML/XHTML only)

  • add-xml-space=true|false -- generate xml:space="preserve" attributes for pre, style, script. Ensures that whitespace is preserved correctly (as there's no DTD to check it against)
  • clean=true|false -- remove presentational markup, and use styles instead. Useful for reducing amount of markup you need to wade through
  • numeric-entities -- use numeric entities rather than named entities. Useful as you don't have to worry about declarations.
  • enclose-text/enclose-block-text
  • char-encoding
  • doctype
  • output-xhtml -- sets doctype and default namespace. Case is alway preserved, other options ignored
  • output-xml

