Some notes on the Cocoon HTML generator.
- Can generate XHTML/XML from a local file or remote URL
- Configured as generator in sitemap, including reference to properties file for configuring JTidy:
<map:generator type="html" src="org.apache.cocoon.generation.HTMLGenerator"> <jtidy-config>jtidy.properties</jtidy-config> </map:generator>
- multiple JTidy configs supported by using multiple generator definitions
- caches results based on last modified date of document
Useful options when screen-scraping
Preserves original case of elements and attributes, unless you switch it off with uppercase-tags, uppercase-attributes (HTML/XHTML only)
- add-xml-space=true|false -- generate xml:space="preserve" attributes for pre, style, script. Ensures that whitespace is preserved correctly (as there's no DTD to check it against)
- clean=true|false -- remove presentational markup, and use styles instead. Useful for reducing amount of markup you need to wade through
- numeric-entities -- use numeric entities rather than named entities. Useful as you don't have to worry about declarations.
- output-xhtml -- sets doctype and default namespace. Case is alway preserved, other options ignored