I have an html DOM tree in memory (after having passed html through
JTidy and NekoHTML for validation/cleanup) and I'm trying to write it
back out as valid html.  I'm using Xerces 2.9.1 and Xalan 2.7.1 with
Sun JDK 1.5.0_14.  I'm running this command line, so I have careful
control of the classpath.  The jars in my project are very minimal but
I wouldn't rule out conflicts with the JDK yet (though I'm not sure
how to check that).  The specific examples I'm having trouble with
follow, as well as the code I'm using to do the export.

The main situation I'm having trouble with is empty tags.  For
instance... my input file contains:
<P>This is some <STRONG></STRONG> paragraph text.</P>
<P>This is a textarea.  <TEXTAREA name="foo"></TEXTAREA>  It has text
after it.</P>

It gets into my in-memory dom tree okay.  But then when I try to use a
transformer to output the html, instead I get this which Firefox
chokes on:
<P>This is some <STRONG/> paragraph text.</P>
<P>This is a textarea.  <TEXTAREA name="foo"/> It has text after it.</P>

(Firefox sees <STRONG/> and thinks it means <STRONG> and sees
<TEXTAREA/> and thinks it means <TEXTAREA>  ... which leaves the tags
hanging open and they boldface or otherwise consume the rest of the
page; on other tags such as div it may even make the whole page
un-renderable.)


So here's what I'm doing for export code, and my intention is simply
to produce valid HTML that a browser can render later.
============
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");

StringWriter sw = new StringWriter();
try {
         transformer.transform(new DOMSource(domDocument), new 
StreamResult(sw));
} catch (TransformerException te) {
         return(te.toString());
}

============

(Yes, I do really actually want it in a string after that, not an
output stream... this will eventually be a module in the middle of a
handling pipeline)

So, I'm trying to tell it to give me html, but what I get is a
document that contains xml-like empty tags wherever the tag was empty,
which results in browser bombs, and starts with:
<HTML xmlns="http://www.w3.org/1999/xhtml"; lang="en">


I'm sure there's something I'm missing here (configuration? other
setup?), but I'm not sure what.  Thanks for your help.


Jenny Brown

Reply via email to