Correction - I found where I am telling it to output a doctype, and I can simply turn that off, so it's not printing a misleading one. So, nothing much for remaining questions. Thanks. :)
On Thu, Apr 17, 2008 at 5:31 PM, Jenny Brown <[EMAIL PROTECTED]> wrote: > Aha. The final solution to this was reconfiguring JTidy (the first > step in my processing pipeline) to say: > > tidy.setXHTML(false); > tidy.setXmlOut(false); > > instead of saying: > > tidy.setXHTML(true); > > Fixing that means JTidy no longer "pretty-prints" with an inserted > namespace, which means NekoHTML doesn't get a wrong namespace, which > means I avoid the eventual output problems. And now my tags look > right: <STRONG></STRONG> is being output. > > Do I need to be concerned about this line showing up in my html source? > <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> > > Or is that appropriate for a regular html file? > > Thanks so much! Code is working now. > > Jenny Brown > > > > > On Thu, Apr 17, 2008 at 10:57 AM, Brian Minchau <[EMAIL PROTECTED]> wrote: > > > > Hi Jenny. > > > > Yes, Henry is right. > > > > > > I don't know how I missed what your wrote: > > > which results in browser bombs, and starts with: > > > <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en"> > > > > That default namespace forces this HTML element to be treated as XML. > > Likewise for any other element that is in a non-null namespace. > > > > - Brian > > > > ----- Forwarded by Brian Minchau/Toronto/IBM on 04/17/2008 11:54 AM ----- > > > > Henry > > Zongaro/Toronto/I > > [EMAIL PROTECTED] > To > > > > "Jenny Brown" <[EMAIL PROTECTED]> > > 04/17/2008 10:50 cc > > AM xalan-j-users@xml.apache.org > > Subject > > Re: Trouble exporting HTML from a > > DOM in memory > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, Jenny. > > > > "Jenny Brown" <[EMAIL PROTECTED]> wrote on 2008-04-16 09:27:44 PM: > > > The main situation I'm having trouble with is empty tags. For > > > instance... my input file contains: > > > <P>This is some <STRONG></STRONG> paragraph text.</P> > > > <P>This is a textarea. <TEXTAREA name="foo"></TEXTAREA> It has text > > > after it.</P> > > > > > > It gets into my in-memory dom tree okay. But then when I try to use a > > > transformer to output the html, instead I get this which Firefox > > > chokes on: > > > <P>This is some <STRONG/> paragraph text.</P> > > > <P>This is a textarea. <TEXTAREA name="foo"/> It has text after it.</P> > > > > > > [Snip] > > > > > > Transformer transformer = > > TransformerFactory.newInstance().newTransformer(); > > > transformer.setOutputProperty(OutputKeys.METHOD, "html"); > > > transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html"); > > > transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); > > > transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); > > > > > > [Snip] > > > > > > So, I'm trying to tell it to give me html, but what I get is a > > > document that contains xml-like empty tags wherever the tag was empty, > > > which results in browser bombs, and starts with: > > > <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en"> > > > > I think this is the key. You have specified that you want to use the html > > output method, but your output is really xhtml. Because your output is in > > an XML namespace, the serializer is required to serialize the output as > > XML, despite the fact that you've used the html output method. However, > > XHTML has to adhere to certain lexical conventions in order to be > correctly > > displayed in a browser that ordinary XML does not have to adhere to. > > > > XSLT 1.0 does not define an xhtml output method, but Xalan-J does allow > you > > to give it a clue that what you're serializing is really XHTML. If you > add > > the following output property, the serializer will emit empty tags using a > > space before the trailing /> - thus, <STRONG /> > > > > transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD > XHTML > > 1.0 Transitional//EN"); > > > > That will probably help with a tag like <br> which is always supposed to > be > > empty - it will be serialized as <br /> - but probably not with STRONG and > > TEXTAREA which happen to have no content in your DOM tree, but ordinarily > > would have content. They really should be serialized as <STRONG></STRONG> > > rather than <STRONG />. This issue has previously been reported as JIra > > issue XALANJ-1906.[1] > > > > In the meanwhile, you probably have a couple of options for working around > > this issue: one would be recreate the DOM tree using elements that are in > > no namespace rather than being in the XHTML namespace - then the html > > output method would work properly; another would be search the DOM tree > > looking for elements that ordinarily have content that are actually empty, > > and give them a single whitespace node child or remove them from the tree > > entirely. You could also write XSLT stylesheets to implement any of those > > work-arounds; let us know if you'd like an example. > > > > Thanks, > > > > Henry > > [1] http://issues.apache.org/jira/browse/XALANJ-1906 > > ------------------------------------------------------------------ > > Henry Zongaro > > XML Transformation & Query Development > > IBM Toronto Lab T/L 313-6044; Phone +1 905 413-6044 > > mailto:[EMAIL PROTECTED] > > > > >