Re: Fw: Trouble exporting HTML from a DOM in memory

Jenny Brown Thu, 17 Apr 2008 16:51:37 -0700

Correction - I found where I am telling it to output a doctype, and I
can simply turn that off, so it's not printing a misleading one.  So,
nothing much for remaining questions.  Thanks.  :)




On Thu, Apr 17, 2008 at 5:31 PM, Jenny Brown <[EMAIL PROTECTED]> wrote:
> Aha.  The final solution to this was reconfiguring JTidy (the first
>  step in my processing pipeline) to say:
>
>                 tidy.setXHTML(false);
>                 tidy.setXmlOut(false);
>
>  instead of saying:
>
>                 tidy.setXHTML(true);
>
>  Fixing that means JTidy no longer "pretty-prints" with an inserted
>  namespace, which means NekoHTML doesn't get a wrong namespace, which
>  means I avoid the eventual output problems.  And now my tags look
>  right:  <STRONG></STRONG> is being output.
>
>  Do I need to be concerned about this line showing up in my html source?
>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
>
>  Or is that appropriate for a regular html file?
>
>  Thanks so much!  Code is working now.
>
>  Jenny Brown
>
>
>
>
>  On Thu, Apr 17, 2008 at 10:57 AM, Brian Minchau <[EMAIL PROTECTED]> wrote:
>  >
>  >  Hi Jenny.
>  >
>  >  Yes, Henry is right.
>  >
>  >
>  >  I don't know how I missed what your wrote:
>  >  > which results in browser bombs, and starts with:
>  >  > <HTML xmlns="http://www.w3.org/1999/xhtml"; lang="en">
>  >
>  >  That default namespace forces this HTML element to be treated as XML.
>  >  Likewise for any other element that is in a non-null namespace.
>  >
>  >  - Brian
>  >
>  >  ----- Forwarded by Brian Minchau/Toronto/IBM on 04/17/2008 11:54 AM -----
>  >
>  >              Henry
>  >              Zongaro/Toronto/I
>  >              [EMAIL PROTECTED]                                             
>       To
>  >
>  >                                        "Jenny Brown" <[EMAIL PROTECTED]>
>  >              04/17/2008 10:50                                           cc
>  >              AM                        xalan-j-users@xml.apache.org
>  >                                                                    Subject
>  >                                        Re: Trouble exporting HTML from a
>  >                                        DOM in memory
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >  Hi, Jenny.
>  >
>  >  "Jenny Brown" <[EMAIL PROTECTED]> wrote on 2008-04-16 09:27:44 PM:
>  >  > The main situation I'm having trouble with is empty tags.  For
>  >  > instance... my input file contains:
>  >  > <P>This is some <STRONG></STRONG> paragraph text.</P>
>  >  > <P>This is a textarea.  <TEXTAREA name="foo"></TEXTAREA>  It has text
>  >  > after it.</P>
>  >  >
>  >  > It gets into my in-memory dom tree okay.  But then when I try to use a
>  >  > transformer to output the html, instead I get this which Firefox
>  >  > chokes on:
>  >  > <P>This is some <STRONG/> paragraph text.</P>
>  >  > <P>This is a textarea.  <TEXTAREA name="foo"/> It has text after it.</P>
>  >  >
>  >  > [Snip]
>  >  >
>  >  > Transformer transformer =
>  >  TransformerFactory.newInstance().newTransformer();
>  >  > transformer.setOutputProperty(OutputKeys.METHOD, "html");
>  >  > transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
>  >  > transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
>  >  > transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
>  >  >
>  >  > [Snip]
>  >  >
>  >  > So, I'm trying to tell it to give me html, but what I get is a
>  >  > document that contains xml-like empty tags wherever the tag was empty,
>  >  > which results in browser bombs, and starts with:
>  >  > <HTML xmlns="http://www.w3.org/1999/xhtml"; lang="en">
>  >
>  >  I think this is the key.  You have specified that you want to use the html
>  >  output method, but your output is really xhtml.  Because your output is in
>  >  an XML namespace, the serializer is required to serialize the output as
>  >  XML, despite the fact that you've used the html output method.  However,
>  >  XHTML has to adhere to certain lexical conventions in order to be 
> correctly
>  >  displayed in a browser that ordinary XML does not have to adhere to.
>  >
>  >  XSLT 1.0 does not define an xhtml output method, but Xalan-J does allow 
> you
>  >  to give it a clue that what you're serializing is really XHTML.  If you 
> add
>  >  the following output property, the serializer will emit empty tags using a
>  >  space before the trailing /> - thus, <STRONG />
>  >
>  >  transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD 
> XHTML
>  >  1.0 Transitional//EN");
>  >
>  >  That will probably help with a tag like <br> which is always supposed to 
> be
>  >  empty - it will be serialized as <br /> - but probably not with STRONG and
>  >  TEXTAREA which happen to have no content in your DOM tree, but ordinarily
>  >  would have content.  They really should be serialized as <STRONG></STRONG>
>  >  rather than <STRONG />.  This issue has previously been reported as JIra
>  >  issue XALANJ-1906.[1]
>  >
>  >  In the meanwhile, you probably have a couple of options for working around
>  >  this issue:  one would be recreate the DOM tree using elements that are in
>  >  no namespace rather than being in the XHTML namespace - then the html
>  >  output method would work properly; another would be search the DOM tree
>  >  looking for elements that ordinarily have content that are actually empty,
>  >  and give them a single whitespace node child or remove them from the tree
>  >  entirely.  You could also write XSLT stylesheets to implement any of those
>  >  work-arounds; let us know if you'd like an example.
>  >
>  >  Thanks,
>  >
>  >  Henry
>  >  [1] http://issues.apache.org/jira/browse/XALANJ-1906
>  >  ------------------------------------------------------------------
>  >  Henry Zongaro
>  >  XML Transformation & Query Development
>  >  IBM Toronto Lab   T/L 313-6044;  Phone +1 905 413-6044
>  >  mailto:[EMAIL PROTECTED]
>  >
>  >
>

Re: Fw: Trouble exporting HTML from a DOM in memory

Reply via email to