Hi,
I think there is a bug in the RTF parser. When parsing RTF (generated by
dragging an Outlook MSG file to the desktop and read in using POI's
MAPIMessage.getRtfBody()) it seems I get an endElement(qName='title') AFTER
endElement(qName='head') . it should be coming before the header is closed.
Note: I'm using XmlBeans Sax2Dom to process, as my goal is conversion of RTF to
HTML.
Example Code:
String rtf = . bunch of rtf from Outlook MSG .
handler = new XHTMLContentHandler(sax2dom=new Sax2Dom(), metadata=new
Metadata());
(new RTFParser()).parse(new StringInputStream(rtf), handler, metadata,
new ParseContext());
Node html = sax2dom.getDOM();
Resulting html is malformed, for example:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>
<title></title>
</title>
<body></body> <!-- BAD FORMATTING! Should be </head><body>!! -->
<p>Message body text</p>
<p> </p>
<p>.etc.</p>
</head> <!-- BAD -- should be </body> -->
</html>
I can 'fix' this issue by creating a wrapper class and ignoring all begin /
endElement's with qName='title', but that's not a real solution :D
Also, another issue is embedded <img> tags are not emitted from the RTF, such
as this one ...
{\*\htmltag84 <img width=142 height=59 id="Picture_x0020_1"
src="cid:[email protected]" alt="Entertainment Information">}
I could upload an example java class, but not sure if attached files are
allowed in this mailing list.
Thanks!
David Van Camp | Software Engineer III | 40 Media Drive, Queensbury, NY 12804
Toll Free: 800.833-9581 Ext 2145 | Web: TribuneMediaServices.com | Email:
[email protected]