This looks just like:

    https://issues.apache.org/jira/browse/TIKA-801

Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
your document... when you open the Jira issue can you attach the
problematic document?  Thanks.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 7, 2011 at 1:31 PM, P. Hill <[email protected]> wrote:
> On 12/6/2011 6:50 PM, Nick Burch wrote:
>>
>> On Tue, 6 Dec 2011, P. Hill wrote:
>>>
>>>   at
>>> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>>>   at
>>> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>>> [rest of stack trace removed]
>>
>>
>> You've alas snipped the interesting bit, which is what the parser broke on
>
>
> Further note on the type of message, it was a many-level nested reply chain
> generated by I believe Outlook for all coorespondants.  The attached PDF
> itself parses in all versions of tika-app.
>
> Wow, really?  You wanted to see the AWT call? Probably not, but here is the
> trace to swing followed by the cause.
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@6337bb9c
>
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>    at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>    at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>    at
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>    at
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>    at javax.swing.TransferHandler.importData(Unknown Source)
>
> OOPS Sorry I didn't see the cause way down there: :-)
>
> Caused by: java.lang.NullPointerException
>    at
> com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown
> Source)
>    at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown
> Source)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
>    at
> org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
>    at
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
>    at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>
>
>> Try with a recent svn nightly build, and see if that fixes it. If not,
>> please post a problem file and the full stacktrace to a new issue in JIRA
>
>
> I will try to find time to check into that.
> -Paul
>

Reply via email to