Nick, As I mentioned, I wrote out my own xhtml using the tika event stream. When I tried to parse, I got:
XMLSyntaxError: Input is not proper UTF-8, indicate encoding ! Bytes: 0xA9 0x20 0x32 0x30, line 3581, column 17 (Using python lxml which uses xmllib2) So for me this makes it more probable that the problems in the core are caused by a character encoding. As you are being used for scraping purposes, however, you should probably be able to read anything excel can write, including inconsistent unicode. (If it is inconsistent -- I note that I don't receive a "processingInstruction" callback to write the document encoding type from the parser. Are you assuming anything about the text encoding that might not be always valid even in a properly formed excel file?) Thanks, -- Shaun On Dec 19, 2010, at 7:43 PM, Nick Burch wrote: > On Fri, 17 Dec 2010, Shaun Cutts wrote: >> Caused by: java.lang.NullPointerException >> at >> com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString(ToStream.java:1962) >> at >> com.sun.org.apache.xml.internal.serializer.ToStream.processAttributes(ToStream.java:1942) > > This doesn't look like the sort of code that should be giving problems... > > Can you try with some other excel files and see if they work though? If they > do, any chance you could upload the problem file to jira so we can try to > track down why the core JVM xml code is null pointering > > Nick
