Re: problems parsing an xls spreadsheet

Shaun Cutts Mon, 20 Dec 2010 01:54:47 -0800

Nick,

As I mentioned, I wrote out my own xhtml using the tika event stream. When I 
tried to parse, I got:

XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xA9 0x20 0x32 0x30, line 3581, column 17

(Using python lxml which uses xmllib2)

So for me this makes it more probable that the problems in the core are caused 
by a character encoding.

As you are being used for scraping purposes, however, you should probably be 
able to read anything excel can write, including inconsistent unicode. (If it 
is inconsistent -- I note that I don't receive a "processingInstruction" 
callback to write the document encoding type from the parser. Are you assuming 
anything about the text encoding that might not be always valid even in a 
properly formed excel file?)

Thanks,

-- Shaun

On Dec 19, 2010, at 7:43 PM, Nick Burch wrote:

> On Fri, 17 Dec 2010, Shaun Cutts wrote:
>> Caused by: java.lang.NullPointerException
>>      at 
>> com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString(ToStream.java:1962)
>>      at 
>> com.sun.org.apache.xml.internal.serializer.ToStream.processAttributes(ToStream.java:1942)
> 
> This doesn't look like the sort of code that should be giving problems...
> 
> Can you try with some other excel files and see if they work though? If they 
> do, any chance you could upload the problem file to jira so we can try to 
> track down why the core JVM xml code is null pointering
> 
> Nick

Re: problems parsing an xls spreadsheet

Reply via email to