Hmm...

ok, but in when I call parse, then my ContentHandler.characters() callback gets 
a char [],
and this is passed as:

(Pdb) ch
array('c', '\xa9 2010 Crane Data LLC. All rights reserved.')

so when I try unicode I get an error:

(Pdb) ch.tounicode()
*** ValueError: tounicode() may only be called on type 'u' arrays

So it would seem to me that in fact I'm not getting a unicode string here. When 
I try to decode in various codecs, I get problems. One question is what is the 
standard name for "UCS-2" -- as when I try to use that name it fails; is it a 
subset of utf-16?

-- Shaun


On Dec 20, 2010, at 5:10 AM, Nick Burch wrote:

> On Mon, 20 Dec 2010, Shaun Cutts wrote:
>> As you are being used for scraping purposes, however, you should probably be 
>> able to read anything excel can write, including inconsistent unicode. (If 
>> it is inconsistent -- I note that I don't receive a "processingInstruction" 
>> callback to write the document encoding type from the parser. Are you 
>> assuming anything about the text encoding that might not be always valid 
>> even in a properly formed excel file?)
> 
> Excel stores strings in one of two formats, basically US-ASCII or UCS-2. POI, 
> which is the library Tika uses internally, handles all that for you. 
> Everything you get as Java strings ought to be correctly handled as regular 
> Java unicode strings
> 
> Nick

Reply via email to