Hmm...
ok, but in when I call parse, then my ContentHandler.characters() callback gets
a char [],
and this is passed as:
(Pdb) ch
array('c', '\xa9 2010 Crane Data LLC. All rights reserved.')
so when I try unicode I get an error:
(Pdb) ch.tounicode()
*** ValueError: tounicode() may only be called on type 'u' arrays
So it would seem to me that in fact I'm not getting a unicode string here. When
I try to decode in various codecs, I get problems. One question is what is the
standard name for "UCS-2" -- as when I try to use that name it fails; is it a
subset of utf-16?
-- Shaun
On Dec 20, 2010, at 5:10 AM, Nick Burch wrote:
> On Mon, 20 Dec 2010, Shaun Cutts wrote:
>> As you are being used for scraping purposes, however, you should probably be
>> able to read anything excel can write, including inconsistent unicode. (If
>> it is inconsistent -- I note that I don't receive a "processingInstruction"
>> callback to write the document encoding type from the parser. Are you
>> assuming anything about the text encoding that might not be always valid
>> even in a properly formed excel file?)
>
> Excel stores strings in one of two formats, basically US-ASCII or UCS-2. POI,
> which is the library Tika uses internally, handles all that for you.
> Everything you get as Java strings ought to be correctly handled as regular
> Java unicode strings
>
> Nick