Nick,

The other thing is that tika calls the characters() callback without "&<>" 
escaped. I can work around this, but I don't know if that conforms to sax?

-- Shaun

On Dec 21, 2010, at 5:28 AM, Nick Burch wrote:

> On Tue, 21 Dec 2010, Shaun Cutts wrote:
>> ok, but in when I call parse, then my ContentHandler.characters() callback 
>> gets a char [], and this is passed as:
>> 
>> (Pdb) ch
>> array('c', '\xa9 2010 Crane Data LLC. All rights reserved.')
>> 
>> so when I try unicode I get an error:
>> 
>> (Pdb) ch.tounicode()
>> *** ValueError: tounicode() may only be called on type 'u' arrays
> 
> You sure there isn't a problem with your python-java bridge? All Java strings 
> are always unicode
> 
>> So it would seem to me that in fact I'm not getting a unicode string here. 
>> When I try to decode in various codecs, I get problems. One question is what 
>> is the standard name for "UCS-2" -- as when I try to use that name it fails; 
>> is it a subset of utf-16?
> 
> UCS-2 is a predecessor to UTF-16, which doesn't handle supplementary code 
> points so can't hold the whole of the unicode range.
> http://en.wikipedia.org/wiki/UTF-16/UCS-2
> 
> Nick

Reply via email to