On Mon, 20 Dec 2010, Shaun Cutts wrote:
As you are being used for scraping purposes, however, you should probably be able to read anything excel can write, including inconsistent unicode. (If it is inconsistent -- I note that I don't receive a "processingInstruction" callback to write the document encoding type from the parser. Are you assuming anything about the text encoding that might not be always valid even in a properly formed excel file?)
Excel stores strings in one of two formats, basically US-ASCII or UCS-2. POI, which is the library Tika uses internally, handles all that for you. Everything you get as Java strings ought to be correctly handled as regular Java unicode strings
Nick
