On Mon, 20 Dec 2010, Shaun Cutts wrote:
As you are being used for scraping purposes, however, you should probably be able to read anything excel can write, including inconsistent unicode. (If it is inconsistent -- I note that I don't receive a "processingInstruction" callback to write the document encoding type from the parser. Are you assuming anything about the text encoding that might not be always valid even in a properly formed excel file?)

Excel stores strings in one of two formats, basically US-ASCII or UCS-2. POI, which is the library Tika uses internally, handles all that for you. Everything you get as Java strings ought to be correctly handled as regular Java unicode strings

Nick

Reply via email to