Hi, Congratulations first: I have been following Tika for a little bit now and am very happy to see a first release of it. Well done everybody!
I am particularly interested in the project as we work on text analysis with GATE and UIMA. Obviously being able to extract text from different formats is crucial for what we do and so is the extraction of the markup information. That leads me to the following question: how difficult would it be to get the OfficeParser to generate information about the markup (pages, headers, tables, etc...)? I am not a POI expert at all, is this is supported by it? Thanks, Julien PS: I will probably go to the Apache EU conference. Anyone from the Tika community going there? <http://www.digitalpebble.com>