Hi,

Congratulations first: I have been following Tika for a little bit now and
am very happy to see a first release of it. Well done everybody!

I am particularly interested in the project as we work on text analysis with
GATE and UIMA. Obviously being able to extract text from different formats
is crucial for what we do and so is the extraction of the markup
information. That leads me to the following question: how difficult would it
be to get the OfficeParser to generate information about the markup (pages,
headers, tables, etc...)? I am not a POI expert at all, is this is supported
by it?

Thanks,

Julien

PS: I will probably go to the Apache EU conference. Anyone from the Tika
community going there?
<http://www.digitalpebble.com>

Reply via email to