get markup information via ContentHandler for OfficeParser

Julien Nioche Tue, 12 Feb 2008 02:23:20 -0800

Hi,

Congratulations first: I have been following Tika for a little bit now and
am very happy to see a first release of it. Well done everybody!


I am particularly interested in the project as we work on text analysis with
GATE and UIMA. Obviously being able to extract text from different formats
is crucial for what we do and so is the extraction of the markup
information. That leads me to the following question: how difficult would it
be to get the OfficeParser to generate information about the markup (pages,
headers, tables, etc...)? I am not a POI expert at all, is this is supported
by it?

Thanks,

Julien

PS: I will probably go to the Apache EU conference. Anyone from the Tika
community going there?
<http://www.digitalpebble.com>

get markup information via ContentHandler for OfficeParser

Reply via email to