Re: get markup information via ContentHandler for OfficeParser

Jukka Zitting Sun, 17 Feb 2008 01:33:34 -0800

Hi,

On Feb 12, 2008 12:22 PM, Julien Nioche <[EMAIL PROTECTED]> wrote:
> Congratulations first: I have been following Tika for a little bit now and
> am very happy to see a first release of it. Well done everybody!


Great to hear that, thanks!

> I am particularly interested in the project as we work on text analysis with
> GATE and UIMA. Obviously being able to extract text from different formats
> is crucial for what we do and so is the extraction of the markup
> information. That leads me to the following question: how difficult would it
> be to get the OfficeParser to generate information about the markup (pages,
> headers, tables, etc...)? I am not a POI expert at all, is this is supported
> by it?

I think we should be able to do that, and since one of Tika's goals is
to support extraction of "structured text", doing that is right there
on our charter. However, since Tika is supposed to be a generic tool,
we probably don't want to replicate the structure of any specific
format in too much details. You can always use the specific parser
libraries for details.

My proposal would be to try to support at least the following basic
structural constructs in all parsers that have the required
information:

    <div class="page"/>
    <h1/>
    <p/>
    <table/>
    <a/>

We could add more constructs based on existing demand.

> PS: I will probably go to the Apache EU conference. Anyone from the Tika
> community going there?

I'll be there and I think a few other people as well.

BR,

Jukka Zitting

Re: get markup information via ContentHandler for OfficeParser

Reply via email to