Hi, On Feb 12, 2008 12:22 PM, Julien Nioche <[EMAIL PROTECTED]> wrote: > Congratulations first: I have been following Tika for a little bit now and > am very happy to see a first release of it. Well done everybody!
Great to hear that, thanks! > I am particularly interested in the project as we work on text analysis with > GATE and UIMA. Obviously being able to extract text from different formats > is crucial for what we do and so is the extraction of the markup > information. That leads me to the following question: how difficult would it > be to get the OfficeParser to generate information about the markup (pages, > headers, tables, etc...)? I am not a POI expert at all, is this is supported > by it? I think we should be able to do that, and since one of Tika's goals is to support extraction of "structured text", doing that is right there on our charter. However, since Tika is supposed to be a generic tool, we probably don't want to replicate the structure of any specific format in too much details. You can always use the specific parser libraries for details. My proposal would be to try to support at least the following basic structural constructs in all parsers that have the required information: <div class="page"/> <h1/> <p/> <table/> <a/> We could add more constructs based on existing demand. > PS: I will probably go to the Apache EU conference. Anyone from the Tika > community going there? I'll be there and I think a few other people as well. BR, Jukka Zitting