On Tue, 24 Apr 2012, Gabriel Valencia wrote:
We have been using Tika to parse iWork files, but have found many
limitations and potentially bugs.

At the moment, there aren't any suitable iWorks file format libraries in Java for us to use. So, what we do have has come from community contributions. There's a very strong chance that it'll only handle the kinds of files that the people who volunteered to work on it had!

(The parsers all work by doing SAX parsing of the main content of the iWorks file, mapping from parts of the file's XML to XHTML as possible)

Here is a sampling:
* Things like header and footer text and embedded text boxes are not
parsed.
* Pages docs created in Layout mode are not parsed at all. Only the
metadata is extracted.
* Text box text in Keynote slides is extracted, but all of the text of all
the boxes is lumped together without any spaces.

As a first step, are you able to produce some simple sample files that show these different document features? If so, please create new bugs in JIRA and upload them.

Step two is probably to write some (initially failing) unit tests to show what text you'd expect to get out

Step three is to unzip your sample files (iWorks are a zip of xml + images), and hunt around for your missing text. For each case, try to work out where in the file structure the text is held, and if it's inline with the surrounding content or stored off elsewhere with a pointer

Finally, step four is to update the parsers based on this information, and upload the patch!

* Password protected files throw an NPE.

Are you able to check if these use Zip based password protection, or something else? A sample file that's small and has a known password would be handy

Is there any work in progress or planned to improve the parsing of iWork
files?

I think you may have just volunteered yourself ;-)

Cheers
Nick

Reply via email to