Re: Limitations of iWork parsing

Nick Burch Wed, 25 Apr 2012 04:24:40 -0700

On Tue, 24 Apr 2012, Gabriel Valencia wrote:

We have been using Tika to parse iWork files, but have found many
limitations and potentially bugs.

At the moment, there aren't any suitable iWorks file format libraries inJava for us to use. So, what we do have has come from communitycontributions. There's a very strong chance that it'll only handle thekinds of files that the people who volunteered to work on it had!

(The parsers all work by doing SAX parsing of the main content of theiWorks file, mapping from parts of the file's XML to XHTML as possible)

Here is a sampling:
* Things like header and footer text and embedded text boxes are not
parsed.
* Pages docs created in Layout mode are not parsed at all. Only the
metadata is extracted.
* Text box text in Keynote slides is extracted, but all of the text of all
the boxes is lumped together without any spaces.

As a first step, are you able to produce some simple sample files thatshow these different document features? If so, please create new bugs inJIRA and upload them.

Step two is probably to write some (initially failing) unit tests to showwhat text you'd expect to get out

Step three is to unzip your sample files (iWorks are a zip of xml +images), and hunt around for your missing text. For each case, try to workout where in the file structure the text is held, and if it's inline withthe surrounding content or stored off elsewhere with a pointer

Finally, step four is to update the parsers based on this information,and upload the patch!

* Password protected files throw an NPE.

Are you able to check if these use Zip based password protection, orsomething else? A sample file that's small and has a known password wouldbe handy

Is there any work in progress or planned to improve the parsing of iWork
files?


I think you may have just volunteered yourself ;-)

Cheers
Nick

Re: Limitations of iWork parsing

Reply via email to