Limitations of iWork parsing

Gabriel Valencia Tue, 24 Apr 2012 16:54:38 -0700

Hi all

We have been using Tika to parse iWork files, but have found many
limitations and potentially bugs. Here is a sampling:
* Things like header and footer text and embedded text boxes are not
parsed.
* Pages docs created in Layout mode are not parsed at all. Only the
metadata is extracted.
* Text box text in Keynote slides is extracted, but all of the text of all
the boxes is lumped together without any spaces.
* Password protected files throw an NPE.


Is there any work in progress or planned to improve the parsing of iWork
files? Or only as defects are opened?
--
Gabriel Valencia
Software Development for IBM Content Integrator, IBM Content Analytics, and
IBM Content and Predictive Analytics
[email protected]
Tel: 408-463-4133 TL: 543-4133

Limitations of iWork parsing

Reply via email to