Hi all We have been using Tika to parse iWork files, but have found many limitations and potentially bugs. Here is a sampling: * Things like header and footer text and embedded text boxes are not parsed. * Pages docs created in Layout mode are not parsed at all. Only the metadata is extracted. * Text box text in Keynote slides is extracted, but all of the text of all the boxes is lumped together without any spaces. * Password protected files throw an NPE.
Is there any work in progress or planned to improve the parsing of iWork files? Or only as defects are opened? -- Gabriel Valencia Software Development for IBM Content Integrator, IBM Content Analytics, and IBM Content and Predictive Analytics [email protected] Tel: 408-463-4133 TL: 543-4133
