Hi Stephan,
  This is currently an omission/blindspot in Tika[1].  Regrettably,
the new iWorks files are, um, complex, and last I looked the schemas
for iWorks were enormous, and there were version conflicts in the
schemas across different versions of iWorks files.
  So, perhaps our best bet would be to follow something along the
lines of [2] on [3].
  You could help out by sharing example files.  I don't know that I'll
have any time soon to work on this, but, y, this is a known issue.
Sorry.

             Best,

                   Tim

[1] https://issues.apache.org/jira/browse/TIKA-1358
[2] 
https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
[3] https://issues.apache.org/jira/browse/TIKA-2912

On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach <[email protected]> wrote:
>
> Hello,
>
> I have just recently discovered Tika as I have been playing around with 
> fscrawler to help me index my file shares and I came across a problem, that I 
> can't fix. Tika has had the ability to parse Apple iWork files for quite some 
> time, but since Apple has split up the iWorks Suite into three seperate apps, 
> the media type has changed for each of those - now seperate files.
>
> As I have learned from looking at the code of the Class IWorkPackageParser, 
> it defines this media type for iWork files:
>
> /**
>      * This parser handles all iWorks formats.
>      */
>     private final static Set<MediaType> supportedTypes =
>          Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
>                 MediaType.application("vnd.apple.iwork"),
>                 IWORKDocumentType.KEYNOTE.getType(),
>                 IWORKDocumentType.NUMBERS.getType(),
>                 IWORKDocumentType.PAGES.getType()
>          )));
>
> However, fscrawler sends this MediaType to Tika, which of course triggers no 
> parser: application/vnd.apple.keynote
>
> Can the iWorks parser be updated to be able to handle Keynote files, or at 
> least, give it a try? Unfortuanetly, I am not a dev type, so I am lacking the 
> skills to pull that off, but I'd be ready to try a new parser and provide 
> feedback.
>
> Regards,
> Stephan
>
> --
> Krebs's 3 Basic Rules for Online Safety
> 1st - "If you didn't go looking for it, don't install it!"
> 2nd - "If you installed it, update it."
> 3rd - "If you no longer need it, remove it."
> http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
>
>
> Stephan Budach
> Head of IT
> Jung von Matt AG
> Glashüttenstraße 79
> D-20357 Hamburg
>
>
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: [email protected]
> Internet: http://www.jvm.com
> WebEx: https://jvm.webex.com/meet/stephan.budach
>
> Vorstand: Dr. Peter Figge
> Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> AG HH HRB 72893
>

Reply via email to