Hi Tim,

yeah, I have read, I think, all of those - the two Jira issues definetively. I 
also didn't expect this to be a no-brainer and I at least I do have all of 
those apps on my Mac, so I can share example files without any issue. Thanks to 
be willing to take shot at it.

To start with one thing… Keynote has two flavours of files: bundled ones (all 
files separately in a folder, carrying the app's extension e.g. .key) or a 
zip-compressed archive (a zip file, again with the extension .key for Keynote, 
instead of .zip). Does the current iWork parser can handle both - that wasn't 
clear to me, when I looked at the code on Github. I do think though, that if 
the iWorks parser encounters a zip-compressed file, it will have to unzip it 
somewhere temporarily and then look into the structure (folders: Data/Index) to 
find the interesting pieces.

I will take a look at the protobuf tool and feed it some of the iwa files… in 
the end we're mostly interested in the text, that is on those slides and at 
leats I do know, whats on the slides. ;)

Thanks and regards,
Stephan


----- Ursprüngliche Mail -----
> Von: "Tim Allison" <[email protected]>
> An: [email protected]
> Gesendet: Donnerstag, 25. Juli 2019 17:07:21
> Betreff: Re: Update Tika's Apple iWork parser?
> 
> Hi Stephan,
>   This is currently an omission/blindspot in Tika[1].  Regrettably,
> the new iWorks files are, um, complex, and last I looked the schemas
> for iWorks were enormous, and there were version conflicts in the
> schemas across different versions of iWorks files.
>   So, perhaps our best bet would be to follow something along the
> lines of [2] on [3].
>   You could help out by sharing example files.  I don't know that
>   I'll
> have any time soon to work on this, but, y, this is a known issue.
> Sorry.
> 
>              Best,
> 
>                    Tim
> 
> [1] https://issues.apache.org/jira/browse/TIKA-1358
> [2]
> https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
> [3] https://issues.apache.org/jira/browse/TIKA-2912
> 
> On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach
> <[email protected]> wrote:
> >
> > Hello,
> >
> > I have just recently discovered Tika as I have been playing around
> > with fscrawler to help me index my file shares and I came across a
> > problem, that I can't fix. Tika has had the ability to parse Apple
> > iWork files for quite some time, but since Apple has split up the
> > iWorks Suite into three seperate apps, the media type has changed
> > for each of those - now seperate files.
> >
> > As I have learned from looking at the code of the Class
> > IWorkPackageParser, it defines this media type for iWork files:
> >
> > /**
> >      * This parser handles all iWorks formats.
> >      */
> >     private final static Set<MediaType> supportedTypes =
> >          Collections.unmodifiableSet(new
> >          HashSet<MediaType>(Arrays.asList(
> >                 MediaType.application("vnd.apple.iwork"),
> >                 IWORKDocumentType.KEYNOTE.getType(),
> >                 IWORKDocumentType.NUMBERS.getType(),
> >                 IWORKDocumentType.PAGES.getType()
> >          )));
> >
> > However, fscrawler sends this MediaType to Tika, which of course
> > triggers no parser: application/vnd.apple.keynote
> >
> > Can the iWorks parser be updated to be able to handle Keynote
> > files, or at least, give it a try? Unfortuanetly, I am not a dev
> > type, so I am lacking the skills to pull that off, but I'd be
> > ready to try a new parser and provide feedback.
> >
> > Regards,
> > Stephan
> >
> > --
> > Krebs's 3 Basic Rules for Online Safety
> > 1st - "If you didn't go looking for it, don't install it!"
> > 2nd - "If you installed it, update it."
> > 3rd - "If you no longer need it, remove it."
> > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
> >
> >
> > Stephan Budach
> > Head of IT
> > Jung von Matt AG
> > Glashüttenstraße 79
> > D-20357 Hamburg
> >
> >
> > Tel: +49 40-4321-1353
> > Fax: +49 40-4321-1114
> > E-Mail: [email protected]
> > Internet: http://www.jvm.com
> > WebEx: https://jvm.webex.com/meet/stephan.budach
> >
> > Vorstand: Dr. Peter Figge
> > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> > AG HH HRB 72893
> >
> 

-- 

Krebs's 3 Basic Rules for Online Safety 
1st - "If you didn't go looking for it, don't install it!" 
2nd - "If you installed it, update it." 
3rd - "If you no longer need it, remove it." 
http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety 


Stephan Budach 
Head of IT 
Jung von Matt AG 
Glashüttenstraße 79 
D-20357 Hamburg 


Tel: +49 40-4321-1353 
Fax: +49 40-4321-1114 
E-Mail: [email protected] 
Internet: http://www.jvm.com 
WebEx: https://jvm.webex.com/meet/stephan.budach 

Vorstand: Dr. Peter Figge 
Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod 
AG HH HRB 72893 



Jung von Matt investiert in die Kreativen von morgen: JvM-Academy. 
http://jvm-academy.org 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to