> in the end we're mostly interested in the text

Ditto!  :D

The more help, the better.  Thank you!

On Thu, Jul 25, 2019 at 11:41 AM Stephan Budach <[email protected]> wrote:
>
> Hi Tim,
>
> yeah, I have read, I think, all of those - the two Jira issues definetively. 
> I also didn't expect this to be a no-brainer and I at least I do have all of 
> those apps on my Mac, so I can share example files without any issue. Thanks 
> to be willing to take shot at it.
>
> To start with one thing… Keynote has two flavours of files: bundled ones (all 
> files separately in a folder, carrying the app's extension e.g. .key) or a 
> zip-compressed archive (a zip file, again with the extension .key for 
> Keynote, instead of .zip). Does the current iWork parser can handle both - 
> that wasn't clear to me, when I looked at the code on Github. I do think 
> though, that if the iWorks parser encounters a zip-compressed file, it will 
> have to unzip it somewhere temporarily and then look into the structure 
> (folders: Data/Index) to find the interesting pieces.
>
> I will take a look at the protobuf tool and feed it some of the iwa files… in 
> the end we're mostly interested in the text, that is on those slides and at 
> leats I do know, whats on the slides. ;)
>
> Thanks and regards,
> Stephan
>
>
> ----- Ursprüngliche Mail -----
> > Von: "Tim Allison" <[email protected]>
> > An: [email protected]
> > Gesendet: Donnerstag, 25. Juli 2019 17:07:21
> > Betreff: Re: Update Tika's Apple iWork parser?
> >
> > Hi Stephan,
> >   This is currently an omission/blindspot in Tika[1].  Regrettably,
> > the new iWorks files are, um, complex, and last I looked the schemas
> > for iWorks were enormous, and there were version conflicts in the
> > schemas across different versions of iWorks files.
> >   So, perhaps our best bet would be to follow something along the
> > lines of [2] on [3].
> >   You could help out by sharing example files.  I don't know that
> >   I'll
> > have any time soon to work on this, but, y, this is a known issue.
> > Sorry.
> >
> >              Best,
> >
> >                    Tim
> >
> > [1] https://issues.apache.org/jira/browse/TIKA-1358
> > [2]
> > https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551
> > [3] https://issues.apache.org/jira/browse/TIKA-2912
> >
> > On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach
> > <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I have just recently discovered Tika as I have been playing around
> > > with fscrawler to help me index my file shares and I came across a
> > > problem, that I can't fix. Tika has had the ability to parse Apple
> > > iWork files for quite some time, but since Apple has split up the
> > > iWorks Suite into three seperate apps, the media type has changed
> > > for each of those - now seperate files.
> > >
> > > As I have learned from looking at the code of the Class
> > > IWorkPackageParser, it defines this media type for iWork files:
> > >
> > > /**
> > >      * This parser handles all iWorks formats.
> > >      */
> > >     private final static Set<MediaType> supportedTypes =
> > >          Collections.unmodifiableSet(new
> > >          HashSet<MediaType>(Arrays.asList(
> > >                 MediaType.application("vnd.apple.iwork"),
> > >                 IWORKDocumentType.KEYNOTE.getType(),
> > >                 IWORKDocumentType.NUMBERS.getType(),
> > >                 IWORKDocumentType.PAGES.getType()
> > >          )));
> > >
> > > However, fscrawler sends this MediaType to Tika, which of course
> > > triggers no parser: application/vnd.apple.keynote
> > >
> > > Can the iWorks parser be updated to be able to handle Keynote
> > > files, or at least, give it a try? Unfortuanetly, I am not a dev
> > > type, so I am lacking the skills to pull that off, but I'd be
> > > ready to try a new parser and provide feedback.
> > >
> > > Regards,
> > > Stephan
> > >
> > > --
> > > Krebs's 3 Basic Rules for Online Safety
> > > 1st - "If you didn't go looking for it, don't install it!"
> > > 2nd - "If you installed it, update it."
> > > 3rd - "If you no longer need it, remove it."
> > > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
> > >
> > >
> > > Stephan Budach
> > > Head of IT
> > > Jung von Matt AG
> > > Glashüttenstraße 79
> > > D-20357 Hamburg
> > >
> > >
> > > Tel: +49 40-4321-1353
> > > Fax: +49 40-4321-1114
> > > E-Mail: [email protected]
> > > Internet: http://www.jvm.com
> > > WebEx: https://jvm.webex.com/meet/stephan.budach
> > >
> > > Vorstand: Dr. Peter Figge
> > > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> > > AG HH HRB 72893
> > >
> >
>
> --
>
> Krebs's 3 Basic Rules for Online Safety
> 1st - "If you didn't go looking for it, don't install it!"
> 2nd - "If you installed it, update it."
> 3rd - "If you no longer need it, remove it."
> http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety
>
>
> Stephan Budach
> Head of IT
> Jung von Matt AG
> Glashüttenstraße 79
> D-20357 Hamburg
>
>
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: [email protected]
> Internet: http://www.jvm.com
> WebEx: https://jvm.webex.com/meet/stephan.budach
>
> Vorstand: Dr. Peter Figge
> Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod
> AG HH HRB 72893
>
>
>
> Jung von Matt investiert in die Kreativen von morgen: JvM-Academy.
> http://jvm-academy.org

Reply via email to