Hi Tim, yeah, I have read, I think, all of those - the two Jira issues definetively. I also didn't expect this to be a no-brainer and I at least I do have all of those apps on my Mac, so I can share example files without any issue. Thanks to be willing to take shot at it.
To start with one thing… Keynote has two flavours of files: bundled ones (all files separately in a folder, carrying the app's extension e.g. .key) or a zip-compressed archive (a zip file, again with the extension .key for Keynote, instead of .zip). Does the current iWork parser can handle both - that wasn't clear to me, when I looked at the code on Github. I do think though, that if the iWorks parser encounters a zip-compressed file, it will have to unzip it somewhere temporarily and then look into the structure (folders: Data/Index) to find the interesting pieces. I will take a look at the protobuf tool and feed it some of the iwa files… in the end we're mostly interested in the text, that is on those slides and at leats I do know, whats on the slides. ;) Thanks and regards, Stephan ----- Ursprüngliche Mail ----- > Von: "Tim Allison" <[email protected]> > An: [email protected] > Gesendet: Donnerstag, 25. Juli 2019 17:07:21 > Betreff: Re: Update Tika's Apple iWork parser? > > Hi Stephan, > This is currently an omission/blindspot in Tika[1]. Regrettably, > the new iWorks files are, um, complex, and last I looked the schemas > for iWorks were enormous, and there were version conflicts in the > schemas across different versions of iWorks files. > So, perhaps our best bet would be to follow something along the > lines of [2] on [3]. > You could help out by sharing example files. I don't know that > I'll > have any time soon to work on this, but, y, this is a known issue. > Sorry. > > Best, > > Tim > > [1] https://issues.apache.org/jira/browse/TIKA-1358 > [2] > https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551 > [3] https://issues.apache.org/jira/browse/TIKA-2912 > > On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach > <[email protected]> wrote: > > > > Hello, > > > > I have just recently discovered Tika as I have been playing around > > with fscrawler to help me index my file shares and I came across a > > problem, that I can't fix. Tika has had the ability to parse Apple > > iWork files for quite some time, but since Apple has split up the > > iWorks Suite into three seperate apps, the media type has changed > > for each of those - now seperate files. > > > > As I have learned from looking at the code of the Class > > IWorkPackageParser, it defines this media type for iWork files: > > > > /** > > * This parser handles all iWorks formats. > > */ > > private final static Set<MediaType> supportedTypes = > > Collections.unmodifiableSet(new > > HashSet<MediaType>(Arrays.asList( > > MediaType.application("vnd.apple.iwork"), > > IWORKDocumentType.KEYNOTE.getType(), > > IWORKDocumentType.NUMBERS.getType(), > > IWORKDocumentType.PAGES.getType() > > ))); > > > > However, fscrawler sends this MediaType to Tika, which of course > > triggers no parser: application/vnd.apple.keynote > > > > Can the iWorks parser be updated to be able to handle Keynote > > files, or at least, give it a try? Unfortuanetly, I am not a dev > > type, so I am lacking the skills to pull that off, but I'd be > > ready to try a new parser and provide feedback. > > > > Regards, > > Stephan > > > > -- > > Krebs's 3 Basic Rules for Online Safety > > 1st - "If you didn't go looking for it, don't install it!" > > 2nd - "If you installed it, update it." > > 3rd - "If you no longer need it, remove it." > > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety > > > > > > Stephan Budach > > Head of IT > > Jung von Matt AG > > Glashüttenstraße 79 > > D-20357 Hamburg > > > > > > Tel: +49 40-4321-1353 > > Fax: +49 40-4321-1114 > > E-Mail: [email protected] > > Internet: http://www.jvm.com > > WebEx: https://jvm.webex.com/meet/stephan.budach > > > > Vorstand: Dr. Peter Figge > > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod > > AG HH HRB 72893 > > > -- Krebs's 3 Basic Rules for Online Safety 1st - "If you didn't go looking for it, don't install it!" 2nd - "If you installed it, update it." 3rd - "If you no longer need it, remove it." http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety Stephan Budach Head of IT Jung von Matt AG Glashüttenstraße 79 D-20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: [email protected] Internet: http://www.jvm.com WebEx: https://jvm.webex.com/meet/stephan.budach Vorstand: Dr. Peter Figge Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod AG HH HRB 72893 Jung von Matt investiert in die Kreativen von morgen: JvM-Academy. http://jvm-academy.org
smime.p7s
Description: S/MIME cryptographic signature
