> in the end we're mostly interested in the text Ditto! :D
The more help, the better. Thank you! On Thu, Jul 25, 2019 at 11:41 AM Stephan Budach <[email protected]> wrote: > > Hi Tim, > > yeah, I have read, I think, all of those - the two Jira issues definetively. > I also didn't expect this to be a no-brainer and I at least I do have all of > those apps on my Mac, so I can share example files without any issue. Thanks > to be willing to take shot at it. > > To start with one thing… Keynote has two flavours of files: bundled ones (all > files separately in a folder, carrying the app's extension e.g. .key) or a > zip-compressed archive (a zip file, again with the extension .key for > Keynote, instead of .zip). Does the current iWork parser can handle both - > that wasn't clear to me, when I looked at the code on Github. I do think > though, that if the iWorks parser encounters a zip-compressed file, it will > have to unzip it somewhere temporarily and then look into the structure > (folders: Data/Index) to find the interesting pieces. > > I will take a look at the protobuf tool and feed it some of the iwa files… in > the end we're mostly interested in the text, that is on those slides and at > leats I do know, whats on the slides. ;) > > Thanks and regards, > Stephan > > > ----- Ursprüngliche Mail ----- > > Von: "Tim Allison" <[email protected]> > > An: [email protected] > > Gesendet: Donnerstag, 25. Juli 2019 17:07:21 > > Betreff: Re: Update Tika's Apple iWork parser? > > > > Hi Stephan, > > This is currently an omission/blindspot in Tika[1]. Regrettably, > > the new iWorks files are, um, complex, and last I looked the schemas > > for iWorks were enormous, and there were version conflicts in the > > schemas across different versions of iWorks files. > > So, perhaps our best bet would be to follow something along the > > lines of [2] on [3]. > > You could help out by sharing example files. I don't know that > > I'll > > have any time soon to work on this, but, y, this is a known issue. > > Sorry. > > > > Best, > > > > Tim > > > > [1] https://issues.apache.org/jira/browse/TIKA-1358 > > [2] > > https://stackoverflow.com/questions/25898230/decoding-protobuf-without-schema/25898551#25898551 > > [3] https://issues.apache.org/jira/browse/TIKA-2912 > > > > On Thu, Jul 25, 2019 at 9:22 AM Stephan Budach > > <[email protected]> wrote: > > > > > > Hello, > > > > > > I have just recently discovered Tika as I have been playing around > > > with fscrawler to help me index my file shares and I came across a > > > problem, that I can't fix. Tika has had the ability to parse Apple > > > iWork files for quite some time, but since Apple has split up the > > > iWorks Suite into three seperate apps, the media type has changed > > > for each of those - now seperate files. > > > > > > As I have learned from looking at the code of the Class > > > IWorkPackageParser, it defines this media type for iWork files: > > > > > > /** > > > * This parser handles all iWorks formats. > > > */ > > > private final static Set<MediaType> supportedTypes = > > > Collections.unmodifiableSet(new > > > HashSet<MediaType>(Arrays.asList( > > > MediaType.application("vnd.apple.iwork"), > > > IWORKDocumentType.KEYNOTE.getType(), > > > IWORKDocumentType.NUMBERS.getType(), > > > IWORKDocumentType.PAGES.getType() > > > ))); > > > > > > However, fscrawler sends this MediaType to Tika, which of course > > > triggers no parser: application/vnd.apple.keynote > > > > > > Can the iWorks parser be updated to be able to handle Keynote > > > files, or at least, give it a try? Unfortuanetly, I am not a dev > > > type, so I am lacking the skills to pull that off, but I'd be > > > ready to try a new parser and provide feedback. > > > > > > Regards, > > > Stephan > > > > > > -- > > > Krebs's 3 Basic Rules for Online Safety > > > 1st - "If you didn't go looking for it, don't install it!" > > > 2nd - "If you installed it, update it." > > > 3rd - "If you no longer need it, remove it." > > > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety > > > > > > > > > Stephan Budach > > > Head of IT > > > Jung von Matt AG > > > Glashüttenstraße 79 > > > D-20357 Hamburg > > > > > > > > > Tel: +49 40-4321-1353 > > > Fax: +49 40-4321-1114 > > > E-Mail: [email protected] > > > Internet: http://www.jvm.com > > > WebEx: https://jvm.webex.com/meet/stephan.budach > > > > > > Vorstand: Dr. Peter Figge > > > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod > > > AG HH HRB 72893 > > > > > > > -- > > Krebs's 3 Basic Rules for Online Safety > 1st - "If you didn't go looking for it, don't install it!" > 2nd - "If you installed it, update it." > 3rd - "If you no longer need it, remove it." > http://krebsonsecurity.com/2011/05/krebss-3-basic-rules-for-online-safety > > > Stephan Budach > Head of IT > Jung von Matt AG > Glashüttenstraße 79 > D-20357 Hamburg > > > Tel: +49 40-4321-1353 > Fax: +49 40-4321-1114 > E-Mail: [email protected] > Internet: http://www.jvm.com > WebEx: https://jvm.webex.com/meet/stephan.budach > > Vorstand: Dr. Peter Figge > Vorsitzender des Aufsichtsrates: Dr. Jochen Gutbrod > AG HH HRB 72893 > > > > Jung von Matt investiert in die Kreativen von morgen: JvM-Academy. > http://jvm-academy.org
