This feels a bit like a bug and not a feature? Maybe...
Nick, what do you think?
The PackageParser winds up parsing the file with this one:
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>application/vnd.apple.keynote</mime-exclude>
<parser-exclude class="org.apache.tika.parser.iwork.IWorkPackageParser"/>
</parser>
</parsers>
</properties>
and this one:
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>application/vnd.apple.keynote</mime-exclude>
<parser-exclude class="org.apache.tika.parser.iwork.IWorkPackageParser"/>
<parser-exclude class="org.apache.tika.parser.pkg.PackageParser"/>
</parser>
<parser class="org.apache.tika.parser.pkg.PackageParser">
<mime-exclude>application/vnd.apple.keynote</mime-exclude>
</parser>
</parsers>
</properties>
Again, the problem is that the DefaultParser backs off to
application/zip because application/vnd.apple.keynote is a subclass of
zip.
Not sure what the fix should be...
On Thu, May 20, 2021 at 2:24 PM Furkan KAMACI <[email protected]> wrote:
>
> Hi Tim,
>
> Seems that we can exclude mime types parser only [1]
>
> How about globally excluding such mime-types? Is there any way to define it?
>
> [1] https://tika.apache.org/1.26/configuring.html
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, May 20, 2021 at 6:36 PM Tim Allison <[email protected]> wrote:
>>
>> All,
>>
>> Let's say I don't want to parse old iWorks files (zip-based file
>> format). I can exclude that parser via TikaConfig, but then it gets
>> parsed by the PackageParser. So, then I have to decorate the
>> PackageParser with mime-exclude=application/iworks... or add an
>> EmptyParser that handles application/iWorks.
>>
>> Do we have a way to say: I only want the PackageParser to process
>> actual zip files and not zip-based files generally ... without having
>> to enumerate the zip-based files.
>>
>> Thank you.
>>
>> Cheers,
>>
>> Tim