This feels a bit like a bug and not a feature? Maybe...

Nick, what do you think?

The PackageParser winds up parsing the file with this one:
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>application/vnd.apple.keynote</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.iwork.IWorkPackageParser"/>
    </parser>
  </parsers>
</properties>

and this one:

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>application/vnd.apple.keynote</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.iwork.IWorkPackageParser"/>
      <parser-exclude class="org.apache.tika.parser.pkg.PackageParser"/>
    </parser>
    <parser class="org.apache.tika.parser.pkg.PackageParser">
      <mime-exclude>application/vnd.apple.keynote</mime-exclude>
    </parser>
  </parsers>
</properties>

Again, the problem is that the DefaultParser backs off to
application/zip because application/vnd.apple.keynote is a subclass of
zip.

Not sure what the fix should be...



On Thu, May 20, 2021 at 2:24 PM Furkan KAMACI <[email protected]> wrote:
>
> Hi Tim,
>
> Seems that we can exclude mime types parser only [1]
>
> How about globally excluding such mime-types? Is there any way to define it?
>
> [1] https://tika.apache.org/1.26/configuring.html
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, May 20, 2021 at 6:36 PM Tim Allison <[email protected]> wrote:
>>
>> All,
>>
>> Let's say I don't want to parse old iWorks files (zip-based file
>> format).  I can exclude that parser via TikaConfig, but then it gets
>> parsed by the PackageParser.  So, then I have to decorate the
>> PackageParser with mime-exclude=application/iworks... or add an
>> EmptyParser that handles application/iWorks.
>>
>> Do we have a way to say: I only want the PackageParser to process
>> actual zip files and not zip-based files generally ... without having
>> to enumerate the zip-based files.
>>
>> Thank you.
>>
>> Cheers,
>>
>>           Tim

Reply via email to