Neha,
I posted a similar request here:
https://issues.apache.org/jira/projects/TIKA/issues/TIKA-3984.  While my issue 
mentions specifically a Maven mapping, it is all related to how we map file 
extensions, media type and Tika module.

I think you or anyone else adding comments on the need there would help upvote 
the issue.
cheers
Marc


Marc Ubaldino
tel. 781-271-2159
Principal Data Science Engineer / Deputy Project Leader, ATS Lab
N163 – All Domain Integration, Air & Space Forces Center
MITRE | Solving Problems for a Safer World | http://www.mitre.org


From: Neha Kamat via user <[email protected]>
Date: Tuesday, May 16, 2023 at 3:02 AM
To: [email protected] <[email protected]>
Subject: [EXT] FIle extensions supported by TIKA
Hi team

Is there a documentation available with Apache TIKA which clearly describes 
list of file extensions supported by a particular TIKA version? I can see file 
formats supported by tika under https://tika.apache.org/2.8.0/formats.html but 
this page doesn’t give clarity around extensions covered under a particular 
file format.
Based on supported extension list, we plan to implement filters in our 
application so that right set of extensions(supported) are sent to TIKA for 
extraction and non-supported extensions are not even sent to TIKA for 
processing. I am also looking for documentation which captures performance 
statistics and recommendations for different type of parsers currently 
supported by TIKA e.g. <x> parser is resource intensive and <y> parser is time 
consuming and so on with right set of statistics published.

Is there any common shared testdata location(something similar to govdocs or 
testdata maintained by TIKA) against which parser testing is done?





Reply via email to