Neha, I posted a similar request here: https://issues.apache.org/jira/projects/TIKA/issues/TIKA-3984. While my issue mentions specifically a Maven mapping, it is all related to how we map file extensions, media type and Tika module.
I think you or anyone else adding comments on the need there would help upvote the issue. cheers Marc Marc Ubaldino tel. 781-271-2159 Principal Data Science Engineer / Deputy Project Leader, ATS Lab N163 – All Domain Integration, Air & Space Forces Center MITRE | Solving Problems for a Safer World | http://www.mitre.org From: Neha Kamat via user <[email protected]> Date: Tuesday, May 16, 2023 at 3:02 AM To: [email protected] <[email protected]> Subject: [EXT] FIle extensions supported by TIKA Hi team Is there a documentation available with Apache TIKA which clearly describes list of file extensions supported by a particular TIKA version? I can see file formats supported by tika under https://tika.apache.org/2.8.0/formats.html but this page doesn’t give clarity around extensions covered under a particular file format. Based on supported extension list, we plan to implement filters in our application so that right set of extensions(supported) are sent to TIKA for extraction and non-supported extensions are not even sent to TIKA for processing. I am also looking for documentation which captures performance statistics and recommendations for different type of parsers currently supported by TIKA e.g. <x> parser is resource intensive and <y> parser is time consuming and so on with right set of statistics published. Is there any common shared testdata location(something similar to govdocs or testdata maintained by TIKA) against which parser testing is done?
