Marc, Thank you for starting a separate thread. This is very helpful. Would you be willing to open an issue on our JIRA? If we can think of a way of automating this like we do with "Full list of Supported Formats in 'standard' artifacts" (https://tika.apache.org/2.7.0/formats.html), that'd be great.
Given that we change the modules outside of the standard artifacts so rarely, maybe we hard code those in the "formats" template for now? As you point out, we need to include maven coordinates with the lists. Does the "full list" on that link look like what you'd want (if we added coordinates)? On Tue, Mar 7, 2023 at 1:03 PM Marc C Ubaldino <[email protected]> wrote: > > ( I’m starting a new thread because I did not want to hijack the previous > discussion on Metadata obj reuse, etc.) > > > > My original intent was to know if the Tika Project has a tabulation of > Parsers ~ mapping a file type to a parser to a Maven artifact. Maven > artifacts have proliferated and its now more important to know how it all > ties together because you have to get your `tika-config.xml` just right … > More thoughts below. > > > > From: Tim Allison [email protected] > Date: Tuesday, March 7, 2023 at 12:48 PM > Subject: Re: [EXT] Re: Best practice for extracting content and metadata > repeatedly > > // Thank you, Marc. > > // > > // Please let us know how we can improve the documentation here: > > // https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 > > // > > // and/or if we need to add documentation elsewhere. > > // > > // Tim > > > > > > Thanks Tim A few more ideas below – and yes I think a new Parser Index page > is needed to tie this altogether or update the Parsers page below: > > > > https://cwiki.apache.org/confluence/display/TIKA/Parsers – This page looks > close, but its Jargon-based. Possibly not comprehensive, and more a list of > worked examples? > > > > The “Migratiing to Tika 2.x” page is also fun reading – if you are migrating. > For those finding Tika now and using 2.x the concept of migration is not > relevant. > > > > Can you provide a simple tabulation of File type, Parser(s) and Maven plugin > as a new page? > > Possible Model: Maven Plugins table is like this, > https://maven.apache.org/plugins/index.html > > Marc > > > > > > > >
