I guess the question is how far do we want to bake this in? I could see adding a field for the default extension in the CompositeDetector/DefaultDetector. This would then be triggered on embedded files, too. I can't imagine this would add much cost computationally(???), and it would just show up for free all over the place.
It does feel a bit smelly to add this one feature, but I've done worse in my career. :( Or, do we want a custom handler/parameter on the detect/ endpoint in tika-server? Is the use case that you want to parse the file _and_ get this information in one go? Or, are you only running detect on the main/container file? On Thu, Feb 17, 2022 at 2:00 PM Nick Burch <[email protected]> wrote: > > On Thu, 10 Feb 2022, Nick Burch wrote: > > On Thu, 10 Feb 2022, Willy T. Koch wrote: > >> …and calling it as a webservice with Postman/curl. > > > > Ah, I think we might not be exposing the full details of the mime types via > > the server, only details of their parsers and the heirarchy, eg > > http://localhost:9998/mime-types#audio/vorbis > > > > (We have that info in Java we're just seemingly not making it available) > > > > > > I'm not sure about exposing all the details of all the types by default, > > but adding a flag and/or a sub-endpoint that would return the full > > details of a type, including extensions and comments etc, seems OK to > > me. Thoughts anyone? > > Tika devs - any thoughts on this? It's a pretty small code change (we > already have the data on the mime type!), just need feedback on extending > the existing API vs adding a new one > > Nick
