On Fri, 26 Apr 2024, Mauler, David wrote:
I'm in the process of troubleshooting an issue with certain mp4 video
files and tika. After a bunch of digging, it appears to be related to
whatever ISO is set for the mp4 file. An mp4 with an ISO of
14496-12:2003 will be detected as video/quicktime but an mp4 with an ISO
of 14496-14 is detected as video/mp4 which is what I was expecting for
both files.
Depends where in the file the type box lives. At the moment, we only have
mime-magic based detection for the Quicktime / MP4 family of formats. If
the right box in the container is at the start we're ok, if it comes later
we can't tell with just a mime magic signature
What we really need is a container-aware detector for the file format,
similar to what we have for Zip files, and for the Ogg family. That would
properly process the file in a format-aware way, checking for the contents
to correctly identify the type.
The long-standing issue is https://issues.apache.org/jira/browse/TIKA-2935
- do you have a few days of spare coding time you could put towards this,
and/or a bit of budget to sponsor someone to?
Thanks
Nick