Re: parsers implementations for media files (mpeg, flv, webm)

Nick Burch Mon, 05 Dec 2011 16:50:50 -0800

On 05/12/11 21:41, Albretch Mueller wrote:

  If you're interested in helping ...


 Yes, I can and would offer man/mind hours to including movie media
files parsing (and eventually processing) in tika


Great!

 I am definitely more inclined to use ffmpeg (your third option) but I
think we should carefully think about and probably use more than one
option. There already is a Java port of parts of the FFMPEG project
(jffmpeg.sourceforge.net) but as you may know already ;-) its
licensing is messy

Something like ffmpeg (via external) or jffmpeg wouldn't be able to beincluded in the core product anyway, because of licensing reasons.They'd have to be maintained at least partly externally, so there wouldbe nothing to stop people picking the right one for them.

(Possibly the code to talk to ffmpeg could be included in core, with theuser responsible for downloading ffmpeg to use it, but code to talk tojffmpeg would need to be external)

  About your second option all the info is in the containers anyway,
codecs are just encoded data


Alas not really, at least not the way users seem to think of things...

Consider this example. We have a mpeg container. Within it we find 4 mp3audio streams (with different languages tagged), and 2 5.1 channel oggvorbis audio streams (same language). We also find 2 subtitle streams,and 3 mpeg2 video streams (1 at a higher bitrate to the other).

If you ask the user, that's a mpeg2 video with 2 alternate cameraangles, high quality english audio, high quality english directorsdescription, and translated audio.

If we don't understand the codecs, we can't figure out what streams areat what bitrates, which ones are video and which are audio etc.Especially one some container formats (ogg springs to mind) which arevery general, the container provides framing info but you need to knowabout the codecs to figure out what is in it.

The first step is going to be to make sure we can recognise all thedifferent media containers (there's something like 6-10 of them), aswe'll need that to know if we should handle them or not. Next we'd wantto understand the basics of the container, to pull out any metadata wecan do about it. Finally we'd need to implement basic metadataextractors for the key codecs (we already have this for some of theaudio formats) so we can get info on what's in the container.

  Could you guide me/us of a running list of what you think needs to be done?

First up, I'd say one thing to do is come up with some (very small!)sample files in the different formats. Initially just one per container,but ideally also some with different contents too. (For example, bothmpeg with mpeg2+mp3, and mpeg with mpeg2+mp3+mp3+ogg)

Next, using these sample files, we need to ensure that we have mimemagic for all the container formats, along with unit tests. We'll alsoneed to sort out mimetypes for the common combinations, and maybe alsothink about how to describe some of the cases (do we always call itafter the biggest video format for example? Do we care about the container?)

Now, if we wanted to go down the ffmpeg external processor route, weneed to do two sets of mappings. One is from "ffmpeg -formats" tomimetypes, so that our parser can correctly claim the mimetypes it canhandle. In addition, we need to work out how to map the output of"ffmpeg -i" back to our (often new) mimetypes, so we can have a detectorbased on it

I know there are developers extracting the sequences of images of the
subtitles and using OCR to change them to text ... Any one could see
how useful such a thing could be. Could tika reach out to those deep
waters?

Let's have some more progress on the regular OCR stuff first, then wecan worry about extracting out subtitles and finally figure out how toOCR it... :)


Nick

Re: parsers implementations for media files (mpeg, flv, webm)

Reply via email to