On 05/12/11 21:41, Albretch Mueller wrote:
  If you're interested in helping ...

 Yes, I can and would offer man/mind hours to including movie media
files parsing (and eventually processing) in tika

Great!

 I am definitely more inclined to use ffmpeg (your third option) but I
think we should carefully think about and probably use more than one
option. There already is a Java port of parts of the FFMPEG project
(jffmpeg.sourceforge.net) but as you may know already ;-) its
licensing is messy

Something like ffmpeg (via external) or jffmpeg wouldn't be able to be included in the core product anyway, because of licensing reasons. They'd have to be maintained at least partly externally, so there would be nothing to stop people picking the right one for them.

(Possibly the code to talk to ffmpeg could be included in core, with the user responsible for downloading ffmpeg to use it, but code to talk to jffmpeg would need to be external)


  About your second option all the info is in the containers anyway,
codecs are just encoded data

Alas not really, at least not the way users seem to think of things...

Consider this example. We have a mpeg container. Within it we find 4 mp3 audio streams (with different languages tagged), and 2 5.1 channel ogg vorbis audio streams (same language). We also find 2 subtitle streams, and 3 mpeg2 video streams (1 at a higher bitrate to the other).

If you ask the user, that's a mpeg2 video with 2 alternate camera angles, high quality english audio, high quality english directors description, and translated audio.

If we don't understand the codecs, we can't figure out what streams are at what bitrates, which ones are video and which are audio etc. Especially one some container formats (ogg springs to mind) which are very general, the container provides framing info but you need to know about the codecs to figure out what is in it.


The first step is going to be to make sure we can recognise all the different media containers (there's something like 6-10 of them), as we'll need that to know if we should handle them or not. Next we'd want to understand the basics of the container, to pull out any metadata we can do about it. Finally we'd need to implement basic metadata extractors for the key codecs (we already have this for some of the audio formats) so we can get info on what's in the container.


  Could you guide me/us of a running list of what you think needs to be done?

First up, I'd say one thing to do is come up with some (very small!) sample files in the different formats. Initially just one per container, but ideally also some with different contents too. (For example, both mpeg with mpeg2+mp3, and mpeg with mpeg2+mp3+mp3+ogg)

Next, using these sample files, we need to ensure that we have mime magic for all the container formats, along with unit tests. We'll also need to sort out mimetypes for the common combinations, and maybe also think about how to describe some of the cases (do we always call it after the biggest video format for example? Do we care about the container?)

Now, if we wanted to go down the ffmpeg external processor route, we need to do two sets of mappings. One is from "ffmpeg -formats" to mimetypes, so that our parser can correctly claim the mimetypes it can handle. In addition, we need to work out how to map the output of "ffmpeg -i" back to our (often new) mimetypes, so we can have a detector based on it


I know there are developers extracting the sequences of images of the
subtitles and using OCR to change them to text ... Any one could see
how useful such a thing could be. Could tika reach out to those deep
waters?

Let's have some more progress on the regular OCR stuff first, then we can worry about extracting out subtitles and finally figure out how to OCR it... :)

Nick

Reply via email to