Op 14/06/2013 19:36, Ivan Frade schreef:
Hi Ivan,
During a Tracker/Nepomuk/SPARQL training I gave at one of my customers
I noted the interest in extractors that can dive into archives and
document types that have a tree of other documents (like MIME documents).
Just today another message in this mailing list was mentioning it :)
Yep. I noticed it.
That or libtracker-extract should allow a stream or buffer based
extraction, and/or a file descriptor based one (in which case we
could pass the extractor modules, the ones now only used by
tracker-extract, a by pipe created FD from the E-mail client, and
write the Base64 decoded data to the pipe FD - or something).
Unfortunately is tracker-extract right now entirely FILE based
(not really FD based, nor stream based).
FD passing and buffered extraction are both good ideas. They are also
independent. We could implement any of them without the other.
A problem that I see with the FD passing using pipe is that we can't
know for sure whether a library that we depend on for metadata
extraction wont use seek() and assume they got a real file's fd. I'm
even afraid that most do. The others are probably buffer or mmap based.
Meaning that libstreamanalyzer's way of just rewriting all extractors to
be stream based is probably the only way to end up with a consistent and
sensible solution for in-archive metadata extraction.
I think it would be a great first addition if the tracker-extract
.rule file based environment would be adapted to have two levels
of matching: first on container and then on MimeType. The first
level would for all of its native extractors be "Just File", and
for the libstreamanalyzer's be "MIMEDocument" and "Archive". The
second level would be the same as now. Ideally this level system
could also be used for multimedia files (videos have first a MIME
type and then a codec type, for example).
Is this two level matching really needed? at the end we recognize the
containers with mime-types (e.g. application/x-tgz). With the current
.rules files, we can assign those "container mime-types" to the
topanalyzer.
You're right if there is only one such kind of extractor. As soon as you
want to select libstreamanalyzer for one kind of archive-mime-type
combination and another extractor for another archive-mime-type
combination, this wont work. But I agree that we could have a
tracker-extract-container.c and .rule for application/x-tgz among other
container types that then splits it out to
tracker-extract-container-streamanalyzer.cpp and
tracker-extract-container-somethingelse.c based on logic defined not in
the top .rule system but on what tracker-extract-container.c itself
does. And at first this can simply be to throw them all to a
streamanalyzer.cpp one (which will likely look a lot like what
tracker-topanalyzer.cpp is now).
Note that there's no reason to keep tracker-topanalyzer.cpp's filename.
With the .rule based system the filename topanalyzer.cpp makes no sense
anymore.
Then would it start being possible for a extractor module like
tracker-topanalyzer.cpp to get kicked into action for diving into
archive files and MIME documents (and the native ones would still
operate on native file types).
Also should the tracker-topanalyzer.cpp be fixed. It has been a
long time that it was last tested and I don't expect it to still
work. And for it to work it would probably be needed that
libstreamanalyzer gets adapted to follow Tracker's Nepomuk
adaptations (right now libstreamanalyzer doesn't know about the
nmm ontology, afaik).
I wonder if Jos is still working on it. We could bring back to life
that topanalyzer extractor, use it for compressed files and move on
from there.
If Jos isn't working on it anymore, surely we can look into it
ourselves. I doubt that Jos would reject patches that conditionally make
libstreamanalyzer spit out a better ontology than the broken upstream
Nepomuk ontologies for multimedia. A bit of stream and decorator in C++
will do good to us.
Kind regards,
Philip
_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
https://mail.gnome.org/mailman/listinfo/tracker-list