On 28 June 2013 18:37, Martyn Russell <mar...@lanedo.com> wrote:
> On 28/06/13 07:45, Philip Van Hoof wrote:
>>
>> Op 28/06/2013 8:30, Jonatan Pålsson schreef:
>>
>> Hi Jonatan,
>
>
> Hello Jonatan, Philip,

Hey Martyn!

>
>
>>> The main reason for not modifying the original extractor is that I
>>> want to keep it as a fallback if this new extractor fails due to an
>>> unexpected file structure. Since png-faster tries to skip to the end
>>> of the file by estimating the location of the metadata contained in
>>> the end of the file using the file size & IDAT chunk size, I predict
>>> it may fail more often than the original. Since tracker-extract
>>> handles these failures gracefully, this is not a problem however.
>>>
>>> The best way I can see to get a similar functionality in to the
>>> existing extractor would be to modify libpng to allow skipping to the
>>> end of the file (right now there is a comment in the existing png
>>> extractor noting that this functionality is missing from the library),
>>> but since reading the PNG format is relatively simple I opted to put
>>> this functionality in the extractor rather than first patching libpng
>>> (I am not sure how much work this would be, either).
>>>
>>> What are your thoughts on keeping png-faster as a separate, optional
>>> extractor module which can be enabled when extraction speed is of
>>> primary concern?
>>
>>
>> After reading this, my opinion is that we should keep it as separate
>> modules. What do you think, Martyn?
>
>
> I think that really depends on the diff. It's harder to see what the real
> changes are with an entirely new module.

Yes, I agree. It is definitely more difficult to see what has changed
now.. I did not really base the new extractor on the code from the old
extractor, so it is a bit more difficult to get a diff. It would
however be possible to merge the two extractors (and thus more easily
get a diff), but this would take some time.

>
> Ideally, I would like to share functionality where possible. That's the
> first thing.
>
> If it's a case of switching between fast() and slow() functions depending on
> IDAT chunks then it's likely to be quicker over thousands of files, to do
> that quick test first and then call the slow function in the one extractor
> instead of switching out to the next module where we have to do a bunch of
> new seeking from scratch. It's hard to say without looking at the real
> changes and should be tested too. I've not looked in detail myself yet and I
> plan to :)

I definitely see your point here. Failing one extractor to switch over
to the a different one several thousand times is a huge waste of time.
This would happen when png-faster fails to skip to the end of the
file, most likely due to the IDAT chunks being of variable size. I'd
like to point out that this should be an uncommon scenario however
(based on the fact that I have never seen such a file). If it turns
out to be much more common than I anticipate, the usefulness of
png-faster can be debated :) The worst case for png-faster which I can
think of, is if the same software/camera produces all PNG files
scanned by Tracker - and these PNGs have variable sized IDATs. This
would be bad.

I'm obviously partial here, due to the approach taken in png-faster,
but I like the idea of separating different extraction strategies into
different extractor modules. This means they can easily be disabled
and prioritized, etc. A different approach (which would be taken if
the two extractors are merged) would be to use #ifdefs within the
extractor module, and this means we can select extractors during
compile time, but only during compile time.

On a slightly different note, right now, some extractors can fall back
to some more generic extractor for example GStreamer, which is exactly
what I am going for in png-faster as well. The argument you make
concerning when the "faster" extractor fails is very valid for these
extractors as well, and I wonder.. Wouldn't it be nice to blacklist
certain extractors dynamically if they are prone to errors? Say
png-faster, or the mp3 extractor has failed five times in a row (or
several times during a short amount of time), and there is a more
generic extractor available, the specialized extractor could then
automatically be skipped by tracker-extract, and the extractor with a
lower priority could be chosen. With this functionality in place, the
original concern that png-faster fails very many times, should be
mitigated, while also possibly contributing to an overall performance
boost for the other modules which have more generic extractors
available. The blacklist could be kept in the memory of the
tracker-extract process, thus invalidating it after each mining (I
assume permanently faulty extractor modules are not common). Thoughts
on this?

--
Regards,
Jonatan Pålsson

Pelagicore AB
Ekelundsgatan 4, 6th floor, SE-411 18 Gothenburg, Sweden
_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
https://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to