On 28 June 2013 18:37, Martyn Russell <mar...@lanedo.com> wrote: > On 28/06/13 07:45, Philip Van Hoof wrote: >> >> Op 28/06/2013 8:30, Jonatan Pålsson schreef: >> >> Hi Jonatan, > > > Hello Jonatan, Philip,
Hey Martyn! > > >>> The main reason for not modifying the original extractor is that I >>> want to keep it as a fallback if this new extractor fails due to an >>> unexpected file structure. Since png-faster tries to skip to the end >>> of the file by estimating the location of the metadata contained in >>> the end of the file using the file size & IDAT chunk size, I predict >>> it may fail more often than the original. Since tracker-extract >>> handles these failures gracefully, this is not a problem however. >>> >>> The best way I can see to get a similar functionality in to the >>> existing extractor would be to modify libpng to allow skipping to the >>> end of the file (right now there is a comment in the existing png >>> extractor noting that this functionality is missing from the library), >>> but since reading the PNG format is relatively simple I opted to put >>> this functionality in the extractor rather than first patching libpng >>> (I am not sure how much work this would be, either). >>> >>> What are your thoughts on keeping png-faster as a separate, optional >>> extractor module which can be enabled when extraction speed is of >>> primary concern? >> >> >> After reading this, my opinion is that we should keep it as separate >> modules. What do you think, Martyn? > > > I think that really depends on the diff. It's harder to see what the real > changes are with an entirely new module. Yes, I agree. It is definitely more difficult to see what has changed now.. I did not really base the new extractor on the code from the old extractor, so it is a bit more difficult to get a diff. It would however be possible to merge the two extractors (and thus more easily get a diff), but this would take some time. > > Ideally, I would like to share functionality where possible. That's the > first thing. > > If it's a case of switching between fast() and slow() functions depending on > IDAT chunks then it's likely to be quicker over thousands of files, to do > that quick test first and then call the slow function in the one extractor > instead of switching out to the next module where we have to do a bunch of > new seeking from scratch. It's hard to say without looking at the real > changes and should be tested too. I've not looked in detail myself yet and I > plan to :) I definitely see your point here. Failing one extractor to switch over to the a different one several thousand times is a huge waste of time. This would happen when png-faster fails to skip to the end of the file, most likely due to the IDAT chunks being of variable size. I'd like to point out that this should be an uncommon scenario however (based on the fact that I have never seen such a file). If it turns out to be much more common than I anticipate, the usefulness of png-faster can be debated :) The worst case for png-faster which I can think of, is if the same software/camera produces all PNG files scanned by Tracker - and these PNGs have variable sized IDATs. This would be bad. I'm obviously partial here, due to the approach taken in png-faster, but I like the idea of separating different extraction strategies into different extractor modules. This means they can easily be disabled and prioritized, etc. A different approach (which would be taken if the two extractors are merged) would be to use #ifdefs within the extractor module, and this means we can select extractors during compile time, but only during compile time. On a slightly different note, right now, some extractors can fall back to some more generic extractor for example GStreamer, which is exactly what I am going for in png-faster as well. The argument you make concerning when the "faster" extractor fails is very valid for these extractors as well, and I wonder.. Wouldn't it be nice to blacklist certain extractors dynamically if they are prone to errors? Say png-faster, or the mp3 extractor has failed five times in a row (or several times during a short amount of time), and there is a more generic extractor available, the specialized extractor could then automatically be skipped by tracker-extract, and the extractor with a lower priority could be chosen. With this functionality in place, the original concern that png-faster fails very many times, should be mitigated, while also possibly contributing to an overall performance boost for the other modules which have more generic extractors available. The blacklist could be kept in the memory of the tracker-extract process, thus invalidating it after each mining (I assume permanently faulty extractor modules are not common). Thoughts on this? -- Regards, Jonatan Pålsson Pelagicore AB Ekelundsgatan 4, 6th floor, SE-411 18 Gothenburg, Sweden _______________________________________________ tracker-list mailing list tracker-list@gnome.org https://mail.gnome.org/mailman/listinfo/tracker-list