Hey, On Wed, 2011-09-14 at 15:15 +0100, Martyn Russell wrote: > On 14/09/11 12:04, Carlos Garnacho wrote: > > Hey hey, > > Hi Carlos, > > > Lately I've been thinking on how to improve TrackerMinerFS design and > > performance, as it's a big piece of code that's getting too intricate at > > Me too, I will come on to my thoughts later in this email. > > > places. It mainly has 2 roles that we should separate further: > > > > * Keeping track of what files to index (either fed through the crawler > > or the dir monitors) > > * actually indexing them > > > > For each of these 2 roles TrackerMinerFS maintains one cache (mtimes for > > the first, URNs for the second) that's filled in per-directory as > > processing goes, which introduces a latency directly related to how > > scattered is the data in the FS. > > > > Another source of latency is the need to have a parent folder URN before > > inserting the data for the file at hand, which forces a flush/commit > > right before indexing files within a folder to keep > > nfo:belongsToContainer consistent, but that's harder to beat. > > > > So, my idea to improve these situations is to separate the first role > > out to a separate object that is able to carry out caching operations at > > a higher level than folders (probably for entire configured > > directories), and would hide the crawler and the monitor to the miner. > > That way the miner would query in one go what now does in scattered > > chunks. Very rough testing seemed to show crawling is reduced to 30%-40% > > of the original time, just ~2x the effort of only adding the directory > > monitors. > > That's quite impressive. > > > Additionally, I think a filesystem abstraction object should be in > > place, where GFiles are canonicalized so every comparison afterwards can > > be performed through == and !=, and directories (and related data, > > mtime, URN...) are cached for a longer term, while regular files are > > This would indeed be nice. The comparison right now does feel clunky and > we've had bugs in the past about 2 GFile objects being equal with > g_file_equal() but the pointers are different. Would be nice to simplify > things a bit. > > > more short-lived. I'd expect a slightly higher memory usage with this, > > but almost negligible, since we already have GFiles in memory for every > > monitored directory and every file waiting to be processed/indexed. > > > > But this would specially help in non-first indexes, as actual indexing > > (mostly bound to tracker-extract) outweights these file operations. > > Indeed. > > > Opinions? > > It all sounds very good. Any ideas on time lines for this?
Not fully sure, probably could be done in 2 weeks or a bit more, I think TrackerCrawler and TrackerMonitor can be used as is, which saves quite a lot of work, but there are operations at the miner level that'd be affected and deserve extra care, specially: * mounts/unmounts * moving files, overwriting files * moving directories * moving stuff in and out of inspected directory trees We should write unit tests for these to ensure a correct behavior > > My thoughts: > > - It might make sense to split current functionality into more modules > first to make things easier to refactor in turn, I've been meaning to do > this for one or two files which are > 5k LoC. Very much agreed :) > > - How does this affect the miner-config branch which has yet to land in > master? It's somewhat orthogonal, but touches related code, and could also make use of the filesystem abstraction, so it could be probably considered an starting point to the bigger refactor. > > - There are some other features I would like to see added which have > been recently mentioned in a bug from Bastien¹, namely: > > 1. Disable indexing removable media by default (how useful is this?, > I currently only need it for my music/photos but can specify it > directly anyway and it picks up a load of other crap like backups > if I just do it blindly so ...) I think it'd still make sense to be able to whitelist some specific devices, perhaps even with nautilus integration so it shows an "index this media?" info bar :) > > 2. I wonder if we should be more clever about what we monitor, some > ideas I had: > > A. Only monitor locations where files have changed in the last > month to avoid wasting monitors and spending so much time > setting them up? Hmm, the downside of that is that you'd only notice changes on an older directory on the next restart, I'd rather try to see first how fast can we get on setting up monitors :) > > B. Don't set up monitors for removable media, just crawl them (as > we do now anyway) when they're mounted? If data changes > frequently on them, users can add specific locations through > the config. > > C. Don't add monitors to directories which are obviously code > repositories. I quite agree there, Tracker isn't usually going the tool of choice for code search. > > 3. I think we should have some option to force indexing source code > directories (this touches on 2C a bit). Bastien mentioned that > developers are having issues using their desktop with projects > checked out in $HOME somewhere and I will admit, I avoid indexing > my source dirs. Perhaps we should do more here. > > 4. Detect when the user has been away for n minutes (like Gossip and > other IM clients have done for years) and use that to index new > content in the background. This might have to be optional given > some people will expect content up to date. Being a one time thing, I'm a bit unsure about this, perhaps initial indexing shouldn't be done at full throttle though so it doesn't feel as taxing and it's just a bit slower, but there's certainly no magic throttling number that's good for all. Carlos > > > ¹ https://bugzilla.gnome.org/show_bug.cgi?id=659025 > > Thoughts? > _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
