Re: [Tracker] Tracker daemon/indexer responsibilities

Jamie McCracken Wed, 25 Jun 2008 09:17:55 -0700

On Wed, 2008-06-25 at 16:46 +0100, Martyn Russell wrote:
> Hi all,
> 
> Thinks are all going well on the indexer-split branch, however, it
> occurred to me that daemon has its work duplicated in the indexer. We
> need to resolve where the responsibility lies for the daemon and the
> indexer.
> 
> 
> The Modules
> ===========
> 
> First about the modules. So we have these modules, they all share a
> common API. This API includes functions to:
> 
> - Index content
> - Get directories
> - Know if a file or directory should be ignored
> 
> The modules include:
> 
> - applications
> - files
> - gaim-conversations
> - firefox-history
> 
> The idea is, each of these modules know how to index, locate and ignore
> particular files and directories pertaining to their specific arena
> (i.e. instant messaging, browsing, applications, etc).


thats fine

> 
> 
> The Daemon
> ==========
> 
> So, recently in the daemon, I just finished writing the code to crawl
> the file system and queue ALL files in $HOME or where ever the config
> says we should index files from. The daemon also sets up monitors for
> each directory it finds along the way. This is all done using the new
> GIO functions and works nicely. The files found are then sent in chunks
> to the indexer to process. This includes monitor updates to files.
> 

for non-inotify i take it we will not use GIO? Do we have enough info to
manage things this way? 

daemon should be able to get/set tags and user/app metadata not from the
index as well

I think we need a libtrackermetadata to share the metadata stuff between
indexer and daemon


> 
> The Indexer
> ===========
> 
> The indexer process works like a state machine with 3 queues for:
> 
> - Files
> - Directores
> - Modules
> 
> The files queue has the highest priority, individual files are stored
> here, waiting for metadata extraction, etc... files are taken one by one
> in order to be processed, when this queue is empty, a single token from
> the _next_ queue is processed.
> 
> The directories queue is the _next_ queue. Directories are waiting for
> inspection here. When a directory is checked the contained files and
> directories will be prepended in their respective queues. When this
> queue is empty, a single token from the _next_ queue is processed.
> 
> The last queue and again the _next_ queue after the directory queue is
> the modules queue. When all files from the previous file have been
> inspected, the next module then does its part and this continues until
> all modules are finished. At this point the indexer quits. IT should be
> noted here, the indexer is an impermanent entity. It only survives to
> process work given to it.


Not quite what I had in mind - the indexer should be dumb and fed stuff
to index by the daemon. the exception is directories which need to be
recursively scanned (not sure we need separate queues for them)



> 
> 
> The Problem
> ===========
> 
> The question is, should the daemon do some of this work? The issue here
> for the daemon is that what it does is highly specific to "files" only.
> It doesn't know anything about instant messaging files, locations, what
> should be ignored, what should be monitored, etc.
> 
> When running the indexer right now, it sits at about 25%->33% in the
> background indexing files (on my laptop), on my desktop, it can index my
> 140k files in about 130 seconds using no throttling and the system is
> very usable during this time (and we haven't optimised anything yet
> either). The daemon, however, does absolutely nothing after the initial
> 10-15 seconds (which is how long it takes to set up 6500 monitors and
> get all 140k files in my home directory 30k of which have been ignored
> as being unsuitable). So the statistics look good, but the daemon can do
> more and should be doing things like monitoring the desktop file
> directory so we know when applications are added, removed or updated.


absolutely - all watching should be done by daemon

the indexer should be told what service is being indexed when its passed
a url. the daemon will know this a sit keeps track of which directories
belong to which service

> 
> To do this, we have been thinking about how best to design the
> indexer/daemon work load so it is most efficient.
> 
> 
> The How
> =======
> 
> So after speaking with Carlos some more about this, the basic idea we
> had was to make the indexer JUST index.

thats was my plan :)

> 
> To do this means the modules need to be shared. This is so that the
> indexer can get each module to index files the way it knows how to index
> and so the daemon can request locations to monitor and crawl. The idea
> being that the daemon crawls the files and sends all files and
> directories (we currently don't send directories, just files) to the
> indexer. The indexer needs both files and directories to add these to
> the database.
> 
> We can take this one step further. We can even have the daemon check in
> the database before sending files to the indexer to make sure we are not
> generating extra work unnecessarily. This is something we don't do at
> all yet, but is planned.

makes sense

> 
> 
> The Conclusion
> ==============
> 
> This work is mostly done right now. It is merely a case of moving the
> architecture around a bit and moving code between processes. But is this
> the right approach, what do you think? Comments welcome!
> 

I think you are on the right track (I have not checked all the source
changes though - just a quick scan)

the main problem between indexer and daemon is which should handle
recursive indexing of folders (particularly during first time index) -
this can be done either way. I dont know which is better

for performance reasons the daemon should not be niced at all so its
important it does not consume too much cpu or disk I/O whilst the
indexer does all the IO/cpu intensive stuff and will be niced +19 and
ioniced as much as possible

handling file moves also needs some discussion about whether the
indexer/daemon should do this

Im open to which way you go here - there may also be issues with NFS
with slow performance or broken file locking so we need to be careful

architecturally you just need a libtrackermetadata for the common 
metadata routines between indexer/daemon (unless these are somewhere else?)



jamie

_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] Tracker daemon/indexer responsibilities

Reply via email to