Queries regarding a "ParallelDataImportHandler"

Avlesh Singh Sun, 02 Aug 2009 07:03:34 -0700

In my quest to improve indexing time (in a multi-core environment), I tried
writing a Solr RequestHandler called ParallelDataImportHandler.
I had a few lame questions to begin with, which Noble and Shalin answered
here -
http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing


As the name suggests, the handler, when invoked, tries to execute multiple
DIH instances on the same core in parallel. Of-course the catch here is
that, only those data-sources, that can be batched can benifit from this
handler. In my case, I am writing this for import from a MySQL database. So,
I have a single data-config.xml, in which the query has to add placeholders
for "limit" and "offset". Each DIH instance uses the same data-config file,
and replaces its own values for the limit and offset (which is in fact
supplied by the parent ParallelDataImportHandler).

I am achieving this by making my handler SolrCoreAware, and creating
maxNumberOfDIHInstances (configurable) in the inform method. These instances
are then initialized and  registered with the core. Whenever a request comes
in, the ParallelDataImportHandler delegates the task to these instances,
schedules the remainder and aggregates responses from each of these
instances to return back to the user.

Thankfully, all of these worked, and preliminary benchmarking with 5million
records indicated 50% decrease in re-indexing time. Moreover, all my cores
(Solr in my case is hosted on a quad-core machine), indicated above 70% CPU
utilization. All that I could have asked for!

With respect to this whole thing, I have a few questions -

   1. Is something similar available out of the box?
   2. Is the idea flawed? Is the approach fundamentally correct?
   3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone
   age. I need to know, if a DIH instance is done with its task (mostly the
   "commit") operation. I could not figure a clean way out. As a hack, I keep
   pinging the DIH instances with command=status at regular intervals (in a
   separate thread), to figure out if it is free to be assigned some task. With
   works, but obviously with an overhead of unnessecary wasted CPU cycles. Is
   there a better approach?
   4. I can better the time taken, even further if there was a way for me to
   tell a DIH instance not to open a new IndexSearcher. In the current scheme
   of things, as soon as one DIH instance is done committing, a new searcher is
   opened. This is blocking for other DIH instances (which were active) and
   they cannot continue without the searcher being initialized. Is there a way
   I can implement, single commit once all these DIH instances are done with
   their tasks? I tried each DIH instance with a commit=false without luck.
   5. Can this implementation be extended to support other data-sources
   supported in DIH (HTTP, File, URL etc)?
   6. If the utility is worth it, can I host this on Google code as an open
   source contrib?

Any help will be deeply acknowledged and appreciated. While suggesting,
please don't forget that I am using Solr 1.3. If it all goes well, I don't
mind writing one for Solr 1.4.

Cheers
Avlesh

Queries regarding a "ParallelDataImportHandler"

Reply via email to