In my quest to improve indexing time (in a multi-core environment), I tried writing a Solr RequestHandler called ParallelDataImportHandler. I had a few lame questions to begin with, which Noble and Shalin answered here - http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
As the name suggests, the handler, when invoked, tries to execute multiple DIH instances on the same core in parallel. Of-course the catch here is that, only those data-sources, that can be batched can benifit from this handler. In my case, I am writing this for import from a MySQL database. So, I have a single data-config.xml, in which the query has to add placeholders for "limit" and "offset". Each DIH instance uses the same data-config file, and replaces its own values for the limit and offset (which is in fact supplied by the parent ParallelDataImportHandler). I am achieving this by making my handler SolrCoreAware, and creating maxNumberOfDIHInstances (configurable) in the inform method. These instances are then initialized and registered with the core. Whenever a request comes in, the ParallelDataImportHandler delegates the task to these instances, schedules the remainder and aggregates responses from each of these instances to return back to the user. Thankfully, all of these worked, and preliminary benchmarking with 5million records indicated 50% decrease in re-indexing time. Moreover, all my cores (Solr in my case is hosted on a quad-core machine), indicated above 70% CPU utilization. All that I could have asked for! With respect to this whole thing, I have a few questions - 1. Is something similar available out of the box? 2. Is the idea flawed? Is the approach fundamentally correct? 3. I am using Solr 1.3. DIH did not have "EventListeners" in the stone age. I need to know, if a DIH instance is done with its task (mostly the "commit") operation. I could not figure a clean way out. As a hack, I keep pinging the DIH instances with command=status at regular intervals (in a separate thread), to figure out if it is free to be assigned some task. With works, but obviously with an overhead of unnessecary wasted CPU cycles. Is there a better approach? 4. I can better the time taken, even further if there was a way for me to tell a DIH instance not to open a new IndexSearcher. In the current scheme of things, as soon as one DIH instance is done committing, a new searcher is opened. This is blocking for other DIH instances (which were active) and they cannot continue without the searcher being initialized. Is there a way I can implement, single commit once all these DIH instances are done with their tasks? I tried each DIH instance with a commit=false without luck. 5. Can this implementation be extended to support other data-sources supported in DIH (HTTP, File, URL etc)? 6. If the utility is worth it, can I host this on Google code as an open source contrib? Any help will be deeply acknowledged and appreciated. While suggesting, please don't forget that I am using Solr 1.3. If it all goes well, I don't mind writing one for Solr 1.4. Cheers Avlesh