Sure Noble. I'll do it pretty soon. Cheers Avlesh
2009/8/3 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > On Mon, Aug 3, 2009 at 5:02 PM, Avlesh Singh<avl...@gmail.com> wrote: > > We are generally talking about two things here - > > > > 1. Speed up indexing in general by creating separate thread(s) for > > writing to the index. Solr-1089 should take care of this. > > 2. Ability to split the DIH commands into batches, that can be executed > > in parallel threads. > > > > My initial proposal was #2. > > I see #1 as an "internal" optimization in DIH which we should anyways do. > > With #2 an end user can decide how to batch the process, (e.g. In a JDBC > > datasource limit and offset parameters can be used by multiple DIH > > instances), how many parallel threads should be created for writing etc. > > > > I am creating a JIRA issue for #2 and will add a more detailed > description > > with possible options. > sure. just add the details on the JIRA itself > > > > Cheers > > Avlesh > > > > 2009/8/3 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > > > >> then there is SOLR-1089 which does writes to lucene in a new thread. > >> > >> 2009/8/2 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com>: > >> > On Sun, Aug 2, 2009 at 9:39 PM, Avlesh Singh<avl...@gmail.com> wrote: > >> >>> There can be a batch command (which) will take in multiple commands > in > >> one > >> >>> http request. > >> >> > >> >> You seem to be obsessed with this approach, Noble. Solr-1093 also > echoes > >> the > >> >> same sentiments :) > >> >> I personally find this approach a bit restrictive and difficult to > adapt > >> to. > >> >> IMHO, it is better handled as a configuration. i.e. user tells us how > >> the > >> >> single task can be "batched" (or 'sliced', as you call it) while > >> configuring > >> >> the Parallel(or, MultiThreaded) DIH inside solrconfig. > >> > agreed . > >> > > >> > I suggested this as a low hanging fruit because the changes are less > >> > invasive . I'm open to anything other suggestion which you can come up > >> > with. > >> > > >> > > >> >> > >> >> As an example, for non-jdbc data sources where batching might be > >> difficult > >> >> to achieve in an abstract way, the user might choose to configure > >> different > >> >> data-config.xml's (for different DIH instances) altogether. > >> >> > >> >> Cheers > >> >> Avlesh > >> >> > >> >> 2009/8/2 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > >> >>> > >> >>> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<avl...@gmail.com> > wrote: > >> >>> > I have one more question w.r.t the MultiThreaded DIH - What would > be > >> the > >> >>> > logic behind distributing tasks to thread? > >> >>> > > >> >>> > I am sorry to have not mentioned this earlier - In my case, I take > a > >> >>> > "count > >> >>> > query" parameter as an configuration element. Based on this count > and > >> >>> > the > >> >>> > maxNumberOfDIHInstances, task assignment scheduling is done by > >> >>> > "injecting" > >> >>> > limit and offset values in the import query for each DIH instance. > >> >>> > And this is, one of the reasons, why I call it a > >> >>> > ParallelDataImportHandler. > >> >>> There can be a batch command will take in multiple commands in one > >> >>> http request. so it will be like invoking multiple DIH instances and > >> >>> the user will have to find ways to split up the whole task into > >> >>> multiple 'slices'. DIH in turn would fire up multiple threads and > once > >> >>> all the threads are returned it should issue a commit > >> >>> > >> >>> this is a very dumb implementation but is a very easy path. > >> >>> > > >> >>> > Cheers > >> >>> > Avlesh > >> >>> > > >> >>> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <avl...@gmail.com> > >> wrote: > >> >>> > > >> >>> >> run the add() calls to Solr in a dedicated thread > >> >>> >> > >> >>> >> Makes absolute sense. This would actually mean, DIH sits on top > of > >> all > >> >>> >> the > >> >>> >> add/update operations making it easier to implement a > multi-threaded > >> >>> >> DIH. > >> >>> >> > >> >>> >> I would create a JIRA issue, right away. > >> >>> >> However, I would still love to see responses to my problems due > to > >> >>> >> limitations in 1.3 > >> >>> >> > >> >>> >> Cheers > >> >>> >> Avlesh > >> >>> >> > >> >>> >> 2009/8/2 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com> > >> >>> >> > >> >>> >> a multithreaded DIH is in my top priority list. There are muliple > >> >>> >>> approaches > >> >>> >>> > >> >>> >>> 1) create multiple instances of dataImporter instances in the > same > >> DIH > >> >>> >>> instance and run them in parallel and commit when all of them > are > >> done > >> >>> >>> 2) run the add() calls to Solr in a dedicated thread > >> >>> >>> 3) make DIH automatically multithreaded . This is much harder to > >> >>> >>> implement. > >> >>> >>> > >> >>> >>> but a and #1 and #2 can be implemented with ease. It does not > have > >> to > >> >>> >>> be aother implementation called ParallelDataImportHandler. I > >> believe > >> >>> >>> it can be done in DIH itself > >> >>> >>> > >> >>> >>> you may not need to create a project in google code. you can > open a > >> >>> >>> JIRA issue and start posting patches and we can put it back into > >> Solr. > >> >>> >>> > >> >>> >>> . > >> >>> >>> > >> >>> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com> > >> wrote: > >> >>> >>> > In my quest to improve indexing time (in a multi-core > >> environment), > >> >>> >>> > I > >> >>> >>> tried > >> >>> >>> > writing a Solr RequestHandler called > ParallelDataImportHandler. > >> >>> >>> > I had a few lame questions to begin with, which Noble and > Shalin > >> >>> >>> answered > >> >>> >>> > here - > >> >>> >>> > > >> >>> >>> > >> >>> >>> > >> > http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing > >> >>> >>> > > >> >>> >>> > As the name suggests, the handler, when invoked, tries to > execute > >> >>> >>> multiple > >> >>> >>> > DIH instances on the same core in parallel. Of-course the > catch > >> here > >> >>> >>> > is > >> >>> >>> > that, only those data-sources, that can be batched can benifit > >> from > >> >>> >>> > this > >> >>> >>> > handler. In my case, I am writing this for import from a MySQL > >> >>> >>> > database. > >> >>> >>> So, > >> >>> >>> > I have a single data-config.xml, in which the query has to add > >> >>> >>> placeholders > >> >>> >>> > for "limit" and "offset". Each DIH instance uses the same > >> >>> >>> > data-config > >> >>> >>> file, > >> >>> >>> > and replaces its own values for the limit and offset (which is > in > >> >>> >>> > fact > >> >>> >>> > supplied by the parent ParallelDataImportHandler). > >> >>> >>> > > >> >>> >>> > I am achieving this by making my handler SolrCoreAware, and > >> creating > >> >>> >>> > maxNumberOfDIHInstances (configurable) in the inform method. > >> These > >> >>> >>> instances > >> >>> >>> > are then initialized and registered with the core. Whenever a > >> >>> >>> > request > >> >>> >>> comes > >> >>> >>> > in, the ParallelDataImportHandler delegates the task to these > >> >>> >>> > instances, > >> >>> >>> > schedules the remainder and aggregates responses from each of > >> these > >> >>> >>> > instances to return back to the user. > >> >>> >>> > > >> >>> >>> > Thankfully, all of these worked, and preliminary benchmarking > >> with > >> >>> >>> 5million > >> >>> >>> > records indicated 50% decrease in re-indexing time. Moreover, > all > >> my > >> >>> >>> cores > >> >>> >>> > (Solr in my case is hosted on a quad-core machine), indicated > >> above > >> >>> >>> > 70% > >> >>> >>> CPU > >> >>> >>> > utilization. All that I could have asked for! > >> >>> >>> > > >> >>> >>> > With respect to this whole thing, I have a few questions - > >> >>> >>> > > >> >>> >>> > 1. Is something similar available out of the box? > >> >>> >>> > 2. Is the idea flawed? Is the approach fundamentally > correct? > >> >>> >>> > 3. I am using Solr 1.3. DIH did not have "EventListeners" in > >> the > >> >>> >>> > stone > >> >>> >>> > age. I need to know, if a DIH instance is done with its task > >> >>> >>> > (mostly > >> >>> >>> the > >> >>> >>> > "commit") operation. I could not figure a clean way out. As > a > >> >>> >>> > hack, I > >> >>> >>> keep > >> >>> >>> > pinging the DIH instances with command=status at regular > >> intervals > >> >>> >>> > (in > >> >>> >>> a > >> >>> >>> > separate thread), to figure out if it is free to be assigned > >> some > >> >>> >>> task. With > >> >>> >>> > works, but obviously with an overhead of unnessecary wasted > CPU > >> >>> >>> cycles. Is > >> >>> >>> > there a better approach? > >> >>> >>> > 4. I can better the time taken, even further if there was a > way > >> >>> >>> > for me > >> >>> >>> to > >> >>> >>> > tell a DIH instance not to open a new IndexSearcher. In the > >> >>> >>> > current > >> >>> >>> scheme > >> >>> >>> > of things, as soon as one DIH instance is done committing, a > >> new > >> >>> >>> searcher is > >> >>> >>> > opened. This is blocking for other DIH instances (which were > >> >>> >>> > active) > >> >>> >>> and > >> >>> >>> > they cannot continue without the searcher being initialized. > Is > >> >>> >>> > there > >> >>> >>> a way > >> >>> >>> > I can implement, single commit once all these DIH instances > are > >> >>> >>> > done > >> >>> >>> with > >> >>> >>> > their tasks? I tried each DIH instance with a commit=false > >> without > >> >>> >>> luck. > >> >>> >>> > 5. Can this implementation be extended to support other > >> >>> >>> > data-sources > >> >>> >>> > supported in DIH (HTTP, File, URL etc)? > >> >>> >>> > 6. If the utility is worth it, can I host this on Google > code > >> as > >> >>> >>> > an > >> >>> >>> open > >> >>> >>> > source contrib? > >> >>> >>> > > >> >>> >>> > Any help will be deeply acknowledged and appreciated. While > >> >>> >>> > suggesting, > >> >>> >>> > please don't forget that I am using Solr 1.3. If it all goes > >> well, I > >> >>> >>> don't > >> >>> >>> > mind writing one for Solr 1.4. > >> >>> >>> > > >> >>> >>> > Cheers > >> >>> >>> > Avlesh > >> >>> >>> > > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> -- > >> >>> >>> ----------------------------------------------------- > >> >>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com > >> >>> >>> > >> >>> >> > >> >>> >> > >> >>> > > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> ----------------------------------------------------- > >> >>> Noble Paul | Principal Engineer| AOL | http://aol.com > >> >> > >> >> > >> > > >> > > >> > > >> > -- > >> > ----------------------------------------------------- > >> > Noble Paul | Principal Engineer| AOL | http://aol.com > >> > > >> > >> > >> > >> -- > >> ----------------------------------------------------- > >> Noble Paul | Principal Engineer| AOL | http://aol.com > >> > > > > > > -- > ----------------------------------------------------- > Noble Paul | Principal Engineer| AOL | http://aol.com >