Re: Queries regarding a "ParallelDataImportHandler"

Avlesh Singh Mon, 03 Aug 2009 05:47:29 -0700

Sure Noble. I'll do it pretty soon.

Cheers
Avlesh


2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>

> On Mon, Aug 3, 2009 at 5:02 PM, Avlesh Singh<avl...@gmail.com> wrote:
> > We are generally talking about two things here -
> >
> >   1. Speed up indexing in general by creating separate thread(s) for
> >   writing to the index. Solr-1089 should take care of this.
> >   2. Ability to split the DIH commands into batches, that can be executed
> >   in parallel threads.
> >
> > My initial proposal was #2.
> > I see #1 as an "internal" optimization in DIH which we should anyways do.
> > With #2 an end user can decide how to batch the process, (e.g. In a JDBC
> > datasource limit and offset parameters can be used by multiple DIH
> > instances), how many parallel threads should be created for writing etc.
> >
> > I am creating a JIRA issue for #2 and will add a more detailed
> description
> > with possible options.
> sure. just add the details on the JIRA itself
> >
> > Cheers
> > Avlesh
> >
> > 2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
> >
> >> then there is SOLR-1089 which does writes to lucene in a new thread.
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>:
> >> > On Sun, Aug 2, 2009 at 9:39 PM, Avlesh Singh<avl...@gmail.com> wrote:
> >> >>> There can be a batch command (which) will take in multiple commands
> in
> >> one
> >> >>> http request.
> >> >>
> >> >> You seem to be obsessed with this approach, Noble. Solr-1093 also
> echoes
> >> the
> >> >> same sentiments :)
> >> >> I personally find this approach a bit restrictive and difficult to
> adapt
> >> to.
> >> >> IMHO, it is better handled as a configuration. i.e. user tells us how
> >> the
> >> >> single task can be "batched" (or 'sliced', as you call it) while
> >> configuring
> >> >> the Parallel(or, MultiThreaded) DIH inside solrconfig.
> >> > agreed .
> >> >
> >> > I suggested this as a low hanging fruit because the changes are less
> >> > invasive . I'm open to anything other suggestion which you can come up
> >> > with.
> >> >
> >> >
> >> >>
> >> >> As an example, for non-jdbc data sources where batching might be
> >> difficult
> >> >> to achieve in an abstract way, the user might choose to configure
> >> different
> >> >> data-config.xml's (for different DIH instances) altogether.
> >> >>
> >> >> Cheers
> >> >> Avlesh
> >> >>
> >> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
> >> >>>
> >> >>> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<avl...@gmail.com>
> wrote:
> >> >>> > I have one more question w.r.t the MultiThreaded DIH - What would
> be
> >> the
> >> >>> > logic behind distributing tasks to thread?
> >> >>> >
> >> >>> > I am sorry to have not mentioned this earlier - In my case, I take
> a
> >> >>> > "count
> >> >>> > query" parameter as an configuration element. Based on this count
> and
> >> >>> > the
> >> >>> > maxNumberOfDIHInstances, task assignment scheduling is done by
> >> >>> > "injecting"
> >> >>> > limit and offset values in the import query for each DIH instance.
> >> >>> > And this is, one of the reasons, why I call it a
> >> >>> > ParallelDataImportHandler.
> >> >>> There can be a batch command will take in multiple commands in one
> >> >>> http request. so it will be like invoking multiple DIH instances and
> >> >>> the user will have to find ways to split up the whole task into
> >> >>> multiple 'slices'. DIH in turn would fire up multiple threads and
> once
> >> >>> all the threads are returned it should issue a commit
> >> >>>
> >> >>> this is a very dumb implementation but is a very easy path.
> >> >>> >
> >> >>> > Cheers
> >> >>> > Avlesh
> >> >>> >
> >> >>> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <avl...@gmail.com>
> >> wrote:
> >> >>> >
> >> >>> >> run the add() calls to Solr in a dedicated thread
> >> >>> >>
> >> >>> >> Makes absolute sense. This would actually mean, DIH sits on top
> of
> >> all
> >> >>> >> the
> >> >>> >> add/update operations making it easier to implement a
> multi-threaded
> >> >>> >> DIH.
> >> >>> >>
> >> >>> >> I would create a JIRA issue, right away.
> >> >>> >> However, I would still love to see responses to my problems due
> to
> >> >>> >> limitations in 1.3
> >> >>> >>
> >> >>> >> Cheers
> >> >>> >> Avlesh
> >> >>> >>
> >> >>> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
> >> >>> >>
> >> >>> >> a multithreaded DIH is in my top priority list. There are muliple
> >> >>> >>> approaches
> >> >>> >>>
> >> >>> >>> 1) create multiple instances of dataImporter instances in the
> same
> >> DIH
> >> >>> >>> instance and run them in parallel and commit when all of them
> are
> >> done
> >> >>> >>> 2) run the add() calls to Solr in a dedicated thread
> >> >>> >>> 3) make DIH automatically multithreaded . This is much harder to
> >> >>> >>> implement.
> >> >>> >>>
> >> >>> >>> but a and #1 and #2 can be implemented with ease. It does not
> have
> >> to
> >> >>> >>> be aother implementation called ParallelDataImportHandler. I
> >> believe
> >> >>> >>> it can be done in DIH itself
> >> >>> >>>
> >> >>> >>> you may not need to create a project in google code. you can
> open a
> >> >>> >>> JIRA issue and start posting patches and we can put it back into
> >> Solr.
> >> >>> >>>
> >> >>> >>> .
> >> >>> >>>
> >> >>> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avl...@gmail.com>
> >> wrote:
> >> >>> >>> > In my quest to improve indexing time (in a multi-core
> >> environment),
> >> >>> >>> > I
> >> >>> >>> tried
> >> >>> >>> > writing a Solr RequestHandler called
> ParallelDataImportHandler.
> >> >>> >>> > I had a few lame questions to begin with, which Noble and
> Shalin
> >> >>> >>> answered
> >> >>> >>> > here -
> >> >>> >>> >
> >> >>> >>>
> >> >>> >>>
> >>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >> >>> >>> >
> >> >>> >>> > As the name suggests, the handler, when invoked, tries to
> execute
> >> >>> >>> multiple
> >> >>> >>> > DIH instances on the same core in parallel. Of-course the
> catch
> >> here
> >> >>> >>> > is
> >> >>> >>> > that, only those data-sources, that can be batched can benifit
> >> from
> >> >>> >>> > this
> >> >>> >>> > handler. In my case, I am writing this for import from a MySQL
> >> >>> >>> > database.
> >> >>> >>> So,
> >> >>> >>> > I have a single data-config.xml, in which the query has to add
> >> >>> >>> placeholders
> >> >>> >>> > for "limit" and "offset". Each DIH instance uses the same
> >> >>> >>> > data-config
> >> >>> >>> file,
> >> >>> >>> > and replaces its own values for the limit and offset (which is
> in
> >> >>> >>> > fact
> >> >>> >>> > supplied by the parent ParallelDataImportHandler).
> >> >>> >>> >
> >> >>> >>> > I am achieving this by making my handler SolrCoreAware, and
> >> creating
> >> >>> >>> > maxNumberOfDIHInstances (configurable) in the inform method.
> >> These
> >> >>> >>> instances
> >> >>> >>> > are then initialized and  registered with the core. Whenever a
> >> >>> >>> > request
> >> >>> >>> comes
> >> >>> >>> > in, the ParallelDataImportHandler delegates the task to these
> >> >>> >>> > instances,
> >> >>> >>> > schedules the remainder and aggregates responses from each of
> >> these
> >> >>> >>> > instances to return back to the user.
> >> >>> >>> >
> >> >>> >>> > Thankfully, all of these worked, and preliminary benchmarking
> >> with
> >> >>> >>> 5million
> >> >>> >>> > records indicated 50% decrease in re-indexing time. Moreover,
> all
> >> my
> >> >>> >>> cores
> >> >>> >>> > (Solr in my case is hosted on a quad-core machine), indicated
> >> above
> >> >>> >>> > 70%
> >> >>> >>> CPU
> >> >>> >>> > utilization. All that I could have asked for!
> >> >>> >>> >
> >> >>> >>> > With respect to this whole thing, I have a few questions -
> >> >>> >>> >
> >> >>> >>> >   1. Is something similar available out of the box?
> >> >>> >>> >   2. Is the idea flawed? Is the approach fundamentally
> correct?
> >> >>> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners" in
> >> the
> >> >>> >>> > stone
> >> >>> >>> >   age. I need to know, if a DIH instance is done with its task
> >> >>> >>> > (mostly
> >> >>> >>> the
> >> >>> >>> >   "commit") operation. I could not figure a clean way out. As
> a
> >> >>> >>> > hack, I
> >> >>> >>> keep
> >> >>> >>> >   pinging the DIH instances with command=status at regular
> >> intervals
> >> >>> >>> > (in
> >> >>> >>> a
> >> >>> >>> >   separate thread), to figure out if it is free to be assigned
> >> some
> >> >>> >>> task. With
> >> >>> >>> >   works, but obviously with an overhead of unnessecary wasted
> CPU
> >> >>> >>> cycles. Is
> >> >>> >>> >   there a better approach?
> >> >>> >>> >   4. I can better the time taken, even further if there was a
> way
> >> >>> >>> > for me
> >> >>> >>> to
> >> >>> >>> >   tell a DIH instance not to open a new IndexSearcher. In the
> >> >>> >>> > current
> >> >>> >>> scheme
> >> >>> >>> >   of things, as soon as one DIH instance is done committing, a
> >> new
> >> >>> >>> searcher is
> >> >>> >>> >   opened. This is blocking for other DIH instances (which were
> >> >>> >>> > active)
> >> >>> >>> and
> >> >>> >>> >   they cannot continue without the searcher being initialized.
> Is
> >> >>> >>> > there
> >> >>> >>> a way
> >> >>> >>> >   I can implement, single commit once all these DIH instances
> are
> >> >>> >>> > done
> >> >>> >>> with
> >> >>> >>> >   their tasks? I tried each DIH instance with a commit=false
> >> without
> >> >>> >>> luck.
> >> >>> >>> >   5. Can this implementation be extended to support other
> >> >>> >>> > data-sources
> >> >>> >>> >   supported in DIH (HTTP, File, URL etc)?
> >> >>> >>> >   6. If the utility is worth it, can I host this on Google
> code
> >> as
> >> >>> >>> > an
> >> >>> >>> open
> >> >>> >>> >   source contrib?
> >> >>> >>> >
> >> >>> >>> > Any help will be deeply acknowledged and appreciated. While
> >> >>> >>> > suggesting,
> >> >>> >>> > please don't forget that I am using Solr 1.3. If it all goes
> >> well, I
> >> >>> >>> don't
> >> >>> >>> > mind writing one for Solr 1.4.
> >> >>> >>> >
> >> >>> >>> > Cheers
> >> >>> >>> > Avlesh
> >> >>> >>> >
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> --
> >> >>> >>> -----------------------------------------------------
> >> >>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >>> >>>
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> -----------------------------------------------------
> >> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > -----------------------------------------------------
> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Queries regarding a "ParallelDataImportHandler"

Reply via email to