Re: DataImportHandler Questions-Load data in parallel and temp tables

Amit Nithian Tue, 28 Apr 2009 13:37:30 -0700

I do remember LuSQL and a discussion regarding the performance implications
of using it compared to the DIH. My only reason to stick with DIH is that we
may have other data sources for document loading in the near term that may
make LuSQL too specific for our needs.


Regarding the bug to write to the index in a separate thread, while helpful,
doesn't address my use case which is as follows:
1) Write a loader application using EmbeddedSolr + SolrJ + DIH (create a
bogus local request with path='/dataimport') so that the DIH code is invoked
2) Instead of using DirectUpdate2 update handler, write a custom update
handler to take a solr document and POST to a remote Solr server. I could
queue documents here and POST in bulk but that's details..
3) Possibly multi-thread the DIH so that multiple threads can process
different database segments, construct and POST solr documents.
  - For example, thread 1 processes IDs 1-100, thread 2, 101-200, thread 3,
201-...
  - If the Solr Server is multithreaded in writing to the index, that's
great and helps in performance.

#3 is possible depending on performance tests. #1 and #2 I believe I need
because I want my loader separated from the master server for development,
deployment and just general separation of concerns.

Thanks
Amit

On Tue, Apr 28, 2009 at 6:03 AM, Glen Newton <glen.new...@gmail.com> wrote:

> Amit,
>
> You might want to take a look at LuSql[1] and see if it may be
> appropriate for the issues you have.
>
> thanks,
>
> Glen
>
> [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
>
> 2009/4/27 Amit Nithian <anith...@gmail.com>:
> > All,
> > I have a few questions regarding the data import handler. We have some
> > pretty gnarly SQL queries to load our indices and our current loader
> > implementation is extremely fragile. I am looking to migrate over to the
> > DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom
> stuff
> > to remotely load the indices so that my index loader and main search
> engine
> > are separated.
> > Currently, unless I am missing something, the data gathering from the
> entity
> > and the data processing (i.e. conversion to a Solr Document) is done
> > sequentially and I was looking to make this execute in parallel so that I
> > can have multiple threads processing different parts of the resultset and
> > loading documents into Solr. Secondly, I need to create temporary tables
> to
> > store results of a few queries and use them later for inner joins was
> > wondering how to best go about this?
> >
> > I am thinking to add support in DIH for the following:
> > 1) Temporary tables (maybe call it temporary entities)? --Specific only
> to
> > SQL though unless it can be generalized to other sources.
> > 2) Parallel support
> >  - Including some mechanism to get the number of records (whether it be
> > count or the MAX(custom_id)-MIN(custom_id))
> > 3) Support in DIH or Solr to post documents to a remote index (i.e.
> create a
> > new UpdateHandler instead of DirectUpdateHandler2).
> >
> > If any of these exist or anyone else is working on this (OR you have
> better
> > suggestions), please let me know.
> >
> > Thanks!
> > Amit
> >
>
>
>
> --
>
> -
>

Re: DataImportHandler Questions-Load data in parallel and temp tables

Reply via email to