Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Glen Newton Thu, 23 Jul 2009 02:53:18 -0700

Chantal,

You might consider LuSql[1].
It has much better performance than Solr DIH. It runs 4-10 times faster on a
multicore machine, and can run in 1/20th the heap size Solr needs. It
produces a Lucene index.


See slides 22-25 in this presentation comparing Solr DIH with LuSql:
 http://code4lib.org/files/glen_newton_LuSql.pdf

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclosure: I am the author of LuSql.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/22 Chantal Ackermann <chantal.ackerm...@btelligent.de>:
> Hi all,
>
> this is my first post, as I am new to SOLR (some Lucene exp).
>
> I am trying to load data from an existing datamart into SOLR using the
> DataImportHandler but in my opinion it is too slow due to the special
> structure of the datamart I have to use.
>
> Root Cause:
> This datamart uses a row based approach (pivot) to present its data. It was
> so done to allow adding more attributes to the data set without having to
> change the table structure.
>
> Impact:
> To use the DataImportHandler, i have to pivot the data to create again one
> row per data set. Unfortunately, this results in more and less performant
> queries. Moreover, there are sometimes multiple rows for a single attribute,
> that require separate queries - or more tricky subselects that probably
> don't speed things up.
>
> Here is an example of the relation between DB requests, row fetches and
> actual number of documents created:
>
> <lst name="statusMessages">
> <str name="Total Requests made to DataSource">3737</str>
> <str name="Total Rows Fetched">5380</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2009-07-22 18:19:06</str>
> −
> <str name="">
> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
> </str>
> <str name="Committed">2009-07-22 18:22:29</str>
> <str name="Optimized">2009-07-22 18:22:29</str>
> <str name="Time taken ">0:3:22.484</str>
> </lst>
>
> (Full index creation.)
> There are about half a million data sets, in total. That would require about
> 30h for indexing? My feeling is that there are far too many row fetches per
> data set.
>
> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
> around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
> factor 10, ram buffer size 32).
>
> Possible solutions?
> A) Write my own DataImportHandler?
> B) Write my own "MultiRowTransformer" that accepts several rows as input
> argument (not sure this is a valid option)?
> C) Approach the DB developers to add a flat table with one data set per row?
> D) ...?
>
> If someone would like to share their experiences, that would be great!
>
> Thanks a lot!
> Chantal
>
>
>
> --
> Chantal Ackermann
>



-- 

-

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Reply via email to