We have ~50 long-running SQL queries that need to be joined and denormalized.  
Not all of the queries are to the same db, and some data comes from fixed-width 
data feeds.  Our current search engine (that we are converting to SOLR) has a 
fast disk-caching mechanism that lets you cache all of these data sources and 
then it will join them locally prior to indexing.  

I'm in the process of developing something similar for DIH that uses the 
Berkley db to do the same thing.  Its good enough that I can do nightly full 
re-indexes of all our data while developing the front-end, but it is still very 
rough.  Possibly I would like to get this refined enough to eventually submit 
as a jira ticket / patch as it seems this is a somewhat common problem that 
needs solving.

Even with our current search engine, the join & denormalize step is always the 
longest-running part of the process.  However, I have it running fairly fast by 
partitioning the data by a modulus of the primary key and then running several 
jobs in parallel.  The trick is not to get I/O bound.  Things run fast if you 
can set it up to maximize CPU.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Ephraim Ofir [mailto:ephra...@icq.com] 
Sent: Thursday, December 16, 2010 3:04 AM
To: solr-user@lucene.apache.org
Subject: RE: Dataimport performance

Check out 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e
This approach of not using sub entities really improved our load time.

Ephraim Ofir

-----Original Message-----
From: Robert Gründler [mailto:rob...@dubture.com] 
Sent: Wednesday, December 15, 2010 4:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Dataimport performance

i've benchmarked the import already with 500k records, one time without the 
artists subquery, and one time without the join in the main query:


Without subquery: 500k in 3 min 30 sec

Without join and without subquery: 500k in 2 min 30.

With subquery and with left join:   320k in 6 Min 30


so the joins / subqueries are definitely a bottleneck. 

How exactly did you implement the custom data import? 

In our case, we need to de-normalize the relations of the sql data for the 
index, 
so i fear i can't really get rid of the join / subquery.


-robert





On Dec 15, 2010, at 15:43 , Tim Heckman wrote:

> 2010/12/15 Robert Gründler <rob...@dubture.com>:
>> The data-config.xml looks like this (only 1 entity):
>> 
>>      <entity name="track" query="select t.id as id, t.title as title, 
>> l.title as label from track t left join label l on (l.id = t.label_id) where 
>> t.deleted = 0" transformer="TemplateTransformer">
>>        <field column="title" name="title_t" />
>>        <field column="label" name="label_t" />
>>        <field column="id" name="sf_meta_id" />
>>        <field column="metaclass" template="Track" name="sf_meta_class"/>
>>        <field column="metaid" template="${track.id}" name="sf_meta_id"/>
>>        <field column="uniqueid" template="Track_${track.id}" 
>> name="sf_unique_id"/>
>> 
>>        <entity name="artists" query="select a.name as artist from artist a 
>> left join track_artist ta on (ta.artist_id = a.id) where 
>> ta.track_id=${track.id}">
>>          <field column="artist" name="artists_t" />
>>        </entity>
>> 
>>      </entity>
> 
> So there's one track entity with an artist sub-entity. My (admittedly
> rather limited) experience has been that sub-entities, where you have
> to run a separate query for every row in the parent entity, really
> slow down data import. For my own purposes, I wrote a custom data
> import using SolrJ to improve the performance (from 3 hours to 10
> minutes).
> 
> Just as a test, how long does it take if you comment out the artists entity?

Reply via email to