RE: DIH parallel processing

2015-10-15 Thread Davis, Daniel (NIH/NLM) [C]
This is also what I have done, but I agree with the notion of using something 
external to load the data.

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Thursday, October 15, 2015 9:24 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH parallel processing

Nabil,

What we do is have multiple dih request handlers configured in solrconfig.xml.  
Then in the sql query we put something like "where mod(id, ${partition})=0".  
Then an external script calls a full import on each request handler at the same 
time and monitors the response.  This isn't the most elegant solution but it 
gets around the fact that DIH is single-threaded.

James Dyer
Ingram Content Group


-Original Message-
From: nabil Kouici [mailto:koui...@yahoo.fr] 
Sent: Thursday, October 15, 2015 3:58 AM
To: Solr-user
Subject: DIH parallel processing

Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more 
than 2 hours. Big amount of this time is consumed by data fetching from 
database. I'm thinking about a solution to have parallel (thread) loud in the 
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil. 


RE: DIH parallel processing

2015-10-15 Thread Dyer, James
Nabil,

What we do is have multiple dih request handlers configured in solrconfig.xml.  
Then in the sql query we put something like "where mod(id, ${partition})=0".  
Then an external script calls a full import on each request handler at the same 
time and monitors the response.  This isn't the most elegant solution but it 
gets around the fact that DIH is single-threaded.

James Dyer
Ingram Content Group


-Original Message-
From: nabil Kouici [mailto:koui...@yahoo.fr] 
Sent: Thursday, October 15, 2015 3:58 AM
To: Solr-user
Subject: DIH parallel processing

Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more 
than 2 hours. Big amount of this time is consumed by data fetching from 
database. I'm thinking about a solution to have parallel (thread) loud in the 
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil. 


Re: DIH parallel processing

2015-10-15 Thread Charlie Hull

On 15/10/2015 09:57, nabil Kouici wrote:

Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more 
than 2 hours. Big amount of this time is consumed by data fetching from 
database. I'm thinking about a solution to have parallel (thread) loud in the 
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil.


Hi Nabil,

Although very convenient for database imports, DIH is single-threaded 
and difficult to optimise for performance. There is a batchSize 
parameter that you may try adjusting to see if that helps.


However, we generally avoid the DIH and roll our own indexers using 
Python or Java, reading the database using SQL (easy in either language) 
and then posting directly to Solr. This gives us a lot more flexibility 
in terms of conditioning the data, multi-threading and batching Solr 
updates. There are lots of great examples of high-performance indexing 
code available e.g.:

http://bryanbende.com/development/2014/08/16/indexing-wikipedia-with-apache-solr/

Best

Charlie
--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


DIH parallel processing

2015-10-15 Thread nabil Kouici
Hi All,
I'm using DIH to index more than 15M from Sql Server to Solr. This take more 
than 2 hours. Big amount of this time is consumed by data fetching from 
database. I'm thinking about a solution to have parallel (thread) loud in the 
same DIH. Each thread load a part of data.
Do you have any experience with this kind of situation?
Regards,Nabil.