Hello,

Most times users end-up with coding multithread SolrJ indexer that I
consider as a sad thing. As 3.x fix contributor I want to share my vision
to the problem. While I did that work I realized that join operation itself
is too hard and even impossible to make concurrent. I propose to add
concurrency into outbound and inbound streams.

My plan is:
1. add threads to outbound flow
https://issues.apache.org/jira/browse/SOLR-3585 it allows to don't wait for
Solr. I mostly like that code, but recently I realized that this code
implements ConcurrentUpdateSolrServer algorithm, looking forward I prefer
to unify some core concurrent code between them or it's kind of using CUSS
inside of DIH's SolrWriter
2. The next problem, which we've faced is SQLEntityProcessor. It has two
modes, one of them gets miserable performance due to N+1 problem; cached
version is not production capable with default heap cache.  Our proposal
for it https://issues.apache.org/jira/browse/SOLR-4799 unfortunately I have
no time to polish the patch.
3. After that the only thing which DIH  waits for is jdbc. it can be easily
boosted by implementing DataSource wrapper with producer thread and bounded
queue as a buffer.

if we complete this plan, we will never need to code SolrJ indexers.

Particular question to you is what you need to speed up?

On Thu, Jun 13, 2013 at 11:01 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 6/13/2013 12:08 PM, bbarani wrote:
>
>> I see that the threads parameter has been removed from DIH from all
>> version
>> starting SOLR 4.x. Can someone let me know the best way to initiate
>> indexing
>> in multi threaded mode when using DIH now? Is there a way to do that?
>>
>
> That parameter was removed because it didn't work right, and there was no
> apparent way to fix it.  The change that went into a later 3.6 version was
> a bandaid, not a fix.  I don't know all the details.
>
> There's no way to get multithreading with DIH directly, but you can do it
> indirectly:
>
> Create multiple request handlers with different names, such as
> /dataimport1, /dataimport2, etc.  Configure each handler with settings that
> will pull part of your data source.  Start them so they run concurrently.
>
> Depending on your environment, it may be easier to just write a
> multi-threaded indexing application using the Solr API for your language of
> choice.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 <mkhlud...@griddynamics.com>

Reply via email to