Hello, Most times users end-up with coding multithread SolrJ indexer that I consider as a sad thing. As 3.x fix contributor I want to share my vision to the problem. While I did that work I realized that join operation itself is too hard and even impossible to make concurrent. I propose to add concurrency into outbound and inbound streams.
My plan is: 1. add threads to outbound flow https://issues.apache.org/jira/browse/SOLR-3585 it allows to don't wait for Solr. I mostly like that code, but recently I realized that this code implements ConcurrentUpdateSolrServer algorithm, looking forward I prefer to unify some core concurrent code between them or it's kind of using CUSS inside of DIH's SolrWriter 2. The next problem, which we've faced is SQLEntityProcessor. It has two modes, one of them gets miserable performance due to N+1 problem; cached version is not production capable with default heap cache. Our proposal for it https://issues.apache.org/jira/browse/SOLR-4799 unfortunately I have no time to polish the patch. 3. After that the only thing which DIH waits for is jdbc. it can be easily boosted by implementing DataSource wrapper with producer thread and bounded queue as a buffer. if we complete this plan, we will never need to code SolrJ indexers. Particular question to you is what you need to speed up? On Thu, Jun 13, 2013 at 11:01 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 6/13/2013 12:08 PM, bbarani wrote: > >> I see that the threads parameter has been removed from DIH from all >> version >> starting SOLR 4.x. Can someone let me know the best way to initiate >> indexing >> in multi threaded mode when using DIH now? Is there a way to do that? >> > > That parameter was removed because it didn't work right, and there was no > apparent way to fix it. The change that went into a later 3.6 version was > a bandaid, not a fix. I don't know all the details. > > There's no way to get multithreading with DIH directly, but you can do it > indirectly: > > Create multiple request handlers with different names, such as > /dataimport1, /dataimport2, etc. Configure each handler with settings that > will pull part of your data source. Start them so they run concurrently. > > Depending on your environment, it may be easier to just write a > multi-threaded indexing application using the Solr API for your language of > choice. > > Thanks, > Shawn > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <mkhlud...@griddynamics.com>