> With Spark-Solr additional complexity comes. You could have too many
> executors for your Solr instance(s), ie a too high parallelism.

I have been reducing the parallelism of spark-solr part by 5. I had 40
executors loading 4 shards. Right now only 8 executors loading 4 shards.
As a result, I can see a 10 times update improvement, and I suspect the
update process had been overhelmed by spark.

I have been able to keep 40 executor for document preprocessing and
reducing to 8 executors within the same spark job by using the
"dataframe.coalesce" feature which does not shuffle the data at all and
keeps both spark cluster and solr quiet in term of network.

Thanks

On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> Maybe you need to give more details. I recommend always to try and test 
> yourself as you know your own solution best. Depending on your spark process 
> atomic updates  could be faster.
> 
> With Spark-Solr additional complexity comes. You could have too many 
> executors for your Solr instance(s), ie a too high parallelism.
> 
> Probably the most important question is:
> What performance do your use car needs and what is your current performance?
> 
> Once this is clear further architecture aspects can be derived, such as 
> number of spark executors, number of Solr instances, sharding, replication, 
> commit timing etc.
> 
> > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <nicolas.pa...@riseup.net>:
> > 
> > Hi community,
> > 
> > Any advice to speed-up updates ?
> > Is there any advice on commit, memory, docvalues, stored or any tips to
> > faster things ?
> > 
> > Thanks
> > 
> > 
> >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> >> Hi
> >> 
> >> I am looking for a way to faster the update of documents.
> >> 
> >> In my context, the update replaces one of the many existing indexed
> >> fields, and keep the others as is.
> >> 
> >> Right now, I am building the whole document, and replacing the existing
> >> one by id.
> >> 
> >> I am wondering if **atomic update feature** would faster the process.
> >> 
> >> From one hand, using this feature would save network because only a
> >> small subset of the document would be send from the client to the
> >> server. 
> >> On the other hand, the server will have to collect the values from the
> >> disk and reindex them. In addition, this implies to store the values for
> >> every fields (I am not storing every fields) and use more space.
> >> 
> >> Also I have read about the ConcurrentUpdateSolrServer class might be an
> >> optimized way of updating documents.
> >> 
> >> I am using spark-solr library to deal with solr-cloud. If something
> >> exist to faster the process, I would be glad to implement it in that
> >> library.
> >> Also, I have split the collection over multiple shard, and I admit this
> >> faster the update process, but who knows ?
> >> 
> >> Thoughts ?
> >> 
> >> -- 
> >> nicolas
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas

Reply via email to