Hi list,

I'm working on improving the performance of the Solr scheme for Cascading.

This supports generating a Solr index as the output of a Hadoop job. We use 
SolrJ to write the index locally (via EmbeddedSolrServer).

There are mentions of using overwrite=false with the CSV request handler, as a 
way of improving performance.

I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support 
from SolrJ, because it was deemed too dangerous for mere mortals.

My question is whether anyone knows just how much performance boost this really 
provides.

For Hadoop-based workflows, it's straightforward to ensure that the unique key 
field is really unique, thus if the performance gain is significant, I might 
look into figuring out some way (with a trigger lock) of re-enabling this 
support in SolrJ.

Thanks,

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to