Hi list, I'm working on improving the performance of the Solr scheme for Cascading.
This supports generating a Solr index as the output of a Hadoop job. We use SolrJ to write the index locally (via EmbeddedSolrServer). There are mentions of using overwrite=false with the CSV request handler, as a way of improving performance. I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support from SolrJ, because it was deemed too dangerous for mere mortals. My question is whether anyone knows just how much performance boost this really provides. For Hadoop-based workflows, it's straightforward to ensure that the unique key field is really unique, thus if the performance gain is significant, I might look into figuring out some way (with a trigger lock) of re-enabling this support in SolrJ. Thanks, -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr