Re: When not to use NRTCachingDirectory and what to use instead.
On 4/19/14, 6:51 AM, Ken Krugler kkrugler_li...@transpac.com wrote: The code I see seems to be using an FSDirectory, or is there another layer of wrapping going on here? return new NRTCachingDirectory(FSDirectory.open(new File(path)), maxMergeSizeMB, maxCachedMB); I was also curious about this subject. Not enough to test anything, but enough to look at the code too. FSDirectory.open picks one of MMapDirectory, SimpleFSDirectory and NIOFSDirectory in that order of preference based on what it thinks your system will support. ThereĀ¹s still the possibility that the added caching functionality slows down bulk index operations, but setting that aside, it does look like NRTCachingDirectoryFactory is almost always the best choice.
Re: When not to use NRTCachingDirectory and what to use instead.
Hi Ken, Given the comments which seemed to describe using NRT for the opposite of our use case, I just set our Solr 4 to use the solr.MMapDirectoryFactory. Didn't bother to test whether NRT would be better for our use case, mostly because it didn't sound like there was an advantage and I've been focused on other things relating to Solr 4. , I'd love to hear any results from someone who is testing for a batch indexing use case and has tested various xxxDirectoryFactory implementations. Please let me know your results if you do end up doing some testing. Tom On Sat, Apr 19, 2014 at 9:51 AM, Ken Krugler kkrugler_li...@transpac.comwrote: Tom - did you ever get any useful results from testing here? I'm also curious about the impact of various xxxDirectoryFactory implementations for batch indexing. Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: When not to use NRTCachingDirectory and what to use instead.
On Jul 10, 2013, at 9:16am, Shawn Heisey s...@elyograg.org wrote: On 7/10/2013 9:59 AM, Tom Burton-West wrote: The Javadoc for NRTCachingDirectoy ( http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) says: This class is likely only useful in a near real-time context, where indexing rate is lowish but reopen rate is highish, resulting in many tiny files being written... It seems like we have exactly the opposite use case, so we would like advice on what directory implementation to use instead. We are doing offline batch indexing, so no searches are being done. So we don't need NRT. We also have a high indexing rate as we are trying to index 3 billion pages as quickly as possible. I am not clear what determines the reopen rate. Is it only related to searching or is it involved in indexing as well? Does the NRTCachingDirectory have any benefit for indexing under the use case noted above? I'm guessing we should just use the solrStandardDirectoryFactory instead. Is this correct? The NRT directory object in Solr uses the MMap implementation as its default delegate. The code I see seems to be using an FSDirectory, or is there another layer of wrapping going on here? return new NRTCachingDirectory(FSDirectory.open(new File(path)), maxMergeSizeMB, maxCachedMB); I would use MMapDirectoryFactory (the default for most of the 3.x releases) for testing whether you can get any improvement from moving away from the default. The advantages of memory mapping are not something you'd want to give up. Tom - did you ever get any useful results from testing here? I'm also curious about the impact of various xxxDirectoryFactory implementations for batch indexing. Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: When not to use NRTCachingDirectory and what to use instead.
On 7/10/2013 9:59 AM, Tom Burton-West wrote: The Javadoc for NRTCachingDirectoy ( http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) says: This class is likely only useful in a near real-time context, where indexing rate is lowish but reopen rate is highish, resulting in many tiny files being written... It seems like we have exactly the opposite use case, so we would like advice on what directory implementation to use instead. We are doing offline batch indexing, so no searches are being done. So we don't need NRT. We also have a high indexing rate as we are trying to index 3 billion pages as quickly as possible. I am not clear what determines the reopen rate. Is it only related to searching or is it involved in indexing as well? Does the NRTCachingDirectory have any benefit for indexing under the use case noted above? I'm guessing we should just use the solrStandardDirectoryFactory instead. Is this correct? The NRT directory object in Solr uses the MMap implementation as its default delegate. I would use MMapDirectoryFactory (the default for most of the 3.x releases) for testing whether you can get any improvement from moving away from the default. The advantages of memory mapping are not something you'd want to give up. Thanks, Shawn