Hi Glen, I'd love to use LuSql, but our data is not in a db. Its 6-8TB of files containing OCR (one file per page for about 1.5 billion pages) gzipped on disk which are ugzipped, concatenated, and converted to Solr documents on-the-fly. We have multiple instances of our Solr document producer script running. At this point we can run enough producers, so that the rate at which Solr can ingest and index documents is our current bottleneck and so far the bottleneck we see for indexing appears to be disk I/O for Solr/Lucene during merges.
Is there any obvious relationship between the size of the ramBuffer and how much heap you need to give the JVM, or is there some reasonable method of finding this out by experimentation? We would rather not find out by decreasing the amount of memory allocated to the JVM until we get an OOM. Tom I've run Lucene with heap sizes as large as 28GB of RAM (on a 32GB machine, 64bit, Linux) and a ramBufferSize of 3GB. While I haven't noticed the GC issues mark mentioned in this configuration, I have seen them in the ranges he discusses (on 1.6 <update 18). You may consider using LuSql[1] to create the indexes, if your source content is in a JDBC accessible db. It is quite a bit faster than Solr, as it is a tool specifically created and tuned for Lucene indexing. But it is command-line, not RESTful like Solr. The released version of LuSql only runs single machine (though designed for many threads), the new release will allow distributing indexing across any number of machines (with each machine building a shard). The new release also has plugable sources, so it is not restricted to JDBC. -Glen [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql -- View this message in context: http://old.nabble.com/What-is-largest-reasonable-setting-for-ramBufferSizeMB--tp27631231p27658384.html Sent from the Solr - User mailing list archive at Nabble.com.