Hi Glen,

I'd love to use LuSql, but our data is not in a db.  Its 6-8TB of files
containing OCR (one file per page for about 1.5 billion pages) gzipped on
disk which are ugzipped, concatenated, and converted to Solr documents
on-the-fly.  We have multiple instances of our Solr document producer script
running. At this point we can run enough producers, so that the rate at
which Solr can ingest and index documents is our current bottleneck and so
far the bottleneck we see for indexing appears to be disk I/O for
Solr/Lucene during merges.

Is there any obvious relationship between the size of the ramBuffer and how
much heap you need to give the JVM, or is there some reasonable method of
finding this out by experimentation?
We would rather not find out by decreasing the amount of memory allocated to
the JVM until we get an OOM.

Tom



I've run Lucene with heap sizes as large as 28GB of RAM (on a 32GB
machine, 64bit, Linux) and a ramBufferSize of 3GB. While I haven't
noticed the GC issues mark mentioned in this configuration, I have
seen them in the ranges he discusses (on 1.6 <update 18).

You may consider using LuSql[1] to create the indexes, if your source
content is in a JDBC accessible db. It is quite a bit faster than
Solr, as it is a tool specifically created and tuned for Lucene
indexing. But it is command-line, not RESTful like Solr. The released
version of LuSql only runs single machine (though designed for many
threads), the new release will allow distributing indexing across any
number of machines (with each machine building a shard). The new
release also has plugable sources, so it is not restricted to JDBC.

-Glen
[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql


-- 
View this message in context: 
http://old.nabble.com/What-is-largest-reasonable-setting-for-ramBufferSizeMB--tp27631231p27658384.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to