On Wed, 2014-10-29 at 23:37 +0100, Will Martin wrote: > This command only touches OS level caches that hold pages destined for (or > not) the swap cache. Its use means that disk will be hit on future requests, > but in many instances the pages were headed for ejection anyway. > > It does not have anything whatsoever to do with Solr caches.
If you re-read my post, you will see "the OS had to spend a lot of resources just bookkeeping memory". OS, not JVM. > It also is not fragmentation related; it is a result of the kernel > managing virtual pages in an "as designed manner". The proper command > is > > #sync; echo 3 >/proc/sys/vm/drop_caches. I just talked with a Systems guy to verify what happened when we had the problem: - The machine spawned Xmx1g JVMs with Tika, each instance processing a single 100M ARC file, sending the result to a shared Solr instance and shutting down. 40 instances were running at all times, each instance living for a little less than 3 minutes. Besides taking ~40GB of RAM in total, this also meant that about 10GB of RAM was released and re-requested from the system each minute. I don't know how the memory mapping in Solr works with regard to re-use of existing allocations, so I can't say if Solr added to than number or not. - The indexing speed deteriorated after some days, grinding down to (loose guess) something like 1/4th of initial speed. - Running top showed that the majority of time was spend in the kernel. - Running "echo 3 >/proc/sys/vm/drop_caches" (I asked Systems explicitly about the integer and it was '3') brought the speed back to the initial level. The temporary patch was to run it once every hour. - Running top with the patch showed the vast majority of time was spend in user space. - Systems investigated and determined that "huge pages" were automatically requested by processes on the machine, leading to (virtual) memory fragmentation on the OS level. They used a tool in 'sysfsutils' (just relaying what they said here) to change the default from huge pages to small pages (or whatever the default is named). - The disabling of huge pages made the problem go away and we no longer use the drop_caches-trick. > http://linux.die.net/man/5/proc > > I have encountered resistance on the use of this on long-running processes > for years ... from people who don't even research the matter. The resistance is natural: Although it might work to drop_cache, as it did for us, it is still symptom treatment. Until the cause has been isolated and determined to be practically unresolvable, the drop_cache is a red flag. Your undetermined core problem might not be the same as ours, but it is simple to check: Watch kernel time percentage. If it rises over time, try disabling huge pages. - Toke Eskildsen, State and University Library, Denmark