On Wed, 2014-10-29 at 23:37 +0100, Will Martin wrote:
> This command only touches OS level caches that hold pages destined for (or
> not) the swap cache. Its use means that disk will be hit on future requests,
> but in many instances the pages were headed for ejection anyway.
> 
> It does not have anything whatsoever to do with Solr caches.

If you re-read my post, you will see "the OS had to spend a lot of
resources just bookkeeping memory". OS, not JVM.

> It also is not fragmentation related; it is a result of the kernel
> managing virtual pages in an "as designed manner". The proper command
> is
> 
> #sync; echo 3 >/proc/sys/vm/drop_caches. 

I just talked with a Systems guy to verify what happened when we had
the problem:

- The machine spawned Xmx1g JVMs with Tika, each instance processing a 
  single 100M ARC file, sending the result to a shared Solr instance 
  and shutting down. 40 instances were running at all times, each 
  instance living for a little less than 3 minutes.
  Besides taking ~40GB of RAM in total, this also meant that about 10GB 
  of RAM was released and re-requested from the system each minute.
  I don't know how the memory mapping in Solr works with regard to
  re-use of existing allocations, so I can't say if Solr added to than
  number or not.

- The indexing speed deteriorated after some days, grinding down to 
  (loose guess) something like 1/4th of initial speed.

- Running top showed that the majority of time was spend in the kernel.

- Running "echo 3 >/proc/sys/vm/drop_caches" (I asked Systems explicitly
  about the integer and it was '3') brought the speed back to the 
  initial level. The temporary patch was to run it once every hour.

- Running top with the patch showed the vast majority of time was spend 
  in user space.

- Systems investigated and determined that "huge pages" were 
  automatically requested by processes on the machine, leading to 
  (virtual) memory fragmentation on the OS level. They used a tool in 
  'sysfsutils' (just relaying what they said here) to change the default
  from huge pages to small pages (or whatever the default is named).

- The disabling of huge pages made the problem go away and we no longer
  use the drop_caches-trick.

> http://linux.die.net/man/5/proc
> 
> I have encountered resistance on the use of this on long-running processes
> for years ... from people who don't even research the matter.

The resistance is natural: Although it might work to drop_cache, as it
did for us, it is still symptom treatment. Until the cause has been
isolated and determined to be practically unresolvable, the drop_cache
is a red flag.

Your undetermined core problem might not be the same as ours, but it is
simple to check: Watch kernel time percentage. If it rises over time,
try disabling huge pages.

- Toke Eskildsen, State and University Library, Denmark


Reply via email to