> On 8 Aug 2018, at 12:38, Miguel Zilhão 
> <[email protected]> wrote:
> 
> hi Ian,
>> The memory problems are very likely strongly related to the machine you run 
>> on.  I don't know that we can take much information from a smaller test run 
>> on a different machine. We already see from this run that Carpet is not 
>> "leaking" memory continuously; the curves for allocated memory show what has 
>> been malloced and not freed, and it remains more or less constant after the 
>> initial phase.
>> I think it's worth trying to get tcmalloc running on the cluster.  So this 
>> means that you have never seen the OOM happen when using tcmalloc.  It's 
>> possible that the improved memory allocation in tcmalloc over glibc would 
>> entirely solve the problem.
> 
> well, i did have cases where i'd ran out of memory also in my workstation 
> with tcmalloc (where i've been doing these tests), with this same 
> configuration and more resolution. i don't have an OOM-killer in the 
> workstation, though, so at some point the system would just start to swap (at 
> which point i'd kill the job).

OK.

>> Sorry, I made a mistake.  It should have been pageheap_unmapped, not 
>> pageheap_free.  Sorry!   pageheap_free is essentially zero, and cannot 
>> account for the difference.
> 
> ah, no problem. i'm attaching the updated plot.

Good, that looks better.  So we see that the rss mostly follows the sum of 
allocated and unmapped memory.  I think one thing I have seen in the past is 
that a high rss is not necessarily an indication of a problem.  Even thought 
the OS hasn't unmapped the pages from the process' address space, the memory is 
free if another process (or the current process) needs it.  I suspect that the 
saturation point at iteration ~3000 is the point at which all the processes 
have a lot of unmapped memory, and the OS needs to start actually unmapping it, 
which stops the rss from growing any further.

>>>> The point that Roland made also applies here: we are looking at the max 
>>>> across all processes and assuming that every process is the same.  It's 
>>>> possible that one process has a high unmapped curve, but another has a 
>>>> high rss curve, and we don't see this on the plot.  We would have to do 1D 
>>>> output of the grid arrays and plot each process separately to see the full 
>>>> detail.  One way to see if this is necessary would be to plot both the max 
>>>> and min instead of just the max.  That way, we can see if this is likely 
>>>> to be an issue.
>>> 
>>> ok, i'm attaching another plot with both the min (dashed lines) and the max 
>>> (full lines) plotted. i hope it helps.
>> Thanks.  This shows that the gridfunction usage is more or less similar 
>> across all processes, which is good.  However, there is significant 
>> variation in most of the other quantities across processes.   To understand 
>> this better, we would have to look at 1D ASCII output of the grid arrays, 
>> which is a bit painful to plot in gnuplot.  Before this, I would definitely 
>> try to get tcmalloc running and outputting this information on the cluster 
>> in a run that actually shows the OOM.  My guess is that you won't get an OOM 
>> with tcmalloc, and all will be fine :)
> 
> ok, i could also try to do this on cluster once it's back online (currently 
> it's down for maintenance).

OK. I'll be interested to see the results when you have them.  The thing to 
look out for is generic_current_allocated growing.

-- 
Ian Hinder
https://ianhinder.net

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to