> On 8 Aug 2018, at 12:38, Miguel Zilhão > <[email protected]> wrote: > > hi Ian, >> The memory problems are very likely strongly related to the machine you run >> on. I don't know that we can take much information from a smaller test run >> on a different machine. We already see from this run that Carpet is not >> "leaking" memory continuously; the curves for allocated memory show what has >> been malloced and not freed, and it remains more or less constant after the >> initial phase. >> I think it's worth trying to get tcmalloc running on the cluster. So this >> means that you have never seen the OOM happen when using tcmalloc. It's >> possible that the improved memory allocation in tcmalloc over glibc would >> entirely solve the problem. > > well, i did have cases where i'd ran out of memory also in my workstation > with tcmalloc (where i've been doing these tests), with this same > configuration and more resolution. i don't have an OOM-killer in the > workstation, though, so at some point the system would just start to swap (at > which point i'd kill the job).
OK. >> Sorry, I made a mistake. It should have been pageheap_unmapped, not >> pageheap_free. Sorry! pageheap_free is essentially zero, and cannot >> account for the difference. > > ah, no problem. i'm attaching the updated plot. Good, that looks better. So we see that the rss mostly follows the sum of allocated and unmapped memory. I think one thing I have seen in the past is that a high rss is not necessarily an indication of a problem. Even thought the OS hasn't unmapped the pages from the process' address space, the memory is free if another process (or the current process) needs it. I suspect that the saturation point at iteration ~3000 is the point at which all the processes have a lot of unmapped memory, and the OS needs to start actually unmapping it, which stops the rss from growing any further. >>>> The point that Roland made also applies here: we are looking at the max >>>> across all processes and assuming that every process is the same. It's >>>> possible that one process has a high unmapped curve, but another has a >>>> high rss curve, and we don't see this on the plot. We would have to do 1D >>>> output of the grid arrays and plot each process separately to see the full >>>> detail. One way to see if this is necessary would be to plot both the max >>>> and min instead of just the max. That way, we can see if this is likely >>>> to be an issue. >>> >>> ok, i'm attaching another plot with both the min (dashed lines) and the max >>> (full lines) plotted. i hope it helps. >> Thanks. This shows that the gridfunction usage is more or less similar >> across all processes, which is good. However, there is significant >> variation in most of the other quantities across processes. To understand >> this better, we would have to look at 1D ASCII output of the grid arrays, >> which is a bit painful to plot in gnuplot. Before this, I would definitely >> try to get tcmalloc running and outputting this information on the cluster >> in a run that actually shows the OOM. My guess is that you won't get an OOM >> with tcmalloc, and all will be fine :) > > ok, i could also try to do this on cluster once it's back online (currently > it's down for maintenance). OK. I'll be interested to see the results when you have them. The thing to look out for is generic_current_allocated growing. -- Ian Hinder https://ianhinder.net
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
