From a colleague: Greetings,
I/we also discovered the performance problem in hostlist parsing, via a different route, and have a different approach to a solution. The malloc/calloc/xmalloc discussion is all well and good, but it misses a critical point: Why is the parser pessimistically allocating 64K nodes (MAX_RANGES) worth of ranges? And in one code path that is called very often, when there aren't even brackets that need the ranges at all?!? Our solution is to dynamically size and resize the ranges buffer as needed. If huge numbers or ranges are used, those (?rare?) cases will take an additional penalty, but for the standard cases, no. I am developing a patch that implements this dynamic allocation and resizing, using the xmalloc()/xfree() paradigm. Aside: I believe SLURM should have an xmallocNZ() and xreallocNZ() functions that do Not Zero the contents, to be used judiciously when the memory is about to be modified immediately (e.g,. by copy-into or IO reception) (e.g., NOT by hostlist parsing). Humbly, David Linden (HP) -----Original Message----- From: Janne Blomqvist [mailto:[email protected]] Sent: Monday, March 31, 2014 2:50 AM To: slurm-dev Subject: [slurm-dev] Re: xmalloc in Slurm On 2014-03-21 23:19, Daniel M. Weeks wrote: > I started this email after reading the thread started by Hongjia Cao > "performance improvement for hostlist". After a bit of digging and > writing it grew beyond the issue pointed out in that thread so I > thought it best to begin a new one since it may take on a discussion of its > own. > It also includes a good bit of ancillary information people may find > generally useful. I don't think the solution committed for the problem > Cao highlighted is correct. My proposed fix is in the second to last > paragraph although it somewhat depends on understanding the other > items that lead up to it. > > It already looks long, doesn't it? Let's dive in... > > First, the timing test provided by Cao is bogus but perfectly > illustrates the bigger problem Cao is trying to fix. (I'm not saying > Cao is wrong or it was a bad test to run because there's something to > be learned and it led to a more careful analysis.) > > Second, a bit of background for those that are unfamiliar with > malloc/memset; skip the paragraph if you know this. malloc takes > practically no time on Linux with glibc because it doesn't really do > much. It's not until the memory is actually accessed that it is truly > allocated by Linux. That's why there is such a discrepancy in the > timing; xmalloc is actually accessing the memory to zero it. This is > also why zeroing a bunch of memory unnecessarily can be a big waste of > time as Cao found. (A more realistic test would involve accessing a > portion of the memory between each xmalloc/malloc and xfree/free.) > > It's also worth noting at this time that there is a fundamental > difference between malloc+memset and calloc. For a better explanation > than I can include in this space see Dietrich Epp's writeup on Stack > Overflow[1]. > > I'll say Cao is mostly right. It is probably better to *not* use > xmalloc since it is potentially allocating more memory than necessary > and zeroing otherwise untouched memory is a waste. However, xmalloc > does more than just zero the memory; by switching to malloc those > other features are getting tossed. > > The question becomes whether those features of xmalloc are important > enough to be kept for this case. (I have made some nodes about > additional problems to be corrected.) Let's take a look: > >> void *slurm_xmalloc >> (size_t size, const char *file, int line, const char *func) { void >> *new; int *p; >> >> >> xmalloc_assert(size >= 0 && size <= INT_MAX); > > Good for debugging. However, size is type size_t. Comparison should be > against SIZE_MAX. Yes, but if you compare against SIZE_MAX then the assert will never fire, since size_t is unsigned, so the assert might as well be removed then. Perhaps the idea was that if something tries to allocate more than INT_MAX something has gone terribly wrong as slurm typically doesn't allocate very large objects, so you might as well abort at that point? Then again, if so, why use INT_MAX and not some other magic number? Furthermore, is xmalloc() the proper place for such checks, and even moreso, are such checks actually useful in general? >> MALLOC_LOCK(); >> p = (int *)malloc(size + 2*sizeof(int)); > > Don't cast malloc. Ever. It hides problems. The 2*sizeof(int) is a > problem too as seen a little later once we start assigning to it. > >> MALLOC_UNLOCK(); > > *Great* on platforms with non-thread-safe malloc MALLOC_LOCK and MALLOC_UNLOCK are actually macros, you can see the definitions in the beginning of xmalloc.c. Note also that the HAVE_UNSAFE_MALLOC macro that controls the expansion of those macros is actually never set, so unless you have modified your source tree to set it, MALLOC_LOCK/UNLOCK are no-ops. That being said, I don't think there exists any remotely recent system which supports threads but doesn't provide a thread-safe malloc, so IMHO those macros should be removed. > and since we care about > an OOM condition and want to handle it once. Since Slurm is > multithreaded we don't want >1 OOM condition simultaneously. Can this actually happen? Isn't the entire process killed immediately once the OOM killer decides to kill it? >> p[0] = XMALLOC_MAGIC; /* add "secret" magic cookie */ >> p[1] = (int)size; /* store size in buffer */ >> > > Neat. The only problem is size_t and int aren't the same thing and > this feature would not work as intended on some architectures. Yeah, it means that xmalloc doesn't work for allocations larger than INT_MAX, though the xmalloc_assert previously already guards against this. A simple fix would be to make "p" a size_t* instead of int*. > >> new = &p[2]; >> memset(new, 0, size); > > Whoops. We can also get an OOM condition here and we can't catch it > thanks to Linux over-committing memory by default. Yes, but there's not much one can do about it here > Now, there is the issue of malloc+memset vs calloc. For instances where > it's *necessary* to zero the memory (namely ease of string manipulation, > struct field initialization, or security) calloc *could* be used simply > because it's more efficient. However, because of what xmalloc is trying > to accomplish handling OOM conditions calloc will eliminate the memset > call that can't be handled. Not only *could* calloc be used for > efficiency, it *should* be used so xmalloc functions as intended (and > more gracefully handles a memory allocation error). > > Now, all the way back to why this originally came up: I think a more > appropriate change is to fix xmalloc to use calloc which should on its > own give a speedup to cases like the one Cao was seeing without changing > its semantics. > > If necessary, another malloc wrapper *could* be added that just used > malloc, without zeroing the memory. However, the only benefit over the > calloc-based wrapper would be when a system is under sufficient load > that the kernel has not had time to collect some all-zero pages and > there is a slight delay to calloc. I suspect the system load is really > going to be a bigger slowdown than the small delay in calloc and there > is no benefit to a non-zeroing malloc wrapper. A "proper" fix would IMHO be to (in addition to the int/size_t issues discussed above) 1) Create a calloc wrapper (xcalloc), and update all current users of xmalloc to use that instead. 2) Remove the memset from xmalloc, so that xmalloc properly is a malloc wrapper. 3) Audit users of xcalloc, change back to use xmalloc where zeroing isn't needed. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & BECS +358503841576 || [email protected]
