Thanks for this post first. I do learned something from it.
The performance difference is significant when a large block of memory
is allocated and the memory will not all be used in the program.
在 2014-03-21五的 14:18 -0700,Daniel M. Weeks写道:
> I started this email after reading the thread started by Hongjia Cao
> "performance improvement for hostlist". After a bit of digging and
> writing it grew beyond the issue pointed out in that thread so I thought
> it best to begin a new one since it may take on a discussion of its own.
> It also includes a good bit of ancillary information people may find
> generally useful. I don't think the solution committed for the problem
> Cao highlighted is correct. My proposed fix is in the second to last
> paragraph although it somewhat depends on understanding the other items
> that lead up to it.
>
> It already looks long, doesn't it? Let's dive in...
>
> First, the timing test provided by Cao is bogus but perfectly
> illustrates the bigger problem Cao is trying to fix. (I'm not saying Cao
> is wrong or it was a bad test to run because there's something to be
> learned and it led to a more careful analysis.)
>
> Second, a bit of background for those that are unfamiliar with
> malloc/memset; skip the paragraph if you know this. malloc takes
> practically no time on Linux with glibc because it doesn't really do
> much. It's not until the memory is actually accessed that it is truly
> allocated by Linux. That's why there is such a discrepancy in the
> timing; xmalloc is actually accessing the memory to zero it. This is
> also why zeroing a bunch of memory unnecessarily can be a big waste of
> time as Cao found. (A more realistic test would involve accessing a
> portion of the memory between each xmalloc/malloc and xfree/free.)
>
> It's also worth noting at this time that there is a fundamental
> difference between malloc+memset and calloc. For a better explanation
> than I can include in this space see Dietrich Epp's writeup on Stack
> Overflow[1].
>
> I'll say Cao is mostly right. It is probably better to *not* use xmalloc
> since it is potentially allocating more memory than necessary and
> zeroing otherwise untouched memory is a waste. However, xmalloc does
> more than just zero the memory; by switching to malloc those other
> features are getting tossed.
>
> The question becomes whether those features of xmalloc are important
> enough to be kept for this case. (I have made some nodes about
> additional problems to be corrected.) Let's take a look:
>
> >void *slurm_xmalloc
> >(size_t size, const char *file, int line, const char *func)
> >{
> >void *new;
> >int *p;
> >
> >
> >xmalloc_assert(size >= 0 && size <= INT_MAX);
>
> Good for debugging. However, size is type size_t. Comparison should be
> against SIZE_MAX.
>
> >MALLOC_LOCK();
> >p = (int *)malloc(size + 2*sizeof(int));
>
> Don't cast malloc. Ever. It hides problems. The 2*sizeof(int) is a
> problem too as seen a little later once we start assigning to it.
>
> >MALLOC_UNLOCK();
>
> *Great* on platforms with non-thread-safe malloc and since we care about
> an OOM condition and want to handle it once. Since Slurm is
> multithreaded we don't want >1 OOM condition simultaneously. These
> reasons are enough in my opinion to *not* replace xmalloc with malloc.
> However, I'll continue because there is another separate but interesting
> aspect that should be considered while we're here.
>
> >if (!p) {
> >/* out of memory */
> >log_oom(file, line, func);
> >abort();
> >}
>
> OK. Handle the OOM condition without attempting to allocate more memory
> (see implementation of log_oom for details). However, because the mutex
> doesn't cover it another thread could have already triggered and/or
> partially handled an OOM condition. It should probably be within the
> mutex block.
>
> >p[0] = XMALLOC_MAGIC; /* add "secret" magic cookie */
> >p[1] = (int)size; /* store size in buffer */
> >
>
> Neat. The only problem is size_t and int aren't the same thing and this
> feature would not work as intended on some architectures.
>
> >new = &p[2];
> >memset(new, 0, size);
>
> Whoops. We can also get an OOM condition here and we can't catch it
> thanks to Linux over-committing memory by default. The malloc above
> could have succeeded but because it wasn't *actually* allocated. Another
> thread (other another program) could have already used enough memory for
> memset to fail. It could be moved inside the mutex block to help prevent
> another Slurm thread from triggering an OOM condition here but if memory
> is already scarce all bets are off because it will probably be caused
> soon anyway.
>
> >return new;
> >}
>
> Now, there is the issue of malloc+memset vs calloc. For instances where
> it's *necessary* to zero the memory (namely ease of string manipulation,
> struct field initialization, or security) calloc *could* be used simply
> because it's more efficient. However, because of what xmalloc is trying
> to accomplish handling OOM conditions calloc will eliminate the memset
> call that can't be handled. Not only *could* calloc be used for
> efficiency, it *should* be used so xmalloc functions as intended (and
> more gracefully handles a memory allocation error).
>
> Now, all the way back to why this originally came up: I think a more
> appropriate change is to fix xmalloc to use calloc which should on its
> own give a speedup to cases like the one Cao was seeing without changing
> its semantics.
>
> If necessary, another malloc wrapper *could* be added that just used
> malloc, without zeroing the memory. However, the only benefit over the
> calloc-based wrapper would be when a system is under sufficient load
> that the kernel has not had time to collect some all-zero pages and
> there is a slight delay to calloc. I suspect the system load is really
> going to be a bigger slowdown than the small delay in calloc and there
> is no benefit to a non-zeroing malloc wrapper.
>
> Please see the attached patches for my proposed fixes to the malloc
> wrappers. I have done some limited testing but they should definitely be
> well reviewed before committing.
>
> [1]
> http://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc
>