The easy work as describe in Janne's email has been done:
https://github.com/SchedMD/slurm/commit/20004689197075a3dcab8bb946133a64820baa86
A review of all xmalloc references and changing them to xcalloc where
appropriate will take a lot of effort with little benefit likely, so
it's probably not going to happen, even though it's arguably the
"proper" solution.
Quoting Janne Blomqvist <[email protected]>:
On 2014-03-21 23:19, Daniel M. Weeks wrote:
I started this email after reading the thread started by Hongjia Cao
"performance improvement for hostlist". After a bit of digging and
writing it grew beyond the issue pointed out in that thread so I thought
it best to begin a new one since it may take on a discussion of its own.
It also includes a good bit of ancillary information people may find
generally useful. I don't think the solution committed for the problem
Cao highlighted is correct. My proposed fix is in the second to last
paragraph although it somewhat depends on understanding the other items
that lead up to it.
It already looks long, doesn't it? Let's dive in...
First, the timing test provided by Cao is bogus but perfectly
illustrates the bigger problem Cao is trying to fix. (I'm not saying Cao
is wrong or it was a bad test to run because there's something to be
learned and it led to a more careful analysis.)
Second, a bit of background for those that are unfamiliar with
malloc/memset; skip the paragraph if you know this. malloc takes
practically no time on Linux with glibc because it doesn't really do
much. It's not until the memory is actually accessed that it is truly
allocated by Linux. That's why there is such a discrepancy in the
timing; xmalloc is actually accessing the memory to zero it. This is
also why zeroing a bunch of memory unnecessarily can be a big waste of
time as Cao found. (A more realistic test would involve accessing a
portion of the memory between each xmalloc/malloc and xfree/free.)
It's also worth noting at this time that there is a fundamental
difference between malloc+memset and calloc. For a better explanation
than I can include in this space see Dietrich Epp's writeup on Stack
Overflow[1].
I'll say Cao is mostly right. It is probably better to *not* use xmalloc
since it is potentially allocating more memory than necessary and
zeroing otherwise untouched memory is a waste. However, xmalloc does
more than just zero the memory; by switching to malloc those other
features are getting tossed.
The question becomes whether those features of xmalloc are important
enough to be kept for this case. (I have made some nodes about
additional problems to be corrected.) Let's take a look:
void *slurm_xmalloc
(size_t size, const char *file, int line, const char *func)
{
void *new;
int *p;
xmalloc_assert(size >= 0 && size <= INT_MAX);
Good for debugging. However, size is type size_t. Comparison should be
against SIZE_MAX.
Yes, but if you compare against SIZE_MAX then the assert will never
fire, since size_t is unsigned, so the assert might as well be
removed then.
Perhaps the idea was that if something tries to allocate more than
INT_MAX something has gone terribly wrong as slurm typically doesn't
allocate very large objects, so you might as well abort at that
point? Then again, if so, why use INT_MAX and not some other magic
number? Furthermore, is xmalloc() the proper place for such checks,
and even moreso, are such checks actually useful in general?
MALLOC_LOCK();
p = (int *)malloc(size + 2*sizeof(int));
Don't cast malloc. Ever. It hides problems. The 2*sizeof(int) is a
problem too as seen a little later once we start assigning to it.
MALLOC_UNLOCK();
*Great* on platforms with non-thread-safe malloc
MALLOC_LOCK and MALLOC_UNLOCK are actually macros, you can see the
definitions in the beginning of xmalloc.c. Note also that the
HAVE_UNSAFE_MALLOC macro that controls the expansion of those macros
is actually never set, so unless you have modified your source tree
to set it, MALLOC_LOCK/UNLOCK are no-ops.
That being said, I don't think there exists any remotely recent
system which supports threads but doesn't provide a thread-safe
malloc, so IMHO those macros should be removed.
and since we care about
an OOM condition and want to handle it once. Since Slurm is
multithreaded we don't want >1 OOM condition simultaneously.
Can this actually happen? Isn't the entire process killed
immediately once the OOM killer decides to kill it?
p[0] = XMALLOC_MAGIC; /* add "secret" magic cookie */
p[1] = (int)size; /* store size in buffer */
Neat. The only problem is size_t and int aren't the same thing and this
feature would not work as intended on some architectures.
Yeah, it means that xmalloc doesn't work for allocations larger than
INT_MAX, though the xmalloc_assert previously already guards against
this. A simple fix would be to make "p" a size_t* instead of int*.
new = &p[2];
memset(new, 0, size);
Whoops. We can also get an OOM condition here and we can't catch it
thanks to Linux over-committing memory by default.
Yes, but there's not much one can do about it here
Now, there is the issue of malloc+memset vs calloc. For instances where
it's *necessary* to zero the memory (namely ease of string manipulation,
struct field initialization, or security) calloc *could* be used simply
because it's more efficient. However, because of what xmalloc is trying
to accomplish handling OOM conditions calloc will eliminate the memset
call that can't be handled. Not only *could* calloc be used for
efficiency, it *should* be used so xmalloc functions as intended (and
more gracefully handles a memory allocation error).
Now, all the way back to why this originally came up: I think a more
appropriate change is to fix xmalloc to use calloc which should on its
own give a speedup to cases like the one Cao was seeing without changing
its semantics.
If necessary, another malloc wrapper *could* be added that just used
malloc, without zeroing the memory. However, the only benefit over the
calloc-based wrapper would be when a system is under sufficient load
that the kernel has not had time to collect some all-zero pages and
there is a slight delay to calloc. I suspect the system load is really
going to be a bigger slowdown than the small delay in calloc and there
is no benefit to a non-zeroing malloc wrapper.
A "proper" fix would IMHO be to (in addition to the int/size_t
issues discussed above)
1) Create a calloc wrapper (xcalloc), and update all current users
of xmalloc to use that instead.
2) Remove the memset from xmalloc, so that xmalloc properly is a
malloc wrapper.
3) Audit users of xcalloc, change back to use xmalloc where zeroing
isn't needed.
--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || [email protected]