On Wed, 29 Apr 2015 20:29:07 -0600 Scott Robison <scott at casaderobison.com> wrote:
> > That code can fail on a system configured to overcommit memory. By > > that standard, the pointer is invalid. > > > > Accidentally sent before I was finished. In any case, by "invalid > pointer" I did not mean to imply "it returns a bit pattern that could > never represent a valid pointer". I mean "if you dereference a > pointer returned by malloc that is not null or some implementation > defined value, it should not result in an invalid memory access". Agreed. And I don't think that will happen with malloc. It might, and I have a plausible scenario, but I don't think that's what happened. In the bizarre context of the Linux OOM killer, the OS may promise more memory than it can supply. This promise is made by malloc and materialized by writes to memory allocated through the returned pointer, because at time of writing the the OS must actually (and may fail to) allocate the memory from RAM or swap. Exhaustion of overcommitted memory does *not* result in SIGSEGV, however. The OOM killer selects a process for SIGKILL, and the straw-on-the-camel's-back process that triggered the OOM condition is not necessarily the one that is selected. As far as "invalid" goes, I don't see how we can single out pointers from malloc. In the presence of overcommitted memory, *all* addresses, including that of the program text, are invalid in the sense that they are undependable. The process may be killed through no fault of its own by virtue of a heuristic. I think it's fair to say it makes the machine nondeterministic, or at least adds to the machine's nondeterminism. Can writing through a pointer returned by malloc (within the allocated range) ever result in SIGSEGV? Maybe. I have a plausible scenario in the context of sparse files and mmap, which malloc uses. Let us say that you have two processes on a 64-bit machine, and a 1 TB filesystem. Each process opens a new file, seeks to the position 1 TB - 1, and writes 1 byte. Each process now owns a file whose "size" is 1 TB and whose block count is 1. Most of the filesystem is empty, yet the two files have allocated 200% of the available space. These are known as "sparse" files; the unwritten locations are called "holes". Now each process calls mmap(2) on its file for the entire 1 TB. Still OK. mmap will not fail. The holes in the files return 0 when read. When written to, the OS allocates a block from filesystem and maps it to a page of memory. As each process begins writing 1's sequentially to its memory, successive blocks are allocated. Soon enough the last block is allocated and the filesystem will be really and truly full. At the next allocation, no block can be allocated and no page mapped. What to do? When calling write(2) on a full filesystem we expect ENOSPC, but there's nowhere to return an error condition when writing to memory. Consequently the OS has no choice but to signal the process. That signal will be, yes, SIGSEGV. What does that have to do with malloc? GNU malloc uses mmap for large allocations; the pointer it returns is supplied by mmap for an anonymous mapping to blocks in the swap partition. If malloc creates sparse files, writes through malloc'd pointers could result in SIGSEGV. However, I do not know that that's what malloc does. I do not think that's what's happening in the OP's case. I suspect the OP's process sailed past any memory-allocation constraints because of the overcommitted memory configuration, and eventually ran aground when the stack was exhausted. Others have already suggested fixing the overcommit setting as a first step. Others might be: 1. Examine the core dump to determine if the SIGSEGV was triggered by a write to heap or stack memory. Or not, as the case may be. ;-) 2. Investigate the malloc algorithm and/or replace it with one that does not use sparse files. 3. Increase the stack space allocated to the process. It's an interesting problem. I hope we learn the answer. --jkl