Hello!

I've upgraded slurm from 14.03 to 15.08.12 and have problems with memory fragmentation errors on multiple nodes. This is in the syslog.

Jun 20 22:43:36 wn029 kernel: SLAB: Unable to allocate memory on node 0 (gfp=0x100020) Jun 20 22:43:36 wn029 kernel: cache: skbuff_fclone_cache(47:step_batch), object size: 512, order: 0
Jun 20 22:43:36 wn029 kernel:  node 0: slabs: 0/0, objs: 0/0, free: 0
Jun 20 22:43:36 wn029 kernel:  node 1: slabs: 0/0, objs: 0/0, free: 0
Jun 20 22:43:36 wn029 kernel: SLAB: Unable to allocate memory on node 0 (gfp=0x100020) Jun 20 22:43:36 wn029 kernel: cache: skbuff_fclone_cache(47:step_batch), object size: 512, order: 0
Jun 20 22:43:36 wn029 kernel:  node 0: slabs: 0/0, objs: 0/0, free: 0
Jun 20 22:43:36 wn029 kernel:  node 1: slabs: 0/0, objs: 0/0, free: 0

After those errors, slurm controller reports that node is not responding. Which is true, while the node is unreachable and has to be force rebooted. Based on those errors, I though it was a network issue and ran a few tests, but nothing special was found.

I tested slurm for memory leak with valgrind and got the results attached to the mail. I am not sure whether this actually causes the machine to crash.

Any help will be appreciated.
Cheers, Barbara
==8301== HEAP SUMMARY:
==8301==     in use at exit: 57,036 bytes in 391 blocks
==8301==   total heap usage: 29,441 allocs, 29,050 frees, 3,081,722 bytes 
allocated
==8301== 
==8301== Searching for pointers to 391 not-freed blocks
==8301== Checked 290,976 bytes
==8301== 
==8301== 17 bytes in 1 blocks are definitely lost in loss record 16 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46509E: xstrdup (xstring.c:361)
==8301==    by 0x445EFA: xcgroup_ns_create (xcgroup.c:102)
==8301==    by 0x785B7A1: ???
==8301==    by 0x7856F45: ???
==8301== 
==8301== 23 bytes in 1 blocks are definitely lost in loss record 27 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46509E: xstrdup (xstring.c:361)
==8301==    by 0x445F0E: xcgroup_ns_create (xcgroup.c:103)
==8301==    by 0x785B7A1: ???
==8301==    by 0x7856F45: ???
==8301== 
==8301== 116 bytes in 1 blocks are definitely lost in loss record 46 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46585F: _xstrdup_vprintf (xstring.c:620)
==8301==    by 0x465188: xstrdup_printf (xstring.c:381)
==8301==    by 0x445EE7: xcgroup_ns_create (xcgroup.c:100)
==8301==    by 0x785B7A1: ???
==8301== 
==8301== 116 bytes in 1 blocks are definitely lost in loss record 47 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46585F: _xstrdup_vprintf (xstring.c:620)
==8301==    by 0x465188: xstrdup_printf (xstring.c:381)
==8301==    by 0x445F36: xcgroup_ns_create (xcgroup.c:104)
==8301==    by 0x785B7A1: ???
==8301== 
==8301== 2,064 bytes in 1 blocks are possibly lost in loss record 87 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46B3B4: list_alloc_aux (list.c:1041)
==8301==    by 0x46B2CE: list_node_alloc (list.c:994)
==8301==    by 0x46B0AF: list_node_create (list.c:909)
==8301==    by 0x46B62C: _list_append_locked (list.c:1145)
==8301== 
==8301== 4,112 bytes in 1 blocks are possibly lost in loss record 88 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46B3B4: list_alloc_aux (list.c:1041)
==8301==    by 0x46B306: list_iterator_alloc (list.c:1010)
==8301==    by 0x46AA26: list_iterator_create (list.c:716)
==8301==    by 0x54AD19: build_all_nodeline_info (node_conf.c:682)
==8301== 
==8301== 10,256 bytes in 1 blocks are possibly lost in loss record 89 of 90
==8301==    at 0x4C267BB: calloc (vg_replace_malloc.c:593)
==8301==    by 0x464272: slurm_xmalloc (xmalloc.c:84)
==8301==    by 0x46B3B4: list_alloc_aux (list.c:1041)
==8301==    by 0x46B296: list_alloc (list.c:978)
==8301==    by 0x469AB2: list_create (list.c:292)
==8301==    by 0x42C60A: _init_conf (slurmd.c:1208)
==8301== 
==8301== LEAK SUMMARY:
==8301==    definitely lost: 272 bytes in 4 blocks
==8301==    indirectly lost: 0 bytes in 0 blocks
==8301==      possibly lost: 16,432 bytes in 3 blocks
==8301==    still reachable: 40,332 bytes in 384 blocks
==8301==         suppressed: 0 bytes in 0 blocks
==8301== Reachable blocks (those to which a pointer was found) are not shown.
==8301== To see them, rerun with: --leak-check=full --show-reachable=yes
==8301== 
==8301== ERROR SUMMARY: 7 errors from 7 contexts (suppressed: 6 from 6)

Reply via email to