Re: [petsc-users] PetscAllreduceBarrierCheck is valgrind clean?
On Wed, Jan 13, 2021 at 11:49 AM Barry Smith wrote: > > Fande, > > Look at > https://scm.mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/trunk/src/mpid/ch3/channels/common/src/detect/arch/mv2_arch_detect.c > > cpubind_set = hwloc_bitmap_alloc(); > > but I don't find a corresponding hwloc_bitmap_free(cpubind_set ); in > get_socket_bound_info(). > Thanks. I added hwloc_bitmap_free(cpubind_set ) to the end of get_socket_bound_info(). And then these valgrind messages disappeared. Will ask mvapich developers to fix this. Thanks, Fande, > > > Barry > > > > > > > On Jan 13, 2021, at 12:32 PM, Fande Kong wrote: > > > > Hi All, > > > > I ran valgrind with mvapich-2.3.5 for a moose simulation. The > motivation was that we have a few non-deterministic parallel simulations in > moose. I want to check if we have any memory issues. I got some complaints > from PetscAllreduceBarrierCheck > > > > Thanks, > > > > > > Fande > > > > > > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely > lost in loss record 31 of 54 > > ==98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > > ==98001==by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898) > > ==98001==by 0xD93C87A: create_intra_sock_comm > (create_2level_comm.c:593) > > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > > ==98001==by 0xD59A894: mv2_increment_shmem_coll_counter > (ch3_shmem_coll.c:2183) > > ==98001==by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912) > > ==98001==by 0x99F1766: PetscAllreduceBarrierCheck (pbarrier.c:26) > > ==98001==by 0x99F70BE: PetscSplitOwnership (psplit.c:84) > > ==98001==by 0x9C5C26B: PetscLayoutSetUp (pmap.c:262) > > ==98001==by 0xA08C66B: MatMPIAdjSetPreallocation_MPIAdj > (mpiadj.c:630) > > ==98001==by 0xA08EB9A: MatMPIAdjSetPreallocation (mpiadj.c:856) > > ==98001==by 0xA08F6D3: MatCreateMPIAdj (mpiadj.c:904) > > > > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely > lost in loss record 32 of 54 > > ==98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > > ==98001==by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898) > > ==98001==by 0xD93C87A: create_intra_sock_comm > (create_2level_comm.c:593) > > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > > ==98001==by 0xD59A9A4: mv2_increment_allgather_coll_counter > (ch3_shmem_coll.c:2218) > > ==98001==by 0xD4E4CE4: PMPI_Allreduce (allreduce.c:917) > > ==98001==by 0xCD9D74D: libparmetis__gkMPI_Allreduce (gkmpi.c:103) > > ==98001==by 0xCDBB663: libparmetis__ComputeParallelBalance > (stat.c:87) > > ==98001==by 0xCDA4FE0: libparmetis__KWayFM (kwayrefine.c:352) > > ==98001==by 0xCDA21ED: libparmetis__Global_Partition (kmetis.c:222) > > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > > ==98001==by 0xCDA2748: ParMETIS_V3_PartKway (kmetis.c:94) > > ==98001==by 0xA2D6B39: MatPartitioningApply_Parmetis_Private > (pmetis.c:145) > > ==98001==by 0xA2D77D9: MatPartitioningApply_Parmetis (pmetis.c:219) > > ==98001==by 0xA2CD46A: MatPartitioningApply (partition.c:332) > > > > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely > lost in loss record 33 of 54 > > ==98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > > ==98001==by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898) > > ==98001==by 0xD93C87A: create_intra_sock_comm > (create_2level_comm.c:593) > > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > > ==98001==by 0xD59A894: mv2_increment_shmem_coll_counter > (ch3_shmem_coll.c:2183) > > ==98001==by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912) > > ==98001==by 0x99F1766: PetscAllreduceBarrierCheck (pbarrier.c:26) > > ==98001==by 0x99F733E: PetscSplitOwnership (psplit.c:91) > > ==98001==by 0x9C5C26B: PetscLayoutSetUp (pmap.c:262) > > ==98001==by 0x9C5DB0D: PetscLayoutCreateFromSizes (pmap.c:112) > > ==98001==by 0x9D9A018: ISGeneralSetIndices_General (general.c:568) > > ==98001==by 0x9D9AB44: ISGeneralSetIndices (general.c:554) > > ==98001==by 0x9D9ADC4: ISCreateGeneral (general.c:529) > > ==98001==by 0x9B431E6: VecCreateGhostWithArray (pbvec.c:692) > > ==98001==by 0x9B43A33: VecCreateGhost (pbvec.c:748) > > > > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely > lost in loss record 34 of 54 > > =98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > > ==98001==by 0xDAE1D5E:
Re: [petsc-users] PetscAllreduceBarrierCheck is valgrind clean?
Fande, Look at https://scm.mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/trunk/src/mpid/ch3/channels/common/src/detect/arch/mv2_arch_detect.c cpubind_set = hwloc_bitmap_alloc(); but I don't find a corresponding hwloc_bitmap_free(cpubind_set ); in get_socket_bound_info(). Barry > > On Jan 13, 2021, at 12:32 PM, Fande Kong wrote: > > Hi All, > > I ran valgrind with mvapich-2.3.5 for a moose simulation. The motivation was > that we have a few non-deterministic parallel simulations in moose. I want to > check if we have any memory issues. I got some complaints from > PetscAllreduceBarrierCheck > > Thanks, > > > Fande > > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost > in loss record 31 of 54 > ==98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > ==98001==by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898) > ==98001==by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593) > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > ==98001==by 0xD59A894: mv2_increment_shmem_coll_counter > (ch3_shmem_coll.c:2183) > ==98001==by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912) > ==98001==by 0x99F1766: PetscAllreduceBarrierCheck (pbarrier.c:26) > ==98001==by 0x99F70BE: PetscSplitOwnership (psplit.c:84) > ==98001==by 0x9C5C26B: PetscLayoutSetUp (pmap.c:262) > ==98001==by 0xA08C66B: MatMPIAdjSetPreallocation_MPIAdj (mpiadj.c:630) > ==98001==by 0xA08EB9A: MatMPIAdjSetPreallocation (mpiadj.c:856) > ==98001==by 0xA08F6D3: MatCreateMPIAdj (mpiadj.c:904) > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost > in loss record 32 of 54 > ==98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > ==98001==by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898) > ==98001==by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593) > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > ==98001==by 0xD59A9A4: mv2_increment_allgather_coll_counter > (ch3_shmem_coll.c:2218) > ==98001==by 0xD4E4CE4: PMPI_Allreduce (allreduce.c:917) > ==98001==by 0xCD9D74D: libparmetis__gkMPI_Allreduce (gkmpi.c:103) > ==98001==by 0xCDBB663: libparmetis__ComputeParallelBalance (stat.c:87) > ==98001==by 0xCDA4FE0: libparmetis__KWayFM (kwayrefine.c:352) > ==98001==by 0xCDA21ED: libparmetis__Global_Partition (kmetis.c:222) > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > ==98001==by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191) > ==98001==by 0xCDA2748: ParMETIS_V3_PartKway (kmetis.c:94) > ==98001==by 0xA2D6B39: MatPartitioningApply_Parmetis_Private > (pmetis.c:145) > ==98001==by 0xA2D77D9: MatPartitioningApply_Parmetis (pmetis.c:219) > ==98001==by 0xA2CD46A: MatPartitioningApply (partition.c:332) > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost > in loss record 33 of 54 > ==98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > ==98001==by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898) > ==98001==by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593) > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > ==98001==by 0xD59A894: mv2_increment_shmem_coll_counter > (ch3_shmem_coll.c:2183) > ==98001==by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912) > ==98001==by 0x99F1766: PetscAllreduceBarrierCheck (pbarrier.c:26) > ==98001==by 0x99F733E: PetscSplitOwnership (psplit.c:91) > ==98001==by 0x9C5C26B: PetscLayoutSetUp (pmap.c:262) > ==98001==by 0x9C5DB0D: PetscLayoutCreateFromSizes (pmap.c:112) > ==98001==by 0x9D9A018: ISGeneralSetIndices_General (general.c:568) > ==98001==by 0x9D9AB44: ISGeneralSetIndices (general.c:554) > ==98001==by 0x9D9ADC4: ISCreateGeneral (general.c:529) > ==98001==by 0x9B431E6: VecCreateGhostWithArray (pbvec.c:692) > ==98001==by 0x9B43A33: VecCreateGhost (pbvec.c:748) > > > ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost > in loss record 34 of 54 > =98001==at 0x4C29F73: malloc (vg_replace_malloc.c:307) > ==98001==by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74) > ==98001==by 0xDA7523F: get_socket_ > = > bound_info (mv2_arch_detect.c:898) > ==98001==by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593) > ==98001==by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762) > ==98001==by 0xD59A894: mv2_increment_shmem_coll_counter > (ch3_shmem_coll.c:2183) > ==98001==by 0xD4E4CBB: PMPI_Allreduce