https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103985
Bug ID: 103985 Summary: segfault in finalize_transfer (fbuf_destroy) on (parallel) writing into character / string function argument Product: gcc Version: 10.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libfortran Assignee: unassigned at gcc dot gnu.org Reporter: thomas.or...@uni-hamburg.de Target Milestone: --- I noticed an infrequent crash of an OpenMP-parallelized scientific application that is observable when running hundreds of instances. A recent reproduction attempt on 250 Haswell nodes on Linux 4.14.259 with a loop over simulation setup and some short computation showed 4 cases of the crash vs. 3176 successful runs. Each run used 16 threads on a dual-socket node. This chance means getting a reproducer is tricky. This is a rather trivial part of a rather elaborate codebase (which triggered a number of Fortran compiler bugs before, regardless of compiler vendor), so it might be extracted, but the odds may mean that certain structures in memory are necessary to really trip up the gfortran runtime. So I'll report the backtrace and some code fragments here. Maybe this already points into some direction where one can home in on the underlying corruption. This is the crash (leading the lines is a node name and seconds roughly since program start): node292 @ 14.482751: model_setup: Done. node292 @ 14.482965: memory/kiB peak=5924012 size=5924012 resp=3262324 ress=3225348 node292 @ 14.504973: node292 @ 14.505052: Program received signal SIGSEGV: Segmentation fault - invalid memory reference. node292 @ 14.505071: node292 @ 14.505083: Backtrace for this error: node292 @ 15.149332: #0 0x15523d2cd3ff in ??? node292 @ 15.149365: #1 0x15523d3174a9 in ??? node292 @ 15.149376: #2 0x15523d31820d in ??? node292 @ 15.150486: #3 0x15523e46f469 in finalize_transfer node292 @ 15.150499: at ../../../gcc-10.3.0/libgfortran/io/transfer.c:4228 node292 @ 15.150586: #4 0x430f6d in __stringtools_MOD_str_integer node292 @ 15.150599: at .preprocessed/misc/stringtools.f90:147 node292 @ 15.150789: #5 0x546b3f in find_worst_speeds node292 @ 15.150800: at .preprocessed/dgmodel/cfl.f90:309 node292 @ 15.150808: #6 0x546b3f in __cfl_MOD_cfl_timestep._omp_fn.0 node292 @ 15.150817: at .preprocessed/dgmodel/cfl.f90:198 node292 @ 15.150939: #7 0x15523d8de4d1 in GOMP_parallel node292 @ 15.150951: at ../../../gcc-10.3.0/libgomp/parallel.c:171 node292 @ 15.150961: #8 0x547404 in __cfl_MOD_cfl_timestep node292 @ 15.150969: at .preprocessed/dgmodel/cfl.f90:195 node292 @ 15.151289: #9 0x548f59 in find_new_step node292 @ 15.151300: at .preprocessed/dgmodel/time_bdf.f90:635 node292 @ 15.151310: #10 0x54d98b in __time_bdf_MOD_time_bdf_explicit_part node292 @ 15.151318: at .preprocessed/dgmodel/time_bdf.f90:399 node292 @ 15.151326: #11 0x54d98b in __time_bdf_MOD_time_bdf_step node292 @ 15.151334: at .preprocessed/dgmodel/time_bdf.f90:489 node292 @ 15.151442: #12 0x526acc in __time_integration_MOD_timeint_step node292 @ 15.151454: at .preprocessed/dgmodel/time_integration.f90:141 node292 @ 15.151743: #13 0x40b058 in dgmodel node292 @ 15.151754: at .preprocessed/programs/dgmodel.f90:371 node292 @ 15.151764: #14 0x40760c in main node292 @ 15.151772: at .preprocessed/programs/dgmodel.f90:46 node292 @ 17.984833: 85.74user 6.77system 0:16.76elapsed 551%CPU (0avgtext+0avgdata 3262324maxresident)k node292 @ 17.984892: 64680inputs+8737984outputs (18major+1123633minor)pagefaults 0swaps The memory stats before/after the crash might be useful, too … I reckon that some significant memory action is necessary to observe the issue. The respective code is inside an OpenMP section (line numbers prefixed): 186 !$omp parallel num_threads(cfl%threadcount) private(loopcount, ti) 187 loopcount = size(state%setup%grid%elements) 188 ti = ompapi_threadnum() 189 190 191 cfl%maxei(:,ti) = -1 192 cfl%max_e_speeds(:,ti) = 0._dp 193 cfl%speedratio(:,ti) = m_big 194 195 !$omp do 196 do ei=1,loopcount 197 call find_worst_speeds(state%setup%grid%elements(ei), state, mytype, cfl%maxei(:,ti), cfl%max_e_speeds(:,ti), cfl%speedratio& 198 &(:,ti), directions) 199 end do 200 !$omp end do 201 !$omp end parallel (The loopcount business is a workaround for an earlier compiler bug with Sun Studio that messed up parallel loop boundaries, AFAIR.) The routine find_worst_speeds does some computations on the handed-in structures and then does this: 308 309 if(err_follow_line('dgmodel/cfl.F90', 265, 'Trouble in element ' // trim(adjustl(str_number(element%id))))) return 310 end subroutine The function err_follow_line checks a global error counter and returns true and prints the constructed string that is passed if there has been an error raised before. So, a simple type of exception handling, sort of. 148 function err_follow_line(file, line, cmessage) result(lstatus) 149 character(*), intent(in) :: file 150 integer, intent(in) :: line 151 character(*), intent(in) :: cmessage 152 logical :: lstatus 153 154 if( .not. err_count == 0 ) then 155 156 157 if( (par_err_firstonly == 0 .or. backtrace) & 158 .and. (par_err_maxfollow < 1 .or. followed < par_err_maxfollow) ) then 159 followed = followed + (1) 160 write(0,'(A,A,A,A,A)') err_marker, ' error(follow), ', & 161 trim(err_filestring(file, line)), ' ', trim(cmessage) 162 end if 163 lstatus = .true. 164 else 165 lstatus = .false. 166 end if 167 end function err_follow_line The construction of this string argument is where the trouble occurs. This is str_number, in this case implemented for integers for the common interface: 53 integer, parameter :: str_bufferlen = 100 143 function str_integer(number) result(out) 144 integer, intent(in) :: number 145 character(str_bufferlen) :: out 146 147 write(out, *) number 148 end function And here, the relevant section of libgfortran is invoked on the write call to the character variable. For keeping the context together, this is the block where the crash occurs, the call to fbuf_destroy() seems to be a sensible place for such. The position in the backtrace seems to be stable, always pointing here. 4220 done: 4221 4222 if (dtp->u.p.unit_is_internal) 4223 { 4224 /* The unit structure may be reused later so clear the 4225 internal unit kind. */ 4226 dtp->u.p.current_unit->internal_unit_kind = 0; 4227 4228 fbuf_destroy (dtp->u.p.current_unit); 4229 if (dtp->u.p.current_unit 4230 && (dtp->u.p.current_unit->child_dtio == 0) 4231 && dtp->u.p.current_unit->s) 4232 { 4233 sclose (dtp->u.p.current_unit->s); 4234 dtp->u.p.current_unit->s = NULL; 4235 } 4236 } I am not sure how much influence the parallel nature of the OpenMP section has, but I bet that it's causal. Hardware issues seem unlikely given that the crash is always at the same place and we'd notice failures in ECC memory.