string function argument

thomas.orgis--- via Gcc-bugs Tue, 11 Jan 2022 14:46:00 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103985


            Bug ID: 103985
           Summary: segfault in finalize_transfer (fbuf_destroy) on
                    (parallel) writing into character / string function
                    argument
           Product: gcc
           Version: 10.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libfortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: thomas.or...@uni-hamburg.de
  Target Milestone: ---

I noticed an infrequent crash of an OpenMP-parallelized scientific application
that is observable when running hundreds of instances. A recent reproduction
attempt on 250 Haswell nodes on Linux 4.14.259 with a loop over simulation
setup and some short computation showed 4 cases of the crash vs. 3176
successful runs. Each run used 16 threads on a dual-socket node.

This chance means getting a reproducer is tricky. This is a rather trivial part
of a rather elaborate codebase (which triggered a number of Fortran compiler
bugs before, regardless of compiler vendor), so it might be extracted, but the
odds may mean that certain structures in memory are necessary to really trip up
the gfortran runtime.

So I'll report the backtrace and some code fragments here. Maybe this already
points into some direction where one can home in on the underlying corruption.

This is the crash (leading the lines is a node name and seconds roughly since
program start):

node292 @ 14.482751:  model_setup: Done.
node292 @ 14.482965:  memory/kiB  peak=5924012 size=5924012 resp=3262324
ress=3225348
node292 @ 14.504973: 
node292 @ 14.505052: Program received signal SIGSEGV: Segmentation fault -
invalid memory reference.
node292 @ 14.505071: 
node292 @ 14.505083: Backtrace for this error:
node292 @ 15.149332: #0  0x15523d2cd3ff in ???
node292 @ 15.149365: #1  0x15523d3174a9 in ???
node292 @ 15.149376: #2  0x15523d31820d in ???
node292 @ 15.150486: #3  0x15523e46f469 in finalize_transfer
node292 @ 15.150499:    at ../../../gcc-10.3.0/libgfortran/io/transfer.c:4228
node292 @ 15.150586: #4  0x430f6d in __stringtools_MOD_str_integer
node292 @ 15.150599:    at .preprocessed/misc/stringtools.f90:147
node292 @ 15.150789: #5  0x546b3f in find_worst_speeds
node292 @ 15.150800:    at .preprocessed/dgmodel/cfl.f90:309
node292 @ 15.150808: #6  0x546b3f in __cfl_MOD_cfl_timestep._omp_fn.0
node292 @ 15.150817:    at .preprocessed/dgmodel/cfl.f90:198
node292 @ 15.150939: #7  0x15523d8de4d1 in GOMP_parallel
node292 @ 15.150951:    at ../../../gcc-10.3.0/libgomp/parallel.c:171
node292 @ 15.150961: #8  0x547404 in __cfl_MOD_cfl_timestep
node292 @ 15.150969:    at .preprocessed/dgmodel/cfl.f90:195
node292 @ 15.151289: #9  0x548f59 in find_new_step
node292 @ 15.151300:    at .preprocessed/dgmodel/time_bdf.f90:635
node292 @ 15.151310: #10  0x54d98b in __time_bdf_MOD_time_bdf_explicit_part
node292 @ 15.151318:    at .preprocessed/dgmodel/time_bdf.f90:399
node292 @ 15.151326: #11  0x54d98b in __time_bdf_MOD_time_bdf_step
node292 @ 15.151334:    at .preprocessed/dgmodel/time_bdf.f90:489
node292 @ 15.151442: #12  0x526acc in __time_integration_MOD_timeint_step
node292 @ 15.151454:    at .preprocessed/dgmodel/time_integration.f90:141
node292 @ 15.151743: #13  0x40b058 in dgmodel
node292 @ 15.151754:    at .preprocessed/programs/dgmodel.f90:371
node292 @ 15.151764: #14  0x40760c in main
node292 @ 15.151772:    at .preprocessed/programs/dgmodel.f90:46
node292 @ 17.984833: 85.74user 6.77system 0:16.76elapsed 551%CPU
(0avgtext+0avgdata 3262324maxresident)k
node292 @ 17.984892: 64680inputs+8737984outputs
(18major+1123633minor)pagefaults 0swaps

The memory stats before/after the crash might be useful, too … I reckon that
some significant memory action is necessary to observe the issue.

The respective code is inside an OpenMP section (line numbers prefixed):

186 !$omp parallel num_threads(cfl%threadcount) private(loopcount, ti)
187    loopcount = size(state%setup%grid%elements)
188    ti = ompapi_threadnum()
189 
190 
191    cfl%maxei(:,ti) = -1
192    cfl%max_e_speeds(:,ti) = 0._dp
193    cfl%speedratio(:,ti) = m_big
194 
195 !$omp do
196    do ei=1,loopcount
197       call find_worst_speeds(state%setup%grid%elements(ei), state, mytype,
cfl%maxei(:,ti), cfl%max_e_speeds(:,ti), cfl%speedratio&
198 &(:,ti), directions)
199    end do
200 !$omp end do
201 !$omp end parallel

(The loopcount business is a workaround for an earlier compiler bug with Sun
Studio that messed up parallel loop boundaries, AFAIR.)

The routine find_worst_speeds does some computations on the handed-in
structures and then does this:

308 
309    if(err_follow_line('dgmodel/cfl.F90', 265, 'Trouble in element ' //
trim(adjustl(str_number(element%id))))) return
310 end subroutine

The function err_follow_line checks a global error counter and returns true and
prints the constructed string that is passed if there has been an error raised
before. So, a simple type of exception handling, sort of.

148 function err_follow_line(file, line, cmessage) result(lstatus)
149    character(*), intent(in) :: file
150    integer, intent(in) :: line
151    character(*), intent(in) :: cmessage
152    logical :: lstatus
153 
154    if( .not. err_count == 0 ) then
155 
156 
157       if( (par_err_firstonly == 0 .or. backtrace) &
158          .and. (par_err_maxfollow < 1 .or. followed < par_err_maxfollow) )
then
159          followed = followed + (1)
160          write(0,'(A,A,A,A,A)') err_marker, ' error(follow), ', &
161             trim(err_filestring(file, line)), ' ', trim(cmessage)
162       end if
163       lstatus = .true.
164    else
165       lstatus = .false.
166    end if
167 end function err_follow_line

The construction of this string argument is where the trouble occurs. This is
str_number, in this case implemented for integers for the common interface:

 53 integer, parameter :: str_bufferlen = 100

143 function str_integer(number) result(out)
144    integer, intent(in) :: number
145    character(str_bufferlen) :: out
146 
147    write(out, *) number
148 end function

And here, the relevant section of libgfortran is invoked on the write call to
the character variable. For keeping the context together, this is the block
where the crash occurs, the call to fbuf_destroy() seems to be a sensible place
for such. The position in the backtrace seems to be stable, always pointing
here.

4220  done:
4221 
4222   if (dtp->u.p.unit_is_internal)
4223     {
4224       /* The unit structure may be reused later so clear the
4225          internal unit kind.  */
4226       dtp->u.p.current_unit->internal_unit_kind = 0;
4227 
4228       fbuf_destroy (dtp->u.p.current_unit);
4229       if (dtp->u.p.current_unit
4230           && (dtp->u.p.current_unit->child_dtio  == 0)
4231           && dtp->u.p.current_unit->s)
4232         {
4233           sclose (dtp->u.p.current_unit->s);
4234           dtp->u.p.current_unit->s = NULL;
4235         }
4236     }

I am not sure how much influence the parallel nature of the OpenMP section has,
but I bet that it's causal. Hardware issues seem unlikely given that the crash
is always at the same place and we'd notice failures in ECC memory.

[Bug libfortran/103985] New: segfault in finalize_transfer (fbuf_destroy) on (parallel) writing into character / string function argument

Reply via email to