Hi Philippe,


Upgraded to 3.9.0 as you suggested and ran with these options:



                -v -v -v -d -d -d --trace-sched=yes --trace-syscalls=yes
--trace-signals=yes --quiet --track-origins=yes --free-fill=7a
--child-silent-after-fork=yes --fair-sched=no



After some time, a bunch of processes went into 'pipe_w' status.  These
were single-threaded processes.  Their logfiles (which were enormous -
hundreds of gigabytes!) all contained this line:



                --23014--   SCHED[3]: TRC: YIELD



Each of the processes showed only one thread:



                GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1)

                Copyright (C) 2009 Free Software Foundation, Inc.

                License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html>

                This is free software: you are free to change and
redistribute it.

                There is NO WARRANTY, to the extent permitted by law.  Type
"show copying"

                and "show warranty" for details.

                This GDB was configured as "x86_64-redhat-linux-gnu".

                For bug reporting instructions, please see:

                <http://www.gnu.org/software/gdb/bugs/>.

                Attaching to process 2071

                Reading symbols from
/apps1/pkgs/valgrind-3.9.0/lib/valgrind/memcheck-amd64-linux...done.

                0x000000003804b559 in do_syscall_WRK ()

                (gdb) where

                #0  0x000000003804b559 in do_syscall_WRK ()

                #1  0x000000003804b94a in vgPlain_do_syscall (sysno=1028,
a1=34516426208, a2=1, a3=18446744073709551615, a4=0, a5=0, a6=0, a7=0,
a8=0) at m_syscall.c:674

                #2  0x0000000038035d44 in vgPlain_read (fd=1,
buf=0xffffffffffffffff, count=<value optimized out>) at m_libcfile.c:158

                #3  0x00000000380daa98 in vgModuleLocal_sema_down
(sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109

                #4  0x0000000038083687 in vgPlain_acquire_BigLock_LL
(tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355

                #5  vgPlain_acquire_BigLock (tid=1, who=0x80956dde0 "") at
m_scheduler/scheduler.c:277

                #6  0x00000000380838f5 in vgPlain_scheduler (tid=<value
optimized out>) at m_scheduler/scheduler.c:1227

                #7  0x00000000380b28b6 in thread_wrapper (tidW=1) at
m_syswrap/syswrap-linux.c:103

                #8  run_a_thread_NORETURN (tidW=1) at
m_syswrap/syswrap-linux.c:156

                #9  0x0000000000000000 in ?? ()

                (gdb) info threads

                * 1 process 2071  0x000000003804b559 in do_syscall_WRK ()

                (gdb)



strace showed the same as before (i.e. read on a high-numbered filehandle,
around 1026 or 1027).  Someone has suggested that this would indicate that
valgrind is calling dup2 to create new filehandles.  Evidence from lsof
also bears this out, showing only 77 open files for each process.  The fd's
not relevant to our application are:



                COMMAND    PID    USER   FD   TYPE DEVICE        SIZE
NODE NAME

                memcheck- 2071 nbezj7v    5r  FIFO    0,6
297571407 pipe

                memcheck- 2071 nbezj7v    7u  sock    0,5
297780139 can't identify protocol

                memcheck- 2071 nbezj7v    8w  FIFO    0,6
297571410 pipe

                memcheck- 2071 nbezj7v    9r   CHR    1,3
3908 /dev/null

                memcheck- 2071 nbezj7v   10r   DIR  253,0
4096         2 /

                memcheck- 2071 nbezj7v 1025u   REG  253,0         637
1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted)

                memcheck- 2071 nbezj7v 1026u   REG  253,0         256
1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted)

                memcheck- 2071 nbezj7v 1028r  FIFO    0,6
297571563 pipe

                memcheck- 2071 nbezj7v 1029w  FIFO    0,6
297571563 pipe

                memcheck- 2071 nbezj7v 1030r  FIFO  253,0
1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-???



I tried vgdb, but not a lot of luck.  After invoking 'valgrind --vgdb=yes
--vgdb-error=0 /path/to/my/exe', I then got this in another terminal:



                $ gdb /path/to/my/exe

                GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1)

                Copyright (C) 2009 Free Software Foundation, Inc.

                License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html>

                This is free software: you are free to change and
redistribute it.

                There is NO WARRANTY, to the extent permitted by law.  Type
"show copying"

                and "show warranty" for details.

                This GDB was configured as "x86_64-redhat-linux-gnu".

                For bug reporting instructions, please see:

                <http://www.gnu.org/software/gdb/bugs/>...

                "/path/to/my/exe": not in executable format: File truncated

                (gdb) target remote | /apps1/pkgs/valgrind-3.9.0/bin/vgdb
--pid=30352

                Remote debugging using |
/apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352

                relaying data between gdb and process 30352

                Remote register badly formatted:
T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;

                here:
00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;

                Try to load the executable by `file' first,

                you may also check `set/show architecture'.



This also caused the vgdb server to hang up.  I tried with the 'file'
command made no difference.  The "not in executable format" is totally
expected - we run a optimised lightweight "test shell" process which loads
a bunch of heavy debug so's.



What is the next stage, can I try different options? Or perhaps
instrument/change the source code in some way in order to figure out what
is happening?



Thanks,

David.
------------------------------


On Sunday, January 26, 2014, David Carter <dch...@gmail.com> wrote:

> Thank you very much, Philippe,
>
> The --fair-sched option was set in an attempt to fix this.  I had read
> about interminable FUTEX_WAIT status and I think that was one of the
> suggestions.  Clearly it doesn't make any difference.
>
> I think I've tried 3.9.0, but I will double-check and run that one from
> now on anyway.
>
> I have tried connecting with gdb and there wasn't much visible. I'll try
> again though and also try vgdb - I was unaware of this tool.
>
> Not sure what is getting locked, whether it's Valgrind or our code.  We do
> use threading but only in a limited way, and I'm pretty sure memcheck is
> hanging up on single-threaded cases.  Hopefully the extra logging etc will
> reveal something. I can't easily log onto the machine from here - I'll run
> the experiments you suggest and report back in a short while.
>
> One thing I didn't mention, which might be important, is that I run
> valgrind through a python-driven process-pool.  I use the multiprocess
> module to spawn off a bunch of valgrinds.  I don't think its relevant as it
> was working fine for several weeks like this before the hang-ups started.
>
> Best wishes and thanks again,
> David.
>
>
>
> On Sun, Jan 26, 2014 at 1:07 PM, Philippe Waroquiers <
> philippe.waroqui...@skynet.be<javascript:_e(%7B%7D,'cvml','philippe.waroqui...@skynet.be');>
> > wrote:
>
>> On Sun, 2014-01-26 at 02:20 +0000, David Carter wrote:
>> > Hi,
>> >
>> >
>> > I've got an issue with memcheck in Valgrind 3.8.1 hanging.  I've left
>> > processes running for weeks or even months but they don't complete
>> > (normally these processes run in a few minutes tops, and they were
>> > working fine with memcheck until a while ago.
>> >
>> >
>> > Has anyone seen anything like this before?  Here are the details:
>> >
>> >
>> > options:
>> >
>> > --quiet --track-origins=yes --free-fill=7a
>> > --child-silent-after-fork=yes --fair-sched=no --log-file=/path/to/log
>> >  --suppressions=/path/to/suppression.file
>> >
>> >
>> >
>> > strace shows:
>> >
>> > Process 5223 attached - interrupt to quit
>> >
>> > read(1027,
>> With --fair-sched=no, valgrind uses a pipe to implement a "big lock".
>> It is however not clear with what you have shown if this 1027 is
>> the valgrind pipe big lock fd. If yes, then it looks like a bug in
>> valgrind, as the above read means a thread want to acquire the big
>> lock to run, but the thread currently holding the lock does not
>> release it.
>>
>> Here are various suggestions :
>> 1. when you are in the above blocked state, use gdb+vgdb
>>    to connect to your process, and examine the state
>>    of your process (e.g. which thread is doing what)
>>    (the most likely cause of deadlock/problem is your application, not
>>    valgrind, at least when looking at your mail with
>>    a "valgrind developer hat on" :).
>>
>> 2. upgrade to 3.9.0, there are many bugs solved since 3.8.1
>>    (probably not yours, I do not see anything related to deadlock
>>     but one never knows).
>>
>> 3. run with a lot more traces e.g.
>>     -v -v -v -d -d -d --trace-sched=yes --trace-syscalls=yes
>> --trace-signals=yes
>>   and see if there is some suspicious output.
>>
>> Philippe
>>
>>
>>
>>
>
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to