On Fri, 2014-01-31 at 16:34 +0000, David Carter wrote:
Hello David,

> Hi Philippe,
> 
>  
> 
> Upgraded to 3.9.0 as you suggested and ran with these options:
> 
>  
> 
>                 -v -v -v -d -d -d --trace-sched=yes
> --trace-syscalls=yes --trace-signals=yes --quiet --track-origins=yes
> --free-fill=7a --child-silent-after-fork=yes --fair-sched=no
> 
>  
> 
> After some time, a bunch of processes went into 'pipe_w' status.
> These were single-threaded processes.  Their logfiles (which were
> enormous - hundreds of gigabytes!) all contained this line:
> 
>  
> 
>                 --23014--   SCHED[3]: TRC: YIELD
The above trace is strange: the SCHED[3] indicates that this is
valgrind thread id 3 which is doing the trace. That seems to indicate
that there was (at least at some point in time) more than one thread
in the game.
The YIELD scheduler trace is explained as:
      case VEX_TRC_JMP_YIELD:
         /* Explicit yield, because this thread is in a spin-lock
            or something.  Only let the thread run for a short while
            longer.  Because swapping to another thread is expensive,
            we're prepared to let this thread eat a little more CPU
            before swapping to another.  That means that short term
            spins waiting for hardware to poke memory won't cause a
            thread swap. */
         if (dispatch_ctr > 1000) 
            dispatch_ctr = 1000;
         break;


> 
>  
> 
> Each of the processes showed only one thread:
I have already seen in the past that GDB is not always
able to show the various threads (not clear when, I suspect
this might happen with a static thread lib ?).

To double check the nr of threads, you can do one of the following:
   from the shell:
       vgdb -l
and then for each reported PIDNR:
       vgdb --pid=<PIDNR> v.info scheduler
That will show the state of the valgrind scheduler, and the list
of threads known by valgrind, and ask valgrind to produce a stack
trace (guest stack trace) for each thread.

Or alternatively:
       ls /proc/<PIDNR>/task
will show the list of threads nr at linux level.


>                 #3  0x00000000380daa98 in vgModuleLocal_sema_down
> (sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109
> 
>                 #4  0x0000000038083687 in vgPlain_acquire_BigLock_LL
> (tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355
> 
>                 #5  vgPlain_acquire_BigLock (tid=1, who=0x80956dde0
> "") at m_scheduler/scheduler.c:277
The above indicates that the thread is trying to acquire the
valgrind "big lock". When using --fair-sched=no, the big lock is
implemented using a pipe. Writing a character on the pipe releases
the lock. Reading the character on the pipe is the lock acquisition.
If the process is blocked on reading on the pipe, then it looks
like the "lock" character was not written back ?
Maybe strace -f valgrind and see what is happening with the lock
character. The lock character loops over A .. Z and then back to A.
Check that the lock is properly released some short time before
the trial to re-acquire it.

>  
> 
> strace showed the same as before (i.e. read on a high-numbered
> filehandle, around 1026 or 1027).  Someone has suggested that this
> would indicate that valgrind is calling dup2 to create new
> filehandles.  Evidence from lsof also bears this out, showing only 77
> open files for each process.  The fd's not relevant to our application
> are:
In the below, I guess that 1028 and 1029 is the pipe used for the
valgrind lock.

> 
>  
> 
>                 COMMAND    PID    USER   FD   TYPE DEVICE        SIZE
> NODE NAME
> 
>                 memcheck- 2071 nbezj7v    5r  FIFO    0,6
> 297571407 pipe
> 
>                 memcheck- 2071 nbezj7v    7u  sock    0,5
> 297780139 can't identify protocol
> 
>                 memcheck- 2071 nbezj7v    8w  FIFO    0,6
> 297571410 pipe
> 
>                 memcheck- 2071 nbezj7v    9r   CHR    1,3
> 3908 /dev/null
> 
>                 memcheck- 2071 nbezj7v   10r   DIR  253,0        4096
> 2 /
> 
>                 memcheck- 2071 nbezj7v 1025u   REG  253,0         637
> 1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted)
> 
>                 memcheck- 2071 nbezj7v 1026u   REG  253,0         256
> 1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted)
> 
>                 memcheck- 2071 nbezj7v 1028r  FIFO    0,6
> 297571563 pipe
> 
>                 memcheck- 2071 nbezj7v 1029w  FIFO    0,6
> 297571563 pipe
> 
>                 memcheck- 2071 nbezj7v 1030r  FIFO  253,0
> 1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-???
> 
>  
> 
> I tried vgdb, but not a lot of luck.  After invoking 'valgrind
> --vgdb=yes --vgdb-error=0 /path/to/my/exe', I then got this in another
> terminal:
> 
>  
> 
>                 $ gdb /path/to/my/exe

>                 "/path/to/my/exe": not in executable format: File
> truncated
> 
>                 (gdb) target remote
> | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
> 
>                 Remote debugging using
> | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
> 
>                 relaying data between gdb and process 30352
> 
>                 Remote register badly formatted:
> T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
> 
>                 here:
> 00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
> 
>                 Try to load the executable by `file' first,
> 
>                 you may also check `set/show architecture'.
> 
>  
> 
> This also caused the vgdb server to hang up.  I tried with the 'file'
> command made no difference.  The "not in executable format" is totally
> expected - we run a optimised lightweight "test shell" process which
> loads a bunch of heavy debug so's.
The "not in executable format" is not expected by me :).

What kind of executable is that ? I thought that gdb should be able
to "understand" the executables that are launchable on e.g. red hat 5
(if I guessed the distro properly).

In the shell, what is 'file /path/to/my/exe' telling ?

Maybe you could download and compile the last gdb version (7.6 or 7.7
if it has just been produced) and see if gdb is now more intelligent ?
> 
>  
> 
> What is the next stage, can I try different options? Or perhaps
> instrument/change the source code in some way in order to figure out
> what is happening?

As detailed above, you could:
  * confirm the list of threads in your process
     (ls /proc/..., vgdb v.info scheduler)
  * if v.info scheduler works, you might guess what your
    application is doing from the stack trace(s)
  * maybe you could debug using a newer gdb
  * strace -f valgrind .... /path/to/my/exe
    might also give some lights on what happens with the valgrind big
    lock.

Philippe 




------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Reply via email to