Re: [zeromq-dev] are zmq::atomic_ptr_t<> Helgrind warnings known?

2018-02-25 Thread Luca Boccassi
On Sun, 2018-02-25 at 13:22 -0500, Bill Torpey wrote:
> Hi Franceso:
> 
> A few more points below …
> 
> Good luck, and please post back if you find out anything interesting!
> 
> Regards,
> 
> Bill
> 
> > On Feb 25, 2018, at 4:54 AM, Francesco  > m> wrote:
> > 
> > Hi Bill,
> > thanks for your answer.
> > 
> > 
> > 
> > 2018-02-24 21:49 GMT+01:00 Bill Torpey  > >:
> > ...
> > If T2 and T4 were application code, this would be a clear violation
> > of ZMQ’s threading rules (assuming “legacy”, non-thread-safe socket
> > types).  
> > Right. 
> > 
> >  
> > For instance, one technique would be to perform initialization
> > (e.g., bind’s) in the main thread, and only after that is done spin
> > up other threads to do processing.  In this case, TSAN wouldn't
> > have any way to know that the application guarantees that accesses
> > cannot result in a race, so TSAN would flag it.  I’ve gotten in the
> > habit of using mutexes to protect code like this even though it
> > should not strictly be needed, just to keep tools like TSAN happy,
> > and also because I don’t know the ZMQ code well enough to be 100%
> > certain that the mutexes are not necessary — better safe than
> > sorry!
> > Yeah, that's a possibility but it results in a lot of "clutter"
> > that decrease code readability and makes it harder to maintain in
> > the long run...
> > 
> > This situation is different, though, since T4 is not an application
> > thread — it’s an internal ZMQ worker thread.   So, I think in this
> > case we kind of have to accept that ZMQ is doing the right thing
> > here.
> > At least, that’s the approach I’ve been taking.  When I instrument
> > my apps and libraries with TSAN I specifically do NOT instrument
> > ZMQ, and I also use the “called_from_lib:libzmq.so” suppression
> > (which is listed as an example for TSAN: https://github.com/google/
> > sanitizers/wiki/ThreadSanitizerSuppressions
> >  > ons>).
> > Understood. I have a question though: if you
> > use   “called_from_lib:libzmq.so”  suppression, are you able to
> > spot the race condition due to T2 and T4 being application threads
> > (instead of being 1 application and 1 zmq internal)? 
> > I wonder if TSAN, detecting that one of the 2 threads generating
> > the data race is inside ZMQ, entirely suppress the race warning or
> > instead will suppress only race conditions involving 2 internal zmq
> > threads..
> 
> Good question.  I could only find one post that discusses this
> suppression: https://groups.google.com/forum/#!topic/thread-sanitizer
> /NEcgiPEG0N8  sanitizer/NEcgiPEG0N8>
> 
>   called_from_lib suppresses only interceptors (like read or
> memset) called directly from the lib. It's intended for non-
> instrumented libraries. 
> 
> However, when I try this with my test code, enabling the suppression
> actually increases the number of false positives reported by
> TSAN.  Disabling the suppression results in a smaller number of
> mostly different false positives.  You may want to experiment with
> this — I plan to take another look at whether enabling this
> suppression is a good idea based on what I’ve seen in my tests.
> 
> Unfortunately, it’s not possible to use the race:libzmq.so
> suppression to avoid all false positives in ZMQ, since that
> suppresses ALL warnings where libzmq.so appears ANYWHERE in the stack
> trace, and that is much too broad.
> 
> So, there’s no simple answer.  I’ve developed some scripts that parse
> the output of TSAN and generate MD5 hashes of the stack traces, which
> can then be used to suppress individual stack traces.  Going that
> route is a lot of work, but it’s the only way I know of at this time
> to provide more granular suppressions with TSAN.  It would be nice if
> the suppression mechanism in TSAN were more robust (e.g., more like
> valgrind’s), but it isn’t.
> 
> > 
> > 
> > Instrumenting libzmq and/or omitting the suppression causes a LOT
> > of warnings, esp. in the ZMQ worker threads.  So, unless I'm
> > willing to commit the time and effort to go through and investigate
> > each of these warnings, I feel I have little choice but to accept
> > that at this point in its lifetime ZMQ should be race-free for all
> > practical purposes.
> > 
> > FWIW, I’ve done fairly extensive testing, and specifically stress
> > testing, and have yet to find anything that looks like an honest-
> > to-goodness bug in ZMQ.  (Which is not to say that the docs are
> > always clear about what to expect in certain situations ;-)  I did
> > have one problem which appears to have been a bug in epoll, and
> > which was resolved by upgrading Linux, but that’s it.
> > 
> > I agree, I was just surprised to see so many warnings... 
> > 
> >  
> > BTW, there’s an excellent overview of how this all works at http://
> > 

Re: [zeromq-dev] are zmq::atomic_ptr_t<> Helgrind warnings known?

2018-02-25 Thread Bill Torpey
Hi Franceso:

A few more points below …

Good luck, and please post back if you find out anything interesting!

Regards,

Bill

> On Feb 25, 2018, at 4:54 AM, Francesco  wrote:
> 
> Hi Bill,
> thanks for your answer.
> 
> 
> 
> 2018-02-24 21:49 GMT+01:00 Bill Torpey  >:
> ...
> If T2 and T4 were application code, this would be a clear violation of ZMQ’s 
> threading rules (assuming “legacy”, non-thread-safe socket types).  
> Right. 
> 
>  
> For instance, one technique would be to perform initialization (e.g., bind’s) 
> in the main thread, and only after that is done spin up other threads to do 
> processing.  In this case, TSAN wouldn't have any way to know that the 
> application guarantees that accesses cannot result in a race, so TSAN would 
> flag it.  I’ve gotten in the habit of using mutexes to protect code like this 
> even though it should not strictly be needed, just to keep tools like TSAN 
> happy, and also because I don’t know the ZMQ code well enough to be 100% 
> certain that the mutexes are not necessary — better safe than sorry!
> Yeah, that's a possibility but it results in a lot of "clutter" that decrease 
> code readability and makes it harder to maintain in the long run...
> 
> This situation is different, though, since T4 is not an application thread — 
> it’s an internal ZMQ worker thread.   So, I think in this case we kind of 
> have to accept that ZMQ is doing the right thing here.
> At least, that’s the approach I’ve been taking.  When I instrument my apps 
> and libraries with TSAN I specifically do NOT instrument ZMQ, and I also use 
> the “called_from_lib:libzmq.so” suppression (which is listed as an example 
> for TSAN: 
> https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions 
> ).
> Understood. I have a question though: if you use   
> “called_from_lib:libzmq.so”  suppression, are you able to spot the race 
> condition due to T2 and T4 being application threads (instead of being 1 
> application and 1 zmq internal)? 
> I wonder if TSAN, detecting that one of the 2 threads generating the data 
> race is inside ZMQ, entirely suppress the race warning or instead will 
> suppress only race conditions involving 2 internal zmq threads..

Good question.  I could only find one post that discusses this suppression: 
https://groups.google.com/forum/#!topic/thread-sanitizer/NEcgiPEG0N8 


called_from_lib suppresses only interceptors (like read or memset) 
called directly from the lib. It's intended for non-instrumented libraries. 

However, when I try this with my test code, enabling the suppression actually 
increases the number of false positives reported by TSAN.  Disabling the 
suppression results in a smaller number of mostly different false positives.  
You may want to experiment with this — I plan to take another look at whether 
enabling this suppression is a good idea based on what I’ve seen in my tests.

Unfortunately, it’s not possible to use the race:libzmq.so suppression to avoid 
all false positives in ZMQ, since that suppresses ALL warnings where libzmq.so 
appears ANYWHERE in the stack trace, and that is much too broad.

So, there’s no simple answer.  I’ve developed some scripts that parse the 
output of TSAN and generate MD5 hashes of the stack traces, which can then be 
used to suppress individual stack traces.  Going that route is a lot of work, 
but it’s the only way I know of at this time to provide more granular 
suppressions with TSAN.  It would be nice if the suppression mechanism in TSAN 
were more robust (e.g., more like valgrind’s), but it isn’t.

> 
> 
> Instrumenting libzmq and/or omitting the suppression causes a LOT of 
> warnings, esp. in the ZMQ worker threads.  So, unless I'm willing to commit 
> the time and effort to go through and investigate each of these warnings, I 
> feel I have little choice but to accept that at this point in its lifetime 
> ZMQ should be race-free for all practical purposes.
> 
> FWIW, I’ve done fairly extensive testing, and specifically stress testing, 
> and have yet to find anything that looks like an honest-to-goodness bug in 
> ZMQ.  (Which is not to say that the docs are always clear about what to 
> expect in certain situations ;-)  I did have one problem which appears to 
> have been a bug in epoll, and which was resolved by upgrading Linux, but 
> that’s it.
> 
> I agree, I was just surprised to see so many warnings... 
> 
>  
> BTW, there’s an excellent overview of how this all works at 
> http://zeromq.org/whitepapers:architecture 
>  — although it’s somewhat old, it 
> appears to still be relatively accurate.
> 
> thanks for the link, I read that and it's quite interesting. However it does 
> not mention how multi-thread 

Re: [zeromq-dev] are zmq::atomic_ptr_t<> Helgrind warnings known?

2018-02-25 Thread Francesco
Hi Bill,
thanks for your answer.



2018-02-24 21:49 GMT+01:00 Bill Torpey :
>
> ...
>
If T2 and T4 were application code, this would be a clear violation of
> ZMQ’s threading rules (assuming “legacy”, non-thread-safe socket types).
>
Right.



> For instance, one technique would be to perform initialization (e.g.,
> bind’s) in the main thread, and only after that is done spin up other
> threads to do processing.  In this case, TSAN wouldn't have any way to know
> that the application guarantees that accesses cannot result in a race, so
> TSAN would flag it.  I’ve gotten in the habit of using mutexes to protect
> code like this even though it should not strictly be needed, just to keep
> tools like TSAN happy, and also because I don’t know the ZMQ code well
> enough to be 100% certain that the mutexes are not necessary — better safe
> than sorry!
>
Yeah, that's a possibility but it results in a lot of "clutter" that
decrease code readability and makes it harder to maintain in the long run...

This situation is different, though, since T4 is not an application thread
> — it’s an internal ZMQ worker thread.   So, I think in this case we kind of
> have to accept that ZMQ is doing the right thing here.
> At least, that’s the approach I’ve been taking.  When I instrument my apps
> and libraries with TSAN I specifically do NOT instrument ZMQ, and I also
> use the “called_from_lib:libzmq.so” suppression (which is listed as an
> example for TSAN: https://github.com/google/sanitizers/wiki/
> ThreadSanitizerSuppressions).
>
Understood. I have a question though: if you use
“called_from_lib:libzmq.so”  suppression, are you able to spot the race
condition due to T2 and T4 being application threads (instead of being 1
application and 1 zmq internal)?
I wonder if TSAN, detecting that one of the 2 threads generating the data
race is inside ZMQ, entirely suppress the race warning or instead will
suppress only race conditions involving 2 internal zmq threads..


Instrumenting libzmq and/or omitting the suppression causes a LOT of
> warnings, esp. in the ZMQ worker threads.  So, unless I'm willing to commit
> the time and effort to go through and investigate each of these warnings, I
> feel I have little choice but to accept that at this point in its lifetime
> ZMQ should be race-free for all practical purposes.
>
> FWIW, I’ve done fairly extensive testing, and specifically stress testing,
> and have yet to find anything that looks like an honest-to-goodness bug in
> ZMQ.  (Which is not to say that the docs are always clear about what to
> expect in certain situations ;-)  I did have one problem which appears to
> have been a bug in epoll, and which was resolved by upgrading Linux, but
> that’s it.
>

I agree, I was just surprised to see so many warnings...



> BTW, there’s an excellent overview of how this all works at
> http://zeromq.org/whitepapers:architecture — although it’s somewhat old,
> it appears to still be relatively accurate.
>

thanks for the link, I read that and it's quite interesting. However it
does not mention how multi-thread safety is achieved.
Just out of curiosity I will take a look at ypipe implementation.



> I tried forcing use of C++11 internal atomics and all the warnings
> about  zmq::atomic_ptr_t<> disappear (I still get tons about ypipe and
> "pipe")..
> maybe it's best to prefer C++11 atomics when available?
>
>
> I would actually suggest using plain old pthread for synchronization
> primitives — TSAN’s support for pthread is pretty mature. Support for C++11
> is a bit less mature, and has certain constraints (
> https://clang.llvm.org/docs/ThreadSanitizer.html):
>

However using old pthread for synchronization you probably loose a lot in
performances... if C++11 atomics are supported by TSAN just fine, why not
preferring them over the GCC intrinsics?



> *ThreadSanitizer is in beta stage. It is known to work on large C++
> programs using pthreads, but we do not promise anything (yet). C++11
> threading is supported with llvm libc++. The test suite is integrated into
> CMake build and can be run with make check-tsan command.*
>
>
> In particular, to use C++11 you also need to use llvm’s libcxx, rather
> than gnu’s libstdc++.
>

I think that the quoted text is referring to "C++11 threading" that is to
http://en.cppreference.com/w/cpp/thread
As I mentioned from a simple test I do see that at least gcc 5.3.0 TSAN is
handling C++11 atomics fine (without forcing you to link LLVM libc++)




> Btw I will try to write a suppression file for ThreadSanitizer and ZMQ...
> I just start to doubt: is all ZMQ code really race-free? :)
>
>
> As mentioned above, that is likely to be a huge undertaking.  At this
> point, I think the only confirmation we have that ZMQ is race-free is based
> on “black-box” testing.  The good news is that ZMQ appears to be used
> widely enough that the code is well-tested.  (Of course, failure to find a
> bug doesn’t prove that there isn’t one, so