subject:"Re\: STDCXX forks"

On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:
I hope you agree that this synchronization is sufficient for the facet
initialization and reading of facet data.

I have reduced the number of reported race conditions in
22.locale.numpunct.mt from 12440:

http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/index.html

to 288:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/22.locale.numpunct.mt.5.er.html/index.html

The changes are in the following files:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/facet.cpp
http://s247136804.onlinehome.us/stdcxx-1056-20120918/punct.cpp

_numpunct.h looks like this:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/_numpunct.h

With these changes, no races conditions are repoted for any of the
functions in std::numpunctT.

Still, there are 288 race conditions being reported in
__rw_locale::__rw_locale and in std::locale::_C_get_facet. We need to
identify the source and cause of these race conditions and correct
them as well.

This is not a complete solution to the problem, because we still have
to re-write the chunk of code I eliminated from facet.cpp. It is only
step one towards finding a real solution. But, at least for now, we
have pinpointed where the source of these race conditions is located,
and what causing it.

The test program was run as: ./22.locale.numpunct.mt --nthreads=8
--nloops=1.

More to follow.

--Stefan

--
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 09/18/12 08:55, Stefan Teleman wrote:

On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:

I hope you agree that this synchronization is sufficient for the facet
initialization and reading of facet data.


I have reduced the number of reported race conditions in
22.locale.numpunct.mt from 12440:



I am attaching a test program which, while 100% MT-safe, is flagged by the 
Solaris thread analyzer. The test program contains, conceptually, the exact 
same of shared variable access like the locale management code, the facet 
management code and the facet data management code. This proves, if there was a 
need, that the tool results are to be further analyzed and interpreted but not 
taken at literal value.

The fix we are looking for is one which corrects the initial MT failure 
observed in caching the facet data, as well as it preserves the unguarded reads 
of facet data for performance reasons.

Thanks.

Liviu



http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/index.html

to 288:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/22.locale.numpunct.mt.5.er.html/index.html

The changes are in the following files:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/facet.cpp
http://s247136804.onlinehome.us/stdcxx-1056-20120918/punct.cpp

_numpunct.h looks like this:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/_numpunct.h

With these changes, no races conditions are repoted for any of the
functions in std::numpunctT.

Still, there are 288 race conditions being reported in
__rw_locale::__rw_locale and in std::locale::_C_get_facet. We need to
identify the source and cause of these race conditions and correct
them as well.

This is not a complete solution to the problem, because we still have
to re-write the chunk of code I eliminated from facet.cpp. It is only
step one towards finding a real solution. But, at least for now, we
have pinpointed where the source of these race conditions is located,
and what causing it.

The test program was run as: ./22.locale.numpunct.mt --nthreads=8
--nloops=1.

More to follow.

--Stefan

#include stdio.h
#include stdlib.h
#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long counter = 0, nloops = 1000, nthreads = 16;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

extern C {

static void* 
f (void* pv)
{
for (size_t i = 0; i  nloops; ++i) {
if (counter == 0) {
pthread_mutex_lock (lock);
if (counter == 0)
++counter;
pthread_mutex_unlock (lock);
}
else {
// counter value is safe to use here
}
}

printf (%ld\n, counter);

return 0;
}

}

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, 0))
exit (-1);
}

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}

Re: STDCXX-1056 [was: Re: STDCXX forks]

On Tue, Sep 18, 2012 at 12:43 PM, Liviu Nicoara nikko...@hates.ms wrote:

 I am attaching a test program which, while 100% MT-safe, is flagged by
 the Solaris thread analyzer.

The program as written is not thread safe. It is reading the value of
the counter variable and performing a zero comparison outside of a
mutex lock:

for (size_t i = 0; i  nloops; ++i) {
if (counter == 0) {  // --- 
pthread_mutex_lock (lock);
if (counter == 0)
++counter;
pthread_mutex_unlock (lock);
}
else {
// counter value is safe to use here
}
}




-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 09/18/12 13:21, Stefan Teleman wrote:

On Tue, Sep 18, 2012 at 12:43 PM, Liviu Nicoara nikko...@hates.ms wrote:


I am attaching a test program which, while 100% MT-safe, is flagged by
the Solaris thread analyzer.


The program as written is not thread safe. It is reading the value of
the counter variable and performing a zero comparison outside of a
mutex lock:


Stefan, I urge you to consider the argument on its merits. Yes, the thread 
analyzer flags it, but it is nonetheless MT-safe. Specifically:

1. writes are properly synchronized wrt themselves
2. reads are inherently thread-safe wrt themselves
3. reads are properly synchronized wrt writes
4.no thread can possibly observe an intermediate or otherwise incomplete value.

I will also add that the flag is either 0 or 1 during the execution of the 
program, with only one transition from 0 to 1, performed by one single thread.

I will concede that I might be wrong and I am open to arguments. I would accept 
as a counter-argument this program if you can show a runtime failure. I would 
also accept as argument a scenario under which two threads would see 
inconsistent/incorrect values or write the variable more than once, etc.

Thanks,
Liviu



for (size_t i = 0; i  nloops; ++i) {
 if (counter == 0) {  // --- 
 pthread_mutex_lock (lock);
 if (counter == 0)
 ++counter;
 pthread_mutex_unlock (lock);
 }
 else {
 // counter value is safe to use here
 }
 }

Re: STDCXX-1056 [was: Re: STDCXX forks]

On Tue, Sep 18, 2012 at 4:35 PM, Liviu Nicoara nikko...@hates.ms wrote:

 I will concede that I might be wrong and I am open to arguments. I would
 accept as a counter-argument this program if you can show a runtime failure.

The the first read of the counter variable is outside a mutex lock
correct? The read is followed by a 0 comparison, correct?

What guarantees that between the read and the comparison the value of
the counter variable hasn't been modified by another thread? The
thread currently doing the comparison cannot guarantee it: it hasn't
locked the mutex. Other threads may be running - actually they
probably are. Another thread may have already acquired the mutex and
incremented the value of counter. Your thread has no way of knowing if
that has happened, because it does not yet have exclusive access to
the counter variable. It will, after it acquires the mutex.

Where is it reading the variable  from? A register? Is it declared
volatile? L2 cache? L3 cache?

The program, as you wrote it, implicitly acknowledges that it is not
thread safe. That is the point of the double check: one before the
mutex lock, and one after it. The point of the first check has nothing
to do with thread-safety, and everything to do with a minor
optimization: if the value stored in variable counter is already not
zero, then there's no point in locking the mutex or performing the
increment.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 09/18/12 18:24, Stefan Teleman wrote:

On Tue, Sep 18, 2012 at 4:35 PM, Liviu Nicoara nikko...@hates.ms wrote:


I will concede that I might be wrong and I am open to arguments. I would
accept as a counter-argument this program if you can show a runtime failure.


The the first read of the counter variable is outside a mutex lock
correct? The read is followed by a 0 comparison, correct?


Correct.



What guarantees that between the read and the comparison the value of
the counter variable hasn't been modified by another thread? The
thread currently doing the comparison cannot guarantee it: it hasn't
locked the mutex. Other threads may be running - actually they
probably are. Another thread may have already acquired the mutex and
incremented the value of counter. Your thread has no way of knowing if
that has happened, because it does not yet have exclusive access to
the counter variable. It will, after it acquires the mutex.


It does not matter if the value is changed concurrently in between the 
reading of it and the actual comparison. The value can only be 0, in 
which case it takes the branch that locks the mutex and the next read 
will absolutely see a fresh value; or it can be 1 in which case it has 
already been initialized and this thread does not pay the penalty of a 
lock anymore.




Where is it reading the variable  from? A register? Is it declared
volatile? L2 cache? L3 cache?


Irrelevant. If the value has not been changed it can be read from either 
the cache or directly from the memory. If the value has been changed 
concurrently in between the reading and the actual comparison, the 
processor sees either a stale value of 0, or the new value 1. It is 
guaranteed that any thread reading the value after the (one and only) 
unlock will see a fresh value.




The program, as you wrote it, implicitly acknowledges that it is not
thread safe. That is the point of the double check: one before the
mutex lock, and one after it.


That is a misstatement of the program intentions. See below.


The point of the first check has nothing
to do with thread-safety, and everything to do with a minor
optimization: if the value stored in variable counter is already not
zero, then there's no point in locking the mutex or performing the
increment.


The first check is indeed an optimization. It is the point of this 
exercise to show that the unguarded reads in the localization library 
are not defects and the code, simplified in my test case, is thread safe 
in exactly the respects I mentioned before: the program only observes 
consistent, correct values (program states) in a concurrent environment.


HTH.

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 9/18/12 7:04 PM, Stefan Teleman wrote:

On Tue, Sep 18, 2012 at 6:42 PM, Liviu Nicoara nikko...@hates.ms wrote:


The first check is indeed an optimization. It is the point of this exercise
to show that the unguarded reads in the localization library are not defects
and the code, simplified in my test case, is thread safe in exactly the
respects I mentioned before: the program only observes consistent, correct
values (program states) in a concurrent environment.


The Intel Thread Analyzer shows a data race in the exact same spot the
SunPro Thread Analyzer shows a data race, and in the exact same spot I
have stated earlier that there is a data race.

The results are available here:

http://s247136804.onlinehome.us/liviu-mt-test/22.mt.test.r000ti3.inspxez

As usual, it is an Intel Inspector XE 2003 Analysis Results Blob. It
opens with the Intel Inspector. i don't know what format it is in, or
whether or not it can be read by other means.

So far we have SunPro 12.3/Linux/Intel, SunPro 12.3/Solaris/Intel,
SunPro 12.3/Solaris/SPARC and Intel 2003/Linux/Intel saying that there
is a race condition in your program.

And now I would like to get back to solving the thread-unsafety and
race conditions problems in Apache stdcxx.


Stefan, you cannot ignore others' arguments and embrace a 'no' attitude; the 
tools are not saying what you think they do. They all show the same thing 
because they all do the exact same analysis and we could delve into that, too. I 
sincerely hoped you could come up with some sort of a logical argument that 
would prove my latest test case wrong, or at least show it failing at runtime.


My time is also limited and I have already spent more time than I wanted to 
trying to present a logical argument, down to the simplest test cases. I say we 
defer this to Martin for when he will be back from vacation, next week. For the 
record, because people may not have the energy to read all this back and forth, 
I will summarize here the gist of this thread:


1. The facet data caching is not MT-safe
2. The facet data initialization (STDCXX or system locales) is safe (*)
3. There is no unit test currently showing a failure in (2)
4. Timing results show that caching may be slower than non-caching in MT builds
5. A fix should, ideally, be binary compatible
6. A fix should, ideally, preserve performance or increase it
7. There is one patch, currently attached to the issue, by Stefan
8. Other partial patches are referenced from this thread

Please correct me if I missed anything. The above summary is a good starting 
point for a new thread dealing with this issue.


HTH.

Liviu

(*) When perfectly forwarding, with a theoretical concern I asked in an earlier 
post, directed at Martin.

Re: STDCXX-1056 [was: Re: STDCXX forks]

On Tue, Sep 18, 2012 at 7:38 PM, Liviu Nicoara nikko...@hates.ms wrote:

 1. The facet data caching is not MT-safe
 2. The facet data initialization (STDCXX or system locales) is safe (*)
 3. There is no unit test currently showing a failure in (2)
 4. Timing results show that caching may be slower than non-caching in MT
 builds
 5. A fix should, ideally, be binary compatible
 6. A fix should, ideally, preserve performance or increase it
 7. There is one patch, currently attached to the issue, by Stefan
 8. Other partial patches are referenced from this thread

 Please correct me if I missed anything. The above summary is a good starting

For the record, I fundamentally disagree with your assessment above.
It is not based on any verifiable and reproducible facts, analysis of
facts and/or measurements. It is based on assertions, which are, in
turn, based on other assertions.

As long as you continue dismissinsg the results from 4 [ FOUR ]
different thread analyzer, on 4 [ FOUR ] different operating systems,
and that solely on the basis on your assertions and beliefs, there is
no point in continuing this debate.

The results from all four thread analyzers contradict to all of your
assertions. If you firmly and strongly believe that you are always
right, and that the four thread analyzers are always wrong, that is
your choice.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Stefan Teleman

On Mon, Sep 17, 2012 at 8:46 AM, Liviu Nicoara nikko...@hates.ms wrote:

 In the meantime I would like to stress again that __rw_get_numpunct is
 perfectly thread-safe and does not need extra locking for perfect
 forwarding.

So, by removing the test for

  if (!(_C_flags  _RW::__rw_gr))

(or any other bitmask for that matter), the functions which were
thread-unsafe - and were exhibiting all the symptoms of a run-time
race condition -, magically became thread-safe?

I have looked *extensively* at the code in __rw_get_numpunct. It is
inherently thread-unsafe.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Liviu Nicoara


On 09/17/12 09:51, Stefan Teleman wrote:

On Mon, Sep 17, 2012 at 8:46 AM, Liviu Nicoara nikko...@hates.ms wrote:


In the meantime I would like to stress again that __rw_get_numpunct is
perfectly thread-safe and does not need extra locking for perfect
forwarding.


So, by removing the test for

   if (!(_C_flags  _RW::__rw_gr))

(or any other bitmask for that matter), the functions which were
thread-unsafe - and were exhibiting all the symptoms of a run-time
race condition -, magically became thread-safe?

I have looked *extensively* at the code in __rw_get_numpunct. It is
inherently thread-unsafe.


I mean to say that no extra locking is necessary when the public interface 
forwards and no caching is done.

In more detail: __rw_get_numpunct code is entered upon a call from the 
protected virtual interface. It calls facet _C_data member function which 
either returns objects constructed from previously-initialized POD data (in 
which case no locking is necessary), or it attempts to first initialize the 
facet data from the STDCXX database.

If the latter, the execution goes in the facet data initialization code which 
is appropriately synchronized, see this stack trace:

(gdb) bt
#0  __rw::__rw_facet::_C_get_data (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/src/facet.cpp:179
#1  0x0043788b in __rw::__rw_facet::_C_data (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_facet.h:194
#2  0x0044874f in __rw::__rw_get_numpunct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:80
#3  0x00449a14 in __rw::__rw_get_punct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:578
#4  0x0041cecf in std::numpunctchar::do_grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:99
#5  0x0041bd31 in std::numpunctchar::grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:190
#6  0x004036b8 in main (argc=2, argv=0x7fffe018) at 
srcdir/stdcxx/branches/4.2.x/tests/localization/s.cpp:29
(gdb)

When the above fails, the facet data has not been initialized, e.g., when there 
is no STDCXX locale database, or the name of the locale does not refer to a 
locale in STDCXX database. The __rw_get_numpunct function will attempt next to 
use libc and system locales, via the __rw_setlocale class, which again is 
properly synchronized and the scope of that lock extends to cover the facet 
data initialization. See this stack trace:

(gdb) bt
#0  __rw::__rw_setlocale::__rw_setlocale (this=0x7fffdd30, locname=0x6e112a 
en_US.utf8, cat=6, nothrow=0) at 
srcdir/stdcxx/branches/4.2.x/src/setlocale.cpp:84
#1  0x00448974 in __rw::__rw_get_numpunct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:134
#2  0x00449a14 in __rw::__rw_get_punct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:578
#3  0x0041cecf in std::numpunctchar::do_grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:99
#4  0x0041bd31 in std::numpunctchar::grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:190
#5  0x004036b8 in main (argc=2, argv=0x7fffe018) at 
srcdir/stdcxx/branches/4.2.x/tests/localization/s.cpp:29

I hope you agree that this synchronization is sufficient for the facet 
initialization and reading of facet data.

Thanks.

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Stefan Teleman

On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:

 I hope you agree that this synchronization is sufficient for the facet
 initialization and reading of facet data.

Sorry, I do not agree. Two different thread analyzers from two
different compilers written by two different compiler writers tell me
not to.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Liviu Nicoara


On 09/17/12 11:21, Stefan Teleman wrote:

On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:


I hope you agree that this synchronization is sufficient for the facet
initialization and reading of facet data.


Sorry, I do not agree. Two different thread analyzers from two
different compilers written by two different compiler writers tell me
not to.


Stefan, that is indeed your prerogative. However, please keep in mind that 
tools may be buggy or may have limitations beyond what is advertised. I do not 
ask you to have faith in my analysis, which would be absurd, but to look for 
yourself, exercise due diligence in testing the code and draw your own 
conclusions.

Thanks,
Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Wojciech Meyer

Liviu Nicoara nikko...@hates.ms writes:

 On 09/17/12 11:21, Stefan Teleman wrote:
 On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:

 I hope you agree that this synchronization is sufficient for the facet
 initialization and reading of facet data.

 Sorry, I do not agree. Two different thread analyzers from two
 different compilers written by two different compiler writers tell me
 not to.

 Stefan, that is indeed your prerogative. However, please keep in mind
 that tools may be buggy or may have limitations beyond what is
 advertised. I do not ask you to have faith in my analysis, which would
 be absurd, but to look for yourself, exercise due diligence in testing
 the code and draw your own conclusions.

so which compilers do fail? You know, some of them might use the same
component.

-- 
Wojciech

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Stefan Teleman

On Mon, Sep 17, 2012 at 11:38 AM, Wojciech Meyer wojciech.me...@arm.com wrote:

 so which compilers do fail? You know, some of them might use the same
 component.

Intel Compiler/Thread Analyzer on Linux, SunPro Compiler/Thread
Analyzer on Linux and Solaris (Intel and SPARC). All three of them
show the same exact problems.

The Intel Compilers and the SunPro Compilers have nothing in common
with each other.

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-16 Thread Stefan Teleman

On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote:

 Now, to clear the confusion I created: the timing numbers I posted in the
 attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
 perfectly forwarding, no caching public interface (exemplified by a changed
 grouping) performs better than the current implementation. It was that test
 case that I hoped you could time, perhaps on SPARC, in both MT and ST
 builds. The t.cpp program is for MT, s.cpp for ST.

I got your patch, and have tested it.

I have created two Experiments (that's what they are called) with the
SunPro Performance Analyzer. Both experiments are targeting race
conditions and deadlocks in the instrumented program,  and both
experiments are running the 22.locale.numpunct.mt program from the
stdcxx test harness. One experiment is with  your patch applied. The
other experiment is with our (Solaris) patch applied.

Here are the results:

1. with your patch applied:

http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.nts/

2. with our (Solaris) patch applied:

http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.ts/

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-16 Thread Liviu Nicoara


On 9/16/12 3:20 AM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote:


Now, to clear the confusion I created: the timing numbers I posted in the
attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
perfectly forwarding, no caching public interface (exemplified by a changed
grouping) performs better than the current implementation. It was that test
case that I hoped you could time, perhaps on SPARC, in both MT and ST
builds. The t.cpp program is for MT, s.cpp for ST.


I got your patch, and have tested it.


Thanks, Stefan. I looked over it and it seems very similar to, and somewhat more 
detailed than gprof profiling output.


I am going to update the incident shortly with a more detailed timing 
measurements on my side, in the form of a new attachment. Just FYI in case you 
still don't get notifications.


Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-16 Thread Liviu Nicoara


On 9/16/12 11:21 AM, Liviu Nicoara wrote:

On 9/16/12 3:20 AM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote:


Now, to clear the confusion I created: the timing numbers I posted in the
attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
perfectly forwarding, no caching public interface (exemplified by a changed
grouping) performs better than the current implementation. It was that test
case that I hoped you could time, perhaps on SPARC, in both MT and ST
builds. The t.cpp program is for MT, s.cpp for ST.


I got your patch, and have tested it.


Thanks, Stefan. I looked over it and it seems very similar to, and somewhat more
detailed than gprof profiling output.

I am going to update the incident shortly with a more detailed timing
measurements on my side, in the form of a new attachment. Just FYI in case you
still don't get notifications.


I have attached a new set of results to the incident, in the form of the 
archive:

http://tinyurl.com/9drzg4e

Please see the content for a description of the library changes (_numpunct.h 
file), the MT test program (t.cpp) and the results collected through two 
separate builds on two different machines (results.txt file).


Thanks.

Liviu

STDCXX-1066 [was: Re: STDCXX forks]


On 9/1/12 1:52 PM, Stefan Teleman wrote: On Sat, Sep 1, 2012 at 12:15 PM, 
 I opened yesterday STDCXX-1066:

 https://issues.apache.org/jira/browse/STDCXX-1056

 about the pthread_mutex_t/pthread_cond_t alignment on SPARCV8. I'll
 have patches done this weekend. Achtung: the patchset is very large
 and touches a very large number of files. It's strange that I didn't
 get an email about STDCXX-1066.


Hi Stefan,

I have read through the patches attached to the incident, then I briefly read 
about the SunPro pragma align and pack. Two questions:


1. AFAICT, the use of the packing pragma may interfere with a user's setting of 
the same value. I.e., a user sets the packing in their sources and then, 
directly or not, includes an STDCXX header. It seems to me that in such a 
situation, our setting of the packing value would interfere with the rest of the 
user's translation unit, since there is no way to `restore' the previous packing 
value.


Something along the lines of:

// user source file

#pragma pack (X) // X != 8

#include iostream

struct UserDef
{
// different alignment than X ?
// ...
};

Is my understanding correct?

2. The patches are against 4.2.1, but the change would be binary incompatible 
with the already released 4.2.1 branch. Do you plan to have this fix in 4.3.x?


Thanks.

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-15 Thread Stefan Teleman

On Sat, Sep 15, 2012 at 9:01 AM, Liviu Nicoara nikko...@hates.ms wrote:

 That is funny. What compiler are you using? What does the following test
 case return for you?

It's the Intel compiler with the patched stdcxx for  the wrong case
and GCC 4.7.1 + GNU libstdc++ for  the correct case.

GCC + GNU libstdc++ are correct.

The patched facet does not call the protected do_*() virtual functions
from their public counterparts, as it is required to do by the
Standard. Instead, it returns the data mebers directly (the data
members were initialized in the constructor). That is the patch you
proposed, which is indeed much better performing than using a mutex
lock. Unfortunately, in doing so, overriding the virtual functions in
a derived facet type becomes pointless.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 9/15/12 1:51 PM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 9:01 AM, Liviu Nicoara nikko...@hates.ms wrote:


That is funny. What compiler are you using? What does the following test
case return for you?


It's the Intel compiler with the patched stdcxx for  the wrong case
and GCC 4.7.1 + GNU libstdc++ for  the correct case.

GCC + GNU libstdc++ are correct.

The patched facet does not call the protected do_*() virtual functions
from their public counterparts, as it is required to do by the
Standard. Instead, it returns the data mebers directly (the data
members were initialized in the constructor). That is the patch you
proposed, which is indeed much better performing than using a mutex
lock. Unfortunately, in doing so, overriding the virtual functions in
a derived facet type becomes pointless.


Ahh, I see now. You are indeed right, that patch is defective. I was under the 
impression that we were discussing the (later) attachment 
stdcxx-1056-timings.tgz which contains a perfectly forwarding implementation of 
the facet public grouping method. The timings I attached there were the ones I 
thought we were discussing all along.


Now, to clear the confusion I created: the timing numbers I posted in the 
attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a 
perfectly forwarding, no caching public interface (exemplified by a changed 
grouping) performs better than the current implementation. It was that test case 
that I hoped you could time, perhaps on SPARC, in both MT and ST builds. The 
t.cpp program is for MT, s.cpp for ST.


Please let me know if this clarifies things. I apologize for the 
misunderstanding.

Liviu

Re: STDCXX-1066 [was: Re: STDCXX forks]


On 9/15/12 2:57 PM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 1:02 PM, Liviu Nicoara nikko...@hates.ms wrote:


I have read through the patches attached to the incident, then I briefly
read about the SunPro pragma align and pack. Two questions:

1. AFAICT, the use of the packing pragma may interfere with a user's setting
of the same value. I.e., a user sets the packing in their sources and then,
directly or not, includes an STDCXX header. It seems to me that in such a
situation, our setting of the packing value would interfere with the rest of
the user's translation unit, since there is no way to `restore' the previous
packing value.

Something along the lines of:

// user source file

#pragma pack (X) // X != 8

#include iostream

struct UserDef
{
 // different alignment than X ?
 // ...
};

Is my understanding correct?


Yes, you are indeed correct, Sir. :-)

However, for every #pragma pack(8) there should be a corresponding
subsequent #pragma pack(0). If there isn't, that's a patch bug.
#pragma pack(0) restores the default alignment. So, there should be
(there *must* be) no silent packing side-effects in user programs.


Yes, but it restores the default packing, not an arbitrary one, potentially set 
by the user prior to including our headers. Say, the user sets 2, the default is 
4 and we set 8. When we set it to default it goes back to 4, instead of the 
expected 2. Did I get this right?





2. The patches are against 4.2.1, but the change would be binary
incompatible with the already released 4.2.1 branch. Do you plan to have
this fix in 4.3.x?


Yes, definitely for 4.2.x and 4.3.x. I sent them for 4.2.1 just
because that's what I have handy.


I see now. I tried to apply them to 4.2.x and some chunks failed, and I thought 
I was not applying them where they were intended to go.


Thanks!

Liviu

Re: STDCXX-1066 [was: Re: STDCXX forks]


On 9/15/12 5:19 PM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 4:57 PM, Liviu Nicoara nikko...@hates.ms wrote:


Yes, but it restores the default packing, not an arbitrary one, potentially
set by the user prior to including our headers. Say, the user sets 2, the
default is 4 and we set 8. When we set it to default it goes back to 4,
instead of the expected 2. Did I get this right?


This is true, but leaving some arbirary pragma pack(N) (for N != 0) in
effect for the duration of a program, and expecting it to work sounds
like a very defective programming approach to me. It will certainly
not work on SPARC at all.  if the program needs to pack something in a
certain specific way, it need to do so for that specific something
only. Otherwise the side-effects of globally setting a non-default
packing will destroy the program anyway.


I merely wanted to point out that restoring the default packing is not the same 
as restoring the packing to the previous value in effect.


Given this, I thought about an alternative way of forcing this alignment, e.g., 
via a union, aligned on an appropriate type. I see an advantage here in that 
most of the changes would occur where we define the 'primitive' mutex and 
condition wrappers, saving a few pre-processor conditionals and pragmas along 
the way.


What do you think?

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-12 Thread Liviu Nicoara


On 09/12/12 00:21, Stefan Teleman wrote:

On Tue, Sep 11, 2012 at 10:18 PM, Liviu Nicoara nikko...@hates.ms wrote:


AFAICT, there are two cases to consider:

1. Using STDCXX locale database initializes the __rw_punct_t data in the
first, properly synchronized pass through __rw_get_numpunct. All subsequent
calls use the __rw_punct_t data to construct returned objects.
2. Using the C library locales does the same in the first pass, via
setlocale and localeconv, but setlocale synchronization is via a per-process
lock. The facet data, once initialized is used just like above.

I probably missed this in the previous conversation, but did you detect a
race condition in the tests if the facets are simply forwarding to the
private virtual interface? I.e., did you detect that the facet
initialization code is unsafe? I think the facet __rw_punct_t data is safely
initialized in both cases, it's the caching that is done incorrectly.


I originally thought so too, but now I'm having doubts. :-) And I
haven't tracked it down with 100% accuracy yet. I saw today this
comment in src/facet.cpp, line 358:

// a per-process array of facet pointers sufficiently large
// to hold (pointers to) all standard facets for 8 locales
static __rw_facet*  std_facet_buf [__rw_facet::_C_last_type * 8];


The localization code is cleverly dense and low-level. There are a few patterns 
that are used throughout, though. One is the function-local static buffers, 
e.g., the locales buffer and the facets buffer, in locale_body.cpp and 
facet.cpp, respectively. They both start from a reasonable size and are grown 
and shrunk according to the needs, but neither locales nor facets are ever 
evicted from them. See:

locale_body:936  -- reduction of locale buffer
locale_body:1006 -- expansion of locale buffer

facet.cpp:397-- reduction of facet buffer
facet.cpp:462-- expansion of facet buffer

Facets buffer is guarded by a static guard in:

__rw_facet::_C_manage:366

The locales buffer is protected by a static guard in:

__rw_locale::_C_manage:880




[...] this is all investigative stuff for tomorrow. :-)

and I agree with Martin that breaking ABI in a minor release is really
not an option. I'm trying to find the best way of making these facets
thread-safe while inflicting the least horrible performance hit.


We are getting close to the point where we can summarize all these findings and 
gauge an appropriate course of action.



I will run your tests tomorrow and let you know. :-)



That would be very helpful!

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-11 Thread Liviu Nicoara


On 9/11/12 9:40 PM, Martin Sebor wrote:

On 09/11/2012 04:15 PM, Stefan Teleman wrote:

On Mon, Sep 10, 2012 at 4:24 PM, Stefan Teleman

I think I have something which doesn't break BC - stay tuned because
I'm testing it now.


OK.

So, here's a possible implementation of __rw_get_numpunct() with
minimal locking, which passes the MT tests and does not break ABI:

s247136804.onlinehome.us/stdcxx-1056-20120911/punct.cpp

And the same for include/loc/_numpunct.h:

http://s247136804.onlinehome.us/stdcxx-1056-20120911/_numpunct.h

In _numpunct.h, all the functions perform no checks and no lazy
initialization. They function simply as a pass-through to
__rw_get_numpunct(). std::numpunctT's data members are now dead
varaiables.

The bad: performance is no better than with locking the mutex inside
each of the std::numpunctT::*() functions, and with lazy
instantiation.


I wouldn't expect this to be faster than the original. In fact,
I would expect it to be slower because each call to one of the
public, non-virtual members results in a call to the out-of-line
virtual functions (and another to __rw_get_moneypunct). Avoiding
the overhead of such calls is the main and only reason why the
caching exists.



AFAICT, there are two cases to consider:

1. Using STDCXX locale database initializes the __rw_punct_t data in the first, 
properly synchronized pass through __rw_get_numpunct. All subsequent calls use 
the __rw_punct_t data to construct returned objects.
2. Using the C library locales does the same in the first pass, via setlocale 
and localeconv, but setlocale synchronization is via a per-process lock. The 
facet data, once initialized is used just like above.


I probably missed this in the previous conversation, but did you detect a race 
condition in the tests if the facets are simply forwarding to the private 
virtual interface? I.e., did you detect that the facet initialization code is 
unsafe? I think the facet __rw_punct_t data is safely initialized in both cases, 
it's the caching that is done incorrectly.



I'm afraid unoptimized timings don't tell us much. Neither does
a comparison between two compilers, even on the same OS.

I looked at Liviu's timings today. I was puzzled by the difference
between (1) which, IIUC, is the current implementation (presumably
an optimized, thread-safe build with the same compiler and OS) and
(4), which, again IIUC, is the equivalent of your latest patch here
(again, presumably optimized, thread safe, same compiler/OS). I'm
having trouble envisioning how calling a virtual function to
retrieve the value of grouping can possibly be faster than not
calling it (and simply returning the value cached in the data
member of the facet.



The new results I attached to the issue come from a a bit clearer tests and they 
focus on just two cases: the current implementation vs. a non-caching one; the 
latter just forwards the grouping calls to the protected do_grouping, with _no_ 
other changes to the implementation.


The timing numbers seem to show that MT builds fare far worse with the caching 
than without. Stefan, if you have the time, could you please infirm :) my 
conclusions by timing it on one of your machines?


Thanks,

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-11 Thread Stefan Teleman

On Tue, Sep 11, 2012 at 10:18 PM, Liviu Nicoara nikko...@hates.ms wrote:

 AFAICT, there are two cases to consider:

 1. Using STDCXX locale database initializes the __rw_punct_t data in the
 first, properly synchronized pass through __rw_get_numpunct. All subsequent
 calls use the __rw_punct_t data to construct returned objects.
 2. Using the C library locales does the same in the first pass, via
 setlocale and localeconv, but setlocale synchronization is via a per-process
 lock. The facet data, once initialized is used just like above.

 I probably missed this in the previous conversation, but did you detect a
 race condition in the tests if the facets are simply forwarding to the
 private virtual interface? I.e., did you detect that the facet
 initialization code is unsafe? I think the facet __rw_punct_t data is safely
 initialized in both cases, it's the caching that is done incorrectly.

I originally thought so too, but now I'm having doubts. :-) And I
haven't tracked it down with 100% accuracy yet. I saw today this
comment in src/facet.cpp, line 358:

// a per-process array of facet pointers sufficiently large
// to hold (pointers to) all standard facets for 8 locales
static __rw_facet*  std_facet_buf [__rw_facet::_C_last_type * 8];

this leads me to suspect that there is an upper limit of 8 locales +
their standard facets. If the locales (and their facets) are being
recycled in and out of this 8-limit cache, that would explain the
other thing I also noticed (which also answers your question): yes, i
have gotten the dreaded strcmp(3C) 'Assertion failed' in
22.locale.numpunct.mt when I test implemented 22.locale.numpunct.mt in
a similar way to your tests. which in theory shouldn't happen, but it
did. which means that there's something going on with
behind-the-scenes facet re-initialization that i haven't found yet.
which would partially explain your observation that MT-tests perform
much worse with caching than without.

this is all investigative stuff for tomorrow. :-)

and I agree with Martin that breaking ABI in a minor release is really
not an option. I'm trying to find the best way of making these facets
thread-safe while inflicting the least horrible performance hit.

i will run your tests tomorrow and let you know. :-)

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-10 Thread Stefan Teleman

On Thu, Sep 6, 2012 at 6:43 PM, Stefan Teleman stefan.tele...@gmail.com wrote:
 On Thu, Sep 6, 2012 at 4:02 PM, Martin Sebor mse...@gmail.com wrote:

 Here's a thought: it's not pretty but how about having
 the function initialize the facet? It already has a pointer
 to the base class, so it could downcast it to std::numpunct
 (or moneypunct, respectively), and assign the initial values
 to the members. Would that work?

 I haven't looked at them in detail (yet) but a cursory look shows that
 they're both recursive for the successful case.

It's not going to work that way.

For one, __rw_get_numpunct() and __rw_get_moneypunct() are static
functions in the __rw namespace. Neither can access or modify the
std::numpunctT or std::moneypunctT data members directly, because
they are private.

Second, both __rw_get_numpunct() and __rw_get_moneypunct() are
recursive. Unless we want to start playing with
PTHREAD_MUTEX_RECURSIVE, which I'm not at all sure is supported on all
the platforms we support, we're not going to be able to solve the
thread-safety problem here (Linux supports it as
PTHREAD_MUTEX_RECURSIVE_NP, Solaris supports it
PTHREAD_MUTEX_RECURSIVE).

Third, both __rw_get_numpunct() and __rw_get_moneypunct() can return a
NULL pointer. This is bad, because it will cause a SEGV at string
assignment, during a call to either of the
std::numpunctT::grouping(), std::numpunctT::truename(), etc.
functions. We should fix this and throw an exception instead. The
Standard doesn't say that any of these functions can throw, but it
doesn't say they can't throw either. And both __rw_get_numpunct() and
__rw_get_moneypunct() throw already.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-10 Thread Martin Sebor


On 09/10/2012 10:56 AM, Stefan Teleman wrote:

On Thu, Sep 6, 2012 at 6:43 PM, Stefan Telemanstefan.tele...@gmail.com  wrote:

On Thu, Sep 6, 2012 at 4:02 PM, Martin Sebormse...@gmail.com  wrote:


Here's a thought: it's not pretty but how about having
the function initialize the facet? It already has a pointer
to the base class, so it could downcast it to std::numpunct
(or moneypunct, respectively), and assign the initial values
to the members. Would that work?


I haven't looked at them in detail (yet) but a cursory look shows that
they're both recursive for the successful case.


It's not going to work that way.

For one, __rw_get_numpunct() and __rw_get_moneypunct() are static
functions in the __rw namespace. Neither can access or modify the
std::numpunctT  or std::moneypunctT  data members directly, because
they are private.


There are ways around it -- see, for example, the hack in
src/access.h. This gives us, the implementation, a way to
cleanly access private data without exposing them to user
programs. In our case, we could solve our problem by having
the facet declare some helper (defined only in punct.cpp)
a friend and invoking the helper from __rw_get_numpunct.

That said, I'd certainly prefer to avoid hacks as much as
possible. This problem could perhaps more cleanly be solved
by having the facet pass a reference to the string (or to
all of its internal data) to modify to the function (or
something like that). Unfortunately, it would break binary
compatibility.



Second, both __rw_get_numpunct() and __rw_get_moneypunct() are
recursive. Unless we want to start playing with
PTHREAD_MUTEX_RECURSIVE, which I'm not at all sure is supported on all
the platforms we support, we're not going to be able to solve the
thread-safety problem here (Linux supports it as
PTHREAD_MUTEX_RECURSIVE_NP, Solaris supports it
PTHREAD_MUTEX_RECURSIVE).


If I remember correctly, the recursion is quite simple. The
first stage (pfacet-_C_data () != 0) sets up the internal
data and takes place just once per facet object. The second
(recursive) stage then extracts and returns it. I only looked
at it briefly last week but from what I saw I'd say it should
be possible to lock only the second, non-recursive stage and
do the assignment there.



Third, both __rw_get_numpunct() and __rw_get_moneypunct() can return a
NULL pointer. This is bad, because it will cause a SEGV at string
assignment, during a call to either of the
std::numpunctT::grouping(), std::numpunctT::truename(), etc.
functions. We should fix this and throw an exception instead. The
Standard doesn't say that any of these functions can throw, but it
doesn't say they can't throw either. And both __rw_get_numpunct() and
__rw_get_moneypunct() throw already.


The functions only return NULL on internal error (i.e., a bug
in the implementation). There's an assertion right above the
return statement that will fire in a debug build. There's no
point in throwing an exception to the user for our own bugs
(they would have no way to recover from it). It's better to
abort. We might want to do that (i.e., abort) unconditionally
in this case, even when assertions are disabled.

Martin



--Stefan

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-10 Thread Stefan Teleman

On Mon, Sep 10, 2012 at 2:21 PM, Liviu Nicoara nikko...@hates.ms wrote:

 4. Without caching of grouping values, grouping() delegates always to
 do_grouping():

 real0m5.668s
 user1m11.389s
 sys 0m3.952s

FWIW, 22.2.3.1.1 explicitly states that all of the decimal_point(),
grouping(), truename(), falsename() must return their do_*()
counterparts.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-10 Thread Liviu Nicoara


On 09/10/12 15:01, Stefan Teleman wrote:

On Mon, Sep 10, 2012 at 2:21 PM, Liviu Nicoara nikko...@hates.ms wrote:


4. Without caching of grouping values, grouping() delegates always to
do_grouping():

real0m5.668s
user1m11.389s
sys 0m3.952s


FWIW, 22.2.3.1.1 explicitly states that all of the decimal_point(),
grouping(), truename(), falsename() must return their do_*()
counterparts.


Yes, of course. :-) I just meant to say that is _always_ delegating to 
do_grouping(), i.e., does no caching.

Liviu



--Stefan




--
And now I see with eye serene
The very pulse of the machine.

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-10 Thread Stefan Teleman

On Mon, Sep 10, 2012 at 1:32 PM, Martin Sebor mse...@gmail.com wrote:


 That said, I'd certainly prefer to avoid hacks as much as
 possible. This problem could perhaps more cleanly be solved
 by having the facet pass a reference to the string (or to
 all of its internal data) to modify to the function (or
 something like that). Unfortunately, it would break binary
 compatibility.

I think I have something which doesn't break BC - stay tuned because
I'm testing it now.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-07 Thread Liviu Nicoara


On 09/06/12 19:54, Martin Sebor wrote:

I'm not sure how easily we can do that. Almost all of locale
is initialized lazily. Some of the layers might depend on the
facets being initialized lazily as well. This was a deliberate
design choice. One of the constraints was to avoid dynamic
initialization or allocation at startup. [...]


There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch.


Out of curiosity, does it still work if you move the locale into
the thread function? Like this:


I created the test case because I wanted something more specific and with more 
predictable results. Initially, the locale object was created in the thread and 
it was passing.




Also, does the 27.objects test pass with this patch?


Will try. I don't think that we can actually use this patch, I bundled it with 
the test so that people can test the approach right away.



I don't know if we have tests for it but writing to cerr/cout
in a bad_alloc handler should succeed. I.e., this should not
abort:

   try {
   struct rlimit rlim = { };
   getrlimit (RLIMIT_DATA, rlim);
   rlim.rlim_cur = 0;
   setrlimit (RLIMIT_DATA, rlim);

   for ( ; ; ) new int;
   }
   catch (bad_alloc) {
   cout  caught bad alloc:  123  '\n';
   }


I see..

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-07 Thread Liviu Nicoara


On 09/06/12 22:58, Stefan Teleman wrote:

On Thu, Sep 6, 2012 at 7:31 PM, Liviu Nicoara nikko...@hates.ms wrote:


There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch.


With your patches, the performance is much much better:


I didn't think about timing it, thanks! The result is somewhat expected without 
the synchronization. However, the full construction-time initialization is 
inappropriate if you take in consideration the delegation from the public API 
to the protected virtual API of the facet class, and the possibility of 
overrides in subclasses. Also, lazy initialization follows the principle of 
economy of effort, and caching increases efficiency. If possible in a robust 
way, I would rather have those two. :-)

Liviu



# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  6 2012, 20:50:13
# COMMENT: thread safety


# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 1035.05
user 1191.76
sys 63.49

--Stefan




--
And now I see with eye serene
The very pulse of the machine.

dbx [was: Re: STDCXX-1056 [was: Re: STDCXX forks]]

2012-09-07 Thread Liviu Nicoara


On 09/07/12 11:58, Stefan Teleman wrote:

On Fri, Sep 7, 2012 at 8:52 AM, Liviu Nicoara nikko...@hates.ms wrote:

On 09/06/12 19:54, Martin Sebor wrote:



Also, does the 27.objects test pass with this patch?



No, it does not. It hangs at the first insertion, line 227. Unfortunately, I
cannot debug it because dbx does not work properly in my installation.


What are the symptoms with dbx mishbehaving? (maybe I can help).


Thanks!

I get this when launching the debugger:

$ dbx -xexec32 t
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading t
Reading ld-linux.so.2
dbx: fetch at 0xf400 failed -- Input/output error
dbx: warning: could not put in breakpoint
dbx: warning: internal handler (-80) made defunct -- could not enable event BPT
dbx: warning: internal handler (-86) made defunct -- could not enable event BPT
...

Then the program just runs from the get go. If I set breakpoints and rerun, it 
ignores them.

Also, long running programs cannot be broken into with C-c:

^Cdbx: warning: Interrupt ignored but forwarded to child.
signal INT in (unknown) at 0xf77cf6c4
0xf77cf6c4: movl %edx,%eax
dbx: warning: 'stop' ignored -- while doing rtld handshake
terminating signal 2 SIGINT
dbx: warning: 'stop' ignored -- while doing rtld handshake
(dbx) where
dbx: program is not active
(dbx)

Liviu

Re: dbx [was: Re: STDCXX-1056 [was: Re: STDCXX forks]]

2012-09-07 Thread Stefan Teleman

On Fri, Sep 7, 2012 at 12:23 PM, Liviu Nicoara nikko...@hates.ms wrote:

 I get this when launching the debugger:

 $ dbx -xexec32 t
 For information about new features see `help changes'
 To remove this message, put `dbxenv suppress_startup_message 7.9' in your
 .dbxrc
 Reading t
 Reading ld-linux.so.2
 dbx: fetch at 0xf400 failed -- Input/output error
 dbx: warning: could not put in breakpoint

There is fix for this, but for Studio 12.2:

http://wesunsolve.net/bugid/id/6545393

It had priority 5 (very low) so that leads me to assume that it hasn't
been fixed yet in 12.3. But I will ask at work about 12.3/Linux.

Strangely enough, I don't get the error on Fedora 17 with 12.3.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Liviu Nicoara


On 09/05/12 23:51, Martin Sebor wrote:

On 09/05/2012 09:03 PM, Stefan Teleman wrote:

[...]
Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?


Breaking the ABI is not an option (unless we rev the version).
But I'm not sure I understand what you think breaks the ABI.


I think Stefan is referring to adding a mutex member variable to the facet in 
question and breaking binary compatibility. If that is the case I have confused 
things when I suggested exactly that, earlier. A cursory read through the 
__rw_facet source shows that inherits from __rw_synchronized in MT builds, 
therefore each facet carries its own mutex member.

Liviu


We don't need to add a new mutex -- we can use the __rw_facet
member for the locking. Or did you mean something else?

Martin



--Stefan




--
And now I see with eye serene
The very pulse of the machine.

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Stefan Teleman

On Thu, Sep 6, 2012 at 9:16 AM, Liviu Nicoara nikko...@hates.ms wrote:

 I think Stefan is referring to adding a mutex member variable to the facet
 in question and breaking binary compatibility. If that is the case I have
 confused things when I suggested exactly that, earlier. A cursory read
 through the __rw_facet source shows that inherits from __rw_synchronized in
 MT builds, therefore each facet carries its own mutex member.

 On 09/05/12 23:51, Martin Sebor wrote:
 We don't need to add a new mutex -- we can use the __rw_facet
 member for the locking. Or did you mean something else?

A possible implementation using the __rw_facet mutex could look like this:

template class _CharT
inline string numpunct_CharT::grouping () const
{
if (!(_C_flags  _RW::__rw_gr)) {

numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

_RWSTD_MT_GUARD (__self-_C_mutex);

if (!(_C_flags  _RW::__rw_gr)) {

// [try to] get the grouping first (may throw)
// then set a flag to avoid future initializations
__self-_C_grouping  = do_grouping ();
__self-_C_flags|= _RW::__rw_gr;

}
}

return _C_grouping;
}

Except that it will not work. Because the __rw_facet mutex member is
being locked  in file ../src/facet.cpp in function
__rw_facet::_C_manage at line 366:

// acquire lock
_RWSTD_MT_STATIC_GUARD (_RW::__rw_facet);

This will deadlock because this is the mutex already locked by
std::numpunctT::grouping().

I've already tested this with 3 compilers, and, it does indeed deadlock.

So yes, I did indeed mean something different. I meant adding another
mutex data member to the numpunct class.

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Martin Sebor


On 09/06/2012 09:58 AM, Stefan Teleman wrote:

On Thu, Sep 6, 2012 at 9:16 AM, Liviu Nicoaranikko...@hates.ms  wrote:


I think Stefan is referring to adding a mutex member variable to the facet
in question and breaking binary compatibility. If that is the case I have
confused things when I suggested exactly that, earlier. A cursory read
through the __rw_facet source shows that inherits from __rw_synchronized in
MT builds, therefore each facet carries its own mutex member.



On 09/05/12 23:51, Martin Sebor wrote:

We don't need to add a new mutex -- we can use the __rw_facet
member for the locking. Or did you mean something else?


A possible implementation using the __rw_facet mutex could look like this:

templateclass _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 _RWSTD_MT_GUARD (__self-_C_mutex);

 if (!(_C_flags  _RW::__rw_gr)) {

 // [try to] get the grouping first (may throw)
 // then set a flag to avoid future initializations
 __self-_C_grouping  = do_grouping ();
 __self-_C_flags|= _RW::__rw_gr;

 }
 }

 return _C_grouping;
}

Except that it will not work. Because the __rw_facet mutex member is
being locked  in file ../src/facet.cpp in function
__rw_facet::_C_manage at line 366:

// acquire lock
_RWSTD_MT_STATIC_GUARD (_RW::__rw_facet);


This locks a different mutex, one unrelated to __rw_facet::_C_mutex.

Look at the implementation of _RWSTD_MT_STATIC_GUARD() in rw/_defs.h:

  #  define _RWSTD_MT_STATIC_GUARD(type) \
typedef _RW::__rw_typetype,__LINE__ _UniqueType;   \
_RWSTD_MT_CLASS_GUARD (_UniqueType)

  #  define _RWSTD_MT_CLASS_GUARD(type)   \
_RWSTD_MT_GUARD (_RW::__rw_get_static_mutex ((type*)0))

and then at __rw_get_static_mutex in rw/_mutex.h. It gets a static
(global) mutex from a mutex factory via template instantiation:

  __rw_static_mutex__rw_facet::_C_mutex;



This will deadlock because this is the mutex already locked by
std::numpunctT::grouping().

I've already tested this with 3 compilers, and, it does indeed deadlock.


Something else must be locking the mutex then. I quickly looked
at __rw_get_punct (and __rw_get_numpunct) in punct.cpp but I
couldn't find any signs of the mutex being locked there. The
only thing I see being locked there is the global C locale. If
it's not the callee it has to be the caller.



So yes, I did indeed mean something different. I meant adding another
mutex data member to the numpunct class.


We can't (and shouldn't need to) add one. We need to be able to
make do with the existing member. That's what it's there for.

Martin



--Stefan

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Stefan Teleman

On Thu, Sep 6, 2012 at 2:46 PM, Stefan Teleman stefan.tele...@gmail.com wrote:

 [steleman@darthvader][/src/steleman/programming/stdcxx-ss122/stdcxx-4.2.1/build/tests][09/06/2012
 14:40:11][1084] ./22.locale.numpunct.mt --nthreads=2 --nloops=100
 # INFO (S1) (10 lines):
 # TEXT:
 # COMPILER: SunPro, __SUNPRO_CC = 0x5120
 # ENVIRONMENT: i386 running linux (Fedora release 17 (Beefy Miracle)
 (3.5.0-2.fc17.x86_64)) with glibc 2.15
 # FILE: 22.locale.numpunct.mt.cpp
 # COMPILED: Sep  6 2012, 14:38:42
 # COMMENT: thread safety
 

 # CLAUSE: lib.locale.numpunct

 # NOTE (S2) (5 lines):
 # TEXT: executing /usr/bin/locale -a  /tmp/tmpfile-YzXcb9
 # CLAUSE: lib.locale.numpunct
 # FILE: process.cpp
 # LINE: 276

 grouping: _RWSTD_MT_GUARD (__self-_C_mutex): 0xf774913c
 _C_get_data: _RWSTD_MT_GUARD (_C_mutex): 0xf774913c

 [ ... deadlock ... ]

And just to make absolutely sure that this isn't a case of SunPro
being insane, here's the output from the Intel compiler:

[steleman@darthvader][/src/steleman/programming/stdcxx-intel/stdcxx-4.2.1-thread-safe/build/tests][09/06/2012
15:44:23][1390] ./22.locale.numpunct.mt --nthreads=2 --nloops=100
# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  6 2012, 15:43:29
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

# NOTE (S2) (5 lines):
# TEXT: executing locale -a  /tmp/tmpfile-FoZR0J
# CLAUSE: lib.locale.numpunct
# FILE: process.cpp
# LINE: 276

grouping: _RWSTD_MT_GUARD (__self-_C_mutex): 0xf74c2424
_C_get_data: _RWSTD_MT_GUARD (_C_mutex): 0xf74c2424

[ ... deadlock ... ]

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Liviu Nicoara


On Sep 5, 2012, at 4:03 PM, Martin Sebor wrote:

 On 09/05/2012 01:33 PM, Liviu Nicoara wrote:
 On 09/05/12 15:17, Martin Sebor wrote:
 On 09/05/2012 12:40 PM, Liviu Nicoara wrote:
 On 09/05/12 14:09, Stefan Teleman wrote:
 On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
 [...]
 OK so I did a little bit of testing, after looking at the *right*
 __rw_guard class. :-)
 
 I changed the std::numpunct class thusly:
 [...]
 
 I am afraid this would be unsafe, too (if I said otherwise earlier I was
 wrong). [...] Thoughts?
 
 You're right, there's still a problem. We didn't get the double
 checked locking quite right.
 
 I wish for simplicity: eliminate the lazy initialization, and fully
 initialize the facet in the constructor. Then we'd need no locking on
 copying its data (std::string takes care of its own copying).
 
 I'm not sure how easily we can do that. Almost all of locale
 is initialized lazily. Some of the layers might depend on the
 facets being initialized lazily as well. This was a deliberate
 design choice. One of the constraints was to avoid dynamic
 initialization or allocation at startup. [...]

There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch. 

Will continue to look into it. Although STDCXX does not have an implementation 
of the atomics library we could probably use the existing, unexposed, atomics 
API to implement a more robust lazy initialization. 

Liviu

$ cat tests/localization/t.cpp; svn diff
#include locale

#include cstdio
#include cstdlib

#include unistd.h

#define MAX_THREADS16
#define MAX_LOOPS1000

static bool hold = true;

extern C {

static void* 
f (void* pv)
{
std::numpunctchar const fac = 
*reinterpret_cast std::numpunctchar*  (pv);

while (hold) ; 

for (int i = 0; i != MAX_LOOPS; ++i) {
std::string const grp = fac.grouping ();
}

return 0;
}

}

int
main (int argc, char** argv)
{
std::locale const loc = std::locale (argv [1]);

std::numpunctchar const fac =
std::use_facetstd::numpunctchar  (loc);

pthread_t tid [MAX_THREADS] = { 0 };

for (int i = 0; i  MAX_THREADS; ++i) {
if (pthread_create (tid + i, 0, f, (void*)fac)) 
exit (-1);
}

sleep (1);

hold = false;

for (int i = 0; i  MAX_THREADS; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}

Index: include/loc/_numpunct.h
===
--- include/loc/_numpunct.h (revision 1381575)
+++ include/loc/_numpunct.h (working copy)
@@ -61,24 +61,40 @@ struct numpunct: _RW::__rw_facet
 string_type;
 
 _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0)
-: _RW::__rw_facet (__ref), _C_flags (0) { }
+: _RW::__rw_facet (__ref) {
+_C_grouping   = do_grouping ();
+_C_truename   = do_truename ();
+_C_falsename  = do_falsename ();
+_C_decimal_point  = do_decimal_point ();
+_C_thousands_sep  = do_thousands_sep ();
+}
 
 virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW;
 
 // 22.2.3.1.1, p1
-char_type decimal_point () const;
+char_type decimal_point () const {
+return _C_decimal_point;
+}
 
 // 22.2.3.1.1, p2
-char_type thousands_sep () const;
+char_type thousands_sep () const {
+return _C_thousands_sep;
+}
 
 // 22.2.3.1.1, p3
-string grouping () const;
+string grouping () const {
+return _C_grouping;
+}
 
 // 22.2.3.1.1, p4
-string_type truename () const;
+string_type truename () const {
+return _C_truename;
+}
 
 // 22.2.3.1.1, p4
-string_type falsename () const;
+string_type falsename () const {
+return _C_falsename;
+}
 
 static _RW::__rw_facet_id id;
 
@@ -112,15 +128,13 @@ protected:
 
 private:
 
-int _C_flags;   // bitmap of cached data valid flags
-string  _C_grouping;// cached results of virtual members
+string  _C_grouping;
 string_type _C_truename;
 string_type _C_falsename;
 char_type   _C_decimal_point;
 char_type   _C_thousands_sep;
 };
 
-
 #ifndef _RWSTD_NO_SPECIALIZED_FACET_ID
 
 _RWSTD_SPECIALIZED_CLASS
@@ -134,95 +148,6 @@ _RW::__rw_facet_id numpunctwchar_t::id
 #  endif   // _RWSTD_NO_WCHAR_T
 #endif   // _RWSTD_NO_SPECIALIZED_FACET_ID
 
-
-template class _CharT
-inline _TYPENAME numpunct_CharT::char_type
-numpunct_CharT::decimal_point () const
-{
-if (!(_C_flags  _RW::__rw_dp)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the decimal point first (may throw)
-// then set a flag to avoid future

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Martin Sebor


I'm not sure how easily we can do that. Almost all of locale
is initialized lazily. Some of the layers might depend on the
facets being initialized lazily as well. This was a deliberate
design choice. One of the constraints was to avoid dynamic
initialization or allocation at startup. [...]


There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch.


Out of curiosity, does it still work if you move the locale into
the thread function? Like this:


static void*
f (void* pv)
{
 std::numpunctchar  const  fac =
 *reinterpret_cast  std::numpunctchar*  (pv);

 while (hold) ;

 for (int i = 0; i != MAX_LOOPS; ++i) {


   const std::locale loc (en_US.utf8);

   typedef std::numpunctchar NumPunct;
   NumPunct const fac = std::use_facetNumPunct(loc);


 std::string const grp = fac.grouping ();
 }

 return 0;
}



Also, does the 27.objects test pass with this patch?

I don't know if we have tests for it but writing to cerr/cout
in a bad_alloc handler should succeed. I.e., this should not
abort:

  try {
  struct rlimit rlim = { };
  getrlimit (RLIMIT_DATA, rlim);
  rlim.rlim_cur = 0;
  setrlimit (RLIMIT_DATA, rlim);

  for ( ; ; ) new int;
  }
  catch (bad_alloc) {
  cout  caught bad alloc:  123  '\n';
  }

Martin

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Stefan Teleman

On Thu, Sep 6, 2012 at 7:31 PM, Liviu Nicoara nikko...@hates.ms wrote:

 There would be a performance degradation. IMHO, it would be minor and would 
 simplify the code considerably.

 After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
 tried to remove the lazy initialization and a (smaller) test case now passes. 
 I am attaching the program and the numpunct patch.

With your patches, the performance is much much better:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  6 2012, 20:50:13
# COMMENT: thread safety


# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 1035.05
user 1191.76
sys 63.49

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Liviu Nicoara


On 09/05/12 00:51, Stefan Teleman wrote:

On Tue, Sep 4, 2012 at 10:49 PM, Martin Sebor mse...@gmail.com wrote:


   template class _CharT
   inline string numpunct_CharT::grouping () const
   {
   if (!(_C_flags  _RW::__rw_gr)) {

   numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

  _RWSTD_MT_GUARD (_C_mutex);

  // [try to] get the grouping first (may throw)
   // then set a flag to avoid future initializations
   __self-_C_grouping  = do_grouping ();
   __self-_C_flags|= _RW::__rw_gr;
   }

   return _C_grouping;
   }


That's what I wanted to do originally - use a per-object mutex.


FWIW, it is pretty obvious to me _now_ that these assignments are not MT-safe, 
by themselves and also in the context of the test guarding them.

You're right, access must be synchronized on a per-facet object basis. Since 
the data that needs to be protected on assignment is instance data, the mutex 
used in the guard must be a (yet to be added) instance mutex. That would make 
it binary incompatible and would go in ... 4.3.x?




This works:

template class _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 _RWSTD_MT_STATIC_GUARD (_Type);


It seems to me that a static guard is excessive. The only thing that needs 
guarding is the actual assignment to facet instance members -- a mutex instance member 
would suffice.



And even so, this is still not thread-safe:

Two different threads [ T1 and T2 ], seeking two different locales
[en_US.UTF-8 and ja_JP.UTF-8], enter std::numpunct_CharT::grouping()
at the same time - because they are running on two different cores.
They both test for

 if (!(_C_flags  _RW::__rw_gr))

and then -- assuming the expression above evaluates to true -- one of
them wins the mutex [T1], and the other one [T2] blocks on the mutex.

When T1 is done and exits the function, the grouping is set to
en_US.UTF-8 and the mutex is released. Now T2 acquires the mutex, and
proceeds to setting grouping to ja-JP.UTF-8. Woe on the caller running
from T1 who now thinks he got en_US.UTF-8, but instead he gets
ja_JP.UTF-8, which was duly set so by T2, but T1 had no clue about it
(remember, the std::string grouping _charT buffer is shared by the
caller from T1 and the caller from T2).


I am pretty sure that in your example, they would operate on two different instances of the facet 
class. The static guard would synchronize their running alright but through two 
different facet instances, copying data from two different places of the locale database. In this 
respect a static guard is excessive.

I would defer it to Martin to validate a course of action but to me it looks 
like the only problem is the concurrent assignment to facet instance member 
variables inside the facet member functions. Which could be easily and nicely 
solved with a plain guard on a mutex ordinary member variable.

Thanks!

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]


That's what I wanted to do originally - use a per-object mutext.
Unfortunately the _C_mutex member in rw::__rw_synchronized is static:


That's the wrong __rw_synchronized -- this one is for non-MT builds.
The one we get in MT builds in on line 370.



struct __rw_synchronized
{
 // static so that it takes up no space
 static _RWSTD_EXPORT __rw_mutex _C_mutex;

...

and __rw::rw_guard doesn't have an appropriate constructor.

Intel C++ complains about it too:

/src/steleman/programming/stdcxx-intel/stdcxx-4.2.1-thread-safe/include/loc/_numpunct.h(181):
error: no instance of constructor __rw::__rw_guard::__rw_guard
matches the argument list
 argument types are: (const __rw::__rw_mutex)
   _RWSTD_MT_GUARD (_C_mutex);
   ^


That's because the function is const and the guard takes a non-const
reference. This might be a defect in  __rw_synchronized: the mutex
member should probably be mutable.




...

And even so, this is still not thread-safe:

Two different threads [ T1 and T2 ], seeking two different locales
[en_US.UTF-8 and ja_JP.UTF-8], enter std::numpunct_CharT::grouping()
at the same time - because they are running on two different cores.


Those threads won't have the same numpunct facet. A single facet
is always associated with exactly one locale and the relationship
can never change. The locale class guarantees this.


They both test for

 if (!(_C_flags  _RW::__rw_gr))

and then -- assuming the expression above evaluates to true -- one of
them wins the mutex [T1], and the other one [T2] blocks on the mutex.

When T1 is done and exits the function, the grouping is set to
en_US.UTF-8 and the mutex is released. Now T2 acquires the mutex, and
proceeds to setting grouping to ja-JP.UTF-8. Woe on the caller running
from T1 who now thinks he got en_US.UTF-8, but instead he gets
ja_JP.UTF-8, which was duly set so by T2, but T1 had no clue about it
(remember, the std::string grouping _charT buffer is shared by the
caller from T1 and the caller from T2).

So at a minimum, the locking must happen before evaluating the

 if (!(_C_flags  _RW::__rw_gr))

expression.


You're right that there is a race here. But the race is benign
because the T2 will assign the same value to the string as T1
did (because the grouping must be the same in the same locale).
This doesn't reallocate the string but simply overwrites each
byte with its own value. On all architectures we support this
is atomic and safe.

Martin



This still doesn't solve what ends up being returned in grouping. If
we lock at the top of the function, then, when T2 acquires the mutex,
the test expression will evaluate to false. Therefore T2 will return
whatever is in grouping right now, which happens to be en_US.UTF-8 as
set by T1, when T2 really wanted ja_JP.UTF-8.

I really think the appropriate fix here -- which would address the
performance implications -- is more complex than this. I am thinking
about creating and using a (non-publicly accessible) internal locale
cache:

typedef std::mapstd::string, std::locale  locale_cache;

where all the locales are stored fully initialized, on demand. There
is only one locale instantiation and initialization overhead cost per
locale. After a locale has been instantiated and placed into the
cache, the caller of any specfic locale gets a copy from the cache,
fully instantiated and initialized. But this breaks ABI, so I'm
thinking it's for stdcxx 5.

Thoughts?

--Stefan

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Stefan Teleman

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:

 You're right that there is a race here. But the race is benign
 because the T2 will assign the same value to the string as T1
 did (because the grouping must be the same in the same locale).
 This doesn't reallocate the string but simply overwrites each
 byte with its own value. On all architectures we support this
 is atomic and safe.

OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:

template class _CharT
class numpunct : public _RW::__rw_facet
{
 public:
 // [ ... snip, no changes here ... ]

 private:

int _C_flags;   // bitmap of cached data valid flags
string  _C_grouping;// cached results of virtual members
string_type _C_truename;
string_type _C_falsename;
char_type   _C_decimal_point;
char_type   _C_thousands_sep;
mutable _RW::__rw_mutex  _C_object_mutex; // -
};

And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
if (!(_C_flags  _RW::__rw_gr)) {

_RWSTD_MT_GUARD (_C_object_mutex);

// double-test here to avoid re-writing an already written string
if (!(_C_flags  _RW::__rw_gr)) {

numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

// [try to] get the grouping first (may throw)
// then set a flag to avoid future initializations
__self-_C_grouping  = do_grouping ();
__self-_C_flags|= _RW::__rw_gr;

}
}

return _C_grouping;
}

same changes for std::numpuncttruename() and std::numpunct::falsename().

Ran 22.locale.numpunct.mt with the default values for nthreads and nloops.

Yesterday's test results, with the static per-class mutex:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED:  Sep  4 2012, 09:11:36
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

# NOTE (S2) (5 lines):
# TEXT: executing locale -a  /tmp/tmpfile-V7siTq
# CLAUSE: lib.locale.numpunct
# FILE: process.cpp
# LINE: 276

# INFO (S1) (3 lines):
# TEXT: testing std::numpunctcharT with 8 threads, 20 iterations
each, in 32 locales { C aa_DJ aa_DJ.iso88591 aa_DJ.utf8
aa_ER aa_ER@saaho aa_ER.utf8 aa_ER.utf8@saaho aa_ET
aa_ET.utf8 af_ZA af_ZA.iso88591 af_ZA.utf8 am_ET
am_ET.utf8 an_ES an_ES.iso885915 an_ES.utf8 ar_AE
ar_AE.iso88596 ar_AE.utf8 ar_BH ar_BH.iso88596 ar_BH.utf8
ar_DZ ar_DZ.iso88596 ar_DZ.utf8 ar_EG ar_EG.iso88596
ar_EG.utf8 ar_IN ar_IN.utf8 }
# CLAUSE: lib.locale.numpunct

[ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 2139.31
user 2406.09
sys 155.61

===

Today's test results with the per-object mutex (as shown above):

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  5 2012, 13:08:03
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

# NOTE (S2) (5 lines):
# TEXT: executing locale -a  /tmp/tmpfile-ww0nez
# CLAUSE: lib.locale.numpunct
# FILE: process.cpp
# LINE: 276

# INFO (S1) (3 lines):
# TEXT: testing std::numpunctcharT with 8 threads, 20 iterations
each, in 32 locales { C aa_DJ aa_DJ.iso88591 aa_DJ.utf8
aa_ER aa_ER@saaho aa_ER.utf8 aa_ER.utf8@saaho aa_ET
aa_ET.utf8 af_ZA af_ZA.iso88591 af_ZA.utf8 am_ET
am_ET.utf8 an_ES an_ES.iso885915 an_ES.utf8 ar_AE
ar_AE.iso88596 ar_AE.utf8 ar_BH ar_BH.iso88596 ar_BH.utf8
ar_DZ ar_DZ.iso88596 ar_DZ.utf8 ar_EG ar_EG.iso88596
ar_EG.utf8 ar_IN ar_IN.utf8 }
# CLAUSE: lib.locale.numpunct

[ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Liviu Nicoara


On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]
And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 _RWSTD_MT_GUARD (_C_object_mutex);

 // double-test here to avoid re-writing an already written string
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 // [try to] get the grouping first (may throw)
 // then set a flag to avoid future initializations
 __self-_C_grouping  = do_grouping ();
 __self-_C_flags|= _RW::__rw_gr;

 }
 }

 return _C_grouping;
}


I am afraid this would be unsafe, too (if I said otherwise earlier I was 
wrong). The compiler might re-arrange the protected assignments, such that 
another thread sees a partially updated object, where the flags are updated and 
the string not. I don't think we're going to get away with this here without 
either a simpler and more inefficient top-level locking, or doing away 
completely with the lazy initialization. Thoughts?

Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 09/05/2012 12:40 PM, Liviu Nicoara wrote:

On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]
And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
if (!(_C_flags  _RW::__rw_gr)) {

_RWSTD_MT_GUARD (_C_object_mutex);

// double-test here to avoid re-writing an already written string
if (!(_C_flags  _RW::__rw_gr)) {

numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

// [try to] get the grouping first (may throw)
// then set a flag to avoid future initializations
__self-_C_grouping = do_grouping ();
__self-_C_flags |= _RW::__rw_gr;

}
}

return _C_grouping;
}


I am afraid this would be unsafe, too (if I said otherwise earlier I was
wrong). The compiler might re-arrange the protected assignments, such
that another thread sees a partially updated object, where the flags are
updated and the string not. I don't think we're going to get away with
this here without either a simpler and more inefficient top-level
locking, or doing away completely with the lazy initialization. Thoughts?


You're right, there's still a problem. We didn't get the double
checked locking quite right. We need to prevent the reordering
both by the compiler and by the hardware so that the reader
doesn't get an out of date value of _C_grouping. Making the
members volatile and/or moving the body of the first if into
an out-of-line function should help with the compiler reordering
but we'll need a barrier to prevent the hardware from serving us
up stale data (i.e., letting T2 see the updated _C_flags but
a _C_grouping that's still being modified). Calling
pthread_mutex_unlock does insert a barrier, but the barrier
is only executed when both tests pass, and not when the first
one fails. We need the barrier in both cases. This seems like
it would get too ugly (and big) to put in an inline function.

Martin



Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 09/05/2012 01:33 PM, Liviu Nicoara wrote:

On 09/05/12 15:17, Martin Sebor wrote:

On 09/05/2012 12:40 PM, Liviu Nicoara wrote:

On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]


I am afraid this would be unsafe, too (if I said otherwise earlier I was
wrong). [...] Thoughts?


You're right, there's still a problem. We didn't get the double
checked locking quite right.


I wish for simplicity: eliminate the lazy initialization, and fully
initialize the facet in the constructor. Then we'd need no locking on
copying its data (std::string takes care of its own copying).


I'm not sure how easily we can do that. Almost all of locale
is initialized lazily. Some of the layers might depend on the
facets being initialized lazily as well. This was a deliberate
design choice. One of the constraints was to avoid dynamic
initialization or allocation at startup. Another was to be
able to use iostreams (to a limited extent) in low memory
conditions to write an error message to stderr. It's been
a long time and the details are more than a little fuzzy.
I'll need to spend some time going through the code and
refreshing my memory.

Martin



Liviu

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Pavel Heimlich, a.k.a. hajma

2012/9/5 Stefan Teleman stefan.tele...@gmail.com:
 On Tue, Sep 4, 2012 at 9:15 PM, Liviu Nicoara nikko...@hates.ms wrote:

 That is good, right?

 Yes, it is good. :-)


 Could not get an Intel build off the ground on the account of the LIMITS.cpp 
 test not running to completion because of a possible compiler bug. I posted 
 earlier an account of it. Do you have a support account that allows posting 
 bug reports for Intel's C++ compiler?

 Oh, yes I remember that now, it happened to me too. I worked around it
 by copying the LIMITS executable from a gcc build.

 I don't have a support contract with Intel - I'm using the free
 compilers at home. I'm guessing Intel must have an open forum for
 Intel developers using their compilers, so that might be a way to
 report the bug. But I haven't looked for it.



 I did not have access to a Solaris box since I left Rogue Wave. Does 
 Sun/Oracle have any program that would allow one to test drive their 
 compiler on a shared box somewhere? I vaguely remember that HP had something 
 like that a while ago.

 I know Sun had such a program. Oracle I don't really know but my
 educated guess is not a chance. ;-)

Gold and above members of the oracle parter network can* get access to
a lab running various Oracle HW and SW.
For regular persons Stefan's answer applies 100%.

However there's nothing stopping you from downloading Solaris 11 iso
or a virtualbox image from
http://www.oracle.com/technetwork/server-storage/solaris11/downloads/index.html
installing it in a virtualization solution of your choice (virtualbox
works best for me) and installing Studio from
http://www.oracle.com/technetwork/server-storage/solarisstudio/downloads/index.html

As long as the installation is used for sw development and not for
production use you do not need to pay Oracle a dime, you just won't
get any updates/patches. (check the licenses fo details, I'm no
lawyer)

P.

* this was announced in April, I have no idea about the current status


 One way of testing with SunPro is to download the free version of
 SunPro from Oracle, and use it on Linux (yes, SunPro exists for Linux
 as well). It works very well on OpenSUSE and Ubuntu (tested and used
 on both myself, debugged stdcxx on OpenSUSE with SunPro 12.2. Doesn't
 work at all on Fedora 17 because of the TLS glibc bug (however it used
 to on earlier Fedora releases).

 And now for the latest about 1056:

 I managed (after many, MANY tries) to get a run of
 22.locale.numpunct.mt on Solaris SPARC 32-bit build (at work), and
 without the thread-safety patches applied, and I hit the jackpot:

 # INFO (S1) (10 lines):
 # TEXT:
 # COMPILER: SunPro, __SUNPRO_CC = 0x5100
 # ENVIRONMENT: sparc-v8 running sunos-5.11
 # FILE: 22.locale.numpunct.mt.cpp
 # COMPILED: Sep  4 2012, 16:10:03
 # COMMENT: thread safety
 

 # CLAUSE: lib.locale.numpunct

 # NOTE (S2) (5 lines):
 # TEXT: executing /usr/bin/locale -a  /var/tmp/tmpfile-MWa4bQ
 # CLAUSE: lib.locale.numpunct
 # FILE: process.cpp
 # LINE: 276

 # INFO (S1) (3 lines):
 # TEXT: testing std::numpunctcharT with 4 threads, 1000 iterations
 each, in 32 locales { C POSIX af_ZA.UTF-8 ar_AE.UTF-8
 ar_BH.UTF-8 ar_DZ.UTF-8 ar_EG.ISO8859-6 ar_EG.UTF-8
 ar_IQ.UTF-8 ar_JO.UTF-8 ar_KW.UTF-8 ar_LY.UTF-8 ar_MA.UTF-8
 ar_OM.UTF-8 ar_QA.UTF-8 ar_SA.UTF-8 ar_TN.UTF-8 ar_YE.UTF-8
 as_IN.UTF-8 az_AZ.UTF-8 be_BY.UTF-8 bg_BG.ISO8859-5
 bg_BG.UTF-8 bn_IN.UTF-8 bs_BA.ISO8859-2 bs_BA.UTF-8
 ca_ES.ISO8859-1 ca_ES.ISO8859-15 ca_ES.UTF-8 cs_CZ.ISO8859-2
 cs_CZ.UTF-8 cs_CZ.UTF-8@euro }
 # CLAUSE: lib.locale.numpunct

 [ ... snip ... ]

 # INFO (S1) (4 lines):
 # TEXT: creating a thread pool with 4 threads
 # CLAUSE: lib.locale.numpunct
 # LINE: 548

 Abort (core dump)

 /builds/steleman/userland-stdcxx-upstream-thread-safety/components/stdcxx/stdcxx-4.2.1/tests/localization/22.locale.numpunct.mt.cpp:140:
 Assertion '0 == rw_strncmp (grp.c_str (), data.grouping_.c_str ())'
 failed.
 /builds/steleman/userland-stdcxx-upstream-thread-safety/components/stdcxx/stdcxx-4.2.1/build/lib/libstdcxx4.so.4.2.1'__1cE__rwQ__rw_assert_fail6Fpkc2i2_v_+0x78
 [0xff249af8]
 /builds/steleman/userland-stdcxx-upstream-thread-safety/components/stdcxx/stdcxx-4.2.1/build/tests/22.locale.numpunct.mt'thread_func+0x500
 [0x1b050]
 /lib/libc.so.1'_lwp_start+0x0 [0xf7b553f0]

 Bingo!

 At line 140 in 22.locale.numpunct.mt.cpp, we see:

  RW_ASSERT (0 == rw_strncmp (grp.c_str (),
 data.grouping_.c_str ()));

 What this means:

 Between the time std::string grp was initialized at line 133:

 const std::string grp = np.grouping ();

 and the time we hit line 140 (the RW_ASSERT), another thread modified
 the std::string _C_grouping member of class

 template class _CharT class numpunct : public _RW::__rw_facet.

 And because this is the non-patched version of _numpunct.h, the
 variable std::string grp did not really return a deep copy of
 _C_grouping, but a shared

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Stefan Teleman

On Wed, Sep 5, 2012 at 4:20 PM, Stefan Teleman stefan.tele...@gmail.com wrote:

 But then there's another aspect  -- which I probably failed to
 highlight in my previous email: the per-object mutex implementation is
 20% *slower* than the class-static mutex implementation.

 class-static implementation:
 real 2139.31
 user 2406.09
 sys 155.61

 pe-object implementation:
 real 2416.75
 user 2694.64
 sys 159.49

The above results are for the Intel compiler. Now here are the results
with GCC 4.7.0:

1. with the static per-class mutex:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: gcc 4.7.0, __VERSION__ = 4.7.0 20120507 (Red Hat 4.7.0-5)
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  5 2012, 06:21:18
# COMMENT: thread safety


[ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 2165.06
user 2428.08
sys 151.30


2. With the per-object mutex:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: gcc 4.7.0, __VERSION__ = 4.7.0 20120507 (Red Hat 4.7.0-5)
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  5 2012, 21:29:56
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

 [ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 2438.70
user 2726.44
sys 155.79

About the same percentage difference as the Intel compiler.

--Stefan

---
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Stefan Teleman

On Wed, Sep 5, 2012 at 10:55 PM, Martin Sebor mse...@gmail.com wrote:

 I suspect the difference is due to the overhead of the repeated
 initialization and destruction of the per-object mutex in the
 test. The test repeatedly creates (and discards) named locale
 objects.

 The per-class mutex is initialized just once in the process, no
 matter how many facet objects (how many distinct named locales)
 the test creates. But the per-object mutex must be created (and
 destroyed) for each named locale.

Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com

Re: STDCXX-1056 [was: Re: STDCXX forks]


On 09/05/2012 09:03 PM, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:55 PM, Martin Sebor mse...@gmail.com wrote:


I suspect the difference is due to the overhead of the repeated
initialization and destruction of the per-object mutex in the
test. The test repeatedly creates (and discards) named locale
objects.

The per-class mutex is initialized just once in the process, no
matter how many facet objects (how many distinct named locales)
the test creates. But the per-object mutex must be created (and
destroyed) for each named locale.


Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?


Breaking the ABI is not an option (unless we rev the version).
But I'm not sure I understand what you think breaks the ABI.
We don't need to add a new mutex -- we can use the __rw_facet
member for the locking. Or did you mean something else?

Martin



--Stefan

Re: STDCXX-1056 [was: Re: STDCXX forks]