Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-27 Thread Liviu Nicoara

On 10/26/12 18:50, Martin Sebor wrote:

On 10/26/2012 06:50 AM, Liviu Nicoara wrote:

[...]
tl;dr: removing the facet data cache is a priority. All else can be put
on the back-burner.
[...]
// facet data accessor
...
if (0 == _C_impsize) {   // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data = get_facet_data (); // 2
??   // 3
_C_impsize = 1;  // 4
mutex_unlock ();
}
??   // 5
return _C_data;  // 6
[...]
Various compilers provide these features in various forms, but at the
moment we don't have a unified STDCXX API to implement this.



[...]
I suggested moving the body of the outer if from the header
into a .cpp file in the library where we could implement
ugly, bloated locking without the risk of breaking things
if we removed/replaced it in the future. That's when we ran
into questions about how exactly to do this cleanly, etc.
It didn't seem to be doable very cleanly but I still think
it's a viable approach.


Just making sure we are talking about the same thing. My argument (and the code 
oversimplification above) was about the initialization of the std::facet data, 
not the cached numpunct data. I think we should get rid of the numpunct cache 
right away because it makes the facet unusable in MT builds.

As for the DCII I was talking about in the previous post, I would tackle it but 
I only have access to x86(_64). Does the foundation have shared dev machines?

Liviu


Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-26 Thread Liviu Nicoara

On 10/03/12 11:10, Martin Sebor wrote:

[...]
I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


tl;dr: removing the facet data cache is a priority. All else can be put 
on the back-burner.


Conflicting test results aside, there still is the case of the incorrect 
handling of the cached data in the facet. I don't think there is a 
disagreement on that. Considering that the std::string is moving in the 
direction of dropping the handle-body implementation, simply getting rid 
of the cache is a step in the same direction.


I think that we should preserve the lock-free reading of the facet data, 
as a benign race, but making it benign is perhaps more complicated than 
previously suggested.


As a reminder, the core of the facet access and initialization code 
essentially looks like this (pseudocode-ish):



// facet data accessor
...
if (0 == _C_impsize) {  // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data= get_facet_data (); // 2
??  // 3
_C_impsize = 1; // 4
mutex_unlock ();
}
??  // 5
return _C_data; // 6
...


with question marks for missing, necessary fixes. The compiler needs to 
be prevented from re-ordering both 2-4 and 1-6. Just for the sake of 
argument I can imagine an optimization that reorders the reads in 1-6:


register x = _C_data;
if (_C_impsize)
return x;

and if the loads are executed in this order, the caller will see a stale 
_C_data.


First, the 2-4 writes need to be executed in the program order. This 
needs both a compiler barrier and a store-store memory barrier that will 
keep the writes ordered.


Then, the reads in 1-6 need to be ordered such that _C_data is read 
after _C_impsize, via a compiler barrier and a load-load memory barrier 
that will preserve the program order of the loads.


Various compilers provide these features in various forms, but at the 
moment we don't have a unified STDCXX API to implement this.


Of course, I might be wrong. Input is appreciated.

Thanks,
Liviu



Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-26 Thread Martin Sebor

On 10/26/2012 06:50 AM, Liviu Nicoara wrote:

On 10/03/12 11:10, Martin Sebor wrote:

[...]
I was just thinking of a few simple loops along the lines of:

void* thread_func (void*) {
for (int i = 0; i  N; ++)
test 1: do some simple stuff inline
test 2: call a virtual function to do the same stuff
test 3: lock and unlock a mutex and do the same stuff
}

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


tl;dr: removing the facet data cache is a priority. All else can be put
on the back-burner.

Conflicting test results aside, there still is the case of the incorrect
handling of the cached data in the facet. I don't think there is a
disagreement on that. Considering that the std::string is moving in the
direction of dropping the handle-body implementation, simply getting rid
of the cache is a step in the same direction.

I think that we should preserve the lock-free reading of the facet data,
as a benign race, but making it benign is perhaps more complicated than
previously suggested.

As a reminder, the core of the facet access and initialization code
essentially looks like this (pseudocode-ish):


// facet data accessor
...
if (0 == _C_impsize) { // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data = get_facet_data (); // 2
?? // 3
_C_impsize = 1; // 4
mutex_unlock ();
}
?? // 5
return _C_data; // 6
...


with question marks for missing, necessary fixes. The compiler needs to
be prevented from re-ordering both 2-4 and 1-6. Just for the sake of
argument I can imagine an optimization that reorders the reads in 1-6:

register x = _C_data;
if (_C_impsize)
return x;

and if the loads are executed in this order, the caller will see a stale
_C_data.

First, the 2-4 writes need to be executed in the program order. This
needs both a compiler barrier and a store-store memory barrier that will
keep the writes ordered.

Then, the reads in 1-6 need to be ordered such that _C_data is read
after _C_impsize, via a compiler barrier and a load-load memory barrier
that will preserve the program order of the loads.

Various compilers provide these features in various forms, but at the
moment we don't have a unified STDCXX API to implement this.

Of course, I might be wrong. Input is appreciated.


You're right (we don't have an API for it, and it is the double
checked locking problem mentioned earlier on in this thread).
Implementing it here would bloat the function (just in terms
of source code, not necessarily object code), and potentially
introduce a binary compatibility dependency on the new code
(i.e., we wouldn't be able to remove the fix in the future
w/o braking it).

I suggested moving the body of the outer if from the header
into a .cpp file in the library where we could implement
ugly, bloated locking without the risk of breaking things
if we removed/replaced it in the future. That's when we ran
into questions about how exactly to do this cleanly, etc.
It didn't seem to be doable very cleanly but I still think
it's a viable approach.

Martin



Thanks,
Liviu





Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-21 Thread Liviu Nicoara

On 10/03/12 11:10, Martin Sebor wrote:

[...]
I was just thinking of a few simple loops along the lines of:



tl;dr: I consider the results of the multi-threaded performance tests (12S, 
Intel/AMD multicores) as coming from heavy contention in copying of 
reference-counted std::string objects.

The long version: I went back to the original test case (attached, t.cpp). As a 
reminder, the program is a simplification of the library numpunct facet 
creation process, which can be compiled either with data caching enabled or 
disabled.

I ran the perf tool (https://perf.wiki.kernel.org/index.php/Main_Page) on the 
program, and I noticed marked differences in the number of instructions and 
cycles (an order of magnitude) and also in the number of CPU-migrations and 
context-switches (way larger for the cached version):

 Performance counter stats for './t':

  121168.026301  task-clock-msecs #  3.778 CPUs
6133377  context-switches #  0.051 M/sec
  35457  CPU-migrations   #  0.000 M/sec
298  page-faults  #  0.000 M/sec
   338006437277  cycles   #   2789.568 M/sec
   136554855096  instructions #  0.404 IPC
31298346444  branches #258.305 M/sec
 2305761092  branch-misses#  7.367 %
 1592157186  cache-references # 13.140 M/sec
1729755  cache-misses #  0.014 M/sec

   32.070381363  seconds time elapsed

vs.

 Performance counter stats for './t':

   16194.235282  task-clock-msecs #  3.883 CPUs
   7491  context-switches #  0.000 M/sec
 47  CPU-migrations   #  0.000 M/sec
311  page-faults  #  0.000 M/sec
45210281987  cycles   #   2791.752 M/sec
76784904688  instructions #  1.698 IPC
18891711522  branches #   1166.570 M/sec
 3162251552  branch-misses# 16.739 %
3332839  cache-references #  0.206 M/sec
 119424  cache-misses #  0.007 M/sec

4.170904279  seconds time elapsed


Then, I have run more detailed reports on the runs. I noticed that the report 
for the cached version contains what appears to be thread synchronization 
overhead, whereas the non-cached version doesn't:

# Events: 134K cycles
#
# Overhead  Command   Shared Object   Symbol
#   ...  ..  ...
#
50.69%t  [kernel.kallsyms]   [k] _raw_spin_lock
 5.46%t  libpthread-2.13.so  [.] pthread_mutex_lock
 4.16%t  libpthread-2.13.so  [.] __pthread_mutex_unlock
 3.65%t  libpthread-2.13.so  [.] __lll_lock_wait
 3.35%t  [kernel.kallsyms]   [k] copy_user_generic_string
 3.09%t  40395b  [.]   40395b
 2.91%t  [kernel.kallsyms]   [k] hash_futex
 2.87%t  [kernel.kallsyms]   [k] futex_wake
 2.67%t  t   [.] f
 2.37%t  [kernel.kallsyms]   [k] futex_wait_setup
 1.61%t  [kernel.kallsyms]   [k] system_call
 1.00%t  [kernel.kallsyms]   [k] system_call_after_swapgs
 0.98%t  [kernel.kallsyms]   [k] __switch_to
 0.96%t  libpthread-2.13.so  [.] __lll_unlock_wake
 0.95%t  [kernel.kallsyms]   [k] get_futex_key
 0.92%t  libc-2.13.so[.] __strlen_sse42
 [...]


vs.


# Events: 15K cycles
#
# Overhead  Command   Shared Object 

 Symbol
#   ...  ..  
..
#
49.55%   :295717f1ed3830a86  [.] 7f1ed3830a86
 8.67%t  libc-2.13.so[.] __malloc
 8.03%t  libc-2.13.so[.] cfree
 4.88%t  t   [.] std::basic_stringchar, std::char_traitschar, 
std::allocatorchar ::basic_string(char const*, std::allocatorchar const)
 4.67%t  libc-2.13.so[.] __strlen_sse42
 4.14%t  libpthread-2.13.so  [.] __pthread_mutex_unlock
 3.62%t  libpthread-2.13.so  [.] pthread_mutex_lock
 2.64%t  t   [.] f
 2.40%t  libc-2.13.so[.] _int_malloc
 2.10%t  libc-2.13.so[.] _int_free
 1.53%t  libpthread-2.13.so  [.] __pthread_mutex_init_internal
 1.33%t  t   [.] operator new(unsigned long)
 1.08%t  t   [.] X::numpunct::do_grouping() const
 1.04%t  libc-2.13.so[.] __memcpy_sse2
 

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-12 Thread Liviu Nicoara

On 10/04/12 22:41, Liviu Nicoara wrote:

On 10/4/12 10:10 PM, Liviu Nicoara wrote:


On 10/3/12 11:10 AM, Martin Sebor wrote:

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:



   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


That is expected; I attached test case x.cpp and results-x.txt.

I did not find it too interesting in its own, though. The difference
between the
cached and non-cached data is that in the case of the cached data the
copying of
the string involves nothing more than a bump in the reference counter,
whereas
in the non-cached version a string object is constructed anew, and
memory gets
allocated for its body. Yet, in my measurements, the cached version is
the one
which shows the worse performance.

So, I extracted the std::string class and simplified it down and put
it in
another test case. That would be u.cpp and the results are
results-u.txt. The
results show the same performance trends although the absolute values
have
skewed. Will get back to this after I digest the results a bit more.


This discussion has stagnated lately. I hope to allocate some time this 
week-end to go over the results, re-verify them and analyze the 
disassembly. In the meantime, any thoughts?


Liviu



Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-04 Thread Liviu Nicoara


On 10/3/12 11:10 AM, Martin Sebor wrote:

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:


I am gathering some more measurements along these lines but it's time
consuming. I estimate I will have some ready for review later today or
tomorrow. In the meantime could you please post your kernel, glibc and
compiler versions?


I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


That is expected; I attached test case x.cpp and results-x.txt.

I did not find it too interesting in its own, though. The difference between the 
cached and non-cached data is that in the case of the cached data the copying of 
the string involves nothing more than a bump in the reference counter, whereas 
in the non-cached version a string object is constructed anew, and memory gets 
allocated for its body. Yet, in my measurements, the cached version is the one 
which shows the worse performance.


So, I extracted the std::string class and simplified it down and put it in 
another test case. That would be u.cpp and the results are results-u.txt. The 
results show the same performance trends although the absolute values have 
skewed. Will get back to this after I digest the results a bit more.


Liviu
-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make u

16, 1   1m18.811s   5m12.329s   0m0.263s
8, 10m39.919s   2m36.198s   0m0.150s
4, 10m20.449s   1m20.797s   0m0.050s
2, 10m9.888s0m19.725s   0m0.005s
1, 10m2.483s0m2.480s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE u

16, 1   0m37.418s   2m27.822s   0m0.872s
8, 10m18.844s   1m14.607s   0m0.261s
4, 10m10.165s   0m40.147s   0m0.023s
2, 10m8.652s0m17.278s   0m0.003s
1, 10m8.482s0m8.473s0m0.007s

$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

16, 1   1m2.770s4m9.307s0m0.179s
8, 10m31.890s   2m6.792s0m0.087s
4, 10m16.427s   1m5.133s0m0.039s
2, 10m8.497s0m16.981s   0m0.007s
1, 10m2.291s0m2.288s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

16, 1   0m35.838s   2m21.406s   0m0.877s
8, 10m19.007s   1m14.920s   0m0.255s
4, 10m10.099s   0m39.504s   0m0.042s
2, 10m8.599s0m17.190s   0m0.003s
1, 10m8.986s0m8.980s0m0.005s

* Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2

$ nice make u



$ nice make CPPOPTS=-DNO_CACHE u


$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u

16, 1   0m7.864s0m30.259s   0m0.035s
 8, 1   0m4.396s0m17.034s   0m0.016s
 4, 1   0m2.729s0m10.473s   0m0.011s
 2, 1   0m2.481s0m4.929s0m0.003s
 1, 1   0m2.461s0m2.455s0m0.002s

$ nice make CPPOPTS=-DNO_LOCK u

16, 1   0m9.724s0m37.455s   0m0.043s
 8, 1   0m5.559s0m20.309s   0m0.048s
 4, 1   0m3.160s0m12.213s   0m0.013s
 2, 1   0m2.872s0m5.694s0m0.004s
 1, 1   0m2.845s0m2.838s0m0.002s

$ nice make u

16, 1   1m3.745s3m58.570s   0m0.351s
 8, 1   0m32.351s   1m55.740s   0m0.203s
 4, 1   0m16.852s   1m1.633s0m0.092s
 2, 1   0m8.419s0m16.699s   0m0.010s
 1, 1   0m4.214s0m4.179s0m0.005s

* Linux Slackware, 16x AMD Opteron, 12S, gcc 4.7.1

$ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u

16, 1   0m4.382s1m9.896s0m0.004s
 8, 1   0m4.374s0m34.904s   0m0.002s
 4, 1   0m4.368s0m17.445s   0m0.002s
 2, 1   0m4.366s0m8.720s0m0.003s
 1, 1   0m4.355s0m4.351s0m0.001s

$ nice make CPPOPTS=-DNO_LOCK u

16, 1   0m5.415s1m19.833s   0m0.005s
 8, 1   0m4.939s0m39.438s   0m0.003s
 4, 1   0m4.936s0m19.712s   0m0.001s
 2, 1   0m4.930s0m9.847s0m0.002s
 1, 1   0m4.921s0m4.917s0m0.002s

$ nice make u

16, 1   1m40.769s   24m17.198s  0m0.006s
 8, 1   0m51.702s   6m15.400s  

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-04 Thread Liviu Nicoara

On 10/4/12 10:10 PM, Liviu Nicoara wrote:


On 10/3/12 11:10 AM, Martin Sebor wrote:

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:


I am gathering some more measurements along these lines but it's time
consuming. I estimate I will have some ready for review later today or
tomorrow. In the meantime could you please post your kernel, glibc and
compiler versions?


I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


That is expected; I attached test case x.cpp and results-x.txt.

I did not find it too interesting in its own, though. The difference between the
cached and non-cached data is that in the case of the cached data the copying of
the string involves nothing more than a bump in the reference counter, whereas
in the non-cached version a string object is constructed anew, and memory gets
allocated for its body. Yet, in my measurements, the cached version is the one
which shows the worse performance.

So, I extracted the std::string class and simplified it down and put it in
another test case. That would be u.cpp and the results are results-u.txt. The
results show the same performance trends although the absolute values have
skewed. Will get back to this after I digest the results a bit more.


Attached some incomplete files previously. Corrected.

Liviu
-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make u

16, 1   1m18.811s   5m12.329s   0m0.263s
8, 10m39.919s   2m36.198s   0m0.150s
4, 10m20.449s   1m20.797s   0m0.050s
2, 10m9.888s0m19.725s   0m0.005s
1, 10m2.483s0m2.480s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE u

16, 1   0m37.418s   2m27.822s   0m0.872s
8, 10m18.844s   1m14.607s   0m0.261s
4, 10m10.165s   0m40.147s   0m0.023s
2, 10m8.652s0m17.278s   0m0.003s
1, 10m8.482s0m8.473s0m0.007s

$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

16, 1   1m2.770s4m9.307s0m0.179s
8, 10m31.890s   2m6.792s0m0.087s
4, 10m16.427s   1m5.133s0m0.039s
2, 10m8.497s0m16.981s   0m0.007s
1, 10m2.291s0m2.288s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

16, 1   0m35.838s   2m21.406s   0m0.877s
8, 10m19.007s   1m14.920s   0m0.255s
4, 10m10.099s   0m39.504s   0m0.042s
2, 10m8.599s0m17.190s   0m0.003s
1, 10m8.986s0m8.980s0m0.005s

* Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2

$ nice make u

16, 1   2m7.644s30m10.014s  0m0.006s
 8, 1   1m4.449s7m45.111s   0m0.004s
 4, 1   0m31.646s   2m1.438s0m0.002s
 2, 1   0m14.828s   0m29.637s   0m0.002s
 1, 1   0m3.223s0m3.220s0m0.002s


$ nice make CPPOPTS=-DNO_CACHE u

16, 1   0m10.730s   2m46.032s   0m2.204s
 8, 1   0m9.922s1m18.892s   0m0.008s
 4, 1   0m9.874s0m39.192s   0m0.003s
 2, 1   0m9.652s0m19.268s   0m0.004s
 1, 1   0m9.622s0m9.617s0m0.002s


$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

16, 1   1m54.812s   27m5.957s   0m0.007s
 8, 1   0m53.862s   6m26.391s   0m0.003s
 4, 1   0m26.287s   1m39.991s   0m0.001s
 2, 1   0m9.120s0m17.233s   0m0.002s
 1, 1   0m3.004s0m3.002s0m0.001s


$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

16, 1   0m9.717s2m34.412s   0m0.401s
 8, 1   0m9.613s1m16.762s   0m0.007s
 4, 1   0m9.590s0m38.285s   0m0.004s
 2, 1   0m9.629s0m19.165s   0m0.002s
 1, 1   0m9.361s0m9.357s0m0.001s
-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u

16, 1   0m7.864s0m30.259s   0m0.035s
 8, 1   0m4.396s0m17.034s   0m0.016s
 4, 1   0m2.729s0m10.473s   0m0.011s
 2, 1   0m2.481s0m4.929s0m0.003s
 1, 1   0m2.461s0m2.455s0m0.002s

$ nice make CPPOPTS=-DNO_LOCK u

16, 1   0m9.724s0m37.455s   0m0.043s
 8, 1   0m5.559s0m20.309s   0m0.048s
 4, 1   0m3.160s0m12.213s   0m0.013s
 2, 

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-03 Thread Liviu Nicoara

On 10/02/12 10:41, Martin Sebor wrote:

I haven't had time to look at this since my last email on
Sunday. I also forgot about the string mutex. I don't think
I'll have time to spend on this until later in the week.
Unless the disassembly reveals the smoking gun, I think we
might need to simplify the test to get to the bottom of the
differences in our measurements. (I.e., eliminate the library
and measure the runtime of a simple thread loop, with and
without locking.) We should also look at the GLIBC and
kernel versions on our systems, on the off chance that
there has been a change that could explain the discrepancy
between my numbers and yours. I suspect my system (RHEL 4.8)
is much older than yours (I don't remember now if you posted
your details).


I am gathering some more measurements along these lines but it's time 
consuming. I estimate I will have some ready for review later today or 
tomorrow. In the meantime could you please post your kernel, glibc and 
compiler versions?


Liviu



Martin

On 10/02/2012 06:22 AM, Liviu Nicoara wrote:

On 09/30/12 18:18, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?


The first thing that struck me in the cached `f' was that __string_ref
class uses a mutex for synchronizing access to the ref counter. It turns
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
ops on the ref counter, via a block in rw/_config.h:

# if _RWSTD_VER_MAJOR  5
# ifdef _RWSTD_OS_LINUX
// on Linux/AMD64, unless explicitly requested, disable the use
// of atomic operations in string for binary compatibility with
// stdcxx 4.1.x
# ifndef _RWSTD_USE_STRING_ATOMIC_OPS
# define _RWSTD_NO_STRING_ATOMIC_OPS
# endif // _RWSTD_USE_STRING_ATOMIC_OPS
# endif // _WIN32
# endif // stdcxx  5.0


That is not the cause for the performance difference, though. Even after
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
performance with the non-cached version.

Liviu




Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-03 Thread Martin Sebor

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:

On 10/02/12 10:41, Martin Sebor wrote:

I haven't had time to look at this since my last email on
Sunday. I also forgot about the string mutex. I don't think
I'll have time to spend on this until later in the week.
Unless the disassembly reveals the smoking gun, I think we
might need to simplify the test to get to the bottom of the
differences in our measurements. (I.e., eliminate the library
and measure the runtime of a simple thread loop, with and
without locking.) We should also look at the GLIBC and
kernel versions on our systems, on the off chance that
there has been a change that could explain the discrepancy
between my numbers and yours. I suspect my system (RHEL 4.8)
is much older than yours (I don't remember now if you posted
your details).


I am gathering some more measurements along these lines but it's time
consuming. I estimate I will have some ready for review later today or
tomorrow. In the meantime could you please post your kernel, glibc and
compiler versions?


I was just thinking of a few simple loops along the lines of:

  void* thread_func (void*) {
  for (int i = 0; i  N; ++)
  test 1: do some simple stuff inline
  test 2: call a virtual function to do the same stuff
  test 3: lock and unlock a mutex and do the same stuff
  }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).

For the Linux tests I used a 16 CPU (Xeon X5570 @ 3GHz) box with
RHEL 4.8 with 2.6.9-89.0.11.ELlargesmp, GLIBC version is 2.3.4,
and GCC 3.4.6.

Martin



Liviu



Martin

On 10/02/2012 06:22 AM, Liviu Nicoara wrote:

On 09/30/12 18:18, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?


The first thing that struck me in the cached `f' was that __string_ref
class uses a mutex for synchronizing access to the ref counter. It turns
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
ops on the ref counter, via a block in rw/_config.h:

# if _RWSTD_VER_MAJOR  5
# ifdef _RWSTD_OS_LINUX
// on Linux/AMD64, unless explicitly requested, disable the use
// of atomic operations in string for binary compatibility with
// stdcxx 4.1.x
# ifndef _RWSTD_USE_STRING_ATOMIC_OPS
# define _RWSTD_NO_STRING_ATOMIC_OPS
# endif // _RWSTD_USE_STRING_ATOMIC_OPS
# endif // _WIN32
# endif // stdcxx  5.0


That is not the cause for the performance difference, though. Even after
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
performance with the non-cached version.

Liviu






Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-02 Thread Liviu Nicoara

On 09/30/12 18:18, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?


The first thing that struck me in the cached `f' was that __string_ref 
class uses a mutex for synchronizing access to the ref counter. It turns 
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic 
ops on the ref counter, via a block in rw/_config.h:


#  if _RWSTD_VER_MAJOR  5
#ifdef _RWSTD_OS_LINUX
   // on Linux/AMD64, unless explicitly requested, disable the use
   // of atomic operations in string for binary compatibility with
   // stdcxx 4.1.x
#  ifndef _RWSTD_USE_STRING_ATOMIC_OPS
#define _RWSTD_NO_STRING_ATOMIC_OPS
#  endif   // _RWSTD_USE_STRING_ATOMIC_OPS
#endif   // _WIN32
#  endif   // stdcxx  5.0


That is not the cause for the performance difference, though. Even after 
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better 
performance with the non-cached version.


Liviu


Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-02 Thread Martin Sebor

I haven't had time to look at this since my last email on
Sunday. I also forgot about the string mutex. I don't think
I'll have time to spend on this until later in the week.
Unless the disassembly reveals the smoking gun, I think we
might need to simplify the test to get to the bottom of the
differences in our measurements. (I.e., eliminate the library
and measure the runtime of a simple thread loop, with and
without locking.) We should also look at the GLIBC and
kernel versions on our systems, on the off chance that
there has been a change that could explain the discrepancy
between my numbers and yours. I suspect my system (RHEL 4.8)
is much older than yours (I don't remember now if you posted
your details).

Martin

On 10/02/2012 06:22 AM, Liviu Nicoara wrote:

On 09/30/12 18:18, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?


The first thing that struck me in the cached `f' was that __string_ref
class uses a mutex for synchronizing access to the ref counter. It turns
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
ops on the ref counter, via a block in rw/_config.h:

# if _RWSTD_VER_MAJOR  5
# ifdef _RWSTD_OS_LINUX
// on Linux/AMD64, unless explicitly requested, disable the use
// of atomic operations in string for binary compatibility with
// stdcxx 4.1.x
# ifndef _RWSTD_USE_STRING_ATOMIC_OPS
# define _RWSTD_NO_STRING_ATOMIC_OPS
# endif // _RWSTD_USE_STRING_ATOMIC_OPS
# endif // _WIN32
# endif // stdcxx  5.0


That is not the cause for the performance difference, though. Even after
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
performance with the non-cached version.

Liviu




Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Martin Sebor

[Sending to dev after accidentally replying just to Liviu]

On 09/30/2012 12:09 PM, Martin Sebor wrote:

On 09/27/2012 06:36 PM, Liviu Nicoara wrote:

On 9/27/12 8:27 PM, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


There are two of them, one for 4.2.x, one for 4.3.x.



I'm still curious about the performance, though. It doesn't make
sense to me that a call to a virtual function is faster than one
to an almost trivial inline function.


I scratched my head long on that. I wager that it depends on the
machine, too.


Here are my timings for library-reduction.cpp when compiled
GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
number of trivial changes to get it to compile:

With cache No cache
real 1m38.332s 8m58.568s
user 6m30.244s 34m25.942s
sys 0m0.060s 0m3.922s

I also experimented with the program on Linux (CEL 4 with 16
CPUs). Initially, I saw no differences between the two versions.
So I modified it a bit to make it closer to the library (the
modified program is attached). With those changes the timings
are below:

With cache No cache
real 0m 1.107s 0m 5.669s
user 0m17.204s 0m 5.669s
sys 0m 0.000s 0m22.347s

I also recompiled and re-ran the test on Solaris. To speed
things along, I set the number threads and loops to 8 and
100. The numbers are as follows:

With cache No cache
real 0m3.341s 0m26.333s
user 0m13.052s 1m37.470s
sys 0m0.009s 0m0.132s

The numbers match my expectation. The overhead without the
numpunct cache is considerable.

Somewhat unexpectedly, the test with the cache didn't crash.

I also tried timing the numpunct patch attached to the issue
with the test program. The test program runs to completion
without the patch but it crashes with a SIGSEGV after the
patch is applied. That suggests there's a bug somewhere in
do_thousands_sep() (in addition to the caching).

Martin



Let me know if there is anything I could help with.

Liviu






Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Liviu Nicoara

Forwarding with the attachment.

 Original Message 
Subject: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 12:09:10 -0600
From: Martin Sebor mse...@gmail.com
To: Liviu Nicoara nikko...@hates.ms

On 09/27/2012 06:36 PM, Liviu Nicoara wrote:

On 9/27/12 8:27 PM, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


There are two of them, one for 4.2.x, one for 4.3.x.



I'm still curious about the performance, though. It doesn't make
sense to me that a call to a virtual function is faster than one
to an almost trivial inline function.


I scratched my head long on that. I wager that it depends on the
machine, too.


Here are my timings for library-reduction.cpp when compiled
GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
number of trivial changes to get it to compile:

 With cache   No cache
real1m38.332s 8m58.568s
user6m30.244s34m25.942s
sys 0m0.060s  0m3.922s

I also experimented with the program on Linux (CEL 4 with 16
CPUs). Initially, I saw no differences between the two versions.
So I modified it a bit to make it closer to the library (the
modified program is attached). With those changes the timings
are below:

 With cache   No cache
real0m 1.107s0m 5.669s
user0m17.204s0m 5.669s
sys 0m 0.000s0m22.347s

I also recompiled and re-ran the test on Solaris. To speed
things along, I set the number threads and loops to 8 and
100. The numbers are as follows:

 With cache   No cache
real0m3.341s 0m26.333s
user0m13.052s1m37.470s
sys 0m0.009s 0m0.132s

The numbers match my expectation. The overhead without the
numpunct cache is considerable.

Somewhat unexpectedly, the test with the cache didn't crash.

I also tried timing the numpunct patch attached to the issue
with the test program. The test program runs to completion
without the patch but it crashes with a SIGSEGV after the
patch is applied. That suggests there's a bug somewhere in
do_thousands_sep() (in addition to the caching).

Martin



Let me know if there is anything I could help with.

Liviu





#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



namespace X {

struct guard 
{
guard (pthread_mutex_t* ptr) : lock_ (ptr) { 
pthread_mutex_lock (lock_);
}

~guard () { 
pthread_mutex_unlock (lock_); 
}

private:

guard (guard const);
guard operator= (guard const);

private:

pthread_mutex_t* lock_;
};

struct facet
{
facet () {
pthread_mutex_init (_C_mutex, 0);
}

void* _C_data () {
return _C_impsize ? _C_impdata : _C_get_data ();
}

void* _C_get_data ();

size_t _C_impsize;
void*  _C_impdata;

pthread_mutex_t _C_mutex;
};

void* facet::_C_get_data ()
{
if (_C_impsize)
return _C_impdata;

guard g (_C_mutex);

if (_C_impsize)
return _C_impdata;

#if !defined (NO_USE_STDCXX_LOCALES)
_C_impdata = (void*)\3\3;
#endif // USE_STDCXX_LOCALES
_C_impsize = (size_t)(-1);

return _C_impdata;
}

void* get_data (facet*);

struct numpunct : facet
{
numpunct () : _C_flags () { }

virtual std::string do_grouping () const;

std::string grouping () const {
numpunct* const __self = (numpunct*)this;

#if !defined (NO_USE_NUMPUNCT_CACHE)
if (!(_C_flags  1)) {
__self-_C_flags  |= 1;
__self-_C_grouping = do_grouping ();
}

return _C_grouping;
#else
return do_grouping ();
#endif // NO_USE_NUMPUNCT_CACHE
}

private:

int  _C_flags;
std::string  _C_grouping;
};

std::string numpunct::do_grouping () const
{
return (const char*)get_data (const_castnumpunct*(this));
}

void* get_data (facet* pfacet)
{
void* pdata = pfacet-_C_data ();

if (pdata) 
return pdata;

// __rw_setlocale loc (...); - other mutex

if (pfacet-_C_data ())
return get_data (pfacet);

pfacet-_C_impdata = const_castchar*(\4\4);
pfacet-_C_impsize = (size_t)(-1);

return 0;
}

} // namespace X

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Liviu Nicoara

On 9/30/12 2:21 PM, Liviu Nicoara wrote:

Forwarding with the attachment.

 Original Message 
Subject: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 12:09:10 -0600
From: Martin Sebor mse...@gmail.com
To: Liviu Nicoara nikko...@hates.ms


On 9/27/12 8:27 PM, Martin Sebor wrote:


Here are my timings for library-reduction.cpp when compiled
GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
number of trivial changes to get it to compile:

  With cache   No cache
real1m38.332s 8m58.568s
user6m30.244s34m25.942s
sys 0m0.060s  0m3.922s

I also experimented with the program on Linux (CEL 4 with 16
CPUs). Initially, I saw no differences between the two versions.
So I modified it a bit to make it closer to the library (the
modified program is attached). With those changes the timings


I see the difference -- your program has a virtual function it calls from the 
inline grouping function.



are below:

  With cache   No cache
real0m 1.107s0m 5.669s
user0m17.204s0m 5.669s
sys0m 0.000s0m22.347s

I also recompiled and re-ran the test on Solaris. To speed
things along, I set the number threads and loops to 8 and
100. The numbers are as follows:

  With cache   No cache
real0m3.341s 0m26.333s
user0m13.052s1m37.470s
sys 0m0.009s 0m0.132s

The numbers match my expectation. The overhead without the
numpunct cache is considerable.


I have done another (smaller) round of measurements, this time using the test 
program you posted. Here are the results:


* iMac, 4x Intel, 12S:

16, 1000:

Cached   Not cached
real0m9.300s 0m5.224s
user0m36.441s0m20.523s
sys 0m0.043s 0m0.068s

* iMac, 4x Intel, 12D:

CachedNot cached
real0m9.012s  0m5.774s
user0m35.343s 0m20.997s
sys 0m0.045s  0m0.183s

* Linux Slackware, 16x AMD Opteron, 12S:

16, 1000:

Cached Not cached
real0m29.798s  0m3.278s
user0m48.662s  0m47.338s
sys 6m18.525s  0m3.298s



Somewhat unexpectedly, the test with the cache didn't crash.


On my iMac it did not crash for me either (gcc 4.5.4), this time. On the other 
box (gcc 4.5.2) crashed every time with caching, so I had to add a call to 
fac.grouping outside the thread function to initialize the facet.


Liviu


Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Martin Sebor

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?

Martin

On 09/30/2012 03:30 PM, Liviu Nicoara wrote:

On 9/30/12 2:21 PM, Liviu Nicoara wrote:

Forwarding with the attachment.

 Original Message 
Subject: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 12:09:10 -0600
From: Martin Sebor mse...@gmail.com
To: Liviu Nicoara nikko...@hates.ms


On 9/27/12 8:27 PM, Martin Sebor wrote:


Here are my timings for library-reduction.cpp when compiled
GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
number of trivial changes to get it to compile:

With cache No cache
real 1m38.332s 8m58.568s
user 6m30.244s 34m25.942s
sys 0m0.060s 0m3.922s

I also experimented with the program on Linux (CEL 4 with 16
CPUs). Initially, I saw no differences between the two versions.
So I modified it a bit to make it closer to the library (the
modified program is attached). With those changes the timings


I see the difference -- your program has a virtual function it calls
from the inline grouping function.


are below:

With cache No cache
real 0m 1.107s 0m 5.669s
user 0m17.204s 0m 5.669s
sys 0m 0.000s 0m22.347s

I also recompiled and re-ran the test on Solaris. To speed
things along, I set the number threads and loops to 8 and
100. The numbers are as follows:

With cache No cache
real 0m3.341s 0m26.333s
user 0m13.052s 1m37.470s
sys 0m0.009s 0m0.132s

The numbers match my expectation. The overhead without the
numpunct cache is considerable.


I have done another (smaller) round of measurements, this time using the
test program you posted. Here are the results:

* iMac, 4x Intel, 12S:

16, 1000:

Cached Not cached
real 0m9.300s 0m5.224s
user 0m36.441s 0m20.523s
sys 0m0.043s 0m0.068s

* iMac, 4x Intel, 12D:

Cached Not cached
real 0m9.012s 0m5.774s
user 0m35.343s 0m20.997s
sys 0m0.045s 0m0.183s

* Linux Slackware, 16x AMD Opteron, 12S:

16, 1000:

Cached Not cached
real 0m29.798s 0m3.278s
user 0m48.662s 0m47.338s
sys 6m18.525s 0m3.298s



Somewhat unexpectedly, the test with the cache didn't crash.


On my iMac it did not crash for me either (gcc 4.5.4), this time. On the
other box (gcc 4.5.2) crashed every time with caching, so I had to add a
call to fac.grouping outside the thread function to initialize the facet.

Liviu




Fwd: Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Liviu Nicoara

Forwarding to the list. Duh.


 Original Message 
Subject: Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 19:02:27 -0400
From: Liviu Nicoara nikko...@hates.ms
To: Martin Sebor mse...@gmail.com

On 9/30/12 6:18 PM, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?



Here they are.

Liviu




Dump of assembler code for function f:
   0x00403870 +0: push   %r15
   0x00403872 +2: push   %r14
   0x00403874 +4: push   %r13
   0x00403876 +6: push   %r12
   0x00403878 +8: push   %rbp
   0x00403879 +9: push   %rbx
   0x0040387a +10:mov%rdi,%rbx
   0x0040387d +13:sub$0x38,%rsp
   0x00403881 +17:nopl   0x0(%rax)
   0x00403888 +24:movzbl 0x261f11(%rip),%eax# 0x6657a0 
_ZL5pwait
   0x0040388f +31:test   %al,%al
   0x00403891 +33:jne0x403888 f+24
   0x00403893 +35:cmpq   $0x0,0x261ef5(%rip)# 0x665790 
_ZL6nloops
   0x0040389b +43:jle0x403b12 f+674
   0x004038a1 +49:xor%ebp,%ebp
   0x004038a3 +51:xor%r12d,%r12d
   0x004038a6 +54:lea0x10(%rsp),%r13
   0x004038ab +59:lea0x48(%rbx),%r14
   0x004038af +63:lea0x20(%rsp),%r15
   0x004038b4 +68:jmpq   0x4039a7 f+311
   0x004038b9 +73:nopl   0x0(%rax)
   0x004038c0 +80:cmp$0x66a020,%rdx
   0x004038c7 +87:mov%rdi,0x20(%rsp)
   0x004038cc +92:je 0x403ab8 f+584
   0x004038d2 +98:mov%rdx,%rdi
   0x004038d5 +101:   mov%rdx,(%rsp)
   0x004038d9 +105:   callq  0x403658 pthread_mutex_lock@plt
   0x004038de +110:   test   %eax,%eax
   0x004038e0 +112:   mov(%rsp),%rdx
   0x004038e4 +116:   je 0x4038fb f+139
   0x004038e6 +118:   mov$0x4452a4,%esi
   0x004038eb +123:   mov$0xa,%edi
   0x004038f0 +128:   xor%eax,%eax
   0x004038f2 +130:   callq  0x404370 _ZN4__rw10__rw_throwEiz
   0x004038f7 +135:   mov(%rsp),%rdx
   0x004038fb +139:   addl   $0x1,0x28(%rdx)
   0x004038ff +143:   test   %rdx,%rdx
   0x00403902 +146:   je 0x40390c f+156
   0x00403904 +148:   mov%rdx,%rdi
   0x00403907 +151:   callq  0x4036c8 pthread_mutex_unlock@plt
   0x0040390c +156:   mov0x20(%rsp),%rdx
   0x00403911 +161:   mov%rdx,%rdi
   0x00403914 +164:   mov%rdx,(%rsp)
   0x00403918 +168:   callq  0x403258 strlen@plt
   0x0040391d +173:   mov(%rsp),%rdx
   0x00403921 +177:   add%rax,%r12
   0x00403924 +180:   lea-0x40(%rdx),%rcx
   0x00403928 +184:   cmp$0x66a020,%rcx
   0x0040392f +191:   je 0x40398b f+283
   0x00403931 +193:   mov%rcx,%rdi
   0x00403934 +196:   mov%rcx,0x8(%rsp)
   0x00403939 +201:   callq  0x403658 pthread_mutex_lock@plt
   0x0040393e +206:   test   %eax,%eax
   0x00403940 +208:   mov(%rsp),%rdx
   0x00403944 +212:   mov0x8(%rsp),%rcx
   0x00403949 +217:   je 0x403965 f+245
   0x0040394b +219:   mov$0x4452a4,%esi
   0x00403950 +224:   mov$0xa,%edi
   0x00403955 +229:   xor%eax,%eax
   0x00403957 +231:   callq  0x404370 _ZN4__rw10__rw_throwEiz
   0x0040395c +236:   mov0x8(%rsp),%rcx
   0x00403961 +241:   mov(%rsp),%rdx
   0x00403965 +245:   mov-0x18(%rdx),%esi
   0x00403968 +248:   test   %rcx,%rcx
   0x0040396b +251:   lea-0x1(%rsi),%eax
   0x0040396e +254:   mov%eax,-0x18(%rdx)
   0x00403971 +257:   je 0x403983 f+275
   0x00403973 +259:   mov%rcx,%rdi
   0x00403976 +262:   mov%esi,0x8(%rsp)
   0x0040397a +266:   callq  0x4036c8 pthread_mutex_unlock@plt
   0x0040397f +271:   mov0x8(%rsp),%esi
   0x00403983 +275:   test   %esi,%esi
   0x00403985 +277:   jle0x403a80 f+528
   0x0040398b +283:   add$0x1,%ebp
   0x0040398e +286:   movq   $0x0,0x20(%rsp

Re: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Liviu Nicoara

I thought I replied but I see no trace of my post:

On 09/27/12 20:27, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary 
compatible). My understanding was that minor version revisions may break binary 
compatibility.

Liviu


RE: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Travis Vitek
Only major versions can break binary. The versioning policy for stdcxx can be 
found here..

  http://stdcxx.apache.org/versions.html

Travis

-Original Message-
From: Liviu Nicoara [mailto:nikko...@hates.ms] 
Sent: Friday, September 28, 2012 3:52 AM
To: dev@stdcxx.apache.org
Subject: Re: STDCXX-1071 numpunct facet defect

I thought I replied but I see no trace of my post:

On 09/27/12 20:27, Martin Sebor wrote:
 On 09/27/2012 06:41 AM, Liviu Nicoara wrote:
 On 09/26/12 20:12, Liviu Nicoara wrote:
 I have created STDCXX-1071 and linked to STDCXX-1056. [...]

 I am open to all questions, the more the better. Most of my opinions
 have been expressed earlier, but please ask if you want to know more.


 I am attaching here the proposed (4.3.x) patch and the timings results
 (after re-verifying the correctness of the timing program and the
 results). The 4.2.x patch, the 4.3.x patch, the test program and the
 results file are also attached to the incident.

 The patch isn't binary compatible. We can't remove data members
 in a minor release. We could only do it in a major release.

I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary 
compatible). My understanding was that minor version revisions may break binary 
compatibility.

Liviu


Re: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Liviu Nicoara

On 09/28/12 11:01, Travis Vitek wrote:

Only major versions can break binary. The versioning policy for stdcxx can be 
found here..

   http://stdcxx.apache.org/versions.html


Thanks, that clarifies things.

Liviu


Re: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Liviu Nicoara

On 9/28/12 11:01 AM, Travis Vitek wrote:

Only major versions can break binary. The versioning policy for stdcxx can be 
found here..

   http://stdcxx.apache.org/versions.html


I have renamed the binary-incompatible patch as patch-5.0.x.diff.

Thanks,
Liviu



Travis

-Original Message-
From: Liviu Nicoara [mailto:nikko...@hates.ms]
Sent: Friday, September 28, 2012 3:52 AM
To: dev@stdcxx.apache.org
Subject: Re: STDCXX-1071 numpunct facet defect

I thought I replied but I see no trace of my post:

On 09/27/12 20:27, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary 
compatible). My understanding was that minor version revisions may break binary 
compatibility.

Liviu





Re: STDCXX-1071 numpunct facet defect

2012-09-27 Thread Liviu Nicoara

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions have been 
expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results (after 
re-verifying the correctness of the timing program and the results). The 4.2.x 
patch, the 4.3.x patch, the test program and the results file are also attached 
to the incident.

Thanks,
Liviu



Index: include/loc/_numpunct.h
===
--- include/loc/_numpunct.h (revision 1388733)
+++ include/loc/_numpunct.h (working copy)
@@ -61,7 +61,7 @@ struct numpunct: _RW::__rw_facet
 string_type;
 
 _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0)
-: _RW::__rw_facet (__ref), _C_flags (0) { }
+: _RW::__rw_facet (__ref) { }
 
 virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW;
 
@@ -109,15 +109,6 @@ protected:
 virtual string_type do_falsename () const {
 return _RW::__rw_get_punct (this, _RW::__rw_fn, char_type ());
 }
-
-private:
-
-int _C_flags;   // bitmap of cached data valid flags
-string  _C_grouping;// cached results of virtual members
-string_type _C_truename;
-string_type _C_falsename;
-char_type   _C_decimal_point;
-char_type   _C_thousands_sep;
 };
 
 
@@ -139,17 +130,7 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::char_type
 numpunct_CharT::decimal_point () const
 {
-if (!(_C_flags  _RW::__rw_dp)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the decimal point first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_decimal_point  = do_decimal_point ();
-__self-_C_flags |= _RW::__rw_dp;
-}
-
-return _C_decimal_point;
+return do_decimal_point ();
 }
 
 
@@ -157,34 +138,14 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::char_type
 numpunct_CharT::thousands_sep () const
 {
-if (!(_C_flags  _RW::__rw_ts)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the thousands_sep first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_thousands_sep  = do_thousands_sep ();
-__self-_C_flags |= _RW::__rw_ts;
-}
-
-return _C_thousands_sep;
+return do_thousands_sep ();
 }
 
 
 template class _CharT
 inline string numpunct_CharT::grouping () const
 {
-if (!(_C_flags  _RW::__rw_gr)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the grouping first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_grouping  = do_grouping ();
-__self-_C_flags|= _RW::__rw_gr;
-}
-
-return _C_grouping;
+return do_grouping ();
 }
 
 
@@ -192,17 +153,7 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::string_type
 numpunct_CharT::truename () const
 {
-if (!(_C_flags  _RW::__rw_tn)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the true name first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_truename  = do_truename ();
-__self-_C_flags|= _RW::__rw_tn;
-}
-
-return _C_truename;
+return do_truename ();
 }
 
 
@@ -210,17 +161,7 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::string_type
 numpunct_CharT::falsename () const
 {
-if (!(_C_flags  _RW::__rw_fn)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the false name first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_falsename  = do_falsename ();
-__self-_C_flags |= _RW::__rw_fn;
-}
-
-return _C_falsename;
+return do_falsename ();
 }
 
 // #endif _RWSTD_NO_EXT_NUMPUNCT_PRIMARY
-*- mode: org -*-

* Machines:

** iMac, Intel, 4 cores:

$ uname -a; gcc -v
Darwin imax 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr  9 19:32:15 PDT 2012; 
root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64
gcc version 4.7.1 (GCC) 

** Linux Slackware, AMD, 16 cores:

$ uname -a; gcc -v
Linux behemoth 2.6.37.6 #3 SMP Sat Apr 9 22:49:32 CDT 2011 x86_64 AMD 
Opteron(tm) Processor 6134 AuthenticAMD GNU/Linux
gcc version 4.5.2 (GCC) 

* Method

** Library

Apply the patch. Build an optimized library (I used 12S in all runs). Build the 
library, rwtest, and locale database:

$ nice make -Clib
$ nice make -Cbin locales
$ nice make -Crwtest 

Properly export the necessary envar if running against STDCXX locale
database or unset, otherwise:

$ export RWSTD_LOCALE_ROOT=/path/to/.../nls

** Test program

Place the multi-threaded program source file, t.cpp, in
srcdir/tests/localization and run make in the 

Re: STDCXX-1071 numpunct facet defect

2012-09-27 Thread Martin Sebor

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.

I'm still curious about the performance, though. It doesn't make
sense to me that a call to a virtual function is faster than one
to an almost trivial inline function.

Let me see if I can build the library on Solaris and put together
some numbers. I hope to have some time this weekend.

Martin



Thanks,
Liviu







STDCXX-1071 numpunct facet defect

2012-09-26 Thread Liviu Nicoara
I have created STDCXX-1071 and linked to STDCXX-1056. I have reduced the scope 
to numpunct because moneypunct is not failing for me. If someone has a 
moneypunct failure listing I want to see it.


I have reduced the library code to a failing test case. I have attached there 
the reduced program. The program shows the same failures like the library 
(interestingly enough it does not fail when removing the data caching).


I have (re)attached the forwarding patches I created for 1056, one binary 
compatible, the other not.


Martin, my wish is to move this to some completion. Please let me know if there 
is something else you think I should do to speed this up.


I am open to all questions, the more the better. Most of my opinions have been 
expressed earlier, but please ask if you want to know more.


Thanks,
Liviu