Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/26/12 18:50, Martin Sebor wrote: On 10/26/2012 06:50 AM, Liviu Nicoara wrote: [...] tl;dr: removing the facet data cache is a priority. All else can be put on the back-burner. [...] // facet data accessor ... if (0 == _C_impsize) { // 1 mutex_lock (); if (_C_impsize) return _C_data; _C_data = get_facet_data (); // 2 ?? // 3 _C_impsize = 1; // 4 mutex_unlock (); } ?? // 5 return _C_data; // 6 [...] Various compilers provide these features in various forms, but at the moment we don't have a unified STDCXX API to implement this. [...] I suggested moving the body of the outer if from the header into a .cpp file in the library where we could implement ugly, bloated locking without the risk of breaking things if we removed/replaced it in the future. That's when we ran into questions about how exactly to do this cleanly, etc. It didn't seem to be doable very cleanly but I still think it's a viable approach. Just making sure we are talking about the same thing. My argument (and the code oversimplification above) was about the initialization of the std::facet data, not the cached numpunct data. I think we should get rid of the numpunct cache right away because it makes the facet unusable in MT builds. As for the DCII I was talking about in the previous post, I would tackle it but I only have access to x86(_64). Does the foundation have shared dev machines? Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/03/12 11:10, Martin Sebor wrote: [...] I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). tl;dr: removing the facet data cache is a priority. All else can be put on the back-burner. Conflicting test results aside, there still is the case of the incorrect handling of the cached data in the facet. I don't think there is a disagreement on that. Considering that the std::string is moving in the direction of dropping the handle-body implementation, simply getting rid of the cache is a step in the same direction. I think that we should preserve the lock-free reading of the facet data, as a benign race, but making it benign is perhaps more complicated than previously suggested. As a reminder, the core of the facet access and initialization code essentially looks like this (pseudocode-ish): // facet data accessor ... if (0 == _C_impsize) { // 1 mutex_lock (); if (_C_impsize) return _C_data; _C_data= get_facet_data (); // 2 ?? // 3 _C_impsize = 1; // 4 mutex_unlock (); } ?? // 5 return _C_data; // 6 ... with question marks for missing, necessary fixes. The compiler needs to be prevented from re-ordering both 2-4 and 1-6. Just for the sake of argument I can imagine an optimization that reorders the reads in 1-6: register x = _C_data; if (_C_impsize) return x; and if the loads are executed in this order, the caller will see a stale _C_data. First, the 2-4 writes need to be executed in the program order. This needs both a compiler barrier and a store-store memory barrier that will keep the writes ordered. Then, the reads in 1-6 need to be ordered such that _C_data is read after _C_impsize, via a compiler barrier and a load-load memory barrier that will preserve the program order of the loads. Various compilers provide these features in various forms, but at the moment we don't have a unified STDCXX API to implement this. Of course, I might be wrong. Input is appreciated. Thanks, Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/26/2012 06:50 AM, Liviu Nicoara wrote: On 10/03/12 11:10, Martin Sebor wrote: [...] I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). tl;dr: removing the facet data cache is a priority. All else can be put on the back-burner. Conflicting test results aside, there still is the case of the incorrect handling of the cached data in the facet. I don't think there is a disagreement on that. Considering that the std::string is moving in the direction of dropping the handle-body implementation, simply getting rid of the cache is a step in the same direction. I think that we should preserve the lock-free reading of the facet data, as a benign race, but making it benign is perhaps more complicated than previously suggested. As a reminder, the core of the facet access and initialization code essentially looks like this (pseudocode-ish): // facet data accessor ... if (0 == _C_impsize) { // 1 mutex_lock (); if (_C_impsize) return _C_data; _C_data = get_facet_data (); // 2 ?? // 3 _C_impsize = 1; // 4 mutex_unlock (); } ?? // 5 return _C_data; // 6 ... with question marks for missing, necessary fixes. The compiler needs to be prevented from re-ordering both 2-4 and 1-6. Just for the sake of argument I can imagine an optimization that reorders the reads in 1-6: register x = _C_data; if (_C_impsize) return x; and if the loads are executed in this order, the caller will see a stale _C_data. First, the 2-4 writes need to be executed in the program order. This needs both a compiler barrier and a store-store memory barrier that will keep the writes ordered. Then, the reads in 1-6 need to be ordered such that _C_data is read after _C_impsize, via a compiler barrier and a load-load memory barrier that will preserve the program order of the loads. Various compilers provide these features in various forms, but at the moment we don't have a unified STDCXX API to implement this. Of course, I might be wrong. Input is appreciated. You're right (we don't have an API for it, and it is the double checked locking problem mentioned earlier on in this thread). Implementing it here would bloat the function (just in terms of source code, not necessarily object code), and potentially introduce a binary compatibility dependency on the new code (i.e., we wouldn't be able to remove the fix in the future w/o braking it). I suggested moving the body of the outer if from the header into a .cpp file in the library where we could implement ugly, bloated locking without the risk of breaking things if we removed/replaced it in the future. That's when we ran into questions about how exactly to do this cleanly, etc. It didn't seem to be doable very cleanly but I still think it's a viable approach. Martin Thanks, Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/03/12 11:10, Martin Sebor wrote: [...] I was just thinking of a few simple loops along the lines of: tl;dr: I consider the results of the multi-threaded performance tests (12S, Intel/AMD multicores) as coming from heavy contention in copying of reference-counted std::string objects. The long version: I went back to the original test case (attached, t.cpp). As a reminder, the program is a simplification of the library numpunct facet creation process, which can be compiled either with data caching enabled or disabled. I ran the perf tool (https://perf.wiki.kernel.org/index.php/Main_Page) on the program, and I noticed marked differences in the number of instructions and cycles (an order of magnitude) and also in the number of CPU-migrations and context-switches (way larger for the cached version): Performance counter stats for './t': 121168.026301 task-clock-msecs # 3.778 CPUs 6133377 context-switches # 0.051 M/sec 35457 CPU-migrations # 0.000 M/sec 298 page-faults # 0.000 M/sec 338006437277 cycles # 2789.568 M/sec 136554855096 instructions # 0.404 IPC 31298346444 branches #258.305 M/sec 2305761092 branch-misses# 7.367 % 1592157186 cache-references # 13.140 M/sec 1729755 cache-misses # 0.014 M/sec 32.070381363 seconds time elapsed vs. Performance counter stats for './t': 16194.235282 task-clock-msecs # 3.883 CPUs 7491 context-switches # 0.000 M/sec 47 CPU-migrations # 0.000 M/sec 311 page-faults # 0.000 M/sec 45210281987 cycles # 2791.752 M/sec 76784904688 instructions # 1.698 IPC 18891711522 branches # 1166.570 M/sec 3162251552 branch-misses# 16.739 % 3332839 cache-references # 0.206 M/sec 119424 cache-misses # 0.007 M/sec 4.170904279 seconds time elapsed Then, I have run more detailed reports on the runs. I noticed that the report for the cached version contains what appears to be thread synchronization overhead, whereas the non-cached version doesn't: # Events: 134K cycles # # Overhead Command Shared Object Symbol # ... .. ... # 50.69%t [kernel.kallsyms] [k] _raw_spin_lock 5.46%t libpthread-2.13.so [.] pthread_mutex_lock 4.16%t libpthread-2.13.so [.] __pthread_mutex_unlock 3.65%t libpthread-2.13.so [.] __lll_lock_wait 3.35%t [kernel.kallsyms] [k] copy_user_generic_string 3.09%t 40395b [.] 40395b 2.91%t [kernel.kallsyms] [k] hash_futex 2.87%t [kernel.kallsyms] [k] futex_wake 2.67%t t [.] f 2.37%t [kernel.kallsyms] [k] futex_wait_setup 1.61%t [kernel.kallsyms] [k] system_call 1.00%t [kernel.kallsyms] [k] system_call_after_swapgs 0.98%t [kernel.kallsyms] [k] __switch_to 0.96%t libpthread-2.13.so [.] __lll_unlock_wake 0.95%t [kernel.kallsyms] [k] get_futex_key 0.92%t libc-2.13.so[.] __strlen_sse42 [...] vs. # Events: 15K cycles # # Overhead Command Shared Object Symbol # ... .. .. # 49.55% :295717f1ed3830a86 [.] 7f1ed3830a86 8.67%t libc-2.13.so[.] __malloc 8.03%t libc-2.13.so[.] cfree 4.88%t t [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(char const*, std::allocatorchar const) 4.67%t libc-2.13.so[.] __strlen_sse42 4.14%t libpthread-2.13.so [.] __pthread_mutex_unlock 3.62%t libpthread-2.13.so [.] pthread_mutex_lock 2.64%t t [.] f 2.40%t libc-2.13.so[.] _int_malloc 2.10%t libc-2.13.so[.] _int_free 1.53%t libpthread-2.13.so [.] __pthread_mutex_init_internal 1.33%t t [.] operator new(unsigned long) 1.08%t t [.] X::numpunct::do_grouping() const 1.04%t libc-2.13.so[.] __memcpy_sse2
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/04/12 22:41, Liviu Nicoara wrote: On 10/4/12 10:10 PM, Liviu Nicoara wrote: On 10/3/12 11:10 AM, Martin Sebor wrote: On 10/03/2012 07:01 AM, Liviu Nicoara wrote: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). That is expected; I attached test case x.cpp and results-x.txt. I did not find it too interesting in its own, though. The difference between the cached and non-cached data is that in the case of the cached data the copying of the string involves nothing more than a bump in the reference counter, whereas in the non-cached version a string object is constructed anew, and memory gets allocated for its body. Yet, in my measurements, the cached version is the one which shows the worse performance. So, I extracted the std::string class and simplified it down and put it in another test case. That would be u.cpp and the results are results-u.txt. The results show the same performance trends although the absolute values have skewed. Will get back to this after I digest the results a bit more. This discussion has stagnated lately. I hope to allocate some time this week-end to go over the results, re-verify them and analyze the disassembly. In the meantime, any thoughts? Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/3/12 11:10 AM, Martin Sebor wrote: On 10/03/2012 07:01 AM, Liviu Nicoara wrote: I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). That is expected; I attached test case x.cpp and results-x.txt. I did not find it too interesting in its own, though. The difference between the cached and non-cached data is that in the case of the cached data the copying of the string involves nothing more than a bump in the reference counter, whereas in the non-cached version a string object is constructed anew, and memory gets allocated for its body. Yet, in my measurements, the cached version is the one which shows the worse performance. So, I extracted the std::string class and simplified it down and put it in another test case. That would be u.cpp and the results are results-u.txt. The results show the same performance trends although the absolute values have skewed. Will get back to this after I digest the results a bit more. Liviu -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make u 16, 1 1m18.811s 5m12.329s 0m0.263s 8, 10m39.919s 2m36.198s 0m0.150s 4, 10m20.449s 1m20.797s 0m0.050s 2, 10m9.888s0m19.725s 0m0.005s 1, 10m2.483s0m2.480s0m0.002s $ nice make CPPOPTS=-DNO_CACHE u 16, 1 0m37.418s 2m27.822s 0m0.872s 8, 10m18.844s 1m14.607s 0m0.261s 4, 10m10.165s 0m40.147s 0m0.023s 2, 10m8.652s0m17.278s 0m0.003s 1, 10m8.482s0m8.473s0m0.007s $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u 16, 1 1m2.770s4m9.307s0m0.179s 8, 10m31.890s 2m6.792s0m0.087s 4, 10m16.427s 1m5.133s0m0.039s 2, 10m8.497s0m16.981s 0m0.007s 1, 10m2.291s0m2.288s0m0.002s $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u 16, 1 0m35.838s 2m21.406s 0m0.877s 8, 10m19.007s 1m14.920s 0m0.255s 4, 10m10.099s 0m39.504s 0m0.042s 2, 10m8.599s0m17.190s 0m0.003s 1, 10m8.986s0m8.980s0m0.005s * Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2 $ nice make u $ nice make CPPOPTS=-DNO_CACHE u $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u 16, 1 0m7.864s0m30.259s 0m0.035s 8, 1 0m4.396s0m17.034s 0m0.016s 4, 1 0m2.729s0m10.473s 0m0.011s 2, 1 0m2.481s0m4.929s0m0.003s 1, 1 0m2.461s0m2.455s0m0.002s $ nice make CPPOPTS=-DNO_LOCK u 16, 1 0m9.724s0m37.455s 0m0.043s 8, 1 0m5.559s0m20.309s 0m0.048s 4, 1 0m3.160s0m12.213s 0m0.013s 2, 1 0m2.872s0m5.694s0m0.004s 1, 1 0m2.845s0m2.838s0m0.002s $ nice make u 16, 1 1m3.745s3m58.570s 0m0.351s 8, 1 0m32.351s 1m55.740s 0m0.203s 4, 1 0m16.852s 1m1.633s0m0.092s 2, 1 0m8.419s0m16.699s 0m0.010s 1, 1 0m4.214s0m4.179s0m0.005s * Linux Slackware, 16x AMD Opteron, 12S, gcc 4.7.1 $ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u 16, 1 0m4.382s1m9.896s0m0.004s 8, 1 0m4.374s0m34.904s 0m0.002s 4, 1 0m4.368s0m17.445s 0m0.002s 2, 1 0m4.366s0m8.720s0m0.003s 1, 1 0m4.355s0m4.351s0m0.001s $ nice make CPPOPTS=-DNO_LOCK u 16, 1 0m5.415s1m19.833s 0m0.005s 8, 1 0m4.939s0m39.438s 0m0.003s 4, 1 0m4.936s0m19.712s 0m0.001s 2, 1 0m4.930s0m9.847s0m0.002s 1, 1 0m4.921s0m4.917s0m0.002s $ nice make u 16, 1 1m40.769s 24m17.198s 0m0.006s 8, 1 0m51.702s 6m15.400s
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/4/12 10:10 PM, Liviu Nicoara wrote: On 10/3/12 11:10 AM, Martin Sebor wrote: On 10/03/2012 07:01 AM, Liviu Nicoara wrote: I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). That is expected; I attached test case x.cpp and results-x.txt. I did not find it too interesting in its own, though. The difference between the cached and non-cached data is that in the case of the cached data the copying of the string involves nothing more than a bump in the reference counter, whereas in the non-cached version a string object is constructed anew, and memory gets allocated for its body. Yet, in my measurements, the cached version is the one which shows the worse performance. So, I extracted the std::string class and simplified it down and put it in another test case. That would be u.cpp and the results are results-u.txt. The results show the same performance trends although the absolute values have skewed. Will get back to this after I digest the results a bit more. Attached some incomplete files previously. Corrected. Liviu -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make u 16, 1 1m18.811s 5m12.329s 0m0.263s 8, 10m39.919s 2m36.198s 0m0.150s 4, 10m20.449s 1m20.797s 0m0.050s 2, 10m9.888s0m19.725s 0m0.005s 1, 10m2.483s0m2.480s0m0.002s $ nice make CPPOPTS=-DNO_CACHE u 16, 1 0m37.418s 2m27.822s 0m0.872s 8, 10m18.844s 1m14.607s 0m0.261s 4, 10m10.165s 0m40.147s 0m0.023s 2, 10m8.652s0m17.278s 0m0.003s 1, 10m8.482s0m8.473s0m0.007s $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u 16, 1 1m2.770s4m9.307s0m0.179s 8, 10m31.890s 2m6.792s0m0.087s 4, 10m16.427s 1m5.133s0m0.039s 2, 10m8.497s0m16.981s 0m0.007s 1, 10m2.291s0m2.288s0m0.002s $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u 16, 1 0m35.838s 2m21.406s 0m0.877s 8, 10m19.007s 1m14.920s 0m0.255s 4, 10m10.099s 0m39.504s 0m0.042s 2, 10m8.599s0m17.190s 0m0.003s 1, 10m8.986s0m8.980s0m0.005s * Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2 $ nice make u 16, 1 2m7.644s30m10.014s 0m0.006s 8, 1 1m4.449s7m45.111s 0m0.004s 4, 1 0m31.646s 2m1.438s0m0.002s 2, 1 0m14.828s 0m29.637s 0m0.002s 1, 1 0m3.223s0m3.220s0m0.002s $ nice make CPPOPTS=-DNO_CACHE u 16, 1 0m10.730s 2m46.032s 0m2.204s 8, 1 0m9.922s1m18.892s 0m0.008s 4, 1 0m9.874s0m39.192s 0m0.003s 2, 1 0m9.652s0m19.268s 0m0.004s 1, 1 0m9.622s0m9.617s0m0.002s $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u 16, 1 1m54.812s 27m5.957s 0m0.007s 8, 1 0m53.862s 6m26.391s 0m0.003s 4, 1 0m26.287s 1m39.991s 0m0.001s 2, 1 0m9.120s0m17.233s 0m0.002s 1, 1 0m3.004s0m3.002s0m0.001s $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u 16, 1 0m9.717s2m34.412s 0m0.401s 8, 1 0m9.613s1m16.762s 0m0.007s 4, 1 0m9.590s0m38.285s 0m0.004s 2, 1 0m9.629s0m19.165s 0m0.002s 1, 1 0m9.361s0m9.357s0m0.001s -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u 16, 1 0m7.864s0m30.259s 0m0.035s 8, 1 0m4.396s0m17.034s 0m0.016s 4, 1 0m2.729s0m10.473s 0m0.011s 2, 1 0m2.481s0m4.929s0m0.003s 1, 1 0m2.461s0m2.455s0m0.002s $ nice make CPPOPTS=-DNO_LOCK u 16, 1 0m9.724s0m37.455s 0m0.043s 8, 1 0m5.559s0m20.309s 0m0.048s 4, 1 0m3.160s0m12.213s 0m0.013s 2,
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/02/12 10:41, Martin Sebor wrote: I haven't had time to look at this since my last email on Sunday. I also forgot about the string mutex. I don't think I'll have time to spend on this until later in the week. Unless the disassembly reveals the smoking gun, I think we might need to simplify the test to get to the bottom of the differences in our measurements. (I.e., eliminate the library and measure the runtime of a simple thread loop, with and without locking.) We should also look at the GLIBC and kernel versions on our systems, on the off chance that there has been a change that could explain the discrepancy between my numbers and yours. I suspect my system (RHEL 4.8) is much older than yours (I don't remember now if you posted your details). I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? Liviu Martin On 10/02/2012 06:22 AM, Liviu Nicoara wrote: On 09/30/12 18:18, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h: # if _RWSTD_VER_MAJOR 5 # ifdef _RWSTD_OS_LINUX // on Linux/AMD64, unless explicitly requested, disable the use // of atomic operations in string for binary compatibility with // stdcxx 4.1.x # ifndef _RWSTD_USE_STRING_ATOMIC_OPS # define _RWSTD_NO_STRING_ATOMIC_OPS # endif // _RWSTD_USE_STRING_ATOMIC_OPS # endif // _WIN32 # endif // stdcxx 5.0 That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version. Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/03/2012 07:01 AM, Liviu Nicoara wrote: On 10/02/12 10:41, Martin Sebor wrote: I haven't had time to look at this since my last email on Sunday. I also forgot about the string mutex. I don't think I'll have time to spend on this until later in the week. Unless the disassembly reveals the smoking gun, I think we might need to simplify the test to get to the bottom of the differences in our measurements. (I.e., eliminate the library and measure the runtime of a simple thread loop, with and without locking.) We should also look at the GLIBC and kernel versions on our systems, on the off chance that there has been a change that could explain the discrepancy between my numbers and yours. I suspect my system (RHEL 4.8) is much older than yours (I don't remember now if you posted your details). I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). For the Linux tests I used a 16 CPU (Xeon X5570 @ 3GHz) box with RHEL 4.8 with 2.6.9-89.0.11.ELlargesmp, GLIBC version is 2.3.4, and GCC 3.4.6. Martin Liviu Martin On 10/02/2012 06:22 AM, Liviu Nicoara wrote: On 09/30/12 18:18, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h: # if _RWSTD_VER_MAJOR 5 # ifdef _RWSTD_OS_LINUX // on Linux/AMD64, unless explicitly requested, disable the use // of atomic operations in string for binary compatibility with // stdcxx 4.1.x # ifndef _RWSTD_USE_STRING_ATOMIC_OPS # define _RWSTD_NO_STRING_ATOMIC_OPS # endif // _RWSTD_USE_STRING_ATOMIC_OPS # endif // _WIN32 # endif // stdcxx 5.0 That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version. Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 09/30/12 18:18, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h: # if _RWSTD_VER_MAJOR 5 #ifdef _RWSTD_OS_LINUX // on Linux/AMD64, unless explicitly requested, disable the use // of atomic operations in string for binary compatibility with // stdcxx 4.1.x # ifndef _RWSTD_USE_STRING_ATOMIC_OPS #define _RWSTD_NO_STRING_ATOMIC_OPS # endif // _RWSTD_USE_STRING_ATOMIC_OPS #endif // _WIN32 # endif // stdcxx 5.0 That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version. Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
I haven't had time to look at this since my last email on Sunday. I also forgot about the string mutex. I don't think I'll have time to spend on this until later in the week. Unless the disassembly reveals the smoking gun, I think we might need to simplify the test to get to the bottom of the differences in our measurements. (I.e., eliminate the library and measure the runtime of a simple thread loop, with and without locking.) We should also look at the GLIBC and kernel versions on our systems, on the off chance that there has been a change that could explain the discrepancy between my numbers and yours. I suspect my system (RHEL 4.8) is much older than yours (I don't remember now if you posted your details). Martin On 10/02/2012 06:22 AM, Liviu Nicoara wrote: On 09/30/12 18:18, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h: # if _RWSTD_VER_MAJOR 5 # ifdef _RWSTD_OS_LINUX // on Linux/AMD64, unless explicitly requested, disable the use // of atomic operations in string for binary compatibility with // stdcxx 4.1.x # ifndef _RWSTD_USE_STRING_ATOMIC_OPS # define _RWSTD_NO_STRING_ATOMIC_OPS # endif // _RWSTD_USE_STRING_ATOMIC_OPS # endif // _WIN32 # endif // stdcxx 5.0 That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version. Liviu
Re: STDCXX-1071 numpunct facet defect
[Sending to dev after accidentally replying just to Liviu] On 09/30/2012 12:09 PM, Martin Sebor wrote: On 09/27/2012 06:36 PM, Liviu Nicoara wrote: On 9/27/12 8:27 PM, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. There are two of them, one for 4.2.x, one for 4.3.x. I'm still curious about the performance, though. It doesn't make sense to me that a call to a virtual function is faster than one to an almost trivial inline function. I scratched my head long on that. I wager that it depends on the machine, too. Here are my timings for library-reduction.cpp when compiled GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small number of trivial changes to get it to compile: With cache No cache real 1m38.332s 8m58.568s user 6m30.244s 34m25.942s sys 0m0.060s 0m3.922s I also experimented with the program on Linux (CEL 4 with 16 CPUs). Initially, I saw no differences between the two versions. So I modified it a bit to make it closer to the library (the modified program is attached). With those changes the timings are below: With cache No cache real 0m 1.107s 0m 5.669s user 0m17.204s 0m 5.669s sys 0m 0.000s 0m22.347s I also recompiled and re-ran the test on Solaris. To speed things along, I set the number threads and loops to 8 and 100. The numbers are as follows: With cache No cache real 0m3.341s 0m26.333s user 0m13.052s 1m37.470s sys 0m0.009s 0m0.132s The numbers match my expectation. The overhead without the numpunct cache is considerable. Somewhat unexpectedly, the test with the cache didn't crash. I also tried timing the numpunct patch attached to the issue with the test program. The test program runs to completion without the patch but it crashes with a SIGSEGV after the patch is applied. That suggests there's a bug somewhere in do_thousands_sep() (in addition to the caching). Martin Let me know if there is anything I could help with. Liviu
Fwd: Re: STDCXX-1071 numpunct facet defect
Forwarding with the attachment. Original Message Subject: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 12:09:10 -0600 From: Martin Sebor mse...@gmail.com To: Liviu Nicoara nikko...@hates.ms On 09/27/2012 06:36 PM, Liviu Nicoara wrote: On 9/27/12 8:27 PM, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. There are two of them, one for 4.2.x, one for 4.3.x. I'm still curious about the performance, though. It doesn't make sense to me that a call to a virtual function is faster than one to an almost trivial inline function. I scratched my head long on that. I wager that it depends on the machine, too. Here are my timings for library-reduction.cpp when compiled GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small number of trivial changes to get it to compile: With cache No cache real1m38.332s 8m58.568s user6m30.244s34m25.942s sys 0m0.060s 0m3.922s I also experimented with the program on Linux (CEL 4 with 16 CPUs). Initially, I saw no differences between the two versions. So I modified it a bit to make it closer to the library (the modified program is attached). With those changes the timings are below: With cache No cache real0m 1.107s0m 5.669s user0m17.204s0m 5.669s sys 0m 0.000s0m22.347s I also recompiled and re-ran the test on Solaris. To speed things along, I set the number threads and loops to 8 and 100. The numbers are as follows: With cache No cache real0m3.341s 0m26.333s user0m13.052s1m37.470s sys 0m0.009s 0m0.132s The numbers match my expectation. The overhead without the numpunct cache is considerable. Somewhat unexpectedly, the test with the cache didn't crash. I also tried timing the numpunct patch attached to the issue with the test program. The test program runs to completion without the patch but it crashes with a SIGSEGV after the patch is applied. That suggests there's a bug somewhere in do_thousands_sep() (in addition to the caching). Martin Let me know if there is anything I could help with. Liviu #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; namespace X { struct guard { guard (pthread_mutex_t* ptr) : lock_ (ptr) { pthread_mutex_lock (lock_); } ~guard () { pthread_mutex_unlock (lock_); } private: guard (guard const); guard operator= (guard const); private: pthread_mutex_t* lock_; }; struct facet { facet () { pthread_mutex_init (_C_mutex, 0); } void* _C_data () { return _C_impsize ? _C_impdata : _C_get_data (); } void* _C_get_data (); size_t _C_impsize; void* _C_impdata; pthread_mutex_t _C_mutex; }; void* facet::_C_get_data () { if (_C_impsize) return _C_impdata; guard g (_C_mutex); if (_C_impsize) return _C_impdata; #if !defined (NO_USE_STDCXX_LOCALES) _C_impdata = (void*)\3\3; #endif // USE_STDCXX_LOCALES _C_impsize = (size_t)(-1); return _C_impdata; } void* get_data (facet*); struct numpunct : facet { numpunct () : _C_flags () { } virtual std::string do_grouping () const; std::string grouping () const { numpunct* const __self = (numpunct*)this; #if !defined (NO_USE_NUMPUNCT_CACHE) if (!(_C_flags 1)) { __self-_C_flags |= 1; __self-_C_grouping = do_grouping (); } return _C_grouping; #else return do_grouping (); #endif // NO_USE_NUMPUNCT_CACHE } private: int _C_flags; std::string _C_grouping; }; std::string numpunct::do_grouping () const { return (const char*)get_data (const_castnumpunct*(this)); } void* get_data (facet* pfacet) { void* pdata = pfacet-_C_data (); if (pdata) return pdata; // __rw_setlocale loc (...); - other mutex if (pfacet-_C_data ()) return get_data (pfacet); pfacet-_C_impdata = const_castchar*(\4\4); pfacet-_C_impsize = (size_t)(-1); return 0; } } // namespace X
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 9/30/12 2:21 PM, Liviu Nicoara wrote: Forwarding with the attachment. Original Message Subject: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 12:09:10 -0600 From: Martin Sebor mse...@gmail.com To: Liviu Nicoara nikko...@hates.ms On 9/27/12 8:27 PM, Martin Sebor wrote: Here are my timings for library-reduction.cpp when compiled GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small number of trivial changes to get it to compile: With cache No cache real1m38.332s 8m58.568s user6m30.244s34m25.942s sys 0m0.060s 0m3.922s I also experimented with the program on Linux (CEL 4 with 16 CPUs). Initially, I saw no differences between the two versions. So I modified it a bit to make it closer to the library (the modified program is attached). With those changes the timings I see the difference -- your program has a virtual function it calls from the inline grouping function. are below: With cache No cache real0m 1.107s0m 5.669s user0m17.204s0m 5.669s sys0m 0.000s0m22.347s I also recompiled and re-ran the test on Solaris. To speed things along, I set the number threads and loops to 8 and 100. The numbers are as follows: With cache No cache real0m3.341s 0m26.333s user0m13.052s1m37.470s sys 0m0.009s 0m0.132s The numbers match my expectation. The overhead without the numpunct cache is considerable. I have done another (smaller) round of measurements, this time using the test program you posted. Here are the results: * iMac, 4x Intel, 12S: 16, 1000: Cached Not cached real0m9.300s 0m5.224s user0m36.441s0m20.523s sys 0m0.043s 0m0.068s * iMac, 4x Intel, 12D: CachedNot cached real0m9.012s 0m5.774s user0m35.343s 0m20.997s sys 0m0.045s 0m0.183s * Linux Slackware, 16x AMD Opteron, 12S: 16, 1000: Cached Not cached real0m29.798s 0m3.278s user0m48.662s 0m47.338s sys 6m18.525s 0m3.298s Somewhat unexpectedly, the test with the cache didn't crash. On my iMac it did not crash for me either (gcc 4.5.4), this time. On the other box (gcc 4.5.2) crashed every time with caching, so I had to add a call to fac.grouping outside the thread function to initialize the facet. Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? Martin On 09/30/2012 03:30 PM, Liviu Nicoara wrote: On 9/30/12 2:21 PM, Liviu Nicoara wrote: Forwarding with the attachment. Original Message Subject: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 12:09:10 -0600 From: Martin Sebor mse...@gmail.com To: Liviu Nicoara nikko...@hates.ms On 9/27/12 8:27 PM, Martin Sebor wrote: Here are my timings for library-reduction.cpp when compiled GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small number of trivial changes to get it to compile: With cache No cache real 1m38.332s 8m58.568s user 6m30.244s 34m25.942s sys 0m0.060s 0m3.922s I also experimented with the program on Linux (CEL 4 with 16 CPUs). Initially, I saw no differences between the two versions. So I modified it a bit to make it closer to the library (the modified program is attached). With those changes the timings I see the difference -- your program has a virtual function it calls from the inline grouping function. are below: With cache No cache real 0m 1.107s 0m 5.669s user 0m17.204s 0m 5.669s sys 0m 0.000s 0m22.347s I also recompiled and re-ran the test on Solaris. To speed things along, I set the number threads and loops to 8 and 100. The numbers are as follows: With cache No cache real 0m3.341s 0m26.333s user 0m13.052s 1m37.470s sys 0m0.009s 0m0.132s The numbers match my expectation. The overhead without the numpunct cache is considerable. I have done another (smaller) round of measurements, this time using the test program you posted. Here are the results: * iMac, 4x Intel, 12S: 16, 1000: Cached Not cached real 0m9.300s 0m5.224s user 0m36.441s 0m20.523s sys 0m0.043s 0m0.068s * iMac, 4x Intel, 12D: Cached Not cached real 0m9.012s 0m5.774s user 0m35.343s 0m20.997s sys 0m0.045s 0m0.183s * Linux Slackware, 16x AMD Opteron, 12S: 16, 1000: Cached Not cached real 0m29.798s 0m3.278s user 0m48.662s 0m47.338s sys 6m18.525s 0m3.298s Somewhat unexpectedly, the test with the cache didn't crash. On my iMac it did not crash for me either (gcc 4.5.4), this time. On the other box (gcc 4.5.2) crashed every time with caching, so I had to add a call to fac.grouping outside the thread function to initialize the facet. Liviu
Fwd: Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Forwarding to the list. Duh. Original Message Subject: Re: Fwd: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 19:02:27 -0400 From: Liviu Nicoara nikko...@hates.ms To: Martin Sebor mse...@gmail.com On 9/30/12 6:18 PM, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? Here they are. Liviu Dump of assembler code for function f: 0x00403870 +0: push %r15 0x00403872 +2: push %r14 0x00403874 +4: push %r13 0x00403876 +6: push %r12 0x00403878 +8: push %rbp 0x00403879 +9: push %rbx 0x0040387a +10:mov%rdi,%rbx 0x0040387d +13:sub$0x38,%rsp 0x00403881 +17:nopl 0x0(%rax) 0x00403888 +24:movzbl 0x261f11(%rip),%eax# 0x6657a0 _ZL5pwait 0x0040388f +31:test %al,%al 0x00403891 +33:jne0x403888 f+24 0x00403893 +35:cmpq $0x0,0x261ef5(%rip)# 0x665790 _ZL6nloops 0x0040389b +43:jle0x403b12 f+674 0x004038a1 +49:xor%ebp,%ebp 0x004038a3 +51:xor%r12d,%r12d 0x004038a6 +54:lea0x10(%rsp),%r13 0x004038ab +59:lea0x48(%rbx),%r14 0x004038af +63:lea0x20(%rsp),%r15 0x004038b4 +68:jmpq 0x4039a7 f+311 0x004038b9 +73:nopl 0x0(%rax) 0x004038c0 +80:cmp$0x66a020,%rdx 0x004038c7 +87:mov%rdi,0x20(%rsp) 0x004038cc +92:je 0x403ab8 f+584 0x004038d2 +98:mov%rdx,%rdi 0x004038d5 +101: mov%rdx,(%rsp) 0x004038d9 +105: callq 0x403658 pthread_mutex_lock@plt 0x004038de +110: test %eax,%eax 0x004038e0 +112: mov(%rsp),%rdx 0x004038e4 +116: je 0x4038fb f+139 0x004038e6 +118: mov$0x4452a4,%esi 0x004038eb +123: mov$0xa,%edi 0x004038f0 +128: xor%eax,%eax 0x004038f2 +130: callq 0x404370 _ZN4__rw10__rw_throwEiz 0x004038f7 +135: mov(%rsp),%rdx 0x004038fb +139: addl $0x1,0x28(%rdx) 0x004038ff +143: test %rdx,%rdx 0x00403902 +146: je 0x40390c f+156 0x00403904 +148: mov%rdx,%rdi 0x00403907 +151: callq 0x4036c8 pthread_mutex_unlock@plt 0x0040390c +156: mov0x20(%rsp),%rdx 0x00403911 +161: mov%rdx,%rdi 0x00403914 +164: mov%rdx,(%rsp) 0x00403918 +168: callq 0x403258 strlen@plt 0x0040391d +173: mov(%rsp),%rdx 0x00403921 +177: add%rax,%r12 0x00403924 +180: lea-0x40(%rdx),%rcx 0x00403928 +184: cmp$0x66a020,%rcx 0x0040392f +191: je 0x40398b f+283 0x00403931 +193: mov%rcx,%rdi 0x00403934 +196: mov%rcx,0x8(%rsp) 0x00403939 +201: callq 0x403658 pthread_mutex_lock@plt 0x0040393e +206: test %eax,%eax 0x00403940 +208: mov(%rsp),%rdx 0x00403944 +212: mov0x8(%rsp),%rcx 0x00403949 +217: je 0x403965 f+245 0x0040394b +219: mov$0x4452a4,%esi 0x00403950 +224: mov$0xa,%edi 0x00403955 +229: xor%eax,%eax 0x00403957 +231: callq 0x404370 _ZN4__rw10__rw_throwEiz 0x0040395c +236: mov0x8(%rsp),%rcx 0x00403961 +241: mov(%rsp),%rdx 0x00403965 +245: mov-0x18(%rdx),%esi 0x00403968 +248: test %rcx,%rcx 0x0040396b +251: lea-0x1(%rsi),%eax 0x0040396e +254: mov%eax,-0x18(%rdx) 0x00403971 +257: je 0x403983 f+275 0x00403973 +259: mov%rcx,%rdi 0x00403976 +262: mov%esi,0x8(%rsp) 0x0040397a +266: callq 0x4036c8 pthread_mutex_unlock@plt 0x0040397f +271: mov0x8(%rsp),%esi 0x00403983 +275: test %esi,%esi 0x00403985 +277: jle0x403a80 f+528 0x0040398b +283: add$0x1,%ebp 0x0040398e +286: movq $0x0,0x20(%rsp
Re: STDCXX-1071 numpunct facet defect
I thought I replied but I see no trace of my post: On 09/27/12 20:27, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary compatible). My understanding was that minor version revisions may break binary compatibility. Liviu
RE: STDCXX-1071 numpunct facet defect
Only major versions can break binary. The versioning policy for stdcxx can be found here.. http://stdcxx.apache.org/versions.html Travis -Original Message- From: Liviu Nicoara [mailto:nikko...@hates.ms] Sent: Friday, September 28, 2012 3:52 AM To: dev@stdcxx.apache.org Subject: Re: STDCXX-1071 numpunct facet defect I thought I replied but I see no trace of my post: On 09/27/12 20:27, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary compatible). My understanding was that minor version revisions may break binary compatibility. Liviu
Re: STDCXX-1071 numpunct facet defect
On 09/28/12 11:01, Travis Vitek wrote: Only major versions can break binary. The versioning policy for stdcxx can be found here.. http://stdcxx.apache.org/versions.html Thanks, that clarifies things. Liviu
Re: STDCXX-1071 numpunct facet defect
On 9/28/12 11:01 AM, Travis Vitek wrote: Only major versions can break binary. The versioning policy for stdcxx can be found here.. http://stdcxx.apache.org/versions.html I have renamed the binary-incompatible patch as patch-5.0.x.diff. Thanks, Liviu Travis -Original Message- From: Liviu Nicoara [mailto:nikko...@hates.ms] Sent: Friday, September 28, 2012 3:52 AM To: dev@stdcxx.apache.org Subject: Re: STDCXX-1071 numpunct facet defect I thought I replied but I see no trace of my post: On 09/27/12 20:27, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary compatible). My understanding was that minor version revisions may break binary compatibility. Liviu
Re: STDCXX-1071 numpunct facet defect
On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. Thanks, Liviu Index: include/loc/_numpunct.h === --- include/loc/_numpunct.h (revision 1388733) +++ include/loc/_numpunct.h (working copy) @@ -61,7 +61,7 @@ struct numpunct: _RW::__rw_facet string_type; _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0) -: _RW::__rw_facet (__ref), _C_flags (0) { } +: _RW::__rw_facet (__ref) { } virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW; @@ -109,15 +109,6 @@ protected: virtual string_type do_falsename () const { return _RW::__rw_get_punct (this, _RW::__rw_fn, char_type ()); } - -private: - -int _C_flags; // bitmap of cached data valid flags -string _C_grouping;// cached results of virtual members -string_type _C_truename; -string_type _C_falsename; -char_type _C_decimal_point; -char_type _C_thousands_sep; }; @@ -139,17 +130,7 @@ template class _CharT inline _TYPENAME numpunct_CharT::char_type numpunct_CharT::decimal_point () const { -if (!(_C_flags _RW::__rw_dp)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the decimal point first (may throw) -// then set a flag to avoid future initializations -__self-_C_decimal_point = do_decimal_point (); -__self-_C_flags |= _RW::__rw_dp; -} - -return _C_decimal_point; +return do_decimal_point (); } @@ -157,34 +138,14 @@ template class _CharT inline _TYPENAME numpunct_CharT::char_type numpunct_CharT::thousands_sep () const { -if (!(_C_flags _RW::__rw_ts)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the thousands_sep first (may throw) -// then set a flag to avoid future initializations -__self-_C_thousands_sep = do_thousands_sep (); -__self-_C_flags |= _RW::__rw_ts; -} - -return _C_thousands_sep; +return do_thousands_sep (); } template class _CharT inline string numpunct_CharT::grouping () const { -if (!(_C_flags _RW::__rw_gr)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the grouping first (may throw) -// then set a flag to avoid future initializations -__self-_C_grouping = do_grouping (); -__self-_C_flags|= _RW::__rw_gr; -} - -return _C_grouping; +return do_grouping (); } @@ -192,17 +153,7 @@ template class _CharT inline _TYPENAME numpunct_CharT::string_type numpunct_CharT::truename () const { -if (!(_C_flags _RW::__rw_tn)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the true name first (may throw) -// then set a flag to avoid future initializations -__self-_C_truename = do_truename (); -__self-_C_flags|= _RW::__rw_tn; -} - -return _C_truename; +return do_truename (); } @@ -210,17 +161,7 @@ template class _CharT inline _TYPENAME numpunct_CharT::string_type numpunct_CharT::falsename () const { -if (!(_C_flags _RW::__rw_fn)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the false name first (may throw) -// then set a flag to avoid future initializations -__self-_C_falsename = do_falsename (); -__self-_C_flags |= _RW::__rw_fn; -} - -return _C_falsename; +return do_falsename (); } // #endif _RWSTD_NO_EXT_NUMPUNCT_PRIMARY -*- mode: org -*- * Machines: ** iMac, Intel, 4 cores: $ uname -a; gcc -v Darwin imax 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr 9 19:32:15 PDT 2012; root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64 gcc version 4.7.1 (GCC) ** Linux Slackware, AMD, 16 cores: $ uname -a; gcc -v Linux behemoth 2.6.37.6 #3 SMP Sat Apr 9 22:49:32 CDT 2011 x86_64 AMD Opteron(tm) Processor 6134 AuthenticAMD GNU/Linux gcc version 4.5.2 (GCC) * Method ** Library Apply the patch. Build an optimized library (I used 12S in all runs). Build the library, rwtest, and locale database: $ nice make -Clib $ nice make -Cbin locales $ nice make -Crwtest Properly export the necessary envar if running against STDCXX locale database or unset, otherwise: $ export RWSTD_LOCALE_ROOT=/path/to/.../nls ** Test program Place the multi-threaded program source file, t.cpp, in srcdir/tests/localization and run make in the
Re: STDCXX-1071 numpunct facet defect
On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. I'm still curious about the performance, though. It doesn't make sense to me that a call to a virtual function is faster than one to an almost trivial inline function. Let me see if I can build the library on Solaris and put together some numbers. I hope to have some time this weekend. Martin Thanks, Liviu
STDCXX-1071 numpunct facet defect
I have created STDCXX-1071 and linked to STDCXX-1056. I have reduced the scope to numpunct because moneypunct is not failing for me. If someone has a moneypunct failure listing I want to see it. I have reduced the library code to a failing test case. I have attached there the reduced program. The program shows the same failures like the library (interestingly enough it does not fail when removing the data caching). I have (re)attached the forwarding patches I created for 1056, one binary compatible, the other not. Martin, my wish is to move this to some completion. Please let me know if there is something else you think I should do to speed this up. I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. Thanks, Liviu