[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #24 from tstarling at wikimedia dot org 2010-09-10 15:25 --- Created an attachment (id=21766) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21766action=view) dynamic_cast hack The attached patch uses a dynamic_cast hack to avoid the need to break the ABI. I added *_unlocked functions to cstdio, I'm not sure if this is necessary, but it's easy enough to remove that part if not. I also added some lightly-tested autoconf stuff. I'm an autoconf newbie so that part should probably be reviewed carefully. stdio_sync_filebufwchar_t::_M_getline() is currently unreachable, since I only edited basic_istreamchar::getline() and not basic_istreamwchar_t::getline(). It would be easy enough to fix that. I haven't used getwc_unlocked() because it's a GNU extension, POSIX only has non-wide unlocked I/O. The timings for 1M lines with 500 bytes per line, user time only, are: Old library: 26.7s New library: 1.65s fgets: 0.280s So it's better, but not perfect. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #25 from paolo dot carlini at oracle dot com 2010-09-10 16:01 --- Good. Please give me a couple of days to come to your code. Note, since you don't have a Copyright Assignment on file, it will be difficult to fully acknowledge your work in the ChangeLog. Thus, I would suggest having first a minimal patch, limited to char, limited in any way ;) but still sufficient to bring most of the improvement we are aiming to within the ABI. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #15 from paolo dot carlini at oracle dot com 2010-09-09 09:25 --- If you write a patch it would be of course looked at. But *please* try first something that doesn't break the ABI, because we have *no* idea when you changes would be applied otherwise. About the *_unlocked functions, we know those glibc extensions exist, but, as far as I can see would only change the complexity by a not so so small multiplicative constant and, after years and years of using everywhere the normal versions, I don't believe we should change just now the configuration on Linux only. But, as I said, provided you don't break the ABI (to be concrete) and the improvements are substantive, you are certainly welcome to submit patches to the libstdc++ mailing list. Thanks. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #16 from jakub at gcc dot gnu dot org 2010-09-09 10:33 --- The *_unlocked versions are faster a lot actually, at least for the one character ops, because no locking is performed and the calls are inlined. But the question is whether libstdc++ can use them, unless there is some restriction that would disallow several threads from using the same FILE * (including using STL APIs in one thread and C stdio APIs in another thread). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #17 from paolo dot carlini at oracle dot com 2010-09-09 10:42 --- At some point I tried quickly replacing some getc, did notice an improvement of course, but of the order of magnitude I mentioned. Worth further investigating sure (and simple, just replace in stdio_sync_ and measure, on Linux). In terms of the C++ Standard, I think that C++98 would allow essentially *anything*, not so C++0x, and within C++98 too I'm afraid we can break code making already some assumptions about the thread safety, which we don't spell out anywhere as impl def behavior, still... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #18 from tstarling at wikimedia dot org 2010-09-09 14:12 --- Created an attachment (id=21752) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=21752action=view) gprof output I haven't managed to get libstdc++ to compile with -pg, but compiling the test program with -static at least gives you a function breakdown. gprof output attached for 1 million lines, 500 bytes per line. To summarise: fgetc: 36.13% istream::getline: 18.01% ungetc: 16.70% _IO_sputbackc: 9.54% stdio_sync_filebuf::underflow: 5.66% stdio_sync_filebuf::uflow: 4.93% I should have spotted it from reading the code, it's not a loop of getc(), it's a loop of ungetc(getc()) getc(). It really demonstrates how poorly suited the streambuf interface is to unbuffered input. The virtual functions called by istream::getline() don't give much flexibility. So I still have no other ideas apart from breaking the ABI. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #19 from tstarling at wikimedia dot org 2010-09-09 14:28 --- (In reply to comment #16) The *_unlocked versions are faster a lot actually, at least for the one character ops, because no locking is performed and the calls are inlined. But the question is whether libstdc++ can use them, unless there is some restriction that would disallow several threads from using the same FILE * (including using STL APIs in one thread and C stdio APIs in another thread). My current idea is to do: flockfile(stdin); while (!eof) { c = getc_unlocked(stdin); ... } funlockfile(stdin); This is not only much faster, it's an improvement to the current behaviour in terms of locking and thread safety. The current behaviour, as I said in comment #4, could cause data to be badly mangled if one thread uses stdio while another uses cin.getline(). Using getc() in preference to getc_unlocked() does not help. And unlike getdelim(), the unlocked I/O functions are in POSIX.1-2001, says the man page, so it's relatively portable. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #20 from paolo dot carlini at oracle dot com 2010-09-09 14:53 --- Good about POSIX, we would add a configure time test with some hope to enable the mechanism outside Linux too. Anyway, I'm sure your kind of loop would improve the performance a lot - if only we could have it without breaking the ABI I would be in favor of having it immediately - still, we would still use a single char function, I think the complexity for long lines would still scale badly. As a matter of fact, I think the only completely satisfactory design would be that used by the old v2, with a low level libio layer, doing buffering and the low level operations, and used by the C and C++ libraries on top. Missing that, I don't think the C++ library, working purely on top of the Standard C library will ever be able to performe as well as C in the synced mode. The only hope could be exploiting, on Linux systems, a glibc *extension* (we do that in many other cases), like an fgetc not writing '\0' and newline, which the glibc people would essentially provide exactly to help the C++ library implementation. Anyway, sorry if I may have appeared a little too harsh in my first replies, the fact is I know the history of these facilities, I know all the effort other people besides me put to have a good overall compromise (eg, stdio_sync isn't mine and solved a *a lot* of problems), we are certainly open to improvements, but realistic ones, at least until we break all the ABIs. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #21 from jakub at gcc dot gnu dot org 2010-09-09 15:07 --- #if __GNUC__ = 3 # define _IO_BE(expr, res) __builtin_expect ((expr), res) #else # define _IO_BE(expr, res) (expr) #endif #define _IO_getc_unlocked(_fp) \ (_IO_BE ((_fp)-_IO_read_ptr = (_fp)-_IO_read_end, 0) \ ? __uflow (_fp) : *(unsigned char *) (_fp)-_IO_read_ptr++) ... __STDIO_INLINE int getc_unlocked (FILE *__fp) { return _IO_getc_unlocked (__fp); } i.e. if getc_unlocked is called directly, it should be very fast, unless the underlying stream is unbuffered. This is direct access to the buffer, just STL getline couldn't use memchr. Anyway, not sure which STL getline we are talking about here, because e.g. src/istream.cc getline seems to access the stdio buffer directly: streamsize __size = std::min(streamsize(__sb-egptr() - __sb-gptr()), streamsize(__n - _M_gcount - 1)); if (__size 1) { const char_type* __p = traits_type::find(__sb-gptr(), __size, __delim); -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #22 from paolo dot carlini at oracle dot com 2010-09-09 16:08 --- Jakub, when, by default, cin co boil down to stdio_sync_filebuf, the underlying basic_streambuf is unbuffered, everything is unbuffered in the C++ library. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #23 from tstarling at wikimedia dot org 2010-09-10 00:17 --- (In reply to comment #21) Anyway, not sure which STL getline we are talking about here, because e.g. src/istream.cc getline seems to access the stdio buffer directly: streamsize __size = std::min(streamsize(__sb-egptr() - __sb-gptr()), streamsize(__n - _M_gcount - 1)); __sb-gptr() and __sb-egptr() are always null for this kind of streambuf, so __size is always zero, and so the loop just calls snextc() on every iteration. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #10 from paolo dot carlini at oracle dot com 2010-09-08 09:56 --- (In reply to comment #8) Maybe you should tell that to Paolo Carlini, who closed bug 15002 as resolved fixed in 2004, And it *is* fixed. Did you actually open the testcases? Just plain fstreams, thus no syncing. or to Loren Rittle, who closed bug 5001 as resolved fixed in 2003, declaring This issue was addressed by gcc 3.2.X such that sync_with_stdio was no longer required for reasonable performance. And indeed it's true. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #11 from paolo dot carlini at oracle dot com 2010-09-08 09:59 --- (In reply to comment #8) But a 9-10x difference doesn't sound reasonable to me. The synced mode is not unbuffered, before or after my suggested change, it uses the internal buffer in glibc. So? We are not changing glibc here. The C++ library does *not* use buffering in the synced mode, and it does otherwise, for fstreams in particular. Where do you think the performance difference is essentially coming from? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #12 from paolo dot carlini at oracle dot com 2010-09-08 10:20 --- By the way, getdelim is not standard, thus would work only on linux, even more special casing. More importantly, fgets *stores* newline and '\0', at variance with getline, I don't think it can be used as-is as an implementation detail. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #13 from jakub at gcc dot gnu dot org 2010-09-08 14:48 --- And, getdelim insist on allocating resp. reallocating the buffer. That is of course usually desirable when used from C, but I'm afraid it isn't in this case, where you have user provided buffer with fixed size. You don't want to read more than that size, while getdelim will read any amount of data. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574
[Bug libstdc++/45574] cin.getline() is extremely slow
--- Comment #14 from tstarling at wikimedia dot org 2010-09-09 02:31 --- (In reply to comment #11) So? We are not changing glibc here. The C++ library does *not* use buffering in the synced mode, and it does otherwise, for fstreams in particular. Where do you think the performance difference is essentially coming from? Sure, buffering would help, because the interface between the C++ library and the buffer in the C library is slow. I just meant that the lack of a buffer in C++ isn't an excuse for slowness since it should theoretically be possible for C++ to access the buffer in the C library without much overhead. At another level, your question is unsolved and interesting, because while(getc(stdin)!=EOF); is much faster than cin.getline(), taking only 0.632s of user time for the attached test case. And a loop of getc_unlocked() only takes 0.188s of user time. So there may be opportunity for optimisation here without resorting to fgets() or getdelim(), which as you say, suck in various ways. I'll see if I have time for some more testing. If I wrote a patch involving a new virtual method or two, would it be looked at? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45574