Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Kevin Grittner kevin.gritt...@wicourts.gov writes: Andres Freund and...@anarazel.de wrote: [concerns addressed in new patch version] Looks good to me. I'm marking this Ready for Committer. Applied with minor editorialization --- improving the comments mostly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund and...@anarazel.de wrote: I think you see no real benefit, because your strings are rather short - the documents I scanned when noticing the issue where rather long. The document I used in the test which showed the regression was 672,585 characters, containing 10,000 URLs. A rather extreme/contrived example: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 2) g(i)), ' - ')); The most extreme of your examples uses a 979,996 character string, which is less than 50% larger than my test. I am, however, able to see the performance difference for this particular example, so I now have something to work with. I'm seeing some odd behavior in terms of when there is what sort of difference. Once I can categorize it better, I'll follow up. Thanks for the sample which shows the difference. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
I wrote: Thanks for the sample which shows the difference. Ah, once I got on the right track, there is no problem seeing dramatic improvements with the patch. It changes some nasty O(N^2) cases to O(N). In particular, the fixes affect parsing of large strings encoded with multi-byte character encodings and containing email addresses or URLs with a non-IP-address host component. It strikes me as odd that URLs without a slash following the host portion, or with an IP address, are treated so differently in the parser, but if we want to address that, it's a matter for another patch. I'm inclined to think that the minimal differences found in some of my tests probably have more to do with happenstance of code alignment than the particulars of the patch. I did find one significant (although easily solved) problem. In the patch, the recursive setup of usewide, pgwstr, and wstr are not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched version. Unless there's a good reason for that, the #ifdef should be added. Less critical, but worth fixing one way or the other, TParserClose does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but TParserCopyClose does. I think this should be consistent. Finally, there's that spelling error in the comment for TParserCopyInit. Please fix. If a patch is produced with fixes for these three things, I'd say it'll be ready for committer. I'm marking it as Waiting on Author for fixes to these three items. Sorry for the delay in review. I hope there's still time to get this committed in this CF. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
I wrote: I did find one significant (although easily solved) problem. In the patch, the recursive setup of usewide, pgwstr, and wstr are not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched version. Unless there's a good reason for that, the #ifdef should be added. That should read: I did find one significant (although easily solved) problem. In the patch, the recursive setup of usewide, pgwstr, and wstr are not conditioned by #ifdef USE_WIDE_UPPER_LOWER as they are in the non-patched version. Unless there's a good reason for that, the #ifdef should be added. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote: I wrote: Thanks for the sample which shows the difference. Ah, once I got on the right track, there is no problem seeing dramatic improvements with the patch. It changes some nasty O(N^2) cases to O(N). In particular, the fixes affect parsing of large strings encoded with multi-byte character encodings and containing email addresses or URLs with a non-IP-address host component. It strikes me as odd that URLs without a slash following the host portion, or with an IP address, are treated so differently in the parser, but if we want to address that, it's a matter for another patch. Same here. Generally I do have to say that I dont find that parser very nice - and actually I think its not very efficient as well because it searches a big part of the search space all the time. I think a generated parser would be more efficient and way much easier to read... I'm inclined to think that the minimal differences found in some of my tests probably have more to do with happenstance of code alignment than the particulars of the patch. Yes, I think that as well. I did find one significant (although easily solved) problem. In the patch, the recursive setup of usewide, pgwstr, and wstr are not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched version. Unless there's a good reason for that, the #ifdef should be added. Uh. Sorry. Fixed. Less critical, but worth fixing one way or the other, TParserClose does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but TParserCopyClose does. I think this should be consistent. Actually there was *some* thinking behind that: The end of parsing is quite obvious (the call returns, and its so verbose you will never do more than one call anyway) - but knowing where the copy ends is quite important to understand the parse flow. I added a log there because in the end that line is not going to make any difference ;-) Sorry for the delay in review. I hope there's still time to get this committed in this CF. Thanks for your reviewing! Actually I dont mind very much if it gets delayed or not. Its trivial enough that it shouldnt cause much work/conflicts/whatever next round and I am running patched versions anyway, so ... Andres From e01bac641f318b378c4353aa6ccebc76b3071166 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Thu, 10 Dec 2009 19:54:22 +0100 Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings --- src/backend/tsearch/wparser_def.c | 69 ++-- 1 files changed, 65 insertions(+), 4 deletions(-) diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 72b435c..46e86ee 100644 *** a/src/backend/tsearch/wparser_def.c --- b/src/backend/tsearch/wparser_def.c *** TParserInit(char *str, int len) *** 328,333 --- 328,371 return prs; } + /* + * As an alternative to a full TParserInit one can create a + * TParserCopy which basically is a normally TParser without a private + * copy of the string - instead it uses the one from another TParser. + * This is useful because at some places TParsers are created + * recursively and the repeated copying around of the strings can + * cause major inefficiency. + * Obviously one may not close the original TParser before the copy. + */ + static TParser * + TParserCopyInit(const TParser const* orig) + { + TParser*prs = (TParser *) palloc0(sizeof(TParser)); + + prs-charmaxlen = orig-charmaxlen; + prs-str = orig-str + orig-state-posbyte; + prs-lenstr = orig-lenstr - orig-state-posbyte; + + #ifdef USE_WIDE_UPPER_LOWER + prs-usewide = orig-usewide; + + if(orig-pgwstr) + prs-pgwstr = orig-pgwstr + orig-state-poschar; + if(orig-wstr) + prs-wstr = orig-wstr + orig-state-poschar; + #endif + + prs-state = newTParserPosition(NULL); + prs-state-state = TPS_Base; + + #ifdef WPARSER_TRACE + fprintf(stderr, parsing copy of \%.*s\\n, len, str); + #endif + + return prs; + } + + static void TParserClose(TParser *prs) { *** TParserClose(TParser *prs) *** 346,354 --- 384,415 pfree(prs-pgwstr); #endif + #ifdef WPARSER_TRACE + fprintf(stderr, closing parser); + #endif + pfree(prs); + } + + /* + * See TParserCopyInit + */ + static void + TParserCopyClose(TParser *prs) + { + while (prs-state) + { + TParserPosition *ptr = prs-state-prev; + + pfree(prs-state); + prs-state = ptr; + } + #ifdef WPARSER_TRACE + fprintf(stderr, closing parser copy); + #endif pfree(prs); } + /* * Character-type support functions, equivalent to is* macros, but * working with any possible encodings and locales. Notes: *** p_isignore(TParser *prs) *** 617,623 static int p_ishost(TParser *prs) { ! TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte); int res = 0;
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund and...@anarazel.de wrote: [concerns addressed in new patch version] Looks good to me. I'm marking this Ready for Committer. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote: Andres Freund and...@anarazel.de wrote: Could you show your testcase? OK. I was going to try to check other platforms first, and package up the information better, but here goes. I created 1 lines with random IP-based URLs for a test. The first few lines are: create table t1 (c1 int not null primary key, c2 text); insert into t1 values (2, 'http://255.102.51.212/*/quick/brown/fox?jumpsover*lazydog.html http://204.56.222.143/*/quick/brown/fox?jumpsover*lazydog.html http://138.183.168.227/*/quick/brown/fox?jumpsover*lazydog.html I think you see no real benefit, because your strings are rather short - the documents I scanned when noticing the issue where rather long. If your strings are short, they and the copy will fit into cpu cache anyway, so copying them around/converting them to some other string format is not that expensive compared to the rest of the work done. Also after each copying step for large strings the complete cache is filled with unrelated information (namely the end of the string). So every charwise access will need to wait for a memory access. A rather extreme/contrived example: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1) g(i)), ' - ')); ?column? -- 1 (1 row) Time: 3.740 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1000) g(i)), ' - ')); ?column? -- 1 (1 row) Time: 115.027 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 1) g(i)), ' - ')); ?column? -- 1 (1 row) Time: 24355.339 ms postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 2) g(i)), ' - ')); ?column? -- 1 (1 row) Time: 47276.739 ms One easily can see the quadratic complexity here. The quadratic complexity lies in the length/amount of emails/urls of the strings, not in the number of to_tsvector calls! After the patch: postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 2) g(i)), ' - ')); ?column? -- 1 (1 row) Time: 168.384 ms I could not reproduce the slowdown you mentioned: Without patch: postgres=# SELECT to_tsvector('and...@anarazel.de http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 10) g(i) ORDER BY g.i LIMIT 1; to_tsvector --- '/1':4 'and...@anarazel.de':1 'www.postgresql.org':3 'www.postgresql.org/1':2 (1 row) Time: 1109.833 ms With patch: postgres=# SELECT to_tsvector('and...@anarazel.de http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 10) g(i) ORDER BY g.i LIMIT 1; to_tsvector --- '/1':4 'and...@anarazel.de':1 'www.postgresql.org':3 'www.postgresql.org/1':2 (1 row) Time: 1036.689 ms So on the hardware I tried its even a little bit faster for small strings (Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit). I could not reproduce any difference with strings not involving urls or emails: Without patch: postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) FROM generate_series(1, 10) g(i) ORDER BY g.i LIMIT 1; to_tsvector 'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6 (1 row) Time: 988.426 ms With patch: postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) FROM generate_series(1, 10) g(i) ORDER BY g.i LIMIT 1; to_tsvector 'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6 (1 row) Time: 975.339 ms So at least in my testing I do see no danger in the patch ;-) Andres PS: I averaged all the time result over multiple runs where it was relevant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
I wrote: Frankly, I'd be amazed if there was a performance regression, OK, I'm amazed. While it apparently helps some cases dramatically (Andres had a case where run time was reduced by 93.2%), I found a pretty routine case where run time was increased by 3.1%. I tweaked the code and got that down to a 2.5% run time increase. I'm having troubles getting it any lower than that. And yes, this is real, not noise -- the slowest unpatched time for this test is faster than the fastest time with any version of the patch. :-( Andres, could you provide more information on the test which showed the dramatic improvement? In particular, info on OS, CPU, character set, encoding scheme, and what kind of data was used for the test. I'll do some more testing and try to figure out how the patch is slowing things down and post with details. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Tuesday 08 December 2009 16:23:11 Kevin Grittner wrote: I wrote: Frankly, I'd be amazed if there was a performance regression, OK, I'm amazed. While it apparently helps some cases dramatically (Andres had a case where run time was reduced by 93.2%), I found a pretty routine case where run time was increased by 3.1%. I tweaked the code and got that down to a 2.5% run time increase. I'm having troubles getting it any lower than that. And yes, this is real, not noise -- the slowest unpatched time for this test is faster than the fastest time with any version of the patch. :-( Andres, could you provide more information on the test which showed the dramatic improvement? In particular, info on OS, CPU, character set, encoding scheme, and what kind of data was used for the test. I'll do some more testing and try to figure out how the patch is slowing things down and post with details. Could you show your testcase? I dont see why it could get slower? I tested with various data, the one benefiting most was some changelog where each entry was signed by an email. OS: Debian Sid, Core2 Duo, UTF-8, and I tried both C and de_DE.UTF8. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund and...@anarazel.de wrote: Could you show your testcase? OK. I was going to try to check other platforms first, and package up the information better, but here goes. I created 1 lines with random IP-based URLs for a test. The first few lines are: create table t1 (c1 int not null primary key, c2 text); insert into t1 values (2, 'http://255.102.51.212/*/quick/brown/fox?jumpsover*lazydog.html http://204.56.222.143/*/quick/brown/fox?jumpsover*lazydog.html http://138.183.168.227/*/quick/brown/fox?jumpsover*lazydog.html Actually, the special character was initially the word the, but I wanted to see if having non-ASCII characters in the value made any difference. It didn't. Unfortunately, I was testing at home last night and forgot to bring the exact test query with me, but it was this or something close to it: \timing select to_tsvector(c2) from t1, (select generate_series(1,200)) x where c1 = 2; I was running on Ubuntu 9.10, an AMD dual core CPU (don't have the model number handy), UTF-8, en_US.UTF8. I dont see why it could get slower? I don't either. The best I can tell, following the pointer from orig to any of its elements seems to be way more expensive than I would ever have guessed. The only thing that seemed to improve the speed was minimizing that by using a local variable to capture any element referenced more than once. (Although, there is overlap between the timings for the original patch and the one which seemed a slight improvement; I would need to do more testing to really rule out noise and have complete confidence that my changes actually are an improvement on the original patch.) Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD CPU? (I'm looking forward to testing this today on a 64 bit build on an Intel CPU.) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote: Andres Freund and...@anarazel.de wrote: Could you show your testcase? Will hopefully look into this later. I dont see why it could get slower? I don't either. The best I can tell, following the pointer from orig to any of its elements seems to be way more expensive than I would ever have guessed. The only thing that seemed to improve the speed was minimizing that by using a local variable to capture any element referenced more than once. (Although, there is overlap between the timings for the original patch and the one which seemed a slight improvement; I would need to do more testing to really rule out noise and have complete confidence that my changes actually are an improvement on the original patch.) Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD CPU? (I'm looking forward to testing this today on a 64 bit build on an Intel CPU.) Did you test that with a optimized build? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund and...@anarazel.de wrote: Did you test that with a optimized build? After running a series of tests with --enable-cassert to confirm that nothing squawked, I disabled that before doing any performance testing. Going from memory, I used --enable-debug --with-libxml and --prefix. I didn't explicitly use or disable any compiler optimizations. Possibly relevant is that I was using Greg Smith's peg tool (as best I could): http://github.com/gregs1104/peg/ I'm not aware of it doing anything to the compile options, but I didn't look for that, either. I guess I should poke around that some more to be sure. After having peg do the checkout and set up the directories, did my own ./configure and make runs so that I could be sure of my configure settings. Are there any settings which you feel I should be using which PostgreSQL doesn't set up as you would recommend during the ./configure run? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Kevin Grittner wrote: After running a series of tests with --enable-cassert to confirm that nothing squawked, I disabled that before doing any performance testing. Going from memory, I used --enable-debug --with-libxml and --prefix. I didn't explicitly use or disable any compiler optimizations. Possibly relevant is that I was using Greg Smith's peg tool (as best I could): http://github.com/gregs1104/peg/ I'm not aware of it doing anything to the compile options, but I didn't look for that, either. Now that you ask, of course I just spotted a bug in there such that the documented behavior for the PGDEBUG feature doesn't actually work. If you were using that to turn off asserts, that may not have worked as expected. Don't know what you did there exactly. Fix now pushed to the repo. I general, when doing performance testing, now matter how careful I've been I try to do a: show debug_assertions; To confirm I'm not running with those on. Andres, are using any optimization flags when you're testing? Would have preferred that the first mention of my new project didn't involve a bug report, but that's software I guess. For everyone here who's not on the reviewers mailing list: peg is a scripting tool I've put together recently that makes it easier to setup the environment (source code, binaries, database) for testing PostgreSQL in a development content. It takes care of grabbing the latest source, building, starting the database, all that boring stuff. I tried to make it automate the most tedious parts of both development and patch review. Documentation and the program itself are at the git repo Kevin mentioned. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Greg Smith g...@2ndquadrant.com wrote: Now that you ask, of course I just spotted a bug in there such that the documented behavior for the PGDEBUG feature doesn't actually work. If you were using that to turn off asserts, that may not have worked as expected. Don't know what you did there exactly. Fix now pushed to the repo. Thanks. I was going to reply to your original message with my experiences (and probably still will), but it seemed like it might be relevant here. I did check pg_config results before doing anything, and saw that the debug and cassert weren't set, so that's why I did explicit configure and make commands. Even without that covered, peg was a nice convenience -- I can say that it saved me more time already than it took to install and read the docs. Nice work! Anyway, I'm not sure whether your reply directly answers the point I was raising -- peg doesn't do anything with the compiler optimization flags under the covers, does it? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Hi, On Tuesday 08 December 2009 20:09:22 Greg Smith wrote: Andres, are using any optimization flags when you're testing? I was testing with and without debug/cassert - and did not get adverse results in both... Unfortunately it looks like I wont get to test today, but only tomorrow morning its 9pm and I am not yet fully finished with work Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Kevin Grittner wrote: Anyway, I'm not sure whether your reply directly answers the point I was raising -- peg doesn't do anything with the compiler optimization flags under the covers, does it? Not really. It does this: PGDEBUG=--enable-cassert --enable-debug ./configure --prefix=$PGINST/$PGPROJECT --enable-depend --enable-thread-safety $PGDEBUG Which are pretty standard options. The idea is that you'd use the default normally, then just set PGDEBUG= for a non-debug build--or to otherwise change the configure flags but still get things done automatically for you. If it's set before the script starts, it doesn't change it. I did try to design things so that you could do any step in the automated series manually and not have that screw things up. There's hundreds of lines of code in there just for things like figuring out whether configure has been run or not yet when it decides you need to build, so it can try to do the right thing in either case. My hope was that anyone who tried peg out would find it a net positive time savings after a single use, glad to hear I accomplished that in your case. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Greg Smith g...@2ndquadrant.com writes: Kevin Grittner wrote: Anyway, I'm not sure whether your reply directly answers the point I was raising -- peg doesn't do anything with the compiler optimization flags under the covers, does it? Not really. It does this: PGDEBUG=--enable-cassert --enable-debug ./configure --prefix=$PGINST/$PGPROJECT --enable-depend --enable-thread-safety $PGDEBUG --enable-cassert might have a noticeable performance impact. We usually try to not have that on when doing performance testing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Tom Lane wrote: --enable-cassert might have a noticeable performance impact. We usually try to not have that on when doing performance testing. All covered in the tool's documentation, and it looks like Kevin did the right thing during his tests by checking the pg_config output to confirm the right flags were used. I think we can rule out simple pilot error here and work under the assumption Kevin found a mild performance regression under some circumstances with the patch. Hopefully Andres will be able to replicate the problem, and it sounds like Kevin might be able to provide more information about it tonight too. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Tom Lane t...@sss.pgh.pa.us wrote: --enable-cassert might have a noticeable performance impact. We usually try to not have that on when doing performance testing. Right. I turned it on for some initial tests to confirm that we had no assertion failures; then turned it off for the performance testing. I did leave --enable-debug on during performance testing. My understanding is that debug info doesn't affect performance, but I guess it never hurts to try an empirical test to confirm. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
I wrote: Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD CPU? (I'm looking forward to testing this today on a 64 bit build on an Intel CPU.) The exact same test on 64 bit OS (SuSE Enterprise Server) on Intel gave very different results. With 10 runs each of 200 iterations of parsing the 1 URLs, the patch Andres submitted ran 0.4% faster than HEAD, and my attempt to improve on it ran 0.6% slower than HEAD. I'll try to run the numbers to get the percentage chance that a random distribution would have generated a spread as large as either of those; but I think it's safe to say that the submitted patch doesn't hurt there and that my attempt to improve on it was misdirected. :-/ I would like to independently confirm the dramatic improvement reported by Andres. Could I get a short snippet from the log which was used for that, along with an indication of the size of the text parsed in that test? (Since the old code looks like it might have O(N^2) performance in some situations, while the patch changes that to O(N), I might not be testing with a big enough N.) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Monday 07 December 2009 02:12:43 Greg Smith wrote: After getting off to a good start, it looks like this patch is now stuck waiting for a second review pass from Kevin right now, with no open items for Andres to correct. Since the only issues on the table seem to be that of code aesthetics and long-term planning for this style of implementation rather than specific functional bits, I'm leaning toward saying this one is ready to have a committer look at it. Any comments from Kevin or Andres about where this is at? I think it should be ready - the only know thing it needs is a s/usefull/useful/. I will take another look but I doubt I will see anything new. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Greg Smith g...@2ndquadrant.com wrote: After getting off to a good start, it looks like this patch is now stuck waiting for a second review pass from Kevin right now, with no open items for Andres to correct. Since the only issues on the table seem to be that of code aesthetics and long-term planning for this style of implementation rather than specific functional bits, I'm leaning toward saying this one is ready to have a committer look at it. Any comments from Kevin or Andres about where this is at? Yeah, really the only thing I found to complain about was one misspelled word in a comment. I am currently the hold-up, due to fighting off a bout of some virus and having other real world issues impinge. The only thing left to do, besides correcting the spelling, is to confirm the author's performance improvements and confirm that there is no degradation in a non-targeted situation. Frankly, I'd be amazed if there was a performance regression, because all it really does is pass a pointer to a new spot in an existing input buffer rather than allocating new space and copying the input from the desired spot to the end of the buffer. I can't think of any situations where calculating the new address should be slower than calculating the new address and copying from there to the end of the buffer. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
After getting off to a good start, it looks like this patch is now stuck waiting for a second review pass from Kevin right now, with no open items for Andres to correct. Since the only issues on the table seem to be that of code aesthetics and long-term planning for this style of implementation rather than specific functional bits, I'm leaning toward saying this one is ready to have a committer look at it. Any comments from Kevin or Andres about where this is at? -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote: Andres Freund wrote: On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: It is in context format, applies cleanly, and passes make check. Unfortunately the latter is not saying much - I had a bug there and it was not found by the regression tests. Perhaps I should take a stab and add at least some more... Sounds like a good idea. The one thing to avoid is anything with a long enough run time to annoy those that run it many times a day. Hm. There actually are tests excercising the part where I had a bug... Strange. It was a bug involving uninitialized data so probably the regression tests where just lucky. It is in context format, applies cleanly, and passes make check. Next I read through the code, and have the same question that Andres posed 12 days ago. His patch massively reduces the cost of the parser recursively calling itself for some cases, and it seems like the least invasive way to modify the parser to solve this performance problem; but it does beg the question of why a state machine like this should recursively call itself when it hits certain states. I was wondering about that as well. I am not completely sure but to me it looks like its just done to reduce the amount of rules and states. I'm assuming that's the reason, but didn't dig deep enough to be sure. I suspect to be really sure, I'd have to set it up without the recursion and see what breaks. I can't imagine it would be anything which couldn't be fixed by adding enough states; but perhaps they ran into something where these types would require so many new states that the recursion seemed like the lesser of evils. This is similar to my understanding... I have to say that that code is not exactly clear and well documented... Yeah. I was happy with the level of documentation that you added with your new code, but what was there before is mighty thin. If you gleaned enough information while working on it to feel comfortable adding documentation anywhere else, that would be a good thing. It definitely would be a good thing. But that would definitely be seperate patch. But I fear my current leel of knowledge is sufficient and also I am not sure if I can make myself interested enough in that part. So far the only vote is to proceed with the mitigation, which was my preference, and apparently yours -- so I guess we're at 3 to 0 in favor of that. I'll mark the patch as Waiting on Author so you can add any comments and regression tests you feel are appropriate. By the way, I found one typo in the comments -- it should by useful, not usefull. Ok, will update. I liked what I saw so far, but need to spend more time desk-checking for correctness, testing to confirm that it doesn't change results, and confirming the performance improvement. Thanks again for your reviewing! Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund and...@anarazel.de wrote: On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote: Andres Freund wrote: I had a bug there and it was not found by the regression tests. Perhaps I should take a stab and add at least some more... Sounds like a good idea. Hm. There actually are tests excercising the part where I had a bug... Strange. It was a bug involving uninitialized data so probably the regression tests where just lucky. OK. I won't be looking for extra tests. I have to say that that code is not exactly clear and well documented... Yeah. I was happy with the level of documentation that you added with your new code, but what was there before is mighty thin. If you gleaned enough information while working on it to feel comfortable adding documentation anywhere else, that would be a good thing. It definitely would be a good thing. But that would definitely be seperate patch. But I fear my current leel of knowledge is sufficient and also I am not sure if I can make myself interested enough in that part. Fair enough. I won't be looking for new comments for the old code. By the way, I found one typo in the comments -- it should by useful, not usefull. Ok, will update. Given how trivial that is, I'm putting this back in Needs Review status, and resuming my review work. Barring surprises, I should wrap this up whenever I can free up a two or three hours. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund wrote: On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: It is in context format, applies cleanly, and passes make check. Unfortunately the latter is not saying much - I had a bug there and it was not found by the regression tests. Perhaps I should take a stab and add at least some more... Sounds like a good idea. The one thing to avoid is anything with a long enough run time to annoy those that run it many times a day. It is in context format, applies cleanly, and passes make check. Next I read through the code, and have the same question that Andres posed 12 days ago. His patch massively reduces the cost of the parser recursively calling itself for some cases, and it seems like the least invasive way to modify the parser to solve this performance problem; but it does beg the question of why a state machine like this should recursively call itself when it hits certain states. I was wondering about that as well. I am not completely sure but to me it looks like its just done to reduce the amount of rules and states. I'm assuming that's the reason, but didn't dig deep enough to be sure. I suspect to be really sure, I'd have to set it up without the recursion and see what breaks. I can't imagine it would be anything which couldn't be fixed by adding enough states; but perhaps they ran into something where these types would require so many new states that the recursion seemed like the lesser of evils. I have to say that that code is not exactly clear and well documented... Yeah. I was happy with the level of documentation that you added with your new code, but what was there before is mighty thin. If you gleaned enough information while working on it to feel comfortable adding documentation anywhere else, that would be a good thing. So far the only vote is to proceed with the mitigation, which was my preference, and apparently yours -- so I guess we're at 3 to 0 in favor of that. I'll mark the patch as Waiting on Author so you can add any comments and regression tests you feel are appropriate. By the way, I found one typo in the comments -- it should by useful, not usefull. I liked what I saw so far, but need to spend more time desk-checking for correctness, testing to confirm that it doesn't change results, and confirming the performance improvement. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Andres Freund and...@anarazel.de wrote: On Sunday 01 November 2009 16:19:43 Andres Freund wrote: While playing around/evaluating tsearch I notices that to_tsvector is obscenely slow for some files. After some profiling I found that this is due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If a multibyte encoding is in use TParserInit copies the whole remaining input and converts it to wchar_t or pg_wchar - for every email or protocol prefixed url in the the document. Which obviously is bad. I solved the issue by having a seperate TParserCopyInit/ TParserCopyClose which reuses the the already converted strings of the original TParser - only at different offsets. Another approach would be to get rid of the separate parser invocations - requiring a bunch of additional states. This seemed more complex to me, so I wanted to get some feedback first. Without patch: Time: 5835.676 ms With patch: Time: 395.341 ms As nobody commented here is a corrected (stupid thinko) and cleaned up version. I've taken this one for review, and have taken a preliminary look. It is in context format, applies cleanly, and passes make check. Next I read through the code, and have the same question that Andres posed 12 days ago. His patch massively reduces the cost of the parser recursively calling itself for some cases, and it seems like the least invasive way to modify the parser to solve this performance problem; but it does beg the question of why a state machine like this should recursively call itself when it hits certain states. The patch is mitigating the penalty for what seems to me to be an existing ugly kludge. Is it acceptable to address the problem in this way, or should the much more invasive work be done to eliminate the kludge? (Note: I personally would much rather see the performance penalty addressed this way, and a TODO added for the more invasive work, than to leave this alone for the next release if there's nobody willing to tackle the problem at a more fundamental level.) If nobody has a contrary opinion, I'll proceed with the review of this patch and add something to the TODO page for the more invasive work. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
Kevin Grittner wrote: (Note: I personally would much rather see the performance penalty addressed this way, and a TODO added for the more invasive work, than to leave this alone for the next release if there's nobody willing to tackle the problem at a more fundamental level.) +1 -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: It is in context format, applies cleanly, and passes make check. Unfortunately the latter is not saying much - I had a bug there and it was not found by the regression tests. Perhaps I should take a stab and add at least some more... It is in context format, applies cleanly, and passes make check. Next I read through the code, and have the same question that Andres posed 12 days ago. His patch massively reduces the cost of the parser recursively calling itself for some cases, and it seems like the least invasive way to modify the parser to solve this performance problem; but it does beg the question of why a state machine like this should recursively call itself when it hits certain states. I was wondering about that as well. I am not completely sure but to me it looks like its just done to reduce the amount of rules and states. I have to say that that code is not exactly clear and well documented... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers