Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-15 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Andres Freund and...@anarazel.de wrote:
 [concerns addressed in new patch version]
 
 Looks good to me.  I'm marking this Ready for Committer.

Applied with minor editorialization --- improving the comments mostly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-10 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 
 I think you see no real benefit, because your strings are rather
 short - the documents I scanned when noticing the issue where
 rather long.
 
The document I used in the test which showed the regression was
672,585 characters, containing 10,000 URLs.
 
 A rather extreme/contrived example:
 
 postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
 'and...@anarazel.de http://www.postgresql.org/'::text FROM 
 generate_series(1, 
 2) g(i)), ' -  '));
 
The most extreme of your examples uses a 979,996 character string,
which is less than 50% larger than my test.  I am, however, able to
see the performance difference for this particular example, so I now
have something to work with.  I'm seeing some odd behavior in terms
of when there is what sort of difference.  Once I can categorize it
better, I'll follow up.
 
Thanks for the sample which shows the difference.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-10 Thread Kevin Grittner
I wrote:
 
 Thanks for the sample which shows the difference.
 
Ah, once I got on the right track, there is no problem seeing
dramatic improvements with the patch.  It changes some nasty O(N^2)
cases to O(N).  In particular, the fixes affect parsing of large
strings encoded with multi-byte character encodings and containing
email addresses or URLs with a non-IP-address host component.  It
strikes me as odd that URLs without a slash following the host
portion, or with an IP address, are treated so differently in the
parser, but if we want to address that, it's a matter for another
patch.
 
I'm inclined to think that the minimal differences found in some of
my tests probably have more to do with happenstance of code
alignment than the particulars of the patch.
 
I did find one significant (although easily solved) problem.  In the
patch, the recursive setup of usewide, pgwstr, and wstr are not
conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
version.  Unless there's a good reason for that, the #ifdef should
be added.
 
Less critical, but worth fixing one way or the other, TParserClose
does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
TParserCopyClose does.  I think this should be consistent.
 
Finally, there's that spelling error in the comment for
TParserCopyInit.  Please fix.
 
If a patch is produced with fixes for these three things, I'd say
it'll be ready for committer.  I'm marking it as Waiting on Author
for fixes to these three items.
 
Sorry for the delay in review.  I hope there's still time to get
this committed in this CF.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-10 Thread Kevin Grittner
I wrote:
 
 I did find one significant (although easily solved) problem.  In
 the patch, the recursive setup of usewide, pgwstr, and wstr are
 not conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
 version.  Unless there's a good reason for that, the #ifdef should
 be added.
 
That should read:
 
I did find one significant (although easily solved) problem.  In
the patch, the recursive setup of usewide, pgwstr, and wstr are
not conditioned by #ifdef USE_WIDE_UPPER_LOWER as they are in the
non-patched version.  Unless there's a good reason for that, the
#ifdef should be added.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-10 Thread Andres Freund
On Thursday 10 December 2009 19:10:24 Kevin Grittner wrote:
 I wrote:
  Thanks for the sample which shows the difference.
 
 Ah, once I got on the right track, there is no problem seeing
 dramatic improvements with the patch.  It changes some nasty O(N^2)
 cases to O(N).  In particular, the fixes affect parsing of large
 strings encoded with multi-byte character encodings and containing
 email addresses or URLs with a non-IP-address host component.  It
 strikes me as odd that URLs without a slash following the host
 portion, or with an IP address, are treated so differently in the
 parser, but if we want to address that, it's a matter for another
 patch.
Same here. Generally I do have to say that I dont find that parser very nice - 
and actually I think its not very efficient as well because it searches a big 
part of the search space all the time.
I think a generated parser would be more efficient and way much easier to 
read...

 I'm inclined to think that the minimal differences found in some of
 my tests probably have more to do with happenstance of code
 alignment than the particulars of the patch.
Yes, I think that as well.

 I did find one significant (although easily solved) problem.  In the
 patch, the recursive setup of usewide, pgwstr, and wstr are not
 conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
 version.  Unless there's a good reason for that, the #ifdef should
 be added.
Uh. Sorry. Fixed.

 Less critical, but worth fixing one way or the other, TParserClose
 does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
 TParserCopyClose does.  I think this should be consistent.
Actually there was *some* thinking behind that: The end of parsing is quite 
obvious (the call returns, and its so verbose you will never do more than one 
call anyway) - but knowing where the copy ends is quite important to 
understand the parse flow.
I added a log there because in the end that line is not going to make any 
difference ;-)

 Sorry for the delay in review.  I hope there's still time to get
 this committed in this CF.
Thanks for your reviewing!

Actually I dont mind very much if it gets delayed or not. Its trivial enough 
that it shouldnt cause much work/conflicts/whatever next round and I am running 
patched versions anyway, so ...



Andres
From e01bac641f318b378c4353aa6ccebc76b3071166 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Thu, 10 Dec 2009 19:54:22 +0100
Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings

---
 src/backend/tsearch/wparser_def.c |   69 ++--
 1 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 72b435c..46e86ee 100644
*** a/src/backend/tsearch/wparser_def.c
--- b/src/backend/tsearch/wparser_def.c
*** TParserInit(char *str, int len)
*** 328,333 
--- 328,371 
  	return prs;
  }
  
+ /*
+  * As an alternative to a full TParserInit one can create a
+  * TParserCopy which basically is a normally TParser without a private
+  * copy of the string - instead it uses the one from another TParser.
+  * This is useful because at some places TParsers are created
+  * recursively and the repeated copying around of the strings can
+  * cause major inefficiency.
+  * Obviously one may not close the original TParser before the copy.
+  */
+ static TParser *
+ TParserCopyInit(const TParser const* orig)
+ {
+ 	TParser*prs = (TParser *) palloc0(sizeof(TParser));
+ 
+ 	prs-charmaxlen = orig-charmaxlen;
+ 	prs-str = orig-str + orig-state-posbyte;
+ 	prs-lenstr = orig-lenstr - orig-state-posbyte;
+ 
+ #ifdef USE_WIDE_UPPER_LOWER
+ 	prs-usewide = orig-usewide;
+ 
+ 	if(orig-pgwstr)
+ 		prs-pgwstr = orig-pgwstr + orig-state-poschar;
+ 	if(orig-wstr)
+ 		prs-wstr = orig-wstr + orig-state-poschar;
+ #endif
+ 
+ 	prs-state = newTParserPosition(NULL);
+ 	prs-state-state = TPS_Base;
+ 
+ #ifdef WPARSER_TRACE
+ 	fprintf(stderr, parsing copy of \%.*s\\n, len, str);
+ #endif
+ 
+ 	return prs;
+ }
+ 
+ 
  static void
  TParserClose(TParser *prs)
  {
*** TParserClose(TParser *prs)
*** 346,354 
--- 384,415 
  		pfree(prs-pgwstr);
  #endif
  
+ #ifdef WPARSER_TRACE
+ 	fprintf(stderr, closing parser);
+ #endif
+ 	pfree(prs);
+ }
+ 
+ /*
+  * See TParserCopyInit
+  */
+ static void
+ TParserCopyClose(TParser *prs)
+ {
+ 	while (prs-state)
+ 	{
+ 		TParserPosition *ptr = prs-state-prev;
+ 
+ 		pfree(prs-state);
+ 		prs-state = ptr;
+ 	}
+ #ifdef WPARSER_TRACE
+ 	fprintf(stderr, closing parser copy);
+ #endif
  	pfree(prs);
  }
  
+ 
  /*
   * Character-type support functions, equivalent to is* macros, but
   * working with any possible encodings and locales. Notes:
*** p_isignore(TParser *prs)
*** 617,623 
  static int
  p_ishost(TParser *prs)
  {
! 	TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte);
  	int			res = 0;
  
  	

Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-10 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 
 [concerns addressed in new patch version]
 
Looks good to me.  I'm marking this Ready for Committer.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-09 Thread Andres Freund
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
 Andres Freund and...@anarazel.de wrote:
  Could you show your testcase?
 
 OK.  I was going to try to check other platforms first, and package
 up the information better, but here goes.
 
 I created 1 lines with random IP-based URLs for a test.  The
 first few lines are:
 
 create table t1 (c1 int not null primary key, c2 text);
 insert into t1 values (2,
 'http://255.102.51.212/*/quick/brown/fox?jumpsover*lazydog.html
  http://204.56.222.143/*/quick/brown/fox?jumpsover*lazydog.html
  http://138.183.168.227/*/quick/brown/fox?jumpsover*lazydog.html
I think you see no real benefit, because your strings are rather short - the 
documents I scanned when noticing the issue where rather long.
If your strings are short, they and the copy will fit into cpu cache anyway, so 
copying them around/converting them to some other string format is not that 
expensive compared to the rest of the work done.

Also after each copying step for large strings the complete cache is filled 
with unrelated information (namely the end of the string). So every charwise 
access will need to wait for a memory access.

A rather extreme/contrived example:


postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1) g(i)), ' -  '));
  ?column? 
 --
 1
 (1 row)
 
Time: 3.740 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1000) g(i)), ' -  '));
  ?column? 
 --
 1
 (1 row)
 
Time: 115.027 ms

 postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
1) g(i)), ' -  '));
  ?column? 
 --
 1
 (1 row)
  
Time: 24355.339 ms
postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
2) g(i)), ' -  '));
  ?column? 
 --
 1
 (1 row)
 
Time: 47276.739 ms


One easily can see the quadratic complexity here. The quadratic complexity 
lies in the length/amount of emails/urls of the strings, not in the number of 
to_tsvector calls!

After the patch:

postgres=# SELECT 1 FROM to_tsvector(array_to_string(ARRAY(SELECT 
'and...@anarazel.de http://www.postgresql.org/'::text FROM generate_series(1, 
2) g(i)), ' -  '));
  ?column? 
 --
 1
 (1 row)
 
Time: 168.384 ms


I could not reproduce the slowdown you mentioned:


Without patch:
postgres=# SELECT to_tsvector('and...@anarazel.de 
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 10) g(i) 
ORDER BY g.i LIMIT 1;
   to_tsvector  
 ---
  '/1':4 'and...@anarazel.de':1 'www.postgresql.org':3 
'www.postgresql.org/1':2
 (1 row)
Time: 1109.833 ms

With patch:
postgres=# SELECT to_tsvector('and...@anarazel.de 
http://www.postgresql.org/'||g.i::text) FROM generate_series(1, 10) g(i) 
ORDER BY g.i LIMIT 1;
   to_tsvector  
 ---
  '/1':4 'and...@anarazel.de':1 'www.postgresql.org':3 
'www.postgresql.org/1':2
 (1 row)
 
Time: 1036.689 ms

So on the hardware I tried its even a little bit faster for small strings 
(Older Xeon32bit, Core2 Duo, 64bit, Nehalem based Xeon 64bit).


I could not reproduce any difference with strings not involving urls or emails:

Without patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) 
FROM generate_series(1, 10) g(i) ORDER BY g.i LIMIT 1;
   to_tsvector   
 
  'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6
 (1 row)
 
Time: 988.426 ms

With patch:

postgres=# SELECT to_tsvector('live hard, love fast, die young'||g.i::text) 
FROM generate_series(1, 10) g(i) ORDER BY g.i LIMIT 1;
   to_tsvector   
 
  'die':5 'fast':4 'hard':2 'live':1 'love':3 'young1':6
 (1 row)
 
Time: 975.339 ms


So at least in my testing I do see no danger in the patch ;-)

Andres



PS: I averaged all the time result over multiple runs where it was relevant.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Kevin Grittner
I wrote:
 
 Frankly, I'd be amazed if there was a performance regression,
 
OK, I'm amazed.  While it apparently helps some cases dramatically
(Andres had a case where run time was reduced by 93.2%), I found a
pretty routine case where run time was increased by 3.1%.  I tweaked
the code and got that down to a 2.5% run time increase.  I'm having
troubles getting it any lower than that.  And yes, this is real, not
noise -- the slowest unpatched time for this test is faster than the
fastest time with any version of the patch.  :-(
 
Andres, could you provide more information on the test which showed
the dramatic improvement?  In particular, info on OS, CPU, character
set, encoding scheme, and what kind of data was used for the test.
 
I'll do some more testing and try to figure out how the patch is
slowing things down and post with details.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Andres Freund
On Tuesday 08 December 2009 16:23:11 Kevin Grittner wrote:
 I wrote:
  Frankly, I'd be amazed if there was a performance regression,
 
 OK, I'm amazed.  While it apparently helps some cases dramatically
 (Andres had a case where run time was reduced by 93.2%), I found a
 pretty routine case where run time was increased by 3.1%.  I tweaked
 the code and got that down to a 2.5% run time increase.  I'm having
 troubles getting it any lower than that.  And yes, this is real, not
 noise -- the slowest unpatched time for this test is faster than the
 fastest time with any version of the patch.  :-(
 
 Andres, could you provide more information on the test which showed
 the dramatic improvement?  In particular, info on OS, CPU, character
 set, encoding scheme, and what kind of data was used for the test.
 
 I'll do some more testing and try to figure out how the patch is
 slowing things down and post with details.
Could you show your testcase? I dont see why it could get slower?

I tested with various data, the one benefiting most was some changelog where 
each entry was signed by an email.

OS: Debian Sid, Core2 Duo, UTF-8, and I tried both C and de_DE.UTF8.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote: 
 
 Could you show your testcase?
 
OK.  I was going to try to check other platforms first, and package
up the information better, but here goes.
 
I created 1 lines with random IP-based URLs for a test.  The
first few lines are:
 
create table t1 (c1 int not null primary key, c2 text);
insert into t1 values (2, 
'http://255.102.51.212/*/quick/brown/fox?jumpsover*lazydog.html
 http://204.56.222.143/*/quick/brown/fox?jumpsover*lazydog.html
 http://138.183.168.227/*/quick/brown/fox?jumpsover*lazydog.html
 
Actually, the special character was initially the word the, but I
wanted to see if having non-ASCII characters in the value made any
difference.  It didn't.
 
Unfortunately, I was testing at home last night and forgot to bring
the exact test query with me, but it was this or something close to
it:
 
\timing
select to_tsvector(c2)
  from t1, (select generate_series(1,200)) x where c1 = 2;
 
I was running on Ubuntu 9.10, an AMD dual core CPU (don't have the
model number handy), UTF-8, en_US.UTF8.
 
 I dont see why it could get slower?
 
I don't either.  The best I can tell, following the pointer from
orig to any of its elements seems to be way more expensive than I
would ever have guessed.  The only thing that seemed to improve the
speed was minimizing that by using a local variable to capture any
element referenced more than once.  (Although, there is overlap
between the timings for the original patch and the one which seemed
a slight improvement; I would need to do more testing to really rule
out noise and have complete confidence that my changes actually are
an improvement on the original patch.)
 
Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
CPU?  (I'm looking forward to testing this today on a 64 bit build
on an Intel CPU.)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Andres Freund
On Tuesday 08 December 2009 17:15:36 Kevin Grittner wrote:
 Andres Freund and...@anarazel.de wrote:
  Could you show your testcase?
Will hopefully look into this later.

  I dont see why it could get slower?
 I don't either.  The best I can tell, following the pointer from
 orig to any of its elements seems to be way more expensive than I
 would ever have guessed.  The only thing that seemed to improve the
 speed was minimizing that by using a local variable to capture any
 element referenced more than once.  (Although, there is overlap
 between the timings for the original patch and the one which seemed
 a slight improvement; I would need to do more testing to really rule
 out noise and have complete confidence that my changes actually are
 an improvement on the original patch.)
 
 Perhaps it is some quirk of using 32 bit pointers on the 64 bit AMD
 CPU?  (I'm looking forward to testing this today on a 64 bit build
 on an Intel CPU.)
Did you test that with a optimized build?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 
 Did you test that with a optimized build?
 
After running a series of tests with --enable-cassert to confirm
that nothing squawked, I disabled that before doing any
performance testing.  Going from memory, I used --enable-debug
--with-libxml and --prefix.  I didn't explicitly use or disable any
compiler optimizations.
 
Possibly relevant is that I was using Greg Smith's peg tool (as best
I could):
 
http://github.com/gregs1104/peg/ 
 
I'm not aware of it doing anything to the compile options, but I
didn't look for that, either.  I guess I should poke around that
some more to be sure.  After having peg do the checkout and set up
the directories, did my own ./configure and make runs so that I
could be sure of my configure settings.
 
Are there any settings which you feel I should be using which
PostgreSQL doesn't set up as you would recommend during the
./configure run?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Greg Smith

Kevin Grittner wrote:

After running a series of tests with --enable-cassert to confirm
that nothing squawked, I disabled that before doing any
performance testing.  Going from memory, I used --enable-debug
--with-libxml and --prefix.  I didn't explicitly use or disable any
compiler optimizations.
 
Possibly relevant is that I was using Greg Smith's peg tool (as best

I could):
 
http://github.com/gregs1104/peg/ 
 
I'm not aware of it doing anything to the compile options, but I

didn't look for that, either.
Now that you ask, of course I just spotted a bug in there such that the 
documented behavior for the PGDEBUG feature doesn't actually work.  If 
you were using that to turn off asserts, that may not have worked as 
expected.  Don't know what you did there exactly.  Fix now pushed to the 
repo. 

I general, when doing performance testing, now matter how careful I've 
been I try to do a:


show debug_assertions;

To confirm I'm not running with those on.

Andres, are using any optimization flags when you're testing? 

Would have preferred that the first mention of my new project didn't 
involve a bug report, but that's software I guess.  For everyone here 
who's not on the reviewers mailing list:  peg is a scripting tool I've 
put together recently that makes it easier to setup the environment 
(source code, binaries, database) for testing PostgreSQL in a 
development content.  It takes care of grabbing the latest source, 
building, starting the database, all that boring stuff.  I tried to make 
it automate the most tedious parts of both development and patch 
review.  Documentation and the program itself are at the git repo Kevin 
mentioned.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 
 Now that you ask, of course I just spotted a bug in there such
 that the documented behavior for the PGDEBUG feature doesn't
 actually work.   If you were using that to turn off asserts, that
 may not have worked as expected.  Don't know what you did there
 exactly.  Fix now pushed to the repo. 
 
Thanks.  I was going to reply to your original message with my
experiences (and probably still will), but it seemed like it might
be relevant here.  I did check pg_config results before doing
anything, and saw that the debug and cassert weren't set, so that's
why I did explicit configure and make commands.  Even without that
covered, peg was a nice convenience -- I can say that it saved me
more time already than it took to install and read the docs.  Nice
work!
 
Anyway, I'm not sure whether your reply directly answers the point
I was raising -- peg doesn't do anything with the compiler
optimization flags under the covers, does it?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Andres Freund
Hi,

On Tuesday 08 December 2009 20:09:22 Greg Smith wrote:
 Andres, are using any optimization flags when you're testing?
I was testing with and without debug/cassert - and did not get adverse results 
in both...

Unfortunately it looks like I wont get to test today, but only tomorrow 
morning its 9pm and I am not yet fully finished with work

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Greg Smith

Kevin Grittner wrote:

Anyway, I'm not sure whether your reply directly answers the point
I was raising -- peg doesn't do anything with the compiler
optimization flags under the covers, does it?
 

Not really.  It does this:

PGDEBUG=--enable-cassert --enable-debug
./configure --prefix=$PGINST/$PGPROJECT --enable-depend 
--enable-thread-safety $PGDEBUG


Which are pretty standard options.  The idea is that you'd use the 
default normally, then just set PGDEBUG=  for a non-debug build--or to 
otherwise change the configure flags but still get things done 
automatically for you.  If it's set before the script starts, it doesn't 
change it.


I did try to design things so that you could do any step in the 
automated series manually and not have that screw things up.  There's 
hundreds of lines of code in there just for things like figuring out 
whether configure has been run or not yet when it decides you need to 
build, so it can try to do the right thing in either case.  My hope was 
that anyone who tried peg out would find it a net positive time savings 
after a single use, glad to hear I accomplished that in your case.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 Kevin Grittner wrote:
 Anyway, I'm not sure whether your reply directly answers the point
 I was raising -- peg doesn't do anything with the compiler
 optimization flags under the covers, does it?
 
 Not really.  It does this:

 PGDEBUG=--enable-cassert --enable-debug
 ./configure --prefix=$PGINST/$PGPROJECT --enable-depend 
 --enable-thread-safety $PGDEBUG

--enable-cassert might have a noticeable performance impact.
We usually try to not have that on when doing performance testing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Greg Smith

Tom Lane wrote:

--enable-cassert might have a noticeable performance impact.
We usually try to not have that on when doing performance testing.
  
All covered in the tool's documentation, and it looks like Kevin did the 
right thing during his tests by checking the pg_config output to confirm 
the right flags were used.  I think we can rule out simple pilot error 
here and work under the assumption Kevin found a mild performance 
regression under some circumstances with the patch.  Hopefully Andres 
will be able to replicate the problem, and it sounds like Kevin might be 
able to provide more information about it tonight too.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 --enable-cassert might have a noticeable performance impact.
 We usually try to not have that on when doing performance testing.
 
Right.  I turned it on for some initial tests to confirm that we had
no assertion failures; then turned it off for the performance
testing.  I did leave --enable-debug on during performance testing. 
My understanding is that debug info doesn't affect performance, but
I guess it never hurts to try an empirical test to confirm.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-08 Thread Kevin Grittner
I wrote:
 
 Perhaps it is some quirk of using 32 bit pointers on the 64 bit
 AMD CPU?  (I'm looking forward to testing this today on a 64 bit
 build on an Intel CPU.)
 
The exact same test on 64 bit OS (SuSE Enterprise Server) on Intel
gave very different results. With 10 runs each of 200 iterations of
parsing the 1 URLs, the patch Andres submitted ran 0.4% faster
than HEAD, and my attempt to improve on it ran 0.6% slower than
HEAD.  I'll try to run the numbers to get the percentage chance that
a random distribution would have generated a spread as large as
either of those; but I think it's safe to say that the submitted
patch doesn't hurt there and that my attempt to improve on it was
misdirected.  :-/
 
I would like to independently confirm the dramatic improvement
reported by Andres.  Could I get a short snippet from the log which
was used for that, along with an indication of the size of the text
parsed in that test?  (Since the old code looks like it might have
O(N^2) performance in some situations, while the patch changes that
to O(N), I might not be testing with a big enough N.)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-07 Thread Andres Freund
On Monday 07 December 2009 02:12:43 Greg Smith wrote:
 After getting off to a good start, it looks like this patch is now stuck
 waiting for a second review pass from Kevin right now, with no open
 items for Andres to correct.  Since the only issues on the table seem to
 be that of code aesthetics and long-term planning for this style of
 implementation rather than specific functional bits, I'm leaning toward
 saying this one is ready to have a committer look at it.  Any comments
 from Kevin or Andres about where this is at?
I think it should be ready - the only know thing it needs is a 
s/usefull/useful/.

I will take another look but I doubt I will see anything new.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-07 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 
 After getting off to a good start, it looks like this patch is now
 stuck waiting for a second review pass from Kevin right now, with
 no open items for Andres to correct.  Since the only issues on the
 table seem to be that of code aesthetics and long-term planning
 for this style of implementation rather than specific functional
 bits, I'm leaning toward saying this one is ready to have a
 committer look at it.  Any comments from Kevin or Andres about
 where this is at?
 
Yeah, really the only thing I found to complain about was one
misspelled word in a comment.  I am currently the hold-up, due to
fighting off a bout of some virus and having other real world
issues impinge.  The only thing left to do, besides correcting the
spelling, is to confirm the author's performance improvements and
confirm that there is no degradation in a non-targeted situation.
 
Frankly, I'd be amazed if there was a performance regression,
because all it really does is pass a pointer to a new spot in an
existing input buffer rather than allocating new space and copying
the input from the desired spot to the end of the buffer.  I can't
think of any situations where calculating the new address should be
slower than calculating the new address and copying from there to
the end of the buffer.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-12-06 Thread Greg Smith
After getting off to a good start, it looks like this patch is now stuck 
waiting for a second review pass from Kevin right now, with no open 
items for Andres to correct.  Since the only issues on the table seem to 
be that of code aesthetics and long-term planning for this style of 
implementation rather than specific functional bits, I'm leaning toward 
saying this one is ready to have a committer look at it.  Any comments 
from Kevin or Andres about where this is at?


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-25 Thread Andres Freund
On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:
 Andres Freund  wrote:
 
 On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
  It is in context format, applies cleanly, and passes make check.
 
  Unfortunately the latter is not saying much - I had a bug there and
  it was not found by the regression tests. Perhaps I should take a
  stab and add at least some more...
 
 Sounds like a good idea.  The one thing to avoid is anything with a
 long enough run time to annoy those that run it many times a day.
Hm. There actually are tests excercising the part where I had a bug... 
Strange.
It was a bug involving uninitialized data so probably the regression tests 
where just lucky.

  It is in context format, applies cleanly, and passes make check.
  Next I read through the code, and have the same question that
  Andres posed 12 days ago. His patch massively reduces the cost of
  the parser recursively calling itself for some cases, and it seems
  like the least invasive way to modify the parser to solve this
  performance problem; but it does beg the question of why a state
  machine like this should recursively call itself when it hits
  certain states.
  I was wondering about that as well. I am not completely sure but to
  me it looks like its just done to reduce the amount of rules and
  states.
 I'm assuming that's the reason, but didn't dig deep enough to be sure.
 I suspect to be really sure, I'd have to set it up without the
 recursion and see what breaks.  I can't imagine it would be anything
 which couldn't be fixed by adding enough states; but perhaps they ran
 into something where these types would require so many new states that
 the recursion seemed like the lesser of evils.
This is similar to my understanding...

  I have to say that that code is not exactly clear and well
  documented...
 Yeah.  I was happy with the level of documentation that you added with
 your new code, but what was there before is mighty thin.  If you
 gleaned enough information while working on it to feel comfortable
 adding documentation anywhere else, that would be a good thing.
It definitely would be a good thing. But that would definitely be seperate 
patch. But I fear my current leel of knowledge is sufficient and also I am not 
sure if I can make myself interested enough in that part.

 So far the only vote is to proceed with the mitigation, which was my
 preference, and apparently yours -- so I guess we're at 3 to 0 in
 favor of that.  I'll mark the patch as Waiting on Author so you can
 add any comments and regression tests you feel are appropriate.
 
 By the way, I found one typo in the comments -- it should by useful,
 not usefull.
Ok, will update.

 I liked what I saw so far, but need to spend more time desk-checking
 for correctness, testing to confirm that it doesn't change results,
 and confirming the performance improvement.
Thanks again for your reviewing!


Andres  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-25 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:
 Andres Freund  wrote:
 
  I had a bug there and it was not found by the regression tests.
  Perhaps I should take a stab and add at least some more...
 
 Sounds like a good idea.
 
 Hm. There actually are tests excercising the part where I had a
 bug...   Strange.  It was a bug involving uninitialized data so
 probably the regression tests where just lucky.
 
OK.  I won't be looking for extra tests.
 
  I have to say that that code is not exactly clear and well
  documented...
 Yeah.  I was happy with the level of documentation that you added
 with your new code, but what was there before is mighty thin.  If
 you gleaned enough information while working on it to feel
 comfortable adding documentation anywhere else, that would be a
 good thing.
 It definitely would be a good thing. But that would definitely be
 seperate patch. But I fear my current leel of knowledge is
 sufficient and also I am not sure if I can make myself interested
 enough in that part.
 
Fair enough.  I won't be looking for new comments for the old code.
 
 By the way, I found one typo in the comments -- it should by
 useful, not usefull.
 Ok, will update.
 
Given how trivial that is, I'm putting this back in Needs Review
status, and resuming my review work.  Barring surprises, I should wrap
this up whenever I can free up a two or three hours.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-14 Thread Kevin Grittner
Andres Freund  wrote:
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
 It is in context format, applies cleanly, and passes make check.
 Unfortunately the latter is not saying much - I had a bug there and
 it was not found by the regression tests. Perhaps I should take a
 stab and add at least some more...
 
Sounds like a good idea.  The one thing to avoid is anything with a
long enough run time to annoy those that run it many times a day.
 
 It is in context format, applies cleanly, and passes make check.
 Next I read through the code, and have the same question that
 Andres posed 12 days ago. His patch massively reduces the cost of
 the parser recursively calling itself for some cases, and it seems
 like the least invasive way to modify the parser to solve this
 performance problem; but it does beg the question of why a state
 machine like this should recursively call itself when it hits
 certain states.
 I was wondering about that as well. I am not completely sure but to
 me it looks like its just done to reduce the amount of rules and
 states.
 
I'm assuming that's the reason, but didn't dig deep enough to be sure.
I suspect to be really sure, I'd have to set it up without the
recursion and see what breaks.  I can't imagine it would be anything
which couldn't be fixed by adding enough states; but perhaps they ran
into something where these types would require so many new states that
the recursion seemed like the lesser of evils.
 
 I have to say that that code is not exactly clear and well
 documented...
 
Yeah.  I was happy with the level of documentation that you added with
your new code, but what was there before is mighty thin.  If you
gleaned enough information while working on it to feel comfortable
adding documentation anywhere else, that would be a good thing.
 
So far the only vote is to proceed with the mitigation, which was my
preference, and apparently yours -- so I guess we're at 3 to 0 in
favor of that.  I'll mark the patch as Waiting on Author so you can
add any comments and regression tests you feel are appropriate.
 
By the way, I found one typo in the comments -- it should by useful,
not usefull.
 
I liked what I saw so far, but need to spend more time desk-checking
for correctness, testing to confirm that it doesn't change results,
and confirming the performance improvement.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-13 Thread Kevin Grittner
Andres Freund and...@anarazel.de wrote:
 On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
 While playing around/evaluating tsearch I notices that to_tsvector
 is obscenely slow for some files. After some profiling I found that
 this is due using a seperate TSParser in p_ishost/p_isURLPath in
 wparser_def.c. If a multibyte encoding is in use TParserInit copies
 the whole remaining input and converts it to wchar_t or pg_wchar -
 for every email or protocol prefixed url in the the document. Which
 obviously is bad.
 
 I solved the issue by having a seperate TParserCopyInit/
 TParserCopyClose which reuses the the already converted strings of
 the original TParser - only at different offsets.
 
 Another approach would be to get rid of the separate parser
 invocations - requiring a bunch of additional states. This seemed
 more complex to me, so I wanted to get some feedback first.
 
 Without patch:
 
 Time: 5835.676 ms
 
 With patch:
 
 Time: 395.341 ms
 
 As nobody commented here is a corrected (stupid thinko) and cleaned
 up version.
 
I've taken this one for review, and have taken a preliminary look.
 
It is in context format, applies cleanly, and passes make check.
Next I read through the code, and have the same question that Andres
posed 12 days ago.  His patch massively reduces the cost of the parser
recursively calling itself for some cases, and it seems like the least
invasive way to modify the parser to solve this performance problem;
but it does beg the question of why a state machine like this should
recursively call itself when it hits certain states.
 
The patch is mitigating the penalty for what seems to me to be an
existing ugly kludge.  Is it acceptable to address the problem in this
way, or should the much more invasive work be done to eliminate the
kludge?  (Note: I personally would much rather see the performance
penalty addressed this way, and a TODO added for the more invasive
work, than to leave this alone for the next release if there's nobody
willing to tackle the problem at a more fundamental level.)
 
If nobody has a contrary opinion, I'll proceed with the review of this
patch and add something to the TODO page for the more invasive work.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-13 Thread Alvaro Herrera
Kevin Grittner wrote:

 (Note: I personally would much rather see the performance
 penalty addressed this way, and a TODO added for the more invasive
 work, than to leave this alone for the next release if there's nobody
 willing to tackle the problem at a more fundamental level.)

+1

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-13 Thread Andres Freund
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
 It is in context format, applies cleanly, and passes make check.
Unfortunately the latter is not saying much - I had a bug there and it was not 
found by the regression tests. Perhaps I should take a stab and add at least 
some more...

 It is in context format, applies cleanly, and passes make check.
 Next I read through the code, and have the same question that Andres
 posed 12 days ago.  His patch massively reduces the cost of the parser
 recursively calling itself for some cases, and it seems like the least
 invasive way to modify the parser to solve this performance problem;
 but it does beg the question of why a state machine like this should
 recursively call itself when it hits certain states.
I was wondering about that as well. I am not completely sure but to me it 
looks like its just done to reduce the amount of rules and states. 

I have to say that that code is not exactly clear and well documented...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers