Re: [PERFORM] FTS performance with the Polish config

2009-11-15 Thread Oleg Bartunov

On Sun, 15 Nov 2009, Pavel Stehule wrote:


2009/11/15 Oleg Bartunov o...@sai.msu.su:

Yes, as stated original author use polish ispell dictionary.
Ispell dictionary is slow to load first time. In real life it should be no
problem.



it is a problem. People who needs fast access uses english without
czech. It drop some features, but it is significaly faster.


just don't use ispell dictionary, czech snowball stemmer is as fast as 
english.


Ispell dictionary (doesn't matter english, or other language) is slow for the 
first load and then it caches, so there is no problem if use persistent 
database connection, which is de facto standard for any serious projects.




Pavel


Oleg
On Sat, 14 Nov 2009, Pavel Stehule wrote:


2009/11/14 Tom Lane t...@sss.pgh.pa.us:

Kenneth Marshall k...@rice.edu writes:

On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:

I just finished implementing a search engine for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuratio=

n,

while fast with English.



The documentation for ts_headline() states:
ts_headline uses the original document, not a tsvector summary, so it
can be slow and should be used with care.


That's true but the argument in the docs would apply just as well to
english or any other config. =C2=A0So while Wojciech would be well
advised
to try to avoid making a lot of calls to ts_headline, it's still curious
that it's so much slower in polish than english. =C2=A0Could we see a
self-contained test case?


is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is
slow.

regards
Pavel Stehule



=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
=C2=

=A0 =C2=A0regards, tom lane


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--=20
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



       Regards,
               Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance with the Polish config

2009-11-15 Thread Pavel Stehule
2009/11/15 Oleg Bartunov o...@sai.msu.su:
 On Sun, 15 Nov 2009, Pavel Stehule wrote:

 2009/11/15 Oleg Bartunov o...@sai.msu.su:

 Yes, as stated original author use polish ispell dictionary.
 Ispell dictionary is slow to load first time. In real life it should be
 no
 problem.


 it is a problem. People who needs fast access uses english without
 czech. It drop some features, but it is significaly faster.

 just don't use ispell dictionary, czech snowball stemmer is as fast as
 english.

czech stemmer doesn't exist :(


 Ispell dictionary (doesn't matter english, or other language) is slow for
 the first load and then it caches, so there is no problem if use persistent
 database connection, which is de facto standard for any serious projects.


I agree so connection pooling should be a solution. But it is good?
Cannot we share dictionary better?


 Pavel

 Oleg
 On Sat, 14 Nov 2009, Pavel Stehule wrote:

 2009/11/14 Tom Lane t...@sss.pgh.pa.us:

 Kenneth Marshall k...@rice.edu writes:

 On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:

 I just finished implementing a search engine for my site and found
 ts_headline extremely slow when used with a Polish tsearch
 configuratio=

 n,

 while fast with English.

 The documentation for ts_headline() states:
 ts_headline uses the original document, not a tsvector summary, so it
 can be slow and should be used with care.

 That's true but the argument in the docs would apply just as well to
 english or any other config. =C2=A0So while Wojciech would be well
 advised
 to try to avoid making a lot of calls to ts_headline, it's still
 curious
 that it's so much slower in polish than english. =C2=A0Could we see a
 self-contained test case?

 is it dictionary based or stem based?

 Dictionary based FTS is very slow (first load). Minimally czech FTS is
 slow.

 regards
 Pavel Stehule


 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
 =C2=

 =A0 =C2=A0regards, tom lane

 --
 Sent via pgsql-performance mailing list
 (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


 --=20
 Sent via pgsql-performance mailing list
 (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


        Regards,
                Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83



        Regards,
                Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance with the Polish config

2009-11-15 Thread Oleg Bartunov

On Sun, 15 Nov 2009, Pavel Stehule wrote:



czech stemmer doesn't exist :(



I'd try morfessor http://www.cis.hut.fi/projects/morpho/, which is 
unsupervised morphological dictionary. I think it'd be not very hard to add

morfessor dictionary template to tsearch2, so people could create their
own stemmers.



Ispell dictionary (doesn't matter english, or other language) is slow for
the first load and then it caches, so there is no problem if use persistent
database connection, which is de facto standard for any serious projects.



I agree so connection pooling should be a solution. But it is good?
Cannot we share dictionary better?


We thought about this issue and got some idea. Teodor can be more clear here,
since I don't remember all details.






Pavel


Oleg
On Sat, 14 Nov 2009, Pavel Stehule wrote:


2009/11/14 Tom Lane t...@sss.pgh.pa.us:


Kenneth Marshall k...@rice.edu writes:


On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:


I just finished implementing a search engine for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuratio=


n,


while fast with English.



The documentation for ts_headline() states:
ts_headline uses the original document, not a tsvector summary, so it
can be slow and should be used with care.


That's true but the argument in the docs would apply just as well to
english or any other config. =C2=A0So while Wojciech would be well
advised
to try to avoid making a lot of calls to ts_headline, it's still
curious
that it's so much slower in polish than english. =C2=A0Could we see a
self-contained test case?


is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is
slow.

regards
Pavel Stehule



=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
=C2=


=A0 =C2=A0regards, tom lane


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--=20
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



       Regards,
               Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





       Regards,
               Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] FTS performance with the Polish config

2009-11-14 Thread Wojciech Knapik


Hello

 I just finished implementing a search engine for my site and found 
ts_headline extremely slow when used with a Polish tsearch 
configuration, while fast with English. All of it boils down to a simple 
testcase, but first some background.


 I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 
(2.6.21), then switched both installations to 8.3.8 (both packages 
compiled, but provided by the distro - port/emerge). The Polish 
dictionaries and config were created according to this article (it's in 
Polish, but the code is self-explanatory):


http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/

 Now for the testcase:

text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad 
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip 
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in 
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur 
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum.'


# explain analyze select ts_headline('polish', text, 
plainto_tsquery('polish', 'foobar'));
 QUERY PLAN 



 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 
rows=1 loops=1)

 Total runtime: 6.524 ms
(2 rows)

# explain analyze select ts_headline('english', text, 
plainto_tsquery('english', 'foobar'));
 QUERY PLAN 



 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 
rows=1 loops=1)

 Total runtime: 0.935 ms
(2 rows)

# explain analyze select ts_headline('simple', text, 
plainto_tsquery('simple', 'foobar'));
 QUERY PLAN 



 Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 
rows=1 loops=1)

 Total runtime: 0.697 ms
(2 rows)

#

 As you can see, the results differ by an order of magnitude between 
Polish and English. While in this simple testcase it's a non-issue, in 
the real world this translates into enormous overhead.


 One of the queries I ran testing my site's search function took 
1870ms. When I took that query and changed all ts_headline(foo) calls to 
just foo, the time dropped below 100ms. That's the difference between 
something completely unacceptable and something quite useful.


 I can post various details about the hardware, software and specific 
queries, but the testcases speak for themselves. I'm sure you can easily 
reproduce my results.


 Hints would be very much appreciated, since I've already spent way 
more time on this, than I could afford.



cheers,
Wojciech Knapik


PS. A few other details can be found here 
http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with 
snippets of my conversations in #postgresql that lead to this testcase. 
Big thanks to RhodiumToad for helping me with fts for the last couple 
days ;]





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance with the Polish config

2009-11-14 Thread Kenneth Marshall
On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:

 Hello

  I just finished implementing a search engine for my site and found 
 ts_headline extremely slow when used with a Polish tsearch configuration, 
 while fast with English. All of it boils down to a simple testcase, but 
 first some background.

  I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), 
 then switched both installations to 8.3.8 (both packages compiled, but 
 provided by the distro - port/emerge). The Polish dictionaries and config 
 were created according to this article (it's in Polish, but the code is 
 self-explanatory):

 http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/

  Now for the testcase:

 text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do 
 eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad 
 minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex 
 ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate 
 velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat 
 cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id 
 est laborum.'

 # explain analyze select ts_headline('polish', text, 
 plainto_tsquery('polish', 'foobar'));
  QUERY PLAN 
 
  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 rows=1 
 loops=1)
  Total runtime: 6.524 ms
 (2 rows)

 # explain analyze select ts_headline('english', text, 
 plainto_tsquery('english', 'foobar'));
  QUERY PLAN 
 
  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 rows=1 
 loops=1)
  Total runtime: 0.935 ms
 (2 rows)

 # explain analyze select ts_headline('simple', text, 
 plainto_tsquery('simple', 'foobar'));
  QUERY PLAN 
 
  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 rows=1 
 loops=1)
  Total runtime: 0.697 ms
 (2 rows)

 #

  As you can see, the results differ by an order of magnitude between Polish 
 and English. While in this simple testcase it's a non-issue, in the real 
 world this translates into enormous overhead.

  One of the queries I ran testing my site's search function took 1870ms. 
 When I took that query and changed all ts_headline(foo) calls to just foo, 
 the time dropped below 100ms. That's the difference between something 
 completely unacceptable and something quite useful.

  I can post various details about the hardware, software and specific 
 queries, but the testcases speak for themselves. I'm sure you can easily 
 reproduce my results.

  Hints would be very much appreciated, since I've already spent way more 
 time on this, than I could afford.


 cheers,
 Wojciech Knapik


 PS. A few other details can be found here 
 http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with 
 snippets of my conversations in #postgresql that lead to this testcase. Big 
 thanks to RhodiumToad for helping me with fts for the last couple days ;]


Hi,

The documentation for ts_headline() states:

ts_headline uses the original document, not a tsvector summary, so it can be 
slow
and should be used with care. A typical mistake is to call ts_headline for every
matching document when only ten documents are to be shown. SQL subqueries can 
help;
here is an example:

SELECT id, ts_headline(body, q), rank
FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
  FROM apod, to_tsquery('stars') q
  WHERE ti @@ q
  ORDER BY rank DESC
  LIMIT 10) AS foo;

It looks like you have proven that behavior. I have not looked at the 
ts_headline
code, but it may also be slowed by the locale, so showing that it is faster for
English is not really saying much. Maybe there is a better algorithm that could
be used, but that would require code changes. It may be that you can change some
of the parameters to speed it up. Good luck.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance with the Polish config

2009-11-14 Thread Tom Lane
Kenneth Marshall k...@rice.edu writes:
 On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:
 I just finished implementing a search engine for my site and found 
 ts_headline extremely slow when used with a Polish tsearch configuration, 
 while fast with English.

 The documentation for ts_headline() states:
 ts_headline uses the original document, not a tsvector summary, so it
 can be slow and should be used with care.

That's true but the argument in the docs would apply just as well to
english or any other config.  So while Wojciech would be well advised
to try to avoid making a lot of calls to ts_headline, it's still curious
that it's so much slower in polish than english.  Could we see a
self-contained test case?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance with the Polish config

2009-11-14 Thread Pavel Stehule
2009/11/14 Tom Lane t...@sss.pgh.pa.us:
 Kenneth Marshall k...@rice.edu writes:
 On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:
 I just finished implementing a search engine for my site and found
 ts_headline extremely slow when used with a Polish tsearch configuration,
 while fast with English.

 The documentation for ts_headline() states:
 ts_headline uses the original document, not a tsvector summary, so it
 can be slow and should be used with care.

 That's true but the argument in the docs would apply just as well to
 english or any other config.  So while Wojciech would be well advised
 to try to avoid making a lot of calls to ts_headline, it's still curious
 that it's so much slower in polish than english.  Could we see a
 self-contained test case?

is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is slow.

regards
Pavel Stehule


                        regards, tom lane

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] FTS performance with the Polish config

2009-11-14 Thread Oleg Bartunov

Yes, as stated original author use polish ispell dictionary.
Ispell dictionary is slow to load first time. In real life it should 
be no problem.


Oleg
On Sat, 14 Nov 2009, Pavel Stehule wrote:


2009/11/14 Tom Lane t...@sss.pgh.pa.us:
 Kenneth Marshall k...@rice.edu writes:
 On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:
 I just finished implementing a search engine for my site and found
 ts_headline extremely slow when used with a Polish tsearch configuratio=
n,
 while fast with English.

 The documentation for ts_headline() states:
 ts_headline uses the original document, not a tsvector summary, so it
 can be slow and should be used with care.

 That's true but the argument in the docs would apply just as well to
 english or any other config. =C2=A0So while Wojciech would be well advised
 to try to avoid making a lot of calls to ts_headline, it's still curious
 that it's so much slower in polish than english. =C2=A0Could we see a
 self-contained test case?

is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is slow.

regards
Pavel Stehule


 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0regards, tom lane

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


--=20
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance