Re: [PERFORM] TSearch2 and optimisation ...

2004-09-12 Thread George Essig
--- Herv�inputPiedvache [EMAIL PROTECTED] wrote:

 George,
 
 I have well read many pages about this subject ... but I have not found any 
 thing for the moment to really help me ...
 What can I do to optimize my PostgreSQL configuration for a special use of 
 Tsearch2 ...
 I'm a little dispointed looking the Postgresql Russian search engine using 
 Tsearch2 is really quick ... why I can't haev the same result with a 
 bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 
 records ?
 
 Regards,
 -- 
 Herv�inputPiedvache
 
 Elma Ing�inputierie Informatique
 6 rue du Faubourg Saint-Honor�input F-75008 - Paris - France
 Pho. 33-144949901
 Fax. 33-144949902
 

Tsearch does not scale indefinitely.  It was designed for fast online updates and to 
be integrated
into PostgreSQL.  My understanding is that it uses a bloom filter together with bit 
string
signatures.  Typically, full text searches use inverted indexes, scale better, but are 
slower to
update.

My understanding is that tsearch has a practical limit of 100,000 distinct word stems 
or lexemes. 
Note that word stems are not words.  Word stems are what are actually stored in a 
tsvector after
parsing and dictionary processing.

The key to making tsearch fast is to keep the number of word stems low.  You decrease 
the number
of word stems by using stop words, various dictionaries, synonyms, and preprocessing 
text before
it gets to tsearch.  You can find what word stems are stored in a tsvector column by 
using the
stat function.  For examples of how to use the stat function, see:

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Note that the stat function will take a long time to run on large tables.

Performance tuning must be done on a case by case basis.  It can take some time to try 
different
things and see the change in performance.  Each time you try something new, use the 
stat function
to see how the number of word stems has changed.

The largest project I used tsearch2 on contained 900,000 records.  Without performance 
tuning,
there were 275,000 distinct word stems.  After performance tuning, I got it down to 
14,000
distinct word stems.  

By using the stat function, I noticed some obvious stop words that were very frequent 
that nobody
would ever search for.  For how to use stop words, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Also I noticed some strange patterns by looking through all of the word stems.  

In one case, strings of 3 to 7 words were joined together with hyphens to indicate 
category
nesting.  Tsearch would store these long hyphenated words intact and also store the 
stem of each
individual word.  I made a judgment call that no one would ever search for the long 
hyphenated
words, so I preprocessed the text to remove the hyphens. 

I also noticed that many of the word stems were alphanumeric IDs that were designed to 
be unique. 
There were many of these IDs in the tsvector column although each ID would occur only 
once or
twice.  I again preprocessed the text to remove these IDs, but created a btree index 
on a varchar
column representing the IDs.  My search form allows users to either search full text 
using
tsearch2 or search IDs using 'LIKE' queries which use a btree index.  For 'LIKE' 
queries, it was
another matter to get postgres to use the btree index and not use a sequential scan.  
For this,
see:

http://www.postgresql.org/docs/7.4/static/indexes-opclass.html

Last, I noticed that most users wanted to restrict the full text search to a subset 
determined by
another column in the table.  As a result, I created a multicolumn gist index on an 
integer column
and a tsvector column.  For how to setup a multicolumn gist index, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

There are no easy answers.  Like I said, performance tuning must be done on a case by 
case basis.

Hope this helps,

George Essig

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] TSearch2 and optimisation ...

2004-09-09 Thread Herv Piedvache
George,

Le Jeudi 26 Août 2004 19:58, George Essig a écrit :
 Bill Footcow wrote:

 ...

  I have done a simple request, looking for title or description having
  Postgres inside order by rank and date,  like this :
  SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/ HH24:MI:SS')
  as dt, s.site_name, s.id_site, case when exists (select id_user from
  user_choice u where u.id_site=s.id_site and u.id_user = 1) then 1 else 0
  end as bookmarked FROM article a, site s
WHERE s.id_site = a.id_site
 AND idxfti @@ to_tsquery('postgresql')
 ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC;
 
  The request takes about 4 seconds ... I have about 1 400 000 records in
  article and 36 000 records in site table ... it's a Bi-Pentium III 933
  MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5
  For me this result is very very slow I really need a quicker result with
  less than 1 second ...
  The next time I call the same request I have got the result in 439 ms ...
  but

 ...

 The first query is slow because the relevant index pages are not cached in
 memory.  Everyone experiences this.  GiST indexes on tsvector columns can
 get really big.  You have done nothing wrong.  When you have a lot of
 records, tsearch2 will not run fast without extensive performance tuning.

 Read the following:

 Optimization
 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/
optimization.html
 
 ...

I have well read many pages about this subject ... but I have not found any 
thing for the moment to really help me ...
What can I do to optimize my PostgreSQL configuration for a special use of 
Tsearch2 ...
I'm a little dispointed looking the Postgresql Russian search engine using 
Tsearch2 is really quick ... why I can't haev the same result with a 
bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 
records ?

Regards,
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] TSearch2 and optimisation ...

2004-08-26 Thread Herv Piedvache
Josh,

Le Jeudi 26 Août 2004 01:50, Josh Berkus a écrit :
  The request takes about 4 seconds ... I have about 1 400 000 records in
  article and 36 000 records in site table ... it's a Bi-Pentium III 933
  MHz server with 1 Gb memory ... I'm using Postgresql 7.4.5
  For me this result is very very slow I really need a quicker result with
  less than 1 second ...
  The next time I call the same request I have got the result in 439 ms ...
  but If I replace Postgresql in my find with Linux for example I will
  get the next result in 5 seconds ... :o(

 Hmmm.  It sounds like your system is unable to keep all of the data cached
 in memory.  What else do you have going on on that machine?

There is an Apache + PHP running in same time ... 

  Explain gives me this result :

 Please do EXPLAIN ANALYZE so that we can see where time is actually
 spent.

 QUERY PLAN
---
 Sort  (cost=10740.35..10743.73 rows=1351 width=190) (actual 
time=7054.603..7054.707 rows=139 loops=1)
   Sort Key: rank(a.idxfti, '\'postgresql\''::tsquery), a.r_date
   -  Merge Join  (cost=4123.09..10670.10 rows=1351 width=190) (actual 
time=5476.749..7052.766 rows=139 loops=1)
 Merge Cond: (outer.id_site = inner.id_site)
 -  Index Scan using site_id_site_key on site s  (cost=0.00..2846.52 
rows=35705 width=28) (actual time=43.985..1548.903 rows=34897 loops=1)
 -  Sort  (cost=4123.09..4126.47 rows=1351 width=166) (actual 
time=5416.836..5416.983 rows=139 loops=1)
   Sort Key: a.id_site
   -  Index Scan using idxfti_idx on article a  
(cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108 
rows=139 loops=1)
 Index Cond: (idxfti @@ '\'postgresql\''::tsquery)
 Filter: (idxfti @@ '\'postgresql\''::tsquery)
 SubPlan
   -  Seq Scan on user_choice u  (cost=0.00..2.69 rows=1 width=4) 
(actual time=0.146..0.146 rows=0 loops=139)
 Filter: ((id_site = $0) AND (id_user = 1))
 Total runtime: 7056.126 ms

Thanks for your help ...
-- 
Hervé Piedvache

Elma Ingénierie Informatique
6 rue du Faubourg Saint-Honoré
F-75008 - Paris - France
Pho. 33-144949901
Fax. 33-144949902

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] TSearch2 and optimisation ...

2004-08-26 Thread George Essig
Bill Footcow wrote:

...
 I have done a simple request, looking for title or description having Postgres 
 inside order by rank and date,  like this :
 SELECT a.title, a.id, a.url, to_char(a.r_date, 'DD/MM/ HH24:MI:SS') as dt, 
 s.site_name, s.id_site, case when exists (select id_user from user_choice u 
 where u.id_site=s.id_site and u.id_user = 1) then 1 else 0 end as bookmarked
 FROM article a, site s
   WHERE s.id_site = a.id_site
AND idxfti @@ to_tsquery('postgresql')
ORDER BY rank(idxfti, to_tsquery('postgresql')) DESC, a.r_date DESC;
 
 The request takes about 4 seconds ... I have about 1 400 000 records in 
 article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz 
 server with 1 Gb memory ... I'm using Postgresql 7.4.5
 For me this result is very very slow I really need a quicker result with less 
 than 1 second ...
 The next time I call the same request I have got the result in 439 ms ... but 
...

The first query is slow because the relevant index pages are not cached in memory.  
Everyone
experiences this.  GiST indexes on tsvector columns can get really big.  You have done 
nothing
wrong.  When you have a lot of records, tsearch2 will not run fast without extensive 
performance
tuning.  

Read the following:

Optimization
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/optimization.html

stat function
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Stop words
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Multicolumn GiST index
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

openfts-general mailing list archive
http://sourceforge.net/mailarchive/forum.php?forum=openfts-general

Try some of things out and let me know how it goes.

George Essig



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] TSearch2 and optimisation ...

2004-08-26 Thread Herv Piedvache
Le Jeudi 26 Août 2004 19:48, Josh Berkus a écrit :
 Herve'

  (cost=0.00..4052.84 rows=1351 width=166) (actual time=109.766..5415.108
  rows=139 loops=1)
   Index Cond: (idxfti @@ '\'postgresql\''::tsquery)
   Filter: (idxfti @@ '\'postgresql\''::tsquery)
 
 From this, it looks like your FTI index isn't fitting in your sort_mem.

 What's sort_mem at now?  Can you increase it?

shared_buffers = 3000
sort_mem = 10240

 Overall, though, I'm not sure you can get this sub-1s without a faster
 machine.   Although I'm doing FTI on about 25MB of FTI text on a
 single-processor machine, and getting 40ms response times, so maybe we can
 ...

Sorry I missed understand what you mean here ... 
You tell me to upgrade the hardware but you manage a 25 Mb with a single 
processor ?? What you mean ?
My database is about 450 Mb ...

Regards,
-- 
Bill Footcow


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] TSearch2 and optimisation ...

2004-08-25 Thread Josh Berkus
Herve'

 The request takes about 4 seconds ... I have about 1 400 000 records in
 article and 36 000 records in site table ... it's a Bi-Pentium III 933 MHz
 server with 1 Gb memory ... I'm using Postgresql 7.4.5
 For me this result is very very slow I really need a quicker result with
 less than 1 second ...
 The next time I call the same request I have got the result in 439 ms ...
 but If I replace Postgresql in my find with Linux for example I will
 get the next result in 5 seconds ... :o(

Hmmm.  It sounds like your system is unable to keep all of the data cached in 
memory.  What else do you have going on on that machine?

 Explain gives me this result :

Please do EXPLAIN ANALYZE so that we can see where time is actually spent.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]