from:"Oleg Bartunov"

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

2017-02-26 Thread Oleg Bartunov

On Sun, Feb 26, 2017 at 4:28 PM, Sven R. Kunze  wrote:

> Hello everyone,
>
> I am currently evaluating the possibility of using PostgreSQL for storing
> and querying jsonb+tsvector queries. Let's consider this setup:
>
> create table docs (id serial primary key, meta jsonb);
> # generate 10M entries, cf. appendix
> create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
> create index docs_name_idx ON docs using gin (to_tsvector('english',
> meta->>'name'));
> create index docs_address_idx ON docs using gin (to_tsvector('english',
> meta->>'address'));
>
>
functional index tends to be slow, better use separate column(s) for
tsvector


>
> Testing around with some smaller datasets, functionality-wise it's great.
> However increasing to 10M, things tend to slow down (using PostgreSQL 9.5):
>
>
> explain analyze select id from docs where meta @> '{"age": 20}';
>  Planning time: 0.121 ms
>  Execution time: 4873.507 ms
>
> explain analyze select id from docs where meta @> '{"age": 20}';
>  Planning time: 0.122 ms
>  Execution time: 206.289 ms
>
>
>
> explain analyze select id from docs where meta @> '{"age": 30}';
>  Planning time: 0.109 ms
>  Execution time: 7496.886 ms
>
> explain analyze select id from docs where meta @> '{"age": 30}';
>  Planning time: 0.114 ms
>  Execution time: 1169.649 ms
>
>
>
> explain analyze select id from docs where to_tsvector('english',
> meta->>'name') @@ to_tsquery('english', 'john');
>  Planning time: 0.179 ms
>  Execution time: 10109.375 ms
>
> explain analyze select id from docs where to_tsvector('english',
> meta->>'name') @@ to_tsquery('english', 'john');
>  Planning time: 0.188 ms
>  Execution time: 238.854 ms
>

what is full output from explain analyze ?


>
>
> Using "select pg_prewarm('docs');" and on any of the indexes doesn't help
> either.
> After a "systemctl stop postgresql.service && sync && echo 3 >
> /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20,
> 30 or name=john queries are slow again.
>
>
> Is there a way to speed up or to warm up things permanently?
>
>
> Regards,
> Sven
>
>
> Appendix I:
>
> example json:
>
> {"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16",
> "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}
>
>
>
> Appendix II:
>
>
> The Python script to generate fake json data. Needs "pip install faker".
>
> >>> python fake_json.py > test.json  # generates 2M entries; takes some
> time
> >>> cat test.json | psql -c 'copy docs (meta) from stdin'
> >>> cat test.json | psql -c 'copy docs (meta) from stdin'
> >>> cat test.json | psql -c 'copy docs (meta) from stdin'
> >>> cat test.json | psql -c 'copy docs (meta) from stdin'
> >>> cat test.json | psql -c 'copy docs (meta) from stdin'
>
>
> -- fake_json.py --
>
> import faker, json;
> fake = faker.Faker();
> for i in range(2*10**6):
> print(json.dumps({"name": fake.name(), "birth": fake.date(),
> "address": fake.address(), "age": fake.random_int(0,100)}).replace('\\n',
> 'n'))
>
>

Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-22 Thread Oleg Bartunov

On Sat, Mar 19, 2016 at 5:44 AM, Jeff Janes  wrote:

> On Wed, Mar 16, 2016 at 6:53 AM, Andreas Joseph Krogh 
> wrote:
>
>> På onsdag 16. mars 2016 kl. 14:37:27, skrev Tom Lane :
>>
>> Andreas Joseph Krogh  writes:
>> > 1. Why isnt' folder_id part of the index-cond?
>>
>> Because a GIN index is useless for sorting.
>>
>> > 2. Is there a way to make it use the (same) index to sort by
>> > received_timestamp?
>>
>> No.
>>
>> > 3. Using a GIN-index, is there a way to use the index at all for
>> sorting?
>>
>> No.
>>
>> > 4. It doesn't seem like ts_rank uses the index for sorting either.
>>
>> Same reason.
>>
>> regards, tom lane
>>
>>
>> So it's basically impossible to use FTS/GIN with sorting on large
>> datasets?
>> Are there any plans to improve this situation?
>>
>
> I don't see why it would not be possible to create a new execution node
> type that does an index scan to obtain order (or just to satisfy an
> equality or range expression), and takes a bitmap (as produced by the
> FTS/GIN) to apply as a filter.  But, I don't know of anyone planning on
> doing that.
>

Please, find bitmap filtering patch, which we developed several months ago,
but failed  to find good use case :( Teodor is here now, so he could answer
the questions.


>
> Cheers,
>
> Jeff
>


bitfilter-0.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-21 Thread Oleg Bartunov

On Mon, Mar 21, 2016 at 5:41 PM, Andreas Joseph Krogh 
wrote:

> På lørdag 19. mars 2016 kl. 03:44:55, skrev Jeff Janes <
> jeff.ja...@gmail.com>:
>
> On Wed, Mar 16, 2016 at 6:53 AM, Andreas Joseph Krogh 
> wrote:
>>
>> På onsdag 16. mars 2016 kl. 14:37:27, skrev Tom Lane :
>>
>> Andreas Joseph Krogh  writes:
>> > 1. Why isnt' folder_id part of the index-cond?
>>
>> Because a GIN index is useless for sorting.
>>
>> > 2. Is there a way to make it use the (same) index to sort by
>> > received_timestamp?
>>
>> No.
>>
>> > 3. Using a GIN-index, is there a way to use the index at all for
>> sorting?
>>
>> No.
>>
>> > 4. It doesn't seem like ts_rank uses the index for sorting either.
>>
>> Same reason.
>>
>> regards, tom lane
>>
>>
>> So it's basically impossible to use FTS/GIN with sorting on large
>> datasets?
>> Are there any plans to improve this situation?
>>
>
> I don't see why it would not be possible to create a new execution node
> type that does an index scan to obtain order (or just to satisfy an
> equality or range expression), and takes a bitmap (as produced by the
> FTS/GIN) to apply as a filter.  But, I don't know of anyone planning on
> doing that.
>
>
> Isn't this what Postgres Pro are planning?
> http://postgrespro.com/roadmap/mssearch
>
> *"Unlike external special-purpose search engines, a full-text search
> engine built in a DBMS is capable of combining full-text and attributive
> search criteria in SQL query syntax. It is planned to improve the existing
> PostgreSQL full-text search engine by extending the functionality of
> Generalized Inverted Index (GIN) to make it capable of storing extra
> information required for ranging query results. This search acceleration
> will allow to go back from external full-text search engines, thus
> facilitating system administration and use, reducing technology risks, and
> improving information security."*
>

This is different feature ! Actually, we already have prototype of what
Jeff suggested, we called it bitmap filtering, but failed to find use case
where it provides benefits. Teodor will comment this idea more detail.


>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>
>

Re: [PERFORM] Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

2014-04-22 Thread Oleg Bartunov

On Tue, Apr 22, 2014 at 10:28 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 04/20/2014 07:46 AM, Oleg Bartunov wrote:

 btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
 fast scan feature.


 Indeed, although we didn't actually do anything to the planner to make it
 understand when fast scan helps. Doing something about cost estimation is
 still on the 9.4 Open Items list, but I don't have any ideas on what to do
 about it, and I haven't heard anything from Alexander about that either.
 That means that the cost estimation issue Laurence saw is going to be even
 worse in 9.4, because GIN is going to be faster than a seq scan in more
 cases than before and the planner doesn't know about it.

 - Heikki

You are right, we should return to that topic.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Workaround: Planner preference for tsquery filter vs. GIN index in fast text search

2014-04-19 Thread Oleg Bartunov

btw, 9.4 should be wiser in case of rare+common terms, thanks to GIN
fast scan feature.

On Sun, Apr 20, 2014 at 8:30 AM, Laurence Parry greenrea...@hotmail.com wrote:
 As mentioned here and elsewhere (most recently in How can I get the query
 planner to use a bitmap index scap instead of an index scan ? - 8 Mar
 2014), estimation of the relative cost of text search operations using
 GIN-indexed columns sometimes goes awry, particularly when there will be a
 large number of matches.

 The planner may choose to use a sequential or unrelated index scan with @@
 as a filter, especially when incorporated as a subquery, incurring
 significant cost (even without considering de-TOASTing). Pre-tsvectorizing
 the column offers only a slight mitigation and can cause regressions (if
 nothing else, it adds another large column).

 What worked for me (and I'm hoping for others, though YMMV) was adding
 'OFFSET 0' to the subquery involving the indexed column, e.g.

 ...
 (SELECT sk1.submission_id
 FROM submission_keywords sk1, keywords k1
 WHERE sk1.keyword_id = k1.keyword_id
AND
 to_tsvector('english_nostop', k1.keyword) @@ to_tsquery('english_nostop',
 'tails')
 OFFSET 0)
 ...

 The result is a bitmap scan:
 --
 Nested Loop
 (cost=8.73..4740.29 rows=21348 width=4)
 (actual time=0.621..13.661 rows=20097 loops=1)
-  Bitmap Heap Scan on keywords k1
(cost=8.30..1028.72 rows=755 width=4)
(actual time=0.603..2.276 rows=752 loops=1)
Recheck Cond:
(to_tsvector('english_nostop'::regconfig, keyword) @@
 '''tail'''::tsquery)
-  Bitmap Index Scan on keyword_to_tsvector_keywords
(cost=0.00..8.11 rows=755 width=0)
(actual time=0.496..0.496 rows=756 loops=1)
Index Cond:
(to_tsvector('english_nostop'::regconfig, keyword) @@
 '''tail'''::tsquery)
-  Index Only Scan using keyword_id_submission_id_submission_keywords on
 submission_keywords sk1
(cost=0.43..3.47 rows=145 width=8)
(actual time=0.005..0.010 rows=27 loops=752)
Index Cond: (keyword_id = k1.keyword_id)
Heap Fetches: 99
 Total runtime: 14.809 ms

 Without this the test was moved to a filter inside a nested loop, with
 disastrous results:
 -  Hash Semi Join
(cost=23.37..23.51 rows=1 width=8)
(actual time=0.090..0.090 rows=0 loops=594670)
Hash Cond: (s1.submission_id = sk1.submission_id)
-  Index Only Scan using submissions_pkey on submissions s1
(cost=0.42..0.56 rows=1 width=4)
(actual time=0.007..0.007 rows=1 loops=17352)
Index Cond: (submission_id = s.submission_id)
Heap Fetches: 8372
-  Hash
(cost=22.94..22.94 rows=1 width=4)
(actual time=0.086..0.086 rows=0 loops=594670)
Buckets: 1024  Batches: 1  Memory Usage: 0kB
-  Nested Loop
(cost=0.85..22.94 rows=1 width=4)
(actual time=0.083..0.085 rows=0 loops=594670)
-  Index Only Scan using file_keyword on submission_keywords
 sk1
(cost=0.43..0.80 rows=13 width=8)
(actual time=0.006..0.008 rows=9 loops=594670)
Index Cond: (submission_id = s.submission_id)
Heap Fetches: 21324
-  Index Scan using keywords_pkey on keywords k1
(cost=0.42..1.69 rows=1 width=4)
(actual time=0.008..0.008 rows=0 loops=5329219)
Index Cond: (keyword_id = sk1.keyword_id)
Filter: (to_tsvector('english_nostop'::regconfig,
 keyword) @@ '''tail'''::tsquery)
 Total runtime: 55194.034 ms [there are other lines, but 50 sec is above]

 Yes, that's a ~3000x speedup! Not all search terms benefit so much, but we
 get a lot of searches for the most common terms, and scans just get worse
 the more you add.

 I got the idea from Seamus Abshere:
 http://seamusabshere.github.io/2013/03/29/hinting-postgres-and-mysql-with-offset-and-limit/

 I've heard it said that any Postgres DBA worth his salt knows this trick,
 as well as the use of WITH to create a common table expression. Alas, many
 of us are still learning . . . I beat my head over this for a week, and it's
 affected our site for far longer. This kind of issue makes people think they
 need to replace PostgreSQL with a dedicated search solution to be able to
 scale, which is a shame.

 I know hinting has a bad rep, but this is a localized fix, and what has been
 said before leads me to believe that estimating the cost of such situations
 is a hard nut to crack - one which is not on anyone's plate right now.

 Incidentally, documentation section 7.6. LIMIT and OFFSET states that
 OFFSET 0 is the same as omitting the OFFSET clause which is clearly not
 the case here. I appreciate that this is an implementation detail which
 might change, but it's an important one

Re: [PERFORM] Ways to speed up ts_rank

2012-10-10 Thread Oleg Bartunov


We'll present in Prague some improvements in FTS. Unfortunately, we have
only several minutes during lighting talk. In short, we improved GIN to 
store additional information, coordinates for fts, for example and return 
ordered by rank search results, which gave us performance better than
sphynx. It's just a prototype, but we already got median at 8 msec for 
6 mln classifieds.


We didn't tested for long documents yet.

Regards,
Oleg

On Wed, 10 Oct 2012, Fran?ois Beausoleil wrote:



Le 2012-10-09 ? 17:38, Shane Hathaway a ?crit :


Hello,

The database has a text index of around 200,000 documents. Investigation 
revealed that text queries are slow only when using ts_rank or ts_rank_cd.  
Without a ts_rank function, any query is answered within 200ms or so; with 
ts_rank function, queries take up to 30 seconds.  Deeper investigation using 
gprof showed that the problem is probably not ts_rank or ts_rank_cd, but the 
fact that those functions retrieve thousands of TOASTed tsvectors.


Is the query perhaps doing something like this:

SELECT ...
FROM table
WHERE tsvectorcol @@ plainto_tsquery('...')
ORDER BY ts_rank(...)

If so, ts_rank() is run for every document. What you should do instead is:

SELECT *
FROM (
   SELECT ...
   FROM table
   WHERE tsvectorcol @@ plainto_tsquery('...')) AS t1
ORDER BY ts_rank(...)

Notice the ts_rank() is on the outer query, which means it'll only run on the 
subset of documents which match the query. This is explicitly mentioned in the 
docs:

Ranking can be expensive since it requires consulting the tsvector of each matching 
document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since 
practical queries often result in large numbers of matches.

(last paragraph of) 
http://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-RANKING

Hope that helps!
Fran?ois Beausoleil




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance issues

2011-03-07 Thread Oleg Bartunov


On Mon, 7 Mar 2011, Andreas For? Tollefsen wrote:


Ok. Cheers. I will do some more testing on my heavy PostGIS queries which
often takes hours to complete.


I'd like to see hours long queries :) EXPLAIN ANALYZE



Thanks.
Andreas

2011/3/7 Kenneth Marshall k...@rice.edu


On Mon, Mar 07, 2011 at 03:17:05PM +0100, Andreas For? Tollefsen wrote:

Thanks, Ken.

It seems like the tip to turn off synchronous_commit did the trick:

/usr/lib/postgresql/8.4/bin/pgbench -T 60 test1
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
duration: 60 s
number of transactions actually processed: 86048
tps = 1434.123199 (including connections establishing)
tps = 1434.183362 (excluding connections establishing)

Is this acceptable compared to others when considering my setup?

Cheers,
Andreas




These are typical results for synchronous_commit off. The caveat
is you must be able to handle loosing transactions if you have a
database crash, but your database is still intact. This differs
from turning fsync off in which a crash means you would need to
restore from a backup.

Cheers,
Ken





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] application of KNN code to US zipcode searches?

2011-02-17 Thread Oleg Bartunov


Mark,

we investigating pgsphere http://pgsphere.projects.postgresql.org/, if we could 
add KNN support.


Oleg
On Thu, 17 Feb 2011, Mark Stosberg wrote:




I thought the benefit of KNN was that you could retrieve the rows in
distance order, so that a query for the closest 20 locations (for
example) would be very fast.  I wouldn't have expected it to be
helpful when you're selecting all the rows regardless of distance.


Kevin,

Thanks for the feedback. You are right that my reduced test case
wasn't a good approximation. I added a limit, to simulate finding the
100 zipcodes closest to 90210.

Below I compare 4 approaches to the same query:

1. Cube search
2. Earth Distance Search
3. Simple point distance (no index)
4. Simple point distance (KNN)

Now KNN benchmarks to be almost 100x faster! That's very promising.
Then there's only the issue that simple point distance is not expected
to be a good enough approximation of earth-distances. Perhaps that can
be solved by pre-computing coordinates based on the lat/long pairs
much like the map projections used to present a curved surface on a flat
map? Given that's OK to be be a few miles off, it seems we have some
leeway here.

Recommendations?

   Mark

EXPLAIN ANALYZE
SELECT zipcode,
   cube_distance( '(-2513120.64361786, -4645511.0460328,
3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius
   FROM zipcodes ORDER BY radius LIMIT 100;

---
Limit  (cost=2946.70..2946.95 rows=100 width=62) (actual
time=167.650..168.064 rows=100 loops=1)
  -  Sort  (cost=2946.70..3050.40 rows=41483 width=62) (actual
time=167.644..167.829 rows=100 loops=1)
Sort Key: ((cube_distance('(-2513120.64361786,
-4645511.0460328, 3575538.9507084)'::cube, earth_coords) /
1609.344::double precision))
Sort Method: top-N heapsort  Memory: 20kB
-  Seq Scan on zipcodes  (cost=0.00..1361.24 rows=41483
width=62) (actual time=0.030..90.807 rows=41483 loops=1)
Total runtime: 168.300 ms

3

-- Using Earthdistance
EXPLAIN ANALYZE SELECT zipcode,
   lon_lat @ '(-118.412426,34.096629)' As radius
   FROM zipcodes
   ORDER BY lon_lat @ '(-118.412426,34.096629)'
   LIMIT 100;


Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=187.995..188.451 rows=100 loops=1)
  -  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=187.989..188.149 rows=100 loops=1)
Sort Key: ((lon_lat @ '(-118.412426,34.096629)'::point))
Sort Method: top-N heapsort  Memory: 20kB
-  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.033..108.203 rows=41483 loops=1)
Total runtime: 188.660 ms

##

Using simple point distance, but with no Gist Index:

EXPLAIN ANALYZE SELECT zipcode,
   lon_lat - '(-118.412426,34.096629)' As radius
   FROM zipcodes
   ORDER BY lon_lat - '(-118.412426,34.096629)'
   LIMIT 100;


Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=160.574..161.057 rows=100 loops=1)
  -  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=160.568..160.691 rows=100 loops=1)
Sort Key: ((lon_lat - '(-118.412426,34.096629)'::point))
Sort Method: top-N heapsort  Memory: 20kB
-  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.027..84.610 rows=41483 loops=1)
Total runtime: 161.226 ms

#

-- Using KNN-GIST index
EXPLAIN ANALYZE SELECT zipcode,
   lon_lat - '(-118.412426,34.096629)' As radius
   FROM zipcodes
   ORDER BY lon_lat - '(-118.412426,34.096629)'
   LIMIT 100;
--
Limit  (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892
rows=100 loops=1)
  -  Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1)
Order By: (lon_lat - '(-118.412426,34.096629)'::point)
Total runtime: 2.198 ms





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Full Text index is not using during OR operation

2010-11-29 Thread Oleg Bartunov

 crmentity(crmid) ON DELETE CASCADE
 TABLE cc2crmentity CONSTRAINT fk_cc2crmentity_crmentity FOREIGN KEY
(crm_id) REFERENCES crmentity(crmid) ON UPDATE CASCADE ON DELETE CASCADE


\d activity

 Table public.activity
 Column | Type | Modifiers
--++---
activityid | integer | not null default 0
subject | character varying(250) | not null
semodule | character varying(20) |
activitytype | character varying(200) | not null
date_start | date | not null
due_date | date |
time_start | character varying(50) |
time_end | character varying(50) |
sendnotification | character varying(3) | not null default '0'::character
varying
duration_hours | character varying(2) |
duration_minutes | character varying(200) |
status | character varying(200) |
eventstatus | character varying(200) |
priority | character varying(200) |
location | character varying(150) |
notime | character varying(3) | not null default '0'::character varying
visibility | character varying(50) | not null default 'all'::character
varying
recurringtype | character varying(200) |
end_date | date |
end_time | character varying(50) |
Indexes:
 activity_pkey PRIMARY KEY, btree (activityid)
 activity_activitytype_idx btree (activitytype)
 activity_date_start_idx btree (date_start)
 activity_due_date_idx btree (due_date)
 activity_eventstatus_idx btree (eventstatus)
 activity_status_idx btree (status)
 activity_subject_idx btree (subject)
 activity_time_start_idx btree (time_start)
 ftx_en_activity_subject gin (to_tsvector('vcrm_en'::regconfig,
for_fts(subject::text)))



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Full Text index is not using during OR operation

2010-11-29 Thread Oleg Bartunov


On Mon, 29 Nov 2010, AI Rumman wrote:


Oh! Actualy it is:
select replace('Dhaka University of Bangladesh:*', ' ',':*  ');
No space at start.


So, what are actual problems with full text ? I mostly interesting with 
server crush. We need test data, test query and error message.





On Mon, Nov 29, 2010 at 6:37 PM, Oleg Bartunov o...@sai.msu.su wrote:


What does replace(' Dhaka University of Bangladesh:*', ' ',':*  ') means ?
I see it produces something wrong for to_tsquery:

test=# select replace(' Dhaka University of Bangladesh:*', ' ',':*  ');

 replace
---
 :*  Dhaka:*  University:*  of:*  Bangladesh:*
(1 row)

Oleg


On Mon, 29 Nov 2010, AI Rumman wrote:

 explain

SELECT crmentity.crmid, crmentity.setype, crmentity.modifiedtime,
activity.subject,case when ( users.user_name not like '') then
users.user_name else groups.groupname end as user_name,
activity.date_start
FROM crmentity INNER JOIN activity ON crmentity.crmid =
activity.activityid
and crmentity.deleted = 0
LEFT JOIN activitygrouprelation ON activitygrouprelation.activityid =
crmentity.crmid
LEFT JOIN groups ON groups.groupname = activitygrouprelation.groupname
LEFT join users ON crmentity.smownerid= users.id
WHERE
to_tsvector(' en', for_fts( activity.subject)) @@ to_tsquery(' en',
replace(' Dhaka University of Bangladesh:*', ' ',':*  '))
or
to_tsvector(' en', for_fts( crmentity.description)) @@ to_tsquery(' en',
replace(' Dhaka University of Bangladesh:*', ' ',':*  '))
ORDER BY crmentity.modifiedtime DESC LIMIT 100

 QUERY PLAN


--
Limit (cost=112724.54..112724.54 rows=1 width=99)
 - Sort (cost=112724.54..112724.54 rows=1 width=99)
 Sort Key: crmentity.modifiedtime
 - Nested Loop Left Join (cost=0.00..112724.53 rows=1 width=99)
 - Nested Loop Left Join (cost=0.00..112724.24 rows=1 width=82)
 - Nested Loop Left Join (cost=0.00..112723.96 rows=1 width=79)
 - Nested Loop (cost=0.00..112723.68 rows=1 width=56)
 Join Filter: ((to_tsvector('en'::regconfig,
regexp_replace((activity.subject)::text,
'([^;]+;)|([^]+)|([\\s\\r\\n\\t]+)'::text, ''::text, 'gs'::text)) @@
'''
Dhaka'':*  ''univers'':*  ''bangladesh'':*'::tsquery) OR
(to_tsvector('en'::regconfig, regexp_replace(crmentity.description,
'([^;]+;)|([^]+)|([\\s\\r\\n\\t]+)'::text, ' '::text, 'gs'::text)) @@
''' Dhaka'':*  ''univers'':*  ''bangladesh'':*'::tsquery))
 - Index Scan using activity_pkey on activity (cost=0.00..10223.89
rows=343070 width=36)
 - Index Scan using crmentity_pkey on crmentity (cost=0.00..0.27 rows=1
width=151)
 Index Cond: (crmentity.crmid = activity.activityid)
 Filter: (crmentity.deleted = 0)
 - Index Scan using activitygrouprelation_activityid_idx on
activitygrouprelation (cost=0.00..0.27 rows=1 width=27)
 Index Cond: (activitygrouprelation.activityid = crmentity.crmid)
 - Index Scan using groups_groupname_idx on groups (cost=0.00..0.27
rows=1
width=26)
 Index Cond: ((groups.groupname)::text =
(activitygrouprelation.groupname)::text)
 - Index Scan using users_pkey on users (cost=0.00..0.27 rows=1 width=25)
 Index Cond: (crmentity.smownerid = users.id)


The above query are not using fts indexes, even hang the server.

But,


explain
SELECT crmentity.crmid, crmentity.setype, crmentity.modifiedtime,
activity.subject,case when ( users.user_name not like '') then
users.user_name else groups.groupname end as user_name,
activity.date_start
FROM crmentity INNER JOIN activity ON crmentity.crmid =
activity.activityid
and crmentity.deleted = 0
LEFT JOIN activitygrouprelation ON activitygrouprelation.activityid =
crmentity.crmid
LEFT JOIN groups ON groups.groupname = activitygrouprelation.groupname
LEFT join users ON crmentity.smownerid= users.id
WHERE
to_tsvector(' en', for_fts( activity.subject)) @@ to_tsquery(' en',
replace(' Dhaka University of Bangladesh:*', ' ',':*  '))
ORDER BY crmentity.modifiedtime DESC LIMIT 100


--

Limit (cost=1.46..1.47 rows=1 width=99) (actual time=0.824..0.824 rows=0
loops=1)
 - Sort (cost=1.46..1.47 rows=1 width=99) (actual time=0.819..0.819
rows=0
loops=1)
 Sort Key: crmentity.modifiedtime
 Sort Method: quicksort Memory: 17kB
 - Nested Loop Left Join (cost=0.27..1.45 rows=1 width=99) (actual
time=0.752..0.752 rows=0 loops=1)
 - Nested Loop Left Join (cost=0.27..1.17 rows=1 width=82) (actual
time=0.750..0.750 rows=0 loops=1)
 - Nested Loop Left Join (cost=0.27..0.88 rows=1 width=79) (actual
time=0.748..0.748 rows=0 loops=1)
 - Nested Loop (cost=0.27..0.60 rows=1 width=56) (actual
time=0.746..0.746
rows=0 loops=1)
 - Bitmap Heap Scan on activity (cost=0.27..0.30 rows=1 width=36) (actual
time=0.744..0.744 rows=0 loops=1)
 Recheck Cond: (to_tsvector('en'::regconfig

Re: [PERFORM] Understanding tsearch2 performance

2010-07-14 Thread Oleg Bartunov


Something is not good with statistics,  91 est. vs 8449 actually returned.
Returning 8449 rows could be quite long.

Oleg
On Wed, 14 Jul 2010, Ivan Voras wrote:


Here's a query and its EXPLAIN ANALYZE output:

cms= select count(*) from forum;
count
---
90675
(1 row)

cms= explain analyze select id,title from forum where _fts_ @@
'fer'::tsquery;
 QUERY PLAN

---
Bitmap Heap Scan on forum  (cost=29.21..361.21 rows=91 width=35)
(actual time=2.946..63.646 rows=8449 loops=1)
  Recheck Cond: (_fts_ @@ '''fer'''::tsquery)
  -  Bitmap Index Scan on forum_fts  (cost=0.00..29.19 rows=91
width=0) (actual time=2.119..2.119 rows=8449 loops=1)
Index Cond: (_fts_ @@ '''fer'''::tsquery)
Total runtime: 113.641 ms
(5 rows)

The problem is - tsearch2 seems too slow. I have nothing to compare it
to but 113 ms for searching through this small table of 90,000 records
seems too slow. The forum_fts index is of GIN type and the table
certainly fits into RAM.

When I issue a dumb query without an index, I get a comparable order of
magnitude performance:

cms= explain analyze select id,title from forum where content ilike
'%fer%';
QUERY PLAN


Seq Scan on forum  (cost=0.00..7307.44 rows=3395 width=35) (actual
time=0.030..798.375 rows=10896 loops=1)
  Filter: (content ~~* '%fer%'::text)
Total runtime: 864.384 ms
(3 rows)

cms= explain analyze select id,title from forum where content like '%fer%';
   QUERY PLAN

---
Seq Scan on forum  (cost=0.00..7307.44 rows=3395 width=35) (actual
time=0.024..146.959 rows=7596 loops=1)
  Filter: (content ~~ '%fer%'::text)
Total runtime: 191.732 ms
(3 rows)

Some peculiarities of the setup which might or might not influence this
performance:

1) I'm using ICU-patched postgresql because I cannot use my UTF-8 locale
otherwise - this is why the difference between the dumb queries is large
(but I don't see how this can influence tsearch2 since it pre-builds the
tsvector data with lowercase lexemes)

2) My tsearch2 lexer is somewhat slow (but I don't see how it can
influence these read-only queries on a pre-built, lexed and indexed data)

Any ideas?





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Understanding tsearch2 performance

2010-07-14 Thread Oleg Bartunov


On Wed, 14 Jul 2010, Ivan Voras wrote:


Returning 8449 rows could be quite long.


You are right, I didn't test this. Issuing a query which returns a
smaller result set is much faster.

But, offtopic, why would returning 8500 records, each around 100 bytes
long so around 8.5 MB, over local unix sockets, be so slow? The machine
in question has a sustained memory bendwidth of nearly 10 GB/s. Does
PostgreSQL spend much time marshalling the data through the socket stream?


It's disk access time.
in the very bad case it could take  ~5 ms (for fast drive) to get one just
one row.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Understanding tsearch2 performance

2010-07-14 Thread Oleg Bartunov


Ivan,

here is explain analyze output - 7122 out of 528155 docs
tseval=# select count(*) from document;
 count 


 528155
(1 row)

Time: 345,562 ms

tseval=# explain analyze select docno, title from document where vector @@ 
to_tsquery('english','mars');
 Bitmap Heap Scan on document  (cost=1655.97..10518.34 rows=2641 width=13) 
(actual time=3.127..11.556 rows=7122 loops=1)
   Recheck Cond: (vector @@ '''mar'''::tsquery)
   -  Bitmap Index Scan on idx_vector  (cost=0.00..1655.31 rows=2641 width=0) 
(actual time=1.899..1.899 rows=7122 loops=1)
 Index Cond: (vector @@ '''mar'''::tsquery)
 Total runtime: 12.303 ms
(5 rows)

This is PostgreSQL 8.4.4 on Ubuntu machine.


Oleg

On Wed, 14 Jul 2010, Ivan Voras wrote:


On 07/14/10 16:03, Kevin Grittner wrote:

Ivan Voras   ivo...@freebsd.org   wrote:

On 07/14/10 15:49, Stephen Frost wrote:



Regarding the statistics, it's entirely possible that the index
is *not* the fastest way to pull this data (it's nearly 10% of
the table..)


I think that what I'm asking here is: is it reasonable for
tsearch2 to extract 8,500 rows from an index of 90,000 rows in 118
ms, given that the approximately same task can be done with an
unindexed LIKE operator in nearly the same time?


The answer is yes.  When it's 10% of the table, a sequential scan
can be more efficient than an index, as Stephen indicated.


Ok, to verify this I've tried increasing statistics on the field and
running vacumm analyze full, which didn't help. Next, I've tried setting
enable_indexscan to off, which also didn't do it:

cms= set enable_indexscan=off;
SET
cms= explain analyze select id,title from forum where _fts_ @@
'fer'::tsquery order by id limit 10;
QUERY PLAN

-
Limit  (cost=363.18..363.20 rows=10 width=35) (actual
time=192.243..192.406 rows=10 loops=1)
  -  Sort  (cost=363.18..363.40 rows=91 width=35) (actual
time=192.229..192.283 rows=10 loops=1)
Sort Key: id
Sort Method:  top-N heapsort  Memory: 25kB
-  Bitmap Heap Scan on forum  (cost=29.21..361.21 rows=91
width=35) (actual time=12.071..136.130 rows=8449 loops=1)
  Recheck Cond: (_fts_ @@ '''fer'''::tsquery)
  -  Bitmap Index Scan on forum_fts  (cost=0.00..29.19
rows=91 width=0) (actual time=11.169..11.169 rows=8449 loops=1)
Index Cond: (_fts_ @@ '''fer'''::tsquery)
Total runtime: 192.686 ms
(9 rows)

Any ideas on how to verify this?







Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Ugh - bad plan with LIMIT in a complex SELECT, any way to fix this?

2010-05-09 Thread Oleg Bartunov

 rows=9808 width=0) (actual time=1.438..1.438 rows=0
loops=1)
Index Cond:
(to_tsvector('english'::text, message) @@ to_tsquery('hosehead'::text))
  -  Index Scan using post_ordinal on post
(cost=0.00..6.41 rows=1 width=436) (never executed)
Index Cond: (public.post.ordinal = public.post.ordinal)
Total runtime: 1.600 ms
(13 rows)

Why is the planner taking into consideration the LIMIT (I know the
docs say it does) and choosing to sequentially scan a table of nearly 2
million rows?!  I don't see how that makes sense irrespective of the
query being LIMITed.

If it matters setting enable_seqscan OFF does not impact the results.

-- Karl



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Digesting explain analyze

2010-01-07 Thread Oleg Bartunov


Jesper,

the whole idea of bitmap index scan is to optimize heap access, so it ruins
any ordering, returned by index. That's why our new KNNGist, which returned
ordered index tuples doesn't supports bitmap index scan (note, this is only
for knn search).

Oleg

On Wed, 6 Jan 2010, Robert Haas wrote:


On Wed, Jan 6, 2010 at 2:10 PM, Jesper Krogh jes...@krogh.cc wrote:
 Hi.

 I have a table that consists of somewhere in the magnitude of 100.000.000
 rows and all rows are of this tuples

 (id1,id2,evalue);

 Then I'd like to speed up a query like this:

 explain analyze select id from table where id1 =3D 2067 or id2 =3D 2067 o=
rder
 by evalue asc limit 100;
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0QUERY PLAN
 -=
--
 =A0Limit =A0(cost=3D1423.28..1423.28 rows=3D100 width=3D12) (actual
 time=3D2.565..2.567 rows=3D100 loops=3D1)
 =A0 - =A0Sort =A0(cost=3D1423.28..1424.54 rows=3D505 width=3D12) (actual
 time=3D2.560..2.560 rows=3D100 loops=3D1)
 =A0 =A0 =A0 =A0 Sort Key: evalue
 =A0 =A0 =A0 =A0 Sort Method: =A0top-N heapsort =A0Memory: 25kB
 =A0 =A0 =A0 =A0 - =A0Bitmap Heap Scan on table =A0(cost=3D16.58..1420.75=
 rows=3D505
 width=3D12) (actual time=3D0.709..1.752 rows=3D450 loops=3D1)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Recheck Cond: ((id1 =3D 2067) OR (id2 =3D 206=
7))
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 - =A0BitmapOr =A0(cost=3D16.58..16.58 rows=
=3D506 width=3D0) (actual
 time=3D0.676..0.676 rows=3D0 loops=3D1)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 - =A0Bitmap Index Scan on id1_ev=
alue_idx
 (cost=3D0.00..11.44 rows=3D423 width=3D0) (actual
 time=3D0.599..0.599 rows=3D450 loops=3D1)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Index Cond: (id1_id =
=3D 2067)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 - =A0Bitmap Index Scan on id2_ev=
alue_idx
 (cost=3D0.00..4.89 rows=3D83 width=3D0) (actual
 time=3D0.070..0.070 rows=3D1 loops=3D1)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Index Cond: (id2_id =
=3D 2067)
 =A0Total runtime: 2.642 ms
 (12 rows)


 What I had expected was to see the Bitmap Index Scan on id1_evalue_idx
 to chop it off at a limit 1. The inner sets are on average 3.000 for
 both id1 and id2 and a typical limit would be 100, so if I could convince
 postgresql to not fetch all of them then I would reduce the set retrieved
 by around 60. The dataset is quite large so the random query is not very
 likely to be hitting the same part of the dataset again, so there is going
 to be a fair amount of going to disk.,

 I would also mean that using it in a for loop in a stored-procedure in
 plpgsql it would not get any benefit from the CURSOR effect?

 I actually tried to stuff id1,id2 into an array and do a GIST index on the
 array,evalue hoping that it directly would satisfy this query.. it used
 the GIST index fetch the rows the post-sorting and limit on the set.

 What it boils down to is more or less:

 Does a bitmap index scan support ordering and limit ?
 Does a multicolummn gist index support ordering and limit ?

 Have I missed anything that can hugely speed up fetching these (typically
 100 rows) from the database.

Bitmap index scans always return all the matching rows.  It would be
nice if they could fetch them in chunks for queries like this, but
they can't.  I am not sure whether there's any way to make GIST do
what you want.

You might try something like this (untested):

SELECT * FROM (
   select id from table where id1 =3D 2067 order by evalue asc limit 100
   union all
   select id from table where id2 =3D 2067 order by evalue asc limit 100
) x ORDER BY evalue LIMIT 100

If you have an index by (id1, evalue) and by (id2, evalue) then I
would think this would be pretty quick, as it should do two index
scans (not bitmap index scans) to fetch 100 rows for each, then append
the results, sort them, and then limit again.

...Robert

--=20
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] FTS performance with the Polish config

2009-11-15 Thread Oleg Bartunov


On Sun, 15 Nov 2009, Pavel Stehule wrote:


2009/11/15 Oleg Bartunov o...@sai.msu.su:

Yes, as stated original author use polish ispell dictionary.
Ispell dictionary is slow to load first time. In real life it should be no
problem.



it is a problem. People who needs fast access uses english without
czech. It drop some features, but it is significaly faster.


just don't use ispell dictionary, czech snowball stemmer is as fast as 
english.


Ispell dictionary (doesn't matter english, or other language) is slow for the 
first load and then it caches, so there is no problem if use persistent 
database connection, which is de facto standard for any serious projects.




Pavel


Oleg
On Sat, 14 Nov 2009, Pavel Stehule wrote:


2009/11/14 Tom Lane t...@sss.pgh.pa.us:

Kenneth Marshall k...@rice.edu writes:

On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:

I just finished implementing a search engine for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuratio=

n,

while fast with English.



The documentation for ts_headline() states:
ts_headline uses the original document, not a tsvector summary, so it
can be slow and should be used with care.


That's true but the argument in the docs would apply just as well to
english or any other config. =C2=A0So while Wojciech would be well
advised
to try to avoid making a lot of calls to ts_headline, it's still curious
that it's so much slower in polish than english. =C2=A0Could we see a
self-contained test case?


is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is
slow.

regards
Pavel Stehule



=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
=C2=

=A0 =C2=A0regards, tom lane


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--=20
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



       Regards,
               Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] FTS performance with the Polish config

2009-11-15 Thread Oleg Bartunov


On Sun, 15 Nov 2009, Pavel Stehule wrote:



czech stemmer doesn't exist :(



I'd try morfessor http://www.cis.hut.fi/projects/morpho/, which is 
unsupervised morphological dictionary. I think it'd be not very hard to add

morfessor dictionary template to tsearch2, so people could create their
own stemmers.



Ispell dictionary (doesn't matter english, or other language) is slow for
the first load and then it caches, so there is no problem if use persistent
database connection, which is de facto standard for any serious projects.



I agree so connection pooling should be a solution. But it is good?
Cannot we share dictionary better?


We thought about this issue and got some idea. Teodor can be more clear here,
since I don't remember all details.






Pavel


Oleg
On Sat, 14 Nov 2009, Pavel Stehule wrote:


2009/11/14 Tom Lane t...@sss.pgh.pa.us:


Kenneth Marshall k...@rice.edu writes:


On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:


I just finished implementing a search engine for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuratio=


n,


while fast with English.



The documentation for ts_headline() states:
ts_headline uses the original document, not a tsvector summary, so it
can be slow and should be used with care.


That's true but the argument in the docs would apply just as well to
english or any other config. =C2=A0So while Wojciech would be well
advised
to try to avoid making a lot of calls to ts_headline, it's still
curious
that it's so much slower in polish than english. =C2=A0Could we see a
self-contained test case?


is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is
slow.

regards
Pavel Stehule



=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
=C2=


=A0 =C2=A0regards, tom lane


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--=20
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



       Regards,
               Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





       Regards,
               Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] FTS performance with the Polish config

2009-11-14 Thread Oleg Bartunov


Yes, as stated original author use polish ispell dictionary.
Ispell dictionary is slow to load first time. In real life it should 
be no problem.


Oleg
On Sat, 14 Nov 2009, Pavel Stehule wrote:


2009/11/14 Tom Lane t...@sss.pgh.pa.us:
 Kenneth Marshall k...@rice.edu writes:
 On Sat, Nov 14, 2009 at 12:25:05PM +0100, Wojciech Knapik wrote:
 I just finished implementing a search engine for my site and found
 ts_headline extremely slow when used with a Polish tsearch configuratio=
n,
 while fast with English.

 The documentation for ts_headline() states:
 ts_headline uses the original document, not a tsvector summary, so it
 can be slow and should be used with care.

 That's true but the argument in the docs would apply just as well to
 english or any other config. =C2=A0So while Wojciech would be well advised
 to try to avoid making a lot of calls to ts_headline, it's still curious
 that it's so much slower in polish than english. =C2=A0Could we see a
 self-contained test case?

is it dictionary based or stem based?

Dictionary based FTS is very slow (first load). Minimally czech FTS is slow.

regards
Pavel Stehule


 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0regards, tom lane

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


--=20
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-21 Thread Oleg Bartunov


On Tue, 21 Jul 2009, Krade wrote:


On 7/21/2009 11:32, valgog wrote:

Hi,

There is a problem with GIN and GIST indexes, that they cannot be used
by the ORDER BY. Maybe it will be a nice idea to ask Oleg to make it
possible to use the b-tree columns in GIST or GIN to make the sort
easier, but I have no idea how difficult it will be to implement it in
current GIN or GIST structures. I think Oleg or even Tom will be the
right people to ask it :) But even if it is possible it will not be
implemented at least until 8.5 that will need a year to come, so until
then...

Unfortunately, it's not even just the lack of ORDER BY support, btree_gin 
indexes seem to be broken under some circumstances. So I can't even use my 
idea to limit searches to the last 10 days.


See this:
http://pgsql.privatepaste.com/5219TutUMk

The first query gives bogus results. It's not using the index correctly.

timestamp_comment_gin is a GIN index on timestamp, comment_tsv. The timestamp 
column is an integer. The queries work right if I drop the index. Is this a 
bug in btree_gin?


it'd be nice if you provide us data,so we can reproduce your problem


It is possible to strip your table in several smaller ones putting
them on different machines and then splitting your query with DBLINK.
This will distribute the burden of sorting to several machines that
will have to sort smaller parts as well. After you have your 25 ids
from each of the machines, you can merge them, sort again and limit as
you wish. Doing large offsets will be still problematic but faster
anyway in most reasonable offset ranges. (Load balancing tools like
pg_pool can automate this task, but I do not have practical experience
using them for that purposes)

Yet another very interesting technology -- sphinx search (http://
www.sphinxsearch.com/). It can distribute data on several machines
automatically, but it will be probably too expensive to start using
(if your task is not your main one :)) as they do not have standard
automation scripts, it does not support live updates (so you will
always have some minutes delay), and this is a standalone service,
that needs to be maintained and configured and synchronized with our
main database separately (though you can use pg/python to access it
from postgres).

Good luck with your task :)
Yeah, I don't really have that sort of resources. This is a small hobby 
project (ie: no budget) that is growing a bit too large. I might just have to 
do text searches without time ordering.


On 7/21/2009 5:06, Scott Marlowe wrote:

Couldn't you do tge second query as a with query then run another
query to limit that result to everything greater than now()-xdays ?

I suppose I could, but I have no way to do a fast query that does both a full 
text match and a  or  in the same WHERE due to the issue I described above, 
so my original plan won't work. A separate BTREE timestamp index obviously 
does nothing.


And again, thank you for all the help.




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Atomic access to large arrays

2009-07-21 Thread Oleg Bartunov


Victor,

Just wondering why do you use array ?

Oleg
On Wed, 22 Jul 2009, Victor de Buen (Bayes) wrote:


Hi

I'm storing historical meteorological gridded data from GFS (
http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a
table like this:

CREATE TABLE grid_f_data_i2 (
 //Specifies the variable and other features of data
 id_inventory integer REFERENCES grid_d_inventory(id_inventory),
 //A new grid is available each 3 hours since 5 years ago
 dh_date timestamp,
 //Values are scaled to be stored as signed integers of 2 bytes
 vl_grid smallint[361][720],
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
 (co_inventory, dh_date)
);

Dimensions of each value of field vl_grid are (lat:361 x lon:720 = 259920
cells} for a grid of 0.5 degrees (about each 55 Km) around the world. So,
vl_grid[y][x] stores the value at dh_date of a meteorological variable
specified by id_inventory in the geodesic point

latitude  = -90 + y*0.5
longitude = x*0.5

The reverse formula for the closest point in the grid of an arbitary
geodesic point will be

y = Round((latitude+90) * 2
x = Round(longitude*2)

Field vl_grid is stored in the TOAST table and has a good compression level.
PostgreSql is the only one database that is able to store this huge amount
of data in only 34 GB of disk. It's really great system. Queries returning
big rectangular areas are very fast, but the target of almost all queries is
to get historical series for a geodesic point

SELECT  dh_date, vl_grid[123][152]
FROM  grid_f_data_i2
WHERE  id_inventory = 6
ORDER BY dh_date

In this case, atomic access to just a cell of each one of a only few
thousands of rows becomes too slow.

Please, could somebody answer some of these questions?

  - It's posible to tune some TOAST parameters to get faster atomic access
  to large arrays?


  - Using EXTERNAL strategy for storing TOAST-able columns could solve
  the problem?


  - Atomic access will be faster if table stores vectors for data in the
  same parallel instead of matrices of global data?
  CREATE TABLE grid_f_data_i2 (
//Specifies the variable and other features of data
id_inventory integer REFERENCES grid_d_inventory(id_inventory),
//A new grid is available each 3 hours since 5 years ago
dh_date timestamp,
// nu_parallel = y = Round((latitude+90) * 2
smallint nu_parallel,
//Values are scaled to be stored as signed integers of 2 bytes
vl_parallel smallint[],
  CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY
(co_inventory, nu_parallel, dh_date)
  );

   - There is another faster solution?

Thanks in advance and best regards




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Full text search with ORDER BY performance issue

2009-07-20 Thread Oleg Bartunov


Krade,

On Sat, 18 Jul 2009, Krade wrote:


Here's a couple of queries:

archive= explain analyze select * from a where  comment_tsv @@ 
plainto_tsquery('love') order by timestamp desc limit 24 offset 0;


QUERY PLAN
--
Limit  (cost=453248.73..453248.79 rows=24 width=281) (actual 
time=188441.047..188441.148 rows=24 loops=1)
  -  Sort  (cost=453248.73..453882.82 rows=253635 width=281) (actual 
time=188441.043..188441.079 rows=24 loops=1)

Sort Key: timestamp
Sort Method:  top-N heapsort  Memory: 42kB
-  Bitmap Heap Scan on a  (cost=17782.16..446166.02 rows=253635 
width=281) (actual time=2198.930..187948.050 rows=256378 loops=1)

  Recheck Cond: (comment_tsv @@ plainto_tsquery('love'::text))
  -  Bitmap Index Scan on timestamp_comment_gin 
(cost=0.00..17718.75 rows=253635 width=0) (actual time=2113.664..2113.664 
rows=259828 loops=1)
Index Cond: (comment_tsv @@ 
plainto_tsquery('love'::text))

Total runtime: 188442.617 ms
(9 rows)

archive= explain analyze select * from a where  comment_tsv @@ 
plainto_tsquery('love') limit 24 offset 0;


QUERY PLAN
--
Limit  (cost=0.00..66.34 rows=24 width=281) (actual time=14.632..53.647 
rows=24 loops=1)
  -  Seq Scan on a  (cost=0.00..701071.49 rows=253635 width=281) (actual 
time=14.629..53.588 rows=24 loops=1)

Filter: (comment_tsv @@ plainto_tsquery('love'::text))
Total runtime: 53.731 ms
(4 rows)

First one runs painfully slow.


Hmm, everything is already written in explain :) In the first query 
253635 rows should be readed from disk and sorted, while in the

second query only 24 (random) rows readed from disk, so there is 4 magnitudes
difference and in the worst case you should expected time for the 1st query
about 53*10^4 ms.



Is there really no way to have efficient full text search results ordered by 
a separate field? I'm really open to all possibilities, at this point.


Thanks.




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] GiST index performance

2009-05-07 Thread Oleg Bartunov


On Wed, 6 May 2009, Tom Lane wrote:


Matthew Wakeling matt...@flymine.org writes:

Here is my patch ported over to the seg contrib package, attached. Apply
it to seg.c and all should be well. A similar thing needs to be done to
cube, but I haven't looked at that.


Teodor, Oleg, do you intend to review/apply this patch?


Tom,

I just returned from trek around Annapurna and just learned about Matthew's
experiments, Teodor is in holidays and will be available after May 11, 
then there are should be PGCon, so if it can wait, we could look on this

after PGCon.

Matthew, did you try various data ? From our experience we learned there
are can be various corner cases.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Extremely slow intarray index creation and inserts.

2009-03-19 Thread Oleg Bartunov


We usually say about 200 unique values as a limit for
gist_int_ops.

On Wed, 18 Mar 2009, Tom Lane wrote:


Ron Mayer rm...@cheapcomplexdevices.com writes:

Oleg Bartunov wrote:
OB: it's not about short or long arrays, it's about small or big
OB: cardinality of the whole set (the number of unique elements)



I'm re-reading the docs and still wasn't obvious to me.   A
potential docs patch is attached below.


Done, though not in exactly those words.  I wonder though if we can
be less vague about it --- can we suggest a typical cutover point?
Like use gist__intbig_ops if there are more than about 10,000 distinct
array values?  Even a rough order of magnitude for where to worry
about this would save a lot of people time.

regards, tom lane

Index: intarray.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/intarray.sgml,v
retrieving revision 1.5
retrieving revision 1.6
diff -c -r1.5 -r1.6
*** intarray.sgml   10 Dec 2007 05:32:51 -  1.5
--- intarray.sgml   18 Mar 2009 20:18:18 -  1.6
***
*** 237,245 
   para
Two GiST index operator classes are provided:
literalgist__int_ops/ (used by default) is suitable for
!small and medium-size arrays, while
literalgist__intbig_ops/ uses a larger signature and is more
!suitable for indexing large arrays.
   /para

   para
--- 237,246 
   para
Two GiST index operator classes are provided:
literalgist__int_ops/ (used by default) is suitable for
!small- to medium-size data sets, while
literalgist__intbig_ops/ uses a larger signature and is more
!suitable for indexing large data sets (i.e., columns containing
!a large number of distinct array values).
   /para

   para



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Extremely slow intarray index creation and inserts.

2009-03-19 Thread Oleg Bartunov


On Thu, 19 Mar 2009, Tom Lane wrote:


Oleg Bartunov o...@sai.msu.su writes:

We usually say about 200 unique values as a limit for
gist_int_ops.


That seems awfully small ... should we make gist_intbig_ops the default,
or more likely, raise the signature size of both opclasses?  Even at a
crossover point of 1 I'm not sure that many real-world apps would
bother considering gist_int_ops.


gist__int_ops doesn't uses signatures, it uses range compression, which
is not lossy, but not capacious. Perhaps, that's why we decided to use it 
as default opclass.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Extremely slow intarray index creation and inserts.

2009-03-19 Thread Oleg Bartunov


On Thu, 19 Mar 2009, Oleg Bartunov wrote:


On Thu, 19 Mar 2009, Tom Lane wrote:


Oleg Bartunov o...@sai.msu.su writes:

We usually say about 200 unique values as a limit for
gist_int_ops.


That seems awfully small ... should we make gist_intbig_ops the default,
or more likely, raise the signature size of both opclasses?  Even at a
crossover point of 1 I'm not sure that many real-world apps would
bother considering gist_int_ops.


gist__int_ops doesn't uses signatures, it uses range compression, which
is not lossy, but not capacious. Perhaps, that's why we decided to use it as


sorry, it's lossy


default opclass.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] bitmap heap scan recheck on full text search with gin index

2009-02-05 Thread Oleg Bartunov


Hal,

just create separate column with tsvector and create index on it.

Oleg
On Thu, 5 Feb 2009, Hal Roberts wrote:


Hi All,

I'm getting poor performance on full text searches that return lots of 
entries from a table with about 7 million rows.  I think the cause is 
rechecking the text match on all of the returned rows, even though I'm using 
a @@ query on a gin index, which the docs say should not require a recheck.


Here's the table:


mediacloud= \d download_texts;
  Table public.download_texts
  Column   |  Type   | Modifiers
---+-+
download_texts_id | integer | not null default 
nextval('download_texts_download_texts_id_seq'::regclass)

downloads_id  | integer | not null
download_text | text| not null
Indexes:
download_texts_pkey PRIMARY KEY, btree (download_texts_id)
download_texts_downloads_id_index UNIQUE, btree (downloads_id)
download_texts_textsearch_idx gin (to_tsvector('english'::regconfig, 
download_text)), tablespace large_table_space

Foreign-key constraints:
download_texts_downloads_id_fkey FOREIGN KEY (downloads_id) REFERENCES 
downloads(downloads_id)

Tablespace: large_table_space


And here's the query:


mediacloud= explain analyze select count(dt.download_texts_id) from 
download_texts dt where to_tsvector('english', download_text) @@ 
to_tsquery('english', 'stimulus');

QUERY PLAN

Aggregate  (cost=26161.16..26161.17 rows=1 width=4) (actual 
time=153640.083..153640.083 rows=1 loops=1)
-  Bitmap Heap Scan on download_texts dt  (cost=3937.41..26146.11 rows=6018 
width=4) (actual time=1957.074..153529.351 rows=72225 loops=1)
 Recheck Cond: (to_tsvector('english'::regconfig, download_text) @@ 
'''stimulus'''::tsquery)
 -  Bitmap Index Scan on download_texts_textsearch_idx 
(cost=0.00..3935.90 rows=6018 width=0) (actual time=1048.556..1048.556 
rows=72225 loops=1)
   Index Cond: (to_tsvector('english'::regconfig, download_text) @@ 
'''stimulus'''::tsquery)

Total runtime: 153642.249 ms


Nearly all of the time is being spent in the bitmap heap scan, I suspect 
because of the work of rereading and rechecking the text of all the matched 
entries.  Is this indeed what's going on here?  Is there any way to make 
postgres not do that recheck?



Thanks!

-hal




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] LIKE Query performance

2009-01-29 Thread Oleg Bartunov


On Thu, 29 Jan 2009, Hari, Balaji wrote:


Is there a patch to make Wildspeed work with postgresql version 8.3.1?


unfortunately, no.



P.S
My bad, the version number was incorrect in my previous mail.
-Original Message-
From: Oleg Bartunov [mailto:o...@sai.msu.su]
Sent: Wednesday, January 28, 2009 1:27 AM
To: Hari, Balaji
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] LIKE Query performance

Only wildspeed http://www.sai.msu.su/~megera/wiki/wildspeed
has index support for %text%
But, it has limitations.

Oleg
On Tue, 27 Jan 2009, Hari, Balaji wrote:


Hi,

I am relatively new to PostgreSQL(8.1) and facing the following problem.

We have indexes defined on timestamp and description (create index 
description_idx on event using btree (description varchar_pattern_ops))

EXPLAIN ANALYZE SELECT event_id, category, current_session_number, description, 
event_type_id, realm_name, root_session_number, severity, source_name, 
target_key, target_name, timestamp, jdo_version FROM event WHERE description 
like '%mismatch%' ORDER BY timestamp desc;
 QUERY PLAN
--
Sort  (cost=36267.09..36272.73 rows=2256 width=314) (actual 
time=19255.075..20345.774 rows=647537 loops=1)
  Sort Key: timestamp
  Sort Method:  external merge  Disk: 194080kB
  -  Seq Scan on event  (cost=0.00..36141.44 rows=2256 width=314) (actual 
time=0.080..1475.041 rows=647537 loops=1)
Filter: ((description)::text ~~ '%mismatch%'::text)
Total runtime: 22547.292 ms
(6 rows)

But startsWith query use indexes.

EXPLAIN ANALYZE SELECT event_id, category, current_session_number, description, 
event_type_id, realm_name, root_session_number, severity, source_name, 
target_key, target_name, timestamp, jdo_version FROM event WHERE description 
like 'mismatch%' ORDER BY timestamp desc;
 QUERY PLAN
---
Sort  (cost=9.26..9.27 rows=1 width=314) (actual time=0.766..0.766 rows=0 
loops=1)
  Sort Key: timestamp
  Sort Method:  quicksort  Memory: 17kB
  -  Index Scan using description_idx on event  (cost=0.00..9.25 rows=1 
width=314) (actual time=0.741..0.741 rows=0 loops=1)
Index Cond: (((description)::text ~=~ 'mismatch'::text) AND 
((description)::text ~~ 'mismatci'::text))
Filter: ((description)::text ~~ 'mismatch%'::text)
Total runtime: 0.919 ms
(7 rows)

Is there any tweaks to force pgsql to use index on description?

Balaji

P.S The event database has 700k records.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] LIKE Query performance

2009-01-27 Thread Oleg Bartunov


Only wildspeed http://www.sai.msu.su/~megera/wiki/wildspeed
has index support for %text% 
But, it has limitations.


Oleg
On Tue, 27 Jan 2009, Hari, Balaji wrote:


Hi,

I am relatively new to PostgreSQL(8.1) and facing the following problem.

We have indexes defined on timestamp and description (create index 
description_idx on event using btree (description varchar_pattern_ops))

EXPLAIN ANALYZE SELECT event_id, category, current_session_number, description, 
event_type_id, realm_name, root_session_number, severity, source_name, 
target_key, target_name, timestamp, jdo_version FROM event WHERE description 
like '%mismatch%' ORDER BY timestamp desc;
 QUERY PLAN
--
Sort  (cost=36267.09..36272.73 rows=2256 width=314) (actual 
time=19255.075..20345.774 rows=647537 loops=1)
  Sort Key: timestamp
  Sort Method:  external merge  Disk: 194080kB
  -  Seq Scan on event  (cost=0.00..36141.44 rows=2256 width=314) (actual 
time=0.080..1475.041 rows=647537 loops=1)
Filter: ((description)::text ~~ '%mismatch%'::text)
Total runtime: 22547.292 ms
(6 rows)

But startsWith query use indexes.

EXPLAIN ANALYZE SELECT event_id, category, current_session_number, description, 
event_type_id, realm_name, root_session_number, severity, source_name, 
target_key, target_name, timestamp, jdo_version FROM event WHERE description 
like 'mismatch%' ORDER BY timestamp desc;
 QUERY PLAN
---
Sort  (cost=9.26..9.27 rows=1 width=314) (actual time=0.766..0.766 rows=0 
loops=1)
  Sort Key: timestamp
  Sort Method:  quicksort  Memory: 17kB
  -  Index Scan using description_idx on event  (cost=0.00..9.25 rows=1 
width=314) (actual time=0.741..0.741 rows=0 loops=1)
Index Cond: (((description)::text ~=~ 'mismatch'::text) AND 
((description)::text ~~ 'mismatci'::text))
Filter: ((description)::text ~~ 'mismatch%'::text)
Total runtime: 0.919 ms
(7 rows)

Is there any tweaks to force pgsql to use index on description?

Balaji

P.S The event database has 700k records.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] how to estimate shared_buffers...

2008-07-12 Thread Oleg Bartunov


On Sat, 12 Jul 2008, Jessica Richard wrote:


On a running production machine, we have 900M configured on a 16G-memory Linux host. The 
db size for all dbs combined is about 50G.  There are many transactions going on all the 
times (deletes, inserts, updates). We do not have a testing environment that has the same 
setup and the same amount of workload. I want to evaluate on the production host if this 
900M is enough. If not, we still have room to go up a little bit to speed up all Postgres 
activities. I don't know enough about the SA side. I just would imagine, if something 
like top command or other tools can measure how much total memory Postgres is 
actually using (against the configured 900M shared buffers), and if Postgres is using 
almost 900M all the time, I would take this as an indication that the shared_buffers can 
go up for another 100M...

What is the best way to tell how much memory Postgres (all Postgres related 
things) is actually using?


there is a contrib/pg_buffers which can tell you about usage of shared 
memory. Also, you can estimate how much memory of OS cache occupied by
postgres files (tables, indexes). Looks on 
http://www.kennygorman.com/wordpress/?p=246 for some details.

I wrote a perl script, which simplifies estimation of OS buffers, but
it's not yet ready for public.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] IN() statement values order makes 2x performance hit

2008-05-29 Thread Oleg Bartunov


You may try contrib/intarray, which we developed specially for
denormalization.

Oleg
On Thu, 29 May 2008, Alexey Kupershtokh wrote:


Hello everybody!

I have found a performance issue with 2 equivalent queries stably taking
different (~x2) time to finish. In just a few words it can be described
like this: if you have a lot of values in an IN() statement, you should
put most heavy (specifying most number of rows) ids first.
This is mostly just a bug submit, than looking for help.

So this is what I have:
 *  RHEL
 *  PostgreSQL 8.3.1
 *  A table ext_feeder_item with ~4.6M records.
kia=# \d+ ext_feeder_item;
Table public.ext_feeder_item
Column | Type | Modifiers | Description
--+--+--
+-
id | bigint | not null default
nextval('ext_feeder_item_id_seq'::regclass) |
feed_id | bigint | not null |
pub_date | timestamp with time zone | |
Indexes:
ext_feeder_item_pkey PRIMARY KEY, btree (id)
ext_feeder_item_feed_id_pub_date_idx btree (feed_id, pub_date)
ext_feeder_item_idx btree (feed_id)
Triggers:

Has OIDs: no
 *  Statistics for the fields feed_id and pub_date are set to 1000;
 *  The table have just been vacuumed and analyzed.
 *  A simple query to the table:
SELECT
id
FROM
ext_feeder_item AS i
WHERE
i.feed_id IN (...)
ORDER BY pub_date DESC, id DESC
LIMIT 11 OFFSET 0;

with many (~1200) ids in the IN() statement.
 *  The count of rows distribution for these ids (may be thought of as
foreign keys in this table) is the following:
id = 54461: ~18 - actually the most heavy id in the whole table.
other ids: a single id at most specifies 2032 rows; 6036 rows total.
 *  If I perform a query with
IN(54461, ...)
it stably (5 attempts) takes 4.5..4.7 secs. to perform.
QUERY PLAN
Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=4585.420..4585.452 rows=11 loops=1)
  -  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
(actual time=4585.415..4585.425 rows=11 loops=1)
    Sort Key: pub_date, id
    Sort Method:  top-N heapsort  Memory: 17kB
    -  Bitmap Heap Scan on ext_feeder_item i
(cost=13832.40..1449341.79 rows=617228 width=16) (actual
time=894.622..4260.441 rows=185625 loops=1)
  Recheck Cond: (feed_id = ANY ('{54461,
...}'::bigint[]))
  -  Bitmap Index Scan on ext_feeder_item_idx
(cost=0.00..13678.10 rows=617228 width=0) (actual
time=884.686..884.686 rows=185625 loops=1)
    Index Cond: (feed_id = ANY ('{54461,
...}'::bigint[]))
Total runtime: 4585.852 ms
 *  If I perform a query with
IN(..., 54461)
it stably (5 attempts) takes 9.3..9.5 secs. to perform.
QUERY PLAN
Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=9330.267..9330.298 rows=11 loops=1)
  -  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
(actual time=9330.263..9330.273 rows=11 loops=1)
    Sort Key: pub_date, id
    Sort Method:  top-N heapsort  Memory: 17kB
    -  Bitmap Heap Scan on ext_feeder_item i
(cost=13832.40..1449341.79 rows=617228 width=16) (actual
time=1018.401..8971.029 rows=185625 loops=1)
  Recheck Cond: (feed_id = ANY ('{...
,54461}'::bigint[]))
  -  Bitmap Index Scan on ext_feeder_item_idx
(cost=0.00..13678.10 rows=617228 width=0) (actual
time=1008.791..1008.791 rows=185625 loops=1)
    Index Cond: (feed_id = ANY ('{...
,54461}'::bigint[]))
Total runtime: 9330.729 ms
I don't know what are the roots of the problem, but I think that some
symptomatic healing could be applied: the PostgreSQL could sort the IDs
due to the statistics.
So currently I tend to select the IDs from another table ordering them
due to their weights: it's easy for me thanks to denormalization.

Also I would expect from PostgreSQL that it sorted the values to make
index scan more sequential, but this expectation already conflicts with
the bug described above :)




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Re: what worked: performance improvements for geo-spatial searching on FreeBSD

2008-04-08 Thread Oleg Bartunov


On Tue, 8 Apr 2008, Mark Stosberg wrote:


Oleg Bartunov wrote:

Mark,

do you know about our sky segmentation code Q3C,
see details http://www.sai.msu.su/~megera/wiki/SkyPixelization
We use it for billions objects in database and quite happy.


Oleg,

Thanks for the response. That sounds interesting, but it's not clear to me 
how I would put together a geo-spatial search calculating distances around 
the curvature of the earth using this technique. Is there is a SQL sample for 
this that you could point to?


it's not about calculating distances, but about searching objects around
given point.



Also, I didn't recognize the names of the techniques you were benchmarking 
against RADEC and Rtree, are either of these related to the 
earthdistance or cube() based searches I would have used already?


Rtree is a standard spatial tree, RADEC - is naive approach of 
having two indexes, one on ra (right ascension) and another - on dec (declination).

Both are an astronomical coordinates.



 Mark





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] what worked: performance improvements for geo-spatial searching on FreeBSD

2008-04-07 Thread Oleg Bartunov


Mark,

do you know about our sky segmentation code Q3C,
see details http://www.sai.msu.su/~megera/wiki/SkyPixelization
We use it for billions objects in database and quite happy.

Oleg

On Mon, 7 Apr 2008, Mark Stosberg wrote:


The http://www.1-800-save-a-pet.com site is hosted with FreeBSD and
PostgreSQL and as a geo-spatial search as a central feature.

One thing that made a substantial performance improvement was switching
from the geo_distance() search in the earthdistance contrib, to use
the cube based geo-spatial calculations, also in available in contrib/
In our case, the slight loss of precision between the two methods didn't
matter.

The other things that made a noticeable performance improvement was
upgrading our servers from FreeBSD 4.x or 5.x (old, I know!) to FreeBSD
6.2 or later. We also upgrade these systems from PostgreSQL 8.1 to 8.2
at the same time.  I assume the upgrade to 8.2 must be responsible at
least in part for the performance gains.

The result of these two rounds of updates is that our overall CPU
capacity in the cluster seems to be double or triple what it was before.

As the site grows we continue to be very happy with the performance,
features and stability of PostgreSQL.

  Mark





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] The many nulls problem

2008-03-14 Thread Oleg Bartunov


Kynn,

have you seen contrib/hstore ? You can have one table with common attributes
and hide others in hstore

Oleg
On Fri, 14 Mar 2008, Kynn Jones wrote:


It often happens that a particular pieces of information is non-null for a
small minority of cases.  A superficially different manifestation of this is
when two pieces of information are identical in all but a small minority of
cases.  This can be easily mapped to the previous description by defining a
null in one column to mean that its contents should be obtained from those
of another column.  A further variant of this is when one piece of
information is a simple function of another one in all but a small minority
of cases.

(BTW, I vaguely recall that RDb theorists have a technical term for this
particular design issue, but I don't remember it.)

In all these cases, the design choice, at least according to RDb's 101, is
between including a column in the table that will be NULL most of the time,
or defining a second auxiliary column that references the first one and
holds the non-redundant information for the minority of cases for which this
is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101
reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
for it!  For example, does a large proportion of NULLs really imply a lot of
wasted space?  Maybe this is true for fixed-length data types, but what
about for type TEXT or VARCHAR?

Just to be concrete, consider the case of a customers database for some home
shopping website.  Suppose that, as it happens, for the majority of this
site's customers, the shipping and billing addresses are identical.  Or
consider the scenario of a company in which, for most employees, the email
address can be readily computed from the first and last name using the rule
First M. Last = [EMAIL PROTECTED], but the company allows some
flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
known to everyone by his nickname, Yaz, the email is
[EMAIL PROTECTED] hardly anyone remembers or even knows his
full name.)

What's your schema design approach for such situations?  How would you go
about deciding whether the number of exceptional cases is small enough to
warrant a second table?  Of course, one could do a systematic profiling of
various possible scenarios, but as a first approximation what's your
rule-of-thumb?

TIA!

Kynn



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] The many nulls problem

2008-03-14 Thread Oleg Bartunov


On Fri, 14 Mar 2008, Kynn Jones wrote:


On Fri, Mar 14, 2008 at 2:59 PM, Oleg Bartunov [EMAIL PROTECTED] wrote:


have you seen contrib/hstore ? You can have one table with common
attributes
and hide others in hstore



That's interesting.  I'll check it out.  Thanks!


actually, hstore was designed specially for this kind of problems.




Kynn



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Improve Full text rank in a query

2008-03-07 Thread Oleg Bartunov


On Fri, 7 Mar 2008, b wragg wrote:


Hi all,

I'm running the following query to match a supplied text string to an actual
place name which is recorded in a table with extra info like coordinates,
etc.

SELECT ts_rank_cd(textsearchable_index_col , query, 32 /* rank/(rank+1) */)
AS rank,*
FROM gazetteer, to_tsquery('Gunbower|Island|Vic') query
WHERE query @@ textsearchable_index_col order by rank desc, concise_ga desc,
auda_alloc desc LIMIT 10

When I run this I get the following top two results:

Pos RankName
State
1   0.23769 Gunbower Island Primary School  Vic
2   0.23769 Gunbower Island Vic

The textsearchable_index_col for each of these looks like this:

'vic':6 '':5 'gunbow':1 'island':2 'school':4 'primari':3 'victoria':7
'vic':4 '':3 'gunbow':1 'island':2 'victoria':5

I'm new to this, but I can't figure out why the Gunbower Island Primary
School is getting top place. How do I get the query to improve the ranking
so that an exact match (like Gunbower|Island|Vic) gets a higher position?


you can read documentation and use document length normalization flag,
or write your own ranking function.



Thanks,

bw




No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.21.4/1309 - Release Date: 3/03/2008
6:50 PM






Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] t1.col like '%t2.col%'

2008-02-27 Thread Oleg Bartunov


On Wed, 27 Feb 2008, Dan Kaplan wrote:


I've got a lot of rows in one table and a lot of rows in another table.  I
want to do a bunch of queries on their join column.  One of these is like
this: t1.col like '%t2.col%'


We have an idea how to speedup wildcard search at the expense of the size - 
we have to index all permutation of the original word. Then we could

use GIN for quieries like a*b.





I know that always sucks.  I'm wondering how I can make it better.  First, I
should let you know that I can likely hold both of these tables entirely in
ram.  Since that's the case, would it be better to accomplish this with my
programming language?  Also you should know that in most cases, t1.col and
t2.col is 2 words or less.  I'm not sure if that matters, I mention it
because it may make tsearch2 perform badly.



contrib/pg_trgm should help you.





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] index usage on arrays

2008-02-07 Thread Oleg Bartunov


andrew,
what are your queries ? Have you seen contrib/intarray, 
GIN index ?


On Thu, 7 Feb 2008, andrew klassen wrote:


I am using Postgres 8.2.5.

I have a table that has rows containing a variable length array with a known 
maximum.
I was doing selects on the array elements using an ANY match. The performance
was not too good as my table got bigger. So I added an index on the array.
That didn't help since the select was not using it.  I saw a thread in the
mailing lists stating the index wouldn't be used.

So I created indices on the individual array elements and then do a select
on each element separately and then combine each match using OR.
This did substantially increase the select performance. However, it may
be difficult to maintain this approach over time as the maximum array
size may increase dramatically and forming the query will become tedious.

Is there any alternative to what am I currently doing other than creating a row 
for
each array element, i.e. stop using an array and use a separate row for each
array index? The reason I didn't want to take this approach is because there are
other columns in the row that will be duplicated needlessly.

Thanks, Andrew


 

Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Search for fixed set of keywords

2008-01-10 Thread Oleg Bartunov


On Thu, 10 Jan 2008, J?rg Kiegeland wrote:


Did you try integer arrays with GIN (inverted index) ?
I now tried this, and GIN turned out to be linear time, compared with GIST 
which was acceptable time. However I tested this only for Z=infinity, for 
Z=1000, GIST/GIN are both not acceptable.


Sorry, I didn't follow your problem, but GIN should be certainly
logarithmic on the number of unique words. Also, it'd be much clear
if you show us your queries and explain analyze.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Search for fixed set of keywords

2008-01-09 Thread Oleg Bartunov


Did you try integer arrays with GIN (inverted index) ?

Oleg
On Wed, 9 Jan 2008, J?rg Kiegeland wrote:


Hello,

I have an interesting generic search task, for which I have done different 
performance tests and I would like to share and discuss my results on this 
newsgroup.


So I begin to describe the search task:

=
You have a set of N unique IDs. Every ID is associated with an integer 
scoring value. Every ID is also associated with up to K different keywords 
(there are totally K different keywords K1 ... Kn). Now find the first Z 
best-scored IDs which are associated with a given set of keywords in one of 
two ways:


(C1) The ID must be associated with all keywords of the given set of 
keywords.
(C2) The ID must be associated with at least one keyword of the given set of 
keywords.

=


My tests showed that only a Multiple-Column-approach resulted in a acceptable 
query response time. I also tried out an int-array approach using gist,  a 
sub-string approach, a bit-column approach, and even a sub-string approach 
using Solr.
Actually, the int-array approach was 20% faster for Z=infinity, but it became 
linear for the test case [Z=1000 and *all* IDs matches the search condition].
(To be not misunderstood, acceptable time means: having a fixed Z, a fixed 
set of keywords K, a fixed query, and an increasing N, results in constant up 
to logarithmic response time; linear or worser-than-linear time is not 
accepted)


In the Multiple-Column-approach, there is one table. The table has a boolean 
column for each keyword. It has also a column for the ID and for the scoring. 
Now, for each keyword column and for the scoring column a separate index is 
created.
C1 is implemented by an AND-query on the keyword columns, C2 by and OR query, 
and the result is sorted for the scoring column, cutting of after the first Z 
results.


However our requirements for the search task have changed and I not yet 
managed to find a search approach with acceptable response time for following 
variation:
Namely that one uses C2 and do not sort for a scoring column but use as 
scoring value the number of matched keywords for a given ID.
The difficulty in this query type is that the scoring is dependent on the 
query itself..


So has anyone an idea how to solve this query type with acceptable response 
time, or can anybody tell/prove, that this is theoretically not possible?




---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] TB-sized databases

2007-11-26 Thread Oleg Bartunov


We have several TB database in production and it works well on
HP rx1620 dual Itanium2, MSA 20, running Linux. It's read-only storage for
astronomical catalogs with about 4-billions objects. We have custom
index for spherical coordinates which provide great performance.

Oleg
On Mon, 26 Nov 2007, Peter Koczan wrote:


Hi all,

I have a user who is looking to store 500+ GB of data in a database
(and when all the indexes and metadata are factored in, it's going to
be more like 3-4 TB). He is wondering how well PostgreSQL scales with
TB-sized databases and what can be done to help optimize them (mostly
hardware and config parameters, maybe a little advocacy). I can't
speak on that since I don't have any DBs approaching that size.

The other part of this puzzle is that he's torn between MS SQL Server
(running on Windows and unsupported by us) and PostgreSQL (running on
Linux...which we would fully support). If any of you have ideas of how
well PostgreSQL compares to SQL Server, especially in TB-sized
databases, that would be much appreciated.

We're running PG 8.2.5, by the way.

Peter

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] [GENERAL] Slow TSearch2 performance for table with 1 million documents.

2007-10-05 Thread Oleg Bartunov


On Fri, 5 Oct 2007, Tom Lane wrote:


Benjamin Arai [EMAIL PROTECTED] writes:

# explain analyze select * FROM fulltext_article, to_tsquery
('simple','dog') AS q  WHERE idxfti @@ q ORDER BY rank(idxfti, q) DESC;



   QUERY PLAN



Sort  (cost=6576.74..6579.07 rows=933 width=774) (actual
time=12969.237..12970.490 rows=5119 loops=1)
Sort Key: rank(fulltext_article.idxfti, q.q)
-  Nested Loop  (cost=3069.79..6530.71 rows=933 width=774)
(actual time=209.513..12955.498 rows=5119 loops=1)
  -  Function Scan on q  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.005..0.006 rows=1 loops=1)
  -  Bitmap Heap Scan on fulltext_article
(cost=3069.79..6516.70 rows=933 width=742) (actual
time=209.322..234.390 rows=5119 loops=1)
Recheck Cond: (fulltext_article.idxfti @@ q.q)
-  Bitmap Index Scan on fulltext_article_idxfti_idx
(cost=0.00..3069.56 rows=933 width=0) (actual time=208.373..208.373
rows=5119 loops=1)
  Index Cond: (fulltext_article.idxfti @@ q.q)
Total runtime: 12973.035 ms
(9 rows)


The time seems all spent at the join step, which is odd because it
really hasn't got much to do.  AFAICS all it has to do is compute the
rank() values that the sort step will use.  Is it possible that
rank() is really slow?


can you try rank_cd() instead ?




regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Fast tsearch2, trigram matching on short phrases

2007-08-22 Thread Oleg Bartunov


On Wed, 22 Aug 2007, Carlo Stonebanks wrote:

I have read that trigram matching (similarity()) performance degrades when 
the matching is on longer strings such as phrases. I need to quickly match 
strings and rate them by similiarity. The strings are typically one to seven 
words in length - and will often include unconventional abbreviations and 
misspellings.


I have a stored function which does more thorough testing of the phrases, 
including spelling correction, abbreviation translation, etc... and scores 
the results - I pick the winning score that passes a pass/fail constant. 
However, the function is slow. My solution was to reduce the number of rows 
that are passed to the function by pruning obvious mismatches using 
similarity(). However, trigram matching on phrases is slow as well.


you didn't show us explain analyze of your select.



I have experimented with tsearch2 but I have two problems:

1) I need a score so I can decide if match passed or failed. trigram 
similarity() has a fixed result that you can test, but I don't know if rank() 
returns results that can be compared to a fixed value


2) I need an efficient methodology to create vectors based on trigrams, and a 
way to create an index to support it. My tsearch2 experiment with normal 
vectors used gist(text tsvector) and an on insert/update trigger to populate 
the vector field.


Any suggestions on where to go with this project to improve performance would 
be greatly appreciated.


Carlo



---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Fast tsearch2, trigram matching on short phrases

2007-08-22 Thread Oleg Bartunov


On Wed, 22 Aug 2007, Carlo Stonebanks wrote:


Hi Oleg,


you didn't show us explain analyze of your select.


I didn't because I didn't expect any reaction to it - my understanding is 
that trigram matching for phrases is not recommended because of the 
performance. Do you believe that I SHOULD expect good performance from 
trigram matching on phrases (a sopposed to words)?


The problem is in idea, not in performance.



Carlo 


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

  http://www.postgresql.org/about/donate



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] startup caching suggestions

2007-06-25 Thread Oleg Bartunov


On Mon, 25 Jun 2007, Bryan Murphy wrote:


We have a search facility in our database that uses full text indexing to
search about 300,000 records spread across 2 tables.  Nothing fancy there.

The problem is, whenever we restart the database (system crash, lost
connectivity to SAN, upgrade, configuration change, etc.) our data is not
cached and query performance is really sketchy the first five to ten minutes
or so after the restart.  This is particularly problematic because the only
way the data gets cached in memory is if somebody actively searches for it,
and the first few people who visit our site after a restart are pretty much
screwed.

I'd like to know what are the recommended strategies for dealing with this
problem.  We need our search queries to be near instantaneous, and we just
can't afford the startup penalty.


Bryan, did you try 'dd if=/path/to/your/table of=/dev/null' trick ?
It will very fast read you data into kernel's buffers.



I'm also concerned that Postgres may not be pulling data off the SAN as
efficiently as theory dictates.  What's the best way I can diagnose if the
SAN is performing up to spec?  I've been using iostat, and some of what I'm
seeing concerns me.  Here's a typical iostat output (iostat -m -d 1):

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda   0.00 0.00 0.00  0  0
sdb 102.97 2.03 0.00  2  0
sdc   0.00 0.00 0.00  0  0
sdd   0.00 0.00 0.00  0  0

sda is the os partitionn (local), sdb is the primary database partion (SAN),
sdc is the log file partition (SAN), and sdd is used only for backups
(SAN).  I very rarely seen sdb MB_read/s much above 2, and most of the time
it hovers around 1 or lower.  This seems awfully goddamn slow to me, but
maybe I just don't fully understand what iostat is telling me.  I've seen
sdc writes get as high as 10 during a database restore.

A few bits of information about our setup:

Debian Linux 2.6.18-4-amd64 (stable)
4x Intel(R) Xeon(R) CPU 5110 @ 1.60GHz (100% dedicated to database)
RAID 1+0 iSCSI partitions over Gig/E MTU 9000 (99% dedicated to database)
8GB RAM
Postgres v8.1.9

The database is only about 4GB in size and the key tables total about 700MB.
Primary keys are CHAR(32) GUIDs

Thanks,
Bryan



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Slow indexscan

2007-06-20 Thread Oleg Bartunov


Mikko,

I don't follow this thread, just see familiar spherical coordinates we
work with a lot. If you need fast radial query you can use our
q3c package available from q3c.sf.net. See some details 
http://www.sai.msu.su/~megera/wiki/SkyPixelization


Oleg

On Wed, 20 Jun 2007, Tom Lane wrote:


Mikko Partio [EMAIL PROTECTED] writes:


 Index Scan using tbl_20070601_pkey on tbl_20070601 t1
(cost=0.00..365.13rows=13 width=137) (actual time=
120.83..10752.64 rows=539 loops=1)
   Index Cond: ((validtime = 2007060200::bigint) AND (validtime =
20070602235500::bigint) AND (latitude = 60.2744::double precision) AND
(longitude = 26.4417::double precision))
   Filter: (parname = 'temperature'::character varying)


You do realize that's going to scan the entire index range from
2007060200 to 20070602235500?

If this is a typical query you'd be better off putting the lat/long
columns first in the index.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Cannot make GIN intarray index be used by the planner

2007-05-09 Thread Oleg Bartunov


Do you have both indexes (GiST, GIN) on the same table ?

On Wed, 9 May 2007, Valentine Gogichashvili wrote:


Hello all,

I am trying to move from GiST intarray index to GIN intarray index, but my
GIN index is not being used by the planner.

The normal query is like that

select *
from sourcetablewith_int4
where ARRAY[myint] @ myint_array
 and some_other_filters

(with GiST index everything works fine, but GIN index is not being used)

If I create the same table populating it with text[] data like

select myint_array::text[] as myint_array_as_textarray
into newtablewith_text
from sourcetablewith_int4

and then create a GIN index using this new text[] column

the planner starts to use the index and queries run with grate speed when
the query looks like that:

select *
from newtablewith_text
where ARRAY['myint'] @ myint_array_as_textarray
 and some_other_filters

Where the problem can be with _int4 GIN index in this constellation?

by now the enable_seqscan is set to off in the configuration.

With best regards,

-- Valentine Gogichashvili



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread Oleg Bartunov


On Fri, 27 Apr 2007, Alexander Staubo wrote:


On 4/20/07, Andrew Lazarus [EMAIL PROTECTED] wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.


Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper Efficient k-NN Search on
Vertically Decomposed Data by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)


you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which 
could be used for k-d-b tree,  see http://www.cs.purdue.edu/spgist/

I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery



Alexander.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

 http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Simple query, 10 million records...MySQL ten times faster

2007-04-26 Thread Oleg Bartunov


Folks,

we in astronomy permanently work with billiards objects with spherical
atributes and have several sky-indexing schemes. See my page
for links http://www.sai.msu.su/~megera/wiki/SkyPixelization

We have q3c package for PostgreSQL available from q3c.sf.net, which 
we use in production with terabytes-sized database.


Oleg
On Thu, 26 Apr 2007, Scott Marlowe wrote:


On Tue, 2007-04-24 at 16:26, zardozrocks wrote:

I have this table:

CREATE TABLE test_zip_assoc (
id serial NOT NULL,
f_id integer DEFAULT 0 NOT NULL,
lat_radians numeric(6,5) DEFAULT 0.0 NOT NULL,
long_radians numeric(6,5) DEFAULT 0.0 NOT NULL
);


Like someone else mentioned numeric types are SLOW.  See if you can use
integers, or at least floats.

I also wonder if you might be better served with geometric types and
GiST indexes on them than using your lat / long grid.  With a geometric
type, you could define the lat / long as a point and use geometric
operations with it.

See the pgsql manual:

http://www.postgresql.org/docs/8.1/static/datatype-geometric.html
http://www.postgresql.org/docs/8.1/static/functions-geometry.html


It's basically a table that associates some foreign_key (for an event,
for instance) with a particular location using longitude and
latitude.  I'm basically doing a simple proximity search.  I have
populated the database with *10 million* records.  I then test
performance by picking 50 zip codes at random and finding the records
within 50 miles with a query like this:


I assume that there aren't 10 million zip codes, right?

Are you storing the lat / long of the individual venues?  Or the zip
codes?  If you're storing the lat / long of the zips, then I can't
imagine there are 10 million zip codes.  If you could use the lat / long
numbers to find the zip codes that are in your range, then join that to
a venue table that fks off of the zip code table, I would think it would
be much faster, as you'd have a smaller data set to trundle through.


SELECT id
FROM test_zip_assoc
WHERE
lat_radians  0.69014816041
AND lat_radians  0.71538026567
AND long_radians  -1.35446228028
AND long_radians  -1.32923017502


On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
ram) this query averages 1.5 seconds each time it runs after a brief
warmup period.  In PostGreSQL it averages about 15 seconds.


I wonder how well it would run if you had 10, 20, 30, 40 etc... users
running it at the same time.  My guess is that you'll be very lucky to
get anything close to linear scaling in any database.  That's because
this is CPU / Memory bandwidth intensive, so it's gonna kill your DB.
OTOH, if it was I/O bound you could throw more hardware at it (bb cache
RAID controller, etc)


Both of those times are too slow.  I need the query to run in under a
second with as many as a billion records.  I don't know if this is
possible but I'm really hoping someone can help me restructure my
indexes (multicolumn?, multiple indexes with a 'where' clause?) so
that I can get this running as fast as possible.


You're trying to do a whole lot of processing in a little time.  You're
either gonna have to accept a less exact answer (i.e. base it on zip
codes) or come up with some way of mining the data for the answers ahead
of time, kind of like a full text search for lat and long.

So, have you tried what I suggested about increasing shared_buffers and
work_mem yet?

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Array indexes, GIN?

2007-03-01 Thread Oleg Bartunov


On Thu, 1 Mar 2007, Josh Berkus wrote:


Adam,


On the surface, looks like a job for GIN, but GIN seems undocumented,
specifically mentions it doesn't support the deletes we'll have many of
since it's designed for word searching apparently, the performance
implications are undocumented. I searched, I read, and even IRC'd, and
it seems like GIN is just not used much.


It's new (as of 8.2).  And the authors, Oleg and Teodor, are notorious for
skimpy documetentation.


We're getting better, we have 72 pages written about new FTS :)



I'd start with the code in INTARRAY contrib module (also by Teodor) and bug
them on pgsql-hackers about helping you implement a GIN index for arrays.


GIN already has support for one dimensional arrays and intarray, particularly,
too has support of GiN.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Writting a search engine for a pgsql DB

2007-02-28 Thread Oleg Bartunov


On Wed, 28 Feb 2007, Dave Page wrote:


Steinar H. Gunderson wrote:

On Tue, Feb 27, 2007 at 01:33:47PM +, Dave Page wrote:

When we outgrow PostgreSQL  Tsearch2, then, well, we'll need to stop
pretending to be Google...


Just for the record: Google has been known to sponsor sites in need with
Google Minis and such earlier -- I don't know what their[1] policy is on the
matter, but if tsearch2 should at some point stop being usable for indexing
postgresql.org, asking them might be worth a shot.


I think if postgresql.org outgrows tsearch2 then the preferred solution
would be to improve tsearch2/postgresql, but thanks for the tip :-)


Guys, current tsearch2 should works with millions of documents. Actually,
the performance killer is the necessity to consult heap to calculate rank
which is unavoidably slow, since one need to read all records.
Search itself is incredibly fast ! If we find a way to store an additional 
information in index and workout visibility issue, full text search will 
be damn fast.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] pg_trgm performance

2007-02-26 Thread Oleg Bartunov


On Mon, 26 Feb 2007, Guillaume Smet wrote:


On 2/24/07, Steinar H. Gunderson [EMAIL PROTECTED] wrote:

Thanks for your time.


GiN version, short:
   -  Bitmap Heap Scan on tags  (cost=8.64..151.79 rows=41 width=0) 
(actual time=5.555..30.157 rows=7 loops=1)

 Filter: (title % 'foo'::text)
 -  Bitmap Index Scan on trgm_idx  (cost=0.00..8.63 rows=41 
width=0) (actual time=2.857..2.857 rows= loops=1)

   Index Cond: (title % 'foo'::text)


This is currently the worst case in the gist - gin comparison because
in the index scan, gin version doesn't have the length of the indexed
string. So it returns a lot of rows which have every trigram of your
search string but has in fact a low similarity due to the length of
the indexed string ( rows - 7 rows).
It cannot be fixed at the moment due to the way GIN indexes work.

So, the GiN version seems to be a bit faster for long queries, but it's 
still
too slow -- in fact, _unindexed_ versions give 141ms, 342ms, 725ms for 
these

three queries, so for the longer queries, the gain is only about a factor
two. (By the way, I would like to stress that this is not my personal music
collection! :-P)


The fact is that pg_trgm is designed to index words and not to index
long sentences. I'm not that surprised it's slow in your case.

It's also my case but following the instructions in README.pg_trgm I
created a dictionary of words using tsearch2 (stat function) and I use
pg_trgm on this dictionary to find similar words in my dictionary.

For example, I rewrite the search:
auberge cevenes
as:
(auberge | auberges | aubberge | auberg)  (ceven | cene | cevenol | 
cevennes)

using pg_trgm and my query can find Auberge des C?vennes (currently
it's limited to the 4th most similar words but I can change it
easily).


Did you rewrite query manually or use rewrite feature of tsearch2 ?



--
Guillaume

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] pg_trgm performance

2007-02-26 Thread Oleg Bartunov


On Mon, 26 Feb 2007, Guillaume Smet wrote:


On 2/26/07, Oleg Bartunov oleg@sai.msu.su wrote:

Did you rewrite query manually or use rewrite feature of tsearch2 ?


Currently, it's manual. I perform a pg_trgm query for each word of the
search words (a few stop words excluded) and I generate the ts_query
with the similar words instead of using the search words.
Is there any benefit of using rewrite() in this case?


not really, just a matter of possible interesting architectural design.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] optimizing a geo_distance() proximity query

2007-02-03 Thread Oleg Bartunov


Mark,

in astronomy we extensively use such kind of query, which we call
radial search or conesearch. There are several algorithms which perform
efficiently such query using spherical coordinates. 
Specifically, we use our Q3C algorithm, see

http://www.sai.msu.su/~megera/wiki/SkyPixelization for details,
which was designed for PostgreSQL and is freely available.

The paper is http://lnfm1.sai.msu.ru/~math/docs/adass_proceedings2005.pdf
Web site - http://q3c.sf.net/


Oleg

On Sat, 3 Feb 2007, Mark Stosberg wrote:



Hello,

I'm using geo_distance() from contrib/earthdistance would like to find a
way to spend up the geo distance calculation if possible. This is for a
proximity search: Show me adoptable pets within 250 miles of this
zipcode.

I'm researched a number of approaches to this, but none seem as workable
as I would have hoped.

I read this claim [1] that Jobster used a lookup table of pre-calculated
distances between zipcodes...it took about 100 million rows.

1. http://bostonsteamer.livejournal.com/831325.html

I'd like to avoid that, but I think there's a sound concept in there: we
repeatedly making a complex calculation with the same inputs, and the
outputs are always the same.

The zipdy project [2] used some interesting approaches, both similar to
the large table idea. One variation involved a PL routine that would
look up the result in a cache table. If no result was found, it would
would compute the result and add it to the cache table. Besides
eventually creating millions of rows in the cache table, I tried this
technique and found it was much slower than using geo_distance() without
a cache. Another variation in the zipdy distribution just uses several
smaller cache tables, like one for zipcodes 25 miles away and
zipcodes 50 miles away.  Equally unattractive.

2. http://www.cryptnet.net/fsp/zipdy/

I looked at doing the calculation outside of PostgreSQL, and passing in
the resulting list of zipcodes in an explicit IN() list. This seem
promising at first. Geo::Postalcode (Perl) could do the calculation in
5ms, which seemed to beat PostgreSQL. For a small proximity, I think
that combination might have performed better. However, some places have
close to 5,000 zipcodes within 250 files. I tried putting /that/
resulting list into an explicitly IN() clause, and it got much slower. :)

I did find there are some possible optimizations that can be made to the
Haversine algorithm itself. As this post pointed out [3], we could
pre-convert the lat/lon pair to radians, and also compute their sin()
and cos() values. However, the person suggesting this approach provided
no benchmarks to suggest it was worth it, and I have no evidence so far
that it matters either.

3.
http://www.voxclandestina.com/2006-09-01/optimizing-spatial-proximity-searches-in-sql/

What strikes me to consider at this point are a couple of options:

A. Explore a way add some memory caching or memoizing to
geo_distance() so it would hold on to frequently pre-computed values,
but without storing all the millions of possibilities.

B. Look at an alternate implementation. I suspect that given a small
enough radius and the relatively large size of zipcodes, a simpler
representation of the Earth's curvature could be used, with a sufficient
accuracy. Perhaps a cylinder, or even a flat projection... We currently
max out at 250 miles. ( I just discussed this option with my wife, the
math teacher. :)

Advice from other people who have deployed high-performance proximity
searches with PostgreSQL would be appreciated!

  Mark

















---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [pgsql-advocacy] [PERFORM] Postgres and really huge tables

2007-01-19 Thread Oleg Bartunov


On Thu, 18 Jan 2007, Tom Lane wrote:


Brian Hurt [EMAIL PROTECTED] writes:

Is there any experience with Postgresql and really huge tables?  I'm
talking about terabytes (plural) here in a single table.


The 2MASS sky survey point-source catalog
http://www.ipac.caltech.edu/2mass/releases/allsky/doc/sec2_2a.html
is 470 million rows by 60 columns; I don't have it loaded up but
a very conservative estimate would be a quarter terabyte.  (I've
got a copy of the data ... 5 double-sided DVDs, gzipped ...)
I haven't heard from Rae Stiening recently but I know he's been using
Postgres to whack that data around since about 2001 (PG 7.1 or so,
which is positively medieval compared to current releases).  So at
least for static data, it's certainly possible to get useful results.
What are your processing requirements?


We are working in production with 2MASS and other catalogues, and
2MASS is not the biggest. The nomad catalog has more than milliard records.
You could query them online
http://vo.astronet.ru/cas/conesearch.php
Everything is in PostgreSQL 8.1.5 and at present migrate to the 8.2.1,
which is very slow, since slow COPY.
The hardware we use is HP rx1620, dual Itanium2, MSA 20, currently
4.5 Tb.





regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Regex performance issue

2006-12-02 Thread Oleg Bartunov


I may miss something but I'd use tsearch2. Check
intdict dictionary for basic idea -  http://www.sai.msu.su/~megera/wiki/Gendict

Oleg
On Sat, 2 Dec 2006, Alexandru Coseru wrote:


Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..



Regards
  Alex





- Original Message - From: Dave Dutcher [EMAIL PROTECTED]
To: 'Alexandru Coseru' [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue



-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Alexandru Coseru
asterisk= explain analyze SELECT * FROM destlist WHERE
'0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;


QUERY PLAN
--
--
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
 Recheck Cond: ((id_ent = -2) AND (dir = 0))
 Filter: ('0039051248787'::text ~ (prefix)::text)
 -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
   Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)


mmumu btree (prefix varchar_pattern_ops)



I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave








Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Regex performance issue

2006-12-02 Thread Oleg Bartunov


On Sun, 3 Dec 2006, Alexandru Coseru wrote:


Hello..

I have never used tsearch2  , but at a first glance , i would not see any 
major improvement , because the main advantage of tsearch is the splitting in 
words of a phrase..

But here , i only have one word  (no spaces).


Oh, yes, I was confused :) What if you consider you prefix as 
1.2.3.4.5.6, then you could try our contrib/ltree module.



Oleg





Regards
  Alex
- Original Message - From: Oleg Bartunov oleg@sai.msu.su
To: Alexandru Coseru [EMAIL PROTECTED]
Cc: Dave Dutcher [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Sent: Saturday, December 02, 2006 10:54 PM
Subject: Re: [PERFORM] Regex performance issue



I may miss something but I'd use tsearch2. Check
intdict dictionary for basic idea - 
http://www.sai.msu.su/~megera/wiki/Gendict


Oleg
On Sat, 2 Dec 2006, Alexandru Coseru wrote:


Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..



Regards
  Alex





- Original Message - From: Dave Dutcher [EMAIL PROTECTED]
To: 'Alexandru Coseru' [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue



-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Alexandru 
Coseru

asterisk= explain analyze SELECT * FROM destlist WHERE
'0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;


QUERY PLAN
--
--
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
 Recheck Cond: ((id_ent = -2) AND (dir = 0))
 Filter: ('0039051248787'::text ~ (prefix)::text)
 -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
   Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)

mmumu btree (prefix varchar_pattern_ops)



I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave








 Regards,
 Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006




---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Help optimizing a slow index scan

2006-03-18 Thread Oleg Bartunov


I may be wrong but we in astronomy have several sky indexing schemes, which
allows to effectively use classical btree index. See 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/SkyPixelization
for details. Sergei Koposov has developed Q3C contrib module for 
PostgreSQL 8.1+ and we use it with billiard size astronomical catalogs.



Oleg

On Fri, 17 Mar 2006, Michael Fuhr wrote:


On Fri, Mar 17, 2006 at 11:41:11PM -0500, Tom Lane wrote:

Dan Harris [EMAIL PROTECTED] writes:

Furthermore, by doing so, I am tying my queries directly to
postgres-isms.  One of the long term goals of this project is to be
able to fairly transparently support any ANSI SQL-compliant back end
with the same code base.


Unfortunately, there isn't any portable or standard (not exactly the
same thing ;-)) SQL functionality for dealing gracefully with
two-dimensional searches, which is what your lat/long queries are.


The OpenGIS Simple Features Specification[1] is a step in that
direction, no?  PostGIS[2], MySQL[3], and Oracle Spatial[4] implement
to varying degrees.  With PostGIS you do have to add non-standard
operators to a query's predicate to benefit from GiST indexes on
spatial columns, but the rest of the query can be straight out of
the SQL and OGC standards.

[1] http://www.opengeospatial.org/docs/99-049.pdf
[2] http://www.postgis.org/
[3] http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html
[4] http://www.oracle.com/technology/products/spatial/index.html




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Help optimizing a slow index scan

2006-03-18 Thread Oleg Bartunov


On Fri, 17 Mar 2006, Evgeny Gridasov wrote:


Try contrib/btree_gist.


contrib/btree_gist does nothing more than  built-in btree - it's just
an support for multicolumn GiST indices.


I've tried that one, but for my case it didn't help much.
The performance was almost equal or even slower than built-in btree.

On Fri, 17 Mar 2006 08:53:44 -0700
Dan Harris [EMAIL PROTECTED] wrote:


Dan Harris wrote:

Markus Bertheau wrote:

Have you tried using a GIST index on lat  long? These things are
meant for two-dimensional data, whereas btree doesn't handle
two-dimensional data that well. How many rows satisfy either of the
long / lat condition?





According to the analyze, less than 500 rows matched.  I'll look into
GIST indexes, thanks for the feedback.

-Dan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


When I try to create a GIST index, I get the following error:

create index eventgeo_lat_idx on eventgeo using GIST (lat);

ERROR:  data type double precision has no default operator class for
access method gist
HINT:  You must specify an operator class for the index or define a
default operator class for the data type.

I'm not sure what a default operator class is, exactly..

-Dan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings







Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] tsearch2 headline and postgresql.conf

2006-01-22 Thread Oleg Bartunov


You didn't provides us any query with explain analyze.
Just to make sure you're fine.

Oleg
On Sun, 22 Jan 2006, [EMAIL PROTECTED] wrote:


Hi folks,

I'm not sure if this is the right place for this but thought I'd ask.  I'm 
relateively new to postgres having only used it on 3 projects and am just 
delving into the setup and admin for the second time.


I decided to try tsearch2 for this project's search requirements but am 
having trouble attaining adequate performance.  I think I've nailed it down 
to trouble with the headline() function in tsearch2. 
In short, there is a crawler that grabs HTML docs and places them in a 
database.  The search is done using tsearch2 pretty much installed according 
to instructions.  I have read a couple online guides suggested by this list 
for tuning the postgresql.conf file.  I only made modest adjustments because 
I'm not working with top-end hardware and am still uncertain of the actual 
impact of the different paramenters.


I've been learning 'explain' and over the course of reading I have done 
enough query tweaking to discover the source of my headache seems to be 
headline().


On a query of 429 documents, of which the avg size of the stripped down 
document as stored is 21KB, and the max is 518KB (an anomaly), tsearch2 
performs exceptionally well returning most queries in about 100ms.


On the other hand, following the tsearch2 guide which suggests returning that 
first portion as a subquery and then generating the headline() from those 
results, I see the query increase to 4 seconds!


This seems to be directly related to document size.  If I filter out that 
518KB doc along with some 100KB docs by returning substring( stripped_text 
FROM 0 FOR 5) AS stripped_text I decrease the time to 1.4 seconds, but 
increase the risk of not getting a headline.


Seeing as how this problem is directly tied to document size, I'm wondering 
if there are any specific settings in postgresql.conf that may help, or is 
this just a fact of life for the headline() function?  Or, does anyone know 
what the problem is and how to overcome it?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Oleg Bartunov


On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:


However, IMHO, this algorithm is optimising the wrong thing. It
shouldn't be trying to split into sets that are far apart, it should be
trying to split into sets that minimize the number of set bits (ie
distance from zero), since that's what's will speed up searching.


Martijn, you're right! We want not only to split page to very
different parts, but not to increase the number of sets bits in
resulted signatures, which are union (OR'ed) of all signatures 
in part. We need not only fast index creation (thanks, Tom !),

but a better index. Some information is available here
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
There are should be more detailed document, but I don't remember where:)


That's harder though (this algorithm does approximate it sort of)
and I havn't come up with an algorithm yet


Don't ask how hard we thought :)

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Oleg Bartunov


On Sat, 21 Jan 2006, Ron wrote:


At 07:23 PM 1/20/2006, Tom Lane wrote:

Steinar H. Gunderson [EMAIL PROTECTED] writes:
 On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
 It's also worth considering that the entire approach is a heuristic,
 really --- getting the furthest-apart pair of seeds doesn't guarantee
 an optimal split as far as I can see.  Maybe there's some totally
 different way to do it.

 For those of us who don't know what tsearch2/gist is trying to accomplish
 here, could you provide some pointers? :-)

Well, we're trying to split an index page that's gotten full into two index 
pages, preferably with approximately equal numbers of items in

each new page (this isn't a hard requirement though).


Maybe we are over thinking this.  What happens if we do the obvious and just 
make a new page and move the last n/2 items on the full page to the new 
page?


Various forms of move the last n/2 items can be tested here:
0= just split the table in half.  Sometimes KISS  works. O(1).
1= the one's with the highest (or lowest) x value.
2= the one's with the highest sum of coordinates (x+y+...= values in the 
top/bottom n/2 of entries).
3= split the table so that each table has entries whose size_waste values add 
up to approximately the same value.

4= I'm sure there are others.
1-5 can be done in O(n) time w/o auxiliary data.  They can be done in O(1) if 
we've kept track of the appropriate metric as we've built the current page.



I think the true figure of merit for a split is how often will subsequent 
searches have to descend into *both* of the resulting pages as opposed to 
just one
--- the less often that is true, the better.  I'm not very clear on what 
tsearch2 is doing with these bitmaps, but it looks like an upper page's 
downlink has the union (bitwise OR) of the one-bits in the values on the 
lower page, and you have to visit the lower page if this union has a 
nonempty intersection with the set you are looking for.  If that's correct, 
what you really want is to divide the values so that the unions of the two 
sets have minimal overlap ... which seems to me to have little to do with 
what the code does at present.

I'm not sure what upper page and lower page mean here?



Teodor, Oleg, can you clarify what's needed here?

Ditto.  Guys what is the real motivation and purpose for this code?


we want not just split the page by two very distinct parts, but to keep
resulted signatures which is ORed signature of all signatures in the page
as much 'sparse' as can. 
some information available here

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

Unfortunately, we're rather busy right now and couldn't be very useful.



Ron 



---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Oleg Bartunov


On Sat, 21 Jan 2006, Ron wrote:


Perhaps a different approach to this problem is called for:

_Managing Gigabytes: Compressing and Indexing Documents and Images_  2ed
Witten, Moffat, Bell
ISBN 1-55860-570-3

This is a VERY good book on the subject.

I'd also suggest looking at the publicly available work on indexing and 
searching for search engines like Inktomi (sp?) and Google.

Ron


Ron,

you completely miss the problem ! We do know MG and other SE.  Actually,
we've implemented several search engines based on inverted index technology 
(see, for example, pgsql.ru/db/pgsearch). tsearch2 was designed for

online indexing, while keeping inverted index online is rather difficult
problem. We do have plan to implement inverted index as an option for
large read-only archives, but now we discuss how to organize online
index and if possible to optimize current storage for signatures 
without breaking search performance.





At 08:34 AM 1/21/2006, Oleg Bartunov wrote:

On Sat, 21 Jan 2006, Ron wrote:


At 07:23 PM 1/20/2006, Tom Lane wrote:

Steinar H. Gunderson [EMAIL PROTECTED] writes:
 On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
 It's also worth considering that the entire approach is a heuristic,
 really --- getting the furthest-apart pair of seeds doesn't guarantee
 an optimal split as far as I can see.  Maybe there's some totally
 different way to do it.
 For those of us who don't know what tsearch2/gist is trying to 
accomplish

 here, could you provide some pointers? :-)
Well, we're trying to split an index page that's gotten full into two 
index pages, preferably with approximately equal numbers of items in

each new page (this isn't a hard requirement though).


Maybe we are over thinking this.  What happens if we do the obvious and 
just make a new page and move the last n/2 items on the full page to the 
new page?


Various forms of move the last n/2 items can be tested here:
0= just split the table in half.  Sometimes KISS  works. O(1).
1= the one's with the highest (or lowest) x value.
2= the one's with the highest sum of coordinates (x+y+...= values in the 
top/bottom n/2 of entries).
3= split the table so that each table has entries whose size_waste values 
add up to approximately the same value.

4= I'm sure there are others.
1-5 can be done in O(n) time w/o auxiliary data.  They can be done in O(1) 
if we've kept track of the appropriate metric as we've built the current 
page.



I think the true figure of merit for a split is how often will subsequent 
searches have to descend into *both* of the resulting pages as opposed to 
just one
--- the less often that is true, the better.  I'm not very clear on what 
tsearch2 is doing with these bitmaps, but it looks like an upper page's 
downlink has the union (bitwise OR) of the one-bits in the values on the 
lower page, and you have to visit the lower page if this union has a 
nonempty intersection with the set you are looking for.  If that's 
correct, what you really want is to divide the values so that the unions 
of the two sets have minimal overlap ... which seems to me to have little 
to do with what the code does at present.

I'm not sure what upper page and lower page mean here?



Teodor, Oleg, can you clarify what's needed here?

Ditto.  Guys what is the real motivation and purpose for this code?


we want not just split the page by two very distinct parts, but to keep
resulted signatures which is ORed signature of all signatures in the page
as much 'sparse' as can. some information available here
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

Unfortunately, we're rather busy right now and couldn't be very useful.





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Oleg Bartunov


On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:


On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:

Martijn, you're right! We want not only to split page to very
different parts, but not to increase the number of sets bits in
resulted signatures, which are union (OR'ed) of all signatures
in part. We need not only fast index creation (thanks, Tom !),
but a better index. Some information is available here
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
There are should be more detailed document, but I don't remember where:)


I see how it works, what I don't quite get is whether the inverted
index you refer to is what we're working with here, or just what's in
tsearchd?


just tsearchd. We plan to implement inverted index into PostgreSQL core
and then adapt tsearch2 to use it as option for read-only archives.




That's harder though (this algorithm does approximate it sort of)
and I havn't come up with an algorithm yet


Don't ask how hard we thought :)


Well, looking at how other people are struggling with it, it's
definitly a Hard Problem. One thing though, I don't think the picksplit
algorithm as is really requires you to strictly have the longest
distance, just something reasonably long. So I think the alternate
algorithm I posted should produce equivalent results. No idea how to
test it though...


you may try our development module 'gevel' to see how dense is a signature.

www=# \d v_pages
  Table public.v_pages
  Column   |   Type| Modifiers
---+---+---
 tid   | integer   | not null
 path  | character varying | not null
 body  | character varying |
 title | character varying |
 di| integer   |
 dlm   | integer   |
 de| integer   |
 md5   | character(22) |
 fts_index | tsvector  |
Indexes:
v_pages_pkey PRIMARY KEY, btree (tid)
v_pages_path_key UNIQUE, btree (path)
v_gist_key gist (fts_index)

# select * from gist_print('v_gist_key') as t(level int, valid bool, a 
gtsvector) where level =1;
 level | valid |   a
---+---+
 1 | t | 1698 true bits, 318 false bits
 1 | t | 1699 true bits, 317 false bits
 1 | t | 1701 true bits, 315 false bits
 1 | t | 1500 true bits, 516 false bits
 1 | t | 1517 true bits, 499 false bits
(5 rows)



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Oleg Bartunov

gevel is available from 
http://www.sai.msu.su/~megera/postgres/gist/


Oleg
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:


On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:

Martijn, you're right! We want not only to split page to very
different parts, but not to increase the number of sets bits in
resulted signatures, which are union (OR'ed) of all signatures
in part. We need not only fast index creation (thanks, Tom !),
but a better index. Some information is available here
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
There are should be more detailed document, but I don't remember where:)


I see how it works, what I don't quite get is whether the inverted
index you refer to is what we're working with here, or just what's in
tsearchd?


That's harder though (this algorithm does approximate it sort of)
and I havn't come up with an algorithm yet


Don't ask how hard we thought :)


Well, looking at how other people are struggling with it, it's
definitly a Hard Problem. One thing though, I don't think the picksplit
algorithm as is really requires you to strictly have the longest
distance, just something reasonably long. So I think the alternate
algorithm I posted should produce equivalent results. No idea how to
test it though...

Have a nice day,



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] High context switches occurring

2005-12-19 Thread Oleg Bartunov

 branches row, you will be measuring mostly row-update
contention overhead if there's more concurrent transactions than there
are rows.  In the case -s 1, which is what you've got here, there is no
actual concurrency at all --- all the transactions stack up on the
single branches row.

2. Running a small number of transactions per client means that
startup/shutdown transients overwhelm the steady-state data.  You should
probably run at least a thousand transactions per client if you want
repeatable numbers.

Try something like -s 10 -c 10 -t 3000 to get numbers reflecting test
conditions more like what the TPC council had in mind when they designed
this benchmark.  I tend to repeat such a test 3 times to see if the
numbers are repeatable, and quote the middle TPS number as long as
they're not too far apart.

regards, tom lane


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] First query is slow, subsequent queries fast

2005-12-07 Thread Oleg Bartunov


Stephan,

you cache is too low :) Try to increase shared_buffers, for example,
for 2Gb I'd set it to 100,000

On Wed, 7 Dec 2005, Stephan Vollmer wrote:


Hi everybody!

This is my first posting to this list and I'm quite a PostgreSQL
newbie. My question is:

The first time I execute a query, it is very slow, but subsequent
queries are as fast as expected. I would be very glad if somebody
could explain why the first query is so slow and what I could do to
speed it up.

The query operates on a tsearch2 indexed column, but I experienced
the same issue on other tables as well, so I don't think it's a
tsearch2 issue.

To get a better overview of the queries and EXPLAIN outputs, I've
put them on a temporary website, together with the table definition
and my postgresql.conf:

http://dblp.dyndns.org:8080/dblptest/explain.jsp

I'm running PostgreSQL 8.1 on Windows XP SP2, Athlon64 3000+, 2 GB
RAM, 400 GB SATA HDD, 120 GB ATA HDD. The data reside on the first
HDD, the indexes in an index tablespace on the second HDD.

In the example below, the first query is still quite fast compared
to others. Sometimes the first query takes up to 9000 ms (see
website). I've run VACUUM FULL, but it didn't seem to solve the problem.

Thanks very much in advance,

- Stephan



Query:

SELECT keyword, overview
FROM publications
WHERE idx_fti @@ to_tsquery('default', 'linux  kernel')
ORDER BY rank_cd(idx_fti, 'linux  kernel') DESC;



EXPLAIN for first query:

Sort  (cost=859.89..860.48 rows=237 width=299) (actual
time=1817.962..1817.971 rows=10 loops=1)
 Sort Key: rank_cd(idx_fti, '''linux''  ''kernel'''::tsquery)
 -  Bitmap Heap Scan on publications  (cost=3.83..850.54 rows=237
width=299) (actual time=1817.839..1817.914 rows=10 loops=1)
   Filter: (idx_fti @@ '''linux''  ''kernel'''::tsquery)
   -  Bitmap Index Scan on idx_fti_idx  (cost=0.00..3.83
rows=237 width=0) (actual time=1817.792..1817.792 rows=10 loops=1)
 Index Cond: (idx_fti @@ '''linux''  ''kernel'''::tsquery)
Total runtime: 1818.068 ms



EXPLAIN for second query:

Sort  (cost=859.89..860.48 rows=237 width=299) (actual
time=4.817..4.826 rows=10 loops=1)
 Sort Key: rank_cd(idx_fti, '''linux''  ''kernel'''::tsquery)
 -  Bitmap Heap Scan on publications  (cost=3.83..850.54 rows=237
width=299) (actual time=4.727..4.769 rows=10 loops=1)
   Filter: (idx_fti @@ '''linux''  ''kernel'''::tsquery)
   -  Bitmap Index Scan on idx_fti_idx  (cost=0.00..3.83
rows=237 width=0) (actual time=4.675..4.675 rows=10 loops=1)
 Index Cond: (idx_fti @@ '''linux''  ''kernel'''::tsquery)
Total runtime: 4.914 ms

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] TSearch2 vs. Apache Lucene

2005-12-06 Thread Oleg Bartunov


Folks,

tsearch2 and Lucene are very different search engines, so it'd be unfair
comparison. If you need full access to metadata and instant indexing
you, probably, find tsearch2 is more suitable then Lucene. But, if 
you could live without that features and need to search read only

archives you need Lucene.

Tsearch2 integration into pgsql would be cool, but, I see no problem to 
use tsearch2 as an official extension module. After completing our

todo, which we hope will likely  happens for 8.2 release, you could
forget about Lucene and other engines :) We'll be available for developing
in spring and we estimate about three months for our todo, so, it's
really doable.

Oleg

On Tue, 6 Dec 2005, Michael Riess wrote:



Has anyone ever compared TSearch2 to Lucene, as far as performance is 
concerned?


I'll stay away from TSearch2 until it is fully integrated in the postgres 
core (like create index foo_text on foo (texta, textb) USING TSearch2). 
Because a full integration is unlikely to happen in the near future (as far 
as I know), I'll stick to Lucene.


Mike

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq



Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] tsearch2/GIST performance factors?

2005-10-17 Thread Oleg Bartunov


On Sat, 15 Oct 2005, Craig A. James wrote:

We are indexing about 5 million small documents using tsearch2/GIST.  Each 
document contains 2 to 50 words.  This is a write once, read many 
situation.  Write performance is unimportant, and the database contents are 
static.  (We build it offline.)


We're having problems with inconsistent performance, and it's very hard to 
separate the effects of various factors.  Here are the things we think may be 
relevant.


1. Total number of words

Our documents currently contain about 110,000 unique words.  Oleg wrote: 
[The limit is] 100K, but it's very fuzzy limit. By trial and error, we've 
learned that 50,000 works well, and 150,000 works poorly, so Oleg's comment 
appears to be a good rule-of-thumb.  (With SIGLENINT enlarged, see below.) 
But there may be other factors that affect this conclusion (such as shared 
memory, total memory, etc.).




Did you consider *decreasing* SIGLENINT ? Size of index will diminish
and performance could be increased. I use in current project SIGLENINT=15



2. Total size of the table

5 million documents is not a very big database (each document is a few to a 
few hundred bytes), so we don't think this is relevant.



3. Number of documents per word

There seems to be a VERY strong effect related to common words.  When a 
word occurs in more than about 1% of the documents (say 50,000 to 150,000 
documents), performance goes WAY down.  Not just for that specific query, but 
it screws up tsearch2/GIST completely.


We have a test of 100 queries that return 382,000 documents total.  The first 
time we run it, it's slow, about 20 minutes (as expected).  The second time 
we run it, it's very fast, about 72 seconds -- very fast!!  As long as we 
avoid queries with common words, performance is very good.


But, if we run just one query that contains a common word (a word that's in 
more than about 2% of the documents, roughly 150,000 documents), then the 
next time we run the 100 test queries, it will take 20 minutes again.




We can't simply eliminate these common words.  First of all, they can be very 
significant.  Second, it doesn't seem like 2% is common.  I can understand 
that a words like the which occur in most documents shouldn't be indexed. 
But a word that occurs in 2% of the database seems like a very good word to 
index, yet it causes us great problems.




tsearch2's index is a lossy index, read 
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
so search results  should be rechecked !


I've read a bit about tsearchd, and wonder if it would solve our problem. 
For our application, consistent performance is VERY important.  If we could 
lock the GIST index into memory, I think it would fix our problem.


I think so, tsearchd was designed for static contents in mind and it's
index doesn't require rechecking !



I tried copying the GIST indexes (which are in a separate tablespace) to a 1 
GB RAM disk, and it made the initial query faster, but overall performance 
seemed worse, probably because the RAM disk was using memory that could have 
been used by the file-system cache.



4. Available RAM and Disk drives

Would more RAM help?  How would we tell Postgres to use it effectively?  The 
GIST indexes are currently about 2.6 GB on the disk.


try to decrease signature size, say, 
#define SIGLENINT  15



I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are 
interacting in ways that I can't predict or analyze.  Thanks very much for 
any comments and advice.


We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
and hope to find sponsorhips for fts project for 8.2 release.
Unfortunately, I didn't find spare time to package tsearchd for you,
it should certainly help you.


Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] tsearch2 seem very slow

2005-09-25 Thread Oleg Bartunov


On Mon, 26 Sep 2005, Ahmad Fajar wrote:


Hi Oleg,

Sorry for my late. From the stat() function I got 1,5 million rows, although
I've added garbage words to the stop word file, there seem still have
garbage words. So I ask for my team to identify the garbage words and add to


what king of garbage ? Probably you index not needed token types, for
example, email address, file names


stop words and I will update the articles after that. And about my articles,
it is quite big enough, the average length is about 2900 characters. And I
think, I have to tune tsearch2 and concentrate to the garbage words. The
most articles are indonesian language. What others way to tune the tsearch2
beside the garbage words?


do you need proximity ? If no, use strip(tsvector) function to remove
coordinate information from tsvector.

don't index default configuration and index only needed tokens, for 
example, to index only 3 type of tokens, first create 'qq' configuration

and specify tokens to index.

insert into pg_ts_cfg values('qq','default','en_US');
-- tokens to index
insert into pg_ts_cfgmap values('qq','lhword','{en_ispell,en_stem}');
insert into pg_ts_cfgmap values('qq','lword','{en_ispell,en_stem}');
insert into pg_ts_cfgmap values('qq','lpart_hword','{en_ispell,en_stem}');




Beside that, I still have problem, if I do a simple query like:
Select ids, keywords from dict where keywords='blabla' ('blabla' is a single
word); The table have 200 million rows, I have index the keywords field. On
the first time my query seem to slow to get the result, about 15-60 sec to
get the result. I use latest pgAdmin3 to test all queries. But if I repeat
the query I will get fast result. My question is why on the first time the
query seem to slow.


because index pages should be readed from disk into shared buffers, so next
query will benefit from that. You need enough shared memory to get real
benefit. You may get postgresql stats and look on cache hit ration.

btw, how does your query ( keywords='blabla') relates to tsearch2 ?



I try to cluster the table base on keyword index, but after 15 hours waiting
and it doesn't finish I stop clustering. Now I think I have to change the


don't use cluster for big tables ! simple
 select *  into clustered_foo from foo order by indexed_field
would be faster and does the same job.


file system for postgresql data. Do you have any idea what best for
postgresql, JFS or XFS? I will not try reiserfs, because there are some
rumors about reiserfs stability, although reiserfs is fast enough for
postgresql. And must I down grade my postgresql from version 8.0.3 to 7.4.8?



I'm not experienced with filesystems :)





Regards,
ahmad fajar

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Oleg Bartunov
Sent: Saturday, September 24, 2005 1:08 PM
To: Ahmad Fajar
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] tsearch2 seem very slow

Ahmad,

what's about the number of unique words ? I mean stat() function.
Sometimes, it helps to identify garbage words.
How big is your articles (average length) ?

please, cut'n paste queries and output from psql ! How fast are
next queries ?

Oleg
On Fri, 23 Sep 2005, Ahmad Fajar wrote:


Hi Oleg,

For single index I try this query:
explain analyze
select articleid, title, datee from articles
where fti @@ to_tsquery('bankindonesia');

analyze result:

Index Scan using fti_idx on articles  (cost=0.00..862.97 rows=420

width=51)

(actual time=0.067..183761.324 rows=46186 loops=1)
  Index Cond: (fti @@ '\'bank\'  \'indonesia\''::tsquery)
Total runtime: 183837.826 ms

And for multicolumn index I try this query:
explain analyze
select articleid, title, datee from articles
where fti @@ to_tsquery('bankmega');

analyze result:

Index Scan using articles_x1 on articles  (cost=0.00..848.01 rows=410
width=51) (actual time=52.204..37914.135 rows=1841 loops=1)
  Index Cond: ((datee = '2002-01-01'::date) AND (datee =
('now'::text)::date) AND (fti @@ '\'bank\'  \'mega\''::tsquery))
Total runtime: 37933.757 ms

The table structure is as mention on the first talk. If you wanna know how
much table in my database, it's about 100 tables or maybe more. Now I
develop the version 2 of my web application, you can take a look at:
http://www.mediatrac.net, so it will hold many datas. But the biggest

table

is article's table. On develop this version 2 I just use half data of the
article's table (about 419804 rows). May be if I import all of the

article's

table data it will have 1 million rows. The article's table grows rapidly,
about 10 rows per-week. My developing database size is 28 GB (not real
database, coz I still develop the version 2 and I use half of the data for
play around). I just wanna to perform quick search (fulltext search) on my
article's table not other table. On version 1, the current running version

I

use same hardware spesification as mention below

Re: [PERFORM] tsearch2 seem very slow

2005-09-25 Thread Oleg Bartunov


Ahmad,

On Mon, 26 Sep 2005, Ahmad Fajar wrote:


Hi Oleg,


what king of garbage ? Probably you index not needed token types, for
example, email address, file names



do you need proximity ? If no, use strip(tsvector) function to remove
coordinate information from tsvector.


I need proximity. Some time I have to rank my article and make a chart for
that.


don't index default configuration and index only needed tokens, for
example, to index only 3 type of tokens, first create 'qq' configuration
and specify tokens to index.



insert into pg_ts_cfg values('qq','default','en_US');

-- tokens to index

insert into pg_ts_cfgmap values('qq','lhword','{en_ispell,en_stem}');
insert into pg_ts_cfgmap values('qq','lword','{en_ispell,en_stem}');
insert into pg_ts_cfgmap values('qq','lpart_hword','{en_ispell,en_stem}');


I still don't understand about tsearch2 configuration, so until now I just
use default configuration. I will try your suggestion. But how can I get the
en_ispell? Does my system will know if I use: ,'{en_ispell,en_stem}';

From default configuration I only see: ..., '{en_stem}';


I think you should read documentation. I couldn't explain you things already
written.




Beside that, I still have problem, if I do a simple query like:
Select ids, keywords from dict where keywords='blabla' ('blabla' is a

single

word); The table have 200 million rows, I have index the keywords field.

On

the first time my query seem to slow to get the result, about 15-60 sec to
get the result. I use latest pgAdmin3 to test all queries. But if I repeat
the query I will get fast result. My question is why on the first time the
query seem to slow.



because index pages should be readed from disk into shared buffers, so
next query will benefit from that. You need enough shared memory to get
real benefit. You may get postgresql stats and look on cache hit ration.



btw, how does your query ( keywords='blabla') relates to tsearch2 ?


(Keywords='blabla') isn't related to tsearch2, I just got an idea from
tsearch2 and try different approach. But I stuck on the query result speed.
Very slow to get result on the first query.
And how to see postgresql stats and look on cache hit ratio? I still don't
know how to get it.



learn from http://www.postgresql.org/docs/8.0/static/monitoring-stats.html


I try to cluster the table base on keyword index, but after 15 hours
waiting and it doesn't finish I stop clustering.



don't use cluster for big tables ! simple
 select *  into clustered_foo from foo order by indexed_field
would be faster and does the same job.


What the use of clustered_foo table? And how to use it?
I think it will not distinct duplicate rows. And the clustered_foo table
still not have an index, so if query to this table, I think the query will
be very slow to get a result.


oh guy, you certainly need to read documentation
http://www.postgresql.org/docs/8.0/static/sql-cluster.html




Regards,
ahmad fajar



Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] tsearch2 seem very slow

2005-09-24 Thread Oleg Bartunov


Ahmad,

what's about the number of unique words ? I mean stat() function.
Sometimes, it helps to identify garbage words.
How big is your articles (average length) ?

please, cut'n paste queries and output from psql ! How fast are 
next queries ?


Oleg
On Fri, 23 Sep 2005, Ahmad Fajar wrote:


Hi Oleg,

For single index I try this query:
explain analyze
select articleid, title, datee from articles
where fti @@ to_tsquery('bankindonesia');

analyze result:

Index Scan using fti_idx on articles  (cost=0.00..862.97 rows=420 width=51)
(actual time=0.067..183761.324 rows=46186 loops=1)
  Index Cond: (fti @@ '\'bank\'  \'indonesia\''::tsquery)
Total runtime: 183837.826 ms

And for multicolumn index I try this query:
explain analyze
select articleid, title, datee from articles
where fti @@ to_tsquery('bankmega');

analyze result:

Index Scan using articles_x1 on articles  (cost=0.00..848.01 rows=410
width=51) (actual time=52.204..37914.135 rows=1841 loops=1)
  Index Cond: ((datee = '2002-01-01'::date) AND (datee =
('now'::text)::date) AND (fti @@ '\'bank\'  \'mega\''::tsquery))
Total runtime: 37933.757 ms

The table structure is as mention on the first talk. If you wanna know how
much table in my database, it's about 100 tables or maybe more. Now I
develop the version 2 of my web application, you can take a look at:
http://www.mediatrac.net, so it will hold many datas. But the biggest table
is article's table. On develop this version 2 I just use half data of the
article's table (about 419804 rows). May be if I import all of the article's
table data it will have 1 million rows. The article's table grows rapidly,
about 10 rows per-week. My developing database size is 28 GB (not real
database, coz I still develop the version 2 and I use half of the data for
play around). I just wanna to perform quick search (fulltext search) on my
article's table not other table. On version 1, the current running version I
use same hardware spesification as mention below, but there is no fulltext
search. So I develop the new version with new features, new interface and
include the fulltext search.

I do know, if the application finish, I must use powerfull hardware. But how
can I guarantee the application will run smooth, if I do fulltext search on
419804 rows in a table it took a long time to get the result.

Could you or friends in this maling-list help meplz..plzz

Tsearch2 configuration:
-
I use default configuration, english stop word file as tsearch2 provide,
stem dictionary as default (coz I don't know how to configure and add new
data to stem dictionary) and I add some words to the english stop word file.

Postgresql configuration
-
max_connections = 32
shared_buffers = 32768
sort_mem = 8192
vacuum_mem = 65536
work_mem = 16384
maintenance_work_mem = 65536
max_fsm_pages = 3
max_fsm_relations = 1000
max_files_per_process = 10
checkpoint_segments = 15
effective_cache_size = 192000
random_page_cost = 2
geqo = true
geqo_threshold = 50
geqo_effort = 5
geqo_pool_size = 0
geqo_generations = 0
geqo_selection_bias = 2.0
from_collapse_limit = 10
join_collapse_limit = 15

OS configuration:
--
I use Redhat 4 AS, kernel 2.6.9-11
kernel.shmmax=1073741824
kernel.sem=250 32000 100 128
fs.aio-max-nr=5242880
the server I configure just only for postgresql, no other service is running
like: www, samba, ftp, email, firewall

hardware configuration:

Motherboard ASUS P5GD1
Processor P4 3,2 GHz
Memory 2 GB DDR 400,
2x200 GB Serial ATA 7200 RPM UltraATA/133, configure as RAID0 for postgresql
data and the partition is EXT3
1x80 GB EIDE 7200 RPM configure for system and home directory and the
partiton is EXT3

Did I miss something?

Regards,
ahmad fajar


-Original Message-
From: Oleg Bartunov [mailto:[EMAIL PROTECTED]
Sent: Jumat, 23 September 2005 18:26
To: Ahmad Fajar
Cc: pgsql-performance@postgresql.org
Subject: RE: [PERFORM] tsearch2 seem very slow

On Fri, 23 Sep 2005, Ahmad Fajar wrote:


Hi Oleg,

I didn't deny on the third repeat or more, it can reach  600 msec. It is
only because the result still in postgres cache, but how about in the

first

run? I didn't dare, the values is un-acceptable. Because my table will

grows

rapidly, it's about 10 rows per-week. And the visitor will search
anything that I don't know, whether it's the repeated search or new

search,

or whether it's in postgres cache or not.


if you have enoush shared memory postgresql will keep index pages there.




I just compare with http://www.postgresql.org, the search is quite fast,

and

I don't know whether the site uses tsearch2 or something else. But as fas

as

I know, if the rows reach 100 milion (I have try for 200 milion rows and

it

seem very slow), even if don't use tsearch2, only use simple query like:
select f1, f2 from table1 where f2='blabla',
and f2 is indexes, my postgres still slow on the first

Re: [PERFORM] tsearch2 seem very slow

2005-09-23 Thread Oleg Bartunov


On Fri, 23 Sep 2005, Ahmad Fajar wrote:


Hi Oleg,

I didn't deny on the third repeat or more, it can reach  600 msec. It is
only because the result still in postgres cache, but how about in the first
run? I didn't dare, the values is un-acceptable. Because my table will grows
rapidly, it's about 10 rows per-week. And the visitor will search
anything that I don't know, whether it's the repeated search or new search,
or whether it's in postgres cache or not.


if you have enoush shared memory postgresql will keep index pages there.




I just compare with http://www.postgresql.org, the search is quite fast, and
I don't know whether the site uses tsearch2 or something else. But as fas as
I know, if the rows reach 100 milion (I have try for 200 milion rows and it
seem very slow), even if don't use tsearch2, only use simple query like:
select f1, f2 from table1 where f2='blabla',
and f2 is indexes, my postgres still slow on the first time, about 10 sec.
because of this I tried something brand new to fullfill my needs. I have
used fti, and tsearch2 but still slow.

I don't know what's going wrong with my postgres, what configuration must I
do to perform the query get fast result. Or must I use enterprisedb 2005 or
pervasive postgres (both uses postgres), I don't know very much about these
two products.


you didn't show us your configuration (hardware,postgresql and tsearch2),
explain analyze of your queries, so we can't help you.
How big is your database, tsearch2 index size ?




Regards,
ahmad fajar


-Original Message-
From: Oleg Bartunov [mailto:[EMAIL PROTECTED]
Sent: Jumat, 23 September 2005 14:36
To: Ahmad Fajar
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] tsearch2 seem very slow

Ahmad,

how fast is repeated runs ? First time system could be very slow.
Also, have you checked my page
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
and some info about tsearch2 internals
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals

Oleg
On Thu, 22 Sep 2005, Ahmad Fajar wrote:


I have about 419804 rows in my article table. I have installed tsearch2

and

its gist index correctly.

My table structure is:

CREATE TABLE tbarticles

(

 articleid int4 NOT NULL,

 title varchar(250),

 mediaid int4,

 datee date,

 content text,

 contentvar text,

 mmcol float4 NOT NULL,

 sirkulasi float4,

 page varchar(10),

 tglisidata date,

 namapc varchar(12),

 usere varchar(12),

 file_pdf varchar(255),

 file_pdf2 varchar(50),

 kolom int4,

 size_jpeg int4,

 journalist varchar(120),

 ratebw float4,

 ratefc float4,

 fti tsvector,

 CONSTRAINT pk_tbarticles PRIMARY KEY (articleid)

) WITHOUT OIDS;

Create index fti_idx1 on tbarticles using gist (fti);

Create index fti_idx2 on tbarticles using gist (datee, fti);



But when I search something like:

Select articleid, title, datee from tbarticles where fti @@
to_tsquery('susilobambangyudhoyonojusufkalla');

It takes about 30 sec. I run explain analyze and the index is used
correctly.



Then I try multi column index to filter by date, and my query something
like:

Select articleid, title, datee from tbarticles where fti @@
to_tsquery('susilobambangyudhoyonojusufkalla') and datee =

'2002-01-01'

and datee = current_date

An it still run about 25 sec. I do run explain analyze and my multicolumn
index is used correctly.

This is not acceptable if want to publish my website if the search took

very

longer.



I have run vacuum full analyze before doing such query. What going wrong
with my query?? Is there any way to make this faster?

I have try to tune my postgres configuration, but it seem helpless. My

linux

box is Redhat 4 AS, and

the hardware: 2 GB RAM DDR 400, 2x200 GB Serial ATA 7200RPM and configure

as

RAID0 (just for postgres data), my sistem run at EIDE 80GB 7200 RPM.



Please.help.help.




Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Looking for a large database for testing

2005-08-16 Thread Oleg Bartunov


Sebastian,

you can try document generator. I used 
http://www.cs.rmit.edu.au/~jz/resources/finnegan.zip

yuo can play with freq. of words and document length distribution.
Also, I have SentenceGenerator.java which could be used for
generation of synthetic texts.

Oleg
On Tue, 16 Aug 2005, Sebastian Hennebrueder wrote:


Hello,

I would like to test the performance of my Java/PostgreSQL applications
especially when making full text searches.
For this I am looking for a database with 50 to 300 MB having text fields.
e.g. A table with books with fields holding a comment, table of content
or example chapters
or what ever else.

Does anybody have an idea where I can find a database like this or does
even have something like this?




Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [PERFORM] Forums tsearch2 - best practices reg. concurrency

2005-06-22 Thread Oleg Bartunov


Markus,

wait for 8.1 which should solve all of these issues. We're working
on GiST concurrency  recovery right now. See 
http://www.pgsql.ru/db/mw/msg.html?mid=2073083
for details.

Oleg
On Wed, 22 Jun 2005, Markus Wollny wrote:


Hello!

We're using PostgreSQL 8.0.1 as general backend for all of our websites,
including our online forums (aka bulletin boards or whatever you wish to
call that). As for full text search capabilities, we've chosen to
implement this via tsearch2. However, the tables themselves are quite
large, and as there's lots of weird user input in them (just no way of
limiting our users to proper orthography), so are the indices; we have
already split up the main posting-table in two, one containing the more
recent messages (6 months) and one for everything else.

Search capabilities have been limited to accessing only one of those,
either recent or archive. Still, the tsearch2-GiST-index for a table is
around 325MB in size; the recent messages table itself without any
indices weighs in at about 1.8GB containing over one million rows, the
archive-table is a little over 3GB and contains about 1.3 million rows.
A full text search in the table with the recent postings can take up to
five minutes.

This wouldn't be much of a problem, as we're providing other, quicker
search options (like searching for an author or a full text search just
on the topics); the problem with the full text search lies in the
locking mechanisms: As long as there's a search going on, all the
subsequent INSERTs or UPDATEs on that table fail due to timeout. This
means that currently, whenever we allow full text searching, there may
be a timeframe of more than one hour, during which users cannot write
any new postings in our forum or edit (i.e. update) anything. This is
hardly acceptable...

This is what I did to actually diagnose that simple tsearch2-related
SELECTs where causing the write-locks:

First I started a full text search query which I knew would run over
four minutes. Then I waited for other users to try and post some
messages; soon enough a 'ps ax|grep wait' showed several INSERT/UPDATE
waiting-backends. So I took a look at the locks:

select s.current_query as statement,
   l.mode as lock_mode,
   l.granted as lock_granted,
   c.relname as locked_relation,
   c.relnamespace as locked_relnamespace,
   c.reltype as locked_reltype
from pg_stat_activity s,
   pg_locks l,
   pg_class c
where
   l.pid = s.procpid
and
   l.relation = c.oid
order by age(s.query_start) desc;

I found four locks for the search query at the very beginning of the
resultset - all of them of the AccessShareLock persuasion and granted
alright: one on the message-table, one on the thread-table, one on the
tsearch2-index and another one on the primary key index of the
thread-table.

The hanging inserts/updates were waiting for an AccessExclusiveLock on
the tsearch2-index - all the other locks of these queries were marked as
granted.

As far as I understand from some of the previous messages on the mailing
list regarding concurrency issues with GiST-type indices, any SELECT
that's using a tsearch2-index would completely lock write-access to that
index for the runtime of the query - is that correct so far?

Now I'd like to find out about possible solutions or workarounds for
this issue. Surely some of you must have encountered quite similar
situations, so what did you do about it? I already pondered the idea of
a separate insert/update-queue-table which would then be processed by a
cron-job, thus separating the information-entry from the actual insert
into the table that's blocked due to the lock on the index. Another
possibility (which I find a little bit more compelling) would involve
replicating the message-table via Slony-I to another database which
could then be used as only target for any search-queries which require
use of the GiST-index. Would this provide the needed asynchronicity to
avoid this race condition between the AccessShareLock from the
search-SELECT and the AccessExclusiveLock from the write access queries?

I'd be very glad to know your opinions on this matter.

Kind regards

  Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match



Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Fwd: Index on table when using DESC clause

2005-05-23 Thread Oleg Bartunov


You didn't say what version of PostgreSQL you're trying.
I recall old version doesn't used index for backward pagination.

Oleg

On Mon, 23 May 2005, Yves Vindevogel wrote:


I tried that, but

create index ixTest on table1 (pages desc, documentname)

gives me a syntax error


On 23 May 2005, at 20:03, Steinar H. Gunderson wrote:


On Mon, May 23, 2005 at 07:41:19PM +0200, Yves Vindevogel wrote:

However, when I query my db using for instance   order by pages,
documentname, it is very fast.
If I use   order by pages desc, documentname, it is not fast at
all, like it is not using the index properly at all.


Make an index on pages desc, documentname asc.

/* Steinar */
--
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match



Met vriendelijke groeten,
Bien ? vous,
Kind regards,

Yves Vindevogel
Implements




Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] Tsearch2 performance on big database

2005-03-24 Thread Oleg Bartunov

On Thu, 24 Mar 2005, Rick Jansen wrote:
Oleg Bartunov wrote:
from my notes
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
It's usefull to see words statistics, for example, to check how good
your dictionaries work or how did you configure pg_ts_cfgmap. Also, you
may notice probable stop words relevant for your collection. Tsearch
provides stat() function:
...
Don't hesitate to read it and if you find some bugs or know better wording
I'd be glad to improve my notes.
Thanks, but that stat() query takes way too long.. I let it run for like
4 hours and still nothing. The database I am testing tsearch2 on is also
the production database (mysql) server so I have to be careful not to
use too many resources :o
stat() is indeed a bigdog, it was designed for developers needs,
so we recommend to save results in table.
Anyway, here's my pg_ts_cfgmap now (well the relevant bits):
default_english | lhword   | {en_ispell,en_stem}
default_english | lpart_hword  | {en_ispell,en_stem}
default_english | lword| {en_ispell,en_stem}
Is it normal that queries for single words (or perhaps they are words
that are common) take a really long time? Like this:
'hispanic' isn't common, I see you get only  674 rows and 
'buckingham  palace'  returns 185 rows. Did you run 'vacuum analyze' ?
I see a big discrepancy between estimated rows (8041) and actual rows.


ilab=# explain analyze select count(*) from books where description_fti @@ 
to_tsquery('default', 'hispanic');
   QUERY PLAN
--
Aggregate  (cost=20369.81..20369.81 rows=1 width=0) (actual 
time=261512.031..261512.031 rows=1 loops=1)
  -  Index Scan using idxfti_idx on books  (cost=0.00..20349.70 rows=8041 
width=0) (actual time=45777.760..261509.288 rows=674 loops=1)
Index Cond: (description_fti @@ '\'hispan\''::tsquery)
Total runtime: 261518.529 ms
(4 rows)

ilab=# explain analyze select titel from books where description_fti @@ 
to_tsquery('default', 'buckingham  palace'); 
QUERY PLAN

Index Scan using idxfti_idx on books  (cost=0.00..20349.70 rows=8041 
width=57) (actual time=18992.045..48863.385 rows=185 loops=1)
  Index Cond: (description_fti @@ '\'buckingham\'  \'palac\''::tsquery)
Total runtime: 48863.874 ms
(3 rows)

I dont know what happened, these queries were a lot faster 2 days ago..what 
the feck is going on?!

Rick

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Delete query takes exorbitant amount of time

2005-03-24 Thread Oleg Bartunov

On Thu, 24 Mar 2005, Tom Lane wrote:
Mark Lewis [EMAIL PROTECTED] writes:
I've got a similar problem with deletes taking a very long time.  I know
that there are lots of foreign keys referencing this table, and other
foreign keys referencing those tables, etc.  I've been curious, is there
a way to find out how long the foreign key checks take for each
dependent table?
There is not any easy way at the moment.
Hmm ... I wonder how hard it would be to teach EXPLAIN ANALYZE to show
the runtime expended in each trigger when the statement is of a kind
that has triggers.  We couldn't break down the time *within* the
triggers, but even this info would help a lot in terms of finger
pointing ...
Seq Scan on ... (nn.nnn ms)
Trigger foo: nn.mmm ms
Trigger bar: nn.mmm ms
Total time: nn.mmm ms
and if you add
Index foo_idx: nn.mm ss
Heap  foo_tbl: nn.mm ss

regards, tom lane
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] Tsearch2 performance on big database

2005-03-22 Thread Oleg Bartunov

On Tue, 22 Mar 2005, Mike Rylander wrote:
And Oleg should know.  Unless I'm mistaken, he (co)wrote tsearch2.
You're not mistaken :)
Other than shared buffers, I can't imagine what could be causing that
kind of slowness.  EXPLAIN ANALYZE, please?
tsearch2 config's also are very important. I've seen a lot of 
mistakes in configs !

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Tsearch2 performance on big database

2005-03-22 Thread Oleg Bartunov

Mike,
no comments before Rick post tsearch configs and increased buffers !
Union shouldn't be faster than (term1|term2).
tsearch2 internals description might help you understanding tsearch2 
limitations.
See  http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
Also, don't miss my notes:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
Oleg
On Tue, 22 Mar 2005, Mike Rylander wrote:
On Tue, 22 Mar 2005 14:25:19 +0100, Rick Jansen [EMAIL PROTECTED] wrote:
ilab=# explain analyze select count(titel) from books where idxfti @@
to_tsquery('default', 'buckingham | palace');
QUERY PLAN

  Aggregate  (cost=35547.99..35547.99 rows=1 width=56) (actual
time=125968.119..125968.120 rows=1 loops=1)
-  Index Scan using idxfti_idx on books  (cost=0.00..35525.81
rows=8869 width=56) (actual time=0.394..125958.245 rows=3080 loops=1)
  Index Cond: (idxfti @@ '\'buckingham\' | \'palac\''::tsquery)
  Total runtime: 125968.212 ms
(4 rows)
Time: 125969.264 ms
ilab=#
Ahh... I should have qualified my claim.  I am creating a google-esqe
search interface and almost every query uses '' as the term joiner.
'AND' queries and one-term queries are orders of magnitude faster than
'OR' queries, and fortunately are the expected default for most users.
(Think, I typed in these words, therefore I want to match these
words...)  An interesting test may be to time multiple queries
independently, one for each search term, and see if the combined cost
is less than a single 'OR' search.  If so, you could use UNION to join
the results.
However, the example you originally gave ('terminology') should be
very fast.  On a comparable query (select count(value) from
metabib.full_rec where index_vector @@ to_tsquery('default','jane');)
I get 12ms.
Oleg, do you see anything else on the surface here?
Try:
EXPLAIN ANALYZE
 SELECT titel FROM books WHERE idxfti @@
   to_tsquery('default', 'buckingham')
 UNION
 SELECT titel FROM books WHERE idxfti @@
   to_tsquery('default', 'palace');
and see if using '' instead of '|' where you can helps out.  I
imagine you'd be surprised by the speed of:
 SELECT titel FROM books WHERE idxfti @@
   to_tsquery('default', 'buckinghampalace');

 As an example of what I think you *should* be seeing, I have a similar
 box (4 procs, but that doesn't matter for one query) and I can search
 a column with tens of millions of rows in around a second.

That sounds very promising, I'd love to get those results.. could you
tell me what your settings are, howmuch memory you have and such?
16G of RAM on a dedicated machine.
shared_buffers = 15000  # min 16, at least max_connections*2, 8KB 
each
work_mem = 10240# min 64, size in KB
maintenance_work_mem = 100  # min 1024, size in KB
# big m_w_m for loading data...
random_page_cost = 2.5  # units are one sequential page fetch cost
# fast drives, and tons of RAM

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Tsearch2 performance on big database

2005-03-22 Thread Oleg Bartunov

On Tue, 22 Mar 2005, Rick Jansen wrote:
Oleg Bartunov wrote:
Mike,
no comments before Rick post tsearch configs and increased buffers !
Union shouldn't be faster than (term1|term2).
tsearch2 internals description might help you understanding tsearch2 
limitations.
See  http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
Also, don't miss my notes:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

Oleg
Thanks Oleg, i've seen those pages before :) I've set shared_buffers to 45000 
now (yes thats probably very much, isn't it?) and it already seems a lot 
quicker.

How do I find out what my tsearch config is? I followed the intro 
(http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro) and 
applied it to our books table, thats all, didnt change anything else about 
configs.
Hmm, default configuration is too eager, you index every lexem using 
simple dictionary) ! Probably, it's too much. Here is what I have for my 
russian configuration in dictionary database:

 default_russian | lword| {en_ispell,en_stem}
 default_russian | lpart_hword  | {en_ispell,en_stem}
 default_russian | lhword   | {en_ispell,en_stem}
 default_russian | nlword   | {ru_ispell,ru_stem}
 default_russian | nlpart_hword | {ru_ispell,ru_stem}
 default_russian | nlhword  | {ru_ispell,ru_stem}
Notice, I index only russian and english words, no numbers, url, etc.
You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
but I'd recommend just update them setting dict_name to NULL.
For example, to not indexing integers:
update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' 
and tok_alias='int';

voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have +7 bucks');
 token  |  dict_name  | tok_type | tsvector 
+-+--+--
 Do | {en_ispell,en_stem} | lword|
 you| {en_ispell,en_stem} | lword|
 have   | {en_ispell,en_stem} | lword|
 +7 | | int  |
 bucks  | {en_ispell,en_stem} | lword| 'buck'

Only 'bucks' gets indexed :)
Hmm, probably I should add this into documentation.
What about word statistics (# of unique words, for example).

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] What needs to be done for real Partitioning?

2005-03-20 Thread Oleg Bartunov

On Sun, 20 Mar 2005, Josh Berkus wrote:

whole system make a lot more sense: individual partitions are really
tables. The partitioned tables themselves are just meta-objects like views.
If partition is a table, so I could define different indices for them ?
In our prototype of scaled full text search we create another index
which is optimized for archived (not changed) data - it's sort of
standard inverted index which is proven to be scaled, while tsearch2's index
is good for online data. All interfaces ( dictionaries, parsers, ranking)
are the same, so it's possible to combine search results.
This is rather easy to implement using table inheritance, but I'd like
to do this with partitioning
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] How to speed up tsearch2 indexing

2005-03-09 Thread Oleg Bartunov

On Wed, 9 Mar 2005, Werner Bohl wrote:
On Wed, 2005-03-09 at 20:41 +0300, Oleg Bartunov wrote:
What's your tsearch2 configuration ? Do you use dictionaries ?
I wrote a brief explanation of tsearch2 internals
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
Tsearch2 is using default english configuration. No dictionaries, just
put some more stop words (10) in english.stop.
it's not good, because you, probably, have a lot of unique words.
Do you have some statistics, see stat() function ?


Hope, it could help you.

TIA,

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] GiST indexes and concurrency (tsearch2)

2005-02-03 Thread Oleg Bartunov

On Thu, 3 Feb 2005, Marinos J. Yannikos wrote:
Hi,
according to http://www.postgresql.org/docs/8.0/interactive/limitations.html 
, concurrent access to GiST indexes isn't possible at the moment. I haven't 
read the thesis mentioned there, but I presume that concurrent read access is 
also impossible. Is there any workaround for this, esp. if the index is 
usually only read and not written to?
there are should no problem with READ access.
It seems to be a big problem with tsearch2, when multiple clients are 
hammering the db (we have a quad opteron box here that stays 75% idle despite 
an apachebench with concurrency 10 stressing the php script that uses 
tsearch2, with practically no disk accesses)
I'm willing to see some details: version, query, explain analyze.


Regards,
Marinos
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] GiST indexes and concurrency (tsearch2)

2005-02-03 Thread Oleg Bartunov

Marinos,
what if you construct apachebench  Co free  script and see if
the issue still exists. There are could be many issues doesn't
connected to postgresql and tsearch2.
Oleg
On Thu, 3 Feb 2005, Marinos J. Yannikos wrote:
Oleg Bartunov wrote:
On Thu, 3 Feb 2005, Marinos J. Yannikos wrote:
concurrent access to GiST indexes isn't possible at the moment. I [...]
there are should no problem with READ access.
OK, thanks everyone (perhaps it would make sense to clarify this in the 
manual).

I'm willing to see some details: version, query, explain analyze.
8.0.0
Query while the box is idle:
explain analyze select count(*) from fr_offer o, fr_merchant m where idxfti 
@@ to_tsquery('ranz  mc') and eur = 70 and m.m_id=o.m_id;

Aggregate  (cost=2197.48..2197.48 rows=1 width=0) (actual time=88.052..88.054 
rows=1 loops=1)
  -  Merge Join  (cost=2157.42..2196.32 rows=461 width=0) (actual 
time=88.012..88.033 rows=3 loops=1)
Merge Cond: (outer.m_id = inner.m_id)
-  Index Scan using fr_merchant_pkey on fr_merchant m 
(cost=0.00..29.97 rows=810 width=4) (actual time=0.041..1.233 rows=523 
loops=1)
-  Sort  (cost=2157.42..2158.57 rows=461 width=4) (actual 
time=85.779..85.783 rows=3 loops=1)
  Sort Key: o.m_id
  -  Index Scan using idxfti_idx on fr_offer o 
(cost=0.00..2137.02 rows=461 width=4) (actual time=77.957..85.754 rows=3 
loops=1)
Index Cond: (idxfti @@ '\'ranz\'  \'mc\''::tsquery)
Filter: (eur = 70::double precision)

Total runtime: 88.131 ms
now, while using apachebench (-c10), top says this:
Cpu0  : 15.3% us, 10.0% sy,  0.0% ni, 74.7% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu1  : 13.3% us, 11.6% sy,  0.0% ni, 75.1% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu2  : 16.9% us,  9.6% sy,  0.0% ni, 73.4% id,  0.0% wa,  0.0% hi,  0.0% si
Cpu3  : 18.7% us, 14.0% sy,  0.0% ni, 67.0% id,  0.0% wa,  0.0% hi,  0.3% si
(this is with shared_buffers = 2000; a larger setting makes almost no 
difference for overall performance: although according to top system time 
goes to ~0 and user time to ~25%, the system still stays 70-75% idle)

vmstat:
r  b   swpd   free   buff  cache   si   sobibo   incs us sy id 
wa
2  0  0 8654316  64908 4177136005635  279   286  5  1 94 
0
2  0  0 8646188  64908 417713600 0 0 1156  2982 15 10 75 
0
2  0  0 8658412  64908 417713600 0 0 1358  3098 19 11 70 
0
1  0  0 8646508  64908 417713600 0   104 1145  2070 13 12 75 
0

so the script's execution speed is apparently not limited by the CPUs.
The query execution times go up like this while apachebench is running (and 
the system is 75% idle):

Aggregate  (cost=2197.48..2197.48 rows=1 width=0) (actual 
time=952.661..952.663 rows=1 loops=1)
  -  Merge Join  (cost=2157.42..2196.32 rows=461 width=0) (actual 
time=952.621..952.641 rows=3 loops=1)
Merge Cond: (outer.m_id = inner.m_id)
-  Index Scan using fr_merchant_pkey on fr_merchant m 
(cost=0.00..29.97 rows=810 width=4) (actual time=2.078..3.338 rows=523 
loops=1)
-  Sort  (cost=2157.42..2158.57 rows=461 width=4) (actual 
time=948.345..948.348 rows=3 loops=1)
  Sort Key: o.m_id
  -  Index Scan using idxfti_idx on fr_offer o 
(cost=0.00..2137.02 rows=461 width=4) (actual time=875.643..948.301 rows=3 
loops=1)
Index Cond: (idxfti @@ '\'ranz\'  \'mc\''::tsquery)
Filter: (eur = 70::double precision)
Total runtime: 952.764 ms

I can't seem to find out where the bottleneck is, but it doesn't seem to be 
CPU or disk. top shows that postgres processes are frequently in this 
state:

 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  WCHAN COMMAND
6701 postgres  16   0  204m  58m  56m S  9.3  0.2   0:06.96 semtimedo
^
postmaste
Any hints are appreciated...
Regards,
Marinos
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] GiST indexes and concurrency (tsearch2)

2005-02-03 Thread Oleg Bartunov

On Thu, 3 Feb 2005, Tom Lane wrote:
Marinos J. Yannikos [EMAIL PROTECTED] writes:
I can't seem to find out where the bottleneck is, but it doesn't seem to
be CPU or disk. top shows that postgres processes are frequently in
this state:

  6701 postgres  16   0  204m  58m  56m S  9.3  0.2   0:06.96 semtimedo
  ^
What's the platform exactly (hardware and OS)?
it should be 'semtimedop'

regards, tom lane
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] [SQL] OFFSET impact on Performance???

2005-01-27 Thread Oleg Bartunov

On Thu, 27 Jan 2005, PFC wrote:

Thats a really good idea, just store a list of the sorted ids in the
temp table - small amount of data for insert... I like it!
Alex Turner
NetEconomist
	The best part is that you can skip the LIMIT/OFFSET entirely if you 
put page numbers in your cache table while inserting into it, via a temporary 
sequence or something. Retrieving the results will then be very fast, but 
beware that SELECT * FROM table WHERE id =ANY( array ) won't use an index, so
contrib/intarray provides index access to such queries.
you'll have to trick the thing by generating a query with IN, or joining 
against a SRF returning the elements of the array one by one, which might be 
better.

---(end of broadcast)---
TIP 6: Have you searched our list archives?
 http://archives.postgresql.org
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-18 Thread Oleg Bartunov

Have you run 'vacuum analyze' ?
1;2c1;2c1;2c
1;2c1;2c1;2cmy desktop is very simple PIII, 512 Mb RAM.
1;2c1;2c1;2cOleg
1;2c1;2c1;2c
1;2c1;2c1;2cOn Thu, 18 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Oleg,
Le Mercredi 17 Novembre 2004 18:23, Oleg Bartunov a ?crit :
Sorry but when I do your request I get :
# select id_site from site where idx_site_name @@ 'livejourn';
ERROR: type  does not exist
no idea :) btw, what version of postgresql and OS you're running.
Could you try minimal test - check sql commands from tsearch2 sources,
some basic queries from tsearch2 documentation, tutorials.
btw, your query should looks like
select id_site from site_rss where idx_site_name @@ 'livejourn';
 
How did you run your queries at all ? I mean your first message about
poor tsearch2 performance.
I don't know what happend yesterday ... it's running now ...
You sent me :
zz=# explain analyze select id_site from site_rss where idx_site_name
@@ 'livejourn';
  QUERY PLAN
---
 Index Scan using ix_idx_site_name on site_rss (cost=0.00..733.62 rows=184
width=4) (actual time=0.339..39.183 rows=1737 loops=1)
  Index Cond: (idx_site_name @@ '\'livejourn\''::tsquery)
  Filter: (idx_site_name @@ '\'livejourn\''::tsquery)
 Total runtime: 40.997 ms
(4 rows)
It's really fast ! So, I don't understand your problem.
I run query on my desktop machine, nothing special.

I get this :
  QUERY PLAN
-
Index Scan using ix_idx_site_name on site_rss s  (cost=0.00..574.19 rows=187
width=24) (actual time=105.097..7157.277 rows=388 loops=1)
  Index Cond: (idx_site_name @@ '\'livejourn\''::tsquery)
  Filter: (idx_site_name @@ '\'livejourn\''::tsquery)
Total runtime: 7158.576 ms
(4 rows)
With the ilike I get :
QUERY PLAN

Seq Scan on site_rss s  (cost=0.00..8360.23 rows=1 width=24) (actual
time=8.195..879.440 rows=404 loops=1)
  Filter: (site_name ~~* '%livejourn%'::text)
Total runtime: 882.600 ms
(3 rows)
I don't know what is your desktop ... but I'm using PostgreSQL 7.4.6, on
Debian Woody with a PC Bi-PIII 933 Mhz and 1 Gb of memory ... the server is
dedicated to this database ... !!
I have no idea !
Regards,

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-18 Thread Oleg Bartunov

1;2c1;2c1;2cBlin !
what's happenning with my terminal when I read messagess from this guy ?
I don't even know how to call him - I see just Herv?
Oleg
1;2c1;2c1;2c1;2c
1;2cOn Thu, 18 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Le Jeudi 18 Novembre 2004 10:37, Oleg Bartunov a ?crit :
Have you run 'vacuum analyze' ?
Yep every night VACUUM FULL VERBOSE ANALYZE; of all the database !
1;2c1;2c1;2c
1;2c1;2c1;2cmy desktop is very simple PIII, 512 Mb RAM.
1;2c1;2c11;2c1;2c1;2c;2c Oleg1;2c1;2c1;2c
11;2c1;2c1;2c;2c1;2c1;2c
YOU send strange caracters ! ;o)
1;2c1;2c1;2cOn Thu, 18 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Oleg,
Le Mercredi 17 Novembre 2004 18:23, Oleg Bartunov a ?crit :
Sorry but when I do your request I get :
# select id_site from site where idx_site_name @@ 'livejourn';
ERROR: type  d1;2c1;2c1;2c1;2coes not exist
no idea :) btw, what version of postgresql and OS you're running.
Could you try minimal test - check sql commands from tsearch2 sources,
some basic queries from tsearch2 documentation, tutorials.
btw, your query should looks like
select id_site from site_rss where idx_site_name @@ 'livejourn';
 
How did you run your queries at all ? I mean your first message about
poor tsearch2 performance.
I don't know what happend yesterday ... it's running now ...
You sent me :
zz=# explain analyze select id_site from site_rss where idx_site_name
@@ 'livejourn';
  QUERY PLAN
-
-- Index Scan
using ix_idx_site_name on site_rss (cost=0.00..733.62 rows=184 width=4)
(actual time=0.339..39.183 rows=1737 loops=1)
  Index Cond: (idx_site_name @@ '\'livejourn\''::tsquery)
  Filter: (idx_site_name @@ '\'livejourn\''::tsquery)
 Total runtime: 40.997 ms
(4 rows)
It's really fast ! So, I don't understand your problem.
I run query on my desktop machine, nothing special.
I get this :
  QUERY PLAN
-
 Index
Scan using ix_idx_site_name on site_rss s  (cost=0.00..574.19 rows=187
width=24) (actual time=105.097..7157.277 rows=388 loops=1)
  Index Cond: (idx_site_name @@ '\'livejourn\''::tsquery)
  Filter: (idx_site_name @@ '\'livejourn\''::tsquery)
Total runtime: 7158.576 ms
(4 rows)
With the ilike I get :
QUERY PLAN
-
--- Seq Scan on site_rss s
(cost=0.00..8360.23 rows=1 width=24) (actual time=8.195..879.440 rows=404
loops=1)
  Filter: (site_name ~~* '%livejourn%'::text)
Total runtime: 882.600 ms
(3 rows)
I don't know what is your desktop ... but I'm using PostgreSQL 7.4.6, on
Debian Woody with a PC Bi-PIII 933 Mhz and 1 Gb of memory ... the server
is dedicated to this database ... !!
I have no idea !
Regards,
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 8: explain analyze is your friend

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-17 Thread Oleg Bartunov

1;2c1;2c1;2cOn Wed, 17 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Oleg,
Sorry but when I do your request I get :
# select id_site from site where idx_site_name @@ 'livejourn';
ERROR: type  does not exist
no idea :) btw, what version of postgresql and OS you're running.
Could you try minimal test - check sql commands from tsearch2 sources,
some basic queries from tsearch2 documentation, tutorials.
btw, your query should looks like
select id_site from site_rss where idx_site_name @@ 'livejourn';

How did you run your queries at all ? I mean your first message about 
poor tsearch2 performance.

1;2c1;2c1;2c
What is this ?
(private: I don't know what happend with my mail, but I do nothing special to
disturb the contains when I'm writting to you ...)
Le Mardi 16 Novembre 2004 22:13, Oleg Bartunov a ?crit :
ok, I downloaded dump of table and here is what I found:
zz=# select count(*) from tt;
  count

  183956
(1 row)
zz=# select * from stat('select tt from tt') order by ndoc desc, nentry
desc,wo
rd limit 10;
  word | ndoc  | nentry
--+---+
  blog | 12710 |  12835
  weblog   |  4857 |   4859
  news |  4402 |   4594
  life |  4136 |   4160
  world|  1980 |   1986
  journal  |  1882 |   1883
  livejourn|  1737 |   1737
  thought  |  1669 |   1677
  web  |  1154 |   1161
  scotsman.com |  1138 |   1138
(10 rows)
zz=# explain analyze select tt from tt where tt @@  'blog';
   QUERY PLAN
---
--- Index Scan using tt_idx on tt
(cost=0.00..728.83 rows=184 width=32) (actual time=0.047..141.110
rows=12710 loops=1) Index Cond: (tt @@ '\'blog\''::tsquery)
Filter: (tt @@ '\'blog\''::tsquery)
  Total runtime: 154.105 ms
(4 rows)
It's really fast ! So, I don't understand your problem.
I run query on my desktop machine, nothing special.
Oleg
On Tue, 16 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Hi,
I'm completly dispointed with Tsearch2 ...
I have a table like this :
 Table public.site
   Column |Type |
Modifiers
---+-+---
 id_site   | integer
  | not null default
nextval('public.site_id_site_seq'::text)
site_name | text|
site_url  | text|
url| text|
language  | text|
datecrea  | date| default now()
id_category   | integer |
time_refresh  | integer |
active| integer |
error | integer |
description   | text|
version   | text|
idx_site_name | tsvector|
lastcheck | date|
lastupdate| timestamp without time zone |
Indexes:
   site_id_site_key unique, btree (id_site)
   ix_idx_site_name gist (idx_site_name)
Triggers:
   tsvectorupdate_site_name BEFORE INSERT OR UPDATE ON site FOR EACH ROW
EXECUTE PROCEDURE tsearch2('idx_site_name', 'site_name')
I have 183 956 records in the database ...
SELECT s.site_name, s.id_site, s.description, s.site_url,
   case when exists (select id_user
   from user_choice u
where u.id_site=s.id_site
   and u.id_user = 1)
then 1 else 0 end as bookmarked
  FROM site s
WHERE s.idx_site_name @@ to_tsquery('atari');
Explain Analyze :
  QUERY PLAN
-
- Index
Scan using ix_idx_site_name on site s  (cost=0.00..1202.12 rows=184
width=158) (actual time=4687.674..4698.422 rows=1 loops=1)
  Index Cond: (idx_site_name @@ '\'atari\''::tsquery)
  Filter: (idx_site_name @@ '\'atari\''::tsquery)
  SubPlan
-  Seq Scan on user_choice u  (cost=0.00..3.46 rows=1 width=4)
(actual time=0.232..0.232 rows=0 loops=1)
  Filter: ((id_site = $0) AND (id_user = 1))
Total runtime: 4698.608 ms
First time I run the request I have a result in about 28 seconds.
SELECT s.site_name, s.id_site, s.description, s.site_url,
   case when exists (select id_user
   from user_choice u
where u.id_site=s.id_site
   and u.id_user = 1)
then 1 else 0 end as bookmarked
  FROM site_rss s
WHERE s.site_name ilike '%atari

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-16 Thread Oleg Bartunov

could you provide me a dump of your table (just id and tsvector columns),
so I could try on my computer. Also, plain query (simple and clean) which
demonstrated  your problem would be preferred next time !
   Oleg
On Tue, 16 Nov 2004, [iso-8859-15] Herv? Piedvache wrote:
Michael,
Le Mardi 16 Novembre 2004 16:32, Michael Fuhr a ?crit :
On Tue, Nov 16, 2004 at 03:55:58PM +0100, Herv? Piedvache wrote:
WHERE s.idx_site_name @@ to_tsquery('atari');
How much text does each site_name field contain?  From the field
name I'd guess only a few words.  Based on my own experience, if
the fields were documents containing thousands of words then I'd
expect tsearch2 to be faster than ILIKE by an order of magnitude
or more.
Yes site name ... is company names or web site name ... so not many word in
each record ... but I don't understand why more words are more efficient than
few words ?? sorry ...
Regards,
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-16 Thread Oleg Bartunov

On Tue, 16 Nov 2004, Joshua D. Drake wrote:

or more.
Yes site name ... is company names or web site name ... so not many word in 
each record ... but I don't understand why more words are more efficient 
than few words ?? sorry ...

Well there are a couple of reasons but the easiest one is index size.
An ILIKE btree index is in general going to be much smaller than a gist 
index.
The smaller the index the faster it is searched.
for single word queries  @@ should have the same performance  as ilike with 
index disabled  and better for complex queries.


Sincerely,
Joshua D. Drake

Regards,


Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Tsearch2 really slower than ilike ?

2004-11-16 Thread Oleg Bartunov

 be a better solution ...
Regards,
Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

92 matches

Mail list logo