Re: [HACKERS] GIN improvements part 3: ordering in index

2013-07-06 Thread Tomas Vondra
Hi,

this is a follow-up to the message I posted to the thread about
additional info in GIN.

I've applied all three patches (ginaddinfo7.patch, gin_fast_scan.4.patch
and gin_ordering.4.patch) onto commit b8fd1a09. I ended up with two
definitions of ‘cmpEntries’ in ginget.c, but I suppose this is due to
split of the patch into multiple pieces. The definitions are exactly the
same so I've commented out the second one.

After applying fast scan the queries fail with 'buffer is not owned by
resource owner Portal' errors, the ordering patch causes segmentation
faults when loading the data.

Loading the data is basically a bunch of INSERT statements into
messages table, with a GIN index on the message body. So the table and
index are defined like this:

CREATE TABLE messages (

idSERIAL PRIMARY KEY,
parent_id INT REFERENCES messages(id),
thread_id INT,
level INT,
hash_id   VARCHAR(32) NOT NULL UNIQUE,

list  VARCHAR(32) NOT NULL REFERENCES lists(id),
message_idVARCHAR(200),
in_reply_to   TEXT[],
refs  TEXT[],
sent  TIMESTAMP,

subject   TEXT,
authorTEXT,

body_plainTEXT,

body_tsvector tsvector,
subject_tsvector  tsvector,

headers   HSTORE,
raw_message   TEXT
);

CREATE INDEX message_body_idx on messages using gin(body_tsvector);

I've observed about three failure scenarios:

1) autovacuum runs VACUUM on the 'messages' table and fails, killing
   all the connections, with this message in the server log

LOG:  server process (PID 16611) was terminated by signal
  11: Segmentation fault
DETAIL:  Failed process was running: autovacuum: ANALYZE
 public.messages


2) manual run of VACUUM on the table, with about the same result and
   this output on the console (and the same segfault in the server log)

archie=# vacuum messages;
WARNING:  relation messages page 6226 is uninitialized --- fixing
WARNING:  relation messages page 6227 is uninitialized --- fixing
WARNING:  relation messages page 6228 is uninitialized --- fixing
WARNING:  relation messages page 6229 is uninitialized --- fixing
WARNING:  relation messages page 6230 is uninitialized --- fixing
WARNING:  relation messages page 6231 is uninitialized --- fixing
WARNING:  relation messages page 6232 is uninitialized --- fixing
WARNING:  relation messages page 6233 is uninitialized --- fixing
The connection to the server was lost. Attempting reset: Failed.


3) disabled autovacuum, the load fails (always at exactly the same
   place) - I have collected a backtrace from gdb (after recompiling
   with disabled optimization), see the attachment.

All three scenarios might actually be caused by the same bug, as I've
checked the backtrace for the VACUUM and it fails at exactly the same
place as the third case.

regards
Tomas
Program received signal SIGSEGV, Segmentation fault.
0x0047517d in ginDataPageLeafReadItemPointer (ptr=0x15f00ae Address 
0x15f00ae out of bounds, iptr=0x7fff1781d340, addInfoIsNull=0x7fff1781d2d7 ) 
at ../../../../src/include/access/gin_private.h:866
866 v = *ptr;
(gdb) bt
#0  0x0047517d in ginDataPageLeafReadItemPointer (ptr=0x15f00ae 
Address 0x15f00ae out of bounds, iptr=0x7fff1781d340, 
addInfoIsNull=0x7fff1781d2d7 ) at 
../../../../src/include/access/gin_private.h:866
#1  0x004752ae in ginDataPageLeafRead (ptr=0x15f00ae Address 0x15f00ae 
out of bounds, attnum=1, iptr=0x7fff1781d340, addInfo=0x7fff1781d348, 
addInfoIsNull=0x7fff1781d347 , ginstate=0x7fff1781d9c0)
at ../../../../src/include/access/gin_private.h:916
#2  0x004779d6 in dataSplitPageLeaf (btree=0x8b97580, lbuf=33601, 
rbuf=33602, off=1018, prdata=0x7fff1781d428) at gindatapage.c:954
#3  0x00478d0d in dataSplitPage (btree=0x8b97580, lbuf=33601, 
rbuf=33602, off=1018, prdata=0x7fff1781d428) at gindatapage.c:1262
#4  0x0047a056 in ginInsertValue (btree=0x8b97580, stack=0x8b98748, 
buildStats=0x0) at ginbtree.c:385
#5  0x004793d1 in ginInsertItemPointers (ginstate=0x7fff1781d9c0, 
attnum=1, gdi=0x8b97580, items=0x43c6726, addInfo=0x3a41b08, 
addInfoIsNull=0x8b97b31 , nitem=1795, buildStats=0x0) at gindatapage.c:1414
#6  0x0046f7b2 in buildFreshLeafTuple (ginstate=0x7fff1781d9c0, 
attnum=1, key=49986944, category=0 '\000', items=0x43c4f50, addInfo=0x3a3fb40, 
addInfoIsNull=0x8b97738 , nitem=2812, buildStats=0x0) at gininsert.c:418
#7  0x0046faf0 in ginEntryInsert (ginstate=0x7fff1781d9c0, attnum=1, 
key=49986944, category=0 '\000', items=0x43c4f50, addInfo=0x3a3fb40, 
addInfoIsNull=0x8b97738 , nitem=2812, buildStats=0x0) at gininsert.c:512
#8  0x004859c3 in ginInsertCleanup (ginstate=0x7fff1781d9c0, 
vac_delay=0 '\000', stats=0x0) at ginfast.c:960
#9  0x0048425a in ginHeapTupleFastInsert 

Re: [HACKERS] GIN improvements part 3: ordering in index

2013-07-01 Thread Alexander Korotkov
On Sun, Jun 30, 2013 at 3:09 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 25.06.2013 21:18, Alexander Korotkov wrote:

 On Tue, Jun 25, 2013 at 7:31 PM, Heikki Linnakangashlinnakangas@**
 vmware.com hlinnakan...@vmware.com

 wrote:


  In summary: The test case you presented as motivation for this patch is a
 bit of a worst-case scenario for the current tidbitmap implementation.
 The
 speedup from your patch comes from avoiding the tidbitmap. However, it
 would be fairly easy to optimize the tidbitmap to handle this scenario
 better, which would benefit all kinds of queries that use bitmap scans.
 There is really no reason to complicate the GIN API for this. Let's just
 optimize tidbitmap.

 I'm not sure if I fullly understand your patch, though. Is there some
 other test scenario where it performs significantly better, which can not
 be attributed to a tidbitmap overhead? I'm assuming 'no' for now, and
 marking this patch as rejected in the commitfest app, but feel free to
 reopen if there is.


 So, it's likely I've positioned this patch wrong from the begging, because
 my examples were focused on CPU time improvement. But initial purpose of
 this patch was to decrease IO.


 Ok. Storing the additional information bloats the index considerably, so
 it's clearly not going to be a win in all cases. So whether you store the
 additional information or not needs to configurable somehow.


Yes, I think we should have two distinct opclasses.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-30 Thread Heikki Linnakangas

On 25.06.2013 21:18, Alexander Korotkov wrote:

On Tue, Jun 25, 2013 at 7:31 PM, Heikki Linnakangashlinnakan...@vmware.com

wrote:



In summary: The test case you presented as motivation for this patch is a
bit of a worst-case scenario for the current tidbitmap implementation. The
speedup from your patch comes from avoiding the tidbitmap. However, it
would be fairly easy to optimize the tidbitmap to handle this scenario
better, which would benefit all kinds of queries that use bitmap scans.
There is really no reason to complicate the GIN API for this. Let's just
optimize tidbitmap.

I'm not sure if I fullly understand your patch, though. Is there some
other test scenario where it performs significantly better, which can not
be attributed to a tidbitmap overhead? I'm assuming 'no' for now, and
marking this patch as rejected in the commitfest app, but feel free to
reopen if there is.


So, it's likely I've positioned this patch wrong from the begging, because
my examples were focused on CPU time improvement. But initial purpose of
this patch was to decrease IO.


Ok. Storing the additional information bloats the index considerably, so 
it's clearly not going to be a win in all cases. So whether you store 
the additional information or not needs to configurable somehow.


I'm marking this as returned with feedback, as we need new performance 
testing from I/O point of view. The comparison should be with the base 
additional information patch or at least the part of that that packs 
the item pointers more tightly. Also, this depends on the additional 
information patch, so we need to get that committed before this one, and 
I just returned that patch.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-25 Thread Heikki Linnakangas

On 25.06.2013 01:24, Alexander Korotkov wrote:

On Wed, Jun 19, 2013 at 1:21 AM, Alexander Korotkovaekorot...@gmail.comwrote:


On Mon, Jun 17, 2013 at 10:27 PM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

That has some obvious limitations. First of all, you can run out of

memory.


Yes, it is so. qsort should be replaced with tuplesort.


In attached patch qsort is replaced with tuplesort. As expected, it leads
to some performance drawback, but it's not dramatic.
Also, some doc is added for new distance method of GIN.


Thanks. But the fact remains that with this patch, you fetch all the 
tuples and then you sort them, which is exactly the same thing you do 
without the patch. The way it happens without the patch just seems to be 
slower. Time to break out the profiler..


I downloaded what I believe to be the same DBLP titles dataset that you 
tested with. Without the patch:


postgres=# explain analyze select * from dblp_titles where tsvector @@
to_tsquery('english', 'statistics') order by  ts_rank(tsvector, 
to_tsquery('english', 'statistics')) limit 10;
QUERY 
PLAN


-
-
 Limit  (cost=42935.81..42935.84 rows=10 width=64) (actual 
time=57.945..57.948 ro

ws=10 loops=1)
   -  Sort  (cost=42935.81..42980.86 rows=18020 width=64) (actual 
time=57.943..5

7.944 rows=10 loops=1)
 Sort Key: (ts_rank(tsvector, '''statist'''::tsquery))
 Sort Method: top-N heapsort  Memory: 28kB
 -  Bitmap Heap Scan on dblp_titles  (cost=211.66..42546.41 
rows=18020 w

idth=64) (actual time=13.061..46.358 rows=15142 loops=1)
   Recheck Cond: (tsvector @@ '''statist'''::tsquery)
   -  Bitmap Index Scan on gin_title  (cost=0.00..207.15 
rows=18020

width=0) (actual time=8.339..8.339 rows=15142 loops=1)
 Index Cond: (tsvector @@ '''statist'''::tsquery)
 Total runtime: 57.999 ms
(9 rows)

And the profile looks like this:

6,94%  postgrestbm_iterate
6,12%  postgreshash_search_with_hash_value
4,40%  postgrestbm_comparator
3,79%  libc-2.17.so__memcpy_ssse3_back
3,68%  postgresheap_hot_search_buffer
2,62%  postgresslot_deform_tuple
2,47%  postgresnocachegetattr
2,37%  postgresheap_page_prune_opt
2,27%  libc-2.17.so__memcmp_sse4_1
2,21%  postgresheap_fill_tuple
2,18%  postgrespg_qsort
1,96%  postgrestas
1,88%  postgrespalloc0
1,83%  postgrescalc_rank_or

Drilling into that, tbm_iterate, tbm_comparator and pq_sort calls come 
from the Tidbitmap code, as well as about 1/3 of the 
hash_search_with_hash_value calls. There seems to be a fair amount of 
overhead in building and iterating the tid bitmap. Is that what's 
killing us?


For comparison, this is the same with your patch:

postgres=# explain analyze select * from dblp_titles where tsvector @@
to_tsquery('english', 'statistics') order by tsvector 
to_tsquery('english', 'statistics') limit 10;
   QUERY 
PLAN


-

 Limit  (cost=16.00..52.81 rows=10 width=136) (actual time=9.957..9.980 
rows=10 l

oops=1)
   -  Index Scan using gin_title on dblp_titles  (cost=16.00..52198.94 
rows=1417

5 width=136) (actual time=9.955..9.977 rows=10 loops=1)
 Index Cond: (tsvector @@ '''statist'''::tsquery)
 Order By: (tsvector  '''statist'''::tsquery)
 Total runtime: 10.084 ms
(5 rows)

9,57%  postgresscanGetItemFast
7,02%  postgrescalc_rank_or
5,71%  postgresFunctionCall10Coll
5,59%  postgresentryGetNextItem
5,19%  postgreskeyGetOrdering
5,13%  postgresginDataPageLeafReadItemPointer
4,89%  postgresentryShift
4,85%  postgresginCompareItemPointers
3,44%  postgresginDataPageLeafRead
3,28%  postgresAllocSetAlloc
3,27%  postgresinsertScanItem
3,18%  postgresgin_tsquery_distance
2,38%  postgresputtuple_common
2,26%  postgrescheckcondition_gin
2,20%  postgrescmpEntries
2,17%  postgresAllocSetFreeIndex
2,11%  postgrescalc_rank

Unsurprisingly, the tidbitmap overhead is not visible in the profile 
with your patch.


To see how much of the difference is caused by the tidbitmap overhead, I 
wrote the attached quick  dirty patch (it will crash and burn with 
queries that require tidbitmap unions or intersects etc.). When there 
are fewer than 10 items on a page, the tidbitmap keeps the offsets of 
those items in an ordered array of offsets, instead of setting the bits 
in the 

Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-25 Thread Alexander Korotkov
On Tue, Jun 25, 2013 at 7:31 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 25.06.2013 01:24, Alexander Korotkov wrote:

 On Wed, Jun 19, 2013 at 1:21 AM, Alexander Korotkovaekorot...@gmail.com
 **wrote:

  On Mon, Jun 17, 2013 at 10:27 PM, Heikki Linnakangas
 hlinnakan...@vmware.com  wrote:

 That has some obvious limitations. First of all, you can run out of

 memory.


 Yes, it is so. qsort should be replaced with tuplesort.


 In attached patch qsort is replaced with tuplesort. As expected, it leads
 to some performance drawback, but it's not dramatic.
 Also, some doc is added for new distance method of GIN.


 Thanks. But the fact remains that with this patch, you fetch all the
 tuples and then you sort them, which is exactly the same thing you do
 without the patch. The way it happens without the patch just seems to be
 slower. Time to break out the profiler..


Actually, it's no exactly so. gingettuple doesn't touch any heap tuple. It
gets all required information to calculate relevance directly from index
itself. It's possible with respect to additional information which is word
positions for fts. So, patch is doing similar but with significant
difference: source of data for ranking changes from heap to index. The
information required for ranking in index is much more well-localized than
it is in heap.


 I downloaded what I believe to be the same DBLP titles dataset that you
 tested with. Without the patch:

 postgres=# explain analyze select * from dblp_titles where tsvector @@
 to_tsquery('english', 'statistics') order by  ts_rank(tsvector,
 to_tsquery('english', 'statistics')) limit 10;
 QUERY PLAN

 --**--**
 -
 --**---
  Limit  (cost=42935.81..42935.84 rows=10 width=64) (actual
 time=57.945..57.948 ro
 ws=10 loops=1)
-  Sort  (cost=42935.81..42980.86 rows=18020 width=64) (actual
 time=57.943..5
 7.944 rows=10 loops=1)
  Sort Key: (ts_rank(tsvector, '''statist'''::tsquery))

  Sort Method: top-N heapsort  Memory: 28kB
  -  Bitmap Heap Scan on dblp_titles  (cost=211.66..42546.41
 rows=18020 w
 idth=64) (actual time=13.061..46.358 rows=15142 loops=1)
Recheck Cond: (tsvector @@ '''statist'''::tsquery)
-  Bitmap Index Scan on gin_title  (cost=0.00..207.15
 rows=18020
 width=0) (actual time=8.339..8.339 rows=15142 loops=1)

  Index Cond: (tsvector @@ '''statist'''::tsquery)
  Total runtime: 57.999 ms
 (9 rows)

 And the profile looks like this:

 6,94%  postgrestbm_iterate
 6,12%  postgreshash_search_with_hash_value
 4,40%  postgrestbm_comparator
 3,79%  libc-2.17.so__memcpy_ssse3_back
 3,68%  postgresheap_hot_search_buffer
 2,62%  postgresslot_deform_tuple
 2,47%  postgresnocachegetattr
 2,37%  postgresheap_page_prune_opt
 2,27%  libc-2.17.so__memcmp_sse4_1
 2,21%  postgresheap_fill_tuple
 2,18%  postgrespg_qsort
 1,96%  postgrestas
 1,88%  postgrespalloc0
 1,83%  postgrescalc_rank_or

 Drilling into that, tbm_iterate, tbm_comparator and pq_sort calls come
 from the Tidbitmap code, as well as about 1/3 of the
 hash_search_with_hash_value calls. There seems to be a fair amount of
 overhead in building and iterating the tid bitmap. Is that what's killing
 us?

 For comparison, this is the same with your patch:

 postgres=# explain analyze select * from dblp_titles where tsvector @@

 to_tsquery('english', 'statistics') order by tsvector 
 to_tsquery('english', 'statistics') limit 10;
QUERY PLAN

 --**--**
 -
 --**--
  Limit  (cost=16.00..52.81 rows=10 width=136) (actual time=9.957..9.980
 rows=10 l
 oops=1)
-  Index Scan using gin_title on dblp_titles  (cost=16.00..52198.94
 rows=1417
 5 width=136) (actual time=9.955..9.977 rows=10 loops=1)

  Index Cond: (tsvector @@ '''statist'''::tsquery)
  Order By: (tsvector  '''statist'''::tsquery)
  Total runtime: 10.084 ms
 (5 rows)

 9,57%  postgresscanGetItemFast
 7,02%  postgrescalc_rank_or
 5,71%  postgresFunctionCall10Coll
 5,59%  postgresentryGetNextItem
 5,19%  postgreskeyGetOrdering
 5,13%  postgresginDataPageLeafReadItemPointer
 4,89%  postgresentryShift
 4,85%  postgresginCompareItemPointers
 3,44%  postgresginDataPageLeafRead
 3,28%  postgresAllocSetAlloc
 3,27%  postgresinsertScanItem
 3,18%  postgresgin_tsquery_distance
 2,38%  postgres

Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-24 Thread Alexander Korotkov
On Wed, Jun 19, 2013 at 1:21 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Mon, Jun 17, 2013 at 10:27 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 17.06.2013 15:56, Alexander Korotkov wrote:

 On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkovaekorot...@gmail.com
 **wrote:

  This patch introduces new interface method of GIN which takes same
 arguments as consistent but returns float8.
 float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
 nullFlags[], Datum addInfo[], bool addInfoIsNull[])
 This patch implements gingettuple method which can return ordering data
 using KNN infrastructure. Also it introduces  operator for fts which
 support ordering in GIN index. Some example:

 postgres=# explain analyze select * from dblp_titles2 where tsvector @@
 to_tsquery('english', 'statistics') order by tsvector
 to_tsquery('english', 'statistics') limit 10;

 QUERY
 PLAN

 --**--**
 --**--**
 -
   Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
 rows=10 loops=1)
 -   Index Scan using dblp_titles2_idx on dblp_titles2
   (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
 rows=10 loops=1)
   Index Cond: (tsvector @@ '''statist'''::tsquery)
   Order By: (tsvector  '''statist'''::tsquery)
   Total runtime: 7.556 ms
 (5 rows)


 Attached version of patch has some refactoring and bug fixes.


 Thanks. There are no docs changes and not many comments, that needs to be
 fixed, but I think I understand how it works:

 On the first call to gingettuple, the index is first scanned for all the
 matches, which are collected in an array in memory. Then, the array is
 sorted with qsort(), and the matches are returned one by one from the
 sorted array.


 Right.

 That has some obvious limitations. First of all, you can run out of
 memory.


 Yes, it is so. qsort should be replaced with tuplesort.


In attached patch qsort is replaced with tuplesort. As expected, it leads
to some performance drawback, but it's not dramatic.
Also, some doc is added for new distance method of GIN.

--
With best regards,
Alexander Korotkov.


gin_ordering.4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-18 Thread Alexander Korotkov
On Mon, Jun 17, 2013 at 10:27 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 17.06.2013 15:56, Alexander Korotkov wrote:

 On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkovaekorot...@gmail.com
 **wrote:

  This patch introduces new interface method of GIN which takes same
 arguments as consistent but returns float8.
 float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
 nullFlags[], Datum addInfo[], bool addInfoIsNull[])
 This patch implements gingettuple method which can return ordering data
 using KNN infrastructure. Also it introduces  operator for fts which
 support ordering in GIN index. Some example:

 postgres=# explain analyze select * from dblp_titles2 where tsvector @@
 to_tsquery('english', 'statistics') order by tsvector
 to_tsquery('english', 'statistics') limit 10;
 QUERY
 PLAN

 --**--**
 --**--**
 -
   Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
 rows=10 loops=1)
 -   Index Scan using dblp_titles2_idx on dblp_titles2
   (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
 rows=10 loops=1)
   Index Cond: (tsvector @@ '''statist'''::tsquery)
   Order By: (tsvector  '''statist'''::tsquery)
   Total runtime: 7.556 ms
 (5 rows)


 Attached version of patch has some refactoring and bug fixes.


 Thanks. There are no docs changes and not many comments, that needs to be
 fixed, but I think I understand how it works:

 On the first call to gingettuple, the index is first scanned for all the
 matches, which are collected in an array in memory. Then, the array is
 sorted with qsort(), and the matches are returned one by one from the
 sorted array.


Right.

That has some obvious limitations. First of all, you can run out of memory.


Yes, it is so. qsort should be replaced with tuplesort.

Secondly, is that really any faster than the plan you get without this
 patch? Ie. scan all the matches with a bitmap index scan, and sort the
 results on the rank function. If it is faster, why?


At, first it's obviously much faster when not both heap and index fit into
cache, because of IO. With patch you need only posting trees and posting
lists of requested entries. If query matching significant part of documents
then without patch you will need almost all the heap.
Also it's faster when both heap and index are in the cache. There are some
examples on DBLP paper titles (2.5M titles of 47 average length).

Without patch:
postgres=# explain analyze select id, s from dblp_titles2 where ts @@
to_tsquery('english', 'system') order by ts_rank(ts, to_tsquery('english',
'system')) desc limit 10;
 QUERY
PLAN

 Limit  (cost=60634.15..60634.17 rows=10 width=134) (actual
time=200.204..200.205 rows=10 loops=1)
   -  Sort  (cost=60634.15..61041.28 rows=162854 width=134) (actual
time=200.202..200.203 rows=10 loops=1)
 Sort Key: (ts_rank(ts, '''system'''::tsquery))
 Sort Method: top-N heapsort  Memory: 28kB
 -  Bitmap Heap Scan on dblp_titles2  (cost=1758.12..57114.93
rows=162854 width=134) (actual time=33.592..158.006 rows=168308 loops=1)
   Recheck Cond: (ts @@ '''system'''::tsquery)
   -  Bitmap Index Scan on dblp_titles2_idx
 (cost=0.00..1717.41 rows=162854 width=0) (actual time=27.327..27.327
rows=168308 loops=1)
 Index Cond: (ts @@ '''system'''::tsquery)
 Total runtime: 200.610 ms
(9 rows)

With patch:
postgres=# explain analyze select id, s from dblp_titles2 where ts @@
to_tsquery('english', 'system') order by ts  to_tsquery('english',
'system') limit 10;
 QUERY
PLAN
-
 Limit  (cost=12.00..43.05 rows=10 width=136) (actual time=39.585..39.597
rows=10 loops=1)
   -  Index Scan using dblp_titles2_idx on dblp_titles2
 (cost=12.00..493681.61 rows=159005 width=136) (actual time=39.584..39.593
rows=10 loops=1)
 Index Cond: (ts @@ '''system'''::tsquery)
 Order By: (ts  '''system'''::tsquery)
 Total runtime: 40.115 ms
(5 rows)

Without patch:
postgres=# explain analyze select id, s, ts_rank(ts, to_tsquery('english',
'statistics')) from dblp_titles2 where ts @@ to_tsquery('english',
'statistics') order by ts_rank(ts, to_tsquery('english', 'statistics'))
desc limit 10;
  QUERY PLAN

Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-17 Thread Alexander Korotkov
On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 attached patch implementing ordering inside GIN index. This is third patch
 of GIN improvements, see previous two:

 http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com

 http://www.postgresql.org/message-id/CAPpHfdvftaJq7www381naLw1=4u0h+qpxgwvnhceb9hmvyw...@mail.gmail.com

 This patch introduces new interface method of GIN which takes same
 arguments as consistent but returns float8.
 float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
 nullFlags[], Datum addInfo[], bool addInfoIsNull[])
 This patch implements gingettuple method which can return ordering data
 using KNN infrastructure. Also it introduces  operator for fts which
 support ordering in GIN index. Some example:

 postgres=# explain analyze select * from dblp_titles2 where tsvector @@
 to_tsquery('english', 'statistics') order by tsvector 
 to_tsquery('english', 'statistics') limit 10;
QUERY
 PLAN

 -
  Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
 rows=10 loops=1)
-  Index Scan using dblp_titles2_idx on dblp_titles2
  (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
 rows=10 loops=1)
  Index Cond: (tsvector @@ '''statist'''::tsquery)
  Order By: (tsvector  '''statist'''::tsquery)
  Total runtime: 7.556 ms
 (5 rows)


Attached version of patch has some refactoring and bug fixes.

--
With best regards,
Alexander Korotkov.


gin_ordering.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-17 Thread Heikki Linnakangas

On 17.06.2013 15:56, Alexander Korotkov wrote:

On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkovaekorot...@gmail.comwrote:


This patch introduces new interface method of GIN which takes same
arguments as consistent but returns float8.
float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
nullFlags[], Datum addInfo[], bool addInfoIsNull[])
This patch implements gingettuple method which can return ordering data
using KNN infrastructure. Also it introduces  operator for fts which
support ordering in GIN index. Some example:

postgres=# explain analyze select * from dblp_titles2 where tsvector @@
to_tsquery('english', 'statistics') order by tsvector
to_tsquery('english', 'statistics') limit 10;
QUERY
PLAN

-
  Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
rows=10 loops=1)
-   Index Scan using dblp_titles2_idx on dblp_titles2
  (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
rows=10 loops=1)
  Index Cond: (tsvector @@ '''statist'''::tsquery)
  Order By: (tsvector  '''statist'''::tsquery)
  Total runtime: 7.556 ms
(5 rows)



Attached version of patch has some refactoring and bug fixes.


Thanks. There are no docs changes and not many comments, that needs to 
be fixed, but I think I understand how it works:


On the first call to gingettuple, the index is first scanned for all the 
matches, which are collected in an array in memory. Then, the array is 
sorted with qsort(), and the matches are returned one by one from the 
sorted array.


That has some obvious limitations. First of all, you can run out of 
memory. Secondly, is that really any faster than the plan you get 
without this patch? Ie. scan all the matches with a bitmap index scan, 
and sort the results on the rank function. If it is faster, why?


What parts of the previous two patches does this rely on?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers