Re: [HACKERS] GSoC 2015 proposal. Bitmap Index-only Count

2015-03-27 Thread Alexander Korotkov
On Wed, Mar 25, 2015 at 11:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Anastasia Lubennikova lubennikov...@gmail.com writes:
  2015-03-24 18:01 GMT+04:00 Tom Lane t...@sss.pgh.pa.us:
  I wonder whether it'd be possible to teach GIN to support index_getnext
  instead.  Initially it would probably work only for cases where the
  index didn't have to return any columns ... but if we did it, maybe the
  door would be open to cases where GIN could reconstruct actual values.

  Another idea is to write index_getnext() for GIN which would return some
  fake tuple. Since there is no difference for COUNT aggregate what the
 tuple
  contains. COUNT just wants to know whether we have tuple that satisfy the
  qual.

 Well, yeah, that would be the idea (at least initially).  You don't have
 to return any real data unless you claim you can do so via amcanreturn.
 The planner is still capable of selecting an index-only scan as long as
 the query retrieves no columns.

 The trick would be to not return the same heap TID more than once per
 scan.  A zero-order implementation would be to construct the same bitmap
 we do now and then just provide a gingetnext function that scans through
 that.  That would be pretty awful in terms of scan startup time, so doing
 better would be nice; but perhaps it would be useful even in that form.


My ideal picture for FTS using GIN looks like this:

1) Have lexemes offsets in GIN stored with item pointers.
2) Calculate relevance using only GIN information without using heap.
3) Sort results by relevance in either GIN itself or executor node.
4) Get both TOP-N most relevant rows and total rows count (using index-only
scan) from single GIN index scan.

Implementing index_getnext() for GIN looks step forward for me because it
allows index only count and potentially could be used for ordered output.
However, it's unclear for me if it's feasible to have #4? Could we return
TOP-N results and total count from single GIN index scan?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2015 proposal. Bitmap Index-only Count

2015-03-25 Thread Anastasia Lubennikova
2015-03-24 18:01 GMT+04:00 Tom Lane t...@sss.pgh.pa.us:

 Anastasia Lubennikova lubennikov...@gmail.com writes:
  There is a problem of slow counting in PostgreSQL [1]. The reason why
 this
  is slow is related to the *MVCC* implementation in PostgreSQL. Index-only
  scans (implemented since PostgreSQL-9.2) providing some performance
  improvements where the *visibility map* of the table allows it. That’s
  good. But it works only for access methods which provide amgettuple
 method.
  Unfortunately GIN supports only BitmapIndexScan and has no implementation
  of index_getnext() interface [2].

 Right ...

  As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which
  would catch tuples from Bitmap Index Scan node and pass them to Aggregate
  node. Thus, new query plan will be as follow:

 I'm pretty hesitant about adding a whole new plan node type (which will
 require quite a lot of infrastructure) for such a narrow use-case.
 I think the odds are good that if you proceed down this path, you will
 end up with something that never gets committed to Postgres.


Thanks a lot for reply. It was just approximate idea. I thought is wasn't
very good.

I wonder whether it'd be possible to teach GIN to support index_getnext
 instead.  Initially it would probably work only for cases where the
 index didn't have to return any columns ... but if we did it, maybe the
 door would be open to cases where GIN could reconstruct actual values.

 Another idea is to write index_getnext() for GIN which would return some
fake tuple. Since there is no difference for COUNT aggregate what the tuple
contains. COUNT just wants to know whether we have tuple that satisfy the
qual.
Is this idea better? Is it possible for planner to use index_getnext() for
GIN only with COUNT aggregate?


-- 
Best regards,
Lubennikova Anastasia


Re: [HACKERS] GSoC 2015 proposal. Bitmap Index-only Count

2015-03-25 Thread Tom Lane
Anastasia Lubennikova lubennikov...@gmail.com writes:
 2015-03-24 18:01 GMT+04:00 Tom Lane t...@sss.pgh.pa.us:
 I wonder whether it'd be possible to teach GIN to support index_getnext
 instead.  Initially it would probably work only for cases where the
 index didn't have to return any columns ... but if we did it, maybe the
 door would be open to cases where GIN could reconstruct actual values.

 Another idea is to write index_getnext() for GIN which would return some
 fake tuple. Since there is no difference for COUNT aggregate what the tuple
 contains. COUNT just wants to know whether we have tuple that satisfy the
 qual.

Well, yeah, that would be the idea (at least initially).  You don't have
to return any real data unless you claim you can do so via amcanreturn.
The planner is still capable of selecting an index-only scan as long as
the query retrieves no columns.

The trick would be to not return the same heap TID more than once per
scan.  A zero-order implementation would be to construct the same bitmap
we do now and then just provide a gingetnext function that scans through
that.  That would be pretty awful in terms of scan startup time, so doing
better would be nice; but perhaps it would be useful even in that form.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2015 proposal. Bitmap Index-only Count

2015-03-24 Thread Tom Lane
Anastasia Lubennikova lubennikov...@gmail.com writes:
 There is a problem of slow counting in PostgreSQL [1]. The reason why this
 is slow is related to the *MVCC* implementation in PostgreSQL. Index-only
 scans (implemented since PostgreSQL-9.2) providing some performance
 improvements where the *visibility map* of the table allows it. That’s
 good. But it works only for access methods which provide amgettuple method.
 Unfortunately GIN supports only BitmapIndexScan and has no implementation
 of index_getnext() interface [2].

Right ...

 As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which
 would catch tuples from Bitmap Index Scan node and pass them to Aggregate
 node. Thus, new query plan will be as follow:

I'm pretty hesitant about adding a whole new plan node type (which will
require quite a lot of infrastructure) for such a narrow use-case.
I think the odds are good that if you proceed down this path, you will
end up with something that never gets committed to Postgres.

I wonder whether it'd be possible to teach GIN to support index_getnext
instead.  Initially it would probably work only for cases where the
index didn't have to return any columns ... but if we did it, maybe the
door would be open to cases where GIN could reconstruct actual values.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSoC 2015 proposal. Bitmap Index-only Count

2015-03-24 Thread Anastasia Lubennikova
Hi, hackers!
Here is the text of my proposal which I've applied to GSoC.
(and link
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/lubennikovaav/5657382461898752
 )
Any suggestions and comments are welcome.


*Project name*

Bitmap Index-only Count



*Brief review*

There is a problem of slow counting in PostgreSQL [1]. The reason why this
is slow is related to the *MVCC* implementation in PostgreSQL. Index-only
scans (implemented since PostgreSQL-9.2) providing some performance
improvements where the *visibility map* of the table allows it. That’s
good. But it works only for access methods which provide amgettuple method.
Unfortunately GIN supports only BitmapIndexScan and has no implementation
of index_getnext() interface [2]. Idea of the proposal is to implement
Bitmap Index-Only Count method that allows to count elements, without
reference to the heap.



*Benefits to the PostgreSQL Community*

Faster count(*) would be actual for GIN. Probably it would accelerate
count(*) for other access methods too, but I do not sure if it would be
significant difference.



*Quantifiable results*

Acceleration of count(*) queries with WHERE clause which use GIN.



*Project details*

Let’s look at count(*) query:

EXPLAIN  SELECT count(*) from tbl_mytbl where val @ '{b:2}';



Now the query plan looks like this:

Aggregate

   -  Bitmap Heap Scan on tbl_mytbl

 Recheck Cond: (val @ '{b: 2}'::jsonb)

 -  Bitmap Index Scan on idx_mytbl

   Index Cond: (val @ '{b: 2}'::jsonb)



Idea of the proposal is to implement Bitmap Index-Only Count method that
allows to count elements, without reference to the heap.

Conditions:


   -  all tuples are visible (the same problem as for Index-only count(*));
   - result TIDBitmap is not lossy. nchunks = 0;

int nchunks
http://doxygen.postgresql.org/structTIDBitmap.html#a85d5883056bad6863cb47a15446581c7;
/* number of lossy entries in pagetable */


   - pages in TIDBitmap don’t require recheck

* recheck is used only on exact pages --- it indicates that although

* only the stated tuples need be checked, the full index qual condition

* must be checked for each (ie, these are candidate matches).



If all conditions are true, it’s possible to call aggregate COUNT function
for each tuple from TIDBitmap returned by Bitmap Index Scan (or
BitmapAnd/BitmapOr nodes). And there’s no necessity to call Bitmap Heap
Scan.





As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which
would catch tuples from Bitmap Index Scan node and pass them to Aggregate
node. Thus, new query plan will be as follow:



Aggregate

   -  Bitmap Index-only Count

 -  Bitmap Index Scan on idx_mytbl

   Index Cond: (val @ '{b: 2}'::jsonb)





*Project Schedule*

until May 25

Read documentation and source code, clarify details of implementation.

until June 30

Implement Bitmap Index-only Count node.

1 July – 31 July

Add Bitmap Index-only Count node to Planner.

1 August -15 August

Final refactoring, testing and submitting a patch.



*Links*


   1.  https://wiki.postgresql.org/wiki/Slow_Counting
   2.
   
http://postgresql.nabble.com/Todo-item-Support-amgettuple-in-GIN-td5780810.html



-- 
Best regards,
Lubennikova Anastasia