Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-11 Thread Bruce Momjian
On Sun, Feb  9, 2014 at 02:17:12PM +0400, Alexander Korotkov wrote:
 On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-J rgen Sch nig 
 postg...@cybertec.at wrote:
 
 i think there is one more thing which would be really good in GIN and 
 which
 would solve a ton of issues.
 atm GIN entries are sorted by item pointer.
 if we could sort them by a column it would fix a couple of real work
 issues such as ...
 
 SELECT ... FROM foo WHERE tsearch_query ORDER BY price DESC 
 LIMIT
 10
 
 ... or so.
 it many cases you want to search for a, say, product and find the cheapest
 / most expensive one.
 if the tsearch_query yields a high number of rows (which it often does) 
 the
 subsequent sort will kill you.
 
 
 This is not intended to be a small change. However, some solution might be
 possible in post 9.4 gin improvements or in new secret indexing project which
 will be presented at PGCon :-)

Would any of the listed changes cause backward-incompatible changes to
the on-disk format, causing problems for pg_upgrade?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-11 Thread Alexander Korotkov
On Wed, Feb 12, 2014 at 2:58 AM, Bruce Momjian br...@momjian.us wrote:

 On Sun, Feb  9, 2014 at 02:17:12PM +0400, Alexander Korotkov wrote:
  On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-J rgen Sch nig 
  postg...@cybertec.at wrote:
 
  i think there is one more thing which would be really good in GIN
 and which
  would solve a ton of issues.
  atm GIN entries are sorted by item pointer.
  if we could sort them by a column it would fix a couple of real
 work
  issues such as ...
 
  SELECT ... FROM foo WHERE tsearch_query ORDER BY price
 DESC LIMIT
  10
 
  ... or so.
  it many cases you want to search for a, say, product and find the
 cheapest
  / most expensive one.
  if the tsearch_query yields a high number of rows (which it often
 does) the
  subsequent sort will kill you.
 
 
  This is not intended to be a small change. However, some solution might
 be
  possible in post 9.4 gin improvements or in new secret indexing project
 which
  will be presented at PGCon :-)

 Would any of the listed changes cause backward-incompatible changes to
 the on-disk format, causing problems for pg_upgrade?


None of them.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-09 Thread Alexander Korotkov
On Thu, Feb 6, 2014 at 8:31 PM, PostgreSQL - Hans-Jürgen Schönig 
postg...@cybertec.at wrote:

 i think there is one more thing which would be really good in GIN and
 which would solve a ton of issues.
 atm GIN entries are sorted by item pointer.
 if we could sort them by a column it would fix a couple of real work
 issues such as ...

 SELECT ... FROM foo WHERE tsearch_query ORDER BY price DESC
 LIMIT 10

 ... or so.
 it many cases you want to search for a, say, product and find the cheapest
 / most expensive one.
 if the tsearch_query yields a high number of rows (which it often does)
 the subsequent sort will kill you.


This is not intended to be a small change. However, some solution might be
possible in post 9.4 gin improvements or in new secret indexing project
which will be presented at PGCon :-)

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-09 Thread Alexander Korotkov
On Thu, Feb 6, 2014 at 3:36 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 While hacking on the GIN patches, I've come up with a few different ideas
 for improving performance. It's too late for 9.4, but I'll list them here
 if someone wants to work on them later:

 * Represent ItemPointers as uint64's, to speed up comparisons.
 ginCompareItemPointers is inlined into only a few instructions, but it's
 still more expensive than a single-instruction 64-bit comparison.
 ginCompareItemPointers is called very heavily in a GIN scan, so even a
 small improvement there would make for a noticeable speedup. It might be an
 improvement in code clarity, too.

 * Keep the entry streams of a GinScanKey in a binary heap, to quickly find
 the minimum curItem among them.

 I did this in various versions of the fast scan patch, but then I realized
 that the straightforward way of doing it is wrong, because a single
 GinScanEntry can be part of multiple GinScanKeys. If an entry's curItem is
 updated as part of advancing one key, and the entry is in a heap of another
 key, updating the curItem can violate the heap property of the other
 entry's heap.

 * Build a truth table (or cache) of consistent-function's results, and use
 that instead of calling consistent for every item.


Caching seems more appropriate for me. Intuition tells me that when number
of entries is high then far not all consistent combinations will be used.
However, intuition might be false :-)

* Deduce AND or OR logic from the consistent function. Or have the opclass
 provide a tree of AND/OR/NOT nodes directly, instead of a consistent
 function. For example, if the query is foo  bar, we could avoid calling
 consistent function altogether, and only return items that match both.


I also had this idea. But this solution doesn't cover similarity queries.
If you have 20 entries and consistent function returns true when at least
10 of 20 are present then representation in AND/OR/NOT nodes would be too
enormous, so useless.

* Delay decoding segments during a scan. Currently, we decode all segments
 of a posting tree page into a single array at once. But with fast scan,
 we might be able to skip over all entries in some of the segments. So it
 would be better to copy the segments into backend-private memory in
 compressed format, and decode them one segment at a time (or maybe even one
 item at a time), when needed. That would avoid the unnecessary decoding of
 segments that can be skipped over, and would also reduce memory usage of a
 scan.


I would like to add another one as continue of fast-scan:
* Skip scanning of some entries at all forcing recheck instead. Correct
decision should be done based on costs. However, I'm not sure about design.
Because it's like a planning feature. How correct to do this inside of GIN?

--
With best regards,
Alexander Korotkov.


[HACKERS] Small GIN optimizations (after 9.4)

2014-02-06 Thread Heikki Linnakangas
While hacking on the GIN patches, I've come up with a few different 
ideas for improving performance. It's too late for 9.4, but I'll list 
them here if someone wants to work on them later:


* Represent ItemPointers as uint64's, to speed up comparisons. 
ginCompareItemPointers is inlined into only a few instructions, but it's 
still more expensive than a single-instruction 64-bit comparison. 
ginCompareItemPointers is called very heavily in a GIN scan, so even a 
small improvement there would make for a noticeable speedup. It might be 
an improvement in code clarity, too.


* Keep the entry streams of a GinScanKey in a binary heap, to quickly 
find the minimum curItem among them.


I did this in various versions of the fast scan patch, but then I 
realized that the straightforward way of doing it is wrong, because a 
single GinScanEntry can be part of multiple GinScanKeys. If an entry's 
curItem is updated as part of advancing one key, and the entry is in a 
heap of another key, updating the curItem can violate the heap property 
of the other entry's heap.


* Build a truth table (or cache) of consistent-function's results, and 
use that instead of calling consistent for every item.


* Deduce AND or OR logic from the consistent function. Or have the 
opclass provide a tree of AND/OR/NOT nodes directly, instead of a 
consistent function. For example, if the query is foo  bar, we could 
avoid calling consistent function altogether, and only return items that 
match both.


* Delay decoding segments during a scan. Currently, we decode all 
segments of a posting tree page into a single array at once. But with 
fast scan, we might be able to skip over all entries in some of the 
segments. So it would be better to copy the segments into 
backend-private memory in compressed format, and decode them one segment 
at a time (or maybe even one item at a time), when needed. That would 
avoid the unnecessary decoding of segments that can be skipped over, and 
would also reduce memory usage of a scan.


I'll add these to the TODO.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small GIN optimizations (after 9.4)

2014-02-06 Thread PostgreSQL - Hans-Jürgen Schönig
i think there is one more thing which would be really good in GIN and which 
would solve a ton of issues.
atm GIN entries are sorted by item pointer.
if we could sort them by a column it would fix a couple of real work issues 
such as ...

SELECT ... FROM foo WHERE tsearch_query ORDER BY price DESC LIMIT 10 

... or so.
it many cases you want to search for a, say, product and find the cheapest / 
most expensive one.
if the tsearch_query yields a high number of rows (which it often does) the 
subsequent sort will kill you.

many thanks,

hans


On Feb 6, 2014, at 12:36 PM, Heikki Linnakangas wrote:

 While hacking on the GIN patches, I've come up with a few different ideas for 
 improving performance. It's too late for 9.4, but I'll list them here if 
 someone wants to work on them later:
 
 * Represent ItemPointers as uint64's, to speed up comparisons. 
 ginCompareItemPointers is inlined into only a few instructions, but it's 
 still more expensive than a single-instruction 64-bit comparison. 
 ginCompareItemPointers is called very heavily in a GIN scan, so even a small 
 improvement there would make for a noticeable speedup. It might be an 
 improvement in code clarity, too.
 
 * Keep the entry streams of a GinScanKey in a binary heap, to quickly find 
 the minimum curItem among them.
 
 I did this in various versions of the fast scan patch, but then I realized 
 that the straightforward way of doing it is wrong, because a single 
 GinScanEntry can be part of multiple GinScanKeys. If an entry's curItem is 
 updated as part of advancing one key, and the entry is in a heap of another 
 key, updating the curItem can violate the heap property of the other entry's 
 heap.
 
 * Build a truth table (or cache) of consistent-function's results, and use 
 that instead of calling consistent for every item.
 
 * Deduce AND or OR logic from the consistent function. Or have the opclass 
 provide a tree of AND/OR/NOT nodes directly, instead of a consistent 
 function. For example, if the query is foo  bar, we could avoid calling 
 consistent function altogether, and only return items that match both.
 
 * Delay decoding segments during a scan. Currently, we decode all segments of 
 a posting tree page into a single array at once. But with fast scan, we 
 might be able to skip over all entries in some of the segments. So it would 
 be better to copy the segments into backend-private memory in compressed 
 format, and decode them one segment at a time (or maybe even one item at a 
 time), when needed. That would avoid the unnecessary decoding of segments 
 that can be skipped over, and would also reduce memory usage of a scan.
 
 I'll add these to the TODO.
 
 - Heikki
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers