[PERFORM] GIST optimization to limit calls to operator on sub nodes

2014-06-23 Thread Pujol Mathieu

Hello,
I already post my question in the General Mailing list, but without 
succeed so I try this one that seems to me more specialized.

My question is about GIST index.
I made my own index to handle specific data and operators. It works 
pretty fine but I wonder if it was possible to optimize it.
When I run my operator on a GIST node (in the method 
gist_range_consistent) it returns "NotConsistent" / "MaybeConsistent" / 
"FullyConsistent".
NotConsistent -> means that all subnodes could be ignored, 
gist_range_consistent return false
MaybeConsistent -> means that at least one subnode/leaf will be 
consistent, gist_range_consistent return true
FullyConsistent -> means that all subnodes/leaves will be consistent, 
gist_range_consistent return true


So like with the "recheck flag" I would like to know if there is a way 
to notify postgres that it is not necessary to rerun my operator on 
subnodes, to speedup the search.


For example, consider the following gist tree
  R
/ \
   Na  Nb
 /   \   /\
La1  La2Lb1  Lb2

If all nodes return FullyConsistent, postgres will run tests in that 
Order : R, Na, Nb, La1, La2, Lb1, Lb2, thanks to recheck flag it will 
not test rows associated to leaves Lxx.
My goal is that postgres run test on R and then skip tests on other 
nodes. So is there a way to do that in the GIST API ? Or could I share 
data from R to Nx and then From Na to Lax and Nb to Lbx ?

Thanks,
Mathieu



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GIST optimization to limit calls to operator on sub nodes

2014-06-30 Thread Pujol Mathieu


Le 29/06/2014 22:14, Emre Hasegeli a écrit :

Pujol Mathieu :

Hello,
I already post my question in the General Mailing list, but without
succeed so I try this one that seems to me more specialized.
My question is about GIST index.
I made my own index to handle specific data and operators. It works
pretty fine but I wonder if it was possible to optimize it.
When I run my operator on a GIST node (in the method
gist_range_consistent) it returns "NotConsistent" /
"MaybeConsistent" / "FullyConsistent".
NotConsistent -> means that all subnodes could be ignored,
gist_range_consistent return false
MaybeConsistent -> means that at least one subnode/leaf will be
consistent, gist_range_consistent return true
FullyConsistent -> means that all subnodes/leaves will be
consistent, gist_range_consistent return true

So like with the "recheck flag" I would like to know if there is a
way to notify postgres that it is not necessary to rerun my operator
on subnodes, to speedup the search.

I do not think it is possible at the moment.  The GiST framework can
be extended to support this use case.  I am not sure about the
speedup.  Most of the consistent functions do not seem very expensive
compared to other operations of the GiST framework.  I would be
happy to test it, if you would implement.



Thanks for your reply.
I am not sure to have time to develop inside the framework, but if I try 
I'll let you know my results. In my case the consistent function is 
costly and the number of row important so this optimization could save 
several hundred tests on a single request.


--
Mathieu PUJOL
Ingénieur Réalité Virtuelle
REAL FUSIO - 3D Computer Graphics
10, rue des arts - 31000 TOULOUSE - FRANCE
mathieu.pu...@realfusio.com - http://www.realfusio.com



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GIST optimization to limit calls to operator on sub nodes

2014-06-30 Thread Pujol Mathieu


Le 29/06/2014 22:30, Tom Lane a écrit :

Emre Hasegeli  writes:

Pujol Mathieu :

I made my own index to handle specific data and operators. It works
pretty fine but I wonder if it was possible to optimize it.
When I run my operator on a GIST node (in the method
gist_range_consistent) it returns "NotConsistent" /
"MaybeConsistent" / "FullyConsistent".
NotConsistent -> means that all subnodes could be ignored,
gist_range_consistent return false
MaybeConsistent -> means that at least one subnode/leaf will be
consistent, gist_range_consistent return true
FullyConsistent -> means that all subnodes/leaves will be
consistent, gist_range_consistent return true

So like with the "recheck flag" I would like to know if there is a
way to notify postgres that it is not necessary to rerun my operator
on subnodes, to speedup the search.

I do not think it is possible at the moment.  The GiST framework can
be extended to support this use case.  I am not sure about the
speedup.  Most of the consistent functions do not seem very expensive
compared to other operations of the GiST framework.  I would be
happy to test it, if you would implement.

I don't actually understand what's being requested here that the
NotConsistent case doesn't already cover.

regards, tom lane



Hi,
The NotConsistent case is correctly covered, the sub nodes are not 
tested because I know that no child could pass the consistent_test.
The MaybeConsistent case is also correctly covered, all sub nodes are 
tested because I don't know which sub nodes will pass the consistent_test.
My problem is with the FullyConsistent, because when I test a node I can 
know that all it's childs nodes and leaves will pass the test, so I want 
to notify GIST framework that it can't skip consistent test on those 
nodes. Like we can notify it when testing a leaf that it could skip 
consistent test on the row. Maybe I miss something on the API to do 
that. On my tests, the "recheck_flag" works only for leaves.

Thanks
Mathieu



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] GIST optimization to limit calls to operator on sub nodes

2014-07-01 Thread Pujol Mathieu


Le 30/06/2014 16:04, Tom Lane a écrit :

Pujol Mathieu  writes:

Le 29/06/2014 22:30, Tom Lane a écrit :

I don't actually understand what's being requested here that the
NotConsistent case doesn't already cover.

The NotConsistent case is correctly covered, the sub nodes are not
tested because I know that no child could pass the consistent_test.
The MaybeConsistent case is also correctly covered, all sub nodes are
tested because I don't know which sub nodes will pass the consistent_test.
My problem is with the FullyConsistent, because when I test a node I can
know that all it's childs nodes and leaves will pass the test, so I want
to notify GIST framework that it can't skip consistent test on those
nodes. Like we can notify it when testing a leaf that it could skip
consistent test on the row. Maybe I miss something on the API to do
that. On my tests, the "recheck_flag" works only for leaves.

Hm ... that doesn't seem like a case that'd come up often enough to be
worth complicating the APIs for, unless maybe you are expecting a lot
of exact-duplicate index entries.  If you are, you might find that GIN
is a better fit for your problem than GIST --- it's designed to be
efficient for lots-of-duplicates.

Another view of this is that if you can make exact satisfaction checks
at upper-page entries, you're probably storing too much information in
the index entries (and thereby bloating the index).  The typical tradeoff
in GIST indexes is something like storing bounding boxes for geometric
objects --- which is necessarily lossy, but it results in small indexes
that are fast to search.  It's particularly important for upper-page
entries to be small, so that fanout is high and you have a better chance
of keeping all the upper pages in cache.

If you've got a compelling example where this actually makes sense,
I'd be curious to hear the details.

regards, tom lane



Hi,
I have a table containing several millions of rows, and each row 
contains an unique integer as identifier. My goal is to select all rows 
which have an identifier that is contained into at least one range of a 
list.

CREATE TABLE t (id int4 UNIQUE, ...)
SELECT * FROM t WHERE id @@> ARRAY[...]::range[]
I use a custom operator to check if an integer is contained in an array 
of ranges (a custom structure defined by my plugin).
I build my own GIST to speedup those requests. So each node of my GIST 
is represented by a range (like a BVH or octree of 3D boxes). I have no 
duplicated keys in my index.
When I run my tests I am able to quickly discard entire portions of the 
index which leads to great performance improvements.
During GIST traversal when I test consistency of a node 
(isRangeOverlap(range,range[]) the test return Fully, Partially, No. So 
when the tests return Fully I know for sure that all subnodes will also 
return Fully.
In practice my operator is not free in execution time, so if I could 
propagate the information on subnodes it will allow to save lot of 
computation time.

This optimization could also be benefical for cube extension.
I don't think that it would complicate the API, we could use existing 
recheck_flag. Today this value is used only for leaves node. Maybe it 
could be propagated by GIST API from a node to its subnodes. So if a 
node set it to false, it will be false for it's subnodes allowing client 
to use it or no. So the API could remain the same without changes for 
existing plugins and need only small memory to propagate this boolean.
I already achieve great performance improvements with my GIST, my 
request is to optimize it in use cases that select several rows to limit 
overhead of my consistent_operator.


regards,
mathieu pujol


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] less than 2 sec for response - possible?

2016-07-05 Thread Pujol Mathieu

Hello

Have you solved your problem ?

Could it be a conversion overhead from 'timestamp without time zone' to 
'date' ? In this case, I don't know if planer store constants as date or 
timestamp.


Mathieu Pujol

Le 02/07/2016 à 04:48, trafdev a écrit :

Thanks Tom.

I've created index on aid, date:

create index aaa on stats.feed_sub(aid,date);

and simplified a query (dropped gran as it's equal for all rows anyway):

SELECT
  sum(stats.feed_sub.c_filt_click_cap_ip) AS clicks_from_ip,
  sum(stats.feed_sub.c_filt_doubleclick)  AS clicks_on_target,
  sum(stats.feed_sub.c_filt_delay_clicks) AS ip_click_period,
  sum(stats.feed_sub.c_filt_fast_click)   AS fast_click,
  sum(stats.feed_sub.c_filt_ip_mismatch)  AS ip_mismatch,
  sum(stats.feed_sub.c_filt_lng_mismatch) AS lng_mismatch,
  sum(stats.feed_sub.c_filt_ref_mismatch) AS ref_mismatch,
  sum(stats.feed_sub.c_filt_ua_mismatch)  AS ua_mismatch,
  sum(stats.feed_sub.c_filt_url_expired)  AS url_expired,
  stats.feed_sub.subidAS stats_feed_sub_subid,
  stats.feed_sub.sid  AS stats_feed_sub_sid
FROM stats.feed_sub
WHERE stats.feed_sub.date >= '2016-06-01' :: TIMESTAMP AND
  stats.feed_sub.date <= '2016-06-30' :: TIMESTAMP AND
  stats.feed_sub.aid = 3
GROUP BY
  stats.feed_sub.subid, stats.feed_sub.sid;

All data is in the cache and it still takes almost 5 seconds to complete:

QUERY PLAN
HashAggregate  (cost=792450.42..803727.24 rows=346979 width=86) 
(actual time=4742.145..4882.468 rows=126533 loops=1)

"  Group Key: subid, sid"
  Buffers: shared hit=1350371
  ->  Index Scan using aaa on feed_sub  (cost=0.43..697031.39 
rows=3469783 width=86) (actual time=0.026..1655.394 rows=3588376 loops=1)
Index Cond: ((aid = 3) AND (date >= '2016-06-01 
00:00:00'::timestamp without time zone) AND (date <= '2016-06-30 
00:00:00'::timestamp without time zone))

Buffers: shared hit=1350371
Planning time: 0.159 ms
Execution time: 4899.934 ms

It's better, but still is far from "<2 secs" goal.

Any thoughts?


On 07/01/16 18:23, Tom Lane wrote:

trafdev  writes:

CREATE INDEX ix_feed_sub_date
   ON stats.feed_sub
   USING brin
   (date);



CREATE UNIQUE INDEX ixu_feed_sub
   ON stats.feed_sub
   USING btree
   (date, gran, aid, pid, sid, fid, subid COLLATE 
pg_catalog."default");



HashAggregate (cost=901171.72..912354.97 rows=344100 width=86) (actual
time=7207.825..7335.473 rows=126044 loops=1)
"  Group Key: subid, sid"
   Buffers: shared hit=3635804
   ->  Index Scan using ixu_feed_sub on feed_sub (cost=0.56..806544.38
rows=3440994 width=86) (actual time=0.020..3650.208 rows=3578344 
loops=1)

 Index Cond: ((date >= '2016-06-01 00:00:00'::timestamp without
time zone) AND (date <= '2016-06-30 00:00:00'::timestamp without time
zone) AND (gran = '1 day'::interval) AND (aid = 3))
 Buffers: shared hit=3635804
Planning time: 0.150 ms
Execution time: 7352.009 ms


Neither of those indexes is terribly well designed for this query.
A btree index on (aid, gran, date) or (gran, aid, date) would work
much better.  See

https://www.postgresql.org/docs/9.5/static/indexes-multicolumn.html

You could rearrange the column order in that giant unique index
and get some of the benefit.  But if you're desperate to optimize
this particular query, an index not bearing so many irrelevant columns
would probably be better for it.

An alternative way of thinking would be to create an index with those
three leading columns and then all of the other columns used by this
query as later columns.  That would be an even larger index, but it 
would

allow an index-only scan, which might be quite a lot faster. The fact
that you seem to be hitting about one page for each row retrieved says
that the data you need is pretty badly scattered, so constructing an 
index

that concentrates everything you need into one range of the index might
be the ticket.

Either of these additional-index ideas is going to penalize table
insertions/updates, so keep an eye on that end of the performance
question too.

regards, tom lane








--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance