Re: Bad selectivity estimate when using a sub query to determine WHERE condition

2020-02-10 Thread Justin Pryzby
On Mon, Feb 10, 2020 at 11:34:01AM +0100, Chris Borckholder wrote:
> I have a large table of immutable events that need to be aggregated
> regularly to derive statistics. To improve the performance, that table is
> rolled up every 15minutes, so that online checks can aggregate rolled up
> data and combine it with latest events created after the last roll up.
> 
> To implement this a query is executed that selects only events after the
> time of the last rollup.
> That time is determined dynamically based on a log table.

Perhaps that could be done as an indexed column in the large table, rather
than querying a 2nd log table.
Possibly with a partial index on that column: WHERE unprocessed='t'.

> When using a sub select or CTE to get the latest roll up time, the query
> planner fails to recognize that a most of the large table would be filtered
> out by the condition and tries a sequential scan instead of an index scan.
> When using the literal value for the WHERE condition, the plan correctly
> uses an index scan, which is much faster.
> 
> I analyzed the involved tables and increased the collected histogram, but
> the query plan did not improve. Is there a way to help the query planner
> recognize this in the dynamic case?

Also, if you used partitioning with pgostgres since v11, then I think most
partitions would be excluded:

https://www.postgresql.org/docs/12/release-12.html
|Allow partition elimination during query execution (David Rowley, Beena 
Emerson)
|Previously, partition elimination only happened at planning time, meaning many 
joins and prepared queries could not use partition elimination.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=499be013de65242235ebdde06adb08db887f0ea5

https://www.postgresql.org/about/featurematrix/detail/332/

Justin




Re: Bad selectivity estimate when using a sub query to determine WHERE condition

2020-02-10 Thread Tom Lane
Chris Borckholder  writes:
> When using a sub select or CTE to get the latest roll up time, the query
> planner fails to recognize that a most of the large table would be filtered
> out by the condition and tries a sequential scan instead of an index scan.
> When using the literal value for the WHERE condition, the plan correctly
> uses an index scan, which is much faster.

Yeah, a scalar sub-select is pretty much a black box to the planner.

> EXPLAIN (ANALYZE, BUFFERS) WITH current_rollup AS (
> SELECT COALESCE(MAX(window_end), '-infinity') AS cutoff
> FROM exchange.ledger_zerosum_rollup
> )
> SELECT *
> FROM exchange.ledger
> WHERE created > (SELECT cutoff FROM current_rollup);

Well, it's not that hard to get rid of that scalar sub-select: since
you're already relying on current_rollup to produce exactly one row,
you could write a plain join instead, something like

WITH current_rollup AS ...
SELECT l.*
FROM exchange.ledger l, current_rollup c
WHERE l.created > c.cutoff;

Unfortunately I doubt that will improve matters much, since the
planner also knows relatively little about MAX() and nothing about
COALESCE, so it's not going to be able to estimate what comes out
of the WITH.  I think you're going to have to cheat a bit.

The form of cheating that comes to mind is to wrap the sub-select
in a function that's marked STABLE:

create function current_rollup_cutoff() returns timestamp -- or whatever
stable language sql as $$
SELECT COALESCE(MAX(window_end), '-infinity') AS cutoff
FROM exchange.ledger_zerosum_rollup
$$;

SELECT *
FROM exchange.ledger
WHERE created > current_rollup_cutoff();

I have not actually tried this, but I think that since the function is
marked stable, the planner would test-run it to get an estimated value,
and then produce a plan similar to what you'd get with a literal constant.

Of course, then it's going to run the function once more when the query is
executed for-real, so this approach doubles the cost of getting the MAX().
That shouldn't be too awful if you have an index on window_end, though.

If you like living dangerously, you could cheat a LOT and mark the
function immutable so that its value gets substituted at plan time.
But that will only work for interactive submission of the outer
query --- if the plan gets cached and re-used, you'll have a stale
cutoff value.  Personally I wouldn't risk that.

regards, tom lane




Re: Fwd: TOAST table performance problem

2020-02-10 Thread Asya Nevra Buyuksoy
  Ok, you are right. Thanks for everything.

Jeff Janes , 10 Şub 2020 Pzt, 15:54 tarihinde şunu
yazdı:

> On Mon, Feb 10, 2020 at 7:38 AM Asya Nevra Buyuksoy 
> wrote:
>
>> Got it, thanks! I understand and know it that PostgreSQL is not slow, but
>> I want to a piece of advice how can I load this data fastly :)
>>
>
> You haven't told us anything about your client, so what advice can we
> offer?  Unless the bottleneck is in the libpq library, this is probably not
> the right place to ask about it anyway.
>
> Cheers,
>
> Jeff
>


Re: Fwd: TOAST table performance problem

2020-02-10 Thread Asya Nevra Buyuksoy
Got it, thanks! I understand and know it that PostgreSQL is not slow, but I
want to a piece of advice how can I load this data fastly :)

Luís Roberto Weck , 10 Şub 2020 Pzt, 15:31
tarihinde şunu yazdı:

> -- Forwarded message -
> Gönderen: Asya Nevra Buyuksoy 
> Date: 10 Şub 2020 Pzt, 10:51
> Subject: Re: TOAST table performance problem
> To: Andreas Joseph Krogh 
>
>
> I copied my data to the CSV file, yes it is very fast. However, this does
> not solve my problem.
> After deserializing the on the front side, I want to visualize my data on
> the web page effectively.
> When I select my data one by one with a limit clause, the query executes
> 200 ms. For example, If I create a function that takes data with a loop,
> the execution time will be 200 ms*1000=200 sec.
>
>
> Andreas Joseph Krogh , 7 Şub 2020 Cum, 18:15
> tarihinde şunu yazdı:
>
>> Try \o  in psql, to redirect the output to file, and prevent it
>> from processing the json (ie. format it)
>>
>> Den 7. februar 2020 15:59:05 CET, skrev Asya Nevra Buyuksoy <
>> ayisan1...@gmail.com>:
>>>
>>> I try it, but there is no enhancement.
>>> I read this link is about TOAST and also its sub_links;
>>> https://blog.gojekengineering.com/a-toast-from-postgresql-83b83d0d0683
>>> When I execute this query, except JSON data like;
>>> SELECT paramuser_id, paramperson_id, paramisdeleted, paramactiontime
>>>   FROM zamazin;
>>> It takes 94 ms. :)
>>>
>>>
>>> Andreas Joseph Krogh , 7 Şub 2020 Cum, 17:42
>>> tarihinde şunu yazdı:
>>>
 På fredag 07. februar 2020 kl. 15:35:04, skrev Asya Nevra Buyuksoy <
 ayisan1...@gmail.com>:

 I use pgadmin3.


 Try "psql", it has the lowest overhead (I think). pgAdmin might use
 time presenting the results etc. which is easy to overlook.

 --
 Andreas Joseph Krogh

>>>
>> --
>> Sendt fra min Android-enhet med K-9 e-post. Unnskyld min kortfattethet.
>>
>
> What Andreas is trying to say is that it's not PostgreSQL that is slow to
> read the JSON, but your client app that is slow to parse it.
>


Re: Fwd: TOAST table performance problem

2020-02-10 Thread Luís Roberto Weck

-- Forwarded message -
Gönderen: *Asya Nevra Buyuksoy* >

Date: 10 Şub 2020 Pzt, 10:51
Subject: Re: TOAST table performance problem
To: Andreas Joseph Krogh mailto:andr...@visena.com>>


I copied my data to the CSV file, yes it is very fast. However, this 
does not solve my problem.
After deserializing the on the front side, I want to visualize my data 
on the web page effectively.
When I select my data one by one with a limit clause, the query 
executes 200 ms. For example, If I create a function that takes data 
with a loop, the execution time will be 200 ms*1000=200 sec.


Andreas Joseph Krogh mailto:andr...@visena.com>>, 
7 Şub 2020 Cum, 18:15 tarihinde şunu yazdı:


Try \o  in psql, to redirect the output to file, and
prevent it from processing the json (ie. format it)

Den 7. februar 2020 15:59:05 CET, skrev Asya Nevra Buyuksoy
mailto:ayisan1...@gmail.com>>:

I try it, but there is no enhancement.
I read this link is about TOAST and also its sub_links;
https://blog.gojekengineering.com/a-toast-from-postgresql-83b83d0d0683

When I execute this query, except JSON data like;
SELECT paramuser_id, paramperson_id, paramisdeleted,
paramactiontime
  FROM zamazin;
It takes 94 ms. :)


Andreas Joseph Krogh mailto:andr...@visena.com>>, 7 Şub 2020 Cum, 17:42 tarihinde
şunu yazdı:

På fredag 07. februar 2020 kl. 15:35:04, skrev Asya Nevra
Buyuksoy mailto:ayisan1...@gmail.com>>:

I use pgadmin3.

Try "psql", it has the lowest overhead (I think). pgAdmin
might use time presenting the results etc. which is easy
to overlook.
-- 
Andreas Joseph Krogh



-- 
Sendt fra min Android-enhet med K-9 e-post. Unnskyld min

kortfattethet.



What Andreas is trying to say is that it's not PostgreSQL that is slow 
to read the JSON, but your client app that is slow to parse it.


Bad selectivity estimate when using a sub query to determine WHERE condition

2020-02-10 Thread Chris Borckholder
Hi,

I have a large table of immutable events that need to be aggregated
regularly to derive statistics. To improve the performance, that table is
rolled up every 15minutes, so that online checks can aggregate rolled up
data and combine it with latest events created after the last roll up.

To implement this a query is executed that selects only events after the
time of the last rollup.
That time is determined dynamically based on a log table.

When using a sub select or CTE to get the latest roll up time, the query
planner fails to recognize that a most of the large table would be filtered
out by the condition and tries a sequential scan instead of an index scan.
When using the literal value for the WHERE condition, the plan correctly
uses an index scan, which is much faster.

I analyzed the involved tables and increased the collected histogram, but
the query plan did not improve. Is there a way to help the query planner
recognize this in the dynamic case?

Best Regards
Chris

 Original query with a CTE to get the timestamp to filter on

https://explain.depesz.com/s/Hsix

EXPLAIN (ANALYZE, BUFFERS) WITH current_rollup AS (
SELECT COALESCE(MAX(window_end), '-infinity') AS cutoff
FROM exchange.ledger_zerosum_rollup
)
SELECT *
FROM exchange.ledger
WHERE created > (SELECT cutoff FROM current_rollup);

 Query with literal value

https://explain.depesz.com/s/ULAq

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM exchange.ledger
WHERE created > '2020-02-10T08:54:39.857789Z';