subject:"percentile value check can be slow"

Re: percentile value check can be slow

2017-11-19 Thread jotpe




On 19.11.2017 18:49, David Fetter wrote:

On Sun, Nov 19, 2017 at 01:23:42PM +0100, Tomas Vondra wrote:

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

...

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?


I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

 CHECK('[0,1]' @> ALL(input_array))


OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.


There are two hard problems in computer science: naming things, cache
coherency, and off-by-one.


OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...


Indeed.


Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
 Allow optional CHECK constraints in CREATE AGGREGATE for direct
 arguments.



How will any of the approaches deal with something like

 select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.


It doesn't.  Does it make sense to do a one-shot execution for cases
like that?  It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile.  Of course, you can break that one by making p a JOIN
to yet another thing...


FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?


I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience.  Here,
"cheaply" refers to their computing resources and time.


The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.


I may be wrong but I'm pretty sure that a check for well-formed direct
parameters will not impose a significant cost on aggregates.

It occurs to me that this particular aggregate could take an array of
a domain defined along the lines of:

 CREATE DOMAIN float4_0_1_closed AS float4
 NOT NULL
 CHECK(VALUE >= 0.0 AND VALUE <= 1.0);

Then the check would happen much earlier without adding a bunch of
potentially expensive machinery.


Clearly, not having this happen in this case bothered Johannes
enough to wade in here.


No. He was surprised the error is reported after significant amount
of time, but he does not seem to claim failing faster would be
valuable to him. That is your assumption, and I have my doubts about
it.


My mistake.  I shouldn't have guessed when there was a better
alternative.


I already wrote that: I did not know about the complexity that is needed 
to precheck the parameters. I thought maybe it could be done easily.

If it's too hard to change that, I wouldn't want that improvement.


Johannes, could you help us understand your thinking in reporting
this?


When executing this query was wanted to see whats happening with wrong 
fractions. And I expected to fail this query quick, but it seams the 
postgresql has to read a lot until it checks (in the end) that on 
parameter was out the valid area.


I thougt, maybe there are there are technical forces to do it that way, 
or it can just be improved to fail fast.


Best regards.

Re: percentile value check can be slow

2017-11-19 Thread Tomas Vondra

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:
> On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> ...
>>
>> Is 'recognizer' an established definition I should know? Is it the same
>> as 'validator' or is it something new/different?
> 
> I borrowed it from http://langsec.org/
> 
> I'm not entirely sure what you mean by a validator, but a recognizer
> is something that gives a quick and sure read as to whether the input
> is well-formed. In general, it's along the lines of a tokenizer, a
> parser, and something that does very light post-parse analysis for
> correctness of form.
> 
> For the case that started the thread, a recognizer would check
> something along the lines of 
> 
> CHECK('[0,1]' @> ALL(input_array))
> 

OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.

OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...

>>> Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
>>> Allow optional CHECK constraints in CREATE AGGREGATE for direct
>>> arguments.
>>>
>>
>> How will any of the approaches deal with something like
>>
>> select percentile_cont((select array_agg(v) from p))
>>within group (order by a) from t;
>>
>> In this case the the values are unknown after the parse analysis, so I
>> guess it does not really address that.
> 
> It doesn't.  Does it make sense to do a one-shot execution for cases
> like that?  It might well be worth it to do the aggregate once in
> advance as a throw-away if the query execution time is already going
> to take awhile.  Of course, you can break that one by making p a JOIN
> to yet another thing...
> 
>> FWIW while I won't stand in the way of improving this, I wonder if this
>> is really worth the additional complexity. If you get errors like this
>> with a static list of values, you will fix the list and you're done. If
>> the list is dynamic (computed in the query itself), you'll still get the
>> error much later during query execution.
>>
>> So if you're getting many failures like this for the "delayed error
>> reporting" to be an issue, perhaps there's something wrong in you stack
>> and you should address that instead?
> 
> I'd like to think that getting something to fail quickly and cheaply
> when it can will give our end users a better experience.  Here,
> "cheaply" refers to their computing resources and time.

The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.

>
> Clearly, not having this happen in this case bothered Johannes
> enough to wade in here.
> 

No. He was surprised the error is reported after significant amount of
time, but he does not seem to claim failing faster would be valuable to
him. That is your assumption, and I have my doubts about it.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: percentile value check can be slow

2017-11-18 Thread David Fetter

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:
> Hi,
> 
> On 11/18/2017 10:30 PM, David Fetter wrote:
> > On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote:
> >> jotpe  writes:
> >>> I tried to enter invalid percentile fractions, and was astonished
> >>> that it seems to be checked after many work is done?
> >>
> >> IIRC, only the aggregate's final-function is concerned with direct
> >> arguments, so it's not all that astonishing.
> > 
> > It may not be surprising from the point of view of a systems
> > programmer, but it's pretty surprising that this check is deferred to
> > many seconds in when the system has all the information it need in
> > order to establish this before execution begins.
> > 
> > I'm not sure I see an easy way to do this check early, but it's worth
> > trying on grounds of POLA violation.  I have a couple of ideas on how
> > to do this, one less invasive but hinky, the other a lot more invasive
> > but better overall.
> > 
> > Ugly Hack With Ugly Performance Consequences:
> > Inject a subtransaction at the start of execution that supplies an
> > empty input to the final function with the supplied direct
> > arguments.
> 
> I'm pretty sure you realize this is quite unlikely to get accepted.

I should hope not, but I mentioned it because I'd like to get it on
the record as rejected.

> > Bigger Lift:
> > Require a separate recognizer function direct arguments and fire
> > it during post-parse analysis.  Perhaps this could be called
> > recognizer along with the corresponding mrecognizer.  It's not
> > clear that it's sane to have separate ones, but I thought I'd
> > bring it up for completeness.
> > 
> 
> Is 'recognizer' an established definition I should know? Is it the same
> as 'validator' or is it something new/different?

I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of 

CHECK('[0,1]' @> ALL(input_array))

> > Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
> > Allow optional CHECK constraints in CREATE AGGREGATE for direct
> > arguments.
> > 
> 
> How will any of the approaches deal with something like
> 
> select percentile_cont((select array_agg(v) from p))
>within group (order by a) from t;
> 
> In this case the the values are unknown after the parse analysis, so I
> guess it does not really address that.

It doesn't.  Does it make sense to do a one-shot execution for cases
like that?  It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile.  Of course, you can break that one by making p a JOIN
to yet another thing...

> FWIW while I won't stand in the way of improving this, I wonder if this
> is really worth the additional complexity. If you get errors like this
> with a static list of values, you will fix the list and you're done. If
> the list is dynamic (computed in the query itself), you'll still get the
> error much later during query execution.
> 
> So if you're getting many failures like this for the "delayed error
> reporting" to be an issue, perhaps there's something wrong in you stack
> and you should address that instead?

I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience.  Here,
"cheaply" refers to their computing resources and time.  Clearly, not
having this happen in this case bothered Johannes enough to wade in
here.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: percentile value check can be slow

2017-11-18 Thread Tomas Vondra

Hi,

On 11/18/2017 10:30 PM, David Fetter wrote:
> On Sat, Nov 18, 2017 at 10:44:36AM -0500, Tom Lane wrote:
>> jotpe  writes:
>>> I tried to enter invalid percentile fractions, and was astonished
>>> that it seems to be checked after many work is done?
>>
>> IIRC, only the aggregate's final-function is concerned with direct
>> arguments, so it's not all that astonishing.
> 
> It may not be surprising from the point of view of a systems
> programmer, but it's pretty surprising that this check is deferred to
> many seconds in when the system has all the information it need in
> order to establish this before execution begins.
> 
> I'm not sure I see an easy way to do this check early, but it's worth
> trying on grounds of POLA violation.  I have a couple of ideas on how
> to do this, one less invasive but hinky, the other a lot more invasive
> but better overall.
> 
> Ugly Hack With Ugly Performance Consequences:
> Inject a subtransaction at the start of execution that supplies an
> empty input to the final function with the supplied direct
> arguments.
> 

I'm pretty sure you realize this is quite unlikely to get accepted.

> Bigger Lift:
> Require a separate recognizer function direct arguments and fire
> it during post-parse analysis.  Perhaps this could be called
> recognizer along with the corresponding mrecognizer.  It's not
> clear that it's sane to have separate ones, but I thought I'd
> bring it up for completeness.
> 

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?

> Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
> Allow optional CHECK constraints in CREATE AGGREGATE for direct
> arguments.
> 

How will any of the approaches deal with something like

select percentile_cont((select array_agg(v) from p))
   within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.

FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

percentile value check can be slow

2017-11-18 Thread jotpe

I tried to enter invalid percentile fractions, and was astonished that 
it seems to be checked after many work is done?


psql (11devel)
Type "help" for help.

jotpe=# \timing
Timing is on.
jotpe=# select percentile_cont(array[0,0.25,0.5,1,1,null,2]) within 
group(order by txk) from klima_tag;

ERROR:  percentile value 2 is not between 0 and 1
Time: 19155,565 ms (00:19,156)
jotpe=# select count(*) from klima_tag;
  count
--
 13950214
(1 row)

Time: 933,847 ms


Best regards Johannes

Re: percentile value check can be slow

Re: percentile value check can be slow

Re: percentile value check can be slow

Re: percentile value check can be slow

percentile value check can be slow

5 matches

Site Navigation

Mail list logo

Footer information