Re: percentile value check can be slow

2017-11-19 Thread jotpe



On 19.11.2017 18:49, David Fetter wrote:

On Sun, Nov 19, 2017 at 01:23:42PM +0100, Tomas Vondra wrote:

Hi,

On 11/19/2017 03:10 AM, David Fetter wrote:

On Sat, Nov 18, 2017 at 11:05:47PM +0100, Tomas Vondra wrote:

Hi,

...

Is 'recognizer' an established definition I should know? Is it the same
as 'validator' or is it something new/different?


I borrowed it from http://langsec.org/

I'm not entirely sure what you mean by a validator, but a recognizer
is something that gives a quick and sure read as to whether the input
is well-formed. In general, it's along the lines of a tokenizer, a
parser, and something that does very light post-parse analysis for
correctness of form.

For the case that started the thread, a recognizer would check
something along the lines of

 CHECK('[0,1]' @> ALL(input_array))


OK, thanks. From what I understand, recognizer is more about recognizing
if a string is valid within a given formal language (essentially, if
it's a well-formed program). That may not be the right term for checks
on parameter values.


There are two hard problems in computer science: naming things, cache
coherency, and off-by-one.


OTOH we already have "validators" on a number of places - functions
checking various parameters, e.g. reloptions for FDWs, etc.

But I guess the naming can be solved later ...


Indeed.


Way Bigger Lift, As Far As I Can Tell, But More Fun For Users:
 Allow optional CHECK constraints in CREATE AGGREGATE for direct
 arguments.



How will any of the approaches deal with something like

 select percentile_cont((select array_agg(v) from p))
within group (order by a) from t;

In this case the the values are unknown after the parse analysis, so I
guess it does not really address that.


It doesn't.  Does it make sense to do a one-shot execution for cases
like that?  It might well be worth it to do the aggregate once in
advance as a throw-away if the query execution time is already going
to take awhile.  Of course, you can break that one by making p a JOIN
to yet another thing...


FWIW while I won't stand in the way of improving this, I wonder if this
is really worth the additional complexity. If you get errors like this
with a static list of values, you will fix the list and you're done. If
the list is dynamic (computed in the query itself), you'll still get the
error much later during query execution.

So if you're getting many failures like this for the "delayed error
reporting" to be an issue, perhaps there's something wrong in you stack
and you should address that instead?


I'd like to think that getting something to fail quickly and cheaply
when it can will give our end users a better experience.  Here,
"cheaply" refers to their computing resources and time.


The trouble is, this increases execution time for everyone, including
people who carefully construct the parameter values. That seems rather
undesirable.


I may be wrong but I'm pretty sure that a check for well-formed direct
parameters will not impose a significant cost on aggregates.

It occurs to me that this particular aggregate could take an array of
a domain defined along the lines of:

 CREATE DOMAIN float4_0_1_closed AS float4
 NOT NULL
 CHECK(VALUE >= 0.0 AND VALUE <= 1.0);

Then the check would happen much earlier without adding a bunch of
potentially expensive machinery.


Clearly, not having this happen in this case bothered Johannes
enough to wade in here.


No. He was surprised the error is reported after significant amount
of time, but he does not seem to claim failing faster would be
valuable to him. That is your assumption, and I have my doubts about
it.


My mistake.  I shouldn't have guessed when there was a better
alternative.


I already wrote that: I did not know about the complexity that is needed 
to precheck the parameters. I thought maybe it could be done easily.

If it's too hard to change that, I wouldn't want that improvement.


Johannes, could you help us understand your thinking in reporting
this?


When executing this query was wanted to see whats happening with wrong 
fractions. And I expected to fail this query quick, but it seams the 
postgresql has to read a lot until it checks (in the end) that on 
parameter was out the valid area.


I thougt, maybe there are there are technical forces to do it that way, 
or it can just be improved to fail fast.


Best regards.



percentile value check can be slow

2017-11-18 Thread jotpe
I tried to enter invalid percentile fractions, and was astonished that 
it seems to be checked after many work is done?


psql (11devel)
Type "help" for help.

jotpe=# \timing
Timing is on.
jotpe=# select percentile_cont(array[0,0.25,0.5,1,1,null,2]) within 
group(order by txk) from klima_tag;

ERROR:  percentile value 2 is not between 0 and 1
Time: 19155,565 ms (00:19,156)
jotpe=# select count(*) from klima_tag;
  count
--
 13950214
(1 row)

Time: 933,847 ms


Best regards Johannes