Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2017-09-23 Thread Tom Lane
Tomas Vondra  writes:
>>> [ scalarineqsel may fall over when used by extension operators ]

> What about using two-pronged approach:

> 1) fall back to mid bucket in back branches (9.3 - 10)
> 2) do something smarter (along the lines you outlined) in PG11

Sure.  We need to test the fallback case anyway.

>> [ sketch of a more extensible design ]

> Sounds reasonable to me, I guess - I can't really think about anything
> simpler giving us the same flexibility.

Actually, on further thought, that's still too simple.  If you look
at convert_string_to_scalar() you'll see it's examining all three
values concurrently (the probe value, of one datatype, and two bin
boundary values of possibly a different type).  The reason is that
it's stripping off whatever common prefix those strings have before
trying to form a numeric equivalent.  While certainly
convert_string_to_scalar is pretty stupid in the face of non-ASCII
sort orders, the prefix-stripping is something I really don't want
to give up.  So the design I sketched of considering each value
totally independently isn't good enough.

We could, perhaps, provide an API that lets an operator estimation
function replace convert_to_scalar in toto, but that seems like
you'd still end up duplicating code in many cases.  Not sure about
how to find a happy medium.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2017-09-23 Thread Tomas Vondra
On 09/21/2017 04:24 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> [ scalarineqsel may fall over when used by extension operators ]
> 
> I concur with your thought that we could have
> ineq_histogram_selectivity fall back to a "mid bucket" default if
> it's working with a datatype it is unable to convert_to_scalar. But I
> think if we're going to touch this at all, we ought to have higher
> ambition than that, and try to provide a mechanism whereby an
> extension that's willing to work a bit harder could get that
> additional increment of estimation accuracy. I don't care for this
> way to do that:
> 

The question is - do we need a solution that is back-patchable? This
means we can't really use FIXEDDECIMAL without writing effectively
copying a lot of the selfuncs.c stuff, only to make some minor fixes.

What about using two-pronged approach:

1) fall back to mid bucket in back branches (9.3 - 10)
2) do something smarter (along the lines you outlined) in PG11

I'm willing to spend some time on both, but (2) alone is not a
particularly attractive for us as it only helps PG11 and we'll have to
do the copy-paste evil anyway to get the data type working on back branches.

>> * Make convert_numeric_to_scalar smarter, so that it checks if
>> there is an implicit cast to numeric, and fail only if it does not
>> find one.
> 
> because it's expensive, and it only works for numeric-category cases,
> and it will fail outright for numbers outside the range of "double".
> (Notice that convert_numeric_to_scalar is *not* calling the regular
> cast function for numeric-to-double.) Moreover, any operator ought to
> know what types it can accept; we shouldn't have to do more catalog
> lookups to figure out what to do.
> 

Ah, I see. I haven't realized it's not using the regular cast functions,
but now that you point that out it's clear relying on casts would fail.

> Now that scalarltsel and friends are just trivial wrappers for a
> common function, we could imagine exposing scalarineqsel_wrapper as a
> non-static function, with more arguments (and perhaps a better-chosen
> name ;-)). The idea would be for extensions that want to go this
> extra mile to provide their own selectivity estimation functions,
> which again would just be trivial wrappers for the core function, but
> would provide additional knowledge through additional arguments.
> 
> The additional arguments I'm envisioning are a couple of C function 
> pointers, one function that knows how to convert the operator's 
> left-hand input type to scalar, and one function that knows how to
> convert the right-hand type to scalar. (Identical APIs of course.) 
> Passing a NULL would imply that the core code must fall back on its 
> own devices for that input.
> 
> Now the thing about convert_to_scalar is that there are several
> different conversion conventions already (numeric, string, timestamp,
> ...), and there probably could be more once extension types are
> coming to the party. So I'm imagining that the API for these
> conversion functions could be something like
> 
>   bool convert(Datum value, Oid valuetypeid,
>int *conversion_convention, double *converted_value);
> 
> The conversion_convention output would make use of some agreed-on 
> constants, say 1 for numeric, 2 for string, etc etc. In the core 
> code, if either converter fails (returns false) or if they don't 
> return the same conversion_convention code, we give up and use the 
> mid-bucket default. If they both produce results with the same 
> conversion_convention, then we can treat the converted_values as 
> commensurable.
> 

OK, so instead of re-implementing the whole function, we'd essentially
do just something like this:

#typedef bool (*convert_callback) (Datum value, Oid valuetypeid, \
   int *conversion_convention, \
   double *converted_value);

double
scalarineqsel(PlannerInfo *root, Oid operator,
 bool isgt, bool iseq, VariableStatData *vardata,
 Datum constval, Oid consttype,
 convert_calback convert);

and then, from the extension

double
scalarineqsel_wrapper(PlannerInfo *root, Oid operator,
 bool isgt, bool iseq, VariableStatData *vardata,
 Datum constval, Oid consttype)
{
return scalarineqsel(root, operator, isgt, iseq, vardata,
 constval, consttype, my_convert_func);
}

Sounds reasonable to me, I guess - I can't really think about anything
simpler giving us the same flexibility.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2017-09-21 Thread Tom Lane
Tomas Vondra  writes:
> [ scalarineqsel may fall over when used by extension operators ]

I concur with your thought that we could have ineq_histogram_selectivity
fall back to a "mid bucket" default if it's working with a datatype it
is unable to convert_to_scalar.  But I think if we're going to touch this
at all, we ought to have higher ambition than that, and try to provide a
mechanism whereby an extension that's willing to work a bit harder could
get that additional increment of estimation accuracy.  I don't care for
this way to do that:

> * Make convert_numeric_to_scalar smarter, so that it checks if there is
> an implicit cast to numeric, and fail only if it does not find one.

because it's expensive, and it only works for numeric-category cases,
and it will fail outright for numbers outside the range of "double".
(Notice that convert_numeric_to_scalar is *not* calling the regular
cast function for numeric-to-double.)  Moreover, any operator ought to
know what types it can accept; we shouldn't have to do more catalog
lookups to figure out what to do.

Now that scalarltsel and friends are just trivial wrappers for a common
function, we could imagine exposing scalarineqsel_wrapper as a non-static
function, with more arguments (and perhaps a better-chosen name ;-)).
The idea would be for extensions that want to go this extra mile to
provide their own selectivity estimation functions, which again would
just be trivial wrappers for the core function, but would provide
additional knowledge through additional arguments.

The additional arguments I'm envisioning are a couple of C function
pointers, one function that knows how to convert the operator's
left-hand input type to scalar, and one function that knows how
to convert the right-hand type to scalar.  (Identical APIs of course.)
Passing a NULL would imply that the core code must fall back on its
own devices for that input.

Now the thing about convert_to_scalar is that there are several different
conversion conventions already (numeric, string, timestamp, ...), and
there probably could be more once extension types are coming to the party.
So I'm imagining that the API for these conversion functions could be
something like

bool convert(Datum value, Oid valuetypeid,
 int *conversion_convention, double *converted_value);

The conversion_convention output would make use of some agreed-on
constants, say 1 for numeric, 2 for string, etc etc.  In the core
code, if either converter fails (returns false) or if they don't
return the same conversion_convention code, we give up and use the
mid-bucket default.  If they both produce results with the same
conversion_convention, then we can treat the converted_values as
commensurable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2017-09-19 Thread Tomas Vondra
Hi,

while testing a custom data type FIXEDDECIMAL [1], implementing a
numeric-like data type with limited range, I ran into a several issues
that I suspect may not be entirely intentional / expected behavior.

[1] https://github.com/2ndQuadrant/fixeddecimal

Attached is a minimal subset of the extension SQL definition, which may
be more convenient when looking into the issue.

The most important issue is that when planning a simple query, the
estimation of range queries on a column with the custom data type fails
like this:

test=# create table t (a fixeddecimal);
CREATE TABLE
test=# insert into t select random() from generate_series(1,10);
INSERT 0 10
test=# analyze t;
ANALYZE
test=# select * from t where a > 0.9;
ERROR:  unsupported type: 16385

The error message here comes from convert_numeric_to_scalar, which gets
called during histogram processing (ineq_histogram_selectivity) when
approximating the histogram. convert_to_scalar does this:

switch (valuetypeid)
{
  ...
  case NUMERICOID:
  ...
*scaledvalue = convert_numeric_to_scalar(value, valuetypid);
*scaledlobound = convert_numeric_to_scalar(lobound, boundstypid);
*scaledhibound = convert_numeric_to_scalar(hibound, boundstypid);
return true;

  ...
}

The first call works fine, as the constant really is numeric
(valuetypeid=1700). But the histogram boundaries are using the custom
data type, causing the error as convert_numeric_to_scalar expects only a
bunch of hard-coded data types. So it's pretty much guaranteed to fail
with any user-defined data type.

This seems a bit unfortunate :-(

One solution would be to implement custom estimation function, replacing
scalarltsel/scalargtsel. But that seems rather unnecessary, especially
considering there is an implicit cast from fixeddecimal to numeric.
Another thing is that when there's just an MCV, the estimation works
just fine.

So I see two basic ways to fix this:

* Make convert_numeric_to_scalar smarter, so that it checks if there is
an implicit cast to numeric, and fail only if it does not find one.

* Make convert_to_scalar smarter, so that it does return false for
unexpected data types, so that ineq_histogram_selectivity uses the
default estimate of 0.5 (for that one bucket).

Both options seem more favorable than what's happening currently. Is
there anything I missed, making those fixes unacceptable?

If anything, the fact that MCV estimates work while histogram does not
makes this somewhat unpredictable - a change in the data distribution
(or perhaps even just a different sample in ANALYZE) may result in
sudden failures.


I ran into one additional strange thing while investigating this. The
attached SQL script defines two operator classes - fixeddecimal_ops and
fixeddecimal_numeric_ops, defining (fixeddecimal,fixeddecimal) and
(fixeddecimal,numeric) operators. Dropping one of those operator classes
changes the estimates in a somewhat suspicious ways.

When only fixeddecimal_ops is defined, we get this:

test=# explain select * from t where a > 0.1;
   QUERY PLAN

 Seq Scan on t  (cost=0.00..1943.00 rows=3 width=8)
   Filter: ((a)::numeric > 0.1)
(2 rows)

That is, we get the default estimate for inequality clauses, 33%. But
when only fixeddecimal_numeric_ops, we get this:

test=# explain select * from t where a > 0.1;
   QUERY PLAN

 Seq Scan on t  (cost=0.00..1693.00 rows=5 width=8)
   Filter: (a > 0.1)
(2 rows)

That is, we get 50% estimate, because that's what scalarineqsel uses
when it ineq_histogram_selectivity can't compute selectivity from a
histogram for some reason.

Wouldn't it make it more sense to use the default 33% estimate here?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


fixeddecimal-minimal.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers