Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-21 Thread Martijn van Oosterhout
On Wed, Sep 20, 2006 at 10:56:08AM -0700, Mark Dilger wrote:
 If the system chooses cast chains based on a breadth-first search, then the 
 existing int2 - int8 cast would be chosen over an int2 - int4 - int8 
 chain, or an int2 - int3 - int4 - int8 chain, or in fact any chain at 
 all, because the int2 - int8 cast is the shortest.

But we're not talking about a search here, we don't always know where
the endpoint is. Imagine you have the following three functions:

abs(int8)
abs(float4)
abs(numeric)

And you have an int2. Which is the best cast to use? What's the answer
if you have a float8? What if it's an unknown type text string?

Now, consider that functions can have up to 32 arguments and that this
resolution might have to be applied to each argument and you find that
searching is going to get very expensive very quickly.

The current system of requiring only a single step is at least
predictable. If you have the choice between:

- first argument matches, second needs three safe conversions, and
- first argument need one unsafe conversion, second matches exactly

Which is cheaper?

To make this manageable you have to keep the number of types you can
cast to small, or you'll get lost in the possibilites. Adding just a
single step domain to base type conversion seems pretty safe, but
anything more is going to be hard.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-21 Thread Mark Dilger

Martijn van Oosterhout wrote:

On Wed, Sep 20, 2006 at 10:56:08AM -0700, Mark Dilger wrote:
If the system chooses cast chains based on a breadth-first search, then the 
existing int2 - int8 cast would be chosen over an int2 - int4 - int8 
chain, or an int2 - int3 - int4 - int8 chain, or in fact any chain at 
all, because the int2 - int8 cast is the shortest.


But we're not talking about a search here, we don't always know where
the endpoint is. Imagine you have the following three functions:

abs(int8)
abs(float4)
abs(numeric)

And you have an int2. Which is the best cast to use? What's the answer
if you have a float8? What if it's an unknown type text string?

Now, consider that functions can have up to 32 arguments and that this
resolution might have to be applied to each argument and you find that
searching is going to get very expensive very quickly.

The current system of requiring only a single step is at least
predictable. If you have the choice between:

- first argument matches, second needs three safe conversions, and
- first argument need one unsafe conversion, second matches exactly

Which is cheaper?

To make this manageable you have to keep the number of types you can
cast to small, or you'll get lost in the possibilites. Adding just a
single step domain to base type conversion seems pretty safe, but
anything more is going to be hard.

Have a nice day,


The searching never needs to be done at runtime.  It should be computable at 
cast creation time.  A new cast creates a potential bridge between any two types 
in the system.  Using a shortest path algorithm, the best chain (if any exists) 
from one type to another can be computed and pre-compiled, right?


So, assume the following already exists:

Types A,B,C, fully connected with casts A-B, B-A, A-C, C-A, B-C, C-B, with 
some marked IMPLICIT, some marked EXPLICIT, and some marked SAFE.


Types X,Y,Z, also fully connected with casts, as above.

Then assume someone comes along and creates a new type M with conversions A-M, 
M-A, X-M, and M-X.  At the time that type and those casts are added to the 
system, the system could calculate any additional casts to/from B, C, Y, and Z. 
 A simple implementation (but maybe not optimal) would be for the system to 
autogenerate code like:


CREATE FUNCTION cast_M_Y (arg M) RETURNS Y AS $$
SELECT arg::X::Y;
$$ LANGUAGE SQL;
CREATE CAST (M AS Y) WITH FUNCTION cast_M_Y(M) [ AS ASSIGNMENT | AS IMPLICIT ]

And then load that function and cast.  The only real trick seems to be 
determining the rules for which cast chain gets used within that autogenerated 
function, and whether the generated cast is IMPLICIT, EXPLICIT, or ASSIGNMENT.




Looking over what I have just written, another idea pops up.  To avoid having 
the system decide which casts are reasonable, you could extend the syntax and 
allow an easy shorthand for the user.  Something like:


CREATE CAST (M AS A)
WITH FUNCTION cast_M_A
AS ASSIGNMENT
PROPOGATES TO B AS ASSIGNMENT,
PROPOGATES TO C AS ASSIGNMENT;

CREATE CAST (A AS M)
WITH FUNCTION cast_A_M
AS ASSIGNMENT
PROPOGATES FROM B,
PROPOGATES FROM C;

And then the casts from M-B, M-C, B-M, and C-M would all be added to the 
system.

Thoughts?

mark


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Tom Lane
Gevik Babakhani [EMAIL PROTECTED] writes:
 First I would like to know how PG's code looked like without the
 domains.

IIRC, as far as the datatype coercion and operator/function resolution
code were concerned, the domain patch basically consisted of dropping
getBaseType() calls in at a bunch of choke points, so that these
routines all treat a domain the same as its base type.

I was never real happy about this, mainly because it adds extra syscache
lookups that buy you nothing when you're not using domains.

What would be nice is to revert all that, and instead have domain
creation insert explicit cast entries between a domain and its base type
into pg_cast, so that domains don't require any special-case code in
this part of the system.  The problem with that glib answer is that
maintaining anything like the current behavior would seem to require
that we abandon the current principle that we consider only one-step
cast pathways when trying to match arguments to functions.  (For
instance, a varchar value can be passed to a text-taking function
because varchar-text is an implicit cast according to pg_cast.  What
of a domain over varchar?)  And allowing multi-step casts to be chosen
automatically is unbelievably scary --- I think it would basically
destroy the system's ability to resolve overloaded functions at all,
because you can get from any type to any other if you allow enough cast
steps.  (The unreasonably large number of implicit casts to text aren't
helping any here :-()

So the hard part of this doesn't really require any understanding of
code at all.  What we need is a proposal for an algorithm that loosens
the casting rules just enough to make explicit pg_cast entries for
domains work the way we would like them to, without wholesale breakage
of situations that have nothing to do with domains.  See
http://developer.postgresql.org/pgdocs/postgres/typeconv.html

Rereading what I just wrote, it might be as simple as allowing a
two-step cast in certain cases, only if the first step is a domain to
base type coercion (which we assume would be specially marked in
pg_cast).  But the devil is in the details ... and anyway there might
be a cleaner approach than that.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Andrew Dunstan

Tom Lane wrote:

So the hard part of this doesn't really require any understanding of
code at all.  What we need is a proposal for an algorithm that loosens
the casting rules just enough to make explicit pg_cast entries for
domains work the way we would like them to, without wholesale breakage
of situations that have nothing to do with domains.  See
http://developer.postgresql.org/pgdocs/postgres/typeconv.html

Rereading what I just wrote, it might be as simple as allowing a
two-step cast in certain cases, only if the first step is a domain to
base type coercion (which we assume would be specially marked in
pg_cast).  


FWIW, before I got to this paragraph that was the thought that 
immediately occurred to me.


cheers

andrew

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Mark Dilger

Tom Lane wrote:

Rereading what I just wrote, it might be as simple as allowing a
two-step cast in certain cases, only if the first step is a domain to
base type coercion (which we assume would be specially marked in
pg_cast).  But the devil is in the details ... and anyway there might
be a cleaner approach than that.


ISTM casts from a domain to their base type are fundamentally different from 
casts between types.  In general, casting TYPE_X to TYPE_Y requires malloc'ing 
memory for TYPE_Y, and converting the data of TYPE_X into TYPE_Y, possibly with 
loss of accuracy or correctness, etc.  (4-byte or less types are handled on the 
stack, not the heap, but that seems irrelevant to me and I'm only mentioning it 
here to head off any replies along those lines.)  Certainly, having the system 
chain together lots of implicit casts of this sort is scary.  But casting a 
domain to its base type never involves loss of accuracy or correctness, right? 
(Casting from the base type to the domain might not work, on account of the 
domain restrictions forbidding the particular value stored in the base.)


Perhaps we need to be able to register casts with more information than just 
IMPLICIT vs. EXPLICIT.  Perhaps we also need something like SAFE or some other 
term, and then have a rule that no chain of casts chosen by the system (as 
opposed to specified by the user) can contain more than one IMPLICIT cast, but 
can contain unlimited many SAFE casts.


When a domain is created, a SAFE cast from the domain to its base type could 
automatically be generated.


Casts between the existing varchar(n) to text could be marked as SAFE, given 
that the underlying storage scheme for varchar(n) is the same as text.  (Casts 
from text to varchar(n) are not SAFE, because the text might be too long to fit.)


Casts from int2 - int4, int2 - int8, and int4 - int8 would all be SAFE, I 
think, because they are not lossy.  But perhaps I have not thought enough about 
this and these should be IMPLICIT rather than SAFE.


Casts from non-text types to text would remain IMPLICIT, I expect.

If a user created their own type, such as the recent discussion of an int3 type, 
 they could also create an int3 - int4 cast marked as SAFE, and from int2 - 
int3 marked as SAFE, and from int3 - int2 marked as EXPLICIT, and from int4 - 
int3 marked as EXPLICIT, and could avoid writing all the casts to other integral 
types.


(I've pretty much abandoned the idea of an int3 type because my testing 
convinced me there were no performance advantages to it.  But it serves ok as an 
 example.)


mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Mark Dilger

Mark Dilger wrote:
Casts from int2 - int4, int2 - int8, and int4 - int8 would all be 
SAFE, I think, because they are not lossy.  But perhaps I have not 
thought enough about this and these should be IMPLICIT rather than SAFE.


I have thought about this some more.  I think these are indeed SAFE.  The 
distinction between SAFE and IMPLICIT should not, I think, be whether the 
storage type is identical, but rather whether there is any possible loss of 
precision, range, accuracy, etc., or whether there is any change in the 
fundamental interpretation of the data when cast from the source to destination 
type.


The built-in cast from int2 - int4, which is currently IMPLICIT, cannot lose 
any information, nor can it cause the data to be interpreted differently. 
Therefore it is SAFE.


The built-in cast from int8 - float8 is currently marked as IMPLICIT, but since 
 a large integer value which is cast in this fashion might be somewhat altered, 
it is not SAFE.  It is also interpreted differently, since floating point 
numbers are typically interpreted as approximations, whereas integers are 
interpreted as exact.  (Hence the tendency to compare integers for equality, but 
not floats.)


Can anyone think of examples where chaining together SAFE casts would cause 
problems, using the guidelines for what is SAFE outlined above?


mark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Tom Lane
Mark Dilger [EMAIL PROTECTED] writes:
 Mark Dilger wrote:
 Casts from int2 - int4, int2 - int8, and int4 - int8 would all be 
 SAFE, I think, because they are not lossy.  But perhaps I have not 
 thought enough about this and these should be IMPLICIT rather than SAFE.

 I have thought about this some more.  I think these are indeed SAFE.  The 
 distinction between SAFE and IMPLICIT should not, I think, be whether the 
 storage type is identical, but rather whether there is any possible loss of 
 precision, range, accuracy, etc., or whether there is any change in the 
 fundamental interpretation of the data when cast from the source to 
 destination 
 type.

You are going in exactly the wrong direction --- this line of thought is
aiming to make *more* casts possible by default, which is not what we
need, at least not among the collection of base types.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Mark Dilger

Tom Lane wrote:

Mark Dilger [EMAIL PROTECTED] writes:

Mark Dilger wrote:
Casts from int2 - int4, int2 - int8, and int4 - int8 would all be 
SAFE, I think, because they are not lossy.  But perhaps I have not 
thought enough about this and these should be IMPLICIT rather than SAFE.


I have thought about this some more.  I think these are indeed SAFE.  The 
distinction between SAFE and IMPLICIT should not, I think, be whether the 
storage type is identical, but rather whether there is any possible loss of 
precision, range, accuracy, etc., or whether there is any change in the 
fundamental interpretation of the data when cast from the source to destination 
type.


You are going in exactly the wrong direction --- this line of thought is
aiming to make *more* casts possible by default, which is not what we
need, at least not among the collection of base types.



If I understand correctly, you are worried about two issues:  ambiguity and 
performance.  You don't want the system to be slower from the extra searching 
needed to find possible multiple step casts, and you don't want any new 
ambiguity where the system can't deterministically decide which choice of 
cast(s) should be used.  Is that right?


If the system chooses cast chains based on a breadth-first search, then the 
existing int2 - int8 cast would be chosen over an int2 - int4 - int8 chain, 
or an int2 - int3 - int4 - int8 chain, or in fact any chain at all, because 
the int2 - int8 cast is the shortest.


So the code to search chains should only be invoked in what would currently be 
an *error condition*, that being that the SQL includes a request for a cast that 
cannot be resolved without chaining.


Since the chaining code would be new, and the rules for it would be new, we can 
still design them however we like (within reason.)  I would propose:


1) Shorter chains trump longer chains.

2) When comparing two equal length chains, one made entirely of SAFE casts 
trumps one which contains an IMPLICIT cast.


3) When two or more chains remain that cannot be resolved under the above two 
rules, the SQL is considered ambiguous and an error condition is raised.


I don't see how this would break any existing valid SQL.  But it seems like it 
would solve both the DOMAIN problem you mentioned and the oft lamented problem 
that adding a new datatype requires quadratically many casts to the system.


mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Tom Lane
Mark Dilger [EMAIL PROTECTED] writes:
 If the system chooses cast chains based on a breadth-first search,
 then the existing int2 - int8 cast would be chosen over an int2 -
 int4 - int8 chain, or an int2 - int3 - int4 - int8 chain, or in
 fact any chain at all, because the int2 - int8 cast is the shortest.

Well, this is the sort of thing that has to be thought about pretty
carefully.  Is length of chain the most appropriate metric?  What are
you going to do when resolving a multi-input operator or function, and
there are different ways to match different candidates with different
sets of path lengths?

There's been some prior discussion of attaching a measure of goodness
to different potential cast pathways.  I'm too lazy to look it up at the
moment but I strongly suggest whoever wants to produce a concrete
proposal in this area should go check the archives.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Martijn van Oosterhout
On Wed, Sep 20, 2006 at 09:31:48AM -0700, Mark Dilger wrote:
 Perhaps we need to be able to register casts with more information than 
 just IMPLICIT vs. EXPLICIT.  Perhaps we also need something like SAFE or 
 some other term, and then have a rule that no chain of casts chosen by the 
 system (as opposed to specified by the user) can contain more than one 
 IMPLICIT cast, but can contain unlimited many SAFE casts.

Currently cast have three types, implicit, explicit and assignment.

You don't want to have an unlimited number of anything. Cast lookups
are expensive enough as it is.

 Casts from int2 - int4, int2 - int8, and int4 - int8 would all be SAFE, 
 I think, because they are not lossy.  But perhaps I have not thought enough 
 about this and these should be IMPLICIT rather than SAFE.

Yeah, but the trick is you want, with the above casts to only ever
produce the *shortest* path. That's what makes it expensive.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Martijn van Oosterhout
On Wed, Sep 20, 2006 at 10:26:55AM -0700, Mark Dilger wrote:
 I have thought about this some more.  I think these are indeed SAFE.  The 
 distinction between SAFE and IMPLICIT should not, I think, be whether the 
 storage type is identical, but rather whether there is any possible loss of 
 precision, range, accuracy, etc., or whether there is any change in the 
 fundamental interpretation of the data when cast from the source to 
 destination type.

My question is whether there should be any implicit casts that are not
safe. Your example int8 - float8 being implicit is I think an error
and we should wonder why that cast implicit now anyway.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 My question is whether there should be any implicit casts that are not
 safe. Your example int8 - float8 being implicit is I think an error
 and we should wonder why that cast implicit now anyway.

Because the SQL spec requires it.  You are not required to write a cast
to add an exact and an approximate quantity, and the spec says the
result is approximate.

Trying to design this stuff purely according to abstract notions of
elegance of the cast rules isn't going to work out well --- we have
both spec requirements and backwards compatibility to worry about.

Now we do have the flexibility to alter the default contents of pg_cast
--- there could be more or fewer entries in there than there are now,
if the type coercion rules are altered to do less or more automatically
than they do now.  But the end-result behavior needs to wind up being
pretty darn near the same thing, at least within the numeric type
category (I'm not as certain that we have the other ones right, but the
numeric category has been *very* heavily scrutinized and beat upon).
The only thing I really want to see changing is the behavior for domain
types --- and even there, the default behavior when there are no
user-created domain-specific operators or casts has to stay the same.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Mark Dilger

Tom Lane wrote:

Now we do have the flexibility to alter the default contents of pg_cast
--- there could be more or fewer entries in there than there are now,
if the type coercion rules are altered to do less or more automatically
than they do now.  But the end-result behavior needs to wind up being
pretty darn near the same thing, at least within the numeric type
category (I'm not as certain that we have the other ones right, but the
numeric category has been *very* heavily scrutinized and beat upon).
The only thing I really want to see changing is the behavior for domain
types --- and even there, the default behavior when there are no
user-created domain-specific operators or casts has to stay the same.


Your suggestion upthread that domains have two-step casts (from domain to base, 
then from base to whatever) is what got my attention.  I don't like the idea of 
having an interim solution to that subset of the problem if it might get in the 
way of solving the general problem later.  But perhaps it can be argued that no 
cruftiness would result from the special case code for casting domains to their 
base types?


mark


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] TODO: Fix CREATE CAST on DOMAINs

2006-09-20 Thread Gevik Babakhani
 Trying to design this stuff purely according to abstract notions of
 elegance of the cast rules isn't going to work out well --- we have
 both spec requirements and backwards compatibility to worry about.
 
 Now we do have the flexibility to alter the default contents of pg_cast
 --- there could be more or fewer entries in there than there are now,
 if the type coercion rules are altered to do less or more automatically
 than they do now.  But the end-result behavior needs to wind up being
 pretty darn near the same thing, at least within the numeric type
 category (I'm not as certain that we have the other ones right, but the
 numeric category has been *very* heavily scrutinized and beat upon).
 The only thing I really want to see changing is the behavior for domain
 types --- and even there, the default behavior when there are no
 user-created domain-specific operators or casts has to stay the same.

Trying to solve this problem requires more investigation having spec
requirements and backwards compatibility etc.. etc.. in mind.

After reading the thread, I think there are some interesting
similarities, ideas (or even techniques) used in OO languages like JAVA
and C# regarding internal handling when type boxing and type casting. (I
would like to think domains as inherited classes of their super or the
base class.)

I will come back with more thoughts after I have investigated a thing or
two.  Say tuned


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly