Re: Improving inferred query column names

2023-03-02 Thread Peter Eisentraut

On 22.02.23 21:38, Andres Freund wrote:

On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:

On 11.02.23 20:24, Andres Freund wrote:
I think we should just do it and not care about what breaks.  There has
never been any guarantee about these.

FWIW, "most" other SQL implementations appear to generate column names like

SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"

Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns.  sum_reads imo is more usable
than than "SUM(reads)".


If you want something without special characters, the example you gave 
is manageable, but what are you going to do with


SELECT a, b, a * b, a / b FROM ...

or

SELECT a, b, SUM(a * b) FROM ...

and so on.  What would be the actual rule to produce the output you want?

I think a question here is what "usable" means in this context.

If you want a name that you can refer to (in a client API, for example), 
you should give it a name explicitly.


I think the uses for the automatic names are that they look pretty and 
meaningful in visual output (psql, pgadmin, etc.).  In that context, I 
think it is ok to use special characters without limitation, since you 
are just going to look at the thing, not type it back in.






Re: Improving inferred query column names

2023-02-23 Thread Joe Conway

On 2/22/23 23:03, Tom Lane wrote:

Andres Freund  writes:

We could just do something like printing __. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.
Whether that's worth it, or whether column name lengths would be too painful,
IDK.


I think you'd soon be hitting NAMEDATALEN limits ...




Probably an unpalatable idea, but if we did something like 
md5('avg(reltuples / relpages)') for the column name, it would be 
(reasonably) unique and deterministic. Not pretty, but possibly useful 
in some cases.




--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Improving inferred query column names

2023-02-22 Thread Tom Lane
Andres Freund  writes:
> On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
>> The proposal so far was just to handle a function call wrapped
>> around something else by converting to the function name followed
>> by whatever we'd emit for the something else.

> SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
> ┌──┬───┬──┐
> │ sum_relpages │ sum_reltuples │ ?column? │
> ├──┼───┼──┤

So far so good, but what about multi-argument functions?
Do we do "f_x_y_z", and truncate wherever?  How well will this
work with nested function calls?

>> You cannot realistically
>> handle, say, operator expressions without emitting names that will
>> require quoting, which doesn't seem attractive.

> Well, it doesn't require much to be better than "?column?", which already
> requires quoting...

I think the point of "?column?" is to use something that nobody's going
to want to reference that way, quoted or otherwise.  The SQL spec says
(in SQL:2021, it's 7.16  syntax rule 18) that if the
column expression is anything more complex than a simple column reference
(or SQL parameter reference, which I think we don't support) then the
column name is implementation-dependent, which is standards-ese for
"here be dragons".

BTW, SQL92 and SQL99 had a further constraint:

c) Otherwise, the  of the i-th column of the  is implementation-dependent and different
  from the  of any column, other than itself, of
  a table referenced by any  contained in the
  SQL-statement.

We never tried to implement that literally, and now I'm glad we didn't
bother, because recent spec versions only say "implementation-dependent",
full stop.  In any case, the spec is clearly in the camp of "don't depend
on these column names".

> We could just do something like printing __. So
> something like avg(reltuples / relpages) would end up as
> avg_reltuples_float48div_relpages.
> Whether that's worth it, or whether column name lengths would be too painful,
> IDK.

I think you'd soon be hitting NAMEDATALEN limits ...

>> And no, deduplication isn't on the table at all here.

> +1

I remembered while looking at the spec that duplicate column names
in SELECT output are not only allowed but *required* by the spec.
If you write, say, "SELECT 1 AS x, 2 AS x, ..." then the column
names of those two columns are both "x", no wiggle room at all.
So I see little point in trying to deduplicate generated names,
even aside from the points you made.

regards, tom lane




Re: Improving inferred query column names

2023-02-22 Thread Andres Freund
Hi,

On 2023-02-22 16:38:51 -0500, Tom Lane wrote:
> Vladimir Churyukin  writes:
> > It doesn't need to be perfect, but it needs to be consistent. So far you
> > proposed a rule to replace () with _. What is the plan for expressions, how
> > to convert them to names (with deduplication I guess?, because there could
> > be 2 similar expressions mapped to the same name potentially).
> 
> I do not think we need to do anything for arbitrary expressions.

Exactly. It's not like they have a useful name today.  Nor are they unique.


> The proposal so far was just to handle a function call wrapped
> around something else by converting to the function name followed
> by whatever we'd emit for the something else.

Just to showcase that better, what I think we're discussing is changing:

SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──┬┬──┐
│ sum  │  sum   │ ?column? │
├──┼┼──┤
│ 2774 │ 257896 │2 │
└──┴┴──┘
(1 row)

to

SELECT sum(relpages), sum(reltuples), 1+1 FROM pg_class;
┌──┬───┬──┐
│ sum_relpages │ sum_reltuples │ ?column? │
├──┼───┼──┤
│ 2774 │257896 │2 │
└──┴───┴──┘
(1 row)


> You cannot realistically
> handle, say, operator expressions without emitting names that will
> require quoting, which doesn't seem attractive.

Well, it doesn't require much to be better than "?column?", which already
requires quoting...

We could just do something like printing __. So
something like avg(reltuples / relpages) would end up as
avg_reltuples_float48div_relpages.

Whether that's worth it, or whether column name lengths would be too painful,
IDK.


> And no, deduplication isn't on the table at all here.

+1

Greetings,

Andres Freund




Re: Improving inferred query column names

2023-02-22 Thread Tom Lane
Vladimir Churyukin  writes:
> It doesn't need to be perfect, but it needs to be consistent. So far you
> proposed a rule to replace () with _. What is the plan for expressions, how
> to convert them to names (with deduplication I guess?, because there could
> be 2 similar expressions mapped to the same name potentially).

I do not think we need to do anything for arbitrary expressions.
The proposal so far was just to handle a function call wrapped
around something else by converting to the function name followed
by whatever we'd emit for the something else.  You cannot realistically
handle, say, operator expressions without emitting names that will
require quoting, which doesn't seem attractive.

And no, deduplication isn't on the table at all here.

regards, tom lane




Re: Improving inferred query column names

2023-02-22 Thread Vladimir Churyukin
On Wed, Feb 22, 2023, 12:40 PM Andres Freund  wrote:

> Hi,
>
> On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
> > That is a good idea for simple cases, I'm just curious how it would look
> > like for more complex cases (you can have all kinds of expressions as
> > parameters for aggregate function calls).
> > If it works only for simple cases, I think it would be confusing and not
> > very helpful.
>
> I don't think it needs to be perfect to be helpful.
>


It doesn't need to be perfect, but it needs to be consistent. So far you
proposed a rule to replace () with _. What is the plan for expressions, how
to convert them to names (with deduplication I guess?, because there could
be 2 similar expressions mapped to the same name potentially).


> > Wouldn't it make more sense to just deduplicate the names by adding
> > numerical postfixes, like sum_1, sum_2?
>
> That'd be considerably worse than what we do today imo, because any
> reordering
> / added aggregate would lead to everything else changing as well.
>


Ok, that I kinda agree with. Not necessarily worse overall, but worse for
some cases. Well, the proposal above about keeping the names exactly the
same as the full expressions is probably the best we can do then. It will
take care of possible duplications and won't be position-sensitive. And
will be consistent. The only issue is somewhat unusual column names that
you will have to use quotes to refer to. But is that a real issue?

-Vladimir Churyukin

>


Re: Improving inferred query column names

2023-02-22 Thread Andres Freund
Hi,

On 2023-02-11 12:47:04 -0800, Vladimir Churyukin wrote:
> That is a good idea for simple cases, I'm just curious how it would look
> like for more complex cases (you can have all kinds of expressions as
> parameters for aggregate function calls).
> If it works only for simple cases, I think it would be confusing and not
> very helpful.

I don't think it needs to be perfect to be helpful.


> Wouldn't it make more sense to just deduplicate the names by adding
> numerical postfixes, like sum_1, sum_2?

That'd be considerably worse than what we do today imo, because any reordering
/ added aggregate would lead to everything else changing as well.


Greetings,

Andres Freund




Re: Improving inferred query column names

2023-02-22 Thread Andres Freund
Hi,

On 2023-02-20 16:08:00 +0100, Peter Eisentraut wrote:
> On 11.02.23 20:24, Andres Freund wrote:
> I think we should just do it and not care about what breaks.  There has
> never been any guarantee about these.
> 
> FWIW, "most" other SQL implementations appear to generate column names like
> 
> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
> column names: "SUM(reads)", "SUM(writes)"

Hm, personally I don't like leaving in parens in the names, that makes it
unnecessarily hard to reference the columns.  sum_reads imo is more usable
than than "SUM(reads)".


> We had a colleague look into this a little while ago, and it got pretty
> tedious to implement this for all the expression types.

Hm, any chance that colleague could be pointed at this discussion and chime
in? It doesn't immediately look that hard to do substantially better than
today. Of course there's an approximately endless amount of effort that could
be poured into this, but even some fairly basic improvements seem like a big
win.


> And, you know, the bikeshedding ...

Indeed. I already started above :)

Greetings,

Andres Freund




Re: Improving inferred query column names

2023-02-22 Thread Peter Eisentraut

On 20.02.23 16:17, David G. Johnston wrote:

I think we should just do it and not care about what breaks.  There has
never been any guarantee about these.


I'm going to toss a -1 into the ring but if this does go through a 
strong request that it be disabled via a GUC.  The ugliness of that 
option is why we shouldn't do this.


Defacto reality is still a reality we are on the hook for.

I too find the legacy design choice to be annoying but not so much that 
changing it seems like a good idea.


Well, a small backward compatibility GUC might not be too cumbersome.





Re: Improving inferred query column names

2023-02-20 Thread David G. Johnston
On Mon, Feb 20, 2023 at 8:08 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 11.02.23 20:24, Andres Freund wrote:
> >
> > I think on a green field it'd be clearly better to do something like the
> > above.  What does give me pause is that it seems quite likely to break
> > existing queries, and to a lesser degree, might break applications
> relying on
> > inferred column names
> >
> > Can anybody think of a good way out of that? It's not like that problem
> is
> > going to go away at some point...
>
> I think we should just do it and not care about what breaks.  There has
> never been any guarantee about these.
>
>
I'm going to toss a -1 into the ring but if this does go through a strong
request that it be disabled via a GUC.  The ugliness of that option is why
we shouldn't do this.

Defacto reality is still a reality we are on the hook for.

I too find the legacy design choice to be annoying but not so much that
changing it seems like a good idea.

David J.


Re: Improving inferred query column names

2023-02-20 Thread Peter Eisentraut

On 11.02.23 20:24, Andres Freund wrote:

Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sum

So i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;

Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.

I think on a green field it'd be clearly better to do something like the
above.  What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column names

Can anybody think of a good way out of that? It's not like that problem is
going to go away at some point...


I think we should just do it and not care about what breaks.  There has 
never been any guarantee about these.


FWIW, "most" other SQL implementations appear to generate column names like

SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: "SUM(reads)", "SUM(writes)"

(various capitalization of course).

We had a colleague look into this a little while ago, and it got pretty 
tedious to implement this for all the expression types.  And, you know, 
the bikeshedding ...


But I'm all in favor of improving this.





Re: Improving inferred query column names

2023-02-11 Thread Corey Huinker
On Sat, Feb 11, 2023 at 3:47 PM Vladimir Churyukin 
wrote:

> For backwards compatibility I guess you can have a GUC flag controlling
> that behavior that can be set into backwards compatibility mode if required.
> The previous functionality can be declared deprecated and removed (with
> the flag) once the current version becomes unsupported.
>

Seems more like a per-session setting than a GUC.

Here's a suggestion off the top of my head.

We create a session setting inferred_column_name_template.

The template takes a formatting directive %N which is just a counter

SET inferred_column_name_template = 'col_%N'


which would give you col_1, col_2, regardless of what kind of expression
the columns were

We could introduce another directive, %T

SET inferred_column_name_template = '%T_%N'


which prints the datatype short name of the column. In this case, %N would
increment per datatype, so text_1, integer_1, text_2, timestamptz_1, text_3

Getting fancier, we could introduce something less datatype centric, %F

SET inferred_column_name_template = '%F_%N'


Which would walk the following waterfall and stop on the first match

   1. The datatype short name if the expression is explicitly casted
(either CAST or ::)
   2. the name of the function if the outermost expression was a function
(aggregate, window, or scalar),  so sum_1, substr_1
   3. 'case' if the outermost expression was  case
   4. 'expr' if the expression was effectively an operator (  SELECT 3+4,
'a'  || 'b' etc)
   5. the datatype short name for anything that doesn't match any of the
previous, and for explicit casts


Keeping track of all the %N counters could get silly, so maybe a %P which
is simply the numeric column position of the column, so your result set
would go like:  id, name, col_3, last_login, col_5.

We would have to account for the case where the user left either %N or %P
out of the template, so one of them would be an implied suffix if both were
absent, or we maybe go with

SET inferred_column_name_prefix = '%F_';
SET inferred_column_name_counter = 'position';   /* position, counter,
per_type_counter */

Or we just cook up a few predefined naming schemes, and let the user pick
from those.

One caution I have is that I have seen several enterprise app database
designs that have lots of user-customizable columns with names like
varchar1, numeric4, etc. Presumably the user would know their environment
and not pick a confusing template.


Re: Improving inferred query column names

2023-02-11 Thread Vladimir Churyukin
That is a good idea for simple cases, I'm just curious how it would look
like for more complex cases (you can have all kinds of expressions as
parameters for aggregate function calls).
If it works only for simple cases, I think it would be confusing and not
very helpful.
Wouldn't it make more sense to just deduplicate the names by adding
numerical postfixes, like sum_1, sum_2?
For backwards compatibility I guess you can have a GUC flag controlling
that behavior that can be set into backwards compatibility mode if required.
The previous functionality can be declared deprecated and removed (with the
flag) once the current version becomes unsupported.
(or with a different deprecation policy, I'm not sure what is the general
rule for breaking changes and deprecation currently).
If there is a clearly defined deprecation policy and a backwards
compatibility option, it should be good, no? Just my 2 cents.

-Vladimir Churyukin

On Sat, Feb 11, 2023 at 11:24 AM Andres Freund  wrote:

> Hi,
>
> A common annoyance when writing ad-hoc analytics queries is column naming
> once
> aggregates are used.
>
> Useful column names:
> SELECT reads, writes FROM pg_stat_io;
> column names: reads, writes
>
> Not useful column names:
> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
> column names: sum, sum
>
> So i often end up manually writing:
> SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM
> pg_stat_io;
>
>
> Of course we can't infer useful column names for everything, but for
> something
> like this, it should't be too hard to do better. E.g. by combining the
> function name with the column name in the argument, if a single plain
> column
> is the argument.
>
> I think on a green field it'd be clearly better to do something like the
> above.  What does give me pause is that it seems quite likely to break
> existing queries, and to a lesser degree, might break applications relying
> on
> inferred column names
>
> Can anybody think of a good way out of that? It's not like that problem is
> going to go away at some point...
>
> Greetings,
>
> Andres Freund
>
>
>


Improving inferred query column names

2023-02-11 Thread Andres Freund
Hi,

A common annoyance when writing ad-hoc analytics queries is column naming once
aggregates are used.

Useful column names:
SELECT reads, writes FROM pg_stat_io;
column names: reads, writes

Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sum

So i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;


Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.

I think on a green field it'd be clearly better to do something like the
above.  What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column names

Can anybody think of a good way out of that? It's not like that problem is
going to go away at some point...

Greetings,

Andres Freund