Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-17 Thread Tom Lane
Robert Haas  writes:
> On Fri, Feb 17, 2017 at 6:51 AM, Greg Stark  wrote:
>> Moreover, it wouldn't be hard to make sum(float4) use a float8 as an
>> accumulator and then cast to float4 for the final state. That would be
>> 100% compatible with the existing behaviour aside from producing more
>> accurate results.

> Sure, but if somebody wants that, they can get it already just by a
> minor change to the SQL.  The point is that adding up float4 as float4
> is a reasonable thing to do.  Adding it as float8 might also be a
> reasonable thing to do, but nobody's keeping anybody from doing that.

Also, if we changed sum(float4) to work that way, it would become very
hard to sum float4 in float4 --- you'd pretty much have to build your
own aggregate function, which is a lot harder than just inserting a
cast.  The argument for changing this boils down to "nobody would ever
want that", which I do not think I buy.  It's been a long time since
I studied numerical analysis, but I think there are applications where
you do want to do that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-17 Thread Robert Haas
On Fri, Feb 17, 2017 at 6:51 AM, Greg Stark  wrote:
> On 15 February 2017 at 12:52, Robert Haas  wrote:
>> Personally, I find it somewhere in the middle: I think the way it
>> works now is reasonable, and I think what he wants would have been
>> reasonable as well.  However, I find it hard to believe it would be
>> worth changing now on backward compatibility grounds.
>
> Honestly I was quite surprised to discover that sum(float4) worked
> this way because I assumed it followed the same pattern as integers.
>
> But I wonder what you mean by backward compatibility grounds. If
> sum(float4) returned a float8 the only compatibility issue would be
> someone who did "create table as" and then expected to get a float4
> column and instead got a float8 column. That seems like a much more
> minor corner case than most any other backward incompatible change in
> any release.

That is what I mean, yes.  Plus somebody could have a view defined,
and the types would change.  I think it's only possible to consider
that kind of thing as minor because most people don't use float4 for
anything.  Changing the SQL types of things tends to result in fairly
wide-ranging SQL breakage.

> Moreover, it wouldn't be hard to make sum(float4) use a float8 as an
> accumulator and then cast to float4 for the final state. That would be
> 100% compatible with the existing behaviour aside from producing more
> accurate results.

Sure, but if somebody wants that, they can get it already just by a
minor change to the SQL.  The point is that adding up float4 as float4
is a reasonable thing to do.  Adding it as float8 might also be a
reasonable thing to do, but nobody's keeping anybody from doing that.
It's just not the behavior you get out of the box.

> (Though as an aside, I think Konstantin would be much better served by
> using integers and storing cents or whatever unit of currency is small
> enough. That would actually result in accurate results which neither
> float4 nor float8 guarantee.)

Yep.  Or changing the SQL to do sum(f4::float8)::float4, if that's
what he wants.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-17 Thread Greg Stark
On 15 February 2017 at 12:52, Robert Haas  wrote:
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.

Honestly I was quite surprised to discover that sum(float4) worked
this way because I assumed it followed the same pattern as integers.

But I wonder what you mean by backward compatibility grounds. If
sum(float4) returned a float8 the only compatibility issue would be
someone who did "create table as" and then expected to get a float4
column and instead got a float8 column. That seems like a much more
minor corner case than most any other backward incompatible change in
any release.

Moreover, it wouldn't be hard to make sum(float4) use a float8 as an
accumulator and then cast to float4 for the final state. That would be
100% compatible with the existing behaviour aside from producing more
accurate results.

(Though as an aside, I think Konstantin would be much better served by
using integers and storing cents or whatever unit of currency is small
enough. That would actually result in accurate results which neither
float4 nor float8 guarantee.)

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-15 Thread Claudio Freire
On Wed, Feb 15, 2017 at 9:52 AM, Robert Haas  wrote:
> On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane  wrote:
>> You could perhaps make an argument that sum(float4) would have less risk
>> of overflow if it accumulated in and returned float8, but frankly that
>> seems a bit thin.
>
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.  He doesn't like
> the way it works currently, but we have no way of knowing how many
> people who are happy with the way it works today would become unhappy
> if we changed it.  We need a fairly compelling reason to risk breaking
> somebody's SQL, and I don't think this rises to that level.


I know this is said from time to time in this list, but a third option
that wouldn't break anybody's SQL would be using compensated summation
in the input type.

AFAIK, that can only increase precision, but it will cost cycles. The
impact could however fall below the noise and be worth a try.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-15 Thread Tom Lane
Robert Haas  writes:
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.

I agree with that conclusion.  It might have been reasonable to change
it fifteen years ago when we changed the integer sum() implementations,
but I think doing so now would make more people unhappy than it would
make happy; or at least, there's little reason to think the reverse
is true.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-15 Thread Robert Haas
On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane  wrote:
> You could perhaps make an argument that sum(float4) would have less risk
> of overflow if it accumulated in and returned float8, but frankly that
> seems a bit thin.

I think that's more or less the argument Konstantin is in fact making.
Whether it's a good argument or a thin one is a value judgement.
Personally, I find it somewhere in the middle: I think the way it
works now is reasonable, and I think what he wants would have been
reasonable as well.  However, I find it hard to believe it would be
worth changing now on backward compatibility grounds.  He doesn't like
the way it works currently, but we have no way of knowing how many
people who are happy with the way it works today would become unhappy
if we changed it.  We need a fairly compelling reason to risk breaking
somebody's SQL, and I don't think this rises to that level.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-14 Thread Konstantin Knizhnik



On 14.02.2017 16:59, Jim Nasby wrote:

On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:

It is not true - please notice query execution time of this two queries:


I bet you'd get even less difference if you simply cast to float8 
instead of adding 0.0. Same result, no floating point addition.



The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.


From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to
represent range of data I have and I need to minimize size of record.


In other words, you've decided to trade accuracy for performance...


Could not agree with it...
1. If I choose float4 type to store bid price (which usually has 5-6 
significant digits) - I do not loose precision and accuracy is not suffered.
The accuracy is important when I am calculating sum of prices. But here 
the assumption that accuracy of sum calculation should depend on type of 
summed field
is non obvious. May be it is more or less clear for C programmers but 
not for SQL users.
In all database I have tested SUM  of single precision floats is 
calculated at least using double precision numbers (or using numeric type).


2. There is no huge gap in performance between accumulating  in float4 
and float8. There are no "orders of magnitude":

postgres=# select sum(l_quantity) from lineitem_projection;
 sum
-
 1.07374e+09
(1 row)

Time: 4659.509 ms (00:04.660)

postgres=# select sum(l_quantity::float8) from lineitem_projection;
sum

 1529738036
(1 row)

Time: 5465.320 ms (00:05.465)


So do not think that there is actually compromise here between 
performance and accuracy.
But current implementation cause leads to many confusions and 
contradictions with users expectations:


1. The fact that sum(l_quantity) and sum(l_quantity::float8) are 
absolutely different (1.5 times!!! - we loose 0.5 milliard dollars:)
2. avg(l_quantity)*count(l_quantity) is not equal to sum(l_quantity) But 
in case of casting to float8 result is the same.
3. sum of aggregates for groups is not equal to total sum (once again no 
problem for float8 type_/



But when I am calculating sum, I expect to receive more or less precise
result. Certainly I realize that even in case of using double it is


... but now you want to trade performance for accuracy? Why would you 
expect the database to magically come to that conclusion?


Se above. No trading here. Please notice that current Postgres 
implementation of AVG aggregates calculates at sum and sum of squares 
even if last one is not needed for AVG.

The comment in the code says:

 * It might seem attractive to optimize this by having multiple accumulator
 * functions that only calculate the sums actually needed.  But on most
 * modern machines, a couple of extra floating-point multiplies will be
 * insignificant compared to the other per-tuple overhead, so I've chosen
 * to minimize code space instead.

And it is true!
In the addition to the results above I can add AVG timing for AVG 
calculation:


postgres=# select avg(l_quantity) from lineitem_projection;
   avg
--
 25.5015621964919
(1 row)

postgres=# select avg(l_quantity::float8) from lineitem_projection;
   avg
--
 25.5015621964919
(1 row)

Please notice that avg for float is calculated using float4_accum which 
use float8 accumulator and also calculates sumX2!



Time: 6103.807 ms (00:06.104)



So I do not see reasonable arguments here for using float4pl for 
sum(float4)!

And I do not know any database which has such strange behavior.
I know that "be as others" or especially "be as Oracle" are never good 
argument for Postgres community but doing something differently (and 
IMHO  wrong) without any significant reasons seems to be very strange.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-14 Thread Tom Lane
Robert Haas  writes:
> Well put.  Although it's worth noting that we aren't 100% consistent
> about this stuff: sum(smallint), sum(integer), and sum(bigint) all use
> an output data type different from the input data type, but other
> versions of sum() don't.

In those cases I believe the main reason for the different output type is
that there's a significant risk of overflow if we don't.  See commits
bec98a31c and 5f7c2bdb5 for some history.

You could perhaps make an argument that sum(float4) would have less risk
of overflow if it accumulated in and returned float8, but frankly that
seems a bit thin.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-14 Thread Robert Haas
On Tue, Feb 14, 2017 at 8:59 AM, Jim Nasby  wrote:
>> From my point of your it is strange and wrong expectation.
>> I am choosing "float4" type for a column just because it is enough to
>> represent range of data I have and I need to minimize size of record.
>
> In other words, you've decided to trade accuracy for performance...
>
>> But when I am calculating sum, I expect to receive more or less precise
>> result. Certainly I realize that even in case of using double it is
>
> ... but now you want to trade performance for accuracy? Why would you expect
> the database to magically come to that conclusion?

Well put.  Although it's worth noting that we aren't 100% consistent
about this stuff: sum(smallint), sum(integer), and sum(bigint) all use
an output data type different from the input data type, but other
versions of sum() don't.  To some extent all of these decisions are
just guesses about what users will find useful, and as this thread
shows, not everybody's going to agree.  But I don't think our guesses
are flagrantly unreasonable or anything.

There's also nothing to prevent Konstantin or anybody else who doesn't
like the default behavior to create their own version of sum(float4)
and put it in a schema that's listed before pg_catalog in search_path.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-14 Thread Jim Nasby

On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:

It is not true - please notice query execution time of this two queries:


I bet you'd get even less difference if you simply cast to float8 
instead of adding 0.0. Same result, no floating point addition.



The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.


From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to
represent range of data I have and I need to minimize size of record.


In other words, you've decided to trade accuracy for performance...


But when I am calculating sum, I expect to receive more or less precise
result. Certainly I realize that even in case of using double it is


... but now you want to trade performance for accuracy? Why would you 
expect the database to magically come to that conclusion?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-13 Thread Konstantin Knizhnik



On 13.02.2017 19:20, Tom Lane wrote:

Konstantin Knizhnik  writes:

I wonder why SUM aggregate is calculated for real (float4) type using
floating point accumulator?

If you can't deal with the vagaries of floating-point arithmetic, you
shouldn't be storing your data in float format.  Use numeric.


4-byte floats are widely used for example in trading applications just 
because it is two times shorter then double and range of stored data is 
relatively small (do not need a lot of significant digits). At the same 
time volume of stored data is very large and switching from float4 to 
float8 will almost double it. It requires two times more storage and 
almost two times increase query execution time.

So this is not acceptable answer.




Are there are reasons of using float4pl function for SUM aggregate instead of 
float4_accum?

The latter is probably a good two orders of magnitude slower, and it
wouldn't really do much to solve the inherent accuracy problems of
adding float4 values that have a wide dynamic range.


It is not true - please notice query execution time of this two queries:

postgres=# select sum(l_quantity)  from lineitem where l_shipdate <= 
'1998-12-01';

 sum
-
 1.52688e+09
(1 row)

Time: 2858.852 ms
postgres=# select sum(l_quantity+0.0)  from lineitem where l_shipdate <= 
'1998-12-01';

sum

 1529738036
(1 row)

Time: 3174.529 ms


Looks like now in Postgres aggregate calculation itself is not a 
bottleneck, comparing with tuple deform cost.





The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.


From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to 
represent range of data I have and I need to minimize size of record.
But when I am calculating sum, I expect to receive more or less precise 
result. Certainly I realize that even in case of using double it is 
possible to loose precision while calculation and result may depend on 
sum order (if we add very small and very larger values). But in real use 
cases (for example in trading data) such large difference in attribute 
values is very rare. If you have, for example, stock price, then it is 
very unlikely that one company has value 0.01 and another 1000.0
At least in TPC-H example (which certainly deal with dummy generated 
data), double type produce "almost price" result.




regards, tom lane


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-13 Thread Tom Lane
Konstantin Knizhnik  writes:
> I wonder why SUM aggregate is calculated for real (float4) type using 
> floating point accumulator?

If you can't deal with the vagaries of floating-point arithmetic, you
shouldn't be storing your data in float format.  Use numeric.

> Are there are reasons of using float4pl function for SUM aggregate instead of 
> float4_accum?

The latter is probably a good two orders of magnitude slower, and it
wouldn't really do much to solve the inherent accuracy problems of
adding float4 values that have a wide dynamic range.

The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Sum aggregate calculation for single precsion real

2017-02-13 Thread Konstantin Knizhnik

Hi hackers,

I wonder why SUM aggregate is calculated for real (float4) type using 
floating point accumulator?

It cause very confusing and unexpected behavior:

-- postgres=# select sum(l_quantity)  from lineitem where l_shipdate <= 
'1998-12-01';
 sum
-
 1.52688e+09
(1 row)

postgres=# select sum(l_quantity+0.0)  from lineitem where l_shipdate <= 
'1998-12-01';
sum

 1529738036


It is specified in any SQL standard how aggregates should be calculated?
At least Oracle and MS-SQL are calculating SUM for single precision type in 
different (and more natual) way.
Are there are reasons of using float4pl function for SUM aggregate instead of 
float4_accum?



Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers