Re: [HACKERS] Fix number skipping in to_number

2017-11-12 Thread Tom Lane
Oliver Ford  writes:
> [ 0001-apply-number-v3.patch ]

I looked at this patch briefly and have a couple of comments:

* It seems entirely wrong to be matching to L_thousands_sep in the
NUM_COMMA case; that format code is by definition not locale aware,
so it should be matching to plain ',' independently of locale.

* Don't we need to fix the NUM_L (currency symbol) case in the
same manner?  (The NUM_D and NUM_S cases are handled in
NUM_numpart_from_char and seem ok at a quick glance.)

* I'm not in love with the noadd flag.  Other places in this
switch that want to skip the final increment do it with
"continue", and I think this should do likewise.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-10-02 Thread Daniel Gustafsson
> On 25 Sep 2017, at 02:52, Nathan Wagner  wrote:
> 
> On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:
> 
>> Ok I've made that change in the attached v3. I'm not sure as I'm on
>> en_US.UTF-8 locale too. Maybe something Windows specific?
> 
> This patch applies against master (8485a25a), compiles, and
> passes a make check.
> 
> I tested both on my mac laptop, and my linux server.
> 
> If we want this patch, I'd say it's ready for committer.  We may want
> (and I can't believe I'm saying this) more discussion as to exactly what
> the strategy for to_number() (and friends) is.

Based on this review, I am moving this patch to the next commitfest and will
update it to Ready for committer.  There might, as you say, be more discussion
needed, but due to the lack of extensive such so far I’ll move it for now.

cheers ./daniel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-09-26 Thread Nathan Wagner
On Mon, Sep 25, 2017 at 07:52:19PM +0100, Oliver Ford wrote:

> Thanks for your review. The issue is that Oracle throws errors on many
> more input cases than Postgres does, so making it exactly like Oracle
> could break a lot of existing users. E.g. to_number ('123,000', '999')
> returns '123' on Postgres, but throws an error on Oracle. So making it
> exactly Oracle-like could break existing users who might rely on the
> current behavior.

I wouldn't use to_number for anything other than oracle compatibility,
and then just so I didn't have to wade through all the ported oracle
code.  I would use a regex or some such to clean up the number, and then
cast the result.  For an input string of '123,000' I might do a
translate('123,000', ',', '')::integer or perhaps use regexp_replace().
Either way I would want a more positive decision as to what was valid or
not, based on the input.

> My view is that we shouldn't deliberately introduce errors in order to be
> exactly like Oracle if we don't currently and there's a sane use case for
> current behavior. Do you have any examples of results that are different
> between Oracle and Postgres, and you think the Oracle result makes more
> sense?

Not really, other than I think an error report might make more sense.
'123,000' doesn't really match the format of '999'.  If anything it
seems like we're guessing rather than validating input.  It is
surprising (to me at least) that

to_char(to_number('123,000', '999'), '999')

doesn't give us the original input (in the sense that identical formats
don't preserve the original string).  So I'm not sure the current
behavior is a sane use case, but perhaps more people are using
to_number() to get *some* numeric result, rather than for wanting it to
be like oracle.  I would generally prefer to throw an exception instead
of getting a number I wasn't expecting, but I can see cases where that
might not be the case.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-09-25 Thread Oliver Ford
On Monday, 25 September 2017, Nathan Wagner  wrote:

> On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:
>
> > Ok I've made that change in the attached v3. I'm not sure as I'm on
> > en_US.UTF-8 locale too. Maybe something Windows specific?
>
> This patch applies against master (8485a25a), compiles, and
> passes a make check.
>
> I tested both on my mac laptop, and my linux server.
>
> If we want this patch, I'd say it's ready for committer.  We may want
> (and I can't believe I'm saying this) more discussion as to exactly what
> the strategy for to_number() (and friends) is.  Do we want to duplicate
> Oracle's functionality, or do we want a similar function to do similar
> things, without necessarily having a goal of identical behavior to
> oracle?
>
> For myself, I pretty much never use the to_date, to_number, or
> to_timestamp functions except when porting oracle code.  I do use the
> to_char functions on occasion.  If strftime were available, I probably
> wouldn't use them.
>
> I would commit this patch and update the TODO with a goal of making
> to_number as Oracle compatible as is reasonable.
>
>
Thanks for your review. The issue is that Oracle throws errors on many more
input cases than Postgres does, so making it exactly like Oracle could
break a lot of existing users. E.g. to_number ('123,000', '999') returns
'123' on Postgres, but throws an error on Oracle. So making it exactly
Oracle-like could break existing users who might rely on the current
behavior.

My view is that we shouldn't deliberately introduce errors in order to be
exactly like Oracle if we don't currently and there's a sane use case for
current behavior. Do you have any examples of results that are different
between Oracle and Postgres, and you think the Oracle result makes more
sense?


Re: [HACKERS] Fix number skipping in to_number

2017-09-24 Thread Nathan Wagner
On Thu, Aug 17, 2017 at 12:33:02PM +0100, Oliver Ford wrote:
 
> Ok I've made that change in the attached v3. I'm not sure as I'm on
> en_US.UTF-8 locale too. Maybe something Windows specific?

This patch applies against master (8485a25a), compiles, and
passes a make check.

I tested both on my mac laptop, and my linux server.

If we want this patch, I'd say it's ready for committer.  We may want
(and I can't believe I'm saying this) more discussion as to exactly what
the strategy for to_number() (and friends) is.  Do we want to duplicate
Oracle's functionality, or do we want a similar function to do similar
things, without necessarily having a goal of identical behavior to
oracle?

For myself, I pretty much never use the to_date, to_number, or
to_timestamp functions except when porting oracle code.  I do use the
to_char functions on occasion.  If strftime were available, I probably
wouldn't use them.

I would commit this patch and update the TODO with a goal of making
to_number as Oracle compatible as is reasonable.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-08-17 Thread Oliver Ford
On Thu, Aug 17, 2017 at 11:55 AM, Thomas Munro
 wrote:

> Hmm.  Just in case my macOS laptop (CC=Apple's clang,
> LANG=en_NZ.UTF-8) was unduly affected by cosmic rays I tried on a
> couple of nearby servers running FreeBSD 11.1 (CC=clang, LANG=)
> and CentOS 7.3 (CC=gcc, LANG=en_US.utf-8) and got the same result:
> int8 has lost some whitespace in to_char output.
>
> It looks a bit like it has lost a leading space for every ',' that was
> present in the format string but which didn't finish up getting output
> (because the number was too small).  It looks a bit like that doesn't
> happen for 'G', so let's compare the NUM_COMMA and NUM_G cases.  Ahh..
> I'm not sure, but I think you might have a close-brace in the wrong
> place here:
>
> @@ -4879,6 +4883,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
> char *inout,
> continue;
> }
> }<--- this guy here might
> need to move after "noadd = true"?
> +   if
> (strncmp(Np->inout_p, Np->L_thousands_sep, separator_len) == 0)
> +   Np->inout_p +=
> separator_len - 1;
> +   else
> +   noadd = true;
> break;
>
> case NUM_G:
>
> With that change the test passes for me.  But why do we get different results?
>
> --
> Thomas Munro
> http://www.enterprisedb.com

Ok I've made that change in the attached v3. I'm not sure as I'm on
en_US.UTF-8 locale too. Maybe something Windows specific?


0001-apply-number-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-08-17 Thread Thomas Munro
On Thu, Aug 17, 2017 at 9:48 PM, Oliver Ford  wrote:
>> The tests you added pass for me but the int8 test now fails with the
>> following (this is from regression.diff after running
>> 'PG_REGRESS_DIFF_OPTS="-dU10" make check').  It looks like some new
>> whitespace is appearing on the right in the output of to_char().  I
>> didn't try to understand why.
>>
>> @@ -453,34 +453,34 @@
>>  --+--
>>   4567890123456789 | 4567890123456789
>>  (1 row)
>>
>>  -- TO_CHAR()
>>  --
>>  SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
>> to_char(q2, '9,999,999,999,999,999')
>> FROM INT8_TBL;
>>   to_char_1 |to_char |to_char
>>  ---++
>> -   |123 |456
>> +   |123 |   456
>> |123 |  4,567,890,123,456,789
>> -   |  4,567,890,123,456,789 |123
>> +   |  4,567,890,123,456,789 |   123
>> |  4,567,890,123,456,789 |  4,567,890,123,456,789
>> |  4,567,890,123,456,789 | -4,567,890,123,456,789
>>  (5 rows)
>>
>>  SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
>> to_char(q2, '9,999,999,999,999,999.999,999')
>> FROM INT8_TBL;
>>   to_char_2 |to_char |to_char
>>  
>> ---++
>> -   |123.000,000 |456.000,000
>> +   |123.000,000 |   456.000,000
>> |123.000,000 |  4,567,890,123,456,789.000,000
>> -   |  4,567,890,123,456,789.000,000 |123.000,000
>> +   |  4,567,890,123,456,789.000,000 |   123.000,000
>> |  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
>> |  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
>>  (5 rows)
>>
>>  SELECT '' AS to_char_3, to_char( (q1 * -1), 'PR'),
>> to_char( (q2 * -1), '.999PR')
>> FROM INT8_TBL;
>>   to_char_3 |  to_char   |to_char
>>  ---++
>> |  <123> |  <456.000>
>> |  <123> | <4567890123456789.000>
>>
>
> That's strange, I can't replicate that issue on my Windows build. I've
> tried with 'PG_REGRESS_DIFF_OPTS="-dU10" make check' and all the tests
> (including int8) pass fine. The spacing in the results is perfectly
> normal. I wonder if something else on your build is causing this? I've
> also tried several "make check" options for different locales
> mentioned in the docs and they pass fine.

Hmm.  Just in case my macOS laptop (CC=Apple's clang,
LANG=en_NZ.UTF-8) was unduly affected by cosmic rays I tried on a
couple of nearby servers running FreeBSD 11.1 (CC=clang, LANG=)
and CentOS 7.3 (CC=gcc, LANG=en_US.utf-8) and got the same result:
int8 has lost some whitespace in to_char output.

It looks a bit like it has lost a leading space for every ',' that was
present in the format string but which didn't finish up getting output
(because the number was too small).  It looks a bit like that doesn't
happen for 'G', so let's compare the NUM_COMMA and NUM_G cases.  Ahh..
I'm not sure, but I think you might have a close-brace in the wrong
place here:

@@ -4879,6 +4883,10 @@ NUM_processor(FormatNode *node, NUMDesc *Num,
char *inout,
continue;
}
}<--- this guy here might
need to move after "noadd = true"?
+   if
(strncmp(Np->inout_p, Np->L_thousands_sep, separator_len) == 0)
+   Np->inout_p +=
separator_len - 1;
+   else
+   noadd = true;
break;

case NUM_G:

With that change the test passes for me.  But why do we get different results?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-08-17 Thread Oliver Ford
> The tests you added pass for me but the int8 test now fails with the
> following (this is from regression.diff after running
> 'PG_REGRESS_DIFF_OPTS="-dU10" make check').  It looks like some new
> whitespace is appearing on the right in the output of to_char().  I
> didn't try to understand why.
>
> @@ -453,34 +453,34 @@
>  --+--
>   4567890123456789 | 4567890123456789
>  (1 row)
>
>  -- TO_CHAR()
>  --
>  SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
> to_char(q2, '9,999,999,999,999,999')
> FROM INT8_TBL;
>   to_char_1 |to_char |to_char
>  ---++
> -   |123 |456
> +   |123 |   456
> |123 |  4,567,890,123,456,789
> -   |  4,567,890,123,456,789 |123
> +   |  4,567,890,123,456,789 |   123
> |  4,567,890,123,456,789 |  4,567,890,123,456,789
> |  4,567,890,123,456,789 | -4,567,890,123,456,789
>  (5 rows)
>
>  SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
> to_char(q2, '9,999,999,999,999,999.999,999')
> FROM INT8_TBL;
>   to_char_2 |to_char |to_char
>  ---++
> -   |123.000,000 |456.000,000
> +   |123.000,000 |   456.000,000
> |123.000,000 |  4,567,890,123,456,789.000,000
> -   |  4,567,890,123,456,789.000,000 |123.000,000
> +   |  4,567,890,123,456,789.000,000 |   123.000,000
> |  4,567,890,123,456,789.000,000 |  4,567,890,123,456,789.000,000
> |  4,567,890,123,456,789.000,000 | -4,567,890,123,456,789.000,000
>  (5 rows)
>
>  SELECT '' AS to_char_3, to_char( (q1 * -1), 'PR'),
> to_char( (q2 * -1), '.999PR')
> FROM INT8_TBL;
>   to_char_3 |  to_char   |to_char
>  ---++
> |  <123> |  <456.000>
> |  <123> | <4567890123456789.000>
>

That's strange, I can't replicate that issue on my Windows build. I've
tried with 'PG_REGRESS_DIFF_OPTS="-dU10" make check' and all the tests
(including int8) pass fine. The spacing in the results is perfectly
normal. I wonder if something else on your build is causing this? I've
also tried several "make check" options for different locales
mentioned in the docs and they pass fine.

>
> One superficial comment after first glimpse at the patch:
>
> +if(!strncmp(Np->inout_p, Np->L_thousands_sep, separator_len))
>
> I believe the usual coding around here would be if (strncmp(...) == 0)
>

Yes you're right, that is the coding standard. I've changed it to that
in the attached v2.


0001-apply-number-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix number skipping in to_number

2017-08-16 Thread Thomas Munro
On Thu, Aug 10, 2017 at 10:21 PM, Oliver Ford  wrote:
> Prevents an issue where numbers can be skipped in the to_number()
> function when the format mask contains a "G" or a "," but the input
> string doesn't contain a separator. This resolves the TODO item "Fix
> to_number() handling for values not matching the format string".

>From the TODO I found the previous discussion here:

https://www.postgresql.org/message-id/flat/be46a4f30909212157o71dc82bep7e074f9fa7eb1d14%40mail.gmail.com

There were some votes against changing it and some votes for.  I don't
see what's good about the current behaviour.  It's hard (though not
impossible) to imagine that someone is depending on that weirdness,
and it's harder to believe that anyone wants that behaviour knowingly.
For people migrating from Oracle or writing application portable to
both, the current behaviour sucks.  It's not a SQL standard feature,
it's a be-like-Oracle feature, so it should be like Oracle.  Aside
from that, by the principle of least astonishment alone I think it's
pretty clear that "a separator goes here" shouldn't mean "eat a
digit".

So +1 from me.

> == Change ==
>
> Currently, if the format mask in to_number() has a "G" or a ",", it
> will assume that the input string contains a separator character in
> the same place. If however a number is there instead, that number will
> be silently skipped and not appear in the output. So we get:
>
> select to_number('34,50','999,99');
>  to_number
> ---
>340
> (1 row)
>
> This patch checks the input string when it encounters a "G" or "," in
> the format mask. If the separator character is found, the code moves
> over it as normal. If it's not found, then the code no longer
> increments the relevant pointer so as not to skip the character. After
> the patch, we get the correct result:
>
> select to_number('34,50','999,99');
>  to_number
> ---
>   3450
> (1 row)
>
> This is in line with Oracle's result.

In other words the separators are completely ignored.  Presumably the
only reason you'd have them at all is so that you could use the same
format string in to_number() and to_char() calls to do round-trip
conversions.  That seems reasonable to me.

> == Rationale ==
>
> This patch is a small change, which leaves PostgreSQL behavior
> different from Oracle behavior in related cases. Oracle's
> implementation seems to read from right-to-left, and raises an
> "ORA-01722: invalid number" error if there are digits in the input
> string which don't have corresponding characters in the format mask. I
> have chosen not to throw such errors, because there are use cases for
> only returning part of a number string. For example, the following is
> an error on Oracle:
>
> select to_number('123,000', '999G') from dual;
> Error report -
> SQL Error: ORA-01722: invalid number
>
> But if you wanted to only take the characters before the comma, and
> discard the thousands part, you can do so on PostgreSQL with:
>
> select to_number('123,000', '999G');
>  to_number
> ---
>123
> (1 row)

I can see that it's hard to change this one, because it's more likely
to break existing applications that were written by people who didn't
think of the format string as limiting the maximum size of the number.
I think someone could argue for changing that too, but I agree with
your choice for this patch.

> == Testing ==
>
> Tested on Windows with MinGW using the latest checkout from master.
> Added regression tests to check for this new behavior. All existing
> tests pass.

The tests you added pass for me but the int8 test now fails with the
following (this is from regression.diff after running
'PG_REGRESS_DIFF_OPTS="-dU10" make check').  It looks like some new
whitespace is appearing on the right in the output of to_char().  I
didn't try to understand why.

@@ -453,34 +453,34 @@
 --+--
  4567890123456789 | 4567890123456789
 (1 row)

 -- TO_CHAR()
 --
 SELECT '' AS to_char_1, to_char(q1, '9G999G999G999G999G999'),
to_char(q2, '9,999,999,999,999,999')
FROM INT8_TBL;
  to_char_1 |to_char |to_char
 ---++
-   |123 |456
+   |123 |   456
|123 |  4,567,890,123,456,789
-   |  4,567,890,123,456,789 |123
+   |  4,567,890,123,456,789 |   123
|  4,567,890,123,456,789 |  4,567,890,123,456,789
|  4,567,890,123,456,789 | -4,567,890,123,456,789
 (5 rows)

 SELECT '' AS to_char_2, to_char(q1, '9G999G999G999G999G999D999G999'),
to_char(q2, '9,999,999,999,999,999.999,999')
FROM INT8_TBL;
  to_char_2 |to_char |to_char
 ---++
-   |123.000,000 |