RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4 (workarounds)

2021-08-29 Thread l...@laurent-hasson.com


   >  -Original Message-
   >  From: Justin Pryzby 
   >  Sent: Sunday, August 29, 2021 23:17
   >  To: Pavel Stehule 
   >  Cc: l...@laurent-hasson.com; Tom Lane ; Ranier
   >  Vilela ; Andrew Dunstan
   >  ; pgsql-performa...@postgresql.org
   >  Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
   >  and 13.4 (workarounds)
   >  
   >  On Mon, Aug 30, 2021 at 04:43:23AM +0200, Pavel Stehule wrote:
   >  > po 30. 8. 2021 v 2:44 odesílatel l...@laurent-hasson.com napsal:
   >  > > At this point, I am not sure how to proceed except to rethink that
   >  > > toFloat() function and many other places where we use exceptions.
   >  We
   >  > > get such dirty data that I need a "safe" way to convert a string to
   >  > > float without throwing an exception. BTW, I tried other
   >  combinations
   >  > > in case there may have been some weird interactions with the ::REAL
   >  > > conversion operator, but nothing made any change. Could you
   >  > > recommend another approach off the top of your head? I could use
   >  > > regexes for testing etc... Or maybe there is another option like a
   >  > > no-throw conversion that's built in or in some extension that you
   >  may know of? Like the "SAFE." Prefix in BigQuery.
   >  >
   >  > CREATE OR REPLACE FUNCTION safe_to_double_precision(t text)
   >  RETURNS
   >  > double precision AS $$ BEGIN
   >  >   IF $1 SIMILAR TO '[+-]?([0-9]*[.])?[0-9]+' THEN
   >  > RETURN $1::double precision;
   >  >   ELSE
   >  > RETURN NULL;
   >  >   END IF;
   >  > END;
   >  > $$ LANGUAGE plpgsql IMMUTABLE STRICT;
   >  
   >  This tries to use a regex to determine if something is a "Number" or not.
   >  Which has all the issues enumerated in painful detail by long answers on
   >  stack overflow, and other wiki/blog/forums.
   >  
   >  Rather than trying to define Numbers using regex, I'd try to avoid only
   >  the most frequent exceptions and get 90% of the performance back.  I
   >  don't know what your data looks like, but you might try things like this:
   >  
   >  IF $1 IS NULL THEN RETURN $2
   >  ELSE IF $1 ~ '^$' THEN RETURN $2
   >  ELSE IF $1 ~ '[[:alpha:]]{2}' THEN RETURN $2 ELSE IF $1 !~ '[[:digit:]]' 
THEN
   >  RETURN $2
   >  BEGIN
   > RETURN $1::float;
   >  EXCEPTION WHEN OTHERS THEN
   > RETURN $2;
   >  END;
   >  
   >  You can check the stackoverflow page for ideas as to what kind of thing
   >  to reject, but it may depend mostly on your data (what is the most
   >  common string?
   >  The most common exceptional string?).
   >  
   >  I think it's possible that could even be *faster* than the original, 
since it
   >  avoids the exception block for values which are for sure going to cause
   >  an exception anyway.  It might be that using alternation (|) is faster (if
   >  less
   >  readable) than using a handful of IF branches.
   >  
   >  --
   >  Justin

That's exactly where my head was at. I have looked different way to test for a 
floating point number and recognize the challenge 😊

The data is very messy with people entering data by hand. We have seen alpha 
and punctuation, people copy/pasting from excel so large numbers get the "e" 
notation. It's a total mess. The application that authors that data is a piece 
of crap and we have no chance to change it unfortunately. Short of rolling out 
an ETL process, which is painful for the way our data comes in, I need an in-db 
solution.

Thank you!
Laurent.


Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4 (workarounds)

2021-08-29 Thread Justin Pryzby
On Mon, Aug 30, 2021 at 04:43:23AM +0200, Pavel Stehule wrote:
> po 30. 8. 2021 v 2:44 odesílatel l...@laurent-hasson.com napsal:
> > At this point, I am not sure how to proceed except to rethink that
> > toFloat() function and many other places where we use exceptions. We get
> > such dirty data that I need a "safe" way to convert a string to float
> > without throwing an exception. BTW, I tried other combinations in case
> > there may have been some weird interactions with the ::REAL conversion
> > operator, but nothing made any change. Could you recommend another approach
> > off the top of your head? I could use regexes for testing etc... Or maybe
> > there is another option like a no-throw conversion that's built in or in
> > some extension that you may know of? Like the "SAFE." Prefix in BigQuery.
> 
> CREATE OR REPLACE FUNCTION safe_to_double_precision(t text)
> RETURNS double precision AS $$
> BEGIN
>   IF $1 SIMILAR TO '[+-]?([0-9]*[.])?[0-9]+' THEN
> RETURN $1::double precision;
>   ELSE
> RETURN NULL;
>   END IF;
> END;
> $$ LANGUAGE plpgsql IMMUTABLE STRICT;

This tries to use a regex to determine if something is a "Number" or not.
Which has all the issues enumerated in painful detail by long answers on stack
overflow, and other wiki/blog/forums.

Rather than trying to define Numbers using regex, I'd try to avoid only the
most frequent exceptions and get 90% of the performance back.  I don't know
what your data looks like, but you might try things like this:

IF $1 IS NULL THEN RETURN $2
ELSE IF $1 ~ '^$' THEN RETURN $2
ELSE IF $1 ~ '[[:alpha:]]{2}' THEN RETURN $2
ELSE IF $1 !~ '[[:digit:]]' THEN RETURN $2
BEGIN   

  
   RETURN $1::float;
EXCEPTION WHEN OTHERS THEN  

  
   RETURN $2;
END;

  

You can check the stackoverflow page for ideas as to what kind of thing to
reject, but it may depend mostly on your data (what is the most common string?
The most common exceptional string?).

I think it's possible that could even be *faster* than the original, since it
avoids the exception block for values which are for sure going to cause an
exception anyway.  It might be that using alternation (|) is faster (if less
readable) than using a handful of IF branches.

-- 
Justin