HyukjinKwon opened a new pull request #25130: [SPARK-28359][SQL][PYTHON][TESTS] 
Make integrated UDF tests robust by making them no-op
URL: https://github.com/apache/spark/pull/25130
   ## What changes were proposed in this pull request?
   Current UDFs available in `IntegratedUDFTestUtils` are not exactly no-op. It 
converts input column to strings and outputs to strings.
   It causes some issues when we convert and port the tests at SPARK-27921. 
Integrated UDF test cases share one output file and it should outputs the same. 
   1. Special values are converted into strings differently:
       | Scala      | Python |
       | ---------- | ------ |
       | `null`     | `None` |
       | `Infinity` | `inf`  |
       | `-Infinity`| `-inf` |
       | `NaN`      | `nan`  |
   2. Due to float limitation at Python (see 
https://docs.python.org/3/tutorial/floatingpoint.html), if float is passed into 
Python and sent back to JVM, the values are potentially not exactly correct.
   To work around this, this PR targets to change the current UDF to be wrapped 
by cast. So, Input column is casted into string, UDF returns strings as are, 
and then output column is casted back to the input column.
   As an example:
   from pyspark.sql.functions import udf
   df = spark.range(3).toDF("col")
   df.select(udf(lambda x: str(x), "string")("col")).show()
   from pyspark.sql.functions import udf
   df = spark.range(3).toDF("col")
   python_udf = udf(lambda x: str(x), "string")
   casted_col = python_udf(df.col.cast("string"))
   In this way, UDF is virtually no-op although there might be some subtleties 
due to roundtrip in string cast.
   Python native functions and Scala native functions will take strings and 
output strings as are. So, there will be no potential test failures due to 
differences of conversion between Python and Scala.
   After this fix, for instance, `udf-aggregates_part1.sql ` outputs exactly 
same as `aggregates_part1.sql`:
   <details><summary>Diff comparing to 'pgSQL/aggregates_part1.sql'</summary>
   diff --git 
   index 51ca1d55869..801735781c7 100644
   @@ -3,7 +3,7 @@
    -- !query 0
   -SELECT avg(four) AS avg_1 FROM onek
   +SELECT avg(udf(four)) AS avg_1 FROM onek
    -- !query 0 schema
    -- !query 0 output
   @@ -11,7 +11,7 @@ struct<avg_1:double>
    -- !query 1
   -SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100
   +SELECT udf(avg(a)) AS avg_32 FROM aggtest WHERE a < 100
    -- !query 1 schema
    -- !query 1 output
   @@ -19,7 +19,7 @@ struct<avg_32:double>
    -- !query 2
   -select CAST(avg(b) AS Decimal(10,3)) AS avg_107_943 FROM aggtest
   +select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest
    -- !query 2 schema
    -- !query 2 output
   @@ -27,7 +27,7 @@ struct<avg_107_943:decimal(10,3)>
    -- !query 3
   -SELECT sum(four) AS sum_1500 FROM onek
   +SELECT sum(udf(four)) AS sum_1500 FROM onek
    -- !query 3 schema
    -- !query 3 output
   @@ -35,7 +35,7 @@ struct<sum_1500:bigint>
    -- !query 4
   -SELECT sum(a) AS sum_198 FROM aggtest
   +SELECT udf(sum(a)) AS sum_198 FROM aggtest
    -- !query 4 schema
    -- !query 4 output
   @@ -43,7 +43,7 @@ struct<sum_198:bigint>
    -- !query 5
   -SELECT sum(b) AS avg_431_773 FROM aggtest
   +SELECT udf(udf(sum(b))) AS avg_431_773 FROM aggtest
    -- !query 5 schema
    -- !query 5 output
   @@ -51,7 +51,7 @@ struct<avg_431_773:double>
    -- !query 6
   -SELECT max(four) AS max_3 FROM onek
   +SELECT udf(max(four)) AS max_3 FROM onek
    -- !query 6 schema
    -- !query 6 output
   @@ -59,7 +59,7 @@ struct<max_3:int>
    -- !query 7
   -SELECT max(a) AS max_100 FROM aggtest
   +SELECT max(udf(a)) AS max_100 FROM aggtest
    -- !query 7 schema
    -- !query 7 output
   @@ -67,7 +67,7 @@ struct<max_100:int>
    -- !query 8
   -SELECT max(aggtest.b) AS max_324_78 FROM aggtest
   +SELECT udf(udf(max(aggtest.b))) AS max_324_78 FROM aggtest
    -- !query 8 schema
    -- !query 8 output
   @@ -75,237 +75,238 @@ struct<max_324_78:float>
    -- !query 9
   -SELECT stddev_pop(b) FROM aggtest
   +SELECT stddev_pop(udf(b)) FROM aggtest
    -- !query 9 schema
   -struct<stddev_pop(CAST(b AS DOUBLE)):double>
   +struct<stddev_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS 
    -- !query 9 output
    -- !query 10
   -SELECT stddev_samp(b) FROM aggtest
   +SELECT udf(stddev_samp(b)) FROM aggtest
    -- !query 10 schema
   -struct<stddev_samp(CAST(b AS DOUBLE)):double>
   +struct<CAST(udf(cast(stddev_samp(cast(b as double)) as string)) AS 
    -- !query 10 output
    -- !query 11
   -SELECT var_pop(b) FROM aggtest
   +SELECT var_pop(udf(b)) FROM aggtest
    -- !query 11 schema
   -struct<var_pop(CAST(b AS DOUBLE)):double>
   +struct<var_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS 
    -- !query 11 output
    -- !query 12
   -SELECT var_samp(b) FROM aggtest
   +SELECT udf(var_samp(b)) FROM aggtest
    -- !query 12 schema
   -struct<var_samp(CAST(b AS DOUBLE)):double>
   +struct<CAST(udf(cast(var_samp(cast(b as double)) as string)) AS 
    -- !query 12 output
    -- !query 13
   -SELECT stddev_pop(CAST(b AS Decimal(38,0))) FROM aggtest
   +SELECT udf(stddev_pop(CAST(b AS Decimal(38,0)))) FROM aggtest
    -- !query 13 schema
   -struct<stddev_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
   +struct<CAST(udf(cast(stddev_pop(cast(cast(b as decimal(38,0)) as double)) 
as string)) AS DOUBLE):double>
    -- !query 13 output
    -- !query 14
   -SELECT stddev_samp(CAST(b AS Decimal(38,0))) FROM aggtest
   +SELECT stddev_samp(CAST(udf(b) AS Decimal(38,0))) FROM aggtest
    -- !query 14 schema
   -struct<stddev_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
   +struct<stddev_samp(CAST(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS 
DECIMAL(38,0)) AS DOUBLE)):double>
    -- !query 14 output
    -- !query 15
   -SELECT var_pop(CAST(b AS Decimal(38,0))) FROM aggtest
   +SELECT udf(var_pop(CAST(b AS Decimal(38,0)))) FROM aggtest
    -- !query 15 schema
   -struct<var_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
   +struct<CAST(udf(cast(var_pop(cast(cast(b as decimal(38,0)) as double)) as 
string)) AS DOUBLE):double>
    -- !query 15 output
    -- !query 16
   -SELECT var_samp(CAST(b AS Decimal(38,0))) FROM aggtest
   +SELECT var_samp(udf(CAST(b AS Decimal(38,0)))) FROM aggtest
    -- !query 16 schema
   -struct<var_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
   +struct<var_samp(CAST(CAST(udf(cast(cast(b as decimal(38,0)) as string)) AS 
DECIMAL(38,0)) AS DOUBLE)):double>
    -- !query 16 output
    -- !query 17
   -SELECT var_pop(1.0), var_samp(2.0)
   +SELECT udf(var_pop(1.0)), var_samp(udf(2.0))
    -- !query 17 schema
   -struct<var_pop(CAST(1.0 AS DOUBLE)):double,var_samp(CAST(2.0 AS 
   +struct<CAST(udf(cast(var_pop(cast(1.0 as double)) as string)) AS 
DOUBLE):double,var_samp(CAST(CAST(udf(cast(2.0 as string)) AS DECIMAL(2,1)) AS 
    -- !query 17 output
    0.0    NaN
    -- !query 18
   -SELECT stddev_pop(CAST(3.0 AS Decimal(38,0))), stddev_samp(CAST(4.0 AS 
   +SELECT stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))), 
stddev_samp(CAST(udf(4.0) AS Decimal(38,0)))
    -- !query 18 schema
   -struct<stddev_pop(CAST(CAST(3.0 AS DECIMAL(38,0)) AS 
DOUBLE)):double,stddev_samp(CAST(CAST(4.0 AS DECIMAL(38,0)) AS DOUBLE)):double>
   +struct<stddev_pop(CAST(CAST(udf(cast(cast(3.0 as decimal(38,0)) as string)) 
AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(CAST(udf(cast(4.0 as 
string)) AS DECIMAL(2,1)) AS DECIMAL(38,0)) AS DOUBLE)):double>
    -- !query 18 output
    0.0    NaN
    -- !query 19
   -select sum(CAST(null AS int)) from range(1,4)
   +select sum(udf(CAST(null AS int))) from range(1,4)
    -- !query 19 schema
   -struct<sum(CAST(NULL AS INT)):bigint>
   +struct<sum(CAST(udf(cast(cast(null as int) as string)) AS INT)):bigint>
    -- !query 19 output
    -- !query 20
   -select sum(CAST(null AS long)) from range(1,4)
   +select sum(udf(CAST(null AS long))) from range(1,4)
    -- !query 20 schema
   -struct<sum(CAST(NULL AS BIGINT)):bigint>
   +struct<sum(CAST(udf(cast(cast(null as bigint) as string)) AS 
    -- !query 20 output
    -- !query 21
   -select sum(CAST(null AS Decimal(38,0))) from range(1,4)
   +select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4)
    -- !query 21 schema
   -struct<sum(CAST(NULL AS DECIMAL(38,0))):decimal(38,0)>
   +struct<sum(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS 
    -- !query 21 output
    -- !query 22
   -select sum(CAST(null AS DOUBLE)) from range(1,4)
   +select sum(udf(CAST(null AS DOUBLE))) from range(1,4)
    -- !query 22 schema
   -struct<sum(CAST(NULL AS DOUBLE)):double>
   +struct<sum(CAST(udf(cast(cast(null as double) as string)) AS 
    -- !query 22 output
    -- !query 23
   -select avg(CAST(null AS int)) from range(1,4)
   +select avg(udf(CAST(null AS int))) from range(1,4)
    -- !query 23 schema
   -struct<avg(CAST(NULL AS INT)):double>
   +struct<avg(CAST(udf(cast(cast(null as int) as string)) AS INT)):double>
    -- !query 23 output
    -- !query 24
   -select avg(CAST(null AS long)) from range(1,4)
   +select avg(udf(CAST(null AS long))) from range(1,4)
    -- !query 24 schema
   -struct<avg(CAST(NULL AS BIGINT)):double>
   +struct<avg(CAST(udf(cast(cast(null as bigint) as string)) AS 
    -- !query 24 output
    -- !query 25
   -select avg(CAST(null AS Decimal(38,0))) from range(1,4)
   +select avg(udf(CAST(null AS Decimal(38,0)))) from range(1,4)
    -- !query 25 schema
   -struct<avg(CAST(NULL AS DECIMAL(38,0))):decimal(38,4)>
   +struct<avg(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS 
    -- !query 25 output
    -- !query 26
   -select avg(CAST(null AS DOUBLE)) from range(1,4)
   +select avg(udf(CAST(null AS DOUBLE))) from range(1,4)
    -- !query 26 schema
   -struct<avg(CAST(NULL AS DOUBLE)):double>
   +struct<avg(CAST(udf(cast(cast(null as double) as string)) AS 
    -- !query 26 output
    -- !query 27
   -select sum(CAST('NaN' AS DOUBLE)) from range(1,4)
   +select sum(CAST(udf('NaN') AS DOUBLE)) from range(1,4)
    -- !query 27 schema
   -struct<sum(CAST(NaN AS DOUBLE)):double>
   +struct<sum(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double>
    -- !query 27 output
    -- !query 28
   -select avg(CAST('NaN' AS DOUBLE)) from range(1,4)
   +select avg(CAST(udf('NaN') AS DOUBLE)) from range(1,4)
    -- !query 28 schema
   -struct<avg(CAST(NaN AS DOUBLE)):double>
   +struct<avg(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double>
    -- !query 28 output
    -- !query 30
   -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
   +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE))
    FROM (VALUES ('Infinity'), ('1')) v(x)
    -- !query 30 schema
   -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
   +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS 
DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS 
    -- !query 30 output
    Infinity       NaN
    -- !query 31
   -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
   +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE))
    FROM (VALUES ('Infinity'), ('Infinity')) v(x)
    -- !query 31 schema
   -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
   +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS 
DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS 
    -- !query 31 output
    Infinity       NaN
    -- !query 32
   -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
   +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE))
    FROM (VALUES ('-Infinity'), ('Infinity')) v(x)
    -- !query 32 schema
   -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
   +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS 
DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS 
    -- !query 32 output
    NaN    NaN
    -- !query 33
   -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
   +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE)))
    FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x)
    -- !query 33 schema
   -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
   +struct<avg(CAST(udf(cast(cast(x as double) as string)) AS 
DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS 
    -- !query 33 output
    1.00000005E8   2.5
    -- !query 34
   -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
   +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE)))
    FROM (VALUES (7000000000005), (7000000000007)) v(x)
    -- !query 34 schema
   -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
   +struct<avg(CAST(udf(cast(cast(x as double) as string)) AS 
DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS 
    -- !query 34 output
    7.000000000006E12      1.0
    -- !query 35
   -SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest
   +SELECT udf(covar_pop(b, udf(a))), covar_samp(udf(b), a) FROM aggtest
    -- !query 35 schema
   -struct<covar_pop(CAST(b AS DOUBLE), CAST(a AS 
DOUBLE)):double,covar_samp(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double>
   +struct<CAST(udf(cast(covar_pop(cast(b as double), cast(cast(udf(cast(a as 
string)) as int) as double)) as string)) AS 
DOUBLE):double,covar_samp(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS 
    -- !query 35 output
    653.6289553875104      871.5052738500139
    -- !query 36
   -SELECT corr(b, a) FROM aggtest
   +SELECT corr(b, udf(a)) FROM aggtest
    -- !query 36 schema
   -struct<corr(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double>
   +struct<corr(CAST(b AS DOUBLE), CAST(CAST(udf(cast(a as string)) AS INT) AS 
    -- !query 36 output
    -- !query 37
   -SELECT count(four) AS cnt_1000 FROM onek
   +SELECT count(udf(four)) AS cnt_1000 FROM onek
    -- !query 37 schema
    -- !query 37 output
   @@ -313,7 +314,7 @@ struct<cnt_1000:bigint>
    -- !query 38
   -SELECT count(DISTINCT four) AS cnt_4 FROM onek
   +SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek
    -- !query 38 schema
    -- !query 38 output
   @@ -321,10 +322,10 @@ struct<cnt_4:bigint>
    -- !query 39
   -select ten, count(*), sum(four) from onek
   +select ten, udf(count(*)), sum(udf(four)) from onek
    group by ten order by ten
    -- !query 39 schema
   +struct<ten:int,CAST(udf(cast(count(1) as string)) AS 
BIGINT):bigint,sum(CAST(udf(cast(four as string)) AS INT)):bigint>
    -- !query 39 output
    0      100     100
    1      100     200
   @@ -339,10 +340,10 @@ struct<ten:int,count(1):bigint,sum(four):bigint>
    -- !query 40
   -select ten, count(four), sum(DISTINCT four) from onek
   +select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek
    group by ten order by ten
    -- !query 40 schema
   -struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint>
   +struct<ten:int,count(CAST(udf(cast(four as string)) AS 
INT)):bigint,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS 
    -- !query 40 output
    0      100     2
    1      100     4
   @@ -357,11 +358,11 @@ struct<ten:int,count(four):bigint,sum(DISTINCT 
    -- !query 41
   -select ten, sum(distinct four) from onek a
   +select ten, udf(sum(distinct four)) from onek a
    group by ten
   -having exists (select 1 from onek b where sum(distinct a.four) = b.four)
   +having exists (select 1 from onek b where udf(sum(distinct a.four)) = 
    -- !query 41 schema
   -struct<ten:int,sum(DISTINCT four):bigint>
   +struct<ten:int,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) 
AS BIGINT):bigint>
    -- !query 41 output
    0      2
    2      2
   @@ -374,23 +375,23 @@ struct<ten:int,sum(DISTINCT four):bigint>
    select ten, sum(distinct four) from onek a
    group by ten
    having exists (select 1 from onek b
   -               where sum(distinct a.four + b.four) = b.four)
   +               where sum(distinct a.four + b.four) = udf(b.four))
    -- !query 42 schema
    -- !query 42 output
    Aggregate/Window/Generate expressions are not valid in where clause of the 
   -Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS 
BIGINT)) = CAST(b.`four` AS BIGINT))]
   +Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS 
BIGINT)) = CAST(CAST(udf(cast(four as string)) AS INT) AS BIGINT))]
    Invalid expressions: [sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT))];
    -- !query 43
   -  (select max((select i.unique2 from tenk1 i where i.unique1 = o.unique1)))
   +  (select udf(max((select i.unique2 from tenk1 i where i.unique1 = 
    from tenk1 o
    -- !query 43 schema
    -- !query 43 output
   -cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, 
i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, 
i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 63
   +cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, 
i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, 
i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 67
   ## How was this patch tested?
   Manually tested.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to