[jira] [Updated] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17100:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-08-16 Thread Tim Sell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Sell updated SPARK-17100:
-
Attachment: test_bug.py

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-08-16 Thread Tim Sell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Sell updated SPARK-17100:
-
Attachment: bug.py

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org