[jira] [Assigned] (SPARK-18766) Push Down Filter Through BatchEvalPython
[ https://issues.apache.org/jira/browse/SPARK-18766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-18766: --- Assignee: Xiao Li > Push Down Filter Through BatchEvalPython > > > Key: SPARK-18766 > URL: https://issues.apache.org/jira/browse/SPARK-18766 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always > generated below {{FilterExec}}. However, not all the predicates need to be > evaluated after Python UDF execution. Thus, we can push down the predicates > through {{BatchEvalPython}} . > {noformat} > >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > >>> ["key", "value"]) > >>> from pyspark.sql.functions import udf, col > >>> from pyspark.sql.types import BooleanType > >>> my_filter = udf(lambda a: a < 2, BooleanType()) > >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) > >>> & (df.value < "2")) > >>> sel.explain(True) > {noformat} > {noformat} > == Physical Plan == > *Project [key#0L, value#1] > +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2)) >+- BatchEvalPython [(key#0L)], [key#0L, value#1, pythonUDF0#9] > +- Scan ExistingRDD[key#0L,value#1] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18766) Push Down Filter Through BatchEvalPython
[ https://issues.apache.org/jira/browse/SPARK-18766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18766: Assignee: (was: Apache Spark) > Push Down Filter Through BatchEvalPython > > > Key: SPARK-18766 > URL: https://issues.apache.org/jira/browse/SPARK-18766 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li > > Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always > generated below {{FilterExec}}. However, not all the predicates need to be > evaluated after Python UDF execution. Thus, we can push down the predicates > through {{BatchEvalPython}} . > {noformat} > >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > >>> ["key", "value"]) > >>> from pyspark.sql.functions import udf, col > >>> from pyspark.sql.types import BooleanType > >>> my_filter = udf(lambda a: a < 2, BooleanType()) > >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) > >>> & (df.value < "2")) > >>> sel.explain(True) > {noformat} > {noformat} > == Physical Plan == > *Project [key#0L, value#1] > +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2)) >+- BatchEvalPython [(key#0L)], [key#0L, value#1, pythonUDF0#9] > +- Scan ExistingRDD[key#0L,value#1] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18766) Push Down Filter Through BatchEvalPython
[ https://issues.apache.org/jira/browse/SPARK-18766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18766: Assignee: Apache Spark > Push Down Filter Through BatchEvalPython > > > Key: SPARK-18766 > URL: https://issues.apache.org/jira/browse/SPARK-18766 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Apache Spark > > Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always > generated below {{FilterExec}}. However, not all the predicates need to be > evaluated after Python UDF execution. Thus, we can push down the predicates > through {{BatchEvalPython}} . > {noformat} > >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], > >>> ["key", "value"]) > >>> from pyspark.sql.functions import udf, col > >>> from pyspark.sql.types import BooleanType > >>> my_filter = udf(lambda a: a < 2, BooleanType()) > >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) > >>> & (df.value < "2")) > >>> sel.explain(True) > {noformat} > {noformat} > == Physical Plan == > *Project [key#0L, value#1] > +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2)) >+- BatchEvalPython [(key#0L)], [key#0L, value#1, pythonUDF0#9] > +- Scan ExistingRDD[key#0L,value#1] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org