[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-22 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163150#comment-17163150
 ] 

Dongjoon Hyun commented on SPARK-32347:
---

Thank you, [~cltlfcjin] and [~JinxinTang]. I'll close this JIRA as a duplicate 
of SPARK-32237.

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160859#comment-17160859
 ] 

Lantao Jin commented on SPARK-32347:


Duplicates to SPARK-32237

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160627#comment-17160627
 ] 

JinxinTang commented on SPARK-32347:


cc [~ibobak] Thanks for your issue, I have just raised a PR for this.

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160620#comment-17160620
 ] 

Apache Spark commented on SPARK-32347:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/29156

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160619#comment-17160619
 ] 

Apache Spark commented on SPARK-32347:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/29156

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-17 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160284#comment-17160284
 ] 

JinxinTang commented on SPARK-32347:


Same the CTE issue as https://issues.apache.org/jira/browse/SPARK-32237

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org