[
https://issues.apache.org/jira/browse/SPARK-45722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780965#comment-17780965
]
Alexey Dmitriev commented on SPARK-45722:
-
turning off spark.sql.analyzer.failAmbiguousSelfJoin doesn't help, so probably
issue is not exatcly where I think it was:
{code:java}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
session = SparkSession.Builder().getOrCreate()
session.conf.set('spark.sql.analyzer.failAmbiguousSelfJoin', False)
A = session.createDataFrame([(1,)], ['a'])
B = session.createDataFrame([(1,)], ['b'])
A.join(B).select(B.b)
C = A.join(A.join(B), on=F.lit(False), how='leftanti')
C.join(B).select(B.b) {code}
AnalysisException: Resolved attribute(s) b#2L missing from a#0L,b#12L in
operator !Project [b#2L]. Attribute(s) with the same name appear in the
operation: b. Please check if the right attribute(s) are used.; !Project [b#2L]
+- Join Inner :- Join LeftAnti, false : :- LogicalRDD [a#0L], false : +- Join
Inner : :- LogicalRDD [a#9L], false : +- LogicalRDD [b#2L], false +- LogicalRDD
[b#12L], false
> False positive when cheking for ambigious columns
> --
>
> Key: SPARK-45722
> URL: https://issues.apache.org/jira/browse/SPARK-45722
> Project: Spark
> Issue Type: Bug
> Components: PySpark
>Affects Versions: 3.4.0
> Environment: py3.11
> pyspark 3.4.0
>Reporter: Alexey Dmitriev
>Priority: Major
>
> I have following code, which I expect to work
> {code:java}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F session =
> SparkSession.Builder().getOrCreate() A = session.createDataFrame([(1,)],
> ['a'])
> B = session.createDataFrame([(1,)], ['b'])
> A.join(B).select(B.b) # works fine
> C = A.join(A.join(B), on=F.lit(False), how='leftanti') # C has the same
> columns as A (columns, not only names)
> C.join(B).select(B.b) #doesn't work, says B.b is ambigious,
> {code}
> {code:java}
> Exception below:{code}
> {code:java}
> AnalysisException: Column b#11L are ambiguous. It's probably because you
> joined several Datasets together, and some of these Datasets are the same.
> This column points to one of the Datasets but Spark is unable to figure out
> which one. Please alias the Datasets with different names via `Dataset.as`
> before joining them, and specify the column using qualified name, e.g.
> `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set
> spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org