[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17728184#comment-17728184 ] shufan commented on SPARK-21380: [~dongjoon] Another situation: SELECT * FROM ( SELECT age, 'bob' AS NAME FROM person ) p LEFT JOIN temp_person t_p ON t_p.NAME = p.NAME; The JoinSelection will choose BroadcastNestedLoopJoinExec ,which may lead to OOM. In fact, it did cause OOM When adding parameters: set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.FoldablePropagation The JoinSelection will choose SortMergeJoin,and will not result in an OOM.But I shouldn't have done it in practice. Do you have a better suggestion? > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Toews >Priority: Major > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084274#comment-16084274 ] Dongjoon Hyun commented on SPARK-21380: --- I see. I agree your point about that warning is misleading here. > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083429#comment-16083429 ] Everett Anderson commented on SPARK-21380: -- [~dongjoon] Hey -- I don't totally follow. It sounds like you're saying that it's correct for a 2 single row tables to fail due to the join being considered a Cartesian product. What if you happened to only have 1 rows in each table? It seems unfortunate to error out. > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083434#comment-16083434 ] Dongjoon Hyun commented on SPARK-21380: --- One row in real normal table is okay. Your example is a constant. So, `FoldablePropagation` and `ConstantFolding` is applied. See the optimized result. > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083439#comment-16083439 ] Everett Anderson commented on SPARK-21380: -- Ah, I see. Okay, that makes sense. Thanks for the explanation! I sure wish we didn't have so many quirky 'This looks like a Cartesian product join' cases in Spark, though! > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083272#comment-16083272 ] Dongjoon Hyun commented on SPARK-21380: --- Your case are too simple, so it's optimized. The following is a normal case what you mentioned. {code} scala> val l = spark.sql("select name, age from values ('bob', 1), ('sam', 2) T(name,age)") scala> val r = spark.sql("select name, company from values ('bob', 'bobco'), ('larry', 'larryco') T(name,company)") scala> val left = l.alias("l").select(l.col("name").as("l_name"), l.col("age").as("l_age")) scala> val right = r.alias("r").select(r.col("name").as("r_name"), r.col("company").as("r_age")) scala> l.show() ++---+ |name|age| ++---+ | bob| 1| | sam| 2| ++---+ scala> r.show() +-+---+ | name|company| +-+---+ | bob| bobco| |larry|larryco| +-+---+ scala> left.join(right, left.col("l_name").equalTo(right.col("r_name")), "inner").show +--+-+--+-+ |l_name|l_age|r_name|r_age| +--+-+--+-+ | bob|1| bob|bobco| +--+-+--+-+ {code} > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083266#comment-16083266 ] Dongjoon Hyun commented on SPARK-21380: --- Hi, [~everett]. It's the correct result of optimization. Please see the following. {code} === Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation === !Join Inner, (l_name#11 = r_name#17)Join Inner, (bob = bob) :- Project [bob AS l_name#11, 23 AS l_age#12] :- Project [bob AS l_name#11, 23 AS l_age#12] : +- OneRowRelation$ : +- OneRowRelation$ +- Project [bob AS r_name#17, bobco AS r_age#18] +- Project [bob AS r_name#17, bobco AS r_age#18] +- OneRowRelation$ +- OneRowRelation$ === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding === !Join Inner, (bob = bob)Join Inner, true :- Project [bob AS l_name#11, 23 AS l_age#12] :- Project [bob AS l_name#11, 23 AS l_age#12] : +- OneRowRelation$ : +- OneRowRelation$ +- Project [bob AS r_name#17, bobco AS r_age#18] +- Project [bob AS r_name#17, bobco AS r_age#18] +- OneRowRelation$ +- OneRowRelation$ {code} > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083233#comment-16083233 ] Everett Anderson commented on SPARK-21380: -- [~dongjoon] Sure thing! I'll update this when I've tried it. > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at >
[jira] [Commented] (SPARK-21380) Join with Columns thinks inner join is cross join even when aliased
[ https://issues.apache.org/jira/browse/SPARK-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083230#comment-16083230 ] Dongjoon Hyun commented on SPARK-21380: --- Hi, [~everett]. Thank you for reporting. Today, Apache Spark 2.2.0 is released. Could you check that on 2.2? > Join with Columns thinks inner join is cross join even when aliased > --- > > Key: SPARK-21380 > URL: https://issues.apache.org/jira/browse/SPARK-21380 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Everett Anderson > Labels: correctness > > While this seemed to work in Spark 2.0.2, it fails in 2.1.0 and 2.1.1. > Even after aliasing both the table names and all the columns, joining > Datasets using a criteria assembled from Columns rather than the with the > join( usingColumns) method variants errors complaining that a join is a > cross join / cartesian product even when it isn't. > Example: > {noformat} > Dataset left = spark.sql("select 'bob' as name, 23 as age"); > left = left > .alias("l") > .select( > left.col("name").as("l_name"), > left.col("age").as("l_age")); > Dataset right = spark.sql("select 'bob' as name, 'bobco' as > company"); > right = right > .alias("r") > .select( > right.col("name").as("r_name"), > right.col("company").as("r_age")); > Dataset result = left.join( > right, > left.col("l_name").equalTo(right.col("r_name")), > "inner"); > result.show(); > {noformat} > Results in > {noformat} > org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER > join between logical plans > Project [bob AS l_name#22, 23 AS l_age#23] > +- OneRowRelation$ > and > Project [bob AS r_name#33, bobco AS r_age#34] > +- OneRowRelation$ > Join condition is missing or trivial. > Use the CROSS JOIN syntax to allow cartesian products between these > relations.; > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1067) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1064) > at > org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1049) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at >