[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593085#comment-16593085 ] Michail Giannakopoulos edited comment on SPARK-24826 at 8/27/18 12:53 AM: -- [~dongjoon] I will try to repro and let you know... was (Author: miccagiann): [~dongjoon] I will and let you know... > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at
[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981 ] Joseph Fourny edited comment on SPARK-24826 at 8/2/18 3:57 PM: --- I was able to reproduce this defect with an inner-join of two temp views that refer to equivalent local relations. I started by creating 2 datasets (in Java) from a List of GenericRow and registered them as separate views. As far as the optimizer is concerned, the contents of the local relations are the same. If you update one of the datasets to make them distinct, then the assertion is not longer triggered. Note: I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. was (Author: josephfourny): I was able to reproduce this defect with an inner-join of two temp views that refer to equivalent local relations. I started by creating 2 datasets (in Java) from a List of GenericRow and registered them as separate views. As far as the optimizer is concerned, the contents of the local relations are the same. Note: I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at >