[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-26 Thread Michail Giannakopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593085#comment-16593085
 ] 

Michail Giannakopoulos edited comment on SPARK-24826 at 8/27/18 12:53 AM:
--

[~dongjoon] I will try to repro and let you know...


was (Author: miccagiann):
[~dongjoon] I will and let you know...

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at 

[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-02 Thread Joseph Fourny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981
 ] 

Joseph Fourny edited comment on SPARK-24826 at 8/2/18 3:57 PM:
---

I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. If 
you update one of the datasets to make them distinct, then the assertion is not 
longer triggered. Note: I have to force a SortMergeJoin to trigger the issue in 
ExchangeCoordinator. 


was (Author: josephfourny):
I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. Note: 
I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. 

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
>