[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682329#comment-16682329 ] Marco Gaido commented on SPARK-25332: - [~Bjangir] please don't use "Critical" and "Blocker": they are reserved for committers. Thanks. > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679574#comment-16679574 ] Babulal commented on SPARK-25332: - Since issue impacting performance degradation so marking as 'Critical' . > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Critical > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654170#comment-16654170 ] Apache Spark commented on SPARK-25332: -- User 'sujith71955' has created a pull request for this issue: https://github.com/apache/spark/pull/22758 > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654168#comment-16654168 ] Apache Spark commented on SPARK-25332: -- User 'sujith71955' has created a pull request for this issue: https://github.com/apache/spark/pull/22758 > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652914#comment-16652914 ] Sujith commented on SPARK-25332: [~Bjangir] i think you are right, there is a bug while inserting data into table when we use stored by clause in create command, I am working on it. soon i will be raising a PR . [~maropu] *[srowen|https://github.com/srowen] [cloud-fan|https://github.com/cloud-fan]* i will raise a PR to handle this and keep you guys in loop. thanks > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609534#comment-16609534 ] Babulal commented on SPARK-25332: - Hi [~maropu] it seems to be a straightforward issue so raised directly. Actually issue happens because Relation size is correct for same session but when restart the application ,HadoopFSRelation size became default relation size (spark.sql.defaultSizeInBytes which is LONG.MAXVALUE) that is why SortMergeJoin is choosen instead of broadcast join. > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe,
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603761#comment-16603761 ] Takeshi Yamamuro commented on SPARK-25332: -- Probably, you need to describe more about this case. Also, I think you'd be better to ask in the spark-user mailing list first. > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org