[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473584#comment-16473584 ] Fernando Pereira commented on SPARK-19618: -- [~cloud_fan] I have created the Jira and an implementation to lift the limit via a configuration option. Internally we are forced to use our mod, and it would be nice to get in sync with upstream again at some point. It is a very small patch in the end. Thanks. > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Major > Fix For: 2.2.0 > > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440085#comment-16440085 ] Fernando Pereira commented on SPARK-19618: -- Opened [SPARK-23997|https://issues.apache.org/jira/browse/SPARK-23997] Thanks > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Major > Fix For: 2.2.0 > > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438912#comment-16438912 ] Wenchen Fan commented on SPARK-19618: - making it configurable sounds like a good idea, can you open a JIRA for it? thanks! > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Major > Fix For: 2.2.0 > > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438822#comment-16438822 ] Fernando Pereira commented on SPARK-19618: -- Is there any technological problem in using more than 100k buckets? Otherwise what about making it configurable? We have an 80TB workload and to keep partitions "manageable" we do need to use a large number of buckets. While it might seem a lot today it is expected that workloads will continue to increase in size... > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Major > Fix For: 2.2.0 > > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868963#comment-15868963 ] Apache Spark commented on SPARK-19618: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/16948 > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org