[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2021-07-21 Thread Ashish Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385071#comment-17385071
 ] 

Ashish Singh commented on SPARK-31162:
--

This is needed for reasons other than supporting hive bucketing write. For 
example, this is also needed to make sure custom partitioners from Hive (using 
Hive udf) can partition similar to hive.

Assigning it to myself, but let me know if you are working on this already 
[~maropu].

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-16 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060540#comment-17060540
 ] 

Takeshi Yamamuro commented on SPARK-31162:
--

I've checked the original PR ([https://github.com/apache/spark/pull/10498]) to 
implement the buckeBy and I think that comment says nothing about the 
compatibility of Hive's bucketing schema. But, I think the comment is a bit 
confusing, so I'll fix it later.

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-16 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060408#comment-17060408
 ] 

Felix Kizhakkel Jose commented on SPARK-31162:
--

I have seen following in the API documentation:



/**
 * Buckets the output by the given columns. *If specified, the output is laid 
out on the file*
 ** system similar to Hive's bucketing scheme.*
 *
 * This is applicable for all file-based data sources (e.g. Parquet, JSON) 
starting with Spark
 * 2.1.0.
 *
 * @since 2.0
 */
@scala.annotation.varargs
def bucketBy(numBuckets: Int, colName: String, colNames: String*): 
DataFrameWriter[T] = {
 this.numBuckets = Option(numBuckets)
 this.bucketColumnNames = Option(colName +: colNames)
 this
}

How can we specify that?

 

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org