[jira] [Commented] (HIVE-11672) Hive Streaming API handles bucketing incorrectly

2015-11-02 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985919#comment-14985919
 ] 

Roshan Naik commented on HIVE-11672:


yes.

> Hive Streaming API handles bucketing incorrectly
> 
>
> Key: HIVE-11672
> URL: https://issues.apache.org/jira/browse/HIVE-11672
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog, Hive, Transactions
>Affects Versions: 1.2.1
>Reporter: Raj Bains
>Assignee: Roshan Naik
>Priority: Critical
>
> Hive Streaming API allows the clients to get a random bucket and then insert 
> data into it. However, this leads to incorrect bucketing as Hive expects data 
> to be distributed into buckets based on a hash function applied to bucket 
> key. The data is inserted randomly by the clients right now. They have no way 
> of
> # Knowing what bucket a row (tuple) belongs to
> # Asking for a specific bucket
> There are optimization such as Sort Merge Join and Bucket Map Join that rely 
> on the data being correctly distributed across buckets and these will cause 
> incorrect read results if the data is not distributed correctly.
> There are two obvious design choices
> # Hive Streaming API should fix this internally by distributing the data 
> correctly
> # Hive Streaming API should expose data distribution scheme to the clients 
> and allow them to distribute the data correctly
> The first option will mean every client thread will write to many buckets, 
> causing many small files in each bucket and too many connections open. this 
> does not seem feasible. The second option pushes more functionality into the 
> client of the Hive Streaming API, but can maintain high throughput and write 
> good sized ORC files. This option seems preferable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11672) Hive Streaming API handles bucketing incorrectly

2015-09-29 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935893#comment-14935893
 ] 

Eugene Koifman commented on HIVE-11672:
---

[~roshan_naik] is this is a dup of HIVE-11983?

> Hive Streaming API handles bucketing incorrectly
> 
>
> Key: HIVE-11672
> URL: https://issues.apache.org/jira/browse/HIVE-11672
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog, Hive, Transactions
>Affects Versions: 1.2.1
>Reporter: Raj Bains
>Assignee: Roshan Naik
>Priority: Critical
>
> Hive Streaming API allows the clients to get a random bucket and then insert 
> data into it. However, this leads to incorrect bucketing as Hive expects data 
> to be distributed into buckets based on a hash function applied to bucket 
> key. The data is inserted randomly by the clients right now. They have no way 
> of
> # Knowing what bucket a row (tuple) belongs to
> # Asking for a specific bucket
> There are optimization such as Sort Merge Join and Bucket Map Join that rely 
> on the data being correctly distributed across buckets and these will cause 
> incorrect read results if the data is not distributed correctly.
> There are two obvious design choices
> # Hive Streaming API should fix this internally by distributing the data 
> correctly
> # Hive Streaming API should expose data distribution scheme to the clients 
> and allow them to distribute the data correctly
> The first option will mean every client thread will write to many buckets, 
> causing many small files in each bucket and too many connections open. this 
> does not seem feasible. The second option pushes more functionality into the 
> client of the Hive Streaming API, but can maintain high throughput and write 
> good sized ORC files. This option seems preferable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)