[jira] [Commented] (HIVE-11672) Hive Streaming API handles bucketing incorrectly
[ https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985919#comment-14985919 ] Roshan Naik commented on HIVE-11672: yes. > Hive Streaming API handles bucketing incorrectly > > > Key: HIVE-11672 > URL: https://issues.apache.org/jira/browse/HIVE-11672 > Project: Hive > Issue Type: Bug > Components: HCatalog, Hive, Transactions >Affects Versions: 1.2.1 >Reporter: Raj Bains >Assignee: Roshan Naik >Priority: Critical > > Hive Streaming API allows the clients to get a random bucket and then insert > data into it. However, this leads to incorrect bucketing as Hive expects data > to be distributed into buckets based on a hash function applied to bucket > key. The data is inserted randomly by the clients right now. They have no way > of > # Knowing what bucket a row (tuple) belongs to > # Asking for a specific bucket > There are optimization such as Sort Merge Join and Bucket Map Join that rely > on the data being correctly distributed across buckets and these will cause > incorrect read results if the data is not distributed correctly. > There are two obvious design choices > # Hive Streaming API should fix this internally by distributing the data > correctly > # Hive Streaming API should expose data distribution scheme to the clients > and allow them to distribute the data correctly > The first option will mean every client thread will write to many buckets, > causing many small files in each bucket and too many connections open. this > does not seem feasible. The second option pushes more functionality into the > client of the Hive Streaming API, but can maintain high throughput and write > good sized ORC files. This option seems preferable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11672) Hive Streaming API handles bucketing incorrectly
[ https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935893#comment-14935893 ] Eugene Koifman commented on HIVE-11672: --- [~roshan_naik] is this is a dup of HIVE-11983? > Hive Streaming API handles bucketing incorrectly > > > Key: HIVE-11672 > URL: https://issues.apache.org/jira/browse/HIVE-11672 > Project: Hive > Issue Type: Bug > Components: HCatalog, Hive, Transactions >Affects Versions: 1.2.1 >Reporter: Raj Bains >Assignee: Roshan Naik >Priority: Critical > > Hive Streaming API allows the clients to get a random bucket and then insert > data into it. However, this leads to incorrect bucketing as Hive expects data > to be distributed into buckets based on a hash function applied to bucket > key. The data is inserted randomly by the clients right now. They have no way > of > # Knowing what bucket a row (tuple) belongs to > # Asking for a specific bucket > There are optimization such as Sort Merge Join and Bucket Map Join that rely > on the data being correctly distributed across buckets and these will cause > incorrect read results if the data is not distributed correctly. > There are two obvious design choices > # Hive Streaming API should fix this internally by distributing the data > correctly > # Hive Streaming API should expose data distribution scheme to the clients > and allow them to distribute the data correctly > The first option will mean every client thread will write to many buckets, > causing many small files in each bucket and too many connections open. this > does not seem feasible. The second option pushes more functionality into the > client of the Hive Streaming API, but can maintain high throughput and write > good sized ORC files. This option seems preferable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)