HCatalog streaming works with Hive's transactional tables which are currently 
required to be bucketed.
The later is to improve read performance since these tables also support 
update/delete operations (though not via streaming).

"The number of buckets is ideally ..." this is obsolete (as of HIVE-11983).  
There isn't really a relationship.  Each HiveEndPoint will write as many files 
as you have buckets.

Eugene



From: Igor Kuzmenko <f1she...@gmail.com<mailto:f1she...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Friday, April 8, 2016 at 2:35 AM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Hive Hcatalog Streaming. Why hive table must be bucketed?

Hello I've got few questions about Hive HCatalog 
streaming<https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest>.
 This feature has requirement:
"The Hive table must be 
bucketed<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables>,
 but not sorted. So something like "clustered by (colName) into 10 buckets" 
must be specified during table creation. The number of buckets is ideally the 
same as the number of streaming writers."

1) I wonder why it is required condition of streaming?
2) How many buckets should I create, when number of streaming writers changes 
over time (for example from 1 to 10)?

Reply via email to