Re: Hive, Tez, clustering, buckets, and Presto

Gopal Vijayaraghavan Wed, 04 Apr 2018 16:32:23 -0700

> so there asking "where is the Hive bucketing spec".  Is it just to read the 
> code for that function?


This worked the other way around in time, than writing a spec first - ACIDv1 
implemented Streaming ingest via Storm, it used an explicit naming "bucket_<n>" 
for the filename.

Since until the compaction runs the actual base files don't exist, the ACID 
bucketing implementation has to handle missing buckets as 0 rows in base file + 
possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work 
similarly, for the ability to do bucket map-joins between ACID & non-ACID 
bucketed tables. Particularly about the modulus for -ve numbers, which was 
broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for 
bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000000_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000000_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000001_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000001_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000002_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000002_0_copy_1 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/000003_0 |
| -rw-r--r--   3 hive hdfs        181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/000003_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per 
bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you 
can see that is only applicable for the 1st insert to table (& the regex will 
remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of 
Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross" <r...@oaktreepeak.com> wrote:

    Gopal,
    
    The Presto devs say they are willing to make the changes to adhere to the 
Hive bucket spec.  I quoted 
    
    "Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."
    
    so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.
    
    Thanks
    
    ----- Original Message -----
    From: "Gopal Vijayaraghavan" <gop...@apache.org>
    To: user@hive.apache.org
    Sent: Tuesday, April 3, 2018 3:15:46 AM
    Subject: Re: Hive, Tez, clustering, buckets, and Presto
    
    >    * I'm interested in your statement that CLUSTERED BY does not CLUSTER 
BY.  My understanding was that this was related to the number of buckets, but 
you are relating it to ORC stripes.  It is odd that no examples that I've seen 
include the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)
    
    ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).
    
    Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.
    
    An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.
    
    This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.
    
    Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.
    
    https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5
    
    was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.
    
    Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.
    
    >    I hadn't thought of the even number issue, not having looked at the 
function; I had assumed that it was a hash, not a modulus; shame on me.  
Reading the docs I see that hash is only used on string columns
    
    Actually a hash is used in theory, but I entirely blame Java for it - the 
Java hash is an identity function for Integers.
    
    scala> 42.hashCode
    res1: Int = 42
    
    scala> 42L.hashCode
    res2: Int = 42
    
    > Finally, I'm not sure that I got a specific answer to my original 
question, which is can I force Tez to create all bucket files so Presto queries 
can succeed?  Anyway, I will be testing today and the solution will either be 
to forgo buckets completely or to simply rely on ORC indexes.
    
    There's no config to do that today & Presto is already incompatible with 
Hive 3.0 tables (Update/Delete support).
    
    Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile).
    
    The file-count is a very flaky way to check if the table is bucketed 
correctly - either you trust the user to have properly bucketed the table or 
you don't use it. Failing to work on valid tables does look pretty bad, instead 
of soft fallbacks.
    
    I wrote a few UDFs which was used to validate suspect tables and fix them 
for customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.
    
    
https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27
    
    LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).
    
    Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
    Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281
    
    That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.
    
    If you do have the ability to redeploy Hive, the change you are looking for 
is a 1-liner to enable.
    
    
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248
    
    Cheers,
    Gopal

Re: Hive, Tez, clustering, buckets, and Presto

Reply via email to