Todd Lipcon has posted comments on this change.

Change subject: Ported delta encoding from Impala to KUDU.
......................................................................


Patch Set 5:

(25 comments)

here are comments on an older rev (I'd queued up a bunch so just stuck with 
reviewing that revision)

http://gerrit.cloudera.org:8080/#/c/1210/5/src/kudu/cfile/packed_diff_block.h
File src/kudu/cfile/packed_diff_block.h:

Line 43: // Delta encoding supports signed integer(INT64, INT32, INT16, INT8).
is this an implementation of an algorithm from the literature? one of Lemire's? 
Some tweak thereof? Would be good to reference it


Line 50: //  - list of bitwidths of miniblocks (one byte each)
can a miniblock have a bitwidth of 0 (eg in the case that you have purely 
sequential data?)


Line 54: //  - list of miniblocks: each miniblock is a list of bit packed
this doesn't make clear what the bit packed ints represent - i.e whether it's 
differential (ie each int is a delta from the previous one) or what.


PS5, Line 56: So setting the number of deltas stored in one miniblock
this should refer to kNumEntriesPerMiniBlock below and make it more clear that 
it's set to a fixed number that's part of the format (not a choice).


Line 84:       if (num_elems_ == 0) {
can move this block into the if block on line 90 to get rid of a branch in the 
hot path


Line 86:         cur_mblock_.num_elems = 0;
why not initialize this in Reset()? seems like it would make more sense there 
plus make the tight loop have a bit less code


Line 90:       if (cur_mblock_.num_elems == 0) {
probably PREDICT_FALSE here (only 1/128 of iterations hit it, right?)


Line 108:       if (cur_mblock_.num_elems < kEntriesPerMiniBlock) {
PREDICT_TRUE this?


Line 166:     }
layout wise, we should consider advantages and disadvantages of this layout vs 
an "interleaved" layout, where we have all the info for a given miniblock 
contiguous -- maybe some cache miss differences


PS5, Line 184: boost::make_unsigned<
std::make_unsigned


Line 214:     // Caculate the number of bits needed to store the diff between 
max_delta and min_delta.
typo: calculateh


Line 232:   // The number of deltas stored in one miniblock
worth a comment here saying that this is a multiple of 8 so that the miniblocks 
end up aligned


PS5, Line 252: OVERRIDE
can use 'override' now (same elsewhere in this file)


PS5, Line 257: Header corruption"
let's make these more specific to this code path, like 'PackedDiffBlock header 
too short'


PS5, Line 268: Meta-data section corruption
same


Line 271:     data_bit_offset_ = bit_reader_.position();
is there a guarantee that we are byte-aligned at this point? I think so, 
considering we've only read whole bytes. worth a DCHECK to document the 
invariant here.


PS5, Line 292: (c
nit: dont need the extra parens here


Line 319:       // Iterate the elements in this miniblock until we find the 
element that is equal or greater
can we not std::find (binary search) here? with 128 elements I think binary 
search is worth it. Similarly maybe binary search is worth it above to find the 
right miniblock?


Line 333:     if (cur_idx_ == num_elems_)
add {}, maybe PREDICT_FALSE


Line 363:     return (num_elems_ - cur_idx_) > 0;
why not the more straight forward 'cur_idx < num_elems_'? (also less apt to 
have an int underflow)


PS5, Line 373: k 
can we find a better name for this var?


Line 381:       if (pending_.size() == 0) {
.empty


Line 387:     if (pending_.size() > 0) {
.empty


Line 400:       
RETURN_NOT_OK(DoGetNextMiniBlock(reinterpret_cast<CppType*>(vals)));
this cast seems a no-op


http://gerrit.cloudera.org:8080/#/c/1210/5/src/kudu/util/bit-stream-utils.inline.h
File src/kudu/util/bit-stream-utils.inline.h:

Line 120:   int bytes_remaining = max_bytes_ - byte_offset_;
can this just be a call to BufferValues()?


-- 
To view, visit http://gerrit.cloudera.org:8080/1210
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1446a78f22773c28a7cc877fbe861697e39b2af8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: jingkai.y...@intel.com
Gerrit-Reviewer: David Ribeiro Alves <david.al...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: todd+t...@lipcon.org
Gerrit-HasComments: Yes

Reply via email to