Todd Lipcon has posted comments on this change. Change subject: Ported delta encoding from Impala to KUDU. ......................................................................
Patch Set 5: (25 comments) here are comments on an older rev (I'd queued up a bunch so just stuck with reviewing that revision) http://gerrit.cloudera.org:8080/#/c/1210/5/src/kudu/cfile/packed_diff_block.h File src/kudu/cfile/packed_diff_block.h: Line 43: // Delta encoding supports signed integer(INT64, INT32, INT16, INT8). is this an implementation of an algorithm from the literature? one of Lemire's? Some tweak thereof? Would be good to reference it Line 50: // - list of bitwidths of miniblocks (one byte each) can a miniblock have a bitwidth of 0 (eg in the case that you have purely sequential data?) Line 54: // - list of miniblocks: each miniblock is a list of bit packed this doesn't make clear what the bit packed ints represent - i.e whether it's differential (ie each int is a delta from the previous one) or what. PS5, Line 56: So setting the number of deltas stored in one miniblock this should refer to kNumEntriesPerMiniBlock below and make it more clear that it's set to a fixed number that's part of the format (not a choice). Line 84: if (num_elems_ == 0) { can move this block into the if block on line 90 to get rid of a branch in the hot path Line 86: cur_mblock_.num_elems = 0; why not initialize this in Reset()? seems like it would make more sense there plus make the tight loop have a bit less code Line 90: if (cur_mblock_.num_elems == 0) { probably PREDICT_FALSE here (only 1/128 of iterations hit it, right?) Line 108: if (cur_mblock_.num_elems < kEntriesPerMiniBlock) { PREDICT_TRUE this? Line 166: } layout wise, we should consider advantages and disadvantages of this layout vs an "interleaved" layout, where we have all the info for a given miniblock contiguous -- maybe some cache miss differences PS5, Line 184: boost::make_unsigned< std::make_unsigned Line 214: // Caculate the number of bits needed to store the diff between max_delta and min_delta. typo: calculateh Line 232: // The number of deltas stored in one miniblock worth a comment here saying that this is a multiple of 8 so that the miniblocks end up aligned PS5, Line 252: OVERRIDE can use 'override' now (same elsewhere in this file) PS5, Line 257: Header corruption" let's make these more specific to this code path, like 'PackedDiffBlock header too short' PS5, Line 268: Meta-data section corruption same Line 271: data_bit_offset_ = bit_reader_.position(); is there a guarantee that we are byte-aligned at this point? I think so, considering we've only read whole bytes. worth a DCHECK to document the invariant here. PS5, Line 292: (c nit: dont need the extra parens here Line 319: // Iterate the elements in this miniblock until we find the element that is equal or greater can we not std::find (binary search) here? with 128 elements I think binary search is worth it. Similarly maybe binary search is worth it above to find the right miniblock? Line 333: if (cur_idx_ == num_elems_) add {}, maybe PREDICT_FALSE Line 363: return (num_elems_ - cur_idx_) > 0; why not the more straight forward 'cur_idx < num_elems_'? (also less apt to have an int underflow) PS5, Line 373: k can we find a better name for this var? Line 381: if (pending_.size() == 0) { .empty Line 387: if (pending_.size() > 0) { .empty Line 400: RETURN_NOT_OK(DoGetNextMiniBlock(reinterpret_cast<CppType*>(vals))); this cast seems a no-op http://gerrit.cloudera.org:8080/#/c/1210/5/src/kudu/util/bit-stream-utils.inline.h File src/kudu/util/bit-stream-utils.inline.h: Line 120: int bytes_remaining = max_bytes_ - byte_offset_; can this just be a call to BufferValues()? -- To view, visit http://gerrit.cloudera.org:8080/1210 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1446a78f22773c28a7cc877fbe861697e39b2af8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: jingkai.y...@intel.com Gerrit-Reviewer: David Ribeiro Alves <david.al...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: todd+t...@lipcon.org Gerrit-HasComments: Yes