Hi Janne, This is a good interesting question.
If you never plan on actually querying based on those columns themselves, concatenating them into a binary column as the single PK will save a bit of space relative to storing them separately. In the case of a composite primary key, Kudu will internally encode a binary concatenated column and store it using prefix encoding. So, if you store them separately, you'll get the same composite binary encoding plus the additional storage for the separate columns. However, if you have any use case for querying based on them, having the separate columns would be quite useful, since Kudu can push down predicates to individual columns. Being able to use the subfields for partitioning is also likely to be useful - eg you might want to hash-partition on 'topic+partition' together so that all data for a given topic always ends up stored together. This wouldn't be possible if you use a combined (manually-encoded) key. -Todd On Fri, Aug 25, 2017 at 11:10 PM, Janne Keskitalo <[email protected]> wrote: > Hi > > We're inserting messages from kafka into kudu tables and some messages > don't have a natural primary key, hence we decided to use kafka > topic/partition/offset -combination as the key. Is it better to concatenate > the fields into one kudu column or create a separate column for each? Do we > get better compression if using individual columns? And is the PK index > structure maintained outside of the actual table data? > > -- > Br. > Janne Keskitalo, > Database Architect, PAF.COM > For support: [email protected] > > -- Todd Lipcon Software Engineer, Cloudera
