Thank you for your time Jeff, very helpful.I couldn't find anything out
there about the subject and I suspected that this could be the case.
Regarding the clustering key in this case:
Back in the RDBMS world, you will always assign a sequential (or as
sequential as possible) clustering key to a table to minimize fragmentation
and increase the speed of the insertions. In the Cassandra world, does the
same apply to the clustering key? For example, is it a good idea to assign
a UUID to a clustering key, or would a timestamp be a better choice? I am
thinking that partitions need to keep some sort of binary index for the
clustering keys and for relatively large partitions it can be relatively
expensive to maintain.
F Javier Pareja
On Wed, Mar 7, 2018 at 5:20 PM, Jeff Jirsa <jji...@gmail.com> wrote:
> On Wed, Mar 7, 2018 at 7:13 AM, Carlos Rolo <r...@pythian.com> wrote:
>> Hi Jeff,
>> Could you expand: "Tables without clustering keys are often deceptively
>> expensive to compact, as a lot of work (relative to the other cell
>> boundaries) happens on partition boundaries." This is something I didn't
>> know and highly interesting to know more about!
> We do a lot "by partition". We build column indexes by partition. We
> update the partition index on each partition. We invalidate key cache by
> partition. They're not super expensive, but they take time, and tables with
> tiny partitions can actually be slower to compact.
> There's no magic cutoff where it does/doesn't make sense, my comment is
> mostly a warning that the edges of the "normal" use cases tend to be less
> optimized than the common case. Having a table with a hundred billion
> records, where the key is numeric and the value is a single byte (let's say
> you're keeping track of whether or not a specific sensor has ever detected
> some magic event, and you have 100B sensors, that table will be close to
> the worst-case example of this behavior).