Re: Improved DeletionTime serialization to reduce disk size

2023-07-17 Thread Brandon Williams
On Sun, Jul 16, 2023 at 11:47 PM Berenguer Blasi
 wrote:
> one q that came up during the review: What should we do if we find a 
> markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254:
>
> A. That is a mistake/bug. I makes no sense when localDeletionTime can't 
> already go any further than year 2106. We should reject/fail, maybe log and 
> add an upgrade note.

I think creation of doomstones is always a bug, but perhaps there is a
use case I cannot think of.  One option that was discussed is setting
a default for the maximum_timestamp_fail_threshold which I think could
make sense, since it would provide protection but allow a way out.

> B. That was supported, regardless of how weird it may be. Cap it to the 
> current max year 4254, maybe log and add an upgrade note.

I am not a fan of doing something other than what we were asked to do,
I think we should either reject it, or do it.


Re: Improved DeletionTime serialization to reduce disk size

2023-07-16 Thread Berenguer Blasi

Hi All,

one q that came up during the review: What should we do if we find a 
markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254:


A. That is a mistake/bug. I makes no sense when localDeletionTime can't 
already go any further than year 2106. We should reject/fail, maybe log 
and add an upgrade note.


B. That was supported, regardless of how weird it may be. Cap it to the 
current max year 4254, maybe log and add an upgrade note.


Happy to hear your thoughts.

On 5/7/23 7:05, Berenguer Blasi wrote:


Hi All,

https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review 
and the PR is quite small


Regards

On 3/7/23 11:03, Berenguer Blasi wrote:


Thanks for the comments Benedict. Given 
DeletionTime.localDeletionTime is what caps everything to year 2106 
(uint enconded now) I am ok with a DeletionTime.markForDeleteAt that 
can go up to year 4284, personal opinion ofc.


And yes I hope once I read, doc and understand the sstable format 
better I can look into your suggestion and anything else I come across.


On 3/7/23 9:46, Benedict wrote:
I checked and I’m pretty sure we do, but it doesn’t apply any 
liveness optimisation. I had misunderstood the optimisation you 
proposed. Ideally we would encode any non-live timestamp with the 
delta offset, but since that’s a distinct optimisation perhaps that 
can be left to another patch.


Are we happy, though, that the two different deletion time 
serialisers can store different ranges of timestamp? Both are large 
ranges, but I am not 100% comfortable with them diverging.


On 3 Jul 2023, at 05:45, Berenguer Blasi  
wrote:




It can look into it. I don't have a deep knowledge of the sstable 
format hence why I wanted to document it someday. But DeletionTime 
is being serialized in other places as well iirc and I doubt 
(finger in the air) we'll have that Epoch handy.


On 29/6/23 17:22, Benedict wrote:
So I’m just taking a quick peek at SerializationHeader and we 
already have a method for reading and writing a deletion time with 
offsets from EncodingStats.


So perhaps we simply have a bug where we are using DeletionTime 
Serializer instead of SerializationHeader.writeLocalDeletionTime? 
It looks to me like this is already available at most (perhaps 
all) of the relevant call sites.




On 29 Jun 2023, at 15:53, Josh McKenzie  wrote:



I would prefer we not plan on two distinct changes to this

I agree with this sentiment, /*and*/


+1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an 
improvement in their hands for that duration than no improvement 
at all IMO. Justifies the cost of the double implementation and 
transitions to me.


On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:


Just for completeness the change is a handful loc. The rest
is added tests and we'd loose the sstable format change
opportunity window.



+1, if you have time for this approach and no other in this window.

(If you have time for the other, or someone else does, then the 
technically superior approach should win)





Re: Improved DeletionTime serialization to reduce disk size

2023-07-04 Thread Berenguer Blasi

Hi All,

https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review and 
the PR is quite small


Regards

On 3/7/23 11:03, Berenguer Blasi wrote:


Thanks for the comments Benedict. Given DeletionTime.localDeletionTime 
is what caps everything to year 2106 (uint enconded now) I am ok with 
a DeletionTime.markForDeleteAt that can go up to year 4284, personal 
opinion ofc.


And yes I hope once I read, doc and understand the sstable format 
better I can look into your suggestion and anything else I come across.


On 3/7/23 9:46, Benedict wrote:
I checked and I’m pretty sure we do, but it doesn’t apply any 
liveness optimisation. I had misunderstood the optimisation you 
proposed. Ideally we would encode any non-live timestamp with the 
delta offset, but since that’s a distinct optimisation perhaps that 
can be left to another patch.


Are we happy, though, that the two different deletion time 
serialisers can store different ranges of timestamp? Both are large 
ranges, but I am not 100% comfortable with them diverging.


On 3 Jul 2023, at 05:45, Berenguer Blasi  
wrote:




It can look into it. I don't have a deep knowledge of the sstable 
format hence why I wanted to document it someday. But DeletionTime 
is being serialized in other places as well iirc and I doubt (finger 
in the air) we'll have that Epoch handy.


On 29/6/23 17:22, Benedict wrote:
So I’m just taking a quick peek at SerializationHeader and we 
already have a method for reading and writing a deletion time with 
offsets from EncodingStats.


So perhaps we simply have a bug where we are using DeletionTime 
Serializer instead of SerializationHeader.writeLocalDeletionTime? 
It looks to me like this is already available at most (perhaps all) 
of the relevant call sites.




On 29 Jun 2023, at 15:53, Josh McKenzie  wrote:



I would prefer we not plan on two distinct changes to this

I agree with this sentiment, /*and*/


+1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an 
improvement in their hands for that duration than no improvement 
at all IMO. Justifies the cost of the double implementation and 
transitions to me.


On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:


Just for completeness the change is a handful loc. The rest
is added tests and we'd loose the sstable format change
opportunity window.



+1, if you have time for this approach and no other in this window.

(If you have time for the other, or someone else does, then the 
technically superior approach should win)





Re: Improved DeletionTime serialization to reduce disk size

2023-07-03 Thread Berenguer Blasi
Thanks for the comments Benedict. Given DeletionTime.localDeletionTime 
is what caps everything to year 2106 (uint enconded now) I am ok with a 
DeletionTime.markForDeleteAt that can go up to year 4284, personal 
opinion ofc.


And yes I hope once I read, doc and understand the sstable format better 
I can look into your suggestion and anything else I come across.


On 3/7/23 9:46, Benedict wrote:
I checked and I’m pretty sure we do, but it doesn’t apply any liveness 
optimisation. I had misunderstood the optimisation you proposed. 
Ideally we would encode any non-live timestamp with the delta offset, 
but since that’s a distinct optimisation perhaps that can be left to 
another patch.


Are we happy, though, that the two different deletion time serialisers 
can store different ranges of timestamp? Both are large ranges, but I 
am not 100% comfortable with them diverging.


On 3 Jul 2023, at 05:45, Berenguer Blasi  
wrote:




It can look into it. I don't have a deep knowledge of the sstable 
format hence why I wanted to document it someday. But DeletionTime is 
being serialized in other places as well iirc and I doubt (finger in 
the air) we'll have that Epoch handy.


On 29/6/23 17:22, Benedict wrote:
So I’m just taking a quick peek at SerializationHeader and we 
already have a method for reading and writing a deletion time with 
offsets from EncodingStats.


So perhaps we simply have a bug where we are using DeletionTime 
Serializer instead of SerializationHeader.writeLocalDeletionTime? It 
looks to me like this is already available at most (perhaps all) of 
the relevant call sites.




On 29 Jun 2023, at 15:53, Josh McKenzie  wrote:



I would prefer we not plan on two distinct changes to this

I agree with this sentiment, /*and*/


+1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an 
improvement in their hands for that duration than no improvement at 
all IMO. Justifies the cost of the double implementation and 
transitions to me.


On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:


Just for completeness the change is a handful loc. The rest is
added tests and we'd loose the sstable format change
opportunity window.



+1, if you have time for this approach and no other in this window.

(If you have time for the other, or someone else does, then the 
technically superior approach should win)





Re: Improved DeletionTime serialization to reduce disk size

2023-07-03 Thread Benedict
I checked and I’m pretty sure we do, but it doesn’t apply any liveness optimisation. I had misunderstood the optimisation you proposed. Ideally we would encode any non-live timestamp with the delta offset, but since that’s a distinct optimisation perhaps that can be left to another patch.Are we happy, though, that the two different deletion time serialisers can store different ranges of timestamp? Both are large ranges, but I am not 100% comfortable with them diverging.On 3 Jul 2023, at 05:45, Berenguer Blasi  wrote:
  

  
  
It can look into it. I don't have a deep knowledge of the sstable
  format hence why I wanted to document it someday. But DeletionTime
  is being serialized in other places as well iirc and I doubt
  (finger in the air) we'll have that Epoch handy.

On 29/6/23 17:22, Benedict wrote:


  
  
So I’m just taking a quick peek at
  SerializationHeader and we already have a method for reading
  and writing a deletion time with offsets from EncodingStats.


So perhaps we simply have a bug where we are
  using DeletionTime Serializer instead of
  SerializationHeader.writeLocalDeletionTime? It looks to me
  like this is already available at most (perhaps all) of the
  relevant call sites.


 
  
  
On 29 Jun 2023, at 15:53, Josh McKenzie
   wrote:
  

  
  

  
  
I would prefer we not plan on two distinct changes to
  this

  
  I agree with this sentiment, and
  
  
  
  
+1, if you have time for this approach and no other in
  this window.
  
  People are going to use 5.0 for awhile. Better to have an
improvement in their hands for that duration than no
improvement at all IMO. Justifies the cost of the double
implementation and transitions to me.
  
  
  On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
  
  

  

  Just for completeness the change is a handful loc.
The rest is added tests and we'd loose the sstable
format change opportunity window.
  





+1, if you have time for this approach
  and no other in this window.



(If you have time for the other, or
  someone else does, then the technically superior
  approach should win)

  
  

  
  
  
  

  

  
  
  

  

  



Re: Improved DeletionTime serialization to reduce disk size

2023-07-02 Thread Berenguer Blasi
It can look into it. I don't have a deep knowledge of the sstable format 
hence why I wanted to document it someday. But DeletionTime is being 
serialized in other places as well iirc and I doubt (finger in the air) 
we'll have that Epoch handy.


On 29/6/23 17:22, Benedict wrote:
So I’m just taking a quick peek at SerializationHeader and we already 
have a method for reading and writing a deletion time with offsets 
from EncodingStats.


So perhaps we simply have a bug where we are using DeletionTime 
Serializer instead of SerializationHeader.writeLocalDeletionTime? It 
looks to me like this is already available at most (perhaps all) of 
the relevant call sites.




On 29 Jun 2023, at 15:53, Josh McKenzie  wrote:



I would prefer we not plan on two distinct changes to this

I agree with this sentiment, /*and*/


+1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an improvement 
in their hands for that duration than no improvement at all IMO. 
Justifies the cost of the double implementation and transitions to me.


On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:


Just for completeness the change is a handful loc. The rest is
added tests and we'd loose the sstable format change opportunity
window.



+1, if you have time for this approach and no other in this window.

(If you have time for the other, or someone else does, then the 
technically superior approach should win)





Re: Improved DeletionTime serialization to reduce disk size

2023-07-02 Thread Berenguer Blasi
The idea is 11 bytes less per LIVE partition. So small partitions will 
benefit the most.


On 29/6/23 18:44, Brandon Williams wrote:

On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa  wrote:

3-4% reduction on disk ... for what exactly?

It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.

If the data is TTL'd I think it's not entirely uncommon.

Kind Regards,
Brandon


Re: Improved DeletionTime serialization to reduce disk size

2023-06-29 Thread Brandon Williams
On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa  wrote:
> 3-4% reduction on disk ... for what exactly?
>
> It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.

If the data is TTL'd I think it's not entirely uncommon.

Kind Regards,
Brandon


Re: Improved DeletionTime serialization to reduce disk size

2023-06-29 Thread Jeff Jirsa
On Thu, Jun 22, 2023 at 11:23 PM Berenguer Blasi 
wrote:

> Hi all,
>
> Given we're already introducing a new sstable format (OA) in 5.0 I would
> like to try to get this in before the freeze. The point being that
> sstables with lots of small partitions would benefit from a smaller DT
> at partition level. My tests show a 3%-4% size reduction on disk.
>


3-4% reduction on disk ... for what exactly?

It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.

Is this enhancement driven by a pathological data model that's like "mostly
tiny records OR tombstones" ?


Re: Improved DeletionTime serialization to reduce disk size

2023-06-29 Thread Benedict
So I’m just taking a quick peek at SerializationHeader and we already have a 
method for reading and writing a deletion time with offsets from EncodingStats.

So perhaps we simply have a bug where we are using DeletionTime Serializer 
instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this 
is already available at most (perhaps all) of the relevant call sites.

 

> On 29 Jun 2023, at 15:53, Josh McKenzie  wrote:
> 
> 
>> 
>> I would prefer we not plan on two distinct changes to this
> I agree with this sentiment, and
> 
>> +1, if you have time for this approach and no other in this window.
> People are going to use 5.0 for awhile. Better to have an improvement in 
> their hands for that duration than no improvement at all IMO. Justifies the 
> cost of the double implementation and transitions to me.
> 
>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests 
>> and we'd loose the sstable format change opportunity window.
>> 
>> 
>> 
>> +1, if you have time for this approach and no other in this window.
>> 
>> (If you have time for the other, or someone else does, then the technically 
>> superior approach should win)
>> 
>> 
> 


Re: Improved DeletionTime serialization to reduce disk size

2023-06-29 Thread Josh McKenzie
> I would prefer we not plan on two distinct changes to this
I agree with this sentiment, **and**

> +1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an improvement in their 
hands for that duration than no improvement at all IMO. Justifies the cost of 
the double implementation and transitions to me.

On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests 
>> and we'd loose the sstable format change opportunity window.
>> 
> 
> 
> +1, if you have time for this approach and no other in this window.
> 
> (If you have time for the other, or someone else does, then the technically 
> superior approach should win)
> 
> 
> 


Re: Improved DeletionTime serialization to reduce disk size

2023-06-27 Thread Mick Semb Wever
>
> Just for completeness the change is a handful loc. The rest is added tests
> and we'd loose the sstable format change opportunity window.
>


+1, if you have time for this approach and no other in this window.

(If you have time for the other, or someone else does, then the technically
superior approach should win)


Re: Improved DeletionTime serialization to reduce disk size

2023-06-26 Thread Berenguer Blasi
Just for completeness the change is a handful loc. The rest is added 
tests and we'd loose the sstable format change opportunity window.


Thx again for the replies.

On 26/6/23 9:33, Benedict wrote:
I would prefer we not plan on two distinct changes to this, 
particularly when neither change is particularly more complex than the 
other. There is a modest cost to maintenance from changing this 
multiple times.


But if others feel strongly otherwise I won’t stand in the way.

On 26 Jun 2023, at 05:49, Berenguer Blasi  
wrote:




Thanks for the replies.

I intend to javadoc the ssatble format in detail someday and more 
improvements might come up then, along the vint encoding mentioned 
here. But unless sbdy volunteers to do that in 5.0, is anybody 
against I try to get the original proposal (1 byte flags for sentinel 
values) in?


Regards


Distant future people will not be happy about this, I can already 
tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.


LOL




On 23/6/23 15:44, Josh McKenzie wrote:
If we’re doing this, why don’t we delta encode a vint from some 
per-sstable minimum value? I’d expect that to commonly compress to 
a single byte or so.

+1 to this approach.

Distant future people will not be happy about this, I can already 
tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.


On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:

It's a possibility. Though I haven't coded and benchmarked such an
approach and I don't think I would have the time before the freeze to
take advantage of the sstable format change opportunity.

Still it's sthg that can be explored later. If we can shave a few 
extra

% then that would always be great imo.

On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some 
per-sstable minimum value? I’d expect that to commonly compress to 
a single byte or so.

>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  
wrote:

>>
>> Distant future people will not be happy about this, I can 
already tell you now.

>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi 
 wrote:

>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix 
Epoch. But I noticed that with 7 bytes we can already encode ~2284 
years. We can either shed the 8th byte, for reduced IO and disk, or 
can encode some sentinel values (such as LIVE) as flags there. That 
would mean reading and writing 1 byte instead of 12 (8 mfda long + 
4 ldts int). Yes we already avoid serializing DeletionTime (DT) in 
sstables at _row_ level entirely but not at _partition_ level and 
it is also serialized at index, metadata, etc.

>>>
>>> So here's a POC: 
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some 
jmh (1) to evaluate the impact of the new alg (2). It's tested here 
against a 70% and a 30% LIVE DTs to see how we perform:

>>>
>>>  [java] Benchmark (liveDTPcParam) (sstableParam)  Mode  
Cnt  Score   Error  Units
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 
70PcLive  NC  avgt   15  0.331 ± 0.001 ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 
70PcLive  OA  avgt   15  0.335 ± 0.004 ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 
30PcLive  NC  avgt   15  0.334 ± 0.002 ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 
30PcLive  OA  avgt   15  0.340 ± 0.008 ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 
70PcLive  NC  avgt   15  0.337 ± 0.006 ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 
70PcLive  OA  avgt   15  0.340 ± 0.004 ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 
30PcLive  NC  avgt   15  0.339 ± 0.004 ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 
30PcLive  OA  avgt   15  0.343 ± 0.016 ns/op

>>>
>>> That was ByteBuffer backed to test the extra bit level 
operations impact. But what would be the impact of an end to end 
test against disk?

>>>
>>>  [java] Benchmark (diskRAMParam) (liveDTPcParam)  
(sstableParam)  Mode  Cnt Score Error  Units
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
70PcLive  NC  avgt   15   605236.515 ± 19929.058  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
70PcLive  OA  avgt   15   586477.039 ± 7384.632  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
30PcLive  NC  avgt   15   937580.311 ± 30669.647  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
30PcLive  OA  avgt   15   914097.770 ± 9865.070  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 70PcLive  NC  avgt   15 1314417.207 ± 
37879.012  ns/op
>>>  [java] 

Re: Improved DeletionTime serialization to reduce disk size

2023-06-26 Thread Benedict
I would prefer we not plan on two distinct changes to this, particularly when neither change is particularly more complex than the other. There is a modest cost to maintenance from changing this multiple times. But if others feel strongly otherwise I won’t stand in the way.On 26 Jun 2023, at 05:49, Berenguer Blasi  wrote:
  

  
  
Thanks for the replies.
I intend to javadoc the ssatble format in detail someday and more
  improvements might come up then, along the vint encoding mentioned
  here. But unless sbdy volunteers to do that in 5.0, is anybody
  against I try to get the original proposal (1 byte flags for
  sentinel values) in?
Regards




  
Distant future people will not be happy about this, I can
  already tell you now.

  
  Eh, they'll all be AI's anyway and will just rewrite the code
in a background thread.

LOL







On 23/6/23 15:44, Josh McKenzie wrote:


  
  
  
  
If we’re doing this, why don’t we delta encode a vint from
  some per-sstable minimum value? I’d expect that to commonly
  compress to a single byte or so.

  
  +1 to this approach.
  
  
  
Distant future people will not be happy about this, I can
  already tell you now.

  
  Eh, they'll all be AI's anyway and will just rewrite the code
in a background thread.
  
  
  On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
  
  
It's a possibility. Though I haven't coded and benchmarked
  such an 

approach and I don't think I would have the time before the
  freeze to 

take advantage of the sstable format change opportunity.



Still it's sthg that can be explored later. If we can shave
  a few extra 

% then that would always be great imo.



On 23/6/23 13:57, Benedict wrote:

> If we’re doing this, why don’t we delta encode a vint
  from some per-sstable minimum value? I’d expect that to
  commonly compress to a single byte or so.

>

>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko 
  wrote:

>>

>> Distant future people will not be happy about
  this, I can already tell you now.

>>

>> Sounds like a reasonable improvement to me
  however.

>>

>>> On 23 Jun 2023, at 07:22, Berenguer Blasi 
  wrote:

>>>

>>> Hi all,

>>>

>>> DeletionTime.markedForDeleteAt is a long
  useconds since Unix Epoch. But I noticed that with 7 bytes we
  can already encode ~2284 years. We can either shed the 8th
  byte, for reduced IO and disk, or can encode some sentinel
  values (such as LIVE) as flags there. That would mean reading
  and writing 1 byte instead of 12 (8 mfda long + 4 ldts int).
  Yes we already avoid serializing DeletionTime (DT) in sstables
  at _row_ level entirely but not at _partition_ level and it is
  also serialized at index, metadata, etc.

>>>

>>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk
  and some jmh (1) to evaluate the impact of the new alg (2).
  It's tested here against a 70% and a 30% LIVE DTs  to see how
  we perform:

>>>

>>>  [java] Benchmark (liveDTPcParam) 
  (sstableParam)  Mode  Cnt  Score   Error  Units

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 70PcLive 
  NC  avgt   15  0.331 ± 0.001  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 70PcLive 
  OA  avgt   15  0.335 ± 0.004  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 30PcLive 
  NC  avgt   15  0.334 ± 0.002  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testRawAlgReads 30PcLive 
  OA  avgt   15  0.340 ± 0.008  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testNewAlgWrites 70PcLive 
  NC  avgt   15  0.337 ± 0.006  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testNewAlgWrites 70PcLive 
  OA  avgt   15  0.340 ± 0.004  ns/op

>>>  [java]
  DeletionTimeDeSerBench.testNewAlgWrites 30PcLive 
  NC  avgt   15  0.339 ± 0.004  ns/op

>>>  [java]
  

Re: Improved DeletionTime serialization to reduce disk size

2023-06-25 Thread Berenguer Blasi

Thanks for the replies.

I intend to javadoc the ssatble format in detail someday and more 
improvements might come up then, along the vint encoding mentioned here. 
But unless sbdy volunteers to do that in 5.0, is anybody against I try 
to get the original proposal (1 byte flags for sentinel values) in?


Regards


Distant future people will not be happy about this, I can already tell 
you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.


LOL




On 23/6/23 15:44, Josh McKenzie wrote:
If we’re doing this, why don’t we delta encode a vint from some 
per-sstable minimum value? I’d expect that to commonly compress to a 
single byte or so.

+1 to this approach.

Distant future people will not be happy about this, I can already 
tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.


On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:

It's a possibility. Though I haven't coded and benchmarked such an
approach and I don't think I would have the time before the freeze to
take advantage of the sstable format change opportunity.

Still it's sthg that can be explored later. If we can shave a few extra
% then that would always be great imo.

On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some 
per-sstable minimum value? I’d expect that to commonly compress to a 
single byte or so.

>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  
wrote:

>>
>> Distant future people will not be happy about this, I can already 
tell you now.

>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi 
 wrote:

>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix 
Epoch. But I noticed that with 7 bytes we can already encode ~2284 
years. We can either shed the 8th byte, for reduced IO and disk, or 
can encode some sentinel values (such as LIVE) as flags there. That 
would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 
ldts int). Yes we already avoid serializing DeletionTime (DT) in 
sstables at _row_ level entirely but not at _partition_ level and it 
is also serialized at index, metadata, etc.

>>>
>>> So here's a POC: 
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some 
jmh (1) to evaluate the impact of the new alg (2). It's tested here 
against a 70% and a 30% LIVE DTs  to see how we perform:

>>>
>>>  [java] Benchmark (liveDTPcParam) (sstableParam)  Mode  Cnt  
Score   Error  Units
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC  
avgt   15  0.331 ± 0.001  ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA  
avgt   15  0.335 ± 0.004  ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC  
avgt   15  0.334 ± 0.002  ns/op
>>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA  
avgt   15  0.340 ± 0.008  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC  
avgt   15  0.337 ± 0.006  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA  
avgt   15  0.340 ± 0.004  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC  
avgt   15  0.339 ± 0.004  ns/op
>>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA  
avgt   15  0.343 ± 0.016  ns/op

>>>
>>> That was ByteBuffer backed to test the extra bit level operations 
impact. But what would be the impact of an end to end test against disk?

>>>
>>>  [java] Benchmark (diskRAMParam) (liveDTPcParam)  
(sstableParam)  Mode  Cnt Score    Error Units
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
70PcLive  NC  avgt   15   605236.515 ± 19929.058 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
70PcLive  OA  avgt   15   586477.039 ± 7384.632 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
30PcLive  NC  avgt   15   937580.311 ± 30669.647 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
30PcLive  OA  avgt   15   914097.770 ± 9865.070 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
70PcLive  NC  avgt   15  1314417.207 ± 37879.012 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 70PcLive  OA  avgt   15 805256.345 ± 
15471.587  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 30PcLive  NC  avgt   15 1583239.011 ±  
50104.245  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
Disk 30PcLive  OA  avgt   15 1439605.006 ±  
64342.510  ns/op
>>>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  
RAM 70PcLive  NC  avgt   15 295711.217 ±   5432.507 ns/op
>>>  [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM 
70PcLive  OA  avgt   15 305282.827 ±   1906.841 

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Josh McKenzie
> If we’re doing this, why don’t we delta encode a vint from some per-sstable 
> minimum value? I’d expect that to commonly compress to a single byte or so.
+1 to this approach.

> Distant future people will not be happy about this, I can already tell you 
> now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background 
thread.

On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
> It's a possibility. Though I haven't coded and benchmarked such an 
> approach and I don't think I would have the time before the freeze to 
> take advantage of the sstable format change opportunity.
> 
> Still it's sthg that can be explored later. If we can shave a few extra 
> % then that would always be great imo.
> 
> On 23/6/23 13:57, Benedict wrote:
> > If we’re doing this, why don’t we delta encode a vint from some per-sstable 
> > minimum value? I’d expect that to commonly compress to a single byte or so.
> >
> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:
> >>
> >> Distant future people will not be happy about this, I can already tell 
> >> you now.
> >>
> >> Sounds like a reasonable improvement to me however.
> >>
> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi  
> >>> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
> >>> noticed that with 7 bytes we can already encode ~2284 years. We can 
> >>> either shed the 8th byte, for reduced IO and disk, or can encode some 
> >>> sentinel values (such as LIVE) as flags there. That would mean reading 
> >>> and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we 
> >>> already avoid serializing DeletionTime (DT) in sstables at _row_ level 
> >>> entirely but not at _partition_ level and it is also serialized at index, 
> >>> metadata, etc.
> >>>
> >>> So here's a POC: 
> >>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh 
> >>> (1) to evaluate the impact of the new alg (2). It's tested here against a 
> >>> 70% and a 30% LIVE DTs  to see how we perform:
> >>>
> >>>  [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
> >>> Error  Units
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  
> >>> NC  avgt   15  0.331 ± 0.001  ns/op
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  
> >>> OA  avgt   15  0.335 ± 0.004  ns/op
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  
> >>> NC  avgt   15  0.334 ± 0.002  ns/op
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  
> >>> OA  avgt   15  0.340 ± 0.008  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  
> >>> NC  avgt   15  0.337 ± 0.006  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  
> >>> OA  avgt   15  0.340 ± 0.004  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  
> >>> NC  avgt   15  0.339 ± 0.004  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  
> >>> OA  avgt   15  0.343 ± 0.016  ns/op
> >>>
> >>> That was ByteBuffer backed to test the extra bit level operations impact. 
> >>> But what would be the impact of an end to end test against disk?
> >>>
> >>>  [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  
> >>> Mode  Cnt ScoreError  Units
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 70PcLive  NC  avgt   15   605236.515 ± 19929.058  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 70PcLive  OA  avgt   15   586477.039 ± 7384.632  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 30PcLive  NC  avgt   15   937580.311 ± 30669.647  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 30PcLive  OA  avgt   15   914097.770 ± 9865.070  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
> >>> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
> >>> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk  
> >>>30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk   
> >>>   30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM
> >>>  70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
> >>> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
> >>>  [java] 

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Berenguer Blasi
It's a possibility. Though I haven't coded and benchmarked such an 
approach and I don't think I would have the time before the freeze to 
take advantage of the sstable format change opportunity.


Still it's sthg that can be explored later. If we can shave a few extra 
% then that would always be great imo.


On 23/6/23 13:57, Benedict wrote:

If we’re doing this, why don’t we delta encode a vint from some per-sstable 
minimum value? I’d expect that to commonly compress to a single byte or so.


On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:

Distant future people will not be happy about this, I can already tell you now.

Sounds like a reasonable improvement to me however.


On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:

Hi all,

DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
noticed that with 7 bytes we can already encode ~2284 years. We can either shed 
the 8th byte, for reduced IO and disk, or can encode some sentinel values (such 
as LIVE) as flags there. That would mean reading and writing 1 byte instead of 
12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime 
(DT) in sstables at _row_ level entirely but not at _partition_ level and it is 
also serialized at index, metadata, etc.

So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and 
some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
against a 70% and a 30% LIVE DTs  to see how we perform:

 [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error 
 Units
 [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
avgt   15  0.331 ± 0.001  ns/op
 [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
avgt   15  0.335 ± 0.004  ns/op
 [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
avgt   15  0.334 ± 0.002  ns/op
 [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
avgt   15  0.340 ± 0.008  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
avgt   15  0.337 ± 0.006  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
avgt   15  0.340 ± 0.004  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
avgt   15  0.339 ± 0.004  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
avgt   15  0.343 ± 0.016  ns/op

That was ByteBuffer backed to test the extra bit level operations impact. But 
what would be the impact of an end to end test against disk?

 [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
Cnt ScoreError  Units
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive
  NC  avgt   15   605236.515 ± 19929.058  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive
  OA  avgt   15   586477.039 ± 7384.632  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive
  NC  avgt   15   937580.311 ± 30669.647  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive
  OA  avgt   15   914097.770 ± 9865.070  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 70PcLive 
 NC  avgt   15  1314417.207 ± 37879.012  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 30PcLive 
 NC  avgt   15   446029.899 ±   4038.938  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive   
   OA  avgt   15   479085.875 ± 10032.804  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive 
 NC  avgt   15  1789434.838 ± 206455.771  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive 
 OA  avgt   15  1252162.253 ± 121626.818  ns/o

We can see big improvements when 

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Benedict
If we’re doing this, why don’t we delta encode a vint from some per-sstable 
minimum value? I’d expect that to commonly compress to a single byte or so.

> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:
> 
> Distant future people will not be happy about this, I can already tell you 
> now.
> 
> Sounds like a reasonable improvement to me however.
> 
>> On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:
>> 
>> Hi all,
>> 
>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
>> noticed that with 7 bytes we can already encode ~2284 years. We can either 
>> shed the 8th byte, for reduced IO and disk, or can encode some sentinel 
>> values (such as LIVE) as flags there. That would mean reading and writing 1 
>> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid 
>> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at 
>> _partition_ level and it is also serialized at index, metadata, etc.
>> 
>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk 
>> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
>> against a 70% and a 30% LIVE DTs  to see how we perform:
>> 
>> [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
>> Error  Units
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
>> avgt   15  0.331 ± 0.001  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
>> avgt   15  0.335 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
>> avgt   15  0.334 ± 0.002  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
>> avgt   15  0.340 ± 0.008  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
>> avgt   15  0.337 ± 0.006  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
>> avgt   15  0.340 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
>> avgt   15  0.339 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
>> avgt   15  0.343 ± 0.016  ns/op
>> 
>> That was ByteBuffer backed to test the extra bit level operations impact. 
>> But what would be the impact of an end to end test against disk?
>> 
>> [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
>> Cnt ScoreError  Units
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
>> NC  avgt   15   605236.515 ± 19929.058  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
>> OA  avgt   15   586477.039 ± 7384.632  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
>> NC  avgt   15   937580.311 ± 30669.647  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
>> OA  avgt   15   914097.770 ± 9865.070  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
>> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
>> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
>> 30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
>> 30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
>> 70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
>> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
>> 30PcLive  NC  avgt   15   446029.899 ±   4038.938  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive 
>>  OA  avgt   15   479085.875 ± 10032.804  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
>> 70PcLive  NC  avgt   15  1789434.838 ± 206455.771  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
>> 70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
>> 30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
>> 30PcLive  OA  avgt   15  1252162.253 ± 121626.818  ns/o
>> 
>> We can see big improvements when backed with the disk and little impact from 
>> the new alg.
>> 
>> Given we're already introducing a new sstable format (OA) in 5.0 I would 

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Aleksey Yeshchenko
Distant future people will not be happy about this, I can already tell you now.

Sounds like a reasonable improvement to me however.

> On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:
> 
> Hi all,
> 
> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
> noticed that with 7 bytes we can already encode ~2284 years. We can either 
> shed the 8th byte, for reduced IO and disk, or can encode some sentinel 
> values (such as LIVE) as flags there. That would mean reading and writing 1 
> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid 
> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at 
> _partition_ level and it is also serialized at index, metadata, etc.
> 
> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk 
> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
> against a 70% and a 30% LIVE DTs  to see how we perform:
> 
>  [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
> Error  Units
>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
> avgt   15  0.331 ± 0.001  ns/op
>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
> avgt   15  0.335 ± 0.004  ns/op
>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
> avgt   15  0.334 ± 0.002  ns/op
>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
> avgt   15  0.340 ± 0.008  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
> avgt   15  0.337 ± 0.006  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
> avgt   15  0.340 ± 0.004  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
> avgt   15  0.339 ± 0.004  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
> avgt   15  0.343 ± 0.016  ns/op
> 
> That was ByteBuffer backed to test the extra bit level operations impact. But 
> what would be the impact of an end to end test against disk?
> 
>  [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
> Cnt ScoreError  Units
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
> NC  avgt   15   605236.515 ± 19929.058  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
> OA  avgt   15   586477.039 ± 7384.632  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
> NC  avgt   15   937580.311 ± 30669.647  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
> OA  avgt   15   914097.770 ± 9865.070  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
> 30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
> 30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
> 70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
> 30PcLive  NC  avgt   15   446029.899 ±   4038.938  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive 
>  OA  avgt   15   479085.875 ± 10032.804  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
> 70PcLive  NC  avgt   15  1789434.838 ± 206455.771  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
> 70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
> 30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
> 30PcLive  OA  avgt   15  1252162.253 ± 121626.818  ns/o
> 
> We can see big improvements when backed with the disk and little impact from 
> the new alg.
> 
> Given we're already introducing a new sstable format (OA) in 5.0 I would like 
> to try to get this in before the freeze. The point being that sstables with 
> lots of small partitions would benefit from a smaller DT at partition level. 
> My tests show a 3%-4% size reduction on disk.
> 
> Before proceeding though I'd like to bounce the idea