Re: Improved DeletionTime serialization to reduce disk size
On Sun, Jul 16, 2023 at 11:47 PM Berenguer Blasi wrote: > one q that came up during the review: What should we do if we find a > markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254: > > A. That is a mistake/bug. I makes no sense when localDeletionTime can't > already go any further than year 2106. We should reject/fail, maybe log and > add an upgrade note. I think creation of doomstones is always a bug, but perhaps there is a use case I cannot think of. One option that was discussed is setting a default for the maximum_timestamp_fail_threshold which I think could make sense, since it would provide protection but allow a way out. > B. That was supported, regardless of how weird it may be. Cap it to the > current max year 4254, maybe log and add an upgrade note. I am not a fan of doing something other than what we were asked to do, I think we should either reject it, or do it.
Re: Improved DeletionTime serialization to reduce disk size
Hi All, one q that came up during the review: What should we do if we find a markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254: A. That is a mistake/bug. I makes no sense when localDeletionTime can't already go any further than year 2106. We should reject/fail, maybe log and add an upgrade note. B. That was supported, regardless of how weird it may be. Cap it to the current max year 4254, maybe log and add an upgrade note. Happy to hear your thoughts. On 5/7/23 7:05, Berenguer Blasi wrote: Hi All, https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review and the PR is quite small Regards On 3/7/23 11:03, Berenguer Blasi wrote: Thanks for the comments Benedict. Given DeletionTime.localDeletionTime is what caps everything to year 2106 (uint enconded now) I am ok with a DeletionTime.markForDeleteAt that can go up to year 4284, personal opinion ofc. And yes I hope once I read, doc and understand the sstable format better I can look into your suggestion and anything else I come across. On 3/7/23 9:46, Benedict wrote: I checked and I’m pretty sure we do, but it doesn’t apply any liveness optimisation. I had misunderstood the optimisation you proposed. Ideally we would encode any non-live timestamp with the delta offset, but since that’s a distinct optimisation perhaps that can be left to another patch. Are we happy, though, that the two different deletion time serialisers can store different ranges of timestamp? Both are large ranges, but I am not 100% comfortable with them diverging. On 3 Jul 2023, at 05:45, Berenguer Blasi wrote: It can look into it. I don't have a deep knowledge of the sstable format hence why I wanted to document it someday. But DeletionTime is being serialized in other places as well iirc and I doubt (finger in the air) we'll have that Epoch handy. On 29/6/23 17:22, Benedict wrote: So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats. So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites. On 29 Jun 2023, at 15:53, Josh McKenzie wrote: I would prefer we not plan on two distinct changes to this I agree with this sentiment, /*and*/ +1, if you have time for this approach and no other in this window. People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me. On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window. +1, if you have time for this approach and no other in this window. (If you have time for the other, or someone else does, then the technically superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
Hi All, https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review and the PR is quite small Regards On 3/7/23 11:03, Berenguer Blasi wrote: Thanks for the comments Benedict. Given DeletionTime.localDeletionTime is what caps everything to year 2106 (uint enconded now) I am ok with a DeletionTime.markForDeleteAt that can go up to year 4284, personal opinion ofc. And yes I hope once I read, doc and understand the sstable format better I can look into your suggestion and anything else I come across. On 3/7/23 9:46, Benedict wrote: I checked and I’m pretty sure we do, but it doesn’t apply any liveness optimisation. I had misunderstood the optimisation you proposed. Ideally we would encode any non-live timestamp with the delta offset, but since that’s a distinct optimisation perhaps that can be left to another patch. Are we happy, though, that the two different deletion time serialisers can store different ranges of timestamp? Both are large ranges, but I am not 100% comfortable with them diverging. On 3 Jul 2023, at 05:45, Berenguer Blasi wrote: It can look into it. I don't have a deep knowledge of the sstable format hence why I wanted to document it someday. But DeletionTime is being serialized in other places as well iirc and I doubt (finger in the air) we'll have that Epoch handy. On 29/6/23 17:22, Benedict wrote: So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats. So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites. On 29 Jun 2023, at 15:53, Josh McKenzie wrote: I would prefer we not plan on two distinct changes to this I agree with this sentiment, /*and*/ +1, if you have time for this approach and no other in this window. People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me. On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window. +1, if you have time for this approach and no other in this window. (If you have time for the other, or someone else does, then the technically superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
Thanks for the comments Benedict. Given DeletionTime.localDeletionTime is what caps everything to year 2106 (uint enconded now) I am ok with a DeletionTime.markForDeleteAt that can go up to year 4284, personal opinion ofc. And yes I hope once I read, doc and understand the sstable format better I can look into your suggestion and anything else I come across. On 3/7/23 9:46, Benedict wrote: I checked and I’m pretty sure we do, but it doesn’t apply any liveness optimisation. I had misunderstood the optimisation you proposed. Ideally we would encode any non-live timestamp with the delta offset, but since that’s a distinct optimisation perhaps that can be left to another patch. Are we happy, though, that the two different deletion time serialisers can store different ranges of timestamp? Both are large ranges, but I am not 100% comfortable with them diverging. On 3 Jul 2023, at 05:45, Berenguer Blasi wrote: It can look into it. I don't have a deep knowledge of the sstable format hence why I wanted to document it someday. But DeletionTime is being serialized in other places as well iirc and I doubt (finger in the air) we'll have that Epoch handy. On 29/6/23 17:22, Benedict wrote: So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats. So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites. On 29 Jun 2023, at 15:53, Josh McKenzie wrote: I would prefer we not plan on two distinct changes to this I agree with this sentiment, /*and*/ +1, if you have time for this approach and no other in this window. People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me. On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window. +1, if you have time for this approach and no other in this window. (If you have time for the other, or someone else does, then the technically superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
I checked and I’m pretty sure we do, but it doesn’t apply any liveness optimisation. I had misunderstood the optimisation you proposed. Ideally we would encode any non-live timestamp with the delta offset, but since that’s a distinct optimisation perhaps that can be left to another patch.Are we happy, though, that the two different deletion time serialisers can store different ranges of timestamp? Both are large ranges, but I am not 100% comfortable with them diverging.On 3 Jul 2023, at 05:45, Berenguer Blasi wrote: It can look into it. I don't have a deep knowledge of the sstable format hence why I wanted to document it someday. But DeletionTime is being serialized in other places as well iirc and I doubt (finger in the air) we'll have that Epoch handy. On 29/6/23 17:22, Benedict wrote: So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats. So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites. On 29 Jun 2023, at 15:53, Josh McKenzie wrote: I would prefer we not plan on two distinct changes to this I agree with this sentiment, and +1, if you have time for this approach and no other in this window. People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me. On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window. +1, if you have time for this approach and no other in this window. (If you have time for the other, or someone else does, then the technically superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
It can look into it. I don't have a deep knowledge of the sstable format hence why I wanted to document it someday. But DeletionTime is being serialized in other places as well iirc and I doubt (finger in the air) we'll have that Epoch handy. On 29/6/23 17:22, Benedict wrote: So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats. So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites. On 29 Jun 2023, at 15:53, Josh McKenzie wrote: I would prefer we not plan on two distinct changes to this I agree with this sentiment, /*and*/ +1, if you have time for this approach and no other in this window. People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me. On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window. +1, if you have time for this approach and no other in this window. (If you have time for the other, or someone else does, then the technically superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
The idea is 11 bytes less per LIVE partition. So small partitions will benefit the most. On 29/6/23 18:44, Brandon Williams wrote: On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa wrote: 3-4% reduction on disk ... for what exactly? It seems exceptionally uncommon to have 3% of your data SIZE be tombstones. If the data is TTL'd I think it's not entirely uncommon. Kind Regards, Brandon
Re: Improved DeletionTime serialization to reduce disk size
On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa wrote: > 3-4% reduction on disk ... for what exactly? > > It seems exceptionally uncommon to have 3% of your data SIZE be tombstones. If the data is TTL'd I think it's not entirely uncommon. Kind Regards, Brandon
Re: Improved DeletionTime serialization to reduce disk size
On Thu, Jun 22, 2023 at 11:23 PM Berenguer Blasi wrote: > Hi all, > > Given we're already introducing a new sstable format (OA) in 5.0 I would > like to try to get this in before the freeze. The point being that > sstables with lots of small partitions would benefit from a smaller DT > at partition level. My tests show a 3%-4% size reduction on disk. > 3-4% reduction on disk ... for what exactly? It seems exceptionally uncommon to have 3% of your data SIZE be tombstones. Is this enhancement driven by a pathological data model that's like "mostly tiny records OR tombstones" ?
Re: Improved DeletionTime serialization to reduce disk size
So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats. So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites. > On 29 Jun 2023, at 15:53, Josh McKenzie wrote: > > >> >> I would prefer we not plan on two distinct changes to this > I agree with this sentiment, and > >> +1, if you have time for this approach and no other in this window. > People are going to use 5.0 for awhile. Better to have an improvement in > their hands for that duration than no improvement at all IMO. Justifies the > cost of the double implementation and transitions to me. > >> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: >> Just for completeness the change is a handful loc. The rest is added tests >> and we'd loose the sstable format change opportunity window. >> >> >> >> +1, if you have time for this approach and no other in this window. >> >> (If you have time for the other, or someone else does, then the technically >> superior approach should win) >> >> >
Re: Improved DeletionTime serialization to reduce disk size
> I would prefer we not plan on two distinct changes to this I agree with this sentiment, **and** > +1, if you have time for this approach and no other in this window. People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me. On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote: >> Just for completeness the change is a handful loc. The rest is added tests >> and we'd loose the sstable format change opportunity window. >> > > > +1, if you have time for this approach and no other in this window. > > (If you have time for the other, or someone else does, then the technically > superior approach should win) > > >
Re: Improved DeletionTime serialization to reduce disk size
> > Just for completeness the change is a handful loc. The rest is added tests > and we'd loose the sstable format change opportunity window. > +1, if you have time for this approach and no other in this window. (If you have time for the other, or someone else does, then the technically superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window. Thx again for the replies. On 26/6/23 9:33, Benedict wrote: I would prefer we not plan on two distinct changes to this, particularly when neither change is particularly more complex than the other. There is a modest cost to maintenance from changing this multiple times. But if others feel strongly otherwise I won’t stand in the way. On 26 Jun 2023, at 05:49, Berenguer Blasi wrote: Thanks for the replies. I intend to javadoc the ssatble format in detail someday and more improvements might come up then, along the vint encoding mentioned here. But unless sbdy volunteers to do that in 5.0, is anybody against I try to get the original proposal (1 byte flags for sentinel values) in? Regards Distant future people will not be happy about this, I can already tell you now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. LOL On 23/6/23 15:44, Josh McKenzie wrote: If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. +1 to this approach. Distant future people will not be happy about this, I can already tell you now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote: It's a possibility. Though I haven't coded and benchmarked such an approach and I don't think I would have the time before the freeze to take advantage of the sstable format change opportunity. Still it's sthg that can be explored later. If we can shave a few extra % then that would always be great imo. On 23/6/23 13:57, Benedict wrote: > If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. > >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko wrote: >> >> Distant future people will not be happy about this, I can already tell you now. >> >> Sounds like a reasonable improvement to me however. >> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi wrote: >>> >>> Hi all, >>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc. >>> >>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform: >>> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op >>> >>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk? >>> >>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op >>> [java]
Re: Improved DeletionTime serialization to reduce disk size
I would prefer we not plan on two distinct changes to this, particularly when neither change is particularly more complex than the other. There is a modest cost to maintenance from changing this multiple times. But if others feel strongly otherwise I won’t stand in the way.On 26 Jun 2023, at 05:49, Berenguer Blasi wrote: Thanks for the replies. I intend to javadoc the ssatble format in detail someday and more improvements might come up then, along the vint encoding mentioned here. But unless sbdy volunteers to do that in 5.0, is anybody against I try to get the original proposal (1 byte flags for sentinel values) in? Regards Distant future people will not be happy about this, I can already tell you now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. LOL On 23/6/23 15:44, Josh McKenzie wrote: If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. +1 to this approach. Distant future people will not be happy about this, I can already tell you now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote: It's a possibility. Though I haven't coded and benchmarked such an approach and I don't think I would have the time before the freeze to take advantage of the sstable format change opportunity. Still it's sthg that can be explored later. If we can shave a few extra % then that would always be great imo. On 23/6/23 13:57, Benedict wrote: > If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. > >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenkowrote: >> >> Distant future people will not be happy about this, I can already tell you now. >> >> Sounds like a reasonable improvement to me however. >> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi wrote: >>> >>> Hi all, >>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc. >>> >>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform: >>> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op >>> [java]
Re: Improved DeletionTime serialization to reduce disk size
Thanks for the replies. I intend to javadoc the ssatble format in detail someday and more improvements might come up then, along the vint encoding mentioned here. But unless sbdy volunteers to do that in 5.0, is anybody against I try to get the original proposal (1 byte flags for sentinel values) in? Regards Distant future people will not be happy about this, I can already tell you now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. LOL On 23/6/23 15:44, Josh McKenzie wrote: If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. +1 to this approach. Distant future people will not be happy about this, I can already tell you now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote: It's a possibility. Though I haven't coded and benchmarked such an approach and I don't think I would have the time before the freeze to take advantage of the sstable format change opportunity. Still it's sthg that can be explored later. If we can shave a few extra % then that would always be great imo. On 23/6/23 13:57, Benedict wrote: > If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. > >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko wrote: >> >> Distant future people will not be happy about this, I can already tell you now. >> >> Sounds like a reasonable improvement to me however. >> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi wrote: >>> >>> Hi all, >>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc. >>> >>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform: >>> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op >>> >>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk? >>> >>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive OA avgt 15 305282.827 ± 1906.841
Re: Improved DeletionTime serialization to reduce disk size
> If we’re doing this, why don’t we delta encode a vint from some per-sstable > minimum value? I’d expect that to commonly compress to a single byte or so. +1 to this approach. > Distant future people will not be happy about this, I can already tell you > now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote: > It's a possibility. Though I haven't coded and benchmarked such an > approach and I don't think I would have the time before the freeze to > take advantage of the sstable format change opportunity. > > Still it's sthg that can be explored later. If we can shave a few extra > % then that would always be great imo. > > On 23/6/23 13:57, Benedict wrote: > > If we’re doing this, why don’t we delta encode a vint from some per-sstable > > minimum value? I’d expect that to commonly compress to a single byte or so. > > > >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko wrote: > >> > >> Distant future people will not be happy about this, I can already tell > >> you now. > >> > >> Sounds like a reasonable improvement to me however. > >> > >>> On 23 Jun 2023, at 07:22, Berenguer Blasi > >>> wrote: > >>> > >>> Hi all, > >>> > >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I > >>> noticed that with 7 bytes we can already encode ~2284 years. We can > >>> either shed the 8th byte, for reduced IO and disk, or can encode some > >>> sentinel values (such as LIVE) as flags there. That would mean reading > >>> and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we > >>> already avoid serializing DeletionTime (DT) in sstables at _row_ level > >>> entirely but not at _partition_ level and it is also serialized at index, > >>> metadata, etc. > >>> > >>> So here's a POC: > >>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh > >>> (1) to evaluate the impact of the new alg (2). It's tested here against a > >>> 70% and a 30% LIVE DTs to see how we perform: > >>> > >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score > >>> Error Units > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive > >>> NC avgt 15 0.331 ± 0.001 ns/op > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive > >>> OA avgt 15 0.335 ± 0.004 ns/op > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive > >>> NC avgt 15 0.334 ± 0.002 ns/op > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive > >>> OA avgt 15 0.340 ± 0.008 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive > >>> NC avgt 15 0.337 ± 0.006 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive > >>> OA avgt 15 0.340 ± 0.004 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive > >>> NC avgt 15 0.339 ± 0.004 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive > >>> OA avgt 15 0.343 ± 0.016 ns/op > >>> > >>> That was ByteBuffer backed to test the extra bit level operations impact. > >>> But what would be the impact of an end to end test against disk? > >>> > >>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) > >>> Mode Cnt ScoreError Units > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>> 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>> 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>>30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk > >>> 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > >>> 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM > >>> 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op > >>> [java]
Re: Improved DeletionTime serialization to reduce disk size
It's a possibility. Though I haven't coded and benchmarked such an approach and I don't think I would have the time before the freeze to take advantage of the sstable format change opportunity. Still it's sthg that can be explored later. If we can shave a few extra % then that would always be great imo. On 23/6/23 13:57, Benedict wrote: If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. On 23 Jun 2023, at 12:55, Aleksey Yeshchenko wrote: Distant future people will not be happy about this, I can already tell you now. Sounds like a reasonable improvement to me however. On 23 Jun 2023, at 07:22, Berenguer Blasi wrote: Hi all, DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc. So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform: [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk? [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt ScoreError Units [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o We can see big improvements when
Re: Improved DeletionTime serialization to reduce disk size
If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so. > On 23 Jun 2023, at 12:55, Aleksey Yeshchenko wrote: > > Distant future people will not be happy about this, I can already tell you > now. > > Sounds like a reasonable improvement to me however. > >> On 23 Jun 2023, at 07:22, Berenguer Blasi wrote: >> >> Hi all, >> >> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I >> noticed that with 7 bytes we can already encode ~2284 years. We can either >> shed the 8th byte, for reduced IO and disk, or can encode some sentinel >> values (such as LIVE) as flags there. That would mean reading and writing 1 >> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid >> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at >> _partition_ level and it is also serialized at index, metadata, etc. >> >> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk >> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here >> against a 70% and a 30% LIVE DTs to see how we perform: >> >> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score >> Error Units >> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC >> avgt 15 0.331 ± 0.001 ns/op >> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA >> avgt 15 0.335 ± 0.004 ns/op >> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC >> avgt 15 0.334 ± 0.002 ns/op >> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA >> avgt 15 0.340 ± 0.008 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC >> avgt 15 0.337 ± 0.006 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA >> avgt 15 0.340 ± 0.004 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC >> avgt 15 0.339 ± 0.004 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA >> avgt 15 0.343 ± 0.016 ns/op >> >> That was ByteBuffer backed to test the extra bit level operations impact. >> But what would be the impact of an end to end test against disk? >> >> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode >> Cnt ScoreError Units >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive >> NC avgt 15 605236.515 ± 19929.058 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive >> OA avgt 15 586477.039 ± 7384.632 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive >> NC avgt 15 937580.311 ± 30669.647 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive >> OA avgt 15 914097.770 ± 9865.070 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk >> 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM >> 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM >> 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM >> 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive >> OA avgt 15 479085.875 ± 10032.804 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk >> 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o >> >> We can see big improvements when backed with the disk and little impact from >> the new alg. >> >> Given we're already introducing a new sstable format (OA) in 5.0 I would
Re: Improved DeletionTime serialization to reduce disk size
Distant future people will not be happy about this, I can already tell you now. Sounds like a reasonable improvement to me however. > On 23 Jun 2023, at 07:22, Berenguer Blasi wrote: > > Hi all, > > DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I > noticed that with 7 bytes we can already encode ~2284 years. We can either > shed the 8th byte, for reduced IO and disk, or can encode some sentinel > values (such as LIVE) as flags there. That would mean reading and writing 1 > byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid > serializing DeletionTime (DT) in sstables at _row_ level entirely but not at > _partition_ level and it is also serialized at index, metadata, etc. > > So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk > and some jmh (1) to evaluate the impact of the new alg (2). It's tested here > against a 70% and a 30% LIVE DTs to see how we perform: > > [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score > Error Units > [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC > avgt 15 0.331 ± 0.001 ns/op > [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA > avgt 15 0.335 ± 0.004 ns/op > [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC > avgt 15 0.334 ± 0.002 ns/op > [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA > avgt 15 0.340 ± 0.008 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC > avgt 15 0.337 ± 0.006 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA > avgt 15 0.340 ± 0.004 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC > avgt 15 0.339 ± 0.004 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA > avgt 15 0.343 ± 0.016 ns/op > > That was ByteBuffer backed to test the extra bit level operations impact. But > what would be the impact of an end to end test against disk? > > [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode > Cnt ScoreError Units > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive > NC avgt 15 605236.515 ± 19929.058 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive > OA avgt 15 586477.039 ± 7384.632 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive > NC avgt 15 937580.311 ± 30669.647 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive > OA avgt 15 914097.770 ± 9865.070 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk > 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM > 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive > OA avgt 15 479085.875 ± 10032.804 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk > 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o > > We can see big improvements when backed with the disk and little impact from > the new alg. > > Given we're already introducing a new sstable format (OA) in 5.0 I would like > to try to get this in before the freeze. The point being that sstables with > lots of small partitions would benefit from a smaller DT at partition level. > My tests show a 3%-4% size reduction on disk. > > Before proceeding though I'd like to bounce the idea