[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2018-06-20 Thread Jeff Jirsa (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Jirsa updated CASSANDRA-12526:
---
Fix Version/s: (was: 4.x)
   4.0

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: compaction, lcs, performance
> Fix For: 4.0
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> situations, 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2018-06-19 Thread Alex Petrov (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-12526:

Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: compaction, lcs, performance
> Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2018-06-13 Thread Marcus Eriksson (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-12526:

Status: Ready to Commit  (was: Patch Available)

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: compaction, lcs, performance
> Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2018-05-18 Thread Marcus Eriksson (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-12526:

Reviewer: Alex Petrov

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: compaction, lcs, performance
> Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> situations, by default STCS_in_L0 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2018-04-18 Thread Marcus Eriksson (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-12526:

Status: Patch Available  (was: Open)

https://github.com/krummas/cassandra/commits/marcuse/12526 (includes 
CASSANDRA-14388 to make the test work)
https://circleci.com/gh/krummas/cassandra/tree/marcuse%2F12526

in my silly benchmarks (writing 100M keys to a LCS table with 10MB sstables and 
measuring time until LCS is fully leveled), this reduces the time spent 
compacting with about 20%

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: compaction, lcs, performance
> Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2017-05-25 Thread Jeff Jirsa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Jirsa updated CASSANDRA-12526:
---
Fix Version/s: 4.x

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>  Labels: compaction, lcs, performance
> Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> situations, by default STCS_in_L0 will be triggered, and you will end up 
> getting a bunch of much bigger L0 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2016-09-04 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-12526:
-
Issue Type: Improvement  (was: Bug)

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Wei Deng
>  Labels: compaction, lcs, performance
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> situations, by default STCS_in_L0 will be triggered, and you will end up 
> getting a bunch of much bigger L0 SSTables after STCS is 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2016-09-01 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-12526:
-
Description: 
I'm using the latest trunk (as of August 2016, which probably is going to be 
3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
inefficiency.

The test data is generated using cassandra-stress default parameters 
(keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
inserted partitions that will never merge in compactions, which is probably the 
worst kind of workload for LCS (however, I'll detail later why this scenario 
should not be ignored as a corner case; for now, let's just assume we still 
want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match 
 the "Compacted" summary so that I can see how long each individual compaction 
took and how many bytes they processed. The search pattern is like the 
following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having 
*only one* SSTable involved. With the workload mentioned above, the "single 
SSTable" compactions actually consist of the majority of all compactions (as 
shown below), so its efficiency can affect the overall compaction throughput 
quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
debug.log-test1 | wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
debug.log-test1 | grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the 
level of a particular SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
targetLevel);
sstable.reloadSSTableMetadata();
{code}

To be exact, I summed up the time spent for these single-SSTable compactions 
(the total data size is 60GB) and found that if each compaction only needs to 
spend 100ms for only the metadata change (instead of the 10+ second they're 
doing now), it can already achieve 22.75% saving on total compaction time.

Compared to what we have now (reading the whole single-SSTable from old level 
and writing out the same single-SSTable at the new level), the only difference 
I could think of by using this approach is that the new SSTable will have the 
same file name (sequence number) as the old one's, which could break some 
assumptions on some other part of the code. However, not having to go through 
the full read/write IO, and not having to bear the overhead of cleaning up the 
old file, creating the new file, creating more churns in heap and file buffer, 
it seems the benefits outweigh the inconvenience. So I'd argue this JIRA 
belongs to LHF and should be made available in 3.0.x as well.

As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
all-new-partition workload should not be ignored as a corner case. Basically, 
for the main use case of LCS where you need to frequently merge partitions to 
optimize read and eliminate tombstones and expired data sooner, LCS can be 
perfectly happy and efficiently perform the partition merge and tombstone 
elimination for a long time. However, as soon as the node becomes a bit 
unhealthy for various reasons (could be a bad disk so it's missing a whole 
bunch of mutations and need repair, could be the user chooses to ingest way 
more data than it usually takes and exceeds its capability, or god-forbidden, 
some DBA chooses to run offline sstablelevelreset), you will have to handle 
this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
of such single-SSTable compactions, which is the situation this JIRA is 
intended to address.

Actually, when I think more about this, to make this kind of single SSTable 
up-level more efficient will not only help the all-new-partition scenario, but 
also help in general any time when there is a big backlog of L0 SSTables due to 
too many flushes or excessive repair streaming with vnode. In those situations, 
by default STCS_in_L0 will be triggered, and you will end up getting a bunch of 
much bigger L0 SSTables after STCS is done. When it's time to up-level those 
much bigger L0 SSTables most likely they will overlap among themselves and you 
will add them all into your compaction session (along with all overlapped L1 
SSTables). For these much bigger L0 SSTables, they have gone through a few 
rounds of STCS compactions, so if there's partition merge that needs to be done 
because fragments of the same partition are dispersed in smaller L0 SSTables 
earlier, after those STCS rounds, what you end up having in those much bigger 
L0 SSTables (generated by STCS) will not have much more 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2016-08-23 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-12526:
-
Description: 
I'm using the latest trunk (as of August 2016, which probably is going to be 
3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
inefficiency.

The test data is generated using cassandra-stress default parameters 
(keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
inserted partitions that will never merge in compactions, which is probably the 
worst kind of workload for LCS (however, I'll detail later why this scenario 
should not be ignored as a corner case; for now, let's just assume we still 
want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match 
 the "Compacted" summary so that I can see how long each individual compaction 
took and how many bytes they processed. The search pattern is like the 
following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having 
*only one* SSTable involved. With the workload mentioned above, the "single 
SSTable" compactions actually consist of the majority of all compactions (as 
shown below), so its efficiency can affect the overall compaction throughput 
quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
debug.log-test1 | wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
debug.log-test1 | grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the 
level of a particular SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
targetLevel);
sstable.reloadSSTableMetadata();
{code}

To be exact, I summed up the time spent for these single-SSTable compactions 
(the total data size is 60GB) and found that if each compaction only needs to 
spend 100ms for only the metadata change (instead of the 10+ second they're 
doing now), it can already achieve 22.75% saving on total compaction time.

Compared to what we have now (reading the whole single-SSTable from old level 
and writing out the same single-SSTable at the new level), the only difference 
I could think of by using this approach is that the new SSTable will have the 
same file name (sequence number) as the old one's, which could break some 
assumptions on some other part of the code. However, not having to go through 
the full read/write IO, and not having to bear the overhead of cleaning up the 
old file, creating the new file, creating more churns in heap and file buffer, 
it seems the benefits outweigh the inconvenience. So I'd argue this JIRA 
belongs to LHF and should be made available in 3.0.x as well.

As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
all-new-partition workload should not be ignored as a corner case. Basically, 
for the main use case of LCS where you need to frequently merge partitions to 
optimize read and eliminate tombstones and expired data sooner, LCS can be 
perfectly happy and efficiently perform the partition merge and tombstone 
elimination for a long time. However, as soon as the node becomes a bit 
unhealthy for various reasons (could be a bad disk so it's missing a whole 
bunch of mutations and need repair, could be the user chooses to ingest way 
more data than it usually takes and exceeds its capability, or god-forbidden, 
some DBA chooses to run offline sstablelevelreset), you will have to handle 
this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
of such single-SSTable compactions, which is the situation this JIRA is 
intended to address.

  was:
I'm using the latest trunk (as of August 2016, which probably is going to be 
3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
inefficiency.

The test data is generated using cassandra-stress default parameters 
(keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
inserted partitions that will never merge in compactions, which is probably the 
worst kind of workload for LCS (however, I'll detail later why this scenario 
should not be ignored as a corner case; for now, let's just assume we still 
want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match 
 the "Compacted" summary so that I can see how long each individual compaction 
took and how many bytes they processed. The search pattern is like the 
following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having 
*only one* SSTable 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2016-08-23 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-12526:
-
Description: 
I'm using the latest trunk (as of August 2016, which probably is going to be 
3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
inefficiency.

The test data is generated using cassandra-stress default parameters 
(keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
inserted partitions that will never merge in compactions, which is probably the 
worst kind of workload for LCS (however, I'll detail later why this scenario 
should not be ignored as a corner case; for now, let's just assume we still 
want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match 
 the "Compacted" summary so that I can see how long each individual compaction 
took and how many bytes they processed. The search pattern is like the 
following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having 
*only one* SSTable involved. With the workload mentioned above, the "single 
SSTable" compactions actually consist of the majority of all compactions (as 
shown below), so its efficiency can affect the overall compaction throughput 
quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
debug.log-test1 | wc -l
243
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
debug.log-test1 | grep ") 1 sstable" | wc -l
218
{noformat}

By looking at the code, it appears that there's a way to directly edit the 
level of a particular SSTable like the following:

{code}
sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
targetLevel);
sstable.reloadSSTableMetadata();
{code}

Compared to what we have now (reading the whole single-SSTable from old level 
and writing out the same single-SSTable at the new level), the only difference 
I could think of by using this approach is that the new SSTable will have the 
same file name (sequence number) as the old one's, which could break some 
assumptions on some other part of the code. However, not having to go through 
the full read/write IO, and not having to bear the overhead of cleaning up the 
old file, creating the new file, creating more churns in heap and file buffer, 
it seems the benefits outweigh the inconvenience. So I'd argue this JIRA 
belongs to LHF and should be made available in 3.0.x as well.

As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
all-new-partition workload should not be ignored as a corner case. Basically, 
for the main use case of LCS where you need to frequently merge partitions to 
optimize read and eliminate tombstones and expired data sooner, LCS can be 
perfectly happy and efficiently perform the partition merge and tombstone 
elimination for a long time. However, as soon as the node becomes a bit 
unhealthy for various reasons (could be a bad disk so it's missing a whole 
bunch of mutations and need repair, could be the user chooses to ingest way 
more data than it usually takes and exceeds its capability, or god-forbidden, 
some DBA chooses to run offline sstablelevelreset), you will have to handle 
this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
of such single-SSTable compactions, which is the situation this JIRA is 
intended to address.

  was:
I'm using the latest trunk (as of August 2016, which probably is going to be 
3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
inefficiency.

The test data is generated using cassandra-stress default parameters 
(keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
inserted partitions that will never merge in compactions, which is probably the 
worst kind of workload for LCS (however, I'll detail later why this scenario 
should not be ignored as a corner case; for now, let's just assume we still 
want to handle this scenario efficiently).

After the compaction test is done, I scrubbed debug.log for patterns that match 
 the "Compacted" summary so that I can see how long each individual compaction 
took and how many bytes they processed. The search pattern is like the 
following:

{noformat}
grep 'Compacted.*standard1' debug.log
{noformat}

Interestingly, I noticed a lot of the finished compactions are marked as having 
*only one* SSTable involved. With the workload mentioned above, the "single 
SSTable" compactions actually consist of the majority of all compactions (as 
shown below), so its efficiency can affect the overall compaction throughput 
quite a bit.

{noformat}
automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 

[jira] [Updated] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

2016-08-23 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-12526:
-
Labels: compaction lcs performance  (was: )

> For LCS, single SSTable up-level is handled inefficiently
> -
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Wei Deng
>  Labels: compaction, lcs, performance
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)