[jira] [Commented] (CASSANDRA-13530) GroupCommitLogService

2017-11-13 Thread Yuji Ito (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250863#comment-16250863
 ] 

Yuji Ito commented on CASSANDRA-13530:
--

Thank you [~jasobrown],
Sorry for late, I'm glad I've contributed to this.

> GroupCommitLogService
> -
>
> Key: CASSANDRA-13530
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13530
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Yuji Ito
>Assignee: Yuji Ito
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
> Attachments: GuavaRequestThread.java, MicroRequestThread.java, 
> groupAndBatch.png, groupCommit22.patch, groupCommit30.patch, 
> groupCommit3x.patch, groupCommitLog_noSerial_result.xlsx, 
> groupCommitLog_result.xlsx
>
>
> I propose a new CommitLogService, GroupCommitLogService, to improve the 
> throughput when lots of requests are received.
> It improved the throughput by maximum 94%.
> I'd like to discuss about this CommitLogService.
> Currently, we can select either 2 CommitLog services; Periodic and Batch.
> In Periodic, we might lose some commit log which hasn't written to the disk.
> In Batch, we can write commit log to the disk every time. The size of commit 
> log to write is too small (< 4KB). When high concurrency, these writes are 
> gathered and persisted to the disk at once. But, when insufficient 
> concurrency, many small writes are issued and the performance decreases due 
> to the latency of the disk. Even if you use SSD, processes of many IO 
> commands decrease the performance.
> GroupCommitLogService writes some commitlog to the disk at once.
> The patch adds GroupCommitLogService (It is enabled by setting 
> `commitlog_sync` and `commitlog_sync_group_window_in_ms` in cassandra.yaml).
> The difference from Batch is just only waiting for the semaphore.
> By waiting for the semaphore, some writes for commit logs are executed at the 
> same time.
> In GroupCommitLogService, the latency becomes worse if the there is no 
> concurrency.
> I measured the performance with my microbench (MicroRequestThread.java) by 
> increasing the number of threads.The cluster has 3 nodes (Replication factor: 
> 3). Each nodes is AWS EC2 m4.large instance + 200IOPS io1 volume.
> The result is as below. The GroupCommitLogService with 10ms window improved 
> update with Paxos by 94% and improved select with Paxos by 76%.
> h6. SELECT / sec
> ||\# of threads||Batch 2ms||Group 10ms||
> |1|192|103|
> |2|163|212|
> |4|264|416|
> |8|454|800|
> |16|744|1311|
> |32|1151|1481|
> |64|1767|1844|
> |128|2949|3011|
> |256|4723|5000|
> h6. UPDATE / sec
> ||\# of threads||Batch 2ms||Group 10ms||
> |1|45|26|
> |2|39|51|
> |4|58|102|
> |8|102|198|
> |16|167|213|
> |32|289|295|
> |64|544|548|
> |128|1046|1058|
> |256|2020|2061|



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Kurt Greaves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250786#comment-16250786
 ] 

Kurt Greaves commented on CASSANDRA-13992:
--

My understanding is that, at the moment, {{METADATA_CHANGED}} will _always_ be 
set for a conditional update, regardless of whether it's necessary or not. 
Necessary being defined as the schema has actually changed and the prepared 
statements need to be updated client side to reflect those schema changes. 
[~omichallat] is this true? what exactly is "metadata" referring to on the 
driver side, and why is the answer "always no" for conditional updates? If 
there is a change to one of the columns in the update is that going to cause 
problems if we don't tell the driver that it has changed?

I'm with Olivier that that's a hacky addition to the driver, but if it's not 
even necessary as per above then simply only passing an empty digest will be 
sufficient.

I've updated my 
[branch|https://github.com/apache/cassandra/compare/trunk...kgreav:13992] to 
reflect this. Note I've changed to using {{MD5Digest#compute}} to calculate an 
"empty" digest. Although it's thread local it will always be the same digest, 
and this will also solve the initial preparation problem, as it also uses the 
{{EMPTY}} resultset + metadata.





> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13530) GroupCommitLogService

2017-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250740#comment-16250740
 ] 

Jason Brown edited comment on CASSANDRA-13530 at 11/14/17 3:03 AM:
---

[~yuji] We would like this functionality rather soon, so I'd like to take it 
over. You've done a nice job up to now, and let's drive it home.

[~aweisberg] I've taken [~yuji]'s patch and added the comments and tests. wrt 
utests, the functionality I wanted to test is largely all in {{CommitLogTest}}, 
but the choice of commitlog mode is driven by the {{test/conf/cassandra.yaml}}. 
Add on to this [~JoshuaMcKenzie]'s attempts to make commitlog more ammenable to 
unit testing (read: they are still not very friendly for unit testing; see 
[this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-13123?focusedCommentId=16189523=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16189523]),
 I've subclassed {{CommitLogTest}} for each of the three modes (periodic, 
batch, and now group). This way we can test group mode, and get periodic as a 
bonus. This is the cowardly way of testing the different modes and their 
replayability rather than reworking the commit log as a whole (as 
[~JoshuaMcKenzie] alludes to), but it seems like that's a larger issue to 
tackle (on a different ticket).


||13530||
|[branch|https://github.com/jasobrown/cassandra/tree/13530]|
|[utests|https://circleci.com/gh/jasobrown/cassandra/tree/13530]|

Note: I know there's a problem with {{PeriodicCommitLogTest}}, and I'll look in 
the morning. It should not hold up reviewing the small amount that I've added 
if you start reviewing before I fix the test.



was (Author: jasobrown):
[~yuji] We would like this functionality rather soon, so I'd like to take it 
over. You've done a nice job up to now, and let's drive it home.

@Ariel, I've taken [~yuji]'s patch and added the comments and tests. wrt 
utests, the functionality I wanted to test is largely all in {{CommitLogTest}}, 
but the choice of commitlog mode is driven by the {{test/conf/cassandra.yaml}}. 
Add on to this [~JoshuaMcKenzie]'s attempts to make commitlog more ammenable to 
unit testing (read: they are still not very friendly for unit testing; see 
[this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-13123?focusedCommentId=16189523=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16189523]),
 I've subclassed {{CommitLogTest}} for each of the three modes (periodic, 
batch, and now group). This way we can test group mode, and get periodic as a 
bonus. This is the cowardly way of testing the different modes and their 
replayability rather than reworking the commit log as a whole (as 
[~JoshuaMcKenzie] alludes to), but it seems like that's a larger issue to 
tackle (on a different ticket).


||13530||
|[branch|https://github.com/jasobrown/cassandra/tree/13530]|
|[utests|https://circleci.com/gh/jasobrown/cassandra/tree/13530]|

Note: I know there's a problem with {{PeriodicCommitLogTest}}, and I'll look in 
the morning. It should not hold up reviewing the small amount that I've added 
if you start reviewing before I fix the test.


> GroupCommitLogService
> -
>
> Key: CASSANDRA-13530
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13530
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Yuji Ito
>Assignee: Yuji Ito
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
> Attachments: GuavaRequestThread.java, MicroRequestThread.java, 
> groupAndBatch.png, groupCommit22.patch, groupCommit30.patch, 
> groupCommit3x.patch, groupCommitLog_noSerial_result.xlsx, 
> groupCommitLog_result.xlsx
>
>
> I propose a new CommitLogService, GroupCommitLogService, to improve the 
> throughput when lots of requests are received.
> It improved the throughput by maximum 94%.
> I'd like to discuss about this CommitLogService.
> Currently, we can select either 2 CommitLog services; Periodic and Batch.
> In Periodic, we might lose some commit log which hasn't written to the disk.
> In Batch, we can write commit log to the disk every time. The size of commit 
> log to write is too small (< 4KB). When high concurrency, these writes are 
> gathered and persisted to the disk at once. But, when insufficient 
> concurrency, many small writes are issued and the performance decreases due 
> to the latency of the disk. Even if you use SSD, processes of many IO 
> commands decrease the performance.
> GroupCommitLogService writes some commitlog to the disk at once.
> The patch adds GroupCommitLogService (It is enabled by setting 
> `commitlog_sync` and `commitlog_sync_group_window_in_ms` in cassandra.yaml).
> The difference from Batch is just only waiting for the semaphore.
> By waiting for the semaphore, some writes for commit logs are executed at the 
> 

[jira] [Commented] (CASSANDRA-13530) GroupCommitLogService

2017-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250740#comment-16250740
 ] 

Jason Brown commented on CASSANDRA-13530:
-

[~yuji] We would like this functionality rather soon, so I'd like to take it 
over. You've done a nice job up to now, and let's drive it home.

@Ariel, I've taken [~yuji]'s patch and added the comments and tests. wrt 
utests, the functionality I wanted to test is largely all in {{CommitLogTest}}, 
but the choice of commitlog mode is driven by the {{test/conf/cassandra.yaml}}. 
Add on to this [~JoshuaMcKenzie]'s attempts to make commitlog more ammenable to 
unit testing (read: they are still not very friendly for unit testing; see 
[this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-13123?focusedCommentId=16189523=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16189523]),
 I've subclassed {{CommitLogTest}} for each of the three modes (periodic, 
batch, and now group). This way we can test group mode, and get periodic as a 
bonus. This is the cowardly way of testing the different modes and their 
replayability rather than reworking the commit log as a whole (as 
[~JoshuaMcKenzie] alludes to), but it seems like that's a larger issue to 
tackle (on a different ticket).


||13530||
|[branch|https://github.com/jasobrown/cassandra/tree/13530]|
|[utests|https://circleci.com/gh/jasobrown/cassandra/tree/13530]|

Note: I know there's a problem with {{PeriodicCommitLogTest}}, and I'll look in 
the morning. It should not hold up reviewing the small amount that I've added 
if you start reviewing before I fix the test.


> GroupCommitLogService
> -
>
> Key: CASSANDRA-13530
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13530
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Yuji Ito
>Assignee: Yuji Ito
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
> Attachments: GuavaRequestThread.java, MicroRequestThread.java, 
> groupAndBatch.png, groupCommit22.patch, groupCommit30.patch, 
> groupCommit3x.patch, groupCommitLog_noSerial_result.xlsx, 
> groupCommitLog_result.xlsx
>
>
> I propose a new CommitLogService, GroupCommitLogService, to improve the 
> throughput when lots of requests are received.
> It improved the throughput by maximum 94%.
> I'd like to discuss about this CommitLogService.
> Currently, we can select either 2 CommitLog services; Periodic and Batch.
> In Periodic, we might lose some commit log which hasn't written to the disk.
> In Batch, we can write commit log to the disk every time. The size of commit 
> log to write is too small (< 4KB). When high concurrency, these writes are 
> gathered and persisted to the disk at once. But, when insufficient 
> concurrency, many small writes are issued and the performance decreases due 
> to the latency of the disk. Even if you use SSD, processes of many IO 
> commands decrease the performance.
> GroupCommitLogService writes some commitlog to the disk at once.
> The patch adds GroupCommitLogService (It is enabled by setting 
> `commitlog_sync` and `commitlog_sync_group_window_in_ms` in cassandra.yaml).
> The difference from Batch is just only waiting for the semaphore.
> By waiting for the semaphore, some writes for commit logs are executed at the 
> same time.
> In GroupCommitLogService, the latency becomes worse if the there is no 
> concurrency.
> I measured the performance with my microbench (MicroRequestThread.java) by 
> increasing the number of threads.The cluster has 3 nodes (Replication factor: 
> 3). Each nodes is AWS EC2 m4.large instance + 200IOPS io1 volume.
> The result is as below. The GroupCommitLogService with 10ms window improved 
> update with Paxos by 94% and improved select with Paxos by 76%.
> h6. SELECT / sec
> ||\# of threads||Batch 2ms||Group 10ms||
> |1|192|103|
> |2|163|212|
> |4|264|416|
> |8|454|800|
> |16|744|1311|
> |32|1151|1481|
> |64|1767|1844|
> |128|2949|3011|
> |256|4723|5000|
> h6. UPDATE / sec
> ||\# of threads||Batch 2ms||Group 10ms||
> |1|45|26|
> |2|39|51|
> |4|58|102|
> |8|102|198|
> |16|167|213|
> |32|289|295|
> |64|544|548|
> |128|1046|1058|
> |256|2020|2061|



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250453#comment-16250453
 ] 

Jason Brown commented on CASSANDRA-14013:
-

OK, I walked through [~kongo2002]'s example script above on the 3.11 branch, 
and indeed I am able to reproduce. I tried on 3.0, and I think it did not repro 
(would need to do it again, tbqh).

I don't have time to dig in for the next few days, but I suspect it's because 
you named the keyspace "{{snapshots}}", and cassandra might be getting confused 
by trying to clean up any data it thinks is "snapshot" data. Especially as you 
have other keyspaces by other names, and you are not seeing this problem, I'm 
guessing we have a bug in the handling of subdirectories names "snapshots"

> Data loss in snapshots keyspace after service restart
> -
>
> Key: CASSANDRA-14013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Gregor Uhlenheuer
>
> I am posting this bug in hope to discover the stupid mistake I am doing 
> because I can't imagine a reasonable answer for the behavior I see right now 
> :-)
> In short words, I do observe data loss in a keyspace called *snapshots* after 
> restarting the Cassandra service. Say I do have 1000 records in a table 
> called *snapshots.test_idx* then after restart the table has less entries or 
> is even empty.
> My kind of "mysterious" observation is that it happens only in a keyspace 
> called *snapshots*...
> h3. Steps to reproduce
> These steps to reproduce show the described behavior in "most" attempts (not 
> every single time though).
> {code}
> # create keyspace
> CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> # create table
> CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));
> # insert some test data
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
> ...
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 1000
> # restart service
> kill 
> cassandra -f
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 0
> {code}
> I hope someone can point me to the obvious mistake I am doing :-)
> This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250428#comment-16250428
 ] 

Jason Brown commented on CASSANDRA-13987:
-

I don't believe we've had a policy or guarantee in the past about the 
availability of commit log data that was unflushed (not {{mysyc}}'ed), thus I'm 
not sure how much of a 'regression' this changed behavior is. It's unfortunate 
that some previous assumptions that both developers and operators may have had 
were altered, and the end result may result in data loss. So I'm kind of on the 
fence with how far to go back, but I think 3.0 and up is reasonable.

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread Gregor Uhlenheuer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250381#comment-16250381
 ] 

Gregor Uhlenheuer commented on CASSANDRA-14013:
---

[~jasobrown] Thanks for the pointer to CASSANDRA-13987 - although I don't think 
this is the same problem as I do wait for more than 10 seconds indeed. It 
actually appears that I can pretty much restart the service a couple of times 
until the table in the *snapshots* keyspace is completely empty. I just tried 
again on a different machine with the same behavior.

> Data loss in snapshots keyspace after service restart
> -
>
> Key: CASSANDRA-14013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Gregor Uhlenheuer
>
> I am posting this bug in hope to discover the stupid mistake I am doing 
> because I can't imagine a reasonable answer for the behavior I see right now 
> :-)
> In short words, I do observe data loss in a keyspace called *snapshots* after 
> restarting the Cassandra service. Say I do have 1000 records in a table 
> called *snapshots.test_idx* then after restart the table has less entries or 
> is even empty.
> My kind of "mysterious" observation is that it happens only in a keyspace 
> called *snapshots*...
> h3. Steps to reproduce
> These steps to reproduce show the described behavior in "most" attempts (not 
> every single time though).
> {code}
> # create keyspace
> CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> # create table
> CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));
> # insert some test data
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
> ...
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 1000
> # restart service
> kill 
> cassandra -f
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 0
> {code}
> I hope someone can point me to the obvious mistake I am doing :-)
> This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Jeff Jirsa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250373#comment-16250373
 ] 

Jeff Jirsa commented on CASSANDRA-13987:


I get that the ship has sailed on 2.1/2.1, and I accept that. 

I'd like it in 3.0/3.11 because I think it's a guarantee people expect, but I'm 
open to arguments that it's too dangerous (I haven't touched that code in 
months, you have, I'll defer to you if you think it's straightforward enough).



> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250370#comment-16250370
 ] 

Benedict commented on CASSANDRA-13987:
--

The discussed behavioural change was introduced in 2.1, so if it's considered a 
regression it should probably go all the way back, at least to 2.2 (I think 
we're still servicing that, right?)

However if we consider it a regression this doesn't fundamentally fix the 
problem, and we should probably file a follow up ticket if we want to restore 
2.0 behaviour.

For the record, it's quite likely that for unencrypted segments we can get very 
nearly identical behaviour to before with only changes to replay, by just 
skipping corrupted sync markers and continuing to replay records while we are 
able to.  Some changes to the file format and/or the time at which we serialize 
the size/checksum could make this more reliable, but here we're talking about 
race conditions, which arguably isn't a regression given these could have 
equivalently simply been held up in the queue for the commit log thread before.

For encrypted segments, I don't know if we need to "restore" behaviour since it 
was never available before, but it would make sense to do so (least surprise 
and all that).  In which case we'd probably want to modify our segment writing 
to happen concurrently (but serially writing the bytes, of course).  This 
probably isn't actually such a dramatic change, though it's been a while since 
I've looked at the code.  This way we could "just" do the same as above, but 
also abort when we hit a corrupted/abrupt end of encrypted/compressed stream.

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do 

[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250368#comment-16250368
 ] 

Jason Brown commented on CASSANDRA-13987:
-

[~jjirsa] This is a change in behavior from when multithreaded commitlog 
(CASSANDRA-3578) was introduced, in 2.1. I'm pretty sure we don't want to 
update 2.1, and 2.2 is highly doubtful, as well, but I'm fine with 3.0 and 
higher if folks think it's worth it.

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250357#comment-16250357
 ] 

Jason Brown commented on CASSANDRA-14013:
-

[~kongo2002] After you perform the inserts, how long do you wait before 
bouncing cassandra? If you wait for >= 10 seconds (or whatever 
{{commitlog_sync_period_in_ms}} is set to in the {{cassandra.yaml}}), do you 
still have the same problem?

I believe CASSANDRA-13987 addresses the same issue that you are raising here. 
You can read that ticket for all the gory details.

> Data loss in snapshots keyspace after service restart
> -
>
> Key: CASSANDRA-14013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Gregor Uhlenheuer
>
> I am posting this bug in hope to discover the stupid mistake I am doing 
> because I can't imagine a reasonable answer for the behavior I see right now 
> :-)
> In short words, I do observe data loss in a keyspace called *snapshots* after 
> restarting the Cassandra service. Say I do have 1000 records in a table 
> called *snapshots.test_idx* then after restart the table has less entries or 
> is even empty.
> My kind of "mysterious" observation is that it happens only in a keyspace 
> called *snapshots*...
> h3. Steps to reproduce
> These steps to reproduce show the described behavior in "most" attempts (not 
> every single time though).
> {code}
> # create keyspace
> CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> # create table
> CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));
> # insert some test data
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
> ...
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 1000
> # restart service
> kill 
> cassandra -f
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 0
> {code}
> I hope someone can point me to the obvious mistake I am doing :-)
> This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250338#comment-16250338
 ] 

sankalp kohli commented on CASSANDRA-13987:
---

+1 for doing this in 3.0+ 

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250338#comment-16250338
 ] 

sankalp kohli edited comment on CASSANDRA-13987 at 11/13/17 10:07 PM:
--

 +1 for doing this in 3.0+ 


was (Author: kohlisankalp):
+1 for doing this in 3.0+ 

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Jeff Jirsa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250333#comment-16250333
 ] 

Jeff Jirsa commented on CASSANDRA-13987:


Am I the only one who thinks this belongs in 3.0+ instead of 4.0?  It's a 
regression (though not from 2.1/2.2, I guess), and it impacts data safety. 


> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13983) Support a means of logging all queries as they were invoked

2017-11-13 Thread Blake Eggleston (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250240#comment-16250240
 ] 

Blake Eggleston commented on CASSANDRA-13983:
-

Well I meant more than one :).

Anyway, if it's there intentionally, I don't have a problem with it. I was just 
calling it out because it seemed like it could have been something left over 
from an earlier iteration.

> Support a means of logging all queries as they were invoked
> ---
>
> Key: CASSANDRA-13983
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13983
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL, Observability, Testing, Tools
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
> Fix For: 4.0
>
>
> For correctness testing it's useful to be able to capture production traffic 
> so that it can be replayed against both the old and new versions of Cassandra 
> while comparing the results.
> Implementing this functionality once inside the database is high performance 
> and presents less operational complexity.
> In [this patch|https://github.com/apache/cassandra/pull/169] there is an 
> implementation of a full query log that logs uses chronicle-queue (apache 
> licensed, the maven artifacts are labeled incorrectly in some cases, 
> dependencies are also apache licensed) to implement a rotating log of queries.
> * Single thread asynchronously writes log entries to disk to reduce impact on 
> query latency
> * Heap memory usage bounded by a weighted queue with configurable maximum 
> weight sitting in front of logging thread
> * If the weighted queue is full producers can be blocked or samples can be 
> dropped
> * Disk utilization is bounded by deleting old log segments once a 
> configurable size is reached
> * The on disk serialization uses a flexible schema binary format 
> (chronicle-wire) making it easy to skip unrecognized fields, add new ones, 
> and omit old ones.
> * Can be enabled and configured via JMX, disabled, and reset (delete on disk 
> data), logging path is configurable via both JMX and YAML
> * Introduce new {{fqltool}} in /bin that currently implements {{Dump}} which 
> can dump in a human readable format full query logs as well as follow active 
> full query logs
> Follow up work:
> * Introduce new {{fqltool}} command Replay which can replay N full query logs 
> to two different clusters and compare the result and check for 
> inconsistencies. <- Actively working on getting this done
> * Log not just queries but their results to facilitate a comparison between 
> the original query result and the replayed result. <- Really just don't have 
> specific use case at the moment
> * "Consistent" query logging allowing replay to fully replicate the original 
> order of execution and completion even in the face of races (including CAS). 
> <- This is more speculative



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Sam Tunnicliffe (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-13987:

Status: In Progress  (was: Patch Available)

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250195#comment-16250195
 ] 

Sam Tunnicliffe commented on CASSANDRA-13987:
-

Previously, {{writeCDCIndexFile}} was only called ever called after a flush, 
which would be consistent with its comment that states:
{code}We persist the offset of the last data synced to disk so clients can 
parse only durable data if they choose{code}
So currently this definition of durable would include durability in the face of 
host failures, whereas with this patch the index file may contain offsets for 
segments that are durable under process crash, but which have not yet been 
msynced/fsynced and so may not survive a host failure. Should we move the call 
to {{writeCDCIndexFile}} into the {{if (flush || close)}} block, to after the 
flush has completed?

That question aside, the code seems solid and I've manually tested both as-is 
and with some added hacks to inject failures etc, but I feel like it could 
still benefit from some automated testing to cover the new behaviour. I know 
that writing tests for this area is non-trivial and usually involves byteman, 
but do you think it's worth adding a unit test or two for this?

Nits:
* Typo in cassandra.yaml #380 s/mmaped/mmapped 
* The comment atop {{AbstractCommitLogSegmentManager::sync}} could use 
updating. The fact that it says it flushes, but also takes a boolean flush arg 
is a bit confusing.
* {{CompressedSegment}} and {{EncryptedSegment}} no longer need to import 
{{SyncUtil}}


> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Olivier Michallat (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250154#comment-16250154
 ] 

Olivier Michallat commented on CASSANDRA-13992:
---

{{METADATA_CHANGED}} tells the client if it needs to update its local copy of 
the metadata. For conditional updates, the answer is always no (since the 
client should never store that information in the first place); that is why I 
think it's more intuitive to set the flag to false.

To put it another way: if the flag is forced to true, I have to add a condition 
in the client code ({{newMetadataId.bytes.length > 0}}). My worry is that a 
client implementation could forget to check that the id is empty, and end up 
with a sub-optimal behavior (that updates the local metadata unnecessarily each 
time).

If the flag is absent, conditional updates can be handled like any other 
statement.



> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250071#comment-16250071
 ] 

Alex Petrov edited comment on CASSANDRA-13992 at 11/13/17 7:43 PM:
---

[~omichallat] not sure, since {{METADATA_CHANGED}} is just a flag: e.g. if it's 
set it's {{true}}, otherwise it's {{false}}. Moreover, I think that the default 
behaviour for LWTs has to be that we _always_ update metadata: there's no way 
for server to know what was the last metadata on the client (since it depends 
on the result), the server can't distinguish between the metadata hash 
inequality caused by {{ALTER}} vs caused by success/non-success LWT result.

Unless I'm missing something, my patch achieves exactly that (also, without any 
driver changes): it forces the server to _always_ send the metadata. This, 
combined with the metadata consisting of zeroes can instruct the client that 
caching metadata is possible, but won't bring anything: new result metadata 
will just be re-delivered on every call, since it's potentially going to be 
changing on every request.

I haven't updated spec though. I will, if/when we agree on the behaviour.


was (Author: ifesdjeen):
[~omichallat] not sure, since {{METADATA_CHANGED}} is just a flag: e.g. if it's 
set it's {{true}}, otherwise it's {{false}}. Moreover, I think that the default 
behaviour for LWTs has to be that we _always_ update metadata: there's no way 
for server to know what was the last metadata on the client (since it depends 
on the result), the server can't distinguish between the metadata hash 
inequality caused by {{ALTER}} vs caused by success/non-success LWT result.

Unless I'm missing something, my patch achieves exactly that (also, without any 
driver changes): it forces the server to _always_ send the metadata. This, 
combined with the metadata consisting of zeroes can instruct the client that 
caching metadata is possible, but won't bring anything: new result metadata 
will just be re-delivered on every call, since it's potentially going to be 
changing on every request.

> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250071#comment-16250071
 ] 

Alex Petrov commented on CASSANDRA-13992:
-

[~omichallat] not sure, since {{METADATA_CHANGED}} is just a flag: e.g. if it's 
set it's {{true}}, otherwise it's {{false}}. Moreover, I think that the default 
behaviour for LWTs has to be that we _always_ update metadata: there's no way 
for server to know what was the last metadata on the client (since it depends 
on the result), the server can't distinguish between the metadata hash 
inequality caused by {{ALTER}} vs caused by success/non-success LWT result.

Unless I'm missing something, my patch achieves exactly that (also, without any 
driver changes): it forces the server to _always_ send the metadata. This, 
combined with the metadata consisting of zeroes can instruct the client that 
caching metadata is possible, but won't bring anything: new result metadata 
will just be re-delivered on every call, since it's potentially going to be 
changing on every request.

> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13983) Support a means of logging all queries as they were invoked

2017-11-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250045#comment-16250045
 ] 

Ariel Weisberg commented on CASSANDRA-13983:


bq. It doesn't look like we use special Weigher implementations anywhere. I 
think this could be slightly simplified if we made the type param ?

There is one. It's the natural weigher :-) I think we don't do enough to bound 
resources by weight in general. Having a piece of library code ready to go 
lowers the barrier to using it. The idiom of allowing a pluggable weigher for 
legacy items or classes you can't modify is pretty common in this kind of 
library code (Comparable and sorting and navigable maps, Guava Cache's weigher).

I get that if you want to cut to the bone then yes technically this could be 
done without it, but it's unit tested and it can be done with it. I'd like to 
keep it but you are right it's not core to what this ticket is trying to do.

> Support a means of logging all queries as they were invoked
> ---
>
> Key: CASSANDRA-13983
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13983
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL, Observability, Testing, Tools
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
> Fix For: 4.0
>
>
> For correctness testing it's useful to be able to capture production traffic 
> so that it can be replayed against both the old and new versions of Cassandra 
> while comparing the results.
> Implementing this functionality once inside the database is high performance 
> and presents less operational complexity.
> In [this patch|https://github.com/apache/cassandra/pull/169] there is an 
> implementation of a full query log that logs uses chronicle-queue (apache 
> licensed, the maven artifacts are labeled incorrectly in some cases, 
> dependencies are also apache licensed) to implement a rotating log of queries.
> * Single thread asynchronously writes log entries to disk to reduce impact on 
> query latency
> * Heap memory usage bounded by a weighted queue with configurable maximum 
> weight sitting in front of logging thread
> * If the weighted queue is full producers can be blocked or samples can be 
> dropped
> * Disk utilization is bounded by deleting old log segments once a 
> configurable size is reached
> * The on disk serialization uses a flexible schema binary format 
> (chronicle-wire) making it easy to skip unrecognized fields, add new ones, 
> and omit old ones.
> * Can be enabled and configured via JMX, disabled, and reset (delete on disk 
> data), logging path is configurable via both JMX and YAML
> * Introduce new {{fqltool}} in /bin that currently implements {{Dump}} which 
> can dump in a human readable format full query logs as well as follow active 
> full query logs
> Follow up work:
> * Introduce new {{fqltool}} command Replay which can replay N full query logs 
> to two different clusters and compare the result and check for 
> inconsistencies. <- Actively working on getting this done
> * Log not just queries but their results to facilitate a comparison between 
> the original query result and the replayed result. <- Really just don't have 
> specific use case at the moment
> * "Consistent" query logging allowing replay to fully replicate the original 
> order of execution and completion even in the face of races (including CAS). 
> <- This is more speculative



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread Gregor Uhlenheuer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249947#comment-16249947
 ] 

Gregor Uhlenheuer commented on CASSANDRA-14013:
---

What additionally throws me off is that similar {{INSERTs}} into another 
keyspace with the exact same schema and settings do survive every restart 
without any issues.

> Data loss in snapshots keyspace after service restart
> -
>
> Key: CASSANDRA-14013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Gregor Uhlenheuer
>
> I am posting this bug in hope to discover the stupid mistake I am doing 
> because I can't imagine a reasonable answer for the behavior I see right now 
> :-)
> In short words, I do observe data loss in a keyspace called *snapshots* after 
> restarting the Cassandra service. Say I do have 1000 records in a table 
> called *snapshots.test_idx* then after restart the table has less entries or 
> is even empty.
> My kind of "mysterious" observation is that it happens only in a keyspace 
> called *snapshots*...
> h3. Steps to reproduce
> These steps to reproduce show the described behavior in "most" attempts (not 
> every single time though).
> {code}
> # create keyspace
> CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> # create table
> CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));
> # insert some test data
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
> ...
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 1000
> # restart service
> kill 
> cassandra -f
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 0
> {code}
> I hope someone can point me to the obvious mistake I am doing :-)
> This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread Gregor Uhlenheuer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249896#comment-16249896
 ] 

Gregor Uhlenheuer commented on CASSANDRA-14013:
---

It's the default (which is {{periodic}} if I recall correctly) since I tested 
with the vanilla configuration.

> Data loss in snapshots keyspace after service restart
> -
>
> Key: CASSANDRA-14013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Gregor Uhlenheuer
>
> I am posting this bug in hope to discover the stupid mistake I am doing 
> because I can't imagine a reasonable answer for the behavior I see right now 
> :-)
> In short words, I do observe data loss in a keyspace called *snapshots* after 
> restarting the Cassandra service. Say I do have 1000 records in a table 
> called *snapshots.test_idx* then after restart the table has less entries or 
> is even empty.
> My kind of "mysterious" observation is that it happens only in a keyspace 
> called *snapshots*...
> h3. Steps to reproduce
> These steps to reproduce show the described behavior in "most" attempts (not 
> every single time though).
> {code}
> # create keyspace
> CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> # create table
> CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));
> # insert some test data
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
> ...
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 1000
> # restart service
> kill 
> cassandra -f
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 0
> {code}
> I hope someone can point me to the obvious mistake I am doing :-)
> This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249851#comment-16249851
 ] 

sankalp kohli commented on CASSANDRA-14013:
---

Which commit log mode are you using? 

> Data loss in snapshots keyspace after service restart
> -
>
> Key: CASSANDRA-14013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Gregor Uhlenheuer
>
> I am posting this bug in hope to discover the stupid mistake I am doing 
> because I can't imagine a reasonable answer for the behavior I see right now 
> :-)
> In short words, I do observe data loss in a keyspace called *snapshots* after 
> restarting the Cassandra service. Say I do have 1000 records in a table 
> called *snapshots.test_idx* then after restart the table has less entries or 
> is even empty.
> My kind of "mysterious" observation is that it happens only in a keyspace 
> called *snapshots*...
> h3. Steps to reproduce
> These steps to reproduce show the described behavior in "most" attempts (not 
> every single time though).
> {code}
> # create keyspace
> CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> # create table
> CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));
> # insert some test data
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
> ...
> INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 1000
> # restart service
> kill 
> cassandra -f
> # count entries
> SELECT count(*) FROM snapshots.test_idx;
> 0
> {code}
> I hope someone can point me to the obvious mistake I am doing :-)
> This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Olivier Michallat (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249838#comment-16249838
 ] 

Olivier Michallat edited comment on CASSANDRA-13992 at 11/13/17 5:18 PM:
-

[~ifesdjeen] that would work, the driver can treat an empty {{new_metadata_id}} 
as "don't update my local copy". Namely, changing [this 
line|https://github.com/datastax/java-driver/blob/6eeb8b2193ab5b50b73b0d9a533e775265f11007/driver-core/src/main/java/com/datastax/driver/core/ArrayBackedResultSet.java#L83]
 to:
{code}
if (newMetadataId != null && newMetadataId.bytes.length > 0) {
{code}
However that feels kind of hacky. Consider how we would have to explain that in 
the protocol spec:
{quote}
-  is \[short bytes] representing the new, changed 
resultset
   metadata. The new metadata ID must also be used in subsequent 
executions of
   the corresponding prepared statement, if any, *except if it is 
empty*.
{quote}
It would make so much more sense to force {{METADATA_CHANGED}} to *false* for 
conditional updates, isn't there any way we can do that?


was (Author: omichallat):
[~ifesdjeen] that would work, the driver can treat an empty {{new_metadata_id}} 
as "don't update my local copy". Namely, changing [this 
line|https://github.com/datastax/java-driver/blob/6eeb8b2193ab5b50b73b0d9a533e775265f11007/driver-core/src/main/java/com/datastax/driver/core/ArrayBackedResultSet.java#L83]
 to:
{code}
if (newMetadataId != null && newMetadataId.bytes.length > 0) {
{code}
However that feels kind of hacky. Consider how we would have to update the 
protocol spec to explain this:
{quote}
-  is \[short bytes] representing the new, changed 
resultset
   metadata. The new metadata ID must also be used in subsequent 
executions of
   the corresponding prepared statement, if any, *except if it is 
empty*.
{quote}
It would make so much more sense to force {{METADATA_CHANGED}} to *false* for 
conditional statements, isn't there any way we can do that?

> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Olivier Michallat (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249838#comment-16249838
 ] 

Olivier Michallat commented on CASSANDRA-13992:
---

[~ifesdjeen] that would work, the driver can treat an empty {{new_metadata_id}} 
as "don't update my local copy". Namely, changing [this 
line|https://github.com/datastax/java-driver/blob/6eeb8b2193ab5b50b73b0d9a533e775265f11007/driver-core/src/main/java/com/datastax/driver/core/ArrayBackedResultSet.java#L83]
 to:
{code}
if (newMetadataId != null && newMetadataId.bytes.length > 0) {
{code}
However that feels kind of hacky. Consider how we would have to update the 
protocol spec to explain this:
{quote}
-  is \[short bytes] representing the new, changed 
resultset
   metadata. The new metadata ID must also be used in subsequent 
executions of
   the corresponding prepared statement, if any, *except if it is 
empty*.
{quote}
It would make so much more sense to force {{METADATA_CHANGED}} to *false* for 
conditional statements, isn't there any way we can do that?

> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14013) Data loss in snapshots keyspace after service restart

2017-11-13 Thread Gregor Uhlenheuer (JIRA)
Gregor Uhlenheuer created CASSANDRA-14013:
-

 Summary: Data loss in snapshots keyspace after service restart
 Key: CASSANDRA-14013
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14013
 Project: Cassandra
  Issue Type: Bug
Reporter: Gregor Uhlenheuer


I am posting this bug in hope to discover the stupid mistake I am doing because 
I can't imagine a reasonable answer for the behavior I see right now :-)

In short words, I do observe data loss in a keyspace called *snapshots* after 
restarting the Cassandra service. Say I do have 1000 records in a table called 
*snapshots.test_idx* then after restart the table has less entries or is even 
empty.

My kind of "mysterious" observation is that it happens only in a keyspace 
called *snapshots*...

h3. Steps to reproduce

These steps to reproduce show the described behavior in "most" attempts (not 
every single time though).

{code}
# create keyspace
CREATE KEYSPACE snapshots WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 1};

# create table
CREATE TABLE snapshots.test_idx (key text, seqno bigint, primary key(key));

# insert some test data
INSERT INTO snapshots.test_idx (key,seqno) values ('key1', 1);
...
INSERT INTO snapshots.test_idx (key,seqno) values ('key1000', 1000);

# count entries
SELECT count(*) FROM snapshots.test_idx;
1000

# restart service
kill 
cassandra -f

# count entries
SELECT count(*) FROM snapshots.test_idx;
0
{code}

I hope someone can point me to the obvious mistake I am doing :-)

This happened to me using both Cassandra 3.9 and 3.11.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14007) cqlshlib tests fail due to compact table

2017-11-13 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249778#comment-16249778
 ] 

Joel Knighton commented on CASSANDRA-14007:
---

Yeah, the cqlshlib tests have their own script to run and don't run as part of 
dtests. See 
[https://github.com/apache/cassandra-builds/blob/f0e63d66269f9086c3a0393a24a55577d21b4454/build-scripts/cassandra-cqlsh-tests.sh]
 for an example of how to run them.

> cqlshlib tests fail due to compact table
> 
>
> Key: CASSANDRA-14007
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14007
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> The pylib/cqlshlib tests fail on initialization with the error 
> {{SyntaxException:  query\] message="Compact tables are not allowed in Cassandra starting with 
> 4.0 version.">}}. 
> The table {{dynamic_columns}} is created {{WITH COMPACT STORAGE}}. Since 
> [CASSANDRA-10857], this is no longer supported. It looks like dropping the 
> COMPACT STORAGE modifier is enough for the tests to run, but I haven't looked 
> if we should instead remove the table and all related tests entirely, or if 
> there's an interesting code path covered by this that we should test in a 
> different way now. [~ifesdjeen] might know at a glance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14012) Document gossip protocol

2017-11-13 Thread JIRA
Jörn Heissler created CASSANDRA-14012:
-

 Summary: Document gossip protocol
 Key: CASSANDRA-14012
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14012
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jörn Heissler
Priority: Minor


I had an issue today with two nodes communicating with each other; there's a 
flaw in my configuration (wrong broadcast address).

I saw a little bit of traffic on port 7000, but I couldn't understand it for 
lack of documentation.
With documentation I would have understood my issue very quickly (7f 00 01 01 
is a bad broadcast address!). But I didn't recognize those 4 bytes as the bc 
address.

Could you please document the gossip protocol?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14011) Multi threaded L0 -> L1 compaction

2017-11-13 Thread Marcus Eriksson (JIRA)
Marcus Eriksson created CASSANDRA-14011:
---

 Summary: Multi threaded L0 -> L1 compaction
 Key: CASSANDRA-14011
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14011
 Project: Cassandra
  Issue Type: Improvement
Reporter: Marcus Eriksson
 Fix For: 4.x


Currently L0 -> L1 compactions are almost always single threaded because every 
L0 sstable will overlap with all L1 sstables. To improve this, we should 
range-split the input sstables in a configurable amount of parts and then use 
multiple threads to write out the results. This is similar to the 
{{max_subcompactions}} option in RocksDB: 
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability

2017-11-13 Thread Sam Tunnicliffe (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-13987:

Reviewer: Sam Tunnicliffe  (was: Blake Eggleston)

> Multithreaded commitlog subtly changed durability
> -
>
> Key: CASSANDRA-13987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jason Brown
>Assignee: Jason Brown
> Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14010) NullPointerException when creating keyspace

2017-11-13 Thread Jonathan Pellby (JIRA)
Jonathan Pellby created CASSANDRA-14010:
---

 Summary: NullPointerException when creating keyspace
 Key: CASSANDRA-14010
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14010
 Project: Cassandra
  Issue Type: Bug
Reporter: Jonathan Pellby


We have a test environment were we drop and create keyspaces and tables several 
times within a short time frame. Since upgrading from 3.11.0 to 3.11.1, we are 
seeing a lot of create statements failing. See the logs below:
{code:java}
2017-11-13T14:29:20.037986449Z WARN Directory /tmp/ramdisk/commitlog doesn't 
exist
2017-11-13T14:29:20.038009590Z WARN Directory /tmp/ramdisk/saved_caches doesn't 
exist
2017-11-13T14:29:20.094337265Z INFO Initialized prepared statement caches with 
10 MB (native) and 10 MB (Thrift)
2017-11-13T14:29:20.805946340Z INFO Initializing system.IndexInfo
2017-11-13T14:29:21.934686905Z INFO Initializing system.batches
2017-11-13T14:29:21.973914733Z INFO Initializing system.paxos
2017-11-13T14:29:21.994550268Z INFO Initializing system.local
2017-11-13T14:29:22.014097194Z INFO Initializing system.peers
2017-11-13T14:29:22.124211254Z INFO Initializing system.peer_events
2017-11-13T14:29:22.153966833Z INFO Initializing system.range_xfers
2017-11-13T14:29:22.174097334Z INFO Initializing system.compaction_history
2017-11-13T14:29:22.194259920Z INFO Initializing system.sstable_activity
2017-11-13T14:29:22.210178271Z INFO Initializing system.size_estimates
2017-11-13T14:29:22.223836992Z INFO Initializing system.available_ranges
2017-11-13T14:29:22.237854207Z INFO Initializing system.transferred_ranges
2017-11-13T14:29:22.253995621Z INFO Initializing system.views_builds_in_progress
2017-11-13T14:29:22.264052481Z INFO Initializing system.built_views
2017-11-13T14:29:22.283334779Z INFO Initializing system.hints
2017-11-13T14:29:22.304110311Z INFO Initializing system.batchlog
2017-11-13T14:29:22.318031950Z INFO Initializing system.prepared_statements
2017-11-13T14:29:22.326547917Z INFO Initializing system.schema_keyspaces
2017-11-13T14:29:22.337097407Z INFO Initializing system.schema_columnfamilies
2017-11-13T14:29:22.354082675Z INFO Initializing system.schema_columns
2017-11-13T14:29:22.384179063Z INFO Initializing system.schema_triggers
2017-11-13T14:29:22.394222027Z INFO Initializing system.schema_usertypes
2017-11-13T14:29:22.414199833Z INFO Initializing system.schema_functions
2017-11-13T14:29:22.427205182Z INFO Initializing system.schema_aggregates
2017-11-13T14:29:22.427228345Z INFO Not submitting build tasks for views in 
keyspace system as storage service is not initialized
2017-11-13T14:29:22.652838866Z INFO Scheduling approximate time-check task with 
a precision of 10 milliseconds
2017-11-13T14:29:22.732862906Z INFO Initializing system_schema.keyspaces
2017-11-13T14:29:22.746598744Z INFO Initializing system_schema.tables
2017-11-13T14:29:22.759649011Z INFO Initializing system_schema.columns
2017-11-13T14:29:22.766245435Z INFO Initializing system_schema.triggers
2017-11-13T14:29:22.778716809Z INFO Initializing system_schema.dropped_columns
2017-11-13T14:29:22.791369819Z INFO Initializing system_schema.views
2017-11-13T14:29:22.839141724Z INFO Initializing system_schema.types
2017-11-13T14:29:22.852911976Z INFO Initializing system_schema.functions
2017-11-13T14:29:22.852938112Z INFO Initializing system_schema.aggregates
2017-11-13T14:29:22.869348526Z INFO Initializing system_schema.indexes
2017-11-13T14:29:22.874178682Z INFO Not submitting build tasks for views in 
keyspace system_schema as storage service is not initialized
2017-11-13T14:29:23.700250435Z INFO Initializing key cache with capacity of 25 
MBs.
2017-11-13T14:29:23.724357053Z INFO Initializing row cache with capacity of 0 
MBs
2017-11-13T14:29:23.724383599Z INFO Initializing counter cache with capacity of 
12 MBs
2017-11-13T14:29:23.724386906Z INFO Scheduling counter cache save to every 7200 
seconds (going to save all keys).
2017-11-13T14:29:23.984408710Z INFO Populating token metadata from system tables
2017-11-13T14:29:24.032687075Z INFO Global buffer pool is enabled, when pool is 
exhausted (max is 125.000MiB) it will allocate on heap
2017-11-13T14:29:24.214123695Z INFO Token metadata:
2017-11-13T14:29:24.304218769Z INFO Completed loading (14 ms; 8 keys) KeyCache 
cache
2017-11-13T14:29:24.363978406Z INFO No commitlog files found; skipping replay
2017-11-13T14:29:24.364005238Z INFO Populating token metadata from system tables
2017-11-13T14:29:24.394408476Z INFO Token metadata:
2017-11-13T14:29:24.709411652Z INFO Preloaded 0 prepared statements
2017-11-13T14:29:24.719332880Z INFO Cassandra version: 3.11.1
2017-11-13T14:29:24.719355969Z INFO Thrift API version: 20.1.0
2017-11-13T14:29:24.719359443Z INFO CQL supported versions: 3.4.4 (default: 
3.4.4)
2017-11-13T14:29:24.719362103Z INFO Native protocol supported versions: 3/v3, 
4/v4, 5/v5-beta (default: 4/v4)

[jira] [Deleted] (CASSANDRA-14009) _to_be_deleted

2017-11-13 Thread Jeff Jirsa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Jirsa deleted CASSANDRA-14009:
---


> _to_be_deleted
> --
>
> Key: CASSANDRA-14009
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
> Project: Cassandra
>  Issue Type: Test
>Reporter: Andrzej Bober
>Priority: Trivial
>
> __deleted__



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14009) _to_be_deleted

2017-11-13 Thread Andrzej Bober (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bober updated CASSANDRA-14009:
--
   Priority: Trivial  (was: Major)
Component/s: (was: Auth)
 Issue Type: Test  (was: Bug)

> _to_be_deleted
> --
>
> Key: CASSANDRA-14009
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
> Project: Cassandra
>  Issue Type: Test
>Reporter: Andrzej Bober
>Priority: Trivial
>
> __deleted__



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14009) _to_be_deleted

2017-11-13 Thread Andrzej Bober (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bober updated CASSANDRA-14009:
--
Labels:   (was: security)

> _to_be_deleted
> --
>
> Key: CASSANDRA-14009
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
> Project: Cassandra
>  Issue Type: Test
>Reporter: Andrzej Bober
>Priority: Trivial
>
> __deleted__



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14009) _to_be_deleted

2017-11-13 Thread Andrzej Bober (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bober updated CASSANDRA-14009:
--
Summary: _to_be_deleted  (was: Any user can overwrite any table with 
sstableloader)

> _to_be_deleted
> --
>
> Key: CASSANDRA-14009
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
> Project: Cassandra
>  Issue Type: Bug
>  Components: Auth
>Reporter: Andrzej Bober
>  Labels: security
>
> __deleted__



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Resolved] (CASSANDRA-14009) Any user can overwrite any table with sstableloader

2017-11-13 Thread Andrzej Bober (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bober resolved CASSANDRA-14009.
---
   Resolution: Incomplete
Fix Version/s: (was: 3.11.x)
   (was: 3.0.x)
   (was: 2.2.x)
   (was: 2.1.x)

> Any user can overwrite any table with sstableloader
> ---
>
> Key: CASSANDRA-14009
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
> Project: Cassandra
>  Issue Type: Bug
>  Components: Auth
>Reporter: Andrzej Bober
>  Labels: security
>
> __deleted__



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14009) Any user can overwrite any table with sstableloader

2017-11-13 Thread Andrzej Bober (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bober updated CASSANDRA-14009:
--
Description: __deleted__  (was: Hi there,

Looks like any user can overwrite any table with sstableloader.
Tested ubuntu 16.04.3, Java 1.8.0_151_b12, and Cassandra 2.1.19 / 2.2.11 / 
3.0.15 / 3.11.1.

{code:sql}
cassandra@cqlsh> CREATE USER alice WITH PASSWORD 'Alice';
cassandra@cqlsh> CREATE USER bob WITH PASSWORD 'Bob';

cassandra@cqlsh>  CREATE KEYSPACE db4alice WITH replication = {'class': 
'SimpleStrategy', 'replication_factor': 1};
cassandra@cqlsh>  GRANT ALL PERMISSIONS ON KEYSPACE db4alice TO alice;

alice@cqlsh> CREATE TABLE users (userid text PRIMARY KEY, password text);

alice@cqlsh> INSERT INTO users (userid, password) VALUES ('user1', 'pass1');
alice@cqlsh> INSERT INTO users (userid, password) VALUES ('user2’, 'pass2’);
alice@cqlsh> INSERT INTO users (userid, password) VALUES ('user3’, 'pass3’);

alice@cqlsh> truncate users;

alice@cqlsh> select * from db4alice.users ;
 userid | password
+--
(0 rows)

sstableloader -d 127.0.0.1 -u bob -pw Bob ./db4alice/users

alice@cqlsh> select * from db4alice.users ;

 userid | password
+--
  user2 |pass2
  user1 |pass1
  user3 |pass3

(3 rows)
{code}

Looks like a pretty serious bug to me.)

> Any user can overwrite any table with sstableloader
> ---
>
> Key: CASSANDRA-14009
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
> Project: Cassandra
>  Issue Type: Bug
>  Components: Auth
>Reporter: Andrzej Bober
>  Labels: security
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.11.x
>
>
> __deleted__



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-9988) Introduce leaf-only iterator

2017-11-13 Thread Jason Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-9988:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed as sha {{0eab80bf389114be8d6f7627f72249bbc3c02e64}}

Thanks!

> Introduce leaf-only iterator
> 
>
> Key: CASSANDRA-9988
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9988
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Benedict
>Assignee: Jay Zhuang
>Priority: Minor
>  Labels: patch
> Fix For: 4.0
>
> Attachments: 9988-3tests.png, 9988-data.png, 9988-result.png, 
> 9988-result2.png, 9988-result3.png, 9988-test-result-expsearch.xlsx, 
> 9988-test-result-raw.png, 9988-test-result.xlsx, 9988-test-result3.png, 
> 9988-trunk-new-update.txt, 9988-trunk-new.txt, trunk-9988.txt
>
>
> In many cases we have small btrees, small enough to fit in a single leaf 
> page. In this case it _may_ be more efficient to specialise our iterator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



cassandra git commit: Introduce Leaf-only BTree Iterator

2017-11-13 Thread jasobrown
Repository: cassandra
Updated Branches:
  refs/heads/trunk 07258a96b -> 0eab80bf3


Introduce Leaf-only BTree Iterator

patch by Piotr Jastrzebski, Jay Zhuang; reviewed by jasobrown for CASSANDRA-9988


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/0eab80bf
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/0eab80bf
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/0eab80bf

Branch: refs/heads/trunk
Commit: 0eab80bf389114be8d6f7627f72249bbc3c02e64
Parents: 07258a9
Author: Jay Zhuang 
Authored: Sun Jan 15 16:52:58 2017 -0800
Committer: Jason Brown 
Committed: Mon Nov 13 05:43:22 2017 -0800

--
 CHANGES.txt |   1 +
 .../db/partitions/AbstractBTreePartition.java   |   3 +-
 .../org/apache/cassandra/utils/btree/BTree.java |  26 +-
 .../utils/btree/BTreeSearchIterator.java| 137 +--
 .../apache/cassandra/utils/btree/BTreeSet.java  |   3 +-
 .../utils/btree/FullBTreeSearchIterator.java| 159 
 .../utils/btree/LeafBTreeSearchIterator.java| 113 +
 .../microbench/BTreeSearchIteratorBench.java| 143 +++
 .../utils/btree/BTreeSearchIteratorTest.java| 241 +++
 9 files changed, 683 insertions(+), 143 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/0eab80bf/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index f5951d6..494901c 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0
+ * Introduce leaf-only iterator (CASSANDRA-9988)
  * Upgrade Guava to 23.3 and Airline to 0.8 (CASSANDRA-13997)
  * Allow only one concurrent call to StatusLogger (CASSANDRA-12182)
  * Refactoring to specialised functional interfaces (CASSANDRA-13982)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/0eab80bf/src/java/org/apache/cassandra/db/partitions/AbstractBTreePartition.java
--
diff --git 
a/src/java/org/apache/cassandra/db/partitions/AbstractBTreePartition.java 
b/src/java/org/apache/cassandra/db/partitions/AbstractBTreePartition.java
index d913cb3..6dbaff5 100644
--- a/src/java/org/apache/cassandra/db/partitions/AbstractBTreePartition.java
+++ b/src/java/org/apache/cassandra/db/partitions/AbstractBTreePartition.java
@@ -26,7 +26,6 @@ import org.apache.cassandra.db.filter.ColumnFilter;
 import org.apache.cassandra.db.rows.*;
 import org.apache.cassandra.utils.SearchIterator;
 import org.apache.cassandra.utils.btree.BTree;
-import org.apache.cassandra.utils.btree.BTreeSearchIterator;
 
 import static org.apache.cassandra.utils.btree.BTree.Dir.desc;
 
@@ -131,7 +130,7 @@ public abstract class AbstractBTreePartition implements 
Partition, Iterable
 final Holder current = holder();
 return new SearchIterator()
 {
-private final SearchIterator rawIter = new 
BTreeSearchIterator<>(current.tree, metadata().comparator, desc(reversed));
+private final SearchIterator rawIter = 
BTree.slice(current.tree, metadata().comparator, desc(reversed));
 private final DeletionTime partitionDeletion = 
current.deletionInfo.getPartitionDeletion();
 
 public Row next(Clustering clustering)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/0eab80bf/src/java/org/apache/cassandra/utils/btree/BTree.java
--
diff --git a/src/java/org/apache/cassandra/utils/btree/BTree.java 
b/src/java/org/apache/cassandra/utils/btree/BTree.java
index a4519b9..9ed7534 100644
--- a/src/java/org/apache/cassandra/utils/btree/BTree.java
+++ b/src/java/org/apache/cassandra/utils/btree/BTree.java
@@ -201,12 +201,14 @@ public class BTree
 
 public static  Iterator iterator(Object[] btree, Dir dir)
 {
-return new BTreeSearchIterator<>(btree, null, dir);
+return isLeaf(btree) ? new LeafBTreeSearchIterator<>(btree, null, dir)
+ : new FullBTreeSearchIterator<>(btree, null, dir);
 }
 
 public static  Iterator iterator(Object[] btree, int lb, int ub, Dir 
dir)
 {
-return new BTreeSearchIterator<>(btree, null, dir, lb, ub);
+return isLeaf(btree) ? new LeafBTreeSearchIterator<>(btree, null, dir, 
lb, ub)
+ : new FullBTreeSearchIterator<>(btree, null, dir, 
lb, ub);
 }
 
 public static  Iterable iterable(Object[] btree)
@@ -234,7 +236,8 @@ public class BTree
  */
 public static  BTreeSearchIterator slice(Object[] btree, 
Comparator comparator, Dir dir)
 {
-return new BTreeSearchIterator<>(btree, 

[jira] [Updated] (CASSANDRA-13975) Add a workaround for overly large read repair mutations

2017-11-13 Thread Aleksey Yeschenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-13975:
--
   Resolution: Fixed
Fix Version/s: (was: 3.11.x)
   (was: 3.0.x)
   3.11.2
   3.0.16
   Status: Resolved  (was: Ready to Commit)

> Add a workaround for overly large read repair mutations
> ---
>
> Key: CASSANDRA-13975
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13975
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
> Fix For: 3.0.16, 3.11.2
>
>
> It's currently possible for {{DataResolver}} to accumulate more changes to 
> read repair that would fit in a single serialized mutation. If that happens, 
> the node receiving the mutation would fail, and the read would time out, and 
> won't be able to proceed until the operator runs repair or manually drops the 
> affected partitions.
> Ideally we should either read repair iteratively, or at least split the 
> resulting mutation into smaller chunks in the end. In the meantime, for 
> 3.0.x, I suggest we add logging to catch this, and a -D flag to allow 
> proceeding with the requests as is when the mutation is too large, without 
> read repair.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13975) Add a workaround for overly large read repair mutations

2017-11-13 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249540#comment-16249540
 ] 

Aleksey Yeschenko commented on CASSANDRA-13975:
---

Thanks, committed as 
[f1e850a492126572efc636a6838cff90333806b9|https://github.com/apache/cassandra/commit/f1e850a492126572efc636a6838cff90333806b9]
 to 3.0 and merged up with 3.11 and trunk.

> Add a workaround for overly large read repair mutations
> ---
>
> Key: CASSANDRA-13975
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13975
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
> Fix For: 3.0.16, 3.11.2
>
>
> It's currently possible for {{DataResolver}} to accumulate more changes to 
> read repair that would fit in a single serialized mutation. If that happens, 
> the node receiving the mutation would fail, and the read would time out, and 
> won't be able to proceed until the operator runs repair or manually drops the 
> affected partitions.
> Ideally we should either read repair iteratively, or at least split the 
> resulting mutation into smaller chunks in the end. In the meantime, for 
> 3.0.x, I suggest we add logging to catch this, and a -D flag to allow 
> proceeding with the requests as is when the mutation is too large, without 
> read repair.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[1/6] cassandra git commit: Add flag to allow dropping oversized read repair mutations

2017-11-13 Thread aleksey
Repository: cassandra
Updated Branches:
  refs/heads/cassandra-3.0 f767d35ae -> f1e850a49
  refs/heads/cassandra-3.11 387d3a4eb -> 9ee44db49
  refs/heads/trunk 7707b736c -> 07258a96b


Add flag to allow dropping oversized read repair mutations

patch by Aleksey Yeschenko; reviewed by Sam Tunnicliffe for
CASSANDRA-13975


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f1e850a4
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f1e850a4
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f1e850a4

Branch: refs/heads/cassandra-3.0
Commit: f1e850a492126572efc636a6838cff90333806b9
Parents: f767d35
Author: Aleksey Yeschenko 
Authored: Wed Oct 25 20:15:39 2017 +0100
Committer: Aleksey Yeschenko 
Committed: Mon Nov 13 13:10:28 2017 +

--
 CHANGES.txt |  2 +
 .../apache/cassandra/metrics/TableMetrics.java  |  2 +
 .../apache/cassandra/service/DataResolver.java  | 53 +---
 3 files changed, 49 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index e3026aa..a3c43fd 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 3.0.16
+ * Add flag to allow dropping oversized read repair mutations (CASSANDRA-13975)
  * Fix SSTableLoader logger message (CASSANDRA-14003)
  * Fix repair race that caused gossip to block (CASSANDRA-13849)
  * Tracing interferes with digest requests when using RandomPartitioner 
(CASSANDRA-13964)
@@ -8,6 +9,7 @@
  * Mishandling of cells for removed/dropped columns when reading legacy files 
(CASSANDRA-13939)
  * Deserialise sstable metadata in nodetool verify (CASSANDRA-13922)
 
+
 3.0.15
  * Improve TRUNCATE performance (CASSANDRA-13909)
  * Implement short read protection on partition boundaries (CASSANDRA-13595)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/src/java/org/apache/cassandra/metrics/TableMetrics.java
--
diff --git a/src/java/org/apache/cassandra/metrics/TableMetrics.java 
b/src/java/org/apache/cassandra/metrics/TableMetrics.java
index fe88a63..eb56ed9 100644
--- a/src/java/org/apache/cassandra/metrics/TableMetrics.java
+++ b/src/java/org/apache/cassandra/metrics/TableMetrics.java
@@ -151,6 +151,7 @@ public class TableMetrics
 public final static LatencyMetrics globalWriteLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Write");
 public final static LatencyMetrics globalRangeLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Range");
 
+public final Meter readRepairRequests;
 public final Meter shortReadProtectionRequests;
 
 public final Map samplers;
@@ -648,6 +649,7 @@ public class TableMetrics
 casPropose = new LatencyMetrics(factory, "CasPropose", 
cfs.keyspace.metric.casPropose);
 casCommit = new LatencyMetrics(factory, "CasCommit", 
cfs.keyspace.metric.casCommit);
 
+readRepairRequests = 
Metrics.meter(factory.createMetricName("ReadRepairRequests"));
 shortReadProtectionRequests = 
Metrics.meter(factory.createMetricName("ShortReadProtectionRequests"));
 }
 

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/src/java/org/apache/cassandra/service/DataResolver.java
--
diff --git a/src/java/org/apache/cassandra/service/DataResolver.java 
b/src/java/org/apache/cassandra/service/DataResolver.java
index 5fb34c6..f02b565 100644
--- a/src/java/org/apache/cassandra/service/DataResolver.java
+++ b/src/java/org/apache/cassandra/service/DataResolver.java
@@ -44,6 +44,9 @@ import org.apache.cassandra.utils.FBUtilities;
 
 public class DataResolver extends ResponseResolver
 {
+private static final boolean DROP_OVERSIZED_READ_REPAIR_MUTATIONS =
+Boolean.getBoolean("cassandra.drop_oversized_readrepair_mutations");
+
 @VisibleForTesting
 final List repairResults = 
Collections.synchronizedList(new ArrayList<>());
 
@@ -452,15 +455,49 @@ public class DataResolver extends ResponseResolver
 public void close()
 {
 for (int i = 0; i < repairs.length; i++)
+if (null != repairs[i])
+sendRepairMutation(repairs[i], sources[i]);
+}
+
+private void sendRepairMutation(PartitionUpdate partition, 
InetAddress destination)
+{
+Mutation mutation = new Mutation(partition);
+int messagingVersion = 
MessagingService.instance().getVersion(destination);
+
+

[4/6] cassandra git commit: Merge branch 'cassandra-3.0' into cassandra-3.11

2017-11-13 Thread aleksey
Merge branch 'cassandra-3.0' into cassandra-3.11


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/9ee44db4
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/9ee44db4
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/9ee44db4

Branch: refs/heads/trunk
Commit: 9ee44db49b13d4b4c91c9d6332ce06a6e2abf944
Parents: 387d3a4 f1e850a
Author: Aleksey Yeschenko 
Authored: Mon Nov 13 13:13:06 2017 +
Committer: Aleksey Yeschenko 
Committed: Mon Nov 13 13:13:06 2017 +

--
 CHANGES.txt |  1 +
 .../apache/cassandra/metrics/TableMetrics.java  |  2 +
 .../apache/cassandra/service/DataResolver.java  | 53 +---
 3 files changed, 48 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/9ee44db4/CHANGES.txt
--
diff --cc CHANGES.txt
index 6a78b60,a3c43fd..a1a1a37
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@@ -1,9 -1,5 +1,10 @@@
 -3.0.16
 +3.11.2
 + * Add asm jar to build.xml for maven builds (CASSANDRA-11193)
 + * Round buffer size to powers of 2 for the chunk cache (CASSANDRA-13897)
 + * Update jackson JSON jars (CASSANDRA-13949)
 + * Avoid locks when checking LCS fanout and if we should defrag 
(CASSANDRA-13930)
 +Merged from 3.0:
+  * Add flag to allow dropping oversized read repair mutations 
(CASSANDRA-13975)
   * Fix SSTableLoader logger message (CASSANDRA-14003)
   * Fix repair race that caused gossip to block (CASSANDRA-13849)
   * Tracing interferes with digest requests when using RandomPartitioner 
(CASSANDRA-13964)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/9ee44db4/src/java/org/apache/cassandra/metrics/TableMetrics.java
--
diff --cc src/java/org/apache/cassandra/metrics/TableMetrics.java
index b0f667c,eb56ed9..e78bb66
--- a/src/java/org/apache/cassandra/metrics/TableMetrics.java
+++ b/src/java/org/apache/cassandra/metrics/TableMetrics.java
@@@ -167,40 -151,7 +167,41 @@@ public class TableMetric
  public final static LatencyMetrics globalWriteLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Write");
  public final static LatencyMetrics globalRangeLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Range");
  
 +public final static Gauge globalPercentRepaired = 
Metrics.register(globalFactory.createMetricName("PercentRepaired"),
 +new Gauge()
 +{
 +public Double getValue()
 +{
 +double repaired = 0;
 +double total = 0;
 +for (String keyspace : Schema.instance.getNonSystemKeyspaces())
 +{
 +Keyspace k = Schema.instance.getKeyspaceInstance(keyspace);
 +if 
(SchemaConstants.DISTRIBUTED_KEYSPACE_NAME.equals(k.getName()))
 +continue;
 +if (k.getReplicationStrategy().getReplicationFactor() < 2)
 +continue;
 +
 +for (ColumnFamilyStore cf : k.getColumnFamilyStores())
 +{
 +if (!SecondaryIndexManager.isIndexColumnFamily(cf.name))
 +{
 +for (SSTableReader sstable : 
cf.getSSTables(SSTableSet.CANONICAL))
 +{
 +if (sstable.isRepaired())
 +{
 +repaired += sstable.uncompressedLength();
 +}
 +total += sstable.uncompressedLength();
 +}
 +}
 +}
 +}
 +return total > 0 ? (repaired / total) * 100 : 100.0;
 +}
 +});
 +
+ public final Meter readRepairRequests;
  public final Meter shortReadProtectionRequests;
  
  public final Map samplers;

http://git-wip-us.apache.org/repos/asf/cassandra/blob/9ee44db4/src/java/org/apache/cassandra/service/DataResolver.java
--
diff --cc src/java/org/apache/cassandra/service/DataResolver.java
index 111d561,f02b565..f63f4f5
--- a/src/java/org/apache/cassandra/service/DataResolver.java
+++ b/src/java/org/apache/cassandra/service/DataResolver.java
@@@ -44,15 -44,17 +44,18 @@@ import org.apache.cassandra.utils.FBUti
  
  public class DataResolver extends ResponseResolver
  {
+ private static final boolean DROP_OVERSIZED_READ_REPAIR_MUTATIONS =
+ Boolean.getBoolean("cassandra.drop_oversized_readrepair_mutations");
+ 
  @VisibleForTesting
  final List repairResults = 
Collections.synchronizedList(new ArrayList<>());
 -
 +

[6/6] cassandra git commit: Merge branch 'cassandra-3.11' into trunk

2017-11-13 Thread aleksey
Merge branch 'cassandra-3.11' into trunk


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/07258a96
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/07258a96
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/07258a96

Branch: refs/heads/trunk
Commit: 07258a96bfde3a6df839b4cc2c79e500d95163f0
Parents: 7707b73 9ee44db
Author: Aleksey Yeschenko 
Authored: Mon Nov 13 13:15:15 2017 +
Committer: Aleksey Yeschenko 
Committed: Mon Nov 13 13:18:03 2017 +

--
 CHANGES.txt |  1 +
 .../apache/cassandra/metrics/TableMetrics.java  |  2 +
 .../apache/cassandra/service/DataResolver.java  | 51 +---
 3 files changed, 46 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/07258a96/CHANGES.txt
--

http://git-wip-us.apache.org/repos/asf/cassandra/blob/07258a96/src/java/org/apache/cassandra/metrics/TableMetrics.java
--
diff --cc src/java/org/apache/cassandra/metrics/TableMetrics.java
index 04fbf46,e78bb66..5c4a849
--- a/src/java/org/apache/cassandra/metrics/TableMetrics.java
+++ b/src/java/org/apache/cassandra/metrics/TableMetrics.java
@@@ -248,33 -201,7 +248,34 @@@ public class TableMetric
  }
  });
  
 +public static final Gauge globalBytesRepaired = 
Metrics.register(globalFactory.createMetricName("BytesRepaired"),
 +   
new Gauge()
 +{
 +public Long getValue()
 +{
 +return totalNonSystemTablesSize(SSTableReader::isRepaired).left;
 +}
 +});
 +
 +public static final Gauge globalBytesUnrepaired = 
Metrics.register(globalFactory.createMetricName("BytesUnrepaired"),
 + 
new Gauge()
 +{
 +public Long getValue()
 +{
 +return totalNonSystemTablesSize(s -> !s.isRepaired() && 
!s.isPendingRepair()).left;
 +}
 +});
 +
 +public static final Gauge globalBytesPendingRepair = 
Metrics.register(globalFactory.createMetricName("BytesPendingRepair"),
 +  
  new Gauge()
 +{
 +public Long getValue()
 +{
 +return 
totalNonSystemTablesSize(SSTableReader::isPendingRepair).left;
 +}
 +});
 +
+ public final Meter readRepairRequests;
  public final Meter shortReadProtectionRequests;
  
  public final Map samplers;
@@@ -825,26 -698,7 +826,27 @@@
  casPropose = new LatencyMetrics(factory, "CasPropose", 
cfs.keyspace.metric.casPropose);
  casCommit = new LatencyMetrics(factory, "CasCommit", 
cfs.keyspace.metric.casCommit);
  
 +repairsStarted = createTableCounter("RepairJobsStarted");
 +repairsCompleted = createTableCounter("RepairJobsCompleted");
 +
 +anticompactionTime = createTableTimer("AnticompactionTime", 
cfs.keyspace.metric.anticompactionTime);
 +validationTime = createTableTimer("ValidationTime", 
cfs.keyspace.metric.validationTime);
 +syncTime = createTableTimer("SyncTime", 
cfs.keyspace.metric.repairSyncTime);
 +
 +bytesValidated = createTableHistogram("BytesValidated", 
cfs.keyspace.metric.bytesValidated, false);
 +partitionsValidated = createTableHistogram("PartitionsValidated", 
cfs.keyspace.metric.partitionsValidated, false);
 +bytesAnticompacted = createTableCounter("BytesAnticompacted");
 +bytesMutatedAnticompaction = 
createTableCounter("BytesMutatedAnticompaction");
 +mutatedAnticompactionGauge = 
createTableGauge("MutatedAnticompactionGauge", () ->
 +{
 +double bytesMutated = bytesMutatedAnticompaction.getCount();
 +double bytesAnticomp = bytesAnticompacted.getCount();
 +if (bytesAnticomp + bytesMutated > 0)
 +return bytesMutated / (bytesAnticomp + bytesMutated);
 +return 0.0;
 +});
 +
+ readRepairRequests = 
Metrics.meter(factory.createMetricName("ReadRepairRequests"));
  shortReadProtectionRequests = 
Metrics.meter(factory.createMetricName("ShortReadProtectionRequests"));
  }
  

http://git-wip-us.apache.org/repos/asf/cassandra/blob/07258a96/src/java/org/apache/cassandra/service/DataResolver.java
--
diff --cc src/java/org/apache/cassandra/service/DataResolver.java
index d4c77d1,f63f4f5..933014f
--- a/src/java/org/apache/cassandra/service/DataResolver.java
+++ 

[3/6] cassandra git commit: Add flag to allow dropping oversized read repair mutations

2017-11-13 Thread aleksey
Add flag to allow dropping oversized read repair mutations

patch by Aleksey Yeschenko; reviewed by Sam Tunnicliffe for
CASSANDRA-13975


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f1e850a4
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f1e850a4
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f1e850a4

Branch: refs/heads/trunk
Commit: f1e850a492126572efc636a6838cff90333806b9
Parents: f767d35
Author: Aleksey Yeschenko 
Authored: Wed Oct 25 20:15:39 2017 +0100
Committer: Aleksey Yeschenko 
Committed: Mon Nov 13 13:10:28 2017 +

--
 CHANGES.txt |  2 +
 .../apache/cassandra/metrics/TableMetrics.java  |  2 +
 .../apache/cassandra/service/DataResolver.java  | 53 +---
 3 files changed, 49 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index e3026aa..a3c43fd 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 3.0.16
+ * Add flag to allow dropping oversized read repair mutations (CASSANDRA-13975)
  * Fix SSTableLoader logger message (CASSANDRA-14003)
  * Fix repair race that caused gossip to block (CASSANDRA-13849)
  * Tracing interferes with digest requests when using RandomPartitioner 
(CASSANDRA-13964)
@@ -8,6 +9,7 @@
  * Mishandling of cells for removed/dropped columns when reading legacy files 
(CASSANDRA-13939)
  * Deserialise sstable metadata in nodetool verify (CASSANDRA-13922)
 
+
 3.0.15
  * Improve TRUNCATE performance (CASSANDRA-13909)
  * Implement short read protection on partition boundaries (CASSANDRA-13595)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/src/java/org/apache/cassandra/metrics/TableMetrics.java
--
diff --git a/src/java/org/apache/cassandra/metrics/TableMetrics.java 
b/src/java/org/apache/cassandra/metrics/TableMetrics.java
index fe88a63..eb56ed9 100644
--- a/src/java/org/apache/cassandra/metrics/TableMetrics.java
+++ b/src/java/org/apache/cassandra/metrics/TableMetrics.java
@@ -151,6 +151,7 @@ public class TableMetrics
 public final static LatencyMetrics globalWriteLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Write");
 public final static LatencyMetrics globalRangeLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Range");
 
+public final Meter readRepairRequests;
 public final Meter shortReadProtectionRequests;
 
 public final Map samplers;
@@ -648,6 +649,7 @@ public class TableMetrics
 casPropose = new LatencyMetrics(factory, "CasPropose", 
cfs.keyspace.metric.casPropose);
 casCommit = new LatencyMetrics(factory, "CasCommit", 
cfs.keyspace.metric.casCommit);
 
+readRepairRequests = 
Metrics.meter(factory.createMetricName("ReadRepairRequests"));
 shortReadProtectionRequests = 
Metrics.meter(factory.createMetricName("ShortReadProtectionRequests"));
 }
 

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/src/java/org/apache/cassandra/service/DataResolver.java
--
diff --git a/src/java/org/apache/cassandra/service/DataResolver.java 
b/src/java/org/apache/cassandra/service/DataResolver.java
index 5fb34c6..f02b565 100644
--- a/src/java/org/apache/cassandra/service/DataResolver.java
+++ b/src/java/org/apache/cassandra/service/DataResolver.java
@@ -44,6 +44,9 @@ import org.apache.cassandra.utils.FBUtilities;
 
 public class DataResolver extends ResponseResolver
 {
+private static final boolean DROP_OVERSIZED_READ_REPAIR_MUTATIONS =
+Boolean.getBoolean("cassandra.drop_oversized_readrepair_mutations");
+
 @VisibleForTesting
 final List repairResults = 
Collections.synchronizedList(new ArrayList<>());
 
@@ -452,15 +455,49 @@ public class DataResolver extends ResponseResolver
 public void close()
 {
 for (int i = 0; i < repairs.length; i++)
+if (null != repairs[i])
+sendRepairMutation(repairs[i], sources[i]);
+}
+
+private void sendRepairMutation(PartitionUpdate partition, 
InetAddress destination)
+{
+Mutation mutation = new Mutation(partition);
+int messagingVersion = 
MessagingService.instance().getVersion(destination);
+
+intmutationSize = (int) 
Mutation.serializer.serializedSize(mutation, messagingVersion);
+int maxMutationSize = DatabaseDescriptor.getMaxMutationSize();
+
+if 

[2/6] cassandra git commit: Add flag to allow dropping oversized read repair mutations

2017-11-13 Thread aleksey
Add flag to allow dropping oversized read repair mutations

patch by Aleksey Yeschenko; reviewed by Sam Tunnicliffe for
CASSANDRA-13975


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f1e850a4
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f1e850a4
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f1e850a4

Branch: refs/heads/cassandra-3.11
Commit: f1e850a492126572efc636a6838cff90333806b9
Parents: f767d35
Author: Aleksey Yeschenko 
Authored: Wed Oct 25 20:15:39 2017 +0100
Committer: Aleksey Yeschenko 
Committed: Mon Nov 13 13:10:28 2017 +

--
 CHANGES.txt |  2 +
 .../apache/cassandra/metrics/TableMetrics.java  |  2 +
 .../apache/cassandra/service/DataResolver.java  | 53 +---
 3 files changed, 49 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index e3026aa..a3c43fd 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 3.0.16
+ * Add flag to allow dropping oversized read repair mutations (CASSANDRA-13975)
  * Fix SSTableLoader logger message (CASSANDRA-14003)
  * Fix repair race that caused gossip to block (CASSANDRA-13849)
  * Tracing interferes with digest requests when using RandomPartitioner 
(CASSANDRA-13964)
@@ -8,6 +9,7 @@
  * Mishandling of cells for removed/dropped columns when reading legacy files 
(CASSANDRA-13939)
  * Deserialise sstable metadata in nodetool verify (CASSANDRA-13922)
 
+
 3.0.15
  * Improve TRUNCATE performance (CASSANDRA-13909)
  * Implement short read protection on partition boundaries (CASSANDRA-13595)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/src/java/org/apache/cassandra/metrics/TableMetrics.java
--
diff --git a/src/java/org/apache/cassandra/metrics/TableMetrics.java 
b/src/java/org/apache/cassandra/metrics/TableMetrics.java
index fe88a63..eb56ed9 100644
--- a/src/java/org/apache/cassandra/metrics/TableMetrics.java
+++ b/src/java/org/apache/cassandra/metrics/TableMetrics.java
@@ -151,6 +151,7 @@ public class TableMetrics
 public final static LatencyMetrics globalWriteLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Write");
 public final static LatencyMetrics globalRangeLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Range");
 
+public final Meter readRepairRequests;
 public final Meter shortReadProtectionRequests;
 
 public final Map samplers;
@@ -648,6 +649,7 @@ public class TableMetrics
 casPropose = new LatencyMetrics(factory, "CasPropose", 
cfs.keyspace.metric.casPropose);
 casCommit = new LatencyMetrics(factory, "CasCommit", 
cfs.keyspace.metric.casCommit);
 
+readRepairRequests = 
Metrics.meter(factory.createMetricName("ReadRepairRequests"));
 shortReadProtectionRequests = 
Metrics.meter(factory.createMetricName("ShortReadProtectionRequests"));
 }
 

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f1e850a4/src/java/org/apache/cassandra/service/DataResolver.java
--
diff --git a/src/java/org/apache/cassandra/service/DataResolver.java 
b/src/java/org/apache/cassandra/service/DataResolver.java
index 5fb34c6..f02b565 100644
--- a/src/java/org/apache/cassandra/service/DataResolver.java
+++ b/src/java/org/apache/cassandra/service/DataResolver.java
@@ -44,6 +44,9 @@ import org.apache.cassandra.utils.FBUtilities;
 
 public class DataResolver extends ResponseResolver
 {
+private static final boolean DROP_OVERSIZED_READ_REPAIR_MUTATIONS =
+Boolean.getBoolean("cassandra.drop_oversized_readrepair_mutations");
+
 @VisibleForTesting
 final List repairResults = 
Collections.synchronizedList(new ArrayList<>());
 
@@ -452,15 +455,49 @@ public class DataResolver extends ResponseResolver
 public void close()
 {
 for (int i = 0; i < repairs.length; i++)
+if (null != repairs[i])
+sendRepairMutation(repairs[i], sources[i]);
+}
+
+private void sendRepairMutation(PartitionUpdate partition, 
InetAddress destination)
+{
+Mutation mutation = new Mutation(partition);
+int messagingVersion = 
MessagingService.instance().getVersion(destination);
+
+intmutationSize = (int) 
Mutation.serializer.serializedSize(mutation, messagingVersion);
+int maxMutationSize = DatabaseDescriptor.getMaxMutationSize();
+
+   

[5/6] cassandra git commit: Merge branch 'cassandra-3.0' into cassandra-3.11

2017-11-13 Thread aleksey
Merge branch 'cassandra-3.0' into cassandra-3.11


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/9ee44db4
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/9ee44db4
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/9ee44db4

Branch: refs/heads/cassandra-3.11
Commit: 9ee44db49b13d4b4c91c9d6332ce06a6e2abf944
Parents: 387d3a4 f1e850a
Author: Aleksey Yeschenko 
Authored: Mon Nov 13 13:13:06 2017 +
Committer: Aleksey Yeschenko 
Committed: Mon Nov 13 13:13:06 2017 +

--
 CHANGES.txt |  1 +
 .../apache/cassandra/metrics/TableMetrics.java  |  2 +
 .../apache/cassandra/service/DataResolver.java  | 53 +---
 3 files changed, 48 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/9ee44db4/CHANGES.txt
--
diff --cc CHANGES.txt
index 6a78b60,a3c43fd..a1a1a37
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@@ -1,9 -1,5 +1,10 @@@
 -3.0.16
 +3.11.2
 + * Add asm jar to build.xml for maven builds (CASSANDRA-11193)
 + * Round buffer size to powers of 2 for the chunk cache (CASSANDRA-13897)
 + * Update jackson JSON jars (CASSANDRA-13949)
 + * Avoid locks when checking LCS fanout and if we should defrag 
(CASSANDRA-13930)
 +Merged from 3.0:
+  * Add flag to allow dropping oversized read repair mutations 
(CASSANDRA-13975)
   * Fix SSTableLoader logger message (CASSANDRA-14003)
   * Fix repair race that caused gossip to block (CASSANDRA-13849)
   * Tracing interferes with digest requests when using RandomPartitioner 
(CASSANDRA-13964)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/9ee44db4/src/java/org/apache/cassandra/metrics/TableMetrics.java
--
diff --cc src/java/org/apache/cassandra/metrics/TableMetrics.java
index b0f667c,eb56ed9..e78bb66
--- a/src/java/org/apache/cassandra/metrics/TableMetrics.java
+++ b/src/java/org/apache/cassandra/metrics/TableMetrics.java
@@@ -167,40 -151,7 +167,41 @@@ public class TableMetric
  public final static LatencyMetrics globalWriteLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Write");
  public final static LatencyMetrics globalRangeLatency = new 
LatencyMetrics(globalFactory, globalAliasFactory, "Range");
  
 +public final static Gauge globalPercentRepaired = 
Metrics.register(globalFactory.createMetricName("PercentRepaired"),
 +new Gauge()
 +{
 +public Double getValue()
 +{
 +double repaired = 0;
 +double total = 0;
 +for (String keyspace : Schema.instance.getNonSystemKeyspaces())
 +{
 +Keyspace k = Schema.instance.getKeyspaceInstance(keyspace);
 +if 
(SchemaConstants.DISTRIBUTED_KEYSPACE_NAME.equals(k.getName()))
 +continue;
 +if (k.getReplicationStrategy().getReplicationFactor() < 2)
 +continue;
 +
 +for (ColumnFamilyStore cf : k.getColumnFamilyStores())
 +{
 +if (!SecondaryIndexManager.isIndexColumnFamily(cf.name))
 +{
 +for (SSTableReader sstable : 
cf.getSSTables(SSTableSet.CANONICAL))
 +{
 +if (sstable.isRepaired())
 +{
 +repaired += sstable.uncompressedLength();
 +}
 +total += sstable.uncompressedLength();
 +}
 +}
 +}
 +}
 +return total > 0 ? (repaired / total) * 100 : 100.0;
 +}
 +});
 +
+ public final Meter readRepairRequests;
  public final Meter shortReadProtectionRequests;
  
  public final Map samplers;

http://git-wip-us.apache.org/repos/asf/cassandra/blob/9ee44db4/src/java/org/apache/cassandra/service/DataResolver.java
--
diff --cc src/java/org/apache/cassandra/service/DataResolver.java
index 111d561,f02b565..f63f4f5
--- a/src/java/org/apache/cassandra/service/DataResolver.java
+++ b/src/java/org/apache/cassandra/service/DataResolver.java
@@@ -44,15 -44,17 +44,18 @@@ import org.apache.cassandra.utils.FBUti
  
  public class DataResolver extends ResponseResolver
  {
+ private static final boolean DROP_OVERSIZED_READ_REPAIR_MUTATIONS =
+ Boolean.getBoolean("cassandra.drop_oversized_readrepair_mutations");
+ 
  @VisibleForTesting
  final List repairResults = 
Collections.synchronizedList(new ArrayList<>());
 

[jira] [Commented] (CASSANDRA-14008) RTs at index boundaries in 2.x sstables can create unexpected CQL row in 3.x

2017-11-13 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249432#comment-16249432
 ] 

Aleksey Yeschenko commented on CASSANDRA-14008:
---

We can probably generate an sstable that triggers this bug relatively easily 
for a regression test (nice to have, but won't block the patch on lack of it).

And, as Jeff mentions, it would be nice to find a way to un-break 3.0 sstables 
where the damage's been done already, in a follow-up JIRA.

> RTs at index boundaries in 2.x sstables can create unexpected CQL row in 3.x
> 
>
> Key: CASSANDRA-14008
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14008
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local Write-Read Paths
>Reporter: Jeff Jirsa
>Assignee: Jeff Jirsa
>  Labels: correctness
> Fix For: 3.0.x, 3.11.x
>
>
> In 2.1/2.2, it is possible for a range tombstone that isn't a row deletion 
> and isn't a complex deletion to appear between two cells with the same 
> clustering. The 8099 legacy code incorrectly treats the two (non-RT) cells as 
> two distinct CQL rows, despite having the same clustering prefix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14008) RTs at index boundaries in 2.x sstables can create unexpected CQL row in 3.x

2017-11-13 Thread Aleksey Yeschenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-14008:
--
Component/s: (was: Core)
 Local Write-Read Paths

> RTs at index boundaries in 2.x sstables can create unexpected CQL row in 3.x
> 
>
> Key: CASSANDRA-14008
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14008
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local Write-Read Paths
>Reporter: Jeff Jirsa
>Assignee: Jeff Jirsa
>  Labels: correctness
> Fix For: 3.0.x, 3.11.x
>
>
> In 2.1/2.2, it is possible for a range tombstone that isn't a row deletion 
> and isn't a complex deletion to appear between two cells with the same 
> clustering. The 8099 legacy code incorrectly treats the two (non-RT) cells as 
> two distinct CQL rows, despite having the same clustering prefix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14007) cqlshlib tests fail due to compact table

2017-11-13 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249370#comment-16249370
 ] 

Alex Petrov commented on CASSANDRA-14007:
-

I've just re-ran all the dtests and they seem to be clean. Or do we run the 
cqlshlib tests in some other way?..

> cqlshlib tests fail due to compact table
> 
>
> Key: CASSANDRA-14007
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14007
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> The pylib/cqlshlib tests fail on initialization with the error 
> {{SyntaxException:  query\] message="Compact tables are not allowed in Cassandra starting with 
> 4.0 version.">}}. 
> The table {{dynamic_columns}} is created {{WITH COMPACT STORAGE}}. Since 
> [CASSANDRA-10857], this is no longer supported. It looks like dropping the 
> COMPACT STORAGE modifier is enough for the tests to run, but I haven't looked 
> if we should instead remove the table and all related tests entirely, or if 
> there's an interesting code path covered by this that we should test in a 
> different way now. [~ifesdjeen] might know at a glance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14009) Any user can overwrite any table with sstableloader

2017-11-13 Thread Andrzej Bober (JIRA)
Andrzej Bober created CASSANDRA-14009:
-

 Summary: Any user can overwrite any table with sstableloader
 Key: CASSANDRA-14009
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14009
 Project: Cassandra
  Issue Type: Bug
  Components: Auth
Reporter: Andrzej Bober
 Fix For: 2.1.x, 2.2.x, 3.0.x, 3.11.x


Hi there,

Looks like any user can overwrite any table with sstableloader.
Tested ubuntu 16.04.3, Java 1.8.0_151_b12, and Cassandra 2.1.19 / 2.2.11 / 
3.0.15 / 3.11.1.

{code:sql}
cassandra@cqlsh> CREATE USER alice WITH PASSWORD 'Alice';
cassandra@cqlsh> CREATE USER bob WITH PASSWORD 'Bob';

cassandra@cqlsh>  CREATE KEYSPACE db4alice WITH replication = {'class': 
'SimpleStrategy', 'replication_factor': 1};
cassandra@cqlsh>  GRANT ALL PERMISSIONS ON KEYSPACE db4alice TO alice;

alice@cqlsh> CREATE TABLE users (userid text PRIMARY KEY, password text);

alice@cqlsh> INSERT INTO users (userid, password) VALUES ('user1', 'pass1');
alice@cqlsh> INSERT INTO users (userid, password) VALUES ('user2’, 'pass2’);
alice@cqlsh> INSERT INTO users (userid, password) VALUES ('user3’, 'pass3’);

alice@cqlsh> truncate users;

alice@cqlsh> select * from db4alice.users ;
 userid | password
+--
(0 rows)

sstableloader -d 127.0.0.1 -u bob -pw Bob ./db4alice/users

alice@cqlsh> select * from db4alice.users ;

 userid | password
+--
  user2 |pass2
  user1 |pass1
  user3 |pass3

(3 rows)
{code}

Looks like a pretty serious bug to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13997) Upgrade Guava to 23.3 and Airline to 0.8

2017-11-13 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249290#comment-16249290
 ] 

Stefan Podkowinski commented on CASSANDRA-13997:


If we do that, then we should probably include the guava artifact directly 
instead of j2objc. This should override the ancient guava-16 version that is 
pulled by ohc. 

> Upgrade Guava to 23.3 and Airline to 0.8
> 
>
> Key: CASSANDRA-13997
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13997
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Libraries
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
> Fix For: 4.0
>
> Attachments: airline-0.8.jar.asc, guava-23.3-jre.jar.asc
>
>
> For 4.0 we should upgrade guava to the latest version
> patch here: https://github.com/krummas/cassandra/commits/marcuse/guava23
> A bunch of quite commonly used methods have been deprecated since guava 18 
> which we use now ({{Throwables.propagate}} for example), this patch mostly 
> updates uses where compilation fails. {{Futures.transform(ListenableFuture 
> ..., AsyncFunction ...}} was deprecated in Guava 19 and removed in 20 for 
> example, we should probably open new tickets to remove calls to all 
> deprecated guava methods.
> Also had to add a dependency on {{com.google.j2objc.j2objc-annotations}}, to 
> avoid some build-time warnings (maybe due to 
> https://github.com/google/guava/commit/fffd2b1f67d158c7b4052123c5032b0ba54a910d
>  ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-14007) cqlshlib tests fail due to compact table

2017-11-13 Thread Alex Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-14007:
---

Assignee: Alex Petrov

> cqlshlib tests fail due to compact table
> 
>
> Key: CASSANDRA-14007
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14007
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joel Knighton
>Assignee: Alex Petrov
>
> The pylib/cqlshlib tests fail on initialization with the error 
> {{SyntaxException:  query\] message="Compact tables are not allowed in Cassandra starting with 
> 4.0 version.">}}. 
> The table {{dynamic_columns}} is created {{WITH COMPACT STORAGE}}. Since 
> [CASSANDRA-10857], this is no longer supported. It looks like dropping the 
> COMPACT STORAGE modifier is enough for the tests to run, but I haven't looked 
> if we should instead remove the table and all related tests entirely, or if 
> there's an interesting code path covered by this that we should test in a 
> different way now. [~ifesdjeen] might know at a glance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13948) Reload compaction strategies when JBOD disk boundary changes

2017-11-13 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249228#comment-16249228
 ] 

Paulo Motta commented on CASSANDRA-13948:
-

Testall passed with no failures, and [dtest 
failures|https://issues.apache.org/jira/secure/attachment/12897298/dtest13948.png]
 look unrelated.

> Reload compaction strategies when JBOD disk boundary changes
> 
>
> Key: CASSANDRA-13948
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13948
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Paulo Motta
>Assignee: Paulo Motta
> Fix For: 3.11.x, 4.x
>
> Attachments: debug.log, dtest13948.png, threaddump-cleanup.txt, 
> threaddump.txt, trace.log
>
>
> The thread dump below shows a race between an sstable replacement by the 
> {{IndexSummaryRedistribution}} and 
> {{AbstractCompactionTask.getNextBackgroundTask}}:
> {noformat}
> Thread 94580: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() 
> @bci=1, line=836 (Compiled frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node,
>  int) @bci=67, line=870 (Compiled frame)
>  - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) 
> @bci=17, line=1199 (Compiled frame)
>  - java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, 
> line=943 (Compiled frame)
>  - 
> org.apache.cassandra.db.compaction.CompactionStrategyManager.handleListChangedNotification(java.lang.Iterable,
>  java.lang.Iterable) @bci=359, line=483 (Interpreted frame)
>  - 
> org.apache.cassandra.db.compaction.CompactionStrategyManager.handleNotification(org.apache.cassandra.notifications.INotification,
>  java.lang.Object) @bci=53, line=555 (Interpreted frame)
>  - 
> org.apache.cassandra.db.lifecycle.Tracker.notifySSTablesChanged(java.util.Collection,
>  java.util.Collection, org.apache.cassandra.db.compaction.OperationType, 
> java.lang.Throwable) @bci=50, line=409 (Interpreted frame)
>  - 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.doCommit(java.lang.Throwable)
>  @bci=157, line=227 (Interpreted frame)
>  - 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit(java.lang.Throwable)
>  @bci=61, line=116 (Compiled frame)
>  - 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit()
>  @bci=2, line=200 (Interpreted frame)
>  - 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish()
>  @bci=5, line=185 (Interpreted frame)
>  - 
> org.apache.cassandra.io.sstable.IndexSummaryRedistribution.redistributeSummaries()
>  @bci=559, line=130 (Interpreted frame)
>  - 
> org.apache.cassandra.db.compaction.CompactionManager.runIndexSummaryRedistribution(org.apache.cassandra.io.sstable.IndexSummaryRedistribution)
>  @bci=9, line=1420 (Interpreted frame)
>  - 
> org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(org.apache.cassandra.io.sstable.IndexSummaryRedistribution)
>  @bci=4, line=250 (Interpreted frame)
>  - 
> org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries() 
> @bci=30, line=228 (Interpreted frame)
>  - org.apache.cassandra.io.sstable.IndexSummaryManager$1.runMayThrow() 
> @bci=4, line=125 (Interpreted frame)
>  - org.apache.cassandra.utils.WrappedRunnable.run() @bci=1, line=28 
> (Interpreted frame)
>  - 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run()
>  @bci=4, line=118 (Compiled frame)
>  - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 
> (Compiled frame)
>  - java.util.concurrent.FutureTask.runAndReset() @bci=47, line=308 (Compiled 
> frame)
>  - 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask)
>  @bci=1, line=180 (Compiled frame)
>  - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() 
> @bci=37, line=294 (Compiled frame)
>  - 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
>  @bci=95, line=1149 (Compiled frame)
>  - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 
> (Interpreted frame)
>  - 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(java.lang.Runnable)
>  @bci=1, line=81 (Interpreted frame)
>  - org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$8.run() @bci=4 
> 

[jira] [Updated] (CASSANDRA-13948) Reload compaction strategies when JBOD disk boundary changes

2017-11-13 Thread Paulo Motta (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-13948:

Attachment: dtest13948.png

> Reload compaction strategies when JBOD disk boundary changes
> 
>
> Key: CASSANDRA-13948
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13948
> Project: Cassandra
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Paulo Motta
>Assignee: Paulo Motta
> Fix For: 3.11.x, 4.x
>
> Attachments: debug.log, dtest13948.png, threaddump-cleanup.txt, 
> threaddump.txt, trace.log
>
>
> The thread dump below shows a race between an sstable replacement by the 
> {{IndexSummaryRedistribution}} and 
> {{AbstractCompactionTask.getNextBackgroundTask}}:
> {noformat}
> Thread 94580: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=175 (Compiled frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() 
> @bci=1, line=836 (Compiled frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node,
>  int) @bci=67, line=870 (Compiled frame)
>  - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) 
> @bci=17, line=1199 (Compiled frame)
>  - java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, 
> line=943 (Compiled frame)
>  - 
> org.apache.cassandra.db.compaction.CompactionStrategyManager.handleListChangedNotification(java.lang.Iterable,
>  java.lang.Iterable) @bci=359, line=483 (Interpreted frame)
>  - 
> org.apache.cassandra.db.compaction.CompactionStrategyManager.handleNotification(org.apache.cassandra.notifications.INotification,
>  java.lang.Object) @bci=53, line=555 (Interpreted frame)
>  - 
> org.apache.cassandra.db.lifecycle.Tracker.notifySSTablesChanged(java.util.Collection,
>  java.util.Collection, org.apache.cassandra.db.compaction.OperationType, 
> java.lang.Throwable) @bci=50, line=409 (Interpreted frame)
>  - 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.doCommit(java.lang.Throwable)
>  @bci=157, line=227 (Interpreted frame)
>  - 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit(java.lang.Throwable)
>  @bci=61, line=116 (Compiled frame)
>  - 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit()
>  @bci=2, line=200 (Interpreted frame)
>  - 
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish()
>  @bci=5, line=185 (Interpreted frame)
>  - 
> org.apache.cassandra.io.sstable.IndexSummaryRedistribution.redistributeSummaries()
>  @bci=559, line=130 (Interpreted frame)
>  - 
> org.apache.cassandra.db.compaction.CompactionManager.runIndexSummaryRedistribution(org.apache.cassandra.io.sstable.IndexSummaryRedistribution)
>  @bci=9, line=1420 (Interpreted frame)
>  - 
> org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(org.apache.cassandra.io.sstable.IndexSummaryRedistribution)
>  @bci=4, line=250 (Interpreted frame)
>  - 
> org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries() 
> @bci=30, line=228 (Interpreted frame)
>  - org.apache.cassandra.io.sstable.IndexSummaryManager$1.runMayThrow() 
> @bci=4, line=125 (Interpreted frame)
>  - org.apache.cassandra.utils.WrappedRunnable.run() @bci=1, line=28 
> (Interpreted frame)
>  - 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run()
>  @bci=4, line=118 (Compiled frame)
>  - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 
> (Compiled frame)
>  - java.util.concurrent.FutureTask.runAndReset() @bci=47, line=308 (Compiled 
> frame)
>  - 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask)
>  @bci=1, line=180 (Compiled frame)
>  - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() 
> @bci=37, line=294 (Compiled frame)
>  - 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
>  @bci=95, line=1149 (Compiled frame)
>  - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 
> (Interpreted frame)
>  - 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(java.lang.Runnable)
>  @bci=1, line=81 (Interpreted frame)
>  - org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$8.run() @bci=4 
> (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=748 (Compiled frame)
> {noformat}
> {noformat}
> Thread 94573: (state = IN_JAVA)
>  - 

[jira] [Commented] (CASSANDRA-13992) Don't send new_metadata_id for conditional updates

2017-11-13 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249215#comment-16249215
 ] 

Alex Petrov commented on CASSANDRA-13992:
-

I've composed a version of the patch, to demonstrate my thinking 
[here|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:CASSANDRA-13992].
 It seems that we can solve this problem without patching the driver. In fact, 
it might be even better if inner doings of metadata hash are transparent for 
the driver.

In short, we can always force {{METADATA_CHANGED}} for conditional statements 
and avoid computing their metadata to make sure it's empty. It's a rough 
equivalent of making metadata hash random, just simpler to reason about. What 
do you think about it [~KurtG] [~omichallat]

> Don't send new_metadata_id for conditional updates
> --
>
> Key: CASSANDRA-13992
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13992
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Olivier Michallat
>Assignee: Kurt Greaves
>Priority: Minor
>
> This is a follow-up to CASSANDRA-10786.
> Given the table
> {code}
> CREATE TABLE foo (k int PRIMARY KEY)
> {code}
> And the prepared statement
> {code}
> INSERT INTO foo (k) VALUES (?) IF NOT EXISTS
> {code}
> The result set metadata changes depending on the outcome of the update:
> * if the row didn't exist, there is only a single column \[applied] = true
> * if it did, the result contains \[applied] = false, plus the current value 
> of column k.
> The way this was handled so far is that the PREPARED response contains no 
> result set metadata, and therefore all EXECUTE messages have SKIP_METADATA = 
> false, and the responses always include the full (and correct) metadata.
> CASSANDRA-10786 still sends the PREPARED response with no metadata, *but the 
> response to EXECUTE now contains a {{new_metadata_id}}*. The driver thinks it 
> is because of a schema change, and updates its local copy of the prepared 
> statement's result metadata.
> The next EXECUTE is sent with SKIP_METADATA = true, but the server appears to 
> ignore that, and still sends the metadata in the response. So each response 
> includes the correct metadata, the driver uses it, and there is no visible 
> issue for client code.
> The only drawback is that the driver updates its local copy of the metadata 
> unnecessarily, every time. We can work around that by only updating if we had 
> metadata before, at the cost of an extra volatile read. But I think the best 
> thing to do would be to never send a {{new_metadata_id}} in for a conditional 
> update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org