[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-06-25 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145048#comment-17145048
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

[~aleksey] apologies for this late reply. I've pushed the recommended changes, 
please have a look when you have a moment.

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15665) StreamManager should clearly differentiate between "initiator" and "receiver" sessions

2020-06-11 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133127#comment-17133127
 ] 

Sergio Bossa commented on CASSANDRA-15665:
--

[~jasonstack], [~blerer], apologies for this late reply. I'm afraid I'll be 
busy until at least next week. Feel free to go ahead without my review.

> StreamManager should clearly differentiate between "initiator" and "receiver" 
> sessions
> --
>
> Key: CASSANDRA-15665
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15665
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamManager}} does currently a suboptimal job in differentiating between 
> stream sessions (in form of {{StreamResultFuture}}) which have been either 
> initiated or "received", for the following reasons:
> 1) Naming is IMO confusing: a "receiver" session could actually both send and 
> receive files, so technically an initiator is also a receiver.
> 2) {{StreamManager#findSession()}}  assumes we should first looking into 
> "initiator" sessions, then into "receiver" ones: this is a dangerous 
> assumptions, in particular for test environments where the same process could 
> work as both an initiator and a receiver.
> I would recommend the following changes:
> 1) Rename "receiver" with "follower" everywhere the former is used.
> 2) Introduce a new flag into {{StreamMessageHeader}} to signal if the message 
> comes from an initiator or follower session, in order to correctly 
> differentiate and look for sessions in {{StreamManager}}.
> While my arguments above might seem trivial, I believe they will improve 
> clarity and save from potential bugs/headaches at testing time, and doing 
> such changes now that we're revamping streaming for 4.0 seems the right time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-06-03 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125211#comment-17125211
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

[~aleksey] thanks for reviewing: getting back to this after two months and 
having to rebuild the whole thing in my mind, made me realize how complex this 
is, and that I should have put a couple more comments, so apologies and kudos 
for getting through it all alone :)

That said, to answer your question straight: the race you point out is 
absolutely correct, and was unfortunately overlooked by myself, but there 
should be an easy fix. Before going with that, I think I have to clarify first 
why I introduced the new {{nextExpirationDeadline}} variable along 
{{earliestExpiryTime}}. Simply put, if we only track time via 
{{earliestExpiryTime}} during both {{add}} and {{prune}}, we risk getting into 
a race where we accumulate an unbounded number of messages until the next 
expiration, as shown in the following scenario:
1. {{add}} is called N times with {{earliestExpiryTime}} set as the minimum 
time among the added messages; you can't adjust in this case by the current 
time otherwise you would never expire (as the time would always shift).
2. {{prune}} is called and {{earliestExpiryTime}} is set as the minimum between 
the minimum time among the pruned messages and the current expiry time.
3. This means that any messages arrived between the start and end of the 
pruning whose expiry time was not the minimum, but still less than the minimum 
expiry time among pruned messages, would be "ignored" and remain in the queue.

At this point you might say it really sounds like the race you discovered and 
in a way it is :) Although in the above case the race window would be much 
larger (the whole {{prune}}).

Anyway, as I said there should be an easy fix: the deadline should be updated 
only if it's actually the minimum value (adjusted by current time), and I've 
sent a new commit with such fix. I don't see a way to fix the same kind of 
races by just keeping a single variable, but let me know if you find any.

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> 

[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions

2020-04-20 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087966#comment-17087966
 ] 

Sergio Bossa commented on CASSANDRA-15666:
--

Good to merge for me.

> Race condition when completing stream sessions
> --
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{StreamSession#prepareAsync()}} executes, as the name implies, 
> asynchronously from the IO thread: this opens up for race conditions between 
> the sending of the {{PrepareSynAckMessage}} and the call to 
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready 
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1, 
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because 
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the 
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels 
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about, 
> which is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-13606) Improve handling of 2i initialization failures

2020-04-17 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa reassigned CASSANDRA-13606:


Assignee: (was: Sergio Bossa)

> Improve handling of 2i initialization failures
> --
>
> Key: CASSANDRA-13606
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13606
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/2i Index
>Reporter: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
>
> CASSANDRA-10130 fixes the 2i build management, but initialization failures 
> are still not properly handled, most notably because:
> * Initialization failures make the index non-queryable, but it can still be 
> written to.
> * Initialization failures can be recovered via full rebuilds.
> Both points above are probably suboptimal because the initialization logic 
> could be more complex than just an index build, hence it shouldn't be made 
> recoverable via a simple rebuild, and could cause the index to be fully 
> unavailable not just for reads, but for writes as well.
> So, we should better handle initialization failures by:
> * Allowing the index implementation to specify if unavailable for reads, 
> writes, or both. 
> * Providing a proper method to recover, distinct from index rebuilds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2020-04-17 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-10130:
-
Reviewers: Paulo Motta, Sergio Bossa  (was: Paulo Motta)

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination
>Reporter: Yuki Morishita
>Assignee: Andres de la Peña
>Priority: Low
> Fix For: 4.0
>
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races

2020-04-14 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083146#comment-17083146
 ] 

Sergio Bossa commented on CASSANDRA-15667:
--

[~maxtomassi], [~e.dimitrova], thanks for your PRs. I'm +1 to both.

> StreamResultFuture check for completeness is inconsistent, leading to races
> ---
>
> Key: CASSANDRA-15667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15667
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: Massimiliano Tomassi
>Priority: Normal
> Fix For: 4.0
>
> Attachments: log_bootstrap_resumable
>
>
> {{StreamResultFuture#maybeComplete()}} uses 
> {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are 
> completed, but then accesses each session state via 
> {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the 
> former relies on the actual {{StreamSession}} state, while the latter on the 
> {{SessionInfo}} state, and the two are concurrently updated with no 
> coordination whatsoever.
> This leads to races, i.e. apparent in some dtest spurious failures, such as 
> {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc 
> [~e.dimitrova].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions

2020-04-09 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079527#comment-17079527
 ] 

Sergio Bossa commented on CASSANDRA-15666:
--

[~jasonstack] thanks for the followups, sent another round of comments.

> Race condition when completing stream sessions
> --
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamSession#prepareAsync()}} executes, as the name implies, 
> asynchronously from the IO thread: this opens up for race conditions between 
> the sending of the {{PrepareSynAckMessage}} and the call to 
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready 
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1, 
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because 
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the 
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels 
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about, 
> which is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races

2020-04-09 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079258#comment-17079258
 ] 

Sergio Bossa commented on CASSANDRA-15667:
--

Thanks [~e.dimitrova] for chiming in.
{quote}From what I recall the bootstrap was sometimes completing too fast 
before the streaming is really interrupted from the byteman code and we didn't 
really have an instrument to control that.
{quote}
I went through the "resumable bootstrap" test, and unfortunately I don't see 
how the bootstrap could ever complete before the byteman script is invoked: 
this is because such script [makes node 1 fail before it starts to stream 
files|[https://github.com/ekaterinadimitrova2/cassandra-dtest/blob/b56887d67c353d6d69cd60cfd74859405fa37685/byteman/4.0/stream_failure.btm#L10]],
 which means there's no way for node 3 to finish bootstrapping before it 
received all files from both nodes, which will never happen due to said script 
causing node 1 to fail.

So why did the test fail?

I believe that's because of this issue: in other words, node 3 was correctly 
seeing its streaming session completed (after node 1 finished streaming with an 
error) but *not* failed; this is because the "completed" state is read through 
the actual session state, while the "failed" state is read through the 
{{SessionInfo}} state, which is what we're fixing here.

That said, I would propose to still re-introduce the original 
{{resumable_bootstrap_test}}, because it's an important enough feature to 
deserve its own test, and it uses 3 nodes which increases the chances of 
detecting errors/races.

Thoughts?

> StreamResultFuture check for completeness is inconsistent, leading to races
> ---
>
> Key: CASSANDRA-15667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15667
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: Massimiliano Tomassi
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamResultFuture#maybeComplete()}} uses 
> {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are 
> completed, but then accesses each session state via 
> {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the 
> former relies on the actual {{StreamSession}} state, while the latter on the 
> {{SessionInfo}} state, and the two are concurrently updated with no 
> coordination whatsoever.
> This leads to races, i.e. apparent in some dtest spurious failures, such as 
> {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc 
> [~e.dimitrova].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-09 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15700:
-
Source Control Link: https://github.com/apache/cassandra/pull/531

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-09 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15700:
-
Test and Documentation Plan: See attached performance test plots. Also see 
unit tests.
 Status: Patch Available  (was: In Progress)

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-09 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079208#comment-17079208
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

Added some more thorough unit tests in the meantime.

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-09 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15700:
-
Authors: Sergio Bossa  (was: Sergio Bossa, ZhaoYang)

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-09 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079102#comment-17079102
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

Thank you both.

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-08 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078510#comment-17078510
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

{quote}Honestly though it sounds like there is not much between our proposals 
now, which is reassuring, so perhaps we should focus on the common ground we 
have. 
{quote}
Indeed, I think we agree on the necessity to fix this in the least risky way, 
which it seems to mean keeping the current pruning implementation and accuracy, 
but avoiding it to be run *at every single delivery* (although maybe you 
dispute this being the problem, but the collapsed stacks speak soundly about 
that).

In the spirit of that, I've fixed the message queue algorithm to compute the 
expiration deadline in a way that can be used to actually run the pruning task 
_only after such deadline_. I'll give it another review on my own and possibly 
add more unit tests (please note the current implementation seems to had none 
at all), but performance tests now look much better (orange is patched 4.0):

!Oss40patchedvsOss311.png|width=556,height=299!

Here's the branch: 
[https://github.com/sbtourist/cassandra/commits/CASSANDRA-15700]

I understand there's no _panic_ to fix it, and you'll be away the next couple 
weeks, but this means realistically postponing this issue for at least 3 weeks, 
which will add to the delay we're already accumulating in 4.0 for other 
reasons, so maybe you could delegate this to [~aleksey] as you mentioned?

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> 

[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-08 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15700:
-
Attachment: Oss40patchedvsOss311.png

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, 
> oss40_nogc.tar.xz, oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-08 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078042#comment-17078042
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

{quote}This would be a regression from 3.0 unfortunately
{quote}
Oh that's a good point, forgot we do that in 3.0+ too. We can keep the 
{{enqueue()}} pruning, as that's not the worst offender (see collapsed stacks).
{quote}I assume you mean to maintain a guess of the number of items we _on 
average_ have in the queue?
{quote}
No, I meant to compute the "next expire time" as an approximation of the expire 
time of the processed messages, rather than relying on exactly computing it via 
the pruner at every event loop delivery run.
{quote}One of the features running through this work is that we make strong 
guarantees, and this would weaken that.  I would prefer to be able to stipulate 
that (e.g.) we never reject an {{enqueue}} if more than 20% of the queue is 
already expired.
{quote}
How is that related to what we're discussing? Enqueuing new messages is 
regulated by memory limiting, so that's what gives us strong guarantees; also, 
we do already prune the backlog if memory limits are met.
{quote}messing with important control flow semantics is something I would 
prefer to avoid.
{quote}
I'd still like to understand how the pruning approach we're discussing here is 
important to the control flow semantics at all, as I don't think I've got a 
clear answer yet, although it might be me missing the point. What I've 
heard/understood is:

1) It protects us against saturating memory upon network stalls.

2) It protects us against saturating memory upon too many expired messages.

AFAIU, none of those is accurate, as the current implementation doesn't satisfy 
#1, and #2 is covered by the memory limits implementation.
{quote}I think the least risky approach is to change how we compute and select 
the expiration time we use for triggering an expiration. This also has the 
benefit of maintaining well-define guarantees, is simple, and modifies no 
behaviours besides the selection of this value.
{quote}
This is something that can definitely be tried first, to reduce the amount of 
pruner runs. I will test this next.
{quote}More specifically (e.g.), you pick a time {{t}}, such that you expect 
the set of messages with expiration less than {{t}}, say {{M( Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
> Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of 

[jira] [Updated] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races

2020-04-07 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15667:
-
Reviewers: Sergio Bossa, ZhaoYang, Sergio Bossa  (was: Sergio Bossa, 
ZhaoYang)
   Sergio Bossa, ZhaoYang, Sergio Bossa  (was: Sergio Bossa, 
ZhaoYang)
   Status: Review In Progress  (was: Patch Available)

+1

[~maxtomassi] (and [~e.dimitrova]), I think we should reintroduce 
{{TestBootstrap.resumable_bootstrap_test}} and check if it was actually failing 
due to this race: what do you think?

> StreamResultFuture check for completeness is inconsistent, leading to races
> ---
>
> Key: CASSANDRA-15667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15667
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: Massimiliano Tomassi
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamResultFuture#maybeComplete()}} uses 
> {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are 
> completed, but then accesses each session state via 
> {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the 
> former relies on the actual {{StreamSession}} state, while the latter on the 
> {{SessionInfo}} state, and the two are concurrently updated with no 
> coordination whatsoever.
> This leads to races, i.e. apparent in some dtest spurious failures, such as 
> {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc 
> [~e.dimitrova].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions

2020-04-07 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077478#comment-17077478
 ] 

Sergio Bossa commented on CASSANDRA-15666:
--

{quote}Let's see what [~blerer] has to say
{quote}
Let's not delay this fix further, unless [~blerer] really wants to chime in?

> Race condition when completing stream sessions
> --
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamSession#prepareAsync()}} executes, as the name implies, 
> asynchronously from the IO thread: this opens up for race conditions between 
> the sending of the {{PrepareSynAckMessage}} and the call to 
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready 
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1, 
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because 
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the 
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels 
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about, 
> which is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-07 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077469#comment-17077469
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

{quote}So, I'm actually misremembering: though I had thought we scheduled 
exactly what you suggest
{quote}
It's ok, I do also forget most stuff past the 24hrs cut-off :)

On a more serious note, I get your point about the need to be precise VS 
amortising costs, but wondering what data we have to help with such decision, 
if any.

We definitely have evidence that the _current_ implementation massively affects 
performance, and given the need to stabilise the codebase upon the 4.0 release, 
I would personally opt for a fix which doesn't increase overall complexity, and 
rather tries to strike a balance between said complexity and desired accuracy 
and performance, along the lines of:

1) Remove pruning and expire time tracking from enqueue.

2) Remove pruning from the event loop delivery run.

3) During event loop delivery:

a) Compute a weighted average of each message expire time.

b) If potentially stalling (flushing over the high water mark), schedule a 
background pruning using the computed expire time.

c) If resuming from stalling before the pruning deadline, cancel it (because 
we'll keep pruning as we process). 

This should be reasonably easy to implement, have decent accuracy, and perform 
well, as the expire deadline would be computed on the single threaded event 
loop (no contention) and the pruner would only run if actually stalling (which 
should *not* be the case most of the time).

How does that sound? Am I missing anything?

 

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
> Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 

[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-07 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077309#comment-17077309
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

{quote}Delivery will only be scheduled at enqueue if it is not already 
processing outstanding work, and though iirc the enqueue operation will itself 
forcefully prune in some circumstances, there are anyway no guarantees that new 
messages will be arriving to this connection.
{quote}
Correct.
{quote}Delivery will only reschedule itself once the outbound TCP channel has 
room, and so in a network outage that does not terminate the connection or 
stalled recipient process (for example), where it is making no (or very slow) 
progress, it will not be woken up to process further pending messages.
{quote}
Isn't this a problem regardless, given the current implementation? That is, we 
currently prune upon:

1) Receiving a new message.

2) Finishing processing messages in an event loop run.

As we said, we have no guarantee we'll keep receiving messages, which rules #1 
out, and if the event loop is stalling (i.e. due to network issues or whatever) 
we'll not run the message processing loop anyway, which rules #2 out and leaves 
us with the same problem of leaving expired messages in the queue.

If this really is the problem we want to solve, which is a bit of an edge case 
but fine, something as easy as scheduling a background pruning at the earliest 
expiry date before "stalling" (i.e. before setting the connection as 
unwritable) would do it (possibly using a Hashed Wheel Timer because the Netty 
scheduler sucks a bit at scheduled tasks).

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
> Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> 

[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-07 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077179#comment-17077179
 ] 

Sergio Bossa commented on CASSANDRA-15700:
--

{quote}This is the only way (presently) to know what deadline the eventLoop 
needs to wake up on, to ensure we do not let expiring messages accumulate 
indefinitely
{quote}
I'm not sure I follow, it would be great if you could elaborate more, maybe 
with some pointers on the code. AFAIU, the "event loop" (assuming you mean the 
{{EventLoopDelivery}} class) is potentially scheduled at every new message, 
then rescheduling itself based on the number of pending bytes and the writable 
state; this means, as long as there are pending bytes, and even if no new 
messages arrive, it will keep rescheduling itself and getting rid of expired 
messages along the way, so I don't see how they could accumulate indefinitely, 
although I might very well missing something. Or you just mean such 
"rescheduling to expire" is simply inefficient?

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
> Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by 

[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-07 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15700:
-
Authors: Sergio Bossa, ZhaoYang  (was: Sergio Bossa)

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
> Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-07 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15700:
-
Fix Version/s: 4.0

> Performance regression on internode messaging
> -
>
> Key: CASSANDRA-15700
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Internode
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
> Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-15700) Performance regression on internode messaging

2020-04-07 Thread Sergio Bossa (Jira)
Sergio Bossa created CASSANDRA-15700:


 Summary: Performance regression on internode messaging
 Key: CASSANDRA-15700
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
 Project: Cassandra
  Issue Type: Bug
  Components: Messaging/Internode
Reporter: Sergio Bossa
Assignee: Sergio Bossa
 Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
oss40_system.log

Me and [~jasonstack] have been investigating a performance regression affecting 
4.0 during a 3 nodes, RF 3 write throughput test with a timeseries like 
workload, as shown in this plot, where blue is 3.11 and orange is 4.0:

!Oss40vsOss311.png|width=389,height=214!

 It's been a bit of a long investigation, but two clues ended up standing out:
1) An abnormal number of expired messages on 4.0 (as shown in the attached  
system log), while 3.11 has almost none.
2) An abnormal GC activity (as shown in the attached gc log).

Turns out the two are related, as the [on expired 
callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
 creates a huge amount of strings in the {{id()}} call. The next question is 
what causes all those message expirations; we thoroughly reviewed the internode 
messaging code and the only issue we could find so far is related to the "batch 
pruning" calls 
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
 and 
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
 it _seems_ too much time is spent on those, causing the event loop to fall 
behind in processing the rest of the messages, which will end up being expired. 
This is supported by the analysis of the collapsed stacks (after fixing the GC 
issue):
{noformat}
(tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
org/apache/cassandra/net/PrunableArrayQueue:prune 1621
org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
{noformat}

Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
spends half of its time pruning. But only a tiny portion of such pruning time 
is spent actually expiring:
{noformat}
(tprint (top-aggregated-calls oss40nogc 
"OutboundMessageQueue:pruneInternalQueueWithLock" 5))
org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
org/apache/cassandra/net/PrunableArrayQueue:prune 1894
org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
org/apache/cassandra/net/OutboundConnection:onExpired 147
{noformat}

And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
{noformat}
(tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
org/apache/cassandra/net/PrunableArrayQueue:prune 1718
org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
java/util/concurrent/ConcurrentHashMap:replaceNode 19
java/util/concurrent/ConcurrentLinkedQueue:offer 16
java/util/concurrent/LinkedBlockingQueue:offer 15
{noformat}

That said, before proceeding with a PR to fix those issues, I'd like to 
understand: what's the reason to prune so often, rather than just when polling 
the message during delivery? If there's a reason I'm missing, let's talk about 
how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions

2020-04-02 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073831#comment-17073831
 ] 

Sergio Bossa commented on CASSANDRA-15666:
--

[~jasonstack] your fix looks good, just left a few minor comments on the PR. 
Regarding the changes to the {{CompleteMessage}} exchange, I still think that'd 
be a win regardless if the race is fixed in a different way, as the current 
implementation makes it harder to reason about its correctness (which means it 
could be prone to other races), but I also understand we want to limit the 
scope of changes to 4.0, so not pushing hard for it.

> Race condition when completing stream sessions
> --
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamSession#prepareAsync()}} executes, as the name implies, 
> asynchronously from the IO thread: this opens up for race conditions between 
> the sending of the {{PrepareSynAckMessage}} and the call to 
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready 
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1, 
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because 
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the 
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels 
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about, 
> which is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15666) Race condition when completing stream sessions

2020-04-02 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15666:
-
Reviewers: Benjamin Lerer, Sergio Bossa, Sergio Bossa  (was: Benjamin 
Lerer, Sergio Bossa)
   Benjamin Lerer, Sergio Bossa, Sergio Bossa  (was: Benjamin 
Lerer, Sergio Bossa)
   Status: Review In Progress  (was: Patch Available)

> Race condition when completing stream sessions
> --
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamSession#prepareAsync()}} executes, as the name implies, 
> asynchronously from the IO thread: this opens up for race conditions between 
> the sending of the {{PrepareSynAckMessage}} and the call to 
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready 
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1, 
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because 
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the 
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels 
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about, 
> which is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races

2020-04-01 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15667:
-
Reviewers: Sergio Bossa, ZhaoYang  (was: ZhaoYang)

> StreamResultFuture check for completeness is inconsistent, leading to races
> ---
>
> Key: CASSANDRA-15667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15667
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamResultFuture#maybeComplete()}} uses 
> {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are 
> completed, but then accesses each session state via 
> {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the 
> former relies on the actual {{StreamSession}} state, while the latter on the 
> {{SessionInfo}} state, and the two are concurrently updated with no 
> coordination whatsoever.
> This leads to races, i.e. apparent in some dtest spurious failures, such as 
> {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc 
> [~e.dimitrova].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions

2020-03-27 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068663#comment-17068663
 ] 

Sergio Bossa commented on CASSANDRA-15666:
--

bq. it's still possible to send 2 CompleteMessage by follower when 
maybeComplete() in prepareAsync() is delayed and race with maybeComplete() in 
taskCompleted().

Correct. I agree the overall thread safety of {{StreamSession}} should be fixed.

> Race condition when completing stream sessions
> --
>
> Key: CASSANDRA-15666
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Assignee: ZhaoYang
>Priority: Normal
>
> {{StreamSession#prepareAsync()}} executes, as the name implies, 
> asynchronously from the IO thread: this opens up for race conditions between 
> the sending of the {{PrepareSynAckMessage}} and the call to 
> {{StreamSession#maybeCompleted()}}. I.e., the following could happen:
> 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
> 2) Node B receives it and starts streaming.
> 3) Node A receives the streamed file and sends {{ReceivedMessage}}.
> 4) At this point, if this was the only file to stream, both nodes are ready 
> to close the session via {{maybeCompleted()}}, but:
> a) Node A will call it twice from both the IO thread and the thread at #1, 
> closing the session and its channels.
> b) Node B will attempt to send a {{CompleteMessage}}, but will fail because 
> the session has been closed in the meantime.
> There are other subtle variations of the pattern above, depending on the 
> order of concurrently sent/received messages.
> I believe the best fix would be to modify the message exchange so that:
> 1) Only the "follower" is allowed to send the {{CompleteMessage}}.
> 2) Only the "initiator" is allowed to close the session and its channels 
> after receiving the {{CompleteMessage}}.
> By doing so, the message exchange logic would be easier to reason about, 
> which is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15375) Remove BackPressureStrategy

2020-03-27 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15375:
-
Reviewers: Sergio Bossa, Sergio Bossa  (was: Sergio Bossa)
   Sergio Bossa, Sergio Bossa  (was: Sergio Bossa)
   Status: Review In Progress  (was: Patch Available)

[~benedict], apologies for this late review. I've added some comments to your 
commit here: 
https://github.com/belliottsmith/cassandra/commit/3584b4305ab87f836fdc94d4d5b28bd102642ccb

> Remove BackPressureStrategy
> ---
>
> Key: CASSANDRA-15375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15375
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Low
>
> This is odd:
> {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - 
> Back-pressure is disabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}
> When I saw that, I wasn't sure if back pressure was actually disabled, or if 
> I was really using {{RateBasedBackPressure.}}
> This should change to output either:
> {{Back-pressure is disabled}}
> {{or}}
> {{Back-pressure is enabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15375) Remove BackPressureStrategy

2020-03-27 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15375:
-
Fix Version/s: 4.0

> Remove BackPressureStrategy
> ---
>
> Key: CASSANDRA-15375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15375
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Low
> Fix For: 4.0
>
>
> This is odd:
> {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - 
> Back-pressure is disabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}
> When I saw that, I wasn't sure if back pressure was actually disabled, or if 
> I was really using {{RateBasedBackPressure.}}
> This should change to output either:
> {{Back-pressure is disabled}}
> {{or}}
> {{Back-pressure is enabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15375) Remove BackPressureStrategy

2020-03-27 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15375:
-
Reviewers: Sergio Bossa

> Remove BackPressureStrategy
> ---
>
> Key: CASSANDRA-15375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15375
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Low
>
> This is odd:
> {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - 
> Back-pressure is disabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}
> When I saw that, I wasn't sure if back pressure was actually disabled, or if 
> I was really using {{RateBasedBackPressure.}}
> This should change to output either:
> {{Back-pressure is disabled}}
> {{or}}
> {{Back-pressure is enabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races

2020-03-27 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15667:
-
Fix Version/s: 4.0

> StreamResultFuture check for completeness is inconsistent, leading to races
> ---
>
> Key: CASSANDRA-15667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15667
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamResultFuture#maybeComplete()}} uses 
> {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are 
> completed, but then accesses each session state via 
> {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the 
> former relies on the actual {{StreamSession}} state, while the latter on the 
> {{SessionInfo}} state, and the two are concurrently updated with no 
> coordination whatsoever.
> This leads to races, i.e. apparent in some dtest spurious failures, such as 
> {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc 
> [~e.dimitrova].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races

2020-03-27 Thread Sergio Bossa (Jira)
Sergio Bossa created CASSANDRA-15667:


 Summary: StreamResultFuture check for completeness is 
inconsistent, leading to races
 Key: CASSANDRA-15667
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15667
 Project: Cassandra
  Issue Type: Bug
  Components: Legacy/Streaming and Messaging
Reporter: Sergio Bossa


{{StreamResultFuture#maybeComplete()}} uses 
{{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are 
completed, but then accesses each session state via 
{{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the former 
relies on the actual {{StreamSession}} state, while the latter on the 
{{SessionInfo}} state, and the two are concurrently updated with no 
coordination whatsoever.

This leads to races, i.e. apparent in some dtest spurious failures, such as 
{{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc [~e.dimitrova].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14754) Add verification of state machine in StreamSession

2020-03-27 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068579#comment-17068579
 ] 

Sergio Bossa commented on CASSANDRA-14754:
--

Thanks [~jasonstack]. Also see CASSANDRA-15667.

> Add verification of state machine in StreamSession
> --
>
> Key: CASSANDRA-14754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14754
> Project: Cassandra
>  Issue Type: Task
>  Components: Legacy/Streaming and Messaging
>Reporter: Jason Brown
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.x
>
>
> {{StreamSession}} contains an implicit state machine, but we have no 
> verification of the safety of the transitions between states. For example, we 
> have no checks to ensure we cannot leave the final states (COMPLETED, FAILED).
> I propose we add some program logic in {{StreamSession}}, tests, and 
> documentation to ensure the correctness of the state transitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14754) Add verification of state machine in StreamSession

2020-03-26 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067826#comment-17067826
 ] 

Sergio Bossa commented on CASSANDRA-14754:
--

I think just validating state transitions doesn't really add much: for example, 
moving from {{INITIALIZED}} to {{COMPLETE}} is 
[valid|https://github.com/apache/cassandra/pull/480/files#diff-b397cc5438b1a4be20f026c9ec9ecd6eR215],
 but only if there's nothing to actually stream, which is not expressed in the 
current validation (IIUIC, correct me if I'm wrong). In other words, I think it 
would be best to implement a proper state machine, but is it really necessary 
for 4.0? OTOH, I think we should focus on other {{StreamSession}} bugs such as 
CASSANDRA-15665 and CASSANDRA-15666. Thoughts?

> Add verification of state machine in StreamSession
> --
>
> Key: CASSANDRA-14754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14754
> Project: Cassandra
>  Issue Type: Task
>  Components: Legacy/Streaming and Messaging
>Reporter: Jason Brown
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamSession}} contains an implicit state machine, but we have no 
> verification of the safety of the transitions between states. For example, we 
> have no checks to ensure we cannot leave the final states (COMPLETED, FAILED).
> I propose we add some program logic in {{StreamSession}}, tests, and 
> documentation to ensure the correctness of the state transitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-15666) Race condition when completing stream sessions

2020-03-26 Thread Sergio Bossa (Jira)
Sergio Bossa created CASSANDRA-15666:


 Summary: Race condition when completing stream sessions
 Key: CASSANDRA-15666
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15666
 Project: Cassandra
  Issue Type: Bug
  Components: Legacy/Streaming and Messaging
Reporter: Sergio Bossa


{{StreamSession#prepareAsync()}} executes, as the name implies, asynchronously 
from the IO thread: this opens up for race conditions between the sending of 
the {{PrepareSynAckMessage}} and the call to 
{{StreamSession#maybeCompleted()}}. I.e., the following could happen:
1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread.
2) Node B receives it and starts streaming.
3) Node A receives the streamed file and sends {{ReceivedMessage}}.
4) At this point, if this was the only file to stream, both nodes are ready to 
close the session via {{maybeCompleted()}}, but:
a) Node A will call it twice from both the IO thread and the thread at #1, 
closing the session and its channels.
b) Node B will attempt to send a {{CompleteMessage}}, but will fail because the 
session has been closed in the meantime.

There are other subtle variations of the pattern above, depending on the order 
of concurrently sent/received messages.

I believe the best fix would be to modify the message exchange so that:
1) Only the "follower" is allowed to send the {{CompleteMessage}}.
2) Only the "initiator" is allowed to close the session and its channels after 
receiving the {{CompleteMessage}}.

By doing so, the message exchange logic would be easier to reason about, which 
is overall a win anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15665) StreamManager should clearly differentiate between "initiator" and "receiver" sessions

2020-03-26 Thread Sergio Bossa (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-15665:
-
Fix Version/s: 4.0

> StreamManager should clearly differentiate between "initiator" and "receiver" 
> sessions
> --
>
> Key: CASSANDRA-15665
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15665
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: Sergio Bossa
>Priority: Normal
> Fix For: 4.0
>
>
> {{StreamManager}} does currently a suboptimal job in differentiating between 
> stream sessions (in form of {{StreamResultFuture}}) which have been either 
> initiated or "received", for the following reasons:
> 1) Naming is IMO confusing: a "receiver" session could actually both send and 
> receive files, so technically an initiator is also a receiver.
> 2) {{StreamManager#findSession()}}  assumes we should first looking into 
> "initiator" sessions, then into "receiver" ones: this is a dangerous 
> assumptions, in particular for test environments where the same process could 
> work as both an initiator and a receiver.
> I would recommend the following changes:
> 1) Rename "receiver" with "follower" everywhere the former is used.
> 2) Introduce a new flag into {{StreamMessageHeader}} to signal if the message 
> comes from an initiator or follower session, in order to correctly 
> differentiate and look for sessions in {{StreamManager}}.
> While my arguments above might seem trivial, I believe they will improve 
> clarity and save from potential bugs/headaches at testing time, and doing 
> such changes now that we're revamping streaming for 4.0 seems the right time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-15665) StreamManager should clearly differentiate between "initiator" and "receiver" sessions

2020-03-26 Thread Sergio Bossa (Jira)
Sergio Bossa created CASSANDRA-15665:


 Summary: StreamManager should clearly differentiate between 
"initiator" and "receiver" sessions
 Key: CASSANDRA-15665
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15665
 Project: Cassandra
  Issue Type: Bug
  Components: Legacy/Streaming and Messaging
Reporter: Sergio Bossa


{{StreamManager}} does currently a suboptimal job in differentiating between 
stream sessions (in form of {{StreamResultFuture}}) which have been either 
initiated or "received", for the following reasons:
1) Naming is IMO confusing: a "receiver" session could actually both send and 
receive files, so technically an initiator is also a receiver.
2) {{StreamManager#findSession()}}  assumes we should first looking into 
"initiator" sessions, then into "receiver" ones: this is a dangerous 
assumptions, in particular for test environments where the same process could 
work as both an initiator and a receiver.

I would recommend the following changes:
1) Rename "receiver" with "follower" everywhere the former is used.
2) Introduce a new flag into {{StreamMessageHeader}} to signal if the message 
comes from an initiator or follower session, in order to correctly 
differentiate and look for sessions in {{StreamManager}}.

While my arguments above might seem trivial, I believe they will improve 
clarity and save from potential bugs/headaches at testing time, and doing such 
changes now that we're revamping streaming for 4.0 seems the right time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15375) Remove BackPressureStrategy

2020-03-04 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051099#comment-17051099
 ] 

Sergio Bossa commented on CASSANDRA-15375:
--

bq. I'd like to disclaim any desire to discredit the original work

Of course, and as I said it was experimental and meant to evolve once widely 
used, which never happened. Do you want me to review your patch?

> Remove BackPressureStrategy
> ---
>
> Key: CASSANDRA-15375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15375
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Low
>
> This is odd:
> {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - 
> Back-pressure is disabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}
> When I saw that, I wasn't sure if back pressure was actually disabled, or if 
> I was really using {{RateBasedBackPressure.}}
> This should change to output either:
> {{Back-pressure is disabled}}
> {{or}}
> {{Back-pressure is enabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15375) Remove BackPressureStrategy

2020-03-03 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050482#comment-17050482
 ] 

Sergio Bossa commented on CASSANDRA-15375:
--

As the original author of that implementation, I regard it as experimental and 
never proved to be widely used, so I agree with removing it. Dead code is bad 
code (regardless of its actual quality or past merits).

> Remove BackPressureStrategy
> ---
>
> Key: CASSANDRA-15375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15375
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Low
>
> This is odd:
> {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - 
> Back-pressure is disabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}
> When I saw that, I wasn't sure if back pressure was actually disabled, or if 
> I was really using {{RateBasedBackPressure.}}
> This should change to output either:
> {{Back-pressure is disabled}}
> {{or}}
> {{Back-pressure is enabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15375) Remove BackPressureStrategy

2020-03-03 Thread Sergio Bossa (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050482#comment-17050482
 ] 

Sergio Bossa edited comment on CASSANDRA-15375 at 3/3/20 7:01 PM:
--

As the original author of that implementation, I regard it as experimental and 
never proven to be widely used, so I agree with removing it. Dead code is bad 
code (regardless of its actual quality or past merits).


was (Author: sbtourist):
As the original author of that implementation, I regard it as experimental and 
never proved to be widely used, so I agree with removing it. Dead code is bad 
code (regardless of its actual quality or past merits).

> Remove BackPressureStrategy
> ---
>
> Key: CASSANDRA-15375
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15375
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Low
>
> This is odd:
> {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - 
> Back-pressure is disabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}
> When I saw that, I wasn't sure if back pressure was actually disabled, or if 
> I was really using {{RateBasedBackPressure.}}
> This should change to output either:
> {{Back-pressure is disabled}}
> {{or}}
> {{Back-pressure is enabled with strategy 
> org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, 
> flow=FAST}.}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14878) Race condition when setting bootstrap flags

2018-11-08 Thread Sergio Bossa (JIRA)
Sergio Bossa created CASSANDRA-14878:


 Summary: Race condition when setting bootstrap flags
 Key: CASSANDRA-14878
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14878
 Project: Cassandra
  Issue Type: Bug
Reporter: Sergio Bossa
Assignee: Sergio Bossa
 Fix For: 3.0.x, 3.11.x, 4.x


{{StorageService#bootstrap()}} is supposed to wait for bootstrap to finish, but 
Guava calls the future listeners 
[after|https://github.com/google/guava/blob/ec2dedebfa359991cbcc8750dc62003be63ec6d3/guava/src/com/google/common/util/concurrent/AbstractFuture.java#L890]
 unparking its waiters, which causes a race on when the {{bootstrapFinished()}} 
will be executed, making it non-deterministic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-8272) 2ndary indexes can return stale data

2018-06-12 Thread Sergio Bossa (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-8272:

Reviewer:   (was: Sergio Bossa)

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
>Priority: Major
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14507) OutboundMessagingConnection backlog is not fully written in case of race conditions

2018-06-09 Thread Sergio Bossa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506886#comment-16506886
 ] 

Sergio Bossa commented on CASSANDRA-14507:
--

bq.  Wouldn't they be picked up by the 
MessageOutHandler::channelWritabilityChanged and then get drained?

That is assuming the writability changes before the timeout window, which might 
very well not be the case, unless I'm misunderstanding your question? Also, 
{{channelWritabilityChanged()}} is race-prone by itself as mentioned above.

> OutboundMessagingConnection backlog is not fully written in case of race 
> conditions
> ---
>
> Key: CASSANDRA-14507
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14507
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
>Reporter: Sergio Bossa
>Priority: Major
>
> The {{OutboundMessagingConnection}} writes into a backlog queue before the 
> connection handshake is successfully completed, and then writes such backlog 
> to the channel as soon as the successful handshake moves the channel state to 
> {{READY}}.
> This is unfortunately race prone, as the following could happen:
> 1) One or more writer threads see the channel state as {{NOT_READY}} in 
> {{#sendMessage()}} and are about to enqueue to the backlog, but they get 
> descheduled by the OS.
> 2) The handshake thread is scheduled by the OS and moves the channel state to 
> {{READY}}, emptying the backlog.
> 3) The writer threads are scheduled back and add to the backlog, but the 
> channel state is {{READY}} at this point, so those writes would sit in the 
> backlog and expire.
> Please note a similar race condition exists between 
> {{OutboundMessagingConnection#sendMessage()}} and 
> {{MessageOutHandler#channelWritabilityChanged()}}, which is way more serious 
> as the channel writability could frequently change, luckily it looks like 
> {{ChannelWriter#write()}} never gets invoked with {{checkWritability}} at 
> {{true}} (so writes never go to the backlog when the channel is not writable).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14507) OutboundMessagingConnection backlog is not fully written in case of race conditions

2018-06-08 Thread Sergio Bossa (JIRA)
Sergio Bossa created CASSANDRA-14507:


 Summary: OutboundMessagingConnection backlog is not fully written 
in case of race conditions
 Key: CASSANDRA-14507
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14507
 Project: Cassandra
  Issue Type: Bug
  Components: Streaming and Messaging
Reporter: Sergio Bossa


The {{OutboundMessagingConnection}} writes into a backlog queue before the 
connection handshake is successfully completed, and then writes such backlog to 
the channel as soon as the successful handshake moves the channel state to 
{{READY}}.
This is unfortunately race prone, as the following could happen:
1) One or more writer threads see the channel state as {{NOT_READY}} in 
{{#sendMessage()}} and are about to enqueue to the backlog, but they get 
descheduled by the OS.
2) The handshake thread is scheduled by the OS and moves the channel state to 
{{READY}}, emptying the backlog.
3) The writer threads are scheduled back and add to the backlog, but the 
channel state is {{READY}} at this point, so those writes would sit in the 
backlog and expire.

Please note a similar race condition exists between 
{{OutboundMessagingConnection#sendMessage()}} and 
{{MessageOutHandler#channelWritabilityChanged()}}, which is way more serious as 
the channel writability could frequently change, luckily it looks like 
{{ChannelWriter#write()}} never gets invoked with {{checkWritability}} at 
{{true}} (so writes never go to the backlog when the channel is not writable).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14503) Internode connection management is race-prone

2018-06-07 Thread Sergio Bossa (JIRA)
Sergio Bossa created CASSANDRA-14503:


 Summary: Internode connection management is race-prone
 Key: CASSANDRA-14503
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14503
 Project: Cassandra
  Issue Type: Bug
  Components: Streaming and Messaging
Reporter: Sergio Bossa


Following CASSANDRA-8457, internode connection management has been rewritten to 
rely on Netty, but the new implementation in {{OutboundMessagingConnection}} 
seems quite race prone to me, in particular on those two cases:

* {{#finishHandshake()}} racing with {{#close()}}: i.e. in such case the former 
could run into an NPE if the latter nulls the {{channelWriter}} (but this is 
just an example, other conflicts might happen).
* Connection timeout and retry racing with state changing methods: 
{{connectionRetryFuture}} and {{connectionTimeoutFuture}} are cancelled when 
handshaking or closing, but there's no guarantee those will be actually 
cancelled (as they might be already running), so they might end up changing the 
connection state concurrently with other methods (i.e. by unexpectedly closing 
the channel or clearing the backlog).

Overall, the thread safety of {{OutboundMessagingConnection}} is very difficult 
to assess given the current implementation: I would suggest to refactor it into 
a single-thread model, where all connection state changing actions are enqueued 
on a single threaded scheduler, so that state transitions can be clearly 
defined and checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13798) Disallow filtering on non-primary-key base column for MV

2017-08-29 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13798:
-
Reviewer: Paulo Motta

> Disallow filtering on non-primary-key base column for MV
> 
>
> Key: CASSANDRA-13798
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13798
> Project: Cassandra
>  Issue Type: Bug
>  Components: Materialized Views
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>
> We should probably consider disallow filtering conditions on non-primary-key 
> base column for Materialized View which is introduced in CASSANDRA-10368.
> The main problem is that the liveness of view row is now depending on 
> multiple base columns (multiple filtered non-pk base column + base column 
> used in view pk) and this semantic could not be properly support without 
> drastic storage format changes. (SEE CASSANDRA-11500, 
> [background|https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16119823=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16119823])
> We should step back and re-consider the non-primary-key filtering feature 
> together with supporting multiple non-PK cols in MV clustering key in 
> CASSANDRA-10226.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13640) CQLSH error when using 'login' to switch users

2017-08-22 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13640:
-
Reviewer: ZhaoYang

> CQLSH error when using 'login' to switch users
> --
>
> Key: CASSANDRA-13640
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13640
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
>Reporter: Andrés de la Peña
>Assignee: Andrés de la Peña
>Priority: Minor
> Fix For: 3.0.x
>
>
> Using {{PasswordAuthenticator}} and {{CassandraAuthorizer}}:
> {code}
> bin/cqlsh -u cassandra -p cassandra
> Connected to Test Cluster at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.0.14-SNAPSHOT | CQL spec 3.4.0 | Native protocol 
> v4]
> Use HELP for help.
> cassandra@cqlsh> create role super with superuser = true and password = 'p' 
> and login = true;
> cassandra@cqlsh> login super;
> Password:
> super@cqlsh> list roles;
> 'Row' object has no attribute 'values'
> {code}
> When we initialize the Shell, we configure certain settings on the session 
> object such as
> {code}
> self.session.default_timeout = request_timeout
> self.session.row_factory = ordered_dict_factory
> self.session.default_consistency_level = cassandra.ConsistencyLevel.ONE
> {code}
> However, once we perform a LOGIN cmd, which calls do_login(..), we create a 
> new cluster/session object but actually never set those settings on the new 
> session.
> It isn't failing on 3.x. 
> As a workaround, it is possible to logout and log back in and things work 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13464) Failed to create Materialized view with a specific token range

2017-08-22 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13464:
-
Reviewer: ZhaoYang

> Failed to create Materialized view with a specific token range
> --
>
> Key: CASSANDRA-13464
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13464
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Natsumi Kojima
>Assignee: Krishna Dattu Koneru
>Priority: Minor
>  Labels: materializedviews
>
> Failed to create Materialized view with a specific token range.
> Example :
> {code:java}
> $ ccm create "MaterializedView" -v 3.0.13
> $ ccm populate  -n 3
> $ ccm start
> $ ccm status
> Cluster: 'MaterializedView'
> ---
> node1: UP
> node3: UP
> node2: UP
> $ccm node1 cqlsh
> Connected to MaterializedView at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.0.13 | CQL spec 3.4.0 | Native protocol v4]
> Use HELP for help.
> cqlsh> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', 
> 'replication_factor':3};
> cqlsh> CREATE TABLE test.test ( id text PRIMARY KEY , value1 text , value2 
> text, value3 text);
> $ccm node1 ring test 
> Datacenter: datacenter1
> ==
> AddressRackStatus State   LoadOwns
> Token
>   
> 3074457345618258602
> 127.0.0.1  rack1   Up Normal  64.86 KB100.00% 
> -9223372036854775808
> 127.0.0.2  rack1   Up Normal  86.49 KB100.00% 
> -3074457345618258603
> 127.0.0.3  rack1   Up Normal  89.04 KB100.00% 
> 3074457345618258602
> $ ccm node1 cqlsh
> cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('aaa', 
> 'aaa', 'aaa' ,'aaa');
> cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('bbb', 
> 'bbb', 'bbb' ,'bbb');
> cqlsh> SELECT token(id),id,value1 FROM test.test;
>  system.token(id) | id  | value1
> --+-+
>  -4737872923231490581 | aaa |aaa
>  -3071845237020185195 | bbb |bbb
> (2 rows)
> cqlsh> CREATE MATERIALIZED VIEW test.test_view AS SELECT value1, id FROM 
> test.test WHERE id IS NOT NULL AND value1 IS NOT NULL AND TOKEN(id) > 
> -9223372036854775808 AND TOKEN(id) < -3074457345618258603 PRIMARY KEY(value1, 
> id) WITH CLUSTERING ORDER BY (id ASC);
> ServerError: java.lang.ClassCastException: 
> org.apache.cassandra.cql3.TokenRelation cannot be cast to 
> org.apache.cassandra.cql3.SingleColumnRelation
> {code}
> Stacktrace :
> {code:java}
> INFO  [MigrationStage:1] 2017-04-19 18:32:48,131 ColumnFamilyStore.java:389 - 
> Initializing test.test
> WARN  [SharedPool-Worker-1] 2017-04-19 18:44:07,263 FBUtilities.java:337 - 
> Trigger directory doesn't exist, please create it and try again.
> ERROR [SharedPool-Worker-1] 2017-04-19 18:46:10,072 QueryMessage.java:128 - 
> Unexpected error during query
> java.lang.ClassCastException: org.apache.cassandra.cql3.TokenRelation cannot 
> be cast to org.apache.cassandra.cql3.SingleColumnRelation
>   at 
> org.apache.cassandra.db.view.View.relationsToWhereClause(View.java:275) 
> ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.statements.CreateViewStatement.announceMigration(CreateViewStatement.java:219)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:93)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:206)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:237) 
> ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:222) 
> ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
>  ~[apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
>  [apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
>  [apache-cassandra-3.0.13.jar:3.0.13]
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  [netty-all-4.0.44.Final.jar:4.0.44.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
>  [netty-all-4.0.44.Final.jar:4.0.44.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
>  

[jira] [Updated] (CASSANDRA-13737) Node start can fail if the base table of a materialized view is not found

2017-08-08 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13737:
-
Reviewer: T Jake Luciani

> Node start can fail if the base table of a materialized view is not found
> -
>
> Key: CASSANDRA-13737
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13737
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata, Materialized Views
>Reporter: Andrés de la Peña
>Assignee: Andrés de la Peña
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Node start can fail if the base table of a materialized view is not found, 
> which is something that can happen under certain circumstances. There is a 
> dtest reproducing the problem:
> {code}
> cluster = self.cluster
> cluster.populate(3)
> cluster.start()
> node1, node2, node3 = self.cluster.nodelist()
> session = self.patient_cql_connection(node1, 
> consistency_level=ConsistencyLevel.QUORUM)
> create_ks(session, 'ks', 3)
> session.execute('CREATE TABLE users (username varchar PRIMARY KEY, state 
> varchar)')
> node3.stop(wait_other_notice=True)
> # create a materialized view only in nodes 1 and 2
> session.execute(('CREATE MATERIALIZED VIEW users_by_state AS '
>  'SELECT * FROM users WHERE state IS NOT NULL AND username IS 
> NOT NULL '
>  'PRIMARY KEY (state, username)'))
> node1.stop(wait_other_notice=True)
> node2.stop(wait_other_notice=True)
> # drop the base table only in node 3
> node3.start(wait_for_binary_proto=True)
> session = self.patient_cql_connection(node3, 
> consistency_level=ConsistencyLevel.QUORUM)
> session.execute('DROP TABLE ks.users')
> cluster.stop()
> cluster.start()  # Fails
> {code}
> This is the error during node start:
> {code}
> java.lang.IllegalArgumentException: Unknown CF 
> 958ebc30-76e4-11e7-869a-9d8367a71c76
>   at 
> org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:215) 
> ~[main/:na]
>   at 
> org.apache.cassandra.db.view.ViewManager.addView(ViewManager.java:143) 
> ~[main/:na]
>   at 
> org.apache.cassandra.db.view.ViewManager.reload(ViewManager.java:113) 
> ~[main/:na]
>   at org.apache.cassandra.schema.Schema.alterKeyspace(Schema.java:618) 
> ~[main/:na]
>   at org.apache.cassandra.schema.Schema.lambda$merge$18(Schema.java:591) 
> ~[main/:na]
>   at 
> java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet.lambda$entryConsumer$0(Collections.java:1575)
>  ~[na:1.8.0_131]
>   at java.util.HashMap$EntrySet.forEach(HashMap.java:1043) ~[na:1.8.0_131]
>   at 
> java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet.forEach(Collections.java:1580)
>  ~[na:1.8.0_131]
>   at org.apache.cassandra.schema.Schema.merge(Schema.java:591) ~[main/:na]
>   at 
> org.apache.cassandra.schema.Schema.mergeAndAnnounceVersion(Schema.java:564) 
> ~[main/:na]
>   at 
> org.apache.cassandra.schema.MigrationTask$1.response(MigrationTask.java:89) 
> ~[main/:na]
>   at 
> org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
>  ~[main/:na]
>   at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) 
> ~[main/:na]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_131]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_131]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  ~[na:1.8.0_131]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_131]
>   at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
>  [main/:na]
>   at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-12952) AlterTableStatement propagates base table and affected MV changes inconsistently

2017-07-24 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-12952:
-
Reviewer: ZhaoYang

> AlterTableStatement propagates base table and affected MV changes 
> inconsistently
> 
>
> Key: CASSANDRA-12952
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12952
> Project: Cassandra
>  Issue Type: Bug
>  Components: Distributed Metadata, Materialized Views
>Reporter: Aleksey Yeschenko
>Assignee: Andrés de la Peña
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> In {{AlterTableStatement}}, when renaming columns or changing their types, we 
> also keep track of all affected MVs - ones that also need column renames or 
> type changes. Then in the end we announce the migration for the table change, 
> and afterwards, separately, one for each affected MV.
> This creates a window in which view definitions and base table definition are 
> not in sync with each other. If a node fails in between receiving those 
> pushes, it's likely to have startup issues.
> The fix is trivial: table change and affected MV change should be pushed as a 
> single schema mutation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup

2017-07-24 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13725:
-
Reviewer: Andrés de la Peña

> Secondary indexes are always rebuilt at startup
> ---
>
> Key: CASSANDRA-13725
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13725
> Project: Cassandra
>  Issue Type: Bug
>  Components: Secondary Indexes
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Critical
> Fix For: 4.0
>
>
> Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be 
> rebuilt at startup, even if such index is already built.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup

2017-07-24 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13725:
-
Status: Patch Available  (was: Open)

> Secondary indexes are always rebuilt at startup
> ---
>
> Key: CASSANDRA-13725
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13725
> Project: Cassandra
>  Issue Type: Bug
>  Components: Secondary Indexes
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Critical
> Fix For: 4.0
>
>
> Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be 
> rebuilt at startup, even if such index is already built.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup

2017-07-24 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13725:
-

This is caused by calling {{SIM#markIndexesBuilding}} when creating the index 
during column family initialization, which marks the index as "not built" and 
causes the index initialization task to rebuild it.

Given there's no need to mark the index when a new column family is created (as 
the index will be "not built" by definition and there can't be any concurrent 
indexing), we can just pass a boolean up to {{createIndex()}} to distinguish 
between index creation at different times, i.e. when a column family is 
[created|https://github.com/sbtourist/cassandra/blob/CASSANDRA-13725/src/java/org/apache/cassandra/db/Keyspace.java#L394]
 or 
[reloaded|https://github.com/sbtourist/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L129].

Such solution is implemented in the following patch, with a new dtest verifying 
it:
|[trunk|https://github.com/apache/cassandra/pull/135]|[dtest|https://github.com/apache/cassandra-dtest/pull/2]|

Test runs are in progress on our internal CI and I will report results as soon 
as they're ready.


> Secondary indexes are always rebuilt at startup
> ---
>
> Key: CASSANDRA-13725
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13725
> Project: Cassandra
>  Issue Type: Bug
>  Components: Secondary Indexes
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>Priority: Critical
> Fix For: 4.0
>
>
> Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be 
> rebuilt at startup, even if such index is already built.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup

2017-07-24 Thread Sergio Bossa (JIRA)
Sergio Bossa created CASSANDRA-13725:


 Summary: Secondary indexes are always rebuilt at startup
 Key: CASSANDRA-13725
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13725
 Project: Cassandra
  Issue Type: Bug
  Components: Secondary Indexes
Reporter: Sergio Bossa
Assignee: Sergio Bossa
Priority: Critical
 Fix For: 4.0


Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be 
rebuilt at startup, even if such index is already built.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13547) Filtered materialized views missing data

2017-06-27 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13547:
-
Reviewer: ZhaoYang

> Filtered materialized views missing data
> 
>
> Key: CASSANDRA-13547
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13547
> Project: Cassandra
>  Issue Type: Bug
>  Components: Materialized Views
> Environment: Official Cassandra 3.10 Docker image (ID 154b919bf8ce).
>Reporter: Craig Nicholson
>Assignee: Krishna Dattu Koneru
>Priority: Blocker
>  Labels: materializedviews
> Fix For: 3.11.x
>
>
> When creating a materialized view against a base table the materialized view 
> does not always reflect the correct data.
> Using the following test schema:
> {code:title=Schema|language=sql}
> DROP KEYSPACE IF EXISTS test;
> CREATE KEYSPACE test
>   WITH REPLICATION = { 
>'class' : 'SimpleStrategy', 
>'replication_factor' : 1 
>   };
> CREATE TABLE test.table1 (
> id int,
> name text,
> enabled boolean,
> foo text,
> PRIMARY KEY (id, name));
> CREATE MATERIALIZED VIEW test.table1_mv1 AS SELECT id, name, foo
> FROM test.table1
> WHERE id IS NOT NULL 
> AND name IS NOT NULL 
> AND enabled = TRUE
> PRIMARY KEY ((name), id);
> CREATE MATERIALIZED VIEW test.table1_mv2 AS SELECT id, name, foo, enabled
> FROM test.table1
> WHERE id IS NOT NULL 
> AND name IS NOT NULL 
> AND enabled = TRUE
> PRIMARY KEY ((name), id);
> {code}
> When I insert a row into the base table the materialized views are updated 
> appropriately. (+)
> {code:title=Insert row|language=sql}
> cqlsh> INSERT INTO test.table1 (id, name, enabled, foo) VALUES (1, 'One', 
> TRUE, 'Bar');
> cqlsh> SELECT * FROM test.table1;
>  id | name | enabled | foo
> +--+-+-
>   1 |  One |True | Bar
> (1 rows)
> cqlsh> SELECT * FROM test.table1_mv1;
>  name | id | foo
> --++-
>   One |  1 | Bar
> (1 rows)
> cqlsh> SELECT * FROM test.table1_mv2;
>  name | id | enabled | foo
> --++-+-
>   One |  1 |True | Bar
> (1 rows)
> {code}
> Updating the record in the base table and setting enabled to FALSE will 
> filter the record from both materialized views. (+)
> {code:title=Disable the row|language=sql}
> cqlsh> UPDATE test.table1 SET enabled = FALSE WHERE id = 1 AND name = 'One';
> cqlsh> SELECT * FROM test.table1;
>  id | name | enabled | foo
> +--+-+-
>   1 |  One |   False | Bar
> (1 rows)
> cqlsh> SELECT * FROM test.table1_mv1;
>  name | id | foo
> --++-
> (0 rows)
> cqlsh> SELECT * FROM test.table1_mv2;
>  name | id | enabled | foo
> --++-+-
> (0 rows)
> {code}
> However a further update to the base table setting enabled to TRUE should 
> include the record in both materialzed views, however only one view 
> (table1_mv2) gets updated. (-)
> It appears that only the view (table1_mv2) that returns the filtered column 
> (enabled) is updated. (-)
> Additionally columns that are not part of the partiion or clustering key are 
> not updated. You can see that the foo column has a null value in table1_mv2. 
> (-)
> {code:title=Enable the row|language=sql}
> cqlsh> UPDATE test.table1 SET enabled = TRUE WHERE id = 1 AND name = 'One';
> cqlsh> SELECT * FROM test.table1;
>  id | name | enabled | foo
> +--+-+-
>   1 |  One |True | Bar
> (1 rows)
> cqlsh> SELECT * FROM test.table1_mv1;
>  name | id | foo
> --++-
> (0 rows)
> cqlsh> SELECT * FROM test.table1_mv2;
>  name | id | enabled | foo
> --++-+--
>   One |  1 |True | null
> (1 rows)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13627) Index queries are rejected on COMPACT tables

2017-06-23 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13627:
-
Reviewer: Andrés de la Peña  (was: Benjamin Lerer)

> Index queries are rejected on COMPACT tables
> 
>
> Key: CASSANDRA-13627
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13627
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Since {{3.0}}, {{compact}} tables are using under the hood {{static}} 
> columns. Due to that {{SELECT}} queries using secondary indexes get rejected 
> with the following error:
> {{Queries using 2ndary indexes don't support selecting only static columns}}.
> This problem can be reproduced using the following unit test:
> {code}@Test
> public void testIndicesOnCompactTable() throws Throwable
> {
> createTable("CREATE TABLE %s (pk int PRIMARY KEY, v int) WITH COMPACT 
> STORAGE");
> createIndex("CREATE INDEX ON %s(v)");
> execute("INSERT INTO %S (pk, v) VALUES (?, ?)", 1, 1);
> execute("INSERT INTO %S (pk, v) VALUES (?, ?)", 2, 1);
> execute("INSERT INTO %S (pk, v) VALUES (?, ?)", 3, 3);
> assertRows(execute("SELECT pk, v FROM %s WHERE v = 1"),
>row(1, 1),
>row(2, 1));
> }{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-23 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060619#comment-16060619
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

Excellent job everyone! +1

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-22 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-10130:
-
Component/s: (was: Materialized Views)

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-22 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059725#comment-16059725
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

Excellent, can we have another utests/dtests run before the final +1?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination, Materialized Views
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13512) SASI full-text search queries using standard analyzer do not work in multi-node environments

2017-06-22 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13512:
-
Reviewer: Andrés de la Peña

> SASI full-text search queries using standard analyzer do not work in 
> multi-node environments 
> -
>
> Key: CASSANDRA-13512
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13512
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>
> SASI full-text search queries using standard analyzer do not work in 
> multi-node environments. Standard Analyzer will rewind the buffer and search 
> term will be empty for any node other than coordinator, so will return no 
> results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-22 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059129#comment-16059129
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

[~adelapena], the latest commit looks good, just two minor notes about tests:
1) Shouldn't we set the 
[throwable|https://github.com/adelapena/cassandra/compare/ca7da30f621e742f85b6a7b1f66d320ba224a6a4...adelapena:0f6972eacdab6b0c81e00d8c0c59968106d3f462#diff-0ccfd193eadf8f1c2a48cf846b4eb791R549]
 null on {{TestingIndex#clear()}}?
2) Shouldn't we test the stability inspector on index creation too?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-19 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054296#comment-16054296
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq. I think we should probably run exceptions during 2i rebuild failure 
(logAndMarkIndexesFailed) via the JVMStabilityInspector

Agreed.

+1 otherwise, excellent job everyone :)

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-19 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053861#comment-16053861
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq. I really think it makes the code cleaner and less prone to errors to update 
SecondaryIndexManager state (needsFullRebuild, queryableIndexes and 
inProgressBuilds) exclusively in the mark* methods rather than in many places 
throughout the code.

I know that would be cleaner, but bear in mind that's also a change from 
previous behaviour *and* would cause the index to be considered queryable even 
if the initialization task failed and later a non-full rebuild succeeded. I'm 
not a fan of that, but I'll leave it to you guys.

bq. I also removed the marking of index as failed before fetching the 
initialization task on this commit.

Looks good, and also properly marks the index failed (which we forgot to do 
before).

bq. Regarding the exception being passed as the last parameter to the logger, 
that's not a bug since sl4j will properly handle that case

Apologies, I misread the diff.

I'd like to finally highlight two further changes from previous behaviour that 
just crossed my mind (unrelated to Paulo's patches):
1) If the index initialization fails because of a non build error, and the 
index was previously built, it will be rebuilt again from scratch anyway.
2) If a single sstable build fails, the index will be (potentially) fully 
rebuilt at restart; while theoretically correct, this could have a non-trivial 
performance impact at startup.

That said, #1 is probably better addressed by CASSANDRA-13606, while #2 is what 
worries me most; I personally do not think that automatically rebuilding the 
indexes at startup gives much advantages, as it's an async operation anyway, so 
I'd propose to either/both add a {{cassandra.yaml}} option to enable automatic 
rebuilds explicitly or/and add a flag to NodeTool to rebuild indexes _only if_ 
not already built.

Thoughts?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-16 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051990#comment-16051990
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq. Update queryableIndexes only on markIndexBuilt and markIndexRemoved

I think we should preserve the original behaviour; it is not perfect, but 
CASSANDRA-13606 should provide a proper fix.

bq. Remove unnecessary marking of index in need of full rebuild on index 
creation

I do not think that was unnecessary, because `createIndex` could fail due to an 
exception in {{Index#getInitializationTask()}}, hence should be considered as 
failed.

bq. Warn user to run full rebuild after index failure and log failure stack 
trace if present

+1 (but {{logAndMarkIndexesFailed}} seems to have a bug 
[here|https://github.com/pauloricardomg/cassandra/commit/19b6f3c68d1b00895de2d3da5aec54102b44ce9f#diff-3f2c8994c4ff8748c3faf7e70958520dR627])

Other than that, I'm +1 on the latest changes, I think we're almost done :)

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-15 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050347#comment-16050347
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

Also, a style note about indentation for lambdas and anonymous classes; I see 
[some|https://github.com/adelapena/cassandra/blob/eb0316d651bbdec70896be041ab1981cc9349a18/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L453]
 are deeply indented, but I think that makes the code harder to read, and I 
prefer [standard 4 spaces 
indent|https://github.com/adelapena/cassandra/blob/eb0316d651bbdec70896be041ab1981cc9349a18/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L492]:
 thoughts? Whatever the choice, we should make the patch consistent (but 
possibly avoid to change other untouched code)

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-15 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050313#comment-16050313
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

As a side note, I opened CASSANDRA-13606 to further improve initialization 
failures handling.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13606) Improve handling of 2i initialization failures

2017-06-15 Thread Sergio Bossa (JIRA)
Sergio Bossa created CASSANDRA-13606:


 Summary: Improve handling of 2i initialization failures
 Key: CASSANDRA-13606
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13606
 Project: Cassandra
  Issue Type: Improvement
Reporter: Sergio Bossa
Assignee: Sergio Bossa
 Fix For: 4.0


CASSANDRA-10130 fixes the 2i build management, but initialization failures are 
still not properly handled, most notably because:
* Initialization failures make the index non-queryable, but it can still be 
written to.
* Initialization failures can be recovered via full rebuilds.

Both points above are probably suboptimal because the initialization logic 
could be more complex than just an index build, hence it shouldn't be made 
recoverable via a simple rebuild, and could cause the index to be fully 
unavailable not just for reads, but for writes as well.

So, we should better handle initialization failures by:
* Allowing the index implementation to specify if unavailable for reads, 
writes, or both. 
* Providing a proper method to recover, distinct from index rebuilds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-15 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050253#comment-16050253
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

[~adelapena], [~pauloricardomg], excellent additions overall, just a few 
comments:

* I mistakenly removed {{queryableIndexes}} in my refactor (as noted); 
[this|https://github.com/adelapena/cassandra/commit/c5604ffd5ae36cc34d13af6c4c4eadba98458fa3]
 commit added it back, but it's unfortunately not enough, as if the 
initialization task fails, the index will never be made queryable: we need to 
fix this, and we need a test for such failure case.
* Nit/OCD: I would change the 
[rebuildFromSSTablesBlocking|https://github.com/adelapena/cassandra/commit/a945358a36f97632e94e9103650d72c04735354d#diff-3f2c8994c4ff8748c3faf7e70958520dR306]
 signature to have the {{sstables}} first, consistently with 
{{buildIndexesBlocking}}.
* In {{buildIndexesBlocking}}, any exceptions thrown by 
{{flushIndexesBlocking}} in the {{finally}} block will hide previous 
exceptions: we should accumulate them.
* Nit/OCD: In {{buildIndexesBlocking}}, we should put an additional comment 
before invoking {{flushIndexesBlocking}}.
* If {{rebuildFromSSTablesBlocking}} is not used elsewhere, I would strongly 
recommend to make it private/protected and move its comment to 
{{rebuildIndexesBlocking}}: the [general 
rule|https://image.slidesharecdn.com/effectivejava-2ndedition-chapter4-151203200429-lva1-app6891/95/effective-java-chapter-4-classes-and-interfaces-4-638.jpg?cb=1449173201]
 is methods (as well as attributes) should have the most restricted visibility 
allowed by working code.
* Regarding the {{test_standalone_scrub}} failure, I _believe_ it's due to the 
fact we only manage counters in our marking methods when 
{{DatabaseDescriptor.isDaemonInitialized()}}, which is obviously not the case 
with the scrubber; if so, it's probably an easy fix.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-14 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049090#comment-16049090
 ] 

Sergio Bossa edited comment on CASSANDRA-10130 at 6/14/17 1:31 PM:
---

[~adelapena], I reviewed your latest patch and found the concurrent lifecycle 
implementation a bit hard to understand and maintain. I tried to explain it and 
give suggestions in writing, but in the end I found it easier and more 
productive to try an alternative implementation by myself, which you can find 
[here|https://github.com/sbtourist/cassandra/commit/efe452846d29173e8b0d522651e7dc9f49f2597e];
 in a nutshell, this implementation doesn't require any additional locks and 
manages a fully thread safe lifecycle whose cases I tried to extensively test.

Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to 
pull it in your branch (just note the patch is implemented against 
[this|https://github.com/adelapena/cassandra/commit/c002d26255eaf53070a840d7232788e33f852e19]
 commit in your branch, that is the one previous to your latest changes).


was (Author: sbtourist):
[~adelapena], I reviewed your latest patch and found the concurrent lifecycle 
implementation a bit hard to understand and maintain. I tried to explain it and 
give suggestions in writing, but in the end I found it easier and more 
productive to try an alternative implementation by myself, which you can find 
[here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5];
 in a nutshell, this implementation doesn't require any additional locks and 
manages a fully thread safe lifecycle whose cases I tried to extensively test.

Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to 
pull it in your branch (just note the patch is implemented against 
[this|https://github.com/adelapena/cassandra/commit/c002d26255eaf53070a840d7232788e33f852e19]
 commit in your branch, that is the one previous to your latest changes).

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-14 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049090#comment-16049090
 ] 

Sergio Bossa edited comment on CASSANDRA-10130 at 6/14/17 11:49 AM:


[~adelapena], I reviewed your latest patch and found the concurrent lifecycle 
implementation a bit hard to understand and maintain. I tried to explain it and 
give suggestions in writing, but in the end I found it easier and more 
productive to try an alternative implementation by myself, which you can find 
[here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5];
 in a nutshell, this implementation doesn't require any additional locks and 
manages a fully thread safe lifecycle whose cases I tried to extensively test.

Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to 
pull it in your branch (just note the patch is implemented against 
[this|https://github.com/adelapena/cassandra/commit/c002d26255eaf53070a840d7232788e33f852e19]
 commit in your branch, that is the one previous to your latest changes).


was (Author: sbtourist):
[~adelapena], I reviewed your latest patch and found the concurrent lifecycle 
implementation a bit hard to understand and maintain. I tried to explain it and 
give suggestions in writing, but in the end I found it easier and more 
productive to try an alternative implementation by myself, which you can find 
[here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5];
 in a nutshell, this implementation doesn't require any additional locks and 
manages a fully thread safe lifecycle whose cases I tried to extensively test.

Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to 
pull it in your branch.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-14 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049090#comment-16049090
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

[~adelapena], I reviewed your latest patch and found the concurrent lifecycle 
implementation a bit hard to understand and maintain. I tried to explain it and 
give suggestions in writing, but in the end I found it easier and more 
productive to try an alternative implementation by myself, which you can find 
[here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5];
 in a nutshell, this implementation doesn't require any additional locks and 
manages a fully thread safe lifecycle whose cases I tried to extensively test.

Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to 
pull it in your branch.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-06 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038692#comment-16038692
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq. Sorry for being picky here but while we are fixing the original limitation, 
we are introducing a new limitation that if there's ever a non-fatal index 
build failure, a successful full index rebuild will not mark the index as built 
until the node is restarted and the index is unnecessarily rebuilt.

Excellent point, you're not being picky at all. There's actually a related 
problem: if a single sstable indexing fails, we restart the node, and try to 
load a *new* sstable, the index will be marked as built, even if there's an 
sstable whose indexing failed.

In other words, it seems to me we should mark the index as built _only_ if:
1) It is a full rebuild.
2) It is an initialization task (which should be considered as the initial full 
build).
3) It is a single/group sstable(s) indexing, and the index was *already built*: 
that is, if we initiate an sstable indexing, but the index was *not* marked as 
built, we should preserve such state (that implementation-wise probably means 
to just keep the counters at 0 and avoid any marking).

Other than that, I have a concern about [flushing indexes in the future 
transformation|https://github.com/apache/cassandra/compare/trunk...adelapena:10130-trunk#diff-3f2c8994c4ff8748c3faf7e70958520dR399],
 which would cause such blocking activity to happen in the compaction thread, a 
departure from previous behaviour and probably an unwanted one.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-02 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034963#comment-16034963
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq. It's still possible that an index is created after the 
SSTableAddedNotification but before all sstables are added to the tracker

That made me also think of an ABA-like race, where a new index is registered 
_and built_ in-between the two notifications, causing it to miss the 
new-to-be-added sstables. Weird stuff, and overall excellent points 
[~pauloricardomg].

Given most of the races come from the 2-phase event-based solution, I'd say to 
simplify it in a way that still at least solves the original issue: that is, 
keep just the {{SSTableAddedNotification}}, and mark the index building/built 
around that; this will not protect against failures happening _while_ adding 
SSTables, but again, it would at least solve the original problem of the index 
itself failing after adding the SSTables and missing to rebuild. Then, we can 
think about a more complete solution (i.e. one that doesn't rebuild the whole 
index because of a single failed stream) in another ticket.

Thoughts?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-06-02 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034482#comment-16034482
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq.  I have replaced SSTableLoadedNotification by a 
SSTableBeforeAddedNotification

Nit: {{SSTableBeforeAddedNotification}} would probably read better as 
{{SSTableBeforeAddNotification}}.

bq. It's not clear that users of buildIndexesBlocking should mark the indexes 
as building beforehand, so we should make that explicit via an assertion and 
probably also add a comment

A comment is a good idea, but I think we shouldn't go further than that: 
{{buildIndexesBlocking}} is private so we can assume callers know what they're 
doing.

bq. There's still a slight chance that an index is created between an 
SSTableBeforeAddedNotification and SSTableAddedNotification and we won't have 
marked it as building

I think in such case the new {{Index}} initialization task would take care of 
indexing, so it doesn't really matter if the new index misses the sstable 
notification. So, to cover against the specific race you mentioned, we can just 
filter out in {{SIM#handleNotification()}}, when receiving the 
{{SSTableAddedNotification}}, all indexes not marked as building, as we can 
assume those missed the first notification because not yet registered (and 
being just registered, the initialization task will eventually take care of any 
initial indexing).

bq. we could probably get rid of the pendingBuilds counter complexity since we 
wouldn't race with manual individual index rebuilds on the BUILT_INDEXES table

I think the race above can be solved easily without adding columns to the 
system table. Other than that, let's not forget the {{pendingBuilds}} counter 
was needed to protect us not just against concurrent building of multiple 
indexes, but also against concurrent building of multiple sstables "batches" 
for the same index.

Thoughts? I'll now proceed with reviewing the latest code changes.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-05-31 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031771#comment-16031771
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

[~pauloricardomg], excellent suggestion.

[~adelapena], I personally like the "memtable" version more, as logically 
speaking, if an sstable is added with a memtable, it means it has been also 
indexed, while in all other cases it means the sstable(s) have been externally 
loaded and needs indexing, but I have a few concerns:
1) There are too many overloads of {{addSSTable(s)}}, and some of them do not 
make much sense and are there just to support self calls (i.e. the version with 
many sstables and a single memtable I believe), so can we clean that up?
2) Who is actually calling the {{addSSTable}} method with the memtable? I think 
I'm missing where it's actually used...
3) Why did you reduce the test coverage in 
{{indexCorrectlyMarkedAsBuildAndRemoved}}?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-05-18 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015593#comment-16015593
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

[~adelapena], the approach looks good to me, here is some specific feedback:

* Why executing the 
[preBuildTask|https://github.com/apache/cassandra/compare/trunk...adelapena:10130-trunk#diff-3f2c8994c4ff8748c3faf7e70958520dR389]
 in another thread via {{executeBlocking}}, rather than just calling 
{{Runnable#run}}?

* Currently, {{buildIndexesBlocking}} will end up rebuilding *all* indexes even 
if just a single one failed, because {{markIndexBuilt}} is called in bulk at 
the very end; I know this is in line with the previous behaviour, but wouldn't 
it make sense to improve it in this issue?

* Do we have tests checking:
** Index status pre and post (re)building actions (create index, rebuild index).
** Index status upon index removal.
** Index status and automatic rebuild in case of failures.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-05-16 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012006#comment-16012006
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

Some more feedback by my side too.

I think there still is a subtle concurrency issue, due to the fact we have some 
methods managing the "index status" in the system keyspace via the counter, and 
other methods directly messing with the system keyspace and/or the pending 
builds map; i.e., what happens if the index is unregistered while it is being 
rebuilt? Probably an 
[NPE|https://github.com/apache/cassandra/compare/trunk...adelapena:10130-trunk#diff-3f2c8994c4ff8748c3faf7e70958520dR424].
 I would suggest to clean it up a bit by avoiding to keep "pending builds" in 
the map after they're all done, that is after the index has been marked as 
built: by doing so, you can just remove the pending build inside 
{{markIndexBuilt}} if the counter reaches 0 (you'd have to guard the {{mark*}} 
methods rather than the counter ones), and hence keep the "build management" 
code inside the two "mark building/built" methods only.

I'm kinda unsure if we should actually call {{markIndexBuilding}} inside 
{{createIndex}}: what if the user forgets to call {{markIndexBuilt}} inside the 
initialization task? There's no such contract forcing the user to do that. So, 
as a corollary, we should probably accept calls to {{markIndexBuilt}} even 
without a previous {{markIndexBuilding}} call (that is, making it idempotent).

Also, {{buildAllIndexesBlocking}} doesn't follow the "mark building/built" 
pattern: isn't that a weird anomaly? I know the {{StreamReceiveTask}} does 
that, but what about other callers?

Regarding {{PendingIndexBuildsCounter}}, it isn't really just a counter, I 
would rename it (i.e. just {{PendingBuild}}), and rename its methods as well 
for further clarity; or we could ditch it altogether as suggested by 
[~pauloricardomg].

Finally regarding:

bq. the user will expect the index NOT to be rebuilt on restart if there was a 
subsequent successful rebuild.

Is it what [~adelapena] actually meant? I do not think the index is actually 
rebuilt on restart if successfully rebuilt manually, or am I missing the point?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-8273) Allow filtering queries can return stale data

2017-05-15 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa reassigned CASSANDRA-8273:
---

Assignee: Andrés de la Peña

> Allow filtering queries can return stale data
> -
>
> Key: CASSANDRA-8273
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8273
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
>
> Data filtering is done replica side. That means that a single replica with 
> stale data may make the whole query return that stale data.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v1 text, v2 int);
> CREATE INDEX ON test(v1);
> INSERT INTO test(k, v1, v2) VALUES (0, 'foo', 1);
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v2 = 2 WHERE k = 0;
> SELECT * FROM test WHERE v1 = 'foo' AND v2 = 1;
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned. Let's 
> note that this is a problem related to filtering, not 2ndary indexes.
> This issue share similarity with CASSANDRA-8272 but contrarily to that former 
> issue, I'm not sure how to fix it. Obviously, moving the filtering to the 
> coordinator would remove that problem, but doing so would, on top of not 
> being trivial to implmenent, have serious performance impact since we can't 
> know in advance how much data will be filtered and we may have to redo query 
> to replica multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-15 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010605#comment-16010605
 ] 

Sergio Bossa commented on CASSANDRA-8272:
-

bq. Here is a draft patch showing the approach.

Excellent, I like it too.

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-12 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008026#comment-16008026
 ] 

Sergio Bossa edited comment on CASSANDRA-8272 at 5/12/17 12:25 PM:
---

bq.  I just want us to have the discussion around filtering in CASSANDRA-8273 
to avoid mixing things up, but If we can agree on moving filtering server-side 
there quickly, then I'm totally fine doing that and the indexing in a single 
patch if we prefer.

I'm fine with moving filtering server side, and I'm fine with dealing with it 
on either issues provided we eventually address CASSANDRA-8273 straight away.

bq. If you're making a general point [about API], then sure.

Yes, I'm making a general point. Also see below.

bq. we'd have to rely on the version exchanged through gossip in a way we never 
have, so with risks associated 

Alright, I don't have enough knowledge to provide a proper risk assessment on 
my own, so I'm fine with the previously proposed "split" patch.

bq. I think we can and should make most things index agnostic here

Totally agree here, and that's what I meant when saying we should make it work 
with all index implementations and devise APIs to leverage. I agree we should 
move away from using {{Index#postProcessorFor}}, and I agree we should keep the 
counting outside the index implementation (again, as it will be useful to all 
index implementations), but I'll leave the actual implementation details to 
[~adelapena] and discuss once we've got a first cut.

So here's my take away about how we should move forward:
1) Provide a first patch based on 3.11 to fix the coordinator side.
2) Provide a second patch based on trunk to fix both coordinator and replica 
side (we could address this first, review, and backport the coordinator side 
once in agreement).
3) Discuss filtering in CASSANDRA-8273 and eventually address it here (this 
patch is getting quite meaty already, but OTOH CASSANDRA-8273 will require a 
split approach too, so we might want to make our life easier and do everything 
in one place).

Thoughts?


was (Author: sbtourist):
bq.  I just want us to have the discussion around filtering in CASSANDRA-8273 
to avoid mixing things up, but If we can agree on moving filtering server-side 
there quickly, then I'm totally fine doing that and the indexing in a single 
patch if we prefer.

I'm fine with moving filtering server side, and I'm fine with dealing with it 
on either issues provided we eventually address CASSANDRA-8273 straight away.

bq. If you're making a general point [about API], then sure.

Yes, I'm making a general point. Also see below.

bq. we'd have to rely on the version exchanged through gossip in a way we never 
have, so with risks associated 

Alright, I don't have enough knowledge to provide a proper risk assessment on 
my own, so I'm fine with the previously proposed "split" patch.

bq. I think we can and should make most things index agnostic here

Totally agree here, and that's what I meant when saying we should make it work 
with all index implementations and devise APIs to leverage. I agree we should 
move away from using {{Index#postProcessorFor}}, and I agree we should keep the 
counting outside the index implementation (again, as it will be useful to all 
index implementations), but I'll leave the actual implementation details to 
[~adelapena] and discuss once we've got a first cut.

So here's my take away about how we should move forward:
1) Provide a first patch based on 3.11 to fix the coordinator side.
2) Provide a second patch based on trunk to fix both coordinator and replica 
side (we could address this first, review, and backport the coordinator side 
once in agreement).
3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this 
patch is getting quite meaty already and I'm now convinced it's indeed better 
to separate the two issues).

Thoughts?

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A 

[jira] [Comment Edited] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-12 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008026#comment-16008026
 ] 

Sergio Bossa edited comment on CASSANDRA-8272 at 5/12/17 12:10 PM:
---

bq.  I just want us to have the discussion around filtering in CASSANDRA-8273 
to avoid mixing things up, but If we can agree on moving filtering server-side 
there quickly, then I'm totally fine doing that and the indexing in a single 
patch if we prefer.

I'm fine with moving filtering server side, and I'm fine with dealing with it 
on either issues provided we eventually address CASSANDRA-8273 straight away.

bq. If you're making a general point [about API], then sure.

Yes, I'm making a general point. Also see below.

bq. we'd have to rely on the version exchanged through gossip in a way we never 
have, so with risks associated 

Alright, I don't have enough knowledge to provide a proper risk assessment on 
my own, so I'm fine with the previously proposed "split" patch.

bq. I think we can and should make most things index agnostic here

Totally agree here, and that's what I meant when saying we should make it work 
with all index implementations and devise APIs to leverage. I agree we should 
move away from using {{Index#postProcessorFor}}, and I agree we should keep the 
counting outside the index implementation (again, as it will be useful to all 
index implementations), but I'll leave the actual implementation details to 
[~adelapena] and discuss once we've got a first cut.

So here's my take away about how we should move forward:
1) Provide a first patch based on 3.11 to fix the coordinator side.
2) Provide a second patch based on trunk to fix both coordinator and replica 
side (we could address this first, review, and backport the coordinator side 
once in agreement).
3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this 
patch is getting quite meaty already and I'm now convinced it's indeed better 
to separate the two issues).

Thoughts?


was (Author: sbtourist):
bq.  I just want us to have the discussion around filtering in CASSANDRA-8273 
to avoid mixing things up, but If we can agree on moving filtering server-side 
there quickly, then I'm totally fine doing that and the indexing in a single 
patch if we prefer.

I'm fine with moving filtering server side, and I'm fine with dealing with it 
on either issues provided we eventually address CASSANDRA-8273 straight away.

bq. If you're making a general point [about API], then sure.

Yes, I'm making a general point. Also see below.

bq. we'd have to rely on the version exchanged through gossip in a way we never 
have, so with risks associated 

Alright, I don't have enough knowledge to provide a proper risk assessment on 
my own, so I'm fine with the previously proposed "split" patch.

bq. I think we can and should make most things index agnostic here

Totally agree here, and that's what I meant when saying we should make it work 
with all index implementations and devise APIs to leverage. I agree we should 
move away from using {{Index#postProcessorFor}}, and I agree we should keep the 
counting outside the index implementation (again, as it will be useful to all 
index implementations), but I'll leave the actual implementation details to 
[~adelapena] and discuss once we've got a first cut.

So here's my take away about how we should move forward:
1) Provide a first patch based on 3.11 to fix the coordinator side.
2) Provide a second patch based on trunk to fix both coordinator and replica 
side (we could address this first, review, and backport the coordinator side 
once in agreement).
3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this 
patch is getting quite meaty already and I'm not convinced it's indeed better 
to separate the two issues).

Thoughts?

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 

[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-12 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008026#comment-16008026
 ] 

Sergio Bossa commented on CASSANDRA-8272:
-

bq.  I just want us to have the discussion around filtering in CASSANDRA-8273 
to avoid mixing things up, but If we can agree on moving filtering server-side 
there quickly, then I'm totally fine doing that and the indexing in a single 
patch if we prefer.

I'm fine with moving filtering server side, and I'm fine with dealing with it 
on either issues provided we eventually address CASSANDRA-8273 straight away.

bq. If you're making a general point [about API], then sure.

Yes, I'm making a general point. Also see below.

bq. we'd have to rely on the version exchanged through gossip in a way we never 
have, so with risks associated 

Alright, I don't have enough knowledge to provide a proper risk assessment on 
my own, so I'm fine with the previously proposed "split" patch.

bq. I think we can and should make most things index agnostic here

Totally agree here, and that's what I meant when saying we should make it work 
with all index implementations and devise APIs to leverage. I agree we should 
move away from using {{Index#postProcessorFor}}, and I agree we should keep the 
counting outside the index implementation (again, as it will be useful to all 
index implementations), but I'll leave the actual implementation details to 
[~adelapena] and discuss once we've got a first cut.

So here's my take away about how we should move forward:
1) Provide a first patch based on 3.11 to fix the coordinator side.
2) Provide a second patch based on trunk to fix both coordinator and replica 
side (we could address this first, review, and backport the coordinator side 
once in agreement).
3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this 
patch is getting quite meaty already and I'm not convinced it's indeed better 
to separate the two issues).

Thoughts?

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-05-12 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007972#comment-16007972
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

bq. There is still a slight possibility of race though if an index is being 
built by some other thread and we mark the index as built after the streaming 
operation but before the other thread build is completed.

There's actually a very probable race in case of parallel repairs when multiple 
streaming tasks for the same table are created. We should either find a better 
(race-free) place where to put those calls, or employ a guarded counter as 
suggested.

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted

2017-05-11 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006895#comment-16006895
 ] 

Sergio Bossa commented on CASSANDRA-10130:
--

I had a quick look at this one and I think we could avoid adding yet another 
system table by just reusing the {{BUILT_INDEXES}} table and following the same 
pattern used by {{SIM#rebuildIndexesBlocking()}}, that is:
* Call {{SystemKeyspace#setIndexRemoved()}}.
* Build index.
* Call {{SystemKeyspace#setIndexBuilt()}}.

Thoughts? Am I missing anything actually requiring a new table?

> Node failure during 2i update after streaming can have incomplete 2i when 
> restarted
> ---
>
> Key: CASSANDRA-10130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10130
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Yuki Morishita
>Assignee: Andrés de la Peña
>Priority: Minor
>
> Since MV/2i update happens after SSTables are received, node failure during 
> MV/2i update can leave received SSTables live when restarted while MV/2i are 
> partially up to date.
> We can add some kind of tracking mechanism to automatically rebuild at the 
> startup, or at least warn user when the node restarts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-11 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006545#comment-16006545
 ] 

Sergio Bossa commented on CASSANDRA-8272:
-

bq. my main point is that on principle we should be careful to look at the 
whole solution before comparing it to alternatives and deciding which one we 
"stick with". I've seen simple solutions get pretty messy once you fix all edge 
cases to the point that it wasn't the best solution anymore.

Of course. And I indeed gave some thoughts on my own to the tombstones solution 
(as I'm sure [~adelapena] did as well), and I've found it quite more complex 
that the current one, with little/no gains in return, and, something I didn't 
mention before, not really complete for indexes covering multiple columns, or 
if we'll ever want to support multiple indexes per row: in such cases, mixing 
tombstones and valid column values for all combinations would easily turn into 
a mess IMHO, while actually returning the row and later post-filter is IMHO 
cleaner and less error prone. To be noted, we could still "skim" the row when 
we detect it's related to a stale entry and only keep the index-related columns 
(and easily add a merging step in the future for the multiple indexes cases): 
this would buy us the performance optimization you mentioned above, but I see 
it slightly error prone and I'd rather go with a functionally complete solution 
first.

bq. It's in particular not true that fixing this bug will be "invalidated when 
filtering is applied"

I disagree here: if filtering is applied on top of index results, you'll still 
get wrong results, which is confusing to me (as a user). I understand filtering 
is also orthogonal, so what about fixing filtering (that is, moving to 
coordinator-side filtering) only when indexes are present?

bq. That [fixing other index implementations] I agree is something we should 
consider. Though tbh, I have doubts we can have a solution that is completely 
index agnostic. 

Of course. But we can still provide some API (i.e. the {{isSatisfiedBy()}} you 
mentioned) they can leverage. And if we do this kind of work on the 
SASI-enabled branches, we'll have two different index implementations to test 
the goodness of our API.

bq. One thing that hasn't been mentioned is that the fix has impact on 
upgrades. Namely, in a mixed cluster, some replica will start to return invalid 
results and if the coordinator isn't upgraded yet, it won't filter those, which 
means we'll return invalid entries.

Excellent point! And definitely something to avoid.

bq. That does mean we should consider starting to filter entries on index 
queries coordinator-side in 3.0/3.11 (even though we never return them), and 
only do the replica-side parts in 4.0, with a fat warning that you need to only 
upgrade to 4.0 from a 3.X version that has the coordinator-side fix.

Mmm ... clunky. And error prone as the 3.X code would be probably 
untestable. Couldn't the replica detect the coordinator version and return 
results accordingly?

bq. Worth noting that this doesn't entirely fly for index using custom indexes: 
we'd need to have them implement the CustomExpression#isSatistiedBy method in 
3.X in that scheme since we need it for the coordinator-side filtering as well, 
but making that method abstract in 3.X is, as said above, a breaking change.

I'm not sure I get why you _have to_ make that abstract: I think it's fine to 
leave it as it is and warn users they'll have to override it on upgrade if they 
want consistent results. And for those implementations that can't implement it, 
we should maybe add a {{isConsistent}} predicate to disable "consistent 
filtering" altogether.

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential 

[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-11 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006297#comment-16006297
 ] 

Sergio Bossa commented on CASSANDRA-8272:
-

One more thing I'm realizing we should fix is paging, which seems broken in a 
similar way to limiting: that is, if the page size is less than the number of 
mismatched rows we might end up going through "empty" pages until we get to the 
valid results.

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-11 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006245#comment-16006245
 ] 

Sergio Bossa commented on CASSANDRA-8272:
-

[~adelapena], I gave a first review pass and the approach looks sensible, so +1 
on that.

Unfortunately, the problem is actually quite subtle and there are at least a 
couple cases where it doesn't fully work.

First of all, when a {{LIMIT}} clause is provided, the query might return no 
results when there actually are some valid ones: this is because the rows 
returned as a result of an "index mismatch" are still counted against the limit 
(by {{CQLCounter}}), which means the coordinator might end up with less valid 
rows than the requested limit, simply because some replicas returned only 
mismatched rows. Here's a simple scenario with two nodes:
1) Write row {{key=1,index=1}}.
2) Write row {{key=2,index=1}}.
3) Shutdown node 2.
4) Delete column {{index}} from row {{key=1}}: the delete will go to node 1, 
while node 2 will miss it.
5) Restart node 2 (hints need to be disabled).
6) Query for {{index=1}}.
7) Node 1 will return the first row found, i.e. the "mismatched" one {{key=1}}.
8) Node 2 will return the "missed delete" with {{key=1}}.
9) Coordinator will merge/post-process the rows, realize there's a mismatch and 
return no results, while it should have instead returned {{key=2}}.

Second, this patch doesn't fix filtering; while it's true we have a different 
issue for that ({{CASSANDRA-8273}}), and while we could argue filtering isn't 
exactly a form of indexing, it is still used in conjunction with indexing, and 
fixing indexing just to have its results invalidated when filtering is applied 
seems quite confusing to me.

In the end, I'd suggest the following:
1) Stick with the current approach! It's good and I do not think using special 
tombstones would buy us anything.
2) Fix the first problem above.
3) Generalize the approach so we can fix filtering and any other indexing 
implementation (most notably SASI).
4) To ease the burden of porting between versions, and given this is not a 
trivial bug fix at all, I'd also suggest to only apply it to 3.11 onwards.

Thoughts?

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-8272) 2ndary indexes can return stale data

2017-05-09 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-8272:

Reviewer: Sergio Bossa

> 2ndary indexes can return stale data
> 
>
> Key: CASSANDRA-8272
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8272
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Andrés de la Peña
> Fix For: 3.0.x
>
>
> When replica return 2ndary index results, it's possible for a single replica 
> to return a stale result and that result will be sent back to the user, 
> potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are 
> done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before 
> having applied the insert, then the now stale result will be returned (since 
> C will return it and A or B will return nothing).
> A potential solution would be that when we read a tombstone in the index (and 
> provided we make the index inherit the gcGrace of it's parent CF), instead of 
> skipping that tombstone, we'd insert in the result a corresponding range 
> tombstone.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13225) Best Consistency Level

2017-02-16 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870593#comment-15870593
 ] 

Sergio Bossa commented on CASSANDRA-13225:
--

bq. Does the back-pressure strategy set at SLOW and a consistency level of 
Quorum mean that while we will only wait for 3 replicas out of 5 for the write 
to succeed we will overall be limiting the number of write requests to all 
replicas based on the slowest responses we are getting from the cluster?

Yes. Specifically, the rate limiting is computed based on the ratio between 
sent messages and received responses.

bq. Does this then mean that a large set of individual writes will be rate 
limited by the slowest write responses and we will in effect be waiting for all 
replicas even though each individual write is only waiting for Quorum replicas?

Kinda. Keep in mind we're still not actually *waiting* for all replicas, but 
just rate limiting (so pausing) based on the feedback loop produced by messages 
sent and responses received.

> Best Consistency Level
> --
>
> Key: CASSANDRA-13225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13225
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Connor Warrington
>Priority: Minor
>
> When writing data into a cluster there are a few consistency levels to choose 
> from. When choosing the consistency level to write with you are making a 
> tradeoff between consistency and failover availability. If you choose 
> consistency level ALL then all replicas have to be up and when a write 
> succeeds all replicas received the write. If you choose consistency level 
> QUORUM then a quorum number of replicas have to be up and when a write 
> succeeds at quorum number of replicas received the write. The tradeoff comes 
> in when there are more then quorum nodes available for the write. We would 
> like a write to succeed only when all replicas that are up have received the 
> write. Hence the suggestion of best as a consistency level. This would be 
> available for the existing consistency levels. The main idea behind this 
> feature request is that we are okay with a replica going down (fault 
> tolerance) but when the cluster is in a good state we don't mind waiting for 
> all nodes to get the write. This would enable the writer to operate at speed 
> of the slowest node instead of potentially getting into a state where that 
> slow node gets even further behind. This would also enable back pressure to 
> be better propagated through the system as the slowest node likely has back 
> pressure which is trying to tell the client about but if we don't wait for 
> that node the writer loses that information.
> Example scenarios:
> If we have replication factor of 3: 
> ALL consistency means 3 replicas have to be up and 3 replicas have to 
> successfully get the write. 
> QUORUM consistency means 2 replicas have to be up and 2 replicas have to 
> successfully get the write. 
> BEST_QUORUM consistency means 2 replicas have be up and all up replicas have 
> to successfully get the write.
> If 3 replicas are up with replication factor of 3: 
> ALL would succeed as all 3 replicas are up and would return success when all 
> 3 replicas get the write 
> QUORUM would succeed as all 3 replicas are up and would return success when 2 
> replicas get the write 
> BEST_QUORUM would succeed as all 3 replicas are up and would return success 
> when all 3 replicas get the write
> If 2 replicas are up with replication factor of 3: 
> ALL would fail as only 2 replicas are up 
> QUORUM would succeed as 2 replicas are up and would return success when 2 
> replicas get the write 
> BEST_QUORUM would succeed as 2 replicas are up and would return success when 
> 2 replicas get the write



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13225) Best Consistency Level

2017-02-16 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869855#comment-15869855
 ] 

Sergio Bossa commented on CASSANDRA-13225:
--

bq. If what you are worried about is overloading nodes and want to get all 3 
acks back to slow down the writes, then you might want to take a look at using 
a back_pressure_strategy, this is exactly the case that was implemented for, 
reducing the number of dropped mutations when all nodes are up and there is a 
constant stream of writes coming in.

[~Connor Warrington], I fully second [~jjordan]'s suggestion here.

The main problem with a "variable" CL is that under failure conditions it 
doesn't really give you any guarantees, as it will succeed if both quorum or 
"more than quorum" are met. Hence, it's practically equal to QUORUM in terms of 
guarantees.

Configuring the back-pressure strategy at SLOW will probably be a better way to 
reduce the number of dropped mutations *and* allow hints to readily fix the 
ones you'll still get.

> Best Consistency Level
> --
>
> Key: CASSANDRA-13225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13225
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Connor Warrington
>Priority: Minor
>
> When writing data into a cluster there are a few consistency levels to choose 
> from. When choosing the consistency level to write with you are making a 
> tradeoff between consistency and failover availability. If you choose 
> consistency level ALL then all replicas have to be up and when a write 
> succeeds all replicas received the write. If you choose consistency level 
> QUORUM then a quorum number of replicas have to be up and when a write 
> succeeds at quorum number of replicas received the write. The tradeoff comes 
> in when there are more then quorum nodes available for the write. We would 
> like a write to succeed only when all replicas that are up have received the 
> write. Hence the suggestion of best as a consistency level. This would be 
> available for the existing consistency levels. The main idea behind this 
> feature request is that we are okay with a replica going down (fault 
> tolerance) but when the cluster is in a good state we don't mind waiting for 
> all nodes to get the write. This would enable the writer to operate at speed 
> of the slowest node instead of potentially getting into a state where that 
> slow node gets even further behind. This would also enable back pressure to 
> be better propagated through the system as the slowest node likely has back 
> pressure which is trying to tell the client about but if we don't wait for 
> that node the writer loses that information.
> Example scenarios:
> If we have replication factor of 3: 
> ALL consistency means 3 replicas have to be up and 3 replicas have to 
> successfully get the write. 
> QUORUM consistency means 2 replicas have to be up and 2 replicas have to 
> successfully get the write. 
> BEST_QUORUM consistency means 2 replicas have be up and all up replicas have 
> to successfully get the write.
> If 3 replicas are up with replication factor of 3: 
> ALL would succeed as all 3 replicas are up and would return success when all 
> 3 replicas get the write 
> QUORUM would succeed as all 3 replicas are up and would return success when 2 
> replicas get the write 
> BEST_QUORUM would succeed as all 3 replicas are up and would return success 
> when all 3 replicas get the write
> If 2 replicas are up with replication factor of 3: 
> ALL would fail as only 2 replicas are up 
> QUORUM would succeed as 2 replicas are up and would return success when 2 
> replicas get the write 
> BEST_QUORUM would succeed as 2 replicas are up and would return success when 
> 2 replicas get the write



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables

2017-01-19 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830158#comment-15830158
 ] 

Sergio Bossa commented on CASSANDRA-13075:
--

+1

> Indexer is not correctly invoked when building indexes over sstables
> 
>
> Key: CASSANDRA-13075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13075
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sergio Bossa
>Assignee: Alex Petrov
>Priority: Critical
> Attachments: CustomIndexTest.java
>
>
> Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls 
> each {{Indexer}} {{begin}} and {{finish}} methods multiple times per 
> partition (depending on the page size), as 
> {{PartitionIterators#getOnlyElement()}} returns an empty partition even when 
> the iterator is exhausted.
> This leads to bugs for {{Indexer}} implementations doing actual work in those 
>  methods, but even worse, it provides the {{Indexer}} the same input of an 
> empty partition containing only a non-live partition deletion, as the 
> {{Indexer#partitionDelete()}} method is *not* actually called.
> My proposed solution:
> 1) Stop the iteration before the empty partition is returned and ingested 
> into the {{Indexer}}.
> 2) Actually call the {{Indexer#partitionDelete()}} method inside 
> {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered 
> iterator so it actually contains the deletion info).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables

2017-01-18 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828347#comment-15828347
 ] 

Sergio Bossa commented on CASSANDRA-13075:
--

Looks better now [~ifesdjeen]!

Just a few more points:
1) As you mentioned, it is missing a test with range tombstones crossing pages.
2) 
[This|https://github.com/ifesdjeen/cassandra/commit/7aa0c7c41960ed11b35545c49ca7e8f869f76e2f#diff-13cb97758bcb11cce8fc6f4cb1990dd6R729]
 should use {{pageSize}}.
3) 
[This|https://github.com/ifesdjeen/cassandra/commit/7aa0c7c41960ed11b35545c49ca7e8f869f76e2f#diff-13cb97758bcb11cce8fc6f4cb1990dd6R785]
 should be probably replaced with {{StubIndex}} (I've used that other one just 
as quick hack).

> Indexer is not correctly invoked when building indexes over sstables
> 
>
> Key: CASSANDRA-13075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13075
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sergio Bossa
>Assignee: Alex Petrov
>Priority: Critical
> Attachments: CustomIndexTest.java
>
>
> Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls 
> each {{Indexer}} {{begin}} and {{finish}} methods multiple times per 
> partition (depending on the page size), as 
> {{PartitionIterators#getOnlyElement()}} returns an empty partition even when 
> the iterator is exhausted.
> This leads to bugs for {{Indexer}} implementations doing actual work in those 
>  methods, but even worse, it provides the {{Indexer}} the same input of an 
> empty partition containing only a non-live partition deletion, as the 
> {{Indexer#partitionDelete()}} method is *not* actually called.
> My proposed solution:
> 1) Stop the iteration before the empty partition is returned and ingested 
> into the {{Indexer}}.
> 2) Actually call the {{Indexer#partitionDelete()}} method inside 
> {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered 
> iterator so it actually contains the deletion info).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables

2017-01-17 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826423#comment-15826423
 ] 

Sergio Bossa commented on CASSANDRA-13075:
--

[~ifesdjeen],

I believe there are a few problems with your patch:

* 
[Passing|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-3f2c8994c4ff8748c3faf7e70958520dR550]
 the partition columns for the metadata definition is subtly different than 
passing the columns for the read partition, as the latter could be a subset of 
the former (at least according to javadocs and a brief code inspection); I'm 
not sure how much it matters in practice, but that could potentially lead to 
unwanted index calls.
* 
[Processing|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-3f2c8994c4ff8748c3faf7e70958520dR567]
 the collected range tombstones inside the loop, piggybacking on the fact the 
loop itself will do one more iteration after exhaustion, doesn't actually work, 
that is, it seems range tombstones are never processed; this can be easily 
verified by checking range tombstones 
[here|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-13cb97758bcb11cce8fc6f4cb1990dd6R754]
 (as this test _should_ check for range tombstones, unless I'm missing 
something?).
* Even if the last point worked, I believe that would lead to a duplicate 
{{begin}} and {{finish}} call for range tombstones.
* 
[Unused|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-71513a85468b7cbc97f1e820b06a20a8R125]?

> Indexer is not correctly invoked when building indexes over sstables
> 
>
> Key: CASSANDRA-13075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13075
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sergio Bossa
>Assignee: Alex Petrov
>Priority: Critical
> Attachments: CustomIndexTest.java
>
>
> Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls 
> each {{Indexer}} {{begin}} and {{finish}} methods multiple times per 
> partition (depending on the page size), as 
> {{PartitionIterators#getOnlyElement()}} returns an empty partition even when 
> the iterator is exhausted.
> This leads to bugs for {{Indexer}} implementations doing actual work in those 
>  methods, but even worse, it provides the {{Indexer}} the same input of an 
> empty partition containing only a non-live partition deletion, as the 
> {{Indexer#partitionDelete()}} method is *not* actually called.
> My proposed solution:
> 1) Stop the iteration before the empty partition is returned and ingested 
> into the {{Indexer}}.
> 2) Actually call the {{Indexer#partitionDelete()}} method inside 
> {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered 
> iterator so it actually contains the deletion info).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables

2017-01-09 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-13075:
-
Attachment: CustomIndexTest.java

[~ifesdjeen], please see the attached {{CustomIndexTest.java}}: it adds a 
quickly hacked {{partitionIsNotOverIndexed}} test which shows how the 
{{finish}} method is invoked twice even if only a single partition with a 
single row is present.

> Indexer is not correctly invoked when building indexes over sstables
> 
>
> Key: CASSANDRA-13075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13075
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sergio Bossa
>Assignee: Alex Petrov
>Priority: Critical
> Attachments: CustomIndexTest.java
>
>
> Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls 
> each {{Indexer}} {{begin}} and {{finish}} methods multiple times per 
> partition (depending on the page size), as 
> {{PartitionIterators#getOnlyElement()}} returns an empty partition even when 
> the iterator is exhausted.
> This leads to bugs for {{Indexer}} implementations doing actual work in those 
>  methods, but even worse, it provides the {{Indexer}} the same input of an 
> empty partition containing only a non-live partition deletion, as the 
> {{Indexer#partitionDelete()}} method is *not* actually called.
> My proposed solution:
> 1) Stop the iteration before the empty partition is returned and ingested 
> into the {{Indexer}}.
> 2) Actually call the {{Indexer#partitionDelete()}} method inside 
> {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered 
> iterator so it actually contains the deletion info).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables

2017-01-09 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15811740#comment-15811740
 ] 

Sergio Bossa commented on CASSANDRA-13075:
--

[~ifesdjeen],

my description above might have been misleading, as I do not think the actual 
problem is calling {{start}} and {{finish}} just once: as long as they are 
called together, it will be fine, even if they're called multiple times for the 
same partition *but different subsets of rows*.

The real problem to me is they are called an additional time even when the 
partition is exhausted. The solution you proposed could work on paper in terms 
of avoiding those methods being called multiple times (at all), even if 
practically speaking could be troublesome, but only for the first part of the 
problem: what about calling {{Indexer#partitionDelete()}}?

> Indexer is not correctly invoked when building indexes over sstables
> 
>
> Key: CASSANDRA-13075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13075
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sergio Bossa
>Assignee: Alex Petrov
>Priority: Critical
>
> Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls 
> each {{Indexer}} {{begin}} and {{finish}} methods multiple times per 
> partition (depending on the page size), as 
> {{PartitionIterators#getOnlyElement()}} returns an empty partition even when 
> the iterator is exhausted.
> This leads to bugs for {{Indexer}} implementations doing actual work in those 
>  methods, but even worse, it provides the {{Indexer}} the same input of an 
> empty partition containing only a non-live partition deletion, as the 
> {{Indexer#partitionDelete()}} method is *not* actually called.
> My proposed solution:
> 1) Stop the iteration before the empty partition is returned and ingested 
> into the {{Indexer}}.
> 2) Actually call the {{Indexer#partitionDelete()}} method inside 
> {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered 
> iterator so it actually contains the deletion info).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables

2016-12-22 Thread Sergio Bossa (JIRA)
Sergio Bossa created CASSANDRA-13075:


 Summary: Indexer is not correctly invoked when building indexes 
over sstables
 Key: CASSANDRA-13075
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13075
 Project: Cassandra
  Issue Type: Bug
Reporter: Sergio Bossa
Priority: Critical


Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls 
each {{Indexer}} {{begin}} and {{finish}} methods multiple times per partition 
(depending on the page size), as {{PartitionIterators#getOnlyElement()}} 
returns an empty partition even when the iterator is exhausted.

This leads to bugs for {{Indexer}} implementations doing actual work in those  
methods, but even worse, it provides the {{Indexer}} the same input of an empty 
partition containing only a non-live partition deletion, as the 
{{Indexer#partitionDelete()}} method is *not* actually called.

My proposed solution:
1) Stop the iteration before the empty partition is returned and ingested into 
the {{Indexer}}.
2) Actually call the {{Indexer#partitionDelete()}} method inside 
{{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered 
iterator so it actually contains the deletion info).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12836) Set JOINING mode before executing the pre-join index callback

2016-10-27 Thread Sergio Bossa (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611989#comment-15611989
 ] 

Sergio Bossa commented on CASSANDRA-12836:
--

Thanks!

> Set JOINING mode before executing the pre-join index callback
> -
>
> Key: CASSANDRA-12836
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12836
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
> Fix For: 3.10
>
>
> The pre-join index callback introduced by CASSANDRA-12039 might perform long 
> running tasks, and given it's blocking, it would be good to set the node in 
> JOINING mode, which currently only happens in case of bootstrap, so that:
> 1) The mode can be properly read by tools.
> 2) Async callback implementation can read the mode themselves to verify at 
> which point of the startup process they are executing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12836) Set JOINING mode before executing the pre-join index callback

2016-10-25 Thread Sergio Bossa (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Bossa updated CASSANDRA-12836:
-
Issue Type: Improvement  (was: Bug)

> Set JOINING mode before executing the pre-join index callback
> -
>
> Key: CASSANDRA-12836
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12836
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Sergio Bossa
>Assignee: Sergio Bossa
>
> The pre-join index callback introduced by CASSANDRA-12039 might perform long 
> running tasks, and given it's blocking, it would be good to set the node in 
> JOINING mode, which currently only happens in case of bootstrap, so that:
> 1) The mode can be properly read by tools.
> 2) Async callback implementation can read the mode themselves to verify at 
> which point of the startup process they are executing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >