[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-10-22 Thread Ben Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180236#comment-14180236
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 10/22/14 5:53 PM:
---

I'm not planning on making any more changes pre-review. Feel free to go ahead.

(Though it would have been fine even if you had started prior to 5483-v17-00 
and 5483-v17-01; the differences are not substantial. Edit: even if they had 
been substantial, you were well within your rights to start reviewing as soon 
as I posted up the initial patch.)



was (Author: usrbincc):
I'm not planning on making any more changes pre-review. Feel free to go ahead.

(Though it would have been fine even if you had started prior to 5483-v17-00 
and 5483-v17-01; the differences are not substantial)


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Fix For: 3.0

 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 
 5483-v15-02-Hook-up-exponential-backoff-functionality.patch, 
 5483-v15-03-Exact-doubling-for-exponential-backoff.patch, 
 5483-v15-04-Re-add-old-StorageService-JMX-signatures.patch, 
 5483-v15-05-Move-command-column-to-system_traces.sessions.patch, 
 5483-v15.patch, 5483-v17-00.patch, 5483-v17-01.patch, 5483-v17.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 prerepair-vs-postbuggedrepair.diff, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-06-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027754#comment-14027754
 ] 

Benedict edited comment on CASSANDRA-5483 at 6/11/14 1:34 PM:
--

Agreed. Uploaded a patch that has identical behaviour, but is a bit easier to 
understand.

I should note I've made no attempt to corroborate this behaviour is sensible; 
I've only simplified it :)


was (Author: benedict):
Agreed. Uploaded a patch that has identical behaviour, but is a bit easier to 
understand.

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
 5483-v13-608fb03-May-14-trace-formatting-changes.patch, 
 5483-v14-01-squashed.patch, 5483-v15.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-06-11 Thread Ben Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027905#comment-14027905
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 6/11/14 5:37 PM:
--

I'm sorry I can't think this through very deeply at the moment, so please allow 
a little slack if I say something incorrect. I'm writing this during a small 
window of time (where I can't do anything else) in a crisis I'm dealing with.

bq. Why is command part of events instead of sessions?

It was a somewhat arbitrary design decision. I can move it over to sessions. 
The only real reason was historical (see the Mar 3 comment); it was a 
proof-of-concept that never got followed up upon until just now.

bq. Also: should use an enum internally. Logging as string representation is 
fine.

Just to be clear, you mean Java code should work with an enum, and the actual 
cql table column is fine as a string?

The code actually does use an enum (of sorts; not an Enum proper), the 
traceType. The traceType is passed to Tracing#getCommandName to look up the 
String for command name.

bq. It makes people grumpy when we break JMX signatures. Can we add a new 
overload instead, preserving the old? This should cut down on some of the code 
churn in StorageService as well.

I will admit that I didn't really consider the entire ecosystem of tools that 
use JMX to control Cassandra (though note that I did mention the JMX api change 
in a Mar 8 comment ... the server isn't going to work with old versions of 
nodetool. And a newer nodetool still won't work with older server versions.).

bq. It's a minor thing to get hung up on, but I'm not wild about all the work 
needed to propagate TTLs around. Is it really super important to persist repair 
traces much longer than queries? If so, what if we used a separate table and 
just allowed users to modify the default ttl? (The original trace code predates 
default TTLs, I think, or we would have made use of it there.)

I guess the question is how many different types of things (bootstrap, repair, 
query, decommission, ... anything else?) might eventually end up being traced. 
If n is small, then having n tables may be fine.

The logic was this (see Mar 1-3 discussion): Repair can take a long time, so 24 
hours may be too short of a ttl.

I recall reading about problematic repairs taking several days, which wouldn't 
mix well with a 24 hour ttl.

bq. Also have a nagging feeling that the notify/wait logic in StorageService + 
TraceState is more complex than necessary.

If there is guaranteed to only ever be one consumer of notifications at a time, 
then the updated v15 logic seems fine. But if there are ever two threads 
polling the same TraceState, then there could be dropped notifications. The 
problem (I believe) is that all consumers may not be in a wait state at the 
moment when notifyAll is signalled. This means a notification could be missed, 
right? I'm not experienced in Java concurrency, and this isn't the best time 
for me to slowly think things through, so it's quite possible I'm wrong here.

But it does seem reasonable there will only ever be one polling thread for any 
given tracing session, so the v15 code should work fine in that respect, as far 
as my cursory inspection goes.

Note, however, that the polling in this case is a heuristic only. Meaning that 
it's *likely* that an external trace happened somewhere around this time plus 
or minus (as far as I know, there is no way in Cassandra to be notified of cf 
updates). Which means that the actual trace may only arrive *after* the 
notification, meaning that for two notifications ~maxWait seconds apart, your 
logging of events might be maxWait seconds late:

{noformat}
time  actionresult
  ----
   0  receive notification  no events found
  10  event A
1000  receive notification  sendNotification(event A)
{noformat}

This is why I had an exponential backoff. Because I wanted to poll with high 
frequency for a likely event, polling less and less often as the notification 
recedes into the past. There are, of course, endless tweaks possible to this 
basic algorithm. It depends upon what you can assume about the likely time 
distribution of the events hinted at by the notification.

{noformat}
time  actionresult
  ----
   0  receive notification  no events found
   2  poll  no events found
   4  poll  no events found
   8  poll  no events found
  10  event A
  16  poll  sendNotification(event A)
  32  poll  no events found
1000  receive notification  no events found
{noformat}

bq. I should note I've made no attempt to corroborate this behaviour is 
sensible; I've only simplified it.

Any feedback would be welcome. As I've said before, heuristics are messy. 

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-05-22 Thread Ben Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998924#comment-13998924
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 5/22/14 12:01 PM:
---

May 14 formatting changes in 
[^5483-v13-608fb03-May-14-trace-formatting-changes.patch] (based off of commit 
608fb03).

{quote}
I think the session log messages are still confusing, especially since we use 
the same term for repairing a subrange and streaming data.
{quote}

Currently the session terminology is baked into the source code, in 
{{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed 
to reflect different terminology, hopefully the source code can eventually be 
changed to match (fewer special cases to remember). Perhaps the best thing is 
to always qualify them, e.g. stream session and repair session?

{quote}
I don't actually see the session uuid being used in the logs except at 
start/finish.
{quote}

Sorry, that was another inadvertent mixing of nodetool messages and trace 
output. {{\[2014-05-13 23:49:52,283] Repair session 
cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range 
(3074457345618258602,-9223372036854775808] finished}} is not a trace, but a 
separate (pre-patch) sendNotification in {{StorageService.java}}. This message 
(and some of the error messages, I think) is redundant when combined with trace 
output. It should have been either one or the other, not both. In the trace 
proper, the session UUID only shows up at the start.

But note: not all nodetool messages are rendered redundant by trace output. 
Since we can't just suppress all non-trace sendNotification, how can we 
unambiguously tell nodetool trace output from normal sendNotification messages? 
I'm currently leaning towards just marking all sendNotification trace output 
with a {{TRACE:}} tag.

The repair session UUIDs used to be prepended to everything, but were removed 
in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without 
them, things are less verbose, but it's sometimes hard to unambiguously follow 
traces for concurrent repair sessions. To make the point clearer, I've marked 
each sub-task graphically in the nodetool trace output below (I've 
cross-checked this with the logs, which do retain the UUIDs). If you cover up 
the left side, it's harder to figure out which trace goes with which sub-task. 
Real-world repair traces will probably be even more confusing.

Note: indentation here does not denote nesting; the column roughly indicates 
task identity, though I reuse columns when it's not ambiguous.

{noformat}
1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 
ranges for s1.users (seq=true, full=true)
  x   [2014-05-15 11:31:37,922] Syncing range 
(-3074457345618258603,3074457345618258602]
  x   [2014-05-15 11:31:38,108] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x   [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x   [2014-05-15 11:31:39,953] Received merkle tree for users from 
/127.0.0.2
  x   [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x   [2014-05-15 11:31:41,279] Received merkle tree for users from 
/127.0.0.3
  x   [2014-05-15 11:31:42,632] Received merkle tree for users from 
/127.0.0.1
x [2014-05-15 11:31:42,671] Syncing range 
(-9223372036854775808,-3074457345618258603]
x [2014-05-15 11:31:42,766] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x   [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
  x   [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
  x   [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
  x   [2014-05-15 11:31:43,084] Completed sync of range 
(-3074457345618258603,3074457345618258602]
x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:43,422] Received merkle tree for users from 
/127.0.0.2
x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
x [2014-05-15 11:31:44,637] Received merkle tree for users from 
/127.0.0.3
x [2014-05-15 11:31:45,474] Received merkle tree for users from 
/127.0.0.1
  x   [2014-05-15 11:31:45,494] Syncing range 
(3074457345618258602,-9223372036854775808]
x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
x [2014-05-15 11:31:45,564] Completed sync of range 

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-04-11 Thread Ben Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966778#comment-13966778
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 4/11/14 5:42 PM:
--

I made some additional changes. Everything is included in 
[^5483-v10-rebased-and-squashed-471f5cc.patch], but I attached 
[^5483-v10-17-minor-bugfixes-and-changes.patch] to make it more more convenient 
to review.

-Repair fails without 
[7000-2.1-v2.txt|https://issues.apache.org/jira/secure/attachment/12639633/7000-2.1-v2.txt]
 so I've included that patch in the test code.-

Overview:
 * Limit exponential backoff.
 * Handle the case where traceType is negative.
 * Reimplement log2 in BitUtil style.
 * Forgot to add a trace parameter.

{noformat}
# rebased against trunk @ 471f5cc, tested against cbb3c8f
# git checkout cbb3c8f
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}

Edit: 7000 landed.



was (Author: usrbincc):
I made some additional changes. Everything is included in 
[^5483-v10-rebased-and-squashed-471f5cc.patch], but I attached 
[^5483-v10-17-minor-bugfixes-and-changes.patch] to make it more more convenient 
to review.

Repair fails without 
[7000-2.1-v2.txt|https://issues.apache.org/jira/secure/attachment/12639633/7000-2.1-v2.txt]
 so I've included that patch in the test code.

Overview:
 * Limit exponential backoff.
 * Handle the case where traceType is negative.
 * Reimplement log2 in BitUtil style.
 * Forgot to add a trace parameter.

{noformat}
# rebased against trunk @ 471f5cc
# git checkout 471f5cc
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12639821/5483-v10-rebased-and-squashed-471f5cc.patch \
  $W/12639633/7000-2.1-v2.txt
do
  { [ -e $(basename $url) ] || curl -sO $url; }  git apply $(basename $url)
done 
ant clean  ant 
./ccm-repair-test -kR 
ccm node1 stop 
ccm node1 clear 
ccm node1 start 
./ccm-repair-test -rt
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
 5483-v10-17-minor-bugfixes-and-changes.patch, 
 5483-v10-rebased-and-squashed-471f5cc.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-04-03 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959183#comment-13959183
 ] 

Jonathan Ellis edited comment on CASSANDRA-5483 at 4/3/14 8:39 PM:
---

Can you both rebase and squash [~usrbincc]?


was (Author: jbellis):
Can you both rebase and squash?

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-03-30 Thread Ben Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954746#comment-13954746
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 3/30/14 5:24 PM:
--

Ouch. After running a before and after test, I'm 99% sure this was the problem. 
There was some obviously wrong code in {{waitActivity}} (an older version used 
0 instead of -1 to signify done; I apparently forgot to update everything 
when I changed this).

Sorry about removing the previous patch. It didn't have the correct {{git diff 
-p}} parameters.

For convenience:

{noformat}
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v09-*.patch 
ant clean  ant
{noformat}

Here's what I used to test with; I get slower and slower repairs, with a hang 
on the 5th repair with the before code, and consistent 10-second repairs with 
the after code.

{noformat}
cat  ccm-nodetool EE
#!/bin/sh

# ccm doesn't let us call nodetool with options, but we still need to get the
# host and port config from it.
read -r JMXGET E
/jmx_port/{p=\$2;} \
/binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p);}
E

NODETOOL=$(ccm $1 show | awk -F= $JMXGET)
shift
$NODETOOL $@
EE

chmod +x ccm-nodetool

for x in $(seq 3); do 
  for y in $(seq 2); do
echo repair node$x \#$y
./ccm-nodetool node$x repair -tr
  done
done
{noformat}

edit: minor awk code cleanup, properly nest heredocs.



was (Author: usrbincc):
Ouch. After running a before and after test, I'm 99% sure this was the problem. 
There was some obviously wrong code in {{waitActivity}} (an older version used 
0 instead of -1 to signify done; I apparently forgot to update everything 
when I changed this).

Sorry about removing the previous patch. It didn't have the correct {{git diff 
-p}} parameters.

For convenience:

{noformat}
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12637720/5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch
do [ -e $(basename $url) ] || curl -sO $url; done 
git apply 5483-v09-*.patch 
ant clean  ant
{noformat}

Here's what I used to test with; I get slower and slower repairs, with a hang 
on the 5th repair with the before code, and consistent 10-second repairs with 
the after code.

{noformat}
cat  ccm-nodetool E
#!/bin/sh

# ccm doesn't let us call nodetool with options, but we still need to get the
# host and port config from it.
read -r JMXGET E
/jmx_port/{p=\$2;} \
/binary/{split(\$2,a,/\047/);h=a[2];} \
END{printf(bin/nodetool -h %s -p %s\n,h,p,cmd);}
E

NODETOOL=$(ccm $1 show | awk -F= $JMXGET)
shift
$NODETOOL $@
E
chmod +x ccm-nodetool
for x in $(seq 3); do 
  for y in $(seq 2); do
echo repair node$x \#$y
./ccm-nodetool node$x repair -tr
  done
done
{noformat}


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
 5483-v08-14-Poll-system_traces.events.patch, 
 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, ccm-repair-test, 
 cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 

[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-03-13 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934196#comment-13934196
 ] 

Yuki Morishita edited comment on CASSANDRA-5483 at 3/13/14 10:13 PM:
-

bq. Is there a simple way to create a situation where a repair requires 
streaming?

The easiest way is to nuke one node after writing data to all nodes.
So, for example, you create 3 nodes cluster, create keyspace with RF=3, fill 
data in, clear one node(ccm node1 clear), bring it up again, then repair.

bq.  Somehow the streaming repair is getting done somewhere other than 
Differencer#run. 

Differencer creates StreamingRepairTask at the end which creates StreamPlan and 
executes.
In order to log streaming, you can add StreamEventHandler to catch streaming 
events.

For mode detail see http://wiki.apache.org/cassandra/Streaming2.
For example, BulkLoader/SSTableLoader uses StreamEvent to track progress.


was (Author: yukim):
bq. Is there a simple way to create a situation where a repair requires 
streaming?

The easiest way is to nuke one node after writing data to all nodes.
So, for example, you create 3 nodes cluster, create keyspace with RF=3, fill 
data in, clear one node(ccm node1 clear), bring it up again, then repair.


 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
 test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-03-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931858#comment-13931858
 ] 

Jonathan Ellis edited comment on CASSANDRA-5483 at 3/12/14 3:10 PM:


Can you post an example trace session?  

Does this add a verbose flag to nodetool repair to show the trace 
as-it-happens or do I have to manually check the trace tables?


was (Author: jbellis):
Can you post an example trace session?  

Does this add a verbose flag to trace to show the trace as-it-happens or do I 
have to manually check the trace tables?

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 ccm-repair-test, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-03-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932510#comment-13932510
 ] 

Jonathan Ellis edited comment on CASSANDRA-5483 at 3/12/14 10:14 PM:
-

Thanks.  I'm going to bikeshed the messages a bit:

- {{Starting repair command #3, repairing 2 ranges for keyspace system_traces 
(seq=true, full=true)}} - {{Repairing 2 ranges for system_traces.\{sessions, 
events\} (seq=true, full=true)}}
- {{[repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] new session: will sync 
/127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for 
system_traces.[sessions, events]}} - {{New session 
fedc3790-a9fa-11e3-ab61-2387bbb38302 will sync range 
(-3074457345618258603,3074457345618258602]}}
- All the rest: omit the [repair] session id since that's already traced, and 
capitalize the first word
- Add a message if there's nothing to stream

What does it log when there *is* something to stream?


was (Author: jbellis):
Thanks.  I'm going to bikeshed the messages a bit:

- {{Starting repair command #3, repairing 2 ranges for keyspace system_traces 
(seq=true, full=true)}} - {{Repairing 2 ranges for system_traces.{sessions, 
events} (seq=true, full=true)}}
- {{[repair #fedc3790-a9fa-11e3-ab61-2387bbb38302] new session: will sync 
/127.0.0.1, /127.0.0.3 on range (-3074457345618258603,3074457345618258602] for 
system_traces.[sessions, events]}} - {{New session 
#fedc3790-a9fa-11e3-ab61-2387bbb38302 will sync range 
(-3074457345618258603,3074457345618258602]}}
- All the rest: omit the session id since that's already traced, and capitalize 
the first word
- Add a message if there's nothing to stream

What does it log when there *is* something to stream?

 Repair tracing
 --

 Key: CASSANDRA-5483
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Yuki Morishita
Assignee: Ben Chan
Priority: Minor
  Labels: repair
 Attachments: 5483-full-trunk.txt, 
 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
  5483-v07-08-Fix-brace-style.patch, 
 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
 ccm-repair-test, test-5483-system_traces-events.txt, 
 trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
 trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
 tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
 v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
 v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch


 I think it would be nice to log repair stats and results like query tracing 
 stores traces to system keyspace. With it, you don't have to lookup each log 
 file to see what was the status and how it performed the repair you invoked. 
 Instead, you can query the repair log with session ID to see the state and 
 stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CASSANDRA-5483) Repair tracing

2014-03-03 Thread Lyuben Todorov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918251#comment-13918251
 ] 

Lyuben Todorov edited comment on CASSANDRA-5483 at 3/3/14 5:25 PM:
---

Are the latest 3 patches supposed to be incrementally added onto 
{{trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch}}
 and 
{{trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch}}?
 As in

{noformat}
1 - apply 
trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch
2 - apply 
trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch
3 - apply one of the three latest patches (v3, v4 or v5)
{noformat}

v5 Does a lot of refactoring that I think is outside the scope of this ticket 
(but might be worth it's own ticket as the idea is good), so my vote is for v3, 
but I'm getting a NoSuchMethod exception, can you post a branch with all the 
patches added onto trunk (for v3)? 

The exception: 
{noformat}
java.lang.NoSuchMethodException: forceRepairAsync(java.lang.String, boolean, 
java.util.Collection, java.util.Collection, boolean, boolean, boolean, 
[Ljava.lang.String;)
at 
com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:168)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:135)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at 
com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at 
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at 
javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at 
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at 
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at 
javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}

bq. I am thinking of calling the new table something generic like 
system_traces.trace_logs. I also assume, that like system_traces.events

I'd say events is pretty generic, the new table should show that the traces 
aren't query related like in events. If we are going to add new tables to the 
trace CF it's worth thinking about refactoring events into something more 
specific and adding new tables with names that carry meaning. Another possible 
solution is to add a command field to system_traces.events where it can allow 
users to retrieve data about specific events, e.g. [~jbellis] WDYT? 

{noformat}
SELECT * FROM system_traces.events;
 session_id   | ... | thread| command
--+ ... +---+-
 09d48eb0-a2f1-11e3-9f04-7d9e3709bf93 | ... | Thrift:1  | REPAIR
 29084f90-a2f3-11e3-9f04-7d9e3709bf93 | ... | Thrift:1  | QUERY

(2 rows)

SELECT * FROM system_traces.events WHERE command='REPAIR';

 session_id   | ... | thread| command
--+ ... +---+-
 09d48eb0-a2f1-11e3-9f04-7d9e3709bf93 | ... | Thrift:1  | REPAIR

(1 rows)
{noformat}


bq. the rows in this table should expire, though perhaps not as fast as 24 
hours. 

+1, repairs can take a very long time so this should be configurable with the 
default perhaps being around 90 days (but should be configurable), but with 
incremental repairs (in 2.1) it will end up logging a lot of data, still a 
better choice than users doing regular repairs missing