I want to add an extra data point to this thread having encountered much
the same problem. I'm using Apache Cassandra 3.10. I attempted to run an
incremental repair that was optimized to take advantage of some downtime
where the cluster is not fielding traffic and only repair each node's
primary partitioner range:
nodetool repair --partitioner-range

On a couple nodes, I was seeing the repair fail with the vague "Some repair
failed" message:
[2017-07-27 15:30:59,283] Some repair failed
[2017-07-27 15:30:59,286] Repair command #2 finished in 10 seconds
error: Repair job has failed with the error message: [2017-07-27
15:30:59,283] Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message:
[2017-07-27 15:30:59,283] Some repair failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

Running with the --trace option yielded no additional relevant information.

On one node where this was arising, I was able to run the repair again with
just the keyspace of interest, see that work, run the repair another time
across all keyspaces, and see that work as well.

On another node, just trying again did not work. What did work was running
a "nodetool compact". The subsequent repair on that node succeeded, even
though it took inordinately long. Strangely, another repair after that
failed. But then the next couple succeeded.

I proceeded to do a "df -h" on the Ubuntu hosts and noticed that the disk
usage was inordinately high. This is my hypothesis as to the underlying
cause. Fortunately for me, this is a dev cluster I'm talking about.

Pertinent troubleshooting steps:
* nodetool compact
* Check disk usage. Better yet, preemptively alert on disk usage exceeding
a certain threshold.

Further insights welcome...

Reply via email to