I want to add an extra data point to this thread having encountered much the same problem. I'm using Apache Cassandra 3.10. I attempted to run an incremental repair that was optimized to take advantage of some downtime where the cluster is not fielding traffic and only repair each node's primary partitioner range: nodetool repair --partitioner-range
On a couple nodes, I was seeing the repair fail with the vague "Some repair failed" message: [2017-07-27 15:30:59,283] Some repair failed [2017-07-27 15:30:59,286] Repair command #2 finished in 10 seconds error: Repair job has failed with the error message: [2017-07-27 15:30:59,283] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2017-07-27 15:30:59,283] Some repair failed at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116) at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452) at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108) Running with the --trace option yielded no additional relevant information. On one node where this was arising, I was able to run the repair again with just the keyspace of interest, see that work, run the repair another time across all keyspaces, and see that work as well. On another node, just trying again did not work. What did work was running a "nodetool compact". The subsequent repair on that node succeeded, even though it took inordinately long. Strangely, another repair after that failed. But then the next couple succeeded. I proceeded to do a "df -h" on the Ubuntu hosts and noticed that the disk usage was inordinately high. This is my hypothesis as to the underlying cause. Fortunately for me, this is a dev cluster I'm talking about. Pertinent troubleshooting steps: * nodetool compact * Check disk usage. Better yet, preemptively alert on disk usage exceeding a certain threshold. Further insights welcome...