First thing you should be concerned about is : Why the repair -pr operation
doesnt complete ?
Second comes the question : Which repair option is best?
One probable cause of stuck repairs is : if the firewall between DCs is closing
TCP connections and Cassandra is trying to use such connections, repairs will
hang. Please refer
. We faced that.
Also make sure you comply with basic bandwidth requirement between DCs.
Recommended is 1000 Mb/s (1 gigabit) or greater.
Answers for specific questions:
1.As per my understanding, all replicas will not participate in dc local
repairs and thus repair would be ineffective. You need to make sure that all
replicas of a data in all dcs are in sync.
2. Every DC is not a ring. All DCs together form a token ring. So, I think yes
you should run repair -pr on all nodes.
3. Yes. I dont have experience with incremental repairs. But you can run repair
-pr on all nodes of all DCs.
Regarding Best approach of repair, you should see some repair presentations of
Cassandra Summit 2016. All are online now.
I attended the summit and people using large clusters generally use sub range
repairs to repair their clusters. But such large deployments are on older
Cassandra versions and these deployments generally dont use vnodes. So people
know easily which nodes hold which token range.