Re: Repair failed and crash the node, how to bring it back?

2019-07-31 Thread Alexander Dejanovski
Hi Martin,

apparently this is the bug you've been hit by on hints :
https://issues.apache.org/jira/browse/CASSANDRA-14080
It was fixed in 3.0.17.

You didn't provide the logs from Cassandra at the time of the crash, only
the output of nodetool, so it's hard to say what caused it. You may be hit
by this bug: https://issues.apache.org/jira/browse/CASSANDRA-14096
This is unlikely to happen with Reaper (as mentioned in the description of
the ticket) since it will generate smaller Merkle trees as subrange covers
less partitions for each repair session.

So the advice is : upgrade to 3.0.19 (even 3.11.4 IMHO as 3.0 offers less
performance than 3.11) and use Reaper  to
handle/schedule repairs.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Thu, Aug 1, 2019 at 12:05 AM Martin Xue  wrote:

> Hi Alex,
>
> Thanks for your reply. The disk space was around 80%. The crash happened
> during repair, primary range full repair on 1TB keyspace.
>
> Would that crash again?
>
> Thanks
> Regards
> Martin
>
> On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, <
> a...@thelastpickle.com> wrote:
>
>> It looks like you have a corrupted hint file.
>> Did the node run out of disk space while repair was running?
>>
>> You might want to move the hint files off their current directory and try
>> to restart the node again.
>> Since you'll have lost mutations then, you'll need... to run repair
>> ¯\_(ツ)_/¯
>>
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>> On Wed, Jul 31, 2019 at 3:51 PM Martin Xue  wrote:
>>
>>> Hi,
>>>
>>> I am running repair on production, started with one of 6 nodes in the
>>> cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
>>>
>>> running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
>>> and crash,
>>>
>>> error shows:
>>> 3202]] finished (progress: 3%)
>>> Exception occurred during clean-up.
>>> java.lang.reflect.UndeclaredThrowableException
>>> Cassandra has shutdown.
>>> error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
>>> server log for repair status of keyspace keyspace_masked (Subsequent
>>> keyspaces are not going to be repaired).
>>> -- StackTrace --
>>> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed.
>>> You should check server log for repair status of keyspace keyspace_masked
>>> keyspaces are not going to be repaired).
>>> at
>>> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
>>> at
>>> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
>>> at
>>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
>>> at
>>> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>>> at
>>> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>>> at
>>> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
>>> at
>>> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
>>> at
>>> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
>>> at
>>> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
>>> at
>>> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
>>> at
>>> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
>>> at
>>> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>>>
>>> system.log shows
>>> INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
>>> Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
>>> 19043999248 -> 20219035248;
>>> INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 -
>>> Pool NameActive   Pending  Completed   Blocked  All
>>> Time Blocked
>>> INFO  [Service Thread] 2019-07-31 20:19:08,584 St

Re: Repair failed and crash the node, how to bring it back?

2019-07-31 Thread Martin Xue
Hi Alex,

Thanks for your reply. The disk space was around 80%. The crash happened
during repair, primary range full repair on 1TB keyspace.

Would that crash again?

Thanks
Regards
Martin

On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, 
wrote:

> It looks like you have a corrupted hint file.
> Did the node run out of disk space while repair was running?
>
> You might want to move the hint files off their current directory and try
> to restart the node again.
> Since you'll have lost mutations then, you'll need... to run repair
> ¯\_(ツ)_/¯
>
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> On Wed, Jul 31, 2019 at 3:51 PM Martin Xue  wrote:
>
>> Hi,
>>
>> I am running repair on production, started with one of 6 nodes in the
>> cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
>>
>> running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
>> and crash,
>>
>> error shows:
>> 3202]] finished (progress: 3%)
>> Exception occurred during clean-up.
>> java.lang.reflect.UndeclaredThrowableException
>> Cassandra has shutdown.
>> error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
>> server log for repair status of keyspace keyspace_masked (Subsequent
>> keyspaces are not going to be repaired).
>> -- StackTrace --
>> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You
>> should check server log for repair status of keyspace keyspace_masked
>> keyspaces are not going to be repaired).
>> at
>> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
>> at
>> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
>> at
>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
>> at
>> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>> at
>> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>> at
>> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
>> at
>> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
>> at
>> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
>> at
>> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
>> at
>> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
>> at
>> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
>> at
>> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
>> at
>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
>> at
>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
>> at
>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
>> at
>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
>> at
>> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>>
>> system.log shows
>> INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
>> Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
>> 19043999248 -> 20219035248;
>> INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 -
>> Pool NameActive   Pending  Completed   Blocked  All
>> Time Blocked
>> INFO  [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 -
>> MutationStage1915 9578177305 0
>> 0
>>
>> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
>> ViewMutationStage 0 0  0 0
>> 0
>>
>> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
>> ReadStage10 0  219357504 0
>> 0
>>
>> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
>> RequestResponseStage  1 0  625174550 0
>> 0
>>
>> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
>> ReadRepairStage   0 02544772 0
>> 0
>>
>> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
>> CounterMutationStage  0 0  0 0
>> 0
>>
>> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
>> MiscStage

Re: Differing snitches in different datacenters

2019-07-31 Thread Voytek Jarnot
Thanks Paul. Yes - finding a definitive answer is where I'm failing as
well. I think we're probably going to try it and see what happens, but
that's a bit worrisome.

On Mon, Jul 29, 2019 at 3:35 PM Paul Chandler  wrote:

> Hi Voytek,
>
> I looked into this a little while ago, and couldn’t really find a
> definitive answer. We ended up keeping the GossipingPropertyFileSnitch in
> our GCP Datacenter, the only downside that I could see is that you have to
> manually specify the rack and DC. But doing it that way does allow you to
> create a multi vendor cluster if you wished in the future.
>
> I would also be interested if anyone has the definitive answer one this.
>
> Thanks
>
> Paul
> www.redshots.com
>
> On 29 Jul 2019, at 17:06, Voytek Jarnot  wrote:
>
> Just a quick bump - hoping someone can shed some light on whether running
> different snitches in different datacenters is a terrible idea or no. It'd
> be fairly temporary, once the new DC is stood up and nodes are rebuilt, the
> old DC will be decomissioned.
>
> On Thu, Jul 25, 2019 at 12:36 PM Voytek Jarnot 
> wrote:
>
>> Quick and hopefully easy question for the list. Background is existing
>> cluster (1 DC) will be migrated to AWS-hosted cluster via standing up a
>> second datacenter, existing cluster will be subsequently decommissioned.
>>
>> We currently use GossipingPropertyFileSnitch and are thinking about using
>> Ec2MultiRegionSnitch in the new AWS DC - that'd position us nicely if in
>> the future we want to run a multi-DC cluster in AWS. My question is: are
>> there any issues with one DC using GossipingPropertyFileSnitch and the
>> other using Ec2MultiRegionSnitch? This setup would be temporary, existing
>> until the new DC nodes have rebuilt and the old DC is decommissioned.
>>
>> Thanks,
>> Voytek Jarnot
>>
>
>


Re: Repair failed and crash the node, how to bring it back?

2019-07-31 Thread Alexander Dejanovski
It looks like you have a corrupted hint file.
Did the node run out of disk space while repair was running?

You might want to move the hint files off their current directory and try
to restart the node again.
Since you'll have lost mutations then, you'll need... to run repair
¯\_(ツ)_/¯

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Wed, Jul 31, 2019 at 3:51 PM Martin Xue  wrote:

> Hi,
>
> I am running repair on production, started with one of 6 nodes in the
> cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
>
> running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
> and crash,
>
> error shows:
> 3202]] finished (progress: 3%)
> Exception occurred during clean-up.
> java.lang.reflect.UndeclaredThrowableException
> Cassandra has shutdown.
> error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
> server log for repair status of keyspace keyspace_masked (Subsequent
> keyspaces are not going to be repaired).
> -- StackTrace --
> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You
> should check server log for repair status of keyspace keyspace_masked
> keyspaces are not going to be repaired).
> at
> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
> at
> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
> at
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
> at
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
> at
> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
> at
> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
> at
> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
> at
> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
> at
> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
> at
> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
> at
> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>
> system.log shows
> INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
> Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
> 19043999248 -> 20219035248;
> INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool
> NameActive   Pending  Completed   Blocked  All Time
> Blocked
> INFO  [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 -
> MutationStage1915 9578177305 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ViewMutationStage 0 0  0 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ReadStage10 0  219357504 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> RequestResponseStage  1 0  625174550 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ReadRepairStage   0 02544772 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> CounterMutationStage  0 0  0 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> MiscStage 0 0  0 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 -
> CompactionExecutor1 19515493 0
> 0
>
>
> When I restart the cassandra, it still failed,
> now the error in system.log shows:
>
> INFO  [main] 2019-07-31 21:35:02,044 StorageService.jav

Re: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread Alexander Dejanovski
Hi Martin,

you can stop the anticompaction by roll restarting the nodes (not sure if
"nodetool stop COMPACTION" will actually stop anticompaction, I never
tried).

Note that this will leave your cluster with SSTables marked as repaired and
others that are not. These two types of SSTables will never be compacted
together, which can delay reclaiming disk space over time because
overwrites and tombstones won't get merged.
If you plan to stick with nodetool, leave the anticompaction running and
hope that it's just taking a long time because it's your first repair (if
it is your first repair).

Otherwise, and I obviously recommend that, if you choose to use Reaper, you
can stop right away the running anticompactions and prepare for Reaper.
Since Reaper won't trigger anticompactions, you'll have to mark your
SSTables back to unrepaired state so that all SSTables can be compacted
with each other in the future.
To that end, you'll need to use the sstablerepairedset

command line tool (ships with Cassandra) and follow the procedure (in a
nutshell, stop Cassandra, mark sstables as unrepaired, restart Cassandra).

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Wed, Jul 31, 2019 at 3:53 PM Martin Xue  wrote:

> Sorry ASAD, don't have chance, still bogged down with the production
> issue...
>
> On Wed, Jul 31, 2019 at 10:56 PM ZAIDI, ASAD A  wrote:
>
>> Did you get chance to look at tlp reaper tool i.e.
>> http://cassandra-reaper.io/
>>
>> It is pretty awesome – Thanks to TLP team.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Martin Xue [mailto:martin...@gmail.com]
>> *Sent:* Wednesday, July 31, 2019 12:09 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Repair / compaction for 6 nodes, 2 DC cluster
>>
>>
>>
>> Hello,
>>
>>
>>
>> Good day. This is Martin.
>>
>>
>>
>> Can someone help me with the following query regarding Cassandra repair
>> and compaction?
>>
>>
>> Currently we have a large keyspace (keyspace_event) with 1TB of data (in
>> /var/lib/cassandra/data/keyspace_event);
>> There is a cluster with Datacenter 1 contains 3 nodes, Data center 2
>> containing 3 nodes; All together 6 nodes;
>>
>>
>> As part of maintenance, I run the repair on this keyspace with the
>> following command:
>>
>>
>> nodetool repair -pr --full keyspace_event;
>>
>>
>> now it has been run for 2 days. yes 2 days, when doing nodetool tpstats,
>> it shows there is a compaction running:
>>
>>
>> CompactionExecutor1 15783732 0
>>   0
>>
>> nodetool compactionstats shows:
>>
>>
>> pending tasks: 6
>> id   compaction type
>>   keyspace  table   completed
>> totalunit   progress
>>   249ec5f1-b225-11e9-82bd-5b36ef02cadd   Anticompaction after repair
>> keyspace_event table_event   1916937740948   2048931045927   bytes
>> 93.56%
>>
>>
>>
>>
>> Now my questions are:
>> 1. why running repair (with primary range option, -pr, as I want to limit
>> the repair node by node), triggered the compaction running on other nodes?
>> 2. when I run the repair on the second node with nodetool repair -pr
>> --full keyspace_event; will the subsequent compaction run again on all the
>> 6 nodes?
>>
>> I want to know what are the best option to run the repair (full repair)
>> as we did not run it before, especially if it can take less time (in
>> current speed it will take 2 weeks to finish all).
>>
>> I am running Cassandra 3.0.14
>>
>> Any suggestions will be appreciated.
>>
>>
>>
>> Thanks
>>
>> Regards
>>
>> Martin
>>
>>
>>
>


Re: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread Martin Xue
Sorry ASAD, don't have chance, still bogged down with the production
issue...

On Wed, Jul 31, 2019 at 10:56 PM ZAIDI, ASAD A  wrote:

> Did you get chance to look at tlp reaper tool i.e.
> http://cassandra-reaper.io/
>
> It is pretty awesome – Thanks to TLP team.
>
>
>
>
>
>
>
> *From:* Martin Xue [mailto:martin...@gmail.com]
> *Sent:* Wednesday, July 31, 2019 12:09 AM
> *To:* user@cassandra.apache.org
> *Subject:* Repair / compaction for 6 nodes, 2 DC cluster
>
>
>
> Hello,
>
>
>
> Good day. This is Martin.
>
>
>
> Can someone help me with the following query regarding Cassandra repair
> and compaction?
>
>
> Currently we have a large keyspace (keyspace_event) with 1TB of data (in
> /var/lib/cassandra/data/keyspace_event);
> There is a cluster with Datacenter 1 contains 3 nodes, Data center 2
> containing 3 nodes; All together 6 nodes;
>
>
> As part of maintenance, I run the repair on this keyspace with the
> following command:
>
>
> nodetool repair -pr --full keyspace_event;
>
>
> now it has been run for 2 days. yes 2 days, when doing nodetool tpstats,
> it shows there is a compaction running:
>
>
> CompactionExecutor1 15783732 0
> 0
>
> nodetool compactionstats shows:
>
>
> pending tasks: 6
> id   compaction type
> keyspace  table   completed
>   totalunit   progress
>   249ec5f1-b225-11e9-82bd-5b36ef02cadd   Anticompaction after repair
> keyspace_event table_event   1916937740948   2048931045927   bytes
> 93.56%
>
>
>
>
> Now my questions are:
> 1. why running repair (with primary range option, -pr, as I want to limit
> the repair node by node), triggered the compaction running on other nodes?
> 2. when I run the repair on the second node with nodetool repair -pr
> --full keyspace_event; will the subsequent compaction run again on all the
> 6 nodes?
>
> I want to know what are the best option to run the repair (full repair) as
> we did not run it before, especially if it can take less time (in current
> speed it will take 2 weeks to finish all).
>
> I am running Cassandra 3.0.14
>
> Any suggestions will be appreciated.
>
>
>
> Thanks
>
> Regards
>
> Martin
>
>
>


Repair failed and crash the node, how to bring it back?

2019-07-31 Thread Martin Xue
Hi,

I am running repair on production, started with one of 6 nodes in the
cluster (3 nodes in each of two DC). Cassandra version 3.0.14.

running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
and crash,

error shows:
3202]] finished (progress: 3%)
Exception occurred during clean-up.
java.lang.reflect.UndeclaredThrowableException
Cassandra has shutdown.
error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
server log for repair status of keyspace keyspace_masked (Subsequent
keyspaces are not going to be repaired).
-- StackTrace --
java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You
should check server log for repair status of keyspace keyspace_masked
keyspaces are not going to be repaired).
at
org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
at
org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
at
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
at
javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
at
javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
at
javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
at
javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
at
javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
at
javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
at
javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
at
javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
at
javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

system.log shows
INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
19043999248 -> 20219035248;
INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool
NameActive   Pending  Completed   Blocked  All Time
Blocked
INFO  [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 -
MutationStage1915 9578177305 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
ViewMutationStage 0 0  0 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
ReadStage10 0  219357504 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
RequestResponseStage  1 0  625174550 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
ReadRepairStage   0 02544772 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
CounterMutationStage  0 0  0 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
MiscStage 0 0  0 0
0

INFO  [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 -
CompactionExecutor1 19515493 0
0


When I restart the cassandra, it still failed,
now the error in system.log shows:

INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:575 - Cassandra
version: 3.0.14
INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:576 - Thrift API
version: 20.1.0
INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:577 - CQL
supported versions: 3.4.0 (default: 3.4.0)
ERROR [main] 2019-07-31 21:35:02,075 CassandraDaemon.java:710 - Exception
encountered during startup
org.apache.cassandra.io.FSReadError: java.io.EOFException
at
org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142)
~[apache-cassandra-3.0.14.jar:3.0.14]
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
~[na:1.8.0_171]
at
java.util.stream.Ref

Re: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread Martin Xue
Thanks Alex,

In this case, as I have already run the repair and anti-compaction have
started (including in other nodes). I don't know how long they will finish
(anti-compaction). is there a way to check? nodetool compactionstats shows
one process finished, then there is another one coming up.

Shall I stop the anti-compaction process if it is too long? if so, how do I
do it?

Your help is much appreciated

Thanks
Regards
Martin

On Wed, Jul 31, 2019 at 9:18 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Wed, Jul 31, 2019 at 7:10 AM Martin Xue  wrote:
>
>> Hello,
>>
>> Good day. This is Martin.
>>
>> Can someone help me with the following query regarding Cassandra repair
>> and compaction?
>>
>
> Martin,
>
> This blog post from The Last Pickle provides an in-depth explanation as
> well as some practical advice:
> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>
>
> The status hasn't changed from the time of writing, as far as I'm aware.
> Nonetheless, you might want to upgrade to the latest released version in
> 3.0 series, which is 3.0.18: http://cassandra.apache.org/download/
>
> Regards,
> --
> Alex
>
>


RE: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread ZAIDI, ASAD A
Did you get chance to look at tlp reaper tool i.e. http://cassandra-reaper.io/
It is pretty awesome – Thanks to TLP team.



From: Martin Xue [mailto:martin...@gmail.com]
Sent: Wednesday, July 31, 2019 12:09 AM
To: user@cassandra.apache.org
Subject: Repair / compaction for 6 nodes, 2 DC cluster

Hello,

Good day. This is Martin.

Can someone help me with the following query regarding Cassandra repair and 
compaction?

Currently we have a large keyspace (keyspace_event) with 1TB of data (in 
/var/lib/cassandra/data/keyspace_event);
There is a cluster with Datacenter 1 contains 3 nodes, Data center 2 containing 
3 nodes; All together 6 nodes;

As part of maintenance, I run the repair on this keyspace with the following 
command:

nodetool repair -pr --full keyspace_event;

now it has been run for 2 days. yes 2 days, when doing nodetool tpstats, it 
shows there is a compaction running:

CompactionExecutor1 15783732 0  
   0

nodetool compactionstats shows:

pending tasks: 6
id   compaction type
   keyspace  table   completed   
totalunit   progress
  249ec5f1-b225-11e9-82bd-5b36ef02cadd   Anticompaction after repair   
keyspace_event table_event   1916937740948   2048931045927   bytes 93.56%


Now my questions are:
1. why running repair (with primary range option, -pr, as I want to limit the 
repair node by node), triggered the compaction running on other nodes?
2. when I run the repair on the second node with nodetool repair -pr --full 
keyspace_event; will the subsequent compaction run again on all the 6 nodes?

I want to know what are the best option to run the repair (full repair) as we 
did not run it before, especially if it can take less time (in current speed it 
will take 2 weeks to finish all).

I am running Cassandra 3.0.14

Any suggestions will be appreciated.

Thanks
Regards
Martin



Re: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread Oleksandr Shulgin
On Wed, Jul 31, 2019 at 7:10 AM Martin Xue  wrote:

> Hello,
>
> Good day. This is Martin.
>
> Can someone help me with the following query regarding Cassandra repair
> and compaction?
>

Martin,

This blog post from The Last Pickle provides an in-depth explanation as
well as some practical advice:
https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html


The status hasn't changed from the time of writing, as far as I'm aware.
Nonetheless, you might want to upgrade to the latest released version in
3.0 series, which is 3.0.18: http://cassandra.apache.org/download/

Regards,
--
Alex