actually I found a lot of .db files in the following directory: /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790- mytable
I also found this: 2021-03-01 06:08:08,864 INFO [Native-Transport-Requests-1] MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace' so I think that you, @erick and @bowen, are right. Something dropped the keyspace. I will try to follow your procedure @bowen, thank you very much! Do you know what could cause this issue? It seems like a big issue. I found this bug https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel, maybe they are correlated... Thank you @Bowen and @Erick Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid> ha scritto: > The warning message indicates the node y.y.y.y went down (or is > unreachable via network) before 2021-02-28 05:17:33. Is there any chance > you can find the log file on that node at around or before that time? It > may show why did that node go down. The reason of that might be irrelevant > to the missing keyspace, but still worth to have a look in order to prevent > the same thing from happening again. > > As Erick said, the table's CF ID isn't new, so it's unlikely to be a > schema synchronization issue. Therefore I also suspect the keyspace was > accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'" > on the node that received the "DROP KEYSPACE ..." query, so you may have to > search this in log files from all nodes to find it. > > Assuming the keyspace was dropped but you still have the SSTable files, > you can recover the data by re-creating the keyspace and tables with > identical replication strategy and schema, then copy the SSTable files to > the corresponding new table directories (with different CF ID suffixes) on > the same node, and finally run "nodetool refresh ..." or restart the node. > Since you don't yet have a full backup, I strongly recommend you to make a > backup, and ideally test restoring it to a different cluster, before > attempting to do this. > > > On 01/03/2021 11:48, Marco Gasparini wrote: > > here the previous error: > > 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165 > validateAndConnectIfNeeded failed to connect to node > {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y > }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times) > org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][ > y.y.y.y :9300] connect_timeout[30s] > at > org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) > at > org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616) > at > org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513) > at > org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336) > at > org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323) > at > org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156) > at > org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185) > at > org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) > at > org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Yes this node (y.y.y.y) stopped because it went out of disk space. > > > I said "deleted" because I'm not a native english speaker :) > I usually "remove" snapshots via 'nodetool clearsnapshot' or > cassandra-reaper user interface. > > > > > Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid> > <bo...@bso.ng.invalid> ha scritto: > >> What was the warning? Is it related to the disk failure policy? Could you >> please share the relevant log? You can edit it and redact the sensitive >> information before sharing it. >> >> Also, I can't help to notice that you used the word "delete" (instead of >> "clear") to describe the process of removing snapshots. May I ask how did >> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..." >> or something else? >> >> >> On 01/03/2021 11:27, Marco Gasparini wrote: >> >> thanks Bowen for answering >> >> Actually, I checked the server log and the only warning was that a node >> went offline. >> No, I have no backups or snapshots. >> >> In the meantime I found that probably Cassandra moved all files from a >> directory to the snapshot directory. I am pretty sure of that because I >> have recently deleted all the snapshots I made because it was going out of >> disk space and I found this very directory full of files where the >> modification timestamp was the same as the first error I got in the log. >> >> >> >> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song <bo...@bso.ng.invalid> >> <bo...@bso.ng.invalid> ha scritto: >> >>> The first thing I'd check is the server log. The log may contain vital >>> information about the cause of it, and that there may be different ways to >>> recover from it depending on the cause. >>> >>> Also, please allow me to ask a seemingly obvious question, do you have a >>> backup? >>> >>> >>> On 01/03/2021 09:34, Marco Gasparini wrote: >>> >>> hello everybody, >>> >>> This morning, Monday!!!, I was checking on Cassandra cluster and I >>> noticed that all data was missing. I noticed the following error on each >>> node (9 nodes in the cluster): >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *2021-03-01 09:05:52,984 WARN [MessagingService-Incoming-/x.x.x.x] >>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading >>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: >>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a >>> table was just created, this is likely due to the schema not being fully >>> propagated. Please wait for schema agreement on table creation. at >>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533) >>> at >>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758) >>> at >>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697) >>> at >>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50) >>> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) >>> at >>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195) >>> at >>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183) >>> at >>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)* >>> >>> I tried to query the keyspace and got this: >>> >>> node1# cqlsh >>> Connected to Cassandra Cluster at x.x.x.x:9042. >>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4] >>> Use HELP for help. >>> cqlsh> select * from mykeyspace.mytable where id = 123935; >>> *InvalidRequest: Error from server: code=2200 [Invalid query] >>> message="Keyspace * *mykeyspace does not exist"* >>> >>> Investigating on each node I found that all the *SStables exist*, so I >>> think data is still there but the keyspace vanished, "magically". >>> >>> Other facts I can tell you are: >>> >>> - I have been getting Anticompaction errors from 2 nodes due to the >>> fact the disk was almost full. >>> - the cluster was online friday >>> - this morning, Monday, the whole cluster was offline and I noticed >>> the problem of "missing keyspace" >>> - During the weekend the cluster has been subject to inserts and >>> deletes >>> - I have a 9 node (HDD) Cassandra 3.11 cluster. >>> >>> I really need help on this, how can I restore the cluster? >>> >>> Thank you very much >>> Marco >>> >>> >>> >>> >>> >>> >>> >>> >>>