Re: [EXTERNAL] Re: Adding new DC results in clients failing to connect

2020-10-22 Thread João Reis
 Hi,

We have received another report of this issue and this time we were able to
identify the bug and fix it. Today's release of the driver (version 3.16.1)
contains this fix. The JIRA issue is CSHARP-943 [1]

Thanks,
João Reis

[1] https://datastax-oss.atlassian.net/browse/CSHARP-943

Gediminas Blazys  escreveu no dia
segunda, 18/05/2020 à(s) 07:16:

> Hey,
>
>
>
> Apologies for the late reply João.
>
>
>
> We really, really appreciate your interest and likewise we could not
> reproduce this issue anywhere else but in production where it occurred,
> which is slightly undesirable. As we could not afford to keep the DC in
> this state we have removed it from our cluster. I’m afraid we cannot
> provide you with the info you’ve requested.
>
>
>
> Gediminas
>
>
>
> *From:* João Reis 
> *Sent:* Tuesday, May 12, 2020 19:58
> *To:* user@cassandra.apache.org
> *Subject:* Re: [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Unfortunately I'm not able to reproduce this.
>
>
>
> Would it be possible for you to run a couple of queries and give us the
> results? The queries are "SELECT * FROM system.peers" and "SELECT * FROM
> system_schema.keyspaces". You should run both of these queries on any node
> that the driver uses to set up the control connection when that error
> occurs. To determine the node you can look for this driver log message:
> "Connection established to [NODE_ADDRESS] using protocol version [VERSION]."
>
>
>
> It should be easier to reproduce the issue with the results of those
> queries.
>
>
>
> Thanks,
>
> João Reis
>
>
>
> Gediminas Blazys  escreveu no dia
> sexta, 8/05/2020 à(s) 08:27:
>
> Hello,
>
>
>
> Thanks for looking into this. As far as the time for token map calculation
> goes, we are considering reducing the number of vnodes for future DCs.
> However, in the mean time we were able to deploy another DC8  (testing the
> hypothesis that this may be isolated to DC7 only) and the deployment
> worked. DC8 is part of the cluster now, currently being rebuilt and we did
> not notice login issues with this expansion. So the topology now is this:
>
>
>
> DC1 - 18 nodes - 256 vnodes - working
>
> DC2 - 18 nodes - 256 vnodes - working
>
> DC3 - 18 nodes - 256 vnodes - working
>
> DC4 - 18 nodes - 256 vnodes - working
>
> DC5 - 18 nodes - 256 vnodes - working
>
> DC6 - 60 nodes - 256 vnodes - working
>
> DC7 - 60 nodes - 256 vnodes - once added to replication, clients can't
> connect to any DC
>
> DC8 - 60 nodes - 256 vnodes - rebuilding at the moment, including this DC
> into replication did not cause login issues.
>
>
>
> The major difference between DC7 and other DCs is that in DC7 we only have
> two racks while in other locations we use three, the replication factor
> however for all keyspaces remains the same – 3 for all user defined
> keyspaces. Maybe this is something that could cause issues with duplicates? 
> It's
> a theoretical but cassandra having to place two replicas  on the same  rack
> maybe placed both the primary and a backup replica on the same node. Hence
> a duplicate...
>
>
>
> Gediminas
>
>
>
> *From:* João Reis 
> *Sent:* Thursday, May 7, 2020 19:22
> *To:* user@cassandra.apache.org
> *Subject:* Re: [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hi,
>
>
>
> I don't believe that the peers entry is responsible for that exception.
> Looking at the driver code, I can't even think of a scenario where that
> exception would be thrown... I will run some tests in the next couple of
> days to try and figure something out.
>
>
>
> One thing that is certain from those log messages is that the tokenmap
> computation is very slow (20 seconds). With 100+ nodes and 256 vnodes per
> node, we should expect the token map computation to be a bit slower but 20
> seconds is definitely too much. I've opened CSHARP-901 to track this. [1]
>
>
>
> João Reis
>
>
>
> [1] https://datastax-oss.atlassian.net/browse/CSHARP-901
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatastax-oss.atlassian.net%2Fbrowse%2FCSHARP-901=02%7C01%7CGediminas.Blazys%40microsoft.com%7C699b48a02b404847fd1908d7f695b020%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637248995123287800=cX1uFLsvyJPt%2FdL6x84d1CdYCN8m17A%2FpTFi1VmrG1c%3D=0>
>
>
>
> Gediminas Blazys  escreveu no dia
> segunda, 4/05/2020 à(s) 11:13:
>
> Hello again,
>
>
>
> Looking into system.peers we found that some nodes contain entries about
> themselves with null values. Not sure if this could be an issue, maybe

Re: [EXTERNAL] Re: Adding new DC results in clients failing to connect

2020-05-12 Thread João Reis
Unfortunately I'm not able to reproduce this.

Would it be possible for you to run a couple of queries and give us the
results? The queries are "SELECT * FROM system.peers" and "SELECT * FROM
system_schema.keyspaces". You should run both of these queries on any node
that the driver uses to set up the control connection when that error
occurs. To determine the node you can look for this driver log message:
"Connection established to [NODE_ADDRESS] using protocol version [VERSION]."

It should be easier to reproduce the issue with the results of those
queries.

Thanks,
João Reis

Gediminas Blazys  escreveu no dia
sexta, 8/05/2020 à(s) 08:27:

> Hello,
>
>
>
> Thanks for looking into this. As far as the time for token map calculation
> goes, we are considering reducing the number of vnodes for future DCs.
> However, in the mean time we were able to deploy another DC8  (testing the
> hypothesis that this may be isolated to DC7 only) and the deployment
> worked. DC8 is part of the cluster now, currently being rebuilt and we did
> not notice login issues with this expansion. So the topology now is this:
>
>
>
> DC1 - 18 nodes - 256 vnodes - working
>
> DC2 - 18 nodes - 256 vnodes - working
>
> DC3 - 18 nodes - 256 vnodes - working
>
> DC4 - 18 nodes - 256 vnodes - working
>
> DC5 - 18 nodes - 256 vnodes - working
>
> DC6 - 60 nodes - 256 vnodes - working
>
> DC7 - 60 nodes - 256 vnodes - once added to replication, clients can't
> connect to any DC
>
> DC8 - 60 nodes - 256 vnodes - rebuilding at the moment, including this DC
> into replication did not cause login issues.
>
>
>
> The major difference between DC7 and other DCs is that in DC7 we only have
> two racks while in other locations we use three, the replication factor
> however for all keyspaces remains the same – 3 for all user defined
> keyspaces. Maybe this is something that could cause issues with duplicates? 
> It's
> a theoretical but cassandra having to place two replicas  on the same  rack
> maybe placed both the primary and a backup replica on the same node. Hence
> a duplicate...
>
>
>
> Gediminas
>
>
>
> *From:* João Reis 
> *Sent:* Thursday, May 7, 2020 19:22
> *To:* user@cassandra.apache.org
> *Subject:* Re: [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hi,
>
>
>
> I don't believe that the peers entry is responsible for that exception.
> Looking at the driver code, I can't even think of a scenario where that
> exception would be thrown... I will run some tests in the next couple of
> days to try and figure something out.
>
>
>
> One thing that is certain from those log messages is that the tokenmap
> computation is very slow (20 seconds). With 100+ nodes and 256 vnodes per
> node, we should expect the token map computation to be a bit slower but 20
> seconds is definitely too much. I've opened CSHARP-901 to track this. [1]
>
>
>
> João Reis
>
>
>
> [1] https://datastax-oss.atlassian.net/browse/CSHARP-901
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatastax-oss.atlassian.net%2Fbrowse%2FCSHARP-901=02%7C01%7CGediminas.Blazys%40microsoft.com%7Cb82cb4f2ca784a9fd4a608d7f2a2d300%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C637244653454584013=%2B9ojISBaiyNt%2Fvlyat2wOCgDbFJIyXjmjuYMhPCB4YU%3D=0>
>
>
>
> Gediminas Blazys  escreveu no dia
> segunda, 4/05/2020 à(s) 11:13:
>
> Hello again,
>
>
>
> Looking into system.peers we found that some nodes contain entries about
> themselves with null values. Not sure if this could be an issue, maybe
> someone saw something similar? This state is there before including the
> funky DC into replication.
>
> peer
>
>  data_center
>
>  host_id
>
>  preferred_ip
>
>  rack
>
>  release_version
>
>  rpc_address
>
>  schema_version
>
>  tokens
>
> 
>
> null
>
>  null
>
>  192.168.104.111
>
>   null
>
> null
>
> null
>
> null
>
> null
>
>
>
> Have a wonderful day 
>
>
>
> Gediminas
>
>
>
> *From:* Gediminas Blazys 
> *Sent:* Monday, May 4, 2020 10:09
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hello,
>
>
>
> Thanks for the reply.
>
>
>
> Following your advice we took a look at system.local for seed nodes and
> compared that data with nodetool ring. Both sources contain the same tokens
> for these specific hosts. Will continue looking into system.peers.
>
>
>
> We have enabled more verbosity on the 

Re: [EXTERNAL] Re: Adding new DC results in clients failing to connect

2020-05-07 Thread João Reis
Hi,

I don't believe that the peers entry is responsible for that exception.
Looking at the driver code, I can't even think of a scenario where that
exception would be thrown... I will run some tests in the next couple of
days to try and figure something out.

One thing that is certain from those log messages is that the tokenmap
computation is very slow (20 seconds). With 100+ nodes and 256 vnodes per
node, we should expect the token map computation to be a bit slower but 20
seconds is definitely too much. I've opened CSHARP-901 to track this. [1]

João Reis

[1] https://datastax-oss.atlassian.net/browse/CSHARP-901

Gediminas Blazys  escreveu no dia
segunda, 4/05/2020 à(s) 11:13:

> Hello again,
>
>
>
> Looking into system.peers we found that some nodes contain entries about
> themselves with null values. Not sure if this could be an issue, maybe
> someone saw something similar? This state is there before including the
> funky DC into replication.
>
> peer
>
>  data_center
>
>  host_id
>
>  preferred_ip
>
>  rack
>
>  release_version
>
>  rpc_address
>
>  schema_version
>
>  tokens
>
> 
>
> null
>
>  null
>
>  192.168.104.111
>
>   null
>
> null
>
> null
>
> null
>
> null
>
>
>
> Have a wonderful day 
>
>
>
> Gediminas
>
>
>
> *From:* Gediminas Blazys 
> *Sent:* Monday, May 4, 2020 10:09
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hello,
>
>
>
> Thanks for the reply.
>
>
>
> Following your advice we took a look at system.local for seed nodes and
> compared that data with nodetool ring. Both sources contain the same tokens
> for these specific hosts. Will continue looking into system.peers.
>
>
>
> We have enabled more verbosity on the C# driver and this is the message
> that we get now:
>
>
>
> ControlConnection: 05/03/2020 14:28:42.346 +03:00 : Updating keyspaces
> metadata
>
> ControlConnection: 05/03/2020 14:28:42.377 +03:00 : Rebuilding token map
>
> ControlConnection: 05/03/2020 14:29:03.837 +03:00 : Finished building
> TokenMap for 7 keyspaces and 210 hosts. It took 19403 milliseconds.
>
> ControlConnection: 05/03/2020 14:29:03.901 +03:00 ALARMA: ENDPOINT:
> <>:9042 EXCEPTION: System.ArgumentException: The source argument
> contains duplicate keys.
>
>at
> System.Collections.Concurrent.ConcurrentDictionary`2.InitializeFromCollection(IEnumerable`1
> collection)
>
>at
> System.Collections.Concurrent.ConcurrentDictionary`2..ctor(IEnumerable`1
> collection, IEqualityComparer`1 comparer)
>
>at
> System.Collections.Concurrent.ConcurrentDictionary`2..ctor(IEnumerable`1
> collection)
>
>at Cassandra.TokenMap..ctor(TokenFactory factory, IReadOnlyDictionary`2
> tokenToHostsByKeyspace, List`1 ring, IReadOnlyDictionary`2 primaryReplicas,
> IReadOnlyDictionary`2 keyspaceTokensCache, IReadOnlyDictionary`2
> datacenters, Int32 numberOfHostsWithTokens)
>
>at Cassandra.TokenMap.Build(String partitioner, ICollection`1 hosts,
> ICollection`1 keyspaces)
>
>at Cassandra.Metadata.d__59.MoveNext()
>
> --- End of stack trace from previous location where exception was thrown
> ---
>
>at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task
> task)
>
>at
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)
>
>at
> System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult()
>
>at Cassandra.Connections.ControlConnection.d__44.MoveNext()
>
>
>
> The error occurs on Cassandra.TokenMap. We are analyzing objects that the
> driver initializes during the token map creation but we are yet to find
> that dictionary with duplicated keys.
>
> Just to note, once this new DC is added to replication python driver is
> unable to establish a connection either. cqlsh though, seems to be ok. It
> is hard to say for sure, but for now at least, this issue seems to be
> pointing to Cassandra.
>
>
>
> Gediminas
>
>
>
> *From:* Jorge Bay Gondra 
> *Sent:* Thursday, April 30, 2020 11:45
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hi,
>
> You can enable logging at driver to see what's happening under the hood:
> https://docs.datastax.com/en/developer/csharp-driver/3.14/faq/#how-can-i-enable-logging-in-the-driver
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.datastax.com%2Fen%2Fdeveloper%