Hi,

I don't believe that the peers entry is responsible for that exception.
Looking at the driver code, I can't even think of a scenario where that
exception would be thrown... I will run some tests in the next couple of
days to try and figure something out.

One thing that is certain from those log messages is that the tokenmap
computation is very slow (20 seconds). With 100+ nodes and 256 vnodes per
node, we should expect the token map computation to be a bit slower but 20
seconds is definitely too much. I've opened CSHARP-901 to track this. [1]

João Reis

[1] https://datastax-oss.atlassian.net/browse/CSHARP-901

Gediminas Blazys <gediminas.bla...@microsoft.com.invalid> escreveu no dia
segunda, 4/05/2020 à(s) 11:13:

> Hello again,
>
>
>
> Looking into system.peers we found that some nodes contain entries about
> themselves with null values. Not sure if this could be an issue, maybe
> someone saw something similar? This state is there before including the
> funky DC into replication.
>
> peer
>
>  data_center
>
>  host_id
>
>  preferred_ip
>
>  rack
>
>  release_version
>
>  rpc_address
>
>  schema_version
>
>  tokens
>
> <IP address>
>
>         null
>
>                                  null
>
>  192.168.104.111
>
>   null
>
>             null
>
>             null
>
> null
>
> null
>
>
>
> Have a wonderful day 😊
>
>
>
> Gediminas
>
>
>
> *From:* Gediminas Blazys <gediminas.bla...@microsoft.com.INVALID>
> *Sent:* Monday, May 4, 2020 10:09
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hello,
>
>
>
> Thanks for the reply.
>
>
>
> Following your advice we took a look at system.local for seed nodes and
> compared that data with nodetool ring. Both sources contain the same tokens
> for these specific hosts. Will continue looking into system.peers.
>
>
>
> We have enabled more verbosity on the C# driver and this is the message
> that we get now:
>
>
>
> ControlConnection: 05/03/2020 14:28:42.346 +03:00 : Updating keyspaces
> metadata
>
> ControlConnection: 05/03/2020 14:28:42.377 +03:00 : Rebuilding token map
>
> ControlConnection: 05/03/2020 14:29:03.837 +03:00 : Finished building
> TokenMap for 7 keyspaces and 210 hosts. It took 19403 milliseconds.
>
> ControlConnection: 05/03/2020 14:29:03.901 +03:00 ALARMA: ENDPOINT:
> <<IPADDRESS>>:9042 EXCEPTION: System.ArgumentException: The source argument
> contains duplicate keys.
>
>    at
> System.Collections.Concurrent.ConcurrentDictionary`2.InitializeFromCollection(IEnumerable`1
> collection)
>
>    at
> System.Collections.Concurrent.ConcurrentDictionary`2..ctor(IEnumerable`1
> collection, IEqualityComparer`1 comparer)
>
>    at
> System.Collections.Concurrent.ConcurrentDictionary`2..ctor(IEnumerable`1
> collection)
>
>    at Cassandra.TokenMap..ctor(TokenFactory factory, IReadOnlyDictionary`2
> tokenToHostsByKeyspace, List`1 ring, IReadOnlyDictionary`2 primaryReplicas,
> IReadOnlyDictionary`2 keyspaceTokensCache, IReadOnlyDictionary`2
> datacenters, Int32 numberOfHostsWithTokens)
>
>    at Cassandra.TokenMap.Build(String partitioner, ICollection`1 hosts,
> ICollection`1 keyspaces)
>
>    at Cassandra.Metadata.<RebuildTokenMapAsync>d__59.MoveNext()
>
> --- End of stack trace from previous location where exception was thrown
> ---
>
>    at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task
> task)
>
>    at
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)
>
>    at
> System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult()
>
>    at Cassandra.Connections.ControlConnection.<Connect>d__44.MoveNext()
>
>
>
> The error occurs on Cassandra.TokenMap. We are analyzing objects that the
> driver initializes during the token map creation but we are yet to find
> that dictionary with duplicated keys.
>
> Just to note, once this new DC is added to replication python driver is
> unable to establish a connection either. cqlsh though, seems to be ok. It
> is hard to say for sure, but for now at least, this issue seems to be
> pointing to Cassandra.
>
>
>
> Gediminas
>
>
>
> *From:* Jorge Bay Gondra <jorgebaygon...@gmail.com>
> *Sent:* Thursday, April 30, 2020 11:45
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Adding new DC results in clients failing to
> connect
>
>
>
> Hi,
>
> You can enable logging at driver to see what's happening under the hood:
> https://docs.datastax.com/en/developer/csharp-driver/3.14/faq/#how-can-i-enable-logging-in-the-driver
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.datastax.com%2Fen%2Fdeveloper%2Fcsharp-driver%2F3.14%2Ffaq%2F%23how-can-i-enable-logging-in-the-driver&data=02%7C01%7CGediminas.Blazys%40microsoft.com%7C6a5b382a16e54752bb8e08d7effa07bc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637241729477296305&sdata=a3XX8EzNAZk7ak3EE3Q7U4kxTtNii2svHqNpoKZgADI%3D&reserved=0>
>
> With logging information, it should be easy to track the issue down.
>
>
>
> Can you query system.local and system.peers on a seed node / contact point
> to see if all the node list / token info is expected. You can compare it to
> nodetool ring info.
>
>
>
> Not directly related: 256 vnodes is probably more than you want.
>
>
>
> Thanks,
>
> Jorge
>
>
>
> On Thu, Apr 30, 2020 at 9:48 AM Gediminas Blazys <
> gediminas.bla...@microsoft.com.invalid> wrote:
>
> Hello,
>
>
>
> We have run into a very interesting issue and maybe some of you have
> encountered it or just have an idea where to look.
>
>
>
> We are working towards adding new dcs into our cluster, here's the current
> topology:
>
> DC1 - 18 nodes
>
> DC2 - 18 nodes
>
> DC3 - 18 nodes
>
> DC4 - 18 nodes
>
> DC5 - 18 nodes
>
>
>
> Recently we introduced a new DC6 (60 nodes) into our cluster. The joining
> and rebuilding of DC6 went smoothly, clients are using it without issue.
> This is how it looked after joining DC6:
>
> DC1 - 18 nodes
>
> DC2 - 18 nodes
>
> DC3 - 18 nodes
>
> DC4 - 18 nodes
>
> DC5 - 18 nodes
>
> DC6 - 60 nodes
>
>
>
> Next we wanted to add another DC7 (also 60 nodes) making it a total of 210
> nodes in the cluster, and while joining new nodes went smoothly, once we
> changed the replication of user defined keyspaces to include DC7, no
> clients were able to connect to Cassandra (regardless of which DC is being
> addressed). They would throw an exception that I have provided at the end
> of the email.
>
>
>
> Cassandra version 3.11.4.
>
> C# driver version 3.12.0. Also tested with 3.14.0. We use dc round robin
> policy and update ring metadata for connecting clients.
>
> Amount of vnodes per node: 256
>
>
>
> The stack trace starts with an exception 'The source argument contains
> duplicate keys.'. Maybe you know what kind of data is in this dictionary?
> What data can be duplicated here?
>
>
>
> Clients are unable to connect until the moment we remove DC7 from
> replication. Once replication is adjusted to exclude DC7, clients can
> connect normally.
>
>
>
> Cassandra.NoHostAvailableException: All hosts tried for query failed
> (tried <<IPaddress>>:9042: ArgumentException 'The source argument contains
> duplicate keys.')2020/04/29 10:19:27.51410636
>
> at
> Cassandra.Connections.ControlConnection.<Connect>d__39.MoveNext()2020/04/29
> 10:19:27.51410636
>
> --- End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)2020/04/29 10:19:27.51410636
>
> Cassandra.Connections.ControlConnection.<InitAsync>d__36.MoveNext()2020/04/29
> 10:19:27.51410636
>
> End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)2020/04/29 10:19:27.51410636
>
> Cassandra.Tasks.TaskHelper.<WaitToCompleteAsync>d__10.MoveNext()2020/04/29
> 10:19:27.51410636
>
> End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)2020/04/29 10:19:27.51410636
>
> Cassandra.Cluster.<Cassandra-SessionManagement-IInternalCluster-OnInitializeAsync>d__50.MoveNext()2020/04/29
> 10:19:27.51410636
>
> End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)2020/04/29 10:19:27.51410636
>
> Cassandra.ClusterLifecycleManager.<InitializeAsync>d__3.MoveNext()2020/04/29
> 10:19:27.51410636
>
> End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)2020/04/29 10:19:27.51410636
>
> Cassandra.Cluster.<Cassandra-SessionManagement-IInternalCluster-ConnectAsync>d__47`1.MoveNext()2020/04/29
> 10:19:27.51410636
>
> End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
> task)2020/04/29 10:19:27.51410636
>
> Cassandra.Cluster.<ConnectAsync>d__46.MoveNext()2020/04/29
> 10:19:27.51410636
>
> End of stack trace from previous location where exception was thrown
> ---2020/04/29 10:19:27.51410636
>
> System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
> 10:19:27.51410636
>
> Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32
> timeout)2020/04/29 10:19:27.51410636
>
> Cassandra.Cluster.Connect()2020/04/29 10:19:27.51410636
>
>
>
> We would really appreciate your input, big thanks in advance.
>
>
>
> Gediminas
>
>
>
>

Reply via email to