Re: Cassandra 2.1: replace running node without streaming

2018-02-03 Thread Jürgen Albersdorfer
Good Point about the Rack - Kyrill! This makes total sense to me.
Deleting the System Keyspace not really, If this contains all sensitive 
information about the node.
Maybe this makes sense in conjunction with replace_node(at_first_boot) Option.
Some comments from devs about this would be great.

Regards,
Jürgen 

> Am 03.02.2018 um 16:42 schrieb Kyrylo Lebediev :
> 
> I've found modified Carlos' article (more recent than that I was referring 
> to) and this one contains the same method as you described, Oleksandr:
> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement
> 
> Thank you for your readiness to help!
> 
> Kind Regards, 
> Kyrill
> From: Kyrylo Lebediev 
> Sent: Saturday, February 3, 2018 12:23:15 PM
> To: User
> Subject: Re: Cassandra 2.1: replace running node without streaming
>  
> Thank you Oleksandr,
> Just tested on 3.11.1 and it worked for me (you may see the logs below).
> Just comprehended that there is one important prerequisite this method to 
> work: new node MUST be located in the same rack (in terms of C*) as the old 
> one. Otherwise correct replicas placement order will be violated (I mean when 
> replicas of the same token range should be placed in different racks). 
> 
> Anyway, even having successful run of node replacement in sandbox I'm still 
> in doubt. 
> Just wondering why this procedure which seems to be much easier than 
> [add/remove node] or [replace a node] which are documented ways for live node 
> replacement, has never been included into documentation. 
> Does anybody in the ML know the reason for this?
> 
> Also, for some reason in his article Carlos drops files of system keyspace 
> (which contains system.local table):
> In the new node, delete all system tables except for the schema ones. This 
> will ensure that the new Cassandra node will not have any corrupt or previous 
> configuration assigned.
> sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | xargs -I 
> {} sudo rm -rf {}
> 
> http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/
> [Carlos, if you are here might you, please, comment ]
> 
> So still a mystery to me. 
> 
> -
> Logs for 3.1.11
> -
> 
> == Before:
> --  Address   Load   Tokens   Owns (effective)  Host ID   
> Rack
> UN  10.10.10.222  256.61 KiB  3100.0%
> bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
> UN  10.10.10.223  225.65 KiB  3100.0%
> c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1 <<
> UN  10.10.10.221  187.39 KiB  3100.0%
> d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
> 
> === After:
> --  Address   Load   Tokens   Owns (effective)  Host ID   
> Rack
> UN  10.10.10.222  245.84 KiB  3100.0%
> bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
> UN  10.10.10.221  192.8 KiB  3100.0%
> d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
> UN  10.10.10.224  266.61 KiB  3100.0%
> c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1  < 
> 
> 
> 
> == Logs from another node (10.10.10.221):
> INFO  [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 
> OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224
> INFO  [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node 
> /10.10.10.224 is now part of the cluster
> INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
> InetAddress /10.10.10.224 is now UP
> INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
> InetAddress /10.10.10.224 is now UP
> WARN  [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host 
> ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 
> and /10.10.10.224; /10.10.10.224 is the new owner
> INFO  [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient 
> /10.10.10.223 has been silent for 3ms, removing from gossip
> 
> == Logs from new node:
> INFO  [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: 
> Finish joining ring
> INFO  [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node 
> /10.10.10.223 is now part of the cluster
> WARN  [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not 
> updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 
> because it's mine
> INFO  [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - 
> Nodes /10.10.10.223 and /10.10.10.224 have the same token 
> -7774421781914237508.  Ignoring /10.10.10.223
> INFO  [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - 
> Nodes /10.10.10.223 and /10.10.10.224 have the same token 
> 2257660731441815305.  Ignoring /10.10.10.223
> INFO  [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - 
> Nodes /10.10.10.223 and 

Re: index_interval

2018-02-03 Thread Jonathan Haddad
I would also optimize for your worst case, which is hitting zero caches.
If you're using the default settings when creating a table, you're going to
get compression settings that are terrible for reads.  If you've got memory
to spare, I suggest changing your chunk_length_in_kb to 4 and disabling
readahead on your drives entirely.  I've seen 50-100x improvement in read
latency and throughput just by changing those settings.  I just did a talk
on this topic last week, slides are here:
https://www.slideshare.net/JonHaddad/performance-tuning-86995333

Jon

On Wed, Jul 12, 2017 at 2:03 PM Jeff Jirsa  wrote:

>
>
> On 2017-07-12 12:03 (-0700), Fay Hou [Storage Service] ­ <
> fay...@coupang.com> wrote:
> > First, a big thank to Jeff who spent endless time to help this mailing
> list.
> > Agreed that we should tune the key cache. In my case, my key cache hit
> rate
> > is about 20%. mainly because we do random read. We just going to leave
> the
> > index_interval as is for now.
> >
>
> That's pretty painful. If you can up that a bit, it'll probably help you
> out. You can adjust the index intervals, too, but I'd significantly
> increase key cache size first if it were my cluster.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Cassandra 2.1: replace running node without streaming

2018-02-03 Thread Kyrylo Lebediev
I've found modified Carlos' article (more recent than that I was referring to) 
and this one contains the same method as you described, Oleksandr:

https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement


Thank you for your readiness to help!

Kind Regards,

Kyrill


From: Kyrylo Lebediev 
Sent: Saturday, February 3, 2018 12:23:15 PM
To: User
Subject: Re: Cassandra 2.1: replace running node without streaming


Thank you Oleksandr,

Just tested on 3.11.1 and it worked for me (you may see the logs below).

Just comprehended that there is one important prerequisite this method to work: 
new node MUST be located in the same rack (in terms of C*) as the old one. 
Otherwise correct replicas placement order will be violated (I mean when 
replicas of the same token range should be placed in different racks).

Anyway, even having successful run of node replacement in sandbox I'm still in 
doubt.

Just wondering why this procedure which seems to be much easier than 
[add/remove node] or [replace a node] which are documented ways for live node 
replacement, has never been included into documentation.

Does anybody in the ML know the reason for this?


Also, for some reason in his article Carlos drops files of system keyspace 
(which contains system.local table):

In the new node, delete all system tables except for the schema ones. This will 
ensure that the new Cassandra node will not have any corrupt or previous 
configuration assigned.

  1.  sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | 
xargs -I {} sudo rm -rf {}


http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/
[Carlos, if you are here might you, please, comment ]

So still a mystery to me.

-
Logs for 3.1.11

-

== Before:

--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  10.10.10.222  256.61 KiB  3100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
UN  10.10.10.223  225.65 KiB  3100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1 <<
UN  10.10.10.221  187.39 KiB  3100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1

=== After:
--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  10.10.10.222  245.84 KiB  3100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
UN  10.10.10.221  192.8 KiB  3100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
UN  10.10.10.224  266.61 KiB  3100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1  <



== Logs from another node (10.10.10.221):
INFO  [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 
OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224
INFO  [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node 
/10.10.10.224 is now part of the cluster
INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
InetAddress /10.10.10.224 is now UP
INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
InetAddress /10.10.10.224 is now UP
WARN  [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host 
ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 and 
/10.10.10.224; /10.10.10.224 is the new owner
INFO  [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient 
/10.10.10.223 has been silent for 3ms, removing from gossip

== Logs from new node:
INFO  [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: Finish 
joining ring
INFO  [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node 
/10.10.10.223 is now part of the cluster
WARN  [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not 
updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 because 
it's mine
INFO  [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token -7774421781914237508.  
Ignoring /10.10.10.223
INFO  [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token 2257660731441815305.  
Ignoring /10.10.10.223
INFO  [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token 51879124242594885.  
Ignoring /10.10.10.223
WARN  [GossipTasks:1] 2018-02-03 11:33:03,985 Gossiper.java:789 - Gossip stage 
has 5 pending tasks; skipping status check (no nodes will be marked down)
INFO  [main] 2018-02-03 11:33:04,394 SecondaryIndexManager.java:509 - Executing 
pre-join tasks for: CFS(Keyspace='test', ColumnFamily='usr')
WARN  [GossipTasks:1] 2018-02-03 11:33:05,088 Gossiper.java:789 - Gossip stage 
has 7 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipStage:1] 

Re: Node won't start

2018-02-03 Thread brian . spindler
Thanks Alex.  That’s exactly what I ended up doing - it did take maybe 45m to 
come back up though :(

-B



Sent from my iPhone
> On Feb 3, 2018, at 9:03 AM, Alexander Dejanovski  
> wrote:
> 
> Hi Brian,
> 
> I just tested this on a CCM cluster and the node started without problem. It 
> flushed some new SSTables a short while after.
> 
> I honestly do not know the specifics of how size_estimates is used, but if it 
> prevented a node from restarting I'd definitely remove the sstables to get it 
> back up.
> 
> Cheers,
> 
>> On Sat, Feb 3, 2018 at 1:53 PM Brian Spindler  
>> wrote:
>> Hi guys, I've got a 2.1.15 node that will not start it seems.  Hangs on 
>> Opening system.size_estimates.  Sometimes it can take a while but I've let 
>> it run for 90m and nothing.  Should I move this sstable out of the way to 
>> let it start?  will it rebuild/refresh size estimates if I remove that 
>> folder?  
>> 
>> thanks
>> -B
> -- 
> -
> Alexander Dejanovski
> France
> @alexanderdeja
> 
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com


Re: Node won't start

2018-02-03 Thread Alexander Dejanovski
Hi Brian,

I just tested this on a CCM cluster and the node started without problem.
It flushed some new SSTables a short while after.

I honestly do not know the specifics of how size_estimates is used, but if
it prevented a node from restarting I'd definitely remove the sstables to
get it back up.

Cheers,

On Sat, Feb 3, 2018 at 1:53 PM Brian Spindler 
wrote:

> Hi guys, I've got a 2.1.15 node that will not start it seems.  Hangs on
> Opening system.size_estimates.  Sometimes it can take a while but I've let
> it run for 90m and nothing.  Should I move this sstable out of the way to
> let it start?  will it rebuild/refresh size estimates if I remove that
> folder?
>
> thanks
> -B
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Node won't start

2018-02-03 Thread Brian Spindler
Hi guys, I've got a 2.1.15 node that will not start it seems.  Hangs on
Opening system.size_estimates.  Sometimes it can take a while but I've let
it run for 90m and nothing.  Should I move this sstable out of the way to
let it start?  will it rebuild/refresh size estimates if I remove that
folder?

thanks
-B


Re: Cassandra 2.1: replace running node without streaming

2018-02-03 Thread Kyrylo Lebediev
Thank you Oleksandr,

Just tested on 3.11.1 and it worked for me (you may see the logs below).

Just comprehended that there is one important prerequisite this method to work: 
new node MUST be located in the same rack (in terms of C*) as the old one. 
Otherwise correct replicas placement order will be violated (I mean when 
replicas of the same token range should be placed in different racks).

Anyway, even having successful run of node replacement in sandbox I'm still in 
doubt.

Just wondering why this procedure which seems to be much easier than 
[add/remove node] or [replace a node] which are documented ways for live node 
replacement, has never been included into documentation.

Does anybody in the ML know the reason for this?


Also, for some reason in his article Carlos drops files of system keyspace 
(which contains system.local table):

In the new node, delete all system tables except for the schema ones. This will 
ensure that the new Cassandra node will not have any corrupt or previous 
configuration assigned.

  1.  sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | 
xargs -I {} sudo rm -rf {}


http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/
[Carlos, if you are here might you, please, comment ]

So still a mystery to me.

-
Logs for 3.1.11

-

== Before:

--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  10.10.10.222  256.61 KiB  3100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
UN  10.10.10.223  225.65 KiB  3100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1 <<
UN  10.10.10.221  187.39 KiB  3100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1

=== After:
--  Address   Load   Tokens   Owns (effective)  Host ID 
  Rack
UN  10.10.10.222  245.84 KiB  3100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31  rack1
UN  10.10.10.221  192.8 KiB  3100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f  rack1
UN  10.10.10.224  266.61 KiB  3100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4  rack1  <



== Logs from another node (10.10.10.221):
INFO  [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 
OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224
INFO  [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node 
/10.10.10.224 is now part of the cluster
INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
InetAddress /10.10.10.224 is now UP
INFO  [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - 
InetAddress /10.10.10.224 is now UP
WARN  [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host 
ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 and 
/10.10.10.224; /10.10.10.224 is the new owner
INFO  [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient 
/10.10.10.223 has been silent for 3ms, removing from gossip

== Logs from new node:
INFO  [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: Finish 
joining ring
INFO  [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node 
/10.10.10.223 is now part of the cluster
WARN  [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not 
updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 because 
it's mine
INFO  [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token -7774421781914237508.  
Ignoring /10.10.10.223
INFO  [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token 2257660731441815305.  
Ignoring /10.10.10.223
INFO  [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - Nodes 
/10.10.10.223 and /10.10.10.224 have the same token 51879124242594885.  
Ignoring /10.10.10.223
WARN  [GossipTasks:1] 2018-02-03 11:33:03,985 Gossiper.java:789 - Gossip stage 
has 5 pending tasks; skipping status check (no nodes will be marked down)
INFO  [main] 2018-02-03 11:33:04,394 SecondaryIndexManager.java:509 - Executing 
pre-join tasks for: CFS(Keyspace='test', ColumnFamily='usr')
WARN  [GossipTasks:1] 2018-02-03 11:33:05,088 Gossiper.java:789 - Gossip stage 
has 7 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipStage:1] 2018-02-03 11:33:05,718 Gossiper.java:1046 - InetAddress 
/10.10.10.223 is now DOWN
INFO  [main] 2018-02-03 11:33:06,872 StorageService.java:2268 - Node 
/10.10.10.224 state jump to NORMAL
INFO  [main] 2018-02-03 11:33:06,998 Gossiper.java:1655 - Waiting for gossip to 
settle...
INFO  [main] 2018-02-03 11:33:15,004 Gossiper.java:1686 - No gossip backlog; 
proceeding
INFO  [GossipTasks:1] 2018-02-03 11:33:20,114 Gossiper.java:1046 - InetAddress 
/10.10.10.222 is now DOWN< have no idea 

Re: Cassandra 2.1: replace running node without streaming

2018-02-03 Thread Oleksandr Shulgin
On 3 Feb 2018 08:49, "Jürgen Albersdorfer"  wrote:

Cool, good to know. Do you know this is still true for 3.11.1?


Well, I've never tried with that specific version, but this is pretty
fundamental, so I would expect it to work the same way. Test in isolation
if you want to be sure, though.

I don't think this is documented anywhere, however, since I had the same
doubts before seeing it worked for the first time.

--
Alex

Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin <
oleksandr.shul...@zalando.de>:

On 3 Feb 2018 02:42, "Kyrylo Lebediev"  wrote:

Thanks, Oleksandr,
In my case I'll need to replace all nodes in the cluster (one-by-one), so
streaming will introduce perceptible overhead.
My question is not about data movement/copy itself, but more about all this
token magic.

Okay, let's say we stopped old node, moved data to new node.
Once it's started with auto_bootstrap=false it will be added to the cluster
like an usual node, just skipping streaming stage, right?

For a cluster with vnodes enabled, during addition of new node its token
ranges are calculated automatically by C* on startup.

So, how will C* know that this new node must be responsible for exactly the
same token ranges as the old node was?
How would the rest of nodes in the cluster ('peers') figure out that old
node should be replaced in ring by the new one?

Do you know about some  limitation for this process in case of C* 2.1.x
with vnodes enabled?


A node stores its tokens and host id in the system.local table. Next time
it starts up, it will use the same tokens as previously and the host id
allows the rest of the cluster to see that it is the same node and ignore
the IP address change. This happens regardless of auto_bootstrap setting.

Try "select * from system.local" to see what is recorded for the old node.
When the new node starts up it should log "Using saved tokens" with the
list of numbers. Other nodes should log something like "ignoring IP address
change" for the affected node addresses.

Be careful though, to make sure that you put the data directory exactly
where the new node expects to find it: otherwise it might just join as a
brand new one, allocating new tokens. As a precaution it helps to ensure
that the system user running the Cassandra process has no permission to
create the data directory: this should stop the startup in case of
misconfiguration.

Cheers,
--
Alex