Re: [EXTERNAL] How to reduce vnodes without downtime

2020-01-30 Thread Anthony Grasso
Hi Maxim,

Basically what Sean suggested is the way to do this without downtime.

To clarify the, the *three* steps following the "Decommission each node in
the DC you are working on" step should be applied to *only* the
decommissioned nodes. So where it say "*all nodes*" or "*every node*" it
applies to only the decommissioned nodes.

In addition, the step that says "Wipe data on all the nodes", I would
delete all files in the following directories on the decommissioned nodes.

   - data (usually located in /var/lib/cassandra/data)
   - commitlogs (usually located in /var/lib/cassandra/commitlogs)
   - hints (usually located in /var/lib/casandra/hints)
   - saved_caches (usually located in /var/lib/cassandra/saved_caches)


Cheers,
Anthony

On Fri, 31 Jan 2020 at 03:05, Durity, Sean R 
wrote:

> Your procedure won’t work very well. On the first node, if you switched to
> 4, you would end up with only a tiny fraction of the data (because the
> other nodes would still be at 256). I updated a large cluster (over 150
> nodes – 2 DCs) to smaller number of vnodes. The basic outline was this:
>
>
>
>- Stop all repairs
>- Make sure the app is running against one DC only
>- Change the replication settings on keyspaces to use only 1 DC
>(basically cutting off the other DC)
>- Decommission each node in the DC you are working on. Because the
>replication setting are changed, no streaming occurs. But it releases the
>token assignments
>- Wipe data on all the nodes
>- Update configuration on every node to your new settings, including
>auto_bootstrap = false
>- Start all nodes. They will choose tokens, but not stream any data
>- Update replication factor for all keyspaces to include the new DC
>- I disabled binary on those nodes to prevent app connections
>- Run nodetool reduild with -dc (other DC) on as many nodes as your
>system can safely handle until they are all rebuilt.
>- Re-enable binary (and app connections to the rebuilt DC)
>- Turn on repairs
>- Rest for a bit, then reverse the process for the remaining DCs
>
>
>
>
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* Maxim Parkachov 
> *Sent:* Thursday, January 30, 2020 10:05 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] How to reduce vnodes without downtime
>
>
>
> Hi everyone,
>
>
>
> with discussion about reducing default vnodes in version 4.0 I would like
> to ask, what would be optimal procedure to perform reduction of vnodes in
> existing 3.11.x cluster which was set up with default value 256. Cluster
> has 2 DC with 5 nodes each and RF=3. There is one more restriction, I could
> not add more servers, nor to create additional DC, everything is physical.
> This should be done without downtime.
>
>
>
> My idea for such procedure would be
>
>
>
> for each node:
>
> - decommission node
>
> - set auto_bootstrap to true and vnodes to 4
>
> - start and wait till node joins cluster
>
> - run cleanup on rest of nodes in cluster
>
> - run repair on whole cluster (not sure if needed after cleanup)
>
> - set auto_bootstrap to false
>
> repeat for each node
>
>
>
> rolling restart of cluster
>
> cluster repair
>
>
>
> Is this sounds right ? My concern is that after decommission, node will
> start on the same IP which could create some confusion.
>
>
>
> Regards,
>
> Maxim.
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: Introducing DSBench

2020-01-30 Thread Jonathan Shook
Here is a link to get started with DSBench:
https://github.com/datastax/dsbench-labs#getting-started

and DataStax Labs:
https://downloads.datastax.com/#labs

On Thu, Jan 30, 2020 at 11:47 AM Jonathan Shook  wrote:
>
> Some of you may remember NGCC talks on metagener (now VirtualDataSet)
> and engineblock from 2015 and 2016. The main themes went something
> along the lines of "testing c* with realistic workloads is hard,
> sizing cassandra is hard, we need tools in this space that go beyond
> what cassandra-stress can do but don't require math phd skills."
>
> We just released our latest attempt at solving this difficult problem
> set. It's called DSBench and it's free to download from DataStax Labs.
> Looking forward to your feedback and hope this tool can prove valuable
> for your sizing, stress testing, and performance benchmarking needs.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Introducing DSBench

2020-01-30 Thread Jonathan Shook
Some of you may remember NGCC talks on metagener (now VirtualDataSet)
and engineblock from 2015 and 2016. The main themes went something
along the lines of "testing c* with realistic workloads is hard,
sizing cassandra is hard, we need tools in this space that go beyond
what cassandra-stress can do but don't require math phd skills."

We just released our latest attempt at solving this difficult problem
set. It's called DSBench and it's free to download from DataStax Labs.
Looking forward to your feedback and hope this tool can prove valuable
for your sizing, stress testing, and performance benchmarking needs.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra OS Patching.

2020-01-30 Thread Michael Shuler
That is some good info. To add just a little more, knowing what the 
pending security updates are for your nodes helps in knowing what to do 
after. Read the security update notes from your vendor.


Java or Cassandra update? Of course the service needs restarted - 
rolling upgrade and restart the `cassandra` service as usual.


Linux kernel update? Node needs a full reboot, so follow a rolling 
reboot plan.


Other OS updates? Most can be done while not affecting Cassandra. For 
instance, an OpenSSH security update to patch some vulnerability should 
most certainly be done as soon as possible, and the node updates can be 
even be in parallel without causing any problems with the JVM or 
Cassandra service. Most intelligent package update systems will install 
the update and restart the affected service, in this hypothetical case 
`sshd`.


Michael

On 1/30/20 3:56 AM, Erick Ramirez wrote:
There is no need to shutdown the application because you should be able 
to carry out the operating system upgraded without an outage to the 
database particularly since you have a lot of nodes in your cluster.


Provided your cluster has sufficient capacity, you might even have the 
ability to upgrade multiple nodes in parallel to reduce the upgrade 
window. If you decide to do nodes in parallel and you fully understand 
the token allocations and where the nodes are positioned in the ring in 
each DC, make sure you only upgrade nodes which are at least 5 nodes 
"away" to the right so you know none of the nodes would have overlapping 
token ranges and they're not replicas of each other.


Other points to consider are:

  * If a node goes down (for whatever reason), I suggest you upgrade the
OS on the node before bringing back up. It's already down so you
might as well take advantage of it since you have so many nodes to
upgrade.
  * Resist the urge to run nodetool decommission or nodetool removenode
if you encounter an issue while upgrading a node. This is a common
knee-jerk reaction which can prove costly because the cluster will
rebalance automatically, adding more time to your upgrade window.
Either fix the problem on the server or replace node using the
"replace_address" flag.
  * Test, test, and test again. Familiarity with the process is your
friend when the unexpected happens.
  * Plan ahead and rehearse your recovery method (i.e. replace the node)
should you run into unexpected issues.
  * Stick to the plan and be prepared to implement it -- don't deviate.
Don't spend 4 hours or more investigating why a server won't start.
  * Be decisive. Activate your recovery/remediation plan immediately.

I'm sure others will chime in with their recommendations. Let us know 
how you go as I'm sure others would be interested in hearing from your 
experience. Not a lot of shops have a deployment as large as yours so 
you are in an enviable position. Good luck!


On Thu, Jan 30, 2020 at 3:45 PM Anshu Vajpayee > wrote:


Hi Team,
What is the best way to patch OS of 1000 nodes Multi DC Cassandra
cluster where we cannot suspend application traffic( we can redirect
traffic to one DC).

Please suggest if anyone has any best practice around it.

-- 
*C*heers,*

*Anshu V*
*
*



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: [EXTERNAL] How to reduce vnodes without downtime

2020-01-30 Thread Durity, Sean R
Your procedure won’t work very well. On the first node, if you switched to 4, 
you would end up with only a tiny fraction of the data (because the other nodes 
would still be at 256). I updated a large cluster (over 150 nodes – 2 DCs) to 
smaller number of vnodes. The basic outline was this:


  *   Stop all repairs
  *   Make sure the app is running against one DC only
  *   Change the replication settings on keyspaces to use only 1 DC (basically 
cutting off the other DC)
  *   Decommission each node in the DC you are working on. Because the 
replication setting are changed, no streaming occurs. But it releases the token 
assignments
  *   Wipe data on all the nodes
  *   Update configuration on every node to your new settings, including 
auto_bootstrap = false
  *   Start all nodes. They will choose tokens, but not stream any data
  *   Update replication factor for all keyspaces to include the new DC
  *   I disabled binary on those nodes to prevent app connections
  *   Run nodetool reduild with -dc (other DC) on as many nodes as your system 
can safely handle until they are all rebuilt.
  *   Re-enable binary (and app connections to the rebuilt DC)
  *   Turn on repairs
  *   Rest for a bit, then reverse the process for the remaining DCs



Sean Durity – Staff Systems Engineer, Cassandra

From: Maxim Parkachov 
Sent: Thursday, January 30, 2020 10:05 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] How to reduce vnodes without downtime

Hi everyone,

with discussion about reducing default vnodes in version 4.0 I would like to 
ask, what would be optimal procedure to perform reduction of vnodes in existing 
3.11.x cluster which was set up with default value 256. Cluster has 2 DC with 5 
nodes each and RF=3. There is one more restriction, I could not add more 
servers, nor to create additional DC, everything is physical. This should be 
done without downtime.

My idea for such procedure would be

for each node:
- decommission node
- set auto_bootstrap to true and vnodes to 4
- start and wait till node joins cluster
- run cleanup on rest of nodes in cluster
- run repair on whole cluster (not sure if needed after cleanup)
- set auto_bootstrap to false
repeat for each node

rolling restart of cluster
cluster repair

Is this sounds right ? My concern is that after decommission, node will start 
on the same IP which could create some confusion.

Regards,
Maxim.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


How to reduce vnodes without downtime

2020-01-30 Thread Maxim Parkachov
Hi everyone,

with discussion about reducing default vnodes in version 4.0 I would like
to ask, what would be optimal procedure to perform reduction of vnodes in
existing 3.11.x cluster which was set up with default value 256. Cluster
has 2 DC with 5 nodes each and RF=3. There is one more restriction, I could
not add more servers, nor to create additional DC, everything is physical.
This should be done without downtime.

My idea for such procedure would be

for each node:
- decommission node
- set auto_bootstrap to true and vnodes to 4
- start and wait till node joins cluster
- run cleanup on rest of nodes in cluster
- run repair on whole cluster (not sure if needed after cleanup)
- set auto_bootstrap to false
repeat for each node

rolling restart of cluster
cluster repair

Is this sounds right ? My concern is that after decommission, node will
start on the same IP which could create some confusion.

Regards,
Maxim.


Re: KeyCache Harmless Error on Startup

2020-01-30 Thread Shalom Sagges
Thanks Erick!

I will check with the owners of this keyspace, hoping to find the culprit.
If they won't come up with anything, is there a way to read the key cache
file? (as I understand it's a binary file)
On another note, there's actually another keyspace I missed to point out on
which I found a weird behavior (not necessarily related though).

CREATE KEYSPACE ks3 WITH replication = {'class': 'NetworkTopologyStrategy',
'DC1': '3', 'DC2': '3'}  AND durable_writes = true;

CREATE TABLE ks3.tbl4 (
account_id text,
consumer_phone_number text,
channel text,
event_time_stamp timestamp,
brand_phone_number text,
campaign_id bigint,
engagement_id bigint,
event_type text,
PRIMARY KEY ((account_id, consumer_phone_number), channel,
event_time_stamp)
);

When I select from this table, I get the following warning:
*cqlsh.py:395: DateOverFlowWarning: Some timestamps are larger than Python
datetime can represent. Timestamps are displayed in milliseconds from
epoch.*

I don't know if it's related but worth pointing out.

account_id   | consumer_phone_number  | channel
| event_time_stamp | brand_phone_number   | campaign_id |
engagement_id | event_type
+--+-+--+--+-+---+
   12345678 | OIs1HXovJ9W/AJZI+Tm8CSCbAavdVI06qt0c | sms |
*1580305508799000* | PY0yHHItI9BibOtNis8hDuLwN91prPa+ |null |
   null |opt-out


Thanks!



On Thu, Jan 30, 2020 at 1:43 AM Erick Ramirez  wrote:

> Specifically for the NegativeArraySizeException, what's happening is that
>> the keyLength is so huge that it blows up MAX_UNSIGNED_SHORT so it looks
>> like it's a negative value. Someone will correct me if I got that wrong but
>> the "Key length longer than max" error confirms that.
>>
>
> Is it possible that you have a rogue metric_name value that's impossibly
> long? I'm a bit more convinced now that's what's happening because you said
> it happens on multiple servers which rules out local file corruption at the
> filesystem level. Cheers!
>


Re: Cassandra OS Patching.

2020-01-30 Thread Erick Ramirez
There is no need to shutdown the application because you should be able to
carry out the operating system upgraded without an outage to the database
particularly since you have a lot of nodes in your cluster.

Provided your cluster has sufficient capacity, you might even have the
ability to upgrade multiple nodes in parallel to reduce the upgrade window.
If you decide to do nodes in parallel and you fully understand the token
allocations and where the nodes are positioned in the ring in each DC, make
sure you only upgrade nodes which are at least 5 nodes "away" to the right
so you know none of the nodes would have overlapping token ranges and
they're not replicas of each other.

Other points to consider are:

   - If a node goes down (for whatever reason), I suggest you upgrade the
   OS on the node before bringing back up. It's already down so you might as
   well take advantage of it since you have so many nodes to upgrade.
   - Resist the urge to run nodetool decommission or nodetool removenode if
   you encounter an issue while upgrading a node. This is a common knee-jerk
   reaction which can prove costly because the cluster will rebalance
   automatically, adding more time to your upgrade window. Either fix the
   problem on the server or replace node using the "replace_address" flag.
   - Test, test, and test again. Familiarity with the process is your
   friend when the unexpected happens.
   - Plan ahead and rehearse your recovery method (i.e. replace the node)
   should you run into unexpected issues.
   - Stick to the plan and be prepared to implement it -- don't deviate.
   Don't spend 4 hours or more investigating why a server won't start.
   - Be decisive. Activate your recovery/remediation plan immediately.

I'm sure others will chime in with their recommendations. Let us know how
you go as I'm sure others would be interested in hearing from your
experience. Not a lot of shops have a deployment as large as yours so you
are in an enviable position. Good luck!

On Thu, Jan 30, 2020 at 3:45 PM Anshu Vajpayee 
wrote:

> Hi Team,
> What is the best way to patch OS of 1000 nodes Multi DC Cassandra cluster
> where we cannot suspend application traffic( we can redirect traffic to one
> DC).
>
> Please suggest if anyone has any best practice around it.
>
> --
> *C*heers,*
> *Anshu V*
>
>
>


RE: Cassandra going OOM due to tombstones (heapdump screenshots provided)

2020-01-30 Thread Steinmaurer, Thomas
If possible, prefer m5 over m4, cause they are running on a newer hypervisor 
(KVM-based), single core performance is ~ 10% better compared to m4 with m5 
even being slightly cheaper than m4.

Thomas

From: Erick Ramirez 
Sent: Donnerstag, 30. Jänner 2020 03:00
To: user@cassandra.apache.org
Subject: Re: Cassandra going OOM due to tombstones (heapdump screenshots 
provided)

It looks like the number of tables is the problem, with 5,000 - 10,000 tables, 
that is way above the recommendations.
Take a look here: 
https://docs.datastax.com/en/dse-planning/doc/planning/planningAntiPatterns.html#planningAntiPatterns__AntiPatTooManyTables
This suggests that 5-10GB of heap is going to be taken up just with the table 
information ( 1MB per table )

+1 to Paul Chandler & Hannu Kröger. Although there isn't a hard limit on the 
maximum number of tables, there's a reasonable number that is operationally 
sound and we recommend that 200 total tables per cluster is the sweet spot. We 
know from experience that the clusters suffer as the total number of tables 
approaches 400+ so stick as close to 200 as possible. I had these 
recommendations published in the DataStax Docs a couple of years ago to provide 
clear guidance to users.

1000 keyspaces suggests that you have a multi-tenant setup. Perhaps you can 
distribute the keyspaces across multiple clusters so each cluster has less than 
500 tables. To be clear, the number of keyspaces isn't relevant in this context 
-- it's the total number of tables across all keyspaces that matters.

- We observed this problem on a c4.4xlarge (AWS EC2) instance having 30GB RAM 
with 8GB heap
- We observed the same problem on a c4.8xlarge having 60GB RAM with 12GB heap

A little off-topic but it sounds like you've been evaluating different instance 
types. The c4 instances may not be ideal for your circumstances because you're 
trading less RAM for more powerful CPUs. I generally recommend m4 instances 
because they're a good balance of CPU and RAM for the money. In a m4.4xlarge 
configuration, what you lose in raw CPU power over a c4.4xlarge (2.4GHz Intel 
Xeon E5-2676 vs 2.9GHz E5-2666) you gain 34GB of RAM (64GB vs 30GB) for nearly 
identical pricing. I think the m4 type is better value compared to c4. YMMV but 
run your tests and you might be surprised.

In relation to the heap, I imagine you're using CMS so allocate at least 16GB 
but 20 or 24GB might turn out to be the ideal size for your cluster based on 
your testing. Just make sure you reserve at least 8GB of RAM for the operating 
system.

I hope this helps. Cheers!
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4020 Linz, Austria, Am 
Fünfundzwanziger Turm 20