Re: Nodetool rebuild question
Read repairs (both foreground/blocking due to consistency level requirements and background/nonblocking due to table option/probability) will go memTable -> flush -> sstable. From: Anubhav Kale Reply-To: "user@cassandra.apache.org" Date: Thursday, October 6, 2016 at 11:50 AM To: "user@cassandra.apache.org" Subject: RE: Nodetool rebuild question Sure. When a read repair happens, does it go via the memtable -> SS Table route OR does the source node send SS Table tmp files directly to inconsistent replica ? From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Wednesday, October 5, 2016 2:20 PM To: user@cassandra.apache.org Subject: Re: Nodetool rebuild question If you set RF to 0, you can ignore my second sentence/paragraph. The third still applies. From: Anubhav Kale Reply-To: "user@cassandra.apache.org" Date: Wednesday, October 5, 2016 at 1:56 PM To: "user@cassandra.apache.org" Subject: RE: Nodetool rebuild question Thanks. We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall). From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Wednesday, October 5, 2016 1:44 PM To: user@cassandra.apache.org Subject: Re: Nodetool rebuild question Both of your statements are true. During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables. This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. - Jeff From: Anubhav Kale Reply-To: "user@cassandra.apache.org" Date: Wednesday, October 5, 2016 at 1:34 PM To: "user@cassandra.apache.org" Subject: Nodetool rebuild question Hello, As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ? 1. The files are written to disk without going through memtables. 2. Regular compactors eventually compact them to bring down # SSTables to a reasonable number. We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). Thanks much ! CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. smime.p7s Description: S/MIME cryptographic signature
RE: Nodetool rebuild question
Sure. When a read repair happens, does it go via the memtable -> SS Table route OR does the source node send SS Table tmp files directly to inconsistent replica ? From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Wednesday, October 5, 2016 2:20 PM To: user@cassandra.apache.org Subject: Re: Nodetool rebuild question If you set RF to 0, you can ignore my second sentence/paragraph. The third still applies. From: Anubhav Kale mailto:anubhav.k...@microsoft.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Date: Wednesday, October 5, 2016 at 1:56 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: RE: Nodetool rebuild question Thanks. We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall). From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Wednesday, October 5, 2016 1:44 PM To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> Subject: Re: Nodetool rebuild question Both of your statements are true. During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables. This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. - Jeff From: Anubhav Kale mailto:anubhav.k...@microsoft.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Date: Wednesday, October 5, 2016 at 1:34 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: Nodetool rebuild question Hello, As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ? 1. The files are written to disk without going through memtables. 2. Regular compactors eventually compact them to bring down # SSTables to a reasonable number. We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). Thanks much ! CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.
Re: Nodetool rebuild question
If you set RF to 0, you can ignore my second sentence/paragraph. The third still applies. From: Anubhav Kale Reply-To: "user@cassandra.apache.org" Date: Wednesday, October 5, 2016 at 1:56 PM To: "user@cassandra.apache.org" Subject: RE: Nodetool rebuild question Thanks. We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall). From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Wednesday, October 5, 2016 1:44 PM To: user@cassandra.apache.org Subject: Re: Nodetool rebuild question Both of your statements are true. During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables. This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. - Jeff From: Anubhav Kale Reply-To: "user@cassandra.apache.org" Date: Wednesday, October 5, 2016 at 1:34 PM To: "user@cassandra.apache.org" Subject: Nodetool rebuild question Hello, As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ? 1. The files are written to disk without going through memtables. 2. Regular compactors eventually compact them to bring down # SSTables to a reasonable number. We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). Thanks much ! CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. smime.p7s Description: S/MIME cryptographic signature
RE: Nodetool rebuild question
Thanks. We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall). From: Jeff Jirsa [mailto:jeff.ji...@crowdstrike.com] Sent: Wednesday, October 5, 2016 1:44 PM To: user@cassandra.apache.org Subject: Re: Nodetool rebuild question Both of your statements are true. During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables. This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. - Jeff From: Anubhav Kale mailto:anubhav.k...@microsoft.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Date: Wednesday, October 5, 2016 at 1:34 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" mailto:user@cassandra.apache.org>> Subject: Nodetool rebuild question Hello, As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ? 1. The files are written to disk without going through memtables. 2. Regular compactors eventually compact them to bring down # SSTables to a reasonable number. We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). Thanks much ! CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.
Re: Nodetool rebuild question
Both of your statements are true. During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables. This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. - Jeff From: Anubhav Kale Reply-To: "user@cassandra.apache.org" Date: Wednesday, October 5, 2016 at 1:34 PM To: "user@cassandra.apache.org" Subject: Nodetool rebuild question Hello, As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ? 1. The files are written to disk without going through memtables. 2. Regular compactors eventually compact them to bring down # SSTables to a reasonable number. We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). Thanks much ! CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments. smime.p7s Description: S/MIME cryptographic signature