RE: Huge single-node DCs (?)

2021-04-09 Thread Durity, Sean R
DataStax Enterprise has a new-ish feature set called Big Node that is supposed 
to help with using much denser nodes. We are going to be doing some testing 
with that for a similar use case with ever-growing disk needs, but no real 
increase in read or write volume. At some point it may become available in the 
open source version, too.


Sean Durity – Staff Systems Engineer, Cassandra

From: Elliott Sims 
Sent: Thursday, April 8, 2021 6:36 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Huge single-node DCs (?)

I'm not sure I'd suggest building a single DIY Backblaze pod.  The SATA port 
multipliers are a pain both from a supply chain and systems management 
perspective.  Can be worth it when you're amortizing that across a lot of 
servers and can exert some leverage over wholesale suppliers, but less so for a 
one-off.  There's a lot more whitebox/OEM/etc options for high-density storage 
servers these days from Seagate, Dell, HP, Supermicro, etc that are worth a 
look.

I'd agree with this (both examples) sounding like a poor fit for Cassandra.  
Seems like you could always just spin up a bunch of Cassandra VMs in the ESX 
cluster instead of one big one, but something like MySQL or PostgreSQL might 
suit your needs better.  Or even some sort of flatfile archive with something 
like Parquet if it's more being kept "just in case" with no need for quick 
random access.

For the 10PB example, it may be time to look at something like Hadoop, or maybe 
Ceph.

On Thu, Apr 8, 2021 at 10:39 AM Bowen Song mailto:bo...@bso.ng>> 
wrote:

This is off-topic. But if your goal is to maximise storage density and also 
ensuring data durability and availability, this is what you should be looking 
at:

  *   hardware: https://www.backblaze.com/blog/open-source-data-storage-server/ 
[backblaze.com]
  *   architecture and software: 
https://www.backblaze.com/blog/vault-cloud-storage-architecture/ 
[backblaze.com]


On 08/04/2021 17:50, Joe Obernberger wrote:
I am also curious on this question.  Say your use case is to store 10PBytes of 
data in a new server room / data-center with new equipment, what makes the most 
sense?  If your database is primarily write with little read, I think you'd 
want to maximize disk space per rack space.  So you may opt for a 2u server 
with 24 3.5" disks at 16TBytes each for a node with 384TBytes of disk - so ~27 
servers for 10PBytes.

Cassandra doesn't seem to be the good choice for that configuration; the rule 
of thumb that I'm hearing is ~2Tbytes per node, in which case we'd need over 
5000 servers.  This seems really unreasonable.

-Joe

On 4/8/2021 9:56 AM, Lapo Luchini wrote:

Hi, one project I wrote is using Cassandra to back the huge amount of data it 
needs (data is written only once and read very rarely, but needs to be 
accessible for years, so the storage needs become huge in time and I chose 
Cassandra mainly for its horizontal scalability regarding disk size) and a 
client of mine needs to install that on his hosts.

Problem is, while I usually use a cluster of 6 "smallish" nodes (which can grow 
in time), he only has big ESX servers with huge disk space (which is already 
RAID-6 redundant) but wouldn't have the possibility to have 3+ nodes per DC.

This is out of my usual experience with Cassandra and, as far as I read around, 
out of most use-cases found on the website or this mailing list, so the 
question is:
does it make sense to use Cassandra with a big (let's talk 6TB today, up to 
20TB in a few years) single-node DataCenter, and another single-node DataCenter 
(to act as disaster recovery)?

Thanks in advance for any suggestion or comment!



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or 

Re: Huge single-node DCs (?)

2021-04-09 Thread Joe Obernberger
We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty 
well.  I would love to be able to use Cassandra instead on a system 
like that.  HBase queries / scans are not the easiest to deal with, 
but, as with Cassandra, if you know the primary key, you can get to your 
data fast, even in trillions of rows. Cassandra offers some 
capabilities that HBase doesn't that I would like to leverage, but yeah 
- how can you use Cassandra with modern equipment in a bare metal 
environment?  Kubernetes could make sense as long as you're able to 
maintain data locality with however your storage is configured.
Even all SSDs - you can get a system with 24, 2 TByte SSDs, which is too 
large for 1 instance of Cassandra.  Does 4.x address any of this?


Ebay uses Cassandra and claims to have 80+ petabytes.  What do they do?

-Joe

On 4/8/2021 6:35 PM, Elliott Sims wrote:
I'm not sure I'd suggest building a single DIY Backblaze pod.  The 
SATA port multipliers are a pain both from a supply chain and systems 
management perspective.  Can be worth it when you're amortizing that 
across a lot of servers and can exert some leverage over wholesale 
suppliers, but less so for a one-off.  There's a lot more 
whitebox/OEM/etc options for high-density storage servers these days 
from Seagate, Dell, HP, Supermicro, etc that are worth a look.



I'd agree with this (both examples) sounding like a poor fit for 
Cassandra.  Seems like you could always just spin up a bunch of 
Cassandra VMs in the ESX cluster instead of one big one, but something 
like MySQL or PostgreSQL might suit your needs better.  Or even some 
sort of flatfile archive with something like Parquet if it's more 
being kept "just in case" with no need for quick random access.Â


For the 10PB example, it may be time to look at something like Hadoop, 
or maybe Ceph.


On Thu, Apr 8, 2021 at 10:39 AM Bowen Song  wrote:

This is off-topic. But if your goal is to maximise storage density
and also ensuring data durability and availability, this is what
you should be looking at:

  * hardware:
https://www.backblaze.com/blog/open-source-data-storage-server/

  * architecture and software:
https://www.backblaze.com/blog/vault-cloud-storage-architecture/



On 08/04/2021 17:50, Joe Obernberger wrote:

I am also curious on this question.� Say your use case is to
store 10PBytes of data in a new server room / data-center with
new equipment, what makes the most sense?  If your database is
primarily write with little read, I think you'd want to maximize
disk space per rack space.  So you may opt for a 2u server with
24 3.5" disks at 16TBytes each for a node with 384TBytes of disk
- so ~27 servers for 10PBytes.

Cassandra doesn't seem to be the good choice for that
configuration; the rule of thumb that I'm hearing is ~2Tbytes per
node, in which case we'd need over 5000 servers.  This seems
really unreasonable.

-Joe

On 4/8/2021 9:56 AM, Lapo Luchini wrote:

Hi, one project I wrote is using Cassandra to back the huge
amount of data it needs (data is written only once and read very
rarely, but needs to be accessible for years, so the storage
needs become huge in time and I chose Cassandra mainly for its
horizontal scalability regarding disk size) and a client of mine
needs to install that on his hosts.

Problem is, while I usually use a cluster of 6 "smallish" nodes
(which can grow in time), he only has big ESX servers with huge
disk space (which is already RAID-6 redundant) but wouldn't have
the possibility to have 3+ nodes per DC.

This is out of my usual experience with Cassandra and, as far as
I read around, out of most use-cases found on the website or
this mailing list, so the question is:
does it make sense to use Cassandra with a big (let's talk 6TB
today, up to 20TB in a few years) single-node DataCenter, and
another single-node DataCenter (to act as disaster recovery)?

Thanks in advance for any suggestion or comment!



 
	Virus-free. www.avg.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Log Rotation of Extended Compaction Logging

2021-04-09 Thread Jens Fischer
Hi Erik,

thank you for the link, very instructive.

To summarise my understanding of your mail, the code and my experiments:

- as long as the compaction logger is running it will write into the same 
“compaction.log" file
- if a new logger gets started (for example through restart of the Cassandra 
Node) the file current file will be moved to “compaction-.log” 
and a new  “compaction.log" file will be created
- files will never be archived (compressed) or deleted

Correct?

Best
Jens
Geschäftsführer: Jean-Baptiste Cornefert, Oliver Koch, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908