> On Apr 9, 2021, at 6:15 AM, Joe Obernberger
> wrote:
>
>
> We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty well.
> I would love to be able to use Cassandra instead on a system like that.
>
1PB is definitely in the range of viable cassandra clusters today
> Even
4.0 has gone a ways to enable better densification of nodes, but it wasn't
a main focus. We're probably still only thinking that 4TB - 8TB nodes will
be feasible (and then maybe only for expert users). The main problems tend
to be streaming, compaction, and repairs when it comes to dense nodes.
available in the
open source version, too.
Sean Durity – Staff Systems Engineer, Cassandra
From: Elliott Sims
Sent: Thursday, April 8, 2021 6:36 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Huge single-node DCs (?)
I'm not sure I'd suggest building a single DIY Backblaze pod
We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty
well. I would love to be able to use Cassandra instead on a system
like that. HBase queries / scans are not the easiest to deal with,
but, as with Cassandra, if you know the primary key, you can get to your
data fast,
I'm not sure I'd suggest building a single DIY Backblaze pod. The SATA
port multipliers are a pain both from a supply chain and systems management
perspective. Can be worth it when you're amortizing that across a lot of
servers and can exert some leverage over wholesale suppliers, but less so
This is off-topic. But if your goal is to maximise storage density and
also ensuring data durability and availability, this is what you should
be looking at:
* hardware:
https://www.backblaze.com/blog/open-source-data-storage-server/
* architecture and software:
I am also curious on this question. Say your use case is to store
10PBytes of data in a new server room / data-center with new equipment,
what makes the most sense? If your database is primarily write with
little read, I think you'd want to maximize disk space per rack space.
So you may opt
I'm sure there's a lots of pitfalls. A few of them in my mind right now:
* With a single node, you will completely lose the benefit of high
availability from Cassandra. Not only hardware failure will result
in downtime, routine maintenance (such as software upgrade) can also
result in