RE: [EXTERNAL] Re: Data Node Density

2017-12-27 Thread Durity, Sean R
You asked for experience; here’s mine.

I support one PR cluster where the hardware was built more for HBase than 
Cassandra. So the data capacity is large (4.5 TB/node). Administratively, it is 
the worst cluster to work on because any kind of repairs, streaming, 
replacement take forever. And when some nodes were hitting the disk capacity? 
Yikes!

So, I am hesitant to recommend anything over 3 TB/node for any application in 
our setting. I understand that the cost of disk storage (with a 35-50% 
compaction overhead and replication factor and mode nodes) makes denser nodes 
more appealing, but I resist.


Sean Durity

From: Amit Agrawal [mailto:amit.ku.agra...@gmail.com]
Sent: Friday, December 15, 2017 9:38 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Data Node Density

Thanks Nicholas. Am aware of the official recommendations. However, in the last 
project, we tried with 5 TB and it worked fine.

So asking for expereinces around.

Anybody knows anyone who provides a consultancy on open source cassandra. 
Datastax just does it for the enterprise version!

On Fri, Dec 15, 2017 at 3:08 PM, Nicolas Guyomar 
<nicolas.guyo...@gmail.com<mailto:nicolas.guyo...@gmail.com>> wrote:
Hi Amit,

This is way too much data per node, official recommendation are to try to stay 
below 2Tb per node, I have seen nodes up to 4Tb but then maintenance gets 
really complicated (backup, boostrap, streaming for repair etc etc)

Nicolas

On 15 December 2017 at 15:01, Amit Agrawal 
<amit.ku.agra...@gmail.com<mailto:amit.ku.agra...@gmail.com>> wrote:
Hi,

We are trying to setup a 3 node cluster with 20 TB HD on each node.
its a bare metal setup with 44 cores on each node.

So in total 60 TB, 66 cores , 3 node cluster.

The data velocity is very less, low access rates.

has anyone tried with this configuration ?

A bit urgent.

Regards,
-A







The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Data Node Density

2017-12-15 Thread Jeff Jirsa
Typing this on a phone during my commute, please excuse the inevitable typos in 
what I expect will be a long email because there’s nothing else for me to do 
right now. 

There’s a few reasons people don’t typically recommend huge nodes, the biggest 
reason being expansion and replacement. This question comes up from time to 
time, so here’s at least one other explanation I’ve written in the past: 
https://stackoverflow.com/questions/31563447/cassandra-cluster-data-density-data-size-per-node-looking-for-feedback-and/31690279#31690279

Streaming (the mechanism for bootstrap / rebuild / repair) doesn’t have a ton 
of retries built in. The larger the amount of data to stream, the more 
opportunities there are for failures. Streaming a terabyte probably succeeds 
just fine 99% of the time, 60TB probably much lower. In 2.2 and newer, 
resumable bootstrap makes this slightly less of a concern (assuming it’s 
implemented correctly).

There’s also some internals in play. When you bootstrap a new node, we create a 
streaming plan. To create that, we need to inspect all of the data files on 
disk, figure which files transfer, figure out how much actual data that is 
(which involves interacting with the compression info), queue them up, send 
them, where the other side compresses it again, recalculated metadata, and 
writes it to disk

The compression/metadata runs single threaded per stream, so you’re typically 
bound by the performance of the number streams, which correlates to the number 
of sending hosts. If you use vnodes, you can set the number of vnodes near how 
many cores/machines you’ll have, so you end up with approximately as many 
streams as cores.

 If you’ve already bought the hardware, you can try to make it work. You’ll 
need the heap to be big enough to calculate the streaming plans, and you’ll 
want to think about how you lay out the data directories (for JBOD to be safe 
you’ll need to be on 3.11, otherwise just raid0 it). Alternatively, as someone 
mentioned on this list in the past few weeks,  you can try to add some extra 
IPs and run more than one Cassandra instance per host - doing so let’s you 
treat each of them as a smaller instance. If you do this you’ll need to use 
rack awareness to make sure you don’t have multiple copies of data on the same 
machine, or a single hardware failure could make you lose data.

If you’re having specific problems trying to run a rebuild or bootstrap, you 
may have better luck with subrange repair - you’ll stream less data, and you 
can do it in very small chunks. Most importantly, if you’re having specific 
problems, don’t ask us if it works, tell us what’s failing and show us the 
errors. 

Having an outside firm come in and help explain and troubleshoot this for you 
is probably a good idea. The firms I’d personally trust if you were a close 
relative of mine asking for help are TheLastPickle and Instaclustr, but there’s 
also some very competent people at Pythian and SmartCat.io.



-- 
Jeff Jirsa


> On Dec 15, 2017, at 6:37 AM, Amit Agrawal  wrote:
> 
> Thanks Nicholas. Am aware of the official recommendations. However, in the 
> last project, we tried with 5 TB and it worked fine. 
> 
> So asking for expereinces around.
> 
> Anybody knows anyone who provides a consultancy on open source cassandra. 
> Datastax just does it for the enterprise version! 
> 
>> On Fri, Dec 15, 2017 at 3:08 PM, Nicolas Guyomar  
>> wrote:
>> Hi Amit,
>> 
>> This is way too much data per node, official recommendation are to try to 
>> stay below 2Tb per node, I have seen nodes up to 4Tb but then maintenance 
>> gets really complicated (backup, boostrap, streaming for repair etc etc)
>> 
>> Nicolas
>> 
>>> On 15 December 2017 at 15:01, Amit Agrawal  
>>> wrote:
>>> Hi,
>>> 
>>> We are trying to setup a 3 node cluster with 20 TB HD on each node. 
>>> its a bare metal setup with 44 cores on each node. 
>>> 
>>> So in total 60 TB, 66 cores , 3 node cluster.
>>> 
>>> The data velocity is very less, low access rates. 
>>> 
>>> has anyone tried with this configuration ?
>>> 
>>> A bit urgent. 
>>> 
>>> Regards,
>>> -A
>>> 
>>> 
>> 
> 


Re: Data Node Density

2017-12-15 Thread Amit Agrawal
Thanks Nicholas. Am aware of the official recommendations. However, in the
last project, we tried with 5 TB and it worked fine.

So asking for expereinces around.

Anybody knows anyone who provides a consultancy on open source cassandra.
Datastax just does it for the enterprise version!

On Fri, Dec 15, 2017 at 3:08 PM, Nicolas Guyomar 
wrote:

> Hi Amit,
>
> This is way too much data per node, official recommendation are to try to
> stay below 2Tb per node, I have seen nodes up to 4Tb but then maintenance
> gets really complicated (backup, boostrap, streaming for repair etc etc)
>
> Nicolas
>
> On 15 December 2017 at 15:01, Amit Agrawal 
> wrote:
>
>> Hi,
>>
>> We are trying to setup a 3 node cluster with 20 TB HD on each node.
>> its a bare metal setup with 44 cores on each node.
>>
>> So in total 60 TB, 66 cores , 3 node cluster.
>>
>> The data velocity is very less, low access rates.
>>
>> has anyone tried with this configuration ?
>>
>> A bit urgent.
>>
>> Regards,
>> -A
>>
>>
>>
>


Re: Data Node Density

2017-12-15 Thread Nicolas Guyomar
Hi Amit,

This is way too much data per node, official recommendation are to try to
stay below 2Tb per node, I have seen nodes up to 4Tb but then maintenance
gets really complicated (backup, boostrap, streaming for repair etc etc)

Nicolas

On 15 December 2017 at 15:01, Amit Agrawal 
wrote:

> Hi,
>
> We are trying to setup a 3 node cluster with 20 TB HD on each node.
> its a bare metal setup with 44 cores on each node.
>
> So in total 60 TB, 66 cores , 3 node cluster.
>
> The data velocity is very less, low access rates.
>
> has anyone tried with this configuration ?
>
> A bit urgent.
>
> Regards,
> -A
>
>
>