from:"daemeon reiydelle"

Did you notice that HDFS is the distributed file system used?





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*


*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*


On Tue, May 30, 2017 at 2:18 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> This isn't an HDFS mailing list.
>
> On Tue, May 30, 2017 at 2:14 PM daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> no, 3tb is small. 30-50tb of hdfs space is typical these days per hdfs
>> node. Depends somewhat on whether there is a mix of more and less
>> frequently accessed data. But even storing only hot data, never saw
>> anything less than 20tb hdfs per node.
>>
>>
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>>
>> *“All men dream, but not equally. Those who dream by night in the dusty
>> recesses of their minds wake up in the day to find it was vanity, but the
>> dreamers of the day are dangerous men, for they may act their dreams with
>> open eyes, to make it possible.” — T.E. Lawrence*
>>
>>
>> On Tue, May 30, 2017 at 2:00 PM, tommaso barbugli <tbarbu...@gmail.com>
>> wrote:
>>
>>> Am I the only one thinking 3TB is way too much data for a single node on
>>> a VM?
>>>
>>> On Tue, May 30, 2017 at 10:36 PM, Daniel Steuernol <
>>> dan...@sendwithus.com> wrote:
>>>
>>>> I don't believe incremental repair is enabled, I have never enabled it
>>>> on the cluster, and unless it's the default then it is off. Also I don't
>>>> see a setting in cassandra.yaml for it.
>>>>
>>>>
>>>>
>>>> On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com>
>>>> wrote:
>>>>
>>>>> Unless there is a bug, snapshots are excluded (they are not HDFS
>>>>> anyway!) from nodetool status.
>>>>>
>>>>> Out of curiousity, is incremenatal repair enabled? This is almost
>>>>> certainly a rat hole, but there was an issue a few releases back where 
>>>>> load
>>>>> would only increase until the node was restarted. Had been fixed ages ago,
>>>>> but wondering what happens if you restart a node, IF you have incremental
>>>>> enabled.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <+1%20415-501-0198>London
>>>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>>>
>>>>>
>>>>> *“All men dream, but not equally. Those who dream by night in the
>>>>> dusty recesses of their minds wake up in the day to find it was vanity, 
>>>>> but
>>>>> the dreamers of the day are dangerous men, for they may act their dreams
>>>>> with open eyes, to make it possible.” — T.E. Lawrence*
>>>>>
>>>>>
>>>>> On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:
>>>>>
>>>>> Can you please check if you have incremental backup enabled and
>>>>> snapshots are occupying the space.
>>>>>
>>>>> run nodetool clearsnapshot command.
>>>>>
>>>>> On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <
>>>>> dan...@sendwithus.com> wrote:
>>>>>
>>>>> It's 3-4TB per node, and by load rises, I'm talking about load as
>>>>> reported by nodetool status.
>>>>>
>>>>>
>>>>>
>>>>> On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> When you say "the load rises ... ", could you clarify what you mean by
>>>>> "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But
>>>>> in neither case would that be relevant to transient or persisted disk. Am 
>>>>> I
>>>>> missing something?
>>>>>
>>>>>
>>>>> On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <
>>>>> tbarbu...@gmail.com> wrote:
>>>>>
>>>>> 3-4 T

Re: Restarting nodes and reported load

no, 3tb is small. 30-50tb of hdfs space is typical these days per hdfs
node. Depends somewhat on whether there is a mix of more and less
frequently accessed data. But even storing only hot data, never saw
anything less than 20tb hdfs per node.





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*


*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*


On Tue, May 30, 2017 at 2:00 PM, tommaso barbugli <tbarbu...@gmail.com>
wrote:

> Am I the only one thinking 3TB is way too much data for a single node on a
> VM?
>
> On Tue, May 30, 2017 at 10:36 PM, Daniel Steuernol <dan...@sendwithus.com>
> wrote:
>
>> I don't believe incremental repair is enabled, I have never enabled it on
>> the cluster, and unless it's the default then it is off. Also I don't see a
>> setting in cassandra.yaml for it.
>>
>>
>>
>> On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> Unless there is a bug, snapshots are excluded (they are not HDFS
>>> anyway!) from nodetool status.
>>>
>>> Out of curiousity, is incremenatal repair enabled? This is almost
>>> certainly a rat hole, but there was an issue a few releases back where load
>>> would only increase until the node was restarted. Had been fixed ages ago,
>>> but wondering what happens if you restart a node, IF you have incremental
>>> enabled.
>>>
>>>
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <+1%20415-501-0198>London
>>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>>
>>>
>>> *“All men dream, but not equally. Those who dream by night in the dusty
>>> recesses of their minds wake up in the day to find it was vanity, but the
>>> dreamers of the day are dangerous men, for they may act their dreams with
>>> open eyes, to make it possible.” — T.E. Lawrence*
>>>
>>>
>>> On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:
>>>
>>> Can you please check if you have incremental backup enabled and
>>> snapshots are occupying the space.
>>>
>>> run nodetool clearsnapshot command.
>>>
>>> On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <
>>> dan...@sendwithus.com> wrote:
>>>
>>> It's 3-4TB per node, and by load rises, I'm talking about load as
>>> reported by nodetool status.
>>>
>>>
>>>
>>> On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com>
>>> wrote:
>>>
>>> When you say "the load rises ... ", could you clarify what you mean by
>>> "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But
>>> in neither case would that be relevant to transient or persisted disk. Am I
>>> missing something?
>>>
>>>
>>> On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com>
>>> wrote:
>>>
>>> 3-4 TB per node or in total?
>>>
>>> On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com
>>> > wrote:
>>>
>>> I should also mention that I am running cassandra 3.10 on the cluster
>>>
>>>
>>>
>>> On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com>
>>> wrote:
>>>
>>> The cluster is running with RF=3, right now each node is storing about
>>> 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61
>>> GB of RAM, and the disks attached for the data drive are gp2 ssd ebs
>>> volumes with 10k iops. I guess this brings up the question of what's a good
>>> marker to decide on whether to increase disk space vs provisioning a new
>>> node?
>>>
>>>
>>> On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com>
>>> wrote:
>>>
>>> Hi Daniel,
>>>
>>> This is not normal. Possibly a capacity problem. Whats the RF, how much
>>> data do you store per node and what kind of servers do you use (core count,
>>> RAM, disk, ...)?
>>>
>>> Cheers,
>>> Tommaso
>>>
>>> On Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com
>>> > wrote:
>>>
>>>
>>> I am running a 6 node cluster, and I have noticed that the reported load
>>> on each node rises throughout the week and grows way past the actual disk
>>> space used and available on each node. Also eventually latency for
>>> operations suffers and the nodes have to be restarted. A couple questions
>>> on this, is this normal? Also does cassandra need to be restarted every few
>>> days for best performance? Any insight on this behaviour would be helpful.
>>>
>>> Cheers,
>>> Daniel
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>>
>>>
>

Re: Restarting nodes and reported load

No degradation.





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*


*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*


On Tue, May 30, 2017 at 1:54 PM, Daniel Steuernol <dan...@sendwithus.com>
wrote:

> That does sound like what's happening, did performance degrade as the
> reported load increased?
>
>
>
> On May 30 2017, at 1:52 pm, daemeon reiydelle <daeme...@gmail.com> wrote:
>
>> OK, thanks.
>>
>> So there was a bug in a prior version of C*, symptoms were:
>>
>> Nodetool would show increasing load utilization over time. Stopping and
>> restarting C* nodes would reset the storage back to what one would expect
>> on that node, for a while, then it would creep upwards again, until the
>> node(s) are restarted, etc. FYI it ONLY occurred on an in-use system, etc.
>>
>> I know (double checked) that the problem was fixed a while back.
>> Wondering if it resurfaced?
>>
>>
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>>
>> *“All men dream, but not equally. Those who dream by night in the dusty
>> recesses of their minds wake up in the day to find it was vanity, but the
>> dreamers of the day are dangerous men, for they may act their dreams with
>> open eyes, to make it possible.” — T.E. Lawrence*
>>
>>
>> On Tue, May 30, 2017 at 1:36 PM, Daniel Steuernol <dan...@sendwithus.com>
>> wrote:
>>
>> I don't believe incremental repair is enabled, I have never enabled it on
>> the cluster, and unless it's the default then it is off. Also I don't see a
>> setting in cassandra.yaml for it.
>>
>>
>> On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>> Unless there is a bug, snapshots are excluded (they are not HDFS anyway!)
>> from nodetool status.
>>
>> Out of curiousity, is incremenatal repair enabled? This is almost
>> certainly a rat hole, but there was an issue a few releases back where load
>> would only increase until the node was restarted. Had been fixed ages ago,
>> but wondering what happens if you restart a node, IF you have incremental
>> enabled.
>>
>>
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>>
>>
>> *“All men dream, but not equally. Those who dream by night in the dusty
>> recesses of their minds wake up in the day to find it was vanity, but the
>> dreamers of the day are dangerous men, for they may act their dreams with
>> open eyes, to make it possible.” — T.E. Lawrence*
>>
>>
>> On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:
>>
>> Can you please check if you have incremental backup enabled and snapshots
>> are occupying the space.
>>
>> run nodetool clearsnapshot command.
>>
>> On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com
>> > wrote:
>>
>> It's 3-4TB per node, and by load rises, I'm talking about load as
>> reported by nodetool status.
>>
>>
>>
>> On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>> When you say "the load rises ... ", could you clarify what you mean by
>> "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But
>> in neither case would that be relevant to transient or persisted disk. Am I
>> missing something?
>>
>>
>> On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com>
>> wrote:
>>
>> 3-4 TB per node or in total?
>>
>> On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com>
>> wrote:
>>
>> I should also mention that I am running cassandra 3.10 on the cluster
>>
>>
>>
>> On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com>
>> wrote:
>>
>> The cluster is running with RF=3, right now each node is storing about
>> 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61
>> GB of RAM, and the disks attached for the data drive are gp2 ssd ebs
>> volumes with 10k iops. I guess this brings up the question of what's a good
>> marker to decide on whether to increase disk space

Re: Restarting nodes and reported load

OK, thanks.

So there was a bug in a prior version of C*, symptoms were:

Nodetool would show increasing load utilization over time. Stopping and
restarting C* nodes would reset the storage back to what one would expect
on that node, for a while, then it would creep upwards again, until the
node(s) are restarted, etc. FYI it ONLY occurred on an in-use system, etc.

I know (double checked) that the problem was fixed a while back. Wondering
if it resurfaced?





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*


*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*


On Tue, May 30, 2017 at 1:36 PM, Daniel Steuernol <dan...@sendwithus.com>
wrote:

> I don't believe incremental repair is enabled, I have never enabled it on
> the cluster, and unless it's the default then it is off. Also I don't see a
> setting in cassandra.yaml for it.
>
>
> On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com> wrote:
>
>> Unless there is a bug, snapshots are excluded (they are not HDFS anyway!)
>> from nodetool status.
>>
>> Out of curiousity, is incremenatal repair enabled? This is almost
>> certainly a rat hole, but there was an issue a few releases back where load
>> would only increase until the node was restarted. Had been fixed ages ago,
>> but wondering what happens if you restart a node, IF you have incremental
>> enabled.
>>
>>
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>>
>> *“All men dream, but not equally. Those who dream by night in the dusty
>> recesses of their minds wake up in the day to find it was vanity, but the
>> dreamers of the day are dangerous men, for they may act their dreams with
>> open eyes, to make it possible.” — T.E. Lawrence*
>>
>>
>> On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:
>>
>> Can you please check if you have incremental backup enabled and snapshots
>> are occupying the space.
>>
>> run nodetool clearsnapshot command.
>>
>> On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com
>> > wrote:
>>
>> It's 3-4TB per node, and by load rises, I'm talking about load as
>> reported by nodetool status.
>>
>>
>>
>> On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>> When you say "the load rises ... ", could you clarify what you mean by
>> "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But
>> in neither case would that be relevant to transient or persisted disk. Am I
>> missing something?
>>
>>
>> On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com>
>> wrote:
>>
>> 3-4 TB per node or in total?
>>
>> On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com>
>> wrote:
>>
>> I should also mention that I am running cassandra 3.10 on the cluster
>>
>>
>>
>> On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com>
>> wrote:
>>
>> The cluster is running with RF=3, right now each node is storing about
>> 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61
>> GB of RAM, and the disks attached for the data drive are gp2 ssd ebs
>> volumes with 10k iops. I guess this brings up the question of what's a good
>> marker to decide on whether to increase disk space vs provisioning a new
>> node?
>>
>>
>> On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com>
>> wrote:
>>
>> Hi Daniel,
>>
>> This is not normal. Possibly a capacity problem. Whats the RF, how much
>> data do you store per node and what kind of servers do you use (core count,
>> RAM, disk, ...)?
>>
>> Cheers,
>> Tommaso
>>
>> On Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com>
>> wrote:
>>
>>
>> I am running a 6 node cluster, and I have noticed that the reported load
>> on each node rises throughout the week and grows way past the actual disk
>> space used and available on each node. Also eventually latency for
>> operations suffers and the nodes have to be restarted. A couple questions
>> on this, is this normal? Also does cassandra need to be restarted every few
>> days for best performance? Any insight on this behaviour would be helpful.
>>
>> Cheers,
>> Daniel
>> - To
>> unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>> additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>>
>>
>> - To
>> unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>> additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>>
>>

Re: Restarting nodes and reported load

Unless there is a bug, snapshots are excluded (they are not HDFS anyway!)
from nodetool status.

Out of curiousity, is incremenatal repair enabled? This is almost certainly
a rat hole, but there was an issue a few releases back where load would
only increase until the node was restarted. Had been fixed ages ago, but
wondering what happens if you restart a node, IF you have incremental
enabled.





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*


*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*


On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:

> Can you please check if you have incremental backup enabled and snapshots
> are occupying the space.
>
> run nodetool clearsnapshot command.
>
> On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com>
> wrote:
>
>> It's 3-4TB per node, and by load rises, I'm talking about load as
>> reported by nodetool status.
>>
>>
>>
>> On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> When you say "the load rises ... ", could you clarify what you mean by
>>> "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But
>>> in neither case would that be relevant to transient or persisted disk. Am I
>>> missing something?
>>>
>>>
>>> On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com>
>>> wrote:
>>>
>>> 3-4 TB per node or in total?
>>>
>>> On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com
>>> > wrote:
>>>
>>> I should also mention that I am running cassandra 3.10 on the cluster
>>>
>>>
>>>
>>> On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com>
>>> wrote:
>>>
>>> The cluster is running with RF=3, right now each node is storing about
>>> 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61
>>> GB of RAM, and the disks attached for the data drive are gp2 ssd ebs
>>> volumes with 10k iops. I guess this brings up the question of what's a good
>>> marker to decide on whether to increase disk space vs provisioning a new
>>> node?
>>>
>>>
>>> On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com>
>>> wrote:
>>>
>>> Hi Daniel,
>>>
>>> This is not normal. Possibly a capacity problem. Whats the RF, how much
>>> data do you store per node and what kind of servers do you use (core count,
>>> RAM, disk, ...)?
>>>
>>> Cheers,
>>> Tommaso
>>>
>>> On Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com
>>> > wrote:
>>>
>>>
>>> I am running a 6 node cluster, and I have noticed that the reported load
>>> on each node rises throughout the week and grows way past the actual disk
>>> space used and available on each node. Also eventually latency for
>>> operations suffers and the nodes have to be restarted. A couple questions
>>> on this, is this normal? Also does cassandra need to be restarted every few
>>> days for best performance? Any insight on this behaviour would be helpful.
>>>
>>> Cheers,
>>> Daniel
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>>
>>>
>>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>> additional commands, e-mail: user-h...@cassandra.apache.org
>>
>
>

Re: Restarting nodes and reported load

When you say "the load rises ... ", could you clarify what you mean by
"load"? That has a specific Linux term, and in e.g. Cloudera Manager. But
in neither case would that be relevant to transient or persisted disk. Am I
missing something?


On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli 
wrote:

> 3-4 TB per node or in total?
>
> On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol 
> wrote:
>
>> I should also mention that I am running cassandra 3.10 on the cluster
>>
>>
>>
>> On May 29 2017, at 9:43 am, Daniel Steuernol 
>> wrote:
>>
>>> The cluster is running with RF=3, right now each node is storing about
>>> 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61
>>> GB of RAM, and the disks attached for the data drive are gp2 ssd ebs
>>> volumes with 10k iops. I guess this brings up the question of what's a good
>>> marker to decide on whether to increase disk space vs provisioning a new
>>> node?
>>>
>>>
>>> On May 29 2017, at 9:35 am, tommaso barbugli 
>>> wrote:
>>>
>>> Hi Daniel,
>>>
>>> This is not normal. Possibly a capacity problem. Whats the RF, how much
>>> data do you store per node and what kind of servers do you use (core count,
>>> RAM, disk, ...)?
>>>
>>> Cheers,
>>> Tommaso
>>>
>>> On Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol >> > wrote:
>>>
>>>
>>> I am running a 6 node cluster, and I have noticed that the reported load
>>> on each node rises throughout the week and grows way past the actual disk
>>> space used and available on each node. Also eventually latency for
>>> operations suffers and the nodes have to be restarted. A couple questions
>>> on this, is this normal? Also does cassandra need to be restarted every few
>>> days for best performance? Any insight on this behaviour would be helpful.
>>>
>>> Cheers,
>>> Daniel
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>>
>

Re: How do you do automatic restacking of AWS instance for cassandra?

2017-05-28 Thread daemeon reiydelle

This is in fact an interesting security practice that makes sense. It
assumes the existing ami had security holes that WERE ALREADY exploited.
See if you can negotiate moving the hdfs volumes to persistent storage. Fyi
two major banks I have worked with did much the same, but as the storage
was SAN (with VMware ) I was able to make adjustments to the ansible
scripts (client was providing mobile banking solutions to the bank)

I had another client using AWS, Chef, Terraform. I WAS NOT able to make
this work in Chef. I can do it with Ansible, Terraform, AWS however.

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 28, 2017 1:25 AM, "Anthony Grasso" <anthony.gra...@gmail.com> wrote:

> Hi Surbhi,
>
> Please see my comment inline below.
>
> On 28 May 2017 at 12:11, Jeff Jirsa <jji...@apache.org> wrote:
>
>>
>>
>> On 2017-05-27 18:04 (-0700), Surbhi Gupta <surbhi.gupt...@gmail.com>
>> wrote:
>> > Thanks a lot for all of your reply.
>> > Our requirement is :
>> > Our company releases AMI almost every month where they have some or the
>> > other security packages.
>> > So as per our security team we need to move our cassandra cluster to the
>> > new AMI .
>> > As this process happens every month, we would like to automate the
>> process .
>> > Few points to consider here:
>> >
>> > 1. We are using ephemeral drives to store cassandra data
>> > 2. We are on dse 4.8.x
>> >
>> > So currently to do the process, we pinup a new nodes with new DC name
>> and
>> > join that DC, alter the keyspace, do rebuild  and later alter the
>> keyspace
>> > again to remove the old DC .
>> >
>> > But all of this process is manually done as of now.
>> >
>> > So i wanted to understand , on AWS, how do you do above kind of task
>> > automatically ?
>>
>>
>> At a previous employer, they used M4 class instances with data on a
>> dedicated EBS volumes, so we could swap AMIs / stop / start / adjust
>> instances without having to deal with this. This worked reasonably well for
>> their scale (which was petabytes of data).
>>
>
> This is a really good option as it avoids streaming data to replace a node
> which could potentially be quicker if dealing with large amounts of data on
> each node.
>
>
>>
>> Other companies using ephemeral tend to be more willing to just terminate
>> instances and replace them (-Dcassandra.replace_address). If you stop
>> cassandra, then boot a replacement with 'replace_address' set, it'll take
>> over for the stopped instance, including re-streaming all data (as best it
>> can, subject to consistency level and repair status). This may be easier
>> for you to script than switching your fleet to EBS, but it's not without
>> risk.
>>
>
> A quick note if you do decide to go down this path. If you are using
> Cassandra version 2.x.x and above, the cassandra.replace_address_firs
> t_boot can also be used. This option works once when Cassandra is first
> started and the replacement node inserted into the cluster. After that, the
> option is ignored for all subsequent restarts, where as
> cassandra.replace_address needs to be removed from the *cassandra-env.sh*
> file in order to restart the node. Restart behaviour aside, both options
> operate in the same way to replace a node in the cluster.
>
>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>

Re: How do you do automatic restacking of AWS instance for cassandra?

2017-05-25 Thread daemeon reiydelle

What is restacking?

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*

On Thu, May 25, 2017 at 10:24 AM, Surbhi Gupta 
wrote:

> Hi,
>
> Wanted to understand, how do you do automatic restacking of cassandra
> nodes on AWS?
>
> Thanks
> Surbhi
>

Re: How to avoid flush if the data can fit into memtable

2017-05-25 Thread daemeon reiydelle

This sounds exactly like a previous post that ended when I asked the person
to document the number of nodes ec2 instance type and size. I suspected a
single nose you system. So the poster reposts? Hmm.

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 25, 2017 9:14 AM, "Jonathan Haddad" <j...@jonhaddad.com> wrote:

Sorry for the confusion.  That was for the OP.  I wrote it quickly right
after waking up.

What I'm asking is why does the OP want to keep his data in the memtable
exclusively?  If the goal is to "make reads fast", then just turn on row
caching.

If there's so little data that it fits in memory (300MB), and there aren't
going to be any writes past the initial small dataset, why use Cassandra?
It sounds like the wrong tool for this job.  Sounds like something that
could easily be stored in S3 and loaded in memory when the app is fired up.

On Thu, May 25, 2017 at 8:06 AM Avi Kivity <a...@scylladb.com> wrote:

> Not sure whether you're asking me or the original poster, but the more
> times data gets overwritten in a memtable, the less it has to be compacted
> later on (and even without overwrites, larger memtables result in less
> compaction).
>
> On 05/25/2017 05:59 PM, Jonathan Haddad wrote:
>
> Why do you think keeping your data in the memtable is a what you need to
> do?
> On Thu, May 25, 2017 at 7:16 AM Avi Kivity <a...@scylladb.com> wrote:
>
>> Then it doesn't have to (it still may, for other reasons).
>>
>> On 05/25/2017 05:11 PM, preetika tyagi wrote:
>>
>> What if the commit log is disabled?
>>
>> On May 25, 2017 4:31 AM, "Avi Kivity" <a...@scylladb.com> wrote:
>>
>>> Cassandra has to flush the memtable occasionally, or the commit log
>>> grows without bounds.
>>>
>>> On 05/25/2017 03:42 AM, preetika tyagi wrote:
>>>
>>> Hi,
>>>
>>> I'm running Cassandra with a very small dataset so that the data can
>>> exist on memtable only. Below are my configurations:
>>>
>>> In jvm.options:
>>>
>>> -Xms4G
>>> -Xmx4G
>>>
>>> In cassandra.yaml,
>>>
>>> memtable_cleanup_threshold: 0.50
>>> memtable_allocation_type: heap_buffers
>>>
>>> As per the documentation in cassandra.yaml, the
>>> *memtable_heap_space_in_mb* and *memtable_heap_space_in_mb* will be set
>>> of 1/4 of heap size i.e. 1000MB
>>>
>>> According to the documentation here (http://docs.datastax.com/en/
>>> cassandra/3.0/cassandra/configuration/configCassandra_
>>> yaml.html#configCassandra_yaml__memtable_cleanup_threshold), the
>>> memtable flush will trigger if the total size of memtabl(s) goes beyond
>>> (1000+1000)*0.50=1000MB.
>>>
>>> Now if I perform several write requests which results in almost ~300MB
>>> of the data, memtable still gets flushed since I see sstables being created
>>> on file system (Data.db etc.) and I don't understand why.
>>>
>>> Could anyone explain this behavior and point out if I'm missing
>>> something here?
>>>
>>> Thanks,
>>>
>>> Preetika
>>>
>>>
>>>
>>
>

Re: Replication issue with Multi DC setup in cassandra

2017-05-24 Thread daemeon reiydelle

Cqlsh looks at the cluster, not node

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 16, 2017 2:42 PM, "suraj pasuparthy" <suraj.pasupar...@gmail.com>
wrote:

> So i though the same,
> I see the data via the CQLSH in both the datacenters. consistency is set
> to LQ
>
> thanks
> -Suraj
>
> On Tue, May 16, 2017 at 2:19 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>
>> Do you see data on other DC or just directory structure? Directory
>> structure would populate because it is DDL but inserts shouldn’t populate,
>> ideally.
>>
>> On May 16, 2017, at 3:19 PM, suraj pasuparthy <suraj.pasupar...@gmail.com>
>> wrote:
>>
>> elp me fig
>>
>>
>>
>
>
> --
> Suraj Pasuparthy
>
> cisco systems
> Software Engineer
> San Jose CA
>

Re: Replication issue with Multi DC setup in cassandra

2017-05-24 Thread daemeon reiydelle

May I inquire if your configuration is actually data center aware? Do you
understand the difference between LQ and replication?





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*


*“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence*


On Wed, May 24, 2017 at 12:03 PM, Igor Leão  wrote:

> Did you run `nodetool repair` after changing the keyspace? (not sure if it
> makes sense though)
>
> 2017-05-16 19:52 GMT-03:00 Nitan Kainth :
>
>> Strange. Anybody else might share something more important.
>>
>> Sent from my iPhone
>>
>> On May 16, 2017, at 5:23 PM, suraj pasuparthy 
>> wrote:
>>
>> Yes is see them in the datacenter's data directories.. infact i see then
>> even after i bring down the interface between the 2 DC's which further
>> confirms that a local copy is maintained in the DC that was not configured
>> in the strategy ..
>> its quite important that we block the info for this keyspace from
>> replicating :(.. not sure why this does not work
>>
>> Thanks
>> Suraj
>>
>> On Tue, May 16, 2017 at 3:06 PM Nitan Kainth  wrote:
>>
>>> check for datafiles on filesystem in both DCs.
>>>
>>> On May 16, 2017, at 4:42 PM, suraj pasuparthy <
>>> suraj.pasupar...@gmail.com> wrote:
>>>
>>> So i though the same,
>>> I see the data via the CQLSH in both the datacenters. consistency is set
>>> to LQ
>>>
>>> thanks
>>> -Suraj
>>>
>>> On Tue, May 16, 2017 at 2:19 PM, Nitan Kainth  wrote:
>>>
 Do you see data on other DC or just directory structure? Directory
 structure would populate because it is DDL but inserts shouldn’t populate,
 ideally.

 On May 16, 2017, at 3:19 PM, suraj pasuparthy <
 suraj.pasupar...@gmail.com> wrote:

 elp me fig



>>>
>>>
>>> --
>>> Suraj Pasuparthy
>>>
>>> cisco systems
>>> Software Engineer
>>> San Jose CA
>>>
>>>
>>>
>>>
>>>
>>>
>
>
> --
> Igor Leão  Site Reliability Engineer
>
> Mobile: +55 81 99727-1083 
> Skype: *igorvpcleao*
> Office: +55 81 4042-9757 
> Website: inlocomedia.com 
> [image: inlocomedia]
> 
>  [image: LinkedIn]
> 
>  [image: Facebook]  [image: Twitter]
> 
>
>
>
>
>
>
>

Re: Impact on latency with larger memtable

2017-05-24 Thread daemeon reiydelle

You speak of increase. Please provide your results. Specific examples, Eg
25% increase results in n% increase. Also please include number of nodes,
size of total keyspace, rep factor, etc.

Hopefully this is a 6 node cluster with several hundred gig per keyspace,
not some single node free tier box.

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 24, 2017 9:32 AM, "preetika tyagi" <preetikaty...@gmail.com> wrote:

> Hi,
>
> I'm experimenting with memtable/heap size on my Cassandra server to
> understand how it impacts the latency/throughput for read requests.
>
> I vary heap size (Xms and -Xmx) in jvm.options so memtable will be 1/4 of
> this. When I increase the heap size and hence memtable, I notice the drop
> in throughput and increase in latency. I'm also creating the database such
> that its size doesn't exceed the size of memtable. Therefore, all data
> exist in memtable and I'm not able to reason why bigger size of memtable is
> resulting into higher latency/low throughput.
>
> Since everything is DRAM, shouldn't the throughput/latency remain same in
> all the cases?
>
> Thanks,
> Preetika
>

Re: Cassandra Node Density thresholds

2017-05-19 Thread daemeon reiydelle

500 nodes, 20tb of ACTIVE DATA per node in hdfs, no brainer, no problem.
But remember the cross DC traffic will get substantial.

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 19, 2017 9:05 AM, "ZAIDI, ASAD A" <az1...@att.com> wrote:

> Hello Folks -
>
> I'm using open source apache Cassandra 2.2 .My cluster is spread over 14
> nodes in cluster in two data centers.
>
>
>
> My DC1 data center nodes are reaching 2TB of consumed volume. we don't
> have much space left on disk.
>
> I am wondering if there is guideline available that can point me to
> certain best practice that describe when we should add more nodes to the
> cluster.  should we add more storage or add more nodes. I guess we should
> scale Cassandra horizontally so adding node may be better option.. i am
> looking for a criteria that describes node density thresholds, if there are
> any.
>
> Can you guys please share your thoughts , experience. I'll much appreciate
> your reply. Thanks/Asad
>
>
>
>
>

Re: Can I have multiple datacenter with different versions of Cassandra

2017-05-18 Thread daemeon reiydelle

Yes, or decomission the old one and build anew after new one is operational

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 18, 2017 8:20 AM, "Chuck Reynolds" <creyno...@ancestry.com> wrote:

> I have a need to create another datacenter and upgrade my existing
> Cassandra from 2.1.13 to Cassandra 3.0.9.
>
>
>
> Can I do this as one step?  Create a new Cassandra ring that is version
> 3.0.9 and replicate the data from an existing ring that is Cassandra 2.1.13?
>
>
>
> After replicating to the new ring if possible them I would upgrade the old
> ring to Cassandra 3.0.9
>

Re: Bootstraping a Node With a Newer Version

2017-05-17 Thread daemeon reiydelle

So you are not upgrading the kernel, you are upgrading the OS. Not what
you asked about. Your devops team is right.

However, Depending on what is using python, the new version of python may
break older scripts (I do not know, mentioning this, testing required?)
W
hen I am doing an OS upgrade (and usually ditto with Hadoop) I 
add nodes to the cluster at the new OS/HDFS version, decom
mission old nodes, and repeat. The replication takes a bit but zero down
time, etc. Since you don't have a lot of storage per node, I don't think
you will have a lot of high network traffic impacting the performance of
nodes.





*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*



On Wed, May 17, 2017 at 12:51 AM, Shalom Sagges <shal...@liveperson.com>
wrote:

> Our DevOPS team told me that their policy is not to perform major kernel
> upgrades but simply install a clean new version.
> I also checked online and found a lot of recommendations *not *to do so
> as there might be a lot of dependencies issues that may affect processes
> such as yum.
> e.g.
> https://www.centos.org/forums/viewtopic.php?t=53678
> "The upgrade from CentOS 6 to 7 is a process that is fraught with danger
> and very very untested. Almost no-one succeeds without extreme effort. The
> CentOS wiki page about it has a big fat warning saying "Do not do this". If
> at all possible you should do a parallel install, migrate your data, apps
> and settings to the new box and decommission the old one.
>
> The problem comes about because there are a large number of packages in
> el6 that already have a higher version number than those in el7. This means
> that the el6 packages take precedence in the update and there are quite a
> few orphans left behind and these break lilttle things like yum. For
> example, one that I know about is openldap which is
> openldap-2.4.40-5.el6.x86_64 and openldap-2.4.39-6.el7.x86_64 so the el6
> package is seen as newer than the el7 one. Anything that's linked against
> openldap (a *lot*) now will not function until that package is replaced
> with its el7 equivalent, The easiest way to do this would be to yum
> downgrade openldap but, ooops, one of the things that needs openldap is
> yum so it doesn't work."
>
>
> I've also checked the Centos Wiki page and found the same recommendation:
> https://wiki.centos.org/FAQ/General?highlight=%28upgrade%
> 29%7C%28to%29%7C%28centos7%29#head-3ac1bdb51f0fecde1f98142cef90e8
> 87b1b12a00 :
>
> *"Upgrades in place are not supported nor recommended by CentOS or TUV. A
> backup followed by a fresh install is the only recommended upgrade path.
> See the Migration Guide for more information."*
>
>
> Since I have around twenty 2TB nodes in each DC (2 DCs in 6 different
> farms) and I don't want it to take forever, perhaps the best way would be
> to either leave it with Centos 6 and install Python 2.7 (I understand
> that's not so user friendly) or perform the backup recommendations shown on
> the Centos page (which sounds extremely agonizing as well).
>
> What do you think?
>
> Thanks!
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>
>
>
> On Tue, May 16, 2017 at 6:48 PM, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> What makes you think you cannot upgrade the kernel?
>>
>> “All men dream, but not equally. Those who dream by night in the dusty
>> recesses of their minds wake up in the day to find it was vanity, but the
>> dreamers of the day are dangerous men, for they may act their dreams with
>> open eyes, to make it possible.” — T.E. Lawrence
>>
>> sent from my mobile
>> Daemeon Reiydelle
>> skype daemeon.c.m.reiydelle
>> USA 415.501.0198 <(415)%20501-0198>
>>
>> On May 16, 2017 5:27 AM, "Shalom Sagges" <shal...@liveperson.com> wrote:
>>
>>> Hi All,
>>>
>>> Hypothetically speaking, let's say I want to upgrade my Cassandra
>>> cluster, but I also want to perform a major upgrade to the kernel of all
>>> nodes.
>>> In order to upgrade the kernel, I need to reinstall the server, hence
>>> lose all data on the node.
>>>
>>> My question is this, after reinstalling the server with the new kernel,
>>> can I first install the upgraded Cassandra version and then bootstrap it to
>>> the cluster?
>>>
>>> Since there's already no data on the node, I wish to skip the agonizing
>>> sstable upgrade process.
>>>
>>> Does anyon

Re: Bootstraping a Node With a Newer Version

2017-05-16 Thread daemeon reiydelle

What makes you think you cannot upgrade the kernel?

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 16, 2017 5:27 AM, "Shalom Sagges" <shal...@liveperson.com> wrote:

> Hi All,
>
> Hypothetically speaking, let's say I want to upgrade my Cassandra cluster,
> but I also want to perform a major upgrade to the kernel of all nodes.
> In order to upgrade the kernel, I need to reinstall the server, hence lose
> all data on the node.
>
> My question is this, after reinstalling the server with the new kernel,
> can I first install the upgraded Cassandra version and then bootstrap it to
> the cluster?
>
> Since there's already no data on the node, I wish to skip the agonizing
> sstable upgrade process.
>
> Does anyone know if this is doable?
>
> Thanks!
>
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <+972%2074-700-4035>
> <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
>
>
>
> This message may contain confidential and/or privileged information.
> If you are not the addressee or authorized to receive this on behalf of
> the addressee you must not use, copy, disclose or take action based on this
> message or any information herein.
> If you have received this message in error, please advise the sender
> immediately by reply email and delete this message. Thank you.
>

Re: Cassandra as a key/object store for many small (10-60k) files

2017-05-05 Thread daemeon reiydelle

I would guess you have network overload issues, I have seen pretty much
exactly what you describe many times, (so far ;{) always this is the issue.
Especially with 1gbit networks, no jumbo frames, etc. Get your network guys
to monitor the error retry packets across ALL of the interfaces (all the
nodes, Top of Rack switch, network switches, etc.). If you see ANY retries,
timeouts, errors, you have found your problem.

Or it could be something like java stack garbage collection, cpu overload,
etc.


*...*

*Making a billion dollar startup is easy: "take a human desire, preferably
one that has been around for a really long time … Identify that desire and
use modern technology to take out steps."*


*...Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144
9872*

On Fri, May 5, 2017 at 12:26 PM, Jonathan Guberman <j...@tineye.com> wrote:

> Yes, local storage volumes on each machine.
>
> On May 5, 2017, at 3:25 PM, daemeon reiydelle <daeme...@gmail.com> wrote:
>
> These numbers do not match e.g. AWS, so guessing you are using local
> storage?
>
>
> *...*
>
> *Making a billion dollar startup is easy: "take a human desire, preferably
> one that has been around for a really long time … Identify that desire and
> use modern technology to take out steps."*
>
>
> *...Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <(415)%20501-0198>London (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>
> On Fri, May 5, 2017 at 12:19 PM, Jonathan Guberman <j...@tineye.com> wrote:
>
>> Hello,
>>
>> We’re currently testing Cassandra for use as a pure key-object store for
>> data blobs around 10kB - 60kB each. Our use case is storing on the order of
>> 10 billion objects with about 5-20 million new writes per day. A written
>> object will never be updated or deleted. Objects will be read at least
>> once, some time within 10 days of being written. This will generally happen
>> as a batch; that is, all of the images written on a particular day will be
>> read together at the same time. This batch read will only happen one time;
>> future reads will happen on individual objects, with no grouping, and they
>> will follow a long-tail distribution, with popular objects read thousands
>> of times per year but most read never or virtually never.
>>
>> I’ve set up a small four node test cluster and have written test scripts
>> to benchmark writing and reading our data. The table I’ve set up is very
>> simple: an ascii primary key column with the object ID and a blob column
>> for the data. All other settings were left at their defaults.
>>
>> I’ve found write speeds to be very fast most of the time. However,
>> periodically, writes will slow to a crawl for anywhere between half an hour
>> to two hours, after which speeds recover to their previous levels. I assume
>> this is some sort of data compaction or flushing to disk, but I haven’t
>> been able to figure out the exact cause.
>>
>> Read speeds have been more disappointing. Cached reads are very fast, but
>> random read speed averages about 2 MB/sec, which is too slow when we need
>> to read out a batch of several million objects. I don’t think it’s
>> reasonable to assume that these rows will all still be cached by the time
>> we need to read them for that first large batch read.
>>
>> My general question is whether anyone has any suggestions for how to
>> improve performance for our use case. More specifically:
>>
>> - Is there a way to mitigate or eliminate the huge slowdowns I see when
>> writing millions of rows?
>> - Are there settings I should be using in order to maximize read speeds
>> for random reads?
>> - Is there a way to design our tables to improve the read speeds for the
>> initial large batched reads? I was thinking of using a batch ID column that
>> could be used to retrieve the data for the initial block. However, future
>> reads would need to be done by the object ID, not the batch ID, so it seems
>> to me I’d need to duplicate the data, one in a “objects by batch” table,
>> and the other in a simple “objects” table. Is there a better approach than
>> this?
>>
>> Thank you!
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>
>

Re: Cassandra as a key/object store for many small (10-60k) files

2017-05-05 Thread daemeon reiydelle

These numbers do not match e.g. AWS, so guessing you are using local
storage?


*...*

*Making a billion dollar startup is easy: "take a human desire, preferably
one that has been around for a really long time … Identify that desire and
use modern technology to take out steps."*


*...Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144
9872*

On Fri, May 5, 2017 at 12:19 PM, Jonathan Guberman  wrote:

> Hello,
>
> We’re currently testing Cassandra for use as a pure key-object store for
> data blobs around 10kB - 60kB each. Our use case is storing on the order of
> 10 billion objects with about 5-20 million new writes per day. A written
> object will never be updated or deleted. Objects will be read at least
> once, some time within 10 days of being written. This will generally happen
> as a batch; that is, all of the images written on a particular day will be
> read together at the same time. This batch read will only happen one time;
> future reads will happen on individual objects, with no grouping, and they
> will follow a long-tail distribution, with popular objects read thousands
> of times per year but most read never or virtually never.
>
> I’ve set up a small four node test cluster and have written test scripts
> to benchmark writing and reading our data. The table I’ve set up is very
> simple: an ascii primary key column with the object ID and a blob column
> for the data. All other settings were left at their defaults.
>
> I’ve found write speeds to be very fast most of the time. However,
> periodically, writes will slow to a crawl for anywhere between half an hour
> to two hours, after which speeds recover to their previous levels. I assume
> this is some sort of data compaction or flushing to disk, but I haven’t
> been able to figure out the exact cause.
>
> Read speeds have been more disappointing. Cached reads are very fast, but
> random read speed averages about 2 MB/sec, which is too slow when we need
> to read out a batch of several million objects. I don’t think it’s
> reasonable to assume that these rows will all still be cached by the time
> we need to read them for that first large batch read.
>
> My general question is whether anyone has any suggestions for how to
> improve performance for our use case. More specifically:
>
> - Is there a way to mitigate or eliminate the huge slowdowns I see when
> writing millions of rows?
> - Are there settings I should be using in order to maximize read speeds
> for random reads?
> - Is there a way to design our tables to improve the read speeds for the
> initial large batched reads? I was thinking of using a batch ID column that
> could be used to retrieve the data for the initial block. However, future
> reads would need to be done by the object ID, not the batch ID, so it seems
> to me I’d need to duplicate the data, one in a “objects by batch” table,
> and the other in a simple “objects” table. Is there a better approach than
> this?
>
> Thank you!
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Service discovery in the Cassandra cluster

2017-05-02 Thread daemeon reiydelle

My compliments to all of you for being adults, excessively kind, and
definitely excessively nice.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, May 2, 2017 at 5:08 PM, Steve Robenalt 
wrote:

> Hi Roman,
>
> I'm assuming you were intending your first statement to be in jest, but
> it's really not that hard to startup a Cassandra cluster. The defaults are
> pretty usable, so if all you want to do is set the IPs and start it up, the
> cluster probably will just take care of everything else.
>
> So I jest a little bit too. It's normally desirable to set up storage
> properly for your database, and there's a few options for which you might
> want to change the defaults, such as the snitch.
>
> Still, if that means you only need to take note of of a couple of IPs and
> designate them as seeds so your cluster can mostly manage itself, you can
> say that's sad, but I'd say it's a small price to pay for all that you
> don't have to do.
>
> Steve
>
> On Mon, May 1, 2017 at 4:55 PM, Roman Naumenko 
> wrote:
>
>> Lol yeah, why
>> I guess I run some ec2 instances, drop some cassandra deb packages on 'em
>> - the thing will figure out how to run...
>>
>> Also, how would you get "initial state of the cluster" if the cluster...
>> is being initialized?
>> Or that's easy, according to the docs - just hardcode some seed IPs into
>> each node, lol
>>
>> It's all kinda funny, but in a sad way.
>>
>> On Mon, May 1, 2017 at 4:45 PM, Jon Haddad 
>> wrote:
>>
>>> Why do you have to figure out what’s up w/ them by accident?  You’ve
>>> gotten all the information you need.  Seeds are used to get the initial
>>> state of the cluster and as an optimization to spread gossip faster.
>>> That’s it.
>>>
>>>
>>>
>>> On May 1, 2017, at 4:37 PM, Roman Naumenko  wrote:
>>>
>>> Well, I guess I have to figure out what’s up with IPs/hostnames by
>>> experiment.
>>> Information about service discovery is practically absent.
>>> Not to mention all important details about fqdns/hostnames, automatic
>>> replacing seed nodes or what not.
>>>
>>> —
>>> Roman
>>>
>>> On May 1, 2017, at 4:14 PM, Jon Haddad 
>>> wrote:
>>>
>>> The in-tree docs do not mention this anywhere, and even have some of the
>>> answers you’re asking:
>>>
>>> https://cassandra.apache.org/doc/latest/faq/index.html?highl
>>> ight=seed#what-are-seeds
>>>
>>> The DataStax docs are maintained outside of the project, you’ll have to
>>> ask them why they’re wrong or misleading.
>>>
>>> Jon
>>>
>>> On May 1, 2017, at 4:10 PM, Roman Naumenko  wrote:
>>>
>>> The docs mention IP addresses everywhere.
>>>
>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra
>>> /operations/ops_replace_seed_node.html
>>> Promote an existing node to a seed node by adding its IP address to
>>> -seeds list and remove (demote) the IP address of the dead seed node from
>>> the cassandra.yaml file for each node in the cluster.
>>>
>>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra
>>> /operations/ops_replace_node_t.html
>>> Note the Address of the dead node; it is used in step 5.
>>>
>>> http://docs.datastax.com/en/cassandra/2.1/cassandra/initiali
>>> ze/initializeSingleDS.html
>>>
>>> Properties to set:
>>> num_tokens: recommended value: 256
>>> -seeds: internal IP address of each seed node
>>>
>>>
>>> I saw also *hostnames *mentioned few times, but it just makes it even
>>> more confusing.
>>>
>>> —
>>> Roman
>>>
>>> On May 1, 2017, at 3:50 PM, Jon Haddad 
>>> wrote:
>>>
>>> Sure, you could use DNS.  Where does it say IP addresses are a
>>> requirement?
>>>
>>> On May 1, 2017, at 1:36 PM, Roman Naumenko  wrote:
>>>
>>> If I understand how Cassandra nodes work, they must contain a list of
>>> seed’s IP addressed in config file.
>>>
>>> This requirement makes cluster setup unnecessarily complicated. Is it
>>> possible to use DNS name for seed nodes?
>>>
>>> Thanks,
>>>
>>> —
>>> Roman
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
>
>
> * Steve Robenalt Software Architect, HighWire Press, Inc. *
> www.highwire.org| Los Gatos, CA| Belfast, NI| Brighton, UK
> 
> 
>
> *HighWire Summer Publishers' Meeting, London, June 12-13
> *
> STM Annual US Conference, April 25-27: Michiel Klein Swormink and Jennifer
> Chang are representing HighWire
> 
> 2017 CSE Annual Meeting: John Sack is presenting on topic of

Re: Service discovery in the Cassandra cluster

2017-05-01 Thread daemeon reiydelle

Yes, you can use host names. That merely adds another level of
configuration. When using terraform, I often use node names like
 and just use those. They are only routable within the
region/VPC but are in fact already in dns. You do have to watch out as if
you change the seeds (in tf) or the cluster can get terminated and rebuild.
If you have a way to capture these (you can do it in ansible, I had been
told it is really hard to do in Chef/Puppet) then your cms can just adjust
the xml as needed without fussing with route53.

*...*

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, May 1, 2017 at 3:50 PM, Jon Haddad 
wrote:

> Sure, you could use DNS.  Where does it say IP addresses are a requirement?
>
> > On May 1, 2017, at 1:36 PM, Roman Naumenko  wrote:
> >
> > If I understand how Cassandra nodes work, they must contain a list of
> seed’s IP addressed in config file.
> >
> > This requirement makes cluster setup unnecessarily complicated. Is it
> possible to use DNS name for seed nodes?
> >
> > Thanks,
> >
> > —
> > Roman
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Seed nodes as part of cluster

2017-05-01 Thread daemeon reiydelle

Caps below for emphasis, not shouting ;{)

Seed nodes are IDENTICAL to all other node hdfs nodes or you will wish
otherwise. Folks get confused because of terminoligy. I refer to this stuff
as "the seed node service of a normal hdfs node". ANY HDFS NODE IS ABLE TO
ACT AS A SEED NODE BY DEFINITION. But ONLY the nodes listed as seeds in the
XML will be contacted, however.

The seed "function" is only used by new nodes when they FIRST join the
cluster for the FIRST time, then never used again (once an node joins the
cluster it is using different protocols, a separate list of nodes, etc.).

*...*

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, May 1, 2017 at 2:05 PM, Roman Naumenko  wrote:

> So they are like any other “data” node… but special?
>
> I’m so freaking confused by this seed nodes design.
>
> —
> Roman
>
> On May 1, 2017, at 1:37 PM, vasu gunja  wrote:
>
> Seed will contain meta data + actual data too
>
> On Mon, May 1, 2017 at 3:34 PM, Roman Naumenko 
> wrote:
>
>> Hi,
>>
>> I’d like to confirm that seed nodes doesn’t contain any data. Is it
>> correct?
>>
>> Can the instances for seed nodes be smaller size than for data nodes?
>>
>> Thank you
>> Roman
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>
>

Re: Migrating from Datastax Distribution to Apache Cassandra

2017-04-07 Thread daemeon reiydelle

Having done variants of this, I would suggest you bring up new nodes at
approximately the same Apache version as a separate data center, in your
same cluster. Replication strategy may need to be tweaked

*...*

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Apr 7, 2017 at 1:55 AM, Eren Yilmaz 
wrote:

> Hi,
>
>
>
> We have Cassandra 3.7 installation on Ubuntu, from Datastax distribution
> (using the repo). Since Datastax has announced that they will no longer
> support a community Cassandra distribution, I want to migrate to Apache
> distribution. Are there any differences between distributions? Can I use
> the upgrading procedures as described in https://docs.datastax.com/en/
> latest-upgrade/upgrade/cassandra/upgrdCassandraDetails.html?
>
>
>
> Thanks,
>
> Eren
>

Re: Cassandra and LINUX CPU Context Switches

2017-04-05 Thread daemeon reiydelle

This would be normal if the switches are user to kernel mode (disk &
network IO are kernel mode activities). If your run queue (jobs waiting to
run) is much larger than the number of cores (just a swag but less than
2-3*# of cores), you might have other issues.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Apr 5, 2017 at 5:45 AM, William Boutin 
wrote:

> I’ve noticed that my apache-cassandra 2.2.6 process is consistently
> performing CPU Context Switches above 10,000 per second.
>
> Is this to be expected or should I be looking into ways to lower the
> number of context switches done on my Cassandra cluster?
>
> Thanks in advance.
>
>
>
> [image: Ericsson] 
>
> *WILLIAM L. BOUTIN *
> Engineer IV - Sftwr
> BMDA PADB DSE DU CC NGEE
>
>
> *Ericsson*
> 1 Ericsson Drive, US PI06 1.S747
> Piscataway, NJ, 08854, USA
> Phone (913) 241-5574
> Mobile (732) 213-1368
> Emergency (732) 354-1263
> william.bou...@ericsson.com
> www.ericsson.com
>
> [image: http://www.ericsson.com/current_campaign]
> 
>
> Legal entity: EUS - ERICSSON INC., registered office in US PI01 4A242.
> This Communication is Confidential. We only send and receive email on the
> basis of the terms set out at www.ericsson.com/email_disclaimer
>
>
>

Re: nodes are always out of sync

2017-04-01 Thread daemeon reiydelle

What you are doing is correctly going to result in this, IF there is
substantial backlog/network/disk or whatever pressure.

What do you think will happen when you write with a replication factor
greater than consistency level of write? Perhaps your mental model of how
C* works needs work?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sat, Apr 1, 2017 at 11:09 AM, Vladimir Yudovin 
wrote:

> Hi,
>
> did you try to read data with consistency ALL immediately after write with
> consistency ONE? Does it succeed?
>
> Best regards, Vladimir Yudovin,
> *Winguzone  - Cloud Cassandra Hosting*
>
>
>  On Thu, 30 Mar 2017 04:22:28 -0400 *Roland Otta
> >* wrote 
>
> hi,
>
> we see the following behaviour in our environment:
>
> cluster consists of 6 nodes (cassandra version 3.0.7). keyspace has a
> replication factor 3.
> clients are writing data to the keyspace with consistency one.
>
> we are doing parallel, incremental repairs with cassandra reaper.
>
> even if a repair just finished and we are starting a new one
> immediately, we can see the following entries in our logs:
>
> INFO  [RepairJobTask:1] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
> and /192.168.0.189 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
> and /192.168.0.189 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.190 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.190
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>
> we cant see any hints on the systems ... so we thought everything is
> running smoothly with the writes.
>
> do we have to be concerned about the nodes always being out of sync or
> is this a normal behaviour in a write intensive table (as the tables
> will never be 100% in sync for the latest inserts)?
>
> bg,
> roland
>
>
>
>

Re: How to add a node with zero downtime

2017-03-21 Thread daemeon reiydelle

Possible areas to check:
- too few nodes (node overload) - you did not indicate either replication
factor, number of nodes. Assume nodes are *rather* full.
- network overload (check your TORS's errors, also the tcp stats on the
relevant nodes)
- look for stop the world garbage collection on multiple nodes.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Mar 21, 2017 at 11:17 AM, Cogumelos Maravilha <
cogumelosmaravi...@sapo.pt> wrote:

> Hi list,
>
> I'm using C* 3.10;
>
> authenticator: PasswordAuthenticator and authorizer: CassandraAuthorizer
>
> When adding a node and before nodetool repair system_auth finished all my
> clients die with:
>
> cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers',
> {'10.100.100.19': AuthenticationFailed('Failed to authenticate to ...
>
> Thanks in advance.
>

Re: repair performance

2017-03-20 Thread daemeon reiydelle

I would zero in on network throughput, especially interrack trunks


sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On Mar 17, 2017 2:07 PM, "Roland Otta" <roland.o...@willhaben.at> wrote:

> hello,
>
> we are quite inexperienced with cassandra at the moment and are playing
> around with a new cluster we built up for getting familiar with
> cassandra and its possibilites.
>
> while getting familiar with that topic we recognized that repairs in
> our cluster take a long time. To get an idea of our current setup here
> are some numbers:
>
> our cluster currently consists of 4 nodes (replication factor 3).
> these nodes are all on dedicated physical hardware in our own
> datacenter. all of the nodes have
>
> 32 cores @2,9Ghz
> 64 GB ram
> 2 ssds (raid0) 900 GB each for data
> 1 seperate hdd for OS + commitlogs
>
> current dataset:
> approx 530 GB per node
> 21 tables (biggest one has more than 200 GB / node)
>
>
> i already tried setting compactionthroughput + streamingthroughput to
> unlimited for testing purposes ... but that did not change anything.
>
> when checking system resources i cannot see any bottleneck (cpus are
> pretty idle and we have no iowaits).
>
> when issuing a repair via
>
> nodetool repair -local on a node the repair takes longer than a day.
> is this normal or could we normally expect a faster repair?
>
> i also recognized that initalizing of new nodes in the datacenter was
> really slow (approx 50 mbit/s). also here i expected a much better
> performance - could those 2 problems be somehow related?
>
> br//
> roland

Re: Random slow read times in Cassandra

2017-03-17 Thread daemeon reiydelle

check for level 2 (stop the world) garbage collections.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Mar 17, 2017 at 11:51 AM, Chuck Reynolds 
wrote:

> I have a large Cassandra 2.1.13 ring (60 nodes) in AWS that has
> consistently random high read times.  In general most reads are under 10
> milliseconds but with in the 30 request there is usually a read time that
> is a couple of seconds.
>
>
>
> Instance type: r4.8xlarge
>
> EBS GP2 volumes, 3tb with 9000 IOPS
>
> 30 Gig Heap
>
>
>
> Data per node is about 170 gigs
>
>
>
> The keyspace is an id & a blob.  When I check the data the slow reads
> don’t seem to have anything to do with size of the blobs
>
>
>
> This system has repairs run once a weeks because it takes a lot of updates.
>
>
>
> The client makes a call and does 30 request serially to Cassandra and the
> response times look like this in milliseconds.
>
>
>
> What could make these so slow and what can I do to diagnosis this?
>
>
>
>
>
> *Responses*
>
>
>
> Get Person time: 3 319746229:9009:66
>
> Get Person time: 7 1830093695:9009:66
>
> Get Person time: 4 30072253:9009:66
>
> Get Person time: 4 2303790089:9009:66
>
> Get Person time: 2 156792066:9009:66
>
> Get Person time: 8 491230624:9009:66
>
> Get Person time: 7 284904599:9009:66
>
> Get Person time: 4 600370489:9009:66
>
> Get Person time: 2 281007386:9009:66
>
> Get Person time: 4 971178094:9009:66
>
> Get Person time: 1 1322259885:9009:66
>
> Get Person time: 2 1937958542:9009:66
>
> Get Person time: 9 286536648:9009:66
>
> Get Person time: 9 1835633470:9009:66
>
> Get Person time: 2 300867513:9009:66
>
> Get Person time: 3 178975468:9009:66
>
> Get Person time: 2900 293043081:9009:66
>
> Get Person time: 8 214913830:9009:66
>
> Get Person time: 2 1956710764:9009:66
>
> Get Person time: 4 237673776:9009:66
>
> Get Person time: 17 68942206:9009:66
>
> Get Person time: 1800 20072145:9009:66
>
> Get Person time: 2 304698506:9009:66
>
> Get Person time: 2 308177320:9009:66
>
> Get Person time: 2 998436038:9009:66
>
> Get Person time: 10 1036890112:9009:66
>
> Get Person time: 1 1629649548:9009:66
>
> Get Person time: 6 1595339706:9009:66
>
> Get Person time: 4 1079637599:9009:66
>
> Get Person time: 3 556342855:9009:66
>
>
>
>
>
> Get Person time: 5 1856382256:9009:66
>
> Get Person time: 3 1891737174:9009:66
>
> Get Person time: 2 1179373651:9009:66
>
> Get Person time: 2 1482602756:9009:66
>
> Get Person time: 3 1236458510:9009:66
>
> Get Person time: 11 1003159823:9009:66
>
> Get Person time: 2 1264952556:9009:66
>
> Get Person time: 2 1662234295:9009:66
>
> Get Person time: 1 246108569:9009:66
>
> Get Person time: 5 1709881651:9009:66
>
> Get Person time: 3213 11878078:9009:66
>
> Get Person time: 2 112866483:9009:66
>
> Get Person time: 2 201870153:9009:66
>
> Get Person time: 6 227696684:9009:66
>
> Get Person time: 2 1946780190:9009:66
>
> Get Person time: 2 2197987101 <(219)%20798-7101>:9009:66
>
> Get Person time: 18 1838959725:9009:66
>
> Get Person time: 3 1782937802:9009:66
>
> Get Person time: 3 1692530939:9009:66
>
> Get Person time: 9 1765654196:9009:66
>
> Get Person time: 2 1597757121:9009:66
>
> Get Person time: 2 1853127153:9009:66
>
> Get Person time: 3 1533599253:9009:66
>
> Get Person time: 6 1693244112:9009:66
>
> Get Person time: 6 82047537:9009:66
>
> Get Person time: 2 96221961:9009:66
>
> Get Person time: 4 98202209:9009:66
>
> Get Person time: 9 12952388:9009:66
>
> Get Person time: 2 300118652:9009:66
>
> Get Person time: 10 78801084:9009:66
>
>
>
>
>
> Get Person time: 13 1856424913:9009:66
>
> Get Person time: 2 255814186:9009:66
>
> Get Person time: 2 1183397424:9009:66
>
> Get Person time: 5 1828603730:9009:66
>
> Get Person time: 9 132965919:9009:66
>
> Get Person time: 4 1616190071:9009:66
>
> Get Person time: 2 15929337:9009:66
>
> Get Person time: 10 297005427:9009:66
>
> Get Person time: 2 1306460047:9009:66
>
> Get Person time: 5 620139216:9009:66
>
> Get Person time: 2 1364349058:9009:66
>
> Get Person time: 3 629543403:9009:66
>
> Get Person time: 5 1299827034:9009:66
>
> Get Person time: 4 1593205912:9009:66
>
> Get Person time: 2 1755460077:9009:66
>
> Get Person time: 2 1906388666:9009:66
>
> Get Person time: 1 1838653952:9009:66
>
> Get Person time: 2 2249662508 <(224)%20966-2508>:9009:66
>
> Get Person time: 3 1931708432:9009:66
>
> Get Person time: 2 2177004948 <(217)%20700-4948>:9009:66
>
> Get Person time: 2 2042756682 <(204)%20275-6682>:9009:66
>
> Get Person time: 5 41764865:9009:66
>
> Get Person time: 4023 1733384704:9009:66
>
> Get Person time: 1 1614842189:9009:66
>
> Get Person time: 2 2194211396 <(219)%20421-1396>:9009:66
>
> Get Person time: 3 1711330834:9009:66
>
> Get Person time: 2 2264849689 <(226)%20484-9689>:9009:66
>
> Get Person time: 3 1819027970:9009:66
>
> Get Person time: 2 1978614851:9009:66
>
> Get Person time: 1 1863483129:9009:66
>
>
>

Re: Issue with Cassandra consistency in results

2017-03-17 Thread daemeon reiydelle

The prep is needed. If I recall correctly it must remain in cache for the
query to complete. I don't have the docs to dig out the yaml parm to adjust
query cache. I had run into the problem stress testing a smallish cluster
with many queries at once.

Do you have a sense of how many distinct queries are hitting the cluster at
peak?

If many clients, how do you balance the connection load or do you always
hit the same node?


sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On Mar 16, 2017 3:25 PM, "srinivasarao daruna" <sree.srin...@gmail.com>
wrote:

> Hi reiydelle,
>
> I cannot confirm the range as the volume of data is huge and the query
> frequency is also high.
> If the cache is the cause of issue, can we increase cache size or is there
> solution to avoid dropped prep statements.?
>
>
>
>
>
>
> Thank You,
> Regards,
> Srini
>
> On Thu, Mar 16, 2017 at 2:13 PM, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> The discard due to oom is causing the zero returned. I would guess a
>> cache miss problem of some sort, but not sure. Are you using row, index,
>> etc. caches? Are you seeing the failed prep statement on random nodes (duh,
>> nodes that have the relevant data ranges)?
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <+1%20415-501-0198>London
>> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>>
>> On Thu, Mar 16, 2017 at 10:56 AM, Ryan Svihla <r...@foundev.pro> wrote:
>>
>>> Depends actually, restore just restores what's there, so if only one
>>> node had a copy of the data then only one node had a copy of the data
>>> meaning quorum will still be wrong sometimes.
>>>
>>> On Thu, Mar 16, 2017 at 1:53 PM, Arvydas Jonusonis <
>>> arvydas.jonuso...@gmail.com> wrote:
>>>
>>>> If the data was written at ONE, consistency is not guaranteed. ..but
>>>> considering you just restored the cluster, there's a good chance something
>>>> else is off.
>>>>
>>>> On Thu, Mar 16, 2017 at 18:19 srinivasarao daruna <
>>>> sree.srin...@gmail.com> wrote:
>>>>
>>>>> Want to make read and write QUORUM as well.
>>>>>
>>>>>
>>>>> On Mar 16, 2017 1:09 PM, "Ryan Svihla" <r...@foundev.pro> wrote:
>>>>>
>>>>> Replication factor is 3, and write consistency is ONE and read
>>>>> consistency is QUORUM.
>>>>>
>>>>> That combination is not gonna work well:
>>>>>
>>>>> *Write succeeds to NODE A but fails on node B,C*
>>>>>
>>>>> *Read goes to NODE B, C*
>>>>>
>>>>> If you can tolerate some temporary inaccuracy you can use QUORUM but
>>>>> may still have the situation where
>>>>>
>>>>> Write succeeds on node A a timestamp 1, B succeeds at timestamp 2
>>>>> Read succeeds on node B and C at timestamp 1
>>>>>
>>>>> If you need fully race condition free counts I'm afraid you need to
>>>>> use SERIAL or LOCAL_SERIAL (for in DC only accuracy)
>>>>>
>>>>> On Thu, Mar 16, 2017 at 1:04 PM, srinivasarao daruna <
>>>>> sree.srin...@gmail.com> wrote:
>>>>>
>>>>> Replication strategy is SimpleReplicationStrategy.
>>>>>
>>>>> Smith is : EC2 snitch. As we deployed cluster on EC2 instances.
>>>>>
>>>>> I was worried that CL=ALL have more read latency and read failures.
>>>>> But won't rule out trying it.
>>>>>
>>>>> Should I switch select count (*) to select partition_key column? Would
>>>>> that be of any help.?
>>>>>
>>>>>
>>>>> Thank you
>>>>> Regards
>>>>> Srini
>>>>>
>>>>> On Mar 16, 2017 12:46 PM, "Arvydas Jonusonis" <
>>>>> arvydas.jonuso...@gmail.com> wrote:
>>>>>
>>>>> What are your replication strategy and snitch settings?
>>>>>
>>>>> Have you tried doing a read at CL=ALL? If it's an actual inconsistency
>>>>> issue (missing data), this should cause the correct results to be 
>>>>> returned.
>>>>> You'll need to run a repair to fix the inconsistencies.
>>>>>
>>>>> If all the data is actually there, you might have o

Re: Issue with Cassandra consistency in results

2017-03-16 Thread daemeon reiydelle

The discard due to oom is causing the zero returned. I would guess a cache
miss problem of some sort, but not sure. Are you using row, index, etc.
caches? Are you seeing the failed prep statement on random nodes (duh,
nodes that have the relevant data ranges)?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Mar 16, 2017 at 10:56 AM, Ryan Svihla  wrote:

> Depends actually, restore just restores what's there, so if only one node
> had a copy of the data then only one node had a copy of the data meaning
> quorum will still be wrong sometimes.
>
> On Thu, Mar 16, 2017 at 1:53 PM, Arvydas Jonusonis <
> arvydas.jonuso...@gmail.com> wrote:
>
>> If the data was written at ONE, consistency is not guaranteed. ..but
>> considering you just restored the cluster, there's a good chance something
>> else is off.
>>
>> On Thu, Mar 16, 2017 at 18:19 srinivasarao daruna 
>> wrote:
>>
>>> Want to make read and write QUORUM as well.
>>>
>>>
>>> On Mar 16, 2017 1:09 PM, "Ryan Svihla"  wrote:
>>>
>>> Replication factor is 3, and write consistency is ONE and read
>>> consistency is QUORUM.
>>>
>>> That combination is not gonna work well:
>>>
>>> *Write succeeds to NODE A but fails on node B,C*
>>>
>>> *Read goes to NODE B, C*
>>>
>>> If you can tolerate some temporary inaccuracy you can use QUORUM but may
>>> still have the situation where
>>>
>>> Write succeeds on node A a timestamp 1, B succeeds at timestamp 2
>>> Read succeeds on node B and C at timestamp 1
>>>
>>> If you need fully race condition free counts I'm afraid you need to use
>>> SERIAL or LOCAL_SERIAL (for in DC only accuracy)
>>>
>>> On Thu, Mar 16, 2017 at 1:04 PM, srinivasarao daruna <
>>> sree.srin...@gmail.com> wrote:
>>>
>>> Replication strategy is SimpleReplicationStrategy.
>>>
>>> Smith is : EC2 snitch. As we deployed cluster on EC2 instances.
>>>
>>> I was worried that CL=ALL have more read latency and read failures. But
>>> won't rule out trying it.
>>>
>>> Should I switch select count (*) to select partition_key column? Would
>>> that be of any help.?
>>>
>>>
>>> Thank you
>>> Regards
>>> Srini
>>>
>>> On Mar 16, 2017 12:46 PM, "Arvydas Jonusonis" <
>>> arvydas.jonuso...@gmail.com> wrote:
>>>
>>> What are your replication strategy and snitch settings?
>>>
>>> Have you tried doing a read at CL=ALL? If it's an actual inconsistency
>>> issue (missing data), this should cause the correct results to be returned.
>>> You'll need to run a repair to fix the inconsistencies.
>>>
>>> If all the data is actually there, you might have one or several nodes
>>> that aren't identifying the correct replicas.
>>>
>>> Arvydas
>>>
>>>
>>>
>>> On Thu, Mar 16, 2017 at 5:31 PM, srinivasarao daruna <
>>> sree.srin...@gmail.com> wrote:
>>>
>>> Hi Team,
>>>
>>> We are struggling with a problem related to cassandra counts, after
>>> backup and restore of the cluster. Aaron Morton has suggested to send this
>>> to user list, so some one of the list will be able to help me.
>>>
>>> We are have a rest api to talk to cassandra and one of our query which
>>> fetches count is creating problems for us.
>>>
>>> We have done backup and restore and copied all the data to new cluster.
>>> We have done nodetool refresh on the tables, and did the nodetool repair as
>>> well.
>>>
>>> However, one of our key API call is returning inconsistent results. The
>>> result count is 0 in the first call and giving the actual values for later
>>> calls. The query frequency is bit high and failure rate has also raised
>>> considerably.
>>>
>>> 1) The count query has partition keys in it. Didnt see any read timeout
>>> or any errors from api logs.
>>>
>>> 2) This is how our code of creating session looks.
>>>
>>> val poolingOptions = new PoolingOptions
>>> poolingOptions
>>>   .setCoreConnectionsPerHost(HostDistance.LOCAL, 4)
>>>   .setMaxConnectionsPerHost(HostDistance.LOCAL, 10)
>>>   .setCoreConnectionsPerHost(HostDistance.REMOTE, 4)
>>>   .setMaxConnectionsPerHost( HostDistance.REMOTE, 10)
>>>
>>> val builtCluster = clusterBuilder.withCredentials(username, password)
>>>   .withPoolingOptions(poolingOptions)
>>>   .build()
>>> val cassandraSession = builtCluster.get.connect()
>>>
>>> val preparedStatement = cassandraSession.prepare(state
>>> ment).setConsistencyLevel(ConsistencyLevel.QUORUM)
>>> cassandraSession.execute(preparedStatement.bind(args :_*))
>>>
>>> Query: SELECT count(*) FROM table_name WHERE parition_column=? AND
>>> text_column_of_clustering_key=? AND date_column_of_clustering_key<=?
>>> AND date_column_of_clustering_key>=?
>>>
>>> 3) Cluster configuration:
>>>
>>> 6 Machines: 3 seeds, we are using apache cassandra 3.9 version. Each
>>> machine is equipped with 16 Cores and 64 GB Ram.
>>>
>>> Replication factor is 3, and write consistency is ONE and read
>>> consistency is QUORUM.
>>>
>>> 4) cassandra is never down on any machine
>>>

Re: Does "nodetool repair" need to be run on each node for a given table?

2017-03-14 Thread daemeon reiydelle

Am I unreasonable in expecting a poster to have looked at the documentation
before posting? And that reposting the same query WITHOUT reading the
documents (when pointed out to them) when asked to do so is not
appropriate? Do we have a way to blackball such?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, Mar 13, 2017 at 1:30 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> I understand that the nodetool command connects to a specific server and
> for many of the commands, e.g. "info", "compactionstats", etc, the
> information is for that specific node.
>
> While for some other commands like "status", the info is for the whole
> cluster.
>
>
>
> So is "nodetool repair" that operates at a single node level (i.e. repairs
> the partitions contained on the target node?).
>
> If so, what is the recommended approach to doing repairs?
>
>
>
> E.g. we have a large number of tables (20+), large amount of data (40+ TB)
> and a number of nodes (40+).
>
> Do I need to iterate through each server AND each table?
>
>
>
> Thanks,
>
> Jayesh
>
>
>
>
>
>
>

Re: Does "nodetool repair" need to be run on each node for a given table?

2017-03-13 Thread daemeon reiydelle

I
 find it helpful to read the manual first. After review, I would be happy
to answer specific questions.

https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsRepair.html


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, Mar 13, 2017 at 1:30 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> I understand that the nodetool command connects to a specific server and
> for many of the commands, e.g. "info", "compactionstats", etc, the
> information is for that specific node.
>
> While for some other commands like "status", the info is for the whole
> cluster.
>
>
>
> So is "nodetool repair" that operates at a single node level (i.e. repairs
> the partitions contained on the target node?).
>
> If so, what is the recommended approach to doing repairs?
>
>
>
> E.g. we have a large number of tables (20+), large amount of data (40+ TB)
> and a number of nodes (40+).
>
> Do I need to iterate through each server AND each table?
>
>
>
> Thanks,
>
> Jayesh
>
>
>
>
>
>
>

Re: scylladb

2017-03-11 Thread daemeon reiydelle

Recall that garbage collection on a busy node can occur minutes or seconds
apart. Note that stop the world GC also happens as frequently as every
couple of minutes on every node. Remove that and do the simple arithmetic.


sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On Mar 10, 2017 8:59 AM, "Bhuvan Rawal" <bhu1ra...@gmail.com> wrote:

> Agreed C++ gives an added advantage to talk to underlying hardware with
> better efficiency, it sound good but can a pice of code written in C++ give
> 1000% throughput than a Java app? Is TPC design 10X more performant than
> SEDA arch?
>
> And if C/C++ is indeed that fast how can Aerospike (which is itself
> written in C) claim to be 10X faster than Scylla here
> http://www.aerospike.com/benchmarks/scylladb-initial/ ? (Combining your's
> and aerospike's benchmarks it appears that Aerospike is 100X performant
> than C* - I highly doubt that!! )
>
> For a moment lets forget about evaluating 2 different databases, one can
> observe 10X performance difference between a mistuned cassandra cluster and
> one thats tuned as per data model - there are so many Tunables in yaml as
> well as table configs.
>
> Idea is - in order to strengthen your claim, you need to provide complete
> system metrics (Disk, CPU, Network), the OPS increase starts to decay along
> with the configs used. Having plain ops per second and 99p latency is
> blackbox.
>
> Regards,
> Bhuvan
>
> On Fri, Mar 10, 2017 at 12:47 PM, Avi Kivity <a...@scylladb.com> wrote:
>
>> ScyllaDB engineer here.
>>
>> C++ is really an enabling technology here. It is directly responsible for
>> a small fraction of the gain by executing faster than Java.  But it is
>> indirectly responsible for the gain by allowing us direct control over
>> memory and threading.  Just as an example, Scylla starts by taking over
>> almost all of the machine's memory, and dynamically assigning it to
>> memtables, cache, and working memory needed to handle requests in flight.
>> Memory is statically partitioned across cores, allowing us to exploit NUMA
>> fully.  You can't do these things in Java.
>>
>> I would say the major contributors to Scylla performance are:
>>  - thread-per-core design
>>  - replacement of the page cache with a row cache
>>  - careful attention to many small details, each contributing a little,
>> but with a large overall impact
>>
>> While I'm here I can say that performance is not the only goal here, it
>> is stable and predictable performance over varying loads and during
>> maintenance operations like repair, without any special tuning.  We measure
>> the amount of CPU and I/O spent on foreground (user) and background
>> (maintenance) tasks and divide them fairly.  This work is not complete but
>> already makes operating Scylla a lot simpler.
>>
>>
>> On 03/10/2017 01:42 AM, Kant Kodali wrote:
>>
>> I dont think ScyllaDB performance is because of C++. The design decisions
>> in scylladb are indeed different from Cassandra such as getting rid of SEDA
>> and moving to TPC and so on.
>>
>> If someone thinks it is because of C++ then just show the benchmarks that
>> proves it is indeed the C++ which gave 10X performance boost as ScyllaDB
>> claims instead of stating it.
>>
>>
>> On Thu, Mar 9, 2017 at 3:22 PM, Richard L. Burton III <mrbur...@gmail.com
>> > wrote:
>>
>>> They spend an enormous amount of time focusing on performance. You can
>>> expect them to continue on with their optimization and keep crushing it.
>>>
>>> P.S., I don't work for ScyllaDB.
>>>
>>> On Thu, Mar 9, 2017 at 6:02 PM, Rakesh Kumar <rakeshkumar...@outlook.com
>>> > wrote:
>>>
>>>> In all of their presentation they keep harping on the fact that
>>>> scylladb is written in C++ and does not carry the overhead of Java.  Still
>>>> the difference looks staggering.
>>>> 
>>>> From: daemeon reiydelle <daeme...@gmail.com>
>>>> Sent: Thursday, March 9, 2017 14:21
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: scylladb
>>>>
>>>> The comparison is fair, and conservative. Did substantial performance
>>>> comparisons for two clients, both results returned throughputs that were
>>>> faster than the published comparisons (15x as I recall). At that time the
>>>> client preferred to utilize a Cass COTS solution and use a caching solution
>>>> for OLA compliance.
>>>>
>>>>
>

Re: Disconnecting two data centers

2017-03-08 Thread daemeon reiydelle

I guess it depends on the experience one has. This is a common process to
bring up, move, build full prod copies, etc.

What is outlined is pretty much exactly what I have done 20-50 times (too
many to remember).

FYI, some of this should be done with nodes DOWN.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Mar 8, 2017 at 6:38 AM, Ryan Svihla  wrote:

> it's a bit tricky and I don't advise it, but the typical pattern is (say
> you have DC1 and DC2):
>
> 1. partition the data centers from one another..kill the routing however
> you can (firewall, etc)
> 2. while partitioned log onto DC1 alter schema so that DC2 is not
> replicating), repeat for other.
> 2a. If using propertyfilesnitch remove the DC2 from all the DC1 property
> files and vice versa
> 2b. change the seeds setting in the cassandra.yaml accordingly (DC1 yaml's
> shouldn't have any seeds from DC2, etc)
> 3. rolling restart to account for this.
> 4,. run repair (not even sure how necessary this step is, but after doing
> RF changes I do this to prevent hiccups)
>
> I've done this a couple of times but really failing all of that, the more
> well supported and harder to mess up but more work approach is:
>
> 1. Set DC2 to RF 0
> 2. remove all nodes from DC2
> 3. change yamls for seed files (update property file if need be)
> 4. create new cluster in DC2,
> 5. use sstableloader to stream DC1 data to DC2.
>
> On Wed, Mar 8, 2017 at 8:13 AM, Chuck Reynolds 
> wrote:
>
>> I’m running C* 2.1.13 and I have two rings that are replicating data from
>> our data center to one in AWS.
>>
>>
>>
>> We would like to keep both of them for a while but we have a need to
>> disconnect them.  How can this be done?
>>
>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>

Re: AWS NVMe i3 instances performances

2017-03-01 Thread daemeon reiydelle

We did. Found that, even with (CentOS, Ubuntu both for application
compatibility reasons) that there is somewhat less IO and better CPU
throughput at the price point. At the time my optimization work for that
client ended, Amazon was looking at the IO issue, as perhaps the frame
configurations needed further optimization. this was 2 months ago. A very
superficial (no kernel tuning) done last month seems to indicate the same
tradeoffs. Testing was performed in both cases with C* stress tool and with
CI test suites. Does this help?

*...*

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Mar 1, 2017 at 3:30 AM, Romain Hardouin  wrote:

> Hi all,
>
> AWS launched i3 instances few days ago*. NVMe SSDs seem very promising!
>
> Did someone already benchmark an i3 with Cassandra? e.g. i2 vs i3
> If yes, with which OS and kernel version?
> Did you make any system tuning for NVMe? e.g. PCIe IRQ? etc.
>
> We plan to make some benchmarks but Debian is not listed as a supported OS
> so we have to upgrade our kernel and see if it works :P
> Here is what we have in mind for the time being:
> * OS: Debian
> * Kernel: v4.9
> * IRQ: try several configurations
> Also I would like to compare performances between our Debian AMI and a
> standard AWS Linux AMI.
>
> Thanks!
>
> [*] https://aws.amazon.com/fr/blogs/aws/now-available-i3-
> instances-for-demanding-io-intensive-applications/
>
>
>

Re: Current data density limits with Open Source Cassandra

2017-02-08 Thread daemeon reiydelle

your MMV. Think of that storage limit as fairly reasonable for active data
likely to tombstone. Add more for older/historic data. Then think about
time to recover a node.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Feb 8, 2017 at 2:14 PM, Ben Slater 
wrote:

> The major issue we’ve seen with very high density (we generally say <2TB
> node is best) is manageability - if you need to replace a node or add node
> then restreaming data takes a *long* time and there we fairly high chance
> of a glitch in the universe meaning you have to start again before it’s
> done.
>
> Also, if you’re uses STCS you can end up with gigantic compactions which
> also take a long time and can cause issues.
>
> Heap limitations are mainly related to partition size rather than node
> density in my experience.
>
> Cheers
> Ben
>
> On Thu, 9 Feb 2017 at 08:20 Hannu Kröger  wrote:
>
>> Hello,
>>
>> Back in the day it was recommended that max disk density per node for
>> Cassandra 1.2 was at around 3-5TB of uncompressed data.
>>
>> IIRC it was mostly because of heap memory limitations? Now that off-heap
>> support is there for certain data and 3.x has different data storage
>> format, is that 3-5TB still a valid limit?
>>
>> Does anyone have experience on running Cassandra with 3-5TB compressed
>> data ?
>>
>> Cheers,
>> Hannu
>
> --
> 
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798 <+61%20437%20929%20798>
>

Re: Instaclustr Masters scholarship

2017-02-07 Thread daemeon reiydelle

A bunch more welcome than here in the US, to our deep shame and foolishness.

Sadly while I am actually involved in this area, I am happy in San
Francisco. I would be interested in being part of a pro bono team should
that transpire.

Thanks, D.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Feb 7, 2017 at 7:24 PM, Ben Bromhead  wrote:

> As part of our commitment to contributing back to the Apache Cassandra
> open source project and the wider community we are always looking for ways
> we can foster knowledge sharing and improve usability of Cassandra itself.
> One of the ways we have done so previously was to open up our internal
> builds and versions of Cassandra (https://github.com/instaclustr/cassandra
> ).
>
> We have also been looking at a few novel or outside the box ways we can
> further contribute back to the community. As such, we are sponsoring a
> masters project in conjunction with the Australian based University of
> Canberra. Instaclustr’s staff will be available to provide advice and
> feedback to the successful candidate.
>
> *Scope*
> Distributed database systems are relatively new technology compared to
> traditional relational databases. Distributed advantages provide
> significant advantages in terms of reliability and scalability but often at
> a cost of increased complexity. This complexity presents challenges for
> testing of these systems to prove correct operation across all possible
> system states. The scope of this masters scholarship is to use the Apache
> Cassandra repair process as an example to consider and improve available
> approaches to distributed database systems testing.
>
> The repair process in Cassandra is a scheduled process that runs to ensure
> the multiple copies of each piece of data that is maintained by Cassandra
> are kept synchronised. Correct operation of repairs has been an ongoing
> challenge for the Cassandra project partly due to the difficulty in
> designing and developing  comprehensive automated tests for this
> functionality.
>
> The expected scope of this project is to:
>
>- survey and understand the existing testing framework available as
>part of the Cassandra project, particularly as it pertains to testing
>repairs
>- consider, research and develop enhanced approaches to testing of
>repairs
>- submit any successful approaches to the Apache Cassandra project for
>feedback and inclusion in the project code base
>
> Australia is a pretty great place to advance your education and is
> welcoming of foreign students.
>
> We are also open to sponsoring a PhD project with a more in depth focus
> for the right candidate.
>
> For more details please don't hesitate to get in touch with myself or
> reach out to i...@instaclustr.com.
>
> Cheers
>
> Ben
> --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692 <(650)%20284-9692>
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>

Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread daemeon reiydelle

This is not a bug, and in fact changing it would be a serious bug.

What it is is a wonderful case of bad coding: would one expect a
java/py/bash script that loops on a bunch of read/execut/update calls where
each iteration calls time to return the same exact time for the duration of
the execution of the code? Whether the code runs for 5 seconds or 5 hours?

Every call to a system call is unique, including within C*. Calling now
PRIOR to initiating multiple inserts is in most cases exactly what one does
to assure unique time stamps FOR THE BATCH OF INSERTS. To get a nearly
identical system time as would be the uuid of the row, one tries to call
time as close to just before the insert as possible. Then repeat.

You have a logic issue in your code. If you want the same value for a set
of calls, the ONLY practice is to set the value before initiating the
sequence of calls.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 30, 2016 at 6:16 PM, Cody Yancey  wrote:

> Getting the same TimeUUID values might be a major problem. Getting two
> different TimeUUIDs that at least have time component would not be a major
> problem as this is the main case today. Getting different time components
> is actually the corner case, and it is a corner case that breaks
> Internet-of-Things applications. We can tightly control clock skew in our
> cluster. We most definitely CANNOT control clock skew on the thousands of
> sensors that write to our cluster.
>
> Thanks,
> Cody
>
> On Wed, Nov 30, 2016 at 5:33 PM, Robert Wille  wrote:
>
>> In my opinion, this is not broken and “fixing” it would break existing
>> code. Consider a batch that includes multiple inserts, each of which
>> inserts the value returned by now(). Getting the same UUID for each insert
>> would be a major problem.
>>
>> Cheers
>>
>> Robert
>>
>>
>> On Nov 30, 2016, at 4:46 PM, Todd Fast  wrote:
>>
>> FWIW I'd suggest opening a bug--this behavior is certainly quite
>> unexpected and more than just a documentation issue. In general I can't
>> imagine any desirable properties of the current implementation, and there
>> are likely a bunch of latent bugs sitting out there, so it should be fixed.
>>
>> Todd
>>
>> On Wed, Nov 30, 2016 at 12:37 PM Terry Liu  wrote:
>>
>>> Sorry for my typo. Obviously, I meant:
>>> "It appears that a single query that calls Cassandra's`now()` time
>>> function *multiple times *may actually cause a query to write or return
>>> different times."
>>>
>>> Less of a surprise now that I realize more about the implementation, but
>>> I agree that more explicit documentation around when exactly the
>>> "execution" of each now() statement happens and what implications it has
>>> for the resulting timestamps would be helpful when running into this.
>>>
>>> Thanks for the quick responses!
>>>
>>> -Terry
>>>
>>>
>>>
>>> On Tue, Nov 29, 2016 at 2:45 PM, Marko Švaljek 
>>> wrote:
>>>
>>> every now() call in statement is under the hood "replaced" with newly
>>> generated uuid.
>>>
>>> It can happen that they belong to  different milliseconds in time.
>>>
>>> If you need to have same timestamps you need to set them on the client
>>> side.
>>>
>>>
>>> @msvaljek 
>>>
>>> 2016-11-29 22:49 GMT+01:00 Terry Liu :
>>>
>>> It appears that a single query that calls Cassandra's `now()` time
>>> function may actually cause a query to write or return different times.
>>>
>>> Is this the expected or defined behavior, and if so, why does it behave
>>> like this rather than evaluating `now()` once across an entire statement?
>>>
>>> This really affects UPDATE statements but to test it more easily, you
>>> could try something like:
>>>
>>> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
>>> FROM keyspace.table
>>> LIMIT 100;
>>>
>>> If you run that a few times, you should eventually see that the
>>> timestamp returned moves onto the next millisecond mid-query.
>>>
>>> --
>>> *Software Engineer*
>>> Turnitin - http://www.turnitin.com
>>> t...@turnitin.com
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *Software Engineer*
>>> Turnitin - http://www.turnitin.com
>>> t...@turnitin.com
>>>
>>
>>
>

Re: Throughout of hints delivery

2016-09-17 Thread daemeon reiydelle

timeouts indicate network or equivalent throughput delays, from the
physical box's network card out and to the other dc's card. If you are
using VM's add that layer. Your network team needs to be looking for ANY
timeouts, retries, packets delivered in retry window > 0, etc. ANY value
other than zero forever is your problem


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sat, Sep 17, 2016 at 12:02 PM, laxmikanth sadula  wrote:

> Hi Matija,
>
> All nodes are UP & running and even GC patterns are all well. But I see
> lot of "Timed out replaying hints" in HintedHandOff Manger, I suspect this
> might be the reason why GBs of hints getting piled up instead of proper
> delivery.
> So this clearly indicates some network related issues , so just wanted to
> know the way to monitor hints delivery throughput and also tcp
> throughput(packets sent, received ,dropped etc., on an interface).
>
> If anyone monitoring such stats, please let me know.
>
> On Sat, Sep 17, 2016 at 11:26 PM, Matija Gobec 
> wrote:
>
>> Hi,
>>
>> You should first figure out why you have so many hints and then think
>> about throughput of hints delivery.
>> Hints are generated for dead nodes and in a healthy cluster are not
>> present.
>> Are all your nodes alive and running? What is the issue of inter DC
>> connectivity?
>>
>> Matija
>>
>> --
>>
>> *Matija Gobec*
>> *Co-Founder & Senior Consultant*
>> www.smartcat.io
>>
>>
>>
>> *Data  --> Knowledge
>>  --> Power  *
>>
>> On Sat, Sep 17, 2016 at 3:16 PM, laxmikanth sadula <
>> laxmikanth...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is there any way to monitor hints delivery throughout/performance/issue
>>> delivering hints?
>>>
>>> We have 2 DC c* cluster with 2.0.17 with RF=3 setup. Due to inter DC
>>> connectivity issues/some other issues hints shoot upto GBs/node.
>>>
>>> So I would like to monitor hints throughput/pin point the reason for
>>> hints growth on nodes.
>>>
>>> Kindly let us know if any one of you have such thing to monitor hints.
>>>
>>>
>>> Thanks
>>> Laxmikanth
>>>
>>>
>>> --
>>> Regards,
>>> Laxmikanth
>>> 99621 38051
>>>
>>>
>>>
>>
>
>
> --
> Regards,
> Laxmikanth
> 99621 38051
>
>

Re: Questions about anti-entropy repair

2016-07-20 Thread daemeon reiydelle

I don't know if my perspective on this will assist, so YMMV:

Summary

   1. Nodetool repairs are required when a node has issues and can't get
   its (e.g. hinted handoff) resync done: culprit: usually network, sometimes
   container/vm, rarely disk.
   2. Scripts to do partition range are a pain to maintain, and you have to
   be CONSTANTLY checking for new keyspaces, parsing them, etc. Git hub
   project?
   3. Monitor/monitor/monitor: if you do a best practices job of actually
   monitoring the FULL stack, you only need to do repairs when the world goes
   south.
   4. Are you alerted when errors show up in the logs, network goes wacky,
   etc? No? then you have to CYA by doing hail mary passes with periodic
   nodetool repairs.
   5. Nodetool repair is a CYA for a cluster whose status is not well
   monitored.

Daemeon's thoughts:

Nodetool repair is not required for a cluster that is and "always has been"
in a known good state. Monitoring of the relevant logs/network/disk/etc. is
the only way that I know of to assure this state. Because (e.g. AWS, and
EVERY ONE OF my clients' infrastructures: screwed up networks) nodes can
disappear then the cluster *can* get overloaded (network traffic) causing
hinted handoffs to have all of the worst case corner cases you can never
hope to see.

So, if you have good monitoring in place to assure that there is known good
cluster behaviour (network, disk, etc.), repairs are not required until you
are alerted that a cluster health problem has occurred. Partition range
repair is a pain in various parts of the anatomy because one has to
CONSTANTLY be updating the scripts that generate the commands (I have not
seen a git hub project around this, would love to see responses that point
them out!).

*...*

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Jul 20, 2016 at 4:33 AM, Alain RODRIGUEZ  wrote:

> Hi Satoshi,
>
>
>> Q1:
>> According to the DataStax document, it's recommended to run full repair
>> weekly or monthly. Is it needed even if repair with partitioner range
>> option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
>> every node in the cluster?
>>
>
> More accurately you need to run a repair for each node and each table
> within the gc_grace_seconds value defined at the table level to ensure no
> deleted data will return. Also running this on a regular basis ensure a
> constantly low entropy in your cluster, allowing better consistency (if not
> using a strong consistency like with CL.R = quorum).
>
> A full repair means every piece of data have been repaired. On a 3 node
> cluster with RF=3, running 'nodetool repair -pr' on the 3 nodes or
> 'nodetool repair' on one node are an equivalent "full repair". The best
> approach is often to run repair with '-pr' on all the nodes indeed. This is
> a full repair.
>
> Is it a good practice to repair a node without using non-repaired
>> snapshots when I want to restore a node because repair process is too slow?
>
>
> I am sorry, this is unclear to me. But from this "actually 1GB data is
> updated because the snapshot is already repaired" I understand you are
> using incremental repairs (or that you think that Cassandra repair uses it
> by default, which is not the case in your version).
> http://www.datastax.com/dev/blog/more-efficient-repairs
>
> Also, be aware that repair is a PITA for all the operators using
> Cassandra, that lead to many tries to improve things:
>
> Range repair: https://github.com/BrianGallew/cassandra_range_repair
> Reaper: https://github.com/spotify/cassandra-reaper
> Ticket to automatically schedule / handle repairs in Cassandra:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Ticket to switch to Mutation Based Repairs (MBR):
> https://issues.apache.org/jira/browse/CASSANDRA-8911
>
> And probably many more... There is a lot to read and try, repair is an
> important yet non trivial topic for any Cassandra operator.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
> 2016-07-14 9:41 GMT+02:00 Satoshi Hikida :
>
>> Hi,
>>
>> I have two questions about anti-entropy repair.
>>
>> Q1:
>> According to the DataStax document, it's recommended to run full repair
>> weekly or monthly. Is it needed even if repair with partitioner range
>> option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
>> every node in the cluster?
>>
>> References:
>> - DataStax, "When to run anti-entropy repair",
>> http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesWhen.html
>>
>>
>> Q2:
>> Is it a good practice to repair a node without using non-repaired
>> snapshots when I want to restore a node because repair process is too slow?
>>
>> I've done some simple verifications for anti-entropy repair and found out
>> that the repair process spends

Re: Problems with cassandra on AWS

2016-07-11 Thread daemeon reiydelle

xWell, I seem to recall that the private IP's are valid for communications
WITHIN one VPC. I assume you can log into one machine and ping (or ssh) the
others. If so, check that cassandra.yaml is not set to listen on 127.0.0.1
(localhost).


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
<%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
<%28%2B44%29%20%280%29%2020%208144%209872>*

On Sun, Jul 10, 2016 at 4:54 PM, Kant Kodali  wrote:

> Hi Guys,
>
> I installed a 3 node Cassandra cluster on AWS and my replication factor is
> 3. I am trying to insert some data into a table. I set the consistency
> level of QUORUM at a Cassandra Session level. It only inserts into one node
> and unable to talk to other nodes because it is trying to contact other
> nodes through private IP and obviously that is failing so I am not sure how
> to change settings in say cassandra.yaml or somewhere such that rpc_address
> in system.peers table is updated to public IP's? I tried changing the seeds
> to all public IP's that didn't work as it looks like ec2 instances cannot
> talk to each other using public IP's. any help would be appreciated!
>
> Thanks,
> kant
>

Re: Blog post on Cassandra's inner workings and performance - feedback welcome

2016-07-09 Thread daemeon reiydelle

I saw this really useful post a few days ago. I found the organization and
presentation quite clear and helpful (I often struggle trying to do high
level comparisons of Hadoop and Cass). Thank you!

If there was sections I would like to see your clear thoughts appear
within, it would be around:

   - (1) why networks need to be clean (the impact of "dirty"/erratic
   networks);
   - (2) the impact of java (off heap, stop the world garbage collection,
   why more memory makes things worse;
   - (3) table design decisions (read mostly, write mostly, mixed
   read/write, etc.)

A really great writeup, thank you!





*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Jul 8, 2016 at 11:59 PM, Manuel Kiessling <
kiessling.man...@gmail.com> wrote:

> Yes, the joke's on me. It was a copy error, and I've since posted
> the correct URL (journeymonitor.com
> :4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/).
>
> Substantial feedback regarding the actual post still very much welcome.
>
> Regards,
> Manuel
>
> Am 09.07.2016 um 03:32 schrieb daemeon reiydelle <daeme...@gmail.com>:
>
> Localhost is a special network address that never leaves the operating
> system. It only goes "half way" down the IP stack. Thanks for your efforts!
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Fri, Jul 8, 2016 at 5:53 PM, Joaquin Alzola <joaquin.alz...@lebara.com>
> wrote:
>
>> Hi Manuel
>>
>>
>>
>> I think localhost will not work for people on the internet.
>>
>>
>>
>> BR
>>
>>
>>
>> Joaquin
>>
>>
>>
>> *From:* kiessling.man...@gmail.com [mailto:kiessling.man...@gmail.com] *On
>> Behalf Of *Manuel Kiessling
>> *Sent:* 07 July 2016 14:12
>> *To:* user@cassandra.apache.org
>> *Subject:* Blog post on Cassandra's inner workings and performance -
>> feedback welcome
>>
>>
>>
>> Hi all,
>>
>> I'm currently in the process of understanding the inner workings of
>> Cassandra with regards to network and local storage mechanisms and
>> operations. In order to do so, I've written a blog post about it which is
>> now in a "first final" version.
>>
>> Any feedback, especially corrections regarding misunderstandings on my
>> side, would be highly appreciated. The post really represents my very
>> subjective view on how Cassandra works under the hood, which makes it prone
>> to errors of course.
>>
>> You can access the current version at
>> http://localhost:4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/
>>
>>
>>
>> Thanks,
>>
>> --
>>
>>  Manuel
>> This email is confidential and may be subject to privilege. If you are
>> not the intended recipient, please do not copy or disclose its content but
>> contact the sender immediately upon receipt.
>>
>
>

Re: Blog post on Cassandra's inner workings and performance - feedback welcome

2016-07-08 Thread daemeon reiydelle

Localhost is a special network address that never leaves the operating
system. It only goes "half way" down the IP stack. Thanks for your efforts!


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Jul 8, 2016 at 5:53 PM, Joaquin Alzola 
wrote:

> Hi Manuel
>
>
>
> I think localhost will not work for people on the internet.
>
>
>
> BR
>
>
>
> Joaquin
>
>
>
> *From:* kiessling.man...@gmail.com [mailto:kiessling.man...@gmail.com] *On
> Behalf Of *Manuel Kiessling
> *Sent:* 07 July 2016 14:12
> *To:* user@cassandra.apache.org
> *Subject:* Blog post on Cassandra's inner workings and performance -
> feedback welcome
>
>
>
> Hi all,
>
> I'm currently in the process of understanding the inner workings of
> Cassandra with regards to network and local storage mechanisms and
> operations. In order to do so, I've written a blog post about it which is
> now in a "first final" version.
>
> Any feedback, especially corrections regarding misunderstandings on my
> side, would be highly appreciated. The post really represents my very
> subjective view on how Cassandra works under the hood, which makes it prone
> to errors of course.
>
> You can access the current version at
> http://localhost:4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/
>
>
>
> Thanks,
>
> --
>
>  Manuel
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>

Re: Is my cluster normal?

Those numbers, as I suspected, line up pretty well with your AWS
configuration and network latencies within AWS. It is clear that this is a
WRITE ONLY test. You might want to do a mixed (e.g. 50% read, 50% write)
test for sanity. Note that the test will populate the data BEFORE it begins
doing the read/write tests.

In a dedicated environment at a recent client, with 10gbit links (just
grabbing one casstest run from my archives) I see less than twice the
above. Note your latency max is the result of a stop-the-world garbage
collection. There were huge problems below because this particular run was
using 24gb (Cassandra 2.x) java heap.

op rate   : 21567 [WRITE:21567]
partition rate: 21567 [WRITE:21567]
row rate  : 21567 [WRITE:21567]
latency mean  : 9.3 [WRITE:9.3]
latency median: 7.7 [WRITE:7.7]
latency 95th percentile   : 13.2 [WRITE:13.2]
latency 99th percentile   : 32.6 [WRITE:32.6]
latency 99.9th percentile : 97.2 [WRITE:97.2]
latency max   : 14906.1 [WRITE:14906.1]
Total partitions  : 8333 [WRITE:8333]
Total errors  : 0 [WRITE:0]
total gc count: 705
total gc mb   : 1691132
total gc time (s) : 30
avg gc time(ms)   : 43
stdev gc time(ms) : 13
Total operation time  : 01:04:23


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Jul 7, 2016 at 2:51 PM, Yuan Fang <y...@kryptoncloud.com> wrote:

> Yes, here is my stress test result:
> Results:
> op rate   : 12200 [WRITE:12200]
> partition rate: 12200 [WRITE:12200]
> row rate  : 12200 [WRITE:12200]
> latency mean  : 16.4 [WRITE:16.4]
> latency median: 7.1 [WRITE:7.1]
> latency 95th percentile   : 38.1 [WRITE:38.1]
> latency 99th percentile   : 204.3 [WRITE:204.3]
> latency 99.9th percentile : 465.9 [WRITE:465.9]
> latency max   : 1408.4 [WRITE:1408.4]
> Total partitions  : 100 [WRITE:100]
> Total errors  : 0 [WRITE:0]
> total gc count: 0
> total gc mb   : 0
> total gc time (s) : 0
> avg gc time(ms)   : NaN
> stdev gc time(ms) : 0
> Total operation time  : 00:01:21
> END
>
> On Thu, Jul 7, 2016 at 2:49 PM, Ryan Svihla <r...@foundev.pro> wrote:
>
>> Lots of variables you're leaving out.
>>
>> Depends on write size, if you're using logged batch or not, what
>> consistency level, what RF, if the writes come in bursts, etc, etc.
>> However, that's all sort of moot for determining "normal" really you need a
>> baseline as all those variables end up mattering a huge amount.
>>
>> I would suggest using Cassandra stress as a baseline and go from there
>> depending on what those numbers say (just pick the defaults).
>>
>> Sent from my iPhone
>>
>> On Jul 7, 2016, at 4:39 PM, Yuan Fang <y...@kryptoncloud.com> wrote:
>>
>> yes, it is about 8k writes per node.
>>
>>
>>
>> On Thu, Jul 7, 2016 at 2:18 PM, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> Are you saying 7k writes per node? or 30k writes per node?
>>>
>>>
>>> *...*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Jul 7, 2016 at 2:05 PM, Yuan Fang <y...@kryptoncloud.com> wrote:
>>>
>>>> writes 30k/second is the main thing.
>>>>
>>>>
>>>> On Thu, Jul 7, 2016 at 1:51 PM, daemeon reiydelle <daeme...@gmail.com>
>>>> wrote:
>>>>
>>>>> Assuming you meant 100k, that likely for something with 16mb of
>>>>> storage (probably way small) where the data is more that 64k hence will 
>>>>> not
>>>>> fit into the row cache.
>>>>>
>>>>>
>>>>> *...*
>>>>>
>>>>>
>>>>>
>>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>>>
>>>>> On Thu, Jul 7, 2016 at 1:25 PM, Yuan Fang <y...@kryptoncloud.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> I have a cluster of 4 m4.xlarge nodes(4 cpus and 16 gb memory and
>>>>>> 600GB ssd EBS).
>>>>>> I can reach a cluster wide write requests of 30k/second and read
>>>>>> request about 100/second. The cluster OS load constantly above 10. Are
>>>>>> those normal?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Yuan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Is my cluster normal?

Are you saying 7k writes per node? or 30k writes per node?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Jul 7, 2016 at 2:05 PM, Yuan Fang <y...@kryptoncloud.com> wrote:

> writes 30k/second is the main thing.
>
>
> On Thu, Jul 7, 2016 at 1:51 PM, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> Assuming you meant 100k, that likely for something with 16mb of storage
>> (probably way small) where the data is more that 64k hence will not fit
>> into the row cache.
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Jul 7, 2016 at 1:25 PM, Yuan Fang <y...@kryptoncloud.com> wrote:
>>
>>>
>>>
>>> I have a cluster of 4 m4.xlarge nodes(4 cpus and 16 gb memory and 600GB
>>> ssd EBS).
>>> I can reach a cluster wide write requests of 30k/second and read request
>>> about 100/second. The cluster OS load constantly above 10. Are those normal?
>>>
>>> Thanks!
>>>
>>>
>>> Best,
>>>
>>> Yuan
>>>
>>>
>>
>

Re: Is my cluster normal?

Assuming you meant 100k, that likely for something with 16mb of storage
(probably way small) where the data is more that 64k hence will not fit
into the row cache.

*...*

*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Jul 7, 2016 at 1:25 PM, Yuan Fang  wrote:

>
>
> I have a cluster of 4 m4.xlarge nodes(4 cpus and 16 gb memory and 600GB
> ssd EBS).
> I can reach a cluster wide write requests of 30k/second and read request
> about 100/second. The cluster OS load constantly above 10. Are those normal?
>
> Thanks!
>
>
> Best,
>
> Yuan
>
>

Re: Debugging high tail read latencies (internal timeout)

Hmm. Would you mind looking at your network interface (appropriate netstat
commands). if I am right you will be seeing packet errors, drops, retries,
packet out of window receives, etc.

What you may be missing is that you reported zero DROPPED latency. Not mean
LATENCY. Check your netstats. ANY VALUE CHANGE IS BAD (except total
read/write byte counts). If your network guys say otherwise, escalate to
someone who undertands tcp retry and sliding window.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Jul 7, 2016 at 11:35 AM, Bryan Cheng  wrote:

> Hi Nimi,
>
> My suspicions would probably lie somewhere between GC and large partitions.
>
> The first tool would probably be a trace but if you experience full client
> timeouts from dropped messages you may find it hard to find the issue. You
> can try running the trace with cqlsh's timeouts cranked all the way against
> the local node with CL=ONE to try to force the local machine to answer.
>
> What does nodetool tpstats report for dropped message counts? Are they
> very high? Primarily restricted to READ, or including MUTATION, etc. ?
>
> Are there specific PK's that trigger this behavior, either all the time or
> more consistently? That would finger either very large partition sizes or
> potentially bad hardware on a node. cfhistograms will show you various
> percentile partition sizes and your max as well.
>
> GC should be accessible via JMX and also you should have GCInspector logs
> in cassandra/system.log that should give you per-collection breakdowns.
>
> --Bryan
>
>
> On Wed, Jul 6, 2016 at 6:22 PM, Nimi Wariboko Jr 
> wrote:
>
>> Hi,
>>
>> I've begun experiencing very high tail latencies across my clusters.
>> While Cassandra's internal metrics report <1ms read latencies, measuring
>> responses from within the driver in my applications (roundtrips of
>> query/execute frames), have 90% round trip times of up to a second for very
>> basic queries (SELECT a,b FROM table WHERE pk=x).
>>
>> I've been studying the logs to try and get a handle on what could be
>> going wrong. I don't think there are GC issues, but the logs mention
>> dropped messages due to timeouts while the threadpools are nearly empty -
>>
>> https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c
>>
>> Relevant line:
>> REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
>> timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
>> ms and Mean cross-node dropped latency: 0 ms
>>
>> Are there any tools I can use to start to understand what is causing
>> these issues?
>>
>> Nimi
>>
>>
>

Re: all the nost are not reacheable when running massive deletes

2016-04-04 Thread daemeon reiydelle

Network issues. Could be jumbo frames not consistent or other.

sent from my mobile

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Apr 4, 2016 5:34 AM, "Paco Trujillo"  wrote:

> Hi everyone
>
>
>
> We are having problems with our cluster (7 nodes version 2.0.17) when
> running “massive deletes” on one of the nodes (via cql command line). At
> the beginning everything is fine, but after a while we start getting
> constant NoHostAvailableException using the datastax driver:
>
>
>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /172.31.7.243:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.245:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.246:9042
> (com.datastax.driver.core.exceptions.DriverException: Timeout while trying
> to acquire available connection (you may want to increase the driver number
> of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /
> 172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3
> hosts, use getErrors() for more details])
>
>
>
>
>
> All the nodes are running:
>
>
>
> UN  172.31.7.244  152.21 GB  256 14.5%
> 58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
>
> UN  172.31.7.245  168.4 GB   256 14.5%
> bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
>
> UN  172.31.7.246  177.71 GB  256 13.7%
> 8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
>
> UN  172.31.7.247  158.57 GB  256 14.1%
> 94022081-a563-4042-81ab-75ffe4d13194  RAC1
>
> UN  172.31.7.243  176.83 GB  256 14.6%
> 0dda3410-db58-42f2-9351-068bdf68f530  RAC1
>
> UN  172.31.7.233  159 GB 256 13.6%
> 01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
>
> UN  172.31.7.232  166.05 GB  256 15.0%
> 4d009603-faa9-4add-b3a2-fe24ec16a7c1
>
>
>
> but two of them have high cpu load, especially the 232 because I am
> running a lot of deletes using cqlsh in that node.
>
>
>
> I know that deletes generate tombstones, but with 7 nodes in the cluster I
> do not think is normal that all the host are not accesible.
>
>
>
> We have a replication factor of 3 and for the deletes I am not using any
> consistency (so it is using the default ONE).
>
>
>
> I check the nodes which a lot of CPU (near 96%) and th gc activity remains
> on 1.6% (using only 3 GB from the 10 which have assigned). But looking at
> the thread pool stats, the mutation stages pending column grows without
> stop, could be that the problem?
>
>
>
> I cannot find the reason that originates the timeouts. I already have
> increased the timeouts, but It do not think that is a solution because the
> timeouts indicated another type of error. Anyone have a tip to try to
> determine where is the problem?
>
>
>
> Thanks in advance
>

Re: Unexpected high internode network activity

Hmm. From the AWS FAQ:

*Q: If I have two instances in different availability zones, how will I be
charged for regional data transfer?*

Each instance is charged for its data in and data out. Therefore, if data
is transferred between these two instances, it is charged out for the first
instance and in for the second instance.


I really am not seeing this factored into your numbers fully. If data
transfer is only twice as much as expected, the above billing would seem to
put the numbers in line. Since (I assume) you have one copy in EACH AZ (dc
aware but really dc=az) I am not seeing the bandwidth as that much out of
line.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 11:00 PM, Gianluca Borello <gianl...@sysdig.com>
wrote:

> It is indeed very intriguing and I really hope to learn more from the
> experience of this mailing list. To address your points:
>
> - The theory that full data is coming from replicas during reads is not
> enough to explain the situation. In my scenario, over a time window I had
> 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
> reads (measured on port 9042), so even if both reads and writes affected
> all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
> port 7000 unaccounted
>
> - We are doing regular backups the standard way, using periodic snapshots
> and synchronizing them to S3. This traffic is not part of the anomalous
> traffic we're seeing above, since this one goes on port 80 and it's clearly
> visible with a separate bpf filter, and its magnitude is far lower than
> that anyway
>
> Thanks
>
> On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> Intriguing. It's enough data to look like full data is coming from the
>> replicants instead of digests when the read of the copy occurs. Are you
>> doing backup/dr? Are directories copied regularly and over the network or ?
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gianl...@sysdig.com>
>> wrote:
>>
>>> Thank you for your reply.
>>>
>>> To answer your points:
>>>
>>> - I fully agree on the write volume, in fact my isolated tests confirm
>>> your estimation
>>>
>>> - About the read, I agree as well, but the volume of data is still much
>>> higher
>>>
>>> - I am writing to one single keyspace with RF 3, there's just one
>>> keyspace
>>>
>>> - I am not using any indexes, the column families are very simple
>>>
>>> - I am aware of the double count, in fact, I measured the traffic on
>>> port 9042 at the client side (so just counted once) and I divided by two
>>> the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All
>>> the measurements have been done with iftop with proper bpf filters on the
>>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>>
>>> So unfortunately I still don't have any ideas about what's going on and
>>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>>
>>> On Thursday, February 25, 2016, daemeon reiydelle <daeme...@gmail.com>
>>> wrote:
>>>
>>>> If read & write at quorum then you write 3 copies of the data then
>>>> return to the caller; when reading you read one copy (assume it is not on
>>>> the coordinator), and 1 digest (because read at quorum is 2, not 3).
>>>>
>>>> When you insert, how many keyspaces get written to? (Are you using e.g.
>>>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>>>> written for every byte inserted.
>>>>
>>>> Every byte you write is counted also as a read (system a sends 1gb to
>>>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>>>> but inter AZ and inter DC will get that double count.
>>>>
>>>> So, my guess is reverse indexes, and you forgot to include receive and
>>>> transmit.
>>>> 
>>>>
>>>>
>>>> *...*
>>>>
>>>>
>>>>
>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>&

Re: Unexpected high internode network activity

Intriguing. It's enough data to look like full data is coming from the
replicants instead of digests when the read of the copy occurs. Are you
doing backup/dr? Are directories copied regularly and over the network or ?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gianl...@sysdig.com>
wrote:

> Thank you for your reply.
>
> To answer your points:
>
> - I fully agree on the write volume, in fact my isolated tests confirm
> your estimation
>
> - About the read, I agree as well, but the volume of data is still much
> higher
>
> - I am writing to one single keyspace with RF 3, there's just one keyspace
>
> - I am not using any indexes, the column families are very simple
>
> - I am aware of the double count, in fact, I measured the traffic on port
> 9042 at the client side (so just counted once) and I divided by two the
> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
> measurements have been done with iftop with proper bpf filters on the
> port and the total traffic matches what I see in cloudwatch (divided by two)
>
> So unfortunately I still don't have any ideas about what's going on and
> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>
> On Thursday, February 25, 2016, daemeon reiydelle <daeme...@gmail.com>
> wrote:
>
>> If read & write at quorum then you write 3 copies of the data then return
>> to the caller; when reading you read one copy (assume it is not on the
>> coordinator), and 1 digest (because read at quorum is 2, not 3).
>>
>> When you insert, how many keyspaces get written to? (Are you using e.g.
>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>> written for every byte inserted.
>>
>> Every byte you write is counted also as a read (system a sends 1gb to
>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>> but inter AZ and inter DC will get that double count.
>>
>> So, my guess is reverse indexes, and you forgot to include receive and
>> transmit.
>> 
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianl...@sysdig.com>
>> wrote:
>>
>>> Hello,
>>>
>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>> c3.2xlarge instances.
>>>
>>> The configuration is pretty standard, we use the default settings that
>>> come with the datastax AMI and the driver in our application is configured
>>> to use lz4 compression. The keyspace where all the activity happens has RF
>>> 3 and we read and write at quorum to get strong consistency.
>>>
>>> While analyzing our monthly bill, we noticed that the amount of network
>>> traffic related to Cassandra was significantly higher than expected. After
>>> breaking it down by port, it seems like over any given time, the internode
>>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>>> we would expect something around 2-3 times, given the replication factor
>>> and the consistency level of our queries.
>>>
>>> For example, this is the network traffic broken down by port and
>>> direction over a few minutes, measured as sum of each node:
>>>
>>> Port 9042 from client to cluster (write queries): 1 GB
>>> Port 9042 from cluster to client (read queries): 1.5 GB
>>> Port 7000: 35 GB, which must be divided by two because the traffic is
>>> always directed to another instance of the cluster, so that makes it 17.5
>>> GB generated traffic
>>>
>>> The traffic on port 9042 completely matches our expectations, we do
>>> about 100k write operations writing 10KB binary blobs for each query, and a
>>> bit more reads on the same data.
>>>
>>> According to our calculations, in the worst case, when the coordinator
>>> of the query is not a replica for the data, this should generate about (1 +
>>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>>
>>> Also, hinted handoffs are disabled and nodes are healthy over the period
>>> of observation, and I get the same numbers across pretty much every time
>>> window, even including an entire 24 hours period.
>>>
>>> I tried to replicate this problem in a test environment so I connected a
>>> client to a test cluster done in a bunch of Docker containers (same
>>> parameters, essentially the only difference is the
>>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>>> amount of traffic on port 9042 and the queries are pretty much the same
>>> ones.
>>>
>>> Before doing more analysis, I was wondering if someone has an
>>> explanation on this problem, since perhaps we are missing something obvious
>>> here?
>>>
>>> Thanks
>>>
>>>
>>>
>>

Re: Unexpected high internode network activity

If read & write at quorum then you write 3 copies of the data then return
to the caller; when reading you read one copy (assume it is not on the
coordinator), and 1 digest (because read at quorum is 2, not 3).

When you insert, how many keyspaces get written to? (Are you using e.g.
inverted indices?) That is my guess, that your db has about 1.8 bytes
written for every byte inserted.

Every byte you write is counted also as a read (system a sends 1gb to
system b, so system b receives 1gb). You would not be charged if intra AZ,
but inter AZ and inter DC will get that double count.

So, my guess is reverse indexes, and you forgot to include receive and
transmit.



*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello 
wrote:

> Hello,
>
> We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
> There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
> instances.
>
> The configuration is pretty standard, we use the default settings that
> come with the datastax AMI and the driver in our application is configured
> to use lz4 compression. The keyspace where all the activity happens has RF
> 3 and we read and write at quorum to get strong consistency.
>
> While analyzing our monthly bill, we noticed that the amount of network
> traffic related to Cassandra was significantly higher than expected. After
> breaking it down by port, it seems like over any given time, the internode
> network activity is 6-7 times higher than the traffic on port 9042, whereas
> we would expect something around 2-3 times, given the replication factor
> and the consistency level of our queries.
>
> For example, this is the network traffic broken down by port and direction
> over a few minutes, measured as sum of each node:
>
> Port 9042 from client to cluster (write queries): 1 GB
> Port 9042 from cluster to client (read queries): 1.5 GB
> Port 7000: 35 GB, which must be divided by two because the traffic is
> always directed to another instance of the cluster, so that makes it 17.5
> GB generated traffic
>
> The traffic on port 9042 completely matches our expectations, we do about
> 100k write operations writing 10KB binary blobs for each query, and a bit
> more reads on the same data.
>
> According to our calculations, in the worst case, when the coordinator of
> the query is not a replica for the data, this should generate about (1 +
> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>
> Also, hinted handoffs are disabled and nodes are healthy over the period
> of observation, and I get the same numbers across pretty much every time
> window, even including an entire 24 hours period.
>
> I tried to replicate this problem in a test environment so I connected a
> client to a test cluster done in a bunch of Docker containers (same
> parameters, essentially the only difference is the
> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
> expect, the amount of traffic on port 7000 is between 2 and 3 times the
> amount of traffic on port 9042 and the queries are pretty much the same
> ones.
>
> Before doing more analysis, I was wondering if someone has an explanation
> on this problem, since perhaps we are missing something obvious here?
>
> Thanks
>
>
>

Re: Checking replication status

Hmm. What are your processes when a node comes back after "a long offline"?
Long enough to take the node offline and do a repair? Run the risk of
serving stale data? Parallel repairs? ???

So, what sort of time frames are "a long time"?


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin  wrote:

> hi all,
>
> what are the better ways to check replication overall status of cassandra 
> cluster?
>
>  within a single DC, unless a node is down for long time, most of the time i 
> feel it is pretty much non-issue and things are replicated pretty fast. But 
> when a node come back from a long offline, is there a way to check that the 
> node has finished its data sync with other nodes  ?
>
>  Now across DC, we have frequent VPN outage (sometime short sometims long) 
> between DCs, i also like to know if there is a way to find how the 
> replication progress between DC catching up under this condtion?
>
>  Also, if i understand correctly, the only gaurantee way to make sure data 
> are synced is to run a complete repair job,
> is that correct? I am trying to see if there is a way to "force a quick 
> replication sync" between DCs after vpn outage.
> Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, 
> there is nothing else we/(system admin) can do to make it faster or better?
>
>
>
> Sent from my iPhone
>

Re: Nodes go down periodically

2016-02-23 Thread daemeon reiydelle

If you can, do a few (short, maybe 10m records, delete the default schema
between executions) run of Cassandra Stress test against your production
cluster (replication=3, force quorum to 3). Look for latency max in the 10s
of SECONDS. If your devops team is running a monitoring tool that looks at
the network, look for timeout/retries/errors/lost packets, etc. during the
run (worst case you need to do netstats runs against the relevant nic e.g.
every 10 seconds on the CassStress node, look for jumps in this count (if
monitoring is enabled, look at the monitor's results for ALL of your nodes.
At least one is having some issues.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Feb 23, 2016 at 8:43 AM, Jack Krupansky 
wrote:

> The reality of modern distributed systems is that connectivity between
> nodes is never guaranteed and distributed software must be able to cope
> with occasional absence of connectivity. GC and network connectivity are
> the two issues that a lot of us are most familiar with. There may be others
> - but most technical problems on a node would be clearly logged on that
> node. If you see a lapse of connectivity no more than once or twice a day,
> consider yourselves lucky.
>
> Is it only one node at a time that goes down, and at widely dispersed
> times?
>
> How many nodes?
>
> -- Jack Krupansky
>
> On Tue, Feb 23, 2016 at 11:01 AM, Joel Samuelsson <
> samuelsson.j...@gmail.com> wrote:
>
>> Hi,
>>
>> Version is 2.0.17.
>> Yes, these are VMs in the cloud though I'm fairly certain they are on a
>> LAN rather than WAN. They are both in the same data centre physically. The
>> phi_convict_threshold is set to default. I'd rather find the root cause of
>> the problem than just hiding it by not convicting a node if it isn't
>> responding though. If pings are <2 ms without a single ping missed in
>> several days, I highly doubt that network is the reason for the downtime.
>>
>> Best regards,
>> Joel
>>
>> 2016-02-23 16:39 GMT+01:00 :
>>
>>> You didn’t mention version, but I saw this kind of thing very often in
>>> the 1.1 line. Often this is connected to network flakiness. Are these VMs?
>>> In the cloud? Connected over a WAN? You mention that ping seems fine. Take
>>> a look at the phi_convict_threshold in c assandra.yaml. You may need to
>>> increase it to reduce the UP/DOWN flapping behavior.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Joel Samuelsson [mailto:samuelsson.j...@gmail.com]
>>> *Sent:* Tuesday, February 23, 2016 9:41 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Nodes go down periodically
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> Thanks for your reply.
>>>
>>>
>>>
>>> I have debug logging on and see no GC pauses that are that long. GC
>>> pauses are all well below 1s and 99 times out of 100 below 100ms.
>>>
>>> Do I need to enable GC log options to see the pauses?
>>>
>>> I see plenty of these lines:
>>> DEBUG [ScheduledTasks:1] 2016-02-22 10:43:02,891 GCInspector.java (line
>>> 118) GC for ParNew: 24 ms for 1 collections
>>>
>>> as well as a few CMS GC log lines.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Joel
>>>
>>>
>>>
>>> 2016-02-23 15:14 GMT+01:00 Hannu Kröger :
>>>
>>> Hi,
>>>
>>>
>>>
>>> Those are probably GC pauses. Memory tuning is probably needed. Check
>>> the parameters that you already have customised if they make sense.
>>>
>>>
>>>
>>> http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html
>>>
>>>
>>>
>>> Hannu
>>>
>>>
>>>
>>>
>>>
>>> On 23 Feb 2016, at 16:08, Joel Samuelsson 
>>> wrote:
>>>
>>>
>>>
>>> Our nodes go down periodically, around 1-2 times each day. Downtime is
>>> from <1 second to 30 or so seconds.
>>>
>>>
>>>
>>> INFO [GossipTasks:1] 2016-02-22 10:05:14,896 Gossiper.java (line 992)
>>> InetAddress /109.74.13.67 is now DOWN
>>>
>>>  INFO [RequestResponseStage:8844] 2016-02-22 10:05:38,331 Gossiper.java
>>> (line 978) InetAddress /109.74.13.67 is now UP
>>>
>>>
>>>
>>> I find nothing odd in the logs around the same time. I logged a ping
>>> with timestamp and checked during the same time and saw nothing weird (ping
>>> is less than 2ms at all times).
>>>
>>>
>>>
>>> Does anyone have any suggestions as to why this might happen?
>>>
>>>
>>>
>>> Best regards,
>>> Joel
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of

RE: Restart Cassandra automatically

2016-02-23 Thread daemeon reiydelle

Cassandra nodes do not go down "for no reason". They are not stateless. I
would like to thank you for this marvelous example of a wonderful
antipattern. Absolutely fantastic.

Thank you! I am not being a satirical smartass. I sometimes am challenged
by clients in my presentations about sre best practices around c*, hadoop,
and elk on the grounds that "noone would ever do this in production". Now I
have objective proof!

Daemeon

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Feb 23, 2016 7:53 AM,  wrote:

> Yes, I can see the potential problem in theory. However, we never do your
> #2. Generally, we don’t have unused spare hardware. We just fix the host
> that is down and run repairs. (Side note: while I have seen nodes fight it
> out over who owns a particular token in earlier versions, it seems that
> 1.2+ doesn’t allow that to happen as easily. The second node will just not
> come up.)
>
>
>
> For most of our use cases, I would agree with your Coli Conjecture.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Robert Coli [mailto:rc...@eventbrite.com]
> *Sent:* Tuesday, February 09, 2016 4:41 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Restart Cassandra automatically
>
>
>
> On Tue, Feb 9, 2016 at 6:20 AM,  wrote:
>
> Call me naïve, but we do use an in-house built program for keeping nodes
> started (based on a flag-check). The program is something that was written
> for all kinds of daemon processes here, not Cassandra specifically. The
> basic idea is that is runs a status check. If that fails, and the flag is
> set, start Cassandra. In my opinion, it has helped more than hurt us –
> especially with the very fragile 1.1 releases that were prone to heap
> problems.
>
>
>
> Ok, you're naïve.. ;P
>
>
>
> But seriously, think of this scenario :
>
>
>
> 1) Node A, responsible for range A-M, goes down due to hardware failure of
> a disk in a RAID
>
> 2) Node B is put into service and is made responsible for A-M
>
> 3) Months pass
>
> 4) Node A comes back up, announces that it is responsible for A-M, and the
> cluster agrees
>
>
>
> Consistency is now permanently broken for any involved rows. Why doesn't
> it (usually) matter?
>
>
>
> It's not so much that you are naïve but that you are providing still more
> support for the Coli Conjecture : "If you are using a distributed database
> you probably do not care about consistency, even if you think you do." You
> have repeatedly chosen Availability over Consistency and it has never had a
> negative impact on your actual application.
>
>
>
> =Rob
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

2016-02-19 Thread daemeon reiydelle

FYI, my observations were with native, not thrift.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Feb 19, 2016 at 10:12 AM, Sotirios Delimanolis  wrote:

> Does your cluster contain 24+ nodes or fewer?
>
> We did the same upgrade on a smaller cluster of 5 nodes and we didn't see
> this behavior. On the 24 node cluster, the timeouts only took effect once
> ~5-6-7+ nodes had been upgraded.
>
> We're doing some more upgrades next week, trying different deployment
> plans. I'll report back with the results.
>
> Thanks for the reply (we absolutely want to move to CQL)
>
>
> On Friday, February 19, 2016 1:10 AM, Alain RODRIGUEZ 
> wrote:
>
>
> I performed this exact update a few days ago, excepted clients were using
> native protocol and it wen smoothly. So I think this might be thrift
> related. No idea what is producing this though, just wanted to give the
> info fwiw.
>
> As a side note, unrelated to the issue, performances using native are a
> lot better than thrift starting in C* 2.1. Drivers using native are also
> more modern allowing you to do very interesting stuff. Updating to native
> now that you are using 2.1 is something you might want to do soon enough
> :-).
>
> C*heers,
> -
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis :
>
> We have a Cassandra cluster with 24 nodes. These nodes were running
> 2.0.16.
>
> While the nodes are in the ring and handling queries, we perform the
> upgrade to 2.1.12 as follows (more or less) one node at a time:
>
>
>1. Stop the Cassandra process
>2. Deploy jars, scripts, binaries, etc.
>3. Start the Cassandra process
>
>
> A few nodes into the upgrade, we start noticing that the majority of
> queries (mostly through Thrift) time out or report unavailable. Looking at
> system information, Cassandra GC time goes through the roof, which is what
> we assume causes the time outs.
>
> Once all nodes are upgraded, the cluster stabilizes and no more (barely
> any) time outs occur.
>
> What could explain this? Does it have anything to do with how a 2.0
> communicates with a 2.1?
>
> Our Cassandra consumers haven't changed.
>
>
>
>
>
>
>
>
>

Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing timeouts and unavailability

2016-02-19 Thread daemeon reiydelle

May be unrelated, but I found highly variable latency (latency max) when on
the 2.1 code tree loading new data (and reading). Others found that G1 or
CMS do not make a difference. Some evidence that 8/12/16gb memory make no
difference. These were latencies in the 10-30 SECOND range. It did cause
timeouts. You may not be seeing a 2.0 vs. 2.1 issue, rather a 2.1 issue
proper. While others did not find this associated with stop-the-world GC, I
saw some evidence of same (using Cassandra stress, but I recently reproduce
the issue with YCSB!)


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Feb 19, 2016 at 1:10 AM, Alain RODRIGUEZ  wrote:

> I performed this exact update a few days ago, excepted clients were using
> native protocol and it wen smoothly. So I think this might be thrift
> related. No idea what is producing this though, just wanted to give the
> info fwiw.
>
> As a side note, unrelated to the issue, performances using native are a
> lot better than thrift starting in C* 2.1. Drivers using native are also
> more modern allowing you to do very interesting stuff. Updating to native
> now that you are using 2.1 is something you might want to do soon enough
> :-).
>
> C*heers,
> -
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis :
>
>> We have a Cassandra cluster with 24 nodes. These nodes were running
>> 2.0.16.
>>
>> While the nodes are in the ring and handling queries, we perform the
>> upgrade to 2.1.12 as follows (more or less) one node at a time:
>>
>>
>>1. Stop the Cassandra process
>>2. Deploy jars, scripts, binaries, etc.
>>3. Start the Cassandra process
>>
>>
>> A few nodes into the upgrade, we start noticing that the majority of
>> queries (mostly through Thrift) time out or report unavailable. Looking at
>> system information, Cassandra GC time goes through the roof, which is what
>> we assume causes the time outs.
>>
>> Once all nodes are upgraded, the cluster stabilizes and no more (barely
>> any) time outs occur.
>>
>> What could explain this? Does it have anything to do with how a 2.0
>> communicates with a 2.1?
>>
>> Our Cassandra consumers haven't changed.
>>
>>
>>
>>
>>
>>
>

Re: Compatability, performance & portability of Cassandra data types (MAP, UDT & JSON) in DSE Search & Analytics

2016-02-18 Thread daemeon reiydelle

Given you only have 16 columns vs. over 200 ... I would expect a
substantial improvement in writes, but not 5x.
Ditto reads. I would be interested to understand where that 5x comes from.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Feb 18, 2016 at 8:20 PM, Chandra Sekar KR <
chandraseka...@hotmail.com> wrote:

> Hi,
>
>
> I'm looking for help in arriving at pros & cons of using MAP, UDT & JSON
> (Text) data types in Cassandra & its ease of use/impact across other DSE
> products - Spark & Solr. We are migrating an OLTP database from RDBMS to
> Cassandra which has 200+ columns and with an average daily volume of 25
> million records/day. The access pattern is quite simple and in OLTP the
> access is always based on primary key. For OLAP, there are other access
> patterns with a combination of columns where we are planning to use Spark &
> Solr for search & analytical capabilities (in a separate DC).
>
>
> The average size of each record is ~2KB and the application workload is of
> type INSERT only (no updates/deletes). We conducted performance tests on
> two types of data models
>
> 1) A table with 200+ columns similar to RDBMS
>
> 2) A table with 15 columns where only critical business fields are
> maintained as key/value pairs and the remaining are stored in a single
> column of type TEXT as JSON object.
>
>
> In the results, we noticed significant advantage in the JSON model where
> the performance was 5X times better than columnar data model.
> Alternatively, we are in the process of evaluating performance for other
> data types - MAP & UDT instead of using TEXT for storing JSON object.
> Sample data model structure for columnar, json, map & udt types are given
> below:
>
>
>
>
> I would like to know the performance, transformation, compatibility &
> portability impacts & east-of-use of each of these data types from Search &
> Analytics perspective (Spark & Solr). I'm aware that we will have to use
> field transformers in Solr to use index on JSON fields, not sure about MAP
> & UDT. Any help on comparison of these data types in Spark & Solr is highly
> appreciated.
>
>
> Regards, KR
>

Re: High Bloom filter false ratio

2016-02-18 Thread daemeon reiydelle

The bloom filter buckets the values in a small number of buckets. I have
been surprised by how many cases I see with large cardinality where a few
values populate a given bloom leaf, resulting in high false positives, and
a surprising impact on latencies!

Are you seeing 2:1 ranges between mean and worse case latencies (allowing
for gc times)?

Daemeon Reiydelle
On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:

> You can try slightly lowering the bloom_filter_fp_chance on your table.
>
> Otherwise, it's possible that you're repeatedly querying one or two
> partitions that always trigger a bloom filter false positive.  You could
> try manually tracing a few queries on this table (for non-existent
> partitions) to see if the bloom filter rejects them.
>
> Depending on your Cassandra version, your false positive ratio could be
> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>
> There are also a couple of recent improvements to bloom filters:
> * https://issues.apache.org/jira/browse/CASSANDRA-8413
> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>
>
> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <anis...@gmail.com>
> wrote:
>
>> Hello,
>>
>> We have a table with composite partition key with humungous cardinality,
>> its a combination of (long,long). On the table we have
>> bloom_filter_fp_chance=0.01.
>>
>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are
>> seeing  "Bloom filter false ratio:" in the range of 0.7 -0.9.
>>
>> I thought over time the bloom filter would adjust to the key space
>> cardinality, we have been running the cluster for a long time now but have
>> added significant traffic from Jan this year, which would not lead to
>> writes in the db but would lead to high reads to see if are any values.
>>
>> Are there any settings that can be changed to allow better ratio.
>>
>> Thanks
>> Anishek
>>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: Cassandra Collections performance issue

2016-02-09 Thread daemeon reiydelle

I think the key to your problem might be around "we overwrite every value".
You are creating a large number of tombstones, forcing many reads to pull
current results. You would do well to rethink why you are having to to
overwrite values all the time under the same key. You would be better to
figure out haw to add values under a key then age off the old values. I
would say that (at least at scale) you have a classic anti-pattern in play.


*...*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, Feb 8, 2016 at 5:23 PM, Robert Coli  wrote:

> On Mon, Feb 8, 2016 at 2:10 PM, Agrawal, Pratik 
> wrote:
>
>> Recently we added one of the table fields from as Map in 
>> *Cassandra
>> 2.1.11*. Currently we read every field from Map and overwrite map
>> values. Map is of size 3. We saw that writes are 30-40% slower while reads
>> are 70-80% slower. Please find below some metrics that can help.
>>
>> My question is, Are there any known issues in Cassandra map performance?
>> As I understand it each of the CQL3 Map entry, maps to a column in
>> cassandra, with that assumption we are just creating 3 columns right? Any
>> insight on this issue would be helpful.
>>
>
> I have previously heard reports along similar lines, but in the other
> direction.
>
> eg - "I moved from a collection to a TEXT column with JSON in it, and my
> reads and writes both became much faster!"
>
> I'm not sure if the issue has been raised as an Apache Cassandra Jira, iow
> if it is a known and expected limitation as opposed to just a performance
> issue.
>
> If I were you, I would consider filing a repro case as a Jira ticket, and
> responding to this thread with its URL. :D
>
> =Rob
>
>

Re: Need Feedback about cassandra-stress tests

2016-01-23 Thread daemeon reiydelle

Might I suggest you START by using the default schema provided by
cassandra-stress. Using someone else's schema is great AFTER you use have
used a standard and generally well understood baseline.

>From that you can decide whether a 4 node x 2 cluster is right for you.
FYI, given your 6 way replication, the above do not seem unreasonable. You
can see where YOUR test hits a peak and falls back in throughput. But what
are your USER requirements?

With the default schema (you can try distinct as well as the default
keyspace1 shared keyspace), look very hard at your latencies. If you can
withstand occasional 10 second worst case latencies (1msec) during
compaction, great. if your 90/99/99.5 latencies are important, you will be
able to see where your tiny cluster plays out. Only THEN can you try some
other no doubt fun schema.



*...*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sat, Jan 23, 2016 at 10:18 AM, Bhuvan Rawal  wrote:

> Hi All,
>
> I have been trying to set up a cluster for POC and ran cassandra-stress
> tests today on an 8 Node cluster with 2 DC having 4 nodes each. The disk
> used was SSD with I/O of greater than 500MB per sec, CPU- 2.3 Ghz ,4 Core
> Xeon and 8Gigs of RAM in each node.
>
> First of all I would like to thank *Alain Rodriguez* and *Sabastian
> Estevez* for helping me to sort out an issue which was troubling me.
>
> I used the schema and queries mentioned in the gist at this link -
> https://gist.github.com/tjake/fb166a659e8fe4c8d4a3#file-query2-txt-L14. I
> used Network Replication on the keyspace with RF of 3 for each DC, meaning
> each key would be replicated to 6
>
> Im not sure if I have managed to get the config right because the read
> operations seem to be slow. I would need your comments on them.
>
> Command - $ cassandra-stress user profile=./blogpost.yaml ops\(insert=1\) 
> -node 10.41.55.21
> Insert Operation
> type id total ops op/s pk/s row/s
> threadCount 4 insert 90259 2994 2994 4088
> threadCount 4 total 90259 2994 2994 4088
> threadCount 8 insert 161754 5362 5362 7335
> threadCount 8 total 161754 5362 5362 7335
> threadCount 16 insert 261958 7796 7796 10653
> threadCount 16 total 261958 7796 7796 10653
> threadCount 24 insert 769757 8132 8132 5
> threadCount 24 total 769757 8132 8132 5
> threadCount 36 insert 571600 9762 9762 13362
> threadCount 36 total 571600 9762 9762 13362
> threadCount 54 insert 884351 10730 10730 14672
> threadCount 54 total 884351 10730 10730 14672
> threadCount 81 insert 623874 10919 10919 14931
> threadCount 81 total 623874 10919 10919 14931
> threadCount 121 insert 1234867 11736 11736 16053
> threadCount 121 total 1234867 11736 11736 16053
> threadCount 181 insert 2377008 10310 10310 14110
> threadCount 181 total 2377008 10310 10310 14110
>
> Command-$cassandra-stress user profile=./blogpost.yaml 
> ops\(singlepost=2,timeline=1,insert=1\) -node 10.41.55.21
> Mixed – Read, Write
> Query – singlepost=2,timeline=1,insert=1
> type id total ops op/s pk/s row/s
> threadCount 4 insert 27179 76 76 103
> threadCount 4 singlepost 54118 151 151 151
> threadCount 4 timeline 27343 76 76 576
> threadCount 4 total 108640 303 303 831
> threadCount 8 insert 8642 149 149 203
> threadCount 8 singlepost 17838 307 307 307
> threadCount 8 timeline 8529 147 147 1107
> threadCount 8 total 35009 602 602 1617
> threadCount 16 insert 18784 176 176 242
> threadCount 16 singlepost 37960 356 356 356
> threadCount 16 timeline 19058 179 179 1359
> threadCount 16 total 75802 710 710 1957
> threadCount 24 insert 21564 151 151 208
> threadCount 24 singlepost 44545 313 313 313
> threadCount 24 timeline 21553 151 151 1150
> threadCount 24 total 87662 615 615 1670
> threadCount 36 insert 38054 185 185 252
> threadCount 36 singlepost 76495 372 372 372
> threadCount 36 timeline 38783 189 189 1443
> threadCount 36 total 153332 746 746 2068
> threadCount 54 insert 92639 167 167 229
> threadCount 54 singlepost 187085 338 338 338
> threadCount 54 timeline 92679 167 167 1292
> threadCount 54 total 372403 673 673 1859
>
> Command - $ cassandra-stress user profile=./blogpost.yaml ops\(singlepost=1\) 
> -node 10.41.55.21
>
> Read only load, single value, include blog text (which is 5000 chars) – 
> singlepost query
> Query- select * from blogposts where domain = ? LIMIT 1
> type id total ops op/s pk/s row/s
> threadCount 4 singlepost 11334 368 368 368
> threadCount 4 total 11334 368 368 368
> threadCount 8 singlepost 18744 547 547 547
> threadCount 8 total 18744 547 547 547
> threadCount 16 singlepost 26038 607 607 607
> threadCount 16 total 26038 607 607 607
> threadCount 24 singlepost 35903 654 654 654
> threadCount 24 total 35903 654 654 654
>

Re: In UJ status for over a week trying to rejoin cluster in Cassandra 3.0.1

2016-01-17 Thread daemeon reiydelle

What do the logs say on the seed node (and on the UJ node)?

Look for timeout messages.

This problem has occurred for me when there was high network utilization
between the seed and the joining node, also routing issues.

*...*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Jan 17, 2016 at 2:24 PM, Kai Wang  wrote:

> Carlos,
>
> so you essentially replace the 33 node. Did you follow this
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html?
> The link is for 2.x not sure about 3.x. What if you change the new node to
> .34?
>
>
>
> On Mon, Jan 11, 2016 at 12:57 AM, Carlos A  wrote:
>
>> Hello all,
>>
>> I have a small dev environment with 4 machines. One of them, I had it
>> removed (.33) from the cluster because I wanted to upgrade its HD to a SSD.
>> I then reinstalled it and tried to join. It is on UJ status for a week now
>> and no changes.
>>
>> I had tried node-repair etc but nothing.
>>
>> nodetool status output
>>
>> Datacenter: DC1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address   Load   Tokens   OwnsHost ID
>>   Rack
>> UN  192.168.1.30  16.13 MB   256  ?
>> 0e524b1c-b254-45d0-98ee-63b8f34a8531  RAC1
>> UN  192.168.1.31  20.12 MB   256  ?
>> 1f8000f5-026c-42c7-8189-cf19fbede566  RAC1
>> UN  192.168.1.32  17.73 MB   256  ?
>> 7b06f9e9-7c41-4364-ab18-f6976fd359e4  RAC1
>> UJ  192.168.1.33  877.6 KB   256  ?
>> 7a1507b5-198e-4a3a-a9fd-7af9e588fde2  RAC1
>>
>> Note: Non-system keyspaces don't have the same replication settings,
>> effective ownership information is meaningless
>>
>> Any tips on fixing this?
>>
>> Thanks,
>>
>> C.
>>
>
>

Re: electricity outage problem

2016-01-15 Thread daemeon reiydelle

Nodes need about 60-90 second delay before it can start accepting
connections as a seed node. Also a seed node needs time to accept a node
starting up, and syncing to other nodes (on 10gbit the max new nodes is
only 1 or 2, on 1gigabit it can handle at least 3-4 new nodes connecting).
In a large cluster (500 nodes) I see this wierd condition where nodetool
status shows overlapping subsets of nodes, and the problem does not go away
after even an hour on a 10 gigabit network).



*...*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Jan 15, 2016 at 9:17 AM, Adil <adil.cha...@gmail.com> wrote:

> Hi,
> we did full restart of the cluster but nodetool status still giving
> incoerent info from different nodes, some nodes appers UP from a node but
> appers DOWN from another, and in the log as is said still having the
> message "received an invalid gossip generation for peer /x.x.x.x"
> cassandra version is 2.1.2, we want to execute the purge operation as
> explained here
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_gossip_purge.html
> but we don't found the peers folder, should we do it via cql deleting the
> peers content? should we do it for all nodes?
>
> thanks
>
>
> 2016-01-12 17:42 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
>
>> Sometimes you may have to clear out the saved Gossip state:
>>
>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html
>>
>> Note the instruction about bringing up the seed nodes first. Normally
>> seed nodes are only relevant when initially joining a node to a cluster
>> (and then the Gossip state will be persisted locally), but if you clear te
>> persisted Gossip state the seed nodes will again be needed to find the rest
>> of the cluster.
>>
>> I'm not sure whether a power outage is the same as stopping and
>> restarting an instance (AWS) in terms of whether the restarted instance
>> retains its current public IP address.
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Jan 12, 2016 at 10:02 AM, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> This happens when there is insufficient time for nodes coming up to join
>>> a network. It takes a few seconds for a node to come up, e.g. your seed
>>> node. If you tell a node to join a cluster you can get this scenario
>>> because of high network utilization as well. I wait 90 seconds after the
>>> first (i.e. my first seed) node comes up to start the next one. Any nodes
>>> that are seeds need some 60 seconds, so the additional 30 seconds is a
>>> buffer. Additional nodes each wait 60 seconds before joining (although this
>>> is a parallel tree for large clusters).
>>>
>>>
>>>
>>>
>>>
>>> *...*
>>>
>>>
>>>
>>>
>>>
>>>
>>> *“Life should not be a journey to the grave with the intention of
>>> arriving safely in apretty and well preserved body, but rather to skid in
>>> broadside in a cloud of smoke,thoroughly used up, totally worn out, and
>>> loudly proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M.
>>> ReiydelleUSA (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0)
>>> 20 8144 9872 <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Tue, Jan 12, 2016 at 6:56 AM, Adil <adil.cha...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> we have two DC with 5 nodes in each cluster, yesterday there was an
>>>> electricity outage causing all nodes down, we restart the clusters but when
>>>> we run nodetool status on DC1 it results that some nodes are DN, the
>>>> strange thing is that running the command from diffrent node in DC1 doesn't
>>>> give the same node in DC as own, we have noticed this message in the log
>>>> "received an invalid gossip generation for peer", does anyone know how to
>>>> resolve this problem? should we purge the gossip?
>>>>
>>>> thanks
>>>>
>>>> Adil
>>>>
>>>
>>>
>>
>

Re: Encryption in cassandra

2016-01-14 Thread daemeon reiydelle

The keys don't have to be on the box. You do need a logi/password for c*.

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Jan 14, 2016 5:16 PM, "oleg yusim"  wrote:

> Greetings,
>
> Guys, can you please help me to understand following:
>
> I'm reading through the way keystore and truststore are implemented, and
> it is all fine and great, but at the end Cassandra documentation
> instructing to extract all the keystore content and leave all certs and
> keys in a clear.
>
> Do I miss something here? Why are we doing it? What is the point to even
> have a keystore then? It doesn't look very secure to me...
>
> Another item - cassandra.yaml has passwords from keystore and truststore -
> clear text... what is the point to have these stores then, if passwords are
> out?
>
> Thanks,
>
> Oleg
>

Re: electricity outage problem

2016-01-12 Thread daemeon reiydelle

This happens when there is insufficient time for nodes coming up to join a
network. It takes a few seconds for a node to come up, e.g. your seed node.
If you tell a node to join a cluster you can get this scenario because of
high network utilization as well. I wait 90 seconds after the first (i.e.
my first seed) node comes up to start the next one. Any nodes that are
seeds need some 60 seconds, so the additional 30 seconds is a buffer.
Additional nodes each wait 60 seconds before joining (although this is a
parallel tree for large clusters).

*...*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 12, 2016 at 6:56 AM, Adil  wrote:

> Hi,
>
> we have two DC with 5 nodes in each cluster, yesterday there was an
> electricity outage causing all nodes down, we restart the clusters but when
> we run nodetool status on DC1 it results that some nodes are DN, the
> strange thing is that running the command from diffrent node in DC1 doesn't
> give the same node in DC as own, we have noticed this message in the log
> "received an invalid gossip generation for peer", does anyone know how to
> resolve this problem? should we purge the gossip?
>
> thanks
>
> Adil
>

Re: Three questions about cassandra

2015-11-27 Thread daemeon reiydelle

There is a window after a node goes down that changes that node should have
gotten will be kept. If the node is down LONGER than that, it will server
stale data. If the consistency is greater than two, its data will be
ignored (if consistency one, its data could be the first returned, if
consistency two then the application needs to be able to handle such a
situation. Nodetool repair needs to be run in this case to get data
consistent. Cleanup does more than make things pretty, but it will do that.

The comment about disabling the thrift listener is related to preventing
the node serving old data if the timeout I mention above has expired
between the time the node comes on line and the time the repair is
completed.

One of the advantages of using e.g. Ansible is that it can be configured to
whack an errant node's thrift listener BEFORE it starts the node's Cass
instance. Agent based tools like Puppet and Chef can have this magic
performed. This automatically start Cass vs. NOT automatically starting the
service sometimes makes for interesting religious wars. And obviously if
the node didn't stop but just lost network connections, there are
advantages to agent based tools.

*...*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Nov 27, 2015 at 3:51 AM, Hadmut Danisch  wrote:

> Thanks!
>
> Hadmut
>

Re: Repair Hangs while requesting Merkle Trees

2015-11-11 Thread daemeon reiydelle

Have you checked the network statistics on that machine? (netstats -tas)
while attempting to repair ... if netstats show ANY issues you have a
problem. If you can put the command in a loop running every 60 seconds for
maybe 15 minutes and post back?

Out of curiousity, how many remote DC nodes are getting successfully
repaired?



*...*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra 
wrote:

> Hi,
>
> we are using 2.0.14. We have 2 DCs at remote locations with 10GBps
> connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only
> one node in DC2, we are unable to complete repair as it always hangs. Node
> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never
> show that they sent the merkle tree reply to requesting node.
> Repair hangs infinitely.
>
> After increasing request_timeout_in_ms on affected node, we were able to
> successfully run repair on one of the two occassions.
>
> Any comments, why this is happening on just one node? In
> OutboundTcpConnection.java,  when isTimeOut method always returns false for
> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
> increasing request timeout solved problem on one occasion ?
>
>
> Thanks
> Anuj Wadehra
>
>
>
> On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <
> anujw_2...@yahoo.co.in> wrote:
>
>
> Hi,
>
> We have 2 DCs at remote locations with 10GBps connectivity.We are able to
> complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are
> unable to complete repair as it always hangs. Node sends Merkle Tree
> requests, but one or more nodes in DC1 (remote) never show that they sent
> the merkle tree reply to requesting node.
> Repair hangs infinitely.
>
> After increasing request_timeout_in_ms on affected node, we were able to
> successfully run repair on one of the two occassions.
>
> Any comments, why this is happening on just one node? In
> OutboundTcpConnection.java,  when isTimeOut method always returns false for
> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
> increasing request timeout solved problem on one occasion ?
>
>
> Thanks
> Anuj Wadehra
>
>
>

Re: Can consistency-levels be different for "read" and "write" in Datastax Java-Driver?

2015-10-26 Thread daemeon reiydelle

If one rethinks "consistency" to mean "copies returned" and "copies
written" then one can have different values for the former (datastax) and
the latter (within Cassandra). The latter changes eventual consistency
(e.g. two copies must be written), the former can speed up a result at the
(slight) risk of stale data. I have no experience with the former, just
recall it somewhere in the documentation: n-copy eventual consistency is
fine for all of my work.

*...*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Mon, Oct 26, 2015 at 11:52 AM, Jonathan Haddad  wrote:

> What's your query?  Do you have IF NOT EXISTS in there?
>
> On Mon, Oct 26, 2015 at 11:17 AM Ajay Garg  wrote:
>
>> Right now, I have setup "LOCAL QUORUM" as the consistency level in the
>> driver, but it seems that "SERIAL" is being used during writes, and I
>> consistently get this error of type ::
>>
>> *Cassandra timeout during write query at consistency SERIAL (3 replica
>> were required but only 0 acknowledged the write)*
>>
>>
>> Am I missing something?
>>
>>
>>
>> --
>> Regards,
>> Ajay
>>
>

Re: How much disk is needed to compact Leveled compaction?

2015-04-05 Thread daemeon reiydelle

You appear to have multiple java binaries in your path. That needs to be
resolved.

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872
On Apr 5, 2015 1:40 AM, Jean Tremblay jean.tremb...@zen-innovations.com
wrote:

Hi,
I have a cluster of 5 nodes. We use cassandra 2.1.3.

The 5 nodes use about 50-57% of the 1T SSD.
One node managed to compact all its data. During one compaction this node
used almost 100% of the drive. The other nodes refuse to continue
compaction claiming that there is not enough disk space.

From the documentation LeveledCompactionStrategy should be able to
compact my data, well at least this is what I understand.

Size-tiered compaction requires at least as much free disk space for
compaction as the size of the largest column family. Leveled compaction
needs much less space for compaction, only 10 * sstable_size_in_mb.
However, even if you’re using leveled compaction, you should leave much
more free disk space available than this to accommodate streaming, repair,
and snapshots, which can easily use 10GB or more of disk space.
Furthermore, disk performance tends to decline after 80 to 90% of the disk
space is used, so don’t push the boundaries.

This is the disk usage. Node 4 is the only one that could compact
everything.
node0: /dev/disk1 931Gi 534Gi 396Gi 57% /
node1: /dev/disk1 931Gi 513Gi 417Gi 55% /
node2: /dev/disk1 931Gi 526Gi 404Gi 57% /
node3: /dev/disk1 931Gi 507Gi 424Gi 54% /
node4: /dev/disk1 931Gi 475Gi 456Gi 51% /

When I try to compact the other ones I get this:

objc[18698]: Class JavaLaunchHelper is implemented in both /Library/Java/
JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java and
/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/libinstrument.dylib.
One of the two will be used. Which one is undefined.
error: Not enough space for compaction, estimated sstables = 2894,
expected write size = 485616651726
-- StackTrace --
java.lang.RuntimeException: Not enough space for compaction, estimated
sstables = 2894, expected write size = 485616651726
at org.apache.cassandra.db.compaction.CompactionTask.
checkAvailableDiskSpace(CompactionTask.java:293)
at org.apache.cassandra.db.compaction.CompactionTask.
runMayThrow(CompactionTask.java:127)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(
CompactionTask.java:76)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(
AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$7.runMayThrow(
CompactionManager.java:512)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

I did not set the sstable_size_in_mb I use the 160MB default.

Is it normal that during compaction it needs so much diskspace? What
would be the best solution to overcome this problem?

Thanks for your help

Re: COMMERCIAL:Re: Cross-datacenter requests taking a very long time.

You might want to see what quorum is configured? I meant to ask that.



*...*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Apr 2, 2015 at 12:39 PM, Andrew Vant andrew.v...@rackspace.com
wrote:

 On Mar 31, 2015, at 4:59 PM, daemeon reiydelle daeme...@gmail.com wrote:
  What is your replication factor?

 NetworkTopologyStrategy with replfactor: 2 in each DC.

 Someone else asked about the endpoint snitch I'm using; it's set to
 GossipingPropertyFileSnitch.

  Any idea how much data has to be processed under the query?

 It does not matter what query I use, or what size; the problem occurs even
 just selecting a single user from the users table.

  While running the query against both DC's, you can take a look at
 netstats
  to get a really quick-and-dirty idea of network traffic.

 I'll try that. I should add that one of the other teams here has a similar
 setup (3 nodes in 3 DCs) that is working correctly. We're going to go
 through the config files and see if we can figure out what's different.

 --

 Andrew

Re: Best practice: Multiple clusters vs multiple tables in a single cluster?

Jack did a superb job of explaining all of your issues, and his last
sentence seems to fit your needs (and my experience) very well. The only
other point I would add is to ascertain if the use patterns commend
microservices to abstract from data locality, even if the initial
deployment is a noop to a single cluster. This depends on whether you see a
rapid stream of special purpose business functions. A second question is
about data access ... does Pig support your data access response times?
Many clients find Hadoop ideally suited to a sophisticated ECTL (extract,
cleanup, transformation, and load) model to fast, schema oriented,
repositories like e.g. MySQL. All depends on the use case, growth 
fragmentation expectations for your business model(s), etc.

Good luck.

PS, Jack thanks, for your succint comment.




On Thu, Apr 2, 2015 at 6:33 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 There is an old saying in the software industry: The structure of a system
 follows from the structure of the organization that created it (Conway's
 Law). Seriously, the main, first question for your end is who owns the
 applications in terms of executive management, such that if management
 makes a decision that dramatically affects the app's impact on the cluster,
 is it likely that they will have done so with the concurrence of management
 who owns the other app. Trust me, you do not want to be in the middle when
 two managers are in dispute over whose app is more important. IOW, if one
 manager owns both apps, you are probably safe, but if two different
 managers might have differing views of each other's priorities, tread with
 caution.

 In any case, be prepared to move one of the apps to a different cluster if
 and when usage patterns cause them to conflict.

 There is also the concept of devOps, where the app developers also own
 operations. You really can't have two separate development teams administer
 operations for one set of hardware.

 If you are dedicated to operations for both app teams and the teams seem
 to be reasonably compatible, then it could be fine.

 In short, sure, technically a single cluster can support  any number of
 key spaces, but mostly it will come down to whether there might be an
 excess of contention for load and operations of the cluster in production.

 And then little things like software upgrades - one app might really need
 a disruptive or risky upgrade or need to bounce the entire cluster, but
 then the other app may be impacted even though it had no need for the
 upgrade or be bounced.

 Are the apps synergistic in some way, such that there is an architectural
 benefit from running on the same hardware?

 In the end, the simplest solution is typically the better solution, unless
 any of these other factors loom too large.


 -- Jack Krupansky

 On Thu, Apr 2, 2015 at 9:06 AM, Ian Rose ianr...@fullstory.com wrote:

 Hi all -

 We currently have a single cassandra cluster that is dedicated to a
 relatively narrow purpose, with just 2 tables.  Soon we will need cassandra
 for another, unrelated, system, and my debate is whether to just add the
 new tables to our existing cassandra cluster or whether to spin up an
 entirely new, separate cluster for this new system.

 Does anyone have pros/cons to share on this?  It appears from watching
 talks and such online that the big users (e.g. Netflix, Spotify) tend to
 favor multiple, single-purpose clusters, and thus that was my initial
 preference.  But we are (for now) no where close to them in traffic so I'm
 wondering if running an entirely separate cluster would be a premature
 optimization which wouldn't pay for the (nontrivial) overhead in
 configuration management and ops.  While we are still small it might be
 much smarter to reuse our existing clusters so that I can get it done
 faster...

 Thanks!
 - Ian

Re: Cluster status instability

Do you happen to be using a tool like Nagios or Ganglia that are able to
report utilization (CPU, Load, disk io, network)? There are plugins for
both that will also notify you of (depending on whether you enabled the
intermediate GC logging) about what is happening.



On Thu, Apr 2, 2015 at 8:35 AM, Jan cne...@yahoo.com wrote:

 Marcin  ;

 are all your nodes within the same Region   ?
 If not in the same region,   what is the Snitch type that you are using
 ?

 Jan/



   On Thursday, April 2, 2015 3:28 AM, Michal Michalski 
 michal.michal...@boxever.com wrote:


 Hey Marcin,

 Are they actually going up and down repeatedly (flapping) or just down and
 they never come back?
 There might be different reasons for flapping nodes, but to list what I
 have at the top of my head right now:

 1. Network issues. I don't think it's your case, but you can read about
 the issues some people are having when deploying C* on AWS EC2 (keyword to
 look for: phi_convict_threshold)

 2. Heavy load. Node is under heavy load because of massive number of reads
 / writes / bulkloads or e.g. unthrottled compaction etc., which may result
 in extensive GC.

 Could any of these be a problem in your case? I'd start from investigating
 GC logs e.g. to see how long does the stop the world full GC take (GC
 logs should be on by default from what I can see [1])

 [1] https://issues.apache.org/jira/browse/CASSANDRA-5319

 Michał


 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com wrote:

 Hi!

 We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
 installed. Assume we have nodes A, B, C, D, E. On some irregular basis
 one of those nodes starts to report that subset of other nodes is in
 DN state although C* deamon on all nodes is running:

 A$ nodetool status
 UN B
 DN C
 DN D
 UN E

 B$ nodetool status
 UN A
 UN C
 UN D
 UN E

 C$ nodetool status
 DN A
 UN B
 UN D
 UN E

 After restart of A node, C and D report that A it's in UN and also A
 claims that whole cluster is in UN state. Right now I don't have any
 clear steps to reproduce that situation, do you guys have any idea
 what could be causing such behaviour? How this could be prevented?

 It seems like when A node is a coordinator and gets request for some
 data being replicated on C and D it respond with Unavailable
 exception, after restarting A that problem disapears.

 --
 mp

Re: Frequent timeout issues

May not be relevant, but what is the default heap size you have deployed.
Should be no more than 16gb (and be aware of the impacts of gc on that
large size), suggest not smaller than 8-12gb.



On Wed, Apr 1, 2015 at 11:28 AM, Anuj Wadehra anujw_2...@yahoo.co.in
wrote:

 Are you writing multiple cf at same time?
 Please run nodetool tpstats to make sure that FlushWriter etc doesnt have
 high All time blocked counts. A Blocked memtable FlushWriter may block/drop
 writes. If thats the case you may need to increase memtable flush
 writers..if u have many secondary indexes in cf ..make sure that memtable
 flush que size is set at least equal to no of indexes..

 monitoring iostat and gc logs may help..

 Thanks
 Anuj Wadehra
 --
   *From*:Amlan Roy amlan@cleartrip.com
 *Date*:Wed, 1 Apr, 2015 at 9:27 pm
 *Subject*:Re: Frequent timeout issues

 Did not see any exception in cassandra.log and system.log. Monitored using
 JConsole. Did not see anything wrong. Do I need to see any specific info?
 Doing almost 1000 writes/sec.

 HBase and Cassandra are running on different clusters. For cassandra I
 have 6 nodes with 64GB RAM(Heap is at default setting) and 32 cores.

 On 01-Apr-2015, at 8:43 pm, Eric R Medley emed...@xylocore.com wrote:

Re: Column value not getting updated