Re: Adding a new node with the double of disk space

2017-08-17 Thread Kevin O'Connor
Are you saying if a node had double the hardware capacity in every way it
would be a bad idea to up num_tokens? I thought that was the whole idea of
that setting though?

On Thu, Aug 17, 2017 at 9:52 AM, Carlos Rolo  wrote:

> No.
>
> If you would double all the hardware on that node vs the others would
> still be a bad idea.
> Keep the cluster uniform vnodes wise.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant / Datastax Certified Architect / Cassandra MVP
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin:
> *linkedin.com/in/carlosjuzarterolo
> *
> Mobile: +351 918 918 100 <+351%20918%20918%20100>
> www.pythian.com
>
> On Thu, Aug 17, 2017 at 5:47 PM, Cogumelos Maravilha <
> cogumelosmaravi...@sapo.pt> wrote:
>
>> Hi all,
>>
>> I need to add a new node to my cluster but this time the new node will
>> have the double of disk space comparing to the other nodes.
>>
>> I'm using the default vnodes (num_tokens: 256). To fully use the disk
>> space in the new node I just have to configure num_tokens: 512?
>>
>> Thanks in advance.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>
> --
>
>
>
>


Re: Truncate data from a single node

2017-07-12 Thread Kevin O'Connor
Thanks for the suggestions! Could altering the RF from 2 to 1 cause any
issues, or will it basically just be changing the coordinator's write paths
and also guiding future repairs/cleans?

On Wed, Jul 12, 2017 at 22:29 Jeff Jirsa <jji...@apache.org> wrote:

>
>
> On 2017-07-11 20:09 (-0700), "Kevin O'Connor" <ke...@reddit.com.INVALID>
> wrote:
> > This might be an interesting question - but is there a way to truncate
> data
> > from just a single node or two as a test instead of truncating from the
> > entire cluster? We have time series data we don't really care if we're
> > missing gaps in, but it's taking up a huge amount of space and we're
> > looking to clear some. I'm worried if we run a truncate on this huge CF
> > it'll end up locking up the cluster, but I don't care so much if it just
> > kills a single node.
> >
>
> IF YOU CAN TOLERATE DATA INCONSISTENCIES, You can stop a node, delete some
> sstables, and start it again. The risk in deleting arbitrary sstables is
> that you may remove a tombstone and bring data back to life, or remove the
> only replica with a write if you write at CL:ONE, but if you're OK with
> data going missing, you won't hurt much as long as you stop cassandra
> before you go killing sstables.
>
> TWCS does make this easier, because you can use sstablemetadata to
> identify timestamps/tombstone %s, and then nuke sstables that are
> old/mostly-expired first.
>
>
> > Is doing something like deleting SSTables from disk possible? If I alter
> > this keyspace from an RF of 2 down to 1 and then delete them, they won't
> be
> > able to be repaired if I'm thinking this through right.
> >
>
> If you drop RF from 2 to 1, you can just run cleanup and delete half the
> data (though it'll rewrite sstables to do it, which will be a short term
> increase).
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Truncate data from a single node

2017-07-11 Thread Kevin O'Connor
This might be an interesting question - but is there a way to truncate data
from just a single node or two as a test instead of truncating from the
entire cluster? We have time series data we don't really care if we're
missing gaps in, but it's taking up a huge amount of space and we're
looking to clear some. I'm worried if we run a truncate on this huge CF
it'll end up locking up the cluster, but I don't care so much if it just
kills a single node.

Is doing something like deleting SSTables from disk possible? If I alter
this keyspace from an RF of 2 down to 1 and then delete them, they won't be
able to be repaired if I'm thinking this through right.

Thanks!


Re: How to avoid flush if the data can fit into memtable

2017-05-31 Thread Kevin O'Connor
Great post Akhil! Thanks for explaining that.

On Mon, May 29, 2017 at 5:43 PM, Akhil Mehra  wrote:

> Hi Preetika,
>
> After thinking about your scenario I believe your small SSTable size might
> be due to data compression. By default, all tables enable SSTable
> compression.
>
> Let go through your scenario. Let's say you have allocated 4GB to your
> Cassandra node. Your *memtable_heap_space_in_mb* and
>
> *memtable_offheap_space_in_mb  *will roughly come to around 1GB. Since
> you have memtable_cleanup_threshold to .50 table cleanup will be
> triggered when total allocated memtable space exceeds 1/2GB. Note the
> cleanup threshold is .50 of 1GB and not a combination of heap and off heap
> space. This memtable allocation size is the total amount available for all
> tables on your node. This includes all system related keyspaces. The
> cleanup process will write the largest memtable to disk.
>
> For your case, I am assuming that you are on a *single node with only one
> table with insert activity*. I do not think the commit log will trigger a
> flush in this circumstance as by default the commit log has 8192 MB of
> space unless the commit log is placed on a very small disk.
>
> I am assuming your table on disk is smaller than 500MB because of
> compression. You can disable compression on your table and see if this
> helps get the desired size.
>
> I have written up a blog post explaining memtable flushing (
> http://abiasforaction.net/apache-cassandra-memtable-flush/)
>
> Let me know if you have any other question.
>
> I hope this helps.
>
> Regards,
> Akhil Mehra
>
>
> On Fri, May 26, 2017 at 6:58 AM, preetika tyagi 
> wrote:
>
>> I agree that for such a small data, Cassandra is obviously not needed.
>> However, this is purely an experimental setup by using which I'm trying to
>> understand how and exactly when memtable flush is triggered. As I mentioned
>> in my post, I read the documentation and tweaked the parameters accordingly
>> so that I never hit memtable flush but it is still doing that. As far the
>> the setup is concerned, I'm just using 1 node and running Cassandra using
>> "cassandra -R" option and then running some queries to insert some dummy
>> data.
>>
>> I use the schema from CASSANDRA_HOME/tools/cqlstress-insanity-example.yaml
>> and add "durable_writes=false" in the keyspace_definition.
>>
>> @Daemeon - The previous post lead to this post but since I was unaware of
>> memtable flush and I assumed memtable flush wasn't happening, the previous
>> post was about something else (throughput/latency etc.). This post is
>> explicitly about exactly when memtable is being dumped to the disk. Didn't
>> want to confuse two different goals that's why posted a new one.
>>
>> On Thu, May 25, 2017 at 10:38 AM, Avi Kivity  wrote:
>>
>>> It doesn't have to fit in memory. If your key distribution has strong
>>> temporal locality, then a larger memtable that can coalesce overwrites
>>> greatly reduces the disk I/O load for the memtable flush and subsequent
>>> compactions. Of course, I have no idea if the is what the OP had in mind.
>>>
>>>
>>> On 05/25/2017 07:14 PM, Jonathan Haddad wrote:
>>>
>>> Sorry for the confusion.  That was for the OP.  I wrote it quickly right
>>> after waking up.
>>>
>>> What I'm asking is why does the OP want to keep his data in the memtable
>>> exclusively?  If the goal is to "make reads fast", then just turn on row
>>> caching.
>>>
>>> If there's so little data that it fits in memory (300MB), and there
>>> aren't going to be any writes past the initial small dataset, why use
>>> Cassandra?  It sounds like the wrong tool for this job.  Sounds like
>>> something that could easily be stored in S3 and loaded in memory when the
>>> app is fired up.
>>>
>>> On Thu, May 25, 2017 at 8:06 AM Avi Kivity  wrote:
>>>
 Not sure whether you're asking me or the original poster, but the more
 times data gets overwritten in a memtable, the less it has to be compacted
 later on (and even without overwrites, larger memtables result in less
 compaction).

 On 05/25/2017 05:59 PM, Jonathan Haddad wrote:

 Why do you think keeping your data in the memtable is a what you need
 to do?
 On Thu, May 25, 2017 at 7:16 AM Avi Kivity  wrote:

> Then it doesn't have to (it still may, for other reasons).
>
> On 05/25/2017 05:11 PM, preetika tyagi wrote:
>
> What if the commit log is disabled?
>
> On May 25, 2017 4:31 AM, "Avi Kivity"  wrote:
>
>> Cassandra has to flush the memtable occasionally, or the commit log
>> grows without bounds.
>>
>> On 05/25/2017 03:42 AM, preetika tyagi wrote:
>>
>> Hi,
>>
>> I'm running Cassandra with a very small dataset so that the data can
>> exist on memtable only. Below are my configurations:
>>
>> In jvm.options:
>>
>> 

Re: STCS Compaction with wide rows & TTL'd data

2016-09-02 Thread Kevin O'Connor
On Fri, Sep 2, 2016 at 9:33 AM, Mark Rose <markr...@markrose.ca> wrote:

> Hi Kevin,
>
> The tombstones will live in an sstable until it gets compacted. Do you
> have a lot of pending compactions? If so, increasing the number of
> parallel compactors may help.


Nope, we are pretty well managed on compactions. Only ever 1 or 2 running
at a time per node.


> You may also be able to tun the STCS
> parameters. Here's a good explanation of how it works:
> https://shrikantbang.wordpress.com/2014/04/22/size-
> tiered-compaction-strategy-in-apache-cassandra/


Yeah interesting - I'd like to try that. Is there a way to verify what the
settings are before changing them? DESCRIBE TABLE doesn't seem to show the
compaction subproperties.


> Anyway, LCS would probably be a better fit for your use case. LCS
> would help with eliminating tombstones, but it may also result in
> dramatically higher CPU usage for compaction. If LCS compaction can
> keep up, in addition to getting ride of tombstones faster, LCS should
> reduce the number of sstables that must be read to return the row and
> have a positive impact on read latency. STCS is a bad fit for rows
> that are updated frequently (which includes rows with TTL'ed data).
>

Thanks - that may end up being where we go with this.

Also, you may have an error in your application design. OAuth Access
> Tokens are designed to have a very short lifetime of seconds or
> minutes. On access token expiry, a Refresh Token should be used to get
> a new access token. A long-lived access token is a dangerous thing as
> there is no way to disable it (refresh tokens should be disabled to
> prevent the creation of new access tokens).
>

Yeah, noted. We only allow longer lived access tokens in some very specific
scenarios, so they are much less likely to be in that CF than the standard
3600s ones, but they're there.


>
> -Mark
>
> On Thu, Sep 1, 2016 at 3:53 AM, Kevin O'Connor <ke...@reddit.com> wrote:
> > We're running C* 1.2.11 and have two CFs, one called OAuth2AccessToken
> and
> > one OAuth2AccessTokensByUser. OAuth2AccessToken has the token as the row
> > key, and the columns are some data about the OAuth token. There's a TTL
> set
> > on it, usually 3600, but can be higher (up to 1 month).
> > OAuth2AccessTokensByUser has the user as the row key, and then all of the
> > user's token identifiers as column values. Each of the column values has
> a
> > TTL that is set to the same as the access token it corresponds to.
> >
> > The OAuth2AccessToken CF takes up around ~6 GB on disk, whereas the
> > OAuth2AccessTokensByUser CF takes around ~110 GB. If I use
> sstablemetadata,
> > I can see the droppable tombstones ratio is around 90% for the larger
> > sstables.
> >
> > My question is - why aren't these tombstones getting compacted away? I'm
> > guessing that it's because we use STCS and the large sstables that have
> > built up over time are never considered for compaction. Would LCS be a
> > better fit for the issue of trying to keep the tombstones in check?
> >
> > I've also tried forceUserDefinedCompaction via JMX on some of the largest
> > sstables and it just creates a new sstable of the exact same size, which
> was
> > pretty surprising. Why would this explicit request to compact an sstable
> not
> > remove tombstones?
> >
> > Thanks!
> >
> > Kevin
>


STCS Compaction with wide rows & TTL'd data

2016-09-01 Thread Kevin O'Connor
We're running C* 1.2.11 and have two CFs, one called OAuth2AccessToken and
one OAuth2AccessTokensByUser. OAuth2AccessToken has the token as the row
key, and the columns are some data about the OAuth token. There's a TTL set
on it, usually 3600, but can be higher (up to 1 month).
OAuth2AccessTokensByUser has the user as the row key, and then all of the
user's token identifiers as column values. Each of the column values has a
TTL that is set to the same as the access token it corresponds to.

The OAuth2AccessToken CF takes up around ~6 GB on disk, whereas the
OAuth2AccessTokensByUser CF takes around ~110 GB. If I use sstablemetadata,
I can see the droppable tombstones ratio is around 90% for the larger
sstables.

My question is - why aren't these tombstones getting compacted away? I'm
guessing that it's because we use STCS and the large sstables that have
built up over time are never considered for compaction. Would LCS be a
better fit for the issue of trying to keep the tombstones in check?

I've also tried forceUserDefinedCompaction via JMX on some of the largest
sstables and it just creates a new sstable of the exact same size, which
was pretty surprising. Why would this explicit request to compact an
sstable not remove tombstones?

Thanks!

Kevin


Open source equivalents of OpsCenter

2016-07-13 Thread Kevin O'Connor
Now that OpsCenter doesn't work with open source installs, are there any
runs at an open source equivalent? I'd be more interested in looking at
metrics of a running cluster and doing other tasks like managing
repairs/rolling restarts more so than historical data.


Re: Latency overhead on Cassandra cluster deployed on multiple AZs (AWS)

2016-04-12 Thread Kevin O'Connor
Are you in VPC or EC2 Classic? Are you using enhanced networking?

On Tue, Apr 12, 2016 at 9:52 AM, Alessandro Pieri  wrote:

> Hi Jack,
>
> As mentioned before I've used m3.xlarge instance types together with two
> ephemeral disks in raid 0 and, according to Amazon, they have "high"
> network performance.
>
> I ran many tests starting with a brand-new cluster every time and I got
> consistent results.
>
> I believe there's something that I cannot explain yet with the client used
> by cassandra-stress to connect to the nodes, I'd like to understand why
> there is such a big difference:
>
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" -> 95th
> percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
>
> Hope you can help to figure it out.
>
> Cheers,
> Alessandro
>
>
>
>
> On Tue, Apr 12, 2016 at 5:43 PM, Jack Krupansky 
> wrote:
>
>> Which instance type are you using? Some may be throttled for EBS access,
>> so you could bump into a rate limit, and who knows what AWS will do at that
>> point.
>>
>> -- Jack Krupansky
>>
>> On Tue, Apr 12, 2016 at 6:02 AM, Alessandro Pieri <
>> alessan...@getstream.io> wrote:
>>
>>> Thanks Chris for your reply.
>>>
>>> I ran the tests 3 times for 20 minutes/each and I monitored the network
>>> latency in the meanwhile, it was very low (even the 99th percentile).
>>>
>>> I didn't notice any cpu spike caused by the GC but, as you pointed out,
>>> I will look into the GC log, just to be sure.
>>>
>>> In order to avoid the problem you mentioned with EBS and to keep the
>>> deviation under control I used two ephemeral disks in raid 0.
>>>
>>> I think the odd results come from the way cassandra-stress deals with
>>> multiple nodes. As soon as possible I will go through the Java code to get
>>> some more detail.
>>>
>>> If you have something else in your mind please let me know, your
>>> comments were really appreciated.
>>>
>>> Cheers,
>>> Alessandro
>>>
>>>
>>> On Mon, Apr 11, 2016 at 4:15 PM, Chris Lohfink 
>>> wrote:
>>>
 Where do you get the ~1ms latency between AZs? Comparing a short term
 average to a 99th percentile isn't very fair.

 "Over the last month, the median is 2.09 ms, 90th percentile is
 20ms, 99th percentile is 47ms." - per
 https://www.quora.com/What-are-typical-ping-times-between-different-EC2-availability-zones-within-the-same-region

 Are you using EBS? That would further impact latency on reads and GCs
 will always cause hiccups in the 99th+.

 Chris


 On Mon, Apr 11, 2016 at 7:57 AM, Alessandro Pieri 
 wrote:

> Hi everyone,
>
> Last week I ran some tests to estimate the latency overhead introduces
> in a Cassandra cluster by a multi availability zones setup on AWS EC2.
>
> I started a Cassandra cluster of 6 nodes deployed on 3 different AZs
> (2 nodes/AZ).
>
> Then, I used cassandra-stress to create an INSERT (write) test of 20M
> entries with a replication factor = 3, right after, I ran cassandra-stress
> again to READ 10M entries.
>
> Well, I got the following unexpected result:
>
> Single-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.06ms/7.41ms/55.81ms
> Multi-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.16ms/38.14ms/47.75ms
>
> Basically, switching to the multi-AZ setup the latency increased of
> ~30ms. That's too much considering the the average network latency between
> AZs on AWS is ~1ms.
>
> Since I couldn't find anything to explain those results, I decided to
> run the cassandra-stress specifying only a single node entry (i.e. 
> "--nodes
> node1" instead of "--nodes node1,node2,node3,node4,node5,node6") and
> surprisingly the latency went back to 5.9 ms.
>
> Trying to recap:
>
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" ->
> 95th percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
>
> For the sake of completeness I've ran a further test using a
> consistency level = LOCAL_QUORUM and the test did not show any large
> variance with using a single node or multiple ones.
>
> Do you guys know what could be the reason?
>
> The test were executed on a m3.xlarge (network optimized) using the
> DataStax AMI 2.6.3 running Cassandra v2.0.15.
>
> Thank you in advance for your help.
>
> Cheers,
> Alessandro
>


>>>
>>>
>>> --
>>> *Alessandro Pieri*
>>> *Software Architect @ Stream.io Inc*
>>> e-Mail: alessan...@getstream.io - twitter: sirio7g
>>> 
>>>
>>>
>>
>


Re: Cassandra is consuming a lot of disk space

2016-01-12 Thread Kevin O'Connor
Have you tried restarting? It's possible there's open file handles to
sstables that have been compacted away. You can verify by doing lsof and
grepping for DEL or deleted.

If it's not that, you can run nodetool cleanup on each node to scan all of
the sstables on disk and remove anything that it's not responsible for.
Generally this would only work if you added nodes recently.

On Tuesday, January 12, 2016, Rahul Ramesh  wrote:

> We have a 2 node Cassandra cluster with a replication factor of 2.
>
> The load factor on the nodes is around 350Gb
>
> Datacenter: Cassandra
> ==
> Address  RackStatus State   LoadOwns
>  Token
>
>   -5072018636360415943
> 172.31.7.91  rack1   Up Normal  328.5 GB100.00%
>   -7068746880841807701
> 172.31.7.92  rack1   Up Normal  351.7 GB100.00%
>   -5072018636360415943
>
> However,if I use df -h,
>
> /dev/xvdf   252G  223G   17G  94% /HDD1
> /dev/xvdg   493G  456G   12G  98% /HDD2
> /dev/xvdh   197G  167G   21G  90% /HDD3
>
>
> HDD1,2,3 contains only cassandra data. It amounts to close to 1Tb in one
> of the machine and in another machine it is close to 650Gb.
>
> I started repair 2 days ago, after running repair, the amount of disk
> space consumption has actually increased.
> I also checked if this is because of snapshots. nodetool listsnapshot
> intermittently lists a snapshot but it goes away after sometime.
>
> Can somebody please help me understand,
> 1. why so much disk space is consumed?
> 2. Why did it increase after repair?
> 3. Is there any way to recover from this state.
>
>
> Thanks,
> Rahul
>
>