from:"Chris Lohfink"

Re: Switching to Incremental Repair

2024-02-15 Thread Chris Lohfink

I would recommend adding something to C* to be able to flip the repaired
state on all sstables quickly (with default OSS can turn nodes off one at a
time and use sstablerepairedset). It's a life saver to be able to revert
back to non-IR if migration going south. Same can be used to quickly switch
into IR sstables with more caveats. Probably worth a jira to add a faster
solution

On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys  wrote:

> Hi folks,
>
> One last question regarding incremental repair.
>
> What would be a safe approach to temporarily stop running incremental
> repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
> understanding is that if we simply stop running incremental repair, the
> cluster's nodes can, in the worst case, double in disk size as the repaired
> dataset will not get compacted with the unrepaired dataset. Similar to
> Sebastian, we have nodes where the disk usage is multiple TiBs so
> significant growth can be quite dangerous in our case. Would the only safe
> choice be to mark all SSTables as unrepaired before stopping regular
> incremental repair?
>
> Thanks,
> Kristijonas
>
>
> On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The over-streaming is only problematic for the repaired SSTables, but it
>> can be triggered by inconsistencies within the unrepaired SSTables
>> during an incremental repair session. This is because although an
>> incremental repair will only compare the unrepaired SSTables, but it
>> will stream both the unrepaired and repaired SSTables for the
>> inconsistent token ranges. Keep in mind that the source SSTables for
>> streaming is selected based on the token ranges, not the
>> repaired/unrepaired state.
>>
>> Base on the above, I'm unsure running an incremental repair before a
>> full repair can fully avoid the over-streaming issue.
>>
>> On 07/02/2024 22:41, Sebastian Marsching wrote:
>> > Thank you very much for your explanation.
>> >
>> > Streaming happens on the token range level, not the SSTable level,
>> right? So, when running an incremental repair before the full repair, the
>> problem that “some unrepaired SSTables are being marked as repaired on one
>> node but not on another” should not exist any longer. Now this data should
>> be marked as repaired on all nodes.
>> >
>> > Thus, when repairing the SSTables that are marked as repaired, this
>> data should be included on all nodes when calculating the Merkle trees and
>> no overstreaming should happen.
>> >
>> > Of course, this means that running an incremental repair *first* after
>> marking SSTables as repaired and only running the full repair *after* that
>> is critical. I have to admit that previously I wasn’t fully aware of how
>> critical this step is.
>> >
>> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
>> user@cassandra.apache.org>:
>> >>
>> >> Unfortunately repair doesn't compare each partition individually.
>> Instead, it groups multiple partitions together and calculate a hash of
>> them, stores the hash in a leaf of a merkle tree, and then compares the
>> merkle trees between replicas during a repair session. If any one of the
>> partitions covered by a leaf is inconsistent between replicas, the hash
>> values in these leaves will be different, and all partitions covered by the
>> same leaf will need to be streamed in full.
>> >>
>> >> Knowing that, and also know that your approach can create a lots of
>> inconsistencies in the repaired SSTables because some unrepaired SSTables
>> are being marked as repaired on one node but not on another, you would then
>> understand why over-streaming can happen. The over-streaming is only
>> problematic for the repaired SSTables, because they are much bigger than
>> the unrepaired.
>> >>
>> >>
>> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>  Caution, using the method you described, the amount of data streamed
>> at the end with the full repair is not the amount of data written between
>> stopping the first node and the last node, but depends on the table size,
>> the number of partitions written, their distribution in the ring and the
>> 'repair_session_space' value. If the table is large, the writes touch a
>> large number of partitions scattered across the token ring, and the value
>> of 'repair_session_space' is small, you may end up with a very expensive
>> over-streaming.
>> >>> Thanks for the warning. In our case it worked well (obviously we
>> tested it on a test cluster before applying it on the production clusters),
>> but it is good to know that this might not always be the case.
>> >>>
>> >>> Maybe I misunderstand how full and incremental repairs work in C*
>> 4.x. I would appreciate if you could clarify this for me.
>> >>>
>> >>> So far, I assumed that a full repair on a cluster that is also using
>> incremental repair pretty much works like on a cluster that is not using
>> incremental repair at all, the only difference being that the set

Re: Nodetool command to pre-load the chunk cache

2023-03-24 Thread Chris Lohfink

Something additional to consider (outside C* fix) is using a tool like
happycache  to have
consistent pagecache between them. Might be sufficient if the data is in
memory already.

Chris

On Tue, Mar 21, 2023 at 2:48 PM Jeff Jirsa  wrote:

> We serialize the other caches to disk to avoid cold-start problems, I
> don't see why we couldn't also serialize the chunk cache? Seems worth a
> JIRA to me.
>
> Until then, you can probably use the dynamic snitch (badness + severity)
> to route around newly started hosts.
>
> I'm actually pretty surprised the chunk cache is that effective, sort of
> nice to know.
>
>
>
> On Tue, Mar 21, 2023 at 10:17 AM Carlos Diaz  wrote:
>
>> Hi Team,
>>
>> We are heavy users of Cassandra at a pretty big bank.  Security measures
>> require us to constantly refresh our C* nodes every x number of days.  We
>> normally do this in a rolling fashion, taking one node down at a time and
>> then refreshing it with a new instance.  This process has been working for
>> us great for the past few years.
>>
>> However, we recently started having issues when a newly refreshed
>> instance comes back online, our automation waits a few minutes for the node
>> to become "ready (UN)" and then moves on to the next node.  The problem
>> that we are facing is that when the node is ready, the chunk cache is still
>> empty so when the node starts accepting new connections, queries that go to
>> take much longer to respond and this causes errors for our apps.
>>
>> I was thinking that it would be great if we had a nodetool command that
>> would allow us to prefetch a certain table or a set of tables to preload
>> the chunk cache.  Then we could simply add another check (nodetool info?),
>> to ensure that the chunk cache has been preloaded enough to handle queries
>> to this particular node.
>>
>> Would love to hear others' feedback on the feasibility of this idea.
>>
>> Thanks!
>>
>>
>>
>>

Re: oversized partition detection ? monitoring the partitions growth ?

2019-11-01 Thread Chris Lohfink

You can set compaction_large_partition_warning_threshold_mb and alert on
logs

.

Writing large partition {}/{}:{} ({}) to sstable {}

Chris

On Thu, Oct 31, 2019 at 8:01 AM Eric LELEU  wrote:

> Hi,
>
> I'm not sure that your are able to log which partition has reached 100MB
> but you may monitor the "EstimatedPartitionSizeHistogram" and take the
> max value (or 99ct, 95ct) to trigger an alert using your monitoring system.
>
> http://cassandra.apache.org/doc/latest/operating/metrics.html#table-metrics
>
> regards,
>
> Eric
>
> Le 31/10/2019 à 12:37, jagernico...@legtux.org a écrit :
>
>
> Hi,
> how can I detect a partition that reaches the 100MB ? is it possible to
> log the size of every partition one time per day ?
>
> regards,
> Nicolas Jäger
>
>

Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-19 Thread Chris Lohfink

"It depends" on your version and heap size but G1 is easier to get right so
probably wanna stick with that unless you are using small heaps or really
interested in tuning it (likely for massively smaller gains then tuning
your data model). There is no GC algo that is strictly better than others
in all scenarios unfortunately. If your JVM supports it, ZGC or Shenandoah
are likely going to give you the best latencies.

Chris

On Fri, Oct 18, 2019 at 8:41 PM Sergio Bilello 
wrote:

> Hello!
>
> Is it still better to use ParNew + CMS Is it still better than G1GC  these
> days?
>
> Any recommendation for i3.xlarge nodes read-heavy workload?
>
>
> Thanks,
>
> Sergio
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: loosing data during saving data from java

2019-10-19 Thread Chris Lohfink

If the writes are being coming fast enough that the commitlog cant keep up
it will block applying mutations the the memtable (even with periodic once
hit >1.5x flush time). Things will queue up and possibly timeout but they
will not be acknowledged until applied. If you do it enough fast enough you
can dump a lot into the mutation queue and you can cause the node to OOM or
GC thrash, but it wont acknowledge the writes so you wont lose the data.

If you firing off async writes and not waiting for acknowledgement and
assume they succeeded you may lose data if C* did not succeed (which you
will be notified of via a WriteFailure, WriteTimeout, or an
OperationTimeout). A simple write like that can be idempotent so you can
just try again on failure.

Chris

On Sat, Oct 19, 2019 at 1:26 AM adrien ruffie 
wrote:

> Thank Jeff 
>
> but if you save several data to fast with cassandra repository and if
> cassandra doesn't have the same speed and inserts more slowly.
> What is the bevahior ? cassandra store the overflow in a additionnal
> buffer ? No data can be lost on the cassandra's side ?
>
> Thank a lot.
>
> Adrian
> --
> *De :* Jeff Jirsa 
> *Envoyé :* samedi 19 octobre 2019 00:41
> *À :* cassandra 
> *Objet :* Re: loosing data during saving data from java
>
> There is no buffer in cassandra that is known to (or suspected to)
> lose acknowledged writes if it's overwhelmed.
>
> There may be a client bug where you send so many async writes that they
> overwhelm a bounded queue, or otherwise get dropped or timeout, but those
> would be client bugs, and I'm not sure this list can help you with them.
>
>
>
> On Fri, Oct 18, 2019 at 3:16 PM adrien ruffie 
> wrote:
>
> Hello all,
>
> I have a table cassandra where I insert quickly several java entity
> about 15.000 entries by minutes. But at the process ending, I only
> have for exemple 199.921 entries instead 312.212
> If I truncate the table and relaunch the process, several time I get
> 199.354
> or 189.012 entries ... not a really fixed entries saved any time ...
>
> several coworker tell me, they heard about a buffer which can be
> overwhelmed
> sometimes, and loosing several entities stacked for insertion ...
> right ?
> Because I don't understand why this loosing insertion appears ...
> And I java code is very simple like below:
>
> myEntitiesList.forEach(myEntity -> {
>   try {
> myEntitiesRepository.save(myEntity).subscribe();
> } catch (Exception e) {
> e.printStackTrace();
> }
> }
>
> And the repository is a:
> public interface MyEntityRepository extends ReactiveCassandraRepository yEntity, String> {
> }
>
>
> Some one already heard about this problem ?
>
> Thank you very must and best regards
>
> Adrian
>
>

Re: Collecting Latency Metrics

2019-05-30 Thread Chris Lohfink

For what it is worth, generally I would recommend just using the mean vs
calculating it yourself. It's a lot easier and averages are meaningless for
anything besides trending anyway (which is really what this is useful for,
finding issues on the larger scale), especially with high volume clusters
so the loss in accuracy kinda moot. Your average for local reads/writes
will almost always be sub millisecond but you might end up having 500
millisecond requests or worse that the mean will hide.

Chris

On Thu, May 30, 2019 at 6:30 AM shalom sagges 
wrote:

> Thanks for your replies guys. I really appreciate it.
>
> @Alain, I use Graphite for backend on top of Grafana. But the goal is to
> move from Graphite to Prometheus eventually.
>
> I tried to find a direct way of getting a specific Latency metric in
> average and as Chris pointed out, then Mean value isn't that accurate.
> I do not wish to use the percentile metrics either, but a single latency
> metric like the *"Local read latency" *output in nodetool tablestats.
> Looking at the code of nodetool tablestats, it seems that C* also divides
> *ReadTotalLatency.Count* with *ReadLatency.Count *to get the latency
> result.
>
> So I guess I will have no choice but to run the calculation on my own via
> Graphite:
>
> divideSeries(averageSeries(keepLastValue(nonNegativeDerivative($env.path.to.host.$host.org_apache_cassandra_metrics.Table.$ks.$cf.ReadTotalLatency.Count))),averageSeries(keepLastValue(nonNegativeDerivative($env.path.to.host.$host.org_apache_cassandra_metrics.Table.$ks.$cf.ReadLatency.Count
>
> Does this seem right to you?
>
> Thanks!
>
> On Thu, May 30, 2019 at 12:34 AM Paul Chandler  wrote:
>
>> There are various attributes under
>> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
>> latency in milliseconds
>>
>> Thanks
>>
>> Paul
>> www.redshots.com
>>
>> > On 29 May 2019, at 15:31, shalom sagges  wrote:
>> >
>> > Hi All,
>> >
>> > I'm creating a dashboard that should collect read/write latency metrics
>> on C* 3.x.
>> > In older versions (e.g. 2.0) I used to divide the total read latency in
>> microseconds with the read count.
>> >
>> > Is there a metric attribute that shows read/write latency without the
>> need to do the math, such as in nodetool tablestats "Local read latency"
>> output?
>> > I saw there's a Mean attribute in
>> org.apache.cassandra.metrics.ReadLatency but I'm not sure this is the right
>> one.
>> >
>> > I'd really appreciate your help on this one.
>> > Thanks!
>> >
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>

Re: Collecting Latency Metrics

2019-05-30 Thread Chris Lohfink

>
> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
> latency in milliseconds
>

Its actually in microseconds, unless calling the values() operation which
gives the histogram in nanoseconds

On Wed, May 29, 2019 at 4:34 PM Paul Chandler  wrote:

> There are various attributes under
> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
> latency in milliseconds
>
> Thanks
>
> Paul
> www.redshots.com
>
> > On 29 May 2019, at 15:31, shalom sagges  wrote:
> >
> > Hi All,
> >
> > I'm creating a dashboard that should collect read/write latency metrics
> on C* 3.x.
> > In older versions (e.g. 2.0) I used to divide the total read latency in
> microseconds with the read count.
> >
> > Is there a metric attribute that shows read/write latency without the
> need to do the math, such as in nodetool tablestats "Local read latency"
> output?
> > I saw there's a Mean attribute in
> org.apache.cassandra.metrics.ReadLatency but I'm not sure this is the right
> one.
> >
> > I'd really appreciate your help on this one.
> > Thanks!
> >
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Collecting Latency Metrics

2019-05-29 Thread Chris Lohfink

To answer your question
org.apache.cassandra.metrics:type=Table,name=ReadTotalLatency can give you
the total local read latency in microseconds and you can get the count from
the Latency read metric.

If you are going to do that be sure to do it on the delta from previous
query (new - last) for both total latency and counter or else you will
slowly converge to a global average that will almost never change as the
quantity of reads simply removes outliers. The mean attribute of the
Latency metric you mentioned will give you an approximation for this
actually as its taking the total/count of a decaying histogram of the
latencies. It will however be even less accurate than using the deltas
since the bounds of the decaying wont necessarily match up with your
reading intervals and histogram introduces a worst case 20% round up. Even
with using deltas though this will hide outliers, you could end up with
really bad queries that don't even show up as a tick on your graph
(although *generally* it will).

Chris

On Wed, May 29, 2019 at 9:32 AM shalom sagges 
wrote:

> Hi All,
>
> I'm creating a dashboard that should collect read/write latency metrics on
> C* 3.x.
> In older versions (e.g. 2.0) I used to divide the total read latency in
> microseconds with the read count.
>
> Is there a metric attribute that shows read/write latency without the need
> to do the math, such as in nodetool tablestats "Local read latency" output?
> I saw there's a Mean attribute in org.apache.cassandra.metrics.ReadLatency
> but I'm not sure this is the right one.
>
> I'd really appreciate your help on this one.
> Thanks!
>
>
>

Re: Cassandra config in table

2019-02-25 Thread Chris Lohfink

In 4.0+ you can SELECT * FROM system_views.settings;

Chris

On Mon, Feb 25, 2019 at 9:22 AM Abdul Patel  wrote:

> Do we have any sustem table which stores all config details which we have
> in yaml or cassandra env.sh?

Re: Cassandra collection tombstones

2019-01-25 Thread Chris Lohfink

>  The "estimated droppable tombstone" value is actually always wrong. Because 
> it's an estimate that does not consider overlaps (and I'm not sure about the 
> fact it considers the gc_grace_seconds either).

It considers the time the tombstone was created and the gc_grace_seconds, it 
doesn't matter if the tombstone is overlapped it still need to be kept for the 
gc_grace before purging or it can result in data resurrection. sstablemetadata 
cannot reliably or safely know the table parameters that are not kept in the 
sstable so to get an accurate value you have to provide a -g or 
--gc-grace-seconds parameter. I am not sure where the "always wrong" comes in 
as the quantity of data thats being shadowed is not what its tracking (although 
it would be more meaningful for single sstable compactions if it did), just 
when tombstones can be purged.

Chris


> On Jan 25, 2019, at 8:11 AM, Alain RODRIGUEZ  wrote:
> 
> Hello, 
> 
> I think you might be inserting on the top of an existing collection, 
> implicitly, Cassandra creates a range tombstone. Cassandra does not 
> update/delete data, it always inserts (data or tombstone). Then eventually 
> compaction merges the data and evict the tombstones. Thus, when overwriting 
> an entire collection, Cassandra performs a delete first under the hood.
> 
> I wrote about this, in this post about 2 years ago, in the middle of this 
> (long) article: 
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html 
> 
> 
> Here is the part that might be of interest in your case:
> 
> "Note: When using collections, range tombstones will be generated by INSERT 
> and UPDATE operations every time you are using an entire collection, and not 
> updating parts of it. Inserting a collection over an existing collection, 
> rather than appending it or updating only an item in it, leads to range 
> tombstones insert followed by the insert of the new values for the 
> collection. This DELETE operation is hidden leading to some weird and 
> frustrating tombstones issues."
> 
> and
> 
> "From the mailing list I found out that James Ravn posted about this topic 
> using list example, but it is true for all the collections, so I won’t go 
> through more details, I just wanted to point this out as it can be 
> surprising, see: 
> http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists 
> "
> 
> Thus to specifically answer your questions:
> 
>  Does this tombstone ever get removed?
> 
> Yes, after gc_grace_seconds (table option) happened AND if the data that is 
> shadowed by the tombstone is also part of the same compaction (all the 
> previous shards need to be there if I remember correctly). So yes, but 
> eventually, not immediately nor any time soon (10+ days by default). 
>  
> Also when I run sstablemetadata on the only sstable, it shows "Estimated 
> droppable tombstones" as 0.5", Similarly it shows one record with epoch time 
> as insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it 
> mean that when I do sstablemetadata on a table having collections, the 
> estimated droppable tombstone ratio and drop times values are not true and 
> dependable values due to collection/list range tombstones?
> 
> I do not remember this precisely but you can check the code, it's worth 
> having a look. The "estimated droppable tombstone" value is actually always 
> wrong. Because it's an estimate that does not consider overlaps (and I'm not 
> sure about the fact it considers the gc_grace_seconds either). But also 
> because calculation does not count a certain type of tombstones and the 
> weight of range tombstones compared to the tombstone cells makes the count 
> quite inaccurate: 
> http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
>  
> .
> 
> I think this evolved since I looked at it and might not remember well, but 
> this value is definitely not accurate. 
> 
> If you're re-inserting a collection for a given existing partition often, 
> there is probably plenty of tombstones sitting around though, that's almost 
> guaranteed.
> 
> Does tombstone_threshold of compaction depend on the sstablemetadata 
> threshold value? If so then for tables having collections, this is not a true 
> threshold right?
> 
> Yes, I believe the tombstone threshold actually uses the "estimated droppable 
> tombstone" value to chose to trigger or not a "single-SSTable"/"tombstone" 
> compaction. Yet, in your case, this will not clean the tombstones in the 
> first 10 days at least (gc_grace_seconds default value). Compactions do not 
> keep triggering because there is a minimum interval defined between 2 
> tombstones compactions of an SSTable (1 day by default). This

Re: Compact storage removal effect

2019-01-22 Thread Chris Lohfink

In 3.x+ the format on disk is the same with compact storage on or off so you 
shouldn't expect much of a difference in table size with the new storage format 
compared to compact vs non compact in 2.x.

Chris

> On Jan 22, 2019, at 10:21 AM, Nitan Kainth  wrote:
> 
> hey Chris,
> 
> We upgraded form 3.0.4 to 3.11. yes, I did run upgradesstables -a to migrate 
> sstables. 
> Here is the table structure:
> 
> CREATE TABLE ks.cf1 (
> key text,
> column1 timestamp,
> value blob,
> PRIMARY KEY (key, column1)
> ) WITH COMPACT STORAGE
> 
> CREATE TABLE ks.cf2 (
> key bigint,
> column1 text,
> value blob,
> PRIMARY KEY (key, column1)
> ) WITH COMPACT STORAGE
> 
> CREATE TABLE ks.cf3 (
> key text,
> column1 timestamp,
> value int,
> PRIMARY KEY (key, column1)
> ) WITH COMPACT STORAGE  
> 
> On Tue, Jan 22, 2019 at 10:07 AM Chris Lohfink  
> wrote:
> What version are you running? Did you include an upgradesstables -a or 
> something to rebuild without the compact storage in your migration?
> 
> After 3.0 the new format can be more or less the same size as the 2.x compact 
> storage tables depending on schema (which can impact things a lot).
> 
> Chris
> 
> > On Jan 22, 2019, at 9:58 AM, Nitan Kainth  > <mailto:nitankai...@gmail.com>> wrote:
> > 
> > Hi,
> > 
> > We are testing to migrate off from compact storage. After removing compact 
> > storage, we were hoping to see an increase in disk usage but nothing 
> > changed. 
> > any feedback, why didn't we see an increase in storage?
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> <mailto:user-unsubscr...@cassandra.apache.org>
> For additional commands, e-mail: user-h...@cassandra.apache.org 
> <mailto:user-h...@cassandra.apache.org>
>

Re: Compact storage removal effect

2019-01-22 Thread Chris Lohfink

What version are you running? Did you include an upgradesstables -a or 
something to rebuild without the compact storage in your migration?

After 3.0 the new format can be more or less the same size as the 2.x compact 
storage tables depending on schema (which can impact things a lot).

Chris

> On Jan 22, 2019, at 9:58 AM, Nitan Kainth  wrote:
> 
> Hi,
> 
> We are testing to migrate off from compact storage. After removing compact 
> storage, we were hoping to see an increase in disk usage but nothing changed. 
> any feedback, why didn't we see an increase in storage?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: High CPU usage on some of the nodes due to message coalesce

2018-10-20 Thread Chris Lohfink

1s young gcs are horrible and likely cause of some of your bad metrics. How 
large are your mutations/query results and what gc/heap settings are you using?

You can use https://github.com/aragozin/jvm-tools 
 to see the threads generating 
allocation pressure and using the cpu (ttop) and what garbage is being created 
(hh --dead-young).

Just a shot in the dark, I would guess you have rather large mutations putting 
pressure on commitlog and heap. G1 with a larger heap might help in that 
scenario to reduce fragmentation and adjust its eden and survivor regions to 
the allocation rate better (but give it a bigger reserve space) but theres 
limits to what can help if you cant change your workload. Without more info on 
schema etc its hard to tell but maybe that can help give you some ideas on 
places to look. It could just as likely be repair coordination, wide partition 
reads, or compactions so need to look more at what within the app is causing 
the pressure to know if its possible to improve with settings or if the load 
your application is producing exceeds what your cluster can handle (needs more 
nodes).

Chris

> On Oct 20, 2018, at 5:18 AM, onmstester onmstester 
>  wrote:
> 
> 3 nodes in my cluster have 100% cpu usage and most of it is used by 
> org.apache.cassandra.util.coalesceInternal and SepWorker.run?
> The most active threads are the messaging-service-incomming.
> Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 
> rack each having 3 nodes. The problematic nodes are configured for one rack, 
> on normal write load, system.log reports too many hint message dropped (cross 
> node). also there are alot of parNewGc with about 700-1000ms and commit log 
> isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there 
> are alot of "updateing topology" logs (1000s of them pending). 
> Using iperf, i'm sure that network is OK
> checking NTPs and mutations on each node, load is balanced among the nodes.
> using apache cassandra 3.11.2
> I can not not figure out the root cause of the problem, although there are 
> some obvious symptoms.
> 
> Best Regards
> Sent using Zoho Mail 
> 
>

Re: jmxterm "#NullPointerException: No such PID "

2018-09-20 Thread Chris Lohfink

For what its worth, I highly recommend you remove that option in all
cassandra clusters first thing. A possibly non existent improvement (ie
/tmp on different low throughput drive) vs being able to diagnose issues is
a no brainer. You can measure or monitor gc logs for your safepoint pauses
to see if its ever a significant portion of your GC pauses.

Chris

On Thu, Sep 20, 2018 at 6:05 AM Philip Ó Condúin 
wrote:

> Thank you Yuki, this explains it.
> I am used to working on C* 2.1 in production where this JVM flag is not
> enabled.
>
>
> On Wed, 19 Sep 2018 at 00:29, Yuki Morishita  wrote:
>
>> This is because Cassandra sets -XX:+PerfDisableSharedMem JVM option by
>> default.
>> This prevents tools such as jps to list jvm processes.
>> See https://issues.apache.org/jira/browse/CASSANDRA-9242 for detail.
>>
>> You can work around by doing what Riccardo said.
>> On Tue, Sep 18, 2018 at 9:41 PM Philip Ó Condúin
>>  wrote:
>> >
>> > Hi Riccardo,
>> >
>> > Yes that works for me:
>> >
>> > Welcome to JMX terminal. Type "help" for available commands.
>> > $> open localhost:7199
>> > #Connection to localhost:7199 is opened
>> > $>domains
>> > #following domains are available
>> > JMImplementation
>> > ch.qos.logback.classic
>> > com.sun.management
>> > java.lang
>> > java.nio
>> > java.util.logging
>> > org.apache.cassandra.db
>> > org.apache.cassandra.hints
>> > org.apache.cassandra.internal
>> > org.apache.cassandra.metrics
>> > org.apache.cassandra.net
>> > org.apache.cassandra.request
>> > org.apache.cassandra.service
>> > $>
>> >
>> > I can work with this :-)
>> >
>> > Not sure why the JVM is not listed when issuing the JVMS command, maybe
>> its a server setting, our production servers find the Cass JVM.  I've spent
>> half the day trying to figure it out so I think I'll just put it to bed now
>> and work on something else.
>> >
>> > Regards,
>> > Phil
>> >
>> > On Tue, 18 Sep 2018 at 13:34, Riccardo Ferrari 
>> wrote:
>> >>
>> >> Hi Philip,
>> >>
>> >> I've used jmxterm myself without any problems particular problems. On
>> my systems too, I don't get the cassandra daemon listed when issuing the
>> `jvms` command but I never spent much time investigating it.
>> >> Assuming you have not changed anything relevant in the
>> cassandra-env.sh you can connect using jmxterm by issuing 'open
>> 127.0.0.1:7199'. Would that work for you?
>> >>
>> >> HTH,
>> >>
>> >>
>> >>
>> >> On Tue, Sep 18, 2018 at 2:00 PM, Philip Ó Condúin <
>> philipocond...@gmail.com> wrote:
>> >>>
>> >>> Further info:
>> >>>
>> >>> I would expect to see the following when I list the jvm's:
>> >>>
>> >>> Welcome to JMX terminal. Type "help" for available commands.
>> >>> $>jvms
>> >>> 25815(m) - org.apache.cassandra.service.CassandraDaemon
>> >>> 17628( ) - jmxterm-1.0-alpha-4-uber.jar
>> >>>
>> >>> But jmxtem is not picking up the JVM for Cassandra for some reason.
>> >>>
>> >>> Can someone point me in the right direction?  Is there settings in
>> the cassandra-env.sh file I need to amend to get jmxterm to find the cass
>> jvm?
>> >>>
>> >>> Im not finding much about it on google.
>> >>>
>> >>> Thanks,
>> >>> Phil
>> >>>
>> >>>
>> >>> On Tue, 18 Sep 2018 at 12:09, Philip Ó Condúin <
>> philipocond...@gmail.com> wrote:
>> 
>>  Hi All,
>> 
>>  I need a little advice.  I'm trying to access the JMX terminal using
>> jmxterm-1.0-alpha-4-uber.jar with a very simple default install of C* 3.11.3
>> 
>>  I keep getting the following:
>> 
>>  [cassandra@reaper-1 conf]$ java -jar jmxterm-1.0-alpha-4-uber.jar
>>  Welcome to JMX terminal. Type "help" for available commands.
>>  $>open 1666
>>  #NullPointerException: No such PID 1666
>>  $>
>> 
>>  C* is running with a PID of 1666.  I've tried setting JMX_LOCAL=no
>> and have even created a new VM to test it.
>> 
>>  Does anyone know what I might be doing wrong here?
>> 
>>  Kind Regards,
>>  Phil
>> 
>> >>>
>> >>>
>> >>> --
>> >>> Regards,
>> >>> Phil
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> > Phil
>>
>
>
> --
> Regards,
> Phil
>

Re: Setting up rerouting java/python driver read requests from unresponsive nodes to good ones

2018-08-15 Thread Chris Lohfink

That’s what the retry handler does (see Horia’s response). You can also use the 
speculative retry to possibly send requests to multiple coordinators a little 
earlier as well to reduce the impact of the slow requests (ie a GC). 
https://docs.datastax.com/en/developer/java-driver/3.1/manual/speculative_execution/

Chris

Sent from my iPhone

> On Aug 15, 2018, at 6:57 AM, Horia Mocioi  wrote:
> 
> Hello,
> 
> I believe that this is what you are looking for - 
> https://docs.datastax.com/en/developer/java-driver/3.5/manual/retries/
> 
> In particular, tryNextHost().
> 
> Regards,
> Horia
> 
>> On ons, 2018-08-15 at 14:16 +0300, Vsevolod Filaretov wrote:
>> Hello Cassandra community!
>> 
>> Unfortunately, I cannot find the corresponding info via load balancing 
>> manuals, so the question is:
>> 
>> Is it possible to set up java/python cassandra driver to redirect 
>> unsuccessful read requests from the coordinator node, which came to be 
>> unresponsive during the session, to the up and running one (dynamically 
>> switch to other coordinator node from the dead one)?
>> 
>> If the answer is no, what could be my alternatives?
>> 
>> Thank you all in advance,
>> Vsevolod Filaretov.

Re: Cassandra Compaction Metrics - CompletedTasks vs TotalCompactionCompleted

2018-08-10 Thread Chris Lohfink

If its occurring that often you can monitor nodetool compactionstats to see 
whats running

> On Aug 10, 2018, at 11:35 AM, Dionne Cloudoupoulos  
> wrote:
> 
> On 2017/10/31 16:56:29, Chris Lohfink wrote:
>> The "CompletedTasks" metric is a measure of how many tasks ran on these two
>> executors combined.
>> The "TotalCompactionsCompleted" metric is a measure of how many compactions
>> issued from the compaction manager ran (normal compactions, cache writes,
>> scrub, 2i and MVs).  So while they may be close, depending on whats
>> happening on the system, theres no assurance that they will be within any
>> bounds of each other.
> 
>all this is very interesting, but I do not understand why
> CompletedTasks grows at the rate of five thousand operations per hour in
> my cloud. Have an idea where can I look? kalo dromo
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

Re: concurrent_compactors via JMX

2018-07-18 Thread Chris Lohfink

Refer to Alains email but to strictly answer the question of increasing 
concurrent_compactors via jmx:

There are two attributes you can increase that would set the maximum number of 
concurrent compactions.

org.apache.cassandra.db:type=CompactionManager,name=MaximumCompactorThreads  -> 
6
org.apache.cassandra.db:type=CompactionManager,name=CoreCompactorThreads -> 6

Would set it to 6. To decrease them you will want to go opposite order (core 
than max). Just increasing the number of concurrent compactors doesnt mean that 
all of them will be utilized though.

Chris

> On Jul 17, 2018, at 12:18 PM, Alain RODRIGUEZ  wrote:
> 
> Hello Riccardo,
> 
> I noticed I have been writing a novel to answer a simple couple of questions 
> again ¯\_(ツ)_/¯. So here is a short answer in the case that's what you were 
> looking for :). Also, there is a warning that it might be counter-productive 
> and stress the cluster even more to increase the compaction throughput. There 
> is more information below ('about the issue').
> 
> tl;dr: 
> 
> What about using 'nodetool setcompactionthroughput XX' instead. It should 
> available there.
> 
> In the same way 'nodetool getcompactionthroughput' gives you the current 
> value. Be aware that this change done through JMX/nodetool is not permanent. 
> You still need to update the cassandra.yaml file.
> 
> If you really want to use the MBean through JMX, because using 'nodetool' is 
> too easy (or for any other reason :p):
> 
> Mbean: org.apache.cassandra.service.StorageServiceMBean
> Attribute: CompactionThroughputMbPerSec
> 
> Long story with the "how to" since I went through this search myself, I did 
> not know where this MBean was.
> 
> Can someone point me to the right mbean? 
> I can not really find good docs about mbeans (or tools ...) 
> 
> I am not sure about the doc, but you can use jmxterm 
> (http://wiki.cyclopsgroup.org/jmxterm/download.html 
> ).
> 
> To replace the doc I use CCM (https://github.com/riptano/ccm 
> ) + jconsole to find the mbeans locally:
> 
> * Add loopback addresses for ccm (see the readme file)
> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s'
> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep pid | 
> cut -d "=" -f 2)'
> * Explore MBeans, try to guess where this could be (and discover other funny 
> stuff in there :)).
> 
> I must admit I did not find it this way using C*3.0.6 and jconsole. 
> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI 
> CompactionThroughput' with this result: 
> https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006 
> 
> 
> With this I could find the right MBean, the only code documentation that is 
> always up to date is the code itself I am afraid:
> 
> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:public 
> void setCompactionThroughputMbPerSec(int value);' 
> 
> Note that the research in the code also leads to nodetool ;-).
> 
> I could finally find the MBean in the 'jconsole' too: 
> https://cdn.pbrd.co/images/HuUya3x.png 
>  (not sure how long this link will 
> live).
> 
> jconsole also allows you to see what attributes it is possible to set or not.
> 
> You can now find any other MBean you would need I hope :).
> 
> 
> see if it helps when the system is under stress
> 
> About the issue
> 
> You don't exactly say what you are observing, what is that "stress"? How is 
> it impacting the cluster?
> 
> I ask because I am afraid this change might not help and even be 
> counter-productive. Even though having SSTables nicely compacted make a huge 
> difference at the read time, if that's already the case for you and the data 
> is already nicely compacted, doing this change won't help. It might even make 
> things slightly worse if the current bottleneck is the disk IO during a 
> stress period as the compactors would increase their disk read throughput, 
> thus maybe fight with the read requests for disk throughput.
> 
> If you have a similar number of sstables on all nodes, not many compactions 
> pending (nodetool netstats -H) and read operations are hitting a small number 
> sstables (nodetool tablehistogram) then you probably don't need to increase 
> the compaction speed.
> 
> Let's say that the compaction throughput is not often the cause of stress 
> during peak hours nor a direct way to make things 'faster'. Generally when 
> compaction goes wrong, the number of sstables goes through the roof. If you 
> have a chart showing the number sstables, you can see this really well.
> 
> Of course, if you feel you are in this case, increasing the compaction 
> throughput will definitely help if the cluster also has spared disk 
> throughput.
> 
> To check what's wrong, if you believe it's something different, here are some 
> useful

Re: Compaction process stuck

2018-07-05 Thread Chris Lohfink

That looks a bit to me like it isnt stuck but just a long running compaction. 
Can you include the output of `nodetool compactionstats` and the `nodetool 
cfstats` with schema for the table thats being compacted (redacted names if 
necessary).

Can stop compaction with `nodetool stop COMPACTION` or restarting the node.

Chris

> On Jul 5, 2018, at 12:08 AM, atul atri  wrote:
> 
> Hi,
> 
> We noticed that compaction process is also hanging on a node in backup ring. 
> Please find attached thread dump for both servers. Recently, we have made few 
> changes in cluster topology.
> 
> a. Added new server in backup data-center and decommissioned old server. 
> Backup ring only has 2 server.
> b. Added new node in primary data-center. Now it has 4 nods.
> 
> Is there way we can stop this compaction? As we have added a new node in this 
> cluster and we are waiting to run cleanup on this node on which compaction is 
> hanging. I am afraid that cleanup will not start until compaction job 
> finishes. 
> 
> Attachments:
> 1. cass-logg02.prod2.thread_dump.out: Thread dump from old node in primary 
> datacenter
> 2. cass-logg03.prod1.thread_dump.out: Thread dump from new node in backup 
> datacenter. This node is added recently.
> 
> Your help is much appreciated. 
> 
> Thanks & Regards,
> Atul Atri.
> 
> 
> On 4 July 2018 at 21:15, atul atri  <mailto:atulatri2...@gmail.com>> wrote:
> Hi Chris,
> Thanks for reply.
> 
> Unfortunately, our servers do not have jstack installed. 
> I tried "kill -3 " option but that is also not generating thread dump. 
> 
> Is there any other way I can generate thread dump?
> 
> Thanks & Regards,
> Atul Atri.
> 
> On 4 July 2018 at 20:32, Chris Lohfink  <mailto:clohf...@apple.com>> wrote:
> Can you take a thread dump (jstack) and share the state of the compaction 
> threads? Also check for “Exception” in logs
> 
> Chris
> 
> Sent from my iPhone
> 
> On Jul 4, 2018, at 8:37 AM, atul atri  <mailto:atulatri2...@gmail.com>> wrote:
> 
>> Hi,
>> 
>> On one of our server, compaction process is hanging. It's stuck at 80%. It 
>> was stuck for last 3 days. And today we did a cluster restart (one host at 
>> time). And again it is stuck at same 80%. CPU usages are 100% and there 
>> seems no IO issue. We are seeing following kinds of WARNING in system.log
>> 
>> BatchStatement.java (line 226) Batch of prepared statements for [, 
>> *] is of size 7557, exceeding specified threshold of 5120 by 2437.
>> 
>> 
>> Other than this there seems no error.  I have tried to stop compaction 
>> process, but it does not stop. Cassandra version is 2.1.
>> 
>>  Can someone please guide us in solving this issue?
>> 
>> Thanks & Regards,
>> Atul Atri.
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Compaction process stuck

2018-07-04 Thread Chris Lohfink

Can you take a thread dump (jstack) and share the state of the compaction 
threads? Also check for “Exception” in logs

Chris

Sent from my iPhone

> On Jul 4, 2018, at 8:37 AM, atul atri  wrote:
> 
> Hi,
> 
> On one of our server, compaction process is hanging. It's stuck at 80%. It 
> was stuck for last 3 days. And today we did a cluster restart (one host at 
> time). And again it is stuck at same 80%. CPU usages are 100% and there seems 
> no IO issue. We are seeing following kinds of WARNING in system.log
> 
> BatchStatement.java (line 226) Batch of prepared statements for [, *] 
> is of size 7557, exceeding specified threshold of 5120 by 2437.
> 
> 
> Other than this there seems no error.  I have tried to stop compaction 
> process, but it does not stop. Cassandra version is 2.1.
> 
>  Can someone please guide us in solving this issue?
> 
> Thanks & Regards,
> Atul Atri.

Re: G1GC CPU Spike

2018-06-15 Thread Chris Lohfink

There are no bad GCs in the gclog (worse is like 100ms). Everything looks great 
actually from what I see. CPU utilization isn't inherently a bad thing for what 
its worth.

Chris

> On Jun 14, 2018, at 1:18 PM, rajpal reddy  wrote:
> 
> Hey Chris,
> 
> Sorry to bother you. Did you get a chance to look at the gclog file I sent 
> last night.
> 
> On Wed, Jun 13, 2018, 8:44 PM rajpal reddy  <mailto:rajpalreddy...@gmail.com>> wrote:
> Chris,
> 
> sorry attached wrong log file. attaching gc collection seconds and cpu. there 
> were going high at the same time and also attached the gc.log. grafana 
> dashboard and gc.log timing are 4hours apart gc can be see 06/12th around 
> 22:50
> 
> rate(jvm_gc_collection_seconds_sum{"}[5m])
> 
> > On Jun 13, 2018, at 5:26 PM, Chris Lohfink  > <mailto:clohf...@apple.com>> wrote:
> > 
> > There are not even a 100ms GC pause in that, are you certain theres a 
> > problem?
> > 
> >> On Jun 13, 2018, at 3:00 PM, rajpal reddy  >> <mailto:rajpalreddy...@gmail.com>> wrote:
> >> 
> >> Thanks Chris I did attached the gc logs already. reattaching them 
> >> now.
> >> 
> >> it started yesterday around 11:54PM 
> >>> On Jun 13, 2018, at 3:56 PM, Chris Lohfink  >>> <mailto:clohf...@apple.com>> wrote:
> >>> 
> >>>> What is the criteria for picking up the value for G1ReservePercent?
> >>> 
> >>> 
> >>> it depends on the object allocation rate vs the size of the heap. 
> >>> Cassandra ideally would be sub 500-600mb/s allocations but it can spike 
> >>> pretty high with something like reading a wide partition or repair 
> >>> streaming which might exceed what the g1 ygcs tenuring and timing is 
> >>> prepared for from previous steady rate. Giving it a bigger buffer is a 
> >>> nice safety net for allocation spikes.
> >>> 
> >>>> is the HEAP_NEWSIZE is required only for CMS
> >>> 
> >>> 
> >>> it should only set Xmn with that if using CMS, with G1 it should be 
> >>> ignored or else yes it would be bad to set Xmn. Giving the gc logs will 
> >>> give the results of all the bash scripts along with details of whats 
> >>> happening so its your best option if you want help to share that.
> >>> 
> >>> Chris
> >>> 
> >>>> On Jun 13, 2018, at 12:17 PM, Subroto Barua 
> >>>>  wrote:
> >>>> 
> >>>> Chris,
> >>>> What is the criteria for picking up the value for G1ReservePercent?
> >>>> 
> >>>> Subroto 
> >>>> 
> >>>>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink  >>>>> <mailto:clohf...@apple.com>> wrote:
> >>>>> 
> >>>>> G1ReservePercent
> >>>> 
> >>>> -
> >>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> >>>> <mailto:user-unsubscr...@cassandra.apache.org>
> >>>> For additional commands, e-mail: user-h...@cassandra.apache.org 
> >>>> <mailto:user-h...@cassandra.apache.org>
> >>>> 
> >>> 
> >>> 
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> >>> <mailto:user-unsubscr...@cassandra.apache.org>
> >>> For additional commands, e-mail: user-h...@cassandra.apache.org 
> >>> <mailto:user-h...@cassandra.apache.org>
> >>> 
> >> 
> >> 
> >> 
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> >> <mailto:user-unsubscr...@cassandra.apache.org>
> >> For additional commands, e-mail: user-h...@cassandra.apache.org 
> >> <mailto:user-h...@cassandra.apache.org>
> > 
> > 
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> > <mailto:user-unsubscr...@cassandra.apache.org>
> > For additional commands, e-mail: user-h...@cassandra.apache.org 
> > <mailto:user-h...@cassandra.apache.org>
> > 
>

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink

There are not even a 100ms GC pause in that, are you certain theres a problem?

> On Jun 13, 2018, at 3:00 PM, rajpal reddy  wrote:
> 
> Thanks Chris I did attached the gc logs already. reattaching them 
> now.
> 
> it started yesterday around 11:54PM 
>> On Jun 13, 2018, at 3:56 PM, Chris Lohfink  wrote:
>> 
>>> What is the criteria for picking up the value for G1ReservePercent?
>> 
>> 
>> it depends on the object allocation rate vs the size of the heap. Cassandra 
>> ideally would be sub 500-600mb/s allocations but it can spike pretty high 
>> with something like reading a wide partition or repair streaming which might 
>> exceed what the g1 ygcs tenuring and timing is prepared for from previous 
>> steady rate. Giving it a bigger buffer is a nice safety net for allocation 
>> spikes.
>> 
>>> is the HEAP_NEWSIZE is required only for CMS
>> 
>> 
>> it should only set Xmn with that if using CMS, with G1 it should be ignored 
>> or else yes it would be bad to set Xmn. Giving the gc logs will give the 
>> results of all the bash scripts along with details of whats happening so its 
>> your best option if you want help to share that.
>> 
>> Chris
>> 
>>> On Jun 13, 2018, at 12:17 PM, Subroto Barua  
>>> wrote:
>>> 
>>> Chris,
>>> What is the criteria for picking up the value for G1ReservePercent?
>>> 
>>> Subroto 
>>> 
>>>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink  wrote:
>>>> 
>>>> G1ReservePercent
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink

> What is the criteria for picking up the value for G1ReservePercent?

it depends on the object allocation rate vs the size of the heap. Cassandra 
ideally would be sub 500-600mb/s allocations but it can spike pretty high with 
something like reading a wide partition or repair streaming which might exceed 
what the g1 ygcs tenuring and timing is prepared for from previous steady rate. 
Giving it a bigger buffer is a nice safety net for allocation spikes.

> is the HEAP_NEWSIZE is required only for CMS

it should only set Xmn with that if using CMS, with G1 it should be ignored or 
else yes it would be bad to set Xmn. Giving the gc logs will give the results 
of all the bash scripts along with details of whats happening so its your best 
option if you want help to share that.

Chris

> On Jun 13, 2018, at 12:17 PM, Subroto Barua  
> wrote:
> 
> Chris,
> What is the criteria for picking up the value for G1ReservePercent?
> 
> Subroto 
> 
>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink  wrote:
>> 
>> G1ReservePercent
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink

That metric is the total number of seconds spent in GC, it will increase over 
time with every young gc which is expected. Whats interesting is the rate of 
growth not the fact that its increasing. If graphing tool has option to graph 
derivative you should use that instead.

Chris

> On Jun 13, 2018, at 9:51 AM, rajpal reddy  wrote:
> 
> jvm_gc_collection_seconds_count{gc="G1 Young Generation”} and also young 
> generation seconds count keep increasing
> 
> 
> 
>> On Jun 13, 2018, at 9:52 AM, Chris Lohfink > <mailto:clohf...@apple.com>> wrote:
>> 
>> The gc log file is best to share when asking for help with tuning. The top 
>> of file has all the computed args it ran with and it gives details on what 
>> part of the GC is taking time. I would guess the CPU spike is from full GCs 
>> which with that small heap of a heap is probably from evacuation failures. 
>> Reserving more of the heap to be free (-XX:G1ReservePercent=25) can help, 
>> along with increasing the amount of heap. 8GB is pretty small for G1, might 
>> be better off with CMS.
>> 
>> Chris
>> 
>>> On Jun 13, 2018, at 8:42 AM, rajpal reddy >> <mailto:rajpalreddy...@gmail.com>> wrote:
>>> 
>>> Hello,
>>> 
>>> we are using G1GC and noticing garbage collection taking a while and during 
>>> that process we are seeing cpu spiking up to 70-80%. can you please let us 
>>> know. if we have to tune any parameters for that. attaching the 
>>> cassandra-env file with jam-options.
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>>> <mailto:user-unsubscr...@cassandra.apache.org>
>>> For additional commands, e-mail: user-h...@cassandra.apache.org 
>>> <mailto:user-h...@cassandra.apache.org>
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>> <mailto:user-unsubscr...@cassandra.apache.org>
>> For additional commands, e-mail: user-h...@cassandra.apache.org 
>> <mailto:user-h...@cassandra.apache.org>
>> 
>

Re: G1GC CPU Spike

2018-06-13 Thread Chris Lohfink

The gc log file is best to share when asking for help with tuning. The top of 
file has all the computed args it ran with and it gives details on what part of 
the GC is taking time. I would guess the CPU spike is from full GCs which with 
that small heap of a heap is probably from evacuation failures. Reserving more 
of the heap to be free (-XX:G1ReservePercent=25) can help, along with 
increasing the amount of heap. 8GB is pretty small for G1, might be better off 
with CMS.

Chris

> On Jun 13, 2018, at 8:42 AM, rajpal reddy  wrote:
> 
> Hello,
> 
> we are using G1GC and noticing garbage collection taking a while and during 
> that process we are seeing cpu spiking up to 70-80%. can you please let us 
> know. if we have to tune any parameters for that. attaching the cassandra-env 
> file with jam-options.
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage

2018-05-29 Thread Chris Lohfink

Might be better to disable explicit gcs so the full gcs don’t even occur. It’s 
likely from the rmi dgc or directbytebuffers not any actual need to do gcs or 
the concurrent gc threads would be an issue as well.

Nodetool also has no excuse to use that big of a heap so it should have max 
size capped too (along with parallel and concurrent gc threads).

Chris

Sent from my iPhone

> On May 29, 2018, at 4:42 PM, kurt greaves  wrote:
> 
> Good to know. So that confirms it's just the GC threads causing problems.
> 
>> On Tue., 29 May 2018, 22:02 Steinmaurer, Thomas, 
>>  wrote:
>> Kurt,
>> 
>>  
>> 
>> in our test it also didn’t made a difference with the default number of GC 
>> Threads (43 on our large machine) and running with Xmx128M or XmX31G 
>> (derived from $MAX_HEAP_SIZE). For both Xmx, we saw the high CPU caused by 
>> nodetool.
>> 
>>  
>> 
>> Regards,
>> 
>> Thomas
>> 
>>  
>> 
>> From: kurt greaves [mailto:k...@instaclustr.com] 
>> Sent: Dienstag, 29. Mai 2018 13:06
>> To: User 
>> Subject: Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>> 
>>  
>> 
>> Thanks Thomas. After a bit more research today I found that the whole 
>> $MAX_HEAP_SIZE issue isn't really a problem because we don't explicitly set 
>> -Xms so the minimum heapsize by default will be 256mb, which isn't hugely 
>> problematic, and it's unlikely more than that would get allocated.
>> 
>>  
>> 
>> On 29 May 2018 at 09:29, Steinmaurer, Thomas 
>>  wrote:
>> 
>> Hi Kurt,
>> 
>>  
>> 
>> thanks for pointing me to the Xmx issue.
>> 
>>  
>> 
>> JIRA + patch (for Linux only based on C* 3.11) for the parallel GC thread 
>> issue is available here: 
>> https://issues.apache.org/jira/browse/CASSANDRA-14475
>> 
>>  
>> 
>> Thanks,
>> 
>> Thomas
>> 
>>  
>> 
>> From: kurt greaves [mailto:k...@instaclustr.com] 
>> Sent: Dienstag, 29. Mai 2018 05:54
>> To: User 
>> Subject: Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>> 
>>  
>> 
>> 1) nodetool is reusing the $MAX_HEAP_SIZE environment variable, thus if we 
>> are running Cassandra with e.g. Xmx31G, nodetool is started with Xmx31G as 
>> well
>> 
>> This was fixed in 3.0.11/3.10 in CASSANDRA-12739. Not sure why it didn't 
>> make it into 2.1/2.2.
>> 
>> 2) As -XX:ParallelGCThreads is not explicitly set upon startup, this 
>> basically defaults to a value dependent on the number of cores. In our case, 
>> with the machine above, the number of parallel GC threads for the JVM is set 
>> to 43!
>> 3) Test-wise, we have adapted the nodetool startup script in a way to get a 
>> Java Flight Recording file on JVM exit, thus with each nodetool invocation 
>> we can inspect a JFR file. Here we may have seen System.gc() calls (without 
>> visible knowledge where they come from), GC times for the entire JVM 
>> life-time (e.g. ~1min) showing high cpu. This happened for both Xmx128M 
>> (default as it seems) and Xmx31G
>>  
>> After explicitly setting -XX:ParallelGCThreads=1 in the nodetool startup 
>> script, CPU usage spikes by nodetool are entirely gone.
>>  
>> Is this something which has been already adapted/tackled in Cassandra 
>> versions > 2.1 or worth to be considered as some sort of RFC?
>> 
>> Can you create a JIRA for this (and a patch, if you like)? We should be 
>> explicitly setting this on nodetool invocations.
>> 
>> 
>> 
>> The contents of this e-mail are intended for the named addressee only. It 
>> contains information that may be confidential. Unless you are the named 
>> addressee or an authorized designee, you may not copy or use it, or disclose 
>> it to anyone else. If you received it in error please notify us immediately 
>> and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) 
>> is a company registered in Linz whose registered office is at 4040 Linz, 
>> Austria, Freistädterstraße 313
>> 
>>  
>> 
>> The contents of this e-mail are intended for the named addressee only. It 
>> contains information that may be confidential. Unless you are the named 
>> addressee or an authorized designee, you may not copy or use it, or disclose 
>> it to anyone else. If you received it in error please notify us immediately 
>> and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) 
>> is a company registered in Linz whose registered office is at 4040 Linz, 
>> Austria, Freistädterstraße 313

Re: tablestats and gossip

2018-04-06 Thread Chris Lohfink

Yes, its the count of all locally applied writes to that table. A insert to a 
table with a RF=3 should increase the local write count by 1 on 3 different 
nodes.

Chris

> On Apr 6, 2018, at 5:00 AM, Grzegorz Pietrusza  wrote:
> 
> Hi all
> 
> Does local write count provided by tablestats include writes from gossip?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Understanding Blocked and All Time Blocked columns in tpstats

2018-03-23 Thread Chris Lohfink

Increasing queue would increase the number of requests waiting. It could make 
GCs worse if the requests are like large INSERTs, but for a lot of super tiny 
queries it helps to increase queue size (to a point). Might want to look into 
what and how queries are being made, since there are possibly options to help 
with that (ie prepared queries, what queries are, limiting number of async 
inflight queries)

Chris

> On Mar 23, 2018, at 11:42 AM, John Sanda <john.sa...@gmail.com> wrote:
> 
> Thanks for the explanation. In the past when I have run into problems related 
> to CASSANDRA-11363, I have increased the queue size via the 
> cassandra.max_queued_native_transport_requests system property. If I find 
> that the queue is frequently at capacity, would that be an indicator that the 
> node is having trouble keeping up with the load? And if so, will increasing 
> the queue size just exacerbate the problem?
> 
> On Fri, Mar 23, 2018 at 11:51 AM, Chris Lohfink <clohf...@apple.com 
> <mailto:clohf...@apple.com>> wrote:
> It blocks the caller attempting to add the task until theres room in queue, 
> applying back pressure. It does not reject it. It mimics the behavior from 
> pre-SEP DebuggableThreadPoolExecutor's RejectionExecutionHandler that the 
> other thread pools use (exception on sampling/trace which just throw away on 
> rejections).
> 
> Worth noting this is only really possible in the native transport pool (sep 
> pool) last I checked. Since 2.1 at least, before that there were a few 
> others. That changes version to version. For (basically) all other thread 
> pools the queue is limited by memory.
> 
> Chris
> 
> 
>> On Mar 22, 2018, at 10:44 PM, John Sanda <john.sa...@gmail.com 
>> <mailto:john.sa...@gmail.com>> wrote:
>> 
>> I have been doing some work on a cluster that is impacted by 
>> https://issues.apache.org/jira/browse/CASSANDRA-11363 
>> <https://issues.apache.org/jira/browse/CASSANDRA-11363>. Reading through the 
>> ticket prompted me to take a closer look at 
>> org.apache.cassandra.concurrent.SEPExecutor. I am looking at the 3.0.14 
>> code. I am a little confused about the Blocked and All Time Blocked columns 
>> reported in nodetool tpstats and reported by StatusLogger. I understand that 
>> there is a queue for tasks. In the case of RequestThreadPoolExecutor, the 
>> size of that queue can be controlled via the 
>> cassandra.max_queued_native_transport_requests system property.
>> 
>> I have been looking at SEPExecutor.addTask(FutureTask task), and here is 
>> my question. If the queue is full, as defined by SEPExector.maxTasksQueued, 
>> are tasks rejected? I do not fully grok the code, but it looks like it is 
>> possible for tasks to be rejected here (some code and comments omitted for 
>> brevity):
>> 
>> public void addTask(FutureTask task)
>> {
>> tasks.add(task);
>> ...
>> else if (taskPermits >= maxTasksQueued) 
>> {
>> WaitQueue.Signal s = hasRoom.register();
>> 
>> if (taskPermits(permits.get()) > maxTasksQueued)
>> {
>> if (takeWorkPermit(true))
>> pool.schedule(new Work(this))
>> 
>> metrics.totalBlocked.inc();
>> metrics.currentBlocked.inc();
>> s.awaitUninterruptibly();
>> metrics.currentBlocked.dec();
>> }
>> else
>> s.cancel();
>> }   
>> }
>> 
>> The first thing that happens is that the task is added to the tasks queue. 
>> pool.schedule() only gets called if takeWorkPermit() returns true. I am 
>> still studying the code, but can someone explain what exactly happens when 
>> the queue is full?
>> 
>> 
>> - John
> 
> 
> 
> 
> -- 
> 
> - John

Re: Understanding Blocked and All Time Blocked columns in tpstats

2018-03-23 Thread Chris Lohfink

It blocks the caller attempting to add the task until theres room in queue, 
applying back pressure. It does not reject it. It mimics the behavior from 
pre-SEP DebuggableThreadPoolExecutor's RejectionExecutionHandler that the other 
thread pools use (exception on sampling/trace which just throw away on 
rejections).

Worth noting this is only really possible in the native transport pool (sep 
pool) last I checked. Since 2.1 at least, before that there were a few others. 
That changes version to version. For (basically) all other thread pools the 
queue is limited by memory.

Chris

> On Mar 22, 2018, at 10:44 PM, John Sanda  wrote:
> 
> I have been doing some work on a cluster that is impacted by 
> https://issues.apache.org/jira/browse/CASSANDRA-11363 
> . Reading through the 
> ticket prompted me to take a closer look at 
> org.apache.cassandra.concurrent.SEPExecutor. I am looking at the 3.0.14 code. 
> I am a little confused about the Blocked and All Time Blocked columns 
> reported in nodetool tpstats and reported by StatusLogger. I understand that 
> there is a queue for tasks. In the case of RequestThreadPoolExecutor, the 
> size of that queue can be controlled via the 
> cassandra.max_queued_native_transport_requests system property.
> 
> I have been looking at SEPExecutor.addTask(FutureTask task), and here is 
> my question. If the queue is full, as defined by SEPExector.maxTasksQueued, 
> are tasks rejected? I do not fully grok the code, but it looks like it is 
> possible for tasks to be rejected here (some code and comments omitted for 
> brevity):
> 
> public void addTask(FutureTask task)
> {
> tasks.add(task);
> ...
> else if (taskPermits >= maxTasksQueued) 
> {
> WaitQueue.Signal s = hasRoom.register();
> 
> if (taskPermits(permits.get()) > maxTasksQueued)
> {
> if (takeWorkPermit(true))
> pool.schedule(new Work(this))
> 
> metrics.totalBlocked.inc();
> metrics.currentBlocked.inc();
> s.awaitUninterruptibly();
> metrics.currentBlocked.dec();
> }
> else
> s.cancel();
> }   
> }
> 
> The first thing that happens is that the task is added to the tasks queue. 
> pool.schedule() only gets called if takeWorkPermit() returns true. I am still 
> studying the code, but can someone explain what exactly happens when the 
> queue is full?
> 
> 
> - John

Re: Delete System_Traces Table

2018-03-19 Thread Chris Lohfink

traces and auth in that version have a whitelist of tables that can be dropped 
(legacy auth tables).

https://github.com/apache/cassandra/blob/cassandra-3.0.12/src/java/org/apache/cassandra/service/ClientState.java#L367
 
<https://github.com/apache/cassandra/blob/cassandra-3.0.12/src/java/org/apache/cassandra/service/ClientState.java#L367>

It does make sense to allowing CREATEs in the distributed tables, mostly 
because of auth. That way if the auth tables are changed in later version you 
can pre-prime them before an upgrade. Might be a bit of overstep in protecting 
users from themselves but it doesnt hurt anything to have the table there.  
Just ignore it and its existence will not cause any issues.

Chris

> On Mar 19, 2018, at 10:27 AM, shalom sagges <shalomsag...@gmail.com> wrote:
> 
> That's weird... I'm using 3.0.12, so I should've still been able to drop it, 
> no?
> 
> Also, if I intend to upgrade to version 3.11.2, will the existence of the 
> table cause any issues?
> 
> Thanks!
> 
> On Mon, Mar 19, 2018 at 4:30 PM, Chris Lohfink <clohf...@apple.com 
> <mailto:clohf...@apple.com>> wrote:
> Oh I misread original, I see.
> 
> With https://issues.apache.org/jira/browse/CASSANDRA-13813 
> <https://issues.apache.org/jira/browse/CASSANDRA-13813> you wont be able to 
> drop the table, but would be worth a ticket to prevent creation in those 
> keyspaces or allow some sort of override if allowing create.
> 
> Chris
> 
> 
>> On Mar 19, 2018, at 9:15 AM, shalom sagges <shalomsag...@gmail.com 
>> <mailto:shalomsag...@gmail.com>> wrote:
>> 
>> Yes, that's correct. 
>> 
>> I'd definitely like to keep the default tables. 
>> 
>> On Mon, Mar 19, 2018 at 4:10 PM, Rahul Singh <rahul.xavier.si...@gmail.com 
>> <mailto:rahul.xavier.si...@gmail.com>> wrote:
>> I think he just wants to delete the test table not the whole keyspace. Is 
>> that correct?
>> 
>> --
>> Rahul Singh
>> rahul.si...@anant.us <mailto:rahul.si...@anant.us>
>> 
>> Anant Corporation
>> 
>> On Mar 19, 2018, 9:08 AM -0500, Chris Lohfink <clohf...@apple.com 
>> <mailto:clohf...@apple.com>>, wrote:
>>> No.
>>> 
>>> Why do you want to? If you don't use tracing they will be empty, and if 
>>> were able to drop them you will no longer be able to use tracing in 
>>> debugging.
>>> 
>>> Chris
>>> 
>>>> On Mar 19, 2018, at 7:52 AM, shalom sagges <shalomsag...@gmail.com 
>>>> <mailto:shalomsag...@gmail.com>> wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>> I accidentally created a test table on the system_traces keyspace.
>>>> 
>>>> When I tried to drop the table with the Cassandra user, I got the 
>>>> following error:
>>>> Unauthorized: Error from server: code=2100 [Unauthorized] message="Cannot 
>>>> DROP "
>>>> 
>>>> Is there a way to drop this table permanently?
>>>> 
>>>> Thanks!
>>> 
>> 
> 
>

Re: Delete System_Traces Table

2018-03-19 Thread Chris Lohfink

Oh I misread original, I see.

With https://issues.apache.org/jira/browse/CASSANDRA-13813 
<https://issues.apache.org/jira/browse/CASSANDRA-13813> you wont be able to 
drop the table, but would be worth a ticket to prevent creation in those 
keyspaces or allow some sort of override if allowing create.

Chris

> On Mar 19, 2018, at 9:15 AM, shalom sagges <shalomsag...@gmail.com> wrote:
> 
> Yes, that's correct. 
> 
> I'd definitely like to keep the default tables. 
> 
> On Mon, Mar 19, 2018 at 4:10 PM, Rahul Singh <rahul.xavier.si...@gmail.com 
> <mailto:rahul.xavier.si...@gmail.com>> wrote:
> I think he just wants to delete the test table not the whole keyspace. Is 
> that correct?
> 
> --
> Rahul Singh
> rahul.si...@anant.us <mailto:rahul.si...@anant.us>
> 
> Anant Corporation
> 
> On Mar 19, 2018, 9:08 AM -0500, Chris Lohfink <clohf...@apple.com 
> <mailto:clohf...@apple.com>>, wrote:
>> No.
>> 
>> Why do you want to? If you don't use tracing they will be empty, and if were 
>> able to drop them you will no longer be able to use tracing in debugging.
>> 
>> Chris
>> 
>>> On Mar 19, 2018, at 7:52 AM, shalom sagges <shalomsag...@gmail.com 
>>> <mailto:shalomsag...@gmail.com>> wrote:
>>> 
>>> Hi All,
>>> 
>>> I accidentally created a test table on the system_traces keyspace.
>>> 
>>> When I tried to drop the table with the Cassandra user, I got the following 
>>> error:
>>> Unauthorized: Error from server: code=2100 [Unauthorized] message="Cannot 
>>> DROP "
>>> 
>>> Is there a way to drop this table permanently?
>>> 
>>> Thanks!
>> 
>

Re: Delete System_Traces Table

2018-03-19 Thread Chris Lohfink

No.

Why do you want to? If you don't use tracing they will be empty, and if were 
able to drop them you will no longer be able to use tracing in debugging.

Chris

> On Mar 19, 2018, at 7:52 AM, shalom sagges  wrote:
> 
> Hi All, 
> 
> I accidentally created a test table on the system_traces keyspace. 
> 
> When I tried to drop the table with the Cassandra user, I got the following 
> error:
> Unauthorized: Error from server: code=2100 [Unauthorized] message="Cannot 
> DROP "
> 
> Is there a way to drop this table permanently? 
> 
> Thanks!

Re: WARN [PERIODIC-COMMIT-LOG-SYNCER] .. exceeded the configured commit interval by an average of...

2018-03-16 Thread Chris Lohfink

If you just want to make it work, increase commitlog_segment_size_in_mb  to 64. 
A single mutation cannot exceed 1/2 the segment size.

If you want to actually fix your problem decrease the size of the mutations and 
limit the size of the value blob. <== recommended

Chris

> On Mar 16, 2018, at 9:19 AM, Frank Limstrand  wrote:
> 
> Hi listmembers 
> 
> Basic info
> [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
> CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '3'}  AND durable_writes = true;
> 8 linux nodes, SSD. 64GB memory on each server.
> Additional information after the signature.
> 
> We are about to enter production with this new cluster and are using our own 
> (homemade) application to test with.
> 
> Problem
> We see this frequently in system.log on all servers:
> 
> Timestamp WARN  [PERIODIC-COMMIT-LOG-SYNCER] NoSpamLogger.java:94 - Out of 27 
> commit log syncs over the past 266.24s with average duration of 53.21ms, 1 
> have exceeded the configured commit interval by an average of 3.89ms
> (The last ms number vary from log messages to log message but is never over 
> 1000ms, more in the 100 ms range)
> 
> We have had one ERROR log message on one node:
> 
> Timestamp ERROR [MutationStage-2] StorageProxy.java:1414 - Failed to apply 
> mutation locally : {}
> java.lang.IllegalArgumentException: Mutation of 24.142MiB is too large for 
> the maximum size of 16.000MiB
> 
> On two other nodes we got this
> Timestamp WARN  [MutationStage-3] AbstractLocalAwareExecutorService.java:167 
> - Uncaught exception on thread Thread[MutationStage-3,5,main]: {}
> java.lang.IllegalArgumentException: Mutation of 24.142MiB is too large for 
> the maximum size of 16.000MiB
> 
> Our application got this in the log
> Cassandra failure during write query at consistency QUORUM (2 responses were 
> required but only 0 replica responded, 2 failed)
> com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure 
> during write query at consistency QUORUM (2 responses were required but only 
> 0 replica responded, 2 failed)
> 
> Are the WARNings a sign that there can be ERRORs like this? Are they related 
> somehow?
> 
> We decided to relax some performance parameters in our application and the 
> WARN log messages now come very seldomly but they are there. We have seen the 
> same WARN log message at nightime when we don't run our application at all so 
> WARN messages were unexpected.
> 
> There are no GC warnings about long pauses.
> 
> Any thoughts about how to proceed with this issue?
> 
> Kind regards
> Frank Limstrand
> National Library of Norway
> 
> 
> All tables created like this:
> CREATE TABLE mykeyspace.mytable (
> key blob,
> column1 timeuuid,
> column2 text,
> value blob,
> PRIMARY KEY (key, column1, column2)
> ) WITH COMPACT STORAGE
> AND CLUSTERING ORDER BY (column1 ASC, column2 ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND comment = 'Column Family for storing job execution record information'
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
> AND compression = {'chunk_length_in_kb': '64', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 1.0
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
> 
> cassandra.yaml:
> hinted_handoff_enabled: true
> max_hint_window_in_ms: 1080 # 3 hours
> hinted_handoff_throttle_in_kb: 1024
> max_hints_delivery_threads: 2
> hints_directory: /d1/cassandra/data/hints
> hints_flush_period_in_ms: 1
> max_hints_file_size_in_mb: 128
> batchlog_replay_throttle_in_kb: 1024
> authenticator: AllowAllAuthenticator
> authorizer: AllowAllAuthorizer
> role_manager: CassandraRoleManager
> roles_validity_in_ms: 2000
> permissions_validity_in_ms: 2000
> credentials_validity_in_ms: 2000
> partitioner: org.apache.cassandra.dht.RandomPartitioner
> data_file_directories:
> - /d1/cassandra/data
> - /d2/cassandra/data
> commitlog_directory: /d1/cassandra/commitlog
> cdc_enabled: false
> disk_failure_policy: stop
> commit_failure_policy: stop
> prepared_statements_cache_size_mb:
> thrift_prepared_statements_cache_size_mb:
> key_cache_size_in_mb:
> key_cache_save_period: 14400
> row_cache_size_in_mb: 0
> row_cache_save_period: 0
> counter_cache_size_in_mb:
> counter_cache_save_period: 7200
> saved_caches_directory: /d2/cassandra/saved_caches
> commitlog_sync: periodic
> commitlog_sync_period_in_ms: 1
> commitlog_segment_size_in_mb: 32
> concurrent_reads: 32
>

Re: system.size_estimates - safe to remove sstables?

2018-03-06 Thread Chris Lohfink

While its off you can delete the files in the directory yeah

Chris

> On Mar 6, 2018, at 2:35 AM, Kunal Gangakhedkar <kgangakhed...@gmail.com> 
> wrote:
> 
> Hi Chris,
> 
> I checked for snapshots and backups - none found.
> Also, we're not using opscenter, hadoop or spark or any such tool.
> 
> So, do you think we can just remove the cf and restart the service?
> 
> Thanks,
> Kunal
> 
> On 5 March 2018 at 21:52, Chris Lohfink <clohf...@apple.com 
> <mailto:clohf...@apple.com>> wrote:
> Any chance space used by snapshots? What files exist there that are taking up 
> space?
> 
> > On Mar 5, 2018, at 1:02 AM, Kunal Gangakhedkar <kgangakhed...@gmail.com 
> > <mailto:kgangakhed...@gmail.com>> wrote:
> >
> > Hi all,
> >
> > I have a 2-node cluster running cassandra 2.1.18.
> > One of the nodes has run out of disk space and died - almost all of it 
> > shows up as occupied by size_estimates CF.
> > Out of 296GiB, 288GiB shows up as consumed by size_estimates in 'du -sh' 
> > output.
> >
> > This is while the other node is chugging along - shows only 25MiB consumed 
> > by size_estimates (du -sh output).
> >
> > Any idea why this descripancy?
> > Is it safe to remove the size_estimates sstables from the affected node and 
> > restart the service?
> >
> > Thanks,
> > Kunal
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> <mailto:user-unsubscr...@cassandra.apache.org>
> For additional commands, e-mail: user-h...@cassandra.apache.org 
> <mailto:user-h...@cassandra.apache.org>
> 
>

Re: cfhistograms InstanceNotFoundException EstimatePartitionSizeHistogram

2018-03-06 Thread Chris Lohfink

Make sure your using same version of nodetool as your version of Cassandra. 
That metric was renamed from EstimatedRowSize so if using a version of nodetool 
made for a more recent version you would get this error since 
EstimatePartitionSizeHistogram doesn’t exist on the older Cassandra host.

Chris

Sent from my iPhone

> On Mar 6, 2018, at 3:29 AM, onmstester onmstester  wrote:
> 
> Running this command:
> nodetools cfhistograms keyspace1 table1
> 
> throws this exception in production server:
> javax.management.InstanceNotFoundException: 
> org.apache.cassandra.metrics:type=Table,keyspace=keyspace1,scope=table1,name=EstimatePartitionSizeHistogram
> 
> But i have no problem in a test server with few data in it and same datamodel.
> I'm using Casssandra 3.
> Sent using Zoho Mail
> 
> 
>

Re: system.size_estimates - safe to remove sstables?

2018-03-05 Thread Chris Lohfink

Any chance space used by snapshots? What files exist there that are taking up 
space?

> On Mar 5, 2018, at 1:02 AM, Kunal Gangakhedkar  
> wrote:
> 
> Hi all,
> 
> I have a 2-node cluster running cassandra 2.1.18.
> One of the nodes has run out of disk space and died - almost all of it shows 
> up as occupied by size_estimates CF.
> Out of 296GiB, 288GiB shows up as consumed by size_estimates in 'du -sh' 
> output.
> 
> This is while the other node is chugging along - shows only 25MiB consumed by 
> size_estimates (du -sh output).
> 
> Any idea why this descripancy?
> Is it safe to remove the size_estimates sstables from the affected node and 
> restart the service?
> 
> Thanks,
> Kunal


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: system.size_estimates - safe to remove sstables?

2018-03-05 Thread Chris Lohfink

Unless using spark or hadoop nothing consumes the data in that table (unless 
you have tooling that may use it like opscenter or something) so your safe to 
just truncate it or rm the sstables when instance offline you will be fine, if 
you do use that table you can then do a `nodetool refreshsizeestimates` to 
readd it or just wait for it to re-run automatically (every 5 min).

Chris

> On Mar 5, 2018, at 1:02 AM, Kunal Gangakhedkar  
> wrote:
> 
> Hi all,
> 
> I have a 2-node cluster running cassandra 2.1.18.
> One of the nodes has run out of disk space and died - almost all of it shows 
> up as occupied by size_estimates CF.
> Out of 296GiB, 288GiB shows up as consumed by size_estimates in 'du -sh' 
> output.
> 
> This is while the other node is chugging along - shows only 25MiB consumed by 
> size_estimates (du -sh output).
> 
> Any idea why this descripancy?
> Is it safe to remove the size_estimates sstables from the affected node and 
> restart the service?
> 
> Thanks,
> Kunal

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Chris Lohfink

Instead of saying "Make X better" you can quantify "Here's how we can make X 
better" in a jira and the conversation will continue with interested parties 
(opening jiras are free!). Being combative and insulting project on mailing 
list may help vent some frustrations but it is counter productive and makes 
people defensive.

People are not averse to usability, quite the opposite actually. People do tend 
to be averse to conversations opened up with "cassandra is an idiot" with no 
clear definition of how to make it better or what a better solution would look 
like though. Note however that saying "make backups better" or "look at 
marketing literature for these guys" is hard for an engineer or architect to 
break into actionable item. Coming up with cool ideas on how to do something 
will more likely hook a developer into working on it then trying to shame the 
community with a sales pitch from another DB's sales guy.

Chris

> On Feb 21, 2018, at 4:53 PM, Kenneth Brotman  
> wrote:
> 
> Hi Akash,
> 
> I get the part about outside work which is why in replying to Jeff Jirsa I 
> was suggesting the big companies could justify taking it on easy enough and 
> you know actually pay the people who would be working at it so those people 
> could have a life.
> 
> The part I don't get is the aversion to usability.  Isn't that what you think 
> about when you are coding?  "Am I making this thing I'm building easy to 
> use?"  If you were programming for me, we would be constantly talking about 
> what we are building and how we can make things easier for users.  If I had 
> to fight with a developer, architect or engineer about usability all the 
> time, they would be gone and quick.  How do approach programming if you 
> aren't trying to make things easy.
> 
> Kenneth Brotman
> 
> -Original Message-
> From: Akash Gangil [mailto:akashg1...@gmail.com] 
> Sent: Wednesday, February 21, 2018 2:24 PM
> To: d...@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
> 
> I would second Jon in the arguments he made. Contributing outside work is 
> draining and really requires a lot of commitment. If someone requires 
> features around usability etc, just pay for it, period.
> 
> On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman < 
> kenbrot...@yahoo.com.invalid> wrote:
> 
>> Jon,
>> 
>> Very sorry that you don't see the value of the time I'm taking for this.
>> I don't have demands; I do have a stern warning and I'm right Jon.  
>> Please be very careful not to mischaracterized my words Jon.
>> 
>> You suggest I put things in JIRA's, then seem to suggest that I'd be 
>> lucky if anyone looked at it and did anything. That's what I figured too.
>> 
>> I don't appreciate the hostility.  You will understand more fully in 
>> the next post where I'm coming from.  Try to keep the conversation civilized.
>> I'm trying or at least so you understand I think what I'm doing is 
>> saving your gig and mine.  I really like a lot of people is this group.
>> 
>> I've come to a preliminary assessment on things.  Soon the cloud will 
>> clear or I'll be gone.  Don't worry.  I'm a very peaceful person and 
>> like you I am driven by real important projects that I feel compelled 
>> to work on for the good of others.  I don't have time for people to 
>> hand hold a database and I can't get stuck with my projects on the wrong 
>> stuff.
>> 
>> Kenneth Brotman
>> 
>> 
>> -Original Message-
>> From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon 
>> Haddad
>> Sent: Wednesday, February 21, 2018 12:44 PM
>> To: user@cassandra.apache.org
>> Cc: d...@cassandra.apache.org
>> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>> 
>> Ken,
>> 
>> Maybe it’s not clear how open source projects work, so let me try to 
>> explain.  There’s a bunch of us who either get paid by someone or 
>> volunteer on our free time.  The folks that get paid, (yay!) usually 
>> take direction on what the priorities are, and work on projects that 
>> directly affect our jobs.  That means that someone needs to care 
>> enough about the features you want to work on them, if you’re not going to 
>> do it yourself.
>> 
>> Now as others have said already, please put your list of demands in 
>> JIRA, if someone is interested, they will work on it.  You may need to 
>> contribute a little more than you’ve done already, be prepared to get 
>> involved if you actually want to to see something get done.  Perhaps 
>> learning a little more about Cassandra’s internals and the people 
>> involved will reveal some of the design decisions and priorities of the 
>> project.
>> 
>> Third, you seem to be a little obsessed with market share.  While 
>> market share is fun to talk about, *most* of us that are working on 
>> and contributing to Cassandra do so because it does actually solve a 
>> problem we have, and solves it reasonably well.  If some magic open 
>>

Re: Commitlogs are filling the Full Disk space and nodes are down

2018-01-30 Thread Chris Lohfink

The commitlog growing is often a symptom of a problem. If the memtable flush or 
post flush fails in anyway, the commitlogs will not be recycled/deleted and 
will continue to pool up.

Might want to go back in logs earlier to make sure theres nothing like the 
postmemtable flusher getting a permission error (some tooling creates 
commitlogs so if run by wrong user can create this prooblem), or a memtable 
flush error.  You can also check tpstats to see if tasks are queued up in 
postmemtable flusher and jstack to see where the active ones are stuck if they 
are.

Chris

> On Jan 30, 2018, at 4:20 AM, Amit Singh  wrote:
> 
> Hi,
>  
> When you actually say nodetool flush, data from memTable goes to disk based 
> structure as SStables and side by side , commit logs segments for that 
> particular data get written off and its continuous process . May be in your 
> case , you can decrease the value of  below uncommented property in 
> Cassandra.yaml 
>  
> commitlog_total_space_in_mb
>  
> Also this is what is it used for 
>  
> # Total space to use for commit logs on disk.
> #
> # If space gets above this value, Cassandra will flush every dirty CF
> # in the oldest segment and remove it.  So a small total commitlog space
> # will tend to cause more flush activity on less-active columnfamilies.
> #
> # The default value is the smaller of 8192, and 1/4 of the total space
> # of the commitlog volume.
>  
>  
> From: Mokkapati, Bhargav (Nokia - IN/Chennai) 
> [mailto:bhargav.mokkap...@nokia.com] 
> Sent: Tuesday, January 30, 2018 4:00 PM
> To: user@cassandra.apache.org
> Subject: Commitlogs are filling the Full Disk space and nodes are down
>  
> Hi Team,
>  
> My Cassandra version : Apache Cassandra 3.0.13
>  
> Cassandra nodes are down due to Commitlogs are getting filled up until full 
> disk size.
>  
> 
>  
> With “Nodetool flush” I didn’t see any commitlogs deleted.
>  
> Can anyone tell me how to flush the commitlogs without losing data.
>  
> Thanks,
> Bhargav M

Re: sstabledump tries to delete a file

2018-01-10 Thread Chris Lohfink

Yes it should be read only, open a jira please. It does look like if the fp
changed it would rebuild or if your missing. When it builds the table
metadata from the sstable it can just set the properties to match that of
the sstable to prevent this.

Chris

On Wed, Jan 10, 2018 at 4:16 AM, Python_Max  wrote:

> Hello all.
>
> I have an error when trying to dump SSTable (Cassandra 3.11.1):
>
> $ sstabledump mc-56801-big-Data.db
> Exception in thread "main" FSWriteError in /var/lib/cassandra/data/<
> keyspace>//mc-56801-big-Summary.db
> at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(
> FileUtils.java:142)
> at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(
> FileUtils.java:159)
> at org.apache.cassandra.io.sstable.format.SSTableReader.
> saveSummary(SSTableReader.java:935)
> at org.apache.cassandra.io.sstable.format.SSTableReader.
> saveSummary(SSTableReader.java:920)
> at org.apache.cassandra.io.sstable.format.SSTableReader.
> load(SSTableReader.java:788)
> at org.apache.cassandra.io.sstable.format.SSTableReader.
> load(SSTableReader.java:731)
> at org.apache.cassandra.io.sstable.format.SSTableReader.
> open(SSTableReader.java:516)
> at org.apache.cassandra.io.sstable.format.SSTableReader.
> openNoValidation(SSTableReader.java:396)
> at org.apache.cassandra.tools.SSTableExport.main(
> SSTableExport.java:191)
> Caused by: java.nio.file.AccessDeniedException: /var/lib/cassandra/data/<
> keyspace>//mc-56801-big-Summary.db
> at sun.nio.fs.UnixException.translateToIOException(
> UnixException.java:84)
> at sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:102)
> at sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:107)
> at sun.nio.fs.UnixFileSystemProvider.implDelete(
> UnixFileSystemProvider.java:244)
> at sun.nio.fs.AbstractFileSystemProvider.delete(
> AbstractFileSystemProvider.java:103)
> at java.nio.file.Files.delete(Files.java:1126)
> at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(
> FileUtils.java:136)
> ... 8 more
>
> Seems that sstabledump tries to delete and recreate summary file which I
> think is risky because external modification to files that should be
> modified only by Cassandra itself can lead to unpredictable behavior.
> When I copy all related files and change it's owner to myself and run
> sstabledump in that directory then Summary.db file is recreated but it's
> md5 is exactly the same as original Summary.db's file.
>
> I indeed have changed bloom_filter_fp_chance couple months ago, so I
> believe that's the reason why SSTableReader wants to recreate summary file.
>
> After nodetool scrub an error still happens.
>
> I have not found any issues like this in bug tracker.
> Shouldn't sstabledump be read only?
>
> --
> Best regards,
> Python_Max.
>

Re: sstable

2017-12-20 Thread Chris Lohfink

Somewhere along the line sstabledump tool incorrectly got setup to use tool
initialization, its fixed
https://issues.apache.org/jira/browse/CASSANDRA-13683

Chris

On Tue, Dec 19, 2017 at 5:45 PM, Mounika kale 
wrote:

> Hi,
>   I'm getting below error for all sstable tools.
>
> sstabledump mc-173-big-Data.db
> Exception in thread "main" java.lang.ExceptionInInitializerError
> Caused by: org.apache.cassandra.exceptions.ConfigurationException:
> Expecting URI in variable: [cassandra.config]. Found[cassandra.yaml].
> Please prefix the file with [file:///] for local files and
> [file:///] for remote files. If you are executing this from an
> external tool, it needs to set Config.setClientMode(true) to avoid loading
> configuration.
> at org.apache.cassandra.config.YamlConfigurationLoader.
> getStorageConfigURL(YamlConfigurationLoader.java:80)
> at org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(
> YamlConfigurationLoader.java:100)
> at org.apache.cassandra.config.DatabaseDescriptor.loadConfig(
> DatabaseDescriptor.java:261)
> at org.apache.cassandra.config.DatabaseDescriptor.toolInitialization(
> DatabaseDescriptor.java:179)
> at org.apache.cassandra.config.DatabaseDescriptor.toolInitialization(
> DatabaseDescriptor.java:150)
> at org.apache.cassandra.tools.SSTableExport.(
> SSTableExport.java:65)
>
>

Re: gc causes C* node hang

2017-11-30 Thread Chris Lohfink

Mail client may be changing changing the char if your copy and pasting, its
- "hyphen" not the unicode en dash –. I would recommend adding it to jvm
options like oleksandr pointed out

Chris

On Thu, Nov 30, 2017 at 1:50 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Thu, Nov 30, 2017 at 1:38 AM, Peng Xiao <2535...@qq.com> wrote:
>
>> looks we are not able to enable –XX:PrintSafepointStatisticsCount=1
>> in cassandra-env.sh
>> Could anyone please advise?
>>
>> ...
>
>> Error: Could not find or load main class –XX:PrintSafepointStatisticsCo
>> unt=1
>>
>
> Hm, not sure how are you doing it, but it boils down to adding a line
> somewhere in the cassandra-env.sh like this one:
>
> JVM_OPTS="$JVM_OPTS -XX:PrintSafepointStatisticsCount=1"
>
> OR, if you're using a newer version (3.0 or newer), the following in the
> jvm.options file:
>
> -XX:PrintSafepointStatisticsCount=1
>
> Cheers,
> --
> Alex
>
>

Re: What is OneMinuteRate in Write Latency?

2017-11-03 Thread Chris Lohfink

Its from the metrics library Meter

object which tracks the exponentially weighted moving average

of events.

Chris

On Thu, Nov 2, 2017 at 12:10 PM, AI Rumman  wrote:

> Hi,
>
> I am trying to calculate the Read/second and Write/Second in my Cassandra
> 2.1 cluster. After searching and reading, I came to know about JMX bean
> "org.apache.cassandra.metrics:type=ClientRequest,scope=
> Write,name=Latency".
> Here I can see oneMinuteRate. I have started a brand new cluster and
> started collected these metrics from 0.
> When I started my first record, I can see
>
> Count = 1
>> OneMinuteRate = 0.01599111...
>
>
> Does it mean that my write/s is 0.0159911? Or does it mean that based on 1
> minute data, my write latency is 0.01599 where Write Latency refers to the
> response time for writing a record?
>
> Please help me understand the value.
>
> Thanks.
>
>
>
>
>

Re: Cassandra Compaction Metrics - CompletedTasks vs TotalCompactionCompleted

2017-10-31 Thread Chris Lohfink

CompactionMetrics is a combination of the compaction executor (sstable
compactions, secondary index build, view building, relocate,
garbagecollect, cleanup, scrub etc) and validation executor (repairs). Keep
in mind not all jobs execute 1 task per operation, things that use the
parallelAllSSTableOperation like cleanup will create 1 task per sstable.

The "CompletedTasks" metric is a measure of how many tasks ran on these two
executors combined.
The "TotalCompactionsCompleted" metric is a measure of how many compactions
issued from the compaction manager ran (normal compactions, cache writes,
scrub, 2i and MVs).  So while they may be close, depending on whats
happening on the system, theres no assurance that they will be within any
bounds of each other.

So I would suspect validation compactions from repairs would be one major
difference. If you run other operational tasks there will likely be more.


On Mon, Oct 30, 2017 at 12:22 PM, Lucas Benevides <
lu...@maurobenevides.com.br> wrote:

> Kurt,
>
> I apreciate your answer but I don't believe CompletedTasks count the
> "validation compactions". These are compactions that occur from repair
> operations. I am running tests on 10 cluster nodes in the same physical
> rack, with Cassandra Stress Tool and I didn't make any Repair commands. The
> tables only last for seven hours, so it is not reasonable that tens of
> thousands of these validation compactions occur per node.
>
> I tried to see the code and the CompletedTasks counter seems to be
> populated by a method from the class java.util.concurrent.
> ThreadPoolExecutor.
> So I really don't know what it is but surely is not the amount of
> Compaction Completed Tasks.
>
> Thank you
> Lucas Benevides
>
>-
>
>
> 2017-10-30 8:05 GMT-02:00 kurt greaves :
>
>> I believe (may be wrong) that CompletedTasks counts Validation
>> compactions while TotalCompactionsCompleted does not. Considering a lot of
>> validation compactions can be created every repair it might explain the
>> difference. I'm not sure why they are named that way or work the way they
>> do. There appears to be no documentation around this in the code (what a
>> surprise) and looks like it was last touched in CASSANDRA-4009
>> , which also has
>> no useful info.
>>
>> On 27 October 2017 at 13:48, Lucas Benevides > > wrote:
>>
>>> Dear community,
>>>
>>> I am studying the behaviour of the Cassandra
>>> TimeWindowCompactionStragegy. To do so I am watching some metrics. Two of
>>> these metrics are important: Compaction.CompletedTasks, a gauge, and the
>>> TotalCompactionsCompleted, a Meter.
>>>
>>> According to the documentation (http://cassandra.apache.org/d
>>> oc/latest/operating/metrics.html#table-metrics):
>>> Completed Taks = Number of completed compactions since server [re]start.
>>> TotalCompactionsCompleted = Throughput of completed compactions since
>>> server [re]start.
>>>
>>> As I realized, the TotalCompactionsCompleted, in the Meter object, has a
>>> counter, which I supposed would be numerically close to the CompletedTasks
>>> gauge. But they are very different, with the Completed Tasks being much
>>> higher than the TotalCompactions Completed.
>>>
>>> According to the code, in github (class metrics.CompactionMetrics.java):
>>> Completed Taks - Number of completed compactions since server [re]start
>>> TotalCompactionsCompleted - Total number of compactions since server
>>> [re]start
>>>
>>> Can you help me and explain the difference between these two metrics, as
>>> they seem to have very distinct values, with the Completed Tasks being
>>> around 1000 times the value of the counter in TotalCompactionsCompleted.
>>>
>>> Thanks in Advance,
>>> Lucas Benevides
>>>
>>>
>>
>

Re: Inter Data Center Latency calculation of a Multi DC cluster running in AWS

2017-10-17 Thread Chris Lohfink

An alternative if using >3.8 you can use the
org.apache.cassandra.metrics:type=Messaging,name=[DC]-Latency mbean where
[DC] is the name of the DC and you can get the inter DC latency per node
(to that node). This does not account for NTP drift though, just how long
it takes messages (ie mutations) take to get to a node from other DCs.

Chris

On Tue, Oct 17, 2017 at 7:18 PM, Jon Haddad  wrote:

> I recommend figuring out the latency between your datacenters.  Cassandra
> isn’t going to be any more than that barring JVM pauses on the remote
> coordinator.
>
>
> On Oct 17, 2017, at 4:17 PM, Bill Walters  wrote:
>
> Hi Everyone,
>
> I need some suggestions on finding the time taken for Cassandra
> replication to happen from east to west region for write and read
> operations on a multi DC cluster.
> Currently, below is our cluster setup.
>
> *Cassandra version:* DSE 5.0.7
> *No of Data centers:* 2 (AWS East and AWS West regions)
> *No of Nodes:* 12 nodes (6 nodes in AWS East and 6 nodes in AWS West)
> *Replication Factor:* 3 in each data center.
> *Cluster size*: Around 40 GB on each node
>
> Sometime, next year we have an activity where our clients are going to be
> reading only from AWS West region. The data center in AWS east will be
> available but we do not want any reads to be done on this.(Our management
> wants to know the time it takes for Cassandra to replicate from one DC to
> the other)
>
> Here are some options I have thought of in finding the time taken for
> Cassandra replication to happen from AWS East DC to AWS West DC.
>
> 1. Setup a Java client to write/read a transaction with *"Local Quorum" 
> *consistency
> level in* AWS East* region as Local data center, capture the time taken
> for this activity. Similarly use this client to perform read/write
> transaction with *"Local Quorum"* consistency level in *AWS West* region
> and capture the time. Then finally perform the same transaction with with 
> *"Each
> Quorum" *consistency level and capture the time.
>
> *Inter DC latency* = *Time taken for Each Quorum transaction* *-* *(Time
> taken for Local Quorum transaction in AWS East as local dc)* *-** (Time
> taken for Local Quorum transaction in AWS West as local dc)*.
>
>
> 2. Utilize the https://github.com/gitaroktato/cassandra-
> replication-latency-tools open source project where a Python Cassandra
> clients writes in one Data Center and other client reads in other data
> center.
>
>
> Can you please suggest if my strategies above will help in finding the
> Inter DC latency or there are other ways I need to follow.
>
>
> Thank You,
> Bill Walters.
>
>
>

Re: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread Chris Lohfink

Can you share your schema and cfstats? This sounds kinda like a wide
partition, backed up compactions, or tombstone issue for it to create so
much and have issues like that so quickly with those settings.

A heap dump would be most telling but they are rather large and hard to
share.

Chris

On Mon, Oct 9, 2017 at 8:12 AM, Gustavo Scudeler 
wrote:

> Hello,
>
> @kurt greaves: Have you tried CMS with that sized heap?
>
>
> Yes, for testing for testing purposes, I have 3 nodes with CMS and 3 with
> G1. The behavior is basically the same.
>
> *Using CMS suggested settings* http://gceasy.io/my-gc-report.jsp?p=
> c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=
>
> *Using G1 suggested settings* http://gceasy.io/my-gc-report.jsp?p=
> c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3
>
>
> @Steinmaurer, Thomas If this happens in a very short very frequently and
>> depending on your allocation rate in MB/s, a combination of the G1 bug and
>> a small heap, might result going towards OOM.
>
>
> We have a really high obj allocation rate:
>
> Avg creation rate  622.9 mb/sec
> Avg promotion rate  18.39 mb/sec
>
> It could be the cause, where the GC can't keep up with this rate.
>
> I'm stating to think it could be some wrong configuration where Cassandra is
> configured in a way that bursts allocations in a manner that G1 can't keep
> up with.
>
> Any ideas?
>
> Best regards,
>
>
> 2017-10-09 12:44 GMT+01:00 Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com>:
>
>> Hi,
>>
>>
>>
>> although not happening here with Cassandra (due to using CMS), we had
>> some weird problem with our server application e.g. hit by the following
>> JVM/G1 bugs:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8140597
>>
>> https://bugs.openjdk.java.net/browse/JDK-8141402 (more or less  a
>> duplicate of above)
>>
>> https://bugs.openjdk.java.net/browse/JDK-8048556
>>
>>
>>
>> Especially the first, JDK-8140597, might be interesting, if you see
>> periodic humongous allocations (according to a GC log) resulting in mixed
>> GC phases being steadily interrupted due to G1 bug, thus no GC in OLD
>> regions. Humongous allocations will happen if a single (?) allocation is >
>> (region size / 2), if I remember correctly. Can’t recall the default G1
>> region size for a 12GB heap, but possibly 4MB. So, in case you are
>> allocating something larger than > 2MB, you might end up in something
>> called “humongous” allocations, spanning several G1 regions. If this
>> happens in a very short very frequently and depending on your allocation
>> rate in MB/s, a combination of the G1 bug and a small heap, might result
>> going towards OOM.
>>
>>
>>
>> Possibly worth a further route for investigation.
>>
>>
>>
>> Regards,
>>
>> Thomas
>>
>>
>>
>> *From:* Gustavo Scudeler [mailto:scudel...@gmail.com]
>> *Sent:* Montag, 09. Oktober 2017 13:12
>> *To:* user@cassandra.apache.org
>> *Subject:* Cassandra and G1 Garbage collector stop the world event (STW)
>>
>>
>>
>> Hi guys,
>>
>> We have a 6 node Cassandra Cluster under heavy utilization. We have been
>> dealing a lot with garbage collector stop the world event, which can take
>> up to 50 seconds in our nodes, in the meantime Cassandra Node is
>> unresponsive, not even accepting new logins.
>>
>> Extra details:
>>
>> · Cassandra Version: 3.11
>>
>> · Heap Size = 12 GB
>>
>> · We are using G1 Garbage Collector with default settings
>>
>> · Nodes size: 4 CPUs 28 GB RAM
>>
>> · All CPU cores are at 100% all the time.
>>
>> · The G1 GC behavior is the same across all nodes.
>>
>> The behavior remains basically:
>>
>> 1.  Old Gen starts to fill up.
>>
>> 2.  GC can't clean it properly without a full GC and a STW event.
>>
>> 3.  The full GC starts to take longer, until the node is completely
>> unresponsive.
>>
>> *Extra details and GC reports:*
>>
>> https://stackoverflow.com/questions/46568777/cassandra-and-
>> g1-garbage-collector-stop-the-world-event-stw
>>
>>
>>
>> Can someone point me what configurations or events I could check?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Best regards,
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> 
>>
>
>
>
>

Re: [EXTERNAL] Re: Increasing VNodes

2017-10-04 Thread Chris Lohfink

Cant you just increase segmentCount option to split it more?

On Wed, Oct 4, 2017 at 12:50 PM, Mohapatra, Kishore <
kishore.mohapa...@nuance.com> wrote:

> Thanks a lot for all of your input. We are actually using Cassandra
> reaper. But it is just splitting the ranges into 256 per node.
>
> But I will certainly try out splitting into smaller ranges going thru the
> system.size_estimate table.
>
>
>
> Thanks
>
>
>
> *Kishore Mohapatra*
>
> Principal Operations DBA
>
> Seattle, WA
>
> Email : kishore.mohapa...@nuance.com
>
>
>
>
>
> *From:* Jon Haddad [mailto:jonathan.had...@gmail.com] * On Behalf Of *Jon
> Haddad
> *Sent:* Wednesday, October 04, 2017 10:27 AM
> *To:* user <user@cassandra.apache.org>
> *Subject:* [EXTERNAL] Re: Increasing VNodes
>
>
>
> The site (with the docs) is probably more helpful to learn about how
> reaper works:  http://cassandra-reaper.io/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra-2Dreaper.io_=DwMFAg=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY=O20_rcIS1QazTO3_J10I1cPIygxnuBZ4sUCz1TS16XE=nHN7toaSQUjfwSABx1KXlVHLYmlaEcUMYPHzC3ky5TM=oqOGMK4a6er4kSRtzs7B2_A6QB6kb7nQek8NAU5pytI=>
>
>
>
> On Oct 4, 2017, at 9:54 AM, Chris Lohfink <clohfin...@gmail.com> wrote:
>
>
>
> Increasing number of tokens will make repairs worse not better. You can
> just split the sub ranges into smaller chunks, you dont need to use vnodes
> to do that. Simple approach is to iterate through each host token range and
> split by N and repair them (ie https://github.com/onzra/
> cassandra_range_repair
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_onzra_cassandra-5Frange-5Frepair=DwMFAg=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY=O20_rcIS1QazTO3_J10I1cPIygxnuBZ4sUCz1TS16XE=nHN7toaSQUjfwSABx1KXlVHLYmlaEcUMYPHzC3ky5TM=Ph50r9wV17T72OEwI3FsbAXBVZ3Pt-AmACQZYdsQqgk=>)
> To be more efficient you can grab ranges and split based on number of
> partitions in the range (ie fetch system.size_estimates and walk that) so
> you dont split empty or small ranges a ton unnecessarily, and because not
> all tables have some fixed N that is efficient.
>
>
>
> Using TLPs reaper https://github.com/thelastpickle/cassandra-reaper
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_thelastpickle_cassandra-2Dreaper=DwMFAg=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY=O20_rcIS1QazTO3_J10I1cPIygxnuBZ4sUCz1TS16XE=nHN7toaSQUjfwSABx1KXlVHLYmlaEcUMYPHzC3ky5TM=_4VgTSxIgqGn339jpHMycnHg4bmM0pHmUxSQ8nNfdDU=>
>  or
> DataStax OpsCenter's repair service is easiest solution without a lot of
> effort. Repairs are hard.
>
>
>
> Chris
>
>
>
> On Wed, Oct 4, 2017 at 11:48 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>
> You don't need to change the number of vnodes, you can manually select
> CONTAINED token subranges and pass in -st and -et (just try to pick a
> number > 2^20 that is fully contained by at least one vnode).
>
>
>
>
>
>
>
>
>
> On Wed, Oct 4, 2017 at 9:46 AM, Mohapatra, Kishore <
> kishore.mohapa...@nuance.com> wrote:
>
> Hi,
>
> We are having a lot of problems in repair process. We use sub
> range repair. But most of the time, some ranges fails with streaming error
> or some other kind of error.
>
> So wondering if it will help if we increase the no. of VNodes from 256
> (default) to 512. But increasing the VNodes will be a lot of efforts, as it
> involves wiping out the data and bootstrapping.
>
> So is there any other way of splitting the range into small ranges ?
>
>
>
> We are using version 2.1.15.4 at the moment.
>
>
>
> Thanks
>
>
>
> *Kishore Mohapatra*
>
> Principal Operations DBA
>
> Seattle, WA
>
> Email : kishore.mohapa...@nuance.com
>
>
>
>
>
>
>
>
>
>
>

Re: Increasing VNodes

2017-10-04 Thread Chris Lohfink

Increasing number of tokens will make repairs worse not better. You can
just split the sub ranges into smaller chunks, you dont need to use vnodes
to do that. Simple approach is to iterate through each host token range and
split by N and repair them (ie
https://github.com/onzra/cassandra_range_repair)  To be more efficient you
can grab ranges and split based on number of partitions in the range (ie
fetch system.size_estimates and walk that) so you dont split empty or small
ranges a ton unnecessarily, and because not all tables have some fixed N
that is efficient.

Using TLPs reaper https://github.com/thelastpickle/cassandra-reaper or
DataStax OpsCenter's repair service is easiest solution without a lot of
effort. Repairs are hard.

Chris

On Wed, Oct 4, 2017 at 11:48 AM, Jeff Jirsa  wrote:

> You don't need to change the number of vnodes, you can manually select
> CONTAINED token subranges and pass in -st and -et (just try to pick a
> number > 2^20 that is fully contained by at least one vnode).
>
>
>
>
> On Wed, Oct 4, 2017 at 9:46 AM, Mohapatra, Kishore <
> kishore.mohapa...@nuance.com> wrote:
>
>> Hi,
>>
>> We are having a lot of problems in repair process. We use sub
>> range repair. But most of the time, some ranges fails with streaming error
>> or some other kind of error.
>>
>> So wondering if it will help if we increase the no. of VNodes from 256
>> (default) to 512. But increasing the VNodes will be a lot of efforts, as it
>> involves wiping out the data and bootstrapping.
>>
>> So is there any other way of splitting the range into small ranges ?
>>
>>
>>
>> We are using version 2.1.15.4 at the moment.
>>
>>
>>
>> Thanks
>>
>>
>>
>> *Kishore Mohapatra*
>>
>> Principal Operations DBA
>>
>> Seattle, WA
>>
>> Email : kishore.mohapa...@nuance.com
>>
>>
>>
>>
>>
>
>

Re: Read-/ Write Latency - Cassandra 2.1 .15 vs 3.10

2017-10-03 Thread Chris Lohfink

RecentReadLatency metrics has been deprecated for years (1.1 or 1.2) and were 
removed in 2.2. It was a very misleading metric. Instead pull from the Table's 
ReadLatency metrics from the org.apache.cassandra.metrics domain. 
http://cassandra.apache.org/doc/latest/operating/metrics.html?highlight=metrics#table-metrics
 


Chris

> On Oct 3, 2017, at 10:06 AM, Anumod Mullachery  
> wrote:
> 
> Hi,  We were running splunk queries to pull read / write latency.  It's 
> working fine in 2.1.15 , but not returning result from upgraded version 3.10. 
>  The bean used in the script is as shown below.  Let me know, if any changes 
> on the functionality on 2.1.15 vs 3.10 or it replaced to some other bean.   
> perf_queries= { "org.apache.cassandra.db:type=StorageProxy" => 
> "RecentReadLatencyMicros,RecentWriteLatencyMicros", }  stage_queries= { 
> "org.apache.cassandra.request:type=*" => 
> "ActiveCount,PendingTasks,CurrentlyBlockedTasks", }  curl 
> http://localhost:8778/jolokia/read/org.apache.cassandra.db:type=StorageProxy/RecentReadLatencyMicros,RecentWriteLatencyMicros
>  
> 
>   curl 
> http://localhost:8778/jolokia/read/org.apache.cassandra.request:type=*/ActiveCount,PendingTasks,CurrentlyBlockedTasks
>  
> 
>   
> 
> ~ Thanks ~  Anumod

Re: Do not use Cassandra 3.11.0+ or Cassandra 3.0.12+

2017-09-12 Thread Chris Lohfink

Last Ive seen of it OpsCenter does not collect this metric. I don't think any 
monitoring tools do.

Chris

> On Sep 11, 2017, at 4:06 PM, CPC  wrote:
> 
> Hi,
> 
> Is this bug fixed in dse 5.1.3? As I understand calling jmx getTombStoneRatio
> trigers that bug. We are using opscenter as well and do you have any idea
> whether opscenter using/calling this method?
> 
> Thanks
> 
> On Aug 29, 2017 6:35 AM, "Jeff Jirsa"  wrote:
> 
>> I shouldn't actually say I don't think it can happen on 3.0 - I haven't
>> seen this happen on 3.0 without some other code change to enable it, but
>> like I said, we're still investigating.
>> 
>> --
>> Jeff Jirsa
>> 
>> 
>>> On Aug 28, 2017, at 8:30 PM, Jeff Jirsa  wrote:
>>> 
>>> For what it's worth, I don't think this impacts 3.0 without adding some
>> other code change (the reporter of the bug on 3.0 had added custom metrics
>> that exposed a concurrency issue).
>>> 
>>> We're looking at it on 3.11. I think 13038 made it far more likely to
>> occur, but I think it could have happened pre-13038 as well (would take
>> some serious luck with your deletion time distribution though - the
>> rounding in 13038 does make it more likely, but the race was already there).
>>> 
>>> --
>>> Jeff Jirsa
>>> 
>>> 
 On Aug 28, 2017, at 8:24 PM, Jay Zhuang 
>> wrote:
 
 We're using 3.0.12+ for a few months and haven't seen the issue like
 that. Do we know what could trigger the problem? Or is 3.0.x really
 impacted?
 
 Thanks,
 Jay
 
> On 8/28/17 6:02 AM, Hannu Kröger wrote:
> Hello,
> 
> Current latest Cassandra version (3.11.0, possibly also 3.0.12+) has a
>> race
> condition that causes Cassandra to create broken sstables (stats file
>> in
> sstables to be precise).
> 
> Bug described here:
> https://issues.apache.org/jira/browse/CASSANDRA-13752
> 
> This change might be causing it (but not sure):
> https://issues.apache.org/jira/browse/CASSANDRA-13038
> 
> Other related issues:
> https://issues.apache.org/jira/browse/CASSANDRA-13718
> https://issues.apache.org/jira/browse/CASSANDRA-13756
> 
> I would not recommend using 3.11.0 nor upgrading to 3.0.12 or higher
>> before
> this is fixed.
> 
> Cheers,
> Hannu
> 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 For additional commands, e-mail: user-h...@cassandra.apache.org
 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
>> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Cassandra CF Level Metrics (Read, Write Count and Latency)

2017-09-01 Thread Chris Lohfink

To be future compatible should consider using `type=Table` instead of
`type=ColumnFamily`
depending on your version.

> not matching with the total read requests

the table level metrics for Read/Write latencies will not match the number
of requests you've made. This metric is the amount of time it took to
perform the action of the read/write locally on that node. The
`type=ClientRequests` mbeans are the ones that are at the coordinator level
including querying all the replicas and merging results etc.

The table metrics do have a name=CoordinatorReadLatency (also Scan for
range queries) mbean which may be what your looking for. Table level write
coordinator metrics are missing since the read coordinator metrics were
actually added for speculative retry so I think writes were overlooked.

Chris

On Thu, Aug 31, 2017 at 10:58 PM, Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> okay, let me try it out
>
> On Thu, Aug 31, 2017 at 8:30 PM, Christophe Schmitz <
> christo...@instaclustr.com> wrote:
>
>> Hi Jai,
>>
>> The ReadLatency MBean expose a few metrics, including the count one,
>> which is the total read requests you are after.
>> See attached screenshot
>>
>> Cheers,
>>
>> Christophe
>>
>> On 1 September 2017 at 09:21, Jai Bheemsen Rao Dhanwada <
>> jaibheem...@gmail.com> wrote:
>>
>>> I did look at the document and tried setting up the metric as following,
>>> does this is not matching with the total read requests. I am using
>>> "ReadLatency_OneMinuteRate"
>>>
>>> /org.apache.cassandra.metrics:type=ColumnFamily,keyspace=*,s
>>> cope=*,name=ReadLatency
>>>
>>> On Thu, Aug 31, 2017 at 4:17 PM, Christophe Schmitz <
>>> christo...@instaclustr.com> wrote:
>>>
 Hello Jai,

 Did you have a look at the following page:
 http://cassandra.apache.org/doc/latest/operating/metrics.html

 In your case, you would want the following MBeans:
 org.apache.cassandra.metrics:type=Table keyspace=
 scope= name=
 With MetricName set to ReadLatency and WriteLatency

 Cheers,

 Christophe



 On 1 September 2017 at 09:08, Jai Bheemsen Rao Dhanwada <
 jaibheem...@gmail.com> wrote:

> Hello All,
>
> I am looking to capture the CF level Read, Write count and Latency. As
> of now I am using Telegraf plugin to capture the JMX metrics.
>
> What is the MBeans, scope and metric to look for the CF level metrics?
>
>



>>>
>>
>>
>> --
>>
>>
>> *Christophe Schmitz*
>> *Director of consulting EMEA*AU: +61 4 03751980 / FR: +33 7 82022899
>> <+33%207%2082%2002%2028%2099>
>>
>>
>> 
>>
>> 
>> 
>> 
>>
>> Read our latest technical blog posts here
>> .
>>
>> This email has been sent on behalf of Instaclustr Pty. Limited
>> (Australia) and Instaclustr Inc (USA).
>>
>> This email and any attachments may contain confidential and legally
>> privileged information.  If you are not the intended recipient, do not copy
>> or disclose its content, but please reply to this email immediately and
>> highlight the error to the sender and then immediately delete the message.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>
>

Re: Cassandra - Nodes can't restart due to java.lang.OutOfMemoryError: Direct buffer memory

2017-08-31 Thread Chris Lohfink

What version of java are you running? There is a "kinda leak" in jvm around
this you may run into, can try with -Djdk.nio.maxCachedBufferSize=262144 if
above 8u102. You can also try increasing the size allowed for direct byte
buffers. It defaults to size of heap -XX:MaxDirectMemorySize=?G

Some NIO channel operations use temporary DirectByteBuffers which are
> cached in thread-local caches to avoid having to allocate / free a buffer
> at every operation.
> Unfortunately, there is no bound imposed on the size of buffers added to
> the thread-local caches. So, infrequent channel operations that require a
> very large buffer can create a native memory leak.



> *Ability to limit the capacity of buffers that can be held in the
> temporary buffer cache*The system property jdk.nio.maxCachedBufferSize has
> been introduced in 8u102 to limit the memory used by the "temporary buffer
> cache." The temporary buffer cache is a per-thread cache of direct memory
> used by the NIO implementation to support applications that do I/O with
> buffers backed by arrays in the java heap. The value of the property is the
> maximum capacity of a direct buffer that can be cached. If the property is
> not set, then no limit is put on the size of buffers that are cached.
> Applications with certain patterns of I/O usage may benefit from using this
> property. In particular, an application that does I/O with large
> multi-megabyte buffers at startup but does I/O with small buffers may see a
> benefit to using this property. Applications that do I/O using direct
> buffers will not see any benefit to using this system property.
> See JDK-8147468 


Chris

On Thu, Aug 31, 2017 at 4:59 AM, Jonathan Baynes <
jonathan.bay...@tradeweb.com> wrote:

> I wonder if its related to this bug  (below) that’s currently unresolved,
> albeit it being reproduced way back in 2.1.11
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-10689
>
>
>
>
>
> *From:* qf zhou [mailto:zhouqf2...@gmail.com]
> *Sent:* 31 August 2017 10:58
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra - Nodes can't restart due to
> java.lang.OutOfMemoryError: Direct buffer memory
>
>
>
> I am usingCassandra 3.9 with cqlsh 5.0.1.
>
> 在 2017年8月31日，下午5:54，Jonathan Baynes  写道：
>
>
>
> again
>
>
>
> 
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and destroy it. Any unauthorized
> copying, disclosure or distribution of the material in this e-mail is
> strictly forbidden. Tradeweb reserves the right to monitor all e-mail
> communications through its networks. If you do not wish to receive
> marketing emails about our products / services, please let us know by
> contacting us, either by email at contac...@tradeweb.com or by writing to
> us at the registered office of Tradeweb in the UK, which is: Tradeweb
> Europe Limited (company number 3912826), 1 Fore Street Avenue London EC2Y
> 9DT. To see our privacy policy, visit our website @ www.tradeweb.com.
>

Re: Nodetool tablehistograms

2017-07-19 Thread Chris Lohfink

Its the number of sstables that may of been read from. This includes
sstables who had their bloom filters checked (which may hit disk). This
changes a bit in https://issues.apache.org/jira/browse/CASSANDRA-13120 to
be only the sstables that its actually reading from.

On Wed, Jul 19, 2017 at 11:04 AM, Abhinav Solan 
wrote:

> Hi Everyone,
>
> Here is the result of my tablehistograms command on one of our tables.
>
> Percentile  SSTables Write Latency  Read LatencyPartition Size
>Cell Count
>   (micros)  (micros)   (bytes)
> 50% 4.00 73.46545.79152321
>  8239
> 75%10.00 88.15   2346.80379022
> 20501
> 95%10.00152.32   4055.27   1358102
> 73457
> 98%10.00219.34   4866.32   1955666
> 88148
> 99%10.00315.85   5839.59   1955666
>105778
> Min 0.00 17.09 35.4373
> 3
> Max10.00  36157.19  52066.35   2816159
>152321
>
> What does SSTables column represent here?
> Does it mean how many SSTables the read is spanning to?
>
> Thanks,
> Abhinav
>

Re: reduced num_token = improved performance ??

2017-07-12 Thread Chris Lohfink

Probably worth mentioning that some operational procedures like repairs,
bootstrapping etc are helped massively by using less tokens. Incremental
repairs are one of the things I would say is most impacted the by it since
less tokens will mean less local ranges to iterate through and less anti
compaction. I would highly recommend using far less than 256 in 3.x.

Chris

On Tue, Jul 11, 2017 at 8:36 PM, Justin Cameron 
wrote:

> Hi,
>
> Using fewer vnodes means you'll have a higher chance of hot spots in your
> cluster. Hot spots in Cassandra are nodes that, by random chance, are
> responsible for a higher percentage of the token space than others. This
> means they will receive more data and also more traffic/load than other
> nodes in the cluster.
>
> CASSANDRA-7032 goes a long way towards addresses this issue by allocating
> vnode tokens more intelligently, rather than just randomly assigning them.
> If you're using a version of Cassandra that contains this feature (3.0+),
> you can use a smaller number of vnodes in your cluster.
>
> A high number of vnodes won't affect performance for most Cassandra
> workloads, but if you're running tasks that need to do token-range scans
> (such as Spark), there is usually a significant performance hit.
>
> If you're on C* 3.0+ and are using Spark (or similar workloads - cassandra
> lucene index plugin is also affected) then I'd recommend using fewer vnodes
> - 16 would be ok. You'll probably still see some variance in token-space
> ownership between nodes, but the trade-off for better Spark performance
> will likely be worth it.
>
> Justin
>
> On Wed, 12 Jul 2017 at 00:34 ZAIDI, ASAD A  wrote:
>
>> Hi Folks,
>>
>>
>>
>> Pardon me if I’m missing  something obvious.  I’m still using
>> apache-cassandra 2.2 and planning for upgrade to  3.x.
>>
>> I came across this jira [https://issues.apache.org/
>> jira/browse/CASSANDRA-7032] that suggests reducing num_token may improve
>> general performance of Cassandra like having  num_token=16 instead of 256
>>   may help!
>>
>>
>>
>> Can you please suggests if having less num_token would provide real
>> performance benefits or if  it comes with any downsides that we should also
>> consider? I’ll much appreciate your insights.
>>
>>
>>
>> Thank you
>>
>> Asad
>>
> --
>
>
> *Justin Cameron*Senior Software Engineer
>
>
> 
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>

Re: Understanding of cassandra metrics

2017-07-07 Thread Chris Lohfink

The coordinator read/scan (Scan is just different naming for the Range, so
coordinator view of RangeLatency) is the latencies from the coordinator
perspective, so it includes network latency between replicas and such. This
which is actually added for speculative retry (why there is no
coordinatorWriteLatency). Only the CoordinatorReadLatency is used for it
however.

The Read/RangeLatency metrics are for local reads, basically just how long
to read from disk and merge with sstables.

The View* metrics are only relevant to materialized views. There actually
is a partition lock for updates which ViewLockAcquireTime gives visibility
too. Also there are sometimes reads required for updating materialized
views, which ViewReadTime is for tracking. For more details id recommend
https://opencredo.com/everything-need-know-cassandra-materialized-views/

Chris

On Fri, Jul 7, 2017 at 9:42 AM, ZAIDI, ASAD A  wrote:

> What exactly does mean CoordinatorScanLatency for example
>
> CoordinatorScanLatency  is a timer metric that present coordinator range
> scan latency for  table.
>
> Is it latency on full table scan or maybe range scan by clustering key?
>
> It is range scan.. clustering key is used to only store
> data in sorted fashion – partition key along with chosen partitioner helps
> in range scan of data.
>
> Can anybody write into partition while locked?
>
> Writes are atomic – it depends on your chosen consistency
> level to determine if writes will fail or succeed.
>
>
>
> *From:* Павел Сапежко [mailto:amelius0...@gmail.com]
> *Sent:* Friday, July 07, 2017 8:23 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Understanding of cassandra metrics
>
>
>
> Are you really think that I don't read docs? Do you have enough
> information in the documentation? I think no. What exactly does mean 
> CoordinatorScanLatency
> for example? Is it latency on full table scan or maybe range scan by
> clustering key? What exactly mean ViewLockAcquireTime? What is "partition
> lock"? Can anybody write into partition while locked? Etc.
>
> пт, 7 июл. 2017 г. в 13:01, Ivan Iliev :
>
> 1st result on google returns:
>
>
>
> http://cassandra.apache.org/doc/latest/operating/metrics.html
> 
>
>
>
> On Fri, Jul 7, 2017 at 12:16 PM, Павел Сапежко 
> wrote:
>
> Hello, I have several question about cassandra metrics. What does exactly
> mean the next metrics:
>
>- CoordinatorReadLatency
>- CoordinatorScanLatency
>- ReadLatency
>- RangeLatency
>- ViewLockAcquireTime
>- ViewReadTime
>
> --
>
> С уважением,
>
> Павел Сапежко
>
> skype: p.sapezhko
>
>
>
> --
>
> С уважением,
>
> Павел Сапежко
>
> skype: p.sapezhko
>

Re: what is MemtableReclaimMemory mean ??

2017-05-01 Thread Chris Lohfink

Question though, how many tables do you have? If you have more than a few
hundreds it could be bottlenecking the flushing if it is flushing very
frequently.

On Mon, May 1, 2017 at 9:32 PM, Chris Lohfink <clohfin...@gmail.com> wrote:

> Theres a read barrier to stop reclaiming a memtable when there are
> requests actively reading it. The *MemtableReclaimMemory* pool offloads
> that wait instead of blocking the caller. It in itself is not going to use
> any cpu or increase load. It will however block the releasing of the
> memtable resources which might cause additional heap allocation pressure.
> Its more likely a symptom of GCs or reads being slow than the cause of the
> issue however.
>
> Chris
>
> On Mon, May 1, 2017 at 9:01 PM, Pranay akula <pranay.akula2...@gmail.com>
> wrote:
>
>> Hi Alain,
>>
>> when  "*MemtableReclaimMemory*"  Pending Tasks increasing, its slowly
>> backing up reads and writes mostly writes. yes i am seeing bit high GC
>> pressure, currently we are using 24Gb Heap  and G1GC collection. I tried
>> changing Memtable flush threshold it did helped a little but not much. I am
>> not seeing any Errors in the Logs.
>>
>>
>> Thanks
>> Pranay.
>>
>> On Thu, Apr 27, 2017 at 6:08 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>> wrote:
>>
>>> Hi Pranay,
>>>
>>> According to http://docs.datastax.com/en/ca
>>> ssandra/3.0/cassandra/tools/toolsTPstats.html, "*MemtableReclaimMemory*"
>>> is the thread pool used for "Making unused memory available". I don't know
>>> much about it since it was never an issue for me. Neither did I heard much
>>> about it.
>>>
>>>
>>>- Are pending tasks staying high for a long period? `watch -d
>>>nodetool tpstats`
>>>- What are your GC settings?
>>>- Any other threads pending, blocked or dropped?
>>>- Do you have errors or warnings in your logs?
>>>- Any GC pressure? (monitored through charts or logs at INFO level,
>>>or WARN on recent versions)
>>>
>>>
>>> C*heers,
>>> ---
>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>> France
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>>
>>>
>>> 2017-04-16 16:04 GMT+02:00 Pranay akula <pranay.akula2...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> what is *MemtableReclaimMemory* mean in nodetooltpstats ?? does this
>>>> mean trying to flushing memtable from memory to SStables.
>>>>
>>>> I can see sometimes increase in pending tasks of  MemtableReclaimMemory
>>>> in nodetool tpstats, at that time i can see increase in load on those 
>>>> nodes.
>>>>
>>>> Does decreasing memtable_cleanup_threshold will help ??
>>>>
>>>> Thanks
>>>> Pranay.
>>>>
>>>
>>>
>>
>

Re: what is MemtableReclaimMemory mean ??

2017-05-01 Thread Chris Lohfink

Theres a read barrier to stop reclaiming a memtable when there are requests
actively reading it. The *MemtableReclaimMemory* pool offloads that wait
instead of blocking the caller. It in itself is not going to use any cpu or
increase load. It will however block the releasing of the memtable
resources which might cause additional heap allocation pressure. Its more
likely a symptom of GCs or reads being slow than the cause of the issue
however.

Chris

On Mon, May 1, 2017 at 9:01 PM, Pranay akula 
wrote:

> Hi Alain,
>
> when  "*MemtableReclaimMemory*"  Pending Tasks increasing, its slowly
> backing up reads and writes mostly writes. yes i am seeing bit high GC
> pressure, currently we are using 24Gb Heap  and G1GC collection. I tried
> changing Memtable flush threshold it did helped a little but not much. I am
> not seeing any Errors in the Logs.
>
>
> Thanks
> Pranay.
>
> On Thu, Apr 27, 2017 at 6:08 AM, Alain RODRIGUEZ 
> wrote:
>
>> Hi Pranay,
>>
>> According to http://docs.datastax.com/en/cassandra/3.0/cassandra/tools/to
>> olsTPstats.html, "*MemtableReclaimMemory*" is the thread pool used for
>> "Making unused memory available". I don't know much about it since it was
>> never an issue for me. Neither did I heard much about it.
>>
>>
>>- Are pending tasks staying high for a long period? `watch -d
>>nodetool tpstats`
>>- What are your GC settings?
>>- Any other threads pending, blocked or dropped?
>>- Do you have errors or warnings in your logs?
>>- Any GC pressure? (monitored through charts or logs at INFO level,
>>or WARN on recent versions)
>>
>>
>> C*heers,
>> ---
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> 2017-04-16 16:04 GMT+02:00 Pranay akula :
>>
>>> Hi,
>>>
>>> what is *MemtableReclaimMemory* mean in nodetooltpstats ?? does this
>>> mean trying to flushing memtable from memory to SStables.
>>>
>>> I can see sometimes increase in pending tasks of  MemtableReclaimMemory
>>> in nodetool tpstats, at that time i can see increase in load on those nodes.
>>>
>>> Does decreasing memtable_cleanup_threshold will help ??
>>>
>>> Thanks
>>> Pranay.
>>>
>>
>>
>

Re: system_auth replication strategy

2017-04-01 Thread Chris Lohfink

You should use a network topology strategy with high RF in each DC or something 
like the everywhere strategy.

You should never really use SimpleStrategy, especially if you have multiple DCs 
and are using LOCAL or EACH consistencies. Its more for test and dev setups 
then a prod environment.

The problem is that it DOES ensure a LOCAL consistency level will be targeted 
in the same DC as the coordinator but it doesn ensure there will be data on 
each DC. This means that if there are no nodes in your DC that are a replica of 
the data you can get an unavailable exception even if 100% of your nodes are up 
and healthy.

So if you have 10 nodes, 5 per dc, 2 dcs and a RF of 3 with simple strategy

DC1
[0] [10] [30] [40] [45]

DC2
[1] [11] [15] [21] [41]

Especially if you have random token assignment like above. A partition with a 
token of 11 can end up on

DC1
[ ] [ ] [ ] [ ] [ ]

DC2
[ ] [*] [*] [*] [ ]

In which case an insert or read to a node on DC1 with LOCAL_ONE or LOCAL_QUORUM 
will result in an unavailable exception.

Chris

> On Apr 1, 2017, at 10:51 AM, Vlad  wrote:
> 
> Hi,
> 
> what is the suitable replication strategy for system_auth keyspace?
> As I understand factor should be equal to total nodes number, so can we use 
> SimpleStrategy? Does it ensure that queries with LOCAL_ONE consistency level 
> will be targeted to local DC (or the same node)?
> 
> Thanks.

Re: nodes are always out of sync

2017-04-01 Thread Chris Lohfink

Repairs do not have an ability to instantly build a perfect view of its
data between your 3 nodes at an exact time. When a piece of data is written
there is a delay between when they applied between the nodes, even if its
just 500ms. So if a request to read the data and build the merkle tree of
the data occurs and it finishes on node1 at 12:01 while node2 finishes at
12:02 the 1 minute or so delta (even if a few seconds, or if using snapshot
repairs) between the partition/range hashes in the merkle tree can be
different. On a moving data set its almost impossible to have the clusters
perfectly in sync for a repair. I wouldnt worry about that log message. If
you are worried about consistency between your read/writes use each or
local quorum for both.

Chris

On Thu, Mar 30, 2017 at 1:22 AM, Roland Otta 
wrote:

> hi,
>
> we see the following behaviour in our environment:
>
> cluster consists of 6 nodes (cassandra version 3.0.7). keyspace has a
> replication factor 3.
> clients are writing data to the keyspace with consistency one.
>
> we are doing parallel, incremental repairs with cassandra reaper.
>
> even if a repair just finished and we are starting a new one
> immediately, we can see the following entries in our logs:
>
> INFO  [RepairJobTask:1] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
> and /192.168.0.189 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
> and /192.168.0.189 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.190 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.190
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>
> we cant see any hints on the systems ... so we thought everything is
> running smoothly with the writes.
>
> do we have to be concerned about the nodes always being out of sync or
> is this a normal behaviour in a write intensive table (as the tables
> will never be 100% in sync for the latest inserts)?
>
> bg,
> roland
>
>
>

Re: partition sizes reported by nodetool tablehistograms

2017-02-24 Thread Chris Lohfink

Its the decompressed size of the partitions. Each sstable has stats
component that contains histograms for the size and number of columns in
the partitions (among other things, can see with sstablemetadata tool),
tablehistograms merges it for each sstable and gives the results.

Chris

On Fri, Feb 24, 2017 at 4:53 PM, John Sanda  wrote:

> I am working on some issues involving really big partitions. I have been
> making extensive use of nodetool tablehistograms. What exactly is the
> partition size being reported? I have a table for which the max value
> reported is about 3.5 GB, but running du -h against the table data
> directory reports 548 MB. Are the partition sizes reported by
> tablehistograms the decompressed size on disk?
>
> - John
>

Re: Help

2017-01-09 Thread Chris Lohfink

Do you have any monitoring setup around garbage collections?  A GC +
network latency > write timeout will cause intermittent hints.

On Sun, Jan 8, 2017 at 10:30 PM, Anshu Vajpayee 
wrote:

> Gossip shows - all nodes are up.
>
> But when  we perform writes , coordinator stores the hints. It means  -
> coordinator was not able to deliver the writes to few nodes after meeting
> consistency requirements.
>
> The nodes for which  writes were failing, are in different DC. Those nodes
> do not have any load.
>
> Gossips shows everything is up.  I already set write timeout to 60 sec,
> but no help.
>
> Can anyone encounter this scenario ? Network side everything is fine.
>
> Cassandra version is 2.1.13
>
> --
> *Regards,*
> *Anshu *
>
>
>

Re: Java GC pauses, reality check

2016-11-25 Thread Chris Lohfink

No tuning will eliminate gcs.

20-30 seconds is horrific and out of the ordinary. Most likely implementing
antipatterns and/or poorly configured. Sub 1s is realistic but with some
workloads still may require some tuning to maintain. Some workloads are
very unfriendly to GCs though (ie heavy tombstones, very wide partitions).

Chris

On Fri, Nov 25, 2016 at 3:25 PM, S Ahmed  wrote:

> Hello!
>
> From what I understand java GC pauses are pretty much a fact of life, but
> you can tune the jvm to reduce the likelihood of the frequency and length
> of GC pauses.
>
> When using Cassandra, how frequent or long have these pauses known to be?
> Even with tuning, is it safe to assume they cannot be eliminated?
>
> Would a 20-30 second pause be something out of the ordinary?
>
> Thanks.
>

Re: Can a Select Count(*) Affect Writes in Cassandra?

2016-11-10 Thread Chris Lohfink

count(*) actually pages through all the data. So a select count(*) without
a limit would be expected to cause a lot of load on the system. The hit is
more than just IO load and CPU, it also creates a lot of garbage that can
cause pauses slowing down the entire JVM. Some details here:
http://www.datastax.com/dev/blog/counting-keys-in-cassandra


You may want to consider maintaining the count yourself, using Spark, or if
you just want a ball park number you can grab it from JMX.

> Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually
has nothing to do with flushes. A flush is the operation of moving data
from memory (memtable) to disk (SSTable).

FWIW in 2.0 thats not completely accurate. Before 2.1 the process of
memtable flushing acquired a switchlock on that blocks mutations during the
flush (the "pending task" metric is the measure of how many mutations are
blocked by this lock).

Chris

On Thu, Nov 10, 2016 at 8:10 AM, Shalom Sagges 
wrote:

> Hi Alexander,
>
> I'm referring to Writes Count generated from JMX:
> [image: Inline image 1]
>
> The higher curve shows the total write count per second for all nodes in
> the cluster and the lower curve is the average write count per second per
> node.
> The drop in the end is the result of shutting down one application node
> that performed this kind of query (we still haven't removed the query
> itself in this cluster).
>
>
> On a different cluster, where we already removed the "select count(*)"
> query completely, we can see that the issue was resolved (also verified
> this with running nodetool cfstats a few times and checked the write count
> difference):
> [image: Inline image 2]
>
>
> Naturally I asked how can a select query affect the write count of a node
> but weird as it seems, the issue was resolved once the query was removed
> from the code.
>
> Another side note.. One of our developers that wrote the query in the
> code, thought it would be nice to limit the query results to 560,000,000.
> Perhaps the ridiculously high limit might have caused this?
>
> Thanks!
>
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035
>  
>  We Create Meaningful Connections
>
> 
>
>
> On Thu, Nov 10, 2016 at 3:21 PM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Shalom,
>>
>> Cassandra writes (mutations) are INSERTs, UPDATEs or DELETEs, it actually
>> has nothing to do with flushes. A flush is the operation of moving data
>> from memory (memtable) to disk (SSTable).
>>
>> The Cassandra write path and read path are two different things and, as
>> far as I know, I see no way for a select count(*) to increase your write
>> count (if you are indeed talking about actual Cassandra writes, and not I/O
>> operations).
>>
>> Cheers,
>>
>> On Thu, Nov 10, 2016 at 1:21 PM Shalom Sagges 
>> wrote:
>>
>>> Yes, I know it's obsolete, but unfortunately this takes time.
>>> We're in the process of upgrading to 2.2.8 and 3.0.9 in our clusters.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>>  
>>>  We Create Meaningful Connections
>>>
>>> 
>>>
>>>
>>> On Thu, Nov 10, 2016 at 1:31 PM, Vladimir Yudovin 
>>> wrote:
>>>
>>> As I said I'm not sure about it, but it will be interesting to check
>>> memory heap state with any JMX tool, e.g. https://github.com/patric
>>> -r/jvmtop
>>>
>>> By a way, why Cassandra 2.0.14? It's quit old and unsupported version.
>>> Even in 2.0 branch there is 2.0.17 available.
>>>
>>> Best regards, Vladimir Yudovin,
>>>
>>> *Winguzone  - Hosted Cloud
>>> CassandraLaunch your cluster in minutes.*
>>>
>>>
>>>  On Thu, 10 Nov 2016 05:47:37 -0500*Shalom Sagges
>>> >* wrote 
>>>
>>> Thanks for the quick reply Vladimir.
>>> Is it really possible that ~12,500 writes per second (per node in a 12
>>> nodes DC) are caused by memory flushes?
>>>
>>>
>>>
>>>
>>>
>>>
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035
>>> 
>>> 
>>> 
>>> We Create Meaningful Connections
>>>
>>> 
>>>
>>>
>>>
>>> On Thu, Nov 10, 2016 at 11:02 AM, Vladimir Yudovin >> > wrote:
>>>
>>>
>>>
>>> This message may contain confidential and/or privileged

Re: metrics not resetting after running proxyhistograms or cfhistograms

2016-10-25 Thread Chris Lohfink

That behavior went away with 2.2.
https://issues.apache.org/jira/browse/CASSANDRA-11752 adds decay to it to
make it recent data which is much better then just reseting on reads.

Chris

On Tue, Oct 25, 2016 at 2:06 PM, Andrew Bialecki <
andrew.biale...@klaviyo.com> wrote:

> We're running 3.6. Running "nodetool proxyhistograms" twice, we're seeing
> the same data returned each time, but expecting the second run to be reset.
> We're seeing the same behavior with "nodetool cfhistograms."
>
> I believe resetting after each call used to be the behavior, did that
> change in recent version? We've confirmed metrics reset after the service
> is restarted.
>
> --
> AB
>

Re: system_distributed.repair_history table

2016-10-06 Thread Chris Lohfink

small reminder that unless you have autosnapshot to false in
cassandra.yaml, you will need to clear snapshot (nodetool
clearsnapshot system_distributed) to actually delete the sstables

On Thu, Oct 6, 2016 at 9:25 AM, Saladi Naidu <naidusp2...@yahoo.com> wrote:

> Thanks for the response. It makes sense to periodically truncate as it is
> only for debugging purposes
>
> Naidu Saladi
>
>
> On Wednesday, October 5, 2016 8:03 PM, Chris Lohfink <clohfin...@gmail.com>
> wrote:
>
>
> The only current solution is to truncate it periodically. I opened
> https://issues.apache.org/jira/browse/CASSANDRA-12701 about it if
> interested in following
>
> On Wed, Oct 5, 2016 at 4:23 PM, Saladi Naidu <naidusp2...@yahoo.com>
> wrote:
>
> We are seeing following warnings in system.log,  As *compaction_large_
> partition_warning_threshold_mb*   in cassandra.yaml file is as default
> value 100, we are seeing these warnings
>
> 110:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:05,554
> BigTableWriter.java:184 - Writing large partition
> system_distributed/repair_ history:gccatmer:mer_admin_job (115943239 bytes)
>
> 111:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:13,303 
> BigTableWriter.java:184 - Writing large partition system_distributed/repair_ 
> history:gcconfigsrvcks:user_ activation (163926097 bytes)
>
>
> When I looked at the table definition it is partitioned by keyspace and 
> cloumnfamily, under this partition, repair history is maintained. When I 
> looked at the count of rows in this partition, most of the paritions have 
> >200,000 rows and these will keep growing because of the partition strategy 
> right. There is no TTL on this so any idea what is the solution for reducing 
> partition size.
>
>
> I also looked at size_estimates table for this column family and found that 
> the mean partition size for each range is 50,610,179 which is very large 
> compared to any other tables.
>
>
>
>
>

Re: system_distributed.repair_history table

2016-10-05 Thread Chris Lohfink

The only current solution is to truncate it periodically. I opened
https://issues.apache.org/jira/browse/CASSANDRA-12701 about it if
interested in following

On Wed, Oct 5, 2016 at 4:23 PM, Saladi Naidu  wrote:

> We are seeing following warnings in system.log,  As
> *compaction_large_partition_warning_threshold_mb*  in cassandra.yaml file
> is as default value 100, we are seeing these warnings
>
> 110:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:05,554
> BigTableWriter.java:184 - Writing large partition 
> system_distributed/repair_history:gccatmer:mer_admin_job
> (115943239 bytes)
>
> 111:WARN  [CompactionExecutor:91798] 2016-10-05 00:54:13,303 
> BigTableWriter.java:184 - Writing large partition 
> system_distributed/repair_history:gcconfigsrvcks:user_activation (163926097 
> bytes)
>
>
> When I looked at the table definition it is partitioned by keyspace and 
> cloumnfamily, under this partition, repair history is maintained. When I 
> looked at the count of rows in this partition, most of the paritions have 
> >200,000 rows and these will keep growing because of the partition strategy 
> right. There is no TTL on this so any idea what is the solution for reducing 
> partition size.
>
>
> I also looked at size_estimates table for this column family and found that 
> the mean partition size for each range is 50,610,179 which is very large 
> compared to any other tables.
>
>

Re: repair_history maintenance

2016-09-23 Thread Chris Lohfink

Probably should just periodically truncate/clear snapshots when gets too
big (will probably take months before noticeable). I opened
https://issues.apache.org/jira/browse/CASSANDRA-12701 for discussion on if
it should use TTLs

Chris

On Thu, Sep 22, 2016 at 1:28 PM, sfesc...@gmail.com 
wrote:

> Should there be a maintenance schedule for repair_history? Meaning, a
> scheduled nodetool repair and/or deletion schedule? Or is it the intention
> that this table just grow for the life of the cluster?
>

Re: How to get information of each read/write request?

2016-08-30 Thread Chris Lohfink

Running a query with trace (`TRACING ON` in cqlsh) can give you a lot of
the information for an individual request. There has been a ticket to track
time in queue (https://issues.apache.org/jira/browse/CASSANDRA-8398) but no
ones worked on it yet.

Chris

On Tue, Aug 30, 2016 at 12:20 PM, Jun Wu  wrote:

> Hi there,
>
>  I'm very interested in the read/write path of Cassandra.
> Specifically, I'd like to know the whole process when a read/write request
> comes in.
>
> I noticed that for reach request it could go through multiple stages.
> For example, for read request, it could be in ReadStage,
> RequestResponseStage, ReadRepairStage. For each stage, actually it's a
> queue and thread pool to serve the request.
>
>First question is how to track each request in which stage.
>
>Also I'm very interested int the waiting time for each request to be in
> the queue, also the total queue in each stage. I noticed that in nodetool
> tpstats will have this information. However, I may want to get the
> real-time information of this, like print it out in the terminal.
>
> I'm wondering  whether someone has hints on this.
>
>Thanks in advance!
>
> Jun
>
>
>

Re: Hintedhandoff mutation

2016-08-17 Thread Chris Lohfink

Probably question better suited for the dev@ list. But I afaik the answer
is there is no way to tell the difference, but probably safe to look at the
created time, HHs tend to be older.

Chris

On Wed, Aug 17, 2016 at 5:02 AM, Stone Fang  wrote:

> Hi All,
>
> I want to differ hintedhandoff mutation and normal write mutation when i
> receive a mutation.
>
> how to get this in cassandra source code.have not found any attribute
> about this in Mutation class.
>
> or there is no way to get this.
>
>
> thanks
> stone
>

Re: a solution of getting cassandra cross-datacenter latency at a certain time

2016-08-08 Thread Chris Lohfink

If you invoke the values operation on the mbean every minute (or whatever
period) you can get a histogram of the cross dc the latencies. Just keep
track of the values of each bin in histogram and look at the delta from
previous time to the current time to find how many latencies occurred in
each bins range during the period.

Also can wait for CASSANDRA-11752
<https://issues.apache.org/jira/browse/CASSANDRA-11752> for the a "recent"
histogram (although would need to apply it to this histogram as well).

Chris Lohfink

On Mon, Aug 8, 2016 at 8:50 AM, Ryan Svihla <r...@foundev.pro> wrote:

> The first issue I can think of is the Latency table, if I understand you
> correctly, has an unbounded size for the partition key of DC and will over
> time just get larger as more measurements are recorded.
>
> Regards,
>
> Ryan Svihla
>
> On Aug 8, 2016, at 2:58 AM, Stone Fang <cnstonef...@gmail.com> wrote:
>
> *objective*:get cassandra cross-datacenter latency in time
>
> *existing ticket:*
>
> there is a ticket [track cross-datacenter latency](https://issues.
> apache.org/jira/browse/CASSANDRA-11569)
> but it is a statistics value from node starting,i want to get the
> instantaneous value in a certain time.
>
> *thought*
>
> want to write a message into **MESSAGE TABLE** in 1s timer task(the period
> is similar to most of cross datacenter latency )
> ,and replicate to other datacenter,there will be a delay.and I capture
> it,and write to **LATENCY TABLE**.i can query the latency value from this
> table with the condition of certain time.
>
> *schema*
>
> message table for replicating data cross datacenter
>
>
> create keyspace heartbeat with replication=
> {'class':'NetworkTopologyStrategy','dc1':1, 'dc2':1...};
>
>
>
>  CREATE TABLE HEARTBEAT.MESSAGE{
> CREATED TIMESTAMP,
> FROMDC VARCHAR,
> PRIMARY KEY(CREATED,FROMDC)
> }
>
> latency Table for querying latency value
>
>  CREATE TABLE SYSTEM.LATENCY{
>  FROMDC VARCHAR,
>  ARRIVED TIMESTAMP,
>  CREATED TIMESTAMP,
>  LANTENCY BIGINT
>  PRIMARY KEY(FROMDC,ARRIVED)
> }WITH CLUSTERING ORDER BY(ARRIVED DESC);
>
> problems
>
> 1.can this solution work to get the cross-datacenter latency?
>
>
> 2.create heartbeat keyspace in cassandra bootstrap process,i need to load
> Heartbeat keyspace in Scheam.java.and save this keyspace into SystemSchema.
> also need to check if this keyspace has exist after first node start.so i
> think this is not a good solution.
>
> 3.compared to 1,try another solution.generate heartbeat message in a
> standalone jar.but always i need to capture heartbeat message mutation in
> cassandra.so i need to check if the mutation is about heartbeat message.and
> it seems strange that check the heartbeat keyspace which is not defined in
> cassandra,but third-party.
>
> hope to see your thought on this.
> thanks
> stone
>
>

Re: Approximate row count

2016-07-27 Thread Chris Lohfink

the number of keys are the number of *partition keys, *not row keys. You
have ~39434 partitions, ranging from 311 bytes to 386mb. Looks like you
have some wide partitions that contain many of your rows.

Chris Lohfink

On Wed, Jul 27, 2016 at 1:44 PM, Luke Jolly <l...@getadmiral.com> wrote:

> I have a table that I'm storing ad impression data in with every row being
> an impression.  I want to get a count of total rows / impressions.  I know
> that there is in the ball park of 200-400 million rows in this table and
> from my reading "Number of keys" in the output of cfstats should be a
> reasonably accurate estimate. However, it is 39434. Am I misunderstanding
> something? Every node in my cluster has a complete copy of the keyspace.
>
>
>   Table: impressions_2
>   SSTable count: 22
>   Space used (live): 51255709817
>   Space used (total): 51255709817
>   Space used by snapshots (total): 49415721741
>   Off heap memory used (total): 30824975
>   SSTable Compression Ratio: 0.20347134631246266
>   Number of keys (estimate): 39434
>   Memtable cell count: 18279
>   Memtable data size: 15897457
>   Memtable off heap memory used: 0
>   Memtable switch count: 1294
>   Local read count: 347016
>   Local read latency: 12.573 ms
>   Local write count: 109226238
>   Local write latency: 0.023 ms
>   Pending flushes: 0
>   Bloom filter false positives: 655
>   Bloom filter false ratio: 0.0
>   Bloom filter space used: 97552
>   Bloom filter off heap memory used: 97376
>   Index summary off heap memory used: 26719
>   Compression metadata off heap memory used: 30700880
>   Compacted partition minimum bytes: 311
>   Compacted partition maximum bytes: 386857368
>   Compacted partition mean bytes: 6424107
>   Average live cells per slice (last five minutes): 
> 1027.9502011434631
>   Maximum live cells per slice (last five minutes): 5722
>   Average tombstones per slice (last five minutes): 1.0
>   Maximum tombstones per slice (last five minutes): 1
>
>

Re: sstabledump failing for system keyspace tables

2016-06-11 Thread Chris Lohfink

related to https://issues.apache.org/jira/browse/CASSANDRA-11330, most of
the system tables will work but batches are kinda special cased and uses
the localpartitioner (see:
https://github.com/apache/cassandra/blob/ff42012edd8651ca2567a670c2df9b3be6f51fcd/src/java/org/apache/cassandra/db/SystemKeyspace.java#L119
 ) like secondary indexes but isnt caught by the 2i check to use the local
partitioner.

If you want you can open a jira for this or I can later. A workaround in
meantime while waiting for a fix may be to actually use a relative path
with a ".." or "." in it to take advantage of the issue mentioned in this
comment

Chris

On Sat, Jun 11, 2016 at 3:00 PM, Bhuvan Rawal  wrote:

> I have been trying to obtain json dump of batches table using sstabledump
> but I get this exception:
> $ sstabledump
> /sstable/data/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-277-big-Data.db
> Exception in thread "main"
> org.apache.cassandra.exceptions.ConfigurationException: Cannot use abstract
> class 'org.apache.cassandra.dht.LocalPartitioner' as partitioner.
> at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:489)
> at
> org.apache.cassandra.utils.FBUtilities.instanceOrConstruct(FBUtilities.java:461)
> at
> org.apache.cassandra.utils.FBUtilities.newPartitioner(FBUtilities.java:402)
> at
> org.apache.cassandra.tools.SSTableExport.metadataFromSSTable(SSTableExport.java:108)
> at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:184)
>
> I further tried Andrew Tolbert's sstable tool but it gives the same
> exception.
> $ java -jar sstable-tools-3.0.0-alpha4.jar describe
> /sstable/data/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-277-big-Data.db
>
> /sstable/data/system/batches-919a4bc57a333573b03e13fc3f68b465/ma-277-big-Data.db
>
> 
> org.apache.cassandra.exceptions.ConfigurationException: Cannot use
> abstract class 'org.apache.cassandra.dht.LocalPartitioner' as partitioner.
> at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:489)
>
> Any way by which I can figure out the content of batches table?
>
> Thanks & Regards,
> Bhuvan
>

Re: Latency overhead on Cassandra cluster deployed on multiple AZs (AWS)

2016-04-11 Thread Chris Lohfink

Where do you get the ~1ms latency between AZs? Comparing a short term
average to a 99th percentile isn't very fair.

"Over the last month, the median is 2.09 ms, 90th percentile is 20ms,
99th percentile
is 47ms." - per
https://www.quora.com/What-are-typical-ping-times-between-different-EC2-availability-zones-within-the-same-region

Are you using EBS? That would further impact latency on reads and GCs will
always cause hiccups in the 99th+.

Chris


On Mon, Apr 11, 2016 at 7:57 AM, Alessandro Pieri  wrote:

> Hi everyone,
>
> Last week I ran some tests to estimate the latency overhead introduces in
> a Cassandra cluster by a multi availability zones setup on AWS EC2.
>
> I started a Cassandra cluster of 6 nodes deployed on 3 different AZs (2
> nodes/AZ).
>
> Then, I used cassandra-stress to create an INSERT (write) test of 20M
> entries with a replication factor = 3, right after, I ran cassandra-stress
> again to READ 10M entries.
>
> Well, I got the following unexpected result:
>
> Single-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.06ms/7.41ms/55.81ms
> Multi-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.16ms/38.14ms/47.75ms
>
> Basically, switching to the multi-AZ setup the latency increased of ~30ms.
> That's too much considering the the average network latency between AZs on
> AWS is ~1ms.
>
> Since I couldn't find anything to explain those results, I decided to run
> the cassandra-stress specifying only a single node entry (i.e. "--nodes
> node1" instead of "--nodes node1,node2,node3,node4,node5,node6") and
> surprisingly the latency went back to 5.9 ms.
>
> Trying to recap:
>
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" -> 95th
> percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
>
> For the sake of completeness I've ran a further test using a consistency
> level = LOCAL_QUORUM and the test did not show any large variance with
> using a single node or multiple ones.
>
> Do you guys know what could be the reason?
>
> The test were executed on a m3.xlarge (network optimized) using the
> DataStax AMI 2.6.3 running Cassandra v2.0.15.
>
> Thank you in advance for your help.
>
> Cheers,
> Alessandro
>

Re: CRT

2016-02-23 Thread Chris Lohfink

Check out
http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen. You
can run it yourself to test as well.

Chris

On Tue, Feb 23, 2016 at 7:02 PM, Rakesh Kumar  wrote:

> https://www.aphyr.com/posts/294-jepsen-cassandra
>
> How much of this is still valid in ver 3.0. The above seems to have been
> written for ver 1.0.
>
> thanks.
>

Re: opscenter doesn't work with cassandra 3.0

2016-01-26 Thread Chris Lohfink

DataStax has a free program for startups
http://www.datastax.com/datastax-enterprise-for-startups

On Tue, Jan 26, 2016 at 9:42 AM, Otis Gospodnetić <
otis.gospodne...@gmail.com> wrote:

> Hi Duyhai,
>
> SPM is not free, but there is a free plan, plus we have special pricing
> for startups, non-profits, and education institutions.
>
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
> On Tue, Jan 26, 2016 at 9:59 AM, DuyHai Doan  wrote:
>
>> Hello Otis
>>
>>  The Sematext tools, is it free or not ? And if not free, is there a
>> "limited" open-source version ?
>>
>> On Tue, Jan 26, 2016 at 3:39 PM, Otis Gospodnetić <
>> otis.gospodne...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> As Julien pointed out, there is a good OpsCenter alternative at
>>> https://sematext.com/spm/integrations/cassandra-monitoring.html
>>>
>>> Questions/comments/feedback/milk/cookies are all welcome.
>>>
>>> Otis
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>> On Wed, Jan 6, 2016 at 12:00 PM, Michael Shuler 
>>> wrote:
>>>
 On 01/06/2016 10:55 AM, Michael Shuler wrote:
 > On 01/06/2016 01:47 AM, Wills Feng wrote:
 >> Looks like opscenter doesn't support cassandra 3.0?
 >
 > This is correct. OpsCenter does not support Cassandra >= 3.0.

 It took me a minute to find the correct document:


 http://docs.datastax.com/en/upgrade/doc/upgrade/opscenter/opscCompatibility.html

 According to this version table, OpsCenter does not officially support
 Cassandra > 2.1.

 --
 Michael

>>>
>>>
>>
>

Re: Estimated key count from nodetool tablestats

2016-01-24 Thread Chris Lohfink

It will give you an estimate of the number of partition keys.  In newer
versions it will merge a sketch of the keys and using HyperLogLog++
<http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf>
(p=13,
sp=25) it will come up with an estimate of the cardinality. I would say its
safe to assume that its 2-ish% of the actual value. That does not include
the memtable data however so thats added on top. So things in both memtable
and sstables will be double counted. It should still be a fair estimate.

Before 2.1.6 it used the index and could be off by a lot in wide
rows/updated/many sstable use cases.

---
Chris Lohfink

On Sun, Jan 24, 2016 at 6:32 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Does the nodetool tablestats output line for "Number of keys (estimate)"
> indicate partition keys or CQL row primary keys (PK)?
>
> We currently don't have doc on this and I couldn't get a solid answer from
> a quick examination of the code.
>
> Since it is an estimate, roughly what is the nature of the estimation?
>
> In particular, for a very wide partition with many CQL rows (even
> millions) is it estimating that as roughly one key or will the number of
> sstables that the partition spans make it a large number?
>
> Thanks.
>
> -- Jack Krupansky
>

Re: Infinite loop in SliceQueryFilter

2015-12-04 Thread Chris Lohfink

May just be going over a lot of data. Does output of 'nodetool cfstats'
show large partitions? (partition maximum bytes). "collecting 1 of 2147483647"
is suspicious. Are your queries using ALLOW FILTERING or have very high
limits? If trying to read 2 billion entries in 1 query you will have memory
issues. May want to check with jvmtop/htop to make sure its not GCs using
CPU as well. Is there a sane amount of sstables? Providing some more
details can help (cfstats, cfhistograms, queries your making, schema)

Chris

On Fri, Dec 4, 2015 at 10:43 AM, Xihui He  wrote:

> Dear All,
>
> Recently one of node in our cluster has high cpu load ~100%. It seems to
> me there is a infinite loop in SliceQueryFilter.
>
> The below log is repeated in 5000ms (range_request_timeout_in_ms).
> TRACE [SharedPool-Worker-11] 2015-12-04 19:25:33,418
> SliceQueryFilter.java:269 - collecting 1 of 2147483647:
> images:b608719e728d11e5812b57f4c5416142:false:62@1444838867382000
>
> Our version is 2.19. Here is the bt:
>
> org.apache.cassandra.db.composites.AbstractComposite.isEmpty(AbstractComposite.java:30)
>
> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:76)
> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52)
> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46)
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>
> org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:83)
>
> org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:37)
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
> org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:173)
> org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:156)
>
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
>
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
>
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:264)
>
> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:108)
>
> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:82)
>
> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:69)
>
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:314)
>
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:2033)
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1876)
> org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:357)
>
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:85)
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47)
>
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
>
> java.lang.Thread.run(Thread.java:745)
>
> Appreciate if anyone could help.
>
> Thanks in advance,
> Xihui
>

Re: Error Code

2015-10-29 Thread Chris Lohfink

It means a response (opcode 8) message couldn't be decoded. What driver are
you using? What version? What version of C*?

Chris

On Thu, Oct 29, 2015 at 9:19 AM, Eduardo Alfaia 
wrote:

> yes, but what does it mean?
>
> On 29 Oct 2015, at 15:18, Kai Wang  wrote:
>
>
> https://github.com/datastax/python-driver/blob/75ddc514617304797626cc69957eb6008695be1e/cassandra/connection.py#L573
>
> Is your error message complete?
>
> On Thu, Oct 29, 2015 at 9:45 AM, Eduardo Alfaia 
> wrote:
>
>> Hi Guys,
>>
>> Does anyone know what error code in cassandra is?
>>
>> Error decoding response from Cassandra. opcode: 0008;
>>
>> Thanks
>>
>
>
>

Re: confusion about nodetool cfstats

2015-09-10 Thread Chris Lohfink

All metrics reported in cfstats are for just the one node (its pulled from
jmx). To see cluster aggregates its best to use a tool for monitoring like
opscenter, graphite, influxdb, nagios etc. Its a good idea to have one of
these something like this setup for many reasons anyway.

If you are using DSE you can use the performance service to get some of the
metrics (including aggregates across dc, keyspace, cluster etc) from CQL.

Chris Lohfink

On Thu, Sep 10, 2015 at 9:38 PM, Shuo Chen <chenatu2...@gmail.com> wrote:

> Sorry to send the previous message.
>
> I want to monitor columnfamily space used with nodetool cfstats. The
> document says,
> Space used (live), bytes:9592399Space that is measured depends on
> operating system
>
> Is this metric shows space used on one nodes or on the whole cluster?
>
> If it is just one node, is there a method to retrieve load info on the
> whole cluster?
>
> 
> Shuo Chen
>
>
> On Fri, Sep 11, 2015 at 10:36 AM, Shuo Chen <chenatu2...@gmail.com> wrote:
>
>> Hi!
>>
>> I want to monitor columnfamily space used with nodetool cfstats. The
>> document says,
>> Space used (live), bytes:9592399Space that is measured depends on
>> operating system
>>
>
>

Re: Last two metrics of cfstats

2015-09-02 Thread Chris Lohfink

Its number of cells and tombstones seen on the partitions during reads.
Just ignore the "last five minutes" part though since thats incorrect.

It being zero probably means theres been no actual reads have been off of
disk on that node. Might want to check if "Local read count" is non-zero
which implies queries to non existent data (most likely at least).

On Wed, Sep 2, 2015 at 2:23 AM, Jayapandian Ponraj 
wrote:

> The last two metrics of cfstats shows zero for all the tables we have
>
> Average live cells per slice (last five minutes): 0.0
> Average tombstones per slice (last five minutes): 0.0
>
> What do these mean and why are they always zero?
>

Re: cfstats ERROR

2015-06-20 Thread Chris Lohfink

Issue here:
https://issues.apache.org/jira/browse/CASSANDRA-9580

Fixed in 2.1.7.

Chris

On Sat, Jun 20, 2015 at 1:40 PM, 曹志富 cao.zh...@gmail.com wrote:

 error:
 /home/ant/apache-cassandra-2.1.6/bin/../data/data/blogger/edgestore/blogger-edgestore-tmplink-ka-146100-Data.db
 -- StackTrace --
 java.lang.AssertionError:
 /home/ant/apache-cassandra-2.1.6/bin/../data/data/blogger/edgestore/blogger-edgestore-tmplink-ka-146100-Data.db
 at
 org.apache.cassandra.io.sstable.SSTableReader.getApproximateKeyCount(SSTableReader.java:270)
 at
 org.apache.cassandra.metrics.ColumnFamilyMetrics$9.value(ColumnFamilyMetrics.java:296)
 at
 org.apache.cassandra.metrics.ColumnFamilyMetrics$9.value(ColumnFamilyMetrics.java:290)
 at
 com.yammer.metrics.reporting.JmxReporter$Gauge.getValue(JmxReporter.java:63)
 at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)

 vnodes,LCS

 --
 Ranger Tsao

Re: Really high read latency

2015-03-23 Thread Chris Lohfink

  Compacted partition maximum bytes: 36904729268

thats huge... 36gb rows are gonna cause a lot of problems, even when you
specify a precise cell under this it still is going to have an enormous
column index to deserialize on every read for the partition.  As mentioned
above, you should include your attribute name in the partition key ((row_time,
attrs))
 to spread this out... Id call that critical

Chris

On Mon, Mar 23, 2015 at 4:13 PM, Dave Galbraith david92galbra...@gmail.com
wrote:

 I haven't deleted anything. Here's output from a traced cqlsh query (I
 tried to make the spaces line up, hope it's legible):

 Execute CQL3
 query
 | 2015-03-23 21:04:37.422000 | 172.31.32.211 |  0
 Parsing select * from default.metrics where row_time = 16511 and attrs =
 '[redacted]' limit 100; [SharedPool-Worker-2] | 2015-03-23 21:04:37.423000
 | 172.31.32.211 | 93
 Preparing statement
 [SharedPool-Worker-2]
 | 2015-03-23 21:04:37.423000 | 172.31.32.211 |696
 Executing single-partition query on metrics [SharedPool-Worker-1]

   | 2015-03-23
 21:04:37.425000 | 172.31.32.211 |   2807
 Acquiring sstable references [SharedPool-Worker-1]

 | 2015-03-23 21:04:37.425000 |
 172.31.32.211 |   2993
 Merging memtable tombstones [SharedPool-Worker-1]

 | 2015-03-23 21:04:37.426000 |
 172.31.32.211 |   3049
 Partition index with 484338 entries found for sstable 15966
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.625000 | 172.31.32.211
 | 202304
 Seeking to partition indexed section in data file
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.625000 | 172.31.32.211 | 202354
 Bloom filter allows skipping sstable 5613 [SharedPool-Worker-1]

  | 2015-03-23 21:04:38.625000 | 172.31.32.211 |
 202445
 Bloom filter allows skipping sstable 5582 [SharedPool-Worker-1]

  | 2015-03-23 21:04:38.625000 | 172.31.32.211 |
 202478
 Bloom filter allows skipping sstable 5611 [SharedPool-Worker-1]

  | 2015-03-23 21:04:38.625000 | 172.31.32.211 |
 202508
 Bloom filter allows skipping sstable 5610
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.625000 | 172.31.32.211 | 202539
 Bloom filter allows skipping sstable 5549
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.625001 | 172.31.32.211 | 202678
 Bloom filter allows skipping sstable 5544 [SharedPool-Worker-1]

  | 2015-03-23 21:04:38.625001 | 172.31.32.211 |
 202720
 Bloom filter allows skipping sstable 5237
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.625001 | 172.31.32.211 | 202752
 Bloom filter allows skipping sstable 2516
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.625001 | 172.31.32.211 | 202782
 Bloom filter allows skipping sstable 2632 [SharedPool-Worker-1]

 | 2015-03-23 21:04:38.625001 | 172.31.32.211 |
 202812
 Bloom filter allows skipping sstable 3015 [SharedPool-Worker-1]

 | 2015-03-23 21:04:38.625001 | 172.31.32.211 |
 202852
 Skipped 0/11 non-slice-intersecting sstables, included 0 due to tombstones
 [SharedPool-Worker-1]   | 2015-03-23
 21:04:38.625001 | 172.31.32.211 | 202882
 Merging data from memtables and 1 sstables [SharedPool-Worker-1]

 | 2015-03-23 21:04:38.625001 | 172.31.32.211 | 202902
 Read 101 live and 0 tombstoned cells
 [SharedPool-Worker-1]
 | 2015-03-23 21:04:38.626000 | 172.31.32.211 | 203752
 Request complete

  | 2015-03-23
 21:04:38.628253 | 172.31.32.211 | 206253

 On Mon, Mar 23, 2015 at 11:53 AM, Eric Stevens migh...@gmail.com wrote:

 Enable tracing in cqlsh and see how many sstables are being lifted to
 satisfy the query (are you repeatedly writing to the same partition
 [row_time]) over time?).

 Also watch for whether you're hitting a lot of tombstones (are you
 deleting lots of values in the same partition over time?).

 On Mon, Mar 23, 2015 at 4:01 AM, Dave Galbraith 
 david92galbra...@gmail.com wrote:

 Duncan: I'm thinking it might be something like that. I'm also seeing
 just a ton of garbage collection on the box, could it be pulling rows for
 all 100k attrs for a given row_time into memory since only row_time is the
 partition key?

 Jens: I'm not using EBS (although I used to until I read up on how
 useless it is). I'm not sure what constitutes proper paging but my client
 has a pretty small amount of available memory so I'm doing pages of size 5k
 using the C++ Datastax driver.

 Thanks for the replies!

 -Dave

 On Mon, Mar 23, 2015 at 2:00 AM, Jens Rantil jens.ran...@tink.se
 wrote:

 Also, two control questions:

- Are you using EBS for data storage? It might introduce additional
latencies.
- Are you doing proper paging when querying the

Re: Out of Memory Error While Opening SSTables on Startup

2015-02-10 Thread Chris Lohfink

Your cluster is probably having issues with compactions (with STCS you
should never have this many).  I would probably punt with
OpsCenter/rollups60. Turn the node off and move all of the sstables off to
a different directory for backup (or just rm if you really don't care about
1 minute metrics), than turn the server back on.

Once you get your cluster running again go back and investigate why
compactions stopped, my guess is you hit an exception in past that killed
your CompactionExecutor and things just built up slowly until you got to
this point.

Chris

On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson pgn...@gmail.com wrote:

 Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There
 are 1,617,289 files under OpsCenter/rollups60.

 Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1), I
 was able to start up Cassandra OK with the default heap size formula.

 Now my cluster is running multiple versions of Cassandra. I think I will
 downgrade the rest to 2.1.1.

  ~ Paul Nickerson

 On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson pgn...@gmail.com
 wrote:

 I am getting an out of memory error why I try to start Cassandra on one
 of my nodes. Cassandra will run for a minute, and then exit without
 outputting any error in the log file. It is happening while SSTableReader
 is opening a couple hundred thousand things.

 ...

 Does anyone know how I might get Cassandra on this node running again?
 I'm not very familiar with correctly tuning Java memory parameters, and I'm
 not sure if that's the right solution in this case anyway.


 Try running 2.1.1, and/or increasing heap size beyond 8gb.

 Are there actually that many SSTables on disk?

 =Rob

Re: nodetool status shows large numbers of up nodes are down

2015-02-10 Thread Chris Lohfink

Are you hitting long GCs on your nodes? Can check gc log or look at
cassandra log for GCInspector.

Chris

On Tue, Feb 10, 2015 at 1:28 PM, Cheng Ren cheng@bloomreach.com wrote:

 Hi Carlos,
 Thanks for your suggestion. We did check the NTP setting and clock, and
 they are all working normally. Schema versions are also consistent with
 peers'.
 BTW, the only change we made was to set some of nodes' request
 timeout(read_request_timeout, write_request_timeout, range_request_timeout
 and request_timeout) from 3 to 1 for 6 nodes yesterday. Will this
 affect internode gossip?

 Thanks,
 Cheng

 On Mon, Feb 9, 2015 at 11:07 PM, Carlos Rolo r...@pythian.com wrote:

 Hi Cheng,

 Are all machines configured with NTP and all clocks in sync? If that is
 not the case do it.

 If your clocks are not in sync it causes some weird issues like the ones
 you see, but also schema disagreements and in some cases corrupted data.

 Regards,

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Tue, Feb 10, 2015 at 3:40 AM, Cheng Ren cheng@bloomreach.com
 wrote:

 Hi,
 We have a two-dc cluster with 21 nodes and 27 nodes in each DC. Over the
 past few months, we have seen nodetool status marks 4-8 nodes down while
 they are actually functioning. Particularly today we noticed that running
 nodetool status on some nodes shows higher number of nodes are down than
 before while they are actually up and serving requests.
 For example, on one node it shows 42 nodes are down.

 phi_convict_threshold of all nodes are set as 12, and we are running
 cassandra 2.0.4 on AWS EC2 machines.

 Does anyone have recommendation on identifying the root cause of this?
 Will this cause any consequences?

 Thanks,
 Cheng



 --

Re: Out of Memory Error While Opening SSTables on Startup

2015-02-10 Thread Chris Lohfink

yeah... probably just 2.1.2 things and not compactions.  Still probably
want to do something about the 1.6 million files though.  It may be worth
just mv/rm'ing to 60 sec rollup data though unless really attached to it.

Chris

On Tue, Feb 10, 2015 at 4:04 PM, Paul Nickerson pgn...@gmail.com wrote:

 I was having trouble with snapshots failing while trying to repair that
 table (http://www.mail-archive.com/user@cassandra.apache.org/msg40686.html).
 I have a repair running on it now, and it seems to be going successfully
 this time. I am going to wait for that to finish, then try a
 manual nodetool compact. If that goes successfully, then would it be safe
 to chalk the lack of compaction on this table in the past up to 2.1.2
 problems?


  ~ Paul Nickerson

 On Tue, Feb 10, 2015 at 3:34 PM, Chris Lohfink clohfin...@gmail.com
 wrote:

 Your cluster is probably having issues with compactions (with STCS you
 should never have this many).  I would probably punt with
 OpsCenter/rollups60. Turn the node off and move all of the sstables off to
 a different directory for backup (or just rm if you really don't care about
 1 minute metrics), than turn the server back on.

 Once you get your cluster running again go back and investigate why
 compactions stopped, my guess is you hit an exception in past that killed
 your CompactionExecutor and things just built up slowly until you got to
 this point.

 Chris

 On Tue, Feb 10, 2015 at 2:15 PM, Paul Nickerson pgn...@gmail.com wrote:

 Thank you Rob. I tried a 12 GiB heap size, and still crashed out. There
 are 1,617,289 files under OpsCenter/rollups60.

 Once I downgraded Cassandra to 2.1.1 (apt-get install cassandra=2.1.1),
 I was able to start up Cassandra OK with the default heap size formula.

 Now my cluster is running multiple versions of Cassandra. I think I will
 downgrade the rest to 2.1.1.

  ~ Paul Nickerson

 On Tue, Feb 10, 2015 at 2:05 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Tue, Feb 10, 2015 at 11:02 AM, Paul Nickerson pgn...@gmail.com
 wrote:

 I am getting an out of memory error why I try to start Cassandra on
 one of my nodes. Cassandra will run for a minute, and then exit without
 outputting any error in the log file. It is happening while SSTableReader
 is opening a couple hundred thousand things.

 ...

 Does anyone know how I might get Cassandra on this node running again?
 I'm not very familiar with correctly tuning Java memory parameters, and 
 I'm
 not sure if that's the right solution in this case anyway.


 Try running 2.1.1, and/or increasing heap size beyond 8gb.

 Are there actually that many SSTables on disk?

 =Rob

Re: High GC activity on node with 4TB on data

2015-02-09 Thread Chris Lohfink

- number of tombstones - how can I reliably find it out?
https://github.com/spotify/cassandra-opstools
https://github.com/cloudian/support-tools

If not getting much compression it may be worth trying to disable it, it
may contribute but its very unlikely that its the cause of the gc pressure
itself.

7000 sstables but STCS? Sounds like compactions couldn't keep up. Do you
have a lot of pending compactions (nodetool)? You may want to increase
your compaction throughput (nodetool) to see if you can catch up a little,
it would cause a lot of heap overhead to do reads with that many. May even
need to take more drastic measures if it cant catch back up.

May also be good to check `nodetool cfstats` for very wide partitions.

Theres a good chance if under load and you have over 8gb heap your GCs
could use tuning. The bigger the nodes the more manual tweaking it will
require to get the most out of them
https://issues.apache.org/jira/browse/CASSANDRA-8150 also has some ideas.

Chris

On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com wrote:

Hi all,

thank you all for the info.

To answer the questions:
- we have 2 DCs with 5 nodes in each, each node has 256G of memory, 24x1T
drives, 2x Xeon CPU - there are multiple cassandra instances running for
different project. The node itself is powerful enough.
- there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per
DC (because of amount of data and because it serves more or less like a
cache)
- there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation
requests - numbers are sum of all nodes
- we us STCS (LCS would be quite IO have for this amount of data)
- number of tombstones - how can I reliably find it out?
- the biggest CF (3.6T per node) has 7000 sstables

Now, I understand that the best practice for Cassandra is to run with the
minimum size of heap which is enough which for this case we thought is
about 12G - there is always 8G consumbed by the SSTable readers. Also, I
though that high number of tombstones create pressure in the new space
(which can then cause pressure in old space as well), but this is not what
we are seeing. We see continuous GC activity in Old generation only.

Also, I noticed that the biggest CF has Compression factor of 0.99 which
basically means that the data come compressed already. Do you think that
turning off the compression should help with memory consumption?

Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help
here, as it seems that 8G is something that Cassandra needs for bookkeeping
this amount of data and that this was sligtly above the 75% limit which
triggered the CMS again and again.

I will definitely have a look at the presentation.

Regards
Jiri Horky

On 02/08/2015 10:32 PM, Mark Reddy wrote:

Hey Jiri,

While I don't have any experience running 4TB nodes (yet), I would
recommend taking a look at a presentation by Arron Morton on large nodes:
http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/
to see if you can glean anything from that.

I would note that at the start of his talk he mentions that in version
1.2 we can now talk about nodes around 1 - 3 TB in size, so if you are
storing anything more than that you are getting into very specialised use
cases.

If you could provide us with some more information about your cluster
setup (No. of CFs, read/write patterns, do you delete / update often, etc.)
that may help in getting you to a better place.

Regards,
Mark

On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com wrote:

Do you have a lot of individual tables? Or lots of small compactions?

I think the general consensus is that (at least for Cassandra), 8GB
heaps are ideal.

If you have lots of small tables it’s a known anti-pattern (I believe)
because the Cassandra internals could do a better job on handling the in
memory metadata representation.

I think this has been improved in 2.0 and 2.1 though so the fact that
you’re on 1.2.18 could exasperate the issue. You might want to consider an
upgrade (though that has its own issues as well).

On Sun, Feb 8, 2015 at 12:44 PM, Jiri Horky ho...@avast.com wrote:

Hi all,

we are seeing quite high GC pressure (in old space by CMS GC Algorithm)
on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap memory
(2G for new space). The node runs fine for couple of days when the GC
activity starts to raise and reaches about 15% of the C* activity which
causes dropped messages and other problems.

Taking a look at heap dump, there is about 8G used by SSTableReader
classes in org.apache.cassandra.io.compress.CompressedRandomAccessReader.

Is this something expected and we have just reached the limit of how
many data a single Cassandra instance can handle or it is possible to
tune it better?

Regards
Jiri Horky

--
Founder/CEO Spinn3r.com
Location: *San

Re: How to remove obsolete error message in Datastax Opscenter?

2015-02-09 Thread Chris Lohfink

Restarting opscenter service will get rid of it.

Chris

On Mon, Feb 9, 2015 at 3:01 AM, Björn Hachmann bjoern.hachm...@metrigo.de
wrote:

 Good morning,

 unfortunately my last rolling restart of our Cassandra cluster issued from
 OpsCenter (5.0.2) failed. No big deal, but since then OpsCenter is showing
 an error message at the top of its screen:
 Error restarting cluster: Timed out waiting for Cassandra to start..

 Does anybody know how to remove that message permanently?

 Thank you very much in advance!

 Kind regards
 Björn Hachmann

Re: data distribution along column family partitions

2015-02-04 Thread Chris Lohfink

 What about 15 gb?

not ok :) don't let a single partition get to 1gb, 100's of mb should be
when flares are going up. The main reasoning is compactions would be
horrifically slow and there will be a lot of gc pain. Bringing the time
bucket to be by day will probably be sufficient. It would take billions of
alarm events in single time bucket if thats entire data payload to get that
bad.

 If I use paging, Cassandra won't try to allocate the whole partition on
the server node, it will just allocate memory in the heap for that page.
Check?

Cassandra should never allocate an entire (large/wide) partition into
memory unless your telling it to on a read. (gross simplification coming up
here) Can think of it as if more as if its streaming the partitions data
from disk (more or less) filling a response to your query. Don't ask for
1gb of data and you won't get 1gb objects in your heap. Wide rows work
well, the keeping them smaller is an optimization that will save you a lot
of pain down the road from troublesome jvm gcs, slower compactions,
unbalanced nodes, and higher read latencies.

Chris

On Wed, Feb 4, 2015 at 9:33 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

  The data model lgtm. You may need to balance the size of the time
 buckets with the amount of alarms to prevent partitions from getting too
 large. 1
 month may be a little large, I would aim to keep the partitions below 25mb
 (can check with nodetool cfstats) or so in size to keep everything happy.
 Its ok if occasional ones go larger, something like 1gb can be bad.. but it
 would still work if not very efficiently.

 What about 15 gb?

  Deletes on an entire time-bucket at a time seems like a good approach,
 but just setting TTL would be far far better imho (why not just set it to
 two years?). May want to look into new DateTieredCompactionStrategy, or
 LeveledCompactionStrategy or the obsoleted data will very rarely go away.

 Excellent hint, I will take a good look on this. I didn't know
 DateTieredCompactionStrategy

  When reading just be sure to use paging (the good cql drivers will have
 it built in) and don't actually read it all in one massive query. If you
 decrease size of your time bucket you may end up having to page the query
 across multiple partitions if Y-X  bucket size.

 If I use paging, Cassandra won't try to allocate the whole partition on
 the server node, it will just allocate memory in the heap for that page.
 Check?

 Marcelo Valle

 From: user@cassandra.apache.org
 Subject: Re: data distribution along column family partitions

 The data model lgtm.  You may need to balance the size of the time buckets
 with the amount of alarms to prevent partitions from getting too large.  1
 month may be a little large, I would aim to keep the partitions below 25mb
 (can check with nodetool cfstats) or so in size to keep everything
 happy.  Its ok if occasional ones go larger, something like 1gb can be
 bad.. but it would still work if not very efficiently.

 Deletes on an entire time-bucket at a time seems like a good approach, but
 just setting TTL would be far far better imho (why not just set it to two
 years?).  May want to look into new DateTieredCompactionStrategy, or
 LeveledCompactionStrategy or the obsoleted data will very rarely go away.

 When reading just be sure to use paging (the good cql drivers will have it
 built in) and don't actually read it all in one massive query.  If you
 decrease size of your time bucket you may end up having to page the query
 across multiple partitions if Y-X  bucket size.

 Chris

 On Wed, Feb 4, 2015 at 4:34 AM, Marcelo Elias Del Valle 
 mvall...@gmail.com wrote:

 Hello,

 I am designing a model to store alerts users receive over time. I will
 want to store probably the last two years of alerts for each user.

 The first thought I had was having a column family partitioned by user +
 timebucket, where time bucket could be something like year + month. For
 instance:

 *partition key:*
 user-id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d479
 time-bucket = 201502
 *rest of primary key:*
 timestamp = column of tipy timestamp
 alert id = f47ac10b-58cc-*4*372-*a*567-0e02b2c3d480

 Question, would this make it easier to delete old data? Supposing I am
 not using TTL and I want to remove alerts older than 2 years, what would be
 better, just deleting the entire time-bucket for each user-id (through a
 map/reduce process) or having just user-id as partition key and deleting,
 for each user, where X  timestamp  Y?

 Is it the same for Cassandra, internally?

 Another question is: would data be distributed enough if I just choose to
 partition by user-id? I will have some users with a large number of alerts,
 but in average I could consider alerts would have a good distribution along
 user ids. The problem is I don't fell confident having few partitions with
 A LOT of alerts would not be a problem at read time.

 What happens at read time when I try to read data from a big

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-08 Thread Chris Lohfink

So I would -expect- an increase of ~20k qps per node with m3.xlarge so
there may be something up with your client (I am not a c++ person however
but hopefully someone on list will take notice).

Latency does not decreases linearly as you add nodes.  What you are likely
seeing with latency since so few nodes is side effect of an optimization.
When you read/write from a table the node you request will act as the
coordinator.  If the data exists on the coordinator and using rf=1 or cl=1
it will not have to send the request to another node, just service it
locally:

  +-+ +--+
  |  node0  | +--|node1 |
  |-| |--|
  |  client | --+| coordinator  |
  +-+ +--+

In this case the write latency is dominated by the network between
coordinator and client.  A second case is where the coordinator actually
has to send the request to another node:

  +-+ +--+ +---+
  |  node0  | +--|node1 |+-- |node2  |
  |-| |--| |---|
  |  client | --+| coordinator  |---+| data replica  |
  +-+ +--+ +---+

As your adding nodes your increasing the probability of hitting this second
scenario where the coordinator has to make an additional network hop.  This
possibly why your seeing an increase (aside from client issues). To get an
idea on how the latency is affected when you increase nodes you really need
to go higher then 4 (ie graph the same rf for 5, 10, 15, 25 nodes.  below 5
isn't really the recommended way to run Cassandra anyway) nodes since the
latency will approach that of the 2nd scenario (plus some spike outliers
for GCs) and then it should settle down until you overwork the node.

May want to give https://github.com/datastax/cpp-driver a go (not cpp guy
take with grain of salt).  I would still highly recommend using
cassandra-stress instead of own stuff if you want to test cassandra and not
your code.

===
Chris Lohfink

On Mon, Dec 8, 2014 at 4:57 AM, 孔嘉林 kongjiali...@gmail.com wrote:

 Thanks Chris.
 I run a *client on a separate* AWS *instance from* the Cassandra cluster
 servers. At the client side, I create 40 or 50 threads for sending requests
 to each Cassandra node. I create one thrift client for each of the threads.
 And at the beginning, all the created thrift clients connect to the
 corresponding Cassandra nodes and keep connecting during the whole
 process(I did not close all the transports until the end of the test
 process). So I use very simple load balancing, since the same number of
 thrift clients connect to each node. And my source code is here:
 https://github.com/kongjialin/Cassandra/blob/master/cassandra_client.cpp It's
 very nice of you to help me improve my code.

 As I increase the number of threads, the latency gets longer.

 I'm using C++, so if I want to use native binary + prepared statements,
 the only way is to use C++ driver?
 Thanks very much.




 2014-12-08 12:51 GMT+08:00 Chris Lohfink clohfin...@gmail.com:

 I think your client could use improvements.  How many threads do you have
 running in your test?  With a thrift call like that you only can do one
 request at a time per connection.   For example, assuming C* takes 0ms, a
 10ms network latency/driver overhead will mean 20ms RTT and a max
 throughput of ~50 QPS per thread (native binary doesn't behave like this).
 Are you running client on its own system or shared with a node?  how are
 you load balancing your requests?  Source code would help since theres a
 lot that can become a bottleneck.

 Generally you will see a bit of a dip in latency from N=RF=1 and N=2,
 RF=2 etc since there are optimizations on the coordinator node when it
 doesn't need to send the request to the replicas.  The impact of the
 network overhead decreases in significance as cluster grows.  Typically;
 latency wise, RF=N=1 is going to be fastest possible for smaller loads (ie
 when a client cannot fully saturate a single node).

 Main thing to expect is that latency will plateau and remain fairly
 constant as load/nodes increase while throughput potential will linearly
 (empirically at least) increase.

 You should really attempt it with the native binary + prepared
 statements, running cql over thrift is far from optimal.  I would recommend
 using the cassandra-stress tool if you want to stress test Cassandra (and
 not your code)
 http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

 ===
 Chris Lohfink

 On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote:

 Hi Eric,
 Thank you very much for your reply!
 Do you mean that I should clear my table after each run? Indeed, I can
 see several times of compaction during my test, but could only a few times
 compaction

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Chris Lohfink

I think your client could use improvements.  How many threads do you have
running in your test?  With a thrift call like that you only can do one
request at a time per connection.   For example, assuming C* takes 0ms, a
10ms network latency/driver overhead will mean 20ms RTT and a max
throughput of ~50 QPS per thread (native binary doesn't behave like this).
Are you running client on its own system or shared with a node?  how are
you load balancing your requests?  Source code would help since theres a
lot that can become a bottleneck.

Generally you will see a bit of a dip in latency from N=RF=1 and N=2, RF=2
etc since there are optimizations on the coordinator node when it doesn't
need to send the request to the replicas.  The impact of the network
overhead decreases in significance as cluster grows.  Typically; latency
wise, RF=N=1 is going to be fastest possible for smaller loads (ie when a
client cannot fully saturate a single node).

Main thing to expect is that latency will plateau and remain fairly
constant as load/nodes increase while throughput potential will linearly
(empirically at least) increase.

You should really attempt it with the native binary + prepared statements,
running cql over thrift is far from optimal.  I would recommend using the
cassandra-stress tool if you want to stress test Cassandra (and not your
code)
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema

===
Chris Lohfink

On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote:

 Hi Eric,
 Thank you very much for your reply!
 Do you mean that I should clear my table after each run? Indeed, I can see
 several times of compaction during my test, but could only a few times
 compaction affect the performance that much? Also, I can see from the
 OpsCenter some ParNew GC happen but no CMS GC happen.

 I run my test on EC2 cluster, I think the network could be of high speed
 with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD
 storage, which is of m3.xlarge type.

 As for latency, which latency should I care about most? p(99) or p(999)? I
 want to get the max QPS under a certain limited latency.

 I know my testing scenario are not the common case in production, I just
 want to know how much burden my cluster can bear under stress.

 So, how did you test your cluster that can get 86k writes/sec? How many
 requests did you send to your cluster? Was it also 1 million? Did you also
 use OpsCenter to monitor the real time performance? I also wonder why the
 write and read QPS OpsCenter provide are much lower than what I calculate.
 Could you please describe in detail about your test deployment?

 Thank you very much,
 Joy

 2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com:

 Hi Joy,

 Are you resetting your data after each test run?  I wonder if your tests
 are actually causing you to fall behind on data grooming tasks such as
 compaction, and so performance suffers for your later tests.

 There are *so many* factors which can affect performance, without
 reviewing test methodology in great detail, it's really hard to say whether
 there are flaws which might uncover an antipattern cause atypical number of
 cache hits or misses, and so forth. You may also be producing gc pressure
 in the write path, and so forth.

 I *can* say that 28k writes per second looks just a little low, but it
 depends a lot on your network, hardware, and write patterns (eg, data
 size).  For a little performance test suite I wrote, with parallel batched
 writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per
 second.

 Also focusing exclusively on max latency is going to cause you some
 troubles especially in the case of magnetic media as you're using.  Between
 ill-timed GC and inconsistent performance characteristics from magnetic
 media, your max numbers will often look significantly worse than your p(99)
 or p(999) numbers.

 All this said, one node will often look better than several nodes for
 certain patterns because it completely eliminates proxy (coordinator) write
 times.  All writes are local writes.  It's an over-simple case that doesn't
 reflect any practical production use of Cassandra, so it's probably not
 worth even including in your tests.  I would recommend start at 3 nodes
 rf=3, and compare against 6 nodes rf=6.  Make sure you're staying on top of
 compaction and aren't seeing garbage collections in the logs (either of
 those will be polluting your results with variability you can't account for
 with small sample sizes of ~1 million).

 If you expect to sustain write volumes like this, you'll find these
 clusters are sized too small (on that hardware you won't keep up with
 compaction), and your tests are again testing scenarios you wouldn't
 actually see in production.

 On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote:

 Hi,

 I am doing stress test on Datastax Cassandra Community 2.1.2, not using
 the provided stress test

Re: Programmatic Cassandra version detection/extraction

2014-11-13 Thread Chris Lohfink

There is a ReleaseVersion attribute in the
org.apache.cassandra.db:StorageService bean

---
Chris Lohfink

On Wed, Nov 12, 2014 at 5:57 PM, Michael Shuler mich...@pbandjelly.org
wrote:

 On 11/12/2014 04:58 PM, Michael Shuler wrote:

 On 11/12/2014 04:44 PM, Otis Gospodnetic wrote:

 Is there a way to detect which version of Cassandra one is running?
 Is there an API for that, or a constant with this value, or maybe an
 MBean or some other way to get to this info?


 I'm not sure if there are other methods, but this should always work:

SELECT release_version from system.local;


 I asked the devs about where I might find the version in jmx and got the
 hint that I could cheat and look at `nodetool gossipinfo`.

 It looks like RELEASE_VERSION is reported as a field in
 org.apache.cassandra.net FailureDetector AllEndpointStates.

 --
 Michael

Re: What actually causing java.lang.OutOfMemoryError: unable to create new native thread

2014-11-10 Thread Chris Lohfink

if your using 64 bit, check output of:

cat /proc/{cassandra pid}/limits

some older linux kernels wont work with above so if it doesnt exist check
the ulimit -a output for the cassandra user. max processes per user may be
your issue as well.

---
Chris Lohfink


On Mon, Nov 10, 2014 at 11:21 AM, graham sanderson gra...@vast.com wrote:

 First question are you running 32bit or 64bit… on 32bit you can easily run
 out of virtual address space for thread stacks.

 On Nov 10, 2014, at 8:25 AM, Jason Wee peich...@gmail.com wrote:

 Hello people, below is an extraction from cassandra system log.

 ERROR [Thread-273] 2012-04-10 16:33:18,328 AbstractCassandraDaemon.java
 (line 139) Fatal exception in thread Thread[Thread-273,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:640)
 at
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
 at
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:104)
 at
 org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.run(CassandraDaemon.java:214)

 Investigated into the call until the java native call,
 http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/share/vm/prims/jvm.cpp#l2698

   if (native_thread-osthread() == NULL) {
 // No one should hold a reference to the 'native_thread'.
 delete native_thread;
 if (JvmtiExport::should_post_resource_exhausted()) {
   JvmtiExport::post_resource_exhausted(
 JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR |
 JVMTI_RESOURCE_EXHAUSTED_THREADS,
 unable to create new native thread);
 }
 THROW_MSG(vmSymbols::java_lang_OutOfMemoryError(),
   unable to create new native thread);
   }

 Question. Is that out of memory error due to native os memory or java
 heap? Stacked size to the jvm is -Xss128k. Operating system file descriptor
 max user processes 26. open files capped at 65536

 Can any java/cpp expert pin point what JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR
 and  JVMTI_RESOURCE_EXHAUSTED_THREADS means too?

 Thank you.

 Jason

Re: query tracing

2014-11-07 Thread Chris Lohfink

It saves a lot of information for each request thats traced so there is
significant overhead.  If you start at a low probability and move it up
based on the load impact it will provide a lot of insight and you can
control the cost.

---
Chris Lohfink

On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 is there any significant  performance penalty if one turn on Cassandra
 query tracing, through DataStax java driver (say, per every query request
 of some trouble query)?

 More sampling seems better but then doing so may also slow down the system
 in some other ways?

 thanks

Re: Multiple SSD disks per sever? Ideal config?

2014-11-06 Thread Chris Lohfink

If optimizing for IO, use Cassandra's JBOD configuration (list each disk
under data directories in cassandra.yaml).  It would put sstables on the
disk thats least used.  If want to optimize for disk space, I'd go with
RAID0.  Will probably want to tune concurrent reader/writers, stream
throughput (if have network for it) and compaction throughput if you end up
with IO to spare.  I generally would not recommend putting multiple C*
instances on a single box.

---
Chris Lohfink

On Thu, Nov 6, 2014 at 5:13 PM, Kevin Burton bur...@spinn3r.com wrote:

 I’m curious what people are doing with multiple SSDs per server.

 I think there are two main paths:

 - RAID 0 them… the problem here is that RAID0 is not a panacea and the
 drives may or may not see better IO throughput.

 - use N cassandra instances per box (or containers) and have one C* node
 accessing each SSD.  The upside here is that Cassandra sees the drive
 directly.  The downside is that you would probably have to cheat and tell
 C* that all the containers on that box are on the same “rack” so C* doesn’t
 schedule two replicas on the same box.

 Thoughts?

 Kevin

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com

Re: tuning concurrent_reads param

2014-10-29 Thread Chris Lohfink

Theres a bit to it, sometimes it can use tweaking though.  Its a good
default for most systems so I wouldn't increase right off the bat. When
using ssds or something with a lot of horsepower it could be higher though
(ie i2.xlarge+ on ec2).  If you monitor the number of active threads in
read thread pool (nodetool tpstats) you can see if they are actually all
busy or not.  If its near 32 (or whatever you set it at) all the time it
may be a bottleneck.

---
Chris Lohfink

On Wed, Oct 29, 2014 at 10:41 PM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 Hi,
 looking at the docs, the default value for concurrent_reads is 32, which
 seems bit small to me (comparing to say http server)? because if my node is
 receiving slight traffic, any more than 32 concurrent read query will have
 to wait.(?)

 Recommend rule is, 16* number of drives. Would that be different if I have
 SSDs?

 I am attempting to increase it because I have a few tables have wide rows
 that app will fetch them, the pure size of data may already eating up the
 thread time, which can cause  other read threads need to wait and essential
 slow.

 thanks

Re: Exploring Simply Queueing

2014-10-05 Thread Chris Lohfink

It appears you are aware of the tombstones affect that leads people to label 
this an anti-pattern.  Without due or any time based value being part of the 
partition key means you will still get a lot of buildup.  You only have 1 
partition per shard which just linearly decreases the tombstones.  That isn't 
likely to be enough to really help in a situation of high queue throughput, 
especially with the default of 4 shards. 

You may want to consider switching to LCS from the default STCS since 
re-writing to same partitions a lot. It will still use STCS in L0 so in high 
write/delete scenarios, with low enough gc_grace, when it never gets higher 
then L1 it will be sameish write throughput. In scenarios where you get more 
LCS will shine I suspect by reducing number of obsolete tombstones.  Would be 
hard to identify difference in small tests I think.

Whats the plan to prevent two consumers from reading same message off of a 
queue?  You mention in docs you will address it at a later point in time but 
its kinda a biggy.  Big lock  batch reads like astyanax recipe?

---
Chris Lohfink


On Oct 5, 2014, at 6:03 PM, Jan Algermissen jan.algermis...@nordsc.com wrote:

 Hi,
 
 I have put together some thoughts on realizing simple queues with Cassandra.
 
 https://github.com/algermissen/cassandra-ruby-queue
 
 The design is inspired by (the much more sophisticated) Netfilx approach[1] 
 but very reduced.
 
 Given that I am still a C* newbie, I’d be very glad to hear some thoughts on 
 the design path I took.
 
 Jan
 
 [1] https://github.com/Netflix/astyanax/wiki/Message-Queue

Re: CPU consumption of Cassandra

2014-09-23 Thread Chris Lohfink

Well, first off you shouldn't run stress tool on the node your testing.  Give 
it its own box.  

With RF=N=2 your essentially testing a single machine locally which isnt the 
best indicator long term (optimizations available when reading data thats local 
to the node).  80k/sec on a system is pretty good though, your probably seeing 
slower on the 2nd query since its returning 10x the data and there will be more 
to go through within the partition. 42k/sec is still acceptable imho since 
these are smaller boxes.  You are probably seeing high CPU because the system 
is doing a lot :)

If you want to get more out of these systems can do some tuning probably, 
enable trace to see whats actually the bottleneck. 

Collections will very likely hurt more then help.

---
Chris Lohfink

On Sep 23, 2014, at 9:39 AM, Leleu Eric eric.le...@worldline.com wrote:

 I tried to run “cassandra-stress” on some of my table as proposed by Jake 
 Luciani.
  
 For a simple table, this tool is able to perform 8 read op/s with a few 
 CPU consumption if I request the table by the PK(name, tenanted)
  
 Ex :  
 TABLE :
  
 CREATE TABLE IF NOT EXISTS buckets (tenantid varchar,
 name varchar,
 owner varchar,
 location varchar,
 description varchar,
 codeQuota varchar,
 creationDate timestamp,
 updateDate timestamp,
 PRIMARY KEY (name, tenantid));
  
 QUERY : select * from buckets where name = ? and tenantid = ? limit 1;
  
 TOP output for 900 threads on cassandra-stress :
 top - 13:17:09 up 173 days, 21:54,  4 users,  load average: 11.88, 4.30, 2.76
 Tasks: 272 total,   1 running, 270 sleeping,   0 stopped,   1 zombie
 Cpu(s): 71.4%us, 14.0%sy,  0.0%ni, 13.1%id,  0.0%wa,  0.0%hi,  1.5%si,  0.0%st
 Mem:  98894704k total, 96367436k used,  2527268k free,15440k buffers
 Swap:0k total,0k used,0k free, 88194556k cached
  
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 25857 root  20   0 29.7g 1.5g  12m S 693.0  1.6  38:45.58 java  ç 
 Cassandra-stress
 29160 cassandr  20   0 16.3g 4.8g  10m S  1.3  5.0  44:46.89 java  ç Cassandra
  
  
  
 Now, If I run another query on a table that provides a list of buckets 
 according to the  owner, the number of op/s is divided by 2  (42000 op/s) and 
 CPU consumption grow UP.
  
 Ex :  
 TABLE :
  
 CREATE TABLE IF NOT EXISTS owner_to_buckets (tenantid varchar,
 name varchar,
 owner varchar,
 location varchar,
 description varchar,
 codeQuota varchar,
 creationDate timestamp,
 updateDate timestamp,
 PRIMARY KEY ((owner, tenantid), name));
  
 QUERY : select * from owner_to_buckets  where owner = ? and tenantid = ? 
 limit 10;
  
 TOP output for 4  threads on cassandra-stress:
  
 top - 13:49:16 up 173 days, 22:26,  4 users,  load average: 1.76, 1.48, 1.17
 Tasks: 273 total,   1 running, 271 sleeping,   0 stopped,   1 zombie
 Cpu(s): 26.3%us,  8.0%sy,  0.0%ni, 64.7%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
 Mem:  98894704k total, 97512156k used,  1382548k free,14580k buffers
 Swap:0k total,0k used,0k free, 90413772k cached
  
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 29160 cassandr  20   0 13.6g 4.8g  37m S 186.7  5.1  62:26.77 java ç Cassandra
 50622 root  20   0 28.8g 469m  12m S 102.5  0.5   0:45.84 java ç 
 Cassandra-stress
  
 TOP output for 271  threads on cassandra-stress:
  
  
 top - 13:57:03 up 173 days, 22:34,  4 users,  load average: 4.67, 1.76, 1.25
 Tasks: 272 total,   1 running, 270 sleeping,   0 stopped,   1 zombie
 Cpu(s): 81.5%us, 14.0%sy,  0.0%ni,  3.1%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
 Mem:  98894704k total, 94955936k used,  3938768k free,15892k buffers
 Swap:0k total,0k used,0k free, 85993676k cached
  
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 29160 cassandr  20   0 13.6g 4.8g  38m S 430.0  5.1  82:31.80 java ç Cassandra
 50622 root  20   0 29.1g 2.3g  12m S 343.4  2.4  17:51.22 java ç 
 Cassandra-stress
  
  
 I have 4 tables with  a composed PRIMARY KEY (two of them has 4 entries : 2 
 for the partition key, one for cluster column and one for sort column)
 Two of these tables are frequently read with the partition key because we 
 want to list data of a given user, this should explain my CPU load according 
 to the simple test done with Cassandra-stress …
  
 How can I avoid this?
 Collections could be an option but the number of data per user is not limited 
 and can easily exceed 200 entries. According to the Cassandra documentation, 
 collections have a size limited to 64KB. So it is probably not a solution in 
 my case. L
  
  
 Regards,
 Eric
  
 De : Chris Lohfink [mailto:clohf...@blackbirdit.com] 
 Envoyé : lundi 22 septembre 2014 22:03
 À : user@cassandra.apache.org
 Objet : Re: CPU consumption of Cassandra
  
 Its going to depend a lot on your data model but 5-6k is on the low end of 
 what I would expect.  N=RF=2 is not really something I would recommend.  That 
 said 93GB is not much

Re: CPU consumption of Cassandra

2014-09-23 Thread Chris Lohfink

CPU consumption may be affected from the cassandra-stress tool in 2nd example 
as well.  Running on a separate system eliminates it as a possible cause.  
There is a little extra work but not anything that I think would be that 
obvious.  tracing (can enable with nodetool) or profiling (ie with yourkit) can 
give more exposure to the bottleneck.  Id run test from separate system first.

---
Chris Lohfink 


On Sep 23, 2014, at 12:48 PM, Leleu Eric eric.le...@worldline.com wrote:

 First of all, Thanks for your help ! :)
 
 Here is some details :
 
 With RF=N=2 your essentially testing a single machine locally which isnt the 
 best indicator long term
 I will  test with more nodes, (4 with RF = 2) but for now I'm limited to 2 
 nodes for non technical reason ...
 
 Well, first off you shouldn't run stress tool on the node your testing.  
 Give it its own box.
 I performed the test in a new Keyspace in order to have a clear dataset.
 
 the 2nd query since its returning 10x the data and there will be more to go 
 through within the partition
 I configured cassandra-stress in a way of each user has only one bucket so 
 the amount of data is the same in the both case. (select * from buckets 
 where name = ? and tenantid = ? limit 1 and select * from owner_to_buckets  
 where owner = ? and tenantid = ? limit 10).
 Does cassandra perform extra read when the limit is bigger than the available 
 data (even if the partition key contains only one single value in the 
 clustering column) ?
 If the amount of data is the same, how can we explain the difference of CPU 
 consumption?
 
 
 Regards,
 Eric
 
 
 De : Chris Lohfink [clohf...@blackbirdit.com]
 Date d'envoi : mardi 23 septembre 2014 19:23
 À : user@cassandra.apache.org
 Objet : Re: CPU consumption of Cassandra
 
 Well, first off you shouldn't run stress tool on the node your testing.  Give 
 it its own box.
 
 With RF=N=2 your essentially testing a single machine locally which isnt the 
 best indicator long term (optimizations available when reading data thats 
 local to the node).  80k/sec on a system is pretty good though, your probably 
 seeing slower on the 2nd query since its returning 10x the data and there 
 will be more to go through within the partition. 42k/sec is still acceptable 
 imho since these are smaller boxes.  You are probably seeing high CPU because 
 the system is doing a lot :)
 
 If you want to get more out of these systems can do some tuning probably, 
 enable trace to see whats actually the bottleneck.
 
 Collections will very likely hurt more then help.
 
 ---
 Chris Lohfink
 
 On Sep 23, 2014, at 9:39 AM, Leleu Eric 
 eric.le...@worldline.commailto:eric.le...@worldline.com wrote:
 
 I tried to run “cassandra-stress” on some of my table as proposed by Jake 
 Luciani.
 
 For a simple table, this tool is able to perform 8 read op/s with a few 
 CPU consumption if I request the table by the PK(name, tenanted)
 
 Ex :
 TABLE :
 
 CREATE TABLE IF NOT EXISTS buckets (tenantid varchar,
 name varchar,
 owner varchar,
 location varchar,
 description varchar,
 codeQuota varchar,
 creationDate timestamp,
 updateDate timestamp,
 PRIMARY KEY (name, tenantid));
 
 QUERY : select * from buckets where name = ? and tenantid = ? limit 1;
 
 TOP output for 900 threads on cassandra-stress :
 top - 13:17:09 up 173 days, 21:54,  4 users,  load average: 11.88, 4.30, 2.76
 Tasks: 272 total,   1 running, 270 sleeping,   0 stopped,   1 zombie
 Cpu(s): 71.4%us, 14.0%sy,  0.0%ni, 13.1%id,  0.0%wa,  0.0%hi,  1.5%si,  0.0%st
 Mem:  98894704k total, 96367436k used,  2527268k free,15440k buffers
 Swap:0k total,0k used,0k free, 88194556k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 25857 root  20   0 29.7g 1.5g  12m S 693.0  1.6  38:45.58 java  == 
 Cassandra-stress
 29160 cassandr  20   0 16.3g 4.8g  10m S  1.3  5.0  44:46.89 java  == 
 Cassandra
 
 
 
 Now, If I run another query on a table that provides a list of buckets 
 according to the  owner, the number of op/s is divided by 2  (42000 op/s) and 
 CPU consumption grow UP.
 
 Ex :
 TABLE :
 
 CREATE TABLE IF NOT EXISTS owner_to_buckets (tenantid varchar,
 name varchar,
 owner varchar,
 location varchar,
 description varchar,
 codeQuota varchar,
 creationDate timestamp,
 updateDate timestamp,
 PRIMARY KEY ((owner, tenantid), name));
 
 QUERY : select * from owner_to_buckets  where owner = ? and tenantid = ? 
 limit 10;
 
 TOP output for 4  threads on cassandra-stress:
 
 top - 13:49:16 up 173 days, 22:26,  4 users,  load average: 1.76, 1.48, 1.17
 Tasks: 273 total,   1 running, 271 sleeping,   0 stopped,   1 zombie
 Cpu(s): 26.3%us,  8.0%sy,  0.0%ni, 64.7%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
 Mem:  98894704k total, 97512156k used,  1382548k free,14580k buffers
 Swap:0k total,0k used,0k free, 90413772k cached
 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME

Re: CPU consumption of Cassandra

2014-09-22 Thread Chris Lohfink

Its going to depend a lot on your data model but 5-6k is on the low end of what 
I would expect.  N=RF=2 is not really something I would recommend.  That said 
93GB is not much data so the bottleneck may exist more in your data model, 
queries, or client.

What profiler are you using?  The cpu on the select/read is marked as RUNNABLE 
but its really more of a wait state that may throw some profilers off, it may 
be a red haring.

---
Chris Lohfink

On Sep 22, 2014, at 11:39 AM, Leleu Eric eric.le...@worldline.com wrote:

 Hi,
  
  
 I’m currently testing Cassandra 2.0.9  (and since the last week 2.1) under 
 some read heavy load…
  
 I have 2 cassandra nodes (RF : 2) running under CentOS 6 with 16GB of RAM and 
 8 Cores.
 I have around 93GB of data per node (one Disk of 300GB with SAS interface and 
 a Rotational Speed of 10500)
  
 I have 300 active client threads and they request the C* nodes with a 
 Consitency level set to ONE (I’m using the CQL datastax driver).
  
 During my tests I saw  a lot of CPU consumption (70% user / 6%sys / 4% iowait 
 / 20%idle).
 C* nodes respond to around 5000 op/s (sometime up to 6000op/s)
  
 I try to profile a node and at the first look, 60% of the CPU is passed in 
 the “sun.nio.ch” package. (SelectorImpl.select or Channel.read)
  
 I know that Benchmark results are highly dependent of the Dataset and use 
 cases, but according to my point of view this CPU consumption is normal 
 according to the load.
 Someone can confirm that point ?
 According to my Hardware configuration, can I expect to have more than 6000 
 read op/s ?
  
  
 Regards,
 Eric
  
  
  
  
 
 
 Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
 exclusif de ses destinataires. Il peut également être protégé par le secret 
 professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
 immédiatement l'expéditeur et de le détruire. L'intégrité du message ne 
 pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra 
 être recherchée quant au contenu de ce message. Bien que les meilleurs 
 efforts soient faits pour maintenir cette transmission exempte de tout virus, 
 l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
 saurait être recherchée pour tout dommage résultant d'un virus transmis.
 
 This e-mail and the documents attached are confidential and intended solely 
 for the addressee; it may also be privileged. If you receive this e-mail in 
 error, please notify the sender immediately and destroy it. As its integrity 
 cannot be secured on the Internet, the Worldline liability cannot be 
 triggered for the message content. Although the sender endeavours to maintain 
 a computer virus-free network, the sender does not warrant that this 
 transmission is virus-free and will not be liable for any damages resulting 
 from any virus transmitted.

Re: High Compactions Pending

2014-09-22 Thread Chris Lohfink

35 isn't that high really in some scenarios (ie, theres a lot of column 
families), is it continuing to climb or does it drop down shortly after?

---
Chris Lohfink

On Sep 22, 2014, at 7:57 PM, arun sirimalla arunsi...@gmail.com wrote:

 I have a 6 (i2.2xlarge) node cluster on AWS with 4.5 DSE running on it. I 
 notice high compaction pending on one of the node around 35.
 Compaction throughput set to 64 MB and flush writes to 4. Any suggestion is  
 much appreciated.
 
 -- 
 Arun 
 Senior Hadoop Engineer
 Cloudwick
 
 Champion of Big Data
 http://www.cloudera.com/content/dev-center/en/home/champions-of-big-data.html

Re: High Compactions Pending

2014-09-22 Thread Chris Lohfink

Whats the output of 'nodetool compactionstats'?   Is concurrent_compactors not 
set in your cassandra.yaml?  Any Exception or Error 's in the system.log or 
output.log?

---
Chris Lohfink

On Sep 22, 2014, at 9:50 PM, Arun arunsi...@gmail.com wrote:

 Its constant since 4 hours. Remaining nodes have around 10 compactions. We 
 have 4 column families. 
 
 
 On Sep 22, 2014, at 19:39, Chris Lohfink clohf...@blackbirdit.com wrote:
 
 35 isn't that high really in some scenarios (ie, theres a lot of column 
 families), is it continuing to climb or does it drop down shortly after?
 
 ---
 Chris Lohfink
 
 On Sep 22, 2014, at 7:57 PM, arun sirimalla arunsi...@gmail.com wrote:
 
 I have a 6 (i2.2xlarge) node cluster on AWS with 4.5 DSE running on it. I 
 notice high compaction pending on one of the node around 35.
 Compaction throughput set to 64 MB and flush writes to 4. Any suggestion is 
  much appreciated.
 
 -- 
 Arun 
 Senior Hadoop Engineer
 Cloudwick
 
 Champion of Big Data
 http://www.cloudera.com/content/dev-center/en/home/champions-of-big-data.html

1 2 >

1 - 100 of 138 matches

Mail list logo