Re: Lot of GC on two nodes out of 7

2016-03-03 Thread Jeff Jirsa
I’d personally would have gone the other way – if you’re seeing parnew, 
increasing new gen instead of decreasing it should help drop (faster) rather 
than promoting to sv/oldgen (slower) ?



From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, March 3, 2016 at 8:55 PM
To:  "user@cassandra.apache.org"
Subject:  Re: Lot of GC on two nodes out of 7

Hello, 

Bryan, most of the partition sizes are under 45 KB

I have tried with concurrent_compactors : 8 for one of the nodes still no 
improvement, 
I have tried max_heap_Size : 8G, no improvement. 

I will try the newHeapsize of 2G though i am sure CMS will be a longer then.

Also doesn't look like i mentioned what type of GC was causing the problems. On 
both the nodes its the ParNewGC thats taking long for each run and too many 
runs are happening in succession. 

anishek


On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng  wrote:
Hi Anishek, 

In addition to the good advice others have given, do you notice any abnormally 
large partitions? What does cfhistograms report for 99% partition size? A few 
huge partitions will cause very disproportionate load on your cluster, 
including high GC.

--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F  wrote:
Hi Anishek,

 

We too faced similar problem in 2.0.14 and after doing some research we config 
few parameters in Cassandra.yaml and was able to overcome GC pauses . Those are 
:

 

·memtable_flush_writers : increased from 1 to 3 as from tpstats output  
we can see mutations dropped so it means writes are getting blocked, so 
increasing number will have those catered.

·memtable_total_space_in_mb : Default (1/4 of heap size), can lowered 
because larger long lived objects will create pressure on HEAP, so its better 
to reduce some amount of size.

·Concurrent_compactors : Alain righlty pointed out this i.e reduce it 
to 8. You need to try this.

 

Also please check whether you have mutations drop in other nodes or not.

 

Hope this helps in your cluster too.

 

Regards

Amit Singh

From: Jonathan Haddad [mailto:j...@jonhaddad.com] 
Sent: Wednesday, March 02, 2016 9:33 PM
To: user@cassandra.apache.org
Subject: Re: Lot of GC on two nodes out of 7

 

Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I 
think it's cool you're willing to experiment with alternative JVM settings but 
I've never seen anyone use max tenuring threshold of 50 either and I can't 
imagine it's helpful.  Keep in mind if your objects are actually reaching that 
threshold it means they've been copied 50x (really really slow) and also you're 
going to end up spilling your eden objects directly into your old gen if your 
survivor is full.  Considering the small amount of memory you're using for heap 
I'm really not surprised you're running into problems.  

 

I recommend G1GC + 12GB heap and just let it optimize itself for almost all 
cases with the latest JVM versions.

 

On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ  wrote:

It looks like you are doing a good work with this cluster and know a lot about 
JVM, that's good :-).

 

our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM

 

That's good hardware too.

 

With 64 GB of ram I would probably directly give a try to `MAX_HEAP_SIZE=8G` on 
one of the 2 bad nodes probably.

 

Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using 
`-XX:MaxTenuringThreshold=15`, still on the canary node to observe the effects. 
But that's just an idea of something I would try to see the impacts, I don't 
think it will solve your current issues or even make it worse for this node.

 

Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would allow 
you to store the memtables off-heap. Those are 2 improvements reducing the heap 
pressure that you might be interested in.

 

I have spent time reading about all other options before including them and a 
similar configuration on our other prod cluster is showing good GC graphs via 
gcviewer.

 

So, let's look for an other reason. 

 

there are MUTATION and READ messages dropped in high number on nodes in 
question and on other 5 nodes it varies between 1-3.

 

- Is Memory, CPU or disk a bottleneck? Is one of those running at the limits?

 

concurrent_compactors: 48

 

Reducing this to 8 would free some space for transactions (R requests). It is 
probably worth a try, even more when compaction is not keeping up and 
compaction throughput is not throttled.

 

Just found an issue about that: 
https://issues.apache.org/jira/browse/CASSANDRA-7139

 

Looks like `concurrent_compactors: 8` is the new default.

 

C*heers,

---

Alain Rodriguez - al...@thelastpickle.com

France

 

The Last Pickle - Apache Cassandra Consulting

http://www.thelastpickle.com

 

 

 

 

 

 

2016-03-02 12:27 GMT+01:00 Anishek Agarwal :

Thanks a lot 

Re: Removing Node causes bunch of HostUnavailableException

2016-03-03 Thread Jack Krupansky
What is the exact exception you are getting and where do you get it? Is it
UnavailableException or NoHostAvailableException and does it occur on the
client, using the Java driver?

What is your LoadBalancingPolicy?

What consistency level is the client using?

What retry policy is the client using?

When you say that the failures don't last for more than a few minutes, you
mean from the moment you perform the nodetool removenode? And is operation
completely normal after those few minutes?


-- Jack Krupansky

On Thu, Mar 3, 2016 at 4:40 PM, Peddi, Praveen  wrote:

> Hi Jack,
>
> Which node(s) were getting the HostNotAvailable errors - all nodes for
> every query, or just a small portion of the nodes on some queries?
>
> Not all read/writes are failing with Unavalable or Timeout exception.
> Writes failures were around 10% of total calls. Reads were little worse (as
> worse as 35% of total calls).
>
>
> It may take some time for the gossip state to propagate; maybe some of it
> is corrupted or needs a full refresh.
>
> Were any of the seed nodes in the collection of nodes that were removed?
> How many seed nodes does each node typically have?
>
> We currently use all hosts as seed hosts which I know is a very bad idea
> and we are going to fix that soon. The reason we use all hosts as seed
> hosts is because these hosts can get recycled for many reasons and we
> didn’t want to hard code the host names so we programmatically get host
> names (we wrote our own seed host provider). Could that be the reason for
> these failures? If a dead node is in the seed nodes list and we try to
> remove that node, could that lead to blip of failures. The failures don’t
> last for more than few minutes.
>
>
>
> -- Jack Krupansky
>
> On Thu, Mar 3, 2016 at 4:16 PM, Peddi, Praveen  wrote:
>
>> Thanks Alain for quick and detailed response. My answers inline. One
>> thing I want to clarify is, the nodes got recycled due to some automatic
>> health check failure. This means old nodes are dead and new nodes got added
>> w/o our intervention. So replacing nodes would not work for us since the
>> new nodes were already added.
>>
>>
>>
>>> We are not removing multiple nodes at the same time. All dead nodes are
>>> from same AZ so there were no errors when the nodes were down as expected
>>> (because we use QUORUM)
>>
>>
>> Do you use at leat 3 distinct AZ ? If so, you should indeed be fine
>> regarding data integrity. Also repair should then work for you. If you have
>> less than 3 AZ, then you are in troubles...
>>
>> Yes we use 3 distinct AZs and replicate to all 3 Azs which is why when 8
>> nodes were recycled, there were absolutely no outage on Cassandra (other
>> two nodes wtill satisfy quorum consistency)
>>
>>
>> About the unreachable errors, I believe it can be due to the overload due
>> to the missing nodes. Pressure on the remaining node might be too strong.
>>
>> It is certainly possible but we have beefed up cluster with <3% CPU,
>> hardly any network I/o and disk usage. We have 162 nodes in the cluster and
>> each node doesn’t have more than 80 to 100MB of data.
>>
>>
>>
>> However, As soon as I started removing nodes one by one, every time time
>>> we see lot of timeout and unavailable exceptions which doesn’t make any
>>> sense because I am just removing a node that doesn’t even exist.
>>>
>>
>> This probably added even more load, if you are using vnodes, all the
>> remaining nodes probably started streaming data to each other node at the
>> speed of "nodetool getstreamthroughput". AWS network isn't that good, and
>> is probably saturated. Also have you the phi_convict_threshold configured
>> to a high value at least 10 or 12 ? This would avoid nodes to be marked
>> down that often.
>>
>> We are using c3.2xlarge which has good network throughput (1GB/sec I
>> think). We are using default value which is 200MB/sec in 2.0.9. We will
>> play with it in future and see if this could make any difference but as I
>> mentioned the data size on each node is not huge.
>> Regarding phi_convict_threshold, our Cassandra is not bringing itself
>> down. There was a bug in health check from one of our internal tool and
>> that tool is recycling the nodes. Nothing to do with Cassandra health.
>> Again we will keep an eye on it in future.
>>
>>
>> What does "nodetool tpstats" outputs ?
>>
>> Nodetool tpstats on which node? Any node?
>>
>>
>> Also you might try to monitor resources and see what happens (my guess is
>> you should focus at iowait, disk usage and network, have an eye at cpu too).
>>
>> We did monitor cpu, disk and network and they are all very low.
>>
>>
>> A quick fix would probably be to hardly throttle the network on all the
>> nodes and see if it helps:
>>
>> nodetool setstreamthroughput 2
>>
>> We will play with this config. 2.0.9 defaults to 200MB/sec which I think
>> is too high.
>>
>>
>> If this work, you could incrementally increase it and monitor, find the
>> good tuning and 

Re: Lot of GC on two nodes out of 7

2016-03-03 Thread Anishek Agarwal
Hello,

Bryan, most of the partition sizes are under 45 KB

I have tried with concurrent_compactors : 8 for one of the nodes still no
improvement,
I have tried max_heap_Size : 8G, no improvement.

I will try the newHeapsize of 2G though i am sure CMS will be a longer then.

Also doesn't look like i mentioned what type of GC was causing the
problems. On both the nodes its the ParNewGC thats taking long for each run
and too many runs are happening in succession.

anishek


On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng  wrote:

> Hi Anishek,
>
> In addition to the good advice others have given, do you notice any
> abnormally large partitions? What does cfhistograms report for 99%
> partition size? A few huge partitions will cause very disproportionate load
> on your cluster, including high GC.
>
> --Bryan
>
> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F 
> wrote:
>
>> Hi Anishek,
>>
>>
>>
>> We too faced similar problem in 2.0.14 and after doing some research we
>> config few parameters in Cassandra.yaml and was able to overcome GC pauses
>> . Those are :
>>
>>
>>
>> · memtable_flush_writers : increased from 1 to 3 as from tpstats
>> output  we can see mutations dropped so it means writes are getting
>> blocked, so increasing number will have those catered.
>>
>> · memtable_total_space_in_mb : Default (1/4 of heap size), can
>> lowered because larger long lived objects will create pressure on HEAP, so
>> its better to reduce some amount of size.
>>
>> · Concurrent_compactors : Alain righlty pointed out this i.e
>> reduce it to 8. You need to try this.
>>
>>
>>
>> Also please check whether you have mutations drop in other nodes or not.
>>
>>
>>
>> Hope this helps in your cluster too.
>>
>>
>>
>> Regards
>>
>> Amit Singh
>>
>> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
>> *Sent:* Wednesday, March 02, 2016 9:33 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Lot of GC on two nodes out of 7
>>
>>
>>
>> Can you post a gist of the output of jstat -gccause (60 seconds worth)?
>> I think it's cool you're willing to experiment with alternative JVM
>> settings but I've never seen anyone use max tenuring threshold of 50 either
>> and I can't imagine it's helpful.  Keep in mind if your objects are
>> actually reaching that threshold it means they've been copied 50x (really
>> really slow) and also you're going to end up spilling your eden objects
>> directly into your old gen if your survivor is full.  Considering the small
>> amount of memory you're using for heap I'm really not surprised you're
>> running into problems.
>>
>>
>>
>> I recommend G1GC + 12GB heap and just let it optimize itself for almost
>> all cases with the latest JVM versions.
>>
>>
>>
>> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ 
>> wrote:
>>
>> It looks like you are doing a good work with this cluster and know a lot
>> about JVM, that's good :-).
>>
>>
>>
>> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>>
>>
>>
>> That's good hardware too.
>>
>>
>>
>> With 64 GB of ram I would probably directly give a try to
>> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>>
>>
>>
>> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
>> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
>> effects. But that's just an idea of something I would try to see the
>> impacts, I don't think it will solve your current issues or even make it
>> worse for this node.
>>
>>
>>
>> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
>> allow you to store the memtables off-heap. Those are 2 improvements
>> reducing the heap pressure that you might be interested in.
>>
>>
>>
>> I have spent time reading about all other options before including them
>> and a similar configuration on our other prod cluster is showing good GC
>> graphs via gcviewer.
>>
>>
>>
>> So, let's look for an other reason.
>>
>>
>>
>> there are MUTATION and READ messages dropped in high number on nodes in
>> question and on other 5 nodes it varies between 1-3.
>>
>>
>>
>> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
>> limits?
>>
>>
>>
>> concurrent_compactors: 48
>>
>>
>>
>> Reducing this to 8 would free some space for transactions (R requests).
>> It is probably worth a try, even more when compaction is not keeping up and
>> compaction throughput is not throttled.
>>
>>
>>
>> Just found an issue about that:
>> https://issues.apache.org/jira/browse/CASSANDRA-7139
>>
>>
>>
>> Looks like `concurrent_compactors: 8` is the new default.
>>
>>
>>
>> C*heers,
>>
>> ---
>>
>> Alain Rodriguez - al...@thelastpickle.com
>>
>> France
>>
>>
>>
>> The Last Pickle - Apache Cassandra Consulting
>>
>> http://www.thelastpickle.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal :
>>
>> Thanks a lot Alian for the details.
>>

Re: Consistent read timeouts for bursts of reads

2016-03-03 Thread Mike Heffner
Emils,

I realize this may be a big downgrade, but are you timeouts reproducible
under Cassandra 2.1.4?

Mike

On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis 
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>


-- 

  Mike Heffner 
  Librato, Inc.


Re: Broken links in Apache Cassandra home page

2016-03-03 Thread Eric Evans
On Thu, Mar 3, 2016 at 10:10 AM, Eric Evans  wrote:
>> https://www.w3.org/Addressing/draft-mirashi-url-irc-01.txt
>>
>> irc://freenode.net/#cassandra
>
> Hrmm, that seems like more of a candidate for a channel name link, no?
>  For example:
>
> Many of the Cassandra developers and community members hang out in
> the #cassandra channel on
> irc.freenode.net.
>
> And, perhaps until Freenode gets their website sorted, we could use
> the Wikipedia article instead
> (https://en.wikipedia.org/wiki/Freenode).

I went ahead and did this; Hopefully that's an improvement.

Cheers,

-- 
Eric Evans
eev...@wikimedia.org


Modeling transactional messages

2016-03-03 Thread I PVP
Hi everyone,

Can anyone please let me know if I am heading to an antiparttern or 
somethingelse bad?

How would you model the following ... ?

I am migrating from MYSQL to Cassandra, I have a scenario in which need to 
store the content of "to be sent" transactional email messages that the 
customer will receive on events like : an order was created, an order was 
updated, an order was canceled,an order was  shipped,an account was created, an 
account was confirmed, an account was locked and so on.

On MYSQL there is table for email message "type", like: a table to store 
messages of "order-created”, a table to store messages of "order-updated" and 
so on.

The messages are sent by a non-parallelized java worker, scheduled to run every 
X seconds, that push the messages to a service like Sendgrid/Mandrill/Mailjet.

For better performance, easy to purge and overall code maintenance I am looking 
to have all message "types" on a single table/column family as following:

CREATE TABLE communication.transactional_email (
objectid timeuuid,
subject text,
content text,
fromname text,
fromaddr text,
toname text,
toaddr text,
wassent boolean,
createdate timestamp,
sentdate timestamp,
type text,// example: order_created, order_canceled
domain text, // exaple: hotmail.com. in case need to stop sending to a specific 
domain
PRIMARY KEY (wassent, objectid)
);

create index on toaddr
create index on sentdate
create index on domain
create index on type


The requirements are :

1) select * from transactional_email where was_sent = false and objectid < 
minTimeuuid(current timestamp) limit 

(to get the messages that need to be sent)

2) update transactional_email set was_sent = true where objectid = 

(to update the message  right after it was sent)

3) select * from transactional_email where toaddr = 

(to get all messages that were sent to a specific emailaddr)

4) select * from transactional_email where domain = 

(to get all messages that were sent to a specific domain)

5) delete from transactional_email where was_sent = true and objectid < 
minTimeuuid(a timestamp)

(to do purge, delete all messages send before the last X days)

6) delete from transactional_email where toaddr = 

(to be able to delete all messages when a user account is closed)


Thanks

IPVP


Re: Lot of GC on two nodes out of 7

2016-03-03 Thread Bryan Cheng
Hi Anishek,

In addition to the good advice others have given, do you notice any
abnormally large partitions? What does cfhistograms report for 99%
partition size? A few huge partitions will cause very disproportionate load
on your cluster, including high GC.

--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F 
wrote:

> Hi Anishek,
>
>
>
> We too faced similar problem in 2.0.14 and after doing some research we
> config few parameters in Cassandra.yaml and was able to overcome GC pauses
> . Those are :
>
>
>
> · memtable_flush_writers : increased from 1 to 3 as from tpstats
> output  we can see mutations dropped so it means writes are getting
> blocked, so increasing number will have those catered.
>
> · memtable_total_space_in_mb : Default (1/4 of heap size), can
> lowered because larger long lived objects will create pressure on HEAP, so
> its better to reduce some amount of size.
>
> · Concurrent_compactors : Alain righlty pointed out this i.e
> reduce it to 8. You need to try this.
>
>
>
> Also please check whether you have mutations drop in other nodes or not.
>
>
>
> Hope this helps in your cluster too.
>
>
>
> Regards
>
> Amit Singh
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* Wednesday, March 02, 2016 9:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Lot of GC on two nodes out of 7
>
>
>
> Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I
> think it's cool you're willing to experiment with alternative JVM settings
> but I've never seen anyone use max tenuring threshold of 50 either and I
> can't imagine it's helpful.  Keep in mind if your objects are actually
> reaching that threshold it means they've been copied 50x (really really
> slow) and also you're going to end up spilling your eden objects directly
> into your old gen if your survivor is full.  Considering the small amount
> of memory you're using for heap I'm really not surprised you're running
> into problems.
>
>
>
> I recommend G1GC + 12GB heap and just let it optimize itself for almost
> all cases with the latest JVM versions.
>
>
>
> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ  wrote:
>
> It looks like you are doing a good work with this cluster and know a lot
> about JVM, that's good :-).
>
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
>
>
> That's good hardware too.
>
>
>
> With 64 GB of ram I would probably directly give a try to
> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>
>
>
> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
> effects. But that's just an idea of something I would try to see the
> impacts, I don't think it will solve your current issues or even make it
> worse for this node.
>
>
>
> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
> allow you to store the memtables off-heap. Those are 2 improvements
> reducing the heap pressure that you might be interested in.
>
>
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
>
>
> So, let's look for an other reason.
>
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
>
>
> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
> limits?
>
>
>
> concurrent_compactors: 48
>
>
>
> Reducing this to 8 would free some space for transactions (R requests).
> It is probably worth a try, even more when compaction is not keeping up and
> compaction throughput is not throttled.
>
>
>
> Just found an issue about that:
> https://issues.apache.org/jira/browse/CASSANDRA-7139
>
>
>
> Looks like `concurrent_compactors: 8` is the new default.
>
>
>
> C*heers,
>
> ---
>
> Alain Rodriguez - al...@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal :
>
> Thanks a lot Alian for the details.
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
>
>
>
>
> we have a lot of reads and writes onto the system so keeping the high new
> size to make sure enough is held in memory 

Re: Removing Node causes bunch of HostUnavailableException

2016-03-03 Thread Peddi, Praveen
Hi Jack,
Which node(s) were getting the HostNotAvailable errors - all nodes for every 
query, or just a small portion of the nodes on some queries?
Not all read/writes are failing with Unavalable or Timeout exception. Writes 
failures were around 10% of total calls. Reads were little worse (as worse as 
35% of total calls).


It may take some time for the gossip state to propagate; maybe some of it is 
corrupted or needs a full refresh.

Were any of the seed nodes in the collection of nodes that were removed? How 
many seed nodes does each node typically have?
We currently use all hosts as seed hosts which I know is a very bad idea and we 
are going to fix that soon. The reason we use all hosts as seed hosts is 
because these hosts can get recycled for many reasons and we didn’t want to 
hard code the host names so we programmatically get host names (we wrote our 
own seed host provider). Could that be the reason for these failures? If a dead 
node is in the seed nodes list and we try to remove that node, could that lead 
to blip of failures. The failures don’t last for more than few minutes.


-- Jack Krupansky

On Thu, Mar 3, 2016 at 4:16 PM, Peddi, Praveen 
> wrote:
Thanks Alain for quick and detailed response. My answers inline. One thing I 
want to clarify is, the nodes got recycled due to some automatic health check 
failure. This means old nodes are dead and new nodes got added w/o our 
intervention. So replacing nodes would not work for us since the new nodes were 
already added.


We are not removing multiple nodes at the same time. All dead nodes are from 
same AZ so there were no errors when the nodes were down as expected (because 
we use QUORUM)

Do you use at leat 3 distinct AZ ? If so, you should indeed be fine regarding 
data integrity. Also repair should then work for you. If you have less than 3 
AZ, then you are in troubles...
Yes we use 3 distinct AZs and replicate to all 3 Azs which is why when 8 nodes 
were recycled, there were absolutely no outage on Cassandra (other two nodes 
wtill satisfy quorum consistency)

About the unreachable errors, I believe it can be due to the overload due to 
the missing nodes. Pressure on the remaining node might be too strong.
It is certainly possible but we have beefed up cluster with <3% CPU, hardly any 
network I/o and disk usage. We have 162 nodes in the cluster and each node 
doesn’t have more than 80 to 100MB of data.


However, As soon as I started removing nodes one by one, every time time we see 
lot of timeout and unavailable exceptions which doesn’t make any sense because 
I am just removing a node that doesn’t even exist.

This probably added even more load, if you are using vnodes, all the remaining 
nodes probably started streaming data to each other node at the speed of 
"nodetool getstreamthroughput". AWS network isn't that good, and is probably 
saturated. Also have you the phi_convict_threshold configured to a high value 
at least 10 or 12 ? This would avoid nodes to be marked down that often.
We are using c3.2xlarge which has good network throughput (1GB/sec I think). We 
are using default value which is 200MB/sec in 2.0.9. We will play with it in 
future and see if this could make any difference but as I mentioned the data 
size on each node is not huge.
Regarding phi_convict_threshold, our Cassandra is not bringing itself down. 
There was a bug in health check from one of our internal tool and that tool is 
recycling the nodes. Nothing to do with Cassandra health. Again we will keep an 
eye on it in future.


What does "nodetool tpstats" outputs ?
Nodetool tpstats on which node? Any node?


Also you might try to monitor resources and see what happens (my guess is you 
should focus at iowait, disk usage and network, have an eye at cpu too).
We did monitor cpu, disk and network and they are all very low.


A quick fix would probably be to hardly throttle the network on all the nodes 
and see if it helps:

nodetool setstreamthroughput 2
We will play with this config. 2.0.9 defaults to 200MB/sec which I think is too 
high.


If this work, you could incrementally increase it and monitor, find the good 
tuning and put it the cassandra.yaml.

I opened a ticket a while ago about that issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9509
I voted for this issue. Lets see if it gets picked up :).


I hope this will help you to go back to a healthy state allowing you a fast 
upgrade ;-).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 22:17 GMT+01:00 Peddi, Praveen 
>:
Hi Robert,
Thanks for your response.

Replication factor is 3.

We are in the process of upgrading to 2.2.4. We have had too many performance 
issues with later versions of Cassandra (I have asked asked for help related to 
that in the 

Re: Removing Node causes bunch of HostUnavailableException

2016-03-03 Thread Jack Krupansky
Which node(s) were getting the HostNotAvailable errors - all nodes for
every query, or just a small portion of the nodes on some queries?

It may take some time for the gossip state to propagate; maybe some of it
is corrupted or needs a full refresh.

Were any of the seed nodes in the collection of nodes that were removed?
How many seed nodes does each node typically have?


-- Jack Krupansky

On Thu, Mar 3, 2016 at 4:16 PM, Peddi, Praveen  wrote:

> Thanks Alain for quick and detailed response. My answers inline. One thing
> I want to clarify is, the nodes got recycled due to some automatic health
> check failure. This means old nodes are dead and new nodes got added w/o
> our intervention. So replacing nodes would not work for us since the new
> nodes were already added.
>
>
>
>> We are not removing multiple nodes at the same time. All dead nodes are
>> from same AZ so there were no errors when the nodes were down as expected
>> (because we use QUORUM)
>
>
> Do you use at leat 3 distinct AZ ? If so, you should indeed be fine
> regarding data integrity. Also repair should then work for you. If you have
> less than 3 AZ, then you are in troubles...
>
> Yes we use 3 distinct AZs and replicate to all 3 Azs which is why when 8
> nodes were recycled, there were absolutely no outage on Cassandra (other
> two nodes wtill satisfy quorum consistency)
>
>
> About the unreachable errors, I believe it can be due to the overload due
> to the missing nodes. Pressure on the remaining node might be too strong.
>
> It is certainly possible but we have beefed up cluster with <3% CPU,
> hardly any network I/o and disk usage. We have 162 nodes in the cluster and
> each node doesn’t have more than 80 to 100MB of data.
>
>
>
> However, As soon as I started removing nodes one by one, every time time
>> we see lot of timeout and unavailable exceptions which doesn’t make any
>> sense because I am just removing a node that doesn’t even exist.
>>
>
> This probably added even more load, if you are using vnodes, all the
> remaining nodes probably started streaming data to each other node at the
> speed of "nodetool getstreamthroughput". AWS network isn't that good, and
> is probably saturated. Also have you the phi_convict_threshold configured
> to a high value at least 10 or 12 ? This would avoid nodes to be marked
> down that often.
>
> We are using c3.2xlarge which has good network throughput (1GB/sec I
> think). We are using default value which is 200MB/sec in 2.0.9. We will
> play with it in future and see if this could make any difference but as I
> mentioned the data size on each node is not huge.
> Regarding phi_convict_threshold, our Cassandra is not bringing itself
> down. There was a bug in health check from one of our internal tool and
> that tool is recycling the nodes. Nothing to do with Cassandra health.
> Again we will keep an eye on it in future.
>
>
> What does "nodetool tpstats" outputs ?
>
> Nodetool tpstats on which node? Any node?
>
>
> Also you might try to monitor resources and see what happens (my guess is
> you should focus at iowait, disk usage and network, have an eye at cpu too).
>
> We did monitor cpu, disk and network and they are all very low.
>
>
> A quick fix would probably be to hardly throttle the network on all the
> nodes and see if it helps:
>
> nodetool setstreamthroughput 2
>
> We will play with this config. 2.0.9 defaults to 200MB/sec which I think
> is too high.
>
>
> If this work, you could incrementally increase it and monitor, find the
> good tuning and put it the cassandra.yaml.
>
> I opened a ticket a while ago about that issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9509
>
> I voted for this issue. Lets see if it gets picked up :).
>
>
> I hope this will help you to go back to a healthy state allowing you a
> fast upgrade ;-).
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-03-02 22:17 GMT+01:00 Peddi, Praveen :
>
>> Hi Robert,
>> Thanks for your response.
>>
>> Replication factor is 3.
>>
>> We are in the process of upgrading to 2.2.4. We have had too many
>> performance issues with later versions of Cassandra (I have asked asked for
>> help related to that in the forum). We are close to getting to similar
>> performance now and hopefully upgrade in next few weeks. Lot of testing to
>> do :(.
>>
>> We are not removing multiple nodes at the same time. All dead nodes are
>> from same AZ so there were no errors when the nodes were down as expected
>> (because we use QUORUM). However, As soon as I started removing nodes one
>> by one, every time time we see lot of timeout and unavailable exceptions
>> which doesn’t make any sense because I am just removing a node that doesn’t
>> even exist.
>>
>>
>>
>>
>>
>>
>> From: Robert Coli 
>> Reply-To: "user@cassandra.apache.org" 

Re: Removing Node causes bunch of HostUnavailableException

2016-03-03 Thread Peddi, Praveen
Thanks Alain for quick and detailed response. My answers inline. One thing I 
want to clarify is, the nodes got recycled due to some automatic health check 
failure. This means old nodes are dead and new nodes got added w/o our 
intervention. So replacing nodes would not work for us since the new nodes were 
already added.


We are not removing multiple nodes at the same time. All dead nodes are from 
same AZ so there were no errors when the nodes were down as expected (because 
we use QUORUM)

Do you use at leat 3 distinct AZ ? If so, you should indeed be fine regarding 
data integrity. Also repair should then work for you. If you have less than 3 
AZ, then you are in troubles...
Yes we use 3 distinct AZs and replicate to all 3 Azs which is why when 8 nodes 
were recycled, there were absolutely no outage on Cassandra (other two nodes 
wtill satisfy quorum consistency)

About the unreachable errors, I believe it can be due to the overload due to 
the missing nodes. Pressure on the remaining node might be too strong.
It is certainly possible but we have beefed up cluster with <3% CPU, hardly any 
network I/o and disk usage. We have 162 nodes in the cluster and each node 
doesn’t have more than 80 to 100MB of data.


However, As soon as I started removing nodes one by one, every time time we see 
lot of timeout and unavailable exceptions which doesn’t make any sense because 
I am just removing a node that doesn’t even exist.

This probably added even more load, if you are using vnodes, all the remaining 
nodes probably started streaming data to each other node at the speed of 
"nodetool getstreamthroughput". AWS network isn't that good, and is probably 
saturated. Also have you the phi_convict_threshold configured to a high value 
at least 10 or 12 ? This would avoid nodes to be marked down that often.
We are using c3.2xlarge which has good network throughput (1GB/sec I think). We 
are using default value which is 200MB/sec in 2.0.9. We will play with it in 
future and see if this could make any difference but as I mentioned the data 
size on each node is not huge.
Regarding phi_convict_threshold, our Cassandra is not bringing itself down. 
There was a bug in health check from one of our internal tool and that tool is 
recycling the nodes. Nothing to do with Cassandra health. Again we will keep an 
eye on it in future.


What does "nodetool tpstats" outputs ?
Nodetool tpstats on which node? Any node?


Also you might try to monitor resources and see what happens (my guess is you 
should focus at iowait, disk usage and network, have an eye at cpu too).
We did monitor cpu, disk and network and they are all very low.


A quick fix would probably be to hardly throttle the network on all the nodes 
and see if it helps:

nodetool setstreamthroughput 2
We will play with this config. 2.0.9 defaults to 200MB/sec which I think is too 
high.


If this work, you could incrementally increase it and monitor, find the good 
tuning and put it the cassandra.yaml.

I opened a ticket a while ago about that issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9509
I voted for this issue. Lets see if it gets picked up :).


I hope this will help you to go back to a healthy state allowing you a fast 
upgrade ;-).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 22:17 GMT+01:00 Peddi, Praveen 
>:
Hi Robert,
Thanks for your response.

Replication factor is 3.

We are in the process of upgrading to 2.2.4. We have had too many performance 
issues with later versions of Cassandra (I have asked asked for help related to 
that in the forum). We are close to getting to similar performance now and 
hopefully upgrade in next few weeks. Lot of testing to do :(.

We are not removing multiple nodes at the same time. All dead nodes are from 
same AZ so there were no errors when the nodes were down as expected (because 
we use QUORUM). However, As soon as I started removing nodes one by one, every 
time time we see lot of timeout and unavailable exceptions which doesn’t make 
any sense because I am just removing a node that doesn’t even exist.








From: Robert Coli >
Reply-To: "user@cassandra.apache.org" 
>
Date: Wednesday, March 2, 2016 at 2:52 PM
To: "user@cassandra.apache.org" 
>
Subject: Re: Removing Node causes bunch of HostUnavailableException

On Wed, Mar 2, 2016 at 8:10 AM, Peddi, Praveen 
> wrote:
We have few dead nodes in the cluster (Amazon ASG removed those thinking there 
is an issue with health). Now we are trying to remove those dead nodes from the 

Re: Broken links in Apache Cassandra home page

2016-03-03 Thread Eric Evans
On Wed, Mar 2, 2016 at 1:55 PM, Robert Coli  wrote:
> On Wed, Mar 2, 2016 at 7:00 AM, Eric Evans  wrote:
>>
>> On Tue, Mar 1, 2016 at 8:30 PM, ANG ANG  wrote:
>> > "#cassandra channel": http://freenode.net/
>>
>> The latter, while not presently useful, links to a "coming soon..."
>> for Freenode.  It might be pedantic to insist it's not broken, but I
>> don't know where else we could point that.  Freenode *is* the network
>> hosting the IRC channels, and such that it is, that is their website
>> (for now).
>
>
> https://www.w3.org/Addressing/draft-mirashi-url-irc-01.txt
>
> irc://freenode.net/#cassandra

Hrmm, that seems like more of a candidate for a channel name link, no?
 For example:

Many of the Cassandra developers and community members hang out in
the #cassandra channel on
irc.freenode.net.

And, perhaps until Freenode gets their website sorted, we could use
the Wikipedia article instead
(https://en.wikipedia.org/wiki/Freenode).


-- 
Eric Evans
eev...@wikimedia.org


Re: how to read parent_repair_history table?

2016-03-03 Thread Paulo Motta
> is there any other better way to find out a node's token range?  I see
systems.peers column family seems to include range information, so that is
promising but when I look at both datastax java driver and python driver,
its API both require a keyspace name and host name, I wonder why ?

range information is per keyspace, since is related to chosen topology
(replication factor, multi dcs, etc). you can construct ranges from
system.peers, but you will still need to look at keyspace info so that will
be much more complicated. so I guess the driver's getTokenRanges is your
best bet.

> And just to be sure, the participants column in the repair_history table
represented the node being repaired and not the node being used to
comparing the data, correct?

participants include all participants of the repair, including the
coordinator.

2016-03-01 0:44 GMT-03:00 Jimmy Lin :

> is there any other better way to find out a node's token range?  I see
> systems.peers column family seems to include range information, so that is
> promising but when I look at both datastax java driver and python driver,
> its API both require a keyspace name and host name, I wonder why ?
>
>
> http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/Metadata.html#getTokenRanges-java.lang.String-com.datastax.driver.core.Host-
>
>
> And just to be sure, the participants column in the repair_history table
> represented the node being repaired and not the node being used to
> comparing the data, correct?
>
>
> On Thu, Feb 25, 2016 at 1:38 PM, Paulo Motta 
> wrote:
>
>> > how does it work when repair job targeting only local vs all DC? is
>> there any columns or flag i can tell the difference? or does it actualy
>> matter?
>>
>> You can not easily find out from the parent_repair_session table if a
>> repair is local-only or multi-dc. I created
>> https://issues.apache.org/jira/browse/CASSANDRA-11244 to add more
>> information to that table. Since that table only has id as primary key,
>> you'd need to do a full scan to perform checks on it, or keep track of the
>> parent id session when submitting the repair and query by primary key.
>>
>> What you could probably do to health check your nodes are repaired on
>> time is to check for each table:
>>
>> select * from repair_history where keyspace = 'ks' columnfamily_name =
>> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2);
>>
>> And then verify for each node if all of its ranges have been repaired in
>> this period, and send an alert otherwise. You can find out a nodes range by
>> querying JMX via StorageServiceMBean.getRangeToEndpointMap.
>>
>> To make this task a bit simpler you could probably add a secondary index
>> to the participants column of repair_history table with:
>>
>> CREATE INDEX myindex ON system_distributed.repair_history (participants) ;
>>
>> and check each node status individually with:
>>
>> select * from repair_history where keyspace = 'ks' columnfamily_name =
>> 'cf' and id > mintimeuuid(now() - gc_grace_seconds/2) AND participants
>> CONTAINS 'node_IP';
>>
>>
>>
>> 2016-02-25 16:22 GMT-03:00 Jimmy Lin :
>>
>>> hi Paulo,
>>>
>>> one more follow up ... :)
>>>
>>>  I noticed these tables are suppose to replicatd to all nodes in the 
>>> cluster, and it is not per node specific.
>>>
>>> how does it work when repair job targeting only local vs all DC? is there 
>>> any columns or flag i can tell the difference?
>>> or does it actualy matter?
>>>
>>>  thanks
>>>
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 25, 2016, at 10:37 AM, Paulo Motta 
>>> wrote:
>>>
>>> > why each job repair execution will have 2 entries? I thought it will
>>> be one entry, begining with started_at column filled, and when it
>>> completed, finished_at column will be filled.
>>>
>>> that's correct, I was mistaken!
>>>
>>> > Also, if my cluster has more than 1 keyspace, and the way this table
>>> is structured, it will have multiple entries, one for each keysapce_name
>>> value. no ? thanks
>>>
>>> right, because repair sessions in different keyspaces will have
>>> different repair session ids.
>>>
>>> 2016-02-25 15:04 GMT-03:00 Jimmy Lin :
>>>
 hi Paulo,

 follow up on the # of entries question...

  why each job repair execution will have 2 entries?
 I thought it will be one entry, begining with started_at column filled, 
 and when it completed, finished_at column will be filled.

 Also, if my cluster has more than 1 keyspace, and the way this table is 
 structured, it will have multiple entries, one for each keysapce_name 
 value. no ?

 thanks



 Sent from my iPhone

 On Feb 25, 2016, at 5:48 AM, Paulo Motta 
 wrote:

 Hello Jimmy,

 The parent_repair_history table keeps track of start and finish
 information of a repair session.  The other 

Re: MemtableReclaimMemory pending building up

2016-03-03 Thread Alain RODRIGUEZ
Hi Dan,

I'll try to go through all the elements:

seeing this odd behavior happen, seemingly to single nodes at a time


Is that one node at the time or always on the same node. Do you consider
your data model if fairly, evenly distributed ?

The node starts to take more and more memory (instance has 48GB memory on
> G1GC)


Do you use 48 GB heap size or is that the total amount of memory in the
node ? Could we have your JVM settings (GC and heap sizes), also memtable
size and type (off heap?) and the amount of available memory ?

Note that there is a decent number of compactions going on as well but that
> is expected on these nodes and this particular one is catching up from a
> high volume of writes
>

Are the *concurrent_compactors* correctly throttled (about 8 with good
machines) and the *compaction_throughput_mb_per_sec* high enough to cope
with what is thrown at the node ? Using SSD I often see the latter
unthrottled (using 0 value), but I would try small increments first.

Also interestingly, neither CPU nor disk utilization are pegged while this
> is going on
>

First thing is making sure your memory management is fine. Having
information about the JVM and memory usage globally would help. Then, if
you are not fully using the resources you might want to try increasing the
number of *concurrent_writes* to a higher value (probably a way higher,
given the pending requests, but go safely, incrementally, first on a canary
node) and monitor tpstats + resources. Hope this will help Mutation pending
going down. My guess is that pending requests are messing with the JVM, but
it could be the exact contrary as well.

Native-Transport-Requests25 0  547935519 0
>   2586907


About Native requests being blocked, you can probably mitigate things by
increasing the native_transport_max_threads: 128 (try to double it and
continue tuning incrementally). Also, an up to date client, using the
Native protocol V3 handles a lot better connections / threads from clients.
Having an heavy throughput like yours, you might want to give this a try.

What is your current client ?
What does "netstat -an | grep -e 9042 -e 9160 | grep ESTABLISHED | wc -l"
outputs ? This is the number of clients connected to the node.
Do you have other significant errors or warning in the logs (other than
dropped mutations)? "grep -i -e "ERROR" -e "WARN"
/var/log/cassandra/system.log"

As a small conclusion I would have an eye on things related to the memory
management and also trying to push Cassandra limits by increasing default
values as you seems to have resources available, to make sure Cassandra can
cope with the high throughput. Pending operations = high memory pressure.
Reducing pending stuff somehow will probably get you out off troubles.

Hope this first round of ideas will help you.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 22:58 GMT+01:00 Dan Kinder :

> Also should note: Cassandra 2.2.5, Centos 6.7
>
> On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder  wrote:
>
>> Hi y'all,
>>
>> I am writing to a cluster fairly fast and seeing this odd behavior
>> happen, seemingly to single nodes at a time. The node starts to take more
>> and more memory (instance has 48GB memory on G1GC). tpstats shows that
>> MemtableReclaimMemory Pending starts to grow first, then later
>> MutationStage builds up as well. By then most of the memory is being
>> consumed, GC is getting longer, node slows down and everything slows down
>> unless I kill the node. Also the number of Active MemtableReclaimMemory
>> threads seems to stay at 1. Also interestingly, neither CPU nor disk
>> utilization are pegged while this is going on; it's on jbod and there is
>> plenty of headroom there. (Note that there is a decent number of
>> compactions going on as well but that is expected on these nodes and this
>> particular one is catching up from a high volume of writes).
>>
>> Anyone have any theories on why this would be happening?
>>
>>
>> $ nodetool tpstats
>> Pool NameActive   Pending  Completed   Blocked
>>  All time blocked
>> MutationStage   192715481  311327142 0
>>   0
>> ReadStage 7 09142871 0
>>   0
>> RequestResponseStage  1 0  690823199 0
>>   0
>> ReadRepairStage   0 02145627 0
>>   0
>> CounterMutationStage  0 0  0 0
>>   0
>> HintedHandoff 0 0144 0
>>   0
>> MiscStage 0 0  0 0
>>   0
>> CompactionExecutor   1224  41022 0
>>   0