Re: dropped mutations cross node

2020-10-05 Thread onmstester onmstester
Thanks,

I've done a lot of conf changes to fix  the problem but nothing worked (last 
one was disabling hints) and after a few days problem gone!!

The source of droppedCrossNode was changing every half an hour and it was not 
always the new nodes

No difference between new nodes and old ones in configuration and node spec

Sent using https://www.zoho.com/mail/




 On Mon, 05 Oct 2020 09:14:17 +0330 Erick Ramirez 
 wrote 


Sorry for the late reply. Do you still need assistance with this issue?



If the source of the dropped mutations and high latency are the newer nodes, 
that indicates to me that you have an issue with the commitlog disks. Are the 
newer nodes identical in hardware configuration to the pre-existing nodes? Any 
differences in configuration you could point out? Cheers!

Re: dropped mutations cross node

2020-10-04 Thread Erick Ramirez
Sorry for the late reply. Do you still need assistance with this issue?

If the source of the dropped mutations and high latency are the newer
nodes, that indicates to me that you have an issue with the commitlog
disks. Are the newer nodes identical in hardware configuration to the
pre-existing nodes? Any differences in configuration you could point out?
Cheers!


dropped mutations cross node

2020-09-21 Thread onmstester onmstester
Hi, 

I've extended a cluster by 10% and after that each hour, on some of the nodes 
(which changes randomly each time),  "dropped mutations cross node" appears on 
logs (each time 1 or 2 drops and some times some thousands with cross node 
latency from 3000ms to 9ms or 90seconds!) and insert rate been decreased 
abour 50%:

 on token ownership everything is OK (stdev.p of ownership percent even 
decreased with cluster extension)

CPU usage on nodes is less than 30 percent and all well balanced

disk usage is less than 10% watching through iostat and also no pending 
compaction on nodes
there is no other log beside dropped reports (although a few GC about 200-300ms 
every 5 minuts)

no sign of memory problem looking at jvisualVM

honestly i do not monitor network equipments (Switches) but the network did not 
changed since the extend of cluster and not increase in packet discard counters 
at node side




So to emphasize; there is mutation drop  which i can not detect the root cause. 
Is there any workaround or monitoring metric that i missed here?





Cluster Info:
Cassandra 3.11.2

RF 3

30 Nodes


Sent using https://www.zoho.com/mail/

Re: Dropped mutations

2019-07-28 Thread Ayub M
What does Read and _Trace dropped mutations mean? There is no tracing
enabled on any node in the cluster, what are these _TRACE dropped messages?

INFO  [ScheduledTasks:1] 2019-07-25 21:17:13,878
MessagingService.java:1281 - READ messages were dropped in last 5000 ms: 1
internal and 0 cross node. Mean internal dropped latency: 5960 ms and Mean
cross-node dropped latency: 0 ms
INFO  [ScheduledTasks:1] 2019-07-25 20:38:43,788
MessagingService.java:1281 - _TRACE messages were dropped in last 5000 ms:
5035 internal and 0 cross node. Mean internal dropped latency: 0 ms and
Mean cross-node dropped latency: 0 ms



On Thu, Jul 25, 2019 at 1:49 PM Ayub M  wrote:

> Thanks Jeff, does internal mean local node operations - in this case
> mutation response from local node and cross node means the time it took to
> get response back from other nodes depending on the consistency level
> choosen?
>
> On Thu, Jul 25, 2019 at 11:51 AM Jeff Jirsa  wrote:
>
>> This means your database is seeing commands that have already timed out
>> by the time it goes to execute them, so it ignores them and gives up
>> instead of working on work items that have already expired.
>>
>> The first log line shows 5 second latencies, the second line 6s and 8s
>> latencies, which sounds like either really bad disks or really bad JVM GC
>> pauses.
>>
>>
>> On Thu, Jul 25, 2019 at 8:45 AM Ayub M  wrote:
>>
>>> Hello, how do I read dropped mutations error messages - whats internal
>>> and cross node? For mutations it fails on cross-node and read_repair/read
>>> it fails on internal. What does it mean?
>>>
>>> INFO  [ScheduledTasks:1] 2019-07-21 11:44:46,150
>>> MessagingService.java:1281 - MUTATION messages were dropped in last 5000
>>> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and
>>> Mean cross-node dropped latency: 4966 ms
>>> INFO  [ScheduledTasks:1] 2019-07-19 05:01:10,620
>>> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000
>>> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and
>>> Mean cross-node dropped latency: 8164 ms
>>>
>>> --
>>>
>>> Regards,
>>> Ayub
>>>
>>
>
> --
> Regards,
> Ayub
>


-- 
Regards,
Ayub


Re: Dropped mutations

2019-07-25 Thread Ayub M
Thanks Jeff, does internal mean local node operations - in this case
mutation response from local node and cross node means the time it took to
get response back from other nodes depending on the consistency level
choosen?

On Thu, Jul 25, 2019 at 11:51 AM Jeff Jirsa  wrote:

> This means your database is seeing commands that have already timed out by
> the time it goes to execute them, so it ignores them and gives up instead
> of working on work items that have already expired.
>
> The first log line shows 5 second latencies, the second line 6s and 8s
> latencies, which sounds like either really bad disks or really bad JVM GC
> pauses.
>
>
> On Thu, Jul 25, 2019 at 8:45 AM Ayub M  wrote:
>
>> Hello, how do I read dropped mutations error messages - whats internal
>> and cross node? For mutations it fails on cross-node and read_repair/read
>> it fails on internal. What does it mean?
>>
>> INFO  [ScheduledTasks:1] 2019-07-21 11:44:46,150
>> MessagingService.java:1281 - MUTATION messages were dropped in last 5000
>> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and
>> Mean cross-node dropped latency: 4966 ms
>> INFO  [ScheduledTasks:1] 2019-07-19 05:01:10,620
>> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000
>> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and
>> Mean cross-node dropped latency: 8164 ms
>>
>> --
>>
>> Regards,
>> Ayub
>>
>

-- 
Regards,
Ayub


Re: Dropped mutations

2019-07-25 Thread Rajsekhar Mallick
Hello Jeff,

Request you to help on how to visualise the terms
1. Internal mutations
2. Cross node mutations
3. Mean internal dropped latency
4. Cross node dropped latency

Thanks,
Rajsekhar

On Thu, 25 Jul, 2019, 9:21 PM Jeff Jirsa,  wrote:

> This means your database is seeing commands that have already timed out by
> the time it goes to execute them, so it ignores them and gives up instead
> of working on work items that have already expired.
>
> The first log line shows 5 second latencies, the second line 6s and 8s
> latencies, which sounds like either really bad disks or really bad JVM GC
> pauses.
>
>
> On Thu, Jul 25, 2019 at 8:45 AM Ayub M  wrote:
>
>> Hello, how do I read dropped mutations error messages - whats internal
>> and cross node? For mutations it fails on cross-node and read_repair/read
>> it fails on internal. What does it mean?
>>
>> INFO  [ScheduledTasks:1] 2019-07-21 11:44:46,150
>> MessagingService.java:1281 - MUTATION messages were dropped in last 5000
>> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and
>> Mean cross-node dropped latency: 4966 ms
>> INFO  [ScheduledTasks:1] 2019-07-19 05:01:10,620
>> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000
>> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and
>> Mean cross-node dropped latency: 8164 ms
>>
>> --
>>
>> Regards,
>> Ayub
>>
>


Re: Dropped mutations

2019-07-25 Thread Jeff Jirsa
This means your database is seeing commands that have already timed out by
the time it goes to execute them, so it ignores them and gives up instead
of working on work items that have already expired.

The first log line shows 5 second latencies, the second line 6s and 8s
latencies, which sounds like either really bad disks or really bad JVM GC
pauses.


On Thu, Jul 25, 2019 at 8:45 AM Ayub M  wrote:

> Hello, how do I read dropped mutations error messages - whats internal and
> cross node? For mutations it fails on cross-node and read_repair/read it
> fails on internal. What does it mean?
>
> INFO  [ScheduledTasks:1] 2019-07-21 11:44:46,150
> MessagingService.java:1281 - MUTATION messages were dropped in last 5000
> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and
> Mean cross-node dropped latency: 4966 ms
> INFO  [ScheduledTasks:1] 2019-07-19 05:01:10,620
> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000
> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and
> Mean cross-node dropped latency: 8164 ms
>
> --
>
> Regards,
> Ayub
>


Dropped mutations

2019-07-25 Thread Ayub M
Hello, how do I read dropped mutations error messages - whats internal and
cross node? For mutations it fails on cross-node and read_repair/read it
fails on internal. What does it mean?

INFO  [ScheduledTasks:1] 2019-07-21 11:44:46,150
MessagingService.java:1281 - MUTATION messages were dropped in last 5000
ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and
Mean cross-node dropped latency: 4966 ms
INFO  [ScheduledTasks:1] 2019-07-19 05:01:10,620
MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000
ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and
Mean cross-node dropped latency: 8164 ms

-- 

Regards,
Ayub


Re: Problem with dropped mutations

2018-07-02 Thread Jeff Jirsa
Dropped mutations are load shedding - somethings not happy.

Are you seeing GC pauses? 

What heap size and version?

What memtable settings ?

-- 
Jeff Jirsa


> On Jul 2, 2018, at 12:48 AM, Hannu Kröger  wrote:
> 
> Yes, there are timeouts sometimes but more on the read side. And yes, there 
> are certain data modeling problems which will be soon addressed but we need 
> to keep things steady before we get there. 
> 
> I guess many write timeouts go unnoticed due to consistency level != ALL. 
> 
> Network looks to be working fine. 
> 
> Hannu
> 
>> ZAIDI, ASAD A  kirjoitti 26.6.2018 kello 21.42:
>> 
>> Are you also seeing time-outs on certain Cassandra operations?? If yes, you 
>> may have to tweak *request_timeout parameter in order to get rid of dropped 
>> mutation messages if application data model is not upto mark!
>> 
>> You can also check if network isn't dropping packets (ifconfig  -a tool) +  
>> storage (dstat tool) isn't reporting too slow disks.
>> 
>> Cheers/Asad
>> 
>> 
>> -Original Message-
>> From: Hannu Kröger [mailto:hkro...@gmail.com] 
>> Sent: Tuesday, June 26, 2018 9:49 AM
>> To: user 
>> Subject: Problem with dropped mutations
>> 
>> Hello,
>> 
>> We have a cluster with somewhat heavy load and we are seeing dropped 
>> mutations (variable amount and not all nodes have those).
>> 
>> Are there some clear trigger which cause those? What would be the best 
>> pragmatic approach to start debugging those? We have already added more 
>> memory which seemed to help somewhat but not completely.
>> 
>> Cheers,
>> Hannu
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Problem with dropped mutations

2018-07-02 Thread Hannu Kröger
Yes, there are timeouts sometimes but more on the read side. And yes, there are 
certain data modeling problems which will be soon addressed but we need to keep 
things steady before we get there. 

I guess many write timeouts go unnoticed due to consistency level != ALL. 

Network looks to be working fine. 

Hannu

> ZAIDI, ASAD A  kirjoitti 26.6.2018 kello 21.42:
> 
> Are you also seeing time-outs on certain Cassandra operations?? If yes, you 
> may have to tweak *request_timeout parameter in order to get rid of dropped 
> mutation messages if application data model is not upto mark!
> 
> You can also check if network isn't dropping packets (ifconfig  -a tool) +  
> storage (dstat tool) isn't reporting too slow disks.
> 
> Cheers/Asad
> 
> 
> -Original Message-
> From: Hannu Kröger [mailto:hkro...@gmail.com] 
> Sent: Tuesday, June 26, 2018 9:49 AM
> To: user 
> Subject: Problem with dropped mutations
> 
> Hello,
> 
> We have a cluster with somewhat heavy load and we are seeing dropped 
> mutations (variable amount and not all nodes have those).
> 
> Are there some clear trigger which cause those? What would be the best 
> pragmatic approach to start debugging those? We have already added more 
> memory which seemed to help somewhat but not completely.
> 
> Cheers,
> Hannu
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



RE: Problem with dropped mutations

2018-06-26 Thread ZAIDI, ASAD A
Are you also seeing time-outs on certain Cassandra operations?? If yes, you may 
have to tweak *request_timeout parameter in order to get rid of dropped 
mutation messages if application data model is not upto mark!

You can also check if network isn't dropping packets (ifconfig  -a tool) +  
storage (dstat tool) isn't reporting too slow disks.

Cheers/Asad


-Original Message-
From: Hannu Kröger [mailto:hkro...@gmail.com] 
Sent: Tuesday, June 26, 2018 9:49 AM
To: user 
Subject: Problem with dropped mutations

Hello,

We have a cluster with somewhat heavy load and we are seeing dropped mutations 
(variable amount and not all nodes have those).

Are there some clear trigger which cause those? What would be the best 
pragmatic approach to start debugging those? We have already added more memory 
which seemed to help somewhat but not completely.

Cheers,
Hannu



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Problem with dropped mutations

2018-06-26 Thread Joshua Galbraith
Hannu,

Dropped mutations are often a sign of load-shedding due to an overloaded
node or cluster. Are you seeing resource saturation like high CPU usage
(because the write path is usually CPU-bound) on any of the nodes in your
cluster?

Some potential contributing factors that might be causing you to drop
mutations are long garbage collection (GC) pauses or large partitions. Do
the drops coincide with an increase in requests, a code change, or
compaction activity?

On Tue, Jun 26, 2018 at 7:48 AM, Hannu Kröger  wrote:

> Hello,
>
> We have a cluster with somewhat heavy load and we are seeing dropped
> mutations (variable amount and not all nodes have those).
>
> Are there some clear trigger which cause those? What would be the best
> pragmatic approach to start debugging those? We have already added more
> memory which seemed to help somewhat but not completely.
>
> Cheers,
> Hannu
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


-- 
*Joshua Galbraith *| Senior Software Engineer | New Relic


Problem with dropped mutations

2018-06-26 Thread Hannu Kröger
Hello,

We have a cluster with somewhat heavy load and we are seeing dropped mutations 
(variable amount and not all nodes have those).

Are there some clear trigger which cause those? What would be the best 
pragmatic approach to start debugging those? We have already added more memory 
which seemed to help somewhat but not completely.

Cheers,
Hannu



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Dropped Mutations

2018-04-19 Thread Shalom Sagges
Thanks a lot Hitesh!

I'll try to re-tune the heap to a lower level


Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Thu, Apr 19, 2018 at 12:42 AM, hitesh dua <hiteshd...@gmail.com> wrote:

> Hi ,
>
> I'll recommend tuning you heap size further( preferably lower) as large
> Heap size can lead to Large Garbage collection pauses also known as also
> known as a stop-the-world event. A pause occurs when a region of memory is
> full and the JVM needs to make space to continue. During a pause all
> operations are suspended. Because a pause affects networking, the node can
> appear as down to other nodes in the cluster. Additionally, any Select and
> Insert statements will wait, which increases read and write latencies.
>
> Any pause of more than a second, or multiple pauses within a second that
> add to a large fraction of that second, should be avoided. The basic cause
> of the problem is the rate of data stored in memory outpaces the rate at
> which data can be removed
>
> MUTATION : If a write message is processed after its timeout
> (write_request_timeout_in_ms) it either sent a failure to the client or it
> met its requested consistency level and will relay on hinted handoff and
> read repairs to do the mutation if it succeeded.
>
> Another possible cause of the Issue could be you HDDs as that could too
> be a bottleneck.
>
> *MAX_HEAP_SIZE*
> The recommended maximum heap size depends on which GC is used:
> Hardware setupRecommended MAX_HEAP_SIZE
> Older computers Typically 8 GB.
> CMS for newer computers (8+ cores) with up to 256 GB RAM No more 14 GB.
>
>
> Thanks,
> Hitesh dua
> hiteshd...@gmail.com
>
> On Wed, Apr 18, 2018 at 10:07 PM, shalom sagges <shalomsag...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have a 44 node cluster (22 nodes on each DC).
>> Each node has 24 cores and 130 GB RAM, 3 TB HDDs.
>> Version 2.0.14 (soon to be upgraded)
>> ~10K writes per second per node.
>> Heap size: 8 GB max, 2.4 GB newgen
>>
>> I deployed Reaper and GC started to increase rapidly. I'm not sure if
>> it's because there was a lot of inconsistency in the data, but I decided to
>> increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure
>> from 1 to 5.
>>
>> I tested on a canary node and everything was fine but when I changed the
>> entire DC, I suddenly saw a lot of dropped mutations in the logs on most of
>> the nodes. (Reaper was not running on the cluster yet but a manual repair
>> was running).
>>
>> Can the heap increment cause lots of dropped mutations?
>> When is a mutation considered as dropped? Is it during flush? Is it
>> during the write to the commit log or memtable?
>>
>> Thanks!
>>
>>
>>
>>
>

-- 
This message may contain confidential and/or privileged information. 
If 
you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in 
error, please advise the sender immediately by reply email and delete this 
message. Thank you.


Re: Dropped Mutations

2018-04-18 Thread hitesh dua
Hi ,

I'll recommend tuning you heap size further( preferably lower) as large
Heap size can lead to Large Garbage collection pauses also known as also
known as a stop-the-world event. A pause occurs when a region of memory is
full and the JVM needs to make space to continue. During a pause all
operations are suspended. Because a pause affects networking, the node can
appear as down to other nodes in the cluster. Additionally, any Select and
Insert statements will wait, which increases read and write latencies.

Any pause of more than a second, or multiple pauses within a second that
add to a large fraction of that second, should be avoided. The basic cause
of the problem is the rate of data stored in memory outpaces the rate at
which data can be removed

MUTATION : If a write message is processed after its timeout
(write_request_timeout_in_ms) it either sent a failure to the client or it
met its requested consistency level and will relay on hinted handoff and
read repairs to do the mutation if it succeeded.

Another possible cause of the Issue could be you HDDs as that could too be
a bottleneck.

*MAX_HEAP_SIZE*
The recommended maximum heap size depends on which GC is used:
Hardware setupRecommended MAX_HEAP_SIZE
Older computers Typically 8 GB.
CMS for newer computers (8+ cores) with up to 256 GB RAM No more 14 GB.


Thanks,
Hitesh dua
hiteshd...@gmail.com

On Wed, Apr 18, 2018 at 10:07 PM, shalom sagges <shalomsag...@gmail.com>
wrote:

> Hi All,
>
> I have a 44 node cluster (22 nodes on each DC).
> Each node has 24 cores and 130 GB RAM, 3 TB HDDs.
> Version 2.0.14 (soon to be upgraded)
> ~10K writes per second per node.
> Heap size: 8 GB max, 2.4 GB newgen
>
> I deployed Reaper and GC started to increase rapidly. I'm not sure if it's
> because there was a lot of inconsistency in the data, but I decided to
> increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure
> from 1 to 5.
>
> I tested on a canary node and everything was fine but when I changed the
> entire DC, I suddenly saw a lot of dropped mutations in the logs on most of
> the nodes. (Reaper was not running on the cluster yet but a manual repair
> was running).
>
> Can the heap increment cause lots of dropped mutations?
> When is a mutation considered as dropped? Is it during flush? Is it during
> the write to the commit log or memtable?
>
> Thanks!
>
>
>
>


Dropped Mutations

2018-04-18 Thread shalom sagges
Hi All,

I have a 44 node cluster (22 nodes on each DC).
Each node has 24 cores and 130 GB RAM, 3 TB HDDs.
Version 2.0.14 (soon to be upgraded)
~10K writes per second per node.
Heap size: 8 GB max, 2.4 GB newgen

I deployed Reaper and GC started to increase rapidly. I'm not sure if it's
because there was a lot of inconsistency in the data, but I decided to
increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure
from 1 to 5.

I tested on a canary node and everything was fine but when I changed the
entire DC, I suddenly saw a lot of dropped mutations in the logs on most of
the nodes. (Reaper was not running on the cluster yet but a manual repair
was running).

Can the heap increment cause lots of dropped mutations?
When is a mutation considered as dropped? Is it during flush? Is it during
the write to the commit log or memtable?

Thanks!


Re: Dropped Mutations

2018-01-11 Thread kurt greaves
Dropped mutations aren't data loss. Data loss implies the data was already
there and is now gone, whereas for a dropped mutation the data was never
there in the first place. A dropped mutation just results in a
inconsistency, or potentially no data if all mutations are dropped, and C*
will tell you this and it's up to your client to respond accordingly (e.g:
re-write the data if it's an idempotent query and your desired CL failed to
be achieved).

On 11 January 2018 at 08:18, ਨਿਹੰਗ <niih...@gmail.com> wrote:

> Hello
> Could the following be interpreted as, 'Dropped Mutations', in some cases
> mean data loss?
>
> http://cassandra.apache.org/doc/latest/faq/index.html#why-message-dropped
> For writes, this means that the mutation was not applied to all replicas
> it was sent to. The inconsistency will be repaired by read repair, hints or
> a manual repair. *The write operation may also have timeouted as a result*
> .
>
> Thanks
> N
>


Dropped Mutations

2018-01-11 Thread ਨਿਹੰਗ
Hello
Could the following be interpreted as, 'Dropped Mutations', in some cases
mean data loss?

http://cassandra.apache.org/doc/latest/faq/index.html#why-message-dropped
For writes, this means that the mutation was not applied to all replicas it
was sent to. The inconsistency will be repaired by read repair, hints or a
manual repair. *The write operation may also have timeouted as a result*.

Thanks
N


Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10

2014-11-10 Thread Paulo Ricardo Motta Gomes
Hey,

We've seen a considerable increase in the number of dropped mutations after
a major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to
the extra load incurred by upgradesstables, but the dropped mutations
continue even after all sstables are upgraded.

Additional info: Overall (read, write and range) latency improved with the
upgrade, which is great, but I don't understand why dropped mutations has
increased. I/O and CPU load is pretty much the same, number of completed
tasks is the only metric that increased together with dropped mutations.

I also noticed that the number of all time blocked FlushWriter operations
is about 5% of completed operations, don't know if this is related, but in
case it helps out...

Anyone has a clue on what could that be? Or what should we monitor to find
out? Any help or JIRA pointers would be kindly appreciated.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10

2014-11-10 Thread Duncan Sands

Hi Paulo,

On 10/11/14 15:18, Paulo Ricardo Motta Gomes wrote:

Hey,

We've seen a considerable increase in the number of dropped mutations after a
major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to the extra
load incurred by upgradesstables, but the dropped mutations continue even after
all sstables are upgraded.


are the clocks on all your nodes synchronized with each other?

Ciao, Duncan.



Additional info: Overall (read, write and range) latency improved with the
upgrade, which is great, but I don't understand why dropped mutations has
increased. I/O and CPU load is pretty much the same, number of completed tasks
is the only metric that increased together with dropped mutations.

I also noticed that the number of all time blocked FlushWriter operations is
about 5% of completed operations, don't know if this is related, but in case it
helps out...

Anyone has a clue on what could that be? Or what should we monitor to find out?
Any help or JIRA pointers would be kindly appreciated.

Cheers,

--
*Paulo Motta*

Chaordic | /Platform/
_www.chaordic.com.br http://www.chaordic.com.br/_
+55 48 3232.3200




Re: Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10

2014-11-10 Thread Paulo Ricardo Motta Gomes
On Mon, Nov 10, 2014 at 12:46 PM, Duncan Sands duncan.sa...@gmail.com
wrote:

 Hi Paulo,

 On 10/11/14 15:18, Paulo Ricardo Motta Gomes wrote:

 Hey,

 We've seen a considerable increase in the number of dropped mutations
 after a
 major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to
 the extra
 load incurred by upgradesstables, but the dropped mutations continue even
 after
 all sstables are upgraded.


 are the clocks on all your nodes synchronized with each other?

 Ciao, Duncan.


Yes, the servers are synchronized via NTP.

Cheers!




 Additional info: Overall (read, write and range) latency improved with the
 upgrade, which is great, but I don't understand why dropped mutations has
 increased. I/O and CPU load is pretty much the same, number of completed
 tasks
 is the only metric that increased together with dropped mutations.

 I also noticed that the number of all time blocked FlushWriter
 operations is
 about 5% of completed operations, don't know if this is related, but in
 case it
 helps out...

 Anyone has a clue on what could that be? Or what should we monitor to
 find out?
 Any help or JIRA pointers would be kindly appreciated.

 Cheers,

 --
 *Paulo Motta*

 Chaordic | /Platform/
 _www.chaordic.com.br http://www.chaordic.com.br/_
 +55 48 3232.3200





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


dropped mutations, UnavailableException, and long GC

2011-02-24 Thread Jeffrey Wang
Hey all,

Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB 
disk each collocated in a DC. We're doing bulk imports from each of the nodes 
with RF = 2 and write consistency ANY (write perf is very important). The 
behavior we're seeing is this:


-  Nodes often see each other as dead even though none of the nodes 
actually go down. I suspect this may be due to long GCs. It seems like 
increasing the RPC timeout could help this, but I'm not convinced this is the 
root of the problem. Note that in this case writes return with the 
UnavailableException.

-  As mentioned, long GCs. We see the ParNew GC doing a lot of smaller 
collections (few hundred MB) which are very fast (few hundred ms), but every 
once in a while the ConcurrentMarkSweep will take a LONG time (up to 15 min!) 
to collect upwards of 15GB at once.

-  On some nodes, we see a lot of pending MutationStages build up (e.g. 
500K), which leads to the messages Dropped X MUTATION messages in the last 
5000ms, presumably meaning that Cassandra has decided to not write one of the 
replicas of the data. This is not a HUGE deal, but is less than ideal.

-  The end result is that a bunch of writes end up failing due to the 
UnavailableExceptions, so not all of our data is getting into Cassandra.

So my question is: what is the best way to avoid this behavior? Our memtable 
thresholds are fairly low (256MB) so there should be plenty of heap space to 
work with. We may experiment with write consistency ONE or ALL to see if the 
perf hit is not too bad, but I wanted to get some opinions on why this might be 
happening. Thanks!

-Jeffrey



Re: dropped mutations, UnavailableException, and long GC

2011-02-24 Thread Narendra Sharma
1. Why 24GB of heap? Do you need this high heap? Bigger heap can lead to
longer GC cycles but 15min look too long.
2. Do you have ROW cache enabled?
3. How many column families do you have?
4. Enable GC logs and monitor what GC is doing to get idea of why it is
taking so long. You can add following to enable gc log.
# GC logging options -- uncomment to enable
# JVM_OPTS=$JVM_OPTS -XX:+PrintGCDetails
# JVM_OPTS=$JVM_OPTS -XX:+PrintGCTimeStamps
# JVM_OPTS=$JVM_OPTS -XX:+PrintClassHistogram
# JVM_OPTS=$JVM_OPTS -XX:+PrintTenuringDistribution
# JVM_OPTS=$JVM_OPTS -XX:+PrintGCApplicationStoppedTime
# JVM_OPTS=$JVM_OPTS -Xloggc:/var/log/cassandra/gc.log

5. Move to Cassandra 0.7.2, if possible. It has following nice feature:
added flush_largest_memtables_at and reduce_cache_sizes_at options to
cassandra.yaml as an escape value for memory pressure

Thanks,
Naren


On Thu, Feb 24, 2011 at 2:21 PM, Jeffrey Wang jw...@palantir.com wrote:

 Hey all,



 Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB
 disk each collocated in a DC. We’re doing bulk imports from each of the
 nodes with RF = 2 and write consistency ANY (write perf is very important).
 The behavior we’re seeing is this:



 -  Nodes often see each other as dead even though none of the
 nodes actually go down. I suspect this may be due to long GCs. It seems like
 increasing the RPC timeout could help this, but I’m not convinced this is
 the root of the problem. Note that in this case writes return with the
 UnavailableException.

 -  As mentioned, long GCs. We see the ParNew GC doing a lot of
 smaller collections (few hundred MB) which are very fast (few hundred ms),
 but every once in a while the ConcurrentMarkSweep will take a LONG time (up
 to 15 min!) to collect upwards of 15GB at once.

 -  On some nodes, we see a lot of pending MutationStages build up
 (e.g. 500K), which leads to the messages “Dropped X MUTATION messages in the
 last 5000ms,” presumably meaning that Cassandra has decided to not write one
 of the replicas of the data. This is not a HUGE deal, but is less than
 ideal.

 -  The end result is that a bunch of writes end up failing due to
 the UnavailableExceptions, so not all of our data is getting into Cassandra.



 So my question is: what is the best way to avoid this behavior? Our
 memtable thresholds are fairly low (256MB) so there should be plenty of heap
 space to work with. We may experiment with write consistency ONE or ALL to
 see if the perf hit is not too bad, but I wanted to get some opinions on why
 this might be happening. Thanks!



 -Jeffrey