Hi Abhishek

The article with the futex bug description lists the solution, which is to
upgrade to a version of RHEL or CentOS that have the specified patch.

What help do you specifically need? If you need help upgrading the OS I
would look at the documentation for RHEL or CentOS.

Ben

On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta <gupta.abhis...@snapdeal.com>
wrote:

Hi,

We are seeing an issue where the system CPU is shooting off to a figure or
> 90% when the cluster is subjected to a relatively high write workload i.e
4k wreq/secs.

2016-11-14T13:27:47.900+0530 Process summary
  process cpu=695.61%
  application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High
System CPU *
  other: cpu=19.49%
  heap allocation rate *403mb*/s
[000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129
[000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34
[000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56
[000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79
[000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78
[000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41

On doing strace it was found that the following system call is consuming
all the system CPU
 timeout 10s strace -f -p 5954 -c -q
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------

*88.33 1712.798399       16674    102723     22191 futex* 3.98   77.098730
       4356     17700           read
 3.27   63.474795      394253       161        29 restart_syscall
 3.23   62.601530       29768      2103           epoll_wait

On searching we found the following bug with the RHEL 6.6, CentOS 6.6
kernel seems to be a probable cause for the issue:

https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/fetuxWaitBug.html

The patch fix mentioned in the doc is also not present in our kernel.

sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
- [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347]
{CVE-2010-0623}

Can some who has faced and resolved this issue help us here.

Thanks,
Abhishek


-- 
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

Reply via email to