Re: High system CPU during high write workload

2016-11-15 Thread Bhuvan Rawal
Hi Ben,

Thanks for your reply, we tested the same workload on kernel
version 4.6.4-1.el7.elrepo.x86_64 and found the issue to be not present
there.

This had resulted in really high CPU in write workloads -> area in which
cassandra excels. Degrading performance by atleast 5x! I suggest this
mention could be included in cassandra community wiki as it could impact a
large audience.

Thanks & Regards,
Bhuvan

On Tue, Nov 15, 2016 at 12:33 PM, Ben Bromhead  wrote:

> Hi Abhishek
>
> The article with the futex bug description lists the solution, which is to
> upgrade to a version of RHEL or CentOS that have the specified patch.
>
> What help do you specifically need? If you need help upgrading the OS I
> would look at the documentation for RHEL or CentOS.
>
> Ben
>
> On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta 
> wrote:
>
> Hi,
>
> We are seeing an issue where the system CPU is shooting off to a figure or
> > 90% when the cluster is subjected to a relatively high write workload i.e
> 4k wreq/secs.
>
> 2016-11-14T13:27:47.900+0530 Process summary
>   process cpu=695.61%
>   application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High
> System CPU *
>   other: cpu=19.49%
>   heap allocation rate *403mb*/s
> [000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129
> [000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34
> [000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56
> [000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79
> [000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78
> [000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41
>
> On doing strace it was found that the following system call is consuming
> all the system CPU
>  timeout 10s strace -f -p 5954 -c -q
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>
> *88.33 1712.798399   16674102723 22191 futex* 3.98
> 77.0987304356 17700   read
>  3.27   63.474795  394253   16129 restart_syscall
>  3.23   62.601530   29768  2103   epoll_wait
>
> On searching we found the following bug with the RHEL 6.6, CentOS 6.6
> kernel seems to be a probable cause for the issue:
>
> https://docs.datastax.com/en/landing_page/doc/landing_page/
> troubleshooting/cassandra/fetuxWaitBug.html
>
> The patch fix mentioned in the doc is also not present in our kernel.
>
> sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
> - [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347]
> {CVE-2010-0623}
>
> Can some who has faced and resolved this issue help us here.
>
> Thanks,
> Abhishek
>
>
> --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>


Re: High system CPU during high write workload

2016-11-14 Thread Ben Bromhead
Hi Abhishek

The article with the futex bug description lists the solution, which is to
upgrade to a version of RHEL or CentOS that have the specified patch.

What help do you specifically need? If you need help upgrading the OS I
would look at the documentation for RHEL or CentOS.

Ben

On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta 
wrote:

Hi,

We are seeing an issue where the system CPU is shooting off to a figure or
> 90% when the cluster is subjected to a relatively high write workload i.e
4k wreq/secs.

2016-11-14T13:27:47.900+0530 Process summary
  process cpu=695.61%
  application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High
System CPU *
  other: cpu=19.49%
  heap allocation rate *403mb*/s
[000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129
[000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34
[000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56
[000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79
[000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78
[000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41

On doing strace it was found that the following system call is consuming
all the system CPU
 timeout 10s strace -f -p 5954 -c -q
% time seconds  usecs/call callserrors syscall
-- --- --- - - 

*88.33 1712.798399   16674102723 22191 futex* 3.98   77.098730
   4356 17700   read
 3.27   63.474795  394253   16129 restart_syscall
 3.23   62.601530   29768  2103   epoll_wait

On searching we found the following bug with the RHEL 6.6, CentOS 6.6
kernel seems to be a probable cause for the issue:

https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/fetuxWaitBug.html

The patch fix mentioned in the doc is also not present in our kernel.

sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
- [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347]
{CVE-2010-0623}

Can some who has faced and resolved this issue help us here.

Thanks,
Abhishek


-- 
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer