Re: [Ocfs2-users] OCFS2 KVM Crashes Yet Again !

2017-10-27 Thread netbsd
Hello,

So we have created the new cluster finally 3 identical KVMs:

-8 vCPUs
-10GB ram per node
-Kernel custom 4.13.2OCFS
-All the 3 VMs running on a dell host server which have more than enough 
resources so network connection between the VMs cannot be an issue yet 
(we will move them to separate physical servers once they become rock 
solid)

Until 9 days it was running fine until Today one of the webservers 
decided to crash on OCFS2 again.

Here is the picture of the crashed server:

https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_kxSqLm=DwICAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU=CU8iQ7bMjz3onn2KVgChw_n06syWA6OAYpbd1hl6mfw=
 


And the log from the other nodes:

Oct 27 13:11:06 webserver2 kernel: [789844.406061] o2net: Connection to 
node webserver3 (num 2) at 10.0.0.247: has been idle for 30.688 
secs.
Oct 27 13:11:36 webserver2 kernel: [789875.125863] o2net: Connection to 
node webserver3 (num 2) at 10.0.0.247: has been idle for 30.720 
secs.
Oct 27 13:11:40 webserver2 kernel: [789878.935510] o2net: No longer 
connected to node webserver3 (num 2) at 10.0.0.247:
Oct 27 13:11:40 webserver2 kernel: [789878.935924] o2cb: o2dlm has 
evicted node 2 from domain 428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:40 webserver2 kernel: [789879.050040] o2cb: o2dlm has 
evicted node 2 from domain E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:41 webserver2 kernel: [789880.245846] o2dlm: Begin recovery 
on domain 428503AACBAA492D84DFA48C5CF305B4 for node 2
Oct 27 13:11:41 webserver2 kernel: [789880.246863] o2dlm: Node 1 (me) is 
the Recovery Master for the dead node 2 in domain 
428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:41 webserver2 kernel: [789880.325817] o2dlm: End recovery 
on domain 428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:42 webserver2 kernel: [789880.501802] o2dlm: Begin recovery 
on domain E6CEF44C077640538468D6FCD1E27C5F for node 2
Oct 27 13:11:42 webserver2 kernel: [789880.502841] o2dlm: Node 1 (me) is 
the Recovery Master for the dead node 2 in domain 
E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:47 webserver2 kernel: [789885.629843] o2dlm: End recovery 
on domain E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:47 webserver2 kernel: [789885.684062] ocfs2: Begin replay 
journal (node 2, slot 1) on device (254,64)
Oct 27 13:11:47 webserver2 kernel: [789885.707354] ocfs2: End replay 
journal (node 2, slot 1) on device (254,64)
Oct 27 13:11:47 webserver2 kernel: [789885.737907] ocfs2: Beginning 
quota recovery on device (254,64) for slot 1
Oct 27 13:11:47 webserver2 kernel: [789885.757285] ocfs2: Finishing 
quota recovery on device (254,64) for slot 1
Oct 27 13:19:40 webserver2 kernel: [790358.453142] php-fpm7.0  D
0  8659   8654 0x
Oct 27 13:19:40 webserver2 kernel: [790358.453145] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.453153]  ? 
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.453155]  ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.453158]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.453160]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.453164]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.453165]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.453167]  ? 
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.453170]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.453227]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.453230]  ? 
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.453232]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.453234]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.453236]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.453238]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.453240]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.453241]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.453243]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.455597] php-fpm7.0  D
0  8662   8654 0x
Oct 27 13:19:40 webserver2 kernel: [790358.455624] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.455628]  ? 
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.455630]  ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.455632]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.455634]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.455637]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.455639]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: 

Re: [Ocfs2-users] OCFS2 KVM Crashes Yet Again !

2017-09-29 Thread Gang He
Hello netbsd,

Could you conclude to a way to trigger this crash happen in a normal ocfs2 
cluster?
e.g. reproduce steps, or a shell script.

Thanks
Gang


>>> 
> Hello,
> 
> Find the full log below:
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_25625787_=DwIFAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A=5ZRqjhlhVphYeGDUyONVUUBrtPi8rLz88ZN7_wbNlNQ=CGsTC_h47c4MXFb4l_7fmVPQ9Ru96AAupsNcqdb76Lk=
>   
> 
> VM was restarted at 9:27 and no problem since then. We are rsyncing 
> about 2TB data (a lot of small files) between 2 OCFS shares on the same 
> vm:
> 
> 
> /dev/vdc  4.8T  2.8T  2.1T  58% /mnt/s1
> /dev/vdf  4.8T  985G  3.9T  21% /mnt/s2
> 
> rsync -av --numeric-ids --delete /mnt/s1/ /mnt/s2/
> 
> 
> On 2017-09-27 10:53, Gang He wrote:
>> Hello netbsd,
>> 
>> The ocfs2 project is still be developed by us (from SUE, Huawei,
>> Oracle and H3C. etc.).
>> If you encountered some problem, please send the mail to ocfs2-devel
>> mail list, we usually watch that mail for ocfs2 kernel related issues.
>> 
>> 
>> 
>> 
> 
>>> Hello All,
>>> 
>>> I wrote earlier about our OCFS2 crash issue in KVM due to bug in the 
>>> SMP
>>> code.
>>> 
>>> For this we come up with a solution:
>>> 
>>> Instead of using multiple vcpus
>>>8
>>> 
>>> using a single one and multiple cores instead:
>>>  
>>> 
>>> And applying key tune options to sysctl.conf:
>>> 
>>> vm.min_free_kbytes=131072
>>> vm.zone_reclaim_mode=1
>>> 
>>> Seemed to be helped, the fs did not crash right away when we were
>>> hammering it with apache benchmarks with 1 requests however last
>>> night I started a large rsync operation from a 5TB OCFS2 FS mounted in
>>> the VM to another OCFS2 mounted in the same VM and ended up with:
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5=DwICAg=R
>>>  
>>> 
> oP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtY
> 
>>> 
> n-0afBpa7A=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk=ajWfQIlUZOpElFWxoKcmvTI
> 
>>> k7J3PpuCJITcnXfJQHrc=
>> From the kernel crash backtrace, this problem should be that long time
>> to acquiring spin_lock triggers a NMI interruption.
>> Could you give a detailed reproduce steps? since we want to reproduce
>> this issue in local, then try to fix it.
>> 
>> 
>> Thanks
>> Gang
>> 
>>> 
>>> After trying a lot of different kernels starting from the 3.x series,
>>> now we are using 4.13.2 latest kernel with default configuration but
>>> these issues still present. Is this OCFS2 project still being 
>>> developed?
>>> With this crashing and unreliability it cannot be used in production
>>> unless you put in place bunch of safeguards to reset out the whole
>>> virtualmachine when it crashes.
>>> 
>>> Thanks
>>> 
>>> ___
>>> Ocfs2-users mailing list
>>> Ocfs2-users@oss.oracle.com 
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 KVM Crashes Yet Again !

2017-09-27 Thread netbsd
Hello,

Find the full log below:

https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_25625787_=DwICAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A=5BOEZ6shfWftzp2R4SRZiOEZXmUvTYJZyua8Tuqlya4=Wj5bNrDYurcyqvciOCmJAZyQjrpSQjQVP_9-laqPiso=
 

VM was restarted at 9:27 and no problem since then. We are rsyncing 
about 2TB data (a lot of small files) between 2 OCFS shares on the same 
vm:


/dev/vdc  4.8T  2.8T  2.1T  58% /mnt/s1
/dev/vdf  4.8T  985G  3.9T  21% /mnt/s2

rsync -av --numeric-ids --delete /mnt/s1/ /mnt/s2/


On 2017-09-27 10:53, Gang He wrote:
> Hello netbsd,
> 
> The ocfs2 project is still be developed by us (from SUE, Huawei,
> Oracle and H3C. etc.).
> If you encountered some problem, please send the mail to ocfs2-devel
> mail list, we usually watch that mail for ocfs2 kernel related issues.
> 
> 
> 
> 
 
>> Hello All,
>> 
>> I wrote earlier about our OCFS2 crash issue in KVM due to bug in the 
>> SMP
>> code.
>> 
>> For this we come up with a solution:
>> 
>> Instead of using multiple vcpus
>>8
>> 
>> using a single one and multiple cores instead:
>>  
>> 
>> And applying key tune options to sysctl.conf:
>> 
>> vm.min_free_kbytes=131072
>> vm.zone_reclaim_mode=1
>> 
>> Seemed to be helped, the fs did not crash right away when we were
>> hammering it with apache benchmarks with 1 requests however last
>> night I started a large rsync operation from a 5TB OCFS2 FS mounted in
>> the VM to another OCFS2 mounted in the same VM and ended up with:
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5=DwICAg=R
>> oP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtY
>> n-0afBpa7A=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk=ajWfQIlUZOpElFWxoKcmvTI
>> k7J3PpuCJITcnXfJQHrc=
> From the kernel crash backtrace, this problem should be that long time
> to acquiring spin_lock triggers a NMI interruption.
> Could you give a detailed reproduce steps? since we want to reproduce
> this issue in local, then try to fix it.
> 
> 
> Thanks
> Gang
> 
>> 
>> After trying a lot of different kernels starting from the 3.x series,
>> now we are using 4.13.2 latest kernel with default configuration but
>> these issues still present. Is this OCFS2 project still being 
>> developed?
>> With this crashing and unreliability it cannot be used in production
>> unless you put in place bunch of safeguards to reset out the whole
>> virtualmachine when it crashes.
>> 
>> Thanks
>> 
>> ___
>> Ocfs2-users mailing list
>> Ocfs2-users@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 KVM Crashes Yet Again !

2017-09-27 Thread Gang He
Hello netbsd,

The ocfs2 project is still be developed by us (from SUE, Huawei, Oracle and 
H3C. etc.).
If you encountered some problem, please send the mail to ocfs2-devel mail list, 
we usually watch that mail for ocfs2 kernel related issues.




>>> 
> Hello All,
> 
> I wrote earlier about our OCFS2 crash issue in KVM due to bug in the SMP 
> code.
> 
> For this we come up with a solution:
> 
> Instead of using multiple vcpus
>8
> 
> using a single one and multiple cores instead:
>  
> 
> And applying key tune options to sysctl.conf:
> 
> vm.min_free_kbytes=131072
> vm.zone_reclaim_mode=1
> 
> Seemed to be helped, the fs did not crash right away when we were 
> hammering it with apache benchmarks with 1 requests however last 
> night I started a large rsync operation from a 5TB OCFS2 FS mounted in 
> the VM to another OCFS2 mounted in the same VM and ended up with:
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5=DwICAg=R
>  
> oP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtY
> n-0afBpa7A=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk=ajWfQIlUZOpElFWxoKcmvTI
> k7J3PpuCJITcnXfJQHrc= 
>From the kernel crash backtrace, this problem should be that long time to 
>acquiring spin_lock triggers a NMI interruption.
Could you give a detailed reproduce steps? since we want to reproduce this 
issue in local, then try to fix it. 


Thanks
Gang 

> 
> After trying a lot of different kernels starting from the 3.x series, 
> now we are using 4.13.2 latest kernel with default configuration but 
> these issues still present. Is this OCFS2 project still being developed? 
> With this crashing and unreliability it cannot be used in production 
> unless you put in place bunch of safeguards to reset out the whole 
> virtualmachine when it crashes.
> 
> Thanks
> 
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com 
> https://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 KVM Crashes Yet Again !

2017-09-27 Thread Changwei Ge
Hi,

Could you please paste the crash back trace.

On 2017/9/27 16:15, net...@tango.lu wrote:
> Hello All,
> 
> I wrote earlier about our OCFS2 crash issue in KVM due to bug in the SMP
> code.
> 
> For this we come up with a solution:
> 
> Instead of using multiple vcpus
> 8
> 
> using a single one and multiple cores instead:
>   
> 
> And applying key tune options to sysctl.conf:
> 
> vm.min_free_kbytes=131072
> vm.zone_reclaim_mode=1
> 
> Seemed to be helped, the fs did not crash right away when we were
> hammering it with apache benchmarks with 1 requests however last
> night I started a large rsync operation from a 5TB OCFS2 FS mounted in
> the VM to another OCFS2 mounted in the same VM and ended up with:
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5=DwICAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk=ajWfQIlUZOpElFWxoKcmvTIk7J3PpuCJITcnXfJQHrc=
> 
> After trying a lot of different kernels starting from the 3.x series,
> now we are using 4.13.2 latest kernel with default configuration but
> these issues still present. Is this OCFS2 project still being developed?
I admit that the developing group is not active recently.

> With this crashing and unreliability it cannot be used in production
> unless you put in place bunch of safeguards to reset out the whole
> virtualmachine when it crashes.
> 
> Thanks
> 
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
> 


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users