Re: [robinhood-support] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

2017-08-23 Thread LEIBOVICI Thomas
If you get kernel crashes, a system dump, the system console or the 
system logs would give more information about the root cause, rather 
than a userland process.


On 08/23/17 16:55, Mervini, Joseph A wrote:
I just re-ran the scan using strace -f and this is the tail of the 
output up until the machine rebooted:


2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
[pid 13254] <... futex resumed> )   = -1 EAGAIN (Resource 
temporarily unavailable)
[pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 

[pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 


[pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
[pid 13257] <... write resumed> )   = 58
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 


[pid 13271] newfstatat(10, "ompi-f77.pc",  
[pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 
[pid 13254] <... futex resumed> )   = -1 EAGAIN (Resource 
temporarily unavailable)

[pid 13270] <... openat resumed> )  = 12
[pid 13257] <... futex resumed> )   = 1
[pid 13255] <... futex resumed> )   = 0
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 
[pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 
0x7f284190a680 

[pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 
[pid 13270] <... ioctl resumed> )   = 0
[pid 13257] <... poll resumed> )= 0 (Timeout)
[pid 13255] <... futex resumed> )   = 1
[pid 13254] <... futex resumed> )   = 1
[pid 13253] <... futex resumed> )   = 0
[pid 13268] <... futex resumed> )   = 0
[pid 13270] close(12 
[pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 
[pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 

2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM 
ENTRIES WHERE id='0x20405:0xd5b0:0x0'
[pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 

[pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 


[pid 13257] <... write resumed> )   = 11
[pid 13255] <... write resumed> )   = 108
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 

[pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, 
st_size=665, ...}, AT_SYMLINK_NOFOLLOW) = 0

[pid 13257] read(15,  
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 
[pid 13254] <... futex resumed> )   = -1 EAGAIN (Resource 
temporarily unavailable)
[pid 13271] openat(10, "ompi-f77.pc", 
O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME 

[pid 13266] <... futex resumed> )   = 0
[pid 13255] <... futex resumed> )   = 1
[pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 
1082017/08/23 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM 
ENTRIES WHERE id='0x20405:0xcc68:0x0'

 
[pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 

[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL 


[pid 13254] <... write resumed> )   = 108
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 13270] <... close resumed> )   = 0
[pid 13259] <... futex resumed> )   = 0
Write failed: Broken pipe

NOTE: I captured the entire output from strace in a screen log.



Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov 



On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A > wrote:


Hello,

I have not received a response to this posting but have continued to 
try and figure out why this problem persists.


Since I initially opened the request I have been able to duplicate it 
on three different machines. I have also tried multiple kernel 
versions and lustre 2.8 client versions. I have also completely 
rebuilt my lustre file system will a newer version of 2.8 (2.8.0.9.) 
The problem only presents itself when running a scan against the 2.8 
lustre file system. A 2.5 lustre file system works fine.


I decided to run the simultaneous scan against both file systems 
using valgrind and although (again) the 2.5 version of the file 
system completed the system rebooted prior to the 2.8 version scan 
completed. However in both scan valgrind’s output was similar with 
output like this:


==8883== Thread 7:
==8883== Conditional jump or move depends on uninitialised value(s)
==8883==at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883==by 0x44B1F1: listmgr_batch_insert_no_tx 
(listmgr_insert.c:246)

==8883==by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883==by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883==by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883==by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883==by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883== Conditional jump or move depends on uninitialised value(s)
==8883==at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883==by 0x44B1F1: listmgr_batch_insert_no_tx 

[robinhood-support] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

2017-07-11 Thread Mervini, Joseph A
Hello,

I have a problem similar to 
https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which the 
robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre 2.8.0.8 client 
will reboot when the initial scan is run. I am running this in a testbed 
environment prior to deployment on our production system because I want to get 
a complete handle on it before I commit to the deployment. I have 2 separate 
lustre file systems that I am running against: One is a 408TB lustre 2.8 file 
system with ~16M inodes, the other is a 204TB lustre 2.5.5 file system with ~3M 
inodes.

The curious thing is that I had successfully scanned both file systems 
independently on the system with everything working (including web-gui) and 
then basically blew away the databases to get a datapoint on how the system 
performed and the time it took if I ran a scan on both file systems 
simultaneously. It appears that it is only impacting the 2.8 file system 
database. I just ran a fresh scan against the 2.5.5 file system without 
problem. I then stated a new scan against the 2.8 file system an once again it 
rebooted.

Like the other support ticket above, when I ran the scan only on the 2.8 file 
system in debug mode it also reported messages similar to “2017/07/10 15:44:58 
[15191/6] FS_Scan | openat failed on