Re: [lustre-discuss] lustre 2.10.5 or 2.11.0

2018-10-30 Thread Riccardo Veraldi

Sorry for replaying late, I answered in-line

On 10/21/18 6:00 AM, Andreas Dilger wrote:

It would be useful to post information like this on wiki.lustre.org so they can 
be found more easily by others.  There are already some ZFS tunings there (I 
don't have the URL handy, just on a plane), so it might be useful to include 
some information about the hardware and workload to give context to what this 
is tuned for.

Even more interesting would be to see if there is a general set of tunings that 
people agree should be made the default?  It is even better when new users 
don't have to seek out the various tuning parameters, and instead get good 
performance out of the box.

A few comments inline...

On Oct 19, 2018, at 17:52, Riccardo Veraldi  
wrote:

On 10/19/18 12:37 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi  
wrote:

anyway especially regarding the OSSes you may eventually need some ZFS module parameters 
optimizations regarding vdev_write and vdev_read max to increase those values higher than 
default. You may also disable ZIL, change the redundant_metadata to "most"  
atime off.

I could send you a list of parameters that in my case work well.

Riccardo,

Would you mind sharing your ZFS parameters with the mailing list?  I would be 
interested to see which options you have changed.


this worked for me on my high performance cluster

options zfs zfs_prefetch_disable=1

This matches what I've seen in the past - at high bandwidth under concurrent 
client load the prefetched data on the server is lost, and just causes needless 
disk IO that is discarded.


options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30

This is interesting.  Is this actually setting the maximum TXG age up to 30s?


yes, I think the default is 5 seconds.





options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

##

these the zfs attributes that I changed on the OSSes:

zfs set mountpoint=none $ostpool
zfs set sync=disabled $ostpool
zfs set atime=off $ostpool
zfs set redundant_metadata=most $ostpool
zfs set xattr=sa $ostpool
zfs set recordsize=1M $ostpool

The recordsize=1M is already the default for Lustre OSTs.

Did you disable multimount, or just not include it here?  That is fairly
important for any multi-homed ZFS storage, to prevent multiple imports.


#


these the ko2iblnd parameters for FDR Mellanox IB interfaces

options ko2iblnd timeout=100 peer_credits=63 credits=2560 concurrent_sends=63 
ntx=2048 fmr_pool_size=1280 fmr_flush_trigger=1024 ntx=5120

You have ntx= in there twice...


yes it is a mistake I typed it two times





If this provides a significant improvement for FDR, it might make sense to add in 
machinery to lustre/conf/{ko2iblnd-probe,ko2iblnd.conf} to have a new alias 
"ko2iblnd-fdr" set these values on Mellanox FDB IB cards by default?


I found it it works better with FDR.

Anyway most of the tunings I did were taken here and there reading what 
other people did. So mostly from here:


 * https://lustre.ornl.gov/lustre101-courses/content/C1/L5/LustreTuning.pdf
 * https://www.eofs.eu/_media/events/lad15/15_chris_horn_lad_2015_lnet.pdf
 * https://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial2.pdf

And by the way the most effective tweaks were after reading Rick Mohr 
advice  in LustreTuning.pdf, Thanks Rick!







these the ksocklnd paramaters

options ksocklnd sock_timeout=100 credits=2560 peer_credits=63

##

these other parameters that I did tweak

echo 32 > /sys/module/ptlrpc/parameters/max_ptlrpcds
echo 3 > /sys/module/ptlrpc/parameters/ptlrpcd_bind_policy

This parameter is marked as obsolete in the code.


Yes I should fix my configuration and use the new parameters



lctl set_param timeout=600
lctl set_param ldlm_timeout=200
lctl set_param at_min=250
lctl set_param at_max=600

###

Also I run this script at boot time to redefine IRQ assignments for hard drives 
spanned across all CPUs, not needed for kernel > 4.4

#!/bin/sh
# numa_smp.sh
device=$1
cpu1=$2
cpu2=$3
cpu=$cpu1
grep $1 /proc/interrupts|awk '{print $1}'|sed 's/://'|while read int
do
  echo $cpu > /proc/irq/$int/smp_affinity_list
  echo "echo CPU $cpu > /proc/irq/$a/smp_affinity_list"
  if [ $cpu = $cpu2 ]
  then
 cpu=$cpu1
  else
 ((cpu=$cpu+1))
  fi
done

Cheers, Andreas
---
Andreas 

Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)

2018-10-30 Thread Andreas Dilger
I would totally be fine with that, so long as it works in a reasonable manner.

In theory, it would even be possible to do sub-page uncached writes from the 
client, and have the OSS handle the read-modify-write of a single page.  That 
would need some help from the CLIO layer to send the small write directly to 
the OST without going through the page cache (also invalidating any overlapping 
page from the client cache) and LNet handling the misaligned RDMA properly.

We used to allow misaligned RDMA with the old liblustre client because it 
didn't ever have any cache, but not with the Linux client.  It _might_ be 
possible to do without major surgery on the servers, and might even speed up 
sub-page random writes.  This would avoid the need to read a whole page over to 
the client just to overwrite part of it and send it back, and also avoid 
contending on DLM write locks for non-overlapping regions since the sub-page 
writes could be sent lockless from the client and the DLM locking and 
page-aligned IO would be handled on the OSS (that is already in the protocol).

That said, this is definitely more in your area of expertise Patrick (and 
Jinshan, CC'd).

Cheers, Andreas

> On Oct 30, 2018, at 09:10, Patrick Farrell  wrote:
> 
> Andreas,
>  
> An interesting thought on this, as the same limitation came up recently in 
> discussions with a Cray customer.  Strictly honoring the direct I/O 
> expectations around data copying is apparently optional.  GPFS is a notable 
> example – It allows non page-aligned/page-size direct I/O, but it apparently 
> (This is second hand from a GPFS knowledgeable person, so take with a grain 
> of salt) uses the buffered path (data copy, page cache, etc) and flushes it, 
> O_SYNC style.  My understanding from conversations is this is the general 
> approach taken by file systems that support unaligned direct I/O – they cheat 
> a little and do buffered I/O in those cases.
>  
> So rather than refusing to perform unaligned direct I/O, we could emulate the 
> approach taken by (some) other file systems.  There’s no clear standard here, 
> but this is an option others have taken that might improve the user 
> experience.  (I believe we persuaded our particular user to switch their code 
> away from direct I/O, since they had no real reason to be using it.)
>  
>   • Patrick
>  
> From: lustre-discuss  on behalf of 
> 김형근 
> Date: Sunday, October 28, 2018 at 11:40 PM
> To: Andreas Dilger 
> Cc: "lustre-discuss@lists.lustre.org" 
> Subject: Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)
>  
> The software I use is RedHat Virtualization. When using Posix compatible FS, 
> it seems to perform direct I / O with a block size of 256512 bytes.
>  
> If I can't resolve the issue with my storage configuration, I will contact 
> RedHat.
>  
> Your answer was very helpful.
> Thank you.
>  
>  
>  
>  
>  
> 보내는사람 : Andreas Dilger 
>  
> 받는사람 : 김형근 
>  
> 참조 : lustre-discuss@lists.lustre.org 
>  
> 보낸 날짜 : 2018-10-25 16:47:58
>  
>  
>  
> 제목 : Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)
>  
>  
>  
> On Oct 25, 2018, at 15:05, 김형근
> wrote: 
> > 
> > Hi. 
> > It's a pleasure to meet you, the lustre specialists. 
> > (I do not speak English well ... Thank you for your understanding!) 
> 
> Your english is better than my Korean. :-) 
> 
> > I used the dd command in lustre mount point. (using the oflag = direct 
> > option) 
> > 
> >  
> > dd if = / dev / zero of = / mnt / testfile oflag = direct bs = 512 count = 
> > 1 
> >  
> > 
> > I need direct I / O with 512 byte block size. 
> > This is a required check function on the software I use. 
> 
> What software is it? Is it possible to change the application to use 
> 4096-byte alignment? 
> 
> > But unfortunately, If the direct option is present, 
> > bs must be a multiple of 4K (4096) (for 8K, 12K, 256K, 1M, 8M, etc.) for 
> > operation. 
> > For example, if you enter a value such as 512 or 4095, it will not work. 
> > The error message is as follows. 
> > 
> > 'error message: dd: error writing [filename]: invalid argument' 
> > 
> > My test system is all up to date. (RHEL, lustre-server, client) 
> > I have used both ldiskfs and zfs for backfile systems. The result is same. 
> > 
> > 
> > My question is simply two. 
> > 
> > 1. Why does DirectIO work only in 4k multiples block size? 
> 
> The client PAGE_SIZE on an x86 system is 4096 bytes. The Lustre client 
> cannot cache data smaller than PAGE_SIZE, so the current implementation 
> is limited to have O_DIRECT read/write being a multiple of PAGE_SIZE. 
> 
> I think the same would happen if you try to use O_DIRECT on a disk with 
> 4096-byte native sector drive 
> (https://en.wikipedia.org/w/index.php?title=Advanced_Format§ion=5#4K_native 
> )? 
> 
> > 2. Can I change the settings of the server and client to enable 512bytes of 

Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)

2018-10-30 Thread Patrick Farrell
Andreas,

An interesting thought on this, as the same limitation came up recently in 
discussions with a Cray customer.  Strictly honoring the direct I/O 
expectations around data copying is apparently optional.  GPFS is a notable 
example – It allows non page-aligned/page-size direct I/O, but it apparently 
(This is second hand from a GPFS knowledgeable person, so take with a grain of 
salt) uses the buffered path (data copy, page cache, etc) and flushes it, 
O_SYNC style.  My understanding from conversations is this is the general 
approach taken by file systems that support unaligned direct I/O – they cheat a 
little and do buffered I/O in those cases.

So rather than refusing to perform unaligned direct I/O, we could emulate the 
approach taken by (some) other file systems.  There’s no clear standard here, 
but this is an option others have taken that might improve the user experience. 
 (I believe we persuaded our particular user to switch their code away from 
direct I/O, since they had no real reason to be using it.)


  *   Patrick

From: lustre-discuss  on behalf of 김형근 

Date: Sunday, October 28, 2018 at 11:40 PM
To: Andreas Dilger 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)


The software I use is RedHat Virtualization. When using Posix compatible FS, it 
seems to perform direct I / O with a block size of 256512 bytes.



If I can't resolve the issue with my storage configuration, I will contact 
RedHat.



Your answer was very helpful.

Thank you.













보내는사람 : Andreas Dilger 



받는사람 : 김형근 



참조 : lustre-discuss@lists.lustre.org 



보낸 날짜 : 2018-10-25 16:47:58







제목 : Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)







On Oct 25, 2018, at 15:05, 김형근
wrote:
>
> Hi.
> It's a pleasure to meet you, the lustre specialists.
> (I do not speak English well ... Thank you for your understanding!)

Your english is better than my Korean. :-)

> I used the dd command in lustre mount point. (using the oflag = direct option)
>
> 
> dd if = / dev / zero of = / mnt / testfile oflag = direct bs = 512 count = 1
> 
>
> I need direct I / O with 512 byte block size.
> This is a required check function on the software I use.

What software is it? Is it possible to change the application to use
4096-byte alignment?

> But unfortunately, If the direct option is present,
> bs must be a multiple of 4K (4096) (for 8K, 12K, 256K, 1M, 8M, etc.) for 
> operation.
> For example, if you enter a value such as 512 or 4095, it will not work. The 
> error message is as follows.
>
> 'error message: dd: error writing [filename]: invalid argument'
>
> My test system is all up to date. (RHEL, lustre-server, client)
> I have used both ldiskfs and zfs for backfile systems. The result is same.
>
>
> My question is simply two.
>
> 1. Why does DirectIO work only in 4k multiples block size?

The client PAGE_SIZE on an x86 system is 4096 bytes. The Lustre client
cannot cache data smaller than PAGE_SIZE, so the current implementation
is limited to have O_DIRECT read/write being a multiple of PAGE_SIZE.

I think the same would happen if you try to use O_DIRECT on a disk with
4096-byte native sector drive 
(https://en.wikipedia.org/w/index.php?title=Advanced_Format§ion=5#4K_native )?

> 2. Can I change the settings of the server and client to enable 512bytes of 
> DirectIO?

This would not be possible without changing the Lustre client code.
I don't know how easily this is possible to do and still ensure that
the 512-byte writes are handled correctly.

So far we have not had other requests to change this limitation, so
it is not a high priority to change on our side, especially since
applications will have to deal with 4096-byte sectors in any case.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud








___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre OSS kernel panic after mounting OSTs

2018-10-30 Thread Riccardo Veraldi
thank you Fernando  for the hint, I did it right  now thanks. I am 
running e2fsck again.

Anyway my problem was this:

https://jira.whamcloud.com/browse/LU-5040

thank you

On 10/30/18 5:28 AM, Fernando Perez wrote:

Dear Riccardo.

Have you tried to upgrade e2fsprogs packages before perform the e2fsck?

Regards.

=
Fernando Pérez
Institut de Ciències del Mar (CSIC)
Departament Oceanografía Física i Tecnològica
Passeig Marítim de la Barceloneta,37-49
08003 Barcelona
Phone:  (+34) 93 230 96 35
=

On 10/30/2018 01:05 PM, Riccardo Veraldi wrote:

Hello,

I have quite a very critical problem.

One of my OSSes hanfs into a kernel panic when trying to mount the OSTs.

After mounting 11 OSTs over 12 total OSTs it goes into kernel panic. 
Does not matter hte order in which they are mounted.


Any clue on hints ?

I cannot really recover it and I have important data on it.

I already performed an e2fsck. Anyway it did not fix. it has found a 
few inode count inconsistencies before.


kernel is 2.6.32-431.23.3.el6_lustre.x86_64

Red Hat Enterprise Linux Server release 6.7 (Santiago)

lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64


Oct 30 04:58:52 psanaoss231 kernel: INFO: task tgt_recov:4569 blocked 
for more than 120 seconds.


Oct 30 04:58:52 psanaoss231 kernel:  Not tainted 
2.6.32-431.23.3.el6_lustre.x86_64 #1
Oct 30 04:58:52 psanaoss231 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 04:58:52 psanaoss231 kernel: tgt_recov D 
0003 0  4569  2 0x0080
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1da0 0046 
 0003
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1d30 81059096 
880bf2ae1d40 880bf2a1d500
Oct 30 04:58:52 psanaoss231 kernel: 880bf2b01ab8 880bf2ae1fd8 
fbc8 880bf2b01ab8

Oct 30 04:58:52 psanaoss231 kernel: Call Trace:
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
enqueue_task+0x66/0x80
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
check_for_clients+0x0/0x70 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
autoremove_wake_function+0x0/0x40
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_thread+0x540/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
default_wake_function+0x12/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
kthread+0x96/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] 
child_rip+0xa/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
kthread+0x0/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
child_rip+0x0/0x20
Oct 30 04:59:02 psanaoss231 kernel: Lustre: ana13-OST0004: Recovery 
over after 3:05, of 147 clients 146 recovered and 1 was evicted.
Oct 30 04:59:03 psanaoss231 kernel: Lustre: ana13-OST0004: Client 
89ba817f-45c3-5e64-99a8-b472651bbe45 (at 172.21.52.213@o2ib) 
reconnecting
Oct 30 04:59:03 psanaoss231 kernel: Lustre: Skipped 94 previous 
similar messages
Oct 30 04:59:21 psanaoss231 kernel: LustreError: 
4569:0:(ost_handler.c:1123:ost_brw_write()) Dropping timed-out write 
from 12345-172.21.49.129@tcp because locking object 0x0:14198730 took 
153 seconds (limit was 30).
Oct 30 04:59:21 psanaoss231 kernel: Lustre: ana13-OST0005: Bulk IO 
write error with 3a71df2f-16e7-d507-2495-ab60364d8e7c (at 
172.21.49.129@tcp), client will retry: rc -110

Oct 30 04:59:52 psanaoss231 kernel: [ cut here ]
Oct 30 04:59:52 psanaoss231 kernel: kernel BUG at 
fs/jbd2/transaction.c:1033!

Oct 30 04:59:52 psanaoss231 kernel: invalid opcode:  [#1] SMP
Oct 30 04:59:52 psanaoss231 kernel: last sysfs file: 
/sys/devices/system/cpu/online

Oct 30 04:59:52 psanaoss231 kernel: CPU 10
Oct 30 04:59:52 psanaoss231 kernel: Modules linked in: osp(U) ofd(U) 
lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) 
ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) 
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic 
sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss 
nfs_acl mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase 
autofs4 sunrpc ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode 
power_meter iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf sb_edac 
edac_core lpc_ich mfd_core shpchp igb i2c_algo_bit i2c_core ses 
enclosure sg ixgbe dca ptp pps_core mdio ext4 jbd2 mbcache raid1 
sd_mod crc_t10dif ahci wmi mlx4_ib ib_sa ib_mad ib_core mlx4_en 
mlx4_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last 

Re: [lustre-discuss] Lustre OSS kernel panic after mounting OSTs

2018-10-30 Thread Fernando Perez

Dear Riccardo.

Have you tried to upgrade e2fsprogs packages before perform the e2fsck?

Regards.

=
Fernando Pérez
Institut de Ciències del Mar (CSIC)
Departament Oceanografía Física i Tecnològica
Passeig Marítim de la Barceloneta,37-49
08003 Barcelona
Phone:  (+34) 93 230 96 35
=

On 10/30/2018 01:05 PM, Riccardo Veraldi wrote:

Hello,

I have quite a very critical problem.

One of my OSSes hanfs into a kernel panic when trying to mount the OSTs.

After mounting 11 OSTs over 12 total OSTs it goes into kernel panic. 
Does not matter hte order in which they are mounted.


Any clue on hints ?

I cannot really recover it and I have important data on it.

I already performed an e2fsck. Anyway it did not fix. it has found a 
few inode count inconsistencies before.


kernel is 2.6.32-431.23.3.el6_lustre.x86_64

Red Hat Enterprise Linux Server release 6.7 (Santiago)

lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64


Oct 30 04:58:52 psanaoss231 kernel: INFO: task tgt_recov:4569 blocked 
for more than 120 seconds.


Oct 30 04:58:52 psanaoss231 kernel:  Not tainted 
2.6.32-431.23.3.el6_lustre.x86_64 #1
Oct 30 04:58:52 psanaoss231 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 04:58:52 psanaoss231 kernel: tgt_recov D 
0003 0  4569  2 0x0080
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1da0 0046 
 0003
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1d30 81059096 
880bf2ae1d40 880bf2a1d500
Oct 30 04:58:52 psanaoss231 kernel: 880bf2b01ab8 880bf2ae1fd8 
fbc8 880bf2b01ab8

Oct 30 04:58:52 psanaoss231 kernel: Call Trace:
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
enqueue_task+0x66/0x80
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
check_for_clients+0x0/0x70 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
autoremove_wake_function+0x0/0x40
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_thread+0x540/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
default_wake_function+0x12/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
kthread+0x96/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] 
child_rip+0xa/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
kthread+0x0/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
child_rip+0x0/0x20
Oct 30 04:59:02 psanaoss231 kernel: Lustre: ana13-OST0004: Recovery 
over after 3:05, of 147 clients 146 recovered and 1 was evicted.
Oct 30 04:59:03 psanaoss231 kernel: Lustre: ana13-OST0004: Client 
89ba817f-45c3-5e64-99a8-b472651bbe45 (at 172.21.52.213@o2ib) reconnecting
Oct 30 04:59:03 psanaoss231 kernel: Lustre: Skipped 94 previous 
similar messages
Oct 30 04:59:21 psanaoss231 kernel: LustreError: 
4569:0:(ost_handler.c:1123:ost_brw_write()) Dropping timed-out write 
from 12345-172.21.49.129@tcp because locking object 0x0:14198730 took 
153 seconds (limit was 30).
Oct 30 04:59:21 psanaoss231 kernel: Lustre: ana13-OST0005: Bulk IO 
write error with 3a71df2f-16e7-d507-2495-ab60364d8e7c (at 
172.21.49.129@tcp), client will retry: rc -110

Oct 30 04:59:52 psanaoss231 kernel: [ cut here ]
Oct 30 04:59:52 psanaoss231 kernel: kernel BUG at 
fs/jbd2/transaction.c:1033!

Oct 30 04:59:52 psanaoss231 kernel: invalid opcode:  [#1] SMP
Oct 30 04:59:52 psanaoss231 kernel: last sysfs file: 
/sys/devices/system/cpu/online

Oct 30 04:59:52 psanaoss231 kernel: CPU 10
Oct 30 04:59:52 psanaoss231 kernel: Modules linked in: osp(U) ofd(U) 
lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) 
ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) 
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic 
sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss 
nfs_acl mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase 
autofs4 sunrpc ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode 
power_meter iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf sb_edac 
edac_core lpc_ich mfd_core shpchp igb i2c_algo_bit i2c_core ses 
enclosure sg ixgbe dca ptp pps_core mdio ext4 jbd2 mbcache raid1 
sd_mod crc_t10dif ahci wmi mlx4_ib ib_sa ib_mad ib_core mlx4_en 
mlx4_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last 
unloaded: speedstep_lib]

Oct 30 04:59:52 psanaoss231 kernel:
Oct 30 04:59:52 psanaoss231 kernel: Pid: 4272, comm: ll_ost01_007 Not 
tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge 
R620/0PXXHP
Oct 30 04:59:52 

Re: [lustre-discuss] Lustre OSS kernel panic after mounting OSTs

2018-10-30 Thread Riccardo Veraldi

I could mount the OSTs the only way though was to  mount with abort_recov

thanks to this old ticket

https://jira.whamcloud.com/browse/LU-5040




On 10/30/18 5:05 AM, Riccardo Veraldi wrote:

Hello,

I have quite a very critical problem.

One of my OSSes hanfs into a kernel panic when trying to mount the OSTs.

After mounting 11 OSTs over 12 total OSTs it goes into kernel panic. 
Does not matter hte order in which they are mounted.


Any clue on hints ?

I cannot really recover it and I have important data on it.

I already performed an e2fsck. Anyway it did not fix. it has found a 
few inode count inconsistencies before.


kernel is 2.6.32-431.23.3.el6_lustre.x86_64

Red Hat Enterprise Linux Server release 6.7 (Santiago)

lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64


Oct 30 04:58:52 psanaoss231 kernel: INFO: task tgt_recov:4569 blocked 
for more than 120 seconds.


Oct 30 04:58:52 psanaoss231 kernel:  Not tainted 
2.6.32-431.23.3.el6_lustre.x86_64 #1
Oct 30 04:58:52 psanaoss231 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 04:58:52 psanaoss231 kernel: tgt_recov D 
0003 0  4569  2 0x0080
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1da0 0046 
 0003
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1d30 81059096 
880bf2ae1d40 880bf2a1d500
Oct 30 04:58:52 psanaoss231 kernel: 880bf2b01ab8 880bf2ae1fd8 
fbc8 880bf2b01ab8

Oct 30 04:58:52 psanaoss231 kernel: Call Trace:
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
enqueue_task+0x66/0x80
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
check_for_clients+0x0/0x70 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
autoremove_wake_function+0x0/0x40
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_thread+0x540/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
default_wake_function+0x12/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
kthread+0x96/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] 
child_rip+0xa/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
kthread+0x0/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
child_rip+0x0/0x20
Oct 30 04:59:02 psanaoss231 kernel: Lustre: ana13-OST0004: Recovery 
over after 3:05, of 147 clients 146 recovered and 1 was evicted.
Oct 30 04:59:03 psanaoss231 kernel: Lustre: ana13-OST0004: Client 
89ba817f-45c3-5e64-99a8-b472651bbe45 (at 172.21.52.213@o2ib) reconnecting
Oct 30 04:59:03 psanaoss231 kernel: Lustre: Skipped 94 previous 
similar messages
Oct 30 04:59:21 psanaoss231 kernel: LustreError: 
4569:0:(ost_handler.c:1123:ost_brw_write()) Dropping timed-out write 
from 12345-172.21.49.129@tcp because locking object 0x0:14198730 took 
153 seconds (limit was 30).
Oct 30 04:59:21 psanaoss231 kernel: Lustre: ana13-OST0005: Bulk IO 
write error with 3a71df2f-16e7-d507-2495-ab60364d8e7c (at 
172.21.49.129@tcp), client will retry: rc -110

Oct 30 04:59:52 psanaoss231 kernel: [ cut here ]
Oct 30 04:59:52 psanaoss231 kernel: kernel BUG at 
fs/jbd2/transaction.c:1033!

Oct 30 04:59:52 psanaoss231 kernel: invalid opcode:  [#1] SMP
Oct 30 04:59:52 psanaoss231 kernel: last sysfs file: 
/sys/devices/system/cpu/online

Oct 30 04:59:52 psanaoss231 kernel: CPU 10
Oct 30 04:59:52 psanaoss231 kernel: Modules linked in: osp(U) ofd(U) 
lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) 
ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) 
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic 
sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss 
nfs_acl mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase 
autofs4 sunrpc ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode 
power_meter iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf sb_edac 
edac_core lpc_ich mfd_core shpchp igb i2c_algo_bit i2c_core ses 
enclosure sg ixgbe dca ptp pps_core mdio ext4 jbd2 mbcache raid1 
sd_mod crc_t10dif ahci wmi mlx4_ib ib_sa ib_mad ib_core mlx4_en 
mlx4_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last 
unloaded: speedstep_lib]

Oct 30 04:59:52 psanaoss231 kernel:
Oct 30 04:59:52 psanaoss231 kernel: Pid: 4272, comm: ll_ost01_007 Not 
tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge 
R620/0PXXHP
Oct 30 04:59:52 psanaoss231 kernel: RIP: 0010:[]  
[] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
Oct 30 04:59:52 psanaoss231 kernel: RSP: 0018:880c058437d0 EFLAGS: 
00010246
Oct 30 04:59:52 psanaoss231 kernel: RAX: 880c05573dc0 

[lustre-discuss] Lustre OSS kernel panic after mounting OSTs

2018-10-30 Thread Riccardo Veraldi

Hello,

I have quite a very critical problem.

One of my OSSes hanfs into a kernel panic when trying to mount the OSTs.

After mounting 11 OSTs over 12 total OSTs it goes into kernel panic. 
Does not matter hte order in which they are mounted.


Any clue on hints ?

I cannot really recover it and I have important data on it.

I already performed an e2fsck. Anyway it did not fix. it has found a few 
inode count inconsistencies before.


kernel is 2.6.32-431.23.3.el6_lustre.x86_64

Red Hat Enterprise Linux Server release 6.7 (Santiago)

lustre-2.5.3-2.6.32_431.23.3.el6_lustre.x86_64.x86_64


Oct 30 04:58:52 psanaoss231 kernel: INFO: task tgt_recov:4569 blocked 
for more than 120 seconds.


Oct 30 04:58:52 psanaoss231 kernel:  Not tainted 
2.6.32-431.23.3.el6_lustre.x86_64 #1
Oct 30 04:58:52 psanaoss231 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 04:58:52 psanaoss231 kernel: tgt_recov D 0003 
0  4569  2 0x0080
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1da0 0046 
 0003
Oct 30 04:58:52 psanaoss231 kernel: 880bf2ae1d30 81059096 
880bf2ae1d40 880bf2a1d500
Oct 30 04:58:52 psanaoss231 kernel: 880bf2b01ab8 880bf2ae1fd8 
fbc8 880bf2b01ab8

Oct 30 04:58:52 psanaoss231 kernel: Call Trace:
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
enqueue_task+0x66/0x80
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
check_for_clients+0x0/0x70 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_overseer+0x9d/0x230 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
exp_connect_healthy+0x0/0x20 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
autoremove_wake_function+0x0/0x40
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] 
target_recovery_thread+0x540/0x1920 [ptlrpc]
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
default_wake_function+0x12/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
target_recovery_thread+0x0/0x1920 [ptlrpc]

Oct 30 04:58:52 psanaoss231 kernel: [] kthread+0x96/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] child_rip+0xa/0x20
Oct 30 04:58:52 psanaoss231 kernel: [] ? kthread+0x0/0xa0
Oct 30 04:58:52 psanaoss231 kernel: [] ? 
child_rip+0x0/0x20
Oct 30 04:59:02 psanaoss231 kernel: Lustre: ana13-OST0004: Recovery over 
after 3:05, of 147 clients 146 recovered and 1 was evicted.
Oct 30 04:59:03 psanaoss231 kernel: Lustre: ana13-OST0004: Client 
89ba817f-45c3-5e64-99a8-b472651bbe45 (at 172.21.52.213@o2ib) reconnecting
Oct 30 04:59:03 psanaoss231 kernel: Lustre: Skipped 94 previous similar 
messages
Oct 30 04:59:21 psanaoss231 kernel: LustreError: 
4569:0:(ost_handler.c:1123:ost_brw_write()) Dropping timed-out write 
from 12345-172.21.49.129@tcp because locking object 0x0:14198730 took 
153 seconds (limit was 30).
Oct 30 04:59:21 psanaoss231 kernel: Lustre: ana13-OST0005: Bulk IO write 
error with 3a71df2f-16e7-d507-2495-ab60364d8e7c (at 172.21.49.129@tcp), 
client will retry: rc -110

Oct 30 04:59:52 psanaoss231 kernel: [ cut here ]
Oct 30 04:59:52 psanaoss231 kernel: kernel BUG at 
fs/jbd2/transaction.c:1033!

Oct 30 04:59:52 psanaoss231 kernel: invalid opcode:  [#1] SMP
Oct 30 04:59:52 psanaoss231 kernel: last sysfs file: 
/sys/devices/system/cpu/online

Oct 30 04:59:52 psanaoss231 kernel: CPU 10
Oct 30 04:59:52 psanaoss231 kernel: Modules linked in: osp(U) ofd(U) 
lfsck(U) ost(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) 
ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) 
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic 
sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss 
nfs_acl mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase 
autofs4 sunrpc ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 
nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs 
ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 microcode power_meter iTCO_wdt 
iTCO_vendor_support dcdbas ipmi_devintf sb_edac edac_core lpc_ich 
mfd_core shpchp igb i2c_algo_bit i2c_core ses enclosure sg ixgbe dca ptp 
pps_core mdio ext4 jbd2 mbcache raid1 sd_mod crc_t10dif ahci wmi mlx4_ib 
ib_sa ib_mad ib_core mlx4_en mlx4_core megaraid_sas dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Oct 30 04:59:52 psanaoss231 kernel:
Oct 30 04:59:52 psanaoss231 kernel: Pid: 4272, comm: ll_ost01_007 Not 
tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Dell Inc. PowerEdge R620/0PXXHP
Oct 30 04:59:52 psanaoss231 kernel: RIP: 0010:[]  
[] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
Oct 30 04:59:52 psanaoss231 kernel: RSP: 0018:880c058437d0 EFLAGS: 
00010246
Oct 30 04:59:52 psanaoss231 kernel: RAX: 880c05573dc0 RBX: 
880c043b8d08 RCX: 88175b0fedc8
Oct 30 04:59:52 psanaoss231 kernel: RDX:  RSI: 
88175b0fedc8 RDI: 
Oct 30 04:59:52 psanaoss231 kernel: RBP: