[Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi ocfs2-users, my first post to this list from yesterday probably didn't get through. Anyway, I've made some progress in the meantime and may now ask more specific questions ... I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux the kernel modules are: modinfo ocfs2 -> version: 1.5.0 using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. As an alternative I cloned and built the latest ocfs2-tools from markfasheh's ocfs2-tools on github which should be version 1.8.4. The filesystem runs on top of drbd, is used to roughly 40 % and suffers from read-only remounts and hanging clients since the last reboot. This may be DLM problems but I suspect they stem from some corrupt disk structures. Before that it all ran stable for months. This situation made me want to run fsck.ocfs2 and now I wonder how to do that. The filesystem is not mounted. With the stock ocfs-tools 1.6.4: root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd1: Label: ocfs2_ASSET UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 Number of blocks: 5557283182 Block size: 2048 Number of clusters: 2778641591 Cluster size: 4096 Number of slots:16 I'm checking fsck_drbd1.log and find that it is making progress in Pass 0a: Checking cluster allocation chains until it reaches "chain 73" and goes into an infinite loop filling the logfile with breathtaking speed. With the newly built ocfs-tools 1.8.4 I get: root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 fsck.ocfs2 1.8.4 Checking OCFS2 filesystem in /dev/drbd1: Label: ocfs2_ASSET UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 Number of blocks: 5557283182 Block size: 2048 Number of clusters: 2778641591 Cluster size: 4096 Number of slots:16 Again watching the verbose output in fsck_drbd1.log I find that this time it proceeds up to Pass 0a: Checking cluster allocation chains o2fsck_pass0:1360 | found inode alloc 13 at block 13 and stays there without any further progress. I've terminated this process after waiting for more than an hour. Now - I'm lost somehow ... and would very much appreciate if anybody on this list would share his knowledge and give me a hint what to do next. What could be done to get this file system checked and repaired? Am I missing something important or do I just have to wait a little bit longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will perform as expected? I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away from taking that risk without any clue of whether that might solve my problem ... Thanks in advance ... Michael Ulbrich ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, ok, got it! Here's the loop in chain 73: Group Chain: 73 Parent Inode: 13 Generation: 1172963971 CRC32: ECC: ## Block#TotalUsed Free Contig Size 0428077363215872114874385 1774 1984 12583263232158725341 105315153 1984 24543613952158725329 105435119 1984 3453266227215872107535119 5119 1984 44539963392158723223 126497530 1984 54536312832158725219 106535534 1984 64529011712158726047 9825 3359 1984 74525361152158724475 113975809 1984 84521710592158723182 126905844 1984 94518060032158725881 9991 5131 1984 10 423696691215872107535119 5119 1984 11 409824563215872107565116 3388 1984 12 4514409472158728826 7046 5119 1984 13 34411448321587215 158579680 1984 14 4404892672158727563 8309 5119 1984 15 4233316352158729398 6474 5114 1984 16 448882158726358 9514 5119 1984 17 3901115392158729932 5940 3757 1984 18 4507108352158726557 9315 6166 1984 19 408364339215872571 153014914 1984 <-- 20 4510758912158724834 110386601 1984 21 4492506112158726532 9340 5119 1984 22 449615667215872107535119 5119 1984 23 450345779215872107185154 5119 1984 ... 154 408364339215872571 153014914 1984 <-- 155 4510758912158724834 110386601 1984 156 4492506112158726532 9340 5119 1984 157 449615667215872107535119 5119 1984 158 450345779215872107185154 5119 1984 ... 289 408364339215872571 153014914 1984 <-- 290 4510758912158724834 110386601 1984 291 4492506112158726532 9340 5119 1984 292 449615667215872107535119 5119 1984 293 450345779215872107185154 5119 1984 etc. So the loop begins at record #154 and spans 135 records, right? Will backup fs metadata as soon as I have some external storage at hand. Thanks a lot so far ... Michael On 03/24/2016 10:41 AM, Joseph Qi wrote: > Hi Michael, > It seems that dead loop happens in chain 73. You have formatted using 2K > block and 4K cluster, so each chain should have 1522 or 1521 records. > But at first glance, I cannot figure out which block goes wrong, because > the output you pasted indicates all blocks are different. So I suggest > you investigate the all blocks which belong to chain 73 and try to find > out if there is a loop there. > BTW, have you backed up the metadata using o2image? > > Thanks, > Joseph > > On 2016/3/24 16:40, Michael Ulbrich wrote: >> Hi Joseph, >> >> thanks a lot for your help. It is very much appreciated! >> >> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: >> >> root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > >> debugfs_drbd1.log 2>&1 >> >> Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) >> FS Generation: 1172963971 (0x45ea0283) >> CRC32: ECC: >> Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain >> Dynamic Features: (0x0) >> User: 0 (root) Group: 0 (root) Size: 11381315956736 >> Links: 1 Clusters: 2778641591 >> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >> dtime: 0x0 -- Thu Jan 1 01:00:00 1970 >> ctime_nsec: 0x -- 0 >> atime_nsec: 0x -- 0 >> mtime_nsec: 0x -- 0 >> Refcount Block: 0 >> Last Extblk: 0 Orphan Slot: 0 >> Sub Alloc Slot: Global Sub Alloc Bit: 7 >> Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 >> Clusters per Group: 15872 Bits per Cluster: 1 >> Count: 115 Next Free Rec: 115 >> ## TotalUsed Free Block# >> 024173056 9429318 14743738 4533995520 >> 124173056 9421663 14751393 4548629504 >> 224173056 9432421 14740635 4588817408 >> 324173056 9427533 14745523 4548692992 >> 424173056 9433978 14739078 4508568576 >> 524173056 9436974 14736082 4636369920 >> 624173056 942
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, thanks for this information although this does not sound too optimistic ... So, if I understand you correctly, if we had a metadata backup from o2image _before_ the crash we could have looked up the missing info to remove the loop from group chain 73, right? But how could the loop issue be fixed and at the same time the damage to the data be minimized? There is a recent file level backup from which damaged or missing files could be restored later. 151 4054438912158722152 13720106061984 152 409459507215872107535119 5119 1984 153 4090944512158721818 140549646 1984 <-- 154 408364339215872571 153014914 1984 155 4510758912158724834 110386601 1984 156 4492506112158726532 9340 5119 1984 Could you describe a "brute force" way how to dd out and edit record #153 to remove the loop and minimize potential loss of data at the same time? So that fsck would have a chance to complete and fix the remaining issues? Thanks a lot for your help ... Michael On 03/24/2016 02:10 PM, Joseph Qi wrote: > Hi Michael, > So I think the block of record #153 goes wrong, which points next to > block 4083643392 of record #19. > But the problem is we don't know the right info of the block of record > #153, otherwise we can dd out, edit it and then dd in to fix it. > > Thanks, > Joseph > > On 2016/3/24 18:38, Michael Ulbrich wrote: >> Hi Joseph, >> >> ok, got it! Here's the loop in chain 73: >> >> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >> CRC32: ECC: >> ## Block#TotalUsed Free Contig Size >> 0428077363215872114874385 1774 1984 >> 12583263232158725341 105315153 1984 >> 24543613952158725329 105435119 1984 >> 3453266227215872107535119 5119 1984 >> 44539963392158723223 126497530 1984 >> 54536312832158725219 106535534 1984 >> 64529011712158726047 9825 3359 1984 >> 74525361152158724475 113975809 1984 >> 84521710592158723182 126905844 1984 >> 94518060032158725881 9991 5131 1984 >> 10 423696691215872107535119 5119 1984 >> 11 409824563215872107565116 3388 1984 >> 12 4514409472158728826 7046 5119 1984 >> 13 34411448321587215 158579680 1984 >> 14 4404892672158727563 8309 5119 1984 >> 15 4233316352158729398 6474 5114 1984 >> 16 448882158726358 9514 5119 1984 >> 17 3901115392158729932 5940 3757 1984 >> 18 4507108352158726557 9315 6166 1984 >> 19 408364339215872571 153014914 1984 <-- >> 20 4510758912158724834 110386601 1984 >> 21 4492506112158726532 9340 5119 1984 >> 22 449615667215872107535119 5119 1984 >> 23 450345779215872107185154 5119 1984 >> ... >> 154 408364339215872571 153014914 1984 <-- >> 155 4510758912158724834 110386601 1984 >> 156 4492506112158726532 9340 5119 1984 >> 157 449615667215872107535119 5119 1984 >> 158 450345779215872107185154 5119 1984 >> ... >> 289 408364339215872571 153014914 1984 <-- >> 290 4510758912158724834 110386601 1984 >> 291 4492506112158726532 9340 5119 1984 >> 292 449615667215872107535119 5119 1984 >> 293 450345779215872107185154 5119 1984 >> >> etc. >> >> So the loop begins at record #154 and spans 135 records, right? >> >> Will backup fs metadata as soon as I have some external storage at hand. >> >> Thanks a lot so far ... Michael >> >> On 03/24/2016 10:41 AM, Joseph Qi wrote: >>> Hi Michael, >>> It seems that dead loop happens in chain 73. You have formatted using 2K >>> block and 4K cluster, so each chain should have 1522 or 1521 records. >>> But at first glance, I cannot figure out which block goes wrong, because >>> the output you pasted indicates all blocks are different
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
555253350415872115871158711984 1522 555618406415872115871158711984 ... all following group chains are similarly structured up to #73 which looks as follows: Group Chain: 73 Parent Inode: 13 Generation: 1172963971 CRC32: ECC: ## Block#TotalUsed Free Contig Size 02583263232158725341 105315153 1984 14543613952158725329 105435119 1984 2453266227215872107535119 5119 1984 34539963392158723223 126497530 1984 44536312832158725219 106535534 1984 54529011712158726047 9825 3359 1984 64525361152158724475 113975809 1984 74521710592158723182 126905844 1984 84518060032158725881 9991 5131 1984 9423696691215872107535119 5119 1984 ... 2059651 4299026432158724334 115384816 1984 2059652 4087293952158727003 8869 2166 1984 2059653 4295375872158726626 9246 5119 1984 2059654 428807475215872509 153639662 1984 2059655 4291725312158726151 9721 5119 1984 2059656 428442419215872100525820 5119 1984 2059657 4277123072158727383 8489 5120 1984 2059658 42734725121587214 158585655 1984 2059659 4269821952158722637 132357060 1984 2059660 426617139215872107585114 3674 1984 ... Assuming this would go on forever I stopped debugfs.ocfs2. With debugs.ocfs2 from ocfs2-tools 1.8.4 I get an identical result. Please let me know if I can provide any further information and help to fix this issue. Thanks again + Best regards ... Michael On 03/24/2016 01:30 AM, Joseph Qi wrote: > Hi Michael, > Could you please use debugfs to check the output? > # debugfs.ocfs2 -R 'stat //global_bitmap' > > Thanks, > Joseph > > On 2016/3/24 6:38, Michael Ulbrich wrote: >> Hi ocfs2-users, >> >> my first post to this list from yesterday probably didn't get through. >> >> Anyway, I've made some progress in the meantime and may now ask more >> specific questions ... >> >> I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: >> >> Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux >> >> the kernel modules are: >> >> modinfo ocfs2 -> version: 1.5.0 >> >> using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. >> >> As an alternative I cloned and built the latest ocfs2-tools from >> markfasheh's ocfs2-tools on github which should be version 1.8.4. >> >> The filesystem runs on top of drbd, is used to roughly 40 % and suffers >> from read-only remounts and hanging clients since the last reboot. This >> may be DLM problems but I suspect they stem from some corrupt disk >> structures. Before that it all ran stable for months. >> >> This situation made me want to run fsck.ocfs2 and now I wonder how to do >> that. The filesystem is not mounted. >> >> With the stock ocfs-tools 1.6.4: >> >> root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >> fsck.ocfs2 1.6.4 >> Checking OCFS2 filesystem in /dev/drbd1: >> Label: ocfs2_ASSET >> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >> Number of blocks: 5557283182 >> Block size: 2048 >> Number of clusters: 2778641591 >> Cluster size: 4096 >> Number of slots:16 >> >> I'm checking fsck_drbd1.log and find that it is making progress in >> >> Pass 0a: Checking cluster allocation chains >> >> until it reaches "chain 73" and goes into an infinite loop filling the >> logfile with breathtaking speed. >> >> With the newly built ocfs-tools 1.8.4 I get: >> >> root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >> fsck.ocfs2 1.8.4 >> Checking OCFS2 filesystem in /dev/drbd1: >> Label: ocfs2_ASSET >> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >> Number of blocks: 5557283182 >> Block size: 2048 >> Number of clusters: 2778641591 >> Cluster size: 4096 >> Number of slots:16 >> >> Again watching the verbose output in fsck_drbd1.log I find that this >> time it proceeds up to >> >> Pass 0a: Checking cluster allocation chains >> o2fsck_pass0:1360 | found inode alloc 13 at block 13 >> >> and stays there wi
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Joseph, thanks again for your help! Currently I'm dumping out 4 TB of data from the broken ocfs2 device to an external disk. I have shut down the cluster and have the fs mounted read-only on a single node. It seems that the data structures are still intact and that the file system problems are bound to internal data areas (DLM?) which are not in use in the single node r/o mount use case. Will create a new ocfs2 device and restore the data later. Besides taking metadata backups with o2image is there any advice which you would give to avoid similar situations in the future? All the best ... Michael On 03/25/2016 01:36 AM, Joseph Qi wrote: > Hi Michael, > > On 2016/3/24 21:47, Michael Ulbrich wrote: >> Hi Joseph, >> >> thanks for this information although this does not sound too optimistic ... >> >> So, if I understand you correctly, if we had a metadata backup from >> o2image _before_ the crash we could have looked up the missing info to >> remove the loop from group chain 73, right? > If we have metadata backup, we can use o2image to restore it back, but > this may loss some data. > >> >> But how could the loop issue be fixed and at the same time the damage to >> the data be minimized? There is a recent file level backup from which >> damaged or missing files could be restored later. >> >> 151 4054438912158722152 13720106061984 >> 152 409459507215872107535119 5119 1984 >> 153 4090944512158721818 140549646 1984 <-- >> 154 408364339215872571 153014914 1984 >> 155 4510758912158724834 110386601 1984 >> 156 4492506112158726532 9340 5119 1984 >> >> Could you describe a "brute force" way how to dd out and edit record >> #153 to remove the loop and minimize potential loss of data at the same >> time? So that fsck would have a chance to complete and fix the remaining >> issues? > This is dangerous until we can know exactly what's info the block should > store. > > My idea is to find out the actual block of record #154 and let block > 4090944512 of record #153 points to it. This must be a bit complicated > and should be done under deep understanding of the disk layout. > > I have went though fsck.ocfs2 patches, and found the following may help: > commit efca4b0f2241 (Break a chain loop in group desc) > But as you said, you have already upgraded to version 1.8.4. So I'm sorry > currently I don't have a better idea. > > Thanks, > Joseph >> >> Thanks a lot for your help ... Michael >> >> On 03/24/2016 02:10 PM, Joseph Qi wrote: >>> Hi Michael, >>> So I think the block of record #153 goes wrong, which points next to >>> block 4083643392 of record #19. >>> But the problem is we don't know the right info of the block of record >>> #153, otherwise we can dd out, edit it and then dd in to fix it. >>> >>> Thanks, >>> Joseph >>> >>> On 2016/3/24 18:38, Michael Ulbrich wrote: >>>> Hi Joseph, >>>> >>>> ok, got it! Here's the loop in chain 73: >>>> >>>> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >>>> CRC32: ECC: >>>> ## Block#TotalUsed Free Contig Size >>>> 0428077363215872114874385 1774 1984 >>>> 12583263232158725341 105315153 1984 >>>> 24543613952158725329 105435119 1984 >>>> 3453266227215872107535119 5119 1984 >>>> 44539963392158723223 126497530 1984 >>>> 54536312832158725219 106535534 1984 >>>> 64529011712158726047 9825 3359 1984 >>>> 74525361152158724475 113975809 1984 >>>> 84521710592158723182 126905844 1984 >>>> 94518060032158725881 9991 5131 1984 >>>> 10 423696691215872107535119 5119 1984 >>>> 11 409824563215872107565116 3388 1984 >>>> 12 4514409472158728826 7046 5119 1984 >>>> 13 34411448321587215 158579680 1984 >>>> 14 4404892672158727563 8309 5119 1984 >>>> 15 4233316352158729398 6474 5114 1984 >>>> 16 448882158726358 9514 5119 1984
[Ocfs2-users] Mixed mounts w/ different physical block sizes (long post)
Hi again, chatting with a helpful person on #ocfs2 IRC channel this morning I got encouraged to cross-post to ocsf2-devel. For historic background and further details pls. see my two previous posts to ocfs2-users from last week which are unanswered so far. According to my current state of inspection I changed the topic from "Node 8 doesn't mount / Wrong slot map assignment" to the current "Mixed mounts ..." Here we go: I've learnt that large hard disks in increasing number come formatted w/ 4k physical blocks size. Now I've created an ocfs2 shared file system on top of drbd on a RAID1 of two 6 TB disks with such 4k physical block size. File system creation was done on a hypervisor which actually saw the device as having 4k physical sector size. I'm using the default o2cb cluster stack. Version is ocfs2 1.6.4 on stock Debian 8. A node (numbered "1" in cluster.conf) which mounts this device with 4k phys. blocks leads to a strange "times 8" numbering when checking heartbeat debug info with 'echo "hb" | debugfs.ocfs2 -n /dev/drbd1': hb node: node seq generation checksum 8:1 59bfd253 00bfa1b63f30e494 c518c55a I'm not sure why the first 2 columns are named "node:" and "node" but assume the first "node:" is an index into some internal data structure (slot map ?, heartbeat region ?) while the second "node" column shows the actual node number as given in cluster.conf Now a second node mounts the shared file system again as 4k block device: hb node: node seq generation checksum 8:1 59bfd36a 00bfa1b63f30e494 d4f79d63 16:2 59bfd369 7acf8521da342228 4b8cd74d As it actually happened in my setup of a two node cluster with 2 hypervisors and 3 virtual machines on top of each (8 nodes in total), when mounting the fs on the first virtual machine with node number 3 we get: hb node: node seq generation checksum 3:3 59bfd413 59eb77b4db07884b 87a5057d 8:1 59bfd412 00bfa1b63f30e494 e782d86e 16:2 59bfd413 7acf8521da342228 cd48c018 Uhm, ... wait ... 3 ?? Mounting on further VMs (nodes 4, 5, 6 and 7) leads to: hb node: node seq generation checksum 3:3 59bfd413 59eb77b4db07884b 87a5057d 4:4 59bfd413 debf95d5ff50dc10 3839c791 5:5 59bfd414 529a98c758325d5b 60080c42 6:6 59bfd412 14acfb487fa8c8b8 f54cef9d 7:7 59bfd413 4d2d36de0b0d6b2e 3f1ad275 8:1 59bfd412 00bfa1b63f30e494 e782d86e 16:2 59bfd413 7acf8521da342228 cd48c018 Up to this point I did not notice any error or warning in the machines' console or kernel logs. And then trying to mount on node 8 finally there's an error: kern.log node 1: (o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c, 0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b) kern.log node 8: ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode. (o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c, 0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c) (actual seq and generation are not from above hb debug dump) Now we have a conflict on slot 8. When I encountered this error for the first time, I didn't know about heartbeat debug info, slot maps or heartbeat regions and had no idea what might have gone wrong so I started experimenting and found a "solution" by swapping nodes 1 <-> 8 in cluster.conf. This leads to the following layout of the heartbeat region (?): hb node: node seq generation checksum 1:1 59bfd412 00bfa1b63f30e494 e782d86e 3:3 59bfd413 59eb77b4db07884b 87a5057d 4:4 59bfd413 debf95d5ff50dc10 3839c791 5:5 59bfd414 529a98c758325d5b 60080c42 6:6 59bfd412 14acfb487fa8c8b8 f54cef9d 7:7 59bfd413 4d2d36de0b0d6b2e 3f1ad275 16:2 59bfd413 7acf8521da342228 cd48c018 64:8 59bfd413 73a63eb550a33095 f4e074d1 Voila - all 8 nodes mounted, problem solved - let's continue with getting this cluster ready for production ... As it turned out this was in no way a stable configuration in that after few weeks spurious reboots (fencing peer) started to happen (drbd losing its replication connection, all kinds of weird kernel oopses and panics from drbd and ocfs2). Reboots were usually preceded by burst of errors like: Sep 11 00:01:27 web1 kernel: [ 9697.644436] (o2hb-10254DCA50,515,1):o2hb_check_own_slot:582 ERROR: Heartbeat sequence mismatch on device (vdc): expected(3:0x743493e99d19e721, 0x59b5b635),
[Ocfs2-users] Node 8 doesn't mount / Wrong slot map assignment?
Hi all, we've a small (?) problem with a 2-node cluster on Debian 8: Linux h1b 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u2 (2017-06-26) x86_64 GNU/Linux ocfs2-tools 1.6.4-3 Two ocfs2 filesystems (drbd0 600 GB w/ 8 slots and drbd1 6 TB w/ 6 slots) are created on top of drbd w/ 4k block and cluster size, 'max_features' enabled. cluster.conf assigns sequential node numbers 1 - 8. Nodes 1, 2 are the hypervisors. Nodes 3, 4, 5 are VMs on node 1. Nodes 6, 7, 8 the corresponding VMs on node 2. VMs all run Debian 8 as well: Linux srv2 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux When mounting drbd0 in order of increasing node numbers and concurrently watching the 'hb' output from debugsfs.ocfs2 we get a clean slot map (?): hb node: node seq generation checksum 1:1 59b8d94a fa60f0d8423590d9 edec9643 2:2 59b8d94c aca059df4670f467 994e3458 3:3 59b8d949 f03dc9ba8f27582c d4473fc2 4:4 59b8d94b df5bbdb756e757f8 12a198eb 5:5 59b8d94a 1af81d94a7cb681b 91fba906 6:6 59b8d94b 104538f30cdb35fa 8713e798 7:7 59b8d94b 195658c9fb8ca7f9 5e54edf6 8:8 59b8d949 dc6bfb46b9cf1ac3 de7a8757 Device drbd1 in contrast yields the following table after mounting on nodes 1, 2: hb node: node seq generation checksum 8:1 59b8d9ba 73a63eb550a33095 f4e074d1 16:2 59b8d9b9 5c7504c05637983e 07d696ec Proceeding with the drbd1 mounts on nodes 3, 5, 6 leads us to: hb node: node seq generation checksum 3:3 59b8da3b 9443b4b209b16175 f2cc87ec 5:5 59b8da3c 4b742f709377466f 3ac41cf3 6:6 59b8da3b d96e2de0a55514f6 335a4d90 8:1 59b8da3c 73a63eb550a33095 2312c1c4 16:2 59b8da3d 5c7504c05637983e 659571a1 The problem arises when trying to mount node 8 since its slot is already occupied by node 1: kern.log node 1: (o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c, 0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b) kern.log node 8: ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode. (o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c, 0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c) This can be "fixed" by exchanging node numbers 1 and 8 in cluster.conf. Then node 8 will be assigned slot 8, node 2 stays in slot 16, 3 to 7 as expected. There is no node 16 configured so there's no conflict. But since we experience some other so far not explainable instabilities with this ocfs2 device / system during operation further down the road we decided to take care of and try to fix this issue first. Somehow the failure reminds of bit shift or masking problems: 1 << 3 = 8 2 << 3 = 16 But then again - what do I know ... Tried so far: A. Create offending file system with 8 slots instead of 6 -> same issue. B. Set features to 'default' (disables feature 'extended-slotmap') -> same issue. We'd very much appreciate any comments on this. Has anything similar ever been experienced before? Are we completely missing something important here? If there's a fix already out for this any pointers (src files / commits) to where to look would be greatly appreciated. Thanks in advance + Best regards ... Michael U. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users