Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-24 Thread Daniel Dehennin
Bob Peterson  writes:

> Hi Daniel,

Hello,


> I took a look at that metadata you sent me, but I didn't find any evidence
> relating to the problem you posted. Either the corruption happened a long
> time prior to your saving of the metadata, or else the metadata was saved
> after an fsck.gfs2 fixed (or attempted to fix) the problem?

- when I first encountered the problem, I did an fsck on the filesystem
  with version 3.1.6 from Ubuntu.

- several days after, the same messages “dirty_inode: glock -5”
  start showing on the same node as the first time.

- I did an fsck with 3.1.8 build from git

- few days after, the same node had the “dirty_inode” messages, I
  shutdown that node and then run the “gfs2_edit savemeta”.

All nodes are same hardware and OS/kernel/pacemaker version.

> One thing's for sure: I don't see any evidence of wild file system corruption;
> certainly nothing that can account for those errors.
>
> You said the problem seemed to revolve around a gfs2_grow operation,
> right?

Not exactly, I live grow the fs 6 months ago and encounter some
troubles, I did an fsck by that time and the fs run fine for months.

Then we had the “dirty_inode” troubles starting on Feb 9.

> Can you make sure the lvm2 volume group has the clustered bit set?
> Please do the "vgs" command and see if that volume has "c" listed in its
> flags. If not, it could have caused problems for the gfs2_grow.

Yes it has the cluster flag.

> I've seen problems like this very rarely. Once was a legitimate bug in
> GFS2 that we fixed in RHEL5, but I assume your kernel is newer than
> that.

We have 3.13.0-78-generic from Ubuntu.


[...]

> My only working theory is this:
>
> This might be related to the transition between "unlinked" dinodes and
> "free". After a file is deleted, it goes to "unlinked" and has to be
> transitioned to "free". This sometimes goes wrong because of the way
> it needs to check what other nodes in the cluster are doing.
>
> Maybe: If you have three nodes, and a file was unlinked on node 1, then
> maybe the internode communication got confused and nodes 2 and 3 both
> tried to transition it from Unlinked to Free. That is only a theory, and
> there is absolutely no proof. However, I have a set of patches that are
> experimental, and not even in the upstream kernel yet (hopefully soon!)
> that try to tighten up and fix problems like this. It's much more common
> for multiple nodes to try to transition from Unlinked to Free, and they
> all fail, leaving the file in an "Unlinked" state.

Thanks for the explanations, so I try to re-add the down node to the
cluster and see what happen.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-24 Thread Shreekant Jena
HI ,
I m having a problem in two node cluster . Secondary node is showing
offline after reboot.
CMAN not starting.
below are logs of offline node:-

[root@EI51SPM1 cluster]# clustat
msg_open: Invalid argument
Member Status: Inquorate

Resource Group Manager not running; no service information available.

Membership information not available
[root@EI51SPM1 cluster]# tail -10 /var/log/messages
Feb 24 13:36:23 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:23 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:28 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:33 EI51SPM1 kernel: CMAN: sending membership request
[root@EI51SPM1 cluster]#
[root@EI51SPM1 cluster]# cman_tool status
Protocol version: 5.0.1
Config version: 166
Cluster name: IVRS_DB
Cluster ID: 9982
Cluster Member: No
Membership state: Joining
[root@EI51SPM1 cluster]# cman_tool nodes
Node  Votes Exp Sts  Name
[root@EI51SPM1 cluster]#
[root@EI51SPM1 cluster]#



Thanks & Regards,
Shreekanta Jena


On Tue, Feb 23, 2016 at 11:30 PM, Bob Peterson  wrote:

> - Original Message -
> > Bob Peterson  writes:
> >
> >
> > [...]
> >
> > > Hi Daniel,
> > >
> > > I'm downloading the metadata now. I'll let you know what I find.
> > > It may take a while because my storage is a bit in flux at the moment.
> >
> > Ok, thanks a lot for looking at our problems.
> >
> > Regards.
> > --
> > Daniel Dehennin
> > Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> > Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
>
> Hi Daniel,
>
> I took a look at that metadata you sent me, but I didn't find any evidence
> relating to the problem you posted. Either the corruption happened a long
> time prior to your saving of the metadata, or else the metadata was saved
> after an fsck.gfs2 fixed (or attempted to fix) the problem?
>
> One thing's for sure: I don't see any evidence of wild file system
> corruption;
> certainly nothing that can account for those errors.
>
> You said the problem seemed to revolve around a gfs2_grow operation, right?
> Can you make sure the lvm2 volume group has the clustered bit set?
> Please do the "vgs" command and see if that volume has "c" listed in its
> flags. If not, it could have caused problems for the gfs2_grow.
>
> I've seen problems like this very rarely. Once was a legitimate bug in
> GFS2 that we fixed in RHEL5, but I assume your kernel is newer than that.
> The other problem we weren't able to solve because there was no evidence
> of what went wrong.
>
> My only working theory is this:
>
> This might be related to the transition between "unlinked" dinodes and
> "free". After a file is deleted, it goes to "unlinked" and has to be
> transitioned to "free". This sometimes goes wrong because of the way
> it needs to check what other nodes in the cluster are doing.
>
> Maybe: If you have three nodes, and a file was unlinked on node 1, then
> maybe the internode communication got confused and nodes 2 and 3 both
> tried to transition it from Unlinked to Free. That is only a theory, and
> there is absolutely no proof. However, I have a set of patches that are
> experimental, and not even in the upstream kernel yet (hopefully soon!)
> that try to tighten up and fix problems like this. It's much more common
> for multiple nodes to try to transition from Unlinked to Free, and they
> all fail, leaving the file in an "Unlinked" state.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-23 Thread Bob Peterson
- Original Message -
> Bob Peterson  writes:
> 
> 
> [...]
> 
> > Hi Daniel,
> >
> > I'm downloading the metadata now. I'll let you know what I find.
> > It may take a while because my storage is a bit in flux at the moment.
> 
> Ok, thanks a lot for looking at our problems.
> 
> Regards.
> --
> Daniel Dehennin
> Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Hi Daniel,

I took a look at that metadata you sent me, but I didn't find any evidence
relating to the problem you posted. Either the corruption happened a long
time prior to your saving of the metadata, or else the metadata was saved
after an fsck.gfs2 fixed (or attempted to fix) the problem?

One thing's for sure: I don't see any evidence of wild file system corruption;
certainly nothing that can account for those errors.

You said the problem seemed to revolve around a gfs2_grow operation, right?
Can you make sure the lvm2 volume group has the clustered bit set?
Please do the "vgs" command and see if that volume has "c" listed in its
flags. If not, it could have caused problems for the gfs2_grow.

I've seen problems like this very rarely. Once was a legitimate bug in
GFS2 that we fixed in RHEL5, but I assume your kernel is newer than that.
The other problem we weren't able to solve because there was no evidence
of what went wrong.

My only working theory is this:

This might be related to the transition between "unlinked" dinodes and
"free". After a file is deleted, it goes to "unlinked" and has to be
transitioned to "free". This sometimes goes wrong because of the way
it needs to check what other nodes in the cluster are doing.

Maybe: If you have three nodes, and a file was unlinked on node 1, then
maybe the internode communication got confused and nodes 2 and 3 both
tried to transition it from Unlinked to Free. That is only a theory, and
there is absolutely no proof. However, I have a set of patches that are
experimental, and not even in the upstream kernel yet (hopefully soon!)
that try to tighten up and fix problems like this. It's much more common
for multiple nodes to try to transition from Unlinked to Free, and they
all fail, leaving the file in an "Unlinked" state.

Regards,

Bob Peterson
Red Hat File Systems

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-22 Thread Daniel Dehennin
Bob Peterson  writes:


[...]

> Hi Daniel,
>
> I'm downloading the metadata now. I'll let you know what I find.
> It may take a while because my storage is a bit in flux at the moment.

Ok, thanks a lot for looking at our problems.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-22 Thread Bob Peterson
- Original Message -
> Daniel Dehennin  writes:
> 
> 
> [...]
> 
> > I push everything on an HTTP server[1]:
> >
> > - gfs2-fsck.log is the output of “fsck.gfs2 -p ”
> >
> > - gfs2-fsck-forced.log is the output of “fsck.gfs2 -f -p ”
> >
> > - gfs2.meta.gz is the “gfs2_edit savemeta” file, with version 3.1.8 of
> >   gfs2_utlis.
> 
> I just fix the perms on the .gz file, sorry.
> --
> Daniel Dehennin
> Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
> 
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi Daniel,

I'm downloading the metadata now. I'll let you know what I find.
It may take a while because my storage is a bit in flux at the moment.

Regards,

Bob Peterson
Red Hat File Systems

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-21 Thread Daniel Dehennin
Daniel Dehennin  writes:


[...]

> I push everything on an HTTP server[1]:
>
> - gfs2-fsck.log is the output of “fsck.gfs2 -p ”
>
> - gfs2-fsck-forced.log is the output of “fsck.gfs2 -f -p ”
>
> - gfs2.meta.gz is the “gfs2_edit savemeta” file, with version 3.1.8 of
>   gfs2_utlis.

I just fix the perms on the .gz file, sorry.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-20 Thread Daniel Dehennin
Daniel Dehennin  writes:


[...]

> I preferred to do the fsck on the filesystem, two times[1], instead of
> the “gfs2_edit savemeta”:
>
> 1. “fsck.gfs2 -p ” was quick
> 2. “fsck.gfs2 -f -p ” took 4 hours

[...]

> Footnotes: 
> [1]  The logs are attached to this email

I forgot the files but attachement does not pass.

I push everything on an HTTP server[1]:

- gfs2-fsck.log is the output of “fsck.gfs2 -p ”

- gfs2-fsck-forced.log is the output of “fsck.gfs2 -f -p ”

- gfs2.meta.gz is the “gfs2_edit savemeta” file, with version 3.1.8 of
  gfs2_utlis.

Regards.

Footnotes: 
[1]  http://eole.ac-dijon.fr/pub/.gfs2/

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-19 Thread Daniel Dehennin
Daniel Dehennin  writes:

> Thanks, I'm using 3.1.6.
>
> Tonight I'll build the version 3.1.8 from Git[1] and run “fsck.gfs2 -p” on 
> the fs.

Hello,

I preferred to do the fsck on the filesystem, two times[1], instead of
the “gfs2_edit savemeta”:

1. “fsck.gfs2 -p ” was quick
2. “fsck.gfs2 -f -p ” took 4 hours

The cluster was bringed up after and everything was working fine until
yesterday:

Feb 18 19:13:22 nebula3 kernel: [293848.682606] GFS2: buf_blk = 0x2089 
old_state=0, new_state=0
Feb 18 19:13:22 nebula3 kernel: [293848.682612] GFS2: rgrp=0xc0c5667 
bi_start=0x0
Feb 18 19:13:22 nebula3 kernel: [293848.682614] GFS2: bi_offset=0x80 
bi_len=0xf80
Feb 18 19:13:22 nebula3 kernel: [293848.682619] CPU: 6 PID: 7057 Comm: 
kworker/6:8 Tainted: GW 3.13.0-78-generic #122-Ubuntu
Feb 18 19:13:22 nebula3 kernel: [293848.682621] Hardware name: Dell Inc. 
PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
Feb 18 19:13:22 nebula3 kernel: [293848.682637] Workqueue: delete_workqueue 
delete_work_func [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682640]  0c0c7705 
8811256c59d8 81725768 0c0c76f6
Feb 18 19:13:22 nebula3 kernel: [293848.682648]  8811256c5a30 
a05bebbf 880f5ffe9200 a05c5977
Feb 18 19:13:22 nebula3 kernel: [293848.682653]  880f1ee574c8 
2089 882e8c622000 0010
Feb 18 19:13:22 nebula3 kernel: [293848.682658] Call Trace:
Feb 18 19:13:22 nebula3 kernel: [293848.682668]  [] 
dump_stack+0x45/0x56
Feb 18 19:13:22 nebula3 kernel: [293848.682681]  [] 
rgblk_free+0x1ff/0x230 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682693]  [] 
__gfs2_free_blocks+0x34/0x120 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682700]  [] 
recursive_scan+0x5b6/0x6a0 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682707]  [] 
recursive_scan+0x46c/0x6a0 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682714]  [] ? 
submit_bio+0x71/0x150
Feb 18 19:13:22 nebula3 kernel: [293848.682720]  [] ? 
bio_alloc_bioset+0x196/0x2a0
Feb 18 19:13:22 nebula3 kernel: [293848.682727]  [] ? 
_submit_bh+0x150/0x200
Feb 18 19:13:22 nebula3 kernel: [293848.682734]  [] 
recursive_scan+0x46c/0x6a0 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682744]  [] ? 
gfs2_quota_hold+0x175/0x1f0 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682752]  [] 
trunc_dealloc+0xfa/0x120 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682760]  [] ? 
gfs2_glock_wait+0x3e/0x80 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682769]  [] ? 
gfs2_glock_nq+0x280/0x430 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682777]  [] 
gfs2_file_dealloc+0x10/0x20 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682787]  [] 
gfs2_evict_inode+0x2b3/0x3e0 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682796]  [] ? 
gfs2_evict_inode+0x113/0x3e0 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682802]  [] 
evict+0xb0/0x1b0
Feb 18 19:13:22 nebula3 kernel: [293848.682807]  [] 
iput+0xf5/0x180
Feb 18 19:13:22 nebula3 kernel: [293848.682815]  [] 
delete_work_func+0x5c/0x90 [gfs2]
Feb 18 19:13:22 nebula3 kernel: [293848.682822]  [] 
process_one_work+0x182/0x450
Feb 18 19:13:22 nebula3 kernel: [293848.682827]  [] 
worker_thread+0x121/0x410
Feb 18 19:13:22 nebula3 kernel: [293848.682832]  [] ? 
rescuer_thread+0x430/0x430
Feb 18 19:13:22 nebula3 kernel: [293848.682837]  [] 
kthread+0xd2/0xf0
Feb 18 19:13:22 nebula3 kernel: [293848.682841]  [] ? 
kthread_create_on_node+0x1c0/0x1c0
Feb 18 19:13:22 nebula3 kernel: [293848.682846]  [] 
ret_from_fork+0x58/0x90
Feb 18 19:13:22 nebula3 kernel: [293848.682850]  [] ? 
kthread_create_on_node+0x1c0/0x1c0
Feb 18 19:13:22 nebula3 kernel: [293848.682855] GFS2: 
fsid=yggdrasil:datastores.1: fatal: filesystem consistency error
Feb 18 19:13:22 nebula3 kernel: [293848.682855] GFS2: 
fsid=yggdrasil:datastores.1:   RG = 202135143
Feb 18 19:13:22 nebula3 kernel: [293848.682855] GFS2: 
fsid=yggdrasil:datastores.1:   function = gfs2_setbit, file = 
/build/linux-OTIHGI/linux-3.13.0/fs/gfs2/rgrp.c, line = 103
Feb 18 19:13:22 nebula3 kernel: [293848.682859] GFS2: 
fsid=yggdrasil:datastores.1: about to withdraw this file system
Feb 18 19:13:22 nebula3 kernel: [293848.699050] GFS2: 
fsid=yggdrasil:datastores.1: dirty_inode: glock -5
Feb 18 19:13:22 nebula3 kernel: [293848.705401] GFS2: 
fsid=yggdrasil:datastores.1: dirty_inode: glock -5

Now, the “always faulty node” is down and I'm doing the “gfs2_edit savemeta” 
from the other node.

I'm wondering if I should not upgrade the kernels to a much newer
version than 3.13.0.

My Ubuntu Trusty has some proposed kernel up to 4.2.0.

Regards.

Footnotes: 
[1]  The logs are attached to this email

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

-- 
Linux-cluster mailing list
Linux-cluster@redhat.c

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Daniel Dehennin
Bob Peterson  writes:


[...]

> It sounds like the resulting metadata file will be too big to email, so
> you'll need to find a suitable server to put it on, so I can grab it.

Sure, for now I'm at “3604476 inodes processed, 272407 blocks saved (0%)
processed” with a meta file of 127Mo.

My fs i 3TB used, I hope the meta file will not full my /home (2.5GB) ;-)

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Daniel Dehennin
Andrew Price  writes:


[...]

> I suspect you're using an older version of gfs2-utils that doesn't
> contain this patch:
>
> https://git.fedorahosted.org/cgit/gfs2-utils.git/commit/?id=45b761f6
>
> It's only a cosmetic "multiply by the fs block size before printing
> the value" step that's missing, so nothing to worry about.
>
> The patch was included in gfs2-utils 3.1.7.

Thanks, I'm using 3.1.6.

Tonight I'll build the version 3.1.8 from Git[1] and run “fsck.gfs2 -p” on the 
fs.

Regards.

Footnotes: 
[1]  https://git.fedorahosted.org/cgit/gfs2-utils.git/tag/?h=3.1.8

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Andrew Price

On 15/02/16 14:26, Daniel Dehennin wrote:

But the start of the “gfs2_edit savemeta” command looks stange to me:

 There are 1073479680 blocks of 4096 bytes in the destination device.
 Reading resource groups...Done. File system size: 1023.734M

Is it saying my FS is 1TB instead of the real 4TB?


I suspect you're using an older version of gfs2-utils that doesn't 
contain this patch:


https://git.fedorahosted.org/cgit/gfs2-utils.git/commit/?id=45b761f6

It's only a cosmetic "multiply by the fs block size before printing the 
value" step that's missing, so nothing to worry about.


The patch was included in gfs2-utils 3.1.7.

Cheers,
Andy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Bob Peterson
- Original Message -
> It's running but looks like it will take a long time and produce a huge
> file.
> 
> But the start of the “gfs2_edit savemeta” command looks stange to me:
> 
> There are 1073479680 blocks of 4096 bytes in the destination device.
> Reading resource groups...Done. File system size: 1023.734M
> 
> Is it saying my FS is 1TB instead of the real 4TB?

Hi Daniel,

It sounds like the resulting metadata file will be too big to email, so
you'll need to find a suitable server to put it on, so I can grab it.

No, 4TB device looks about right: 1073479680 * 4096.

Regards,

Bob Peterson
Red Hat File Systems

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Daniel Dehennin
Bob Peterson  writes:


[...]

> As Steve mentioned, this means it tried to free a block that was already free.
> The problem might be due to timing issues with regard to changing blocks
> from "unlinked" state to "free" state. I have a number of patches required
> to fix this, but some of them are not even in the upstream kernel yet.
> And there is no guarantee this is the cause or solution for your
> problem.

Ok, I understand.

> As for fixing the file system: Newer versions of fsck.gfs2 should be able to
> fix the file system. If it doesn't fix the file system, perhaps I could
> get a copy of your file system metadata (via "gfs2_edit savemeta") and I
> can see where it's failing. There are some known problems with fsck.gfs2 not
> being able to correctly repair file systems that have been grown with 
> gfs2_grow.
> I've got fixes for that too, but it is all experimental code and none have 
> gone
> upstream yet. Your metadata might help me test it. :)

It's running but looks like it will take a long time and produce a huge
file.

But the start of the “gfs2_edit savemeta” command looks stange to me:

There are 1073479680 blocks of 4096 bytes in the destination device.
Reading resource groups...Done. File system size: 1023.734M

Is it saying my FS is 1TB instead of the real 4TB?

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Daniel Dehennin
Bob Peterson  writes:

> - Original Message -
>> Daniel Dehennin  writes:
>> 
> (snip)
>> Now the kernel gave me a warning, if it could help:

[...]

> This call trace may safely be ignored. This is a known problem and is
> harmless. This is documented as a very old bug record. I don't know if
> it's accessible externally, so you may not be able to read it:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=790188
>
> It does not relate to the previous error you posted.

Thanks for reassuring me, I can't read the bug but I'm trusing you ;-)

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Bob Peterson
- Original Message -
> Hello,
> 
> We run some troubles since several days on our GFS2 (log attached):
> 
> - we ran the FS for some times without troubles (since 2014-11-03)
> 
> - the FS was grown from 3To to 4To near 6 month ago
> 
> - it seems to happen only on one node “nebula3”
> 
> - I run an FSCK when just fencing the node was not sufficient (2 crashes
>   the same day)
> 
> The nodes run on Ubuntu Trusty Thar up to date.
> 
> Do you have any idea?
> 
> Regards.
> 
> --
> Daniel Dehennin
> Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Hi Daniel,

As Steve mentioned, this means it tried to free a block that was already free.
The problem might be due to timing issues with regard to changing blocks
from "unlinked" state to "free" state. I have a number of patches required
to fix this, but some of them are not even in the upstream kernel yet.
And there is no guarantee this is the cause or solution for your problem.

As for fixing the file system: Newer versions of fsck.gfs2 should be able to
fix the file system. If it doesn't fix the file system, perhaps I could
get a copy of your file system metadata (via "gfs2_edit savemeta") and I
can see where it's failing. There are some known problems with fsck.gfs2 not
being able to correctly repair file systems that have been grown with gfs2_grow.
I've got fixes for that too, but it is all experimental code and none have gone
upstream yet. Your metadata might help me test it. :)

Regards,

Bob Peterson
Red Hat File Systems

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Bob Peterson
- Original Message -
> Daniel Dehennin  writes:
> 
(snip)
> Now the kernel gave me a warning, if it could help:
> 
> Feb 15 14:13:07 nebula3 kernel: [16423.261927] [ cut here
> ]
> Feb 15 14:13:07 nebula3 kernel: [16423.261943] WARNING: CPU: 8 PID: 4410 at
> /build/linux-OTIHGI/linux-3.13.0/mm/page_alloc.c:1604
> get_page_from_freelist+0x924/0x930()
> Feb 15 14:13:07 nebula3 kernel: [16423.261945] Modules linked in: vhost_net
> vhost macvtap macvlan gfs2 dlm sctp configfs ip6table_filter ip6_tables
> iptable_filter ip_tables x_tables dm_round_robin openvswitch gre vxlan
> ip_tunnel nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache bonding
> x86_pkg_temp_thermal intel_powerclamp ipmi_devintf gpio_ich coretemp dcdbas
> kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw
> gf128mul glue_helper ablk_helper cryptd dm_multipath joydev scsi_dh mei_me
> shpchp mei sb_edac ipmi_si edac_core lpc_ich acpi_power_meter mac_hid wmi
> iTCO_wdt iTCO_vendor_support ses enclosure hid_generic qla2xxx usbhid hid
> ahci scsi_transport_fc libahci bnx2x tg3 megaraid_sas ptp scsi_tgt pps_core
> mdio libcrc32c
> Feb 15 14:13:07 nebula3 kernel: [16423.262017] CPU: 8 PID: 4410 Comm: rm Not
> tainted 3.13.0-78-generic #122-Ubuntu
> Feb 15 14:13:07 nebula3 kernel: [16423.262019] Hardware name: Dell Inc.
> PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
> Feb 15 14:13:07 nebula3 kernel: [16423.262022]  0009
> 882e5f9f7820 81725768 
> Feb 15 14:13:07 nebula3 kernel: [16423.262028]  882e5f9f7858
> 810678bd 0004 35de
> Feb 15 14:13:07 nebula3 kernel: [16423.262033]  0001
> 88187fffbf00  882e5f9f7868
> Feb 15 14:13:07 nebula3 kernel: [16423.262037] Call Trace:
> Feb 15 14:13:07 nebula3 kernel: [16423.262046]  []
> dump_stack+0x45/0x56
> Feb 15 14:13:07 nebula3 kernel: [16423.262052]  []
> warn_slowpath_common+0x7d/0xa0
> Feb 15 14:13:07 nebula3 kernel: [16423.262056]  []
> warn_slowpath_null+0x1a/0x20
> Feb 15 14:13:07 nebula3 kernel: [16423.262060]  []
> get_page_from_freelist+0x924/0x930
> Feb 15 14:13:07 nebula3 kernel: [16423.262091]  [] ?
> __switch_to+0x3fe/0x4d0
> Feb 15 14:13:07 nebula3 kernel: [16423.262096]  []
> __alloc_pages_nodemask+0x184/0xb80
> Feb 15 14:13:07 nebula3 kernel: [16423.262102]  [] ?
> find_get_page+0x1e/0xa0
> Feb 15 14:13:07 nebula3 kernel: [16423.262111]  [] ?
> find_lock_page+0x30/0x70
> Feb 15 14:13:07 nebula3 kernel: [16423.262115]  [] ?
> find_or_create_page+0x34/0x90
> Feb 15 14:13:07 nebula3 kernel: [16423.262125]  [] ?
> radix_tree_lookup_slot+0xe/0x10
> Feb 15 14:13:07 nebula3 kernel: [16423.262134]  []
> alloc_pages_current+0xa3/0x160
> Feb 15 14:13:07 nebula3 kernel: [16423.262144]  []
> __get_free_pages+0xe/0x50
> Feb 15 14:13:07 nebula3 kernel: [16423.262157]  []
> kmalloc_order_trace+0x2e/0xa0
> Feb 15 14:13:07 nebula3 kernel: [16423.262170]  [] ?
> wake_up_bit+0x25/0x30
> Feb 15 14:13:07 nebula3 kernel: [16423.262177]  []
> __kmalloc+0x211/0x230
> Feb 15 14:13:07 nebula3 kernel: [16423.262192]  []
> gfs2_rlist_alloc+0x26/0x70 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262199]  []
> recursive_scan+0x29d/0x6a0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262206]  []
> recursive_scan+0x46c/0x6a0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262217]  [] ?
> gfs2_quota_hold+0x175/0x1f0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262224]  []
> trunc_dealloc+0xfa/0x120 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262232]  [] ?
> gfs2_glock_wait+0x3e/0x80 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262240]  [] ?
> gfs2_glock_nq+0x280/0x430 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262247]  []
> gfs2_file_dealloc+0x10/0x20 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262257]  []
> gfs2_evict_inode+0x2b3/0x3e0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262276]  [] ?
> gfs2_evict_inode+0x113/0x3e0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262286]  []
> evict+0xb0/0x1b0
> Feb 15 14:13:07 nebula3 kernel: [16423.262290]  []
> iput+0xf5/0x180
> Feb 15 14:13:07 nebula3 kernel: [16423.262296]  []
> do_unlinkat+0x18e/0x2b0
> Feb 15 14:13:07 nebula3 kernel: [16423.262305]  [] ?
> filp_close+0x56/0x70
> Feb 15 14:13:07 nebula3 kernel: [16423.262310]  []
> SyS_unlinkat+0x1b/0x40
> Feb 15 14:13:07 nebula3 kernel: [16423.262315]  []
> system_call_fastpath+0x1a/0x1f
> Feb 15 14:13:07 nebula3 kernel: [16423.262318] ---[ end trace
> 346ccba5c58117dc ]---
> 
> Regards.
> --
> Daniel Dehennin
> Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
> 
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi,

This call trace may safely be ignored. This is a known problem and is
harmless. This is documented as a very old bug record. I don't know if
it's accessible externally, so you may not be ab

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Daniel Dehennin
Daniel Dehennin  writes:


[...]

> We are using 3.1.6-0ubuntu1.
>
> Running an fsck is quite expensive for us, 4 hours with the shared FS
> unusable.
>
> I forgot to say that it stores qcow2 images, so there should not be
> concurrency on the file system except on some directories to
> create/access sub directories:
>
> ///
>
> Only the  should have concurrent write
> accesses, everything under  is accessed only by one node at a
> time, except for monitoring which is read only.
>
> So “looks like it is trying to free a block that is already marked as
> being free” looks strange.

Now the kernel gave me a warning, if it could help:

Feb 15 14:13:07 nebula3 kernel: [16423.261927] [ cut here 
]
Feb 15 14:13:07 nebula3 kernel: [16423.261943] WARNING: CPU: 8 PID: 4410 at 
/build/linux-OTIHGI/linux-3.13.0/mm/page_alloc.c:1604 
get_page_from_freelist+0x924/0x930()
Feb 15 14:13:07 nebula3 kernel: [16423.261945] Modules linked in: vhost_net 
vhost macvtap macvlan gfs2 dlm sctp configfs ip6table_filter ip6_tables 
iptable_filter ip_tables x_tables dm_round_robin openvswitch gre vxlan 
ip_tunnel nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache bonding 
x86_pkg_temp_thermal intel_powerclamp ipmi_devintf gpio_ich coretemp dcdbas 
kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd dm_multipath joydev scsi_dh mei_me shpchp mei 
sb_edac ipmi_si edac_core lpc_ich acpi_power_meter mac_hid wmi iTCO_wdt 
iTCO_vendor_support ses enclosure hid_generic qla2xxx usbhid hid ahci 
scsi_transport_fc libahci bnx2x tg3 megaraid_sas ptp scsi_tgt pps_core mdio 
libcrc32c
Feb 15 14:13:07 nebula3 kernel: [16423.262017] CPU: 8 PID: 4410 Comm: rm Not 
tainted 3.13.0-78-generic #122-Ubuntu
Feb 15 14:13:07 nebula3 kernel: [16423.262019] Hardware name: Dell Inc. 
PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
Feb 15 14:13:07 nebula3 kernel: [16423.262022]  0009 
882e5f9f7820 81725768 
Feb 15 14:13:07 nebula3 kernel: [16423.262028]  882e5f9f7858 
810678bd 0004 35de
Feb 15 14:13:07 nebula3 kernel: [16423.262033]  0001 
88187fffbf00  882e5f9f7868
Feb 15 14:13:07 nebula3 kernel: [16423.262037] Call Trace:
Feb 15 14:13:07 nebula3 kernel: [16423.262046]  [] 
dump_stack+0x45/0x56
Feb 15 14:13:07 nebula3 kernel: [16423.262052]  [] 
warn_slowpath_common+0x7d/0xa0
Feb 15 14:13:07 nebula3 kernel: [16423.262056]  [] 
warn_slowpath_null+0x1a/0x20
Feb 15 14:13:07 nebula3 kernel: [16423.262060]  [] 
get_page_from_freelist+0x924/0x930
Feb 15 14:13:07 nebula3 kernel: [16423.262091]  [] ? 
__switch_to+0x3fe/0x4d0
Feb 15 14:13:07 nebula3 kernel: [16423.262096]  [] 
__alloc_pages_nodemask+0x184/0xb80
Feb 15 14:13:07 nebula3 kernel: [16423.262102]  [] ? 
find_get_page+0x1e/0xa0
Feb 15 14:13:07 nebula3 kernel: [16423.262111]  [] ? 
find_lock_page+0x30/0x70
Feb 15 14:13:07 nebula3 kernel: [16423.262115]  [] ? 
find_or_create_page+0x34/0x90
Feb 15 14:13:07 nebula3 kernel: [16423.262125]  [] ? 
radix_tree_lookup_slot+0xe/0x10
Feb 15 14:13:07 nebula3 kernel: [16423.262134]  [] 
alloc_pages_current+0xa3/0x160
Feb 15 14:13:07 nebula3 kernel: [16423.262144]  [] 
__get_free_pages+0xe/0x50
Feb 15 14:13:07 nebula3 kernel: [16423.262157]  [] 
kmalloc_order_trace+0x2e/0xa0
Feb 15 14:13:07 nebula3 kernel: [16423.262170]  [] ? 
wake_up_bit+0x25/0x30
Feb 15 14:13:07 nebula3 kernel: [16423.262177]  [] 
__kmalloc+0x211/0x230
Feb 15 14:13:07 nebula3 kernel: [16423.262192]  [] 
gfs2_rlist_alloc+0x26/0x70 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262199]  [] 
recursive_scan+0x29d/0x6a0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262206]  [] 
recursive_scan+0x46c/0x6a0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262217]  [] ? 
gfs2_quota_hold+0x175/0x1f0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262224]  [] 
trunc_dealloc+0xfa/0x120 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262232]  [] ? 
gfs2_glock_wait+0x3e/0x80 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262240]  [] ? 
gfs2_glock_nq+0x280/0x430 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262247]  [] 
gfs2_file_dealloc+0x10/0x20 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262257]  [] 
gfs2_evict_inode+0x2b3/0x3e0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262276]  [] ? 
gfs2_evict_inode+0x113/0x3e0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262286]  [] 
evict+0xb0/0x1b0
Feb 15 14:13:07 nebula3 kernel: [16423.262290]  [] 
iput+0xf5/0x180
Feb 15 14:13:07 nebula3 kernel: [16423.262296]  [] 
do_unlinkat+0x18e/0x2b0
Feb 15 14:13:07 nebula3 kernel: [16423.262305]  [] ? 
filp_close+0x56/0x70
Feb 15 14:13:07 nebula3 kernel: [16423.262310]  [] 
SyS_unlinkat+0x1b/0x40
Feb 15 14:13:07 nebula3 kernel: [16423.262315]  [] 
system_call_fastpath+0x1a/0x1f
Feb 15 14:13:07 nebula3 kernel: [16423.262318] ---[ end trace 346ccba5c58117dc 
]---

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Daniel Dehennin
Steven Whitehouse  writes:

> That looks like it is trying to free a block that is already marked as
> being free. fsck should fix that. What version of gfs2-utils are you
> using?

Hello,

We are using 3.1.6-0ubuntu1.

Running an fsck is quite expensive for us, 4 hours with the shared FS
unusable.

I forgot to say that it stores qcow2 images, so there should not be
concurrency on the file system except on some directories to
create/access sub directories:

///

Only the  should have concurrent write
accesses, everything under  is accessed only by one node at a
time, except for monitoring which is read only.

So “looks like it is trying to free a block that is already marked as
being free” looks strange.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] GFS2 filesystem consistency error

2016-02-15 Thread Steven Whitehouse

Hi,

On 15/02/16 09:20, Daniel Dehennin wrote:

Hello,

We run some troubles since several days on our GFS2 (log attached):

- we ran the FS for some times without troubles (since 2014-11-03)

- the FS was grown from 3To to 4To near 6 month ago

- it seems to happen only on one node “nebula3”

- I run an FSCK when just fencing the node was not sufficient (2 crashes
   the same day)

The nodes run on Ubuntu Trusty Thar up to date.

Do you have any idea?

Regards.


That looks like it is trying to free a block that is already marked as 
being free. fsck should fix that. What version of gfs2-utils are you using?


Steve.


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster