Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-06-01 Thread a . schwibbe
I am on 7.9-ubuntu1~focal1 amd64


A. ;)


"Strahil Nikolov" hunter86...@yahoo.com – 1. Juni 2021 14:56
> Gald to hear that.
>
> What version are you using ?
> It's interesting to find out the reason behind that defunct status.
>
>
>
> Best Regards,
> Strahil Nikolov
>
> >
> > Strahil,
> >
> >
> >
> > I was able to resolve the issue!
> >
> >
> >
> > On node1 I found [glusterfsd] .
> >
> > I put first the failing arbiters in reset-brick, then formated them new,
> >
> >
> > I killed the zombie ps on node 1, then stopped glusterd gracefully, then 
> > killed all remaining gluster* ps, then started glusterd again,
> >
> > and finally re-added the arbiters with reset brick commit force -
> >
> >
> > now the arbiters started and have been populated correctly. :)
> >
> >
> >
> > Thanks for the support!
> >
> > Best
> >
> >
> > A.
> >
> >
> >
> >
> >
> > "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 21:03
> >
> > > I would avoid shrinking the volume. An oVirt user reported issues after 
> > > volume shrinking.
> >
> > >
> >
> > > Did you try to format the arbiter brick and 'replace-brick' ?
> >
> > >
> >
> > > Best Regards,
> >
> > > Strahil Nikolov
> >
> > > >
> >
> > > > I can't find anything suspicious in the brick logs other than 
> > > > authetication refused to clients trying to mount a dir that is not 
> > > > existing on the arb_n, because the self-heal isn't working.
> >
> > > >
> >
> > > > I tried to add another node and replace-brick a faulty arbiter, however 
> > > > this new arbiter sees the same error.
> >
> > > >
> >
> > > >
> >
> > > > Last idea is to completely remove first subvolume, then re-add as new 
> > > > hoping it will work.
> >
> > > >
> >
> > > >
> >
> > > >
> >
> > > > A.
> >
> > > >
> >
> > > >
> >
> > > >
> >
> > > > "a.schwi...@gmx.net" a.schwi...@gmx.net – 31. Mai 2021 13:44
> >
> > > >
> >
> > > > > Ok, will do.
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > working arbiter:
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 13 33 33 146 Mai 29 22:38 
> > > > > brick
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > ls- lna /var/bricks/arb_0/brick >>> drw--- 262 0 0 8192 Mai 29 
> > > > > 22:38 .glusterfs
> >
> > > >
> >
> > > > > + all data-brick dirs ...
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > affected arbiter:
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 3 0 0 24 Mai 30 16:23 brick
> >
> > > >
> >
> > > > > ls -lna /var/bricks/arb_0/brick >>> drw--- 7 0 0 99 Mai 30 16:23 
> > > > > .glusterfs
> >
> > > >
> >
> > > > > nothing else here
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > find /var/bricks/arb_0/brick -not -user 33 -print
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/changelogs
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/00
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/00/00
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/landfill
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/unlink
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/health_check
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > find /var/bricks/arb_0/brick -not -user 33 -print
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/changelogs
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/00
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/00/00
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/landfill
> >
> > > >
> >
> > > > > /var/bricks/arb_0/brick/.glusterfs/unlink
> >
> > > >
> >
> > > > 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-06-01 Thread a . schwibbe
Strahil,


I was able to resolve the issue!


On node1 I found [glusterfsd] .
I put first the failing arbiters in reset-brick, then formated them new,

I killed the zombie ps on node 1, then stopped glusterd gracefully, then killed 
all remaining gluster* ps, then started glusterd again,
and finally re-added the arbiters with reset brick commit force -

now the arbiters started and have been populated correctly. :)


Thanks for the support!
Best

A.



"Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 21:03
> I would avoid shrinking the volume. An oVirt user reported issues after 
> volume shrinking.
>
> Did you try to format the arbiter brick and 'replace-brick' ?
>
> Best Regards,
> Strahil Nikolov
> >
> > I can't find anything suspicious in the brick logs other than authetication 
> > refused to clients trying to mount a dir that is not existing on the arb_n, 
> > because the self-heal isn't working.
> >
> > I tried to add another node and replace-brick a faulty arbiter, however 
> > this new arbiter sees the same error.
> >
> >
> > Last idea is to completely remove first subvolume, then re-add as new 
> > hoping it will work.
> >
> >
> >
> > A.
> >
> >
> >
> > "a.schwi...@gmx.net" a.schwi...@gmx.net – 31. Mai 2021 13:44
> >
> > > Ok, will do.
> >
> > >
> >
> > >
> >
> > > working arbiter:
> >
> > >
> >
> > > ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 13 33 33 146 Mai 29 22:38 brick
> >
> > >
> >
> > > ls- lna /var/bricks/arb_0/brick >>> drw--- 262 0 0 8192 Mai 29 22:38 
> > > .glusterfs
> >
> > > + all data-brick dirs ...
> >
> > >
> >
> > >
> >
> > > affected arbiter:
> >
> > >
> >
> > > ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 3 0 0 24 Mai 30 16:23 brick
> >
> > > ls -lna /var/bricks/arb_0/brick >>> drw--- 7 0 0 99 Mai 30 16:23 
> > > .glusterfs
> >
> > > nothing else here
> >
> > >
> >
> > >
> >
> > > find /var/bricks/arb_0/brick -not -user 33 -print
> >
> > >
> >
> > > /var/bricks/arb_0/brick/.glusterfs
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> >
> > > /var/bricks/arb_0/brick/.glusterfs/changelogs
> >
> > > /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> >
> > > /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> >
> > > /var/bricks/arb_0/brick/.glusterfs/00
> >
> > > /var/bricks/arb_0/brick/.glusterfs/00/00
> >
> > > /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> >
> > > /var/bricks/arb_0/brick/.glusterfs/landfill
> >
> > > /var/bricks/arb_0/brick/.glusterfs/unlink
> >
> > > /var/bricks/arb_0/brick/.glusterfs/health_check
> >
> > >
> >
> > > find /var/bricks/arb_0/brick -not -user 33 -print
> >
> > >
> >
> > > /var/bricks/arb_0/brick/.glusterfs
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> >
> > > /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> >
> > > /var/bricks/arb_0/brick/.glusterfs/changelogs
> >
> > > /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> >
> > > /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> >
> > > /var/bricks/arb_0/brick/.glusterfs/00
> >
> > > /var/bricks/arb_0/brick/.glusterfs/00/00
> >
> > > /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> >
> > > /var/bricks/arb_0/brick/.glusterfs/landfill
> >
> > > /var/bricks/arb_0/brick/.glusterfs/unlink
> >
> > > /var/bricks/arb_0/brick/.glusterfs/health_check
> >
> > >
> >
> > > Output is identical to user:group 36 as all these have UID:GID 0:0, but 
> > > these files have 0:0 also on the working arbiters.
> >
> > > And this is all files/dirs that exist on the affected arbs. Nothing more 
> > > on it. There should be much more, but this seems to missing self heal.
> >
> > >
> >
> > > Thanks.
> >
> > >
> >
> > > A.
> >
> > >
> >
> > >
> >
> > > "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 13:11
> >
> > > > Hi,
> >
> > > >
> >
> > > > I think that the best way is to go through the logs on the affected 
> > > > arbiter brick (maybe even temporarily increase the log level).
> >
> > > >
> >
> > > > What is the output of:
> >
> > > >
> >
> > > > find /var/brick/arb_0/brick -not -user 36 -print
> >
> > > > find /var/brick/arb_0/brick -not group 36 -print
> >
> > > >
> >
> > > > Maybe there are some files/dirs that are with wrong ownership.
> >
> > > >
> >
> > > > Best Regards,
> >
> > > > Strahil Nikolov
> >
> > > >
> >
> > > > >
> >
> > > > > Thanks Strahil,
> >
> > > > >
> >
> > > > > unfortunately I cannot connect as the mount is denied as in mount.log 
> > > > > provided.
> >
> > > > > IPs > n.n.n..100 are clients and simply cannot mount the volume. When 
> > > > > killing the arb pids on node2 new clients can mount the volume. When 
> > > > > bringing them up again I experience the same problem.
> >

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread Strahil Nikolov
I would avoid shrinking the volume. An oVirt user reported issues after volume 
shrinking.
Did you try to format the arbiter brick and 'replace-brick' ?
Best Regards,Strahil Nikolov 
I can't find anything suspicious in the brick logs other than authetication 
refused to clients trying to mount a dir that is not existing on the arb_n, 
because the self-heal isn't working.
I tried to add another node and replace-brick a faulty arbiter, however this 
new arbiter sees the same error.

Last idea is to completely remove first subvolume, then re-add as new hoping it 
will work.


A.


"a.schwi...@gmx.net" a.schwi...@gmx.net – 31. Mai 2021 13:44
> Ok, will do.
>
>
> working arbiter:
>
> ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 13 33 33 146 Mai 29 22:38 brick
>
> ls- lna /var/bricks/arb_0/brick >>> drw--- 262 0 0 8192 Mai 29 22:38 
> .glusterfs
> + all data-brick dirs ...
>
>
> affected arbiter:
>
> ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 3 0 0 24 Mai 30 16:23 brick
> ls -lna /var/bricks/arb_0/brick >>> drw--- 7 0 0 99 Mai 30 16:23 
> .glusterfs
> nothing else here
>
>
> find /var/bricks/arb_0/brick -not -user 33 -print
>
> /var/bricks/arb_0/brick/.glusterfs
> /var/bricks/arb_0/brick/.glusterfs/indices
> /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> /var/bricks/arb_0/brick/.glusterfs/changelogs
> /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> /var/bricks/arb_0/brick/.glusterfs/00
> /var/bricks/arb_0/brick/.glusterfs/00/00
> /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> /var/bricks/arb_0/brick/.glusterfs/landfill
> /var/bricks/arb_0/brick/.glusterfs/unlink
> /var/bricks/arb_0/brick/.glusterfs/health_check
>
> find /var/bricks/arb_0/brick -not -user 33 -print
>
> /var/bricks/arb_0/brick/.glusterfs
> /var/bricks/arb_0/brick/.glusterfs/indices
> /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> /var/bricks/arb_0/brick/.glusterfs/changelogs
> /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> /var/bricks/arb_0/brick/.glusterfs/00
> /var/bricks/arb_0/brick/.glusterfs/00/00
> /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> /var/bricks/arb_0/brick/.glusterfs/landfill
> /var/bricks/arb_0/brick/.glusterfs/unlink
> /var/bricks/arb_0/brick/.glusterfs/health_check
>
> Output is identical to user:group 36 as all these have UID:GID 0:0, but these 
> files have 0:0 also on the working arbiters.
> And this is all files/dirs that exist on the affected arbs. Nothing more on 
> it. There should be much more, but this seems to missing self heal.
>
> Thanks.
>
> A.
>
>
> "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 13:11
> > Hi,
> >
> > I think that the best way is to go through the logs on the affected arbiter 
> > brick (maybe even temporarily increase the log level).
> >
> > What is the output of:
> >
> > find /var/brick/arb_0/brick -not -user 36 -print
> > find /var/brick/arb_0/brick -not group 36 -print
> >
> > Maybe there are some files/dirs that are with wrong ownership.
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > >
> > > Thanks Strahil,
> > >
> > > unfortunately I cannot connect as the mount is denied as in mount.log 
> > > provided.
> > > IPs > n.n.n..100 are clients and simply cannot mount the volume. When 
> > > killing the arb pids on node2 new clients can mount the volume. When 
> > > bringing them up again I experience the same problem.
> > >
> > > I wonder why the root dir on the arb bricks has wrong UID:GID.
> > > I added regular data bricks before without any problems on node2.
> > >
> > > Also when executing "watch df"
> > >
> > > I see
> > >
> > > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
> > > ..
> > >
> > > /dev/md50 11700224 33128 11667096 1% /var/bricks/arb_0
> > >
> > > ..
> > >
> > > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
> > >
> > > So heal daemon might try to do something, which isn't working. Thus I 
> > > chowned UID:GID of ../arb_0/brick manually to match, but it did not work 
> > > either.
> > >
> > > As I added all 6 arbs at once and 4 are working as expected I really 
> > > don't get what's wrong with these...
> > >
> > > A.
> > >
> > > "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 11:12
> > > > For the arb_0 I seeonly 8 clients , while there should be 12 clients:
> > > > Brick : 192.168.0.40:/var/bricks/0/brick
> > > > Clients connected : 12
> > > >
> > > > Brick : 192.168.0.41:/var/bricks/0/brick
> > > > Clients connected : 12
> > > >
> > > > Brick : 192.168.0.80:/var/bricks/arb_0/brick
> > > > Clients connected : 8
> > > >
> > > > Can you try to reconnect them. The most simple way is to kill the 
> > > > arbiter process and 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread Andreas Schwibbe
Hm,

I tried format and reset-brick on node2 - no success.
I tried new brick on new node3 and replace-brick - no success as the new 
arbiter is created wrongly and self-heal does not work.

I also restarted all nodes turn by turn without any improvement. 

If shrinking the volume is not recommended, converting it back to replica 2 
possible and if successful.another try?

Thanks
A.



31.05.2021 21:03:01 Strahil Nikolov :

> I would avoid shrinking the volume. An oVirt user reported issues after 
> volume shrinking.
> 
> Did you try to format the arbiter brick and 'replace-brick' ?
> 
> Best Regards,
> Strahil Nikolov
>> 
>> I can't find anything suspicious in the brick logs other than authetication 
>> refused to clients trying to mount a dir that is not existing on the arb_n, 
>> because the self-heal isn't working.
>> I tried to add another node and replace-brick a faulty arbiter, however this 
>> new arbiter sees the same error.
>> 
>> Last idea is to completely remove first subvolume, then re-add as new hoping 
>> it will work.
>> 
>> 
>> A.
>> 
>> 
>> "a.schwi...@gmx.net" a.schwi...@gmx.net – 31. Mai 2021 13:44
>>> Ok, will do.
>>>
>>>
>>> working arbiter:
>>>
>>> ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 13 33 33 146 Mai 29 22:38 brick
>>>
>>> ls- lna /var/bricks/arb_0/brick >>> drw--- 262 0 0 8192 Mai 29 22:38 
>>> .glusterfs
>>> + all data-brick dirs ...
>>>
>>>
>>> affected arbiter:
>>>
>>> ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 3 0 0 24 Mai 30 16:23 brick
>>> ls -lna /var/bricks/arb_0/brick >>> drw--- 7 0 0 99 Mai 30 16:23 
>>> .glusterfs
>>> nothing else here
>>>
>>>
>>> find /var/bricks/arb_0/brick -not -user 33 -print
>>>
>>> /var/bricks/arb_0/brick/.glusterfs
>>> /var/bricks/arb_0/brick/.glusterfs/indices
>>> /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
>>> /var/bricks/arb_0/brick/.glusterfs/indices/dirty
>>> /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
>>> /var/bricks/arb_0/brick/.glusterfs/changelogs
>>> /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
>>> /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
>>> /var/bricks/arb_0/brick/.glusterfs/00
>>> /var/bricks/arb_0/brick/.glusterfs/00/00
>>> /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
>>> /var/bricks/arb_0/brick/.glusterfs/landfill
>>> /var/bricks/arb_0/brick/.glusterfs/unlink
>>> /var/bricks/arb_0/brick/.glusterfs/health_check
>>>
>>> find /var/bricks/arb_0/brick -not -user 33 -print
>>>
>>> /var/bricks/arb_0/brick/.glusterfs
>>> /var/bricks/arb_0/brick/.glusterfs/indices
>>> /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
>>> /var/bricks/arb_0/brick/.glusterfs/indices/dirty
>>> /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
>>> /var/bricks/arb_0/brick/.glusterfs/changelogs
>>> /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
>>> /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
>>> /var/bricks/arb_0/brick/.glusterfs/00
>>> /var/bricks/arb_0/brick/.glusterfs/00/00
>>> /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
>>> /var/bricks/arb_0/brick/.glusterfs/landfill
>>> /var/bricks/arb_0/brick/.glusterfs/unlink
>>> /var/bricks/arb_0/brick/.glusterfs/health_check
>>>
>>> Output is identical to user:group 36 as all these have UID:GID 0:0, but 
>>> these files have 0:0 also on the working arbiters.
>>> And this is all files/dirs that exist on the affected arbs. Nothing more on 
>>> it. There should be much more, but this seems to missing self heal.
>>>
>>> Thanks.
>>>
>>> A.
>>>
>>>
>>> "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 13:11
>>> > Hi,
>>> >
>>> > I think that the best way is to go through the logs on the affected 
>>> > arbiter brick (maybe even temporarily increase the log level).
>>> >
>>> > What is the output of:
>>> >
>>> > find /var/brick/arb_0/brick -not -user 36 -print
>>> > find /var/brick/arb_0/brick -not group 36 -print
>>> >
>>> > Maybe there are some files/dirs that are with wrong ownership.
>>> >
>>> > Best Regards,
>>> > Strahil Nikolov
>>> >
>>> > >
>>> > > Thanks Strahil,
>>> > >
>>> > > unfortunately I cannot connect as the mount is denied as in mount.log 
>>> > > provided.
>>> > > IPs > n.n.n..100 are clients and simply cannot mount the volume. When 
>>> > > killing the arb pids on node2 new clients can mount the volume. When 
>>> > > bringing them up again I experience the same problem.
>>> > >
>>> > > I wonder why the root dir on the arb bricks has wrong UID:GID.
>>> > > I added regular data bricks before without any problems on node2.
>>> > >
>>> > > Also when executing "watch df"
>>> > >
>>> > > I see
>>> > >
>>> > > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
>>> > > ..
>>> > >
>>> > > /dev/md50 11700224 33128 11667096 1% /var/bricks/arb_0
>>> > >
>>> > > ..
>>> > >
>>> > > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
>>> > >
>>> > > So heal daemon might try to do something, which isn't working. Thus I 
>>> > > chowned UID:GID of ../arb_0/brick manually to match, 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread a . schwibbe
I can't find anything suspicious in the brick logs other than authetication 
refused to clients trying to mount a dir that is not existing on the arb_n, 
because the self-heal isn't working.
I tried to add another node and replace-brick a faulty arbiter, however this 
new arbiter sees the same error.

Last idea is to completely remove first subvolume, then re-add as new hoping it 
will work.


A.


"a.schwi...@gmx.net" a.schwi...@gmx.net – 31. Mai 2021 13:44
> Ok, will do.
>
>
> working arbiter:
>
> ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 13 33 33 146 Mai 29 22:38 brick
>
> ls- lna /var/bricks/arb_0/brick >>> drw--- 262 0 0 8192 Mai 29 22:38 
> .glusterfs
> + all data-brick dirs ...
>
>
> affected arbiter:
>
> ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 3 0 0 24 Mai 30 16:23 brick
> ls -lna /var/bricks/arb_0/brick >>> drw--- 7 0 0 99 Mai 30 16:23 
> .glusterfs
> nothing else here
>
>
> find /var/bricks/arb_0/brick -not -user 33 -print
>
> /var/bricks/arb_0/brick/.glusterfs
> /var/bricks/arb_0/brick/.glusterfs/indices
> /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> /var/bricks/arb_0/brick/.glusterfs/changelogs
> /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> /var/bricks/arb_0/brick/.glusterfs/00
> /var/bricks/arb_0/brick/.glusterfs/00/00
> /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> /var/bricks/arb_0/brick/.glusterfs/landfill
> /var/bricks/arb_0/brick/.glusterfs/unlink
> /var/bricks/arb_0/brick/.glusterfs/health_check
>
> find /var/bricks/arb_0/brick -not -user 33 -print
>
> /var/bricks/arb_0/brick/.glusterfs
> /var/bricks/arb_0/brick/.glusterfs/indices
> /var/bricks/arb_0/brick/.glusterfs/indices/xattrop
> /var/bricks/arb_0/brick/.glusterfs/indices/dirty
> /var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
> /var/bricks/arb_0/brick/.glusterfs/changelogs
> /var/bricks/arb_0/brick/.glusterfs/changelogs/htime
> /var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
> /var/bricks/arb_0/brick/.glusterfs/00
> /var/bricks/arb_0/brick/.glusterfs/00/00
> /var/bricks/arb_0/brick/.glusterfs/00/00/----0001
> /var/bricks/arb_0/brick/.glusterfs/landfill
> /var/bricks/arb_0/brick/.glusterfs/unlink
> /var/bricks/arb_0/brick/.glusterfs/health_check
>
> Output is identical to user:group 36 as all these have UID:GID 0:0, but these 
> files have 0:0 also on the working arbiters.
> And this is all files/dirs that exist on the affected arbs. Nothing more on 
> it. There should be much more, but this seems to missing self heal.
>
> Thanks.
>
> A.
>
>
> "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 13:11
> > Hi,
> >
> > I think that the best way is to go through the logs on the affected arbiter 
> > brick (maybe even temporarily increase the log level).
> >
> > What is the output of:
> >
> > find /var/brick/arb_0/brick -not -user 36 -print
> > find /var/brick/arb_0/brick -not group 36 -print
> >
> > Maybe there are some files/dirs that are with wrong ownership.
> >
> > Best Regards,
> > Strahil Nikolov
> >
> > >
> > > Thanks Strahil,
> > >
> > > unfortunately I cannot connect as the mount is denied as in mount.log 
> > > provided.
> > > IPs > n.n.n..100 are clients and simply cannot mount the volume. When 
> > > killing the arb pids on node2 new clients can mount the volume. When 
> > > bringing them up again I experience the same problem.
> > >
> > > I wonder why the root dir on the arb bricks has wrong UID:GID.
> > > I added regular data bricks before without any problems on node2.
> > >
> > > Also when executing "watch df"
> > >
> > > I see
> > >
> > > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
> > > ..
> > >
> > > /dev/md50 11700224 33128 11667096 1% /var/bricks/arb_0
> > >
> > > ..
> > >
> > > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
> > >
> > > So heal daemon might try to do something, which isn't working. Thus I 
> > > chowned UID:GID of ../arb_0/brick manually to match, but it did not work 
> > > either.
> > >
> > > As I added all 6 arbs at once and 4 are working as expected I really 
> > > don't get what's wrong with these...
> > >
> > > A.
> > >
> > > "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 11:12
> > > > For the arb_0 I seeonly 8 clients , while there should be 12 clients:
> > > > Brick : 192.168.0.40:/var/bricks/0/brick
> > > > Clients connected : 12
> > > >
> > > > Brick : 192.168.0.41:/var/bricks/0/brick
> > > > Clients connected : 12
> > > >
> > > > Brick : 192.168.0.80:/var/bricks/arb_0/brick
> > > > Clients connected : 8
> > > >
> > > > Can you try to reconnect them. The most simple way is to kill the 
> > > > arbiter process and 'gluster volume start force' , but always verify 
> > > > that you have both data bricks up and running.
> > > >
> > > >
> > > >
> > > > Yet, this doesn't explain why the heal daemon is not 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread Strahil Nikolov
Hi,
I think that the best way is to go through the logs on the affected arbiter 
brick (maybe even temporarily increase the log level).
What is the output of:
find /var/brick/arb_0/brick -not -user 36 -printfind /var/brick/arb_0/brick 
-not group 36 -print
Maybe there are some files/dirs that are with wrong ownership.
Best Regards,Strahil Nikolov
 
Thanks Strahil,


unfortunately I cannot connect as the mount is denied as in mount.log provided.
IPs > n.n.n..100 are clients and simply cannot mount the volume. When killing 
the arb pids on node2 new clients can mount the volume. When bringing them up 
again I experience the same problem.

I wonder why the root dir on the arb bricks has wrong UID:GID.
I added regular data bricks before without any problems on node2.


Also when executing "watch df"

I see

/dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
..

/dev/md50 11700224 33128 11667096 1% /var/bricks/arb_0

..

/dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0

So heal daemon might try to do something, which isn't working. Thus I chowned 
UID:GID of ../arb_0/brick manually to match, but it did not work either.

As I added all 6 arbs at once and 4 are working as expected I really don't get 
what's wrong with these...


A.

"Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 11:12
> For the arb_0 I seeonly 8 clients , while there should be 12 clients:
> Brick : 192.168.0.40:/var/bricks/0/brick
> Clients connected : 12
>
> Brick : 192.168.0.41:/var/bricks/0/brick
> Clients connected : 12
>
> Brick : 192.168.0.80:/var/bricks/arb_0/brick
> Clients connected : 8
>
> Can you try to reconnect them. The most simple way is to kill the arbiter 
> process and 'gluster volume start force' , but always verify that you have 
> both data bricks up and running.
>
>
>
> Yet, this doesn't explain why the heal daemon is not able to replicate 
> properly.
>
>
>
> Best Regards,
> Strahil Nikolov
> >
> > Meanwhile I tried reset-brick on one of the failing arbiters on node2, but 
> > with same results. The behaviour is reproducible, arbiter stays empty.
> >
> > node0: 192.168.0.40
> >
> > node1: 192.168.0.41
> >
> > node3: 192.168.0.80
> >
> > volume info:
> >
> > Volume Name: gv0
> > Type: Distributed-Replicate
> > Volume ID: 9bafc4d2-d9b6-4b6d-a631-1cf42d1d2559
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 6 x (2 + 1) = 18
> > Transport-type: tcp
> > Bricks:
> > Brick1: 192.168.0.40:/var/bricks/0/brick
> > Brick2: 192.168.0.41:/var/bricks/0/brick
> > Brick3: 192.168.0.80:/var/bricks/arb_0/brick (arbiter)
> > Brick4: 192.168.0.40:/var/bricks/2/brick
> > Brick5: 192.168.0.80:/var/bricks/2/brick
> > Brick6: 192.168.0.41:/var/bricks/arb_1/brick (arbiter)
> > Brick7: 192.168.0.40:/var/bricks/1/brick
> > Brick8: 192.168.0.41:/var/bricks/1/brick
> > Brick9: 192.168.0.80:/var/bricks/arb_1/brick (arbiter)
> > Brick10: 192.168.0.40:/var/bricks/3/brick
> > Brick11: 192.168.0.80:/var/bricks/3/brick
> > Brick12: 192.168.0.41:/var/bricks/arb_0/brick (arbiter)
> > Brick13: 192.168.0.41:/var/bricks/3/brick
> > Brick14: 192.168.0.80:/var/bricks/4/brick
> > Brick15: 192.168.0.40:/var/bricks/arb_0/brick (arbiter)
> > Brick16: 192.168.0.41:/var/bricks/2/brick
> > Brick17: 192.168.0.80:/var/bricks/5/brick
> > Brick18: 192.168.0.40:/var/bricks/arb_1/brick (arbiter)
> > Options Reconfigured:
> > cluster.min-free-inodes: 6%
> > cluster.min-free-disk: 2%
> > performance.md-cache-timeout: 600
> > cluster.rebal-throttle: lazy
> > features.scrub-freq: monthly
> > features.scrub-throttle: lazy
> > features.scrub: Inactive
> > features.bitrot: off
> > cluster.server-quorum-type: none
> > performance.cache-refresh-timeout: 10
> > performance.cache-max-file-size: 64MB
> > performance.cache-size: 781901824
> > auth.allow: 
> > /(192.168.0.*),/usr/andreas(192.168.0.120),/usr/otis(192.168.0.168),/usr/otis(192.168.0.111),/usr/otis(192.168.0.249),/media(192.168.0.*),/virt(192.168.0.*),/cloud(192.168.0.247),/zm(192.168.0.136)
> > performance.cache-invalidation: on
> > performance.stat-prefetch: on
> > features.cache-invalidation-timeout: 600
> > cluster.quorum-type: auto
> > features.cache-invalidation: on
> > nfs.disable: on
> > transport.address-family: inet
> > cluster.self-heal-daemon: on
> > cluster.server-quorum-ratio: 51%
> >
> > volume status:
> >
> > Status of volume: gv0
> > Gluster process TCP Port RDMA Port Online Pid
> > --
> > Brick 192.168.0.40:/var/bricks/0/brick 49155 0 Y 713066
> > Brick 192.168.0.41:/var/bricks/0/brick 49152 0 Y 2082
> > Brick 192.168.0.80:/var/bricks/arb_0/brick 49152 0 Y 26186
> > Brick 192.168.0.40:/var/bricks/2/brick 49156 0 Y 713075
> > Brick 192.168.0.80:/var/bricks/2/brick 49154 0 Y 325
> > Brick 192.168.0.41:/var/bricks/arb_1/brick 49157 0 Y 1746903
> > Brick 192.168.0.40:/var/bricks/1/brick 49157 0 Y 713084
> > Brick 192.168.0.41:/var/bricks/1/brick 49153 0 Y 14104
> > Brick 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread Strahil Nikolov
For the arb_0 I seeonly 8 clients , while there should be 12 clients:
Brick : 192.168.0.40:/var/bricks/0/brickClients connected : 12
Brick : 192.168.0.41:/var/bricks/0/brickClients connected : 12
Brick : 192.168.0.80:/var/bricks/arb_0/brickClients connected : 8
Can you try to reconnect them. The most simple way is to kill the arbiter 
process and 'gluster volume start force' , but always verify that you have both 
data bricks up and running.

Yet, this doesn't explain why the heal daemon is not able to replicate properly.

Best Regards,Strahil Nikolov
 
Meanwhile I tried reset-brick on one of the failing arbiters on node2, but 
with same results. The behaviour is reproducible, arbiter stays empty.


node0: 192.168.0.40

node1: 192.168.0.41

node3: 192.168.0.80


volume info:

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 9bafc4d2-d9b6-4b6d-a631-1cf42d1d2559
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x (2 + 1) = 18
Transport-type: tcp
Bricks:
Brick1: 192.168.0.40:/var/bricks/0/brick
Brick2: 192.168.0.41:/var/bricks/0/brick
Brick3: 192.168.0.80:/var/bricks/arb_0/brick (arbiter)
Brick4: 192.168.0.40:/var/bricks/2/brick
Brick5: 192.168.0.80:/var/bricks/2/brick
Brick6: 192.168.0.41:/var/bricks/arb_1/brick (arbiter)
Brick7: 192.168.0.40:/var/bricks/1/brick
Brick8: 192.168.0.41:/var/bricks/1/brick
Brick9: 192.168.0.80:/var/bricks/arb_1/brick (arbiter)
Brick10: 192.168.0.40:/var/bricks/3/brick
Brick11: 192.168.0.80:/var/bricks/3/brick
Brick12: 192.168.0.41:/var/bricks/arb_0/brick (arbiter)
Brick13: 192.168.0.41:/var/bricks/3/brick
Brick14: 192.168.0.80:/var/bricks/4/brick
Brick15: 192.168.0.40:/var/bricks/arb_0/brick (arbiter)
Brick16: 192.168.0.41:/var/bricks/2/brick
Brick17: 192.168.0.80:/var/bricks/5/brick
Brick18: 192.168.0.40:/var/bricks/arb_1/brick (arbiter)
Options Reconfigured:
cluster.min-free-inodes: 6%
cluster.min-free-disk: 2%
performance.md-cache-timeout: 600
cluster.rebal-throttle: lazy
features.scrub-freq: monthly
features.scrub-throttle: lazy
features.scrub: Inactive
features.bitrot: off
cluster.server-quorum-type: none
performance.cache-refresh-timeout: 10
performance.cache-max-file-size: 64MB
performance.cache-size: 781901824
auth.allow: 
/(192.168.0.*),/usr/andreas(192.168.0.120),/usr/otis(192.168.0.168),/usr/otis(192.168.0.111),/usr/otis(192.168.0.249),/media(192.168.0.*),/virt(192.168.0.*),/cloud(192.168.0.247),/zm(192.168.0.136)
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
cluster.quorum-type: auto
features.cache-invalidation: on
nfs.disable: on
transport.address-family: inet
cluster.self-heal-daemon: on
cluster.server-quorum-ratio: 51%

volume status:

Status of volume: gv0
Gluster process TCP Port RDMA Port Online Pid
--
Brick 192.168.0.40:/var/bricks/0/brick 49155 0 Y 713066
Brick 192.168.0.41:/var/bricks/0/brick 49152 0 Y 2082
Brick 192.168.0.80:/var/bricks/arb_0/brick 49152 0 Y 26186
Brick 192.168.0.40:/var/bricks/2/brick 49156 0 Y 713075
Brick 192.168.0.80:/var/bricks/2/brick 49154 0 Y 325
Brick 192.168.0.41:/var/bricks/arb_1/brick 49157 0 Y 1746903
Brick 192.168.0.40:/var/bricks/1/brick 49157 0 Y 713084
Brick 192.168.0.41:/var/bricks/1/brick 49153 0 Y 14104
Brick 192.168.0.80:/var/bricks/arb_1/brick 49159 0 Y 2314
Brick 192.168.0.40:/var/bricks/3/brick 49153 0 Y 2978692
Brick 192.168.0.80:/var/bricks/3/brick 49155 0 Y 23269
Brick 192.168.0.41:/var/bricks/arb_0/brick 49158 0 Y 1746942
Brick 192.168.0.41:/var/bricks/3/brick 49155 0 Y 897058
Brick 192.168.0.80:/var/bricks/4/brick 49156 0 Y 27433
Brick 192.168.0.40:/var/bricks/arb_0/brick 49152 0 Y 3561115
Brick 192.168.0.41:/var/bricks/2/brick 49156 0 Y 902602
Brick 192.168.0.80:/var/bricks/5/brick 49157 0 Y 29522
Brick 192.168.0.40:/var/bricks/arb_1/brick 49154 0 Y 3561159
Self-heal Daemon on localhost N/A N/A Y 26199
Self-heal Daemon on 192.168.0.41 N/A N/A Y 2240635
Self-heal Daemon on 192.168.0.40 N/A N/A Y 3912810

Task Status of Volume gv0
--
There are no active volume tasks

volume heal info summary:

Brick 192.168.0.40:/var/bricks/0/brick <--- contains 100177 files in 25015 dirs
Status: Connected
Total Number of entries: 1006
Number of entries in heal pending: 1006
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.41:/var/bricks/0/brick
Status: Connected
Total Number of entries: 1006
Number of entries in heal pending: 1006
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.80:/var/bricks/arb_0/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.40:/var/bricks/2/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread a . schwibbe
Ok, will do.


working arbiter:

ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 13 33 33 146 Mai 29 22:38 brick

ls- lna /var/bricks/arb_0/brick >>> drw--- 262 0 0 8192 Mai 29 22:38 
.glusterfs
+ all data-brick dirs ...


affected arbiter:

ls -ln /var/bricks/arb_0/ >>> drwxr-xr-x 3 0 0 24 Mai 30 16:23 brick
ls -lna /var/bricks/arb_0/brick >>> drw--- 7 0 0 99 Mai 30 16:23 .glusterfs
nothing else here


find /var/bricks/arb_0/brick -not -user 33 -print

/var/bricks/arb_0/brick/.glusterfs
/var/bricks/arb_0/brick/.glusterfs/indices
/var/bricks/arb_0/brick/.glusterfs/indices/xattrop
/var/bricks/arb_0/brick/.glusterfs/indices/dirty
/var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
/var/bricks/arb_0/brick/.glusterfs/changelogs
/var/bricks/arb_0/brick/.glusterfs/changelogs/htime
/var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
/var/bricks/arb_0/brick/.glusterfs/00
/var/bricks/arb_0/brick/.glusterfs/00/00
/var/bricks/arb_0/brick/.glusterfs/00/00/----0001
/var/bricks/arb_0/brick/.glusterfs/landfill
/var/bricks/arb_0/brick/.glusterfs/unlink
/var/bricks/arb_0/brick/.glusterfs/health_check

find /var/bricks/arb_0/brick -not -user 33 -print

/var/bricks/arb_0/brick/.glusterfs
/var/bricks/arb_0/brick/.glusterfs/indices
/var/bricks/arb_0/brick/.glusterfs/indices/xattrop
/var/bricks/arb_0/brick/.glusterfs/indices/dirty
/var/bricks/arb_0/brick/.glusterfs/indices/entry-changes
/var/bricks/arb_0/brick/.glusterfs/changelogs
/var/bricks/arb_0/brick/.glusterfs/changelogs/htime
/var/bricks/arb_0/brick/.glusterfs/changelogs/csnap
/var/bricks/arb_0/brick/.glusterfs/00
/var/bricks/arb_0/brick/.glusterfs/00/00
/var/bricks/arb_0/brick/.glusterfs/00/00/----0001
/var/bricks/arb_0/brick/.glusterfs/landfill
/var/bricks/arb_0/brick/.glusterfs/unlink
/var/bricks/arb_0/brick/.glusterfs/health_check

Output is identical to user:group 36 as all these have UID:GID 0:0, but these 
files have 0:0 also on the working arbiters.
And this is all files/dirs that exist on the affected arbs. Nothing more on it. 
There should be much more, but this seems to missing self heal.

Thanks.

A.


"Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 13:11
> Hi,
>
> I think that the best way is to go through the logs on the affected arbiter 
> brick (maybe even temporarily increase the log level).
>
> What is the output of:
>
> find /var/brick/arb_0/brick -not -user 36 -print
> find /var/brick/arb_0/brick -not group 36 -print
>
> Maybe there are some files/dirs that are with wrong ownership.
>
> Best Regards,
> Strahil Nikolov
>
> >
> > Thanks Strahil,
> >
> > unfortunately I cannot connect as the mount is denied as in mount.log 
> > provided.
> > IPs > n.n.n..100 are clients and simply cannot mount the volume. When 
> > killing the arb pids on node2 new clients can mount the volume. When 
> > bringing them up again I experience the same problem.
> >
> > I wonder why the root dir on the arb bricks has wrong UID:GID.
> > I added regular data bricks before without any problems on node2.
> >
> > Also when executing "watch df"
> >
> > I see
> >
> > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
> > ..
> >
> > /dev/md50 11700224 33128 11667096 1% /var/bricks/arb_0
> >
> > ..
> >
> > /dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
> >
> > So heal daemon might try to do something, which isn't working. Thus I 
> > chowned UID:GID of ../arb_0/brick manually to match, but it did not work 
> > either.
> >
> > As I added all 6 arbs at once and 4 are working as expected I really don't 
> > get what's wrong with these...
> >
> > A.
> >
> > "Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 11:12
> > > For the arb_0 I seeonly 8 clients , while there should be 12 clients:
> > > Brick : 192.168.0.40:/var/bricks/0/brick
> > > Clients connected : 12
> > >
> > > Brick : 192.168.0.41:/var/bricks/0/brick
> > > Clients connected : 12
> > >
> > > Brick : 192.168.0.80:/var/bricks/arb_0/brick
> > > Clients connected : 8
> > >
> > > Can you try to reconnect them. The most simple way is to kill the arbiter 
> > > process and 'gluster volume start force' , but always verify that you 
> > > have both data bricks up and running.
> > >
> > >
> > >
> > > Yet, this doesn't explain why the heal daemon is not able to replicate 
> > > properly.
> > >
> > >
> > >
> > > Best Regards,
> > > Strahil Nikolov
> > > >
> > > > Meanwhile I tried reset-brick on one of the failing arbiters on node2, 
> > > > but with same results. The behaviour is reproducible, arbiter stays 
> > > > empty.
> > > >
> > > > node0: 192.168.0.40
> > > >
> > > > node1: 192.168.0.41
> > > >
> > > > node3: 192.168.0.80
> > > >
> > > > volume info:
> > > >
> > > > Volume Name: gv0
> > > > Type: Distributed-Replicate
> > > > Volume ID: 9bafc4d2-d9b6-4b6d-a631-1cf42d1d2559
> > > > Status: Started
> > > > Snapshot Count: 0
> > > > Number of Bricks: 6 x (2 + 1) = 18
> > > > Transport-type: tcp
> > > > Bricks:
> > > > Brick1: 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-31 Thread a . schwibbe
Thanks Strahil,


unfortunately I cannot connect as the mount is denied as in mount.log provided.
IPs > n.n.n..100 are clients and simply cannot mount the volume. When killing 
the arb pids on node2 new clients can mount the volume. When bringing them up 
again I experience the same problem.

I wonder why the root dir on the arb bricks has wrong UID:GID.
I added regular data bricks before without any problems on node2.


Also when executing "watch df"

I see

/dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0
..

/dev/md50 11700224 33128 11667096 1% /var/bricks/arb_0

..

/dev/md50 11700224 33108 11667116 1% /var/bricks/arb_0

So heal daemon might try to do something, which isn't working. Thus I chowned 
UID:GID of ../arb_0/brick manually to match, but it did not work either.

As I added all 6 arbs at once and 4 are working as expected I really don't get 
what's wrong with these...


A.

"Strahil Nikolov" hunter86...@yahoo.com – 31. Mai 2021 11:12
> For the arb_0 I seeonly 8 clients , while there should be 12 clients:
> Brick : 192.168.0.40:/var/bricks/0/brick
> Clients connected : 12
>
> Brick : 192.168.0.41:/var/bricks/0/brick
> Clients connected : 12
>
> Brick : 192.168.0.80:/var/bricks/arb_0/brick
> Clients connected : 8
>
> Can you try to reconnect them. The most simple way is to kill the arbiter 
> process and 'gluster volume start force' , but always verify that you have 
> both data bricks up and running.
>
>
>
> Yet, this doesn't explain why the heal daemon is not able to replicate 
> properly.
>
>
>
> Best Regards,
> Strahil Nikolov
> >
> > Meanwhile I tried reset-brick on one of the failing arbiters on node2, but 
> > with same results. The behaviour is reproducible, arbiter stays empty.
> >
> > node0: 192.168.0.40
> >
> > node1: 192.168.0.41
> >
> > node3: 192.168.0.80
> >
> > volume info:
> >
> > Volume Name: gv0
> > Type: Distributed-Replicate
> > Volume ID: 9bafc4d2-d9b6-4b6d-a631-1cf42d1d2559
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 6 x (2 + 1) = 18
> > Transport-type: tcp
> > Bricks:
> > Brick1: 192.168.0.40:/var/bricks/0/brick
> > Brick2: 192.168.0.41:/var/bricks/0/brick
> > Brick3: 192.168.0.80:/var/bricks/arb_0/brick (arbiter)
> > Brick4: 192.168.0.40:/var/bricks/2/brick
> > Brick5: 192.168.0.80:/var/bricks/2/brick
> > Brick6: 192.168.0.41:/var/bricks/arb_1/brick (arbiter)
> > Brick7: 192.168.0.40:/var/bricks/1/brick
> > Brick8: 192.168.0.41:/var/bricks/1/brick
> > Brick9: 192.168.0.80:/var/bricks/arb_1/brick (arbiter)
> > Brick10: 192.168.0.40:/var/bricks/3/brick
> > Brick11: 192.168.0.80:/var/bricks/3/brick
> > Brick12: 192.168.0.41:/var/bricks/arb_0/brick (arbiter)
> > Brick13: 192.168.0.41:/var/bricks/3/brick
> > Brick14: 192.168.0.80:/var/bricks/4/brick
> > Brick15: 192.168.0.40:/var/bricks/arb_0/brick (arbiter)
> > Brick16: 192.168.0.41:/var/bricks/2/brick
> > Brick17: 192.168.0.80:/var/bricks/5/brick
> > Brick18: 192.168.0.40:/var/bricks/arb_1/brick (arbiter)
> > Options Reconfigured:
> > cluster.min-free-inodes: 6%
> > cluster.min-free-disk: 2%
> > performance.md-cache-timeout: 600
> > cluster.rebal-throttle: lazy
> > features.scrub-freq: monthly
> > features.scrub-throttle: lazy
> > features.scrub: Inactive
> > features.bitrot: off
> > cluster.server-quorum-type: none
> > performance.cache-refresh-timeout: 10
> > performance.cache-max-file-size: 64MB
> > performance.cache-size: 781901824
> > auth.allow: 
> > /(192.168.0.*),/usr/andreas(192.168.0.120),/usr/otis(192.168.0.168),/usr/otis(192.168.0.111),/usr/otis(192.168.0.249),/media(192.168.0.*),/virt(192.168.0.*),/cloud(192.168.0.247),/zm(192.168.0.136)
> > performance.cache-invalidation: on
> > performance.stat-prefetch: on
> > features.cache-invalidation-timeout: 600
> > cluster.quorum-type: auto
> > features.cache-invalidation: on
> > nfs.disable: on
> > transport.address-family: inet
> > cluster.self-heal-daemon: on
> > cluster.server-quorum-ratio: 51%
> >
> > volume status:
> >
> > Status of volume: gv0
> > Gluster process TCP Port RDMA Port Online Pid
> > --
> > Brick 192.168.0.40:/var/bricks/0/brick 49155 0 Y 713066
> > Brick 192.168.0.41:/var/bricks/0/brick 49152 0 Y 2082
> > Brick 192.168.0.80:/var/bricks/arb_0/brick 49152 0 Y 26186
> > Brick 192.168.0.40:/var/bricks/2/brick 49156 0 Y 713075
> > Brick 192.168.0.80:/var/bricks/2/brick 49154 0 Y 325
> > Brick 192.168.0.41:/var/bricks/arb_1/brick 49157 0 Y 1746903
> > Brick 192.168.0.40:/var/bricks/1/brick 49157 0 Y 713084
> > Brick 192.168.0.41:/var/bricks/1/brick 49153 0 Y 14104
> > Brick 192.168.0.80:/var/bricks/arb_1/brick 49159 0 Y 2314
> > Brick 192.168.0.40:/var/bricks/3/brick 49153 0 Y 2978692
> > Brick 192.168.0.80:/var/bricks/3/brick 49155 0 Y 23269
> > Brick 192.168.0.41:/var/bricks/arb_0/brick 49158 0 Y 1746942
> > Brick 192.168.0.41:/var/bricks/3/brick 49155 0 Y 897058
> > Brick 192.168.0.80:/var/bricks/4/brick 49156 0 Y 27433
> > Brick 

Re: [Gluster-users] Conv distr-repl 2 to repl 3 arb 1 now 2 of 6 arb bricks won't get healed

2021-05-30 Thread a . schwibbe
Meanwhile I tried reset-brick on one of the failing arbiters on node2, but with 
same results. The behaviour is reproducible, arbiter stays empty.


node0: 192.168.0.40

node1: 192.168.0.41

node3: 192.168.0.80


volume info:

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 9bafc4d2-d9b6-4b6d-a631-1cf42d1d2559
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x (2 + 1) = 18
Transport-type: tcp
Bricks:
Brick1: 192.168.0.40:/var/bricks/0/brick
Brick2: 192.168.0.41:/var/bricks/0/brick
Brick3: 192.168.0.80:/var/bricks/arb_0/brick (arbiter)
Brick4: 192.168.0.40:/var/bricks/2/brick
Brick5: 192.168.0.80:/var/bricks/2/brick
Brick6: 192.168.0.41:/var/bricks/arb_1/brick (arbiter)
Brick7: 192.168.0.40:/var/bricks/1/brick
Brick8: 192.168.0.41:/var/bricks/1/brick
Brick9: 192.168.0.80:/var/bricks/arb_1/brick (arbiter)
Brick10: 192.168.0.40:/var/bricks/3/brick
Brick11: 192.168.0.80:/var/bricks/3/brick
Brick12: 192.168.0.41:/var/bricks/arb_0/brick (arbiter)
Brick13: 192.168.0.41:/var/bricks/3/brick
Brick14: 192.168.0.80:/var/bricks/4/brick
Brick15: 192.168.0.40:/var/bricks/arb_0/brick (arbiter)
Brick16: 192.168.0.41:/var/bricks/2/brick
Brick17: 192.168.0.80:/var/bricks/5/brick
Brick18: 192.168.0.40:/var/bricks/arb_1/brick (arbiter)
Options Reconfigured:
cluster.min-free-inodes: 6%
cluster.min-free-disk: 2%
performance.md-cache-timeout: 600
cluster.rebal-throttle: lazy
features.scrub-freq: monthly
features.scrub-throttle: lazy
features.scrub: Inactive
features.bitrot: off
cluster.server-quorum-type: none
performance.cache-refresh-timeout: 10
performance.cache-max-file-size: 64MB
performance.cache-size: 781901824
auth.allow: 
/(192.168.0.*),/usr/andreas(192.168.0.120),/usr/otis(192.168.0.168),/usr/otis(192.168.0.111),/usr/otis(192.168.0.249),/media(192.168.0.*),/virt(192.168.0.*),/cloud(192.168.0.247),/zm(192.168.0.136)
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
cluster.quorum-type: auto
features.cache-invalidation: on
nfs.disable: on
transport.address-family: inet
cluster.self-heal-daemon: on
cluster.server-quorum-ratio: 51%

volume status:

Status of volume: gv0
Gluster process TCP Port RDMA Port Online Pid
--
Brick 192.168.0.40:/var/bricks/0/brick 49155 0 Y 713066
Brick 192.168.0.41:/var/bricks/0/brick 49152 0 Y 2082
Brick 192.168.0.80:/var/bricks/arb_0/brick 49152 0 Y 26186
Brick 192.168.0.40:/var/bricks/2/brick 49156 0 Y 713075
Brick 192.168.0.80:/var/bricks/2/brick 49154 0 Y 325
Brick 192.168.0.41:/var/bricks/arb_1/brick 49157 0 Y 1746903
Brick 192.168.0.40:/var/bricks/1/brick 49157 0 Y 713084
Brick 192.168.0.41:/var/bricks/1/brick 49153 0 Y 14104
Brick 192.168.0.80:/var/bricks/arb_1/brick 49159 0 Y 2314
Brick 192.168.0.40:/var/bricks/3/brick 49153 0 Y 2978692
Brick 192.168.0.80:/var/bricks/3/brick 49155 0 Y 23269
Brick 192.168.0.41:/var/bricks/arb_0/brick 49158 0 Y 1746942
Brick 192.168.0.41:/var/bricks/3/brick 49155 0 Y 897058
Brick 192.168.0.80:/var/bricks/4/brick 49156 0 Y 27433
Brick 192.168.0.40:/var/bricks/arb_0/brick 49152 0 Y 3561115
Brick 192.168.0.41:/var/bricks/2/brick 49156 0 Y 902602
Brick 192.168.0.80:/var/bricks/5/brick 49157 0 Y 29522
Brick 192.168.0.40:/var/bricks/arb_1/brick 49154 0 Y 3561159
Self-heal Daemon on localhost N/A N/A Y 26199
Self-heal Daemon on 192.168.0.41 N/A N/A Y 2240635
Self-heal Daemon on 192.168.0.40 N/A N/A Y 3912810

Task Status of Volume gv0
--
There are no active volume tasks

volume heal info summary:

Brick 192.168.0.40:/var/bricks/0/brick <--- contains 100177 files in 25015 dirs
Status: Connected
Total Number of entries: 1006
Number of entries in heal pending: 1006
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.41:/var/bricks/0/brick
Status: Connected
Total Number of entries: 1006
Number of entries in heal pending: 1006
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.80:/var/bricks/arb_0/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.40:/var/bricks/2/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.80:/var/bricks/2/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.41:/var/bricks/arb_1/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 192.168.0.40:/var/bricks/1/brick
Status: Connected
Total Number of entries: 1006
Number of entries in heal pending: 1006
Number of