Re: [ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

2017-11-13 Thread Matteo Dacrema
I’ve seen that only one time and noticed that there’s a bug fixed in 10.2.10 (  
http://tracker.ceph.com/issues/20041  ) 
Yes I use snapshots.

As I can see in my case the PG was scrubbing since 20 days but I’ve only 7 days 
logs so I’m not able to identify the affected PG.



> Il giorno 10 nov 2017, alle ore 14:05, Peter Maloney 
>  ha scritto:
> 
> I have often seen a problem where a single osd in an eternal deep scrup
> will hang any client trying to connect. Stopping or restarting that
> single OSD fixes the problem.
> 
> Do you use snapshots?
> 
> Here's what the scrub bug looks like (where that many seconds is 14 hours):
> 
>> ceph daemon "osd.$osd_number" dump_blocked_ops
> 
>>  {
>>  "description": "osd_op(client.6480719.0:2000419292 4.a27969ae
>> rbd_data.46820b238e1f29.aa70 [set-alloc-hint object_size
>> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0]
>> ack+ondisk+write+known_if_redirected e148441)",
>>  "initiated_at": "2017-09-12 20:04:27.987814",
>>  "age": 49315.666393,
>>  "duration": 49315.668515,
>>  "type_data": [
>>  "delayed",
>>  {
>>  "client": "client.6480719",
>>  "tid": 2000419292
>>  },
>>  [
>>  {
>>  "time": "2017-09-12 20:04:27.987814",
>>  "event": "initiated"
>>  },
>>  {
>>  "time": "2017-09-12 20:04:27.987862",
>>  "event": "queued_for_pg"
>>  },
>>  {
>>  "time": "2017-09-12 20:04:28.004142",
>>  "event": "reached_pg"
>>  },
>>  {
>>  "time": "2017-09-12 20:04:28.004219",
>>  "event": "waiting for scrub"
>>  }
>>  ]
>>  ]
>>  }
> 
> 
> 
> 
> 
> 
> On 11/09/17 17:20, Matteo Dacrema wrote:
>> Update:  I noticed that there was a pg that remained scrubbing from the 
>> first day I found the issue to when I reboot the node and problem 
>> disappeared.
>> Can this cause the behaviour I described before?
>> 
>> 
>>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  
>>> ha scritto:
>>> 
>>> Hi all,
>>> 
>>> I’ve experienced a strange issue with my cluster.
>>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each 
>>> plus 4 SSDs nodes with 5 SSDs each.
>>> All the nodes are behind 3 monitors and 2 different crush maps.
>>> All the cluster is on 10.2.7 
>>> 
>>> About 20 days ago I started to notice that long backups hangs with "task 
>>> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
>>> About few days ago another VM start to have high iowait without doing iops 
>>> also on the HDD crush map.
>>> 
>>> Today about a hundreds VMs wasn’t able to read/write from many volumes all 
>>> of them on HDD crush map. Ceph health was ok and no significant log entries 
>>> were found.
>>> Not all the VMs experienced this problem and in the meanwhile the iops on 
>>> the journal and HDDs was very low even if I was able to do significant iops 
>>> on the working VMs.
>>> 
>>> After two hours of debug I decided to reboot one of the OSD nodes and the 
>>> cluster start to respond again. Now the OSD node is back in the cluster and 
>>> the problem is disappeared.
>>> 
>>> Can someone help me to understand what happened?
>>> I see strange entries in the log files like:
>>> 
>>> accept replacing existing (lossy) channel (new one lossy=1)
>>> fault with nothing to send, going to standby
>>> leveldb manual compact 
>>> 
>>> I can share all the logs that can help to identify the issue.
>>> 
>>> Thank you.
>>> Regards,
>>> 
>>> Matteo
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> --
>>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
>>> infetto.
>>> Seguire il link qui sotto per segnalarlo come spam: 
>>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
>>> 
>>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> 
> 
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: peter.malo...@brockmann-consult.de
> Internet: http://www.brockmann-consult.de
> 
> 
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> 

Re: [ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

2017-11-10 Thread Peter Maloney
I have often seen a problem where a single osd in an eternal deep scrup
will hang any client trying to connect. Stopping or restarting that
single OSD fixes the problem.

Do you use snapshots?

Here's what the scrub bug looks like (where that many seconds is 14 hours):

> ceph daemon "osd.$osd_number" dump_blocked_ops

>  {
>  "description": "osd_op(client.6480719.0:2000419292 4.a27969ae
> rbd_data.46820b238e1f29.aa70 [set-alloc-hint object_size
> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0]
> ack+ondisk+write+known_if_redirected e148441)",
>  "initiated_at": "2017-09-12 20:04:27.987814",
>  "age": 49315.666393,
>  "duration": 49315.668515,
>  "type_data": [
>  "delayed",
>  {
>  "client": "client.6480719",
>  "tid": 2000419292
>  },
>  [
>  {
>  "time": "2017-09-12 20:04:27.987814",
>  "event": "initiated"
>  },
>  {
>  "time": "2017-09-12 20:04:27.987862",
>  "event": "queued_for_pg"
>  },
>  {
>  "time": "2017-09-12 20:04:28.004142",
>  "event": "reached_pg"
>  },
>  {
>  "time": "2017-09-12 20:04:28.004219",
>  "event": "waiting for scrub"
>  }
>  ]
>  ]
>  }






On 11/09/17 17:20, Matteo Dacrema wrote:
> Update:  I noticed that there was a pg that remained scrubbing from the first 
> day I found the issue to when I reboot the node and problem disappeared.
> Can this cause the behaviour I described before?
>
>
>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  ha 
>> scritto:
>>
>> Hi all,
>>
>> I’ve experienced a strange issue with my cluster.
>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 
>> 4 SSDs nodes with 5 SSDs each.
>> All the nodes are behind 3 monitors and 2 different crush maps.
>> All the cluster is on 10.2.7 
>>
>> About 20 days ago I started to notice that long backups hangs with "task 
>> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
>> About few days ago another VM start to have high iowait without doing iops 
>> also on the HDD crush map.
>>
>> Today about a hundreds VMs wasn’t able to read/write from many volumes all 
>> of them on HDD crush map. Ceph health was ok and no significant log entries 
>> were found.
>> Not all the VMs experienced this problem and in the meanwhile the iops on 
>> the journal and HDDs was very low even if I was able to do significant iops 
>> on the working VMs.
>>
>> After two hours of debug I decided to reboot one of the OSD nodes and the 
>> cluster start to respond again. Now the OSD node is back in the cluster and 
>> the problem is disappeared.
>>
>> Can someone help me to understand what happened?
>> I see strange entries in the log files like:
>>
>> accept replacing existing (lossy) channel (new one lossy=1)
>> fault with nothing to send, going to standby
>> leveldb manual compact 
>>
>> I can share all the logs that can help to identify the issue.
>>
>> Thank you.
>> Regards,
>>
>> Matteo
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> --
>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
>> infetto.
>> Seguire il link qui sotto per segnalarlo come spam: 
>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster hang

2017-11-09 Thread Matteo Dacrema
Update:  I noticed that there was a pg that remained scrubbing from the first 
day I found the issue to when I reboot the node and problem disappeared.
Can this cause the behaviour I described before?


> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  ha 
> scritto:
> 
> Hi all,
> 
> I’ve experienced a strange issue with my cluster.
> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 
> 4 SSDs nodes with 5 SSDs each.
> All the nodes are behind 3 monitors and 2 different crush maps.
> All the cluster is on 10.2.7 
> 
> About 20 days ago I started to notice that long backups hangs with "task 
> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
> About few days ago another VM start to have high iowait without doing iops 
> also on the HDD crush map.
> 
> Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
> them on HDD crush map. Ceph health was ok and no significant log entries were 
> found.
> Not all the VMs experienced this problem and in the meanwhile the iops on the 
> journal and HDDs was very low even if I was able to do significant iops on 
> the working VMs.
> 
> After two hours of debug I decided to reboot one of the OSD nodes and the 
> cluster start to respond again. Now the OSD node is back in the cluster and 
> the problem is disappeared.
> 
> Can someone help me to understand what happened?
> I see strange entries in the log files like:
> 
> accept replacing existing (lossy) channel (new one lossy=1)
> fault with nothing to send, going to standby
> leveldb manual compact 
> 
> I can share all the logs that can help to identify the issue.
> 
> Thank you.
> Regards,
> 
> Matteo
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster hang

2017-11-09 Thread Matteo Dacrema
Hi all,

I’ve experienced a strange issue with my cluster.
The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 
SSDs nodes with 5 SSDs each.
All the nodes are behind 3 monitors and 2 different crush maps.
All the cluster is on 10.2.7 

About 20 days ago I started to notice that long backups hangs with "task 
jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
About few days ago another VM start to have high iowait without doing iops also 
on the HDD crush map.

Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
them on HDD crush map. Ceph health was ok and no significant log entries were 
found.
Not all the VMs experienced this problem and in the meanwhile the iops on the 
journal and HDDs was very low even if I was able to do significant iops on the 
working VMs.

After two hours of debug I decided to reboot one of the OSD nodes and the 
cluster start to respond again. Now the OSD node is back in the cluster and the 
problem is disappeared.

Can someone help me to understand what happened?
I see strange entries in the log files like:

accept replacing existing (lossy) channel (new one lossy=1)
fault with nothing to send, going to standby
leveldb manual compact 

I can share all the logs that can help to identify the issue.

Thank you.
Regards,

Matteo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com