Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-07-17 Thread Dominik Mostowiec
Hi, Something interesting, osd whith problems eats much more memory. Standard is about 300m, This osd eats even 30G. Can i do any tests to help find where the problem is? -- Regards Dominik 2013/7/16 Dominik Mostowiec : > Hi, > I noticed that problem is more frequent at nigth where traffic is sm

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-07-16 Thread Dominik Mostowiec
Hi, I noticed that problem is more frequent at nigth where traffic is smaller. Mabye it is caused by scrubbing (multiple scrubbings on one osd) and to small "filestore op threads" or something other num threads settings in my config: osd heartbeat grace = 15 filestore flush min = 0

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-07-04 Thread Dominik Mostowiec
I reported bug: http://tracker.ceph.com/issues/5504 -- Regards Dominik 2013/7/2 Dominik Mostowiec : > Hi, > Some osd.87 performance graphs: > https://www.dropbox.com/s/o07wae2041hu06l/osd_87_performance.PNG > After 11.05 I have restarted it. > > Mons .., maybe this is the problem. > > -- > Regard

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-07-02 Thread Dominik Mostowiec
Hi, Some osd.87 performance graphs: https://www.dropbox.com/s/o07wae2041hu06l/osd_87_performance.PNG After 11.05 I have restarted it. Mons .., maybe this is the problem. -- Regards Dominik 2013/7/2 Andrey Korolyov : > Hi Dominik, > > What`s about performance on the osd.87 at this moment, do you

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-07-02 Thread Dominik Mostowiec
Hi, I got it. ceph health details HEALTH_WARN 3 pgs peering; 3 pgs stuck inactive; 5 pgs stuck unclean; recovery 64/38277874 degraded (0.000%) pg 5.df9 is stuck inactive for 138669.746512, current state peering, last acting [87,2,151] pg 5.a82 is stuck inactive for 138638.121867, current state pee

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-07-02 Thread Andrey Korolyov
Hi Dominik, What`s about performance on the osd.87 at this moment, do you have any related measurements? As for mine version of this issue, seems that quorum has some kind of degradation over time - when I restarted mons, problem has gone and peering time lowered by factor of ten or so. Also see

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-30 Thread Andrey Korolyov
That`s not a loop as it looks, sorry - I had reproduced issue many times and there is no such cpu-eating behavior in most cases, only locked pgs are presented. Also I may celebrate returning of 'wrong down mark' bug, at least for the 0.61.4 tag. For first one, I`ll send a link with core as quick a

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Dominik Mostowiec
I have only 'ceph healht details' from previous crash. ceph health details HEALTH_WARN 6 pgs peering; 9 pgs stuck unclean pg 3.c62 is stuck unclean for 583.220063, current state active, last acting [57,23,51] pg 4.269 is stuck unclean for 4842.519837, current state peering, last acting [23,57,106]

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Sage Weil
> Ver. 0.56.6 > Hmm, osd not died, 1 or more pg stack on peereng on it. Can you get a pgid from 'ceph health detail' and then do 'ceph pg query' and attach that output? Thanks! sage > > Regards > Dominik > > On Jun 28, 2013 11:28 PM, "Sage Weil" wrote: > On Sat, 29 Jun 2013, Andrey K

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Dominik Mostowiec
Ver. 0.56.6 Hmm, osd not died, 1 or more pg stack on peereng on it. Regards Dominik On Jun 28, 2013 11:28 PM, "Sage Weil" wrote: > On Sat, 29 Jun 2013, Andrey Korolyov wrote: > > There is almost same problem with the 0.61 cluster, at least with same > > symptoms. Could be reproduced quite easily

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Sage Weil
On Sat, 29 Jun 2013, Andrey Korolyov wrote: > There is almost same problem with the 0.61 cluster, at least with same > symptoms. Could be reproduced quite easily - remove an osd and then > mark it as out and with quite high probability one of neighbors will > be stuck at the end of peering process

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Dominik Mostowiec
Today I have peereng problem not when I put osd.71 out, but in normal CEPH work. Regards Dominik 2013/6/28 Andrey Korolyov : > There is almost same problem with the 0.61 cluster, at least with same > symptoms. Could be reproduced quite easily - remove an osd and then > mark it as out and with qui

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Andrey Korolyov
There is almost same problem with the 0.61 cluster, at least with same symptoms. Could be reproduced quite easily - remove an osd and then mark it as out and with quite high probability one of neighbors will be stuck at the end of peering process with couple of peering pgs with primary copy on it.

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-28 Thread Dominik Mostowiec
Hi, We took osd.71 out and now problem is on osd.57. Something curious, op_rw on osd.57 is much higher than other. See here: https://www.dropbox.com/s/o5q0xi9wbvpwyiz/op_rw_osd57.PNG On data on this osd I found: > data/osd.57/current# du -sh omap/ > 2.3Gomap/ That much higher op_rw on one osd

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-13 Thread Gregory Farnum
On Thu, Jun 13, 2013 at 6:33 AM, SÅ‚awomir Skowron wrote: > Hi, sorry for late response. > > https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view > > Logs in attachment, and on google drive, from today. > > https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view > > We have suc

Re: [ceph-users] two osd stack on peereng after start osd to recovery

2013-06-06 Thread Gregory Farnum
We don't have your logs (vger doesn't forward them). Can you describe the situation more completely in terms of what failures occurred and what steps you took? (Also, this should go on ceph-users. Adding that to the recipients list.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.co