Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
On 02/23/2013 02:54 AM, Gregory Farnum wrote: > I haven't spent that much time in the kernel client, but this patch > isn't working out for me. In particular, I'm pretty sure we need to > preserve this: > >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> index 5d5c32b..b9d8417 100644 >> --- a/fs/ceph/caps.c >> +++ b/fs/ceph/caps.c >> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info >> *ci, int need, int want, >> } >> have = __ceph_caps_issued(ci, &implemented); >> >> - /* >> -* disallow writes while a truncate is pending >> -*/ >> - if (ci->i_truncate_pending) >> - have &= ~CEPH_CAP_FILE_WR; >> - >> if ((have & need) == need) { >> /* >> * Look at (implemented & ~have & not) so that we keep >> waiting > > Because if there's a pending truncate, we really can't write. You do > handle it in the particular case of doing buffered file writes, but > these caps are a marker of permission, and the client shouldn't have > write permission to a file until it's up to date on the truncates. Or > am I misunderstanding something? pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap, the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write, this patch only affects situation that clients receives truncate request from MDS in the middle of write. I think it's better to do vmtruncate after write finishes. > > >> diff --git a/fs/ceph/file.c b/fs/ceph/file.c >> index a1e5b81..bf7849a 100644 >> --- a/fs/ceph/file.c >> +++ b/fs/ceph/file.c >> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const >> struct iovec *iov, >> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n", >> inode, ceph_vinop(inode), pos, (unsigned)len, inode); >> again: >> - __ceph_do_pending_vmtruncate(inode); > > There doesn't seem to be any kind of replacement for this? We need to > do any pending truncates before reading or we might get stale data > back. generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential bug, that's the reason I remove it. Regards Yan, Zheng > > The first two in the series look good to me, but I'll wait and pull > them in as a unit with this one once we're done discussing. :) > -Greg > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Inline data support for Ceph
Hi there, We have a team working on Ceph optimization, for the purpose of integrating with OpenStack. Currently, Ceph could make use of the inline data feature implemented in the local file system, such as btrfs. However, we think it maybe better to implement inline data at a higher level, i.e, let Ceph aware. Since it could save the client the calculation of object location and communication with the OSDs. It hopefully will receive a good IO speedup for small files, with a slightly heavier load for MDS. We have worked out a plan to do it, and the job is ongoing, and we are wondering the community 's response to this job and its fate for inclusion, comments are appreciated. Cheers, Li
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: >> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: >>> On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: >> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: >>> On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > I just looked at the logs. I can't tell what happend to cause that > 10 > second delay.. strangely, messages were passing from 0 -> 1, but > nothing > came back from 1 -> 0 (although 1 was queuing, if not sending, them). >> >> Is there any way of telling where they were delayed, i.e. in the 1's >> output >> queue or 0's input queue? > > Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will > generate a lot of logging, though. I really don't want to load the system with too much logging, but I'm happy modifying code... Are there specific interesting debug outputs which I can modify so they're output under "ms = 1"? >>> >>> I'm basically interested in everything in writer() and write_message(), >>> and reader() and read_message()... >> >> Like this? > > Yeah. You could do 2 instead of 1 so you can turn it down. I suspect > that this is the lions share of what debug 20 will spam to the log, but > hopefully the load is manageable! Good idea on the '2'. I'll get that installed and wait for it to happen again. Chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Sat, 23 Feb 2013, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: > > On Sat, 23 Feb 2013, Chris Dunlop wrote: > >> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: > >>> On Sat, 23 Feb 2013, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > > On Sat, 23 Feb 2013, Chris Dunlop wrote: > >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > >>> I just looked at the logs. I can't tell what happend to cause that > >>> 10 > >>> second delay.. strangely, messages were passing from 0 -> 1, but > >>> nothing > >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them). > > Is there any way of telling where they were delayed, i.e. in the 1's > output > queue or 0's input queue? > >>> > >>> Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will > >>> generate a lot of logging, though. > >> > >> I really don't want to load the system with too much logging, but I'm happy > >> modifying code... Are there specific interesting debug outputs which I can > >> modify so they're output under "ms = 1"? > > > > I'm basically interested in everything in writer() and write_message(), > > and reader() and read_message()... > > Like this? Yeah. You could do 2 instead of 1 so you can turn it down. I suspect that this is the lions share of what debug 20 will spam to the log, but hopefully the load is manageable! sage > -- > diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc > index 37b1eeb..db4774f 100644 > --- a/src/msg/Pipe.cc > +++ b/src/msg/Pipe.cc > @@ -1263,7 +1263,7 @@ void Pipe::reader() > > // sleep if (re)connecting > if (state == STATE_STANDBY) { > - ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << > dendl; > + ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << > dendl; >cond.Wait(pipe_lock); >continue; > } > @@ -1272,28 +1272,28 @@ void Pipe::reader() > > char buf[80]; > char tag = -1; > -ldout(msgr->cct,20) << "reader reading tag..." << dendl; > +ldout(msgr->cct, 1) << "reader reading tag..." << dendl; > if (tcp_read((char*)&tag, 1) < 0) { >pipe_lock.Lock(); > - ldout(msgr->cct,2) << "reader couldn't read tag, " << > strerror_r(errno, buf, sizeof(buf)) << dendl; > + ldout(msgr->cct, 1) << "reader couldn't read tag, " << > strerror_r(errno, buf, sizeof(buf)) << dendl; >fault(true); >continue; > } > > if (tag == CEPH_MSGR_TAG_KEEPALIVE) { > - ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl; > + ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl; >pipe_lock.Lock(); >continue; > } > > // open ... > if (tag == CEPH_MSGR_TAG_ACK) { > - ldout(msgr->cct,20) << "reader got ACK" << dendl; > + ldout(msgr->cct, 1) << "reader got ACK" << dendl; >ceph_le64 seq; >int rc = tcp_read((char*)&seq, sizeof(seq)); >pipe_lock.Lock(); >if (rc < 0) { > - ldout(msgr->cct,2) << "reader couldn't read ack seq, " << > strerror_r(errno, buf, sizeof(buf)) << dendl; > + ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << > strerror_r(errno, buf, sizeof(buf)) << dendl; > fault(true); >} else if (state != STATE_CLOSED) { > handle_ack(seq); > @@ -1302,7 +1302,7 @@ void Pipe::reader() > } > > else if (tag == CEPH_MSGR_TAG_MSG) { > - ldout(msgr->cct,20) << "reader got MSG" << dendl; > + ldout(msgr->cct, 1) << "reader got MSG" << dendl; >Message *m = 0; >int r = read_message(&m); > > @@ -1342,7 +1342,7 @@ void Pipe::reader() > >cond.Signal(); // wake up writer, to ack this > > - ldout(msgr->cct,10) << "reader got message " > + ldout(msgr->cct, 1) << "reader got message " > << m->get_seq() << " " << m << " " << *m > << dendl; > > @@ -1360,7 +1360,7 @@ void Pipe::reader() > } > > else if (tag == CEPH_MSGR_TAG_CLOSE) { > - ldout(msgr->cct,20) << "reader got CLOSE" << dendl; > + ldout(msgr->cct, 1) << "reader got CLOSE" << dendl; >pipe_lock.Lock(); >if (state == STATE_CLOSING) { > state = STATE_CLOSED; > @@ -1383,7 +1383,7 @@ void Pipe::reader() >reader_running = false; >reader_needs_join = true; >unlock_maybe_reap(); > - ldout(msgr->cct,10) << "reader done" << dendl; > + ldout(msgr->cct, 1) << "reader done" << dendl; > } > > /* write msgs to socket. > @@ -1395,7 +1395,7 @@ void Pipe::writer() > >pipe_lock.Lock(); >while (state != STATE_CLOSED) {// && state != STATE_WAIT) { > -ldout(msgr->cct,10) << "writer: state = " << get_state_name() > +ldout(msgr->cct, 1) << "writer: state = " << get_state_name() > << " policy.server="
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: >> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: >>> On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: >>> I just looked at the logs. I can't tell what happend to cause that 10 >>> second delay.. strangely, messages were passing from 0 -> 1, but >>> nothing >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? >>> >>> Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will >>> generate a lot of logging, though. >> >> I really don't want to load the system with too much logging, but I'm happy >> modifying code... Are there specific interesting debug outputs which I can >> modify so they're output under "ms = 1"? > > I'm basically interested in everything in writer() and write_message(), > and reader() and read_message()... Like this? -- diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc index 37b1eeb..db4774f 100644 --- a/src/msg/Pipe.cc +++ b/src/msg/Pipe.cc @@ -1263,7 +1263,7 @@ void Pipe::reader() // sleep if (re)connecting if (state == STATE_STANDBY) { - ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << dendl; + ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << dendl; cond.Wait(pipe_lock); continue; } @@ -1272,28 +1272,28 @@ void Pipe::reader() char buf[80]; char tag = -1; -ldout(msgr->cct,20) << "reader reading tag..." << dendl; +ldout(msgr->cct, 1) << "reader reading tag..." << dendl; if (tcp_read((char*)&tag, 1) < 0) { pipe_lock.Lock(); - ldout(msgr->cct,2) << "reader couldn't read tag, " << strerror_r(errno, buf, sizeof(buf)) << dendl; + ldout(msgr->cct, 1) << "reader couldn't read tag, " << strerror_r(errno, buf, sizeof(buf)) << dendl; fault(true); continue; } if (tag == CEPH_MSGR_TAG_KEEPALIVE) { - ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl; + ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl; pipe_lock.Lock(); continue; } // open ... if (tag == CEPH_MSGR_TAG_ACK) { - ldout(msgr->cct,20) << "reader got ACK" << dendl; + ldout(msgr->cct, 1) << "reader got ACK" << dendl; ceph_le64 seq; int rc = tcp_read((char*)&seq, sizeof(seq)); pipe_lock.Lock(); if (rc < 0) { - ldout(msgr->cct,2) << "reader couldn't read ack seq, " << strerror_r(errno, buf, sizeof(buf)) << dendl; + ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << strerror_r(errno, buf, sizeof(buf)) << dendl; fault(true); } else if (state != STATE_CLOSED) { handle_ack(seq); @@ -1302,7 +1302,7 @@ void Pipe::reader() } else if (tag == CEPH_MSGR_TAG_MSG) { - ldout(msgr->cct,20) << "reader got MSG" << dendl; + ldout(msgr->cct, 1) << "reader got MSG" << dendl; Message *m = 0; int r = read_message(&m); @@ -1342,7 +1342,7 @@ void Pipe::reader() cond.Signal(); // wake up writer, to ack this - ldout(msgr->cct,10) << "reader got message " + ldout(msgr->cct, 1) << "reader got message " << m->get_seq() << " " << m << " " << *m << dendl; @@ -1360,7 +1360,7 @@ void Pipe::reader() } else if (tag == CEPH_MSGR_TAG_CLOSE) { - ldout(msgr->cct,20) << "reader got CLOSE" << dendl; + ldout(msgr->cct, 1) << "reader got CLOSE" << dendl; pipe_lock.Lock(); if (state == STATE_CLOSING) { state = STATE_CLOSED; @@ -1383,7 +1383,7 @@ void Pipe::reader() reader_running = false; reader_needs_join = true; unlock_maybe_reap(); - ldout(msgr->cct,10) << "reader done" << dendl; + ldout(msgr->cct, 1) << "reader done" << dendl; } /* write msgs to socket. @@ -1395,7 +1395,7 @@ void Pipe::writer() pipe_lock.Lock(); while (state != STATE_CLOSED) {// && state != STATE_WAIT) { -ldout(msgr->cct,10) << "writer: state = " << get_state_name() +ldout(msgr->cct, 1) << "writer: state = " << get_state_name() << " policy.server=" << policy.server << dendl; // standby? @@ -1413,7 +1413,7 @@ void Pipe::writer() if (state == STATE_CLOSING) { // write close tag - ldout(msgr->cct,20) << "writer writing CLOSE tag" << dendl; + ldout(msgr->cct, 1) << "writer writing CLOSE tag" << dendl; char tag = CEPH_MSGR_TAG_CLOSE; state = STATE_CLOSED; state_closed.set(1); @@ -1436,7 +1436,7 @@ void Pipe::writer() int rc = write_keepalive(); pipe_loc
Hadoop release jars
Hi all, I've pushed up some changes that let us build stand-alone jar files for the Hadoop CephFS bindings. github.com/ceph/hadoop-common.git cephfs/branch-1.0-build-jar Running "ant cephfs" will produce "build/hadoop-cephfs.jar". I've tested it locally and things work well, so hopefully this means we can continue to develop in the hadoop-common tree and produce jar releases for whatever version combinations we care about. Is this something that should be easy to integrate into gitbuilder so we can link off the documentation page to the release jars? -Noah -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Sat, 23 Feb 2013, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: > > On Sat, 23 Feb 2013, Chris Dunlop wrote: > >> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > >>> On Sat, 23 Feb 2013, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > > On Fri, 22 Feb 2013, Chris Dunlop wrote: > >> G'day, > >> > >> It seems there might be two issues here: the first being the delayed > >> receipt of echo replies causing an seemingly otherwise healthy osd to > >> be > >> marked down, the second being the lack of recovery once the downed osd > >> is > >> recognised as up again. > >> > >> Is it worth my opening tracker reports for this, just so it doesn't get > >> lost? > > > > I just looked at the logs. I can't tell what happend to cause that 10 > > second delay.. strangely, messages were passing from 0 -> 1, but > > nothing > > came back from 1 -> 0 (although 1 was queuing, if not sending, them). > >> > >> Is there any way of telling where they were delayed, i.e. in the 1's output > >> queue or 0's input queue? > > > > Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will > > generate a lot of logging, though. > > I really don't want to load the system with too much logging, but I'm happy > modifying code... Are there specific interesting debug outputs which I can > modify so they're output under "ms = 1"? I'm basically interested in everything in writer() and write_message(), and reader() and read_message()... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Sat, Feb 23, 2013 at 11:50:26AM +1100, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote: >> Hi Chris- >> >> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., >> they were restarted after the upgrade)? > > Not absolutely, but the indications are good: the osd.1 process > was started Feb 16 08:38:43 2013, 20 minutes before I sent my > email saying it had been upgraded. The osd.0 process was > restarted more recently to kick things along after the most > recent problem, however my command line history shows a "apt-get > upgrade" followed by a "service ceph restart". I can't see anything in the logs indicating the various daemon version numbers. Unless I'm blind and it's already there, perhaps a good idea for every daemon to log it's version number when it starts? Chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: >> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: >>> On Sat, 23 Feb 2013, Chris Dunlop wrote: On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > On Fri, 22 Feb 2013, Chris Dunlop wrote: >> G'day, >> >> It seems there might be two issues here: the first being the delayed >> receipt of echo replies causing an seemingly otherwise healthy osd to be >> marked down, the second being the lack of recovery once the downed osd is >> recognised as up again. >> >> Is it worth my opening tracker reports for this, just so it doesn't get >> lost? > > I just looked at the logs. I can't tell what happend to cause that 10 > second delay.. strangely, messages were passing from 0 -> 1, but nothing > came back from 1 -> 0 (although 1 was queuing, if not sending, them). >> >> Is there any way of telling where they were delayed, i.e. in the 1's output >> queue or 0's input queue? > > Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will > generate a lot of logging, though. I really don't want to load the system with too much logging, but I'm happy modifying code... Are there specific interesting debug outputs which I can modify so they're output under "ms = 1"? Chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote: > Hi Chris- > > Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., > they were restarted after the upgrade)? Not absolutely, but the indications are good: the osd.1 process was started Feb 16 08:38:43 2013, 20 minutes before I sent my email saying it had been upgraded. The osd.0 process was restarted more recently to kick things along after the most recent problem, however my command line history shows a "apt-get upgrade" followed by a "service ceph restart". Chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
On Fri, 22 Feb 2013, S?bastien Han wrote: > Hi all, > > I finally got a core dump. > > I did it with a kill -SEGV on the OSD process. > > https://www.dropbox.com/s/ahv6hm0ipnak5rf/core-ceph-osd-11-0-0-20100-1361539008 > > Hope we will get something out of it :-). AHA! We have a theory. The pg log isnt trimmed during scrub (because teh old scrub code required that), but the new (deep) scrub can take a very long time, which means the pg log will eat ram in the meantime.. especially under high iops. Can you try wip-osd-log-trim (which is bobtail + a simple patch) and see if that seems to work? Note that that patch shouldn't be run in a mixed argonaut+bobtail cluster, since it isn't properly checking if the scrub is class or chunky/deep. Thanks! sage > -- > Regards, > S?bastien Han. > > > On Fri, Jan 11, 2013 at 7:13 PM, Gregory Farnum wrote: > > On Fri, Jan 11, 2013 at 6:57 AM, S?bastien Han > > wrote: > >>> Is osd.1 using the heap profiler as well? Keep in mind that active use > >>> of the memory profiler will itself cause memory usage to increase ? > >>> this sounds a bit like that to me since it's staying stable at a large > >>> but finite portion of total memory. > >> > >> Well, the memory consumption was already high before the profiler was > >> started. So yes with the memory profiler enable an OSD might consume > >> more memory but this doesn't cause the memory leaks. > > > > My concern is that maybe you saw a leak but when you restarted with > > the memory profiling you lost whatever conditions caused it. > > > >> Any ideas? Nothing to say about my scrumbing theory? > > I like it, but Sam indicates that without some heap dumps which > > capture the actual leak then scrub is too large to effectively code > > review for leaks. :( > > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
Hi Chris- Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., they were restarted after the upgrade)? sage On Fri, 22 Feb 2013, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: > > On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > > > On Sat, 23 Feb 2013, Chris Dunlop wrote: > > >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > > >>> On Fri, 22 Feb 2013, Chris Dunlop wrote: > > G'day, > > > > It seems there might be two issues here: the first being the delayed > > receipt of echo replies causing an seemingly otherwise healthy osd to > > be > > marked down, the second being the lack of recovery once the downed osd > > is > > recognised as up again. > > > > Is it worth my opening tracker reports for this, just so it doesn't get > > lost? > > >>> > > >>> I just looked at the logs. I can't tell what happend to cause that 10 > > >>> second delay.. strangely, messages were passing from 0 -> 1, but > > >>> nothing > > >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them). > > > > Is there any way of telling where they were delayed, i.e. in the 1's output > > queue or 0's input queue? > > Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will > generate a lot of logging, though. > > > >>> The strange bit is that after this, you get those indefinite hangs. > > >>> From > > >>> the logs it looks like the OSD rebound to an old port that was > > >>> previously > > >>> open from osd.0.. probably from way back. Do you have logs going > > >>> further > > >>> back than what you posted? Also, do you have osdmaps, say, 750 and > > >>> onward? It looks like there is a bug in the connection handling code > > >>> (that is unrelated to the delay above). > > >> > > >> Currently uploading logs starting midnight to dropbox, will send > > >> links when when they're up. > > >> > > >> How would I retrieve the interesting osdmaps? > > > > > > They are in the monitor data directory, in the osdmap_full dir. > > > > Logs from midnight onwards and osdmaps are in this folder: > > > > https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2 > > > > ceph-mon.b2.log.bz2 > > ceph-mon.b4.log.bz2 > > ceph-mon.b5.log.bz2 > > ceph-osd.0.log.bz2 > > ceph-osd.1.log.bz2 (still uploading as I type) > > osdmaps.zip > > I'll take a look... -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Sat, 23 Feb 2013, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > > On Sat, 23 Feb 2013, Chris Dunlop wrote: > >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > >>> On Fri, 22 Feb 2013, Chris Dunlop wrote: > G'day, > > It seems there might be two issues here: the first being the delayed > receipt of echo replies causing an seemingly otherwise healthy osd to be > marked down, the second being the lack of recovery once the downed osd is > recognised as up again. > > Is it worth my opening tracker reports for this, just so it doesn't get > lost? > >>> > >>> I just looked at the logs. I can't tell what happend to cause that 10 > >>> second delay.. strangely, messages were passing from 0 -> 1, but nothing > >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them). > > Is there any way of telling where they were delayed, i.e. in the 1's output > queue or 0's input queue? Yeah, if you bump it up to 'debug ms = 20'. Be aware that that will generate a lot of logging, though. > >>> The strange bit is that after this, you get those indefinite hangs. From > >>> the logs it looks like the OSD rebound to an old port that was previously > >>> open from osd.0.. probably from way back. Do you have logs going further > >>> back than what you posted? Also, do you have osdmaps, say, 750 and > >>> onward? It looks like there is a bug in the connection handling code > >>> (that is unrelated to the delay above). > >> > >> Currently uploading logs starting midnight to dropbox, will send > >> links when when they're up. > >> > >> How would I retrieve the interesting osdmaps? > > > > They are in the monitor data directory, in the osdmap_full dir. > > Logs from midnight onwards and osdmaps are in this folder: > > https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2 > > ceph-mon.b2.log.bz2 > ceph-mon.b4.log.bz2 > ceph-mon.b5.log.bz2 > ceph-osd.0.log.bz2 > ceph-osd.1.log.bz2 (still uploading as I type) > osdmaps.zip I'll take a look... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote: > On Sat, 23 Feb 2013, Chris Dunlop wrote: >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: >>> On Fri, 22 Feb 2013, Chris Dunlop wrote: G'day, It seems there might be two issues here: the first being the delayed receipt of echo replies causing an seemingly otherwise healthy osd to be marked down, the second being the lack of recovery once the downed osd is recognised as up again. Is it worth my opening tracker reports for this, just so it doesn't get lost? >>> >>> I just looked at the logs. I can't tell what happend to cause that 10 >>> second delay.. strangely, messages were passing from 0 -> 1, but nothing >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them). Is there any way of telling where they were delayed, i.e. in the 1's output queue or 0's input queue? >>> The strange bit is that after this, you get those indefinite hangs. From >>> the logs it looks like the OSD rebound to an old port that was previously >>> open from osd.0.. probably from way back. Do you have logs going further >>> back than what you posted? Also, do you have osdmaps, say, 750 and >>> onward? It looks like there is a bug in the connection handling code >>> (that is unrelated to the delay above). >> >> Currently uploading logs starting midnight to dropbox, will send >> links when when they're up. >> >> How would I retrieve the interesting osdmaps? > > They are in the monitor data directory, in the osdmap_full dir. Logs from midnight onwards and osdmaps are in this folder: https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2 ceph-mon.b2.log.bz2 ceph-mon.b4.log.bz2 ceph-mon.b5.log.bz2 ceph-osd.0.log.bz2 ceph-osd.1.log.bz2 (still uploading as I type) osdmaps.zip Cheers, Chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Sat, 23 Feb 2013, Chris Dunlop wrote: > On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > > On Fri, 22 Feb 2013, Chris Dunlop wrote: > >> G'day, > >> > >> It seems there might be two issues here: the first being the delayed > >> receipt of echo replies causing an seemingly otherwise healthy osd to be > >> marked down, the second being the lack of recovery once the downed osd is > >> recognised as up again. > >> > >> Is it worth my opening tracker reports for this, just so it doesn't get > >> lost? > > > > I just looked at the logs. I can't tell what happend to cause that 10 > > second delay.. strangely, messages were passing from 0 -> 1, but nothing > > came back from 1 -> 0 (although 1 was queuing, if not sending, them). > > > > The strange bit is that after this, you get those indefinite hangs. From > > the logs it looks like the OSD rebound to an old port that was previously > > open from osd.0.. probably from way back. Do you have logs going further > > back than what you posted? Also, do you have osdmaps, say, 750 and > > onward? It looks like there is a bug in the connection handling code > > (that is unrelated to the delay above). > > Currently uploading logs starting midnight to dropbox, will send > links when when they're up. > > How would I retrieve the interesting osdmaps? They are in the monitor data directory, in the osdmap_full dir. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote: > On Fri, 22 Feb 2013, Chris Dunlop wrote: >> G'day, >> >> It seems there might be two issues here: the first being the delayed >> receipt of echo replies causing an seemingly otherwise healthy osd to be >> marked down, the second being the lack of recovery once the downed osd is >> recognised as up again. >> >> Is it worth my opening tracker reports for this, just so it doesn't get >> lost? > > I just looked at the logs. I can't tell what happend to cause that 10 > second delay.. strangely, messages were passing from 0 -> 1, but nothing > came back from 1 -> 0 (although 1 was queuing, if not sending, them). > > The strange bit is that after this, you get those indefinite hangs. From > the logs it looks like the OSD rebound to an old port that was previously > open from osd.0.. probably from way back. Do you have logs going further > back than what you posted? Also, do you have osdmaps, say, 750 and > onward? It looks like there is a bug in the connection handling code > (that is unrelated to the delay above). Currently uploading logs starting midnight to dropbox, will send links when when they're up. How would I retrieve the interesting osdmaps? Chris. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix statvfs fr_size
On Fri, Feb 22, 2013 at 3:23 PM, Sage Weil wrote: > Different versions of glibc are broken in different ways, but the short of > it is that for the time being, frsize should == bsize, and be used as the > multiple for the blocks, free, and available fields. This mirrors what is > done for NFS. The previous reporting of the page size for frsize meant > that newer glibc and df would report a very small value for the fs size. > > Fixes http://tracker.ceph.com/issues/3793. > > Signed-off-by: Sage Weil > --- > fs/ceph/super.c |7 ++- > fs/ceph/super.h |2 +- > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c > index e86aa99..006f94b 100644 > --- a/fs/ceph/super.c > +++ b/fs/ceph/super.c > @@ -71,8 +71,14 @@ static int ceph_statfs(struct dentry *dentry, struct > kstatfs *buf) > /* > * express utilization in terms of large blocks to avoid > * overflow on 32-bit machines. > +* > +* NOTE: for the time being, we make bsize == frsize so humor s/so/to :) > +* not-yet-ancient versions of glibc that are broken. > +* Someday, we will probably want to report a real block > +* size... whatever that may mean for a network file system! > */ > buf->f_bsize = 1 << CEPH_BLOCK_SHIFT; > + buf->f_frsize = 1 << CEPH_BLOCK_SHIFT; > buf->f_blocks = le64_to_cpu(st.kb) >> (CEPH_BLOCK_SHIFT-10); > buf->f_bfree = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10); > buf->f_bavail = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10); > @@ -80,7 +86,6 @@ static int ceph_statfs(struct dentry *dentry, struct > kstatfs *buf) > buf->f_files = le64_to_cpu(st.num_objects); > buf->f_ffree = -1; > buf->f_namelen = NAME_MAX; > - buf->f_frsize = PAGE_CACHE_SIZE; > > /* leave fsid little-endian, regardless of host endianness */ > fsid = *(u64 *)(&monmap->fsid) ^ *((u64 *)&monmap->fsid + 1); > diff --git a/fs/ceph/super.h b/fs/ceph/super.h > index 9861cce..604526a 100644 > --- a/fs/ceph/super.h > +++ b/fs/ceph/super.h > @@ -21,7 +21,7 @@ > > /* large granularity for statfs utilization stats to facilitate > * large volume sizes on 32-bit machines. */ > -#define CEPH_BLOCK_SHIFT 20 /* 1 MB */ > +#define CEPH_BLOCK_SHIFT 22 /* 4 MB */ > #define CEPH_BLOCK (1 << CEPH_BLOCK_SHIFT) > > #define CEPH_MOUNT_OPT_DIRSTAT (1<<4) /* `cat dirname` for stats */ I've already voiced my concerns about this (basically de-tuning the stack above us with the large block sizes), but I dunno that they have much validity and this should work from the POV of our code correctness, so other than the typo above Reviewed-by: Greg Farnum -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: fix statvfs fr_size
Different versions of glibc are broken in different ways, but the short of it is that for the time being, frsize should == bsize, and be used as the multiple for the blocks, free, and available fields. This mirrors what is done for NFS. The previous reporting of the page size for frsize meant that newer glibc and df would report a very small value for the fs size. Fixes http://tracker.ceph.com/issues/3793. Signed-off-by: Sage Weil --- fs/ceph/super.c |7 ++- fs/ceph/super.h |2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ceph/super.c b/fs/ceph/super.c index e86aa99..006f94b 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -71,8 +71,14 @@ static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf) /* * express utilization in terms of large blocks to avoid * overflow on 32-bit machines. +* +* NOTE: for the time being, we make bsize == frsize so humor +* not-yet-ancient versions of glibc that are broken. +* Someday, we will probably want to report a real block +* size... whatever that may mean for a network file system! */ buf->f_bsize = 1 << CEPH_BLOCK_SHIFT; + buf->f_frsize = 1 << CEPH_BLOCK_SHIFT; buf->f_blocks = le64_to_cpu(st.kb) >> (CEPH_BLOCK_SHIFT-10); buf->f_bfree = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10); buf->f_bavail = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10); @@ -80,7 +86,6 @@ static int ceph_statfs(struct dentry *dentry, struct kstatfs *buf) buf->f_files = le64_to_cpu(st.num_objects); buf->f_ffree = -1; buf->f_namelen = NAME_MAX; - buf->f_frsize = PAGE_CACHE_SIZE; /* leave fsid little-endian, regardless of host endianness */ fsid = *(u64 *)(&monmap->fsid) ^ *((u64 *)&monmap->fsid + 1); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 9861cce..604526a 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -21,7 +21,7 @@ /* large granularity for statfs utilization stats to facilitate * large volume sizes on 32-bit machines. */ -#define CEPH_BLOCK_SHIFT 20 /* 1 MB */ +#define CEPH_BLOCK_SHIFT 22 /* 4 MB */ #define CEPH_BLOCK (1 << CEPH_BLOCK_SHIFT) #define CEPH_MOUNT_OPT_DIRSTAT (1<<4) /* `cat dirname` for stats */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mon losing touch with OSDs
On Fri, 22 Feb 2013, Chris Dunlop wrote: > G'day, > > It seems there might be two issues here: the first being the delayed > receipt of echo replies causing an seemingly otherwise healthy osd to be > marked down, the second being the lack of recovery once the downed osd is > recognised as up again. > > Is it worth my opening tracker reports for this, just so it doesn't get > lost? I just looked at the logs. I can't tell what happend to cause that 10 second delay.. strangely, messages were passing from 0 -> 1, but nothing came back from 1 -> 0 (although 1 was queuing, if not sending, them). The strange bit is that after this, you get those indefinite hangs. From the logs it looks like the OSD rebound to an old port that was previously open from osd.0.. probably from way back. Do you have logs going further back than what you posted? Also, do you have osdmaps, say, 750 and onward? It looks like there is a bug in the connection handling code (that is unrelated to the delay above). Thanks! sage > > Cheers, > > Chris > > On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote: > > On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote: > >> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote: > >>> On Mon, 18 Feb 2013, Chris Dunlop wrote: > On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote: > > On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote: > >> On Fri, 15 Feb 2013, Chris Dunlop wrote: > >>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might > >>> cause the > >>> mons to lose touch with the osds? > >> > >> Can you enable 'debug ms = 1' on the mons and leave them that way, in > >> the > >> hopes that this happens again? It will give us more information to go > >> on. > > > > Debug turned on. > > We haven't experienced the cluster losing touch with the osds completely > since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1 > for a few seconds before it recovered. See below for logs (reminder: 3 > boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1). > >>> > >>> Hrm, I don't see any obvious clues. You could enable 'debug ms = 1' on > >>> the osds as well. That will give us more to go on if/when it happens > >>> again, and should not affect performance significantly. > >> > >> Done: ceph osd tell '*' injectargs '--debug-ms 1' > >> > >> Now to wait for it to happen again. > > > > OK, we got it again. Full logs covering the incident available at: > > > > https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip > > > > Archive: /tmp/ceph-logs.zip > > Length MethodSize CmprDateTime CRC-32 Name > > -- --- -- - > >11492 Defl:X 1186 90% 2013-02-20 12:04 c0cba4ae ceph-mon.b2.log > > 1270789 Defl:X89278 93% 2013-02-20 12:00 2208d035 ceph-mon.b4.log > > 1375858 Defl:X 104025 92% 2013-02-20 12:05 c64c1dad ceph-mon.b5.log > > 2020042 Defl:X 215000 89% 2013-02-20 10:40 f74ae4ca ceph-osd.0.log > > 2075512 Defl:X 224098 89% 2013-02-20 12:05 b454d2ec ceph-osd.1.log > > 154938 Defl:X12989 92% 2013-02-20 12:04 d2729b05 ceph.log > > --- ------ > > 6908631 646576 91%6 files > > > > My naive analysis, based on the log extracts below (best viewed on a wide > > screen!)... > > > > Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon, > > which marks osd.1 down. > > > > However the whole time, the osd.1 log indicates that it's receiving and > > responding to each ping from osd.0 in a timely fashion. In contrast, the > > osd.0 > > log indicates it isn't seeing the osd.1 replies for a while, then sees them > > all > > arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell > > the > > mon. > > > > During the time osd.0 is not seeing the osd.1 ping_replies, there's other > > traffic > > (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, > > indicating > > that it's not a network problem. > > > > The load on both osds during this period was >90% idle and <1% iow. > > > > Is this pointing to osd.0 experiencing some kind of scheduling or priority > > starvation on the ping thread (assuming the ping is in it's own thread)? > > > > The next odd thing is that, although the osds are both back by 04:38:50 ("2 > > osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for > > both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was > > restarted > > on one of the boxes at around 05:50 (not shown in the logs, but full logs > > available if needed). > > > > Prior to the restart: > > > > # ceph health > > HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean > > > > (Sorry, once again didn't get a 'ceph -s' prior to the restart.)
OSD memory usage
Hi folks, I was having trouble with OSDs crashing under ceph 0.56.2 (CentoOS, 64-bit, installed from rpms, using the "elrepo" kernel), so I eagerly installed 0.56.3. Since then, I've been having a lot of trouble getting the OSDs running, either because of previous crashes or because of some change in 0.56.3. What's happening now is that the OSD processes are using so much memory that the machines start swapping, and eventually die. Today, I tried systematically starting up a few OSDs at a time and watching memory usage. The processes climb pretty quickly up to 2-3 GB in RSIZE. If I let them run for a while and then restart them, I find that they tend to settle into an RSIZE of about 1.5 GB. Unfortunately, I don't seem to have any records of how big these processes were under 0.56.2, but this is way too big to fit into the memory available, when I start up all of the OSD processes. My impression was that the sizes used to be well under 1 GB when the cluster was idle. Is there anything I can do to reduce the memory footprint of ceph-osd? Thanks for any advice. Bryan -- Bryan Wright |"If you take cranberries and stew them like Physics Department| applesauce, they taste much more like prunes University of Virginia| than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218| br...@virginia.edu -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
I haven't spent that much time in the kernel client, but this patch isn't working out for me. In particular, I'm pretty sure we need to preserve this: > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index 5d5c32b..b9d8417 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info > *ci, int need, int want, > } > have = __ceph_caps_issued(ci, &implemented); > > - /* > -* disallow writes while a truncate is pending > -*/ > - if (ci->i_truncate_pending) > - have &= ~CEPH_CAP_FILE_WR; > - > if ((have & need) == need) { > /* > * Look at (implemented & ~have & not) so that we keep waiting Because if there's a pending truncate, we really can't write. You do handle it in the particular case of doing buffered file writes, but these caps are a marker of permission, and the client shouldn't have write permission to a file until it's up to date on the truncates. Or am I misunderstanding something? > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > index a1e5b81..bf7849a 100644 > --- a/fs/ceph/file.c > +++ b/fs/ceph/file.c > @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const > struct iovec *iov, > dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n", > inode, ceph_vinop(inode), pos, (unsigned)len, inode); > again: > - __ceph_do_pending_vmtruncate(inode); There doesn't seem to be any kind of replacement for this? We need to do any pending truncates before reading or we might get stale data back. The first two in the series look good to me, but I'll wait and pull them in as a unit with this one once we're done discussing. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph scalar & replicas performance
On Thu, Feb 21, 2013 at 5:01 PM, wrote: > Hi all, > I have some problem after my scalar performance test !! > > Setup: > Linux kernel: 3.2.0 > OS: Ubuntu 12.04 > Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + > 10GbE NIC + RAID card: LSI MegaRAID SAS 9260-4i > For every HDD: RAID0, Write Policy: Write Back with BBU, Read > Policy: ReadAhead, IO Policy: Direct Storage server number : 1 to 4 > > Ceph version : 0.48.2 > Replicas : 2 > > FIO cmd: > [Sequencial Read] > fio --iodepth = 32 --numjobs=1 --runtime=120 --bs = 65536 --rw = read > --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 > --thinktime=10 > > [Sequencial Read] > fio --iodepth = 32 --numjobs=1 --runtime=120 --bs = 65536 --rw = write > --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 > --thinktime=10 > > [Random Read] > fio --iodepth = 32 --numjobs=8 --runtime=120 --bs = 65536 --rw = randread > --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 > --thinktime=10 > > [Random Write] > fio --iodepth = 32 --numjobs=8 --runtime=120 --bs = 65536 --rw = randwrite > --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 > --thinktime=10 > > Use ceph client then create 1T RBD image for testing, the client also has > 10GbE NIC , Linux kernel 3.2.0 , Ubuntu 12.04 > > Performance result: > Bandwidth (MB/sec) > ┌ > │storage server number│Sequential Read │Sequential Write│Random Read│Random > Write │ > ├─ ┼── > │ 1│ 259 │ 76 │837│26 │ > ├─ ┼── > │ 2│ 349 │121 │950│45 │ > ├─ ┼── > │ 3│ 354 │108 │490│71 │ > ├─ ┼── > │ 4│ 338 │103 │610│89 │ > ├─ ┼── > > We expect that bandwidth will increase when storage server increase under all > case, but the result is not !! > Can you share your idea for read/write bandwidth when storage server > increasing ? There's a bunch of stuff that could be weird here. Is your switch capable of handling all the traffic going over it? Have you benchmarked the drives and filesystems on each node individually to make sure they all have the same behavior, or are some of your additions slower than the others? (My money is on you having some slow drives that are dragging everything down.) > In another case, we fixed use 4 storage servers then adjust the number of > replicas 2 to 4 > > Performance result: > > Bandwidth (MB/sec) > ┌ > │ replicas number│Sequential Read │Sequential Write│Random Read│Random > Write │ > ├─ ┼── > │ 2│ 338│ 103 │ 614│ 89 │ > ├─ ┼── > │ 3│ 337│ 76 │ 791│ 62 │ > ├─ ┼── > │ 4│ 337│ 60 │ 754│ 43 │ > ├─ ┼── > > The bandwidth of write will decrease when replicas increase that is easy to > know, but why read bandwidth did not increase? Reads are always served from the "primary" OSD, but even if they weren't, you distribute the same number of reads over the same number of disks no matter how many replicas you have of each individual data block... But in particular the change in random read values that you're seeing indicates that your data is very noisy — I'm not sure I'd trust any of the values you're seeing, especially the weirder trends. It might be all noise and no real data value. -Greg N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: [PATCH 2/5] libceph: separate non-locked fault handling
On 02/22/2013 11:26 AM, Alex Elder wrote: > An error occurring on a ceph connection is treated as a fault, > causing the connection to be reset. The initial part of this fault > handling has to be done while holding the connection mutex, but > it must then be dropped for the last part. . . . Sorry about the duplicate(s) and the messed up subject lines. I got a little trouble from gmail in the middle of posting these. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5, v2] libceph: indent properly
This is just the second part of a two-part patch. It simply indents a block of code. This patch is going to be merged into its predecessor following review. Signed-off-by: Alex Elder --- v2: rebased net/ceph/messenger.c | 80 +- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 223406f..2c0669f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2390,50 +2390,50 @@ static void con_work(struct work_struct *work) bool fault; mutex_lock(&con->mutex); -while (true) { - int ret; + while (true) { + int ret; - if ((fault = con_sock_closed(con))) { - dout("%s: con %p SOCK_CLOSED\n", __func__, con); - break; - } - if (con_backoff(con)) { - dout("%s: con %p BACKOFF\n", __func__, con); - break; - } - if (con->state == CON_STATE_STANDBY) { - dout("%s: con %p STANDBY\n", __func__, con); - break; - } - if (con->state == CON_STATE_CLOSED) { - dout("%s: con %p CLOSED\n", __func__, con); - BUG_ON(con->sock); - break; - } - if (con->state == CON_STATE_PREOPEN) { - dout("%s: con %p PREOPEN\n", __func__, con); - BUG_ON(con->sock); - } + if ((fault = con_sock_closed(con))) { + dout("%s: con %p SOCK_CLOSED\n", __func__, con); + break; + } + if (con_backoff(con)) { + dout("%s: con %p BACKOFF\n", __func__, con); + break; + } + if (con->state == CON_STATE_STANDBY) { + dout("%s: con %p STANDBY\n", __func__, con); + break; + } + if (con->state == CON_STATE_CLOSED) { + dout("%s: con %p CLOSED\n", __func__, con); + BUG_ON(con->sock); + break; + } + if (con->state == CON_STATE_PREOPEN) { + dout("%s: con %p PREOPEN\n", __func__, con); + BUG_ON(con->sock); + } - ret = try_read(con); - if (ret < 0) { - if (ret == -EAGAIN) - continue; - con->error_msg = "socket error on read"; - fault = true; - break; - } + ret = try_read(con); + if (ret < 0) { + if (ret == -EAGAIN) + continue; + con->error_msg = "socket error on read"; + fault = true; + break; + } - ret = try_write(con); - if (ret < 0) { - if (ret == -EAGAIN) - continue; - con->error_msg = "socket error on write"; - fault = true; - } + ret = try_write(con); + if (ret < 0) { + if (ret == -EAGAIN) + continue; + con->error_msg = "socket error on write"; + fault = true; + } - break; /* If we make it to here, we're done */ -} + break; /* If we make it to here, we're done */ + } if (fault) con_fault(con); mutex_unlock(&con->mutex); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5, v2] libceph: use a do..while loop in con_work()
This just converts a manually-implemented loop into a do..while loop in con_work(). It also moves handling of EAGAIN inside the blocks where it's already been determined an error code was returned. Also update a few dout() calls near the affected code for consistency. NOTE: This was done in two steps in order to facilitate review. The This patch will be squashed into the next one before commit. next patch simply indents the loop properly. Signed-off-by: Alex Elder --- v2: rebased net/ceph/messenger.c | 39 --- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 18eb788..223406f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2387,51 +2387,53 @@ static void con_work(struct work_struct *work) { struct ceph_connection *con = container_of(work, struct ceph_connection, work.work); - bool fault = false; - int ret; + bool fault; mutex_lock(&con->mutex); -restart: - if (con_sock_closed(con)) { +while (true) { + int ret; + + if ((fault = con_sock_closed(con))) { dout("%s: con %p SOCK_CLOSED\n", __func__, con); - fault = true; - goto done; + break; } if (con_backoff(con)) { dout("%s: con %p BACKOFF\n", __func__, con); - goto done; + break; } if (con->state == CON_STATE_STANDBY) { - dout("con_work %p STANDBY\n", con); - goto done; + dout("%s: con %p STANDBY\n", __func__, con); + break; } if (con->state == CON_STATE_CLOSED) { - dout("con_work %p CLOSED\n", con); + dout("%s: con %p CLOSED\n", __func__, con); BUG_ON(con->sock); - goto done; + break; } if (con->state == CON_STATE_PREOPEN) { - dout("%s: con %p OPENING\n", __func__, con); + dout("%s: con %p PREOPEN\n", __func__, con); BUG_ON(con->sock); } ret = try_read(con); - if (ret == -EAGAIN) - goto restart; if (ret < 0) { + if (ret == -EAGAIN) + continue; con->error_msg = "socket error on read"; fault = true; - goto done; + break; } ret = try_write(con); - if (ret == -EAGAIN) - goto restart; if (ret < 0) { + if (ret == -EAGAIN) + continue; con->error_msg = "socket error on write"; fault = true; } -done: + + break; /* If we make it to here, we're done */ +} if (fault) con_fault(con); mutex_unlock(&con->mutex); @@ -2442,7 +2444,6 @@ done: con->ops->put(con); } - /* * Generic error/fault handler. A retry mechanism is used with * exponential backoff -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5, v2] libceph: use a flag to indicate a fault has occurred
This just rearranges the logic in con_work() a little bit so that a flag is used to indicate a fault has occurred. This allows both the fault and non-fault case to be handled the same way and avoids a couple of nearly consecutive gotos. Signed-off-by: Alex Elder --- v2: rebased net/ceph/messenger.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index c3b9060..18eb788 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2387,13 +2387,15 @@ static void con_work(struct work_struct *work) { struct ceph_connection *con = container_of(work, struct ceph_connection, work.work); + bool fault = false; int ret; mutex_lock(&con->mutex); restart: if (con_sock_closed(con)) { dout("%s: con %p SOCK_CLOSED\n", __func__, con); - goto fault; + fault = true; + goto done; } if (con_backoff(con)) { dout("%s: con %p BACKOFF\n", __func__, con); @@ -2418,7 +2420,8 @@ restart: goto restart; if (ret < 0) { con->error_msg = "socket error on read"; - goto fault; + fault = true; + goto done; } ret = try_write(con); @@ -2426,20 +2429,17 @@ restart: goto restart; if (ret < 0) { con->error_msg = "socket error on write"; - goto fault; + fault = true; } - done: + if (fault) + con_fault(con); mutex_unlock(&con->mutex); -done_unlocked: - con->ops->put(con); - return; -fault: - con_fault(con); - mutex_unlock(&con->mutex); - con_fault_finish(con); - goto done_unlocked; + if (fault) + con_fault_finish(con); + + con->ops->put(con); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] libceph: separate non-locked fault handling
An error occurring on a ceph connection is treated as a fault, causing the connection to be reset. The initial part of this fault handling has to be done while holding the connection mutex, but it must then be dropped for the last part. Separate the part of this fault handling that executes without the lock into its own function, con_fault_finish(). Move the call to this new function, as well as call that drops the connection mutex, into ceph_fault(). Rename that function con_fault() to reflect that it's only handling the connection part of the fault handling. The motivation for this was a warning from sparse about the locking being done here. Rearranging things this way keeps all the mutex manipulation within ceph_fault(), and this stops sparse from complaining. This partially resolves: http://tracker.ceph.com/issues/4184 Reported-by: Fengguang Wu Signed-off-by: Alex Elder --- v2: rebased net/ceph/messenger.c | 42 +++--- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 9a29d8a..c3b9060 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -166,7 +166,7 @@ static struct lock_class_key socket_class; static void queue_con(struct ceph_connection *con); static void con_work(struct work_struct *); -static void ceph_fault(struct ceph_connection *con); +static void con_fault(struct ceph_connection *con); /* * Nicely render a sockaddr as a string. An array of formatted @@ -2363,6 +2363,23 @@ static bool con_backoff(struct ceph_connection *con) return true; } +/* Finish fault handling; con->mutex must *not* be held here */ + +static void con_fault_finish(struct ceph_connection *con) +{ + /* +* in case we faulted due to authentication, invalidate our +* current tickets so that we can get new ones. +*/ + if (con->auth_retry && con->ops->invalidate_authorizer) { + dout("calling invalidate_authorizer()\n"); + con->ops->invalidate_authorizer(con); + } + + if (con->ops->fault) + con->ops->fault(con); +} + /* * Do some work on a connection. Drop a connection ref when we're done. */ @@ -2419,7 +2436,9 @@ done_unlocked: return; fault: - ceph_fault(con); /* error/fault path */ + con_fault(con); + mutex_unlock(&con->mutex); + con_fault_finish(con); goto done_unlocked; } @@ -2428,8 +2447,7 @@ fault: * Generic error/fault handler. A retry mechanism is used with * exponential backoff */ -static void ceph_fault(struct ceph_connection *con) - __releases(con->mutex) +static void con_fault(struct ceph_connection *con) { pr_warning("%s%lld %s %s\n", ENTITY_NAME(con->peer_name), ceph_pr_addr(&con->peer_addr.in_addr), con->error_msg); @@ -2445,7 +2463,7 @@ static void ceph_fault(struct ceph_connection *con) if (con_flag_test(con, CON_FLAG_LOSSYTX)) { dout("fault on LOSSYTX channel, marking CLOSED\n"); con->state = CON_STATE_CLOSED; - goto out_unlock; + return; } if (con->in_msg) { @@ -2476,20 +2494,6 @@ static void ceph_fault(struct ceph_connection *con) con_flag_set(con, CON_FLAG_BACKOFF); queue_con(con); } - -out_unlock: - mutex_unlock(&con->mutex); - /* -* in case we faulted due to authentication, invalidate our -* current tickets so that we can get new ones. -*/ - if (con->auth_retry && con->ops->invalidate_authorizer) { - dout("calling invalidate_authorizer()\n"); - con->ops->invalidate_authorizer(con); - } - - if (con->ops->fault) - con->ops->fault(con); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] libceph: separate non-locked fault handling
An error occurring on a ceph connection is treated as a fault, causing the connection to be reset. The initial part of this fault handling has to be done while holding the connection mutex, but it must then be dropped for the last part. Separate the part of this fault handling that executes without the lock into its own function, con_fault_finish(). Move the call to this new function, as well as call that drops the connection mutex, into ceph_fault(). Rename that function con_fault() to reflect that it's only handling the connection part of the fault handling. The motivation for this was a warning from sparse about the locking being done here. Rearranging things this way keeps all the mutex manipulation within ceph_fault(), and this stops sparse from complaining. This partially resolves: http://tracker.ceph.com/issues/4184 Reported-by: Fengguang Wu Signed-off-by: Alex Elder --- v2: rebased net/ceph/messenger.c | 42 +++--- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 9a29d8a..c3b9060 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -166,7 +166,7 @@ static struct lock_class_key socket_class; static void queue_con(struct ceph_connection *con); static void con_work(struct work_struct *); -static void ceph_fault(struct ceph_connection *con); +static void con_fault(struct ceph_connection *con); /* * Nicely render a sockaddr as a string. An array of formatted @@ -2363,6 +2363,23 @@ static bool con_backoff(struct ceph_connection *con) return true; } +/* Finish fault handling; con->mutex must *not* be held here */ + +static void con_fault_finish(struct ceph_connection *con) +{ + /* +* in case we faulted due to authentication, invalidate our +* current tickets so that we can get new ones. +*/ + if (con->auth_retry && con->ops->invalidate_authorizer) { + dout("calling invalidate_authorizer()\n"); + con->ops->invalidate_authorizer(con); + } + + if (con->ops->fault) + con->ops->fault(con); +} + /* * Do some work on a connection. Drop a connection ref when we're done. */ @@ -2419,7 +2436,9 @@ done_unlocked: return; fault: - ceph_fault(con); /* error/fault path */ + con_fault(con); + mutex_unlock(&con->mutex); + con_fault_finish(con); goto done_unlocked; } @@ -2428,8 +2447,7 @@ fault: * Generic error/fault handler. A retry mechanism is used with * exponential backoff */ -static void ceph_fault(struct ceph_connection *con) - __releases(con->mutex) +static void con_fault(struct ceph_connection *con) { pr_warning("%s%lld %s %s\n", ENTITY_NAME(con->peer_name), ceph_pr_addr(&con->peer_addr.in_addr), con->error_msg); @@ -2445,7 +2463,7 @@ static void ceph_fault(struct ceph_connection *con) if (con_flag_test(con, CON_FLAG_LOSSYTX)) { dout("fault on LOSSYTX channel, marking CLOSED\n"); con->state = CON_STATE_CLOSED; - goto out_unlock; + return; } if (con->in_msg) { @@ -2476,20 +2494,6 @@ static void ceph_fault(struct ceph_connection *con) con_flag_set(con, CON_FLAG_BACKOFF); queue_con(con); } - -out_unlock: - mutex_unlock(&con->mutex); - /* -* in case we faulted due to authentication, invalidate our -* current tickets so that we can get new ones. -*/ - if (con->auth_retry && con->ops->invalidate_authorizer) { - dout("calling invalidate_authorizer()\n"); - con->ops->invalidate_authorizer(con); - } - - if (con->ops->fault) - con->ops->fault(con); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] libceph: encapsulate connection backoff
Collect the code that tests for and implements a backoff delay for a ceph connection into a new function, ceph_backoff(). Make the debug output messages in that part of the code report things consistently by reporting a message in the socket closed case, and by making the one for PREOPEN state report the connection pointer like the rest. Signed-off-by: Alex Elder --- v2: rebased net/ceph/messenger.c | 37 - 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index ed9e237..9a29d8a 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2345,6 +2345,24 @@ static bool con_sock_closed(struct ceph_connection *con) return true; } +static bool con_backoff(struct ceph_connection *con) +{ + int ret; + + if (!con_flag_test_and_clear(con, CON_FLAG_BACKOFF)) + return false; + + ret = queue_con_delay(con, round_jiffies_relative(con->delay)); + if (ret) { + dout("%s: con %p FAILED to back off %lu\n", __func__, + con, con->delay); + BUG_ON(ret == -ENOENT); + con_flag_set(con, CON_FLAG_BACKOFF); + } + + return true; +} + /* * Do some work on a connection. Drop a connection ref when we're done. */ @@ -2356,21 +2374,14 @@ static void con_work(struct work_struct *work) mutex_lock(&con->mutex); restart: - if (con_sock_closed(con)) + if (con_sock_closed(con)) { + dout("%s: con %p SOCK_CLOSED\n", __func__, con); goto fault; - - if (con_flag_test_and_clear(con, CON_FLAG_BACKOFF)) { - dout("con_work %p backing off\n", con); - ret = queue_con_delay(con, round_jiffies_relative(con->delay)); - if (ret) { - dout("con_work %p FAILED to back off %lu\n", con, -con->delay); - BUG_ON(ret == -ENOENT); - con_flag_set(con, CON_FLAG_BACKOFF); - } + } + if (con_backoff(con)) { + dout("%s: con %p BACKOFF\n", __func__, con); goto done; } - if (con->state == CON_STATE_STANDBY) { dout("con_work %p STANDBY\n", con); goto done; @@ -2381,7 +2392,7 @@ restart: goto done; } if (con->state == CON_STATE_PREOPEN) { - dout("con_work OPENING\n"); + dout("%s: con %p OPENING\n", __func__, con); BUG_ON(con->sock); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5, v2] libceph clean up con_work()
(Re-posting because these changes have been rebased.) This series cleans up con_work() a bit. The original motivation was to get rid of a warning issued by the sparse utility, but addressing that required a little rework and it was fairly straightforward once that was done to make that function fairly simple. The problem sparse reported was really due to sparse not being able to follow the logic between multiple functions that together implement locking. The result of these changes makes both acquiring and releasing the connection mutex occur in con_work(). -Alex [PATCH 1/5] libceph: encapsulate connection backoff [PATCH 2/5] libceph: separate non-locked fault handling [PATCH 3/5] libceph: use a flag to indicate a fault has occurred [PATCH 4/5] libceph: use a do..while loop in con_work() [PATCH 5/5] libceph: indent properly -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH, v2] libceph: eliminate sparse warnings
Eliminate most of the problems in the libceph code that cause sparse to issue warnings. - Convert functions that are never referenced externally to have static scope. - Pass NULL rather than 0 for a pointer argument in one spot in ceph_monc_delete_snapid() This partially resolves: http://tracker.ceph.com/issues/4184 Reported-by: Fengguang Wu Signed-off-by: Alex Elder --- v2: rebased net/ceph/crypto.c |7 --- net/ceph/messenger.c |2 +- net/ceph/mon_client.c |2 +- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c index af14cb4..6e7a236 100644 --- a/net/ceph/crypto.c +++ b/net/ceph/crypto.c @@ -423,7 +423,8 @@ int ceph_encrypt2(struct ceph_crypto_key *secret, void *dst, size_t *dst_len, } } -int ceph_key_instantiate(struct key *key, struct key_preparsed_payload *prep) +static int ceph_key_instantiate(struct key *key, + struct key_preparsed_payload *prep) { struct ceph_crypto_key *ckey; size_t datalen = prep->datalen; @@ -458,12 +459,12 @@ err: return ret; } -int ceph_key_match(const struct key *key, const void *description) +static int ceph_key_match(const struct key *key, const void *description) { return strcmp(key->description, description) == 0; } -void ceph_key_destroy(struct key *key) { +static void ceph_key_destroy(struct key *key) { struct ceph_crypto_key *ckey = key->payload.data; ceph_crypto_key_destroy(ckey); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 771d4c9..ed9e237 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -223,7 +223,7 @@ static void encode_my_addr(struct ceph_messenger *msgr) */ static struct workqueue_struct *ceph_msgr_wq; -void _ceph_msgr_exit(void) +static void _ceph_msgr_exit(void) { if (ceph_msgr_wq) { destroy_workqueue(ceph_msgr_wq); diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 812eb3b..aef5b10 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -697,7 +697,7 @@ int ceph_monc_delete_snapid(struct ceph_mon_client *monc, u32 pool, u64 snapid) { return do_poolop(monc, POOL_OP_CREATE_UNMANAGED_SNAP, - pool, snapid, 0, 0); + pool, snapid, NULL, 0); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH, v2] ceph: eliminate sparse warnings in fs code
Fix the causes for sparse warnings reported in the ceph file system code. Here there are only two (and they're sort of silly but they're easy to fix). This partially resolves: http://tracker.ceph.com/issues/4184 Reported-by: Fengguang Wu Signed-off-by: Alex Elder --- v2: rebased fs/ceph/xattr.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 2135817..9b6b2b6 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -213,7 +213,7 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = { XATTR_NAME_CEPH(dir, rsubdirs), XATTR_NAME_CEPH(dir, rbytes), XATTR_NAME_CEPH(dir, rctime), - { 0 } /* Required table terminator */ + { .name = NULL, 0 } /* Required table terminator */ }; static size_t ceph_dir_vxattrs_name_size; /* total size of all names */ @@ -232,7 +232,7 @@ static struct ceph_vxattr ceph_file_vxattrs[] = { XATTR_LAYOUT_FIELD(file, layout, stripe_count), XATTR_LAYOUT_FIELD(file, layout, object_size), XATTR_LAYOUT_FIELD(file, layout, pool), - { 0 } /* Required table terminator */ + { .name = NULL, 0 } /* Required table terminator */ }; static size_t ceph_file_vxattrs_name_size; /* total size of all names */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH, v2] rbd: eliminate sparse warnings
Fengguang Wu reminded me that there were outstanding sparse reports in the ceph and rbd code. This patch fixes these problems in rbd that lead to those reports: - Convert functions that are never referenced externally to have static scope. - Add a lockdep annotation to rbd_request_fn(), because it releases a lock before acquiring it again. This partially resolves: http://tracker.ceph.com/issues/4184 Reported-by: Fengguang Wu Signed-off-by: Alex Elder --- v2: rebased drivers/block/rbd.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index a9c86ca..c6b15d4 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1141,7 +1141,7 @@ static bool obj_request_type_valid(enum obj_request_type type) } } -struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...) +static struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...) { struct ceph_osd_req_op *op; va_list args; @@ -1537,7 +1537,8 @@ static void rbd_obj_request_destroy(struct kref *kref) * that comprises the image request, and the Linux request pointer * (if there is one). */ -struct rbd_img_request *rbd_img_request_create(struct rbd_device *rbd_dev, +static struct rbd_img_request *rbd_img_request_create( + struct rbd_device *rbd_dev, u64 offset, u64 length, bool write_request) { @@ -1971,6 +1972,7 @@ out: } static void rbd_request_fn(struct request_queue *q) + __releases(q->queue_lock) __acquires(q->queue_lock) { struct rbd_device *rbd_dev = q->queuedata; bool read_only = rbd_dev->mapping.read_only; @@ -2705,7 +2707,7 @@ static void rbd_spec_free(struct kref *kref) kfree(spec); } -struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, +static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, struct rbd_spec *spec) { struct rbd_device *rbd_dev; @@ -4256,7 +4258,7 @@ static void rbd_sysfs_cleanup(void) device_unregister(&rbd_root_dev); } -int __init rbd_init(void) +static int __init rbd_init(void) { int rc; @@ -4272,7 +4274,7 @@ int __init rbd_init(void) return 0; } -void __exit rbd_exit(void) +static void __exit rbd_exit(void) { rbd_sysfs_cleanup(); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Updated sparse warning message patches
I'm re-posting these patches because I've updated them to be based on the patches I just posted ("Four miscellaneous patches"). These patches are available in the branch "test/wip-4184" in the ceph-client git repository. That branch is based on branch "test/wip-4234,5,7,8". (Here's the original description) What follows is a few series of patches that get rid of code issues that lead to warnings from the sparse utility. The first three patches address the warnings in the rbd, ceph file system, and libceph code respectively. After that, one warning remains in libceph, and that is addressed by a series of five patches which both address the underlying problem and reorganized and clean up the surrounding code. The last two are meant to be combined into one before commit; they've been posted as two separate patches to make them easier to review. -Alex [PATCH, v2] rbd: eliminate sparse warnings [PATCH, v2] ceph: eliminate sparse warnings in fs code [PATCH, v2] libceph: eliminate sparse warnings [PATCH 1/5, v2] libceph: encapsulate connection backoff [PATCH 2/5, v2] libceph: separate non-locked fault handling [PATCH 3/5, v2] libceph: use a flag to indicate a fault has occurred [PATCH 4/5, v2] libceph: use a do..while loop in con_work() [PATCH 5/5, v2] libceph: indent properly -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: define connection flag helpers
Define and use functions that encapsulate operations performed on a connection's flags. This resolves: http://tracker.ceph.com/issues/4234 Signed-off-by: Alex Elder --- net/ceph/messenger.c | 107 -- 1 file changed, 78 insertions(+), 29 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 8a62a55..771d4c9 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -98,6 +98,57 @@ #define CON_FLAG_SOCK_CLOSED 3 /* socket state changed to closed */ #define CON_FLAG_BACKOFF 4 /* need to retry queuing delayed work */ +static bool con_flag_valid(unsigned long con_flag) +{ + switch (con_flag) { + case CON_FLAG_LOSSYTX: + case CON_FLAG_KEEPALIVE_PENDING: + case CON_FLAG_WRITE_PENDING: + case CON_FLAG_SOCK_CLOSED: + case CON_FLAG_BACKOFF: + return true; + default: + return false; + } +} + +static void con_flag_clear(struct ceph_connection *con, unsigned long con_flag) +{ + BUG_ON(!con_flag_valid(con_flag)); + + clear_bit(con_flag, &con->flags); +} + +static void con_flag_set(struct ceph_connection *con, unsigned long con_flag) +{ + BUG_ON(!con_flag_valid(con_flag)); + + set_bit(con_flag, &con->flags); +} + +static bool con_flag_test(struct ceph_connection *con, unsigned long con_flag) +{ + BUG_ON(!con_flag_valid(con_flag)); + + return test_bit(con_flag, &con->flags); +} + +static bool con_flag_test_and_clear(struct ceph_connection *con, + unsigned long con_flag) +{ + BUG_ON(!con_flag_valid(con_flag)); + + return test_and_clear_bit(con_flag, &con->flags); +} + +static bool con_flag_test_and_set(struct ceph_connection *con, + unsigned long con_flag) +{ + BUG_ON(!con_flag_valid(con_flag)); + + return test_and_set_bit(con_flag, &con->flags); +} + /* static tag bytes (protocol control messages) */ static char tag_msg = CEPH_MSGR_TAG_MSG; static char tag_ack = CEPH_MSGR_TAG_ACK; @@ -309,7 +360,7 @@ static void ceph_sock_write_space(struct sock *sk) * buffer. See net/ipv4/tcp_input.c:tcp_check_space() * and net/core/stream.c:sk_stream_write_space(). */ - if (test_bit(CON_FLAG_WRITE_PENDING, &con->flags)) { + if (con_flag_test(con, CON_FLAG_WRITE_PENDING)) { if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) { dout("%s %p queueing write work\n", __func__, con); clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); @@ -334,7 +385,7 @@ static void ceph_sock_state_change(struct sock *sk) case TCP_CLOSE_WAIT: dout("%s TCP_CLOSE_WAIT\n", __func__); con_sock_state_closing(con); - set_bit(CON_FLAG_SOCK_CLOSED, &con->flags); + con_flag_set(con, CON_FLAG_SOCK_CLOSED); queue_con(con); break; case TCP_ESTABLISHED: @@ -475,7 +526,7 @@ static int con_close_socket(struct ceph_connection *con) * received a socket close event before we had the chance to * shut the socket down. */ - clear_bit(CON_FLAG_SOCK_CLOSED, &con->flags); + con_flag_clear(con, CON_FLAG_SOCK_CLOSED); con_sock_state_closed(con); return rc; @@ -539,11 +590,10 @@ void ceph_con_close(struct ceph_connection *con) ceph_pr_addr(&con->peer_addr.in_addr)); con->state = CON_STATE_CLOSED; - clear_bit(CON_FLAG_LOSSYTX, &con->flags); /* so we retry next connect */ - clear_bit(CON_FLAG_KEEPALIVE_PENDING, &con->flags); - clear_bit(CON_FLAG_WRITE_PENDING, &con->flags); - clear_bit(CON_FLAG_KEEPALIVE_PENDING, &con->flags); - clear_bit(CON_FLAG_BACKOFF, &con->flags); + con_flag_clear(con, CON_FLAG_LOSSYTX); /* so we retry next connect */ + con_flag_clear(con, CON_FLAG_KEEPALIVE_PENDING); + con_flag_clear(con, CON_FLAG_WRITE_PENDING); + con_flag_clear(con, CON_FLAG_BACKOFF); reset_connection(con); con->peer_global_seq = 0; @@ -799,7 +849,7 @@ static void prepare_write_message(struct ceph_connection *con) /* no, queue up footer too and be done */ prepare_write_message_footer(con); - set_bit(CON_FLAG_WRITE_PENDING, &con->flags); + con_flag_set(con, CON_FLAG_WRITE_PENDING); } /* @@ -820,7 +870,7 @@ static void prepare_write_ack(struct ceph_connection *con) &con->out_temp_ack); con->out_more = 1; /* more will follow.. eventually.. */ - set_bit(CON_FLAG_WRITE_PENDING, &con->flags); + con_flag_set(con, CON_FLAG_WRITE_PENDING); } /* @@ -831,7 +881,7 @@ static void prepare_write_keepalive(struct ceph_connection *con) dout("prepare_write_keepalive %p\n", con); con_out_kvec_reset(con); con_out_kvec_add(c
[PATCH] rbd: normalize dout() calls
Add dout() calls to facilitate tracing of image and object requests. Change a few existing calls so they use __func__ rather than the hard-coded function name. Have calls always add ":" after the name of the function, and prefix pointer values with a consistent tag indicating what it represents. (Note that there remain some older dout() calls that are left untouched by this patch.) Issue a warning if rbd_osd_write_callback() ever gets a short write. This resolves: http://tracker.ceph.com/issues/4235 Signed-off-by: Alex Elder --- drivers/block/rbd.c | 66 +++ 1 file changed, 61 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bd6078b..a9c86ca 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -443,7 +443,7 @@ static struct rbd_client *rbd_client_create(struct ceph_options *ceph_opts) struct rbd_client *rbdc; int ret = -ENOMEM; - dout("rbd_client_create\n"); + dout("%s:\n", __func__); rbdc = kmalloc(sizeof(struct rbd_client), GFP_KERNEL); if (!rbdc) goto out_opt; @@ -467,8 +467,8 @@ static struct rbd_client *rbd_client_create(struct ceph_options *ceph_opts) spin_unlock(&rbd_client_list_lock); mutex_unlock(&ctl_mutex); + dout("%s: rbdc %p\n", __func__, rbdc); - dout("rbd_client_create created %p\n", rbdc); return rbdc; out_err: @@ -479,6 +479,8 @@ out_mutex: out_opt: if (ceph_opts) ceph_destroy_options(ceph_opts); + dout("%s: error %d\n", __func__, ret); + return ERR_PTR(ret); } @@ -605,7 +607,7 @@ static void rbd_client_release(struct kref *kref) { struct rbd_client *rbdc = container_of(kref, struct rbd_client, kref); - dout("rbd_release_client %p\n", rbdc); + dout("%s: rbdc %p\n", __func__, rbdc); spin_lock(&rbd_client_list_lock); list_del(&rbdc->node); spin_unlock(&rbd_client_list_lock); @@ -1064,6 +1066,8 @@ out_err: static void rbd_obj_request_get(struct rbd_obj_request *obj_request) { + dout("%s: obj %p (was %d)\n", __func__, obj_request, + atomic_read(&obj_request->kref.refcount)); kref_get(&obj_request->kref); } @@ -1071,11 +1075,15 @@ static void rbd_obj_request_destroy(struct kref *kref); static void rbd_obj_request_put(struct rbd_obj_request *obj_request) { rbd_assert(obj_request != NULL); + dout("%s: obj %p (was %d)\n", __func__, obj_request, + atomic_read(&obj_request->kref.refcount)); kref_put(&obj_request->kref, rbd_obj_request_destroy); } static void rbd_img_request_get(struct rbd_img_request *img_request) { + dout("%s: img %p (was %d)\n", __func__, img_request, + atomic_read(&img_request->kref.refcount)); kref_get(&img_request->kref); } @@ -1083,6 +1091,8 @@ static void rbd_img_request_destroy(struct kref *kref); static void rbd_img_request_put(struct rbd_img_request *img_request) { rbd_assert(img_request != NULL); + dout("%s: img %p (was %d)\n", __func__, img_request, + atomic_read(&img_request->kref.refcount)); kref_put(&img_request->kref, rbd_img_request_destroy); } @@ -1097,6 +1107,8 @@ static inline void rbd_img_obj_request_add(struct rbd_img_request *img_request, rbd_assert(obj_request->which != BAD_WHICH); img_request->obj_request_count++; list_add_tail(&obj_request->links, &img_request->obj_requests); + dout("%s: img %p obj %p w=%u\n", __func__, img_request, obj_request, + obj_request->which); } static inline void rbd_img_obj_request_del(struct rbd_img_request *img_request, @@ -1104,6 +1116,8 @@ static inline void rbd_img_obj_request_del(struct rbd_img_request *img_request, { rbd_assert(obj_request->which != BAD_WHICH); + dout("%s: img %p obj %p w=%u\n", __func__, img_request, obj_request, + obj_request->which); list_del(&obj_request->links); rbd_assert(img_request->obj_request_count > 0); img_request->obj_request_count--; @@ -1200,11 +1214,14 @@ static void rbd_osd_req_op_destroy(struct ceph_osd_req_op *op) static int rbd_obj_request_submit(struct ceph_osd_client *osdc, struct rbd_obj_request *obj_request) { + dout("%s: osdc %p obj %p\n", __func__, osdc, obj_request); + return ceph_osdc_start_request(osdc, obj_request->osd_req, false); } static void rbd_img_request_complete(struct rbd_img_request *img_request) { + dout("%s: img %p\n", __func__, img_request); if (img_request->callback) img_request->callback(img_request); else @@ -1215,6 +1232,8 @@ static void rbd_img_request_complete(struct rbd_img_request *img_request) static int rbd_obj_request_wait(struct rbd_obj_request *obj_request) { + dout("%s: obj %p\n", __func__, obj_request); + ret
[PATCH] rbd: barriers are hard
Let's go shopping! I'm afraid this may not have gotten it right: 07741308 rbd: add barriers near done flag operations The smp_wmb() should have been done *before* setting the done flag, to ensure all other data was valid before marking the object request done. Switch to use atomic_inc_return() here to set the done flag, which allows us to verify we don't mark something done more than once. Doing this also implies general barriers before and after the call. And although a read memory barrier might have been sufficient before reading the done flag, convert this to a full memory barrier just to put this issue to bed. This resolves: http://tracker.ceph.com/issues/4238 Signed-off-by: Alex Elder --- drivers/block/rbd.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3cc003b..bd6078b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1226,13 +1226,22 @@ static void obj_request_done_init(struct rbd_obj_request *obj_request) static void obj_request_done_set(struct rbd_obj_request *obj_request) { - atomic_set(&obj_request->done, 1); - smp_wmb(); + int done; + + done = atomic_inc_return(&obj_request->done); + if (done > 1) { + struct rbd_img_request *img_request = obj_request->img_request; + struct rbd_device *rbd_dev; + + rbd_dev = img_request ? img_request->rbd_dev : NULL; + rbd_warn(rbd_dev, "obj_request %p was already done\n", + obj_request); + } } static bool obj_request_done_test(struct rbd_obj_request *obj_request) { - smp_rmb(); + smp_mb(); return atomic_read(&obj_request->done) != 0; } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: ignore zero-length requests
The old request code simply ignored zero-length requests. We should still operate that same way to avoid any changes in behavior. We can implement handling for special zero-length requests separately (see http://tracker.ceph.com/issues/4236). Add some assertions based on this new constraint. This resolves: http://tracker.ceph.com/issues/4237 Signed-off-by: Alex Elder --- drivers/block/rbd.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b0eea3e..3cc003b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1560,6 +1560,7 @@ static int rbd_img_request_fill_bio(struct rbd_img_request *img_request, image_offset = img_request->offset; rbd_assert(image_offset == bio_list->bi_sector << SECTOR_SHIFT); resid = img_request->length; + rbd_assert(resid > 0); while (resid) { const char *object_name; unsigned int clone_size; @@ -1627,8 +1628,10 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request) bool more = true; img_request = obj_request->img_request; + rbd_assert(img_request != NULL); rbd_assert(img_request->rq != NULL); + rbd_assert(img_request->obj_request_count > 0); rbd_assert(which != BAD_WHICH); rbd_assert(which < img_request->obj_request_count); rbd_assert(which >= img_request->next_completion); @@ -1918,6 +1921,19 @@ static void rbd_request_fn(struct request_queue *q) /* Ignore any non-FS requests that filter through. */ if (rq->cmd_type != REQ_TYPE_FS) { + dout("%s: non-fs request type %d\n", __func__, + (int) rq->cmd_type); + __blk_end_request_all(rq, 0); + continue; + } + + /* Ignore/skip any zero-length requests */ + + offset = (u64) blk_rq_pos(rq) << SECTOR_SHIFT; + length = (u64) blk_rq_bytes(rq); + + if (!length) { + dout("%s: zero-length request\n", __func__); __blk_end_request_all(rq, 0); continue; } @@ -1947,9 +1963,6 @@ static void rbd_request_fn(struct request_queue *q) goto end_request; } - offset = (u64) blk_rq_pos(rq) << SECTOR_SHIFT; - length = (u64) blk_rq_bytes(rq); - result = -EINVAL; if (WARN_ON(offset && length > U64_MAX - offset + 1)) goto end_request; /* Shouldn't happen */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Four miscellaneous patches
The following patches address some issues that were found while investigating a kernel rbd client performance issue this week. These patches are available in the branch "test/wip-4234,5,7,8" in the ceph-client git repository. -Alex [PATCH] rbd: ignore zero-length requests [PATCH] rbd: barriers are hard [PATCH] rbd: normalize dout() calls [PATCH] libceph: define connection flag helpers -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
Hi all, I finally got a core dump. I did it with a kill -SEGV on the OSD process. https://www.dropbox.com/s/ahv6hm0ipnak5rf/core-ceph-osd-11-0-0-20100-1361539008 Hope we will get something out of it :-). -- Regards, Sébastien Han. On Fri, Jan 11, 2013 at 7:13 PM, Gregory Farnum wrote: > On Fri, Jan 11, 2013 at 6:57 AM, Sébastien Han > wrote: >>> Is osd.1 using the heap profiler as well? Keep in mind that active use >>> of the memory profiler will itself cause memory usage to increase — >>> this sounds a bit like that to me since it's staying stable at a large >>> but finite portion of total memory. >> >> Well, the memory consumption was already high before the profiler was >> started. So yes with the memory profiler enable an OSD might consume >> more memory but this doesn't cause the memory leaks. > > My concern is that maybe you saw a leak but when you restarted with > the memory profiling you lost whatever conditions caused it. > >> Any ideas? Nothing to say about my scrumbing theory? > I like it, but Sam indicates that without some heap dumps which > capture the actual leak then scrub is too large to effectively code > review for leaks. :( > -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html