date:20130222

Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-22 Thread Yan, Zheng

On 02/23/2013 02:54 AM, Gregory Farnum wrote:
> I haven't spent that much time in the kernel client, but this patch
> isn't working out for me. In particular, I'm pretty sure we need to
> preserve this:
> 
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index 5d5c32b..b9d8417 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info 
>> *ci, int need, int want,
>> }
>> have = __ceph_caps_issued(ci, &implemented);
>>
>> -   /*
>> -* disallow writes while a truncate is pending
>> -*/
>> -   if (ci->i_truncate_pending)
>> -   have &= ~CEPH_CAP_FILE_WR;
>> -
>> if ((have & need) == need) {
>> /*
>>  * Look at (implemented & ~have & not) so that we keep 
>> waiting
> 
> Because if there's a pending truncate, we really can't write. You do
> handle it in the particular case of doing buffered file writes, but
> these caps are a marker of permission, and the client shouldn't have
> write permission to a file until it's up to date on the truncates. Or
> am I misunderstanding something?

pending vmtruncate is only relevant to buffered write case. If client doesn't 
have 'b' cap,
the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For 
buffered write,
this patch only affects situation that clients receives truncate request from 
MDS in the
middle of write. I think it's better to do vmtruncate after write finishes. 

> 
> 
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index a1e5b81..bf7849a 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const 
>> struct iovec *iov,
>> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>>  inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>>  again:
>> -   __ceph_do_pending_vmtruncate(inode);
> 
> There doesn't seem to be any kind of replacement for this? We need to
> do any pending truncates before reading or we might get stale data
> back.

generic_file_aio_read checks i_size when coping data to user buffer, so the 
user program can't
get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, 
it's a potential
bug, that's the reason I remove it.

Regards
Yan, Zheng

> 
> The first two in the series look good to me, but I'll wait and pull
> them in as a unit with this one once we're done discussing. :)
> -Greg
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Inline data support for Ceph

2013-02-22 Thread 汪黎

Hi there,
   We have a team working on Ceph optimization, for the purpose of integrating 
with OpenStack.
   Currently, Ceph could make use of the inline data feature implemented in the 
local file system, such as btrfs. However, we think it maybe better to 
implement inline data at a higher level, i.e, let Ceph aware. Since it could 
save the client the calculation of object location and communication with the 
OSDs. It hopefully will receive a good IO speedup for small files, with a 
slightly heavier load for MDS. 
We have worked out a plan to do it, and the job is ongoing, and we are 
wondering the community 's response to this job and its fate for inclusion, 
comments are appreciated.
 
Cheers,
Li

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> I just looked at the logs.  I can't tell what happend to cause that 
> 10 
> second delay.. strangely, messages were passing from 0 -> 1, but 
> nothing 
> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>> 
>> Is there any way of telling where they were delayed, i.e. in the 1's 
>> output
>> queue or 0's input queue?
> 
> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> generate a lot of logging, though.
 
 I really don't want to load the system with too much logging, but I'm happy
 modifying code...  Are there specific interesting debug outputs which I can
 modify so they're output under "ms = 1"?
>>> 
>>> I'm basically interested in everything in writer() and write_message(), 
>>> and reader() and read_message()...
>> 
>> Like this?
> 
> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
> that this is the lions share of what debug 20 will spam to the log, but 
> hopefully the load is manageable!

Good idea on the '2'. I'll get that installed and wait for it to happen again.

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> >>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>  On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>> I just looked at the logs.  I can't tell what happend to cause that 
> >>> 10 
> >>> second delay.. strangely, messages were passing from 0 -> 1, but 
> >>> nothing 
> >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>  
>  Is there any way of telling where they were delayed, i.e. in the 1's 
>  output
>  queue or 0's input queue?
> >>> 
> >>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> >>> generate a lot of logging, though.
> >> 
> >> I really don't want to load the system with too much logging, but I'm happy
> >> modifying code...  Are there specific interesting debug outputs which I can
> >> modify so they're output under "ms = 1"?
> > 
> > I'm basically interested in everything in writer() and write_message(), 
> > and reader() and read_message()...
> 
> Like this?

Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
that this is the lions share of what debug 20 will spam to the log, but 
hopefully the load is manageable!

sage



> --
> diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc
> index 37b1eeb..db4774f 100644
> --- a/src/msg/Pipe.cc
> +++ b/src/msg/Pipe.cc
> @@ -1263,7 +1263,7 @@ void Pipe::reader()
>  
>  // sleep if (re)connecting
>  if (state == STATE_STANDBY) {
> -  ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << 
> dendl;
> +  ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << 
> dendl;
>cond.Wait(pipe_lock);
>continue;
>  }
> @@ -1272,28 +1272,28 @@ void Pipe::reader()
>  
>  char buf[80];
>  char tag = -1;
> -ldout(msgr->cct,20) << "reader reading tag..." << dendl;
> +ldout(msgr->cct, 1) << "reader reading tag..." << dendl;
>  if (tcp_read((char*)&tag, 1) < 0) {
>pipe_lock.Lock();
> -  ldout(msgr->cct,2) << "reader couldn't read tag, " << 
> strerror_r(errno, buf, sizeof(buf)) << dendl;
> +  ldout(msgr->cct, 1) << "reader couldn't read tag, " << 
> strerror_r(errno, buf, sizeof(buf)) << dendl;
>fault(true);
>continue;
>  }
>  
>  if (tag == CEPH_MSGR_TAG_KEEPALIVE) {
> -  ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl;
> +  ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl;
>pipe_lock.Lock();
>continue;
>  }
>  
>  // open ...
>  if (tag == CEPH_MSGR_TAG_ACK) {
> -  ldout(msgr->cct,20) << "reader got ACK" << dendl;
> +  ldout(msgr->cct, 1) << "reader got ACK" << dendl;
>ceph_le64 seq;
>int rc = tcp_read((char*)&seq, sizeof(seq));
>pipe_lock.Lock();
>if (rc < 0) {
> - ldout(msgr->cct,2) << "reader couldn't read ack seq, " << 
> strerror_r(errno, buf, sizeof(buf)) << dendl;
> + ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << 
> strerror_r(errno, buf, sizeof(buf)) << dendl;
>   fault(true);
>} else if (state != STATE_CLOSED) {
>  handle_ack(seq);
> @@ -1302,7 +1302,7 @@ void Pipe::reader()
>  }
>  
>  else if (tag == CEPH_MSGR_TAG_MSG) {
> -  ldout(msgr->cct,20) << "reader got MSG" << dendl;
> +  ldout(msgr->cct, 1) << "reader got MSG" << dendl;
>Message *m = 0;
>int r = read_message(&m);
>  
> @@ -1342,7 +1342,7 @@ void Pipe::reader()
>  
>cond.Signal();  // wake up writer, to ack this
>
> -  ldout(msgr->cct,10) << "reader got message "
> +  ldout(msgr->cct, 1) << "reader got message "
>  << m->get_seq() << " " << m << " " << *m
>  << dendl;
>  
> @@ -1360,7 +1360,7 @@ void Pipe::reader()
>  } 
>  
>  else if (tag == CEPH_MSGR_TAG_CLOSE) {
> -  ldout(msgr->cct,20) << "reader got CLOSE" << dendl;
> +  ldout(msgr->cct, 1) << "reader got CLOSE" << dendl;
>pipe_lock.Lock();
>if (state == STATE_CLOSING) {
>   state = STATE_CLOSED;
> @@ -1383,7 +1383,7 @@ void Pipe::reader()
>reader_running = false;
>reader_needs_join = true;
>unlock_maybe_reap();
> -  ldout(msgr->cct,10) << "reader done" << dendl;
> +  ldout(msgr->cct, 1) << "reader done" << dendl;
>  }
>  
>  /* write msgs to socket.
> @@ -1395,7 +1395,7 @@ void Pipe::writer()
>  
>pipe_lock.Lock();
>while (state != STATE_CLOSED) {// && state != STATE_WAIT) {
> -ldout(msgr->cct,10) << "writer: state = " << get_state_name()
> +ldout(msgr->cct, 1) << "writer: state = " << get_state_name()
>   << " policy.server="

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>> second delay.. strangely, messages were passing from 0 -> 1, but 
>>> nothing 
>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
 
 Is there any way of telling where they were delayed, i.e. in the 1's output
 queue or 0's input queue?
>>> 
>>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>>> generate a lot of logging, though.
>> 
>> I really don't want to load the system with too much logging, but I'm happy
>> modifying code...  Are there specific interesting debug outputs which I can
>> modify so they're output under "ms = 1"?
> 
> I'm basically interested in everything in writer() and write_message(), 
> and reader() and read_message()...

Like this?

--
diff --git a/src/msg/Pipe.cc b/src/msg/Pipe.cc
index 37b1eeb..db4774f 100644
--- a/src/msg/Pipe.cc
+++ b/src/msg/Pipe.cc
@@ -1263,7 +1263,7 @@ void Pipe::reader()
 
 // sleep if (re)connecting
 if (state == STATE_STANDBY) {
-  ldout(msgr->cct,20) << "reader sleeping during reconnect|standby" << 
dendl;
+  ldout(msgr->cct, 1) << "reader sleeping during reconnect|standby" << 
dendl;
   cond.Wait(pipe_lock);
   continue;
 }
@@ -1272,28 +1272,28 @@ void Pipe::reader()
 
 char buf[80];
 char tag = -1;
-ldout(msgr->cct,20) << "reader reading tag..." << dendl;
+ldout(msgr->cct, 1) << "reader reading tag..." << dendl;
 if (tcp_read((char*)&tag, 1) < 0) {
   pipe_lock.Lock();
-  ldout(msgr->cct,2) << "reader couldn't read tag, " << strerror_r(errno, 
buf, sizeof(buf)) << dendl;
+  ldout(msgr->cct, 1) << "reader couldn't read tag, " << strerror_r(errno, 
buf, sizeof(buf)) << dendl;
   fault(true);
   continue;
 }
 
 if (tag == CEPH_MSGR_TAG_KEEPALIVE) {
-  ldout(msgr->cct,20) << "reader got KEEPALIVE" << dendl;
+  ldout(msgr->cct, 1) << "reader got KEEPALIVE" << dendl;
   pipe_lock.Lock();
   continue;
 }
 
 // open ...
 if (tag == CEPH_MSGR_TAG_ACK) {
-  ldout(msgr->cct,20) << "reader got ACK" << dendl;
+  ldout(msgr->cct, 1) << "reader got ACK" << dendl;
   ceph_le64 seq;
   int rc = tcp_read((char*)&seq, sizeof(seq));
   pipe_lock.Lock();
   if (rc < 0) {
-   ldout(msgr->cct,2) << "reader couldn't read ack seq, " << 
strerror_r(errno, buf, sizeof(buf)) << dendl;
+   ldout(msgr->cct, 1) << "reader couldn't read ack seq, " << 
strerror_r(errno, buf, sizeof(buf)) << dendl;
fault(true);
   } else if (state != STATE_CLOSED) {
 handle_ack(seq);
@@ -1302,7 +1302,7 @@ void Pipe::reader()
 }
 
 else if (tag == CEPH_MSGR_TAG_MSG) {
-  ldout(msgr->cct,20) << "reader got MSG" << dendl;
+  ldout(msgr->cct, 1) << "reader got MSG" << dendl;
   Message *m = 0;
   int r = read_message(&m);
 
@@ -1342,7 +1342,7 @@ void Pipe::reader()
 
   cond.Signal();  // wake up writer, to ack this
   
-  ldout(msgr->cct,10) << "reader got message "
+  ldout(msgr->cct, 1) << "reader got message "
   << m->get_seq() << " " << m << " " << *m
   << dendl;
 
@@ -1360,7 +1360,7 @@ void Pipe::reader()
 } 
 
 else if (tag == CEPH_MSGR_TAG_CLOSE) {
-  ldout(msgr->cct,20) << "reader got CLOSE" << dendl;
+  ldout(msgr->cct, 1) << "reader got CLOSE" << dendl;
   pipe_lock.Lock();
   if (state == STATE_CLOSING) {
state = STATE_CLOSED;
@@ -1383,7 +1383,7 @@ void Pipe::reader()
   reader_running = false;
   reader_needs_join = true;
   unlock_maybe_reap();
-  ldout(msgr->cct,10) << "reader done" << dendl;
+  ldout(msgr->cct, 1) << "reader done" << dendl;
 }
 
 /* write msgs to socket.
@@ -1395,7 +1395,7 @@ void Pipe::writer()
 
   pipe_lock.Lock();
   while (state != STATE_CLOSED) {// && state != STATE_WAIT) {
-ldout(msgr->cct,10) << "writer: state = " << get_state_name()
+ldout(msgr->cct, 1) << "writer: state = " << get_state_name()
<< " policy.server=" << policy.server << dendl;
 
 // standby?
@@ -1413,7 +1413,7 @@ void Pipe::writer()
 
 if (state == STATE_CLOSING) {
   // write close tag
-  ldout(msgr->cct,20) << "writer writing CLOSE tag" << dendl;
+  ldout(msgr->cct, 1) << "writer writing CLOSE tag" << dendl;
   char tag = CEPH_MSGR_TAG_CLOSE;
   state = STATE_CLOSED;
   state_closed.set(1);
@@ -1436,7 +1436,7 @@ void Pipe::writer()
int rc = write_keepalive();
pipe_loc

Hadoop release jars

2013-02-22 Thread Noah Watkins

Hi all,

I've pushed up some changes that let us build stand-alone jar files for the 
Hadoop CephFS bindings.

  github.com/ceph/hadoop-common.git cephfs/branch-1.0-build-jar

Running "ant cephfs" will produce "build/hadoop-cephfs.jar".

I've tested it locally and things work well, so hopefully this means we can 
continue to develop in the hadoop-common tree and produce jar releases for 
whatever version combinations we care about.

Is this something that should be easy to integrate into gitbuilder so we can 
link off the documentation page to the release jars?

-Noah

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> >>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>  On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> > On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >> G'day,
> >> 
> >> It seems there might be two issues here: the first being the delayed
> >> receipt of echo replies causing an seemingly otherwise healthy osd to 
> >> be
> >> marked down, the second being the lack of recovery once the downed osd 
> >> is
> >> recognised as up again.
> >> 
> >> Is it worth my opening tracker reports for this, just so it doesn't get
> >> lost?
> > 
> > I just looked at the logs.  I can't tell what happend to cause that 10 
> > second delay.. strangely, messages were passing from 0 -> 1, but 
> > nothing 
> > came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> >> 
> >> Is there any way of telling where they were delayed, i.e. in the 1's output
> >> queue or 0's input queue?
> > 
> > Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> > generate a lot of logging, though.
> 
> I really don't want to load the system with too much logging, but I'm happy
> modifying code...  Are there specific interesting debug outputs which I can
> modify so they're output under "ms = 1"?

I'm basically interested in everything in writer() and write_message(), 
and reader() and read_message()...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Sat, Feb 23, 2013 at 11:50:26AM +1100, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote:
>> Hi Chris-
>> 
>> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
>> they were restarted after the upgrade)?
> 
> Not absolutely, but the indications are good: the osd.1 process
> was started Feb 16 08:38:43 2013, 20 minutes before I sent my
> email saying it had been upgraded. The osd.0 process was
> restarted more recently to kick things along after the most
> recent problem, however my command line history shows a "apt-get
> upgrade" followed by a "service ceph restart".


I can't see anything in the logs indicating the various daemon version
numbers.

Unless I'm blind and it's already there, perhaps a good idea for every
daemon to log it's version number when it starts?

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
 On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>> G'day,
>> 
>> It seems there might be two issues here: the first being the delayed
>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>> marked down, the second being the lack of recovery once the downed osd is
>> recognised as up again.
>> 
>> Is it worth my opening tracker reports for this, just so it doesn't get
>> lost?
> 
> I just looked at the logs.  I can't tell what happend to cause that 10 
> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>> 
>> Is there any way of telling where they were delayed, i.e. in the 1's output
>> queue or 0's input queue?
> 
> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> generate a lot of logging, though.

I really don't want to load the system with too much logging, but I'm happy
modifying code...  Are there specific interesting debug outputs which I can
modify so they're output under "ms = 1"?

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Fri, Feb 22, 2013 at 04:25:39PM -0800, Sage Weil wrote:
> Hi Chris-
> 
> Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
> they were restarted after the upgrade)?

Not absolutely, but the indications are good: the osd.1 process
was started Feb 16 08:38:43 2013, 20 minutes before I sent my
email saying it had been upgraded. The osd.0 process was
restarted more recently to kick things along after the most
recent problem, however my command line history shows a "apt-get
upgrade" followed by a "service ceph restart".

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD memory leaks?

2013-02-22 Thread Sage Weil

On Fri, 22 Feb 2013, S?bastien Han wrote:
> Hi all,
> 
> I finally got a core dump.
> 
> I did it with a kill -SEGV on the OSD process.
> 
> https://www.dropbox.com/s/ahv6hm0ipnak5rf/core-ceph-osd-11-0-0-20100-1361539008
> 
> Hope we will get something out of it :-).

AHA!  We have a theory.  The pg log isnt trimmed during scrub (because teh 
old scrub code required that), but the new (deep) scrub can take a very 
long time, which means the pg log will eat ram in the meantime.. 
especially under high iops.

Can you try wip-osd-log-trim (which is bobtail + a simple patch) and see 
if that seems to work?  Note that that patch shouldn't be run in a mixed 
argonaut+bobtail cluster, since it isn't properly checking if the scrub is 
class or chunky/deep.

Thanks!
sage


 > --
> Regards,
> S?bastien Han.
> 
> 
> On Fri, Jan 11, 2013 at 7:13 PM, Gregory Farnum  wrote:
> > On Fri, Jan 11, 2013 at 6:57 AM, S?bastien Han  
> > wrote:
> >>> Is osd.1 using the heap profiler as well? Keep in mind that active use
> >>> of the memory profiler will itself cause memory usage to increase ?
> >>> this sounds a bit like that to me since it's staying stable at a large
> >>> but finite portion of total memory.
> >>
> >> Well, the memory consumption was already high before the profiler was
> >> started. So yes with the memory profiler enable an OSD might consume
> >> more memory but this doesn't cause the memory leaks.
> >
> > My concern is that maybe you saw a leak but when you restarted with
> > the memory profiling you lost whatever conditions caused it.
> >
> >> Any ideas? Nothing to say about my scrumbing theory?
> > I like it, but Sam indicates that without some heap dumps which
> > capture the actual leak then scrub is too large to effectively code
> > review for leaks. :(
> > -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil

Hi Chris-

Can you confirm that both ceph-osd daemons are running v0.56.3 (i.e., 
they were restarted after the upgrade)?

sage

On Fri, 22 Feb 2013, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> > > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> > >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> > >>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >  G'day,
> >  
> >  It seems there might be two issues here: the first being the delayed
> >  receipt of echo replies causing an seemingly otherwise healthy osd to 
> >  be
> >  marked down, the second being the lack of recovery once the downed osd 
> >  is
> >  recognised as up again.
> >  
> >  Is it worth my opening tracker reports for this, just so it doesn't get
> >  lost?
> > >>> 
> > >>> I just looked at the logs.  I can't tell what happend to cause that 10 
> > >>> second delay.. strangely, messages were passing from 0 -> 1, but 
> > >>> nothing 
> > >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> > 
> > Is there any way of telling where they were delayed, i.e. in the 1's output
> > queue or 0's input queue?
> 
> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> generate a lot of logging, though.
> 
> > >>> The strange bit is that after this, you get those indefinite hangs.  
> > >>> From 
> > >>> the logs it looks like the OSD rebound to an old port that was 
> > >>> previously 
> > >>> open from osd.0.. probably from way back.  Do you have logs going 
> > >>> further 
> > >>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> > >>> onward?  It looks like there is a bug in the connection handling code 
> > >>> (that is unrelated to the delay above).
> > >> 
> > >> Currently uploading logs starting midnight to dropbox, will send
> > >> links when when they're up.
> > >> 
> > >> How would I retrieve the interesting osdmaps?
> > > 
> > > They are in the monitor data directory, in the osdmap_full dir.
> > 
> > Logs from midnight onwards and osdmaps are in this folder:
> > 
> > https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2
> > 
> >   ceph-mon.b2.log.bz2
> >   ceph-mon.b4.log.bz2
> >   ceph-mon.b5.log.bz2
> >   ceph-osd.0.log.bz2
> >   ceph-osd.1.log.bz2 (still uploading as I type)
> >   osdmaps.zip
> 
> I'll take a look...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> > On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>  G'day,
>  
>  It seems there might be two issues here: the first being the delayed
>  receipt of echo replies causing an seemingly otherwise healthy osd to be
>  marked down, the second being the lack of recovery once the downed osd is
>  recognised as up again.
>  
>  Is it worth my opening tracker reports for this, just so it doesn't get
>  lost?
> >>> 
> >>> I just looked at the logs.  I can't tell what happend to cause that 10 
> >>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> >>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> 
> Is there any way of telling where they were delayed, i.e. in the 1's output
> queue or 0's input queue?

Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
generate a lot of logging, though.

> >>> The strange bit is that after this, you get those indefinite hangs.  From 
> >>> the logs it looks like the OSD rebound to an old port that was previously 
> >>> open from osd.0.. probably from way back.  Do you have logs going further 
> >>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> >>> onward?  It looks like there is a bug in the connection handling code 
> >>> (that is unrelated to the delay above).
> >> 
> >> Currently uploading logs starting midnight to dropbox, will send
> >> links when when they're up.
> >> 
> >> How would I retrieve the interesting osdmaps?
> > 
> > They are in the monitor data directory, in the osdmap_full dir.
> 
> Logs from midnight onwards and osdmaps are in this folder:
> 
> https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2
> 
>   ceph-mon.b2.log.bz2
>   ceph-mon.b4.log.bz2
>   ceph-mon.b5.log.bz2
>   ceph-osd.0.log.bz2
>   ceph-osd.1.log.bz2 (still uploading as I type)
>   osdmaps.zip

I'll take a look...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>>> On Fri, 22 Feb 2013, Chris Dunlop wrote:
 G'day,
 
 It seems there might be two issues here: the first being the delayed
 receipt of echo replies causing an seemingly otherwise healthy osd to be
 marked down, the second being the lack of recovery once the downed osd is
 recognised as up again.
 
 Is it worth my opening tracker reports for this, just so it doesn't get
 lost?
>>> 
>>> I just looked at the logs.  I can't tell what happend to cause that 10 
>>> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
>>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).

Is there any way of telling where they were delayed, i.e. in the 1's output
queue or 0's input queue?

>>> The strange bit is that after this, you get those indefinite hangs.  From 
>>> the logs it looks like the OSD rebound to an old port that was previously 
>>> open from osd.0.. probably from way back.  Do you have logs going further 
>>> back than what you posted?  Also, do you have osdmaps, say, 750 and 
>>> onward?  It looks like there is a bug in the connection handling code 
>>> (that is unrelated to the delay above).
>> 
>> Currently uploading logs starting midnight to dropbox, will send
>> links when when they're up.
>> 
>> How would I retrieve the interesting osdmaps?
> 
> They are in the monitor data directory, in the osdmap_full dir.

Logs from midnight onwards and osdmaps are in this folder:

https://www.dropbox.com/sh/7nq7gr2u2deorcu/Nvw3FFGiy2

  ceph-mon.b2.log.bz2
  ceph-mon.b4.log.bz2
  ceph-mon.b5.log.bz2
  ceph-osd.0.log.bz2
  ceph-osd.1.log.bz2 (still uploading as I type)
  osdmaps.zip

Cheers,

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil

On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> > On Fri, 22 Feb 2013, Chris Dunlop wrote:
> >> G'day,
> >> 
> >> It seems there might be two issues here: the first being the delayed
> >> receipt of echo replies causing an seemingly otherwise healthy osd to be
> >> marked down, the second being the lack of recovery once the downed osd is
> >> recognised as up again.
> >> 
> >> Is it worth my opening tracker reports for this, just so it doesn't get
> >> lost?
> > 
> > I just looked at the logs.  I can't tell what happend to cause that 10 
> > second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> > came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> > 
> > The strange bit is that after this, you get those indefinite hangs.  From 
> > the logs it looks like the OSD rebound to an old port that was previously 
> > open from osd.0.. probably from way back.  Do you have logs going further 
> > back than what you posted?  Also, do you have osdmaps, say, 750 and 
> > onward?  It looks like there is a bug in the connection handling code 
> > (that is unrelated to the delay above).
> 
> Currently uploading logs starting midnight to dropbox, will send
> links when when they're up.
> 
> How would I retrieve the interesting osdmaps?

They are in the monitor data directory, in the osdmap_full dir.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Chris Dunlop

On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> On Fri, 22 Feb 2013, Chris Dunlop wrote:
>> G'day,
>> 
>> It seems there might be two issues here: the first being the delayed
>> receipt of echo replies causing an seemingly otherwise healthy osd to be
>> marked down, the second being the lack of recovery once the downed osd is
>> recognised as up again.
>> 
>> Is it worth my opening tracker reports for this, just so it doesn't get
>> lost?
> 
> I just looked at the logs.  I can't tell what happend to cause that 10 
> second delay.. strangely, messages were passing from 0 -> 1, but nothing 
> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
> 
> The strange bit is that after this, you get those indefinite hangs.  From 
> the logs it looks like the OSD rebound to an old port that was previously 
> open from osd.0.. probably from way back.  Do you have logs going further 
> back than what you posted?  Also, do you have osdmaps, say, 750 and 
> onward?  It looks like there is a bug in the connection handling code 
> (that is unrelated to the delay above).

Currently uploading logs starting midnight to dropbox, will send
links when when they're up.

How would I retrieve the interesting osdmaps?

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph: fix statvfs fr_size

2013-02-22 Thread Gregory Farnum

On Fri, Feb 22, 2013 at 3:23 PM, Sage Weil  wrote:
> Different versions of glibc are broken in different ways, but the short of
> it is that for the time being, frsize should == bsize, and be used as the
> multiple for the blocks, free, and available fields.  This mirrors what is
> done for NFS.  The previous reporting of the page size for frsize meant
> that newer glibc and df would report a very small value for the fs size.
>
> Fixes http://tracker.ceph.com/issues/3793.
>
> Signed-off-by: Sage Weil 
> ---
>  fs/ceph/super.c |7 ++-
>  fs/ceph/super.h |2 +-
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index e86aa99..006f94b 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -71,8 +71,14 @@ static int ceph_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
> /*
>  * express utilization in terms of large blocks to avoid
>  * overflow on 32-bit machines.
> +*
> +* NOTE: for the time being, we make bsize == frsize so humor

s/so/to :)

> +* not-yet-ancient versions of glibc that are broken.
> +* Someday, we will probably want to report a real block
> +* size...  whatever that may mean for a network file system!
>  */
> buf->f_bsize = 1 << CEPH_BLOCK_SHIFT;
> +   buf->f_frsize = 1 << CEPH_BLOCK_SHIFT;
> buf->f_blocks = le64_to_cpu(st.kb) >> (CEPH_BLOCK_SHIFT-10);
> buf->f_bfree = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10);
> buf->f_bavail = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10);
> @@ -80,7 +86,6 @@ static int ceph_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
> buf->f_files = le64_to_cpu(st.num_objects);
> buf->f_ffree = -1;
> buf->f_namelen = NAME_MAX;
> -   buf->f_frsize = PAGE_CACHE_SIZE;
>
> /* leave fsid little-endian, regardless of host endianness */
> fsid = *(u64 *)(&monmap->fsid) ^ *((u64 *)&monmap->fsid + 1);
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 9861cce..604526a 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -21,7 +21,7 @@
>
>  /* large granularity for statfs utilization stats to facilitate
>   * large volume sizes on 32-bit machines. */
> -#define CEPH_BLOCK_SHIFT   20  /* 1 MB */
> +#define CEPH_BLOCK_SHIFT   22  /* 4 MB */
>  #define CEPH_BLOCK (1 << CEPH_BLOCK_SHIFT)
>
>  #define CEPH_MOUNT_OPT_DIRSTAT (1<<4) /* `cat dirname` for stats */

I've already voiced my concerns about this (basically de-tuning the
stack above us with the large block sizes), but I dunno that they have
much validity and this should work from the POV of our code
correctness, so other than the typo above
Reviewed-by: Greg Farnum 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ceph: fix statvfs fr_size

2013-02-22 Thread Sage Weil

Different versions of glibc are broken in different ways, but the short of
it is that for the time being, frsize should == bsize, and be used as the
multiple for the blocks, free, and available fields.  This mirrors what is
done for NFS.  The previous reporting of the page size for frsize meant
that newer glibc and df would report a very small value for the fs size.

Fixes http://tracker.ceph.com/issues/3793.

Signed-off-by: Sage Weil 
---
 fs/ceph/super.c |7 ++-
 fs/ceph/super.h |2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index e86aa99..006f94b 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -71,8 +71,14 @@ static int ceph_statfs(struct dentry *dentry, struct kstatfs 
*buf)
/*
 * express utilization in terms of large blocks to avoid
 * overflow on 32-bit machines.
+*
+* NOTE: for the time being, we make bsize == frsize so humor
+* not-yet-ancient versions of glibc that are broken.
+* Someday, we will probably want to report a real block
+* size...  whatever that may mean for a network file system!
 */
buf->f_bsize = 1 << CEPH_BLOCK_SHIFT;
+   buf->f_frsize = 1 << CEPH_BLOCK_SHIFT;
buf->f_blocks = le64_to_cpu(st.kb) >> (CEPH_BLOCK_SHIFT-10);
buf->f_bfree = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10);
buf->f_bavail = le64_to_cpu(st.kb_avail) >> (CEPH_BLOCK_SHIFT-10);
@@ -80,7 +86,6 @@ static int ceph_statfs(struct dentry *dentry, struct kstatfs 
*buf)
buf->f_files = le64_to_cpu(st.num_objects);
buf->f_ffree = -1;
buf->f_namelen = NAME_MAX;
-   buf->f_frsize = PAGE_CACHE_SIZE;
 
/* leave fsid little-endian, regardless of host endianness */
fsid = *(u64 *)(&monmap->fsid) ^ *((u64 *)&monmap->fsid + 1);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 9861cce..604526a 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -21,7 +21,7 @@
 
 /* large granularity for statfs utilization stats to facilitate
  * large volume sizes on 32-bit machines. */
-#define CEPH_BLOCK_SHIFT   20  /* 1 MB */
+#define CEPH_BLOCK_SHIFT   22  /* 4 MB */
 #define CEPH_BLOCK (1 << CEPH_BLOCK_SHIFT)
 
 #define CEPH_MOUNT_OPT_DIRSTAT (1<<4) /* `cat dirname` for stats */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mon losing touch with OSDs

2013-02-22 Thread Sage Weil

On Fri, 22 Feb 2013, Chris Dunlop wrote:
> G'day,
> 
> It seems there might be two issues here: the first being the delayed
> receipt of echo replies causing an seemingly otherwise healthy osd to be
> marked down, the second being the lack of recovery once the downed osd is
> recognised as up again.
> 
> Is it worth my opening tracker reports for this, just so it doesn't get
> lost?

I just looked at the logs.  I can't tell what happend to cause that 10 
second delay.. strangely, messages were passing from 0 -> 1, but nothing 
came back from 1 -> 0 (although 1 was queuing, if not sending, them).

The strange bit is that after this, you get those indefinite hangs.  From 
the logs it looks like the OSD rebound to an old port that was previously 
open from osd.0.. probably from way back.  Do you have logs going further 
back than what you posted?  Also, do you have osdmaps, say, 750 and 
onward?  It looks like there is a bug in the connection handling code 
(that is unrelated to the delay above).

Thanks!
sage


> 
> Cheers,
> 
> Chris
> 
> On Wed, Feb 20, 2013 at 01:07:03PM +1100, Chris Dunlop wrote:
> > On Tue, Feb 19, 2013 at 02:02:03PM +1100, Chris Dunlop wrote:
> >> On Sun, Feb 17, 2013 at 05:44:29PM -0800, Sage Weil wrote:
> >>> On Mon, 18 Feb 2013, Chris Dunlop wrote:
>  On Sat, Feb 16, 2013 at 09:05:21AM +1100, Chris Dunlop wrote:
> > On Thu, Feb 14, 2013 at 08:57:11PM -0800, Sage Weil wrote:
> >> On Fri, 15 Feb 2013, Chris Dunlop wrote:
> >>> In an otherwise seemingly healthy cluster (ceph 0.56.2), what might 
> >>> cause the
> >>> mons to lose touch with the osds?
> >> 
> >> Can you enable 'debug ms = 1' on the mons and leave them that way, in 
> >> the 
> >> hopes that this happens again?  It will give us more information to go 
> >> on.
> > 
> > Debug turned on.
>  
>  We haven't experienced the cluster losing touch with the osds completely
>  since upgrading from 0.56.2 to 0.56.3, but we did lose touch with osd.1
>  for a few seconds before it recovered. See below for logs (reminder: 3
>  boxes, b2 is mon-only, b4 is mon+osd.0, b5 is mon+osd.1).
> >>> 
> >>> Hrm, I don't see any obvious clues.  You could enable 'debug ms = 1' on 
> >>> the osds as well.  That will give us more to go on if/when it happens 
> >>> again, and should not affect performance significantly.
> >> 
> >> Done: ceph osd tell '*' injectargs '--debug-ms 1'
> >> 
> >> Now to wait for it to happen again.
> > 
> > OK, we got it again. Full logs covering the incident available at:
> > 
> > https://www.dropbox.com/s/kguzwyjfglv3ypl/ceph-logs.zip
> > 
> > Archive:  /tmp/ceph-logs.zip
> >  Length   MethodSize  CmprDateTime   CRC-32   Name
> >   --  ---  -- -   
> >11492  Defl:X 1186  90% 2013-02-20 12:04 c0cba4ae  ceph-mon.b2.log
> >  1270789  Defl:X89278  93% 2013-02-20 12:00 2208d035  ceph-mon.b4.log
> >  1375858  Defl:X   104025  92% 2013-02-20 12:05 c64c1dad  ceph-mon.b5.log
> >  2020042  Defl:X   215000  89% 2013-02-20 10:40 f74ae4ca  ceph-osd.0.log
> >  2075512  Defl:X   224098  89% 2013-02-20 12:05 b454d2ec  ceph-osd.1.log
> >   154938  Defl:X12989  92% 2013-02-20 12:04 d2729b05  ceph.log
> >   ---  ------
> >  6908631   646576  91%6 files
> > 
> > My naive analysis, based on the log extracts below (best viewed on a wide
> > screen!)...
> > 
> > Osd.0 starts hearing much-delayed ping_replies from osd.1 and tells the mon,
> > which marks osd.1 down.
> > 
> > However the whole time, the osd.1 log indicates that it's receiving and
> > responding to each ping from osd.0 in a timely fashion. In contrast, the 
> > osd.0
> > log indicates it isn't seeing the osd.1 replies for a while, then sees them 
> > all
> > arrive in a flurry, until they're "delayed" enough to cause osd.0 to tell 
> > the
> > mon.
> > 
> > During the time osd.0 is not seeing the osd.1 ping_replies, there's other 
> > traffic
> > (osd_op, osd_sub_op, osd_sub_op_reply etc.) between osd.0 and osd.1, 
> > indicating
> > that it's not a network problem.
> > 
> > The load on both osds during this period was >90% idle and <1% iow.
> > 
> > Is this pointing to osd.0 experiencing some kind of scheduling or priority
> > starvation on the ping thread (assuming the ping is in it's own thread)?
> > 
> > The next odd thing is that, although the osds are both back by 04:38:50 ("2
> > osds: 2 up, 2 in"), the system still wasn't working (see the disk stats for
> > both osd.0 and osd.1) and didn't recover until ceph (mon + osd) was 
> > restarted
> > on one of the boxes at around 05:50 (not shown in the logs, but full logs
> > available if needed).
> > 
> > Prior to the restart:
> > 
> > # ceph health
> > HEALTH_WARN 281 pgs peering; 281 pgs stuck inactive; 576 pgs stuck unclean
> > 
> > (Sorry, once again didn't get a 'ceph -s' prior to the restart.)

OSD memory usage

2013-02-22 Thread Bryan K. Wright

Hi folks,

I was having trouble with OSDs crashing under ceph 0.56.2
(CentoOS, 64-bit, installed from rpms, using the "elrepo" kernel),
so I eagerly installed 0.56.3.  Since then, I've been having a lot
of trouble getting the OSDs running, either because of previous
crashes or because of some change in 0.56.3.

What's happening now is that the OSD processes are using
so much memory that the machines start swapping, and eventually
die.  Today, I tried systematically starting up a few OSDs at
a time and watching memory usage.  The processes climb pretty
quickly up to 2-3 GB in RSIZE.  If I let them run for a while and
then restart them, I find that they tend to settle into an RSIZE
of about 1.5 GB.  Unfortunately, I don't seem to have any records
of how big these processes were under 0.56.2, but this is way
too big to fit into the memory available, when I start up all of
the OSD processes.  My impression was that the sizes used to
be well under 1 GB  when the cluster was idle.

Is there anything I can do to reduce the memory
footprint of ceph-osd?

Thanks for any advice.

Bryan

-- 

Bryan Wright  |"If you take cranberries and stew them like 
Physics Department| applesauce, they taste much more like prunes
University of Virginia| than rhubarb does."  --  Groucho 
Charlottesville, VA  22901| 
(434) 924-7218| br...@virginia.edu



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-22 Thread Gregory Farnum

I haven't spent that much time in the kernel client, but this patch
isn't working out for me. In particular, I'm pretty sure we need to
preserve this:

> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 5d5c32b..b9d8417 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info 
> *ci, int need, int want,
> }
> have = __ceph_caps_issued(ci, &implemented);
>
> -   /*
> -* disallow writes while a truncate is pending
> -*/
> -   if (ci->i_truncate_pending)
> -   have &= ~CEPH_CAP_FILE_WR;
> -
> if ((have & need) == need) {
> /*
>  * Look at (implemented & ~have & not) so that we keep waiting

Because if there's a pending truncate, we really can't write. You do
handle it in the particular case of doing buffered file writes, but
these caps are a marker of permission, and the client shouldn't have
write permission to a file until it's up to date on the truncates. Or
am I misunderstanding something?


> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index a1e5b81..bf7849a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const 
> struct iovec *iov,
> dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
>  inode, ceph_vinop(inode), pos, (unsigned)len, inode);
>  again:
> -   __ceph_do_pending_vmtruncate(inode);

There doesn't seem to be any kind of replacement for this? We need to
do any pending truncates before reading or we might get stale data
back.

The first two in the series look good to me, but I'll wait and pull
them in as a unit with this one once we're done discussing. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph scalar & replicas performance

2013-02-22 Thread Gregory Farnum

On Thu, Feb 21, 2013 at 5:01 PM,   wrote:
> Hi all,
> I have some problem after my scalar performance test !!
>
> Setup:
> Linux kernel: 3.2.0
> OS: Ubuntu 12.04
> Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + 
> 10GbE NIC + RAID card: LSI MegaRAID SAS 9260-4i
>  For every HDD: RAID0, Write Policy: Write Back with BBU, Read 
> Policy: ReadAhead, IO Policy: Direct Storage server number : 1 to 4
>
> Ceph version : 0.48.2
> Replicas : 2
>
> FIO cmd:
> [Sequencial Read]
> fio --iodepth = 32 --numjobs=1 --runtime=120  --bs = 65536 --rw = read 
> --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 
> --thinktime=10
>
> [Sequencial Read]
> fio --iodepth = 32 --numjobs=1 --runtime=120  --bs = 65536 --rw = write 
> --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 
> --thinktime=10
>
> [Random Read]
> fio --iodepth = 32 --numjobs=8 --runtime=120  --bs = 65536 --rw = randread 
> --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 
> --thinktime=10
>
> [Random Write]
> fio --iodepth = 32 --numjobs=8 --runtime=120  --bs = 65536 --rw = randwrite 
> --ioengine=libaio --group_reporting --direct=1 --eta=always --ramp_time=10 
> --thinktime=10
>
> Use ceph client then create 1T RBD image for testing, the client also has 
> 10GbE NIC , Linux kernel 3.2.0 , Ubuntu 12.04
>
> Performance result:
>   Bandwidth (MB/sec)
> ┌
> │storage server number│Sequential Read │Sequential Write│Random Read│Random 
> Write │
> ├─ ┼──
> │  1│  259 │ 76   │837│26   │
> ├─ ┼──
> │  2│  349 │121   │950│45   │
> ├─ ┼──
> │  3│  354 │108   │490│71   │
> ├─ ┼──
> │  4│  338 │103   │610│89   │
> ├─ ┼──
>
> We expect that bandwidth will increase when storage server increase under all 
> case, but the result is not !!
> Can you share your idea for read/write bandwidth when storage server 
> increasing ?

There's a bunch of stuff that could be weird here. Is your switch
capable of handling all the traffic going over it? Have you
benchmarked the drives and filesystems on each node individually to
make sure they all have the same behavior, or are some of your
additions slower than the others? (My money is on you having some slow
drives that are dragging everything down.)


> In another case, we fixed use 4 storage servers then adjust the number of 
> replicas 2 to 4
>
> Performance result:
>
> Bandwidth (MB/sec)
> ┌
> │  replicas number│Sequential Read │Sequential Write│Random Read│Random 
> Write │
> ├─ ┼──
> │  2│   338│  103 │ 614│  89 │
> ├─ ┼──
> │  3│   337│  76  │ 791│  62 │
> ├─ ┼──
> │  4│   337│  60  │ 754│  43 │
> ├─ ┼──
>
> The bandwidth of write will decrease when replicas increase that is easy to 
> know, but why read bandwidth did not increase?

Reads are always served from the "primary" OSD, but even if they
weren't, you distribute the same number of reads over the same number
of disks no matter how many replicas you have of each individual data
block...
But in particular the change in random read values that you're seeing
indicates that your data is very noisy — I'm not sure I'd trust any of
the values you're seeing, especially the weirder trends. It might be
all noise and no real data value.
-Greg
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Re: [PATCH 2/5] libceph: separate non-locked fault handling

2013-02-22 Thread Alex Elder

On 02/22/2013 11:26 AM, Alex Elder wrote:
> An error occurring on a ceph connection is treated as a fault,
> causing the connection to be reset.  The initial part of this fault
> handling has to be done while holding the connection mutex, but
> it must then be dropped for the last part.
. . .

Sorry about the duplicate(s) and the messed up subject
lines.  I got a little trouble from gmail in the middle
of posting these.

-Alex


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5, v2] libceph: indent properly

2013-02-22 Thread Alex Elder

This is just the second part of a two-part patch.  It simply
indents a block of code.  This patch is going to be merged into
its predecessor following review.

Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/messenger.c |   80
+-
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 223406f..2c0669f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2390,50 +2390,50 @@ static void con_work(struct work_struct *work)
bool fault;

mutex_lock(&con->mutex);
-while (true) {
-   int ret;
+   while (true) {
+   int ret;

-   if ((fault = con_sock_closed(con))) {
-   dout("%s: con %p SOCK_CLOSED\n", __func__, con);
-   break;
-   }
-   if (con_backoff(con)) {
-   dout("%s: con %p BACKOFF\n", __func__, con);
-   break;
-   }
-   if (con->state == CON_STATE_STANDBY) {
-   dout("%s: con %p STANDBY\n", __func__, con);
-   break;
-   }
-   if (con->state == CON_STATE_CLOSED) {
-   dout("%s: con %p CLOSED\n", __func__, con);
-   BUG_ON(con->sock);
-   break;
-   }
-   if (con->state == CON_STATE_PREOPEN) {
-   dout("%s: con %p PREOPEN\n", __func__, con);
-   BUG_ON(con->sock);
-   }
+   if ((fault = con_sock_closed(con))) {
+   dout("%s: con %p SOCK_CLOSED\n", __func__, con);
+   break;
+   }
+   if (con_backoff(con)) {
+   dout("%s: con %p BACKOFF\n", __func__, con);
+   break;
+   }
+   if (con->state == CON_STATE_STANDBY) {
+   dout("%s: con %p STANDBY\n", __func__, con);
+   break;
+   }
+   if (con->state == CON_STATE_CLOSED) {
+   dout("%s: con %p CLOSED\n", __func__, con);
+   BUG_ON(con->sock);
+   break;
+   }
+   if (con->state == CON_STATE_PREOPEN) {
+   dout("%s: con %p PREOPEN\n", __func__, con);
+   BUG_ON(con->sock);
+   }

-   ret = try_read(con);
-   if (ret < 0) {
-   if (ret == -EAGAIN)
-   continue;
-   con->error_msg = "socket error on read";
-   fault = true;
-   break;
-   }
+   ret = try_read(con);
+   if (ret < 0) {
+   if (ret == -EAGAIN)
+   continue;
+   con->error_msg = "socket error on read";
+   fault = true;
+   break;
+   }

-   ret = try_write(con);
-   if (ret < 0) {
-   if (ret == -EAGAIN)
-   continue;
-   con->error_msg = "socket error on write";
-   fault = true;
-   }
+   ret = try_write(con);
+   if (ret < 0) {
+   if (ret == -EAGAIN)
+   continue;
+   con->error_msg = "socket error on write";
+   fault = true;
+   }

-   break;  /* If we make it to here, we're done */
-}
+   break;  /* If we make it to here, we're done */
+   }
if (fault)
con_fault(con);
mutex_unlock(&con->mutex);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5, v2] libceph: use a do..while loop in con_work()

2013-02-22 Thread Alex Elder

This just converts a manually-implemented loop into a do..while loop
in con_work().  It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.

Also update a few dout() calls near the affected code for
consistency.

NOTE:
This was done in two steps in order to facilitate review.  The
This patch will be squashed into the next one before commit.
next patch simply indents the loop properly.

Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/messenger.c |   39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 18eb788..223406f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2387,51 +2387,53 @@ static void con_work(struct work_struct *work)
 {
struct ceph_connection *con = container_of(work, struct ceph_connection,
   work.work);
-   bool fault = false;
-   int ret;
+   bool fault;

mutex_lock(&con->mutex);
-restart:
-   if (con_sock_closed(con)) {
+while (true) {
+   int ret;
+
+   if ((fault = con_sock_closed(con))) {
dout("%s: con %p SOCK_CLOSED\n", __func__, con);
-   fault = true;
-   goto done;
+   break;
}
if (con_backoff(con)) {
dout("%s: con %p BACKOFF\n", __func__, con);
-   goto done;
+   break;
}
if (con->state == CON_STATE_STANDBY) {
-   dout("con_work %p STANDBY\n", con);
-   goto done;
+   dout("%s: con %p STANDBY\n", __func__, con);
+   break;
}
if (con->state == CON_STATE_CLOSED) {
-   dout("con_work %p CLOSED\n", con);
+   dout("%s: con %p CLOSED\n", __func__, con);
BUG_ON(con->sock);
-   goto done;
+   break;
}
if (con->state == CON_STATE_PREOPEN) {
-   dout("%s: con %p OPENING\n", __func__, con);
+   dout("%s: con %p PREOPEN\n", __func__, con);
BUG_ON(con->sock);
}

ret = try_read(con);
-   if (ret == -EAGAIN)
-   goto restart;
if (ret < 0) {
+   if (ret == -EAGAIN)
+   continue;
con->error_msg = "socket error on read";
fault = true;
-   goto done;
+   break;
}

ret = try_write(con);
-   if (ret == -EAGAIN)
-   goto restart;
if (ret < 0) {
+   if (ret == -EAGAIN)
+   continue;
con->error_msg = "socket error on write";
fault = true;
}
-done:
+
+   break;  /* If we make it to here, we're done */
+}
if (fault)
con_fault(con);
mutex_unlock(&con->mutex);
@@ -2442,7 +2444,6 @@ done:
con->ops->put(con);
 }

-
 /*
  * Generic error/fault handler.  A retry mechanism is used with
  * exponential backoff
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/5, v2] libceph: use a flag to indicate a fault has occurred

2013-02-22 Thread Alex Elder

This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred.  This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.

Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/messenger.c |   24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index c3b9060..18eb788 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2387,13 +2387,15 @@ static void con_work(struct work_struct *work)
 {
struct ceph_connection *con = container_of(work, struct ceph_connection,
   work.work);
+   bool fault = false;
int ret;

mutex_lock(&con->mutex);
 restart:
if (con_sock_closed(con)) {
dout("%s: con %p SOCK_CLOSED\n", __func__, con);
-   goto fault;
+   fault = true;
+   goto done;
}
if (con_backoff(con)) {
dout("%s: con %p BACKOFF\n", __func__, con);
@@ -2418,7 +2420,8 @@ restart:
goto restart;
if (ret < 0) {
con->error_msg = "socket error on read";
-   goto fault;
+   fault = true;
+   goto done;
}

ret = try_write(con);
@@ -2426,20 +2429,17 @@ restart:
goto restart;
if (ret < 0) {
con->error_msg = "socket error on write";
-   goto fault;
+   fault = true;
}
-
 done:
+   if (fault)
+   con_fault(con);
mutex_unlock(&con->mutex);
-done_unlocked:
-   con->ops->put(con);
-   return;

-fault:
-   con_fault(con);
-   mutex_unlock(&con->mutex);
-   con_fault_finish(con);
-   goto done_unlocked;
+   if (fault)
+   con_fault_finish(con);
+
+   con->ops->put(con);
 }


-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] libceph: separate non-locked fault handling

2013-02-22 Thread Alex Elder

An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset.  The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.

Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish().  Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault().  Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.

The motivation for this was a warning from sparse about the locking
being done here.  Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu 
Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/messenger.c |   42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9a29d8a..c3b9060 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -166,7 +166,7 @@ static struct lock_class_key socket_class;

 static void queue_con(struct ceph_connection *con);
 static void con_work(struct work_struct *);
-static void ceph_fault(struct ceph_connection *con);
+static void con_fault(struct ceph_connection *con);

 /*
  * Nicely render a sockaddr as a string.  An array of formatted
@@ -2363,6 +2363,23 @@ static bool con_backoff(struct ceph_connection *con)
return true;
 }

+/* Finish fault handling; con->mutex must *not* be held here */
+
+static void con_fault_finish(struct ceph_connection *con)
+{
+   /*
+* in case we faulted due to authentication, invalidate our
+* current tickets so that we can get new ones.
+*/
+   if (con->auth_retry && con->ops->invalidate_authorizer) {
+   dout("calling invalidate_authorizer()\n");
+   con->ops->invalidate_authorizer(con);
+   }
+
+   if (con->ops->fault)
+   con->ops->fault(con);
+}
+
 /*
  * Do some work on a connection.  Drop a connection ref when we're done.
  */
@@ -2419,7 +2436,9 @@ done_unlocked:
return;

 fault:
-   ceph_fault(con); /* error/fault path */
+   con_fault(con);
+   mutex_unlock(&con->mutex);
+   con_fault_finish(con);
goto done_unlocked;
 }

@@ -2428,8 +2447,7 @@ fault:
  * Generic error/fault handler.  A retry mechanism is used with
  * exponential backoff
  */
-static void ceph_fault(struct ceph_connection *con)
-   __releases(con->mutex)
+static void con_fault(struct ceph_connection *con)
 {
pr_warning("%s%lld %s %s\n", ENTITY_NAME(con->peer_name),
   ceph_pr_addr(&con->peer_addr.in_addr), con->error_msg);
@@ -2445,7 +2463,7 @@ static void ceph_fault(struct ceph_connection *con)
if (con_flag_test(con, CON_FLAG_LOSSYTX)) {
dout("fault on LOSSYTX channel, marking CLOSED\n");
con->state = CON_STATE_CLOSED;
-   goto out_unlock;
+   return;
}

if (con->in_msg) {
@@ -2476,20 +2494,6 @@ static void ceph_fault(struct ceph_connection *con)
con_flag_set(con, CON_FLAG_BACKOFF);
queue_con(con);
}
-
-out_unlock:
-   mutex_unlock(&con->mutex);
-   /*
-* in case we faulted due to authentication, invalidate our
-* current tickets so that we can get new ones.
-*/
-   if (con->auth_retry && con->ops->invalidate_authorizer) {
-   dout("calling invalidate_authorizer()\n");
-   con->ops->invalidate_authorizer(con);
-   }
-
-   if (con->ops->fault)
-   con->ops->fault(con);
 }


-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] libceph: separate non-locked fault handling

2013-02-22 Thread Alex Elder

An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset.  The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.

Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish().  Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault().  Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.

The motivation for this was a warning from sparse about the locking
being done here.  Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu 
Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/messenger.c |   42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9a29d8a..c3b9060 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -166,7 +166,7 @@ static struct lock_class_key socket_class;

 static void queue_con(struct ceph_connection *con);
 static void con_work(struct work_struct *);
-static void ceph_fault(struct ceph_connection *con);
+static void con_fault(struct ceph_connection *con);

 /*
  * Nicely render a sockaddr as a string.  An array of formatted
@@ -2363,6 +2363,23 @@ static bool con_backoff(struct ceph_connection *con)
return true;
 }

+/* Finish fault handling; con->mutex must *not* be held here */
+
+static void con_fault_finish(struct ceph_connection *con)
+{
+   /*
+* in case we faulted due to authentication, invalidate our
+* current tickets so that we can get new ones.
+*/
+   if (con->auth_retry && con->ops->invalidate_authorizer) {
+   dout("calling invalidate_authorizer()\n");
+   con->ops->invalidate_authorizer(con);
+   }
+
+   if (con->ops->fault)
+   con->ops->fault(con);
+}
+
 /*
  * Do some work on a connection.  Drop a connection ref when we're done.
  */
@@ -2419,7 +2436,9 @@ done_unlocked:
return;

 fault:
-   ceph_fault(con); /* error/fault path */
+   con_fault(con);
+   mutex_unlock(&con->mutex);
+   con_fault_finish(con);
goto done_unlocked;
 }

@@ -2428,8 +2447,7 @@ fault:
  * Generic error/fault handler.  A retry mechanism is used with
  * exponential backoff
  */
-static void ceph_fault(struct ceph_connection *con)
-   __releases(con->mutex)
+static void con_fault(struct ceph_connection *con)
 {
pr_warning("%s%lld %s %s\n", ENTITY_NAME(con->peer_name),
   ceph_pr_addr(&con->peer_addr.in_addr), con->error_msg);
@@ -2445,7 +2463,7 @@ static void ceph_fault(struct ceph_connection *con)
if (con_flag_test(con, CON_FLAG_LOSSYTX)) {
dout("fault on LOSSYTX channel, marking CLOSED\n");
con->state = CON_STATE_CLOSED;
-   goto out_unlock;
+   return;
}

if (con->in_msg) {
@@ -2476,20 +2494,6 @@ static void ceph_fault(struct ceph_connection *con)
con_flag_set(con, CON_FLAG_BACKOFF);
queue_con(con);
}
-
-out_unlock:
-   mutex_unlock(&con->mutex);
-   /*
-* in case we faulted due to authentication, invalidate our
-* current tickets so that we can get new ones.
-*/
-   if (con->auth_retry && con->ops->invalidate_authorizer) {
-   dout("calling invalidate_authorizer()\n");
-   con->ops->invalidate_authorizer(con);
-   }
-
-   if (con->ops->fault)
-   con->ops->fault(con);
 }


-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] libceph: encapsulate connection backoff

2013-02-22 Thread Alex Elder

Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().

Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.

Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/messenger.c |   37 -
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ed9e237..9a29d8a 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2345,6 +2345,24 @@ static bool con_sock_closed(struct
ceph_connection *con)
return true;
 }

+static bool con_backoff(struct ceph_connection *con)
+{
+   int ret;
+
+   if (!con_flag_test_and_clear(con, CON_FLAG_BACKOFF))
+   return false;
+
+   ret = queue_con_delay(con, round_jiffies_relative(con->delay));
+   if (ret) {
+   dout("%s: con %p FAILED to back off %lu\n", __func__,
+   con, con->delay);
+   BUG_ON(ret == -ENOENT);
+   con_flag_set(con, CON_FLAG_BACKOFF);
+   }
+
+   return true;
+}
+
 /*
  * Do some work on a connection.  Drop a connection ref when we're done.
  */
@@ -2356,21 +2374,14 @@ static void con_work(struct work_struct *work)

mutex_lock(&con->mutex);
 restart:
-   if (con_sock_closed(con))
+   if (con_sock_closed(con)) {
+   dout("%s: con %p SOCK_CLOSED\n", __func__, con);
goto fault;
-
-   if (con_flag_test_and_clear(con, CON_FLAG_BACKOFF)) {
-   dout("con_work %p backing off\n", con);
-   ret = queue_con_delay(con, round_jiffies_relative(con->delay));
-   if (ret) {
-   dout("con_work %p FAILED to back off %lu\n", con,
-con->delay);
-   BUG_ON(ret == -ENOENT);
-   con_flag_set(con, CON_FLAG_BACKOFF);
-   }
+   }
+   if (con_backoff(con)) {
+   dout("%s: con %p BACKOFF\n", __func__, con);
goto done;
}
-
if (con->state == CON_STATE_STANDBY) {
dout("con_work %p STANDBY\n", con);
goto done;
@@ -2381,7 +2392,7 @@ restart:
goto done;
}
if (con->state == CON_STATE_PREOPEN) {
-   dout("con_work OPENING\n");
+   dout("%s: con %p OPENING\n", __func__, con);
BUG_ON(con->sock);
}

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5, v2] libceph clean up con_work()

2013-02-22 Thread Alex Elder

(Re-posting because these changes have been rebased.)

This series cleans up con_work() a bit.  The original motivation
was to get rid of a warning issued by the sparse utility, but
addressing that required a little rework and it was fairly
straightforward once that was done to make that function
fairly simple.

The problem sparse reported was really due to sparse not
being able to follow the logic between multiple functions
that together implement locking.  The result of these
changes makes both acquiring and releasing the connection
mutex occur in con_work().

-Alex

[PATCH 1/5] libceph: encapsulate connection backoff
[PATCH 2/5] libceph: separate non-locked fault handling
[PATCH 3/5] libceph: use a flag to indicate a fault has occurred
[PATCH 4/5] libceph: use a do..while loop in con_work()
[PATCH 5/5] libceph: indent properly

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH, v2] libceph: eliminate sparse warnings

2013-02-22 Thread Alex Elder

Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
- Convert functions that are never referenced externally to have
  static scope.
- Pass NULL rather than 0 for a pointer argument in one spot in
  ceph_monc_delete_snapid()

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu 
Signed-off-by: Alex Elder 
---
v2: rebased

 net/ceph/crypto.c |7 ---
 net/ceph/messenger.c  |2 +-
 net/ceph/mon_client.c |2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c
index af14cb4..6e7a236 100644
--- a/net/ceph/crypto.c
+++ b/net/ceph/crypto.c
@@ -423,7 +423,8 @@ int ceph_encrypt2(struct ceph_crypto_key *secret,
void *dst, size_t *dst_len,
}
 }

-int ceph_key_instantiate(struct key *key, struct key_preparsed_payload
*prep)
+static int ceph_key_instantiate(struct key *key,
+   struct key_preparsed_payload *prep)
 {
struct ceph_crypto_key *ckey;
size_t datalen = prep->datalen;
@@ -458,12 +459,12 @@ err:
return ret;
 }

-int ceph_key_match(const struct key *key, const void *description)
+static int ceph_key_match(const struct key *key, const void *description)
 {
return strcmp(key->description, description) == 0;
 }

-void ceph_key_destroy(struct key *key) {
+static void ceph_key_destroy(struct key *key) {
struct ceph_crypto_key *ckey = key->payload.data;

ceph_crypto_key_destroy(ckey);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 771d4c9..ed9e237 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -223,7 +223,7 @@ static void encode_my_addr(struct ceph_messenger *msgr)
  */
 static struct workqueue_struct *ceph_msgr_wq;

-void _ceph_msgr_exit(void)
+static void _ceph_msgr_exit(void)
 {
if (ceph_msgr_wq) {
destroy_workqueue(ceph_msgr_wq);
diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
index 812eb3b..aef5b10 100644
--- a/net/ceph/mon_client.c
+++ b/net/ceph/mon_client.c
@@ -697,7 +697,7 @@ int ceph_monc_delete_snapid(struct ceph_mon_client
*monc,
u32 pool, u64 snapid)
 {
return do_poolop(monc,  POOL_OP_CREATE_UNMANAGED_SNAP,
-  pool, snapid, 0, 0);
+  pool, snapid, NULL, 0);

 }

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH, v2] ceph: eliminate sparse warnings in fs code

2013-02-22 Thread Alex Elder

Fix the causes for sparse warnings reported in the ceph file system
code.  Here there are only two (and they're sort of silly but
they're easy to fix).

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu 
Signed-off-by: Alex Elder 
---
v2: rebased

 fs/ceph/xattr.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 2135817..9b6b2b6 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -213,7 +213,7 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = {
XATTR_NAME_CEPH(dir, rsubdirs),
XATTR_NAME_CEPH(dir, rbytes),
XATTR_NAME_CEPH(dir, rctime),
-   { 0 }   /* Required table terminator */
+   { .name = NULL, 0 } /* Required table terminator */
 };
 static size_t ceph_dir_vxattrs_name_size;  /* total size of all names */

@@ -232,7 +232,7 @@ static struct ceph_vxattr ceph_file_vxattrs[] = {
XATTR_LAYOUT_FIELD(file, layout, stripe_count),
XATTR_LAYOUT_FIELD(file, layout, object_size),
XATTR_LAYOUT_FIELD(file, layout, pool),
-   { 0 }   /* Required table terminator */
+   { .name = NULL, 0 } /* Required table terminator */
 };
 static size_t ceph_file_vxattrs_name_size; /* total size of all names */

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH, v2] rbd: eliminate sparse warnings

2013-02-22 Thread Alex Elder

Fengguang Wu reminded me that there were outstanding sparse reports
in the ceph and rbd code.  This patch fixes these problems in rbd
that lead to those reports:
- Convert functions that are never referenced externally to have
  static scope.
- Add a lockdep annotation to rbd_request_fn(), because it
  releases a lock before acquiring it again.

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu 
Signed-off-by: Alex Elder 
---
v2: rebased

 drivers/block/rbd.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a9c86ca..c6b15d4 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1141,7 +1141,7 @@ static bool obj_request_type_valid(enum
obj_request_type type)
}
 }

-struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...)
+static struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...)
 {
struct ceph_osd_req_op *op;
va_list args;
@@ -1537,7 +1537,8 @@ static void rbd_obj_request_destroy(struct kref *kref)
  * that comprises the image request, and the Linux request pointer
  * (if there is one).
  */
-struct rbd_img_request *rbd_img_request_create(struct rbd_device *rbd_dev,
+static struct rbd_img_request *rbd_img_request_create(
+   struct rbd_device *rbd_dev,
u64 offset, u64 length,
bool write_request)
 {
@@ -1971,6 +1972,7 @@ out:
 }

 static void rbd_request_fn(struct request_queue *q)
+   __releases(q->queue_lock) __acquires(q->queue_lock)
 {
struct rbd_device *rbd_dev = q->queuedata;
bool read_only = rbd_dev->mapping.read_only;
@@ -2705,7 +2707,7 @@ static void rbd_spec_free(struct kref *kref)
kfree(spec);
 }

-struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
+static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
struct rbd_spec *spec)
 {
struct rbd_device *rbd_dev;
@@ -4256,7 +4258,7 @@ static void rbd_sysfs_cleanup(void)
device_unregister(&rbd_root_dev);
 }

-int __init rbd_init(void)
+static int __init rbd_init(void)
 {
int rc;

@@ -4272,7 +4274,7 @@ int __init rbd_init(void)
return 0;
 }

-void __exit rbd_exit(void)
+static void __exit rbd_exit(void)
 {
rbd_sysfs_cleanup();
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Updated sparse warning message patches

2013-02-22 Thread Alex Elder

I'm re-posting these patches because I've updated them to be
based on the patches I just posted ("Four miscellaneous patches").

These patches are available in the branch "test/wip-4184" in
the ceph-client git repository.  That branch is based on
branch "test/wip-4234,5,7,8".

(Here's the original description)

What follows is a few series of patches that get rid of code
issues that lead to warnings from the sparse utility.

The first three patches address the warnings in the rbd, ceph
file system, and libceph code respectively.  After that, one
warning remains in libceph, and that is addressed by a series
of five patches which both address the underlying problem and
reorganized and clean up the surrounding code.  The last two
are meant to be combined into one before commit; they've been
posted as two separate patches to make them easier to review.

-Alex

[PATCH, v2] rbd: eliminate sparse warnings
[PATCH, v2] ceph: eliminate sparse warnings in fs code
[PATCH, v2] libceph: eliminate sparse warnings
[PATCH 1/5, v2] libceph: encapsulate connection backoff
[PATCH 2/5, v2] libceph: separate non-locked fault handling
[PATCH 3/5, v2] libceph: use a flag to indicate a fault has occurred
[PATCH 4/5, v2] libceph: use a do..while loop in con_work()
[PATCH 5/5, v2] libceph: indent properly
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: define connection flag helpers

2013-02-22 Thread Alex Elder

Define and use functions that encapsulate operations performed on
a connection's flags.

This resolves:
http://tracker.ceph.com/issues/4234

Signed-off-by: Alex Elder 
---
 net/ceph/messenger.c |  107
--
 1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 8a62a55..771d4c9 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -98,6 +98,57 @@
 #define CON_FLAG_SOCK_CLOSED  3  /* socket state changed to closed */
 #define CON_FLAG_BACKOFF   4  /* need to retry queuing delayed
work */

+static bool con_flag_valid(unsigned long con_flag)
+{
+   switch (con_flag) {
+   case CON_FLAG_LOSSYTX:
+   case CON_FLAG_KEEPALIVE_PENDING:
+   case CON_FLAG_WRITE_PENDING:
+   case CON_FLAG_SOCK_CLOSED:
+   case CON_FLAG_BACKOFF:
+   return true;
+   default:
+   return false;
+   }
+}
+
+static void con_flag_clear(struct ceph_connection *con, unsigned long
con_flag)
+{
+   BUG_ON(!con_flag_valid(con_flag));
+
+   clear_bit(con_flag, &con->flags);
+}
+
+static void con_flag_set(struct ceph_connection *con, unsigned long
con_flag)
+{
+   BUG_ON(!con_flag_valid(con_flag));
+
+   set_bit(con_flag, &con->flags);
+}
+
+static bool con_flag_test(struct ceph_connection *con, unsigned long
con_flag)
+{
+   BUG_ON(!con_flag_valid(con_flag));
+
+   return test_bit(con_flag, &con->flags);
+}
+
+static bool con_flag_test_and_clear(struct ceph_connection *con,
+   unsigned long con_flag)
+{
+   BUG_ON(!con_flag_valid(con_flag));
+
+   return test_and_clear_bit(con_flag, &con->flags);
+}
+
+static bool con_flag_test_and_set(struct ceph_connection *con,
+   unsigned long con_flag)
+{
+   BUG_ON(!con_flag_valid(con_flag));
+
+   return test_and_set_bit(con_flag, &con->flags);
+}
+
 /* static tag bytes (protocol control messages) */
 static char tag_msg = CEPH_MSGR_TAG_MSG;
 static char tag_ack = CEPH_MSGR_TAG_ACK;
@@ -309,7 +360,7 @@ static void ceph_sock_write_space(struct sock *sk)
 * buffer. See net/ipv4/tcp_input.c:tcp_check_space()
 * and net/core/stream.c:sk_stream_write_space().
 */
-   if (test_bit(CON_FLAG_WRITE_PENDING, &con->flags)) {
+   if (con_flag_test(con, CON_FLAG_WRITE_PENDING)) {
if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) {
dout("%s %p queueing write work\n", __func__, con);
clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
@@ -334,7 +385,7 @@ static void ceph_sock_state_change(struct sock *sk)
case TCP_CLOSE_WAIT:
dout("%s TCP_CLOSE_WAIT\n", __func__);
con_sock_state_closing(con);
-   set_bit(CON_FLAG_SOCK_CLOSED, &con->flags);
+   con_flag_set(con, CON_FLAG_SOCK_CLOSED);
queue_con(con);
break;
case TCP_ESTABLISHED:
@@ -475,7 +526,7 @@ static int con_close_socket(struct ceph_connection *con)
 * received a socket close event before we had the chance to
 * shut the socket down.
 */
-   clear_bit(CON_FLAG_SOCK_CLOSED, &con->flags);
+   con_flag_clear(con, CON_FLAG_SOCK_CLOSED);

con_sock_state_closed(con);
return rc;
@@ -539,11 +590,10 @@ void ceph_con_close(struct ceph_connection *con)
 ceph_pr_addr(&con->peer_addr.in_addr));
con->state = CON_STATE_CLOSED;

-   clear_bit(CON_FLAG_LOSSYTX, &con->flags); /* so we retry next connect */
-   clear_bit(CON_FLAG_KEEPALIVE_PENDING, &con->flags);
-   clear_bit(CON_FLAG_WRITE_PENDING, &con->flags);
-   clear_bit(CON_FLAG_KEEPALIVE_PENDING, &con->flags);
-   clear_bit(CON_FLAG_BACKOFF, &con->flags);
+   con_flag_clear(con, CON_FLAG_LOSSYTX);  /* so we retry next connect */
+   con_flag_clear(con, CON_FLAG_KEEPALIVE_PENDING);
+   con_flag_clear(con, CON_FLAG_WRITE_PENDING);
+   con_flag_clear(con, CON_FLAG_BACKOFF);

reset_connection(con);
con->peer_global_seq = 0;
@@ -799,7 +849,7 @@ static void prepare_write_message(struct
ceph_connection *con)
/* no, queue up footer too and be done */
prepare_write_message_footer(con);

-   set_bit(CON_FLAG_WRITE_PENDING, &con->flags);
+   con_flag_set(con, CON_FLAG_WRITE_PENDING);
 }

 /*
@@ -820,7 +870,7 @@ static void prepare_write_ack(struct ceph_connection
*con)
&con->out_temp_ack);

con->out_more = 1;  /* more will follow.. eventually.. */
-   set_bit(CON_FLAG_WRITE_PENDING, &con->flags);
+   con_flag_set(con, CON_FLAG_WRITE_PENDING);
 }

 /*
@@ -831,7 +881,7 @@ static void prepare_write_keepalive(struct
ceph_connection *con)
dout("prepare_write_keepalive %p\n", con);
con_out_kvec_reset(con);
con_out_kvec_add(c

[PATCH] rbd: normalize dout() calls

2013-02-22 Thread Alex Elder

Add dout() calls to facilitate tracing of image and object requests.
Change a few existing calls so they use __func__ rather than the
hard-coded function name.  Have calls always add ":" after the name
of the function, and prefix pointer values with a consistent tag
indicating what it represents.  (Note that there remain some older
dout() calls that are left untouched by this patch.)

Issue a warning if rbd_osd_write_callback() ever gets a short write.

This resolves:
http://tracker.ceph.com/issues/4235

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   66
+++
 1 file changed, 61 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bd6078b..a9c86ca 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -443,7 +443,7 @@ static struct rbd_client *rbd_client_create(struct
ceph_options *ceph_opts)
struct rbd_client *rbdc;
int ret = -ENOMEM;

-   dout("rbd_client_create\n");
+   dout("%s:\n", __func__);
rbdc = kmalloc(sizeof(struct rbd_client), GFP_KERNEL);
if (!rbdc)
goto out_opt;
@@ -467,8 +467,8 @@ static struct rbd_client *rbd_client_create(struct
ceph_options *ceph_opts)
spin_unlock(&rbd_client_list_lock);

mutex_unlock(&ctl_mutex);
+   dout("%s: rbdc %p\n", __func__, rbdc);

-   dout("rbd_client_create created %p\n", rbdc);
return rbdc;

 out_err:
@@ -479,6 +479,8 @@ out_mutex:
 out_opt:
if (ceph_opts)
ceph_destroy_options(ceph_opts);
+   dout("%s: error %d\n", __func__, ret);
+
return ERR_PTR(ret);
 }

@@ -605,7 +607,7 @@ static void rbd_client_release(struct kref *kref)
 {
struct rbd_client *rbdc = container_of(kref, struct rbd_client, kref);

-   dout("rbd_release_client %p\n", rbdc);
+   dout("%s: rbdc %p\n", __func__, rbdc);
spin_lock(&rbd_client_list_lock);
list_del(&rbdc->node);
spin_unlock(&rbd_client_list_lock);
@@ -1064,6 +1066,8 @@ out_err:

 static void rbd_obj_request_get(struct rbd_obj_request *obj_request)
 {
+   dout("%s: obj %p (was %d)\n", __func__, obj_request,
+   atomic_read(&obj_request->kref.refcount));
kref_get(&obj_request->kref);
 }

@@ -1071,11 +1075,15 @@ static void rbd_obj_request_destroy(struct kref
*kref);
 static void rbd_obj_request_put(struct rbd_obj_request *obj_request)
 {
rbd_assert(obj_request != NULL);
+   dout("%s: obj %p (was %d)\n", __func__, obj_request,
+   atomic_read(&obj_request->kref.refcount));
kref_put(&obj_request->kref, rbd_obj_request_destroy);
 }

 static void rbd_img_request_get(struct rbd_img_request *img_request)
 {
+   dout("%s: img %p (was %d)\n", __func__, img_request,
+   atomic_read(&img_request->kref.refcount));
kref_get(&img_request->kref);
 }

@@ -1083,6 +1091,8 @@ static void rbd_img_request_destroy(struct kref
*kref);
 static void rbd_img_request_put(struct rbd_img_request *img_request)
 {
rbd_assert(img_request != NULL);
+   dout("%s: img %p (was %d)\n", __func__, img_request,
+   atomic_read(&img_request->kref.refcount));
kref_put(&img_request->kref, rbd_img_request_destroy);
 }

@@ -1097,6 +1107,8 @@ static inline void rbd_img_obj_request_add(struct
rbd_img_request *img_request,
rbd_assert(obj_request->which != BAD_WHICH);
img_request->obj_request_count++;
list_add_tail(&obj_request->links, &img_request->obj_requests);
+   dout("%s: img %p obj %p w=%u\n", __func__, img_request, obj_request,
+   obj_request->which);
 }

 static inline void rbd_img_obj_request_del(struct rbd_img_request
*img_request,
@@ -1104,6 +1116,8 @@ static inline void rbd_img_obj_request_del(struct
rbd_img_request *img_request,
 {
rbd_assert(obj_request->which != BAD_WHICH);

+   dout("%s: img %p obj %p w=%u\n", __func__, img_request, obj_request,
+   obj_request->which);
list_del(&obj_request->links);
rbd_assert(img_request->obj_request_count > 0);
img_request->obj_request_count--;
@@ -1200,11 +1214,14 @@ static void rbd_osd_req_op_destroy(struct
ceph_osd_req_op *op)
 static int rbd_obj_request_submit(struct ceph_osd_client *osdc,
struct rbd_obj_request *obj_request)
 {
+   dout("%s: osdc %p obj %p\n", __func__, osdc, obj_request);
+
return ceph_osdc_start_request(osdc, obj_request->osd_req, false);
 }

 static void rbd_img_request_complete(struct rbd_img_request *img_request)
 {
+   dout("%s: img %p\n", __func__, img_request);
if (img_request->callback)
img_request->callback(img_request);
else
@@ -1215,6 +1232,8 @@ static void rbd_img_request_complete(struct
rbd_img_request *img_request)

 static int rbd_obj_request_wait(struct rbd_obj_request *obj_request)
 {
+   dout("%s: obj %p\n", __func__, obj_request);
+
ret

[PATCH] rbd: barriers are hard

2013-02-22 Thread Alex Elder

Let's go shopping!

I'm afraid this may not have gotten it right:
07741308  rbd: add barriers near done flag operations

The smp_wmb() should have been done *before* setting the done flag,
to ensure all other data was valid before marking the object request
done.

Switch to use atomic_inc_return() here to set the done flag, which
allows us to verify we don't mark something done more than once.
Doing this also implies general barriers before and after the call.

And although a read memory barrier might have been sufficient before
reading the done flag, convert this to a full memory barrier just
to put this issue to bed.

This resolves:
http://tracker.ceph.com/issues/4238

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3cc003b..bd6078b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1226,13 +1226,22 @@ static void obj_request_done_init(struct
rbd_obj_request *obj_request)

 static void obj_request_done_set(struct rbd_obj_request *obj_request)
 {
-   atomic_set(&obj_request->done, 1);
-   smp_wmb();
+   int done;
+
+   done = atomic_inc_return(&obj_request->done);
+   if (done > 1) {
+   struct rbd_img_request *img_request = obj_request->img_request;
+   struct rbd_device *rbd_dev;
+
+   rbd_dev = img_request ? img_request->rbd_dev : NULL;
+   rbd_warn(rbd_dev, "obj_request %p was already done\n",
+   obj_request);
+   }
 }

 static bool obj_request_done_test(struct rbd_obj_request *obj_request)
 {
-   smp_rmb();
+   smp_mb();
return atomic_read(&obj_request->done) != 0;
 }

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: ignore zero-length requests

2013-02-22 Thread Alex Elder

The old request code simply ignored zero-length requests.  We should
still operate that same way to avoid any changes in behavior.  We
can implement handling for special zero-length requests separately
(see http://tracker.ceph.com/issues/4236).

Add some assertions based on this new constraint.

This resolves:
http://tracker.ceph.com/issues/4237

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b0eea3e..3cc003b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1560,6 +1560,7 @@ static int rbd_img_request_fill_bio(struct
rbd_img_request *img_request,
image_offset = img_request->offset;
rbd_assert(image_offset == bio_list->bi_sector << SECTOR_SHIFT);
resid = img_request->length;
+   rbd_assert(resid > 0);
while (resid) {
const char *object_name;
unsigned int clone_size;
@@ -1627,8 +1628,10 @@ static void rbd_img_obj_callback(struct
rbd_obj_request *obj_request)
bool more = true;

img_request = obj_request->img_request;
+
rbd_assert(img_request != NULL);
rbd_assert(img_request->rq != NULL);
+   rbd_assert(img_request->obj_request_count > 0);
rbd_assert(which != BAD_WHICH);
rbd_assert(which < img_request->obj_request_count);
rbd_assert(which >= img_request->next_completion);
@@ -1918,6 +1921,19 @@ static void rbd_request_fn(struct request_queue *q)
/* Ignore any non-FS requests that filter through. */

if (rq->cmd_type != REQ_TYPE_FS) {
+   dout("%s: non-fs request type %d\n", __func__,
+   (int) rq->cmd_type);
+   __blk_end_request_all(rq, 0);
+   continue;
+   }
+
+   /* Ignore/skip any zero-length requests */
+
+   offset = (u64) blk_rq_pos(rq) << SECTOR_SHIFT;
+   length = (u64) blk_rq_bytes(rq);
+
+   if (!length) {
+   dout("%s: zero-length request\n", __func__);
__blk_end_request_all(rq, 0);
continue;
}
@@ -1947,9 +1963,6 @@ static void rbd_request_fn(struct request_queue *q)
goto end_request;
}

-   offset = (u64) blk_rq_pos(rq) << SECTOR_SHIFT;
-   length = (u64) blk_rq_bytes(rq);
-
result = -EINVAL;
if (WARN_ON(offset && length > U64_MAX - offset + 1))
goto end_request;   /* Shouldn't happen */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Four miscellaneous patches

2013-02-22 Thread Alex Elder

The following patches address some issues that were found while
investigating a kernel rbd client performance issue this week.

These patches are available in the branch "test/wip-4234,5,7,8"
in the ceph-client git repository.

-Alex

[PATCH] rbd: ignore zero-length requests
[PATCH] rbd: barriers are hard
[PATCH] rbd: normalize dout() calls
[PATCH] libceph: define connection flag helpers
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD memory leaks?

2013-02-22 Thread Sébastien Han

Hi all,

I finally got a core dump.

I did it with a kill -SEGV on the OSD process.

https://www.dropbox.com/s/ahv6hm0ipnak5rf/core-ceph-osd-11-0-0-20100-1361539008

Hope we will get something out of it :-).
--
Regards,
Sébastien Han.


On Fri, Jan 11, 2013 at 7:13 PM, Gregory Farnum  wrote:
> On Fri, Jan 11, 2013 at 6:57 AM, Sébastien Han  
> wrote:
>>> Is osd.1 using the heap profiler as well? Keep in mind that active use
>>> of the memory profiler will itself cause memory usage to increase —
>>> this sounds a bit like that to me since it's staying stable at a large
>>> but finite portion of total memory.
>>
>> Well, the memory consumption was already high before the profiler was
>> started. So yes with the memory profiler enable an OSD might consume
>> more memory but this doesn't cause the memory leaks.
>
> My concern is that maybe you saw a leak but when you restarted with
> the memory profiling you lost whatever conditions caused it.
>
>> Any ideas? Nothing to say about my scrumbing theory?
> I like it, but Sam indicates that without some heap dumps which
> capture the actual leak then scrub is too large to effectively code
> review for leaks. :(
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

40 matches

Mail list logo