At Fri, 12 Sep 2014 12:31:02 +0800, =?utf-8?B?QmluZ3BlbmcgWmh1?= wrote: > > [1 <multipart/alternative (8bit)>] > [1.1 <text/plain; utf-8 (base64)>] > Hi, Hitoshi: > I think you are right. But maybe this bug is hard to trigger. Here is my > analysis: > If md.nr_disks changes from 1 to 0, kick_recover() is not called > immediately. In most > case, there are more than one io thread. If other io thread access the > corrupted directory, > it will get SD_RES_EIO since md.nr_disks is 0 (See md_handle_eio() > function). Then, main > thread will call leave_cluster() (See io_op_done() function). > I think maybe we can change the code like this: > > > if (disk) { > if (nr > 0) { > update_node_disks(); > kick_recover(); > } else { > leave_cluster(); > } > } > > > ��> How do you think?
Thanks for your analysis! Your solution seems to be good. Could you send a patch? Thanks, Hitoshi > > > Thanks! > Bingpeng > > > ------------------ 原始邮件 ------------------ > 发件人: "Bingpeng Zhu";<nku...@foxmail.com>; > 发送时间: 2014年9月11日(星期四) 下午2:37 > 收件人: "Hitoshi Mitake"<mitake.hito...@lab.ntt.co.jp>; > > 主题: Re: [sheepdog] [PATCH v1] sheep/md: fix dogcluster info command show > wrong disk number > > > > I see. But it works OK in my experimental environment: > > > #dog cluster info > Cluster status: running, auto-recovery enabled > Cluster created at Thu Sep 11 14:07:48 2014 > > > Epoch Time Version > 2014-09-11 14:26:51 5 [127.0.0.1:7000(3), 127.0.0.1:7001(3), > 127.0.0.1:7002(3), 127.0.0.1:7003(3), 127.0.0.1:7004(3), 127.0.0.1:7005(3), > 127.0.0.1:7007(3)] > 2014-09-11 14:26:51 4 [127.0.0.1:7000(3), 127.0.0.1:7001(3), > 127.0.0.1:7002(3), 127.0.0.1:7003(3), 127.0.0.1:7004(3), 127.0.0.1:7005(3), > 127.0.0.1:7006(0), 127.0.0.1:7007(3)] > 2014-09-11 14:26:27 3 [127.0.0.1:7000(3), 127.0.0.1:7001(3), > 127.0.0.1:7002(3), 127.0.0.1:7003(3), 127.0.0.1:7004(3), 127.0.0.1:7005(3), > 127.0.0.1:7006(1), 127.0.0.1:7007(3)] > 2014-09-11 14:26:20 2 [127.0.0.1:7000(3), 127.0.0.1:7001(3), > 127.0.0.1:7002(3), 127.0.0.1:7003(3), 127.0.0.1:7004(3), 127.0.0.1:7005(3), > 127.0.0.1:7006(2), 127.0.0.1:7007(3)] > 2014-09-11 14:07:49 1 [127.0.0.1:7000(3), 127.0.0.1:7001(3), > 127.0.0.1:7002(3), 127.0.0.1:7003(3), 127.0.0.1:7004(3), 127.0.0.1:7005(3), > 127.0.0.1:7006(3), 127.0.0.1:7007(3)] > > > > Seems it will notify other nodes even md.nr_disks equals zero. I'll read the > code carefully to figure it out. > > > Thanks. > > > ------------------ 原始邮件 ------------------ > 发件人: "Hitoshi Mitake";<mitake.hito...@lab.ntt.co.jp>; > 发送时间: 2014年9月11日(星期四) 中午12:55 > 收件人: "Bingpeng Zhu"<nku...@foxmail.com>; > 抄送: "Hitoshi Mitake"<mitake.hito...@lab.ntt.co.jp>; > "sheepdog"<sheepdog@lists.wpkg.org>; "Yu Fang"<bingpeng....@alibaba-inc.com>; > 主题: Re: [sheepdog] [PATCH v1] sheep/md: fix dogcluster info command show > wrong disk number > > > > At Thu, 11 Sep 2014 10:53:42 +0800, > =?utf-8?B?QmluZ3BlbmcgWmh1?= wrote: > > > > [1 <multipart/alternative (8bit)>] > > [1.1 <text/plain; utf-8 (base64)>] > > If md.nr_disks changes from 1 to 0, the node is pure gateway then. Seems it > > doesn't need > > to do any recovery work since no object will locate in the pure gateway. > > But other nodes must execute recovery because redundancy is decreased, no? > > Thanks, > Hitoshi > > > > > > > > > > > > > ------------------ Original ------------------ > > From: "Hitoshi Mitake";<mitake.hito...@lab.ntt.co.jp>; > > Date: Thu, Sep 11, 2014 10:34 AM > > To: "Bingpeng Zhu"<nku...@foxmail.com>; > > Cc: "sheepdog"<sheepdog@lists.wpkg.org>; "Bingpeng > > Zhu"<bingpeng....@alibaba-inc.com>; > > Subject: Re: [sheepdog] [PATCH v1] sheep/md: fix dog cluster info command > > show wrong disk number > > > > > > > > At Wed, 10 Sep 2014 20:08:32 +0800, > > Bingpeng Zhu wrote: > > > > > > If we enable diskvnodes mode and encounter EIO, running dog > > > cluster info will show wrong disk number in the new epoch. > > > We should update node disks when handling EIO and unplug the > > > disk. > > > > > > Signed-off-by: Bingpeng Zhu <bingpeng....@alibaba-inc.com> > > > --- > > > sheep/md.c | 3 +++ > > > 1 files changed, 3 insertions(+), 0 deletions(-) > > > > > > diff --git a/sheep/md.c b/sheep/md.c > > > index b13d30a..378d1f1 100644 > > > --- a/sheep/md.c > > > +++ b/sheep/md.c > > > @@ -541,6 +541,9 @@ static void md_do_recover(struct work *work) > > > out: > > > sd_rw_unlock(&md.lock); > > > > > > + if (disk) > > > + update_node_disks(); > > > + > > > > The change looks reasonable. But I have a question: > > > > > if (nr > 0) > > > kick_recover(); > > > > Does the above conditional branch (nr > 0) required? When a node has > > single disk and faces EIO, md.nr_disks can change from 1 to 0. In such > > a case, the code will not call kick_recover(). I think it is a > > problematic behavior. How do you think? > > > > Thanks, > > Hitoshi > > > > > > > > -- > > > 1.7.1 > > > > > > > > > > > > > > > -- > > > sheepdog mailing list > > > sheepdog@lists.wpkg.org > > > http://lists.wpkg.org/mailman/listinfo/sheepdog > > [1.2 <text/html; utf-8 (base64)>] > > > > [2 <text/plain; us-ascii (7bit)>] > > -- > > sheepdog mailing list > > sheepdog@lists.wpkg.org > > http://lists.wpkg.org/mailman/listinfo/sheepdog > [1.2 <text/html; utf-8 (base64)>] > > [2 <text/plain; us-ascii (7bit)>] > -- > sheepdog mailing list > sheepdog@lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog
-- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog