Re: Fwd: how io works when backfill

2015-12-28 Thread Dong Wu
if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3]  --> pg1.0
[7, 2, 3],  is it similar with the example above?
still install a pg_temp entry mapping the PG back to [1, 2, 3], then
backfill happens to 7, normal io write to [1, 2, 3], if io to the
portion of the PG that has already been backfilled will also be sent
to osd.7?

how about these examples about removing an osd:
- pg1.0 [1, 2, 3]
- osd.3 down and be removed
- mapping changes to [1, 2, 5], but osd.5 has no data, then install a
pg_temp mapping the PG back to [1, 2], then backfill happens to 5,
- normal io write to [1, 2], if io hits object which has been
backfilled to osd.5, io will also send to osd.5
- when backfill completes, remove the pg_temp and mapping changes back
to [1, 2, 5]


another example:
- pg1.0 [1, 2, 3]
- osd.3 down and be removed
- mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then
install a pg_temp mapping the PG back to [1, 2] which osd.1
temporarily becomes the primary, then backfill happens to 5,
- normal io write to [1, 2], if io hits object which has been
backfilled to osd.5, io will also send to osd.5
- when backfill completes, remove the pg_temp and mapping changes back
to [5, 1, 2]

is my ananysis right?

2015-12-29 1:30 GMT+08:00 Sage Weil <s...@newdream.net>:
> On Mon, 28 Dec 2015, Zhiqiang Wang wrote:
>> 2015-12-27 20:48 GMT+08:00 Dong Wu <archer.wud...@gmail.com>:
>> > Hi,
>> > When add osd or remove osd, ceph will backfill to rebalance data.
>> > eg:
>> > - pg1.0[1, 2, 3]
>> > - add an osd(eg. osd.7)
>> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
>> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
>> > object a is backfilling
>> > - when a write io hits object a, then the io needs to wait for its
>> > complete, then goes on.
>> > - but if io hits object b which has not been backfilled, io reaches
>> > osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
>> > have object b, so osd.7 needs to wait for object b to backfilled, then
>> > write. Is it right? Or osd.1 only send the io to osd.2, not both?
>>
>> I think in this case, when the write of object b reaches osd.1, it
>> holds the client write, raises the priority of the recovery of object
>> b, and kick off the recovery of it. When the recovery of object b is
>> done, it requeue the client write, and then everything goes like
>> usual.
>
> It's more complicated than that.  In a normal (log-based) recovery
> situation, it is something like the above: if the acting set is [1,2,3]
> but 3 is missing the latest copy of A, a write to A will block on the
> primary while the primary initiates recovery of A immediately.  Once that
> completes the IO will continue.
>
> For backfill, it's different.  In your example, you start with [1,2,3]
> then add in osd.7.  The OSD will see that 7 has no data for teh PG and
> install a pg_temp entry mapping the PG back to [1,2,3] temporarily.  Then
> things will proceed normally while backfill happens to 7.  Backfill won't
> interfere with normal IO at all, except that IO to the portion of the PG
> that has already been backfilled will also be sent to the backfill target
> (7) so that it stays up to date.  Once it complets, the pg_temp entry is
> removed and the mapping changes back to [1,2,7].  Then osd.3 is allowed to
> remove it's copy of the PG.
>
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how io works when backfill

2015-12-27 Thread Dong Wu
Hi,
When add osd or remove osd, ceph will backfill to rebalance data.
eg:
- pg1.0[1, 2, 3]
- add an osd(eg. osd.7)
- ceph start backfill, then pg1.0 osd set changes to [1, 2, 7]
- if [a, b, c, d, e] are objects needing to backfill to osd.7 and now
object a is backfilling
- when a write io hits object a, then the io needs to wait for its
complete, then goes on.
- but if io hits object b which has not been backfilled, io reaches
osd.1, then osd.1 send the io to osd.2  and osd.7, but osd.7 does not
have object b, so osd.7 needs to wait for object b to backfilled, then
write. Is it right? Or osd.1 only send the io to osd.2, not both?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Dong Wu
Thank you for your reply. I am looking formard to Sage's opinion too @sage.
Also I'll keep on with the BlueStore and Kstore's progress.

Regards

2015-12-25 14:48 GMT+08:00 Ning Yao <zay11...@gmail.com>:
> Hi, Dong Wu,
>
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
>
> If you are interest on this, we may cooperate to do this.
>
> Regards
> Ning Yao
>
>
> 2015-12-25 14:23 GMT+08:00 Dong Wu <archer.wud...@gmail.com>:
>> Thanks, from this pull request I learned that this issue is not
>> completed, is there any new progress of this issue?
>>
>> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) <xmdx...@gmail.com>:
>>> Yeah, This is good idea for recovery, but not for backfill.
>>> @YaoNing have pull a request about this
>>> https://github.com/ceph/ceph/pull/3837 this year.
>>>
>>> 2015-12-25 11:16 GMT+08:00 Dong Wu <archer.wud...@gmail.com>:
>>>> Hi,
>>>> I have doubt about pglog, the pglog contains (op,object,version) etc.
>>>> when peering, use pglog to construct missing list,then recover the
>>>> whole object in missing list even if different data among replicas is
>>>> less then a whole object data(eg,4MB).
>>>> why not add (offset,len) to pglog? If so, the missing list can contain
>>>> (object, offset, len), then we can reduce recover data.
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-us...@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Xinze Chi
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] why not add (offset,len) to pglog

2015-12-24 Thread Dong Wu
Thanks, from this pull request I learned that this issue is not
completed, is there any new progress of this issue?

2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) <xmdx...@gmail.com>:
> Yeah, This is good idea for recovery, but not for backfill.
> @YaoNing have pull a request about this
> https://github.com/ceph/ceph/pull/3837 this year.
>
> 2015-12-25 11:16 GMT+08:00 Dong Wu <archer.wud...@gmail.com>:
>> Hi,
>> I have doubt about pglog, the pglog contains (op,object,version) etc.
>> when peering, use pglog to construct missing list,then recover the
>> whole object in missing list even if different data among replicas is
>> less then a whole object data(eg,4MB).
>> why not add (offset,len) to pglog? If so, the missing list can contain
>> (object, offset, len), then we can reduce recover data.
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Regards,
> Xinze Chi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


why not add (offset,len) to pglog

2015-12-24 Thread Dong Wu
Hi,
I have doubt about pglog, the pglog contains (op,object,version) etc.
when peering, use pglog to construct missing list,then recover the
whole object in missing list even if different data among replicas is
less then a whole object data(eg,4MB).
why not add (offset,len) to pglog? If so, the missing list can contain
(object, offset, len), then we can reduce recover data.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2015-11-22 Thread Dong Wu
subscribe ceph-devel
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html