Re: Help on ext4/xattr linux kernel stability issue / ceph xattr use?

2015-11-09 Thread Laurent GUERBY
On Mon, 2015-11-09 at 05:24 -0800, Sage Weil wrote:
> The above is all correct.  The mbcache (didn't know that existed!) is 
> definitely not going to be useful here.
> > Also I think it is necessary to warn ceph users to avoid ext4 at all
> > costs until this kernel/ceph issue is sorted out: we went from
> > relatively stable production for more than a year to crashes everywhere
> > all the time since two weeks ago, probably after hitting some magic
> > limit. We migrated our machines to ubuntu trusty, our SSD based
> > filesystem to XFS but our HDD are still mostly on ext4 (60 TB
> > of data to move so not that easy...).
> 
> Was there a ceph upgrade in there somewhere?  The size of the user.ceph._ 
> xattr has increased over time, and (somewhat) recently crossed the 255 
> byte threshold (on average) which also triggered a performance regression 
> on XFS...


Hi Sage,

Thanks for the confirmation.

The history of our cluster is:
- initial cluster on ceph 0.80.7 (september 2014)
debian ext4 since xfs and btrfs were crashing on debian/ceph 
- upgraded to 0.87 (december 2014)
- upgraded to 0.94.2 (june 2015)
- on october 26 2015 we got two disk failures in one night, we replaced
the disks but we started to have random machine freeze during
and after the recovery. We upgraded to 0.94.5 to be able to restart
two of our OSD due to:
http://tracker.ceph.com/issues/13594
- after changing various hardware part, adding new machine
we started to suspect ceph/ext4 so we migrated all
our machines to ubuntu trusty and all SSD to XFS leaving
60 TB of data on rotational ext4 (too long to migrate)

During the whole time cluster and data kept expanding
from 4 machines and 2 TB to 11 machines now and 60TB of data
(~ 75% full).

I have lightly tested a rebuild of the ubuntu trusty 3.19
kernel with the ext4 mbcache code removed, patch here:
https://bugzilla.kernel.org/show_bug.cgi?id=107301#c6

But now we have to decide wether to go live with it.

Sincerely,

Laurent

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever

2015-11-09 Thread Stefan Priebe - Profihost AG

> - Original Message -
>> From: "Alexandre DERUMIER" 
>> To: "ceph-devel" 
>> Cc: "qemu-devel" , jdur...@redhat.com
>> Sent: Monday, November 9, 2015 5:48:45 AM
>> Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and 
>> vm_stop is hanging forever
>>
>> adding to ceph.conf
>>
>> [client]
>> rbd_non_blocking_aio = false
>>
>>
>> fix the problem for me (with rbd_cache=false)
>>
>>
>> (@cc jdur...@redhat.com)

+1 same to me.

Stefan

>>
>>
>>
>> - Mail original -
>> De: "Denis V. Lunev" 
>> À: "aderumier" , "ceph-devel"
>> , "qemu-devel" 
>> Envoyé: Lundi 9 Novembre 2015 08:22:34
>> Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop
>> is hanging forever
>>
>> On 11/09/2015 10:19 AM, Denis V. Lunev wrote:
>>> On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote:
 Hi,

 with qemu (2.4.1), if I do an internal snapshot of an rbd device,
 then I pause the vm with vm_stop,

 the qemu process is hanging forever


 monitor commands to reproduce:


 # snapshot_blkdev_internal drive-virtio0 yoursnapname
 # stop




 I don't see this with qcow2 or sheepdog block driver for example.


 Regards,

 Alexandre

>>> this could look like the problem I have recenty trying to
>>> fix with dataplane enabled. Patch series is named as
>>>
>>> [PATCH for 2.5 v6 0/10] dataplane snapshot fixes
>>>
>>> Den
>>
>> anyway, even if above will not help, can you collect gdb
>> traces from all threads in QEMU process. May be I'll be
>> able to give a hit.
>>
>> Den
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


make check bot resumed

2015-11-09 Thread Loic Dachary
Hi,

The machine sending notifications for the make check bot failed during the 
week-end. It was rebooted and it should resume its work. 

The virtual machine was actually re-built because the underlying OpenStack 
cloud was unable to find the volume used for root after a hard reboot. There 
were also issues with the devicemapper docker backend that was corrupted. 
Wiping them out was enough to resolve the problem: they did not have any 
persistent data anyway.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever

2015-11-09 Thread Alexandre DERUMIER
>>Can you reproduce with Ceph debug logging enabled (i.e. debug rbd=20 in your 
>>ceph.conf)?  If you could attach the log to the Ceph tracker ticket I opened 
>>[1], that would be very helpful.
>>
>>[1] http://tracker.ceph.com/issues/13726

yes,I'm able to reproduce it 100%, I have attached the log to the tracker.

Alexandre

- Mail original -
De: "Jason Dillaman" 
À: "aderumier" 
Cc: "ceph-devel" , "qemu-devel" 
, jdur...@redhat.com
Envoyé: Lundi 9 Novembre 2015 14:42:42
Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop 
is hanging forever

Can you reproduce with Ceph debug logging enabled (i.e. debug rbd=20 in your 
ceph.conf)? If you could attach the log to the Ceph tracker ticket I opened 
[1], that would be very helpful. 

[1] http://tracker.ceph.com/issues/13726 

Thanks, 
Jason 


- Original Message - 
> From: "Alexandre DERUMIER"  
> To: "ceph-devel"  
> Cc: "qemu-devel" , jdur...@redhat.com 
> Sent: Monday, November 9, 2015 5:48:45 AM 
> Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and 
> vm_stop is hanging forever 
> 
> adding to ceph.conf 
> 
> [client] 
> rbd_non_blocking_aio = false 
> 
> 
> fix the problem for me (with rbd_cache=false) 
> 
> 
> (@cc jdur...@redhat.com) 
> 
> 
> 
> - Mail original - 
> De: "Denis V. Lunev"  
> À: "aderumier" , "ceph-devel" 
> , "qemu-devel"  
> Envoyé: Lundi 9 Novembre 2015 08:22:34 
> Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop 
> is hanging forever 
> 
> On 11/09/2015 10:19 AM, Denis V. Lunev wrote: 
> > On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote: 
> >> Hi, 
> >> 
> >> with qemu (2.4.1), if I do an internal snapshot of an rbd device, 
> >> then I pause the vm with vm_stop, 
> >> 
> >> the qemu process is hanging forever 
> >> 
> >> 
> >> monitor commands to reproduce: 
> >> 
> >> 
> >> # snapshot_blkdev_internal drive-virtio0 yoursnapname 
> >> # stop 
> >> 
> >> 
> >> 
> >> 
> >> I don't see this with qcow2 or sheepdog block driver for example. 
> >> 
> >> 
> >> Regards, 
> >> 
> >> Alexandre 
> >> 
> > this could look like the problem I have recenty trying to 
> > fix with dataplane enabled. Patch series is named as 
> > 
> > [PATCH for 2.5 v6 0/10] dataplane snapshot fixes 
> > 
> > Den 
> 
> anyway, even if above will not help, can you collect gdb 
> traces from all threads in QEMU process. May be I'll be 
> able to give a hit. 
> 
> Den 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majord...@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph encoding optimization

2015-11-09 Thread Gregory Farnum
On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnum  wrote:
> The problem with this approach is that the encoded versions need to be
> platform-independent — they are shared over the wire and written to
> disks that might get transplanted to different machines. Apart from
> padding bytes, we also need to worry about endianness of the machine,
> etc. *And* we often mutate structures across versions in order to add
> new abilities, relying on the encode-decode process to deal with any
> changes to the system. How could we deal with that if just dumping the
> raw memory?
>
> Now, maybe we could make these changes on some carefully-selected
> structs, I'm not sure. But we'd need a way to pick them out, guarantee
> that we aren't breaking interoperability concerns, etc; and it would
> need to be something we can maintain as a group going forward. I'm not
> sure how to satisfy those constraints without burning a little extra
> CPU. :/
> -Greg

So it turns out we've actually had issues with this. Sage merged
(wrote?) some little-endian-only optimizations to the cephx code that
broke big-endian systems by doing a direct memcpy. Apparently our
tests don't find these issues, which makes me even more nervous about
taking that sort of optimization into the tree. :(
-Greg


On Sun, Nov 8, 2015 at 6:28 AM, Sage Weil  wrote:
> On Sat, 7 Nov 2015, Haomai Wang wrote:
>> Hi sage,
>>
>> Could we know about your progress to refactor MSubOP and hobject_t,
>> pg_stat_t decode problem?
>>
>> We could work on this based on your work if any.
>
> See Piotr's last email on this thead... it has Josh's patch attached.
>
> sage
>
>
>>
>>
>> On Thu, Nov 5, 2015 at 1:29 AM, Haomai Wang  wrote:
>> > On Thu, Nov 5, 2015 at 1:19 AM, piotr.da...@ts.fujitsu.com
>> >  wrote:
>> >>> -Original Message-
>> >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> >>> ow...@vger.kernel.org] On Behalf Of ???
>> >>> Sent: Wednesday, November 04, 2015 4:34 PM
>> >>> To: Gregory Farnum
>> >>> Cc: ceph-devel@vger.kernel.org
>> >>> Subject: Re: ceph encoding optimization
>> >>>
>> >>> I agree with pg_stat_t (and friends) is a good first start.
>> >>> The eversion_t and utime_t are also good choice to start because they are
>> >>> used at many places.
>> >>
>> >> On Ceph Hackathon, Josh Durgin made initial steps in right direction in 
>> >> terms of pg_stat_t encoding and decoding optimization, with the 
>> >> endianness-awareness thing left out. Even in that state, performance 
>> >> improvements offered by this change were huge enough to make it 
>> >> worthwhile. I'm attaching the patch, but please note that this is 
>> >> prototype and based on mid-August state of code, so you might need to 
>> >> take that into account when applying the patch.
>> >
>> > Cool, it's exactly we want to see.
>> >
>> >>
>> >>
>> >> With best regards / Pozdrawiam
>> >> Piotr Da?ek
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Cannot start osd due to permission of journal raw device

2015-11-09 Thread Sage Weil
On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> There is no such rules (only 70-persistent-net.rules) in my /etc/udev/ruled.d/
> 
> Could you point me which part of the code create the rules file? Is that 
> ceph-disk?

https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules

The package should install it in /lib/udev/rules.d or similar...

sage

> > -Original Message-
> > From: Sage Weil [mailto:s...@newdream.net]
> > Sent: Friday, November 6, 2015 6:33 PM
> > To: Chen, Xiaoxi
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: Cannot start osd due to permission of journal raw device
> > 
> > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> > > Hi,
> > > I tried  infernalis (version 9.1.0
> > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to permission
> > of journal ,  the OSD  was upgraded from hammer(also true for newly
> > created OSD).
> > >   I am using raw device as journal, this is because the default privilege 
> > > of
> > raw block is root:disk. Changing the journal owner to ceph:ceph solve the
> > issue. Seems we can either:
> > >   1. add ceph to "disk" group and run ceph-osd with --setuser ceph --
> > setgroup disk?
> > >   2. Require user to set the ownership of journal device to ceph:ceph is 
> > > they
> > want to use raw as journal?  Maybe we can done this in ceph-disk.
> > >
> > >Personally I would prefer the second one , what do you think?
> > 
> > The udev rules should be setting the jouranl device ownership to ceph:ceph.
> > IIRC there was a race in ceph-disk that could prevent this from happening in
> > some cases but that is now fixed.  Can you try the infernalis branch?
> > 
> > sage
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help on ext4/xattr linux kernel stability issue / ceph xattr use?

2015-11-09 Thread Sage Weil
On Mon, 9 Nov 2015, Laurent GUERBY wrote:
> Hi,
> 
> Part of our ceph cluster is using ext4 and we recently hit major kernel
> instability in the form of kernel lockups every few hours, issues
> opened:
> 
> http://tracker.ceph.com/issues/13662
> https://bugzilla.kernel.org/show_bug.cgi?id=107301
> 
> On kernel.org kernel developpers are asking about ceph usage of xattr,
> in particular wether there are lots of common xattr key/value or wether
> they are all differents.
> 
> I attached a file with various xattr -l outputs:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=107301#c8
> https://bugzilla.kernel.org/attachment.cgi?id=192491
> 
> Looks like the "big" xattr "user.ceph._" is always different, same for
> the intermediate size "user.ceph.hinfo_key".
> 
> "user.cephos.spill_out" and "user.ceph.snapset" seem to have small
> values, and within a small value set.
> 
> Our cluster is used exclusively for virtual machines block devices with
> rbd, on replicated (3) and erasure coded pools (4+1 and 8+2).
> 
> Could someone knowledgeable add some information on ceph use of xattr in
> the kernel.org bugzilla above?

The above is all correct.  The mbcache (didn't know that existed!) is 
definitely not going to be useful here.
 
> Also I think it is necessary to warn ceph users to avoid ext4 at all
> costs until this kernel/ceph issue is sorted out: we went from
> relatively stable production for more than a year to crashes everywhere
> all the time since two weeks ago, probably after hitting some magic
> limit. We migrated our machines to ubuntu trusty, our SSD based
> filesystem to XFS but our HDD are still mostly on ext4 (60 TB
> of data to move so not that easy...).

Was there a ceph upgrade in there somewhere?  The size of the user.ceph._ 
xattr has increased over time, and (somewhat) recently crossed the 255 
byte threshold (on average) which also triggered a performance regression 
on XFS...

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


suites' runs on jewel added to the schedule

2015-11-09 Thread Yuri Weinstein
(rados suite/jewel - on hold for the time being to avoid queue overload)

But other suites have been added to the schedule:

http://tracker.ceph.com/projects/ceph-releases/wiki/Sepia

Pls let me know if you see problems or any issues.

Thx
YuriW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On Mon, Nov 9, 2015 at 1:49 PM, Samuel Just  wrote:
> We basically don't want a single thread to see all of the operations -- it
> would cause a tremendous bottleneck and complicate the design
> immensely.  It's shouldn't be necessary anyway since PGs are a form
> of course grained locking, so it's probably fine to schedule work for
> different groups of PGs independently if we assume that all kinds of
> work are well distributed over those groups.

The only issue that I can see, based on the discussion last week, is
when the client I/O is small. There will be some points where each
thread will think it is OK so send a bolder along with the pebbles
(recovery I/O vs. client I/O), If all/most of the threads send a
bolder at the same time would it cause issues for slow disks
(spindles)? A single queue would be much more intelligent about
situations like this and spread the bolders out better. It also seems
more scalable as you add threads (I don't think really practical on
spindles). I assume the bottleneck in your concern is the thread
communication between threads? I'm trying to understand and in no way
trying to attack you (I've been know to come across differently than I
intend to).

>> But the recovery is still happening the recovery thread and not the
>> client thread, right? The recovery thread has a lower priority than
>> the op thread? That's how I understand it.
>>
>
> No, in hammer we removed the snap trim and scrub workqueues.  With
> wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
> meaningful set of threads remaining will be the op_tp and associated
> queues.

OK, that is good news, I didn't do a scrub so I haven't seen the OPs
for that. Do you know the priorities of snap trim, scrub and recovery
so that I can do some math/logic on applying costs in an efficient way
as we talked about last week?

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQRB6CRDmVDuy+mK58QAAAsMP/RoBeyhqwNDURHagKJ9i
knjYW4jy0FFw1XmnFRhJN7FuFlYlHZ+bwvQGGYvmOkLlxgY9Y+J1GglwwV14
Vvtd/1LBOUw06Ch/WjhcgVFNIQdgdNBPHPaRurSTGxnofYKAwqB266gnzwAo
oX3EpgRskzrlwrOIg+b46Z3FhbdxYfJVqsWIEazIu9uFJDxf/pFimWSig0n1
bQsB0lZNeTbGKYww5GZqPtY3dVNqbfM6Xj5r5kxf5mhDZ2vKWJfvlc8nu86z
/VIDy5ZHPFZzv79wNlzNtZ9ofdmMT4n0Bhk8q4SFQSivs2z68DQxthcGXVaB
Bp5gy19QyE2mC6SeG3kwCYlEiGwJBGN5PVj9wDWrqDRiG/3eRS9yUs7N3RPW
hViKOYCt5lHBEhkkXaE824FweWZhupzXjiAjCMXYGtWek4LbLH9XFiMrigbR
b07EohO3cnXvrHL3+SmdEsHs0PIS0o9anyB7wn7Ze9oHQNYHXmzw48nzhth6
juGxCVeg80iNnlwpH/jQRfyEFB8rKfpJd7BLYdJgc/q4L25o/q588MeUqjUw
gc0cVkoKnegbz1fZ85CjI3YGXgXwRtVXFFl4Z+KdEJlEa1q9nRBGsho8LkT6
aanb77/QUJixLi7QQi8blXMvY0wjxzEkbtkoij0rL1OaxmKpoy/Nb8v6kyDL
rnL6
=IlY9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW multi-tenancy APIs overview

2015-11-09 Thread Yehuda Sadeh-Weinraub
On Mon, Nov 9, 2015 at 9:10 PM, Pete Zaitcev  wrote:
> With ticket 5073 getting close to complete, we're getting the APIs mostly

Great! thanks for all the work you've done to get this closer to completion.

> nailed down. Most of them come down to selection a syntax separator
> character. Unfortunately, there are several such characters. Plus,
> it is not always feasible to get by with a character (in S3 at least).
>
> So far we have the following changes:
>
> #1 Back-end and radosgw-admin use '/' or "tenant/bucket". This is what is
> literally stored in RADOS, because it's used to name bucket objects in
> the .rgw pool.
>
> #2 Buckets in Swift URLs use '\' (backslash), because there does not seem
> to be a way to use '/'. Example:
>  http://host.corp.com:8080/swift/v1/testen\testcont
>
> At first, I tried URL encoding (%2f), but that didn't work: we permit '%'
> in Swift container names, so there's a show-stopper compatibility problem.
> So, backslash. The backslash poses a similar problem, too, but hopefuly
> nobody created a container with backslash in name.
>
> Note that strictly speaking, we don't really need this, since Swift URLs
> could easily include tenant names where reference Swift places account names.
> It's just easier to implement without disturbing authenthication code.

I think that leveraging the native swift URL tenant encoding is
probably a cleaner solution than having it encoded as a backslash.

>
> #3 S3 host addressing of buckets
>
> This is similar to Swift and is slated to use backslash. Note that S3
> prohibits it, so we're reasonably safe with this choice.
>
> #4 S3 URL addressing of buckets
>
> Here we must use a period. Example:
>  bucket.tenant.host.corp.com
>

Can probably identify this automatically, if the host is at a
subdomain of a supported domain, and it's a second level subdomain
from the main domain then we can regard it as .

> #5 Listings and redirects.
>
> Listings present a difficulty in S3: we don't know if the name will be
> used in host-based or URL-based addressing of a bucket. So, we put the
> tenant of a bucket into a separate XML attribute.

You mean a separate http header? http param?

In the supported domains configuration, we can specify for each domain
whether a subdomain for it would be a bucket (as it is now), or
whether it would be a tenant (which implies the possibility of
bucket.tenant). This only affects the global (a.k.a the "empty")
tenant.

E.g., we can have two domains:

legacy-foo.com
new-foo.com

We'd specify that legacy-foo.com is a global tenant endpoint. In which
case, when accessing buck.legacy-foo.com, it will access the global
tenant, and bucket=buck.
Whereas, new-foo.com isn't a global tenant endpoint, in which case, if
we'd access buck.new-foo.com, it will mean that we accessed the 'buck'
tenant.


>
> Since Swift listings are always in a specific account, and thus tenant,
> they are unchanged.
>
> In addition to listings, bucket names leak into certain HTTP headers, where
> we add "Tenant:" headers as appropriate.
>
> Finally, multi-tenancy also puts user_uid namespaces under tenants as well
> as bucket namespaces. That one is easy though. A '$' separator is used
> consistently for it (tenant$user).
>

Does that work the same for object copy, and acls?

Thanks,
Yehuda

> -- Pete
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Cannot start osd due to permission of journal raw device

2015-11-09 Thread Chen, Xiaoxi
Hmm I didn't use ceph-disk but partitioned & format by myself and call 
ceph-osd --mkfs directly, that should be the reason why udev rules doesn't make 
effect?

> -Original Message-
> From: Sage Weil [mailto:s...@newdream.net]
> Sent: Monday, November 9, 2015 9:18 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: Cannot start osd due to permission of journal raw device
> 
> On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> > There is no such rules (only 70-persistent-net.rules) in my
> > /etc/udev/ruled.d/
> >
> > Could you point me which part of the code create the rules file? Is
> > that ceph-disk?
> 
> https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules
> 
> The package should install it in /lib/udev/rules.d or similar...
> 
> sage
> 
> > > -Original Message-
> > > From: Sage Weil [mailto:s...@newdream.net]
> > > Sent: Friday, November 6, 2015 6:33 PM
> > > To: Chen, Xiaoxi
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: Re: Cannot start osd due to permission of journal raw
> > > device
> > >
> > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> > > > Hi,
> > > > I tried  infernalis (version 9.1.0
> > > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to
> > > permission of journal ,  the OSD  was upgraded from hammer(also true
> > > for newly created OSD).
> > > >   I am using raw device as journal, this is because the default
> > > > privilege of
> > > raw block is root:disk. Changing the journal owner to ceph:ceph
> > > solve the issue. Seems we can either:
> > > >   1. add ceph to "disk" group and run ceph-osd with --setuser ceph
> > > > --
> > > setgroup disk?
> > > >   2. Require user to set the ownership of journal device to
> > > > ceph:ceph is they
> > > want to use raw as journal?  Maybe we can done this in ceph-disk.
> > > >
> > > >Personally I would prefer the second one , what do you think?
> > >
> > > The udev rules should be setting the jouranl device ownership to
> ceph:ceph.
> > > IIRC there was a race in ceph-disk that could prevent this from
> > > happening in some cases but that is now fixed.  Can you try the infernalis
> branch?
> > >
> > > sage
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph encoding optimization

2015-11-09 Thread Milosz Tanski
On Mon, Nov 9, 2015 at 10:24 AM, Sage Weil  wrote:
> On Mon, 9 Nov 2015, Gregory Farnum wrote:
>> On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnum  wrote:
>> > The problem with this approach is that the encoded versions need to be
>> > platform-independent ? they are shared over the wire and written to
>> > disks that might get transplanted to different machines. Apart from
>> > padding bytes, we also need to worry about endianness of the machine,
>> > etc. *And* we often mutate structures across versions in order to add
>> > new abilities, relying on the encode-decode process to deal with any
>> > changes to the system. How could we deal with that if just dumping the
>> > raw memory?
>> >
>> > Now, maybe we could make these changes on some carefully-selected
>> > structs, I'm not sure. But we'd need a way to pick them out, guarantee
>> > that we aren't breaking interoperability concerns, etc; and it would
>> > need to be something we can maintain as a group going forward. I'm not
>> > sure how to satisfy those constraints without burning a little extra
>> > CPU. :/
>> > -Greg
>>
>> So it turns out we've actually had issues with this. Sage merged
>> (wrote?) some little-endian-only optimizations to the cephx code that
>> broke big-endian systems by doing a direct memcpy. Apparently our
>> tests don't find these issues, which makes me even more nervous about
>> taking that sort of optimization into the tree. :(
>
> I think the way to make this maintainable will be to
>
> 1) Find a clean approach with a simple #if or #ifdef condition for
> little endian and/or architectures that can handle unaligned int pointer
> access.
>

In C++ you can also do that with a template  or using
std::enable_if. The upside is the same as the downside (depending on
how you look at it). So it'll add to compile time checks (because it
won't be discarded by the processor) and it'll take longer to build,
but you get extra checks and the compiler will later discard the
unused code.

If you do that, it should be easier to write unit tests for the functionality.

> 2) Maintain the parallel optimized implementation next to the generic
> encode/decode in a way that makes it as easy as possible to make changes
> and keep them in sync.
>
> 3) Optimize *only* the most recent encoding to minimize complexity.
>
> 4) Ensure that there is a set of encode/decode tests that verify they both
> work, triggered by make check (so that a simple make check on a big
> endian box will catch errors).  Ideally this'd be part of the
> test/encoding/readable.sh so that we run it over the entire corpus of old
> encodings..
>
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I should probably work against this branch.

I've got some more reading of code to do, but I'm thinking that there
isn't one of these queues for each OSD, it seems like there is one
queue for each thread in the OSD. If this is true, I think it makes
sense to break the queue into it's own thread and have each 'worker'
thread push and pop OPs out of that thread. I have been focused on the
Queue code that I haven't really looked at the OSD/PG code until last
Friday and it is like trying to drink from a fire hose going through
that code, so I may be misunderstanding something.

I'd appreciate any pointers to quickly understanding the OSD/PG code
specifically around the OPs and the queue.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
ztfA
=GSDL
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just  wrote:
> It's partially in the unified queue.  The primary's background work
> for kicking off a recovery operation is not in the unified queue, but
> the messages to the replicas (pushes, pull, backfill scans) as well as
> their replies are in the unified queue as normal messages.  I've got a
> branch moving the primary's work to the queue as well (didn't quite
> make infernalis) --
> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
> stabilize it now for merge that infernalis is out.
> -Sam
>
> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil  wrote:
>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> After trying to look through the recovery code, I'm getting the
>>> feeling that recovery OPs are not scheduled in the OP queue that I've
>>> been working on. Does that sound right? In the OSD logs I'm only
>>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>>> If the recovery is in another separate queue, then there is no
>>> reliable way to prioritize OPs between them.
>>>
>>> If I'm going off in to the weeds, please help me get back on the trail.
>>
>> Yeah, the recovery work isn't in the unified queue yet.
>>
>> sage
>>
>>
>>
>>>
>>> Thanks,
>>> - 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>>> > -BEGIN PGP SIGNED MESSAGE-
>>> > Hash: SHA256
>>> >
>>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>> >>> -BEGIN PGP SIGNED MESSAGE-
>>> >>> Hash: SHA256
>>> >>>
>>> >>> Thanks Gregory,
>>> >>>
>>> >>> People are most likely busy and haven't had time to digest this and I
>>> >>> may be expecting more excitement from it (I'm excited due to the
>>> >>> results and probably also that such a large change still works). I'll
>>> >>> keep working towards a PR, this was mostly proof of concept, now that
>>> >>> there is some data I'll clean up the code.
>>> >>
>>> >> I'm *very* excited about this.  This is something that almost every
>>> >> operator has problems with so it's very encouraging to see that switching
>>> >> up the queue has a big impact in your environment.
>>> >>
>>> >> I'm just following up on this after a week of travel, so apologies if 
>>> >> this
>>> >> is covered already, but did you compare this implementation to the
>>> >> original one with the same tunables?  I see somewhere that you had
>>> >> max_backfills=20 at some point, which is going to be bad regardless of 
>>> >> the
>>> >> queue.
>>> >>
>>> >> I also see that you chnaged the strict priority threshold from LOW to 
>>> >> HIGH
>>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>>> >> implementation.
>>> >
>>> > Yes max_backfills=20 is problematic for both queues and from what I
>>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>>> > busy cluster it can take a while due to the recovery ops having low
>>> > priority. In the current queue, it is possible to be blocked for a
>>> > long time. The new queue seems to prevent that, 

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Samuel Just
Ops are hashed from the messenger (or any of the other enqueue sources
for non-message items) into one of N queues, each of which is serviced
by M threads.  We can't quite have a single thread own a single queue
yet because the current design allows multiple threads/queue
(important because if a sync read blocks on one thread, other threads
working on that queue can continue to make progress).  However, the
queue contents are hashed to a queue based on the PG, so if a PG
queues work, it'll be on the same queue as it is already operating
from (which I think is what you are getting at?).  I'm moving away
from that with the async read work I'm doing (ceph-devel subject
"Async reads, sync writes, op thread model discussion"), but I'll
still need a replacement for PrioritizedQueue.
-Sam

On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I should probably work against this branch.
>
> I've got some more reading of code to do, but I'm thinking that there
> isn't one of these queues for each OSD, it seems like there is one
> queue for each thread in the OSD. If this is true, I think it makes
> sense to break the queue into it's own thread and have each 'worker'
> thread push and pop OPs out of that thread. I have been focused on the
> Queue code that I haven't really looked at the OSD/PG code until last
> Friday and it is like trying to drink from a fire hose going through
> that code, so I may be misunderstanding something.
>
> I'd appreciate any pointers to quickly understanding the OSD/PG code
> specifically around the OPs and the queue.
>
> Thanks,
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
> ztfA
> =GSDL
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just  wrote:
>> It's partially in the unified queue.  The primary's background work
>> for kicking off a recovery operation is not in the unified queue, but
>> the messages to the replicas (pushes, pull, backfill scans) as well as
>> their replies are in the unified queue as normal messages.  I've got a
>> branch moving the primary's work to the queue as well (didn't quite
>> make infernalis) --
>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>> stabilize it now for merge that infernalis is out.
>> -Sam
>>
>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil  wrote:
>>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>>
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 After trying to look through the recovery code, I'm getting the
 feeling that recovery OPs are not scheduled in the OP queue that I've
 been working on. Does that sound right? In the OSD logs I'm only
 seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
 If the recovery is in another separate queue, then there is no
 reliable way to prioritize OPs between them.

 If I'm going off in to the weeds, please help me get back on the trail.
>>>
>>> Yeah, the recovery work isn't in the unified queue yet.
>>>
>>> sage
>>>
>>>
>>>

 Thanks,
 - 
 Robert LeBlanc
 PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


 On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
 > -BEGIN PGP SIGNED MESSAGE-
 > Hash: SHA256
 >
 > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
 >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
 >>> -BEGIN PGP SIGNED MESSAGE-
 >>> Hash: SHA256
 >>>
 >>> Thanks Gregory,
 >>>
 >>> People are most likely busy and haven't had time to digest this and I
 >>> may be expecting more excitement from it (I'm excited due to the
 >>> results and probably also that such a large change still works). I'll
 >>> keep working towards a PR, this was mostly proof of concept, now that
 >>> there is some data I'll clean up the code.
 >>
 >> I'm *very* excited about this.  This is something that almost every
 >> operator has problems with 

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Haomai Wang
On Tue, Nov 10, 2015 at 2:19 AM, Samuel Just  wrote:
> Ops are hashed from the messenger (or any of the other enqueue sources
> for non-message items) into one of N queues, each of which is serviced
> by M threads.  We can't quite have a single thread own a single queue
> yet because the current design allows multiple threads/queue
> (important because if a sync read blocks on one thread, other threads
> working on that queue can continue to make progress).  However, the
> queue contents are hashed to a queue based on the PG, so if a PG
> queues work, it'll be on the same queue as it is already operating
> from (which I think is what you are getting at?).  I'm moving away
> from that with the async read work I'm doing (ceph-devel subject
> "Async reads, sync writes, op thread model discussion"), but I'll
> still need a replacement for PrioritizedQueue.

I don't think clearly about the idea that we make PrioriryQueue(or
whatever WeightBased) client-oriented. Because currently each
connection owned by a async messenger thread, if latter queue is pg
oriented, huge lock contention can't be avoided with iops increasing.

The only way I guess is make msgr thread -> osd thread via the same
hash key(or whatever we can make the two threads paired). What's more,
msgr thread could use the same way as sam's branch, it could be only
one thread.

> -Sam
>
> On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I should probably work against this branch.
>>
>> I've got some more reading of code to do, but I'm thinking that there
>> isn't one of these queues for each OSD, it seems like there is one
>> queue for each thread in the OSD. If this is true, I think it makes
>> sense to break the queue into it's own thread and have each 'worker'
>> thread push and pop OPs out of that thread. I have been focused on the
>> Queue code that I haven't really looked at the OSD/PG code until last
>> Friday and it is like trying to drink from a fire hose going through
>> that code, so I may be misunderstanding something.
>>
>> I'd appreciate any pointers to quickly understanding the OSD/PG code
>> specifically around the OPs and the queue.
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
>> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
>> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
>> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
>> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
>> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
>> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
>> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
>> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
>> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
>> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
>> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
>> ztfA
>> =GSDL
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just  wrote:
>>> It's partially in the unified queue.  The primary's background work
>>> for kicking off a recovery operation is not in the unified queue, but
>>> the messages to the replicas (pushes, pull, backfill scans) as well as
>>> their replies are in the unified queue as normal messages.  I've got a
>>> branch moving the primary's work to the queue as well (didn't quite
>>> make infernalis) --
>>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>>> stabilize it now for merge that infernalis is out.
>>> -Sam
>>>
>>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil  wrote:
 On Fri, 6 Nov 2015, Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> After trying to look through the recovery code, I'm getting the
> feeling that recovery OPs are not scheduled in the OP queue that I've
> been working on. Does that sound right? In the OSD logs I'm only
> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
> If the recovery is in another separate queue, then there is no
> reliable way to prioritize OPs between them.
>
> If I'm going off in to the weeds, please help me get back on the trail.

 Yeah, the recovery work isn't in the unified queue yet.

 sage



>
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > On Fri, Nov 6, 

RE: Cannot start osd due to permission of journal raw device

2015-11-09 Thread Sage Weil
On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> Hmm I didn't use ceph-disk but partitioned & format by myself and 
> call ceph-osd --mkfs directly, that should be the reason why udev rules 
> doesn't make effect?

Yeah... the udev rule is based on the GPT partition label.  For example,

https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules#L4-L5

sage



> 
> > -Original Message-
> > From: Sage Weil [mailto:s...@newdream.net]
> > Sent: Monday, November 9, 2015 9:18 PM
> > To: Chen, Xiaoxi
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: Cannot start osd due to permission of journal raw device
> > 
> > On Mon, 9 Nov 2015, Chen, Xiaoxi wrote:
> > > There is no such rules (only 70-persistent-net.rules) in my
> > > /etc/udev/ruled.d/
> > >
> > > Could you point me which part of the code create the rules file? Is
> > > that ceph-disk?
> > 
> > https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules
> > 
> > The package should install it in /lib/udev/rules.d or similar...
> > 
> > sage
> > 
> > > > -Original Message-
> > > > From: Sage Weil [mailto:s...@newdream.net]
> > > > Sent: Friday, November 6, 2015 6:33 PM
> > > > To: Chen, Xiaoxi
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: Re: Cannot start osd due to permission of journal raw
> > > > device
> > > >
> > > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> > > > > Hi,
> > > > > I tried  infernalis (version 9.1.0
> > > > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to
> > > > permission of journal ,  the OSD  was upgraded from hammer(also true
> > > > for newly created OSD).
> > > > >   I am using raw device as journal, this is because the default
> > > > > privilege of
> > > > raw block is root:disk. Changing the journal owner to ceph:ceph
> > > > solve the issue. Seems we can either:
> > > > >   1. add ceph to "disk" group and run ceph-osd with --setuser ceph
> > > > > --
> > > > setgroup disk?
> > > > >   2. Require user to set the ownership of journal device to
> > > > > ceph:ceph is they
> > > > want to use raw as journal?  Maybe we can done this in ceph-disk.
> > > > >
> > > > >Personally I would prefer the second one , what do you think?
> > > >
> > > > The udev rules should be setting the jouranl device ownership to
> > ceph:ceph.
> > > > IIRC there was a race in ceph-disk that could prevent this from
> > > > happening in some cases but that is now fixed.  Can you try the 
> > > > infernalis
> > branch?
> > > >
> > > > sage
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Samuel Just
It's partially in the unified queue.  The primary's background work
for kicking off a recovery operation is not in the unified queue, but
the messages to the replicas (pushes, pull, backfill scans) as well as
their replies are in the unified queue as normal messages.  I've got a
branch moving the primary's work to the queue as well (didn't quite
make infernalis) --
https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
stabilize it now for merge that infernalis is out.
-Sam

On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil  wrote:
> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> After trying to look through the recovery code, I'm getting the
>> feeling that recovery OPs are not scheduled in the OP queue that I've
>> been working on. Does that sound right? In the OSD logs I'm only
>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>> If the recovery is in another separate queue, then there is no
>> reliable way to prioritize OPs between them.
>>
>> If I'm going off in to the weeds, please help me get back on the trail.
>
> Yeah, the recovery work isn't in the unified queue yet.
>
> sage
>
>
>
>>
>> Thanks,
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA256
>> >
>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>> >>> -BEGIN PGP SIGNED MESSAGE-
>> >>> Hash: SHA256
>> >>>
>> >>> Thanks Gregory,
>> >>>
>> >>> People are most likely busy and haven't had time to digest this and I
>> >>> may be expecting more excitement from it (I'm excited due to the
>> >>> results and probably also that such a large change still works). I'll
>> >>> keep working towards a PR, this was mostly proof of concept, now that
>> >>> there is some data I'll clean up the code.
>> >>
>> >> I'm *very* excited about this.  This is something that almost every
>> >> operator has problems with so it's very encouraging to see that switching
>> >> up the queue has a big impact in your environment.
>> >>
>> >> I'm just following up on this after a week of travel, so apologies if this
>> >> is covered already, but did you compare this implementation to the
>> >> original one with the same tunables?  I see somewhere that you had
>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>> >> queue.
>> >>
>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>> >> implementation.
>> >
>> > Yes max_backfills=20 is problematic for both queues and from what I
>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>> > busy cluster it can take a while due to the recovery ops having low
>> > priority. In the current queue, it is possible to be blocked for a
>> > long time. The new queue seems to prevent that, but they do still back
>> > up. After this, I think I'd like to look into promoting recovery OPs
>> > that are blocking client OPs to higher priorities so that client I/O
>> > doesn't suffer as much during recovery. I think that will be a very
>> > different problem to tackle because I don't think I can do the proper
>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>> > or PG.cc.
>> >
>> > The strict priority threshold didn't make much of a difference with
>> > the original queue. I initially eliminated it all together in the WRR,
>> > but there were times that peering would never complete. I want to get
>> > as many OPs in the WRR queue to provide fairness as much as possible.
>> > I haven't tweaked the setting much in the WRR queue yet.
>> >
>> >>
>> >>> I was thinking that a config option to choose the scheduler would be a
>> >>> good idea. In terms of the project what is the better approach: create
>> >>> a new template and each place the template class is instantiated
>> >>> select the queue, or perform the queue selection in the same template
>> >>> class, or something else I haven't thought of.
>> >>
>> >> A config option would be nice, but I'd start by just cleaning up the code
>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>> >> whatever).  If we find that it's behaving better I'm not sure how much
>> >> value we get from a tunable.  Note that there is one other user
>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>> >> point.. especially if this implementation is faster.
>> >>
>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>> >> PR and we can review and get it through testing.
>> >
>> > In talking with Samuel in IRC, we think creating an abstract class for
>> > the queue is the best option. C++11 allows you to still optimize
>> > 

Re: ceph encoding optimization

2015-11-09 Thread Sage Weil
On Mon, 9 Nov 2015, Gregory Farnum wrote:
> On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnum  wrote:
> > The problem with this approach is that the encoded versions need to be
> > platform-independent ? they are shared over the wire and written to
> > disks that might get transplanted to different machines. Apart from
> > padding bytes, we also need to worry about endianness of the machine,
> > etc. *And* we often mutate structures across versions in order to add
> > new abilities, relying on the encode-decode process to deal with any
> > changes to the system. How could we deal with that if just dumping the
> > raw memory?
> >
> > Now, maybe we could make these changes on some carefully-selected
> > structs, I'm not sure. But we'd need a way to pick them out, guarantee
> > that we aren't breaking interoperability concerns, etc; and it would
> > need to be something we can maintain as a group going forward. I'm not
> > sure how to satisfy those constraints without burning a little extra
> > CPU. :/
> > -Greg
> 
> So it turns out we've actually had issues with this. Sage merged
> (wrote?) some little-endian-only optimizations to the cephx code that
> broke big-endian systems by doing a direct memcpy. Apparently our
> tests don't find these issues, which makes me even more nervous about
> taking that sort of optimization into the tree. :(

I think the way to make this maintainable will be to

1) Find a clean approach with a simple #if or #ifdef condition for 
little endian and/or architectures that can handle unaligned int pointer 
access.

2) Maintain the parallel optimized implementation next to the generic 
encode/decode in a way that makes it as easy as possible to make changes 
and keep them in sync.

3) Optimize *only* the most recent encoding to minimize complexity.

4) Ensure that there is a set of encode/decode tests that verify they both 
work, triggered by make check (so that a simple make check on a big 
endian box will catch errors).  Ideally this'd be part of the 
test/encoding/readable.sh so that we run it over the entire corpus of old 
encodings..


sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2015-11-09 Thread ceph branch robot
-- All Branches --

Adam C. Emerson 
2015-10-16 13:49:09 -0400   wip-cxx11time
2015-10-17 13:20:15 -0400   wip-cxx11concurrency

Adam Crume 
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza 
2015-03-23 16:39:48 -0400   wip-11212

Alfredo Deza 
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Ali Maredia 
2015-10-12 14:28:30 -0400   wip-10587-split-servers
2015-11-06 14:12:14 -0500   wip-cmake

Barbora Ančincová 
2015-11-04 16:43:45 +0100   wip-doc-RGW

Boris Ranto 
2015-09-04 15:19:11 +0200   wip-bash-completion

Casey Bodley 
2015-09-28 17:09:11 -0400   wip-cxx14-test
2015-09-29 15:18:17 -0400   wip-fio-objectstore

Daniel Gryniewicz 
2015-10-28 08:53:55 -0400   wip-12997

Danny Al-Gaaf 
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-11-03 14:10:47 +0100   wip-da-SCA-20151029
2015-11-03 14:40:44 +0100   wip-da-SCA-20150910

David Zafman 
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-08-04 07:39:00 -0700   wip-12577-hammer
2015-09-28 11:33:11 -0700   wip-12983
2015-10-29 00:27:40 -0700   wip-zafman-testing

Dongmao Zhang 
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum 
2015-04-29 21:44:11 -0700   wip-init-names
2015-07-16 09:28:24 -0700   hammer-12297
2015-10-02 13:00:59 -0700   greg-infernalis-lock-testing
2015-10-02 13:09:05 -0700   greg-infernalis-lock-testing-cacher
2015-10-07 00:45:24 -0700   greg-infernalis-fs
2015-10-21 17:43:07 -0700   client-pagecache-norevoke
2015-10-27 11:32:46 -0700   hammer-pg-replay
2015-10-29 15:24:35 -0700   greg-fs-testing

Greg Farnum 
2014-10-23 13:33:44 -0700   wip-forward-scrub

Guang G Yang 
2015-06-26 20:31:44 +   wip-ec-readall
2015-07-23 16:13:19 +   wip-12316

Guang Yang 
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang 
2015-10-25 01:51:47 +0800   wip-13521

Haomai Wang 
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-07-21 19:33:56 +0800   fio-objectstore
2015-08-26 09:57:27 +0800   wip-recovery-attr
2015-10-24 23:39:07 +0800   fix-compile-warning

Ilya Dryomov 
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

Ivo Jimenez 
2015-08-24 23:12:45 -0700   hammer-with-new-workunit-for-wip-12551

James Page 
2015-11-04 11:08:42 +   javacruft-wip-ec-modules

Jason Dillaman 
2015-08-31 23:17:53 -0400   wip-12698
2015-09-01 10:17:02 -0400   wip-11287
2015-11-05 22:16:45 -0500   wip-librbd-qa-dillaman

Jenkins 
2015-11-04 14:31:13 -0800   rhcs-v0.94.3-ubuntu

Jenkins 
2014-07-29 05:24:39 -0700   wip-nhm-hang
2014-10-14 12:10:38 -0700   wip-2
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-08-21 12:46:32 -0700   last
2015-08-21 12:46:32 -0700   loic-v9.0.3
2015-09-15 10:23:18 -0700   rhcs-v0.80.8
2015-09-21 16:48:32 -0700   rhcs-v0.94.1-ubuntu

Jenkins Build Slave User 
2015-11-03 16:58:32 +   infernalis

Joao Eduardo Luis 
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis 
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis 
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling
2015-07-27 21:56:42 +0100   wip-11470.hammer
2015-09-09 15:45:45 +0100   wip-11786.hammer

Joao Eduardo Luis 
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thanks, I think some of the fog is clearing. I was wondering how
operations between threads were keeping the order of operations in
PGs, that explains it.

My original thoughts were to have a queue in front and behind the
Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
The queue thread would pull ops off that queue and place them into the
specialized queue, do house keeping, etc and would dequeue ops in that
queue to a post-queue that worker threads would monitor. The thread
queue could keep a certain amount of items in the post-queue to
prevent starvation and worker threads from being blocked.

It would require the worker thread to be able to handle any kind of
op, or having separate post-queues for the different kinds of work.
I'm getting the feeling that this may be a far too simplistic approach
to the problem (or at least in terms of the organization of Ceph at
this point). I'm also starting to feel that I'm getting out of my
league trying to understand all the intricacies of the OSD work flow
(trying to start with one of the most complicated parts of the system
doesn't help).

Maybe what I should do is just code up the queue to drop in as a
replacement for the Prio queue for the moment. Then as your async work
is completing we can shake out the potential issues with recovery and
costs that we talked about earlier. One thing that I'd like to look
into is elevating the priority of recovery ops that have client OPs
blocked. I don't think the WRR queue gives the recovery thread a lot
of time to get its work done.

Based on some testing on Friday, the number of recovery ops on an osd
did not really change if there were 20 backfilling or 1 backfilling.
The difference came in with how many client I/Os were blocked waiting
for objects to recover. When 20 backfills were going, there were a lot
more blocked I/O waiting for objects to show up or recover. With one
backfill, there were far less blocked I/O, but there were still times
I/O would block.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQPHBCRDmVDuy+mK58QAA72EQAMgzgrw3OAvBi1/NmuWl
LXGM0qGz3hE/p5oUsnqcnz2/+VYP3FZRanszyuU8+vKCwj+I/Ny9Olm1JAnw
DSE7PvhuO6J5w0ymOIccKdX7uk2QZyP8ggO1D5fLC2M9/xqQQSZrAPE7vc4j
O9HHuZsMF+ABUKU5RVCjn1ax+y2LhpetxH3nu37xpSKPDPFiowVnW8YlBGJy
Cf1FYMVDLv60F5EmjstOn4FhSXC/+DuSATwP+CmNEPZ3JNTBgtPuU/22/De3
M4ZdDzeylVWYB66vbL9ijLeZDoCaxKgFL+QwUAswefaDBD1citCU2v7/7VQP
aChnSzI8BYG0bHg5u7QEohzQyJUCC1OubiRkbUmOOeCiBI0Lqv3jf321T4ss
PD3hqkagyhRe67zPB6bhhik0ZDOYHTAyV/ceAae4VDJTgu+/gI8Gc1c3mp5g
nZL5z7hVohZ0AvfdEzasRhTnTcH6TfO9lpqU2nyMAc76SoPyDSTmAcMVt0tj
/1BQAnk/I5rlCL5CKTxb2LR1/5WJt0eh7xtyKU1B0yh4G7JlMf/3kmrznOWu
VEUUA3mJ1depDToadnECnCZMKHrGYC36XCy8xq3FDqhvl4BWV0VMA+yi1uhj
zZ5udKKbN5Cxo/Sc48DG8wz9lQKn4LPCH2PD81oTcTfyd1iG2oNNkchrXa6K
iwed
=WjDS
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Nov 9, 2015 at 11:19 AM, Samuel Just  wrote:
> Ops are hashed from the messenger (or any of the other enqueue sources
> for non-message items) into one of N queues, each of which is serviced
> by M threads.  We can't quite have a single thread own a single queue
> yet because the current design allows multiple threads/queue
> (important because if a sync read blocks on one thread, other threads
> working on that queue can continue to make progress).  However, the
> queue contents are hashed to a queue based on the PG, so if a PG
> queues work, it'll be on the same queue as it is already operating
> from (which I think is what you are getting at?).  I'm moving away
> from that with the async read work I'm doing (ceph-devel subject
> "Async reads, sync writes, op thread model discussion"), but I'll
> still need a replacement for PrioritizedQueue.
> -Sam
>
> On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I should probably work against this branch.
>>
>> I've got some more reading of code to do, but I'm thinking that there
>> isn't one of these queues for each OSD, it seems like there is one
>> queue for each thread in the OSD. If this is true, I think it makes
>> sense to break the queue into it's own thread and have each 'worker'
>> thread push and pop OPs out of that thread. I have been focused on the
>> Queue code that I haven't really looked at the OSD/PG code until last
>> Friday and it is like trying to drink from a fire hose going through
>> that code, so I may be misunderstanding something.
>>
>> I'd appreciate any pointers to quickly understanding the OSD/PG code
>> specifically around the OPs and the queue.
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
>> 

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Samuel Just
On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
>> What I really want from PrioritizedQueue (and from the dmclock/mclock
>> approaches that are also being worked on) is a solution to the problem
>> of efficiently deciding which op to do next taking into account
>> fairness across io classes and ops with different costs.
>
>> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>> Thanks, I think some of the fog is clearing. I was wondering how
>>> operations between threads were keeping the order of operations in
>>> PGs, that explains it.
>>>
>>> My original thoughts were to have a queue in front and behind the
>>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>>> The queue thread would pull ops off that queue and place them into the
>>> specialized queue, do house keeping, etc and would dequeue ops in that
>>> queue to a post-queue that worker threads would monitor. The thread
>>> queue could keep a certain amount of items in the post-queue to
>>> prevent starvation and worker threads from being blocked.
>>
>> I'm not sure what the advantage of this would be -- it adds another thread
>> to the processing pipeline at best.
>
> There are a few reasons I thought about it. 1. It is hard to
> prioritize/mange the work load if you can't see/manage all the
> operations. One queue allows the algorithm to make decisions based on
> all available information. (This point seems to be handled in a
> different way in the future) 2. Reduce latency in the Op path. When an
> OP is queued, there is overhead in getting it in the right place. When
> an OP is dequeued there is more overhead in spreading tokens, etc.
> Right now that is all serial, if an OP is stuck in the queue waiting
> to be dispatched some of this overhead can't be performed while in
> this waiting period. The idea is pushing that overhead to a separate
> thread and allowing a worker thread to queue/dequeue in the most
> efficient manner. It also allows for more complex trending,
> scheduling, etc because it can sit outside of the OP path. As the
> workload changes, it can dynamically change how it manages the queue
> like simple fifo for low periods where latency is dominated by compute
> time, to Token/WRR when latency is dominated by disk access, etc.
>

We basically don't want a single thread to see all of the operations -- it
would cause a tremendous bottleneck and complicate the design
immensely.  It's shouldn't be necessary anyway since PGs are a form
of course grained locking, so it's probably fine to schedule work for
different groups of PGs independently if we assume that all kinds of
work are well distributed over those groups.

>>> It would require the worker thread to be able to handle any kind of
>>> op, or having separate post-queues for the different kinds of work.
>>> I'm getting the feeling that this may be a far too simplistic approach
>>> to the problem (or at least in terms of the organization of Ceph at
>>> this point). I'm also starting to feel that I'm getting out of my
>>> league trying to understand all the intricacies of the OSD work flow
>>> (trying to start with one of the most complicated parts of the system
>>> doesn't help).
>>>
>>> Maybe what I should do is just code up the queue to drop in as a
>>> replacement for the Prio queue for the moment. Then as your async work
>>> is completing we can shake out the potential issues with recovery and
>>> costs that we talked about earlier. One thing that I'd like to look
>>> into is elevating the priority of recovery ops that have client OPs
>>> blocked. I don't think the WRR queue gives the recovery thread a lot
>>> of time to get its work done.
>>>
>>
>> If an op comes in that requires recovery to happen before it can be
>> processed, we send the recovery messages with client priority rather
>> than recovery priority.
>
> But the recovery is still happening the recovery thread and not the
> client thread, right? The recovery thread has a lower priority than
> the op thread? That's how I understand it.
>

No, in hammer we removed the snap trim and scrub workqueues.  With
wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
meaningful set of threads remaining will be the op_tp and associated
queues.

>>> Based on some testing on Friday, the number of recovery ops on an osd
>>> did not really change if there were 20 backfilling or 1 backfilling.
>>> The difference came in with how many client I/Os were blocked waiting
>>> for objects to recover. When 20 backfills were going, there were a lot
>>> more blocked I/O waiting for objects to show up or recover. With one
>>> backfill, there were far less blocked I/O, but there were still times
>>> I/O would block.
>>
>> The number of recovery ops is actually a separate configurable
>> 

Re: [PATCH 1/9] drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c: use correct structure type name in sizeof

2015-11-09 Thread Laurent Pinchart
Hi Julia,

Thank you for the patch.

On Tuesday 29 July 2014 17:16:43 Julia Lawall wrote:
> From: Julia Lawall 
> 
> Correct typo in the name of the type given to sizeof.  Because it is the
> size of a pointer that is wanted, the typo has no impact on compilation or
> execution.
> 
> This problem was found using Coccinelle (http://coccinelle.lip6.fr/).  The
> semantic patch used can be found in message 0 of this patch series.
> 
> Signed-off-by: Julia Lawall 
> 
> ---
>  drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c
> b/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c index
> cda8388..255590f 100644
> --- a/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c
> +++ b/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c
> @@ -227,7 +227,7 @@ static int vpfe_enable_clock(struct vpfe_device
> *vpfe_dev) return 0;
> 
>   vpfe_dev->clks = kzalloc(vpfe_cfg->num_clocks *
> -sizeof(struct clock *), GFP_KERNEL);
> +sizeof(struct clk *), GFP_KERNEL);

I'd use sizeof(*vpfe_dev->clks) to avoid such issues.

Apart from that,

Acked-by: Laurent Pinchart 

I've applied the patch to my tree with the above change, there's no need to 
resubmit if you agree with the proposal.

>   if (vpfe_dev->clks == NULL) {
>   v4l2_err(vpfe_dev->pdev->driver, "Memory allocation failed\n");
>   return -ENOMEM;

-- 
Regards,

Laurent Pinchart

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Samuel Just
What I really want from PrioritizedQueue (and from the dmclock/mclock
approaches that are also being worked on) is a solution to the problem
of efficiently deciding which op to do next taking into account
fairness across io classes and ops with different costs.

On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Thanks, I think some of the fog is clearing. I was wondering how
> operations between threads were keeping the order of operations in
> PGs, that explains it.
>
> My original thoughts were to have a queue in front and behind the
> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
> The queue thread would pull ops off that queue and place them into the
> specialized queue, do house keeping, etc and would dequeue ops in that
> queue to a post-queue that worker threads would monitor. The thread
> queue could keep a certain amount of items in the post-queue to
> prevent starvation and worker threads from being blocked.

I'm not sure what the advantage of this would be -- it adds another thread
to the processing pipeline at best.

>
> It would require the worker thread to be able to handle any kind of
> op, or having separate post-queues for the different kinds of work.
> I'm getting the feeling that this may be a far too simplistic approach
> to the problem (or at least in terms of the organization of Ceph at
> this point). I'm also starting to feel that I'm getting out of my
> league trying to understand all the intricacies of the OSD work flow
> (trying to start with one of the most complicated parts of the system
> doesn't help).
>
> Maybe what I should do is just code up the queue to drop in as a
> replacement for the Prio queue for the moment. Then as your async work
> is completing we can shake out the potential issues with recovery and
> costs that we talked about earlier. One thing that I'd like to look
> into is elevating the priority of recovery ops that have client OPs
> blocked. I don't think the WRR queue gives the recovery thread a lot
> of time to get its work done.
>

If an op comes in that requires recovery to happen before it can be
processed, we send the recovery messages with client priority rather
than recovery priority.

> Based on some testing on Friday, the number of recovery ops on an osd
> did not really change if there were 20 backfilling or 1 backfilling.
> The difference came in with how many client I/Os were blocked waiting
> for objects to recover. When 20 backfills were going, there were a lot
> more blocked I/O waiting for objects to show up or recover. With one
> backfill, there were far less blocked I/O, but there were still times
> I/O would block.

The number of recovery ops is actually a separate configurable
(osd_recovery_max_active -- default to 15).  It's odd that with more
backfilling on a single osd, there is more blocked IO.  Looking into
that would be helpful and would probably give you some insight
into recovery and the op processing pipeline.
-Sam

> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQPHBCRDmVDuy+mK58QAA72EQAMgzgrw3OAvBi1/NmuWl
> LXGM0qGz3hE/p5oUsnqcnz2/+VYP3FZRanszyuU8+vKCwj+I/Ny9Olm1JAnw
> DSE7PvhuO6J5w0ymOIccKdX7uk2QZyP8ggO1D5fLC2M9/xqQQSZrAPE7vc4j
> O9HHuZsMF+ABUKU5RVCjn1ax+y2LhpetxH3nu37xpSKPDPFiowVnW8YlBGJy
> Cf1FYMVDLv60F5EmjstOn4FhSXC/+DuSATwP+CmNEPZ3JNTBgtPuU/22/De3
> M4ZdDzeylVWYB66vbL9ijLeZDoCaxKgFL+QwUAswefaDBD1citCU2v7/7VQP
> aChnSzI8BYG0bHg5u7QEohzQyJUCC1OubiRkbUmOOeCiBI0Lqv3jf321T4ss
> PD3hqkagyhRe67zPB6bhhik0ZDOYHTAyV/ceAae4VDJTgu+/gI8Gc1c3mp5g
> nZL5z7hVohZ0AvfdEzasRhTnTcH6TfO9lpqU2nyMAc76SoPyDSTmAcMVt0tj
> /1BQAnk/I5rlCL5CKTxb2LR1/5WJt0eh7xtyKU1B0yh4G7JlMf/3kmrznOWu
> VEUUA3mJ1depDToadnECnCZMKHrGYC36XCy8xq3FDqhvl4BWV0VMA+yi1uhj
> zZ5udKKbN5Cxo/Sc48DG8wz9lQKn4LPCH2PD81oTcTfyd1iG2oNNkchrXa6K
> iwed
> =WjDS
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 9, 2015 at 11:19 AM, Samuel Just  wrote:
>> Ops are hashed from the messenger (or any of the other enqueue sources
>> for non-message items) into one of N queues, each of which is serviced
>> by M threads.  We can't quite have a single thread own a single queue
>> yet because the current design allows multiple threads/queue
>> (important because if a sync read blocks on one thread, other threads
>> working on that queue can continue to make progress).  However, the
>> queue contents are hashed to a queue based on the PG, so if a PG
>> queues work, it'll be on the same queue as it is already operating
>> from (which I think is what you are getting at?).  I'm moving away
>> from that with the async read work I'm doing (ceph-devel subject
>> "Async reads, sync writes, op thread model discussion"), but I'll
>> still need a replacement for PrioritizedQueue.
>> -Sam
>>
>> On Mon, Nov 9, 2015 

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
> What I really want from PrioritizedQueue (and from the dmclock/mclock
> approaches that are also being worked on) is a solution to the problem
> of efficiently deciding which op to do next taking into account
> fairness across io classes and ops with different costs.

> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> Thanks, I think some of the fog is clearing. I was wondering how
>> operations between threads were keeping the order of operations in
>> PGs, that explains it.
>>
>> My original thoughts were to have a queue in front and behind the
>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>> The queue thread would pull ops off that queue and place them into the
>> specialized queue, do house keeping, etc and would dequeue ops in that
>> queue to a post-queue that worker threads would monitor. The thread
>> queue could keep a certain amount of items in the post-queue to
>> prevent starvation and worker threads from being blocked.
>
> I'm not sure what the advantage of this would be -- it adds another thread
> to the processing pipeline at best.

There are a few reasons I thought about it. 1. It is hard to
prioritize/mange the work load if you can't see/manage all the
operations. One queue allows the algorithm to make decisions based on
all available information. (This point seems to be handled in a
different way in the future) 2. Reduce latency in the Op path. When an
OP is queued, there is overhead in getting it in the right place. When
an OP is dequeued there is more overhead in spreading tokens, etc.
Right now that is all serial, if an OP is stuck in the queue waiting
to be dispatched some of this overhead can't be performed while in
this waiting period. The idea is pushing that overhead to a separate
thread and allowing a worker thread to queue/dequeue in the most
efficient manner. It also allows for more complex trending,
scheduling, etc because it can sit outside of the OP path. As the
workload changes, it can dynamically change how it manages the queue
like simple fifo for low periods where latency is dominated by compute
time, to Token/WRR when latency is dominated by disk access, etc.

>> It would require the worker thread to be able to handle any kind of
>> op, or having separate post-queues for the different kinds of work.
>> I'm getting the feeling that this may be a far too simplistic approach
>> to the problem (or at least in terms of the organization of Ceph at
>> this point). I'm also starting to feel that I'm getting out of my
>> league trying to understand all the intricacies of the OSD work flow
>> (trying to start with one of the most complicated parts of the system
>> doesn't help).
>>
>> Maybe what I should do is just code up the queue to drop in as a
>> replacement for the Prio queue for the moment. Then as your async work
>> is completing we can shake out the potential issues with recovery and
>> costs that we talked about earlier. One thing that I'd like to look
>> into is elevating the priority of recovery ops that have client OPs
>> blocked. I don't think the WRR queue gives the recovery thread a lot
>> of time to get its work done.
>>
>
> If an op comes in that requires recovery to happen before it can be
> processed, we send the recovery messages with client priority rather
> than recovery priority.

But the recovery is still happening the recovery thread and not the
client thread, right? The recovery thread has a lower priority than
the op thread? That's how I understand it.

>> Based on some testing on Friday, the number of recovery ops on an osd
>> did not really change if there were 20 backfilling or 1 backfilling.
>> The difference came in with how many client I/Os were blocked waiting
>> for objects to recover. When 20 backfills were going, there were a lot
>> more blocked I/O waiting for objects to show up or recover. With one
>> backfill, there were far less blocked I/O, but there were still times
>> I/O would block.
>
> The number of recovery ops is actually a separate configurable
> (osd_recovery_max_active -- default to 15).  It's odd that with more
> backfilling on a single osd, there is more blocked IO.  Looking into
> that would be helpful and would probably give you some insight
> into recovery and the op processing pipeline.

I'll see what I can find here.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ
/E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N
+rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora
q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW
GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF

Re: make check bot resumed

2015-11-09 Thread Loic Dachary
Hi,

For some reason jenkins thought it was necessary to reconsider all commits 
merged weeks ago. It was silenced to not send test results about pull request 
already merged. It should now resume work on the current pull requests. If a 
pull request needs to be visited by the make check bot, it is enough to rebase 
and repush it.

Cheers

On 09/11/2015 15:33, Loic Dachary wrote:
> Hi,
> 
> The machine sending notifications for the make check bot failed during the 
> week-end. It was rebooted and it should resume its work. 
> 
> The virtual machine was actually re-built because the underlying OpenStack 
> cloud was unable to find the volume used for root after a hard reboot. There 
> were also issues with the devicemapper docker backend that was corrupted. 
> Wiping them out was enough to resolve the problem: they did not have any 
> persistent data anyway.
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


RGW multi-tenancy APIs overview

2015-11-09 Thread Pete Zaitcev
With ticket 5073 getting close to complete, we're getting the APIs mostly
nailed down. Most of them come down to selection a syntax separator
character. Unfortunately, there are several such characters. Plus,
it is not always feasible to get by with a character (in S3 at least).

So far we have the following changes:

#1 Back-end and radosgw-admin use '/' or "tenant/bucket". This is what is
literally stored in RADOS, because it's used to name bucket objects in
the .rgw pool.

#2 Buckets in Swift URLs use '\' (backslash), because there does not seem
to be a way to use '/'. Example:
 http://host.corp.com:8080/swift/v1/testen\testcont

At first, I tried URL encoding (%2f), but that didn't work: we permit '%'
in Swift container names, so there's a show-stopper compatibility problem.
So, backslash. The backslash poses a similar problem, too, but hopefuly
nobody created a container with backslash in name.

Note that strictly speaking, we don't really need this, since Swift URLs
could easily include tenant names where reference Swift places account names.
It's just easier to implement without disturbing authenthication code.

#3 S3 host addressing of buckets

This is similar to Swift and is slated to use backslash. Note that S3
prohibits it, so we're reasonably safe with this choice.

#4 S3 URL addressing of buckets

Here we must use a period. Example:
 bucket.tenant.host.corp.com

#5 Listings and redirects.

Listings present a difficulty in S3: we don't know if the name will be
used in host-based or URL-based addressing of a bucket. So, we put the
tenant of a bucket into a separate XML attribute.

Since Swift listings are always in a specific account, and thus tenant,
they are unchanged.

In addition to listings, bucket names leak into certain HTTP headers, where
we add "Tenant:" headers as appropriate.

Finally, multi-tenancy also puts user_uid namespaces under tenants as well
as bucket namespaces. That one is easy though. A '$' separator is used
consistently for it (tenant$user).

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


There is no next; only jewel

2015-11-09 Thread Sage Weil
Hey everyone,

Just a reminder that now that infernalis is out and we're back to focusing 
on jewel, we should send all bug fixes to the 'jewel' branch (which 
functions the same way the old 'next' branch did).  That is,

 bug fixes -> jewel
 new features -> master

Every dev release (hopefully we'll get back on a 2 week shedule) we'll 
slurp master into jewel for the next sprint.  And during each sprint we'll 
test/stabilize the jewel branch.

Expect feature freeze to be February-ish.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph:Fix error handling in the function down_reply

2015-11-09 Thread Ilya Dryomov
On Mon, Nov 9, 2015 at 11:15 AM, Yan, Zheng  wrote:
>
>> On Nov 9, 2015, at 11:11, Nicholas Krause  wrote:
>>
>> This fixes error handling in the function down_reply in order to
>> properly check and jump to the goto label, out_err for this
>> particular function if a error code is returned by any function
>> called in down_reply and therefore make checking be included
>> for the call to ceph_update_snap_trace in order to comply with
>> these error handling checks/paths.
>>
>> Signed-off-by: Nicholas Krause 
>> ---
>> fs/ceph/mds_client.c | 11 +++
>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 51cb02d..0b01f94 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session 
>> *session, struct ceph_msg *msg)
>>   realm = NULL;
>>   if (rinfo->snapblob_len) {
>>   down_write(>snap_rwsem);
>> - ceph_update_snap_trace(mdsc, rinfo->snapblob,
>> - rinfo->snapblob + rinfo->snapblob_len,
>> - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP,
>> - );
>> + err = ceph_update_snap_trace(mdsc, rinfo->snapblob,
>> +  rinfo->snapblob + 
>> rinfo->snapblob_len,
>> +  le32_to_cpu(head->op) == 
>> CEPH_MDS_OP_RMSNAP,
>> +  );
>>   downgrade_write(>snap_rwsem);
>>   } else {
>>   down_read(>snap_rwsem);
>>   }
>> +
>> + if (err)
>> + goto out_err;
>>
>>   /* insert trace into our cache */
>>   mutex_lock(>r_fill_mutex);
>
> Applied, thanks

This looks to me like it'd leave snap_rwsem locked for read?  Also, the
name of the function in question is handle_reply(), not down_reply().

I'll revert testing.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever

2015-11-09 Thread Alexandre DERUMIER
adding to ceph.conf

[client]
rbd_non_blocking_aio = false


fix the problem for me (with rbd_cache=false)


(@cc jdur...@redhat.com)



- Mail original -
De: "Denis V. Lunev" 
À: "aderumier" , "ceph-devel" 
, "qemu-devel" 
Envoyé: Lundi 9 Novembre 2015 08:22:34
Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop 
is hanging forever

On 11/09/2015 10:19 AM, Denis V. Lunev wrote: 
> On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote: 
>> Hi, 
>> 
>> with qemu (2.4.1), if I do an internal snapshot of an rbd device, 
>> then I pause the vm with vm_stop, 
>> 
>> the qemu process is hanging forever 
>> 
>> 
>> monitor commands to reproduce: 
>> 
>> 
>> # snapshot_blkdev_internal drive-virtio0 yoursnapname 
>> # stop 
>> 
>> 
>> 
>> 
>> I don't see this with qcow2 or sheepdog block driver for example. 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> 
> this could look like the problem I have recenty trying to 
> fix with dataplane enabled. Patch series is named as 
> 
> [PATCH for 2.5 v6 0/10] dataplane snapshot fixes 
> 
> Den 

anyway, even if above will not help, can you collect gdb 
traces from all threads in QEMU process. May be I'll be 
able to give a hit. 

Den 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Milosz Tanski
On Mon, Nov 9, 2015 at 3:49 PM, Samuel Just  wrote:
> On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
>>> What I really want from PrioritizedQueue (and from the dmclock/mclock
>>> approaches that are also being worked on) is a solution to the problem
>>> of efficiently deciding which op to do next taking into account
>>> fairness across io classes and ops with different costs.
>>
>>> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 Thanks, I think some of the fog is clearing. I was wondering how
 operations between threads were keeping the order of operations in
 PGs, that explains it.

 My original thoughts were to have a queue in front and behind the
 Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
 The queue thread would pull ops off that queue and place them into the
 specialized queue, do house keeping, etc and would dequeue ops in that
 queue to a post-queue that worker threads would monitor. The thread
 queue could keep a certain amount of items in the post-queue to
 prevent starvation and worker threads from being blocked.
>>>
>>> I'm not sure what the advantage of this would be -- it adds another thread
>>> to the processing pipeline at best.
>>
>> There are a few reasons I thought about it. 1. It is hard to
>> prioritize/mange the work load if you can't see/manage all the
>> operations. One queue allows the algorithm to make decisions based on
>> all available information. (This point seems to be handled in a
>> different way in the future) 2. Reduce latency in the Op path. When an
>> OP is queued, there is overhead in getting it in the right place. When
>> an OP is dequeued there is more overhead in spreading tokens, etc.
>> Right now that is all serial, if an OP is stuck in the queue waiting
>> to be dispatched some of this overhead can't be performed while in
>> this waiting period. The idea is pushing that overhead to a separate
>> thread and allowing a worker thread to queue/dequeue in the most
>> efficient manner. It also allows for more complex trending,
>> scheduling, etc because it can sit outside of the OP path. As the
>> workload changes, it can dynamically change how it manages the queue
>> like simple fifo for low periods where latency is dominated by compute
>> time, to Token/WRR when latency is dominated by disk access, etc.
>>
>
> We basically don't want a single thread to see all of the operations -- it
> would cause a tremendous bottleneck and complicate the design
> immensely.  It's shouldn't be necessary anyway since PGs are a form
> of course grained locking, so it's probably fine to schedule work for
> different groups of PGs independently if we assume that all kinds of
> work are well distributed over those groups.

There are are some queue implementations that rely on a single thread
essentially playing traffic cop in between queues and it's pretty
fast. FastFlow, the C++ lib, does that. It constructs other kinds of
queues from fast lock-free / wait-free SPSC queues. In the case of
something like MPMC there's a mediator thread there that manages N
SPSC in-queus to MSPC out-queues.

I'm only bringing this up since if you have a problem that might need
a mediator to arrange order, it's possible to do it fast.

>
 It would require the worker thread to be able to handle any kind of
 op, or having separate post-queues for the different kinds of work.
 I'm getting the feeling that this may be a far too simplistic approach
 to the problem (or at least in terms of the organization of Ceph at
 this point). I'm also starting to feel that I'm getting out of my
 league trying to understand all the intricacies of the OSD work flow
 (trying to start with one of the most complicated parts of the system
 doesn't help).

 Maybe what I should do is just code up the queue to drop in as a
 replacement for the Prio queue for the moment. Then as your async work
 is completing we can shake out the potential issues with recovery and
 costs that we talked about earlier. One thing that I'd like to look
 into is elevating the priority of recovery ops that have client OPs
 blocked. I don't think the WRR queue gives the recovery thread a lot
 of time to get its work done.

>>>
>>> If an op comes in that requires recovery to happen before it can be
>>> processed, we send the recovery messages with client priority rather
>>> than recovery priority.
>>
>> But the recovery is still happening the recovery thread and not the
>> client thread, right? The recovery thread has a lower priority than
>> the op thread? That's how I understand it.
>>
>
> No, in hammer we removed the snap trim and scrub workqueues.  With
> wip-recovery-wq, I remove the 

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

It sounds like dmclock/mclock will alleviate a lot of the concerns I
have as long as it can be smart like you said. It sounds like the
queue thread was already tried so there is experience behind the
current implementation vs. me thinking it might be better. The basic
idea I had is below:


Client,   Client,
Repop,Repop,
backfill,   ...   Backfill,
recovery, Recovery,
etc threadetc thread
| |
\ /
 \   /
lock; push (prio,cost,strict,
front/back,subsystem,); unlock
|
|
 (queue thread) pop
/  \
   /\
if ops.lowPlace op in prio
(fast path)   queue, do any
  |   housekeeping
  ||
  |   when post-queue.len
  |   < threads
  \/
   \  /
   post-queue push
  |
   lock, cond, pop
   /\
  /  \
Worker   ...  Worker
threadthread

What I meant by more scalable is that the rate of boulders would be
constant and evenly dispersed. It also prevents any one worker thread
from being backed up while others are idle. This may not be an issue
if the PG is busy. This design could also suffer if many OPs require
some locking at the PG level instead of the object level. The queue
itself does not do any op work only passing pointers to work to be
done. As I mentioned before it sounds like something like this already
proved to be limiting in performance, although thinking through this
has given me some ideas about implementing a fast path option in the
WRR queue to save some cycles.

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQS/lCRDmVDuy+mK58QAA69QP/0H1K3cArNaqM+yo4W4D
vpUMGxgTOg/8+69w4U2smHtjy8zRnJyUU1fbYdeTCbwTlZi5XVvtdMstDgPf
OqtF+uJm/akWVblzjreWjcqkBOXmlv89loOKJZGp9oUaHll8vrL117dd7Kwh
WHnGkc+fKCjkA7qo3gBo+Y5N3I1N2BNF0NQVuSTFEP5CfPE4Wy6DwBpYD1KY
zoN021E564V8eK1336je+v5xDg4oZLOxp5HhWmLHXnnisvfrK/VUipVl3aGY
Y5AXpdHGuRlsfvodKo6ZjAr1NEyPqlapJ7o57montY8yTxPR6ubSYAPP04Ky
VxA1FmtjsXKwui23rJMViWmY+lCT/P42fDlXEmVkbrnpkfoyzWn3N6yERatV
UCazWH6eA8w/FMjrkU7FTNjttYeQU74Ph26qywL9oNVWbzKKaiEaWgGzOT1Y
c65babw+qExK1syF8cWlKaf+roWIHeDq2+9iNO5SJ5v2eZ+JZipwW5f0BibM
EQGCx4b+vcjJgN2rYxUYOsm0tyOj+MMi2MrHqLC5Ns4zwqBw29+Gz4x+RfW5
2mw/0zaBe9v5GG7SocCHSuLexYBXjJ5h7zx2lII38Bnz9M6OfaAzuFtSXAqH
VSs4+6BrksnvAdhJNh4eX21mF/zIrnatxIvzvZlkAkSlEzpB72ZU8fC1OH/X
3hWW
=LuyT
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Samuel Just
On Mon, Nov 9, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> On Mon, Nov 9, 2015 at 1:49 PM, Samuel Just  wrote:
>> We basically don't want a single thread to see all of the operations -- it
>> would cause a tremendous bottleneck and complicate the design
>> immensely.  It's shouldn't be necessary anyway since PGs are a form
>> of course grained locking, so it's probably fine to schedule work for
>> different groups of PGs independently if we assume that all kinds of
>> work are well distributed over those groups.
>
> The only issue that I can see, based on the discussion last week, is
> when the client I/O is small. There will be some points where each
> thread will think it is OK so send a bolder along with the pebbles
> (recovery I/O vs. client I/O), If all/most of the threads send a
> bolder at the same time would it cause issues for slow disks
> (spindles)? A single queue would be much more intelligent about
> situations like this and spread the bolders out better. It also seems
> more scalable as you add threads (I don't think really practical on
> spindles). I assume the bottleneck in your concern is the thread
> communication between threads? I'm trying to understand and in no way
> trying to attack you (I've been know to come across differently than I
> intend to).
>

This is one of the advantages of the dmclock/mclock based designs,
we'd be able to portion out the available IO (expresed as cost/time)
among the threads and let each queue schedule against its own
quota.  A significant challenge there of course is estimating available
io capacity. Another piece is that there needs to be a bound on how
large boulders get.  Recovery will break up recovery of large objects
into lots of messages to avoid having too large a boulder.  Similarly,
there are limits at least on the bulk size of a client IO operation.

I don't understand how a single queue would be more scalable as we
add threads.  Pre-giant, that's how the queue worked, and it was
indeed a significant bottleneck.

As I see it, each operation is ordered in two ways (each requiring
a lock/thread of control/something):
1) The message stream from the client is ordered (represented by
the reader thread in the SimpleMessenger).  The ordering here
is actually part of the librados interface contract for the most part
(certain reads could theoretically be reordered here without
breaking the rules).
2) Operations on the PG are ordered necessarily by the PG lock
(client writes by necessity, most everything else by convenience).

So at a minimum, something ordered by 1 needs to pass off to
something ordered by 2.  We currently do this by allowing the
reader thread to fast-dispatch directly into the op queue responsible
for the PG which owns the op.  A thread local to the right PG then
takes it from there.  This means that two different ops each of which
is on a different client/pg combo may not interact at all and could be
handled entirely in parallel (that's the ideal, anyway).  Depending on
what you mean by "queue", putting all ops in a single queue
necessarily serializes all IO on that structure (even if only for a small
portion of the execution time).  This limits both parallelism and
the amount of computation you can actually do to make the
scheduling decision even more so than the current design does.

Ideally, we'd like to have our cake and eat it too: we'd like good
scheduling (which PrioritizedQueue does not particularly well)
while minimizing overhead of the queue itself (an even bigger
problem with PrioritizedQueue) and keeping scaling as linear
as we can get it on many-core machines (which usually means
that independent ops should have a low probability of touching
the same structures).

>>> But the recovery is still happening the recovery thread and not the
>>> client thread, right? The recovery thread has a lower priority than
>>> the op thread? That's how I understand it.
>>>
>>
>> No, in hammer we removed the snap trim and scrub workqueues.  With
>> wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
>> meaningful set of threads remaining will be the op_tp and associated
>> queues.
>
> OK, that is good news, I didn't do a scrub so I haven't seen the OPs
> for that. Do you know the priorities of snap trim, scrub and recovery
> so that I can do some math/logic on applying costs in an efficient way
> as we talked about last week?
>

There are config options in common/config_opt.h iirc.
-Sam

> Thanks,
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQRB6CRDmVDuy+mK58QAAAsMP/RoBeyhqwNDURHagKJ9i
> knjYW4jy0FFw1XmnFRhJN7FuFlYlHZ+bwvQGGYvmOkLlxgY9Y+J1GglwwV14
> Vvtd/1LBOUw06Ch/WjhcgVFNIQdgdNBPHPaRurSTGxnofYKAwqB266gnzwAo
> 

Re: [PATCH] ceph:Fix error handling in the function down_reply

2015-11-09 Thread Yan, Zheng

> On Nov 9, 2015, at 11:11, Nicholas Krause  wrote:
> 
> This fixes error handling in the function down_reply in order to
> properly check and jump to the goto label, out_err for this
> particular function if a error code is returned by any function
> called in down_reply and therefore make checking be included
> for the call to ceph_update_snap_trace in order to comply with
> these error handling checks/paths.
> 
> Signed-off-by: Nicholas Krause 
> ---
> fs/ceph/mds_client.c | 11 +++
> 1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 51cb02d..0b01f94 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session 
> *session, struct ceph_msg *msg)
>   realm = NULL;
>   if (rinfo->snapblob_len) {
>   down_write(>snap_rwsem);
> - ceph_update_snap_trace(mdsc, rinfo->snapblob,
> - rinfo->snapblob + rinfo->snapblob_len,
> - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP,
> - );
> + err = ceph_update_snap_trace(mdsc, rinfo->snapblob,
> +  rinfo->snapblob + 
> rinfo->snapblob_len,
> +  le32_to_cpu(head->op) == 
> CEPH_MDS_OP_RMSNAP,
> +  );
>   downgrade_write(>snap_rwsem);
>   } else {
>   down_read(>snap_rwsem);
>   }
> +
> + if (err)
> + goto out_err;
> 
>   /* insert trace into our cache */
>   mutex_lock(>r_fill_mutex);

Applied, thanks

Yan, Zheng

> -- 
> 2.5.0
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Help on ext4/xattr linux kernel stability issue / ceph xattr use?

2015-11-09 Thread Laurent GUERBY
Hi,

Part of our ceph cluster is using ext4 and we recently hit major kernel
instability in the form of kernel lockups every few hours, issues
opened:

http://tracker.ceph.com/issues/13662
https://bugzilla.kernel.org/show_bug.cgi?id=107301

On kernel.org kernel developpers are asking about ceph usage of xattr,
in particular wether there are lots of common xattr key/value or wether
they are all differents.

I attached a file with various xattr -l outputs:

https://bugzilla.kernel.org/show_bug.cgi?id=107301#c8
https://bugzilla.kernel.org/attachment.cgi?id=192491

Looks like the "big" xattr "user.ceph._" is always different, same for
the intermediate size "user.ceph.hinfo_key".

"user.cephos.spill_out" and "user.ceph.snapset" seem to have small
values, and within a small value set.

Our cluster is used exclusively for virtual machines block devices with
rbd, on replicated (3) and erasure coded pools (4+1 and 8+2).

Could someone knowledgeable add some information on ceph use of xattr in
the kernel.org bugzilla above?

Also I think it is necessary to warn ceph users to avoid ext4 at all
costs until this kernel/ceph issue is sorted out: we went from
relatively stable production for more than a year to crashes everywhere
all the time since two weeks ago, probably after hitting some magic
limit. We migrated our machines to ubuntu trusty, our SSD based
filesystem to XFS but our HDD are still mostly on ext4 (60 TB
of data to move so not that easy...).

Thanks in advance for your help,

Sincerely,

Laurent GUERBY
http://tetaneutral.net


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph:Fix error handling in the function ceph_readddir_prepopulate

2015-11-09 Thread Yan, Zheng

> On Nov 9, 2015, at 05:13, Nicholas Krause  wrote:
> 
> This fixes error handling in the function ceph_readddir_prepopulate
> to properly check if the call to the function ceph_fill_dirfrag has
> failed by returning a error code. Further more if this does arise
> jump to the goto label, out of the function ceph_readdir_prepopulate
> in order to clean up previously allocated resources by this function
> before returning to the caller this errror code in order for all callers
> to be now aware and able to handle this failure in their own intended
> error paths.
> 
> Signed-off-by: Nicholas Krause 
> ---
> fs/ceph/inode.c | 7 +--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 96d2bd8..7738be6 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1417,8 +1417,11 @@ int ceph_readdir_prepopulate(struct ceph_mds_request 
> *req,
>   } else {
>   dout("readdir_prepopulate %d items under dn %p\n",
>rinfo->dir_nr, parent);
> - if (rinfo->dir_dir)
> - ceph_fill_dirfrag(d_inode(parent), rinfo->dir_dir);
> + if (rinfo->dir_dir) {
> + err = ceph_fill_dirfrag(d_inode(parent), 
> rinfo->dir_dir);
> + if (err)
> + goto out;
> + }
>   }
> 

ceph_fill_dirfrag() failure is not fatal. I think it’s better to not skip later 
code when it happens.

Regards
Yan, Zheng 


>   if (ceph_frag_is_leftmost(frag) && req->r_readdir_offset == 2) {
> -- 
> 2.5.0
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html