Re: Help on ext4/xattr linux kernel stability issue / ceph xattr use?
On Mon, 2015-11-09 at 05:24 -0800, Sage Weil wrote: > The above is all correct. The mbcache (didn't know that existed!) is > definitely not going to be useful here. > > Also I think it is necessary to warn ceph users to avoid ext4 at all > > costs until this kernel/ceph issue is sorted out: we went from > > relatively stable production for more than a year to crashes everywhere > > all the time since two weeks ago, probably after hitting some magic > > limit. We migrated our machines to ubuntu trusty, our SSD based > > filesystem to XFS but our HDD are still mostly on ext4 (60 TB > > of data to move so not that easy...). > > Was there a ceph upgrade in there somewhere? The size of the user.ceph._ > xattr has increased over time, and (somewhat) recently crossed the 255 > byte threshold (on average) which also triggered a performance regression > on XFS... Hi Sage, Thanks for the confirmation. The history of our cluster is: - initial cluster on ceph 0.80.7 (september 2014) debian ext4 since xfs and btrfs were crashing on debian/ceph - upgraded to 0.87 (december 2014) - upgraded to 0.94.2 (june 2015) - on october 26 2015 we got two disk failures in one night, we replaced the disks but we started to have random machine freeze during and after the recovery. We upgraded to 0.94.5 to be able to restart two of our OSD due to: http://tracker.ceph.com/issues/13594 - after changing various hardware part, adding new machine we started to suspect ceph/ext4 so we migrated all our machines to ubuntu trusty and all SSD to XFS leaving 60 TB of data on rotational ext4 (too long to migrate) During the whole time cluster and data kept expanding from 4 machines and 2 TB to 11 machines now and 60TB of data (~ 75% full). I have lightly tested a rebuild of the ubuntu trusty 3.19 kernel with the ext4 mbcache code removed, patch here: https://bugzilla.kernel.org/show_bug.cgi?id=107301#c6 But now we have to decide wether to go live with it. Sincerely, Laurent -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever
> - Original Message - >> From: "Alexandre DERUMIER">> To: "ceph-devel" >> Cc: "qemu-devel" , jdur...@redhat.com >> Sent: Monday, November 9, 2015 5:48:45 AM >> Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and >> vm_stop is hanging forever >> >> adding to ceph.conf >> >> [client] >> rbd_non_blocking_aio = false >> >> >> fix the problem for me (with rbd_cache=false) >> >> >> (@cc jdur...@redhat.com) +1 same to me. Stefan >> >> >> >> - Mail original - >> De: "Denis V. Lunev" >> À: "aderumier" , "ceph-devel" >> , "qemu-devel" >> Envoyé: Lundi 9 Novembre 2015 08:22:34 >> Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop >> is hanging forever >> >> On 11/09/2015 10:19 AM, Denis V. Lunev wrote: >>> On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote: Hi, with qemu (2.4.1), if I do an internal snapshot of an rbd device, then I pause the vm with vm_stop, the qemu process is hanging forever monitor commands to reproduce: # snapshot_blkdev_internal drive-virtio0 yoursnapname # stop I don't see this with qcow2 or sheepdog block driver for example. Regards, Alexandre >>> this could look like the problem I have recenty trying to >>> fix with dataplane enabled. Patch series is named as >>> >>> [PATCH for 2.5 v6 0/10] dataplane snapshot fixes >>> >>> Den >> >> anyway, even if above will not help, can you collect gdb >> traces from all threads in QEMU process. May be I'll be >> able to give a hit. >> >> Den >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
make check bot resumed
Hi, The machine sending notifications for the make check bot failed during the week-end. It was rebooted and it should resume its work. The virtual machine was actually re-built because the underlying OpenStack cloud was unable to find the volume used for root after a hard reboot. There were also issues with the devicemapper docker backend that was corrupted. Wiping them out was enough to resolve the problem: they did not have any persistent data anyway. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever
>>Can you reproduce with Ceph debug logging enabled (i.e. debug rbd=20 in your >>ceph.conf)? If you could attach the log to the Ceph tracker ticket I opened >>[1], that would be very helpful. >> >>[1] http://tracker.ceph.com/issues/13726 yes,I'm able to reproduce it 100%, I have attached the log to the tracker. Alexandre - Mail original - De: "Jason Dillaman"À: "aderumier" Cc: "ceph-devel" , "qemu-devel" , jdur...@redhat.com Envoyé: Lundi 9 Novembre 2015 14:42:42 Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever Can you reproduce with Ceph debug logging enabled (i.e. debug rbd=20 in your ceph.conf)? If you could attach the log to the Ceph tracker ticket I opened [1], that would be very helpful. [1] http://tracker.ceph.com/issues/13726 Thanks, Jason - Original Message - > From: "Alexandre DERUMIER" > To: "ceph-devel" > Cc: "qemu-devel" , jdur...@redhat.com > Sent: Monday, November 9, 2015 5:48:45 AM > Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and > vm_stop is hanging forever > > adding to ceph.conf > > [client] > rbd_non_blocking_aio = false > > > fix the problem for me (with rbd_cache=false) > > > (@cc jdur...@redhat.com) > > > > - Mail original - > De: "Denis V. Lunev" > À: "aderumier" , "ceph-devel" > , "qemu-devel" > Envoyé: Lundi 9 Novembre 2015 08:22:34 > Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop > is hanging forever > > On 11/09/2015 10:19 AM, Denis V. Lunev wrote: > > On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote: > >> Hi, > >> > >> with qemu (2.4.1), if I do an internal snapshot of an rbd device, > >> then I pause the vm with vm_stop, > >> > >> the qemu process is hanging forever > >> > >> > >> monitor commands to reproduce: > >> > >> > >> # snapshot_blkdev_internal drive-virtio0 yoursnapname > >> # stop > >> > >> > >> > >> > >> I don't see this with qcow2 or sheepdog block driver for example. > >> > >> > >> Regards, > >> > >> Alexandre > >> > > this could look like the problem I have recenty trying to > > fix with dataplane enabled. Patch series is named as > > > > [PATCH for 2.5 v6 0/10] dataplane snapshot fixes > > > > Den > > anyway, even if above will not help, can you collect gdb > traces from all threads in QEMU process. May be I'll be > able to give a hit. > > Den > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph encoding optimization
On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnumwrote: > The problem with this approach is that the encoded versions need to be > platform-independent — they are shared over the wire and written to > disks that might get transplanted to different machines. Apart from > padding bytes, we also need to worry about endianness of the machine, > etc. *And* we often mutate structures across versions in order to add > new abilities, relying on the encode-decode process to deal with any > changes to the system. How could we deal with that if just dumping the > raw memory? > > Now, maybe we could make these changes on some carefully-selected > structs, I'm not sure. But we'd need a way to pick them out, guarantee > that we aren't breaking interoperability concerns, etc; and it would > need to be something we can maintain as a group going forward. I'm not > sure how to satisfy those constraints without burning a little extra > CPU. :/ > -Greg So it turns out we've actually had issues with this. Sage merged (wrote?) some little-endian-only optimizations to the cephx code that broke big-endian systems by doing a direct memcpy. Apparently our tests don't find these issues, which makes me even more nervous about taking that sort of optimization into the tree. :( -Greg On Sun, Nov 8, 2015 at 6:28 AM, Sage Weil wrote: > On Sat, 7 Nov 2015, Haomai Wang wrote: >> Hi sage, >> >> Could we know about your progress to refactor MSubOP and hobject_t, >> pg_stat_t decode problem? >> >> We could work on this based on your work if any. > > See Piotr's last email on this thead... it has Josh's patch attached. > > sage > > >> >> >> On Thu, Nov 5, 2015 at 1:29 AM, Haomai Wang wrote: >> > On Thu, Nov 5, 2015 at 1:19 AM, piotr.da...@ts.fujitsu.com >> > wrote: >> >>> -Original Message- >> >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >> >>> ow...@vger.kernel.org] On Behalf Of ??? >> >>> Sent: Wednesday, November 04, 2015 4:34 PM >> >>> To: Gregory Farnum >> >>> Cc: ceph-devel@vger.kernel.org >> >>> Subject: Re: ceph encoding optimization >> >>> >> >>> I agree with pg_stat_t (and friends) is a good first start. >> >>> The eversion_t and utime_t are also good choice to start because they are >> >>> used at many places. >> >> >> >> On Ceph Hackathon, Josh Durgin made initial steps in right direction in >> >> terms of pg_stat_t encoding and decoding optimization, with the >> >> endianness-awareness thing left out. Even in that state, performance >> >> improvements offered by this change were huge enough to make it >> >> worthwhile. I'm attaching the patch, but please note that this is >> >> prototype and based on mid-August state of code, so you might need to >> >> take that into account when applying the patch. >> > >> > Cool, it's exactly we want to see. >> > >> >> >> >> >> >> With best regards / Pozdrawiam >> >> Piotr Da?ek >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Cannot start osd due to permission of journal raw device
On Mon, 9 Nov 2015, Chen, Xiaoxi wrote: > There is no such rules (only 70-persistent-net.rules) in my /etc/udev/ruled.d/ > > Could you point me which part of the code create the rules file? Is that > ceph-disk? https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules The package should install it in /lib/udev/rules.d or similar... sage > > -Original Message- > > From: Sage Weil [mailto:s...@newdream.net] > > Sent: Friday, November 6, 2015 6:33 PM > > To: Chen, Xiaoxi > > Cc: ceph-devel@vger.kernel.org > > Subject: Re: Cannot start osd due to permission of journal raw device > > > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote: > > > Hi, > > > I tried infernalis (version 9.1.0 > > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to permission > > of journal , the OSD was upgraded from hammer(also true for newly > > created OSD). > > > I am using raw device as journal, this is because the default privilege > > > of > > raw block is root:disk. Changing the journal owner to ceph:ceph solve the > > issue. Seems we can either: > > > 1. add ceph to "disk" group and run ceph-osd with --setuser ceph -- > > setgroup disk? > > > 2. Require user to set the ownership of journal device to ceph:ceph is > > > they > > want to use raw as journal? Maybe we can done this in ceph-disk. > > > > > >Personally I would prefer the second one , what do you think? > > > > The udev rules should be setting the jouranl device ownership to ceph:ceph. > > IIRC there was a race in ceph-disk that could prevent this from happening in > > some cases but that is now fixed. Can you try the infernalis branch? > > > > sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help on ext4/xattr linux kernel stability issue / ceph xattr use?
On Mon, 9 Nov 2015, Laurent GUERBY wrote: > Hi, > > Part of our ceph cluster is using ext4 and we recently hit major kernel > instability in the form of kernel lockups every few hours, issues > opened: > > http://tracker.ceph.com/issues/13662 > https://bugzilla.kernel.org/show_bug.cgi?id=107301 > > On kernel.org kernel developpers are asking about ceph usage of xattr, > in particular wether there are lots of common xattr key/value or wether > they are all differents. > > I attached a file with various xattr -l outputs: > > https://bugzilla.kernel.org/show_bug.cgi?id=107301#c8 > https://bugzilla.kernel.org/attachment.cgi?id=192491 > > Looks like the "big" xattr "user.ceph._" is always different, same for > the intermediate size "user.ceph.hinfo_key". > > "user.cephos.spill_out" and "user.ceph.snapset" seem to have small > values, and within a small value set. > > Our cluster is used exclusively for virtual machines block devices with > rbd, on replicated (3) and erasure coded pools (4+1 and 8+2). > > Could someone knowledgeable add some information on ceph use of xattr in > the kernel.org bugzilla above? The above is all correct. The mbcache (didn't know that existed!) is definitely not going to be useful here. > Also I think it is necessary to warn ceph users to avoid ext4 at all > costs until this kernel/ceph issue is sorted out: we went from > relatively stable production for more than a year to crashes everywhere > all the time since two weeks ago, probably after hitting some magic > limit. We migrated our machines to ubuntu trusty, our SSD based > filesystem to XFS but our HDD are still mostly on ext4 (60 TB > of data to move so not that easy...). Was there a ceph upgrade in there somewhere? The size of the user.ceph._ xattr has increased over time, and (somewhat) recently crossed the 255 byte threshold (on average) which also triggered a performance regression on XFS... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
suites' runs on jewel added to the schedule
(rados suite/jewel - on hold for the time being to avoid queue overload) But other suites have been added to the schedule: http://tracker.ceph.com/projects/ceph-releases/wiki/Sepia Pls let me know if you see problems or any issues. Thx YuriW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for Comments: Weighted Round Robin OP Queue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On Mon, Nov 9, 2015 at 1:49 PM, Samuel Just wrote: > We basically don't want a single thread to see all of the operations -- it > would cause a tremendous bottleneck and complicate the design > immensely. It's shouldn't be necessary anyway since PGs are a form > of course grained locking, so it's probably fine to schedule work for > different groups of PGs independently if we assume that all kinds of > work are well distributed over those groups. The only issue that I can see, based on the discussion last week, is when the client I/O is small. There will be some points where each thread will think it is OK so send a bolder along with the pebbles (recovery I/O vs. client I/O), If all/most of the threads send a bolder at the same time would it cause issues for slow disks (spindles)? A single queue would be much more intelligent about situations like this and spread the bolders out better. It also seems more scalable as you add threads (I don't think really practical on spindles). I assume the bottleneck in your concern is the thread communication between threads? I'm trying to understand and in no way trying to attack you (I've been know to come across differently than I intend to). >> But the recovery is still happening the recovery thread and not the >> client thread, right? The recovery thread has a lower priority than >> the op thread? That's how I understand it. >> > > No, in hammer we removed the snap trim and scrub workqueues. With > wip-recovery-wq, I remove the recovery wqs as well. Ideally, the only > meaningful set of threads remaining will be the op_tp and associated > queues. OK, that is good news, I didn't do a scrub so I haven't seen the OPs for that. Do you know the priorities of snap trim, scrub and recovery so that I can do some math/logic on applying costs in an efficient way as we talked about last week? Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWQRB6CRDmVDuy+mK58QAAAsMP/RoBeyhqwNDURHagKJ9i knjYW4jy0FFw1XmnFRhJN7FuFlYlHZ+bwvQGGYvmOkLlxgY9Y+J1GglwwV14 Vvtd/1LBOUw06Ch/WjhcgVFNIQdgdNBPHPaRurSTGxnofYKAwqB266gnzwAo oX3EpgRskzrlwrOIg+b46Z3FhbdxYfJVqsWIEazIu9uFJDxf/pFimWSig0n1 bQsB0lZNeTbGKYww5GZqPtY3dVNqbfM6Xj5r5kxf5mhDZ2vKWJfvlc8nu86z /VIDy5ZHPFZzv79wNlzNtZ9ofdmMT4n0Bhk8q4SFQSivs2z68DQxthcGXVaB Bp5gy19QyE2mC6SeG3kwCYlEiGwJBGN5PVj9wDWrqDRiG/3eRS9yUs7N3RPW hViKOYCt5lHBEhkkXaE824FweWZhupzXjiAjCMXYGtWek4LbLH9XFiMrigbR b07EohO3cnXvrHL3+SmdEsHs0PIS0o9anyB7wn7Ze9oHQNYHXmzw48nzhth6 juGxCVeg80iNnlwpH/jQRfyEFB8rKfpJd7BLYdJgc/q4L25o/q588MeUqjUw gc0cVkoKnegbz1fZ85CjI3YGXgXwRtVXFFl4Z+KdEJlEa1q9nRBGsho8LkT6 aanb77/QUJixLi7QQi8blXMvY0wjxzEkbtkoij0rL1OaxmKpoy/Nb8v6kyDL rnL6 =IlY9 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW multi-tenancy APIs overview
On Mon, Nov 9, 2015 at 9:10 PM, Pete Zaitcevwrote: > With ticket 5073 getting close to complete, we're getting the APIs mostly Great! thanks for all the work you've done to get this closer to completion. > nailed down. Most of them come down to selection a syntax separator > character. Unfortunately, there are several such characters. Plus, > it is not always feasible to get by with a character (in S3 at least). > > So far we have the following changes: > > #1 Back-end and radosgw-admin use '/' or "tenant/bucket". This is what is > literally stored in RADOS, because it's used to name bucket objects in > the .rgw pool. > > #2 Buckets in Swift URLs use '\' (backslash), because there does not seem > to be a way to use '/'. Example: > http://host.corp.com:8080/swift/v1/testen\testcont > > At first, I tried URL encoding (%2f), but that didn't work: we permit '%' > in Swift container names, so there's a show-stopper compatibility problem. > So, backslash. The backslash poses a similar problem, too, but hopefuly > nobody created a container with backslash in name. > > Note that strictly speaking, we don't really need this, since Swift URLs > could easily include tenant names where reference Swift places account names. > It's just easier to implement without disturbing authenthication code. I think that leveraging the native swift URL tenant encoding is probably a cleaner solution than having it encoded as a backslash. > > #3 S3 host addressing of buckets > > This is similar to Swift and is slated to use backslash. Note that S3 > prohibits it, so we're reasonably safe with this choice. > > #4 S3 URL addressing of buckets > > Here we must use a period. Example: > bucket.tenant.host.corp.com > Can probably identify this automatically, if the host is at a subdomain of a supported domain, and it's a second level subdomain from the main domain then we can regard it as . > #5 Listings and redirects. > > Listings present a difficulty in S3: we don't know if the name will be > used in host-based or URL-based addressing of a bucket. So, we put the > tenant of a bucket into a separate XML attribute. You mean a separate http header? http param? In the supported domains configuration, we can specify for each domain whether a subdomain for it would be a bucket (as it is now), or whether it would be a tenant (which implies the possibility of bucket.tenant). This only affects the global (a.k.a the "empty") tenant. E.g., we can have two domains: legacy-foo.com new-foo.com We'd specify that legacy-foo.com is a global tenant endpoint. In which case, when accessing buck.legacy-foo.com, it will access the global tenant, and bucket=buck. Whereas, new-foo.com isn't a global tenant endpoint, in which case, if we'd access buck.new-foo.com, it will mean that we accessed the 'buck' tenant. > > Since Swift listings are always in a specific account, and thus tenant, > they are unchanged. > > In addition to listings, bucket names leak into certain HTTP headers, where > we add "Tenant:" headers as appropriate. > > Finally, multi-tenancy also puts user_uid namespaces under tenants as well > as bucket namespaces. That one is easy though. A '$' separator is used > consistently for it (tenant$user). > Does that work the same for object copy, and acls? Thanks, Yehuda > -- Pete -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Cannot start osd due to permission of journal raw device
Hmm I didn't use ceph-disk but partitioned & format by myself and call ceph-osd --mkfs directly, that should be the reason why udev rules doesn't make effect? > -Original Message- > From: Sage Weil [mailto:s...@newdream.net] > Sent: Monday, November 9, 2015 9:18 PM > To: Chen, Xiaoxi > Cc: ceph-devel@vger.kernel.org > Subject: RE: Cannot start osd due to permission of journal raw device > > On Mon, 9 Nov 2015, Chen, Xiaoxi wrote: > > There is no such rules (only 70-persistent-net.rules) in my > > /etc/udev/ruled.d/ > > > > Could you point me which part of the code create the rules file? Is > > that ceph-disk? > > https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules > > The package should install it in /lib/udev/rules.d or similar... > > sage > > > > -Original Message- > > > From: Sage Weil [mailto:s...@newdream.net] > > > Sent: Friday, November 6, 2015 6:33 PM > > > To: Chen, Xiaoxi > > > Cc: ceph-devel@vger.kernel.org > > > Subject: Re: Cannot start osd due to permission of journal raw > > > device > > > > > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote: > > > > Hi, > > > > I tried infernalis (version 9.1.0 > > > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to > > > permission of journal , the OSD was upgraded from hammer(also true > > > for newly created OSD). > > > > I am using raw device as journal, this is because the default > > > > privilege of > > > raw block is root:disk. Changing the journal owner to ceph:ceph > > > solve the issue. Seems we can either: > > > > 1. add ceph to "disk" group and run ceph-osd with --setuser ceph > > > > -- > > > setgroup disk? > > > > 2. Require user to set the ownership of journal device to > > > > ceph:ceph is they > > > want to use raw as journal? Maybe we can done this in ceph-disk. > > > > > > > >Personally I would prefer the second one , what do you think? > > > > > > The udev rules should be setting the jouranl device ownership to > ceph:ceph. > > > IIRC there was a race in ceph-disk that could prevent this from > > > happening in some cases but that is now fixed. Can you try the infernalis > branch? > > > > > > sage > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph encoding optimization
On Mon, Nov 9, 2015 at 10:24 AM, Sage Weilwrote: > On Mon, 9 Nov 2015, Gregory Farnum wrote: >> On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnum wrote: >> > The problem with this approach is that the encoded versions need to be >> > platform-independent ? they are shared over the wire and written to >> > disks that might get transplanted to different machines. Apart from >> > padding bytes, we also need to worry about endianness of the machine, >> > etc. *And* we often mutate structures across versions in order to add >> > new abilities, relying on the encode-decode process to deal with any >> > changes to the system. How could we deal with that if just dumping the >> > raw memory? >> > >> > Now, maybe we could make these changes on some carefully-selected >> > structs, I'm not sure. But we'd need a way to pick them out, guarantee >> > that we aren't breaking interoperability concerns, etc; and it would >> > need to be something we can maintain as a group going forward. I'm not >> > sure how to satisfy those constraints without burning a little extra >> > CPU. :/ >> > -Greg >> >> So it turns out we've actually had issues with this. Sage merged >> (wrote?) some little-endian-only optimizations to the cephx code that >> broke big-endian systems by doing a direct memcpy. Apparently our >> tests don't find these issues, which makes me even more nervous about >> taking that sort of optimization into the tree. :( > > I think the way to make this maintainable will be to > > 1) Find a clean approach with a simple #if or #ifdef condition for > little endian and/or architectures that can handle unaligned int pointer > access. > In C++ you can also do that with a template or using std::enable_if. The upside is the same as the downside (depending on how you look at it). So it'll add to compile time checks (because it won't be discarded by the processor) and it'll take longer to build, but you get extra checks and the compiler will later discard the unused code. If you do that, it should be easier to write unit tests for the functionality. > 2) Maintain the parallel optimized implementation next to the generic > encode/decode in a way that makes it as easy as possible to make changes > and keep them in sync. > > 3) Optimize *only* the most recent encoding to minimize complexity. > > 4) Ensure that there is a set of encode/decode tests that verify they both > work, triggered by make check (so that a simple make check on a big > endian box will catch errors). Ideally this'd be part of the > test/encoding/readable.sh so that we run it over the entire corpus of old > encodings.. > > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for Comments: Weighted Round Robin OP Queue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I should probably work against this branch. I've got some more reading of code to do, but I'm thinking that there isn't one of these queues for each OSD, it seems like there is one queue for each thread in the OSD. If this is true, I think it makes sense to break the queue into it's own thread and have each 'worker' thread push and pop OPs out of that thread. I have been focused on the Queue code that I haven't really looked at the OSD/PG code until last Friday and it is like trying to drink from a fire hose going through that code, so I may be misunderstanding something. I'd appreciate any pointers to quickly understanding the OSD/PG code specifically around the OPs and the queue. Thanks, -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5 yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj ztfA =GSDL -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Nov 9, 2015 at 9:49 AM, Samuel Justwrote: > It's partially in the unified queue. The primary's background work > for kicking off a recovery operation is not in the unified queue, but > the messages to the replicas (pushes, pull, backfill scans) as well as > their replies are in the unified queue as normal messages. I've got a > branch moving the primary's work to the queue as well (didn't quite > make infernalis) -- > https://github.com/athanatos/ceph/tree/wip-recovery-wq. I'm trying to > stabilize it now for merge that infernalis is out. > -Sam > > On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil wrote: >> On Fri, 6 Nov 2015, Robert LeBlanc wrote: >> >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >>> >>> After trying to look through the recovery code, I'm getting the >>> feeling that recovery OPs are not scheduled in the OP queue that I've >>> been working on. Does that sound right? In the OSD logs I'm only >>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply). >>> If the recovery is in another separate queue, then there is no >>> reliable way to prioritize OPs between them. >>> >>> If I'm going off in to the weeds, please help me get back on the trail. >> >> Yeah, the recovery work isn't in the unified queue yet. >> >> sage >> >> >> >>> >>> Thanks, >>> - >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc wrote: >>> > -BEGIN PGP SIGNED MESSAGE- >>> > Hash: SHA256 >>> > >>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil wrote: >>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote: >>> >>> -BEGIN PGP SIGNED MESSAGE- >>> >>> Hash: SHA256 >>> >>> >>> >>> Thanks Gregory, >>> >>> >>> >>> People are most likely busy and haven't had time to digest this and I >>> >>> may be expecting more excitement from it (I'm excited due to the >>> >>> results and probably also that such a large change still works). I'll >>> >>> keep working towards a PR, this was mostly proof of concept, now that >>> >>> there is some data I'll clean up the code. >>> >> >>> >> I'm *very* excited about this. This is something that almost every >>> >> operator has problems with so it's very encouraging to see that switching >>> >> up the queue has a big impact in your environment. >>> >> >>> >> I'm just following up on this after a week of travel, so apologies if >>> >> this >>> >> is covered already, but did you compare this implementation to the >>> >> original one with the same tunables? I see somewhere that you had >>> >> max_backfills=20 at some point, which is going to be bad regardless of >>> >> the >>> >> queue. >>> >> >>> >> I also see that you chnaged the strict priority threshold from LOW to >>> >> HIGH >>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue >>> >> implementation. >>> > >>> > Yes max_backfills=20 is problematic for both queues and from what I >>> > can tell is because the OPs are waiting for PGs to get healthy. In a >>> > busy cluster it can take a while due to the recovery ops having low >>> > priority. In the current queue, it is possible to be blocked for a >>> > long time. The new queue seems to prevent that,
Re: Request for Comments: Weighted Round Robin OP Queue
Ops are hashed from the messenger (or any of the other enqueue sources for non-message items) into one of N queues, each of which is serviced by M threads. We can't quite have a single thread own a single queue yet because the current design allows multiple threads/queue (important because if a sync read blocks on one thread, other threads working on that queue can continue to make progress). However, the queue contents are hashed to a queue based on the PG, so if a PG queues work, it'll be on the same queue as it is already operating from (which I think is what you are getting at?). I'm moving away from that with the async read work I'm doing (ceph-devel subject "Async reads, sync writes, op thread model discussion"), but I'll still need a replacement for PrioritizedQueue. -Sam On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlancwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I should probably work against this branch. > > I've got some more reading of code to do, but I'm thinking that there > isn't one of these queues for each OSD, it seems like there is one > queue for each thread in the OSD. If this is true, I think it makes > sense to break the queue into it's own thread and have each 'worker' > thread push and pop OPs out of that thread. I have been focused on the > Queue code that I haven't really looked at the OSD/PG code until last > Friday and it is like trying to drink from a fire hose going through > that code, so I may be misunderstanding something. > > I'd appreciate any pointers to quickly understanding the OSD/PG code > specifically around the OPs and the queue. > > Thanks, > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc > EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m > sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l > WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT > EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC > Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf > TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV > V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv > PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC > KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ > iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5 > yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj > ztfA > =GSDL > -END PGP SIGNATURE- > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just wrote: >> It's partially in the unified queue. The primary's background work >> for kicking off a recovery operation is not in the unified queue, but >> the messages to the replicas (pushes, pull, backfill scans) as well as >> their replies are in the unified queue as normal messages. I've got a >> branch moving the primary's work to the queue as well (didn't quite >> make infernalis) -- >> https://github.com/athanatos/ceph/tree/wip-recovery-wq. I'm trying to >> stabilize it now for merge that infernalis is out. >> -Sam >> >> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil wrote: >>> On Fri, 6 Nov 2015, Robert LeBlanc wrote: >>> -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 After trying to look through the recovery code, I'm getting the feeling that recovery OPs are not scheduled in the OP queue that I've been working on. Does that sound right? In the OSD logs I'm only seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply). If the recovery is in another separate queue, then there is no reliable way to prioritize OPs between them. If I'm going off in to the weeds, please help me get back on the trail. >>> >>> Yeah, the recovery work isn't in the unified queue yet. >>> >>> sage >>> >>> >>> Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil wrote: >> On Thu, 5 Nov 2015, Robert LeBlanc wrote: >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >>> >>> Thanks Gregory, >>> >>> People are most likely busy and haven't had time to digest this and I >>> may be expecting more excitement from it (I'm excited due to the >>> results and probably also that such a large change still works). I'll >>> keep working towards a PR, this was mostly proof of concept, now that >>> there is some data I'll clean up the code. >> >> I'm *very* excited about this. This is something that almost every >> operator has problems with
Re: Request for Comments: Weighted Round Robin OP Queue
On Tue, Nov 10, 2015 at 2:19 AM, Samuel Justwrote: > Ops are hashed from the messenger (or any of the other enqueue sources > for non-message items) into one of N queues, each of which is serviced > by M threads. We can't quite have a single thread own a single queue > yet because the current design allows multiple threads/queue > (important because if a sync read blocks on one thread, other threads > working on that queue can continue to make progress). However, the > queue contents are hashed to a queue based on the PG, so if a PG > queues work, it'll be on the same queue as it is already operating > from (which I think is what you are getting at?). I'm moving away > from that with the async read work I'm doing (ceph-devel subject > "Async reads, sync writes, op thread model discussion"), but I'll > still need a replacement for PrioritizedQueue. I don't think clearly about the idea that we make PrioriryQueue(or whatever WeightBased) client-oriented. Because currently each connection owned by a async messenger thread, if latter queue is pg oriented, huge lock contention can't be avoided with iops increasing. The only way I guess is make msgr thread -> osd thread via the same hash key(or whatever we can make the two threads paired). What's more, msgr thread could use the same way as sam's branch, it could be only one thread. > -Sam > > On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> I should probably work against this branch. >> >> I've got some more reading of code to do, but I'm thinking that there >> isn't one of these queues for each OSD, it seems like there is one >> queue for each thread in the OSD. If this is true, I think it makes >> sense to break the queue into it's own thread and have each 'worker' >> thread push and pop OPs out of that thread. I have been focused on the >> Queue code that I haven't really looked at the OSD/PG code until last >> Friday and it is like trying to drink from a fire hose going through >> that code, so I may be misunderstanding something. >> >> I'd appreciate any pointers to quickly understanding the OSD/PG code >> specifically around the OPs and the queue. >> >> Thanks, >> -BEGIN PGP SIGNATURE- >> Version: Mailvelope v1.2.3 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc >> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m >> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l >> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT >> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC >> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf >> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV >> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv >> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC >> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ >> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5 >> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj >> ztfA >> =GSDL >> -END PGP SIGNATURE- >> >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just wrote: >>> It's partially in the unified queue. The primary's background work >>> for kicking off a recovery operation is not in the unified queue, but >>> the messages to the replicas (pushes, pull, backfill scans) as well as >>> their replies are in the unified queue as normal messages. I've got a >>> branch moving the primary's work to the queue as well (didn't quite >>> make infernalis) -- >>> https://github.com/athanatos/ceph/tree/wip-recovery-wq. I'm trying to >>> stabilize it now for merge that infernalis is out. >>> -Sam >>> >>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil wrote: On Fri, 6 Nov 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > After trying to look through the recovery code, I'm getting the > feeling that recovery OPs are not scheduled in the OP queue that I've > been working on. Does that sound right? In the OSD logs I'm only > seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply). > If the recovery is in another separate queue, then there is no > reliable way to prioritize OPs between them. > > If I'm going off in to the weeds, please help me get back on the trail. Yeah, the recovery work isn't in the unified queue yet. sage > > Thanks, > - > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc wrote: > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA256 > > > > On Fri, Nov 6,
RE: Cannot start osd due to permission of journal raw device
On Mon, 9 Nov 2015, Chen, Xiaoxi wrote: > Hmm I didn't use ceph-disk but partitioned & format by myself and > call ceph-osd --mkfs directly, that should be the reason why udev rules > doesn't make effect? Yeah... the udev rule is based on the GPT partition label. For example, https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules#L4-L5 sage > > > -Original Message- > > From: Sage Weil [mailto:s...@newdream.net] > > Sent: Monday, November 9, 2015 9:18 PM > > To: Chen, Xiaoxi > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: Cannot start osd due to permission of journal raw device > > > > On Mon, 9 Nov 2015, Chen, Xiaoxi wrote: > > > There is no such rules (only 70-persistent-net.rules) in my > > > /etc/udev/ruled.d/ > > > > > > Could you point me which part of the code create the rules file? Is > > > that ceph-disk? > > > > https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules > > > > The package should install it in /lib/udev/rules.d or similar... > > > > sage > > > > > > -Original Message- > > > > From: Sage Weil [mailto:s...@newdream.net] > > > > Sent: Friday, November 6, 2015 6:33 PM > > > > To: Chen, Xiaoxi > > > > Cc: ceph-devel@vger.kernel.org > > > > Subject: Re: Cannot start osd due to permission of journal raw > > > > device > > > > > > > > On Fri, 6 Nov 2015, Chen, Xiaoxi wrote: > > > > > Hi, > > > > > I tried infernalis (version 9.1.0 > > > > (3be81ae6cf17fcf689cd6f187c4615249fea4f61)) but failed due to > > > > permission of journal , the OSD was upgraded from hammer(also true > > > > for newly created OSD). > > > > > I am using raw device as journal, this is because the default > > > > > privilege of > > > > raw block is root:disk. Changing the journal owner to ceph:ceph > > > > solve the issue. Seems we can either: > > > > > 1. add ceph to "disk" group and run ceph-osd with --setuser ceph > > > > > -- > > > > setgroup disk? > > > > > 2. Require user to set the ownership of journal device to > > > > > ceph:ceph is they > > > > want to use raw as journal? Maybe we can done this in ceph-disk. > > > > > > > > > >Personally I would prefer the second one , what do you think? > > > > > > > > The udev rules should be setting the jouranl device ownership to > > ceph:ceph. > > > > IIRC there was a race in ceph-disk that could prevent this from > > > > happening in some cases but that is now fixed. Can you try the > > > > infernalis > > branch? > > > > > > > > sage > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for Comments: Weighted Round Robin OP Queue
It's partially in the unified queue. The primary's background work for kicking off a recovery operation is not in the unified queue, but the messages to the replicas (pushes, pull, backfill scans) as well as their replies are in the unified queue as normal messages. I've got a branch moving the primary's work to the queue as well (didn't quite make infernalis) -- https://github.com/athanatos/ceph/tree/wip-recovery-wq. I'm trying to stabilize it now for merge that infernalis is out. -Sam On Sun, Nov 8, 2015 at 6:20 AM, Sage Weilwrote: > On Fri, 6 Nov 2015, Robert LeBlanc wrote: > >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> After trying to look through the recovery code, I'm getting the >> feeling that recovery OPs are not scheduled in the OP queue that I've >> been working on. Does that sound right? In the OSD logs I'm only >> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply). >> If the recovery is in another separate queue, then there is no >> reliable way to prioritize OPs between them. >> >> If I'm going off in to the weeds, please help me get back on the trail. > > Yeah, the recovery work isn't in the unified queue yet. > > sage > > > >> >> Thanks, >> - >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc wrote: >> > -BEGIN PGP SIGNED MESSAGE- >> > Hash: SHA256 >> > >> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil wrote: >> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote: >> >>> -BEGIN PGP SIGNED MESSAGE- >> >>> Hash: SHA256 >> >>> >> >>> Thanks Gregory, >> >>> >> >>> People are most likely busy and haven't had time to digest this and I >> >>> may be expecting more excitement from it (I'm excited due to the >> >>> results and probably also that such a large change still works). I'll >> >>> keep working towards a PR, this was mostly proof of concept, now that >> >>> there is some data I'll clean up the code. >> >> >> >> I'm *very* excited about this. This is something that almost every >> >> operator has problems with so it's very encouraging to see that switching >> >> up the queue has a big impact in your environment. >> >> >> >> I'm just following up on this after a week of travel, so apologies if this >> >> is covered already, but did you compare this implementation to the >> >> original one with the same tunables? I see somewhere that you had >> >> max_backfills=20 at some point, which is going to be bad regardless of the >> >> queue. >> >> >> >> I also see that you chnaged the strict priority threshold from LOW to HIGH >> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue >> >> implementation. >> > >> > Yes max_backfills=20 is problematic for both queues and from what I >> > can tell is because the OPs are waiting for PGs to get healthy. In a >> > busy cluster it can take a while due to the recovery ops having low >> > priority. In the current queue, it is possible to be blocked for a >> > long time. The new queue seems to prevent that, but they do still back >> > up. After this, I think I'd like to look into promoting recovery OPs >> > that are blocking client OPs to higher priorities so that client I/O >> > doesn't suffer as much during recovery. I think that will be a very >> > different problem to tackle because I don't think I can do the proper >> > introspection at the queue level. I'll have to do that logic in OSD.cc >> > or PG.cc. >> > >> > The strict priority threshold didn't make much of a difference with >> > the original queue. I initially eliminated it all together in the WRR, >> > but there were times that peering would never complete. I want to get >> > as many OPs in the WRR queue to provide fairness as much as possible. >> > I haven't tweaked the setting much in the WRR queue yet. >> > >> >> >> >>> I was thinking that a config option to choose the scheduler would be a >> >>> good idea. In terms of the project what is the better approach: create >> >>> a new template and each place the template class is instantiated >> >>> select the queue, or perform the queue selection in the same template >> >>> class, or something else I haven't thought of. >> >> >> >> A config option would be nice, but I'd start by just cleaning up the code >> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or >> >> whatever). If we find that it's behaving better I'm not sure how much >> >> value we get from a tunable. Note that there is one other user >> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some >> >> point.. especially if this implementation is faster. >> >> >> >> Once it's cleaned up (remove commented out code, new class) put it up as a >> >> PR and we can review and get it through testing. >> > >> > In talking with Samuel in IRC, we think creating an abstract class for >> > the queue is the best option. C++11 allows you to still optimize >> >
Re: ceph encoding optimization
On Mon, 9 Nov 2015, Gregory Farnum wrote: > On Wed, Nov 4, 2015 at 7:07 AM, Gregory Farnumwrote: > > The problem with this approach is that the encoded versions need to be > > platform-independent ? they are shared over the wire and written to > > disks that might get transplanted to different machines. Apart from > > padding bytes, we also need to worry about endianness of the machine, > > etc. *And* we often mutate structures across versions in order to add > > new abilities, relying on the encode-decode process to deal with any > > changes to the system. How could we deal with that if just dumping the > > raw memory? > > > > Now, maybe we could make these changes on some carefully-selected > > structs, I'm not sure. But we'd need a way to pick them out, guarantee > > that we aren't breaking interoperability concerns, etc; and it would > > need to be something we can maintain as a group going forward. I'm not > > sure how to satisfy those constraints without burning a little extra > > CPU. :/ > > -Greg > > So it turns out we've actually had issues with this. Sage merged > (wrote?) some little-endian-only optimizations to the cephx code that > broke big-endian systems by doing a direct memcpy. Apparently our > tests don't find these issues, which makes me even more nervous about > taking that sort of optimization into the tree. :( I think the way to make this maintainable will be to 1) Find a clean approach with a simple #if or #ifdef condition for little endian and/or architectures that can handle unaligned int pointer access. 2) Maintain the parallel optimized implementation next to the generic encode/decode in a way that makes it as easy as possible to make changes and keep them in sync. 3) Optimize *only* the most recent encoding to minimize complexity. 4) Ensure that there is a set of encode/decode tests that verify they both work, triggered by make check (so that a simple make check on a big endian box will catch errors). Ideally this'd be part of the test/encoding/readable.sh so that we run it over the entire corpus of old encodings.. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Adam C. Emerson2015-10-16 13:49:09 -0400 wip-cxx11time 2015-10-17 13:20:15 -0400 wip-cxx11concurrency Adam Crume 2014-12-01 20:45:58 -0800 wip-doc-rbd-replay Alfredo Deza 2015-03-23 16:39:48 -0400 wip-11212 Alfredo Deza 2014-07-08 13:58:35 -0400 wip-8679 2014-09-04 13:58:14 -0400 wip-8366 2014-10-13 11:10:10 -0400 wip-9730 Ali Maredia 2015-10-12 14:28:30 -0400 wip-10587-split-servers 2015-11-06 14:12:14 -0500 wip-cmake Barbora AnÄincová 2015-11-04 16:43:45 +0100 wip-doc-RGW Boris Ranto 2015-09-04 15:19:11 +0200 wip-bash-completion Casey Bodley 2015-09-28 17:09:11 -0400 wip-cxx14-test 2015-09-29 15:18:17 -0400 wip-fio-objectstore Daniel Gryniewicz 2015-10-28 08:53:55 -0400 wip-12997 Danny Al-Gaaf 2015-04-23 16:32:00 +0200 wip-da-SCA-20150421 2015-04-23 17:18:57 +0200 wip-nosetests 2015-04-23 18:20:16 +0200 wip-unify-num_objects_degraded 2015-11-03 14:10:47 +0100 wip-da-SCA-20151029 2015-11-03 14:40:44 +0100 wip-da-SCA-20150910 David Zafman 2014-08-29 10:41:23 -0700 wip-libcommon-rebase 2015-04-24 13:14:23 -0700 wip-cot-giant 2015-08-04 07:39:00 -0700 wip-12577-hammer 2015-09-28 11:33:11 -0700 wip-12983 2015-10-29 00:27:40 -0700 wip-zafman-testing Dongmao Zhang 2014-11-14 19:14:34 +0800 thesues-master Greg Farnum 2015-04-29 21:44:11 -0700 wip-init-names 2015-07-16 09:28:24 -0700 hammer-12297 2015-10-02 13:00:59 -0700 greg-infernalis-lock-testing 2015-10-02 13:09:05 -0700 greg-infernalis-lock-testing-cacher 2015-10-07 00:45:24 -0700 greg-infernalis-fs 2015-10-21 17:43:07 -0700 client-pagecache-norevoke 2015-10-27 11:32:46 -0700 hammer-pg-replay 2015-10-29 15:24:35 -0700 greg-fs-testing Greg Farnum 2014-10-23 13:33:44 -0700 wip-forward-scrub Guang G Yang 2015-06-26 20:31:44 + wip-ec-readall 2015-07-23 16:13:19 + wip-12316 Guang Yang 2014-08-08 10:41:12 + wip-guangyy-pg-splitting 2014-09-25 00:47:46 + wip-9008 2014-09-30 10:36:39 + guangyy-wip-9614 Haomai Wang 2015-10-25 01:51:47 +0800 wip-13521 Haomai Wang 2014-07-27 13:37:49 +0800 wip-flush-set 2015-04-20 00:47:59 +0800 update-organization 2015-07-21 19:33:56 +0800 fio-objectstore 2015-08-26 09:57:27 +0800 wip-recovery-attr 2015-10-24 23:39:07 +0800 fix-compile-warning Ilya Dryomov 2014-09-05 16:15:10 +0400 wip-rbd-notify-errors Ivo Jimenez 2015-08-24 23:12:45 -0700 hammer-with-new-workunit-for-wip-12551 James Page 2015-11-04 11:08:42 + javacruft-wip-ec-modules Jason Dillaman 2015-08-31 23:17:53 -0400 wip-12698 2015-09-01 10:17:02 -0400 wip-11287 2015-11-05 22:16:45 -0500 wip-librbd-qa-dillaman Jenkins 2015-11-04 14:31:13 -0800 rhcs-v0.94.3-ubuntu Jenkins 2014-07-29 05:24:39 -0700 wip-nhm-hang 2014-10-14 12:10:38 -0700 wip-2 2015-02-02 10:35:28 -0800 wip-sam-v0.92 2015-08-21 12:46:32 -0700 last 2015-08-21 12:46:32 -0700 loic-v9.0.3 2015-09-15 10:23:18 -0700 rhcs-v0.80.8 2015-09-21 16:48:32 -0700 rhcs-v0.94.1-ubuntu Jenkins Build Slave User 2015-11-03 16:58:32 + infernalis Joao Eduardo Luis 2014-09-10 09:39:23 +0100 wip-leveldb-get.dumpling Joao Eduardo Luis 2014-07-22 15:41:42 +0100 wip-leveldb-misc Joao Eduardo Luis 2014-09-02 17:19:52 +0100 wip-leveldb-get 2014-10-17 16:20:11 +0100 wip-paxos-fix 2014-10-21 21:32:46 +0100 wip-9675.dumpling 2015-07-27 21:56:42 +0100 wip-11470.hammer 2015-09-09 15:45:45 +0100 wip-11786.hammer Joao Eduardo Luis 2014-11-17 16:43:53 + wip-mon-osdmap-cleanup 2014-12-15 16:18:56 + wip-giant-mon-backports 2014-12-17 17:13:57 + wip-mon-backports.firefly 2014-12-17 23:15:10 + wip-mon-sync-fix.dumpling 2015-01-07 23:01:00 + wip-mon-blackhole-mlog-0.87.7
Re: Request for Comments: Weighted Round Robin OP Queue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Thanks, I think some of the fog is clearing. I was wondering how operations between threads were keeping the order of operations in PGs, that explains it. My original thoughts were to have a queue in front and behind the Prio/WRR queue. Threads scheduling work would queue to the pre-queue. The queue thread would pull ops off that queue and place them into the specialized queue, do house keeping, etc and would dequeue ops in that queue to a post-queue that worker threads would monitor. The thread queue could keep a certain amount of items in the post-queue to prevent starvation and worker threads from being blocked. It would require the worker thread to be able to handle any kind of op, or having separate post-queues for the different kinds of work. I'm getting the feeling that this may be a far too simplistic approach to the problem (or at least in terms of the organization of Ceph at this point). I'm also starting to feel that I'm getting out of my league trying to understand all the intricacies of the OSD work flow (trying to start with one of the most complicated parts of the system doesn't help). Maybe what I should do is just code up the queue to drop in as a replacement for the Prio queue for the moment. Then as your async work is completing we can shake out the potential issues with recovery and costs that we talked about earlier. One thing that I'd like to look into is elevating the priority of recovery ops that have client OPs blocked. I don't think the WRR queue gives the recovery thread a lot of time to get its work done. Based on some testing on Friday, the number of recovery ops on an osd did not really change if there were 20 backfilling or 1 backfilling. The difference came in with how many client I/Os were blocked waiting for objects to recover. When 20 backfills were going, there were a lot more blocked I/O waiting for objects to show up or recover. With one backfill, there were far less blocked I/O, but there were still times I/O would block. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWQPHBCRDmVDuy+mK58QAA72EQAMgzgrw3OAvBi1/NmuWl LXGM0qGz3hE/p5oUsnqcnz2/+VYP3FZRanszyuU8+vKCwj+I/Ny9Olm1JAnw DSE7PvhuO6J5w0ymOIccKdX7uk2QZyP8ggO1D5fLC2M9/xqQQSZrAPE7vc4j O9HHuZsMF+ABUKU5RVCjn1ax+y2LhpetxH3nu37xpSKPDPFiowVnW8YlBGJy Cf1FYMVDLv60F5EmjstOn4FhSXC/+DuSATwP+CmNEPZ3JNTBgtPuU/22/De3 M4ZdDzeylVWYB66vbL9ijLeZDoCaxKgFL+QwUAswefaDBD1citCU2v7/7VQP aChnSzI8BYG0bHg5u7QEohzQyJUCC1OubiRkbUmOOeCiBI0Lqv3jf321T4ss PD3hqkagyhRe67zPB6bhhik0ZDOYHTAyV/ceAae4VDJTgu+/gI8Gc1c3mp5g nZL5z7hVohZ0AvfdEzasRhTnTcH6TfO9lpqU2nyMAc76SoPyDSTmAcMVt0tj /1BQAnk/I5rlCL5CKTxb2LR1/5WJt0eh7xtyKU1B0yh4G7JlMf/3kmrznOWu VEUUA3mJ1depDToadnECnCZMKHrGYC36XCy8xq3FDqhvl4BWV0VMA+yi1uhj zZ5udKKbN5Cxo/Sc48DG8wz9lQKn4LPCH2PD81oTcTfyd1iG2oNNkchrXa6K iwed =WjDS -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Nov 9, 2015 at 11:19 AM, Samuel Justwrote: > Ops are hashed from the messenger (or any of the other enqueue sources > for non-message items) into one of N queues, each of which is serviced > by M threads. We can't quite have a single thread own a single queue > yet because the current design allows multiple threads/queue > (important because if a sync read blocks on one thread, other threads > working on that queue can continue to make progress). However, the > queue contents are hashed to a queue based on the PG, so if a PG > queues work, it'll be on the same queue as it is already operating > from (which I think is what you are getting at?). I'm moving away > from that with the async read work I'm doing (ceph-devel subject > "Async reads, sync writes, op thread model discussion"), but I'll > still need a replacement for PrioritizedQueue. > -Sam > > On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> I should probably work against this branch. >> >> I've got some more reading of code to do, but I'm thinking that there >> isn't one of these queues for each OSD, it seems like there is one >> queue for each thread in the OSD. If this is true, I think it makes >> sense to break the queue into it's own thread and have each 'worker' >> thread push and pop OPs out of that thread. I have been focused on the >> Queue code that I haven't really looked at the OSD/PG code until last >> Friday and it is like trying to drink from a fire hose going through >> that code, so I may be misunderstanding something. >> >> I'd appreciate any pointers to quickly understanding the OSD/PG code >> specifically around the OPs and the queue. >> >> Thanks, >> -BEGIN PGP SIGNATURE- >> Version: Mailvelope v1.2.3 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc >>
Re: Request for Comments: Weighted Round Robin OP Queue
On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlancwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just wrote: >> What I really want from PrioritizedQueue (and from the dmclock/mclock >> approaches that are also being worked on) is a solution to the problem >> of efficiently deciding which op to do next taking into account >> fairness across io classes and ops with different costs. > >> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc wrote: >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >>> >>> Thanks, I think some of the fog is clearing. I was wondering how >>> operations between threads were keeping the order of operations in >>> PGs, that explains it. >>> >>> My original thoughts were to have a queue in front and behind the >>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue. >>> The queue thread would pull ops off that queue and place them into the >>> specialized queue, do house keeping, etc and would dequeue ops in that >>> queue to a post-queue that worker threads would monitor. The thread >>> queue could keep a certain amount of items in the post-queue to >>> prevent starvation and worker threads from being blocked. >> >> I'm not sure what the advantage of this would be -- it adds another thread >> to the processing pipeline at best. > > There are a few reasons I thought about it. 1. It is hard to > prioritize/mange the work load if you can't see/manage all the > operations. One queue allows the algorithm to make decisions based on > all available information. (This point seems to be handled in a > different way in the future) 2. Reduce latency in the Op path. When an > OP is queued, there is overhead in getting it in the right place. When > an OP is dequeued there is more overhead in spreading tokens, etc. > Right now that is all serial, if an OP is stuck in the queue waiting > to be dispatched some of this overhead can't be performed while in > this waiting period. The idea is pushing that overhead to a separate > thread and allowing a worker thread to queue/dequeue in the most > efficient manner. It also allows for more complex trending, > scheduling, etc because it can sit outside of the OP path. As the > workload changes, it can dynamically change how it manages the queue > like simple fifo for low periods where latency is dominated by compute > time, to Token/WRR when latency is dominated by disk access, etc. > We basically don't want a single thread to see all of the operations -- it would cause a tremendous bottleneck and complicate the design immensely. It's shouldn't be necessary anyway since PGs are a form of course grained locking, so it's probably fine to schedule work for different groups of PGs independently if we assume that all kinds of work are well distributed over those groups. >>> It would require the worker thread to be able to handle any kind of >>> op, or having separate post-queues for the different kinds of work. >>> I'm getting the feeling that this may be a far too simplistic approach >>> to the problem (or at least in terms of the organization of Ceph at >>> this point). I'm also starting to feel that I'm getting out of my >>> league trying to understand all the intricacies of the OSD work flow >>> (trying to start with one of the most complicated parts of the system >>> doesn't help). >>> >>> Maybe what I should do is just code up the queue to drop in as a >>> replacement for the Prio queue for the moment. Then as your async work >>> is completing we can shake out the potential issues with recovery and >>> costs that we talked about earlier. One thing that I'd like to look >>> into is elevating the priority of recovery ops that have client OPs >>> blocked. I don't think the WRR queue gives the recovery thread a lot >>> of time to get its work done. >>> >> >> If an op comes in that requires recovery to happen before it can be >> processed, we send the recovery messages with client priority rather >> than recovery priority. > > But the recovery is still happening the recovery thread and not the > client thread, right? The recovery thread has a lower priority than > the op thread? That's how I understand it. > No, in hammer we removed the snap trim and scrub workqueues. With wip-recovery-wq, I remove the recovery wqs as well. Ideally, the only meaningful set of threads remaining will be the op_tp and associated queues. >>> Based on some testing on Friday, the number of recovery ops on an osd >>> did not really change if there were 20 backfilling or 1 backfilling. >>> The difference came in with how many client I/Os were blocked waiting >>> for objects to recover. When 20 backfills were going, there were a lot >>> more blocked I/O waiting for objects to show up or recover. With one >>> backfill, there were far less blocked I/O, but there were still times >>> I/O would block. >> >> The number of recovery ops is actually a separate configurable >>
Re: [PATCH 1/9] drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c: use correct structure type name in sizeof
Hi Julia, Thank you for the patch. On Tuesday 29 July 2014 17:16:43 Julia Lawall wrote: > From: Julia Lawall> > Correct typo in the name of the type given to sizeof. Because it is the > size of a pointer that is wanted, the typo has no impact on compilation or > execution. > > This problem was found using Coccinelle (http://coccinelle.lip6.fr/). The > semantic patch used can be found in message 0 of this patch series. > > Signed-off-by: Julia Lawall > > --- > drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c > b/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c index > cda8388..255590f 100644 > --- a/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c > +++ b/drivers/staging/media/davinci_vpfe/vpfe_mc_capture.c > @@ -227,7 +227,7 @@ static int vpfe_enable_clock(struct vpfe_device > *vpfe_dev) return 0; > > vpfe_dev->clks = kzalloc(vpfe_cfg->num_clocks * > -sizeof(struct clock *), GFP_KERNEL); > +sizeof(struct clk *), GFP_KERNEL); I'd use sizeof(*vpfe_dev->clks) to avoid such issues. Apart from that, Acked-by: Laurent Pinchart I've applied the patch to my tree with the above change, there's no need to resubmit if you agree with the proposal. > if (vpfe_dev->clks == NULL) { > v4l2_err(vpfe_dev->pdev->driver, "Memory allocation failed\n"); > return -ENOMEM; -- Regards, Laurent Pinchart -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for Comments: Weighted Round Robin OP Queue
What I really want from PrioritizedQueue (and from the dmclock/mclock approaches that are also being worked on) is a solution to the problem of efficiently deciding which op to do next taking into account fairness across io classes and ops with different costs. On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlancwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Thanks, I think some of the fog is clearing. I was wondering how > operations between threads were keeping the order of operations in > PGs, that explains it. > > My original thoughts were to have a queue in front and behind the > Prio/WRR queue. Threads scheduling work would queue to the pre-queue. > The queue thread would pull ops off that queue and place them into the > specialized queue, do house keeping, etc and would dequeue ops in that > queue to a post-queue that worker threads would monitor. The thread > queue could keep a certain amount of items in the post-queue to > prevent starvation and worker threads from being blocked. I'm not sure what the advantage of this would be -- it adds another thread to the processing pipeline at best. > > It would require the worker thread to be able to handle any kind of > op, or having separate post-queues for the different kinds of work. > I'm getting the feeling that this may be a far too simplistic approach > to the problem (or at least in terms of the organization of Ceph at > this point). I'm also starting to feel that I'm getting out of my > league trying to understand all the intricacies of the OSD work flow > (trying to start with one of the most complicated parts of the system > doesn't help). > > Maybe what I should do is just code up the queue to drop in as a > replacement for the Prio queue for the moment. Then as your async work > is completing we can shake out the potential issues with recovery and > costs that we talked about earlier. One thing that I'd like to look > into is elevating the priority of recovery ops that have client OPs > blocked. I don't think the WRR queue gives the recovery thread a lot > of time to get its work done. > If an op comes in that requires recovery to happen before it can be processed, we send the recovery messages with client priority rather than recovery priority. > Based on some testing on Friday, the number of recovery ops on an osd > did not really change if there were 20 backfilling or 1 backfilling. > The difference came in with how many client I/Os were blocked waiting > for objects to recover. When 20 backfills were going, there were a lot > more blocked I/O waiting for objects to show up or recover. With one > backfill, there were far less blocked I/O, but there were still times > I/O would block. The number of recovery ops is actually a separate configurable (osd_recovery_max_active -- default to 15). It's odd that with more backfilling on a single osd, there is more blocked IO. Looking into that would be helpful and would probably give you some insight into recovery and the op processing pipeline. -Sam > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWQPHBCRDmVDuy+mK58QAA72EQAMgzgrw3OAvBi1/NmuWl > LXGM0qGz3hE/p5oUsnqcnz2/+VYP3FZRanszyuU8+vKCwj+I/Ny9Olm1JAnw > DSE7PvhuO6J5w0ymOIccKdX7uk2QZyP8ggO1D5fLC2M9/xqQQSZrAPE7vc4j > O9HHuZsMF+ABUKU5RVCjn1ax+y2LhpetxH3nu37xpSKPDPFiowVnW8YlBGJy > Cf1FYMVDLv60F5EmjstOn4FhSXC/+DuSATwP+CmNEPZ3JNTBgtPuU/22/De3 > M4ZdDzeylVWYB66vbL9ijLeZDoCaxKgFL+QwUAswefaDBD1citCU2v7/7VQP > aChnSzI8BYG0bHg5u7QEohzQyJUCC1OubiRkbUmOOeCiBI0Lqv3jf321T4ss > PD3hqkagyhRe67zPB6bhhik0ZDOYHTAyV/ceAae4VDJTgu+/gI8Gc1c3mp5g > nZL5z7hVohZ0AvfdEzasRhTnTcH6TfO9lpqU2nyMAc76SoPyDSTmAcMVt0tj > /1BQAnk/I5rlCL5CKTxb2LR1/5WJt0eh7xtyKU1B0yh4G7JlMf/3kmrznOWu > VEUUA3mJ1depDToadnECnCZMKHrGYC36XCy8xq3FDqhvl4BWV0VMA+yi1uhj > zZ5udKKbN5Cxo/Sc48DG8wz9lQKn4LPCH2PD81oTcTfyd1iG2oNNkchrXa6K > iwed > =WjDS > -END PGP SIGNATURE- > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Mon, Nov 9, 2015 at 11:19 AM, Samuel Just wrote: >> Ops are hashed from the messenger (or any of the other enqueue sources >> for non-message items) into one of N queues, each of which is serviced >> by M threads. We can't quite have a single thread own a single queue >> yet because the current design allows multiple threads/queue >> (important because if a sync read blocks on one thread, other threads >> working on that queue can continue to make progress). However, the >> queue contents are hashed to a queue based on the PG, so if a PG >> queues work, it'll be on the same queue as it is already operating >> from (which I think is what you are getting at?). I'm moving away >> from that with the async read work I'm doing (ceph-devel subject >> "Async reads, sync writes, op thread model discussion"), but I'll >> still need a replacement for PrioritizedQueue. >> -Sam >> >> On Mon, Nov 9, 2015
Re: Request for Comments: Weighted Round Robin OP Queue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just wrote: > What I really want from PrioritizedQueue (and from the dmclock/mclock > approaches that are also being worked on) is a solution to the problem > of efficiently deciding which op to do next taking into account > fairness across io classes and ops with different costs. > On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> Thanks, I think some of the fog is clearing. I was wondering how >> operations between threads were keeping the order of operations in >> PGs, that explains it. >> >> My original thoughts were to have a queue in front and behind the >> Prio/WRR queue. Threads scheduling work would queue to the pre-queue. >> The queue thread would pull ops off that queue and place them into the >> specialized queue, do house keeping, etc and would dequeue ops in that >> queue to a post-queue that worker threads would monitor. The thread >> queue could keep a certain amount of items in the post-queue to >> prevent starvation and worker threads from being blocked. > > I'm not sure what the advantage of this would be -- it adds another thread > to the processing pipeline at best. There are a few reasons I thought about it. 1. It is hard to prioritize/mange the work load if you can't see/manage all the operations. One queue allows the algorithm to make decisions based on all available information. (This point seems to be handled in a different way in the future) 2. Reduce latency in the Op path. When an OP is queued, there is overhead in getting it in the right place. When an OP is dequeued there is more overhead in spreading tokens, etc. Right now that is all serial, if an OP is stuck in the queue waiting to be dispatched some of this overhead can't be performed while in this waiting period. The idea is pushing that overhead to a separate thread and allowing a worker thread to queue/dequeue in the most efficient manner. It also allows for more complex trending, scheduling, etc because it can sit outside of the OP path. As the workload changes, it can dynamically change how it manages the queue like simple fifo for low periods where latency is dominated by compute time, to Token/WRR when latency is dominated by disk access, etc. >> It would require the worker thread to be able to handle any kind of >> op, or having separate post-queues for the different kinds of work. >> I'm getting the feeling that this may be a far too simplistic approach >> to the problem (or at least in terms of the organization of Ceph at >> this point). I'm also starting to feel that I'm getting out of my >> league trying to understand all the intricacies of the OSD work flow >> (trying to start with one of the most complicated parts of the system >> doesn't help). >> >> Maybe what I should do is just code up the queue to drop in as a >> replacement for the Prio queue for the moment. Then as your async work >> is completing we can shake out the potential issues with recovery and >> costs that we talked about earlier. One thing that I'd like to look >> into is elevating the priority of recovery ops that have client OPs >> blocked. I don't think the WRR queue gives the recovery thread a lot >> of time to get its work done. >> > > If an op comes in that requires recovery to happen before it can be > processed, we send the recovery messages with client priority rather > than recovery priority. But the recovery is still happening the recovery thread and not the client thread, right? The recovery thread has a lower priority than the op thread? That's how I understand it. >> Based on some testing on Friday, the number of recovery ops on an osd >> did not really change if there were 20 backfilling or 1 backfilling. >> The difference came in with how many client I/Os were blocked waiting >> for objects to recover. When 20 backfills were going, there were a lot >> more blocked I/O waiting for objects to show up or recover. With one >> backfill, there were far less blocked I/O, but there were still times >> I/O would block. > > The number of recovery ops is actually a separate configurable > (osd_recovery_max_active -- default to 15). It's odd that with more > backfilling on a single osd, there is more blocked IO. Looking into > that would be helpful and would probably give you some insight > into recovery and the op processing pipeline. I'll see what I can find here. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ /E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N +rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF
Re: make check bot resumed
Hi, For some reason jenkins thought it was necessary to reconsider all commits merged weeks ago. It was silenced to not send test results about pull request already merged. It should now resume work on the current pull requests. If a pull request needs to be visited by the make check bot, it is enough to rebase and repush it. Cheers On 09/11/2015 15:33, Loic Dachary wrote: > Hi, > > The machine sending notifications for the make check bot failed during the > week-end. It was rebooted and it should resume its work. > > The virtual machine was actually re-built because the underlying OpenStack > cloud was unable to find the volume used for root after a hard reboot. There > were also issues with the devicemapper docker backend that was corrupted. > Wiping them out was enough to resolve the problem: they did not have any > persistent data anyway. > > Cheers > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
RGW multi-tenancy APIs overview
With ticket 5073 getting close to complete, we're getting the APIs mostly nailed down. Most of them come down to selection a syntax separator character. Unfortunately, there are several such characters. Plus, it is not always feasible to get by with a character (in S3 at least). So far we have the following changes: #1 Back-end and radosgw-admin use '/' or "tenant/bucket". This is what is literally stored in RADOS, because it's used to name bucket objects in the .rgw pool. #2 Buckets in Swift URLs use '\' (backslash), because there does not seem to be a way to use '/'. Example: http://host.corp.com:8080/swift/v1/testen\testcont At first, I tried URL encoding (%2f), but that didn't work: we permit '%' in Swift container names, so there's a show-stopper compatibility problem. So, backslash. The backslash poses a similar problem, too, but hopefuly nobody created a container with backslash in name. Note that strictly speaking, we don't really need this, since Swift URLs could easily include tenant names where reference Swift places account names. It's just easier to implement without disturbing authenthication code. #3 S3 host addressing of buckets This is similar to Swift and is slated to use backslash. Note that S3 prohibits it, so we're reasonably safe with this choice. #4 S3 URL addressing of buckets Here we must use a period. Example: bucket.tenant.host.corp.com #5 Listings and redirects. Listings present a difficulty in S3: we don't know if the name will be used in host-based or URL-based addressing of a bucket. So, we put the tenant of a bucket into a separate XML attribute. Since Swift listings are always in a specific account, and thus tenant, they are unchanged. In addition to listings, bucket names leak into certain HTTP headers, where we add "Tenant:" headers as appropriate. Finally, multi-tenancy also puts user_uid namespaces under tenants as well as bucket namespaces. That one is easy though. A '$' separator is used consistently for it (tenant$user). -- Pete -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
There is no next; only jewel
Hey everyone, Just a reminder that now that infernalis is out and we're back to focusing on jewel, we should send all bug fixes to the 'jewel' branch (which functions the same way the old 'next' branch did). That is, bug fixes -> jewel new features -> master Every dev release (hopefully we'll get back on a 2 week shedule) we'll slurp master into jewel for the next sprint. And during each sprint we'll test/stabilize the jewel branch. Expect feature freeze to be February-ish. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph:Fix error handling in the function down_reply
On Mon, Nov 9, 2015 at 11:15 AM, Yan, Zhengwrote: > >> On Nov 9, 2015, at 11:11, Nicholas Krause wrote: >> >> This fixes error handling in the function down_reply in order to >> properly check and jump to the goto label, out_err for this >> particular function if a error code is returned by any function >> called in down_reply and therefore make checking be included >> for the call to ceph_update_snap_trace in order to comply with >> these error handling checks/paths. >> >> Signed-off-by: Nicholas Krause >> --- >> fs/ceph/mds_client.c | 11 +++ >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >> index 51cb02d..0b01f94 100644 >> --- a/fs/ceph/mds_client.c >> +++ b/fs/ceph/mds_client.c >> @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session >> *session, struct ceph_msg *msg) >> realm = NULL; >> if (rinfo->snapblob_len) { >> down_write(>snap_rwsem); >> - ceph_update_snap_trace(mdsc, rinfo->snapblob, >> - rinfo->snapblob + rinfo->snapblob_len, >> - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP, >> - ); >> + err = ceph_update_snap_trace(mdsc, rinfo->snapblob, >> + rinfo->snapblob + >> rinfo->snapblob_len, >> + le32_to_cpu(head->op) == >> CEPH_MDS_OP_RMSNAP, >> + ); >> downgrade_write(>snap_rwsem); >> } else { >> down_read(>snap_rwsem); >> } >> + >> + if (err) >> + goto out_err; >> >> /* insert trace into our cache */ >> mutex_lock(>r_fill_mutex); > > Applied, thanks This looks to me like it'd leave snap_rwsem locked for read? Also, the name of the function in question is handle_reply(), not down_reply(). I'll revert testing. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever
adding to ceph.conf [client] rbd_non_blocking_aio = false fix the problem for me (with rbd_cache=false) (@cc jdur...@redhat.com) - Mail original - De: "Denis V. Lunev"À: "aderumier" , "ceph-devel" , "qemu-devel" Envoyé: Lundi 9 Novembre 2015 08:22:34 Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever On 11/09/2015 10:19 AM, Denis V. Lunev wrote: > On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote: >> Hi, >> >> with qemu (2.4.1), if I do an internal snapshot of an rbd device, >> then I pause the vm with vm_stop, >> >> the qemu process is hanging forever >> >> >> monitor commands to reproduce: >> >> >> # snapshot_blkdev_internal drive-virtio0 yoursnapname >> # stop >> >> >> >> >> I don't see this with qcow2 or sheepdog block driver for example. >> >> >> Regards, >> >> Alexandre >> > this could look like the problem I have recenty trying to > fix with dataplane enabled. Patch series is named as > > [PATCH for 2.5 v6 0/10] dataplane snapshot fixes > > Den anyway, even if above will not help, can you collect gdb traces from all threads in QEMU process. May be I'll be able to give a hit. Den -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for Comments: Weighted Round Robin OP Queue
On Mon, Nov 9, 2015 at 3:49 PM, Samuel Justwrote: > On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just wrote: >>> What I really want from PrioritizedQueue (and from the dmclock/mclock >>> approaches that are also being worked on) is a solution to the problem >>> of efficiently deciding which op to do next taking into account >>> fairness across io classes and ops with different costs. >> >>> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Thanks, I think some of the fog is clearing. I was wondering how operations between threads were keeping the order of operations in PGs, that explains it. My original thoughts were to have a queue in front and behind the Prio/WRR queue. Threads scheduling work would queue to the pre-queue. The queue thread would pull ops off that queue and place them into the specialized queue, do house keeping, etc and would dequeue ops in that queue to a post-queue that worker threads would monitor. The thread queue could keep a certain amount of items in the post-queue to prevent starvation and worker threads from being blocked. >>> >>> I'm not sure what the advantage of this would be -- it adds another thread >>> to the processing pipeline at best. >> >> There are a few reasons I thought about it. 1. It is hard to >> prioritize/mange the work load if you can't see/manage all the >> operations. One queue allows the algorithm to make decisions based on >> all available information. (This point seems to be handled in a >> different way in the future) 2. Reduce latency in the Op path. When an >> OP is queued, there is overhead in getting it in the right place. When >> an OP is dequeued there is more overhead in spreading tokens, etc. >> Right now that is all serial, if an OP is stuck in the queue waiting >> to be dispatched some of this overhead can't be performed while in >> this waiting period. The idea is pushing that overhead to a separate >> thread and allowing a worker thread to queue/dequeue in the most >> efficient manner. It also allows for more complex trending, >> scheduling, etc because it can sit outside of the OP path. As the >> workload changes, it can dynamically change how it manages the queue >> like simple fifo for low periods where latency is dominated by compute >> time, to Token/WRR when latency is dominated by disk access, etc. >> > > We basically don't want a single thread to see all of the operations -- it > would cause a tremendous bottleneck and complicate the design > immensely. It's shouldn't be necessary anyway since PGs are a form > of course grained locking, so it's probably fine to schedule work for > different groups of PGs independently if we assume that all kinds of > work are well distributed over those groups. There are are some queue implementations that rely on a single thread essentially playing traffic cop in between queues and it's pretty fast. FastFlow, the C++ lib, does that. It constructs other kinds of queues from fast lock-free / wait-free SPSC queues. In the case of something like MPMC there's a mediator thread there that manages N SPSC in-queus to MSPC out-queues. I'm only bringing this up since if you have a problem that might need a mediator to arrange order, it's possible to do it fast. > It would require the worker thread to be able to handle any kind of op, or having separate post-queues for the different kinds of work. I'm getting the feeling that this may be a far too simplistic approach to the problem (or at least in terms of the organization of Ceph at this point). I'm also starting to feel that I'm getting out of my league trying to understand all the intricacies of the OSD work flow (trying to start with one of the most complicated parts of the system doesn't help). Maybe what I should do is just code up the queue to drop in as a replacement for the Prio queue for the moment. Then as your async work is completing we can shake out the potential issues with recovery and costs that we talked about earlier. One thing that I'd like to look into is elevating the priority of recovery ops that have client OPs blocked. I don't think the WRR queue gives the recovery thread a lot of time to get its work done. >>> >>> If an op comes in that requires recovery to happen before it can be >>> processed, we send the recovery messages with client priority rather >>> than recovery priority. >> >> But the recovery is still happening the recovery thread and not the >> client thread, right? The recovery thread has a lower priority than >> the op thread? That's how I understand it. >> > > No, in hammer we removed the snap trim and scrub workqueues. With > wip-recovery-wq, I remove the
Re: Request for Comments: Weighted Round Robin OP Queue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 It sounds like dmclock/mclock will alleviate a lot of the concerns I have as long as it can be smart like you said. It sounds like the queue thread was already tried so there is experience behind the current implementation vs. me thinking it might be better. The basic idea I had is below: Client, Client, Repop,Repop, backfill, ... Backfill, recovery, Recovery, etc threadetc thread | | \ / \ / lock; push (prio,cost,strict, front/back,subsystem,); unlock | | (queue thread) pop / \ /\ if ops.lowPlace op in prio (fast path) queue, do any | housekeeping || | when post-queue.len | < threads \/ \ / post-queue push | lock, cond, pop /\ / \ Worker ... Worker threadthread What I meant by more scalable is that the rate of boulders would be constant and evenly dispersed. It also prevents any one worker thread from being backed up while others are idle. This may not be an issue if the PG is busy. This design could also suffer if many OPs require some locking at the PG level instead of the object level. The queue itself does not do any op work only passing pointers to work to be done. As I mentioned before it sounds like something like this already proved to be limiting in performance, although thinking through this has given me some ideas about implementing a fast path option in the WRR queue to save some cycles. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWQS/lCRDmVDuy+mK58QAA69QP/0H1K3cArNaqM+yo4W4D vpUMGxgTOg/8+69w4U2smHtjy8zRnJyUU1fbYdeTCbwTlZi5XVvtdMstDgPf OqtF+uJm/akWVblzjreWjcqkBOXmlv89loOKJZGp9oUaHll8vrL117dd7Kwh WHnGkc+fKCjkA7qo3gBo+Y5N3I1N2BNF0NQVuSTFEP5CfPE4Wy6DwBpYD1KY zoN021E564V8eK1336je+v5xDg4oZLOxp5HhWmLHXnnisvfrK/VUipVl3aGY Y5AXpdHGuRlsfvodKo6ZjAr1NEyPqlapJ7o57montY8yTxPR6ubSYAPP04Ky VxA1FmtjsXKwui23rJMViWmY+lCT/P42fDlXEmVkbrnpkfoyzWn3N6yERatV UCazWH6eA8w/FMjrkU7FTNjttYeQU74Ph26qywL9oNVWbzKKaiEaWgGzOT1Y c65babw+qExK1syF8cWlKaf+roWIHeDq2+9iNO5SJ5v2eZ+JZipwW5f0BibM EQGCx4b+vcjJgN2rYxUYOsm0tyOj+MMi2MrHqLC5Ns4zwqBw29+Gz4x+RfW5 2mw/0zaBe9v5GG7SocCHSuLexYBXjJ5h7zx2lII38Bnz9M6OfaAzuFtSXAqH VSs4+6BrksnvAdhJNh4eX21mF/zIrnatxIvzvZlkAkSlEzpB72ZU8fC1OH/X 3hWW =LuyT -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for Comments: Weighted Round Robin OP Queue
On Mon, Nov 9, 2015 at 1:30 PM, Robert LeBlancwrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > On Mon, Nov 9, 2015 at 1:49 PM, Samuel Just wrote: >> We basically don't want a single thread to see all of the operations -- it >> would cause a tremendous bottleneck and complicate the design >> immensely. It's shouldn't be necessary anyway since PGs are a form >> of course grained locking, so it's probably fine to schedule work for >> different groups of PGs independently if we assume that all kinds of >> work are well distributed over those groups. > > The only issue that I can see, based on the discussion last week, is > when the client I/O is small. There will be some points where each > thread will think it is OK so send a bolder along with the pebbles > (recovery I/O vs. client I/O), If all/most of the threads send a > bolder at the same time would it cause issues for slow disks > (spindles)? A single queue would be much more intelligent about > situations like this and spread the bolders out better. It also seems > more scalable as you add threads (I don't think really practical on > spindles). I assume the bottleneck in your concern is the thread > communication between threads? I'm trying to understand and in no way > trying to attack you (I've been know to come across differently than I > intend to). > This is one of the advantages of the dmclock/mclock based designs, we'd be able to portion out the available IO (expresed as cost/time) among the threads and let each queue schedule against its own quota. A significant challenge there of course is estimating available io capacity. Another piece is that there needs to be a bound on how large boulders get. Recovery will break up recovery of large objects into lots of messages to avoid having too large a boulder. Similarly, there are limits at least on the bulk size of a client IO operation. I don't understand how a single queue would be more scalable as we add threads. Pre-giant, that's how the queue worked, and it was indeed a significant bottleneck. As I see it, each operation is ordered in two ways (each requiring a lock/thread of control/something): 1) The message stream from the client is ordered (represented by the reader thread in the SimpleMessenger). The ordering here is actually part of the librados interface contract for the most part (certain reads could theoretically be reordered here without breaking the rules). 2) Operations on the PG are ordered necessarily by the PG lock (client writes by necessity, most everything else by convenience). So at a minimum, something ordered by 1 needs to pass off to something ordered by 2. We currently do this by allowing the reader thread to fast-dispatch directly into the op queue responsible for the PG which owns the op. A thread local to the right PG then takes it from there. This means that two different ops each of which is on a different client/pg combo may not interact at all and could be handled entirely in parallel (that's the ideal, anyway). Depending on what you mean by "queue", putting all ops in a single queue necessarily serializes all IO on that structure (even if only for a small portion of the execution time). This limits both parallelism and the amount of computation you can actually do to make the scheduling decision even more so than the current design does. Ideally, we'd like to have our cake and eat it too: we'd like good scheduling (which PrioritizedQueue does not particularly well) while minimizing overhead of the queue itself (an even bigger problem with PrioritizedQueue) and keeping scaling as linear as we can get it on many-core machines (which usually means that independent ops should have a low probability of touching the same structures). >>> But the recovery is still happening the recovery thread and not the >>> client thread, right? The recovery thread has a lower priority than >>> the op thread? That's how I understand it. >>> >> >> No, in hammer we removed the snap trim and scrub workqueues. With >> wip-recovery-wq, I remove the recovery wqs as well. Ideally, the only >> meaningful set of threads remaining will be the op_tp and associated >> queues. > > OK, that is good news, I didn't do a scrub so I haven't seen the OPs > for that. Do you know the priorities of snap trim, scrub and recovery > so that I can do some math/logic on applying costs in an efficient way > as we talked about last week? > There are config options in common/config_opt.h iirc. -Sam > Thanks, > > - > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.2.3 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJWQRB6CRDmVDuy+mK58QAAAsMP/RoBeyhqwNDURHagKJ9i > knjYW4jy0FFw1XmnFRhJN7FuFlYlHZ+bwvQGGYvmOkLlxgY9Y+J1GglwwV14 > Vvtd/1LBOUw06Ch/WjhcgVFNIQdgdNBPHPaRurSTGxnofYKAwqB266gnzwAo >
Re: [PATCH] ceph:Fix error handling in the function down_reply
> On Nov 9, 2015, at 11:11, Nicholas Krausewrote: > > This fixes error handling in the function down_reply in order to > properly check and jump to the goto label, out_err for this > particular function if a error code is returned by any function > called in down_reply and therefore make checking be included > for the call to ceph_update_snap_trace in order to comply with > these error handling checks/paths. > > Signed-off-by: Nicholas Krause > --- > fs/ceph/mds_client.c | 11 +++ > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 51cb02d..0b01f94 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session > *session, struct ceph_msg *msg) > realm = NULL; > if (rinfo->snapblob_len) { > down_write(>snap_rwsem); > - ceph_update_snap_trace(mdsc, rinfo->snapblob, > - rinfo->snapblob + rinfo->snapblob_len, > - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP, > - ); > + err = ceph_update_snap_trace(mdsc, rinfo->snapblob, > + rinfo->snapblob + > rinfo->snapblob_len, > + le32_to_cpu(head->op) == > CEPH_MDS_OP_RMSNAP, > + ); > downgrade_write(>snap_rwsem); > } else { > down_read(>snap_rwsem); > } > + > + if (err) > + goto out_err; > > /* insert trace into our cache */ > mutex_lock(>r_fill_mutex); Applied, thanks Yan, Zheng > -- > 2.5.0 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Help on ext4/xattr linux kernel stability issue / ceph xattr use?
Hi, Part of our ceph cluster is using ext4 and we recently hit major kernel instability in the form of kernel lockups every few hours, issues opened: http://tracker.ceph.com/issues/13662 https://bugzilla.kernel.org/show_bug.cgi?id=107301 On kernel.org kernel developpers are asking about ceph usage of xattr, in particular wether there are lots of common xattr key/value or wether they are all differents. I attached a file with various xattr -l outputs: https://bugzilla.kernel.org/show_bug.cgi?id=107301#c8 https://bugzilla.kernel.org/attachment.cgi?id=192491 Looks like the "big" xattr "user.ceph._" is always different, same for the intermediate size "user.ceph.hinfo_key". "user.cephos.spill_out" and "user.ceph.snapset" seem to have small values, and within a small value set. Our cluster is used exclusively for virtual machines block devices with rbd, on replicated (3) and erasure coded pools (4+1 and 8+2). Could someone knowledgeable add some information on ceph use of xattr in the kernel.org bugzilla above? Also I think it is necessary to warn ceph users to avoid ext4 at all costs until this kernel/ceph issue is sorted out: we went from relatively stable production for more than a year to crashes everywhere all the time since two weeks ago, probably after hitting some magic limit. We migrated our machines to ubuntu trusty, our SSD based filesystem to XFS but our HDD are still mostly on ext4 (60 TB of data to move so not that easy...). Thanks in advance for your help, Sincerely, Laurent GUERBY http://tetaneutral.net -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph:Fix error handling in the function ceph_readddir_prepopulate
> On Nov 9, 2015, at 05:13, Nicholas Krausewrote: > > This fixes error handling in the function ceph_readddir_prepopulate > to properly check if the call to the function ceph_fill_dirfrag has > failed by returning a error code. Further more if this does arise > jump to the goto label, out of the function ceph_readdir_prepopulate > in order to clean up previously allocated resources by this function > before returning to the caller this errror code in order for all callers > to be now aware and able to handle this failure in their own intended > error paths. > > Signed-off-by: Nicholas Krause > --- > fs/ceph/inode.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index 96d2bd8..7738be6 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -1417,8 +1417,11 @@ int ceph_readdir_prepopulate(struct ceph_mds_request > *req, > } else { > dout("readdir_prepopulate %d items under dn %p\n", >rinfo->dir_nr, parent); > - if (rinfo->dir_dir) > - ceph_fill_dirfrag(d_inode(parent), rinfo->dir_dir); > + if (rinfo->dir_dir) { > + err = ceph_fill_dirfrag(d_inode(parent), > rinfo->dir_dir); > + if (err) > + goto out; > + } > } > ceph_fill_dirfrag() failure is not fatal. I think it’s better to not skip later code when it happens. Regards Yan, Zheng > if (ceph_frag_is_leftmost(frag) && req->r_readdir_offset == 2) { > -- > 2.5.0 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html