Re: cmake
On Thu, 3 Dec 2015 19:26:52 -0500 (EST) Matt Benjaminwrote: > Could you share the branch you are trying to build? (ceph/wip-5073 would not > appear to be it.) It's the trunk with a few of my insignificant cleanups. But I found a fix: deleting the CMakeFiles/ and CMakeCache.txt let it run. Thanks again for the tip about the separate build directory. -- Pete -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW S3 Website hosting, non-clean code for early review
On 12/03/2015 10:47 PM, Robin H. Johnson wrote: > On Wed, Dec 02, 2015 at 03:02:12PM +0100, Javier Muñoz wrote: >> I would appreciate to know the current status of the implementation if >> possible. Any progress? Any 'deadline' to go upstream? :) > > When: > As soon as it works 100% and passes my testsuite, which I hope is very > soon. I would very much like to have this in Jewel. > > My work was being done in this branch > https://github.com/dreamhost/ceph/tree/wip-static-website-robbat2-master > However, due to master moving forward, I can't use the latest parts of > the gitbuilder and automatic testing successfully (they've moved on, > while this has stayed behind). > > Yehuda wanted me to try and NOT rebase it, for ease of his review, but > that was no longer possible :-(. > (tagged as wip-static-website-robbat2-master_yehuda-review-20151012 in > the dreamhost fork). > > The above, but squashed and updated to master as of 2015/12/02 > https://github.com/dreamhost/ceph/tree/wip-static-website-robbat2-master-20151202 > It's presently running against my testsuite, and if it passes the pieces > that I know it should [1], I'll be splitting it up to submit. > > [1] I'm seeing some failures of places where I didn't touch the code, so > having to separate those out. Thanks for the update! Javier -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cmake
On Fri, Dec 4, 2015 at 3:59 AM, Pete Zaitcevwrote: > On Thu, 3 Dec 2015 19:26:52 -0500 (EST) > Matt Benjamin wrote: > >> Could you share the branch you are trying to build? (ceph/wip-5073 would >> not appear to be it.) > > It's the trunk with a few of my insignificant cleanups. > > But I found a fix: deleting the CMakeFiles/ and CMakeCache.txt let > it run. Thanks again for the tip about the separate build directory. > FWIW, many cmake issues can be fixed by nuking the cmake generated files. This is one of the big advantages to a separate build dir, since the simplest way to do this is to nuke the build dir. Daniel -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why FailedAssertion is not my favorite exception
On Fri, Dec 4, 2015 at 12:31 PM, Adam C. Emersonwrote: > Noble Creators of the Squid Cybernetic Swimming in a Distributed Data Sea, > > There is a spectre haunting src/common/assert.cc: The spectre of throw > FailedAssertion. > > This seemingly inconsequential yet villainous statement destroys the > stack frame in which a failing assert statement is evaluated-- a stack > frame of great interest to those hoping to divine the cause of such > failures-- at the moment of their detection. > > This consequence follows from the hope that some caller might be able to > catch and recover from the failure. That is an unworthy goal, for any > failure sufficiently dire to rate an 'assert' is a failure from which > there can be no recovery. As I survey the code, I see FailedAssertion > is only caught as part of unit tests and in a few small programs where > it lead to an immediate exit. > > Therefore! If there is no objection, I would like to submit a patch that > will replace 'throw FailedException' with abort(). In support of this > goal, the patch will also remove attempts to catch FailedException from > driver programs like librados-config and change tests expecting a throw > of FailedAssertion to use the EXPECT_DEATH or ASSERT_DEATH macros instead. > > These changes, taken together, should be non-disruptive and make > debugging easier. I must be missing something here. As far as I can tell, "throw FailedAssertion" only happens in assert.cc, and I know that stuff doesn't destroy the relevant stack frames since I've pulled the info out of core dumps when debugging? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compiling for FreeBSD
On 4-12-2015 19:44, Gregory Farnum wrote: > On Fri, Dec 4, 2015 at 10:30 AM, Willem Jan Withagenwrote: >> On 3-12-2015 01:27, Yan, Zheng wrote: >>> On Thu, Dec 3, 2015 at 4:52 AM, Willem Jan Withagen >>> wrote: On 2-12-2015 15:13, Yan, Zheng wrote: >> I see that you have disabled uuid? Might I ask why? >>> >>> not disable. Currently ceph uses boost uuid implementation. so no need >>> to link to libuuid. >> >> And >> >>> The uuid transition to boost::uuid has happened since then (a few months >>> back) and I believe Rohan's AIX and Solaris ports for librados (that >>> just >>> merged) included a fix for the sockaddr_storage issue: >> >> I cannot seem to find the package or port that defines boost::uuid. >> So how did you make it available to the build system? > > http://www.boost.org/doc/libs/1_59_0/libs/uuid/ > > It's part of the boost labyrinth. I think in Debian it's just part of > libboost-dev, but you might need to dig around in whatever packaging > you're using for FreeBSD. I've dumped all of the labels in de boost libraries. So it is not default available with the pre-build packages. Which is understandable given the size of all that is available in boost. But lets go and fetch/build some of that stuff. --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why FailedAssertion is not my favorite exception
Noble Creators of the Squid Cybernetic Swimming in a Distributed Data Sea, There is a spectre haunting src/common/assert.cc: The spectre of throw FailedAssertion. This seemingly inconsequential yet villainous statement destroys the stack frame in which a failing assert statement is evaluated-- a stack frame of great interest to those hoping to divine the cause of such failures-- at the moment of their detection. This consequence follows from the hope that some caller might be able to catch and recover from the failure. That is an unworthy goal, for any failure sufficiently dire to rate an 'assert' is a failure from which there can be no recovery. As I survey the code, I see FailedAssertion is only caught as part of unit tests and in a few small programs where it lead to an immediate exit. Therefore! If there is no objection, I would like to submit a patch that will replace 'throw FailedException' with abort(). In support of this goal, the patch will also remove attempts to catch FailedException from driver programs like librados-config and change tests expecting a throw of FailedAssertion to use the EXPECT_DEATH or ASSERT_DEATH macros instead. These changes, taken together, should be non-disruptive and make debugging easier. Thank you all. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@{RedHat, OFTC, Freenode} 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 signature.asc Description: PGP signature
Re: Why FailedAssertion is not my favorite exception
On Fri, Dec 4, 2015 at 12:40 PM, Adam C. Emersonwrote: > On 04/12/2015, Gregory Farnum wrote: >> I must be missing something here. As far as I can tell, "throw >> FailedAssertion" only happens in assert.cc, and I know that stuff >> doesn't destroy the relevant stack frames since I've pulled the info >> out of core dumps when debugging? >> -Greg > > The behavior I'm seeing is that when I am attached to a process with gdb, and > an assertion fails it bounces all the way up to terminate with an uncaught > exception. Oh, yeah, I've no idea what it does when you're attached. I just want to make sure it's generating core dump files on assert failure (that is by far the most common way of generating and analyzing issues), which it looks like abort() does, so fine with me. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why FailedAssertion is not my favorite exception
On 04/12/2015, Gregory Farnum wrote: > I must be missing something here. As far as I can tell, "throw > FailedAssertion" only happens in assert.cc, and I know that stuff > doesn't destroy the relevant stack frames since I've pulled the info > out of core dumps when debugging? > -Greg The behavior I'm seeing is that when I am attached to a process with gdb, and an assertion fails it bounces all the way up to terminate with an uncaught exception. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@{RedHat, OFTC, Freenode} 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 signature.asc Description: PGP signature
Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
On 12/4/15 2:12 PM, Ric Wheeler wrote: > On 12/01/2015 05:02 PM, Sage Weil wrote: >> Hi David, >> >> On Tue, 1 Dec 2015, David Casier wrote: >>> Hi Sage, >>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy >>> to create an ext4 FS with metadata on flash >>> >>> Example with sdg1 on flash and sdb on hdd : >>> >>> size_of() { >>>blockdev --getsize $1 >>> } >>> >>> mkdmsetup() { >>>_ssd=/dev/$1 >>>_hdd=/dev/$2 >>>_size_of_ssd=$(size_of $_ssd) >>>echo """0 $_size_of_ssd linear $_ssd 0 >>>$_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create >>> dm-${1}-${2} >>> } >>> >>> mkdmsetup sdg1 sdb >>> >>> mkfs.ext4 -O >>> ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode >>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i >>> $((1024*512)) /dev/mapper/dm-sdg1-sdb >>> >>> With that, all meta_blocks are on the SSD >>> >>> If omap are on SSD, there are almost no metadata on HDD >>> >>> Consequence : performance Ceph (with hack on filestore without journal >>> and directIO) are almost same that performance of the HDD. >>> >>> With cache-tier, it's very cool ! >> Cool! I know XFS lets you do that with the journal, but I'm not sure if >> you can push the fs metadata onto a different device too.. I'm guessing >> not? >> >>> That is why we are working on a hybrid approach HDD / Flash on ARM or Intel >>> >>> With newstore, it's much more difficult to control the I/O profil. >>> Because rocksDB embedded its own intelligence >> This is coincidentally what I've been working on today. So far I've just >> added the ability to put the rocksdb WAL on a second device, but it's >> super easy to push rocksdb data there as well (and have it spill over onto >> the larger, slower device if it fills up). Or to put the rocksdb WAL on a >> third device (e.g., expensive NVMe or NVRAM). >> >> See this ticket for the ceph-disk tooling that's needed: >> >> http://tracker.ceph.com/issues/13942 >> >> I expect this will be more flexible and perform better than the ext4 >> metadata option, but we'll need to test on your hardware to confirm! >> >> sage > > I think that XFS "realtime" subvolumes are the thing that does this - the > second volume contains only the data (no metadata). > > Seem to recall that it is popular historically with video appliances, etc but > it is not commonly used. > > Some of the XFS crew cc'ed above would have more information on this, The realtime subvolume puts all data on a separate volume, and uses a different allocator; it is more for streaming type applications, in general. And it's not enabled in RHEL - and not heavily tested at this point, I think. -Eric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CephFS and single threaded RBD read performance
Hi, I see CephFS read performance a bit lower than RBD sequential read single threaded performance. Is it an expected behaviour? Is file access with CephFS single threaded by design? fio shows 70 MB/s seq read with 4M blocks, libaio, 1 thread, direct. fio seq write 200 MB/s # rados bench -t 1 -p test 60 write --no-cleanup ... Bandwidth (MB/sec): 164.247 Stddev Bandwidth: 28.9474 Average Latency:0.0243512 Stddev Latency: 0.0144412 # # rados bench -t 1 -p test 60 seq ... Bandwidth (MB/sec):88.174 Average Latency: 0.0453621 On the other hand, 'rados bench -t 128 60 seq' shows about 1700 MB/s. Is there anything to be tuned? Hardware is 3 nodes x 24 HDDs with journals on SSDs, 2x10GbE Triple replication. cephfs clients: kernel and ceph-fuse on Fedora 23 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
wip-cxx11time and wip-cxx11concurrency
Sage and Fellow Ceph Developers, Someone poked at me and asked me to rebased and repush the time and concurrency branches for merge. They have been so rebased and the loicbot tester seems happy with them. Thank you. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@{RedHat, OFTC, Freenode} 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 signature.asc Description: PGP signature
[GIT PULL] Ceph update for -rc4
Hi Linus, Please pull the following fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This addresses a refcounting bug that leads to a use-after-free. Thanks! sage Ilya Dryomov (1): rbd: don't put snap_context twice in rbd_queue_workfn() drivers/block/rbd.c | 1 + 1 file changed, 1 insertion(+) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
On 12/01/2015 05:02 PM, Sage Weil wrote: Hi David, On Tue, 1 Dec 2015, David Casier wrote: Hi Sage, With a standard disk (4 to 6 TB), and a small flash drive, it's easy to create an ext4 FS with metadata on flash Example with sdg1 on flash and sdb on hdd : size_of() { blockdev --getsize $1 } mkdmsetup() { _ssd=/dev/$1 _hdd=/dev/$2 _size_of_ssd=$(size_of $_ssd) echo """0 $_size_of_ssd linear $_ssd 0 $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2} } mkdmsetup sdg1 sdb mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i $((1024*512)) /dev/mapper/dm-sdg1-sdb With that, all meta_blocks are on the SSD If omap are on SSD, there are almost no metadata on HDD Consequence : performance Ceph (with hack on filestore without journal and directIO) are almost same that performance of the HDD. With cache-tier, it's very cool ! Cool! I know XFS lets you do that with the journal, but I'm not sure if you can push the fs metadata onto a different device too.. I'm guessing not? That is why we are working on a hybrid approach HDD / Flash on ARM or Intel With newstore, it's much more difficult to control the I/O profil. Because rocksDB embedded its own intelligence This is coincidentally what I've been working on today. So far I've just added the ability to put the rocksdb WAL on a second device, but it's super easy to push rocksdb data there as well (and have it spill over onto the larger, slower device if it fills up). Or to put the rocksdb WAL on a third device (e.g., expensive NVMe or NVRAM). See this ticket for the ceph-disk tooling that's needed: http://tracker.ceph.com/issues/13942 I expect this will be more flexible and perform better than the ext4 metadata option, but we'll need to test on your hardware to confirm! sage I think that XFS "realtime" subvolumes are the thing that does this - the second volume contains only the data (no metadata). Seem to recall that it is popular historically with video appliances, etc but it is not commonly used. Some of the XFS crew cc'ed above would have more information on this, Ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Error handling during recovery read
I can't remember the details now, but I know that recovery needed additional work. If it were a simple fix I would have done it when implementing that code. I found this bug related to recovery and ec errors (http://tracker.ceph.com/issues/13493) BUG #13493: osd: for ec, cascading crash during recovery if one shard is corrupted David On 12/4/15 2:03 AM, Markus Blank-Burian wrote: Hi David, I am using ceph 9.2.0 with an erasure coded pool and have some problems with missing objects. Reads for degraded/backfilling objects on an EC pool, which detect an error (-2 in my case) seem to be aborted immediately instead of reading from the remaining shards. Why is there an explicit check for "!rop.for_recovery" in ECBackend::handle_sub_read_reply? Would it be possible to remove this check and let the recovery read be completed from the remaining good shards? Markus -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS and single threaded RBD read performance
Hi Ilja, On 5 December 2015 at 07:16, Ilja Slepnevwrote: > fio shows 70 MB/s seq read with 4M blocks, libaio, 1 thread, direct. > fio seq write 200 MB/s The fio numbers are from fio running on a CephFS mount I take it? > # rados bench -t 1 -p test 60 write --no-cleanup I don't see an rbd test anywhere here...? I suggest comparing fio on CephFS with fio on rbd (as in using fio's rbd ioengine), then at least the application side of your tests is constant. Cheers, ~Blairo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rgw: new multisite update
Orit, Casey and I have been working for quite a while now on a new multi-site framework that will replace the current sync-agent system. There are many improvements that we are working on, and we think it will be worth the wait (and effort!). First and foremost, the main complain that we hear about the rgw, and specifically about its multisite feature is the configuration complexity. The new scheme will make things much easier to configure, and will handle changes dynamically. There will be many new commands that will remove the need to manually edit and inject json configurations as was previously needed. Changes to the zones configuration will be applied to running gateways, and these will be able to handle those changes without the need to restarts the processes. We're getting rid of the sync agent, and the gateways themselves will handle the sync process. Removing the sync agent will help with making the system much easier to set and configure. It also helps with the sync process itself in many aspects. The data sync will now be active-active, so all zones within the same zone group will be writable. Note that 'region' is now called 'zone group'; there was too much confusion with the old term. The metadata changes will keep the master-slave scheme to keep things simpler in that arena. We added a new entity called 'realm', which is a container for zone groups. Multiple realms can be created, which provides the ability to run completely different configurations on the same clusters with minimum effort. A new entity called 'period' (might be renamed to 'config') holds the realm configuration structure. A period changes when the master zone changes. A period epoch is being incremented whenever there's a change in the configuration that does not modify the master. New radosgw-admin commands were added to provide better view into the sync process status itself. The scheme still requires handling 3 different logs (metadata, data, bucket indexes), and the sync statuses reflect the position in those logs (for incremental sync), or which entry is being synced (for the full sync). There is also a new admin socket command ('cr dump') that dumps the current state of the coroutines framework (that was created for this feature), which helps quite a bit with debugging problems. Migrating from the old sync agent to the new sync will require the new sync to start from scratch. Note that this process should not copy any actual data, but the sync will need to build the new sync status (and verify that all the data is in place in the zones). So, when is this going to be ready? We're aiming at having it in Jewel. At the moment nothing is merged yet (still at the wip-rgw-new-multisite branch); we're trying to make sure that things still work against it (e.g., the sync agent can still work), and we'll get it merged once we feel comfortable with the backward compatibility. The metadata sync is still missing some functionality that is related to fail over recovery, and the error reporting and retry still needs some more work. The data sync itself has a few cases that we don't handle correctly. The realm/period bootstrapping still needs some more work. Documentation is almost non existent. But the most important piece that we actually need to work on is the testing. We need to make sure that we have tests coverage for all the new functionality. Which brings me to this: It would be great if we had people outside of the team that could take an early look on it and help with mapping the pain points. It would be even greater if someone could help with the actual development of the automated tests (via teuthology), but even just manually testing and reporting of any issues will help a lot. Note: Danger! Danger! this could and will eat your data! It shouldn't be tested on a production environment (yet!). The following is a sample config of a single zone group, with two separate zones. There are two machines that we set up the zones on: rgw1, rgw2, where rgw1 serves as the master for metadata. Note that there are a bunch of commands that we'll be able to redact later (e.g., all the default setting ones, separate commands to create a zone and attach it to a zonegroup). In some of the cases when you create the first realm/zonegroup/zone, the entity will automatically become the default. However, I've ran into some issues when trying to set multiple realms on a single cluster, and not having the default caused some issues. We'll need to clean that up. access_key= secret= # run on rgw1 $ radosgw-admin realm create --rgw-realm=earth $ radosgw-admin zonegroup create --rgw-zonegroup=us --endpoints=http://rgw1:80 --master $ radosgw-admin zonegroup default --rgw-zonegroup=us $ radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-1 --access-key=${access_key} --secret=${secret} --endponts=http://rgw1:80 $ radosgw-admin zone default --rgw-zone=us-1 $ radosgw-admin zonegroup add --rgw-zonegroup=us --rgw-zone=us-1 $ radosgw-admin
Re: Compiling for FreeBSD
On Fri, Dec 4, 2015 at 10:30 AM, Willem Jan Withagenwrote: > On 3-12-2015 01:27, Yan, Zheng wrote: >> On Thu, Dec 3, 2015 at 4:52 AM, Willem Jan Withagen wrote: >>> On 2-12-2015 15:13, Yan, Zheng wrote: > >>> I see that you have disabled uuid? >>> Might I ask why? >> >> not disable. Currently ceph uses boost uuid implementation. so no need >> to link to libuuid. > > And > >> The uuid transition to boost::uuid has happened since then (a few months >> back) and I believe Rohan's AIX and Solaris ports for librados (that just >> merged) included a fix for the sockaddr_storage issue: > > I cannot seem to find the package or port that defines boost::uuid. > So how did you make it available to the build system? http://www.boost.org/doc/libs/1_59_0/libs/uuid/ It's part of the boost labyrinth. I think in Debian it's just part of libboost-dev, but you might need to dig around in whatever packaging you're using for FreeBSD. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compiling for FreeBSD
On 3-12-2015 01:27, Yan, Zheng wrote: > On Thu, Dec 3, 2015 at 4:52 AM, Willem Jan Withagenwrote: >> On 2-12-2015 15:13, Yan, Zheng wrote: >> I see that you have disabled uuid? >> Might I ask why? > > not disable. Currently ceph uses boost uuid implementation. so no need > to link to libuuid. And > The uuid transition to boost::uuid has happened since then (a few months > back) and I believe Rohan's AIX and Solaris ports for librados (that just > merged) included a fix for the sockaddr_storage issue: I cannot seem to find the package or port that defines boost::uuid. So how did you make it available to the build system? Thanx, --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html