Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index
[Sorry for the piecemeal information... it's getting late here] > Oops, I forgot: Before it crashed, it did modify /mnt/ceph/db; the > overall size of that directory increased(!) from 3.9GB to 12GB. The > compaction seems to have eaten two .log files, but created many more > .sst files. ...and it upgraded the contents of db/CURRENT from "MANIFEST-053662" to "MANIFEST-079750". Good night, -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index
Simon Leinen writes: > Sage Weil writes: >> Try 'compact' instead of 'stats'? > That run for a while and then crashed, also in the destructor for > rocksdb::Version, but with an otherwise different backtrace. [...] Oops, I forgot: Before it crashed, it did modify /mnt/ceph/db; the overall size of that directory increased(!) from 3.9GB to 12GB. The compaction seems to have eaten two .log files, but created many more .sst files. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index
Sage Weil writes: >> 2019-06-12 23:40:43.555 7f724b27f0c0 1 rocksdb: do_open column families: >> [default] >> Unrecognized command: stats >> ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: >> rocksdb::Version::~Version(): Assertion `path_id < >> cfd_->ioptions()->cf_paths.size()' failed. >> *** Caught signal (Aborted) ** > Ah, this looks promising.. it looks like it got it open and has some > problem with teh error/teardown path. > Try 'compact' instead of 'stats'? That run for a while and then crashed, also in the destructor for rocksdb::Version, but with an otherwise different backtrace. I'm attaching the log again. -- Simon. leinen@unil0047:/mnt/ceph/db$ sudo ceph-kvstore-tool rocksdb /mnt/ceph/db compact 2019-06-13 00:00:08.650 7f00f4c510c0 1 rocksdb: do_open column families: [default] ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: rocksdb::Version::~Version(): Assertion `path_id < cfd_->ioptions()->cf_paths.size()' failed. *** Caught signal (Aborted) ** in thread 7f00e5788700 thread_name:rocksdb:low0 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) 1: (()+0x12890) [0x7f00ea641890] 2: (gsignal()+0xc7) [0x7f00e9531e97] 3: (abort()+0x141) [0x7f00e9533801] 4: (()+0x3039a) [0x7f00e952339a] 5: (()+0x30412) [0x7f00e9523412] 6: (rocksdb::Version::~Version()+0x224) [0x5603bd78bfe4] 7: (rocksdb::Version::Unref()+0x35) [0x5603bd78c065] 8: (rocksdb::Compaction::~Compaction()+0x25) [0x5603bd880f05] 9: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc46) [0x5603bd6dab76] 10: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x141) [0x5603bd6dcfa1] 11: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5603bd6dd5f7] 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x267) [0x5603bd8dc847] 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x49) [0x5603bd8dca29] 14: (()+0xbd57f) [0x7f00e9f5757f] 15: (()+0x76db) [0x7f00ea6366db] 16: (clone()+0x3f) [0x7f00e961488f] 2019-06-13 00:05:09.471 7f00e5788700 -1 *** Caught signal (Aborted) ** in thread 7f00e5788700 thread_name:rocksdb:low0 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) 1: (()+0x12890) [0x7f00ea641890] 2: (gsignal()+0xc7) [0x7f00e9531e97] 3: (abort()+0x141) [0x7f00e9533801] 4: (()+0x3039a) [0x7f00e952339a] 5: (()+0x30412) [0x7f00e9523412] 6: (rocksdb::Version::~Version()+0x224) [0x5603bd78bfe4] 7: (rocksdb::Version::Unref()+0x35) [0x5603bd78c065] 8: (rocksdb::Compaction::~Compaction()+0x25) [0x5603bd880f05] 9: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc46) [0x5603bd6dab76] 10: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x141) [0x5603bd6dcfa1] 11: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5603bd6dd5f7] 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x267) [0x5603bd8dc847] 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x49) [0x5603bd8dca29] 14: (()+0xbd57f) [0x7f00e9f5757f] 15: (()+0x76db) [0x7f00ea6366db] 16: (clone()+0x3f) [0x7f00e961488f] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -23> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command assert hook 0x5603be844130 -22> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command abort hook 0x5603be844130 -21> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perfcounters_dump hook 0x5603be844130 -20> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command 1 hook 0x5603be844130 -19> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perf dump hook 0x5603be844130 -18> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perfcounters_schema hook 0x5603be844130 -17> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perf histogram dump hook 0x5603be844130 -16> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command 2 hook 0x5603be844130 -15> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perf schema hook 0x5603be844130 -14> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perf histogram schema hook 0x5603be844130 -13> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command perf reset hook 0x5603be844130 -12> 2019-06-13 00:00:08.602 7f00f4c510c0 5 asok(0x5603bebba000) register_command config show hook 0x5
Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index
Sage Weil writes: > What happens if you do > ceph-kvstore-tool rocksdb /mnt/ceph/db stats (I'm afraid that our ceph-kvstore-tool doesn't know about a "stats" command; but it still tries to open the database.) That aborts after complaining about many missing files in /mnt/ceph/db. When I ( cd /mnt/ceph/db && sudo ln -s ../db.slow/* . ) and re-run, it still aborts, just without complaining about missing files. I'm attaching the output (stdout+stderr combined), in case that helps. > or, if htat works, > ceph-kvstore-tool rocksdb /mnt/ceph/db compact > It looks like bluefs is happy (in that it can read the whole set > of rocksdb files), so the questoin is if rocksdb can open them, or > if there's some corruption or problem at the rocksdb level. > The original crash is actually here: > ... > 9: (tc_new()+0x283) [0x7fbdbed8e943] > 10: (std::__cxx11::basic_string, > std::allocator >::_M_mutate(unsigned long, unsigned long, char const*, > unsigned long)+0x69) [0x5600b1268109] > 11: (std::__cxx11::basic_string, > std::allocator >::_M_append(char const*, unsigned long)+0x63) > [0x5600b12f5b43] > 12: (rocksdb::BlockBuilder::Add(rocksdb::Slice const&, rocksdb::Slice > const&, rocksdb::Slice const*)+0x10b) [0x5600b1eaca9b] > ... > where tc_new is (I think) tcmalloc. Which looks to me like rocksdb > is probably trying to allocate something very big. The question is will > that happen with the exported files or only on bluefs... Yes, that's what I was thinking as well. The server seems to have about 50GB of free RAM though, so maybe it was more like ly big :-) Also, your ceph-kvstore-tool command seems to have crashed somewhere else (the desctructor of a rocksdb::Version object?) 2019-06-12 23:40:43.555 7f724b27f0c0 1 rocksdb: do_open column families: [default] Unrecognized command: stats ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: rocksdb::Version::~Version(): Assertion `path_id < cfd_->ioptions()->cf_paths.size()' failed. *** Caught signal (Aborted) ** in thread 7f724b27f0c0 thread_name:ceph-kvstore-to ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) 1: (()+0x12890) [0x7f7240c6f890] 2: (gsignal()+0xc7) [0x7f723fb5fe97] 3: (abort()+0x141) [0x7f723fb61801] 4: (()+0x3039a) [0x7f723fb5139a] 5: (()+0x30412) [0x7f723fb51412] 6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4] 7: (rocksdb::Version::Unref()+0x35) [0x55974952a065] 8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328] 9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4] 10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8] 11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d] 12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868] 13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb] 14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21] 15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349] 16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599] 17: (main()+0x307) [0x5597490b5fb7] 18: (__libc_start_main()+0xe7) [0x7f723fb42b97] 19: (_start()+0x2a) [0x55974918e03a] 2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught signal (Aborted) ** in thread 7f724b27f0c0 thread_name:ceph-kvstore-to > Thanks! Thanks so much for looking into this! We hope that we can get some access to S3 bucket indexes back, possibly by somehow dropping and re-creating those indexes. -- Simon. 2019-06-12 23:40:43.555 7f724b27f0c0 1 rocksdb: do_open column families: [default] Unrecognized command: stats ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: rocksdb::Version::~Version(): Assertion `path_id < cfd_->ioptions()->cf_paths.size()' failed. *** Caught signal (Aborted) ** in thread 7f724b27f0c0 thread_name:ceph-kvstore-to ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) 1: (()+0x12890) [0x7f7240c6f890] 2: (gsignal()+0xc7) [0x7f723fb5fe97] 3: (abort()+0x141) [0x7f723fb61801] 4: (()+0x3039a) [0x7f723fb5139a] 5: (()+0x30412) [0x7f723fb51412] 6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4] 7: (rocksdb::Version::Unref()+0x35) [0x55974952a065] 8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328] 9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4] 10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8] 11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d] 12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868] 13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb] 14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21] 15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349] 16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599] 17: (main()+0x307) [0x5597490b5fb7] 18: (__libc_start_main()+0xe7) [0x7f723fb42b97] 19: (_start()+0x2a) [0x55974918e03a] 2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught
Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index
Dear Sage, > Also, can you try ceph-bluestore-tool bluefs-export on this osd? I'm > pretty sure it'll crash in the same spot, but just want to confirm > it's a bluefs issue. To my surprise, this actually seems to have worked: $ time sudo ceph-bluestore-tool --out-dir /mnt/ceph bluefs-export --path /var/lib/ceph/osd/ceph-49 inferring bluefs devices from bluestore path slot 2 /var/lib/ceph/osd/ceph-49/block -> /dev/dm-9 slot 1 /var/lib/ceph/osd/ceph-49/block.db -> /dev/dm-8 db/ db/072900.sst db/072901.sst db/076487.sst db/076488.sst db/076489.sst db/076490.sst [...] db/079726.sst db/079727.log db/CURRENT db/IDENTITY db/LOCK db/MANIFEST-053662 db/OPTIONS-053662 db/OPTIONS-053665 db.slow/ db.slow/049192.sst db.slow/049193.sst db.slow/049831.sst db.slow/057443.sst db.slow/057444.sst db.slow/058254.sst [...] db.slow/079718.sst db.slow/079719.sst db.slow/079720.sst db.slow/079721.sst db.slow/079722.sst db.slow/079723.sst db.slow/079724.sst real 5m19.953s user 0m0.101s sys 1m5.571s leinen@unil0047:/var/lib/ceph/osd/ceph-49$ It left 3GB in /mnt/ceph/db (55 files of varying sizes), and 39GB in /mnt/ceph/db.slow (620 files of mostly 68MB each). Is there anything we can do with this? -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating to a dedicated cluster network
Paul Emmerich writes: > Split networks is rarely worth it. One fast network is usually better. > And since you mentioned having only two interfaces: one bond is way > better than two independent interfaces. > IPv4/6 dual stack setups will be supported in Nautilus, you currently > have to use either IPv4 or IPv6. > Jumbo frames: often mentioned but usually not worth it. > (Yes, I know that this is somewhat controversial and increasing MTU is > often a standard trick for performance tuning, but I still have to see > have a benchmark that actually shows a significant performance > improvements. Some quick tests show that I can save around 5-10% CPU > load on a system doing ~50 gbit/s of IO traffic which is almost > nothing given the total system load) Agree with everything Paul said. (I know this is lame, but I think all of this bears repeating :-) To address another question in Jan's original post: I would not consider using link-local IPv6 addressing. Not just because I doubt that this would work (Ceph would always need to know/tell the OS which interface it should use with such an address), but mainly because even if it does work, it will only work as long as everything is on a single logical IPv6 network. This will artificially limit your options for the evolution of your cluster. Routable addresses are cheap in IPv6, use them! -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools
cmonty14 writes: > due to performance issues RGW is not an option. This statement may be > wrong, but there's the following aspect to consider. > If I write a backup that is typically a large file, this is normally a > single IO stream. > This causes massive performance issues on Ceph because this single IO > stream is sequentially written in small pieces on OSDs. > To overcome this issue multi IO stream should be used when writing > large files, and this means the application writing the backup must > support multi IO stream. RGW (and the S3 protocol in general) supports multi-stream uploads nicely, via the "multipart upload" feature: You split your file into many pieces, which can be uploaded in parallel. RGW with multipart uploads seems like a good fit for your application. It could solve your naming and permission issues, has low overhead, and could give you good performance as long as you use multipart uploads with parallel threads. You just need to make sure that your RGW gateways have enough throughput, but this capacity is relatively easy and inexpensive to provide. > Considering this the following question comes up: If I write a backup > into a RBD (that could be considered as a network share), will Ceph > use single IO stream or multi IO stream on storage side? Ceph should be able to handle multiple parallel streams of I/O to an RBD device (in general, writes will go to different "chunks" of the RBD, and those chunk objects will be on different OSDs). But it's another question whether your RBD client will be able to issue parallel streams of requests. Usually you have some kind of file system and kernel block I/O layer on the client side, and it's possible that those will serialize I/O, which will make it hard to get high throughput. -- Simon. > THX > Am Di., 22. Jan. 2019 um 23:20 Uhr schrieb Christian Wuerdig > : >> >> If you use librados directly it's up to you to ensure you can >> identify your objects. Generally RADOS stores objects and not files >> so when you provide your object ids you need to come up with a >> convention so you can correctly identify them. If you need to >> provide meta data (i.e. a list of all existing backups, when they >> were taken etc.) then again you need to manage that yourself >> (probably in dedicated meta-data objects). Using RADOS namespaces >> (like one per database) is probably a good idea. >> Also keep in mind that for example Bluestore has a maximum object >> size of 4GB so mapping files 1:1 to object is probably not a wise >> approach and you should breakup your files into smaller chunks when >> storing them. There is libradosstriper which handles the striping of >> large objects transparently but not sure if that has support for >> RADOS namespaces. >> >> Using RGW instead might be an easier route to go down >> >> On Wed, 23 Jan 2019 at 10:10, cmonty14 <74cmo...@gmail.com> wrote: >>> >>> My backup client is using librados. >>> I understand that defining a pool for the same application is recommended. >>> >>> However this would not answer my other questions: >>> How can I identify a backup created by client A that I want to restore >>> on another client Z? >>> I mean typically client A would write a backup file identified by the >>> filename. >>> Would it be possible on client Z to identify this backup file by >>> filename? If yes, how? >>> >>> Am Di., 22. Jan. 2019 um 15:07 Uhr schrieb : >>> > >>> > Hi, >>> > >>> > Ceph's pool are meant to let you define specific engineering rules >>> > and/or application (rbd, cephfs, rgw) >>> > They are not designed to be created in a massive fashion (see pgs etc) >>> > So, create a pool for each engineering ruleset, and store your data in >>> > them >>> > For what is left of your project, I believe you have to implement that >>> > on top of Ceph >>> > >>> > For instance, let say you simply create a pool, with a rbd volume in it >>> > You then create a filesystem on that, and map it on some server >>> > Finally, you can push your files on that mountpoint, using various >>> > Linux's user, acl or whatever : beyond that point, there is nothing more >>> > specific to Ceph, it is "just" a mounted filesystem >>> > >>> > Regards, >>> > >>> > On 01/22/2019 02:16 PM, cmonty14 wrote: >>> > > Hi, >>> > > >>> > > my use case for Ceph is providing a central backup storage. >>> > > This means I will backup multiple databases in Ceph storage cluster. >>> > > >>> > > This is my question: >>> > > What is the best practice for creating pools & images? >>> > > Should I create multiple pools, means one pool per database? >>> > > Or should I create a single pool "backup" and use namespace when writing >>> > > data in the pool? >>> > > >>> > > This is the security demand that should be considered: >>> > > DB-owner A can only modify the files that belong to A; other files >>> > > (owned by B, C or D) are accessible for A. >>> > > >>> > > And there's another issue: >>> > > How can I identify a backup created by client A
Re: [ceph-users] Ceph in OSPF environment
Burkhard Linke writes: > I'm curious.what is the advantage of OSPF in your setup over > e.g. LACP bonding both links? Good question! Some people (including myself) are uncomfortable with LACP (in particular "MLAG", i.e. port aggregation across multiple chassis), and with fancy L2 setups in general. Concrete issues include * worries about compatiblity between different MLAG implementations, which tend to vary subtly between vendors * worries about general reliability of L2 in complex topologies with lots of potential loops - as soon as spanning tree makes ONE brief mistake you can end up with broadcast storms and subsequent meltdowns * requirement to have direct inter-switch links between switches that share a VLAN with LACP; this runs counter to the fashionable "Clos" (aka fat-tree aka leaf/spine) topology. (Also you are talking about "both links"... I don't know about LACP, but with OSPF it's trivial to use arbitrary numbers of uplinks - e.g. 3 - to arbitrary routers/switches; you can also move servers around freely between "leaf" switches.) With modern L3 routing protocol implementations, you can use simple and generic configurations using "unnumbered" routing adjacency definitions. OSPF has had this for a long time, and modern BGP implementations (e.g. FRR) also do. This neutralizes one of the most important advantages of L2 networks, namely easy configuration. Routing protocols such as OSPF, IS-IS, or BGP-4 have proven their robustness in dynamic and wild environments, e.g. the Internet. IP forwarding also has the wonderful "TTL" mechanism, which makes the occasional routing loop much less disastrous. By the way, a thoroughly "unnumbered" routing configuration can alleviate the original poster's problem, because it lets you use a single IP address (or a single IPv4 + a single IPv6 address...) across all interfaces. We use this in our setup, previously with OSPF(v2+v3), now using BGP-4, between hosts running Ubuntu+FRR and switches running Cumulus Linux (also with FRR) following Cumulus Networks's "routing to the host" model. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Avoid Ubuntu Linux kernel 4.15.0-36
As a little "heads-up": If you are running Ubuntu Bionic 18.04, or Xenial 16.04 with "HWE" kernels, and have systems running under 4.15.0-36 - which was the default between 2018-10-01 and 2018-10-22 - please consider upgrading to the latest 4.15.0-38 ASAP (or downgrade to 4.15.0-34). 4.15.0-36 has a TCP bug[1] that can occasionally slow down a TCP connection to a trickle of 2.5 Kbytes/s (512-byte segments every 200ms). Once a TCP connection is in this state, it will never get out. This started happening within our Ceph clusters after we reinstalled a few servers as part of our Bluestore migration. The effect on our RBD users (OpenStack VMs) was pretty terrible - the typical 4MB transaction would take about 27 MINUTES at this rate, causing timeouts and crashes. This was absolutely painful to diagnose, because it happened so rarely and was hard to reproduce. Fortunately the fix is easy - just don't run this kernel. I should note that our Ceph clusters run over IPv6; I'm not sure whether the TCP bug can hit with IPv4 (the bug was reported for IPv6 as well), although I see no reason why it shouldn't. -- Simon. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796895 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore caching, flawed by design?
Christian Balzer writes: > On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote: >> Christian, you mention single socket systems for storage servers. >> I often thought that the Xeon-D would be ideal as a building block for >> storage servers >> https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html >> Low power, and a complete System-On-Chip with 10gig Ethernet. >> > If you (re)search the ML archives you should be able find discussions > about this and I seem to remember them coming up as well. > If you're going to have a typical HDDs for storage and 1-2 SSDs for > journal/WAL/DB setup, they should do well enough. We have such systems (QuantaGrid SD1Q-1ULH with Xeon D-1541) and are generally happy with them. They are certainly very power-efficient. > But in that scenario you're likely not all that latency conscious > about NUMA issues to begin with, given that current CPU interlinks are > quite decent. Right. > They however do feel underpowered when mated with really fast (NVMe) or > more than 4 SSDs per node if you have a lot of small writes. [...] The new Xeon-D 2100 look promising. I haven't seen any storage-optimized servers based on this yet, though. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SRV mechanism for looking up mons lacks IPv6 support
We just upgraded our last cluster to Luminous. Since we might need to renumber our mons in the not-too-distant future, it would be nice to remove the literal IP addresses of the mons from ceph.conf. Kraken and above support a DNS-based mechanism for this based on SRV records[1]. Unfortunately our Rados cluster is IPv6-based, and in testing we found out that the code that resolves these SRV records only looks for IPv4 addresses (A records) of the hostnames that the SRVs point to. I just created issue #23078[2] for this. The description points to where I think the code would need to be changed. If I can do anything to help (in particular test fixes), please let me know. This might be relevant to others who run IPv6 Rados clusters. -- Simon. [1] http://docs.ceph.com/docs/master/rados/configuration/mon-lookup-dns/ [2] http://tracker.ceph.com/issues/23078 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] s3 bucket policys
Simon Leinen writes: > Adam C Emerson writes: >> On 03/11/2017, Simon Leinen wrote: >> [snip] >>> Is this supported by the Luminous version of RadosGW? >> Yes! There's a few bugfixes in master that are making their way into >> Luminous, but Luminous has all the features at present. > Does that mean it should basically work in 10.2.1? Sorry, I meant to say "in 12.2.1"!!! -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] s3 bucket policys
Adam C Emerson writes: > On 03/11/2017, Simon Leinen wrote: > [snip] >> Is this supported by the Luminous version of RadosGW? > Yes! There's a few bugfixes in master that are making their way into > Luminous, but Luminous has all the features at present. Does that mean it should basically work in 10.2.1? >> (Or even Jewel?) > No! I see; this will definitely motivate us to speed up our Luminous upgrade! >> Does this work with Keystone integration, i.e. can we refer to Keystone >> users as principals? > In principle probably. I haven't tried it and I don't really know > much about Keystone at present. It is hooked into the various > IdentityApplier classes and if RGW thinks a Keystone user is a > 'user' and you supply whatever RGW thinks its username is, then it > should work fine. I haven't tried it, though. Unless someone beats us to it, we'll try as soon as we have our cluster (with Keystone integration) in Luminous. >> Let's say there are many read-only users rather than just one. Would we >> simply add a new clause under "Statement" for each such user, or is >> there a better way? (I understand that RadosGW doesn't support groups, >> which could solve this elegantly and efficiently.) > If you want to give a large number of users the same permissions, just > put them all in the Principal array. Right, thanks for the tip! That makes it more compact. For our use case it won't be hundreds of users, I guess, more like dozens at most. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blog post: storage server power consumption
It was a cold and rainy weekend here, so I did some power measurements of the three types of storage servers we got over a few years of running Ceph in production, and compared the results: https://cloudblog.switch.ch/2017/11/06/ceph-storage-server-power-usage/ The last paragraph contains a challenge to developers: Can we save more power in "cold storage" applications by turning off idle disks? Crazy idea, or did anyone already try this? -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] s3 bucket policys
Adam C Emerson writes: > I'll save you, Citizen! I'm Captain Bucketpolicy! Good to know! > So! RGW's bucket policies are currently a subset of what's > demonstrated in > http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html > The big limitations are that we don't support string interpolation or > most condition keys, but that shouldn't be an issue for what you're > doing. > From your description you should be able to get what you want if you > set something like this on bucket_upload: > { > "Version": "2012-10-17", > "Statement": [ > { > "Sid": "usr_upload_can_write", > "Effect": "Allow", > "Principal": {"AWS": ["arn:aws:iam:::user/usr_upload"]}, > "Action": ["s3:ListBucket", "s3:PutObject"], > "Resource": ["arn:aws:s3:::bucket_policy1", >"arn:aws:s3:::bucket_policy1/*"] > }, > { > "Sid": "usr_process_can_read", > "Effect": "Allow", > "Principal": {"AWS": ["arn:aws:iam:::user/usr_process"]}, > "Action": ["s3:ListBucket", "s3:GetObject"], > "Resource": ["arn:aws:s3:::bucket_policy1", >"arn:aws:s3:::bucket_policy1/*"] > } > ] > } [...] Thanks, that's a great example that seems to fit a use case that we have. A few questions: Is this supported by the Luminous version of RadosGW? (Or even Jewel?) Does this work with Keystone integration, i.e. can we refer to Keystone users as principals? Let's say there are many read-only users rather than just one. Would we simply add a new clause under "Statement" for each such user, or is there a better way? (I understand that RadosGW doesn't support groups, which could solve this elegantly and efficiently.) -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and IPv4 -> IPv6
> I have it running the other way around. The RGW has IPv4 and IPv6, but > the Ceph cluster is IPv6-only. > RGW/librados talks to Ceph ovre IPv6 and handles client traffic on > both protocols. > No problem to run the RGW dual-stacked. Just for the record, we've been doing exactly the same for several years, on multiple clusters. So you're not alone! -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD network with IPv6 SLAAC networks?
Félix Barbeira writes: > We are implementing an IPv6 native ceph cluster using SLAAC. We have > some legacy machines that are not capable of using IPv6, only IPv4 due > to some reasons (yeah, I know). I'm wondering what could happen if I > use an additional IPv4 on the radosgw in addition to the IPv6 that is > already running. The rest of the ceph cluster components only have > IPv6, the radosgw would be the only one with IPv4. Do you think that > this would be a good practice or should I stick to only IPv6? That should work fine. We successfully had a similar setup for a long time. (Except we have been using statically- or DHCPv6-statefully-configured IPv6 addresses rather than SLAAC.) If you make RadosGW listen on ::, then it will accept both IPv6 and IPv4 connections. Recently we changed our setup slightly: Now we have multiple RadosGW instances behind a HAproxy front-end - the proxy listens on both IPv6 and IPv4, but will always talk IPv6 to the back-end RadosGW instances. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Snapshot Costs
Gregory Farnum writes: > On Tue, Mar 7, 2017 at 12:43 PM, Kent Borgwrote: >> I would love it if someone could toss out some examples of the sorts >> of things snapshots are good for and the sorts of things they are >> terrible for. (And some hints as to why, please.) > They're good for CephFS snapshots. They're good at RBD snapshots as > long as you don't take them too frequently. We take snapshots of about thirty 2-TB RBD images (Ceph Cinder volumes) every night. We keep about 60 of each around. Does that still fall under "reasonable"? One round of snapshots is deleted every night; that causes significant load on our cluster - currently Hammer, will be upgraded to Jewel soon. Most of the volumes (and thus snapshots) don't have the "object-map" feature enabled yet; maybe after the Jewel upgrade we can add object-maps to them to reduce the cost of deleting the snapshots. Do object-maps help with snap trimming, or am I overly optimistic? -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience
cephmailinglist writes: > e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0 chown ceph:ceph > [...] > [...] Also at that time one of our pools got a lot of extra data, > those files where stored with root permissions since we did not > restarted the Ceph daemons yet, the 'find' in step e found so much > files that xargs (the shell) could not handle it (too many arguments). I've always found it disappointing that xargs behaves like this on many GNU/Linux distributions. I always thought xargs's main purpose in life was to know how many arguments can safely be passed to a process... Anyway, you should be able to limit the number of arguments per invocation by adding something like "-n 100" to the xargs command line. Thanks for sharing your upgrade experiences! -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RadosGW: No caching when S3 tokens are validated against Keystone?
We're using the Hammer version of RadosGW, with Keystone for authN/Z. When a user started sending a lot of S3 requests (using rclone), the load on our Keystone service has skyrocketed. This surprised me because all those requests are from the same user, and RadosGW has caching for Keystone tokens. But looking at the code, this caching only seems to be used by rgw_swift.cc, not by rgw_rest_s3.cc. That would explain why no caching is going on here. Can anyone confirm? And if so, is there a fundamental problem that makes it hard to use caching when validating S3 tokens against a Keystone backend? (Otherwise I guess I should write a feature request and/or start coding this up myself...) Here are the facts for background: $ sudo ceph --admin-daemon /var/run/ceph/ceph-radosgw.gateway.asok config show | grep keystone "rgw_keystone_url": "https:\/\/...", "rgw_keystone_admin_token": "...", "rgw_keystone_admin_user": "", "rgw_keystone_admin_password": "", "rgw_keystone_admin_tenant": "", "rgw_keystone_accepted_roles": "_member_, ResellerAdmin", "rgw_keystone_token_cache_size": "1", "rgw_keystone_revocation_interval": "900", "rgw_s3_auth_use_keystone": "true", $ sudo ceph --admin-daemon /var/run/ceph/ceph-radosgw.gateway.asok perf dump | grep token_cache "keystone_token_cache_hit": 0, "keystone_token_cache_miss": 0 When I turn on debugging (config set debug_rgw 20/20), I get many messages like these: 2017-02-09 21:50:06.606216 7f6ac5d83700 5 s3 keystone: validated token: : expires: 1486680606 2017-02-09 21:50:06.635940 7f6aac550700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:06.747616 7f6aadd53700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:06.818267 7f6ac2d7d700 5 s3 keystone: validated token: : expires: 1486680606 2017-02-09 21:50:06.853492 7f6ab3d5f700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:06.895471 7f6ac5582700 5 s3 keystone: validated token: : expires: 1486680606 2017-02-09 21:50:06.951734 7f6abf576700 5 s3 keystone: validated token: : expires: 1486680606 2017-02-09 21:50:07.016555 7f6ab7566700 5 s3 keystone: validated token: : expires: 1486680606 2017-02-09 21:50:07.038997 7f6ab355e700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.160196 7f6ac1d7b700 5 s3 keystone: validated token: : expires: 1486680606 2017-02-09 21:50:07.189930 7f6aaf556700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.233593 7f6aabd4f700 5 s3 keystone: validated token: : expires: 1486680607 2017-02-09 21:50:07.263116 7f6abcd71700 5 s3 keystone: validated token: : expires: 1486680607 2017-02-09 21:50:07.263915 7f6ab8d69700 5 s3 keystone: validated token: : expires: 1486680607 2017-02-09 21:50:07.263990 7f6aae554700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.280523 7f6ab2d5d700 5 s3 keystone: validated token: : expires: 1486680607 2017-02-09 21:50:07.290892 7f6aa954a700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.311201 7f6ab6d65700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.317667 7f6aad552700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.380957 7f6ab6564700 5 s3 keystone: validated token: : expires: 1486680607 2017-02-09 21:50:07.421227 7f6abd572700 5 s3 keystone: validated token: : expires: 1486680607 2017-02-09 21:50:07.446867 7f6ab0d59700 20 s3 keystone: trying keystone auth 2017-02-09 21:50:07.459225 7f6aa9d4b700 20 s3 keystone: trying keystone auth and, as I said, our Keystone service is pretty much DoSed right now... -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade
Erik McCormick writes: > We use Edge-Core 5712-54x running Cumulus Linux. Anything off their > compatibility list would be good though. The switch is 48 10G sfp+ > ports. We just use copper cables with attached sfp. It also had 6 40G > ports. The switch cost around $4800 and the cumulus license is about > 3k for a perpetual license. Similar here, except we use Quanta switches (T5032-LY6). SFP+ slots and DAC cables. Actually our switches are 32*40GE, and we use "fan-out" DAC cables (QSFP on one side, 4 SFP+ on the other). Compared to 10GBaseT (RJ45), DAC cables are thicker, which may complicate cable management a little. On the other hand I think DAC still needs less power than 10GBaseT. And with the 40G setup, we have good port density and a smooth migration path to 40GE. We already use 40GE for our leaf-spine uplinks. Another advantage for us is that we can use a single SKU for both leaf and spine switches. The Cumulus licenses are a bit more expensive for those 40GE switches (as are the switches themselves), but it's still a good deal for us. Maybe these days it makes sense to look at 100GE switches in preference to 40GE; 100GE ports can normally be used as 2*50GE, 4*25GE, 1*40GE or 4*10GE as well, so the upgrade paths seem even nicer. And the prices are getting competitive I think. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v10.1.0 Jewel release candidate available
Sage Weil writes: > The first release candidate for Jewel is now available! Cool! [...] > Packages for aarch64 will also be posted shortly! According to the announcement, Ubuntu Xenial should now be supported instead of Precise; but I don't see Xenial packages on download.ceph.com. Will those arrive, or should we get them from Canonical's Xenial repo? -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OpenStack Ops Mid-Cycle session on OpenStack/Ceph integration
A "mid-cycle summit" for OpenStack operators will be held in Manchester (England) on Monday/Tuesday next week (15/16 February). The morning session on Tuesday will include a slot on Ceph integration. If there are any Ceph+OpenStack operators, please have a look at the Etherpad with the draft topic list: https://etherpad.openstack.org/p/MAN-ops-Ceph Feel free to add suggestions in-place, or post here and/or onHope to see some of you in Manchester next week! -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rebooting nodes in a ceph cluster
David Clarke writes: Not directly related to Ceph, but you may want to investigate kexec[0] ('kexec-tools' package in Debian derived distributions) in order to get your machines rebooting quicker. It essentially re-loads the kernel as the last step of the shutdown procedure, skipping over the lengthy BIOS/UEFI/controller firmware etc boot stages. [0]: http://en.wikipedia.org/wiki/Kexec I'd like to second that recommendation - I only discovered this recently, and on systems with long BIOS initialization, this cuts down the time to reboot *dramatically*, like from 5 to 1 minute. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Journal, SSD and OS
Gandalf Corvotempesta writes: what do you think to use the same SSD as journal and as root partition? Forexample: 1x 128GB SSD [...] All logs are stored remotely via rsyslogd Is this good ? AFAIK, in this configuration, ceph will be executed totally in ram. I think this is a fine configuration - you won't be writing to the root partition too much, outside journals. We also put journals on the same SSDs as root partitions (not that we're very ambitious about performance...). -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why RBD is not enough [was: Inconsistent view on mounted CephFS]
Maciej Gałkiewicz writes: On 13 September 2013 17:12, Simon Leinen simon.lei...@switch.ch wrote: [We're not using is *instead* of rbd, we're using it *in addition to* rbd. For example, our OpenStack (users') cinder volumes are stored in rbd.] So you probably have cinder volumes in rbd but you boot instances from images. This is why you need cephfs for /var/lib/nova/instances. I suggest creating volumes from images and booting instances from them. Cephfs is not required then Thanks, I know that we could boot from volume. Two problems: 1.) Our OpenStack installation is not a private cloud; we allow external users to set up VMs. These users need to be able to use the standard workflow (Horizon) to start VMs from an image. 2.) We didn't manage to make boot from volume work with RBD in Folsom. Yes, presumably it works fine in Grizzly and above, so we should just upgrade. What we want to achieve is to have a shared instance store (i.e. /var/lib/nova/instances) across all our nova-compute nodes, so that we can e.g. live-migrate instances between different hosts. And we want to use Ceph for that. In Folsom (but also in Grizzly, I think), this isn't straightforward to do with RBD. A feature[1] to make it more straightforward was merged in Havana(-3) just two weeks ago. I dont get it. I am successfully using live-migration (in Grizzly, havent try Folsom) of instances booted from cinder volumes stored as rbd volumes. What is not straightforward to do? Are you using KVM? As I said, boot from volume is not really an option for us. Yes, people want shared storage that they can access in a POSIXly way from multiple VMs. CephFS is a relatively easy way to give them that, though I don't consider it production-ready - mostly because secure isolation between different tenants is hard to achieve. For now GlusterFS may fits better here. Possibly, but it's another system we'd have to learn, configure and support. And CephFS is already in standard kernels (though obviously it's not reliable, and there may be horrible regressions such as this bug in 3.10). -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistent view on mounted CephFS
Yan, Zheng writes: On Fri, Sep 13, 2013 at 3:09 PM, Jens-Christian Fischer The problem we see, is that the different hosts have different views on the filesystem (i.e. they see different amount of files). [...] The bug was introduced in 3.10 kernel, will be fixed in 3.12 kernel by commit 590fb51f1c (vfs: call d_op-d_prune() before unhashing dentry). Sage may backport the fix to 3.11 and 3.10 kernel soon. This would be great! 3.12 won't be out until November. (Downgrading to e.g.3.9.11 should also fix the issue, right?) please use ceph-fuse at present. That's what we're doing now, but it seems slower. Best regards, -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why RBD is not enough [was: Inconsistent view on mounted CephFS]
Just out of curiosity. Why you are using cephfs instead of rbd? [We're not using is *instead* of rbd, we're using it *in addition to* rbd. For example, our OpenStack (users') cinder volumes are stored in rbd.] To expand on what my colleague Jens-Christian wrote: Two reasons: - we are still on Folsom What we want to achieve is to have a shared instance store (i.e. /var/lib/nova/instances) across all our nova-compute nodes, so that we can e.g. live-migrate instances between different hosts. And we want to use Ceph for that. In Folsom (but also in Grizzly, I think), this isn't straightforward to do with RBD. A feature[1] to make it more straightforward was merged in Havana(-3) just two weeks ago. - Experience with shared storage as this is something our customers are asking for all the time Yes, people want shared storage that they can access in a POSIXly way from multiple VMs. CephFS is a relatively easy way to give them that, though I don't consider it production-ready - mostly because secure isolation between different tenants is hard to achieve. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com