Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
[Sorry for the piecemeal information... it's getting late here]

> Oops, I forgot: Before it crashed, it did modify /mnt/ceph/db; the
> overall size of that directory increased(!) from 3.9GB to 12GB.  The
> compaction seems to have eaten two .log files, but created many more
> .sst files.

...and it upgraded the contents of db/CURRENT from "MANIFEST-053662" to
"MANIFEST-079750".

Good night,
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Simon Leinen writes:
> Sage Weil writes:
>> Try 'compact' instead of 'stats'?

> That run for a while and then crashed, also in the destructor for
> rocksdb::Version, but with an otherwise different backtrace. [...]

Oops, I forgot: Before it crashed, it did modify /mnt/ceph/db; the
overall size of that directory increased(!) from 3.9GB to 12GB.  The
compaction seems to have eaten two .log files, but created many more
.sst files.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Sage Weil writes:
>> 2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
>> [default]
>> Unrecognized command: stats
>> ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
>> rocksdb::Version::~Version(): Assertion `path_id < 
>> cfd_->ioptions()->cf_paths.size()' failed.
>> *** Caught signal (Aborted) **

> Ah, this looks promising.. it looks like it got it open and has some
> problem with teh error/teardown path.

> Try 'compact' instead of 'stats'?

That run for a while and then crashed, also in the destructor for
rocksdb::Version, but with an otherwise different backtrace.  I'm
attaching the log again.
-- 
Simon.
leinen@unil0047:/mnt/ceph/db$ sudo ceph-kvstore-tool rocksdb /mnt/ceph/db 
compact
2019-06-13 00:00:08.650 7f00f4c510c0  1 rocksdb: do_open column families: 
[default]
ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
*** Caught signal (Aborted) **
 in thread 7f00e5788700 thread_name:rocksdb:low0
 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
 1: (()+0x12890) [0x7f00ea641890]
 2: (gsignal()+0xc7) [0x7f00e9531e97]
 3: (abort()+0x141) [0x7f00e9533801]
 4: (()+0x3039a) [0x7f00e952339a]
 5: (()+0x30412) [0x7f00e9523412]
 6: (rocksdb::Version::~Version()+0x224) [0x5603bd78bfe4]
 7: (rocksdb::Version::Unref()+0x35) [0x5603bd78c065]
 8: (rocksdb::Compaction::~Compaction()+0x25) [0x5603bd880f05]
 9: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc46) 
[0x5603bd6dab76]
 10: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
 rocksdb::Env::Priority)+0x141) [0x5603bd6dcfa1]
 11: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5603bd6dd5f7]
 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x267) 
[0x5603bd8dc847]
 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x49) 
[0x5603bd8dca29]
 14: (()+0xbd57f) [0x7f00e9f5757f]
 15: (()+0x76db) [0x7f00ea6366db]
 16: (clone()+0x3f) [0x7f00e961488f]
2019-06-13 00:05:09.471 7f00e5788700 -1 *** Caught signal (Aborted) **
 in thread 7f00e5788700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
 1: (()+0x12890) [0x7f00ea641890]
 2: (gsignal()+0xc7) [0x7f00e9531e97]
 3: (abort()+0x141) [0x7f00e9533801]
 4: (()+0x3039a) [0x7f00e952339a]
 5: (()+0x30412) [0x7f00e9523412]
 6: (rocksdb::Version::~Version()+0x224) [0x5603bd78bfe4]
 7: (rocksdb::Version::Unref()+0x35) [0x5603bd78c065]
 8: (rocksdb::Compaction::~Compaction()+0x25) [0x5603bd880f05]
 9: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, 
rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc46) 
[0x5603bd6dab76]
 10: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
 rocksdb::Env::Priority)+0x141) [0x5603bd6dcfa1]
 11: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5603bd6dd5f7]
 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x267) 
[0x5603bd8dc847]
 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x49) 
[0x5603bd8dca29]
 14: (()+0xbd57f) [0x7f00e9f5757f]
 15: (()+0x76db) [0x7f00ea6366db]
 16: (clone()+0x3f) [0x7f00e961488f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
   -23> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command assert hook 0x5603be844130
   -22> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command abort hook 0x5603be844130
   -21> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perfcounters_dump hook 0x5603be844130
   -20> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command 1 hook 0x5603be844130
   -19> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf dump hook 0x5603be844130
   -18> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perfcounters_schema hook 0x5603be844130
   -17> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf histogram dump hook 0x5603be844130
   -16> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command 2 hook 0x5603be844130
   -15> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf schema hook 0x5603be844130
   -14> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf histogram schema hook 0x5603be844130
   -13> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command perf reset hook 0x5603be844130
   -12> 2019-06-13 00:00:08.602 7f00f4c510c0  5 asok(0x5603bebba000) 
register_command config show hook 0x5

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Sage Weil writes:
> What happens if you do

>  ceph-kvstore-tool rocksdb /mnt/ceph/db stats

(I'm afraid that our ceph-kvstore-tool doesn't know about a "stats"
command; but it still tries to open the database.)

That aborts after complaining about many missing files in /mnt/ceph/db.

When I ( cd /mnt/ceph/db && sudo ln -s ../db.slow/* . ) and re-run,
it still aborts, just without complaining about missing files.

I'm attaching the output (stdout+stderr combined), in case that helps.

> or, if htat works,

>  ceph-kvstore-tool rocksdb /mnt/ceph/db compact

> It looks like bluefs is happy (in that it can read the whole set 
> of rocksdb files), so the questoin is if rocksdb can open them, or 
> if there's some corruption or problem at the rocksdb level.

> The original crash is actually here:

>  ...
>  9: (tc_new()+0x283) [0x7fbdbed8e943]
>  10: (std::__cxx11::basic_string, 
> std::allocator >::_M_mutate(unsigned long, unsigned long, char const*, 
> unsigned long)+0x69) [0x5600b1268109]
>  11: (std::__cxx11::basic_string, 
> std::allocator >::_M_append(char const*, unsigned long)+0x63) 
> [0x5600b12f5b43]
>  12: (rocksdb::BlockBuilder::Add(rocksdb::Slice const&, rocksdb::Slice 
> const&, rocksdb::Slice const*)+0x10b) [0x5600b1eaca9b]
>  ...

> where tc_new is (I think) tcmalloc.  Which looks to me like rocksdb 
> is probably trying to allocate something very big.  The question is will 
> that happen with the exported files or only on bluefs...

Yes, that's what I was thinking as well.  The server seems to have about
50GB of free RAM though, so maybe it was more like ly big :-)

Also, your ceph-kvstore-tool command seems to have crashed somewhere
else (the desctructor of a rocksdb::Version object?)

  2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
[default]
  Unrecognized command: stats
  ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
  *** Caught signal (Aborted) **
   in thread 7f724b27f0c0 thread_name:ceph-kvstore-to
   ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
   1: (()+0x12890) [0x7f7240c6f890]
   2: (gsignal()+0xc7) [0x7f723fb5fe97]
   3: (abort()+0x141) [0x7f723fb61801]
   4: (()+0x3039a) [0x7f723fb5139a]
   5: (()+0x30412) [0x7f723fb51412]
   6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4]
   7: (rocksdb::Version::Unref()+0x35) [0x55974952a065]
   8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328]
   9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4]
   10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8]
   11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d]
   12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868]
   13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb]
   14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21]
   15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349]
   16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599]
   17: (main()+0x307) [0x5597490b5fb7]
   18: (__libc_start_main()+0xe7) [0x7f723fb42b97]
   19: (_start()+0x2a) [0x55974918e03a]
  2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught signal (Aborted) **
   in thread 7f724b27f0c0 thread_name:ceph-kvstore-to

> Thanks!

Thanks so much for looking into this!

We hope that we can get some access to S3 bucket indexes back, possibly
by somehow dropping and re-creating those indexes.
-- 
Simon.

2019-06-12 23:40:43.555 7f724b27f0c0  1 rocksdb: do_open column families: 
[default]
Unrecognized command: stats
ceph-kvstore-tool: /build/ceph-14.2.1/src/rocksdb/db/version_set.cc:356: 
rocksdb::Version::~Version(): Assertion `path_id < 
cfd_->ioptions()->cf_paths.size()' failed.
*** Caught signal (Aborted) **
 in thread 7f724b27f0c0 thread_name:ceph-kvstore-to
 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus 
(stable)
 1: (()+0x12890) [0x7f7240c6f890]
 2: (gsignal()+0xc7) [0x7f723fb5fe97]
 3: (abort()+0x141) [0x7f723fb61801]
 4: (()+0x3039a) [0x7f723fb5139a]
 5: (()+0x30412) [0x7f723fb51412]
 6: (rocksdb::Version::~Version()+0x224) [0x559749529fe4]
 7: (rocksdb::Version::Unref()+0x35) [0x55974952a065]
 8: (rocksdb::SuperVersion::Cleanup()+0x68) [0x55974960f328]
 9: (rocksdb::ColumnFamilyData::~ColumnFamilyData()+0xf4) [0x5597496123d4]
 10: (rocksdb::ColumnFamilySet::~ColumnFamilySet()+0xb8) [0x559749612ba8]
 11: (rocksdb::VersionSet::~VersionSet()+0x4d) [0x55974951da5d]
 12: (rocksdb::DBImpl::CloseHelper()+0x6a8) [0x55974944a868]
 13: (rocksdb::DBImpl::~DBImpl()+0x65b) [0x559749455deb]
 14: (rocksdb::DBImpl::~DBImpl()+0x11) [0x559749455e21]
 15: (RocksDBStore::~RocksDBStore()+0xe9) [0x559749265349]
 16: (RocksDBStore::~RocksDBStore()+0x9) [0x559749265599]
 17: (main()+0x307) [0x5597490b5fb7]
 18: (__libc_start_main()+0xe7) [0x7f723fb42b97]
 19: (_start()+0x2a) [0x55974918e03a]
2019-06-12 23:40:51.363 7f724b27f0c0 -1 *** Caught 

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-12 Thread Simon Leinen
Dear Sage,

> Also, can you try ceph-bluestore-tool bluefs-export on this osd?  I'm
> pretty sure it'll crash in the same spot, but just want to confirm
> it's a bluefs issue.

To my surprise, this actually seems to have worked:

  $ time sudo ceph-bluestore-tool --out-dir /mnt/ceph bluefs-export --path 
/var/lib/ceph/osd/ceph-49
  inferring bluefs devices from bluestore path
   slot 2 /var/lib/ceph/osd/ceph-49/block -> /dev/dm-9
   slot 1 /var/lib/ceph/osd/ceph-49/block.db -> /dev/dm-8
  db/
  db/072900.sst
  db/072901.sst
  db/076487.sst
  db/076488.sst
  db/076489.sst
  db/076490.sst
  [...]
  db/079726.sst
  db/079727.log
  db/CURRENT
  db/IDENTITY
  db/LOCK
  db/MANIFEST-053662
  db/OPTIONS-053662
  db/OPTIONS-053665
  db.slow/
  db.slow/049192.sst
  db.slow/049193.sst
  db.slow/049831.sst
  db.slow/057443.sst
  db.slow/057444.sst
  db.slow/058254.sst
  [...]
  db.slow/079718.sst
  db.slow/079719.sst
  db.slow/079720.sst
  db.slow/079721.sst
  db.slow/079722.sst
  db.slow/079723.sst
  db.slow/079724.sst
  
  real  5m19.953s
  user  0m0.101s
  sys   1m5.571s
  leinen@unil0047:/var/lib/ceph/osd/ceph-49$

It left 3GB in /mnt/ceph/db (55 files of varying sizes),

and 39GB in /mnt/ceph/db.slow (620 files of mostly 68MB each).

Is there anything we can do with this?
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to a dedicated cluster network

2019-01-26 Thread Simon Leinen
Paul Emmerich writes:
> Split networks is rarely worth it. One fast network is usually better.
> And since you mentioned having only two interfaces: one bond is way
> better than two independent interfaces.

> IPv4/6 dual stack setups will be supported in Nautilus, you currently
> have to use either IPv4 or IPv6.

> Jumbo frames: often mentioned but usually not worth it.
> (Yes, I know that this is somewhat controversial and increasing MTU is
> often a standard trick for performance tuning, but I still have to see
> have a benchmark that actually shows a significant performance
> improvements. Some quick tests show that I can save around 5-10% CPU
> load on a system doing ~50 gbit/s of IO traffic which is almost
> nothing given the total system load)

Agree with everything Paul said.  (I know this is lame, but I think all
of this bears repeating :-)

To address another question in Jan's original post:

I would not consider using link-local IPv6 addressing.  Not just because
I doubt that this would work (Ceph would always need to know/tell the OS
which interface it should use with such an address), but mainly because
even if it does work, it will only work as long as everything is on a
single logical IPv6 network.  This will artificially limit your options
for the evolution of your cluster.

Routable addresses are cheap in IPv6, use them!
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-26 Thread Simon Leinen
cmonty14  writes:
> due to performance issues RGW is not an option.  This statement may be
> wrong, but there's the following aspect to consider.

> If I write a backup that is typically a large file, this is normally a
> single IO stream.
> This causes massive performance issues on Ceph because this single IO
> stream is sequentially written in small pieces on OSDs.
> To overcome this issue multi IO stream should be used when writing
> large files, and this means the application writing the backup must
> support multi IO stream.

RGW (and the S3 protocol in general) supports multi-stream uploads
nicely, via the "multipart upload" feature: You split your file into
many pieces, which can be uploaded in parallel.

RGW with multipart uploads seems like a good fit for your application.
It could solve your naming and permission issues, has low overhead, and
could give you good performance as long as you use multipart uploads
with parallel threads.  You just need to make sure that your RGW
gateways have enough throughput, but this capacity is relatively easy
and inexpensive to provide.

> Considering this the following question comes up: If I write a backup
> into a RBD (that could be considered as a network share), will Ceph
> use single IO stream or multi IO stream on storage side?

Ceph should be able to handle multiple parallel streams of I/O to an RBD
device (in general, writes will go to different "chunks" of the RBD, and
those chunk objects will be on different OSDs).  But it's another
question whether your RBD client will be able to issue parallel streams
of requests.  Usually you have some kind of file system and kernel block
I/O layer on the client side, and it's possible that those will
serialize I/O, which will make it hard to get high throughput.
-- 
Simon.

> THX

> Am Di., 22. Jan. 2019 um 23:20 Uhr schrieb Christian Wuerdig
> :
>> 
>> If you use librados directly it's up to you to ensure you can
>> identify your objects. Generally RADOS stores objects and not files
>> so when you provide your object ids you need to come up with a
>> convention so you can correctly identify them. If you need to
>> provide meta data (i.e. a list of all existing backups, when they
>> were taken etc.) then again you need to manage that yourself
>> (probably in dedicated meta-data objects). Using RADOS namespaces
>> (like one per database) is probably a good idea.
>> Also keep in mind that for example Bluestore has a maximum object
>> size of 4GB so mapping files 1:1 to object is probably not a wise
>> approach and you should breakup your files into smaller chunks when
>> storing them. There is libradosstriper which handles the striping of
>> large objects transparently but not sure if that has support for
>> RADOS namespaces.
>> 
>> Using RGW instead might be an easier route to go down
>> 
>> On Wed, 23 Jan 2019 at 10:10, cmonty14 <74cmo...@gmail.com> wrote:
>>> 
>>> My backup client is using librados.
>>> I understand that defining a pool for the same application is recommended.
>>> 
>>> However this would not answer my other questions:
>>> How can I identify a backup created by client A that I want to restore
>>> on another client Z?
>>> I mean typically client A would write a backup file identified by the
>>> filename.
>>> Would it be possible on client Z to identify this backup file by
>>> filename? If yes, how?
>>> 
>>> Am Di., 22. Jan. 2019 um 15:07 Uhr schrieb :
>>> >
>>> > Hi,
>>> >
>>> > Ceph's pool are meant to let you define specific engineering rules
>>> > and/or application (rbd, cephfs, rgw)
>>> > They are not designed to be created in a massive fashion (see pgs etc)
>>> > So, create a pool for each engineering ruleset, and store your data in 
>>> > them
>>> > For what is left of your project, I believe you have to implement that
>>> > on top of Ceph
>>> >
>>> > For instance, let say you simply create a pool, with a rbd volume in it
>>> > You then create a filesystem on that, and map it on some server
>>> > Finally, you can push your files on that mountpoint, using various
>>> > Linux's user, acl or whatever : beyond that point, there is nothing more
>>> > specific to Ceph, it is "just" a mounted filesystem
>>> >
>>> > Regards,
>>> >
>>> > On 01/22/2019 02:16 PM, cmonty14 wrote:
>>> > > Hi,
>>> > >
>>> > > my use case for Ceph is providing a central backup storage.
>>> > > This means I will backup multiple databases in Ceph storage cluster.
>>> > >
>>> > > This is my question:
>>> > > What is the best practice for creating pools & images?
>>> > > Should I create multiple pools, means one pool per database?
>>> > > Or should I create a single pool "backup" and use namespace when writing
>>> > > data in the pool?
>>> > >
>>> > > This is the security demand that should be considered:
>>> > > DB-owner A can only modify the files that belong to A; other files
>>> > > (owned by B, C or D) are accessible for A.
>>> > >
>>> > > And there's another issue:
>>> > > How can I identify a backup created by client A 

Re: [ceph-users] Ceph in OSPF environment

2019-01-21 Thread Simon Leinen
Burkhard Linke writes:
> I'm curious.what is the advantage of OSPF in your setup over
> e.g. LACP bonding both links?

Good question! Some people (including myself) are uncomfortable with
LACP (in particular "MLAG", i.e. port aggregation across multiple
chassis), and with fancy L2 setups in general.  Concrete issues include

* worries about compatiblity between different MLAG implementations,
  which tend to vary subtly between vendors
* worries about general reliability of L2 in complex topologies with
  lots of potential loops - as soon as spanning tree makes ONE brief
  mistake you can end up with broadcast storms and subsequent meltdowns
* requirement to have direct inter-switch links between switches that
  share a VLAN with LACP; this runs counter to the fashionable "Clos"
  (aka fat-tree aka leaf/spine) topology.

(Also you are talking about "both links"... I don't know about LACP, but
with OSPF it's trivial to use arbitrary numbers of uplinks - e.g. 3 - to
arbitrary routers/switches; you can also move servers around freely
between "leaf" switches.)

With modern L3 routing protocol implementations, you can use simple and
generic configurations using "unnumbered" routing adjacency
definitions.  OSPF has had this for a long time, and modern BGP
implementations (e.g. FRR) also do.  This neutralizes one of the most
important advantages of L2 networks, namely easy configuration.

Routing protocols such as OSPF, IS-IS, or BGP-4 have proven their
robustness in dynamic and wild environments, e.g. the Internet.  IP
forwarding also has the wonderful "TTL" mechanism, which makes the
occasional routing loop much less disastrous.

By the way, a thoroughly "unnumbered" routing configuration can
alleviate the original poster's problem, because it lets you use a
single IP address (or a single IPv4 + a single IPv6 address...) across
all interfaces.  We use this in our setup, previously with OSPF(v2+v3),
now using BGP-4, between hosts running Ubuntu+FRR and switches running
Cumulus Linux (also with FRR) following Cumulus Networks's "routing to
the host" model.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Avoid Ubuntu Linux kernel 4.15.0-36

2018-10-28 Thread Simon Leinen
As a little "heads-up":

If you are running Ubuntu Bionic 18.04, or Xenial 16.04 with "HWE"
kernels, and have systems running under 4.15.0-36 - which was the
default between 2018-10-01 and 2018-10-22 - please consider upgrading to
the latest 4.15.0-38 ASAP (or downgrade to 4.15.0-34).

4.15.0-36 has a TCP bug[1] that can occasionally slow down a TCP
connection to a trickle of 2.5 Kbytes/s (512-byte segments every 200ms).
Once a TCP connection is in this state, it will never get out.

This started happening within our Ceph clusters after we reinstalled a
few servers as part of our Bluestore migration.  The effect on our RBD
users (OpenStack VMs) was pretty terrible - the typical 4MB transaction
would take about 27 MINUTES at this rate, causing timeouts and crashes.

This was absolutely painful to diagnose, because it happened so rarely
and was hard to reproduce.  Fortunately the fix is easy - just don't run
this kernel.

I should note that our Ceph clusters run over IPv6; I'm not sure whether
the TCP bug can hit with IPv4 (the bug was reported for IPv6 as well),
although I see no reason why it shouldn't.
-- 
Simon.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796895
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread Simon Leinen
Christian Balzer writes:
> On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote:
>> Christian, you mention single socket systems for storage servers.
>> I often thought that the Xeon-D would be ideal as a building block for
>> storage servers
>> https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
>> Low power, and a complete System-On-Chip with 10gig Ethernet.
>> 
> If you (re)search the ML archives you should be able find discussions
> about this and I seem to remember them coming up as well.

> If you're going to have a typical HDDs for storage and 1-2 SSDs for
> journal/WAL/DB setup, they should do well enough.

We have such systems (QuantaGrid SD1Q-1ULH with Xeon D-1541) and are
generally happy with them.  They are certainly very power-efficient.

> But in that scenario you're likely not all that latency conscious
> about NUMA issues to begin with, given that current CPU interlinks are
> quite decent.

Right.

> They however do feel underpowered when mated with really fast (NVMe) or
> more than 4 SSDs per node if you have a lot of small writes.
[...]

The new Xeon-D 2100 look promising.  I haven't seen any
storage-optimized servers based on this yet, though.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SRV mechanism for looking up mons lacks IPv6 support

2018-02-21 Thread Simon Leinen
We just upgraded our last cluster to Luminous.  Since we might need to
renumber our mons in the not-too-distant future, it would be nice to
remove the literal IP addresses of the mons from ceph.conf.  Kraken and
above support a DNS-based mechanism for this based on SRV records[1].

Unfortunately our Rados cluster is IPv6-based, and in testing we found
out that the code that resolves these SRV records only looks for IPv4
addresses (A records) of the hostnames that the SRVs point to.

I just created issue #23078[2] for this.  The description points to
where I think the code would need to be changed.  If I can do anything
to help (in particular test fixes), please let me know.

This might be relevant to others who run IPv6 Rados clusters.
-- 
Simon.
[1] http://docs.ceph.com/docs/master/rados/configuration/mon-lookup-dns/
[2] http://tracker.ceph.com/issues/23078
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3 bucket policys

2017-11-07 Thread Simon Leinen
Simon Leinen writes:
> Adam C Emerson writes:
>> On 03/11/2017, Simon Leinen wrote:
>> [snip]
>>> Is this supported by the Luminous version of RadosGW?

>> Yes! There's a few bugfixes in master that are making their way into
>> Luminous, but Luminous has all the features at present.

> Does that mean it should basically work in 10.2.1?

Sorry, I meant to say "in 12.2.1"!!!
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3 bucket policys

2017-11-07 Thread Simon Leinen
Adam C Emerson writes:
> On 03/11/2017, Simon Leinen wrote:
> [snip]
>> Is this supported by the Luminous version of RadosGW?

> Yes! There's a few bugfixes in master that are making their way into
> Luminous, but Luminous has all the features at present.

Does that mean it should basically work in 10.2.1?

>> (Or even Jewel?)

> No!

I see; this will definitely motivate us to speed up our Luminous upgrade!

>> Does this work with Keystone integration, i.e. can we refer to Keystone
>> users as principals?

> In principle probably. I haven't tried it and I don't really know
> much about Keystone at present. It is hooked into the various
> IdentityApplier classes and if RGW thinks a Keystone user is a
> 'user' and you supply whatever RGW thinks its username is, then it
> should work fine. I haven't tried it, though.

Unless someone beats us to it, we'll try as soon as we have our
cluster (with Keystone integration) in Luminous.

>> Let's say there are many read-only users rather than just one.  Would we
>> simply add a new clause under "Statement" for each such user, or is
>> there a better way? (I understand that RadosGW doesn't support groups,
>> which could solve this elegantly and efficiently.)

> If you want to give a large number of users the same permissions, just
> put them all in the Principal array.

Right, thanks for the tip! That makes it more compact.  For our use
case it won't be hundreds of users, I guess, more like dozens at most.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blog post: storage server power consumption

2017-11-06 Thread Simon Leinen
It was a cold and rainy weekend here, so I did some power measurements
of the three types of storage servers we got over a few years of running
Ceph in production, and compared the results:

https://cloudblog.switch.ch/2017/11/06/ceph-storage-server-power-usage/

The last paragraph contains a challenge to developers: Can we save more
power in "cold storage" applications by turning off idle disks?
Crazy idea, or did anyone already try this?
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3 bucket policys

2017-11-03 Thread Simon Leinen
Adam C Emerson writes:
> I'll save you, Citizen! I'm Captain Bucketpolicy!

Good to know!

> So! RGW's bucket policies are currently a subset of what's
> demonstrated in
> http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html

> The big limitations are that we don't support string interpolation or
> most condition keys, but that shouldn't be an issue for what you're
> doing.

> From your description you should be able to get what you want if you
> set something like this on bucket_upload:

> {
> "Version": "2012-10-17",
> "Statement": [
>   {
>   "Sid": "usr_upload_can_write",
>   "Effect": "Allow",
>   "Principal": {"AWS": ["arn:aws:iam:::user/usr_upload"]},
>   "Action": ["s3:ListBucket", "s3:PutObject"],
>   "Resource": ["arn:aws:s3:::bucket_policy1",
>"arn:aws:s3:::bucket_policy1/*"]
>   },
>   {
>   "Sid": "usr_process_can_read",
>   "Effect": "Allow",
>   "Principal": {"AWS": ["arn:aws:iam:::user/usr_process"]},
>   "Action": ["s3:ListBucket", "s3:GetObject"],
>   "Resource": ["arn:aws:s3:::bucket_policy1",
>"arn:aws:s3:::bucket_policy1/*"]
>   }
> ]
> }
[...]

Thanks, that's a great example that seems to fit a use case that we
have.  A few questions:

Is this supported by the Luminous version of RadosGW? (Or even Jewel?)

Does this work with Keystone integration, i.e. can we refer to Keystone
users as principals?

Let's say there are many read-only users rather than just one.  Would we
simply add a new clause under "Statement" for each such user, or is
there a better way? (I understand that RadosGW doesn't support groups,
which could solve this elegantly and efficiently.)
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and IPv4 -> IPv6

2017-07-02 Thread Simon Leinen
> I have it running the other way around. The RGW has IPv4 and IPv6, but
> the Ceph cluster is IPv6-only.

> RGW/librados talks to Ceph ovre IPv6 and handles client traffic on
> both protocols.

> No problem to run the RGW dual-stacked.

Just for the record, we've been doing exactly the same for several
years, on multiple clusters.  So you're not alone!
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD network with IPv6 SLAAC networks?

2017-04-18 Thread Simon Leinen
Félix Barbeira writes:
> We are implementing an IPv6 native ceph cluster using SLAAC. We have
> some legacy machines that are not capable of using IPv6, only IPv4 due
> to some reasons (yeah, I know). I'm wondering what could happen if I
> use an additional IPv4 on the radosgw in addition to the IPv6 that is
> already running. The rest of the ceph cluster components only have
> IPv6, the radosgw would be the only one with IPv4. Do you think that
> this would be a good practice or should I stick to only IPv6?

That should work fine. We successfully had a similar setup for a long time.
(Except we have been using statically- or DHCPv6-statefully-configured
IPv6 addresses rather than SLAAC.)

If you make RadosGW listen on ::, then it will accept both IPv6 and IPv4
connections.

Recently we changed our setup slightly: Now we have multiple RadosGW
instances behind a HAproxy front-end - the proxy listens on both IPv6
and IPv4, but will always talk IPv6 to the back-end RadosGW instances.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot Costs

2017-03-19 Thread Simon Leinen
Gregory Farnum writes:
> On Tue, Mar 7, 2017 at 12:43 PM, Kent Borg  wrote:
>> I would love it if someone could toss out some examples of the sorts
>> of things snapshots are good for and the sorts of things they are
>> terrible for.  (And some hints as to why, please.)

> They're good for CephFS snapshots. They're good at RBD snapshots as
> long as you don't take them too frequently.

We take snapshots of about thirty 2-TB RBD images (Ceph Cinder volumes)
every night.  We keep about 60 of each around.  Does that still fall
under "reasonable"?

One round of snapshots is deleted every night; that causes significant
load on our cluster - currently Hammer, will be upgraded to Jewel soon.

Most of the volumes (and thus snapshots) don't have the "object-map"
feature enabled yet; maybe after the Jewel upgrade we can add
object-maps to them to reduce the cost of deleting the snapshots.

Do object-maps help with snap trimming, or am I overly optimistic?
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-19 Thread Simon Leinen
cephmailinglist  writes:
> e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
> [...]

> [...] Also at that time one of our pools got a lot of extra data,
> those files where stored with root permissions since we did not
> restarted the Ceph daemons yet, the 'find' in step e found so much
> files that xargs (the shell) could not handle it (too many arguments).

I've always found it disappointing that xargs behaves like this on many
GNU/Linux distributions.  I always thought xargs's main purpose in life
was to know how many arguments can safely be passed to a process...

Anyway, you should be able to limit the number of arguments per
invocation by adding something like "-n 100" to the xargs command line.

Thanks for sharing your upgrade experiences!
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW: No caching when S3 tokens are validated against Keystone?

2017-02-09 Thread Simon Leinen
We're using the Hammer version of RadosGW, with Keystone for authN/Z.
When a user started sending a lot of S3 requests (using rclone), the
load on our Keystone service has skyrocketed.

This surprised me because all those requests are from the same user, and
RadosGW has caching for Keystone tokens.  But looking at the code, this
caching only seems to be used by rgw_swift.cc, not by rgw_rest_s3.cc.
That would explain why no caching is going on here.

Can anyone confirm?

And if so, is there a fundamental problem that makes it hard to use
caching when validating S3 tokens against a Keystone backend?

(Otherwise I guess I should write a feature request and/or start coding
this up myself...)

Here are the facts for background:

$ sudo ceph --admin-daemon /var/run/ceph/ceph-radosgw.gateway.asok config 
show | grep keystone
"rgw_keystone_url": "https:\/\/...",
"rgw_keystone_admin_token": "...",
"rgw_keystone_admin_user": "",
"rgw_keystone_admin_password": "",
"rgw_keystone_admin_tenant": "",
"rgw_keystone_accepted_roles": "_member_, ResellerAdmin",
"rgw_keystone_token_cache_size": "1",
"rgw_keystone_revocation_interval": "900",
"rgw_s3_auth_use_keystone": "true",

$ sudo ceph --admin-daemon /var/run/ceph/ceph-radosgw.gateway.asok perf 
dump | grep token_cache
"keystone_token_cache_hit": 0,
"keystone_token_cache_miss": 0

When I turn on debugging (config set debug_rgw 20/20), I get many
messages like these:

2017-02-09 21:50:06.606216 7f6ac5d83700  5 s3 keystone: validated token: 
: expires: 1486680606
2017-02-09 21:50:06.635940 7f6aac550700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:06.747616 7f6aadd53700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:06.818267 7f6ac2d7d700  5 s3 keystone: validated token: 
: expires: 1486680606
2017-02-09 21:50:06.853492 7f6ab3d5f700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:06.895471 7f6ac5582700  5 s3 keystone: validated token: 
: expires: 1486680606
2017-02-09 21:50:06.951734 7f6abf576700  5 s3 keystone: validated token: 
: expires: 1486680606
2017-02-09 21:50:07.016555 7f6ab7566700  5 s3 keystone: validated token: 
: expires: 1486680606
2017-02-09 21:50:07.038997 7f6ab355e700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.160196 7f6ac1d7b700  5 s3 keystone: validated token: 
: expires: 1486680606
2017-02-09 21:50:07.189930 7f6aaf556700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.233593 7f6aabd4f700  5 s3 keystone: validated token: 
: expires: 1486680607
2017-02-09 21:50:07.263116 7f6abcd71700  5 s3 keystone: validated token: 
: expires: 1486680607
2017-02-09 21:50:07.263915 7f6ab8d69700  5 s3 keystone: validated token: 
: expires: 1486680607
2017-02-09 21:50:07.263990 7f6aae554700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.280523 7f6ab2d5d700  5 s3 keystone: validated token: 
: expires: 1486680607
2017-02-09 21:50:07.290892 7f6aa954a700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.311201 7f6ab6d65700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.317667 7f6aad552700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.380957 7f6ab6564700  5 s3 keystone: validated token: 
: expires: 1486680607
2017-02-09 21:50:07.421227 7f6abd572700  5 s3 keystone: validated token: 
: expires: 1486680607
2017-02-09 21:50:07.446867 7f6ab0d59700 20 s3 keystone: trying keystone auth
2017-02-09 21:50:07.459225 7f6aa9d4b700 20 s3 keystone: trying keystone auth

and, as I said, our Keystone service is pretty much DoSed right now...
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade

2016-10-31 Thread Simon Leinen
Erik McCormick writes:
> We use Edge-Core 5712-54x running Cumulus Linux. Anything off their
> compatibility list would be good though. The switch is 48 10G sfp+
> ports. We just use copper cables with attached sfp. It also had 6 40G
> ports. The switch cost around $4800 and the cumulus license is about
> 3k for a perpetual license.

Similar here, except we use Quanta switches (T5032-LY6).

SFP+ slots and DAC cables.  Actually our switches are 32*40GE, and we
use "fan-out" DAC cables (QSFP on one side, 4 SFP+ on the other).

Compared to 10GBaseT (RJ45), DAC cables are thicker, which may
complicate cable management a little.  On the other hand I think DAC
still needs less power than 10GBaseT.  And with the 40G setup, we have
good port density and a smooth migration path to 40GE.  We already use
40GE for our leaf-spine uplinks.  Another advantage for us is that we
can use a single SKU for both leaf and spine switches.

The Cumulus licenses are a bit more expensive for those 40GE switches
(as are the switches themselves), but it's still a good deal for us.

Maybe these days it makes sense to look at 100GE switches in preference
to 40GE; 100GE ports can normally be used as 2*50GE, 4*25GE, 1*40GE or
4*10GE as well, so the upgrade paths seem even nicer.  And the prices
are getting competitive I think.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v10.1.0 Jewel release candidate available

2016-03-28 Thread Simon Leinen
Sage Weil writes:
> The first release candidate for Jewel is now available!

Cool!

[...]
> Packages for aarch64 will also be posted shortly!

According to the announcement, Ubuntu Xenial should now be supported
instead of Precise; but I don't see Xenial packages on
download.ceph.com.  Will those arrive, or should we get them from
Canonical's Xenial repo?
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OpenStack Ops Mid-Cycle session on OpenStack/Ceph integration

2016-02-11 Thread Simon Leinen
A "mid-cycle summit" for OpenStack operators will be held in Manchester
(England) on Monday/Tuesday next week (15/16 February).

The morning session on Tuesday will include a slot on Ceph integration.

If there are any Ceph+OpenStack operators, please have a look at the
Etherpad with the draft topic list:

https://etherpad.openstack.org/p/MAN-ops-Ceph

Feel free to add suggestions in-place, or post here and/or on


Hope to see some of you in Manchester next week!
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-20 Thread Simon Leinen
David Clarke writes:
 Not directly related to Ceph, but you may want to investigate kexec[0]
 ('kexec-tools' package in Debian derived distributions) in order to
 get your machines rebooting quicker.  It essentially re-loads the
 kernel as the last step of the shutdown procedure, skipping over the
 lengthy BIOS/UEFI/controller firmware etc boot stages.

 [0]: http://en.wikipedia.org/wiki/Kexec

I'd like to second that recommendation - I only discovered this
recently, and on systems with long BIOS initialization, this cuts down
the time to reboot *dramatically*, like from 5 to 1 minute.
-- 
Simon.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal, SSD and OS

2013-12-04 Thread Simon Leinen
Gandalf Corvotempesta writes:
 what do you think to use the same SSD as journal and as root partition?

 Forexample:
  1x 128GB SSD
 [...]
 All logs are stored remotely via rsyslogd
 Is this good ? AFAIK, in this configuration, ceph will be executed
 totally in ram.

I think this is a fine configuration - you won't be writing to the root
partition too much, outside journals.  We also put journals on the same
SSDs as root partitions (not that we're very ambitious about
performance...).
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why RBD is not enough [was: Inconsistent view on mounted CephFS]

2013-09-15 Thread Simon Leinen
Maciej Gałkiewicz writes:
 On 13 September 2013 17:12, Simon Leinen simon.lei...@switch.ch wrote:
 
 [We're not using is *instead* of rbd, we're using it *in addition to*
 rbd.  For example, our OpenStack (users') cinder volumes are stored in
 rbd.]

 So you probably have cinder volumes in rbd but you boot instances from
 images. This is why you need cephfs for /var/lib/nova/instances. I
 suggest creating volumes from images and booting instances from them.
 Cephfs is not required then

Thanks, I know that we could boot from volume.  Two problems:

1.) Our OpenStack installation is not a private cloud; we allow
external users to set up VMs.  These users need to be able to use
the standard workflow (Horizon) to start VMs from an image.

2.) We didn't manage to make boot from volume work with RBD in Folsom.
Yes, presumably it works fine in Grizzly and above, so we should
just upgrade.

 What we want to achieve is to have a shared instance store
 (i.e. /var/lib/nova/instances) across all our nova-compute nodes, so
 that we can e.g. live-migrate instances between different hosts.  And we
 want to use Ceph for that.
 
 In Folsom (but also in Grizzly, I think), this isn't straightforward to
 do with RBD.  A feature[1] to make it more straightforward was merged in
 Havana(-3) just two weeks ago.

 I dont get it. I am successfully using live-migration (in Grizzly,
 havent try Folsom) of instances booted from cinder volumes stored as
 rbd volumes. What is not straightforward to do? Are you using KVM?

As I said, boot from volume is not really an option for us.

 Yes, people want shared storage that they can access in a POSIXly way
 from multiple VMs.  CephFS is a relatively easy way to give them that,
 though I don't consider it production-ready - mostly because secure
 isolation between different tenants is hard to achieve.

 For now GlusterFS may fits better here.

Possibly, but it's another system we'd have to learn, configure and
support.  And CephFS is already in standard kernels (though obviously
it's not reliable, and there may be horrible regressions such as this
bug in 3.10).
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent view on mounted CephFS

2013-09-15 Thread Simon Leinen
Yan, Zheng writes:
 On Fri, Sep 13, 2013 at 3:09 PM, Jens-Christian Fischer
 The problem we see, is that the different hosts have different
 views on the filesystem (i.e. they see different amount of files).
[...]

 The bug was introduced in 3.10 kernel, will be fixed in 3.12 kernel by
 commit 590fb51f1c (vfs: call d_op-d_prune() before unhashing dentry).
 Sage may backport the fix to 3.11 and 3.10 kernel soon.

This would be great! 3.12 won't be out until November.

(Downgrading to e.g.3.9.11 should also fix the issue, right?)

 please use ceph-fuse at present.

That's what we're doing now, but it seems slower.

Best regards,
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why RBD is not enough [was: Inconsistent view on mounted CephFS]

2013-09-13 Thread Simon Leinen
 Just out of curiosity. Why you are using cephfs instead of rbd?

[We're not using is *instead* of rbd, we're using it *in addition to*
 rbd.  For example, our OpenStack (users') cinder volumes are stored in
 rbd.]

To expand on what my colleague Jens-Christian wrote:

 Two reasons:

 - we are still on Folsom

What we want to achieve is to have a shared instance store
(i.e. /var/lib/nova/instances) across all our nova-compute nodes, so
that we can e.g. live-migrate instances between different hosts.  And we
want to use Ceph for that.

In Folsom (but also in Grizzly, I think), this isn't straightforward to
do with RBD.  A feature[1] to make it more straightforward was merged in
Havana(-3) just two weeks ago.

 - Experience with shared storage as this is something our customers
 are asking for all the time

Yes, people want shared storage that they can access in a POSIXly way
from multiple VMs.  CephFS is a relatively easy way to give them that,
though I don't consider it production-ready - mostly because secure
isolation between different tenants is hard to achieve.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com