Re: newstore direction

2015-10-21 Thread Mark Nelson

On 10/21/2015 05:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a suboptimal 
implementation is something that you can live with compared to the longer gestation 
period of the more optimal implementation. Clearly, Sage believes that the performance 
difference is significant or he wouldn't have kicked off this discussion in the first 
place.

While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a 
significant amount of work. I will offer the case that the "loosely couple" 
scheme may not have as much time-to-market advantage as it appears to have. One example: 
NewStore performance is limited due to bugs in XFS that won't be fixed in the field for 
quite some time (it'll take at least a couple of years before a patched version of XFS 
will be widely deployed at customer environments).

Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:


Has there been any discussion regarding opensourcing zetascale?



(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil ; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).


If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some 
of them no-ops.



   - On read we have to open files by name, which means traversing the
fs namespace.  Newstore tries to keep it as flat and simple as
possible, but at a minimum it is a couple btree lookups.  We'd love to
use open by handle (which would reduce this to 1 btree traversal), but
running the daemon as ceph and not root makes that hard...


This seems like a a pretty low hurdle to overcome.



   - ...and file systems insist on updating mtime on writes, even when
it is a overwrite with no allocation changes.  (We don't care 

rgw with keystone v3

2015-10-21 Thread Luis Periquito
Hi all,

I've created a merge request in github (Rgw keystone v3 #6337) to add
integration with keystone auth v3 to ceph rgw.

We've tested the best we could, and it does seem to work as it should.

There are some notes on the merge request.

I don't know what's the usual testing process, but we would like, if
possible, to have some binaries built to make further tests - to
ensure that we're not suffering from some compiling/copying issues.

PS: sorry for any mistakes with the process. This is my first time
doing such a thing.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 09:32 AM, Sage Weil wrote:

On Tue, 20 Oct 2015, Ric Wheeler wrote:

Now:
  1 io  to write a new file
1-2 ios to sync the fs journal (commit the inode, alloc change)
(I see 2 journal IOs on XFS and only 1 on ext4...)
  1 io  to commit the rocksdb journal (currently 3, but will drop to
1 with xfs fix and my rocksdb change)

I think that might be too pessimistic - the number of discrete IO's sent down
to a spinning disk make much less impact on performance than the number of
fsync()'s since they IO's all land in the write cache.  Some newer spinning
drives have a non-volatile write cache, so even an fsync() might not end up
doing the expensive data transfer to the platter.

True, but in XFS's case at least the file data and journal are not
colocated, so its 2 seeks for the new file write+fdatasync and another for
the rocksdb journal commit.  Of course, with a deep queue, we're doing
lots of these so there's be fewer journal commits on both counts, but the
lower bound on latency of a single write is still 3 seeks, and that bound
is pretty critical when you also have network round trips and replication
(worst out of 2) on top.


What are the performance goals we are looking for?

Small, synchronous writes/second?

File creates/second?

I suspect that looking at things like seeks/write is probably looking at the 
wrong level of performance challenges.  Again, when you write to a modern drive, 
you write to its write cache and it decides internally when/how to destage to 
the platter.


If you look at the performance of XFS with streaming workloads, it will tend to 
max out the bandwidth of the underlaying storage.


If we need IOP's/file writes, etc, we should be clear on what we are aiming at.




It would be interesting to get the timings on the IO's you see to measure the
actual impact.

I observed this with the journaling workload for rocksdb, but I assume the
journaling behavior is the same regardless of what is being journaled.
For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
blktrace showed an IO to the file, and 2 IOs to the journal.  I believe
the first one is the record for the inode update, and the second is the
journal 'commit' record (though I forget how I decided that).  My guess is
that XFS is being extremely careful about journal integrity here and not
writing the commit record until it knows that the preceding records landed
on stable storage.  For ext4, the latency was about ~20ms, and blktrace
showed the IO to the file and then a single journal IO.  When I made the
rocksdb change to overwrite an existing, prewritten file, the latency
dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
(XFS still showed the 2 journal commit IOs, but Dave just posted the fix
for that on the XFS list today.)


Right, if we want to avoid metadata related IO's, we can preallocate a file and 
use O_DIRECT. Effectively, there should be no updates outside of the data write 
itself.  Also won't be performance optimizations, but we could avoid redoing 
allocation and defragmentation again.


Normally, best practice is to use batching to avoid paying worst case latency 
when you do a synchronous IO. Write a batch of files or appends without fsync, 
then go back and fsync and you will pay that latency once (not per file/op).





Plumbing for T10 DIF/DIX already exist, what is missing is the normal block
device that handles them (not enterprise SAS/disk array class)

Yeah... which unfortunately means that unless the cheap drives
suddenly start shipping if DIF/DIX support we'll need to do the
checksums ourselves.  This is probably a good thing anyway as it doesn't
constrain our choice of checksum or checksum granularity, and will
still work with other storage devices (ssds, nvme, etc.).

sage


Might be interesting to see if a device mapper target could be written to 
support DIF/DIX.  For what it's worth, XFS developers have talked loosely about 
looking at data block checksums (could do something like btrfs does, store the 
checksums in another btree)


ric

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] KEYS: Merge the type-specific data with the payload data

2015-10-21 Thread David Howells
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.

Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.

Signed-off-by: David Howells 
cc: linux-c...@vger.kernel.org
cc: ecryp...@vger.kernel.org
cc: linux-e...@vger.kernel.org
cc: linux-f2fs-de...@lists.sourceforge.net
cc: linux-...@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-de...@lists.sourceforge.net
---

 Documentation/crypto/asymmetric-keys.txt |   27 +++--
 Documentation/security/keys.txt  |   41 ---
 crypto/asymmetric_keys/asymmetric_keys.h |5 --
 crypto/asymmetric_keys/asymmetric_type.c |   44 -
 crypto/asymmetric_keys/public_key.c  |4 +-
 crypto/asymmetric_keys/signature.c   |2 -
 crypto/asymmetric_keys/x509_parser.h |1 
 crypto/asymmetric_keys/x509_public_key.c |9 ++--
 fs/cifs/cifs_spnego.c|6 +--
 fs/cifs/cifsacl.c|   25 ++--
 fs/cifs/connect.c|9 ++--
 fs/cifs/sess.c   |2 -
 fs/cifs/smb2pdu.c|2 -
 fs/ecryptfs/ecryptfs_kernel.h|5 +-
 fs/ext4/crypto_key.c |4 +-
 fs/f2fs/crypto_key.c |4 +-
 fs/fscache/object-list.c |4 +-
 fs/nfs/nfs4idmap.c   |4 +-
 include/crypto/public_key.h  |1 
 include/keys/asymmetric-subtype.h|2 -
 include/keys/asymmetric-type.h   |   15 +++
 include/keys/user-type.h |8 
 include/linux/key-type.h |3 -
 include/linux/key.h  |   33 +++
 kernel/module_signing.c  |1 
 lib/digsig.c |7 ++-
 net/ceph/ceph_common.c   |2 -
 net/ceph/crypto.c|6 +--
 net/dns_resolver/dns_key.c   |   20 +
 net/dns_resolver/dns_query.c |7 +--
 net/dns_resolver/internal.h  |8 
 net/rxrpc/af_rxrpc.c |2 -
 net/rxrpc/ar-key.c   |   32 +++
 net/rxrpc/ar-output.c|2 -
 net/rxrpc/ar-security.c  |4 +-
 net/rxrpc/rxkad.c|   16 ---
 security/integrity/evm/evm_crypto.c  |2 -
 security/keys/big_key.c  |   47 +++---
 security/keys/encrypted-keys/encrypted.c |   18 
 security/keys/encrypted-keys/encrypted.h |4 +-
 security/keys/encrypted-keys/masterkey_trusted.c |4 +-
 security/keys/key.c  |   18 
 security/keys/keyctl.c   |4 +-
 security/keys/keyring.c  |   12 +++---
 security/keys/process_keys.c |4 +-
 security/keys/request_key.c  |4 +-
 security/keys/request_key_auth.c |   12 +++---
 security/keys/trusted.c  |6 +--
 security/keys/user_defined.c |   14 +++
 49 files changed, 286 insertions(+), 230 deletions(-)

diff --git a/Documentation/crypto/asymmetric-keys.txt 
b/Documentation/crypto/asymmetric-keys.txt
index b7675904a747..8c07e0ea6bc0 100644
--- a/Documentation/crypto/asymmetric-keys.txt
+++ b/Documentation/crypto/asymmetric-keys.txt
@@ -186,7 +186,7 @@ and looks like the following:
const struct public_key_signature *sig);
};
 
-Asymmetric keys point to this with their type_data[0] member.
+Asymmetric keys point to this with their payload[asym_subtype] member.
 
 The owner and name fields should be set to the owning module and the name of
 the subtype.  Currently, the name is only used for print statements.
@@ -269,8 +269,7 @@ mandatory:
 
struct key_preparsed_payload {
char*description;
-   void*type_data[2];
-   void*payload;
+   void*payload[4];
const void  *data;
size_t  datalen;
size_t  quotalen;
@@ -283,16 +282,18 @@ mandatory:
  not theirs.
 
  If the parser is happy with the blob, it should propose a description for
- the key and attach it to ->description, ->type_data[0] should be set to
- point to the subtype to be used, ->payload should be set to point to the
- initialised data for that 

Re: [PATCH] mark rbd requiring stable pages

2015-10-21 Thread Mike Christie
On 10/21/2015 03:57 PM, Ilya Dryomov wrote:
> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov  wrote:
>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov  wrote:
>>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>>> so enabling this unconditionally is probably too harsh.  OTOH we are
>>> talking to the network, which means all sorts of delays, retransmission
>>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>>> added to an skb - you can't write anything to a page until networking
>>> is fully done with it and expect it to work.  It's particularly
>>> alarming that you've seen corruptions.
>>>
>>> Currently the only users of this flag are block integrity stuff and
>>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>>> area.  There's an old ticket on this topic somewhere on the tracker, so
>>> I'll need to research this.  Thanks for bringing this up!
>>
>> Hi Mike,
>>
>> I was hoping to grab you for a few minutes, but you weren't there...
>>
>> I spent a better part of today reading code and mailing lists on this
>> topic.  It is of course a bug that we use sendpage() which inlines
>> pages into an skb and do nothing to keep those pages stable.  We have
>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
>> case is an obvious fix.
>>
>> I looked at drbd and iscsi and I think iscsi could do the same - ditch
>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
>> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
>> rather than having a roll-your-own implementation which doesn't close
>> the race but only narrows it sounds like a win, unless copying through
>> sendmsg() is for some reason cheaper than stable-waiting?

Yeah, that is what I was saying on the call the other day, but the
reception was bad. We only have the sendmsg code path when digest are on
because that code came before stable pages. When stable pages were
created, it was on by default but did not cover all the cases, so we
left the code. It then handled most scenarios, but I just never got
around to removing old the code. However, it was set to off by default
so I left it and made this patch for iscsi to turn on stable pages:

[this patch only enabled stable pages when digests/crcs are on and dif
not remove the code yet]
https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM

I did not really like the layering so I have not posted it for inclusion.



>>
>> drbd still needs the non-zero-copy version for its async protocol for
>> when they free the pages before the NIC has chance to put them on the
>> wire.  md-raid5 it turns out has an option to essentially disable most
>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
>> if that option is enabled.
>>
>> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
>> convince myself that mucking with sendpage()ed pages while they sit in
>> the TCP queue (or anywhere in the networking stack, really), is safe -
>> there is nothing to prevent pages from being modified after sendpage()
>> returned and Ronny reports data corruptions that pretty much went away
>> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
>> this, starting to confuse fs with block, though.  How does that work in
>> iscsi land?

This is what I was trying to ask about in the call the other day. Where
is the corruption that Ronny was seeing. Was it checksum mismatches on
data being written, or is incorrect meta data being written, etc?

If we are just talking about if stable pages are not used, and someone
is re-writing data to a page after the page has already been submitted
to the block layer (I mean the page is on some bio which is on a request
which is on some request_queue scheduler list or basically anywhere in
the block layer), then I was saying this can occur with any block
driver. There is nothing that is preventing this from happening with a
FC driver or nvme or cciss or in dm or whatever. The app/user can
rewrite as late as when we are in the make_request_fn/request_fn.

I think I am misunderstanding your question because I thought this is
expected behavior, and there is nothing drivers can do if the app is not
doing a flush/sync between these types of write sequences.


>>
>> (There was/is also this [1] bug, which is kind of related and probably
>> worth looking into at some point later.  ceph shouldn't be that easily
>> affected - we've got state, but there is a ticket for it.)
>>
>> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html
> 
> And now with Mike on the CC and a mention that at least one scenario of
> [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
> issue when retransmitting RPC calls").
> 

iSCSI handles timeouts/retries and sequence numbers/responses
differently so we are not affected. We go through some abort and
possibly reconnect process.


RE: newstore direction

2015-10-21 Thread Chen, Xiaoxi
We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e 
re-invent an NVMKV, the final conclusion sounds like it's not hard with 
persistent memory(which will be available soon).  But yeah, NVMKV will not work 
if no PM is present---persist the hashing table to SSD is not practicable.   

Range query seems not a very big issue as the random read performance of 
nowadays SSD is more than enough, I mean, even we break all sequential to 
random (typically 70-80K IOPS which is ~300MB/s), the performance still good 
enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play 
well on SSDs from different vendors, would be better to leave it to SSD vendor, 
something like Openstack Cinder's structure.  a vendor has the responsibility 
to maintain their drivers to ceph and take care the performance.

> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
> 
> Thanks Allen!  The devil is always in the details.  Know of anything else that
> looks promising?
> 
> Mark
> 
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities of
> > the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil ; Chen, Xiaoxi 
> > Cc: James (Fei) Liu-SSI ; Somnath Roy
> > ; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs, say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB or
> >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
> >>> vendor are also trying to build this kind of interface, we had a
> >>> NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-sized/aligned payloads.
> >> Storing just the metadata (block allocation map) w/ the kv api and
> >> storing the data directly on a block/page interface makes more sense to
> me.
> >>
> >> sage
> >
> > I get the feeling that some of the folks that were involved with nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
> instance.
> http://pmem.io might be a better bet, though I haven't looked closely at it.
> >
> > Mark
> >
> >>
> >>
>  -Original Message-
>  From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>  ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>  Sent: Tuesday, October 20, 2015 6:21 AM
>  To: Sage Weil; Somnath Roy
>  Cc: ceph-devel@vger.kernel.org
>  Subject: RE: newstore direction
> 
>  Hi Sage and Somnath,
>  In my humble opinion, There is another more aggressive
>  solution than raw block device base keyvalue store as backend for
>  objectstore. The new key value  SSD device with transaction support
> would be  ideal to solve the issues.
>  First of all, it is raw SSD device. Secondly , It provides key
>  value interface directly from SSD. Thirdly, it can provide
>  transaction support, consistency will be guaranteed by hardware
>  device. It pretty much satisfied all of objectstore needs without
>  any extra overhead since there is not any extra layer in between device
> and objectstore.
>   Either way, I strongly support to have CEPH own data format
>  instead of relying on filesystem.
> 
>  Regards,
>  James
> 
>  -Original Message-
>  From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>  ow...@vger.kernel.org] On Behalf Of Sage Weil
>  Sent: Monday, October 19, 2015 1:55 PM
>  

Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 10:14 AM, Mark Nelson wrote:



On 10/21/2015 06:24 AM, Ric Wheeler wrote:



On 10/21/2015 06:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant
development effort. But the current scheme of using a KV store
combined with a normal file system is always going to be problematic
(FileStore or NewStore). This is caused by the transactional
requirements of the ObjectStore interface, essentially you need to
make transactionally consistent updates to two indexes, one of which
doesn't understand transactions (File Systems) and can never be
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work,
but it will never be optimal. The real question is whether the
performance difference of a suboptimal implementation is something
that you can live with compared to the longer gestation period of the
more optimal implementation. Clearly, Sage believes that the
performance difference is significant or he wouldn't have kicked off
this discussion in the first place.


I think that we need to work with the existing stack - measure and do
some collaborative analysis - before we throw out decades of work.  Very
hard to understand why the local file system is a barrier for
performance in this case when it is not an issue in existing enterprise
applications.

We need some deep analysis with some local file system experts thrown in
to validate the concerns.


I think Sage has been working pretty closely with the XFS guys to uncover 
these kinds of issues.  I know if I encounter something fairly FS specific I 
try to drag Eric or Dave in.  I think the core of the problem is that we often 
find ourselves exercising filesystems in pretty unusual ways.  While it's 
probably good that we add this kind of coverage and help work out somewhat 
esoteric bugs, I think it does make our job of making Ceph perform well 
harder.  One example:  I had been telling folks for several years to favor 
dentry and inode cache due to the way our PG directory splitting works (backed 
by test results), but then Sage discovered:


http://www.spinics.net/lists/ceph-devel/msg25644.html

This is just one example of how very nuanced our performance story is. I can 
keep many users at least semi-engaged when talking about objects being laid 
out in a nested directory structure, how dentry/inode cache affects that in a 
general sense, etc.  But combine the kind of subtlety in the link above with 
the vastness of things in the data path that can hurt performance, and people 
generally just can't wrap their heads around all of it (With the exception of 
some of the very smart folks on this mailing list!)


One of my biggest concerns going forward is reducing the user-facing 
complexity of our performance story.  The question I ask myself is: Does 
keeping Ceph on a FS help us or hurt us in that regard?


The upshot of that is that the kind of micro-optimization is already handled by 
the file system, so the application job should be easier. Better to fsync() each 
file from an application that you care about rather than to worry about using 
more obscure calls.








While I think we can all agree that writing a full-up KV and raw-block
ObjectStore is a significant amount of work. I will offer the case
that the "loosely couple" scheme may not have as much time-to-market
advantage as it appears to have. One example: NewStore performance is
limited due to bugs in XFS that won't be fixed in the field for quite
some time (it'll take at least a couple of years before a patched
version of XFS will be widely deployed at customer environments).


Not clear what bugs you are thinking of or why you think fixing bugs
will take a long time to hit the field in XFS. Red Hat has most of the
XFS developers on staff and we actively backport fixes and ship them,
other distros do as well.

Never seen a "bug" take a couple of years to hit users.


Maybe a good way to start out would be to see how quickly we can get the patch 
dchinner posted here:


http://oss.sgi.com/archives/xfs/2015-10/msg00545.html

rolled out into RHEL/CentOS/Ubuntu.  I have no idea how long these things 
typically take, but this might be a good test case.


How quickly things land in a distro is up to the interested parties making the 
case for it.


Ric





Regards,

Ric



Another example: Sage has just had to substantially rework the
journaling code of rocksDB.

In short, as you can tell, I'm full throated in favor of going down
the optimal route.

Internally at Sandisk, we have a KV store that is optimized for flash
(it's called ZetaScale). We have extended it with a raw block
allocator just as Sage is now proposing to do. Our internal
performance measurements show a significant advantage over the current
NewStore. That performance advantage stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree
(levelDB/RocksDB). LSM trees experience exponential 

Re: newstore direction

2015-10-21 Thread Mark Nelson
Thanks Allen!  The devil is always in the details.  Know of anything 
else that looks promising?


Mark

On 10/21/2015 05:06 AM, Allen Samuels wrote:

I doubt that NVMKV will be useful for two reasons:

(1) It relies on the unique sparse-mapping addressing capabilities of the 
FusionIO VSL interface, it won't run on standard SSDs
(2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range 
operations on keys). This is pretty much required for deep scrubbing.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, October 20, 2015 6:20 AM
To: Sage Weil ; Chen, Xiaoxi 
Cc: James (Fei) Liu-SSI ; Somnath Roy 
; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/20/2015 07:30 AM, Sage Weil wrote:

On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:

+1, nowadays K-V DB care more about very small key-value pairs, say
several bytes to a few KB, but in SSD case we only care about 4KB or
8KB. In this way, NVMKV is a good design and seems some of the SSD
vendor are also trying to build this kind of interface, we had a
NVM-L library but still under development.


Do you have an NVMKV link?  I see a paper and a stale github repo..
not sure if I'm looking at the right thing.

My concern with using a key/value interface for the object data is
that you end up with lots of key/value pairs (e.g., $inode_$offset =
$4kb_of_data) that is pretty inefficient to store and (depending on
the
implementation) tends to break alignment.  I don't think these
interfaces are targetted toward block-sized/aligned payloads.  Storing
just the metadata (block allocation map) w/ the kv api and storing the
data directly on a block/page interface makes more sense to me.

sage


I get the feeling that some of the folks that were involved with nvmkv at 
Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
instance.  http://pmem.io might be a better bet, though I haven't looked 
closely at it.

Mark





-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, October 20, 2015 6:21 AM
To: Sage Weil; Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

Hi Sage and Somnath,
In my humble opinion, There is another more aggressive  solution
than raw block device base keyvalue store as backend for
objectstore. The new key value  SSD device with transaction support would be  
ideal to solve the issues.
First of all, it is raw SSD device. Secondly , It provides key value
interface directly from SSD. Thirdly, it can provide transaction
support, consistency will be guaranteed by hardware device. It
pretty much satisfied all of objectstore needs without any extra
overhead since there is not any extra layer in between device and objectstore.
 Either way, I strongly support to have CEPH own data format
instead of relying on filesystem.

Regards,
James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 1:55 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, Somnath Roy wrote:

Sage,
I fully support that.  If we want to saturate SSDs , we need to get
rid of this filesystem overhead (which I am in process of measuring).
Also, it will be good if we can eliminate the dependency on the k/v
dbs (for storing allocators and all). The reason is the unknown
write amps they causes.


My hope is to keep behing the KeyValueDB interface (and/more change
it as
appropriate) so that other backends can be easily swapped in (e.g. a
btree- based one for high-end flash).

sage




Thanks & Regards
Somnath


-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 12:49 PM
To: ceph-devel@vger.kernel.org
Subject: newstore direction

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our
internal metadata (object metadata, attrs, layout, collection
membership, write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.
A few
things:

   - We currently write the data to the file, fsync, then commit the
kv transaction.  That's at least 3 IOs: one for the data, one for
the fs journal, one for the kv txn to commit (at least once my
rocksdb changes land... the kv commit is currently 2-3).  So two
people are 

Re: newstore direction

2015-10-21 Thread Mark Nelson



On 10/21/2015 06:24 AM, Ric Wheeler wrote:



On 10/21/2015 06:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant
development effort. But the current scheme of using a KV store
combined with a normal file system is always going to be problematic
(FileStore or NewStore). This is caused by the transactional
requirements of the ObjectStore interface, essentially you need to
make transactionally consistent updates to two indexes, one of which
doesn't understand transactions (File Systems) and can never be
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work,
but it will never be optimal. The real question is whether the
performance difference of a suboptimal implementation is something
that you can live with compared to the longer gestation period of the
more optimal implementation. Clearly, Sage believes that the
performance difference is significant or he wouldn't have kicked off
this discussion in the first place.


I think that we need to work with the existing stack - measure and do
some collaborative analysis - before we throw out decades of work.  Very
hard to understand why the local file system is a barrier for
performance in this case when it is not an issue in existing enterprise
applications.

We need some deep analysis with some local file system experts thrown in
to validate the concerns.


I think Sage has been working pretty closely with the XFS guys to 
uncover these kinds of issues.  I know if I encounter something fairly 
FS specific I try to drag Eric or Dave in.  I think the core of the 
problem is that we often find ourselves exercising filesystems in pretty 
unusual ways.  While it's probably good that we add this kind of 
coverage and help work out somewhat esoteric bugs, I think it does make 
our job of making Ceph perform well harder.  One example:  I had been 
telling folks for several years to favor dentry and inode cache due to 
the way our PG directory splitting works (backed by test results), but 
then Sage discovered:


http://www.spinics.net/lists/ceph-devel/msg25644.html

This is just one example of how very nuanced our performance story is. 
I can keep many users at least semi-engaged when talking about objects 
being laid out in a nested directory structure, how dentry/inode cache 
affects that in a general sense, etc.  But combine the kind of subtlety 
in the link above with the vastness of things in the data path that can 
hurt performance, and people generally just can't wrap their heads 
around all of it (With the exception of some of the very smart folks on 
this mailing list!)


One of my biggest concerns going forward is reducing the user-facing 
complexity of our performance story.  The question I ask myself is: 
Does keeping Ceph on a FS help us or hurt us in that regard?






While I think we can all agree that writing a full-up KV and raw-block
ObjectStore is a significant amount of work. I will offer the case
that the "loosely couple" scheme may not have as much time-to-market
advantage as it appears to have. One example: NewStore performance is
limited due to bugs in XFS that won't be fixed in the field for quite
some time (it'll take at least a couple of years before a patched
version of XFS will be widely deployed at customer environments).


Not clear what bugs you are thinking of or why you think fixing bugs
will take a long time to hit the field in XFS. Red Hat has most of the
XFS developers on staff and we actively backport fixes and ship them,
other distros do as well.

Never seen a "bug" take a couple of years to hit users.


Maybe a good way to start out would be to see how quickly we can get the 
patch dchinner posted here:


http://oss.sgi.com/archives/xfs/2015-10/msg00545.html

rolled out into RHEL/CentOS/Ubuntu.  I have no idea how long these 
things typically take, but this might be a good test case.




Regards,

Ric



Another example: Sage has just had to substantially rework the
journaling code of rocksDB.

In short, as you can tell, I'm full throated in favor of going down
the optimal route.

Internally at Sandisk, we have a KV store that is optimized for flash
(it's called ZetaScale). We have extended it with a raw block
allocator just as Sage is now proposing to do. Our internal
performance measurements show a significant advantage over the current
NewStore. That performance advantage stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree
(levelDB/RocksDB). LSM trees experience exponential increase in write
amplification (cost of an insert) as the amount of data under
management increases. B+tree write-amplification is nearly constant
independent of the size of data under management. As the KV database
gets larger (Since newStore is effectively moving the per-file inode
into the kv data base. Don't forget checksums that Sage want's to add
:)) this performance delta swamps all others.
(2) Having a 

Re: newstore direction

2015-10-21 Thread Sage Weil
On Wed, 21 Oct 2015, Ric Wheeler wrote:
> On 10/21/2015 04:22 AM, Orit Wasserman wrote:
> > On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> > > On 10/19/2015 03:49 PM, Sage Weil wrote:
> > > > The current design is based on two simple ideas:
> > > > 
> > > >1) a key/value interface is better way to manage all of our internal
> > > > metadata (object metadata, attrs, layout, collection membership,
> > > > write-ahead logging, overlay data, etc.)
> > > > 
> > > >2) a file system is well suited for storage object data (as files).
> > > > 
> > > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > > few
> > > > things:
> > > > 
> > > >- We currently write the data to the file, fsync, then commit the kv
> > > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > > journal, one for the kv txn to commit (at least once my rocksdb changes
> > > > land... the kv commit is currently 2-3).  So two people are managing
> > > > metadata, here: the fs managing the file metadata (with its own
> > > > journal) and the kv backend (with its journal).
> > > If all of the fsync()'s fall into the same backing file system, are you
> > > sure
> > > that each fsync() takes the same time? Depending on the local FS
> > > implementation
> > > of course, but the order of issuing those fsync()'s can effectively make
> > > some of
> > > them no-ops.
> > > 
> > > >- On read we have to open files by name, which means traversing the
> > > > fs
> > > > namespace.  Newstore tries to keep it as flat and simple as possible,
> > > > but
> > > > at a minimum it is a couple btree lookups.  We'd love to use open by
> > > > handle (which would reduce this to 1 btree traversal), but running
> > > > the daemon as ceph and not root makes that hard...
> > > This seems like a a pretty low hurdle to overcome.
> > > 
> > > >- ...and file systems insist on updating mtime on writes, even when
> > > > it is
> > > > a overwrite with no allocation changes.  (We don't care about mtime.)
> > > > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > > > brainfreeze.
> > > Are you using O_DIRECT? Seems like there should be some enterprisey
> > > database
> > > tricks that we can use here.
> > > 
> > > >- XFS is (probably) never going going to give us data checksums,
> > > > which we
> > > > want desperately.
> > > What is the goal of having the file system do the checksums? How strong do
> > > they
> > > need to be and what size are the chunks?
> > > 
> > > If you update this on each IO, this will certainly generate more IO (each
> > > write
> > > will possibly generate at least one other write to update that new
> > > checksum).
> > > 
> > > > But what's the alternative?  My thought is to just bite the bullet and
> > > > consume a raw block device directly.  Write an allocator, hopefully keep
> > > > it pretty simple, and manage it in kv store along with all of our other
> > > > metadata.
> > > The big problem with consuming block devices directly is that you
> > > ultimately end
> > > up recreating most of the features that you had in the file system. Even
> > > enterprise databases like Oracle and DB2 have been migrating away from
> > > running
> > > on raw block devices in favor of file systems over time.  In effect, you
> > > are
> > > looking at making a simple on disk file system which is always easier to
> > > start
> > > than it is to get back to a stable, production ready state.
> > The best performance is still on block device (SAN).
> > File system simplify the operation tasks which worth the performance
> > penalty for a database. I think in a storage system this is not the
> > case.
> > In many cases they can use their own file system that is tailored for
> > the database.
> 
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.

...except it's not.  Preallocating the file gives you contiguous space, 
but you still have to mark the extent written (not zero/prealloc).  The 
only way to get an identical IO pattern is to *pre-write* zeros (or 
whatever) to the file... which is hours on modern HDDs.

Ted asked for a way to force prealloc to expose preexisting disk bits a 
couple hears back at LSF and it was shot down for security reasons (and 
rightly so, IMO).

If you're going down this path, you already have a "file system" in user 
space sitting on top of the preallocated file, and you could just as 
easily use the block device directly.

If you're not, then you're writing smaller files (e.g., 

Performance meeting next week?

2015-10-21 Thread Paul Von-Stamwitz
Hi Mark,

In light of OpenStack Summit, will we still have a meeting next week?

-Paul
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-21 Thread Gregory Farnum
On Mon, Oct 19, 2015 at 8:31 AM, Milosz Tanski  wrote:
> On Wed, Oct 14, 2015 at 12:46 AM, Gregory Farnum  wrote:
>> On Sun, Oct 11, 2015 at 7:36 PM, Milosz Tanski  wrote:
>>> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski  wrote:
 On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski  wrote:
> On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski  wrote:
>> On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski  wrote:
>>> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum  
>>> wrote:
 On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski  
 wrote:
> About an hour ago my MDSs (primary and follower) started ping-pong
> crashing with this message. I've spent about 30 minutes looking into
> it but nothing yet.
>
> This is from a 0.94.3 MDS
>

>  0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc:
> In function 'virtual void C_IO_SM_Save::finish(int)' thread
> 7fd4f52ad700 time 2015-10-11 17:01:23.594089
> mds/SessionMap.cc: 120: FAILED assert(r == 0)

 These "r == 0" asserts pretty much always mean that the MDS did did a
 read or write to RADOS (the OSDs) and got an error of some kind back.
 (Or in the case of the OSDs, access to the local filesystem returned
 an error, etc.) I don't think these writes include any safety checks
 which would let the MDS break it which means that probably the OSD is
 actually returning an error — odd, but not impossible.

 Notice that the assert happened in thread 7fd4f52ad700, and look for
 the stuff in that thread. You should be able to find an OSD op reply
 (on the SessionMap object) coming in and reporting an error code.
 -Greg
>>>
>>> I only two error ops in that whole MDS session. Neither one happened
>>> on the same thread (7f5ab6000700 in this file). But it looks like the
>>> only session map is the -90 "Message too long" one.
>>>
>>> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v
>>> 'ondisk = 0'
>>>  -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700  1 --
>>> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 
>>> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0
>>> ondisk = -90 ((90) Message too long)) v6  182+0+0 (2955408122 0 0)
>>> 0x3a55d340 con 0x3d5a3c0
>>>   -705> 2015-10-11 20:51:11.374132 7f5ab22f4700  1 --
>>> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 
>>> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2
>>> ((2) No such file or directory)) v6  179+0+0 (1182549251 0 0)
>>> 0x66c5c80 con 0x3d5a7e0
>>>
>>> Any idea what this could be Greg?
>>
>> To follow this up I found this ticket from 9 months ago:
>> http://tracker.ceph.com/issues/10449 In there Yan says:
>>
>> "it's a kernel bug. hang request prevents mds from trimming
>> completed_requests in sessionmap. there is nothing to do with mds.
>> (maybe we should add some code to MDS to show warning when this bug
>> happens)"
>>
>> When I was debugging this I saw an OSD (not cephfs client) operation
>> stuck for a long time along with the MDS error:
>>
>> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow
>> requests; mds cluster is degraded; mds0: Behind on trimming (709/30)
>> 1 ops are blocked > 16777.2 sec
>> 1 ops are blocked > 16777.2 sec on osd.28
>>
>> I did eventually bounce the OSD in question and it hasn't become stuck
>> since, but the MDS is still eating it every time with the "Message too
>> long" error on the session map.
>>
>> I'm not quite sure where to go from here.
>
> First time I had a chance to use the new recover tools. I was able to
> reply the journal, reset it and then reset the sessionmap. MDS
> returned back to life and so far everything looks good. Yay.
>
> Triggering this a bug/issue is a pretty interesting set of steps.

 Spoke too soon, a missing dir is now causing MDS to restart it self.

 -6> 2015-10-11 22:40:47.300169 7f580c7b9700  5 -- op tracker --
 seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request,
 op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
 2015-10-11 21:34:49.224905 RETRY=36)
 -5> 2015-10-11 22:40:47.300208 7f580c7b9700  5 -- op tracker --
 seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request,
 op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
 2015-10-11 21:34:49.224905 RETRY=36)
 -4> 2015-10-11 22:40:47.300231 7f580c7b9700  5 -- op tracker --
 seq: 4, time: 2015-10-11 

librbd regression with Hammer v0.94.4 -- use caution!

2015-10-21 Thread Sage Weil
There is a regression in librbd in v0.94.4 that can cause VMs to crash.  
For now, please refrain from upgrading hypervisor nodes or other librbd 
users to v0.94.4.

http://tracker.ceph.com/issues/13559

The problem does not affect server-side daemons (ceph-mon, ceph-osd, 
etc.).

Jason's identified the bug and has a fix prepared, but it'll probably take 
a few days before we have v0.94.5 out.


https://github.com/ceph/ceph/commit/4692c330bd992a06b97b5b8975ab71952b22477a

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Mark Nelson

On 10/21/2015 10:51 AM, Ric Wheeler wrote:

On 10/21/2015 10:14 AM, Mark Nelson wrote:



On 10/21/2015 06:24 AM, Ric Wheeler wrote:



On 10/21/2015 06:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant
development effort. But the current scheme of using a KV store
combined with a normal file system is always going to be problematic
(FileStore or NewStore). This is caused by the transactional
requirements of the ObjectStore interface, essentially you need to
make transactionally consistent updates to two indexes, one of which
doesn't understand transactions (File Systems) and can never be
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work,
but it will never be optimal. The real question is whether the
performance difference of a suboptimal implementation is something
that you can live with compared to the longer gestation period of the
more optimal implementation. Clearly, Sage believes that the
performance difference is significant or he wouldn't have kicked off
this discussion in the first place.


I think that we need to work with the existing stack - measure and do
some collaborative analysis - before we throw out decades of work.  Very
hard to understand why the local file system is a barrier for
performance in this case when it is not an issue in existing enterprise
applications.

We need some deep analysis with some local file system experts thrown in
to validate the concerns.


I think Sage has been working pretty closely with the XFS guys to
uncover these kinds of issues.  I know if I encounter something fairly
FS specific I try to drag Eric or Dave in.  I think the core of the
problem is that we often find ourselves exercising filesystems in
pretty unusual ways.  While it's probably good that we add this kind
of coverage and help work out somewhat esoteric bugs, I think it does
make our job of making Ceph perform well harder.  One example:  I had
been telling folks for several years to favor dentry and inode cache
due to the way our PG directory splitting works (backed by test
results), but then Sage discovered:

http://www.spinics.net/lists/ceph-devel/msg25644.html

This is just one example of how very nuanced our performance story is.
I can keep many users at least semi-engaged when talking about objects
being laid out in a nested directory structure, how dentry/inode cache
affects that in a general sense, etc.  But combine the kind of
subtlety in the link above with the vastness of things in the data
path that can hurt performance, and people generally just can't wrap
their heads around all of it (With the exception of some of the very
smart folks on this mailing list!)

One of my biggest concerns going forward is reducing the user-facing
complexity of our performance story.  The question I ask myself is:
Does keeping Ceph on a FS help us or hurt us in that regard?


The upshot of that is that the kind of micro-optimization is already
handled by the file system, so the application job should be easier.
Better to fsync() each file from an application that you care about
rather than to worry about using more obscure calls.


I hear you, and I don't want to discount the massive amount of work and 
experience that has gone into making XFS and the other filesystems as 
amazing as they are.  I think Sage's argument that the fit isn't right 
has merit though.  There's a lot of things that we end up working 
around.  Take last winter when we ended up pushing past the 254byte 
inline xattr boundary.  We absolutely want to keep xattrs inlined so the 
idea now is we break large ones down into smaller chunks to try to work 
around the limitation while continuing to employ a 2K inode size. (which 
from my conversations with Ben sounds like it's a little controversial 
in it's own right)  All of this by itself is fairly inconsequential, but 
you add enough of this kind of thing up and it's tough not to feel like 
we're trying to pound a square peg into a round hole.










While I think we can all agree that writing a full-up KV and raw-block
ObjectStore is a significant amount of work. I will offer the case
that the "loosely couple" scheme may not have as much time-to-market
advantage as it appears to have. One example: NewStore performance is
limited due to bugs in XFS that won't be fixed in the field for quite
some time (it'll take at least a couple of years before a patched
version of XFS will be widely deployed at customer environments).


Not clear what bugs you are thinking of or why you think fixing bugs
will take a long time to hit the field in XFS. Red Hat has most of the
XFS developers on staff and we actively backport fixes and ship them,
other distros do as well.

Never seen a "bug" take a couple of years to hit users.


Maybe a good way to start out would be to see how quickly we can get
the patch dchinner posted here:

http://oss.sgi.com/archives/xfs/2015-10/msg00545.html

rolled out 

Re: [PATCH] mark rbd requiring stable pages

2015-10-21 Thread Ilya Dryomov
On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov  wrote:
> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov  wrote:
>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>> so enabling this unconditionally is probably too harsh.  OTOH we are
>> talking to the network, which means all sorts of delays, retransmission
>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>> added to an skb - you can't write anything to a page until networking
>> is fully done with it and expect it to work.  It's particularly
>> alarming that you've seen corruptions.
>>
>> Currently the only users of this flag are block integrity stuff and
>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>> area.  There's an old ticket on this topic somewhere on the tracker, so
>> I'll need to research this.  Thanks for bringing this up!
>
> Hi Mike,
>
> I was hoping to grab you for a few minutes, but you weren't there...
>
> I spent a better part of today reading code and mailing lists on this
> topic.  It is of course a bug that we use sendpage() which inlines
> pages into an skb and do nothing to keep those pages stable.  We have
> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
> case is an obvious fix.
>
> I looked at drbd and iscsi and I think iscsi could do the same - ditch
> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
> rather than having a roll-your-own implementation which doesn't close
> the race but only narrows it sounds like a win, unless copying through
> sendmsg() is for some reason cheaper than stable-waiting?
>
> drbd still needs the non-zero-copy version for its async protocol for
> when they free the pages before the NIC has chance to put them on the
> wire.  md-raid5 it turns out has an option to essentially disable most
> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
> if that option is enabled.
>
> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
> convince myself that mucking with sendpage()ed pages while they sit in
> the TCP queue (or anywhere in the networking stack, really), is safe -
> there is nothing to prevent pages from being modified after sendpage()
> returned and Ronny reports data corruptions that pretty much went away
> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
> this, starting to confuse fs with block, though.  How does that work in
> iscsi land?
>
> (There was/is also this [1] bug, which is kind of related and probably
> worth looking into at some point later.  ceph shouldn't be that easily
> affected - we've got state, but there is a ticket for it.)
>
> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html

And now with Mike on the CC and a mention that at least one scenario of
[1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
issue when retransmitting RPC calls").

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Martin Millnert
Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland 
> kvstore/block approach is going to be less work, for everyone I think, 
> than trying to quickly discover, understand, fix, and push upstream 
> patches that sometimes only really benefit us.  I don't know if we've 
> truly hit that that point, but it's tough for me to find flaws with 
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are
further aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped
(multiple approaches exist) userland networking, which for packet
management has the benefit of - for very, very specific applications of
networking code - avoiding e.g. per-packet context switches etc, and
streamlining processor cache management performance. People have gone as
far as removing CPU cores from CPU scheduler to completely dedicate them
to the networking task at hand (cache optimizations). There are various
latency/throughput (bulking) optimizations applicable, but at the end of
the day, it's about keeping the CPU bus busy with "revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for
context switches to ever appear as a problem in themselves, certainly
for slower SSDs and HDDs. However, when going for truly high performance
IO, *every* hurdle in the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the
networking, per-packet handling characteristics).  Now, I'm not really
suggesting memory-mapping a storage device to user space, not at all,
but having better control over the data path for a very specific use
case, reduces dependency on the code that works as best as possible for
the general case, and allows for very purpose-built code, to address a
narrow set of requirements. ("Ceph storage cluster backend" isn't a
typical FS use case.) It also decouples dependencies on users i.e.
waiting for the next distro release before being able to take up the
benefits of improvements to the storage code.

A random google came up with related data on where "doing something way
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html 

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate all
corner cases of "generic FS" that actually are cause for the experienced
issues, and assess probability of them being solved (and if so when).
That *could* improve chances of approaching consensus which wouldn't
hurt I suppose?

BR,
Martin

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mark rbd requiring stable pages

2015-10-21 Thread Ilya Dryomov
On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov  wrote:
> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
> so enabling this unconditionally is probably too harsh.  OTOH we are
> talking to the network, which means all sorts of delays, retransmission
> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
> added to an skb - you can't write anything to a page until networking
> is fully done with it and expect it to work.  It's particularly
> alarming that you've seen corruptions.
>
> Currently the only users of this flag are block integrity stuff and
> md-raid5, which makes me wonder what iscsi, nfs and others do in this
> area.  There's an old ticket on this topic somewhere on the tracker, so
> I'll need to research this.  Thanks for bringing this up!

Hi Mike,

I was hoping to grab you for a few minutes, but you weren't there...

I spent a better part of today reading code and mailing lists on this
topic.  It is of course a bug that we use sendpage() which inlines
pages into an skb and do nothing to keep those pages stable.  We have
csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
case is an obvious fix.

I looked at drbd and iscsi and I think iscsi could do the same - ditch
the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
rather than having a roll-your-own implementation which doesn't close
the race but only narrows it sounds like a win, unless copying through
sendmsg() is for some reason cheaper than stable-waiting?

drbd still needs the non-zero-copy version for its async protocol for
when they free the pages before the NIC has chance to put them on the
wire.  md-raid5 it turns out has an option to essentially disable most
of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
if that option is enabled.

What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
convince myself that mucking with sendpage()ed pages while they sit in
the TCP queue (or anywhere in the networking stack, really), is safe -
there is nothing to prevent pages from being modified after sendpage()
returned and Ronny reports data corruptions that pretty much went away
with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
this, starting to confuse fs with block, though.  How does that work in
iscsi land?

(There was/is also this [1] bug, which is kind of related and probably
worth looking into at some point later.  ceph shouldn't be that easily
affected - we've got state, but there is a ticket for it.)

[1] http://www.spinics.net/lists/linux-nfs/msg34913.html

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-21 Thread John Spray
> John, I know you've got
> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
> supposed to be for this, but I'm not sure if you spotted any issues
> with it or if we need to do some more diagnosing?

That test path is just verifying that we do handle dirs without dying
in at least one case -- it passes with the existing ceph code, so it's
not reproducing this issue.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-21 Thread Gregory Farnum
On Wed, Oct 21, 2015 at 2:33 PM, John Spray  wrote:
> On Wed, Oct 21, 2015 at 10:33 PM, John Spray  wrote:
>>> John, I know you've got
>>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>>> supposed to be for this, but I'm not sure if you spotted any issues
>>> with it or if we need to do some more diagnosing?
>>
>> That test path is just verifying that we do handle dirs without dying
>> in at least one case -- it passes with the existing ceph code, so it's
>> not reproducing this issue.
>
> Clicked send to soon, I was about to add...
>
> Milosz mentioned that they don't have the data from the system in the
> broken state, so I don't have any bright ideas about learning more
> about what went wrong here unfortunately.

Yeah, I guess we'll just need to watch out for it in the future. :/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-21 Thread John Spray
On Wed, Oct 21, 2015 at 10:33 PM, John Spray  wrote:
>> John, I know you've got
>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>> supposed to be for this, but I'm not sure if you spotted any issues
>> with it or if we need to do some more diagnosing?
>
> That test path is just verifying that we do handle dirs without dying
> in at least one case -- it passes with the existing ceph code, so it's
> not reproducing this issue.

Clicked send to soon, I was about to add...

Milosz mentioned that they don't have the data from the system in the
broken state, so I don't have any bright ideas about learning more
about what went wrong here unfortunately.

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I am pushing internally to open-source ZetaScale. Recent events may or may not 
affect that trajectory -- stay tuned.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: Wednesday, October 21, 2015 10:45 PM
To: Allen Samuels ; Ric Wheeler 
; Sage Weil ; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.
>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).
>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:

Has there been any discussion regarding opensourcing zetascale?

>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil ; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>- We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's at least 3 IOs: one for the data, one for the 
>> fs journal, one for the kv txn to commit (at least once my rocksdb 
>> changes land... the kv commit is currently 2-3).  So two people are 
>> managing metadata, here: the fs managing the file metadata (with its 
>> own
>> journal) and the kv backend (with its journal).
>
> If all of 

RE: newstore direction

2015-10-21 Thread Allen Samuels
One of the biggest changes that flash is making in the storage world is that 
the way basic trade-offs in storage management software architecture are being 
affected. In the HDD world CPU time per IOP was relatively inconsequential, 
i.e., it had little effect on overall performance which was limited by the 
physics of the hard drive. Flash is now inverting that situation. When you look 
at the performance levels being delivered in the latest generation of NVMe SSDs 
you rapidly see that that storage itself is generally no longer the bottleneck 
(speaking about BW, not latency of course) but rather it's the system sitting 
in front of the storage that is the bottleneck. Generally it's the CPU cost of 
an IOP.

When Sandisk first starting working with Ceph (Dumpling) the design of librados 
and the OSD lead to the situation that the CPU cost of an IOP was dominated by 
context switches and network socket handling. Over time, much of that has been 
addressed. The socket handling code has been re-written (more than once!) some 
of the internal queueing in the OSD (and the associated context switches) have 
been eliminated. As the CPU costs have dropped, performance on flash has 
improved accordingly.

Because we didn't want to completely re-write the OSD (time-to-market and 
stability drove that decision), we didn't move it from the current "thread per 
IOP" model into a truly asynchronous "thread per CPU core" model that 
essentially eliminates context switches in the IO path. But a fully optimized 
OSD would go down that path (at least part-way). I believe it's been proposed 
in the past. Perhaps a hybrid "fast-path" style could get most of the benefits 
while preserving much of the legacy code.

I believe this trend toward thread-per-core software development will also tend 
to support the "do it in user-space" trend. That's because most of the kernel 
and file-system interface is architected around the blocking "thread-per-IOP" 
model and is unlikely to change in the future.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Martin Millnert [mailto:mar...@millnert.se]
Sent: Thursday, October 22, 2015 6:20 AM
To: Mark Nelson 
Cc: Ric Wheeler ; Allen Samuels 
; Sage Weil ; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland
> kvstore/block approach is going to be less work, for everyone I think,
> than trying to quickly discover, understand, fix, and push upstream
> patches that sometimes only really benefit us.  I don't know if we've
> truly hit that that point, but it's tough for me to find flaws with
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are further 
aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped (multiple 
approaches exist) userland networking, which for packet management has the 
benefit of - for very, very specific applications of networking code - avoiding 
e.g. per-packet context switches etc, and streamlining processor cache 
management performance. People have gone as far as removing CPU cores from CPU 
scheduler to completely dedicate them to the networking task at hand (cache 
optimizations). There are various latency/throughput (bulking) optimizations 
applicable, but at the end of the day, it's about keeping the CPU bus busy with 
"revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for context 
switches to ever appear as a problem in themselves, certainly for slower SSDs 
and HDDs. However, when going for truly high performance IO, *every* hurdle in 
the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the 
networking, per-packet handling characteristics).  Now, I'm not really 
suggesting memory-mapping a storage device to user space, not at all, but 
having better control over the data path for a very specific use case, reduces 
dependency on the code that works as best as possible for the general case, and 
allows for very purpose-built code, to address a narrow set of requirements. 
("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples 
dependencies on users i.e.
waiting for the next distro release before being able to take up the benefits 
of improvements to the storage code.

A random google came up with related data on where "doing something way 
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate 

RE: newstore direction

2015-10-21 Thread Allen Samuels
Fixing the bug doesn't take a long time. Getting it deployed is where the delay 
is. Many companies standardize on a particular release of a particular distro. 
Getting them to switch to a new release -- even a "bug fix" point release -- is 
a major undertaking that often is a complete roadblock. Just my experience. 
YMMV. 


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com] 
Sent: Wednesday, October 21, 2015 8:24 PM
To: Allen Samuels ; Sage Weil ; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction



On 10/21/2015 06:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.

I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.

We need some deep analysis with some local file system experts thrown in to 
validate the concerns.

>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).

Not clear what bugs you are thinking of or why you think fixing bugs will take 
a long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.

Never seen a "bug" take a couple of years to hit users.

Regards,

Ric

>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:
>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil ; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on 

RE: newstore direction

2015-10-21 Thread Allen Samuels
Actually Range queries are an important part of the performance story and 
random read speed doesn't really solve the problem.

When you're doing a scrub, you need to enumerate the objects in a specific 
order on multiple nodes -- so that they can compare the contents of their 
stores in order to determine if data cleaning needs to take place.

If you don't have in-order enumeration in your basic data structure (which 
NVMKV doesn't have) then you're forced to sort the directory before you can 
respond to an enumeration. That sort will either consume huge amounts of IOPS 
OR huge amounts of DRAM. Regardless of the choice, you'll see a significant 
degradation of performance while the scrub is ongoing -- which is one of the 
biggest problems with clustered systems (expensive and extensive maintenance 
operations).


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com]
Sent: Thursday, October 22, 2015 1:10 AM
To: Mark Nelson ; Allen Samuels 
; Sage Weil 
Cc: James (Fei) Liu-SSI ; Somnath Roy 
; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e 
re-invent an NVMKV, the final conclusion sounds like it's not hard with 
persistent memory(which will be available soon).  But yeah, NVMKV will not work 
if no PM is present---persist the hashing table to SSD is not practicable.

Range query seems not a very big issue as the random read performance of 
nowadays SSD is more than enough, I mean, even we break all sequential to 
random (typically 70-80K IOPS which is ~300MB/s), the performance still good 
enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play 
well on SSDs from different vendors, would be better to leave it to SSD vendor, 
something like Openstack Cinder's structure.  a vendor has the responsibility 
to maintain their drivers to ceph and take care the performance.

> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> Thanks Allen!  The devil is always in the details.  Know of anything
> else that looks promising?
>
> Mark
>
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities
> > of the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil ; Chen, Xiaoxi
> > 
> > Cc: James (Fei) Liu-SSI ; Somnath Roy
> > ; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs,
> >>> +say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB
> >>> or 8KB. In this way, NVMKV is a good design and seems some of the
> >>> SSD vendor are also trying to build this kind of interface, we had
> >>> a NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset
> >> =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-sized/aligned payloads.
> >> Storing just the metadata (block allocation map) w/ the kv api and
> >> storing the data directly on a block/page interface makes more
> >> sense to
> me.
> >>
> >> sage
> >
> > I get the feeling that some of the folks that were involved with
> > nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
> instance.
> http://pmem.io might be a better bet, though I 

Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 08:53 PM, Allen Samuels wrote:

Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many 
companies standardize on a particular release of a particular distro. Getting them to 
switch to a new release -- even a "bug fix" point release -- is a major 
undertaking that often is a complete roadblock. Just my experience. YMMV.



Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).


If someone is deploying a critical server for storage, then it falls back on the 
storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).


ric

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I agree. My only point was that you still have to factor this time into the 
argument that by continuing to put NewStore on top of a file system you'll get 
to a stable system much sooner than the longer development path of doing your 
own raw storage allocator. IMO, once you factor that into the equation the "on 
top of an FS" path doesn't look like such a clear winner.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com]
Sent: Thursday, October 22, 2015 10:17 AM
To: Allen Samuels ; Sage Weil ; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 08:53 PM, Allen Samuels wrote:
> Fixing the bug doesn't take a long time. Getting it deployed is where the 
> delay is. Many companies standardize on a particular release of a particular 
> distro. Getting them to switch to a new release -- even a "bug fix" point 
> release -- is a major undertaking that often is a complete roadblock. Just my 
> experience. YMMV.
>

Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).

If someone is deploying a critical server for storage, then it falls back on 
the storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).

ric




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Sage Weil
On Tue, 20 Oct 2015, Ric Wheeler wrote:
> > Now:
> >  1 io  to write a new file
> >1-2 ios to sync the fs journal (commit the inode, alloc change)
> >(I see 2 journal IOs on XFS and only 1 on ext4...)
> >  1 io  to commit the rocksdb journal (currently 3, but will drop to
> >1 with xfs fix and my rocksdb change)
> 
> I think that might be too pessimistic - the number of discrete IO's sent down
> to a spinning disk make much less impact on performance than the number of
> fsync()'s since they IO's all land in the write cache.  Some newer spinning
> drives have a non-volatile write cache, so even an fsync() might not end up
> doing the expensive data transfer to the platter.

True, but in XFS's case at least the file data and journal are not 
colocated, so its 2 seeks for the new file write+fdatasync and another for 
the rocksdb journal commit.  Of course, with a deep queue, we're doing 
lots of these so there's be fewer journal commits on both counts, but the 
lower bound on latency of a single write is still 3 seeks, and that bound 
is pretty critical when you also have network round trips and replication 
(worst out of 2) on top.

> It would be interesting to get the timings on the IO's you see to measure the
> actual impact.

I observed this with the journaling workload for rocksdb, but I assume the 
journaling behavior is the same regardless of what is being journaled.  
For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and 
blktrace showed an IO to the file, and 2 IOs to the journal.  I believe 
the first one is the record for the inode update, and the second is the 
journal 'commit' record (though I forget how I decided that).  My guess is 
that XFS is being extremely careful about journal integrity here and not 
writing the commit record until it knows that the preceding records landed 
on stable storage.  For ext4, the latency was about ~20ms, and blktrace 
showed the IO to the file and then a single journal IO.  When I made the 
rocksdb change to overwrite an existing, prewritten file, the latency 
dropped to ~10ms on ext4, and blktrace showed a single IO as expected.  
(XFS still showed the 2 journal commit IOs, but Dave just posted the fix 
for that on the XFS list today.)

> Plumbing for T10 DIF/DIX already exist, what is missing is the normal block
> device that handles them (not enterprise SAS/disk array class)

Yeah... which unfortunately means that unless the cheap drives 
suddenly start shipping if DIF/DIX support we'll need to do the 
checksums ourselves.  This is probably a good thing anyway as it doesn't 
constrain our choice of checksum or checksum granularity, and will 
still work with other storage devices (ssds, nvme, etc.).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a 
suboptimal implementation is something that you can live with compared to the 
longer gestation period of the more optimal implementation. Clearly, Sage 
believes that the performance difference is significant or he wouldn't have 
kicked off this discussion in the first place.

While I think we can all agree that writing a full-up KV and raw-block 
ObjectStore is a significant amount of work. I will offer the case that the 
"loosely couple" scheme may not have as much time-to-market advantage as it 
appears to have. One example: NewStore performance is limited due to bugs in 
XFS that won't be fixed in the field for quite some time (it'll take at least a 
couple of years before a patched version of XFS will be widely deployed at 
customer environments).

Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil ; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
>
>   1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>   2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> few
> things:
>
>   - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb
> changes land... the kv commit is currently 2-3).  So two people are
> managing metadata, here: the fs managing the file metadata (with its
> own
> journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some 
of them no-ops.

>
>   - On read we have to open files by name, which means traversing the
> fs namespace.  Newstore tries to keep it as flat and simple as
> possible, but at a minimum it is a couple btree lookups.  We'd love to
> use open by handle (which would reduce this to 1 btree traversal), but
> running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.

>
>   - ...and file systems insist on updating mtime on writes, even when
> it is a overwrite with no allocation changes.  (We don't care about
> mtime.) O_NOCMTIME patches exist but it is hard to get these past the

RE: newstore direction

2015-10-21 Thread Allen Samuels
I doubt that NVMKV will be useful for two reasons:

(1) It relies on the unique sparse-mapping addressing capabilities of the 
FusionIO VSL interface, it won't run on standard SSDs
(2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range 
operations on keys). This is pretty much required for deep scrubbing.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, October 20, 2015 6:20 AM
To: Sage Weil ; Chen, Xiaoxi 
Cc: James (Fei) Liu-SSI ; Somnath Roy 
; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/20/2015 07:30 AM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>> +1, nowadays K-V DB care more about very small key-value pairs, say
>> several bytes to a few KB, but in SSD case we only care about 4KB or
>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>> vendor are also trying to build this kind of interface, we had a
>> NVM-L library but still under development.
>
> Do you have an NVMKV link?  I see a paper and a stale github repo..
> not sure if I'm looking at the right thing.
>
> My concern with using a key/value interface for the object data is
> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> $4kb_of_data) that is pretty inefficient to store and (depending on
> the
> implementation) tends to break alignment.  I don't think these
> interfaces are targetted toward block-sized/aligned payloads.  Storing
> just the metadata (block allocation map) w/ the kv api and storing the
> data directly on a block/page interface makes more sense to me.
>
> sage

I get the feeling that some of the folks that were involved with nvmkv at 
Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
instance.  http://pmem.io might be a better bet, though I haven't looked 
closely at it.

Mark

>
>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> Hi Sage and Somnath,
>>>In my humble opinion, There is another more aggressive  solution
>>> than raw block device base keyvalue store as backend for
>>> objectstore. The new key value  SSD device with transaction support would 
>>> be  ideal to solve the issues.
>>> First of all, it is raw SSD device. Secondly , It provides key value
>>> interface directly from SSD. Thirdly, it can provide transaction
>>> support, consistency will be guaranteed by hardware device. It
>>> pretty much satisfied all of objectstore needs without any extra
>>> overhead since there is not any extra layer in between device and 
>>> objectstore.
>>> Either way, I strongly support to have CEPH own data format
>>> instead of relying on filesystem.
>>>
>>>Regards,
>>>James
>>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Monday, October 19, 2015 1:55 PM
>>> To: Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
 Sage,
 I fully support that.  If we want to saturate SSDs , we need to get
 rid of this filesystem overhead (which I am in process of measuring).
 Also, it will be good if we can eliminate the dependency on the k/v
 dbs (for storing allocators and all). The reason is the unknown
 write amps they causes.
>>>
>>> My hope is to keep behing the KeyValueDB interface (and/more change
>>> it as
>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>> btree- based one for high-end flash).
>>>
>>> sage
>>>
>>>

 Thanks & Regards
 Somnath


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Monday, October 19, 2015 12:49 PM
 To: ceph-devel@vger.kernel.org
 Subject: newstore direction

 The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our
 internal metadata (object metadata, attrs, layout, collection
 membership, write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

 So far 1 is working out well, but I'm questioning the wisdom of #2.
 A few
 things:

   - We currently write the data to the file, fsync, then commit the
 kv 

Re: newstore direction

2015-10-21 Thread Ric Wheeler



On 10/21/2015 06:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a suboptimal 
implementation is something that you can live with compared to the longer gestation 
period of the more optimal implementation. Clearly, Sage believes that the performance 
difference is significant or he wouldn't have kicked off this discussion in the first 
place.


I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.


We need some deep analysis with some local file system experts thrown in to 
validate the concerns.




While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a 
significant amount of work. I will offer the case that the "loosely couple" 
scheme may not have as much time-to-market advantage as it appears to have. One example: 
NewStore performance is limited due to bugs in XFS that won't be fixed in the field for 
quite some time (it'll take at least a couple of years before a patched version of XFS 
will be widely deployed at customer environments).


Not clear what bugs you are thinking of or why you think fixing bugs will take a 
long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.


Never seen a "bug" take a couple of years to hit users.

Regards,

Ric



Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil ; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the 

Re: what does ms_objecter do in OSD ?

2015-10-21 Thread Sage Weil
On Wed, 21 Oct 2015, Jaze Lee wrote:
> Hello,
>I find this messenger do not bind to any ip, so i do not know why we do 
> that.
>Does any one know what ms_object can do ? Thanks a lot.

It is the librados client that is used by the rados copy-from operation 
and for cache tiering (to read/write to other OSDs).  It doesn't bind to 
an IP because it is the client side.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 04:22 AM, Orit Wasserman wrote:

On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb changes
land... the kv commit is currently 2-3).  So two people are managing
metadata, here: the fs managing the file metadata (with its own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure
that each fsync() takes the same time? Depending on the local FS implementation
of course, but the order of issuing those fsync()'s can effectively make some of
them no-ops.


   - On read we have to open files by name, which means traversing the fs
namespace.  Newstore tries to keep it as flat and simple as possible, but
at a minimum it is a couple btree lookups.  We'd love to use open by
handle (which would reduce this to 1 btree traversal), but running
the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.


   - ...and file systems insist on updating mtime on writes, even when it is
a overwrite with no allocation changes.  (We don't care about mtime.)
O_NOCMTIME patches exist but it is hard to get these past the kernel
brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey database
tricks that we can use here.


   - XFS is (probably) never going going to give us data checksums, which we
want desperately.

What is the goal of having the file system do the checksums? How strong do they
need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO (each write
will possibly generate at least one other write to update that new checksum).


But what's the alternative?  My thought is to just bite the bullet and
consume a raw block device directly.  Write an allocator, hopefully keep
it pretty simple, and manage it in kv store along with all of our other
metadata.

The big problem with consuming block devices directly is that you ultimately end
up recreating most of the features that you had in the file system. Even
enterprise databases like Oracle and DB2 have been migrating away from running
on raw block devices in favor of file systems over time.  In effect, you are
looking at making a simple on disk file system which is always easier to start
than it is to get back to a stable, production ready state.

The best performance is still on block device (SAN).
File system simplify the operation tasks which worth the performance
penalty for a database. I think in a storage system this is not the
case.
In many cases they can use their own file system that is tailored for
the database.


You will have to trust me on this as the Red Hat person who spoke to pretty much 
all of our key customers about local file systems and storage - customers all 
have migrated over to using normal file systems under Oracle/DB2. Typically, 
they use XFS or ext4.  I don't know of any non-standard file systems and only 
have seen one account running on a raw block store in 8 years :)


If you have a pre-allocated file and write using O_DIRECT, your IO path is 
identical in terms of IO's sent to the device.


If we are causing additional IO's, then we really need to spend some time 
talking to the local file system gurus about this in detail.  I can help with 
that conversation.





I think that it might be quicker and more maintainable to spend some time
working with the local file system people (XFS or other) to see if we can
jointly address the concerns you have.

Wins:

   - 2 IOs for most: one to write the data to unused space in the block
device, one to commit our transaction (vs 4+ before).  For overwrites,
we'd have one io to do our write-ahead log (kv journal), then do
the overwrite async (vs 4+ before).

   - No concern about mtime getting in the way

   - Faster reads (no fs lookup)

   - Similarly sized metadata for most objects.  If we assume most objects
are not fragmented, then the metadata to store the block offsets is about
the same size as the metadata to store the filenames we have now.

Problems:

   - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs of
rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
a different pool and those 

Re: newstore direction

2015-10-21 Thread Orit Wasserman
On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
> > The current design is based on two simple ideas:
> >
> >   1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >   2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> >
> >   - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> 
> If all of the fsync()'s fall into the same backing file system, are you sure 
> that each fsync() takes the same time? Depending on the local FS 
> implementation 
> of course, but the order of issuing those fsync()'s can effectively make some 
> of 
> them no-ops.
> 
> >
> >   - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> 
> This seems like a a pretty low hurdle to overcome.
> 
> >
> >   - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> 
> Are you using O_DIRECT? Seems like there should be some enterprisey database 
> tricks that we can use here.
> 
> >
> >   - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> 
> What is the goal of having the file system do the checksums? How strong do 
> they 
> need to be and what size are the chunks?
> 
> If you update this on each IO, this will certainly generate more IO (each 
> write 
> will possibly generate at least one other write to update that new checksum).
> 
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> The big problem with consuming block devices directly is that you ultimately 
> end 
> up recreating most of the features that you had in the file system. Even 
> enterprise databases like Oracle and DB2 have been migrating away from 
> running 
> on raw block devices in favor of file systems over time.  In effect, you are 
> looking at making a simple on disk file system which is always easier to 
> start 
> than it is to get back to a stable, production ready state.

The best performance is still on block device (SAN).
File system simplify the operation tasks which worth the performance
penalty for a database. I think in a storage system this is not the
case.
In many cases they can use their own file system that is tailored for
the database.

> I think that it might be quicker and more maintainable to spend some time 
> working with the local file system people (XFS or other) to see if we can 
> jointly address the concerns you have.
> >
> > Wins:
> >
> >   - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> >
> >   - No concern about mtime getting in the way
> >
> >   - Faster reads (no fs lookup)
> >
> >   - Similarly sized metadata for most objects.  If we assume most objects
> > are not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >   - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> >
> >   - We have to write and maintain an allocator.  I'm still optimistic this
> > can be reasonbly simple, especially for the flash case (where
> > fragmentation isn't such an issue as long as our blocks are reasonbly
> > sized).  For disk we may beed to be moderately clever.
> >
> >   - We'll need a fsck to ensure our internal metadata is consistent.  The
> > good news is it'll just need to validate 

what does ms_objecter do in OSD ?

2015-10-21 Thread Jaze Lee
Hello,
   I find this messenger do not bind to any ip, so i do not know why we do that.
   Does any one know what ms_object can do ? Thanks a lot.

-- 
谦谦君子
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html