Re: request_queue use-after-free - inode_detach_wb()
Hello, Ilya. On Wed, Nov 18, 2015 at 04:12:07PM +0100, Ilya Dryomov wrote: > > It's stinky that the bdi is going away while the inode is still there. > > Yeah, blkdev inodes are special and created early but I think it makes > > sense to keep the underlying structures (queue and bdi) around while > > bdev is associated with it. Would simply moving put_disk() after > > bdput() work? > > I'd think so. struct block_device is essentially a "block device" > pseudo-filesystem inode, and as such, may not be around during the > entire lifetime of gendisk / queue. It may be kicked out of the inode > cache as soon as the device is closed, so it makes sense to put it > before putting gendisk / queue, which will outlive it. > > However, I'm confused by this comment > > /* > * ->release can cause the queue to disappear, so flush all > * dirty data before. > */ > bdev_write_inode(bdev); > > It's not true, at least since your 523e1d399ce0 ("block: make gendisk > hold a reference to its queue"), right? (It used to say "->release can > cause the old bdi to disappear, so must switch it out first" and was > changed by Christoph in the middle of his backing_dev_info series.) Right, it started with each layer going away separately, which tends to get tricky with hotunplug, and we've been gradually moving towards a model where the entire stack stays till the last ref is gone, so yeah the comment isn't true anymore. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: request_queue use-after-free - inode_detach_wb()
Hello, Ilya. On Wed, Nov 18, 2015 at 04:48:06PM +0100, Ilya Dryomov wrote: > Just to be clear, the bdi/wb vs inode lifetime rules are that inodes > should always be within bdi/wb? There's been a lot of churn in this Yes, that's where *I* think we should be headed. Stuff in lower layers should stick around while upper layer things are around. > and related areas recently, including in block drivers: 6cd18e711dd8 > ("block: destroy bdi before blockdev is unregistered"), b02176f30cd3 > ("block: don't release bdi while request_queue has live references"), > so I want to fully get my head around this. End-of-life issue has always been a bit of mess in the block layer. Thanks a lot for working on this. -- tejun -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problem about pgmeta object?
On Wed, 18 Nov 2015, Ning Yao wrote: > Hi, Sage > > pgmeta object is a meta-object (like __head___2) without > significant information. It is created when in PG::_init() when > handling pg_create and split_coll and always exits there during pg's > life cycle until pg is removed in RemoveWQ. The real content related > to pgmeta-data is stored in omap. we may just treat pgmeta-object as a > logical object and not present it physically so that we can avoid > recursive searching object path by if(hoid.pgmeta())? > such as: > int FileStore::_omap_setkeys(coll_t cid, const ghobject_t , > const map, > const SequencerPosition ) { > dout(15) << __func__ << " " << cid << "/" << hoid << dendl; > Index index; > int r; > if(hoid.pgmeta()) > goto out; > *** > *** > *** > out: > r = object_map->set_keys(hoid, aset, ); > dout(20) << __func__ << " " << cid << "/" << hoid << " = " << r << dendl; > return r; > } This seems like a reasonable hack. We never store any byte data in it. And if/when that changes, we can change this at the same time. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] net/ceph: do not define list_entry_next
On Wed, Nov 18, 2015 at 1:13 PM, Sergey Senozhatskywrote: > Cosmetic. > > Do not define list_entry_next() and use list_next_entry() > from list.h. > > Signed-off-by: Sergey Senozhatsky > --- > net/ceph/messenger.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index 9981039..77ccca9 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #ifdef CONFIG_BLOCK > #include > #endif /* CONFIG_BLOCK */ > @@ -23,9 +24,6 @@ > #include > #include > > -#define list_entry_next(pos, member) \ > - list_entry(pos->member.next, typeof(*pos), member) > - > /* > * Ceph uses the messenger to exchange ceph_msg messages with other > * hosts in the system. The messenger provides ordered and reliable > @@ -1042,7 +1040,7 @@ static bool ceph_msg_data_pagelist_advance(struct > ceph_msg_data_cursor *cursor, > /* Move on to the next page */ > > BUG_ON(list_is_last(>page->lru, >head)); > - cursor->page = list_entry_next(cursor->page, lru); > + cursor->page = list_next_entry(cursor->page, lru); > cursor->last_piece = cursor->resid <= PAGE_SIZE; > > return true; > @@ -1166,7 +1164,7 @@ static bool ceph_msg_data_advance(struct > ceph_msg_data_cursor *cursor, > if (!cursor->resid && cursor->total_resid) { > WARN_ON(!cursor->last_piece); > BUG_ON(list_is_last(>data->links, cursor->data_head)); > - cursor->data = list_entry_next(cursor->data, links); > + cursor->data = list_next_entry(cursor->data, links); > __ceph_msg_data_cursor_init(cursor); > new_piece = true; > } Someone beat you to it ;) https://github.com/ceph/ceph-client/commit/76b4a27faebb369c1c50df01ef08b614a2854fc5 Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] net/ceph: do not define list_entry_next
Cosmetic. Do not define list_entry_next() and use list_next_entry() from list.h. Signed-off-by: Sergey Senozhatsky--- net/ceph/messenger.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 9981039..77ccca9 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -10,6 +10,7 @@ #include #include #include +#include #ifdef CONFIG_BLOCK #include #endif /* CONFIG_BLOCK */ @@ -23,9 +24,6 @@ #include #include -#define list_entry_next(pos, member) \ - list_entry(pos->member.next, typeof(*pos), member) - /* * Ceph uses the messenger to exchange ceph_msg messages with other * hosts in the system. The messenger provides ordered and reliable @@ -1042,7 +1040,7 @@ static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor, /* Move on to the next page */ BUG_ON(list_is_last(>page->lru, >head)); - cursor->page = list_entry_next(cursor->page, lru); + cursor->page = list_next_entry(cursor->page, lru); cursor->last_piece = cursor->resid <= PAGE_SIZE; return true; @@ -1166,7 +1164,7 @@ static bool ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, if (!cursor->resid && cursor->total_resid) { WARN_ON(!cursor->last_piece); BUG_ON(list_is_last(>data->links, cursor->data_head)); - cursor->data = list_entry_next(cursor->data, links); + cursor->data = list_next_entry(cursor->data, links); __ceph_msg_data_cursor_init(cursor); new_piece = true; } -- 2.6.2.280.g74301d6 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crc32 Challenge
Hi, I checked the partial crc after each iteration in google's python implementation and found that the crc of the last iteration matches ceph's [1]: >>> from crc32c import crc >>> crc('foo bar baz') crc 1197962378 crc 3599162226 crc 2946501991 crc 2501826906 crc 3132034983 crc 3851841059 crc 2745946046 crc 1047783679 crc 767476524 crc 4269731756 crc 4119623852 After finalize (crc ^ 0x) the crc becomes: 175343443L So it looks like ceph_crc32c just isn't doing that final XOR step. >>> 4119623852 ^ 0x 175343443 Cheers, Dan [1] From test_crc32c.cc const char *a = "foo bar baz"; ASSERT_EQ(4119623852u, ceph_crc32c(0, (unsigned char *)a, strlen(a))); On Tue, Nov 17, 2015 at 5:51 PM, chris holcombewrote: > Hello Ceph Devs, > > I'm almost certain at this point that I have discovered a major bug in > ceph's crc32c mechanism. http://tracker.ceph.com/issues/13713 I'm totally > open to be proven wrong and that's what this email is about. Can someone > out there write a piece of code using an outside library that produces the > same crc32c checksums that Ceph does? If they can I'll close my bug and > stand corrected :). I've tried 3 python libraries and 1 rust library so far > and my conclusions are 1) they are all in agreement and 2) they all produce > different checksums than ceph's checksums > https://github.com/ceph/ceph/blob/83e10f7e2df0a71bd59e6ef2aa06b52b186fddaa/src/test/common/test_crc32c.cc#L21 > > Start small and see if you can verify the "foo bar baz" checksum and then > try some of the others. > > For a known good checksum to test your program against use this: > http://www.pdl.cmu.edu/mailinglists/ips/mail/msg04970.html In there Mark > Bakke talks about a 32 byte array of all 00h should produce a checksum of > 8A9136AA. Printing that with python in decimal: 2324772522 > > The implications of this are unfortunately tricky. If I'm right and we fix > ceph's algorithm then it won't be able to talk to any previous version of > ceph past the beginning protocol handshake. There would have to be a > mechanism introduced so that any x and older version would speak the > previous crc and anything y and newer would speak the new version. Another > option is we could break ceph's crc code out into a library and make that > available to everyone and call it ceph-crc32c. > > Thanks! > Chris > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Wiping object content on removal
Hi Cephers. Does Ceph have an ability to wipe object content during one's removal? Surely one can do that manually from the client but I think that's ineffective and not 100% secure. If no - what's about adding such feature to Ceph? I can start working on that. Thanks, Igor. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] net/ceph: do not define list_entry_next
On (11/18/15 14:13), Ilya Dryomov wrote: [..] > > Someone beat you to it ;) > > https://github.com/ceph/ceph-client/commit/76b4a27faebb369c1c50df01ef08b614a2854fc5 Oh, OK then :) Thanks! -ss -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
11/18/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual! I won't be there as I'm out giving a talk at SC15. Sage will be there though, so go talk to him about newstore-block! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Regarding op_t, local_t
Hi Sage, I saw we are now having single transaction in submit_transaction. But, in the replication path we are still having two transaction, can't we merge it to one there ? Thanks & Regards Somnath -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding op_t, local_t
Good catch. I think it does make sense. 2015-11-19 9:54 GMT+08:00 Somnath Roy: > Hi Sage, > I saw we are now having single transaction in submit_transaction. But, > in the replication path we are still having two transaction, can't we merge > it to one there ? > > Thanks & Regards > Somnath > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, xinze -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Regarding op_t, local_t
It's a stack allocation there , no 'new' overhead , but, will save one list push_back. So, gain could be negligible. But, since I am working on a pull request to delete these transaction object from FileStore worker thread , I need to introduce heap allocation on that path as well , reducing one transaction would help there. Thanks & Regards Somnath -Original Message- From: 池信泽 [mailto:xmdx...@gmail.com] Sent: Wednesday, November 18, 2015 6:00 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: Re: Regarding op_t, local_t Good catch. I think it does make sense. 2015-11-19 9:54 GMT+08:00 Somnath Roy: > Hi Sage, > I saw we are now having single transaction in submit_transaction. But, > in the replication path we are still having two transaction, can't we merge > it to one there ? > > Thanks & Regards > Somnath > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majord...@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html -- Regards, xinze
Re: Regarding op_t, local_t
hi, Roy: My previous performance data shows that: If we keep them separate and pass them to ObjectStore::queue_transactions() ,the cpu usage of ObjectStore::queue_transactions() would take up from 6.03% to 6.76% compared with re-using op_t items. I am happy if we could git similar performance data. 2015-11-19 10:09 GMT+08:00 Somnath Roy: > It's a stack allocation there , no 'new' overhead , but, will save one list > push_back. So, gain could be negligible. > But, since I am working on a pull request to delete these transaction object > from FileStore worker thread , I need to introduce heap allocation on that > path as well , reducing one transaction would help there. > > Thanks & Regards > Somnath > > -Original Message- > From: 池信泽 [mailto:xmdx...@gmail.com] > Sent: Wednesday, November 18, 2015 6:00 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: Re: Regarding op_t, local_t > > Good catch. I think it does make sense. > > 2015-11-19 9:54 GMT+08:00 Somnath Roy : >> Hi Sage, >> I saw we are now having single transaction in submit_transaction. But, >> in the replication path we are still having two transaction, can't we merge >> it to one there ? >> >> Thanks & Regards >> Somnath >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majord...@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html > > > > -- > Regards, > xinze -- Regards, xinze -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW multi-tenancy APIs overview
On Mon, 9 Nov 2015 21:36:47 -0800 Yehuda Sadeh-Weinraubwrote: > In the supported domains configuration, we can specify for each domain > whether a subdomain for it would be a bucket (as it is now), or > whether it would be a tenant (which implies the possibility of > bucket.tenant). This only affects the global (a.k.a the "empty") > tenant. > > E.g., we can have two domains: > > legacy-foo.com > new-foo.com > > We'd specify that legacy-foo.com is a global tenant endpoint. In which > case, when accessing buck.legacy-foo.com, it will access the global > tenant, and bucket=buck. > Whereas, new-foo.com isn't a global tenant endpoint, in which case, if > we'd access buck.new-foo.com, it will mean that we accessed the 'buck' > tenant. I think I found another issue with this. Suppose we want a client authenticated under an explicit tenant accessing a legacy bucket. The only way for it to work is for it to use a different endpoint (in your example above it's legacy-foo.com). The client cannot use buck.new-foo.com syntax, as mentioned. So, there's a certain asymmetry built into the system. Oddly enough, the X-amz-copy-source: syntax always includes bucket, and tenant:bucket syntax is recognized there, so miraclously we're good there. -- Pete -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html