Re: request_queue use-after-free - inode_detach_wb()

2015-11-18 Thread Tejun Heo
Hello, Ilya.

On Wed, Nov 18, 2015 at 04:12:07PM +0100, Ilya Dryomov wrote:
> > It's stinky that the bdi is going away while the inode is still there.
> > Yeah, blkdev inodes are special and created early but I think it makes
> > sense to keep the underlying structures (queue and bdi) around while
> > bdev is associated with it.  Would simply moving put_disk() after
> > bdput() work?
> 
> I'd think so.  struct block_device is essentially a "block device"
> pseudo-filesystem inode, and as such, may not be around during the
> entire lifetime of gendisk / queue.  It may be kicked out of the inode
> cache as soon as the device is closed, so it makes sense to put it
> before putting gendisk / queue, which will outlive it.
> 
> However, I'm confused by this comment
> 
> /*
>  * ->release can cause the queue to disappear, so flush all
>  * dirty data before.
>  */
> bdev_write_inode(bdev);
> 
> It's not true, at least since your 523e1d399ce0 ("block: make gendisk
> hold a reference to its queue"), right?  (It used to say "->release can
> cause the old bdi to disappear, so must switch it out first" and was
> changed by Christoph in the middle of his backing_dev_info series.)

Right, it started with each layer going away separately, which tends
to get tricky with hotunplug, and we've been gradually moving towards
a model where the entire stack stays till the last ref is gone, so
yeah the comment isn't true anymore.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: request_queue use-after-free - inode_detach_wb()

2015-11-18 Thread Tejun Heo
Hello, Ilya.

On Wed, Nov 18, 2015 at 04:48:06PM +0100, Ilya Dryomov wrote:
> Just to be clear, the bdi/wb vs inode lifetime rules are that inodes
> should always be within bdi/wb?  There's been a lot of churn in this

Yes, that's where *I* think we should be headed.  Stuff in lower
layers should stick around while upper layer things are around.

> and related areas recently, including in block drivers: 6cd18e711dd8
> ("block: destroy bdi before blockdev is unregistered"), b02176f30cd3
> ("block: don't release bdi while request_queue has live references"),
> so I want to fully get my head around this.

End-of-life issue has always been a bit of mess in the block layer.
Thanks a lot for working on this.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem about pgmeta object?

2015-11-18 Thread Sage Weil
On Wed, 18 Nov 2015, Ning Yao wrote:
> Hi, Sage
> 
> pgmeta object is a meta-object (like __head___2) without
> significant information. It is created when in PG::_init() when
> handling pg_create and split_coll and always exits there during pg's
> life cycle until pg is removed in RemoveWQ. The real content related
> to pgmeta-data is stored in omap. we may just treat pgmeta-object as a
> logical object and not present it physically so that we can avoid
> recursive searching object path by if(hoid.pgmeta())?
> such as:
> int FileStore::_omap_setkeys(coll_t cid, const ghobject_t ,
> const map ,
> const SequencerPosition ) {
>   dout(15) << __func__ << " " << cid << "/" << hoid << dendl;
>   Index index;
>   int r;
>   if(hoid.pgmeta())
> goto out;
> ***
> ***
> ***
> out:
>   r = object_map->set_keys(hoid, aset, );
>   dout(20) << __func__ << " " << cid << "/" << hoid << " = " << r << dendl;
>   return r;
> }

This seems like a reasonable hack. We never store any byte data in it.  
And if/when that changes, we can change this at the same time.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] net/ceph: do not define list_entry_next

2015-11-18 Thread Ilya Dryomov
On Wed, Nov 18, 2015 at 1:13 PM, Sergey Senozhatsky
 wrote:
> Cosmetic.
>
> Do not define list_entry_next() and use list_next_entry()
> from list.h.
>
> Signed-off-by: Sergey Senozhatsky 
> ---
>  net/ceph/messenger.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 9981039..77ccca9 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #ifdef CONFIG_BLOCK
>  #include 
>  #endif /* CONFIG_BLOCK */
> @@ -23,9 +24,6 @@
>  #include 
>  #include 
>
> -#define list_entry_next(pos, member)   \
> -   list_entry(pos->member.next, typeof(*pos), member)
> -
>  /*
>   * Ceph uses the messenger to exchange ceph_msg messages with other
>   * hosts in the system.  The messenger provides ordered and reliable
> @@ -1042,7 +1040,7 @@ static bool ceph_msg_data_pagelist_advance(struct 
> ceph_msg_data_cursor *cursor,
> /* Move on to the next page */
>
> BUG_ON(list_is_last(>page->lru, >head));
> -   cursor->page = list_entry_next(cursor->page, lru);
> +   cursor->page = list_next_entry(cursor->page, lru);
> cursor->last_piece = cursor->resid <= PAGE_SIZE;
>
> return true;
> @@ -1166,7 +1164,7 @@ static bool ceph_msg_data_advance(struct 
> ceph_msg_data_cursor *cursor,
> if (!cursor->resid && cursor->total_resid) {
> WARN_ON(!cursor->last_piece);
> BUG_ON(list_is_last(>data->links, cursor->data_head));
> -   cursor->data = list_entry_next(cursor->data, links);
> +   cursor->data = list_next_entry(cursor->data, links);
> __ceph_msg_data_cursor_init(cursor);
> new_piece = true;
> }

Someone beat you to it ;)

https://github.com/ceph/ceph-client/commit/76b4a27faebb369c1c50df01ef08b614a2854fc5

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] net/ceph: do not define list_entry_next

2015-11-18 Thread Sergey Senozhatsky
Cosmetic.

Do not define list_entry_next() and use list_next_entry()
from list.h.

Signed-off-by: Sergey Senozhatsky 
---
 net/ceph/messenger.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9981039..77ccca9 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_BLOCK
 #include 
 #endif /* CONFIG_BLOCK */
@@ -23,9 +24,6 @@
 #include 
 #include 
 
-#define list_entry_next(pos, member)   \
-   list_entry(pos->member.next, typeof(*pos), member)
-
 /*
  * Ceph uses the messenger to exchange ceph_msg messages with other
  * hosts in the system.  The messenger provides ordered and reliable
@@ -1042,7 +1040,7 @@ static bool ceph_msg_data_pagelist_advance(struct 
ceph_msg_data_cursor *cursor,
/* Move on to the next page */
 
BUG_ON(list_is_last(>page->lru, >head));
-   cursor->page = list_entry_next(cursor->page, lru);
+   cursor->page = list_next_entry(cursor->page, lru);
cursor->last_piece = cursor->resid <= PAGE_SIZE;
 
return true;
@@ -1166,7 +1164,7 @@ static bool ceph_msg_data_advance(struct 
ceph_msg_data_cursor *cursor,
if (!cursor->resid && cursor->total_resid) {
WARN_ON(!cursor->last_piece);
BUG_ON(list_is_last(>data->links, cursor->data_head));
-   cursor->data = list_entry_next(cursor->data, links);
+   cursor->data = list_next_entry(cursor->data, links);
__ceph_msg_data_cursor_init(cursor);
new_piece = true;
}
-- 
2.6.2.280.g74301d6

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crc32 Challenge

2015-11-18 Thread Dan van der Ster
Hi,

I checked the partial crc after each iteration in google's python
implementation and found that the crc of the last iteration matches
ceph's [1]:

>>> from crc32c import crc
>>> crc('foo bar baz')
crc 1197962378
crc 3599162226
crc 2946501991
crc 2501826906
crc 3132034983
crc 3851841059
crc 2745946046
crc 1047783679
crc 767476524
crc 4269731756
crc 4119623852

After finalize (crc ^ 0x) the crc becomes:

175343443L

So it looks like ceph_crc32c just isn't doing that final XOR step.

>>> 4119623852 ^ 0x
175343443

Cheers, Dan

[1] From test_crc32c.cc

const char *a = "foo bar baz";
ASSERT_EQ(4119623852u, ceph_crc32c(0, (unsigned char *)a, strlen(a)));

On Tue, Nov 17, 2015 at 5:51 PM, chris holcombe
 wrote:
> Hello Ceph Devs,
>
> I'm almost certain at this point that I have discovered a major bug in
> ceph's crc32c mechanism.  http://tracker.ceph.com/issues/13713 I'm totally
> open to be proven wrong and that's what this email is about.  Can someone
> out there write a piece of code using an outside library that produces the
> same crc32c checksums that Ceph does?  If they can I'll close my bug and
> stand corrected :).  I've tried 3 python libraries and 1 rust library so far
> and my conclusions are 1) they are all in agreement and 2) they all produce
> different checksums than ceph's checksums
> https://github.com/ceph/ceph/blob/83e10f7e2df0a71bd59e6ef2aa06b52b186fddaa/src/test/common/test_crc32c.cc#L21
>
> Start small and see if you can verify the "foo bar baz" checksum and then
> try some of the others.
>
> For a known good checksum to test your program against use this:
> http://www.pdl.cmu.edu/mailinglists/ips/mail/msg04970.html  In there Mark
> Bakke talks about a 32 byte array of all 00h should produce a checksum of
> 8A9136AA.  Printing that with python in decimal: 2324772522
>
> The implications of this are unfortunately tricky.  If I'm right and we fix
> ceph's algorithm then it won't be able to talk to any previous version of
> ceph past the beginning protocol handshake. There would have to be a
> mechanism introduced so that any x and older version would speak the
> previous crc and anything y and newer would speak the new version.  Another
> option is we could break ceph's crc code out into a library and make that
> available to everyone and call it ceph-crc32c.
>
> Thanks!
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Wiping object content on removal

2015-11-18 Thread Igor Fedotov

Hi Cephers.

Does Ceph have an ability to wipe object content during one's removal? 
Surely one can do that manually from the client but I think that's 
ineffective and not 100% secure.


If no - what's about adding such feature to Ceph?
 I can start working on that.

Thanks,
Igor.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] net/ceph: do not define list_entry_next

2015-11-18 Thread Sergey Senozhatsky
On (11/18/15 14:13), Ilya Dryomov wrote:
[..]
> 
> Someone beat you to it ;)
> 
> https://github.com/ceph/ceph-client/commit/76b4a27faebb369c1c50df01ef08b614a2854fc5

Oh, OK then :) Thanks!

-ss
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


11/18/2015 Weekly Ceph Performance Meeting IS ON!

2015-11-18 Thread Mark Nelson
8AM PST as usual! I won't be there as I'm out giving a talk at SC15. 
Sage will be there though, so go talk to him about newstore-block!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Regarding op_t, local_t

2015-11-18 Thread Somnath Roy
Hi Sage,
I saw we are now having single transaction in submit_transaction. But,
in the replication path we are still having two transaction, can't we
merge it to one there ?

Thanks & Regards
Somnath
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding op_t, local_t

2015-11-18 Thread 池信泽
Good catch. I think it does make sense.

2015-11-19 9:54 GMT+08:00 Somnath Roy :
> Hi Sage,
> I saw we are now having single transaction in submit_transaction. But,
> in the replication path we are still having two transaction, can't we merge 
> it to one there ?
>
> Thanks & Regards
> Somnath
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,
xinze
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding op_t, local_t

2015-11-18 Thread Somnath Roy
It's a stack allocation there , no 'new' overhead , but, will save one list 
push_back. So, gain could be negligible.
But, since I am working on a pull request to delete these transaction object 
from FileStore  worker thread , I need to introduce heap allocation on that 
path as well , reducing one transaction would help there.

Thanks & Regards
Somnath

-Original Message-
From: 池信泽 [mailto:xmdx...@gmail.com] 
Sent: Wednesday, November 18, 2015 6:00 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: Re: Regarding op_t, local_t

Good catch. I think it does make sense.

2015-11-19 9:54 GMT+08:00 Somnath Roy :
> Hi Sage,
> I saw we are now having single transaction in submit_transaction. But, 
> in the replication path we are still having two transaction, can't we merge 
> it to one there ?
>
> Thanks & Regards
> Somnath
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Regards,
xinze


Re: Regarding op_t, local_t

2015-11-18 Thread 池信泽
hi, Roy:
My previous performance data shows that:

If we keep them separate and pass them to
ObjectStore::queue_transactions() ,the cpu usage of
ObjectStore::queue_transactions() would take up from 6.03% to 6.76%
compared with re-using op_t items.

I am happy if we could git similar performance data.

2015-11-19 10:09 GMT+08:00 Somnath Roy :
> It's a stack allocation there , no 'new' overhead , but, will save one list 
> push_back. So, gain could be negligible.
> But, since I am working on a pull request to delete these transaction object 
> from FileStore  worker thread , I need to introduce heap allocation on that 
> path as well , reducing one transaction would help there.
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: 池信泽 [mailto:xmdx...@gmail.com]
> Sent: Wednesday, November 18, 2015 6:00 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Regarding op_t, local_t
>
> Good catch. I think it does make sense.
>
> 2015-11-19 9:54 GMT+08:00 Somnath Roy :
>> Hi Sage,
>> I saw we are now having single transaction in submit_transaction. But,
>> in the replication path we are still having two transaction, can't we merge 
>> it to one there ?
>>
>> Thanks & Regards
>> Somnath
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards,
> xinze



-- 
Regards,
xinze
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW multi-tenancy APIs overview

2015-11-18 Thread Pete Zaitcev
On Mon, 9 Nov 2015 21:36:47 -0800
Yehuda Sadeh-Weinraub  wrote:

> In the supported domains configuration, we can specify for each domain
> whether a subdomain for it would be a bucket (as it is now), or
> whether it would be a tenant (which implies the possibility of
> bucket.tenant). This only affects the global (a.k.a the "empty")
> tenant.
> 
> E.g., we can have two domains:
> 
> legacy-foo.com
> new-foo.com
> 
> We'd specify that legacy-foo.com is a global tenant endpoint. In which
> case, when accessing buck.legacy-foo.com, it will access the global
> tenant, and bucket=buck.
> Whereas, new-foo.com isn't a global tenant endpoint, in which case, if
> we'd access buck.new-foo.com, it will mean that we accessed the 'buck'
> tenant.

I think I found another issue with this. Suppose we want a client authenticated
under an explicit tenant accessing a legacy bucket. The only way for it to
work is for it to use a different endpoint (in your example above it's
legacy-foo.com). The client cannot use buck.new-foo.com syntax, as mentioned.
So, there's a certain asymmetry built into the system.

Oddly enough, the X-amz-copy-source: syntax always includes bucket, and
tenant:bucket syntax is recognized there, so miraclously we're good there.

-- Pete

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html