RE: handling fs errors

2013-01-22 Thread Chen, Xiaoxi
Is there any known connection with the previous discussion Hit suicide timeout after adding new osd or Ceph unstable on XFS ? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: 2013年1月22日 14:06 To:

Re: handling fs errors

2013-01-22 Thread Wido den Hollander
On 01/22/2013 07:12 AM, Yehuda Sadeh wrote: On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. After 3 minutes (180s),

Re: questions on networks and hardware

2013-01-22 Thread Wido den Hollander
On 01/22/2013 12:15 AM, John Nielsen wrote: Thanks all for your responses! Some comments inline. On Jan 20, 2013, at 10:16 AM, Wido den Hollander w...@widodh.nl wrote: On 01/19/2013 12:34 AM, John Nielsen wrote: I'm planning a Ceph deployment which will include: 10Gbit/s

Re: questions on networks and hardware

2013-01-22 Thread Jeff Mitchell
Wido den Hollander wrote: One thing is still having multiple Varnish caches and object banning. I proposed something for this some time ago, some hook in RGW you could use to inform a upstream cache to purge something from it's cache. Hopefully not Varnish-specific; something like the

Ceph Bobtail Performance: IO Scheduler Comparison Article

2013-01-22 Thread Mark Nelson
Hi Guys, We've got an article up looking at performance of CFQ, Deadline, and NOOP IO schedulers with Ceph on the SAS2208. I won't claim that these results are universally applicable to other controllers and disk setups, but they might be interesting if you've been trying to determine what

RGW object purging in upstream caches

2013-01-22 Thread Wido den Hollander
Hi, (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/12316) Hopefully not Varnish-specific; something like the Last-Modified header would be good. Also there are tricks you can do with queries; see for instance http://forum.nginx.org/read.php?2,1047,1052 It seems like a good

Re: Consistently reading/writing rados objects via command line

2013-01-22 Thread Nick Bartos
Assuming that the clone is atomic so that the client only ever grabbed a complete old or new version of the file, that method really seems ideal. How much work/time would that be? The objects will likely average around 10-20MB, but it's possible that in some cases they may grow to a few hundred

Re: Consistently reading/writing rados objects via command line

2013-01-22 Thread Nick Bartos
I had thought about doing something like that, but I'm not sure how to do it in a race-free way. For example if I was to set 'done=yes' on a file, then check that before trying to download the file, the instant I try to download the file the writer of the file could remove the xattr and start

Re: RGW object purging in upstream caches

2013-01-22 Thread Jeff Mitchell
Wido den Hollander wrote: Now, when running just one Varnish instance which does loadbalancing over multiple RGW instances is not a real problem. When it sees a PUT operation it can purge (called banning in Varnish) the object from it's cache. When looking at the scenario where you have

Re: Consistently reading/writing rados objects via command line

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Nick Bartos wrote: Assuming that the clone is atomic so that the client only ever grabbed a complete old or new version of the file, that method really seems ideal. How much work/time would that be? The objects will likely average around 10-20MB, but it's possible that

Re: Consistently reading/writing rados objects via command line

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Sage Weil wrote: On Tue, 22 Jan 2013, Nick Bartos wrote: Assuming that the clone is atomic so that the client only ever grabbed a complete old or new version of the file, that method really seems ideal. How much work/time would that be? The objects will likely

Re: handling fs errors

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 5:12 AM, Wido den Hollander wrote: On 01/22/2013 07:12 AM, Yehuda Sadeh wrote: On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com (mailto:s...@inktank.com) wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd

Re: Throttle::wait use case clarification

2013-01-22 Thread Gregory Farnum
On Monday, January 21, 2013 at 5:44 AM, Loic Dachary wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/21/2013 12:02 AM, Gregory Farnum wrote: On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote: Hi, While working on unit tests for Throttle.{cc,h} I tried to

Re: handling fs errors

2013-01-22 Thread Dimitri Maziuk
On 01/22/2013 12:05 AM, Sage Weil wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. ... FWIW I see this often enough on cheap sata drives: they've a failure mode that makes sata driver

Re: Inktank team @ FOSDEM 2013 ?

2013-01-22 Thread James Page
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi All On 20/01/13 11:13, Constantinos Venetsanopoulos wrote: Hello Loic, Sebastien, Patrick, that's great news! I'm sure we'll have some very interesting stuff to talk about. Saturday 14:00 @ K.3.201 also seems fine. I'm attending FOSDEM as

Re: ssh passwords

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 10:24 AM, Gandalf Corvotempesta wrote: Hi all, i'm trying my very first ceph installation following the 5-minutes quickstart: http://ceph.com/docs/master/start/quick-start/#install-debian-ubuntu just a question: why ceph is asking me for SSH password? Is ceph

Re: ssh passwords

2013-01-22 Thread Xing Lin
If it is the command 'mkcephfs' that asked you for ssh password, then that is probably because that script needs to push some files (ceph.conf, e.g) to other hosts. If we open that script, we can see that it uses 'scp' to send some files. If I remember correctly, for every osd at other hosts,

Re: Consistently reading/writing rados objects via command line

2013-01-22 Thread Nick Bartos
Thanks! Is it safe to just apply that last commit to 0.56.1? Also, is the rados command 'clonedata' instead of 'clone'? That's what it looked like in the code. On Tue, Jan 22, 2013 at 9:27 AM, Sage Weil s...@inktank.com wrote: On Tue, 22 Jan 2013, Nick Bartos wrote: Assuming that the clone

[PATCH] net/ceph/osdmap.c: fix undefined behavior when using snprintf()

2013-01-22 Thread Cong Ding
The variable str is used as both the source and destination in function snprintf(), which is undefined behavior based on C11. The original description in C11 is: If copying takes place between objects that overlap, the behavior is undefined. And, the function of

Re: ssh passwords

2013-01-22 Thread Neil Levine
Out of interest, would people prefer that the Ceph deployment script didn't try to handle server-server file copy and just did the local setup only, or is it useful that it tries to be a mini-config management tool at the same time? Neil On Tue, Jan 22, 2013 at 10:46 AM, Xing Lin

[0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sylvain Munaut
Hi, Since I have ceph in prod, I experienced a memory leak in the OSD forcing to restart them every 5 or 6 days. Without that the OSD process just grows infinitely and eventually gets killed by the OOM killer. (To make sure it wasn't legitimate, I left one grow up to 4G or RSS ...). Here's for

Questions about journals, performance and disk utilization.

2013-01-22 Thread martin
Hi list, In a mixed SSD SATA setup (5 or 8 nodes each holding 8x SATA and 4x SSD) would it make sense to skip having journals on SSD or is the advantage of doing so just too great? We're looking into having 2 pools, sata and ssd and will be creating guests belonging into either of these

Re: Consistently reading/writing rados objects via command line

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Nick Bartos wrote: Thanks! Is it safe to just apply that last commit to 0.56.1? Also, is the rados command 'clonedata' instead of 'clone'? That's what it looked like in the code. Yep, and yep! s On Tue, Jan 22, 2013 at 9:27 AM, Sage Weil s...@inktank.com wrote:

Re: ssh passwords

2013-01-22 Thread Xing Lin
I like the current approach. I think it is more convenient to run commands once at one host to do all the setup work. When the first time I deployed a ceph cluster with 4 hosts, I thought 'service ceph start' would start the whole ceph cluster. But as it turns out, it only starts local osd,

Re: questions on networks and hardware

2013-01-22 Thread Dan Mick
On 01/21/2013 12:19 AM, Gandalf Corvotempesta wrote: 2013/1/21 Gregory Farnum g...@inktank.com: I'm not quite sure what you mean…the use of the cluster network and public network are really just intended as conveniences for people with multiple NICs on their box. There's nothing preventing

Re: flashcache

2013-01-22 Thread Atchley, Scott
On Jan 17, 2013, at 11:19 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again depends on the Ethernet driver. 10GbE faster than IB SDR? Really ?

Re: flashcache

2013-01-22 Thread Atchley, Scott
On Jan 22, 2013, at 4:06 PM, Atchley, Scott atchle...@ornl.gov wrote: On Jan 17, 2013, at 11:19 AM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: 2013/1/17 Atchley, Scott atchle...@ornl.gov: 10GbE should get close to 1.2 GB/s compared to 1 GB/s for IB SDR. Latency again

Re: ssh passwords

2013-01-22 Thread Dan Mick
The '-a/--allhosts' parameter is to spread the command across the cluster...that is, service ceph -a start will start across the cluster. On 01/22/2013 01:01 PM, Xing Lin wrote: I like the current approach. I think it is more convenient to run commands once at one host to do all the setup

Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Mark Nelson
On 01/22/2013 01:59 PM, martin wrote: Hi list, In a mixed SSD SATA setup (5 or 8 nodes each holding 8x SATA and 4x SSD) would it make sense to skip having journals on SSD or is the advantage of doing so just too great? We're looking into having 2 pools, sata and ssd and will be creating guests

Re: ssh passwords

2013-01-22 Thread Xing Lin
I did not notice that there exists such a parameter. Thanks, Dan! Xing On 01/22/2013 02:11 PM, Dan Mick wrote: The '-a/--allhosts' parameter is to spread the command across the cluster...that is, service ceph -a start will start across the cluster. -- To unsubscribe from this list: send

Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sébastien Han
Hi, I originally started a thread around these memory leaks problems here: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11000.html I'm happy to see that someone supports my theory about the scrubbing process leaking the memory. I only use RBD from Ceph, so your theory makes sense as

Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell
Mark Nelson wrote: It may (or may not) help to use a power-of-2 number of PGs. It's generally a good idea to do this anyway, so if you haven't set up your production cluster yet, you may want to play around with this. Basically just take whatever number you were planning on using and round it up

Re: on disk encryption

2013-01-22 Thread James Page
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 10/12/12 09:53, Gregory Farnum wrote: [...] I love the idea of btrfs supporting encryption natively much like it does compression. It may be some time before that happens, so in the meantime, I'd love to see Ceph support dm-crypt and/or

Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sylvain Munaut
Hi, I don't really want to try the mem profiler, I had quite a bad experience with it on a test cluster. While running the profiler some OSD crashed... The only way to fix this is to provide a heap dump. Could you provide one? I just did: ceph osd tell 0 heap start_profiler ceph osd tell 0

Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sébastien Han
Well ideally you want to run the profiler during the scrubbing process when the memory leaks appear :-). -- Regards, Sébastien Han. On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, I don't really want to try the mem profiler, I had quite a bad

Re: Inktank team @ FOSDEM 2013 ?

2013-01-22 Thread Loic Dachary
On 01/22/2013 07:32 PM, James Page wrote: Hi All On 20/01/13 11:13, Constantinos Venetsanopoulos wrote: Hello Loic, Sebastien, Patrick, that's great news! I'm sure we'll have some very interesting stuff to talk about. Saturday 14:00 @ K.3.201 also seems fine. I'm attending FOSDEM as

Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell
Stefan Priebe wrote: Hi, Am 22.01.2013 22:26, schrieb Jeff Mitchell: Mark Nelson wrote: It may (or may not) help to use a power-of-2 number of PGs. It's generally a good idea to do this anyway, so if you haven't set up your production cluster yet, you may want to play around with this.

Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Mark Nelson
On 01/22/2013 03:50 PM, Stefan Priebe wrote: Hi, Am 22.01.2013 22:26, schrieb Jeff Mitchell: Mark Nelson wrote: It may (or may not) help to use a power-of-2 number of PGs. It's generally a good idea to do this anyway, so if you haven't set up your production cluster yet, you may want to play

[PATCH 1/3] rbd: small changes

2013-01-22 Thread Alex Elder
A few very minor changes to the rbd code: - RBD_MAX_OPT_LEN is unused, so get rid of it - Consolidate rbd options definitions - Make rbd_segment_name() return pointer to const char Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 17 - 1 file

[PATCH 2/3] rbd: check for overflow in rbd_get_num_segments()

2013-01-22 Thread Alex Elder
The return type of rbd_get_num_segments() is int, but the values it operates on are u64. Although it's not likely, there's no guarantee the result won't exceed what can be respresented in an int. The function is already designed to return -ERANGE on error, so just add this possible overflow as

[PATCH 3/3] rbd: don't retry setting up header watch

2013-01-22 Thread Alex Elder
When an rbd image is initially mapped a watch event is registered so we can do something if the header object changes. Right now if that returns ERANGE we loop back and try to initiate it again. However the code that sets up the watch event doesn't clean up after itself very well, and doing that

[PATCH 05/12] rbd: get rid of rbd_req_sync_read()

2013-01-22 Thread Alex Elder
Delete rbd_req_sync_read() is no longer used, so get rid of it. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 24 1 file changed, 24 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5a8fef4..6193c69 100644 ---

[PATCH 06/12] rbd: implement watch/unwatch with new code

2013-01-22 Thread Alex Elder
Implement a new function to set up or tear down a watch event for an mapped rbd image header using the new request code. Create a new object request type nodata to handle this. And define rbd_osd_trivial_callback() which simply marks a request done. Signed-off-by: Alex Elder el...@inktank.com

[PATCH 07/12] rbd: get rid of rbd_req_sync_watch()

2013-01-22 Thread Alex Elder
Get rid of rbd_req_sync_watch(), because it is no longer used. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 42 -- 1 file changed, 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3c110b3..7dedd18

[PATCH 08/12] rbd: use new code for notify ack

2013-01-22 Thread Alex Elder
Use the new object request tracking mechanism for handling a notify_ack request. Move the callback function below the definition of this so we don't have to do a pre-declaration. This resolves: http://tracker.newdream.net/issues/3754 Signed-off-by: Alex Elder el...@inktank.com ---

[PATCH 09/12] rbd: get rid of rbd_req_sync_notify_ack()

2013-01-22 Thread Alex Elder
Get rid rbd_req_sync_notify_ack() because it is no longer used. As a result rbd_simple_req_cb() becomes unreferenced, so get rid of that too. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 33 - 1 file changed, 33 deletions(-) diff --git

[PATCH 10/12] rbd: send notify ack asynchronously

2013-01-22 Thread Alex Elder
When we receive notification of a change to an rbd image's header object we need to refresh our information about the image (its size and snapshot context). Once we have refreshed our rbd image we need to acknowledge the notification. This acknowledgement was previously done synchronously, but

[PATCH 11/12] rbd: implement sync method with new code

2013-01-22 Thread Alex Elder
When we receive notification of a change to an rbd image's header object we need to refresh our information about the image (its size and snapshot context). Once we have refreshed our rbd image we need to acknowledge the notification. This acknowledgement was previously done synchronously, but

[PATCH 12/12] rbd: get rid of rbd_req_sync_exec()

2013-01-22 Thread Alex Elder
Get rid rbd_req_sync_exec() because it is no longer used. That eliminates the last use of rbd_req_sync_op(), so get rid of that too. And finally, that leaves rbd_do_request() unreferenced, so get rid of that. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 160

Re: [PATCH 1/3] rbd: small changes

2013-01-22 Thread Dan Mick
Reviewed-by: Dan Mick dan.m...@inktank.com On 01/22/2013 01:57 PM, Alex Elder wrote: A few very minor changes to the rbd code: - RBD_MAX_OPT_LEN is unused, so get rid of it - Consolidate rbd options definitions - Make rbd_segment_name() return pointer to const char

Re: [PATCH 2/3] rbd: check for overflow in rbd_get_num_segments()

2013-01-22 Thread Dan Mick
Reviewed-by: Dan Mick dan.m...@inktank.com On 01/22/2013 01:58 PM, Alex Elder wrote: The return type of rbd_get_num_segments() is int, but the values it operates on are u64. Although it's not likely, there's no guarantee the result won't exceed what can be respresented in an int. The function

Re: handling fs errors

2013-01-22 Thread Sage Weil
On Wed, 23 Jan 2013, Andrey Korolyov wrote: On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. After 3 minutes (180s),

Re: handling fs errors

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Dimitri Maziuk wrote: On 01/22/2013 12:05 AM, Sage Weil wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. ... FWIW I see this often enough on cheap sata

Re: ssh passwords

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Neil Levine wrote: Out of interest, would people prefer that the Ceph deployment script didn't try to handle server-server file copy and just did the local setup only, or is it useful that it tries to be a mini-config management tool at the same time? BTW, you can also

Re: ssh passwords

2013-01-22 Thread Travis Rhoden
On Tue, Jan 22, 2013 at 6:14 PM, Sage Weil s...@inktank.com wrote: On Tue, 22 Jan 2013, Neil Levine wrote: Out of interest, would people prefer that the Ceph deployment script didn't try to handle server-server file copy and just did the local setup only, or is it useful that it tries to be a

Re: on disk encryption

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Asghar Riahi wrote: Are you familiar with Seagate's Self Encrypting Disk (SED)? Here are some links which might be usefull: http://smb.media.seagate.com/tag/seagate-sed/ http://csrc.nist.gov/groups/STM/cmvp/documents/140-1/140sp/140sp1299.pdf Yeah! It would be

Re: on disk encryption

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, James Page wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On 10/12/12 09:53, Gregory Farnum wrote: [...] I love the idea of btrfs supporting encryption natively much like it does compression. It may be some time before that happens, so in the meantime, I'd

Re: ssh passwords

2013-01-22 Thread Neil Levine
We're having a chat about ceph-deploy tomorrow. We need to strike a balance between its being a useful tool for standing up a quick cluster and its ignoring the UNIX philosophy and trying to do to much. My assumption is that for most production operations, or at the point where people decide to

Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Josh Durgin
On 01/22/2013 01:58 PM, Jeff Mitchell wrote: I'd be interested in figuring out the right way to migrate an RBD from one pool to another regardless. Each way involves copying data, since by definition a different pool will use different placement groups. You could export/import with the rbd

Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Jeff Mitchell
On Tue, Jan 22, 2013 at 7:25 PM, Josh Durgin josh.dur...@inktank.com wrote: On 01/22/2013 01:58 PM, Jeff Mitchell wrote: I'd be interested in figuring out the right way to migrate an RBD from one pool to another regardless. Each way involves copying data, since by definition a different

[PATCH 00/24] fixes for MDS cluster recovery

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Patch 1 fixes a readdir bug I introduced, I think it should be included in next release. Patch 2 and patch 3 are non-critical fixes for my previous patches. Patch 4 modifies the EMetaBlob format to support journalling multiple root inodes The rest

[PATCH 01/25] mds: fix end check in Server::handle_client_readdir()

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com commit 1174dd3188 (don't retry readdir request after issuing caps) introduced an bug that wrongly marks 'end' in the the readdir reply. The code that touches existing dentries re-uses an iterator, and the iterator is used for checking if readdir is end.

[PATCH 02/25] mds: check deleted directory in Server::rdlock_path_xlock_dentry

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Commit b03eab22e4 (mds: forbid creating file in deleted directory) is not complete, mknod, mkdir and symlink are missed. Move the ckeck into Server::rdlock_path_xlock_dentry() fixes the issue. Signed-off-by: Yan, Zheng zheng.z@intel.com ---

[PATCH 03/25] mds: lock remote inode's primary dentry during rename

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com commit 1203cd2110 (mds: allow open_remote_ino() to open xlocked dentry) makes Server::handle_client_rename() xlocks remote inodes' primary dentry so witness MDS can open xlocked dentry. But I added remote inodes' projected primary dentries to the xlock list.

[PATCH 04/25] mds: allow journaling multiple root inodes in EMetaBlob

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com In some cases (rename, rmdir, subtree map), we may need journal multiple root inodes (/, mdsdir) in one EMetaBlob. This patch modifies EMetaBlob format to support journaling multiple root inodes. Signed-off-by: Yan, Zheng zheng.z@intel.com ---

[PATCH 06/25] mds: properly set error_dentry for discover reply

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com If MDCache::handle_discover() receives an 'discover path' request but can not find the base inode. It should properly set the 'error_dentry' to make sure MDCache::handle_discover_reply() checks correct object's wait queue. Signed-off-by: Yan, Zheng

[PATCH 09/25] mds: splits rename force journal check into separate function

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com the function will be used by later patch that fixes rename rollback Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 74 +-- src/mds/Server.h | 1 + 2 files changed, 46

[PATCH 08/25] mds: fix had dentry linked to wrong inode warning

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The reason of had dentry linked to wrong inode warning is that Server::_rename_prepare() adds the destdir to the EMetaBlob before adding the straydir. So during MDS recovers, the destdir is first replayed. The old inode is directly replaced by the source

[PATCH 11/25] mds: don't journal non-auth rename source directory

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com After replaying a slave rename, non-auth directory that we rename out of will be trimmed. So there is no need to journal it. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 26 ++ 1 file changed, 10

[PATCH 12/25] mds: preserve non-auth/unlinked objects until slave commit

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The MDS should not trim objects in non-auth subtree immediately after replaying a slave rename. Because the slave rename may require rollback later and these objects are needed for rollback. Signed-off-by: Yan, Zheng zheng.z@intel.com ---

[PATCH 14/25] mds: split reslove into two sub-stages

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The resolve stage serves to disambiguate the fate of uncommitted slave updates and resolve subtrees authority. The MDS sends resolve message that claims subtrees authority immediately when reslove stage is entered, When receiving a resolve message, the MDS

[PATCH 13/25] mds: fix slave rename rollback

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The main issue of old slave rename rollback code is that it assumes all affected objects are in the cache. The assumption is not true when MDS does rollback in the resolve stage. This patch removes the assumption and makes Server::do_rename_rollback() check

[PATCH 15/25] mds: send resolve messages after all MDS reach resolve stage

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Current code sends resolve messages when resolving MDS set changes. There is no need to send resolve messages when some MDS leave the resolve stage. Sending message while some MDS are replaying is also not very useful. Signed-off-by: Yan, Zheng

[PATCH 10/25] mds: force journal straydn for rename if necessary

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com rename may overwrite an empty directory inode and move it into stray directory. MDS who has auth subtree beneath the overwrited directory need journal the stray dentry when handling rename slave request. Signed-off-by: Yan, Zheng zheng.z@intel.com ---

[PATCH 07/25] mds: don't early reply rename

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com _rename_finish() does not send dentry link/unlink message to replicas. We should prevent dentries that are modified by the rename operation from getting new replicas when the rename operation is committing. So don't mark xlocks done and early reply for

[PATCH 20/25] mds: journal inode's projected parent when doing link rollback

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Otherwise the journal entry will revert the effect of any on-going rename operation for the inode. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git

[PATCH 21/25] mds: don't journal opened non-auth inode

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com If we journal opened non-auth inode, during journal replay, the corresponding entry will add non-auth objects to the cache. But the MDS does not journal all subsequent modifications (rmdir,rename) to these non-auth objects, so the code that manages cache and

[PATCH 22/25] mds: properly clear CDir::STATE_COMPLETE when replaying EImportStart

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com when replaying EImportStart, we should set/clear directory's COMPLETE flag according with the flag in the journal entry. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 5 +++-- src/mds/Migrator.cc| 4 +---

[PATCH 24/25] mds: rejoin remote wrlocks and frozen auth pin

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Includes remote wrlocks and frozen authpin in cache rejoin strong message Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 4 +-- src/mds/MDCache.cc | 56 +++---

[PATCH 23/25] mds: move variables special to rename into MDRequest::more

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com My previous patches add two pointers (ambiguous_auth_inode and auth_pin_freeze) to class Mutation. They are both used by cross authority rename, both point to the renamed inode. Later patches need add more rename special state to MDRequest, So just move them

[PATCH 19/25] mds: fix for MDCache::disambiguate_imports

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com In the resolve stage, if no MDS claims other MDS's disambiguous subtree import, the subtree's dir_auth is undefined. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git

[PATCH 25/25] mds: fetch missing inodes from disk

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The problem of fetching missing inodes from replicas is that replicated inodes does not have up-to-date rstat and fragstat. So just fetch missing inodes from disk Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 83

[PATCH 18/25] mds: fix for MDCache::adjust_bounded_subtree_auth

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com After swallowing extra subtrees, subtree bounds may change, so it should re-check. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 24 +--- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git

[PATCH 17/25] mds: don't replace existing slave request

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The MDS may receive a client request, but find there is an existing slave request. It means other MDS is handling the same request, so we should not replace the slave request with a new client request, just forward the request. The client request may

[PATCH 16/25] mds: Always use {push,pop}_projected_linkage to change linkage

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Current code skips using {push,pop}_projected_linkage to modify replica dentry's linkage. This confuses EMetaBlob::add_dir_context() and makes it record out-of-date path when TO_ROOT mode is used. This patch changes the code to always use

[PATCH 05/25] mds: introduce XSYN to SYNC lock state transition

2013-01-22 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com If lock is in XSYN state, Locker::simple_sync() firstly try changing lock state to EXCL. If it fail to change lock state to EXCL, it just returns. So Locker::simple_sync() does not guarantee the lock state eventually changes to SYNC. This issue can cause

Re: ssh passwords

2013-01-22 Thread Travis Rhoden
Since you are chatting about ceph-deploy tomorrow, I'll chime in with a bit more. I'm interested in ceph-deploy since it can be a light-weight production appropriate installer. The docs repeatedly warn that mkcephfs is not intended for production clusters, and Neil reminds us that the

Re: ssh passwords

2013-01-22 Thread Neil Levine
From my perspective, I want to ensure that we have a script that helps users get Ceph up and running as quickly as possible so they can play, explore and evaluate it. With this goal in mind, I would prefer to lean towards the KISS principle to reduce the potential failure scenarios which a) deter

Will multi-monitor speed up pg initializing?

2013-01-22 Thread Chen, Xiaoxi
Hi list, When first time I start my ceph cluster,it takes more than 15 minutes to get all the pg activeclean. It's fast at first (say 100pg/s) but quite slow when only hundreds of PG left peering. Is it a common situation? Since there is quite a few disk IO and network IO

/etc/init.d/ceph bug for multi-host when using -a option

2013-01-22 Thread Chen, Xiaoxi
Hi List, Here is part of /etc/init.d/ceph script: case $command in start) # Increase max_open_files, if the configuration calls for it. get_conf max_open_files 8192 max open files if [ $max_open_files != 0 ]; then # Note: Don't try