[ceph-users] Cache data consistency among multiple RGW instances
Hi list, I'm trying to understand the RGW cache consistency model. My Ceph cluster has multiple RGW instances with HAProxy as the load balancer. HAProxy would choose one RGW instance to serve the request(with round-robin). The question is if RGW cache was enabled, which is the default behavior, there seem to be some cache inconsistency issue. e.g., object0 was cached in RGW-0 and RGW-1 at the same time. Sometime later it was updated from RGW-0. In this case if the next read was issued to RGW-1, the outdated cache would be served out then since RGW-1 wasn't aware of the updates. Thus the data would be inconsistent. Is this behavior expected or is there anything I missed? Sincerely, Yuan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS aborted after recovery and active, FAILED assert (r =0)
Hi John, Good shot! I've increased the osd_max_write_size to 1GB (still smaller than osd journal size) and now the mds still running fine after an hour. Now checking if fs still accessible or not. Will update from time to time. Thanks again John. Regards, Bazli -Original Message- From: john.sp...@inktank.com [mailto:john.sp...@inktank.com] On Behalf Of John Spray Sent: Friday, January 16, 2015 11:58 PM To: Mohd Bazli Ab Karim Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org Subject: Re: MDS aborted after recovery and active, FAILED assert (r =0) It has just been pointed out to me that you can also workaround this issue on your existing system by increasing the osd_max_write_size setting on your OSDs (default 90MB) to something higher, but still smaller than your osd journal size. That might get you on a path to having an accessible filesystem before you consider an upgrade. John On Fri, Jan 16, 2015 at 10:57 AM, John Spray john.sp...@redhat.com wrote: Hmm, upgrading should help here, as the problematic data structure (anchortable) no longer exists in the latest version. I haven't checked, but hopefully we don't try to write it during upgrades. The bug you're hitting is more or less the same as a similar one we have with the sessiontable in the latest ceph, but you won't hit it there unless you're very unlucky! John On Fri, Jan 16, 2015 at 7:37 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: Dear Ceph-Users, Ceph-Devel, Apologize me if you get double post of this email. I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 down and only 1 up) at the moment. Plus I have one CephFS client mounted to it. Now, the MDS always get aborted after recovery and active for 4 secs. Some parts of the log are as below: -3 2015-01-15 14:10:28.464706 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.19 10.4.118.32:6821/243161 73 osd_op_re ply(3742 1000240c57e. [create 0~0,setxattr (99)] v56640'1871414 uv1871414 ondisk = 0) v6 221+0+0 (261801329 0 0) 0x 7770bc80 con 0x69c7dc0 -2 2015-01-15 14:10:28.464730 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.18 10.4.118.32:6818/243072 67 osd_op_re ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567 ondisk = 0) v6 179+0+0 (3759887079 0 0) 0x7757ec80 con 0x1c6bb00 -1 2015-01-15 14:10:28.464754 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.47 10.4.118.35:6809/8290 79 osd_op_repl y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90 (Message too long)) v6 174+0+0 (3942056372 0 0) 0x69f94 a00 con 0x1c6b9a0 0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In function 'void MDSTable::save_2(int, version_t)' thread 7 fbcc8226700 time 2015-01-15 14:10:28.46 mds/MDSTable.cc: 83: FAILED assert(r = 0) ceph version () 1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25] 2: (Context::complete(int)+0x9) [0x568d29] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7] 4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900] 5: (MDS::_dispatch(Message*)+0x2f) [0x58908f] 6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93] 7: (DispatchQueue::entry()+0x549) [0x975739] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd] 9: (()+0x7e9a) [0x7fbcccb0de9a] 10: (clone()+0x6d) [0x7fbccb4ba3fd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is there any workaround/patch to fix this issue? Let me know if need to see the log with debug-mds of certain level as well. Any helps would be very much appreciated. Thanks. Bazli -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments). MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law.
[ceph-users] rgw-agent copy file failed
when I write a file named 1234% in the master region, and rgw-agent send copy obj request which contains x-amz-copy-source:nofilter_bucket_1/1234% to the rep region fail 404 error; I analysis that rgw-agent can't encode url x-amz-copy-source:nofilter_bucket_1/1234% , but rgw could decode x-amz-copy-source in the function RGWCopyObj::parse_copy_location. so 1234% decode to 1234, and fail. can you check this question? baijia...@126.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache pool tiering SSD journal
No, if you used cache tiering, It is no need to use ssd journal again. From: Florent MONTHEL Date: 2015-01-17 23:43 To: ceph-users Subject: [ceph-users] Cache pool tiering SSD journal Hi list, With cache pool tiering (in write back mode) enhancement, should I keep to use SSD journal on SSD ? Can we have 1 big SSD pool for caching for all low cost storage pools ? Thanks Florent Monthel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH Expansion
Hi George, List disks available: # $ ceph-deploy disk list {node-name [node-name]...} Add OSD using osd create: # $ ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}] Or you can use the manual steps to prepare and activate disk described at http://ceph.com/docs/master/start/quick-ceph-deploy/#expanding-your-cluster Jiri On 15/01/2015 06:36, Georgios Dimitrakakis wrote: Hi all! I would like to expand our CEPH Cluster and add a second OSD node. In this node I will have ten 4TB disks dedicated to CEPH. What is the proper way of putting them in the already available CEPH node? I guess that the first thing to do is to prepare them with ceph-deploy and mark them as out at preparation. I should then restart the services and add (mark as in) one of them. Afterwards, I have to wait for the rebalance to occur and upon finishing I will add the second and so on. Is this safe enough? How long do you expect the rebalancing procedure to take? I already have ten more 4TB disks at another node and the amount of data is around 40GB with 2x replication factor. The connection is over Gigabit. Best, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache pool tiering SSD journal
On 01/17/2015 08:17 PM, lidc...@redhat.com wrote: No, if you used cache tiering, It is no need to use ssd journal again. The cache tiering and SSD journals serve a somewhat different purpose. In Ceph, all of the data for every single write is written to both the journal and to the data storage device. SSD journals allow you to avoid additional coalesced O_DSYNC sequential writes to the data disk. In some situations this can provide up to a 2X write performance improvement on the base tier OSDs. Cache pool tiering may also provide some coalescing of writes to the base pool, but doesn't help you avoid the additional journal write penalty on the base tier OSDs. It does however provide the benefit of allowing you to read hot data from the cache tier and potentially avoid read/write head seek contention if you have spinning disks on the base tier. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant on Centos 7 with custom cluster name
Hi, I have upgraded Firefly to Giant on Debian Wheezy and it went without any problems. Jiri On 16/01/2015 06:49, Erik McCormick wrote: Hello all, I've got an existing Firefly cluster on Centos 7 which I deployed with ceph-deploy. In the latest version of ceph-deploy, it refuses to handle commands issued with a cluster name. [ceph_deploy.install][ERROR ] custom cluster names are not supported on sysvinit hosts This is a production cluster. Small, but still production. Is it safe to go through manually upgrading the packages? I'd hate to do the upgrade and find out I can no longer start the cluster because it can't be called anything other than ceph. Thanks, Erik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache pool tiering SSD journal
On Sun, 18 Jan 2015 10:17:50 AM lidc...@redhat.com wrote: No, if you used cache tiering, It is no need to use ssd journal again. Really? writes are as fast as with ssd journals? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] two mount points, two diffrent data
Because you are not using a cluster aware filesystem - the respective mounts don't know when changes are made to the underlying block device (rbd) by the other mount. What you are doing *will* lead to file corruption. Your need to use a distributed filesystem such as GFS2 or cephfs. CephFS would be probably be the easiest to setup. Thanks for help I use cephfs and it's working great ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH Expansion
Hi Jiri, thanks for the feedback. My main concern is if it's better to add each OSD one-by-one and wait for the cluster to rebalance every time or do it all-together at once. Furthermore an estimate of the time to rebalance would be great! Regards, George Hi George, List disks available: # $ ceph-deploy disk list {node-name [node-name]...} Add OSD using osd create: # $ ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}] Or you can use the manual steps to prepare and activate disk described at http://ceph.com/docs/master/start/quick-ceph-deploy/#expanding-your-cluster [3] Jiri On 15/01/2015 06:36, Georgios Dimitrakakis wrote: Hi all! I would like to expand our CEPH Cluster and add a second OSD node. In this node I will have ten 4TB disks dedicated to CEPH. What is the proper way of putting them in the already available CEPH node? I guess that the first thing to do is to prepare them with ceph-deploy and mark them as out at preparation. I should then restart the services and add (mark as in) one of them. Afterwards, I have to wait for the rebalance to occur and upon finishing I will add the second and so on. Is this safe enough? How long do you expect the rebalancing procedure to take? I already have ten more 4TB disks at another node and the amount of data is around 40GB with 2x replication factor. The connection is over Gigabit. Best, George ___ ceph-users mailing list ceph-users@lists.ceph.com [1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] Links: -- [1] mailto:ceph-users@lists.ceph.com [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] http://ceph.com/docs/master/start/quick-ceph-deploy/#expanding-your-cluster -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com