Re: [ceph-users] how to fix X is an unexpected clone
Hello Stefan, ceph-object-tool does not exist on my setup, do yo mean the command /usr/bin/ceph-objectstore-tool that is installed with the ceph-osd package ? I have the following situation here in Ceph Luminous: 2018-02-26 07:15:30.066393 7f0684acb700 -1 log_channel(cluster) log [ERR] : 5.111f shard 395 missing 5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152 2018-02-26 07:15:30.395189 7f0684acb700 -1 log_channel(cluster) log [ERR] : deep-scrub 5.111f 5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152 is an unexpected clone I did not understand how you actually fixed the problem. Could you provide more details ? thanks Saverio On 08.08.17 12:02, Stefan Priebe - Profihost AG wrote: > Hello Greg, > > Am 08.08.2017 um 11:56 schrieb Gregory Farnum: >> On Mon, Aug 7, 2017 at 11:55 PM Stefan Priebe - Profihost AG >> <s.pri...@profihost.ag <mailto:s.pri...@profihost.ag>> wrote: >> >> Hello, >> >> how can i fix this one: >> >> 2017-08-08 08:42:52.265321 osd.20 [ERR] repair 3.61a >> 3:58654d3d:::rbd_data.106dd406b8b4567.018c:9d455 is an >> unexpected clone >> 2017-08-08 08:43:04.914640 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1 >> pgs repair; 1 scrub errors >> 2017-08-08 08:43:33.470246 osd.20 [ERR] 3.61a repair 1 errors, 0 fixed >> 2017-08-08 08:44:04.915148 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1 >> scrub errors >> >> If i just delete manually the relevant files ceph is crashing. rados >> does not list those at all? >> >> How can i fix this? >> >> >> You've sent quite a few emails that have this story spread out, and I >> think you've tried several different steps to repair it that have been a >> bit difficult to track. >> >> It would be helpful if you could put the whole story in one place and >> explain very carefully exactly what you saw and how you responded. Stuff >> like manually copying around the wrong files, or files without a >> matching object info, could have done some very strange things. >> Also, basic debugging stuff like what version you're running will help. :) >> >> Also note that since you've said elsewhere you don't need this image, I >> don't think it's going to hurt you to leave it like this for a bit >> (though it will definitely mess up your monitoring). >> -Greg > > i'm sorry about that. You're correct. > > I was able to fix this just a few minutes ago by using the > ceph-object-tool and the remove operation to remove all left over files. > > I did this on all OSDs with the problematic pg. After that ceph was able > to fix itself. > > A better approach might be that ceph can recover itself from an > unexpected clone by just deleting it. > > Greets, > Stefan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- SWITCH Saverio Proto, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland phone +41 44 268 15 15, direct +41 44 268 1573 saverio.pr...@switch.ch, http://www.switch.ch http://www.switch.ch/stories ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSPF to the host
> I'm looking at the Dell S-ON switches which we can get in a Cumulus > version. Any pro's and con's of using Cumulus vs old school switch OS's you > may have come across? Nothing to declare here. Once configured properly the hardware works as expected. I never used Dell, I used switches from Quanta. Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
I am at the Ceph Day at CERN, I asked to Sage if it is supported to enable both S3 and swift API at the same time. The answer is yes, so it is meant to be supported, and this that we see here is probably a bug. I opened a bug report: http://tracker.ceph.com/issues/16293 If anyone has a chance to test it on a ceph version newer than Hammer, you can update the bug :) thank you Saverio 2016-05-12 15:49 GMT+02:00 Yehuda Sadeh-Weinraub <yeh...@redhat.com>: > On Thu, May 12, 2016 at 12:29 AM, Saverio Proto <ziopr...@gmail.com> wrote: >>> While I'm usually not fond of blaming the client application, this is >>> really the swift command line tool issue. It tries to be smart by >>> comparing the md5sum of the object's content with the object's etag, >>> and it breaks with multipart objects. Multipart objects is calculated >>> differently (md5sum of the md5sum of each part). I think the swift >>> tool has a special handling for swift large objects (which are not the >>> same as s3 multipart objects), so that's why it works in that specific >>> use case. >> >> Well but I tried also with rclone and I have the same issue. >> >> Clients I tried >> rclone (both SWIFT and S3) >> s3cmd (S3) >> python-swiftclient (SWIFT). >> >> I can reproduce the issue with different clients. >> Once a multipart object is uploaded via S3 (with rclone or s3cmd) I >> cannot read it anymore via SWIFT (either with rclone or >> pythonswift-client). >> >> Are you saying that all SWIFT clients implementations are wrong ? > > Yes. > >> >> Or should the radosgw be configured with only 1 API active ? >> >> Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop on cephfs
You can also have Hadoop talking to the Rados Gateway (SWIFT API) so that the data is in Ceph instead of HDFS. I wrote this tutorial that might help: https://github.com/zioproto/hadoop-swift-tutorial Saverio 2016-04-30 23:55 GMT+02:00 Adam Tygart: > Supposedly cephfs-hadoop worked and/or works on hadoop 2. I am in the > process of getting it working with cdh5.7.0 (based on hadoop 2.6.0). > I'm under the impression that it is/was working with 2.4.0 at some > point in time. > > At this very moment, I can use all of the DFS tools built into hadoop > to create, list, delete, rename, and concat files. What I am not able > to do (currently) is run any jobs. > > https://github.com/ceph/cephfs-hadoop > > It can be built using current (at least infernalis with my testing) > cephfs-java and libcephfs. The only thing you'll for sure need to do > is patch the file referenced here: > https://github.com/ceph/cephfs-hadoop/issues/25 When building, you'll > want to tell maven to skip tests (-Dmaven.test.skip=true). > > Like I said, I am digging into this still, and I am not entirely > convinced my issues are ceph related at the moment. > > -- > Adam > > On Sat, Apr 30, 2016 at 1:51 PM, Erik McCormick > wrote: >> I think what you are thinking of is the driver that was built to actually >> replace hdfs with rbd. As far as I know that thing had a very short lifespan >> on one version of hadoop. Very sad. >> >> As to what you proposed: >> >> 1) Don't use Cephfs in production pre-jewel. >> >> 2) running hdfs on top of ceph is a massive waste of disk and fairly >> pointless as you make replicas of replicas. >> >> -Erik >> >> On Apr 29, 2016 9:20 PM, "Bill Sharer" wrote: >>> >>> Actually this guy is already a fan of Hadoop. I was just wondering >>> whether anyone has been playing around with it on top of cephfs lately. It >>> seems like the last round of papers were from around cuttlefish. >>> >>> On 04/28/2016 06:21 AM, Oliver Dzombic wrote: Hi, bad idea :-) Its of course nice and important to drag developer towards a new/promising technology/software. But if the technology under the individual required specifications does not match, you will just risk to show this developer how worst this new/promising technology is. So you will just reach the opposite of what you want. So before you are doing something, usually big, like hadoop on an unstable software, maybe you should not use it. For the good of the developer, for your good and for the good of the reputation of the new/promising technology/software you wish. To force a pinguin to somehow live in the sahara, might be possible ( at least for some time ), but usually not a good idea ;-) >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSPF to the host
> Has anybody had any experience with running the network routed down all the > way to the host? > Hello Nick, yes at SWITCH.ch we run OSPF unnumbered on the switches and on the hosts. Each server has two NICs and we are able to plug the servers to any port on the fabric and OSFP will make the magic :) This makes simpler the design when you want to expand the datacenter or when you want to add more links to existing servers that need more capacity. Remember to put an higher metric on the ToR-Server links otherwise you might end up with flows going through the servers, and that is not what you want. We use Whitebox switches with Cumulus Linux. Back when we started this project in August 2015 we built ubuntu packages for the Quagga version available open source on the Github page of Cumulus Linux. It took us a bit of work to make that quagga running on Ubuntu, but the support from Cumulus was great to sort out the problems. Our setup is dual stack IPv4 and IPv6. On top of that we run Ceph, using IPv6 only for Ceph traffic. Looks like at SWITCH we were not the only ones with this idea, you can find this page dated March 2016: https://support.cumulusnetworks.com/hc/en-us/articles/216805858-Routing-on-the-Host-An-Introduction Cheers, Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] The RGW create new bucket instance then delete it at every create bucket OP
Hello, I am not sure I understood the problem. Can you post the example steps to reproduce the problem ? Also what version of Ceph RGW are you running ? Saverio 2016-05-18 10:24 GMT+02:00 fangchen sun: > Dear ALL, > > I found a problem that the RGW create a new bucket instance and delete > the bucket instance at every create bucket OP with same name > > http://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html > > According to the error code "BucketAlreadyOwnedByYou" from the above > link, shouldn't the RGW return directly or do nothing when recreate > the bucket? > Why do the RGW create a new bucket instance and then delete it? > > Thanks for the reply! > sunfch > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ACL nightmare on RadosGW for 200 TB dataset
> Can't you set the ACL on the object when you put it? What do you think of this bug ? https://github.com/s3tools/s3cmd/issues/743 Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
> While I'm usually not fond of blaming the client application, this is > really the swift command line tool issue. It tries to be smart by > comparing the md5sum of the object's content with the object's etag, > and it breaks with multipart objects. Multipart objects is calculated > differently (md5sum of the md5sum of each part). I think the swift > tool has a special handling for swift large objects (which are not the > same as s3 multipart objects), so that's why it works in that specific > use case. Well but I tried also with rclone and I have the same issue. Clients I tried rclone (both SWIFT and S3) s3cmd (S3) python-swiftclient (SWIFT). I can reproduce the issue with different clients. Once a multipart object is uploaded via S3 (with rclone or s3cmd) I cannot read it anymore via SWIFT (either with rclone or pythonswift-client). Are you saying that all SWIFT clients implementations are wrong ? Or should the radosgw be configured with only 1 API active ? Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ACL nightmare on RadosGW for 200 TB dataset
> Can't you set the ACL on the object when you put it? I could create two tenants. One tenant DATASETADMIN for read/write access, and a tenant DATASETUSERS for readonly access. When I load the dataset into the object store, I need a "s3cmd put" operation and a "s3cmd setacl" operation for each object. It is slow but we do this only once. Giving read access will mean adding the user to the DATASETUSERS tenant, without touching again the ACLs. Still this is a workaround. We create ad-hoc tenants with read-only permissions, and let the users in or out of these tenants. If we want to use the original user's tenant in the ACL, it does not scale for large number of objects AFAIK. :( Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
It does not work also the way around: If I upload a file with the swift client with the -S options to force swift to make multipart: swift upload -S 100 multipart 180.mp4 Then I am not able to read the file with S3 s3cmd get s3://multipart/180.mp4 download: 's3://multipart/180.mp4' -> './180.mp4' [1 of 1] download: 's3://multipart/180.mp4' -> './180.mp4' [1 of 1] 38818503 of 38818503 100% in1s27.32 MB/s done WARNING: MD5 signatures do not match: computed=961f154cc78c7bf1be3b4009c29e5a68, received=d41d8cd98f00b204e9800998ecf8427e Saverio 2016-05-11 16:07 GMT+02:00 Saverio Proto <ziopr...@gmail.com>: > Thank you. > > It is exactly a problem with multipart. > > So I tried two clients (s3cmd and rclone). When you upload a file in > S3 using multipart, you are not able to read anymore this object with > the SWIFT API because the md5 check fails. > > Saverio > > > > 2016-05-09 12:00 GMT+02:00 Xusangdi <xu.san...@h3c.com>: >> Hi, >> >> I'm not running a cluster as yours, but I don't think the issue is caused by >> you using 2 APIs at the same time. >> IIRC the dash thing is append by S3 multipart upload, with a following digit >> indicating the number of parts. >> You may want to check this reported in s3cmd community: >> https://sourceforge.net/p/s3tools/bugs/123/ >> >> and some basic info from Amazon: >> http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html >> >> Hope this helps :D >> >> Regards, >> ---Sandy >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Saverio Proto >>> Sent: Monday, May 09, 2016 4:42 PM >>> To: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API >>> at the same time >>> >>> I try to simplify the question to get some feedback. >>> >>> Is anyone running the RadosGW in production with S3 and SWIFT API active at >>> the same time ? >>> >>> thank you ! >>> >>> Saverio >>> >>> >>> 2016-05-06 11:39 GMT+02:00 Saverio Proto <ziopr...@gmail.com>: >>> > Hello, >>> > >>> > We have been running the Rados GW with the S3 API and we did not have >>> > problems for more than a year. >>> > >>> > We recently enabled also the SWIFT API for our users. >>> > >>> > radosgw --version >>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >>> > >>> > The idea is that each user of the system is free of choosing the S3 >>> > client or the SWIFT client to access the same container/buckets. >>> > >>> > Please tell us if this is possible by design or if we are doing something >>> > wrong. >>> > >>> > We have now a problem that some files wrote in the past with S3, >>> > cannot be read with the SWIFT API because the md5sum always fails. >>> > >>> > I am able to reproduce the bug in this way: >>> > >>> > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know >>> > the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download >>> > the same file here: >>> > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20 >>> > 120701-ts.gz?dl=0 >>> > >>> > rclone mkdir lss3:bugreproduce >>> > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce >>> > >>> > The file is successfully uploaded. >>> > >>> > At this point I can succesfully download again the file: >>> > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz >>> > test.gz >>> > >>> > but not with swift: >>> > >>> > swift download googlebooks-ngrams-gz >>> > fre/googlebooks-fre-all-2gram-20120701-ts.gz >>> > Error downloading object >>> > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz': >>> > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz: >>> > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a != >>> > 1a209a31b4ac3eb923fac5e8d194d9d3-2' >>> > >>> > Also I found strange the dash character '-' at the end of the md5 that >>> > is trying to compare. >>> > >>> > Of course upload a file with the swift client and redownloading the >>> > same file just wor
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
Thank you. It is exactly a problem with multipart. So I tried two clients (s3cmd and rclone). When you upload a file in S3 using multipart, you are not able to read anymore this object with the SWIFT API because the md5 check fails. Saverio 2016-05-09 12:00 GMT+02:00 Xusangdi <xu.san...@h3c.com>: > Hi, > > I'm not running a cluster as yours, but I don't think the issue is caused by > you using 2 APIs at the same time. > IIRC the dash thing is append by S3 multipart upload, with a following digit > indicating the number of parts. > You may want to check this reported in s3cmd community: > https://sourceforge.net/p/s3tools/bugs/123/ > > and some basic info from Amazon: > http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html > > Hope this helps :D > > Regards, > ---Sandy > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Saverio Proto >> Sent: Monday, May 09, 2016 4:42 PM >> To: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at >> the same time >> >> I try to simplify the question to get some feedback. >> >> Is anyone running the RadosGW in production with S3 and SWIFT API active at >> the same time ? >> >> thank you ! >> >> Saverio >> >> >> 2016-05-06 11:39 GMT+02:00 Saverio Proto <ziopr...@gmail.com>: >> > Hello, >> > >> > We have been running the Rados GW with the S3 API and we did not have >> > problems for more than a year. >> > >> > We recently enabled also the SWIFT API for our users. >> > >> > radosgw --version >> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >> > >> > The idea is that each user of the system is free of choosing the S3 >> > client or the SWIFT client to access the same container/buckets. >> > >> > Please tell us if this is possible by design or if we are doing something >> > wrong. >> > >> > We have now a problem that some files wrote in the past with S3, >> > cannot be read with the SWIFT API because the md5sum always fails. >> > >> > I am able to reproduce the bug in this way: >> > >> > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know >> > the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download >> > the same file here: >> > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20 >> > 120701-ts.gz?dl=0 >> > >> > rclone mkdir lss3:bugreproduce >> > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce >> > >> > The file is successfully uploaded. >> > >> > At this point I can succesfully download again the file: >> > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz >> > test.gz >> > >> > but not with swift: >> > >> > swift download googlebooks-ngrams-gz >> > fre/googlebooks-fre-all-2gram-20120701-ts.gz >> > Error downloading object >> > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz': >> > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz: >> > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a != >> > 1a209a31b4ac3eb923fac5e8d194d9d3-2' >> > >> > Also I found strange the dash character '-' at the end of the md5 that >> > is trying to compare. >> > >> > Of course upload a file with the swift client and redownloading the >> > same file just works. >> > >> > Should I open a bug for the radosgw on http://tracker.ceph.com/ ? >> > >> > thank you >> > >> > Saverio >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > - > 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 > 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 > 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 > 邮件! > This e-mail and its attachments contain confidential information from H3C, > which is > intended only for the person or entity whose address is listed above. Any use > of the > information contained herein in any way (including, but not limited to, total > or partial > disclosure, reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, please > notify the sender > by phone or email immediately and delete it! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ACL nightmare on RadosGW for 200 TB dataset
Hello there, Our setup is with Ceph Hammer (latest release). We want to publish in our Object Storage some Scientific Datasets. These are collections of around 100K objects and total size of about 200 TB. For Object Storage we use the RadosGW with S3 API. For the initial testing we are using a smaller dataset of about 26K files and 5Tb of data. Authentication to radosGW is with Keystone integration. We created a Openstack Tenant to manage the datasets, and with EC2 credentials we upload all the files. Once the bucket is full lets look at the ACLs: s3cmd info s3://googlebooks-ngrams-gz/ ACL: TENANTDATASET: FULL_CONTROL So far so good. At this point we want to enable a user of a different tenant, to access this Dataset READ-ONLY. Given the UUID of the tenant of the user it would be as easy as: s3cmd setacl --acl-grant=read: s3://googlebooks-ngrams-gz/ However this is not enough, the user will be able to list the objects of the bucket, but not to read them. The read ACL is not inherited for the Objects from the Bucket. So we must do: s3cmd setacl --acl-grant=read: --recursive s3://googlebooks-ngrams-gz/ But this takes ages on 26K objects. It works but you spend several hours updating ACLs and we cannot have this procedure everytime a user wants read access. Now the painful questions: There is a way to bulk update the "read acl" on all the objects of a bucket ??? What happens to ACLs when SWIFT and S3 API are used simultaneously ? >From my test RadosGW ignores the swift client when we try to post ACLs, however the swift API honors S3 ACLs when reading. Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mixed versions of Ceph Cluster and RadosGW
Hello, I have a production Ceph cluster running the latest Hammer Release. We are not planning soon the upgrade to Jewel. However, I would like to upgrade just the Rados Gateway to Jewel, because I want to test the new SWIFT compatibiltiy improvements. Is it supported to run the system with this configuration ? Ceph Hammer and RadosGW Jewel ? thank you Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
I try to simplify the question to get some feedback. Is anyone running the RadosGW in production with S3 and SWIFT API active at the same time ? thank you ! Saverio 2016-05-06 11:39 GMT+02:00 Saverio Proto <ziopr...@gmail.com>: > Hello, > > We have been running the Rados GW with the S3 API and we did not have > problems for more than a year. > > We recently enabled also the SWIFT API for our users. > > radosgw --version > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > > The idea is that each user of the system is free of choosing the S3 > client or the SWIFT client to access the same container/buckets. > > Please tell us if this is possible by design or if we are doing something > wrong. > > We have now a problem that some files wrote in the past with S3, > cannot be read with the SWIFT API because the md5sum always fails. > > I am able to reproduce the bug in this way: > > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know > the correct md5 is 1c8113d2bd21232688221ec74dccff3a > You can download the same file here: > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20120701-ts.gz?dl=0 > > rclone mkdir lss3:bugreproduce > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce > > The file is successfully uploaded. > > At this point I can succesfully download again the file: > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz test.gz > > but not with swift: > > swift download googlebooks-ngrams-gz > fre/googlebooks-fre-all-2gram-20120701-ts.gz > Error downloading object > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz': > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz: > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a != > 1a209a31b4ac3eb923fac5e8d194d9d3-2' > > Also I found strange the dash character '-' at the end of the md5 that > is trying to compare. > > Of course upload a file with the swift client and redownloading the > same file just works. > > Should I open a bug for the radosgw on http://tracker.ceph.com/ ? > > thank you > > Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
Hello, We have been running the Rados GW with the S3 API and we did not have problems for more than a year. We recently enabled also the SWIFT API for our users. radosgw --version ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) The idea is that each user of the system is free of choosing the S3 client or the SWIFT client to access the same container/buckets. Please tell us if this is possible by design or if we are doing something wrong. We have now a problem that some files wrote in the past with S3, cannot be read with the SWIFT API because the md5sum always fails. I am able to reproduce the bug in this way: We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download the same file here: https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20120701-ts.gz?dl=0 rclone mkdir lss3:bugreproduce rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce The file is successfully uploaded. At this point I can succesfully download again the file: rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz test.gz but not with swift: swift download googlebooks-ngrams-gz fre/googlebooks-fre-all-2gram-20120701-ts.gz Error downloading object 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz': u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz: md5sum != etag, 1c8113d2bd21232688221ec74dccff3a != 1a209a31b4ac3eb923fac5e8d194d9d3-2' Also I found strange the dash character '-' at the end of the md5 that is trying to compare. Of course upload a file with the swift client and redownloading the same file just works. Should I open a bug for the radosgw on http://tracker.ceph.com/ ? thank you Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO
I confirm that the bug is fixed with the 0.94.6 release packages. thank you Saverio 2016-02-22 10:20 GMT+01:00 Saverio Proto <ziopr...@gmail.com>: > Hello Jason, > > from this email on ceph-dev > http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29692 > > it looks like 0.94.6 is coming out very soon. We avoid testing the > unreleased packaged then and we wait for the official release. thank > you > > Saverio > > > 2016-02-19 18:53 GMT+01:00 Jason Dillaman <dilla...@redhat.com>: >> Correct -- a v0.94.6 tag on the hammer branch won't be created until the >> release. >> >> -- >> >> Jason Dillaman >> >> >> - Original Message - >>> From: "Saverio Proto" <ziopr...@gmail.com> >>> To: "Jason Dillaman" <dilla...@redhat.com> >>> Cc: ceph-users@lists.ceph.com >>> Sent: Friday, February 19, 2016 11:38:08 AM >>> Subject: Re: [ceph-users] Cannot reliably create snapshot after freezing >>> QEMU IO >>> >>> Hello, >>> >>> thanks for the pointer. Just to make sure, for dev/QE hammer release, >>> do you mean the "hammer" branch ? So following the documentation, >>> because I use Ubuntu Trusty, this should be the repository right ? >>> >>> deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer >>> trusty main >>> >>> thanks >>> >>> Saverio >>> >>> >>> >>> >>> 2016-02-19 16:41 GMT+01:00 Jason Dillaman <dilla...@redhat.com>: >>> > I believe 0.94.6 is still in testing because of a possible MDS issue [1]. >>> > You can download the interim dev/QE hammer release by following the >>> > instructions here [2] if you are in a hurry. You would only need to >>> > upgrade librbd1 (and its dependencies) to pick up the fix. When you do >>> > upgrade (either with the interim or the official release), I would >>> > appreciate it if you could update the ticket to let me know if it resolved >>> > your issue. >>> > >>> > [1] http://tracker.ceph.com/issues/13356 >>> > [2] http://docs.ceph.com/docs/master/install/get-packages/ >>> > >>> > -- >>> > >>> > Jason Dillaman >>> > >>> > >>> > - Original Message - >>> >> From: "Saverio Proto" <ziopr...@gmail.com> >>> >> To: ceph-users@lists.ceph.com >>> >> Sent: Friday, February 19, 2016 10:11:01 AM >>> >> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU >>> >> IO >>> >> >>> >> Hello, >>> >> >>> >> we are hitting here Bug #14373 in our production cluster >>> >> http://tracker.ceph.com/issues/14373 >>> >> >>> >> Since we introduced the object map feature in our cinder rbd volumes, >>> >> we are not able to make snapshot the volumes, unless they pause the >>> >> VMs. >>> >> >>> >> We are running the latest Hammer and so we are really looking forward >>> >> release v0.94.6 >>> >> >>> >> Does anyone know when the release is going to happen ? >>> >> >>> >> If the release v0.94.6 is far away, we might have to build custom >>> >> packages for Ubuntu and we really would like to avoid that. >>> >> Any input ? >>> >> Anyone else sharing the same bug ? >>> >> >>> >> thank you >>> >> >>> >> Saverio >>> >> ___ >>> >> ceph-users mailing list >>> >> ceph-users@lists.ceph.com >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO
Hello Jason, from this email on ceph-dev http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29692 it looks like 0.94.6 is coming out very soon. We avoid testing the unreleased packaged then and we wait for the official release. thank you Saverio 2016-02-19 18:53 GMT+01:00 Jason Dillaman <dilla...@redhat.com>: > Correct -- a v0.94.6 tag on the hammer branch won't be created until the > release. > > -- > > Jason Dillaman > > > - Original Message - >> From: "Saverio Proto" <ziopr...@gmail.com> >> To: "Jason Dillaman" <dilla...@redhat.com> >> Cc: ceph-users@lists.ceph.com >> Sent: Friday, February 19, 2016 11:38:08 AM >> Subject: Re: [ceph-users] Cannot reliably create snapshot after freezing >> QEMU IO >> >> Hello, >> >> thanks for the pointer. Just to make sure, for dev/QE hammer release, >> do you mean the "hammer" branch ? So following the documentation, >> because I use Ubuntu Trusty, this should be the repository right ? >> >> deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer >> trusty main >> >> thanks >> >> Saverio >> >> >> >> >> 2016-02-19 16:41 GMT+01:00 Jason Dillaman <dilla...@redhat.com>: >> > I believe 0.94.6 is still in testing because of a possible MDS issue [1]. >> > You can download the interim dev/QE hammer release by following the >> > instructions here [2] if you are in a hurry. You would only need to >> > upgrade librbd1 (and its dependencies) to pick up the fix. When you do >> > upgrade (either with the interim or the official release), I would >> > appreciate it if you could update the ticket to let me know if it resolved >> > your issue. >> > >> > [1] http://tracker.ceph.com/issues/13356 >> > [2] http://docs.ceph.com/docs/master/install/get-packages/ >> > >> > -- >> > >> > Jason Dillaman >> > >> > >> > - Original Message - >> >> From: "Saverio Proto" <ziopr...@gmail.com> >> >> To: ceph-users@lists.ceph.com >> >> Sent: Friday, February 19, 2016 10:11:01 AM >> >> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU >> >> IO >> >> >> >> Hello, >> >> >> >> we are hitting here Bug #14373 in our production cluster >> >> http://tracker.ceph.com/issues/14373 >> >> >> >> Since we introduced the object map feature in our cinder rbd volumes, >> >> we are not able to make snapshot the volumes, unless they pause the >> >> VMs. >> >> >> >> We are running the latest Hammer and so we are really looking forward >> >> release v0.94.6 >> >> >> >> Does anyone know when the release is going to happen ? >> >> >> >> If the release v0.94.6 is far away, we might have to build custom >> >> packages for Ubuntu and we really would like to avoid that. >> >> Any input ? >> >> Anyone else sharing the same bug ? >> >> >> >> thank you >> >> >> >> Saverio >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO
Hello, thanks for the pointer. Just to make sure, for dev/QE hammer release, do you mean the "hammer" branch ? So following the documentation, because I use Ubuntu Trusty, this should be the repository right ? deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer trusty main thanks Saverio 2016-02-19 16:41 GMT+01:00 Jason Dillaman <dilla...@redhat.com>: > I believe 0.94.6 is still in testing because of a possible MDS issue [1]. > You can download the interim dev/QE hammer release by following the > instructions here [2] if you are in a hurry. You would only need to upgrade > librbd1 (and its dependencies) to pick up the fix. When you do upgrade > (either with the interim or the official release), I would appreciate it if > you could update the ticket to let me know if it resolved your issue. > > [1] http://tracker.ceph.com/issues/13356 > [2] http://docs.ceph.com/docs/master/install/get-packages/ > > -- > > Jason Dillaman > > > - Original Message - >> From: "Saverio Proto" <ziopr...@gmail.com> >> To: ceph-users@lists.ceph.com >> Sent: Friday, February 19, 2016 10:11:01 AM >> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO >> >> Hello, >> >> we are hitting here Bug #14373 in our production cluster >> http://tracker.ceph.com/issues/14373 >> >> Since we introduced the object map feature in our cinder rbd volumes, >> we are not able to make snapshot the volumes, unless they pause the >> VMs. >> >> We are running the latest Hammer and so we are really looking forward >> release v0.94.6 >> >> Does anyone know when the release is going to happen ? >> >> If the release v0.94.6 is far away, we might have to build custom >> packages for Ubuntu and we really would like to avoid that. >> Any input ? >> Anyone else sharing the same bug ? >> >> thank you >> >> Saverio >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cannot reliably create snapshot after freezing QEMU IO
Hello, we are hitting here Bug #14373 in our production cluster http://tracker.ceph.com/issues/14373 Since we introduced the object map feature in our cinder rbd volumes, we are not able to make snapshot the volumes, unless they pause the VMs. We are running the latest Hammer and so we are really looking forward release v0.94.6 Does anyone know when the release is going to happen ? If the release v0.94.6 is far away, we might have to build custom packages for Ubuntu and we really would like to avoid that. Any input ? Anyone else sharing the same bug ? thank you Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Increasing time to save RGW objects
What kind of authentication you use against the Rados Gateway ? We had similar problem authenticating against our Keystone server. If the Keystone server is overloaded the time to read/write RGW objects increases. You will not see anything wrong on the ceph side. Saverio 2016-02-08 17:49 GMT+01:00 Kris Jurka: > > I've been testing the performance of ceph by storing objects through RGW. > This is on Debian with Hammer using 40 magnetic OSDs, 5 mons, and 4 RGW > instances. Initially the storage time was holding reasonably steady, but it > has started to rise recently as shown in the attached chart. > > The test repeatedly saves 100k objects of 55 kB size using multiple threads > (50) against multiple RGW gateways (4). It uses a sequential identifier as > the object key and shards the bucket name using id % 100. The buckets have > index sharding enabled with 64 index shards per bucket. > > ceph status doesn't appear to show any issues. Is there something I should > be looking at here? > > > # ceph status > cluster 3fc86d01-cf9c-4bed-b130-7a53d7997964 > health HEALTH_OK > monmap e2: 5 mons at > {condor=192.168.188.90:6789/0,duck=192.168.188.140:6789/0,eagle=192.168.188.100:6789/0,falcon=192.168.188.110:6789/0,shark=192.168.188.118:6789/0} > election epoch 18, quorum 0,1,2,3,4 > condor,eagle,falcon,shark,duck > osdmap e674: 40 osds: 40 up, 40 in > pgmap v258756: 3128 pgs, 10 pools, 1392 GB data, 27282 kobjects > 4784 GB used, 69499 GB / 74284 GB avail > 3128 active+clean > client io 268 kB/s rd, 1100 kB/s wr, 493 op/s > > > Kris Jurka > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What are linger_ops in the output of objecter_requests ?
Hello, debugging slow requests behaviour of our Rados Gateway, I run into this linger_ops field and I cannot understand the meaning. I would expect in the "ops" field to find slow requests stucked there. Actually most of the time I have "ops": [], and looks like ops gets empty very quickly. However linger_ops is populated, and it is always the same requests, looks like those are there forever. Any explanation about what are linger_ops ? thanks ! Saverio r...@os.zhdk.cloud /home/proto ; ceph daemon /var/run/ceph/ceph-radosgw.gateway.asok objecter_requests { "ops": [], "linger_ops": [ { "linger_id": 8, "pg": "10.84ada7c9", "osd": 9, "object_id": "notify.7", "object_locator": "@10", "target_object_id": "notify.7", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 2, "pg": "10.16dafda0", "osd": 27, "object_id": "notify.1", "object_locator": "@10", "target_object_id": "notify.1", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 6, "pg": "10.31099063", "osd": 52, "object_id": "notify.5", "object_locator": "@10", "target_object_id": "notify.5", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 3, "pg": "10.88aa5c95", "osd": 66, "object_id": "notify.2", "object_locator": "@10", "target_object_id": "notify.2", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 5, "pg": "10.a204812d", "osd": 66, "object_id": "notify.4", "object_locator": "@10", "target_object_id": "notify.4", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 4, "pg": "10.f8c99aee", "osd": 68, "object_id": "notify.3", "object_locator": "@10", "target_object_id": "notify.3", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 1, "pg": "10.4322fa9f", "osd": 82, "object_id": "notify.0", "object_locator": "@10", "target_object_id": "notify.0", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" }, { "linger_id": 7, "pg": "10.97c520d4", "osd": 103, "object_id": "notify.6", "object_locator": "@10", "target_object_id": "notify.6", "target_object_locator": "@10", "paused": 0, "used_replica": 0, "precalc_pgid": 0, "snapid": "head", "registered": "1" } ], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": [], "command_ops": [] } ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw secret_key
Look at this: https://github.com/ncw/rclone/issues/47 Because this is a json dump, it is encoding the / as \/. It was source of confusion also for me. Best regards Saverio 2015-08-24 16:58 GMT+02:00 Luis Periquito: > When I create a new user using radosgw-admin most of the time the secret key > gets escaped with a backslash, making it not work. Something like > "secret_key": "xx\/\/". > > Why would the "/" need to be escaped? Why is it printing the "\/" instead of > "/" that does work? > > Usually I just remove the backslash and it works fine. I've seen this on > several different clusters. > > Is it just me? > > This may require opening a bug in the tracking tool, but just asking here > first. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?
Hello Jan, I am testing your scripts, because we want also to test OSDs and VMs on the same server. I am new to cgroups, so this might be a very newbie question. In your script you always reference to the file /cgroup/cpuset/libvirt/cpuset.cpus but I have the file in /sys/fs/cgroup/cpuset/libvirt/cpuset.cpus I am working on Ubuntu 14.04 This difference comes from something special in your setup, or maybe because we are working on different Linux distributions ? Thanks for clarification. Saverio 2015-06-30 17:50 GMT+02:00 Jan Schermer j...@schermer.cz: Hi all, our script is available on GitHub https://github.com/prozeta/pincpus I haven’t had much time to do a proper README, but I hope the configuration is self explanatory enough for now. What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA node. Let me know how it works for you! Jan On 30 Jun 2015, at 10:50, Huang Zhiteng winsto...@gmail.com wrote: On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer j...@schermer.cz wrote: Not having OSDs and KVMs compete against each other is one thing. But there are more reasons to do this 1) not moving the processes and threads between cores that much (better cache utilization) 2) aligning the processes with memory on NUMA systems (that means all modern dual socket systems) - you don’t want your OSD running on CPU1 with memory allocated to CPU2 3) the same goes for other resources like NICs or storage controllers - but that’s less important and not always practical to do 4) you can limit the scheduling domain on linux if you limit the cpuset for your OSDs (I’m not sure how important this is, just best practice) 5) you can easily limit memory or CPU usage, set priority, with much greater granularity than without cgroups 6) if you have HyperThreading enabled you get the most gain when the workloads on the threads are dissimiliar - so to have the higher throughput you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re not doing that because latency and performance of the core can vary depending on what the other thread is doing. But it might be useful to someone. Some workloads exhibit 100% performance gain when everything aligns in a NUMA system, compared to a SMP mode on the same hardware. You likely won’t notice it on light workloads, as the interconnects (QPI) are very fast and there’s a lot of bandwidth, but for stuff like big OLAP databases or other data-manipulation workloads there’s a huge difference. And with CEPH being CPU hungy and memory intensive, we’re seeing some big gains here just by co-locating the memory with the processes…. Could you elaborate a it on this? I'm interested to learn in what situation memory locality helps Ceph to what extend. Jan On 30 Jun 2015, at 08:12, Ray Sun xiaoq...@gmail.com wrote: Sound great, any update please let me know. Best Regards -- Ray On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer j...@schermer.cz wrote: I promised you all our scripts for automatic cgroup assignment - they are in our production already and I just need to put them on github, stay tuned tomorrow :-) Jan On 29 Jun 2015, at 19:41, Somnath Roy somnath@sandisk.com wrote: Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’… Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ray Sun Sent: Monday, June 29, 2015 9:19 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core? Cephers, I want to bind each of my ceph-osd to a specific cpu core, but I didn't find any document to explain that, could any one can provide me some detailed information. Thanks. Currently, my ceph is running like this: oot 28692 1 0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i seed.econe.com --pid-file /var/run/ceph/mon.seed.econe.com.pid -c /etc/ceph/ceph.conf --cluster ceph root 40063 1 1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph root 42096 1 0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph root 43263 1 0 Jun23 ?01:22:59 /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph root 44527 1 0 Jun23 ?01:16:53 /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph root 45863 1 0 Jun23 ?01:25:18 /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph root 47462 1 0 Jun23 ?01:20:36 /usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph Best Regards -- Ray PLEASE NOTE: The information contained in this electronic mail
Re: [ceph-users] Unexpected issues with simulated 'rack' outage
Hello Romero, I am still begineer with Ceph, but as far as I understood, ceph is not designed to lose the 33% of the cluster at once and recover rapidly. What I understand is that you are losing 33% of the cluster losing 1 rack out of 3. It will take a very long time to recover, before you have HEALTH_OK status. can you check with ceph -w how long it takes for ceph to converge to a healthy cluster after you switch off the switch in Rack-A ? Saverio 2015-06-24 14:44 GMT+02:00 Romero Junior r.jun...@global.leaseweb.com: Hi, We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the following: When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg The PGs are distributed based on racks, there are not default crush rules. The number of PGs is the following: root@srv003:~# ceph osd pool ls detail pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool stripe_width 0 The qemu talks directly to Ceph through librdb, the disk is configured as the following: disk type='network' device='disk' driver name='qemu' type='raw' cache='writeback'/ auth username='libvirt' secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/ /auth source protocol='rbd' name='libvirt-pool/ceph-vm-automated' host name='10.XX.YY.1' port='6789'/ host name='10.XX.YY.2' port='6789'/ host name='10.XX.YY.2' port='6789'/ /source target dev='vda' bus='virtio'/ alias name='virtio-disk25'/ address type='pci' domain='0x' bus='0x00' slot='0x04' function='0x0'/ /disk As mentioned, it's not a real read-only state, I can touch files and even login on the affected virtual machines (by the way, all are affected) however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a 3 GB file download starts (via wget/curl), it usually crashes after the first few hundred megabytes and it resumes as soon as I power on the “failed” rack. Everything goes back to normal as soon as the rack is powered on again. For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 TB each). On the virtual machine, after recovering the rack, I can see the following messages on /var/log/kern.log: [163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 seconds. [163800.444260] Not tainted 3.13.0-55-generic #94-Ubuntu [163800.444295] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [163800.444346] jbd2/vda1-8 D 88007fd13180 0 135 2 0x [163800.444354] 880036d3bbd8 0046 880036a4b000 880036d3bfd8 [163800.444386] 00013180 00013180 880036a4b000 88007fd13a18 [163800.444390] 88007ffc69d0 0002 811efa80 880036d3bc50 [163800.444396] Call Trace: [163800.20] [811efa80] ? generic_block_bmap+0x50/0x50 [163800.26] [817279bd] io_schedule+0x9d/0x140 [163800.32] [811efa8e] sleep_on_buffer+0xe/0x20 [163800.37] [81727e42] __wait_on_bit+0x62/0x90 [163800.42] [811efa80] ? generic_block_bmap+0x50/0x50 [163800.47] [81727ee7] out_of_line_wait_on_bit+0x77/0x90 [163800.55] [810ab300] ? autoremove_wake_function+0x40/0x40 [163800.61] [811f0dba] __wait_on_buffer+0x2a/0x30 [163800.70] [8128be4d] jbd2_journal_commit_transaction+0x185d/0x1ab0 [163800.77] [8107562f] ? try_to_del_timer_sync+0x4f/0x70 [163800.84] [8129017d] kjournald2+0xbd/0x250 [163800.90] [810ab2c0] ? prepare_to_wait_event+0x100/0x100 [163800.96] [812900c0] ? commit_timeout+0x10/0x10 [163800.444502] [8108b702] kthread+0xd2/0xf0 [163800.444507] [8108b630] ? kthread_create_on_node+0x1c0/0x1c0 [163800.444513] [81733ca8] ret_from_fork+0x58/0x90 [163800.444517] [8108b630] ? kthread_create_on_node+0x1c0/0x1c0 A few theories for this behavior were mention on #Ceph (OFTC): [14:09] Be-El RomeroJnr: i think the problem is the fact that you write to parts of the rbd that have not been accessed before [14:09] Be-El RomeroJnr: ceph does thin provisioning; each rbd is striped into chunks of 4 mb. each stripe is put into one pgs [14:10] Be-El RomeroJnr: if you access formerly unaccessed parts of the rbd, a new stripe is created. and this probably fails if one of the racks is down [14:10] Be-El RomeroJnr: but that's just a theory...maybe some developer can comment on this later
Re: [ceph-users] Unexpected issues with simulated 'rack' outage
You dont have to wait, but the recovery process will be very heavy and it will have an impact on performance. The impact could be catastrophic as you are experiencing. After removing 1 rack, the CRUSH algorithm will run again on the available resources and will map the PGs to the available OSDs. You lost 33% of OSDs so it will be a big change. This means that you will not only have to create again copies for the OSDs that are out of your cluster, but also you have to move a round a lot of objects that are now misplaced. It would also be nice to see your crushmap because you are not using the default. A conceptual bug in the crushmap could leave the cluster on a degraded state forever. For example if you did a crushmap to place copies only on different racks, and you want 3 copies with 2 racks available, this is a possible conceptual bug. Saverio 2015-06-24 15:11 GMT+02:00 Romero Junior r.jun...@global.leaseweb.com: If I have a replica of each object on the other racks why should I have to wait for any recovery time? The failure should not impact my virtual machines. *From:* Saverio Proto [mailto:ziopr...@gmail.com] *Sent:* woensdag, 24 juni, 2015 14:54 *To:* Romero Junior *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Unexpected issues with simulated 'rack' outage Hello Romero, I am still begineer with Ceph, but as far as I understood, ceph is not designed to lose the 33% of the cluster at once and recover rapidly. What I understand is that you are losing 33% of the cluster losing 1 rack out of 3. It will take a very long time to recover, before you have HEALTH_OK status. can you check with ceph -w how long it takes for ceph to converge to a healthy cluster after you switch off the switch in Rack-A ? Saverio 2015-06-24 14:44 GMT+02:00 Romero Junior r.jun...@global.leaseweb.com: Hi, We are setting up a test environment using Ceph as the main storage solution for my QEMU-KVM virtualization platform, and everything works fine except for the following: When I simulate a failure by powering off the switches on one of our three racks my virtual machines get into a weird state, the illustration might help you to fully understand what is going on: http://i.imgur.com/clBApzK.jpg The PGs are distributed based on racks, there are not default crush rules. The number of PGs is the following: root@srv003:~# ceph osd pool ls detail pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags hashpspool stripe_width 0 The qemu talks directly to Ceph through librdb, the disk is configured as the following: disk type='network' device='disk' driver name='qemu' type='raw' cache='writeback'/ auth username='libvirt' secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/ /auth source protocol='rbd' name='libvirt-pool/ceph-vm-automated' host name='10.XX.YY.1' port='6789'/ host name='10.XX.YY.2' port='6789'/ host name='10.XX.YY.2' port='6789'/ /source target dev='vda' bus='virtio'/ alias name='virtio-disk25'/ address type='pci' domain='0x' bus='0x00' slot='0x04' function='0x0'/ /disk As mentioned, it's not a real read-only state, I can touch files and even login on the affected virtual machines (by the way, all are affected) however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a 3 GB file download starts (via wget/curl), it usually crashes after the first few hundred megabytes and it resumes as soon as I power on the “failed” rack. Everything goes back to normal as soon as the rack is powered on again. For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5 TB each). On the virtual machine, after recovering the rack, I can see the following messages on /var/log/kern.log: [163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120 seconds. [163800.444260] Not tainted 3.13.0-55-generic #94-Ubuntu [163800.444295] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [163800.444346] jbd2/vda1-8 D 88007fd13180 0 135 2 0x [163800.444354] 880036d3bbd8 0046 880036a4b000 880036d3bfd8 [163800.444386] 00013180 00013180 880036a4b000 88007fd13a18 [163800.444390] 88007ffc69d0 0002 811efa80 880036d3bc50 [163800.444396] Call Trace: [163800.20] [811efa80] ? generic_block_bmap+0x50/0x50 [163800.26] [817279bd] io_schedule+0x9d/0x140 [163800.32] [811efa8e] sleep_on_buffer+0xe/0x20 [163800.37] [81727e42] __wait_on_bit+0x62/0x90 [163800.42] [811efa80] ? generic_block_bmap+0x50/0x50 [163800.47] [81727ee7] out_of_line_wait_on_bit+0x77/0x90 [163800.55
Re: [ceph-users] xfs corruption, data disaster!
Hello, I dont get it. You lost just 6 osds out of 145 and your cluster is not able to recover ? what is the status of ceph -s ? Saverio 2015-05-04 9:00 GMT+02:00 Yujian Peng pengyujian5201...@126.com: Hi, I'm encountering a data disaster. I have a ceph cluster with 145 osd. The data center had a power problem yesterday, and all of the ceph nodes were down. But now I find that 6 disks(xfs) in 4 nodes have data corruption. Some disks are unable to mount, and some disks have IO errors in syslog. mount: Structure needs cleaning xfs_log_forece: error 5 returned I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd reported a leveldb error: Error initializing leveldb: Corruption: checksum mismatch I cannot start the 6 osds and 22 pgs is down. This is really a tragedy for me. Can you give me some idea to recovery the xfs? Thanks very much! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs corruption, data disaster!
OK I see the problem. Thanks for explanation. However he talks about 4 hosts. So with the default CRUSHMAP losing 1 or more OSDs on the same host is irrelevant. The real problem he lost 4 OSDs on different hosts with pools of size 3 , so he lost the PGs that where mapped to 3 failing drives. So he lost 22 pgs. But I guess the cluster has thousands of pgs so the actual data lost is small. Is that correct ? thanks Saverio 2015-05-07 4:16 GMT+02:00 Christian Balzer ch...@gol.com: Hello, On Thu, 7 May 2015 00:34:58 +0200 Saverio Proto wrote: Hello, I dont get it. You lost just 6 osds out of 145 and your cluster is not able to recover ? He lost 6 OSDs at the same time. With 145 OSDs and standard replication of 3 loosing 3 OSDs makes data loss already extremely likely, with 6 OSDs gone it is approaching certainty levels. Christian what is the status of ceph -s ? Saverio 2015-05-04 9:00 GMT+02:00 Yujian Peng pengyujian5201...@126.com: Hi, I'm encountering a data disaster. I have a ceph cluster with 145 osd. The data center had a power problem yesterday, and all of the ceph nodes were down. But now I find that 6 disks(xfs) in 4 nodes have data corruption. Some disks are unable to mount, and some disks have IO errors in syslog. mount: Structure needs cleaning xfs_log_forece: error 5 returned I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd reported a leveldb error: Error initializing leveldb: Corruption: checksum mismatch I cannot start the 6 osds and 22 pgs is down. This is really a tragedy for me. Can you give me some idea to recovery the xfs? Thanks very much! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph migration to AWS
Why you don't use directly AWS S3 then ? Saverio 2015-04-24 17:14 GMT+02:00 Mike Travis mike.r.tra...@gmail.com: To those interested in a tricky problem, We have a Ceph cluster running at one of our data centers. One of our client's requirements is to have them hosted at AWS. My question is: How do we effectively migrate our data on our internal Ceph cluster to an AWS Ceph cluster? Ideas currently on the table: 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum at AWS then sever the connection between AWS and our data center. 2. Build a Ceph cluster at AWS and send snapshots from our data center to our AWS cluster allowing us to migrate to AWS. Is this a good idea? Suggestions? Has anyone done something like this before? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
Do you by any chance have your OSDs placed at a local directory path rather than on a non utilized physical disk? No, I have 18 Disks per Server. Each OSD is mapped to a physical disk. Here in the output of one server: ansible@zrh-srv-m-cph02:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg01-root 28G 4.5G 22G 18% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 48G 4.0K 48G 1% /dev tmpfs9.5G 1.3M 9.5G 1% /run none 5.0M 0 5.0M 0% /run/lock none 48G 20K 48G 1% /run/shm none 100M 0 100M 0% /run/user /dev/mapper/vg01-tmp 4.5G 9.4M 4.3G 1% /tmp /dev/mapper/vg01-varlog 9.1G 5.1G 3.6G 59% /var/log /dev/sdf1932G 15G 917G 2% /var/lib/ceph/osd/ceph-3 /dev/sdg1932G 15G 917G 2% /var/lib/ceph/osd/ceph-4 /dev/sdl1932G 13G 919G 2% /var/lib/ceph/osd/ceph-8 /dev/sdo1932G 15G 917G 2% /var/lib/ceph/osd/ceph-11 /dev/sde1932G 15G 917G 2% /var/lib/ceph/osd/ceph-2 /dev/sdd1932G 15G 917G 2% /var/lib/ceph/osd/ceph-1 /dev/sdt1932G 15G 917G 2% /var/lib/ceph/osd/ceph-15 /dev/sdq1932G 12G 920G 2% /var/lib/ceph/osd/ceph-12 /dev/sdc1932G 14G 918G 2% /var/lib/ceph/osd/ceph-0 /dev/sds1932G 17G 916G 2% /var/lib/ceph/osd/ceph-14 /dev/sdu1932G 14G 918G 2% /var/lib/ceph/osd/ceph-16 /dev/sdm1932G 15G 917G 2% /var/lib/ceph/osd/ceph-9 /dev/sdk1932G 17G 915G 2% /var/lib/ceph/osd/ceph-7 /dev/sdn1932G 14G 918G 2% /var/lib/ceph/osd/ceph-10 /dev/sdr1932G 15G 917G 2% /var/lib/ceph/osd/ceph-13 /dev/sdv1932G 14G 918G 2% /var/lib/ceph/osd/ceph-17 /dev/sdh1932G 17G 916G 2% /var/lib/ceph/osd/ceph-5 /dev/sdj1932G 14G 918G 2% /var/lib/ceph/osd/ceph-30 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] advantages of multiple pools?
For example you can assign different read/write permissions and different keyrings to different pools. 2015-04-17 16:00 GMT+02:00 Chad William Seys cws...@physics.wisc.edu: Hi All, What are the advantages of having multiple ceph pools (if they use the whole cluster)? Thanks! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Binding a pool to certain OSDs
Yes you can. You have to write your own crushmap. At the end of the crushmap you have rulesets Write a ruleset that selects only the OSDs you want. Then you have to assign the pool to that ruleset. I have seen examples online, people what wanted some pools only on SSD disks and other pools only on SAS disks. That should be not too far from what you want to achieve. ciao, Saverio 2015-04-13 18:26 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi all, I've got a Ceph cluster which serves volumes to a Cinder installation. It runs Emperor. I'd like to be able to replace some of the disks with OPAL disks and create a new pool which uses exclusively the latter kind of disk. I'd like to have a traditional pool and a secure one coexisting on the same ceph host. I'd then use Cinder multi backend feature to serve them. My question is: how is it possible to realize such a setup? How can I bind a pool to certain OSDs? Thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
2015-03-27 18:27 GMT+01:00 Gregory Farnum g...@gregs42.com: Ceph has per-pg and per-OSD metadata overhead. You currently have 26000 PGs, suitable for use on a cluster of the order of 260 OSDs. You have placed almost 7GB of data into it (21GB replicated) and have about 7GB of additional overhead. You might try putting a suitable amount of data into the cluster before worrying about the ratio of space used to data stored. :) -Greg Hello Greg, I put a suitable amount of data now, and it looks like my ratio is still 1 to 5. The folder: /var/lib/ceph/osd/ceph-N/current/meta/ did not grow, so it looks like that is not the problem. Do you have any hint how to troubleshoot this issue ??? ansible@zrh-srv-m-cph02:~$ ceph osd pool get .rgw.buckets size size: 3 ansible@zrh-srv-m-cph02:~$ ceph osd pool get .rgw.buckets min_size min_size: 2 ansible@zrh-srv-m-cph02:~$ ceph -w cluster 4179fcec-b336-41a1-a7fd-4a19a75420ea health HEALTH_WARN pool .rgw.buckets has too few pgs monmap e4: 4 mons at {rml-srv-m-cph01=10.120.50.20:6789/0,rml-srv-m-cph02=10.120.50.21:6789/0,rml-srv-m-stk03=10.120.50.32:6789/0,zrh-srv-m-cph02=10.120.50.2:6789/0}, election epoch 668, quorum 0,1,2,3 zrh-srv-m-cph02,rml-srv-m-cph01,rml-srv-m-cph02,rml-srv-m-stk03 osdmap e2170: 54 osds: 54 up, 54 in pgmap v619041: 28684 pgs, 15 pools, 109 GB data, 7358 kobjects 518 GB used, 49756 GB / 50275 GB avail 28684 active+clean ansible@zrh-srv-m-cph02:~$ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 50275G 49756G 518G 1.03 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd0155 016461G 2 gianfranco 7156 016461G 2 images 8 257M 016461G 38 .rgw.root 9840 016461G 3 .rgw.control 10 0 016461G 8 .rgw 11 21334 016461G 108 .rgw.gc12 0 016461G 32 .users.uid 13 1575 016461G 6 .users 1472 016461G 6 .rgw.buckets.index 15 0 016461G 30 .users.swift 1736 016461G 3 .rgw.buckets 18 108G 0.2216461G 7534745 .intent-log19 0 016461G 0 .rgw.buckets.extra 20 0 016461G 0 volumes21 512M 016461G 161 ansible@zrh-srv-m-cph02:~$ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Binding a pool to certain OSDs
No error message. You just finish the RAM memory and you blow up the cluster because of too many PGs. Saverio 2015-04-14 18:52 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi Saverio, I first made a test on my test staging lab where I have only 4 OSD. On my mon servers (which run other services) I have 16BG RAM, 15GB used but 5 cached. On the OSD servers I have 3GB RAM, 3GB used but 2 cached. ceph -s tells me nothing about PGs, shouldn't I get an error message from its output? Thanks Giuseppe 2015-04-14 18:20 GMT+02:00 Saverio Proto ziopr...@gmail.com: You only have 4 OSDs ? How much RAM per server ? I think you have already too many PG. Check your RAM usage. Check on Ceph wiki guidelines to dimension the correct number of PGs. Remeber that everytime to create a new pool you add PGs into the system. Saverio 2015-04-14 17:58 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi all, I've been following this tutorial to realize my setup: http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ I got this CRUSH map from my test lab: http://paste.openstack.org/show/203887/ then I modified the map and uploaded it. This is the final version: http://paste.openstack.org/show/203888/ When applied the new CRUSH map, after some rebalancing, I get this health status: [- avalon1 root@controller001 Ceph -] # ceph -s cluster af09420b-4032-415e-93fc-6b60e9db064e health HEALTH_WARN crush map has legacy tunables; mon.controller001 low disk space; clock skew detected on mon.controller002 monmap e1: 3 mons at {controller001=10.235.24.127:6789/0,controller002=10.235.24.128:6789/0,controller003=10.235.24.129:6789/0}, election epoch 314, quorum 0,1,2 controller001,controller002,controller003 osdmap e3092: 4 osds: 4 up, 4 in pgmap v785873: 576 pgs, 6 pools, 71548 MB data, 18095 objects 8842 MB used, 271 GB / 279 GB avail 576 active+clean and this osd tree: [- avalon1 root@controller001 Ceph -] # ceph osd tree # idweight type name up/down reweight -8 2 root sed -5 1 host ceph001-sed 2 1 osd.2 up 1 -7 1 host ceph002-sed 3 1 osd.3 up 1 -1 2 root default -4 1 host ceph001-sata 0 1 osd.0 up 1 -6 1 host ceph002-sata 1 1 osd.1 up 1 which seems not a bad situation. The problem rise when I try to create a new pool, the command ceph osd pool create sed 128 128 gets stuck. It never ends. And I noticed that my Cinder installation is not able to create volumes anymore. I've been looking in the logs for errors and found nothing. Any hint about how to proceed to restore my ceph cluster? Is there something wrong with the steps I take to update the CRUSH map? Is the problem related to Emperor? Regards, Giuseppe 2015-04-13 18:26 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi all, I've got a Ceph cluster which serves volumes to a Cinder installation. It runs Emperor. I'd like to be able to replace some of the disks with OPAL disks and create a new pool which uses exclusively the latter kind of disk. I'd like to have a traditional pool and a secure one coexisting on the same ceph host. I'd then use Cinder multi backend feature to serve them. My question is: how is it possible to realize such a setup? How can I bind a pool to certain OSDs? Thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Binding a pool to certain OSDs
You only have 4 OSDs ? How much RAM per server ? I think you have already too many PG. Check your RAM usage. Check on Ceph wiki guidelines to dimension the correct number of PGs. Remeber that everytime to create a new pool you add PGs into the system. Saverio 2015-04-14 17:58 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi all, I've been following this tutorial to realize my setup: http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ I got this CRUSH map from my test lab: http://paste.openstack.org/show/203887/ then I modified the map and uploaded it. This is the final version: http://paste.openstack.org/show/203888/ When applied the new CRUSH map, after some rebalancing, I get this health status: [- avalon1 root@controller001 Ceph -] # ceph -s cluster af09420b-4032-415e-93fc-6b60e9db064e health HEALTH_WARN crush map has legacy tunables; mon.controller001 low disk space; clock skew detected on mon.controller002 monmap e1: 3 mons at {controller001=10.235.24.127:6789/0,controller002=10.235.24.128:6789/0,controller003=10.235.24.129:6789/0}, election epoch 314, quorum 0,1,2 controller001,controller002,controller003 osdmap e3092: 4 osds: 4 up, 4 in pgmap v785873: 576 pgs, 6 pools, 71548 MB data, 18095 objects 8842 MB used, 271 GB / 279 GB avail 576 active+clean and this osd tree: [- avalon1 root@controller001 Ceph -] # ceph osd tree # idweight type name up/down reweight -8 2 root sed -5 1 host ceph001-sed 2 1 osd.2 up 1 -7 1 host ceph002-sed 3 1 osd.3 up 1 -1 2 root default -4 1 host ceph001-sata 0 1 osd.0 up 1 -6 1 host ceph002-sata 1 1 osd.1 up 1 which seems not a bad situation. The problem rise when I try to create a new pool, the command ceph osd pool create sed 128 128 gets stuck. It never ends. And I noticed that my Cinder installation is not able to create volumes anymore. I've been looking in the logs for errors and found nothing. Any hint about how to proceed to restore my ceph cluster? Is there something wrong with the steps I take to update the CRUSH map? Is the problem related to Emperor? Regards, Giuseppe 2015-04-13 18:26 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com: Hi all, I've got a Ceph cluster which serves volumes to a Cinder installation. It runs Emperor. I'd like to be able to replace some of the disks with OPAL disks and create a new pool which uses exclusively the latter kind of disk. I'd like to have a traditional pool and a secure one coexisting on the same ceph host. I'd then use Cinder multi backend feature to serve them. My question is: how is it possible to realize such a setup? How can I bind a pool to certain OSDs? Thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
I will start now to push a lot of data into the cluster to see if the metadata grows a lot or stays costant. There is a way to clean up old metadata ? I pushed a lot of more data to the cluster. Then I lead the cluster sleep for the night. This morning I find this values: 6841 MB data 25814 MB used that is a bit more of 1 to 3. It looks like the extra space is in these folders (for N from 1 to 36): /var/lib/ceph/osd/ceph-N/current/meta/ This meta folders have a lot of data in it. I would really be happy to have pointers to understand what is in there and how to clean that up eventually. The problem is that googling for ceph meta or ceph metadata will produce results for Ceph MDS that is completely unrelated :( thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
Thanks for the answer. Now the meaning of MB data and MB used is clear, and if all the pools have size=3 I expect a ratio 1 to 3 of the two values. I still can't understand why MB used is so big in my setup. All my pools are size =3 but the ratio MB data and MB used is 1 to 5 instead of 1 to 3. My first guess was that I wrote a wrong crushmap that was making more than 3 copies.. (is it really possible to make such a mistake?) So I changed my crushmap and I put the default one, that just spreads data across hosts, but I see no change, the ratio is still 1 to 5. I thought maybe my 3 monitors have different views of the pgmap, so I tried to restart the monitors but this also did not help. What useful information may I share here to troubleshoot this issue further ? ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) Thank you Saverio 2015-03-25 14:55 GMT+01:00 Gregory Farnum g...@gregs42.com: On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote: Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. MB used is the summation of (the programmatic equivalent to) df across all your nodes, whereas MB data is calculated by the OSDs based on data they've written down. Depending on your configuration MB used can include thing like the OSD journals, or even totally unrelated data if the disks are shared with other applications. MB used including the space used by the OSD journals is my first guess about what you're seeing here, in which case you'll notice that it won't grow any faster than MB data does once the journal is fully allocated. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5
You just need to go look at one of your OSDs and see what data is stored on it. Did you configure things so that the journals are using a file on the same storage disk? If so, *that* is why the data used is large. I followed your suggestion and this is the result of my trobleshooting. Each OSD controls a disk that is mounted in a folder with the name: /var/lib/ceph/osd/ceph-N where N is the OSD number The journal is stored on another disk drive. I have three extra SSD drives per server, that I partitioned with 6 partitions each, and those partitions are journal partitions. I checked that the setup is correct because each /var/lib/ceph/osd/ceph-N/journal points correctly to another drive. with df -h I see the folders where my OSD are mounted. The space occupation looks well distributed among all OSDs as expected. the data is always in a folder called: /var/lib/ceph/osd/ceph-N/current I checked with the tool ncdu where the data is stored inside the current folders. in each OSD there is a folder with a lot of data called /var/lib/ceph/osd/ceph-N/current/meta If I sum the MB for each meta folder that is more or less the extra space that is consumed, leading to the 1 to 5 ratio. the meta folder contains a lot of binary files, unreadable, but looking at the file names it looks like it is where the versions of the osdmap are stored. but it is really a lot of metadata. I will start now to push a lot of data into the cluster to see if the metadata grows a lot or stays costant. There is a way to clean up old metadata ? thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph -w: Understanding MB data versus MB used
Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. Any hints ? thank you. Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status
Hello, thanks for the answers. This was exacly what I was looking for: mon_osd_down_out_interval = 900 I was not waiting long enoght to see my cluster recovering by itself. That's why I tried to increase min_size, because I did not understand what min_size was for. Now that I know what is min_size, I guess the best setting for me is min_size = 1 because I would like to be able to make I/O operations even of only 1 copy is left. Thanks to all for helping ! Saverio 2015-03-23 14:58 GMT+01:00 Gregory Farnum g...@gregs42.com: On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto ziopr...@gmail.com wrote: Hello, I started to work with CEPH few weeks ago, I might ask a very newbie question, but I could not find an answer in the docs or in the ml archive for this. Quick description of my setup: I have a ceph cluster with two servers. Each server has 3 SSD drives I use for journal only. To map to different failure domains SAS disks that keep a journal to the same SSD drive, I wrote my own crushmap. I have now a total of 36OSD. Ceph health returns HEALTH_OK. I run the cluster with a couple of pools with size=3 and min_size=3 Production operations questions: I manually stopped some OSDs to simulate a failure. As far as I understood, an OSD down condition is not enough to make CEPH start making new copies of objects. I noticed that I must mark the OSD as out to make ceph produce new copies. As far as I understood min_size=3 puts the object in readonly if there are not at least 3 copies of the object available. That is correct, but the default with size 3 is 2 and you probably want to do that instead. If you have size==min_size on firefly releases and lose an OSD it can't do recovery so that PG is stuck without manual intervention. :( This is because of some quirks about how the OSD peering and recovery works, so you'd be forgiven for thinking it would recover nicely. (This is changed in the upcoming Hammer release, but you probably still want to allow cluster activity when an OSD fails, unless you're very confident in their uptime and more concerned about durability than availability.) -Greg Is this behavior correct or I made some mistake creating the cluster ? Should I expect ceph to produce automatically a new copy for objects when some OSDs are down ? There is any option to mark automatically out OSDs that go down ? thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph in Production: best practice to monitor OSD up/down status
Hello, I started to work with CEPH few weeks ago, I might ask a very newbie question, but I could not find an answer in the docs or in the ml archive for this. Quick description of my setup: I have a ceph cluster with two servers. Each server has 3 SSD drives I use for journal only. To map to different failure domains SAS disks that keep a journal to the same SSD drive, I wrote my own crushmap. I have now a total of 36OSD. Ceph health returns HEALTH_OK. I run the cluster with a couple of pools with size=3 and min_size=3 Production operations questions: I manually stopped some OSDs to simulate a failure. As far as I understood, an OSD down condition is not enough to make CEPH start making new copies of objects. I noticed that I must mark the OSD as out to make ceph produce new copies. As far as I understood min_size=3 puts the object in readonly if there are not at least 3 copies of the object available. Is this behavior correct or I made some mistake creating the cluster ? Should I expect ceph to produce automatically a new copy for objects when some OSDs are down ? There is any option to mark automatically out OSDs that go down ? thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com