Re: [ceph-users] how to fix X is an unexpected clone

2018-02-26 Thread Saverio Proto
Hello Stefan,

ceph-object-tool does not exist on my setup, do yo mean the command
/usr/bin/ceph-objectstore-tool that is installed with the ceph-osd package ?

I have the following situation here in Ceph Luminous:

2018-02-26 07:15:30.066393 7f0684acb700 -1 log_channel(cluster) log
[ERR] : 5.111f shard 395 missing
5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152
2018-02-26 07:15:30.395189 7f0684acb700 -1 log_channel(cluster) log
[ERR] : deep-scrub 5.111f
5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152 is an
unexpected clone

I did not understand how you actually fixed the problem. Could you
provide more details ?

thanks

Saverio


On 08.08.17 12:02, Stefan Priebe - Profihost AG wrote:
> Hello Greg,
> 
> Am 08.08.2017 um 11:56 schrieb Gregory Farnum:
>> On Mon, Aug 7, 2017 at 11:55 PM Stefan Priebe - Profihost AG
>> <s.pri...@profihost.ag <mailto:s.pri...@profihost.ag>> wrote:
>>
>> Hello,
>>
>> how can i fix this one:
>>
>> 2017-08-08 08:42:52.265321 osd.20 [ERR] repair 3.61a
>> 3:58654d3d:::rbd_data.106dd406b8b4567.018c:9d455 is an
>> unexpected clone
>> 2017-08-08 08:43:04.914640 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1
>> pgs repair; 1 scrub errors
>> 2017-08-08 08:43:33.470246 osd.20 [ERR] 3.61a repair 1 errors, 0 fixed
>> 2017-08-08 08:44:04.915148 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1
>> scrub errors
>>
>> If i just delete manually the relevant files ceph is crashing. rados
>> does not list those at all?
>>
>> How can i fix this?
>>
>>
>> You've sent quite a few emails that have this story spread out, and I
>> think you've tried several different steps to repair it that have been a
>> bit difficult to track.
>>
>> It would be helpful if you could put the whole story in one place and
>> explain very carefully exactly what you saw and how you responded. Stuff
>> like manually copying around the wrong files, or files without a
>> matching object info, could have done some very strange things.
>> Also, basic debugging stuff like what version you're running will help. :)
>>
>> Also note that since you've said elsewhere you don't need this image, I
>> don't think it's going to hurt you to leave it like this for a bit
>> (though it will definitely mess up your monitoring).
>> -Greg
> 
> i'm sorry about that. You're correct.
> 
> I was able to fix this just a few minutes ago by using the
> ceph-object-tool and the remove operation to remove all left over files.
> 
> I did this on all OSDs with the problematic pg. After that ceph was able
> to fix itself.
> 
> A better approach might be that ceph can recover itself from an
> unexpected clone by just deleting it.
> 
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
SWITCH
Saverio Proto, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 1573
saverio.pr...@switch.ch, http://www.switch.ch

http://www.switch.ch/stories
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-07-11 Thread Saverio Proto
> I'm looking at the Dell S-ON switches which we can get in a Cumulus
> version. Any pro's and con's of using Cumulus vs old school switch OS's you
> may have come across?

Nothing to declare here. Once configured properly the hardware works
as expected. I never used Dell, I used switches from Quanta.

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-06-14 Thread Saverio Proto
I am at the Ceph Day at CERN,
I asked to Sage if it is supported to enable both S3 and swift API at
the same time. The answer is yes, so it is meant to be supported, and
this that we see here is probably a bug.

I opened a bug report:
http://tracker.ceph.com/issues/16293

If anyone has a chance to test it on a ceph version newer than Hammer,
you can update the bug :)

thank you

Saverio


2016-05-12 15:49 GMT+02:00 Yehuda Sadeh-Weinraub <yeh...@redhat.com>:
> On Thu, May 12, 2016 at 12:29 AM, Saverio Proto <ziopr...@gmail.com> wrote:
>>> While I'm usually not fond of blaming the client application, this is
>>> really the swift command line tool issue. It tries to be smart by
>>> comparing the md5sum of the object's content with the object's etag,
>>> and it breaks with multipart objects. Multipart objects is calculated
>>> differently (md5sum of the md5sum of each part). I think the swift
>>> tool has a special handling for swift large objects (which are not the
>>> same as s3 multipart objects), so that's why it works in that specific
>>> use case.
>>
>> Well but I tried also with rclone and I have the same issue.
>>
>> Clients I tried
>> rclone (both SWIFT and S3)
>> s3cmd (S3)
>> python-swiftclient (SWIFT).
>>
>> I can reproduce the issue with different clients.
>> Once a multipart object is uploaded via S3 (with rclone or s3cmd) I
>> cannot read it anymore via SWIFT (either with rclone or
>> pythonswift-client).
>>
>> Are you saying that all SWIFT clients implementations are wrong ?
>
> Yes.
>
>>
>> Or should the radosgw be configured with only 1 API active ?
>>
>> Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hadoop on cephfs

2016-06-09 Thread Saverio Proto
You can also have Hadoop talking to the Rados Gateway (SWIFT API) so
that the data is in Ceph instead of HDFS.

I wrote this tutorial that might help:
https://github.com/zioproto/hadoop-swift-tutorial

Saverio


2016-04-30 23:55 GMT+02:00 Adam Tygart :
> Supposedly cephfs-hadoop worked and/or works on hadoop 2. I am in the
> process of getting it working with cdh5.7.0 (based on hadoop 2.6.0).
> I'm under the impression that it is/was working with 2.4.0 at some
> point in time.
>
> At this very moment, I can use all of the DFS tools built into hadoop
> to create, list, delete, rename, and concat files. What I am not able
> to do (currently) is run any jobs.
>
> https://github.com/ceph/cephfs-hadoop
>
> It can be built using current (at least infernalis with my testing)
> cephfs-java and libcephfs. The only thing you'll for sure need to do
> is patch the file referenced here:
> https://github.com/ceph/cephfs-hadoop/issues/25 When building, you'll
> want to tell maven to skip tests (-Dmaven.test.skip=true).
>
> Like I said, I am digging into this still, and I am not entirely
> convinced my issues are ceph related at the moment.
>
> --
> Adam
>
> On Sat, Apr 30, 2016 at 1:51 PM, Erik McCormick
>  wrote:
>> I think what you are thinking of is the driver that was built to actually
>> replace hdfs with rbd. As far as I know that thing had a very short lifespan
>> on one version of hadoop. Very sad.
>>
>> As to what you proposed:
>>
>> 1) Don't use Cephfs in production pre-jewel.
>>
>> 2) running hdfs on top of ceph is a massive waste of disk and fairly
>> pointless as you make replicas of replicas.
>>
>> -Erik
>>
>> On Apr 29, 2016 9:20 PM, "Bill Sharer"  wrote:
>>>
>>> Actually this guy is already a fan of Hadoop.  I was just wondering
>>> whether anyone has been playing around with it on top of cephfs lately.  It
>>> seems like the last round of papers were from around cuttlefish.
>>>
>>> On 04/28/2016 06:21 AM, Oliver Dzombic wrote:

 Hi,

 bad idea :-)

 Its of course nice and important to drag developer towards a
 new/promising technology/software.

 But if the technology under the individual required specifications does
 not match, you will just risk to show this developer how worst this
 new/promising technology is.

 So you will just reach the opposite of what you want.

 So before you are doing something, usually big, like hadoop on an
 unstable software, maybe you should not use it.

 For the good of the developer, for your good and for the good of the
 reputation of the new/promising technology/software you wish.

 To force a pinguin to somehow live in the sahara, might be possible ( at
 least for some time ), but usually not a good idea ;-)

>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSPF to the host

2016-06-09 Thread Saverio Proto
> Has anybody had any experience with running the network routed down all the 
> way to the host?
>

Hello Nick,

yes at SWITCH.ch we run OSPF unnumbered on the switches and on the
hosts. Each server has two NICs and we are able to plug the servers to
any port on the fabric and OSFP will make the magic :) This makes
simpler the design when you want to expand the datacenter or when you
want to add more links to existing servers that need more capacity.

Remember to put an higher metric on the ToR-Server links otherwise you
might end up with flows going through the servers, and that is not
what you want.

We use Whitebox switches with Cumulus Linux. Back when we started this
project in August 2015 we built ubuntu packages for the Quagga version
available open source on the Github page of Cumulus Linux. It took us
a bit of work to make that quagga running on Ubuntu, but the support
from Cumulus was great to sort out the problems.
Our setup is dual stack IPv4 and IPv6.
On top of that we run Ceph, using IPv6 only for Ceph traffic.

Looks like at SWITCH we were not the only ones with this idea, you can
find this page dated March 2016:
https://support.cumulusnetworks.com/hc/en-us/articles/216805858-Routing-on-the-Host-An-Introduction

Cheers,

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The RGW create new bucket instance then delete it at every create bucket OP

2016-05-18 Thread Saverio Proto
Hello,

I am not sure I understood the problem.
Can you post the example steps to reproduce the problem ?

Also what version of Ceph RGW are you running ?

Saverio


2016-05-18 10:24 GMT+02:00 fangchen sun :
> Dear ALL,
>
> I found a problem that the RGW create a new bucket instance and delete
> the bucket instance at every create bucket OP with same name
>
> http://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html
>
> According to the error code "BucketAlreadyOwnedByYou" from the above
> link, shouldn't the RGW return directly or do nothing when recreate
> the bucket?
> Why do the RGW create a new bucket instance and then delete it?
>
> Thanks for the reply!
> sunfch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL nightmare on RadosGW for 200 TB dataset

2016-05-12 Thread Saverio Proto
> Can't you set the ACL on the object when you put it?

What do you think of this bug ?
https://github.com/s3tools/s3cmd/issues/743

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-12 Thread Saverio Proto
> While I'm usually not fond of blaming the client application, this is
> really the swift command line tool issue. It tries to be smart by
> comparing the md5sum of the object's content with the object's etag,
> and it breaks with multipart objects. Multipart objects is calculated
> differently (md5sum of the md5sum of each part). I think the swift
> tool has a special handling for swift large objects (which are not the
> same as s3 multipart objects), so that's why it works in that specific
> use case.

Well but I tried also with rclone and I have the same issue.

Clients I tried
rclone (both SWIFT and S3)
s3cmd (S3)
python-swiftclient (SWIFT).

I can reproduce the issue with different clients.
Once a multipart object is uploaded via S3 (with rclone or s3cmd) I
cannot read it anymore via SWIFT (either with rclone or
pythonswift-client).

Are you saying that all SWIFT clients implementations are wrong ?

Or should the radosgw be configured with only 1 API active ?

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL nightmare on RadosGW for 200 TB dataset

2016-05-12 Thread Saverio Proto
> Can't you set the ACL on the object when you put it?

I could create two tenants. One tenant DATASETADMIN for read/write
access, and a tenant DATASETUSERS for readonly access.

When I load the dataset into the object store, I need a "s3cmd put"
operation and a "s3cmd setacl" operation for each object. It is slow
but we do this only once. Giving read access will mean adding the user
to the DATASETUSERS tenant, without touching again the ACLs.

Still this is a workaround. We create ad-hoc tenants with read-only
permissions, and let the users in or out of these tenants.

If we want to use the original user's tenant in the ACL, it does not
scale for large number of objects AFAIK. :(

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-11 Thread Saverio Proto
It does not work also the way around:

If I upload a file with the swift client with the -S options to force
swift to make multipart:

swift upload -S 100 multipart 180.mp4

Then I am not able to read the file with S3

s3cmd get s3://multipart/180.mp4
download: 's3://multipart/180.mp4' -> './180.mp4'  [1 of 1]
download: 's3://multipart/180.mp4' -> './180.mp4'  [1 of 1]
 38818503 of 38818503   100% in1s27.32 MB/s  done
WARNING: MD5 signatures do not match:
computed=961f154cc78c7bf1be3b4009c29e5a68,
received=d41d8cd98f00b204e9800998ecf8427e

Saverio


2016-05-11 16:07 GMT+02:00 Saverio Proto <ziopr...@gmail.com>:
> Thank you.
>
> It is exactly a problem with multipart.
>
> So I tried two clients (s3cmd and rclone). When you upload a file in
> S3 using multipart, you are not able to read anymore this object with
> the SWIFT API because the md5 check fails.
>
> Saverio
>
>
>
> 2016-05-09 12:00 GMT+02:00 Xusangdi <xu.san...@h3c.com>:
>> Hi,
>>
>> I'm not running a cluster as yours, but I don't think the issue is caused by 
>> you using 2 APIs at the same time.
>> IIRC the dash thing is append by S3 multipart upload, with a following digit 
>> indicating the number of parts.
>> You may want to check this reported in s3cmd community:
>> https://sourceforge.net/p/s3tools/bugs/123/
>>
>> and some basic info from Amazon:
>> http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
>>
>> Hope this helps :D
>>
>> Regards,
>> ---Sandy
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Saverio Proto
>>> Sent: Monday, May 09, 2016 4:42 PM
>>> To: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API 
>>> at the same time
>>>
>>> I try to simplify the question to get some feedback.
>>>
>>> Is anyone running the RadosGW in production with S3 and SWIFT API active at 
>>> the same time ?
>>>
>>> thank you !
>>>
>>> Saverio
>>>
>>>
>>> 2016-05-06 11:39 GMT+02:00 Saverio Proto <ziopr...@gmail.com>:
>>> > Hello,
>>> >
>>> > We have been running the Rados GW with the S3 API and we did not have
>>> > problems for more than a year.
>>> >
>>> > We recently enabled also the SWIFT API for our users.
>>> >
>>> > radosgw --version
>>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>> >
>>> > The idea is that each user of the system is free of choosing the S3
>>> > client or the SWIFT client to access the same container/buckets.
>>> >
>>> > Please tell us if this is possible by design or if we are doing something 
>>> > wrong.
>>> >
>>> > We have now a problem that some files wrote in the past with S3,
>>> > cannot be read with the SWIFT API because the md5sum always fails.
>>> >
>>> > I am able to reproduce the bug in this way:
>>> >
>>> > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know
>>> > the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download
>>> > the same file here:
>>> > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20
>>> > 120701-ts.gz?dl=0
>>> >
>>> > rclone mkdir lss3:bugreproduce
>>> > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce
>>> >
>>> > The file is successfully uploaded.
>>> >
>>> > At this point I can succesfully download again the file:
>>> > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz
>>> > test.gz
>>> >
>>> > but not with swift:
>>> >
>>> > swift download googlebooks-ngrams-gz
>>> > fre/googlebooks-fre-all-2gram-20120701-ts.gz
>>> > Error downloading object
>>> > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz':
>>> > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz:
>>> > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a !=
>>> > 1a209a31b4ac3eb923fac5e8d194d9d3-2'
>>> >
>>> > Also I found strange the dash character '-' at the end of the md5 that
>>> > is trying to compare.
>>> >
>>> > Of course upload a file with the swift client and redownloading the
>>> > same file just wor

Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-11 Thread Saverio Proto
Thank you.

It is exactly a problem with multipart.

So I tried two clients (s3cmd and rclone). When you upload a file in
S3 using multipart, you are not able to read anymore this object with
the SWIFT API because the md5 check fails.

Saverio



2016-05-09 12:00 GMT+02:00 Xusangdi <xu.san...@h3c.com>:
> Hi,
>
> I'm not running a cluster as yours, but I don't think the issue is caused by 
> you using 2 APIs at the same time.
> IIRC the dash thing is append by S3 multipart upload, with a following digit 
> indicating the number of parts.
> You may want to check this reported in s3cmd community:
> https://sourceforge.net/p/s3tools/bugs/123/
>
> and some basic info from Amazon:
> http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
>
> Hope this helps :D
>
> Regards,
> ---Sandy
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Saverio Proto
>> Sent: Monday, May 09, 2016 4:42 PM
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at 
>> the same time
>>
>> I try to simplify the question to get some feedback.
>>
>> Is anyone running the RadosGW in production with S3 and SWIFT API active at 
>> the same time ?
>>
>> thank you !
>>
>> Saverio
>>
>>
>> 2016-05-06 11:39 GMT+02:00 Saverio Proto <ziopr...@gmail.com>:
>> > Hello,
>> >
>> > We have been running the Rados GW with the S3 API and we did not have
>> > problems for more than a year.
>> >
>> > We recently enabled also the SWIFT API for our users.
>> >
>> > radosgw --version
>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>> >
>> > The idea is that each user of the system is free of choosing the S3
>> > client or the SWIFT client to access the same container/buckets.
>> >
>> > Please tell us if this is possible by design or if we are doing something 
>> > wrong.
>> >
>> > We have now a problem that some files wrote in the past with S3,
>> > cannot be read with the SWIFT API because the md5sum always fails.
>> >
>> > I am able to reproduce the bug in this way:
>> >
>> > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know
>> > the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download
>> > the same file here:
>> > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20
>> > 120701-ts.gz?dl=0
>> >
>> > rclone mkdir lss3:bugreproduce
>> > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce
>> >
>> > The file is successfully uploaded.
>> >
>> > At this point I can succesfully download again the file:
>> > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz
>> > test.gz
>> >
>> > but not with swift:
>> >
>> > swift download googlebooks-ngrams-gz
>> > fre/googlebooks-fre-all-2gram-20120701-ts.gz
>> > Error downloading object
>> > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz':
>> > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz:
>> > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a !=
>> > 1a209a31b4ac3eb923fac5e8d194d9d3-2'
>> >
>> > Also I found strange the dash character '-' at the end of the md5 that
>> > is trying to compare.
>> >
>> > Of course upload a file with the swift client and redownloading the
>> > same file just works.
>> >
>> > Should I open a bug for the radosgw on http://tracker.ceph.com/ ?
>> >
>> > thank you
>> >
>> > Saverio
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C, 
> which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ACL nightmare on RadosGW for 200 TB dataset

2016-05-11 Thread Saverio Proto
Hello there,

Our setup is with Ceph Hammer (latest release).

We want to publish in our Object Storage some Scientific Datasets.
These are collections of around 100K objects and total size of about
200 TB.

For Object Storage we use the RadosGW with S3 API.

For the initial testing we are using a smaller dataset of about 26K
files and 5Tb of data.

Authentication to radosGW is with Keystone integration.

We created a Openstack Tenant to manage the datasets, and with EC2
credentials we upload all the files.
Once the bucket is full lets look at the ACLs:

s3cmd info s3://googlebooks-ngrams-gz/

ACL: TENANTDATASET: FULL_CONTROL

So far so good.

At this point we want to enable a user of a different tenant, to
access this Dataset READ-ONLY.

Given the UUID of the tenant of the user it would be as easy as:

s3cmd setacl --acl-grant=read: s3://googlebooks-ngrams-gz/

However this is not enough, the user will be able to list the objects
of the bucket, but not to read them. The read ACL is not inherited for
the Objects from the Bucket. So we must do:

s3cmd setacl --acl-grant=read: --recursive s3://googlebooks-ngrams-gz/

But this takes ages on 26K objects. It works but you spend several
hours updating ACLs and we cannot have this procedure everytime a user
wants read access.

Now the painful questions:

There is a way to bulk update the "read acl" on all the objects of a bucket ???

What happens to ACLs when SWIFT and S3 API are used simultaneously ?
>From my test RadosGW ignores the swift client when we try to post
ACLs, however the swift API honors S3 ACLs when reading.

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mixed versions of Ceph Cluster and RadosGW

2016-05-11 Thread Saverio Proto
Hello,

I have a production Ceph cluster running the latest Hammer Release.

We are not planning soon the upgrade to Jewel.

However, I would like to upgrade just the Rados Gateway to Jewel,
because I want to test the new SWIFT compatibiltiy improvements.

Is it supported to run the system with this configuration ? Ceph
Hammer and RadosGW Jewel ?

thank you

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-09 Thread Saverio Proto
I try to simplify the question to get some feedback.

Is anyone running the RadosGW in production with S3 and SWIFT API
active at the same time ?

thank you !

Saverio


2016-05-06 11:39 GMT+02:00 Saverio Proto <ziopr...@gmail.com>:
> Hello,
>
> We have been running the Rados GW with the S3 API and we did not have
> problems for more than a year.
>
> We recently enabled also the SWIFT API for our users.
>
> radosgw --version
> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>
> The idea is that each user of the system is free of choosing the S3
> client or the SWIFT client to access the same container/buckets.
>
> Please tell us if this is possible by design or if we are doing something 
> wrong.
>
> We have now a problem that some files wrote in the past with S3,
> cannot be read with the SWIFT API because the md5sum always fails.
>
> I am able to reproduce the bug in this way:
>
> We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know
> the correct md5 is 1c8113d2bd21232688221ec74dccff3a
> You can download the same file here:
> https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20120701-ts.gz?dl=0
>
> rclone mkdir lss3:bugreproduce
> rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce
>
> The file is successfully uploaded.
>
> At this point I can succesfully download again the file:
> rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz test.gz
>
> but not with swift:
>
> swift download googlebooks-ngrams-gz
> fre/googlebooks-fre-all-2gram-20120701-ts.gz
> Error downloading object
> 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz':
> u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz:
> md5sum != etag, 1c8113d2bd21232688221ec74dccff3a !=
> 1a209a31b4ac3eb923fac5e8d194d9d3-2'
>
> Also I found strange the dash character '-' at the end of the md5 that
> is trying to compare.
>
> Of course upload a file with the swift client and redownloading the
> same file just works.
>
> Should I open a bug for the radosgw on http://tracker.ceph.com/ ?
>
> thank you
>
> Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time

2016-05-06 Thread Saverio Proto
Hello,

We have been running the Rados GW with the S3 API and we did not have
problems for more than a year.

We recently enabled also the SWIFT API for our users.

radosgw --version
ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

The idea is that each user of the system is free of choosing the S3
client or the SWIFT client to access the same container/buckets.

Please tell us if this is possible by design or if we are doing something wrong.

We have now a problem that some files wrote in the past with S3,
cannot be read with the SWIFT API because the md5sum always fails.

I am able to reproduce the bug in this way:

We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know
the correct md5 is 1c8113d2bd21232688221ec74dccff3a
You can download the same file here:
https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20120701-ts.gz?dl=0

rclone mkdir lss3:bugreproduce
rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce

The file is successfully uploaded.

At this point I can succesfully download again the file:
rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz test.gz

but not with swift:

swift download googlebooks-ngrams-gz
fre/googlebooks-fre-all-2gram-20120701-ts.gz
Error downloading object
'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz':
u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz:
md5sum != etag, 1c8113d2bd21232688221ec74dccff3a !=
1a209a31b4ac3eb923fac5e8d194d9d3-2'

Also I found strange the dash character '-' at the end of the md5 that
is trying to compare.

Of course upload a file with the swift client and redownloading the
same file just works.

Should I open a bug for the radosgw on http://tracker.ceph.com/ ?

thank you

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO

2016-02-25 Thread Saverio Proto
I confirm that the bug is fixed with the 0.94.6 release packages.

thank you

Saverio


2016-02-22 10:20 GMT+01:00 Saverio Proto <ziopr...@gmail.com>:
> Hello Jason,
>
> from this email on ceph-dev
> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29692
>
> it looks like 0.94.6 is coming out very soon. We avoid testing the
> unreleased packaged then and we wait for the official release. thank
> you
>
> Saverio
>
>
> 2016-02-19 18:53 GMT+01:00 Jason Dillaman <dilla...@redhat.com>:
>> Correct -- a v0.94.6 tag on the hammer branch won't be created until the 
>> release.
>>
>> --
>>
>> Jason Dillaman
>>
>>
>> - Original Message -
>>> From: "Saverio Proto" <ziopr...@gmail.com>
>>> To: "Jason Dillaman" <dilla...@redhat.com>
>>> Cc: ceph-users@lists.ceph.com
>>> Sent: Friday, February 19, 2016 11:38:08 AM
>>> Subject: Re: [ceph-users] Cannot reliably create snapshot after freezing 
>>> QEMU IO
>>>
>>> Hello,
>>>
>>> thanks for the pointer. Just to make sure, for dev/QE hammer release,
>>> do you mean the "hammer" branch ? So following the documentation,
>>> because I use Ubuntu Trusty, this should be the repository right ?
>>>
>>> deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer
>>> trusty main
>>>
>>> thanks
>>>
>>> Saverio
>>>
>>>
>>>
>>>
>>> 2016-02-19 16:41 GMT+01:00 Jason Dillaman <dilla...@redhat.com>:
>>> > I believe 0.94.6 is still in testing because of a possible MDS issue [1].
>>> > You can download the interim dev/QE hammer release by following the
>>> > instructions here [2] if you are in a hurry.  You would only need to
>>> > upgrade librbd1 (and its dependencies) to pick up the fix.  When you do
>>> > upgrade (either with the interim or the official release), I would
>>> > appreciate it if you could update the ticket to let me know if it resolved
>>> > your issue.
>>> >
>>> > [1] http://tracker.ceph.com/issues/13356
>>> > [2] http://docs.ceph.com/docs/master/install/get-packages/
>>> >
>>> > --
>>> >
>>> > Jason Dillaman
>>> >
>>> >
>>> > - Original Message -
>>> >> From: "Saverio Proto" <ziopr...@gmail.com>
>>> >> To: ceph-users@lists.ceph.com
>>> >> Sent: Friday, February 19, 2016 10:11:01 AM
>>> >> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU
>>> >> IO
>>> >>
>>> >> Hello,
>>> >>
>>> >> we are hitting here Bug #14373 in our production cluster
>>> >> http://tracker.ceph.com/issues/14373
>>> >>
>>> >> Since we introduced the object map feature in our cinder rbd volumes,
>>> >> we are not able to make snapshot the volumes, unless they pause the
>>> >> VMs.
>>> >>
>>> >> We are running the latest Hammer and so we are really looking forward
>>> >> release v0.94.6
>>> >>
>>> >> Does anyone know when the release is going to happen ?
>>> >>
>>> >> If the release v0.94.6 is far away, we might have to build custom
>>> >> packages for Ubuntu and we really would like to avoid that.
>>> >> Any input ?
>>> >> Anyone else sharing the same bug ?
>>> >>
>>> >> thank you
>>> >>
>>> >> Saverio
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO

2016-02-22 Thread Saverio Proto
Hello Jason,

from this email on ceph-dev
http://article.gmane.org/gmane.comp.file-systems.ceph.devel/29692

it looks like 0.94.6 is coming out very soon. We avoid testing the
unreleased packaged then and we wait for the official release. thank
you

Saverio


2016-02-19 18:53 GMT+01:00 Jason Dillaman <dilla...@redhat.com>:
> Correct -- a v0.94.6 tag on the hammer branch won't be created until the 
> release.
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>> From: "Saverio Proto" <ziopr...@gmail.com>
>> To: "Jason Dillaman" <dilla...@redhat.com>
>> Cc: ceph-users@lists.ceph.com
>> Sent: Friday, February 19, 2016 11:38:08 AM
>> Subject: Re: [ceph-users] Cannot reliably create snapshot after freezing 
>> QEMU IO
>>
>> Hello,
>>
>> thanks for the pointer. Just to make sure, for dev/QE hammer release,
>> do you mean the "hammer" branch ? So following the documentation,
>> because I use Ubuntu Trusty, this should be the repository right ?
>>
>> deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer
>> trusty main
>>
>> thanks
>>
>> Saverio
>>
>>
>>
>>
>> 2016-02-19 16:41 GMT+01:00 Jason Dillaman <dilla...@redhat.com>:
>> > I believe 0.94.6 is still in testing because of a possible MDS issue [1].
>> > You can download the interim dev/QE hammer release by following the
>> > instructions here [2] if you are in a hurry.  You would only need to
>> > upgrade librbd1 (and its dependencies) to pick up the fix.  When you do
>> > upgrade (either with the interim or the official release), I would
>> > appreciate it if you could update the ticket to let me know if it resolved
>> > your issue.
>> >
>> > [1] http://tracker.ceph.com/issues/13356
>> > [2] http://docs.ceph.com/docs/master/install/get-packages/
>> >
>> > --
>> >
>> > Jason Dillaman
>> >
>> >
>> > - Original Message -
>> >> From: "Saverio Proto" <ziopr...@gmail.com>
>> >> To: ceph-users@lists.ceph.com
>> >> Sent: Friday, February 19, 2016 10:11:01 AM
>> >> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU
>> >> IO
>> >>
>> >> Hello,
>> >>
>> >> we are hitting here Bug #14373 in our production cluster
>> >> http://tracker.ceph.com/issues/14373
>> >>
>> >> Since we introduced the object map feature in our cinder rbd volumes,
>> >> we are not able to make snapshot the volumes, unless they pause the
>> >> VMs.
>> >>
>> >> We are running the latest Hammer and so we are really looking forward
>> >> release v0.94.6
>> >>
>> >> Does anyone know when the release is going to happen ?
>> >>
>> >> If the release v0.94.6 is far away, we might have to build custom
>> >> packages for Ubuntu and we really would like to avoid that.
>> >> Any input ?
>> >> Anyone else sharing the same bug ?
>> >>
>> >> thank you
>> >>
>> >> Saverio
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO

2016-02-19 Thread Saverio Proto
Hello,

thanks for the pointer. Just to make sure, for dev/QE hammer release,
do you mean the "hammer" branch ? So following the documentation,
because I use Ubuntu Trusty, this should be the repository right ?

deb http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/hammer
trusty main

thanks

Saverio




2016-02-19 16:41 GMT+01:00 Jason Dillaman <dilla...@redhat.com>:
> I believe 0.94.6 is still in testing because of a possible MDS issue [1].  
> You can download the interim dev/QE hammer release by following the 
> instructions here [2] if you are in a hurry.  You would only need to upgrade 
> librbd1 (and its dependencies) to pick up the fix.  When you do upgrade 
> (either with the interim or the official release), I would appreciate it if 
> you could update the ticket to let me know if it resolved your issue.
>
> [1] http://tracker.ceph.com/issues/13356
> [2] http://docs.ceph.com/docs/master/install/get-packages/
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>> From: "Saverio Proto" <ziopr...@gmail.com>
>> To: ceph-users@lists.ceph.com
>> Sent: Friday, February 19, 2016 10:11:01 AM
>> Subject: [ceph-users] Cannot reliably create snapshot after freezing QEMU IO
>>
>> Hello,
>>
>> we are hitting here Bug #14373 in our production cluster
>> http://tracker.ceph.com/issues/14373
>>
>> Since we introduced the object map feature in our cinder rbd volumes,
>> we are not able to make snapshot the volumes, unless they pause the
>> VMs.
>>
>> We are running the latest Hammer and so we are really looking forward
>> release v0.94.6
>>
>> Does anyone know when the release is going to happen ?
>>
>> If the release v0.94.6 is far away, we might have to build custom
>> packages for Ubuntu and we really would like to avoid that.
>> Any input ?
>> Anyone else sharing the same bug ?
>>
>> thank you
>>
>> Saverio
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot reliably create snapshot after freezing QEMU IO

2016-02-19 Thread Saverio Proto
Hello,

we are hitting here Bug #14373 in our production cluster
http://tracker.ceph.com/issues/14373

Since we introduced the object map feature in our cinder rbd volumes,
we are not able to make snapshot the volumes, unless they pause the
VMs.

We are running the latest Hammer and so we are really looking forward
release v0.94.6

Does anyone know when the release is going to happen ?

If the release v0.94.6 is far away, we might have to build custom
packages for Ubuntu and we really would like to avoid that.
Any input ?
Anyone else sharing the same bug ?

thank you

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increasing time to save RGW objects

2016-02-10 Thread Saverio Proto
What kind of authentication you use against the Rados Gateway ?

We had similar problem authenticating against our Keystone server. If
the Keystone server is overloaded the time to read/write RGW objects
increases. You will not see anything wrong on the ceph side.

Saverio

2016-02-08 17:49 GMT+01:00 Kris Jurka :
>
> I've been testing the performance of ceph by storing objects through RGW.
> This is on Debian with Hammer using 40 magnetic OSDs, 5 mons, and 4 RGW
> instances.  Initially the storage time was holding reasonably steady, but it
> has started to rise recently as shown in the attached chart.
>
> The test repeatedly saves 100k objects of 55 kB size using multiple threads
> (50) against multiple RGW gateways (4).  It uses a sequential identifier as
> the object key and shards the bucket name using id % 100.  The buckets have
> index sharding enabled with 64 index shards per bucket.
>
> ceph status doesn't appear to show any issues.  Is there something I should
> be looking at here?
>
>
> # ceph status
> cluster 3fc86d01-cf9c-4bed-b130-7a53d7997964
>  health HEALTH_OK
>  monmap e2: 5 mons at
> {condor=192.168.188.90:6789/0,duck=192.168.188.140:6789/0,eagle=192.168.188.100:6789/0,falcon=192.168.188.110:6789/0,shark=192.168.188.118:6789/0}
> election epoch 18, quorum 0,1,2,3,4
> condor,eagle,falcon,shark,duck
>  osdmap e674: 40 osds: 40 up, 40 in
>   pgmap v258756: 3128 pgs, 10 pools, 1392 GB data, 27282 kobjects
> 4784 GB used, 69499 GB / 74284 GB avail
> 3128 active+clean
>   client io 268 kB/s rd, 1100 kB/s wr, 493 op/s
>
>
> Kris Jurka
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What are linger_ops in the output of objecter_requests ?

2015-10-14 Thread Saverio Proto
Hello,

debugging slow requests behaviour of our Rados Gateway, I run into
this linger_ops field and I cannot understand the meaning.

I would expect in the "ops" field to find slow requests stucked there.
Actually most of the time I have "ops": [], and looks like ops gets
empty very quickly.

However linger_ops is populated, and it is always the same requests,
looks like those are there forever.

Any explanation about what are linger_ops ?

thanks !

Saverio


r...@os.zhdk.cloud /home/proto ; ceph daemon
/var/run/ceph/ceph-radosgw.gateway.asok objecter_requests
{
"ops": [],
"linger_ops": [
{
"linger_id": 8,
"pg": "10.84ada7c9",
"osd": 9,
"object_id": "notify.7",
"object_locator": "@10",
"target_object_id": "notify.7",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 2,
"pg": "10.16dafda0",
"osd": 27,
"object_id": "notify.1",
"object_locator": "@10",
"target_object_id": "notify.1",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 6,
"pg": "10.31099063",
"osd": 52,
"object_id": "notify.5",
"object_locator": "@10",
"target_object_id": "notify.5",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 3,
"pg": "10.88aa5c95",
"osd": 66,
"object_id": "notify.2",
"object_locator": "@10",
"target_object_id": "notify.2",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 5,
"pg": "10.a204812d",
"osd": 66,
"object_id": "notify.4",
"object_locator": "@10",
"target_object_id": "notify.4",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 4,
"pg": "10.f8c99aee",
"osd": 68,
"object_id": "notify.3",
"object_locator": "@10",
"target_object_id": "notify.3",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 1,
"pg": "10.4322fa9f",
"osd": 82,
"object_id": "notify.0",
"object_locator": "@10",
"target_object_id": "notify.0",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
},
{
"linger_id": 7,
"pg": "10.97c520d4",
"osd": 103,
"object_id": "notify.6",
"object_locator": "@10",
"target_object_id": "notify.6",
"target_object_locator": "@10",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
}
],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw secret_key

2015-09-01 Thread Saverio Proto
Look at this:
https://github.com/ncw/rclone/issues/47

Because this is a json dump, it is encoding the / as \/.

It was source of confusion also for me.

Best regards

Saverio




2015-08-24 16:58 GMT+02:00 Luis Periquito :
> When I create a new user using radosgw-admin most of the time the secret key
> gets escaped with a backslash, making it not work. Something like
> "secret_key": "xx\/\/".
>
> Why would the "/" need to be escaped? Why is it printing the "\/" instead of
> "/" that does work?
>
> Usually I just remove the backslash and it works fine. I've seen this on
> several different clusters.
>
> Is it just me?
>
> This may require opening a bug in the tracking tool, but just asking here
> first.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-07-27 Thread Saverio Proto
Hello Jan,

I am testing your scripts, because we want also to test OSDs and VMs
on the same server.

I am new to cgroups, so this might be a very newbie question.
In your script you always reference to the file
/cgroup/cpuset/libvirt/cpuset.cpus

but I have the file in /sys/fs/cgroup/cpuset/libvirt/cpuset.cpus

I am working on Ubuntu 14.04

This difference comes from something special in your setup, or maybe
because we are working on different Linux distributions ?

Thanks for clarification.

Saverio



2015-06-30 17:50 GMT+02:00 Jan Schermer j...@schermer.cz:
 Hi all,
 our script is available on GitHub

 https://github.com/prozeta/pincpus

 I haven’t had much time to do a proper README, but I hope the configuration
 is self explanatory enough for now.
 What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA
 node.

 Let me know how it works for you!

 Jan


 On 30 Jun 2015, at 10:50, Huang Zhiteng winsto...@gmail.com wrote:



 On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer j...@schermer.cz wrote:

 Not having OSDs and KVMs compete against each other is one thing.
 But there are more reasons to do this

 1) not moving the processes and threads between cores that much (better
 cache utilization)
 2) aligning the processes with memory on NUMA systems (that means all
 modern dual socket systems) - you don’t want your OSD running on CPU1 with
 memory allocated to CPU2
 3) the same goes for other resources like NICs or storage controllers -
 but that’s less important and not always practical to do
 4) you can limit the scheduling domain on linux if you limit the cpuset
 for your OSDs (I’m not sure how important this is, just best practice)
 5) you can easily limit memory or CPU usage, set priority, with much
 greater granularity than without cgroups
 6) if you have HyperThreading enabled you get the most gain when the
 workloads on the threads are dissimiliar - so to have the higher throughput
 you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re
 not doing that because latency and performance of the core can vary
 depending on what the other thread is doing. But it might be useful to
 someone.

 Some workloads exhibit 100% performance gain when everything aligns in a
 NUMA system, compared to a SMP mode on the same hardware. You likely won’t
 notice it on light workloads, as the interconnects (QPI) are very fast and
 there’s a lot of bandwidth, but for stuff like big OLAP databases or other
 data-manipulation workloads there’s a huge difference. And with CEPH being
 CPU hungy and memory intensive, we’re seeing some big gains here just by
 co-locating the memory with the processes….

 Could you elaborate a it on this?  I'm interested to learn in what situation
 memory locality helps Ceph to what extend.



 Jan



 On 30 Jun 2015, at 08:12, Ray Sun xiaoq...@gmail.com wrote:

 Sound great, any update please let me know.

 Best Regards
 -- Ray

 On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer j...@schermer.cz wrote:

 I promised you all our scripts for automatic cgroup assignment - they are
 in our production already and I just need to put them on github, stay tuned
 tomorrow :-)

 Jan


 On 29 Jun 2015, at 19:41, Somnath Roy somnath@sandisk.com wrote:

 Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…

 Thanks  Regards
 Somnath

 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Ray Sun
 Sent: Monday, June 29, 2015 9:19 AM
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific
 cpu core?

 Cephers,
 I want to bind each of my ceph-osd to a specific cpu core, but I didn't
 find any document to explain that, could any one can provide me some
 detailed information. Thanks.

 Currently, my ceph is running like this:

 oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i
 seed.econe.com --pid-file /var/run/ceph/mon.seed.econe.com.pid -c
 /etc/ceph/ceph.conf --cluster ceph
 root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0
 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  42096  1  0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i 1
 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  43263  1  0 Jun23 ?01:22:59 /usr/bin/ceph-osd -i 2
 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  44527  1  0 Jun23 ?01:16:53 /usr/bin/ceph-osd -i 3
 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  45863  1  0 Jun23 ?01:25:18 /usr/bin/ceph-osd -i 4
 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  47462  1  0 Jun23 ?01:20:36 /usr/bin/ceph-osd -i 5
 --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph

 Best Regards
 -- Ray

 

 PLEASE NOTE: The information contained in this electronic mail 

Re: [ceph-users] Unexpected issues with simulated 'rack' outage

2015-06-24 Thread Saverio Proto
Hello Romero,

I am still begineer with Ceph, but as far as I understood, ceph is not
designed to lose the 33% of the cluster at once and recover rapidly. What I
understand is that you are losing 33% of the cluster losing 1 rack out of
3. It will take a very long time to recover, before you have HEALTH_OK
status.

can you check with ceph -w how long it takes for ceph to converge to a
healthy cluster after you switch off the switch in Rack-A ?

Saverio



2015-06-24 14:44 GMT+02:00 Romero Junior r.jun...@global.leaseweb.com:

  Hi,



 We are setting up a test environment using Ceph as the main storage
 solution for my QEMU-KVM virtualization platform, and everything works fine
 except for the following:



 When I simulate a failure by powering off the switches on one of our three
 racks my virtual machines get into a weird state, the illustration might
 help you to fully understand what is going on:
 http://i.imgur.com/clBApzK.jpg



 The PGs are distributed based on racks, there are not default crush rules.



 The number of PGs is the following:



 root@srv003:~# ceph osd pool ls detail

 pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0
 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags
 hashpspool stripe_width 0



 The qemu talks directly to Ceph through librdb, the disk is configured as
 the following:



 disk type='network' device='disk'

   driver name='qemu' type='raw' cache='writeback'/

   auth username='libvirt'

 secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/

   /auth

   source protocol='rbd' name='libvirt-pool/ceph-vm-automated'

 host name='10.XX.YY.1' port='6789'/

 host name='10.XX.YY.2' port='6789'/

 host name='10.XX.YY.2' port='6789'/

   /source

   target dev='vda' bus='virtio'/

   alias name='virtio-disk25'/

   address type='pci' domain='0x' bus='0x00' slot='0x04'
 function='0x0'/

 /disk





 As mentioned, it's not a real read-only state, I can touch files and
 even login on the affected virtual machines (by the way, all are affected)
 however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a
 3 GB file download starts (via wget/curl), it usually crashes after the
 first few hundred megabytes and it resumes as soon as I power on the
 “failed” rack. Everything goes back to normal as soon as the rack is
 powered on again.



 For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5
 TB each).



 On the virtual machine, after recovering the rack, I can see the following
 messages on /var/log/kern.log:



 [163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120
 seconds.

 [163800.444260]   Not tainted 3.13.0-55-generic #94-Ubuntu

 [163800.444295] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.

 [163800.444346] jbd2/vda1-8 D 88007fd13180 0   135  2
 0x

 [163800.444354]  880036d3bbd8 0046 880036a4b000
 880036d3bfd8

 [163800.444386]  00013180 00013180 880036a4b000
 88007fd13a18

 [163800.444390]  88007ffc69d0 0002 811efa80
 880036d3bc50

 [163800.444396] Call Trace:

 [163800.20]  [811efa80] ? generic_block_bmap+0x50/0x50

 [163800.26]  [817279bd] io_schedule+0x9d/0x140

 [163800.32]  [811efa8e] sleep_on_buffer+0xe/0x20

 [163800.37]  [81727e42] __wait_on_bit+0x62/0x90

 [163800.42]  [811efa80] ? generic_block_bmap+0x50/0x50

 [163800.47]  [81727ee7] out_of_line_wait_on_bit+0x77/0x90

 [163800.55]  [810ab300] ? autoremove_wake_function+0x40/0x40

 [163800.61]  [811f0dba] __wait_on_buffer+0x2a/0x30

 [163800.70]  [8128be4d]
 jbd2_journal_commit_transaction+0x185d/0x1ab0

 [163800.77]  [8107562f] ? try_to_del_timer_sync+0x4f/0x70

 [163800.84]  [8129017d] kjournald2+0xbd/0x250

 [163800.90]  [810ab2c0] ? prepare_to_wait_event+0x100/0x100

 [163800.96]  [812900c0] ? commit_timeout+0x10/0x10

 [163800.444502]  [8108b702] kthread+0xd2/0xf0

 [163800.444507]  [8108b630] ? kthread_create_on_node+0x1c0/0x1c0

 [163800.444513]  [81733ca8] ret_from_fork+0x58/0x90

 [163800.444517]  [8108b630] ? kthread_create_on_node+0x1c0/0x1c0



 A few theories for this behavior were mention on #Ceph (OFTC):



 [14:09] Be-El RomeroJnr: i think the problem is the fact that you write
 to parts of the rbd that have not been accessed before

 [14:09] Be-El RomeroJnr: ceph does thin provisioning; each rbd is
 striped into chunks of 4 mb. each stripe is put into one pgs

 [14:10] Be-El RomeroJnr: if you access formerly unaccessed parts of the
 rbd, a new stripe is created. and this probably fails if one of the racks
 is down

 [14:10] Be-El RomeroJnr: but that's just a theory...maybe some developer
 can comment on this later

 

Re: [ceph-users] Unexpected issues with simulated 'rack' outage

2015-06-24 Thread Saverio Proto
You dont have to wait, but the recovery process will be very heavy and it
will have an impact on performance. The impact could be catastrophic as you
are experiencing.

After removing 1 rack, the CRUSH algorithm will run again on the available
resources and will map the PGs to the available OSDs. You lost 33% of OSDs
so it will be a big change.

This means that you will not only have to create again copies for the OSDs
that are out of your cluster, but also you have to move a round a lot of
objects that are now misplaced.

It would also be nice to see your crushmap because you are not using the
default. A conceptual bug in the crushmap could leave the cluster on a
degraded state forever. For example if you did a crushmap to place copies
only on different racks, and you want 3 copies with 2 racks available, this
is a possible conceptual bug.

Saverio





2015-06-24 15:11 GMT+02:00 Romero Junior r.jun...@global.leaseweb.com:

  If I have a replica of each object on the other racks why should I have
 to wait for any recovery time? The failure should not impact my virtual
 machines.



 *From:* Saverio Proto [mailto:ziopr...@gmail.com]
 *Sent:* woensdag, 24 juni, 2015 14:54
 *To:* Romero Junior
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] Unexpected issues with simulated 'rack' outage



 Hello Romero,

 I am still begineer with Ceph, but as far as I understood, ceph is not
 designed to lose the 33% of the cluster at once and recover rapidly. What I
 understand is that you are losing 33% of the cluster losing 1 rack out of
 3. It will take a very long time to recover, before you have HEALTH_OK
 status.

 can you check with ceph -w how long it takes for ceph to converge to a
 healthy cluster after you switch off the switch in Rack-A ?



 Saverio



 2015-06-24 14:44 GMT+02:00 Romero Junior r.jun...@global.leaseweb.com:

 Hi,



 We are setting up a test environment using Ceph as the main storage
 solution for my QEMU-KVM virtualization platform, and everything works fine
 except for the following:



 When I simulate a failure by powering off the switches on one of our three
 racks my virtual machines get into a weird state, the illustration might
 help you to fully understand what is going on:
 http://i.imgur.com/clBApzK.jpg



 The PGs are distributed based on racks, there are not default crush rules.



 The number of PGs is the following:



 root@srv003:~# ceph osd pool ls detail

 pool 11 'libvirt-pool' replicated size 2 min_size 1 crush_ruleset 0
 object_hash rjenkins pg_num 16000 pgp_num 16000 last_change 14544 flags
 hashpspool stripe_width 0



 The qemu talks directly to Ceph through librdb, the disk is configured as
 the following:



 disk type='network' device='disk'

   driver name='qemu' type='raw' cache='writeback'/

   auth username='libvirt'

 secret type='ceph' uuid='0d32bxxxyyyzzz47073a965'/

   /auth

   source protocol='rbd' name='libvirt-pool/ceph-vm-automated'

 host name='10.XX.YY.1' port='6789'/

 host name='10.XX.YY.2' port='6789'/

 host name='10.XX.YY.2' port='6789'/

   /source

   target dev='vda' bus='virtio'/

   alias name='virtio-disk25'/

   address type='pci' domain='0x' bus='0x00' slot='0x04'
 function='0x0'/

 /disk





 As mentioned, it's not a real read-only state, I can touch files and
 even login on the affected virtual machines (by the way, all are affected)
 however, a simple 'dd' (count=10 bs=1MB conv=fdatasync) hangs forever. If a
 3 GB file download starts (via wget/curl), it usually crashes after the
 first few hundred megabytes and it resumes as soon as I power on the
 “failed” rack. Everything goes back to normal as soon as the rack is
 powered on again.



 For reference, each rack contains 33 nodes, each node contain 3 OSDs (1.5
 TB each).



 On the virtual machine, after recovering the rack, I can see the following
 messages on /var/log/kern.log:



 [163800.444146] INFO: task jbd2/vda1-8:135 blocked for more than 120
 seconds.

 [163800.444260]   Not tainted 3.13.0-55-generic #94-Ubuntu

 [163800.444295] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.

 [163800.444346] jbd2/vda1-8 D 88007fd13180 0   135  2
 0x

 [163800.444354]  880036d3bbd8 0046 880036a4b000
 880036d3bfd8

 [163800.444386]  00013180 00013180 880036a4b000
 88007fd13a18

 [163800.444390]  88007ffc69d0 0002 811efa80
 880036d3bc50

 [163800.444396] Call Trace:

 [163800.20]  [811efa80] ? generic_block_bmap+0x50/0x50

 [163800.26]  [817279bd] io_schedule+0x9d/0x140

 [163800.32]  [811efa8e] sleep_on_buffer+0xe/0x20

 [163800.37]  [81727e42] __wait_on_bit+0x62/0x90

 [163800.42]  [811efa80] ? generic_block_bmap+0x50/0x50

 [163800.47]  [81727ee7] out_of_line_wait_on_bit+0x77/0x90

 [163800.55

Re: [ceph-users] xfs corruption, data disaster!

2015-05-06 Thread Saverio Proto
Hello,

I dont get it. You lost just 6 osds out of 145 and your cluster is not
able to recover ?

what is the status of ceph -s ?

Saverio


2015-05-04 9:00 GMT+02:00 Yujian Peng pengyujian5201...@126.com:
 Hi,
 I'm encountering a data disaster. I have a ceph cluster with 145 osd. The
 data center had a power problem yesterday, and all of the ceph nodes were 
 down.
 But now I find that 6 disks(xfs) in 4 nodes have data corruption. Some disks
 are unable to mount, and some disks have IO errors in syslog.
 mount: Structure needs cleaning
 xfs_log_forece: error 5 returned
 I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd
 reported a leveldb error:
 Error initializing leveldb: Corruption: checksum mismatch
 I cannot start the 6 osds and 22 pgs is down.
 This is really a tragedy for me. Can you give me some idea to recovery the
 xfs? Thanks very much!



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs corruption, data disaster!

2015-05-06 Thread Saverio Proto
OK I see the problem. Thanks for explanation.
However he talks about 4 hosts. So with the default CRUSHMAP losing 1
or more OSDs on the same host is irrelevant.

The real problem he lost 4 OSDs on different hosts with pools of size
3 , so he lost the PGs that where mapped to 3 failing drives.

So he lost 22 pgs. But I guess the cluster has thousands of pgs so the
actual data lost is small. Is that correct ?

thanks

Saverio

2015-05-07 4:16 GMT+02:00 Christian Balzer ch...@gol.com:

 Hello,

 On Thu, 7 May 2015 00:34:58 +0200 Saverio Proto wrote:

 Hello,

 I dont get it. You lost just 6 osds out of 145 and your cluster is not
 able to recover ?

 He lost 6 OSDs at the same time.
 With 145 OSDs and standard replication of 3 loosing 3 OSDs makes data loss
 already extremely likely, with 6 OSDs gone it is approaching certainty
 levels.

 Christian
 what is the status of ceph -s ?

 Saverio


 2015-05-04 9:00 GMT+02:00 Yujian Peng pengyujian5201...@126.com:
  Hi,
  I'm encountering a data disaster. I have a ceph cluster with 145 osd.
  The data center had a power problem yesterday, and all of the ceph
  nodes were down. But now I find that 6 disks(xfs) in 4 nodes have data
  corruption. Some disks are unable to mount, and some disks have IO
  errors in syslog. mount: Structure needs cleaning
  xfs_log_forece: error 5 returned
  I tried to repair one with xfs_repair -L /dev/sdx1, but the ceph-osd
  reported a leveldb error:
  Error initializing leveldb: Corruption: checksum mismatch
  I cannot start the 6 osds and 22 pgs is down.
  This is really a tragedy for me. Can you give me some idea to recovery
  the xfs? Thanks very much!
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Christian BalzerNetwork/Systems Engineer
 ch...@gol.com   Global OnLine Japan/Fusion Communications
 http://www.gol.com/
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph migration to AWS

2015-05-06 Thread Saverio Proto
Why you don't use directly AWS S3 then ?

Saverio

2015-04-24 17:14 GMT+02:00 Mike Travis mike.r.tra...@gmail.com:
 To those interested in a tricky problem,

 We have a Ceph cluster running at one of our data centers. One of our
 client's requirements is to have them hosted at AWS. My question is: How do
 we effectively migrate our data on our internal Ceph cluster to an AWS Ceph
 cluster?

 Ideas currently on the table:

 1. Build OSDs at AWS and add them to our current Ceph cluster. Build quorum
 at AWS then sever the connection between AWS and our data center.

 2. Build a Ceph cluster at AWS and send snapshots from our data center to
 our AWS cluster allowing us to migrate to AWS.

 Is this a good idea? Suggestions? Has anyone done something like this
 before?

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-04-17 Thread Saverio Proto
 Do you by any chance have your OSDs placed at a local directory path rather
 than on a non utilized physical disk?

No, I have 18 Disks per Server. Each OSD is mapped to a physical disk.

Here in the output of one server:
ansible@zrh-srv-m-cph02:~$ df -h
Filesystem   Size  Used Avail Use% Mounted on
/dev/mapper/vg01-root 28G  4.5G   22G  18% /
none 4.0K 0  4.0K   0% /sys/fs/cgroup
udev  48G  4.0K   48G   1% /dev
tmpfs9.5G  1.3M  9.5G   1% /run
none 5.0M 0  5.0M   0% /run/lock
none  48G   20K   48G   1% /run/shm
none 100M 0  100M   0% /run/user
/dev/mapper/vg01-tmp 4.5G  9.4M  4.3G   1% /tmp
/dev/mapper/vg01-varlog  9.1G  5.1G  3.6G  59% /var/log
/dev/sdf1932G   15G  917G   2% /var/lib/ceph/osd/ceph-3
/dev/sdg1932G   15G  917G   2% /var/lib/ceph/osd/ceph-4
/dev/sdl1932G   13G  919G   2% /var/lib/ceph/osd/ceph-8
/dev/sdo1932G   15G  917G   2% /var/lib/ceph/osd/ceph-11
/dev/sde1932G   15G  917G   2% /var/lib/ceph/osd/ceph-2
/dev/sdd1932G   15G  917G   2% /var/lib/ceph/osd/ceph-1
/dev/sdt1932G   15G  917G   2% /var/lib/ceph/osd/ceph-15
/dev/sdq1932G   12G  920G   2% /var/lib/ceph/osd/ceph-12
/dev/sdc1932G   14G  918G   2% /var/lib/ceph/osd/ceph-0
/dev/sds1932G   17G  916G   2% /var/lib/ceph/osd/ceph-14
/dev/sdu1932G   14G  918G   2% /var/lib/ceph/osd/ceph-16
/dev/sdm1932G   15G  917G   2% /var/lib/ceph/osd/ceph-9
/dev/sdk1932G   17G  915G   2% /var/lib/ceph/osd/ceph-7
/dev/sdn1932G   14G  918G   2% /var/lib/ceph/osd/ceph-10
/dev/sdr1932G   15G  917G   2% /var/lib/ceph/osd/ceph-13
/dev/sdv1932G   14G  918G   2% /var/lib/ceph/osd/ceph-17
/dev/sdh1932G   17G  916G   2% /var/lib/ceph/osd/ceph-5
/dev/sdj1932G   14G  918G   2% /var/lib/ceph/osd/ceph-30
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] advantages of multiple pools?

2015-04-17 Thread Saverio Proto
For example you can assign different read/write permissions and
different keyrings to different pools.

2015-04-17 16:00 GMT+02:00 Chad William Seys cws...@physics.wisc.edu:
 Hi All,
What are the advantages of having multiple ceph pools (if they use the
 whole cluster)?
Thanks!

 C.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Binding a pool to certain OSDs

2015-04-14 Thread Saverio Proto
Yes you can.
You have to write your own crushmap.

At the end of the crushmap you have rulesets

Write a ruleset that selects only the OSDs you want. Then you have to
assign the pool to that ruleset.

I have seen examples online, people what wanted some pools only on SSD
disks and other pools only on SAS disks. That should be not too far
from what you want to achieve.

ciao,

Saverio



2015-04-13 18:26 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com:
 Hi all,

 I've got a Ceph cluster which serves volumes to a Cinder installation. It
 runs Emperor.
 I'd like to be able to replace some of the disks with OPAL disks and create
 a new pool which uses exclusively the latter kind of disk. I'd like to have
 a traditional pool and a secure one coexisting on the same ceph host.
 I'd then use Cinder multi backend feature to serve them.
 My question is: how is it possible to realize such a setup? How can I bind a
 pool to certain OSDs?

 Thanks
 Giuseppe

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-04-14 Thread Saverio Proto
2015-03-27 18:27 GMT+01:00 Gregory Farnum g...@gregs42.com:
 Ceph has per-pg and per-OSD metadata overhead. You currently have 26000 PGs,
 suitable for use on a cluster of the order of 260 OSDs. You have placed
 almost 7GB of data into it (21GB replicated) and have about 7GB of
 additional overhead.

 You might try putting a suitable amount of data into the cluster before
 worrying about the ratio of space used to data stored. :)
 -Greg

Hello Greg,

I put a suitable amount of data now, and it looks like my ratio is still 1 to 5.
The folder:
/var/lib/ceph/osd/ceph-N/current/meta/
did not grow, so it looks like that is not the problem.

Do you have any hint how to troubleshoot this issue ???


ansible@zrh-srv-m-cph02:~$ ceph osd pool get .rgw.buckets size
size: 3
ansible@zrh-srv-m-cph02:~$ ceph osd pool get .rgw.buckets min_size
min_size: 2


ansible@zrh-srv-m-cph02:~$ ceph -w
cluster 4179fcec-b336-41a1-a7fd-4a19a75420ea
 health HEALTH_WARN pool .rgw.buckets has too few pgs
 monmap e4: 4 mons at
{rml-srv-m-cph01=10.120.50.20:6789/0,rml-srv-m-cph02=10.120.50.21:6789/0,rml-srv-m-stk03=10.120.50.32:6789/0,zrh-srv-m-cph02=10.120.50.2:6789/0},
election epoch 668, quorum 0,1,2,3
zrh-srv-m-cph02,rml-srv-m-cph01,rml-srv-m-cph02,rml-srv-m-stk03
 osdmap e2170: 54 osds: 54 up, 54 in
  pgmap v619041: 28684 pgs, 15 pools, 109 GB data, 7358 kobjects
518 GB used, 49756 GB / 50275 GB avail
   28684 active+clean

ansible@zrh-srv-m-cph02:~$ ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
50275G 49756G 518G  1.03
POOLS:
NAME   ID USED  %USED MAX AVAIL OBJECTS
rbd0155 016461G   2
gianfranco 7156 016461G   2
images 8   257M 016461G  38
.rgw.root  9840 016461G   3
.rgw.control   10 0 016461G   8
.rgw   11 21334 016461G 108
.rgw.gc12 0 016461G  32
.users.uid 13  1575 016461G   6
.users 1472 016461G   6
.rgw.buckets.index 15 0 016461G  30
.users.swift   1736 016461G   3
.rgw.buckets   18  108G  0.2216461G 7534745
.intent-log19 0 016461G   0
.rgw.buckets.extra 20 0 016461G   0
volumes21  512M 016461G 161
ansible@zrh-srv-m-cph02:~$
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Binding a pool to certain OSDs

2015-04-14 Thread Saverio Proto
No error message. You just finish the RAM memory and you blow up the
cluster because of too many PGs.

Saverio

2015-04-14 18:52 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com:
 Hi Saverio,

 I first made a test on my test staging lab where I have only 4 OSD.
 On my mon servers (which run other services) I have 16BG RAM, 15GB used but
 5 cached. On the OSD servers I have 3GB RAM, 3GB used but 2 cached.
 ceph -s tells me nothing about PGs, shouldn't I get an error message from
 its output?

 Thanks
 Giuseppe

 2015-04-14 18:20 GMT+02:00 Saverio Proto ziopr...@gmail.com:

 You only have 4 OSDs ?
 How much RAM per server ?
 I think you have already too many PG. Check your RAM usage.

 Check on Ceph wiki guidelines to dimension the correct number of PGs.
 Remeber that everytime to create a new pool you add PGs into the
 system.

 Saverio


 2015-04-14 17:58 GMT+02:00 Giuseppe Civitella
 giuseppe.civite...@gmail.com:
  Hi all,
 
  I've been following this tutorial to realize my setup:
 
  http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
 
  I got this CRUSH map from my test lab:
  http://paste.openstack.org/show/203887/
 
  then I modified the map and uploaded it. This is the final version:
  http://paste.openstack.org/show/203888/
 
  When applied the new CRUSH map, after some rebalancing, I get this
  health
  status:
  [- avalon1 root@controller001 Ceph -] # ceph -s
  cluster af09420b-4032-415e-93fc-6b60e9db064e
   health HEALTH_WARN crush map has legacy tunables; mon.controller001
  low
  disk space; clock skew detected on mon.controller002
   monmap e1: 3 mons at
 
  {controller001=10.235.24.127:6789/0,controller002=10.235.24.128:6789/0,controller003=10.235.24.129:6789/0},
  election epoch 314, quorum 0,1,2
  controller001,controller002,controller003
   osdmap e3092: 4 osds: 4 up, 4 in
pgmap v785873: 576 pgs, 6 pools, 71548 MB data, 18095 objects
  8842 MB used, 271 GB / 279 GB avail
   576 active+clean
 
  and this osd tree:
  [- avalon1 root@controller001 Ceph -] # ceph osd tree
  # idweight  type name   up/down reweight
  -8  2   root sed
  -5  1   host ceph001-sed
  2   1   osd.2   up  1
  -7  1   host ceph002-sed
  3   1   osd.3   up  1
  -1  2   root default
  -4  1   host ceph001-sata
  0   1   osd.0   up  1
  -6  1   host ceph002-sata
  1   1   osd.1   up  1
 
  which seems not a bad situation. The problem rise when I try to create a
  new
  pool, the command ceph osd pool create sed 128 128 gets stuck. It
  never
  ends.  And I noticed that my Cinder installation is not able to create
  volumes anymore.
  I've been looking in the logs for errors and found nothing.
  Any hint about how to proceed to restore my ceph cluster?
  Is there something wrong with the steps I take to update the CRUSH map?
  Is
  the problem related to Emperor?
 
  Regards,
  Giuseppe
 
 
 
 
  2015-04-13 18:26 GMT+02:00 Giuseppe Civitella
  giuseppe.civite...@gmail.com:
 
  Hi all,
 
  I've got a Ceph cluster which serves volumes to a Cinder installation.
  It
  runs Emperor.
  I'd like to be able to replace some of the disks with OPAL disks and
  create a new pool which uses exclusively the latter kind of disk. I'd
  like
  to have a traditional pool and a secure one coexisting on the same
  ceph
  host. I'd then use Cinder multi backend feature to serve them.
  My question is: how is it possible to realize such a setup? How can I
  bind
  a pool to certain OSDs?
 
  Thanks
  Giuseppe
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Binding a pool to certain OSDs

2015-04-14 Thread Saverio Proto
You only have 4 OSDs ?
How much RAM per server ?
I think you have already too many PG. Check your RAM usage.

Check on Ceph wiki guidelines to dimension the correct number of PGs.
Remeber that everytime to create a new pool you add PGs into the
system.

Saverio


2015-04-14 17:58 GMT+02:00 Giuseppe Civitella giuseppe.civite...@gmail.com:
 Hi all,

 I've been following this tutorial to realize my setup:
 http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/

 I got this CRUSH map from my test lab:
 http://paste.openstack.org/show/203887/

 then I modified the map and uploaded it. This is the final version:
 http://paste.openstack.org/show/203888/

 When applied the new CRUSH map, after some rebalancing, I get this health
 status:
 [- avalon1 root@controller001 Ceph -] # ceph -s
 cluster af09420b-4032-415e-93fc-6b60e9db064e
  health HEALTH_WARN crush map has legacy tunables; mon.controller001 low
 disk space; clock skew detected on mon.controller002
  monmap e1: 3 mons at
 {controller001=10.235.24.127:6789/0,controller002=10.235.24.128:6789/0,controller003=10.235.24.129:6789/0},
 election epoch 314, quorum 0,1,2 controller001,controller002,controller003
  osdmap e3092: 4 osds: 4 up, 4 in
   pgmap v785873: 576 pgs, 6 pools, 71548 MB data, 18095 objects
 8842 MB used, 271 GB / 279 GB avail
  576 active+clean

 and this osd tree:
 [- avalon1 root@controller001 Ceph -] # ceph osd tree
 # idweight  type name   up/down reweight
 -8  2   root sed
 -5  1   host ceph001-sed
 2   1   osd.2   up  1
 -7  1   host ceph002-sed
 3   1   osd.3   up  1
 -1  2   root default
 -4  1   host ceph001-sata
 0   1   osd.0   up  1
 -6  1   host ceph002-sata
 1   1   osd.1   up  1

 which seems not a bad situation. The problem rise when I try to create a new
 pool, the command ceph osd pool create sed 128 128 gets stuck. It never
 ends.  And I noticed that my Cinder installation is not able to create
 volumes anymore.
 I've been looking in the logs for errors and found nothing.
 Any hint about how to proceed to restore my ceph cluster?
 Is there something wrong with the steps I take to update the CRUSH map? Is
 the problem related to Emperor?

 Regards,
 Giuseppe




 2015-04-13 18:26 GMT+02:00 Giuseppe Civitella
 giuseppe.civite...@gmail.com:

 Hi all,

 I've got a Ceph cluster which serves volumes to a Cinder installation. It
 runs Emperor.
 I'd like to be able to replace some of the disks with OPAL disks and
 create a new pool which uses exclusively the latter kind of disk. I'd like
 to have a traditional pool and a secure one coexisting on the same ceph
 host. I'd then use Cinder multi backend feature to serve them.
 My question is: how is it possible to realize such a setup? How can I bind
 a pool to certain OSDs?

 Thanks
 Giuseppe



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-03-27 Thread Saverio Proto
 I will start now to push a lot of data into the cluster to see if the
 metadata grows a lot or stays costant.

 There is a way to clean up old metadata ?

I pushed a lot of more data to the cluster. Then I lead the cluster
sleep for the night.

This morning I find this values:

6841 MB data
25814 MB used

that is a bit more of 1 to 3.

It looks like the extra space is in these folders (for N from 1 to 36):

/var/lib/ceph/osd/ceph-N/current/meta/

This meta folders have a lot of data in it. I would really be happy
to have pointers to understand what is in there and how to clean that
up eventually.

The problem is that googling for ceph meta or ceph metadata will
produce results for Ceph MDS that is completely unrelated :(

thanks

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-03-26 Thread Saverio Proto
Thanks for the answer. Now the meaning of MB data and MB used is
clear, and if all the pools have size=3 I expect a ratio 1 to 3 of the
two values.

I still can't understand why MB used is so big in my setup.
All my pools are size =3 but the ratio MB data and MB used is 1 to
5 instead of 1 to 3.

My first guess was that I wrote a wrong crushmap that was making more
than 3 copies.. (is it really possible to make such a mistake?)

So I changed my crushmap and I put the default one, that just spreads
data across hosts, but I see no change, the ratio is still 1 to 5.

I thought maybe my 3 monitors have different views of the pgmap, so I
tried to restart the monitors but this also did not help.

What useful information may I share here to troubleshoot this issue further ?
ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)

Thank you

Saverio



2015-03-25 14:55 GMT+01:00 Gregory Farnum g...@gregs42.com:
 On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello there,

 I started to push data into my ceph cluster. There is something I
 cannot understand in the output of ceph -w.

 When I run ceph -w I get this kinkd of output:

 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056
 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail


 2379MB is actually the data I pushed into the cluster, I can see it
 also in the ceph df output, and the numbers are consistent.

 What I dont understand is 19788MB used. All my pools have size 3, so I
 expected something like 2379 * 3. Instead this number is very big.

 I really need to understand how MB used grows because I need to know
 how many disks to buy.

 MB used is the summation of (the programmatic equivalent to) df
 across all your nodes, whereas MB data is calculated by the OSDs
 based on data they've written down. Depending on your configuration
 MB used can include thing like the OSD journals, or even totally
 unrelated data if the disks are shared with other applications.

 MB used including the space used by the OSD journals is my first
 guess about what you're seeing here, in which case you'll notice that
 it won't grow any faster than MB data does once the journal is fully
 allocated.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-03-26 Thread Saverio Proto
 You just need to go look at one of your OSDs and see what data is
 stored on it. Did you configure things so that the journals are using
 a file on the same storage disk? If so, *that* is why the data used
 is large.

I followed your suggestion and this is the result of my trobleshooting.

Each OSD controls a disk that is mounted in a folder with the name:

/var/lib/ceph/osd/ceph-N

where N is the OSD number

The journal is stored on another disk drive. I have three extra SSD
drives per server, that I partitioned with 6 partitions each, and
those partitions are journal partitions.
I checked that the setup is correct because each
/var/lib/ceph/osd/ceph-N/journal points correctly to another drive.

with df -h I see the folders where my OSD are mounted. The space
occupation looks well distributed among all OSDs as expected.

the data is always in a folder called:

/var/lib/ceph/osd/ceph-N/current

I checked with the tool ncdu where the data is stored inside the
current folders.

in each OSD there is a folder with a lot of data called

/var/lib/ceph/osd/ceph-N/current/meta

If I sum the MB for each meta folder that is more or less the extra
space that is consumed, leading to the 1 to 5 ratio.

the meta folder contains a lot of binary files, unreadable, but
looking at the file names it looks like it is where the versions of
the osdmap are stored.

but it is really a lot of metadata.

I will start now to push a lot of data into the cluster to see if the
metadata grows a lot or stays costant.

There is a way to clean up old metadata ?

thanks

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph -w: Understanding MB data versus MB used

2015-03-25 Thread Saverio Proto
Hello there,

I started to push data into my ceph cluster. There is something I
cannot understand in the output of ceph -w.

When I run ceph -w I get this kinkd of output:

2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056
active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail


2379MB is actually the data I pushed into the cluster, I can see it
also in the ceph df output, and the numbers are consistent.

What I dont understand is 19788MB used. All my pools have size 3, so I
expected something like 2379 * 3. Instead this number is very big.

I really need to understand how MB used grows because I need to know
how many disks to buy.

Any hints ?

thank you.

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-23 Thread Saverio Proto
Hello,

thanks for the answers.

This was exacly what I was looking for:

mon_osd_down_out_interval = 900

I was not waiting long enoght to see my cluster recovering by itself.
That's why I tried to increase min_size, because I did not understand
what min_size was for.

Now that I know what is min_size, I guess the best setting for me is
min_size = 1 because I would like to be able to make I/O operations
even of only 1 copy is left.

Thanks to all for helping !

Saverio



2015-03-23 14:58 GMT+01:00 Gregory Farnum g...@gregs42.com:
 On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello,

 I started to work with CEPH few weeks ago, I might ask a very newbie
 question, but I could not find an answer in the docs or in the ml
 archive for this.

 Quick description of my setup:
 I have a ceph cluster with two servers. Each server has 3 SSD drives I
 use for journal only. To map to different failure domains SAS disks
 that keep a journal to the same SSD drive, I wrote my own crushmap.
 I have now a total of 36OSD. Ceph health returns HEALTH_OK.
 I run the cluster with a couple of pools with size=3 and min_size=3


 Production operations questions:
 I manually stopped some OSDs to simulate a failure.

 As far as I understood, an OSD down condition is not enough to make
 CEPH start making new copies of objects. I noticed that I must mark
 the OSD as out to make ceph produce new copies.
 As far as I understood min_size=3 puts the object in readonly if there
 are not at least 3 copies of the object available.

 That is correct, but the default with size 3 is 2 and you probably
 want to do that instead. If you have size==min_size on firefly
 releases and lose an OSD it can't do recovery so that PG is stuck
 without manual intervention. :( This is because of some quirks about
 how the OSD peering and recovery works, so you'd be forgiven for
 thinking it would recover nicely.
 (This is changed in the upcoming Hammer release, but you probably
 still want to allow cluster activity when an OSD fails, unless you're
 very confident in their uptime and more concerned about durability
 than availability.)
 -Greg


 Is this behavior correct or I made some mistake creating the cluster ?
 Should I expect ceph to produce automatically a new copy for objects
 when some OSDs are down ?
 There is any option to mark automatically out OSDs that go down ?

 thanks

 Saverio
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph in Production: best practice to monitor OSD up/down status

2015-03-22 Thread Saverio Proto
Hello,

I started to work with CEPH few weeks ago, I might ask a very newbie
question, but I could not find an answer in the docs or in the ml
archive for this.

Quick description of my setup:
I have a ceph cluster with two servers. Each server has 3 SSD drives I
use for journal only. To map to different failure domains SAS disks
that keep a journal to the same SSD drive, I wrote my own crushmap.
I have now a total of 36OSD. Ceph health returns HEALTH_OK.
I run the cluster with a couple of pools with size=3 and min_size=3


Production operations questions:
I manually stopped some OSDs to simulate a failure.

As far as I understood, an OSD down condition is not enough to make
CEPH start making new copies of objects. I noticed that I must mark
the OSD as out to make ceph produce new copies.
As far as I understood min_size=3 puts the object in readonly if there
are not at least 3 copies of the object available.

Is this behavior correct or I made some mistake creating the cluster ?
Should I expect ceph to produce automatically a new copy for objects
when some OSDs are down ?
There is any option to mark automatically out OSDs that go down ?

thanks

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com