[ceph-users] Multisite not deleting old data

2019-09-30 Thread Enrico Kern
Hello,

we run a Multisite Setup between Berlin (master) and Amsterdam (slave) with
mimic. We had some huge bucket of around 40TB which got deleted a while
ago. However the data seems not to be deleted on the slave:

from rados df:

berlin.rgw.buckets.data32 TiB 31638448  0 94915344
 0   00  4936153274 989 TiB   644842251 153 TiB

amsterdam.rgw.buckets.data70 TiB 28887118  0 86661354
   0   00   275985124 203 TiB   232226203   90 TiB

the bucket itself doesnt exists anymore on the slave and the master. Any
idea what todo? Syncing of new data seems to work. I tried manual resync of
all but it says everything is back in sync, just not getting rid of the
data.

-- 

*Enrico Kern*
Chief Information Officer

enrico.k...@glispa.com
+49 (0) 30 555713017 / +49 (0) 152 26814501
skype: flyersa
LinkedIn Profile <https://www.linkedin.com/in/enricokern>


<http://goog_59398030/> <https://www.glispa.com/>

*Glispa GmbH* | Berlin Office
Stromstr. 11-17  <https://goo.gl/maps/6mwNA77gXLP2>
Berlin, Germany, 10551  <https://goo.gl/maps/6mwNA77gXLP2>

Managing Director Or Ifrah
Registered in Berlin
AG Charlottenburg |
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
HRB
114678B
–
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Set existing pools to use hdd device class only

2018-08-20 Thread Enrico Kern
Hmm then is not really an option for me. Maybe someone from the devs can
shed a light why it is doing migration as long you only have OSDs with the
same class? I have a few Petabyte of Storage in each cluster. When it
starts migrating everything again that will result in a super big
performance bottleneck. My plan was to set the existing pools to use the
new crush rule hdd only and add ssd osds later.

On Mon, Aug 20, 2018 at 11:22 AM Marc Roos  wrote:

>
> I just recently did the same. Take into account that everything starts
> migrating. How weird it maybe, I had hdd test cluster only and changed
> the crush rule to having hdd. Took a few days, totally unnecessary as
> far as I am concerned.
>
>
>
>
> -Original Message-
> From: Enrico Kern [mailto:enrico.k...@glispa.com]
> Sent: maandag 20 augustus 2018 11:18
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Set existing pools to use hdd device class only
>
> Hello,
>
> right now we have multiple HDD only clusters with ether filestore
> journals on SSDs or on newer installations WAL etc. on SSD.
>
> I plan to extend our ceph clusters with SSDs to provide ssd only pools.
> In luminous we have devices classes so that i should be able todo this
> without editing around crush map.
>
> In the device class doc it says i can create "new" pools to use only
> ssds as example:
>
> ceph osd crush rule create-replicated fast default host ssd
>
> what happens if i fire this on an existing pool but with hdd device
> class? I wasnt able to test thi yet in our staging cluster and wanted to
> ask whats the way todo this.
>
> I want to set an existing pool called volumes to only use osds with hdd
> class. Right now all OSDs have hdds. So in theory it should not use
> newly created SSD osds once i set them all up for hdd classes right?
>
> So for existing pool running:
>
> ceph osd crush rule create-replicated volume-hdd-only volumes default
> hdd ceph osd pool set volumes crush_rule volume-hdd-only
>
>
> should be the way to go right?
>
>
> Regards,
>
> Enrico
>
> --
>
>
> Enrico Kern
>
> VP IT Operations
>
>
> enrico.k...@glispa.com
> +49 (0) 30 555713017 / +49 (0) 152 26814501
>
> skype: flyersa
> LinkedIn Profile <https://www.linkedin.com/in/enricokern>
>
>
> <http://goog_59398030/>  <https://www.glispa.com/>
>
>
> Glispa GmbH | Berlin Office
>
> Stromstr. 11-17  <https://goo.gl/maps/6mwNA77gXLP2>
> Berlin, Germany, 10551  <https://goo.gl/maps/6mwNA77gXLP2>
>
> Managing Director Din Karol-Gavish
> Registered in Berlin
> AG Charlottenburg |
> <
> https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
>
> HRB 114678B –
>
>
>

-- 

*Enrico Kern*
VP IT Operations

enrico.k...@glispa.com
+49 (0) 30 555713017 / +49 (0) 152 26814501
skype: flyersa
LinkedIn Profile <https://www.linkedin.com/in/enricokern>


<http://goog_59398030/> <https://www.glispa.com/>

*Glispa GmbH* | Berlin Office
Stromstr. 11-17  <https://goo.gl/maps/6mwNA77gXLP2>
Berlin, Germany, 10551  <https://goo.gl/maps/6mwNA77gXLP2>

Managing Director Din Karol-Gavish
Registered in Berlin
AG Charlottenburg |
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
HRB
114678B
–
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Set existing pools to use hdd device class only

2018-08-20 Thread Enrico Kern
Hello,

right now we have multiple HDD only clusters with ether filestore journals
on SSDs or on newer installations WAL etc. on SSD.

I plan to extend our ceph clusters with SSDs to provide ssd only pools. In
luminous we have devices classes so that i should be able todo this without
editing around crush map.

In the device class doc it says i can create "new" pools to use only ssds
as example:

ceph osd crush rule create-replicated fast default host ssd


what happens if i fire this on an existing pool but with hdd device class?
I wasnt able to test thi yet in our staging cluster and wanted to ask whats
the way todo this.

I want to set an existing pool called volumes to only use osds with hdd
class. Right now all OSDs have hdds. So in theory it should not use newly
created SSD osds once i set them all up for hdd classes right?

So for existing pool running:

ceph osd crush rule create-replicated volume-hdd-only volumes default hdd
ceph osd pool set volumes crush_rule volume-hdd-only

should be the way to go right?


Regards,

Enrico

-- 

*Enrico Kern*
VP IT Operations

enrico.k...@glispa.com
+49 (0) 30 555713017 / +49 (0) 152 26814501
skype: flyersa
LinkedIn Profile <https://www.linkedin.com/in/enricokern>


<http://goog_59398030/> <https://www.glispa.com/>

*Glispa GmbH* | Berlin Office
Stromstr. 11-17  <https://goo.gl/maps/6mwNA77gXLP2>
Berlin, Germany, 10551  <https://goo.gl/maps/6mwNA77gXLP2>

Managing Director Din Karol-Gavish
Registered in Berlin
AG Charlottenburg |
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
HRB
114678B
–
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-07-11 Thread Enrico Kern
I changed the endpoints to bypass the loadbalancers for sync. But the
Problem stil remains. Will probably resetup the bucket and recopy the data
to see if that changes something. I cant make anything out of all the log
messages, need to dig deeper into that

On Sun, Jul 8, 2018 at 4:55 PM Enrico Kern  wrote:

> Hello,
>
> yes we are using haproxy on the secondary zone, but  A10 Hardware
> Loadbalancers on the Master zone. So i suspect there are some timeouts that
> may cause this issue then if having loadbalancers in front of it be the
> problem?
>
> Will check if changing to the ips directly will fix the issue
>
> On Sun, Jul 8, 2018 at 11:51 AM Orit Wasserman 
> wrote:
>
>> Hi Enrico,
>>
>> On Fri, Jun 29, 2018 at 7:50 PM Enrico Kern 
>> wrote:
>>
>>> hmm that also pops up right away when i restart all radosgw instances.
>>> But i will check further and see if i can find something. Maybe doing the
>>> upgrade to mimic too.
>>>
>>> That bucket is basically under load on the master zone all the time as
>>> we use it as historical storage for druid, so there is constantly data
>>> written to it. I just dont get why disabling/enabling sync on the bucket
>>> flawless syncs everything while if i just keep it enabled it stops syncing
>>> at all. For the last days i was just running disabling/enabling for the
>>> bucket in a while loop with 30 minute interval, but thats no persistent fix
>>> ;)
>>>
>>>
>> Are you using Haproxy? we have seen sync stales with it.
>> The simplest work around is to configure the radosgw's addresses as the
>> sync endpoints not the haproxy's.
>>
>> Regards,
>> Orit
>>
>>
>>
>>> On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
>>>> wrote:
>>>>
>>>>> also when i try to sync the bucket manual i get this error:
>>>>>
>>>>> ERROR: sync.run() returned ret=-16
>>>>> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
>>>>> read sync status for
>>>>> bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1
>>>>>
>>>>> it works flawless with all other buckets.
>>>>>
>>>>
>>>> error 16 is EBUSY: meaning it can't take a lease to do work on the
>>>> bucket. This usually happens when another entity (e.g., a running radosgw
>>>> process) is working on it at the same time. Either something took the lease
>>>> and never gave it back (leases shouldn't be indefinite, usually are being
>>>> taken for a short period but are renewed periodically), or there might be
>>>> some other bug related to the lease itself. I would start by first figuring
>>>> out whether it's the first case or the second one. On the messenger log
>>>> there should be a message prior to that that shows the operation that got
>>>> the -16 as a response (should have something like "...=-16 (Device or
>>>> resource busy)" in it). The same line would also contain the name of the
>>>> rados object that is used to manage the lease. Try to look at the running
>>>> radosgw log at the same time when this happens, and check whether there are
>>>> other operations on that object.
>>>> One thing to note is that if you run a sync on a bucket and stop it
>>>> uncleanly in the middle (e.g., like killing the process), the leak will
>>>> stay locked for a period of time (Something in the order of 1 to 2 
>>>> minutes).
>>>>
>>>> Yehuda
>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> thanks for the reply.
>>>>>>
>>>>>> We have around 200k objects in the bucket. It is not automatic
>>>>>> resharded (is that even supported in multisite?)
>>>>>>
>>>>>> What i see when i run a complete data sync with the debug logs after
>>>>>> a while i see alot of informations that it is unable to perform some log
>>>>>> and also some device or resource busy (also with alot of different osds,
>>>>>> restarting the osds also doesnt make this error going away):
>>>>>>
>>>>>>
>>>&

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-07-08 Thread Enrico Kern
Hello,

yes we are using haproxy on the secondary zone, but  A10 Hardware
Loadbalancers on the Master zone. So i suspect there are some timeouts that
may cause this issue then if having loadbalancers in front of it be the
problem?

Will check if changing to the ips directly will fix the issue

On Sun, Jul 8, 2018 at 11:51 AM Orit Wasserman  wrote:

> Hi Enrico,
>
> On Fri, Jun 29, 2018 at 7:50 PM Enrico Kern 
> wrote:
>
>> hmm that also pops up right away when i restart all radosgw instances.
>> But i will check further and see if i can find something. Maybe doing the
>> upgrade to mimic too.
>>
>> That bucket is basically under load on the master zone all the time as we
>> use it as historical storage for druid, so there is constantly data written
>> to it. I just dont get why disabling/enabling sync on the bucket flawless
>> syncs everything while if i just keep it enabled it stops syncing at all.
>> For the last days i was just running disabling/enabling for the bucket in a
>> while loop with 30 minute interval, but thats no persistent fix ;)
>>
>>
> Are you using Haproxy? we have seen sync stales with it.
> The simplest work around is to configure the radosgw's addresses as the
> sync endpoints not the haproxy's.
>
> Regards,
> Orit
>
>
>
>> On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
>> wrote:
>>
>>>
>>>
>>> On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
>>> wrote:
>>>
>>>> also when i try to sync the bucket manual i get this error:
>>>>
>>>> ERROR: sync.run() returned ret=-16
>>>> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
>>>> read sync status for
>>>> bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1
>>>>
>>>> it works flawless with all other buckets.
>>>>
>>>
>>> error 16 is EBUSY: meaning it can't take a lease to do work on the
>>> bucket. This usually happens when another entity (e.g., a running radosgw
>>> process) is working on it at the same time. Either something took the lease
>>> and never gave it back (leases shouldn't be indefinite, usually are being
>>> taken for a short period but are renewed periodically), or there might be
>>> some other bug related to the lease itself. I would start by first figuring
>>> out whether it's the first case or the second one. On the messenger log
>>> there should be a message prior to that that shows the operation that got
>>> the -16 as a response (should have something like "...=-16 (Device or
>>> resource busy)" in it). The same line would also contain the name of the
>>> rados object that is used to manage the lease. Try to look at the running
>>> radosgw log at the same time when this happens, and check whether there are
>>> other operations on that object.
>>> One thing to note is that if you run a sync on a bucket and stop it
>>> uncleanly in the middle (e.g., like killing the process), the leak will
>>> stay locked for a period of time (Something in the order of 1 to 2 minutes).
>>>
>>> Yehuda
>>>
>>>>
>>>>
>>>> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> thanks for the reply.
>>>>>
>>>>> We have around 200k objects in the bucket. It is not automatic
>>>>> resharded (is that even supported in multisite?)
>>>>>
>>>>> What i see when i run a complete data sync with the debug logs after a
>>>>> while i see alot of informations that it is unable to perform some log and
>>>>> also some device or resource busy (also with alot of different osds,
>>>>> restarting the osds also doesnt make this error going away):
>>>>>
>>>>>
>>>>> 018-06-29 15:18:30.391085 7f38bf882cc0 20
>>>>> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't 
>>>>> lock
>>>>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
>>>>> retcode=-16
>>>>>
>>>>> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
>>>>> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't 
>>>>> lock
>>>>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
>>>>> retcode=-16
>>>>>
>>>>> 2018-06-29 

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Enrico Kern
hmm that also pops up right away when i restart all radosgw instances. But
i will check further and see if i can find something. Maybe doing the
upgrade to mimic too.

That bucket is basically under load on the master zone all the time as we
use it as historical storage for druid, so there is constantly data written
to it. I just dont get why disabling/enabling sync on the bucket flawless
syncs everything while if i just keep it enabled it stops syncing at all.
For the last days i was just running disabling/enabling for the bucket in a
while loop with 30 minute interval, but thats no persistent fix ;)

On Fri, Jun 29, 2018 at 6:15 PM Yehuda Sadeh-Weinraub 
wrote:

>
>
> On Fri, Jun 29, 2018 at 8:48 AM, Enrico Kern 
> wrote:
>
>> also when i try to sync the bucket manual i get this error:
>>
>> ERROR: sync.run() returned ret=-16
>> 2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to
>> read sync status for
>> bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1
>>
>> it works flawless with all other buckets.
>>
>
> error 16 is EBUSY: meaning it can't take a lease to do work on the bucket.
> This usually happens when another entity (e.g., a running radosgw process)
> is working on it at the same time. Either something took the lease and
> never gave it back (leases shouldn't be indefinite, usually are being taken
> for a short period but are renewed periodically), or there might be some
> other bug related to the lease itself. I would start by first figuring out
> whether it's the first case or the second one. On the messenger log there
> should be a message prior to that that shows the operation that got the -16
> as a response (should have something like "...=-16 (Device or resource
> busy)" in it). The same line would also contain the name of the rados
> object that is used to manage the lease. Try to look at the running radosgw
> log at the same time when this happens, and check whether there are other
> operations on that object.
> One thing to note is that if you run a sync on a bucket and stop it
> uncleanly in the middle (e.g., like killing the process), the leak will
> stay locked for a period of time (Something in the order of 1 to 2 minutes).
>
> Yehuda
>
>>
>>
>> On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern 
>> wrote:
>>
>>> Hello,
>>>
>>> thanks for the reply.
>>>
>>> We have around 200k objects in the bucket. It is not automatic resharded
>>> (is that even supported in multisite?)
>>>
>>> What i see when i run a complete data sync with the debug logs after a
>>> while i see alot of informations that it is unable to perform some log and
>>> also some device or resource busy (also with alot of different osds,
>>> restarting the osds also doesnt make this error going away):
>>>
>>>
>>> 018-06-29 15:18:30.391085 7f38bf882cc0 20
>>> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
>>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
>>> retcode=-16
>>>
>>> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
>>> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
>>> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
>>> retcode=-16
>>>
>>> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>>> <== osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
>>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
>>> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
>>>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>>>
>>> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604
>>> <== osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
>>> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
>>> v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
>>>  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380
>>>
>>>
>>> There are no issues with the OSDs all other stuff in the cluster works
>>> (rbd, images to openstack etc.)
>>>
>>>
>>> Also that command with appending debug never finishes.
>>>
>>> On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern <
>>>> enrico.k...@glispamedia.com> wrote:
>

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Enrico Kern
also when i try to sync the bucket manual i get this error:

ERROR: sync.run() returned ret=-16
2018-06-29 15:47:50.137268 7f54b7e4ecc0  0 data sync: ERROR: failed to read
sync status for bucketname:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.27150814.1

it works flawless with all other buckets.


On Fri, Jun 29, 2018 at 5:39 PM Enrico Kern  wrote:

> Hello,
>
> thanks for the reply.
>
> We have around 200k objects in the bucket. It is not automatic resharded
> (is that even supported in multisite?)
>
> What i see when i run a complete data sync with the debug logs after a
> while i see alot of informations that it is unable to perform some log and
> also some device or resource busy (also with alot of different osds,
> restarting the osds also doesnt make this error going away):
>
>
> 018-06-29 15:18:30.391085 7f38bf882cc0 20
> cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
> retcode=-16
>
> 2018-06-29 15:18:30.391094 7f38bf882cc0 20
> cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
> amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
> retcode=-16
>
> 2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
> osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
> v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
>  209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380
>
> 2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
> osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
> datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
> v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
>  210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380
>
>
> There are no issues with the OSDs all other stuff in the cluster works
> (rbd, images to openstack etc.)
>
>
> Also that command with appending debug never finishes.
>
> On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
> wrote:
>
>>
>>
>> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern <
>> enrico.k...@glispamedia.com> wrote:
>>
>>> Hello,
>>>
>>> We have two ceph luminous clusters (12.2.5).
>>>
>>> recently one of our big buckets stopped syncing properly. We have a one
>>> specific bucket which is around 30TB in size consisting of alot of
>>> directories with each one having files of 10-20MB.
>>>
>>> The secondary zone is often completly missing multiple days of data in
>>> this bucket, while all other smaller buckets sync just fine.
>>>
>>> Even with the complete data missing radosgw-admin sync status always
>>> says everything is fine.
>>>
>>> the sync error log doesnt show anything for those days.
>>>
>>> Running
>>>
>>> radosgw-admin metadata sync and data sync also doesnt solve the issue.
>>> The only way of making it sync again is to disable and re-eanble the sync.
>>> That needs to be done as often as like 10 times in an hour to make it sync
>>> properly.
>>>
>>> radosgw-admin bucket sync disable
>>> radosgw-admin bucket sync enable
>>>
>>> when i run data init i sometimes get this:
>>>
>>>  radosgw-admin data sync init --source-zone berlin
>>> 2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
>>> cache for
>>> amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-4bec-aad6-aba72cd8eac6
>>>
>>> Sometimes when really alot of data is missing (yesterday it was more
>>> then 1 month) this helps making them get in sync again when run on the
>>> secondary zone:
>>>
>>> radosgw-admin bucket check --fix --check-objects
>>>
>>> how can i debug that problem further? We have so many requests on the
>>> cluster that is is hard to dig something out of the log files..
>>>
>>> Given all the smaller buckets are perfectly in sync i suspect some
>>> problem because of the size of the bucket.
>>>
>>
>> How many objects in the bucket? Is it getting automatically resharded?
>>
>>
>>>
>>> Any points to the right direction are greatly appreciated.
>>>
>>
>> A few things to look at that might help identify the issue.
>>
>> What does this show (I think the luminous command is as follows):
>>
>>

Re: [ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-29 Thread Enrico Kern
Hello,

thanks for the reply.

We have around 200k objects in the bucket. It is not automatic resharded
(is that even supported in multisite?)

What i see when i run a complete data sync with the debug logs after a
while i see alot of informations that it is unable to perform some log and
also some device or resource busy (also with alot of different osds,
restarting the osds also doesnt make this error going away):


018-06-29 15:18:30.391085 7f38bf882cc0 20
cr:s=0x55de55700b20:op=0x55de55717010:20RGWContinuousLeaseCR: couldn't lock
amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.59:sync_lock:
retcode=-16

2018-06-29 15:18:30.391094 7f38bf882cc0 20
cr:s=0x55de55732750:op=0x55de5572d970:20RGWContinuousLeaseCR: couldn't lock
amsterdam.rgw.log:datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.10:sync_lock:
retcode=-16

2018-06-29 15:22:01.618744 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
osd.43 10.30.3.44:6800/29982 13272  osd_op_reply(258628
datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.52 [call]
v14448'24265315 uv24265266 ondisk = -16 ((16) Device or resource busy)) v8
 209+0+0 (2379682838 0 0) 0x7f38a8005110 con 0x7f3868003380

2018-06-29 15:22:01.618829 7f38ad4c7700  1 -- 10.30.3.67:0/3390890604 <==
osd.43 10.30.3.44:6800/29982 13273  osd_op_reply(258629
datalog.sync-status.shard.6a9448d2-bdba-4bec-aad6-aba72cd8eac6.105 [call]
v14448'24265316 uv24265256 ondisk = -16 ((16) Device or resource busy)) v8
 210+0+0 (4086289880 0 0) 0x7f38a8005110 con 0x7f3868003380


There are no issues with the OSDs all other stuff in the cluster works
(rbd, images to openstack etc.)


Also that command with appending debug never finishes.

On Tue, Jun 26, 2018 at 5:45 PM Yehuda Sadeh-Weinraub 
wrote:

>
>
> On Sun, Jun 24, 2018 at 12:59 AM, Enrico Kern  > wrote:
>
>> Hello,
>>
>> We have two ceph luminous clusters (12.2.5).
>>
>> recently one of our big buckets stopped syncing properly. We have a one
>> specific bucket which is around 30TB in size consisting of alot of
>> directories with each one having files of 10-20MB.
>>
>> The secondary zone is often completly missing multiple days of data in
>> this bucket, while all other smaller buckets sync just fine.
>>
>> Even with the complete data missing radosgw-admin sync status always says
>> everything is fine.
>>
>> the sync error log doesnt show anything for those days.
>>
>> Running
>>
>> radosgw-admin metadata sync and data sync also doesnt solve the issue.
>> The only way of making it sync again is to disable and re-eanble the sync.
>> That needs to be done as often as like 10 times in an hour to make it sync
>> properly.
>>
>> radosgw-admin bucket sync disable
>> radosgw-admin bucket sync enable
>>
>> when i run data init i sometimes get this:
>>
>>  radosgw-admin data sync init --source-zone berlin
>> 2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
>> cache for
>> amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-4bec-aad6-aba72cd8eac6
>>
>> Sometimes when really alot of data is missing (yesterday it was more then
>> 1 month) this helps making them get in sync again when run on the secondary
>> zone:
>>
>> radosgw-admin bucket check --fix --check-objects
>>
>> how can i debug that problem further? We have so many requests on the
>> cluster that is is hard to dig something out of the log files..
>>
>> Given all the smaller buckets are perfectly in sync i suspect some
>> problem because of the size of the bucket.
>>
>
> How many objects in the bucket? Is it getting automatically resharded?
>
>
>>
>> Any points to the right direction are greatly appreciated.
>>
>
> A few things to look at that might help identify the issue.
>
> What does this show (I think the luminous command is as follows):
>
> $ radosgw-admin bucket sync status --source-zone=
>
> You can try manually syncing the bucket, and get specific logs for that
> operation:
>
> $ radosgw-admin bucket sync run -source-zone= --debug-rgw=20
> --debug-ms=1
>
> And you can try getting more info from the sync trace module:
>
> $ ceph --admin-daemon  sync trace history
> 
>
> You can also try the 'sync trace show' command.
>
>
> Yehuda
>
>
>
>>
>> Regards,
>>
>> Enrico
>>
>> --
>>
>> *Enrico Kern*
>> VP IT Operations
>>
>> enrico.k...@glispa.com
>> +49 (0) 30 555713017 / +49 (0) 152 26814501
>> skype: flyersa
>> LinkedIn Profile <https://www.linkedin.com/in/enricokern>
>>
>>
>> <http://goog_59398030/> <https:

[ceph-users] radosgw multizone not syncing large bucket completly to other zone

2018-06-24 Thread Enrico Kern
Hello,

We have two ceph luminous clusters (12.2.5).

recently one of our big buckets stopped syncing properly. We have a one
specific bucket which is around 30TB in size consisting of alot of
directories with each one having files of 10-20MB.

The secondary zone is often completly missing multiple days of data in this
bucket, while all other smaller buckets sync just fine.

Even with the complete data missing radosgw-admin sync status always says
everything is fine.

the sync error log doesnt show anything for those days.

Running

radosgw-admin metadata sync and data sync also doesnt solve the issue. The
only way of making it sync again is to disable and re-eanble the sync. That
needs to be done as often as like 10 times in an hour to make it sync
properly.

radosgw-admin bucket sync disable
radosgw-admin bucket sync enable

when i run data init i sometimes get this:

 radosgw-admin data sync init --source-zone berlin
2018-06-24 07:55:46.337858 7fe7557fa700  0 ERROR: failed to distribute
cache for
amsterdam.rgw.log:datalog.sync-status.6a9448d2-bdba-4bec-aad6-aba72cd8eac6

Sometimes when really alot of data is missing (yesterday it was more then 1
month) this helps making them get in sync again when run on the secondary
zone:

radosgw-admin bucket check --fix --check-objects

how can i debug that problem further? We have so many requests on the
cluster that is is hard to dig something out of the log files..

Given all the smaller buckets are perfectly in sync i suspect some problem
because of the size of the bucket.

Any points to the right direction are greatly appreciated.

Regards,

Enrico

-- 

*Enrico Kern*
VP IT Operations

enrico.k...@glispa.com
+49 (0) 30 555713017 / +49 (0) 152 26814501
skype: flyersa
LinkedIn Profile <https://www.linkedin.com/in/enricokern>


<http://goog_59398030/> <https://www.glispa.com/>

*Glispa GmbH* | Berlin Office
Sonnenburger Straße 73
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
10437 Berlin
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
|
<https://maps.google.com/?q=Sonnenburgerstra%C3%9Fe+73+10437+Berlin%C2%A0%7C%C2%A0Germany=gmail=g>
 Germany
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>

Managing Director Din Karol-Gavish
Registered in Berlin
AG Charlottenburg |
<https://maps.google.com/?q=Sonnenburger+Stra%C3%9Fe+73+10437+Berlin%C2%A0%7C+%3Chttps://maps.google.com/?q%3DSonnenburgerstra%25C3%259Fe%2B73%2B10437%2BBerlin%25C2%25A0%257C%25C2%25A0Germany%26entry%3Dgmail%26source%3Dg%3E%C2%A0Germany=gmail=g>
HRB
114678B
–
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph auth caps - make it more user error proof

2018-02-21 Thread Enrico Kern
Hey all,

i would suggest some changes to the ceph auth caps command.

Today i almost fucked up half of one of our openstack regions with i/o
errors because of user failure.

I tried to add osd blacklist caps to a cinder keyring after luminous
upgrade.

I did so by issuing ceph auth caps client.cinder mon 'bla'

doing this i forgot that this will wipe also other caps and not just only
updates caps for mon because you need to specify all in one line. Result
was all of our vms passing out with read only filesystems after a while
because osd caps were gone.

I suggest that if you only pass

Ceph auth caps mon

It only updates caps for mon or osd etc. and leaves others untouched. Or at
least print some huge error message.

I know it is more a pebkac problem, but ceph is doing great in being idiot
proof and this would make it even more idiot proof ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw - ls not showing some files, invisible files

2018-02-09 Thread Enrico Kern
Hello,

we run ceph 12.2.2. Now the multipart upload problem with replication to
other zones were fixed but we do not see all files in a bucket with ls via
API, cyberduck etc.

example we have a bucket with alot of files. The file list is complete in
our second datacenter and i can see all files in the bucket.

On the main datacenter which replicated this files to the other one i have
some "invisible" files. I cannot see them in the api responses or utilizing
some client like s3cmd, cyberduck etc. But the files is there i can see the
details with

radosgw-admin object stat

and i also verified the file is available in the pool.

Anyone had this behavior before that some files are kind of invisible?

-- 

Enrico Kern
*Director of System Engineering*

*T* +49 (0) 30 555713017  | *M *+49 (0)152 26814501
*E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile
<https://www.linkedin.com/in/enricokern/>



*Glispa GmbH* - Berlin Office
Sonnenburger Str. 73 10437 Berlin, Germany
Managing Director: Dina Karol-Gavish, Registered in Berlin, AG
Charlottenburg HRB 114678B
<http://www.glispa.com/>   <cont...@glispa.com>
<https://www.linkedin.com/company-beta/143634/>
<https://plus.google.com/u/0/b/116135915389937318808/116135915389937318808>
   <https://twitter.com/glispa>   <https://www.facebook.com/glispamedia/>
<https://www.instagram.com/glispaglobalgroup/>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is the 12.2.1 really stable? Anybody have production cluster with Luminous Bluestore?

2017-11-16 Thread Enrico Kern
We upgraded from firefly to 12.2.1 . We cannot use our RadosGW S3 Endpoints
anymore since multipart uploads get not replicated. So we are also waiting
for 12.2.2 to finally allow usage of our s3 endpoints again

On Thu, Nov 16, 2017 at 3:33 PM, Ashley Merrick <ash...@amerrick.co.uk>
wrote:

> Currently experiencing a nasty bug http://tracker.ceph.com/issues/21142
>
> I would say wait a while for the next point release.
>
> ,Ashley
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jack
> Sent: 16 November 2017 22:22
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Is the 12.2.1 really stable? Anybody have
> production cluster with Luminous Bluestore?
>
> My cluster (55 OSDs) runs 12.2.x since the release, and bluestore too All
> good so far
>
> On 16/11/2017 15:14, Konstantin Shalygin wrote:
> > Hi cephers.
> > Some thoughts...
> > At this time my cluster on Kraken 11.2.0 - works smooth with FileStore
> > and RBD only.
> > I want upgrade to Luminous 12.2.1 and go to Bluestore because this
> > cluster want grows double with new disks, so is best opportunity
> > migrate to Bluestore.
> >
> > In ML I was found two problems:
> > 1. Increased memory usage, should be fixed in upstream
> > (http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2017-October/021676.html).
> >
> > 2. OSD drops and goes cluster offline
> > (http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2017-November/022494.html).
> > Don't know this Bluestore or FileStore OSD'.s.
> >
> > If the first case I can safely survive - hosts has enough memory to go
> > to Bluestore and with the growing I can wait until the next stable
> release.
> > That second case really scares me. As I understood clusters with this
> > problem for now not in production.
> >
> > By this point I have completed all the preparations for the update and
> > now I need to figure out whether I should update to 12.2.1 or wait for
> > the next stable release, because my cluster is in production and I
> > can't fail. Or I can upgrade and use FileStore until next release,
> > this is acceptable for me.
> >
> > Thanks.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Enrico Kern
*Lead System Engineer*

*T* +49 (0) 30 555713017  | *M *+49 (0)152 26814501
*E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile
<https://www.linkedin.com/in/enricokern/>



*Glispa GmbH* - Berlin Office
Sonnenburger Str. 73 10437 Berlin, Germany
Managing Director: Dina Karol-Gavish, Registered in Berlin, AG
Charlottenburg HRB 114678B
<http://www.glispa.com/>   <cont...@glispa.com>
<https://www.linkedin.com/company-beta/143634/>
<https://plus.google.com/u/0/b/116135915389937318808/116135915389937318808>
   <https://twitter.com/glispa>   <https://www.facebook.com/glispamedia/>
<https://www.instagram.com/glispaglobalgroup/>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
or this:

   {
"shard_id": 22,
"entries": [
{
"id": "1_1507761448.758184_10459.1",
"section": "data",
"name":
"testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.3/Wireshark-win64-2.2.7.exe",
"timestamp": "2017-10-11 22:37:28.758184Z",
"info": {
"source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
"error_code": 5,
"message": "failed to sync object"
}
}
]
},


<http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
Virenfrei.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Thu, Oct 12, 2017 at 12:39 AM, Enrico Kern <enrico.k...@glispamedia.com>
wrote:

> its 45MB, but it happens with all multipart uploads.
>
> sync error list shows
>
>{
> "shard_id": 31,
> "entries": [
> {
> "id": "1_1507761459.607008_8197.1",
> "section": "data",
> "name": "testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.
> 21344646.3",
> "timestamp": "2017-10-11 22:37:39.607008Z",
> "info": {
> "source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
> "error_code": 5,
> "message": "failed to sync bucket instance: (5)
> Input/output error"
> }
>     }
> ]
> }
>
> for multiple shards not just this one
>
>
>
> On Thu, Oct 12, 2017 at 12:31 AM, Yehuda Sadeh-Weinraub <yeh...@redhat.com
> > wrote:
>
>> What is the size of the object? Is it only this one?
>>
>> Try this command: 'radosgw-admin sync error list'. Does it show anything
>> related to that object?
>>
>> Thanks,
>> Yehuda
>>
>>
>> On Wed, Oct 11, 2017 at 3:26 PM, Enrico Kern <enrico.k...@glispamedia.com
>> > wrote:
>>
>>> if i change permissions the sync status shows that it is syncing 1
>>> shard, but no files ends up in the pool (testing with empty data pool).
>>> after a while it shows that data is back in sync but there is no file
>>>
>>> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub <
>>> yeh...@redhat.com> wrote:
>>>
>>>> Thanks for your report. We're looking into it. You can try to see if
>>>> touching the object (e.g., modifying its permissions) triggers the sync.
>>>>
>>>> Yehuda
>>>>
>>>> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern <
>>>> enrico.k...@glispamedia.com> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> yeah seems you are right, they are stored as different filenames in
>>>>> the data bucket when using multisite upload. But anyway it stil doesnt get
>>>>> replicated. As example i have files like
>>>>>
>>>>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
>>>>> ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>>>>
>>>>> in the data pool on one zone. But its not replicated to the other
>>>>> zone. naming is not relevant, the other data bucket doesnt have any file
>>>>> multipart or not.
>>>>>
>>>>> im really missing the file on the other zone.
>>>>>
>>>>>
>>>>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>>>>  Virenfrei.
>>>>> www.avg.com
>>>>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>>>> <#m_-3657272773285512991_m_-7857162703559898269_m_-2933057183600238029_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>
>>>>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner <drakonst...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Multipart is a client side setting when uploading.  Multisite in and
>>>>>> of it

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
its 45MB, but it happens with all multipart uploads.

sync error list shows

   {
"shard_id": 31,
"entries": [
{
"id": "1_1507761459.607008_8197.1",
"section": "data",
"name":
"testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.3",
"timestamp": "2017-10-11 22:37:39.607008Z",
"info": {
"source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
"error_code": 5,
"message": "failed to sync bucket instance: (5)
Input/output error"
}
}
]
}

for multiple shards not just this one



On Thu, Oct 12, 2017 at 12:31 AM, Yehuda Sadeh-Weinraub <yeh...@redhat.com>
wrote:

> What is the size of the object? Is it only this one?
>
> Try this command: 'radosgw-admin sync error list'. Does it show anything
> related to that object?
>
> Thanks,
> Yehuda
>
>
> On Wed, Oct 11, 2017 at 3:26 PM, Enrico Kern <enrico.k...@glispamedia.com>
> wrote:
>
>> if i change permissions the sync status shows that it is syncing 1 shard,
>> but no files ends up in the pool (testing with empty data pool). after a
>> while it shows that data is back in sync but there is no file
>>
>> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub <
>> yeh...@redhat.com> wrote:
>>
>>> Thanks for your report. We're looking into it. You can try to see if
>>> touching the object (e.g., modifying its permissions) triggers the sync.
>>>
>>> Yehuda
>>>
>>> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern <
>>> enrico.k...@glispamedia.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> yeah seems you are right, they are stored as different filenames in the
>>>> data bucket when using multisite upload. But anyway it stil doesnt get
>>>> replicated. As example i have files like
>>>>
>>>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
>>>> ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>>>
>>>> in the data pool on one zone. But its not replicated to the other zone.
>>>> naming is not relevant, the other data bucket doesnt have any file
>>>> multipart or not.
>>>>
>>>> im really missing the file on the other zone.
>>>>
>>>>
>>>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>>>  Virenfrei.
>>>> www.avg.com
>>>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>>> <#m_-7857162703559898269_m_-2933057183600238029_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>
>>>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner <drakonst...@gmail.com>
>>>> wrote:
>>>>
>>>>> Multipart is a client side setting when uploading.  Multisite in and
>>>>> of itself is a client and it doesn't use multipart (at least not by
>>>>> default).  I have a Jewel RGW Multisite cluster and one site has the 
>>>>> object
>>>>> as multi-part while the second site just has it as a single object.  I had
>>>>> to change from looking at the objects in the pool for monitoring to 
>>>>> looking
>>>>> at an ls of the buckets to see if they were in sync.
>>>>>
>>>>> I don't know if multisite has the option to match if an object is
>>>>> multipart between sites, but it definitely doesn't seem to be the default
>>>>> behavior.
>>>>>
>>>>> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <
>>>>> enrico.k...@glispamedia.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> i just setup multisite replication according to the docs from
>>>>>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
>>>>>> works except that if a client uploads via multipart the files dont get
>>>>>> replicated.
>>>>>>
>>>>>> If i in one zone rename a file that was uploaded via multipart it
>>>>>> gets replicated, but not if i left it untouched. Any ideas why? I 
>>>>>> remember
>>>>>> th

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
in addition i noticed that if you delete a bucket that had multipart upload
files which were not replicated in it that the files are not deleted in the
pool, while the bucket is gone the data stil remains in the pool where the
multipart upload was initiated.

On Thu, Oct 12, 2017 at 12:26 AM, Enrico Kern <enrico.k...@glispamedia.com>
wrote:

> if i change permissions the sync status shows that it is syncing 1 shard,
> but no files ends up in the pool (testing with empty data pool). after a
> while it shows that data is back in sync but there is no file
>
> On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com
> > wrote:
>
>> Thanks for your report. We're looking into it. You can try to see if
>> touching the object (e.g., modifying its permissions) triggers the sync.
>>
>> Yehuda
>>
>> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern <enrico.k...@glispamedia.com
>> > wrote:
>>
>>> Hi David,
>>>
>>> yeah seems you are right, they are stored as different filenames in the
>>> data bucket when using multisite upload. But anyway it stil doesnt get
>>> replicated. As example i have files like
>>>
>>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_W
>>> ireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>>
>>> in the data pool on one zone. But its not replicated to the other zone.
>>> naming is not relevant, the other data bucket doesnt have any file
>>> multipart or not.
>>>
>>> im really missing the file on the other zone.
>>>
>>>
>>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>>  Virenfrei.
>>> www.avg.com
>>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>> <#m_-9020102307170313134_m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner <drakonst...@gmail.com>
>>> wrote:
>>>
>>>> Multipart is a client side setting when uploading.  Multisite in and of
>>>> itself is a client and it doesn't use multipart (at least not by default).
>>>> I have a Jewel RGW Multisite cluster and one site has the object as
>>>> multi-part while the second site just has it as a single object.  I had to
>>>> change from looking at the objects in the pool for monitoring to looking at
>>>> an ls of the buckets to see if they were in sync.
>>>>
>>>> I don't know if multisite has the option to match if an object is
>>>> multipart between sites, but it definitely doesn't seem to be the default
>>>> behavior.
>>>>
>>>> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <
>>>> enrico.k...@glispamedia.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> i just setup multisite replication according to the docs from
>>>>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
>>>>> works except that if a client uploads via multipart the files dont get
>>>>> replicated.
>>>>>
>>>>> If i in one zone rename a file that was uploaded via multipart it gets
>>>>> replicated, but not if i left it untouched. Any ideas why? I remember 
>>>>> there
>>>>> was a similar bug with jewel a while back.
>>>>>
>>>>> On the slave node i also permanently get this error (unrelated to the
>>>>> replication) in the radosgw log:
>>>>>
>>>>>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
>>>>> directory
>>>>>
>>>>> we didnt run radosgw before the luminous upgrade of our clusters.
>>>>>
>>>>> after a finished multipart upload which is only visible at one zone
>>>>> "radosgw-admin sync status" just shows that metadata and data is caught up
>>>>> with the source.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Enrico Kern
>>>>> *Lead System Engineer*
>>>>>
>>>>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>>>>> 26814501 <+49%201522%206814501>
>>>>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>>>>> Profile <https://www.linkedi

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
if i change permissions the sync status shows that it is syncing 1 shard,
but no files ends up in the pool (testing with empty data pool). after a
while it shows that data is back in sync but there is no file

On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub <yeh...@redhat.com>
wrote:

> Thanks for your report. We're looking into it. You can try to see if
> touching the object (e.g., modifying its permissions) triggers the sync.
>
> Yehuda
>
> On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern <enrico.k...@glispamedia.com>
> wrote:
>
>> Hi David,
>>
>> yeah seems you are right, they are stored as different filenames in the
>> data bucket when using multisite upload. But anyway it stil doesnt get
>> replicated. As example i have files like
>>
>> 6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_
>> Wireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6
>>
>> in the data pool on one zone. But its not replicated to the other zone.
>> naming is not relevant, the other data bucket doesnt have any file
>> multipart or not.
>>
>> im really missing the file on the other zone.
>>
>>
>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>  Virenfrei.
>> www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>> <#m_-2032373670326744902_m_1117069282601502036_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>> On Wed, Oct 11, 2017 at 10:25 PM, David Turner <drakonst...@gmail.com>
>> wrote:
>>
>>> Multipart is a client side setting when uploading.  Multisite in and of
>>> itself is a client and it doesn't use multipart (at least not by default).
>>> I have a Jewel RGW Multisite cluster and one site has the object as
>>> multi-part while the second site just has it as a single object.  I had to
>>> change from looking at the objects in the pool for monitoring to looking at
>>> an ls of the buckets to see if they were in sync.
>>>
>>> I don't know if multisite has the option to match if an object is
>>> multipart between sites, but it definitely doesn't seem to be the default
>>> behavior.
>>>
>>> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <enrico.k...@glispamedia.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> i just setup multisite replication according to the docs from
>>>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything
>>>> works except that if a client uploads via multipart the files dont get
>>>> replicated.
>>>>
>>>> If i in one zone rename a file that was uploaded via multipart it gets
>>>> replicated, but not if i left it untouched. Any ideas why? I remember there
>>>> was a similar bug with jewel a while back.
>>>>
>>>> On the slave node i also permanently get this error (unrelated to the
>>>> replication) in the radosgw log:
>>>>
>>>>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
>>>> directory
>>>>
>>>> we didnt run radosgw before the luminous upgrade of our clusters.
>>>>
>>>> after a finished multipart upload which is only visible at one zone
>>>> "radosgw-admin sync status" just shows that metadata and data is caught up
>>>> with the source.
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Enrico Kern
>>>> *Lead System Engineer*
>>>>
>>>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>>>> 26814501 <+49%201522%206814501>
>>>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>>>> Profile <https://www.linkedin.com/in/enricokern/>
>>>>
>>>>
>>>>
>>>> *Glispa GmbH* - Berlin Office
>>>> Sonnenburger Str. 73 10437 Berlin, Germany
>>>> <https://maps.google.com/?q=Sonnenburger+Str.+73+10437+Berlin,+Germany=gmail=g>
>>>>
>>>> Managing Director: David Brown, Registered in Berlin, AG
>>>> Charlottenburg HRB 114678B
>>>> <http://www.glispa.com/>   <cont...@glispa.com>
>>>> <https://www.linkedin.com/company-beta/143634/>
>>>> <https://plus.google.com/u/0/b/116135915389937318808/116135915389937318808>
>>>><https://twitter.com/glispa>
>>>> <https://www.facebook.com/glispa

Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
Hi David,

yeah seems you are right, they are stored as different filenames in the
data bucket when using multisite upload. But anyway it stil doesnt get
replicated. As example i have files like

6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_Wireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6

in the data pool on one zone. But its not replicated to the other zone.
naming is not relevant, the other data bucket doesnt have any file
multipart or not.

im really missing the file on the other zone.

<http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
Virenfrei.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Oct 11, 2017 at 10:25 PM, David Turner <drakonst...@gmail.com>
wrote:

> Multipart is a client side setting when uploading.  Multisite in and of
> itself is a client and it doesn't use multipart (at least not by default).
> I have a Jewel RGW Multisite cluster and one site has the object as
> multi-part while the second site just has it as a single object.  I had to
> change from looking at the objects in the pool for monitoring to looking at
> an ls of the buckets to see if they were in sync.
>
> I don't know if multisite has the option to match if an object is
> multipart between sites, but it definitely doesn't seem to be the default
> behavior.
>
> On Wed, Oct 11, 2017 at 3:56 PM Enrico Kern <enrico.k...@glispamedia.com>
> wrote:
>
>> Hi all,
>>
>> i just setup multisite replication according to the docs from
>> http://docs.ceph.com/docs/master/radosgw/multisite/ and everything works
>> except that if a client uploads via multipart the files dont get replicated.
>>
>> If i in one zone rename a file that was uploaded via multipart it gets
>> replicated, but not if i left it untouched. Any ideas why? I remember there
>> was a similar bug with jewel a while back.
>>
>> On the slave node i also permanently get this error (unrelated to the
>> replication) in the radosgw log:
>>
>>  meta sync: ERROR: failed to read mdlog info with (2) No such file or
>> directory
>>
>> we didnt run radosgw before the luminous upgrade of our clusters.
>>
>> after a finished multipart upload which is only visible at one zone
>> "radosgw-admin sync status" just shows that metadata and data is caught up
>> with the source.
>>
>>
>>
>> --
>>
>> Enrico Kern
>> *Lead System Engineer*
>>
>> *T* +49 (0) 30 555713017 <+49%2030%20555713017>  | *M *+49 (0)152
>> 26814501 <+49%201522%206814501>
>> *E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my
>> Profile <https://www.linkedin.com/in/enricokern/>
>>
>>
>>
>> *Glispa GmbH* - Berlin Office
>> Sonnenburger Str. 73 10437 Berlin, Germany
>> <https://maps.google.com/?q=Sonnenburger+Str.+73+10437+Berlin,+Germany=gmail=g>
>>
>> Managing Director: David Brown, Registered in Berlin, AG Charlottenburg
>> HRB 114678B
>> <http://www.glispa.com/>   <cont...@glispa.com>
>> <https://www.linkedin.com/company-beta/143634/>
>> <https://plus.google.com/u/0/b/116135915389937318808/116135915389937318808>
>><https://twitter.com/glispa>   <https://www.facebook.com/glispamedia/>
>><https://www.instagram.com/glispaglobalgroup/>
>>
>>
>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>>  Virenfrei.
>> www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>> <#m_-8782081303072922711_m_-7647428735749890284_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 

Enrico Kern
*Lead System Engineer*

*T* +49 (0) 30 555713017  | *M *+49 (0)152 26814501
*E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile
<https://www.linkedin.com/in/enricokern/>



*Glispa GmbH* - Berlin Office
Sonnenburger Str. 73 10437 Berlin, Germany
Managing Director: Dina Karol-Gavish, Registered in Berlin, AG
Charlottenburg HRB 114678B
<http://www.glispa.com/>   <cont...@glispa.com>
<https://www.linkedin.com/company-beta/143634/>
<https://plus.google.com/u/0/b/116135915389937318808/116135915389937318808>
   <https://twitter.com/glispa>   <https://www.facebook.com/glispamedia/>
<https://www.instagram.com/glispaglobalgroup/>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-11 Thread Enrico Kern
Hi all,

i just setup multisite replication according to the docs from
http://docs.ceph.com/docs/master/radosgw/multisite/ and everything works
except that if a client uploads via multipart the files dont get replicated.

If i in one zone rename a file that was uploaded via multipart it gets
replicated, but not if i left it untouched. Any ideas why? I remember there
was a similar bug with jewel a while back.

On the slave node i also permanently get this error (unrelated to the
replication) in the radosgw log:

 meta sync: ERROR: failed to read mdlog info with (2) No such file or
directory

we didnt run radosgw before the luminous upgrade of our clusters.

after a finished multipart upload which is only visible at one zone
"radosgw-admin sync status" just shows that metadata and data is caught up
with the source.



-- 

Enrico Kern
*Lead System Engineer*

*T* +49 (0) 30 555713017  | *M *+49 (0)152 26814501
*E*  enrico.k...@glispa.com |  *Skype flyersa* |  LinkedIn View my Profile
<https://www.linkedin.com/in/enricokern/>



*Glispa GmbH* - Berlin Office
Sonnenburger Str. 73 10437 Berlin, Germany
Managing Director: David Brown, Registered in Berlin, AG Charlottenburg HRB
114678B
<http://www.glispa.com/>   <cont...@glispa.com>
<https://www.linkedin.com/company-beta/143634/>
<https://plus.google.com/u/0/b/116135915389937318808/116135915389937318808>
   <https://twitter.com/glispa>   <https://www.facebook.com/glispamedia/>
<https://www.instagram.com/glispaglobalgroup/>

<http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
Virenfrei.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com