Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-27 Thread Wido den Hollander


On 1/25/19 8:33 AM, Gregory Farnum wrote:
> This doesn’t look familiar to me. Is the cluster still doing recovery so
> we can at least expect them to make progress when the “out” OSDs get
> removed from the set?

The recovery has already finished. It resolves itself, but in the
meantime I saw many PGs in the backfill_toofull state for a long time.

This is new since Mimic.

Wido

> On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander  > wrote:
> 
> Hi,
> 
> I've got a couple of PGs which are stuck in backfill_toofull, but none
> of them are actually full.
> 
>   "up": [
>     999,
>     1900,
>     145
>   ],
>   "acting": [
>     701,
>     1146,
>     1880
>   ],
>   "backfill_targets": [
>     "145",
>     "999",
>     "1900"
>   ],
>   "acting_recovery_backfill": [
>     "145",
>     "701",
>     "999",
>     "1146",
>     "1880",
>     "1900"
>   ],
> 
> I checked all these OSDs, but they are all <75% utilization.
> 
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.9
> 
> So I started checking all the PGs and I've noticed that each of these
> PGs has one OSD in the 'acting_recovery_backfill' which is marked as
> out.
> 
> In this case osd.1880 is marked as out and thus it's capacity is shown
> as zero.
> 
> [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
> 1880   hdd 4.54599        0     0 B      0 B      0 B     0    0  27
> [ceph@ceph-mgr ~]$
> 
> This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
> side-effect of one of the OSDs being marked as out?
> 
> Thanks,
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mix hardware on object storage cluster

2019-01-27 Thread Ashley Merrick
What is different with the new hosts?

Better / Large Disk?
More ram?
Higher ghz on the CPU's?
Redundant PSU e.t.c

Depending on what is different about the hardware will help pinpoint how
you may be able to make better use of them.

On Mon, Jan 28, 2019 at 3:26 PM Félix Barbeira  wrote:

> Hi Cephers,
>
> We are managing a cluster where all machines have the same hardware. The
> cluster is used only for object storage. We are planning to increase nodes
> number. Those new nodes have better hardware than the old ones. If we only
> add those nodes as regular nodes to cluster we are not use the full power
> right? what could be the best way to take advantage of this new and better
> hardware?
>
> After read the docs these are possible options:
>
> - Change primary affinity:
> - Cache tiering: I dont really like this comment on the docs "Cache
> tiering will degrade performance for most workloads".
> - Change osd weight: I think this is more oriented to disk space on every
> node.
>
> Do I have some other options?
>
> --
> Félix Barbeira.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mix hardware on object storage cluster

2019-01-27 Thread Félix Barbeira
Hi Cephers,

We are managing a cluster where all machines have the same hardware. The
cluster is used only for object storage. We are planning to increase nodes
number. Those new nodes have better hardware than the old ones. If we only
add those nodes as regular nodes to cluster we are not use the full power
right? what could be the best way to take advantage of this new and better
hardware?

After read the docs these are possible options:

- Change primary affinity:
- Cache tiering: I dont really like this comment on the docs "Cache tiering
will degrade performance for most workloads".
- Change osd weight: I think this is more oriented to disk space on every
node.

Do I have some other options?

-- 
Félix Barbeira.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client hangs

2019-01-27 Thread ST Wong (ITSC)
> That doesn't appear to be an error -- that's just stating that it found a 
> dead client that was holding the exclusice-lock, so it broke the dead 
> client's lock on the image (by blacklisting the client).

As there is only 1 RBD client in this testing, does it mean the RBD client 
process keeps failing?
In a fresh boot RBD client, doing some basic operations also gives the warning:

 cut here 
# rbd -n client.acapp1 map 4copy/foo
/dev/rbd0
# mount /dev/rbd0 /4copy
# cd /4copy; ls


# tail /var/log/messages
Jan 28 14:23:39 acapp1 kernel: Key type ceph registered
Jan 28 14:23:39 acapp1 kernel: libceph: loaded (mon/osd proto 15/24)
Jan 28 14:23:39 acapp1 kernel: rbd: loaded (major 252)
Jan 28 14:23:39 acapp1 kernel: libceph: mon2 192.168.1.156:6789 session 
established
Jan 28 14:23:39 acapp1 kernel: libceph: client80624 fsid 
cc795498-5d16-4b84-9584-1788d0458be9
Jan 28 14:23:39 acapp1 kernel: rbd: rbd0: capacity 10737418240 features 0x5
Jan 28 14:23:44 acapp1 kernel: XFS (rbd0): Mounting V5 Filesystem
Jan 28 14:23:44 acapp1 kernel: rbd: rbd0: client80621 seems dead, breaking lock 
<--
Jan 28 14:23:45 acapp1 kernel: XFS (rbd0): Starting recovery (logdev: internal)
Jan 28 14:23:45 acapp1 kernel: XFS (rbd0): Ending recovery (logdev: internal)

 cut here 

Is this normal?



Besides, repeated the testing:
* Map and mount the rbd device, read/write ok.
* Umount all rbd, then reboot without problem
* Reboot hangs if not umounting all rbd before reboot:

 cut here 
Jan 28 14:13:12 acapp1 kernel: rbd: rbd0: client80531 seems dead, breaking lock
Jan 28 14:13:13 acapp1 kernel: XFS (rbd0): Ending clean mount   
<-- Reboot hangs here
Jan 28 14:14:06 acapp1 systemd: Stopping Session 1 of user root.
<-- pressing power reset 
Jan 28 14:14:06 acapp1 systemd: Stopped target Multi-User System.
 cut here 

Is it necessary to umount all RDB before rebooting  the client host?

Thanks a lot.
/st

-Original Message-
From: Jason Dillaman  
Sent: Friday, January 25, 2019 10:04 PM
To: ST Wong (ITSC) 
Cc: dilla...@redhat.com; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RBD client hangs

That doesn't appear to be an error -- that's just stating that it found a dead 
client that was holding the exclusice-lock, so it broke the dead client's lock 
on the image (by blacklisting the client).

On Fri, Jan 25, 2019 at 5:09 AM ST Wong (ITSC)  wrote:
>
> Oops, while I can map and mount the filesystem, still found error as below, 
> while rebooting the client machine freezes and have to power reset her.
>
> Jan 25 17:57:30 acapp1 kernel: XFS (rbd0): Mounting V5 Filesystem Jan 
> 25 17:57:30 acapp1 kernel: rbd: rbd0: client74700 seems dead, breaking 
> lock ß Jan 25 17:57:30 acapp1 kernel: XFS (rbd0): Starting recovery 
> (logdev: internal) Jan 25 17:57:30 acapp1 kernel: XFS (rbd0): Ending 
> recovery (logdev: internal) Jan 25 17:58:07 acapp1 kernel: rbd: rbd1: 
> capacity 10737418240 features 0x5 Jan 25 17:58:14 acapp1 kernel: XFS 
> (rbd1): Mounting V5 Filesystem Jan 25 17:58:14 acapp1 kernel: rbd: 
> rbd1: client74700 seems dead, breaking lock ß Jan 25 17:58:15 acapp1 
> kernel: XFS (rbd1): Starting recovery (logdev: internal) Jan 25 
> 17:58:15 acapp1 kernel: XFS (rbd1): Ending recovery (logdev: internal)
>
> Would you help ?   Thanks.
> /st
>
> -Original Message-
> From: ceph-users  On Behalf Of ST 
> Wong (ITSC)
> Sent: Friday, January 25, 2019 5:58 PM
> To: dilla...@redhat.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD client hangs
>
> Hi,  It works.  Thanks a lot.
>
> /st
>
> -Original Message-
> From: Jason Dillaman 
> Sent: Tuesday, January 22, 2019 9:29 PM
> To: ST Wong (ITSC) 
> Cc: Ilya Dryomov ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD client hangs
>
> Your "mon" cap should be "profile rbd" instead of "allow r" [1].
>
> [1] 
> http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/#create-a-block-de
> vice-user
>
> On Mon, Jan 21, 2019 at 9:05 PM ST Wong (ITSC)  wrote:
> >
> > Hi,
> >
> > > Is this an upgraded or a fresh cluster?
> > It's a fresh cluster.
> >
> > > Does client.acapp1 have the permission to blacklist other clients?  You 
> > > can check with "ceph auth get client.acapp1".
> >
> > No,  it's our first Ceph cluster with basic setup for testing, without any 
> > blacklist implemented.
> >
> > --- cut here --- # ceph auth get client.acapp1 
> > exported keyring for client.acapp1 [client.acapp1]
> > key = 
> > caps mds = "allow r"
> > caps mgr = "allow r"
> > caps mon = "allow r"
> > caps osd = "allow rwx pool=2copy, allow rwx pool=4copy"
> > --- cut here ---
> >
> > Thanks a lot.
> > /st
> >
> >
> >
> > -Original Message-
> > From: Ilya Dryomov 
> > Sent: Monday, January 21, 2019 7:33 

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-27 Thread Alexandre DERUMIER
Hi,

currently I'm using telegraf + influxdb to monitor.


Note that this bug seem to be only occur on writes, I don't have latency 
increase on read.

counters are op_latency , op_w_latency, op_w_process_latency



SELECT non_negative_derivative(first("op_latency.sum"), 
1s)/non_negative_derivative(first("op_latency.avgcount"),1s)   FROM "ceph" 
WHERE "host" =~  /^([[host]])$/  AND "id" =~ /^([[osd]])$/ AND $timeFilter 
GROUP BY time($interval), "host", "id" fill(previous)


SELECT non_negative_derivative(first("op_w_latency.sum"), 
1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s)   FROM "ceph" 
WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)



SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s)   FROM 
"ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)


dashboard is here:

https://grafana.com/dashboards/7995





- Mail original -
De: "Marc Roos" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Dimanche 27 Janvier 2019 12:11:42
Objet: RE: [ceph-users] ceph osd commit latency increase over time, until 
restart



Hi Alexandre, 

I was curious if I had a similar issue, what value are you monitoring? I 
have quite a lot to choose from. 


Bluestore.commitLat 
Bluestore.kvLat 
Bluestore.readLat 
Bluestore.readOnodeMetaLat 
Bluestore.readWaitAioLat 
Bluestore.stateAioWaitLat 
Bluestore.stateDoneLat 
Bluestore.stateIoDoneLat 
Bluestore.submitLat 
Bluestore.throttleLat 
Osd.opBeforeDequeueOpLat 
Osd.opRProcessLatency 
Osd.opWProcessLatency 
Osd.subopLatency 
Osd.subopWLatency 
Rocksdb.getLatency 
Rocksdb.submitLatency 
Rocksdb.submitSyncLatency 
RecoverystatePerf.repnotrecoveringLatency 
RecoverystatePerf.waitupthruLatency 
Osd.opRwPrepareLatency 
RecoverystatePerf.primaryLatency 
RecoverystatePerf.replicaactiveLatency 
RecoverystatePerf.startedLatency 
RecoverystatePerf.getlogLatency 
RecoverystatePerf.initialLatency 
RecoverystatePerf.recoveringLatency 
ThrottleBluestoreThrottleBytes.wait 
RecoverystatePerf.waitremoterecoveryreservedLatency 



-Original Message- 
From: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Sent: vrijdag 25 januari 2019 17:40 
To: Sage Weil 
Cc: ceph-users; ceph-devel 
Subject: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 

also, here the result of "perf diff 1mslatency.perfdata 
3mslatency.perfdata" 

http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt 




- Mail original - 
De: "aderumier"  
À: "Sage Weil"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 17:32:02 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 

Hi again, 

I was able to perf it today, 

before restart, commit latency was between 3-5ms 

after restart at 17:11, latency is around 1ms 

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png 


here some perf reports: 

with 3ms latency: 
- 
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt 


with 1ms latency 
- 
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt 



I'll retry next week, trying to have bigger latency difference. 

Alexandre 

- Mail original - 
De: "aderumier"  
À: "Sage Weil"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 11:06:51 
Objet: Re: ceph osd commit latency increase over time, until restart 

>>Can you capture a perf top or perf record to see where teh CPU time is 

>>going on one of the OSDs wth a high latency? 

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

- Mail original - 
De: "Sage Weil"  
À: "aderumier"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 10:49:02 
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme 
> drivers, workload is rbd only, with qemu-kvm vms running with librbd + 

> snapshot/rbd export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 
0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), 
> until reaching crazy values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have 

Re: [ceph-users] how to debug a stuck cephfs?

2019-01-27 Thread Sang, Oliver
Thanks! So we should not evict client? we run into slow request so guess maybe 
evict all clients could help. In which case we should or should not use evict?

-Original Message-
From: Yan, Zheng [mailto:uker...@gmail.com] 
Sent: Monday, January 28, 2019 1:07 PM
To: Sang, Oliver 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] how to debug a stuck cephfs?

http://docs.ceph.com/docs/master/cephfs/troubleshooting/


For your case, it's likely client got evicted by mds.

On Mon, Jan 28, 2019 at 9:50 AM Sang, Oliver  wrote:
>
> Hello,
>
>
>
> Our cephfs looks just stuck. If I run some command such like ‘makdir’, 
> ‘touch’ a new file, it just stuck there. Any suggestion about how to debug 
> this issue will be very appreciated.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug a stuck cephfs?

2019-01-27 Thread Yan, Zheng
http://docs.ceph.com/docs/master/cephfs/troubleshooting/


For your case, it's likely client got evicted by mds.

On Mon, Jan 28, 2019 at 9:50 AM Sang, Oliver  wrote:
>
> Hello,
>
>
>
> Our cephfs looks just stuck. If I run some command such like ‘makdir’, 
> ‘touch’ a new file, it just stuck there. Any suggestion about how to debug 
> this issue will be very appreciated.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-27 Thread Yan, Zheng
On Mon, Jan 28, 2019 at 10:34 AM Albert Yue  wrote:
>
> Hi Yan Zheng,
>
> Our clients are also complaining about operations like 'du' or 'ncdu' being 
> very slow. Is there any alternative tool for such kind of operation on 
> CephFS? Thanks!
>

'du' traverse whole directory tree to calculate size. ceph supports
recursive stat. google it for detail

> Best regards,
> Albert
>
> On Wed, Jan 23, 2019 at 11:04 AM Yan, Zheng  wrote:
>>
>> On Wed, Jan 23, 2019 at 10:02 AM Albert Yue  
>> wrote:
>> >
>> > But with enough memory on MDS, I can just cache all metadata into memory. 
>> > Right now there are around 500GB metadata in the ssd. So this is not 
>> > enough?
>> >
>>
>> mds needs to tracking lots of extra information for each object. For
>> 500G metadata, mds may need 1T or more memory.
>>
>> > On Tue, Jan 22, 2019 at 5:48 PM Yan, Zheng  wrote:
>> >>
>> >> On Tue, Jan 22, 2019 at 10:49 AM Albert Yue  
>> >> wrote:
>> >> >
>> >> > Hi Yan Zheng,
>> >> >
>> >> > In your opinion, can we resolve this issue by move MDS to a 512GB or 
>> >> > 1TB memory machine?
>> >> >
>> >>
>> >> The problem is from client side, especially clients with large memory.
>> >> I don't think enlarge mds cache size is good idea. you can
>> >> periodically check periodically
>> >> each kernel clients' /sys/kernel/debug/ceph/xxx/caps. run 'echo 2
>> >> >/proc/sys/vm/drop_caches' if a client used too many caps (for example
>> >> 10k),
>> >>
>> >> > On Mon, Jan 21, 2019 at 10:49 PM Yan, Zheng  wrote:
>> >> >>
>> >> >> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue 
>> >> >>  wrote:
>> >> >> >
>> >> >> > Dear Ceph Users,
>> >> >> >
>> >> >> > We have set up a cephFS cluster with 6 osd machines, each with 16 
>> >> >> > 8TB harddisk. Ceph version is luminous 12.2.5. We created one data 
>> >> >> > pool with these hard disks and created another meta data pool with 3 
>> >> >> > ssd. We created a MDS with 65GB cache size.
>> >> >> >
>> >> >> > But our users are keep complaining that cephFS is too slow. What we 
>> >> >> > observed is cephFS is fast when we switch to a new MDS instance, 
>> >> >> > once the cache fills up (which will happen very fast), client became 
>> >> >> > very slow when performing some basic filesystem operation such as 
>> >> >> > `ls`.
>> >> >> >
>> >> >>
>> >> >> It seems that clients hold lots of unused inodes their icache, which
>> >> >> prevent mds from trimming corresponding objects from its cache.  mimic
>> >> >> has command "ceph daemon mds.x cache drop" to ask client to drop its
>> >> >> cache. I'm also working on a patch that make kclient client release
>> >> >> unused inodes.
>> >> >>
>> >> >> For luminous,  there is not much we can do, except periodically run
>> >> >> "echo 2 > /proc/sys/vm/drop_caches"  on each client.
>> >> >>
>> >> >>
>> >> >> > What we know is our user are putting lots of small files into the 
>> >> >> > cephFS, now there are around 560 Million files. We didn't see high 
>> >> >> > CPU wait on MDS instance and meta data pool just used around 200MB 
>> >> >> > space.
>> >> >> >
>> >> >> > My question is, what is the relationship between the metadata pool 
>> >> >> > and MDS? Is this performance issue caused by the hardware behind 
>> >> >> > meta data pool? Why the meta data pool only used 200MB space, and we 
>> >> >> > saw 3k iops on each of these three ssds, why can't MDS cache all 
>> >> >> > these 200MB into memory?
>> >> >> >
>> >> >> > Thanks very much!
>> >> >> >
>> >> >> >
>> >> >> > Best Regards,
>> >> >> >
>> >> >> > Albert
>> >> >> >
>> >> >> > ___
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@lists.ceph.com
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-27 Thread Albert Yue
Hi Yan Zheng,

Our clients are also complaining about operations like 'du' or 'ncdu' being
very slow. Is there any alternative tool for such kind of operation on
CephFS? Thanks!

Best regards,
Albert

On Wed, Jan 23, 2019 at 11:04 AM Yan, Zheng  wrote:

> On Wed, Jan 23, 2019 at 10:02 AM Albert Yue 
> wrote:
> >
> > But with enough memory on MDS, I can just cache all metadata into
> memory. Right now there are around 500GB metadata in the ssd. So this is
> not enough?
> >
>
> mds needs to tracking lots of extra information for each object. For
> 500G metadata, mds may need 1T or more memory.
>
> > On Tue, Jan 22, 2019 at 5:48 PM Yan, Zheng  wrote:
> >>
> >> On Tue, Jan 22, 2019 at 10:49 AM Albert Yue 
> wrote:
> >> >
> >> > Hi Yan Zheng,
> >> >
> >> > In your opinion, can we resolve this issue by move MDS to a 512GB or
> 1TB memory machine?
> >> >
> >>
> >> The problem is from client side, especially clients with large memory.
> >> I don't think enlarge mds cache size is good idea. you can
> >> periodically check periodically
> >> each kernel clients' /sys/kernel/debug/ceph/xxx/caps. run 'echo 2
> >> >/proc/sys/vm/drop_caches' if a client used too many caps (for example
> >> 10k),
> >>
> >> > On Mon, Jan 21, 2019 at 10:49 PM Yan, Zheng 
> wrote:
> >> >>
> >> >> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue <
> transuranium@gmail.com> wrote:
> >> >> >
> >> >> > Dear Ceph Users,
> >> >> >
> >> >> > We have set up a cephFS cluster with 6 osd machines, each with 16
> 8TB harddisk. Ceph version is luminous 12.2.5. We created one data pool
> with these hard disks and created another meta data pool with 3 ssd. We
> created a MDS with 65GB cache size.
> >> >> >
> >> >> > But our users are keep complaining that cephFS is too slow. What
> we observed is cephFS is fast when we switch to a new MDS instance, once
> the cache fills up (which will happen very fast), client became very slow
> when performing some basic filesystem operation such as `ls`.
> >> >> >
> >> >>
> >> >> It seems that clients hold lots of unused inodes their icache, which
> >> >> prevent mds from trimming corresponding objects from its cache.
> mimic
> >> >> has command "ceph daemon mds.x cache drop" to ask client to drop its
> >> >> cache. I'm also working on a patch that make kclient client release
> >> >> unused inodes.
> >> >>
> >> >> For luminous,  there is not much we can do, except periodically run
> >> >> "echo 2 > /proc/sys/vm/drop_caches"  on each client.
> >> >>
> >> >>
> >> >> > What we know is our user are putting lots of small files into the
> cephFS, now there are around 560 Million files. We didn't see high CPU wait
> on MDS instance and meta data pool just used around 200MB space.
> >> >> >
> >> >> > My question is, what is the relationship between the metadata pool
> and MDS? Is this performance issue caused by the hardware behind meta data
> pool? Why the meta data pool only used 200MB space, and we saw 3k iops on
> each of these three ssds, why can't MDS cache all these 200MB into memory?
> >> >> >
> >> >> > Thanks very much!
> >> >> >
> >> >> >
> >> >> > Best Regards,
> >> >> >
> >> >> > Albert
> >> >> >
> >> >> > ___
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Will Dennis
Thanks for contributing your knowledge to that book, Anthony - really enjoying 
it :)

I didn't mean to use the OS SSD for Ceph use - would buy a second SSD per 
server for that... I will take a look at SATA SSD prices, hopefully the smaller 
ones (>500MB) will be at an acceptable price so that I can buy 1 (or even 2) 
for each server. Love to run two for OS (md mirror) and then two more for Ceph 
use, but that's probably going to add up to more money than I'd want to ask 
for. I was going to check SMART on the existing SSD; since they are Intel SSDs, 
there's also an Intel tool ( 
https://www.intel.com/content/www/us/en/support/articles/06289/memory-and-storage.html
 ) that I was going to use. In any case, will probably re-use the existing OS 
SSD for the new OS, and add in 1/2 new SSDs for Ceph per OSD server.

I also think a new SSD per mon would be doable; maybe 500GB - 1TB OK?

Usually for a storage system I'd be using some sort of Intel DC drives, but may 
go with Samsung 8xx Pro's for this to keep the price lower.

I mean to use CephFS on this PoC; the initial use would be to back up an 
existing ZFS server with ~43TB data (may have to limit the backed-up data 
depending on how much capacity I can get out of the OSD servers) and then share 
out via NFS as a read-only copy, that would give me some I/O speeds on writes 
and reads, and allow me to test different aspects of Ceph before I go pitching 
it as a primary data storage technology (it will be our org's first foray into 
SDS, and I want it to succeed.)

No way I'd go primary production storage with this motley collection of 
"pre-loved" equipment :) If it all seems to work well, I think I could get a 
reasonable budget for new production-grade gear.

From what I've read so far in the book, and on the prior list posts, prolly do 
2x10G bond to the common 10G switch that serves the cluster this would be a 
part of. Do the mon servers need 10G NICs too? If so, I may have to scrounge 
some 10Gbase-T NICs from other servers to give to them (they only have dual 1G 
NICs on the mobo.)

Thanks again!
Will

-Original Message-
From: Anthony D'Atri [mailto:a...@dreamsnake.net] 
Sent: Sunday, January 27, 2019 6:32 PM
To: Will Dennis
Cc: ceph-users
Subject: Re: [ceph-users] Questions about using existing HW for PoC cluster

> Been reading "Learning Ceph - Second Edition”

An outstanding book, I must say ;)

> So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
> WAL if I'm using Bluestore?

Depends on the rest of your setup and use-case, but I think this would be a 
bottleneck.  Some thoughts:

* You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 
2TB SATA OSD drives.

** Sharing the OS with journal/metadata could lead to contention between the two
** Since the OS has been doing who-knows-what with that drive, check the 
lifetime used/remaining with `smartctl -a`.
** If they’ve been significantly consumed, their lifetime with the load Ceph 
will present will be limited.
** SSDs selected for OS/boot drives often have relatively low durability (DWPD) 
and may suffer performance cliffing when given a steady load.  Look up the 
specs on your model
** 8 OSDs sharing a single SSD for metadata is a very large failure domain.  
If/when you lose that SSD, your lose all 8 OSDs and the host itself.  You would 
want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% 
so that you’d have room to backfill in case of a failure not caught by the 
subtree limit
** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked 
unless your workload is substantially reads.

* Single SATA HDD on the mons

** When it fails, you lose the mon
** I have personally seen issues due to HDDs not handling peak demands, 
resulting in an outage

The gear you list is fairly old and underpowered, but sure you could use it *as 
a PoC*.  For a production deployment you’d want different hardware.

> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)

With Filestore and HDDs, absolutely.  Even worse if you were to use EC.  There 
may be some coalescing of ops, but you’re still going to get a *lot* of long 
seeks, and spinners can only do a certain number of IOPs.  I think in the book 
I described personal experience with such a setup that even tickled a design 
flaw on the part of a certain HDD vendor.  Eventually I was permitted to get 
journal devices (this was pre-BlueStore GA), which were PCIe NVMe.  Write 
performance doubled.  Then we hit a race condition / timing issue in nvme.ko, 
but I digress...

When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating 
the journals/metadata is more viable.

> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> which I know isn't optimal, but price prevents 

[ceph-users] how to debug a stuck cephfs?

2019-01-27 Thread Sang, Oliver
Hello,

Our cephfs looks just stuck. If I run some command such like 'makdir', 'touch' 
a new file, it just stuck there. Any suggestion about how to debug this issue 
will be very appreciated.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Anthony D'Atri
> Been reading "Learning Ceph - Second Edition”

An outstanding book, I must say ;)

> So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
> WAL if I'm using Bluestore?

Depends on the rest of your setup and use-case, but I think this would be a 
bottleneck.  Some thoughts:

* You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 
2TB SATA OSD drives.

** Sharing the OS with journal/metadata could lead to contention between the two
** Since the OS has been doing who-knows-what with that drive, check the 
lifetime used/remaining with `smartctl -a`.
** If they’ve been significantly consumed, their lifetime with the load Ceph 
will present will be limited.
** SSDs selected for OS/boot drives often have relatively low durability (DWPD) 
and may suffer performance cliffing when given a steady load.  Look up the 
specs on your model
** 8 OSDs sharing a single SSD for metadata is a very large failure domain.  
If/when you lose that SSD, your lose all 8 OSDs and the host itself.  You would 
want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% 
so that you’d have room to backfill in case of a failure not caught by the 
subtree limit
** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked 
unless your workload is substantially reads.

* Single SATA HDD on the mons

** When it fails, you lose the mon
** I have personally seen issues due to HDDs not handling peak demands, 
resulting in an outage

The gear you list is fairly old and underpowered, but sure you could use it *as 
a PoC*.  For a production deployment you’d want different hardware.

> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)

With Filestore and HDDs, absolutely.  Even worse if you were to use EC.  There 
may be some coalescing of ops, but you’re still going to get a *lot* of long 
seeks, and spinners can only do a certain number of IOPs.  I think in the book 
I described personal experience with such a setup that even tickled a design 
flaw on the part of a certain HDD vendor.  Eventually I was permitted to get 
journal devices (this was pre-BlueStore GA), which were PCIe NVMe.  Write 
performance doubled.  Then we hit a race condition / timing issue in nvme.ko, 
but I digress...

When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating 
the journals/metadata is more viable.

> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> which I know isn't optimal, but price prevents otherwise…)

Optanes for these systems would be overkill.  If you would plan to have the PoC 
cluster run any appreciable load for any length of time, I might suggest 
instead adding 2x SATA SSDs per, so you could map 4x OSDs to each.  These would 
not need to be large:  upstream party line would have you allocate 80GB on 
each, though depending on your use-case you might well do fine with less, 2x 
240GB class or even 2x 120GB class should suffice for PoC service.  For 
production I would advise “enterprise” class drives with at least 1 DWPD 
durability — recently we’ve seen a certain vendor weasel their durability by 
computing it incorrectly.

Depending on what you expect out of your PoC, and especially assuming you use 
BlueStore, you might get away with colocation, but do not expect performance 
that can be extrapolated for a production deployment.  

With the NICs you have, you could keep it simple and skip a replication/back 
end network altogether, or you could bond the LOM ports and split them.  
Whatever’s simplest with the network infrastructure you have.  For production 
you’d most likely want LACP bonded NICs, but if the network tech is modern, 
skipping the replication network may be very feasible,  But I’m ahead of your 
context …

HTH
— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rbd.ko compatibility

2019-01-27 Thread Paul Emmerich
(Copying from our Ceph training material https://croit.io/training )

Feature vs. kernel version

layering 3.8
striping 3.10
exclusive-lock 4.9
object-map not supported
fast-diff not supported
deep-flatten not supported
data-pool 4.11
journaling (WIP, might be in 5.0 or later)


/sys/bus/rbd/supported_features
was added in 4.11 and contains a bitmask of the features which are
defined here: 
https://github.com/ceph/ceph/blob/v13.2.4/src/include/rbd/features.h#L4


Paul



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sun, Jan 27, 2019 at 12:02 PM Marc Schöchlin  wrote:
>
> Hello ceph-users,
>
> we are using a low number of rbd.ko clients with our luminous cluster.
>
> Where can i get information about the following questions:
>
>   * Which features and cluster compatibility is provided by the rbd.ko module 
> of my system?
> (/sys/module/rbd/**, "modinfo rbd" not seems to provide to provide useful 
> information on Ubuntu 16.04/18.04)
>   * Ubuntu 16.04/18.04 (kernel 4.15) /sys/bus/rbd/supported_features lists 
> feature compatibility 0x187
> Is there a convenient way to find out whats provided with that kernel?
>   * Is there a overview for kernel releases for features and cluster 
> compatibility?
> -> i.e. which kernel version is needed to use the fast-diff feature
>   * What is the development-roadmap for features and cluster compatibility in 
> rbd.ko?
>   * Available documentation http://docs.ceph.com/docs/luminous/rbd/rbd-ko/ and
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ABI/testing/sysfs-bus-rbd?h=v4.20.5
> do not provide any significant information regarding the described 
> questions.
> Did i missed important information resources?
>
> Best regards
> Marc
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-27 Thread Chris
When your node went down, you lost 100% of the copies of the objects that 
were stored on that node, so the cluster had to re-create a copy of 
everything.  When the node came back online (and particularly since your 
usage was near-zero), the cluster discovered that many objects did not 
require changes and were still identical to their counterparts.  The only 
moved objects would have been ones that had changed and ones that needed to 
be moved in order to satisfy the requirements of your crush map for the 
purposes of distribution.


On January 27, 2019 09:47:59 Götz Reinicke  
wrote:

Dear all,

thanks for your feedback and Fäll try to take any suggestion in consideration.

I’v rebooted node in question and oll 24 OSDs came online without any 
complaining.


But wat makes me wonder is: During the downtime the Object got rebalanced 
and placed on the remaining nodes.


With the failed node online, only a couple of hundreds objects where 
misplaced, out of about 35 million.


The question for me is: What happens to the objects on the OSDs that went 
down after the OSDs got back online?


Thanks for feedback



Am 27.01.2019 um 04:17 schrieb Christian Balzer :


Hello,

this is where (depending on your topology) something like:
---
mon_osd_down_out_subtree_limit = host
---
can come in very handy.

Provided you have correct monitoring, alerting and operations, recovering
a down node can often be restored long before any recovery would be
finished and you also avoid the data movement back and forth.
And if you see that recovering the node will take a long time, just
manually set things out for the time being.

Christian

On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:


Dear Chris,

Thanks for your feedback. The node/OSDs in question are part of an erasure 
coded pool and during the weekend the workload should be close to none.


But anyway, I could get a look on the console and on the server; the power 
is up, but I cant use any console, the Loginprompt is shown, but no key is 
accepted.


I’ll have to reboot the server and check what he is complaining about 
tomorrow morning ASAP I can access the server again.


Fingers crossed and regards. Götz





Am 26.01.2019 um 23:41 schrieb Chris :

It sort of depends on your workload/use case.  Recovery operations can be 
computationally expensive.  If your load is light because its the weekend 
you should be able to turn that host back on  as soon as you resolve 
whatever the issue is with minimal impact.  You can also increase the 
priority of the recovery operation to make it go faster if you feel you can 
spare additional IO and it won't affect clients.


We do this in our cluster regularly and have yet to see an issue (given 
that we take care to do it during periods of lower client io)


On January 26, 2019 17:16:38 Götz Reinicke  
wrote:




Hi,

one host out of 10 is down for yet unknown reasons. I guess a power 
failure. I could not yet see the server.


The Cluster is recovering and remapping fine, but still has some objects to 
process.


My question: May I just switch the server back on and in best case, the 24 
OSDs get back online and recovering will do the job without problems.


Or what might be a good way to handle that host? Should I first wait till 
the recover is finished?


Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . Götz



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications



Götz Reinicke
IT-Koordinator
IT-OfficeNet
+49 7141 969 82420
goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
http://www.filmakademie.de


Eintragung Amtsgericht Stuttgart HRB 205016
Vorsitzende des Aufsichtsrates:
Petra Olschowski
Staatssekretärin im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer:
Prof. Thomas Schadt

Datenschutzerklärung | Transparenzinformation
Data privacy statement | Transparency information


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Will Dennis
Been reading "Learning Ceph - Second Edition" 
(https://learning.oreilly.com/library/view/learning-ceph-/9781787127913/8f98bac7-44d4-45dc-b672-447d162ea604.xhtml)
 and in Ch. 4 I read this:

"We've noted that Ceph OSDs built with the new BlueStore back end do not 
require journals. One might reason that additional cost savings can be had by 
not having to deploy journal devices, and this can be quite true. However, 
BlueStore does still benefit from provisioning certain data components on 
faster storage, especially when OSDs are deployed on relatively slow HDDs. 
Today's investment in fast FileStore journal devices for HDD OSDs is not wasted 
when migrating to BlueStore. When repaving OSDs as BlueStore devices the former 
journal devices can be readily re purposed for BlueStore's RocksDB and WAL 
data. When using SSD-based OSDs, this BlueStore accessory data can reasonably 
be colocated with the OSD data store. For even better performance they can 
employ faster yet NVMe or other technloogies for WAL and RocksDB. This approach 
is not unknown for traditional FileStore journals as well, though it is not 
inexpensive.Ceph clusters that are fortunate to exploit SSDs as primary OSD dri
 ves usually do not require discrete journal devices, though use cases that 
require every last bit of performance may justify NVMe journals. SSD clusters 
with NVMe journals are as uncommon as they are expensive, but they are not 
unknown."

So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
WAL if I'm using Bluestore?


> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)
> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> which I know isn't optimal, but price prevents otherwise...)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-27 Thread Götz Reinicke
Dear all,

thanks for your feedback and Fäll try to take any suggestion in consideration.

I’v rebooted node in question and oll 24 OSDs came online without any 
complaining.

But wat makes me wonder is: During the downtime the Object got rebalanced and 
placed on the remaining nodes.

With the failed node online, only a couple of hundreds objects where misplaced, 
out of about 35 million.

The question for me is: What happens to the objects on the OSDs that went down 
after the OSDs got back online?

Thanks for feedback 


> Am 27.01.2019 um 04:17 schrieb Christian Balzer :
> 
> 
> Hello,
> 
> this is where (depending on your topology) something like:
> ---
> mon_osd_down_out_subtree_limit = host
> ---
> can come in very handy.
> 
> Provided you have correct monitoring, alerting and operations, recovering
> a down node can often be restored long before any recovery would be
> finished and you also avoid the data movement back and forth.
> And if you see that recovering the node will take a long time, just
> manually set things out for the time being.
> 
> Christian
> 
> On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:
> 
>> Dear Chris,
>> 
>> Thanks for your feedback. The node/OSDs in question are part of an erasure 
>> coded pool and during the weekend the workload should be close to none.
>> 
>> But anyway, I could get a look on the console and on the server; the power 
>> is up, but I cant use any console, the Loginprompt is shown, but no key is 
>> accepted.
>> 
>> I’ll have to reboot the server and check what he is complaining about 
>> tomorrow morning ASAP I can access the server again.
>> 
>>  Fingers crossed and regards. Götz
>> 
>> 
>> 
>>> Am 26.01.2019 um 23:41 schrieb Chris :
>>> 
>>> It sort of depends on your workload/use case.  Recovery operations can be 
>>> computationally expensive.  If your load is light because its the weekend 
>>> you should be able to turn that host back on  as soon as you resolve 
>>> whatever the issue is with minimal impact.  You can also increase the 
>>> priority of the recovery operation to make it go faster if you feel you can 
>>> spare additional IO and it won't affect clients.
>>> 
>>> We do this in our cluster regularly and have yet to see an issue (given 
>>> that we take care to do it during periods of lower client io)
>>> 
>>> On January 26, 2019 17:16:38 Götz Reinicke  
>>> wrote:
>>> 
 Hi,
 
 one host out of 10 is down for yet unknown reasons. I guess a power 
 failure. I could not yet see the server.
 
 The Cluster is recovering and remapping fine, but still has some objects 
 to process.
 
 My question: May I just switch the server back on and in best case, the 24 
 OSDs get back online and recovering will do the job without problems.
 
 Or what might be a good way to handle that host? Should I first wait till 
 the recover is finished?
 
 Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
 Götz  
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications

     
Götz Reinicke 
IT-Koordinator
IT-OfficeNet
+49 7141 969 82420 
goetz.reini...@filmakademie.de 
Filmakademie Baden-Württemberg GmbH 
Akademiehof 10
71638 Ludwigsburg 
http://www.filmakademie.de 
   
 
  
 
Eintragung Amtsgericht Stuttgart HRB 205016
Vorsitzende des Aufsichtsrates:
Petra Olschowski
Staatssekretärin im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg
Geschäftsführer:
Prof. Thomas Schadt

Datenschutzerklärung 
 | 
Transparenzinformation 

Data privacy statement 
 | 
Transparency information 




smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs constantly strays ( num_strays)

2019-01-27 Thread Marc Roos


I have constantly strays. What are strays? Why do I have them? Is this 
bad?



[@~]# ceph daemon mds.c perf dump| grep num_stray
"num_strays": 25823,
"num_strays_delayed": 0,
"num_strays_enqueuing": 0,
[@~]#
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bug in application of bucket policy s3:PutObject?

2019-01-27 Thread Marc Roos


If I want that only a user can put objects, and not download or delete. 
I have to apply a secondary statement denying the GetObject. Yet I did 
not specify the GetObject. 

This works
{
  "Sid": "put-only-objects-s2",
  "Effect": "Deny",
  "Principal": { "AWS": [ "arn:aws:iam::Company:user/user1", 
"arn:aws:iam::Company:user/user2" ] },
  "Action": [
"s3:GetObject"
  ],
  "Resource": "arn:aws:s3:::testbucket/user1/*"
},
{
  "Sid": "put-only-objects-s3",
  "Effect": "Allow",
  "Principal": { "AWS": [ "arn:aws:iam::Company:user/user1", 
"arn:aws:iam::Company:user/user2" ] },
  "Action": [
"s3:ListBucket",
"s3:HeadObject",
"s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::testbucket/user1/*"
},




This does not, you can still download the ones you upload.

{
  "Sid": "put-only-objects-s3",
  "Effect": "Allow",
  "Principal": { "AWS": [ "arn:aws:iam::Company:user/user1", 
"arn:aws:iam::Company:user/user2" ] },
  "Action": [
"s3:ListBucket",
"s3:HeadObject",
"s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::testbucket/user1/*"
},






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw s3 subuser permissions

2019-01-27 Thread Marc Roos


I tried with these, but didn't get any results

"arn:aws:iam::Company:user/testuser:testsubuser"
"arn:aws:iam::Company:subuser/testuser:testsubuser"

-Original Message-
From: Adam C. Emerson [mailto:aemer...@redhat.com] 
Sent: vrijdag 25 januari 2019 16:40
To: The Exoteric Order of the Squid Cybernetic
Subject: Re: [ceph-users] Radosgw s3 subuser permissions

On 24/01/2019, Marc Roos wrote:
>
>
> This should do it sort of.
>
> {
>   "Id": "Policy1548367105316",
>   "Version": "2012-10-17",
>   "Statement": [
> {
>   "Sid": "Stmt1548367099807",
>   "Effect": "Allow",
>   "Action": "s3:ListBucket",
>   "Principal": { "AWS": "arn:aws:iam::Company:user/testuser" },
>   "Resource": "arn:aws:s3:::archive"
> },
> {
>   "Sid": "Stmt1548369229354",
>   "Effect": "Allow",
>   "Action": [
> "s3:GetObject",
> "s3:PutObject",
> "s3:ListBucket"
>   ],
>   "Principal": { "AWS": "arn:aws:iam::Company:user/testuser" },
>   "Resource": "arn:aws:s3:::archive/folder2/*"
> }
>   ]
> }


Does this work well for sub-users? I hadn't worked on them as we were 
focusing on the tenant/user case, but if someone's been using policy 
with sub-users, I'd like to hear their experience and any problems they 
run into.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-27 Thread Marc Roos
 

Hi Alexandre, 

I was curious if I had a similar issue, what value are you monitoring? I 
have quite a lot to choose from.


Bluestore.commitLat
Bluestore.kvLat
Bluestore.readLat
Bluestore.readOnodeMetaLat
Bluestore.readWaitAioLat
Bluestore.stateAioWaitLat
Bluestore.stateDoneLat
Bluestore.stateIoDoneLat
Bluestore.submitLat
Bluestore.throttleLat
Osd.opBeforeDequeueOpLat
Osd.opRProcessLatency
Osd.opWProcessLatency
Osd.subopLatency
Osd.subopWLatency
Rocksdb.getLatency
Rocksdb.submitLatency
Rocksdb.submitSyncLatency
RecoverystatePerf.repnotrecoveringLatency
RecoverystatePerf.waitupthruLatency
Osd.opRwPrepareLatency
RecoverystatePerf.primaryLatency
RecoverystatePerf.replicaactiveLatency
RecoverystatePerf.startedLatency
RecoverystatePerf.getlogLatency
RecoverystatePerf.initialLatency
RecoverystatePerf.recoveringLatency
ThrottleBluestoreThrottleBytes.wait
RecoverystatePerf.waitremoterecoveryreservedLatency



-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Sent: vrijdag 25 januari 2019 17:40
To: Sage Weil
Cc: ceph-users; ceph-devel
Subject: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart

also, here the result of "perf diff 1mslatency.perfdata  
3mslatency.perfdata"

http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt




- Mail original -
De: "aderumier" 
À: "Sage Weil" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 17:32:02
Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart

Hi again, 

I was able to perf it today, 

before restart, commit latency was between 3-5ms 

after restart at 17:11, latency is around 1ms 

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png 


here some perf reports: 

with 3ms latency: 
-
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt 


with 1ms latency
-
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt 



I'll retry next week, trying to have bigger latency difference. 

Alexandre 

- Mail original -
De: "aderumier" 
À: "Sage Weil" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 11:06:51
Objet: Re: ceph osd commit latency increase over time, until restart 

>>Can you capture a perf top or perf record to see where teh CPU time is 

>>going on one of the OSDs wth a high latency?

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 10:49:02
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi,
> 
> I have a strange behaviour of my osd, on multiple clusters,
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme 
> drivers, workload is rbd only, with qemu-kvm vms running with librbd + 

> snapshot/rbd export-diff/snapshotdelete each day for backup
> 
> When the osd are refreshly started, the commit latency is between 
0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), 
> until reaching crazy values like 20-200ms.
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png
> http://odisoweb1.odiso.net/osdlatency2.png
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full 
> loaded)
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore 
memory bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards,
> 
> Alexandre
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph rbd.ko compatibility

2019-01-27 Thread Marc Schöchlin
Hello ceph-users,

we are using a low number of rbd.ko clients with our luminous cluster.

Where can i get information about the following questions:

  * Which features and cluster compatibility is provided by the rbd.ko module 
of my system?
(/sys/module/rbd/**, "modinfo rbd" not seems to provide to provide useful 
information on Ubuntu 16.04/18.04)
  * Ubuntu 16.04/18.04 (kernel 4.15) /sys/bus/rbd/supported_features lists 
feature compatibility 0x187
Is there a convenient way to find out whats provided with that kernel?
  * Is there a overview for kernel releases for features and cluster 
compatibility?
-> i.e. which kernel version is needed to use the fast-diff feature
  * What is the development-roadmap for features and cluster compatibility in 
rbd.ko?
  * Available documentation http://docs.ceph.com/docs/luminous/rbd/rbd-ko/ and

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ABI/testing/sysfs-bus-rbd?h=v4.20.5
do not provide any significant information regarding the described 
questions.
Did i missed important information resources?

Best regards
Marc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com