Re: [ceph-users] user and group acls on cephfs mounts

2019-11-05 Thread Alex Litvak
to 777 and user sam2 cannot write either. On 11/5/2019 7:10 PM, Yan, Zheng wrote: On Wed, Nov 6, 2019 at 5:47 AM Alex Litvak wrote: Hello Cephers, I am trying to understand how uid and gid are handled on the shared cephfs mount. I am using 14.2.2 and cephfs kernel based client. I have 2

Re: [ceph-users] user and group acls on cephfs mounts

2019-11-05 Thread Alex Litvak
I changed /webcluster/data this way rwxrwxrwx dev dev /webcluster/data Still newly added users can't write to it. On 11/5/2019 7:04 PM, Alex Litvak wrote: Plot thickens. I create a new user sam2 and group sam2 both uid and gid = 1501.  User sam2 is a member of group dev.  When I switch

Re: [ceph-users] user and group acls on cephfs mounts

2019-11-05 Thread Alex Litvak
It makes no sense. Old users from the group dev (created and connected) long time ago can write into data dir and new ones cannot. On 11/5/2019 3:07 PM, Alex Litvak wrote: Hello Cephers, I am trying to understand how uid and gid are handled on the shared cephfs mount.  I am using 14.2.2 and cephfs

[ceph-users] user and group acls on cephfs mounts

2019-11-05 Thread Alex Litvak
Hello Cephers, I am trying to understand how uid and gid are handled on the shared cephfs mount. I am using 14.2.2 and cephfs kernel based client. I have 2 client vms with following uid gid vm1 user dev (uid=500) group dev (gid=500) vm2 user dev (uid=500) group dev (gid=500) vm1 user

[ceph-users] Replace ceph osd in a container

2019-10-22 Thread Alex Litvak
Hello cephers, So I am having trouble with a new hardware systems with strange OSD behavior and I want to replace a disk with a brand new one to test the theory. I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 osds. So following

[ceph-users] OSD crashed during the fio test

2019-10-01 Thread Alex Litvak
Hellow everyone, Can you shed the line on the cause of the crash? Could actually client request trigger it? Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio_submit retries 16 Sep 30 22:52:58

[ceph-users] Commit and Apply latency on nautilus

2019-09-29 Thread Alex Litvak
Hello everyone, I am running a number of parallel benchmark tests against the cluster that should be ready to go to production. I enabled prometheus to monitor various information and while cluster stays healthy through the tests with no errors or slow requests, I noticed an apply / commit

Re: [ceph-users] Nautilus 14.2.3 packages appearing on the mirrors

2019-09-03 Thread Alex Litvak
If it is a release it broke my ansible installation because it is missing librados2 https://download.ceph.com/rpm-nautilus/el7/x86_64/librados2-14.2.3-0.el7.x86_64.rpm (404 Not Found). Please fix it one way or another. On 9/3/2019 8:31 PM, Sasha Litvak wrote: Is there an actual release or

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-24 Thread Alex Litvak
components rather than turning everything up to 20. Sorry for hijacking the post, I will create a new one when I have more information. On 7/23/2019 9:50 PM, Alex Litvak wrote: I just had an osd crashed with no logs (debug was not enabled).  Happened 24 hours later after actual upgrade from 14.2.1

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-23 Thread Alex Litvak
I just had an osd crashed with no logs (debug was not enabled). Happened 24 hours later after actual upgrade from 14.2.1 to 14.2.2. Nothing else changed as far as environment or load. Disk is OK. Restarted osd and it came back. Had cluster up for 2 month until the upgrade without an issue.

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Alex Litvak
I was planning to upgrade 14.2.1 to 14.2.2 next week. Since there are few reports of crashes, does any one knows if upgrade somehow triggers the issue? If not, that what is? Since this has been reported before the upgrade by some, just wondering if upgrade to 14.2.2 makes the problem worse.

Re: [ceph-users] Multiple OSD crashes

2019-07-19 Thread Alex Litvak
The issue should have been resolved by backport https://tracker.ceph.com/issues/40424 in nautilus, was it merged into 14.2.2 ? Also do you think it is safe to upgrade from 14.2.1 to 14.2.2 ? On 7/19/2019 1:05 PM, Paul Emmerich wrote: I've also encountered a crash just like that after

Re: [ceph-users] Client admin socket for RBD

2019-06-25 Thread Alex Litvak
rek Zegar Senior SDS Engineer_ __Email __tze...@us.ibm.com_ <mailto:email%20address> Mobile _630.974.7172_ Alex Litvak ---06/24/2019 01:07:28 PM---Jason, Here you go: From: Alex Litvak <_alexander.v.litva

Re: [ceph-users] Client admin socket for RBD

2019-06-24 Thread Alex Litvak
Jason, What are you suggesting to do ? Removing this line from the config database and keeping in config files instead? On 6/24/2019 1:12 PM, Jason Dillaman wrote: On Mon, Jun 24, 2019 at 2:05 PM Alex Litvak wrote: Jason, Here you go: WHOMASK LEVELOPTION

Re: [ceph-users] Client admin socket for RBD

2019-06-24 Thread Alex Litvak
On 6/24/2019 11:50 AM, Jason Dillaman wrote: On Sun, Jun 23, 2019 at 4:27 PM Alex Litvak wrote: Hello everyone, I encounter this in nautilus client and not with mimic. Removing admin socket entry from config on client makes no difference Error: rbd ls -p one 2019-06-23 12:58:29.344

[ceph-users] Client admin socket for RBD

2019-06-23 Thread Alex Litvak
Hello everyone, I encounter this in nautilus client and not with mimic. Removing admin socket entry from config on client makes no difference Error: rbd ls -p one 2019-06-23 12:58:29.344 7ff2710b0700 -1 set_mon_vals failed to set admin_socket = /var/run/ceph/$name.$pid.asok: Configuration

[ceph-users] Failed Disk simulation question

2019-05-21 Thread Alex Litvak
Hello cephers, I know that there was similar question posted 5 years ago. However the answer was inconclusive for me. I installed a new Nautilus 14.2.1 cluster and started pre-production testing. I followed RedHat document and simulated a soft disk failure by # echo 1 >

Re: [ceph-users] Constant Compaction on one mimic node

2019-03-18 Thread Alex Litvak
From what I see, the message is generated by a mon container on each node. Does mon issue a manual compaction of rocksdb at some point (debug is a rocksdb one)? On 3/18/2019 12:33 AM, Konstantin Shalygin wrote: I am getting a huge number of messages on one out of three nodes showing Manual

Re: [ceph-users] Constant Compaction on one mimic node

2019-03-18 Thread Alex Litvak
Konstantin, I am not sure I understand. You mean something in the container does a manual compaction job sporadically? What would be doing that? I am confused. On 3/18/2019 12:33 AM, Konstantin Shalygin wrote: I am getting a huge number of messages on one out of three nodes showing Manual

Re: [ceph-users] Constant Compaction on one mimic node

2019-03-17 Thread Alex Litvak
those things. Thank you again, On 3/17/2019 4:11 AM, Alex Litvak wrote: Hello everyone, I am getting a huge number of messages on one out of three nodes showing Manual compaction starting all the time.  I see no such of log entries on the other nodes in the cluster. Mar 16 06:40:11 storage1n1

[ceph-users] How to lower log verbosity

2019-03-17 Thread Alex Litvak
Hello everyone, As I am troubleshooting an issue I see logs literally littered with messages such as below. I searched documentation and couldn't find a specific debug nob to turn. I see some debugging is on by default but I don't need to see staff below especially mgr and client repeating.

[ceph-users] Constant Compaction on one mimic node

2019-03-17 Thread Alex Litvak
Hello everyone, I am getting a huge number of messages on one out of three nodes showing Manual compaction starting all the time. I see no such of log entries on the other nodes in the cluster. Mar 16 06:40:11 storage1n1-chi docker[24502]: debug 2019-03-16 06:40:11.441 7f6967af4700 4

Re: [ceph-users] Chasing slow ops in mimic

2019-03-12 Thread Alex Litvak
"time": "2019-03-08 07:53:37.002282", "event": "done" } ] } }, It just tell me throttled, nothing else. What does throttled mean in this case? I see so

[ceph-users] Chasing slow ops in mimic

2019-03-11 Thread Alex Litvak
Hello Cephers, I am trying to find the cause of multiple slow ops happened with my small cluster. I have a 3 node with 9 OSDs Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 128 GB RAM Each OSD is SSD Intel DC-S3710 800GB It runs mimic 13.2.2 in containers. Cluster was operating normally for 4

Re: [ceph-users] redirect log to syslog and disable log to stderr

2019-02-26 Thread Alex Litvak
Dear Cephers, In mimic 13.2.2 ceph tell mgr.* injectargs --log-to-stderr=false Returns an error (no valid command found ...). What is the correct way to inject mgr configuration values? The same command works on mon ceph tell mon.* injectargs --log-to-stderr=false Thank you in advance,

[ceph-users] redirect log to syslog and disable log to stderr

2019-02-22 Thread Alex Litvak
Hello everyone, I am running mimic 13.2.2 cluster in containers and noticed that docker logs ate all of my local disk space after a while. So I changed some debugging of rock, level, and mem db to 1/5 (default 4/5) and changed mon logging as such ceph tell mon.* injectargs

Re: [ceph-users] Mimic 13.2.3?

2019-01-03 Thread Alex Litvak
It is true for all distros. It doesn't happen the first time either. I think it is a bit dangerous. On 1/3/19 12:25 AM, Ashley Merrick wrote: Have just run an apt update and have noticed there are some CEPH packages now available for update on my mimic cluster / ubuntu. Have yet to install

[ceph-users] Active mds respawns itself during standby mds reboot

2018-12-19 Thread Alex Litvak
Hello everyone, I am running mds + mon on 3 nodes. Recently due to increased cache pressure and NUMA non-interleave effect, we decided to double the memory on the nodes from 32 G to 64 G. We wanted to upgrade a standby node first to be able to test new memory vendor. So without much

Re: [ceph-users] How you handle failing/slow disks?

2018-11-22 Thread Alex Litvak
Sorry for hijacking a thread but do you have an idea of what to watch for: I monitor admin sockets of osds and occasionally I see a burst of both op_w_process_latency and op_w_latency to near 150 - 200 ms on 7200 SAS enterprise drives. For example load average on the node jumps up with idle 97

Re: [ceph-users] Huge latency spikes

2018-11-20 Thread Alex Litvak
John, If I go with write through, shouldn't disk cache be enabled? On 11/20/2018 6:12 AM, John Petrini wrote: I would disable cache on the controller for your journals. Use write through and no read ahead. Did you make sure the disk cache is disabled? On Tuesday, November 20, 2018, Alex

Re: [ceph-users] Huge latency spikes

2018-11-19 Thread Alex Litvak
I went through raid controller firmware update. I replaced a pair of SSDs with new ones. Nothing have changed. Per controller card utility it shows that no patrol reading happens and battery backup is in a good shape. Cache policy is WriteBack. I am aware on the bad battery effect but it

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Alex Litvak
you On Mon, 19 Nov 2018 at 12:28 AM, Alex Litvak mailto:alexander.v.lit...@gmail.com>> wrote: All machines state the same. /opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -DskCache -Lall -a0 Adapter 0-VD 0(target id: 0): Disk Write Cache : Disk's Default Adapter 0-VD 1(target

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Alex Litvak
check ssd disk caches. On Sun, Nov 18, 2018 at 11:40 AM Alex Litvak wrote: All 3 nodes have this status for SSD mirror. Controller cache is on for all 3. Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
write cache on SSDs enabled on three servers? Can you check them? On Sun, Nov 18, 2018 at 9:05 AM Alex Litvak wrote: Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back cache is on Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
card installed on this system? What is the raid mode? On Sun, Nov 18, 2018 at 8:25 AM Alex Litvak wrote: Here is another snapshot. I wonder if this write io wait is too big Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
0.09 11.750.00 11.75 2.25 1.80 dm-25 0.00 0.000.00 12.00 0.00 160.0026.67 0.065.080.005.08 1.25 1.50 On 11/17/2018 10:19 PM, Alex Litvak wrote: I stand corrected, I looked at the device iostat, but it was partitioned.  Here is a more

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I stand corrected, I looked at the device iostat, but it was partitioned. Here is a more correct picture of what is going on now. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-14 0.00 0.000.00

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Plot thickens: I checked c-states and apparently I am operating in c1 with all CPUS on. Apparently servers were tuned to use latency-performance tuned-adm active Current active profile: latency-performance turbostat shows PackageCore CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
John, Thank you for suggestions: I looked into journal SSDs. It is close to 3 years old showing 5.17% of wear (352941GB Written to disk with 3.6 PB endurance specs over 5 years) It could be that smart not telling all but that it what I see. Vendor Specific SMART Attributes with Thresholds:

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I am evaluating bluestore on the separate cluster. Unfortunately upgrading this one is out of the question at the moment for multiple reasons. That is why I am trying to find a possible root cause. On 11/17/2018 2:14 PM, Paul Emmerich wrote: Are you running FileStore? (The config options

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
the spikes, how ever it is still small. On 11/17/2018 1:40 PM, Kees Meijs wrote: Hi Alex, What kind of clients do you use? Is it KVM (QEMU) using NBD driver, kernel, or...? Regards, Kees On 17-11-18 20:17, Alex Litvak wrote: Hello everyone, I am trying to troubleshoot cluster exhibiting huge

[ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Hello everyone, I am trying to troubleshoot cluster exhibiting huge spikes of latency. I cannot quite catch it because it happens during the light activity and randomly affects one osd node out of 3 in the pool. This is a file store. I see some osds exhibit applied latency of 400 ms, 1

Re: [ceph-users] Need advise on proper cluster reweighing

2018-10-28 Thread Alex Litvak
with upmap. If you can, it is hands down the best way to balance your cluster. On Sat, Oct 27, 2018, 9:14 PM Alex Litvak <mailto:alexander.v.lit...@gmail.com>> wrote: I have a cluster using 2 roots.  I attempted to reweigh osds under the "default" root used by pool rbd,

[ceph-users] Need advise on proper cluster reweighing

2018-10-27 Thread Alex Litvak
I have a cluster using 2 roots. I attempted to reweigh osds under the "default" root used by pool rbd, cephfs-data, cephfs-meta using Cern script: crush-reweight-by-utilization.py. I ran it first and it showed 4 candidates (per script default ), it shows final weight and single step

Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-08 Thread Alex Litvak
, maybe less when is not in replay mode. Anyway, we've deactivated CephFS for now there. I'll try with older versions on a test environment El lun., 8 oct. 2018 a las 5:21, Alex Litvak () escribió: How is this not an emergency announcement? Also I wonder if I can downgrade at all ? I am

Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-07 Thread Alex Litvak
How is this not an emergency announcement? Also I wonder if I can downgrade at all ? I am using ceph with docker deployed with ceph-ansible. I wonder if I should push downgrade or basically wait for the fix. I believe, a fix needs to be provided. Thank you, On 10/7/2018 9:30 PM, Yan,

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Alex Litvak
gor On 10/2/2018 5:04 PM, Alex Litvak wrote: I am sorry for interrupting the thread, but my understanding always was that blue store on the single device should not care of the DB size, i.e. it would use the data part for all operations if DB is full.  And if it is not true, what would be sens

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-02 Thread Alex Litvak
I am sorry for interrupting the thread, but my understanding always was that blue store on the single device should not care of the DB size, i.e. it would use the data part for all operations if DB is full. And if it is not true, what would be sensible defaults on 800 GB SSD? I used

Re: [ceph-users] Strange Client admin socket error in a containerized ceph environment

2018-08-30 Thread Alex Litvak
not be modified at runtime On 08/30/2018 09:06 PM, Alex Litvak wrote: I keep getting the following error message: 018-08-30 18:52:37.882 7fca9df7c700 -1 asok(0x7fca98000fe0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph

[ceph-users] Strange Client admin socket error in a containerized ceph environment

2018-08-30 Thread Alex Litvak
I keep getting the following error message: 018-08-30 18:52:37.882 7fca9df7c700 -1 asok(0x7fca98000fe0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists Otherwise things seem

[ceph-users] Jewel Release

2018-03-02 Thread Alex Litvak
Are there plans to release Jewel 10.2.11 before the end of the support? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph 0.94.8 Hammer released

2016-08-29 Thread Alex Litvak
Hammer RPMs for 0.94.8 are still not available for EL6. Can this please be addressed ? Thank you in advance, On 08/27/2016 06:25 PM, alexander.v.lit...@gmail.com wrote: RPMs are not available at the distro side. On Fri, 26 Aug 2016 21:31:45 + (UTC), Sage Weil wrote: