Re: [ceph-users] ceph-disk: Error: No cluster conf found in /etc/ceph with fsid

2016-05-26 Thread Fulvio Galeazzi

Hallo,
as I spent the whole afternoon on a similar issue...  :-)

  Run purge (will also remove ceph packages, I am assuming you don't 
care much about the existing stuff),


on all nodes mon/osd/admin remove
  rm -rf /var/lib/ceph/

on OSD nodes make sure you mount all partitions, then remove
  rm -rf /srv/node/*/*
  chown -R ceph.ceph /srv/node/*/

on admin node, remove, from the cluster-administration-directory,
rm ceph.bootstrap* ceph*keyring
and only leave the old ceph.conf which you will probably susbstitute 
with the default one after step 1 (assuming, for example, you either 
spent some time playing with it and/or you want to force a specific 
cephfs-id).


  Good luck

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Questions on rbd-mirror

2017-03-24 Thread Fulvio Galeazzi
Hallo, apologies for my (silly) questions, I did try to find some doc on 
rbd-mirror but was unable to, apart from a number of pages explaining 
how to install it.


My environment is CenOS7 and Ceph 10.2.5.

Can anyone help me understand a few minor things:

 - is there a cleaner way to configure the user which will be used for
   rbd-mirror, other than editing the ExecStart in file 
/usr/lib/systemd/system/ceph-rbd-mirror@.service ?

   For example some line in ceph.conf... looks like the username
   defaults to the cluster name, am I right?

 - is it possible to throttle mirroring? Sure, it's a crazy thing to do
   for "cinder" pools, but may make sense for slowly changing ones, like
   a "glance" pool.

 - is it possible to set per-pool default features? I read about
"rbd default features = ###"
   but this is a global setting. (Ok, I can still restrict pools to be
   mirrored with "ceph auth" for the user doing mirroring)


  Thanks!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

2017-07-21 Thread Fulvio Galeazzi

Hallo David, all,
sorry for hi-jacking the thread but I am seeing the same issue, 
although on 10.2.7/10.2.9...



Note that I am using disks taken from a SAN, so the GUIDs in my case are 
those relevant to MPATH.

As per other messages in this thread, I modified:
 - /usr/lib/systemd/system/ceph-osd.target
   adding to [Unit] stanza:
Before=ceph.target
 - /usr/lib/udev/rules.d/60-ceph-by-parttypeuuid.rules
   added at the end of this line:
ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*", 
ENV{ID_PART_ENTRY_UUID}=="?*", 
SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}"

   the string:
, SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}"



df shows (picked a problematic partition and one which mounted OK)
.
/dev/mapper/3600a0980005de737095a56c510cd1  3878873588  142004 
3878731584   1% /var/lib/ceph/osd/cephba1-27
/dev/mapper/3600a0980005ddf751e2558e2bac7p1 7779931116  202720 
7779728396   1% /var/lib/ceph/tmp/mnt.XL7WkY


Yet, for both the GUIDs seem correct:

=== /dev/mapper/3600a0980005de737095a56c510cd
Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: B01E2E0D-9903-4F23-A5FD-FC1C1CB458C3
Partition size: 7761536991 sectors (3.6 TiB)
Partition name: 'ceph data'
Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: E1B3970A-FABF-4AC0-8B6A-F7526989FF36
Partition size: 4096 sectors (19.5 GiB)
Partition name: 'ceph journal'

=== /dev/mapper/3600a0980005ddf751e2558e2bac7
Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: 93A91EBF-A531-4002-A49F-B24F27E962DD
Partition size: 15564036063 sectors (7.2 TiB)
Partition name: 'ceph data'
Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: 2AF9B162-3398-49BD-B6EF-5D284C4A930B
Partition size: 4096 sectors (19.5 GiB)
Partition name: 'ceph journal'

  I rather suspect some sort of race condition, possibly causing 
hitting some timeout within systemctl... (please read the end of this 
message).
I am led to think this because the OSDs which are successfully mounted 
after each reboot are a "random" subset of the configured ones (total 
~40): also, after two or three mounts /var/lib/ceph/mnt... ceph-osd 
apparently gives up.



The only workaround I found to get things going is re-running 
ceph-ansible, but it takes s long...


Have you any idea as to what is going on here? Has anybody seen (and 
solved) the same issue?


  Thanks!

Fulvio





[root@r3srv07.ba1 ~]# cat /var/lib/ceph/tmp/mnt.XL7WkY/whoami
143
[root@r3srv07.ba1 ~]# umount /var/lib/ceph/tmp/mnt.XL7WkY
[root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
● ceph-osd@143.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; 
vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2017-07-21 11:02:23 
CEST; 1h 35min ago
  Process: 40466 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} 
--id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 40217 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)

 Main PID: 40466 (code=exited, status=1/FAILURE)


Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service: 
main process exited, code=exited, status=1/FAILURE
Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: Unit 
ceph-osd@143.service entered failed state.
Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
failed.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
holdoff time over, scheduling restart.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: start request repeated 
too quickly for ceph-osd@143.service
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Failed to start Ceph 
object storage daemon.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Unit 
ceph-osd@143.service entered failed state.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
failed.

[root@r3srv07.ba1 ~]# systemctl restart ceph-osd@143.service
[root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
● ceph-osd@143.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 
2017-07-21 12:38:11 CEST; 1s ago
  Process: 74658 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} 
--id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 74644 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)

 Main PID: 74658 (code=exited, status=1/FAILURE)

Jul 21 12:38:11 r3srv07.ba1.box.garr systemd[1]: Unit 
ceph-osd@143.service entered failed state.
Jul 21 12:38:11 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
failed.


Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

2017-07-21 Thread Fulvio Galeazzi
Hallo again, replying to my own message to provide some more info, and 
ask one more question.


  Not sure I mentioned, but I am on CentOS 7.3.

  I tried to insert a sleep in ExecStartPre in 
/usr/lib/systemd/system/ceph-osd@.service but apparently all ceph-osd 
are started (and retried) at the same time.


  I finally noticed that the simple
ceph-disk activate 
 is sufficient to recover the OSD.


  Questions:

 =  why am I not able to restart the OSD via
systemctl restart ceph-osd@##.service
whereas ceph-disk activates magically works?

 = (off-topic) I also see systemd complaining about OSD## which
   at some point existed on the host but later were reassigned to
   another one. Tried to "systemctl stop/disable ceph-osd@##" but
   those seem to reappear at boot... any idea how to fix this?


  I could easily take care of the "OSD not activating at boot" with 
something simple in rc.local, but I wonder whether someone is aware of a 
cleaner solution.


  Thanks!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests

2017-12-13 Thread Fulvio Galeazzi

Hallo Matthew,
I am now facing the same issue and found this message of yours.
  Were you eventually able to figure what the problem is, with 
erasure-coded pools?


At first sight, the bugzilla page linked by Brian does not seem to 
specifically mention erasure-coded pools...


  Thanks for your help

Fulvio

 Original Message 
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud 
To: Brian Andrus 
CC: "ceph-users@lists.ceph.com" 
Date: 09/07/2017 11:01 PM

After some troubleshooting, the issues appear to be caused by gnocchi 
using rados. I’m trying to figure out why.


Thanks,

Matthew Stroud

*From: *Brian Andrus 
*Date: *Thursday, September 7, 2017 at 1:53 PM
*To: *Matthew Stroud 
*Cc: *David Turner , "ceph-users@lists.ceph.com" 


*Subject: *Re: [ceph-users] Blocked requests

"ceph osd blocked-by" can do the same thing as that provided script.

Can you post relevant osd.10 logs and a pg dump of an affected placement 
group? Specifically interested in recovery_state section.


Hopefully you were careful in how you were rebooting OSDs, and not 
rebooting multiple in the same failure domain before recovery was able 
to occur.


On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud 
> wrote:


Here is the output of your snippet:

[root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh

   6 osd.10

52  ops are blocked > 4194.3   sec on osd.17

9   ops are blocked > 2097.15  sec on osd.10

4   ops are blocked > 1048.58  sec on osd.10

39  ops are blocked > 262.144  sec on osd.10

19  ops are blocked > 131.072  sec on osd.10

6   ops are blocked > 65.536   sec on osd.10

2   ops are blocked > 32.768   sec on osd.10

Here is some backfilling info:

[root@mon01 ceph-conf]# ceph status

     cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155

  health HEALTH_WARN

     5 pgs backfilling

     5 pgs degraded

     5 pgs stuck degraded

     5 pgs stuck unclean

     5 pgs stuck undersized

     5 pgs undersized

     122 requests are blocked > 32 sec

     recovery 2361/1097929 objects degraded (0.215%)

     recovery 5578/1097929 objects misplaced (0.508%)

  monmap e1: 3 mons at
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0

}

     election epoch 58, quorum 0,1,2 mon01,mon02,mon03

  osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs

     flags sortbitwise,require_jewel_osds

   pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects

     1005 GB used, 20283 GB / 21288 GB avail

     2361/1097929 objects degraded (0.215%)

     5578/1097929 objects misplaced (0.508%)

     2587 active+clean

    5 active+undersized+degraded+remapped+backfilling

[root@mon01 ceph-conf]# ceph pg dump_stuck unclean

ok

pg_stat state   up  up_primary  acting  acting_primary

3.5c2   active+undersized+degraded+remapped+backfilling
[17,2,10]   17  [17,2]  17

3.54a   active+undersized+degraded+remapped+backfilling
[10,19,2]   10  [10,17] 10

5.3b    active+undersized+degraded+remapped+backfilling
[3,19,0]    3   [10,17] 10

5.b3    active+undersized+degraded+remapped+backfilling
[10,19,2]   10  [10,17] 10

3.180   active+undersized+degraded+remapped+backfilling
[17,10,6]   17  [22,19] 22

Most of the back filling is was caused by restarting osds to clear
blocked IO. Here are some of the blocked IOs:

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10
10.20.57.15:6806/7029  9362 : cluster
[WRN] slow request 60.834494 seconds old, received at 2017-09-07
13:28:36.143920: osd_op(client.114947.0:2039090 5.e637a4b3
(undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected
e6511) currently queued_for_pg

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10
10.20.57.15:6806/7029  9363 : cluster
[WRN] slow request 240.661052 seconds old, received at 2017-09-07
13:25:36.317363: osd_op(client.246934107.0:3 5.f69addd6 (undecoded)
ack+read+known_if_redirected e6511) currently queued_for_pg

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10
10.20.57.15:6806/7029  9364 : cluster
[WRN] slow request 240.660763 seconds old, received at 2017-09-07
13:25:36.317651: 

Re: [ceph-users] Blocked requests

2017-12-14 Thread Fulvio Galeazzi

Hallo Matthew, thanks for your feedback!
  Please clarify one point: you mean that you recreated the pool as an 
erasure-coded one, or that you recreated it as a regular replicated one? 
I mean, you now have an erasure-coded pool in production as a gnocchi 
backend?


  In any case, from the instability you mention, experimenting with 
BlueStore looks like a better alternative.


  Thanks again

Fulvio

 Original Message 
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud <mattstr...@overstock.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>, Brian Andrus 
<brian.and...@dreamhost.com>

CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Date: 12/13/2017 5:05 PM


We fixed it by destroying the pool and recreating it though this isn’t really a 
fix. Come to find out ceph has a weakness for small high change rate objects 
(the behavior that gnocchi displays). The cluster will keep going fine until an 
event (aka a reboot, osd failure, etc) happens. I haven’t been able to find 
another solution.

I have heard that BlueStore handles this better, but that wasn’t stable on the 
release we are on.

Thanks,
Matthew Stroud

On 12/13/17, 3:56 AM, "Fulvio Galeazzi" <fulvio.galea...@garr.it> wrote:

 Hallo Matthew,
  I am now facing the same issue and found this message of yours.
Were you eventually able to figure what the problem is, with
 erasure-coded pools?

 At first sight, the bugzilla page linked by Brian does not seem to
 specifically mention erasure-coded pools...

Thanks for your help

 Fulvio

  Original Message 
 Subject: Re: [ceph-users] Blocked requests
 From: Matthew Stroud <mattstr...@overstock.com>
 To: Brian Andrus <brian.and...@dreamhost.com>
 CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
 Date: 09/07/2017 11:01 PM

 > After some troubleshooting, the issues appear to be caused by gnocchi
 > using rados. I’m trying to figure out why.
 >
 > Thanks,
 >
 > Matthew Stroud
 >
 > *From: *Brian Andrus <brian.and...@dreamhost.com>
 > *Date: *Thursday, September 7, 2017 at 1:53 PM
 > *To: *Matthew Stroud <mattstr...@overstock.com>
 > *Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com"
 > <ceph-users@lists.ceph.com>
 > *Subject: *Re: [ceph-users] Blocked requests
 >
 > "ceph osd blocked-by" can do the same thing as that provided script.
 >
 > Can you post relevant osd.10 logs and a pg dump of an affected placement
 > group? Specifically interested in recovery_state section.
 >
 > Hopefully you were careful in how you were rebooting OSDs, and not
 > rebooting multiple in the same failure domain before recovery was able
 > to occur.
 >
 > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud
 > <mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote:
 >
 > Here is the output of your snippet:
 >
 > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
 >
 >6 osd.10
 >
 > 52  ops are blocked > 4194.3   sec on osd.17
 >
 > 9   ops are blocked > 2097.15  sec on osd.10
 >
 > 4   ops are blocked > 1048.58  sec on osd.10
 >
 > 39  ops are blocked > 262.144  sec on osd.10
 >
 > 19  ops are blocked > 131.072  sec on osd.10
 >
 > 6   ops are blocked > 65.536   sec on osd.10
 >
 > 2   ops are blocked > 32.768   sec on osd.10
 >
 > Here is some backfilling info:
 >
 > [root@mon01 ceph-conf]# ceph status
 >
 >  cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
 >
 >   health HEALTH_WARN
 >
 >  5 pgs backfilling
 >
 >  5 pgs degraded
 >
 >  5 pgs stuck degraded
 >
 >  5 pgs stuck unclean
 >
 >  5 pgs stuck undersized
 >
 >  5 pgs undersized
 >
 >  122 requests are blocked > 32 sec
 >
 >  recovery 2361/1097929 objects degraded (0.215%)
 >
 >  recovery 5578/1097929 objects misplaced (0.508%)
 >
 >   monmap e1: 3 mons at
 > 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0
 > 
<http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>}
 >
 >  election epoch 58, quor

[ceph-users] About "ceph balancer": typo in doc, restrict by class

2018-05-28 Thread Fulvio Galeazzi

Hallo,
I am using 12.2.4 and started using "ceph balancer". Indeed it does 
a great job, thanks!


  I have few comments:

 - in the documentation http://docs.ceph.com/docs/master/mgr/balancer/
   I think there is an error, since
ceph config set mgr mgr/balancer/max_misplaced .07
   should be replaced by
ceph config-key set mgr/balancer/max_misplaced 0.07

 - when running in automatic mode, I observed that althought I set
   max_misplaced to 1%, sometimes the fraction of misplaced PGs goes
   slightly above (up to 1.5% or so): probably because a new round of
   optimization takes place as soon as there are no degraded objects
   but there may still be some misplaced objects around?
   Not a big deal, though, just need to remember to set max_misplaced
   to a bit lower value.

 - do you think it would make sense to be able to optionally restrict
   balancer to run only  on a given device-class? I have defined a
   custom device class "big" which I use for EC-backed pools, and would
   like to selectively include/exclude those units in/from the
   optimization process.

  Thanks a lot!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD recommendation

2018-05-31 Thread Fulvio Galeazzi

Hallo Simon,
I am also about to buy some new hardware and for SATA ~400GB I was 
considering Micron 5200 MAX, rated at 5 DWPD, for journaling/FSmetadata.

  Is anyone using such drives, and to what degree of satisfaction?

  Thanks

Fulvio

 Original Message 
Subject: Re: [ceph-users] SSD recommendation
From: Simon Ironside 
To: ceph-users@lists.ceph.com
Date: 5/31/2018 2:36 PM

It looks like the choices available to me in the SATA ~400GB and 3 DWPD 
over 5 years range pretty much boils down to just the Intel DC S4600 and 
the Samsung SM863a options anyway. Since David Herselman's thread has 
put me off Intels I think I'll go with the Samsungs.


Regards,
Simon.

On 30/05/18 20:00, Simon Ironside wrote:

Hi Everyone,

I'm about to purchase hardware for a new production cluster. I was 
going to use 480GB Intel DC S4600 SSDs as either Journal devices for 
Filestore and/or DB/WAL for Bluestore spinning disk OSDs until I saw 
David Herselman's "Many concurrent drive failures" thread which has 
given me the fear.


What's the current go to for Journal and/or DB/WAL SSDs if not the S4600?

I'm planning on using AMD EPYC based Supermicros for OSD nodes with 3x 
10TB SAS 7.2k to each SSD with 10gig networking. Happy to provide more 
info here if it's useful.


Thanks,
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Missing udev rule for FC disks (Re: mkjournal error creating journal ... : (13) Permission denied)

2018-01-19 Thread Fulvio Galeazzi

Hallo,
 apologies for reviving an old thread, but I just wasted again one 
full day as I had forgotten about this issue...


   To recap, udev rules nowadays do not (at least in my case, I am 
using disks served via FiberChannel) create the links 
/dev/disk/by-partuuid that ceph-disk expects.


I see the "culprit" is this line in (am on CentOS, but Ubuntu has the 
same issue): /usr/lib/udev/rules.d/60-persistent-storage.rules


.
# skip rules for inappropriate block devices
KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*|zram*|mmcblk[0-9]*rpmb", 
GOTO="persistent_storage_end"

.

stating that multipath'ed devices (called dm-*) should be skipped.


I can happily live with the file mentioned below, but was wondering:

- is there any hope that newer kernels may handle multipath devices
  properly?

- as an alternative, could it be possible to update ceph-disk
  such that symlinks for journal use some other
  /dev/disk/by-?

   Thanks!

Fulvio

On 3/16/2017 5:59 AM, Gunwoo Gim wrote:
  Thank you so much Peter. The 'udevadm trigger' after 'partprobe' 
triggered the udev rules and I've found out that even before the udev 
ruleset triggers the owner is already ceph:ceph.


  I've dug into ceph-disk a little more and found out that there is a 
symbolic link of 
/dev/disk/by-partuuid/120c536d-cb30-4cea-b607-dd347022a497 at 
[/dev/mapper/vg--hdd1-lv--hdd1p1(the_filestore_osd)]/journal and the 
source doesn't exist. though it exists in /dev/disk/by-parttypeuuid 
which has been populated by /lib/udev/rules.d/60-ceph-by-parttypeuuid.rules


  So I added this in /lib/udev/rules.d/60-ceph-by-parttypeuuid.rules:
# when ceph-disk prepares a filestore osd it makes a symbolic link by 
disk/by-partuuid but LVM2 doesn't seem to populate /dev/disk/by-partuuid.
ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*", 
ENV{ID_PART_ENTRY_UUID}=="?*", 
SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}"

  And finally got the osds all up and in. :D

  Yeah, It wasn't actually a permission problem, but the link just 
wasn't existing.



~ # ceph-disk -v activate /dev/mapper/vg--hdd1-lv--hdd1p1
...
mount: Mounting /dev/mapper/vg--hdd1-lv--hdd1p1 on 
/var/lib/ceph/tmp/mnt.ECAifr with options noatime,largeio,inode64,swalloc
command_check_call: Running command: /bin/mount -t xfs -o 
noatime,largeio,inode64,swalloc -- /dev/mapper/vg--hdd1-lv--hdd1p1 
/var/lib/ceph/tmp/mnt.ECAifr

mount: DIGGIN ls -al /var/lib/ceph/tmp/mnt.ECAifr
mount: DIGGIN total 36
drwxr-xr-x 3 ceph ceph  174 Mar 14 11:51 .
drwxr-xr-x 6 ceph ceph 4096 Mar 16 11:30 ..
-rw-r--r-- 1 root root  202 Mar 16 11:19 activate.monmap
-rw-r--r-- 1 ceph ceph   37 Mar 14 11:45 ceph_fsid
drwxr-xr-x 3 ceph ceph   39 Mar 14 11:51 current
-rw-r--r-- 1 ceph ceph   37 Mar 14 11:45 fsid
lrwxrwxrwx 1 ceph ceph   58 Mar 14 11:45 journal -> 
/dev/disk/by-partuuid/120c536d-cb30-4cea-b607-dd347022a497

-rw-r--r-- 1 ceph ceph   37 Mar 14 11:45 journal_uuid
-rw-r--r-- 1 ceph ceph   21 Mar 14 11:45 magic
-rw-r--r-- 1 ceph ceph    4 Mar 14 11:51 store_version
-rw-r--r-- 1 ceph ceph   53 Mar 14 11:51 superblock
-rw-r--r-- 1 ceph ceph    2 Mar 14 11:51 whoami
...
ceph_disk.main.Error: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', 
'--mkkey', '-i', u'0', '--monmap', 
'/var/lib/ceph/tmp/mnt.ECAifr/activate.monmap', '-
-osd-data', '/var/lib/ceph/tmp/mnt.ECAifr', '--osd-journal', 
'/var/lib/ceph/tmp/mnt.ECAifr/journal', '--osd-uuid', 
u'377c336b-278d-4caf-b2f5-592ac72cd9b6', '-
-keyring', '/var/lib/ceph/tmp/mnt.ECAifr/keyring', '--setuser', 'ceph', 
'--setgroup', 'ceph'] failed : 2017-03-16 11:30:05.238725 7f918fbc0a40 
-1 filestore(/v
ar/lib/ceph/tmp/mnt.ECAifr) mkjournal error creating journal on 
/var/lib/ceph/tmp/mnt.ECAifr/journal: (13) Permission denied
2017-03-16 11:30:05.238756 7f918fbc0a40 -1 OSD::mkfs: ObjectStore::mkfs 
failed with error -13
2017-03-16 11:30:05.238833 7f918fbc0a40 -1  ** ERROR: error creating 
empty object store in /var/lib/ceph/tmp/mnt.ECAifr: (13) Permission denied



~ # blkid /dev/mapper/vg--*lv-*p* | grep 
'120c536d-cb30-4cea-b607-dd347022a497'
/dev/mapper/vg--ssd1-lv--ssd1p1: PARTLABEL="ceph journal" 
PARTUUID="120c536d-cb30-4cea-b607-dd347022a497"

~ # ls -al /dev/disk/by-id | grep dm-22
lrwxrwxrwx 1 root root   11 Mar 16 11:37 dm-name-vg--ssd1-lv--ssd1p1 -> 
../../dm-22
lrwxrwxrwx 1 root root   11 Mar 16 11:37 
dm-uuid-part1-LVM-n1SH1FvtfjgxJOMWN9aHurFvn2BpIsLZi89GWxA68hLmUQV6l5oyiEOPsFciRbKg 
-> ../../dm-22

~ # ls -al /dev/disk/by-parttypeuuid | grep dm-22
lrwxrwxrwx 1 root root  11 Mar 16 11:37 
45b0969e-9b03-4f30-b4c6-b4b80ceff106.120c536d-cb30-4cea-b607-dd347022a497 -> 
../../dm-22

~ # ls -al /dev/disk/by-uuid | grep dm-22
~ # ls -al /dev/disk/by-partuuid/ | grep dm-22
~ # ls -al /dev/disk/by-path | grep dm-22


Best Regards,
Nicholas Gim.

On Wed, Mar 15, 2017 at 6:46 PM Peter Maloney 
> wrote:


On 03/15/17 08:43, Gunwoo Gim 

Re: [ceph-users] Missing udev rule for FC disks (Re: mkjournal error creating journal ... : (13) Permission denied)

2018-01-23 Thread Fulvio Galeazzi

Thanks a lot, Tom, glad this was already taken care of!
  Will keep the patch around until the official one somehow gets into 
my distribution.


  Ciao ciao

Fulvio

 Original Message 
Subject: Re: [ceph-users] Missing udev rule for FC disks (Re: mkjournal 
error creating journal ... : (13) Permission denied)

From: <tom.by...@stfc.ac.uk>
To: <fulvio.galea...@garr.it>, <ceph-users@lists.ceph.com>
Date: 1/22/2018 10:34 AM


I believe I've recently spent some time with this issue, so I hope this is 
helpful. Apologies if it's an unrelated dm/udev/ceph-disk problem.

https://lists.freedesktop.org/archives/systemd-devel/2017-July/039222.html

The above email from last July explains the situation somewhat, with the 
outcome (as I understand it) being future versions of lvm/dm will have rules to 
create the necessary partuuid symlinks for dm devices.

I'm unsure when that will make its way into various distribution lvm packages 
(I haven't checked up on this for a month or two actually). For now I've tested 
running with the new dm-disk.rules on the storage nodes that need it, which 
allowed ceph-disk to work as expected.

Cheers
Tom

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Fulvio 
Galeazzi
Sent: 19 January 2018 15:46
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Missing udev rule for FC disks (Re: mkjournal error 
creating journal ... : (13) Permission denied)

Hallo,
   apologies for reviving an old thread, but I just wasted again one full 
day as I had forgotten about this issue...

 To recap, udev rules nowadays do not (at least in my case, I am using 
disks served via FiberChannel) create the links /dev/disk/by-partuuid that 
ceph-disk expects.

I see the "culprit" is this line in (am on CentOS, but Ubuntu has the same 
issue): /usr/lib/udev/rules.d/60-persistent-storage.rules

.
# skip rules for inappropriate block devices 
KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*|zram*|mmcblk[0-9]*rpmb",
GOTO="persistent_storage_end"
.

stating that multipath'ed devices (called dm-*) should be skipped.


I can happily live with the file mentioned below, but was wondering:

- is there any hope that newer kernels may handle multipath devices
properly?

- as an alternative, could it be possible to update ceph-disk
such that symlinks for journal use some other
/dev/disk/by-?

 Thanks!

Fulvio





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-12 Thread Fulvio Galeazzi

Hallo all,
I am not sure RBD discard is working in my setup, and I am asking 
for your help.

(I searched this mailing list for related messages and found one by
Nathan Harper last 29th Jan 2018 "Debugging fstrim issues" which
however mentions trimming was masked by logging... so I am not 100%
sure of what is the expected result)

I am on Ocata, and Ceph 10.2.10.
Followed the recipe: 
https://www.sebastien-han.fr/blog/2015/02/02/openstack-and-ceph-rbd-discard/

 * setup Nova adding to /etc/nova/nova.conf
...
[libvirt]
hw_disk_discard = unmap
...
 * decorated a CentOS image with 
hw_scsi_model=virtio--scsi,hw_disk_bus=scsi

 * created a VM with boot disk on Ceph (my default is ephemeral,
   though), verified the XML shows my disk is scsi,


I see that commands:
rbd --cluster cephpa1 diff cinder-ceph/${theVol} | awk '{ SUM += $2 } 
END { print SUM/1024/1024 " MB" }' ; rados --cluster cephpa1 -p 
cinder-ceph ls | grep rbd_data.{whatever} | wc -l
 show the size increases but does not decrease when I execute delete 
the temporary file and execute

sudo fstrim -v /

  Am I missing something?

  I do see that adding/removing files created with dd does not always 
result in a global size increase, it is as if the dirty blocks are kept 
around and reused. Is this the way discard is supposed to work?


  Thanks for your help!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-13 Thread Fulvio Galeazzi

Hallo Jason,
thanks for your feedback!

 Original Message >>   * decorated a CentOS image with 
hw_scsi_model=virtio--scsi,hw_disk_bus=scsi> > Is that just a typo for 
"hw_scsi_model"?
Yes, it was a typo when I wrote my message. The image has virtio-scsi as 
it should.



I see that commands:
rbd --cluster cephpa1 diff cinder-ceph/${theVol} | awk '{ SUM += $2 } END {
print SUM/1024/1024 " MB" }' ; rados --cluster cephpa1 -p cinder-ceph ls |
grep rbd_data.{whatever} | wc -l


That's pretty old-school -- you can just use 'rbd du" now to calculate
the disk usage.


Good to know, thanks!


  show the size increases but does not decrease when I execute delete the
temporary file and execute
 sudo fstrim -v /


Have you verified that your VM is indeed using virtio-scsi? Does
blktrace show SCSI UNMAP operations being issued to the block device
when you execute "fstrim"?


Thanks for the tip, I think I need some more help, please.

Disk on my VM is indeed /dev/sda rather than /dev/vda. The XML shows:
.

  
.
  name='cinder-ceph/volume-80838a69-e544-47eb-b981-a4786be89736'>

.
  
  80838a69-e544-47eb-b981-a4786be89736
  


  function='0x0'/>




As for blktrace, blkparse shows me tons of lines, please find below the 
first ones and one of the many group of lines which I see:


  8,00   11 4.333917112 24677  Q FWFSM 8406583 + 4 [fstrim]
  8,00   12 4.333919649 24677  G FWFSM 8406583 + 4 [fstrim]
  8,00   13 4.333920695 24677  P   N [fstrim]
  8,00   14 4.333922965 24677  I FWFSM 8406583 + 4 [fstrim]
  8,00   15 4.333924575 24677  U   N [fstrim] 1
  8,00   20 4.340140041 24677  Q   D 986016 + 2097152 [fstrim]
  8,00   21 4.340144908 24677  G   D 986016 + 2097152 [fstrim]
  8,00   22 4.340145561 24677  P   N [fstrim]
  8,00   24 4.340147495 24677  Q   D 3083168 + 1112672 [fstrim]
  8,00   25 4.340149772 24677  G   D 3083168 + 1112672 [fstrim]
.
  8,00   50 4.340556955 24677  Q   D 665880 + 20008 [fstrim]
  8,00   51 4.340558481 24677  G   D 665880 + 20008 [fstrim]
  8,00   52 4.340558728 24677  P   N [fstrim]
  8,00   53 4.340559725 24677  I   D 665880 + 20008 [fstrim]
  8,00   54 4.340560292 24677  U   N [fstrim] 1
  8,00   55 4.340560801 24677  D   D 665880 + 20008 [fstrim]
.

Apologies for my ignorance, is the above enough to understand whether 
SCSI UNMAP operations are being issued?


  Thanks a lot!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-13 Thread Fulvio Galeazzi

Hallo!


Discards appear like they are being sent to the device.  How big of a
temporary file did you create and then delete? Did you sync the file
to disk before deleting it? What version of qemu-kvm are you running?


I made several test with commands like (issuing sync after each operation):

dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct

What I see is that if I repeat the command with count<=200 the size does 
not increase.


Let's try now with count>200:

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2284M

dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct
dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct
sync

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M

rm /tmp/fileTest*
sync
sudo fstrim -v /
/: 14.1 GiB (15145271296 bytes) trimmed

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M



As for qemu-kvm, the guest OS is CentOS7, with:

[centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu
qemu-guest-agent-2.8.0-2.el7.x86_64

while the host is Ubuntu 16 with:

root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu
ii  qemu-block-extra:amd64   1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64extra block backend modules for qemu-system and 
qemu-utils
ii  qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU Full virtualization
ii  qemu-system-common   1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU full system emulation binaries (common files)
ii  qemu-system-x86  1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU full system emulation binaries (x86)
ii  qemu-utils   1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU utilities



  Thanks!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-14 Thread Fulvio Galeazzi

Hallo Jason, sure here it is!

rbd --cluster cephpa1 -p cinder-ceph info 
volume-80838a69-e544-47eb-b981-a4786be89736

rbd image 'volume-80838a69-e544-47eb-b981-a4786be89736':
size 15360 MB in 3840 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.9e7ffe238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten

flags:

  Thanks

Fulvio

 Original Message 
Subject: Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
From: Jason Dillaman <jdill...@redhat.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/13/2018 06:33 PM


Can you provide the output from "rbd info /volume-80838a69-e544-47eb-b981-a4786be89736"?

On Tue, Mar 13, 2018 at 12:30 PM, Fulvio Galeazzi
<fulvio.galea...@garr.it> wrote:

Hallo!


Discards appear like they are being sent to the device.  How big of a
temporary file did you create and then delete? Did you sync the file
to disk before deleting it? What version of qemu-kvm are you running?



I made several test with commands like (issuing sync after each operation):

dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct

What I see is that if I repeat the command with count<=200 the size does not
increase.

Let's try now with count>200:

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2284M

dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct
dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct
sync

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M

rm /tmp/fileTest*
sync
sudo fstrim -v /
/: 14.1 GiB (15145271296 bytes) trimmed

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M



As for qemu-kvm, the guest OS is CentOS7, with:

[centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu
qemu-guest-agent-2.8.0-2.el7.x86_64

while the host is Ubuntu 16 with:

root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu
ii  qemu-block-extra:amd64   1:2.8+dfsg-3ubuntu2.9~cloud1
amd64extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU Full virtualization
ii  qemu-system-common   1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-x86  1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils   1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU utilities


   Thanks!

 Fulvio









smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-15 Thread Fulvio Galeazzi

Hallo Jason, I am really thankful for your time!

  Changed the volume features:

rbd image 'volume-80838a69-e544-47eb-b981-a4786be89736':
.
features: layering, exclusive-lock, deep-flatten

I had to create several dummy files before seeing and increase with "rbd 
du": to me, this is sort of indication that dirty blocks are, at least, 
reused if not properly released.


  Then I did "rm * ; sync ; fstrim / ; sync" but the size did not go down.
  Is there a way to instruct Ceph to perform what is not currently 
happening automatically (namely, scan the object-map of a volume and 
force cleanup of released blocks)? Or the problem is exactly that such 
blocks are not seen by Ceph as reusable?


  By the way, I think I forgot to mention that underlying OSD disks are 
taken from a FibreChannel storage (DELL MD3860, which is not capable of 
presenting JBOD so I present single disks as RAID0) and XFS formatted.


  Thanks!

Fulvio

 Original Message 
Subject: Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
From: Jason Dillaman <jdill...@redhat.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/14/2018 02:10 PM


Hmm -- perhaps as an experiment, can you disable the object-map and
fast-diff features to see if they are incorrectly reporting the object
as in-use after a discard?

$ rbd --cluster cephpa1 -p cinder-ceph feature disable
volume-80838a69-e544-47eb-b981-a4786be89736 object-map,fast-diff

On Wed, Mar 14, 2018 at 3:29 AM, Fulvio Galeazzi
<fulvio.galea...@garr.it> wrote:

Hallo Jason, sure here it is!

rbd --cluster cephpa1 -p cinder-ceph info
volume-80838a69-e544-47eb-b981-a4786be89736
rbd image 'volume-80838a69-e544-47eb-b981-a4786be89736':
 size 15360 MB in 3840 objects
 order 22 (4096 kB objects)
 block_name_prefix: rbd_data.9e7ffe238e1f29
 format: 2
 features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten
 flags:

   Thanks

 Fulvio


 Original Message 
Subject: Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
From: Jason Dillaman <jdill...@redhat.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/13/2018 06:33 PM


Can you provide the output from "rbd info /volume-80838a69-e544-47eb-b981-a4786be89736"?

On Tue, Mar 13, 2018 at 12:30 PM, Fulvio Galeazzi
<fulvio.galea...@garr.it> wrote:


Hallo!


Discards appear like they are being sent to the device.  How big of a
temporary file did you create and then delete? Did you sync the file
to disk before deleting it? What version of qemu-kvm are you running?




I made several test with commands like (issuing sync after each
operation):

dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct

What I see is that if I repeat the command with count<=200 the size does
not
increase.

Let's try now with count>200:

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2284M

dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct
dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct
sync

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M

rm /tmp/fileTest*
sync
sudo fstrim -v /
/: 14.1 GiB (15145271296 bytes) trimmed

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M



As for qemu-kvm, the guest OS is CentOS7, with:

[centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu
qemu-guest-agent-2.8.0-2.el7.x86_64

while the host is Ubuntu 16 with:

root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu
ii  qemu-block-extra:amd64   1:2.8+dfsg-3ubuntu2.9~cloud1
amd64extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU Full virtualization
ii  qemu-system-common   1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-x86  1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils   1:2.8+dfsg-3ubuntu2.9~cloud1
amd64QEMU utilities


Thanks!

  Fulvio















smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-04-09 Thread Fulvio Galeazzi

Hallo Jason,
thanks again for your time and apologies for long silence but I was 
busy upgrading to Luminous and converting Filestore->Bluestore.


  In the meantime, the staging cluster where I was making tests was 
both upgraded to Ceph-Luminous and upgraded to OpenStack-Pike: good news 
is that now fstrim works as expected so I think it's not worth it (and 
difficult/impossible) to investigate further.
I may post some more info once I have a maintenance window to upgrade 
the production cluster (I have to touch nova.conf, and I want to do that 
during a maintenance).


  By the way, I am unable to configure Ceph such that the admin socket 
is made available on the (pure) client node, am going to open a separate 
issue for this.


  Thanks!

Fulvio

 Original Message 
Subject: Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
From: Jason Dillaman <jdill...@redhat.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/15/2018 01:35 PM


OK, last suggestion just to narrow the issue down: ensure you have a
functional admin socket and librbd log file as documented here [1].
With the VM running, before you execute "fstrim", run "ceph
--admin-daemon /path/to/the/asok/file conf set debug_rbd 20" on the
hypervisor host, execute "fstrim" within the VM, and then restore the
log settings via "ceph --admin-daemon /path/to/the/asok/file conf set
debug_rbd 0/5".  Grep the log file for "aio_discard" to verify if QEMU
is passing the discard down to librbd.


[1] http://docs.ceph.com/docs/master/rbd/rbd-openstack/

On Thu, Mar 15, 2018 at 6:53 AM, Fulvio Galeazzi
<fulvio.galea...@garr.it> wrote:

Hallo Jason, I am really thankful for your time!

   Changed the volume features:

rbd image 'volume-80838a69-e544-47eb-b981-a4786be89736':
.
 features: layering, exclusive-lock, deep-flatten

I had to create several dummy files before seeing and increase with "rbd
du": to me, this is sort of indication that dirty blocks are, at least,
reused if not properly released.

   Then I did "rm * ; sync ; fstrim / ; sync" but the size did not go down.
   Is there a way to instruct Ceph to perform what is not currently happening
automatically (namely, scan the object-map of a volume and force cleanup of
released blocks)? Or the problem is exactly that such blocks are not seen by
Ceph as reusable?

   By the way, I think I forgot to mention that underlying OSD disks are
taken from a FibreChannel storage (DELL MD3860, which is not capable of
presenting JBOD so I present single disks as RAID0) and XFS formatted.

   Thanks!

 Fulvio

 Original Message 
Subject: Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
From: Jason Dillaman <jdill...@redhat.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/14/2018 02:10 PM


Hmm -- perhaps as an experiment, can you disable the object-map and
fast-diff features to see if they are incorrectly reporting the object
as in-use after a discard?

$ rbd --cluster cephpa1 -p cinder-ceph feature disable
volume-80838a69-e544-47eb-b981-a4786be89736 object-map,fast-diff

On Wed, Mar 14, 2018 at 3:29 AM, Fulvio Galeazzi
<fulvio.galea...@garr.it> wrote:


Hallo Jason, sure here it is!

rbd --cluster cephpa1 -p cinder-ceph info
volume-80838a69-e544-47eb-b981-a4786be89736
rbd image 'volume-80838a69-e544-47eb-b981-a4786be89736':
  size 15360 MB in 3840 objects
  order 22 (4096 kB objects)
  block_name_prefix: rbd_data.9e7ffe238e1f29
  format: 2
  features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten
  flags:

Thanks

  Fulvio


 Original Message 
Subject: Re: [ceph-users] Issue with fstrim and Nova
hw_disk_discard=unmap
From: Jason Dillaman <jdill...@redhat.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/13/2018 06:33 PM


Can you provide the output from "rbd info /volume-80838a69-e544-47eb-b981-a4786be89736"?

On Tue, Mar 13, 2018 at 12:30 PM, Fulvio Galeazzi
<fulvio.galea...@garr.it> wrote:



Hallo!


Discards appear like they are being sent to the device.  How big of a
temporary file did you create and then delete? Did you sync the file
to disk before deleting it? What version of qemu-kvm are you running?





I made several test with commands like (issuing sync after each
operation):

dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct

What I see is that if I repeat the command with count<=200 the size
does
not
increase.

Let's try now with count>200:

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786b

[ceph-users] Admin socket on a pure client: is it possible?

2018-04-09 Thread Fulvio Galeazzi

Hallo,

  I am wondering whether I could have the admin socket functionality 
enabled on a server which is a pure Ceph client (no MDS/MON/OSD/whatever 
running on such server). Is this at all possible? How should ceph.conf 
be configured? Documentation pages led me to write something like this:


.
[client]
admin socket = /var/run/ceph/$cluster-guest.asok
log file = /var/log/ceph/client-guest.log
.
 but the .asok is absent. Please enlighten me as I must be missing 
something very basic.


  The use-case would be to integrate with a piece of code (namely, a 
Juju charm) which assumes to be able to talk with Ceph cluster via an 
admin socket: problem is that such Juju charm/bundle also assumes to 
manage its own Ceph cluster, whereas I'd like to have it interface to an 
independent, external, ceph-ansible-managed Ceph cluster.


  Would it suffice to install ceph-mgr on such client? But then, I 
don't want such ceph-mgr to form quorum with the "real" ceph-mgr(s) 
installed on my MON nodes.


  Do you think it's possible to achieve such a configuration?

  Thanks!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SOLVED Re: Luminous "ceph-disk activate" issue

2018-03-16 Thread Fulvio Galeazzi

Hallo Paul, thanks for your tip which guided me to success.
  I just needed to manually update via yum and restart services: MONs 
first, then OSDs. I am happily running Luminous, now, and verified 
ceph-ansible can add new disks.


  Thanks

Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous "ceph-disk activate" issue
From: Fulvio Galeazzi <fulvio.galea...@garr.it>
To: Paul Emmerich <paul.emmer...@croit.io>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/16/2018 04:58 PM


Hallo Paul,
     You're correct of course, thanks!

   Ok tried to upgrade one MON (out of 3) to Luminous by:
  - removing the MON from the cluster
  - wiping ceph-common
  - running ceph-ansible with "ceph_stable_release: luminous"

but I am now stuck at "[ceph-mon : collect admin and bootstrap keys]". 
If I execute the command in the machine I am installing I see "machine 
is not in quorum: probing".


   Am a bit confused now: should I upgrade all 3 monitors at once? What 
if anything goes wrong during the upgrade? Or should I do a manual 
upgrade rather than using ceph-ansible?


   Thanks for your time and help!

     Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous "ceph-disk activate" issue
From: Paul Emmerich <paul.emmer...@croit.io>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/16/2018 03:23 PM


Hi,

2018-03-16 15:18 GMT+01:00 Fulvio Galeazzi <fulvio.galea...@garr.it 
<mailto:fulvio.galea...@garr.it>>:


    Hallo,
     I am on Jewel 10.2.10 and willing to upgrade to Luminous. I
    thought I'd proceed same as for the upgrade to Jewel, by running
    ceph-ansible on OSD nodes one by one, then on MON nodes one by one.
         ---> Is this a sensible way to upgrade to Luminous?


no, that's the wrong order. See the Luminous release notes for the 
upgrade instructions. You'll need to start with the mons, otherwise 
the OSDs won't be able to start.



Paul


--
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io>
Tel: +49 89 1896585 90






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous "ceph-disk activate" issue

2018-03-16 Thread Fulvio Galeazzi

Hallo,
I am on Jewel 10.2.10 and willing to upgrade to Luminous. I thought 
I'd proceed same as for the upgrade to Jewel, by running ceph-ansible on 
OSD nodes one by one, then on MON nodes one by one.

---> Is this a sensible way to upgrade to Luminous?

  Problem: on first OSD node I see that "ceph-disk activate" fails like 
at the end of this message.


Note that I am using a slightly mofied version of ceph-ansible, which is 
capable of handling my FibreChannel devices: I just aligned to official 
ceph-ansible. My changes (https://github.com/fgal/ceph-ansible.git) 
merely create a "devices" list, and as long as I set

  ceph_stable_release: jewel
 ceph-ansible is working OK, so this should exclude both 
/dev/disk/by-part* stuff and my changes.
When I change it to "luminous" I see the problem. I guess the behaviour 
of ceph-disk has changed meanwhile... I also tried to go back to 12.2.1, 
last release before ceph-disk was superseded by ceph-volume, and observe 
the same problem.

Looks to me that the problematic line could be (notice the '-' after -i):
ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6


  Anyone has any idea as to what could be the problem?
  Thanks for your help!

Fulvio


[root@r3srv05.pa1 ~]# ceph-disk -v activate 
/dev/mapper/3600a0980005da3a2136058a22992p1 



main_activate: path = /dev/mapper/3600a0980005da3a2136058a22992p1 




get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid path is /sys/dev/block/253:25/dm/uuid 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid is part1-mpath-3600a0980005da3a2136058a22992 







get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid path is /sys/dev/block/253:25/dm/uuid 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid is part1-mpath-3600a0980005da3a2136058a22992 







command: Running command: /usr/sbin/blkid -o udev -p 
/dev/mapper/3600a0980005da3a2136058a22992p1 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid path is /sys/dev/block/253:25/dm/uuid 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid is part1-mpath-3600a0980005da3a2136058a22992 







command: Running command: /sbin/blkid -p -s TYPE -o value -- 
/dev/mapper/3600a0980005da3a2136058a22992p1 



command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_mount_options_xfs 



mount: Mounting /dev/mapper/3600a0980005da3a2136058a22992p1 on 
/var/lib/ceph/tmp/mnt.aCTRx9 with options 
noatime,nodiratime,largeio,inode64,swalloc,logbsize=256k,allocsize=4M 

command_check_call: Running command: /usr/bin/mount -t xfs -o 
noatime,nodiratime,largeio,inode64,swalloc,logbsize=256k,allocsize=4M -- 
/dev/mapper/3600a0980005da3a2136058a22992p1 
/var/lib/ceph/tmp/mnt.aCTRx9

command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.aCTRx9
activate: Cluster uuid is 9a9eedd0-9400-488e-96de-c349fffad7c4
command: Running command: /usr/bin/ceph-osd --cluster=ceph 
--show-config-value=fsid

activate: Cluster name is ceph
activate: OSD uuid is 2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6
allocate_osd_id: Allocating OSD id...
command: Running command: /usr/bin/ceph-authtool --gen-print-key
__init__: stderr
command_with_stdin: Running command with stdin: ceph --cluster ceph 
--name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6

command_with_stdin:
command_with_stdin: no valid command found; 10 closest matches:
osd setmaxosd 
osd pause
osd crush rule rm 
osd crush tree
osd crush rule create-simple{firstn|indep}
osd crush rule create-erasure  {}
osd crush get-tunable straw_calc_version
osd crush show-tunables
osd crush tunables 
legacy|argonaut|bobtail|firefly|hammer|jewel|optimal|default

osd crush set-tunable straw_calc_version 
Error EINVAL: invalid command

mount_activate: Failed to activate
unmount: Unmounting /var/lib/ceph/tmp/mnt.aCTRx9
command_check_call: Running command: /bin/umount -- 
/var/lib/ceph/tmp/mnt.aCTRx9
'['ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', 
'--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', '-i', '-', 
'osd', 'new', u'2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6']' failed with 
status code 22




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous "ceph-disk activate" issue

2018-03-16 Thread Fulvio Galeazzi

Hallo Paul,
You're correct of course, thanks!

  Ok tried to upgrade one MON (out of 3) to Luminous by:
 - removing the MON from the cluster
 - wiping ceph-common
 - running ceph-ansible with "ceph_stable_release: luminous"

but I am now stuck at "[ceph-mon : collect admin and bootstrap keys]". 
If I execute the command in the machine I am installing I see "machine 
is not in quorum: probing".


  Am a bit confused now: should I upgrade all 3 monitors at once? What 
if anything goes wrong during the upgrade? Or should I do a manual 
upgrade rather than using ceph-ansible?


  Thanks for your time and help!

Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous "ceph-disk activate" issue
From: Paul Emmerich <paul.emmer...@croit.io>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>
CC: Ceph Users <ceph-users@lists.ceph.com>
Date: 03/16/2018 03:23 PM


Hi,

2018-03-16 15:18 GMT+01:00 Fulvio Galeazzi <fulvio.galea...@garr.it 
<mailto:fulvio.galea...@garr.it>>:


Hallo,
     I am on Jewel 10.2.10 and willing to upgrade to Luminous. I
thought I'd proceed same as for the upgrade to Jewel, by running
ceph-ansible on OSD nodes one by one, then on MON nodes one by one.
         ---> Is this a sensible way to upgrade to Luminous?


no, that's the wrong order. See the Luminous release notes for the 
upgrade instructions. You'll need to start with the mons, otherwise the 
OSDs won't be able to start.



Paul


--
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io>
Tel: +49 89 1896585 90




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous (12.2.8 on CentOS), recover or recreate incomplete PG

2018-12-18 Thread Fulvio Galeazzi

Hallo Cephers,
I am stuck with an incomplete PG and am seeking help.

  At some point I had a bad configuration for gnocchi which caused a 
flooding of tiny objects to the backend Ceph rados pool. While cleaning 
things up, the load on the OSD disks was such that 3 of them "commited 
suicide" and were marked down.
  Now that the situation is calm, I am left with one stubborn 
incomplete PG.


PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
 pg 107.33 is incomplete, acting [41,22,156] (reducing pool 
gnocchi-ct1-cl1 min_size from 2 may help; search ceph.com/docs for 
'incomplete')

(by the way, reducing min_size did not help)

  I found this page and tried to follow the procedure outlined:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019674.html

  On one of the 3 replicas, the "PG export" produced some decently 
sized file, but when I tried to import it on the acting OSD I got error:


[root@r1srv07.ct1 ~]# ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-41 --op import --file /tmp/recover.107.33 --force

pgid 107.33 already exists


Questions now is: could anyone please suggest a recovery procedure? Note 
that for this specific case I would not mind wiping the PG.


  Thanks for your help!

Fulvio



smime.p7s
Description: Firma crittografica S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate/convert replicated pool to EC?

2019-01-10 Thread Fulvio Galeazzi


Hallo,
I have the same issue as mentioned here, namely 
converting/migrating a replicated pool to an EC-based one. I have ~20 TB 
so my problem is far easier, but I'd like to perform this operation 
without introducing any downtime (or possibly just a minimal one, to 
rename pools).

  I am using Luminous 12.2.8 on CentOS 7.5, currently.

  I am planning to use the procedure outlined in the article quoted 
below, integrated with the trick described here (to force promotion of 
each object to the cache) at point 2)


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-February/016109.html

  I don't mind some manual intervention nor the procedure taking long 
time: my only question is...

  Is the above procedure data-safe, in principle?

  Additional question: while doing the migration, people may create or 
remove objects, so I am wondering how can I make sure that the migration 
is complete? For sure, comparing numbers of objects in the old/new pools 
won't be the way, right?


  Thanks for your help

Fulvio


On 10/26/2018 03:37 PM, Matthew Vernon wrote:

Hi,

On 26/10/2018 12:38, Alexandru Cucu wrote:


Have a look at this article:> 
https://ceph.com/geen-categorie/ceph-pool-migration/


Thanks; that all looks pretty hairy especially for a large pool (ceph df
says 1353T / 428,547,935 objects)...

...so something a bit more controlled/gradual and less
manual-error-prone would make me happier!

Regards,

Matthew



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous (12.2.8 on CentOS), recover or recreate incomplete PG

2018-12-19 Thread Fulvio Galeazzi

Ciao Dan,
thanks a lot for your message!  :-)

  Indeed, the procedure you outlined did the trick and I am now back to 
healthy state.

--yes-i-really-really-love-ceph-parameter-names !!!

  Ciao ciao

Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous (12.2.8 on CentOS), recover or 
recreate incomplete PG

From: Dan van der Ster 
To: fulvio.galea...@garr.it
CC: ceph-users 
Date: 12/18/2018 11:38 AM


Hi Fulvio!

Are you able to query that pg -- which osd is it waiting for?

Also, since you're prepared for data loss anyway, you might have
success setting osd_find_best_info_ignore_history_les=true on the
relevant osds (set it conf, restart those osds).

-- dan


-- dan

On Tue, Dec 18, 2018 at 11:31 AM Fulvio Galeazzi
 wrote:


Hallo Cephers,
  I am stuck with an incomplete PG and am seeking help.

At some point I had a bad configuration for gnocchi which caused a
flooding of tiny objects to the backend Ceph rados pool. While cleaning
things up, the load on the OSD disks was such that 3 of them "commited
suicide" and were marked down.
Now that the situation is calm, I am left with one stubborn
incomplete PG.

PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
   pg 107.33 is incomplete, acting [41,22,156] (reducing pool
gnocchi-ct1-cl1 min_size from 2 may help; search ceph.com/docs for
'incomplete')
 (by the way, reducing min_size did not help)

I found this page and tried to follow the procedure outlined:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019674.html

On one of the 3 replicas, the "PG export" produced some decently
sized file, but when I tried to import it on the acting OSD I got error:

[root@r1srv07.ct1 ~]# ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-41 --op import --file /tmp/recover.107.33 --force
pgid 107.33 already exists


Questions now is: could anyone please suggest a recovery procedure? Note
that for this specific case I would not mind wiping the PG.

Thanks for your help!

 Fulvio

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What is the best way to "move" rgw.buckets.data pool to another cluster?

2019-06-28 Thread Fulvio Galeazzi

Hallo!
  Due to severe maintenance which is going to cause a prolonged 
shutdown, I need to move my RGW pools to a different cluster (and 
geographical site): my problem is with default.rgw.buckets.data pool, 
which is now 100 TB.


Moreover, I'd also like to take advantage of the move to convert from 
replicated to erasure-coded.
Initially I though about rbd-mirror, but then realized it requires 
setting the journaling flag and I have 33M objects... (and also realized 
it's called RBD-mirror whereas I have an rgw pool).
"rados cppool" is going to be removed, if I understand it correctly? 
(apart from not being the right tool for my use-case)



What is the best strategy to copy (or rsync/mirror) an object-store pool 
to a different cluster?


  Thanks for your help!

Fulvio




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the best way to "move" rgw.buckets.data pool to another cluster?

2019-06-28 Thread Fulvio Galeazzi
Hallo again, to reply to my own message... I guess the easiest will be 
to setup multisite replication.
So now I will fight a bit with this and get back to the list in case of 
troubles.


  Sorry for the noise...

Fulvio

On 06/28/2019 10:36 AM, Fulvio Galeazzi wrote:

Hallo!
   Due to severe maintenance which is going to cause a prolonged 
shutdown, I need to move my RGW pools to a different cluster (and 
geographical site): my problem is with default.rgw.buckets.data pool, 
which is now 100 TB.


Moreover, I'd also like to take advantage of the move to convert from 
replicated to erasure-coded.
Initially I though about rbd-mirror, but then realized it requires 
setting the journaling flag and I have 33M objects... (and also realized 
it's called RBD-mirror whereas I have an rgw pool).
"rados cppool" is going to be removed, if I understand it correctly? 
(apart from not being the right tool for my use-case)



What is the best strategy to copy (or rsync/mirror) an object-store pool 
to a different cluster?


   Thanks for your help!

     Fulvio




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com