Re: [ceph-users] OSDs busy reading from Bluestore partition while bringing up nodes.

2019-01-11 Thread Paul Emmerich
This seems like a case of accumulating lots of new osd maps.

What might help is also setting the noup and nodown flags and wait for
the OSDs to start up. Use the "status" daemon command to check the
current OSD state even if it can't come up in the cluster map due to
noup (it also somewhere shows if it's behind on osd maps IIRC).
Once they are all running you should be able to take them up again.

This behavior got better with recent Mimic versions -- so I'd also
recommend to upgrade *after* everything is back to healthy.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Jan 12, 2019 at 3:56 AM Subhachandra Chandra
 wrote:
>
> Hi,
>
> We have a cluster with 9 hosts and 540 HDDs using Bluestore and 
> containerized OSDs running luminous 12.2.4. While trying to add new nodes, 
> the cluster collapsed as it could not keep up with establishing enough tcp 
> connections. We fixed sysctl to be able to handle more connections and also 
> recycle tw sockets faster. Currently, as we are trying to restart the cluster 
> by bringing up a few OSDs at a time, some of the OSDs get very busy after 
> around 360 of them come up. iostats show that the busy OSDs are constantly 
> reading from the Bluestore partition. The number of busy OSDs per node vary 
> and norecover is set with no active clients. OSD logs don't show anything 
> other than cephx: verify_authorizer errors which happen on both busy and idle 
> OSDs and doesn't seem to be related to drive reads.
>
>   How can we figure out why the OSDs are busy reading from the drives? If it 
> is some kind of recovery, is there a way to track progress? Output of ceph -s 
> and logs from a busy and idle OSD are copied below.
>
> Thanks
> Chandra
>
> Uptime stats with load averages show variance across the9 older nodes.
>
>  02:43:44 up 19:21,  0 users,  load average: 0.88, 1.03, 1.06
>
>  02:43:44 up  7:58,  0 users,  load average: 16.91, 13.49, 12.43
>
>  02:43:44 up 1 day, 14 min,  0 users,  load average: 7.67, 6.70, 6.35
>
>  02:43:45 up  7:01,  0 users,  load average: 84.40, 84.20, 83.73
>
>  02:43:45 up  6:40,  1 user,  load average: 17.08, 17.40, 20.05
>
>  02:43:45 up 19:46,  0 users,  load average: 15.58, 11.93, 11.44
>
>  02:43:45 up 20:39,  0 users,  load average: 7.88, 6.50, 5.69
>
>  02:43:46 up 1 day,  1:20,  0 users,  load average: 5.03, 3.81, 3.49
>
>  02:43:46 up 1 day, 58 min,  0 users,  load average: 0.62, 1.00, 1.38
>
>
> Ceph Config
>
> --
>
> [global]
>
> cluster network = 192.168.13.0/24
>
> fsid = <>
>
> mon host = 172.16.13.101,172.16.13.102,172.16.13.103
>
> mon initial members = ctrl1,ctrl2,ctrl3
>
> mon_max_pg_per_osd = 750
>
> mon_osd_backfillfull_ratio = 0.92
>
> mon_osd_down_out_interval = 900
>
> mon_osd_full_ratio = 0.95
>
> mon_osd_nearfull_ratio = 0.85
>
> osd_crush_chooseleaf_type = 3
>
> osd_heartbeat_grace = 900
>
> mon_osd_laggy_max_interval = 900
>
> osd_max_pg_per_osd_hard_ratio = 1.0
>
> public network = 172.16.13.0/24
>
>
> [mon]
>
> mon_compact_on_start = true
>
>
> [osd]
>
> osd_deep_scrub_interval = 2419200
>
> osd_deep_scrub_stride = 4194304
>
> osd_max_backfills = 10
>
> osd_max_object_size = 276824064
>
> osd_max_scrubs = 1
>
> osd_max_write_size = 264
>
> osd_pool_erasure_code_stripe_unit = 2097152
>
> osd_recovery_max_active = 10
>
> osd_heartbeat_interval = 15
>
>
> Data nodes Sysctl params
>
> -
>
> fs.aio-max-nr=1048576
>
> kernel.pid_max=4194303
>
> kernel.threads-max=2097152
>
> net.core.netdev_max_backlog=65536
>
> net.core.optmem_max=1048576
>
> net.core.rmem_max=8388608
>
> net.core.rmem_default=8388608
>
> net.core.somaxconn=2048
>
> net.core.wmem_max=8388608
>
> net.core.wmem_default=8388608
>
> vm.max_map_count=524288
>
> vm.min_free_kbytes=262144
>
>
> net.ipv4.tcp_tw_reuse=1
>
> net.ipv4.tcp_max_syn_backlog=16384
>
> net.ipv4.tcp_fin_timeout=10
>
> net.ipv4.tcp_slow_start_after_idle=0
>
>
>
> Ceph -s output
>
> ---
>
>
> root@ctrl1:/# ceph -s
>
>   cluster:
>
> id: 06126476-6deb-4baa-b7ca-50f5ccfacb68
>
> health: HEALTH_ERR
>
> noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub 
> flag(s) set
>
> 704 osds down
>
> 9 hosts (540 osds) down
>
> 71 nearfull osd(s)
>
> 2 pool(s) nearfull
>
> 780664/74163111 objects misplaced (1.053%)
>
> 7724/8242239 objects unfound (0.094%)
>
> 396 PGs pending on creation
>
> Reduced data availability: 32597 pgs inactive, 29764 pgs down, 
> 820 pgs peering, 74 pgs incomplete, 1 pg stale
>
> Degraded data redundancy: 679158/74163111 objects degraded 
> (0.916%), 1250 pgs degraded, 1106 pgs undersized
>
> 33 slow requests are blocked > 32 sec
>
> 9 stuck requests are blocked > 4096 sec
>
> mons ctrl1,ctrl2,ctrl3 are 

[ceph-users] Boot volume on OSD device

2019-01-11 Thread Brian Topping
Question about OSD sizes: I have two cluster nodes, each with 4x 800GiB SLC SSD 
using BlueStore. They boot from SATADOM so the OSDs are data-only, but the MLC 
SATADOM have terrible reliability and the SLC are way overpriced for this 
application.

Can I carve off 64GiB of from one of the four drives on a node without causing 
problems? If I understand the strategy properly, this will cause mild extra 
load on the other three drives as the weight goes down on the partitioned 
drive, but it probably won’t be a big deal.

Assuming the correct procedure is documented at 
http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-osds/, first 
removing the OSD as documented, zap it, carve off the partition of the freed 
drive, then adding the remaining space back in.

I’m a little nervous that BlueStore assumes it owns the partition table and 
will not be happy that a couple of primary partitions have been used. Will this 
be a problem?

Thanks, Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Offsite replication scenario

2019-01-11 Thread Brian Topping
Hi all,

I have a simple two-node Ceph cluster that I’m comfortable with the care and 
feeding of. Both nodes are in a single rack and captured in the attached dump, 
it has two nodes, only one mon, all pools size 2. Due to physical limitations, 
the primary location can’t move past two nodes at the present time. As far as 
hardware, those two nodes are 18-core Xeon with 128GB RAM and connected with 
10GbE. 

My next goal is to add an offsite replica and would like to validate the plan I 
have in mind. For it’s part, the offsite replica can be considered read-only 
except for the occasional snapshot in order to run backups to tape. The offsite 
location is connected with a reliable and secured ~350Kbps WAN link. 

The following presuppositions bear challenge:

* There is only a single mon at the present time, which could be expanded to 
three with the offsite location. Two mons at the primary location is obviously 
a lower MTBF than one, but  with a third one on the other side of the WAN, I 
could create resiliency against *either* a WAN failure or a single node 
maintenance event. 
* Because there are two mons at the primary location and one at the offsite, 
the degradation mode for a WAN loss (most likely scenario due to facility 
support) leaves the primary nodes maintaining the quorum, which is desirable. 
* It’s clear that a WAN failure and a mon failure at the primary location will 
halt cluster access.
* The CRUSH maps will be managed to reflect the topology change.

If that’s a good capture so far, I’m comfortable with it. What I don’t 
understand is what to expect in actual use:

* Is the link speed asymmetry between the two primary nodes and the offsite 
node going to create significant risk or unexpected behaviors?
* Will the performance of the two primary nodes be limited to the speed that 
the offsite mon can participate? Or will the primary mons correctly calculate 
they have quorum and keep moving forward under normal operation?
* In the case of an extended WAN outage (and presuming full uptime on primary 
site mons), would return to full cluster health be simply a matter of time? Are 
there any limits on how long the WAN could be down if the other two maintain 
quorum?

I hope I’m asking the right questions here. Any feedback appreciated, including 
blogs and RTFM pointers.


Thanks for a great product!! I’m really excited for this next frontier!

Brian

> [root@gw01 ~]# ceph -s
>  cluster:
>id: 
>health: HEALTH_OK
> 
>  services:
>mon: 1 daemons, quorum gw01
>mgr: gw01(active)
>mds: cephfs-1/1/1 up  {0=gw01=up:active}
>osd: 8 osds: 8 up, 8 in
> 
>  data:
>pools:   3 pools, 380 pgs
>objects: 172.9 k objects, 11 GiB
>usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
>pgs: 380 active+clean
> 
>  io:
>client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
> 
> [root@gw01 ~]# ceph df
> GLOBAL:
>SIZEAVAIL   RAW USED %RAW USED 
>5.8 TiB 5.8 TiB   30 GiB  0.51 
> POOLS:
>NAMEID USED%USED MAX AVAIL OBJECTS 
>cephfs_metadata 2  264 MiB 0   2.7 TiB1085 
>cephfs_data 3  8.3 GiB  0.29   2.7 TiB  171283 
>rbd 4  2.0 GiB  0.07   2.7 TiB 542 
> [root@gw01 ~]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF 
> -1   5.82153 root default  
> -3   2.91077 host gw01 
> 0   ssd 0.72769 osd.0 up  1.0 1.0 
> 2   ssd 0.72769 osd.2 up  1.0 1.0 
> 4   ssd 0.72769 osd.4 up  1.0 1.0 
> 6   ssd 0.72769 osd.6 up  1.0 1.0 
> -5   2.91077 host gw02 
> 1   ssd 0.72769 osd.1 up  1.0 1.0 
> 3   ssd 0.72769 osd.3 up  1.0 1.0 
> 5   ssd 0.72769 osd.5 up  1.0 1.0 
> 7   ssd 0.72769 osd.7 up  1.0 1.0 
> [root@gw01 ~]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS 
> 0   ssd 0.72769  1.0 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115 
> 2   ssd 0.72769  1.0 745 GiB 3.1 GiB 742 GiB 0.42 0.82  83 
> 4   ssd 0.72769  1.0 745 GiB 3.6 GiB 742 GiB 0.49 0.96  90 
> 6   ssd 0.72769  1.0 745 GiB 3.5 GiB 742 GiB 0.47 0.93  92 
> 1   ssd 0.72769  1.0 745 GiB 3.4 GiB 742 GiB 0.46 0.90  76 
> 3   ssd 0.72769  1.0 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102 
> 5   ssd 0.72769  1.0 745 GiB 3.9 GiB 741 GiB 0.52 1.02  98 
> 7   ssd 0.72769  1.0 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104 
>TOTAL 5.8 TiB  30 GiB 5.8 TiB 0.51  
> MIN/MAX VAR: 0.82/1.29  STDDEV: 0.07
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs busy reading from Bluestore partition while bringing up nodes.

2019-01-11 Thread Subhachandra Chandra
Hi,

We have a cluster with 9 hosts and 540 HDDs using Bluestore and
containerized OSDs running luminous 12.2.4. While trying to add new nodes,
the cluster collapsed as it could not keep up with establishing
enough tcp connections. We fixed sysctl to be able to handle more
connections and also recycle tw sockets faster. Currently, as we are trying
to restart the cluster by bringing up a few OSDs at a time, some of the
OSDs get very busy after around 360 of them come up. iostats show that the
busy OSDs are constantly reading from the Bluestore partition. The number
of busy OSDs per node vary and norecover is set with no active clients. OSD
logs don't show anything other than cephx: verify_authorizer errors which
happen on both busy and idle OSDs and doesn't seem to be related to drive
reads.

  How can we figure out why the OSDs are busy reading from the drives? If
it is some kind of recovery, is there a way to track progress? Output of
ceph -s and logs from a busy and idle OSD are copied below.

Thanks
Chandra

Uptime stats with load averages show variance across the9 older nodes.

 02:43:44 up 19:21,  0 users,  load average: 0.88, 1.03, 1.06

 02:43:44 up  7:58,  0 users,  load average: 16.91, 13.49, 12.43

 02:43:44 up 1 day, 14 min,  0 users,  load average: 7.67, 6.70, 6.35

 02:43:45 up  7:01,  0 users,  load average: 84.40, 84.20, 83.73

 02:43:45 up  6:40,  1 user,  load average: 17.08, 17.40, 20.05

 02:43:45 up 19:46,  0 users,  load average: 15.58, 11.93, 11.44

 02:43:45 up 20:39,  0 users,  load average: 7.88, 6.50, 5.69

 02:43:46 up 1 day,  1:20,  0 users,  load average: 5.03, 3.81, 3.49

 02:43:46 up 1 day, 58 min,  0 users,  load average: 0.62, 1.00, 1.38


Ceph Config

--

[global]

cluster network = 192.168.13.0/24

fsid = <>

mon host = 172.16.13.101,172.16.13.102,172.16.13.103

mon initial members = ctrl1,ctrl2,ctrl3

mon_max_pg_per_osd = 750

mon_osd_backfillfull_ratio = 0.92

mon_osd_down_out_interval = 900

mon_osd_full_ratio = 0.95

mon_osd_nearfull_ratio = 0.85

osd_crush_chooseleaf_type = 3

osd_heartbeat_grace = 900

mon_osd_laggy_max_interval = 900

osd_max_pg_per_osd_hard_ratio = 1.0

public network = 172.16.13.0/24


[mon]

mon_compact_on_start = true


[osd]

osd_deep_scrub_interval = 2419200

osd_deep_scrub_stride = 4194304

osd_max_backfills = 10

osd_max_object_size = 276824064

osd_max_scrubs = 1

osd_max_write_size = 264

osd_pool_erasure_code_stripe_unit = 2097152

osd_recovery_max_active = 10

osd_heartbeat_interval = 15


Data nodes Sysctl params

-

fs.aio-max-nr=1048576

kernel.pid_max=4194303

kernel.threads-max=2097152

net.core.netdev_max_backlog=65536

net.core.optmem_max=1048576

net.core.rmem_max=8388608

net.core.rmem_default=8388608

net.core.somaxconn=2048

net.core.wmem_max=8388608

net.core.wmem_default=8388608

vm.max_map_count=524288

vm.min_free_kbytes=262144


net.ipv4.tcp_tw_reuse=1

net.ipv4.tcp_max_syn_backlog=16384

net.ipv4.tcp_fin_timeout=10

net.ipv4.tcp_slow_start_after_idle=0



Ceph -s output

---

root@ctrl1:/# ceph -s

  cluster:

id: 06126476-6deb-4baa-b7ca-50f5ccfacb68

health: HEALTH_ERR

noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
flag(s) set

704 osds down

9 hosts (540 osds) down

71 nearfull osd(s)

2 pool(s) nearfull

780664/74163111 objects misplaced (1.053%)

7724/8242239 objects unfound (0.094%)

396 PGs pending on creation

Reduced data availability: 32597 pgs inactive, 29764 pgs down,
820 pgs peering, 74 pgs incomplete, 1 pg stale

Degraded data redundancy: 679158/74163111 objects degraded
(0.916%), 1250 pgs degraded, 1106 pgs undersized

33 slow requests are blocked > 32 sec

9 stuck requests are blocked > 4096 sec

mons ctrl1,ctrl2,ctrl3 are using a lot of disk space



  services:

mon: 3 daemons, quorum ctrl1,ctrl2,ctrl3

mgr: ctrl1(active), standbys: ctrl2, ctrl3

osd: 1080 osds: 376 up, 1080 in; 1963 remapped pgs

 flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub



  data:

pools:   2 pools, 33280 pgs

objects: 8049k objects, 2073 TB

usage:   2277 TB used, 458 TB / 2736 TB avail

pgs: 3.585% pgs unknown

 94.363% pgs not active

 679158/74163111 objects degraded (0.916%)

 780664/74163111 objects misplaced (1.053%)

 7724/8242239 objects unfound (0.094%)

 29754 down

 1193  unknown

 535   peering

 496   activating+undersized+degraded+remapped

 284   remapped+peering

 258   active+undersized+degraded+remapped

 161   activating+degraded+remapped

 143   active+recovering+undersized+degraded+remapped

 89active+undersized+degraded

 76

Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Brad Hubbard
On Fri, Jan 11, 2019 at 8:58 PM Rom Freiman  wrote:
>
> Same kernel :)

Not exactly the point I had in mind, but sure ;)

>
>
> On Fri, Jan 11, 2019, 12:49 Brad Hubbard  wrote:
>>
>> Haha, in the email thread he says CentOS but the bug is opened against RHEL 
>> :P
>>
>> Is it worth recommending a fix in skb_can_coalesce() upstream so other
>> modules don't hit this?
>>
>> On Fri, Jan 11, 2019 at 7:39 PM Ilya Dryomov  wrote:
>> >
>> > On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard  wrote:
>> > >
>> > > On Fri, Jan 11, 2019 at 9:57 AM Jason Dillaman  
>> > > wrote:
>> > > >
>> > > > I think Ilya recently looked into a bug that can occur when
>> > > > CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes
>> > > > through the loopback interface (i.e. co-located OSDs and krbd).
>> > > > Assuming that you have the same setup, you might be hitting the same
>> > > > bug.
>> > >
>> > > Thanks for that Jason, I wasn't aware of that bug. I'm interested to
>> > > see the details.
>> >
>> > Here is Rom's BZ, it has some details:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1665248
>> >
>> > Thanks,
>> >
>> > Ilya
>>
>>
>>
>> --
>> Cheers,
>> Brad
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Meetups

2019-01-11 Thread Jason Van der Schyff
Hi All,

We wanted to let everyone know about a couple of meetups that are happening in 
the near future relating to Ceph and it was suggested we send it out to the 
list.

First of all, in Dallas on January 15th with details here: 
https://www.meetup.com/Object-Storage-Craft-Beer-Dallas/events/257825771/ 


Minneapolis on January 16th with details here: 
https://www.meetup.com/CEPH-Users-Twin-Cities/events/257929703/ 


And finally for now, London on February 6th with details here: 
https://www.meetup.com/Object-Storage-Craft-Beer-London/events/257960715/ 


We hope to see some of you there!

Jason

Jason Van der Schyff
VP, Operations | Soft Iron 
+1 650 679 0234
ja...@softiron.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Mirror Proxy Support?

2019-01-11 Thread Kenneth Van Alstyne
Hello all (and maybe this would be better suited for the ceph devel mailing 
list):
I’d like to use RBD mirroring between two sites (to each other), but I have the 
following limitations:
- The clusters use the same name (“ceph”)
- The clusters share IP address space on a private, non-routed storage network

There are management servers on each side that can talk to the respective 
storage networks, but the storage networks cannot talk directly to each other.  
I recall reading, some years back, of possibly adding support for an RBD mirror 
proxy, which would potentially solve my issues.  Has anything been done in this 
regard?  If not, is my best bet perhaps a tertiary clusters that both can reach 
and do one-way replication to?

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr fails to restart after upgrade to mimic

2019-01-11 Thread Randall Smith
I was going through the permissions on the various keys in the cluster and
I think the admin capabilities look a little weird. (below) Could this be
causing the ceph-mgr problems when it starts?

[client.admin]
key = [redacted]
auid = 0
caps mds = "allow"
caps mgr = "allow *"
caps mon = "allow *"
caps osd = "allow *"


On Tue, Jan 8, 2019 at 10:39 AM Randall Smith  wrote:

> Thanks to everyone who has tried to help so far. I have filed a bug report
> on this issue at http://tracker.ceph.com/issues/37835. I hope we can get
> this fixed so I can finish this upgrade.
>
> On Fri, Jan 4, 2019 at 7:26 AM Randall Smith  wrote:
>
>> Greetings,
>>
>> I'm upgrading my cluster from luminous to mimic. I've upgraded my
>> monitors and am attempting to upgrade the mgrs. Unfortunately, after an
>> upgrade the mgr daemon exits immediately with error code 1.
>>
>> I've tried running ceph-mgr in debug mode to try to see what's happening
>> but the output (below) is a bit cryptic for me. It looks like
>> authentication might be failing but it was working prior to the upgrade.
>>
>> I do have "auth supported = cephx" in the global section of ceph.conf.
>>
>> What do I need to do to fix this?
>>
>> Thanks.
>>
>> /usr/bin/ceph-mgr -f --cluster ceph --id 8 --setuser ceph --setgroup ceph
>> -d --debug_ms 5
>>
>> 2019-01-04 07:01:38.457 7f808f83f700  2 Event(0x30c42c0 nevent=5000
>> time_id=1).set_owner idx=0 owner=140190140331776
>>
>> 2019-01-04 07:01:38.457 7f808f03e700  2 Event(0x30c4500 nevent=5000
>> time_id=1).set_owner idx=1 owner=140190131939072
>>
>> 2019-01-04 07:01:38.457 7f808e83d700  2 Event(0x30c4e00 nevent=5000
>> time_id=1).set_owner idx=2 owner=140190123546368
>>
>> 2019-01-04 07:01:38.457 7f809dd5b380  1  Processor -- start
>>
>>
>> 2019-01-04 07:01:38.477 7f809dd5b380  1 -- - start start
>>
>>
>> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- - --> 192.168.253.147:6789/0
>> -- auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6780 con 0
>>
>> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- - --> 192.168.253.148:6789/0
>> -- auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6a00 con 0
>> 2019-01-04 07:01:38.481 7f808e83d700  1 -- 192.168.253.148:0/1359135487
>> learned_addr learned my addr 192.168.253.148:0/1359135487
>> 2019-01-04 07:01:38.481 7f808e83d700  2 -- 192.168.253.148:0/1359135487
>> >> 192.168.253.148:6789/0 conn(0x332d500 :-1
>> s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection got
>> newly_a$
>> ked_seq 0 vs out_seq 0
>> 2019-01-04 07:01:38.481 7f808f03e700  2 -- 192.168.253.148:0/1359135487
>> >> 192.168.253.147:6789/0 conn(0x332ce00 :-1
>> s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection got
>> newly_a$
>> ked_seq 0 vs out_seq 0
>> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487
>> >> 192.168.253.147:6789/0 conn(0x332ce00 :-1
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1
>> seq
>> 1 0x30c5440 mon_map magic: 0 v1
>> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487
>> >> 192.168.253.148:6789/0 conn(0x332d500 :-1
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2
>> seq
>> 1 0x30c5680 mon_map magic: 0 v1
>> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487
>> >> 192.168.253.147:6789/0 conn(0x332ce00 :-1
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1
>> seq
>> 2 0x32a6780 auth_reply(proto 2 0 (0) Success) v1
>> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487
>> >> 192.168.253.148:6789/0 conn(0x332d500 :-1
>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2
>> seq
>> 2 0x32a6a00 auth_reply(proto 2 0 (0) Success) v1
>> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487
>> <== mon.1 192.168.253.147:6789/0 1  mon_map magic: 0 v1  370+0+0
>> (3034216899 0 0) 0x30c5440 con 0x332ce00
>> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487
>> <== mon.2 192.168.253.148:6789/0 1  mon_map magic: 0 v1  370+0+0
>> (3034216899 0 0) 0x30c5680 con 0x332d500
>> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487
>> <== mon.1 192.168.253.147:6789/0 2  auth_reply(proto 2 0 (0)
>> Success) v1  33+0+0 (3430158761 0 0) 0x32a6780 con 0x33$
>> ce00
>> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487
>> --> 192.168.253.147:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 --
>> 0x32a6f00 con 0
>> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487
>> <== mon.2 192.168.253.148:6789/0 2  auth_reply(proto 2 0 (0)
>> Success) v1  33+0+0 (3242503871 0 0) 0x32a6a00 con 0x33$
>> d500
>> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487
>> --> 192.168.253.148:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 --
>> 0x32a6780 con 0
>> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487
>> >> 

Re: [ceph-users] Problems enabling automatic balancer

2019-01-11 Thread Massimo Sgaravatto
I think I found myself the problem (for the time being I am debugging the
issue on a testbed):

[root@c-mon-01 ceph]# ceph osd crush weight-set create-compat
Error EPERM: crush map contains one or more bucket(s) that are not straw2

So I issued:

[root@c-mon-01 ceph]# ceph osd crush set-all-straw-buckets-to-straw2
Error EINVAL: new crush map requires client version hammer but
require_min_compat_client is firefly


So:

[root@c-mon-01 ceph]# ceph osd set-require-min-compat-client jewel
set require_min_compat_client to jewel


Now the set-all-straw-buckets-to-straw2 should work:

[root@c-mon-01 ceph]# ceph osd crush set-all-straw-buckets-to-straw2

Indeed.

And when I start the balancer:


[root@c-mon-01 ceph]# ceph balancer on

 I can't see anymore the problem

On Fri, Jan 11, 2019 at 3:58 PM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> I am trying to enable the automatic balancer in my Luminous ceph cluster,
> following the documentation at:
>
>
> http://docs.ceph.com/docs/luminous/mgr/balancer/
>
> [root@ceph-mon-01 ~]# ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "crush-compat"
> }
>
> After having issued the command:
>
>
> [root@ceph-mon-01 ~]# ceph balancer on
>
>
> in the manager log file I see:
>
>
> 2019-01-11 15:50:43.087370 7f1afd496700  0 mgr[balancer] Error creating
> compat weight-set
>
>
>
> Any hints how to  debug this ?
>
> Thanks, Massimo
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell
I've created the following bug report to address this issue:

http://tracker.ceph.com/issues/37875

Bryan

From: ceph-users  on behalf of Bryan 
Stillwell 
Date: Friday, January 11, 2019 at 8:59 AM
To: Dan van der Ster 
Cc: ceph-users 
Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-11 Thread Bryan Stillwell
That thread looks like the right one.

So far I haven't needed to restart the osd's for the churn trick to work.  I 
bet you're right that something thinks it still needs one of the old osdmaps on 
your cluster.  Last night our cluster finished another round of expansions and 
we're seeing up to 49,272 osdmaps hanging around.  The churn trick seems to be 
working again too.

Bryan

From: Dan van der Ster 
Date: Thursday, January 10, 2019 at 3:13 AM
To: Bryan Stillwell 
Cc: ceph-users 
Subject: Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

Hi Bryan,

I think this is the old hammer thread you refer to:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html

We also have osdmaps accumulating on v12.2.8 -- ~12000 per osd at the moment.

I'm trying to churn the osdmaps like before, but our maps are not being trimmed.

Did you need to restart the osd's before the churn trick would work?
If so, it seems that something is holding references to old maps, like
like that old hammer issue.

Cheers, Dan


On Tue, Jan 8, 2019 at 5:39 PM Bryan Stillwell 
mailto:bstillw...@godaddy.com>> wrote:

I was able to get the osdmaps to slowly trim (maybe 50 would trim with each 
change) by making small changes to the CRUSH map like this:



for i in {1..100}; do

 ceph osd crush reweight osd.1754 4.1

 sleep 5

 ceph osd crush reweight osd.1754 4

 sleep 5

done



I believe this was the solution Dan came across back in the hammer days.  It 
works, but not ideal for sure.  Across the cluster it freed up around 50TB of 
data!



Bryan



From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Bryan Stillwell 
mailto:bstillw...@godaddy.com>>
Date: Monday, January 7, 2019 at 2:40 PM
To: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: [ceph-users] osdmaps not being cleaned up in 12.2.8



I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
cleaning up old osdmaps after doing an expansion.  This is even after the 
cluster became 100% active+clean:



# find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l

46181



With the osdmaps being over 600KB in size this adds up:



# du -sh /var/lib/ceph/osd/ceph-1754/current/meta

31G/var/lib/ceph/osd/ceph-1754/current/meta



I remember running into this during the hammer days:



http://tracker.ceph.com/issues/13990



Did something change recently that may have broken this fix?



Thanks,

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems enabling automatic balancer

2019-01-11 Thread Massimo Sgaravatto
I am trying to enable the automatic balancer in my Luminous ceph cluster,
following the documentation at:


http://docs.ceph.com/docs/luminous/mgr/balancer/

[root@ceph-mon-01 ~]# ceph balancer status
{
"active": true,
"plans": [],
"mode": "crush-compat"
}

After having issued the command:


[root@ceph-mon-01 ~]# ceph balancer on


in the manager log file I see:


2019-01-11 15:50:43.087370 7f1afd496700  0 mgr[balancer] Error creating
compat weight-set



Any hints how to  debug this ?

Thanks, Massimo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Rom Freiman
Done.

On Fri, Jan 11, 2019 at 3:36 PM Ilya Dryomov  wrote:

> On Fri, Jan 11, 2019 at 11:58 AM Rom Freiman  wrote:
> >
> > Same kernel :)
>
> Rom, can you update your CentOS ticket with the link to the Ceph BZ?
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2019-01-11 Thread Hector Martin

Hi Igor,

On 11/01/2019 20:16, Igor Fedotov wrote:

In short - we're planning to support main device expansion for Nautilus+
and to introduce better error handling for the case in Mimic and
Luminous. Nautilus PR has been merged, M & L PRs are pending review at
the moment:


Got it. No problem then, good to know it isn't *supposed* to work yet :-)

--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD mirroring feat not supported

2019-01-11 Thread Jason Dillaman
krbd doesn't yet support several RBD features, including journaling
[1]. The only current way to use object-map, fast-diff, deep-flatten,
and/or journaling features against a block device is to use "rbd
device map --device-type nbd " (or use a TCMU loopback
device to create an librbd-backed SCSI block device).

On Fri, Jan 11, 2019 at 1:20 AM Hammad Abdullah
 wrote:
>
> Hey guys,
>
> I'm trying to mount a ceph image with journaling, layering and exclusive-lock 
> features enabled (it is a mirror image) but I keep getting the error "feature 
> not supported". I upgraded the kernel from 4.4 to 4.18 but I still get the 
> same error message. Any Idea what the issue might be?
>
> screenshot attached.
>

[1] http://docs.ceph.com/docs/master/rbd/rbd-config-ref/#rbd-features

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Ilya Dryomov
On Fri, Jan 11, 2019 at 11:58 AM Rom Freiman  wrote:
>
> Same kernel :)

Rom, can you update your CentOS ticket with the link to the Ceph BZ?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate/convert replicated pool to EC?

2019-01-11 Thread Garr
Hallo again, re-reading my message I realized I need to point out an 
important detail about my use-case.


  The pool I need to migrate is an object-storage one: it is the 
destination of an OpenStack-Swift.


  Do you think that, in this case, the procedure below would be the 
correct one to use?


  Thanks!

Fulvio

Il 1/10/2019 3:16 PM, Fulvio Galeazzi ha scritto:


Hallo,
     I have the same issue as mentioned here, namely 
converting/migrating a replicated pool to an EC-based one. I have ~20 TB 
so my problem is far easier, but I'd like to perform this operation 
without introducing any downtime (or possibly just a minimal one, to 
rename pools).

   I am using Luminous 12.2.8 on CentOS 7.5, currently.

   I am planning to use the procedure outlined in the article quoted 
below, integrated with the trick described here (to force promotion of 
each object to the cache) at point 2)

 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-February/016109.html

   I don't mind some manual intervention nor the procedure taking long 
time: my only question is...

   Is the above procedure data-safe, in principle?

   Additional question: while doing the migration, people may create or 
remove objects, so I am wondering how can I make sure that the migration 
is complete? For sure, comparing numbers of objects in the old/new pools 
won't be the way, right?


   Thanks for your help

     Fulvio


On 10/26/2018 03:37 PM, Matthew Vernon wrote:

Hi,

On 26/10/2018 12:38, Alexandru Cucu wrote:

Have a look at this article:> 
https://ceph.com/geen-categorie/ceph-pool-migration/


Thanks; that all looks pretty hairy especially for a large pool (ceph df
says 1353T / 428,547,935 objects)...

...so something a bit more controlled/gradual and less
manual-error-prone would make me happier!

Regards,

Matthew



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







smime.p7s
Description: Firma crittografica S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2019-01-11 Thread Igor Fedotov

Hi Hector,

just realized that you're trying to expand main (and exclusive) device 
which isn't supported in mimic.


Here is bluestore_tool complaint (pretty confusing and not preventing 
from the partial expansion though)  while expanding:


expanding dev 1 from 0x1df2eb0 to 0x3a38120
Can't find device path for dev 1


Actually this is covered by the following ticket and its backport 
descendants:


https://tracker.ceph.com/issues/37360

https://tracker.ceph.com/issues/37494

https://tracker.ceph.com/issues/37495


In short - we're planning to support main device expansion for Nautilus+ 
and to introduce better error handling for the case in Mimic and 
Luminous. Nautilus PR has been merged, M & L PRs are pending review at 
the moment:


https://github.com/ceph/ceph/pull/25348

https://github.com/ceph/ceph/pull/25384


Thanks,

Igor


On 1/11/2019 1:18 PM, Hector Martin wrote:

Sorry for the late reply,

Here's what I did this time around. osd.0 and osd.1 should be 
identical, except osd.0 was recreated (that's the first one that 
failed) and I'm trying to expand osd.1 from its original size.


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0 | 
grep size

    "size": 4000780910592,
# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | 
grep size

    "size": 3957831237632,
# blockdev --getsize64 /var/lib/ceph/osd/ceph-0/block
4000780910592
# blockdev --getsize64 /var/lib/ceph/osd/ceph-1/block
4000780910592

As you can see the osd.1 block device is already resized

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-0
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-0/block
1 : size 0x3a38120 : own 0x[1bf1f40~2542a0]
# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a38120 : own 0x[1ba5270~24dc40]

# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
start:
1 : size 0x3a38120 : own 0x[1ba5270~24dc40]
expanding dev 1 from 0x1df2eb0 to 0x3a38120
Can't find device path for dev 1

Unfortunately I forgot to run this with debugging enabled.

This seems like it didn't touch the first 8K (label), so unfortunately 
I cannot undo it. I guess this information is stored elsewhere.


I did notice that the size label was not updated, so I updated it 
manually:


# ceph-bluestore-tool set-label-key --dev 
/var/lib/ceph/osd/ceph-1/block --key size --value 4000780910592


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | 
grep size

    "size": 4000780910592,

This is what bluefs-bdev-sizes says:

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a38120 : own 0x[1ba5270~1e92eb0]

fsck reported "fsck succcess". The log is massive, I can host it 
somewhere if needed.


Starting the OSD fails with:

# ceph-osd --id 1
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Public network was set, but 
cluster network was not set
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Using public network also 
for cluster network
starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2019-01-11 18:51:08.902 7f06a72c62c0 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
extra 0x[1df2eb0~1c45270]
2019-01-11 18:51:09.301 7f06a72c62c0 -1 osd.1 0 OSD:init: unable to 
mount object store
2019-01-11 18:51:09.301 7f06a72c62c0 -1  ** ERROR: osd init failed: 
(5) Input/output error


That "bluefs extra" line seems to be the issue. From a full log:

2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace

2019-01-11 18:56:00.135 7fb74a8272c0 10 bluefs get_block_extents bdev 1
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
says 0x[1ba5270~1e92eb0]
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace super 
says  0x[1ba5270~24dc40]
2019-01-11 18:56:00.135 7fb74a8272c0 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
extra 0x[1df2eb0~1c45270]
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _flush_cache


And that is where the -EIO is coming from:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5305 



So I guess there is an inconsistency between some metadata here?

On 27/12/2018 20:46, Igor Fedotov wrote:

Hector,

One more thing to mention - after expansion please run fsck using
ceph-bluestore-tool prior to running osd daemon and collect another log
using CEPH_ARGS variable.


Thanks,

Igor

On 12/27/2018 2:41 PM, Igor Fedotov wrote:


Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Rom Freiman
Same kernel :)


On Fri, Jan 11, 2019, 12:49 Brad Hubbard  wrote:

> Haha, in the email thread he says CentOS but the bug is opened against
> RHEL :P
>
> Is it worth recommending a fix in skb_can_coalesce() upstream so other
> modules don't hit this?
>
> On Fri, Jan 11, 2019 at 7:39 PM Ilya Dryomov  wrote:
> >
> > On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard 
> wrote:
> > >
> > > On Fri, Jan 11, 2019 at 9:57 AM Jason Dillaman 
> wrote:
> > > >
> > > > I think Ilya recently looked into a bug that can occur when
> > > > CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes
> > > > through the loopback interface (i.e. co-located OSDs and krbd).
> > > > Assuming that you have the same setup, you might be hitting the same
> > > > bug.
> > >
> > > Thanks for that Jason, I wasn't aware of that bug. I'm interested to
> > > see the details.
> >
> > Here is Rom's BZ, it has some details:
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1665248
> >
> > Thanks,
> >
> > Ilya
>
>
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Brad Hubbard
Haha, in the email thread he says CentOS but the bug is opened against RHEL :P

Is it worth recommending a fix in skb_can_coalesce() upstream so other
modules don't hit this?

On Fri, Jan 11, 2019 at 7:39 PM Ilya Dryomov  wrote:
>
> On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard  wrote:
> >
> > On Fri, Jan 11, 2019 at 9:57 AM Jason Dillaman  wrote:
> > >
> > > I think Ilya recently looked into a bug that can occur when
> > > CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes
> > > through the loopback interface (i.e. co-located OSDs and krbd).
> > > Assuming that you have the same setup, you might be hitting the same
> > > bug.
> >
> > Thanks for that Jason, I wasn't aware of that bug. I'm interested to
> > see the details.
>
> Here is Rom's BZ, it has some details:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1665248
>
> Thanks,
>
> Ilya



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption questions

2019-01-11 Thread Sergio A. de Carvalho Jr.
Thanks for the answers, guys!

Am I right to assume msgr2 (http://docs.ceph.com/docs/mimic/dev/msgr2/)
will provide encryption between Ceph daemons as well as between clients and
daemons?

Does anybody know if it will be available in Nautilus?


On Fri, Jan 11, 2019 at 8:10 AM Tobias Florek  wrote:

> Hi,
>
> as others pointed out, traffic in ceph is unencrypted (internal traffic
> as well as client traffic).  I usually advise to set up IPSec or
> nowadays wireguard connections between all hosts.  That takes care of
> any traffic going over the wire, including ceph.
>
> Cheers,
>  Tobias Florek
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2019-01-11 Thread Hector Martin

Sorry for the late reply,

Here's what I did this time around. osd.0 and osd.1 should be identical, 
except osd.0 was recreated (that's the first one that failed) and I'm 
trying to expand osd.1 from its original size.


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0 | grep size
"size": 4000780910592,
# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | grep size
"size": 3957831237632,
# blockdev --getsize64 /var/lib/ceph/osd/ceph-0/block
4000780910592
# blockdev --getsize64 /var/lib/ceph/osd/ceph-1/block
4000780910592

As you can see the osd.1 block device is already resized

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-0
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-0/block
1 : size 0x3a38120 : own 0x[1bf1f40~2542a0]
# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a38120 : own 0x[1ba5270~24dc40]

# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
start:
1 : size 0x3a38120 : own 0x[1ba5270~24dc40]
expanding dev 1 from 0x1df2eb0 to 0x3a38120
Can't find device path for dev 1

Unfortunately I forgot to run this with debugging enabled.

This seems like it didn't touch the first 8K (label), so unfortunately I 
cannot undo it. I guess this information is stored elsewhere.


I did notice that the size label was not updated, so I updated it manually:

# ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-1/block 
--key size --value 4000780910592


# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | grep size
"size": 4000780910592,

This is what bluefs-bdev-sizes says:

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a38120 : own 0x[1ba5270~1e92eb0]

fsck reported "fsck succcess". The log is massive, I can host it 
somewhere if needed.


Starting the OSD fails with:

# ceph-osd --id 1
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Public network was set, but 
cluster network was not set
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Using public network also 
for cluster network
starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 
/var/lib/ceph/osd/ceph-1/journal
2019-01-11 18:51:08.902 7f06a72c62c0 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
extra 0x[1df2eb0~1c45270]
2019-01-11 18:51:09.301 7f06a72c62c0 -1 osd.1 0 OSD:init: unable to 
mount object store
2019-01-11 18:51:09.301 7f06a72c62c0 -1  ** ERROR: osd init failed: (5) 
Input/output error


That "bluefs extra" line seems to be the issue. From a full log:

2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace

2019-01-11 18:56:00.135 7fb74a8272c0 10 bluefs get_block_extents bdev 1
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
says 0x[1ba5270~1e92eb0]
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace super 
says  0x[1ba5270~24dc40]
2019-01-11 18:56:00.135 7fb74a8272c0 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs 
extra 0x[1df2eb0~1c45270]
2019-01-11 18:56:00.135 7fb74a8272c0 10 
bluestore(/var/lib/ceph/osd/ceph-1) _flush_cache


And that is where the -EIO is coming from:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5305

So I guess there is an inconsistency between some metadata here?

On 27/12/2018 20:46, Igor Fedotov wrote:

Hector,

One more thing to mention - after expansion please run fsck using
ceph-bluestore-tool prior to running osd daemon and collect another log
using CEPH_ARGS variable.


Thanks,

Igor

On 12/27/2018 2:41 PM, Igor Fedotov wrote:

Hi Hector,

I've never tried bluefs-bdev-expand over encrypted volumes but it
works absolutely fine for me in other cases.

So it would be nice to troubleshoot this a bit.

Suggest to do the following:

1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd.
This will probably allow to recover from the failed expansion and
repeat it multiple times.

2) Collect current volume sizes with bluefs-bdev-sizes command and
actual devices sizes using 'lsblk --bytes'.

3) Do lvm volume expansion and then collect dev sizes with 'lsblk
--bytes' once again

4) Invoke bluefs-bdev-expand for osd.1 with
CEPH_ARGS="--debug-bluestore 20 --debug-bluefs 20 --log-file
bluefs-bdev-expand.log"

Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...


Thanks,

Igor




On 12/27/2018 12:19 PM, Hector Martin wrote:

Hi list,

I'm slightly expanding the underlying LV for two OSDs and 

Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Ilya Dryomov
On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard  wrote:
>
> On Fri, Jan 11, 2019 at 9:57 AM Jason Dillaman  wrote:
> >
> > I think Ilya recently looked into a bug that can occur when
> > CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes
> > through the loopback interface (i.e. co-located OSDs and krbd).
> > Assuming that you have the same setup, you might be hitting the same
> > bug.
>
> Thanks for that Jason, I wasn't aware of that bug. I'm interested to
> see the details.

Here is Rom's BZ, it has some details:

https://bugzilla.redhat.com/show_bug.cgi?id=1665248

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com