from:"Nicolas Huillard"

Re: [ceph-users] Using Cephfs Snapshots in Luminous

2019-02-06 Thread Nicolas Huillard

Le lundi 12 novembre 2018 à 15:31 +0100, Marc Roos a écrit :
> > > is anybody using cephfs with snapshots on luminous? Cephfs
> > > snapshots are declared stable in mimic, but I'd like to know
> > > about the risks using them on luminous. Do I risk a complete
> > > cephfs failure or just some not working snapshots? It is one
> > > namespace, one fs, one data and one metadata pool.
> > > 
> > 
> > For luminous, snapshot in single mds setup basically works.
> > But snapshot is complete broken in multiple setup.
> > 
> 
> Single active mds not? And hardlinks are not supported with
> snapshots?

What's the final feeling on snapshots ?
* Luminous 12.2.10 on Debian stretch
* ceph-fuse clients
* 1 active MDS, some standbys
* single FS, single namespace, no hardlinks
* will probably create nested snapshots, ie. /1/.snaps/first and
/1/2/3/.snaps/nested
* will use the facility through VirtFS from within VMs, where ceph-fuse 
runs on the host server

What's the risk of using that experimental feature (as said in [1]) ?
* losing snapshots ?
* losing the main/last contents ?
* losing some directory trees, entire filesystem ?
* other ?

TIA,

[1] http://docs.ceph.com/docs/luminous/cephfs/experimental-features/#sn
apshots

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Packages for debian in Ceph repo

2018-11-06 Thread Nicolas Huillard

Le mardi 30 octobre 2018 à 18:14 +0100, Kevin Olbrich a écrit :
> Proxmox has support for rbd as they ship additional packages as well
> as
> ceph via their own repo.
> 
> I ran your command and got this:
> 
> > qemu-img version 2.8.1(Debian 1:2.8+dfsg-6+deb9u4)
> > Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project
> > developers
> > Supported formats: blkdebug blkreplay blkverify bochs cloop dmg
> > file ftp
> > ftps gluster host_cdrom host_device http https iscsi iser luks nbd
> > nfs
> > null-aio null-co parallels qcow qcow2 qed quorum raw rbd
> > replication
> > sheepdog ssh vdi vhdx vmdk vpc vvfat
> 
> 
> It lists rbd but still fails with the exact same error.

I stumbled upon the exact same error, and since there was no answer
anywhere, I figured it was a very simple problem: don't forget to
install the qemu-block-extra package (Debian stretch) along with qemu-
utils which contains the qemu-img command.
This command is actually compiled with rbd support (hence the output
above), but need this extra package to pull actual support-code and
dependencies...

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?

2018-09-24 Thread Nicolas Huillard

Le dimanche 23 septembre 2018 à 20:28 +0200, mj a écrit :
> XFS has *always* treated us nicely, and we have been using it for a
> VERY 
> long time, ever since the pre-2000 suse 5.2 days on pretty much all
> our 
> machines.
> 
> We have seen only very few corruptions on xfs, and the few times we 
> tried btrfs, (almost) always 'something' happened. (same for the few 
> times we tried reiserfs, btw)
> 
> So, while my story may be very anecdotical (and you will probably
> find 
> many others here claiming the opposite) our own conclusion is very 
> clear: we love xfs, and do not like btrfs very much.

Thanks for your anecdote ;-)
Could it be that I stack too many things (XFS in LVM in md-RAID in SSD
's FTL)?

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?

2018-09-24 Thread Nicolas Huillard

Le dimanche 23 septembre 2018 à 17:49 -0700, solarflow99 a écrit :
> ya, sadly it looks like btrfs will never materialize as the next
> filesystem
> of the future.  Redhat as an example even dropped it from its future,
> as
> others probably will and have too.

Too bad, since this FS have a lot of very promising features. I view it
as the single-host-ceph-like FS, and do not see any equivalent (apart
from ZFS which will also never included in the kernel).

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?

2018-09-22 Thread Nicolas Huillard

Hi all,

I don't have a good track record with XFS since I got rid of ReiserFS a
long time ago. I decided XFS was a good idea on servers, while I tested
BTRFS on various less important devices.
So far, XFS betrayed me far more often (a few times) than BTRFS
(never).
Last time was yesterday, on a root filesystem with "Block out of range:
block 0x17b9814b0, EOFS 0x12a000" "I/O Error Detected. Shutting down
filesystem" (shutting down the root filesystem is pretty hard).

Some threads on this ML discuss a similar problem, related to
partitioning and logical sectors located just after the end of the
partition. The problem here does not seem to be the same, as the
requested block is very far out of bound (2 orders of magnitude too
far), and I use a recent Debian stock kernel with every security patch.

My question is : should I trust XFS for small root filesystems (/,
/tmp, /var on LVM sitting within md-RAID1 smallish partition), or is
BTRFS finally trusty enough for a general purpose cluster (still root
et al. filesystems), or do you guys just use the distro-recommended
setup (typically Ext4 on plain disks) ?

Debian stretch with 4.9.110-3+deb9u4 kernel.
Ceph 12.2.8 on bluestore (not related to the question).

Partial output of lsblk /dev/sdc /dev/nvme0n1:
NAME  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdc 8:32   0 447,1G  0 disk  
├─sdc1  8:33   0  55,9G  0 part  
│ └─md0 9:00  55,9G  0 raid1 
│   ├─oxygene_system-root 253:40   9,3G  0 lvm   /
│   ├─oxygene_system-tmp  253:50   9,3G  0 lvm   /tmp
│   └─oxygene_system-var  253:60   4,7G  0 lvm   /var
└─sdc2  8:34   0  29,8G  0 part  [SWAP]
nvme0n1   259:00   477G  0 disk  
├─nvme0n1p1   259:10  55,9G  0 part  
│ └─md0 9:00  55,9G  0 raid1 
│   ├─oxygene_system-root 253:40   9,3G  0 lvm   /
│   ├─oxygene_system-tmp  253:50   9,3G  0 lvm   /tmp
│   └─oxygene_system-var  253:60   4,7G  0 lvm   /var
├─nvme0n1p2   259:20  29,8G  0 part  [SWAP]

TIA !

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] No announce for 12.2.8 / available in repositories

2018-09-22 Thread Nicolas Huillard

Le dimanche 02 septembre 2018 à 11:31 +0200, Nicolas Huillard a écrit :
> I just noticed that 12.2.8 was available on the repositories, without
> any announce. Since upgrading to unannounced 12.2.6 was a bad idea,
> I'll wait a bit anyway ;-)
> Where can I find info on this bugfix release ?
> Nothing there : http://lists.ceph.com/pipermail/ceph-announce-
> ceph.com/

Juste to report that my upgrade was easy and successful (except for a
totally unrelated crash root XFS filesystem).

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Remotely tell an OSD to stop ?

2018-09-21 Thread Nicolas Huillard

Thanks!
I was in the process of upgrading, so "noout" was already set, probably
preventing setting "noin".
I thus just "ceph osdset noup", then "ceph osd down ", which
stopped activity on the disks (probably not enough to clean everything
in Bluestore, but I decided to trust its inner working).

I now have an unbootable XFS root filesystem, some OSDs out but
probably OK owith their data, and 4× redundancy. I'll pause and think
about the next steps with no urgency ;-)

Le vendredi 21 septembre 2018 à 11:09 +0200, Patrick Nawracay a écrit :
> Hi,
> 
> you'll need to set `noup` to prevent OSDs from being started
> automatically. The `noin` flags prevents that the cluster sets the
> OSD
> `in` again, after it has been set `out`.
> 
>     `ceph osd set noup` before `ceph osd down `
> 
>     `ceph osd set noin` before `ceph osd out `
> 
> Those global flags (they prevent all OSDs from being automatically
> set
> up/in again), can be disabled with unset.
> 
>     `ceph osd unset `
> 
> Please note that I'm not familiar with recovery of a Ceph cluster,
> I'm
> just trying to answer the question, but don't know if that's the best
> approach in this case.
> 
> Patrick
> 
> 
> On 21.09.2018 10:49, Nicolas Huillard wrote:
> > Hi all,
> > 
> > One of my server crashed its root filesystem, ie. the currently
> > open
> > shell just says "command not found" for any basic command (ls, df,
> > mount, dmesg, etc.)
> > ACPI soft power-off won't work because it needs scripts on /...
> > 
> > Before I reset the hardware, I'd like to cleanly stop the OSDs on
> > this
> > server (with still work because they do not need /).
> > I was able to move the MGR out of that server with "ceph mgr fail
> > [hostname]".
> > Is it possible to tell the OSD on that host to stop, from another
> > host?
> > I tried "ceph osd down [osdnumber]", but the OSD just got back "in"
> > immediately.
> > 
> > Ceph 12.2.7 on Debian
> > 
> > TIA,
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Nicolas Huillard
Associé fondateur - Directeur Technique - Dolomède

nhuill...@dolomede.fr
Fixe : +33 9 52 31 06 10
Mobile : +33 6 50 27 69 08
http://www.dolomede.fr/

https://www.observatoire-climat-energie.fr/
https://reseauactionclimat.org/planetman/
https://350.org/fr/
https://reporterre.net/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Remotely tell an OSD to stop ?

2018-09-21 Thread Nicolas Huillard

Hi all,

One of my server crashed its root filesystem, ie. the currently open
shell just says "command not found" for any basic command (ls, df,
mount, dmesg, etc.)
ACPI soft power-off won't work because it needs scripts on /...

Before I reset the hardware, I'd like to cleanly stop the OSDs on this
server (with still work because they do not need /).
I was able to move the MGR out of that server with "ceph mgr fail
[hostname]".
Is it possible to tell the OSD on that host to stop, from another host?
I tried "ceph osd down [osdnumber]", but the OSD just got back "in"
immediately.

Ceph 12.2.7 on Debian

TIA,

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] No announce for 12.2.8 / available in repositories

2018-09-02 Thread Nicolas Huillard

Hi all,

I just noticed that 12.2.8 was available on the repositories, without
any announce. Since upgrading to unannounced 12.2.6 was a bad idea,
I'll wait a bit anyway ;-)
Where can I find info on this bugfix release ?
Nothing there : http://lists.ceph.com/pipermail/ceph-announce-ceph.com/

TIA

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Self shutdown of 1 whole system: Oops, it did it again (not yet anymore)

2018-07-31 Thread Nicolas Huillard

Hi all,

The latest hint I received (thanks!) was to replace a failing hardware.
Before that, I updated the BIOS, which included a CPU microcode fix for
melddown/spectre and probably other thngs. Last time I had checked, the
vendor didn't have that fix yet.
Since this update, not CATERR happened... This Intel microcode + vendor
BIOS may have mitigated the problem, and postpones hardware
replacement...

Le mardi 24 juillet 2018 à 12:18 +0200, Nicolas Huillard a écrit :
> Hi all,
> 
> The same server did it again with the same CATERR exactly 3 days
> after
> rebooting (+/- 30 seconds).
> If it were'nt for the exact +3 days, I would think it's a random
> event.
> But exactly 3 days after reboot does not seem random.
> 
> Nothing I added got me more information (mcelog, pstore, BMC video
> record, etc.)...
> 
> Thanks is advance for any hint ;-)
> 
> Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
> > Hi all,
> > 
> > One of my server silently shutdown last night, with no explanation
> > whatsoever in any logs. According to the existing logs, the
> > shutdown
> > (without reboot) happened between 03:58:20.061452 (last timestamp
> > from
> > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> > election called, for which oxygene didn't answer).
> > 
> > Is there any way in which Ceph could silently shutdown a server?
> > Can SMART self-test influence scrubbing or compaction?
> > 
> > The only thing I have is that smartd stated a long self-test on
> > both
> > OSD spinning drives on that host:
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> > test in progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> > test in progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > previous
> > self-test completed without error
> > 
> > ...and smartctl now says that the self-tests didn't finish (on both
> > drives) :
> > # 1  Extended offlineInterrupted (host
> > reset)  00% 10636 -
> > 
> > MON logs on oxygene talks about rockdb compaction a few minutes
> > before
> > the shutdown, and a deep-scrub finished earlier:
> > /var/log/ceph/ceph-osd.6.log
> > 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> > [DBG] : 6.1d deep-scrub starts
> > 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> > [DBG] : 6.1d deep-scrub ok
> > 2018-07-21 03:43:36.720707 7fd178082700  0 --
> > 172.22.0.16:6801/478362
> > > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> > 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=1).handle_connect_msg: challenging authorizer
> > 
> > /var/log/ceph/ceph-mgr.oxygene.log
> > 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
> > 
> > /var/log/ceph/ceph-mon.oxygene.log
> > 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> > Time 2018/07/21-03:52:27.702302) [/build/ceph-
> > 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default]
> > Manual compaction from level-0 to level-1 from 'mgrstat .. '
> > 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746]
> > Compacting 1@0 + 1@1 files to L1, score -1.00
> > 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction
> > start summary: Base version 1745 Base level 0, inputs:
> > [149507(602KB)], [149505(13MB)]
> > 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros": 1532137947702334, "job": 1746, "event":
> > "compaction_started", "files_L0": [149507], "files_L1": [149505],
> > "score": -1, "input_data_size": 14916379}
> > 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746]
> > Generat

Re: [ceph-users] Self shutdown of 1 whole system: Oops, it did it again

2018-07-24 Thread Nicolas Huillard

Hi all,

The same server did it again with the same CATERR exactly 3 days after
rebooting (+/- 30 seconds).
If it were'nt for the exact +3 days, I would think it's a random event.
But exactly 3 days after reboot does not seem random.

Nothing I added got me more information (mcelog, pstore, BMC video
record, etc.)...

Thanks is advance for any hint ;-)

Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
> Hi all,
> 
> One of my server silently shutdown last night, with no explanation
> whatsoever in any logs. According to the existing logs, the shutdown
> (without reboot) happened between 03:58:20.061452 (last timestamp
> from
> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> election called, for which oxygene didn't answer).
> 
> Is there any way in which Ceph could silently shutdown a server?
> Can SMART self-test influence scrubbing or compaction?
> 
> The only thing I have is that smartd stated a long self-test on both
> OSD spinning drives on that host:
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> test in progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> test in progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], previous
> self-test completed without error
> 
> ...and smartctl now says that the self-tests didn't finish (on both
> drives) :
> # 1  Extended offlineInterrupted (host
> reset)  00% 10636 -
> 
> MON logs on oxygene talks about rockdb compaction a few minutes
> before
> the shutdown, and a deep-scrub finished earlier:
> /var/log/ceph/ceph-osd.6.log
> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> [DBG] : 6.1d deep-scrub starts
> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> [DBG] : 6.1d deep-scrub ok
> 2018-07-21 03:43:36.720707 7fd178082700  0 -- 172.22.0.16:6801/478362
> >> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg: challenging authorizer
> 
> /var/log/ceph/ceph-mgr.oxygene.log
> 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
> 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
> 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
> 
> /var/log/ceph/ceph-mon.oxygene.log
> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.702302) [/build/ceph-
> 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default]
> Manual compaction from level-0 to level-1 from 'mgrstat .. '
> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746]
> Compacting 1@0 + 1@1 files to L1, score -1.00
> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction
> start summary: Base version 1745 Base level 0, inputs:
> [149507(602KB)], [149505(13MB)]
> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947702334, "job": 1746, "event":
> "compaction_started", "files_L0": [149507], "files_L1": [149505],
> "score": -1, "input_data_size": 14916379}
> 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746]
> Generated table #149508: 4904 keys, 14808953 bytes
> 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947785565, "cf_name": "default", "job": 1746,
> "event": "table_file_creation", "file_number": 149508, "file_size":
> 14808953, "table_properties": {"data
> 2018-07-21 03:52:27.785627 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1746]
> Compacted 1@0 + 1@1 files to L1 => 14808953 bytes
> 2018-07-21 03:52:27.785656 7f25b5406700  3 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in DB
> than needed. max_bytes_for_level_multiplier may not be guaranteed.
> 2018-07-21 03:52:27.791640 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.791526) [/build/ceph-
> 12.2.7/src/rocksdb/db/compac

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard

Le lundi 23 juillet 2018 à 12:40 +0200, Oliver Freyermuth a écrit :
> Am 23.07.2018 um 11:18 schrieb Nicolas Huillard:
> > Le lundi 23 juillet 2018 à 18:23 +1000, Brad Hubbard a écrit :
> > > Ceph doesn't shut down systems as in kill or reboot the box if
> > > that's
> > > what you're saying?
> > 
> > That's the first part of what I was saying, yes. I was pretty sure
> > Ceph
> > doesn't reboot/shutdown/reset, but now it's 100% sure, thanks.
> > Maybe systemd triggered something, but without any lasting traces.
> > The kernel didn't leave any more traces in kernel.log, and since
> > the
> > server was off, there was no oops remaining on the console...
> 
> If there was an oops, it should also be recorded in pstore. 
> If the kernel was still running and able to show a stacktrace, even
> if disk I/O has become impossible,
> it will in general dump the stacktrace to pstore (e.g. UEFI pstore if
> you boot via EFI, or ACPI pstore, if available). 

I was sure I would learn something from this thread. Thnaks!
Unfortunately, those machines don't boot using UEFI, /sys/fs/pstore/ is
empty, and:
/sys/module/pstore/parameters/backend:(null)
/sys/module/pstore/parameters/update_ms:-1

I suppose this pstore is also shown in the BMC web interface as "Server
Health / System Log". This is empty too, and I wondered what would fill
it. Maybe I'll use UEFI boot next time.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard

Le lundi 23 juillet 2018 à 11:40 +0100, Matthew Vernon a écrit :
> One of my server silently shutdown last night, with no explanation
> > whatsoever in any logs. According to the existing logs, the
> > shutdown
> 
> We have seen similar things with our SuperMicro servers; our current
> best theory is that it's related to CPU power management. Disabling
> it
> in BIOS seems to have helped.

Too bad my hardware design heavily rely on power management, thus
silence...

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard

Le lundi 23 juillet 2018 à 12:43 +0200, Oliver Freyermuth a écrit :
> There ARE chassis/BMC/IPMI level events, one of which is "CPU
> > CATERR
> > Fault", with a timestamp matching the timestamps below, and no more
> > information.
> 
> If this kind of failure (or a less severe one) also happens at
> runtime, mcelog should catch it. 

I'll install mcelog ASAP, even though it probably wouldn't have added
much in that case.

> For CATERR errors, we also found that sometimes the web interface of
> the BMC shows more information for the event log entry 
> than querying the event log via ipmitool - you may want to check
> this. 

I got that from the web interface. ipmitool does not give more
information anyway (lots of "missing" and "unknown", and not
description...):
ipmitool> sel get 118
SEL Record ID  : 0076
 Record Type   : 02
 Timestamp : 07/21/2018 01:58:48
 Generator ID  : 0020
 EvM Revision  : 04
 Sensor Type   : Unknown
 Sensor Number : 76
 Event Type: Sensor-specific Discrete
 Event Direction   : Assertion Event
 Event Data (RAW)  : 00
 Event Interpretation  : Missing
 Description   : 

Sensor ID  : CPU CATERR (0x76)
 Entity ID : 26.1
 Sensor Type (Discrete): Unknown

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why lvm is recommended method for bleustore

2018-07-23 Thread Nicolas Huillard

Le dimanche 22 juillet 2018 à 09:51 -0400, Satish Patel a écrit :
> I read that post and that's why I open this thread for few more
> questions and clearence,
> 
> When you said OSD doesn't come up what actually that means?  After
> reboot of node or after service restart or installation of new disk?
> 
> You said we are using manual method what is that? 
> 
> I'm building new cluster and had zero prior experience so how can I
> produce this error to see lvm is really life saving tool here? I'm
> sure there are plenty of people using but I didn't find and good
> document except that mailing list which raising more questions in my
> mind. 

When I had to change a few drives manually, copying the old contents
over, I noticed that the logical volumes are tagged with lots of
information related to how they should be handled at boot time by the
OSD startup system.
These LVM tags are a good standard way to add that meta-data within the
volumes themselves. Apparently, there is no other way to add these tags
that allow for bluestore/filestore, SATA/SAS/NVMe, whole drive or
partition, etc.
They are easy to manage and fail-safe in many configurations.

> Sent from my iPhone
> 
> > On Jul 22, 2018, at 6:31 AM, Marc Roos 
> > wrote:
> > 
> > 
> > 
> > I don’t think it will get any more basic than that. Or maybe this?
> > If 
> > the doctor diagnoses you, you can either accept this, get 2nd
> > opinion, 
> > or study medicine to verify it. 
> > 
> > In short lvm has been introduced to solve some issues of related
> > to 
> > starting osd's (which I did not have, probably because of a
> > 'manual' 
> > configuration). And it opens the ability to support (more future) 
> > devices.
> > 
> > I gave you two links, did you read the whole thread?
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47802.htm
> > l
> > 
> > 
> > 
> > 
> > 
> > -Original Message-
> > From: Satish Patel [mailto:satish@gmail.com] 
> > Sent: zaterdag 21 juli 2018 20:59
> > To: ceph-users
> > Subject: [ceph-users] Why lvm is recommended method for bleustore
> > 
> > Folks,
> > 
> > I think i am going to boil ocean here, I google a lot about this
> > topic 
> > why lvm is recommended method for bluestore, but didn't find any
> > good 
> > and detail explanation, not even in Ceph official website.
> > 
> > Can someone explain here in basic language because i am no way
> > expert so 
> > just want to understand what is the advantage of adding extra layer
> > of 
> > complexity?
> > 
> > I found this post but its not i got lost reading it and want to see
> > what 
> > other folks suggesting and offering in their language 
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg46768.htm
> > l
> > 
> > ~S
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "CPU CATERR Fault" Was: Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard

Le lundi 23 juillet 2018 à 10:28 +0200, Caspar Smit a écrit :
> Do you have any hardware watchdog running in the system? A watchdog
> could
> trigger a powerdown if it meets some value. Any event logs from the
> chassis
> itself?

Nice suggestions ;-)

I see some [watchdog/N] and one [watchdogd] kernel threads, along with
a "kernel: [0.116002] NMI watchdog: enabled on all CPUs,
permanently consumes one hw-PMU counter." line in the kernel log, but
no user-land watchdog daemon: I'm not sure if the watchdog is actually
active.

There ARE chassis/BMC/IPMI level events, one of which is "CPU CATERR
Fault", with a timestamp matching the timestamps below, and no more
information.
If I understand correctly, this is a signal emitted by the CPU, to the
BMC, upon "catastrophic error" (more than "fatal"), which the BMC must
respond to the way it wants, Intel suggestions including resetting the
chassis.

https://www.intel.in/content/dam/www/public/us/en/documents/white-paper
s/platform-level-error-strategies-paper.pdf

Does that mean that the hardware is failing, or a neutrino just crossed
some CPU register?
CPU is a Xeon D-1521 with ECC memory.

> Kind regards,

Many thanks!

> 
> Caspar
> 
> 2018-07-21 10:31 GMT+02:00 Nicolas Huillard :
> 
> > Hi all,
> > 
> > One of my server silently shutdown last night, with no explanation
> > whatsoever in any logs. According to the existing logs, the
> > shutdown
> > (without reboot) happened between 03:58:20.061452 (last timestamp
> > from
> > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> > election called, for which oxygene didn't answer).
> > 
> > Is there any way in which Ceph could silently shutdown a server?
> > Can SMART self-test influence scrubbing or compaction?
> > 
> > The only thing I have is that smartd stated a long self-test on
> > both
> > OSD spinning drives on that host:
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> > test in
> > progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> > test in
> > progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > previous
> > self-test completed without error
> > 
> > ...and smartctl now says that the self-tests didn't finish (on both
> > drives) :
> > # 1  Extended offlineInterrupted (host
> > reset)  00% 10636
> > -
> > 
> > MON logs on oxygene talks about rockdb compaction a few minutes
> > before
> > the shutdown, and a deep-scrub finished earlier:
> > /var/log/ceph/ceph-osd.6.log
> > 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> > [DBG]
> > : 6.1d deep-scrub starts
> > 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> > [DBG]
> > : 6.1d deep-scrub ok
> > 2018-07-21 03:43:36.720707 7fd178082700  0 --
> > 172.22.0.16:6801/478362 >>
> > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=1).handle_connect_msg: challenging authorizer
> > 
> > /var/log/ceph/ceph-mgr.oxygene.log
> > 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
> > 
> > /var/log/ceph/ceph-mon.oxygene.log
> > 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> > Time
> > 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/
> > rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual
> > compaction
> > from level-0 to level-1 from 'mgrstat .. '
> > 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb:
> > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403]
> > [default] [JOB
> > 1746] Compacting 1@0 + 1@1 files to L1, score -1.00
> > 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb:
> > [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407]
> > [default]
> > Compaction start summary: Base version 1745 Base level 0, inputs:
> > [149507(602KB)], [149505(13MB)]
> > 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros&

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard

Le lundi 23 juillet 2018 à 18:23 +1000, Brad Hubbard a écrit :
> Ceph doesn't shut down systems as in kill or reboot the box if that's
> what you're saying?

That's the first part of what I was saying, yes. I was pretty sure Ceph
doesn't reboot/shutdown/reset, but now it's 100% sure, thanks.
Maybe systemd triggered something, but without any lasting traces.
The kernel didn't leave any more traces in kernel.log, and since the
server was off, there was no oops remaining on the console...

I'm currently activating "Auto video recording" at the BMC/IPMI level,
as that may help next time this event occurs... Triggers look like
they're tuned for Windows BSOD though...

Thanks for all answers ;-)

> On Mon, Jul 23, 2018 at 5:04 PM, Nicolas Huillard  .fr> wrote:
> > Le lundi 23 juillet 2018 à 11:07 +0700, Konstantin Shalygin a écrit
> > :
> > > > I even have no fancy kernel or device, just real standard
> > > > Debian.
> > > > The
> > > > uptime was 6 days since the upgrade from 12.2.6...
> > > 
> > > Nicolas, you should upgrade your 12.2.6 to 12.2.7 due bugs in
> > > this
> > > release.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Nicolas Huillard

Le lundi 23 juillet 2018 à 11:07 +0700, Konstantin Shalygin a écrit :
> > I even have no fancy kernel or device, just real standard Debian.
> > The
> > uptime was 6 days since the upgrade from 12.2.6...
> 
> Nicolas, you should upgrade your 12.2.6 to 12.2.7 due bugs in this
> release.

That was done (cf. subject).
This is happening with 12.2.7, fresh and 6 days old.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-22 Thread Nicolas Huillard

Le dimanche 22 juillet 2018 à 02:44 +0200, Oliver Freyermuth a écrit :
> Since all services are running on these machines - are you by any
> chance running low on memory? 
> Do you have a monitoring of this? 

I have Munin monitoring on all hosts, but nothing special to notice,
except for a +3°C temperature increase on the OSD spinning drives
(probably the long selft-test).
Memory at 1/4th, CPU at "not noticeable", on a generally quiet cluster.
The temperature increase on the drives was not backed by some increase
of IRQ activity or anything else...

> We observe some strange issues with our servers if they run for a
> long while, and with high memory pressure (more memory is
> ordered...). 
> Then, it seems our Infiniband driver can not allocate sufficiently
> large pages anymore, communication is lost between the Ceph nodes,
> recovery starts,
> memory usage grows even higher from this, etc. 
> In some cases, it seems this may lead to a freeze / lockup (not
> reboot). My feeling is that the CentOS 7.5 kernel is not doing as
> well on memory compaction as the modern kernels do. 

I even have no fancy kernel or device, just real standard Debian. The
uptime was 6 days since the upgrade from 12.2.6...

> Right now, this is just a hunch of mine, but my recommendation would
> be to have some monitoring of the machine and see if something
> strange happens in terms of memory usage, CPU usage, or disk I/O
> (e.g. iowait)
> to further pin down the issue. It may as well be something completely
> different. 
> 
> Other options to investigate would be a potential kernel stacktrace
> in pstore, or something in mcelog. 

I'll investigate on some obscure monitoring tools which stored log
file, yes.

Thanks !

> 
> Cheers,
>   Oliver
> 
> Am 21.07.2018 um 14:34 schrieb Nicolas Huillard:
> > I forgot to mention that this server, along with all the other Ceph
> > servers in my cluster, do not run anything else than Ceph, and each
> > run
> >  all the Ceph daemons (mon, mgr, mds, 2×osd).
> > 
> > Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
> > > Hi all,
> > > 
> > > One of my server silently shutdown last night, with no
> > > explanation
> > > whatsoever in any logs. According to the existing logs, the
> > > shutdown
> > > (without reboot) happened between 03:58:20.061452 (last timestamp
> > > from
> > > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> > > election called, for which oxygene didn't answer).
> > > 
> > > Is there any way in which Ceph could silently shutdown a server?
> > > Can SMART self-test influence scrubbing or compaction?
> > > 
> > > The only thing I have is that smartd stated a long self-test on
> > > both
> > > OSD spinning drives on that host:
> > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
> > > starting
> > > scheduled Long Self-Test.
> > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
> > > starting
> > > scheduled Long Self-Test.
> > > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > > starting
> > > scheduled Long Self-Test.
> > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT],
> > > self-
> > > test in progress, 90% remaining
> > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
> > > self-
> > > test in progress, 90% remaining
> > > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > > previous
> > > self-test completed without error
> > > 
> > > ...and smartctl now says that the self-tests didn't finish (on
> > > both
> > > drives) :
> > > # 1  Extended offlineInterrupted (host
> > > reset)  00% 10636 -
> > > 
> > > MON logs on oxygene talks about rockdb compaction a few minutes
> > > before
> > > the shutdown, and a deep-scrub finished earlier:
> > > /var/log/ceph/ceph-osd.6.log
> > > 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster)
> > > log
> > > [DBG] : 6.1d deep-scrub starts
> > > 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster)
> > > log
> > > [DBG] : 6.1d deep-scrub ok
> > > 2018-07-21 03:43:36.720707 7fd178082700  0 --
> > > 172.22.0.16:6801/478362
> > > > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> > > 
> > > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > > l=1).handle_connect_msg: challenging authorizer
> > > 
> > >

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-21 Thread Nicolas Huillard

I forgot to mention that this server, along with all the other Ceph
servers in my cluster, do not run anything else than Ceph, and each run
 all the Ceph daemons (mon, mgr, mds, 2×osd).

Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
> Hi all,
> 
> One of my server silently shutdown last night, with no explanation
> whatsoever in any logs. According to the existing logs, the shutdown
> (without reboot) happened between 03:58:20.061452 (last timestamp
> from
> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> election called, for which oxygene didn't answer).
> 
> Is there any way in which Ceph could silently shutdown a server?
> Can SMART self-test influence scrubbing or compaction?
> 
> The only thing I have is that smartd stated a long self-test on both
> OSD spinning drives on that host:
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> test in progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> test in progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], previous
> self-test completed without error
> 
> ...and smartctl now says that the self-tests didn't finish (on both
> drives) :
> # 1  Extended offlineInterrupted (host
> reset)  00% 10636 -
> 
> MON logs on oxygene talks about rockdb compaction a few minutes
> before
> the shutdown, and a deep-scrub finished earlier:
> /var/log/ceph/ceph-osd.6.log
> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> [DBG] : 6.1d deep-scrub starts
> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> [DBG] : 6.1d deep-scrub ok
> 2018-07-21 03:43:36.720707 7fd178082700  0 -- 172.22.0.16:6801/478362
> > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> 
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg: challenging authorizer
> 
> /var/log/ceph/ceph-mgr.oxygene.log
> 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
> 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
> 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
> 
> /var/log/ceph/ceph-mon.oxygene.log
> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.702302) [/build/ceph-
> 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default]
> Manual compaction from level-0 to level-1 from 'mgrstat .. '
> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746]
> Compacting 1@0 + 1@1 files to L1, score -1.00
> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction
> start summary: Base version 1745 Base level 0, inputs:
> [149507(602KB)], [149505(13MB)]
> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947702334, "job": 1746, "event":
> "compaction_started", "files_L0": [149507], "files_L1": [149505],
> "score": -1, "input_data_size": 14916379}
> 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746]
> Generated table #149508: 4904 keys, 14808953 bytes
> 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947785565, "cf_name": "default", "job": 1746,
> "event": "table_file_creation", "file_number": 149508, "file_size":
> 14808953, "table_properties": {"data
> 2018-07-21 03:52:27.785627 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1746]
> Compacted 1@0 + 1@1 files to L1 => 14808953 bytes
> 2018-07-21 03:52:27.785656 7f25b5406700  3 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in DB
> than needed. max_bytes_for_level_multiplier may not be guaranteed.
> 2018-07-21 03:52:27.791640 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.791526) [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:621] [default] compacted to:
> base level 1 max bytes base 26843546 files[0 1 0 0 0 0 0]
> 2018-07-21 03:52:27.791657 7f25b5406700  4 rocksdb: (Original

[ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-21 Thread Nicolas Huillard

": 149505}
2018-07-21 03:52:27.796690 7f25b6408700  4 rocksdb: 
[/build/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:839] [default] 
Manual compaction starting
...
2018-07-21 03:53:33.404428 7f25b5406700  4 rocksdb: 
[/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1748] 
Compacted 1@0 + 1@1 files to L1 => 14274825 bytes
2018-07-21 03:53:33.404460 7f25b5406700  3 rocksdb: 
[/build/ceph-12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in 
DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
2018-07-21 03:53:33.408360 7f25b5406700  4 rocksdb: (Original Log Time 
2018/07/21-03:53:33.408228) 
[/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:621] [default] compacted 
to: base level 1 max bytes base 26843546 files[0 1 0 0 0 0 0]
2018-07-21 03:53:33.408381 7f25b5406700  4 rocksdb: (Original Log Time 
2018/07/21-03:53:33.408275) EVENT_LOG_v1 {"time_micros": 1532138013408255, 
"job": 1748, "event": "compaction_finished", "compaction_time_micros": 84964, 
"output_level"
2018-07-21 03:53:33.408647 7f25b5406700  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1532138013408641, "job": 1748, "event": "table_file_deletion", 
"file_number": 149510}
2018-07-21 03:53:33.413854 7f25b5406700  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1532138013413849, "job": 1748, "event": "table_file_deletion", 
"file_number": 149508}
2018-07-21 03:54:27.634782 7f25bdc17700  0 
mon.oxygene@3(peon).data_health(66142) update_stats avail 79% total 4758 MB, 
used 991 MB, avail 3766 MB
2018-07-21 03:55:27.635318 7f25bdc17700  0 
mon.oxygene@3(peon).data_health(66142) update_stats avail 79% total 4758 MB, 
used 991 MB, avail 3766 MB
2018-07-21 03:56:27.635923 7f25bdc17700  0 
mon.oxygene@3(peon).data_health(66142) update_stats avail 79% total 4758 MB, 
used 991 MB, avail 3766 MB
2018-07-21 03:57:27.636464 7f25bdc17700  0 
mon.oxygene@3(peon).data_health(66142) update_stats avail 79% total 4758 MB, 
used 991 MB, avail 3766 MB

I can see no evidence of intrusion or anything (network or physical).
I'm not even sure it was a shutdown more than a hard reset, but no
evidence of any fsck replaying any journal during reboot either.
The server restarted without problem and the cluster is now HEALTH_OK.

Hardware:
* ASRock Rack mobos (the BMC/IPMI may have reset the server for no
reason)
* Western Digital ST4000VN008 OSD drives

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Nicolas Huillard

ill cause an availability outage for the duration of the OSD
> restarts.  If this in unacceptable, an *more risky* alternative is to
> disable RGW garbage collection (the primary known cause of these
> rados
> operations) for the duration of the upgrade::
> 
> 1. Set ``rgw_enable_gc_threads = false`` in ceph.conf
> 2. Restart all radosgw daemons
> 3. Upgrade and restart all OSDs
> 4. Remove ``rgw_enable_gc_threads = false`` from ceph.conf
> 5. Restart all radosgw daemons
> 
> Upgrading from other versions
> -
> 
> If your cluster did not run v12.2.5 or v12.2.6 then none of the above
> issues apply to you and you should upgrade normally.
> 
> v12.2.7 Changelog
> -
> 
> * mon/AuthMonitor: improve error message (issue#21765, pr#22963,
> Douglas Fuller)
> * osd/PG: do not blindly roll forward to log.head (issue#24597,
> pr#22976, Sage Weil)
> * osd/PrimaryLogPG: rebuild attrs from clients (issue#24768 ,
> pr#22962, Sage Weil)
> * osd: work around data digest problems in 12.2.6 (version 2)
> (issue#24922, pr#23055, Sage Weil)
> * rgw: objects in cache never refresh after rgw_cache_expiry_interval
> (issue#24346, pr#22369, Casey Bodley, Matt Benjamin)
> 
> Notable changes in v12.2.6 Luminous
> ===
> 
> :note: This is a broken release with serious known regressions.  Do
> not
> install it. The release notes below are to help track the changes
> that
> went in 12.2.6 and hence a part of 12.2.7
> 
> 
> - *Auth*:
> 
>   * In 12.2.4 and earlier releases, keyring caps were not checked for
> validity,
> so the caps string could be anything. As of 12.2.6, caps strings
> are
> validated and providing a keyring with an invalid caps string to,
> e.g.,
> "ceph auth add" will result in an error.
>   * CVE 2018-1128: auth: cephx authorizer subject to replay attack
> (issue#24836, Sage Weil)
>   * CVE 2018-1129: auth: cephx signature check is weak (issue#24837,
> Sage Weil)
>   * CVE 2018-10861: mon: auth checks not correct for pool ops
> (issue#24838, Jason Dillaman)
> 
> 
> - The config-key interface can store arbitrary binary blobs but JSON
>   can only express printable strings.  If binary blobs are present,
>   the 'ceph config-key dump' command will show them as something like
>   ``<<< binary blob of length N >>>``.
> 
> The full changelog for 12.2.6 is published in the release blog.
> 
> Getting ceph:
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-12.2.7.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-pack
> ages/
> * Release git sha1: 3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5
> 
-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] resize wal/db

2018-07-17 Thread Nicolas Huillard

Le mardi 17 juillet 2018 à 16:20 +0300, Igor Fedotov a écrit :
> Right, but procedure described in the blog can be pretty easily
> adjusted 
> to do a resize.

Sure, but if I remember correctly, Ceph itself cannot use the increased
size: you'll end up with a larger device with unused additional space.
Using that space may be on the TODO, though, so this may not be a
complete waste of space...

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged

2018-07-15 Thread Nicolas Huillard

Le dimanche 15 juillet 2018 à 11:01 -0500, Adam Tygart a écrit :
> Check out the message titled "IMPORTANT: broken luminous 12.2.6
> release in repo, do not upgrade"
> 
> It sounds like 12.2.7 should come *soon* to fix this transparently.

Thanks. I didn't notice this one. I should monitor more closely the ML.
This means I'll just wait for the fix with 12.2.7 ;-)
Have a nice day !

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged

2018-07-15 Thread Nicolas Huillard

s
> ing-an-alternate-metadata-pool-for-recovery 
> but I'm not sure if it will work as it's apparently doing nothing at
> the 
> moment (maybe it's just very slow).
> 
> Any help is appreciated, thanks!
> 
> 
>      Alessandro
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Nicolas Huillard
Associé fondateur - Directeur Technique - Dolomède

nhuill...@dolomede.fr
Fixe : +33 9 52 31 06 10
Mobile : +33 6 50 27 69 08
http://www.dolomede.fr/

https://reseauactionclimat.org/planetman/
http://climat-2020.eu/
http://www.350.org/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Place on separate hosts?

2018-05-04 Thread Nicolas Huillard

Le vendredi 04 mai 2018 à 00:25 -0700, Tracy Reed a écrit :
> On Fri, May 04, 2018 at 12:18:15AM PDT, Tracy Reed spake thusly:
> > https://jcftang.github.io/2012/09/06/going-from-replicating-across-
> > osds-to-replicating-across-hosts-in-a-ceph-cluster/
> 
> 
> > How can I tell which way mine is configured? I could post the whole
> > crushmap if necessary but it's a bit large to copy and paste.
> 
> To further answer my own question (sorry for the spam) the above
> linked
> doc says this should do what I want:
> 
> step chooseleaf firstn 0 type host
> 
> which is what I already have in my crush map. So it looks like the
> default is as I want it. In which case I wonder why I had the problem
> previously... I guess the only way to know for sure is to stop one
> osd
> node and see what happens.

You can test the crush rules.
See http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/

Examples from my own notes:
ceph osd getcrushmap -o crushmap
crushtool -i crushmap --test --rule 0 --num-rep 4 --show-utilization
crushtool -i crushmap --test --rule 0 --num-rep 4 --show-mappings 
--show-choose-tries --show-statistics | less
etc.

This helped me validate the placement on different hosts and
datacenters.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-05-02 Thread Nicolas Huillard

Le dimanche 08 avril 2018 à 20:40 +, Jens-U. Mozdzen a écrit :
> sorry for bringing up that old topic again, but we just faced a  
> corresponding situation and have successfully tested two migration  
> scenarios.

Thank you very much for this update, as I needed to do exactly that,
due to an SSD crash triggering hardware replacement.
The block.db on the crashed SSD were lost, so the whole two OSDs
depending on it were re-created. I also replaced two other bad SSDs
before they failed, thus needed to effectively replace DB/WAL devices
on the live cluster (2 SSDs on 2 hosts and 4 OSDs).

> it is possible to move a separate WAL/DB to a new device, whilst  
> without changing the size. We have done this for multiple OSDs,
> using  
> only existing (mainstream :) ) tools and have documented the
> procedure  
> in  
> http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-b
> lock-db/  
> . It will *not* allow to separate WAL / DB after OSD creation, nor  
> does it allow changing the DB size.

The lost OSD were still backfilling when I did the above procedure
(data redundancy was high enough to risk losing one more node). I even
mis-typed the "ceph osd set noout" command ("ceph osd unset noout"
instead, effectively a no-op), and replaced 2 OSDs of a single host at
the same time (thus taking more time than the 10 minutes before kicking
the OSDs out, triggering even more data movement).
Everything went cleanly though, thanks to your detailed commands, which
I ran one at a time, thinking twice before each [Enter].

I digged a bit into the LVM tags :
* make a backup of all pv/vg/lv config : vgcfgbackup
* check the backed-up tags : grep tags /etc/lvm/backup/*

I then noticed that :
* there are lots of "ceph.*=" tags
* tags are still present on the old DB/WAL LVs (since I didn't remove
them)
* tags are absent from the new DB/WAL LVs (ditto, I didn't create
them), which may be a problem later on...
* I changed the ceph.db_device= tag, but there is also a ceph.db_uuid=
tag which was not changed, and may or may not trigger a problem upon
reboot (I don't know if this UUID is part of the dd'ed data)

You effectively helped a lot! Thanks.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ZeroDivisionError: float division by zero in /usr/lib/ceph/mgr/dashboard/module.py (12.2.4)

2018-04-15 Thread Nicolas Huillard

Hi,

I'm not sure if this have been solved since 12.2.4. The same code
occurs in a different file in Github https://github.com/ceph/ceph/blob/
50412f7e9c2691ec10132c8bf9310a05a40e9f9d/src/pybind/mgr/status/module.p
y
The ZeroDivisionError occurs when the dashboard is open, and there is a
network outage (link between 2 datacenters is broken). I'm not sure
about the behaviour of the actual UI in the dashboard at the same time.

Syslog trace:

ceph-mgr[1324]: [15/Apr/2018:09:47:12] HTTP Traceback (most recent call last):
ceph-mgr[1324]:   File 
"/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond
ceph-mgr[1324]: response.body = self.handler()
ceph-mgr[1324]:   File 
"/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 217, in 
__call__
ceph-mgr[1324]: self.body = self.oldhandler(*args, **kwargs)
ceph-mgr[1324]:   File 
"/usr/lib/python2.7/dist-packages/cherrypy/lib/jsontools.py", line 63, in 
json_handler
ceph-mgr[1324]: value = cherrypy.serving.request._json_inner_handler(*args, 
**kwargs)
ceph-mgr[1324]:   File 
"/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61, in __call__
ceph-mgr[1324]: return self.callable(*self.args, **self.kwargs)
ceph-mgr[1324]:   File "/usr/lib/ceph/mgr/dashboard/module.py", line 991, in 
list_data
ceph-mgr[1324]: return self._osds_by_server()
ceph-mgr[1324]:   File "/usr/lib/ceph/mgr/dashboard/module.py", line 1040, in 
_osds_by_server
ceph-mgr[1324]: osd_map.osds_by_id[osd_id])
ceph-mgr[1324]:   File "/usr/lib/ceph/mgr/dashboard/module.py", line 1007, in 
_osd_summary
ceph-mgr[1324]: result['stats'][s.split(".")[1]] = 
global_instance().get_rate('osd', osd_spec, s)
ceph-mgr[1324]:   File "/usr/lib/ceph/mgr/dashboard/module.py", line 268, in 
get_rate
ceph-mgr[1324]: return (data[-1][1] - data[-2][1]) / float(data[-1][0] - 
data[-2][0])
ceph-mgr[1324]: ZeroDivisionError: float division by zero

HTH,

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] "ceph-fuse" / "mount -t fuse.ceph" do not report a failed mount on exit (Pacemaker OCF "Filesystem" resource)

2018-04-11 Thread Nicolas Huillard

Hi all,

I use Pacemaker and the "Filesystem" Resource Agent to mount/unmount my
cephfs. Depending on timing, MDS may be reachable a few dozen seconds
after the mount command, but there is no report of the failure through
the exit code.

Examples using mount.fuse.ceph or ceph-fuse (no MDS running at that
time) :

root@helium:~# mount -t fuse.ceph -o 
ceph.id=srv_root,defaults,noatime,nosuid,nodev /dev/null /srv
ceph-fuse[1432664]: starting ceph client2018-04-11 16:40:21.234177 7f9e5b6750c0 
-1 init, newargv = 0x5627ab4bfd50 newargc=11

ceph-fuse[1432664]: probably no MDS server is up?
ceph-fuse[1432664]: ceph mount failed with (4) Interrupted system call
root@helium:~# echo $?
0

root@helium:~# ceph-fuse --id=srv_root /srv
2018-04-11 16:42:05.383406 7f558e0860c0 -1 init, newargv = 0x55d2aaad7f80 
newargc=9ceph-fuse[1433043
]: starting ceph client
ceph-fuse[1433043]: probably no MDS server is up?
ceph-fuse[1433043]: ceph mount failed with (4) Interrupted system call
root@helium:~# echo $?
0

The man page for mount ways that an exit code of 0 denotes success,
other codes denote various conditions. The Filesystem RA just reports a
generic OCF failure when exit code != 0.

Is there Pacemaker/OCF standard a way to report a proper exit code /
test for failure another way / anything else ?

TIA,

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is it possible to suggest the active MDS to move to a datacenter ?

2018-03-29 Thread Nicolas Huillard

Thanks for your answer.

Le jeudi 29 mars 2018 à 13:51 -0700, Patrick Donnelly a écrit :
> On Thu, Mar 29, 2018 at 1:02 PM, Nicolas Huillard <nhuillard@dolomede
> .fr> wrote:
> > I manage my 2 datacenters with Pacemaker and Booth. One of them is
> > the
> > publicly-known one, thanks to Booth.
> > Whatever the "public datacenter", Ceph is a single storage cluster.
> > Since most of the cephfs traffic come from this "public
> > datacenter",
> > I'd like to suggest or force the active MDS to move to the same
> > datacenter, hoping to reduce trafic on the inter-datacenter link,
> > and
> > reduce cephfs metadata operations latency.
> > 
> > Is it possible for forcefully move the active MDS using external
> > triggers ?
> 
> No and it probably wouldn't be beneficial. The MDS still needs to
> talk
> to the metadata/data pools and increasing the latency between the MDS
> and the OSDs will probably do more harm.

It wasn't clear in my first post: OSDs are already split between both
DCs, so having the MDS on either side has the same effect on MDS-OSD
traffic. It appears that my current usage profile generates load on the
 MDS, but not that much on OSD-metadata.
The public DC is just the one of the two that Booth gives its ticket
to.

> One possibility for helping your situation is to put NFS-Ganesha in
> the public datacenter as a gateway to CephFS. This may help with your
> performance by (a) sharing a larger cache among multiple clients and
> (b) reducing capability conflicts between clients thereby resulting
> in
> less metadata traffic with the MDS. Be aware an HA solution doesn't
> yet exist for NFS-Ganesha+CephFS outside of Openstack Queens
> deployments.

I'll keep it stupid-simple then, just use the cephfs client, and
monitor the usage profile of things ;-)

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Is it possible to suggest the active MDS to move to a datacenter ?

2018-03-29 Thread Nicolas Huillard

Hi,

I manage my 2 datacenters with Pacemaker and Booth. One of them is the
publicly-known one, thanks to Booth.
Whatever the "public datacenter", Ceph is a single storage cluster.
Since most of the cephfs traffic come from this "public datacenter",
I'd like to suggest or force the active MDS to move to the same
datacenter, hoping to reduce trafic on the inter-datacenter link, and
reduce cephfs metadata operations latency.

Is it possible for forcefully move the active MDS using external
triggers ?

As I understand it, I can't do that for MONs, because the lowest-rank
available MON is always the leader. It's also probably less of a
problem since MON traffic is low.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] session lost, hunting for new mon / session established : every 30s until unmount/remount

2018-03-29 Thread Nicolas Huillard

Le mercredi 28 mars 2018 à 15:57 -0700, Jean-Charles Lopez a écrit :
> if I read you crrectly you have 3 MONs on each data center. This
> means that when the link goes down you will loose quorum making the
> cluster unavailable.
> 
> If my perception is correct, you’d have to start a 7th MON somewhere
> else accessible from both sites for your cluster to maintain quorum
> during this event.

Your perception was correct, and your diagnosis too: with the 7th MON,
things are way better.
With correct firewall rules, things improve even more (there were
DROPped TCP/6789 packets in certain cases).

Sorry for the disturbance. I'll continue to test.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] session lost, hunting for new mon / session established : every 30s until unmount/remount

2018-03-29 Thread Nicolas Huillard

I found this message in the archive :

 Message transféré 
De: Ilya Dryomov <idryo...@gmail.com>
À: Дмитрий Глушенок <gl...@jet.msk.su>
Cc: ceph-users@lists.ceph.com <ceph-users@lists.ceph.com>
Objet: Re: [ceph-users] How's cephfs going?
Date: Fri, 21 Jul 2017 16:25:40 +0200

On Fri, Jul 21, 2017 at 4:06 PM, Дмитрий Глушенок <gl...@jet.msk.su>
wrote:
> All three mons has value "simple".

OK, so http://tracker.ceph.com/issues/17664 is unrelated.  Open a new
kernel client ticket with all the ceph-fuse vs kernel client info and
as many log excerpts as possible.  If you've never seen the endless
hang with ceph-fuse, it's probably a fairly simple to fix (might be
hard to track down though) kernel client bug.


Here the setting is :
$ ceph daemon mon.brome config get ms_type
{
"ms_type": "async+posix"
}

Since the issue was solved more than a year ago and I use 12.2.4, I
guess that's not the issue here. I may change that velue to something
else ("simple" or more recent setting?).

TIA,

Le jeudi 29 mars 2018 à 00:40 +0200, Nicolas Huillard a écrit :
> Hi all,
> 
> I didn't find much information regarding this kernel client loop in
> the
> ML. Here are my observation, around which I'll try to investigate.
> 
> My setup:
> * 2 datacenters connected using an IPsec tunnel configured for
> routing
> (2 subnets)
> * connection to the WAN using PPPoE and the pppd kernel module
> * the PPP connection lasts exactly 7 days, after which the provider
> kills it, and my PPP client restarts it (the WAN/inter-cluster
> communication is thus disconnected during ~30s)
> * 3 MON+n×OSD+MGR+MDS on each datacenter
> * 2 client servers using cephfs/kernel module; one of them on each
> datacenter runs the pppd client and the IPSec endpoint (Pacemaker
> manages this front-end aspect of the cluster)
> * a single cephfs mount which is not managed by Pacemaker
> 
> Observations:
> * when the ppp0 connection stops, the pppd restores the default route
> from "using the PPP tunnel" to "using a virtual IP which happens to
> be
> on the same host" (but could move to the other peer)
> 
> Mar 28 19:07:09 neon pppd[5543]: restoring old default route to eth0
> [172.21.0.254]
> 
> * IPsec et al. react cleanly (remove the tunnel, recreate it when PPP
> is up again)
> 
> Mar 28 19:07:43 neon pppd[5543]: Connected to 02::42 via
> interface eth1.835
> Mar 28 19:07:43 neon pppd[5543]: CHAP authentication succeeded
> Mar 28 19:07:43 neon pppd[5543]: peer from calling number 02::42
> authorized
> Mar 28 19:07:43 neon pppd[5543]: replacing old default route to eth0
> [172.21.0.254]
> 
> * 20s after the PPP link is up and IPsec is restored, libceph starts
> to
> complain (neon is the client/gateway on 172.21.0.0/16 which lost its
> PPP, sodium is the remote side of the same IPsec tunnel) :
> 
> Mar 28 19:08:03 neon kernel: [1232455.656828] libceph: mon1
> 172.21.0.18:6789 socket closed (con state OPEN)
> Mar 28 19:08:12 neon kernel: [1232463.846633] ceph: mds0 caps stale
> Mar 28 19:08:16 neon kernel: [1232468.128577] ceph: mds0 caps went
> stale, renewing
> Mar 28 19:08:16 neon kernel: [1232468.128581] ceph: mds0 caps stale
> Mar 28 19:08:30 neon kernel: [1232482.601183] libceph: mon3
> 172.22.0.16:6789 session established
> Mar 28 19:09:01 neon kernel: [1232513.256059] libceph: mon3
> 172.22.0.16:6789 session lost, hunting for new mon
> Mar 28 19:09:01 neon kernel: [1232513.321176] libceph: mon5
> 172.22.0.20:6789 session established
> Mar 28 19:09:32 neon kernel: [1232543.977003] libceph: mon5
> 172.22.0.20:6789 session lost, hunting for new mon
> Mar 28 19:09:32 neon kernel: [1232543.979567] libceph: mon2
> 172.21.0.20:6789 session established
> Mar 28 19:09:39 neon kernel: [1232551.435001] ceph: mds0 caps renewed
> Mar 28 19:10:02 neon kernel: [1232574.697885] libceph: mon2
> 172.21.0.20:6789 session lost, hunting for new mon
> Mar 28 19:10:02 neon kernel: [1232574.763614] libceph: mon4
> 172.22.0.18:6789 session established
> Mar 28 19:10:33 neon kernel: [1232605.418776] libceph: mon4
> 172.22.0.18:6789 session lost, hunting for new mon
> Mar 28 19:10:33 neon kernel: [1232605.420896] libceph: mon0
> 172.21.0.16:6789 session established
> Mar 28 19:11:04 neon kernel: [1232636.139720] libceph: mon0
> 172.21.0.16:6789 session lost, hunting for new mon
> Mar 28 19:11:04 neon kernel: [1232636.205717] libceph: mon3
> 172.22.0.16:6789 session established
> 
> Mar 28 19:07:40 sodium kernel: [1211268.708716] libceph: mon0
> 172.21.0.16:6789 session lost, hunting for new mon
> Mar 28 19:07:44 sodium kernel: [1211272.208735] libceph: mon5
> 172.22.0.20:6789 socket closed (con state OPEN)
> Mar

Re: [ceph-users] session lost, hunting for new mon / session established : every 30s until unmount/remount

2018-03-29 Thread Nicolas Huillard

Le mercredi 28 mars 2018 à 15:57 -0700, Jean-Charles Lopez a écrit :
> if I read you crrectly you have 3 MONs on each data center. This
> means that when the link goes down you will loose quorum making the
> cluster unavailable.

Oh yes, sure. I'm planning to add this 7th MON.
I'm not sure the problem is related, since there is very little
activity on the cluster during the event, and the "hunting for new mon"
messages persist for a very long time after the connection is up again.
There are messages on the various MONs calling for elections, and the
primary MON finally wins when the link is back:

2018-03-28 19:07:18.554186 7ffacd6df700  0 log_channel(cluster) log [INF] : 
mon.brome calling monitor election
2018-03-28 19:07:18.554295 7ffacd6df700  1 mon.brome@0(electing).elector(1551) 
init, last seen epoch 1551, mid-election, bumping
...
2018-03-28 19:08:13.703667 7ffacfee4700  0 log_channel(cluster) log [INF] : 
mon.brome is new leader, mons brome,chlore,fluor,soufre in quorum (ranks 
0,1,2,5)
...
2018-03-28 19:08:13.804470 7ffac8ed6700  0 log_channel(cluster) log [WRN] : 
Health check failed: 2/6 mons down, quorum brome,chlore,fluor,soufre (MON_DOWN)
2018-03-28 19:08:13.834336 7ffac8ed6700  0 log_channel(cluster) log [WRN] : 
overall HEALTH_WARN 2/6 mons down, quorum brome,chlore,fluor,soufre
2018-03-28 19:08:16.243895 7ffacd6df700  0 log_channel(cluster) log [INF] : 
mon.brome calling monitor election
2018-03-28 19:08:16.244011 7ffacd6df700  1 mon.brome@0(electing).elector(1577) 
init, last seen epoch 1577, mid-election, bumping
2018-03-28 19:08:17.106483 7ffacd6df700  0 log_channel(cluster) log [INF] : 
mon.brome is new leader, mons brome,chlore,fluor,oxygene,phosphore,soufre in 
quorum (ranks 0,1,2,3,4,5)
...
2018-03-28 19:08:17.178867 7ffacd6df700  0 log_channel(cluster) log [INF] : 
Cluster is now healthy
2018-03-28 19:08:17.227849 7ffac8ed6700  0 log_channel(cluster) log [INF] : 
overall HEALTH_OK

> If my perception is correct, you’d have to start a 7th MON somewhere
> else accessible from both sites for your cluster to maintain quorum
> during this event.

The actual problem noticed is AFTER the event.
Since my setup is low-end (specially the routing + PPP + IPsec between
datacenters), it may be a new piece of information most didn't
experience, and I suspect the kernel routing table change may trigger
the lasting problem for the cephfs kernel client.

> Regards
> JC
> 
> > On Mar 28, 2018, at 15:40, Nicolas Huillard <nhuill...@dolomede.fr>
> > wrote:
> > 
> > Hi all,
> > 
> > I didn't find much information regarding this kernel client loop in
> > the
> > ML. Here are my observation, around which I'll try to investigate.
> > 
> > My setup:
> > * 2 datacenters connected using an IPsec tunnel configured for
> > routing
> > (2 subnets)
> > * connection to the WAN using PPPoE and the pppd kernel module
> > * the PPP connection lasts exactly 7 days, after which the provider
> > kills it, and my PPP client restarts it (the WAN/inter-cluster
> > communication is thus disconnected during ~30s)
> > * 3 MON+n×OSD+MGR+MDS on each datacenter
> > * 2 client servers using cephfs/kernel module; one of them on each
> > datacenter runs the pppd client and the IPSec endpoint (Pacemaker
> > manages this front-end aspect of the cluster)
> > * a single cephfs mount which is not managed by Pacemaker
> > 
> > Observations:
> > * when the ppp0 connection stops, the pppd restores the default
> > route
> > from "using the PPP tunnel" to "using a virtual IP which happens to
> > be
> > on the same host" (but could move to the other peer)
> > 
> > Mar 28 19:07:09 neon pppd[5543]: restoring old default route to
> > eth0 [172.21.0.254]
> > 
> > * IPsec et al. react cleanly (remove the tunnel, recreate it when
> > PPP
> > is up again)
> > 
> > Mar 28 19:07:43 neon pppd[5543]: Connected to 02::42 via
> > interface eth1.835
> > Mar 28 19:07:43 neon pppd[5543]: CHAP authentication succeeded
> > Mar 28 19:07:43 neon pppd[5543]: peer from calling number
> > 02::42 authorized
> > Mar 28 19:07:43 neon pppd[5543]: replacing old default route to
> > eth0 [172.21.0.254]
> > 
> > * 20s after the PPP link is up and IPsec is restored, libceph
> > starts to
> > complain (neon is the client/gateway on 172.21.0.0/16 which lost
> > its
> > PPP, sodium is the remote side of the same IPsec tunnel) :
> > 
> > Mar 28 19:08:03 neon kernel: [1232455.656828] libceph: mon1
> > 172.21.0.18:6789 socket closed (con state OPEN)
> > Mar 28 19:08:12 neon kernel: [1232463.846633] ceph: mds0 caps stale
> > Mar 28 19:08:16 neon kernel: [1232468.128577] ceph: m

[ceph-users] session lost, hunting for new mon / session established : every 30s until unmount/remount

2018-03-28 Thread Nicolas Huillard

PEN)
Mar 28 19:08:16 lithium kernel: [603577.883001] ceph: mds0 caps went stale, 
renewing
Mar 28 19:08:16 lithium kernel: [603577.883004] ceph: mds0 caps stale
Mar 28 19:08:16 lithium kernel: [603577.883483] ceph: mds0 caps renewed
Mar 28 19:08:19 lithium kernel: [603581.559718] libceph: mon0 172.21.0.16:6789 
session established
Mar 28 19:08:52 lithium kernel: [603614.261565] libceph: mon0 172.21.0.16:6789 
session lost, hunting for new mon
Mar 28 19:08:52 lithium kernel: [603614.263845] libceph: mon3 172.22.0.16:6789 
session established
Mar 28 19:09:23 lithium kernel: [603644.982627] libceph: mon3 172.22.0.16:6789 
session lost, hunting for new mon
Mar 28 19:09:23 lithium kernel: [603644.984546] libceph: mon5 172.22.0.20:6789 
session established
Mar 28 19:09:54 lithium kernel: [603675.703696] libceph: mon5 172.22.0.20:6789 
session lost, hunting for new mon
Mar 28 19:09:54 lithium kernel: [603675.773835] libceph: mon2 172.21.0.20:6789 
session established

* all those "hunting for" messages stop when I unmount/remount the
cephfs filesystem (supported by the kernel module, not Fuse)

Mar 28 20:38:27 neon kernel: [1237879.172749] libceph: mon2 172.21.0.20:6789 
session lost, hunting for new mon
Mar 28 20:38:27 neon kernel: [1237879.238902] libceph: mon4 172.22.0.18:6789 
session established
Mar 28 20:38:57 neon kernel: [1237909.893569] libceph: mon4 172.22.0.18:6789 
session lost, hunting for new mon
Mar 28 20:38:57 neon kernel: [1237909.895888] libceph: mon1 172.21.0.18:6789 
session established
Mar 28 20:39:31 neon kernel: [1237942.989104] libceph: mon0 172.21.0.16:6789 
session established
Mar 28 20:39:31 neon kernel: [1237942.990244] libceph: client114108 fsid 
819889bd-de05-4bf5-ab43-da16d93f9308

I suspect that all this is related to the kernel routing table which is
altered by pppd, restored to it's original value, then re-updated when
the PPP link re-opens. I have experienced problems with some daemons
like dnsmasq, ntpd, etc. The only solution seems to be to restart those
deamons.
I may have to unmount/remount cephfs to have the same effect. I'll also
try cephfs/Fuse.

Did anyone dig into the cause of this flurry of messages?

TIA,

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Nicolas Huillard

Le vendredi 23 mars 2018 à 12:14 +0100, Ilya Dryomov a écrit :
> On Fri, Mar 23, 2018 at 11:48 AM,  <c...@jack.fr.eu.org> wrote:
> > The stock kernel from Debian is perfect
> > Spectre / meltdown mitigations are worthless for a Ceph point of
> > view,
> > and should be disabled (again, strictly from a Ceph point of view)

I know that Ceph itself don't need this, but the cpeh client machines,
specially those hosting VMs or mùore diverse code, should have those
mitigations.

> > If you need the luminous features, using the userspace
> > implementations
> > is required (librbd via rbd-nbd or qemu, libcephfs via fuse etc)

I'd rather use the faster kernel cephfs implementation instead of fuse,
specially with the Meltdown PTI mitigation (I guess fuse implies twice
the userland-to-kernel calls which are costly using PTI).
I don't have an idea yet re. RBD...

> luminous cluster-wide feature bits are supported since kernel 4.13.

This means that there are differences between 4.9 and 4.14 re. Ceph
features. I know that quota are not supported yet in any kernel, but I
don't use this...
Are there some performance/stability improvements in the kernel that
would justify using 4.14 instead of 4.9 ? I can't find any list
anywhere...
Since I'm building a new cluster, I'd rather choose the latest software
from the start if it's justified.

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Kernel version for Debian 9 CephFS/RBD clients

2018-03-23 Thread Nicolas Huillard

Hi all,

I'm using Luminous 12.2.4 on all servers, with Debian stock kernel.

I use the kernel cephfs/rbd on the client side, and have a choice of :
* stock Debian 9 kernel 4.9 : LTS, Spectre/Meltdown mitigations in
place, field-tested, probably old libceph inside.
* backports kernel 4.14 : probably better Luminous support, no
Spectre/Meltdown mitigations yet, much less tested (I may have
experienced a kernel-related PPPoE problem lately), not long-term.

Which client kernel would you suggest re. Ceph ?
Does the cephfs/rbd clients benefit from a really newer kernel ?
I expect that the Cpeh server-side kernel don't really matter.

TIA,

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard

On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote:
> > Then I tried to reduce the number of MDS, from 4 to 1, 

> Le lundi 19 mars 2018 à 19:15 +0300, Sergey Malinin a écrit :
> Forgot to mention, that in my setup the issue gone when I had
> reverted back to single MDS and switched dirfrag off. 

So it appears we had the same problem, and applied the same solution ;-
)
I reverted mds_log_events_per_segment back to 1024 without problems.

Bandwidth utilisation is OK, destination (single SATA disk) throughput
depends on file sizes (lots of tiny file = 1MBps ; big files = 30MBps),
and running 2 rsync in parallel only improve things.

Thanks!

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard

Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit :
> Default for mds_log_events_per_segment is 1024, in my set up I ended
> up with 8192.
> I calculated that value like IOPS / log segments * 5 seconds (afaik
> MDS performs journal maintenance once in 5 seconds by default).

I tried 4096 from the initial 1024, then 8192 at the time of your
answer, then 16384, with not much improvements...

Then I tried to reduce the number of MDS, from 4 to 1, which definitely
works (sorry if my initial mail didn't make it very clear that I was
using many MDSs, even though it mentioned mds.2).
I now have low rate of metadata write (40-50kBps), and the inter-DC
link load reflects the size and direction of the actual data.

I'll now try to reduce mds_log_events_per_segment back to its original
value (1024), because performance is not optimal, and stutters a bit
too much.

Thanks for your advice!

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard

Le lundi 19 mars 2018 à 10:01 +, Sergey Malinin a écrit :
> I experienced the same issue and was able to reduce metadata writes
> by raising mds_log_events_per_segment to
> it’s original value multiplied several times.

I changed it from 1024 to 4096 :
* rsync status (1 line per file) scrolls much quicker
* OSD writes on the dashboard is much lower than reads now (it was much
higher before)
* metadata pool write rate in the 20-800kBps range now, while metadata
reads in the 20-80kBps
* data pool reads is in the hundreds of kBps, which still seems very
low
* destination disk write rate is a bit larger than the data pool read
rate (expected for btrfs), but still low
* inter-DC network load is now 1-50Mbps

I'll monitor the Munin graphs in the long run.

I can't find any doc about that mds_log_events_per_segment setting,
specially on how to choose a good value.
Can you elaborate on "original value multiplied several times" ?

I'm just seeing more MDS_TRIM warnings now. Maybe restarting the MDSs
just delayed re-emergence of the initial problem.

> 
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of
> Nicolas Huillard <nhuill...@dolomede.fr>
> Sent: Monday, March 19, 2018 12:01:09 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Huge amount of cephfs metadata writes while
> only reading data (rsync from storage, to single disk)
> 
> Hi all,
> 
> I'm experimenting with a new little storage cluster. I wanted to take
> advantage of the week-end to copy all data (1TB, 10M objects) from
> the
> cluster to a single SATA disk. I expected to saturate the SATA disk
> while writing to it, but the storage cluster actually saturates its
> network links, while barely writing to the destination disk (63GB
> written in 20h, that's less than 1MBps).
> 
> Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each,
> Luminous
> 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
> between datacenters (12ms latency). 4 clients using a single cephfs
> storing data + metadata on the same spinning disks with bluestore.
> 
> Test : I'm using a single rsync on one of the client servers (the
> other
> 3 are just sitting there). rsync is local to the client, copying from
> the cephfs mount (kernel client on 4.14 from stretch-backports, just
> to
> use a potentially more recent cephfs client than on stock 4.9), to
> the
> SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
> deep directory branches, along with some large files (10-100MB) in a
> few directories. There is no other activity on the cluster.
> 
> Observations : I initially saw write performance on the destination
> disk from a few 100kBps (during exploration of branches with tiny
> file)
> to a few 10MBps (while copying large files), essentially seeing the
> file names scrolling at a relatively fixed rate, unrelated to their
> individual size.
> After 5 hours, the fibre link stated to saturate at 200Mbps, while
> destination disk writes is down to a few 10kBps.
> 
> Using the dashboard, I see lots of metadata writes, at 30MBps rate on
> the metadata pool, which correlates to the 200Mbps link rate.
> It also shows regular "Health check failed: 1 MDSs behind on trimming
> (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming
> (64/30)".
> 
> I wonder why cephfs would write anything to the metadata (I'm
> mounting
> on the clients with "noatime"), while I'm just reading data from
> it...
> What could I tune to reduce that write-load-while-reading-only ?
> 
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Nicolas Huillard
Associé fondateur - Directeur Technique - Dolomède

nhuill...@dolomede.fr
Fixe : +33 9 52 31 06 10
Mobile : +33 6 50 27 69 08
http://www.dolomede.fr/

https://reseauactionclimat.org/planetman/
http://climat-2020.eu/
http://www.350.org/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard

Hi all,

I'm experimenting with a new little storage cluster. I wanted to take
advantage of the week-end to copy all data (1TB, 10M objects) from the
cluster to a single SATA disk. I expected to saturate the SATA disk
while writing to it, but the storage cluster actually saturates its
network links, while barely writing to the destination disk (63GB
written in 20h, that's less than 1MBps).

Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
between datacenters (12ms latency). 4 clients using a single cephfs
storing data + metadata on the same spinning disks with bluestore.

Test : I'm using a single rsync on one of the client servers (the other
3 are just sitting there). rsync is local to the client, copying from
the cephfs mount (kernel client on 4.14 from stretch-backports, just to
use a potentially more recent cephfs client than on stock 4.9), to the
SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
deep directory branches, along with some large files (10-100MB) in a
few directories. There is no other activity on the cluster.

Observations : I initially saw write performance on the destination
disk from a few 100kBps (during exploration of branches with tiny file)
to a few 10MBps (while copying large files), essentially seeing the
file names scrolling at a relatively fixed rate, unrelated to their
individual size.
After 5 hours, the fibre link stated to saturate at 200Mbps, while
destination disk writes is down to a few 10kBps.

Using the dashboard, I see lots of metadata writes, at 30MBps rate on
the metadata pool, which correlates to the 200Mbps link rate.
It also shows regular "Health check failed: 1 MDSs behind on trimming
(MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".

I wonder why cephfs would write anything to the metadata (I'm mounting
on the clients with "noatime"), while I'm just reading data from it...
What could I tune to reduce that write-load-while-reading-only ?

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Multiple storage sites for disaster recovery and/or active-active failover

2016-10-06 Thread Nicolas Huillard

Hello,

I'm new to Ceph, and currently evaluating a deployment strategy.

I'm planning to create a sort of home-hosting (web and compute hosting,
database, etc.), distributed on various locations (cities), extending
the "commodity hardware" concept to "commodity data-center" and
"commodity connectivity". Anything is expected to fail (disks, servers,
switches, routers, Internet fibre links, power, etc.), but the overall
service will still work.
Reaching the services from the outside relies on DNS tuning (add/remove
location when they appears/fail, with a low TTL) and possibly proxies
(providing faster response time by routing traffic through a SPOF...).

Hardware will be recent Xeon D-1500 Mini-ITX motherboards with 2 or 4
Ethernet ports, 6 SATA and 1 NVMe ports, and no PCIe extension cards
(typically Supermicro X10SDV or AsrockRack D1541D4I). A dozen servers
are built in custom enclosures at each location with 12V redundant power
supplies, switches, monitoring, etc.
http://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-F.cfm
http://www.asrockrack.com/general/productdetail.asp?Model=D1541D4I
http://www.pulspower.com/products/show/product/detail/cps20121/

I am wondering how Ceph could provide the location-to-location
fail-over, possibly adding an active-active feature. I'm planning to use
CephFS (shared storage) and RBD (VMs).
I'm not sure yet how to deal with Postgres and other specific services
replication.

Say I have a CephFS at location A, in read-write use at the moment,
serving HTTP requests via the websites/apps/etc. It should replicate
it's data to location B, which could be in standby or read-only mode
(preferably), potentially serving HTTP requests (provided there are no
filesystem writes from those requests). The link between location A and
B (and potentially C, D, etc) is the same Internet fibre link from the
local ISP: not fault-tolerant, subject to lantency, etc.
When the fibre link or power supply fails, the other locations should
notice, change the DNS settings (to disable the requests going to
location A), switch the CephFS to active or read-write mode, and
continue serving requests.
I can handle a few minutes of HTTP downtime, but data should always be
accessible from somewhere, possibly with a few minutes loss but no
crash.

As I read the docs, CephFS and RBD do not handle that situation. RadosGW
have a sort of data replication between clusters and/or pools.
I'm not sure if that problem is solved by the CRUSH rulesets, which
would have to be fine-tuned (say location A is a sort of "room" in the
Crush hierarchy, if I have 2 enclosures with 10 servers in the same
location, those enclosures are "racks", etc.)
Will CRUSH handle latency, failed links, failed power, etc?
How does it solve the CephFS need (active-standby or active-active)?

Nota: I'm also evaluating DRBD, which I know quite well, which have
evolved since my last setup, which does not solve the same low-level
problems, but may also be used in my case, obviously not at the same
scale.

Thanks in advance, for your reading patience and answers!

-- 
Nicolas Huillard
Associé fondateur - Directeur Technique - Dolomède

nhuill...@dolomede.fr
Fixe : +33 9 52 31 06 10
Mobile : +33 6 50 27 69 08
http://www.dolomede.fr/

http://www.350.org/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

42 matches

Mail list logo