Re: [ceph-users] deleted snap dirs are back as _origdir_1099536400705

2019-12-16 Thread Gregory Farnum
With just the one ls listing and my memory it's not totally clear, but
I believe this is the output you get when delete a snapshot folder but
it's still referenced by a different snapshot farther up the
hierarchy.
-Greg

On Mon, Dec 16, 2019 at 8:51 AM Marc Roos  wrote:
>
>
> Am I the only lucky one having this problem? Should I use the bugtracker
> system for this?
>
> -Original Message-
> From: Marc Roos
> Sent: 14 December 2019 10:05
> Cc: ceph-users
> Subject: Re: [ceph-users] deleted snap dirs are back as
> _origdir_1099536400705
>
>
>
> ceph tell mds.a scrub start / recursive repair Did not fix this.
>
>
>
> -Original Message-
> Cc: ceph-users
> Subject: [ceph-users] deleted snap dirs are back as
> _origdir_1099536400705
>
>
> I thought I deleted snapshot dirs, but I still have them but with a
> different name. How to get rid of these?
>
> [@ .snap]# ls -1
> _snap-1_1099536400705
> _snap-2_1099536400705
> _snap-3_1099536400705
> _snap-4_1099536400705
> _snap-5_1099536400705
> _snap-6_1099536400705
> _snap-7_1099536400705
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds way ahead of gateway version?

2019-12-03 Thread Gregory Farnum
Unfortunately RGW doesn't test against extended version differences
like this and I don't think it's compatible across more than one major
release. Basically it's careful to support upgrades between long-term
stable releases but nothing else is expected to work.

That said, getting off of Giant would be good; it's quite old! :)
-Greg

On Tue, Dec 3, 2019 at 3:27 PM Philip Brown  wrote:
>
>
> Im in a situation where it would be extremely strategically advantageous to 
> run some OSDs on luminous (so we can try out bluestore) while the gateways 
> stay on giant.
> Is this a terrible terrible thing, or can we reasonably get away with it?
>
> points of interest:
> 1. i plan to make a new pool for this and keep it all bluestore
> 2. we only use the cluster for RBDs.
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Gregory Farnum
On Wed, Oct 9, 2019 at 10:58 AM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Best I can tell, automatic cache sizing is enabled and all related
> settings are at their default values.
>
> Looking through cache tunables, I came across
> osd_memory_expected_fragmentation, which the docs define as "estimate
> the percent of memory fragmentation". What's the formula to compute
> actual percentage of memory fragmentation?
>
> Based on /proc/buddyinfo, I suspect that our memory fragmentation is a
> lot worse than osd_memory_expected_fragmentation default of 0.15. Could
> this be related to many OSDs' RSSes far exceeding osd_memory_target?
>
> So far high memory consumption hasn't been a problem for us. (I guess
> it's possible that the kernel simply sees no need to reclaim unmapped
> memory until there is actually real memory pressure?)


Oh well that you can check on the admin socket using the “heap” family of
commands. It’ll tell you how much the daemon is actually using out of
what’s allocated, and IIRC what it’s given back to the OS but maybe hasn’t
actually been reclaimed.

It's just a little
> scary not understanding why this started happening when memory usage had
> been so stable before.


>
> Thanks,
>
> Vlad
>
>
>
> On 10/9/19 11:51 AM, Gregory Farnum wrote:
> > On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
> >  wrote:
> >>
> >>   > Do you have statistics on the size of the OSDMaps or count of them
> >>   > which were being maintained by the OSDs?
> >> No, I don't think so. How can I find this information?
> >
> > Hmm I don't know if we directly expose the size of maps. There are
> > perfcounters which expose the range of maps being kept around but I
> > don't know their names off-hand.
> >
> > Maybe it's something else involving the bluestore cache or whatever;
> > if you're not using the newer memory limits I'd switch to those but
> > otherwise I dunno.
> > -Greg
> >
> >>
> >> Memory consumption started to climb again:
> >> https://icecube.wisc.edu/~vbrik/graph-3.png
> >>
> >> Some more info (not sure if relevant or not):
> >>
> >> I increased size of the swap on the servers to 10GB and it's being
> >> completely utilized, even though there is still quite a bit of free
> memory.
> >>
> >> It appears that memory is highly fragmented on the NUMA node 0 of all
> >> the servers. Some of the servers have no free pages higher than order 0.
> >> (Memory on NUMA node 1 of the servers appears much less fragmented.)
> >>
> >> The servers have 192GB of RAM, 2 NUMA nodes.
> >>
> >>
> >> Vlad
> >>
> >>
> >>
> >> On 10/4/19 6:09 PM, Gregory Farnum wrote:
> >>> Do you have statistics on the size of the OSDMaps or count of them
> >>> which were being maintained by the OSDs? I'm not sure why having noout
> >>> set would change that if all the nodes were alive, but that's my bet.
> >>> -Greg
> >>>
> >>> On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
> >>>  wrote:
> >>>>
> >>>> And, just as unexpectedly, things have returned to normal overnight
> >>>> https://icecube.wisc.edu/~vbrik/graph-1.png
> >>>>
> >>>> The change seems to have coincided with the beginning of Rados Gateway
> >>>> activity (before, it was essentially zero). I can see nothing in the
> >>>> logs that would explain what happened though.
> >>>>
> >>>> Vlad
> >>>>
> >>>>
> >>>>
> >>>> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> >>>>> Hello
> >>>>>
> >>>>> I am running a Ceph 14.2.2 cluster and a few days ago, memory
> >>>>> consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> >>>>> after being stable for about 6 months.
> >>>>>
> >>>>> Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> >>>>> Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >>>>>
> >>>>> I am not sure what changed to cause this. Cluster usage has been very
> >>>>> light (typically <10 iops) during this period, and the number of
> objects
> >>>>> stayed about the same.
> >>>>>
> >>>>> The only unusual occurrence was the reboot of one of

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-09 Thread Gregory Farnum
On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou
 wrote:
>
> Hi,
> unfortunately it's single mon, because we had major outage on this cluster 
> and it's just being used to copy off data now. We werent able to add more 
> mons because once a second mon was added it crashed the first one (there's a 
> bug tracker ticket).
> I still have old rocksdb files before I ran a repair on it, but well it had 
> the rocksdb corruption issue (not sure why that happened, it ran fine for 
> 2months now).
>
> Any options? I mean everything still works, data is accessible, RBDs run, 
> only cephfs mount is obviously not working. For that short amount of time the 
> mon starts it reports no issues and all commands run fine.

Sounds like you actually lost some data. You'd need to manage a repair
by trying to figure out why CephFS needs that map and performing
surgery on either the monitor (to give it a fake map or fall back to
something else) or the CephFS data structures.

You might also be able to rebuild the CephFS metadata using the
disaster recovery tools to work around it, but no guarantees there
since I don't understand why CephFS is digging up OSD maps that nobody
else in the cluster cares about.
-Greg


> Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
>  wrote:
> >
> > I had to use rocksdb repair tool before because the rocksdb files got 
> > corrupted, for another reason (another bug possibly). Maybe that is why now 
> > it crash loops, although it ran fine for a day.
>
> Yeah looks like it lost a bit of data. :/
>
> > What is meant with "turn it off and rebuild from remainder"?
>
> If only one monitor is crashing, you can remove it from the quorum,
> zap all the disks, and add it back so that it recovers from its
> healthy peers.
> -Greg
>
>
> >
> > Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
> >  Folgendes geschrieben:
> >
> >
> > Hmm, that assert means the monitor tried to grab an OSDMap it had on
> > disk but it didn't work. (In particular, a "pinned" full map which we
> > kept around after trimming the others to save on disk space.)
> >
> > That *could* be a bug where we didn't have the pinned map and should
> > have (or incorrectly thought we should have), but this code was in
> > Mimic as well as Nautilus and I haven't seen similar reports. So it
> > could also mean that something bad happened to the monitor's disk or
> > Rocksdb store. Can you turn it off and rebuild from the remainder, or
> > do they all exhibit this bug?
> >
> >
> > On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
> >  wrote:
> > >
> > > Hi,
> > > our mon is acting up all of a sudden and dying in crash loop with the 
> > > following:
> > >
> > >
> > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> > >-3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > > mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 
> > > 4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 
> > > lease_expire=0.00 has v0 lc 4549352
> > >-2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > > mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > > closest pinned map ver 252615 not available! error: (2) No such file or 
> > > directory
> > >-1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > > OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' 
> > > thread 7f6e5d461700 time 2019-10-04 14:00:24.347580
> > > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 
> > > 0)
> > >
> > >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > > (stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > > const*)+0x152) [0x7f6e68eb064e]
> > >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
> > > const*, char const*, ...)+0) [0x7f6e68eb0829]
> > >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> > >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> > >  5: 
> > > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> > >  unsigned long)+0x8

Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Gregory Farnum
On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
 wrote:
>
>  > Do you have statistics on the size of the OSDMaps or count of them
>  > which were being maintained by the OSDs?
> No, I don't think so. How can I find this information?

Hmm I don't know if we directly expose the size of maps. There are
perfcounters which expose the range of maps being kept around but I
don't know their names off-hand.

Maybe it's something else involving the bluestore cache or whatever;
if you're not using the newer memory limits I'd switch to those but
otherwise I dunno.
-Greg

>
> Memory consumption started to climb again:
> https://icecube.wisc.edu/~vbrik/graph-3.png
>
> Some more info (not sure if relevant or not):
>
> I increased size of the swap on the servers to 10GB and it's being
> completely utilized, even though there is still quite a bit of free memory.
>
> It appears that memory is highly fragmented on the NUMA node 0 of all
> the servers. Some of the servers have no free pages higher than order 0.
> (Memory on NUMA node 1 of the servers appears much less fragmented.)
>
> The servers have 192GB of RAM, 2 NUMA nodes.
>
>
> Vlad
>
>
>
> On 10/4/19 6:09 PM, Gregory Farnum wrote:
> > Do you have statistics on the size of the OSDMaps or count of them
> > which were being maintained by the OSDs? I'm not sure why having noout
> > set would change that if all the nodes were alive, but that's my bet.
> > -Greg
> >
> > On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
> >  wrote:
> >>
> >> And, just as unexpectedly, things have returned to normal overnight
> >> https://icecube.wisc.edu/~vbrik/graph-1.png
> >>
> >> The change seems to have coincided with the beginning of Rados Gateway
> >> activity (before, it was essentially zero). I can see nothing in the
> >> logs that would explain what happened though.
> >>
> >> Vlad
> >>
> >>
> >>
> >> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> >>> Hello
> >>>
> >>> I am running a Ceph 14.2.2 cluster and a few days ago, memory
> >>> consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> >>> after being stable for about 6 months.
> >>>
> >>> Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> >>> Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >>>
> >>> I am not sure what changed to cause this. Cluster usage has been very
> >>> light (typically <10 iops) during this period, and the number of objects
> >>> stayed about the same.
> >>>
> >>> The only unusual occurrence was the reboot of one of the nodes the day
> >>> before (a firmware update). For the reboot, I ran "ceph osd set noout",
> >>> but forgot to unset it until several days later. Unsetting noout did not
> >>> stop the increase in memory consumption.
> >>>
> >>> I don't see anything unusual in the logs.
> >>>
> >>> Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> >>> 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> >>> don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> >>> utilized, with 101-104 PGs.
> >>>
> >>> Does anybody know what might be the problem here and how to address or
> >>> debug it?
> >>>
> >>>
> >>> Thanks very much,
> >>>
> >>> Vlad
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-07 Thread Gregory Farnum
On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou
 wrote:
>
> I had to use rocksdb repair tool before because the rocksdb files got 
> corrupted, for another reason (another bug possibly). Maybe that is why now 
> it crash loops, although it ran fine for a day.

Yeah looks like it lost a bit of data. :/

> What is meant with "turn it off and rebuild from remainder"?

If only one monitor is crashing, you can remove it from the quorum,
zap all the disks, and add it back so that it recovers from its
healthy peers.
-Greg

>
> Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum 
>  Folgendes geschrieben:
>
>
> Hmm, that assert means the monitor tried to grab an OSDMap it had on
> disk but it didn't work. (In particular, a "pinned" full map which we
> kept around after trimming the others to save on disk space.)
>
> That *could* be a bug where we didn't have the pinned map and should
> have (or incorrectly thought we should have), but this code was in
> Mimic as well as Nautilus and I haven't seen similar reports. So it
> could also mean that something bad happened to the monitor's disk or
> Rocksdb store. Can you turn it off and rebuild from the remainder, or
> do they all exhibit this bug?
>
>
> On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
>  wrote:
> >
> > Hi,
> > our mon is acting up all of a sudden and dying in crash loop with the 
> > following:
> >
> >
> > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> >-3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> > mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
> > is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has 
> > v0 lc 4549352
> >-2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> > closest pinned map ver 252615 not available! error: (2) No such file or 
> > directory
> >-1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> > OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> > 7f6e5d461700 time 2019-10-04 14:00:24.347580
> > /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
> >
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x152) [0x7f6e68eb064e]
> >  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> > char const*, ...)+0) [0x7f6e68eb0829]
> >  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> >  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> >  5: 
> > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> >  unsigned long)+0x8c) [0x717c3c]
> >  6: (PaxosService::maybe_trim()+0x473) [0x707443]
> >  7: (Monitor::tick()+0xa9) [0x5ecf39]
> >  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> >  9: (Context::complete(int)+0x9) [0x6070d9]
> >  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> >  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> >  12: (()+0x76ba) [0x7f6e67cab6ba]
> >  13: (clone()+0x6d) [0x7f6e674d441d]
> >
> >  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) 
> > **
> >  in thread 7f6e5d461700 thread_name:safe_timer
> >
> >  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> > (stable)
> >  1: (()+0x11390) [0x7f6e67cb5390]
> >  2: (gsignal()+0x38) [0x7f6e67402428]
> >  3: (abort()+0x16a) [0x7f6e6740402a]
> >  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x1a3) [0x7f6e68eb069f]
> >  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> > char const*, ...)+0) [0x7f6e68eb0829]
> >  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> >  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> > ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> >  8: 
> > (OSDMonitor::encode_trim_extra(std::shared_ptr,
> >  unsigned long)+0x8c) [0x717c3c]
> >  9: (PaxosService::maybe_trim()+0x473) [0x707443]
> >  10: (Monitor::tick()+0xa9) [0x5ecf39]
> >  11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> >  12: (Context::complete(int)+0x9) [0x6070d9]
> >  13: 

Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-04 Thread Gregory Farnum
Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:
>
> And, just as unexpectedly, things have returned to normal overnight
> https://icecube.wisc.edu/~vbrik/graph-1.png
>
> The change seems to have coincided with the beginning of Rados Gateway
> activity (before, it was essentially zero). I can see nothing in the
> logs that would explain what happened though.
>
> Vlad
>
>
>
> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> > Hello
> >
> > I am running a Ceph 14.2.2 cluster and a few days ago, memory
> > consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> > after being stable for about 6 months.
> >
> > Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> > Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >
> > I am not sure what changed to cause this. Cluster usage has been very
> > light (typically <10 iops) during this period, and the number of objects
> > stayed about the same.
> >
> > The only unusual occurrence was the reboot of one of the nodes the day
> > before (a firmware update). For the reboot, I ran "ceph osd set noout",
> > but forgot to unset it until several days later. Unsetting noout did not
> > stop the increase in memory consumption.
> >
> > I don't see anything unusual in the logs.
> >
> > Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> > 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> > don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> > utilized, with 101-104 PGs.
> >
> > Does anybody know what might be the problem here and how to address or
> > debug it?
> >
> >
> > Thanks very much,
> >
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-04 Thread Gregory Farnum
Hmm, that assert means the monitor tried to grab an OSDMap it had on
disk but it didn't work. (In particular, a "pinned" full map which we
kept around after trimming the others to save on disk space.)

That *could* be a bug where we didn't have the pinned map and should
have (or incorrectly thought we should have), but this code was in
Mimic as well as Nautilus and I haven't seen similar reports. So it
could also mean that something bad happened to the monitor's disk or
Rocksdb store. Can you turn it off and rebuild from the remainder, or
do they all exhibit this bug?


On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
 wrote:
>
> Hi,
> our mon is acting up all of a sudden and dying in crash loop with the 
> following:
>
>
> 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
> is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has v0 
> lc 4549352
> -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> closest pinned map ver 252615 not available! error: (2) No such file or 
> directory
> -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> 7f6e5d461700 time 2019-10-04 14:00:24.347580
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x152) [0x7f6e68eb064e]
>  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> char const*, ...)+0) [0x7f6e68eb0829]
>  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  5: 
> (OSDMonitor::encode_trim_extra(std::shared_ptr, 
> unsigned long)+0x8c) [0x717c3c]
>  6: (PaxosService::maybe_trim()+0x473) [0x707443]
>  7: (Monitor::tick()+0xa9) [0x5ecf39]
>  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  9: (Context::complete(int)+0x9) [0x6070d9]
>  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  12: (()+0x76ba) [0x7f6e67cab6ba]
>  13: (clone()+0x6d) [0x7f6e674d441d]
>
>  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
>  in thread 7f6e5d461700 thread_name:safe_timer
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (()+0x11390) [0x7f6e67cb5390]
>  2: (gsignal()+0x38) [0x7f6e67402428]
>  3: (abort()+0x16a) [0x7f6e6740402a]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x1a3) [0x7f6e68eb069f]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> char const*, ...)+0) [0x7f6e68eb0829]
>  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  8: 
> (OSDMonitor::encode_trim_extra(std::shared_ptr, 
> unsigned long)+0x8c) [0x717c3c]
>  9: (PaxosService::maybe_trim()+0x473) [0x707443]
>  10: (Monitor::tick()+0xa9) [0x5ecf39]
>  11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  12: (Context::complete(int)+0x9) [0x6070d9]
>  13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  15: (()+0x76ba) [0x7f6e67cab6ba]
>  16: (clone()+0x6d) [0x7f6e674d441d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
>
> This was running fine for 2months now, it's a crashed cluster that is in 
> recovery.
>
> Any suggestions?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS EC: is it okay to reduce the number of commits required for reply to client?

2019-09-25 Thread Gregory Farnum
On Thu, Sep 19, 2019 at 12:06 AM Alex Xu  wrote:
>
> Hi Cephers,
>
> We are testing the write performance of Ceph EC (Luminous, 8 + 4), and
> noticed that tail latency is extremly high. Say, avgtime of 10th
> commit is 40ms, acceptable as it's an all HDD cluster; 11th is 80ms,
> doubled; then 12th is 160ms, doubled again, which is not so good. Then
> we made a small modification and tested again, and did get a much
> better result. The patch is quite simple (for test only of course):
>
> --- a/src/osd/ECBackend.cc
> +++ b/src/osd/ECBackend.cc
> @@ -1188,7 +1188,7 @@ void ECBackend::handle_sub_write_reply(
>  i->second.on_all_applied = 0;
>  i->second.trace.event("ec write all applied");
>}
> -  if (i->second.pending_commit.empty() && i->second.on_all_commit) {
> +  if (i->second.pending_commit.size() == 2 &&
> i->second.on_all_commit) {  // 8 + 4 - 10 = 2
>  dout(10) << __func__ << " Calling on_all_commit on " << i->second << 
> dendl;
>  i->second.on_all_commit->complete(0);
>  i->second.on_all_commit = 0;
>
> As far as what I see, everything still goes well (maybe because of the
> rwlock in primary OSD? not sure though), but I'm afraid it might break
> data consistency in some ways not aware of. So I'm writing to ask if
> someone could kindly provide expertise comments on this or maybe share
> any known drawbacks. Thank you!

Unfortunately this is one of those things that will work okay in
everyday use but fail catastrophically if something else goes wrong.

Ceph assumes throughout its codebase that if a write committed to
disk, it can be retrieved as long as k OSDs are available from the
PG's acting set when that write was committed. But by letting writes
"commit" on less than the full acting set, you might lose some of the
OSDs which *did* ack the write and no longer be able to find out what
it was even though you have >k OSDs available!

This will result in a very, very confused Ceph recovery algorithm,
lots of badness, and unhappy cluster users. :(
-Greg

>
> PS: OSD is backended with filestore, not bluestore, if that matters.
>
> Regards,
> Alex

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: apache locks up after parallel reloads on multiple nodes

2019-09-17 Thread Gregory Farnum
On Tue, Sep 17, 2019 at 8:12 AM Sander Smeenk  wrote:
>
> Quoting Paul Emmerich (paul.emmer...@croit.io):
>
> > Yeah, CephFS is much closer to POSIX semantics for a filesystem than
> > NFS. There's an experimental relaxed mode called LazyIO but I'm not
> > sure if it's applicable here.
>
> Out of curiosity, how would CephFS being more POSIX compliant cause
> this much delay in this situation? I'd understand if it would maybe
> take up to a second or maybe two, but almost fifteen minutes and then
> suddenly /all/ servers recover at the same time?
>
> Would this situation exist because we have so many open filehandles per
> server? Or could it also appear in a simpler "two servers share a
> CephFS" setup?
>
> I'm so curious to find out what /causes/ this.
> "Closer to POSIX sematics" doesn't cut it for me in this case.
> Not with the symptoms we're seeing.

Yeah this sounds weird. 15 minutes is one or two timers but I can't
think of anything that should be related here.

I'd look and see what sys calls the apache daemons are making and how
long they're taking; in particular what's different between the first
server and the rest. If they're doing a lot of the same syscalls but
just much slower on the follow-on servers, that probably indicates
they're all hammering the CephFS cluster with conflicting updates
(especially if they're writes!) that NFS simply ignored and collapsed.
If it's just one syscall that takes minutes to complete, check the mds
admin socket for ops_in_flight.
-Greg

>
>
> > You can debug this by dumping slow requests from the MDS servers via
> > the admin socket
>
> As far as i understood, there's not much to see on the MDS servers when
> this issue pops op. E.g. no slow ops logged during this event.
>
>
> Regards,
> -Sndr.
> --
> | I think i want a job cleaning mirrors...
> | It's just something i can really see myself doing...
> | 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7  FBD6 F3A9 9442 20CC 6CD2
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help understanding EC object reads

2019-09-09 Thread Gregory Farnum
On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC
 wrote:
>
> Hi all,
>
> I’m investigating an issue with our (non-Ceph) caching layers of our large EC 
> cluster. It seems to be turning users requests for whole objects into lots of 
> small byte range requests reaching the OSDs, but I’m not sure how inefficient 
> this behaviour is in reality.
>
> My limited understanding of an EC object partial read is that the entire 
> object is reconstructed on the primary OSD, and then the requested byte range 
> is sent to the client before the primary discards the reconstructed object.

Ah, it's not necessarily the entire object is reconstructed, but that
any stripes covering the requested range are reconstructed. It's
changed a bit over time and there are some knobs controlling it, but I
believe this is generally efficient — if you ask for a byte range
which simply lives on the primary, it's not going to talk to the other
OSDs to provide that data.

>
> Assuming this is correct, do multiple reads for different byte ranges of the 
> same object at effectively the same time result in the entire object being 
> reconstructed once for each request, or does the primary do something clever 
> and use the same reconstructed object for multiple requests before discarding 
> it?

I'm pretty sure it's per-request; the EC pool code generally assumes
you have another cache on top of RADOS that deals with combining these
requests.
There is a small cache in the OSD but IIRC it's just for keeping stuff
consistent while writes are in progress.
-Greg

>
> If I’m completely off the mark with what is going on under the hood here, a 
> nudge in the right direction would be appreciated!
>
>
>
> Cheers,
>
> Tom
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can kstore be used as OSD objectstore backend when deploying a Ceph Storage Cluster? If can, how to?

2019-08-07 Thread Gregory Farnum
No; KStore is not for real use AFAIK.

On Wed, Aug 7, 2019 at 12:24 AM R.R.Yuan  wrote:
>
> Hi, All,
>
>When deploying a development cluster, there are three types of OSD 
> objectstore backend:  filestore, bluestore and kstore.
>But there is no "--kstore" option when using "ceph-deploy osd"command 
> to deploy a real ceph cluster.
>
>Can kstore be used as OSD objectstore backend when deploy a real ceph 
> cluster?If can, how to ?
>
>
> Thanks a lot
> R.R.Yuan
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Recovery/Internals Questions

2019-08-04 Thread Gregory Farnum
On Fri, Aug 2, 2019 at 12:13 AM Pierre Dittes  wrote:
>
> Hi,
> we had some major up with our CephFS. Long story short..no Journal backup 
> and journal was truncated.
> Now..I still see a metadata pool with all objects and datapool is fine, from 
> what I know neither was corrupted. Last mount attempt showed a blank FS 
> though.
>
> What are the proper steps now to restore everything to be visible and useable 
> again? I found the documentation very confusing, many things were left 
> unexplained.
>
> Another step that was taken before truncation was the dentries summary 
> command, docs say "stored in the backing store", whatever that means.
>
> cephfs_metadata   18 136 GiB   4.63M 137 GiB  
> 0.66   6.7 TiB
> cephfs_data   19 272 TiB 434.65M 861 TiB 
> 72.04   111 TiB
>
> Any input is helpful

If the expert-only disaster recovery steps are confusing to you, and
yet some of them got run (since your journal was truncated), you're
going to need to be a lot clearer about the story for anyone to be
able to help you.

CephFS metadata is stored in per-directory onode objects within RADOS
(the "backing store"), but in order to aggregate IO and deal with
atomicity and other sorts of things we stream metadata updates into a
per-MDS journal before it goes into the backing store. If you have
some very hot metadata it may be that the backing store is quite stale
and there are a number of updated versions within the journal.

If the journal is gone, some inodes may have been lost if they were
never flushed to begin with. (Although perhaps they were, if you ran
the "recover_dentries summary" option?) To rebuild a working tree
you'll need to do the full backwards scrub with cephfs-data-scan
(https://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/).
-Greg

>
> Thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread Gregory Farnum
On Thu, Aug 1, 2019 at 12:06 PM Eric Ivancich  wrote:
>
> Hi Paul,
>
> I’ll interleave responses below.
>
> On Jul 31, 2019, at 2:02 PM, Paul Emmerich  wrote:
>
> How could the bucket deletion of the future look like? Would it be possible
> to put all objects in buckets into RADOS namespaces and implement some kind
> of efficient namespace deletion on the OSD level similar to how pool deletions
> are handled at a lower level?
>
> I’ll raise that with other RGW developers. I’m unfamiliar with how RADOS 
> namespaces are handled.

I expect RGW could do this, but unfortunately deleting namespaces at
the RADOS level is not practical. People keep asking and maybe in some
future world it will be cheaper, but a namespace is effectively just
part of the object name (and I don't think it's even the first thing
they sort by for the key entries in metadata tracking!), so deleting a
namespace would be equivalent to deleting a snapshot[1] but with the
extra cost that namespaces can be created arbitrarily on every write
operation (so our solutions for handling snapshots without it being
ludicrously expensive wouldn't apply). Deleting a namespace from the
OSD-side using map updates would require the OSD to iterate through
just about all the objects they have and examine them for deletion.

Is it cheaper than doing over the network? Sure. Is it cheap enough
we're willing to let a single user request generate that kind of
cluster IO on an unconstrained interface? Absolutely not.
-Greg
[1]: Deleting snapshots is only feasible because every OSD maintains a
sorted secondary index from snapid->set. This is only
possible because snapids are issued by the monitors and clients
cooperate in making sure they can't get reused after being deleted.
Namespaces are generated by clients and there are no constraints on
their use, reuse, or relationship to each other. We could maybe work
around these problems, but it'd be building a fundamentally different
interface than what namespaces currently are.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-08-01 Thread Gregory Farnum
On Wed, Jul 31, 2019 at 10:31 PM nokia ceph  wrote:
>
> Thank you Greg,
>
> Another question , we need to give new destination object  , so that we can 
> read them separately in parallel with src object .  This function resides in 
> objector.h , seems to be like internal and can it be used in interface level  
> and can we use this in our client ? Currently we use librados.h in our client 
> to communicate with ceph cluster.

copy_from is an ObjectOperations and exposed via the librados C++ api
like all the others. It may not be in the simple
(, , ) interfaces. It may also not be
in the C API?

> Also any equivalent librados api for the command rados -p poolname  object> 

It's using the copy_from command we're discussing here. You can look
at the source as an example:
https://github.com/ceph/ceph/blob/master/src/tools/rados/rados.cc#L497
-Greg

>
> Thanks,
> Muthu
>
> On Wed, Jul 31, 2019 at 11:13 PM Gregory Farnum  wrote:
>>
>>
>>
>> On Wed, Jul 31, 2019 at 1:32 AM nokia ceph  wrote:
>>>
>>> Hi Greg,
>>>
>>> We were trying to implement this however having issues in assigning the 
>>> destination object name with this api.
>>> There is a rados command "rados -p  cp  " , is 
>>> there any librados api equivalent to this ?
>>
>>
>> The copyfrom operation, like all other ops, is directed to a specific 
>> object. The object you run it on is the destination; it copies the specified 
>> “src” object into itself.
>> -Greg
>>
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Fri, Jul 5, 2019 at 4:00 PM nokia ceph  wrote:
>>>>
>>>> Thank you Greg, we will try this out .
>>>>
>>>> Thanks,
>>>> Muthu
>>>>
>>>> On Wed, Jul 3, 2019 at 11:12 PM Gregory Farnum  wrote:
>>>>>
>>>>> Well, the RADOS interface doesn't have a great deal of documentation
>>>>> so I don't know if I can point you at much.
>>>>>
>>>>> But if you look at Objecter.h, you see that the ObjectOperation has
>>>>> this function:
>>>>> void copy_from(object_t src, snapid_t snapid, object_locator_t
>>>>> src_oloc, version_t src_version, unsigned flags, unsigned
>>>>> src_fadvise_flags)
>>>>>
>>>>> src: the object to copy from
>>>>> snapid: if you want to copy a specific snap instead of HEAD
>>>>> src_oloc: the object locator for the object
>>>>> src_version: the version of the object to copy from (helps identify if
>>>>> it was updated in the meantime)
>>>>> flags: probably don't want to set these, but see
>>>>> PrimaryLogPG::_copy_some for the choices
>>>>> src_fadvise_flags: these are the fadvise flags we have in various
>>>>> places that let you specify things like not to cache the data.
>>>>> Probably leave them unset.
>>>>>
>>>>> -Greg
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 3, 2019 at 2:47 AM nokia ceph  
>>>>> wrote:
>>>>> >
>>>>> > Hi Greg,
>>>>> >
>>>>> > Can you please share the api details  for COPY_FROM or any reference 
>>>>> > document?
>>>>> >
>>>>> > Thanks ,
>>>>> > Muthu
>>>>> >
>>>>> > On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard  wrote:
>>>>> >>
>>>>> >> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum  
>>>>> >> wrote:
>>>>> >> >
>>>>> >> > I'm not sure how or why you'd get an object class involved in doing
>>>>> >> > this in the normal course of affairs.
>>>>> >> >
>>>>> >> > There's a copy_from op that a client can send and which copies an
>>>>> >> > object from another OSD into the target object. That's probably the
>>>>> >> > primitive you want to build on. Note that the OSD doesn't do much
>>>>> >>
>>>>> >> Argh! yes, good idea. We really should document that!
>>>>> >>
>>>>> >> > consistency checking (it validates that the object version matches an
>>>>> >> > input, but if they don't it just returns an error) so the client
>>>>> >> > application i

Re: [ceph-users] details about cloning objects using librados

2019-07-31 Thread Gregory Farnum
On Wed, Jul 31, 2019 at 1:32 AM nokia ceph  wrote:

> Hi Greg,
>
> We were trying to implement this however having issues in assigning the
> destination object name with this api.
> There is a rados command "rados -p  cp  " , is
> there any librados api equivalent to this ?
>

The copyfrom operation, like all other ops, is directed to a specific
object. The object you run it on is the destination; it copies the
specified “src” object into itself.
-Greg


> Thanks,
> Muthu
>
> On Fri, Jul 5, 2019 at 4:00 PM nokia ceph 
> wrote:
>
>> Thank you Greg, we will try this out .
>>
>> Thanks,
>> Muthu
>>
>> On Wed, Jul 3, 2019 at 11:12 PM Gregory Farnum 
>> wrote:
>>
>>> Well, the RADOS interface doesn't have a great deal of documentation
>>> so I don't know if I can point you at much.
>>>
>>> But if you look at Objecter.h, you see that the ObjectOperation has
>>> this function:
>>> void copy_from(object_t src, snapid_t snapid, object_locator_t
>>> src_oloc, version_t src_version, unsigned flags, unsigned
>>> src_fadvise_flags)
>>>
>>> src: the object to copy from
>>> snapid: if you want to copy a specific snap instead of HEAD
>>> src_oloc: the object locator for the object
>>> src_version: the version of the object to copy from (helps identify if
>>> it was updated in the meantime)
>>> flags: probably don't want to set these, but see
>>> PrimaryLogPG::_copy_some for the choices
>>> src_fadvise_flags: these are the fadvise flags we have in various
>>> places that let you specify things like not to cache the data.
>>> Probably leave them unset.
>>>
>>> -Greg
>>>
>>>
>>>
>>> On Wed, Jul 3, 2019 at 2:47 AM nokia ceph 
>>> wrote:
>>> >
>>> > Hi Greg,
>>> >
>>> > Can you please share the api details  for COPY_FROM or any reference
>>> document?
>>> >
>>> > Thanks ,
>>> > Muthu
>>> >
>>> > On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard 
>>> wrote:
>>> >>
>>> >> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum 
>>> wrote:
>>> >> >
>>> >> > I'm not sure how or why you'd get an object class involved in doing
>>> >> > this in the normal course of affairs.
>>> >> >
>>> >> > There's a copy_from op that a client can send and which copies an
>>> >> > object from another OSD into the target object. That's probably the
>>> >> > primitive you want to build on. Note that the OSD doesn't do much
>>> >>
>>> >> Argh! yes, good idea. We really should document that!
>>> >>
>>> >> > consistency checking (it validates that the object version matches
>>> an
>>> >> > input, but if they don't it just returns an error) so the client
>>> >> > application is responsible for any locking needed.
>>> >> > -Greg
>>> >> >
>>> >> > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard 
>>> wrote:
>>> >> > >
>>> >> > > Yes, this should be possible using an object class which is also a
>>> >> > > RADOS client (via the RADOS API). You'll still have some client
>>> >> > > traffic as the machine running the object class will still need to
>>> >> > > connect to the relevant primary osd and send the write
>>> (presumably in
>>> >> > > some situations though this will be the same machine).
>>> >> > >
>>> >> > > On Tue, Jul 2, 2019 at 4:08 PM nokia ceph <
>>> nokiacephus...@gmail.com> wrote:
>>> >> > > >
>>> >> > > > Hi Brett,
>>> >> > > >
>>> >> > > > I think I was wrong here in the requirement description. It is
>>> not about data replication , we need same content stored in different
>>> object/name.
>>> >> > > > We store video contents inside the ceph cluster. And our new
>>> requirement is we need to store same content for different users , hence
>>> need same content in different object name . if client sends write request
>>> for object x and sets number of copies as 100, then cluster has to clone
>>> 100 copies of object x and store it as object x1, objectx2,etc. Current

Re: [ceph-users] OSD's won't start - thread abort

2019-07-05 Thread Gregory Farnum
n Wed, Jul 3, 2019 at 11:09 AM Austin Workman  wrote:
> Decided that if all the data was going to move, I should adjust my jerasure 
> ec profile from k=4, m=1 -> k=5, m=1 with force(is this even recommended vs. 
> just creating new pools???)
>
> Initially it unset crush-device-class=hdd to be blank
> Re-set crush-device-class
> Couldn't determine if this had any effect on the move operations.
> Changed back to k=4

You can't change the EC parameters on existing pools; Ceph has no way
of dealing with that. If it's possible to change the profile and break
the pool (which given the striping mismatch you cite later seems to be
what happened), we need to fix that.
Can you describe the exact commands you ran in that timeline?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-07-03 Thread Gregory Farnum
Well, the RADOS interface doesn't have a great deal of documentation
so I don't know if I can point you at much.

But if you look at Objecter.h, you see that the ObjectOperation has
this function:
void copy_from(object_t src, snapid_t snapid, object_locator_t
src_oloc, version_t src_version, unsigned flags, unsigned
src_fadvise_flags)

src: the object to copy from
snapid: if you want to copy a specific snap instead of HEAD
src_oloc: the object locator for the object
src_version: the version of the object to copy from (helps identify if
it was updated in the meantime)
flags: probably don't want to set these, but see
PrimaryLogPG::_copy_some for the choices
src_fadvise_flags: these are the fadvise flags we have in various
places that let you specify things like not to cache the data.
Probably leave them unset.

-Greg



On Wed, Jul 3, 2019 at 2:47 AM nokia ceph  wrote:
>
> Hi Greg,
>
> Can you please share the api details  for COPY_FROM or any reference document?
>
> Thanks ,
> Muthu
>
> On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard  wrote:
>>
>> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum  wrote:
>> >
>> > I'm not sure how or why you'd get an object class involved in doing
>> > this in the normal course of affairs.
>> >
>> > There's a copy_from op that a client can send and which copies an
>> > object from another OSD into the target object. That's probably the
>> > primitive you want to build on. Note that the OSD doesn't do much
>>
>> Argh! yes, good idea. We really should document that!
>>
>> > consistency checking (it validates that the object version matches an
>> > input, but if they don't it just returns an error) so the client
>> > application is responsible for any locking needed.
>> > -Greg
>> >
>> > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard  wrote:
>> > >
>> > > Yes, this should be possible using an object class which is also a
>> > > RADOS client (via the RADOS API). You'll still have some client
>> > > traffic as the machine running the object class will still need to
>> > > connect to the relevant primary osd and send the write (presumably in
>> > > some situations though this will be the same machine).
>> > >
>> > > On Tue, Jul 2, 2019 at 4:08 PM nokia ceph  
>> > > wrote:
>> > > >
>> > > > Hi Brett,
>> > > >
>> > > > I think I was wrong here in the requirement description. It is not 
>> > > > about data replication , we need same content stored in different 
>> > > > object/name.
>> > > > We store video contents inside the ceph cluster. And our new 
>> > > > requirement is we need to store same content for different users , 
>> > > > hence need same content in different object name . if client sends 
>> > > > write request for object x and sets number of copies as 100, then 
>> > > > cluster has to clone 100 copies of object x and store it as object x1, 
>> > > > objectx2,etc. Currently this is done in the client side where 
>> > > > objectx1, object x2...objectx100 are cloned inside the client and 
>> > > > write request sent for all 100 objects which we want to avoid to 
>> > > > reduce network consumption.
>> > > >
>> > > > Similar usecases are rbd snapshot , radosgw copy .
>> > > >
>> > > > Is this possible in object class ?
>> > > >
>> > > > thanks,
>> > > > Muthu
>> > > >
>> > > >
>> > > > On Mon, Jul 1, 2019 at 7:58 PM Brett Chancellor 
>> > > >  wrote:
>> > > >>
>> > > >> Ceph already does this by default. For each replicated pool, you can 
>> > > >> set the 'size' which is the number of copies you want Ceph to 
>> > > >> maintain. The accepted norm for replicas is 3, but you can set it 
>> > > >> higher if you want to incur the performance penalty.
>> > > >>
>> > > >> On Mon, Jul 1, 2019, 6:01 AM nokia ceph  
>> > > >> wrote:
>> > > >>>
>> > > >>> Hi Brad,
>> > > >>>
>> > > >>> Thank you for your response , and we will check this video as well.
>> > > >>> Our requirement is while writing an object into the cluster , if we 
>> > > >>> can provide number of copies to be made , the ne

Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Gregory Farnum
On Mon, Jul 1, 2019 at 8:56 PM Bryan Henderson  wrote:
>
> > Normally in the case of a restart then somebody who used to have a
> > connection to the OSD would still be running and flag it as dead. But
> > if *all* the daemons in the cluster lose their soft state, that can't
> > happen.
>
> OK, thanks.  I guess that explains it.  But that's a pretty serious design
> flaw, isn't it?  What I experienced is a pretty common failure mode: a power
> outage caused the entire cluster to die simultaneously, then when power came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).

I am a little surprised; the peer OSDs used to detect this. But we've
re-done the heartbeat logic a few times and both losing a whole data
center's worth of daemons while not having monitoring to check if they
turn on actually isn't that common.

Can you create a tracker ticket with the version you're seeing it on
and any non-default configuration options you've set?
-Greg

>
> I wonder if I could close this gap with additional monitoring of my own.  I
> could have a cluster bringup protocol that detects OSD processes that aren't
> running after a while and mark those OSDs down.  It would be cleaner, though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster.  Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay down
> until I give a command to mark it back up, or will the monitor detect signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-07-02 Thread Gregory Farnum
I'm not sure how or why you'd get an object class involved in doing
this in the normal course of affairs.

There's a copy_from op that a client can send and which copies an
object from another OSD into the target object. That's probably the
primitive you want to build on. Note that the OSD doesn't do much
consistency checking (it validates that the object version matches an
input, but if they don't it just returns an error) so the client
application is responsible for any locking needed.
-Greg

On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard  wrote:
>
> Yes, this should be possible using an object class which is also a
> RADOS client (via the RADOS API). You'll still have some client
> traffic as the machine running the object class will still need to
> connect to the relevant primary osd and send the write (presumably in
> some situations though this will be the same machine).
>
> On Tue, Jul 2, 2019 at 4:08 PM nokia ceph  wrote:
> >
> > Hi Brett,
> >
> > I think I was wrong here in the requirement description. It is not about 
> > data replication , we need same content stored in different object/name.
> > We store video contents inside the ceph cluster. And our new requirement is 
> > we need to store same content for different users , hence need same content 
> > in different object name . if client sends write request for object x and 
> > sets number of copies as 100, then cluster has to clone 100 copies of 
> > object x and store it as object x1, objectx2,etc. Currently this is done in 
> > the client side where objectx1, object x2...objectx100 are cloned inside 
> > the client and write request sent for all 100 objects which we want to 
> > avoid to reduce network consumption.
> >
> > Similar usecases are rbd snapshot , radosgw copy .
> >
> > Is this possible in object class ?
> >
> > thanks,
> > Muthu
> >
> >
> > On Mon, Jul 1, 2019 at 7:58 PM Brett Chancellor 
> >  wrote:
> >>
> >> Ceph already does this by default. For each replicated pool, you can set 
> >> the 'size' which is the number of copies you want Ceph to maintain. The 
> >> accepted norm for replicas is 3, but you can set it higher if you want to 
> >> incur the performance penalty.
> >>
> >> On Mon, Jul 1, 2019, 6:01 AM nokia ceph  wrote:
> >>>
> >>> Hi Brad,
> >>>
> >>> Thank you for your response , and we will check this video as well.
> >>> Our requirement is while writing an object into the cluster , if we can 
> >>> provide number of copies to be made , the network consumption between 
> >>> client and cluster will be only for one object write. However , the 
> >>> cluster will clone/copy multiple objects and stores inside the cluster.
> >>>
> >>> Thanks,
> >>> Muthu
> >>>
> >>> On Fri, Jun 28, 2019 at 9:23 AM Brad Hubbard  wrote:
> 
>  On Thu, Jun 27, 2019 at 8:58 PM nokia ceph  
>  wrote:
>  >
>  > Hi Team,
>  >
>  > We have a requirement to create multiple copies of an object and 
>  > currently we are handling it in client side to write as separate 
>  > objects and this causes huge network traffic between client and 
>  > cluster.
>  > Is there possibility of cloning an object to multiple copies using 
>  > librados api?
>  > Please share the document details if it is feasible.
> 
>  It may be possible to use an object class to accomplish what you want
>  to achieve but the more we understand what you are trying to do, the
>  better the advice we can offer (at the moment your description sounds
>  like replication which is already part of RADOS as you know).
> 
>  More on object classes from Cephalocon Barcelona in May this year:
>  https://www.youtube.com/watch?v=EVrP9MXiiuU
> 
>  >
>  > Thanks,
>  > Muthu
>  > ___
>  > ceph-users mailing list
>  > ceph-users@lists.ceph.com
>  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
>  --
>  Cheers,
>  Brad
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a cephfs data pool

2019-07-01 Thread Gregory Farnum
On Fri, Jun 28, 2019 at 5:41 PM Jorge Garcia  wrote:
>
> Ok, actually, the problem was somebody writing to the filesystem. So I moved 
> their files and got to 0 objects. But then I tried to remove the original 
> data pool and got an error:
>
>   # ceph fs rm_data_pool cephfs cephfs-data
>   Error EINVAL: cannot remove default data pool
>
> So it seems I will never be able to remove the original data pool. I could 
> leave it there as a ghost pool, which is not optimal, but I guess there's 
> currently not a better option.

Yeah; CephFS writes its backtrace pointers (for inode-based lookups)
to the default data pool. Unfortunately we need all of those to live
in one known pool, and CephFS doesn't have a way to migrate them.
-Greg

>
> On 6/28/19 4:04 PM, Patrick Hein wrote:
>
> Afaik MDS doesn't delete the objects immediately but defer it for later. If 
> you check that again now, how many objects does it report?
>
> Jorge Garcia  schrieb am Fr., 28. Juni 2019, 23:16:
>>
>>
>> On 6/28/19 9:02 AM, Marc Roos wrote:
>> > 3. When everything is copied-removed, you should end up with an empty
>> > datapool with zero objects.
>>
>> I copied the data to a new directory and then removed the data from the
>> old directory, but df still reports some objects in the old pool (not
>> zero). Is there a way to track down what's still in the old pool, and
>> how to delete it?
>>
>> Before delete:
>>
>> # ceph df
>> GLOBAL:
>>  SIZEAVAIL   RAW USED %RAW USED
>>  392 TiB 389 TiB  3.3 TiB  0.83
>> POOLS:
>>  NAMEID USED%USED MAX AVAIL OBJECTS
>>  cephfs-meta  6   17 MiB 0   123 TiB 27
>>  cephfs-data   7  763 GiB  0.60   123 TiB 195233
>>  new-ec-pool  8  641 GiB  0.25   245 TiB 163991
>>
>> After delete:
>>
>> # ceph df
>> GLOBAL:
>>  SIZEAVAIL   RAW USED %RAW USED
>>  392 TiB 391 TiB  1.2 TiB  0.32
>> POOLS:
>>  NAMEID USED%USED MAX AVAIL OBJECTS
>>  cephfs-meta  6   26 MiB 0   124 TiB 29
>>  cephfs-data   7   83 GiB  0.07   124 TiB 21175
>>  new-ec-pool  8  641 GiB  0.25   247 TiB 163991
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-01 Thread Gregory Farnum
On Sat, Jun 29, 2019 at 8:13 PM Bryan Henderson  wrote:
>
> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
>
> Well, that part I understand.  The monitor didn't mark the OSD out because the
> monitor still considered the OSD up.  No reason to mark an up OSD out.
>
> I think the monitor should have marked the OSD down upon not hearing from it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").

It sounds like you had the whole cluster off and turned it on, and
those servers didn't come up. This is why.

The methods of detecting an OSD as down are
1) OSD heartbeat peers. That's as Robert describes (by default).
2) When an OSD is connected to a monitor, they heartbeat each other at
very long intervals and the monitor flags the OSD down if it
disappears and isn't connected to a different monitor.

In your case, the OSD wasn't connected to any monitor, and it hadn't
set up any heartbeat peers.

Normally in the case of a restart then somebody who used to have a
connection to the OSD would still be running and flag it as dead. But
if *all* the daemons in the cluster lose their soft state, that can't
happen.
-Greg

>
> And that's worst case.  Though details of how OSDs watch each other are vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report that
> to the monitor, which would believe it within about a minute and mark the OSDs
> down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min 
> down
> reporters", "osd reporter subtree level").
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-26 Thread Gregory Farnum
Awesome. I made a ticket and pinged the Bluestore guys about it:
http://tracker.ceph.com/issues/40557

On Tue, Jun 25, 2019 at 1:52 AM Thomas Byrne - UKRI STFC
 wrote:
>
> I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
> 50MB and the OSD booted instantly. Thanks!
>
> I'm confused as to why the OSDs weren't doing this themselves, especially as 
> the operation only took a few seconds. But for now I'm happy that this is 
> easy to rectify if we run into it again.
>
> I've uploaded the log of a slow boot with debug_bluestore turned up [1], and 
> I can provide other logs/files if anyone thinks they could be useful.
>
> Cheers,
> Tom
>
> [1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446
>
> -Original Message-
> From: Gregory Farnum 
> Sent: 24 June 2019 17:30
> To: Byrne, Thomas (STFC,RAL,SC) 
> Cc: ceph-users 
> Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
> 'clear_temp_objects', even with fresh PGs
>
> On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC 
>  wrote:
> >
> > Hi all,
> >
> >
> >
> > Some bluestore OSDs in our Luminous test cluster have started becoming 
> > unresponsive and booting very slowly.
> >
> >
> >
> > These OSDs have been used for stress testing for hardware destined for our 
> > production cluster, so have had a number of pools on them with many, many 
> > objects in the past. All these pools have since been deleted.
> >
> >
> >
> > When booting the OSDs, they spend a few minutes *per PG* in 
> > clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> > hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> > read and all available IOPS consumed. The OSD will finish booting and come 
> > up fine, but will then start hammering the disk again and fall over at some 
> > point later, causing the cluster to gradually fall apart. I'm guessing 
> > something is 'not optimal' in the rocksDB.
> >
> >
> >
> > Deleting all pools will stop this behaviour and OSDs without PGs will 
> > reboot quickly and stay up, but creating a pool will cause OSDs that get 
> > even a single PG to start exhibiting this behaviour again.
> >
> >
> >
> > These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are 
> > ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS 
> > export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 
> > 63.80 KB, L1 - 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems 
> > excessive for an empty OSD, but it's also the first time I've looked into 
> > this so may be normal?
> >
> >
> >
> > Destroying and recreating an OSD resolves the issue for that OSD, which is 
> > acceptable for this cluster, but I'm a little concerned a similar thing 
> > could happen on a production cluster. Ideally, I would like to try and 
> > understand what has happened before recreating the problematic OSDs.
> >
> >
> >
> > Has anyone got any thoughts on what might have happened, or tips on how to 
> > dig further into this?
>
> Have you tried a manual compaction? The only other time I see this being 
> reported was for FileStore-on-ZFS and it was just very slow at metadata 
> scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme 
> OSD Boot Time") There has been at least one PR about object listings being 
> slow in BlueStore when there are a lot of deleted objects, which would match 
> up with your many deleted pools/objects.
>
> If you have any debug logs the BlueStore devs might be interested in them to 
> check if the most recent patches will fix it.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-24 Thread Gregory Farnum
On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC
 wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this
being reported was for FileStore-on-ZFS and it was just very slow at
metadata scanning for some reason. ("[ceph-users] Hammer to Jewel
Upgrade - Extreme OSD Boot Time") There has been at least one PR about
object listings being slow in BlueStore when there are a lot of
deleted objects, which would match up with your many deleted
pools/objects.

If you have any debug logs the BlueStore devs might be interested in
them to check if the most recent patches will fix it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor stuck at "probing"

2019-06-20 Thread Gregory Farnum
Just nuke the monitor's store, remove it from the existing quorum, and
start over again. Injecting maps correctly is non-trivial and obviously
something went wrong, and re-syncing a monitor is pretty cheap.

On Thu, Jun 20, 2019 at 6:46 AM ☣Adam  wrote:

> Anyone have any suggestions for how to troubleshoot this issue?
>
>
>  Forwarded Message 
> Subject: Monitor stuck at "probing"
> Date: Fri, 14 Jun 2019 21:40:39 -0500
> From: ☣Adam 
> To: ceph-users@lists.ceph.com
>
> I have a monitor which I just can't seem to get to join the quorum, even
> after injecting a monmap from one of the other servers.[1]  I use NTP on
> all servers and also manually verified the clocks are synchronized.
>
>
> My monitors are named: ceph0, ceph2, xe, and tc
>
> I'm transitioning away from the ceph# naming scheme, so please forgive
> the confusing [lack of a] naming convention.
>
>
> The relevant output from: ceph -s
> 1/4 mons down, quorum ceph0,ceph2,xe
> mon: 4 daemons, quorum ceph0,ceph2,xe, out of quorum: tc
>
>
> tc is up, bound to the expected IP address, and the ceph-mon service can
> be reached from xe, ceph0 and ceph2 using telnet.  The mon_host and
> mon_initial_members from `ceph daemon mon.tc config show` look correct.
>
> mon_status on tc shows the state as "probing" and the list of
> "extra_probe_peers" looks correct (correct IP addresses, and ports).
> However the monmap section looks wrong.  The "mons" has all 4 servers,
> but the addr and public_addr values are 0.0.0.0:0.  Furthermore it says
> the monmap epoch is 4.  I don't understand why because I just injected a
> monmap which has an epoch of 7.
>
> Here's the output of: monmaptool --print ./monmap
> monmaptool: monmap file ./monmap
> epoch 7
> fsid a690e404-3152-4804-a960-8b52abf3bd65
> last_changed 2019-06-02 17:38:50.161035
> created 2018-12-28 20:26:41.443339
> 0: 192.168.60.10:6789/0 mon.ceph0
> 1: 192.168.60.11:6789/0 mon.tc
> 2: 192.168.60.12:6789/0 mon.ceph2
> 3: 192.168.60.53:6789/0 mon.xe
>
> When I injected it, I stopped ceph-mon, ran:
> sudo ceph-mon -i tc --inject-monmap ./monmap
>
> and started ceph-mon again.  I then rebooted to see if it would fix this
> epoch/addr issue.  It did not.
>
> I'm attaching what I believe is the relevant section of my log file from
> the tc monitor.  I ran `ceph auth list` on tc and ceph2 and verified
> that the output is identical.  This check was based on what I saw in the
> log and what I read in a blog post.[2]
>
> What are the next steps in troubleshooting this issue?
>
>
> Thanks,
> Adam
>
>
> [1]
> http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/
> [2]
>
> https://medium.com/@george.shuklin/silly-mistakes-with-ceph-mon-9ef6c9eaab54
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does cephfs ensure client cache consistency?

2019-06-18 Thread Gregory Farnum
On Tue, Jun 18, 2019 at 2:26 AM ?? ??  wrote:
>
> Thank you very much! Can you point out where is the code of revoke?

The caps code is all over the code base as it's fundamental to the
filesystem's workings. You can get some more general background in my
recent Cephalocon talk "What are “caps”? (And Why Won’t my Client Drop
Them?)" https://www.youtube.com/watch?v=VgNI5RQJGp0, slides available
at https://static.sched.com/hosted_files/cephalocon2019/dd/CephFS%20Caps.pdf

For programmatic details you will probably need to go code-diving on
your own. There is some developer documentation at
http://docs.ceph.com/docs/master/dev/internals/ which may have some
hints, and some old videos floating around (search the mailing list
archives) of in-person unstructured developer introductions we've done
but there isn't any explicit written or recorded introduction.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] balancer module makes OSD distribution worse

2019-06-05 Thread Gregory Farnum
I think the mimic balancer doesn't include omap data when trying to
balance the cluster. (Because it doesn't get usable omap stats from
the cluster anyway; in Nautilus I think it does.) Are you using RGW or
CephFS?
-Greg

On Wed, Jun 5, 2019 at 1:01 PM Josh Haft  wrote:
>
> Hi everyone,
>
> On my 13.2.5 cluster, I recently enabled the ceph balancer module in
> crush-compat mode. A couple manual 'eval' and 'execute' runs showed
> the score improving, so I set the following and enabled the auto
> balancer.
>
> mgr/balancer/crush_compat_metrics:bytes # from
> https://github.com/ceph/ceph/pull/20665
> mgr/balancer/max_misplaced:0.01
> mgr/balancer/mode:crush-compat
>
> Log messages from the mgr showed lower scores with each iteration, so
> I thought things were moving in the right direction.
>
> Initially my highest-utilized OSD was at 79% and MAXVAR was 1.17. I
> let the balancer do its thing for 5 days, at which point my highest
> utilized OSD was just over 90% and MAXVAR was about 1.28.
>
> I do have pretty low PG-per-OSD counts (average of about 60 - that's
> next on my list), but I explicitly asked the balancer to use the bytes
> metric. Was I just being impatient? Is it expected that usage would go
> up overall for a time before starting to trend downward? Is my low PG
> count affecting this somehow? I would have expected things to move in
> the opposite direction pretty quickly as they do with 'ceph osd
> reweight-by-utilization'.
>
> Thoughts?
>
> Regards,
> Josh
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG scrub stamps reset to 0.000000 in 14.2.1

2019-06-05 Thread Gregory Farnum
On Wed, Jun 5, 2019 at 10:10 AM Jonas Jelten  wrote:
>
> Hi!
>
> I'm also affected by this:
>
> HEALTH_WARN 13 pgs not deep-scrubbed in time; 13 pgs not scrubbed in time
> PG_NOT_DEEP_SCRUBBED 13 pgs not deep-scrubbed in time
> pg 6.b1 not deep-scrubbed since 0.00
> pg 7.ac not deep-scrubbed since 0.00
> pg 7.a0 not deep-scrubbed since 0.00
> pg 6.96 not deep-scrubbed since 0.00
> pg 7.92 not deep-scrubbed since 0.00
> pg 6.86 not deep-scrubbed since 0.00
> pg 7.74 not deep-scrubbed since 0.00
> pg 7.75 not deep-scrubbed since 0.00
> pg 7.49 not deep-scrubbed since 0.00
> pg 7.47 not deep-scrubbed since 0.00
> pg 6.2a not deep-scrubbed since 0.00
> pg 6.26 not deep-scrubbed since 0.00
> pg 6.b not deep-scrubbed since 0.00
> PG_NOT_SCRUBBED 13 pgs not scrubbed in time
> pg 6.b1 not scrubbed since 0.00
> pg 7.ac not scrubbed since 0.00
> pg 7.a0 not scrubbed since 0.00
> pg 6.96 not scrubbed since 0.00
> pg 7.92 not scrubbed since 0.00
> pg 6.86 not scrubbed since 0.00
> pg 7.74 not scrubbed since 0.00
> pg 7.75 not scrubbed since 0.00
> pg 7.49 not scrubbed since 0.00
> pg 7.47 not scrubbed since 0.00
> pg 6.2a not scrubbed since 0.00
> pg 6.26 not scrubbed since 0.00
> pg 6.b not scrubbed since 0.00
>
>
> A week ago this status was:
>
>
> HEALTH_WARN 6 pgs not deep-scrubbed in time; 6 pgs not scrubbed in time
> PG_NOT_DEEP_SCRUBBED 6 pgs not deep-scrubbed in time
> pg 7.b1 not deep-scrubbed since 0.00
> pg 7.7e not deep-scrubbed since 0.00
> pg 6.6e not deep-scrubbed since 0.00
> pg 7.8 not deep-scrubbed since 0.00
> pg 7.40 not deep-scrubbed since 0.00
> pg 6.f5 not deep-scrubbed since 0.00
> PG_NOT_SCRUBBED 6 pgs not scrubbed in time
> pg 7.b1 not scrubbed since 0.00
> pg 7.7e not scrubbed since 0.00
> pg 6.6e not scrubbed since 0.00
> pg 7.8 not scrubbed since 0.00
> pg 7.40 not scrubbed since 0.00
> pg 6.f5 not scrubbed since 0.00
>
>
> Is this a known problem already? I can't find a bug report.

https://tracker.ceph.com/issues/40073

Fix is in progress!

>
>
> Cheers
>
> -- Jonas
>
>
>
> On 16/05/2019 01.13, Brett Chancellor wrote:
> > After upgrading from 14.2.0 to 14.2.1, I've noticed PGs are frequently 
> > resetting their scrub and deep scrub time stamps
> > to 0.00.  It's extra strange because the peers show timestamps for deep 
> > scrubs.
> >
> > ## First entry from a pg list at 7pm
> > $ grep 11.2f2 ~/pgs-active.7pm
> > 11.2f2 6910 0   0 2897477632   0  0 
> > 2091 active+clean3h  7378'12291
> >  8048:36261[1,6,37]p1[1,6,37]p1 2019-05-14 21:01:29.172460 
> > 2019-05-14 21:01:29.172460
> >
> > ## Next Entry 3 minutes later
> > $ ceph pg ls active |grep 11.2f2
> > 11.2f2 6950 0   0 2914713600   0  0 
> > 2091 active+clean6s  7378'12291
> >  8049:36330[1,6,37]p1[1,6,37]p1   0.00  
> >  0.00
> >
> > ## PG Query
> > {
> > "state": "active+clean",
> > "snap_trimq": "[]",
> > "snap_trimq_len": 0,
> > "epoch": 8049,
> > "up": [
> > 1,
> > 6,
> > 37
> > ],
> > "acting": [
> > 1,
> > 6,
> > 37
> > ],
> > "acting_recovery_backfill": [
> > "1",
> > "6",
> > "37"
> > ],
> > "info": {
> > "pgid": "11.2f2",
> > "last_update": "7378'12291",
> > "last_complete": "7378'12291",
> > "log_tail": "1087'10200",
> > "last_user_version": 12291,
> > "last_backfill": "MAX",
> > "last_backfill_bitwise": 1,
> > "purged_snaps": [],
> > "history": {
> > "epoch_created": 1549,
> > "epoch_pool_created": 216,
> > "last_epoch_started": 6148,
> > "last_interval_started": 6147,
> > "last_epoch_clean": 6148,
> > "last_interval_clean": 6147,
> > "last_epoch_split": 6147,
> > "last_epoch_marked_full": 0,
> > "same_up_since": 6126,
> > "same_interval_since": 6147,
> > "same_primary_since": 6126,
> > "last_scrub": "7378'12291",
> > "last_scrub_stamp": "0.00",
> > "last_deep_scrub": "6103'12186",
> > "last_deep_scrub_stamp": "0.00",
> > "last_clean_scrub_stamp": "2019-05-15 23:08:17.014575"
> > },
> > "stats": {
> > "version": "7378'12291",
> > "reported_seq": "36700",
> > "reported_epoch": "8049",
> > "state": "active+clean",
> > "last_fresh": "2019-05-15 23:08:17.014609",
> > "last_change": "2019-05-15 23:08:17.014609",
> >

Re: [ceph-users] Balancer: uneven OSDs

2019-05-29 Thread Gregory Farnum
These OSDs are far too small at only 10GiB for the balancer to try and
do any work. It's not uncommon for metadata like OSDMaps to exceed
that size in error states and in any real deployment a single PG will
be at least that large.
There are probably parameters you can tweak to try and make it work,
but I wouldn't bother since the behavior will be nothing like what
you'd see in anything of size.
-Greg

On Wed, May 29, 2019 at 8:52 AM Tarek Zegar  wrote:
>
> Can anyone help with this? Why can't I optimize this cluster, the pg counts 
> and data distribution is way off.
> __
>
> I enabled the balancer plugin and even tried to manually invoke it but it 
> won't allow any changes. Looking at ceph osd df, it's not even at all. 
> Thoughts?
>
> root@hostadmin:~# ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 1 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 3 hdd 0.00980 1.0 10 GiB 8.3 GiB 1.7 GiB 82.83 1.14 156
> 6 hdd 0.00980 1.0 10 GiB 8.4 GiB 1.6 GiB 83.77 1.15 144
> 0 hdd 0.00980 0 0 B 0 B 0 B 0 0 0
> 5 hdd 0.00980 1.0 10 GiB 9.0 GiB 1021 MiB 90.03 1.23 159
> 7 hdd 0.00980 1.0 10 GiB 7.7 GiB 2.3 GiB 76.57 1.05 141
> 2 hdd 0.00980 1.0 10 GiB 5.5 GiB 4.5 GiB 55.42 0.76 90
> 4 hdd 0.00980 1.0 10 GiB 5.9 GiB 4.1 GiB 58.78 0.81 99
> 8 hdd 0.00980 1.0 10 GiB 6.3 GiB 3.7 GiB 63.12 0.87 111
> TOTAL 90 GiB 53 GiB 37 GiB 72.93
> MIN/MAX VAR: 0.76/1.23 STDDEV: 12.67
>
>
> root@hostadmin:~# osdmaptool om --upmap out.txt --upmap-pool rbd
> osdmaptool: osdmap file 'om'
> writing upmap command output to: out.txt
> checking for upmap cleanups
> upmap, max-count 100, max deviation 0.01 <---really? It's not even close to 
> 1% across the drives
> limiting to pools rbd (1)
> no upmaps proposed
>
>
> ceph balancer optimize myplan
> Error EALREADY: Unable to find further optimization,or distribution is 
> already perfect
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-28 Thread Gregory Farnum
You’re the second report I’ve seen if this, and while it’s confusing, you
should be Abel to resolve it by restarting your active manager daemon.

On Sun, May 26, 2019 at 11:52 PM Lars Täuber  wrote:

> Fri, 24 May 2019 21:41:33 +0200
> Michel Raabe  ==> Lars Täuber ,
> ceph-users@lists.ceph.com :
> >
> > You can also try
> >
> > $ rados lspools
> > $ ceph osd pool ls
> >
> > and verify that with the pgs
> >
> > $ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut -d.
> > -f1 | uniq
> >
>
> Yes, now I know but I still get this:
> $ sudo ceph -s
> […]
>   data:
> pools:   5 pools, 1153 pgs
> […]
>
>
> and with all other means I get:
> $ sudo ceph osd lspools | wc -l
> 3
>
> Which is what I expect, because all other pools are removed.
> But since this has no bad side effects I can live with it.
>
> Cheers,
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -s finds 4 pools but ceph osd lspools says no pool which is the expected answer

2019-05-15 Thread Gregory Farnum
On Tue, May 14, 2019 at 11:03 AM Rainer Krienke  wrote:
>
> Hello,
>
> for a fresh setup ceph cluster I see a strange difference in the number
> of existing pools in the output of ceph -s and what I know that should
> be there: no pools at all.
>
> I set up a fresh Nautilus cluster with 144 OSDs on 9 hosts. Just to play
> around I created a pool named rbd with
>
> $ ceph osd pool create rbd 512 512 replicated
>
> In ceph -s I saw the pool but also saw a warning:
>
>  cluster:
> id: a-b-c-d-e
> health: HEALTH_WARN
> too few PGs per OSD (21 < min 30)
>
> So I experimented around, removed the pool (ceph osd pool remove rbd)
> and it was gone in ceph osd lspools, and created a new one with some
> more PGs and repeated this a few times with larger PG nums. In the end
> in the output of ceph -s I see that 4 pools do exist:
>
>   cluster:
> id: a-b-c-d-e
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum c2,c5,c8 (age 8h)
> mgr: c2(active, since 8h)
> osd: 144 osds: 144 up (since 8h), 144 in (since 8h)
>
>   data:
> pools:   4 pools, 0 pgs
> objects: 0 objects, 0 B
> usage:   155 GiB used, 524 TiB / 524 TiB avail
> pgs:
>
> but:
>
> $ ceph osd lspools
> 
>
> Since I deleted each pool I created, 0 pools is the correct answer.
> I could add another "ghost" pool by creating another pool named rbd with
> only 512 PGs and then delete it again right away. ceph -s would then
> show me 5 pools. This is the way I came from 3 to 4 "ghost pools".
>
> This does not seem to happen if I use 2048 PGs for the new pool which I
> do delete right afterwards. In this case the pool is created and ceph -s
> shows one pool more (5) and if delete this pool again the counter in
> ceph -s goes back to 4 again.
>
> How can I fix the system so that ceph -s also understands that are
> actually no pools? There must be some inconsistency. Any ideas?
>

I don't really see how this particular error can happen and be
long-lived, but if you restart the ceph-mgr it will probably resolve
itself.
("ceph osd lspools" looks directly at the OSDMap in the monitor,
whereas the "ceph -s" data output is generated from the manager's
pgmap, but there's a tight link where the pgmap gets updated and
removes dead pools on every new OSDMap the manager sees and I can't
see how that would go wrong.)
-Greg


> Thanks
> Rainer
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
> 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
> PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
> 1001312
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritized pool recovery

2019-05-08 Thread Gregory Farnum
On Mon, May 6, 2019 at 6:41 PM Kyle Brantley  wrote:
>
> On 5/6/2019 6:37 PM, Gregory Farnum wrote:
> > Hmm, I didn't know we had this functionality before. It looks to be
> > changing quite a lot at the moment, so be aware this will likely
> > require reconfiguring later.
>
> Good to know, and not a problem. In any case, I'd assume it won't change 
> substantially for luminous, correct?
>
>
> > I'm not seeing this in the luminous docs, are you sure? The source
>
> You're probably right, but there are options for this in luminous:
>
> # ceph osd pool get vm
> Invalid command: missing required parameter var([...] 
> recovery_priority|recovery_op_priority [...])
>
>
> > code indicates in Luminous it's 0-254. (As I said, things have
> > changed, so in the current master build it seems to be -10 to 10 and
> > configured a bit differently.)
>
> > The 1-63 values generally apply to op priorities within the OSD, and
> > are used as part of a weighted priority queue when selecting the next
> > op to work on out of those available; you may have been looking at
> > osd_recovery_op_priority which is on that scale and should apply to
> > individual recovery messages/ops but will not work to schedule PGs
> > differently.
>
> So I was probably looking at the OSD level then.

Ah sorry, I looked at the recovery_priority option and skipped
recovery_op_priority entirely.

So recovery_op_priority sets the priority on the message dispatch
itself and is on the 0-63 scale. I wouldn't mess around with that; the
higher you put it the more of them will be dispatched compared to
client operations.

>
> >
> >> Questions:
> >> 1) If I have pools 1-4, what would I set these values to in order to 
> >> backfill pools 1, 2, 3, and then 4 in order?
> >
> > So if I'm reading the code right, they just need to be different
> > weights, and the higher value will win when trying to get a
> > reservation if there's a queue of them. (However, it's possible that
> > lower-priority pools will send off requests first and get to do one or
> > two PGs first, then the higher-priority pool will get to do all its
> > work before that pool continues.)
>
> Where higher is 0, or higher is 254? And what's the difference between 
> recovery_priority and recovery_op_priority?

For recovery_priority larger numbers are higher. When picking a PG off
the list of pending reservations, it will take the highest priority PG
it sees, and the first request to come in within that priority.

>
> In reading the docs for the OSD, _op_ is "priority set for recovery 
> operations," and non-op is "priority set for recovery work queue." For 
> someone new to ceph such as myself, this reads like the same thing at a 
> glance. Would the recovery operations not be a part of the work queue?
>
> And would this apply the same for the pools?

When a PG needs to recover, it has to acquire a reservation slot on
the local and remote nodes (to limit the total amount of work being
done). It sends off a request and when the total number of
reservations is hit, they go into a pending queue. The
recovery_priority orders that queue.

>
> >
> >> 2) Assuming this is possible, how do I ensure that backfill isn't 
> >> prioritized over client I/O?
> >
> > This is an ongoing issue but I don't think the pool prioritization
> > will change the existing mechanisms.
>
> Okay, understood. Not a huge problem, I'm primarily looking for understanding.
>
>
> >> 3) Is there a command that enumerates the weights of the current 
> >> operations (so that I can observe what's going on)?
> >
> > "ceph osd pool ls detail" will include them.
> >
>
> Perfect!
>
> Thank you very much for the information. Once I have a little more, I'm 
> probably going to work towards sending a pull request in for the docs...
>
>
> --Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mimic and samba vfs_ceph

2019-05-08 Thread Gregory Farnum
On Wed, May 8, 2019 at 10:05 AM Ansgar Jazdzewski
 wrote:
>
> hi folks,
>
> we try to build a new NAS using the vfs_ceph modul from samba 4.9.
>
> if i try to open the share i recive the error:
>
> May  8 06:58:44 nas01 smbd[375700]: 2019-05-08 06:58:44.732830
> 7ff3d5f6e700  0 -- 10.100.219.51:0/3414601814 >> 10.100.219.11:6789/0
> pipe(0x7ff3cc00c350 sd=6 :45626 s=1 pgs=0 cs=0 l=0
> c=0x7ff3cc008980).connect protocol feature mismatch, my
> 27ffefdfbfff < peer 27fddff8e
> fa4bffb missing 20
>
> so my guess is that i need to compile samba with the libcephfs from
> mimic but i'am not able to because of this compile-error:
>
> ../../source3/modules/vfs_ceph.c: In function ‘cephwrap_stat’:
> ../../source3/modules/vfs_ceph.c:835:11: warning: implicit declaration
> of function ‘ceph_stat’; did you mean ‘ceph_statx’?
> [-Wimplicit-function-declaration]
>   result = ceph_stat(handle->data, smb_fname->base_name, (struct stat
> *) &stbuf);
>^
>ceph_statx
> ../../source3/modules/vfs_ceph.c: In function ‘cephwrap_fstat’:
> ../../source3/modules/vfs_ceph.c:861:11: warning: implicit declaration
> of function ‘ceph_fstat’; did you mean ‘ceph_fstatx’?
> [-Wimplicit-function-declaration]
>   result = ceph_fstat(handle->data, fsp->fh->fd, (struct stat *) &stbuf);
>^~
>ceph_fstatx
> ../../source3/modules/vfs_ceph.c: In function ‘cephwrap_lstat’:
> ../../source3/modules/vfs_ceph.c:894:11: warning: implicit declaration
> of function ‘ceph_lstat’; did you mean ‘ceph_statx’?
> [-Wimplicit-function-declaration]
>   result = ceph_lstat(handle->data, smb_fname->base_name, &stbuf);
>^~
>ceph_statx
>
> maybe i can disable a feature in cephfs to avoid the error in the first place?

Hmmm unfortunately it looks like the public functions got changed and
so there isn't the standard ceph_stat any more.

Fixing the wiring wouldn't be that complicated if you can hack on the
code at all, but there are some other issues with the Samba VFS
implementation that have prevented anyone from prioritizing it so far.
(Namely, smb forks for every incoming client connection, which means
every smb client gets a completely independent cephfs client, which is
very inefficient.)
-Greg

>
> thanks for your help,
> Ansgar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-05-08 Thread Gregory Farnum
On Wed, May 8, 2019 at 5:33 AM Dietmar Rieder
 wrote:
>
> On 5/8/19 1:55 PM, Paul Emmerich wrote:
> > Nautilus properly accounts metadata usage, so nothing changed it just
> > shows up correctly now ;)
>
> OK, but then I'm not sure I understand why the increase was not sudden
> (with the update) but it kept growing steadily over days.

Tracking the amount of data used by omap (ie, the internal RocksDB)
isn't really possible to do live, and in the past we haven't done it
at all. In Nautilus, it gets stats whenever a deep scrub happens so
the omap data is always stale, but at least lets us approximate what's
in use for a given PG.

So when you upgraded to Nautilus, the metadata pool scrubbed PGs over
a period of days and each time a PG scrub finished the amount of data
accounted to the pool as a whole increased. :)
-Greg

>
> ~Dietmar
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data moved pools but didn't move osds & backfilling+remapped loop

2019-05-08 Thread Gregory Farnum
On Wed, May 8, 2019 at 2:37 AM Marco Stuurman
 wrote:
>
> Hi,
>
> I've got an issue with the data in our pool. A RBD image containing 4TB+ data 
> has moved over to a different pool after a crush rule set change, which 
> should not be possible. Besides that it loops over and over to start 
> remapping and backfilling (goes up to 377 pg active+clean then suddenly drops 
> to 361, without crashes accourding to ceph -w & ceph crash ls)
>
> First about the pools:
>
> [root@CEPH-MGMT-1 ~t]# ceph df
> RAW STORAGE:
> CLASSSIZE   AVAIL  USEDRAW USED %RAW USED
> cheaphdd 16 TiB 10 TiB 5.9 TiB  5.9 TiB 36.08
> fasthdd  33 TiB 18 TiB  16 TiB   16 TiB 47.07
> TOTAL50 TiB 28 TiB  22 TiB   22 TiB 43.44
>
> POOLS:
> POOL ID STORED  OBJECTS USED %USED 
> MAX AVAIL
> pool1  37   780 B1.33M  780 B 
>   0   3.4 TiB
> pool2  48 2.0 TiB   510.57k5.9 TiB
>   42.64   2.6 TiB
>
> All data is now in pool2 while the RBD image is created in pool1 (since pool2 
> is new).
>
> The steps it took to make ceph do this is:
>
> - Add osds with a different device class (class cheaphdd)
> - Create crushruleset for cheaphdd only called cheapdisks
> - Create pool2 with new crush rule set
> - Remove device class from the previously existing devices (remove class hdd)
> - Add class fasthdd to those devices
> - Create new crushruleset fastdisks
> - Change crushruleset for pool1 to fastdisks
>
> After this the data starts moving everything from pool1 to pool2, however, 
> the RBD image still works and the disks of pool1 are still filled with data.
>
> I've tried to reproduce this issue using virtual machines but I couldn't make 
> it happen again.
>
> Some extra information:
> ceph osd crush tree --show-shadow ==> https://fe.ax/639aa.H34539.txt
> ceph pg ls-by-pool pool1 ==> https://fe.ax/dcacd.H44900.txt (I know the PG 
> count is too low)
> ceph pg ls-by-pool pool2 ==> https://fe.ax/95a2c.H51533.txt
> ceph -s ==> https://fe.ax/aab41.H69711.txt
>
>
> Can someone shine a light on why the data looks like it's moved to another 
> pool and/or explain why the data in pool2 keeps remapping/backfilling in a 
> loop?

What version of Ceph are you running? Are the PGs active+clean
changing in any other way?

My guess is this is just the reporting getting messed up because none
of the cheaphdd disks are supposed to be reachable by pool1 now, and
so their disk usage is being assigned to pool2. In which case it will
clear up once all the data movement is done.

Can you confirm if it's getting better as PGs actually migrate?

>
> Thanks!
>
>
> Kind regards,
>
> Marco Stuurman
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read-only CephFs on a k8s cluster

2019-05-07 Thread Gregory Farnum
On Tue, May 7, 2019 at 6:54 AM Ignat Zapolsky  wrote:
>
> Hi,
>
>
>
> We are looking at how to troubleshoot an issue with Ceph FS on k8s cluster.
>
>
>
> This filesystem is provisioned via rook 0.9.2 and have following behavior:
>
> If ceph fs is mounted on K8S master, then it is writeable
> If ceph fs is mounted as PV to a POD, then we can write a 0-sized file to it, 
> (or create empty file) but bigger writes do not work.

This generally means your clients have CephX permission to access the
MDS but not the RADOS pools. Check what auth caps you've given the
relevant keys ("ceph auth list"). Presumably your master node has an
admin key and the clients have a different one that's not quite right.
-Greg

>
> Following is reported as ceph -s :
>
>
>
> # ceph -s
>
>   cluster:
>
> id: 18f8d40e-1995-4de4-96dc-e905b097e643
>
> health: HEALTH_OK
>
>   services:
>
> mon: 3 daemons, quorum a,b,d
>
> mgr: a(active)
>
> mds: workspace-storage-fs-1/1/1 up  {0=workspace-storage-fs-a=up:active}, 
> 1 up:standby-replay
>
> osd: 3 osds: 3 up, 3 in
>
>   data:
>
> pools:   3 pools, 300 pgs
>
> objects: 212  objects, 181 MiB
>
> usage:   51 GiB used, 244 GiB / 295 GiB avail
>
> pgs: 300 active+clean
>
>   io:
>
> client:   853 B/s rd, 2.7 KiB/s wr, 1 op/s rd, 0 op/s wr
>
>
>
>
>
> I wonder what can be done for further diagnostics ?
>
>
>
> With regards,
>
> Ignat Zapolsky
>
>
>
> Sent from Mail for Windows 10
>
>
>
>
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritized pool recovery

2019-05-06 Thread Gregory Farnum
Hmm, I didn't know we had this functionality before. It looks to be
changing quite a lot at the moment, so be aware this will likely
require reconfiguring later.

On Sun, May 5, 2019 at 10:40 AM Kyle Brantley  wrote:
>
> I've been running luminous / ceph-12.2.11-0.el7.x86_64 on CentOS 7 for about 
> a month now, and had a few times when I've needed to recreate the OSDs on a 
> server. (no I'm not planning on routinely doing this...)
>
> What I've noticed is that the recovery will generally stagger the recovery so 
> that the pools on the cluster will finish around the same time (+/- a few 
> hours). What I'm hoping to do is prioritize specific pools over others, so 
> that ceph will recover all of pool 1 before it moves on to pool 2, for 
> example.
>
> In the docs, recovery_{,op}_priority both have roughly the same description, 
> which is "the priority set for recovery operations" as well as a valid range 
> of 1-63, default 5. This doesn't tell me if a value of 1 is considered a 
> higher priority than 63, and it doesn't tell me how it fits in line with 
> other ceph operations.

I'm not seeing this in the luminous docs, are you sure? The source
code indicates in Luminous it's 0-254. (As I said, things have
changed, so in the current master build it seems to be -10 to 10 and
configured a bit differently.)

The 1-63 values generally apply to op priorities within the OSD, and
are used as part of a weighted priority queue when selecting the next
op to work on out of those available; you may have been looking at
osd_recovery_op_priority which is on that scale and should apply to
individual recovery messages/ops but will not work to schedule PGs
differently.

> Questions:
> 1) If I have pools 1-4, what would I set these values to in order to backfill 
> pools 1, 2, 3, and then 4 in order?

So if I'm reading the code right, they just need to be different
weights, and the higher value will win when trying to get a
reservation if there's a queue of them. (However, it's possible that
lower-priority pools will send off requests first and get to do one or
two PGs first, then the higher-priority pool will get to do all its
work before that pool continues.)

> 2) Assuming this is possible, how do I ensure that backfill isn't prioritized 
> over client I/O?

This is an ongoing issue but I don't think the pool prioritization
will change the existing mechanisms.

> 3) Is there a command that enumerates the weights of the current operations 
> (so that I can observe what's going on)?

"ceph osd pool ls detail" will include them.

>
> For context, my pools are:
> 1) cephfs_metadata
> 2) vm (RBD pool, VM OS drives)
> 3) storage (RBD pool, VM data drives)
> 4) cephfs_data
>
> These are sorted by both size (smallest to largest) and criticality of 
> recovery (most to least). If there's a critique of this setup / a better way 
> of organizing this, suggestions are welcome.
>
> Thanks,
> --Kyle
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule device classes mystery

2019-05-06 Thread Gregory Farnum
What's the output of "ceph -s" and "ceph osd tree"?

On Fri, May 3, 2019 at 8:58 AM Stefan Kooman  wrote:
>
> Hi List,
>
> I'm playing around with CRUSH rules and device classes and I'm puzzled
> if it's working correctly. Platform specifics: Ubuntu Bionic with Ceph 14.2.1
>
> I created two new device classes "cheaphdd" and "fasthdd". I made
> sure these device classes are applied to the right OSDs and that the
> (shadow) crush rule is correctly filtering the right classes for the
> OSDs (ceph osd crush tree --show-shadow).
>
> I then created two new crush rules:
>
> ceph osd crush rule create-replicated fastdisks default host fasthdd
> ceph osd crush rule create-replicated cheapdisks default host cheaphdd
>
> # rules
> rule replicated_rule {
> id 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule fastdisks {
> id 1
> type replicated
> min_size 1
> max_size 10
> step take default class fasthdd
> step chooseleaf firstn 0 type host
> step emit
> }
> rule cheapdisks {
> id 2
> type replicated
> min_size 1
> max_size 10
> step take default class cheaphdd
> step chooseleaf firstn 0 type host
> step emit
> }
>
> After that I put the cephfs_metadata on the fastdisks CRUSH rule:
>
> ceph osd pool set cephfs_metadata crush_rule fastdisks
>
> Some data is moved to new osds, but strange enough there is still data on PGs
> residing on OSDs in the "cheaphdd" class. I confirmed this with:
>
> ceph pg ls-by-pool cephfs_data
>
> Testing CRUSH rule nr. 1 gives me:
>
> crushtool -i /tmp/crush_raw --test --show-mappings --rule 1 --min-x 1 --max-x 
> 4  --num-rep 3
> CRUSH rule 1 x 1 [0,3,6]
> CRUSH rule 1 x 2 [3,6,0]
> CRUSH rule 1 x 3 [0,6,3]
> CRUSH rule 1 x 4 [0,6,3]
>
> Which are indeed the OSDs in the fasthdd class.
>
> Why is not all data moved to OSDs 0,3,6, but still spread on OSDs on the
> cheaphhd class as well?
>
> Thanks,
>
> Stefan
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

2019-05-06 Thread Gregory Farnum
98]
> >  7: (boost::statechart::simple_state > PG::RecoveryState::ToDelete, boost::mpl::list > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
> > :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> > (boost::statechart::history_mode)0>::react_impl(boost::statec
> > hart::event_base const&, void const*)+0x16a) [0x55766fe355ca]
> >  8:
> > (boost::statechart::state_machine > PG::RecoveryState::Initial, std::allocator,
> > boost::statechart::null_exception_translator>::process_event(bo
> > ost::statechart::event_base const&)+0x5a) [0x55766fe130ca]
> >  9: (PG::do_peering_event(std::shared_ptr,
> > PG::RecoveryCtx*)+0x119) [0x55766fe02389]
> >  10: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> > std::shared_ptr, ThreadPool::TPHandle&)+0x1b4)
> > [0x55766fd3c3c4]
> >  11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
> > ThreadPool::TPHandle&)+0x234) [0x55766fd3c804]
> >  12: (OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0x9f4) [0x55766fd30b44]
> >  13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433)
> > [0x55767032ae93]
> >  14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55767032df30]
> >  15: (()+0x7dd5) [0x7f8eb7162dd5]
> >  16: (clone()+0x6d) [0x7f8eb6028ead]
> >
> > 2019-05-03 21:24:05.274 7f8e96b8a700 -1 *** Caught signal (Aborted) **
> >  in thread 7f8e96b8a700 thread_name:tp_osd_tp
> >
> >  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> > (stable)
> >  1: (()+0xf5d0) [0x7f8eb716a5d0]
> >  2: (gsignal()+0x37) [0x7f8eb5f61207]
> >  3: (abort()+0x148) [0x7f8eb5f628f8]
> >  4: (ceph::__ceph_abort(char const*, int, char const*, std::string
> > const&)+0x19c) [0x55766fbd6d94]
> >  5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > ObjectStore::Transaction*)+0x2a85) [0x5576701b6af5]
> >  6:
> > (BlueStore::queue_transactions(boost::intrusive_ptr&,
> > std::vector > std::allocator >&, boost::intrusive_
> > ptr, ThreadPool::TPHandle*)+0x526) [0x5576701b7866]
> >  7:
> > (ObjectStore::queue_transaction(boost::intrusive_ptr&,
> > ObjectStore::Transaction&&, boost::intrusive_ptr,
> > ThreadPool::TPHandle*)+0x7f) [0x55766f
> > d9274f]
> >  8: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x55766fdf577d]
> >  9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
> > [0x55766fdf6598]
> >  10: (boost::statechart::simple_state > PG::RecoveryState::ToDelete, boost::mpl::list > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_
> > ::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> > (boost::statechart::history_mode)0>::react_impl(boost::state
> > chart::event_base const&, void const*)+0x16a) [0x55766fe355ca]
> >  11:
> > (boost::statechart::state_machine > PG::RecoveryState::Initial, std::allocator,
> > boost::statechart::null_exception_translator>::process_event(b
> > oost::statechart::event_base const&)+0x5a) [0x55766fe130ca]
> >  12: (PG::do_peering_event(std::shared_ptr,
> > PG::RecoveryCtx*)+0x119) [0x55766fe02389]
> >  13: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> > std::shared_ptr, ThreadPool::TPHandle&)+0x1b4)
> > [0x55766fd3c3c4]
> >  14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
> > ThreadPool::TPHandle&)+0x234) [0x55766fd3c804]
> >  15: (OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0x9f4) [0x55766fd30b44]
> >  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433)
> > [0x55767032ae93]
> >  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55767032df30]
> >  18: (()+0x7dd5) [0x7f8eb7162dd5]
> >  19: (clone()+0x6d) [0x7f8eb6028ead]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> >
> > ~Best
> >Dietmar
> >
> > On 4/29/19 11:05 PM, Gregory Farnum wrote:
> >> Glad you got it working and thanks for the logs! Looks like we've seen
> >> this once or twice before so I added them to
> >> https://tracker.ceph.com/issues/38724.
> >> -Greg
> >>
> >> On Fri, Apr 26, 2019 at 5:52 PM Elise Burke  wrote:
> >>>
> >>> Thanks for the pointer to ceph-objectstore-tool, it turns out that 
> >>> removing and exporting the PG from all three disks was enough to m

Re: [ceph-users] Cephfs on an EC Pool - What determines object size

2019-04-29 Thread Gregory Farnum
Yes, check out the file layout options:
http://docs.ceph.com/docs/master/cephfs/file-layouts/

On Mon, Apr 29, 2019 at 3:32 PM Daniel Williams  wrote:
>
> Is the 4MB configurable?
>
> On Mon, Apr 29, 2019 at 4:36 PM Gregory Farnum  wrote:
>>
>> CephFS automatically chunks objects into 4MB objects by default. For
>> an EC pool, RADOS internally will further subdivide them based on the
>> erasure code and striping strategy, with a layout that can vary. But
>> by default if you have eg an 8+3 EC code, you'll end up with a bunch
>> of (4MB/8=)512KB objects within the OSD.
>> -Greg
>>
>> On Sun, Apr 28, 2019 at 12:42 PM Daniel Williams  wrote:
>> >
>> > Hey,
>> >
>> > What controls / determines object size of a purely cephfs ec (6.3) pool? I 
>> > have large file but seemingly small objects.
>> >
>> > Daniel
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

2019-04-29 Thread Gregory Farnum
st pend 
>>> 0x7f7fe5
>>>-24> 2019-04-26 19:23:05.192 7fb2667de700 20 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _collection_list key 
>>> 0x7f7fff100020'P!osdmap.7114!='0x'o'
>>>  >= GHMAX
>>>-23> 2019-04-26 19:23:05.192 7fb2667de700 20 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _collection_list oid 
>>> #25:head# end GHMAX
>>>-22> 2019-04-26 19:23:05.192 7fb2667de700 20 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _collection_list oid 
>>> #25:b08b92bdhead# end GHMAX
>>>-21> 2019-04-26 19:23:05.192 7fb2667de700 20 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _collection_list key 
>>> 0x80800c1c0021213dfffe'o' >= 
>>> GHMAX
>>>-20> 2019-04-26 19:23:05.192 7fb2667de700 10 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection oid 
>>> #25:head#
>>>-19> 2019-04-26 19:23:05.192 7fb2667de700 10 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection oid 
>>> #25:b08b92bdhead#
>>>-18> 2019-04-26 19:23:05.192 7fb2667de700 10 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection 
>>> #25:b08b92bdhead# exists in db, not present in ram
>>>-17> 2019-04-26 19:23:05.192 7fb2667de700 10 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection 25.0_head is 
>>> non-empty
>>>-16> 2019-04-26 19:23:05.192 7fb2667de700 10 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection 25.0_head = -39
>>>-15> 2019-04-26 19:23:05.192 7fb2667de700 -1 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (39) 
>>> Directory not empty not handled on operation 21 (op 1, counting from 0)
>>>-14> 2019-04-26 19:23:05.192 7fb2667de700  0 
>>> bluestore(/var/lib/ceph/osd/ceph-2) _dump_transaction transaction dump:
>>> {
>>> "ops": [
>>> {
>>> "op_num": 0,
>>> "op_name": "remove",
>>> "collection": "25.0_head",
>>> "oid": "#25:head#"
>>> },
>>> {
>>> "op_num": 1,
>>> "op_name": "rmcoll",
>>> "collection": "25.0_head"
>>> }
>>>
>>>-13> 2019-04-26 19:23:05.199 7fb2667de700 -1 
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/BlueStore.cc:
>>>  In function 'void 
>>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*)' thread 7fb2667de700 time 2019-04-26 
>>> 19:23:05.193826
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/BlueStore.cc:
>>>  11069: abort()
>>>
>>>  ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus 
>>> (stable)
>>>  1: (ceph::__ceph_abort(char const*, int, char const*, std::string 
>>> const&)+0xd8) [0x7c1454ee40]
>>>  2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>> ObjectStore::Transaction*)+0x2a85) [0x7c14b2d5f5]
>>>  3: 
>>> (BlueStore::queue_transactions(boost::intrusive_ptr&,
>>>  std::vector>> std::allocator >&, 
>>> boost::intrusive_ptr, ThreadPool::TPHandle*)+0x526) 
>>> [0x7c14b2e366]
>>>  4: 
>>> (ObjectStore::queue_transaction(boost::intrusive_ptr&,
>>>  ObjectStore::Transaction&&, boost::intrusive_ptr, 
>>> ThreadPool::TPHandle*)+0x7f) [0x7c1470a81f]
>>>  5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x7c1476d70d]
>>>  6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) 
>>> [0x7c1476e528]
>>>  7: (boost::statechart::simple_state>> PG::RecoveryState::ToDelete, boost::mpl::list>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
>>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
>>> mpl_::na, mpl_::na, mpl_::na>, 
>>> (boost::statechart::history_mode)0>::react_impl(boost::statechart::ev

Re: [ceph-users] Cephfs on an EC Pool - What determines object size

2019-04-29 Thread Gregory Farnum
CephFS automatically chunks objects into 4MB objects by default. For
an EC pool, RADOS internally will further subdivide them based on the
erasure code and striping strategy, with a layout that can vary. But
by default if you have eg an 8+3 EC code, you'll end up with a bunch
of (4MB/8=)512KB objects within the OSD.
-Greg

On Sun, Apr 28, 2019 at 12:42 PM Daniel Williams  wrote:
>
> Hey,
>
> What controls / determines object size of a purely cephfs ec (6.3) pool? I 
> have large file but seemingly small objects.
>
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

2019-04-26 Thread Gregory Farnum
You'll probably want to generate a log with "debug osd = 20" and
"debug bluestore = 20", then share that or upload it with
ceph-post-file, to get more useful info about which PGs are breaking
(is it actually the ones that were supposed to delete?).

If there's a particular set of PGs you need to rescue, you can also
look at using the ceph-objectstore-tool to export them off the busted
OSD stores and import them into OSDs that still work.


On Fri, Apr 26, 2019 at 12:01 PM Elise Burke  wrote:
>
> Hi,
>
> I upgraded to Nautilus a week or two ago and things had been mostly fine. I 
> was interested in trying the device health stats feature and enabled it. In 
> doing so it created a pool, device_health_metrics, which contained zero bytes.
>
> Unfortunately this pool developed a PG that could not be repaired with `ceph 
> pg repair`. That's okay, I thought, this pool is empty (zero bytes), so I'll 
> just remove it and discard the PG entirely.
>
> So I did: `ceph osd pool rm device_health_metrics device_health_metrics 
> --yes-i-really-really-mean-it`
>
> Within a few seconds three OSDs had gone missing (this pool was size=3) and 
> now crashloop at startup.
>
> Any assistance in getting these OSDs up (such as by discarding the errant PG) 
> would be appreciated. I'm most concerned about the other pools in the system, 
> as losing three OSDs at once has not been ideal.
>
> This is made more difficult as these are in the Bluestore configuration and 
> were set up with ceph-deploy to bare metal (using LVM mode).
>
> Here's the traceback as noted in journalctl:
>
> Apr 26 11:01:43 databox ceph-osd[1878533]: -5381> 2019-04-26 11:01:08.902 
> 7f8a00866d80 -1 Falling back to public interface
> Apr 26 11:01:43 databox ceph-osd[1878533]: -4241> 2019-04-26 11:01:41.835 
> 7f8a00866d80 -1 osd.2 7630 log_to_monitors {default=true}
> Apr 26 11:01:43 databox ceph-osd[1878533]: -3> 2019-04-26 11:01:43.203 
> 7f89dee53700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction 
> error (39) Directory not empty not handled on operation 21 (op 1, counting 
> from 0)
> Apr 26 11:01:43 databox ceph-osd[1878533]: -1> 2019-04-26 11:01:43.209 
> 7f89dee53700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14
> Apr 26 11:01:43 databox ceph-osd[1878533]: 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluest
> Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 
> (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
> Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (ceph::__ceph_abort(char 
> const*, int, char const*, std::string const&)+0xd8) [0xfc63afe40]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 2: 
> (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 3: 
> (BlueStore::queue_transactions(boost::intrusive_ptr&,
>  std::vector std::allocator >&, boost::intrusive_ptr
> Apr 26 11:01:43 databox ceph-osd[1878533]: 4: 
> (ObjectStore::queue_transaction(boost::intrusive_ptr&,
>  ObjectStore::Transaction&&, boost::intrusive_ptr, 
> ThreadPool::TPHandle*)+0x7f) [0xfc656b81f
> Apr 26 11:01:43 databox ceph-osd[1878533]: 5: 
> (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 6: 
> (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 7: 
> (boost::statechart::simple_state PG::RecoveryState::ToDelete, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
> Apr 26 11:01:43 databox ceph-osd[1878533]: 8: 
> (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator, 
> boost::statechart::null_exception_translator>::process_event(boost
> Apr 26 11:01:43 databox ceph-osd[1878533]: 9: 
> (PG::do_peering_event(std::shared_ptr, 
> PG::RecoveryCtx*)+0x119) [0xfc65dac99]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 10: 
> (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr, 
> ThreadPool::TPHandle&)+0x1b4) [0xfc6515494]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 11: 
> (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, 
> ThreadPool::TPHandle&)+0x234) [0xfc65158d4]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 12: 
> (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) 
> [0xfc6509c14]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 13: 
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) 
> [0xfc6b01f43]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 14: 
> (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (()+0x7dd5) [0x7f89fd4b0dd5]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (clone()+0x6d) [0x7f89fc376ead]
> Apr 26 11:01

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

2019-04-26 Thread Gregory Farnum
On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic  wrote:
>
> Hi,
>
> yesterday my cluster reported slow request for minutes and after restarting 
> OSDs (reporting slow requests) it stuck with peering PGs. Whole
> cluster was not responding and IO stopped.
>
> I also notice, that problem was with cephx - all OSDs were reporting the same 
> (even the same number of secret_id)
>
> cephx: verify_authorizer could not get service secret for service osd 
> secret_id=14086
> .. conn(0x559e15a5 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH 
> pgs=0 cs=0 l=1).handle_connect_msg: got bad authorizer
> auth: could not find secret_id=14086
>
> My questions are:
>
> Why happened that?
> Can I prevent cluster from stopping to work (with cephx enabled)?
> How quickly are keys rotating/expiring and can I check problems with that 
> anyhow?
>
> I'm running NTP on nodes (and also ceph monitors), so time should not be the 
> issue. I noticed, that some monitor nodes has no timezone set,
> but I hope MONs are using UTC to distribute keys to clients. Or different 
> timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)

>
> I "fixed" the problem by restarting monitors.
>
> It happened for the second time during last 3 months, so I'm reporting it as 
> issue, that can happen.
>
> I also noticed in all OSDs logs
>
> 2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating 
> possible clock skew, rotating keys expired way too early (before
> 2019-04-25 09:06:55.65)
>
> approximately 7 hours before problem occurred. I can see, that it related to 
> the issue. But why 7 hours? Is there some timeout or grace
> period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg

> Thank you
>
> With regards
>
> Jan Pekar
>
> --
> 
> Ing. Jan Pekař
> jan.pe...@imatic.cz
> 
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz
> 
> --
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Were fixed CephFS lock ups when it's running on nodes with OSDs?

2019-04-22 Thread Gregory Farnum
On Sat, Apr 20, 2019 at 9:29 AM Igor Podlesny  wrote:
>
> I remember seeing reports in regards but it's being a while now.
> Can anyone tell?

No, this hasn't changed. It's unlikely it ever will; I think NFS
resolved the issue but it took a lot of ridiculous workarounds and
imposes a permanent memory cost on the client.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Object storage for physically separating tenants storage infrastructure

2019-04-15 Thread Gregory Farnum
On Sat, Apr 13, 2019 at 9:42 AM Varun Singh  wrote:
>
> Thanks Greg. A followup question. Will Zone, ZoneGroup and Realm come
> into picture? While reading the documentation, I inferred that by
> setting different Realms, I should be able to achieve the desired
> result. Is that incorrect?

I think they will come in and you're correct, but I haven't worked
with RGW in years so it's a bit out of my wheelhouse.
-Greg

>
> --
> Regards,
> Varun Singh
>
> On Sat, Apr 13, 2019 at 12:50 AM Gregory Farnum  wrote:
> >
> > Yes, you would do this by setting up separate data pools for segregated 
> > clients, giving those pools a CRUSH rule placing them on their own servers, 
> > and if using S3 assigning the clients to them using either wholly separate 
> > instances or perhaps separate zones and the S3 placement options.
> > -Greg
> >
> > On Fri, Apr 12, 2019 at 3:04 AM Varun Singh  wrote:
> >>
> >> Hi,
> >> We have a requirement to build an object storage solution with thin
> >> layer of customization on top. This is to be deployed in our own data
> >> centre. We will be using the objects stored in this system at various
> >> places in our business workflow. The solution should support
> >> multi-tenancy. Multiple tenants can come and store their objects in
> >> it. However, there is also a requirement that a tenant may want to use
> >> their own machines. In that case, their objects should be stored and
> >> replicated within their machines. But those machines should still be
> >> part of our system. This is because we will still need access to the
> >> objects for our business workflows. It's just that their data should
> >> not be stored and replicated outside of their systems. Is it something
> >> that can be achieved using Ceph? Thanks a lot in advance.
> >>
> >> --
> >> Regards,
> >> Varun Singh
> >
> >
> >
> >>
>
> --
> Confidentiality Notice and Disclaimer: This email (including any
> attachments) contains information that may be confidential, privileged
> and/or copyrighted. If you are not the intended recipient, please notify
> the sender immediately and destroy this email. Any unauthorized use of the
> contents of this email in any manner whatsoever, is strictly prohibited. If
> improper activity is suspected, all available information may be used by
> the sender for possible disciplinary action, prosecution, civil claim or
> any remedy or lawful purpose. Email transmission cannot be guaranteed to be
> secure or error-free, as information could be intercepted, lost, arrive
> late, or contain viruses. The sender is not liable whatsoever for damage
> resulting from the opening of this message and/or the use of the
> information contained in this message and/or attachments. Expressions in
> this email cannot be treated as opined by the sender company management –
> they are solely expressed by the sender unless authorized.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default Pools

2019-04-15 Thread Gregory Farnum
On Mon, Apr 15, 2019 at 1:52 PM Brent Kennedy  wrote:
>
> I was looking around the web for the reason for some of the default pools in 
> Ceph and I cant find anything concrete.  Here is our list, some show no use 
> at all.  Can any of these be deleted ( or is there an article my googlefu 
> failed to find that covers the default pools?
>
> We only use buckets, so I took out .rgw.buckets, .users and 
> .rgw.buckets.index…
>
> Name
> .log
> .rgw.root
> .rgw.gc
> .rgw.control
> .rgw
> .users.uid
> .users.email
> .rgw.buckets.extra
> default.rgw.control
> default.rgw.meta
> default.rgw.log
> default.rgw.buckets.non-ec

All of these are created by RGW when you run it, not by the core Ceph
system. I think they're all used (although they may report sizes of 0,
as they mostly make use of omap).

> metadata

Except this one used to be created-by-default for CephFS metadata, but
that hasn't been true in many releases. So I guess you're looking at
an old cluster? (In which case it's *possible* some of those RGW pools
are also unused now but were needed in the past; I haven't kept good
track of them.)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Object storage for physically separating tenants storage infrastructure

2019-04-12 Thread Gregory Farnum
Yes, you would do this by setting up separate data pools for segregated
clients, giving those pools a CRUSH rule placing them on their own servers,
and if using S3 assigning the clients to them using either wholly separate
instances or perhaps separate zones and the S3 placement options.
-Greg

On Fri, Apr 12, 2019 at 3:04 AM Varun Singh  wrote:

> Hi,
> We have a requirement to build an object storage solution with thin
> layer of customization on top. This is to be deployed in our own data
> centre. We will be using the objects stored in this system at various
> places in our business workflow. The solution should support
> multi-tenancy. Multiple tenants can come and store their objects in
> it. However, there is also a requirement that a tenant may want to use
> their own machines. In that case, their objects should be stored and
> replicated within their machines. But those machines should still be
> part of our system. This is because we will still need access to the
> objects for our business workflows. It's just that their data should
> not be stored and replicated outside of their systems. Is it something
> that can be achieved using Ceph? Thanks a lot in advance.
>
> --
> Regards,
> Varun Singh
>



>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-08 Thread Gregory Farnum
On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell  wrote:
>
> We have two separate RGW clusters running Luminous (12.2.8) that have started 
> seeing an increase in PGs going active+clean+inconsistent with the reason 
> being caused by an omap_digest mismatch.  Both clusters are using FileStore 
> and the inconsistent PGs are happening on the .rgw.buckets.index pool which 
> was moved from HDDs to SSDs within the last few months.
>
> We've been repairing them by first making sure the odd omap_digest is not the 
> primary by setting the primary-affinity to 0 if needed, doing the repair, and 
> then setting the primary-affinity back to 1.
>
> For example PG 7.3 went inconsistent earlier today:
>
> # rados list-inconsistent-obj 7.3 -f json-pretty | jq -r '.inconsistents[] | 
> .errors, .shards'
> [
>   "omap_digest_mismatch"
> ]
> [
>   {
> "osd": 504,
> "primary": true,
> "errors": [],
> "size": 0,
> "omap_digest": "0x4c10ee76",
> "data_digest": "0x"
>   },
>   {
> "osd": 525,
> "primary": false,
> "errors": [],
> "size": 0,
> "omap_digest": "0x26a1241b",
> "data_digest": "0x"
>   },
>   {
> "osd": 556,
> "primary": false,
> "errors": [],
> "size": 0,
> "omap_digest": "0x26a1241b",
> "data_digest": "0x"
>   }
> ]
>
> Since the odd omap_digest is on osd.504 and osd.504 is the primary, we would 
> set the primary-affinity to 0 with:
>
> # ceph osd primary-affinity osd.504 0
>
> Do the repair:
>
> # ceph pg repair 7.3
>
> And then once the repair is complete we would set the primary-affinity back 
> to 1 on osd.504:
>
> # ceph osd primary-affinity osd.504 1
>
> There doesn't appear to be any correlation between the OSDs which would point 
> to a hardware issue, and since it's happening on two different clusters I'm 
> wondering if there's a race condition that has been fixed in a later version?
>
> Also, what exactly is the omap digest?  From what I can tell it appears to be 
> some kind of checksum for the omap data.  Is that correct?

Yeah; it's just a crc over the omap key-value data that's checked
during deep scrub. Same as the data digest.

I've not noticed any issues around this in Luminous but I probably
wouldn't have, so will have to leave it up to others if there are
fixes in since 12.2.8.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and many small files

2019-04-04 Thread Gregory Farnum
On Mon, Apr 1, 2019 at 4:04 AM Paul Emmerich  wrote:
>
> There are no problems with mixed bluestore_min_alloc_size; that's an
> abstraction layer lower than the concept of multiple OSDs. (Also, you
> always have that when mixing SSDs and HDDs)
>
> I'm not sure about the real-world impacts of a lower min alloc size or
> the rationale behind the default values for HDDs (64) and SSDs (16kb).

The min_alloc_size in BlueStore controls which IO requests allocate
new space versus getting data-journaled in the WAL and then
read-modify-write'n over an existing block on disk.

Hard drives have the size set higher because doing the random IOs are
more expensive for them compared to cost of streaming the data out
twice.
-Greg

>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Mon, Apr 1, 2019 at 10:36 AM Clausen, Jörn  wrote:
> >
> > Hi Paul!
> >
> > Thanks for your answer. Yep, bluestore_min_alloc_size and your
> > calculation sounds very reasonable to me :)
> >
> > Am 29.03.2019 um 23:56 schrieb Paul Emmerich:
> > > Are you running on HDDs? The minimum allocation size is 64kb by
> > > default here. You can control that via the parameter
> > > bluestore_min_alloc_size during OSD creation.
> > > 64 kb times 8 million files is 512 GB which is the amount of usable
> > > space you reported before running the test, so that seems to add up.
> >
> > My test cluster is virtualized on vSphere, but the OSDs are reported as
> > HDDs. And our production cluster also uses HDDs only. All OSDs use the
> > default value for bluestore_min_alloc_size.
> >
> > If we should really consider tinkering with bluestore_min_alloc_size: As
> > this is probably not tunable afterwards, we would need to replace all
> > OSDs in a rolling update. Should we expect any problems while we have
> > OSDs with mixed min_alloc_sizes?
> >
> > > There's also some metadata overhead etc. You might want to consider
> > > enabling inline data in cephfs to handle small files in a
> > > store-efficient way (note that this feature is officially marked as
> > > experimental, though).
> > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data
> >
> > I'll give it a try on my test cluster.
> >
> > --
> > Jörn Clausen
> > Daten- und Rechenzentrum
> > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> > Düsternbrookerweg 20
> > 24105 Kiel
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Wrong certificate delivered on https://ceph.io/

2019-04-04 Thread Gregory Farnum
I believe our community manager Mike is in charge of that?

On Wed, Apr 3, 2019 at 6:49 AM Raphaël Enrici  wrote:
>
> Dear all,
>
> is there somebody in charge of the ceph hosting here, or someone who
> knows the guy who knows another guy who may know...
>
> Saw this while reading the FOSDEM 2019 presentation by Sage, I clicked
> on the link at the end which is labelled http://ceph.io/ but linked to
> https://ceph.io/.
>
> The certificate delivered when you visit https://ceph.io/ is only valid
> for ceph.com and www.ceph.com. If someone can fix this or point me to
> the right person.
>
> Best,
> Raphaël
> P.S. Thank you so much for Ceph ;)
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disable cephx with centralized configs

2019-04-04 Thread Gregory Farnum
I think this got dealt with on irc, but for those following along at home:

I think the problem here is that you've set the central config to
disable authentication, but the client doesn't know what those config
options look like until it's connected — which it can't do, because
it's demanding encryption.
(Or, possibly, the other way around if you wrote it locally but didn't
restart the monitors and other daemons after updating their configs.)

Note also that there are a number of auth settings and the only one
named is auth_client_required, but there's also auth_cluster_required
and auth_service_required for the long-running daemons...

On Wed, Apr 3, 2019 at 7:44 PM Shawn Edwards  wrote:
>
> To debug another issue I'm having, I'd like to try to disable cephx auth 
> between my ceph servers.  According to the docs, the way to do this is by 
> setting 'auth_client_required' to 'none' in ceph.conf  This no longer works, 
> as you get an error message like this when running 'ceph -s':
>
> 2019-04-03 19:29:27.134 7f01772a3700 -1 monclient(hunting): 
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>
> If you try to change the setting using 'ceph config', you get both the new 
> and old setting listed in the configuration, and the same error as above.
>
> Is it possible to turn off cephx once it it turned on now?  If so, what's the 
> right method?
>
> --
>  Shawn Edwards
>  Beware programmers with screwdrivers.  They tend to spill them on their 
> keyboards.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: effects of using hard links

2019-03-22 Thread Gregory Farnum
On Thu, Mar 21, 2019 at 2:45 PM Dan van der Ster  wrote:
>
> On Thu, Mar 21, 2019 at 8:51 AM Gregory Farnum  wrote:
> >
> > On Wed, Mar 20, 2019 at 6:06 PM Dan van der Ster  
> > wrote:
> >>
> >> On Tue, Mar 19, 2019 at 9:43 AM Erwin Bogaard  
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> >
> >> > For a number of application we use, there is a lot of file duplication. 
> >> > This wastes precious storage space, which I would like to avoid.
> >> >
> >> > When using a local disk, I can use a hard link to let all duplicate 
> >> > files point to the same inode (use “rdfind”, for example).
> >> >
> >> >
> >> >
> >> > As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use 
> >> > hard links on CephFS in the same way as I use for ‘regular’ file systems 
> >> > like ext4 and xfs.
> >> >
> >> > 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best 
> >> > practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
> >> >
> >> > 2. Is there any performance (dis)advantage?
> >> >
> >> > 3. When using hard links, is there an actual space savings, or is there 
> >> > some trickery happening?
> >> >
> >> > 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I 
> >> > need to keep in mind combining hard links with CephFS?
> >>
> >> The only issue we've seen is if you hardlink b to a, then rm a, then
> >> never stat b, the inode is added to the "stray" directory. By default
> >> there is a limit of 1 million stray entries -- so if you accumulate
> >> files in this state eventually users will be unable to rm any files,
> >> until you stat the `b` files.
> >
> >
> > Eek. Do you know if we have any tickets about that issue? It's easy to see 
> > how that happens but definitely isn't a good user experience!
>
> I'm not aware of a ticket -- I had thought it was just a fact of life
> with hardlinks and cephfs.

I think it is for now, but as you've demonstrated that's not really a
good situation and I'm sure we can figure out some way of
automatically merging inodes into their remaining link parents.
I've created a ticket at http://tracker.ceph.com/issues/38849

> After hitting this issue in prod, we found the explanation here in
> this old thread (with your useful post ;) ):
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013621.html
>
> Our immediate workaround was to increase mds bal fragment size max
> (e.g. to 20).
> In our env we now monitor num_strays in case these get out of control again.
>
> BTW, now thinking about this more... isn't directory fragmentation
> supposed to let the stray dir grow to unlimited shards? (on our side
> it seems limited to 10 shards). Maybe this is just some configuration
> issue on our side?

Sounds like I haven't missed a change here: the stray directory is a
special system directory that doesn't get fragmented like normal ones
do. We just set it up (hard-coded even, IIRC, but maybe a config
option) so that each MDS gets 10 of them after the first time somebody
managed to make it large enough that a single stray directory object
got too large. o_0
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: effects of using hard links

2019-03-21 Thread Gregory Farnum
On Wed, Mar 20, 2019 at 6:06 PM Dan van der Ster  wrote:

> On Tue, Mar 19, 2019 at 9:43 AM Erwin Bogaard 
> wrote:
> >
> > Hi,
> >
> >
> >
> > For a number of application we use, there is a lot of file duplication.
> This wastes precious storage space, which I would like to avoid.
> >
> > When using a local disk, I can use a hard link to let all duplicate
> files point to the same inode (use “rdfind”, for example).
> >
> >
> >
> > As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use
> hard links on CephFS in the same way as I use for ‘regular’ file systems
> like ext4 and xfs.
> >
> > 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best
> practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
> >
> > 2. Is there any performance (dis)advantage?
> >
> > 3. When using hard links, is there an actual space savings, or is there
> some trickery happening?
> >
> > 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I
> need to keep in mind combining hard links with CephFS?
>
> The only issue we've seen is if you hardlink b to a, then rm a, then
> never stat b, the inode is added to the "stray" directory. By default
> there is a limit of 1 million stray entries -- so if you accumulate
> files in this state eventually users will be unable to rm any files,
> until you stat the `b` files.
>

Eek. Do you know if we have any tickets about that issue? It's easy to see
how that happens but definitely isn't a good user experience!
-Greg


>
> -- dan
>
>
> -- dan
>
>
> >
> >
> >
> > Thanks
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: effects of using hard links

2019-03-19 Thread Gregory Farnum
On Tue, Mar 19, 2019 at 2:13 PM Erwin Bogaard 
wrote:

> Hi,
>
>
>
> For a number of application we use, there is a lot of file duplication.
> This wastes precious storage space, which I would like to avoid.
>
> When using a local disk, I can use a hard link to let all duplicate files
> point to the same inode (use “rdfind”, for example).
>
>
>
> As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use
> hard links on CephFS in the same way as I use for ‘regular’ file systems
> like ext4 and xfs.
>
> 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best
> practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
>

This should be okay now. Hard links have changed a few times so Zheng can
correct me if I've gotten something wrong, but the differences between
regular files from a user/performance perspective are:
* if you take snapshots and have hard links, hard-linked files are special
and will be a member of *every* snapshot in the system (which only matters
if you actually write to them during all those snapshots)
* opening a hard-linked file may behave as if you were doing two file opens
instead of one, from a performance perspective. But this might have
changed? (In the past, you would need to look up the file name you open,
and then do another lookup on the authoritative location of the file.)


> 2. Is there any performance (dis)advantage?
>

Generally not once the file is open.

3. When using hard links, is there an actual space savings, or is there
> some trickery happening?
>

If you create a hard link, there is a single copy of the file data in RADOS
that all the file names refer to. I think that's what you're asking?


> 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I
> need to keep in mind combining hard links with CephFS?
>

Not other than above.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - large omap object

2019-03-18 Thread Gregory Farnum
On Mon, Mar 18, 2019 at 7:28 PM Yan, Zheng  wrote:
>
> On Mon, Mar 18, 2019 at 9:50 PM Dylan McCulloch  wrote:
> >
> >
> > >please run following command. It will show where is 4.
> > >
> > >rados -p -p hpcfs_metadata getxattr 4. parent >/tmp/parent
> > >ceph-dencoder import /tmp/parent type inode_backtrace_t decode dump_json
> > >
> >
> > $ ceph-dencoder import /tmp/parent type inode_backtrace_t decode dump_json
> > {
> > "ino": 4,
> > "ancestors": [
> > {
> > "dirino": 1,
> > "dname": "lost+found",
> > "version": 1
> > }
> > ],
> > "pool": 20,
> > "old_pools": []
> > }
> >
> > I guess it may have a very large number of files from previous recovery 
> > operations?
> >
>
> Yes, these files are created by cephfs-data-scan. If you don't want
> them, you can delete "lost+found"

This certainly makes sense, but even with that pointer I can't find
how it's picking inode 4. That should probably be documented? :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running ceph status as non-root user?

2019-03-15 Thread Gregory Farnum
You will either need access to a ceph.conf, or else have some way to
pass in on the CLI:
* monitor IP addresses
* a client ID
* a client key (or keyring file)

Your ceph.conf doesn't strictly need to be the same one used for other
things on the cluster, so you could assemble it yourself. Same goes
for the client keyring. But one way or another you need to gather up
that data and provide it when you invoke the ceph tool.
-Greg

On Fri, Mar 15, 2019 at 1:04 PM Victor Hooi  wrote:
>
> Hi,
>
> I'm attempting to setup Telegraf on a Proxmox machine to send Ceph 
> information into InfluxDB.
>
> I had a few issues around permissions 
> (https://github.com/influxdata/telegraf/issues/5590), but we seem to be 
> nearly sorted out.
>
> However, one issue still remains around ceph status.
>
> Specifically, it seems to require being able to read /etc/ceph/ceph.conf. For 
> example, if I run sudo under the telegraf user context:
>
> root@syd1:/etc/ceph# sudo -u telegraf ceph status
> 2019-03-16 07:01:37.262708 7f7031e1e700 -1 Errors while parsing config file!
> 2019-03-16 07:01:37.262712 7f7031e1e700 -1 parse_file: cannot open 
> /etc/ceph/ceph.conf: (13) Permission denied
> Error initializing cluster client: PermissionDeniedError('error calling 
> conf_read_file',)
>
>
> However, on Proxmox, ceph.conf is a symlink to a file on their pmxcfs file 
> system - which doesn't let you set custom permissions.
>
> Is there another way around this, to get ceph status to run under a non-root 
> user?
>
> Thanks,
> Victor
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS segfaults on client connection -- brand new FS

2019-03-08 Thread Gregory Farnum
I don’t have any idea what’s going on here or why it’s not working, but you
are using v0.94.7. That release is:
1) out of date for the Hammer cycle, which reached at least .94.10
2) prior to the release where we declared CephFS stable (Jewel, v10.2.0)
3) way past its supported expiration date.

You will have a much better time deploying Luminous or Mimic, especially
since you want to use CephFS. :)
-Greg

On Fri, Mar 8, 2019 at 5:02 PM Kadiyska, Yana  wrote:

> Hi,
>
>
>
> I’m very much hoping someone can unblock me on this – we recently ran into
> a very odd issue – I sent an earlier email to the list
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033579.html
>
>
>
> After unsuccessfully trying to repair we decided to forsake the Filesystem
>
>
>
> I marked the cluster down, failed the MDSs, removed the FS and the
> metadata and data pools.
>
>
>
> Then created a new Filesystem from scratch.
>
>
>
> However, I am still observing MDS segfaulting when a client tries to
> connect. This is quite urgent for me as we don’t have a functioning
> Filesystem – if someone can advise how I can remove any and all state
> please do so – I just want to start fresh. I am very puzzled that a brand
> new FS doesn’t work
>
>
>
> Here is the MDS log at level 20 – one odd thing I notice is that the
> client seems to start showing ? as the id well before the segfault…In any
> case, I’m just asking what needs to be done to remove all state from the
> MDS nodes:
>
>
>
> 2019-03-08 19:30:12.024535 7f25ec184700 20 mds.0.server get_session have
> 0x5477e00 client.*2160819875* :0/945029522 state open
>
> 2019-03-08 19:30:12.024537 7f25ec184700 15 mds.0.server
> oldest_client_tid=1
>
> 2019-03-08 19:30:12.024564 7f25ec184700  7 mds.0.cache request_start
> request(client.?:1 cr=0x54a8680)
>
> 2019-03-08 19:30:12.024566 7f25ec184700  7 mds.0.server
> dispatch_client_request client_request(client.?:1 getattr pAsLsXsFs #1
> 2019-03-08 19:29:15.425510 RETRY=2) v2
>
> 2019-03-08 19:30:12.024576 7f25ec184700 10 mds.0.server
> rdlock_path_pin_ref request(client.?:1 cr=0x54a8680) #1
>
> 2019-03-08 19:30:12.024577 7f25ec184700  7 mds.0.cache traverse: opening
> base ino 1 snap head
>
> 2019-03-08 19:30:12.024579 7f25ec184700 10 mds.0.cache path_traverse
> finish on snapid head
>
> 2019-03-08 19:30:12.024580 7f25ec184700 10 mds.0.server ref is [inode 1
> [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0 1=0+1) (iversion lock) |
> dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024589 7f25ec184700 10 mds.0.locker acquire_locks
> request(client.?:1 cr=0x54a8680)
>
> 2019-03-08 19:30:12.024591 7f25ec184700 20 mds.0.locker  must rdlock
> (iauth sync) [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0
> 1=0+1) (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024594 7f25ec184700 20 mds.0.locker  must rdlock
> (ilink sync) [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0
> 1=0+1) (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024597 7f25ec184700 20 mds.0.locker  must rdlock
> (ifile sync) [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0
> 1=0+1) (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024600 7f25ec184700 20 mds.0.locker  must rdlock
> (ixattr sync) [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0
> 1=0+1) (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024602 7f25ec184700 20 mds.0.locker  must rdlock
> (isnap sync) [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0
> 1=0+1) (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024605 7f25ec184700 10 mds.0.locker  must authpin
> [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0 1=0+1)
> (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024607 7f25ec184700 10 mds.0.locker  auth_pinning
> [inode 1 [...2,head] / auth v1 snaprealm=0x53b8480 f() n(v0 1=0+1)
> (iversion lock) | request=1 dirfrag=1 0x53ca968]
>
> 2019-03-08 19:30:12.024610 7f25ec184700 10 mds.0.cache.ino(1) auth_pin by
> 0x51e5e00 on [inode 1 [...2,head] / auth v1 ap=1+0 snaprealm=0x53b8480 f()
> n(v0 1=0+1) (iversion lock) | request=1 dirfrag=1 authpin=1 0x53ca968] now
> 1+0
>
> 2019-03-08 19:30:12.024614 7f25ec184700  7 mds.0.locker rdlock_start  on
> (isnap sync) on [inode 1 [...2,head] / auth v1 ap=1+0 snaprealm=0x53b8480
> f() n(v0 1=0+1) (iversion lock) | request=1 dirfrag=1 authpin=1 0x53ca968]
>
> 2019-03-08 19:30:12.024618 7f25ec184700 10 mds.0.locker  got rdlock on
> (isnap sync r=1) [inode 1 [...2,head] / auth v1 ap=1+0 snaprealm=0x53b8480
> f() n(v0 1=0+1) (isnap sync r=1) (iversion lock) | request=1 lock=1
> dirfrag=1 authpin=1 0x53ca968]
>
> 2019-03-08 19:30:12.024621 7f25ec184700  7 mds.0.locker rdlock_start  on
> (ifile sync) on [inode 1 [...2,head] / auth v1 ap=1+0 snaprealm=0x53b8480
> f() n(v0 1=0+1) (isnap sync r=1) (iversion lock) | request=1 lock=1
> dirfrag=1 authpin=1 0x53ca968]
>
> 2019-03-08 

Re: [ceph-users] garbage in cephfs pool

2019-03-07 Thread Gregory Farnum
Are they getting cleaned up? CephFS does not instantly delete files; they
go into a "purge queue" and get cleaned up later by the MDS.
-Greg

On Thu, Mar 7, 2019 at 2:00 AM Fyodor Ustinov  wrote:

> Hi!
>
> After removing all files from cephfs I see that situation:
> #ceph df
> POOLS:
> NAME   ID USED%USED MAX AVAIL OBJECTS
> fsd2  0 B 0   233 TiB 11527762
>
> #rados df
> POOL_NAME USED  OBJECTS CLONES   COPIES MISSING_ON_PRIMARY UNFOUND
> DEGRADEDRD_OPS  RD WR_OPS  WR
> fsd0 B 11527761270 69166566  0   0
> 0 137451347  61 TiB   46169087  63 TiB
>
> pool contain objects like that:
> 1bd3d0a.
> 1af4b02.
> 12a3b4a.
> 11a1876.
> 1bbda52.
> 1a09fcd.
> 1b54612.
>
>
> Where did these objects come from and how to get rid of them?
>
>
> WBR,
> Fyodor.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can CephFS Kernel Client Not Read & Write at the Same Time?

2019-03-07 Thread Gregory Farnum
In general, no, this is not an expected behavior.

My guess would be that something odd is happening with the other clients
you have to the system, and there's a weird pattern with the way the file
locks are being issued. Can you be more precise about exactly what workload
you're running, and get the output of the session list on your MDS while
doing so?
-Greg

On Wed, Mar 6, 2019 at 9:49 AM Andrew Richards <
andrew.richa...@keepertech.com> wrote:

> We discovered recently that our CephFS mount appeared to be halting reads
> when writes were being synched to the Ceph cluster to the point it was
> affecting applications.
>
> I also posted this as a Gist with embedded graph images to help
> illustrate:
> https://gist.github.com/keeperAndy/aa80d41618caa4394e028478f4ad1694
>
> The following is the plain text from the Gist.
>
> First, details about the host:
>
> 
> $ uname -r
> 4.16.13-041613-generic
>
> $ egrep 'xfs|ceph' /proc/mounts
> 192.168.1.115:6789,192.168.1.116:6789,192.168.1.117:6789:/ /cephfs
> ceph rw,noatime,name=cephfs,secret=,rbytes,acl,wsize=16777216 0 0
> /dev/mapper/tst01-lvidmt01 /rbd_xfs xfs
> rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1024,noquota 0 0
>
> $ ceph -v
> ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> luminous (stable)
>
> $ cat /proc/net/bonding/bond1
> Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
>
> Bonding Mode: adaptive load balancing
> Primary Slave: None
> Currently Active Slave: net6
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 200
> Down Delay (ms): 200
>
> Slave Interface: net8
> MII Status: up
> Speed: 1 Mbps
> Duplex: full
> Link Failure Count: 2
> Permanent HW addr: e4:1d:2d:17:71:e1
> Slave queue ID: 0
>
> Slave Interface: net6
> MII Status: up
> Speed: 1 Mbps
> Duplex: full
> Link Failure Count: 1
> Permanent HW addr: e4:1d:2d:17:71:e0
> Slave queue ID: 0
>
> 
>
> We had CephFS mounted alongside an XFS filesystem made up of 16 RBD images
> aggregated under LVM as our storage targets. The link to the Ceph cluster
> from the host is a mode 6 2x10GbE bond (bond1 above).
>
> We started capturing network counters from the Ceph cluster connection
> (bond1) on the host using ifstat at its most granular setting of 0.1
> (sampling every tenth of a second). We then ran various overlapping read
> and write operations in separate shells on the same host to obtain samples
> of how our different means of accessing Ceph handled this. We converted our
> ifstat output to CSV and insterted it into a spreadsheet to visualize the
> network activity.
>
> We found that the CephFS kernel mount did indeed appear to pause ongoing
> reads when writes were being flushed from the page cache to the Ceph
> cluster.
>
> We wanted to see if we could make this more pronounced, so we added a
> 6Gb-limit tc filter to the interface and re-ran our tests. This yielded
> much lengthier delay periods in the reads while the writes were more slowly
> flushed from the page cache to the Ceph cluster.
>
> A more restrictive 2Gbit-limit tc filter produced much lengthier delays of
> our reads as the writes were synched to the cluster.
>
> When we tested the same I/O on the RBD-backed XFS file system on the same
> host, we found a very different pattern. The reads seemed to be given
> priority over the write activity, but the writes were only slowed, they
> were not halted.
>
> Finally we tested overlapping SMB client reads and writes to a Samba share
> that used the userspace libceph-based VFS_Ceph module to produce the share.
> In this case, while raw throughput was lower than that of the kernel, the
> reads and writes did not interrupt each other at all.
>
> Is this expected behavior for the CephFS kernel drivers? Can a CephFS
> kernel client really not read and write to the file system simultaneously?
>
> Thanks,
> Andrew Richards
> Senior Systems Engineer
> Keeper Technology, LLC
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous 12.2.11 on debian 9 requires nscd?

2019-02-27 Thread Gregory Farnum
This is probably a build issue of some kind, but I'm not quite sure how...
The MDS (and all the Ceph code) is just invoking the getgrnam_r function,
which is part of POSIX and implemented by glib (or whatever other libc). So
any dependency on nscd is being created "behind our backs" somewhere.
Anyone have any ideas?
-Greg

On Tue, Feb 26, 2019 at 7:06 PM Chad W Seys  wrote:

> Hi all,
>I cannot get my luminous 12.2.11 mds servers to start on Debian 9(.8)
> unless nscd is also installed.
>
>Trying to start from command line:
> #  /usr/bin/ceph-mds -f --cluster ceph --id mds02.hep.wisc.edu --setuser
> ceph --setgroup ceph unable to look up group 'ceph': (34) Numerical
> result out of range
>
>Can look up ceph fine with 'id'
> # id ceph
> uid=11(ceph) gid=11(ceph) groups=11(ceph)
>
>
> If I strace, I notice that an nscd directory makes an appearance:
> [...]
> open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
> lseek(3, 0, SEEK_CUR)   = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=285846, ...}) = 0
> mmap(NULL, 285846, PROT_READ, MAP_SHARED, 3, 0) = 0x7f5970ed2000
> lseek(3, 285846, SEEK_SET)  = 285846
> munmap(0x7f5970ed2000, 285846)  = 0
> close(3)= 0
> socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) =
> -1 ENOENT (No such file or directory)
> close(3)= 0
> socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) =
> -1 ENOENT (No such file or directory)
> close(3)= 0
> open("/etc/group", O_RDONLY|O_CLOEXEC)  = 3
> lseek(3, 0, SEEK_CUR)   = 0
> fstat(3, {st_mode=S_IFREG|0644, st_size=122355, ...}) = 0
> mmap(NULL, 122355, PROT_READ, MAP_SHARED, 3, 0) = 0x7f5970efa000
> lseek(3, 122355, SEEK_SET)  = 122355
> lseek(3, 7495, SEEK_SET)= 7495
> munmap(0x7f5970efa000, 122355)  = 0
> close(3)= 0
> write(2, "unable to look up group '", 25unable to look up group ') = 25
> write(2, "ceph", 4ceph) = 4
> write(2, "'", 1')= 1
> write(2, ": ", 2: )   = 2
> write(2, "(34) Numerical result out of ran"..., 34(34) Numerical result
> out of range) = 34
> write(2, "\n", 1
>
> So I install nscd and mds starts!
>
> Shouldn't ceph be agnostic in how the ceph group is looked up?  Do I
> have some kind of config problem?
>
> My nsswitch.conf file is below.  I've tried replacing 'compat' with
> files, but there is no change.
>
> # cat /etc/nsswitch.conf
> # /etc/nsswitch.conf
> #
> # Example configuration of GNU Name Service Switch functionality.
> # If you have the `glibc-doc-reference' and `info' packages installed, try:
> # `info libc "Name Service Switch"' for information about this file.
>
> passwd: compat
> group:  compat
> shadow: compat
> gshadow:files
>
> hosts:  files dns
> networks:   files
>
> protocols:  db files
> services:   db files
> ethers: db files
> rpc:db files
>
> netgroup:   nis
>
>
> Thanks!
> Chad.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd exit common/Thread.cc: 160: FAILED assert(ret == 0)--10.2.10

2019-02-27 Thread Gregory Farnum
The OSD tried to create a new thread, and the kernel told it no. You
probably need to turn up the limits on threads and/or file descriptors.
-Greg

On Wed, Feb 27, 2019 at 2:36 AM hnuzhoulin2  wrote:

> Hi, guys
>
> So far, there have been 10 osd service exit because of this error.
> the error messages are all the same.
>
> 2019-02-27 17:14:59.757146 7f89925ff700 0 -- 10.191.175.15:6886/192803 >>
> 10.191.175.49:6833/188731 pipe(0x55ebba819400 sd=741 :6886 s=0 pgs=0 cs=0
> l=0 c=0x55ebbb8ba900).accept connect_seq 3912 vs existing 3911 state standby
> 2019-02-27 17:15:05.858802 7f89d9856700 -1 common/Thread.cc: In function
> 'void Thread::create(const char*, size_t)' thread 7f89d9856700 time
> 2019-02-27 17:15:05.806607
> common/Thread.cc: 160: FAILED assert(ret 0)
>
> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x82) [0x55eb7a849e12]
>  2: (Thread::create(char const*, unsigned long)+0xba) [0x55eb7a82c14a]
>  3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x55eb7a8203ef]
>  4: (Accepter::entry()+0x379) [0x55eb7a8f3ee9]
>  5: (()+0x8064) [0x7f89ecf76064]
>  6: (clone()+0x6d) [0x7f89eb07762d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> --- begin dump of recent events ---
> 1> 2019-02-27 17:14:50.999276 7f893e811700 1 - 10.191.175.15:0/192803
> < osd.850 10.191.175.46:6837/190855 6953447  osd_ping(ping_reply
> e17846 stamp 2019-02-27 17:14:50.995043) v3  2004+0+0 (3980167553 0 0)
> 0x55eba12b7400 con 0x55eb96ada600
>
> detail logs see:
> https://drive.google.com/file/d/1fZyhTj06CJlcRjmllaPQMNJknI9gAg6J/view
>
> when I restart these osd services, it looks works well. But I do not know
> if it will happen in the other osds.
> And I can not find any error log in the system except the following dmesg
> info:
>
> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
> [三 1月 30 08:14:11 2019] Couldn't build MFI pass thru cmd
> [三 1月 30 08:14:11 2019] Couldn't issue MFI pass thru cmd
> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
> [三 1月 30 08:14:11 2019] megasas: Command pool empty!
> [三 1月 30 08:14:11 2019] megasas: Failed to get a cmd packet
> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
> [三 1月 30 08:14:11 2019] megasas: Err returned from build_and_issue_cmd
> [三 1月 30 08:14:11 2019] megasas: Command pool (fusion) empty!
>
> this cluster only used aas rbd cluster,ceph status is below:
> root@cld-osd5-44:~# ceph -s
> cluster 2bec9425-ea5f-4a48-b56a-fe88e126bced
> health HEALTH_WARN
> noout flag(s) set
> monmap e1: 3 mons at {a=
> 10.191.175.249:6789/0,b=10.191.175.250:6789/0,c=10.191.175.251:6789/0}
> election epoch 26, quorum 0,1,2 a,b,c
> osdmap e17856: 1080 osds: 1080 up, 1080 in
> flags noout,sortbitwise,require_jewel_osds
> pgmap v25160475: 90112 pgs, 3 pools, 43911 GB data, 17618 kobjects
> 139 TB used, 1579 TB / 1718 TB avail
> 90108 active+clean
> 3 active+clean+scrubbing+deep
> 1 active+clean+scrubbing
> client io 107 MB/s rd, 212 MB/s wr, 1621 op/s rd, 7555 op/s wr
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Gregory Farnum
On Fri, Feb 15, 2019 at 1:39 AM Ilya Dryomov  wrote:

> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
> >
> > Hi Marc,
> >
> > You can see previous designs on the Ceph store:
> >
> > https://www.proforma.com/sdscommunitystore
>
> Hi Mike,
>
> This site stopped working during DevConf and hasn't been working since.
> I think Greg has contacted some folks about this, but it would be great
> if you could follow up because it's been a couple of weeks now...


That’s odd because we thought this was resolved by Monday, but I do see
from the time stamps I was back in the USA when testing it. It must be
geographical as Dan says... :/

>
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread Gregory Farnum
Actually I think I misread what this was doing, sorry.

Can you do a “ceph osd tree”? It’s hard to see the structure via the text
dumps.

On Wed, Feb 13, 2019 at 10:49 AM Gregory Farnum  wrote:

> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: (detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable chooseleaf_vary_r 1
>> tunable straw_calc_version 1
>> tunable allowed_bucket_algs 54
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> 。。。
>> device 502 osd.502
>> device 503 osd.503
>>
>> # types
>> type 0 osd  # osd
>> type 1 ctnr 

Re: [ceph-users] RBD image format v1 EOL ...

2019-02-13 Thread Gregory Farnum
On Wed, Feb 13, 2019 at 10:37 AM Jason Dillaman  wrote:
>
> For the future Ceph Octopus release, I would like to remove all
> remaining support for RBD image format v1 images baring any
> substantial pushback.
>
> The image format for new images has been defaulted to the v2 image
> format since Infernalis, the v1 format was officially deprecated in
> Jewel, and creation of new v1 images was prohibited starting with
> Mimic.
>
> The forthcoming Nautilus release will add a new image migration
> feature to help provide a low-impact conversion path forward for any
> legacy images in a cluster. The ability to migrate existing images off
> the v1 image format was the last known pain point that was highlighted
> the previous time I suggested removing support.

What is the image migration path? I think if we’re going to strip out
the ability to read data then the conversion process should be
automatic-on-access, if that’s possible...
(Also, it's generally good for things like that to provide more than
one release's worth of upgrade-overlap. Especially since we will
support going straight from Mimic to Octopus.)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-13 Thread Gregory Farnum
Your CRUSH rule for EC spools is forcing that behavior with the line

step chooseleaf indep 1 type ctnr

If you want different behavior, you’ll need a different crush rule.

On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:

> Hi, cephers
>
>
> I am building a ceph EC cluster.when a disk is error,I out it.But its all
> PGs remap to the osds in the same host,which I think they should remap to
> other hosts in the same rack.
> test process is:
>
> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
> site1_sata_erasure_ruleset 4
> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
> /etc/init.d/ceph stop osd.2
> ceph osd out 2
> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>
> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
> TOTAL 3073T 197G | TOTAL 3065T 197G
> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>
>
> some config info: (detail configs see:
> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
> jewel 10.2.11  filestore+rocksdb
>
> ceph osd erasure-code-profile get ISA-4-2
> k=4
> m=2
> plugin=isa
> ruleset-failure-domain=ctnr
> ruleset-root=site1-sata
> technique=reed_sol_van
>
> part of ceph.conf is:
>
> [global]
> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> pid file = /home/ceph/var/run/$name.pid
> log file = /home/ceph/log/$cluster-$name.log
> mon osd nearfull ratio = 0.85
> mon osd full ratio = 0.95
> admin socket = /home/ceph/var/run/$cluster-$name.asok
> osd pool default size = 3
> osd pool default min size = 1
> osd objectstore = filestore
> filestore merge threshold = -10
>
> [mon]
> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
> mon data = /home/ceph/var/lib/$type/$cluster-$id
> mon cluster log file = /home/ceph/log/$cluster.log
> [osd]
> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
> osd data = /home/ceph/var/lib/$type/$cluster-$id
> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
> osd journal size = 1
> osd mkfs type = xfs
> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
> osd backfill full ratio = 0.92
> osd failsafe full ratio = 0.95
> osd failsafe nearfull ratio = 0.85
> osd max backfills = 1
> osd crush update on start = false
> osd op thread timeout = 60
> filestore split multiple = 8
> filestore max sync interval = 15
> filestore min sync interval = 5
> [osd.0]
> host = cld-osd1-56
> addr = X
> user = ceph
> devs = /disk/link/osd-0/data
> osd journal = /disk/link/osd-0/journal
> …….
> [osd.503]
> host = cld-osd42-56
> addr = 10.108.87.52
> user = ceph
> devs = /disk/link/osd-503/data
> osd journal = /disk/link/osd-503/journal
>
>
> crushmap is below:
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> 。。。
> device 502 osd.502
> device 503 osd.503
>
> # types
> type 0 osd  # osd
> type 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xx
> type 2 media# sata/ssd group by rack, -11~1x/-21~2x
> type 3 mediagroup   # sata/ssd group by site, -5/-6
> type 4 unit # site, -2
> type 5 root # root, -1
>
> # buckets
> ctnr cld-osd1-56-sata {
> id -101  # do not change unnecessarily
> # weight 10.000
> alg straw2
> hash 0   # rjenkins1
> item osd.0 weight 1.000
> item osd.1 weight 1.000
> item osd.2 weight 1.000
> item osd.3 weight 1.000
> item osd.4 weight 1.000
> item osd.5 weight 1.000
> item osd.6 weight 1.000
> item osd.7 weight 1.000
> item osd.8 weight 1.000
> item osd.9 weight 1.000
> }
> ctnr cld-osd1-56-ssd {
> id -201  # do not change unnecessarily
> # weight 2.000
> alg straw2
> hash 0   # rjenkins1
> item osd.10 weight 1.000
> item osd.11 weight 1.000
> }
> …..
> ctnr cld-osd41-56-sata {
> id -141  # do not change unnecessarily
> # weight 10.000
> alg straw2
> hash 0   # rjenkins1
> item osd.480 weight 1.000
> item osd.481 weight 1.000
> item osd.482 weight 1.000
> item osd.483 weight 1.000
> item osd.484 weight 1.000
> it

Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-12 Thread Gregory Farnum
On Tue, Feb 12, 2019 at 5:10 AM Hector Martin  wrote:

> On 12/02/2019 06:01, Gregory Farnum wrote:
> > Right. Truncates and renames require sending messages to the MDS, and
> > the MDS committing to RADOS (aka its disk) the change in status, before
> > they can be completed. Creating new files will generally use a
> > preallocated inode so it's just a network round-trip to the MDS.
>
> I see. Is there a fundamental reason why these kinds of metadata
> operations cannot be buffered in the client, or is this just the current
> way they're implemented?
>

It's pretty fundamental, at least to the consistency guarantees we hold
ourselves to. What happens if the client has buffered an update like that,
performs writes to the data with those updates in mind, and then fails
before they're flushed to the MDS? A local FS doesn't need to worry about a
different node having a different lifetime, and can control the write order
of its metadata and data updates on belated flush a lot more precisely than
we can. :(
-Greg


>
> e.g. on a local FS these kinds of writes can just stick around in the
> block cache unflushed. And of course for CephFS I assume file extension
> also requires updating the file size in the MDS, yet that doesn't block
> while truncation does.
>
> > Going back to your first email, if you do an overwrite that is confined
> > to a single stripe unit in RADOS (by default, a stripe unit is the size
> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
> > to be atomic. CephFS can only tear writes across objects, and only if
> > your client fails before the data has been flushed.
>
> Great! I've implemented this in a backwards-compatible way, so that gets
> rid of this bottleneck. It's just a 128-byte flag file (formerly
> variable length, now I just pad it to the full 128 bytes and rewrite it
> in-place). This is good information to know for optimizing things :-)
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-11 Thread Gregory Farnum
On Thu, Feb 7, 2019 at 3:31 AM Hector Martin  wrote:

> On 07/02/2019 19:47, Marc Roos wrote:
> >
> > Is this difference not related to chaching? And you filling up some
> > cache/queue at some point? If you do a sync after each write, do you
> > have still the same results?
>
> No, the slow operations are slow from the very beginning. It's not about
> filling a buffer/cache somewhere. I'm guessing the slow operations
> trigger several synchronous writes to the underlying OSDs, while the
> fast ones don't. But I'd like to know more about why exactly there is
> this significant performance hit to truncation operations vs. normal
> writes.
>
> To give some more numbers:
>
> echo test | dd of=b conv=notrunc
>
> This completes extremely quickly (microseconds). The data obviously
> remains in the client cache at this point. This is what I want.
>
> echo test | dd of=b conv=notrunc,fdatasync
>
> This runs quickly until the fdatasync(), then that takes ~12ms, which is
> about what I'd expect for a synchronous write to the underlying HDDs. Or
> maybe that's two writes?


It's certainly one write, and may be two overlapping ones if you've
extended the file and need to persist its new size (via the MDS journal).


>


> echo test | dd of=b
>
> This takes ~10ms in the best case for the open() call (sometimes 30-40
> or even more), and 6-8ms for the write() call.
>
> echo test | dd of=b conv=fdatasync
>
> This takes ~10ms for the open() call, ~8ms for the write() call, and
> ~18ms for the fdatasync() call.
>
> So it seems like truncating/recreating an existing file introduces
> several disk I/Os worth of latency and forces synchronous behavior
> somewhere down the stack, while merely creating a new file or writing to
> an existing one without truncation does not.
>

Right. Truncates and renames require sending messages to the MDS, and the
MDS committing to RADOS (aka its disk) the change in status, before they
can be completed. Creating new files will generally use a preallocated
inode so it's just a network round-trip to the MDS.

Going back to your first email, if you do an overwrite that is confined to
a single stripe unit in RADOS (by default, a stripe unit is the size of
your objects which is 4MB and it's aligned from 0), it is guaranteed to be
atomic. CephFS can only tear writes across objects, and only if your client
fails before the data has been flushed.
-Greg

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-11 Thread Gregory Farnum
You can't tell from the client log here, but probably the MDS itself was
failing over to a new instance during that interval. There's not much
experience with it, but you could experiment with faster failover by
reducing the mds beacon and grace times. This may or may not work
reliably...

On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov  wrote:

> Hi!
>
> I have ceph cluster with 3 nodes with mon/mgr/mds servers.
> I reboot one node and see this in client log:
>
> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket
> closed (con state OPEN)
> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session
> lost, hunting for new mon
> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session
> established
> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state OPEN)
> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state CONNECTING)
> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state CONNECTING)
> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket
> closed (con state CONNECTING)
> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start
> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success
> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed
>
> As I understand it, the following has happened:
> 1. Client detects - link with mon server broken and fast switches to
> another mon (less that 1 seconds).
> 2. Client detects - link with mds server broken, 3 times trying reconnect
> (unsuccessful), waiting and reconnects to the same mds after 30 seconds
> downtime.
>
> I have 2 questions:
> 1. Why?
> 2. How to reduce switching time to another mds?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Gregory Farnum
On Mon, Feb 4, 2019 at 8:03 AM Mahmoud Ismail 
wrote:

> On Mon, Feb 4, 2019 at 4:35 PM Gregory Farnum  wrote:
>
>>
>>
>> On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail <
>> mahmoudahmedism...@gmail.com> wrote:
>>
>>> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum 
>>> wrote:
>>>
>>>> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail <
>>>> mahmoudahmedism...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm a bit confused about how the journaling actually works in the MDS.
>>>>>
>>>>> I was reading about these two configuration parameters (journal write
>>>>> head interval)  and (mds early reply). Does the MDS flush the journal
>>>>> synchronously after each operation? and by setting mds eary reply to true
>>>>> it allows operations to return without flushing. If so, what the other
>>>>> parameter (journal write head interval) do or isn't it for MDS?. Also, can
>>>>> all operations return without flushing with the mds early reply or is it
>>>>> specific to a subset of operations?.
>>>>>
>>>>
>>>> In general, the MDS journal is flushed every five seconds (by default),
>>>> and client requests get an early reply when the operation is done in memory
>>>> but not yet committed to RADOS. Some operations will trigger an immediate
>>>> flush, and there may be some operations that can't get an early reply or
>>>> that need to wait for part of the operation to get committed (like renames
>>>> that move a file's authority to a different MDS).
>>>> IIRC the journal write head interval controls how often it flushes out
>>>> the journal's header, which limits how out-of-date its hints on restart can
>>>> be. (When the MDS restarts, it asks the journal head where the journal's
>>>> unfinished start and end points are, but of course more of the journaled
>>>> operations may have been fully completed since the head was written.)
>>>>
>>>
>>> Thanks for the explanation. Which operations trigger an immediate flush?
>>> Is the readdir one of these operations?. I noticed that the readdir
>>> operation latency is going higher under load when the OSDs are hitting the
>>> limit of the underlying hdd throughput. Can i assume that this is happening
>>> due to the journal flushing then?
>>>
>>
>> Not directly, but a readdir might ask to know the size of each file and
>> that will force the other clients in the system to flush their dirty data
>> in the directory (so that the readdir can return valid results).
>> -Greg
>>
>>
>
> Could it be also due to the MDS lock (operations waiting for the lock
> under load)?
>

Well that's not going to cause high OSD usage, and the MDS lock is not held
while writes are happening. But if the MDS is using 100% CPU, yes, it could
be contended.


> Also, i assume that the journal is using a different thread for flushing,
> Right?
>

Yes, that's correct.

>
>>>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Gregory Farnum
On Mon, Feb 4, 2019 at 7:32 AM Mahmoud Ismail 
wrote:

> On Mon, Feb 4, 2019 at 4:16 PM Gregory Farnum  wrote:
>
>> On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail <
>> mahmoudahmedism...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I'm a bit confused about how the journaling actually works in the MDS.
>>>
>>> I was reading about these two configuration parameters (journal write
>>> head interval)  and (mds early reply). Does the MDS flush the journal
>>> synchronously after each operation? and by setting mds eary reply to true
>>> it allows operations to return without flushing. If so, what the other
>>> parameter (journal write head interval) do or isn't it for MDS?. Also, can
>>> all operations return without flushing with the mds early reply or is it
>>> specific to a subset of operations?.
>>>
>>
>> In general, the MDS journal is flushed every five seconds (by default),
>> and client requests get an early reply when the operation is done in memory
>> but not yet committed to RADOS. Some operations will trigger an immediate
>> flush, and there may be some operations that can't get an early reply or
>> that need to wait for part of the operation to get committed (like renames
>> that move a file's authority to a different MDS).
>> IIRC the journal write head interval controls how often it flushes out
>> the journal's header, which limits how out-of-date its hints on restart can
>> be. (When the MDS restarts, it asks the journal head where the journal's
>> unfinished start and end points are, but of course more of the journaled
>> operations may have been fully completed since the head was written.)
>>
>
> Thanks for the explanation. Which operations trigger an immediate flush?
> Is the readdir one of these operations?. I noticed that the readdir
> operation latency is going higher under load when the OSDs are hitting the
> limit of the underlying hdd throughput. Can i assume that this is happening
> due to the journal flushing then?
>

Not directly, but a readdir might ask to know the size of each file and
that will force the other clients in the system to flush their dirty data
in the directory (so that the readdir can return valid results).
-Greg


>
>
>>
>
>>> Another question, are open operations also written to the journal?
>>>
>>
>> Not opens per se, but we do persist when clients have permission to
>> operate on files.
>> -Greg
>>
>>
>>>
>>> Regards,
>>> Mahmoud
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS MDS journal

2019-02-04 Thread Gregory Farnum
On Fri, Feb 1, 2019 at 2:29 AM Mahmoud Ismail 
wrote:

> Hello,
>
> I'm a bit confused about how the journaling actually works in the MDS.
>
> I was reading about these two configuration parameters (journal write head
> interval)  and (mds early reply). Does the MDS flush the journal
> synchronously after each operation? and by setting mds eary reply to true
> it allows operations to return without flushing. If so, what the other
> parameter (journal write head interval) do or isn't it for MDS?. Also, can
> all operations return without flushing with the mds early reply or is it
> specific to a subset of operations?.
>

In general, the MDS journal is flushed every five seconds (by default), and
client requests get an early reply when the operation is done in memory but
not yet committed to RADOS. Some operations will trigger an immediate
flush, and there may be some operations that can't get an early reply or
that need to wait for part of the operation to get committed (like renames
that move a file's authority to a different MDS).
IIRC the journal write head interval controls how often it flushes out the
journal's header, which limits how out-of-date its hints on restart can be.
(When the MDS restarts, it asks the journal head where the journal's
unfinished start and end points are, but of course more of the journaled
operations may have been fully completed since the head was written.)


>
> Another question, are open operations also written to the journal?
>

Not opens per se, but we do persist when clients have permission to operate
on files.
-Greg


>
> Regards,
> Mahmoud
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption questions

2019-01-24 Thread Gregory Farnum
On Fri, Jan 11, 2019 at 11:24 AM Sergio A. de Carvalho Jr. <
scarvalh...@gmail.com> wrote:

> Thanks for the answers, guys!
>
> Am I right to assume msgr2 (http://docs.ceph.com/docs/mimic/dev/msgr2/)
> will provide encryption between Ceph daemons as well as between clients and
> daemons?
>
> Does anybody know if it will be available in Nautilus?
>

That’s the intention; people are scrambling a bit to get it in soon enough
to validate before the release.


>
> On Fri, Jan 11, 2019 at 8:10 AM Tobias Florek  wrote:
>
>> Hi,
>>
>> as others pointed out, traffic in ceph is unencrypted (internal traffic
>> as well as client traffic).  I usually advise to set up IPSec or
>> nowadays wireguard connections between all hosts.  That takes care of
>> any traffic going over the wire, including ceph.
>>
>> Cheers,
>>  Tobias Florek
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd reboot hung

2019-01-24 Thread Gregory Farnum
Looks like your network deactivated before the rbd volume was unmounted.
This is a known issue without a good programmatic workaround and you’ll
need to adjust your configuration.
On Tue, Jan 22, 2019 at 9:17 AM Gao, Wenjun  wrote:

> I’m using krbd to map a rbd device to a VM, it appears when the device is
> mounted, reboot OS will hung for more than 7min, in baremetal case, it
> could be more than 15min, even using the latest kernel 5.0.0, the problem
> still occurs.
>
> Here are the console logs with 4.15.18 kernel and mimic rbd client, reboot
> seems to be stuck in umount rbd operation
>
> *[  OK  ] Stopped Update UTMP about System Boot/Shutdown.*
>
> *[  OK  ] Stopped Create Volatile Files and Directories.*
>
> *[  OK  ] Stopped target Local File Systems.*
>
> * Unmounting /run/user/110281572...*
>
> * Unmounting /var/tmp...*
>
> * Unmounting /root/test...*
>
> * Unmounting /run/user/78402...*
>
> * Unmounting Configuration File System...*
>
> *[  OK  ] Stopped Configure read-only root support.*
>
> *[  OK  ] Unmounted /var/tmp.*
>
> *[  OK  ] Unmounted /run/user/78402.*
>
> *[  OK  ] Unmounted /run/user/110281572.*
>
> *[  OK  ] Stopped target Swap.*
>
> *[  OK  ] Unmounted Configuration File System.*
>
> *[  189.919062] libceph: mon4 XX.XX.XX.XX:6789 session lost, hunting for
> new mon*
>
> *[  189.950085] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  189.950764] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  190.687090] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  190.694197] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  191.711080] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  191.745254] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  193.695065] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  193.727694] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  197.087076] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  197.121077] libceph: mon4 XX.XX.XX.XX:6789 connect error*
>
> *[  197.663082] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  197.680671] libceph: mon4 XX.XX.XX.XX:6789 connect error*
>
> *[  198.687122] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  198.719253] libceph: mon4 XX.XX.XX.XX:6789 connect error*
>
> *[  200.671136] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  200.702717] libceph: mon4 XX.XX.XX.XX:6789 connect error*
>
> *[  204.703115] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  204.736586] libceph: mon4 XX.XX.XX.XX:6789 connect error*
>
> *[  209.887141] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  209.918721] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  210.719078] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  210.750378] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  211.679118] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  211.712246] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  213.663116] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  213.696943] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  217.695062] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  217.728511] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  225.759109] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  225.775869] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  233.951062] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  233.951997] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  234.719114] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  234.720083] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  235.679112] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  235.680060] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  237.663088] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  237.664121] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  241.695082] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  241.696500] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  249.823095] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  249.824101] libceph: mon3 XX.XX.XX.XX:6789 connect error*
>
> *[  264.671119] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  264.672102] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  265.695109] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  265.696106] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  266.719145] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  266.720204] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  268.703121] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  268.704110] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  272.671115] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  272.672159] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  281.055087] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  281.056577] libceph: mon0 XX.XX.XX.XX:6789 connect error*
>
> *[  294.879098] libceph: connect XX.XX.XX.XX:6789 error -101*
>
> *[  294.88

Re: [ceph-users] backfill_toofull while OSDs are not full

2019-01-24 Thread Gregory Farnum
This doesn’t look familiar to me. Is the cluster still doing recovery so we
can at least expect them to make progress when the “out” OSDs get removed
from the set?
On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander  wrote:

> Hi,
>
> I've got a couple of PGs which are stuck in backfill_toofull, but none
> of them are actually full.
>
>   "up": [
> 999,
> 1900,
> 145
>   ],
>   "acting": [
> 701,
> 1146,
> 1880
>   ],
>   "backfill_targets": [
> "145",
> "999",
> "1900"
>   ],
>   "acting_recovery_backfill": [
> "145",
> "701",
> "999",
> "1146",
> "1880",
> "1900"
>   ],
>
> I checked all these OSDs, but they are all <75% utilization.
>
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.9
>
> So I started checking all the PGs and I've noticed that each of these
> PGs has one OSD in the 'acting_recovery_backfill' which is marked as out.
>
> In this case osd.1880 is marked as out and thus it's capacity is shown
> as zero.
>
> [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
> 1880   hdd 4.545990 0 B  0 B  0 B 00  27
> [ceph@ceph-mgr ~]$
>
> This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
> side-effect of one of the OSDs being marked as out?
>
> Thanks,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-21 Thread Gregory Farnum
On Mon, Jan 21, 2019 at 12:52 AM Yan, Zheng  wrote:

> On Mon, Jan 21, 2019 at 12:12 PM Albert Yue 
> wrote:
> >
> > Hi Yan Zheng,
> >
> > 1. mds cache limit is set to 64GB
> > 2. we get the size of meta data pool by running `ceph df` and saw meta
> data pool just used 200MB space.
> >
>
> That's very strange. One file uses about 1k metadata storage. 560M
> files should use hundreds gigabytes.
>

That's presumably because OSDs still don't report LevelDB/RocksDB usage up
in that view, and all the MDS metadata is stored there?
-Greg


>
> > Thanks,
> >
> >
> > On Mon, Jan 21, 2019 at 11:35 AM Yan, Zheng  wrote:
> >>
> >> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue 
> wrote:
> >> >
> >> > Dear Ceph Users,
> >> >
> >> > We have set up a cephFS cluster with 6 osd machines, each with 16 8TB
> harddisk. Ceph version is luminous 12.2.5. We created one data pool with
> these hard disks and created another meta data pool with 3 ssd. We created
> a MDS with 65GB cache size.
> >> >
> >> > But our users are keep complaining that cephFS is too slow. What we
> observed is cephFS is fast when we switch to a new MDS instance, once the
> cache fills up (which will happen very fast), client became very slow when
> performing some basic filesystem operation such as `ls`.
> >> >
> >>
> >> what's your mds cache config ?
> >>
> >> > What we know is our user are putting lots of small files into the
> cephFS, now there are around 560 Million files. We didn't see high CPU wait
> on MDS instance and meta data pool just used around 200MB space.
> >>
> >> It's unlikely.  For output of 'ceph osd df', you should take both both
> >> DATA and OMAP into account.
> >>
> >> >
> >> > My question is, what is the relationship between the metadata pool
> and MDS? Is this performance issue caused by the hardware behind meta data
> pool? Why the meta data pool only used 200MB space, and we saw 3k iops on
> each of these three ssds, why can't MDS cache all these 200MB into memory?
> >> >
> >> > Thanks very much!
> >> >
> >> >
> >> > Best Regards,
> >> >
> >> > Albert
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Offsite replication scenario

2019-01-14 Thread Gregory Farnum
On Fri, Jan 11, 2019 at 10:07 PM Brian Topping 
wrote:

> Hi all,
>
> I have a simple two-node Ceph cluster that I’m comfortable with the care
> and feeding of. Both nodes are in a single rack and captured in the
> attached dump, it has two nodes, only one mon, all pools size 2. Due to
> physical limitations, the primary location can’t move past two nodes at the
> present time. As far as hardware, those two nodes are 18-core Xeon with
> 128GB RAM and connected with 10GbE.
>
> My next goal is to add an offsite replica and would like to validate the
> plan I have in mind. For it’s part, the offsite replica can be considered
> read-only except for the occasional snapshot in order to run backups to
> tape. The offsite location is connected with a reliable and secured
> ~350Kbps WAN link.
>

Unfortunately this is just not going to work. All writes to a Ceph OSD are
replicated synchronously to every replica, all reads are served from the
primary OSD for any given piece of data, and unless you do some hackery on
your CRUSH map each of your 3 OSD nodes is going to be a primary for about
1/3 of the total data.

If you want to move your data off-site asynchronously, there are various
options for doing that in RBD (either periodic snapshots and export-diff,
or by maintaining a journal and streaming it out) and RGW (with the
multi-site stuff). But you're not going to be successful trying to stretch
a Ceph cluster over that link.
-Greg


>
> The following presuppositions bear challenge:
>
> * There is only a single mon at the present time, which could be expanded
> to three with the offsite location. Two mons at the primary location is
> obviously a lower MTBF than one, but  with a third one on the other side of
> the WAN, I could create resiliency against *either* a WAN failure or a
> single node maintenance event.
> * Because there are two mons at the primary location and one at the
> offsite, the degradation mode for a WAN loss (most likely scenario due to
> facility support) leaves the primary nodes maintaining the quorum, which is
> desirable.
> * It’s clear that a WAN failure and a mon failure at the primary location
> will halt cluster access.
> * The CRUSH maps will be managed to reflect the topology change.
>
> If that’s a good capture so far, I’m comfortable with it. What I don’t
> understand is what to expect in actual use:
>
> * Is the link speed asymmetry between the two primary nodes and the
> offsite node going to create significant risk or unexpected behaviors?
> * Will the performance of the two primary nodes be limited to the speed
> that the offsite mon can participate? Or will the primary mons correctly
> calculate they have quorum and keep moving forward under normal operation?
> * In the case of an extended WAN outage (and presuming full uptime on
> primary site mons), would return to full cluster health be simply a matter
> of time? Are there any limits on how long the WAN could be down if the
> other two maintain quorum?
>
> I hope I’m asking the right questions here. Any feedback appreciated,
> including blogs and RTFM pointers.
>
>
> Thanks for a great product!! I’m really excited for this next frontier!
>
> Brian
>
> > [root@gw01 ~]# ceph -s
> >  cluster:
> >id: 
> >health: HEALTH_OK
> >
> >  services:
> >mon: 1 daemons, quorum gw01
> >mgr: gw01(active)
> >mds: cephfs-1/1/1 up  {0=gw01=up:active}
> >osd: 8 osds: 8 up, 8 in
> >
> >  data:
> >pools:   3 pools, 380 pgs
> >objects: 172.9 k objects, 11 GiB
> >usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
> >pgs: 380 active+clean
> >
> >  io:
> >client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
> >
> > [root@gw01 ~]# ceph df
> > GLOBAL:
> >SIZEAVAIL   RAW USED %RAW USED
> >5.8 TiB 5.8 TiB   30 GiB  0.51
> > POOLS:
> >NAMEID USED%USED MAX AVAIL
>  OBJECTS
> >cephfs_metadata 2  264 MiB 0   2.7 TiB
> 1085
> >cephfs_data 3  8.3 GiB  0.29   2.7 TiB
> 171283
> >rbd 4  2.0 GiB  0.07   2.7 TiB
>  542
> > [root@gw01 ~]# ceph osd tree
> > ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
> > -1   5.82153 root default
> > -3   2.91077 host gw01
> > 0   ssd 0.72769 osd.0 up  1.0 1.0
> > 2   ssd 0.72769 osd.2 up  1.0 1.0
> > 4   ssd 0.72769 osd.4 up  1.0 1.0
> > 6   ssd 0.72769 osd.6 up  1.0 1.0
> > -5   2.91077 host gw02
> > 1   ssd 0.72769 osd.1 up  1.0 1.0
> > 3   ssd 0.72769 osd.3 up  1.0 1.0
> > 5   ssd 0.72769 osd.5 up  1.0 1.0
> > 7   ssd 0.72769 osd.7 up  1.0 1.0
> > [root@gw01 ~]# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS
> > 0   ssd 0.72769  1.0 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115
> > 2   ssd 0.72769  1.0 745 GiB 3.1 G

Re: [ceph-users] ceph health JSON format has changed

2019-01-08 Thread Gregory Farnum
On Fri, Jan 4, 2019 at 1:19 PM Jan Kasprzak  wrote:
>
> Gregory Farnum wrote:
> : On Wed, Jan 2, 2019 at 5:12 AM Jan Kasprzak  wrote:
> :
> : > Thomas Byrne - UKRI STFC wrote:
> : > : I recently spent some time looking at this, I believe the 'summary' and
> : > : 'overall_status' sections are now deprecated. The 'status' and 'checks'
> : > : fields are the ones to use now.
> : >
> : > OK, thanks.
> : >
> : > : The 'status' field gives you the OK/WARN/ERR, but returning the most
> : > : severe error condition from the 'checks' section is less trivial. AFAIK
> : > : all health_warn states are treated as equally severe, and same for
> : > : health_err. We ended up formatting our single line human readable output
> : > : as something like:
> : > :
> : > : "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN:
> : > 20 large omap objects"
> : >
> : > Speaking of scrub errors:
> : >
> : > In previous versions of Ceph, I was able to determine which PGs 
> had
> : > scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
> : > provided that they were not already being scrubbed. In Luminous, the bad 
> PG
> : > is not visible in "ceph --status" anywhere. Should I use something like
> : > "ceph health detail -f json-pretty" instead?
> : >
> : > Also, is it possible to configure Ceph to attempt repairing
> : > the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on
> : > top
> : > of a bunch of old spinning disks, and a scrub error almost always means
> : > that there is a bad sector somewhere, which can easily be fixed by
> : > rewriting the lost data using "ceph pg repair".
> : >
> :
> : It is possible. It's a lot safer than it used to be, but is still NOT
> : RECOMMENDED for replicated pools.
> :
> : But if you are very sure, you can use the options osd_scrub_auto_repair
> : (default: false) and osd_scrub_auto_repair_num_errors (default:5, which
> : will not auto-repair if scrub detects more errors than that value) to
> : configure it.
>
> OK, thanks. I just want to say that I am NOT very sure,
> but this is about the only way I am aware of, when I want to
> handle the scrub error. I have mail notification set up in smartd.conf,
> and so far the scrub errors seem to correlate with new reallocated
> or pending sectors.
>
> What are the drawbacks of running "ceph pg repair" as soon
> asi the cluster enters the HEALTH_ERR state with scrub error?

I think there are still some rare cases where it's possible that Ceph
chooses the wrong copy as the authoritative one. The windows keep
getting smaller, though, and if you're just running repair all the
time anyway then having the system do it automatically obviously isn't
worse. :)
-Greg

>
> Thanks for explanation,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions re mon_osd_cache_size increase

2019-01-07 Thread Gregory Farnum
The osd_map_cache_size controls the OSD’s cache of maps; the change in
13.2.3 is to the default for the monitors’.
On Mon, Jan 7, 2019 at 8:24 AM Anthony D'Atri  wrote:

>
>
> > * The default memory utilization for the mons has been increased
> >  somewhat.  Rocksdb now uses 512 MB of RAM by default, which should
> >  be sufficient for small to medium-sized clusters; large clusters
> >  should tune this up.  Also, the `mon_osd_cache_size` has been
> >  increase from 10 OSDMaps to 500, which will translate to an
> >  additional 500 MB to 1 GB of RAM for large clusters, and much less
> >  for small clusters.
>
>
> Just I don't perseverate on this:   mon_osd_cache_size is a [mon] setting
> for ceph-mon only?  Does it relate to osd_map_cache_size?  ISTR that in the
> past the latter defaulted to 500; I had seen a presentation (I think from
> Dan) at an OpenStack Summit advising its decrease and it defaults to 50
> now.
>
> I like to be very clear about where additional memory is needed,
> especially for dense systems.
>
> -- Anthony
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph blog RSS/Atom URL?

2019-01-04 Thread Gregory Farnum
Yeah I think it’s a “planet”-style feed that incorporates some other blogs.
I don’t think it’s been maintained much since being launched though.

On Fri, Jan 4, 2019 at 1:21 PM Jan Kasprzak  wrote:

> Gregory Farnum wrote:
> : It looks like ceph.com/feed is the RSS url?
>
> Close enough, thanks.
>
> Comparing the above with the blog itself, there are some
> posts in (apparently) Chinese in /feed, which are not present in
> /community/blog. The first one being
>
>
> https://ceph.com/planet/vdbench%e6%b5%8b%e8%af%95%e5%ae%9e%e6%97%b6%e5%8f%af%e8%a7%86%e5%8c%96%e6%98%be%e7%a4%ba/
>
> -Yenya
>
> : On Fri, Jan 4, 2019 at 5:52 AM Jan Kasprzak  wrote:
> : > is there any RSS or Atom source for Ceph blog? I have looked inside
> : > the https://ceph.com/community/blog/ HTML source, but there is no
> : >  or anything mentioning RSS or Atom.
>
> --
> | Jan "Yenya" Kasprzak 
> |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
> |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-04 Thread Gregory Farnum
On Wed, Jan 2, 2019 at 5:12 AM Jan Kasprzak  wrote:

> Thomas Byrne - UKRI STFC wrote:
> : I recently spent some time looking at this, I believe the 'summary' and
> : 'overall_status' sections are now deprecated. The 'status' and 'checks'
> : fields are the ones to use now.
>
> OK, thanks.
>
> : The 'status' field gives you the OK/WARN/ERR, but returning the most
> : severe error condition from the 'checks' section is less trivial. AFAIK
> : all health_warn states are treated as equally severe, and same for
> : health_err. We ended up formatting our single line human readable output
> : as something like:
> :
> : "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN:
> 20 large omap objects"
>
> Speaking of scrub errors:
>
> In previous versions of Ceph, I was able to determine which PGs had
> scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
> provided that they were not already being scrubbed. In Luminous, the bad PG
> is not visible in "ceph --status" anywhere. Should I use something like
> "ceph health detail -f json-pretty" instead?
>
> Also, is it possible to configure Ceph to attempt repairing
> the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on
> top
> of a bunch of old spinning disks, and a scrub error almost always means
> that there is a bad sector somewhere, which can easily be fixed by
> rewriting the lost data using "ceph pg repair".
>

It is possible. It's a lot safer than it used to be, but is still NOT
RECOMMENDED for replicated pools.

But if you are very sure, you can use the options osd_scrub_auto_repair
(default: false) and osd_scrub_auto_repair_num_errors (default:5, which
will not auto-repair if scrub detects more errors than that value) to
configure it.
-Greg


>
> Thanks,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak 
> |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
> |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mgr fails to restart after upgrade to mimic

2019-01-04 Thread Gregory Farnum
You can also get more data by checking what the monitor logs for that
manager on the connect attempt (if you turn up its debug mon or debug
ms settings). If one of your managers is behaving, I'd examine its
configuration file and compare to the others. For instance, that
"Invalid argument" might mean the manager is trying to use "AUTH_NONE"
(no CephX) and the monitors aren't allowing that.
-Greg

On Fri, Jan 4, 2019 at 6:26 AM Randall Smith  wrote:
>
> Greetings,
>
> I'm upgrading my cluster from luminous to mimic. I've upgraded my monitors 
> and am attempting to upgrade the mgrs. Unfortunately, after an upgrade the 
> mgr daemon exits immediately with error code 1.
>
> I've tried running ceph-mgr in debug mode to try to see what's happening but 
> the output (below) is a bit cryptic for me. It looks like authentication 
> might be failing but it was working prior to the upgrade.
>
> I do have "auth supported = cephx" in the global section of ceph.conf.
>
> What do I need to do to fix this?
>
> Thanks.
>
> /usr/bin/ceph-mgr -f --cluster ceph --id 8 --setuser ceph --setgroup ceph -d 
> --debug_ms 5
> 2019-01-04 07:01:38.457 7f808f83f700  2 Event(0x30c42c0 nevent=5000 
> time_id=1).set_owner idx=0 owner=140190140331776
> 2019-01-04 07:01:38.457 7f808f03e700  2 Event(0x30c4500 nevent=5000 
> time_id=1).set_owner idx=1 owner=140190131939072
> 2019-01-04 07:01:38.457 7f808e83d700  2 Event(0x30c4e00 nevent=5000 
> time_id=1).set_owner idx=2 owner=140190123546368
> 2019-01-04 07:01:38.457 7f809dd5b380  1  Processor -- start
> 2019-01-04 07:01:38.477 7f809dd5b380  1 -- - start start
> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- - --> 192.168.253.147:6789/0 -- 
> auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6780 con 0
> 2019-01-04 07:01:38.481 7f809dd5b380  1 -- - --> 192.168.253.148:6789/0 -- 
> auth(proto 0 26 bytes epoch 0) v1 -- 0x32a6a00 con 0
> 2019-01-04 07:01:38.481 7f808e83d700  1 -- 192.168.253.148:0/1359135487 
> learned_addr learned my addr 192.168.253.148:0/1359135487
> 2019-01-04 07:01:38.481 7f808e83d700  2 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.148:6789/0 conn(0x332d500 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ 
> pgs=0 cs=0 l=0)._process_connection got newly_a$
> ked_seq 0 vs out_seq 0
> 2019-01-04 07:01:38.481 7f808f03e700  2 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.147:6789/0 conn(0x332ce00 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ 
> pgs=0 cs=0 l=0)._process_connection got newly_a$
> ked_seq 0 vs out_seq 0
> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.147:6789/0 conn(0x332ce00 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 
> seq
> 1 0x30c5440 mon_map magic: 0 v1
> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.148:6789/0 conn(0x332d500 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 
> seq
> 1 0x30c5680 mon_map magic: 0 v1
> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.147:6789/0 conn(0x332ce00 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 
> seq
> 2 0x32a6780 auth_reply(proto 2 0 (0) Success) v1
> 2019-01-04 07:01:38.481 7f808e83d700  5 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.148:6789/0 conn(0x332d500 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74275 cs=1 l=1). rx mon.2 
> seq
> 2 0x32a6a00 auth_reply(proto 2 0 (0) Success) v1
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== 
> mon.1 192.168.253.147:6789/0 1  mon_map magic: 0 v1  370+0+0 
> (3034216899 0 0) 0x30c5440 con 0x332ce00
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== 
> mon.2 192.168.253.148:6789/0 1  mon_map magic: 0 v1  370+0+0 
> (3034216899 0 0) 0x30c5680 con 0x332d500
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== 
> mon.1 192.168.253.147:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
> 33+0+0 (3430158761 0 0) 0x32a6780 con 0x33$
> ce00
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 --> 
> 192.168.253.147:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x32a6f00 con 0
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== 
> mon.2 192.168.253.148:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
> 33+0+0 (3242503871 0 0) 0x32a6a00 con 0x33$
> d500
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 --> 
> 192.168.253.148:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x32a6780 con 0
> 2019-01-04 07:01:38.481 7f808f03e700  5 -- 192.168.253.148:0/1359135487 >> 
> 192.168.253.147:6789/0 conn(0x332ce00 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74172 cs=1 l=1). rx mon.1 
> seq
> 3 0x32a6f00 auth_reply(proto 2 -22 (22) Invalid argument) v1
> 2019-01-04 07:01:38.481 7f808e03c700  1 -- 192.168.253.148:0/1359135487 <== 
> mon.1 192.168.253.147:6789/0 3  auth_reply(

Re: [ceph-users] Mimic 13.2.3?

2019-01-04 Thread Gregory Farnum
Regarding 13.2.3 specifically:

As Abhishek says, there are no known issues in the release. It went
through our full and proper release validation; nobody has spotted any
last-minute bugs. The release notes are available in the git
repository: 
https://github.com/ceph/ceph/blob/master/doc/releases/mimic.rst#v1323-mimic
It was not announced on the mailing list or in a blog post because the
person cutting it didn't have those steps in their checklist, but ff
you install these packages, you will be fine and in good shape.
The release did leave out some non-regression fixes that were intended
to go out so we're building a 13.2.4 with a few more patches included,
so if doing an upgrade is a big deal for you then you should probably
wait for those until you go through the effort.


Regarding Ceph releases more generally:

As we scale up the number of stable releases being maintained and have
more companies involved in Ceph, we are trying as a community to get
more people and groups involved in building and running releases. In
the long term this makes for a more stable and collaborative project.
In the short term, we're dealing with issues like "nobody told me I
should post to the blog when the release was done", figuring out
processes for releases that don't rely on a few blessed people knowing
the steps to take and sharing them over irc, and dealing with the
security concerns presented by making tools like signing keys
available.

I imagine we will discuss all this in more detail after the release,
but everybody's patience is appreciated as we work through these
challenges.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph blog RSS/Atom URL?

2019-01-04 Thread Gregory Farnum
It looks like ceph.com/feed is the RSS url?

On Fri, Jan 4, 2019 at 5:52 AM Jan Kasprzak  wrote:

> Hello,
>
> is there any RSS or Atom source for Ceph blog? I have looked inside
> the https://ceph.com/community/blog/ HTML source, but there is no
>  or anything mentioning RSS or Atom.
>
> Thanks,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak 
> |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
> |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread Gregory Farnum
On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov  wrote:

> We investigated the issue and set debug_mon up to 20 during little change
> of osdmap get many messages for all pgs of each pool (for all cluster):
>
>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
>> []
>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
>
> though no pg_temps are created as result(no single backfill)
>
> We suppose this behavior changed in commit
> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
> because earlier function *OSDMonitor::prime_pg_temp* should return in
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> like in jewel
> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
>
> i accept that we may be mistaken
>

Well those commits made some changes, but I'm not sure what about them
you're saying is wrong?

What would probably be most helpful is if you can dump out one of those
over-large incremental osdmaps and see what's using up all the space. (You
may be able to do it through the normal Ceph CLI by querying the monitor?
Otherwise if it's something very weird you may need to get the
ceph-dencoder tool and look at it with that.)
-Greg


>
>
> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum 
> wrote:
>
>> Hmm that does seem odd. How are you looking at those sizes?
>>
>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
>>
>>> Greq, for example for our cluster ~1000 osd:
>>>
>>> size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
>>> modified 2018-12-12 04:00:17.661731)
>>> size osdmap.1357882__0_F7FE772D__none = 363KB
>>> size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
>>> modified 2018-12-12 04:00:27.385702)
>>> size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
>>>
>>> difference between epoch 1357881 and 1357883: crush weight one osd was
>>> increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
>>> inc_osdmap so huge
>>>
>>> чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
>>> >
>>> > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov 
>>> wrote:
>>> >>
>>> >> Hi guys
>>> >>
>>> >> I faced strange behavior of crushmap change. When I change crush
>>> >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
>>> >> significantly bigger than size of osdmap(0.4MB)
>>> >
>>> >
>>> > This is probably because when CRUSH changes, the new primary OSDs for
>>> a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
>>> reassigns it to the old acting set, so the data can be accessed while the
>>> new OSDs get backfilled. Depending on the size of your cluster, the number
>>> of PGs on it, and the size of the CRUSH change, this can easily be larger
>>> than the rest of the map because it is data with size linear in the number
>>> of PGs affected, instead of being more normally proportional to the number
>>> of OSDs.
>>> > -Greg
>>> >
>>> >>
>>> >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
>>> >> that initially it was firefly
>>> >> How can I view content of increment osdmap or can you give me opinion
>>> >> on this problem. I think that spikes of traffic tight after change of
>>> >> crushmap relates to this crushmap behavior
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Best regards, Sergey Dolgov
>>>
>>
>
> --
> Best regards, Sergey Dolgov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs file block size: must it be so big?

2018-12-21 Thread Gregory Farnum
On Fri, Dec 14, 2018 at 6:44 PM Bryan Henderson 
wrote:

> > Going back through the logs though it looks like the main reason we do a
> > 4MiB block size is so that we have a chance of reporting actual cluster
> > sizes to 32-bit systems,
>
> I believe you're talking about a different block size (there are so many of
> them).
>
> The 'statvfs' system call (the essence of a 'df' command) can return its
> space
> sizes in any units it wants, and tells you that unit.  The unit has
> variously
> been called block size and fragment size.  In Cephfs, it is hardcoded as 4
> MiB
> so that 32 bit fields can represent large storage sizes.  I'm not aware
> that
> anyone attempts to use that value for anything but interpreting statvfs
> results.  Not saying they don't, though.
>
> What I'm looking at, in contrast, is the block size returned by a 'stat'
> system call on a particular file.  In Cephfs, it's the stripe unit size for
> the file, which is an aspect of the file's layout.  In the default layout,
> stripe unit size is 4 MiB.
>

You are of course correct; sorry for the confusion.
It looks like this was introduced in (user space) commit
0457783f6eb0c41951b6d56a568eccaeccec8e6d, which swapped it from the
previous hard-coded 4096. Probably in the expectation that there might be
still-small stripe units that were nevertheless useful to do IO in terms of.

You might want to try and be more sophisticated than just having a mount
option to override the reported block size — perhaps forcing the reported
size within some reasonable limits, but trying to keep some relationship
between it and the stripe size? If someone deploys an erasure-coded pool
under CephFS they definitely want to be doing IO in the stripe size if
possible, rather than 4 or 8KiB.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why libcephfs API use "struct ceph_statx" instead of "struct stat"

2018-12-20 Thread Gregory Farnum
CephFS is prepared for the statx interface that doesn’t necessarily fill in
every member of the stat structure, and allows you to make requests for
only certain pieces of information. The purpose is so that the client and
MDS can take less expensive actions than are required to satisfy a full
stat.
-Greg
On Mon, Dec 17, 2018 at 5:23 PM  wrote:

> Hi,everyone
>
>I found libcephfs API, redefine "struct ceph_statx" instead of "struct
> stat", why not use "struct stat "
>
> directly and I think it may be better understandable  and more convenient
> to use for caller.
>
>
>
> struct ceph_statx {
>
> uint32_t stx_mask;
>
> uint32_t stx_blksize;
>
> uint32_t stx_nlink;
>
> uint32_t stx_uid;
>
> uint32_t stx_gid;
>
> uint16_t stx_mode;
>
> uint64_t stx_ino;
>
> uint64_t stx_size;
>
> uint64_t stx_blocks;
>
> dev_t stx_dev;
>
> dev_t stx_rdev;
>
> struct timespec stx_atime;
>
> struct timespec stx_ctime;
>
> struct timespec stx_mtime;
>
> struct timespec stx_btime;
>
> uint64_t stx_version;
>
> };
>
>
> struct stat {
>
> dev_t st_dev; /* ID of device containing file */
>
> ino_t st_ino; /* inode number */
>
> mode_tst_mode;/* protection */
>
> nlink_t   st_nlink;   /* number of hard links */
>
> uid_t st_uid; /* user ID of owner */
>
> gid_t st_gid; /* group ID of owner */
>
> dev_t st_rdev;/* device ID (if special file) */
>
> off_t st_size;/* total size, in bytes */
>
> blksize_t st_blksize; /* blocksize for file system I/O */
>
> blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
>
> time_tst_atime;   /* time of last access */
>
> time_tst_mtime;   /* time of last modification */
>
> time_tst_ctime;   /* time of last status change */
>
> };
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-20 Thread Gregory Farnum
On Tue, Dec 18, 2018 at 1:11 AM Hector Martin  wrote:

> Hi list,
>
> I'm running libvirt qemu guests on RBD, and currently taking backups by
> issuing a domfsfreeze, taking a snapshot, and then issuing a domfsthaw.
> This seems to be a common approach.
>
> This is safe, but it's impactful: the guest has frozen I/O for the
> duration of the snapshot. This is usually only a few seconds.
> Unfortunately, the freeze action doesn't seem to be very reliable.
> Sometimes it times out, leaving the guest in a messy situation with
> frozen I/O (thaw times out too when this happens, or returns success but
> FSes end up frozen anyway). This is clearly a bug somewhere, but I
> wonder whether the freeze is a hard requirement or not.
>
> Are there any atomicity guarantees for RBD snapshots taken *without*
> freezing the filesystem? Obviously the filesystem will be dirty and will
> require journal recovery, but that is okay; it's equivalent to a hard
> shutdown/crash. But is there any chance of corruption related to the
> snapshot being taken in a non-atomic fashion?


RBD snapshots are indeed crash-consistent. :)
-Greg

Filesystems and
> applications these days should have no trouble with hard shutdowns, as
> long as storage writes follow ordering guarantees (no writes getting
> reordered across a barrier and such).
>
> Put another way: do RBD snapshots have ~identical atomicity guarantees
> to e.g. LVM snapshots?
>
> If we can get away without the freeze, honestly I'd rather go that
> route. If I really need to pause I/O during the snapshot creation, I
> might end up resorting to pausing the whole VM (suspend/resume), which
> has higher impact but also probably a much lower chance of messing up
> (or having excess latency), since it doesn't involve the guest OS or the
> qemu agent at all...
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://marcan.st/marcan.asc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs file block size: must it be so big?

2018-12-13 Thread Gregory Farnum
On Thu, Dec 13, 2018 at 3:31 PM Bryan Henderson 
wrote:

> I've searched the ceph-users archives and found no discussion to speak of
> of
> Cephfs block sizes, and I wonder how much people have thought about it.
>
> The POSIX 'stat' system call reports for each file a block size, which is
> usually defined vaguely as the smallest read or write size that is
> efficient.
> It usually takes into account that small writes may require a
> read-modify-write and there may be a minimum size on reads from backing
> storage.
>
> One thing that uses this information is the stream I/O implementation
> (fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually
> writes
> full blocks, buffering as necessary.
>
> Most filesystems report this number as 4K.
>
> Ceph reports the stripe unit (stripe column size), which is the maximum
> size
> of the RADOS objects that back the file.  This is 4M by default.
>
> One result of this is that a program uses a thousand times more buffer
> space
> when running against a Ceph file as against a traditional filesystem.
>
> And a really pernicious result occurs when you have a special file in
> Cephfs.
> Block size doesn't make any sense at all for special files, and it's
> probably
> a bad idea to use stream I/O to read one, but I've seen it done.  The
> Chrony
> clock synchronizer programs use fread to read random numbers from
> /dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with
> defaults,
> it's going to generate 4M of random bits to satisfy a 4-byte request.  On
> one
> of my computers, that takes 7 seconds - and wipes out the entropy pool.
>
>
> Has stat block size been discussed much?  Is there a good reason that it's
> the RADOS object size?
>
> I'm thinking of modifying the cephfs filesystem driver to add a mount
> option
> to specify a fixed block size to be reported for all files, and using 4K or
> 64K.  Would that break something?
>

I remember this being a huge pain in the butt for a variety of reasons.
Going back through the logs though it looks like the main reason we do a
4MiB block size is so that we have a chance of reporting actual cluster
sizes to 32-bit systems, so obviously mount options to change it should
work fine as long as there aren't any shortcuts in the code. (Given that
we've previously switched from 4KiB to 4MiB, I wouldn't expect that to be a
problem.) My main worry would be that we definitely want to make sure that
the block size is appropriate for anybody using EC data pools, which may be
a little more complicated than a simple 4KiB or 64KiB setting.

It was kind of fun switching though since it revealed a lot of ecosystem
tools assuming the FS' block size was the same as a page size. :D
-Greg



>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] size of inc_osdmap vs osdmap

2018-12-12 Thread Gregory Farnum
Hmm that does seem odd. How are you looking at those sizes?

On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:

> Greq, for example for our cluster ~1000 osd:
>
> size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
> modified 2018-12-12 04:00:17.661731)
> size osdmap.1357882__0_F7FE772D__none = 363KB
> size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
> modified 2018-12-12 04:00:27.385702)
> size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
>
> difference between epoch 1357881 and 1357883: crush weight one osd was
> increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
> inc_osdmap so huge
>
> чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
> >
> > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov  wrote:
> >>
> >> Hi guys
> >>
> >> I faced strange behavior of crushmap change. When I change crush
> >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
> >> significantly bigger than size of osdmap(0.4MB)
> >
> >
> > This is probably because when CRUSH changes, the new primary OSDs for a
> PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
> reassigns it to the old acting set, so the data can be accessed while the
> new OSDs get backfilled. Depending on the size of your cluster, the number
> of PGs on it, and the size of the CRUSH change, this can easily be larger
> than the rest of the map because it is data with size linear in the number
> of PGs affected, instead of being more normally proportional to the number
> of OSDs.
> > -Greg
> >
> >>
> >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
> >> that initially it was firefly
> >> How can I view content of increment osdmap or can you give me opinion
> >> on this problem. I think that spikes of traffic tight after change of
> >> crushmap relates to this crushmap behavior
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Best regards, Sergey Dolgov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ERR scrub mismatch

2018-12-06 Thread Gregory Farnum
Well, it looks like you have different data in the MDSMap across your
monitors. That's not good on its face, but maybe there are extenuating
circumstances. Do you actually use CephFS, or just RBD/RGW? What's the
full output of "ceph -s"?
-Greg

On Thu, Dec 6, 2018 at 1:39 PM Marco Aroldi  wrote:
>
> Sorry about this, I hate "to bump" a thread, but...
> Anyone has faced this situation?
> There is a procedure to follow?
>
> Thanks
> Marco
>
> Il giorno gio 8 nov 2018, 10:54 Marco Aroldi  ha 
> scritto:
>>
>> Hello,
>> Since upgrade from Jewel to Luminous 12.2.8, in the logs are reported some 
>> errors related to "scrub mismatch", every day at the same time.
>> I have 5 mon (from mon.0 to mon.4) and I need help to indentify and recover 
>> from this problem.
>>
>> This is the log:
>> 2018-11-07 15:13:53.808128 [ERR]  mon.4 ScrubResult(keys 
>> {logm=46,mds_health=29,mds_metadata=1,mdsmap=24} crc 
>> {logm=1239992787,mds_health=3182263811,mds_metadata=3704185590,mdsmap=1114086003})
>> 2018-11-07 15:13:53.808095 [ERR]  mon.0 ScrubResult(keys 
>> {logm=46,mds_health=30,mds_metadata=1,mdsmap=23} crc 
>> {logm=1239992787,mds_health=1194056063,mds_metadata=3704185590,mdsmap=3259702002})
>> 2018-11-07 15:13:53.808061 [ERR]  scrub mismatch
>> 2018-11-07 15:13:53.808026 [ERR]  mon.3 ScrubResult(keys 
>> {logm=46,mds_health=31,mds_metadata=1,mdsmap=22} crc 
>> {logm=1239992787,mds_health=807938287,mds_metadata=3704185590,mdsmap=662277977})
>> 2018-11-07 15:13:53.807970 [ERR]  mon.0 ScrubResult(keys 
>> {logm=46,mds_health=30,mds_metadata=1,mdsmap=23} crc 
>> {logm=1239992787,mds_health=1194056063,mds_metadata=3704185590,mdsmap=3259702002})
>> 2018-11-07 15:13:53.807939 [ERR]  scrub mismatch
>> 2018-11-07 15:13:53.807916 [ERR]  mon.2 ScrubResult(keys 
>> {logm=46,mds_health=31,mds_metadata=1,mdsmap=22} crc 
>> {logm=1239992787,mds_health=807938287,mds_metadata=3704185590,mdsmap=662277977})
>> 2018-11-07 15:13:53.807882 [ERR]  mon.0 ScrubResult(keys 
>> {logm=46,mds_health=30,mds_metadata=1,mdsmap=23} crc 
>> {logm=1239992787,mds_health=1194056063,mds_metadata=3704185590,mdsmap=3259702002})
>> 2018-11-07 15:13:53.807844 [ERR]  scrub mismatch
>>
>> Any help will be appreciated
>> Thanks
>> Marco
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] size of inc_osdmap vs osdmap

2018-12-05 Thread Gregory Farnum
On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov  wrote:

> Hi guys
>
> I faced strange behavior of crushmap change. When I change crush
> weight osd I sometimes get  increment osdmap(1.2MB) which size is
> significantly bigger than size of osdmap(0.4MB)
>

This is probably because when CRUSH changes, the new primary OSDs for a PG
will tend to set a "pg temp" value (in the OSDMap) that temporarily
reassigns it to the old acting set, so the data can be accessed while the
new OSDs get backfilled. Depending on the size of your cluster, the number
of PGs on it, and the size of the CRUSH change, this can easily be larger
than the rest of the map because it is data with size linear in the number
of PGs affected, instead of being more normally proportional to the number
of OSDs.
-Greg


> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
> that initially it was firefly
> How can I view content of increment osdmap or can you give me opinion
> on this problem. I think that spikes of traffic tight after change of
> crushmap relates to this crushmap behavior
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [cephfs] Kernel outage / timeout

2018-12-04 Thread Gregory Farnum
Yes, this is exactly it with the "reconnect denied".
-Greg

On Tue, Dec 4, 2018 at 3:00 AM NingLi  wrote:

>
> Hi,maybe this reference can help you
>
>
> http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs
>
>
> > On Dec 4, 2018, at 18:55, c...@jack.fr.eu.org wrote:
> >
> > Hi,
> >
> > I have some wild freeze using cephfs with the kernel driver
> > For instance:
> > [Tue Dec  4 10:57:48 2018] libceph: mon1 10.5.0.88:6789 session lost,
> > hunting for new mon
> > [Tue Dec  4 10:57:48 2018] libceph: mon2 10.5.0.89:6789 session
> established
> > [Tue Dec  4 10:58:20 2018] ceph: mds0 caps stale
> > [..] server is now frozen, filesystem accesses are stuck
> > [Tue Dec  4 11:13:02 2018] libceph: mds0 10.5.0.88:6804 socket closed
> > (con state OPEN)
> > [Tue Dec  4 11:13:03 2018] libceph: mds0 10.5.0.88:6804 connection reset
> > [Tue Dec  4 11:13:03 2018] libceph: reset on mds0
> > [Tue Dec  4 11:13:03 2018] ceph: mds0 closed our session
> > [Tue Dec  4 11:13:03 2018] ceph: mds0 reconnect start
> > [Tue Dec  4 11:13:04 2018] ceph: mds0 reconnect denied
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > 3f1ae609 1099692263746
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > ccd58b71 1099692263749
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > da5acf8f 1099692263750
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > 5ddc2fcf 1099692263751
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > 469a70f4 1099692263754
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > 5c0038f9 1099692263757
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > e7288aa2 1099692263758
> > [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
> > b431209a 1099692263759
> > [Tue Dec  4 11:13:04 2018] libceph: mds0 10.5.0.88:6804 socket closed
> > (con state NEGOTIATING)
> > [Tue Dec  4 11:13:31 2018] libceph: osd12 10.5.0.89:6805 socket closed
> > (con state OPEN)
> > [Tue Dec  4 11:13:35 2018] libceph: osd17 10.5.0.89:6800 socket closed
> > (con state OPEN)
> > [Tue Dec  4 11:13:35 2018] libceph: osd9 10.5.0.88:6813 socket closed
> > (con state OPEN)
> > [Tue Dec  4 11:13:41 2018] libceph: osd0 10.5.0.87:6800 socket closed
> > (con state OPEN)
> >
> > Kernel 4.17 is used, we got the same issue with 4.18
> > Ceph 13.2.1 is used
> > From what I understand, the kernel hang itself for some reason (perhaps
> > it simply cannot handle some wild event)
> >
> > Is there a fix for that ?
> >
> > Secondly, it seems that the kernel reconnect itself after 15 minutes
> > everytime
> > Where is that tunable ? Could I lower that variables, so that hang have
> > less impacts ?
> >
> >
> > On ceph.log, I get Health check failed: 1 MDSs report slow requests
> > (MDS_SLOW_REQUEST), but this is probably the consequence, not the cause
> >
> > Any tip ?
> >
> > Best regards,
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Customized Crush location hooks in Mimic

2018-11-30 Thread Gregory Farnum
I’m pretty sure the monitor command there won’t move intermediate buckets
like the host. This is so if an osd has incomplete metadata it doesn’t
inadvertently move 11 other OSDs into a different rack/row/whatever.

So in this case, it finds the host osd0001 and matches it, but since the
crush map already knows about osd0001 it doesn’t pay any attention to the
datacenter field.
Whereas if you tried setting it with mynewhost, the monitor wouldn’t know
where that host exists and would look at the other fields to set it in the
specified data center.
-Greg
On Fri, Nov 30, 2018 at 6:46 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Dear Cephalopodians,
>
> sorry for the spam, but I found the following in mon logs just now and am
> finally out of ideas:
>
> --
> 2018-11-30 15:43:05.207 7f9d64aac700  0 mon.mon001@0(leader) e3
> handle_command mon_command({"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["1"]} v 0) v1
> 2018-11-30 15:43:05.207 7f9d64aac700  0 log_channel(audit) log [INF] :
> from='osd.1 10.160.12.101:6816/90528' entity='osd.1' cmd=[{"prefix": "osd
> crush set-device-class", "class": "hdd", "ids": ["1"]}]: dispatch
> 2018-11-30 15:43:05.208 7f9d64aac700  0 mon.mon001@0(leader) e3
> handle_command mon_command({"prefix": "osd crush create-or-move", "id": 1,
> "weight":3.6824, "args": ["datacenter=FTD", "host=osd001", "root=default"]}
> v 0) v1
> 2018-11-30 15:43:05.208 7f9d64aac700  0 log_channel(audit) log [INF] :
> from='osd.1 10.160.12.101:6816/90528' entity='osd.1' cmd=[{"prefix": "osd
> crush create-or-move", "id": 1, "weight":3.6824, "args": ["datacenter=FTD",
> "host=osd001", "root=default"]}]: dispatch
> 2018-11-30 15:43:05.208 7f9d64aac700  0 mon.mon001@0(leader).osd e2464
> create-or-move crush item name 'osd.1' initial_weight 3.6824 at location
> {datacenter=FTD,host=osd001,root=default}
>
> --
> So the request to move to datacenter=FTD arrives at the mon, but no action
> is taken, and the OSD is left in FTD_1.
>
> Cheers,
> Oliver
>
> Am 30.11.18 um 15:25 schrieb Oliver Freyermuth:
> > Dear Cephalopodians,
> >
> > further experiments revealed that the crush-location-hook is indeed
> called!
> > It's just my check (writing to a file in tmp from inside the hook) which
> somehow failed. Using "logger" works for debugging.
> >
> > So now, my hook outputs:
> > host=osd001 datacenter=FTD root=default
> > as explained before. I have also explicitly created the buckets
> beforehand in case that is needed.
> >
> > Tree looks like that:
> > # ceph osd tree
> > ID  CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
> >   -1   55.23582 root default
> >   -9  0 datacenter FTD
> > -12   18.41194 datacenter FTD_1
> >   -3   18.41194 host osd001
> >0   hdd  3.68239 osd.0up  1.0 1.0
> >1   hdd  3.68239 osd.1up  1.0 1.0
> >2   hdd  3.68239 osd.2up  1.0 1.0
> >3   hdd  3.68239 osd.3up  1.0 1.0
> >4   hdd  3.68239 osd.4up  1.0 1.0
> > -11  0 datacenter FTD_2
> >   -5   18.41194 host osd002
> >5   hdd  3.68239 osd.5up  1.0 1.0
> >6   hdd  3.68239 osd.6up  1.0 1.0
> >7   hdd  3.68239 osd.7up  1.0 1.0
> >8   hdd  3.68239 osd.8up  1.0 1.0
> >9   hdd  3.68239 osd.9up  1.0 1.0
> >   -7   18.41194 host osd003
> >   10   hdd  3.68239 osd.10   up  1.0 1.0
> >   11   hdd  3.68239 osd.11   up  1.0 1.0
> >   12   hdd  3.68239 osd.12   up  1.0 1.0
> >   13   hdd  3.68239 osd.13   up  1.0 1.0
> >   14   hdd  3.68239 osd.14   up  1.0 1.0
> >
> > So naively, I would expect that when I restart osd.0, it should move
> itself into datacenter=FTD.
> > But that does not happen...
> >
> > Any idea what I am missing?
> >
> > Cheers,
> >  Oliver
> >
> >
> >
> > Am 30.11.18 um 11:44 schrieb Oliver Freyermuth:
> >> Dear Cephalopodians,
> >>
> >> I'm probably missing something obvious, but I am at a loss here on how
> to actually make use of a customized crush location hook.
> >>
> >> I'm currently on "ceph version 13.2.1" on CentOS 7 (i.e. the last
> version before the upgrade-preventing bugs). Here's what I did:
> >>
> >> 1. Write a script /usr/local/bin/customized-ceph-crush-location. The
> script can be executed by user "ceph":
> >># sudo -u ceph /usr/local/bin/customized-ceph-crush-location
> >>host=osd001 datacenter=FTD root=default
> >>
> >> 2. Add the following to ceph.conf:
> >>   [osd]
> >

Re: [ceph-users] What could cause mon_osd_full_ratio to be exceeded?

2018-11-26 Thread Gregory Farnum
On Mon, Nov 26, 2018 at 10:28 AM Vladimir Brik
 wrote:
>
> Hello
>
> I am doing some Ceph testing on a near-full cluster, and I noticed that,
> after I brought down a node, some OSDs' utilization reached
> osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio
> (90%) if mon_osd_backfillfull_ratio is 90%?

While I believe the very newest Ceph source will do this, it can be
surprisingly difficult to identify the exact size a PG will take up on
disk (thanks to omap/RocksDB data), and so for a long time we pretty
much didn't try — these ratios were checked when starting a backfill,
but we didn't try to predict where they would end up and limit
ourselves based on that.
-Greg

>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-26 Thread Gregory Farnum
On Tue, Nov 20, 2018 at 9:50 PM Vlad Kopylov  wrote:

> I see the point, but not for the read case:
>   no overhead for just choosing or let Mount option choose read replica.
>
> This is simple feature that can be implemented, that will save many
> people bandwidth in really distributed cases.
>

This is actually much more complicated than it sounds. Allowing reads from
the replica OSDs while still routing writes through a different primary OSD
introduces a great many consistency issues. We've tried adding very limited
support for this read-from-replica scenario in special cases, but have had
to roll them all back due to edge cases where they don't work.

I understand why you want it, but it's definitely not a simple feature. :(
-Greg


>
> Main issue this surfaces is that RADOS maps ignore clients - they just
> see cluster. There should be the part of RADOS map unique or possibly
> unique for each client connection.
>
> Lets file feature request?
>
> p.s. honestly, I don't see why anyone would use ceph for local network
> RAID setups, there are other simple solutions out there even in your
> own RedHat shop.
> On Tue, Nov 20, 2018 at 8:38 PM Patrick Donnelly 
> wrote:
> >
> > You either need to accept that reads/writes will land on different data
> centers, primary OSD for a given pool is always in the desired data center,
> or some other non-Ceph solution which will have either expensive, eventual,
> or false consistency.
> >
> > On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov  >>
> >> This is what Jean suggested. I understand it and it works with primary.
> >> But what I need is for all clients to access same files, not separate
> sets (like red blue green)
> >>
> >> Thanks Konstantin.
> >>
> >> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin 
> wrote:
> >>>
> >>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
> >>> > Exactly. But write operations should go to all nodes.
> >>>
> >>> This can be set via primary affinity [1], when a ceph client reads or
> >>> writes data, it always contacts the primary OSD in the acting set.
> >>>
> >>>
> >>> If u want to totally segregate IO, you can use device classes:
> >>>
> >>> Just create osds with different classes:
> >>>
> >>> dc1
> >>>
> >>>host1
> >>>
> >>>  red osd.0 primary
> >>>
> >>>  blue osd.1
> >>>
> >>>  green osd.2
> >>>
> >>> dc2
> >>>
> >>>host2
> >>>
> >>>  red osd.3
> >>>
> >>>  blue osd.4 primary
> >>>
> >>>  green osd.5
> >>>
> >>> dc3
> >>>
> >>>host3
> >>>
> >>>  red osd.6
> >>>
> >>>  blue osd.7
> >>>
> >>>  green osd.8 primary
> >>>
> >>>
> >>> create 3 crush rules:
> >>>
> >>> ceph osd crush rule create-replicated red default host red
> >>>
> >>> ceph osd crush rule create-replicated blue default host blue
> >>>
> >>> ceph osd crush rule create-replicated green default host green
> >>>
> >>>
> >>> and 3 pools:
> >>>
> >>> ceph osd pool create red 64 64 replicated red
> >>>
> >>> ceph osd pool create blue 64 64 replicated blue
> >>>
> >>> ceph osd pool create blue 64 64 replicated green
> >>>
> >>>
> >>> [1]
> >>>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
> '
> >>>
> >>>
> >>>
> >>> k
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2018-11-26 Thread Gregory Farnum
On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC)  wrote:

> Hi all,
>
>
> We've 8 osd hosts, 4 in room 1 and 4 in room2.
>
> A pool with size = 3 using following crush map is created, to cater for
> room failure.
>
>
> rule multiroom {
> id 0
> type replicated
> min_size 2
> max_size 4
> step take default
> step choose firstn 2 type room
> step chooseleaf firstn 2 type host
> step emit
> }
>
>
>
> We're expecting:
>
> 1.for each object, there are always 2 replicas in one room and 1 replica
> in other room making size=3.  But we can't control which room has 1 or 2
> replicas.
>

Right.


>
> 2.in case an osd host fails, ceph will assign remaining osds to the same
> PG to hold replicas on the failed osd host.  Selection is based on crush
> rule of the pool, thus maintaining the same failure domain - won't make all
> replicas in the same room.
>

Yes, if a host fails the copies it held will be replaced by new copies in
the same room.


>
> 3.in case of entire room with 1 replica fails, the pool will remain
> degraded but won't do any replica relocation.
>

Right.


>
> 4. in case of entire room with 2 replicas fails, ceph will make use of
> osds in the surviving room and making 2 replicas.  Pool will not be
> writeable before all objects are made 2 copies (unless we make pool
> size=4?).  Then when recovery is complete, pool will remain in degraded
> state until the failed room recover.
>

Hmm, I'm actually not sure if this will work out — because CRUSH is
hierarchical, it will keep trying to select hosts from the dead room and
will fill out the location vector's first two spots with -1. It could be
that Ceph will skip all those "nonexistent" entries and just pick the two
copies from slots 3 and 4, but it might not. You should test this carefully
and report back!
-Greg

>
> Is our understanding correct?  Thanks a lot.
> Will do some simulation later to verify.
>
> Regards,
> /stwong
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor disks for SSD only cluster

2018-11-26 Thread Gregory Farnum
As the monitors limit their transaction rates, I would tend for the
higher-durability drives. I don't think any monitor throughput issues have
been reported on clusters with SSDs for storage.
-Greg

On Mon, Nov 26, 2018 at 5:47 AM Valmar Kuristik  wrote:

> Hello,
>
> Can anyone say how important is to have fast storage on monitors for a
> all ssd deployment? We are planning on throwing SSDs into the monitors
> as well, but are at a loss about if to go for more durability or speed.
> Higher durability drives tend to be a lot slower for the 240GB size we'd
> need on the monitors, while lower durability would net considerably more
> write speed.
>
> Any insight into this ?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No recovery when "norebalance" flag set

2018-11-26 Thread Gregory Farnum
On Sun, Nov 25, 2018 at 2:41 PM Stefan Kooman  wrote:

> Hi list,
>
> During cluster expansion (adding extra disks to existing hosts) some
> OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
> error (39) Directory not empty not handled on operation 21 (op 1,
> counting from 0), full details: https://8n1.org/14078/c534). We had
> "norebalance", "nobackfill", and "norecover" flags set. After we unset
> nobackfill and norecover (to let Ceph fix the degraded PGs) it would
> recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
> were supposed to have a copy of them, and they were already "probed".  A
> day later (~24 hours) it would still not have recovered the degraded
> objects.  After we unset the "norebalance" flag it would start
> rebalancing, backfilling and recovering. The 12 degraded objects were
> recovered.
>
> Is this expected behaviour? I would expect Ceph to always try to fix
> degraded things first and foremost. Even "pg force-recover" and "pg
> force-backfill" could not force recovery.
>

I haven't dug into how the norebalance flag works, but I think this is
expected — it presumably prevents OSDs from creating new copies of PGs,
which is what needed to happen here.
-Greg


>
> Gr. Stefan
>
>
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688
> <+31%20318%20648%20688> / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Gregory Farnum
On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson  wrote:

> Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman :
> >
> > Hi List,
> >
> > Another interesting and unexpected thing we observed during cluster
> > expansion is the following. After we added  extra disks to the cluster,
> > while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> > we did that a couple of hundered objects would become degraded. During
> > that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> > weight host=$storage-node" would cause extra degraded objects.
> >
> > I don't expect objects to become degraded when extra OSDs are added.
> > Misplaced, yes. Degraded, no
> >
> > Someone got an explantion for this?
> >
>
> Yes, when you add a drive (or 10), some PGs decide they should have one or
> more
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
>
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add
> space to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3
> working
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.
>

See, that's the thing: Ceph is designed *not* to reduce data reliability
this way; it shouldn't do that; and so far as I've been able to establish
so far it doesn't actually do that. Which makes these degraded object
reports a bit perplexing.

What we have worked out is that sometimes objects can be degraded because
the log-based recovery takes a while after the primary juggles around PG
set membership, and I suspect that's what is turning up here. The exact
cause still eludes me a bit, but I assume it's a consequence of the
backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded
object counts (as opposed to the 2 that Marco reported).

-Greg


>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   6   7   8   9   10   >