Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-10-04 Thread Webert de Souza Lima
Hi, bring this up again to ask one more question:

what would be the best recommended locking strategy for dovecot against
cephfs?
this is a balanced setup using independent director instances but all
dovecot instances on each node share the same storage system (cephfs).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 5:15 PM Webert de Souza Lima 
wrote:

> Thanks Jack.
>
> That's good to know. It is definitely something to consider.
> In a distributed storage scenario we might build a dedicated pool for that
> and tune the pool as more capacity or performance is needed.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
>
> On Wed, May 16, 2018 at 4:45 PM Jack  wrote:
>
>> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
>> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
>> > backend.
>> > We'll have to do some some work on how to simulate user traffic, for
>> writes
>> > and readings. That seems troublesome.
>> I would appreciate seeing these results !
>>
>> > Thanks for the plugins recommendations. I'll take the change and ask you
>> > how is the SIS status? We have used it in the past and we've had some
>> > problems with it.
>>
>> I am using it since Dec 2016 with mdbox, with no issue at all (I am
>> currently using Dovecot 2.2.27-3 from Debian Stretch)
>> The only config I use is mail_attachment_dir, the rest lies as default
>> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
>> ail_attachment_hash = %{sha1})
>> The backend storage is a local filesystem, and there is only one Dovecot
>> instance
>>
>> >
>> > Regards,
>> >
>> > Webert Lima
>> > DevOps Engineer at MAV Tecnologia
>> > *Belo Horizonte - Brasil*
>> > *IRC NICK - WebertRLZ*
>> >
>> >
>> > On Wed, May 16, 2018 at 4:19 PM Jack  wrote:
>> >
>> >> Hi,
>> >>
>> >> Many (most ?) filesystems does not store multiple files on the same
>> block
>> >>
>> >> Thus, with sdbox, every single mail (you know, that kind of mail with
>> 10
>> >> lines in it) will eat an inode, and a block (4k here)
>> >> mdbox is more compact on this way
>> >>
>> >> Another difference: sdbox removes the message, mdbox does not : a
>> single
>> >> metadata update is performed, which may be packed with others if many
>> >> files are deleted at once
>> >>
>> >> That said, I do not have experience with dovecot + cephfs, nor have
>> made
>> >> tests for sdbox vs mdbox
>> >>
>> >> However, and this is a bit out of topic, I recommend you look at the
>> >> following dovecot's features (if not already done), as they are awesome
>> >> and will help you a lot:
>> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
>> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
>> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>> >>
>> >> Regards,
>> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
>> >>> I'm sending this message to both dovecot and ceph-users ML so please
>> >> don't
>> >>> mind if something seems too obvious for you.
>> >>>
>> >>> Hi,
>> >>>
>> >>> I have a question for both dovecot and ceph lists and below I'll
>> explain
>> >>> what's going on.
>> >>>
>> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
>> >> when
>> >>> using sdbox, a new file is stored for each email message.
>> >>> When using mdbox, multiple messages are appended to a single file
>> until
>> >> it
>> >>> reaches/passes the rotate limit.
>> >>>
>> >>> I would like to understand better how the mdbox format impacts on IO
>> >>> performance.
>> >>> I think it's generally expected that fewer larger file translate to
>> less
>> >> IO
>> >>> and more troughput when compared to more small files, but how does
>> >> dovecot
>> >>> handle that with mdbox?
>> >>> If dovecot does flush data to storage upon each and every new email is
>> >>> arrived and appended 

Re: [ceph-users] cephfs kernel client hangs

2018-08-08 Thread Webert de Souza Lima
You can only try to remount the cephs dir. It will probably not work,
giving you I/O Errors, so the fallback would be to use a fuse-mount.

If I recall correctly you could do a lazy umount on the current dir (umount
-fl /mountdir) and remount it using the FUSE client.
it will work for new sessions but the currently hanging ones will still be
hanging.

with fuse you'll only be able to mount cephfs root dir, so if you have
multiple directories, you'll need to:
 - mount root cephfs dir in another directory
 - mount each subdir (after root mounted) to the desired directory via bind
mount.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 11:46 AM Zhenshi Zhou  wrote:

> Hi,
> Is there any other way excpet rebooting the server when the client hangs?
> If the server is in production environment, I can't restart it everytime.
>
> Webert de Souza Lima  于2018年8月8日周三 下午10:33写道:
>
>> Hi Zhenshi,
>>
>> if you still have the client mount hanging but no session is connected,
>> you probably have some PID waiting with blocked IO from cephfs mount.
>> I face that now and then and the only solution is to reboot the server,
>> as you won't be able to kill a process with pending IO.
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Aug 8, 2018 at 11:17 AM Zhenshi Zhou  wrote:
>>
>>> Hi Webert,
>>> That command shows the current sessions, whereas the server which I get
>>> the files(osdc,mdsc,monc) disconnect for a long time.
>>> So I cannot get useful infomation from the command you provide.
>>>
>>> Thanks
>>>
>>> Webert de Souza Lima  于2018年8月8日周三 下午10:10写道:
>>>
>>>> You could also see open sessions at the MDS server by issuing  `ceph
>>>> daemon mds.XX session ls`
>>>>
>>>> Regards,
>>>>
>>>> Webert Lima
>>>> DevOps Engineer at MAV Tecnologia
>>>> *Belo Horizonte - Brasil*
>>>> *IRC NICK - WebertRLZ*
>>>>
>>>>
>>>> On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou 
>>>> wrote:
>>>>
>>>>> Hi, I find an old server which mounted cephfs and has the debug files.
>>>>> # cat osdc
>>>>> REQUESTS 0 homeless 0
>>>>> LINGER REQUESTS
>>>>> BACKOFFS
>>>>> # cat monc
>>>>> have monmap 2 want 3+
>>>>> have osdmap 3507
>>>>> have fsmap.user 0
>>>>> have mdsmap 55 want 56+
>>>>> fs_cluster_id -1
>>>>> # cat mdsc
>>>>> 194 mds0getattr  #1036ae3
>>>>>
>>>>> What does it mean?
>>>>>
>>>>> Zhenshi Zhou  于2018年8月8日周三 下午1:58写道:
>>>>>
>>>>>> I restarted the client server so that there's no file in that
>>>>>> directory. I will take care of it if the client hangs next time.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Yan, Zheng  于2018年8月8日周三 上午11:23写道:
>>>>>>
>>>>>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> > I check all my ceph servers and they are not mount cephfs on each
>>>>>>> of them(maybe I umount after testing). As a result, the cluster didn't
>>>>>>> encounter a memory deadlock. Besides, I check the monitoring system and 
>>>>>>> the
>>>>>>> memory and cpu usage were at common level while the clients hung.
>>>>>>> > Back to my question, there must be something else cause the client
>>>>>>> hang.
>>>>>>> >
>>>>>>>
>>>>>>> Check if there are hang requests in
>>>>>>> /sys/kernel/debug/ceph//{osdc,mdsc},
>>>>>>>
>>>>>>> > Zhenshi Zhou  于2018年8月8日周三 上午4:16写道:
>>>>>>> >>
>>>>>>> >> Hi, I'm not sure if it just mounts the cephfs without using or
>>>>>>> doing any operation within the mounted directory would be affected by
>>>>>>> flushing cache. I mounted cephfs on osd servers only for testing and 
>>>>>>> then
>>>>>>> left it there. Anyway I will um

Re: [ceph-users] cephfs kernel client hangs

2018-08-08 Thread Webert de Souza Lima
Hi Zhenshi,

if you still have the client mount hanging but no session is connected, you
probably have some PID waiting with blocked IO from cephfs mount.
I face that now and then and the only solution is to reboot the server, as
you won't be able to kill a process with pending IO.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 11:17 AM Zhenshi Zhou  wrote:

> Hi Webert,
> That command shows the current sessions, whereas the server which I get
> the files(osdc,mdsc,monc) disconnect for a long time.
> So I cannot get useful infomation from the command you provide.
>
> Thanks
>
> Webert de Souza Lima  于2018年8月8日周三 下午10:10写道:
>
>> You could also see open sessions at the MDS server by issuing  `ceph
>> daemon mds.XX session ls`
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou  wrote:
>>
>>> Hi, I find an old server which mounted cephfs and has the debug files.
>>> # cat osdc
>>> REQUESTS 0 homeless 0
>>> LINGER REQUESTS
>>> BACKOFFS
>>> # cat monc
>>> have monmap 2 want 3+
>>> have osdmap 3507
>>> have fsmap.user 0
>>> have mdsmap 55 want 56+
>>> fs_cluster_id -1
>>> # cat mdsc
>>> 194 mds0getattr  #1036ae3
>>>
>>> What does it mean?
>>>
>>> Zhenshi Zhou  于2018年8月8日周三 下午1:58写道:
>>>
>>>> I restarted the client server so that there's no file in that
>>>> directory. I will take care of it if the client hangs next time.
>>>>
>>>> Thanks
>>>>
>>>> Yan, Zheng  于2018年8月8日周三 上午11:23写道:
>>>>
>>>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou 
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> > I check all my ceph servers and they are not mount cephfs on each of
>>>>> them(maybe I umount after testing). As a result, the cluster didn't
>>>>> encounter a memory deadlock. Besides, I check the monitoring system and 
>>>>> the
>>>>> memory and cpu usage were at common level while the clients hung.
>>>>> > Back to my question, there must be something else cause the client
>>>>> hang.
>>>>> >
>>>>>
>>>>> Check if there are hang requests in
>>>>> /sys/kernel/debug/ceph//{osdc,mdsc},
>>>>>
>>>>> > Zhenshi Zhou  于2018年8月8日周三 上午4:16写道:
>>>>> >>
>>>>> >> Hi, I'm not sure if it just mounts the cephfs without using or
>>>>> doing any operation within the mounted directory would be affected by
>>>>> flushing cache. I mounted cephfs on osd servers only for testing and then
>>>>> left it there. Anyway I will umount it.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> John Spray 于2018年8月8日 周三03:37写道:
>>>>> >>>
>>>>> >>> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier 
>>>>> wrote:
>>>>> >>> >
>>>>> >>> > This is the first I am hearing about this as well.
>>>>> >>>
>>>>> >>> This is not a Ceph-specific thing -- it can also affect similar
>>>>> >>> systems like Lustre.
>>>>> >>>
>>>>> >>> The classic case is when under some memory pressure, the kernel
>>>>> tries
>>>>> >>> to free memory by flushing the client's page cache, but doing the
>>>>> >>> flush means allocating more memory on the server, making the memory
>>>>> >>> pressure worse, until the whole thing just seizes up.
>>>>> >>>
>>>>> >>> John
>>>>> >>>
>>>>> >>> > Granted, I am using ceph-fuse rather than the kernel client at
>>>>> this point, but that isn’t etched in stone.
>>>>> >>> >
>>>>> >>> > Curious if there is more to share.
>>>>> >>> >
>>>>> >>> > Reed
>>>>> >>> >
>>>>> >>> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima <
>>>>> webert.b...@gmail.com> wrote:
>>>

Re: [ceph-users] cephfs kernel client hangs

2018-08-08 Thread Webert de Souza Lima
You could also see open sessions at the MDS server by issuing  `ceph daemon
mds.XX session ls`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou  wrote:

> Hi, I find an old server which mounted cephfs and has the debug files.
> # cat osdc
> REQUESTS 0 homeless 0
> LINGER REQUESTS
> BACKOFFS
> # cat monc
> have monmap 2 want 3+
> have osdmap 3507
> have fsmap.user 0
> have mdsmap 55 want 56+
> fs_cluster_id -1
> # cat mdsc
> 194 mds0getattr  #1036ae3
>
> What does it mean?
>
> Zhenshi Zhou  于2018年8月8日周三 下午1:58写道:
>
>> I restarted the client server so that there's no file in that directory.
>> I will take care of it if the client hangs next time.
>>
>> Thanks
>>
>> Yan, Zheng  于2018年8月8日周三 上午11:23写道:
>>
>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou 
>>> wrote:
>>> >
>>> > Hi,
>>> > I check all my ceph servers and they are not mount cephfs on each of
>>> them(maybe I umount after testing). As a result, the cluster didn't
>>> encounter a memory deadlock. Besides, I check the monitoring system and the
>>> memory and cpu usage were at common level while the clients hung.
>>> > Back to my question, there must be something else cause the client
>>> hang.
>>> >
>>>
>>> Check if there are hang requests in
>>> /sys/kernel/debug/ceph//{osdc,mdsc},
>>>
>>> > Zhenshi Zhou  于2018年8月8日周三 上午4:16写道:
>>> >>
>>> >> Hi, I'm not sure if it just mounts the cephfs without using or doing
>>> any operation within the mounted directory would be affected by flushing
>>> cache. I mounted cephfs on osd servers only for testing and then left it
>>> there. Anyway I will umount it.
>>> >>
>>> >> Thanks
>>> >>
>>> >> John Spray 于2018年8月8日 周三03:37写道:
>>> >>>
>>> >>> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier 
>>> wrote:
>>> >>> >
>>> >>> > This is the first I am hearing about this as well.
>>> >>>
>>> >>> This is not a Ceph-specific thing -- it can also affect similar
>>> >>> systems like Lustre.
>>> >>>
>>> >>> The classic case is when under some memory pressure, the kernel tries
>>> >>> to free memory by flushing the client's page cache, but doing the
>>> >>> flush means allocating more memory on the server, making the memory
>>> >>> pressure worse, until the whole thing just seizes up.
>>> >>>
>>> >>> John
>>> >>>
>>> >>> > Granted, I am using ceph-fuse rather than the kernel client at
>>> this point, but that isn’t etched in stone.
>>> >>> >
>>> >>> > Curious if there is more to share.
>>> >>> >
>>> >>> > Reed
>>> >>> >
>>> >>> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>> >>> >
>>> >>> >
>>> >>> > Yan, Zheng  于2018年8月7日周二 下午7:51写道:
>>> >>> >>
>>> >>> >> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou 
>>> wrote:
>>> >>> >> this can cause memory deadlock. you should avoid doing this
>>> >>> >>
>>> >>> >> > Yan, Zheng 于2018年8月7日 周二19:12写道:
>>> >>> >> >>
>>> >>> >> >> did you mount cephfs on the same machines that run ceph-osd?
>>> >>> >> >>
>>> >>> >
>>> >>> >
>>> >>> > I didn't know about this. I run this setup in production. :P
>>> >>> >
>>> >>> > Regards,
>>> >>> >
>>> >>> > Webert Lima
>>> >>> > DevOps Engineer at MAV Tecnologia
>>> >>> > Belo Horizonte - Brasil
>>> >>> > IRC NICK - WebertRLZ
>>> >>> >
>>> >>> > ___
>>> >>> > ceph-users mailing list
>>> >>> > ceph-users@lists.ceph.com
>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>> >
>>> >>> >
>>> >>> > ___
>>> >>> > ceph-users mailing list
>>> >>> > ceph-users@lists.ceph.com
>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>> ___
>>> >>> ceph-users mailing list
>>> >>> ceph-users@lists.ceph.com
>>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-08-08 Thread Webert de Souza Lima
So your OSDs are really too busy to respond heartbeats.
You'll be facing this for sometime until cluster loads get lower.

I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops.
maybe you can schedule it for enable during the night and disabling in the
morning.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric  wrote:

> Thx for the command line, I did take a look too it what I don’t really
> know what to search for, my bad….
>
> All this flapping is due to deep-scrub when it starts on an OSD things
> start to go bad.
>
>
>
> I set out all the OSDs that were flapping the most (1 by 1 after
> rebalancing) and it looks better even if some osds keep going down/up with
> the same message in logs :
>
>
>
> 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had
> timed out after 90
>
>
>
> (I update it to 90 instead of 15s)
>
>
>
> Regards,
>
>
>
>
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 16:28
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> oops, my bad, you're right.
>
>
>
> I don't know much you can see but maybe you can dig around performance
> counters and see what's happening on those OSDs, try these:
>
>
>
> ~# ceph daemonperf osd.XX
>
> ~# ceph daemon osd.XX perf dump
>
>
>
> change XX to your OSD numbers.
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric 
> wrote:
>
> Pool is already deleted and no longer present in stats.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 15:08
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> Frédéric,
>
>
>
> see if the number of objects is decreasing in the pool with `ceph df
> [detail]`
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric  wrote:
>
> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
&g

Re: [ceph-users] cephfs kernel client hangs

2018-08-07 Thread Webert de Souza Lima
That's good to know, thanks for the explanation.
Fortunately we are in the process of cluster redesign and we can definitely
fix that scenario.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Aug 7, 2018 at 4:37 PM John Spray  wrote:

> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier  wrote:
> >
> > This is the first I am hearing about this as well.
>
> This is not a Ceph-specific thing -- it can also affect similar
> systems like Lustre.
>
> The classic case is when under some memory pressure, the kernel tries
> to free memory by flushing the client's page cache, but doing the
> flush means allocating more memory on the server, making the memory
> pressure worse, until the whole thing just seizes up.
>
> John
>
> > Granted, I am using ceph-fuse rather than the kernel client at this
> point, but that isn’t etched in stone.
> >
> > Curious if there is more to share.
> >
> > Reed
> >
> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima 
> wrote:
> >
> >
> > Yan, Zheng  于2018年8月7日周二 下午7:51写道:
> >>
> >> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou 
> wrote:
> >> this can cause memory deadlock. you should avoid doing this
> >>
> >> > Yan, Zheng 于2018年8月7日 周二19:12写道:
> >> >>
> >> >> did you mount cephfs on the same machines that run ceph-osd?
> >> >>
> >
> >
> > I didn't know about this. I run this setup in production. :P
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> > IRC NICK - WebertRLZ
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client hangs

2018-08-07 Thread Webert de Souza Lima
Yan, Zheng  于2018年8月7日周二 下午7:51写道:

> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou  wrote:
> this can cause memory deadlock. you should avoid doing this
>
> > Yan, Zheng 于2018年8月7日 周二19:12写道:
> >>
> >> did you mount cephfs on the same machines that run ceph-osd?
> >>


I didn't know about this. I run this setup in production. :P

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-08-07 Thread Webert de Souza Lima
oops, my bad, you're right.

I don't know much you can see but maybe you can dig around performance
counters and see what's happening on those OSDs, try these:

~# ceph daemonperf osd.XX
~# ceph daemon osd.XX perf dump

change XX to your OSD numbers.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric  wrote:

> Pool is already deleted and no longer present in stats.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 15:08
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> Frédéric,
>
>
>
> see if the number of objects is decreasing in the pool with `ceph df
> [detail]`
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric  wrote:
>
> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check up

Re: [ceph-users] Whole cluster flapping

2018-08-07 Thread Webert de Souza Lima
Frédéric,

see if the number of objects is decreasing in the pool with `ceph df
[detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric  wrote:

> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update:
> 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
> 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update:
> 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
> 5738/5845923 objects mispla

Re: [ceph-users] Whole cluster flapping

2018-07-31 Thread Webert de Souza Lima
The pool deletion might have triggered a lot of IO operations on the disks
and the process might be too busy to respond to hearbeats, so the mons mark
them as down due to no response.
Check also the OSD logs to see if they are actually crashing and
restarting, and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric  wrote:

> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update:
> 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
> 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update:
> 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
> 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update:
> 98 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed
> (root=default,room=,host=) (8 reporters from different host after
> 54.650576 >= grace 54.300663)
>
> 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update:
> 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
> degraded, 201 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update:
> 78 slow 

Re: [ceph-users] v10.2.11 Jewel released

2018-07-11 Thread Webert de Souza Lima
Cheers!

Thanks for all the backports and fixes.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jul 11, 2018 at 1:46 PM Abhishek Lekshmanan 
wrote:

>
> We're glad to announce v10.2.11 release of the Jewel stable release
> series. This point releases brings a number of important bugfixes and
> has a few important security fixes. This is most likely going to be the
> final Jewel release (shine on you crazy diamond). We thank everyone in
> the community for contributing towards this release and particularly
> want to thank Nathan and Yuri for their relentless efforts in
> backporting and testing this release.
>
> We recommend that all Jewel 10.2.x users upgrade.
>
> Notable Changes
> ---
>
> * CVE 2018-1128: auth: cephx authorizer subject to replay attack
> (issue#24836 http://tracker.ceph.com/issues/24836, Sage Weil)
>
> * CVE 2018-1129: auth: cephx signature check is weak (issue#24837
> http://tracker.ceph.com/issues/24837, Sage Weil)
>
> * CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838
> http://tracker.ceph.com/issues/24838, Jason Dillaman)
>
> * The RBD C API's rbd_discard method and the C++ API's Image::discard
> method
>   now enforce a maximum length of 2GB. This restriction prevents overflow
> of
>   the result code.
>
> * New OSDs will now use rocksdb for omap data by default, rather than
>   leveldb. omap is used by RGW bucket indexes and CephFS directories,
>   and when a single leveldb grows to 10s of GB with a high write or
>   delete workload, it can lead to high latency when leveldb's
>   single-threaded compaction cannot keep up. rocksdb supports multiple
>   threads for compaction, which avoids this problem.
>
> * The CephFS client now catches failures to clear dentries during startup
>   and refuses to start as consistency and untrimmable cache issues may
>   develop. The new option client_die_on_failed_dentry_invalidate (default:
>   true) may be turned off to allow the client to proceed (dangerous!).
>
> * In 10.2.10 and earlier releases, keyring caps were not checked for
> validity,
>   so the caps string could be anything. As of 10.2.11, caps strings are
>   validated and providing a keyring with an invalid caps string to, e.g.,
>   "ceph auth add" will result in an error.
>
> The changelog and the full release notes are at the release blog entry
> at https://ceph.com/releases/v10-2-11-jewel-released/
>
> Getting Ceph
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-10.2.11.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: e4b061b47f07f583c92a050d9e84b1813a35671e
>
>
> Best,
> Abhishek
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD for bluestore

2018-07-09 Thread Webert de Souza Lima
bluestore doesn't have a journal like the filestore does, but there is the
WAL (Write-Ahead Log) which is looks like a journal but works differently.
You can (or must, depending or your needs) have SSDs to serve this WAL (and
for Rocks DB).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Sun, Jul 8, 2018 at 11:58 AM Satish Patel  wrote:

> Folks,
>
> I'm just reading from multiple post that bluestore doesn't need SSD
> journel, is that true?
>
> I'm planning to build 5 node cluster so depending on that I purchase SSD
> for journel.
>
> If it does require SSD for journel then what would be the best vendor and
> model which last long? Any recommendation
>
> Sent from my iPhone
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimal MDS for CephFS on OSD hosts

2018-06-19 Thread Webert de Souza Lima
Keep in mind that the mds server is cpu-bound, so during heavy workloads it
will eat up CPU usage, so the OSD daemons can affect or be affected by the
MDS daemon.
But it does work well. We've been running a few clusters with MON, MDS and
OSDs sharing the same hosts for a couple of years now.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Jun 19, 2018 at 11:03 AM Paul Emmerich 
wrote:

> Just co-locate them with your OSDs. You can can control how much RAM the
> MDSs use with the "mds cache memory limit" option. (default 1 GB)
> Note that the cache should be large enough RAM to keep the active working
> set in the mds cache but 1 million files is not really a lot.
> As a rule of thumb: ~1GB of MDS cache per ~100k files.
>
> 64GB of RAM for 12 OSDs and an MDS is enough in most cases.
>
> Paul
>
> 2018-06-19 15:34 GMT+02:00 Denny Fuchs :
>
>> Hi,
>>
>> Am 19.06.2018 15:14, schrieb Stefan Kooman:
>>
>> Storage doesn't matter for MDS, as they won't use it to store ceph data
>>> (but instead use the (meta)data pool to store meta data).
>>> I would not colocate the MDS daemons with the OSDS, but instead create a
>>> couple of VMs (active / standby) and give them as much RAM as you
>>> possibly can.
>>>
>>
>> thanks a lot. I think, we would start with round about 8GB and see, what
>> happens.
>>
>> cu denny
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: bind data pool via file layout

2018-06-13 Thread Webert de Souza Lima
Got it Gregory, sounds good enough for us.

Thank you all for the help provided.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jun 13, 2018 at 2:20 PM Gregory Farnum  wrote:

> Nah, I would use one Filesystem unless you can’t. The backtrace does
> create another object but IIRC it’s a maximum one IO per create/rename (on
> the file).
> On Wed, Jun 13, 2018 at 1:12 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> Thanks for clarifying that, Gregory.
>>
>> As said before, we use the file layout to resolve the difference of
>> workloads in those 2 different directories in cephfs.
>> Would you recommend using 2 filesystems instead? By doing so, each fs
>> would have it's default data pool accordingly.
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Jun 13, 2018 at 11:33 AM Gregory Farnum 
>> wrote:
>>
>>> The backtrace object Zheng referred to is used only for resolving hard
>>> links or in disaster recovery scenarios. If the default data pool isn’t
>>> available you would stack up pending RADOS writes inside of your mds but
>>> the rest of the system would continue unless you manage to run the mds out
>>> of memory.
>>> -Greg
>>> On Wed, Jun 13, 2018 at 9:25 AM Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>>
>>>> Thank you Zheng.
>>>>
>>>> Does that mean that, when using such feature, our data integrity relies
>>>> now on both data pools'  integrity/availability?
>>>>
>>>> We currently use such feature in production for dovecot's index files,
>>>> so we could store this directory on a pool of SSDs only. The main data pool
>>>> is made of HDDs and stores the email files themselves.
>>>>
>>>> There ain't too many files created, it's just a few files per email
>>>> user, and basically one directory per user's mailbox.
>>>> Each mailbox has a index file that is updated upon every new email
>>>> received or moved, deleted, read, etc.
>>>>
>>>> I think in this scenario the overhead may be acceptable for us.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Webert Lima
>>>> DevOps Engineer at MAV Tecnologia
>>>> *Belo Horizonte - Brasil*
>>>> *IRC NICK - WebertRLZ*
>>>>
>>>>
>>>> On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng  wrote:
>>>>
>>>>> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima
>>>>>  wrote:
>>>>> >
>>>>> > hello,
>>>>> >
>>>>> > is there any performance impact on cephfs for using file layouts to
>>>>> bind a specific directory in cephfs to a given pool? Of course, such pool
>>>>> is not the default data pool for this cephfs.
>>>>> >
>>>>>
>>>>> For each file, no matter which pool file data are stored,  mds alway
>>>>> create an object in the default data pool. The object in default data
>>>>> pool is used for storing backtrace. So files stored in non-default
>>>>> pool have extra overhead on file creation. For large file, the
>>>>> overhead can be neglect. But for lots of small files, the overhead may
>>>>> affect performance.
>>>>>
>>>>>
>>>>> > Regards,
>>>>> >
>>>>> > Webert Lima
>>>>> > DevOps Engineer at MAV Tecnologia
>>>>> > Belo Horizonte - Brasil
>>>>> > IRC NICK - WebertRLZ
>>>>> > ___
>>>>> > ceph-users mailing list
>>>>> > ceph-users@lists.ceph.com
>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: bind data pool via file layout

2018-06-13 Thread Webert de Souza Lima
Thanks for clarifying that, Gregory.

As said before, we use the file layout to resolve the difference of
workloads in those 2 different directories in cephfs.
Would you recommend using 2 filesystems instead? By doing so, each fs would
have it's default data pool accordingly.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jun 13, 2018 at 11:33 AM Gregory Farnum  wrote:

> The backtrace object Zheng referred to is used only for resolving hard
> links or in disaster recovery scenarios. If the default data pool isn’t
> available you would stack up pending RADOS writes inside of your mds but
> the rest of the system would continue unless you manage to run the mds out
> of memory.
> -Greg
> On Wed, Jun 13, 2018 at 9:25 AM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> Thank you Zheng.
>>
>> Does that mean that, when using such feature, our data integrity relies
>> now on both data pools'  integrity/availability?
>>
>> We currently use such feature in production for dovecot's index files, so
>> we could store this directory on a pool of SSDs only. The main data pool is
>> made of HDDs and stores the email files themselves.
>>
>> There ain't too many files created, it's just a few files per email user,
>> and basically one directory per user's mailbox.
>> Each mailbox has a index file that is updated upon every new email
>> received or moved, deleted, read, etc.
>>
>> I think in this scenario the overhead may be acceptable for us.
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng  wrote:
>>
>>> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima
>>>  wrote:
>>> >
>>> > hello,
>>> >
>>> > is there any performance impact on cephfs for using file layouts to
>>> bind a specific directory in cephfs to a given pool? Of course, such pool
>>> is not the default data pool for this cephfs.
>>> >
>>>
>>> For each file, no matter which pool file data are stored,  mds alway
>>> create an object in the default data pool. The object in default data
>>> pool is used for storing backtrace. So files stored in non-default
>>> pool have extra overhead on file creation. For large file, the
>>> overhead can be neglect. But for lots of small files, the overhead may
>>> affect performance.
>>>
>>>
>>> > Regards,
>>> >
>>> > Webert Lima
>>> > DevOps Engineer at MAV Tecnologia
>>> > Belo Horizonte - Brasil
>>> > IRC NICK - WebertRLZ
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: bind data pool via file layout

2018-06-13 Thread Webert de Souza Lima
Thank you Zheng.

Does that mean that, when using such feature, our data integrity relies now
on both data pools'  integrity/availability?

We currently use such feature in production for dovecot's index files, so
we could store this directory on a pool of SSDs only. The main data pool is
made of HDDs and stores the email files themselves.

There ain't too many files created, it's just a few files per email user,
and basically one directory per user's mailbox.
Each mailbox has a index file that is updated upon every new email received
or moved, deleted, read, etc.

I think in this scenario the overhead may be acceptable for us.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng  wrote:

> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima
>  wrote:
> >
> > hello,
> >
> > is there any performance impact on cephfs for using file layouts to bind
> a specific directory in cephfs to a given pool? Of course, such pool is not
> the default data pool for this cephfs.
> >
>
> For each file, no matter which pool file data are stored,  mds alway
> create an object in the default data pool. The object in default data
> pool is used for storing backtrace. So files stored in non-default
> pool have extra overhead on file creation. For large file, the
> overhead can be neglect. But for lots of small files, the overhead may
> affect performance.
>
>
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> > IRC NICK - WebertRLZ
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs: bind data pool via file layout

2018-06-12 Thread Webert de Souza Lima
hello,

is there any performance impact on cephfs for using file layouts to bind a
specific directory in cephfs to a given pool? Of course, such pool is not
the default data pool for this cephfs.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (yet another) multi active mds advise needed

2018-05-19 Thread Webert de Souza Lima
Hi Daniel,

Thanks for clarifying.
I'll have a look at dirfrag option.

Regards,
Webert Lima

Em sáb, 19 de mai de 2018 01:18, Daniel Baumann <daniel.baum...@bfh.ch>
escreveu:

> On 05/19/2018 01:13 AM, Webert de Souza Lima wrote:
> > New question: will it make any difference in the balancing if instead of
> > having the MAIL directory in the root of cephfs and the domains's
> > subtrees inside it, I discard the parent dir and put all the subtress
> right in cephfs root?
>
> the balancing between the MDS is influenced by which directories are
> accessed, the currently accessed directory-trees are diveded between the
> MDS's (also check the dirfrag option in the docs). assuming you have the
> same access pattern, the "fragmentation" between the MDS's happens at
> these "target-directories", so it doesn't matter if these directories
> are further up or down in the same filesystem tree.
>
> in the multi-MDS scenario where the MDS serving rank 0 fails, the
> effects in the moment of the failure for any cephfs client accessing a
> directory/file are the same (as described in an earlier mail),
> regardless on which level the directory/file is within the filesystem.
>
> Regards,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (yet another) multi active mds advise needed

2018-05-18 Thread Webert de Souza Lima
Hi Patrick

On Fri, May 18, 2018 at 6:20 PM Patrick Donnelly 
wrote:

> Each MDS may have multiple subtrees they are authoritative for. Each
> MDS may also replicate metadata from another MDS as a form of load
> balancing.


Ok, its good to know that it actually does some load balance. Thanks.
New question: will it make any difference in the balancing if instead of
having the MAIL directory in the root of cephfs and the domains's subtrees
inside it,
I discard the parent dir and put all the subtress right in cephfs root?


> standby-replay daemons are not available to take over for ranks other
> than the one it follows. So, you would want to have a standby-replay
> daemon for each rank or just have normal standbys. It will likely
> depend on the size of your MDS (cache size) and available hardware.
>
> It's best if y ou see if the normal balancer (especially in v12.2.6
> [1]) can handle the load for you without trying to micromanage things
> via pins. You can use pinning to isolate metadata load from other
> ranks as a stop-gap measure.
>

Ok I will start with the simplest way. This can be changed after deployment
if it comes to be the case.

On Fri, May 18, 2018 at 6:38 PM Daniel Baumann 
wrote:

> jftr, having 3 active mds and 3 standby-replay resulted May 20217 in a
> longer downtime for us due to http://tracker.ceph.com/issues/21749
>
> we're not using standby-replay MDS's anymore but only "normal" standby,
> and didn't have had any problems anymore (running kraken then, upgraded
> to luminous last fall).
>

Thank you very much for your feedback Daniel. I'll go for the regular
standby daemons, then.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] (yet another) multi active mds advise needed

2018-05-18 Thread Webert de Souza Lima
Hi,

We're migrating from a Jewel / filestore based cephfs archicture to a
Luminous / buestore based one.

One MUST HAVE is multiple Active MDS daemons. I'm still lacking knowledge
of how it actually works.
After reading the docs and ML we learned that they work by sort of dividing
the responsibilities, each with his own and only directory subtree. (please
correct me if I'm wrong).

Question 1: I'd like to know if it is viable to have 4 MDS daemons, being 3
Active and 1 Standby (or Standby-Replay if that's still possible with
multi-mds).

Basically, what we have is 2 subtrees used by dovecot: INDEX and MAIL.
Their tree is almost identical but INDEX stores all dovecot metadata with
heavy IO going on and MAIL stores actual email files, with much more writes
than reads.

I don't know by now which one could bottleneck the MDS servers most so I
wonder if I can take metrics on MDS usage per pool when it's deployed.
Question 2: If the metadata workloads are very different I wonder if I can
isolate them, like pinning MDS servers X and Y to one of the directories.

Cache Tier is deprecated so,
Question 3: how can I think of a read cache mechanism in Luminous with
bluestore, mainly to keep newly created files (emails that just arrived and
will probably be fetched by the user in a few seconds via IMAP/POP3).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-05-18 Thread Webert de Souza Lima
Hello,


On Mon, Apr 30, 2018 at 7:16 AM Daniel Baumann 
wrote:

> additionally: if rank 0 is lost, the whole FS stands still (no new
> client can mount the fs; no existing client can change a directory, etc.).
>
> my guess is that the root of a cephfs (/; which is always served by rank
> 0) is needed in order to do traversals/lookups of any directories on the
> top-level (which then can be served by ranks 1-n).
>

Could someone confirm if this is actually how it works? Thanks.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Thanks Jack.

That's good to know. It is definitely something to consider.
In a distributed storage scenario we might build a dedicated pool for that
and tune the pool as more capacity or performance is needed.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:45 PM Jack <c...@jack.fr.eu.org> wrote:

> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
> > backend.
> > We'll have to do some some work on how to simulate user traffic, for
> writes
> > and readings. That seems troublesome.
> I would appreciate seeing these results !
>
> > Thanks for the plugins recommendations. I'll take the change and ask you
> > how is the SIS status? We have used it in the past and we've had some
> > problems with it.
>
> I am using it since Dec 2016 with mdbox, with no issue at all (I am
> currently using Dovecot 2.2.27-3 from Debian Stretch)
> The only config I use is mail_attachment_dir, the rest lies as default
> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
> ail_attachment_hash = %{sha1})
> The backend storage is a local filesystem, and there is only one Dovecot
> instance
>
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > *Belo Horizonte - Brasil*
> > *IRC NICK - WebertRLZ*
> >
> >
> > On Wed, May 16, 2018 at 4:19 PM Jack <c...@jack.fr.eu.org> wrote:
> >
> >> Hi,
> >>
> >> Many (most ?) filesystems does not store multiple files on the same
> block
> >>
> >> Thus, with sdbox, every single mail (you know, that kind of mail with 10
> >> lines in it) will eat an inode, and a block (4k here)
> >> mdbox is more compact on this way
> >>
> >> Another difference: sdbox removes the message, mdbox does not : a single
> >> metadata update is performed, which may be packed with others if many
> >> files are deleted at once
> >>
> >> That said, I do not have experience with dovecot + cephfs, nor have made
> >> tests for sdbox vs mdbox
> >>
> >> However, and this is a bit out of topic, I recommend you look at the
> >> following dovecot's features (if not already done), as they are awesome
> >> and will help you a lot:
> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
> >>
> >> Regards,
> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> >>> I'm sending this message to both dovecot and ceph-users ML so please
> >> don't
> >>> mind if something seems too obvious for you.
> >>>
> >>> Hi,
> >>>
> >>> I have a question for both dovecot and ceph lists and below I'll
> explain
> >>> what's going on.
> >>>
> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> >> when
> >>> using sdbox, a new file is stored for each email message.
> >>> When using mdbox, multiple messages are appended to a single file until
> >> it
> >>> reaches/passes the rotate limit.
> >>>
> >>> I would like to understand better how the mdbox format impacts on IO
> >>> performance.
> >>> I think it's generally expected that fewer larger file translate to
> less
> >> IO
> >>> and more troughput when compared to more small files, but how does
> >> dovecot
> >>> handle that with mdbox?
> >>> If dovecot does flush data to storage upon each and every new email is
> >>> arrived and appended to the corresponding file, would that mean that it
> >>> generate the same ammount of IO as it would do with one file per
> message?
> >>> Also, if using mdbox many messages will be appended to a said file
> >> before a
> >>> new file is created. That should mean that a file descriptor is kept
> open
> >>> for sometime by dovecot process.
> >>> Using cephfs as backend, how would this impact cluster performance
> >>> regarding MDS caps and inodes cached when files from thousands of users
> >> are
> >>> opened and appended all over?
> >>>
> >>> I would like to understand this better.
> >>>
> >>> Why?
> >>> We are a small Business Email Hosting provider with bare

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Hello Danny,

I actually saw that thread and I was very excited about it. I thank you all
for that idea and all the effort being put in it.
I haven't yet tried to play around with your plugin but I intend to, and to
contribute back. I think when it's ready for production it will be
unbeatable.

I have watched your talk at Cephalocon (on YouTube). I'll see your slides,
maybe they'll give me more insights on our infrastructure architecture.

As you can see our business is still taking baby steps compared to Deutsche
Telekom's but we face infrastructure challenges everyday since ever.
As for now, I think we could still fit with cephfs, but we definitely need
some improvement.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:42 PM Danny Al-Gaaf <danny.al-g...@bisect.de>
wrote:

> Hi,
>
> some time back we had similar discussions when we, as an email provider,
> discussed to move away from traditional NAS/NFS storage to Ceph.
>
> The problem with POSIX file systems and dovecot is that e.g. with mdbox
> only around ~20% of the IO operations are READ/WRITE, the rest are
> metadata IOs. You will not change this with using CephFS since it will
> basically behave the same way as e.g. NFS.
>
> We decided to develop librmb to store emails as objects directly in
> RADOS instead of CephFS. The project is still under development, so you
> should not use it in production, but you can try it to run a POC.
>
> For more information check out my slides from Ceph Day London 2018:
> https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page
>
> The project can be found on github:
> https://github.com/ceph-dovecot/
>
> -Danny
>
> Am 16.05.2018 um 20:37 schrieb Webert de Souza Lima:
> > I'm sending this message to both dovecot and ceph-users ML so please
> don't
> > mind if something seems too obvious for you.
> >
> > Hi,
> >
> > I have a question for both dovecot and ceph lists and below I'll explain
> > what's going on.
> >
> > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> when
> > using sdbox, a new file is stored for each email message.
> > When using mdbox, multiple messages are appended to a single file until
> it
> > reaches/passes the rotate limit.
> >
> > I would like to understand better how the mdbox format impacts on IO
> > performance.
> > I think it's generally expected that fewer larger file translate to less
> IO
> > and more troughput when compared to more small files, but how does
> dovecot
> > handle that with mdbox?
> > If dovecot does flush data to storage upon each and every new email is
> > arrived and appended to the corresponding file, would that mean that it
> > generate the same ammount of IO as it would do with one file per message?
> > Also, if using mdbox many messages will be appended to a said file
> before a
> > new file is created. That should mean that a file descriptor is kept open
> > for sometime by dovecot process.
> > Using cephfs as backend, how would this impact cluster performance
> > regarding MDS caps and inodes cached when files from thousands of users
> are
> > opened and appended all over?
> >
> > I would like to understand this better.
> >
> > Why?
> > We are a small Business Email Hosting provider with bare metal, self
> hosted
> > systems, using dovecot for servicing mailboxes and cephfs for email
> storage.
> >
> > We are currently working on dovecot and storage redesign to be in
> > production ASAP. The main objective is to serve more users with better
> > performance, high availability and scalability.
> > * high availability and load balancing is extremely important to us *
> >
> > On our current model, we're using mdbox format with dovecot, having
> > dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> stored
> > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> > All using cephfs / filestore backend.
> >
> > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> > (10.2.9-4).
> >  - ~25K users from a few thousands of domains per cluster
> >  - ~25TB of email data per cluster
> >  - ~70GB of dovecot INDEX [meta]data per cluster
> >  - ~100MB of cephfs metadata per cluster
> >
> > Our goal is to build a single ceph cluster for storage that could expand
> in
> > capacity, be highly available and perform well enough. I know, that's
> what
> > everyone wants.
> >
> > Cephfs is an important choise because:
> >  - there can be multiple mountpoints, thus 

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Hello Jack,

yes, I imagine I'll have to do some work on tuning the block size on
cephfs. Thanks for the advise.
I knew that using mdbox, messages are not removed but I though that was
true in sdbox too. Thanks again.

We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
backend.
We'll have to do some some work on how to simulate user traffic, for writes
and readings. That seems troublesome.

Thanks for the plugins recommendations. I'll take the change and ask you
how is the SIS status? We have used it in the past and we've had some
problems with it.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:19 PM Jack <c...@jack.fr.eu.org> wrote:

> Hi,
>
> Many (most ?) filesystems does not store multiple files on the same block
>
> Thus, with sdbox, every single mail (you know, that kind of mail with 10
> lines in it) will eat an inode, and a block (4k here)
> mdbox is more compact on this way
>
> Another difference: sdbox removes the message, mdbox does not : a single
> metadata update is performed, which may be packed with others if many
> files are deleted at once
>
> That said, I do not have experience with dovecot + cephfs, nor have made
> tests for sdbox vs mdbox
>
> However, and this is a bit out of topic, I recommend you look at the
> following dovecot's features (if not already done), as they are awesome
> and will help you a lot:
> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>
> Regards,
> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> > I'm sending this message to both dovecot and ceph-users ML so please
> don't
> > mind if something seems too obvious for you.
> >
> > Hi,
> >
> > I have a question for both dovecot and ceph lists and below I'll explain
> > what's going on.
> >
> > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> when
> > using sdbox, a new file is stored for each email message.
> > When using mdbox, multiple messages are appended to a single file until
> it
> > reaches/passes the rotate limit.
> >
> > I would like to understand better how the mdbox format impacts on IO
> > performance.
> > I think it's generally expected that fewer larger file translate to less
> IO
> > and more troughput when compared to more small files, but how does
> dovecot
> > handle that with mdbox?
> > If dovecot does flush data to storage upon each and every new email is
> > arrived and appended to the corresponding file, would that mean that it
> > generate the same ammount of IO as it would do with one file per message?
> > Also, if using mdbox many messages will be appended to a said file
> before a
> > new file is created. That should mean that a file descriptor is kept open
> > for sometime by dovecot process.
> > Using cephfs as backend, how would this impact cluster performance
> > regarding MDS caps and inodes cached when files from thousands of users
> are
> > opened and appended all over?
> >
> > I would like to understand this better.
> >
> > Why?
> > We are a small Business Email Hosting provider with bare metal, self
> hosted
> > systems, using dovecot for servicing mailboxes and cephfs for email
> storage.
> >
> > We are currently working on dovecot and storage redesign to be in
> > production ASAP. The main objective is to serve more users with better
> > performance, high availability and scalability.
> > * high availability and load balancing is extremely important to us *
> >
> > On our current model, we're using mdbox format with dovecot, having
> > dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> stored
> > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> > All using cephfs / filestore backend.
> >
> > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> > (10.2.9-4).
> >  - ~25K users from a few thousands of domains per cluster
> >  - ~25TB of email data per cluster
> >  - ~70GB of dovecot INDEX [meta]data per cluster
> >  - ~100MB of cephfs metadata per cluster
> >
> > Our goal is to build a single ceph cluster for storage that could expand
> in
> > capacity, be highly available and perform well enough. I know, that's
> what
> > everyone wants.
> >
> > Cephfs is an important choise because:
> >  - there can be multiple mountpoints, thus multiple dovecot instances on
> > di

[ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
I'm sending this message to both dovecot and ceph-users ML so please don't
mind if something seems too obvious for you.

Hi,

I have a question for both dovecot and ceph lists and below I'll explain
what's going on.

Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), when
using sdbox, a new file is stored for each email message.
When using mdbox, multiple messages are appended to a single file until it
reaches/passes the rotate limit.

I would like to understand better how the mdbox format impacts on IO
performance.
I think it's generally expected that fewer larger file translate to less IO
and more troughput when compared to more small files, but how does dovecot
handle that with mdbox?
If dovecot does flush data to storage upon each and every new email is
arrived and appended to the corresponding file, would that mean that it
generate the same ammount of IO as it would do with one file per message?
Also, if using mdbox many messages will be appended to a said file before a
new file is created. That should mean that a file descriptor is kept open
for sometime by dovecot process.
Using cephfs as backend, how would this impact cluster performance
regarding MDS caps and inodes cached when files from thousands of users are
opened and appended all over?

I would like to understand this better.

Why?
We are a small Business Email Hosting provider with bare metal, self hosted
systems, using dovecot for servicing mailboxes and cephfs for email storage.

We are currently working on dovecot and storage redesign to be in
production ASAP. The main objective is to serve more users with better
performance, high availability and scalability.
* high availability and load balancing is extremely important to us *

On our current model, we're using mdbox format with dovecot, having
dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored
in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
All using cephfs / filestore backend.

Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
(10.2.9-4).
 - ~25K users from a few thousands of domains per cluster
 - ~25TB of email data per cluster
 - ~70GB of dovecot INDEX [meta]data per cluster
 - ~100MB of cephfs metadata per cluster

Our goal is to build a single ceph cluster for storage that could expand in
capacity, be highly available and perform well enough. I know, that's what
everyone wants.

Cephfs is an important choise because:
 - there can be multiple mountpoints, thus multiple dovecot instances on
different hosts
 - the same storage backend is used for all dovecot instances
 - no need of sharding domains
 - dovecot is easily load balanced (with director sticking users to the
same dovecot backend)

On the upcoming upgrade we intent to:
 - upgrade ceph to 12.X (Luminous)
 - drop the SSD Cache Tier (because it's deprecated)
 - use bluestore engine

I was said on freenode/#dovecot that there are many cases where SDBOX would
perform better with NFS sharing.
In case of cephfs, at first, I wouldn't think that would be true because
more files == more generated IO, but thinking about what I said in the
beginning regarding sdbox vs mdbox that could be wrong.

Any thoughts will be highlt appreciated.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node crash, filesytem not usable

2018-05-15 Thread Webert de Souza Lima
I'm sorry I wouldn't know, I'm on Jewel.
is your cluster HEALTH_OK now?

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Sun, May 13, 2018 at 6:29 AM Marc Roos <m.r...@f1-outsourcing.eu> wrote:

>
> In luminous
> osd_recovery_threads = osd_disk_threads ?
> osd_recovery_sleep = osd_recovery_sleep_hdd ?
>
> Or is this speeding up recovery, a lot different in luminous?
>
> [@~]# ceph daemon osd.0 config show | grep osd | grep thread
> "osd_command_thread_suicide_timeout": "900",
> "osd_command_thread_timeout": "600",
> "osd_disk_thread_ioprio_class": "",
> "osd_disk_thread_ioprio_priority": "-1",
> "osd_disk_threads": "1",
> "osd_op_num_threads_per_shard": "0",
> "osd_op_num_threads_per_shard_hdd": "1",
> "osd_op_num_threads_per_shard_ssd": "2",
> "osd_op_thread_suicide_timeout": "150",
> "osd_op_thread_timeout": "15",
>     "osd_peering_wq_threads": "2",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_remove_thread_suicide_timeout": "36000",
> "osd_remove_thread_timeout": "3600",
>
> -Original Message-
> From: Webert de Souza Lima [mailto:webert.b...@gmail.com]
> Sent: vrijdag 11 mei 2018 20:34
> To: ceph-users
> Subject: Re: [ceph-users] Node crash, filesytem not usable
>
> This message seems to be very concerning:
>  >mds0: Metadata damage detected
>
>
> but for the rest, the cluster seems still to be recovering. you could
> try to seep thing up with ceph tell, like:
>
> ceph tell osd.* injectargs --osd_max_backfills=10
>
> ceph tell osd.* injectargs --osd_recovery_sleep=0.0
>
> ceph tell osd.* injectargs --osd_recovery_threads=2
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
>
> On Fri, May 11, 2018 at 3:06 PM Daniel Davidson
> <dani...@igb.illinois.edu> wrote:
>
>
> Below id the information you were asking for.  I think they are
> size=2, min size=1.
>
> Dan
>
> # ceph status
> cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>
>
>
>  health HEALTH_ERR
>
>
>
>
> 140 pgs are stuck inactive for more than 300 seconds
> 64 pgs backfill_wait
> 76 pgs backfilling
> 140 pgs degraded
> 140 pgs stuck degraded
> 140 pgs stuck inactive
> 140 pgs stuck unclean
> 140 pgs stuck undersized
> 140 pgs undersized
> 210 requests are blocked > 32 sec
> recovery 38725029/695508092 objects degraded (5.568%)
> recovery 10844554/695508092 objects misplaced (1.559%)
> mds0: Metadata damage detected
> mds0: Behind on trimming (71/30)
> noscrub,nodeep-scrub flag(s) set
>  monmap e3: 4 mons at
> {ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:
> 6789/0,ceph-3=172.16.31.4:6789/0}
> election epoch 824, quorum 0,1,2,3
> ceph-0,ceph-1,ceph-2,ceph-3
>   fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>  osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
> flags
> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
> 1444 TB used, 1011 TB / 2455 TB avail
> 38725029/695508092 objects degraded (5.568%)
> 10844554/695508092 objects misplaced (1.559%)
> 1396 active+clean
>   76
> undersized+degraded+remapped+backfilling+peered
>   64
> undersized+degraded+remapped+wait_backfill+peered
> recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
> ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 2619.54541 root default
>  -2  163.72159 host ceph-0
>   0   81.86079 osd.0 up  1.0  1.0
>   1   81.86079 osd.1 up  1.0  1.0
>  -3  163.72159 host c

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-14 Thread Webert de Souza Lima
On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER 
wrote:

> The documentation (luminous) say:
>


> >mds cache size
> >
> >Description:The number of inodes to cache. A value of 0 indicates an
> unlimited number. It is recommended to use mds_cache_memory_limit to limit
> the amount of memory the MDS cache uses.
> >Type:   32-bit Integer
> >Default:0
> >

and, my mds_cache_memory_limit is currently at 5GB.


yeah I have only suggested that because the high memory usage seemed to
trouble you and it might be a bug, so it's more of a workaround.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
Thanks David.
Although you mentioned this was introduced with Luminous, it's working with
Jewel.

~# ceph osd pool stats

Fri May 11 17:41:39 2018

pool rbd id 5
  client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr

pool rbd_cache id 6
  client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
  cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing

pool cephfs_metadata id 7
  client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr

pool cephfs_data_ssd id 8
  client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr

pool cephfs_data id 9
  client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr

pool cephfs_data_cache id 10
  client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
  cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 5:14 PM David Turner <drakonst...@gmail.com> wrote:

> `ceph osd pool stats` with the option to specify the pool you are
> interested in should get you the breakdown of IO per pool.  This was
> introduced with luminous.
>
> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> I think ceph doesn't have IO metrics will filters by pool right? I see IO
>> metrics from clients only:
>>
>> ceph_client_io_ops
>> ceph_client_io_read_bytes
>> ceph_client_io_read_ops
>> ceph_client_io_write_bytes
>> ceph_client_io_write_ops
>>
>> and pool "byte" metrics, but not "io":
>>
>> ceph_pool(write/read)_bytes(_total)
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Hey Jon!
>>>
>>> On Wed, May 9, 2018 at 12:11 PM, John Spray <jsp...@redhat.com> wrote:
>>>
>>>> It depends on the metadata intensity of your workload.  It might be
>>>> quite interesting to gather some drive stats on how many IOPS are
>>>> currently hitting your metadata pool over a week of normal activity.
>>>>
>>>
>>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>>> sure what I should be looking at).
>>> My current SSD disks have 2 partitions.
>>>  - One is used for cephfs cache tier pool,
>>>  - The other is used for both:  cephfs meta-data pool and cephfs
>>> data-ssd (this is an additional cephfs data pool with only ssds with file
>>> layout for a specific direcotory to use it)
>>>
>>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>>> partition, but this could definitely be IO for the data-ssd pool.
>>>
>>>
>>>> If you are doing large file workloads, and the metadata mostly fits in
>>>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>>>> the other hand, if you're doing random metadata reads from a small
>>>> file workload where the metadata does not fit in RAM, almost every
>>>> client read could generate a read operation, and each MDS could easily
>>>> generate thousands of ops per second.
>>>>
>>>
>>> I have yet to measure it the right way but I'd assume my metadata fits
>>> in RAM (a few 100s of MB only).
>>>
>>> This is an email hosting cluster with dozens of thousands of users so
>>> there are a lot of random reads and writes, but not too many small files.
>>> Email messages are concatenated together in files up to 4MB in size
>>> (when a rotation happens).
>>> Most user operations are dovecot's INDEX operations and I will keep
>>> index directory in a SSD-dedicaded pool.
>>>
>>>
>>>
>>>> Isolating metadata OSDs is useful if the data OSDs are going to be
>>>> completely saturated: metadata performance will be protected even if
>>>> clients are hitting the data OSDs hard.
>>>>
>>>
>>> This seems to be the case.
>>>
>>>
>>>> If "heavy write" means completely saturating the cluster, then sharing
>>>> the OSDs is risky.  If "heavy write" just means that there are more
>>>> writes than reads, then it may be fine if the metadata workload is not
>>>> heavy enough to make good use of SSDs.
>>>>
>>>
>>> Saturarion will only happen in peak workloads, not often. By 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
I think ceph doesn't have IO metrics will filters by pool right? I see IO
metrics from clients only:

ceph_client_io_ops
ceph_client_io_read_bytes
ceph_client_io_read_ops
ceph_client_io_write_bytes
ceph_client_io_write_ops

and pool "byte" metrics, but not "io":

ceph_pool(write/read)_bytes(_total)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <webert.b...@gmail.com>
wrote:

> Hey Jon!
>
> On Wed, May 9, 2018 at 12:11 PM, John Spray <jsp...@redhat.com> wrote:
>
>> It depends on the metadata intensity of your workload.  It might be
>> quite interesting to gather some drive stats on how many IOPS are
>> currently hitting your metadata pool over a week of normal activity.
>>
>
> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
> sure what I should be looking at).
> My current SSD disks have 2 partitions.
>  - One is used for cephfs cache tier pool,
>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
> (this is an additional cephfs data pool with only ssds with file layout for
> a specific direcotory to use it)
>
> Because of this, iostat shows me peaks of 12k IOPS in the metadata
> partition, but this could definitely be IO for the data-ssd pool.
>
>
>> If you are doing large file workloads, and the metadata mostly fits in
>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>> the other hand, if you're doing random metadata reads from a small
>> file workload where the metadata does not fit in RAM, almost every
>> client read could generate a read operation, and each MDS could easily
>> generate thousands of ops per second.
>>
>
> I have yet to measure it the right way but I'd assume my metadata fits in
> RAM (a few 100s of MB only).
>
> This is an email hosting cluster with dozens of thousands of users so
> there are a lot of random reads and writes, but not too many small files.
> Email messages are concatenated together in files up to 4MB in size (when
> a rotation happens).
> Most user operations are dovecot's INDEX operations and I will keep index
> directory in a SSD-dedicaded pool.
>
>
>
>> Isolating metadata OSDs is useful if the data OSDs are going to be
>> completely saturated: metadata performance will be protected even if
>> clients are hitting the data OSDs hard.
>>
>
> This seems to be the case.
>
>
>> If "heavy write" means completely saturating the cluster, then sharing
>> the OSDs is risky.  If "heavy write" just means that there are more
>> writes than reads, then it may be fine if the metadata workload is not
>> heavy enough to make good use of SSDs.
>>
>
> Saturarion will only happen in peak workloads, not often. By heavy write I
> mean there are much more writes than reads, yes.
> So I think I can start sharing the OSDs, if I think this is impacting
> performance I can just change the ruleset and move metadata to a SSD-only
> pool, right?
>
>
>> The way I'd summarise this is: in the general case, dedicated SSDs are
>> the safe way to go -- they're intrinsically better suited to metadata.
>> However, in some quite common special cases, the overall number of
>> metadata ops is so low that the device doesn't matter.
>
>
>
> Thank you very much John!
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node crash, filesytem not usable

2018-05-11 Thread Webert de Souza Lima
This message seems to be very concerning:
 >mds0: Metadata damage detected

but for the rest, the cluster seems still to be recovering. you could try
to seep thing up with ceph tell, like:

ceph tell osd.* injectargs --osd_max_backfills=10
ceph tell osd.* injectargs --osd_recovery_sleep=0.0
ceph tell osd.* injectargs --osd_recovery_threads=2


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:06 PM Daniel Davidson 
wrote:

> Below id the information you were asking for.  I think they are size=2,
> min size=1.
>
> Dan
>
> # ceph status
> cluster
> 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>  health
> HEALTH_ERR
>
> 140 pgs are stuck inactive for more than 300 seconds
> 64 pgs backfill_wait
> 76 pgs backfilling
> 140 pgs degraded
> 140 pgs stuck degraded
> 140 pgs stuck inactive
> 140 pgs stuck unclean
> 140 pgs stuck undersized
> 140 pgs undersized
> 210 requests are blocked > 32 sec
> recovery 38725029/695508092 objects degraded (5.568%)
> recovery 10844554/695508092 objects misplaced (1.559%)
> mds0: Metadata damage detected
> mds0: Behind on trimming (71/30)
> noscrub,nodeep-scrub flag(s) set
>  monmap e3: 4 mons at {ceph-0=
> 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0
> }
> election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3
>   fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>  osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
> flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
> 1444 TB used, 1011 TB / 2455 TB avail
> 38725029/695508092 objects degraded (5.568%)
> 10844554/695508092 objects misplaced (1.559%)
> 1396 active+clean
>   76 undersized+degraded+remapped+backfilling+peered
>   64 undersized+degraded+remapped+wait_backfill+peered
> recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
> ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 2619.54541 root default
>  -2  163.72159 host ceph-0
>   0   81.86079 osd.0 up  1.0  1.0
>   1   81.86079 osd.1 up  1.0  1.0
>  -3  163.72159 host ceph-1
>   2   81.86079 osd.2 up  1.0  1.0
>   3   81.86079 osd.3 up  1.0  1.0
>  -4  163.72159 host ceph-2
>   8   81.86079 osd.8 up  1.0  1.0
>   9   81.86079 osd.9 up  1.0  1.0
>  -5  163.72159 host ceph-3
>  10   81.86079 osd.10up  1.0  1.0
>  11   81.86079 osd.11up  1.0  1.0
>  -6  163.72159 host ceph-4
>   4   81.86079 osd.4 up  1.0  1.0
>   5   81.86079 osd.5 up  1.0  1.0
>  -7  163.72159 host ceph-5
>   6   81.86079 osd.6 up  1.0  1.0
>   7   81.86079 osd.7 up  1.0  1.0
>  -8  163.72159 host ceph-6
>  12   81.86079 osd.12up  0.7  1.0
>  13   81.86079 osd.13up  1.0  1.0
>  -9  163.72159 host ceph-7
>  14   81.86079 osd.14up  1.0  1.0
>  15   81.86079 osd.15up  1.0  1.0
> -10  163.72159 host ceph-8
>  16   81.86079 osd.16up  1.0  1.0
>  17   81.86079 osd.17up  1.0  1.0
> -11  163.72159 host ceph-9
>  18   81.86079 osd.18up  1.0  1.0
>  19   81.86079 osd.19up  1.0  1.0
> -12  163.72159 host ceph-10
>  20   81.86079 osd.20up  1.0  1.0
>  21   81.86079 osd.21up  1.0  1.0
> -13  163.72159 host ceph-11
>  22   81.86079 osd.22up  1.0  1.0
>  23   81.86079 osd.23up  1.0  1.0
> -14  163.72159 host ceph-12
>  24   81.86079 osd.24up  1.0  1.0
>  25   81.86079 osd.25up  1.0  1.0
> -15  163.72159 host ceph-13
>  26   81.86079 osd.26  down0  1.0
>  27   81.86079 osd.27  down0  1.0
> -16  163.72159 host ceph-14
>  28   81.86079 osd.28up  1.0  1.0
>  29   81.86079 osd.29up  1.0  1.0
> -17  163.72159 host ceph-15
>  30   

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-11 Thread Webert de Souza Lima
You could use "mds_cache_size" to limit number of CAPS untill you have this
fixed, but I'd say for your number of caps and inodes, 20GB is normal.

this mds (jewel) here is consuming 24GB RAM:

{
"mds": {
"request": 7194867047,
"reply": 7194866688,
"reply_latency": {
"avgcount": 7194866688,
"sum": 27779142.611775008
},
"forward": 0,
"dir_fetch": 179223482,
"dir_commit": 1529387896,
"dir_split": 0,
"inode_max": 300,
"inodes": 3001264,
"inodes_top": 160517,
"inodes_bottom": 226577,
"inodes_pin_tail": 2614170,
"inodes_pinned": 2770689,
"inodes_expired": 2920014835,
"inodes_with_caps": 2743194,
"caps": 2803568,
"subtrees": 2,
"traverse": 8255083028,
"traverse_hit": 7452972311,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 180547123,
"traverse_remote_ino": 122257,
"traverse_lock": 5957156,
"load_cent": 18446743934203149911,
"q": 54,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:13 PM Alexandre DERUMIER 
wrote:

> Hi,
>
> I'm still seeing memory leak with 12.2.5.
>
> seem to leak some MB each 5 minutes.
>
> I'll try to resent some stats next weekend.
>
>
> - Mail original -
> De: "Patrick Donnelly" 
> À: "Brady Deetz" 
> Cc: "Alexandre Derumier" , "ceph-users" <
> ceph-users@lists.ceph.com>
> Envoyé: Jeudi 10 Mai 2018 21:11:19
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>
> On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote:
> > [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> > ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32
> > /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup
> ceph
> >
> >
> > [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> > {
> > "pool": {
> > "items": 173261056,
> > "bytes": 76504108600
> > }
> > }
> >
> > So, 80GB is my configured limit for the cache and it appears the mds is
> > following that limit. But, the mds process is using over 100GB RAM in my
> > 128GB host. I thought I was playing it safe by configuring at 80. What
> other
> > things consume a lot of RAM for this process?
> >
> > Let me know if I need to create a new thread.
>
> The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade
> ASAP.
>
> [1] https://tracker.ceph.com/issues/22972
>
> --
> Patrick Donnelly
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-11 Thread Webert de Souza Lima
Basically what we're trying to figure out looks like what is being done
here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020958.html

But instead of using LIBRADOS to store EMAILs directly into RADOS we're
still using CEPHFS for it, just figuring out if it makes sense to separate
them in different workloads.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, May 11, 2018 at 2:07 AM, Marc Roos  wrote:

>
>
> If I would like to use an erasurecode pool for a cephfs directory how
> would I create these placement rules?
>
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: vrijdag 11 mei 2018 1:54
> To: João Paulo Sacchetto Ribeiro Bastos
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] howto: multiple ceph filesystems
>
> Another option you could do is to use a placement rule. You could create
> a general pool for most data to go to and a special pool for specific
> folders on the filesystem. Particularly I think of a pool for replica vs
> EC vs flash for specific folders in the filesystem.
>
> If the pool and OSDs wasn't the main concern for multiple filesystems
> and the mds servers are then you could have multiple active mds servers
> and pin the metadata for the indexes to one of them while the rest is
> served by the other active mds servers.
>
> I really haven't come across a need for multiple filesystems in ceph
> with the type of granularity you can achieve with mds pinning, folder
> placement rules, and cephx authentication to limit a user to a specific
> subfolder.
>
>
> On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos
>  wrote:
>
>
> Hey John, thanks for you answer. For sure the hardware robustness
> will be nice enough. My true concern was actually the two FS ecosystem
> coexistence. In fact I realized that we may not use this as well because
> it may be represent a high overhead, despite the fact that it's a
> experiental feature yet.
>
> On Thu, 10 May 2018 at 15:48 John Spray  wrote:
>
>
> On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto
> Ribeiro
> Bastos
>  wrote:
> > Hello guys,
> >
> > My company is about to rebuild its whole infrastructure,
> so
> I was called in
> > order to help on the planning. We are essentially an
> corporate mail
> > provider, so we handle daily lots of clients using
> dovecot
> and roundcube and
> > in order to do so we want to design a better plant of
> our
> cluster. Today,
> > using Jewel, we have a single cephFS for both index and
> mail
> from dovecot,
> > but we want to split it into an index_FS and a mail_FS
> to
> handle the
> > workload a little better, is it profitable nowadays?
> From my
> research I
> > realized that we will need data and metadata individual
> pools for each FS
> > such as a group of MDS for each of then, also.
> >
> > The one thing that really scares me about all of this
> is: we
> are planning to
> > have four machines at full disposal to handle our MDS
> instances. We started
> > to think if an idea like the one below is valid, can
> anybody
> give a hint on
> > this? We basically want to handle two MDS instances on
> each
> machine (one for
> > each FS) and wonder if we'll be able to have them
> swapping
> between active
> > and standby simultaneously without any trouble.
> >
> > index_FS: (active={machines 1 and 3}, standby={machines
> 2
> and 4})
> > mail_FS: (active={machines 2 and 4}, standby={machines 1
> and
> 3})
>
> Nothing wrong with that setup, but remember that those
> servers
> are
> going to have to be well-resourced enough to run all four
> at
> once
> (when a failure occurs), so it might not matter very much
> exactly
> which servers are running which daemons.
>
> With a filesystem's MDS daemons (i.e. daemons with the same
> standby_for_fscid setting), Ceph will activate whichever
> daemon comes
> up first, so if it's important to you to have particular
> daemons
> active then you would need to take care of that at the
> point
> you're
> starting them up.
>
> John
>
> >
> > Regards,
> > --
> >
> > João Paulo Sacchetto Ribeiro Bastos
> > +55 31 99279-7092
> >
> >
> > 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
Hey Jon!

On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
sure what I should be looking at).
My current SSD disks have 2 partitions.
 - One is used for cephfs cache tier pool,
 - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
(this is an additional cephfs data pool with only ssds with file layout for
a specific direcotory to use it)

Because of this, iostat shows me peaks of 12k IOPS in the metadata
partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

I have yet to measure it the right way but I'd assume my metadata fits in
RAM (a few 100s of MB only).

This is an email hosting cluster with dozens of thousands of users so there
are a lot of random reads and writes, but not too many small files.
Email messages are concatenated together in files up to 4MB in size (when a
rotation happens).
Most user operations are dovecot's INDEX operations and I will keep index
directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

Saturarion will only happen in peak workloads, not often. By heavy write I
mean there are much more writes than reads, yes.
So I think I can start sharing the OSDs, if I think this is impacting
performance I can just change the ruleset and move metadata to a SSD-only
pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall number of
> metadata ops is so low that the device doesn't matter.



Thank you very much John!
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't get MDS running after a power outage

2018-03-29 Thread Webert de Souza Lima
I'd also try to boot up only one mds until it's fully up and running. Not
both of them.
Sometimes they go switching states between each other.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Mar 29, 2018 at 7:32 AM, John Spray  wrote:

> On Thu, Mar 29, 2018 at 8:16 AM, Zhang Qiang 
> wrote:
> > Hi,
> >
> > Ceph version 10.2.3. After a power outage, I tried to start the MDS
> > deamons, but they stuck forever replaying journals, I had no idea why
> > they were taking that long, because this is just a small cluster for
> > testing purpose with only hundreds MB data. I restarted them, and the
> > error below was encountered.
>
> Usually if an MDS is stuck in replay, it's because it's waiting for
> the OSDs to service the reads of the journal.  Are all your PGs up and
> healthy?
>
> >
> > Any chance I can restore them?
> >
> > Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon.
> > Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon...
> > Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255
> > 7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and
> > will be forbidden in a future version.  MDS names may not start with a
> > numeric digit.
>
> If you're really using "0" as an MDS name, now would be a good time to
> fix that -- most people use a hostname or something like that.  The
> reason that numeric MDS names are invalid is that it makes commands
> like "ceph mds fail 0" ambiguous (do we mean the name 0 or the rank
> 0?).
>
> > Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0
> > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const
> > entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time
> > 2018-03-28 14:20:30.942480
> > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED
> assert(up.count(m))
> > Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3
> > (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> > Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char
> > const*, char const*, int, char const*)+0x85) [0x7f01512aba45]
> > Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f)
> > [0x7f0150ee5e3f]
> > Mar 28 14:20:30 node01 ceph-mds: 3:
> > (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9)
> > [0x7f0150ed6e49]
>
> This is a weird assertion.  I can't see how it could be reached :-/
>
> John
>
> > Mar 28 14:20:30 node01 ceph-mds: 4:
> > (MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d]
> > Mar 28 14:20:30 node01 ceph-mds: 5:
> > (MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3]
> > Mar 28 14:20:30 node01 ceph-mds: 6:
> > (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b]
> > Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a)
> > [0x7f01513ad4aa]
> > Mar 28 14:20:30 node01 ceph-mds: 8:
> > (DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d]
> > Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5]
> > Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced]
> > Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or
> > `objdump -rdS ` is needed to interpret this.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-22 Thread Webert de Souza Lima
Hi,

On Fri, Jan 19, 2018 at 8:31 PM, zhangbingyin 
 wrote:

> 'MAX AVAIL' in the 'ceph df' output represents the amount of data that can
> be used before the first OSD becomes full, and not the sum of all free
> space across a set of OSDs.
>

Thank you very much. I figured this out by the end of the day. That is the
answer. I'm not sure this is in ceph.com docs though.
Now I know the problem is indeed solved (by doing proper reweight).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-19 Thread Webert de Souza Lima
While it seemed to be solved yesterday, today the %USED has grown a lot
again. See:

~# ceph osd df tree
http://termbin.com/0zhk

~# ceph df detail
http://termbin.com/thox

94% USED while there is about 21TB worth of data, size = 2 menas ~42TB RAW
Usage, but the OSDs in that root sum ~70TB available space.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Jan 18, 2018 at 8:21 PM, Webert de Souza Lima <webert.b...@gmail.com
> wrote:

> With the help of robbat2 and llua on IRC channel I was able to solve this
> situation by taking down the 2-OSD only hosts.
> After crush reweighting OSDs 8 and 23 from host mia1-master-fe02 to 0,
> ceph df showed the expected storage capacity usage (about 70%)
>
>
> With this in mind, those guys have told me that it is due the cluster
> beeing uneven and unable to balance properly. It makes sense and it worked.
> But for me it is still a very unexpected bahaviour for ceph to say that
> the pools are 100% full and Available Space is 0.
>
> There were 3 hosts and repl. size = 2, if the host with only 2 OSDs were
> full (it wasn't), ceph could still use space from OSDs from the other hosts.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
With the help of robbat2 and llua on IRC channel I was able to solve this
situation by taking down the 2-OSD only hosts.
After crush reweighting OSDs 8 and 23 from host mia1-master-fe02 to 0, ceph
df showed the expected storage capacity usage (about 70%)


With this in mind, those guys have told me that it is due the cluster
beeing uneven and unable to balance properly. It makes sense and it worked.
But for me it is still a very unexpected bahaviour for ceph to say that the
pools are 100% full and Available Space is 0.

There were 3 hosts and repl. size = 2, if the host with only 2 OSDs were
full (it wasn't), ceph could still use space from OSDs from the other hosts.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Hi David, thanks for replying.


On Thu, Jan 18, 2018 at 5:03 PM David Turner <drakonst...@gmail.com> wrote:

> You can have overall space available in your cluster because not all of
> your disks are in the same crush root.  You have multiple roots
> corresponding to multiple crush rulesets.  All pools using crush ruleset 0
> are full because all of the osds in that crush rule are full.
>


So I did check this. The usage of the OSDs that belonged to that root
(default) was about 60%.
All the pools using crush ruleset 0 were being show 100% there was only 1
near-full OSD in that crush rule. That's what is so weird about it.

On Thu, Jan 18, 2018 at 8:05 PM, David Turner <drakonst...@gmail.com> wrote:

> `ceph osd df` is a good command for you to see what's going on.  Compare
> the osd numbers with `ceph osd tree`.
>

I am sorry I forgot to send this output, here it is. I have added 2 OSDs to
that crush, borrowed them from the host mia1-master-ds05, to see if the
available space would higher, but it didn't.
So adding new OSDs to this didn't take any effect.

ceph osd df tree

ID  WEIGHT   REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS TYPE NAME
 -9 13.5- 14621G  2341G 12279G 16.02 0.31   0 root
databases
 -8  6.5-  7182G   835G  6346G 11.64 0.22   0 host
mia1-master-ds05
 20  3.0  1.0  3463G   380G  3082G 10.99 0.21 260
osd.20
 17  3.5  1.0  3719G   455G  3263G 12.24 0.24 286
osd.17
-10  7.0-  7438G  1505G  5932G 20.24 0.39   0 host
mia1-master-fe01
 21  3.5  1.0  3719G   714G  3004G 19.22 0.37 269
osd.21
 22  3.5  1.0  3719G   791G  2928G 21.27 0.41 295
osd.22
 -3  2.39996-  2830G  1647G  1182G 58.22 1.12   0 root
databases-ssd
 -5  1.19998-  1415G   823G   591G 58.22 1.12   0 host
mia1-master-ds02-ssd
 24  0.3  1.0   471G   278G   193G 58.96 1.14 173
osd.24
 25  0.3  1.0   471G   276G   194G 58.68 1.13 172
osd.25
 26  0.3  1.0   471G   269G   202G 57.03 1.10 167
osd.26
 -6  1.19998-  1415G   823G   591G 58.22 1.12   0 host
mia1-master-ds03-ssd
 27  0.3  1.0   471G   244G   227G 51.87 1.00 152
osd.27
 28  0.3  1.0   471G   281G   190G 59.63 1.15 175
osd.28
 29  0.3  1.0   471G   297G   173G 63.17 1.22 185
osd.29
 -1 71.69997- 76072G 44464G 31607G 58.45 1.13   0 root default
 -2 26.59998- 29575G 17334G 12240G 58.61 1.13   0 host
mia1-master-ds01
  0  3.2  1.0  3602G  1907G  1695G 52.94 1.02  90
osd.0
  1  3.2  1.0  3630G  2721G   908G 74.97 1.45 112
osd.1
  2  3.2  1.0  3723G  2373G  1349G 63.75 1.23  98
osd.2
  3  3.2  1.0  3723G  1781G  1941G 47.85 0.92 105
osd.3
  4  3.2  1.0  3723G  1880G  1843G 50.49 0.97  95
osd.4
  5  3.2  1.0  3723G  2465G  1257G 66.22 1.28 111
osd.5
  6  3.7  1.0  3723G  1722G  2001G 46.25 0.89 109
osd.6
  7  3.7  1.0  3723G  2481G  1241G 66.65 1.29 126
osd.7
 -4  8.5-  9311G  8540G   770G 91.72 1.77   0 host
mia1-master-fe02
  8  5.5  0.7  5587G  5419G   167G 97.00 1.87 189
osd.8
 23  3.0  1.0  3724G  3120G   603G 83.79 1.62 128
osd.23
 -7 29.5- 29747G 17821G 11926G 59.91 1.16   0 host
mia1-master-ds04
  9  3.7  1.0  3718G  2493G  1224G 67.07 1.29 114
osd.9
 10  3.7  1.0  3718G  2454G  1264G 66.00 1.27  90
osd.10
 11  3.7  1.0  3718G  2202G  1516G 59.22 1.14 116
osd.11
 12  3.7  1.0  3718G  2290G  1427G 61.61 1.19 113
osd.12
 13  3.7  1.0  3718G  2015G  1703G 54.19 1.05 112
osd.13
 14  3.7  1.0  3718G  1264G  2454G 34.00 0.66 101
osd.14
 15  3.7  1.0  3718G  2195G  1522G 59.05 1.14 104
osd.15
 16  3.7  1.0  3718G  2905G   813G 78.13 1.51 130
osd.16
-11  7.0-  7438G   768G  6669G 10.33 0.20   0 host
mia1-master-ds05-borrowed-osds
 18  3.5  1.0  3719G   393G  3325G 10.59 0.20 262
osd.18
 19  3.5  1.0  3719G   374G  3344G 10.07 0.19 256
osd.19
TOTAL 93524G 48454G 45069G 51.81
MIN/MAX VAR: 0.19/1.87  STDDEV: 22.02



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Jan 18, 2018 at 8:05 PM, David Turner <drakonst...@gmail.com> wrote:

> `ceph osd df` is a good command for you to see what's going on.  Compare
> the osd numbers with `ceph osd tree`.
>
>
>>
>> On Thu, Jan 18, 2018 at 3:34 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Sorry I forgot, this is a ceph jewel 10.2.10
>>>
>>>
>>> Regards,
>>>
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> *Belo Horizonte - Brasil*
>>> *IRC NICK - WebertRLZ*
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>&g

Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Sorry I forgot, this is a ceph jewel 10.2.10


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Also, there is no quota set for the pools

Here is "ceph osd pool get xxx all": http://termbin.com/ix0n


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Hello,

I'm running near-out-of service radosgw (very slow to write new objects)
and I suspect it's because of ceph df is showing 100% usage in some pools,
though I don't know what that information comes from.

Pools:
#~ ceph osd pool ls detail  -> http://termbin.com/lsd0

Crush Rules (important is rule 0)
~# ceph osd crush rule dump ->  http://termbin.com/wkpo

OSD Tree:
~# ceph osd tree -> http://termbin.com/87vt

Ceph DF, which shows 100% Usage:
~# ceph df detail -> http://termbin.com/15mz

Ceph Status, which shows 45600 GB / 93524 GB avail:
~# ceph -s -> http://termbin.com/wycq


Any thoughts?

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous: HEALTH_ERR full ratio(s) out of order

2018-01-10 Thread Webert de Souza Lima
Good to know. I don't think this should trigger HEALTH_ERR though, but
HEALTH_WARN makes sense.
It makes sense to keep the backfillfull_ratio greater than nearfull_ratio
as one might need backfilling to avoid OSD getting full on reweight
operations.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Wed, Jan 10, 2018 at 12:11 PM, Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> Hello,
>
> since upgrading to luminous i get the following error:
>
> HEALTH_ERR full ratio(s) out of order
> OSD_OUT_OF_ORDER_FULL full ratio(s) out of order
> backfillfull_ratio (0.9) < nearfull_ratio (0.95), increased
>
> but ceph.conf has:
>
> mon_osd_full_ratio = .97
> mon_osd_nearfull_ratio = .95
> mon_osd_backfillfull_ratio = .96
> osd_backfill_full_ratio = .96
> osd_failsafe_full_ratio = .98
>
> Any ideas?  i already restarted:
> * all osds
> * all mons
> * all mgrs
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'lost' cephfs filesystem?

2018-01-10 Thread Webert de Souza Lima
On Wed, Jan 10, 2018 at 12:44 PM, Mark Schouten  wrote:

> > Thanks, that's a good suggestion. Just one question, will this affect
> RBD-
> > access from the same (client)host?


i'm sorry that this didn't help. No, it does not affect rbd clients, as MDS
is related only to cephfs.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'lost' cephfs filesystem?

2018-01-10 Thread Webert de Souza Lima
try to kick out (evict) that cephfs client from the mds node, see
http://docs.ceph.com/docs/master/cephfs/eviction/


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Wed, Jan 10, 2018 at 12:59 AM, Mark Schouten  wrote:

> Hi,
>
> While upgrading a server with a CephFS mount tonight, it stalled on
> installing
> a new kernel, because it was waiting for `sync`.
>
> I'm pretty sure it has something to do with the CephFS filesystem which
> caused
> some issues last week. I think the kernel still has a reference to the
> probably lazy unmounted CephFS filesystem.
> Unmounting the filesystem 'works', which means it is no longer available,
> but
> the unmount-command seems to be waiting for sync() as well. Mounting the
> filesystem again doesn't work either.
>
> I know the simple solution is to just reboot the server, but the server
> holds
> quite a lot of VM's and Containers, so I'd prefer to fix this without a
> reboot.
>
> Anybody with some clever ideas? :)
>
> --
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
> Mark Schouten  | Tuxis Internet Engineering
> KvK: 61527076  | http://www.tuxis.nl/
> T: 0318 200208 | i...@tuxis.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)

2017-12-22 Thread Webert de Souza Lima
On Thu, Dec 21, 2017 at 12:52 PM, shadow_lin  wrote:
>
> After 18:00 suddenly the write throughput dropped and the osd latency
> increased. TCmalloc started relcaim page heap freelist much more
> frequently.All of this happened very fast and every osd had the indentical
> pattern.
>
Could that be caused by OSD scrub?  Check your "osd_scrub_begin_hour"

  ceph daemon osd.$ID config show | grep osd_scrub


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS locatiins

2017-12-22 Thread Webert de Souza Lima
it depends on how you use it. for me, it runs fine on the OSD hosts but the
mds server consumes loads of RAM, so be aware of that.
if the system load average goes too high due to osd disk utilization the
MDS server might run into troubles too, as delayed response from the host
could cause the MDS to be marked as down.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, Dec 22, 2017 at 5:24 AM, nigel davies  wrote:

> Hay all
>
> Is it ok to set up mds on the same serves that do host the osd's or should
> they be on different server's
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-22 Thread Webert de Souza Lima
On Fri, Dec 22, 2017 at 3:20 AM, Yan, Zheng  wrote:

> idle client shouldn't hold so many caps.
>

i'll try to make it reproducible for you to test.


yes. For now, it's better to run "echo 3 >/proc/sys/vm/drop_caches"
> after cronjob finishes


Thanks. I'll adopt that for now.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-21 Thread Webert de Souza Lima
Hello Zheng,

Thanks for opening that issue on the bug tracker.

Also thanks for that tip. Caps dropped from 1.6M to 600k for that client.
Is it safe to run in a cronjob? Let's say, once or twice a day during
production?

Thanks!


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Dec 21, 2017 at 11:55 AM, Yan, Zheng <uker...@gmail.com> wrote:

> On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > I have upgraded the kernel on a client node (one that has close-to-zero
> > traffic) used for tests.
> >
> >{
> >   "reconnecting" : false,
> >   "id" : 1620266,
> >   "num_leases" : 0,
> >   "inst" : "client.1620266 10.0.0.111:0/3921220890",
> >   "state" : "open",
> >   "completed_requests" : 0,
> >   "num_caps" : 1402490,
> >   "client_metadata" : {
> >  "kernel_version" : "4.4.0-104-generic",
> >  "hostname" : "suppressed",
> >  "entity_id" : "admin"
> >   },
> >   "replay_requests" : 0
> >},
> >
> > still 1.4M caps used.
> >
> > is upgrading the client kernel enough ?
> >
>
> See http://tracker.ceph.com/issues/22446. We haven't implemented that
> feature.  "echo 3 >/proc/sys/vm/drop_caches"  should drop most caps.
>
> >
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> > IRC NICK - WebertRLZ
> >
> > On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima
> > <webert.b...@gmail.com> wrote:
> >>
> >> So,
> >>
> >> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng <uker...@gmail.com> wrote:
> >>>
> >>>
> >>> 300k are ready quite a lot. opening them requires long time. does you
> >>> mail server really open so many files?
> >>
> >>
> >> Yes, probably. It's a commercial solution. A few thousand domains,
> dozens
> >> of thousands of users and god knows how any mailboxes.
> >> From the daemonperf you can see the write workload is high, so yes, too
> >> much files opening (dovecot mdbox stores multiple e-mails per file,
> split
> >> into many files).
> >>
> >>> I checked 4.4 kernel, it includes the code that trim cache when mds
> >>> recovers.
> >>
> >>
> >> Ok, all nodes are running 4.4.0-75-generic. The fix might have been
> >> included in a newer version.
> >> I'll upgrade it asap.
> >>
> >>
> >> Regards,
> >>
> >> Webert Lima
> >> DevOps Engineer at MAV Tecnologia
> >> Belo Horizonte - Brasil
> >> IRC NICK - WebertRLZ
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-21 Thread Webert de Souza Lima
I have upgraded the kernel on a client node (one that has close-to-zero
traffic) used for tests.

   {
  "reconnecting" : false,
  "id" : 1620266,
  "num_leases" : 0,
  "inst" : "client.1620266 10.0.0.111:0/3921220890",
  "state" : "open",
  "completed_requests" : 0,
  "num_caps" : 1402490,
  "client_metadata" : {
 "kernel_version" : "4.4.0-104-generic",
 "hostname" : "suppressed",
 "entity_id" : "admin"
  },
  "replay_requests" : 0
   },

still 1.4M caps used.

is upgrading the client kernel enough ?



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima <
webert.b...@gmail.com> wrote:

> So,
>
> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng <uker...@gmail.com> wrote:
>
>>
>> 300k are ready quite a lot. opening them requires long time. does you
>> mail server really open so many files?
>
>
> Yes, probably. It's a commercial solution. A few thousand domains, dozens
> of thousands of users and god knows how any mailboxes.
> From the daemonperf you can see the write workload is high, so yes, too
> much files opening (dovecot mdbox stores multiple e-mails per file, split
> into many files).
>
> I checked 4.4 kernel, it includes the code that trim cache when mds
>> recovers.
>
>
> Ok, all nodes are running 4.4.0-75-generic. The fix might have been
> included in a newer version.
> I'll upgrade it asap.
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-15 Thread Webert de Souza Lima
So,

On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng  wrote:

>
> 300k are ready quite a lot. opening them requires long time. does you
> mail server really open so many files?


Yes, probably. It's a commercial solution. A few thousand domains, dozens
of thousands of users and god knows how any mailboxes.
>From the daemonperf you can see the write workload is high, so yes, too
much files opening (dovecot mdbox stores multiple e-mails per file, split
into many files).

I checked 4.4 kernel, it includes the code that trim cache when mds
> recovers.


Ok, all nodes are running 4.4.0-75-generic. The fix might have been
included in a newer version.
I'll upgrade it asap.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-15 Thread Webert de Souza Lima
Thanks

On Fri, Dec 15, 2017 at 10:46 AM, Yan, Zheng  wrote:

>  recent
> version kernel client and ceph-fuse should trim they cache
> aggressively when mds recovers.
>

So the bug (not sure if I can call it a bug) is already fixed in newer
kernel? Can I just update the kernel and expect this to be fixed?
Could you tell me which kernel version?


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-15 Thread Webert de Souza Lima
Hello, Mr. Yan

On Thu, Dec 14, 2017 at 11:36 PM, Yan, Zheng  wrote:

>
> The client hold so many capabilities because kernel keeps lots of
> inodes in its cache. Kernel does not trim inodes by itself if it has
> no memory pressure. It seems you have set mds_cache_size config to a
> large value.


Yes, I have set mds_cache_size = 300
I usually set this value according to the number of ceph.dir.rentries in
cephfs. Isn't that a good approach?

I have 2 directories in cephfs root, sum of ceph.dir.rentries is 4670933,
for which I would set mds_cache_size to 5M (if I had enough RAM for that in
the MDS server).

# getfattr -d -m ceph.dir.* index
# file: index
ceph.dir.entries="776"
ceph.dir.files="0"
ceph.dir.rbytes="52742318965"
ceph.dir.rctime="1513334528.09909569540"
ceph.dir.rentries="709233"
ceph.dir.rfiles="459512"
ceph.dir.rsubdirs="249721"
ceph.dir.subdirs="776"


# getfattr -d -m ceph.dir.* mail
# file: mail
ceph.dir.entries="786"
ceph.dir.files="1"
ceph.dir.rbytes="15000378101390"
ceph.dir.rctime="1513334524.0993982498"
ceph.dir.rentries="3961700"
ceph.dir.rfiles="3531068"
ceph.dir.rsubdirs="430632"
ceph.dir.subdirs="785"


mds cache size isn't large enough, so mds does not ask
> the client to trim its inode cache neither. This can affect
> performance. we should make mds recognize idle client and ask idle
> client to trim its caps more aggressively
>

I think you mean that the mds cache IS large enough, right? So it doesn't
bother the clients.

This can affect performance. we should make mds recognize idle client and
> ask idle client to trim its caps more aggressively
>

One recurrent problem I have, which I guess is caused by a network issue
(ceph cluster in vrack), is that my MDS servers start switching who is the
active.
This happens after a lease_timeout occur in the mon, then I get "dne in the
mds map" from the active MDS and it suicides.
Even though I use standby-replay, the standby takes from 15min up to 2
hours to take over as active. I see that it loads all inodes (by issuing
"perf dump mds" on the mds daemon).

So, question is: if the number of caps is as low as it is supposed to be
(around 300k) instead if 5M, would the MDS be active faster in such case of
a failure?

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds millions of caps

2017-12-14 Thread Webert de Souza Lima
Hi Patrick,

On Thu, Dec 14, 2017 at 7:52 PM, Patrick Donnelly 
 wrote:

>
> It's likely you're a victim of a kernel backport that removed a dentry
> invalidation mechanism for FUSE mounts. The result is that ceph-fuse
> can't trim dentries.
>

even though I'm not using FUSE? I'm using kernel mounts.



> I suggest setting that config manually to false on all of your clients
>

Ok how do I do that?

Many thanks.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs mds millions of caps

2017-12-14 Thread Webert de Souza Lima
Hi,

I've been look at ceph mds perf counters and I saw the one of my clusters
was hugely different from other in number of caps:

rlat inos  caps  | hsr  hcs   hcr | writ read actv  | recd recy stry  purg
| segs evts subm
  0  3.0M 5.1M |  0 0 595 | 30440 |  0   0   13k
0  | 42 35k   893
  0  3.0M 5.1M |  0 0 165 | 1.8k   437   |  0   0   13k   0
 | 43 36k   302
16  3.0M 5.1M |  0 0 429 | 24794 |  0   0   13k
58| 38 32k   1.7k
  0  3.0M 5.1M |  0 1 213 | 1.2k   0857 |  0   0   13k   0
 | 40 33k   766
23  3.0M 5.1M |  0 0 945 | 44510 |  0   0   13k   0
 | 41 34k   1.1k
  0  3.0M 5.1M |  0 2 696 | 376   11   0 |  0   0   13k   0
 | 43 35k   1.0k
  3  2.9M 5.1M |  0 0 601 | 2.0k   60 |  0   0   13k
56| 38 29k   1.2k
  0  2.9M 5.1M |  0 0 394 | 272   11   0 |  0   0   13k   0
 | 38 30k   758

on another cluster running the same version:

-mds-- --mds_server-- ---objecter--- -mds_cache-
---mds_log
rlat inos caps  | hsr  hcs  hcr  | writ read actv | recd recy stry purg |
segs evts subm
  2  3.9M 380k |  01 266 | 1.8k   0   370  |  0   0   24k  44
 |  37  129k  1.5k


I did a perf dump on the active mds:

~# ceph daemon mds.a perf dump mds
{
"mds": {
"request": 2245276724,
"reply": 2245276366,
"reply_latency": {
"avgcount": 2245276366,
"sum": 18750003.074118977
},
"forward": 0,
"dir_fetch": 20217943,
"dir_commit": 555295668,
"dir_split": 0,
"inode_max": 300,
"inodes": 3000276,
"inodes_top": 152555,
"inodes_bottom": 279938,
"inodes_pin_tail": 2567783,
"inodes_pinned": 2782064,
"inodes_expired": 308697104,
"inodes_with_caps": 2779658,
"caps": 5147887,
"subtrees": 2,
"traverse": 2582452087,
"traverse_hit": 2338123987,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 16627249,
"traverse_remote_ino": 29276,
"traverse_lock": 2507504,
"load_cent": 18446743868740589422,
"q": 27,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}

and then a session ls to see what clients could be holding that much:

   {
  "client_metadata" : {
 "entity_id" : "admin",
 "kernel_version" : "4.4.0-97-generic",
 "hostname" : "suppressed"
  },
  "completed_requests" : 0,
  "id" : 1165169,
  "num_leases" : 343,
  "inst" : "client.1165169 10.0.0.112:0/982172363",
  "state" : "open",
  "num_caps" : 111740,
  "reconnecting" : false,
  "replay_requests" : 0
   },
   {
  "state" : "open",
  "replay_requests" : 0,
  "reconnecting" : false,
  "num_caps" : 108125,
  "id" : 1236036,
  "completed_requests" : 0,
  "client_metadata" : {
 "hostname" : "suppressed",
 "kernel_version" : "4.4.0-97-generic",
 "entity_id" : "admin"
  },
  "num_leases" : 323,
  "inst" : "client.1236036 10.0.0.113:0/1891451616"
   },
   {
  "num_caps" : 63186,
  "reconnecting" : false,
  "replay_requests" : 0,
  "state" : "open",
  "num_leases" : 147,
  "completed_requests" : 0,
  "client_metadata" : {
 "kernel_version" : "4.4.0-75-generic",
 "entity_id" : "admin",
 "hostname" : "suppressed"
  },
  "id" : 1235930,
  "inst" : "client.1235930 10.0.0.110:0/2634585537"
   },
   {
  "num_caps" : 2476444,
  "replay_requests" : 0,
  "reconnecting" : false,
  "state" : "open",
  "num_leases" : 0,
  "completed_requests" : 0,
  "client_metadata" : {
 "entity_id" : "admin",
 "kernel_version" : "4.4.0-75-generic",
 "hostname" : "suppressed"
  },
  "id" : 1659696,
  "inst" : "client.1659696 10.0.0.101:0/4005556527"
   },
   {
  "state" : "open",
  "replay_requests" : 0,
  "reconnecting" : false,
  "num_caps" : 2386376,
  "id" : 1069714,
  "client_metadata" : {
 "hostname" : "suppressed",
 "kernel_version" : "4.4.0-75-generic",
 "entity_id" : "admin"
  },
  "completed_requests" : 0,
  "num_leases" : 0,
  "inst" : "client.1069714 10.0.0.111:0/1876172355"
   },
   {
  "replay_requests" : 0,
  "reconnecting" : false,
  "num_caps" : 1726,
  "state" : "open",
  "inst" : "client.8394 10.0.0.103:0/3970353996",
  "num_leases" : 0,
  "id" : 8394,
  "client_metadata" : {
 "entity_id" : "admin",
 "kernel_version" : "4.4.0-75-generic",
 "hostname" : "suppressed"
  },
  "completed_requests" : 0
   }


Surprisingly, the 2 hosts that were holding 2M+ caps 

Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Webert de Souza Lima
I have experienced delayed free in used space before, in Jewel, but that
just stopped happening with no intervention.
Back then, umounting all client's fs would make it free the space rapidly.
I don't know if that's related.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deterministic naming of LVM volumes (ceph-volume)

2017-12-13 Thread Webert de Souza Lima
On Wed, Dec 13, 2017 at 11:51 AM, Stefan Kooman  wrote:

> If we want to remove the OSD (for whatever reason) and release the ID then
> we
> will use "ceph osd purge* osd.$ID" ... which basically does what you
> suggest (ceph auth del osd.$OSD_ID, crush remove osd.$OSD_ID, ceph osd
> rm osd.$OSD_ID).
>

Perfect. Thanks for clarifying.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deterministic naming of LVM volumes (ceph-volume)

2017-12-13 Thread Webert de Souza Lima
Cool


On Wed, Dec 13, 2017 at 11:04 AM, Stefan Kooman  wrote:

> So, a "ceph osd ls" should give us a list, and we will pick the smallest
> available number as the new osd id to use. We will make a check in the
> (ansible) deployment code to see Ceph will indeed use that number.
>
> Thanks,
>
> Gr. Stefan



Take into account that if an ID is available within a gap, it means that it
might have been used before, so maybe you'll still need to include `ceph
osd rm $ID` [and/or `ceph auth del osd.$ID`] to make sure that ID will be
usable.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deterministic naming of LVM volumes (ceph-volume)

2017-12-13 Thread Webert de Souza Lima
Hello,

On Wed, Dec 13, 2017 at 7:36 AM, Stefan Kooman  wrote:

> Hi,
>
> Is there a way to ask Ceph which OSD_ID
> would be next up?


if I may suggest, "ceph osd create" allocates and returns an OSD ID. So you
could take it by doing:

 ID=$(ceph osd create)

then remove it with

 ceph osd rm $ID

Now you have the $ID and you can deploy it with ceph-volume (I hope it
works)


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW uploaded objects integrity

2017-12-07 Thread Webert de Souza Lima
I have had many cases of corrupt objects in my radosgw cluster. Until now I
have looked at it as a software (my software) bug, still unresolved as the
incidence as lowered a lot.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Dec 7, 2017 at 3:11 PM, Robert Stanford 
wrote:

>
> I did some benchmarking with cosbench and found that successful uploads
> (as shown in the output report) was not 100% unless I used the
> "hashCheck=True" flag in the cosbench configuration file.  Under high load,
> the percent successful was significantly lower (say, 80%).
>
> Has anyone dealt with object integrity issues at high throughputs, when
> using RGW?  Any suggestions on ensuring what goes in to the cluster is the
> same as what comes out?  Do we have to manually check the integrity of the
> objects we upload, like cosbench does, every time?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous ubuntu 16.04 HWE (4.10 kernel). ceph-disk can't prepare a disk

2017-10-24 Thread Webert de Souza Lima
When you do umount the device, the raised error is still the same?


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Mon, Oct 23, 2017 at 4:46 AM, Wido den Hollander  wrote:

>
> > Op 22 oktober 2017 om 18:45 schreef Sean Sullivan :
> >
> >
> > On freshly installed ubuntu 16.04 servers with the HWE kernel selected
> > (4.10). I can not use ceph-deploy or ceph-disk to provision osd.
> >
> >
> >  whenever I try I get the following::
> >
> > ceph-disk -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys
> > --bluestore --cluster ceph --fs-type xfs -- /dev/sdy
> > command: Running command: /usr/bin/ceph-osd --cluster=ceph
> > --show-config-value=fsid
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > set_type: Will colocate block with data on /dev/sdy
> > command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_size
> > [command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_db_size
> > command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_size
> > command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
> > --lookup bluestore_block_wal_size
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > get_dm_uuid: get_dm_uuid /dev/sdy uuid path is
> /sys/dev/block/65:128/dm/uuid
> > Traceback (most recent call last):
> >   File "/usr/sbin/ceph-disk", line 9, in 
> > load_entry_point('ceph-disk==1.0.0', 'console_scripts',
> 'ceph-disk')()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704,
> in
> > run
> > main(sys.argv[1:])
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655,
> in
> > main
> > args.func(args)
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2091,
> in
> > main
> > Prepare.factory(args).prepare()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2080,
> in
> > prepare
> > self._prepare()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2154,
> in
> > _prepare
> > self.lockbox.prepare()
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2842,
> in
> > prepare
> > verify_not_in_use(self.args.lockbox, check_partitions=True)
> >   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 950,
> in
> > verify_not_in_use
> > raise Error('Device is mounted', partition)
> > ceph_disk.main.Error: Error: Device is mounted: /dev/sdy5
> >
> > unmounting the disk does not seem to help either. I'm assuming something
> is
> > triggering too early but i'm not sure how to delay or figure that out.
> >
> > has anyone deployed on xenial with the 4.10 kernel? Am I missing
> something
> > important?
>
> Yes I have without any issues, I've did:
>
> $ ceph-disk prepare /dev/sdb
>
> Luminous default to BlueStore and that worked just fine.
>
> Yes, this is with a 4.10 HWE kernel from Ubuntu 16.04.
>
> Wido
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help with full osd and RGW not responsive

2017-10-18 Thread Webert de Souza Lima
Hi Bryan.

I hope that solved it for you.
Another think you can do in situations like this is to set the full_ration
higher so you can work on the problem. Always set it back to a safe value
after the issue is solved.

*ceph pg set_full_ratio 0.98*



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Tue, Oct 17, 2017 at 6:52 PM, Bryan Banister 
wrote:

> Thanks for the response, we increased our pg count to something more
> reasonable (512 for now) and things are rebalancing.
>
>
>
> Cheers,
>
> -Bryan
>
>
>
> *From:* Andreas Calminder [mailto:andreas.calmin...@klarna.com]
> *Sent:* Tuesday, October 17, 2017 3:48 PM
> *To:* Bryan Banister 
> *Cc:* Ceph Users 
> *Subject:* Re: [ceph-users] Help with full osd and RGW not responsive
>
>
>
> *Note: External Email*
> --
>
> Hi,
>
> You should most definitely look over number of pgs, there's a pg
> calculator available here: http://ceph.com/pgcalc/
>
>
>
> You can increase pgs but not the other way around (
> http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/)
>
>
>
> To solve the immediate problem with your cluster being full you can
> reweight your osds, giving the full osd a lower weight will cause writes
> going to other osds and data on that osd being migrated to other osds in
> the cluster: ceph osd reweight $OSDNUM $WEIGHT, described here
> http://docs.ceph.com/docs/master/rados/operations/control/#osd-subsystem
>
>
>
> When the osd isn't above the full threshold, default is 95%, the cluster
> will clear its full flag and your radosgw should start accepting write
> operations again, at least until another osd gets full, main problem here
> is probably the low pg count.
>
>
>
> Regards,
>
> Andreas
>
>
>
> On 17 Oct 2017 19:08, "Bryan Banister"  wrote:
>
> Hi all,
>
>
>
> Still a real novice here and we didn’t set up our initial RGW cluster very
> well.  We have 134 osds and set up our RGW pool with only 64 PGs, thus not
> all of our OSDs got data and now we have one that is 95% full.
>
>
>
> This apparently has put the cluster into a HEALTH_ERR condition:
>
> [root@carf-ceph-osd01 ~]# ceph health detail
>
> HEALTH_ERR full flag(s) set; 1 full osd(s); 1 pools have many more objects
> per pg than average; application not enabled on 6 pool(s); too few PGs per
> OSD (26 < min 30)
>
> OSDMAP_FLAGS full flag(s) set
>
> OSD_FULL 1 full osd(s)
>
> osd.5 is full
>
> MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
>
> pool carf01.rgw.buckets.data objects per pg (602762) is more than
> 18.3752 times cluster average (32803)
>
>
>
> There is plenty of space on most of the OSDs and don’t know how to go
> about fixing this situation.  If we update the pg_num and pgp_num settings
> for this pool, can we rebalance the data across the OSDs?
>
>
>
> Also, seems like this is causing a problem with the RGWs, which was
> reporting this error in the logs:
>
> 2017-10-16 16:36:47.534461 7fffe6c5c700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffdc447700' had timed out after 600
>
>
>
> After trying to restart the RGW, we see this now:
>
> 2017-10-17 10:40:38.517002 7fffe6c5c700  1 heartbeat_map is_healthy
> 'RGWAsyncRadosProcessor::m_tp thread 0x7fffddc4a700' had timed out after 600
>
> 2017-10-17 10:40:42.124046 77fd4e00  0 deferred set uid:gid to 167:167
> (ceph:ceph)
>
> 2017-10-17 10:40:42.124162 77fd4e00  0 ceph version 12.2.0 (
> 32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> (unknown), pid 65313
>
> 2017-10-17 10:40:42.245259 77fd4e00  0 client.769905.objecter  FULL,
> paused modify 0x5662fb00 tid 0
>
> 2017-10-17 10:45:42.124283 7fffe7bcf700 -1 Initialization timeout, failed
> to initialize
>
> 2017-10-17 10:45:42.353496 77fd4e00  0 deferred set uid:gid to 167:167
> (ceph:ceph)
>
> 2017-10-17 10:45:42.353618 77fd4e00  0 ceph version 12.2.0 (
> 32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
> (unknown), pid 71842
>
> 2017-10-17 10:45:42.388621 77fd4e00  0 client.769986.objecter  FULL,
> paused modify 0x5662fb00 tid 0
>
> 2017-10-17 10:50:42.353731 7fffe7bcf700 -1 Initialization timeout, failed
> to initialize
>
>
>
> Seems pretty evident that the “FULL, paused” is a problem.  So if I fix
> the first issue the RGW should be ok after?
>
>
>
> Thanks in advance,
>
> -Bryan
>
>
> --
>
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential or privileged information.
> If you are not the intended recipient, you are hereby notified that any
> review, dissemination or copying of this email is strictly prohibited, and
> to please notify the sender immediately and destroy this email and any
> attachments. Email transmission cannot be guaranteed to be secure or
> error-free. 

Re: [ceph-users] ceph osd disk full (partition 100% used)

2017-10-11 Thread Webert de Souza Lima
That sounds like it. Thanks David.
I wonder if that behavior of ignoring the OSD full_ratio is intentional.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Oct 11, 2017 at 12:26 PM, David Turner 
wrote:

> The full ratio is based on the max bytes.  if you say that the cache
> should have a max bytes of 1TB and that the full ratio is .8, then it will
> aim to keep it at 800GB.  Without a max bytes value set, the ratios are a
> percentage of unlimited... aka no limit themselves.  The full_ratio should
> be respected, but this is the second report of a cache tier reaching 100%
> this month so I'm guessing that the caching mechanisms might ignore those
> OSD settings in preference of the cache tier settings that were set
> improperly.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd disk full (partition 100% used)

2017-10-11 Thread Webert de Souza Lima
Hi,

I have a cephfs cluster as follows:

1 15x HDD data pool (primary cephfs data pool)
1 2x SSD data pool (linked to a specific dir via xattrs)
1 2x SSD metadata pool
1 2x SSD cache tier pool

the cache tier pool consists in 2 host, with one SSD OSD on each host, with
size=2 replicated by host.
Last night the disks went 100% full and the cluster went down.

I know I made a mistake and set target_max_objects and target_max_bytes to
0 in the cache pool,
but isn't ceph supposed to stop writing to an OSD when it reaches
it's full_ratio (default 0.95) ?
And what about the cache_target_full_ratio in the cache tier pool?

Here is the cluster:

~# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data
cephfs_data_ssd ]

* the Metadata and the SSD data pools use the same 2 OSDs (one cephfs
directory is linked to the SSD data pool via xattrs)

~# ceph -v
ceph version 10.2.9-4-gbeaec39 (beaec397f00491079cd74f7b9e3e10660859e26b)

~# ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 136 lfor 115
flags hashpspool crash_replay_interval 45 tiers 3 read_tier 3 write_tier 3
stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 617 flags
hashpspool stripe_width 0
pool 3 'cephfs_cache' replicated size 2 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 128 pgp_num 128 last_change 1493 lfor 115 flags
hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes
343597383680 hit_set bloom{false_positive_probability: 0.05, target_size:
0, seed: 0} 0s x0 decay_rate 0 search_last_n 0 stripe_width 0
pool 12 'cephfs_data_ssd' replicated size 2 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 653 flags
hashpspool stripe_width 0

~# ceph osd tree
ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT
PRIMARY-AFFINITY
 -8  0.17598 root default-ssd

 -9  0.09299 host bhs1-mail03-fe01-data

 17  0.09299 osd.17  up  1.0
 1.0
-10  0.08299 host bhs1-mail03-fe02-data

 18  0.08299 osd.18  up  1.0
 1.0
 -7  0.86319 root cache-ssd

 -5  0.43159 host bhs1-mail03-fe01

 15  0.43159 osd.15  up  1.0
 1.0
 -6  0.43159 host bhs1-mail03-fe02

 16  0.43159 osd.16  up  1.0
 1.0
 -1 79.95895 root default

 -2 26.65298 host bhs1-mail03-ds01

  0  5.33060 osd.0   up  1.0
 1.0
  1  5.33060 osd.1   up  1.0
 1.0
  2  5.33060 osd.2   up  1.0
 1.0
  3  5.33060 osd.3   up  1.0
 1.0
  4  5.33060 osd.4   up  1.0
 1.0
 -3 26.65298 host bhs1-mail03-ds02

  5  5.33060 osd.5   up  1.0
 1.0
  6  5.33060 osd.6   up  1.0
 1.0
  7  5.33060 osd.7   up  1.0
 1.0
  8  5.33060 osd.8   up  1.0
 1.0
  9  5.33060 osd.9   up  1.0
 1.0
 -4 26.65298 host bhs1-mail03-ds03

 10  5.33060 osd.10  up  1.0
 1.0
 12  5.33060 osd.12  up  1.0
 1.0
 13  5.33060 osd.13  up  1.0
 1.0
 14  5.33060 osd.14  up  1.0
 1.0
 19  5.33060 osd.19  up  1.0
 1.0

~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_ruleset_ssd",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -7,
"item_name": "cache-ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "replicated-data-ssd",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -8,
  

Re: [ceph-users] Ceph stuck creating pool

2017-10-03 Thread Webert de Souza Lima
This looks like something wrong with the crush rule.

What's the size, min_size and crush_rule of this pool?
 ceph osd pool get POOLNAME size
 ceph osd pool get POOLNAME min_size
 ceph osd pool get POOLNAME crush_ruleset

How is the crush rule?
 ceph osd crush rule dump


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Tue, Oct 3, 2017 at 11:22 AM, Guilherme Lima  wrote:

> Hi,
>
>
>
> I have installed a virtual Ceph Cluster lab. I using Ceph Luminous v12.2.1
>
> It consist in 3 mon + 3 osd nodes.
>
> Each node have 3 x 250GB OSD.
>
>
>
> My osd tree:
>
>
>
> ID CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
>
> -1   2.19589 root default
>
> -3   0.73196 host osd1
>
> 0   hdd 0.24399 osd.0  up  1.0 1.0
>
> 6   hdd 0.24399 osd.6  up  1.0 1.0
>
> 9   hdd 0.24399 osd.9  up  1.0 1.0
>
> -5   0.73196 host osd2
>
> 1   hdd 0.24399 osd.1  up  1.0 1.0
>
> 7   hdd 0.24399 osd.7  up  1.0 1.0
>
> 10   hdd 0.24399 osd.10 up  1.0 1.0
>
> -7   0.73196 host osd3
>
> 2   hdd 0.24399 osd.2  up  1.0 1.0
>
> 8   hdd 0.24399 osd.8  up  1.0 1.0
>
> 11   hdd 0.24399 osd.11 up  1.0 1.0
>
>
>
> After create a new pool it is stuck on creating+peering and
> creating+activating.
>
>
>
>   cluster:
>
> id: d20fdc12-f8bf-45c1-a276-c36dfcc788bc
>
> health: HEALTH_WARN
>
> Reduced data availability: 256 pgs inactive, 143 pgs peering
>
> Degraded data redundancy: 256 pgs unclean
>
>
>
>   services:
>
> mon: 3 daemons, quorum mon2,mon3,mon1
>
> mgr: mon1(active), standbys: mon2, mon3
>
> osd: 9 osds: 9 up, 9 in
>
>
>
>   data:
>
> pools:   1 pools, 256 pgs
>
> objects: 0 objects, 0 bytes
>
> usage:   10202 MB used, 2239 GB / 2249 GB avail
>
> pgs: 100.000% pgs not active
>
>  143 creating+peering
>
>  113 creating+activating
>
>
>
> Can anyone help to find the issue?
>
>
>
> Thanks
>
> Guilherme
>
>
>
>
>
>
>
>
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Webert de Souza Lima
When I had to use that I just took for granted that it worked, so I can't
really tell you if that's just it.

:|


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, Sep 28, 2017 at 1:31 PM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hi,
> Yes I'm able to run these commands, however it is unclear both in man file
> and the docs what's supposed to happen with the orphans, will they be
> deleted once I run finish? Or will that just throw away the job? What will
> orphans find actually produce? At the moment it just outputs a lot of text
> saying something like putting $num in orphans.$jobid.$shardnum and listing
> objects that are not orphans?
>
> Regards,
> Andreas
>
> On 28 Sep 2017 15:10, "Webert de Souza Lima" <webert.b...@gmail.com>
> wrote:
>
> Hello,
>
> not an expert here but I think the answer is something like:
>
> radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
> radosgw-admin orphans finish --job-id=_JOB_ID_
>
> _JOB_ID_ being anything.
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder <
> andreas.calmin...@klarna.com> wrote:
>
>> Hello,
>> running Jewel on some nodes with rados gateway I've managed to get a
>> lot of leaked multipart objects, most of them belonging to buckets
>> that do not even exist anymore. We estimated these objects to occupy
>> somewhere around 60TB, which would be great to reclaim. Question is
>> how, since trying to find them one by one and perform some kind of
>> sanity check if they're in use or not will take forever.
>>
>> The radosgw-admin orphans find command sounds like something I could
>> use, but it's not clear if the command also removes the orphans? If
>> not, what does it do? Can I use it to help me removing my orphan
>> objects?
>>
>> Best regards,
>> Andreas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Webert de Souza Lima
Hello,

not an expert here but I think the answer is something like:

radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
radosgw-admin orphans finish --job-id=_JOB_ID_

_JOB_ID_ being anything.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hello,
> running Jewel on some nodes with rados gateway I've managed to get a
> lot of leaked multipart objects, most of them belonging to buckets
> that do not even exist anymore. We estimated these objects to occupy
> somewhere around 60TB, which would be great to reclaim. Question is
> how, since trying to find them one by one and perform some kind of
> sanity check if they're in use or not will take forever.
>
> The radosgw-admin orphans find command sounds like something I could
> use, but it's not clear if the command also removes the orphans? If
> not, what does it do? Can I use it to help me removing my orphan
> objects?
>
> Best regards,
> Andreas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lease_timeout - new election

2017-08-25 Thread Webert de Souza Lima
Oh god

root@bhs1-mail03-ds03:~# zgrep "lease" /var/log/ceph/*.gz
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.2.gz:2017-08-24 06:39:22.384112
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8973251..8973960) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.2.gz:2017-08-24 06:39:37.985416
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8973251..8973960) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.3.gz:2017-08-22 06:43:46.394519
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8715223..8715882) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.3.gz:2017-08-22 06:44:00.518659
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8715223..8715882) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.3.gz:2017-08-22 06:44:16.553480
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8715223..8715883) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.3.gz:2017-08-22 06:44:42.974001
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos recovering c
8715223..8715890) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.4.gz:2017-08-22 06:40:42.051277
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8715223..8715842) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.4.gz:2017-08-22 06:41:23.900840
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8715223..8715851) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.4.gz:2017-08-22 06:42:55.159764
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos recovering c
8715223..8715872) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.7.gz:2017-08-19 06:35:54.072809
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos updating c
8317388..8318114) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.7.gz:2017-08-19 06:36:37.616807
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos recovering c
8317388..8318126) lease_timeout -- calling new election
/var/log/ceph/ceph-mon.bhs1-mail03-ds03.log.7.gz:2017-08-19 06:37:03.650974
7f44c60f1700  1 mon.bhs1-mail03-ds03@2(peon).paxos(paxos recovering c
8317388..8318132) lease_timeout -- calling new election



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Mon, Aug 21, 2017 at 10:34 AM, Webert de Souza Lima <
webert.b...@gmail.com> wrote:

> I really need some help through this.
>
> This is happening very frequently and I can't seem to figure out why.
> My services rely on cephfs and when this happens, the mds suicides.
>
> It's always the same, see the last occurrence logs:
>
> host bhs1-mail03-ds03:
>
> 2017-08-19 06:35:54.072809 7f44c60f1700  1 
> mon.bhs1-mail03-ds03@2(peon).paxos(paxos
> updating c 8317388..8318114) lease_timeout -- calling new election
> 2017-08-19 06:35:54.073777 7f44c58f0700  0 log_channel(cluster) log [INF]
> : mon.bhs1-mail03-ds03 calling new monitor election
>
> host bhs1-mail03-ds02:
>
> 2017-08-19 06:35:54.163225 7fa963d5d700  0 log_channel(cluster) log [INF]
> : mon.bhs1-mail03-ds02 calling new monitor election
> 2017-08-19 06:35:54.163373 7fa963d5d700  1 
> mon.bhs1-mail03-ds02@1(electing).elector(59)
> init, last seen epoch 59
> 2017-08-19 06:35:56.066938 7fa96455e700  0 
> mon.bhs1-mail03-ds02@1(electing).data_health(58)
> update_stats avail 76% total 19555 MB, used 3553 MB, avail 14986 MB
> 2017-08-19 06:35:59.655130 7fa96455e700  0 log_channel(cluster) log [INF]
> : mon.bhs1-mail03-ds02@1 won leader election with quorum 1,2
> 2017-08-19 06:36:00.087081 7fa96455e700  0 log_channel(cluster) log [INF]
> : HEALTH_WARN; 1 mons down, quorum 1,2 bhs1-mail03-ds02,bhs1-mail03-ds03
> 2017-08-19 06:36:00.456332 7fa965f7f700  0 log_channel(cluster) log [INF]
> : monmap e2: 3 mons at {bhs1-mail03-ds01=10.0.0.101:
> 6789/0,bhs1-mail03-ds02=10.0.0.102:6789/0,bhs1-mail03-ds03=
> 10.0.0.103:6789/0}
> 2017-08-19 06:36:00.456511 7fa965f7f700  0 log_channel(cluster) log [INF]
> : pgmap v4088049: 1664 pgs: 1 active+clean+scrubbing+deep, 1663
> active+clean; 6840 GB data, 20295 GB used, 62641 GB / 82937 GB avail; 1019
>  kB/s rd, 142 kB/s wr, 24 op/s
> 2017-08-19 06:36:00.456588 7fa965f7f700  0 log_channel(cluster) log [INF]
> : fsmap e1366: 1/1/1 up {0=bhs1-mail03-ds02=up:active}, 1
> up:standby-replay
>
>
> host bhs1-mail03-ds01:
>
> 2017-08-19 06:36:09.995951 7f94e861c700  0 log_channel(cluster) log [INF]
> : mon.bhs1-mail03-ds01 calling new monitor election
> 2017-08-19 06:36:09.996005 7f94e861c700  1 
> mon.bhs1-mail03-ds01@0(electing).elector(59)
> init, last

Re: [ceph-users] lease_timeout - new election

2017-08-21 Thread Webert de Souza Lima
I really need some help through this.

This is happening very frequently and I can't seem to figure out why.
My services rely on cephfs and when this happens, the mds suicides.

It's always the same, see the last occurrence logs:

host bhs1-mail03-ds03:

2017-08-19 06:35:54.072809 7f44c60f1700  1
mon.bhs1-mail03-ds03@2(peon).paxos(paxos
updating c 8317388..8318114) lease_timeout -- calling new election
2017-08-19 06:35:54.073777 7f44c58f0700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds03 calling new monitor election

host bhs1-mail03-ds02:

2017-08-19 06:35:54.163225 7fa963d5d700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds02 calling new monitor election
2017-08-19 06:35:54.163373 7fa963d5d700  1
mon.bhs1-mail03-ds02@1(electing).elector(59)
init, last seen epoch 59
2017-08-19 06:35:56.066938 7fa96455e700  0
mon.bhs1-mail03-ds02@1(electing).data_health(58)
update_stats avail 76% total 19555 MB, used 3553 MB, avail 14986 MB
2017-08-19 06:35:59.655130 7fa96455e700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds02@1 won leader election with quorum 1,2
2017-08-19 06:36:00.087081 7fa96455e700  0 log_channel(cluster) log [INF] :
HEALTH_WARN; 1 mons down, quorum 1,2 bhs1-mail03-ds02,bhs1-mail03-ds03
2017-08-19 06:36:00.456332 7fa965f7f700  0 log_channel(cluster) log [INF] :
monmap e2: 3 mons at {bhs1-mail03-ds01=
10.0.0.101:6789/0,bhs1-mail03-ds02=10.0.0.102:6789/0,bhs1-mail03-ds03=10.0.0.103:6789/0
}
2017-08-19 06:36:00.456511 7fa965f7f700  0 log_channel(cluster) log [INF] :
pgmap v4088049: 1664 pgs: 1 active+clean+scrubbing+deep, 1663 active+clean;
6840 GB data, 20295 GB used, 62641 GB / 82937 GB avail; 1019
 kB/s rd, 142 kB/s wr, 24 op/s
2017-08-19 06:36:00.456588 7fa965f7f700  0 log_channel(cluster) log [INF] :
fsmap e1366: 1/1/1 up {0=bhs1-mail03-ds02=up:active}, 1 up:standby-replay


host bhs1-mail03-ds01:

2017-08-19 06:36:09.995951 7f94e861c700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds01 calling new monitor election
2017-08-19 06:36:09.996005 7f94e861c700  1
mon.bhs1-mail03-ds01@0(electing).elector(59)
init, last seen epoch 59
2017-08-19 06:36:17.653441 7f94e861c700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds01 calling new monitor election
2017-08-19 06:36:17.653505 7f94e861c700  1
mon.bhs1-mail03-ds01@0(electing).elector(61)
init, last seen epoch 61
2017-08-19 06:36:27.603721 7f94e861c700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail03-ds01@0 won leader election with quorum 0,1,2
2017-08-19 06:36:28.579178 7f94e861c700  0 log_channel(cluster) log [INF] :
HEALTH_OK

the mds host (bhs1-mail03-ds02):
2017-08-19 06:36:06.858100 7f9295a76700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.103:0/3970353996 pipe(0x55b450c7c000 sd=19 :6801 s=2 pgs=5396 cs=1
l=0 c=0x55b4526de000).fault with nothing to send, going to standby
2017-08-19 06:36:06.936091 7f9295874700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.110:0/3302883508 pipe(0x55b450c7e800 sd=20 :6801 s=2 pgs=38167 cs=1
l=0 c=0x55b4526de300).fault with nothing to send, going to standby
2017-08-19 06:36:06.936122 7f9294e6a700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.101:0/2088724837 pipe(0x55b45cef0800 sd=22 :6801 s=2 pgs=360 cs=1
l=0 c=0x55b45daf3b00).fault with nothing to send, going to standby
2017-08-19 06:36:06.936169 7f9295672700  0 -- 10.0.0.102:6801/267422746 >>
10.0.0.111:0/2802182284 pipe(0x55b450c7d400 sd=21 :6801 s=2 pgs=30804 cs=1
l=0 c=0x55b4526de180).fault with nothing to send, going to standby
2017-08-19 06:36:11.412799 7f929c788700  1 mds.bhs1-mail03-ds02
handle_mds_map i (10.0.0.102:6801/267422746) dne in the mdsmap, respawning
myself
2017-08-19 06:36:11.412808 7f929c788700  1 mds.bhs1-mail03-ds02 respawn




Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Aug 9, 2017 at 10:53 AM, Webert de Souza Lima <webert.b...@gmail.com
> wrote:

> Hi David,
>
> thanks for your feedback.
>
> With that in mind, I did rm a 15TB RBD Pool about 1 hour or so before this
> had happened.
> I wouldn't think it would be related to this because there was nothing
> different going on after I removed it. Not even high system load.
>
> But considering what you sid, I think it could have been due to OSDs
> operations related to that pool removal.
>
>
>
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> On Wed, Aug 9, 2017 at 10:15 AM, David Turner <drakonst...@gmail.com>
> wrote:
>
>> I just want to point out that there are many different types of network
>> issues that don't involve entire networks. Bad nic, bad/loose cable, a
>> service on a server restarting our modifying the network stack, etc.
>>
>> That said there are other things that can prevent an mds service, or any
>> service from responding to the mons and being wings marked down. It happens
>> to osds enough that t

Re: [ceph-users] lease_timeout - new election

2017-08-09 Thread Webert de Souza Lima
Hi David,

thanks for your feedback.

With that in mind, I did rm a 15TB RBD Pool about 1 hour or so before this
had happened.
I wouldn't think it would be related to this because there was nothing
different going on after I removed it. Not even high system load.

But considering what you sid, I think it could have been due to OSDs
operations related to that pool removal.






Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Aug 9, 2017 at 10:15 AM, David Turner <drakonst...@gmail.com> wrote:

> I just want to point out that there are many different types of network
> issues that don't involve entire networks. Bad nic, bad/loose cable, a
> service on a server restarting our modifying the network stack, etc.
>
> That said there are other things that can prevent an mds service, or any
> service from responding to the mons and being wings marked down. It happens
> to osds enough that they even have the ability to wire in their logs that
> they were wrongly marked down. That usually happens when the service is so
> busy with an operation that it can't get to the request from the mon fast
> enough and it gets marked down. This could also be environment from the mds
> server. If something else on the host is using too many resources
> preventing the mds service from having what it needs, this could easily
> happen.
>
> What level of granularity do you have in your monitoring to tell what your
> system state was when this happened? Is there a time of day it is more
> likely to happen (expect to find a Cron at that time)?
>
> On Wed, Aug 9, 2017, 8:37 AM Webert de Souza Lima <webert.b...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I recently had a mds outage beucase the mds suicided due to "dne in the
>> mds map".
>> I've asked it here before and I know that happens because the monitors
>> took out this mds from the mds map even though it was alive.
>>
>> Weird thing, there was no network related issues happening at the time,
>> which if there was, it would impact many other systems.
>>
>> I found this in the mon logs, and i'd like to understand it better:
>>  lease_timeout -- calling new election
>>
>> full logs:
>>
>> 2017-08-08 23:12:33.286908 7f2b8398d700  1 leveldb: Manual compaction at
>> level-1 from 'pgmap_pg\x009.a' @ 1830392430 : 1 .. 'paxos\x0057687834' @ 0
>> : 0; will stop at (end)
>>
>> 2017-08-08 23:12:36.885087 7f2b86f9a700  0 
>> mon.bhs1-mail02-ds03@2(peon).data_health(3524)
>> update_stats avail 81% total 19555 MB, used 2632 MB, avail 15907 MB
>> 2017-08-08 23:13:29.357625 7f2b86f9a700  1 
>> mon.bhs1-mail02-ds03@2(peon).paxos(paxos
>> updating c 57687834..57688383) lease_timeout -- calling new election
>> 2017-08-08 23:13:29.358965 7f2b86799700  0 log_channel(cluster) log [INF]
>> : mon.bhs1-mail02-ds03 calling new monitor election
>> 2017-08-08 23:13:29.359128 7f2b86799700  1 
>> mon.bhs1-mail02-ds03@2(electing).elector(3524)
>> init, last seen epoch 3524
>> 2017-08-08 23:13:35.383530 7f2b86799700  1 mon.bhs1-mail02-ds03@2(peon).osd
>> e12617 e12617: 19 osds: 19 up, 19 in
>> 2017-08-08 23:13:35.605839 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
>> e18460 print_map
>> e18460
>> enable_multiple, ever_enabled_multiple: 0,0
>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
>> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>>
>> Filesystem 'cephfs' (2)
>> fs_name cephfs
>> epoch   18460
>> flags   0
>> created 2016-08-01 11:07:47.592124
>> modified2017-07-03 10:32:44.426431
>> tableserver 0
>> root0
>> session_timeout 60
>> session_autoclose   300
>> max_file_size   1099511627776
>> last_failure0
>> last_failure_osd_epoch  12617
>> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
>> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>> max_mds 1
>> in  0
>> up  {0=1574278}
>> failed
>> damaged
>> stopped
>> data_pools  8,9
>> metadata_pool   7
>> inline_data disabled
>> 1574278:10.0.2.4:6800/2556733458 'd' mds.0.18460 up:replay seq 1
>> laggy since 2017-08-08 23:13:35.174109 (standby for rank 0)
>>
>>
>>
>> 2017-08-08 23:13:35.606303 7f2b86799700  0 log_channel(cluster) log [INF]
>> : mon.bhs1-mail02-ds03 callin

[ceph-users] lease_timeout - new election

2017-08-09 Thread Webert de Souza Lima
Hi,

I recently had a mds outage beucase the mds suicided due to "dne in the mds
map".
I've asked it here before and I know that happens because the monitors took
out this mds from the mds map even though it was alive.

Weird thing, there was no network related issues happening at the time,
which if there was, it would impact many other systems.

I found this in the mon logs, and i'd like to understand it better:
 lease_timeout -- calling new election

full logs:

2017-08-08 23:12:33.286908 7f2b8398d700  1 leveldb: Manual compaction at
level-1 from 'pgmap_pg\x009.a' @ 1830392430 : 1 .. 'paxos\x0057687834' @ 0
: 0; will stop at (end)

2017-08-08 23:12:36.885087 7f2b86f9a700  0
mon.bhs1-mail02-ds03@2(peon).data_health(3524)
update_stats avail 81% total 19555 MB, used 2632 MB, avail 15907 MB
2017-08-08 23:13:29.357625 7f2b86f9a700  1
mon.bhs1-mail02-ds03@2(peon).paxos(paxos
updating c 57687834..57688383) lease_timeout -- calling new election
2017-08-08 23:13:29.358965 7f2b86799700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail02-ds03 calling new monitor election
2017-08-08 23:13:29.359128 7f2b86799700  1
mon.bhs1-mail02-ds03@2(electing).elector(3524)
init, last seen epoch 3524
2017-08-08 23:13:35.383530 7f2b86799700  1 mon.bhs1-mail02-ds03@2(peon).osd
e12617 e12617: 19 osds: 19 up, 19 in
2017-08-08 23:13:35.605839 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
e18460 print_map
e18460
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}

Filesystem 'cephfs' (2)
fs_name cephfs
epoch   18460
flags   0
created 2016-08-01 11:07:47.592124
modified2017-07-03 10:32:44.426431
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
last_failure0
last_failure_osd_epoch  12617
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
max_mds 1
in  0
up  {0=1574278}
failed
damaged
stopped
data_pools  8,9
metadata_pool   7
inline_data disabled
1574278:10.0.2.4:6800/2556733458 'd' mds.0.18460 up:replay seq 1
laggy since 2017-08-08 23:13:35.174109 (standby for rank 0)



2017-08-08 23:13:35.606303 7f2b86799700  0 log_channel(cluster) log [INF] :
mon.bhs1-mail02-ds03 calling new monitor election
2017-08-08 23:13:35.606361 7f2b86799700  1
mon.bhs1-mail02-ds03@2(electing).elector(3526)
init, last seen epoch 3526
2017-08-08 23:13:36.885540 7f2b86f9a700  0
mon.bhs1-mail02-ds03@2(peon).data_health(3528)
update_stats avail 81% total 19555 MB, used 2636 MB, avail 15903 MB
2017-08-08 23:13:38.311777 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
e18461 print_map


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and Fscache : can you kindly share your experiences?

2017-08-02 Thread Webert de Souza Lima
Oh, I see. Your concerns are about client-side cache.
I'm sorry for I misunderstood your question.

Good luck!

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Tue, Aug 1, 2017 at 5:19 PM, Anish Gupta <anish_gu...@yahoo.com> wrote:

> Hello Webert,
>
> Thank you for your response.
>
> I am not interested in the SSD cache tier pool at all as that is on the
> Ceph Storage Cluster Server and is somewhat well documented/understood.
>
> My question is regards enabling caching at the ceph clients that talk to
> the Ceph Storage Cluster.
>
> thanks,
> Anish
>
>
>
>
> ------
> On Tuesday, August 1, 2017, 9:55:39 AM PDT, Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>
> Hi Anish, in case you're still interested, we're using cephfs in
> production since jewel 10.2.1.
>
> I have a few similar clusters with some small set up variations. They're
> not so big but they're under heavy workload.
>
> - 15~20 x 6TB HDD OSDs (5 per node), ~4 x 480GB SSD OSDs (2 per node, set
> for cache tier pool)
> - About 4 mount points per cluster, so I assume it translates to 4 clients
> per cluster
> - Running 10.2.9 on Ubuntu 4.4.0-24-generic now.
>
> Cache Tiering is enabled for the CephFS on a separate pool that uses the
> SSDs as OSDs, if that's really what you wanna know.
>
>
> Cya,
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> On Mon, Jul 24, 2017 at 3:27 PM, Anish Gupta <anish_gu...@yahoo.com>
> wrote:
>
> Hello,
>
> Can you kindly share their experience with the  bulit-in FSCache support
> with ceph?
>
> Interested in knowing the following:
> - Are you using FSCache in production environment?
> - How large is your Ceph deployment?
> - If with CephFS, how many Ceph clients are using FSCache
> - which version of Ceph and Linux kernel
>
>
> thank you.
> Anish
>
>
>
>
> __ _
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/ listinfo.cgi/ceph-users-ceph. com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph and Fscache : can you kindly share your experiences?

2017-08-01 Thread Webert de Souza Lima
Hi Anish, in case you're still interested, we're using cephfs in production
since jewel 10.2.1.

I have a few similar clusters with some small set up variations. They're
not so big but they're under heavy workload.

- 15~20 x 6TB HDD OSDs (5 per node), ~4 x 480GB SSD OSDs (2 per node, set
for cache tier pool)
- About 4 mount points per cluster, so I assume it translates to 4 clients
per cluster
- Running 10.2.9 on Ubuntu 4.4.0-24-generic now.

Cache Tiering is enabled for the CephFS on a separate pool that uses the
SSDs as OSDs, if that's really what you wanna know.


Cya,


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Mon, Jul 24, 2017 at 3:27 PM, Anish Gupta  wrote:

> Hello,
>
> Can you kindly share their experience with the  bulit-in FSCache support
> with ceph?
>
> Interested in knowing the following:
> - Are you using FSCache in production environment?
> - How large is your Ceph deployment?
> - If with CephFS, how many Ceph clients are using FSCache
> - which version of Ceph and Linux kernel
>
>
> thank you.
> Anish
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph Community Manager

2017-07-21 Thread Webert de Souza Lima
Thank you for all your efforts, Patrick.
Parabéns e boa sorte Leo :)


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, Jul 20, 2017 at 4:21 PM, Patrick McGarry 
wrote:

> Hey cephers,
>
> As most of you know, my last day as the Ceph community lead is next
> Wed (26 July). The good news is that we now have a replacement who
> will be starting immediately!
>
> I would like to introduce you to Leo Vaz who, until recently, has been
> working as a maintenance engineer with a focus on Ceph (and Gluster).
> Leo also is a long time open source participant with efforts around
> community organization and event management, helping to promote many
> open source projects at places like FISL [0] and Latinoware [1].  Leo
> is a founder and community manager of Tchelinux Free Software Users
> group [2] and in 2009 became a Fedora Ambassador.
>
> Leo has a great mix of the “gritty, just-get-it-done” ethos needed in
> a FOSS project as well as a great deal of professional drive and
> experience. I’m sure he will be a fabulous addition to the public face
> of Ceph, and I hope you all take the time to extend a warm welcome as
> he starts getting settled.
>
> If you have any questions for me before 26 July, feel free to contact
> me directly, otherwise I wish you all the best of luck and hope you
> keep changing the world one Ceph install at a time!
>
>
> [0] http://softwarelivre.org/fisl18
> [1] http://latinoware.org/
> [2] https://tchelinux.org/
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds log: dne in the mdsmap

2017-07-11 Thread Webert de Souza Lima
Thanks John,

I got this in the mds log too:

2017-07-11 07:10:06.293219 7f1836837700  1 mds.beacon.b _send skipping
beacon, heartbeat map not healthy
2017-07-11 07:10:08.330979 7f183b942700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15

but that respawn happened 2 minutes after I got this:

2017-07-11 07:10:10.948237 7f183993e700  0 mds.beacon.b handle_mds_beacon
no longer laggy

Which makes me confused. Could it be a Network issue? Local network
communication was fine by then. It might be a bug.

When it was recovering it was stuck at rejoin_joint_start state for almost
50 minutes.
2017-07-11 07:13:36.587188 7f264a112700  1 mds.0.890528 rejoin_joint_start
[...]
2017-07-11 07:56:21.521006 7f0f78917700  1 mds.0.890537 recovery_done --
successful recovery!
2017-07-11 07:56:21.522570 7f0f78917700  1 mds.0.890537 active_start
2017-07-11 07:56:21.533507 7f0f78917700  1 mds.0.890537 cluster recovered.

I watched with "ceph daemon mds.b perf dump mds" that it was scanning the
inodes. But when this happens (quite often) I have no idea when it will
stop.
Many other times this happened was because of a crash (
http://tracker.ceph.com/issues/20535) but today was not the case.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Tue, Jul 11, 2017 at 11:36 AM, John Spray <jsp...@redhat.com> wrote:

> On Tue, Jul 11, 2017 at 3:23 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Hello,
> >
> > today I got a MDS respawn with the following message:
> >
> > 2017-07-11 07:07:55.397645 7ffb7a1d7700  1 mds.b handle_mds_map i
> > (10.0.1.2:6822/28190) dne in the mdsmap, respawning myself
>
> "dne in the mdsmap" is what an MDS says when the monitors have
> concluded that the MDS is dead, but the MDS is really alive.  "dne"
> stands for "does not exist", so the MDS is complaining that it has
> been removed from the mdsmap.
>
> The message could definitely be better worded!
>
> You can see this happen in certain buggy cases where the MDS is
> failing to send beacon messages to the mons, even though it is really
> alive -- if you're stuck in rejoin, then that is probably related: try
> increasing the log verbosity to work out where the MDS is stuck while
> it's sitting in the rejoin state.
>
> John
>
> >
> > it happened 3 times within 5 minutes. After so, the MDS took 50 minutes
> to
> > recover.
> > I can't find what exactly that message means and how to avoid it.
> >
> > I'll be glad to provide any further information. Thanks!
> >
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph mds log: dne in the mdsmap

2017-07-11 Thread Webert de Souza Lima
Hello,

today I got a MDS respawn with the following message:

2017-07-11 07:07:55.397645 7ffb7a1d7700  1 mds.b handle_mds_map i (
10.0.1.2:6822/28190) dne in the mdsmap, respawning myself

it happened 3 times within 5 minutes. After so, the MDS took 50 minutes to
recover.
I can't find what exactly that message means and how to avoid it.

I'll be glad to provide any further information. Thanks!


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write back mode Cach-tier behavior

2017-06-07 Thread Webert de Souza Lima
That's very likely what I tried to do too. Since I can't, I'll have to live
with the "all in".
haha


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, Jun 7, 2017 at 2:53 AM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Tue, 6 Jun 2017 08:58:07 -0300 Webert de Souza Lima wrote:
>
> > Hey Christian.
> >
> > Which settings do you mean? I played a lot
> > with hit_set_count, hit_set_period, min_read_recency_for_promote
> > and min_write_recency_for_promote.
> > They showed no effect when hit_set_count = 0.
> >
> Yes, that's what I meant, as in hit_set_count is an all or nothing setting.
>
> Quite a while ago when these things were added with Jewel I was clamoring
> for having a better control with respect to read and write promotes.
> As in having split
> osd_tier_promote_max_bytes_sec and osd_tier_promote_max_objects_sec
> into _read and _write parameters.
>
> In my use case I want all writes to go to the cache-tier all the time and
> have a mostly bored and decent backing storage, thus can get away with
> setting the cache-tier to readforward and go full blast on the writes.
>
> But that's less than elegant of course.
>
> Christian
>
> >
> > On Mon, Jun 5, 2017 at 11:54 PM, Christian Balzer <ch...@gol.com> wrote:
> >
> > >
> > > Hello,
> > >
> > > On Tue, 06 Jun 2017 02:35:25 + Webert de Souza Lima wrote:
> > >
> > > > I'd like to add that, from all tests I did, the writing of new files
> only
> > > > go directly to the cache tier if you set hit set count = 0.
> > > >
> > > Yes, that also depends on the settings of course. (which we don't
> know, as
> > > they never got posted).
> > >
> > > I was reciting from Hammer times, where this was the default case.
> > >
> > > Christian
> >
> >
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > *Belo Horizonte - Brasil*
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write back mode Cach-tier behavior

2017-06-06 Thread Webert de Souza Lima
Hey Christian.

Which settings do you mean? I played a lot
with hit_set_count, hit_set_period, min_read_recency_for_promote
and min_write_recency_for_promote.
They showed no effect when hit_set_count = 0.


On Mon, Jun 5, 2017 at 11:54 PM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Tue, 06 Jun 2017 02:35:25 +0000 Webert de Souza Lima wrote:
>
> > I'd like to add that, from all tests I did, the writing of new files only
> > go directly to the cache tier if you set hit set count = 0.
> >
> Yes, that also depends on the settings of course. (which we don't know, as
> they never got posted).
>
> I was reciting from Hammer times, where this was the default case.
>
> Christian



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write back mode Cach-tier behavior

2017-06-06 Thread Webert de Souza Lima
The hit set count/period is supposed to control whether the object will be
in the cache pool or in the cold stage pool. By setting to 0, the object is
always promoted. This is good for writings but on my use case, for example,
I wouldn't want every read operation to make an object get promoted and
that is what happens. You can't adjust the warm up. I don't know if I'm
being clear.


Em Ter, 6 de jun de 2017 05:26, TYLin  escreveu:

>
> On Jun 6, 2017, at 11:18 AM, jiajia zhong  wrote:
>
>>
>> it's very similar to ours.  but  is there any need to seperate the osds
> for different pools ? why ?
> below's our crushmap.
>
> -98   6.29997 root tier_cache
> -94   1.3 host cephn1-ssd
>  95   0.7 osd.95   up  1.0  1.0
> 101   0.34999 osd.101  up  1.0  1.0
> 102   0.34999 osd.102  up  1.0  1.0
> -95   1.3 host cephn2-ssd
>  94   0.7 osd.94   up  1.0  1.0
> 103   0.34999 osd.103  up  1.0  1.0
> 104   0.34999 osd.104  up  1.0  1.0
> -96   1.3 host cephn3-ssd
> 105   0.34999 osd.105  up  1.0  1.0
> 106   0.34999 osd.106  up  1.0  1.0
>  93   0.7 osd.93   up  1.0  1.0
> -93   0.7 host cephn4-ssd
>  97   0.34999 osd.97   up  1.0  1.0
>  98   0.34999 osd.98   up  1.0  1.0
> -97   1.3 host cephn5-ssd
>  96   0.7 osd.96   up  1.0  1.0
>  99   0.34999 osd.99   up  1.0  1.0
> 100   0.34999 osd.100  up  1.0  1.0
>
>
> Because ceph cannot distinguish metadata request and data request. If we
> use same osd sets for both metadata cache and data cache, the bandwidth of
> metadata request may be occupied by data request and lead to long response
> time.
>
> Thanks,
> Ting Yi Lin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write back mode Cach-tier behavior

2017-06-05 Thread Webert de Souza Lima
I'd like to add that, from all tests I did, the writing of new files only
go directly to the cache tier if you set hit set count = 0.

Em Seg, 5 de jun de 2017 23:26, TYLin  escreveu:

> On Jun 5, 2017, at 6:47 PM, Christian Balzer  wrote:
>
> Personally I avoid odd numbered releases, but my needs for stability
> and low update frequency seem to be far off the scale for "normal" Ceph
> users.
>
> W/o precise numbers of files and the size of your SSDs (which type?) it is
> hard to say, but you're likely to be better off just having all metadata
> on an SSD pool instead of cache-tiering.
> 800MB/s sounds about right for your network and cluster in general (no
> telling for sure w/o SSD/HDD details of course).
>
> As I pointed out before and will try to explain again below, that speed
> difference, while pretty daunting, isn't all that surprising.
>
>
> SSD: Intel S3520 240GB
> HDD: WDC WD5003ABYZ-011FA0 500GB
> fio: bs=4m iodepth=32
> dd: bs=4m
> The test file is 20GB.
>
> No, not quite. Re-read what I wrote, there's a difference between RADOS
> object creation and actual data (contents).
>
> The devs or other people with more code familiarity will correct me, but
> essentially as I understand it this happens when a new RADOS object gets
> created in conjunction with a cache-tier:
>
> 1. Client (cephfs, rbd, whatever) talks to the cache-tier and the
> transaction causes a new object to be created.
> Since the tier is an overlay of the actual backing storage, the object
> (but not necessarily the curent data in it) needs to exist on both.
> 2. Object gets created on backing storage  which involves creating the
> file (at zero length), any needed directories above and the entry in the
> OMAP leveldb. All on HDDs, all slow.
> I'm pretty sure this needs to be done and finished before the object is
> usable, no journals to speed this up.
> 3. Cache-tier pseudo-promotes the new object (it is empty after all) and
> starts accepting writes.
>
> This is leaving out any metadata stuff CephFS needs to do for new "blocks"
> and files, which may also be more involved than overwrites.
>
> Christian
>
>
> You make it clear to me! thanks! Really appreciate your kind explanation.
>
> Thanks,
> Ting Yi Lin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cache tiering write vs read promotion

2017-05-18 Thread Webert de Souza Lima
Hello,

I'm using cache tiering with cephfs on latest ceph jewel release.

For my use case, I wanted to make new writes go "directly" to the cache
pool , and any
use other logic for promoting when reading, like after 2 reads, for example.

I see that the following settings are available:

hit_set_count
hit_set_period
min_read_recency_for_promote
min_write_recency_for_promote

Playing with those values I could see that the only way I could make the
first writes go directly to the cache pool was setting the  hit_set_count =
0. Doing that, the other options don't have any effect.

I tried setting the  hit_set_count and hit_set_period for any number above
zero, and set min_write_recency_for_promote = 0 but that does not work as
expected.

Is that possible? Could it be arranged?


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS daemonperf

2017-05-12 Thread Webert de Souza Lima
Haha, that was it.

I thought the first mds was active but it was the second one.
I issued the command on the right mds and it does show it all.

Thank you very much.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Fri, May 12, 2017 at 9:03 AM, John Spray <jsp...@redhat.com> wrote:

> On Fri, May 12, 2017 at 12:47 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Thanks John,
> >
> > I did as yuo suggested but unfortunately I only found information
> regarding
> > the objecter nicks "writ, read and actv", any more suggestions?
>
> The daemonperf command itself is getting its list of things to display
> by calling "perf schema" and looking at which ones have a nick set, so
> they're definitely in there.  Maybe you were sending the command to
> something other than an active MDS?
>
> John
>
> >
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> >
> > On Wed, May 10, 2017 at 3:46 AM, John Spray <jsp...@redhat.com> wrote:
> >>
> >> On Tue, May 9, 2017 at 5:23 PM, Webert de Souza Lima
> >> <webert.b...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > by issuing `ceph daemonperf mds.x` I see the following columns:
> >> >
> >> > -mds-- --mds_server-- ---objecter--- -mds_cache-
> >> > ---mds_log
> >> > rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs
> >> > evts
> >> > subm|
> >> >   0   95   41 |  000 |  000 |  00   250 |  1
> >> > 628
> >> > 0
> >> >   0   95   41 |  000 |  000 |  00   250 |  1
> >> > 628
> >> > 0
> >> >   0   95   41 |  000 |  000 |  00   250 |  1
> >> > 628
> >> > 0
> >> >   0   95   41 |  000 |  000 |  00   250 |  1
> >> > 628
> >> > 0
> >> >   0   95   41 |  000 |  000 |  00   250 |  1
> >> > 628
> >> > 0
> >> >
> >> > It's not clear to me what each column mean, but I can't find it
> >> > anywhere.
> >> > Also the labels are confusing. Why is there mds and mds_server?
> >>
> >> The mds, mds_server etc refer to internal subsystems within the
> >> ceph-mds process (their naming is arcane).
> >>
> >> The abbreviated names for performance counters are the "nick" item in
> >> the output of "ceph daemon  perf schema" -- for sufficiently
> >> recent code you should see a description field there too.
> >>
> >> John
> >>
> >>
> >> > Regards,
> >> >
> >> > Webert Lima
> >> > DevOps Engineer at MAV Tecnologia
> >> > Belo Horizonte - Brasil
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS daemonperf

2017-05-12 Thread Webert de Souza Lima
Thanks John,

I did as yuo suggested but unfortunately I only found information regarding
the objecter nicks "writ, read and actv", any more suggestions?



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Wed, May 10, 2017 at 3:46 AM, John Spray <jsp...@redhat.com> wrote:

> On Tue, May 9, 2017 at 5:23 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Hi,
> >
> > by issuing `ceph daemonperf mds.x` I see the following columns:
> >
> > -mds-- --mds_server-- ---objecter--- -mds_cache-
> > ---mds_log
> > rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs
> evts
> > subm|
> >   0   95   41 |  000 |  000 |  00   250 |  1  628
> > 0
> >   0   95   41 |  000 |  000 |  00   250 |  1  628
> > 0
> >   0   95   41 |  000 |  000 |  00   250 |  1  628
> > 0
> >   0   95   41 |  000 |  000 |  00   250 |  1  628
> > 0
> >   0   95   41 |  000 |  000 |  00   250 |  1  628
> > 0
> >
> > It's not clear to me what each column mean, but I can't find it anywhere.
> > Also the labels are confusing. Why is there mds and mds_server?
>
> The mds, mds_server etc refer to internal subsystems within the
> ceph-mds process (their naming is arcane).
>
> The abbreviated names for performance counters are the "nick" item in
> the output of "ceph daemon  perf schema" -- for sufficiently
> recent code you should see a description field there too.
>
> John
>
>
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph health warn MDS failing to respond to cache pressure

2017-05-12 Thread Webert de Souza Lima
On Wed, May 10, 2017 at 4:09 AM, gjprabu  wrote:

> Hi Webert,
>
>   Thanks for your reply , can pls suggest ceph pg value for data and
> metadata. I have set 128 for data and 128 for metadata , is this correct
>

Well I think this has nothing to do with your current problem but the PG
number depends on your use case.
Try using this tool to estimate the best for your needs:
http://ceph.com/pgcalc/

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-10 Thread Webert de Souza Lima
On Tue, May 9, 2017 at 9:07 PM, Brady Deetz  wrote:

> So with email, you're talking about lots of small reads and writes. In my
> experience with dicom data (thousands of 20KB files per directory), cephfs
> doesn't perform very well at all on platter drivers. I haven't experimented
> with pure ssd configurations, so I can't comment on that.
>

Yes, that's pretty much why I'm using cache tiering on SSDs.


> Somebody may correct me here, but small block io on writes just makes
> latency all that much more important due to the need to wait for your
> replicas to be written before moving on to the next block.
>

I think that is correct. Smaller blocks = more I/O, so SSDs benefit a lot.


> Without know exact hardware details, my brain is immediately jumping to
> networking constraints. 2 or 3 spindle drives can pretty much saturate a
> 1gbps link. As soon as you create contention for that resource, you create
> system load for iowait and latency.
>
You mentioned you don't control the network. Maybe you can scale down and
> out.
>

 I'm constrained with the topology I showed you for now. I did planned
another (see
https://creately.com/diagram/j1eyig9i/7wloXLNOAYjeregBGkvelMXL50%3D) but it
won't be possible at the time.
 That setup would have a 10 gig interconnection link.

On Wed, May 10, 2017 at 3:55 AM, John Spray  wrote:

>
> Hmm, to understand this better I would start by taking cache tiering
> out of the mix, it adds significant complexity.
>
> The "-direct=1" part could be significant here: when you're using RBD,
> that's getting handled by ext4, and then ext4 is potentially still
> benefiting from some caching at the ceph layer.  With CephFS on the
> other hand, it's getting handled by CephFS, and CephFS will be
> laboriously doing direct access to OSD.
>
> John


I won't be able to change that by now. I would need another testing cluster.
The point of direct=1 was to remove any caching possibility in the middle.
That fio suite was suggested by username peetaur on IRC channel (thanks :)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Webert de Souza Lima
On Tue, May 9, 2017 at 4:40 PM, Brett Niver <bni...@redhat.com> wrote:

> What is your workload like?  Do you have a single or multiple active
> MDS ranks configured?


User traffic is heavy. I can't really say in terms of mb/s or iops but it's
an email server with 25k+ users, usually about 6k simultaneously connected
receiving and reading emails.
I have only one active MDS configured. The others are Stand-by.

On Tue, May 9, 2017 at 7:18 PM, Wido den Hollander <w...@42on.com> wrote:

>
> > Op 9 mei 2017 om 20:26 schreef Brady Deetz <bde...@gmail.com>:
> >
> >
> > If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
> > interconnect, presumably cat6. Due to the additional latency of
> performing
> > metadata operations, I could see cephfs performing at those speeds. Are
> you
> > using jumbo frames? Also are you routing?
> >
> > If you're routing, the router will introduce additional latency that an
> l2
> > network wouldn't experience.
> >
>
> Partially true. I am running various Ceph clusters using L3 routing and
> with a decent router the latency for routing a packet is minimal, like 0.02
> ms or so.
>
> Ceph spends much more time in the CPU then it will take the network to
> forward that IP-packet.
>
> I wouldn't be too afraid to run Ceph over a L3 network.
>
> Wido
>
> > On May 9, 2017 12:01 PM, "Webert de Souza Lima" <webert.b...@gmail.com>
> > wrote:
> >
> > > Hello all,
> > >
> > > I'm been using cephfs for a while but never really evaluated its
> > > performance.
> > > As I put up a new ceph cluster, I though that I should run a benchmark
> to
> > > see if I'm going the right way.
> > >
> > > By the results I got, I see that RBD performs *a lot* better in
> > > comparison to cephfs.
> > >
> > > The cluster is like this:
> > >  - 2 hosts with one SSD OSD each.
> > >this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> > > cache tiering).
> > >  - 3 hosts with 5 HDD OSDs each.
> > >   this hosts have 1 pool: cephfs_data.
> > >
> > > all details, cluster set up and results can be seen here:
> > > https://justpaste.it/167fr
> > >
> > > I created the RBD pools the same way as the CEPHFS pools except for the
> > > number of PGs in the data pool.
> > >
> > > I wonder why that difference or if I'm doing something wrong.
> > >
> > > Regards,
> > >
> > > Webert Lima
> > > DevOps Engineer at MAV Tecnologia
> > > *Belo Horizonte - Brasil*
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Webert de Souza Lima
That 1gbps link is the only option I have for those servers, unfortunately.
It's all dedicated server rentals from OVH.
I don't have information regarding the internals of the vrack.

So by what you said, I understand that one should expect a performance drop
in comparison to ceph rbd using the same architecture , right?

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Performance

2017-05-09 Thread Webert de Souza Lima
Hello all,

I'm been using cephfs for a while but never really evaluated its
performance.
As I put up a new ceph cluster, I though that I should run a benchmark to
see if I'm going the right way.

By the results I got, I see that RBD performs *a lot* better in comparison
to cephfs.

The cluster is like this:
 - 2 hosts with one SSD OSD each.
   this hosts have 2 pools: cephfs_metadata and cephfs_cache (for cache
tiering).
 - 3 hosts with 5 HDD OSDs each.
  this hosts have 1 pool: cephfs_data.

all details, cluster set up and results can be seen here:
https://justpaste.it/167fr

I created the RBD pools the same way as the CEPHFS pools except for the
number of PGs in the data pool.

I wonder why that difference or if I'm doing something wrong.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MDS daemonperf

2017-05-09 Thread Webert de Souza Lima
Hi,

by issuing `ceph daemonperf mds.x` I see the following columns:

-mds-- --mds_server-- ---objecter--- -mds_cache-
---mds_log
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
subm|
  0   95   41 |  000 |  000 |  00   250 |  1  628
 0
  0   95   41 |  000 |  000 |  00   250 |  1  628
 0
  0   95   41 |  000 |  000 |  00   250 |  1  628
 0
  0   95   41 |  000 |  000 |  00   250 |  1  628
 0
  0   95   41 |  000 |  000 |  00   250 |  1  628
 0

It's not clear to me what each column mean, but I can't find it anywhere.
Also the labels are confusing. Why is there mds and mds_server?

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph health warn MDS failing to respond to cache pressure

2017-05-04 Thread Webert de Souza Lima
I have faced the same problem many times. Usually it doesn't cause anything
bad, but I had a 30 min system outage twice because of this.
It might be because of the number of inodes on your ceph filesystem. Go to
the MDS server and do (supposing your mds server id is intcfs-osd1):

 ceph daemon mds.intcfs-osd1 perf dump mds

look for the inodes_max and inodes informations.
inode_max is the maximum inodes to cache and inodes is the current number
of inodes currently in the cache.

if it is full, mount the cephfs with the "-o dirstat" option, and cat the
mountpoint, for example:

 mount -t ceph  10.0.0.1:6789:/ /mnt -o
dirstat,name=admin,secretfile=/etc/ceph/admin.secret
 cat /mnt

look for the rentries number. if it is larger than the inode_max, rise
the mds cache size option in ceph.conf to a number that fits and restart
the mds (beware: this will cause cephfs to stall for a while. do at your
own risk).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, May 4, 2017 at 3:28 AM, gjprabu  wrote:

> Hi Team,
>
>   We are running cephfs with 5 OSD and 3 Mon and 1 MDS. There is
> Heath Warn "*failing to respond to cache pressure*" . Kindly advise to
> fix this issue.
>
>
> cluster b466e09c-f7ae-4e89-99a7-99d30eba0a13
>  health HEALTH_WARN
> mds0: Client integ-hm8-1.csez.zohocorpin.com failing to
> respond to cache pressure
> mds0: Client integ-hm5 failing to respond to cache pressure
> mds0: Client integ-hm9 failing to respond to cache pressure
> mds0: Client integ-hm2 failing to respond to cache pressure
>  monmap e2: 3 mons at {intcfs-mon1=192.168.113.113:6
> 789/0,intcfs-mon2=192.168.113.114:6789/0,intcfs-mon3=192.168.113.72:6789/0
> }
> election epoch 16, quorum 0,1,2 intcfs-mon3,intcfs-mon1,intcfs
> -mon2
>   fsmap e79409: 1/1/1 up {0=intcfs-osd1=up:active}, 1 up:standby
>  osdmap e3343: 5 osds: 5 up, 5 in
> flags sortbitwise
>   pgmap v13065759: 564 pgs, 3 pools, 5691 GB data, 12134 kobjects
> 11567 GB used, 5145 GB / 16713 GB avail
>  562 active+clean
>2 active+clean+scrubbing+deep
>   client io 8090 kB/s rd, 29032 kB/s wr, 25 op/s rd, 129 op/s wr
>
>
> Regards
> Prabu GJ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs cache tiering - hitset

2017-03-17 Thread Webert de Souza Lima
Hello everyone,

I`m deploying a ceph cluster with cephfs and I`d like to tune ceph cache
tiering, and I`m
a little bit confused of the settings hit_set_count, hit_set_period and
min_read_recency_for_promote. The docs are very lean and I can`f find any
more detailed explanation anywhere.

Could someone provide me a better understandment of this?

Thanks in advance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds failing to respond to capability release

2016-11-23 Thread Webert de Souza Lima
is it possible to count open file descriptors in cephfs only?

On Wed, Nov 16, 2016 at 2:12 PM Webert de Souza Lima <webert.b...@gmail.com>
wrote:

> I'm sorry, by server, I meant cluster.
> On one cluster the rate of files created and read is about 5 per second.
> On another cluster it's from 25 to 30 files created and read per second.
>
> On Wed, Nov 16, 2016 at 2:03 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
> Hello John.
>
> I'm sorry for the lack of information at the first post.
> The same version is in use for servers and clients.
>
> About the workload, it varies.
> On one server it's about *5 files created/written and then fully read per
> second*.
> On the other server it's about *5 to 6 times that number*, so a lot more,
> but the problem does not escalate at the same proportion.
>
> *~# ceph -v*
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>
> *~#dpkg -l | grep ceph*
> ii  ceph-fuse10.2.2-1trusty
> amd64FUSE-based client for the Ceph distributed file system
>
> Some things are worth mentioning:
> The service(1) that creates the file sends an async request to another
> service(2) that reads it.
> The service(1) that creates the file also deletes it when its client
> closes the connection, so it can do so while the other service(2) is
> trying to read it. i'm not sure what would happen here.
>
>
>
> On Wed, Nov 16, 2016 at 1:42 PM John Spray <jsp...@redhat.com> wrote:
>
> On Wed, Nov 16, 2016 at 3:15 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > hi,
> >
> > I have many clusters running cephfs, and in the last 45 days or so, 2 of
> > them started giving me the following message in ceph health:
> > mds0: Client dc1-mx02-fe02:guest failing to respond to capability release
> >
> > When this happens, cephfs stops responding. It will only get back after I
> > restart the failing mds.
> >
> > Algo, I get the following logs from ceph.log
> > https://paste.debian.net/896236/
> >
> > There was no change made that I can relate to this and I can't figure out
> > what is happening.
>
> I have the usual questions: what ceph versions, what clients etc
> (http://docs.ceph.com/docs/jewel/cephfs/early-adopters/#reporting-issues)
>
> Clients failing to respond to capability release are either buggy (old
> kernels?) or it's also possible that you have a workload that is
> holding an excessive number of files open.
>
> Cheers,
> John
>
>
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds failing to respond to capability release

2016-11-16 Thread Webert de Souza Lima
Hello John.

I'm sorry for the lack of information at the first post.
The same version is in use for servers and clients.

About the workload, it varies.
On one server it's about *5 files created/written and then fully read per
second*.
On the other server it's about *5 to 6 times that number*, so a lot more,
but the problem does not escalate at the same proportion.

*~# ceph -v*
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

*~#dpkg -l | grep ceph*
ii  ceph-fuse10.2.2-1trusty
amd64FUSE-based client for the Ceph distributed file system

Some things are worth mentioning:
The service(1) that creates the file sends an async request to another
service(2) that reads it.
The service(1) that creates the file also deletes it when its client closes
the connection, so it can do so while the other service(2) is trying to
read it. i'm not sure what would happen here.



On Wed, Nov 16, 2016 at 1:42 PM John Spray <jsp...@redhat.com> wrote:

> On Wed, Nov 16, 2016 at 3:15 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > hi,
> >
> > I have many clusters running cephfs, and in the last 45 days or so, 2 of
> > them started giving me the following message in ceph health:
> > mds0: Client dc1-mx02-fe02:guest failing to respond to capability release
> >
> > When this happens, cephfs stops responding. It will only get back after I
> > restart the failing mds.
> >
> > Algo, I get the following logs from ceph.log
> > https://paste.debian.net/896236/
> >
> > There was no change made that I can relate to this and I can't figure out
> > what is happening.
>
> I have the usual questions: what ceph versions, what clients etc
> (http://docs.ceph.com/docs/jewel/cephfs/early-adopters/#reporting-issues)
>
> Clients failing to respond to capability release are either buggy (old
> kernels?) or it's also possible that you have a workload that is
> holding an excessive number of files open.
>
> Cheers,
> John
>
>
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mds failing to respond to capability release

2016-11-16 Thread Webert de Souza Lima
I'm sorry, by server, I meant cluster.
On one cluster the rate of files created and read is about 5 per second.
On another cluster it's from 25 to 30 files created and read per second.

On Wed, Nov 16, 2016 at 2:03 PM Webert de Souza Lima <webert.b...@gmail.com>
wrote:

> Hello John.
>
> I'm sorry for the lack of information at the first post.
> The same version is in use for servers and clients.
>
> About the workload, it varies.
> On one server it's about *5 files created/written and then fully read per
> second*.
> On the other server it's about *5 to 6 times that number*, so a lot more,
> but the problem does not escalate at the same proportion.
>
> *~# ceph -v*
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>
> *~#dpkg -l | grep ceph*
> ii  ceph-fuse10.2.2-1trusty
> amd64FUSE-based client for the Ceph distributed file system
>
> Some things are worth mentioning:
> The service(1) that creates the file sends an async request to another
> service(2) that reads it.
> The service(1) that creates the file also deletes it when its client
> closes the connection, so it can do so while the other service(2) is
> trying to read it. i'm not sure what would happen here.
>
>
>
> On Wed, Nov 16, 2016 at 1:42 PM John Spray <jsp...@redhat.com> wrote:
>
> On Wed, Nov 16, 2016 at 3:15 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > hi,
> >
> > I have many clusters running cephfs, and in the last 45 days or so, 2 of
> > them started giving me the following message in ceph health:
> > mds0: Client dc1-mx02-fe02:guest failing to respond to capability release
> >
> > When this happens, cephfs stops responding. It will only get back after I
> > restart the failing mds.
> >
> > Algo, I get the following logs from ceph.log
> > https://paste.debian.net/896236/
> >
> > There was no change made that I can relate to this and I can't figure out
> > what is happening.
>
> I have the usual questions: what ceph versions, what clients etc
> (http://docs.ceph.com/docs/jewel/cephfs/early-adopters/#reporting-issues)
>
> Clients failing to respond to capability release are either buggy (old
> kernels?) or it's also possible that you have a workload that is
> holding an excessive number of files open.
>
> Cheers,
> John
>
>
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs mds failing to respond to capability release

2016-11-16 Thread Webert de Souza Lima
hi,

I have many clusters running cephfs, and in the last 45 days or so, 2 of
them started giving me the following message in *ceph health:*
*mds0: Client dc1-mx02-fe02:guest failing to respond to capability release*

When this happens, cephfs stops responding. It will only get back
after I *restart
the failing mds*.

Algo, I get the following logs from *ceph.log*
https://paste.debian.net/896236/

There was no change made that I can relate to this and I can't figure out
what is happening.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't recover pgs degraded/stuck unclean/undersized

2016-11-15 Thread Webert de Souza Lima
I removed cephfs and its pools, created everything again using the default
crush ruleset, which is for the HDD, and now ceph health is OK.
I appreciate your help. Thank you very much.

On Tue, Nov 15, 2016 at 11:48 AM Webert de Souza Lima <webert.b...@gmail.com>
wrote:

> Right, thank you.
>
> On this particular cluster it would be Ok to have everything on the HDD.
> No big traffic here.
> In order to do that, do I need to delete this cephfs, delete its pools and
> create them again?
>
> After that I assume I would run ceph osd pool set cephfs_metadata
> crush_ruleset 0, as 0 is the id of the hdd crush rule.
>
> On Tue, Nov 15, 2016 at 11:09 AM Burkhard Linke <
> burkhard.li...@computational.bio.uni-giessen.de> wrote:
>
> Hi,
>
> On 11/15/2016 01:55 PM, Webert de Souza Lima wrote:
>
> sure, as requested:
>
> *cephfs* was created using the following command:
>
> ceph osd pool create cephfs_metadata 128 128
> ceph osd pool create cephfs_data 128 128
> ceph fs new cephfs cephfs_metadata cephfs_data
>
> *ceph.conf:*
> https://paste.debian.net/895841/
>
>
> *# ceph osd crush tree *https://paste.debian.net/895839/
>
> *# ceph osd crush rule list*
> [
> "replicated_ruleset",
> "replicated_ruleset_ssd"
> ]
>
> *# ceph osd crush rule dump*
> https://paste.debian.net/895842/
>
>
> I assume that you want the cephfs_metadata pool to be located on the SSD.
>
> The crush rule uses host based distribution, but there are only two hosts
> available. The default replicated rule uses osd based distribution, that's
> why the other pools aren't affected.
>
> You have configured the default number of replicates to 3, so the ssd rule
> cannot be satisfied with two host. You either need to put the metadata pool
> on the HDD, too, or use a pool size of 2 (which is not recommended).
>
> Regards,
> Burkhard
>
> --
> Dr. rer. nat. Burkhard Linke
> Bioinformatics and Systems Biology
> Justus-Liebig-University Giessen
> 35392 Giessen, Germany
> Phone: (+49) (0)641 9935810 <+49%20641%209935810>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't recover pgs degraded/stuck unclean/undersized

2016-11-15 Thread Webert de Souza Lima
Right, thank you.

On this particular cluster it would be Ok to have everything on the HDD. No
big traffic here.
In order to do that, do I need to delete this cephfs, delete its pools and
create them again?

After that I assume I would run ceph osd pool set cephfs_metadata
crush_ruleset 0, as 0 is the id of the hdd crush rule.

On Tue, Nov 15, 2016 at 11:09 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 11/15/2016 01:55 PM, Webert de Souza Lima wrote:
>
> sure, as requested:
>
> *cephfs* was created using the following command:
>
> ceph osd pool create cephfs_metadata 128 128
> ceph osd pool create cephfs_data 128 128
> ceph fs new cephfs cephfs_metadata cephfs_data
>
> *ceph.conf:*
> https://paste.debian.net/895841/
>
>
> *# ceph osd crush tree *https://paste.debian.net/895839/
>
> *# ceph osd crush rule list*
> [
> "replicated_ruleset",
> "replicated_ruleset_ssd"
> ]
>
> *# ceph osd crush rule dump*
> https://paste.debian.net/895842/
>
>
> I assume that you want the cephfs_metadata pool to be located on the SSD.
>
> The crush rule uses host based distribution, but there are only two hosts
> available. The default replicated rule uses osd based distribution, that's
> why the other pools aren't affected.
>
> You have configured the default number of replicates to 3, so the ssd rule
> cannot be satisfied with two host. You either need to put the metadata pool
> on the HDD, too, or use a pool size of 2 (which is not recommended).
>
> Regards,
> Burkhard
>
> --
> Dr. rer. nat. Burkhard Linke
> Bioinformatics and Systems Biology
> Justus-Liebig-University Giessen
> 35392 Giessen, Germany
> Phone: (+49) (0)641 9935810 <+49%20641%209935810>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't recover pgs degraded/stuck unclean/undersized

2016-11-15 Thread Webert de Souza Lima
sure, as requested:

*cephfs* was created using the following command:

ceph osd pool create cephfs_metadata 128 128
ceph osd pool create cephfs_data 128 128
ceph fs new cephfs cephfs_metadata cephfs_data

*ceph.conf:*
https://paste.debian.net/895841/


*# ceph osd crush tree*https://paste.debian.net/895839/

*# ceph osd crush rule list*
[
"replicated_ruleset",
"replicated_ruleset_ssd"
]

*# ceph osd crush rule dump*
https://paste.debian.net/895842/

*# ceph osd tree*
ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-3  0.07999 root default-ssd
-5  0.03999 host dc1-master-ds02-ssd
11  0.03999 osd.11 up  1.0  1.0
-6  0.03999 host dc1-master-ds03-ssd
13  0.03999 osd.13 up  1.0  1.0
-1 31.3 root default
-2 31.3 host dc1-master-ds01
 0  3.7 osd.0  up  1.0  1.0
 1  3.7 osd.1  up  1.0  1.0
 2  4.0 osd.2  up  1.0  1.0
 3  4.0 osd.3  up  1.0  1.0
 4  4.0 osd.4  up  1.0  1.0
 5  4.0 osd.5  up  1.0  1.0
 6  4.0 osd.6  up  1.0  1.0
 7  4.0 osd.7  up  1.0  1.0


*# ceph osd pool ls*
.rgw.root
master.rgw.control
master.rgw.data.root
master.rgw.gc
master.rgw.log
master.rgw.intent-log
master.rgw.usage
master.rgw.users.keys
master.rgw.users.email
master.rgw.users.swift
master.rgw.users.uid
master.rgw.buckets.index
master.rgw.buckets.data
master.rgw.meta
master.rgw.buckets.non-ec
rbd
cephfs_metadata
cephfs_data


*# ceph osd pool stats*
https://paste.debian.net/895840/




On Tue, Nov 15, 2016 at 10:33 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
>
> On 11/15/2016 01:27 PM, Webert de Souza Lima wrote:
> > Not that I know of. On 5 other clusters it works just fine and
> > configuration is the same for all.
> > On this cluster I was using only radosgw, but cephfs was not in use
> > but it had been already created following our procedures.
> >
> > This happened right after mounting it.
> Do you use any different setup for one of the pools?
> active+undersized+degraded means that the crush rules for a PG cannot be
> satisfied, and 128 PGs sounds like the default setup for the number of PGs.
>
> With 10 OSDs I would suspect that you do not have enough host to satisfy
> all crush requirements. Can you post your crush tree, the crush rules
> and the detailed pool configuration?
>
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't recover pgs degraded/stuck unclean/undersized

2016-11-15 Thread Webert de Souza Lima
Not that I know of. On 5 other clusters it works just fine and
configuration is the same for all.
On this cluster I was using only radosgw, but cephfs was not in use but it
had been already created following our procedures.

This happened right after mounting it.

On Tue, Nov 15, 2016 at 10:24 AM John Spray <jsp...@redhat.com> wrote:

> On Tue, Nov 15, 2016 at 12:14 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Hey John.
> >
> > Just to be sure; by "deleting the pools" you mean the cephfs_metadata and
> > cephfs_metadata pools, right?
> > Does it have any impact over radosgw? Thanks.
>
> Yes, I meant the cephfs pools.  It doesn't affect rgw (assuming your
> pool names correspond to what you're using them for).
>
> By the way, it would be interesting to see what actually went wrong
> with your cephfs_metadata pool.  Did you do something different with
> it like trying to use difference crush rules?
>
> John
>
> >
> > On Tue, Nov 15, 2016 at 10:10 AM John Spray <jsp...@redhat.com> wrote:
> >>
> >> On Tue, Nov 15, 2016 at 11:58 AM, Webert de Souza Lima
> >> <webert.b...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > after running a cephfs on my ceph cluster I got stuck with the
> following
> >> > heath status:
> >> >
> >> > # ceph status
> >> > cluster ac482f5b-dce7-410d-bcc9-7b8584bd58f5
> >> >  health HEALTH_WARN
> >> > 128 pgs degraded
> >> > 128 pgs stuck unclean
> >> > 128 pgs undersized
> >> > recovery 24/40282627 <(24)%204028-2627> objects degraded
> (0.000%)
> >> >  monmap e3: 3 mons at
> >> >
> >> > {dc1-master-ds01=
> 10.2.0.1:6789/0,dc1-master-ds02=10.2.0.2:6789/0,dc1-master-ds03=10.2.0.3:6789/0
> }
> >> > election epoch 140, quorum 0,1,2
> >> > dc1-master-ds01,dc1-master-ds02,dc1-master-ds03
> >> >   fsmap e18: 1/1/1 up {0=b=up:active}, 1 up:standby
> >> >  osdmap e15851: 10 osds: 10 up, 10 in
> >> > flags sortbitwise
> >> >   pgmap v11924989: 1088 pgs, 18 pools, 11496 GB data, 19669
> kobjects
> >> > 23325 GB used, 6349 GB / 29675 GB avail
> >> > 24/40282627 <(24)%204028-2627> objects degraded (0.000%)
> >> >  958 active+clean
> >> >  128 active+undersized+degraded
> >> >2 active+clean+scrubbing
> >> >   client io 1968 B/s rd, 1 op/s rd, 0 op/s wr
> >> >
> >> > # ceph health detail
> >> > -> https://paste.debian.net/895825/
> >> >
> >> > # ceph osd lspools
> >> > 2 .rgw.root,3 master.rgw.control,4 master.rgw.data.root,5
> >> > master.rgw.gc,6
> >> > master.rgw.log,7 master.rgw.intent-log,8 master.rgw.usage,9
> >> > master.rgw.users.keys,10 master.rgw.users.email,11
> >> > master.rgw.users.swift,12
> >> > master.rgw.users.uid,13 master.rgw.buckets.index,14
> >> > master.rgw.buckets.data,15 master.rgw.meta,16
> >> > master.rgw.buckets.non-ec,22
> >> > rbd,23 cephfs_metadata,24 cephfs_data,
> >> >
> >> > on this cluster I run cephfs, which is empty atm, and a radosgw
> service.
> >> > How can I clean this?
> >>
> >> Stop your MDS daemons
> >> Run "ceph mds fail " for each MDS daemon
> >> Use "ceph fs rm "
> >> Then you can delete the pools.
> >>
> >> John
> >>
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't recover pgs degraded/stuck unclean/undersized

2016-11-15 Thread Webert de Souza Lima
I'm sorry, I meant *cephfs_data* and *cephfs_metadata*

On Tue, Nov 15, 2016 at 10:15 AM Webert de Souza Lima <webert.b...@gmail.com>
wrote:

> Hey John.
>
> Just to be sure; by "deleting the pools" you mean the *cephfs_metadata*
>  and *cephfs_metadata* pools, right?
> Does it have any impact over radosgw? Thanks.
>
> On Tue, Nov 15, 2016 at 10:10 AM John Spray <jsp...@redhat.com> wrote:
>
> On Tue, Nov 15, 2016 at 11:58 AM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Hi,
> >
> > after running a cephfs on my ceph cluster I got stuck with the following
> > heath status:
> >
> > # ceph status
> > cluster ac482f5b-dce7-410d-bcc9-7b8584bd58f5
> >  health HEALTH_WARN
> > 128 pgs degraded
> > 128 pgs stuck unclean
> > 128 pgs undersized
> > recovery 24/40282627 <(24)%204028-2627> objects degraded
> (0.000%)
> >  monmap e3: 3 mons at
> > {dc1-master-ds01=
> 10.2.0.1:6789/0,dc1-master-ds02=10.2.0.2:6789/0,dc1-master-ds03=10.2.0.3:6789/0
> }
> > election epoch 140, quorum 0,1,2
> > dc1-master-ds01,dc1-master-ds02,dc1-master-ds03
> >   fsmap e18: 1/1/1 up {0=b=up:active}, 1 up:standby
> >  osdmap e15851: 10 osds: 10 up, 10 in
> > flags sortbitwise
> >   pgmap v11924989: 1088 pgs, 18 pools, 11496 GB data, 19669 kobjects
> > 23325 GB used, 6349 GB / 29675 GB avail
> > 24/40282627 <(24)%204028-2627> objects degraded (0.000%)
> >  958 active+clean
> >  128 active+undersized+degraded
> >2 active+clean+scrubbing
> >   client io 1968 B/s rd, 1 op/s rd, 0 op/s wr
> >
> > # ceph health detail
> > -> https://paste.debian.net/895825/
> >
> > # ceph osd lspools
> > 2 .rgw.root,3 master.rgw.control,4 master.rgw.data.root,5 master.rgw.gc,6
> > master.rgw.log,7 master.rgw.intent-log,8 master.rgw.usage,9
> > master.rgw.users.keys,10 master.rgw.users.email,11
> master.rgw.users.swift,12
> > master.rgw.users.uid,13 master.rgw.buckets.index,14
> > master.rgw.buckets.data,15 master.rgw.meta,16
> master.rgw.buckets.non-ec,22
> > rbd,23 cephfs_metadata,24 cephfs_data,
> >
> > on this cluster I run cephfs, which is empty atm, and a radosgw service.
> > How can I clean this?
>
> Stop your MDS daemons
> Run "ceph mds fail " for each MDS daemon
> Use "ceph fs rm "
> Then you can delete the pools.
>
> John
>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't recover pgs degraded/stuck unclean/undersized

2016-11-15 Thread Webert de Souza Lima
Hey John.

Just to be sure; by "deleting the pools" you mean the *cephfs_metadata* and
*cephfs_metadata* pools, right?
Does it have any impact over radosgw? Thanks.

On Tue, Nov 15, 2016 at 10:10 AM John Spray <jsp...@redhat.com> wrote:

> On Tue, Nov 15, 2016 at 11:58 AM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Hi,
> >
> > after running a cephfs on my ceph cluster I got stuck with the following
> > heath status:
> >
> > # ceph status
> > cluster ac482f5b-dce7-410d-bcc9-7b8584bd58f5
> >  health HEALTH_WARN
> > 128 pgs degraded
> > 128 pgs stuck unclean
> > 128 pgs undersized
> > recovery 24/40282627 <(24)%204028-2627> objects degraded
> (0.000%)
> >  monmap e3: 3 mons at
> > {dc1-master-ds01=
> 10.2.0.1:6789/0,dc1-master-ds02=10.2.0.2:6789/0,dc1-master-ds03=10.2.0.3:6789/0
> }
> > election epoch 140, quorum 0,1,2
> > dc1-master-ds01,dc1-master-ds02,dc1-master-ds03
> >   fsmap e18: 1/1/1 up {0=b=up:active}, 1 up:standby
> >  osdmap e15851: 10 osds: 10 up, 10 in
> > flags sortbitwise
> >   pgmap v11924989: 1088 pgs, 18 pools, 11496 GB data, 19669 kobjects
> > 23325 GB used, 6349 GB / 29675 GB avail
> > 24/40282627 <(24)%204028-2627> objects degraded (0.000%)
> >  958 active+clean
> >  128 active+undersized+degraded
> >2 active+clean+scrubbing
> >   client io 1968 B/s rd, 1 op/s rd, 0 op/s wr
> >
> > # ceph health detail
> > -> https://paste.debian.net/895825/
> >
> > # ceph osd lspools
> > 2 .rgw.root,3 master.rgw.control,4 master.rgw.data.root,5 master.rgw.gc,6
> > master.rgw.log,7 master.rgw.intent-log,8 master.rgw.usage,9
> > master.rgw.users.keys,10 master.rgw.users.email,11
> master.rgw.users.swift,12
> > master.rgw.users.uid,13 master.rgw.buckets.index,14
> > master.rgw.buckets.data,15 master.rgw.meta,16
> master.rgw.buckets.non-ec,22
> > rbd,23 cephfs_metadata,24 cephfs_data,
> >
> > on this cluster I run cephfs, which is empty atm, and a radosgw service.
> > How can I clean this?
>
> Stop your MDS daemons
> Run "ceph mds fail " for each MDS daemon
> Use "ceph fs rm "
> Then you can delete the pools.
>
> John
>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >