Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Brad Hubbard
Can you provide the complete OOM message from the dmesg log?

On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri  wrote:
>
>
> Thank You for the quick response Dyweni!
>
> We are using FileStore as this cluster is upgraded from 
> Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd 
> has 128GB and R740xd has 96GB of RAM. Everything else is the same.
>
> Thanks,
> Pardhiv Karri
>
> On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> 
> wrote:
>>
>> Hi,
>>
>>
>> You could be running out of memory due to the default Bluestore cache sizes.
>>
>>
>> How many disks/OSDs in the R730xd versus the R740xd?  How much memory in 
>> each server type?  How many are HDD versus SSD?  Are you running Bluestore?
>>
>>
>> OSD's in Luminous, which run Bluestore, allocate memory to use as a "cache", 
>> since the kernel-provided page-cache is not available to Bluestore.  
>> Bluestore, by default, will use 1GB of memory for each HDD, and 3GB of 
>> memory for each SSD.  OSD's do not allocate all that memory up front, but 
>> grow into it as it is used.  This cache is in addition to any other memory 
>> the OSD uses.
>>
>>
>> Check out the bluestore_cache_* values (these are specified in bytes) in the 
>> manual cache sizing section of the docs 
>> (http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
>>Note that the automatic cache sizing feature wasn't added until 12.2.9.
>>
>>
>>
>> As an example, I have OSD's running on 32bit/armhf nodes.  These nodes have 
>> 2GB of memory.  I run 1 Bluestore OSD on each node.  In my ceph.conf file, I 
>> have 'bluestore cache size = 536870912' and 'bluestore cache kv max = 
>> 268435456'.  I see aprox 1.35-1.4 GB used by each OSD.
>>
>>
>>
>>
>> On 2018-12-21 15:19, Pardhiv Karri wrote:
>>
>> Hi,
>>
>> We have a luminous cluster which was upgraded from Hammer --> Jewel --> 
>> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes 
>> where they are running out of memory and dying. In the logs we are seeing 
>> OOM killer. We don't have this issue before upgrade. The only difference is 
>> the nodes without any issue are R730xd and the ones with the memory leak are 
>> R740xd. The hardware vendor don't see anything wrong with the hardware. From 
>> Ceph end we are not seeing any issue when it comes to running the cluster, 
>> only issue is with memory leak. Right now we are actively rebooting the 
>> nodes in timely manner to avoid crashes. One R740xd node we set all the OSDs 
>> to 0.0 and there is no memory leak there. Any pointers to fix the issue 
>> would be helpful.
>>
>> Thanks,
>> Pardhiv Karri
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Pardhiv Karri
> "Rise and Rise again until LAMBS become LIONS"
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Pardhiv Karri
Thank You for the quick response Dyweni!

We are using FileStore as this cluster is upgraded from
Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes. R730xd
has 128GB and R740xd has 96GB of RAM. Everything else is the same.

Thanks,
Pardhiv Karri

On Fri, Dec 21, 2018 at 1:43 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com>
wrote:

> Hi,
>
>
> You could be running out of memory due to the default Bluestore cache
> sizes.
>
>
> How many disks/OSDs in the R730xd versus the R740xd?  How much memory in
> each server type?  How many are HDD versus SSD?  Are you running Bluestore?
>
>
> OSD's in Luminous, which run Bluestore, allocate memory to use as a
> "cache", since the kernel-provided page-cache is not available to
> Bluestore.  Bluestore, by default, will use 1GB of memory for each HDD, and
> 3GB of memory for each SSD.  OSD's do not allocate all that memory up
> front, but grow into it as it is used.  This cache is in addition to any
> other memory the OSD uses.
>
>
> Check out the bluestore_cache_* values (these are specified in bytes) in
> the manual cache sizing section of the docs (
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
>  Note that the automatic cache sizing feature wasn't added until 12.2.9.
>
>
>
> As an example, I have OSD's running on 32bit/armhf nodes.  These nodes
> have 2GB of memory.  I run 1 Bluestore OSD on each node.  In my ceph.conf
> file, I have 'bluestore cache size = 536870912' and 'bluestore cache kv max
> = 268435456'.  I see aprox 1.35-1.4 GB used by each OSD.
>
>
>
>
> On 2018-12-21 15:19, Pardhiv Karri wrote:
>
> Hi,
>
> We have a luminous cluster which was upgraded from Hammer --> Jewel -->
> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes
> where they are running out of memory and dying. In the logs we are seeing
> OOM killer. We don't have this issue before upgrade. The only difference is
> the nodes without any issue are R730xd and the ones with the memory leak
> are R740xd. The hardware vendor don't see anything wrong with the hardware.
> From Ceph end we are not seeing any issue when it comes to running the
> cluster, only issue is with memory leak. Right now we are actively
> rebooting the nodes in timely manner to avoid crashes. One R740xd node we
> set all the OSDs to 0.0 and there is no memory leak there. Any pointers to
> fix the issue would be helpful.
>
> Thanks,
> *Pardhiv Karri*
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Dyweni - Ceph-Users
Hi, 

You could be running out of memory due to the default Bluestore cache
sizes. 

How many disks/OSDs in the R730xd versus the R740xd?  How much memory in
each server type?  How many are HDD versus SSD?  Are you running
Bluestore? 

OSD's in Luminous, which run Bluestore, allocate memory to use as a
"cache", since the kernel-provided page-cache is not available to
Bluestore.  Bluestore, by default, will use 1GB of memory for each HDD,
and 3GB of memory for each SSD.  OSD's do not allocate all that memory
up front, but grow into it as it is used.  This cache is in addition to
any other memory the OSD uses. 

Check out the bluestore_cache_* values (these are specified in bytes) in
the manual cache sizing section of the docs
(http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/).
  Note that the automatic cache sizing feature wasn't added until
12.2.9. 

As an example, I have OSD's running on 32bit/armhf nodes.  These nodes
have 2GB of memory.  I run 1 Bluestore OSD on each node.  In my
ceph.conf file, I have 'bluestore cache size = 536870912' and 'bluestore
cache kv max = 268435456'.  I see aprox 1.35-1.4 GB used by each OSD. 

On 2018-12-21 15:19, Pardhiv Karri wrote:

> Hi, 
> 
> We have a luminous cluster which was upgraded from Hammer --> Jewel --> 
> Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes 
> where they are running out of memory and dying. In the logs we are seeing OOM 
> killer. We don't have this issue before upgrade. The only difference is the 
> nodes without any issue are R730xd and the ones with the memory leak are 
> R740xd. The hardware vendor don't see anything wrong with the hardware. From 
> Ceph end we are not seeing any issue when it comes to running the cluster, 
> only issue is with memory leak. Right now we are actively rebooting the nodes 
> in timely manner to avoid crashes. One R740xd node we set all the OSDs to 0.0 
> and there is no memory leak there. Any pointers to fix the issue would be 
> helpful. 
> 
> Thanks, 
> PARDHIV KARRI 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Pardhiv Karri
Hi,

We have a luminous cluster which was upgraded from Hammer --> Jewel -->
Luminous 12.2.8 recently. Post upgrade we are seeing issue with a few nodes
where they are running out of memory and dying. In the logs we are seeing
OOM killer. We don't have this issue before upgrade. The only difference is
the nodes without any issue are R730xd and the ones with the memory leak
are R740xd. The hardware vendor don't see anything wrong with the hardware.
>From Ceph end we are not seeing any issue when it comes to running the
cluster, only issue is with memory leak. Right now we are actively
rebooting the nodes in timely manner to avoid crashes. One R740xd node we
set all the OSDs to 0.0 and there is no memory leak there. Any pointers to
fix the issue would be helpful.

Thanks,
*Pardhiv Karri*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com