Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-19 Thread Jay Linux
Hello Shinobu,

We already raised ticket for this issue. FYI -
http://tracker.ceph.com/issues/18924

Thanks
Jayaram


On Mon, Feb 20, 2017 at 12:36 AM, Shinobu Kinjo  wrote:

> Please open ticket at http://tracker.ceph.com, if you haven't yet.
>
> On Thu, Feb 16, 2017 at 6:07 PM, Muthusamy Muthiah
>  wrote:
> > Hi Wido,
> >
> > Thanks for the information and let us know if this is a bug.
> > As workaround we will go with small bluestore_cache_size to 100MB.
> >
> > Thanks,
> > Muthu
> >
> > On 16 February 2017 at 14:04, Wido den Hollander  wrote:
> >>
> >>
> >> > Op 16 februari 2017 om 7:19 schreef Muthusamy Muthiah
> >> > :
> >> >
> >> >
> >> > Thanks IIya Letkowski for the information we will change this value
> >> > accordingly.
> >> >
> >>
> >> What I understand from yesterday's performance meeting is that this
> seems
> >> like a bug. Lowering this buffer reduces memory, but the root-cause
> seems to
> >> be memory not being freed. A few bytes of a larger allocation still
> >> allocated causing this buffer not to be freed.
> >>
> >> Tried:
> >>
> >> debug_mempools = true
> >>
> >> $ ceph daemon osd.X dump_mempools
> >>
> >> Might want to view the YouTube video of yesterday when it's online:
> >> https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/videos
> >>
> >> Wido
> >>
> >> > Thanks,
> >> > Muthu
> >> >
> >> > On 15 February 2017 at 17:03, Ilya Letkowski  >
> >> > wrote:
> >> >
> >> > > Hi, Muthusamy Muthiah
> >> > >
> >> > > I'm not totally sure that this is a memory leak.
> >> > > We had same problems with bluestore on ceph v11.2.0.
> >> > > Reduce bluestore cache helped us to solve it and stabilize OSD
> memory
> >> > > consumption on the 3GB level.
> >> > >
> >> > > Perhaps this will help you:
> >> > >
> >> > > bluestore_cache_size = 104857600
> >> > >
> >> > >
> >> > >
> >> > > On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
> >> > > muthiah.muthus...@gmail.com> wrote:
> >> > >
> >> > >> Hi All,
> >> > >>
> >> > >> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
> >> > >> issues.
> >> > >>
> >> > >> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL
> >> > >> 7.2
> >> > >>
> >> > >> Some traces using sar are below and attached the memory utilisation
> >> > >> graph
> >> > >> .
> >> > >>
> >> > >> (16:54:42)[cn2.c1 sa] # sar -r
> >> > >> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
> >> > >> %commit
> >> > >> kbactive kbinact kbdirty
> >> > >> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18
> >> > >> 51991692
> >> > >> 2676468 260
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22
> >> > >> 51851512
> >> > >> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316
> >> > >> 47.22
> >> > >> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
> >> > >> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45
> >> > >> 16176
> >> > >> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548
> 137673084
> >> > >> 83.52
> >> > >> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646
> >> > >> 138376076
> >> > >> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01
> >> > >> 26002252
> >> > >> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036
> >> > >> 1611:40:01
> >> > >> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980
> >> > >> 2716740
> >> > >> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84
> >> > >> 57869628
> >> > >> 2715400 16*
> >> > >>
> >> > >> ...
> >> > >> ...
> >> > >>
> >> > >> In the attached graph, there is increase in memory utilisation by
> >> > >> ceph-osd during soak test. And when it reaches the system limit of
> >> > >> 128GB
> >> > >> RAM , we could able to see the below dmesg logs related to memory
> out
> >> > >> when
> >> > >> the system reaches close to 128GB RAM. OSD.3 killed due to Out of
> >> > >> memory
> >> > >> and started again.
> >> > >>
> >> > >> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
> >> > >> gfp_mask=0x280da, order=0, oom_score_adj=0*
> >> > >> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
> >> > >> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not
> >> > >> tainted
> >> > >> 3.10.0-327.13.1.el7.x86_64 #1
> >> > >> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420
> >> > >> Gen9/ProLiant
> >> > >> XL420 Gen9, BIOS U19 09/12/2016
> >> > >> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
> >> > >> 881fa58f7528 816356f4
> >> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
> >> > >> 881fa3478360 881fa3478378
> >> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
> >> > >> 0001 0001f65f
> >> > >> [Tue Feb 14 

Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-19 Thread Shinobu Kinjo
Please open ticket at http://tracker.ceph.com, if you haven't yet.

On Thu, Feb 16, 2017 at 6:07 PM, Muthusamy Muthiah
 wrote:
> Hi Wido,
>
> Thanks for the information and let us know if this is a bug.
> As workaround we will go with small bluestore_cache_size to 100MB.
>
> Thanks,
> Muthu
>
> On 16 February 2017 at 14:04, Wido den Hollander  wrote:
>>
>>
>> > Op 16 februari 2017 om 7:19 schreef Muthusamy Muthiah
>> > :
>> >
>> >
>> > Thanks IIya Letkowski for the information we will change this value
>> > accordingly.
>> >
>>
>> What I understand from yesterday's performance meeting is that this seems
>> like a bug. Lowering this buffer reduces memory, but the root-cause seems to
>> be memory not being freed. A few bytes of a larger allocation still
>> allocated causing this buffer not to be freed.
>>
>> Tried:
>>
>> debug_mempools = true
>>
>> $ ceph daemon osd.X dump_mempools
>>
>> Might want to view the YouTube video of yesterday when it's online:
>> https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/videos
>>
>> Wido
>>
>> > Thanks,
>> > Muthu
>> >
>> > On 15 February 2017 at 17:03, Ilya Letkowski 
>> > wrote:
>> >
>> > > Hi, Muthusamy Muthiah
>> > >
>> > > I'm not totally sure that this is a memory leak.
>> > > We had same problems with bluestore on ceph v11.2.0.
>> > > Reduce bluestore cache helped us to solve it and stabilize OSD memory
>> > > consumption on the 3GB level.
>> > >
>> > > Perhaps this will help you:
>> > >
>> > > bluestore_cache_size = 104857600
>> > >
>> > >
>> > >
>> > > On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
>> > > muthiah.muthus...@gmail.com> wrote:
>> > >
>> > >> Hi All,
>> > >>
>> > >> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
>> > >> issues.
>> > >>
>> > >> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL
>> > >> 7.2
>> > >>
>> > >> Some traces using sar are below and attached the memory utilisation
>> > >> graph
>> > >> .
>> > >>
>> > >> (16:54:42)[cn2.c1 sa] # sar -r
>> > >> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
>> > >> %commit
>> > >> kbactive kbinact kbdirty
>> > >> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18
>> > >> 51991692
>> > >> 2676468 260
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22
>> > >> 51851512
>> > >> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316
>> > >> 47.22
>> > >> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
>> > >> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45
>> > >> 16176
>> > >> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548 137673084
>> > >> 83.52
>> > >> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646
>> > >> 138376076
>> > >> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01
>> > >> 26002252
>> > >> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036
>> > >> 1611:40:01
>> > >> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980
>> > >> 2716740
>> > >> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84
>> > >> 57869628
>> > >> 2715400 16*
>> > >>
>> > >> ...
>> > >> ...
>> > >>
>> > >> In the attached graph, there is increase in memory utilisation by
>> > >> ceph-osd during soak test. And when it reaches the system limit of
>> > >> 128GB
>> > >> RAM , we could able to see the below dmesg logs related to memory out
>> > >> when
>> > >> the system reaches close to 128GB RAM. OSD.3 killed due to Out of
>> > >> memory
>> > >> and started again.
>> > >>
>> > >> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
>> > >> gfp_mask=0x280da, order=0, oom_score_adj=0*
>> > >> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
>> > >> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not
>> > >> tainted
>> > >> 3.10.0-327.13.1.el7.x86_64 #1
>> > >> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420
>> > >> Gen9/ProLiant
>> > >> XL420 Gen9, BIOS U19 09/12/2016
>> > >> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
>> > >> 881fa58f7528 816356f4
>> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
>> > >> 881fa3478360 881fa3478378
>> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
>> > >> 0001 0001f65f
>> > >> [Tue Feb 14 03:51:02 2017] Call Trace:
>> > >> [Tue Feb 14 03:51:02 2017]  [] dump_stack+0x19/0x1b
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> dump_header+0x8e/0x214
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> oom_kill_process+0x24e/0x3b0
>> > >> [Tue Feb 14 03:51:02 2017]  [] ?
>> > >> find_lock_task_mm+0x56/0xc0
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> *out_of_memory+0x4b6/0x4f0*
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> __alloc_pages_nodemask+0xa95/0xb90
>> > >> [Tue Feb 

Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-16 Thread Muthusamy Muthiah
Hi Wido,

Thanks for the information and let us know if this is a bug.
As workaround we will go with small bluestore_cache_size to 100MB.

Thanks,
Muthu

On 16 February 2017 at 14:04, Wido den Hollander  wrote:

>
> > Op 16 februari 2017 om 7:19 schreef Muthusamy Muthiah <
> muthiah.muthus...@gmail.com>:
> >
> >
> > Thanks IIya Letkowski for the information we will change this value
> > accordingly.
> >
>
> What I understand from yesterday's performance meeting is that this seems
> like a bug. Lowering this buffer reduces memory, but the root-cause seems
> to be memory not being freed. A few bytes of a larger allocation still
> allocated causing this buffer not to be freed.
>
> Tried:
>
> debug_mempools = true
>
> $ ceph daemon osd.X dump_mempools
>
> Might want to view the YouTube video of yesterday when it's online:
> https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/videos
>
> Wido
>
> > Thanks,
> > Muthu
> >
> > On 15 February 2017 at 17:03, Ilya Letkowski 
> > wrote:
> >
> > > Hi, Muthusamy Muthiah
> > >
> > > I'm not totally sure that this is a memory leak.
> > > We had same problems with bluestore on ceph v11.2.0.
> > > Reduce bluestore cache helped us to solve it and stabilize OSD memory
> > > consumption on the 3GB level.
> > >
> > > Perhaps this will help you:
> > >
> > > bluestore_cache_size = 104857600
> > >
> > >
> > >
> > > On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
> > > muthiah.muthus...@gmail.com> wrote:
> > >
> > >> Hi All,
> > >>
> > >> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
> > >> issues.
> > >>
> > >> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL
> 7.2
> > >>
> > >> Some traces using sar are below and attached the memory utilisation
> graph
> > >> .
> > >>
> > >> (16:54:42)[cn2.c1 sa] # sar -r
> > >> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
> %commit
> > >> kbactive kbinact kbdirty
> > >> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18
> 51991692
> > >> 2676468 260
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22
> 51851512
> > >> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316
> 47.22
> > >> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
> > >> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45
> 16176
> > >> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548 137673084
> 83.52
> > >> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646
> 138376076
> > >> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01 26002252
> > >> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036
> 1611:40:01
> > >> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980 2716740
> > >> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84
> 57869628
> > >> 2715400 16*
> > >>
> > >> ...
> > >> ...
> > >>
> > >> In the attached graph, there is increase in memory utilisation by
> > >> ceph-osd during soak test. And when it reaches the system limit of
> 128GB
> > >> RAM , we could able to see the below dmesg logs related to memory out
> when
> > >> the system reaches close to 128GB RAM. OSD.3 killed due to Out of
> memory
> > >> and started again.
> > >>
> > >> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
> > >> gfp_mask=0x280da, order=0, oom_score_adj=0*
> > >> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
> > >> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not
> tainted
> > >> 3.10.0-327.13.1.el7.x86_64 #1
> > >> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420
> Gen9/ProLiant
> > >> XL420 Gen9, BIOS U19 09/12/2016
> > >> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
> > >> 881fa58f7528 816356f4
> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
> > >> 881fa3478360 881fa3478378
> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
> > >> 0001 0001f65f
> > >> [Tue Feb 14 03:51:02 2017] Call Trace:
> > >> [Tue Feb 14 03:51:02 2017]  [] dump_stack+0x19/0x1b
> > >> [Tue Feb 14 03:51:02 2017]  []
> dump_header+0x8e/0x214
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> oom_kill_process+0x24e/0x3b0
> > >> [Tue Feb 14 03:51:02 2017]  [] ?
> > >> find_lock_task_mm+0x56/0xc0
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> *out_of_memory+0x4b6/0x4f0*
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> __alloc_pages_nodemask+0xa95/0xb90
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> alloc_pages_vma+0x9a/0x140
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> handle_mm_fault+0xb85/0xf50
> > >> [Tue Feb 14 03:51:02 2017]  [] ?
> > >> follow_page_mask+0xbb/0x5c0
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> __get_user_pages+0x19b/0x640
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> get_user_pages_unlocked+0x15d/0x1f0
> > >> [Tue Feb 14 03:51:02 2017]  []
> > >> 

Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-16 Thread Wido den Hollander

> Op 16 februari 2017 om 7:19 schreef Muthusamy Muthiah 
> :
> 
> 
> Thanks IIya Letkowski for the information we will change this value
> accordingly.
> 

What I understand from yesterday's performance meeting is that this seems like 
a bug. Lowering this buffer reduces memory, but the root-cause seems to be 
memory not being freed. A few bytes of a larger allocation still allocated 
causing this buffer not to be freed.

Tried:

debug_mempools = true

$ ceph daemon osd.X dump_mempools

Might want to view the YouTube video of yesterday when it's online: 
https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/videos

Wido

> Thanks,
> Muthu
> 
> On 15 February 2017 at 17:03, Ilya Letkowski 
> wrote:
> 
> > Hi, Muthusamy Muthiah
> >
> > I'm not totally sure that this is a memory leak.
> > We had same problems with bluestore on ceph v11.2.0.
> > Reduce bluestore cache helped us to solve it and stabilize OSD memory
> > consumption on the 3GB level.
> >
> > Perhaps this will help you:
> >
> > bluestore_cache_size = 104857600
> >
> >
> >
> > On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
> > muthiah.muthus...@gmail.com> wrote:
> >
> >> Hi All,
> >>
> >> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
> >> issues.
> >>
> >> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL 7.2
> >>
> >> Some traces using sar are below and attached the memory utilisation graph
> >> .
> >>
> >> (16:54:42)[cn2.c1 sa] # sar -r
> >> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
> >> kbactive kbinact kbdirty
> >> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18 51991692
> >> 2676468 260
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22 51851512
> >> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316 47.22
> >> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
> >> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45 16176
> >> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548 137673084 83.52
> >> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646 138376076
> >> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01 26002252
> >> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036 1611:40:01
> >> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980 2716740
> >> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84 57869628
> >> 2715400 16*
> >>
> >> ...
> >> ...
> >>
> >> In the attached graph, there is increase in memory utilisation by
> >> ceph-osd during soak test. And when it reaches the system limit of 128GB
> >> RAM , we could able to see the below dmesg logs related to memory out when
> >> the system reaches close to 128GB RAM. OSD.3 killed due to Out of memory
> >> and started again.
> >>
> >> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
> >> gfp_mask=0x280da, order=0, oom_score_adj=0*
> >> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
> >> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not tainted
> >> 3.10.0-327.13.1.el7.x86_64 #1
> >> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420 Gen9/ProLiant
> >> XL420 Gen9, BIOS U19 09/12/2016
> >> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
> >> 881fa58f7528 816356f4
> >> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
> >> 881fa3478360 881fa3478378
> >> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
> >> 0001 0001f65f
> >> [Tue Feb 14 03:51:02 2017] Call Trace:
> >> [Tue Feb 14 03:51:02 2017]  [] dump_stack+0x19/0x1b
> >> [Tue Feb 14 03:51:02 2017]  [] dump_header+0x8e/0x214
> >> [Tue Feb 14 03:51:02 2017]  []
> >> oom_kill_process+0x24e/0x3b0
> >> [Tue Feb 14 03:51:02 2017]  [] ?
> >> find_lock_task_mm+0x56/0xc0
> >> [Tue Feb 14 03:51:02 2017]  []
> >> *out_of_memory+0x4b6/0x4f0*
> >> [Tue Feb 14 03:51:02 2017]  []
> >> __alloc_pages_nodemask+0xa95/0xb90
> >> [Tue Feb 14 03:51:02 2017]  []
> >> alloc_pages_vma+0x9a/0x140
> >> [Tue Feb 14 03:51:02 2017]  []
> >> handle_mm_fault+0xb85/0xf50
> >> [Tue Feb 14 03:51:02 2017]  [] ?
> >> follow_page_mask+0xbb/0x5c0
> >> [Tue Feb 14 03:51:02 2017]  []
> >> __get_user_pages+0x19b/0x640
> >> [Tue Feb 14 03:51:02 2017]  []
> >> get_user_pages_unlocked+0x15d/0x1f0
> >> [Tue Feb 14 03:51:02 2017]  []
> >> get_user_pages_fast+0x9f/0x1a0
> >> [Tue Feb 14 03:51:02 2017]  []
> >> do_blockdev_direct_IO+0x1a78/0x2610
> >> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
> >> [Tue Feb 14 03:51:02 2017]  []
> >> __blockdev_direct_IO+0x55/0x60
> >> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
> >> [Tue Feb 14 03:51:02 2017]  []
> >> blkdev_direct_IO+0x57/0x60
> >> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
> >> [Tue Feb 14 03:51:02 2017]  []
> >> generic_file_aio_read+0x6d3/0x750
> >> [Tue Feb 

Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-15 Thread Muthusamy Muthiah
Thanks IIya Letkowski for the information we will change this value
accordingly.

Thanks,
Muthu

On 15 February 2017 at 17:03, Ilya Letkowski 
wrote:

> Hi, Muthusamy Muthiah
>
> I'm not totally sure that this is a memory leak.
> We had same problems with bluestore on ceph v11.2.0.
> Reduce bluestore cache helped us to solve it and stabilize OSD memory
> consumption on the 3GB level.
>
> Perhaps this will help you:
>
> bluestore_cache_size = 104857600
>
>
>
> On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
> muthiah.muthus...@gmail.com> wrote:
>
>> Hi All,
>>
>> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
>> issues.
>>
>> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL 7.2
>>
>> Some traces using sar are below and attached the memory utilisation graph
>> .
>>
>> (16:54:42)[cn2.c1 sa] # sar -r
>> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
>> kbactive kbinact kbdirty
>> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18 51991692
>> 2676468 260
>>
>>
>>
>>
>>
>>
>>
>>
>> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22 51851512
>> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316 47.22
>> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
>> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45 16176
>> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548 137673084 83.52
>> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646 138376076
>> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01 26002252
>> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036 1611:40:01
>> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980 2716740
>> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84 57869628
>> 2715400 16*
>>
>> ...
>> ...
>>
>> In the attached graph, there is increase in memory utilisation by
>> ceph-osd during soak test. And when it reaches the system limit of 128GB
>> RAM , we could able to see the below dmesg logs related to memory out when
>> the system reaches close to 128GB RAM. OSD.3 killed due to Out of memory
>> and started again.
>>
>> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
>> gfp_mask=0x280da, order=0, oom_score_adj=0*
>> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
>> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not tainted
>> 3.10.0-327.13.1.el7.x86_64 #1
>> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420 Gen9/ProLiant
>> XL420 Gen9, BIOS U19 09/12/2016
>> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
>> 881fa58f7528 816356f4
>> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
>> 881fa3478360 881fa3478378
>> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
>> 0001 0001f65f
>> [Tue Feb 14 03:51:02 2017] Call Trace:
>> [Tue Feb 14 03:51:02 2017]  [] dump_stack+0x19/0x1b
>> [Tue Feb 14 03:51:02 2017]  [] dump_header+0x8e/0x214
>> [Tue Feb 14 03:51:02 2017]  []
>> oom_kill_process+0x24e/0x3b0
>> [Tue Feb 14 03:51:02 2017]  [] ?
>> find_lock_task_mm+0x56/0xc0
>> [Tue Feb 14 03:51:02 2017]  []
>> *out_of_memory+0x4b6/0x4f0*
>> [Tue Feb 14 03:51:02 2017]  []
>> __alloc_pages_nodemask+0xa95/0xb90
>> [Tue Feb 14 03:51:02 2017]  []
>> alloc_pages_vma+0x9a/0x140
>> [Tue Feb 14 03:51:02 2017]  []
>> handle_mm_fault+0xb85/0xf50
>> [Tue Feb 14 03:51:02 2017]  [] ?
>> follow_page_mask+0xbb/0x5c0
>> [Tue Feb 14 03:51:02 2017]  []
>> __get_user_pages+0x19b/0x640
>> [Tue Feb 14 03:51:02 2017]  []
>> get_user_pages_unlocked+0x15d/0x1f0
>> [Tue Feb 14 03:51:02 2017]  []
>> get_user_pages_fast+0x9f/0x1a0
>> [Tue Feb 14 03:51:02 2017]  []
>> do_blockdev_direct_IO+0x1a78/0x2610
>> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
>> [Tue Feb 14 03:51:02 2017]  []
>> __blockdev_direct_IO+0x55/0x60
>> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
>> [Tue Feb 14 03:51:02 2017]  []
>> blkdev_direct_IO+0x57/0x60
>> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
>> [Tue Feb 14 03:51:02 2017]  []
>> generic_file_aio_read+0x6d3/0x750
>> [Tue Feb 14 03:51:02 2017]  [] ?
>> xfs_iunlock+0x11c/0x130 [xfs]
>> [Tue Feb 14 03:51:02 2017]  [] ? unlock_page+0x2b/0x30
>> [Tue Feb 14 03:51:02 2017]  [] ? __do_fault+0x401/0x510
>> [Tue Feb 14 03:51:02 2017]  [] blkdev_aio_read+0x4c/0x70
>> [Tue Feb 14 03:51:02 2017]  [] do_sync_read+0x8d/0xd0
>> [Tue Feb 14 03:51:02 2017]  [] vfs_read+0x9c/0x170
>> [Tue Feb 14 03:51:02 2017]  [] SyS_pread64+0x92/0xc0
>> [Tue Feb 14 03:51:02 2017]  []
>> system_call_fastpath+0x16/0x1b
>>
>>
>> Feb 14 03:51:40 fr-paris kernel: *Out of memory: Kill process 7657
>> (ceph-osd) score 45 or sacrifice child*
>> Feb 14 03:51:40 fr-paris kernel: Killed process 7657 (ceph-osd)
>> total-vm:8650208kB, anon-rss:6124660kB, file-rss:1560kB
>> Feb 14 03:51:41 fr-paris systemd:* ceph-osd@3.service: main process
>> exited, 

Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-15 Thread Ilya Letkouski
Hi, Muthusamy Muthiah

I'm not totally sure that this is a memory leak.
We had same problems with bluestore on ceph v11.2.0.
Reduce bluestore cache helped us to solve it and stabilize OSD memory
consumption on the 3GB level.

Perhaps this will help you:

bluestore_cache_size = 104857600

On Wed, Feb 15, 2017 at 2:33 PM, Ilya Letkowski 
wrote:

> Hi, Muthusamy Muthiah
>
> I'm not totally sure that this is a memory leak.
> We had same problems with bluestore on ceph v11.2.0.
> Reduce bluestore cache helped us to solve it and stabilize OSD memory
> consumption on the 3GB level.
>
> Perhaps this will help you:
>
> bluestore_cache_size = 104857600
>
>
>
> On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
> muthiah.muthus...@gmail.com> wrote:
>
>> Hi All,
>>
>> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
>> issues.
>>
>> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL 7.2
>>
>> Some traces using sar are below and attached the memory utilisation graph
>> .
>>
>> (16:54:42)[cn2.c1 sa] # sar -r
>> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
>> kbactive kbinact kbdirty
>> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18 51991692
>> 2676468 260
>>
>>
>>
>>
>>
>>
>>
>>
>> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22 51851512
>> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316 47.22
>> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
>> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45 16176
>> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548 137673084 83.52
>> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646 138376076
>> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01 26002252
>> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036 1611:40:01
>> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980 2716740
>> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84 57869628
>> 2715400 16*
>>
>> ...
>> ...
>>
>> In the attached graph, there is increase in memory utilisation by
>> ceph-osd during soak test. And when it reaches the system limit of 128GB
>> RAM , we could able to see the below dmesg logs related to memory out when
>> the system reaches close to 128GB RAM. OSD.3 killed due to Out of memory
>> and started again.
>>
>> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
>> gfp_mask=0x280da, order=0, oom_score_adj=0*
>> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
>> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not tainted
>> 3.10.0-327.13.1.el7.x86_64 #1
>> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420 Gen9/ProLiant
>> XL420 Gen9, BIOS U19 09/12/2016
>> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
>> 881fa58f7528 816356f4
>> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
>> 881fa3478360 881fa3478378
>> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
>> 0001 0001f65f
>> [Tue Feb 14 03:51:02 2017] Call Trace:
>> [Tue Feb 14 03:51:02 2017]  [] dump_stack+0x19/0x1b
>> [Tue Feb 14 03:51:02 2017]  [] dump_header+0x8e/0x214
>> [Tue Feb 14 03:51:02 2017]  []
>> oom_kill_process+0x24e/0x3b0
>> [Tue Feb 14 03:51:02 2017]  [] ?
>> find_lock_task_mm+0x56/0xc0
>> [Tue Feb 14 03:51:02 2017]  []
>> *out_of_memory+0x4b6/0x4f0*
>> [Tue Feb 14 03:51:02 2017]  []
>> __alloc_pages_nodemask+0xa95/0xb90
>> [Tue Feb 14 03:51:02 2017]  []
>> alloc_pages_vma+0x9a/0x140
>> [Tue Feb 14 03:51:02 2017]  []
>> handle_mm_fault+0xb85/0xf50
>> [Tue Feb 14 03:51:02 2017]  [] ?
>> follow_page_mask+0xbb/0x5c0
>> [Tue Feb 14 03:51:02 2017]  []
>> __get_user_pages+0x19b/0x640
>> [Tue Feb 14 03:51:02 2017]  []
>> get_user_pages_unlocked+0x15d/0x1f0
>> [Tue Feb 14 03:51:02 2017]  []
>> get_user_pages_fast+0x9f/0x1a0
>> [Tue Feb 14 03:51:02 2017]  []
>> do_blockdev_direct_IO+0x1a78/0x2610
>> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
>> [Tue Feb 14 03:51:02 2017]  []
>> __blockdev_direct_IO+0x55/0x60
>> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
>> [Tue Feb 14 03:51:02 2017]  []
>> blkdev_direct_IO+0x57/0x60
>> [Tue Feb 14 03:51:02 2017]  [] ? I_BDEV+0x10/0x10
>> [Tue Feb 14 03:51:02 2017]  []
>> generic_file_aio_read+0x6d3/0x750
>> [Tue Feb 14 03:51:02 2017]  [] ?
>> xfs_iunlock+0x11c/0x130 [xfs]
>> [Tue Feb 14 03:51:02 2017]  [] ? unlock_page+0x2b/0x30
>> [Tue Feb 14 03:51:02 2017]  [] ? __do_fault+0x401/0x510
>> [Tue Feb 14 03:51:02 2017]  [] blkdev_aio_read+0x4c/0x70
>> [Tue Feb 14 03:51:02 2017]  [] do_sync_read+0x8d/0xd0
>> [Tue Feb 14 03:51:02 2017]  [] vfs_read+0x9c/0x170
>> [Tue Feb 14 03:51:02 2017]  [] SyS_pread64+0x92/0xc0
>> [Tue Feb 14 03:51:02 2017]  []
>> system_call_fastpath+0x16/0x1b
>>
>>
>> Feb 14 03:51:40 fr-paris kernel: *Out of memory: Kill process 7657
>> (ceph-osd) score 45 or sacrifice child*
>> Feb 14 03:51:40