Jan,
The nodes each have 62G of memory and 32G swapspace.
At the time of crash the kernel is reporting that beam.smp is consuming 62G of
memory + 32G of swap.
The settings I've changed for OS (might be related) are:
Soft/Hard nofile limit = 4096
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2000
net.ipv4.txp_max_syn_backlog = 2048
ERL_MAX_PORTS = 4096
In Default.Ini the changes from the base file are:
q=12
r=2
w=2
n=3
In Local.ini the changes from the base file are:
credentials = true
max_connections = 1024
uuid = 712c6af6adce6e9ea43868cd7f78b35f
max_dbs_open = 500
allow_jsonp = true
delayed_commits = false
enable_cors = true
nodejs = /usr/local/bin/node /home/couchdb/couchdb/share/server/main.js
os_process_limit = 900
require_valid_user = true
[cors]
credentials = true
max_connections = 1024
[log]
file = /home/couchdb/couchdb/var/log/couchdb.log
level = debug
writer = file
[compaction]
_default = [{db_fragmentation, "40%"}, {view_fragmentation, "40%"}, {from,
"22:00"}, {to, "06:00"}]
This is consistent across all 12 nodes that we have running.
I've tried running it without the OS changes as well, and it still crashes with
an OOM.
-Tayven
________________________________
From: Joan Touzet <[email protected]>
Sent: Tuesday, January 31, 2017 11:12:12 AM
To: [email protected]
Cc: Tayven Bigelow; Nick Becker
Subject: Re: Crashing due to memory use
Tayven,
Thanks for the info.
How much RAM is in this node? Do you know approximately how much RAM the
beam.smp process is consuming when the oom-killer takes action? Have you
changed any settings in default.ini/local.ini?
-Joan
----- Original Message -----
> From: "Tayven Bigelow" <[email protected]>
> To: "Jan Lehnardt" <[email protected]>, [email protected]
> Cc: "Nick Becker" <[email protected]>
> Sent: Tuesday, January 31, 2017 12:49:11 PM
> Subject: Re: Crashing due to memory use
>
> Hey Jan!
>
>
> You'd be correct on the multiple postings, weren't sure they were
> being posted.
>
> We currently run this in production on cloudant and were hoping to
> have a backup utilizing the new couchdb 2.0. We are able to
> consistently replicate.
>
> The memory leak happens when we kick off a new view.
> beam.smp terminates on a OOM by the kernel.
>
> Checking /var/log/syslog shows:
> Jan 31 18:32:44 couchdb7 kernel: [594086.565577] Out of memory: Kill
> process 23731 (beam.smp) score 961 or sacrifice child
> Jan 31 18:32:44 couchdb7 kernel: [594086.565622] Killed process 23773
> (memsup) total-vm:4228kB, anon-rss:12kB, file-rss:0kB
> Jan 31 18:32:44 couchdb7 kernel: [594086.569327] Out of memory: Kill
> process 23731 (beam.smp) score 961 or sacrifice child
> Jan 31 18:32:44 couchdb7 kernel: [594086.569392] Killed process 23731
> (beam.smp) total-vm:126594220kB, anon-rss:64708732kB, file-rss:0kB
> Jan 31 18:32:56 couchdb7 monit[9113]: 'couchdb' process is not
> running
>
> The couchdb.log file at the time of crash contains:
>
> 1981936-[debug] 2017-01-31T17:16:35.355774Z
> [email protected] <0.9036.262> -------- OS Process
> #Port<0.63437> Input ::
> ["map_doc",{"_id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","_rev":"5-b90c6c87a0a48e647528a1b3c5bfe12b","MetaData":{"PollId":"147402","Car
> rierId":"25504","UserPollStateId":"3362564708"},"UserId":"1002449829201","CreateDate":"2015-11-23T06:42:40.0285675Z","LastModifiedDate":"2015-11-23T06:43:07.5474967Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6
> ","CallbackUri":"http://de-geopoll-1:8645/billingcallback","CallbackSent":true,"Activities":[{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0297329Z","State":"PROCESSING"},{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0307329Z","State":"SUCCESS"}],"Currency":"US_Dol
> lar_USD","ConsumerIdentifier":"250025308","ToBeBilledIdentifier":"255763398389","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":0.11,"BillProcessingState":"SUCCESS","BillingProvider":"TRANSFERTO","NextProcessingTime":"0001-01-01T00:00:00","NextProces
> singTimeAsLong":0,"Id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","CreatedDate":"2015-11-23T06:42:40.0285675Z","ModifiedDate":"2015-11-23T06:43:07.5474967Z","Type":"Bill"}]
> 1981937-[debug] 2017-01-31T17:16:35.355856Z
> [email protected] <0.11910.262> -------- OS Process
> #Port<0.63508> Output ::
> [[[["GeoPoll","8921801"],null]],[[["77802","PRETUPS"],null]],[[["77802","PRETUPS","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","SUCC
> ESS","2014","03","05"],null],[["77802","ALL","SUCCESS","2014","03","05"],null],[["77802","PRETUPS","ALL","2014","03","05"],null],[["ALL","ALL","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","ALL","2014","03","05"],null],[["77802","ALL","ALL","2014","03","05"],null],
> [["ALL","ALL","ALL","2014","03","05"],null]],[[["77802","2014","3","05"],null]],[["254788760292",null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","SUCCESS","2014-03-05T12:48:59.5664722Z"],null
> ]],[[["PRETUPS","25402","SUCCESS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS"],null]],[["254788760292",null]],[["1000374925501",null]],[[[2014,3,5,"PRE
> TUPS","SUCCESS"],null]]]
> 1981938-[debug] 2017-01-31T17:16:35.356012Z
> [email protected] <0.9036.262> -------- OS Process
> #Port<0.63437> Output ::
> [[[["147402","TRANSFERTO","SUCCESS"],null]],[[["TRANSFERTO","SUCCESS","2015-11-23T06:43:07.5474967Z"],null]],[[["TRANSFERTO","SUCCESS","0001
> -01-01T00:00:00"],null]]]
> 1981939-[debug] 2017-01-31T17:16:35.356108Z
> [email protected] <0.11910.262> -------- OS Process
> #Port<0.63508> Input ::
> ["map_doc",{"_id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","_rev":"3-832e63f45b45d5e3008b7e7bbe2b7392","MetaData":{"PollId":"77802","CarrierId":"25402","UserPollStateId":"3256532401","CarrierName":"Airtel-Kenya","Pretups.Version":"5.1","Pretups.Uri":"https://41.223.56.108:8093/pretups/C2SReceiver","Auth.Login":"pretups","Auth.Password":"0971500a350af5c3d1c0b12221a0558c","Auth.GatewayCode":"EXTGW","Auth.GatewayType":"EXTGW","Auth.ServicePort":"190","Auth.SourceType":"EXT","Cmd.ExtNwCode":"KE","Cmd.Msisdn":"732810086","Cmd.Pin":"2549","Cmd.Login":"","Cmd.Password":"","Cmd.ExtCode":"2468","CountryCode":"254","MobilePhoneLength":"9","TestMobileNumber":"254733621719","Currency":"KES"},"UserId":"1000277123401","CreateDate":"2014-03-05T13:45:49.6889321Z","LastModifiedDate":"2014-03-05T13:46:14.8050931Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CallbackUri":"http://uk-app-3:8645/billingcallback","Activities":[{"CreateDate":"2014-03-05T13:46:14.2902898Z","State":"PROCESSING"},{"MetaData":{"Type":"EXRCTRFRESP","Txnid":"R140305.1648.210003","Txnstatus":"200","Date":"05/03/2014
> 16:48:40","Extrefnum":"","Data":null},"CreateDate":"2014-03-05T13:46:14.2912898Z","State":"SUCCESS"}],"Currency":"Kenyan_Shilling_KES","ConsumerIdentifier":"8963201","ToBeBilledIdentifier":"254735960469","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":43.0,"BillProcessingState":"SUCCESS","BillingProvider":"PRETUPS","NextProcessingTime":"0001-01-01T00:00:00","NextProcessingTimeAsLong":0,"Id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CreatedDate":"2014-03-05T13:45:49.6889321Z","ModifiedDate":"2014-03-05T13:46:14.8050931Z","Type":"Bill"}]
> 1981940:[debug] 2017-01-31T17:32:57.300061Z
> [email protected] <0.111.0> -------- Supervisor
> couch_log_sup started couch_log_monitor:start_link() at pid
> <0.114.0>
> 1981941:[debug] 2017-01-31T17:32:57.301585Z
> [email protected] <0.111.0> -------- Supervisor
> couch_log_sup started config_listener_mon:start_link(couch_log_sup,
> nil) at pid <0.115.0>
> 1981942:[info] 2017-01-31T17:32:57.301605Z
> [email protected] <0.7.0> -------- Application couch_log
> started on node '[email protected]'
> 1981943:[debug] 2017-01-31T17:32:57.302447Z
> [email protected] <0.119.0> -------- Supervisor
> folsom_sup started folsom_sample_slide_sup:start_link() at pid
> <0.120.0>
> 1981944:[debug] 2017-01-31T17:32:57.303229Z
> [email protected] <0.119.0> -------- Supervisor
> folsom_sup started folsom_meter_timer_server:start_link() at pid
> <0.121.0>
> 1981945:[debug] 2017-01-31T17:32:57.303979Z
> [email protected] <0.119.0> -------- Supervisor
> folsom_sup started folsom_metrics_histogram_ets:start_link() at pid
> <0.122.0>
> 1981946:[info] 2017-01-31T17:32:57.304074Z
> [email protected] <0.7.0> -------- Application folsom
> started on node '[email protected]'
> 1981947:[debug] 2017-01-31T17:32:57.325716Z
> [email protected] <0.126.0> -------- Supervisor
> couch_stats_sup started couch_stats_aggregator:start_link() at pid
> <0.127.0>
> 1981948:[debug] 2017-01-31T17:32:57.326519Z
> [email protected] <0.126.0> -------- Supervisor
> couch_stats_sup started couch_stats_process_tracker:start_link() at
> pid <0.177.0>
> 1981949:[info] 2017-01-31T17:32:57.326595Z
> [email protected] <0.7.0> -------- Application
> couch_stats started on node '[email protected]'
> 1981950:[info] 2017-01-31T17:32:57.326673Z
> [email protected] <0.7.0> -------- Application khash
> started on node '[email protected]'
> 1981951:[debug] 2017-01-31T17:32:57.330327Z
> [email protected] <0.182.0> -------- Supervisor
> couch_event_sup2 started couch_event_server:start_link() at pid
> <0.183.0>
> 1981952:[debug] 2017-01-31T17:32:57.331211Z
> [email protected] <0.185.0> -------- Supervisor
> couch_event_os_sup started
> config_listener_mon:start_link(couch_event_os_sup, nil) at pid
> <0.186.0>
> 1981953:[debug] 2017-01-31T17:32:57.331268Z
> [email protected] <0.182.0> -------- Supervisor
> couch_event_sup2 started couch_event_os_sup:start_link() at pid
> <0.185.0>
> 1981954:[info] 2017-01-31T17:32:57.331367Z
> [email protected] <0.7.0> -------- Application
> couch_event started on node '[email protected]'
> 1981955:[debug] 2017-01-31T17:32:57.334167Z
> [email protected] <0.190.0> -------- Supervisor
> ibrowse_sup started ibrowse:start_link() at pid <0.191.0>
> 1981956:[info] 2017-01-31T17:32:57.334239Z
> [email protected] <0.7.0> -------- Application ibrowse
> started on node '[email protected]'
> 1981957:[debug] 2017-01-31T17:32:57.335727Z
> [email protected] <0.196.0> -------- Supervisor ioq_sup
> started config_listener_mon:start_link(ioq_sup, nil) at pid
> <0.197.0>
> 1981958:[debug] 2017-01-31T17:32:57.336685Z
> [email protected] <0.196.0> -------- Supervisor ioq_sup
> started ioq:start_link() at pid <0.198.0>
> 1981959:[info] 2017-01-31T17:32:57.336756Z
> [email protected] <0.7.0> -------- Application ioq
> started on node '[email protected]'
> 1981960:[info] 2017-01-31T17:32:57.336829Z
> [email protected] <0.7.0> -------- Application mochiweb
> started on node '[email protected]'
> 1981961:[info] 2017-01-31T17:32:57.336899Z
> [email protected] <0.7.0> -------- Application oauth
> started on node '[email protected]'
> 1981962:[info] 2017-01-31T17:32:57.340965Z
> [email protected] <0.204.0> -------- Apache CouchDB 2.0.0
> is starting.
>
>
>
> For the Large database it would happen when we kicked off 1 out the
> 39 views on the database, however on the smaller database I would
> have to kick off all 5 views within the database.
> The large database has 9 design documents, with the smaller database
> having only 1.
> The views are all JS.
> Other than Fail2Ban, UFW, Logwatch, LogRotate, Monit and Zabbix-Agent
> there is nothing else running on the server. Except when we build it
> with Dreyfus and Clouseau.
>
> Example of one of the larger Design documents:
> {
> "_id": "_design/bills",
> "_rev": "4-b0ed6cf8f871391add5004f7e67bc3a8",
> "language": "javascript",
> "auto_update": true,
> "views": {
> "by_bill_date_and_bill_provider": {
> "map": "function(doc) {\n if (doc._id.indexOf(\"bill-\") ===
> 0){\n var date = new
> Date(doc.CreatedDate?doc.CreatedDate:doc.CreateDate);\n
> var year = date.getFullYear();\n var month =
> (date.getMonth() + 1);\n var day = date.getDate();\n
> emit([year, month, day, doc.BillingProvider,
> doc.BillProcessingState], null);\n }\n}",
> "reduce": "_count"
> },
> "by_poll_id_and_bill_date": {
> "map": "function(doc) {\n if ((doc._id.indexOf(\"bill-\") ===
> 0) && doc.MetaData.PollId){\n var date = new
> Date(doc.CreateDate);\n var year =
> date.getFullYear().toString();\n var month =
> (date.getMonth() + 1).toString();\n var day =
> date.getDate().toString();\n if (day.length == 1){\n
> day = \"0\" + day;\n }\n\n
> emit([doc.MetaData.PollId, year, month, day], null);\n
> }\n}",
> "reduce": "_count"
> },
> }
> }
>
> Example of a doc within the larger database:
> {
> "_id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
> "_rev": "5-b40e00a54059c6c79004c0afd584fc60",
> "MetaData": {
> "PollId": "1844608",
> "CarrierId": "2701",
> "UserPollStateId": "12614468108"
> },
> "UserId": "1002196088104",
> "CreateDate": "2017-01-31T07:20:58",
> "LastModifiedDate": "2017-01-31T07:21:14.2473555Z",
> "SystemSource": "GeoPoll",
> "AttemptCount": 1,
> "BillingIdentifier": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
> "CallbackUri": "http://XXXXXXXXXXX:8645/billingcallback",
> "CallbackSent": true,
> "Activities": [
> {
> "MetaData": {},
> "CreateDate": "2017-01-31T07:21:11.182049Z",
> "State": "PROCESSING"
> },
> {
> "MetaData": {
> "VoucherPin": "",
> "OrderRef": "113234210",
> "TicketNumber": "",
> "BoxNumber": "",
> "BatchNumber": "",
> "ProcessingTime": "3064.3064"
> },
> "CreateDate": "2017-01-31T07:21:11.1820491Z",
> "State": "SUCCESS"
> }
> ],
> "Currency": "South_African_Rand_ZAR",
> "ConsumerIdentifier": "XXXXXXXXXXXX",
> "ToBeBilledIdentifier": "XXXXXXXXXXXX",
> "BillType": "Carrier",
> "BillProcessingStateAsString": "SUCCESS",
> "Value": 2,
> "BillProcessingState": "SUCCESS",
> "BillingProvider": "VODACOMSA",
> "NextProcessingTime": "0001-01-01T00:00:00",
> "NextProcessingTimeAsLong": 0,
> "FinalProcessingTime": 0,
> "LastSubmittedDate": "0001-01-01T00:00:00",
> "Id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
> "CreatedDate": "2017-01-31T07:20:58",
> "ModifiedDate": "2017-01-31T07:21:14.2473555Z",
> "Type": "Bill"
> }
>
> Docs usually go through 4-5 updates before they are finalized.
> Within the larger database we have 16,201,998 docs totaling 23 GB. No
> attachments.
>
> No other traffic besides a single user (me), including replication.
> No other patterns that stand out (to me at least). The memory usage
> grows and grows before eventually consuming the Swap space and
> running into a OOM kill.
>
> The other 11 nodes are affected.
>
> Thanks for your assistance!!
>
> -Tayven
>
> ________________________________
> From: Jan Lehnardt <[email protected]>
> Sent: Tuesday, January 31, 2017 4:38 AM
> To: [email protected]
> Cc: Tayven Bigelow; Nick Becker
> Subject: Re: Crashing due to memory use
>
> Heya Nick and Tayven,
>
> I assume you posted multiple times because your mails didn't show up
> immediately due to mailing list moderation.
>
> You are correct that the database size and hardware configuration
> should not cause any issues.
>
> Can you explain the scenario a little better?
>
> Is the memory leak happening when building your views for the first
> time?
>
> Does beam.smp terminate on its own or is it an OOM kill from the
> kernel?
>
> How many views do you have?
>
> How many design docs?
>
> JS views or Erlang views?
>
> Is there anything else running on these nodes?
>
> Can you share your view code?
>
> Can you share your couch.log?
>
> Can explain your document structure (total bytes, number of fields,
> attachments etc.).
>
> Can you describe your traffic pattern?
>
> Can you describe any other pattern that leads up to the memory leak?
>
> Does this happen on all nodes? If not, is there anything special
> about the affected nodes?
>
>
> (shameless plug, if you require professional assistance, my email
> footer has contact information)
>
>
> > On 31 Jan 2017, at 00:15, Tayven Bigelow
> > <[email protected]> wrote:
> >
> > Hey Guys!
> >
> >
> > Been using a CouchDB 2.0 12 server cluster for a while now and have
> > noticed a memory leak that causes beam.smp to crash while
> > populating Views.
> >
> > The q/r/w/n is set up as:
> >
> > [cluster]
> > q=12
> > r=2
> > w=2
> > n=3
> >
> > As far as I know the server should be able to handle the load as it
> > has 64GB RAM with a Core i7 6700. We are running ubuntu 16.04.1.
> >
> > The Database is 16.5 GB in size.
> >
> >
> > I've also attempted to run 2.0 with Dreyfus and Clouseau and ran
> > into the same issue with a Database size of 7.8MB.
> >
> >
> > I've noted in previous releases some people have ran into similar
> > memory issues with beam.smp and increasing the open file limit was
> > part of the resolution. We've increased the nofile limit for the
> > couchdb user to 4096 (as found here:
> > https://wiki.apache.org/couchdb/Performance ) with no luck.
> Performance - Couchdb
> Wiki<https://wiki.apache.org/couchdb/Performance>
> wiki.apache.org
> With up to tens of thousands of documents you will generally find
> CouchDB to perform well no matter how you write your code. Once you
> start getting into ...
>
>
>
> >
> >
> > Nothing out of the ordinary is thrown in the logs. The only way to
> > catch it is by watching memory use.
> >
> >
> > I'm wondering if theres a configuration/setting somewhere that I am
> > missing that could be causing this issue.
> >
> >
> > Thanks!
> >
> > Tayven
> >
> >
> >
> > All information in this message is confidential and may be legally
> > privileged. If you are not the intended recipient, notify the
> > sender immediately and destroy this email.
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> Professional Support for Apache CouchDB(tm) -
> Neighbourhood<https://neighbourhood.ie/couchdb-support/>
> neighbourhood.ie
> Apache CouchDB is the first choice for geographically distributed
> database solutions. From cross data-centre clusters to offline-first
> mobile and web solutions ...
>
>
>
> Email: [email protected]
>
>
> All information in this message is confidential and may be legally
> privileged. If you are not the intended recipient, notify the sender
> immediately and destroy this email.
>
All information in this message is confidential and may be legally privileged.
If you are not the intended recipient, notify the sender immediately and
destroy this email.