RE: BigCouch - Replication failing with Cannot Allocate memory

Mike Kimber Fri, 13 Apr 2012 13:18:53 -0700

A clean log file (i.e. stop bigcouch, delete log file, restart bigcouch, run 
replication, wait for failure, stop bigcouch) from the node that failed this 
time around can be found at:


http://pastebin.com/embed_js.php?i=s52rYwwy

Mike 

-----Original Message-----
From: Robert Newson [mailto:[email protected]] 
Sent: 13 April 2012 19:28
To: [email protected]
Subject: Re: BigCouch - Replication failing with Cannot Allocate memory

Mike,

Do you have couch.logs from around that time?

B.

On 13 April 2012 17:54, Mike Kimber <[email protected]> wrote:
> Sorry forgot to say that I have already up'd it to N=3 and still get the same 
> issue.
>
> I ran it again with the 6GB of RAM on each of the servers and ran vmstat and 
> got the following:
>
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
> st
>  3  0      0 2067468  31816 302204    0    0     0     5 1820  360 63 32  5  
> 0  0
>  2  0      0 2457728  31816 302212    0    0     0     2 2188  322 70 25  4  
> 0  0
>  2  0      0 1936092  31816 302212    0    0     0     0 3020  200 73 24  3  
> 0  0
>  2  0      0 687428  31816 302212    0    0     0     1 1958  368 56 42  2  0 
>  0
>  2  0      0 2128192  31824 302212    0    0     0     2 2779  243 64 29  7  
> 0  0
>  1  0      0 1829848  31824 302216    0    0     0     0 1734  280 68 29  3  
> 0  0
>  1  0      0 1200300  31832 302216    0    0     0     8 1841  231 43 13 44  
> 0  0
>  2  0      0 1638752  31840 302208    0    0     0     5 2625  350 71 20  8  
> 0  0
>  3  0      0 1670856  31848 302216    0    0     0     3 2150  492 40 21 39  
> 0  0
>  2  0      0 1020848  31848 302216    0    0     0     0 2307  644 67 22 11  
> 0  0
>  1  0      0 271640  31848 302216    0    0     0     6 1995  280 54 42  4  0 
>  0
>  1  0      0 455408  31848 302216    0    0     0     1 1879  238 64 33  3  0 
>  0
>  2  0      0 1240616  25584 193044    0    0     0     2 2408  232 59 34  8  
> 0  0
>  2  0      0 611280  25592 193036    0    0     0     3 2286  246 72 25  2  0 
>  0
>  2  0      0 679548  25592 193044    0    0     0     2 3038  175 78 21  2  0 
>  0
>  2  0      0 786360  25600 193044    0    0     0     3 1679  269 74 23  3  0 
>  0
>  2  0      0 568632  25600 193044    0    0     0     0 2796  274 74 24  2  0 
>  0
> eheap_alloc: Cannot allocate 1824525600 bytes of memory (of type "heap").
>  0  0      0 5749480  25600 193044    0    0     0     0 1389  160 33 15 52  
> 0  0
>  0  0      0 5749956  25608 193044    0    0     0    10 1007   82  0  0 100  
> 0  0
>  0  0      0 5749988  25616 193036    0    0     0     3 1016   85  0  0 100  
> 0  0
>  0  0      0 5750020  25616 193044    0    0     0     0  998   79  0  0 100  
> 0  0
>  0  0      0 5750168  25620 193040    0    0     0     1 1007   87  0  0 100  
> 0  0
>  0  0      0 5750308  25620 193044    0    0     0     0 1008   82  0  0 100  
> 0  0
>
> I really need to work out what each process is doing with respect to memory 
> at the time of failure. I had top running, but not on the node that failed 
> this time, sods law :-)
>
> Mike
>
> -----Original Message-----
> From: Robert Newson [mailto:[email protected]]
> Sent: 13 April 2012 17:31
> To: [email protected]
> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>
> I should note that bigcouch is tested much more often with N=3.
> Perhaps there's something about N=1 that exasperates the issue. For a
> test, could you try with N=3?
>
> B.
>
> On 13 April 2012 16:24, Mike Kimber <[email protected]> wrote:
>> "1. Try to replicate the database in another CouchDB."
>>
>> I have done this to a couchdb 1.2 database successfully. FYI The Source DB 
>> is a couchdb 1.1.1.
>>
>> I haven't done the other tests, but have tested replicating from the couchdb 
>> 1.2 database to the bigcouch install and got the same issue.
>>
>> Mike
>>
>>
>>
>> -----Original Message-----
>> From: CGS [mailto:[email protected]]
>> Sent: 13 April 2012 15:01
>> To: [email protected]
>> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>>
>> If you say so, Robert, I won't argue with you on that. I meant no offense,
>> so, please, accept my apologies if I crossed the line. It's all your's from
>> now on.
>>
>> Mike, please, ignore my suggestion. Sorry for interfering.
>>
>> Good luck!
>>
>> CGS
>>
>>
>>
>>
>> On Fri, Apr 13, 2012 at 3:19 PM, Robert Newson <[email protected]> wrote:
>>
>>> I think you should point out that "My idea behind these tests is that
>>> it may be that your database may be
>>> corrupted (or seen as corrupted by BigCouch at the second test) and what
>>> you get is just garbage at a certain document. " is based on no
>>> evidence. Nor, if it were true, would it necessarily explain the
>>> observed behavior either.
>>>
>>> It would be useful if we could all stick to asserting only things we
>>> know to be true or have reasonable grounds to believe are true.
>>> Unfounded speculation, though offered sincerely, is not helpful on a
>>> mailing list intended to provide assistance.
>>>
>>> Thanks,
>>> B.
>>>
>>> On 13 April 2012 13:55, CGS <[email protected]> wrote:
>>> > Hi Mike,
>>> >
>>> > I haven't used BigCouch by now and that's why I haven't said anything by
>>> > now. Still, giving a thought of what may occur there, I propose few tests
>>> > if you have time:
>>> > 1. Try to replicate the database in another CouchDB.
>>> > 2. If 1 passes, try to replicate to only one node at the time.
>>> > 3. If 2 passes, increase the pool of nodes with 1 and repeat the
>>> > replication (for sure it will fail at all 3 nodes at the time).
>>> >
>>> > My idea behind these tests is that it may be that your database may be
>>> > corrupted (or seen as corrupted by BigCouch at the second test) and what
>>> > you get is just garbage at a certain document. That's why I proposed the
>>> > first test. The second test is to see if any of the nodes has a problem
>>> in
>>> > configuration (or if there is any incompatibility in between your CouchDB
>>> > and BigCouch in manipulating your docs). Finally, the third test is to
>>> see
>>> > if server/node resources limit the number of replications (and at how
>>> many
>>> > it starts to fail).
>>> >
>>> > Can you also check the size of the shards at tests 2 and 3?
>>> >
>>> > If you consider that these tests are irrelevant, please, ignore my
>>> > suggestion.
>>> >
>>> > CGS
>>> >
>>> >
>>> >
>>> > On Fri, Apr 13, 2012 at 1:27 PM, Mike Kimber <[email protected]> wrote:
>>> >
>>> >> I upped the memory to 6GB on each of the nodes and got exactly the same
>>> >> issue in the same time frame i.e. the increased RAM did not seem to by
>>> me
>>> >> any additional time.
>>> >>
>>> >> Mike
>>> >>
>>> >> -----Original Message-----
>>> >> From: Robert Newson [mailto:[email protected]]
>>> >> Sent: 12 April 2012 19:34
>>> >> To: [email protected]
>>> >> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>>> >>
>>> >> 2GB total ram does sound tight. I can only compare to high volume
>>> >> production clusters which have much more ram than this. Given that
>>> >> beam.smp wanted 1.4 gb and you have 2gb total, do you know where the
>>> >> rest one? To couchjs processes, by chance? If so, you can reduce the
>>> >> maximum size of that pool in config, I think the default is 50.
>>> >>
>>> >> On 12 April 2012 18:32, Mike Kimber <[email protected]> wrote:
>>> >> > Ok, I have 3 nodes all load balanced with HAproxy:
>>> >> >
>>> >> > Centos 5.8 (Virtualised)
>>> >> > 2 Cores
>>> >> > 2GB RAM
>>> >> >
>>> >> > I'm trying to replicate about 75K documents which total 6GB when
>>> >> compacted (0n Couchdb 1.2 which has compression turned on). I'm told
>>> they
>>> >> are fairly large documents.
>>> >> >
>>> >> > When it goes pear shaped Vsmstat starts using a lot of memory:
>>> >> >
>>> >> > procs -----------memory---------- ---swap-- -----io---- --system--
>>> >> -----cpu------
>>> >> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>>> sy
>>> >> id wa st
>>> >> >  1  2 570576   8808    140   7208 2998 2249  3154  2249 1234  569  1
>>>  6
>>> >>  2 91  0
>>> >> >  0  2 569656   9156    156   7504 2330 1899  2405  1904 1246  595  1
>>>  5
>>> >>  9 85  0
>>> >> >  1  1 575412   9516    236  14928 1549 2261  3242  2261 1237  593  1
>>>  7
>>> >>  1 91  0
>>> >> >  0  2 607092  13220    168   8156 3772 9012  3871  9017 1284  714  1
>>> 10
>>> >>  4 85  0
>>> >> >  1  0 444336 857004    220  10212 5781    0  6202     0 1574 1010 13
>>>  7
>>> >> 33 47  0
>>> >> >  1  0 442176 870684    428  11052 2049    0  2208   140 2561 1541 17
>>>  8
>>> >> 49 26  0
>>> >> >  0  0 442176 813140    460  11968  170    0   348     0 2672 1565 25
>>>  9
>>> >> 61  4  0
>>> >> >  0  1 442176 744972    484  12224 5440    0  5493     7 2432  900  8
>>>  4
>>> >> 49 40  0
>>> >> >  0  1 442176 714048    484  12296 4547    0  4547     0 1799  827  4
>>>  2
>>> >> 50 44  0
>>> >> >  0  1 442176 686304    496  12688 5128    0  5222     0 1696  999  9
>>>  2
>>> >> 50 40  0
>>> >> >  0  3 444000   8712    444  12876  299  368   331   380 1294  188 22
>>> 20
>>> >> 36 23  0
>>> >> >  0  3 469340  10040    116   7336   29 5087    74  5090 1232  268  3
>>> 22
>>> >>  0 75  0
>>> >> >  1  2 584356  10220    124   6744 11367 28722 11370 28722 1643 1300  5
>>> >> 19 17 59  0
>>> >> >  0  1 624908  10640    132   7036 6518 12879  6590 12884 1296  717  3
>>> 10
>>> >> 29 58  0
>>> >> >  0  2 652556  10948    252  14776 3799 9494  5459  9494 1294  646  2
>>>  9
>>> >> 32 57  0
>>> >> >  0  2 677784  10648    244  14528 3819 8196  3819  8201 1274  588  2
>>>  7
>>> >> 30 61  0
>>> >> >  0  2 688460   9512    212   8224 3013 4522  3125  4522 1379  519  2
>>>  7
>>> >>  6 84  0
>>> >> >  0  3 699164   9888    208   8468 2192 4014  2228  4014 1302  495  1
>>>  6
>>> >> 11 83  0
>>> >> >  2  0 713104   9004    144   9192 2606 4490  2848  4490 1350  487  1
>>>  8
>>> >> 16 75  0
>>> >> >
>>> >> > It only ever takes out one node at a time and the other nodes seem to
>>> be
>>> >> doing very little while the one node is running out of memory.
>>> >> >
>>> >> > If I kick it off again it processed some more and then spikes the
>>> memory
>>> >> and fails
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> > Mike
>>> >> >
>>> >> > PS: hope you enjoyed you couchdb get together!
>>> >> >
>>> >> > -----Original Message-----
>>> >> > From: Robert Newson [mailto:[email protected]]
>>> >> > Sent: 12 April 2012 17:28
>>> >> > To: [email protected]
>>> >> > Subject: Re: BigCouch - Replication failing with Cannot Allocate
>>> memory
>>> >> >
>>> >> > What kind of load were you putting the machine on?
>>> >> >
>>> >> > On 12 April 2012 17:24, Robert Newson <[email protected]> wrote:
>>> >> >> Could you show your vm.args file?
>>> >> >>
>>> >> >> On 12 April 2012 17:23, Robert Newson <[email protected]> wrote:
>>> >> >>> Unfortunately your request for help coincided with the two day
>>> CouchDB
>>> >> >>> Summit. #cloudant and the Issues tab on cloudant/bigcouch are other
>>> >> >>> ways to get bigcouch support, but we happily answer queries here
>>> too,
>>> >> >>> when not at the Model UN of CouchDB. :D
>>> >> >>>
>>> >> >>> B.
>>> >> >>>
>>> >> >>> On 12 April 2012 17:10, Mike Kimber <[email protected]> wrote:
>>> >> >>>> Looks like this isn't the right place based on the responses so
>>> far.
>>> >> Shame I hoped this was going to help solve our index/view rebuild times
>>> etc.
>>> >> >>>>
>>> >> >>>> Mike
>>> >> >>>>
>>> >> >>>> -----Original Message-----
>>> >> >>>> From: Mike Kimber [mailto:[email protected]]
>>> >> >>>> Sent: 10 April 2012 09:20
>>> >> >>>> To: [email protected]
>>> >> >>>> Subject: BigCouch - Replication failing with Cannot Allocate memory
>>> >> >>>>
>>> >> >>>> I'm not sure if this is the correct place to raise an issue I am
>>> >> having with replicating a standalone couchdb 1.1.1 to a 3 node BigCouch
>>> >> cluster? If this is not the correct place please point me in the right
>>> >> direction if it is then any one have any ideas why I keep getting the
>>> >> following error message when I kick of a replication;
>>> >> >>>>
>>> >> >>>> eheap_alloc: Cannot allocate 1459620480 bytes of memory (of type
>>> >> "heap").
>>> >> >>>>
>>> >> >>>> My set-up is:
>>> >> >>>>
>>> >> >>>> Standalone couchdb 1.1.1 running on Centos 5.7
>>> >> >>>>
>>> >> >>>> 3 Node BigCouch cluster running on Centos 5.8 with the following
>>> >> local.ini overrides pulling from the Standalone couchdb (78K documents)
>>> >> >>>>
>>> >> >>>> [httpd]
>>> >> >>>> bind_address = XXX.XX.X.XX
>>> >> >>>>
>>> >> >>>> [cluster]
>>> >> >>>> ; number of shards for a new database
>>> >> >>>> q = 9
>>> >> >>>> ; number of copies of each shard
>>> >> >>>> n = 1
>>> >> >>>>
>>> >> >>>> [couchdb]
>>> >> >>>> database_dir = /other/bigcouch/database
>>> >> >>>> view_index_dir = /other/bigcouch/view
>>> >> >>>>
>>> >> >>>> The error is always generate on the third node in the cluster and
>>> the
>>> >> server basically max's out on memory before hand. The other nodes seem
>>> to
>>> >> be doing very little, but are getting data i.e. the shard sizes are
>>> >> growing. I've put the copies per shard down to 1 as currently I'm not
>>> >> interested in resilience.
>>> >> >>>>
>>> >> >>>> Any help would be greatly appreciated.
>>> >> >>>>
>>> >> >>>> Mike
>>> >> >>>>
>>> >>
>>>

RE: BigCouch - Replication failing with Cannot Allocate memory

Reply via email to