A clean log file (i.e. stop bigcouch, delete log file, restart bigcouch, run replication, wait for failure, stop bigcouch) from the node that failed this time around can be found at:
http://pastebin.com/embed_js.php?i=s52rYwwy Mike -----Original Message----- From: Robert Newson [mailto:[email protected]] Sent: 13 April 2012 19:28 To: [email protected] Subject: Re: BigCouch - Replication failing with Cannot Allocate memory Mike, Do you have couch.logs from around that time? B. On 13 April 2012 17:54, Mike Kimber <[email protected]> wrote: > Sorry forgot to say that I have already up'd it to N=3 and still get the same > issue. > > I ran it again with the 6GB of RAM on each of the servers and ran vmstat and > got the following: > > r b swpd free buff cache si so bi bo in cs us sy id wa > st > 3 0 0 2067468 31816 302204 0 0 0 5 1820 360 63 32 5 > 0 0 > 2 0 0 2457728 31816 302212 0 0 0 2 2188 322 70 25 4 > 0 0 > 2 0 0 1936092 31816 302212 0 0 0 0 3020 200 73 24 3 > 0 0 > 2 0 0 687428 31816 302212 0 0 0 1 1958 368 56 42 2 0 > 0 > 2 0 0 2128192 31824 302212 0 0 0 2 2779 243 64 29 7 > 0 0 > 1 0 0 1829848 31824 302216 0 0 0 0 1734 280 68 29 3 > 0 0 > 1 0 0 1200300 31832 302216 0 0 0 8 1841 231 43 13 44 > 0 0 > 2 0 0 1638752 31840 302208 0 0 0 5 2625 350 71 20 8 > 0 0 > 3 0 0 1670856 31848 302216 0 0 0 3 2150 492 40 21 39 > 0 0 > 2 0 0 1020848 31848 302216 0 0 0 0 2307 644 67 22 11 > 0 0 > 1 0 0 271640 31848 302216 0 0 0 6 1995 280 54 42 4 0 > 0 > 1 0 0 455408 31848 302216 0 0 0 1 1879 238 64 33 3 0 > 0 > 2 0 0 1240616 25584 193044 0 0 0 2 2408 232 59 34 8 > 0 0 > 2 0 0 611280 25592 193036 0 0 0 3 2286 246 72 25 2 0 > 0 > 2 0 0 679548 25592 193044 0 0 0 2 3038 175 78 21 2 0 > 0 > 2 0 0 786360 25600 193044 0 0 0 3 1679 269 74 23 3 0 > 0 > 2 0 0 568632 25600 193044 0 0 0 0 2796 274 74 24 2 0 > 0 > eheap_alloc: Cannot allocate 1824525600 bytes of memory (of type "heap"). > 0 0 0 5749480 25600 193044 0 0 0 0 1389 160 33 15 52 > 0 0 > 0 0 0 5749956 25608 193044 0 0 0 10 1007 82 0 0 100 > 0 0 > 0 0 0 5749988 25616 193036 0 0 0 3 1016 85 0 0 100 > 0 0 > 0 0 0 5750020 25616 193044 0 0 0 0 998 79 0 0 100 > 0 0 > 0 0 0 5750168 25620 193040 0 0 0 1 1007 87 0 0 100 > 0 0 > 0 0 0 5750308 25620 193044 0 0 0 0 1008 82 0 0 100 > 0 0 > > I really need to work out what each process is doing with respect to memory > at the time of failure. I had top running, but not on the node that failed > this time, sods law :-) > > Mike > > -----Original Message----- > From: Robert Newson [mailto:[email protected]] > Sent: 13 April 2012 17:31 > To: [email protected] > Subject: Re: BigCouch - Replication failing with Cannot Allocate memory > > I should note that bigcouch is tested much more often with N=3. > Perhaps there's something about N=1 that exasperates the issue. For a > test, could you try with N=3? > > B. > > On 13 April 2012 16:24, Mike Kimber <[email protected]> wrote: >> "1. Try to replicate the database in another CouchDB." >> >> I have done this to a couchdb 1.2 database successfully. FYI The Source DB >> is a couchdb 1.1.1. >> >> I haven't done the other tests, but have tested replicating from the couchdb >> 1.2 database to the bigcouch install and got the same issue. >> >> Mike >> >> >> >> -----Original Message----- >> From: CGS [mailto:[email protected]] >> Sent: 13 April 2012 15:01 >> To: [email protected] >> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory >> >> If you say so, Robert, I won't argue with you on that. I meant no offense, >> so, please, accept my apologies if I crossed the line. It's all your's from >> now on. >> >> Mike, please, ignore my suggestion. Sorry for interfering. >> >> Good luck! >> >> CGS >> >> >> >> >> On Fri, Apr 13, 2012 at 3:19 PM, Robert Newson <[email protected]> wrote: >> >>> I think you should point out that "My idea behind these tests is that >>> it may be that your database may be >>> corrupted (or seen as corrupted by BigCouch at the second test) and what >>> you get is just garbage at a certain document. " is based on no >>> evidence. Nor, if it were true, would it necessarily explain the >>> observed behavior either. >>> >>> It would be useful if we could all stick to asserting only things we >>> know to be true or have reasonable grounds to believe are true. >>> Unfounded speculation, though offered sincerely, is not helpful on a >>> mailing list intended to provide assistance. >>> >>> Thanks, >>> B. >>> >>> On 13 April 2012 13:55, CGS <[email protected]> wrote: >>> > Hi Mike, >>> > >>> > I haven't used BigCouch by now and that's why I haven't said anything by >>> > now. Still, giving a thought of what may occur there, I propose few tests >>> > if you have time: >>> > 1. Try to replicate the database in another CouchDB. >>> > 2. If 1 passes, try to replicate to only one node at the time. >>> > 3. If 2 passes, increase the pool of nodes with 1 and repeat the >>> > replication (for sure it will fail at all 3 nodes at the time). >>> > >>> > My idea behind these tests is that it may be that your database may be >>> > corrupted (or seen as corrupted by BigCouch at the second test) and what >>> > you get is just garbage at a certain document. That's why I proposed the >>> > first test. The second test is to see if any of the nodes has a problem >>> in >>> > configuration (or if there is any incompatibility in between your CouchDB >>> > and BigCouch in manipulating your docs). Finally, the third test is to >>> see >>> > if server/node resources limit the number of replications (and at how >>> many >>> > it starts to fail). >>> > >>> > Can you also check the size of the shards at tests 2 and 3? >>> > >>> > If you consider that these tests are irrelevant, please, ignore my >>> > suggestion. >>> > >>> > CGS >>> > >>> > >>> > >>> > On Fri, Apr 13, 2012 at 1:27 PM, Mike Kimber <[email protected]> wrote: >>> > >>> >> I upped the memory to 6GB on each of the nodes and got exactly the same >>> >> issue in the same time frame i.e. the increased RAM did not seem to by >>> me >>> >> any additional time. >>> >> >>> >> Mike >>> >> >>> >> -----Original Message----- >>> >> From: Robert Newson [mailto:[email protected]] >>> >> Sent: 12 April 2012 19:34 >>> >> To: [email protected] >>> >> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory >>> >> >>> >> 2GB total ram does sound tight. I can only compare to high volume >>> >> production clusters which have much more ram than this. Given that >>> >> beam.smp wanted 1.4 gb and you have 2gb total, do you know where the >>> >> rest one? To couchjs processes, by chance? If so, you can reduce the >>> >> maximum size of that pool in config, I think the default is 50. >>> >> >>> >> On 12 April 2012 18:32, Mike Kimber <[email protected]> wrote: >>> >> > Ok, I have 3 nodes all load balanced with HAproxy: >>> >> > >>> >> > Centos 5.8 (Virtualised) >>> >> > 2 Cores >>> >> > 2GB RAM >>> >> > >>> >> > I'm trying to replicate about 75K documents which total 6GB when >>> >> compacted (0n Couchdb 1.2 which has compression turned on). I'm told >>> they >>> >> are fairly large documents. >>> >> > >>> >> > When it goes pear shaped Vsmstat starts using a lot of memory: >>> >> > >>> >> > procs -----------memory---------- ---swap-- -----io---- --system-- >>> >> -----cpu------ >>> >> > r b swpd free buff cache si so bi bo in cs us >>> sy >>> >> id wa st >>> >> > 1 2 570576 8808 140 7208 2998 2249 3154 2249 1234 569 1 >>> 6 >>> >> 2 91 0 >>> >> > 0 2 569656 9156 156 7504 2330 1899 2405 1904 1246 595 1 >>> 5 >>> >> 9 85 0 >>> >> > 1 1 575412 9516 236 14928 1549 2261 3242 2261 1237 593 1 >>> 7 >>> >> 1 91 0 >>> >> > 0 2 607092 13220 168 8156 3772 9012 3871 9017 1284 714 1 >>> 10 >>> >> 4 85 0 >>> >> > 1 0 444336 857004 220 10212 5781 0 6202 0 1574 1010 13 >>> 7 >>> >> 33 47 0 >>> >> > 1 0 442176 870684 428 11052 2049 0 2208 140 2561 1541 17 >>> 8 >>> >> 49 26 0 >>> >> > 0 0 442176 813140 460 11968 170 0 348 0 2672 1565 25 >>> 9 >>> >> 61 4 0 >>> >> > 0 1 442176 744972 484 12224 5440 0 5493 7 2432 900 8 >>> 4 >>> >> 49 40 0 >>> >> > 0 1 442176 714048 484 12296 4547 0 4547 0 1799 827 4 >>> 2 >>> >> 50 44 0 >>> >> > 0 1 442176 686304 496 12688 5128 0 5222 0 1696 999 9 >>> 2 >>> >> 50 40 0 >>> >> > 0 3 444000 8712 444 12876 299 368 331 380 1294 188 22 >>> 20 >>> >> 36 23 0 >>> >> > 0 3 469340 10040 116 7336 29 5087 74 5090 1232 268 3 >>> 22 >>> >> 0 75 0 >>> >> > 1 2 584356 10220 124 6744 11367 28722 11370 28722 1643 1300 5 >>> >> 19 17 59 0 >>> >> > 0 1 624908 10640 132 7036 6518 12879 6590 12884 1296 717 3 >>> 10 >>> >> 29 58 0 >>> >> > 0 2 652556 10948 252 14776 3799 9494 5459 9494 1294 646 2 >>> 9 >>> >> 32 57 0 >>> >> > 0 2 677784 10648 244 14528 3819 8196 3819 8201 1274 588 2 >>> 7 >>> >> 30 61 0 >>> >> > 0 2 688460 9512 212 8224 3013 4522 3125 4522 1379 519 2 >>> 7 >>> >> 6 84 0 >>> >> > 0 3 699164 9888 208 8468 2192 4014 2228 4014 1302 495 1 >>> 6 >>> >> 11 83 0 >>> >> > 2 0 713104 9004 144 9192 2606 4490 2848 4490 1350 487 1 >>> 8 >>> >> 16 75 0 >>> >> > >>> >> > It only ever takes out one node at a time and the other nodes seem to >>> be >>> >> doing very little while the one node is running out of memory. >>> >> > >>> >> > If I kick it off again it processed some more and then spikes the >>> memory >>> >> and fails >>> >> > >>> >> > Thanks >>> >> > >>> >> > Mike >>> >> > >>> >> > PS: hope you enjoyed you couchdb get together! >>> >> > >>> >> > -----Original Message----- >>> >> > From: Robert Newson [mailto:[email protected]] >>> >> > Sent: 12 April 2012 17:28 >>> >> > To: [email protected] >>> >> > Subject: Re: BigCouch - Replication failing with Cannot Allocate >>> memory >>> >> > >>> >> > What kind of load were you putting the machine on? >>> >> > >>> >> > On 12 April 2012 17:24, Robert Newson <[email protected]> wrote: >>> >> >> Could you show your vm.args file? >>> >> >> >>> >> >> On 12 April 2012 17:23, Robert Newson <[email protected]> wrote: >>> >> >>> Unfortunately your request for help coincided with the two day >>> CouchDB >>> >> >>> Summit. #cloudant and the Issues tab on cloudant/bigcouch are other >>> >> >>> ways to get bigcouch support, but we happily answer queries here >>> too, >>> >> >>> when not at the Model UN of CouchDB. :D >>> >> >>> >>> >> >>> B. >>> >> >>> >>> >> >>> On 12 April 2012 17:10, Mike Kimber <[email protected]> wrote: >>> >> >>>> Looks like this isn't the right place based on the responses so >>> far. >>> >> Shame I hoped this was going to help solve our index/view rebuild times >>> etc. >>> >> >>>> >>> >> >>>> Mike >>> >> >>>> >>> >> >>>> -----Original Message----- >>> >> >>>> From: Mike Kimber [mailto:[email protected]] >>> >> >>>> Sent: 10 April 2012 09:20 >>> >> >>>> To: [email protected] >>> >> >>>> Subject: BigCouch - Replication failing with Cannot Allocate memory >>> >> >>>> >>> >> >>>> I'm not sure if this is the correct place to raise an issue I am >>> >> having with replicating a standalone couchdb 1.1.1 to a 3 node BigCouch >>> >> cluster? If this is not the correct place please point me in the right >>> >> direction if it is then any one have any ideas why I keep getting the >>> >> following error message when I kick of a replication; >>> >> >>>> >>> >> >>>> eheap_alloc: Cannot allocate 1459620480 bytes of memory (of type >>> >> "heap"). >>> >> >>>> >>> >> >>>> My set-up is: >>> >> >>>> >>> >> >>>> Standalone couchdb 1.1.1 running on Centos 5.7 >>> >> >>>> >>> >> >>>> 3 Node BigCouch cluster running on Centos 5.8 with the following >>> >> local.ini overrides pulling from the Standalone couchdb (78K documents) >>> >> >>>> >>> >> >>>> [httpd] >>> >> >>>> bind_address = XXX.XX.X.XX >>> >> >>>> >>> >> >>>> [cluster] >>> >> >>>> ; number of shards for a new database >>> >> >>>> q = 9 >>> >> >>>> ; number of copies of each shard >>> >> >>>> n = 1 >>> >> >>>> >>> >> >>>> [couchdb] >>> >> >>>> database_dir = /other/bigcouch/database >>> >> >>>> view_index_dir = /other/bigcouch/view >>> >> >>>> >>> >> >>>> The error is always generate on the third node in the cluster and >>> the >>> >> server basically max's out on memory before hand. The other nodes seem >>> to >>> >> be doing very little, but are getting data i.e. the shard sizes are >>> >> growing. I've put the copies per shard down to 1 as currently I'm not >>> >> interested in resilience. >>> >> >>>> >>> >> >>>> Any help would be greatly appreciated. >>> >> >>>> >>> >> >>>> Mike >>> >> >>>> >>> >> >>>
