Yes, currently it's 1.5.0. I meant to run _changes and _all_docs last night, but there was some other maintenance tasks scheduled. I'll definitely do it during the weekend and report back.
Best regards, Igor Klimer ________________________________________ Od: Robert Samuel Newson [[email protected]] Wysłano: 23 stycznia 2014 19:28 Do: user Temat: Re: Error during compaction Understood on access to the file but I figured I should ask. The simplest thing will be a small patch to print out the values that CouchDB expected to match but did not. Remind which exact CouchDB version you are running? I think you said you’re up to 1.5.0 now. B. On 23 Jan 2014, at 15:37, Igor Klimer <[email protected]> wrote: > Well, if you put it that way, then yes, (potential) database corruption does > seem like something you might be interested in ;) > > As far as I can tell, there weren't any options set that could affect > fsyncing, in fact there was hardly any customization done at all. Here are > the config files from the Windows machine (since that's where the corruption > occurred): > default.ini: http://pastebin.com/kUz0qyNk > local.ini: http://pastebin.com/srZUMwzB > > I've checked http://localhost:5984/_utils/config.html and delayed_commits is > set to true (since that's the default). > > Parity checking the RAID seems like a good idea, but from what the admins are > telling me, they'd need to take it offline, so I want to exhaust other > possibilities before that. > So for now, I'll try checking the DB in the way you suggested: > curl localhost:5984/dbname/_changes?include_docs=true > curl localhost:5984/dbname/_all_docs?include_docs=true > > And then (if it succeeds) I'll try to recover the database using the > recover-couchdb script. > > As for receiving a copy of the database - as a programmer I understand how > that'd help you investigate this issue, but it seems that "the powers that > be" aren't so understanding. Even if your software is at the core of their > enterprise... But I'm more than willing to compile/run any debug builds you > throw at me or learn some basic Erlang debugging if that'd help finding the > core of this issue. > > Best regards, > Igor Klimer > ________________________________________ > Od: Robert Samuel Newson [[email protected]] > Wysłano: 23 stycznia 2014 15:40 > Do: user > Temat: Re: Error during compaction > > Database corruption is not esoteric, we take it very seriously. :) > > Yes, strictly append only (opened with O_APPEND in posix land, not sure what > the equivalent is for Win). This doesn’t stop hardware failure from mangling > earlier bits, of course. You have RAID-5, so you have a parity stripe, > perhaps you could force a parity check? Abrupt shutdown or hitting end of > disk should do no more than cause unflushed updates to be lost, nothing > fsync’ed() prior to that should be at risk, by design. Are you running with > delayed_commits equal true or false? The difference is that we fsync > immediately or at up to a one second delay. In neither case do we omit to > make fsync() calls before and after writing each new database footer. There > are config options to disable those fsync calls, please let us know if you > did that (your local.ini file will tell you). > > So, the two causes of true corruption would be 1) a bug in our code and 2) > disk corruption or failure to honor fsync() and/or write ordering. Obviously > we’re very interested in any occurrence of type 1. > > A few things to try to confirm that your database is ok but not compactable > (which would be hard to explain) > > curl localhost:5984/dbname/_changes?include_docs=true > curl localhost:5984/dbname/_all_docs?include_docs=true > > Let us know if those complete without error (or the errors if there are any). > > Finally, would it be possible for the CouchDB development team (or a subset) > to receive a copy of the database file for forensic investigation? I believe > we can sign NDA’s and the like if that helps. > > B. > > On 23 Jan 2014, at 13:54, Igor Klimer <[email protected]> wrote: > >> No problem - I'm sure that you have more then enough on your plate without >> users pestering you with such esoteric problems ;) >> >> As for the I/O subsystem - on the lowest level it's a couple of RAID5 >> matrices. Then, there is a Windows Server 2008 managing that and some >> virtual servers (via Hyper-V), including both of the machines that we had >> the database on (Windows Server 2008 R2 Enterprise and now Ubuntu 12 LTS). >> I was assured that the matrices have been running rock-solid since the >> moment they've been setup and didn't exhibit any hardware failures. However, >> the servers could have been abruptly shut down, for example when it run out >> of free disk space. While NTFS should recover from something like that, I >> wonder if there isn't a corner case that would cause some database >> corruption (like you're suggesting). >> >> Thanks for the recover-couchdb script - I'll try to give it a go ASAP. We do >> have backups, but I don't think they are helpful in this case - the database >> as a whole is working, we don't have any reports of missing files or not >> working documents. It would be interesting to inspect the backups to see >> when and, more interestingly, what happened. You mentioned that the database >> format is strictly append only, so the corruption shouldn't "move" and the >> "old"/early parts of the database file shouldn't change? Still, that seems >> like a really time consuming task, so for now I'll try to give the recovery >> script a go. >> >> Thanks again for your help and best regards, >> Igor Klimer >> >> ________________________________________ >> Od: Robert Samuel Newson [[email protected]] >> Wysłano: 23 stycznia 2014 13:52 >> Do: user >> Temat: Re: Error during compaction >> >> Hi, >> >> Sorry, in turn, for not replying sooner. I’m not really sure what to >> suggest, it does sound like the database file is corrupt, which is quite >> hard to do with our strictly append only format. The only oddities here are >> the use of Windows (and, presumably, NTFS?) and the fact that you did hit >> end of disk. I’ve not observed corruption when hitting end of disk on other >> OS/FS combinations, though. Do you happen to know any details of your I/O >> subsystem that might provide a hint? Could any of the disks themselves have >> suffered block failures that couldn’t be corrected by disk firmware? Bit >> flips?Would your disk controller lie about fsync() calls? >> >> Those questions will help us figure out how the corruption occurs, but it >> obviously doesn’t help you fix it. For that, beyond hoping you have backups, >> I suggest trying https://github.com/jhs/recover-couchdb (perhaps Jason can >> chip in with how likely this is to still work, given the age). If that >> doesn’t extract all your data, then the only other suggestion I have is to >> truncate the database file until it can compact, but this necessarily means >> losing data. >> >> Do other developers have other suggestions here? >> >> B. >> >> On 21 Jan 2014, at 09:08, Igor Klimer <[email protected]> wrote: >> >>> Hi, >>> I'm extremely sorry for not replying sooner, however I was on sick leave >>> last week. >>> I've tried your suggestion with an empty .compact file, however the results >>> seem to be the same... >>> Log: http://pastebin.com/MJCgGM8C >>> >>> Started with an empty ecrepo.couch.compact file (touch >>> ecrepo.couch.compact), then after about 3 hours, the error was printed in >>> the logs and the compaction failed: >>> -rw-r--r-- 1 couchdb couchdb 137502523517 Jan 21 09:51 ecrepo.couch >>> -rw-r--r-- 1 couchdb couchdb 51692612367 Jan 21 02:07 ecrepo.couch.compact >>> >>> There's over 100GB free space available on the disk. >>> >>> At least I think I know what the number 51692471440 in log means ;) But I >>> don't know if there's a way to check which document resides at that >>> position in file. >>> >>> Best regards, >>> Igor Klimer >>> >>> ________________________________________ >>> Od: Robert Samuel Newson [[email protected]] >>> Wysłano: 10 stycznia 2014 18:45 >>> Do: user >>> Temat: Re: Error during compaction >>> >>> Yes, I understood. The empty .compact file will trigger more checking in >>> the compaction process, I’m hoping it gets us past the problem. >>> >>> B. >>> >>> On 10 Jan 2014, at 13:34, Igor Klimer <[email protected]> wrote: >>> >>>> :) >>>> Just to clarify - the .compact file is getting created and then the >>>> compaction fails after some time (an hour or more): >>>> 1) attempt on Windows with Couchdb 1.2.0 it failed because insufficient >>>> disk space. The .compact file had at least 10GB, unfortunately, I don't >>>> remember how much (and whether it was bigger then the one produced in the >>>> later attempts). There was no free disk space when it failed, so I'm >>>> assuming that was the cause. >>>> 2) attempt on Windows with Couchdb 1.2.0 it failed with the error >>>> mentioned below. The .compact file had around 50GB, there was plenty of >>>> free space left on the disk. >>>> 3) attempt on Ubuntu with Couchdb 1.5.0 it failed with the error mentioned >>>> below. The .compact file had around 50GB, there was plenty of free space >>>> left on the disk, and judging from the numbers present in the log (ids? >>>> node numbers?) it failed at the same moment as attempt #2. >>>> >>>> Just wanted to make sure we're on the same page :) Do you still want me to >>>> try it with an empty .compact file? (I can do this only during night >>>> hours, since I don't want to put too much load on the server during >>>> working hours) >>>> >>>> Best regards, >>>> Igor Klimer >>>> ________________________________________ >>>> Od: Robert Samuel Newson [[email protected]] >>>> Wysłano: 10 stycznia 2014 14:03 >>>> Do: user >>>> Temat: Re: Error during compaction >>>> >>>> Hrm, strike one. Ok. Next thing to try is subtly different. stop couchdb, >>>> delete the .compact file, but then make a new, empty .compact file (so >>>> ’touch /path/to/dbname.compact’), start couchdb and compact. >>>> >>>> B. >>>> >>>> On 10 Jan 2014, at 12:42, Igor Klimer <[email protected]> wrote: >>>> >>>>> Yes, I've already done that after the very fist attempt at compaction >>>>> (the one that failed because of lack of disk space). And it resulted in >>>>> the second fail (on Windows), then the same on Linux - I always deleted >>>>> the incomplete (about 50% of the database, around 50GB) .compact file >>>>> before running the compaction again. So I was always doing compaction >>>>> from scratch. >>>>> >>>>> Best regards, >>>>> Igor Klimer >>>>> ________________________________________ >>>>> Od: Robert Samuel Newson [[email protected]] >>>>> Wysłano: 10 stycznia 2014 13:08 >>>>> Do: user >>>>> Temat: Re: Error during compaction >>>>> >>>>> Thanks! that’s very useful. Hitting end of disk certainly feels like a >>>>> cause here. Since the compaction has never completed, I suggest we redo >>>>> compaction from scratch. >>>>> >>>>> 1) stop couchdb >>>>> 2) delete (or move aside) the dbname.compact file for this database >>>>> 3) start couchdb >>>>> 4) compact the db >>>>> >>>>> Whether it works or not, please let us know. >>>>> >>>>> B. >>>>> >>>>> On 10 Jan 2014, at 08:25, Igor Klimer <[email protected]> wrote: >>>>> >>>>>>> Given that you’re at 100Gb and compacting for the first time, can you >>>>>>> tell us if you were running on older couchdb versions that 1.2.0 >>>>>>> between db creation and today? >>>>>> >>>>>> No, we've been running 1.2.0 from the start (around Oct 2012), then >>>>>> switched to Ubuntu and 1.5.0. >>>>>> >>>>>>> Do you have free disk space? >>>>>> Yes, there's about 150% of the DB's size worth of free space :) I forgot >>>>>> to mention ("OK, here we go, the user will confess to some sin he >>>>>> committed and is ashamed of and is most likely the reason for this >>>>>> failure") that we've run the compaction once before the error on Windows >>>>>> I mentioned below, but it failed because of insufficient disk space - so >>>>>> I double checked before running the compaction again if there's enough >>>>>> space. Here's the log, if it's any helpful: http://pastebin.com/S1URXN0p >>>>>> Do you think it could have left the database in some corrupted state? It >>>>>> seems it failed at a different part then the two next attempts (and, as >>>>>> far as I understand, compaction is just copying over the database while >>>>>> pruning the old revisions and deleted documents). >>>>>> >>>>>> Thank you for your time and help and best regards, >>>>>> Igor Klimer >>>>>> ________________________________________ >>>>>> Od: Robert Samuel Newson [[email protected]] >>>>>> Wysłano: 9 stycznia 2014 17:13 >>>>>> Do: user >>>>>> Temat: Re: Error during compaction >>>>>> >>>>>> Do you have free disk space? >>>>>> >>>>>> On 9 Jan 2014, at 15:25, Robert Samuel Newson <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> Given that you’re at 100Gb and compacting for the first time, can you >>>>>>> tell us if you were running on older couchdb versions that 1.2.0 >>>>>>> between db creation and today? >>>>>>> >>>>>>> B. >>>>>>> >>>>>>> On 9 Jan 2014, at 14:39, Igor Klimer <[email protected]> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> I've stumbled upon a peculiar problem while trying to compact (for the >>>>>>>> first time) a large(-ish) database (~100GB at that time). At about 50% >>>>>>>> it failed with this error: http://pastebin.com/qeaZNHMj >>>>>>>> This is from Windows Server 2008 R2 Enterprise with Couchdb 1.2.0. >>>>>>>> I figured that it might be a bug in the Windows build (Erlang on >>>>>>>> Windows? C'mon, that can't be good ;)) or already fixed in a newer >>>>>>>> version. Some time later we migrated the server to a Linux box running >>>>>>>> Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-33-generic x86_64) and update >>>>>>>> Couchdb to 1.5.0. >>>>>>>> Unfortunately, the same error occurred: http://pastebin.com/feJWu7bN >>>>>>>> >>>>>>>> I've tried wrapping my head around that error, googling it, checking >>>>>>>> this mail list but to no avail :) So if anyone can give me any >>>>>>>> pointers as to what might be causing this problem, I'd be very >>>>>>>> grateful. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Igor Klimer >>>>>>>> >>>>>>>> (sorry for the footer that will probably follow, unfortunately it's >>>>>>>> added for all outgoing external mail...) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------- getBACK S.A., ul. Powstańców Śląskich 2-4, 53-333 Wrocław Sad rejestrowy: Sąd Rejonowy dla Wrocławia - Fabrycznej, VI Wydział Gospodarczy KRS. Numer KRS: 0000413997 NIP: 8992733884 REGON: 021829989 Wysokość kapitału zakładowego opłaconego w całości: 4 000 000,00 zł Zamieszczenie powyższych danych identyfikujących getBACK S.A. stosownie do art. 374 par.1 Kodeksu spółek handlowych nie jest równoznaczne z handlowym charakterem dostarczonej do Państwa wiadomości e-mailowej i pozostaje bez wpływu na interpretacje zawartych w niej oświadczeń. Niniejszy e-mail oraz wszelkie załączone do niego pliki są poufne i mogą podlegać ochronie prawnej. Jeżeli nie jest Pan/Pani zamierzonym adresatem powyższej wiadomości, nie może jej Pan/Pani ujawniać, kopiować, dystrybuować, ani tez w żaden inny sposób udostępniać lub wykorzystywać. O błędnym zaadresowaniu wiadomości prosimy niezwłocznie poinformować nadawcę i usunąć wiadomość. This e-mail message may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
