Re: Error during compaction

Igor Klimer Thu, 23 Jan 2014 23:59:21 -0800

Yes, currently it's 1.5.0.
I meant to run _changes and _all_docs last night, but there was some other 
maintenance tasks scheduled. I'll definitely do it during the weekend and 
report back.


Best regards,
Igor Klimer
________________________________________
Od: Robert Samuel Newson [[email protected]]
Wysłano: 23 stycznia 2014 19:28
Do: user
Temat: Re: Error during compaction

Understood on access to the file but I figured I should ask.

The simplest thing will be a small patch to print out the values that CouchDB 
expected to match but did not. Remind which exact CouchDB version you are 
running? I think you said you’re up to 1.5.0 now.

B.

On 23 Jan 2014, at 15:37, Igor Klimer <[email protected]> wrote:

> Well, if you put it that way, then yes, (potential) database corruption does 
> seem like something you might be interested in ;)
>
> As far as I can tell, there weren't any options set that could affect 
> fsyncing, in fact there was hardly any customization done at all. Here are 
> the config files from the Windows machine (since that's where the corruption 
> occurred):
> default.ini: http://pastebin.com/kUz0qyNk
> local.ini: http://pastebin.com/srZUMwzB
>
> I've checked http://localhost:5984/_utils/config.html and delayed_commits is 
> set to true (since that's the default).
>
> Parity checking the RAID seems like a good idea, but from what the admins are 
> telling me, they'd need to take it offline, so I want to exhaust other 
> possibilities before that.
> So for now, I'll try checking the DB in the way you suggested:
> curl localhost:5984/dbname/_changes?include_docs=true
> curl localhost:5984/dbname/_all_docs?include_docs=true
>
> And then (if it succeeds) I'll try to recover the database using the 
> recover-couchdb script.
>
> As for receiving a copy of the database - as a programmer I understand how 
> that'd help you investigate this issue, but it seems that "the powers that 
> be" aren't so understanding. Even if your software is at the core of their 
> enterprise... But I'm more than willing to compile/run any debug builds you 
> throw at me or learn some basic Erlang debugging if that'd help finding the 
> core of this issue.
>
> Best regards,
> Igor Klimer
> ________________________________________
> Od: Robert Samuel Newson [[email protected]]
> Wysłano: 23 stycznia 2014 15:40
> Do: user
> Temat: Re: Error during compaction
>
> Database corruption is not esoteric, we take it very seriously. :)
>
> Yes, strictly append only (opened with O_APPEND in posix land, not sure what 
> the equivalent is for Win). This doesn’t stop hardware failure from mangling 
> earlier bits, of course. You have RAID-5, so you have a parity stripe, 
> perhaps you could force a parity check? Abrupt shutdown or hitting end of 
> disk should do no more than cause unflushed updates to be lost, nothing 
> fsync’ed() prior to that should be at risk, by design. Are you running with 
> delayed_commits equal true or false? The difference is that we fsync 
> immediately or at up to a one second delay. In neither case do we omit to 
> make fsync() calls before and after writing each new database footer. There 
> are config options to disable those fsync calls, please let us know if you 
> did that (your local.ini file will tell you).
>
> So, the two causes of true corruption would be 1) a bug in our code and 2) 
> disk corruption or failure to honor fsync() and/or write ordering. Obviously 
> we’re very interested in any occurrence of type 1.
>
> A few things to try to confirm that your database is ok but not compactable 
> (which would be hard to explain)
>
> curl localhost:5984/dbname/_changes?include_docs=true
> curl localhost:5984/dbname/_all_docs?include_docs=true
>
> Let us know if those complete without error (or the errors if there are any).
>
> Finally, would it be possible for the CouchDB development team (or a subset) 
> to receive a copy of the database file for forensic investigation? I believe 
> we can sign NDA’s and the like if that helps.
>
> B.
>
> On 23 Jan 2014, at 13:54, Igor Klimer <[email protected]> wrote:
>
>> No problem - I'm sure that you have more then enough on your plate without 
>> users pestering you with such esoteric problems ;)
>>
>> As for the I/O subsystem - on the lowest level it's a couple of RAID5 
>> matrices. Then, there is a Windows Server 2008 managing that and some 
>> virtual servers (via Hyper-V), including both of the machines that we had 
>> the database on (Windows Server 2008 R2 Enterprise  and now Ubuntu 12 LTS). 
>> I was assured that the matrices have been running rock-solid since the 
>> moment they've been setup and didn't exhibit any hardware failures. However, 
>> the servers could have been abruptly shut down, for example when it run out 
>> of free disk space. While NTFS should recover from something like that, I 
>> wonder if there isn't a corner case that would cause some database 
>> corruption (like you're suggesting).
>>
>> Thanks for the recover-couchdb script - I'll try to give it a go ASAP. We do 
>> have backups, but I don't think they are helpful in this case - the database 
>> as a whole is working, we don't have any reports of missing files or not 
>> working documents. It would be interesting to inspect the backups to see 
>> when and, more interestingly, what happened. You mentioned that the database 
>> format is strictly append only, so the corruption shouldn't "move" and the 
>> "old"/early parts of the database file shouldn't change? Still, that seems 
>> like a really time consuming task, so for now I'll try to give the recovery 
>> script a go.
>>
>> Thanks again for your help and best regards,
>> Igor Klimer
>>
>> ________________________________________
>> Od: Robert Samuel Newson [[email protected]]
>> Wysłano: 23 stycznia 2014 13:52
>> Do: user
>> Temat: Re: Error during compaction
>>
>> Hi,
>>
>> Sorry, in turn, for not replying sooner. I’m not really sure what to 
>> suggest, it does sound like the database file is corrupt, which is quite 
>> hard to do with our strictly append only format. The only oddities here are 
>> the use of Windows (and, presumably, NTFS?) and the fact that you did hit 
>> end of disk. I’ve not observed corruption when hitting end of disk on other 
>> OS/FS combinations, though. Do you happen to know any details of your I/O 
>> subsystem that might provide a hint? Could any of the disks themselves have 
>> suffered block failures that couldn’t be corrected by disk firmware? Bit 
>> flips?Would your disk controller lie about fsync() calls?
>>
>> Those questions will help us figure out how the corruption occurs, but it 
>> obviously doesn’t help you fix it. For that, beyond hoping you have backups, 
>> I suggest trying https://github.com/jhs/recover-couchdb (perhaps Jason can 
>> chip in with how likely this is to still work, given the age). If that 
>> doesn’t extract all your data, then the only other suggestion I have is to 
>> truncate the database file until it can compact, but this necessarily means 
>> losing data.
>>
>> Do other developers have other suggestions here?
>>
>> B.
>>
>> On 21 Jan 2014, at 09:08, Igor Klimer <[email protected]> wrote:
>>
>>> Hi,
>>> I'm extremely sorry for not replying sooner, however I was on sick leave 
>>> last week.
>>> I've tried your suggestion with an empty .compact file, however the results 
>>> seem to be the same...
>>> Log: http://pastebin.com/MJCgGM8C
>>>
>>> Started with an empty ecrepo.couch.compact file (touch 
>>> ecrepo.couch.compact), then after about 3 hours, the error was printed in 
>>> the logs and the compaction failed:
>>> -rw-r--r-- 1 couchdb couchdb 137502523517 Jan 21 09:51 ecrepo.couch
>>> -rw-r--r-- 1 couchdb couchdb  51692612367 Jan 21 02:07 ecrepo.couch.compact
>>>
>>> There's over 100GB free space available on the disk.
>>>
>>> At least I think I know what the number 51692471440 in log means ;) But I 
>>> don't know if there's a way to check which document resides at that 
>>> position in file.
>>>
>>> Best regards,
>>> Igor Klimer
>>>
>>> ________________________________________
>>> Od: Robert Samuel Newson [[email protected]]
>>> Wysłano: 10 stycznia 2014 18:45
>>> Do: user
>>> Temat: Re: Error during compaction
>>>
>>> Yes, I understood. The empty .compact file will trigger more checking in 
>>> the compaction process, I’m hoping it gets us past the problem.
>>>
>>> B.
>>>
>>> On 10 Jan 2014, at 13:34, Igor Klimer <[email protected]> wrote:
>>>
>>>> :)
>>>> Just to clarify - the .compact file is getting created and then the 
>>>> compaction fails after some time (an hour or more):
>>>> 1) attempt on Windows with Couchdb 1.2.0 it failed because insufficient 
>>>> disk space. The .compact file had at least 10GB, unfortunately, I don't 
>>>> remember how much (and whether it was bigger then the one produced in the 
>>>> later attempts). There was no free disk space when it failed, so I'm 
>>>> assuming that was the cause.
>>>> 2) attempt on Windows with Couchdb 1.2.0 it failed with the error 
>>>> mentioned below. The .compact file had around 50GB, there was plenty of 
>>>> free space left on the disk.
>>>> 3) attempt on Ubuntu with Couchdb 1.5.0 it failed with the error mentioned 
>>>> below. The .compact file had around 50GB, there was plenty of free space 
>>>> left on the disk, and judging from the numbers present in the log (ids? 
>>>> node numbers?) it failed at the same moment as attempt #2.
>>>>
>>>> Just wanted to make sure we're on the same page :) Do you still want me to 
>>>> try it with an empty .compact file? (I can do this only during night 
>>>> hours, since I don't want to put too much load on the server during 
>>>> working hours)
>>>>
>>>> Best regards,
>>>> Igor Klimer
>>>> ________________________________________
>>>> Od: Robert Samuel Newson [[email protected]]
>>>> Wysłano: 10 stycznia 2014 14:03
>>>> Do: user
>>>> Temat: Re: Error during compaction
>>>>
>>>> Hrm, strike one. Ok. Next thing to try is subtly different. stop couchdb, 
>>>> delete the .compact file, but then make a new, empty .compact file (so 
>>>> ’touch /path/to/dbname.compact’), start couchdb and compact.
>>>>
>>>> B.
>>>>
>>>> On 10 Jan 2014, at 12:42, Igor Klimer <[email protected]> wrote:
>>>>
>>>>> Yes, I've already done that after the very fist attempt at compaction 
>>>>> (the one that failed because of lack of disk space). And it resulted in 
>>>>> the second fail (on Windows), then the same on Linux - I always deleted 
>>>>> the incomplete (about 50% of the database, around 50GB) .compact file 
>>>>> before running the compaction again. So I was always doing compaction 
>>>>> from scratch.
>>>>>
>>>>> Best regards,
>>>>> Igor Klimer
>>>>> ________________________________________
>>>>> Od: Robert Samuel Newson [[email protected]]
>>>>> Wysłano: 10 stycznia 2014 13:08
>>>>> Do: user
>>>>> Temat: Re: Error during compaction
>>>>>
>>>>> Thanks! that’s very useful. Hitting end of disk certainly feels like a 
>>>>> cause here. Since the compaction has never completed, I suggest we redo 
>>>>> compaction from scratch.
>>>>>
>>>>> 1) stop couchdb
>>>>> 2) delete (or move aside) the dbname.compact file for this database
>>>>> 3) start couchdb
>>>>> 4) compact the db
>>>>>
>>>>> Whether it works or not, please let us know.
>>>>>
>>>>> B.
>>>>>
>>>>> On 10 Jan 2014, at 08:25, Igor Klimer <[email protected]> wrote:
>>>>>
>>>>>>> Given that you’re at 100Gb and compacting for the first time, can you 
>>>>>>> tell us if you were running on older couchdb versions that 1.2.0 
>>>>>>> between db creation and today?
>>>>>>
>>>>>> No, we've been running 1.2.0 from the start (around Oct 2012), then 
>>>>>> switched to Ubuntu and 1.5.0.
>>>>>>
>>>>>>> Do you have free disk space?
>>>>>> Yes, there's about 150% of the DB's size worth of free space :) I forgot 
>>>>>> to mention ("OK, here we go, the user will confess to some sin he 
>>>>>> committed and is ashamed of and is most likely the reason for this 
>>>>>> failure") that we've run the compaction once before the error on Windows 
>>>>>> I mentioned below, but it failed because of insufficient disk space - so 
>>>>>> I double checked before running the compaction again if there's enough 
>>>>>> space. Here's the log, if it's any helpful: http://pastebin.com/S1URXN0p
>>>>>> Do you think it could have left the database in some corrupted state? It 
>>>>>> seems it failed at a different part then the two next attempts (and, as 
>>>>>> far as I understand, compaction is just copying over the database while 
>>>>>> pruning the old revisions and deleted documents).
>>>>>>
>>>>>> Thank you for your time and help and best regards,
>>>>>> Igor Klimer
>>>>>> ________________________________________
>>>>>> Od: Robert Samuel Newson [[email protected]]
>>>>>> Wysłano: 9 stycznia 2014 17:13
>>>>>> Do: user
>>>>>> Temat: Re: Error during compaction
>>>>>>
>>>>>> Do you have free disk space?
>>>>>>
>>>>>> On 9 Jan 2014, at 15:25, Robert Samuel Newson <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Given that you’re at 100Gb and compacting for the first time, can you 
>>>>>>> tell us if you were running on older couchdb versions that 1.2.0 
>>>>>>> between db creation and today?
>>>>>>>
>>>>>>> B.
>>>>>>>
>>>>>>> On 9 Jan 2014, at 14:39, Igor Klimer <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I've stumbled upon a peculiar problem while trying to compact (for the 
>>>>>>>> first time) a large(-ish) database (~100GB at that time). At about 50% 
>>>>>>>> it failed with this error: http://pastebin.com/qeaZNHMj
>>>>>>>> This is from Windows Server 2008 R2 Enterprise with Couchdb 1.2.0.
>>>>>>>> I figured that it might be a bug in the Windows build (Erlang on 
>>>>>>>> Windows? C'mon, that can't be good ;)) or already fixed in a newer 
>>>>>>>> version. Some time later we migrated the server to a Linux box running 
>>>>>>>> Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-33-generic x86_64) and update 
>>>>>>>> Couchdb to 1.5.0.
>>>>>>>> Unfortunately, the same error occurred: http://pastebin.com/feJWu7bN
>>>>>>>>
>>>>>>>> I've tried wrapping my head around that error, googling it, checking 
>>>>>>>> this mail list but to no avail :) So if anyone can give me any 
>>>>>>>> pointers as to what might be causing this problem, I'd be very 
>>>>>>>> grateful.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Igor Klimer
>>>>>>>>
>>>>>>>> (sorry for the footer that will probably follow, unfortunately it's 
>>>>>>>> added for all outgoing external mail...)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>




-------------------------------

getBACK S.A., ul. Powstańców Śląskich 2-4, 53-333 Wrocław
Sad rejestrowy: Sąd Rejonowy dla Wrocławia - Fabrycznej, VI Wydział Gospodarczy 
KRS.
Numer KRS: 0000413997
NIP: 8992733884
REGON: 021829989

Wysokość kapitału zakładowego opłaconego w całości: 4 000 000,00 zł

Zamieszczenie powyższych danych identyfikujących getBACK S.A. stosownie do art. 
374 par.1 Kodeksu spółek handlowych nie jest równoznaczne z handlowym 
charakterem dostarczonej do Państwa wiadomości e-mailowej i pozostaje bez 
wpływu na interpretacje zawartych w niej oświadczeń.


Niniejszy e-mail oraz wszelkie załączone do niego pliki są poufne i mogą 
podlegać ochronie prawnej. Jeżeli nie jest Pan/Pani zamierzonym adresatem 
powyższej wiadomości, nie może jej Pan/Pani ujawniać, kopiować, dystrybuować, 
ani tez w żaden inny sposób udostępniać lub wykorzystywać. O błędnym 
zaadresowaniu wiadomości prosimy niezwłocznie poinformować nadawcę i usunąć 
wiadomość.


This e-mail message may contain confidential and/or privileged information. If 
you are not the intended recipient (or have received this e-mail in error) 
please notify the sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-mail is strictly 
forbidden.

Re: Error during compaction

Reply via email to