Re: dead disk
I always thought that the hardest part for an OpenBSD developer is the coding one. Looking at this thread and also considering the donation thread, I have changed my perception: the hardest part is to avoid the huge amount of shit delivered by users on multiple ways...
Re: dead disk
On Sun, Jan 26, 2014 at 5:07 PM, Philip Guenther guent...@gmail.com wrote: On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote: My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :). root@master[/etc]wd0(pciide0:0:0): timeout type: ata c_bcount: 16384 c_skip: 0 ... /: got error 5 while accessing filesystem panic: softdep_deallocate_dependencies: unrecovered I/O error Stopped at Debugger+0x4: popl%ebp RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb This is a fundamental problem of softdeps:it can delay an operation to a point where other operations depend on it in a such a way that if the I/O for that first operation fails, the dependent operations cannot be undone and the failure propagated up safely. Rather than live a lie, it'll panic the system and die. the way the decision to panic() was stated implies that the course of action is justified, when detaching the disk/hub, or forcefully mounting it read only, are alternatives that could be explored the other day I unplugged the power connector from a malfunctioning DVD-RW drive that was being too darn noisy. the kernel proceeded to detach the ahci device hosting the aformentioned drive and an sd with mounted ffs partitions i could've unplugged the power connector in the middle of a string of metadata writes to the sd. does that entail that the system should panic? hopefully the current outcome will remain, because it's way more useful than my pc throwing a hissy's fits. I don't know exactly which operations can lead to that; if you need to know that you should go read the softdeps papers on Kirk McKusick's site. Philip Guenther
Re: dead disk
On Tue, Jan 28, 2014 at 12:27 AM, Andres Perera andre...@zoho.com wrote: On Sun, Jan 26, 2014 at 5:07 PM, Philip Guenther guent...@gmail.com wrote: On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote: My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :). root@master[/etc]wd0(pciide0:0:0): timeout type: ata c_bcount: 16384 c_skip: 0 ... /: got error 5 while accessing filesystem panic: softdep_deallocate_dependencies: unrecovered I/O error Stopped at Debugger+0x4: popl%ebp RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb This is a fundamental problem of softdeps:it can delay an operation to a point where other operations depend on it in a such a way that if the I/O for that first operation fails, the dependent operations cannot be undone and the failure propagated up safely. Rather than live a lie, it'll panic the system and die. the way the decision to panic() was stated implies that the course of action is justified, when detaching the disk/hub, or forcefully mounting it read only, are alternatives that could be explored. How do those alternative actions, which can only fail in-progress and future operation, satisfactorily resolve the case of operations WHICH HAVE ALREADY RETURNED SUCCESS but whose effects will actually be lost and not durable? I'm no expert on softdeps, so maybe you have a better explanation for why Kirk made the choice he did to have it panic in some cases? Philip Guenther
Re: dead disk
On Tue, Jan 28, 2014 at 4:55 AM, Philip Guenther guent...@gmail.com wrote: On Tue, Jan 28, 2014 at 12:27 AM, Andres Perera andre...@zoho.com wrote: On Sun, Jan 26, 2014 at 5:07 PM, Philip Guenther guent...@gmail.com wrote: On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote: My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :). root@master[/etc]wd0(pciide0:0:0): timeout type: ata c_bcount: 16384 c_skip: 0 ... /: got error 5 while accessing filesystem panic: softdep_deallocate_dependencies: unrecovered I/O error Stopped at Debugger+0x4: popl%ebp RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb This is a fundamental problem of softdeps:it can delay an operation to a point where other operations depend on it in a such a way that if the I/O for that first operation fails, the dependent operations cannot be undone and the failure propagated up safely. Rather than live a lie, it'll panic the system and die. the way the decision to panic() was stated implies that the course of action is justified, when detaching the disk/hub, or forcefully mounting it read only, are alternatives that could be explored. How do those alternative actions, which can only fail in-progress and future operation, satisfactorily resolve the case of operations WHICH HAVE ALREADY RETURNED SUCCESS but whose effects will actually be lost and not durable? I'm no expert on softdeps, so maybe you have a better explanation for why Kirk made the choice he did to have it panic in some cases? well, i'm no expert either. now that we have presented our credentials, let's go back to what was already conjecture do you understand that disks have write caches that don't give a hoot about posix mkdir() rename() and so on? can bit rot change a inode type from directory to file, and vice versa? do you want the kernel to figure these out after the fact and retroactively panic() for each occurence, neatly queueing them boot after boot or do you want to grow a pair of balls instead? Philip Guenther
Re: dead disk
On Tue, Jan 28, 2014 at 2:03 AM, Andres Perera andre...@zoho.com wrote: On Tue, Jan 28, 2014 at 4:55 AM, Philip Guenther guent...@gmail.com wrote: ... I'm no expert on softdeps, so maybe you have a better explanation for why Kirk made the choice he did to have it panic in some cases? well, i'm no expert either. now that we have presented our credentials, let's go back to what was already conjecture ... do you want the kernel to figure these out after the fact and retroactively panic() for each occurence, neatly queueing them boot after boot or do you want to grow a pair of balls instead? You ignore my pointer to the actual engineering and logic in this area and prefer to expand upon the conjecture. I cannot help in that area and am unwilling to have you reorder my TODO list to suit your pleasure. Instead I look forward to your diff fixing this bug in softdeps. Please send that diff to the list and not me directly, as I find your submissions uninteresting. Philip Guenther
Re: dead disk
On Tue, Jan 28, 2014 at 6:12 AM, Philip Guenther guent...@gmail.com wrote: On Tue, Jan 28, 2014 at 2:03 AM, Andres Perera andre...@zoho.com wrote: On Tue, Jan 28, 2014 at 4:55 AM, Philip Guenther guent...@gmail.com wrote: ... I'm no expert on softdeps, so maybe you have a better explanation for why Kirk made the choice he did to have it panic in some cases? well, i'm no expert either. now that we have presented our credentials, let's go back to what was already conjecture ... do you want the kernel to figure these out after the fact and retroactively panic() for each occurence, neatly queueing them boot after boot or do you want to grow a pair of balls instead? You ignore my pointer to the actual engineering and logic in this area and prefer to expand upon the conjecture. I cannot help in that area and am unwilling to have you reorder my TODO list to suit your pleasure. the comments pertain to your misrepresentation of McKusick's softdep paper. this being a public forum, your todo list is your personal business, and in any case not for others to be shoehorned into when blatant mistakes need correction the paper does not support the notion that metadata cache flushing failures lead to complete system instability meriting a panic. quote the relevant text or stop pretending that it's there. meanwhile, there are cases where synchronous writing of metadata can also allow the unavailability and corruption of a previously succesful system call's pervasive effects. the onus is on you, or in your imaginary representation of the paper, to prove that halting the system is justifiable in BOTH circumstances. the paper does not discuss alternatives, eg, mounting read only and preserving references to unflushed data until unmount... so thread along if you find looking for a solution uninteresting. that's better than lying. Instead I look forward to your diff fixing this bug in softdeps. Please send that diff to the list and not me directly, as I find your submissions uninteresting. Philip Guenther
Re: dead disk
On Tue, Jan 28, 2014 at 05:33, Andres Perera wrote: do you understand that disks have write caches that don't give a hoot about posix mkdir() rename() and so on? can bit rot change a inode type from directory to file, and vice versa? do you want the kernel to figure these out after the fact and retroactively panic() for each occurence, neatly queueing them boot after boot or do you want to grow a pair of balls instead? ./ffs/ffs_alloc.c: panic(ffs_alloc: bad size); ./ffs/ffs_alloc.c: panic(ffs_alloc: missing credential); ./ffs/ffs_alloc.c: panic(ffs_realloccg: bad size); ./ffs/ffs_alloc.c: panic(ffs_realloccg: missing credential); ./ffs/ffs_alloc.c: panic(ffs_realloccg: bad bprev); ./ffs/ffs_alloc.c: panic(ffs_realloccg: bad blockno); ./ffs/ffs_alloc.c: panic(ffs_realloccg: small buf); ./ffs/ffs_alloc.c: panic(ffs_realloccg: bad optim); ./ffs/ffs_alloc.c: panic(ffs_realloccg: small buf 2); ./ffs/ffs_alloc.c: panic(ffs1_reallocblks: unallocated block 1); ./ffs/ffs_alloc.c: panic(ffs1_reallocblks: non-logical cluster); ./ffs/ffs_alloc.c: panic(ffs1_reallocblks: non-physical cluster %d, i); ./ffs/ffs_alloc.c: panic(ffs1_reallocblk: start == end); ./ffs/ffs_alloc.c: panic(ffs1_reallocblks: unallocated block 2); ./ffs/ffs_alloc.c: panic(ffs1_reallocblks: alloc mismatch); ./ffs/ffs_alloc.c: panic(ffs1_reallocblks: unallocated block 3); ./ffs/ffs_alloc.c: panic(ffs2_reallocblks: unallocated block 1); ./ffs/ffs_alloc.c: panic(ffs2_reallocblks: non-logical cluster); ./ffs/ffs_alloc.c: panic(ffs2_reallocblks: non-physical cluster %d, i); ./ffs/ffs_alloc.c: panic(ffs2_reallocblk: start == end); ./ffs/ffs_alloc.c: panic(ffs2_reallocblks: unallocated block 2); ./ffs/ffs_alloc.c: panic(ffs2_reallocblks: alloc mismatch); ./ffs/ffs_alloc.c: panic(ffs2_reallocblks: unallocated block 3); ./ffs/ffs_alloc.c: panic(ffs_valloc: dup alloc); ./ffs/ffs_alloc.c: panic(ffs_clusteralloc: map mismatch); ./ffs/ffs_alloc.c: panic(ffs_clusteralloc: allocated out of group); ./ffs/ffs_alloc.c: panic(ffs_clusteralloc: lost block); ./ffs/ffs_alloc.c: panic(ffs_nodealloccg: map corrupted); ./ffs/ffs_alloc.c: panic(ffs_nodealloccg: block not in map); ./ffs/ffs_alloc.c: panic(ffs_blkfree: bad size); ./ffs/ffs_alloc.c: panic(ffs_blkfree: freeing free block); ./ffs/ffs_alloc.c: panic(ffs_blkfree: freeing free frag); ./ffs/ffs_alloc.c: panic(ffs_freefile: range: dev = 0x%x, ino = %d, fs = %s, ./ffs/ffs_alloc.c: panic(ffs_freefile: freeing free inode); ./ffs/ffs_alloc.c: panic(ffs_checkblk: bad size); ./ffs/ffs_alloc.c: panic(ffs_checkblk: bad block %lld, (long long)bno); ./ffs/ffs_alloc.c: panic(ffs_checkblk: partially free fragment); ./ffs/ffs_alloc.c: * It is a panic if a request is made to find a block if none are ./ffs/ffs_alloc.c: panic(ffs_alloccg: map corrupted); ./ffs/ffs_alloc.c: panic(ffs_alloccg: block not in map); ./ffs/ffs_balloc.c: panic(ffs1_balloc: blk too big); ./ffs/ffs_balloc.c: panic (ffs1_balloc: ufs_bmaparray returned indirect block); ./ffs/ffs_balloc.c: panic(Could not unwind indirect block, error %d, r); ./ffs/ffs_balloc.c: panic(ffs2_balloc: block too big); ./ffs/ffs_balloc.c: panic(ffs2_balloc: ufs_bmaparray returned indirect block); ./ffs/ffs_balloc.c: panic(ffs2_balloc: unwind failed); ./ffs/ffs_inode.c: panic(ffs_update: bad link cnt); ./ffs/ffs_inode.c: panic(ffs_truncate: partial truncate of symlink); ./ffs/ffs_inode.c: panic(ffs_truncate: newspace); ./ffs/ffs_inode.c: panic(ffs_truncate1); ./ffs/ffs_inode.c: panic(ffs_truncate2); ./ffs/ffs_inode.c: panic(ffs_indirtrunc: bad buffer size); ./ffs/ffs_subr.c:__dead void panic(const char *, ...); ./ffs/ffs_subr.c: panic(Disk buffer overlap); ./ffs/ffs_vfsops.c: panic(ffs_reload: dirty2); ./ffs/ffs_vfsops.c: panic(ffs_reload: dirty1); ./ffs/ffs_vfsops.c: panic(ffs_statfs); ./ffs/ffs_vfsops.c: panic(ffs_statfs); ./ffs/ffs_vfsops.c: panic(update: rofs mod); ./ffs/ffs_vfsops.c: panic(ffs_vget: alien ino_t %llu, (unsigned long long)ino);
dead disk
Hi, My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :). root@master[/etc]wd0(pciide0:0:0): timeout type: ata c_bcount: 16384 c_skip: 0 pciide0:0:0: bus-master DMA error: missing interrupt, status=0x20 pciide0 channel 0: reset failed for drive 0 wd0a: device timeout writing fsbn 48851424 of 48851424-48851455 (wd0 bn 48851488; cn 3040 tn 220 sn 28), retrying pciide0:0:0: not ready, st=0xd0BSY,DRDY,DSC, err=0x00 pciide0 channel 0: reset failed for drive 0 wd0a: device timeout writing fsbn 48851424 of 48851424-48851455 (wd0 bn 48851488; cn 3040 tn 220 sn 28), retrying pciide0:0:0: not ready, st=0xd0BSY,DRDY,DSC, err=0x00 pciide0 channel 0: reset failed for drive 0 wd0a: device timeout writing fsbn 48851424 of 48851424-48851455 (wd0 bn 48851488; cn 3040 tn 220 sn 28) /: got error 5 while accessing filesystem panic: softdep_deallocate_dependencies: unrecovered I/O error Stopped at Debugger+0x4: popl%ebp RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb
Re: dead disk
On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote: My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :). root@master[/etc]wd0(pciide0:0:0): timeout type: ata c_bcount: 16384 c_skip: 0 ... /: got error 5 while accessing filesystem panic: softdep_deallocate_dependencies: unrecovered I/O error Stopped at Debugger+0x4: popl%ebp RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb This is a fundamental problem of softdeps:it can delay an operation to a point where other operations depend on it in a such a way that if the I/O for that first operation fails, the dependent operations cannot be undone and the failure propagated up safely. Rather than live a lie, it'll panic the system and die. I don't know exactly which operations can lead to that; if you need to know that you should go read the softdeps papers on Kirk McKusick's site. Philip Guenther