Re: dead disk

2014-01-30 Thread Mihai Popescu
I always thought that the hardest part for an OpenBSD developer is the
coding one. Looking at this thread and also considering the donation
thread, I have changed my perception: the hardest part is to avoid the huge
amount of shit delivered by users on multiple ways...



Re: dead disk

2014-01-28 Thread Andres Perera
On Sun, Jan 26, 2014 at 5:07 PM, Philip Guenther guent...@gmail.com wrote:
 On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote:
 My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :).

 root@master[/etc]wd0(pciide0:0:0): timeout
 type: ata
 c_bcount: 16384
 c_skip: 0
 ...
 /: got error 5 while accessing filesystem
 panic: softdep_deallocate_dependencies: unrecovered I/O error
 Stopped at  Debugger+0x4:   popl%ebp
 RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
 DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
 ddb

 This is a fundamental problem of softdeps:it can delay an operation to
 a point where other operations depend on it in a such a way that if
 the I/O for that first operation fails, the dependent operations
 cannot be undone and the failure propagated up safely.  Rather than
 live a lie, it'll panic the system and die.

the way the decision to panic() was stated implies that the course of
action is justified, when detaching the disk/hub, or forcefully
mounting it read only, are alternatives that could be explored

the other day I unplugged the power connector from a malfunctioning
DVD-RW drive that was being too darn noisy. the kernel proceeded to
detach the ahci device hosting the aformentioned drive and an sd with
mounted ffs partitions

i could've unplugged the power connector in the middle of a string of
metadata writes to the sd. does that entail that the system should
panic?

hopefully the current outcome will remain, because it's way more
useful than my pc throwing a hissy's fits.


 I don't know exactly which operations can lead to that; if you need to
 know that you should go read the softdeps papers on Kirk McKusick's
 site.


 Philip Guenther



Re: dead disk

2014-01-28 Thread Philip Guenther
On Tue, Jan 28, 2014 at 12:27 AM, Andres Perera andre...@zoho.com wrote:
 On Sun, Jan 26, 2014 at 5:07 PM, Philip Guenther guent...@gmail.com wrote:
 On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote:
 My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :).

 root@master[/etc]wd0(pciide0:0:0): timeout
 type: ata
 c_bcount: 16384
 c_skip: 0
 ...
 /: got error 5 while accessing filesystem
 panic: softdep_deallocate_dependencies: unrecovered I/O error
 Stopped at  Debugger+0x4:   popl%ebp
 RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
 DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
 ddb

 This is a fundamental problem of softdeps:it can delay an operation to
 a point where other operations depend on it in a such a way that if
 the I/O for that first operation fails, the dependent operations
 cannot be undone and the failure propagated up safely.  Rather than
 live a lie, it'll panic the system and die.

 the way the decision to panic() was stated implies that the course of
 action is justified, when detaching the disk/hub, or forcefully
 mounting it read only, are alternatives that could be explored.

How do those alternative actions, which can only fail in-progress and
future operation, satisfactorily resolve the case of operations WHICH
HAVE ALREADY RETURNED SUCCESS but whose effects will actually be lost
and not durable?

I'm no expert on softdeps, so maybe you have a better explanation for
why Kirk made the choice he did to have it panic in some cases?


Philip Guenther



Re: dead disk

2014-01-28 Thread Andres Perera
On Tue, Jan 28, 2014 at 4:55 AM, Philip Guenther guent...@gmail.com wrote:
 On Tue, Jan 28, 2014 at 12:27 AM, Andres Perera andre...@zoho.com wrote:
 On Sun, Jan 26, 2014 at 5:07 PM, Philip Guenther guent...@gmail.com wrote:
 On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote:
 My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :).

 root@master[/etc]wd0(pciide0:0:0): timeout
 type: ata
 c_bcount: 16384
 c_skip: 0
 ...
 /: got error 5 while accessing filesystem
 panic: softdep_deallocate_dependencies: unrecovered I/O error
 Stopped at  Debugger+0x4:   popl%ebp
 RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
 DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
 ddb

 This is a fundamental problem of softdeps:it can delay an operation to
 a point where other operations depend on it in a such a way that if
 the I/O for that first operation fails, the dependent operations
 cannot be undone and the failure propagated up safely.  Rather than
 live a lie, it'll panic the system and die.

 the way the decision to panic() was stated implies that the course of
 action is justified, when detaching the disk/hub, or forcefully
 mounting it read only, are alternatives that could be explored.

 How do those alternative actions, which can only fail in-progress and
 future operation, satisfactorily resolve the case of operations WHICH
 HAVE ALREADY RETURNED SUCCESS but whose effects will actually be lost
 and not durable?

 I'm no expert on softdeps, so maybe you have a better explanation for
 why Kirk made the choice he did to have it panic in some cases?

well, i'm no expert either. now that we have presented our
credentials, let's go back to what was already conjecture

do you understand that disks have write caches that don't give a hoot
about posix mkdir() rename() and so on?

can bit rot change a inode type from directory to file, and vice versa?

do you want the kernel to figure these out after the fact and
retroactively panic() for each occurence, neatly queueing them boot
after boot or do you want to grow a pair of balls instead?



 Philip Guenther



Re: dead disk

2014-01-28 Thread Philip Guenther
On Tue, Jan 28, 2014 at 2:03 AM, Andres Perera andre...@zoho.com wrote:
 On Tue, Jan 28, 2014 at 4:55 AM, Philip Guenther guent...@gmail.com wrote:
...
 I'm no expert on softdeps, so maybe you have a better explanation for
 why Kirk made the choice he did to have it panic in some cases?

 well, i'm no expert either. now that we have presented our
 credentials, let's go back to what was already conjecture
...
 do you want the kernel to figure these out after the fact and
 retroactively panic() for each occurence, neatly queueing them boot
 after boot or do you want to grow a pair of balls instead?

You ignore my pointer to the actual engineering and logic in this area
and prefer to expand upon the conjecture.  I cannot help in that area
and am unwilling to have you reorder my TODO list to suit your
pleasure.

Instead I look forward to your diff fixing this bug in softdeps.
Please send that diff to the list and not me directly, as I find your
submissions uninteresting.


Philip Guenther



Re: dead disk

2014-01-28 Thread Andres Perera
On Tue, Jan 28, 2014 at 6:12 AM, Philip Guenther guent...@gmail.com wrote:
 On Tue, Jan 28, 2014 at 2:03 AM, Andres Perera andre...@zoho.com wrote:
 On Tue, Jan 28, 2014 at 4:55 AM, Philip Guenther guent...@gmail.com wrote:
 ...
 I'm no expert on softdeps, so maybe you have a better explanation for
 why Kirk made the choice he did to have it panic in some cases?

 well, i'm no expert either. now that we have presented our
 credentials, let's go back to what was already conjecture
 ...
 do you want the kernel to figure these out after the fact and
 retroactively panic() for each occurence, neatly queueing them boot
 after boot or do you want to grow a pair of balls instead?

 You ignore my pointer to the actual engineering and logic in this area
 and prefer to expand upon the conjecture.  I cannot help in that area
 and am unwilling to have you reorder my TODO list to suit your
 pleasure.


the comments pertain to your misrepresentation of McKusick's softdep
paper. this being a public forum, your todo list is your personal
business, and in any case not for others to be shoehorned into when
blatant mistakes need correction

the paper does not support the notion that metadata cache flushing
failures lead to complete system instability meriting a panic. quote
the relevant text or stop pretending that it's there.

meanwhile, there are cases where synchronous writing of metadata can
also allow the unavailability and corruption of a previously succesful
system call's pervasive effects.

the onus is on you, or in your imaginary representation of the paper,
to prove that halting the system is justifiable in BOTH circumstances.

the paper does not discuss alternatives, eg, mounting read only and
preserving references to unflushed data until unmount...

so thread along if you find looking for a solution uninteresting.
that's better than lying.


 Instead I look forward to your diff fixing this bug in softdeps.
 Please send that diff to the list and not me directly, as I find your
 submissions uninteresting.


 Philip Guenther



Re: dead disk

2014-01-28 Thread Ted Unangst
On Tue, Jan 28, 2014 at 05:33, Andres Perera wrote:

 do you understand that disks have write caches that don't give a hoot
 about posix mkdir() rename() and so on?
 
 can bit rot change a inode type from directory to file, and vice versa?
 
 do you want the kernel to figure these out after the fact and
 retroactively panic() for each occurence, neatly queueing them boot
 after boot or do you want to grow a pair of balls instead?

./ffs/ffs_alloc.c:  panic(ffs_alloc: bad size);
./ffs/ffs_alloc.c:  panic(ffs_alloc: missing credential);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: bad size);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: missing credential);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: bad bprev);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: bad 
blockno);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: small 
buf);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: bad optim);
./ffs/ffs_alloc.c:  panic(ffs_realloccg: small buf 2);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblks: unallocated 
block 1);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblks: non-logical 
cluster);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblks: non-physical 
cluster %d, i);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblk: start == end);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblks: unallocated 
block 2);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblks: alloc 
mismatch);
./ffs/ffs_alloc.c:  panic(ffs1_reallocblks: unallocated 
block 3);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblks: unallocated 
block 1);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblks: non-logical 
cluster);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblks: non-physical 
cluster %d, i);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblk: start == end);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblks: unallocated 
block 2);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblks: alloc 
mismatch);
./ffs/ffs_alloc.c:  panic(ffs2_reallocblks: unallocated 
block 3);
./ffs/ffs_alloc.c:  panic(ffs_valloc: dup alloc);
./ffs/ffs_alloc.c:  panic(ffs_clusteralloc: map mismatch);
./ffs/ffs_alloc.c:  panic(ffs_clusteralloc: allocated out of 
group);
./ffs/ffs_alloc.c:  panic(ffs_clusteralloc: lost block);
./ffs/ffs_alloc.c:  panic(ffs_nodealloccg: map corrupted);
./ffs/ffs_alloc.c:  panic(ffs_nodealloccg: block not in map);
./ffs/ffs_alloc.c:  panic(ffs_blkfree: bad size);
./ffs/ffs_alloc.c:  panic(ffs_blkfree: freeing free 
block);
./ffs/ffs_alloc.c:  panic(ffs_blkfree: freeing 
free frag);
./ffs/ffs_alloc.c:  panic(ffs_freefile: range: dev = 0x%x, ino = 
%d, fs = %s,
./ffs/ffs_alloc.c:  panic(ffs_freefile: freeing free 
inode);
./ffs/ffs_alloc.c:  panic(ffs_checkblk: bad size);
./ffs/ffs_alloc.c:  panic(ffs_checkblk: bad block %lld, (long 
long)bno);
./ffs/ffs_alloc.c:  panic(ffs_checkblk: partially free 
fragment);
./ffs/ffs_alloc.c: * It is a panic if a request is made to find a block if none 
are
./ffs/ffs_alloc.c:  panic(ffs_alloccg: map corrupted);
./ffs/ffs_alloc.c:  panic(ffs_alloccg: block not in map);
./ffs/ffs_balloc.c: panic(ffs1_balloc: blk too big);
./ffs/ffs_balloc.c: panic (ffs1_balloc: ufs_bmaparray returned 
indirect block);
./ffs/ffs_balloc.c: panic(Could not unwind indirect block, 
error %d, r);
./ffs/ffs_balloc.c: panic(ffs2_balloc: block too big);
./ffs/ffs_balloc.c: panic(ffs2_balloc: ufs_bmaparray returned 
indirect block);
./ffs/ffs_balloc.c: panic(ffs2_balloc: unwind 
failed);
./ffs/ffs_inode.c:  panic(ffs_update: bad link cnt);
./ffs/ffs_inode.c:  panic(ffs_truncate: partial truncate 
of symlink);
./ffs/ffs_inode.c:  panic(ffs_truncate: newspace);
./ffs/ffs_inode.c:  panic(ffs_truncate1);
./ffs/ffs_inode.c:  panic(ffs_truncate2);
./ffs/ffs_inode.c:  panic(ffs_indirtrunc: bad buffer 
size);
./ffs/ffs_subr.c:__dead void panic(const char *, ...);
./ffs/ffs_subr.c:   panic(Disk buffer overlap);
./ffs/ffs_vfsops.c: panic(ffs_reload: dirty2);
./ffs/ffs_vfsops.c: panic(ffs_reload: dirty1);
./ffs/ffs_vfsops.c: panic(ffs_statfs);
./ffs/ffs_vfsops.c: panic(ffs_statfs);
./ffs/ffs_vfsops.c: panic(update: rofs mod);
./ffs/ffs_vfsops.c: panic(ffs_vget: alien ino_t %llu, (unsigned 
long long)ino);

dead disk

2014-01-26 Thread emigrant
Hi,

My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :).


root@master[/etc]wd0(pciide0:0:0): timeout
type: ata
c_bcount: 16384
c_skip: 0
pciide0:0:0: bus-master DMA error: missing interrupt, status=0x20
pciide0 channel 0: reset failed for drive 0
wd0a: device timeout writing fsbn 48851424 of 48851424-48851455 (wd0 bn
48851488; cn 3040 tn 220 sn 28), retrying
pciide0:0:0: not ready, st=0xd0BSY,DRDY,DSC, err=0x00
pciide0 channel 0: reset failed for drive 0
wd0a: device timeout writing fsbn 48851424 of 48851424-48851455 (wd0 bn
48851488; cn 3040 tn 220 sn 28), retrying
pciide0:0:0: not ready, st=0xd0BSY,DRDY,DSC, err=0x00
pciide0 channel 0: reset failed for drive 0
wd0a: device timeout writing fsbn 48851424 of 48851424-48851455 (wd0 bn
48851488; cn 3040 tn 220 sn 28)
/: got error 5 while accessing filesystem
panic: softdep_deallocate_dependencies: unrecovered I/O error
Stopped at  Debugger+0x4:   popl%ebp
RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
ddb



Re: dead disk

2014-01-26 Thread Philip Guenther
On Sun, Jan 26, 2014 at 11:40 AM, emigrant emig...@gmail.com wrote:
 My Master machine is dead, exactly HDD(thank you God for CARP+pfsync) :).

 root@master[/etc]wd0(pciide0:0:0): timeout
 type: ata
 c_bcount: 16384
 c_skip: 0
...
 /: got error 5 while accessing filesystem
 panic: softdep_deallocate_dependencies: unrecovered I/O error
 Stopped at  Debugger+0x4:   popl%ebp
 RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
 DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
 ddb

This is a fundamental problem of softdeps:it can delay an operation to
a point where other operations depend on it in a such a way that if
the I/O for that first operation fails, the dependent operations
cannot be undone and the failure propagated up safely.  Rather than
live a lie, it'll panic the system and die.

I don't know exactly which operations can lead to that; if you need to
know that you should go read the softdeps papers on Kirk McKusick's
site.


Philip Guenther