[zfs-discuss] Zpool error in metadata:0x0

2012-12-08 Thread Jim Klimov

I've had this error on my pool since over a year ago, when I
posted and asked about it. The general consent was that this
is only fixable by recreation of the pool, and that if things
don't die right away, the problem may be benign (i.e. in some
first blocks of MOS that are in practice written once and not
really used nor relied upon).

In detailed zpool status this error shows as:
  metadata:0x0

By analogy to other errors in unnamed files, this was deemed to
be the MOS dataset, object number 0.

Anyway, now that I am digging deeper into ZFS bowels (as detailed
in my other current thread), I've made a tool which can request
sectors which pertain to a given DVA and verify the XOR parity.

With ZDB I've extracted what I believe to be the block-pointer
tree for this despite ZDB trying to dump the whole pool upon
access to no child dataset (I saw recently on-list that someone
picked up this ZDB bug as well), I used a bit of perl magic:

# time zdb -d -bb -e 1601233584937321596 0 | \
  perl -e '$a=0; while () { chomp; if ( /^Dataset mos/ ) { $a=1; }
  elsif ( /^Dataset / ) {$a=2; exit 0;};
  if ( $a == 1 ) { print $_\n; }  }'  mos.txt

This gives me everything ZDB thinks is part of MOS, up to the
start of a next Dataset dump:

Dataset mos [META], ID 0, cr_txg 4, 50.5G, 76355 objects,
rootbp DVA[0]=0:590df6a4000:3000 DVA[1]=0:8e4c636000:3000
DVA[2]=0:8107426b000:3000 [L0 DMU objset] fletcher4 lzjb LE
contiguous unique triple size=800L/200P birth=326429440L/326429440P
fill=76355 cksum=1042f7ae8a:63ab010a1de:138cbe92583cd:29e4cd03f544fe

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 0316K16K  84.1M  80.2M   46.49  DMU dnode
dnode flags: USED_BYTES
dnode maxblkid: 5132
Indirect blocks:

   0 L2   DVA[0]=0:590df6a1000:3000 DVA[1]=
0:8e4c63:3000 DVA[2]=0:81074268000:3000 [L2 DMU dnode]
fletcher4 lzjb LE contiguous unique triple size=4000L/e00P
birth=326429440L/326429440P fill=76355
cksum=128bfcb12fe:237fe2ec55891:29135030da5c326:36973942bee30ba3

   0  L1  DVA[0]=0:590df69b000:6000 DVA[1]=
0:8fd76b8000:6000 DVA[2]=0:81074262000:6000 [L1 DMU dnode]
fletcher4 lzjb LE contiguous unique triple size=4000L/1200P
birth=326429440L/326429440P fill=1155
cksum=18d8d8f3e6c:3ab2b45afba95:57ad6e7efb1cb00:216c4680d8cb9644

   0   L0 DVA[0]=0:590df695000:3000 DVA[1]=
0:8e4c61e000:3000 DVA[2]=0:8107425c000:3000 [L0 DMU dnode]
fletcher4 lzjb LE contiguous unique triple size=4000L/c00P
birth=326429440L/326429440P fill=31
cksum=da94d97873:15b87afcb5388:15ac58fbe7745d6:2e083d8ef9f3c90
...
(for a total of 3572 block pointers)

I fed this list into my new verification tool, testing all DVA
ditto copies, and it found no blocks with bad sectors - all the
XOR parities and the checksums matched their sector or two worth
of data.

So, given that there are no on-disk errors in the Dataset mos
[META], ID 0 Object #0 - what does the zpool scrub find time
after time and call an error in metadata:0x0?

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Userdata physical allocation in rows vs. columns WAS Digging in the bowels of ZFS

2012-12-08 Thread Jim Klimov

For those who have work to do and can't be bothered to read detailed
context, please do scroll down to the marked Applied question about
the possible project to implement a better on-disk layout of blocks.
The busy experts' opinions are highly regarded here. Thanks ;) //Jim

CONTEXT AND SPECULATION

Well, now that I've mostly completed building my tool to locate and
extract from disk and verify the sectors related to any particular
block, I can state with certainty: data sector numbering is columnar
as was depicted in my recent mails (quote below), not rows as I had
believed earlier - and which would be more compact to store.

Columns do make certain sense, but do also lead to more wasted space
than could be possible otherwise - and I'm not sure if the allocation
in rows would be really slower to write or read, especially since
the HDD caching would coalesce requests to neighboring sectors -
be they a contiguous quarter of my block's physical data or a series
of every fourth sector from that. This would be more complex to code
and comprehend - likely. Might even require more CPU cycles to account
sizes properly (IF today we just quickly allocate columns of same
size - I skimmed over vdev_raidz.c, but did not look into this detail).

Saving 1-2 sectors from allocations which are some 10-30 sectors long
altogether - this is IMHO a worthy percentage of savings to worry and
bother about, especially with the compression-related paradigm of
our CPUs are slackers with nothing to do. ZFS overhead on 4K-sectored
disks is pretty expensive already, so I see little need to feed it
extra desserts too ;)

APPLIED QUESTION:

If one were to implement a different sector allocator (rows with more
precise cutoff vs. columns as they are today) and expose it as a zfs
property that can be set by users (or testing developers), would it
make sense to call it a compression mode (in current terms) and use
a bit from that field? Or should a GRID bits be more properly used for
this?

I am not sure if feature flags are a proper mechanism for this, except
to protect form import and interpretation of such fixed datasets and
pools on incompatible (older) implementations - the allocation layout
is likely going to be an attribute applied to each block at write-time
and noted in blkptr_t like the checksums and compression, but only
apply to raidzN.

AFAIK, the contents of userdata sectors and their ordering don't even
matter to ZFS layers until decompression - parities and checksums just
apply to prepared bulk data...


//Jim Klimov

On 2012-12-06 02:08, Jim Klimov wrote:

On 2012-12-05 05:52, Jim Klimov wrote:

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).


Getting more and more puzzled with this... I have seen DVA values
matching both theories now...

Interestingly, all the allocations I looked over involved the number
of sectors divisible by 3... rounding to half of my 6-disk RAID set -
is it merely a coincidence, or some means of balancing IOs?

...

I did not yet research where exactly the unused sectors are
allocated - vertically on the last strip, like in my yesterdays
depiction quoted above, or horizontally across several disks,
but now that I know about this - it really bothers me as wasted
space with no apparent gain. I mean, the raidz code does tricks
to ensure that parities are located on different disks, and in
normal conditions the userdata sector reads land on all disks
in a uniform manner. Why forfeit the natural rotation thanks
to P-sizes smaller than the multiple of number of data-disks?

...

In short: can someone explain the rationale - why are allocations
such as they are now, and can it be discussed as a bug or should
this be rationalized as a feature?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool error in metadata:0x0

2012-12-08 Thread Ian Collins

Jim Klimov wrote:

I've had this error on my pool since over a year ago, when I
posted and asked about it. The general consent was that this
is only fixable by recreation of the pool, and that if things
don't die right away, the problem may be benign (i.e. in some
first blocks of MOS that are in practice written once and not
really used nor relied upon).

In detailed zpool status this error shows as:
metadata:0x0

By analogy to other errors in unnamed files, this was deemed to
be the MOS dataset, object number 0.


Unlike you, I haven't had to time or patience to dig deeper into this!  
The only times I have seen this error are in iSCSI pools when the target 
machine's pool became full, causing bizarre errors in the iSCSI client 
pools.


Once the underlying problem was fixed and the pools imported and 
exported, the error went away.


This might enable you to recreate the error for testing.

--
Ian.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss