Re: [zfs-discuss] Sonnet Tempo SSD supported?

2012-12-04 Thread Eugen Leitl
On Mon, Dec 03, 2012 at 06:28:17PM -0500, Peter Tripp wrote:
 HI Eugen,
 
 Whether it's compatible entirely depends on the chipset of the SATA 
 controller.

This is what I was trying to find out. I guess I just have to 
test it empirically.
 
 Basically that card is just a dual port 6gbps PCIe SATA controller with the 
 space to mount one ($149) or two ($299) 2.5inch disks.  Sonnet, a mac focused 
 company, offers it as a way to better utilize existing Mac Pros already in 
 the field without an external box.  Mac Pros only have 3gbps SATA2 and a 
 4x3.5inch drive backplane, but nearly all have a free full-length PCIe slot.  
 This product only makes sense if you're trying to run OpenIndiana on a Mac 
 Pro, which in my experience is more trouble than it's worth, but to each 
 their own I guess. 

My application is to stick 2x SSDs into a SunFire X2100 M2,
without resorting to splicing into power cables and mounting
SSD in random location with double-side sticky tape. Depending
on hardware support I'll either run OpenIndiana or Linux
with a zfs hybrid pool (2x SATA drives as mirrored pool).
 
 If you can confirm the chipset you might get lucky and have it be a supported 
 chip.  The big chip is labelled PLX, but I can't read the markings and wasn't 
 aware PLX made any PCIe SATA controllers (PCIe and USB/SATA bridges sure, but 
 not straight controllers) so that may not even be the chip we care about. 
 http://www.profil-marketing.com/uploads/tx_lipresscenter/Sonnet_Tempo_SSD_Pro_01.jpg

Eiter way I'll know the hardware support situation soon
enough. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How can I copy a ZFS filesystem back and forth

2012-12-04 Thread Anonymous
Thanks for the help Chris!

Cheers,

Fritz

You wrote:

  original and and rename the new one, or zfs send or ?? Can I do a send and
  receive into a filesystem with attributes set as I want or does the receive
  keep the same attributes as well? Thank you.
 
 That will work. Just create the new filesystem with the attributes you
 want and send/recv the latest snapshot. As the data is received the
 gzip compression will be applied. Since the new filesystem already
 exists you will have to do a zfs receive -Fv to force it.
 
 --chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sonnet Tempo SSD supported?

2012-12-04 Thread Gary Driggs
On Dec 4, 2012, Eugen Leitl wrote:

 Either way I'll know the hardware support situation soon
 enough.

Have you tried contacting Sonnet?

-Gary
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sonnet Tempo SSD supported?

2012-12-04 Thread Eugen Leitl
On Tue, Dec 04, 2012 at 03:38:07AM -0800, Gary Driggs wrote:
 On Dec 4, 2012, Eugen Leitl wrote:
 
  Either way I'll know the hardware support situation soon
  enough.
 
 Have you tried contacting Sonnet?

No, but I did some digging. It *might* be a Marvell 88SX7042,
which would be then supported by Linux, but not by Solaris
http://www.nexentastor.org/boards/1/topics/2383
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sonnet Tempo SSD supported?

2012-12-04 Thread Eugen Leitl
On Tue, Dec 04, 2012 at 11:07:17AM +0100, Eugen Leitl wrote:
 On Mon, Dec 03, 2012 at 06:28:17PM -0500, Peter Tripp wrote:
  HI Eugen,
  
  Whether it's compatible entirely depends on the chipset of the SATA 
  controller.
 
 This is what I was trying to find out. I guess I just have to 
 test it empirically.
  
  Basically that card is just a dual port 6gbps PCIe SATA controller with the 
  space to mount one ($149) or two ($299) 2.5inch disks.  Sonnet, a mac 
  focused company, offers it as a way to better utilize existing Mac Pros 
  already in the field without an external box.  Mac Pros only have 3gbps 
  SATA2 and a 4x3.5inch drive backplane, but nearly all have a free 
  full-length PCIe slot.  This product only makes sense if you're trying to 
  run OpenIndiana on a Mac Pro, which in my experience is more trouble than 
  it's worth, but to each their own I guess. 
 
 My application is to stick 2x SSDs into a SunFire X2100 M2,
 without resorting to splicing into power cables and mounting
 SSD in random location with double-side sticky tape. Depending
 on hardware support I'll either run OpenIndiana or Linux
 with a zfs hybrid pool (2x SATA drives as mirrored pool).
  
  If you can confirm the chipset you might get lucky and have it be a 
  supported chip.  The big chip is labelled PLX, but I can't read the 
  markings and wasn't aware PLX made any PCIe SATA controllers (PCIe and 
  USB/SATA bridges sure, but not straight controllers) so that may not even 
  be the chip we care about. 
  http://www.profil-marketing.com/uploads/tx_lipresscenter/Sonnet_Tempo_SSD_Pro_01.jpg
 
 Eiter way I'll know the hardware support situation soon
 enough. 

I see a Marvell 88SE9182 on that Sonnet.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS QoS and priorities

2012-12-04 Thread Richard Elling
On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru wrote:

 I've heard a claim that ZFS relies too much on RAM caching, but
 implements no sort of priorities (indeed, I've seen no knobs to
 tune those) - so that if the storage box receives many different
 types of IO requests with different administrative weights in
 the view of admins, it can not really throttle some IOs to boost
 others, when such IOs have to hit the pool's spindles.

Caching has nothing to do with QoS in this context. *All* modern
filesystems cache to RAM, otherwise they are unusable.

 
 For example, I might want to have corporate webshop-related
 databases and appservers to be the fastest storage citizens,
 then some corporate CRM and email, then various lower priority
 zones and VMs, and at the bottom of the list - backups.

Please read the papers on the ARC and how it deals with MFU and
MRU cache types. You can adjust these policies using the primarycache
and secondarycache properties at the dataset level.

 
 AFAIK, now such requests would hit the ARC, then the disks if
 needed - in no particular order. Well, can the order be made
 particular with current ZFS architecture, i.e. by setting
 some datasets to have a certain NICEness or another priority
 mechanism?

ZFS has a priority-based I/O scheduler that works at the DMU level.
However, there is no system call interface in UNIX that transfers
priority or QoS information (eg read() or write()) into the file system VFS
interface. So the grainularity of priority control is by zone or dataset.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-04 Thread Jim Klimov

On 2012-12-03 18:23, Jim Klimov wrote:

On 2012-12-02 05:42, Jim Klimov wrote:

So... here are some applied questions:


Well, I am ready to reply a few of my own questions now :)


Continuing the desecration of my deceased files' resting grounds...


2) Do I understand correctly that for the offset definition, sectors
in a top-level VDEV (which is all of my pool) are numbered in rows
per-component disk? Like this:
  0  1  2  3  4  5
  6  7  8  9  10 11...

That is, offset % setsize = disknum?

If true, does such numbering scheme apply all over the TLVDEV,
so as for my block on a 6-disk raidz2 disk set - its sectors
start at (roughly rounded) offset_from_DVA / 6 on each disk,
right?

3) Then, if I read the ZFS on-disk spec correctly, the sectors of
the first disk holding anything from this block would contain the
raid-algo1 permutations of the four data sectors, sectors of
the second disk contain the raid-algo2 for those 4 sectors,
and the remaining 4 disks contain the data sectors?


My understanding was correct. For posterity, in the earlier set up
example I had an uncompressed 128KB block residing at the address
DVA[0]=0:590002c1000:3. Counting in my disks' 4KB sectors,
this is 0x590002c1000/0x1000 = 0x590002C1 or 1493172929 logical
offset into the TLVDEV number 0 (and the only one in this pool).

Given that this TLVDEV is a 6-disk raidz2 set, my expected offset
on each component drive is 1493172929/6 = 248862154.83 (.83=5/6),
starting from after the ZFS header (2 labels and a reservation,
amounting to 4MB = 1024*4KB sectors). So this block's allocation
covers 8 4KB-sectors starting at 248862154+1024 on disk5 and at
248862155+1024 on disks 0,1,2,3,4.

As my further tests showed, the sector-columns (not rows as I had
expected after doc-reading) from disks 1,2,3,4 do recombine into
the original userdata (sha256 checksum matches), so disks 5 and 0
should hold the two parities - how ever that is calculated:

# for D in 1 2 3 4; do dd bs=4096 count=8 conv=noerror,sync \
  if=/dev/dsk/c7t${D}d0s0 of=b1d${D}.img skip=248863179; done

# for D in 1 2 3 4; do for R in 0 1 2 3 4 5 6 7; do \
  dd if=/pool/test3/b1d${D}.img bs=4096 skip=$R count=1; \
  done; done  /tmp/d

Note that the latter can be greatly simplified as cat, which
also works to the same effect, and is faster:
# cat /pool/test3/b1d?.img  /tmp/d
However I left the difficult notation to use in experiments later on.

That is, the original 128KB block was cut into 4 pieces (my 4 data
drives in the 6-disk raidz2 set), and each 32Kb strip was stored
on a separate drive. Nice descriptive pictures in some presentations
suggested to me that the original block is stored sector by sector
rotating onto the next disk - the set of 4 sectors with 2 parity
sectors in my case being a single stripe for the RAID purposes.
This directly suggested that incomplete such stripes, such as
the ends of files or whole small files, would still have the two
parity sectors and a handful of data sectors.

Reality differs.

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).

The metadata blocks do have A-sizes of 0x3000 (2 parity + 1 data),
at least, which on 4KB-sectored disks is also pretty much for these
miniature data objects - but not as sad as 6*4KB would have been ;)

It also seems that the instinctive desire to have raidzN sets of
4*M+N disks (i.e. 6-disk raidz2, 11-disk raidz3, etc.) which was
discussed over and over on the list a couple of years ago, may
still be valid with typical block sizes being powers of two...
Even though gurus said that this should not matter much.
For IOPS - maybe not. For wasted space - likely...



I'm almost ready to go and test Q2 and Q3, however, the questions
which regard useable tools (and what data should be fed into such
tools?) are still on the table.


 Some OLD questions remain raised, just in case anyone answers them.

 3b) The redundancy algos should in fact cover other redundancy disks
 too (in order to sustain loss of any 2 disks), correct? (...)

 4) Where are the redundancy algorithms specified? Is there any simple
 tool that would recombine a given algo-N redundancy sector with
 some other 4 sectors from a 6-sector stripe in order to try and
 recalculate the sixth sector's contents? (Perhaps part of some
 unit tests?)

 7) Is there a command-line tool to do lzjb compressions and