Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 and 1TB Seagate Barracuda ES.2

2010-12-21 Thread Marc Bevand
Richard Jacobsen richard at unixboxen.net writes:
 
 Hi all,
 
 I'm getting a very strange problem with a recent OpenSolaris b134 install.
 
 System is:
 Supermicro X5DP8-G2 BIOS 1.6a
 2x Supermicro AOC-SAT2-MV8 1.0b

As Richard pointed out this is a bug in the AOC-SAT2-MV8 firmware 1.0b.
It incorrectly associates itself with other Marvell SATA controllers
in the system, based on PCI IDs, causing all sort of strange issues:

http://opensolaris.org/jive/message.jspa?messageID=254150#254150

I reverse engineered and patched 2 bytes in the firmware to fix the
bug in my case. But I would recommend to simply exchange your cards
with newer ones running a firmware labelled 3.x.x.x. I don't think
Richard's suggestions will work (disabling AHCI or PnP) because even
though it may disable onboard SATA controllers contributing to the PCI
ID confusion, you have 2 x AOC-SAT2-MV8 so I suspect the 2 firmwares
may not even correctly associate themselves to their own card (eg. it
might attempt to initialize the same card during POST, leaving the other
uninitialized.)

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OCZ Vertex 2 Pro performance numbers

2010-09-12 Thread Marc Bevand
(I am aware I am replying to an old post...)

Arne Jansen sensille at gmx.net writes:
 
 Now the test for the Vertex 2 Pro. This was fun.
 For more explanation please see the thread Crucial RealSSD C300 and cache
 flush?
 This time I made sure the device is attached via 3GBit SATA. This is also
 only a short test. I'll retest after some weeks of usage.
 
 cache enabled, 32 buffers, 64k blocks
 linear write, random data: 96 MB/s
 linear read, random data: 206 MB/s
 linear write, zero data: 234 MB/s
 linear read, zero data: 255 MB/s

This discrepancy between tests with random data and zero data is puzzling
to me. Does this suggest that the SSD does transparent compression between
its Sandforce SF-1500 controller and the NAND flash chips?

 cache enabled, 32 buffers, 4k blocks
 random write, random data: 41 MB/s (10300 ops/s)
 random read, random data: 76 MB/s (19000 ops/s)
 random write, zero data: 54 MB/s (13800 ops/s)
 random read, zero data: 91 MB/s (22800 ops/s)

These IOPS numbers are significantly below the 5 IOPS announced
by OCZ. But I supposed this is due to your benchmark tool not aligning
the ops to 4K boundaries, correct?

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OCZ Vertex 2 Pro performance numbers

2010-09-12 Thread Marc Bevand
Marc Bevand m.bevand at gmail.com writes:
 
 This discrepancy between tests with random data and zero data is puzzling
 to me. Does this suggest that the SSD does transparent compression between
 its Sandforce SF-1500 controller and the NAND flash chips?

Replying to myself: yes, SF-1500 does transparent deduplication
and compression to reduce write-amplification. Wow.

http://www.semiaccurate.com/2010/05/03/sandforce-ssds-break-tpc-c-records/

-mrb


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA 6G controller for OSOL

2010-07-10 Thread Marc Bevand
Graham McArdle graham.mcardle at ccfe.ac.uk writes:
 
 This thread from Marc Bevand and his blog linked therein might have some 
useful alternative suggestions.
 http://opensolaris.org/jive/thread.jspa?messageID=480925
 I've bookmarked it because it's quite a handy summary and I hope he keeps 
updating it with new info

Yes I will!

-mrbsun

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-27 Thread Marc Bevand
On Wed, May 26, 2010 at 6:09 PM, Giovanni Tirloni gtirl...@sysdroid.com wrote:
 On Wed, May 26, 2010 at 9:22 PM, Brandon High bh...@freaks.com wrote:

 I'd wager it's the PCIe x4. That's about 1000MB/s raw bandwidth, about
 800MB/s after overhead.

 Makes perfect sense. I was calculating the bottlenecks using the
 full-duplex bandwidth and it wasn't apparent the one-way bottleneck.

Actually both of you guys are wrong :-)

The Supermicro X8DTi mobo and LSISAS9211-4i HBA are both PCIe 2.0 compatible,
so the max theoretical PCIe x4 throughput is 4GB/s aggregate, or 2GB/s in each
direction, well above the 800MB/s bottleneck observed by Giovanni.

This bottleneck is actually caused by the backplane: Supermicro E1 chassis
like Giovanni's (SC846E1) include port multipliers that degrade performance
by putting 6 disks behind a single 3Gbps link.

A single 3Gbps link provides in theory 300MB/s usable after 8b-10b encoding,
but practical throughput numbers are closer to 90% of this figure, or 270MB/s.
6 disks per link means that each disk gets allocated 270/6 = 45MB/s.

So with 18 disks striped, this gives a max usable throughput of 18*45 = 810MB/s,
which matches exactly what Giovanni observed. QED!

-mrb
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-26 Thread Marc Bevand
Hi,

Brandon High bhigh at freaks.com writes:
 
 I only looked at the Megaraid  that he mentioned, which has a PCIe
 1.0 4x interface, or 1000MB/s.

You mean x8 interface (theoretically plugged into that x4 slot below...)

 The board also has a PCIe 1.0 4x electrical slot, which is 8x
 physical. If the card was in the PCIe slot furthest from the CPUs,
 then it was only running 4x.

If Giovanni had put the Megaraid  in this slot, he would have seen
an even lower throughput, around 600MB/s:

This slot is provided by the ICH10R which as you can see on:
http://www.supermicro.com/manuals/motherboard/5500/MNL-1062.pdf
is connected to the northbridge through a DMI link, an Intel-
proprietary PCIe 1.0 x4 link. The ICH10R supports a Max_Payload_Size
of only 128 bytes on the DMI link:
http://www.intel.com/Assets/PDF/datasheet/320838.pdf
And as per my experience:
http://opensolaris.org/jive/thread.jspa?threadID=54481tstart=45
a 128-byte MPS allows using just about 60% of the theoretical PCIe
throughput, that is, for the DMI link: 250MB/s * 4 links * 60% = 600MB/s.
Note that the PCIe x4 slot supports a larger, 256-byte MPS but this is
irrevelant as the DMI link will be the bottleneck anyway due to the
smaller MPS.

  A single 3Gbps link provides in theory 300MB/s usable after 8b-10b 
encoding,
  but practical throughput numbers are closer to 90% of this figure, or 
270MB/s.
  6 disks per link means that each disk gets allocated 270/6 = 45MB/s.
 
 ... except that a SFF-8087 connector contains four 3Gbps connections.

Yes, four 3Gbps links, but 24 disks per SFF-8087 connector. That's
still 6 disks per 3Gbps (according to Giovanni, his LSI HBA was
connected to the backplane with a single SFF-8087 cable).

 It may depend on how the drives were connected to the expander. You're
 assuming that all 18 are on 3 channels, in which case moving drives
 around could help performance a bit.

True, I assumed this and, frankly, this is probably what he did by
using adjacent drive bays... A more optimal solution would be spread
the 18 drives in a 5+5+4+4 config so that the 2 most congested 3Gbps
links are shared by only 5 drives, instead of 6, which would boost the
througput by 6/5 = 1.2x. Which would change my first overall 810MB/s
estimate to 810*1.2 = 972MB/s.

PS: it was not my intention to start a pissing contest. Peace!

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-19 Thread Marc Bevand
Deon Cui deon.cui at gmail.com writes:
 
 So I had a bunch of them lying around. We've bought a 16x SAS hotswap
 case and I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as
 the mobo.
 
 In the two 16x PCI-E slots I've put in the 1068E controllers I had
 lying around. Everything is still being put together and I still
 haven't even installed opensolaris yet but I'll see if I can get
 you some numbers on the controllers when I am done.

This is a well-architected config with no bottlenecks on the PCIe
links to the 890GX northbridge or on the HT link to the CPU. If you
run 16 concurrent dd if=/dev/rdsk/c?d?t?p0 of=/dev/zero bs=1024k and
assuming your drives can do ~100MB/s sustained reads at the
beginning of the platter, you should literally see an aggregate
throughput of ~1.6GB/s...

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-18 Thread Marc Bevand
The LSI SAS1064E slipped through the cracks when I built the list.
This is a 4-port PCIe x8 HBA with very good Solaris (and Linux)
support. I don't remember having seen it mentionned on zfs-discuss@
before, even though many were looking for 4-port controllers. Perhaps
the fact it is priced too close to 8-port models explains why it is
relatively unnoted. That said, the wide x8 PCIe link makes it the
*cheapest* controller able to feed 300-350MB/s to at least 4 ports
concurrently. Now added to my list.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-18 Thread Marc Bevand
Marc Nicholas geekything at gmail.com writes:
 
 Nice write-up, Marc.Aren't the SuperMicro cards their funny UIO form
 factor? Wouldn't want someone buying a card that won't work in a standard
 chassis.

Yes, 4 or the 6 Supermicro cards are UIO cards. I added a warning about it.
Thanks.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-18 Thread Marc Bevand
Thomas Burgess wonslung at gmail.com writes:
 
 A really great alternative to the UIO cards for those who don't want the
 headache of modifying the brackets or cases is the Intel SASUC8I
 
 This is a rebranded LSI SAS3081E-R
 
 It can be flashed with the LSI IT firmware from the LSI website and
 is physically identical to the LSI card.  It is really the exact same
 card, and typically around 140-160 dollars.

The SASUC8I is already in my list. In fact I bought one last week. I
did not need to flash its firmware though - drives were used in JBOD
mode by default.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-15 Thread Marc Bevand
I have done quite some research over the past few years on the best (ie. 
simple, robust, inexpensive, and performant) SATA/SAS controllers for ZFS. 
Especially in terms of throughput analysis (many of them are designed with an 
insufficient PCIe link width). I have seen many questions on this list about 
which one to buy, so I thought I would share my knowledge: 
http://blog.zorinaq.com/?e=10 Very briefly:

- The best 16-port one is probably the LSI SAS2116, 6Gbps, PCIe (gen2) x8. 
Because it is quite pricey, it's probably better to buy 2 8-port controllers.
- The best 8-port is the LSI SAS2008 (faster, more expensive) or SAS1068E 
(150MB/s/port should be sufficient).
- The best 2-port is the Marvell 88SE9128 or 88SE9125 or 88SE9120 because of 
PCIe gen2 allowing a throughput of at least 300MB/s on the PCIe link with 
Max_Payload_Size=128. And this one is particularly cheap ($35). AFAIK this is 
the _only_ controller of the entire market allowing 2 drives to not bottleneck 
an x1 link.

I hope this helps ZFS users here!

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mirror resilver @500k/s

2010-05-14 Thread Marc Bevand
Oliver Seidel osol at os1.net writes:
 
 Hello,
 
 I'm a grown-up and willing to read, but I can't find where to read.
 Please point me to the place that explains how I can diagnose this
 situation: adding a mirror to a disk fills the mirror with an
 apparent rate of 500k per second.

I don't know where to point you, but I know that iostat -nx 1
(not to be confused with zpool iostat) can often give you enough
information. Send us its output over a period of at least 10 sec.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel SASUC8I - worth every penny

2010-03-12 Thread Marc Bevand
Russ Price rjp_sun at fubegra.net writes:
 
 I had recently started setting up a homegrown OpenSolaris NAS with
 a large RAIDZ2 pool, and had found its RAIDZ2 performance severely
 lacking - more like downright atrocious. As originally set up:
 
 * Asus M4A785-M motherboard
 * Phenom II X2 550 Black CPU
 * JMB363-based PCIe X1 SATA card (2 ports)
 * SII3132-based PCIe X1 SATA card (2 ports)
 * Six on-board SATA ports

Did you enable AHCI mode on _every_ SATA controller?

I have the exact opposite experience with 2 of your 3 types of
controllers. I have built various ZFS storage servers with 6-12 drives
each, using onboard SB600/SB700 and SiI3132 controllers and have
always succeeded in getting outstanding I/O throughput by enabling
AHCI mode. For example one of my machines gets 400+MB/s sequential
read throughput from a 7-drive raidz pool (2 drives on SiI3132, 1 on
SiI3124, 4 on onboard SB700).

I have never tested the JMB363 though, so maybe it was the culprit
in your setup?

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel SASUC8I - worth every penny

2010-03-12 Thread Marc Bevand
Russ Price rjp_sun at fubegra.net writes:
 
  Did you enable AHCI mode on _every_ SATA controller?
  
  I have the exact opposite experience with 2 of your 3
  types of controllers.
 
 It wasn't possible to do so, and that also made me think that a real HBA would
work better. First off, with the
 AMD SB700/SB800 on-board ports, if I set the last two ports to AHCI mode, the
BIOS doesn't even see drives
 there, and neither does OpenSolaris; the first four ports work fine in AHCI.
The JMicron board came up in
 AHCI mode; it never, ever presents a BIOS of its own to change configuration.
The Silicon Image board (one
 from SIIG) doesn't have an AHCI mode in its BIOS.

Ok so the lack of AHCI on the onboard SBxxx ports is very likely what was
causing your performance issues. Legacy IDE mode is significantly slower. Sounds
like you hit bugs on your motherboard BIOS that prevented you from detecting
drives while in AHCI mode...

(You are right that SiI3132 doesn't support AHCI, however this is a FIS-based
controller with a hardware interface very similar in design to AHCI, so it
doese offer great performance out of the box).

IMHO the best 2-port PCIe x1 controller is the Marvell 88SE9128, which is
AHCI compliant. I like it not because it supports SATA 6.0Gbps, but PCIe 5GT/s.
People often believe that a PCIe 2.5GT/s x1 device can do 250MB/s but this
is only achievable with a large Max_Payload_Size. In practice MPS is often
128 bytes which limits them to about 60% of the max throughput, or 150MB/s.
Given that 2 drives can easily sustain a read throughput of 200-250MB/s,
PCIe 5GT/s comes in handy by allowing about 300MB/s with MPS=128 (500MB/s
theoretical).

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Marc Bevand
Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:
 [...]
 X25-E's write cache is volatile), the X25-E has been found to offer a 
 bit more than 1000 write IOPS.

I think this is incorrect. On the paper the X25-E offers 3300 random write
4kB IOPS (and Intel is known to be very conservative about the IOPS perf 
numbers they publish). Dumb storage IOPS benchmark tools that don't issue 
parallel I/O ops to the drive tend to report numbers less than half the 
theoretical IOPS. This would explain why you see only 1000 IOPS.

I have direct evidence to prove this (with the other MLC line of SSD drives: 
X25-M): 35000 random read 4kB IOPS theoretical, 1 instance of a private 
benchmarking tool measures 6000, 10+ instances of this tool measure 37000 IOPS 
(slightly better than the theoretical max!)

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Marc Bevand
Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:
 
 The Intel specified random write IOPS are with the cache enabled and 
 without cache flushing.

For random write I/O, caching improves I/O latency not sustained I/O 
throughput (which is what random write IOPS usually refer to). So Intel can't 
cheat with caching. However they can cheat by benchmarking a brand new drive 
instead of an aged one.

 They also carefully only use a limited span 
 of the device, which fits most perfectly with how the device is built. 

AFAIK, for the X25-E series, they benchmark random write IOPS on a 100% span. 
You may be confusing it with the X25-M series with which they actually clearly 
disclose two performance numbers: 350 random write IOPS on 8GB span, and 3.3k 
on 100% span. See 
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/tech/425265.htm

I agree with the rest of your email.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?

2009-09-30 Thread Marc Bevand
Frank Middleton f.middleton at apogeect.com writes:
 
 As noted in another thread, 6GB is way too small. Based on
 actual experience, an upgradable rpool must be more than
 20GB.

It depends on how minimal your install is.

The OpenSolaris install instructions recommend 8GB minimum, I have
one OpenSolaris 2009.06 server using about 4GB, so I thought 6GB
would be sufficient. That said I have never upgraded the rpool of
this server, but based on your commends I would recommend an rpool
of 15GB to the original poster.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?

2009-09-29 Thread Marc Bevand
Richard Connamacher rich at indieimage.com writes:
 
 I was thinking of custom building a server, which I think I can do for
 around $10,000 of hardware (using 45 SATA drives and a custom enclosure),
 and putting OpenSolaris on it. It's a bit of a risk compared to buying a
 $30,000 server, but would be a fun experiment.

Do you have a $2k budget to perform a cheap experiment?

Because for this amount of money you can build the following server that has
10TB of usable storage capacity, and that would be roughly able to sustain
sequential reads between 500MByte/s and 1000MByte/s over NFS over a Myricom
10GbE NIC. This is my estimation. I am less sure about sequential writes:
I think this server would be capable of at least 250-500 MByte/s.

$150 - Mobo with onboard 4-port AHCI SATA controller (eg. any AMD 700
  chipset), and at least two x8 electrical PCI-E slots
$200 - Quad-core Phenom II X4 CPU + 4GB RAM
$150 - LSISAS1068E 8-port SAS/SATA HBA, PCI-E x8
$500 - Myri-10G NIC (10G-PCIE-8B-C), PCI-E x8
$1000 - 12 x 1TB SATA drives (4 on onboard AHCI, 8 on LSISAS1068E)

- It is important to choose an AMD platform because the PCI-E lanes
  will always come from the northbridge chipset which is connected
  to the CPU via an HT 3.0 link. On Intel platforms, the DMI link
  between the ICH and MCH will be a bottleneck if the mobo gives
  you PCI-E lanes from the MCH (in my experience, this is the case
  of most desktop mobos).
- Make sure you enable AHCI in the BIOS.
- Configure the 12 drives as striped raidz vdevs:
  zpool create mytank raidz d0 d1 d2 d3 d4 d5 raidz d6 d7 d8 d9 d10 d11
- Buy drives able to sustain 120-130 MByte/s of sequential reads at the
  beginning of the platter (my recommendation: Seagate 7200.12) this
  way your 4Gbit/s requirement will be met even in the worst case when
  reading from the end of the platters.

Thank me for saving you $28k :-) The above experiment would be a way
to validate some of your ideas before building a 45-drive server...

-mrb


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Would ZFS work for a high-bandwidth video SAN?

2009-09-29 Thread Marc Bevand
Richard Connamacher rich at indieimage.com writes:

 
 Also, one of those drives will need to be the boot drive.
 (Even if it's possible I don't want to boot from the
 data dive, need to keep it focused on video storage.)

But why?

By allocating 11 drives instead of 12 to your data pool, you will reduce the
max sequential I/O throughput by approximately 10% which is significant...
If I were you I would format every 1.5TB drive like this:
* 6GB slice for the root fs
* 1494GB slice for the data fs
And create an N-way mirror for the root fs with N in [2..12].

I would rather loose 6/1500 = 0.4% of storage capacity than loose 10% of
I/O throughput.

I/O activity on the root fs will be insignificant and will have zero
perf impact on your data pool.

-mrb


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lightning SSD with 180,000 IOPs, 320MB/s writes

2009-09-15 Thread Marc Bevand
Neal Pollack Neal.Pollack at Sun.COM writes:
 
 Pliant Technologies just released two Lightning high performance
 enterprise SSDs that threaten to blow away the competition.

One can build an SSD-based storage device that gives you:
o 320GB of storage capacity (2.1x better than their 2.5 model: 150GB)
o 1000 MB/s sequential reads (2.4x better than their 2.5 model: 420MB/s)
o 280 MB/s sequential writes (1.3x better than their 2.5 model: 220MB/s)
o 140k random 4kB read IOPS (1.2x better than their 2.5 model: 120k)
o 26k random 4kB write IOPS (Pliant doesn't document it)
o at a price of $920 (half the MINIMUM price hinted by Pliant )

This device is a ZFS stripe of 4 80GB Intel 34nm MLC devices ($230 each).
Now, the acute reader will observe that:
o Pliant's device is SLC, mine is MLC (shorter life - but so cheap it can be
  replaced cheaply)
o The Pliant specs I quote above are from their website, some press releases
  quote slighly higher numbers
o Pliant's device fits in a single 2.5 bay, mine requires 4
o Pliant doesn't quote random 4kB *write* IOPS performance - if I were a
  potential buyer, I would ask them before buying

As a side note, I personally measure 15k random 4kB write IOPS on my Intel 34nm
MLC 80GB drive whereas Intel's official number is 6.6k - they probably give a
pessimistic number representing the performance of the drive after having been
aged.

-mrb


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand
Bill Moore Bill.Moore at sun.com writes:
 
 Moving on, modern high-capacity SATA drives are in the 100-120MB/s
 range.  Let's call it 125MB/s for easier math.  A 5-port port multiplier
 (PM) has 5 links to the drives, and 1 uplink.  SATA-II speed is 3Gb/s,
 which after all the framing overhead, can get you 300MB/s on a good day.
 So 3 drives can more than saturate a PM.  45 disks (9 backplanes at 5
 disks + PM each) in the box won't get you more than about 21 drives
 worth of performance, tops.  So you leave at least half the available
 drive bandwidth on the table, in the best of circumstances.  That also
 assumes that the SiI controllers can push 100% of the bandwidth coming
 into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting
 close to a 4x PCIe-gen2 slot.

Wrong. The theoretical bandwidth of an x4 PCI-E v2.0 slot is 2GB/s per
direction (5Gbit/s before 8b-10b encoding per lane, times 0.8, times 4),
amply sufficient to deal with 600MB/s.

However they don't have this kind of slot, they have x2 PCI-E v1.0
slots (500MB/s per direction). Moreover SiI3132 default to a
MAX_PAYLOAD_SIZE of 128 bytes therefore my guess is that each 2-port
SATA card is only able to provide 60% of the theoretical throughput[1],
or about 300MB/s.

Then they have 3 such cards: total throughput of 900MB/s.

Finally the 4th SATA card (with 4 ports) is in a 32-bit 33MHz PCI slot
(not PCI-E). In practice such a bus can only provide a usable throughput
of about 100MB/s (out of 133MB/s theoretical).

All the bottlenecks are obviously the PCI-E links and the PCI bus.
So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
is that the max I/O throughput when reading from all the disks on
1 of their storage pod is about 1000MB/s. This is poor compared to
a Thumper for example, but the most important factor for them was
GB/$, not GB/sec. And they did a terrific job at that!

 And I'd re-iterate what myself and others have observed about SiI and
 silent data corruption over the years.

Irrelevant, because it seems they have built fault-tolerance higher in
the stack, à la Google. Commodity hardware + reliable software = great
combo.

[1] 
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand
Marc Bevand m.bevand at gmail.com writes:
 
 So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
 is that the max I/O throughput when reading from all the disks on
 1 of their storage pod is about 1000MB/s.

Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
the aggregate throughput when reading from all the disks is:
3*150+100 = 550MB/s.
(150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)

And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
to exploit closer to the max theoretical bandwidth of an x1 PCI-E
link, it would be:
3*250+100 = 850MB/s.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand
Tim Cook tim at cook.ms writes:
 
 Whats the point of arguing what the back-end can do anyways?  This is bulk 
data storage.  Their MAX input is ~100MB/sec.  The backend can more than 
satisfy that.  Who cares at that point whether it can push 500MB/s or 
5000MB/s?  It's not a database processing transactions.  It only needs to be 
able to push as fast as the front-end can go.  --Tim

True, what they have is sufficient to match GbE speed. But internal I/O 
throughput matters for resilvering RAID arrays, scrubbing, local data 
analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays per 
pod. If their layout is optimal they put 5 drives on the PCI bus (to minimize 
this number)  10 drives behind PCI-E links per array, so this means the PCI 
bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per 
(1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of 
their arrays.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The importance of ECC RAM for ZFS

2009-07-27 Thread Marc Bevand
dick hoogendijk dick at nagual.nl writes:
 
 Than why is it that most AMD MoBo's in the shops clearly state that ECC
 Ram is not supported on the MoBo?

To restate what Erik explained: *all* AMD CPUs support ECC RAM, however poorly 
written motherboard specs often make the mistake of confusing non-ECC vs. ECC
with unbuffered vs. registered (these are 2 completely unrelated technical
characteristics). So, don't blindly trust manuals saying ECC RAM is not
supported.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The importance of ECC RAM for ZFS

2009-07-25 Thread Marc Bevand
dick hoogendijk dick at nagual.nl writes:
 
 I live in Holland and it is not easy to find motherboards that (a)
 truly support ECC ram and (b) are (Open)Solaris compatible.

Virtually all motherboards for AMD processors support ECC RAM because the 
memory controller is in the CPU and all AMD CPUs support ECC RAM.

I have heard of a few BIOSes that refuse to POST if ECC RAM is detected, but 
this is often an attempt to segment markets, rather than a real lack of 
ability to support ECC RAM.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] reboot when copying large amounts of data

2009-03-11 Thread Marc Bevand
The copy operation will make all the disks start seeking at the same time and 
will make your CPU activity jump to a significant percentage to compute the 
ZFS checksum and RAIDZ parity. I think you could be overloading your PSU 
because of the sudden increase in power consumption...

However if you are *not* using SATA staggered spin-up, then the above theory 
is unlikely because spinning up consumes much more power than when seeking. 
So, in a sense, a successful boot proves your PSU is powerful enough.

Trying reproducing the problem by copying data on a smaller number of disks. 
You tried 2 and 16. Try 8.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?

2009-01-01 Thread Marc Bevand
Mattias Pantzare pantzare at gmail.com writes:
 On Tue, Dec 30, 2008 at 11:30, Carsten Aulbert wrote:
  [...]
  where we wrote data to the RAID, powered the system down, pulled out one
  disk, inserted it into another computer and changed the sector checksum
  of a few sectors (using hdparm's utility makebadsector).
 
 You are talking about diffrent types of errors. You tested errors that
 the disk can detect. That is not a problem on any RAID, that is what
 it is designed for.

Mattias pointed out to me in a private email I missed Carsten's mention of 
hdparm --make-bad-sector. Duh!

So Carsten: Mattias is right, you did not simulate a silent data corruption 
error. hdparm --make-bad-sector just introduces a regular media error that 
*any* RAID level can detect and fix.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?

2008-12-31 Thread Marc Bevand
Mattias Pantzare pantzare at gmail.com writes:
 
 He was talking about errors that the disk can't detect (errors
 introduced by other parts of the system, writes to the wrong sector or
 very bad luck). You can simulate that by writing diffrent data to the
 sector,

Well yes you can. Carsten and I are both talking about silent data corruption 
errors, and the way to simulate them is to do what Carsten did. However I 
pointed out that he may have tested only easy corruption cases (affecting the 
P or Q parity only) -- it is tricky to simulate hard-to-recover corruption 
errors...

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?

2008-12-30 Thread Marc Bevand
Carsten Aulbert carsten.aulbert at aei.mpg.de writes:
 
 In RAID6 you have redundant parity, thus the controller can find out
 if the parity was correct or not. At least I think that to be true
 for Areca controllers :)

Are you sure about that ? The latest research I know of [1] says that 
although an algorithm does exist to theoretically recover from
single-disk corruption in the case of RAID-6, it is *not* possible to
detect dual-disk corruption with 100% certainty. And blindly running
the said algorithm in such a case would even introduce corruption on a
third disk.

This is the reason why, AFAIK, no RAID-6 implementation actually
attempts to recover from single-disk corruption (someone correct me if
I am wrong).

The exception is ZFS of course, but it accomplishes single and
dual-disk corruption self-healing by using its own checksum, which is
one layer above RAID-6 (therefore unrelated to it).

[1] http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs HardWare raid - data integrity?

2008-12-30 Thread Marc Bevand
Carsten Aulbert carsten.aulbert at aei.mpg.de writes:

 Well, I probably need to wade through the paper (and recall Galois field
 theory) before answering this. We did a few tests in a 16 disk RAID6
 where we wrote data to the RAID, powered the system down, pulled out one
 disk, inserted it into another computer and changed the sector checksum
 of a few sectors (using hdparm's utility makebadsector). The we
 reinserted this into the original box, powered it up and ran a volume
 check and the controller did indeed find the corrupted sector and
 repaired the correct one without destroying data on another disk (as far
 as we know and tested).

Note that there are cases of single-disk corruption that are trivially
recoverable (for example if the corruption affects the P or Q parity 
block, as opposed to the data blocks). Maybe that's what you
inadvertently tested ? Overwrite a number of contiguous sectors to
span 3 stripes on a single disk to be sure to correctly stress-test
the self-healing mechanism.

 For the other point: dual-disk corruption can (to my understanding)
 never be healed by the controller since there is no redundant
 information available to check against. I don't recall if we performed
 some tests on that part as well, but maybe we should do that to learn
 how the controller will behave. As a matter of fact at that point it
 should just start crying out loud and tell me, that it cannot recover
 for that. 

The paper explains that the best RAID-6 can do is use probabilistic 
methods to distinguish between single and dual-disk corruption, eg. 
there are 95% chances it is single-disk corruption so I am going to
fix it assuming that, but there are 5% chances I am going to actually
corrupt more data, I just can't tell. I wouldn't want to rely on a
RAID controller that takes gambles :-)

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Asymmetric zpool load

2008-12-03 Thread Marc Bevand
Carsten Aulbert carsten.aulbert at aei.mpg.de writes:
 
 Put some stress on the system with bonnie and other tools and try to
 find slow disks

Just run iostat -Mnx 2 (not zpool iostat) while ls is slow to find the slow 
disks. Look at the %b (busy) values.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware Raid Vs ZFS implementation on Su n X4150/X4450

2008-12-03 Thread Marc Bevand
Aaron Blew aaronblew at gmail.com writes:
 
 I've done some basic testing with a X4150 machine using 6 disks in a
 RAID 5 and RAID Z configuration.  They perform very similarly, but RAIDZ
 definitely has more system overhead.

Since hardware RAID 5 implementations usually do not checksum data (they only 
compute the parity, which is not the same thing), for an apples-to-apples 
performance comparison you should have benchmarked raidz with checksum=off. Is 
it what you did ?

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with simple (?) reconfigure of zpool

2008-11-01 Thread Marc Bevand
Ross myxiplx at googlemail.com writes:
 Now this is risky if you don't have backups, but one possible approach might 
be:
 - Take one of the 1TB drives off your raid-z pool
 - Use your 3 1TB drives, plus two sparse 1TB files and create a 5 drive 
raid-z2
 - disconnect the sparse files.  You now have a 3TB raid-z2 volume in a 
degraded state
 - use zfs send / receive to migrate your data over
 - destroy your original pool and use zpool replace to add those drives to 
the new pool in place of the sparse files

This would work but it would give the original poster a raidz2 with only 3TB 
of usable space when he really wants a 4TB raidz1.

Fortunately, Robert, a similar procedure exists to end up exactly with the 
pool config you want without requiring any other temporary drives. Before I go 
further, let me tell you there is a real risk of losing your data because the 
procedure I describe below use temporary striped pools (equivalent to raid0) 
to copy data around, and as you know raid0 is the less reliable raid 
mechanism. Also, the procedure involves lost of manual steps.

So, let me first represent your current pool config in compact form using 
drive names describing their capacity:
  pool (2.6TB usable):  raidz a-1t b-1t c-1t  raidz d-320g e-400g f-400g

Export the 1st pool, create a 2nd temporary striped pool made of your 2 new 
drives plus f-400g, reimport the 1st pool (f-400g should show up as missing in 
the 1st one):
  1st pool (2.6TB usable):  raidz a-1t b-1t c-1t  raidz d-320g e-400g 
missing
  2nd pool (2.4TB usable):  g-1t h-1t f-400g

Copy your data to the 2nd pool, destroy the 1st one and create a 3rd temporary 
striped pool made of the 2 smallest drives:
  1st pool (destroyed): (unused drives: a-1t b-1t c-1t)
  2nd pool (2.4TB usable):  g-1t h-1t f-400g
  3rd pool (0.7TB usable):  d-320g e-400g

Create 2 sparse files x-1t and y-1t of 1 TB each on the 3rd pool (mkfile -n 
932g x-1t y-1t, 1TB is about 932GiB), and recreate the 1st pool with a raidz 
vdev made of 3 physical 1TB drives and the 2 sparse files:
  1st pool (4.0TB usable(*)):  raidz a-1t b-1t c-1t x-1t y-1t
  2nd pool (2.4TB usable): g-1t h-1t f-400g
  3rd pool (0.7TB usable): d-320g e-400g

(*) 4.0TB virtually; in practice the sparse files won't be able to allocate 
1TB of disk blocks because they are backed by the 3rd pool which is much 
smaller.

Offline one of the sparse files (zpool offline) of the 1st pool to prevent 
at least one of them from allocating disk blocks:
  1st pool (4.0TB usable(**)):  raidz a-1t b-1t c-1t x-1t offlined
  2nd pool (2.4TB usable):  g-1t h-1t f-400g
  3rd pool (0.7TB usable):  d-320g e-400g

(**) At that point x-1t can grow to at least 0.7 TB because it is the only 
consumer of disk blocks on the 3rd pool; which means the 1st pool can now hold 
at least 0.7*4 = 2.8 TB in practice.

Now you should be able to copy all your data from the 2nd pool back to the 1st 
one. When done, destroy the 2nd pool:
  1st pool (4.0TB usable):  raidz a-1t b-1t c-1t x-1t offlined
  2nd pool (destroyed): (unused drives: g-1t h-1t f-400g)
  3rd pool (0.7TB usable):  d-320g e-400g

Finally, replace x-1t and the other offlined sparse files with g-1t and h-1t 
(zpool replace):
  1st pool (4.0TB usable):  raidz a-1t b-1t c-1t g-1t h-1t
  2nd pool (destroyed): (unused drives: f-400g)
  3rd pool (0.7TB usable):  d-320g e-400g

And destroy the 3rd pool.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with simple (?) reconfigure of zpool

2008-11-01 Thread Marc Bevand
Robert Rodriguez robertro at comcast.net writes:
 
 A couple of follow up question, have you done anything similar before?

I have done similar manipulations to experiment with ZFS
(using files instead of drives).

 Can you assess the risk involved here?

If any one of your 8 drives die during the procedure, you are going
to lose some data, plain and simple. I would especially be worried
about the 2 brand new drives that were just bought. You are probably
the best person to estimate the probability of them dying, as you
know their history (have been running 24/7 for 1-2 years with
periodical scrubs and not a single pb ? then they are probably ok).

IMHO you can reduce the risk a lot by scrubbing everything:
- before you start, scrub your existing pool (pool #1)
- scrub pool #2 after copying data to it and before destroying pool #1
- scrub pool #1 (made of sparse files) and pool #3 (backing the sparse
  files) after copying from pool #2 to #1
- rescrub pool #1 after replacing the sparse files with real drives

 Does the fact that the pool is currently at 90% usage change this
 in any way.

Nope.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dribbling checksums

2008-10-30 Thread Marc Bevand
Charles Menser charles.menser at gmail.com writes:
 
 Nearly every time I scrub a pool I get small numbers of checksum
 errors on random drives on either controller.

These are the typical symptoms of bad RAM/CPU/Mobo. Run memtest for 24h+.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Success Stories

2008-10-21 Thread Marc Bevand
About 2 years ago I used to run snv_55b with a raidz on top of 5 500GB SATA 
drives. After 10 months I ran out of space and added a mirror of 2 250GB 
drives to my pool with zpool add. No pb. I scrubbed it weekly. I only saw 1 
CKSUM error one day (ZFS self-healed itself automatically of course). Never 
had any pb with that server.

After running again out of space I replaced it with a new system running 
snv_82, configured with a raidz on top of 7 750GB drives. To burn in the 
machine, I wrote a python script that read random sectors from the drives. I 
let it run for 48 hours to subject each disk to 10+ million I/O operations. 
After it passed this test, I created the pool and run some more scripts to 
create/delete files off it continously. To test disk failures (and SATA 
hotplug), I disconnected and reconnected a drive at random while the scripts 
were running. The system was always able to redetect the drive immediately 
after being plugged in (you need set sata:sata_auto_online=1 for this to 
work). Depending on how long the drive had been disconnected, I either needed 
to do a zpool replace or nothing at all, for the system to re-add the disk 
to the pool and initiate a resilver. After these tests, I trusted the system 
enough to move all my data to it, so I rsync'd everything and double-checked 
it with MD5 sums.

I have another ZFS server, at work, on which 1 disk someday started acting 
weirdly (timeouts). I physically replaced it, and ran zpool replace. The 
resilver completed successfully. On this server, we have seen 2 CKSUM errors 
over the last 18 months or so. We read about 3 TB of data every day from it 
(daily rsync), that amounts to about 1.5 PB over 18 months. I guess 2 silent 
data corruptions while reading that quantity of data is about the expected 
error rate of modern SATA drives. (Again ZFS self-healed itself, so this was 
completely transparent to us.)

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Marc Bevand
Marc Bevand m.bevand at gmail.com writes:
 
 Well let's look at a concrete example:
 - cheapest 15k SAS drive (73GB): $180 [1]
 - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37)
 The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x

Doh! I said the opposite of what I meant. Let me rephrase: The SAS drive 
offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price. 
Therefore the SATA drive has better IOPS/$.

(Joerg: I am on your side of the debate !)

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Marc Bevand
Erik Trimble Erik.Trimble at Sun.COM writes:
 
 Bottom line here is that when it comes to making statements about SATA
 vs SAS, there are ONLY two statements which are currently absolute:
 
 (1)  a SATA drive has better GB/$ than a SAS drive
 (2)  a SAS drive has better throughput and IOPs than a SATA drive

Yes, and to represent statements (1) and (2) in a more exhaustive table:

  Best X per Y | Dollar   Watt   Rack Unit (or per drive)
---+---
Capacity   | SATA(1)  SATA   SATA
Throughput | SATA SASSAS(2)
IOPS   | SATA SASSAS(2)

If (a) people understood that each of these 9 performance numbers can be
measured independently from each other, and (b) knew which of these numbers
matter for a given workload (very often multiple of them do, so a
compromise has to be made), then there would be no more circular SATA vs.
SAS debates.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Marc Bevand
Tim tim at tcsac.net writes:
 
 That's because the faster SATA drives cost just as much money as
 their SAS counterparts for less performance and none of the
 advantages SAS brings such as dual ports.

SAS drives are far from always being the best choice, because absolute IOPS or 
throughput numbers do not matter. What only matters in the end is (TB, 
throughput, or IOPS) per (dollar, Watt, or Rack Unit).

7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and IOPS/$. 
You can't argue against that. To paraphrase what was said earlier in this 
thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm 
(SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the best 
IOPS/RU, 15000rpm drives have the advantage. Etc.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Kernel panic at zpool import

2008-08-14 Thread Marc Bevand
Borys Saulyak borys.saulyak at eumetsat.int writes:
 
  Your pools have no redundancy...

 Box is connected to two fabric switches via different HBAs, storage is
 RAID5, MPxIP is ON, and all after that my pools have no redundancy?!?! 

As Darren said: no, there is no redundancy that ZFS can use. It is important 
to understand that your setup _prevents_ ZFS from self-healing itself. You 
need a ZFS-redundant pool (mirror, raidz or raidz2) or an fs with the 
attribute copies=2 to enable self-healing.

I would recommend you to make multiple LUNs visible to ZFS, and create 
redundant pools out of them. Browse he past 2 years or so of the zfs-discuss@ 
archives to give you an idea about how others with the same kind of hardware 
as you are doing it. For example, export each disk as a LUN, and create 
multiple raidz vdevs. Or create 2 hardware raid5 arrays and mirror them with 
ZFS, etc.

  ...and got corrupted, therefore there is nothing ZFS
 This is exactly what I would like to know. HOW this could happened? 

Ask your hardware vendor. The hardware corrupted your data, not ZFS.

 I'm just questioning myself. Is it really reliable filesystem as presented,
 or it's better to keep away from it on production environment.

Consider yourself lucky that the corruption was reported by ZFS. Other 
filesystems would have returned silently corrupted data and it would have 
maybe taken you days/weeks to troubleshoot it. As to myself, I use ZFS in 
production to backup 10+ million files, have seen occurences of hw causing 
data corruption, and have seen ZFS self-heal itself. So yes I trust it.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Kernel panic at zpool import

2008-08-07 Thread Marc Bevand
Borys Saulyak borys.saulyak at eumetsat.int writes:
 root at omases11:~[8]#zpool import 
 [...]
 pool: private 
 id: 3180576189687249855 
 state: ONLINE 
 action: The pool can be imported using its name or numeric identifier. 
 config: 
 
   private ONLINE 
 c7t60060160CBA21000A6D22553CA91DC11d0 ONLINE 

Your pools have no redundancy...

 root at omases11:~[8]#zpool import private 
 
 panic[cpu3]/thread=fe8001223c80: ZFS: bad checksum

...and got corrupted, therefore there is nothing ZFS can do. This is precisely 
why best practices recommend pools to be configured with some level of 
redundancy (mirror, raidz, etc). See:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Additional_Cautions_for_Storage_Pools

Restore your data from backup.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on 32bit.

2008-08-06 Thread Marc Bevand
Bryan, Thomas: these hangs of 32-bit Solaris under heavy (fs, I/O) loads are a 
well known problem. They are caused by memory contention in the kernel heap. 
Check 'kstat vmem::heap'. The usual recommendation is to change the 
kernelbase. It worked for me. See:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046710.html
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046715.html

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Block unification in ZFS

2008-08-05 Thread Marc Bevand
Alan alan at peak.org writes:
 
 I was just thinking of a similar feature request: one of the things
 I'm doing is hosting vm's.  I build a base vm with standard setup in a
 dedicated filesystem, then when I need a new instance zfs clone and voila!
 ready to start tweaking for the needs of the new instance, using a fraction
 of the space.

This is OT but FYI some virtualization apps have built-in support for exactly 
what you want, you can create disk images that share identical blocks between 
themselves.

In Qemu/KVM this feature copy-on-write disk images:
$ qemu-img create -b base_image -f qcow2 new_image

In Microsoft Virtual Server, there is also an equivalent feature but I can't 
recall how it is called.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz parity drives

2008-07-30 Thread Marc Bevand
Vanja vanjab at gmail.com writes:
 
 And finally, if this is the case, is it possible to make an array with
 3 drives, and then add the mirror later?

I assume you are asking if it is possible to create a temporary 3-way raidz, 
then transfer your data to it, then convert it to a 4-way raidz ? No it is not 
possible. However here is one solution (let's call your 4 drives A, B, C and 
D -- your current data is on D).

1. Slice up A, B, C in 2 halves each: A1 and A2, B1 and B2, C1 and C2.
2. Create a 3-way raidz on A2, B2 and C2.
3. Copy your data from D to the 3-way raidz.
4. Slice up D in 2 halves: D1 and D2.
5. Create a 4-way raidz on A1, B1, C1, D1.
6. Copy your data from the 3-way raidz to the 4-way raidz.
7. Destroy the A2, B2, C2, D2 slices.
8. Grow A1, B1, C1, D1 to extend over the remaining disk space.
9. Export and reimport the 4-way raidz to make it register the extra space.

Yes that's a lot of steps. But that's also the safest solution as your data 
would never transit on a (temporarily) degraded raidz (as it is possible for 
example on Linux by creating a raid5 MD with one missing component).

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Errors in ZFS/NFSv4 ACL Documentation

2008-07-29 Thread Marc Bevand
I noticed some errors in ls(1), acl(5) and the ZFS Admin Guide about ZFS/NFSv4 
ACLs:

ls(1): read_acl (r)  Permission  to  read  the ACL of a file. The compact 
representation of read_acl is c, not r.

ls(1): -c | -vThe same as -l, and in addition displays the [...] The 
options are in fact -/ c or -/ v. 

ls(1): The display in verbose mode (/ v) uses full attribute [...].  This 
should read (-/ v).

acl(5): execute (X). The x should be lowercase: (x)

acl(5) does not document 3 ACEs: success access (S), failed access (F), 
inherited (I).

The ZFS Admin Guide does not document the same 3 ACEs.

The ZFS Admin Guide gives examples listing a compact representation of ACLs 
containing only 6 inheritance flags instead of 7. For example in the 
section Setting and Displaying ACLs on ZFS Files in Compact Format:
# ls -V file.1
-rw-r--r-- 1 root   root  206663 Feb 16 11:00 file.1
owner@:--x---:--:deny
  ^^
 7th position for flag 'I' missing

By the way, where can I find the latest version of the ls(1) manpage online ? 
I cannot find it, neither on src.opensolaris.org, nor in the manpage 
consolidation download center [1]. I'd like to check whether the errors I 
found in ls(1) are fixed before submitting a bug report.

[1] http://opensolaris.org/os/downloads/manpages/

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace not working

2008-07-27 Thread Marc Bevand
It looks like you *think* you are trying to add the new drive, when you are in 
fact re-adding the old (failing) one. A new drive should never show up as 
ONLINE in a pool with no action from your part, if only because it contains no 
partition and no vdev label with the right pool GUID.

If I am right, try to add the other drive.

If I am wrong, you somehow managed to confuse ZFS.. You can prevent ZFS from 
thinking c2d1 is already part of the pool by deleting the partition table on 
it:
  $ dd if=/dev/zero of=/dev/rdsk/c2d1p0 bs=512 count=1
  $ zpool import
  (it should show you the pool is now ready to be imported)
  $ zpool import tank
  $ zpool replace tank c2d1

At this point it should be resilvering...

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] planning for upgrades

2008-07-08 Thread Marc Bevand
Matt Harrison iwasinnamuknow at genestate.com writes:
 
 Aah, excellent, just did an export/import and its now showing the
 expected capacity increase. Thanks for that, I should've at least tried
 a reboot  :)

More recent OpenSolaris builds don't even need the export/import anymore when 
expanding a raidz this way (I tested with build 82).

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-04 Thread Marc Bevand
Chris Cosby ccosby+zfs at gmail.com writes:
 
 
 You're backing up 40TB+ of data, increasing at 20-25% per year.
 That's insane.

Over time, backing up his data will require _fewer_ and fewer disks.
Disk sizes increase by about 40% every year.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with AOC-SAT2-MV8

2008-07-01 Thread Marc Bevand
I remember a similar pb with an AOC-SAT2-MV8 controller in a system of mine: 
Solaris rebooted each time the marvell88sx driver tried to detect the disks 
attached to it. I don't remember if happened during installation, or during 
the first boot after a successful install. I ended up spending a night reverse 
engineering the controller's firmware/BIOS to find and fix the bug. The system 
has been running fine since I reflashed the controller with my patched 
firmware.

To make a long story short, a lot of these controllers in the wild use a buggy 
firmware, version 1.0b [1]. During POST the controller's firmware scans the 
PCI bus to find the device it is supposed to initialize, ie the controller's 
Marvell 88SX6081 chip. It incorrectly assumes that the *first* device with one 
of these PCI device IDs is the 88SX6081: 5040 5041 5080 5081 6041 6042 6081 
7042 (the firmware is generic and supposed to support different chips). My 
system's motherboard happened to have an Marvell chip 88SX5041 onboard (device 
ID 5041) which was found first. So during POST the AOC-SAT2-MV8 firmware was 
initializing disks connected to the 5041, leaving the 6081 disks in an 
uninitialized stat. Then after POST when Solaris was booting, I guess the 
marvell88sx barfed on this unexpected state and was causing the kernel to 
reboot.

To fix the bug, I simply patched the firmware to remove 5041 from the device 
ID list. I used the Supermicro-provided tool to reflash the firmware [1].

You said your motherboard is a Supermicro H8DM8E-2. There is no such model, do 
you mean H8DM8-2 or H8DME-2 ?. To determine whether one of your PCI devices 
has one of the device IDs I mentionned, run:
  $ /usr/X11/bin/scanpci

I have recently had to replace this AOC-SAT2-MV8 controller with another one 
(we accidentally broke a SATA connector during a maintainance operation). Its 
firmware version is using a totally different numbering scheme (it's probably 
more recent) and it worked right out-of-the-box on the same motherboard. So it 
looks like Marvell or Supermicro fixed the bug in at least some later 
revisions of the AOC-SAT2-MV8. But they don't distribute this newer firmware 
on their FTP site.

Do you know if yours is using firmware 1.0b (displayed during POST) ?

[1] ftp://ftp.supermicro.com/Firmware/AOC-SAT2-MV8


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS configuration for VMware

2008-07-01 Thread Marc Bevand
Erik Trimble Erik.Trimble at Sun.COM writes:
 
 * Huge RAM drive in a 1U small case (ala Cisco 2500-series routers), 
 with SAS or FC attachment.

Almost what you want:
http://www.superssd.com/products/ramsan-400/
128 GB RAM-based device, 3U chassis, FC and Infiniband connectivity.

However as a commenter pointed out [1] you would be basically buying RAM at 
~20x its street price... Plus the density sucks and they could strip down this 
device much more (remove the backup drives, etc.)

[1] 
http://storagemojo.com/2008/03/07/flash-talking-and-a-wee-dram-with-texas-memory-systems/

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with AOC-SAT2-MV8

2008-07-01 Thread Marc Bevand
Marc Bevand m.bevand at gmail.com writes:
 
 I have recently had to replace this AOC-SAT2-MV8 controller with another one 
 (we accidentally broke a SATA connector during a maintainance operation). Its 
 firmware version is using a totally different numbering scheme (it's probably 
 more recent) and it worked right out-of-the-box on the same motherboard.

I found the time to reboot the aforementioned system today, and the firmware
version displayed during POST by the newer AOC-SAT2-MV8 is Driver Version
3.2.1.3.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot delete errored file

2008-06-07 Thread Marc Bevand
Weird. I have no idea how you could remove that file (beside destroying the 
entire filesystem)...

One other thing I noticed:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 8
  raidz1ONLINE   0 0 8
c0t7d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0

When you see non-zero CKSUM error counters at the pool or raidz1/z2 vdev 
level, but no error on the devices like this, it means that ZFS couldn't 
correct the corruption errors after multiple attempts of reconstructing the 
stripes, each time assuming a different device was corrupting data. IOW it 
means that 2+ (in a raidz1) or 3+ (in a raidz2) devices returned corrupted 
data in the same stripe. Since it is statistically improbable to have that 
many silent data corruption in the same stripe, most likely this condition 
indicates a hardware problem. I suggest running memtest to stress-test your 
cpu/mem/mobo.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA controller suggestion

2008-06-06 Thread Marc Bevand
Buy a 2-port SATA II PCI-E x1 SiI3132 controller ($20). The solaris driver is 
very stable.

Or, a solution I would personally prefer, don't use a 7th disk.  Partition 
each of your 6 disks with a small ~7-GB slice at the beginning and the rest of 
the disk for ZFS. Install the OS in one of the small slices. This will only 
reduce your usable ZFS storage space by 1% (and you may have to manually 
enable write cache because ZFS won't be given entire disks, only slices) but: 
(1) you save a disk and a controller and money and related hassles (the reason 
why you post here :P), (2) you can mirror your OS on the other small slices 
using SVM or a ZFS mirror to improve reliability, and (3) this setup allows 
you to easily experiment with parallel installs of different opensolaris 
versions in the other slices.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA controller suggestion

2008-06-06 Thread Marc Bevand
Richard L. Hamilton rlhamil at smart.net writes:
 But I suspect to some extent you get what you pay for; the throughput on the
 higher-end boards may well be a good bit higher.

Not really. Nowadays, even the cheapest controllers, processors  mobos are 
EASILY capable of handling the platter-speed throughput of up to 8-10 disks.

http://opensolaris.org/jive/thread.jspa?threadID=54481

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot delete errored file

2008-06-05 Thread Marc Bevand
Ben Middleton ben at drn.org writes:
 
 [...]
 But that simply had the effect of transferring the issue to the new drive:

When you see this behavior, it most likely means it's not your drive
which is failing, but instead it indicates a bad SATA/SAS cable, or
port on the disk controller.

PS: have you tried : xxx.mp3 to truncate your corrupted file ?
(colon in a shell builtin that does nothing). If I were you I
would also try removing the directory containing the corrupted file.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can anyone help me?

2008-06-02 Thread Marc Bevand
Hernan Freschi hjf at hjf.com.ar writes:
 
 Here's the output. Numbers may be a little off because I'm doing a nightly  
 build and compressing a crashdump with bzip2 at the same time.

Thanks. Your disks look healthy. But one question: why is
c5t0/c5t1/c6t0/c6t1 when in another post you referred to the 4 disks
as c[1234]d0 ?

Did you change the hardware ?

AFAIK ZFS doesn't always like it when the device names change... There
has been problems/bugs exposed by this in the past.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] can anyone help me?

2008-06-01 Thread Marc Bevand
So you are experiencing slow I/O which is making the deletion of this clone 
and the replay of the ZIL take forever. It could be because of random I/O ops, 
or one of your disks which is dying (not reporting any errors, but very slow 
to execute every single ATA command). You provided the output of 'zpool 
iostat' while an import was hanging, what about 'iostat -Mnx 3 20' (not to be 
confused with zpool iostat). Please let the command complete, it will run for 
3*20 = 60 secs.

Also, to validate the slowly-dying-disk theory, reboot the box, do NOT import 
the pool, and run 4 of these commands (in parallel in the background) with 
c[1234]d0p0:
  $ dd bs=1024k of=/dev/null if=/dev/rdsk/cXd0p0
Then 'iostat -Mnx 2 5'

Also, are you using non-default settings in /etc/systems (other than 
zfs_arc_max) ? Are you passing any particular kernel parameters via GRUB or 
via 'eeprom' ?

On a side note, what is the version of your pool and the version of your 
filesystems ? If you don't know run 'zpool upgrade' and 'zfs upgrade' with no 
argument.

What is your SATA controller ? I didn't see you run dmesg.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Project Hardware

2008-05-31 Thread Marc Bevand
Marc Bevand m.bevand at gmail.com writes:
 
 What I hate about mobos with no onboard video is that these days it is 
 impossible to find cheap fanless video cards. So usually I just go headless.

Didn't finish my sentence: ...fanless and *power-efficient*.
Most cards consume 20+W when idle. This alone is a half or a
third of the idle power consumption of a small NAS.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Project Hardware

2008-05-30 Thread Marc Bevand
Brandon High bhigh at freaks.com writes:
 
 I'm going to be putting together a home NAS
 based on OpenSolaris using the following:
 1 SUPERMICRO CSE-743T-645B Black Chassis  
 1 ASUS M2N-LR AM2 NVIDIA nForce Professional 3600 ATX Server Motherboard  
 1 SUPERMICRO AOC-SAT2-MV8 64-bit PCI-X133MHz SATA Controller Card 
 1 AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
 ADH4850DOBOX
 1 Crucial 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 667 (PC2 5300) ECC
 Unbuffered Dual Channel Kit Server Memory Model CT2KIT25672AA667
 8 Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA
 3.0Gb/s Hard Drive
 
 Subtotal: $2,386.88

You could get a $200 cheaper, more power-efficient, and more performant config 
by buying a high SATA port-count desktop-class mobo instead of a server one + 
AOC-SAT2-MV8. 

For example the Abit AB9 Pro (about $80-90) comes with 10 SATA ports (9 
internal + 1 internal): 6 from the ICH8R chipset (driver: ahci), 2 from a 
JMB363 chip (driver: ahci in snv_82 and above, see bug 6645543), and 2 from a 
SiI3132 chip (driver: si3124).

All these drivers should be rock-solid. Performance-wise you should be able to 
max out your 8 disks' max read/write throughput at the same time (but see 
http://opensolaris.org/jive/thread.jspa?threadID=54481 there is usually a 
bottleneck of 150 MB/s per PCI-E lane, this apply to the JMB363 and SiI3132).

Downside: loss of upgradability by having onboard SATA controllers. No onboard 
video. And it's an Intel mobo. Intel's prices for low-power processors (below 
~50W) are higher than AMD's, especially for dual-core ones. But something only 
slightly more power-hungry than your 45W AMD is the Pentium E2220 (2.4GHz 
dual-core 65W). Most likely your NAS will spend 90+% of its time idle so there 
wouldn't be a constant 20W power diff between the 2 configs.

What I hate about mobos with no onboard video is that these days it is 
impossible to find cheap fanless video cards. So usually I just go headless.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Project Hardware

2008-05-25 Thread Marc Bevand
Tim tim at tcsac.net writes:
 
 So we're still stuck the same place we were a year ago.  No high port
 count pci-E compatible non-raid sata cards.  You'd think with all the
 demand SOMEONE would've stepped up to the plate by now.  Marvell, cmon ;)

Here is a 6-port SATA PCI-Express x1 controller for $70: [1]. I don't know who 
makes this card, but from the picture it is apparently based on a SiI3114 chip 
behind a PCI-E to PCI bridge. I also don't know how they get 6 ports total 
when this chip is known to only provide 4 ports.  Downsides: SATA 1.5 Gbps 
only; 4 of the ports are external (eSATA cables required); and don't expect to 
break throughput records because the bottleneck will be the internal PCI bus 
(33 MHz or 66 MHz: 133 or 266 MB/s theoretical hence 100 or 200 MB/s practical 
peak throughput shared between the 6 drives).

I also know Lycom, who is selling a 4-port PCI-E x8 card based on the Silicon 
Image SiI3124 chip and a PCI-E to PCI-X bridge [2]. I am unable to find a 
vendor for this card though. I heard about Lycom through the vendor list on 
sata-io.org.

Regarding Marvell, their website is completely useless as they provide almost 
no tech info regarding their SATA products, but according to a wikipedia 
article [3] they have three PCI-E to SATA 3.0 Gbps host controllers:

  o 88SE6141: 4-port (AHCI ?)
  o 88SE6145: 4-port (AHCI according to the Linux driver source code)
  o 88SX7042: 4-port (non-AHCI)

The 6141 and 6145 appear to be mostly used as onboard SATA controllers 
according to [3]. The 7042 can be found on some Adaptec and Highpoint cards 
according to [4], but they are probably expensive and come with this thing 
called hardware RAID that most of us don't need :)

Overall, like you I am frustrated by the lack of non-RAID inexpensive native 
PCI-E SATA controllers.

-marc

[1] http://cooldrives.com/ss42chesrapc.html
[2] http://www.lycom.com.tw/PE124R5.htm
[3] http://en.wikipedia.org/wiki/List_of_Marvell_Technology_Group_chipsets
[4] 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/ata/sata_mv.c;hb=HEAD


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Project Hardware

2008-05-25 Thread Marc Bevand
Kyle McDonald KMcDonald at Egenera.COM writes:
 Marc Bevand wrote:
 
  Overall, like you I am frustrated by the lack of non-RAID inexpensive
  native PCI-E SATA controllers.

 Why non-raid? Is it cost?

Primarily cost, reliability (less complex hw = less hw that can fail),
and serviceability (no need to rebuy the exact same raid card model
when it fails, any SATA controller will do).

If you want good write performance, instead of adding N GB of cache memory
to a disk controller, add N*5 or N*10 GB of system memory (DDR2 is maybe
1/5th or 1/10th cheaper per GB, and the OS already uses main memory to
cache disk writes).

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS still crashing after patch

2008-05-02 Thread Marc Bevand
Rustam rustam at code.az writes:
 
 Didn't help. Keeps crashing.
 The worst thing is that I don't know where's the problem. More ideas on
 how to find problem?

Lots of CKSUM errors like you see is often indicative of bad hardware. Run 
memtest for 24-48 hours.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R

2008-04-19 Thread Marc Bevand
Pascal Vandeputte pascal_vdp at hotmail.com writes:
 
 I'm at a loss, I'm thinking about just settling for the 20MB/s write
 speeds with a 3-drive raidz and enjoy life...

As Richard Elling pointed out, the ~10ms per IO operation implies
seeking, or hardware/firmware problems. The mere fact you observed
a low 27 MB/s sequential write throughput on c1t0d0s0 indicates this
is not a ZFS pb.

Test other disks, another SATA controller, mobo, BIOS/firmware, etc.

As you pointed out, these disks should normally be capable of a
80-90 MB/s write throughput. Like you I would also expect ~100 MB/s
writes on a 3-drive raidz pool. As a datapoint, I see 150 MB/s writes
on a 4-drive raidz on a similar config (750GB SATA Samsung HD753LJ
disks, SB600 AHCI controller, low-end CPU).

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Setting 'kernelbase' to 0xd0000000 causes init(1M) to segfault

2008-03-30 Thread Marc Bevand
(Keywords: solaris hang zfs scrub heap space kernelbase marvell 88sx6081)

I am experiencing system hangs on a 32-bit x86 box with 1.5 GB RAM
running Solaris 10 Update 4 (with only patch 125205-07) during ZFS
scrubs of an almost full 3 TB zpool (6 disks on a AOC-SAT2-MV8
controller). I found out they are caused by memory contention in the
kernel heap: 'kstat vmem::heap' shows it is 97% full and 'echo
::threadlist -v | mdb -k' shows most threads are blocked in memory
allocation routines.

When trying to give more memory to the kernel by passing '-B
kernelbase=0x8000' to the kernel, it fails to boot up. The console
is flooded with this line repeating over and over:

  WARNING: init(1M) exited on fatal signal 9: restarting automatically

init segfaults the same way with any kernelbase value less than the 
default of 0xd000 (I tried 0x5000, 0x8000, 0x9000,
0xc000, 0xcf00). It works fine with values greater than or equal
to the default (I only tried 0xd000 and 0xd100).

How can I troubleshoot this crash ? Is it caused by the system being
unable to access / (standard UFS partition on a disk connected to a SATA
controller supported by the marvell88sx driver) ? Could some drivers
such as marvell88sx not support non-standard kernelbase values ?

Alternatively, can I make ZFS use less heap space ? I don't think the
ARC cache use heap space, or does it ? None of my other ZFS servers
have this heap space restriction because they are 64-bit.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Setting kernelbase=0x80000000 only work s under snv_83

2008-03-30 Thread Marc Bevand
For the record a parallel install of snv_83 on the same machine allows me to
set kernelbase to 0x8000 with no pb, no init crash. This increased the
kernel heap size to 1912 MB (up from 632 MB with kernelbase=0xd000 in
sol10u4) and the system doesn't hang anymore. The max heap usage I have seen
so far is 1220 MB.

Is the init(1M) segfault pb known in sol10u4 ? Has/will it be fixed in u5 ?

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup-ing up ZFS configurations

2008-03-22 Thread Marc Bevand
Sachin Palav palavsachin27 at indiatimes.com writes:
 
 3. Currently there no command that prints the entire configuration of ZFS.

Well there _is_ a command to show all (and only) the dataset properties
that have been manually zfs set:

  $ zfs get -s local all

For the pool properties, zpool has no -s local option but you can
emulate the same behavior with grep:

  $ zpool get all $POOLNAME | egrep -v ' default$| -$'

These two commands plus zpool status output everything you need to 
restore a particular ZFS config from scratch.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] aclinherit property changes fast track

2008-03-20 Thread Marc Bevand
Mark Shellenbaum Mark.Shellenbaum at Sun.COM writes:
   # ls -V a
   -rw-r--r--+  1 root root   0 Mar 19 13:04 a
 owner@:--:--I:allow
 group@:--:--I:allow
  everyone@:--:--I:allow

The ls(1) manpage (as of snv_82) seems incorrect because it
says the last inheritance flags is F (Failed access):

  who:rwxpdDaARWcCos:fdinSF:allow|deny

Whereas your example shows I (Inherited).

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box

2008-03-17 Thread Marc Bevand
Brandon High bhigh at freaks.com writes:
 Do you have access to a Sil3726 port multiplier?

Nope. But AFAIK OpenSolaris doesn't support port multipliers yet. Maybe
FreeBSD does.

Keep in mind that three modern drives (334GB/platter) are all it takes to
saturate a SATA 3.0Gbps link.

 It's also easier to use an external disk box like the CFI 8-drive eSATA tower
 than find a reasonable server case that can hold that many drives.

If you are willing to go cheap you can get something that holds 8 drives for
$70: buy a standard tower case with five internal 3.5 bays ($50), and one of
these enclosures that fit in two 5.25 bays but give you three 3.5 bays ($20).

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris Drivers (was: 7-disk raidz ac hieves 430 MB/s reads and 220 MB/s writes on a $1320 box)

2008-03-17 Thread Marc Bevand
Brandon High bhigh at freaks.com writes:
 [...]
 The lack of documentation for supported devices is a general complaint
 of mine with Solaris x86, perhaps better taken to the opensolaris-discuss
 list however.

I replied to all your questions in opensolaris-discuss.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Max_Payload_Size

2008-03-15 Thread Marc Bevand
Anton B. Rang rang at acm.org writes:
 Looking at the AMD 690 series manual (well, the family
 register guide), the max payload size value is deliberately
 set to 0 to indicate that the chip only supports 128-byte
 transfers. There is a bit in another register which can be
 set to ignore max-payload errors.  Perhaps that's being set?

Perhaps. I briefly tried looking for AMD 690 series manual or
datasheet, but they don't seem to be available to the public.

I think I'll go back to the 128-byte setting. I wouldn't want to
see errors happening under heavy usage even though my stress
tests were all successful (aggregate data rate of 610 MB/s
generated by reading the disks for 24+ hours, 6 million head
seeks performed by each disk, etc).

Thanks for your much appreciated comments.

-- 
Marc Bevand


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Max_Payload_Size (was Re: 7-disk raidz achieves 430 MB/s reads and...)

2008-03-14 Thread Marc Bevand
Anton B. Rang rang at acm.org writes:
 
 Be careful of changing the Max_Payload_Size parameter. It needs to match,
 and be supported, between all PCI-E components which might communicate with
 each other. You can tell what values are supported by reading the Device
 Capabilities Register and checking the Max_Payload_Size Supported bits.

Yes, the DevCap register of the SiI3132 indicates the maximum
supported payload size is 1024 bytes. This is confirmed by its
datasheet.

However I compiled lspci for Solaris and running it with -vv shows
only 2 PCI-E devices (other than the SiI3132 and an Ethernet
controller), which represent the AMD690G chipset's root PCI-E
ports for my 2 PCI-E slots (I think):

00:06.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express 
Port 2) (prog-if 00 [Normal decode])
[...]
Capabilities: [58] Express (v1) Root Port (Slot-), MSI 00
00:07.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express 
Port 3) (prog-if 00 [Normal decode])
[...]
Capabilities: [58] Express (v1) Root Port (Slot-), MSI 00

But each shows a Max_Payload_Size of 128 bytes in both the DevCap and
DevCtl registers. Clearly they are accepting 256-byte payloads, else I
wouldn't notice the big perf improvement when reading data from the
disks. Could it be possible that (1) an errata in the AMD690G makes its
DevCap register incorrectly report Max_Payload_Size=128 even though it
supports larger ones, and that (2) the AMD690G implements PCI-E
leniently and always accepts large payloads even when it is not
supposed to when DevCtl defines Max_Payload_Size=128 ?

 If you set a size which is too large, you might see PCI-E errors, data
 corruption, or hangs.

Ouch!

 The operating system is supposed to set this register properly for you.
 A quick glance at OpenSolaris code suggests that, while
 PCIE_DEVCAP_MAX_PAYLOAD_MASK is defined in pcie.h, it's not actually
 referenced yet, and in fact PCIE_DEVCAP seems to only be used for debugging.

I came to the same conclusion as you after grepping through the code.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] 7-disk raidz achieves 430 MB/s reads and 220 MB/s writes on a $1320 box

2008-03-13 Thread Marc Bevand
I figured the following ZFS 'success story' may interest some readers here.

I was interested to see how much sequential read/write performance it would be 
possible to obtain from ZFS running on commodity hardware with modern features 
such as PCI-E busses, SATA disks, well-designed SATA controllers (AHCI, 
SiI3132/SiI3124). So I made this experiment of building a fileserver by 
picking each component to be as cheap as possible while not sacrificing 
performance too much.

I ended up spending $270 on the server itself and $1050 on seven 750GB SATA 
disks. After installing snv_82, a 7-disk raidz pool on this $1320 box is 
capable of:

- 220-250 MByte/s sequential write throughput (dd if=/dev/zero of=file 
bs=1024k)
- 430-440 MByte/s sequential read throughput (dd if=file of=/dev/null 
bs=1024k)

I did a quick test with a 7-disk striped pool too:

- 330-390 MByte/s seq. writes
- 560-570 MByte/s seq. reads (what's really interesting here is that the 
bottleneck is the platter speed of one of the disks at 81 MB/s: 81*7=567, ZFS 
truly runs at platter speed, as advertised, wow)

I used disks with 250GB-platter (Samsung HD753LJ; they have even higher 
density 640GB and 1TB models with 334GB/platter but they are respectively 
impossible to find or too expensive). I put 4 disks on the motherboard's 
integrated AHCI controller (SB600 chipset), 2 disks on a 2-port $20 PCI-E 1x 
SiI3132 controller, and the 7th disk on a $65 4-port PCI-X SiI3124 controller 
that I scavenged from another server (it's in a PCI slot, what a waste for a 
PCI-X card...). The rest is also dirty cheap: $65 Asus M2A-VM motherboard, $60 
dual-core Athlon 64 X2 4000+, with 1GB of DDR2 800, and a 400W PSU.

When testing the read throughput of individual disks with dd (roughly 81 to 97 
MB/s at the beginning of the platter -- I don't know why it varies so much 
between different units of the same model, additional seeks caused by 
reallocated sectors perhaps) I found out that an important factor influencing 
the max bandwidth of a PCI Express device such as the SiI3132 is the 
Max_Payload_Size setting, which can be set from 128 to 4096 bytes by writing 
to bits 7:5 of the Device Control Register (offset 08h) in the PCI Express 
Capability Structure (starting at offset 70h on the SiI3132):

  $ /usr/X11/bin/pcitweak -r 2:0 -h 0x78 # read the register
  0x2007

Bits 7:5 of 0x2007 are 000, which indicates a 128 bytes max payload size 
(000=128B, 001=256B, ..., 101=4096B, 110=reserved, 111=reserved). All OSes and 
drivers seem to keep it to this default value of 128 bytes. However in my 
tests, this payload size only allowed a practical unidirectional bandwidth of 
about 147 MB/s (59% of the 250 MB/s peak theoretical of PCI-E 1x). I changed 
it to 256 bytes:

  $ /usr/X11/bin/pcitweak -w 2:0 -h 0x78 0x2027

This increased the bandwidth to 175 MB/s. Better. At 512 bytes or above 
something strange happens: the bandwidth ridiculously drops below 5 or 50 MB/s 
depending on the PCI-E slot I use... Weird, I have no idea why. Anyway 175 
MB/s or even 145 MB/s is good enough for this 2-port SATA controller because 
the I/O bandwidth consumed by ZFS in my case is never above 62-63 MB/s per 
disk.

I wanted to share this Max_Payload_Size tidbit here because I didn't find any 
mention of anybody manually tuning this parameter on the Net. So in case some 
of you wonder why PCI-E devices seem limited to 60% of their peak theoretical 
bandwidth, now you know why.

Speaking of another bottleneck, my SiI3124 has a bottleneck of 87 MB/s per 
SATA port.

Back on the main topic, here are some system stats during 430-440 MB/s 
sequential reads from the ZFS raidz pool with dd (c0 is the AHCI controller, 
c1 = SiI3124, c2 = SiI3132).

zpool iostat -v 2
 capacity operationsbandwidth
pool   used  avail   read  write   read  write
  -  -  -  -  -  -
tank  2.54T  2.17T  3.38K  0   433M  0
  raidz1  2.54T  2.17T  3.38K  0   433M  0
c0t0d0s7  -  -  1.02K  0  61.9M  0
c0t1d0s7  -  -  1.02K  0  61.9M  0
c0t2d0s7  -  -  1.02K  0  62.0M  0
c0t3d0s7  -  -  1.02K  0  62.0M  0
c1t0d0s7  -  -  1.01K  0  61.9M  0
c2t0d0s7  -  -  1.02K  0  62.0M  0
c2t1d0s7  -  -  1.02K  0  61.9M  0
  -  -  -  -  -  -

iostat -Mnx 2
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 1044.80.5   61.70.0  0.1 14.30.1   13.6   4  81 c0t0d0
 1043.30.0   61.70.0  0.1 15.40.1   14.7   5  84 c0t1d0
 1043.30.0   61.70.0  0.1 14.70.1   14.1   5  82 c0t2d0
 1044.80.0   61.80.0  0.1 13.00.1   12.5   4  76 c0t3d0
 1042.30.0   61.70.0 13.9  0.8   13.30.8  83  83 c1t0d0
 1041.80.0   61.70.0 11.5  0.7   11.1

Re: [zfs-discuss] 'zfs create' hanging

2008-03-11 Thread Marc Bevand
Lida Horn Lida.Horn at Sun.COM writes:
 
 I think you jumped to a conclusion that is probably not warranted.

You are right. I read his error message too hastily and thought I
recognized a pattern --I have been victim of bug 6587133 myself.
And to top this off I gave him the wrong patch number.

To answer Paul's question about how to upgrade to snv_73 (if you
still want to upgrade for another reason): actually I would recommend
you the latest SXDE (Solaris Express Developer Edition 1/08, based
on build 79). Boot from the install disc, and choose the Upgrade
Install option.

-- 
Marc Bevand

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 'zfs create' hanging

2008-03-09 Thread Marc Bevand
Paul Raines raines at nmr.mgh.harvard.edu writes:
 
 Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
 /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
 Mar  9 03:22:16 raidsrv03  port 6: device reset
 [...]
 
 The above repeated a few times but now seems to have stopped.
 Running 'hd -c' shows all disks as ok.  But it seems like I do have
 a disk problem.  But since everything is redundant (zraid) why a
 failed disk should lock up the machine like I saw I don't understand
 unless there is a some bigger issue.

It looks like your Solaris 10U4 install on a Thumper is affected by:
http://bugs.opensolaris.org/view_bug.do?bug_id=6587133
Which was discussed here:
http://opensolaris.org/jive/thread.jspa?messageID=189256
http://opensolaris.org/jive/thread.jspa?messageID=163460

Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5.

-- 
Marc Bevand

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Marc Bevand
William Fretts-Saxton william.fretts.saxton at sun.com writes:
 
 I disabled file prefetch and there was no effect.
 
 Here are some performance numbers.  Note that, when the application server
 used a ZFS file system to save its data, the transaction took TWICE as long.
 For some reason, though, iostat is showing 5x as much disk
 writing (to the physical disks) on the ZFS partition.  Can anyone see a
 problem here?

Possible explanation: the Glassfish applications are using synchronous
writes, causing the ZIL (ZFS Intent Log) to be intensively used, which
leads to a lot of extra I/O. Try to disable it:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

Since disabling it is not recommended, if you find out it is the cause of your
perf problems, you should instead try to use a SLOG (separate intent log, see
above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support
SLOGs, they have only been added to OpenSolaris build snv_68:

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Marc Bevand
Neil Perrin Neil.Perrin at Sun.COM writes:
 
 The ZIL doesn't do a lot of extra IO. It usually just does one write per 
 synchronous request and will batch up multiple writes into the same log
 block if possible.

Ok. I was wrong then. Well, William, I think Marion Hakanson has the
most plausible explanation. As he suggests, experiment with zfs set
recordsize=XXX to force the filesystem to use small records. See
the zfs(1) manpage.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-05 Thread Marc Bevand
William Fretts-Saxton william.fretts.saxton at sun.com writes:
 
 Some more information about the system.  NOTE: Cpu utilization never
 goes above 10%.
 
 Sun Fire v40z
 4 x 2.4 GHz proc
 8 GB memory
 3 x 146 GB Seagate Drives (10k RPM)
 1 x 146 GB Fujitsu Drive (10k RPM)

And what version of Solaris or what build of OpenSolaris are you using ?
Do you know if your application uses synchronous I/O transactions ?
Have you tried disabling ZFS file-level prefetching (just as an
experiment) ? See:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS problem after disk faliure

2008-01-10 Thread Marc Bevand
Robert slask at telia.com writes:
 
 I simply need to rename/remove one of the erronous c2d0 entries/disks in
 the pool so that I can use it in full again, since at this time I can't
 reconnect the 10th disk in my raid and if one more disk fails all my
 data would be lost (4 TB is a lot of disk to waste!)

You see an erroneous c2d0 device that you claim is in reality c3d0...
If I were you I would try:

  $ zpool replace [-f] rz2pool c2d0 c3d0

The -f option may or may not be necessary.
Also, what disk devices does this command display ?:

  $ format

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread Marc Bevand
can you guess? billtodd at metrocast.net writes:
 
 You really ought to read a post before responding to it:  the CERN study
 did encounter bad RAM (and my post mentioned that) - but ZFS usually can't
 do a damn thing about bad RAM, because errors tend to arise either
 before ZFS ever gets the data or after it has already returned and checked
 it (and in both cases, ZFS will think that everything's just fine).

According to the memtest86 author, corruption most often occurs at the moment 
memory cells are written to, by causing bitflips in adjacent cells. So when a 
disk DMA data to RAM, and corruption occur when the DMA operation writes to 
the memory cells, and then ZFS verifies the checksum, then it will detect the 
corruption.

Therefore ZFS is perfectly capable (and even likely) to detect memory 
corruption during simple read operations from a ZFS pool.

Of course there are other cases where neither ZFS nor any other checksumming 
filesystem is capable of detecting anything (e.g. the sequence of events: data 
is corrupted, checksummed, written to disk).

-- 
Marc Bevand

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS 60 second pause times to read 1K

2007-10-09 Thread Marc Bevand
Michael m.kucharski at bigfoot.com writes:
 
 Excellent. 
 
 Oct  9 13:36:01 zeta1 scsi: [ID 107833 kern.warning] WARNING:
 /pci at 2,0/pci1022,7458 at 8/pci11ab,11ab at 1/disk at 2,0 (sd13):
 Oct  9 13:36:01 zeta1   Error for Command: readError
Level: Retryable
 
 Scrubbing now.

This is only a part of the complete error message. Look a few lines above this
one. If you see something like:

  sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci8086,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
   port 1: device reset
  sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci8086,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
   port 1: link lost
  sata: [ID 801593 kern.notice] NOTICE: /[EMAIL PROTECTED],0/pci8086,[EMAIL 
PROTECTED]/pci11ab,[EMAIL PROTECTED]:
   port 1: link established
  marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx0: error on port 1:
  marvell88sx: [ID 517869 kern.info]   device disconnected
  marvell88sx: [ID 517869 kern.info]   device connected

Then it means you are probably affected by bug
http://bugs.opensolaris.org/view_bug.do?bug_id=6587133

This bug is fixed in Solaris Express build 73 and above, and will likely be
fixed in Solaris 10 Update 5. The workaround is to disable SATA NCQ and queuing
by adding set sata:sata_func_enable = 0x4 to /etc/system.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] About bug 6486493 (ZFS boot incompatible with the SATA framework)

2007-10-03 Thread Marc Bevand
I would like to test ZFS boot on my home server, but according to bug 
6486493 ZFS boot cannot be used if the disks are attached to a SATA
controller handled by a driver using the new SATA framework (which
is my case: driver si3124). I have never heard of someone having
successfully used ZFS boot with the SATA framework, so I assume this
bug is real and everybody out there playing with ZFS boot is doing so
with PATA controllers, or SATA controllers operating in compatibility
mode, or SCSI controllers, right ?

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Rule of Thumb for zfs server sizing with (192) 500 GB SATA disks?

2007-09-26 Thread Marc Bevand
David Runyon david.runyon at sun.com writes:

 I'm trying to get maybe 200 MB/sec over NFS for large movie files (need

(I assume you meant 200 Mb/sec with a lower case b.)

 large capacity to hold all of them). Are there any rules of thumb on how 
 much RAM is needed to handle this (probably RAIDZ for all the disks) with
 zfs, and how large a server should be used ? 

If you have a handful of users streaming large movie files over NFS,
RAM is not going to be a bottleneck. One of my ultra low-end server
(Turion MT-37 2.0 GHz, 512 MB RAM, five 500-GB SATA disk in a raidz1,
consumer-grade Nvidia GbE NIC) running an old Nevada b55 install can
serve large files at about 650-670 Mb/sec over NFS. CPU is the
bottleneck at this level. The same box with a slightly better CPU
or a better NIC (with a less CPU-intensive driver that doesn't generate
45k interrupt/sec) would be capable of maxing out the GbE link.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Marc Bevand
Pawel Jakub Dawidek pjd at FreeBSD.org writes:
 
 This is how RAIDZ fills the disks (follow the numbers):
 
   Disk0   Disk1   Disk2   Disk3
 
   D0  D1  D2  P3
   D4  D5  D6  P7
   D8  D9  D10 P11
   D12 D13 D14 P15
   D16 D17 D18 P19
   D20 D21 D22 P23
 
 D is data, P is parity.

This layout assumes of course that large stripes have been written to
the RAIDZ vdev. As you know, the stripe width is dynamic, so it is
possible for a single logical block to span only 2 disks (for those who
don't know what I am talking about, see the red block occupying LBAs
D3 and E3 on page 13 of these ZFS slides [1]).

To read this logical block (and validate its checksum), only D_0 needs 
to be read (LBA E3). So in this very specific case, a RAIDZ read
operation is as cheap as a RAID5 read operation. The existence of these
small stripes could explain why RAIDZ doesn't perform as bad as RAID5
in Pawel's benchmark...

[1] http://br.sun.com/sunnews/events/2007/techdaysbrazil/pdf/eric_zfs.pdf

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 7zip compression?

2007-07-29 Thread Marc Bevand
MC rac at eastlink.ca writes:
 
 Obviously 7zip is far more CPU-intensive than anything in use with ZFS
 today.  But maybe with all these processor cores coming down the road,
 a high-end compression system is just the thing for ZFS to use.

I am not sure you realize the scale of things here. Assuming the worst case: 
that lzjb (default ZFS compression algorithm) performs as bad as lha in [1], 
7zip would compress your data only 20-30% better at the cost of being 4x-5x 
slower !

Also, in most cases, the bottleneck in data compression is the CPU, so 
switching to 7zip would reduce the I/O throughput by about 4x.

[1] http://warp.povusers.org/ArchiverComparison

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mysterious corruption with raidz2 vdev (1 checksum err on disk, 2 on vdev?)

2007-07-27 Thread Marc Bevand
Matthew Ahrens Matthew.Ahrens at sun.com writes:
 
 So the errors on the raidz2 vdev indeed indicate that at least 3 disks below 
 it gave the wrong data for a those 2 blocks; we just couldn't tell which 3+ 
 disks they were.

Something must be seriously wrong with this server. This is the first time I 
see an uncorrectable checksum error in a raidz2 vdev. I would suggest Kevin to 
run memtest86 or similar. It is more likely bad data has been written on the 
disks in the first place (due to flaky RAM/CPU/mobo/cables) rather than 3+ 
disks corrupting data in the same stripe !

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: zfs send -i A B with B older than A

2007-06-19 Thread Marc Bevand
Matthew Ahrens Matthew.Ahrens at sun.com writes:

 True, but presumably restoring the snapshots is a rare event.

You are right, this would only happen in case of disaster and total
loss of the backup server.

 I thought that your onsite and offsite pools were the same size?  If so then 
 you should be able to fit the entire contents of the onsite pool in one of 
 the offsite ones.

Well, I simplified the example. In reality, the offsite pool is slightly
smaller due to different number of disks and sizes. 

 Also, if you can afford to waste some space, you could do something like:
 
 zfs send onsite at T-100 | ...
 zfs send -i T-100 onsite at t-0 | ...
 zfs send -i T-100 onsite at t-99 | ...
 zfs send -i T-99 onsite at t-98 | ...
 [...]

Yes, I thought about it. I might do this if the delta between T-100 and
T-0 is reasonable.

Oh, and while I am thinking about it, beside zfs send | gzip | gpg, and
zfs-crypto, a 3rd option would be to use zfs on top of a loficc device
(lofi compression  cryptography). I went to the project page, only to
realize that they haven't shipped anything yet.

Do you know how hard would it be to implement zfs send -i A B with B
older than A ? Or why hasn't this been done in the first place ? I am 
just being curious here, I can't wait for this feature anyway (even
though it would make my life soo much simpler).

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send -i A B with B older than A

2007-06-18 Thread Marc Bevand
It occured to me that there are scenarios where it would be useful to be
able to zfs send -i A B where B is a snapshot older than A. I am
trying to design an encrypted disk-based off-site backup solution on top
of ZFS, where budget is the primary constraint, and I wish zfs send/recv
would allow me to do that. Here is why.

I have a server with 12 hot-swap disk bays. An onsite pool has been
created on 6 disks, where snapshots of the data to be backed up are
periodically taken. Two other offsite pools have been created on two
other sets of 6 disks, let's give them the names offsite-blue and
offsite-red (for use on blue/red, or even/odd, weeks). At least one of
the offsite pools is always at the off-site location, while the other
one is either in transit or in the server. Every week a script is
basically compressing and encrypting the last few snapshots (T-2, T-1,
T-0) from onsite to offsite-XXX. Here is an example:

  $ rm /offsite-blue/*
  $ zfs send[EMAIL PROTECTED] | gzip | gpg -c 
/offsite-blue/T-2.full.gz.gpg
  $ zfs send -i T-2 [EMAIL PROTECTED] | gzip | gpg -c 
/offsite-blue/T-1.incr.gz.gpg
  $ zfs send -i T-1 [EMAIL PROTECTED] | gzip | gpg -c 
/offsite-blue/T-0.incr.gz.gpg

Then offsite-blue is zfs export'ed, sent to the the off-site location,
offsite-red is retrieved from the off-site location, sent back on-site,
ready to be used for the next week. My proof-of-concept tests show it
works OK, but 2 details are annoying:

  o In order to restore the latest snapshot T-0, all the zfs streams,
T-2, T-1 and T-0, have to be decrypted, then zfs receive'd. It is
slow and inconvenient.
  o My example only backs up the last 3 snapshots, but ideally I would
like to fit as many as possible in the offsite pool. However, because
of the unpredictable compression efficiency, I can't tell which
snapshot I should start from when creating the first full stream.

These 2 problems would be non-existent if one could zfs send -i A B
with B older than A:

  $ zfs send[EMAIL PROTECTED] | gzip | gpg -c 
/offsite-blue/T-0.full.gz.gpg
  $ zfs send -i T-0 [EMAIL PROTECTED] | gzip | gpg -c 
/offsite-blue/T-1.incr.gz.gpg
  $ zfs send -i T-1 [EMAIL PROTECTED] | gzip | gpg -c 
/offsite-blue/T-2.incr.gz.gpg
  $ ... # continue forever, kill zfs(1m) when offsite-blue is 90% full

I have looked at the code and the restriction B must be earlier than A
is enforced in dmu_send.c:dmu_sendbackup() [1]. It looks like the code 
could be reworked to remove it.

Of course, when zfs-crypto ships, it will simplify a lot of things.
I could just always send incremental streams and receive them directly
on the encrypted pool, and directly manage the snapshots rotation by
zfs destroy'ing the old ones, etc.

[1] 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_send.c#232

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: 6 disk raidz2 or 3 stripe 2 way mirror

2007-06-17 Thread Marc Bevand
Joe S js.lists at gmail.com writes:
 
 I'm going to create 3x 2-way mirrors. I guess I don't really *need* the
 raidz at this point. My biggest concern with raidz is getting locked into
 a configuration i can't grow out of. I like the idea of adding more
 2 way mirrors to a pool.

The raidz2 option will *not* restrict your possibilities of expansion.

For example, it is perfectly possible to add a mirror to a pool consisting
of single raidz2 vdev.

Plus, compared to 3 2-disk mirrors, a 6-disk raidz2 offers more usable space 
and is more reliable.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss