Re: [zfs-discuss] A resilver record?

2011-03-20 Thread taemun
769G resilvered on a 500G drive? I'm guessing there was a whole bunch of
activity (and probably snapshot creation) happening alongside the resilver.

On 20 March 2011 18:57, Ian Collins i...@ianshome.com wrote:

  Has anyone seen a resilver longer than this for a 500G drive in a riadz2
 vdev?

 scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20
 19:57:37 2011
  c0t0d0  ONLINE   0 0 0  769G resilvered

 and I told the client it would take 3 to 4 days!

 :)

 --
 Ian.

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-27 Thread taemun
On 28 February 2011 02:06, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 Take that a step further.  Anything external is unreliable.  I have used
 USB, eSATA, and Firewire external devices.  They all work.  The only
 question is for how long.


eSATA has no need for any interposer chips between a modern SATA chipset on
the motherboard and a SATA hard drive. You can buy cables with appropriate
ends for this. There is no reason why the data side of an eSATA drive should
be any more likely to fail than SATA. (within bounds, for cable lengths,
etc) At least you can be assured that the drive will receive a flush request
at appropriate times.

I can't argue about the external power supplies, other than to say that many
external cases these days use a single +12V rail, and have a +5V regulator
on board. These are a lot better because they allow for easy replacement of
the power supply. External units which use a combined +12V/+5V power supply
are often rendered useless by a power supply failure.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread taemun
On 6 February 2011 01:34, Michael michael.armstr...@gmail.com wrote:

 Hi guys,

 I'm currently running 2 zpools each in a raidz1 configuration, totally
 around 16TB usable data. I'm running it all on an OpenSolaris based box with
 2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and
 underpowered for deduplication, so I'm looking at building a new system, but
 wanted some advice first, here is what i've planned so far:

 Core i7 2600 CPU
 16gb DDR3 Memory
 64GB SSD for ZIL (optional)


http://ark.intel.com/Product.aspx?id=52213
http://ark.intel.com/Product.aspx?id=52213The desktop Core i* range
doesn't support ECC ram at all, this could potentially be a pool breaker if
you get a flipped bit in the wrong place (a significant metadata block).
Just something to keep in mind. Also, Intel have issued a recall (ish) for
all of the 6 series chipsets released so far, the PLL unit for the 3gbit
SATA ports on the chipset is driven too hard and will likely degrade over
time (5~15% failure rate over three years). They are talking about a
March~April time to fix in the channel. If you don't plan on using the 3gbit
SATA ports, then you're fine.

Intel will make 1155 Xeon's at some point, ie
http://en.wikipedia.org/wiki/List_of_future_Intel_microprocessors#.22Sandy_Bridge.22_.2832_nm.29_8
They support ECC (just check for a specific QVL after launch, DDR3 ECC
isn't necessarily the only thing you need to look for). I think the Feb 20
release date may have been pushed for the chipset respin.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replace block devices to increase pool size

2011-02-06 Thread taemun
If autoexpand = on, then yes.
zpool get autoexpand pool
zpool set autoexpand=on pool

The expansion is vdev specific, so if you replaced the mirror first, you'd
get that much (the extra 2TB) without touching the raidz.

Cheers,

On 7 February 2011 01:41, Achim Wolpers achim...@googlemail.com wrote:

 Hi!

 I have a zpool biult up from two vdrives (one mirror and one raidz). The
 raidz is built up from 4x1TB HDs. When I successively replace each 1TB
 drive with a 2TB drive will the capacity of the raidz double after the
 last block device is replaced?

 Achim



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-02 Thread taemun
Uhm. Higher RPM = higher linear speed of the head above the platter = higher
throughput. If the bit pitch (ie the size of each bit on the platter) is the
same, then surely a higher linear speed corresponds with a larger number of
bits per second?

So if all other things being equal includes the bit density, and radius to
the edge of the media, then ... surely higher rpm = higher throughput?

Cheers,

On 3 February 2011 14:10, Mark Sandrock mark.sandr...@oracle.com wrote:


 On Feb 2, 2011, at 8:10 PM, Eric D. Mudama wrote:

   All other
  things being equal, the 15k and the 7200 drive, which share
  electronics, will have the same max transfer rate at the OD.

 Is that true? So the only difference is in the access time?

 Mark
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread taemun
Comments below.

On 29 January 2011 00:25, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 This was something interesting I found recently.  Apparently for flash
 manufacturers, flash hard drives are like the pimple on the butt of the
 elephant. A vast majority of the flash production in the world goes into
 devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
 into hard drives.

http://www.eetimes.com/electronics-news/4206361/SSDs--Still-not-a--solid-state--business
~6.1 percent for 2010, from that estimate (first thing that Google turned
up). Not denying what you said, I just like real figures rather than random
hearsay.


 As a result, they optimize for these other devices, and
 one of the important side effects is that standard flash chips use an 8K
 page size.  But hard drives use either 4K or 512B.

http://www.anandtech.com/Show/Index/2738?cPage=19all=Falsesort=0page=5
Terms: page means the smallest data size that can be read or programmed
(written). Block means the smallest data size that can be erased. SSDs
commonly have a page size of 4KiB and a block size of 512KiB. I'd take
Anandtech's word on it.

There is probably some variance across the market, but for the vast
majority, this is true. Wikipedia's
http://en.wikipedia.org/wiki/Flash_memory#NAND_memories says that common
page sizes are 512B, 2KiB, and 4KiB.

The SSD controller secretly remaps blocks internally, and aggregates small
 writes into a single 8K write, so there's really no way for the OS to know
 if it's writing to a 4K block which happens to be shared with another 4K
 block in the 8K page.  So it's unavoidable, and whenever it happens, the
 drive can't simply write.  It must read modify write, which is obviously
 much slower.

This is be true, but for 512B to 4KiB aggregation, as the 8KiB page doesn't
exist. As for writing when everything is full, and you need to do an
erase. well this is where TRIM is helpful.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
 throughput...  They lie.  Well, technically they're not lying because
 technically it is *possible* to reach whatever they say.  Optimize your
 usage patterns and only use blank drives which are new from box, or have
 been fully TRIM'd.  Pt...  But in my experience, reality is about 50%
 of
 whatever they say.

 Presently, the only way to deal with all this is via the TRIM command,
 which
 cannot eliminate the read/modify/write, but can reduce their occurrence.
 Make sure your OS supports TRIM.  I'm not sure at what point ZFS added
 TRIM,
 or to what extent...  Can't really measure the effectiveness myself.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655

 Long story short, in the real world, you can expect the DDRDrive to crush
 and shame the performance of any SSD you can find.  It's mostly a question
 of PCIe slot versus SAS/SATA slot, and other characteristics you might care
 about, like external power, etc.

Sure, DDR RAM will have a much quicker sync write time. This isn't really a
surprising result.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating zpool to new drives with 4K Sectors

2011-01-06 Thread taemun
zfs replace will copy across on to the disk with the same old ashift=9,
whereas you want ashift=12 for 4KB drives. (size = 2^ashift)

You'd need to make a new pool (or add a vdev to an existing pool) with the
modified tools in order to get proper performance out of 4KB drives.

On 7 January 2011 17:43, Matthew Angelo bang...@gmail.com wrote:

 Hi ZFS Discuss,

 I have a 8x 1TB RAIDZ running on Samsung 1TB 5400rpm drives with 512b
 sectors.

 I will be replacing all of these with 8x Western Digital 2TB drives
 with support for 4K sectors.  The replacement plan will be to swap out
 each of the 8 drives until all are replaced and the new size (~16TB)
 is available with a `zfs scrub`.

 My question is, how do I do this and also factor in the new 4k sector
 size?  or should I find a 2TB drive that still uses 512b sectors?


 Thanks
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] very slow boot: stuck at mounting zfs filesystems

2010-12-08 Thread taemun
Dedup? Taking a long time to boot after hard reboot after lookup?

I'll bet that it hard locked whilst deleting some files or a dataset that
was dedup'd. After the delete is started, it spends *ages* cleaning up the
DDT (the table containing a list of dedup'd blocks). If you hard lock in the
middle of this clean up, then the DDT isn't valid, to anything. The next
mount attempt on that pool will do this operation for you. Which will take
an inordinate amount of time. My pool spent *eight days* (iirc) in limbo,
waiting for the DDT cleanup to finish. Once it did, it wrote out a shedload
of blocks and then everything was fine. This was for a zfs destroy of a
900GB, 64KiB block dataset, over 2x 8-wide raidz vdevs.

Unfortunately, raidz is of course slower for random reads than a set or
mirrors. The raidz/mirror hybrid allocator available in snv_148+ is somewhat
of a workaround for this, although I've not seen comprehensive figures for
the gain it gives -
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6977913
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3TB HDD in ZFS

2010-12-06 Thread taemun
On 6 December 2010 21:43, Fred Liu fred_...@issi.com wrote:

 3TB HDD needs UEFI not the traditional BIOS and OS support.



 Fred


Fred:
http://www.anandtech.com/show/3858/the-worlds-first-3tb-hdd-seagate-goflex-desk-3tb-review/2

Namely:
a feature of GPT is 64-bit LBA support. With 64-bit LBAs the largest
512-byte sector drive we can address is 9.4ZB
GPT drives are supported as data drives in all x64 versions of Windows as
well as Mac OS X and Linux.
You’ll note that I said data and not boot drives. In order to boot to a GPT
partition, you need hardware support. I just mentioned that your PC’s BIOS
looks at LBA 0 for the MBR. Your BIOS does not support booting to GPT
partitioned drives. GPT is however supported by systems that implement a
newer BIOS alternative: Intel’s Extensible Firmware Interface (EFI).

I would imagine that anyone looking at this list didn't want the 3TB drive
as a boot drive (rpool), but as a data drive.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3TB HDD in ZFS

2010-12-06 Thread taemun
On 7 December 2010 13:25, Brandon High bh...@freaks.com wrote:

 There shouldn't be any problems using a 3TB drive with Solaris, so
 long as you're using a 64-bit kernel. Recent versions of zfs should
 properly recognize the 4k sector size as well.


I think you'll find that these 3TB, 4KiB physical sector drives are still
exporting logical sectors of 512B (this is what Anandtech has indicated,
anyway). ZFS assumes that the drives logical sectors are directly mapped to
physical sectors, and will create an ashift=9 vdev for the drives.

Hence why enthusiasts are making their own zpool binaries with a hardcoded
ashift=12 so they can create pools that actually function beyond 20 random
writes per second with these drives:
http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3TB HDD in ZFS

2010-12-06 Thread taemun
On 7 December 2010 13:55, Tim Cook t...@cook.ms wrote:

 It's based on a jumper on most new drives.

Can you back that up with anything? I've never seen anything but requests
for a jumper that forces the firmware to export 4KiB sectors.

WD EARS at launch provided the ability to force the requested LBA to be
written to disk as LBA + 1 (a workaround to get Windows XP to make aligned
partitions), as per http://www.anandtech.com/show/2888/2

On 7 December 2010 13:57, Brandon High bh...@freaks.com wrote:

 It depends on the drive. According to Anandtech, the WD drives use 4k
 internally but report 512b sectors.

And hence, will incorrectly create an ashift=9 vdev.

They also report that the Seagate
 GoFlex uses 512b sectors internally but reports 4k sectors through
 it's desktop dock.

Sorry, you're right. If they're using 512B internally, this is a non-event
here. I think that most folks talking about 3TB drives in this list are
looking for internal drives. That the desktop dock (USB, I presume)
coalesces blocks doesn't really make any difference.

Waiting for a 3TB drive that properly reports it capabilities to become
 available is probably the best course of action.


Buying 4KiB physical sector drives which export 512B sectors is fine, as
long as you use a modified binary which has a hardcoded ashift=12 value.
Otherwise, you're asking for trouble (and terrible performance).

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf

2010-12-01 Thread taemun
On 2 December 2010 16:17, Miles Nordin car...@ivy.net wrote:

  t == taemun  tae...@gmail.com writes:

 t I would note that the Seagate 2TB LP has a 0.32% Annualised
 t Failure Rate.

 bullshit.


Apologies, should have read: Specified Annualised Failure Rate.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf

2010-11-29 Thread taemun
On 29 November 2010 20:39, GMAIL piotr.jasiukaj...@gmail.com wrote:

 Does anyone use Seagate ST32000542AS disks with ZFS?

 I wonder if the performance is not that ugly as with WD Green WD20EARS
 disks.


I'm using these drives for one of the vdevs in my pool. The pool was created
with ashift=12 (zpool binary from
http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/),
which limits the minimum block size to 4KB, the same as the physical block
size on these drives. I haven't noticed any performance issues. These
obviously aren't 7200rpm drives, so you can't expect them to match those in
random IOPS.

I'm also using a set of Samsung HD204UI's in the pool.

I would urge you to consider a 2^n + p number of disks. For raidz, p = 1, so
an acceptable number of total drives is 3, 5 or 9.  raidz2 has two parity
drives, hence 4, 6 or 10. These vdev widths ensure that the data blocks are
divided into nicer sizes. A 128KB block in a 9-wide raidz vdev will be split
into 128/(9-1) = 16KB chunks.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recomandations

2010-11-29 Thread taemun
On 29 November 2010 15:03, Erik Trimble erik.trim...@oracle.com wrote:

 I'd have to re-look at the ZFS Best Practices Guide, but I'm pretty sure
 the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2.  Due
 to #5 above, best performance comes with an EVEN number of data disks in any
 raidZ, so a write to any disks is always a full portion of the chunk, rather
 than a partial one (that sounds funny, but trust me).  The best balance of
 size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where
 there are 4, 6 or 8 data disks.


Let the maximum block size of 128KiB = s

If the number of disks in a raidz vdev = n, p = number of parity disks used
and d = data drives.

Hence, n = d + p

So, for some given numbers of d:
d s/d
1 128
2 64
3 42.67
4 32
5 25.6
6 21.33
7 18.29
8 16
9 14.22
10 12.8

Hence, for a raidz vdev with a width of 7, d = 6; s/d = 21.33KiB. This isn't
an ideal block size by any stretch of the imagination. Same thing for a
width of 11, d = 10, s/d = 12.8KiB.

What you were aiming for: for ideal performance, one should keep the vdev
width to the form 2^x + p. So, for raidz: 2, 3, 5, 9, 17. raidz2: 3, 4, 6,
10, 18, etc.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf

2010-11-29 Thread taemun
On 30 November 2010 03:09, Krunal Desai mov...@gmail.com wrote:

  I assume it either:

 1. does a really good job of 512-byte emulation that results in little
 to no performance degradation
 (
 http://consumer.media.seagate.com/2010/06/the-digital-den/advanced-format-drives-with-smartalign/
 references test data)

2. dynamically looks to see if it even needs to do anything; if the
 host OS is sending it requests that all 4k-aware/aligned, all is well.

My understanding is that this is merely saying that it will *align* the data
correctly, with Windows XP, regardless of where Windows XP asks for the
first sector to be. This has nothing to do with 512B random writes.


 Though, the power-on hours count seems rather low
 for me...8760 hours, or just 1 year of 24/7 operation.

Not sure where you got this figure from, the Barracuda Green (
http://www.seagate.com/docs/pdf/datasheet/disc/ds1720_barracuda_green.pdf) is
a different drive to the one we've been talking about in this thread (
http://www.seagate.com/docs/pdf/datasheet/disc/ds_barracuda_lp.pdf).

I would note that the Seagate 2TB LP has a 0.32% Annualised Failure Rate.
ie, in a given sample (which aren't overheating, etc) 32 from every 10,000
should fail. I *believe* that the Power On-Hours on the Barra Green is
simply saying that it is designed for 24/7 usage. It's a per year number. I
couldn't imagine them specifying the number of hours before failure like
that, just below an AFR of 0.43.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ashift and vdevs

2010-11-26 Thread taemun
On 27 November 2010 08:05, Krunal Desai mov...@gmail.com wrote:

 One new thought occurred to me; I know some of the 4K drives emulate 512
 byte sectors, so to the host OS, they appear to be no different than other
 512b drives. With this additional layer of emulation, I would assume that
 ashift wouldn't be needed, though I have read reports of this affecting
 performance. I think I'll need to confirm what drives do what exactly and
 then decide on an ashift if needed.


If you consider that for a 4KB internal drive, with a 512B external
interface, a request for a 512B write will result in the drive reading 4KB,
modifying it (putting the new 512B in) and then writing the 4KB out again.
This is terrible from a latency perspective. I recall seeing 20 IOPS on a WD
EARS 2TB drive (ie, 50ms latency for random 512B writes).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ashift and vdevs

2010-11-23 Thread taemun
zdb -C shows an shift value on each vdev in my pool, I was just wondering if
it is vdev specific, or pool wide. Google didn't seem to know.

I'm considering a mixed pool with some advanced format (4KB sector)
drives, and some normal 512B sector drives, and was wondering if the ashift
can be set per vdev, or only per pool. Theoretically, this would save me
some size on metadata on the 512B sector drives.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ashift and vdevs

2010-11-23 Thread taemun
Cheers for the links David, but you'll note that I've commented on the blog
you linked (ie, was aware of it). The zpool-12 binary linked from
http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/
worked
perfectly on my SX11 installation. (It threw some error on b134, so it
relies on some external code, to some extent.)

I'd note for those who are going to try, that that binary produces a pool of
as high a version as the system supports. I was surprised that it was higher
than the code for which it was compiled (ie, b147 = zpool v28).

I'm currently populating a pool with a 9-wide raidz vdev of Samsung HD204UI
2TB (5400rpm, 4KB sector) and a 9-wide raidz vdev of Seagate LP ST32000542AS
2TB (5900 rpm, 4KB sector) which was created with that binary, and haven't
seen any of the performance issues I've had in the past with WD EARS drives.

It would be lovely if Oracle could see fit to implementing correct detection
of these drives! Or, at the very least, an -o ashift=12 parameter in the
zpool create function.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vdev failure - pool loss ?

2010-10-19 Thread taemun
Tuomas:

My understanding is that the copies functionality doesn't guarantee that
the extra copies will be kept on a different vdev. So that isn't entirely
true. Unfortunately.

On 20 October 2010 07:33, Tuomas Leikola tuomas.leik...@gmail.com wrote:

 On Mon, Oct 18, 2010 at 8:18 PM, Simon Breden sbre...@gmail.com wrote:
  So are we all agreed then, that a vdev failure will cause pool loss ?
  --

 unless you use copies=2 or 3, in which case your data is still safe
 for those datasets that have this option set.
 --
 - Tuomas
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help - Deleting files from a large pool results in less free space!

2010-10-07 Thread taemun
Forgive me, but isn't this incorrect:

---
mv   /pool1/000   /pool1/000d
---
rm   –rf   /pool1/000

Shouldn't that last line be
rm   –rf   /pool1/000d
??

On 8 October 2010 04:32, Remco Lengers re...@lengers.com wrote:

  any snapshots?

 *zfs list -t snapshot*

 ..Remco



 On 10/7/10 7:24 PM, Jim Sloey wrote:

 I have a 20Tb pool on a mount point that is made up of 42 disks from an EMC 
 SAN. We were running out of space and down to 40Gb left (loading 8Gb/day) and 
 have not received disk for our SAN. Using df -h results in:
 Filesystem size   used  avail capacity  Mounted on
 pool120T20T55G   100%/pool1
 pool2   9.1T   8.0T   497G95%/pool2
 The idea was to temporarily move a group of big directories to another zfs 
 pool that had space available and link from the old location to the new.
 cp   –r   /pool1/000/pool2/
 mv   /pool1/000   /pool1/000d
 ln   –s   /pool2/000/pool1/000
 rm   –rf   /pool1/000
 Using df -h after the relocation results in:
 Filesystem size   used  avail capacity  Mounted on
 pool120T19T15G   100%/pool1
 pool2   9.1T   8.3T   221G98%/pool2
 Using zpool list says:
 NAMESIZE   USEDAVAIL   CAP
 pool1 19.9T19.6T  333G 98%
 pool2 9.25T8.89T  369G 96%
 Using zfs get all pool1 produces:
 NAME  PROPERTYVALUE  SOURCE
 pool1  typefilesystem -
 pool1  creationTue Dec 18 11:37 2007  -
 pool1  used19.6T  -
 pool1  available   15.3G  -
 pool1  referenced  19.5T  -
 pool1  compressratio   1.00x  -
 pool1  mounted yes-
 pool1  quota   none   default
 pool1  reservation none   default
 pool1  recordsize  128K   default
 pool1  mountpoint  /pool1  default
 pool1  sharenfson local
 pool1  checksumon default
 pool1  compression offdefault
 pool1  atime   on default
 pool1  devices on default
 pool1  execon default
 pool1  setuid  on default
 pool1  readonlyoffdefault
 pool1  zoned   offdefault
 pool1  snapdir hidden default
 pool1  aclmode groupmask  default
 pool1  aclinherit  secure default
 pool1  canmounton default
 pool1  shareiscsi  offdefault
 pool1  xattr   on default
 pool1  replication:locked  true   local

 Has anyone experienced this or know where to look for a solution to 
 recovering space?


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver that never finishes

2010-09-18 Thread taemun
But all of which have newer code, today, than onnv-134.

On 18 September 2010 22:20, Tom Bird t...@marmot.org.uk wrote:

 On 18/09/10 13:06, Edho P Arief wrote:

 On Sat, Sep 18, 2010 at 7:01 PM, Tom Birdt...@marmot.org.uk  wrote:

 All said and done though, we will have to live with snv_134's bugs from
 now
 on, or perhaps I could try Sol 10.


 or OpenIllumos. Or Nexenta. Or FreeBSD. Orinsert osol distro name.


 ... none of which will receive ZFS code updates unless Oracle deigns to
 bestow them upon the community, this or ZFS dev is taken over by said
 community, in which case we end up with diverging code bases that would be a
 sisyphean task to try and merge.

 Tom

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-22 Thread taemun
Basic electronics, go!

The linked capacitor from Elna (
http://www.elna.co.jp/en/capacitor/double_layer/catalog/pdf/dk_e.pdf) has an
internal resistance of 30 ohms.

Intel rate their 32GB X25-E at 2.4W active (we aren't interested in idle
power usage, if its idle, we don't need the capacitor in the first place) on
the +5V rail, thats 0.48A. (P=VI)

V=IR, supply is 5V, current through load is 480mA, hence R=10.4 ohms.
The resistance of the X25-E under load is 10.4 ohms.

Now if you have a capacitor discharge circuit with the charged Elna
DK-6R3D105T - the largest and most suitable from that datasheet - you have
40.4 ohms around the loop (cap and load). +5V over 40.4 ohms. The maximum
current you can pull from that is I=V/R = 124mA. Around a quarter what the
X25-E wants in order to write.

The setup won't work.

I'd suggest something more along the lines of:
http://www.cap-xx.com/products/products.htm
Which have an ESR around 3 orders of magnitude lower.

t

On 22 May 2010 18:58, Ragnar Sundblad ra...@csc.kth.se wrote:


 On 22 maj 2010, at 07.40, Don wrote:

  The SATA power connector supplies 3.3, 5 and 12v. A complete
  solution will have all three. Most drives use just the 5v, so you can
  probably ignore 3.3v and 12v.
  I'm not interested in building something that's going to work for every
 possible drive config- just my config :) Both the Intel X25-e and the OCZ
 only uses the 5V rail.
 
  You'll need to use a step up DC-DC converter and be able to supply ~
  100mA at 5v.
  It's actually easier/cheaper to use a LiPoly battery  charger and get a
  few minutes of power than to use an ultracap for a few seconds of
  power. Most ultracaps are ~ 2.5v and LiPoly is 3.7v, so you'll need a
  step up converter in either case.
  Ultracapacitors are available in voltage ratings beyond 12volts so there
 is no reason to use a boost converter with them. That eliminates high
 frequency switching transients right next to our SSD which is always
 helpful.
 
  In this case- we have lots of room. We have a 3.5 x 1 drive bay, but a
 2.5 x 1/4 hard drive. There is ample room for several of the 6.3V ELNA 1F
 capacitors (and our SATA power rail is a 5V regulated rail so they should
 suffice)- either in series or parallel (Depending on voltage or runtime
 requirements).
  http://www.elna.co.jp/en/capacitor/double_layer/catalog/pdf/dk_e.pdf
 
  You could 2 caps in series for better voltage tolerance or in parallel
 for longer runtimes. Either way you probably don't need a charge controller,
 a boost or buck converter, or in fact any IC's at all. It's just a small
 board with some caps on it.

 I know they have a certain internal resistance, but I am not familiar
 with the characteristics; is it high enough so you don't need to
 limit the inrush current, and is it low enough so that you don't need
 a voltage booster for output?

  Cost for a 5v only system should be $30 - $35 in one-off
  prototype-ready components with a 1100mAH battery (using prices from
  Sparkfun.com),
  You could literally split a sata cable and add in some capacitors for
 just the cost of the caps themselves. The issue there is whether the caps
 would present too large a current drain on initial charge up- If they do
 then you need to add in charge controllers and you've got the same problems
 as with a LiPo battery- although without the shorter service life.
 
  At the end of the day the real problem is whether we believe the drives
 themselves will actually use the quiet period on the now dead bus to write
 out their caches. This is something we should ask the manufacturers, and
 test for ourselves.

 Indeed!

 /ragge

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding ZFS performance.

2010-05-22 Thread taemun
iostat -xen 1 will provide the same device names as the rest of the system
(as well as show error columns).

zpool status will show you which drive is in which pool.

As for the controllers, cfgadm -al groups them nicely.

t

On 23 May 2010 03:50, Brian broco...@vt.edu wrote:

 I am new to OSOL/ZFS but have just finished building my first system.

 I detailed the system setup here:
 http://opensolaris.org/jive/thread.jspa?threadID=128986tstart=15

 I ended up having to add an additional controller card as two ports on the
 motherboard did not work as standard Sata port.  Luckily I was able to
 salvage an LSI SAS card from an old system.

 Things seem to be working OK for the most part.. But I am trying to dig a
 bit deeper into the performance.  I have done some searching and it seems
 that the iostat -x can help you better understand your performance.

 I have 8 drives in the system.  2 are in a mirrored boot pool and the other
 6 are in a single raidz2 pool.  All 6 are the same.  Samsung 1TB Spinpoints.

 Here is what my output looks like from iostat -x 30 during a scrub of the
 raidz2 pool:

 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 cmdk0   299.71.5 37080.91.6  7.7  2.0   32.3  98  99
 cmdk1   300.21.3 37083.01.5  7.7  2.0   32.2  98  99
 cmdk2  1018.61.6 37141.31.7  0.5  0.71.2  22  43
 cmdk3 0.01.80.05.2  0.0  0.0   33.7   1   2
 cmdk4  1045.62.1 37124.31.4  0.7  0.71.3  21  41
 sd6   0.01.80.05.2  0.0  0.0   25.1   0   1
 sd71033.42.5 37128.51.8  0.0  1.01.0   3  38
 sd81044.52.5 37129.41.8  0.0  0.90.9   3  36
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 cmdk0   301.91.3 37339.01.7  7.8  2.0   32.1  99  99
 cmdk1   302.11.4 37341.01.8  7.7  2.0   32.0  99  99
 cmdk2  1048.11.5 37400.41.6  0.5  0.71.1  20  42
 cmdk3 0.01.50.05.1  0.0  0.0   36.5   1   2
 cmdk4  1054.41.6 37363.11.5  0.7  0.61.2  20  40
 sd6   0.01.50.05.1  0.0  0.0   30.4   0   1
 sd71044.42.1 37404.21.7  0.0  0.90.9   3  38
 sd81050.52.1 37382.81.9  0.0  0.90.9   3  36
 extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 cmdk0   296.31.5 36195.41.7  7.8  2.0   32.7  99  99
 cmdk1   295.21.5 36230.11.8  7.7  2.0   32.5  98  98
 cmdk2   987.52.0 36171.51.7  0.6  0.71.3  22  43
 cmdk3 0.01.50.05.1  0.0  0.0   37.7   1   2
 cmdk4  1018.32.0 36160.81.6  0.7  0.61.4  21  41
 sd6   0.01.50.05.1  0.0  0.1   40.3   0   2
 sd71005.32.6 36300.61.8  0.0  1.11.1   3  39
 sd81016.02.5 36260.12.0  0.0  1.01.0   3  36


 I think cmdk3 and sd6 are in my rpool.  I tried to split the pools across
 the controllers for better performance.

 It seems to me that cmdk0 and cmdk1 are much slower than the others..  But
 I am not sure why or what to check next...  In fact I am not even sure how I
 can trace back that device name to figure out which controller it is
 connected to.

 Any ideas or next steps would be appreciated.

 Thanks.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS on-disk DDT block arrangement

2010-04-06 Thread taemun
I was wondering if someone could explain why the DDT is seemingly
(from empirical observation) kept in a huge number of individual blocks,
randomly written across the pool, rather than just a large binary chunk
somewhere.

Having been victim of the relly long times it takes to destroy a dataset
that has dedup=on, I was wondering why that was. From memory, when the
destroy process was running, something like iopattern -r showed constant 99%
random reads. This seems like a very wasteful approach to allocating blocks
for the DDT.

Having deleted the 900GB dataset, finally, I now only have around 152GB
(allocated PSIZE) left deduped on that pool.
# zdb -DD tank
DDT-sha256-zap-duplicate: 310684 entries, size 578 on disk, 380 in core
DDT-sha256-zap-unique: 1155817 entries, size 2438 on disk, 1783 in core

So 1466501 DDT blocks. For 152GB of data, that's around 108KB/block on
average, which seems sane.

To destroy the dataset holding the files which reference the DDT, I'm
looking at 1.46 million random reads to complete the operation (less those
elements in ARC or L2ARC). That's a lot of read operations for my poor
spindles.

I've seen some people saying that the DDT blocks are around 270 bytes each,
but does it really matter, if the smallest block that zfs can read/write
(for obvious reasons) is 512 bytes? Clearly 2x 270B  512B, but couldn't
there be some way of grouping DDT elements together (in say, 1MB blocks)?

Thoughts?

(side note: can someone explain the size xxx on disk, xxx in core
statements in that zdb output for me? The numbers never seem related to the
number of entries or  anything.)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and 4kb sector Drives (All new western digital GREEN Drives?)

2010-03-27 Thread taemun
I'm not entirely convinced there is no problem here I had a WD EADS
1.5TB die, the warranty replacement drive was a EARS. So, first foray into
4k sectors.

I had 8x EADS in a raidz set, had replaced the broken one with a 1.5TB
Seagate 7200rpm - which was obviously faster.

Just replacing back, and here is the iostat for the new EARS drive:
http://pastie.org/889572

http://pastie.org/889572Those asvc_t's are atrocious. As is the op/s
throughput. All the other drives spend the vast majority of the time idle,
waiting for the new EARS drive to write out data.

This is after isolating another issue to my Dell PERC 5/i's - they
apparently don't talk nicely with the EARS drives either. Streaming writes
would push data for two seconds and pause for ten. Random writes ... give
up.
On the Intel chipset's SATA - streaming writes are acceptable, but random
writes are as per the above url.

Format tells me that the partition starts at sector 256.But given that ZFS
writes variable size blocks, that really shouldn't matter.

When plugged the EARS into a P45-based motherboard running Windows, HDTune
presents a normal looking streaming writes graph, and the average seek time
is 14ms - the drive seems healthy.

Any thoughts?

On 27 March 2010 21:05, Svein Skogen sv...@stillbilde.net wrote:

 On 27.03.2010 11:01, Daniel Carosone wrote:
  On Sat, Mar 27, 2010 at 08:47:26PM +1100, Daniel Carosone wrote:
  On Fri, Mar 26, 2010 at 05:57:31PM -0700, Darren Mackay wrote:
  not sure if 32bit BSD supports 48bit LBA
 
  Solaris is the only otherwise-modern OS with this daft limitation.
 
  Ok, it's not due to LBA48, but the 1Tb limitation is still daft.
 
  There are some limits you'll encounter in BSD, such as if you use the
  wrong disklabel format - but not in the basic disk drivers.

 And, if you use the cheaper cards containing SiI-chips, you may find
 that _SOME_ manufacturers saved a penny per year by not connecting all
 the wires, so LBA48 simply don't work. Don't ask how many gray hairs
 tracking down THAT one caused me. (disks that worked absolutely fine
 until you passed a certain block. Then *wham!* corruptions-galore.)

 //Svein

 --
 +---+---
  /\   |Svein Skogen   | sv...@d80.iso100.no
  \ /   |Solberg Østli 9| PGP Key:  0xE5E76831
   X|2020 Skedsmokorset | sv...@jernhuset.no
  / \   |Norway | PGP Key:  0xCE96CE13
|   | sv...@stillbilde.net
  ascii  |   | PGP Key:  0x58CD33B6
  ribbon |System Admin   | svein-listm...@stillbilde.net
 Campaign|stillbilde.net | PGP Key:  0x22D494A4
+---+---
|msn messenger: | Mobile Phone: +47 907 03 575
|sv...@jernhuset.no | RIPE handle:SS16503-RIPE
 +---+---
 If you really are in a hurry, mail me at
   svein-mob...@stillbilde.net
  This mailbox goes directly to my cellphone and is checked
even when I'm not in front of my computer.
 
 Picture Gallery:
  https://gallery.stillbilde.net/v/svein/
 


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q : recommendations for zpool configuration

2010-03-19 Thread taemun
A pool with a 4-wide raidz2 is a completely nonsensical idea. It has the
same amount of accessible storage as two striped mirrors. And would be
slower in terms of IOPS, and be harder to upgrade in the future (you'd need
to keep adding four drives for every expansion with raidz2 - with mirrors
you only need to add another two drives to the pool).

Just my $0.02

On 19 March 2010 18:28, homerun petri.j.kunn...@gmail.com wrote:

 Greetings

 I would like to get your recommendation how setup new pool.

 I have 4 new 1.5TB disks reserved to new zpool.
 I planned to crow/replace existing small 4 disks ( raidz ) setup with new
 bigger one.

 As new pool will be bigger and will have more personally important data to
 be stored long time, i like to ask your recommendations should i create
 recreate pool or just replace existing devices.

 I have noted there is now raidz2 and been thinking witch woul be better.
 A pool with 2 mirrors or one pool with 4 disks raidz2

 So at least could some explain these new raidz configurations

 Thanks
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread taemun
Just thought I'd chime in for anyone who had read this - the import
operation completed this time, after 60 hours of disk grinding.

:)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread taemun
The system in question has 8GB of ram. It never paged during the
import (unless I was asleep at that point, but anyway).

It ran for 52 hours, then started doing 47% kernel cpu usage. At this
stage, dtrace stopped responding, and so iopattern died, as did
iostat. It was also increasing ram usage rapidly (15mb / minute).
After an hour of that, the cpu went up to 76%. An hour later, CPU
usage stopped. Hard drives were churning throughout all of this
(albeit at a rate that looks like each vdev is being controller by a
single threaded operation).

I'm guessing that if you don't have enough ram, it gets stuck on the
use-lots-of-cpu phase, and just dies from too much paging. Of course,
I have absolutely nothing to back that up.

Personally, I think that if L2ARC devices were persistent, we already
have the mechanism in place for storing the DDT as a seperate vdev.
The problem is, there is nothing you can run at boot time to populate
the L2ARC, so the dedup writes are ridiculously slow until the cache
is warm. If the cache stayed warm, or there was an option to forcibly
warm up the cache, this could be somewhat alleviated.

Cheers
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-13 Thread taemun
After around four days the process appeared to have stalled (no
audible hard drive activity). I restarted with milestone=none; deleted
/etc/zfs/zpool.cache, restarted, and went zpool import tank. (also
allowed root login to ssh, so I could make new ssh sessions if
required.) Now I can watch the process from on the machine.

My present question is how is the DDT stored? I believe the DDT to
have around 10M entries for this dataset, as per:
DDT-sha256-zap-duplicate: 400478 entries, size 490 on disk, 295 in core
DDT-sha256-zap-unique: 10965661 entries, size 381 on disk, 187 in core
(taken just previous to the attempt to destroy the dataset)

A sample from iopattern shows:
%RAN %SEQ  COUNTMINMAXAVG KR
 1000195512512512 97
 1000414512  65536895362
 1000261512512512130
 1000273512512512136
 1000247512512512123
 1000297512512512148
 1000292512512512146
 1000250512512512125
 1000274512512512137
 1000302512512512151
 1000294512512512147
 1000308512512512154
  982286512512512143
 1000270512512512135
 1000390512512512195
 1000269512512512134
 1000251512512512125
 1000254512512512127
 1000265512512512132
 1000283512512512141

As the pool is comprised of 2x 8-disk raidz vdevs, I presume that each
element is stored twice (for the raidz redundancy). So around 280 512b
read op/s, that's 140 entries per second.

Is the import of a semi-broken pool:
1 Reading all the DDT markers for the dataset; or
2 Reading all the DDT markers for the pool; or
3 Reading all of the block markers for the dataset; or
4 Reading all of the block markers for the pool
Prior to actually finalising what it needs to do to fix the pool? I'd
like to be able to estimate the length of time likely before the
import finishes.

Or should I tell it to roll back to the last valid txg - ie before the
zfs destroy dataset command was issued? (by zpool import -F.) Or is
this likely to take as long/longer than the present import/fix?

Cheers.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Reading ZFS config for an extended period

2010-02-11 Thread taemun
Can anyone comment about whether the on-boot Reading ZFS confi is
any slower/better/whatever than deleting zpool.cache, rebooting and
manually importing?

I've been waiting more than 30 hours for this system to come up. There
is a pool with 13TB of data attached. The system locked up whilst
destroying a 934GB dedup'd dataset, and I was forced to reboot it. I
can hear hard drive activity presently - ie its doing
bsomething/b, but am really hoping there is a better way :)

Thanks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-11 Thread taemun
Do you think that more RAM would help this progress faster? We've just
hit 48 hours. No visible progress (although that doesn't really mean
much).

It is presently in a system with 8GB of ram, I could try to move the
pool across to a system with 20GB of ram, if that is likely to
expedite the process. Of course, if it isn't going to make any
difference, I'd rather not restart this process.

Thanks

On 12 February 2010 06:08, Bill Sommerfeld sommerf...@sun.com wrote:
 On 02/11/10 10:33, Lori Alt wrote:

 This bug is closed as a dup of another bug which is not readable from
 the opensolaris site, (I'm not clear what makes some bugs readable and
 some not).

 the other bug in question was opened yesterday and probably hasn't had time
 to propagate.

                                        - Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss