Re: [zfs-discuss] A resilver record?
769G resilvered on a 500G drive? I'm guessing there was a whole bunch of activity (and probably snapshot creation) happening alongside the resilver. On 20 March 2011 18:57, Ian Collins i...@ianshome.com wrote: Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered and I told the client it would take 3 to 4 days! :) -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External SATA drive enclosures + ZFS?
On 28 February 2011 02:06, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: Take that a step further. Anything external is unreliable. I have used USB, eSATA, and Firewire external devices. They all work. The only question is for how long. eSATA has no need for any interposer chips between a modern SATA chipset on the motherboard and a SATA hard drive. You can buy cables with appropriate ends for this. There is no reason why the data side of an eSATA drive should be any more likely to fail than SATA. (within bounds, for cable lengths, etc) At least you can be assured that the drive will receive a flush request at appropriate times. I can't argue about the external power supplies, other than to say that many external cases these days use a single +12V rail, and have a +5V regulator on board. These are a lot better because they allow for easy replacement of the power supply. External units which use a combined +12V/+5V power supply are often rendered useless by a power supply failure. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication requirements
On 6 February 2011 01:34, Michael michael.armstr...@gmail.com wrote: Hi guys, I'm currently running 2 zpools each in a raidz1 configuration, totally around 16TB usable data. I'm running it all on an OpenSolaris based box with 2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and underpowered for deduplication, so I'm looking at building a new system, but wanted some advice first, here is what i've planned so far: Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) http://ark.intel.com/Product.aspx?id=52213 http://ark.intel.com/Product.aspx?id=52213The desktop Core i* range doesn't support ECC ram at all, this could potentially be a pool breaker if you get a flipped bit in the wrong place (a significant metadata block). Just something to keep in mind. Also, Intel have issued a recall (ish) for all of the 6 series chipsets released so far, the PLL unit for the 3gbit SATA ports on the chipset is driven too hard and will likely degrade over time (5~15% failure rate over three years). They are talking about a March~April time to fix in the channel. If you don't plan on using the 3gbit SATA ports, then you're fine. Intel will make 1155 Xeon's at some point, ie http://en.wikipedia.org/wiki/List_of_future_Intel_microprocessors#.22Sandy_Bridge.22_.2832_nm.29_8 They support ECC (just check for a specific QVL after launch, DDR3 ECC isn't necessarily the only thing you need to look for). I think the Feb 20 release date may have been pushed for the chipset respin. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replace block devices to increase pool size
If autoexpand = on, then yes. zpool get autoexpand pool zpool set autoexpand=on pool The expansion is vdev specific, so if you replaced the mirror first, you'd get that much (the extra 2TB) without touching the raidz. Cheers, On 7 February 2011 01:41, Achim Wolpers achim...@googlemail.com wrote: Hi! I have a zpool biult up from two vdrives (one mirror and one raidz). The raidz is built up from 4x1TB HDs. When I successively replace each 1TB drive with a 2TB drive will the capacity of the raidz double after the last block device is replaced? Achim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)
Uhm. Higher RPM = higher linear speed of the head above the platter = higher throughput. If the bit pitch (ie the size of each bit on the platter) is the same, then surely a higher linear speed corresponds with a larger number of bits per second? So if all other things being equal includes the bit density, and radius to the edge of the media, then ... surely higher rpm = higher throughput? Cheers, On 3 February 2011 14:10, Mark Sandrock mark.sandr...@oracle.com wrote: On Feb 2, 2011, at 8:10 PM, Eric D. Mudama wrote: All other things being equal, the 15k and the 7200 drive, which share electronics, will have the same max transfer rate at the OD. Is that true? So the only difference is in the access time? Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
Comments below. On 29 January 2011 00:25, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: This was something interesting I found recently. Apparently for flash manufacturers, flash hard drives are like the pimple on the butt of the elephant. A vast majority of the flash production in the world goes into devices like smartphones, cameras, tablets, etc. Only a slim minority goes into hard drives. http://www.eetimes.com/electronics-news/4206361/SSDs--Still-not-a--solid-state--business ~6.1 percent for 2010, from that estimate (first thing that Google turned up). Not denying what you said, I just like real figures rather than random hearsay. As a result, they optimize for these other devices, and one of the important side effects is that standard flash chips use an 8K page size. But hard drives use either 4K or 512B. http://www.anandtech.com/Show/Index/2738?cPage=19all=Falsesort=0page=5 Terms: page means the smallest data size that can be read or programmed (written). Block means the smallest data size that can be erased. SSDs commonly have a page size of 4KiB and a block size of 512KiB. I'd take Anandtech's word on it. There is probably some variance across the market, but for the vast majority, this is true. Wikipedia's http://en.wikipedia.org/wiki/Flash_memory#NAND_memories says that common page sizes are 512B, 2KiB, and 4KiB. The SSD controller secretly remaps blocks internally, and aggregates small writes into a single 8K write, so there's really no way for the OS to know if it's writing to a 4K block which happens to be shared with another 4K block in the 8K page. So it's unavoidable, and whenever it happens, the drive can't simply write. It must read modify write, which is obviously much slower. This is be true, but for 512B to 4KiB aggregation, as the 8KiB page doesn't exist. As for writing when everything is full, and you need to do an erase. well this is where TRIM is helpful. Also if you look up the specs of a SSD, both for IOPS and/or sustainable throughput... They lie. Well, technically they're not lying because technically it is *possible* to reach whatever they say. Optimize your usage patterns and only use blank drives which are new from box, or have been fully TRIM'd. Pt... But in my experience, reality is about 50% of whatever they say. Presently, the only way to deal with all this is via the TRIM command, which cannot eliminate the read/modify/write, but can reduce their occurrence. Make sure your OS supports TRIM. I'm not sure at what point ZFS added TRIM, or to what extent... Can't really measure the effectiveness myself. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655 Long story short, in the real world, you can expect the DDRDrive to crush and shame the performance of any SSD you can find. It's mostly a question of PCIe slot versus SAS/SATA slot, and other characteristics you might care about, like external power, etc. Sure, DDR RAM will have a much quicker sync write time. This isn't really a surprising result. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating zpool to new drives with 4K Sectors
zfs replace will copy across on to the disk with the same old ashift=9, whereas you want ashift=12 for 4KB drives. (size = 2^ashift) You'd need to make a new pool (or add a vdev to an existing pool) with the modified tools in order to get proper performance out of 4KB drives. On 7 January 2011 17:43, Matthew Angelo bang...@gmail.com wrote: Hi ZFS Discuss, I have a 8x 1TB RAIDZ running on Samsung 1TB 5400rpm drives with 512b sectors. I will be replacing all of these with 8x Western Digital 2TB drives with support for 4K sectors. The replacement plan will be to swap out each of the 8 drives until all are replaced and the new size (~16TB) is available with a `zfs scrub`. My question is, how do I do this and also factor in the new 4k sector size? or should I find a 2TB drive that still uses 512b sectors? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] very slow boot: stuck at mounting zfs filesystems
Dedup? Taking a long time to boot after hard reboot after lookup? I'll bet that it hard locked whilst deleting some files or a dataset that was dedup'd. After the delete is started, it spends *ages* cleaning up the DDT (the table containing a list of dedup'd blocks). If you hard lock in the middle of this clean up, then the DDT isn't valid, to anything. The next mount attempt on that pool will do this operation for you. Which will take an inordinate amount of time. My pool spent *eight days* (iirc) in limbo, waiting for the DDT cleanup to finish. Once it did, it wrote out a shedload of blocks and then everything was fine. This was for a zfs destroy of a 900GB, 64KiB block dataset, over 2x 8-wide raidz vdevs. Unfortunately, raidz is of course slower for random reads than a set or mirrors. The raidz/mirror hybrid allocator available in snv_148+ is somewhat of a workaround for this, although I've not seen comprehensive figures for the gain it gives - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6977913 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3TB HDD in ZFS
On 6 December 2010 21:43, Fred Liu fred_...@issi.com wrote: 3TB HDD needs UEFI not the traditional BIOS and OS support. Fred Fred: http://www.anandtech.com/show/3858/the-worlds-first-3tb-hdd-seagate-goflex-desk-3tb-review/2 Namely: a feature of GPT is 64-bit LBA support. With 64-bit LBAs the largest 512-byte sector drive we can address is 9.4ZB GPT drives are supported as data drives in all x64 versions of Windows as well as Mac OS X and Linux. You’ll note that I said data and not boot drives. In order to boot to a GPT partition, you need hardware support. I just mentioned that your PC’s BIOS looks at LBA 0 for the MBR. Your BIOS does not support booting to GPT partitioned drives. GPT is however supported by systems that implement a newer BIOS alternative: Intel’s Extensible Firmware Interface (EFI). I would imagine that anyone looking at this list didn't want the 3TB drive as a boot drive (rpool), but as a data drive. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3TB HDD in ZFS
On 7 December 2010 13:25, Brandon High bh...@freaks.com wrote: There shouldn't be any problems using a 3TB drive with Solaris, so long as you're using a 64-bit kernel. Recent versions of zfs should properly recognize the 4k sector size as well. I think you'll find that these 3TB, 4KiB physical sector drives are still exporting logical sectors of 512B (this is what Anandtech has indicated, anyway). ZFS assumes that the drives logical sectors are directly mapped to physical sectors, and will create an ashift=9 vdev for the drives. Hence why enthusiasts are making their own zpool binaries with a hardcoded ashift=12 so they can create pools that actually function beyond 20 random writes per second with these drives: http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/ Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3TB HDD in ZFS
On 7 December 2010 13:55, Tim Cook t...@cook.ms wrote: It's based on a jumper on most new drives. Can you back that up with anything? I've never seen anything but requests for a jumper that forces the firmware to export 4KiB sectors. WD EARS at launch provided the ability to force the requested LBA to be written to disk as LBA + 1 (a workaround to get Windows XP to make aligned partitions), as per http://www.anandtech.com/show/2888/2 On 7 December 2010 13:57, Brandon High bh...@freaks.com wrote: It depends on the drive. According to Anandtech, the WD drives use 4k internally but report 512b sectors. And hence, will incorrectly create an ashift=9 vdev. They also report that the Seagate GoFlex uses 512b sectors internally but reports 4k sectors through it's desktop dock. Sorry, you're right. If they're using 512B internally, this is a non-event here. I think that most folks talking about 3TB drives in this list are looking for internal drives. That the desktop dock (USB, I presume) coalesces blocks doesn't really make any difference. Waiting for a 3TB drive that properly reports it capabilities to become available is probably the best course of action. Buying 4KiB physical sector drives which export 512B sectors is fine, as long as you use a modified binary which has a hardcoded ashift=12 value. Otherwise, you're asking for trouble (and terrible performance). Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf
On 2 December 2010 16:17, Miles Nordin car...@ivy.net wrote: t == taemun tae...@gmail.com writes: t I would note that the Seagate 2TB LP has a 0.32% Annualised t Failure Rate. bullshit. Apologies, should have read: Specified Annualised Failure Rate. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf
On 29 November 2010 20:39, GMAIL piotr.jasiukaj...@gmail.com wrote: Does anyone use Seagate ST32000542AS disks with ZFS? I wonder if the performance is not that ugly as with WD Green WD20EARS disks. I'm using these drives for one of the vdevs in my pool. The pool was created with ashift=12 (zpool binary from http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/), which limits the minimum block size to 4KB, the same as the physical block size on these drives. I haven't noticed any performance issues. These obviously aren't 7200rpm drives, so you can't expect them to match those in random IOPS. I'm also using a set of Samsung HD204UI's in the pool. I would urge you to consider a 2^n + p number of disks. For raidz, p = 1, so an acceptable number of total drives is 3, 5 or 9. raidz2 has two parity drives, hence 4, 6 or 10. These vdev widths ensure that the data blocks are divided into nicer sizes. A 128KB block in a 9-wide raidz vdev will be split into 128/(9-1) = 16KB chunks. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recomandations
On 29 November 2010 15:03, Erik Trimble erik.trim...@oracle.com wrote: I'd have to re-look at the ZFS Best Practices Guide, but I'm pretty sure the recommendation of 7, 9, or 11 disks was for a raidz1, NOT a raidz2. Due to #5 above, best performance comes with an EVEN number of data disks in any raidZ, so a write to any disks is always a full portion of the chunk, rather than a partial one (that sounds funny, but trust me). The best balance of size, IOPs, and throughput is found in the mid-size raidZ(n) configs, where there are 4, 6 or 8 data disks. Let the maximum block size of 128KiB = s If the number of disks in a raidz vdev = n, p = number of parity disks used and d = data drives. Hence, n = d + p So, for some given numbers of d: d s/d 1 128 2 64 3 42.67 4 32 5 25.6 6 21.33 7 18.29 8 16 9 14.22 10 12.8 Hence, for a raidz vdev with a width of 7, d = 6; s/d = 21.33KiB. This isn't an ideal block size by any stretch of the imagination. Same thing for a width of 11, d = 10, s/d = 12.8KiB. What you were aiming for: for ideal performance, one should keep the vdev width to the form 2^x + p. So, for raidz: 2, 3, 5, 9, 17. raidz2: 3, 4, 6, 10, 18, etc. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate ST32000542AS and ZFS perf
On 30 November 2010 03:09, Krunal Desai mov...@gmail.com wrote: I assume it either: 1. does a really good job of 512-byte emulation that results in little to no performance degradation ( http://consumer.media.seagate.com/2010/06/the-digital-den/advanced-format-drives-with-smartalign/ references test data) 2. dynamically looks to see if it even needs to do anything; if the host OS is sending it requests that all 4k-aware/aligned, all is well. My understanding is that this is merely saying that it will *align* the data correctly, with Windows XP, regardless of where Windows XP asks for the first sector to be. This has nothing to do with 512B random writes. Though, the power-on hours count seems rather low for me...8760 hours, or just 1 year of 24/7 operation. Not sure where you got this figure from, the Barracuda Green ( http://www.seagate.com/docs/pdf/datasheet/disc/ds1720_barracuda_green.pdf) is a different drive to the one we've been talking about in this thread ( http://www.seagate.com/docs/pdf/datasheet/disc/ds_barracuda_lp.pdf). I would note that the Seagate 2TB LP has a 0.32% Annualised Failure Rate. ie, in a given sample (which aren't overheating, etc) 32 from every 10,000 should fail. I *believe* that the Power On-Hours on the Barra Green is simply saying that it is designed for 24/7 usage. It's a per year number. I couldn't imagine them specifying the number of hours before failure like that, just below an AFR of 0.43. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On 27 November 2010 08:05, Krunal Desai mov...@gmail.com wrote: One new thought occurred to me; I know some of the 4K drives emulate 512 byte sectors, so to the host OS, they appear to be no different than other 512b drives. With this additional layer of emulation, I would assume that ashift wouldn't be needed, though I have read reports of this affecting performance. I think I'll need to confirm what drives do what exactly and then decide on an ashift if needed. If you consider that for a 4KB internal drive, with a 512B external interface, a request for a 512B write will result in the drive reading 4KB, modifying it (putting the new 512B in) and then writing the 4KB out again. This is terrible from a latency perspective. I recall seeing 20 IOPS on a WD EARS 2TB drive (ie, 50ms latency for random 512B writes). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ashift and vdevs
zdb -C shows an shift value on each vdev in my pool, I was just wondering if it is vdev specific, or pool wide. Google didn't seem to know. I'm considering a mixed pool with some advanced format (4KB sector) drives, and some normal 512B sector drives, and was wondering if the ashift can be set per vdev, or only per pool. Theoretically, this would save me some size on metadata on the 512B sector drives. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
Cheers for the links David, but you'll note that I've commented on the blog you linked (ie, was aware of it). The zpool-12 binary linked from http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/ worked perfectly on my SX11 installation. (It threw some error on b134, so it relies on some external code, to some extent.) I'd note for those who are going to try, that that binary produces a pool of as high a version as the system supports. I was surprised that it was higher than the code for which it was compiled (ie, b147 = zpool v28). I'm currently populating a pool with a 9-wide raidz vdev of Samsung HD204UI 2TB (5400rpm, 4KB sector) and a 9-wide raidz vdev of Seagate LP ST32000542AS 2TB (5900 rpm, 4KB sector) which was created with that binary, and haven't seen any of the performance issues I've had in the past with WD EARS drives. It would be lovely if Oracle could see fit to implementing correct detection of these drives! Or, at the very least, an -o ashift=12 parameter in the zpool create function. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vdev failure - pool loss ?
Tuomas: My understanding is that the copies functionality doesn't guarantee that the extra copies will be kept on a different vdev. So that isn't entirely true. Unfortunately. On 20 October 2010 07:33, Tuomas Leikola tuomas.leik...@gmail.com wrote: On Mon, Oct 18, 2010 at 8:18 PM, Simon Breden sbre...@gmail.com wrote: So are we all agreed then, that a vdev failure will cause pool loss ? -- unless you use copies=2 or 3, in which case your data is still safe for those datasets that have this option set. -- - Tuomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help - Deleting files from a large pool results in less free space!
Forgive me, but isn't this incorrect: --- mv /pool1/000 /pool1/000d --- rm –rf /pool1/000 Shouldn't that last line be rm –rf /pool1/000d ?? On 8 October 2010 04:32, Remco Lengers re...@lengers.com wrote: any snapshots? *zfs list -t snapshot* ..Remco On 10/7/10 7:24 PM, Jim Sloey wrote: I have a 20Tb pool on a mount point that is made up of 42 disks from an EMC SAN. We were running out of space and down to 40Gb left (loading 8Gb/day) and have not received disk for our SAN. Using df -h results in: Filesystem size used avail capacity Mounted on pool120T20T55G 100%/pool1 pool2 9.1T 8.0T 497G95%/pool2 The idea was to temporarily move a group of big directories to another zfs pool that had space available and link from the old location to the new. cp –r /pool1/000/pool2/ mv /pool1/000 /pool1/000d ln –s /pool2/000/pool1/000 rm –rf /pool1/000 Using df -h after the relocation results in: Filesystem size used avail capacity Mounted on pool120T19T15G 100%/pool1 pool2 9.1T 8.3T 221G98%/pool2 Using zpool list says: NAMESIZE USEDAVAIL CAP pool1 19.9T19.6T 333G 98% pool2 9.25T8.89T 369G 96% Using zfs get all pool1 produces: NAME PROPERTYVALUE SOURCE pool1 typefilesystem - pool1 creationTue Dec 18 11:37 2007 - pool1 used19.6T - pool1 available 15.3G - pool1 referenced 19.5T - pool1 compressratio 1.00x - pool1 mounted yes- pool1 quota none default pool1 reservation none default pool1 recordsize 128K default pool1 mountpoint /pool1 default pool1 sharenfson local pool1 checksumon default pool1 compression offdefault pool1 atime on default pool1 devices on default pool1 execon default pool1 setuid on default pool1 readonlyoffdefault pool1 zoned offdefault pool1 snapdir hidden default pool1 aclmode groupmask default pool1 aclinherit secure default pool1 canmounton default pool1 shareiscsi offdefault pool1 xattr on default pool1 replication:locked true local Has anyone experienced this or know where to look for a solution to recovering space? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver that never finishes
But all of which have newer code, today, than onnv-134. On 18 September 2010 22:20, Tom Bird t...@marmot.org.uk wrote: On 18/09/10 13:06, Edho P Arief wrote: On Sat, Sep 18, 2010 at 7:01 PM, Tom Birdt...@marmot.org.uk wrote: All said and done though, we will have to live with snv_134's bugs from now on, or perhaps I could try Sol 10. or OpenIllumos. Or Nexenta. Or FreeBSD. Orinsert osol distro name. ... none of which will receive ZFS code updates unless Oracle deigns to bestow them upon the community, this or ZFS dev is taken over by said community, in which case we end up with diverging code bases that would be a sisyphean task to try and merge. Tom ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
Basic electronics, go! The linked capacitor from Elna ( http://www.elna.co.jp/en/capacitor/double_layer/catalog/pdf/dk_e.pdf) has an internal resistance of 30 ohms. Intel rate their 32GB X25-E at 2.4W active (we aren't interested in idle power usage, if its idle, we don't need the capacitor in the first place) on the +5V rail, thats 0.48A. (P=VI) V=IR, supply is 5V, current through load is 480mA, hence R=10.4 ohms. The resistance of the X25-E under load is 10.4 ohms. Now if you have a capacitor discharge circuit with the charged Elna DK-6R3D105T - the largest and most suitable from that datasheet - you have 40.4 ohms around the loop (cap and load). +5V over 40.4 ohms. The maximum current you can pull from that is I=V/R = 124mA. Around a quarter what the X25-E wants in order to write. The setup won't work. I'd suggest something more along the lines of: http://www.cap-xx.com/products/products.htm Which have an ESR around 3 orders of magnitude lower. t On 22 May 2010 18:58, Ragnar Sundblad ra...@csc.kth.se wrote: On 22 maj 2010, at 07.40, Don wrote: The SATA power connector supplies 3.3, 5 and 12v. A complete solution will have all three. Most drives use just the 5v, so you can probably ignore 3.3v and 12v. I'm not interested in building something that's going to work for every possible drive config- just my config :) Both the Intel X25-e and the OCZ only uses the 5V rail. You'll need to use a step up DC-DC converter and be able to supply ~ 100mA at 5v. It's actually easier/cheaper to use a LiPoly battery charger and get a few minutes of power than to use an ultracap for a few seconds of power. Most ultracaps are ~ 2.5v and LiPoly is 3.7v, so you'll need a step up converter in either case. Ultracapacitors are available in voltage ratings beyond 12volts so there is no reason to use a boost converter with them. That eliminates high frequency switching transients right next to our SSD which is always helpful. In this case- we have lots of room. We have a 3.5 x 1 drive bay, but a 2.5 x 1/4 hard drive. There is ample room for several of the 6.3V ELNA 1F capacitors (and our SATA power rail is a 5V regulated rail so they should suffice)- either in series or parallel (Depending on voltage or runtime requirements). http://www.elna.co.jp/en/capacitor/double_layer/catalog/pdf/dk_e.pdf You could 2 caps in series for better voltage tolerance or in parallel for longer runtimes. Either way you probably don't need a charge controller, a boost or buck converter, or in fact any IC's at all. It's just a small board with some caps on it. I know they have a certain internal resistance, but I am not familiar with the characteristics; is it high enough so you don't need to limit the inrush current, and is it low enough so that you don't need a voltage booster for output? Cost for a 5v only system should be $30 - $35 in one-off prototype-ready components with a 1100mAH battery (using prices from Sparkfun.com), You could literally split a sata cable and add in some capacitors for just the cost of the caps themselves. The issue there is whether the caps would present too large a current drain on initial charge up- If they do then you need to add in charge controllers and you've got the same problems as with a LiPo battery- although without the shorter service life. At the end of the day the real problem is whether we believe the drives themselves will actually use the quiet period on the now dead bus to write out their caches. This is something we should ask the manufacturers, and test for ourselves. Indeed! /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding ZFS performance.
iostat -xen 1 will provide the same device names as the rest of the system (as well as show error columns). zpool status will show you which drive is in which pool. As for the controllers, cfgadm -al groups them nicely. t On 23 May 2010 03:50, Brian broco...@vt.edu wrote: I am new to OSOL/ZFS but have just finished building my first system. I detailed the system setup here: http://opensolaris.org/jive/thread.jspa?threadID=128986tstart=15 I ended up having to add an additional controller card as two ports on the motherboard did not work as standard Sata port. Luckily I was able to salvage an LSI SAS card from an old system. Things seem to be working OK for the most part.. But I am trying to dig a bit deeper into the performance. I have done some searching and it seems that the iostat -x can help you better understand your performance. I have 8 drives in the system. 2 are in a mirrored boot pool and the other 6 are in a single raidz2 pool. All 6 are the same. Samsung 1TB Spinpoints. Here is what my output looks like from iostat -x 30 during a scrub of the raidz2 pool: devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 299.71.5 37080.91.6 7.7 2.0 32.3 98 99 cmdk1 300.21.3 37083.01.5 7.7 2.0 32.2 98 99 cmdk2 1018.61.6 37141.31.7 0.5 0.71.2 22 43 cmdk3 0.01.80.05.2 0.0 0.0 33.7 1 2 cmdk4 1045.62.1 37124.31.4 0.7 0.71.3 21 41 sd6 0.01.80.05.2 0.0 0.0 25.1 0 1 sd71033.42.5 37128.51.8 0.0 1.01.0 3 38 sd81044.52.5 37129.41.8 0.0 0.90.9 3 36 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 301.91.3 37339.01.7 7.8 2.0 32.1 99 99 cmdk1 302.11.4 37341.01.8 7.7 2.0 32.0 99 99 cmdk2 1048.11.5 37400.41.6 0.5 0.71.1 20 42 cmdk3 0.01.50.05.1 0.0 0.0 36.5 1 2 cmdk4 1054.41.6 37363.11.5 0.7 0.61.2 20 40 sd6 0.01.50.05.1 0.0 0.0 30.4 0 1 sd71044.42.1 37404.21.7 0.0 0.90.9 3 38 sd81050.52.1 37382.81.9 0.0 0.90.9 3 36 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 296.31.5 36195.41.7 7.8 2.0 32.7 99 99 cmdk1 295.21.5 36230.11.8 7.7 2.0 32.5 98 98 cmdk2 987.52.0 36171.51.7 0.6 0.71.3 22 43 cmdk3 0.01.50.05.1 0.0 0.0 37.7 1 2 cmdk4 1018.32.0 36160.81.6 0.7 0.61.4 21 41 sd6 0.01.50.05.1 0.0 0.1 40.3 0 2 sd71005.32.6 36300.61.8 0.0 1.11.1 3 39 sd81016.02.5 36260.12.0 0.0 1.01.0 3 36 I think cmdk3 and sd6 are in my rpool. I tried to split the pools across the controllers for better performance. It seems to me that cmdk0 and cmdk1 are much slower than the others.. But I am not sure why or what to check next... In fact I am not even sure how I can trace back that device name to figure out which controller it is connected to. Any ideas or next steps would be appreciated. Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS on-disk DDT block arrangement
I was wondering if someone could explain why the DDT is seemingly (from empirical observation) kept in a huge number of individual blocks, randomly written across the pool, rather than just a large binary chunk somewhere. Having been victim of the relly long times it takes to destroy a dataset that has dedup=on, I was wondering why that was. From memory, when the destroy process was running, something like iopattern -r showed constant 99% random reads. This seems like a very wasteful approach to allocating blocks for the DDT. Having deleted the 900GB dataset, finally, I now only have around 152GB (allocated PSIZE) left deduped on that pool. # zdb -DD tank DDT-sha256-zap-duplicate: 310684 entries, size 578 on disk, 380 in core DDT-sha256-zap-unique: 1155817 entries, size 2438 on disk, 1783 in core So 1466501 DDT blocks. For 152GB of data, that's around 108KB/block on average, which seems sane. To destroy the dataset holding the files which reference the DDT, I'm looking at 1.46 million random reads to complete the operation (less those elements in ARC or L2ARC). That's a lot of read operations for my poor spindles. I've seen some people saying that the DDT blocks are around 270 bytes each, but does it really matter, if the smallest block that zfs can read/write (for obvious reasons) is 512 bytes? Clearly 2x 270B 512B, but couldn't there be some way of grouping DDT elements together (in say, 1MB blocks)? Thoughts? (side note: can someone explain the size xxx on disk, xxx in core statements in that zdb output for me? The numbers never seem related to the number of entries or anything.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and 4kb sector Drives (All new western digital GREEN Drives?)
I'm not entirely convinced there is no problem here I had a WD EADS 1.5TB die, the warranty replacement drive was a EARS. So, first foray into 4k sectors. I had 8x EADS in a raidz set, had replaced the broken one with a 1.5TB Seagate 7200rpm - which was obviously faster. Just replacing back, and here is the iostat for the new EARS drive: http://pastie.org/889572 http://pastie.org/889572Those asvc_t's are atrocious. As is the op/s throughput. All the other drives spend the vast majority of the time idle, waiting for the new EARS drive to write out data. This is after isolating another issue to my Dell PERC 5/i's - they apparently don't talk nicely with the EARS drives either. Streaming writes would push data for two seconds and pause for ten. Random writes ... give up. On the Intel chipset's SATA - streaming writes are acceptable, but random writes are as per the above url. Format tells me that the partition starts at sector 256.But given that ZFS writes variable size blocks, that really shouldn't matter. When plugged the EARS into a P45-based motherboard running Windows, HDTune presents a normal looking streaming writes graph, and the average seek time is 14ms - the drive seems healthy. Any thoughts? On 27 March 2010 21:05, Svein Skogen sv...@stillbilde.net wrote: On 27.03.2010 11:01, Daniel Carosone wrote: On Sat, Mar 27, 2010 at 08:47:26PM +1100, Daniel Carosone wrote: On Fri, Mar 26, 2010 at 05:57:31PM -0700, Darren Mackay wrote: not sure if 32bit BSD supports 48bit LBA Solaris is the only otherwise-modern OS with this daft limitation. Ok, it's not due to LBA48, but the 1Tb limitation is still daft. There are some limits you'll encounter in BSD, such as if you use the wrong disklabel format - but not in the basic disk drivers. And, if you use the cheaper cards containing SiI-chips, you may find that _SOME_ manufacturers saved a penny per year by not connecting all the wires, so LBA48 simply don't work. Don't ask how many gray hairs tracking down THAT one caused me. (disks that worked absolutely fine until you passed a certain block. Then *wham!* corruptions-galore.) //Svein -- +---+--- /\ |Svein Skogen | sv...@d80.iso100.no \ / |Solberg Østli 9| PGP Key: 0xE5E76831 X|2020 Skedsmokorset | sv...@jernhuset.no / \ |Norway | PGP Key: 0xCE96CE13 | | sv...@stillbilde.net ascii | | PGP Key: 0x58CD33B6 ribbon |System Admin | svein-listm...@stillbilde.net Campaign|stillbilde.net | PGP Key: 0x22D494A4 +---+--- |msn messenger: | Mobile Phone: +47 907 03 575 |sv...@jernhuset.no | RIPE handle:SS16503-RIPE +---+--- If you really are in a hurry, mail me at svein-mob...@stillbilde.net This mailbox goes directly to my cellphone and is checked even when I'm not in front of my computer. Picture Gallery: https://gallery.stillbilde.net/v/svein/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q : recommendations for zpool configuration
A pool with a 4-wide raidz2 is a completely nonsensical idea. It has the same amount of accessible storage as two striped mirrors. And would be slower in terms of IOPS, and be harder to upgrade in the future (you'd need to keep adding four drives for every expansion with raidz2 - with mirrors you only need to add another two drives to the pool). Just my $0.02 On 19 March 2010 18:28, homerun petri.j.kunn...@gmail.com wrote: Greetings I would like to get your recommendation how setup new pool. I have 4 new 1.5TB disks reserved to new zpool. I planned to crow/replace existing small 4 disks ( raidz ) setup with new bigger one. As new pool will be bigger and will have more personally important data to be stored long time, i like to ask your recommendations should i create recreate pool or just replace existing devices. I have noted there is now raidz2 and been thinking witch woul be better. A pool with 2 mirrors or one pool with 4 disks raidz2 So at least could some explain these new raidz configurations Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Just thought I'd chime in for anyone who had read this - the import operation completed this time, after 60 hours of disk grinding. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The system in question has 8GB of ram. It never paged during the import (unless I was asleep at that point, but anyway). It ran for 52 hours, then started doing 47% kernel cpu usage. At this stage, dtrace stopped responding, and so iopattern died, as did iostat. It was also increasing ram usage rapidly (15mb / minute). After an hour of that, the cpu went up to 76%. An hour later, CPU usage stopped. Hard drives were churning throughout all of this (albeit at a rate that looks like each vdev is being controller by a single threaded operation). I'm guessing that if you don't have enough ram, it gets stuck on the use-lots-of-cpu phase, and just dies from too much paging. Of course, I have absolutely nothing to back that up. Personally, I think that if L2ARC devices were persistent, we already have the mechanism in place for storing the DDT as a seperate vdev. The problem is, there is nothing you can run at boot time to populate the L2ARC, so the dedup writes are ridiculously slow until the cache is warm. If the cache stayed warm, or there was an option to forcibly warm up the cache, this could be somewhat alleviated. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
After around four days the process appeared to have stalled (no audible hard drive activity). I restarted with milestone=none; deleted /etc/zfs/zpool.cache, restarted, and went zpool import tank. (also allowed root login to ssh, so I could make new ssh sessions if required.) Now I can watch the process from on the machine. My present question is how is the DDT stored? I believe the DDT to have around 10M entries for this dataset, as per: DDT-sha256-zap-duplicate: 400478 entries, size 490 on disk, 295 in core DDT-sha256-zap-unique: 10965661 entries, size 381 on disk, 187 in core (taken just previous to the attempt to destroy the dataset) A sample from iopattern shows: %RAN %SEQ COUNTMINMAXAVG KR 1000195512512512 97 1000414512 65536895362 1000261512512512130 1000273512512512136 1000247512512512123 1000297512512512148 1000292512512512146 1000250512512512125 1000274512512512137 1000302512512512151 1000294512512512147 1000308512512512154 982286512512512143 1000270512512512135 1000390512512512195 1000269512512512134 1000251512512512125 1000254512512512127 1000265512512512132 1000283512512512141 As the pool is comprised of 2x 8-disk raidz vdevs, I presume that each element is stored twice (for the raidz redundancy). So around 280 512b read op/s, that's 140 entries per second. Is the import of a semi-broken pool: 1 Reading all the DDT markers for the dataset; or 2 Reading all the DDT markers for the pool; or 3 Reading all of the block markers for the dataset; or 4 Reading all of the block markers for the pool Prior to actually finalising what it needs to do to fix the pool? I'd like to be able to estimate the length of time likely before the import finishes. Or should I tell it to roll back to the last valid txg - ie before the zfs destroy dataset command was issued? (by zpool import -F.) Or is this likely to take as long/longer than the present import/fix? Cheers. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Reading ZFS config for an extended period
Can anyone comment about whether the on-boot Reading ZFS confi is any slower/better/whatever than deleting zpool.cache, rebooting and manually importing? I've been waiting more than 30 hours for this system to come up. There is a pool with 13TB of data attached. The system locked up whilst destroying a 934GB dedup'd dataset, and I was forced to reboot it. I can hear hard drive activity presently - ie its doing bsomething/b, but am really hoping there is a better way :) Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Do you think that more RAM would help this progress faster? We've just hit 48 hours. No visible progress (although that doesn't really mean much). It is presently in a system with 8GB of ram, I could try to move the pool across to a system with 20GB of ram, if that is likely to expedite the process. Of course, if it isn't going to make any difference, I'd rather not restart this process. Thanks On 12 February 2010 06:08, Bill Sommerfeld sommerf...@sun.com wrote: On 02/11/10 10:33, Lori Alt wrote: This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I'm not clear what makes some bugs readable and some not). the other bug in question was opened yesterday and probably hasn't had time to propagate. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss