Re: [zfs-discuss] best migration path from Solaris 10
On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek p...@freebsd.org wrote: On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote: Newer versions of FreeBSD have newer ZFS code. Yes, we are at v28 at this point (the lastest open-source version). That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...] That's actually not true. There are more FreeBSD committers working on ZFS than on UFS. How is the performance of ZFS under FreeBSD? Is it comparable to that in Solaris, or still slower due to some needed compatibility layer? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] A resilver record?
Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered and I told the client it would take 3 to 4 days! :) -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered and I told the client it would take 3 to 4 days! It all depends on the number of drives in the VDEV(s), traffic patterns during resilver, speed VDEV fill, of drives etc. Still, close to 6 days is a lot. Can you detail your configuration? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
769G resilvered on a 500G drive? I'm guessing there was a whole bunch of activity (and probably snapshot creation) happening alongside the resilver. On 20 March 2011 18:57, Ian Collins i...@ianshome.com wrote: Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered and I told the client it would take 3 to 4 days! :) -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
Probably, we need place a tag before zfs -- Opensource-ZFS or Oracle-ZFS after Solaris11 release. If it is true, these two ZFSes will definitely evolve into different directions. BTW, Did Oracle unveil the actual release date? We are also at the cross road... Thanks. Fred -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha Sent: 星期日, 三月 20, 2011 14:55 To: Pawel Jakub Dawidek Cc: openindiana-disc...@openindiana.org; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] best migration path from Solaris 10 On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek p...@freebsd.org wrote: On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote: Newer versions of FreeBSD have newer ZFS code. Yes, we are at v28 at this point (the lastest open-source version). That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...] That's actually not true. There are more FreeBSD committers working on ZFS than on UFS. How is the performance of ZFS under FreeBSD? Is it comparable to that in Solaris, or still slower due to some needed compatibility layer? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
Fred Liu fred_...@issi.com wrote: Probably, we need place a tag before zfs -- Opensource-ZFS or Oracle-ZFS after Solaris11 release. If it is true, these two ZFSes will definitely evolve into different directions. BTW, Did Oracle unveil the actual release date? We are also at the cross road... The long term acceptance for ZFS depends on how Oracle will behave past the announced Solaris 11 is released. If they don't Opensource the related ZFS, they will harm the future of ZFS. If they Opensource it again, there is still a problem with syncing the ZFS ve3rsions from the OSS OpenSolaris continuation projects. The revision number introduced by Sun is only useful if there is no more than a single entity that introduces new features. For a reliable future for a distributed ZFS development, we would need something like the POSIX method to introduce tar extensions: a combination of a textual name for the entity that introduced the fature and a textual name for the feature. e.g. SCHILY-zfs-encryption Jörg -- EMail:jo...@schily.net (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
On Mar 20, 2011, at 09:26, Joerg Schilling wrote: The long term acceptance for ZFS depends on how Oracle will behave past the announced Solaris 11 is released. If they don't Opensource the related ZFS, they will harm the future of ZFS. If they Opensource it again, there is still a problem with syncing the ZFS ve3rsions from the OSS OpenSolaris continuation projects. For a while Apple was considering it, and if Ellison and Jobs can come to an agreement, it would certainly become very popular very quickly. Apple probably ships more UNIX(tm) devices than any other vendor (often over 3M units in a quarter). Using revenue as a metric gives similar results. And who says the Unix workstation market is dead? :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
It all depends on the number of drives in the VDEV(s), traffic patterns during resilver, speed VDEV fill, of drives etc. Still, close to 6 days is a lot. Can you detail your configuration? How many times do we have to rehash this? The speed of resilver is dependent on the amount of data, the distribution of data on the resilvering device, speed of the resilvering device, and the throttle. It is NOT dependent on the number of drives in the vdev. Thanks for clearing this up - I've been told large VDEVs lead to long resilver times, but then, I guess that was wrong. Btw after replacing some 2TB drives with 3TB ones in three VDEVs that were 95% full at the time, resilver times dropped by 30%, so I guess very full VDEVs aren't much fun even on the resilver side. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
On Mar 20, 2011, at 14:33, Garrett D'Amore wrote: I hear from reliable sources that Apple is not doing anything with ZFS, so I would not look there for leadership. Given that one of the prominent (?) file system guys at Apple left to form his own ZFS company, I figured that was the case even before you stated the above: http://tinyurl.com/4jznw48 http://arstechnica.com/apple/news/2011/03/how-zfs-is-slowly-making-its-way-to-mac-os-x.ars The ZFS Working Group is awesome news. I hope to hear of a bright future for ZFS on all operating systems. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote: It all depends on the number of drives in the VDEV(s), traffic patterns during resilver, speed VDEV fill, of drives etc. Still, close to 6 days is a lot. Can you detail your configuration? How many times do we have to rehash this? The speed of resilver is dependent on the amount of data, the distribution of data on the resilvering device, speed of the resilvering device, and the throttle. It is NOT dependent on the number of drives in the vdev. Thanks for clearing this up - I've been told large VDEVs lead to long resilver times, but then, I guess that was wrong. There was a thread (Suggested RaidZ configuration...) a little while back where the topic of IOps and resilver time came up: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633 I think this message by Erik Trimble is a good summary: Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k slab sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive. Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab sizes. In this case, there's only about 14k of data on each drive for a slab. This means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k = 71e6 IOPS to complete. From this, it can be pretty easy to see that the number of required IOPS to the resilvered disk goes up linearly with the number of data drives in a vdev. Since you're always going to be IOPS bound by the single disk resilvering, you have a fixed limit. http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/044660.html Also, a post by Jeff Bonwick on resilvering: http://blogs.sun.com/bonwick/entry/smokin_mirrors Between Richard's and Eric's statements, I would say that while resilver time is not dependent number of drives in the vdev, the pool configuration can affect the IOps rate, and /that/ can affect the time it takes to finish a resilver. Is that a decent summary? I think maybe the number of drives in the vdev perhaps come into play because that when people have a lot of disks, they often put them into RAIDZ[123] configurations. So it's just a matter of confusing the (IOps limiting) configuration with the fact that one may have many disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
I think maybe the number of drives in the vdev perhaps come into play because that when people have a lot of disks, they often put them into RAIDZ[123] configurations. So it's just a matter of confusing the (IOps limiting) configuration with the fact that one may have many disks. My answer was not meant to be a generic one, but based on the original question, which was about a raidz2 VDEV, but then, thanks for the info Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On Mar 20, 2011, at 12:48 PM, David Magda wrote: On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote: It all depends on the number of drives in the VDEV(s), traffic patterns during resilver, speed VDEV fill, of drives etc. Still, close to 6 days is a lot. Can you detail your configuration? How many times do we have to rehash this? The speed of resilver is dependent on the amount of data, the distribution of data on the resilvering device, speed of the resilvering device, and the throttle. It is NOT dependent on the number of drives in the vdev. Thanks for clearing this up - I've been told large VDEVs lead to long resilver times, but then, I guess that was wrong. There was a thread (Suggested RaidZ configuration...) a little while back where the topic of IOps and resilver time came up: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633 I think this message by Erik Trimble is a good summary: hmmm... I must've missed that one, otherwise I would have said... Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k slab sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive. Here, the IOPS doesn't matter because the limit will be the media write speed of the resilvering disk -- bandwidth. Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab sizes. In this case, there's only about 14k of data on each drive for a slab. This means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k = 71e6 IOPS to complete. Here, IOPS might matter, but I doubt it. Where we see IOPS matter is when the block sizes are small (eg. metadata). In some cases you can see widely varying resilver times when the data is large versus small. These changes follow the temporal distribution of the original data. For example, if a pool's life begins with someone loading their MP3 collection (large blocks, mostly sequential) and then working on source code (small blocks, more random, lots of creates/unlinks) then the resilver will be bandwidth bound as it resilvers the MP3s and then IOPS bound as it resilvers the source. Hence, the prediction of when resilver will finish is not very accurate. From this, it can be pretty easy to see that the number of required IOPS to the resilvered disk goes up linearly with the number of data drives in a vdev. Since you're always going to be IOPS bound by the single disk resilvering, you have a fixed limit. You will not always be IOPS bound by the resilvering disk. You will be speed bound by the resilvering disk, where speed is either write bandwidth or random write IOPS. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On 03/20/11 08:57 PM, Ian Collins wrote: Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered I didn't intend to start an argument, I was just very surprised the resilver took so long. This box is backup staging server (Solaris 10u8), so it does receive a lot of data. However it has lost a number of drives in the past and the resilver took around 100 hours hence my surprise. The drive is part of an 8 drive raidz2 vdev, not overly full: raidz2 3.40T 227G -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On Mar 20, 2011, at 18:02, Ian Collins wrote: I didn't intend to start an argument, I was just very surprised the resilver took so long. ZFS is a relatively young file system, and it does a lot of things differently than what has been done in the past. Personally I think arguments / debates / discussions like this thread assist people in understanding how things work and help bring out any misconceptions that they may have, that can then be corrected. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On Mar 20, 2011, at 3:02 PM, Ian Collins wrote: On 03/20/11 08:57 PM, Ian Collins wrote: Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered I didn't intend to start an argument, I was just very surprised the resilver took so long. I'd describe the thread as critical analysis, not argument. There are many facets of ZFS resilver and scrub that many people have never experienced, so it makes sense to explore the issue. Expect ZFS resilvers to take longer in the future for HDDs. Expect ZFS resilvers to remain quite fast for SSDs. Why? Because HDDs are getting bigger, but not faster, while SSDs are getting bigger and faster. I've done a number of studies of this and have a lot of data to describe what happens. I also work through performance analysis of resilver cases for my ZFS tutorials. This box is backup staging server (Solaris 10u8), so it does receive a lot of data. However it has lost a number of drives in the past and the resilver took around 100 hours hence my surprise. We've thought about how to provide some sort of feedback on the progress of resilvers. It is relatively simple to know what has already been resilvered and how much throttling is currently active. But that info does not make future predictions more accurate. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On 03/21/11 12:20 PM, Richard Elling wrote: On Mar 20, 2011, at 3:02 PM, Ian Collins wrote: On 03/20/11 08:57 PM, Ian Collins wrote: Has anyone seen a resilver longer than this for a 500G drive in a riadz2 vdev? scrub: resilver completed after 169h25m with 0 errors on Sun Mar 20 19:57:37 2011 c0t0d0 ONLINE 0 0 0 769G resilvered I didn't intend to start an argument, I was just very surprised the resilver took so long. I'd describe the thread as critical analysis, not argument. There are many facets of ZFS resilver and scrub that many people have never experienced, so it makes sense to explore the issue. Expect ZFS resilvers to take longer in the future for HDDs. Expect ZFS resilvers to remain quite fast for SSDs. Why? Because HDDs are getting bigger, but not faster, while SSDs are getting bigger and faster. I've done a number of studies of this and have a lot of data to describe what happens. I also work through performance analysis of resilver cases for my ZFS tutorials. Does the throttling improve receive latency? The 30+ second latency I see on this system during a resilver renders it pretty useless as a staging server (lots of small snapshots). -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On 3/20/2011 2:23 PM, Richard Elling wrote: On Mar 20, 2011, at 12:48 PM, David Magda wrote: On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote: It all depends on the number of drives in the VDEV(s), traffic patterns during resilver, speed VDEV fill, of drives etc. Still, close to 6 days is a lot. Can you detail your configuration? How many times do we have to rehash this? The speed of resilver is dependent on the amount of data, the distribution of data on the resilvering device, speed of the resilvering device, and the throttle. It is NOT dependent on the number of drives in the vdev. Thanks for clearing this up - I've been told large VDEVs lead to long resilver times, but then, I guess that was wrong. There was a thread (Suggested RaidZ configuration...) a little while back where the topic of IOps and resilver time came up: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633 I think this message by Erik Trimble is a good summary: hmmm... I must've missed that one, otherwise I would have said... Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k slab sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive. Here, the IOPS doesn't matter because the limit will be the media write speed of the resilvering disk -- bandwidth. Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab sizes. In this case, there's only about 14k of data on each drive for a slab. This means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k = 71e6 IOPS to complete. Here, IOPS might matter, but I doubt it. Where we see IOPS matter is when the block sizes are small (eg. metadata). In some cases you can see widely varying resilver times when the data is large versus small. These changes follow the temporal distribution of the original data. For example, if a pool's life begins with someone loading their MP3 collection (large blocks, mostly sequential) and then working on source code (small blocks, more random, lots of creates/unlinks) then the resilver will be bandwidth bound as it resilvers the MP3s and then IOPS bound as it resilvers the source. Hence, the prediction of when resilver will finish is not very accurate. From this, it can be pretty easy to see that the number of required IOPS to the resilvered disk goes up linearly with the number of data drives in a vdev. Since you're always going to be IOPS bound by the single disk resilvering, you have a fixed limit. You will not always be IOPS bound by the resilvering disk. You will be speed bound by the resilvering disk, where speed is either write bandwidth or random write IOPS. -- richard Really? Can you really be bandwidth limited on a (typical) RAIDZ resilver? I can see where you might be on a mirror, with large slabs and essentially sequential read/write - that is, since the drivers can queue up several read/write requests at a time, you have the potential to be reading/writing several (let's say 4) 128k slabs per single IOPS. That means you read/write at 512k per IOPS for a mirror (best case scenario). For a 7200RPM drive, that's 100 IOPS x .5MB/IOPS = 50MB/s, which is lower than the maximum throughput of a modern SATA drive. For one of the 15k SAS drives able to do 300IOPS, you get 150MB/s, which indeed exceeds a SAS drive's write bandwidth. For RAIDZn configs, however, you're going to be limited on the size of an individual read/write. As Roy pointed out before, that max size of an individual portion of a slab is 128k/X, where X=number of data drives in RAIDZn. So, for a typical 4-data-drive RAIDZn, even in the best case scenario where I can queue multiple slab requests (say 4) into a single IOPS, that means I'm likely to top out at about 128k of data to write to the resilvered drive per IOPS. Which, leads to 12MB/s for the 7200RPM drive, and 36MB/s for the 15k drive, both well under their respective bandwidth capability. Even with large slab sizes, I really can't see any place where a RAIDZ resilver isn't going to be IOPS bound when using HDs as backing store. Mirrors are more likely, but still, even in that case, I think you're going to hit the IOPS barrier far more often than the bandwidth barrier. Now, with SSDs as backing store, yes, you become bandwidth limited, because the IOPS values of SSDs are at least an order of magnitude greater than HDs, though both have the same max bandwidth characteristics. Now, the *total* time it takes to resilver either a mirror or RAIDZ is indeed primarily dependent on the number of allocated slabs in the vdev, and the level of fragmentation of slabs. That essentially defines the total amount of work that needs to be done. The above discussion compares