date:20100127

[zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-27 Thread borov

Hello.
We need big cheap storage.
Looking to Supermicro systems.
Something based on SC846E1-R900 case 
http://www.supermicro.com/products/chassis/4U/846/SC846E1-R900.cfm with 24 disc 
bays. This case with 3 GBit  LSI SASX36 expander.

But the problem with LSI based HBA timeouts really confuses me.

Should i get newer motherboard with 6GBit LSI SAS 2008 HDA like 
http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm?IPMI=YSAS=Y
 or get older motherboard with LSI 1068 HBA?
Can anyone post good working configurations based on Supermicro hardware?

Planning to use 2Tb Hitachi SATA drives (any thoughts on HDD choose?).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-27 Thread Björn JACKE

On 2010-01-25 at 08:31 -0600 Mike Gerdts sent off:
 You are missing the point.  Compression and dedup will make it so that
 the blocks in the devices are not overwritten with zeroes.  The goal
 is to overwrite the blocks so that a back-end storage device or
 back-end virtualization platform can recognize that the blocks are not
 in use and as such can reclaim the space.

a filesystem that is able to do that fast would have to implement something
like unwritten extents. Some days ago I experimented to create and allocate
huge files on ZFS ontop of OpenSolaris using fnctl and F_ALLOCSP which is
basically the same thing that you want to do when you zero out space. It takes
ages because it actually writes zeroes to the disk. A filesystem that knows the
concept of unwritten extents finishes the job immediately. There are no real
zeros on the disk but the extent is tagged to be unwritten (you get zeros when
you read it).

Are there any plans to add unwritten extent support into ZFS or any reason why
not?

Björn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-01-27 Thread Georg S. Duck

Hi,
I was suffering for weeks from the following problem:
a zfs dataset contained an automatic snapshot (monthly) that used 2.8 TB of 
data. The dataset was deprecated, so I chose to destroy it after I had deleted 
some files; eventually it was completely blank besides the snapshot that still 
locked 2.8 TB on the pool.

'zfs destroy -r pool/dataset'

hung the machine within seconds to be completely unresponsive. No respective 
messages could be found in logs. The issue was reproducible.
The same happened for 
'zfs destroy pool/data...@snapshot'

Thus, the conclusion was that the snapshot was indeed the problem. 

Solution: After trying several things, including updating the system to snv_130 
and snv_131, I had the idea to restore the dataset to the snapshot before doing 
another zfs destroy attempt. 

'zfs rollback pool/data...@snapshot'
'zfs unmount -f pool/dataset'
'zfs destroy -r pool/dataset'

Et voilà! It worked.

Conclusion: I guess there is something wrong in zfs handling snapshots during a 
recursive dataset destruction. As it seems, the destruction is only successful 
if the dataset is consistent with the snapshot. Even if the workaround seems to 
be viable a fix of the issue would be appreciated.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-01-27 Thread erik.ableson


On 27 janv. 2010, at 12:10, Georg S. Duck wrote:

 Hi,
 I was suffering for weeks from the following problem:
 a zfs dataset contained an automatic snapshot (monthly) that used 2.8 TB of 
 data. The dataset was deprecated, so I chose to destroy it after I had 
 deleted some files; eventually it was completely blank besides the snapshot 
 that still locked 2.8 TB on the pool.
 
 'zfs destroy -r pool/dataset'
 
 hung the machine within seconds to be completely unresponsive. No respective 
 messages could be found in logs. The issue was reproducible.
 The same happened for 
 'zfs destroy pool/data...@snapshot'
 
 Thus, the conclusion was that the snapshot was indeed the problem. 

For info, I have exactly the same situation here with a snapshot that cannot be 
deleted that results in the same symptoms.  Total freeze, even on the console.  
Server responds to pings, but that's it. All iSCSI, NFS and ssh connections are 
cut. Currently running b130.

I'll try the workaround once I get some spare space to migrate the contents.

Erik


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Strange random errors getting automatically repaired

2010-01-27 Thread gtirloni

Hello,

 Has anyone ever seen vdev's getting removed and added back to the
pool very quickly ? That seems to be what's happening here.

 This has started to happen on dozens of machines at different
locations since a few days ago. They are running OpenSolaris b111 and
a few b126.

 Could this be bit rot and/or silent corruption getting detected and fixed ?

Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID:
FMD-8000-4M, TYPE: Repair, VER: 1, SEVERITY: Minor
Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009
Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname
Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2
Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
Jan 27 01:18:01 hostname DESC: All faults associated with an event id
have been addressed.
Jan 27 01:18:01 hostname   Refer to http://sun.com/msg/FMD-8000-4M for
more information.
Jan 27 01:18:01 hostname AUTO-RESPONSE: Some system components
offlined because of the original fault may have been brought back
online.
Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system
due to the original fault may have been recovered.
Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u EVENT-ID to
identify the repaired components.
Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID:
FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor
Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009
Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname
Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2
Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
Jan 27 01:18:01 hostname DESC: All faults associated with an event id
have been addressed.
Jan 27 01:18:01 hostname   Refer to http://sun.com/msg/FMD-8000-6U for
more information.
Jan 27 01:18:01 hostname AUTO-RESPONSE: All system components offlined
because of the original fault have been brought back online.
Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system
due to the original fault has been recovered.
Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u EVENT-ID to
identify the repaired components.


# fmdump -e -t 23Jan2010
TIME CLASS
#

# fmdump
TIME UUID SUNW-MSG-ID
Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired
Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved

# fmdump -V
TIME UUID SUNW-MSG-ID
Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired

  TIME CLASS ENA
  Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data  0x533bf0e964a01801
  Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure  0xe87b448c8ba00c01
  Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b446b04f1
  Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b44664b300401
  Dec 23 16:08:42.0738 ereport.fs.zfs.io 0xe87b445710a01001
  Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b4461a4d00c01

nvlist version: 0
version = 0x0
class = list.repaired
uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd
code = FMD-8000-4M
diag-time = 1261651834 766268
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = X7DB8
chassis-id = 0123456789
server-id = hostname
(end authority)

mod-name = fmd
mod-version = 1.2
(end de)

fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9f4842f183c4c7cc
vdev = 0xd207014426714df9
(end asru)

resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x9f4842f183c4c7cc
vdev = 0xd207014426714df9
(end resource)

(end fault-list[0])

fault-status = 0x6
__ttl = 0x1
__tod = 0x4b5fb069 0xe23eb38

TIME UUID SUNW-MSG-ID
Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved

  TIME CLASS

Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-01-27 Thread Georg S. Duck

 Server responds to pings, but that's it. All iSCSI, NFS and ssh connections 
 are cut.

That's consistent with my findings, adding that SMB is cut as well.

At one vain attempt to destroy the data...@snapshot I got a [ID 224711 
kern.warning] WARNING: Memory pressure: TCP defensive mode on. If I had a 
separate ssh session open with 'top' running I could monitor CPU load going 
through the roof before that session was dead along with everything.

 For info, I have exactly the same situation here with a snapshot that cannot 
 be deleted that results in the same symptoms.

That would rule an empty data set being a relevant side condition. 

 I'll try the workaround once I get some spare space to migrate the contents.

If your final aim isn't the destruction of the dataset that exacerbates the 
situation.

After I had understood the issue with snapshots my choice was to de-activate 
all automatic snapshots on non-rpools. Specifically I have different backup 
protocols in place anyhow. Automatic snapshots are on by default.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-27 Thread Björn JACKE

On 2010-01-27 at 09:50 + Darren J Moffat sent off:
 The whole point of the original question wasn't about consumers of
 ZFS but where ZFS is the consumer of block storage provided by
 something else that expects to see zeros on disk.
 
 This thread is about thin provisioning *to* ZFS not *on* it.

you're right, indeed the original question is actually a different problem that
unwritten extents support wouldn't address.

Björn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa

2010-01-27 Thread Simon Breden

If you choose the AOC-USAS-L8i controller route, don't worry too much about the 
exotic looking nature of these SAS/SATA controllers. These controllers drive 
SAS drives and also SATA drives. As you will be using SATA drives, you'll just 
get cables that plug into the card. The card has 2 ports. You buy a cable that 
plugs in to the port and fans out into 4 SATA connectors. Just buy 2 cables if 
you need to drive 8 drives, or at least more than 4.

SuperMicro sell a few different cable lengths for these cables, so once you've 
measured, you can choose. Take a look at this post of mine and look for the 
card, cables and text where I also remarked on the scariness factor of dealing 
with 'exotic' hardware.

http://breden.org.uk/2009/08/29/home-fileserver-mirrored-ssd-zfs-root-boot/

And cables are here:
http://supermicro.com/products/accessories/index.cfm
http://64.174.237.178/products/accessories/index.cfm (DNS failed so I gave IP 
address version too)
Then select 'cables' from the list. From the cables listed, search for 'IPASS 
to 4 SATA Cable' and you will find they have a 23cm version (CBL-0118L-02) and 
a 50cm version (CBL-0097L-02). Sounds like your larger case will probably need 
the 50cm version.

Cheers,
Simon

http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboard

2010-01-27 Thread Donald Murray, P.Eng.

Hi David,

On Mon, Jan 25, 2010 at 11:16 AM, David Dyer-Bennet d...@dd-b.net wrote:
 My current home fileserver (running Open Solaris 111b and ZFS) has an ASUS
 M2N-SLI DELUXE motherboard.  This has 6 SATA connections, which are
 currently all in use (mirrored pair of 80GB for system zfs pool, two
 mirrors of 400GB both in my data pool).

 I've got two more hot-swap drive bays.  And I'm getting up towards 90%
 full on the data pool.  So, it's time to expand, right?

 I have two approaches in contention:

 #1, I can just swap drives for bigger drives, waiting for resilver and
 taking the risk that the other drive will fail during the resilver (I do
 have backups, plus I've got the old removed drive as well, so I could
 recover from a failure during resilver with some downtime).

 #2, I can find or install two additional SATA ports and put two more
 drives in the open bays.  I've even got two 400GB drives sitting
 available; that's a 50% increase on current storage, so I'm not inclined
 to spend money for new drives yet, even though these are quite small.  (I
 picked up a pile of free Sun-badged Hitachi 400GB drives when the project
 I was on at the time decided they were too small to use and put them out
 for people to take home.  I grabbed two right away, and very
 conscientiously stayed away for a while to give other people a good shot
 too.  But I took another drive every hour, and left with 7 of them.  There
 were still some there when I left, so I feel virtuous rather than greedy.)

 I prefer approach two.  Three pair gives me more flexibility and more
 performance than two, plus I don't have to pay for new drives right away
 since I've got spare 400GB drives around.  Plus it probably bothers me
 more than it should that I'm wasting two of the fairly expensive
 hot-swap bays.

 So, with regard to option #2, I have two questions.

 First, there's some sign that this motherboard has an integral raid
 controller.  Can it also be used to drive bare drives?  If I could just
 find two more usable controller ports (with good drivers and hot-swap
 support), I'd be happy without spending any money.  Anybody understand
 this motherboard?

 Second, if I have to buy an additional controller, what should I buy for
 driving two (or at most 4; I suppose it might make sense to reduce the
 load on the motherboard controller) SATA drives from this motherboard?  I
 believe I have a free PCI-Express x16 slot and two x1 slots (and don't
 understand these new-fangled ports very well).  I want stability, +- 10%
 performance is not at all important.  Cheap is good :-) (paying my own
 money here!).

 (Obvious additional choices like replacing the whole box are not
 interesting; its performance is fine for my needs, and it can easily
 handle increased disk capacity.)

 Also, I probably should upgrade to more recent code than snv_111b, eh?
 What's a demonstrated-to-be-stable code level I could upgrade to?  I'm not
 desperately missing any of the newer features, but I'm looking for bug
 fixes, especially any that relate to zfs send-receive, which I'm
 attempting to use to transfer incremental backups to an external USB drive
 (set up as a single-disk pool).

 Also I will put more memory in while I've got it open, but I can figure
 out what memory it takes for myself :-).

 I'd greatly appreciate motherboard expertise, controller advice, and code
 version advice from people with experience.  Thanks!
 --
 David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
 Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
 Photos: http://dd-b.net/photography/gallery/
 Dragaera: http://dragaera.info

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I have the same motherboard. Though I haven't used all of the SATA ports, I
also have an ST Lab PCIe SATA II 300 RAID Card, 2+2 (uses a PCIe 1X port,
has Sil3132 chip). I've had the card for almost 2 years now, so I'm not sure if
you can still buy these.

The key thing: the Sil3132 is supported in OpenSolaris.

Hope this helps!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays

2010-01-27 Thread Brad

Cindy,

It does not list our SAN (LSI/STK/NetApp)...I'm confused about disabling cache 
from the wiki entries.

Should we disable it by turning off zfs cache syncs via echo 
zfs_nocacheflush/W0t1 | mdb -kw  or specify it by storage device via the 
sd.conf method where the array ignores cache flushes from zfs?

Brad
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] raidz using partitions

2010-01-27 Thread Albert Frenz

hi there,

maybe this is a stupid question, yet i haven't found an answer anywhere ;)
let say i got 3x 1,5tb hdds, can i create equal partitions out of each and make 
a raid5 out of it? sure the safety would drop, but that is not that important 
to me. with roughly 500gb partitions and the raid5 forumla of n-1*smallest 
drive i should be able to get 4tb storage instead of 3tb when using 3x 1,5tb in 
a normal raid5. 

thanks for you answers

greetings
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz using partitions

2010-01-27 Thread Thomas Burgess

On Wed, Jan 27, 2010 at 1:55 PM, Albert Frenz y...@zockbar.de wrote:

 hi there,

 maybe this is a stupid question, yet i haven't found an answer anywhere ;)
 let say i got 3x 1,5tb hdds, can i create equal partitions out of each and
 make a raid5 out of it? sure the safety would drop, but that is not that
 important to me. with roughly 500gb partitions and the raid5 forumla of
 n-1*smallest drive i should be able to get 4tb storage instead of 3tb when
 using 3x 1,5tb in a normal raid5.

 thanks for you answers

 greetings
 --

3 drives is enough to make a raidz already
but you can use slices...yes, i have a friend who did that.

he had 2 1.5tb drives 2 1tb drives and a 2 tb drive
so he made 2 raidz's
one with 5 1tb slices
1 with 3 500gb slices




 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread RayLicon

Has anyone done research into the performance of SWAP on the traditional 
partitioned based SWAP device as compared to a SWAP area set up on ZFS with a 
zvol?

 I can find no best practices for this issue. In the old days it was considered 
important to separate the swap devices onto individual disks (controllers)  and 
select the outer cylinder groups for the partition (to gain some read speed).  
How does this compare to creating a single SWAP zvol within a rootpool and then 
mirroring the rootpool across two separate disks?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz using partitions

2010-01-27 Thread Albert Frenz

ok nice to know :) thank you very much for your quick answer
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread Andrew Gabriel


RayLicon wrote:

Has anyone done research into the performance of SWAP on the traditional 
partitioned based SWAP device as compared to a SWAP area set up on ZFS with a 
zvol?

 I can find no best practices for this issue. In the old days it was considered 
important to separate the swap devices onto individual disks (controllers)  and 
select the outer cylinder groups for the partition (to gain some read speed).  
How does this compare to creating a single SWAP zvol within a rootpool and then 
mirroring the rootpool across two separate disks?


Best practice nowadays is to design a system so it doesn't need to swap.
Then it doesn't matter what the performance of the swap device is.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found

2010-01-27 Thread Tonmaus

 This sounds like yet another instance of
 
 6910767 deleting large holey objects hangs other I/Os
 
 I have a module based on 130 that includes this fix
 if you would like to try it.
 
 -tim

Hi Tim,

6910767 seems to be about ZVOLs. The dataset here was not a ZVOL. I had a 1,4 
TB ZVOL on the same pool that also wasn't easy to kill. It hung the machine as 
well - but only once: it was gone after a forced re-boot.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backing this up

2010-01-27 Thread Gregory Durham

Hello All,
I read through the attached threads and found a solution by a poster and
decided to try it. The solution was to use 3 files (in my case I made them
sparse), I then created a raidz2 pool across these 3 files and started a zfs
send | recv. The performance is horrible, it was 5.62mb/s. When I am backing
up the other system to this failover system over a network connection I can
get around 40mb/s. Is it because I am backing it up onto files rather than
physical disks? Am I doing this all wrong? This pool is temporary as it will
be sent to tape, deleted and recreated. Is it possible to zfs send to two
destination simultaneously? Or am I stuck. Any pointers would be great!

I am using OpenSolaris snv_129 and the disks are sata wd 1tb 7200rpm disks.

Thanks All!
Greg

On Mon, Jan 25, 2010 at 3:41 PM, Gregory Durham gregory.dur...@gmail.comwrote:

 Well I guess I am glad I am not the only one. Thanks for the heads up!

 On Mon, Jan 25, 2010 at 3:39 PM, David Magda dma...@ee.ryerson.ca wrote:

 On Jan 25, 2010, at 18:28, Gregory Durham wrote:

  One option I have seen is zfs send zfs_s...@1 
 /some_dir/some_file_name. Then I can back this up to tape. This seems easy
 as I already have a created a script that does just this but I am worried
 that this is not the best or most secure way to do this. Does anyone have a
 better solution?


 We've been talking about this for the last week and a half. :)


 http://mail.opensolaris.org/pipermail/zfs-discuss/2010-January/thread.html#35929
 http://opensolaris.org/jive/thread.jspa?threadID=121797

 (They're the same thread, just different interfaces.)


  I was thinking about then gzip'ing this but that would take an enormous
 amount of time...



 If you have a decent amount of CPU, you can parallelize compression:

 http://www.zlib.net/pigz/
 http://blogs.sun.com/timc/entry/tamp_a_lightweight_multi_threaded

 The LZO algorithm (as used in 7zip) is supposed to be better that gzip in
 many benchmarks, and supposedly is very parallel.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays

2010-01-27 Thread Cindy Swearingen


Brad,

It depends on the Solaris release. What Solaris release are you running?

Thanks,

Cindy

On 01/27/10 11:43, Brad wrote:

Cindy,

It does not list our SAN (LSI/STK/NetApp)...I'm confused about disabling cache 
from the wiki entries.

Should we disable it by turning off zfs cache syncs via echo zfs_nocacheflush/W0t1 
| mdb -kw  or specify it by storage device via the sd.conf method where the array 
ignores cache flushes from zfs?

Brad

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARC not using all available RAM?

2010-01-27 Thread Christo Kutrovsky

I am interested in this as well.

My machine is with 5 gb ram, and will soon have an 80gb SSD device.

My free memory hovers around 750 Mb, and the arc around 3GB.

This machine doesn't do anything other than iSCSI/CIFS, I wouldn't mind using 
some extra 500 Mb for caching.

And this becomes especially important if the kernel will need to consume such 
large amounts of memory for managing the l2arc.

CPU cache trashing although an important topic is of no importance in such 
cases IMO.
i.e. I don't mind my CPU caches to be trashed if I fire up a gnome desktop 
occasionally.

But I do mind having 750 Mb of RAM sitting unused.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread RayLicon

Ok ...

 Given that ... yes, we all know that swapping is bad (thanks for the 
enlightenment).

 To Swap or not to Swap isn't releated to this question, and besides, even if 
you don't page swap, other mechanisms can still claim swap space, such as the 
tmp file system.

 The question is simple, IF you have to swap (for whatever reason), then 
which of two alternatives is better (separate disk partitons on multiple disks, 
or zvol ZFS stripes or mirrors - and why).

 If no one has any data on  this issue then fine, but I didn't waste my time 
posting to this site to get responses that  simply say -don't swap
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa

2010-01-27 Thread David Dyer-Bennet


On 1/25/2010 6:23 PM, Simon Breden wrote:
By mixing randomly purchased drives of unknown quality, people are 
taking unnecessary chances. But often, they refuse to see that, 
thinking that all drives are the same and they will all fail one day 
anyway...


I would say, though, that buying different drives isn't inherently 
either random or drives of unknown quality.  Most of the time, I 
know no reason other than price to prefer one major manufacturer to 
another.


And, over and over again, I've heard of bad batches of drives.  Small 
manufacturing or design or component sourcing errors.  Given how the 
resilvering process can be quite long (on modern large drives) and quite 
stressful (when the system remains in production use during resilvering, 
so that load is on top of the normal load), I'd rather not have all my 
drives in the set be from the same bad batch!


Google is working heavily with the philosophy that things WILL fail, so 
they plan for it, and have enough redundance to survive it -- and then 
save lots of money by not paying for premium components.  I like that 
approach.


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread Andrew Gabriel


LICON, RAY (ATTPB) wrote:

Thanks for the reply.

In many situations, the hardware design isn't up to me and budgets tend
to dictate everything these days. True, nobody wants to swap, but the
question is if you had to -- what design serves you best. Independent
swap slices or putting it all under control of zfs.


It depends why you need to swap, i.e. why are you using more memory than
you have, and is your working set size bigger than memory (thrashing),
or is swapping likely to be just a once-off event or infrequently repeated?

You probably need to forget most of what you learned about swapping 25
years ago, when systems routinely swapped, and technology was very
different. Disks have got faster over that period, probably of the order
100 times faster. However, CPUs have got 100,000 times faster, so in
reality a disk looks to be 1000 times slower from the CPU's standpoint
than it did 25 years ago. This means that CPU cycles lost due to
swapping will appear to have a proportionally much more dire effect on
performance than they did many years back.

There are lots more options available today than there were when systems
routinely swapped. A couple of examples that spring to mind...

ZFS has been explicitly designed to swap it's own cache data, only we
don't call it swapping - we call it an L2ARC or ReadZilla. So if you
have a system where the application is going to struggle with main
memory, you might configure ZFS to significantly reduce it's memory
buffer (ARC), and instead give it an L2ARC on a fast solid state disk.
This might result in less performance degradation in some systems where
memory is short, depending heavily on the behaviour of the application.

If you do have to go with brute force old style swapping, then you might
want to invest in solid state disk swap devices, which will go some way
towards reducing the factor of 1000 I mentioned above. (Take note of
aligning swap to the 4k flash i/o boundaries.)

Probably lots of other possibilities too, given more than a couple of
minutes thought.

--
Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa

2010-01-27 Thread David Dyer-Bennet


On 1/27/2010 7:29 AM, Simon Breden wrote:

And cables are here:
http://supermicro.com/products/accessories/index.cfm
http://64.174.237.178/products/accessories/index.cfm (DNS failed so I gave IP 
address version too)
Then select 'cables' from the list. From the cables listed, search for 'IPASS 
to 4 SATA Cable' and you will find they have a 23cm version (CBL-0118L-02) and 
a 50cm version (CBL-0097L-02). Sounds like your larger case will probably need 
the 50cm version.
   


And those seem to be half the price of the others I've found.  I'll 
still have to check the length first, though.   And they're listed on 
Amazon.  (Supermicro either doesn't, or at least makes it very hard, to 
buy direct from their web site, or even check a price.)


(This is a big Chenbro case, I think it's really a rack 4u system being 
used as a tower.)


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backing this up

2010-01-27 Thread Daniel Carosone

On Wed, Jan 27, 2010 at 12:01:36PM -0800, Gregory Durham wrote:
 Hello All,
 I read through the attached threads and found a solution by a poster and
 decided to try it.

That may have been mine - good to know it helped, or at least started to.

 The solution was to use 3 files (in my case I made them sparse)

yep - writes to allocate space for them up front are pointless with CoW.

 I then created a raidz2 pool across these 3 files 

Really?  If you want one tape's worth of space, written to 3 tapes,
you might as well just write the same file to three tapes, I think.
(I'm assuming here the files are the size you expect to write to
a single tape - otherwise I'm even more confused about this bit).

Perhaps it's easier to let zfs cope with repairing small media errors
here and there, but the main idea of using a redundant pool of files
was to cope with loss or damage to whole tapes, for a backup that
already needed to span multiple tapes. If you want this three-way copy
of a single tape, plus easy recovery from bad spots by reading back
multiple tapes, then use a 3-way mirror.  But consider the
error-recovery mode of whatever you're using to write to tape - some
skip to the next file on a read error.

I expect similar ratios of data to parity files/tapes as would be used
in typical disk setups, at least for wide stripes.  Say raidz2 in
sets of 10, 8+2, or so.   (As an aside, I like this for disks, too -
since striping 128k blocks to a power-of-two wide data stripe has to
be more efficient)

 and started a zfs send | recv. The performance is horrible

There can be several reasons for this, and we'd need to know more
about your setup.

The first critical thing is going to be the setup of the staging
filesystem tha holds your pool files.  If this is itself a raidz,
perhaps you're iops limited - you're expecting 3 disk-files worth of
concurrency from a pool that may not have it, though it should be a
write-mostly workload so less sensitive.  You'll be seeking a lot
either way, though.

If this is purely staging to tape, consider making the staging pool
out of non-redundant single-disk vdevs.  Alternately, if the staging
pool is safe, there's another trick you might consider: create the
pool, then offline 2 files while you recv, leaving the pool-of-files
degraded.  Then when you're done, you can let the pool resilver and
fill in the redundancy.  This might change the IO pattern enough to
take less time overall, or at least allow you some flexibility with
windows to schedule backup and tapes.

Next is dedup - make sure you have the memory and l2arc capacity to
dedup the incoming write stream.  Dedup within the pool of files if
you want and can (because this will dedup your tapes), but don't dedup
under it as well. I've found this to produce completely pathological
disk thrashing, in a related configuration (pool on lofi crypto
file).  Stacking dedup like this doubles the performance cliff under
memory pressure we've been talking about recently.

(If you really do want 3-way-mirror files, then by all means dedup
them in the staging pool.) 

Related to this is arc usage - I haven't investigated this carefully
myself, but you may well be double-caching: the backup pool's data, as
well as the staging pool's view of the files.  Again, since it's a
write mostly workload zfs should hopefully figure out that few blocks
are being re-read, but you might experiment with primarycache=metadata
for the staging pool holding the files.  Perhaps zpool-on-files is
smart enough to use direct io bypassing cache anyway, I'm not sure.

How's your cpu usage? Check that you're not trying to double-compress
the files (again, within the backup pool but not outside) and consider
using a lightweight checksum rather than sha256 outside.

Then there's streaming and concurrency - try piping through buffer and
using bigger socket and tcp buffers.  TCP stalls and slow-start will
amplify latency many-fold.

A good zil device on the staging pool might also help, the backup pool
will be doing sync writes to close its txgs, though probably not too
many others. I haven't experimented here, either.

 This pool is temporary as it will be sent to tape, deleted and
 recreated.

I tend not to do that, since I can incrementally update the pool
contents before rewriting tapes.  This helps hide the performance
issues dramatically since much less data is transferred and written to
the files, after the first time. 

 Is it possible to zfs send to two destination simultaneously?

Yes, though it's less convenient than using -R on the top of the
pool, since you have to solve any dependencies (including clone
renames) yourself.  Whether this helps or hurts depends on your
bottleneck: it will help with network and buffering issues, but hurt
(badly) if you're limited by thrashing seeks (at the writer, since you
already know the reader can sustain higher rates).

 Or am I stuck. Any pointers would be great!

Never. Always! :-)

--
Dan.

pgpIPYL2vmAI0.pgp
Description:

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa

2010-01-27 Thread Richard Elling

On Jan 27, 2010, at 12:34 PM, David Dyer-Bennet wrote:
 
 Google is working heavily with the philosophy that things WILL fail, so they 
 plan for it, and have enough redundance to survive it -- and then save lots 
 of money by not paying for premium components.  I like that approach.

Yes, it does work reasonably well. But many people on this forum complain
that mirroring disks is too expensive, so they would never consider mirroring
the whole box, let alone triple or quadruple mirroring the whole box :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa

2010-01-27 Thread Daniel Carosone

On Wed, Jan 27, 2010 at 02:34:29PM -0600, David Dyer-Bennet wrote:
 Google is working heavily with the philosophy that things WILL fail, so  
 they plan for it, and have enough redundance to survive it -- and then  
 save lots of money by not paying for premium components.  I like that  
 approach.

So do I, and most other zfs fans.

Google, unlike most of us, is also big enough to buy a whole pallet of
disks at a time, and still spread them around to avoid common faults
taking out all copies. 

--
Dan.


pgpQ87YoZXEt0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP

2010-01-27 Thread Miles Nordin

 ag == Andrew Gabriel andrew.gabr...@sun.com writes:

ag is your working set size bigger than memory (thrashing),

n...no, not...not exactly. :)

ag or is swapping likely to be just a once-off event or
ag infrequently repeated?

once-off!  or...well...repeated, every time the garbage collector runs.

ag You probably need to forget most of what you learned about
ag swapping 25 years ago, when systems routinely swapped, and
ag technology was very different.

yes, some Lisp machines had integrated swapper/garbagecollectors.  Now
we have sbrk() + gc.  dumb!

We used to not worry about overcommitting because refusing to
overcommit just meant some of the allocated swap space would never get
written.  It was a little bit foolish because the threat of thrashing
means, whenever swap's involved, you're basically overcommitted, but
it let us feel better.  Now that we're not using swap, failure to
overcommit seems rather wasteful.  At the very least you should allow
the ARC cache to grow into memory reserved for an allocation, then
boot the ARC out of it if the process actually writes to more than you
thought it would and you need to keep a commitment you thought you
wouldn't.

ag solid state disk swap devices,

smart!  it might turn out to be good for ebooks and other
power-constrained devices, too, because DRAM uses battery: swapping to
conserve energy rather than RAM.  It might be worth tracking pages in
a more complicated way than we're now doing if the goal is to evacuate
RAM and power it down, so maybe holding onto ancient swap wisdom and
code isn't as helpful to this as it might seem.

The point, keep swap on ZFS so you can grow/shrink/delete it as
fashion changes, is good.  But the OP's question still stands: does
ZFS swap perform almost as well as raw device swap, or is it worth
partitioning disks if you insist on actually using swap?  I guess no
one knows.


pgpXR10g3t6iA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] cannot attach c5d0s0 to c4d0s0: device is too small

2010-01-27 Thread dick hoogendijk

cannot attach c5d0s0 to c4d0s0: device is too small

So I guess I installed OpenSolaris onto the smallest disk. Now I cannot
create a mirrored root, because the device is smaller.
What is the best way to correct this except starting all over with two
disks of the same size (which I don't have)?

Do I zfs send the stream to the smallest disk and will the bigger one
attach itself? Or is there another way. I need redundency, so I hope to
get answers soon. ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARC Ghost lists, why have them and how much ram is used to keep track of them? [long]

2010-01-27 Thread Christo Kutrovsky

I have the exact same questions.

I am very interested in the answers of those.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot attach c5d0s0 to c4d0s0: device is too small

2010-01-27 Thread Cindy Swearingen


Hi Dick,

Based on this message:
cannot attach c5d0s0 to c4d0s0: device is too small

c5d0s0 is the disk you are trying to attach so it must be smaller than
c4d0s0.

Is it possible that c5d0s0 is just partitioned so that the s0 is smaller 
than s0 on c4d0s0?


On some disks, the default partitioning is not optimal and you have to
modify it so that the bulk of the disk space is in slice 0.

I would confirm this first as its the easiest solution by far.

Another thought is that a recent improvement was that you can attach a
disk that is an equivalent size, but not exactly the same geometry.
Which OpenSolaris release is this?

Thanks,

Cindy

On 01/27/10 15:26, dick hoogendijk wrote:

cannot attach c5d0s0 to c4d0s0: device is too small

So I guess I installed OpenSolaris onto the smallest disk. Now I cannot
create a mirrored root, because the device is smaller.
What is the best way to correct this except starting all over with two
disks of the same size (which I don't have)?

Do I zfs send the stream to the smallest disk and will the bigger one
attach itself? Or is there another way. I need redundency, so I hope to
get answers soon. ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays

2010-01-27 Thread Brad

We're running 10/09 on the dev box but 11/06 is prodqa.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] primarycache=off, secondarycache=all

2010-01-27 Thread Christo Kutrovsky

In the case of a ZVOL with the following settings:

primarycache=off, secondarycache=all

How does the L2ARC get populated if the data never makes it to ARC ? Is this 
even a valid configuration?

The reason I ask is I have iSCSI volumes for NTFS, I intend to use an SSD for 
l2arc. If something is read from the iSCSI device, then chances are Windows (or 
whatever OS) will cache it for a while in its own cache. It is unlikely that 
the data will be needed soon (under normal circumstances). 

Thus I would like it to avoid polluting the ARC with non-relevant data, but 
then the question is, how will that data make it to the L2ARC.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays

2010-01-27 Thread Cindy Swearingen


Hi Brad,

You should see better performance on the dev box running 10/09 with the
sd and ssd drivers as is because they should properly handle the SYNC_NV 
bit in this release.


If you have determined that the 11/06 system is affected by this issue,
then the best method is to set this parameter in the /kernel/drv/*conf
file.

I'm unclear whether you understand all the implications of disabling
this parameter because we're discussing this over email.

Someone with more experience with tuning this parameter should weigh in.
Brad is using SAN on (LSI/STK/NetApp).

Thanks,

Cindy

On 01/27/10 15:47, Brad wrote:

We're running 10/09 on the dev box but 11/06 is prodqa.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-27 Thread Ragnar Sundblad


On 27 jan 2010, at 10.44, Björn JACKE wrote:

 On 2010-01-25 at 08:31 -0600 Mike Gerdts sent off:
 You are missing the point.  Compression and dedup will make it so that
 the blocks in the devices are not overwritten with zeroes.  The goal
 is to overwrite the blocks so that a back-end storage device or
 back-end virtualization platform can recognize that the blocks are not
 in use and as such can reclaim the space.
 
 a filesystem that is able to do that fast would have to implement something
 like unwritten extents.

Rather what is needed is files with holes, as what is expected here is more 
free space in the file system when the unused parts of the file is punched out. 
With F_ALLOCSP, you would still not be able to use the space and there would be 
no gain.

 Some days ago I experimented to create and allocate
 huge files on ZFS ontop of OpenSolaris using fnctl and F_ALLOCSP which is
 basically the same thing that you want to do when you zero out space. It takes
 ages because it actually writes zeroes to the disk. A filesystem that knows 
 the
 concept of unwritten extents finishes the job immediately. There are no real
 zeros on the disk but the extent is tagged to be unwritten (you get zeros when
 you read it).

Files with holes are implemented, and as far as I know they are fast too:

-bash-4.0$ cat hole.py
f = open('foo', 'w')
f.write('x')
f.seek(2**62)
f.write('y')
f.close()
-bash-4.0$ time python hole.py

real0m0.019s
user0m0.010s
sys 0m0.009s
-bash-4.0$ ls -la foo
-rw-r--r--   1 raggestaff4611686018427387905 Jan 28 00:26 foo

 Are there any plans to add unwritten extent support into ZFS or any reason why
 not?

I have no idea, but just out of curiosity - when do you want that?

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] primarycache=off, secondarycache=all

2010-01-27 Thread Daniel Carosone

On Wed, Jan 27, 2010 at 02:47:47PM -0800, Christo Kutrovsky wrote:
 In the case of a ZVOL with the following settings:
 
 primarycache=off, secondarycache=all
 
 How does the L2ARC get populated if the data never makes it to ARC ?
 Is this even a valid configuration? 

It's valid, I assume, in the sense that it can be set.

However, I've also assumed that if the data never gets into primary
cache, it will never be evicted into L2.  

That's glossing over the details, which may be important - for
example, I don't think ZFS is structured to work with data that's
*not* in ARC, so it may be that primarycache=off basically marks data
for immediate eviction - where it still may be a candidate for l2.

 The reason I ask is I have iSCSI volumes for NTFS, I intend to use
 an SSD for l2arc. If something is read from the iSCSI device, then
 chances are Windows (or whatever OS) will cache it for a while in
 its own cache. It is unlikely that the data will be needed soon
 (under normal circumstances).   
 
 Thus I would like it to avoid polluting the ARC with non-relevant
 data, but then the question is, how will that data make it to the
 L2ARC. 

With the setup above, I suspect it won't.  It would be nice to get an
authoritative confirmation of that, of course.

Regardless, to your original requirement, it sounds like you're
looking for a tuning knob to give further hints to the ARC algorithm,
about which pages to evict first.  More knobs are not always better.
ARC should in theory already do a good job of telling the difference
between accessed recently and accessed frequently.  Evictees from
both states can go to l2arc.

Look at it another way: If the client cache in the windows machine
works as you expect (and I expect it would, at least for some data),
the best hint you can give to ARC that these blocks are not needed is
to access *other* data.

So, measure and analyse.

--
Dan.

pgpmFuXepzig7.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz using partitions

2010-01-27 Thread A Darren Dunham

On Wed, Jan 27, 2010 at 10:55:21AM -0800, Albert Frenz wrote:
 hi there,
 
 maybe this is a stupid question, yet i haven't found an answer anywhere ;)
 let say i got 3x 1,5tb hdds, can i create equal partitions out of each and 
 make a raid5 out of it? sure the safety would drop, but that is not that 
 important to me. with roughly 500gb partitions and the raid5 forumla of 
 n-1*smallest drive i should be able to get 4tb storage instead of 3tb when 
 using 3x 1,5tb in a normal raid5. 
 

The only way you can use more than 3TB is if your RAID5 is not
protecting data on different disks.  By saying 500gb partitions, it
sounds like you want to create a 9 column raid on 3 disks.  

The safety wouild definitely drop.  It would drop so much that it's not
really buying you anything.  The failure of any drive would mean loss of
the data.  So if that's already true, why not just put all the disks in
a pool and not mess with a raid?  You'd get 4.5TB.

Partitioning it into pieces and trying to put them all into a single
RAID set just makes the setup more complex, probably slower, and almost
no extra protection.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Strange random errors getting automatically repaired

2010-01-27 Thread Mark Bennett

Hi Giovanni,

I have seen these while testing the mpt timeout issue, and on other systems 
during resilvering of failed disks and while running a scrub.

Once so far on this test scrub, and several on yesterdays.

I checked the iostat errors, and they weren't that high on that device, 
compared to other disks.

c2t34d0  ONLINE   0 0 1  25.5K repaired

  errors ---
  s/w h/w trn tot device
  0   8  61  69 c2t30d0
  0   2  17  19 c2t31d0
  0   5  41  46 c2t32d0
  0   5  33  38 c2t33d0
  0   3  31  34 c2t34d0 
  0  10  81  91 c2t35d0
  0   4  22  26 c2t36d0
  0   6  44  50 c2t37d0
  0   3  21  24 c2t38d0
  0   5  49  54 c2t39d0
  0   9  77  86 c2t40d0
  0   6  58  64 c2t41d0
  0   5  50  55 c2t42d0
  0   4  34  38 c2t43d0
  0   6  37  43 c2t44d0
  0   9  75  84 c2t45d0
  0  13  82  95 c2t46d0
  0   7  57  64 c2t47d0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa

2010-01-27 Thread Simon Breden

 On 1/25/2010 6:23 PM, Simon Breden wrote:
  By mixing randomly purchased drives of unknown
 quality, people are 
  taking unnecessary chances. But often, they refuse
 to see that, 
  thinking that all drives are the same and they will
 all fail one day 
  anyway...

My use of the word random was a little joke to refer to drives that are bought 
without checking basic failure reports made by users, and then the purchaser 
later says 'oh no, these drives are c**p'. A little checking goes a long way 
IMO. But each to his own.

 I would say, though, that buying different drives
 isn't inherently 
 either random or drives of unknown quality.  Most
 of the time, I 
 know no reason other than price to prefer one major
 manufacturer to 
 another.

Price is an important choice driver I think we all use. But the 'drives of 
unknown quality' bit is still possible to mitigate by checking, if one is 
willing to spend the time and knows where to look. We're never going to be 100% 
certain, but if I read widely of numerous reports that drives of a particular 
revision number are seriously substandard then I am going to take that info 
onboard to help me steer away from purchasing them. That's all.

 And, over and over again, I've heard of bad batches
 of drives.  Small 
 manufacturing or design or component sourcing errors.
  Given how the 
 esilvering process can be quite long (on modern large
 drives) and quite 
 stressful (when the system remains in production use
 during resilvering, 
 so that load is on top of the normal load), I'd
 rather not have all my 
 drives in the set be from the same bad batch!

Indeed. This is why it's good to research, buy what you think is a good drive  
revision, then load your data onto them and test them out over a period of 
time. But one has to keep original data safely backed up.

 Google is working heavily with the philosophy that
 things WILL fail, so 
 they plan for it, and have enough redundance to
 survive it -- and then 
 save lots of money by not paying for premium
 components.  I like that 
 approach.

Yep, as mentioned elsewhere, Google have enormous resources to be hugely 
redundant and safe.
And yes, we all try to use our common sense to build in as much redundancy as 
we deem necessary and we are able to reasonably afford. And we have backups.

Cheers,
Simon

http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams

2010-01-27 Thread Lori Alt


On 01/25/10 16:08, Daniel Carosone wrote:

On Mon, Jan 25, 2010 at 05:42:59PM -0500, Miles Nordin wrote:
  

et You cannot import a stream into a zpool of earlier revision,
et thought the reverse is possible.

This is very bad, because it means if your backup server is pool
version 22, then you cannot use it to back up pool version 15 clients:
you can backup, but then you can never restore.



It would be, yes.
  
Correct.  It would be bad if it were true, but it's not.  What matters 
when doing receives of streams is that the version of the dataset (which 
can differ between datasets on the same system and between datasets in 
the same replication stream) be less than or equal to the version of the 
zfs filesystem supported on the receiving system.  The zfs filesystem 
version supported on a system can be displayed with the command zfs 
upgrade (with no further arguments).


The zfs filesystem version is different than the zpool version 
(displayed by `zpool get version poolname`).  You can send a stream 
from one system to another even if the zpool version is lower on the 
receiving system or pool.  I verified that this works by replicating a 
dataset from a system running build 129 (zpool version 22 and zfs 
version 4 ) to a system running S10 update (zpool version 15 and zfs 
version 4).  Since they agree on the file system version, it works.


But when I try to send a stream from build 120 to S10 U6 (zfs version = 
3), I get:


# zfs recv rpool/new  /net/x4200-brm-16/export/out.zar
Jan 27 17:44:36 v20z-brm-03 zfs: Mismatched versions:  File system is 
version 4 on-disk format, which is incompatible with this software 
version 3!


The version of a zfs dataset (i.e. fileystem or zvol) is preserved 
unless modified.  So, I just did zfs send from S10 U6 (zfs version 3) to 
S10 U8 (zfs version 4).  This created a dataset and its snapshot on the 
build 129 system.  Then I checked the version of the dataset and 
snapshot that was created:


# zfs get -r version rpool/new
NAME  PROPERTY  VALUESOURCE
rpool/new version   3-
rpool/n...@s1  version   3-

So even though the current version of the zfs filesystem on the target 
system is 4, the dataset created by the receive is 3, because that's the 
version that was sent.  Then I tried sending that dataset back to the U6 
system, and it worked.  So as long as the version of the *filesystem* is 
compatible with the target system, you can do sends from, say, S10U8 to 
S10U6, even though U8 has a higher zfs filesystem version number than U6.


Also, as someone pointed out, the stream version has to match too.  So 
if you use dedup (the -D option), that sets the dedup feature flag in 
the stream header, which makes the stream only receivable on systems 
that support stream dedup.  But if you don't use dedup, the stream can 
still be read on earlier version of zfs.


Lori




O
  

For backup to work the zfs send format needs to depend on the zfs
version only, not the pool version in which it's stored nor the kernel
version doing the sending.



I can send from b130 to b111, zpool 22 to 14. (Though not with the new
dedup send -D option, of course).  I don't have S10 to test.

--
Dan.

  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-27 Thread borov

 I have Supermicro 936E1 (X28 expander chip) and LSI
 1068 HBA. I never got timeout issue but I'm using
 Seagate 15K.7 SAS. SATA might be different as it
 handles error and io timeout differently. If you
 still want volume, you make take a look at 7200 RPM
 SAS version.
SAS disks more expensive. Besides, there is no 2Tb SAS 7200 drives on market 
yet.

 If you can wait, better wait for 6Gb SAS expander
 based product.
Do you think it make sense if we will use SATA2 (300Gbit) disks?
I heard there were problems with SAS1 expanders in Supermicro chassis, after 
they come out. Don't want to debug new product.

 BTW. I'd get Supermicro X8DTH-6F motherboard as this
 gives enough expansion slots.
Thanks for the tip about motherboard.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zvol being charged for double space

2010-01-27 Thread Daniel Carosone

In a thread elsewhere, trying to analyse why the zfs auto-snapshot
cleanup code was cleaning up more aggressively than expected, I
discovered some interesting properties of a zvol. 

http://mail.opensolaris.org/pipermail/zfs-auto-snapshot/2010-January/000232.html

The zvol is not thin-provisioned. The entire volume has been written
to (it was dd'd off a physical disk), and:

 volsize = refreservation
 referenced = usedbydataset = (volsize + a little overhead)

This is as expected.  Not expected is that:

 usedbyrefreservation = refreservation

I would expect this to be 0, sinnce all the reserved space has been
allocated.  As a result, used is over twice the size of the volume (+
a few small snapshots as well).

I think others may have have seen similar problems; it may be the root
cause behind several other complaints that time-slider-cleanup deleted
snapshots to free up space, when the pool still had plenty free.

A quick followup test shows that usedbyrefreservation behaves as
expected, for a new test zvol.

http://mail.opensolaris.org/pipermail/zfs-auto-snapshot/2010-January/000233.html

So apparently it may be a problem picked up along the upgrade path
through many zpool version upgrades.  The pool, and the zvol, would
first have been created on b111 or shortly after.  It has been used
with both xvm kernels, and native kernels running virtualbox, in that
time. 

Who can help me figure out what's going on with the older zvol?  Any
useful zdb info I can dump out?   I could fix it by copying and
replacing the zvol, getting compression and dedup in the process, but
before I do I don't want to destroy what may be useful debug info.

I'll check later whether the send|recv snapshots of this zvol on my
backup server show similar problems, but I doubt they will.

--
Dan.

pgpccnSWWxhfy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZPOOL somehow got same physical drive assigned twice

2010-01-27 Thread TheJay

Guys,

Need your help. My DEV131 OSOL build with my 21TB disk system somehow got 
really screwed:

This is what my zpool status looks like:

NAME STATE READ WRITE CKSUM
rzpool2  DEGRADED 0 0 0
  raidz2-0   DEGRADED 0 0 0
replacing-0  DEGRADED 0 0 0
  c6t1d0 OFFLINE  0 0 0
  c6t16d0ONLINE   0 0 0  256M resilvered
c6t2d0s2 ONLINE   0 0 0
c6t3d0p0 ONLINE   0 0 0
c6t4d0p0 ONLINE   0 0 0
c6t5d0p0 ONLINE   0 0 0
c6t6d0p0 ONLINE   0 0 0
c6t7d0p0 ONLINE   0 0 0
c6t8d0p0 ONLINE   0 0 0
c6t9d0   ONLINE   0 0 0
  raidz2-1   DEGRADED 0 0 0
c6t0d0   ONLINE   0 0 0
c6t1d0   UNAVAIL  0 0 0  cannot open
c6t10d0  ONLINE   0 0 0
c6t11d0  ONLINE   0 0 0
c6t12d0  ONLINE   0 0 0
c6t13d0  ONLINE   0 0 0
c6t14d0  ONLINE   0 0 0
c6t15d0  ONLINE   0 0 0

check drive c6t1d0 - It appears in both raidz2-0 and raidz2-1 !!

How do I *remove* the drive from raidz2-1 (with edit/hexedit or anything else)
it is clearly a bug in ZFS that allowed me to assign the drive twiceagain: 
running DEV131 OSOL

Please HELP me.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-27 Thread Freddie Cash

We use the following for our storage servers:

Chenbro 5U chassis (24 hot-swap drive bays)
1350 watt 4-way redundant PSU
Tyan h200M motherboard (S3992)
2x dual-core AMD Opteron 2200-series CPUs
8 GB ECC DDR2-SDRAM
4-port Intel PRO/1000MT NIC (PCIe)
3Ware 9550SXU PCI-X RAID controller (12-port, multi-lane)
3Ware 9650SE PCIe RAID controller (12-port, muli-lane)
24x 500 GB harddrive (either Seagate ES2 or Western Digital RE2)

Comes out to under $10,000 CDN, and gives 10 TB of disk space (3x 8-drive 
raidz2).  If you use multiple 8-port SATA/SAS controllers instead of RAID 
controller, and a 3-way PSU, it should come out to under $8,000 CDN.

Fully supportted by FreeBSD, so everything should work with OpenSolaris.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol being charged for double space

2010-01-27 Thread Bill Sommerfeld


On 01/27/10 21:17, Daniel Carosone wrote:

This is as expected.  Not expected is that:

  usedbyrefreservation = refreservation

I would expect this to be 0, since all the reserved space has been
allocated.


This would be the case if the volume had no snapshots.


As a result, used is over twice the size of the volume (+
a few small snapshots as well).


I'm seeing essentially the same thing with a recently-created zvol
with snapshots that I export via iscsi for time machine backups on a
mac.

% zfs list  -r -o 
name,refer,used,usedbyrefreservation,refreservation,volsize z/tm/mcgarrett

NAMEREFER   USED  USEDREFRESERV  REFRESERV  VOLSIZE
z/tm/mcgarrett  26.7G  88.2G60G60G  60G

The actual volume footprint is a bit less than half of the volume
size, but the refreservation ensures that there is enough free space
in the pool to allow me to overwrite every block of the zvol with
uncompressable data without any writes failing due to the pool being
out of space.

If you were to disable time-based snapshots and then overwrite a measurable
fraction of the zvol you I'd expect USEDBYREFRESERVATION to shrink as
the reserved blocks were actually used.

If you want to allow for overcommit, you need to delete the refreservation.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Building big cheap storage system. What hardware to use?

2010-01-27 Thread Jason Fortezzo

On Wed, Jan 27, 2010 at 08:25:48PM -0800, borov wrote:
 SAS disks more expensive. Besides, there is no 2Tb SAS 7200 drives on market 
 yet.

Seagate released a 2 TB SAS drive last year.
http://www.seagate.com/ww/v/index.jsp?locale=en-USvgnextoid=c7712f655373f110VgnVCM10f5ee0a0aRCRD

-- 
Jason Fortezzo
forte...@mechanicalism.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backing this up

2010-01-27 Thread Gregory Durham

Yep Dan,
Thank you very much for the idea, and helping me with my implementation
issues. haha. I can see that raidz2 is not needed in this case.
My question now lies as to full system recovery. Say all hell brakes loose
and all is lost except tapes. If I use what you said and just add snapshots
to a already standing zfs filesystem. I guess in this case I can do full
backups to tapes as well as partial backups, what is the best way to
accomplish this if data is all standing on a file. Note I will be using
bacula (hopefully) unless a better is recommended.
And finally, should I tar this file prior to sending it to tape or is this
not needed in this case?

Just a note, all of this data will fit on the tapes currently but what if it
doesn't in the future?

Thanks and sorry for all of the questions...
Greg

On Wed, Jan 27, 2010 at 1:08 PM, Daniel Carosone d...@geek.com.au wrote:

 On Wed, Jan 27, 2010 at 12:01:36PM -0800, Gregory Durham wrote:
  Hello All,
  I read through the attached threads and found a solution by a poster and
  decided to try it.

 That may have been mine - good to know it helped, or at least started to.

  The solution was to use 3 files (in my case I made them sparse)

 yep - writes to allocate space for them up front are pointless with CoW.

  I then created a raidz2 pool across these 3 files

 Really?  If you want one tape's worth of space, written to 3 tapes,
 you might as well just write the same file to three tapes, I think.
 (I'm assuming here the files are the size you expect to write to
 a single tape - otherwise I'm even more confused about this bit).

 Perhaps it's easier to let zfs cope with repairing small media errors
 here and there, but the main idea of using a redundant pool of files
 was to cope with loss or damage to whole tapes, for a backup that
 already needed to span multiple tapes. If you want this three-way copy
 of a single tape, plus easy recovery from bad spots by reading back
 multiple tapes, then use a 3-way mirror.  But consider the
 error-recovery mode of whatever you're using to write to tape - some
 skip to the next file on a read error.

 I expect similar ratios of data to parity files/tapes as would be used
 in typical disk setups, at least for wide stripes.  Say raidz2 in
 sets of 10, 8+2, or so.   (As an aside, I like this for disks, too -
 since striping 128k blocks to a power-of-two wide data stripe has to
 be more efficient)

  and started a zfs send | recv. The performance is horrible

 There can be several reasons for this, and we'd need to know more
 about your setup.

 The first critical thing is going to be the setup of the staging
 filesystem tha holds your pool files.  If this is itself a raidz,
 perhaps you're iops limited - you're expecting 3 disk-files worth of
 concurrency from a pool that may not have it, though it should be a
 write-mostly workload so less sensitive.  You'll be seeking a lot
 either way, though.

 If this is purely staging to tape, consider making the staging pool
 out of non-redundant single-disk vdevs.  Alternately, if the staging
 pool is safe, there's another trick you might consider: create the
 pool, then offline 2 files while you recv, leaving the pool-of-files
 degraded.  Then when you're done, you can let the pool resilver and
 fill in the redundancy.  This might change the IO pattern enough to
 take less time overall, or at least allow you some flexibility with
 windows to schedule backup and tapes.

 Next is dedup - make sure you have the memory and l2arc capacity to
 dedup the incoming write stream.  Dedup within the pool of files if
 you want and can (because this will dedup your tapes), but don't dedup
 under it as well. I've found this to produce completely pathological
 disk thrashing, in a related configuration (pool on lofi crypto
 file).  Stacking dedup like this doubles the performance cliff under
 memory pressure we've been talking about recently.

 (If you really do want 3-way-mirror files, then by all means dedup
 them in the staging pool.)

 Related to this is arc usage - I haven't investigated this carefully
 myself, but you may well be double-caching: the backup pool's data, as
 well as the staging pool's view of the files.  Again, since it's a
 write mostly workload zfs should hopefully figure out that few blocks
 are being re-read, but you might experiment with primarycache=metadata
 for the staging pool holding the files.  Perhaps zpool-on-files is
 smart enough to use direct io bypassing cache anyway, I'm not sure.

 How's your cpu usage? Check that you're not trying to double-compress
 the files (again, within the backup pool but not outside) and consider
 using a lightweight checksum rather than sha256 outside.

 Then there's streaming and concurrency - try piping through buffer and
 using bigger socket and tcp buffers.  TCP stalls and slow-start will
 amplify latency many-fold.

 A good zil device on the staging pool might also help, the backup pool
 will be doing sync writes to close

Re: [zfs-discuss] zvol being charged for double space

2010-01-27 Thread Daniel Carosone

On Wed, Jan 27, 2010 at 09:57:08PM -0800, Bill Sommerfeld wrote:

Hi Bill! :-)

 On 01/27/10 21:17, Daniel Carosone wrote:
 This is as expected.  Not expected is that:

   usedbyrefreservation = refreservation

 I would expect this to be 0, since all the reserved space has been
 allocated.

 This would be the case if the volume had no snapshots.

Hmm

 The actual volume footprint is a bit less than half of the volume
 size, but the refreservation ensures that there is enough free space
 in the pool to allow me to overwrite every block of the zvol with
 uncompressable data without any writes failing due to the pool being
 out of space.

Hmm  this is new (to me) and undescribed (in the manpage)
behaviour, but it does explain the observed behaviour.  

In other words, usedbyrefreservation includes blocks currently shared
with snapshots, representing a reservation for potential future CoW of
these blocks.

Does this happen for filesystems, or only volumes?  I hope it's both,
but just more commonly encountered because refreserv is more commonly
used with volumes.

 If you were to disable time-based snapshots and then overwrite a measurable
 fraction of the zvol you I'd expect USEDBYREFRESERVATION to shrink as
 the reserved blocks were actually used.

Right. If I repeat my quick test with snapshots, when the first
snapshot is taken, I would see usedbyrefreservation jump back up to
the full size of the volume. At that point the whole volume is shared
with the snapshot. As data is overwritten, the space for the retained
copy would be added to usedbysnapshots, and the space that's now
unique to the dataset would come off usedbyrefreservation, with the
used total staying constant - until another snapshot is taken.  I'll
do that for my own interest, but it now makes perfect sense and is
quite reasonable. 

The trouble is the documentation doesn't point to this, so it's
surprising and unexpected.  There's text in the description of the
refreservation property, about snapshots will only be allowed if
there's enough free space.  What needs to be clear is that this is
achieved by the behaviour of usedbyrefreservation, in part by
additional text in the description of that property (that it includes
space shared with snapshots), and partly by improving the wording
about free space here.

I'll see if I can knock together some better wording later.

 If you want to allow for overcommit, you need to delete the refreservation.

Of course, I just wasn't thinking of a taking a snapshot as having
this cost, though of course it does.

--
Dan.


pgphFn5Xr1E59.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

47 matches

Mail list logo