Re: [zfs-discuss] Slow zfs writes

2013-02-12 Thread Ian Collins

Ram Chander wrote:


Hi Roy,
You are right. So it looks like re-distribution issue. Initially  
there were two Vdev with 24 disks ( disk 0-23 ) for close to year. 
After which  which we added 24 more disks and created additional 
vdevs. The initial vdevs are filled up and so write speed declined. 
Now  how to find files that are present in a Vdev or a disk. That way 
I can remove and re-copy back to distribute data. Any other way to 
solve this ?



The only way is to avoid the problem in the first place by not mixing
vdev sizes in a pool.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS monitoring

2013-02-12 Thread Pawel Jakub Dawidek
On Mon, Feb 11, 2013 at 05:39:27PM +0100, Jim Klimov wrote:
 On 2013-02-11 17:14, Borja Marcos wrote:
 
  On Feb 11, 2013, at 4:56 PM, Tim Cook wrote:
 
  The zpool iostat output has all sorts of statistics I think would be 
  useful/interesting to record over time.
 
 
  Yes, thanks :) I think I will add them, I just started with the esoteric 
  ones.
 
  Anyway, still there's no better way to read it than running zpool iostat 
  and parsing the output, right?
 
 
 I believe, in this case you'd have to run it as a continuous process
 and parse the outputs after the first one (overall uptime stat, IIRC).
 Also note that on problems with ZFS engine itself, zpool may lock up
 and thus halt your program - so have it ready to abort an outstanding
 statistics read after a timeout and perhaps log an error.
 
 And if pools are imported-exported during work, the zpool iostat
 output changes dynamically, so you basically need to parse its text
 structure every time.
 
 The zpool iostat -v might be even more interesting though, as it lets
 you see per-vdev statistics and perhaps notice imbalances, etc...
 
 All that said, I don't know if this data isn't also available as some
 set of kstats - that would probably be a lot better for your cause.
 Inspect the zpool source to see where it gets its numbers from...
 and perhaps make and RTI relevant kstats, if they aren't yet there ;)
 
 On the other hand, I am not certain how Solaris-based kstats interact
 or correspond to structures in FreeBSD (or Linux for that matter)?..

I made kstat data available on FreeBSD via 'kstat' sysctl tree:

# sysctl kstat

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://tupytaj.pl


pgpyFGpZBBFM1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow zfs writes

2013-02-12 Thread Jim Klimov

On 2013-02-12 10:32, Ian Collins wrote:

Ram Chander wrote:


Hi Roy,
You are right. So it looks like re-distribution issue. Initially there
were two Vdev with 24 disks ( disk 0-23 ) for close to year. After
which  which we added 24 more disks and created additional vdevs. The
initial vdevs are filled up and so write speed declined. Now  how to
find files that are present in a Vdev or a disk. That way I can remove
and re-copy back to distribute data. Any other way to solve this ?


The only way is to avoid the problem in the first place by not mixing
vdev sizes in a pool.





Well, that disbalance is there - in the zpool status printout we see
raidz1 top-level vdevs of size 5, 5, 12, 7, 7, 7 disks and some 5 spares 
- which seems to sum up to 48 ;)


Depending on disk size, it might be possible that tlvdev sizes in
gigabytes were kept the same (i.e. a raidz set with twice as many
disks of half size), but we have no info on this detail and it is
unlikely. The disk sets being in one pool, this would still quite
disbalance the load among spindles and IO buses.

Beside all that - with the older tlvdev's being more full than
the newer ones, there is the disbalance which wouldn't be avoided
by not mixing vdev sizes - writes into newer ones are more likely
to quickly find available holes, while writes into older ones are more 
fragmented and longer data inspection is needed to find a hole -

if not even the gang-block fragmentation. These two are, I believe,
the basis for performance drop on full pools, with the measure
being rather the mix of IO patterns and fragmentation of data and
holes.

I think there were developments in illumos ZFS to address more
writes onto devices with more available space; I am not sure if
the average write latency to a tlvdev was monitored and taken
into account during write-targeting decisions (which would also
wrap the case of failing devices which take longer to respond).
I am not sure which portions nave been completed and integrated
into common illumos-gate.

As was suggested, you can use zpool iostat -v 5 to monitor IOs
to the pool with a fanout per TLVDEV and per disk, and witness
possible patterns there. Do keep in mind, however, that for a
non-failed raidz set you should see reads from only the data
disks for a particular stripe, while parity disks are not used
unless a checksum mismatch occurs. On the average data should
be on all disks in such a manner that there is no dedicated
parity disk, but with small IOs you are likely to notice this.

If the budget permits, I'd suggest building (or leasing) another
system with balanced disk sets and replicating all data onto it,
then repurposing the older system - for example, to be a backup
of the newer box (also after remaking the disk layout).

As for the question of which files are on the older disks -
you can as a rule of thumb use the file creation/modification
time in comparison with the date when you expanded the pool ;)
Closer inspection could be done with a ZDB walk to print out
the DVA block addresses for blocks of a file (the DVA includes
the number of the top-level vdev), but that would take some
time - to determine which files you want to expect (likely
some band of sizes) and then to do these zdb walks.

Good luck,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Darren J Moffat



On 02/10/13 12:01, Koopmann, Jan-Peter wrote:

Why should it?

Unless you do a shrink on the vmdk and use a zfs variant with scsi unmap 
support (I believe currently only Nexenta but correct me if I am wrong) the 
blocks will not be freed, will they?


Solaris 11.1 has ZFS with SCSI UNMAP support.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS monitoring

2013-02-12 Thread Borja Marcos

On Feb 12, 2013, at 11:25 AM, Pawel Jakub Dawidek wrote:

 I made kstat data available on FreeBSD via 'kstat' sysctl tree:

Yes, I am using the data. I wasn't sure about how getting something meaningful 
from it, but I've found the arcstats.pl script and I am using it as a model.

Suggestions will be always welcome, though :)

(the sample pages I put on devilator.froblua.com aren't using the better 
organized graphs, though, it's just a crude parameter dump)





Borja.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Stefan Ring
 Unless you do a shrink on the vmdk and use a zfs variant with scsi unmap
 support (I believe currently only Nexenta but correct me if I am wrong) the
 blocks will not be freed, will they?


 Solaris 11.1 has ZFS with SCSI UNMAP support.

Freeing unused blocks works perfectly well with fstrim (Linux)
consuming an iSCSI zvol served up by oi151a6.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Thomas Nau

Darren

On 02/12/2013 11:25 AM, Darren J Moffat wrote:



On 02/10/13 12:01, Koopmann, Jan-Peter wrote:

Why should it?

Unless you do a shrink on the vmdk and use a zfs variant with scsi
unmap support (I believe currently only Nexenta but correct me if I am
wrong) the blocks will not be freed, will they?


Solaris 11.1 has ZFS with SCSI UNMAP support.



Seem to have skipped that one... Are there any related tools e.g. to 
release all zero blocks or the like? Of course it's up to the admin 
then to know what all this is about or to wreck the data


Thomas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Darren J Moffat



On 02/12/13 15:07, Thomas Nau wrote:

Darren

On 02/12/2013 11:25 AM, Darren J Moffat wrote:



On 02/10/13 12:01, Koopmann, Jan-Peter wrote:

Why should it?

Unless you do a shrink on the vmdk and use a zfs variant with scsi
unmap support (I believe currently only Nexenta but correct me if I am
wrong) the blocks will not be freed, will they?


Solaris 11.1 has ZFS with SCSI UNMAP support.



Seem to have skipped that one... Are there any related tools e.g. to
release all zero blocks or the like? Of course it's up to the admin
then to know what all this is about or to wreck the data


No tools, ZFS does it automaticaly when freeing blocks when the 
underlying device advertises the functionality.


ZFS ZVOLs shared over COMSTAR advertise SCSI UNMAP as well.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Casper . Dik


No tools, ZFS does it automaticaly when freeing blocks when the 
underlying device advertises the functionality.

ZFS ZVOLs shared over COMSTAR advertise SCSI UNMAP as well.


If a system was running something older, e.g., Solaris 11; the free 
blocks will not be marked such on the server even after the system 
upgrades to Solaris 11.1.

There might be a way to force that by disabling compression and then 
create a large file full with NULs and then remove that.  But you need to 
check first that this has some effect before you even try.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Sašo Kiselkov
On 02/10/2013 01:01 PM, Koopmann, Jan-Peter wrote:
 Why should it?
 
 I believe currently only Nexenta but correct me if I am wrong

The code has been mainlined a while ago, see:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd.c#L3702-L3730
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/zvol.c#L1697-L1754

Thanks should go to the guys at Nexenta for contributing this to the
open-source effort.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow zfs writes

2013-02-12 Thread Ian Collins

Jim Klimov wrote:

On 2013-02-12 10:32, Ian Collins wrote:

Ram Chander wrote:

Hi Roy,
You are right. So it looks like re-distribution issue. Initially there
were two Vdev with 24 disks ( disk 0-23 ) for close to year. After
which  which we added 24 more disks and created additional vdevs. The
initial vdevs are filled up and so write speed declined. Now  how to
find files that are present in a Vdev or a disk. That way I can remove
and re-copy back to distribute data. Any other way to solve this ?


The only way is to avoid the problem in the first place by not mixing
vdev sizes in a pool.



I was a bit quick off the mark there, I didn't notice that some vdevs 
were older than others.



Well, that disbalance is there - in the zpool status printout we see
raidz1 top-level vdevs of size 5, 5, 12, 7, 7, 7 disks and some 5 spares
- which seems to sum up to 48 ;)


The vdev sizes are about (including parity space) 14, 14, 22, 19, 19, 
19TB respectively and 127TB total.  So even if the data is balanced, the 
performance of this pool will still start to degrade once ~84TB (about 
2/3 full) are used.


So the only viable long term solution is a rebuild, or putting bigger 
drives in the two smallest vdevs.


In the short term, when I've had similar issues I used zfs send to copy 
a large filesystem within the pool then renamed the copy to the original 
name and deleted the original.  This can be repeated until you have an 
acceptable distribution.


One last thing: unless this is some form of backup pool, or the data on 
it isn't important, avoid raidz vdevs in such a large pool!


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss