Re: [zfs-discuss] Scrub works in parallel?

2012-06-12 Thread Roch Bourbonnais

Scrubs are run at very low priority and yield very quickly in the presence of 
other work.
So I really would not expect to see scrub create any impact on an other type of 
storage activity.
Resilvering will more aggressively push forward on what is has to do, but 
resilvering does not need to 
read any of the  data blocks on the non-resilvering vdevs.

-r

Le 11 juin 2012 à 09:05, Jim Klimov a écrit :

 2012-06-11 5:37, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Kalle Anka
 
 Assume we have 100 disks in one zpool. Assume it takes 5 hours to scrub
 one
 disk. If I scrub the zpool, how long time will it take?
 
 Will it scrub one disk at a time, so it will take 500 hours, i.e. in
 sequence, just
 serial? Or is it possible to run the scrub in parallel, so it takes 5h no
 matter
 how many disks?
 
 It will be approximately parallel, because it's actually scrubbing only the
 used blocks, and the order it scrubs in will be approximately the order they
 were written, which was intentionally parallel.
 
 What the other posters said, plus: 100 disks is quite a lot
 of contention on the bus(es), so even if it is all parallel,
 the bus and CPU bottlenecks would raise the scrubbing time
 somewhat above the single-disk scrub time.
 
 Roughly, if all else is ideal (i.e. no/few random seeks and
 a fast scrub at 100Mbps/disk), the SATA3 interface at 6Gbit/s
 (on the order of ~600Mbyte/s) will be maxed out at about
 6 disks. If your disks are colocated on one HBA receptacle
 (i.e. via a backplane), this may be an issue for many disks
 in an enclosure (a 4-lane link will sustain about 24 drives
 at such speed, and that's not the market's max speed).
 
 Further on, the PCI buses will become a bottleneck and the
 CPU processing power might become one too, and for a box
 with 100 disks this may be noticeable, depending on the other
 architectural choices, components and their specs.
 
 HTH,
 //Jim
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub works in parallel?

2012-06-12 Thread Jim Klimov

2012-06-12 16:20, Roch Bourbonnais wrote:


Scrubs are run at very low priority and yield very quickly in the presence of 
other work.
So I really would not expect to see scrub create any impact on an other type of 
storage activity.
Resilvering will more aggressively push forward on what is has to do, but 
resilvering does not need to
read any of the  data blocks on the non-resilvering vdevs.


Thanks, I agree - and that's important to notice, at least
on the current versions of ZFS :)

What I meant to stress that if a scrub of one disk takes
5 hours (whichever way that measurement can be made, such
as making a 1-disk pool with same data distribution), then
there are physical reasons why a 100-disk pool probably
would take some way more than 5 hours to scrub; or at least
which bottlenecks should be paid attention to in order to
minimize such increase in scrub time.

Also, yes, presence of pool activity would likely delay
the scrub completion time, perhaps even more noticeably.

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub works in parallel?

2012-06-12 Thread Roch Bourbonnais

The process should be scalable.
Scrub all of the data on one disk using one disk worth of IOPS 
Scrub all of the data on N disks  using N  disk's worth of IOPS.

THat will take ~ the same total time. 
-r

Le 12 juin 2012 à 08:28, Jim Klimov a écrit :

 2012-06-12 16:20, Roch Bourbonnais wrote:
 
 Scrubs are run at very low priority and yield very quickly in the presence 
 of other work.
 So I really would not expect to see scrub create any impact on an other type 
 of storage activity.
 Resilvering will more aggressively push forward on what is has to do, but 
 resilvering does not need to
 read any of the  data blocks on the non-resilvering vdevs.
 
 Thanks, I agree - and that's important to notice, at least
 on the current versions of ZFS :)
 
 What I meant to stress that if a scrub of one disk takes
 5 hours (whichever way that measurement can be made, such
 as making a 1-disk pool with same data distribution), then
 there are physical reasons why a 100-disk pool probably
 would take some way more than 5 hours to scrub; or at least
 which bottlenecks should be paid attention to in order to
 minimize such increase in scrub time.
 
 Also, yes, presence of pool activity would likely delay
 the scrub completion time, perhaps even more noticeably.
 
 Thanks,
 //Jim Klimov
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub works in parallel?

2012-06-12 Thread Jim Klimov

2012-06-12 16:45, Roch Bourbonnais wrote:


The process should be scalable.
Scrub all of the data on one disk using one disk worth of IOPS
Scrub all of the data on N disks  using N  disk's worth of IOPS.

THat will take ~ the same total time.


IF the uplink or processing power or some other bottleneck does
not limit that (i.e. a single 4-lane SAS link to the daisy-chain
of 100 or 200 disks would likely impose a bandwidth bottleneck).

I know that well-engineered servers spec'ed by a vendor/integrator
for the customer's tasks and environment, such as those from Sun,
are built to avoid such apparent bottlenecks. But people who
construct their own storage should know of (and try to avoid)
such possible problem-makers ;)

Thanks, Roch,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub works in parallel?

2012-06-12 Thread Richard Elling
On Jun 11, 2012, at 6:05 AM, Jim Klimov wrote:

 2012-06-11 5:37, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Kalle Anka
 
 Assume we have 100 disks in one zpool. Assume it takes 5 hours to scrub
 one
 disk. If I scrub the zpool, how long time will it take?
 
 Will it scrub one disk at a time, so it will take 500 hours, i.e. in
 sequence, just
 serial? Or is it possible to run the scrub in parallel, so it takes 5h no
 matter
 how many disks?
 
 It will be approximately parallel, because it's actually scrubbing only the
 used blocks, and the order it scrubs in will be approximately the order they
 were written, which was intentionally parallel.
 
 What the other posters said, plus: 100 disks is quite a lot
 of contention on the bus(es), so even if it is all parallel,
 the bus and CPU bottlenecks would raise the scrubbing time
 somewhat above the single-disk scrub time.

In general, this is not true for HDDs or modern CPUs. Modern systems
are overprovisioned for bandwidth. In fact, bandwidth has been a poor
design point for storage for a long time. Dave Patterson has some 
interesting observations on this, now 8 years dated.
http://www.ll.mit.edu/HPEC/agendas/proc04/invited/patterson_keynote.pdf

SSDs tend to be a different story, and there is some interesting work being
done in this area, both on the systems side as well as the SSD side. This is
where the fun work is progressing :-)
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
Seems the problem is somewhat more egregious than I thought. The xcall
storm causes my network drivers to stop receiving IP multicast packets
and subsequently my recording applications record bad data, so
ultimately, this kind of isn't workable... I need to somehow resolve
this... I'm running four on-board Broadcom NICs in an LACP
aggregation. Any ideas on why this might be a side-effect? I'm really
kind of out of ideas here...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 03:57 PM, Sašo Kiselkov wrote:
 Seems the problem is somewhat more egregious than I thought. The xcall
 storm causes my network drivers to stop receiving IP multicast packets
 and subsequently my recording applications record bad data, so
 ultimately, this kind of isn't workable... I need to somehow resolve
 this... I'm running four on-board Broadcom NICs in an LACP
 aggregation. Any ideas on why this might be a side-effect? I'm really
 kind of out of ideas here...
 
 Cheers,
 --
 Saso

Just as another datapoint, though I'm not sure if it's going to be much
use, is that I found (via arcstat.pl) that the storms always start
happen when ARC downsizing starts. E.g. I would see the following in
./arcstat.pl 1:

Time  readdmis  dm%  pmis  pm%  mmis  mm%  arcsz c
16:29:4521   00 00 00   111G  111G
16:29:46 0   00 00 00   111G  111G
16:29:47 1   00 00 00   111G  111G
16:29:48 0   00 00 00   111G  111G
16:29:495K   00 00 00   111G  111G
  (this is where the problem starts)
16:29:5036   00 00 00   109G  107G
16:29:5151   00 00 00   107G  107G
16:29:5210   00 00 00   107G  107G
16:29:53   148   00 00 00   107G  107G
16:29:545K   00 00 00   107G  107G
  (and after a while, around 10-15 seconds, it stops)

(I omitted the miss and miss% columns to make the rows fit).

During the time, the network stack is dropping input IP multicast UDP
packets like crazy, so I see my network input drop by about 30-40%.
Truly strange behavior...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Matt Breitbach
I saw this _exact_ problem after I bumped ram from 48GB to 192GB.  Low
memory pressure seemed to be the cuplrit.  Happened usually during storage
vmotions or something like that which effectively nullified the data in the
ARC (sometimes 50GB of data would be purged from the ARC).  The system was
so busy that it would drop 10Gbit LACP portchannels from our Nexus 5k stack.
I never got a good solution to this other than to set arc_min_c to something
that was close to what I wanted the system to use - I settled on setting it
at ~160GB.  It still dropped the arcsz, but it didn't try to adjust arc_c
and resulted in significantly fewer xcalls.

-Matt Breitbach

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Sašo Kiselkov
Sent: Tuesday, June 12, 2012 10:14 AM
To: Richard Elling
Cc: zfs-discuss
Subject: Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

On 06/12/2012 03:57 PM, Sašo Kiselkov wrote:
 Seems the problem is somewhat more egregious than I thought. The xcall
 storm causes my network drivers to stop receiving IP multicast packets
 and subsequently my recording applications record bad data, so
 ultimately, this kind of isn't workable... I need to somehow resolve
 this... I'm running four on-board Broadcom NICs in an LACP
 aggregation. Any ideas on why this might be a side-effect? I'm really
 kind of out of ideas here...
 
 Cheers,
 --
 Saso

Just as another datapoint, though I'm not sure if it's going to be much
use, is that I found (via arcstat.pl) that the storms always start
happen when ARC downsizing starts. E.g. I would see the following in
./arcstat.pl 1:

Time  readdmis  dm%  pmis  pm%  mmis  mm%  arcsz c
16:29:4521   00 00 00   111G  111G
16:29:46 0   00 00 00   111G  111G
16:29:47 1   00 00 00   111G  111G
16:29:48 0   00 00 00   111G  111G
16:29:495K   00 00 00   111G  111G
  (this is where the problem starts)
16:29:5036   00 00 00   109G  107G
16:29:5151   00 00 00   107G  107G
16:29:5210   00 00 00   107G  107G
16:29:53   148   00 00 00   107G  107G
16:29:545K   00 00 00   107G  107G
  (and after a while, around 10-15 seconds, it stops)

(I omitted the miss and miss% columns to make the rows fit).

During the time, the network stack is dropping input IP multicast UDP
packets like crazy, so I see my network input drop by about 30-40%.
Truly strange behavior...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Roch Bourbonnais

So the xcall are necessary part of memory reclaiming, when one needs to tear 
down the TLB entry mapping the physical memory (which can from here on be 
repurposed).
So the xcall are just part of this. Should not cause trouble, but they do. They 
consume a cpu for some time.

That in turn can cause infrequent latency bubble on the network. A certain root 
cause of these latency bubble is that network thread are bound by default and
if the xcall storm ends up on the CPU that the network thread is bound to, it 
will wait for the storm to pass.

So try unbinding the mac threads; it may help you here.

Le 12 juin 2012 à 09:57, Sašo Kiselkov a écrit :

 Seems the problem is somewhat more egregious than I thought. The xcall
 storm causes my network drivers to stop receiving IP multicast packets
 and subsequently my recording applications record bad data, so
 ultimately, this kind of isn't workable... I need to somehow resolve
 this... I'm running four on-board Broadcom NICs in an LACP
 aggregation. Any ideas on why this might be a side-effect? I'm really
 kind of out of ideas here...
 
 Cheers,
 --
 Saso
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:21 PM, Matt Breitbach wrote:
 I saw this _exact_ problem after I bumped ram from 48GB to 192GB.  Low
 memory pressure seemed to be the cuplrit.  Happened usually during storage
 vmotions or something like that which effectively nullified the data in the
 ARC (sometimes 50GB of data would be purged from the ARC).  The system was
 so busy that it would drop 10Gbit LACP portchannels from our Nexus 5k stack.
 I never got a good solution to this other than to set arc_min_c to something
 that was close to what I wanted the system to use - I settled on setting it
 at ~160GB.  It still dropped the arcsz, but it didn't try to adjust arc_c
 and resulted in significantly fewer xcalls.

Hmm, how do I do that? I don't have that kind of symbol in the kernel.
I'm running OpenIndiana build 151a. My system indeed runs at low memory
pressure, I'm simply running a bunch of writers writing files linearly
with data they received IP/UDP multicast sockets.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:37 PM, Roch Bourbonnais wrote:
 
 So the xcall are necessary part of memory reclaiming, when one needs to tear 
 down the TLB entry mapping the physical memory (which can from here on be 
 repurposed).
 So the xcall are just part of this. Should not cause trouble, but they do. 
 They consume a cpu for some time.
 
 That in turn can cause infrequent latency bubble on the network. A certain 
 root cause of these latency bubble is that network thread are bound by 
 default and
 if the xcall storm ends up on the CPU that the network thread is bound to, it 
 will wait for the storm to pass.

I understand, but the xcall storm settles only eats up a single core out
of a total of 32, plus it's not a single specific one, it tends to
change, so what are the odds of hitting the same core as the one on
which the mac thread is running?

 So try unbinding the mac threads; it may help you here.

How do I do that? All I can find on interrupt fencing and the like is to
simply set certain processors to no-intr, which moves all of the
interrupts and it doesn't prevent the xcall storm choosing to affect
these CPUs either...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Jim Klimov

2012-06-12 19:52, Sašo Kiselkov wrote:

So try unbinding the mac threads; it may help you here.


How do I do that? All I can find on interrupt fencing and the like is to
simply set certain processors to no-intr, which moves all of the
interrupts and it doesn't prevent the xcall storm choosing to affect
these CPUs either...


I wonder if creating a CPU set not assigned to any active user,
and setting that CPU set to process (networking) interrupts,
would work or help (see psrset, psradm)?

My 2c,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Jim Mauro

 
 So try unbinding the mac threads; it may help you here.
 
 How do I do that? All I can find on interrupt fencing and the like is to
 simply set certain processors to no-intr, which moves all of the
 interrupts and it doesn't prevent the xcall storm choosing to affect
 these CPUs either…

In /etc/system:

set mac:mac_soft_ring_thread_bind=0
set mac:mac_srs_thread_bind=0

Reboot required. Verify after reboot with mdb;

echo mac_soft_ring_thread_bind/D | mdb -k


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 06:06 PM, Jim Mauro wrote:
 

 So try unbinding the mac threads; it may help you here.

 How do I do that? All I can find on interrupt fencing and the like is to
 simply set certain processors to no-intr, which moves all of the
 interrupts and it doesn't prevent the xcall storm choosing to affect
 these CPUs either…
 
 In /etc/system:
 
 set mac:mac_soft_ring_thread_bind=0
 set mac:mac_srs_thread_bind=0
 
 Reboot required. Verify after reboot with mdb;
 
 echo mac_soft_ring_thread_bind/D | mdb -k

Trying that right now... thanks!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
 find where your nics are bound too
 
 mdb -k
 ::interrupts
 
 create a processor set including those cpus [ so just the nic code will
 run there ]
 
 andy

Tried and didn't help, unfortunately. I'm still seeing drops. What's
even funnier is that I'm seeing drops when the machine is sync'ing the
txg to the zpool. So looking at a little UDP receiver I can see the
following input stream bandwidth (the stream is constant bitrate, so
this shouldn't happen):

4.396151 Mbit/s   - drop
5.217205 Mbit/s
5.144323 Mbit/s
5.150227 Mbit/s
5.144150 Mbit/s
4.663824 Mbit/s   - drop
5.178603 Mbit/s
5.148681 Mbit/s
5.153835 Mbit/s
5.141116 Mbit/s
4.532479 Mbit/s   - drop
5.197381 Mbit/s
5.158436 Mbit/s
5.141881 Mbit/s
5.145433 Mbit/s
4.605852 Mbit/s   - drop
5.183006 Mbit/s
5.150526 Mbit/s
5.149324 Mbit/s
5.142306 Mbit/s
4.749443 Mbit/s   - drop

(txg timeout on my system is the default 5s)

It isn't just a slight delay in the arrival of the packets, because then
I should be seeing a rebound on the bitrate, sort of like this:

 ^ |-, ,^, ,^-, ,^
 B |  v   vv
   |
   +--
t -

Instead, what I'm seeing is simply:

 ^ |-, ,-, ,--, ,-
 B |  v   vv
   |
   +--
t -

(The missing spikes after the drops means that there were lost packets
on the NIC.)

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Mike Gerdts
On Tue, Jun 12, 2012 at 11:17 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
 find where your nics are bound too

 mdb -k
 ::interrupts

 create a processor set including those cpus [ so just the nic code will
 run there ]

 andy

 Tried and didn't help, unfortunately. I'm still seeing drops. What's
 even funnier is that I'm seeing drops when the machine is sync'ing the
 txg to the zpool. So looking at a little UDP receiver I can see the
 following input stream bandwidth (the stream is constant bitrate, so
 this shouldn't happen):

If processing in interrupt context (use intrstat) is dominating cpu
usage, you may be able to use pcitool to cause the device generating
all of those expensive interrupts to be moved to another CPU.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 07:19 PM, Roch Bourbonnais wrote:
 
 Try with this /etc/system tunings :
 
 set mac:mac_soft_ring_thread_bind=0 set mac:mac_srs_thread_bind=0 
 set zfs:zio_taskq_batch_pct=50
 

Thanks for the recommendations, I'll try and see whether it helps, but
this is going to take me a while (especially since the reboot means
I'll have a clear ARC and need to record up again around 120G of data,
which takes a while to accumulate).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss