Le 02/02/10 22:49, Tim Cook a écrit :
On Tue, Feb 2, 2010 at 3:25 PM, Richard
Elling <richard.ell...@gmail.com>
wrote:
On
Feb 2, 2010, at 12:05 PM, Arnaud Brand wrote:
> Hi folks,
>
> I'm having (as the title suggests) a problem with zfs send/receive.
> Command line is like this :
> pfexec zfs send -Rp tank/t...@snapshot | ssh remotehost pfexec zfs
recv -v -F -d tank
>
> This works like a charm as long as the snapshot is small enough.
>
> When it gets too big (meaning somewhere between 17G and 900G), I
get ssh errors (can't read from remote host).
>
> I tried various encryption options (the fastest being in my case
arcfour) with no better results.
> I tried to setup a script to insert dd on the sending and
receiving side to buffer the flow, still read errors.
> I tried with mbuffer (which gives better performance), it didn't
get better.
> Today I tried with netcat (and mbuffer) and I got better
throughput, but it failed at 269GB transferred.
>
> The two machines are connected to the switch with 2x1GbE (Intel)
joined together with LACP.
LACP is spawned from the devil to plague mankind. It won't
help your ssh transfer at all. It will cause your hair to turn grey and
get pulled out by the roots. Try turning it off or using a separate
network for your transfer.
-- richard
That's a bit harsh :)
To further what Richard said though, LACP isn't going to help with your
issue. LACP is NOT round-robin load balancing. Think of it more like
source-destination. You need to have multiple connections going out to
different source/destination mac/ip/whatever addresses. Typically it
works great for something like a fileserver that has 50 clients hitting
it. Then those clients will be balanced across the multiple links.
When you've got one server talking to one other server, it isn't going
to buy you much of anything 99% of the time.
Also, depending on your switch, it can actually hamper you quite a
bit. If you've got a good cisco/hp/brocade/extreme
networks/force10/etc switch, it's fine. If you've got a $50 soho
netgear, you typically are going to get what you paid for :)
--Tim
I'll remove LACP when I get back to work tomorrow (that's in a few
hours).
I already knew about it's principles (doesn't hurt to repeat them
though), but as we have at least two machines connecting simultaneously
to this server, plus occasionnal clients, plus the replication stream,
I thought I could win some bandwidth.
I think I should've stayed by the rule : first make it work, then make
it fast.
In the mean time, I've launched the same command with a dd to a local
file instead of a zfs recv (ie:something along the lines pfexec
zfs send -Rp tank/t...@snapshot | ssh remotehost dd of=/tank/repl.zfs
bs=128k).
I hope I'm not running into the issues related to e1000g problems under
load (zfs recv eats up all the CPU when it flushes and then the
transfer almost stalls for a second or two).
For the switch, it's an HP4208 with reasonnably up to date firmware
(less than 6 month old, next update of our switches scheduled on Feb,
20th).
Strange thing is that the connection is lost on the sending side, but
the receiving side show it's still "established" (in netstat -an).
I could try changing the network cables too, maybe one of them has a
problem.
Thanks,
Arnaud
|