Re: Increasing GELI performance

2007-07-31 Thread Tom Evans
On Sat, 2007-07-28 at 14:26 +0100, Dominic Bishop wrote:
 I've just been testing out GELI performance on an underlying RAID using a
 3ware 9550SXU-12 running RELENG_6 as of yesterday and seem to be hitting a
 performance bottleneck, but I can't see where it is coming from.
 
 Testing with an unencrypted 100GB GPT partition (/dev/da0p1) gives me around
 200-250MB/s read and write speeds to give an idea of the capability of the
 disk device itself.
 
 Using GELI with a default 128bit AES key seems to limit at ~50MB/s ,
 changing the sector size all the way upto 128KB makes no difference
 whatsoever to the performance. If I use the threads sysctl in loader.conf
 and drop the geli threads to 1 thread only (instead of the usual 3 it spawns
 on this system) the performance still does not change at all. Monitoring
 during writes with systat confirms that it really is spawning 1 or 3 threads
 correctly in these cases.
 
 Here is a uname -a from the machine
 
 FreeBSD 004 6.2-STABLE FreeBSD 6.2-STABLE #2: Fri Jul 27 20:10:05 CEST 2007
 [EMAIL PROTECTED]:/u1/obj/u1/src/sys/004  amd64
 
 Kernel is a copy of GENERIC with GELI option added
 
 Encrypted partition created using : geli init -s 65536 /dev/da0p1
 
 Simple write test done with: dd if=/dev/zero of=/dev/da0p1.eli bs=1m
 count=1 (same as I did on the unencyrpted, a full test with bonnie++
 shows similar speeds)
 
 Systat output whilst writing, showing 3 threads:
 
 
 /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
  Load Average   
 
 /0   /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
 root idle: cpu3 X  
 root idle: cpu1   
  idle 
 root idle: cpu0 XXX 
 root idle: cpu2 XX
 root g_eli[2] d XXX 
 root g_eli[0] d XXX
 root g_eli[1] d X 
 root   g_up
 root dd
 
 Output from vmstat -w 5
  procs  memory  pagedisks faults  cpu
  r b w avmfre  flt  re  pi  po  fr  sr ad4 da0   in   sy  cs us sy
 id
  0 1 0   38124 3924428  208   0   1   0 9052   0   0   0 1758  451 6354  1
 15 84
  0 1 0   38124 39244280   0   0   0 13642   0   0 411 2613  128 9483  0
 22 78
  0 1 0   38124 39244280   0   0   0 13649   0   0 411 2614  130 9483  0
 22 78
  0 1 0   38124 39244280   0   0   0 13642   0   0 411 2612  128 9477  0
 22 78
  0 1 0   38124 39244280   0   0   0 13642   0   0 411 2611  128 9474  0
 23 77
 
 Output from iostat -x 5
 extended device statistics  
 device r/s   w/skr/skw/s wait svc_t  %b  
 ad42.2   0.731.6 8.10   3.4   1 
 da00.2 287.8 2.3 36841.50   0.4  10 
 pass0  0.0   0.0 0.0 0.00   0.0   0 
 extended device statistics  
 device r/s   w/skr/skw/s wait svc_t  %b  
 ad40.0   0.0 0.0 0.00   0.0   0 
 da00.0 411.1 0.0 52622.10   0.4  15 
 pass0  0.0   0.0 0.0 0.00   0.0   0 
 extended device statistics  
 device r/s   w/skr/skw/s wait svc_t  %b  
 ad40.0   0.0 0.0 0.00   0.0   0 
 da00.0 411.1 0.0 52616.20   0.4  15 
 pass0  0.0   0.0 0.0 0.00   0.0   0 
 
 
 Looking at these results myself I cannot see where the bottleneck is, I
 would assume since changing the sector size or the geli threads doesn't
 affect performance that there is some other single threaded part limiting it
 but I don't know enough about how it works to say what.
 
 CPU in the machine is a pair of these:
 CPU: Intel(R) Xeon(R) CPU5110  @ 1.60GHz (1603.92-MHz K8-class
 CPU)
 
 I've also come across some other strange issues with some other machines
 which have identical arrays but only a pair of 32bit 3.0Ghz xeons in them
 (Also using releng_6 as of yesterday, just i386 not amd64). On those geli
 will launch a single thread by default (cores-1 seems to be the default)
 however I cannot force it to launch 2 by using the sysctl, although on the 4
 core machine I can successfully use it to launch 4. It would be nice to be
 able to use both cores on the 32bit machines for geli but given the results
 I've shown here I'm not sure it would gain me much at the moment.
 
 Another problem I've found is that if I use a sector size for GELI  8192
 bytes then I'm unable to newfs the encrypted partition afterwards, it fails
 immediately with this error:
 
 newfs /dev/da0p1.eli
 increasing block size from 16384 to fragment size (65536)
 /dev/da0p1.eli: 62499.9MB (127999872 sectors) block size 65536, fragment
 size 65536
 using 5 cylinder groups of 14514.56MB, 232233 blks, 58112 inodes.
 newfs: can't read old UFS1 superblock: read error from block device: Invalid
 argument
 
 The underlying device is readable/writeable however as dd can read/write to
 it without any errors.
 
 If anyone has any suggestions/thoughts on 

Increasing GELI performance

2007-07-28 Thread Dominic Bishop
I've just been testing out GELI performance on an underlying RAID using a
3ware 9550SXU-12 running RELENG_6 as of yesterday and seem to be hitting a
performance bottleneck, but I can't see where it is coming from.

Testing with an unencrypted 100GB GPT partition (/dev/da0p1) gives me around
200-250MB/s read and write speeds to give an idea of the capability of the
disk device itself.

Using GELI with a default 128bit AES key seems to limit at ~50MB/s ,
changing the sector size all the way upto 128KB makes no difference
whatsoever to the performance. If I use the threads sysctl in loader.conf
and drop the geli threads to 1 thread only (instead of the usual 3 it spawns
on this system) the performance still does not change at all. Monitoring
during writes with systat confirms that it really is spawning 1 or 3 threads
correctly in these cases.

Here is a uname -a from the machine

FreeBSD 004 6.2-STABLE FreeBSD 6.2-STABLE #2: Fri Jul 27 20:10:05 CEST 2007
[EMAIL PROTECTED]:/u1/obj/u1/src/sys/004  amd64

Kernel is a copy of GENERIC with GELI option added

Encrypted partition created using : geli init -s 65536 /dev/da0p1

Simple write test done with: dd if=/dev/zero of=/dev/da0p1.eli bs=1m
count=1 (same as I did on the unencyrpted, a full test with bonnie++
shows similar speeds)

Systat output whilst writing, showing 3 threads:


/0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
 Load Average   

/0   /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
root idle: cpu3 X  
root idle: cpu1   
 idle 
root idle: cpu0 XXX 
root idle: cpu2 XX
root g_eli[2] d XXX 
root g_eli[0] d XXX
root g_eli[1] d X 
root   g_up
root dd

Output from vmstat -w 5
 procs  memory  pagedisks faults  cpu
 r b w avmfre  flt  re  pi  po  fr  sr ad4 da0   in   sy  cs us sy
id
 0 1 0   38124 3924428  208   0   1   0 9052   0   0   0 1758  451 6354  1
15 84
 0 1 0   38124 39244280   0   0   0 13642   0   0 411 2613  128 9483  0
22 78
 0 1 0   38124 39244280   0   0   0 13649   0   0 411 2614  130 9483  0
22 78
 0 1 0   38124 39244280   0   0   0 13642   0   0 411 2612  128 9477  0
22 78
 0 1 0   38124 39244280   0   0   0 13642   0   0 411 2611  128 9474  0
23 77

Output from iostat -x 5
extended device statistics  
device r/s   w/skr/skw/s wait svc_t  %b  
ad42.2   0.731.6 8.10   3.4   1 
da00.2 287.8 2.3 36841.50   0.4  10 
pass0  0.0   0.0 0.0 0.00   0.0   0 
extended device statistics  
device r/s   w/skr/skw/s wait svc_t  %b  
ad40.0   0.0 0.0 0.00   0.0   0 
da00.0 411.1 0.0 52622.10   0.4  15 
pass0  0.0   0.0 0.0 0.00   0.0   0 
extended device statistics  
device r/s   w/skr/skw/s wait svc_t  %b  
ad40.0   0.0 0.0 0.00   0.0   0 
da00.0 411.1 0.0 52616.20   0.4  15 
pass0  0.0   0.0 0.0 0.00   0.0   0 


Looking at these results myself I cannot see where the bottleneck is, I
would assume since changing the sector size or the geli threads doesn't
affect performance that there is some other single threaded part limiting it
but I don't know enough about how it works to say what.

CPU in the machine is a pair of these:
CPU: Intel(R) Xeon(R) CPU5110  @ 1.60GHz (1603.92-MHz K8-class
CPU)

I've also come across some other strange issues with some other machines
which have identical arrays but only a pair of 32bit 3.0Ghz xeons in them
(Also using releng_6 as of yesterday, just i386 not amd64). On those geli
will launch a single thread by default (cores-1 seems to be the default)
however I cannot force it to launch 2 by using the sysctl, although on the 4
core machine I can successfully use it to launch 4. It would be nice to be
able to use both cores on the 32bit machines for geli but given the results
I've shown here I'm not sure it would gain me much at the moment.

Another problem I've found is that if I use a sector size for GELI  8192
bytes then I'm unable to newfs the encrypted partition afterwards, it fails
immediately with this error:

newfs /dev/da0p1.eli
increasing block size from 16384 to fragment size (65536)
/dev/da0p1.eli: 62499.9MB (127999872 sectors) block size 65536, fragment
size 65536
using 5 cylinder groups of 14514.56MB, 232233 blks, 58112 inodes.
newfs: can't read old UFS1 superblock: read error from block device: Invalid
argument

The underlying device is readable/writeable however as dd can read/write to
it without any errors.

If anyone has any suggestions/thoughts on any of these points it would be
much appreciated, these machines will be performing backups over 1Gbit LAN
so more speed than I can currently get would be preferable.

I