On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:

The commentary says that normally the COMMIT operations occur during close(2) or fsync(2) system call, or when encountering memory pressure. If the problem is slow copying of many small files, this COMMIT approach does not help very much since very little data is sent per file and most time is spent creating directories and files.

The problem appears to be slog bandwidth exhaustion due to all data being sent via the slog creating a contention for all following NFS or locally synchronous writes. The NFS writes do not appear to be synchronous in nature - there is only a COMMIT being issued at the very end, however, all of that data appears to be going via the slog and it appears to be inflating to twice its original size.

For a test, I just copied a relatively small file (8.4MB in size). Looking at a tcpdump analysis using wireshark, there is a SETATTR which ends with a V3 COMMIT and no COMMIT messages during the transfer.

iostat output that matches looks like this:

slog write of the data (17MB appears to hit the slog)

Friday, 25 September 2009  1:01:00 PM EST
extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 135.0 0.0 17154.5 0.0 0.8 0.0 6.0 0 3 0 0 0 0 c7t2d0

then a few seconds later, the transaction group gets flushed to primary storage writing nearly 11.4MB which is inline with raid Z2 (expect around 10.5MB; 8.4/8*10):

Friday, 25 September 2009  1:01:13 PM EST
extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 91.0 0.0 1170.4 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t0d0 0.0 84.0 0.0 1171.4 0.0 0.1 0.0 1.2 0 2 0 0 0 0 c11t1d0 0.0 92.0 0.0 1172.4 0.0 0.1 0.0 1.2 0 2 0 0 0 0 c11t2d0 0.0 84.0 0.0 1172.4 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t3d0 0.0 81.0 0.0 1176.4 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t4d0 0.0 86.0 0.0 1176.4 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t5d0 0.0 89.0 0.0 1175.4 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t6d0 0.0 84.0 0.0 1175.4 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t7d0 0.0 91.0 0.0 1168.9 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t8d0 0.0 89.0 0.0 1170.9 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t9d0

So I performed the same test with a much larger file (533MB) to see what it would do, being larger than the NVRAM cache in front of the SSD. Note that after the second second of activity the NVRAM is full and only allowing in about the sequential write speed of the SSD (~70MB/s).

Friday, 25 September 2009  1:13:14 PM EST
extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 640.9 0.0 81782.9 0.0 4.2 0.0 6.5 1 14 0 0 0 0 c7t2d0 0.0 1065.7 0.0 136408.1 0.0 18.6 0.0 17.5 1 78 0 0 0 0 c7t2d0 0.0 579.0 0.0 74113.3 0.0 30.7 0.0 53.1 1 100 0 0 0 0 c7t2d0 0.0 588.7 0.0 75357.0 0.0 33.2 0.0 56.3 1 100 0 0 0 0 c7t2d0 0.0 532.0 0.0 68096.3 0.0 31.5 0.0 59.1 1 100 0 0 0 0 c7t2d0 0.0 559.0 0.0 71428.0 0.0 32.5 0.0 58.1 1 100 0 0 0 0 c7t2d0 0.0 542.0 0.0 68755.9 0.0 25.1 0.0 46.4 1 100 0 0 0 0 c7t2d0 0.0 542.0 0.0 69376.4 0.0 35.0 0.0 64.6 1 100 0 0 0 0 c7t2d0 0.0 581.0 0.0 74368.0 0.0 30.6 0.0 52.6 1 100 0 0 0 0 c7t2d0 0.0 567.0 0.0 72574.1 0.0 33.2 0.0 58.6 1 100 0 0 0 0 c7t2d0 0.0 564.0 0.0 72194.1 0.0 31.1 0.0 55.2 1 100 0 0 0 0 c7t2d0 0.0 573.0 0.0 73343.5 0.0 33.2 0.0 57.9 1 100 0 0 0 0 c7t2d0 0.0 536.3 0.0 68640.5 0.0 33.1 0.0 61.7 1 100 0 0 0 0 c7t2d0 0.0 121.9 0.0 15608.9 0.0 2.7 0.0 22.1 0 22 0 0 0 0 c7t2d0

Again, the slog wrote about double the file size (1022.6MB) and a few seconds later, the data was pushed to the primary storage (684.9MB with an expectation of 666MB = 533MB/8*10) so again about the right number hit the spinning platters.

Friday, 25 September 2009  1:13:43 PM EST
extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 338.3 0.0 32794.4 0.0 13.7 0.0 40.6 1 47 0 0 0 0 c11t0d0 0.0 325.3 0.0 31399.8 0.0 13.7 0.0 42.0 1 47 0 0 0 0 c11t1d0 0.0 339.3 0.0 33273.3 0.0 13.7 0.0 40.3 1 47 0 0 0 0 c11t2d0 0.0 332.3 0.0 32009.0 0.0 13.7 0.0 41.4 0 47 0 0 0 0 c11t3d0 0.0 352.3 0.0 34364.0 0.0 13.7 0.0 39.0 1 47 0 0 0 0 c11t4d0 0.0 355.2 0.0 33788.7 0.0 13.7 0.0 38.6 1 47 0 0 0 0 c11t5d0 0.0 352.3 0.0 33452.3 0.0 13.8 0.0 39.3 1 47 0 0 0 0 c11t6d0 0.0 339.3 0.0 32873.5 0.0 13.7 0.0 40.4 1 47 0 0 0 0 c11t7d0 0.0 337.3 0.0 32889.0 0.0 13.5 0.0 40.0 1 47 0 0 0 0 c11t8d0 0.0 336.3 0.0 32441.9 0.0 13.7 0.0 40.9 1 47 0 0 0 0 c11t9d0

Friday, 25 September 2009  1:13:44 PM EST
extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 349.7 0.0 35677.0 0.0 16.1 0.0 45.9 0 48 0 0 0 0 c11t0d0 0.0 367.7 0.0 37078.3 0.0 16.1 0.0 43.8 0 49 0 0 0 0 c11t1d0 0.0 348.7 0.0 35197.1 0.0 16.3 0.0 46.9 0 49 0 0 0 0 c11t2d0 0.0 360.7 0.0 36467.7 0.0 15.9 0.0 44.1 0 48 0 0 0 0 c11t3d0 0.0 342.7 0.0 34103.9 0.0 16.2 0.0 47.2 0 48 0 0 0 0 c11t4d0 0.0 347.7 0.0 34682.1 0.0 16.0 0.0 46.0 0 48 0 0 0 0 c11t5d0 0.0 349.7 0.0 35018.3 0.0 16.3 0.0 46.7 0 49 0 0 0 0 c11t6d0 0.0 353.7 0.0 35600.5 0.0 16.1 0.0 45.6 0 49 0 0 0 0 c11t7d0 0.0 350.7 0.0 35580.5 0.0 16.2 0.0 46.1 0 49 0 0 0 0 c11t8d0 0.0 356.7 0.0 36031.0 0.0 15.9 0.0 44.4 0 48 0 0 0 0 c11t9d0

Can anybody explain what is going on with the slog device in that all data is being shunted via it and why about double the data size is being written to it per transaction?

cheers,
James

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to