Re: [PATCH] btrfs: fix write_dev_supers
At 20:25 09/06/09, Chris Mason wrote: On Tue, Jun 09, 2009 at 10:46:55AM +0900, Hisashi Hifumi wrote: Hi. I got following BUG trace. This is violation of BUG_ON(!buffer_locked(bh)) check on submit_bh() function. In write_dev_supers(), if wait parameter is set and buffer_uptodate() check is negative, submit_bh() is executed and hit above BUG_ON. So I fixed this issue. Thanks for finding this bug and sending the patch. This function is very confusing. If wait parameter is set, it isn't supposed to do any IO at all. The caller first does write_dev_supers with wait == 0, and that sends all the supers down on all the devices. Then it calls again with wait == 1, which is supposed to make sure all the supers actually got to disk. We should change the wait == 0 behavior to leave a reference held on all the buffers, and wait == 1 to drop that reference. That way the buffer won't disappear while we are waiting, and we can return an error if the buffer wasn't up to date when wait == 1. Like this? I changed wait == 0 case to get extra ref and on wait == 1 case if buffer is uptodate, bh releases ref otherwise buffer takes lock to proceed to submit_bh. Thanks. Signed-off-by: Hisashi Hifumi hifumi.hisa...@oss.ntt.co.jp diff -Nrup linux-2.6.30-rc8.org/fs/btrfs/disk-io.c linux-2.6.30-rc8.btrfs/fs/btrfs/disk-io.c --- linux-2.6.30-rc8.org/fs/btrfs/disk-io.c 2009-06-04 16:26:25.0 +0900 +++ linux-2.6.30-rc8.btrfs/fs/btrfs/disk-io.c 2009-06-10 15:41:03.0 +0900 @@ -2044,8 +2044,10 @@ static int write_dev_supers(struct btrfs wait_on_buffer(bh); if (buffer_uptodate(bh)) { brelse(bh); + brelse(bh); continue; - } + } else + lock_buffer(bh); } else { btrfs_set_super_bytenr(sb, bytenr); @@ -2062,6 +2064,7 @@ static int write_dev_supers(struct btrfs set_buffer_uptodate(bh); get_bh(bh); + get_bh(bh); lock_buffer(bh); bh-b_end_io = btrfs_end_buffer_write_sync; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix write_dev_supers
At 20:25 09/06/09, Chris Mason wrote: On Tue, Jun 09, 2009 at 10:46:55AM +0900, Hisashi Hifumi wrote: Hi. I got following BUG trace. This is violation of BUG_ON(!buffer_locked(bh)) check on submit_bh() function. In write_dev_supers(), if wait parameter is set and buffer_uptodate() check is negative, submit_bh() is executed and hit above BUG_ON. So I fixed this issue. Thanks for finding this bug and sending the patch. This function is very confusing. If wait parameter is set, it isn't supposed to do any IO at all. The caller first does write_dev_supers with wait == 0, and that sends all the supers down on all the devices. Then it calls again with wait == 1, which is supposed to make sure all the supers actually got to disk. We should change the wait == 0 behavior to leave a reference held on all the buffers, and wait == 1 to drop that reference. That way the buffer won't disappear while we are waiting, and we can return an error if the buffer wasn't up to date when wait == 1. Are you interested in fixing this? Yes, I want to fix this. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix write_dev_supers
Hi. I got following BUG trace. This is violation of BUG_ON(!buffer_locked(bh)) check on submit_bh() function. In write_dev_supers(), if wait parameter is set and buffer_uptodate() check is negative, submit_bh() is executed and hit above BUG_ON. So I fixed this issue. Thanks. Jun 9 00:41:32 dl580 kernel: [ cut here ] Jun 9 00:41:32 dl580 kernel: kernel BUG at fs/buffer.c:2933! Jun 9 00:41:32 dl580 kernel: invalid opcode: [#1] SMP Jun 9 00:41:32 dl580 kernel: last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/sha red_cpu_map Jun 9 00:41:32 dl580 kernel: CPU 3 Jun 9 00:41:32 dl580 kernel: Modules linked in: btrfs zlib_deflate ext4 jbd2 crc16 sg qla2x xx scsi_transport_fc autofs4 i2c_dev i2c_core sunrpc ipv6 serio_raw tg3 libphy ata_piix libata shpchp rtc_cmos rtc_core rtc_lib cciss sd_mod scsi_mod ext3 jbd [ last unloaded: scsi_transport_fc] Jun 9 00:41:32 dl580 kernel: Pid: 5207, comm: umount Tainted: GW 2.6.30-rc6 #1 Pro Liant DL580 G3 Jun 9 00:41:32 dl580 kernel: RIP: 0010:[802c458b] [802c458b] submit_bh +0x1a/0x105 Jun 9 00:41:32 dl580 kernel: RSP: 0018:8801f46e5bf8 EFLAGS: 00010246 Jun 9 00:41:32 dl580 kernel: RAX: 0028 RBX: 88018a7ea420 RCX: 0 000 Jun 9 00:41:32 dl580 kernel: RDX: 88018a7ea420 RSI: 88018a7ea420 RDI: 0 419 Jun 9 00:41:32 dl580 kernel: RBP: 8801f46e5c18 R08: 802c533d R09: 0 000 Jun 9 00:41:32 dl580 kernel: R10: 0001 R11: 0088 R12: 88021d448 248 Jun 9 00:41:32 dl580 kernel: R13: 0419 R14: 8802191dacbb R15: 0 000 Jun 9 00:41:32 dl580 kernel: FS: 7fd64fef3760() GS:88002815() knlGS:00 00 Jun 9 00:41:32 dl580 kernel: CS: 0010 DS: ES: CR0: 8005003b Jun 9 00:41:32 dl580 kernel: CR2: 0044ef40 CR3: 000104287000 CR4: 0 6e0 Jun 9 00:41:32 dl580 kernel: DR0: DR1: DR2: 0 000 Jun 9 00:41:32 dl580 kernel: DR3: DR6: 0ff0 DR7: 0 400 Jun 9 00:41:32 dl580 kernel: Process umount (pid: 5207, threadinfo 8801f46e4000, task f fff8801e1168000) Jun 9 00:41:32 dl580 kernel: Stack: Jun 9 00:41:32 dl580 kernel: 0003 88018a7ea420 88021d448248 00 03 Jun 9 00:41:32 dl580 kernel: 8801f46e5c68 a02d9979 000100 01 Jun 9 00:41:32 dl580 kernel: 0001 88021d448248 880219 1dacbb Jun 9 00:41:32 dl580 kernel: Call Trace: Jun 9 00:41:33 dl580 kernel: [a02d9979] write_dev_supers+0x1eb/0x258 [btrfs] Jun 9 00:41:33 dl580 kernel: [a02d9b6d] write_all_supers+0x187/0x1c8 [btrfs] Jun 9 00:41:33 dl580 kernel: [a02d9bbc] write_ctree_super+0xe/0x10 [btrfs] Jun 9 00:41:33 dl580 kernel: [a02de39f] btrfs_commit_transaction+0x6bb/0x841 [bt rfs] Jun 9 00:41:33 dl580 kernel: [80246914] ? autoremove_wake_function+0x0/0x38 Jun 9 00:41:33 dl580 kernel: [a02c14ed] btrfs_sync_fs+0x67/0x72 [btrfs] Jun 9 00:41:33 dl580 kernel: [802e6e3a] quota_sync_sb+0x42/0xf3 Jun 9 00:41:33 dl580 kernel: [802e6f14] sync_dquots+0x29/0x138 Jun 9 00:41:33 dl580 kernel: [802a8c29] __fsync_super+0x1e/0x7b Jun 9 00:41:33 dl580 kernel: [802a8c97] fsync_super+0x11/0x22 Jun 9 00:41:33 dl580 kernel: [802a8ea9] generic_shutdown_super+0x26/0xe2 Jun 9 00:41:33 dl580 kernel: [802a8fb6] kill_anon_super+0x17/0x3b Jun 9 00:41:33 dl580 kernel: [802a92e8] deactivate_super+0x62/0x77 Jun 9 00:41:33 dl580 kernel: [802bb7ae] mntput_no_expire+0xec/0x12c Jun 9 00:41:33 dl580 kernel: [802bbcff] sys_umount+0x2c5/0x31c Jun 9 00:41:33 dl580 kernel: [8020aeeb] system_call_fastpath+0x16/0x Jun 9 00:41:33 dl580 kernel: Code: e0 eb ec 44 89 e8 48 83 c4 18 5b 41 5c 41 5d 5d c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 41 89 fd 48 89 f3 48 8b 06 a8 04 75 04 0f 0b eb fe a8 20 75 04 0f 0b eb fe 48 83 7e 38 00 75 04 0f 0b Jun 9 00:41:33 dl580 kernel: RIP [802c458b] submit_bh+0x1a/0x105 Jun 9 00:41:33 dl580 kernel: RSP 8801f46e5bf8 Jun 9 00:41:33 dl580 kernel: ---[ end trace 4eaa2a86a8e2da24 ]--- Signed-off-by: Hisashi Hifumi hifumi.hisa...@oss.ntt.co.jp --- linux-2.6.30-rc8.org/fs/btrfs/disk-io.c 2009-06-04 16:26:25.0 +0900 +++ linux-2.6.30-rc8.btrfs/fs/btrfs/disk-io.c 2009-06-08 18:42:46.0 +0900 @@ -2045,6 +2045,9 @@ static int write_dev_supers(struct btrfs if (buffer_uptodate(bh)) { brelse(bh); continue; + } else { + get_bh(bh); + lock_buffer(bh); } } else
Re: [RFC] [PATCH] Btrfs: improve fsync/osync write performance
At 20:27 09/03/31, Chris Mason wrote: On Tue, 2009-03-31 at 14:18 +0900, Hisashi Hifumi wrote: Hi Chris. I noticed performance of fsync() and write() with O_SYNC flag on Btrfs is very slow as compared to ext3/4. I used blktrace to try to investigate the cause of this. One of cause is that unplug is done by kblockd even if the I/O is issued through fsync() or write() with O_SYNC flag. kblockd's unplug timeout is 3msec, so unplug via blockd can decrease I/O response. To increase fsync/osync write performance, speeding up unplug should be done here. Btrfs's write I/O is issued via kernel thread, not via user application context that calls fsync(). While waiting for page writeback, wait_on_page_writeback() can not unplug I/O sometimes on Btrfs because submit_bio is not called from user application context so when submit_bio is called from kernel thread, wait_on_page_writeback() sleeps on io_schedule(). This is exactly right, and one of the uglier side effects of the async helper kernel threads. I've been thinking for a while about a clean way to fix it. I introduced btrfs_wait_on_page_writeback() on following patch, this is replacement of wait_on_page_writeback() for Btrfs. This does unplug every 1 tick while waiting for page writeback. I did a performance test using the sysbench. # sysbench --num-threads=4 --max-requests=1 --test=fileio --file-num=1 --file-block-size=4K --file-total-size=128M --file-test-mode=rndwr --file-fsync-freq=5 run The result was: -2.6.29 Test execution summary: total time: 628.1047s total number of events: 1 total time taken by event execution: 413.0834 per-request statistics: min:0.s avg:0.0413s max:1.9075s approx. 95 percentile: 0.3712s Threads fairness: events (avg/stddev): 2500./29.21 execution time (avg/stddev): 103.2708/4.04 -2.6.29-patched Test execution summary: total time: 579.8049s total number of events: 10004 total time taken by event execution: 355.3098 per-request statistics: min:0.s avg:0.0355s max:1.7670s approx. 95 percentile: 0.3154s Threads fairness: events (avg/stddev): 2501./8.03 execution time (avg/stddev): 88.8274/1.94 This patch has some effect for performance improvement. I think there are other reasons that should be fixed why fsync() or write() with O_SYNC flag is slow on Btrfs. Very nice. Could I trouble you to try one more experiment? The other way to fix this is to your WRITE_SYNC instead of WRITE. Could you please hardcode WRITE_SYNC in the btrfs submit_bio paths and benchmark that? It doesn't cover as many cases as your patch, but it might have a lower overall impact. Hi. I wrote hardcode WRITE_SYNC patch for btrfs submit_bio paths as shown below, and I did sysbench test. Later, I will try your unplug patch. diff -Nrup linux-2.6.29.org/fs/btrfs/disk-io.c linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c --- linux-2.6.29.org/fs/btrfs/disk-io.c 2009-03-24 08:12:14.0 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/disk-io.c 2009-04-01 16:26:56.0 +0900 @@ -2068,7 +2068,7 @@ static int write_dev_supers(struct btrfs } if (i == last_barrier do_barriers device-barriers) { - ret = submit_bh(WRITE_BARRIER, bh); + ret = submit_bh(WRITE_BARRIER|WRITE_SYNC, bh); if (ret == -EOPNOTSUPP) { printk(btrfs: disabling barriers on dev %s\n, device-name); @@ -2076,10 +2076,10 @@ static int write_dev_supers(struct btrfs device-barriers = 0; get_bh(bh); lock_buffer(bh); - ret = submit_bh(WRITE, bh); + ret = submit_bh(WRITE_SYNC, bh); } } else { - ret = submit_bh(WRITE, bh); + ret = submit_bh(WRITE_SYNC, bh); } if (!ret wait) { diff -Nrup linux-2.6.29.org/fs/btrfs/extent_io.c linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c --- linux-2.6.29.org/fs/btrfs/extent_io.c 2009-03-24 08:12:14.0 +0900 +++ linux-2.6.29.btrfs_sync/fs/btrfs/extent_io.c2009-04-01 14:48:08.0 +0900 @@ -1851,8 +1851,11 @@ static int submit_one_bio(int rw, struct if (tree-ops tree-ops-submit_bio_hook) tree-ops-submit_bio_hook(page-mapping-host, rw, bio
Re: [PATCH] btrfs: call mark_inode_dirty when i_size is updated
At 23:12 09/02/02, Chris Mason wrote: On Mon, 2009-02-02 at 20:00 +0900, Hisashi Hifumi wrote: Hi Chris. I think it is needed to call mark_inode_dirty() when file size expands in order to flush metadata updates to HDD through sync() syscall or background_writeout(). Thanks for reading through this code and sending the patch. I find the I_DIRTY flags one of the more confusing parts of the generic fs writeback cdoe. But, I think what happens is the btrfs_set_page_dirty function calls __set_page_dirty_nobuffers() which does: if (mapping-host) { /* !PageAnon !swapper_space */ __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); } This should be enough to make sure the btrfs inodes are processed by background writeout and sync(). Please let me know if I'm misreading things. Surely, as you pointed out, btrfs_set_page_dirty calls if (mapping-host) { /* !PageAnon !swapper_space */ __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); } through _set_page_dirty_nobuffers. But I_DIRTY_PAGES is not sufficient. To flush metadata update to HDD through sync(), I_DIRTY_SYNC or I_DIRTY_DATASYNC flag is needed. see __sync_single_inode. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: call mark_inode_dirty when i_size is updated
At 10:04 09/02/03, Chris Mason wrote: On Tue, 2009-02-03 at 09:36 +0900, Hisashi Hifumi wrote: At 23:12 09/02/02, Chris Mason wrote: On Mon, 2009-02-02 at 20:00 +0900, Hisashi Hifumi wrote: Hi Chris. I think it is needed to call mark_inode_dirty() when file size expands in order to flush metadata updates to HDD through sync() syscall or background_writeout(). Thanks for reading through this code and sending the patch. I find the I_DIRTY flags one of the more confusing parts of the generic fs writeback cdoe. But, I think what happens is the btrfs_set_page_dirty function calls __set_page_dirty_nobuffers() which does: if (mapping-host) { /* !PageAnon !swapper_space */ __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); } This should be enough to make sure the btrfs inodes are processed by background writeout and sync(). Please let me know if I'm misreading things. Surely, as you pointed out, btrfs_set_page_dirty calls if (mapping-host) { /* !PageAnon !swapper_space */ __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); } through _set_page_dirty_nobuffers. But I_DIRTY_PAGES is not sufficient. To flush metadata update to HDD through sync(), I_DIRTY_SYNC or I_DIRTY_DATASYNC flag is needed. see __sync_single_inode. Since btrfs uses a dirty_inode callback, our inodes are never really dirty. The btree metadata always has the same information as the in-core inode does. The extra transaction commit steps taken at sync time are enough to get all the relevant metadata on disk. So, I think what happens is that I_DIRTY_PAGES is enough to get the data pages on disk and the transaction commit gets the metadata on disk. metadata update transaction is made through dirty_inode call back, but to run dirty_inode callback I_DIRTY_SYNC or I_DIRTY_DATASYNC flag is needed. (see __mark_inode_dirty). Also, to commit transaction to disk through write_inode callback, I_DIRTY_SYNC or I_DIRTY_DATASYNC flag is needed.(see _sync_single_inode) So I think my patch fixes this issue. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html