Re: elevator: Fix a race about elevator switching.
On 02/21/2013 04:42 PM, majianpeng wrote: Thare's a race between elevator switching and normal io operation. Because the allocation of struct elevator_queue and struct elevator_data don't in a atomic operation.So there are have chance to use NULL -elevator_data. For example: Thread A: Thread B blk_flush_plug_listelevator_switch __elv_add_request) blk_peek_request elevator_alloc noop_dispatch elevator_init_fn Because call elevator_alloc, it can't hold queue_lock and the -elevator_data is NULL after allocating.So at the same time, threadA call elv_merge and nedd some info of elevator_data.So the crash happened. [ 196.125709] BUG: unable to handle kernel [ 196.126046] Modules linked in: netconsole configfs nfsd lockd auth_rpcgss sunrpc exportfs raid1 btrfs zlib_deflate libcrc32c [ 196.126046] CPU 2 [ 196.126046] Pid: 6747, comm: dd Not tainted 3.8.0+ #107 To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M. [ 196.126046] RIP: 0010:[812a8c63] [812a8c63] noop_dispatch+0x13/0x40 [ 196.126046] RSP: 0018:8800a6fef838 EFLAGS: 00010046 [ 196.126046] RAX: RBX: 8800b53abf20 RCX: [ 196.126046] RDX: RSI: RDI: 8800b53abf20 [ 196.126046] RBP: 8800a6fef838 R08: R09: [ 196.126046] R10: 0001 R11: 0001 R12: 8800b53abf20 [ 196.126046] R13: 8800a6feffd8 R14: 8800a6feffd8 R15: 8800b3c68090 [ 196.126046] FS: 7f6ef48d6700() GS:8800ba00() knlGS: [ 196.126046] CS: 0010 DS: ES: CR0: 8005003b [ 196.126046] CR2: CR3: a83a1000 CR4: 000407e0 [ 196.126046] DR0: DR1: DR2: [ 196.126046] DR3: DR6: 0ff0 DR7: 0400 [ 196.126046] Process dd (pid: 6747, threadinfo 8800a6fee000, task 88009e2a2840) [ 196.126046] Stack: [ 196.126046] 8800a6fef888 81297a54 8800b50de7b0 8800b3c68090 [ 196.126046] 8800a6fef888 8800b3c68000 8800b53abf20 [ 196.126046] 8800b50de7b0 8800b3c68090 8800a6fef8f8 8140fe5a [ 196.126046] Call Trace: [ 196.126046] [81297a54] blk_peek_request+0x194/0x250 [ 196.126046] [8140fe5a] scsi_request_fn+0x4a/0x4f0 [ 196.126046] [810995ef] ? __lock_is_held+0x5f/0x80 [ 196.126046] [81290d37] __blk_run_queue+0x37/0x50 [ 196.126046] [812904cd] __elv_add_request+0xad/0x2d0 [ 196.126046] [81297ecc] blk_flush_plug_list+0x1bc/0x260 [ 196.126046] [81297f88] blk_finish_plug+0x18/0x50 [ 196.126046] [81195f1e] do_blockdev_direct_IO+0x18be/0x20e0 [ 196.126046] [81077cd8] ? sched_clock_cpu+0xa8/0x120 [ 196.126046] [811913d0] ? I_BDEV+0x10/0x10 [ 196.126046] [810d0868] ? rcu_irq_exit+0x68/0xb0 [ 196.126046] [81196795] __blockdev_direct_IO+0x55/0x60 [ 196.126046] [811913d0] ? I_BDEV+0x10/0x10 [ 196.126046] [81191c67] blkdev_direct_IO+0x57/0x60 [ 196.126046] [811913d0] ? I_BDEV+0x10/0x10 [ 196.126046] [81106633] generic_file_aio_read+0x703/0x770 [ 196.126046] [811915c1] blkdev_aio_read+0x51/0x80 [ 196.126046] [81095d45] ? lock_release_holdtime.part.23+0x15/0x1a0 [ 196.126046] [81157bb3] do_sync_read+0xa3/0xe0 [ 196.126046] [81158343] vfs_read+0xb3/0x180 [ 196.126046] [81158465] sys_read+0x55/0xa0 [ 196.126046] [816f4242] system_call_fastpath+0x16/0x1b [ 196.126046] Code: 48 83 c4 08 5b 5d c3 90 b8 10 00 00 00 eb e0 b8 f4 ff ff ff eb ea 66 90 66 66 66 66 90 48 8b 47 18 55 48 89 e5 48 8b 50 08 31 c0 48 8b 32 48 39 f2 74 1f 48 8b 46 08 48 8b 16 48 89 42 08 48 89 [ 196.126046] RIP [812a8c63] noop_dispatch+0x13/0x40 [ 196.126046] RSP 8800a6fef838 [ 196.126046] CR2: Move the elevator_alloc into func elevator_init_fn, it make the operations in a atomic operation. Using the follow method can easy reproduce this bug 1:dd if=/dev/sdb of=/dev/null 2:while true;do echo noop scheduler;echo deadline scheduler;done The test method also use this method. Signed-off-by: Jianpeng Ma majianp...@gmail.com Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com Tested-by: Gu Zheng guz.f...@cn.fujitsu.com Thanks, Gu --- block/cfq-iosched.c | 17 ++--- block
[PATCH]fs/block_dev.c: fix the inaccurate judgement in function blkdev_aio_read
In function blkdev_aio_read(), the judgement of 'size', if it is equal or greater than the target count we request(iocb-ki_left), there is no need to call iov_shorten() to reduce number of segments and the iovec's length. So the judgement should be changed to 'if (size iocb-ki_left)' instead. Signed-off-by: Jianpeng Ma majianp...@gmail.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/block_dev.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index aae187a..f0328f1 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1559,7 +1559,7 @@ static ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov, return 0; size -= pos; - if (size INT_MAX) + if (size iocb-ki_left) nr_segs = iov_shorten((struct iovec *)iov, nr_segs, size); return generic_file_aio_read(iocb, iov, nr_segs, pos); } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] resource: Add release_mem_region_adjustable()
On 04/03/2013 12:17 AM, Toshi Kani wrote: Added release_mem_region_adjustable(), which releases a requested region from a currently busy memory resource. This interface adjusts the matched memory resource accordingly if the requested region does not match exactly but still fits into. This new interface is intended for memory hot-delete. During bootup, memory resources are inserted from the boot descriptor table, such as EFI Memory Table and e820. Each memory resource entry usually covers the whole contigous memory range. Memory hot-delete request, on the other hand, may target to a particular range of memory resource, and its size can be much smaller than the whole contiguous memory. Since the existing release interfaces like __release_region() require a requested region to be exactly matched to a resource entry, they do not allow a partial resource to be released. There is no change to the existing interfaces since their restriction is valid for I/O resources. Signed-off-by: Toshi Kani toshi.k...@hp.com --- include/linux/ioport.h |2 + kernel/resource.c | 87 2 files changed, 89 insertions(+) diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 85ac9b9b..0fe1a82 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -192,6 +192,8 @@ extern struct resource * __request_region(struct resource *, extern int __check_region(struct resource *, resource_size_t, resource_size_t); extern void __release_region(struct resource *, resource_size_t, resource_size_t); +extern int release_mem_region_adjustable(struct resource *, resource_size_t, + resource_size_t); static inline int __deprecated check_region(resource_size_t s, resource_size_t n) diff --git a/kernel/resource.c b/kernel/resource.c index ae246f9..789f160 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1021,6 +1021,93 @@ void __release_region(struct resource *parent, resource_size_t start, } EXPORT_SYMBOL(__release_region); +/** + * release_mem_region_adjustable - release a previously reserved memory region + * @parent: parent resource descriptor + * @start: resource start address + * @size: resource region size + * + * The requested region is released from a currently busy memory resource. + * It adjusts the matched busy memory resource accordingly if the requested + * region does not match exactly but still fits into. Existing children of + * the busy memory resource must be immutable in this request. + * + * Note, when the busy memory resource gets split into two entries, the code + * assumes that all children remain in the lower address entry for simplicity. + * Enhance this logic when necessary. + */ +int release_mem_region_adjustable(struct resource *parent, + resource_size_t start, resource_size_t size) +{ + struct resource **p; + struct resource *res, *new; + resource_size_t end; + int ret = 0; + + p = parent-child; + end = start + size - 1; + + write_lock(resource_lock); + + while ((res = *p)) { + if (res-start start || res-end end) { + p = res-sibling; + continue; + } + + if (!(res-flags IORESOURCE_MEM)) { + ret = -EINVAL; + break; + } + + if (!(res-flags IORESOURCE_BUSY)) { + p = res-child; + continue; + } + + if (res-start == start res-end == end) { + /* free the whole entry */ + *p = res-sibling; + kfree(res); + } else if (res-start == start res-end != end) { + /* adjust the start */ + ret = __adjust_resource(res, end+1, + res-end - end); + } else if (res-start != start res-end == end) { + /* adjust the end */ + ret = __adjust_resource(res, res-start, + start - res-start); + } else { + /* split into two entries */ + new = kzalloc(sizeof(struct resource), GFP_KERNEL); + if (!new) { + ret = -ENOMEM; + break; + } + new-name = res-name; + new-start = end + 1; + new-end = res-end; + new-flags = res-flags; + new-parent = res-parent; + new-sibling = res-sibling; + new-child = NULL; + + ret =
pci-sysfs: queue sysfs rescan routine into workqueue to avoid potential deadlock situation
] pci_stop_bus_device+0x94/0xa0 [8127ad90] pci_stop_bus_device+0x40/0xa0 [8127ad90] pci_stop_bus_device+0x40/0xa0 [8127ad90] pci_stop_bus_device+0x40/0xa0 [8127af66] pci_stop_and_remove_bus_device+0x16/0x30 [81282359] remove_callback+0x29/0x40 [811e4344] sysfs_schedule_callback_work+0x24/0x70 [81070009] process_one_work+0x179/0x4b0 [8107210e] worker_thread+0x12e/0x330 [81071fe0] ? manage_workers+0x110/0x110 [8107705e] kthread+0x9e/0xb0 [81525bc4] kernel_thread_helper+0x4/0x10 [81076fc0] ? kthread_freezable_should_stop+0x70/0x70 [81525bc0] ? gs_change+0x13/0x13 Signed-off-by: Yinghai Lu ying...@kernel.org Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Lin Feng linf...@cn.fujitsu.com --- drivers/pci/pci-sysfs.c | 92 +-- 1 files changed, 65 insertions(+), 27 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 9c6e9bb..e66b498 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -285,21 +285,34 @@ msi_bus_store(struct device *dev, struct device_attribute *attr, } static DEFINE_MUTEX(pci_remove_rescan_mutex); + +static void bus_rescan_callback(struct device *dev) +{ + struct pci_bus *b = NULL; + + mutex_lock(pci_remove_rescan_mutex); + while ((b = pci_find_next_bus(b)) != NULL) + pci_rescan_bus(b); + mutex_unlock(pci_remove_rescan_mutex); +} + static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf, size_t count) { + int err; unsigned long val; - struct pci_bus *b = NULL; + struct device *dev = bus-dev_root; if (strict_strtoul(buf, 0, val) 0) return -EINVAL; - if (val) { - mutex_lock(pci_remove_rescan_mutex); - while ((b = pci_find_next_bus(b)) != NULL) - pci_rescan_bus(b); - mutex_unlock(pci_remove_rescan_mutex); - } + if (!val) + return count; + + err = device_schedule_callback(dev, bus_rescan_callback); + if (err) + return err; + return count; } @@ -308,21 +321,32 @@ struct bus_attribute pci_bus_attrs[] = { __ATTR_NULL }; +static void dev_rescan_callback(struct device *dev) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + if (pdev-is_added) { + mutex_lock(pci_remove_rescan_mutex); + pci_rescan_bus(pdev-bus); + mutex_unlock(pci_remove_rescan_mutex); + } +} + static ssize_t dev_rescan_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { + int err; unsigned long val; - struct pci_dev *pdev = to_pci_dev(dev); if (strict_strtoul(buf, 0, val) 0) return -EINVAL; - if (val) { - mutex_lock(pci_remove_rescan_mutex); - pci_rescan_bus(pdev-bus); - mutex_unlock(pci_remove_rescan_mutex); - } + if (!val) + return count; + err = device_schedule_callback(dev, dev_rescan_callback); + if (err) + return err; return count; } @@ -339,7 +363,7 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *dummy, const char *buf, size_t count) { - int ret = 0; + int err; unsigned long val; if (strict_strtoul(buf, 0, val) 0) @@ -348,31 +372,45 @@ remove_store(struct device *dev, struct device_attribute *dummy, /* An attribute cannot be unregistered by one of its own methods, * so we have to use this roundabout approach. */ - if (val) - ret = device_schedule_callback(dev, remove_callback); - if (ret) - count = ret; + if (!val) + return count; + + err = device_schedule_callback(dev, remove_callback); + if (err) + return err; + return count; } +static void dev_bus_rescan_callback(struct device *dev) +{ + struct pci_bus *bus = to_pci_bus(dev); + + mutex_lock(pci_remove_rescan_mutex); + if (!pci_is_root_bus(bus) list_empty(bus-devices)) + pci_rescan_bus_bridge_resize(bus-self); + else + pci_rescan_bus(bus); + mutex_unlock(pci_remove_rescan_mutex); +} + static ssize_t dev_bus_rescan_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { + int err; unsigned long val; - struct pci_bus *bus = to_pci_bus(dev); if (strict_strtoul(buf, 0, val) 0) return -EINVAL; - if (val) { - mutex_lock(pci_remove_rescan_mutex); - if (!pci_is_root_bus(bus) list_empty(bus-devices
Re: [PATCH] pci-sysfs: replace mutex_lock with mutex_trylock to avoid potential deadlock situation
Hi Bjorn, Thanks for your review and comments! Please refer to inlined comments below. On 01/25/2013 07:12 AM, Bjorn Helgaas wrote: On Thu, Dec 27, 2012 at 12:42 AM, Lin Feng linf...@cn.fujitsu.com wrote: There is a potential deadlock situation when we manipulate the pci-sysfs user interfaces from different bus hierarchy simultaneously, described as following: path1: sysfs remove device: | path2: sysfs rescan device: sysfs_schedule_callback_work() | sysfs_write_file() remove_callback() | flush_write_buffer() *1* mutex_lock(pci_remove_rescan_mutex)|*2* sysfs_get_active(attr_sd) ... | dev_attr_store() device_remove_file()| dev_rescan_store() ... |*4* mutex_lock(pci_remove_rescan_mutex) *3* sysfs_deactivate(sd) | ... wait_for_completion() |*5* sysfs_put_active(attr_sd) *6* mutex_unlock(pci_remove_rescan_mutex) ...snip... Reported-by: Taku Izumi izumi.t...@jp.fujitsu.com Signed-off-by: Lin Feng linf...@cn.fujitsu.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/pci/pci-sysfs.c | 42 ++ 1 files changed, 26 insertions(+), 16 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 05b78b1..d2efbb0 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -295,10 +295,13 @@ static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf, return -EINVAL; if (val) { - mutex_lock(pci_remove_rescan_mutex); - while ((b = pci_find_next_bus(b)) != NULL) - pci_rescan_bus(b); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + while ((b = pci_find_next_bus(b)) != NULL) + pci_rescan_bus(b); + mutex_unlock(pci_remove_rescan_mutex); + } else { + return 0; What are the semantics of returning 0 from a sysfs store function? Does the user's write just get dropped? I would think we'd return count for that case. Oh, yes, return count seems suitable here, although we did not reach the user's target goal(rescan the bus), but the user's write has been flushed yet. But the user still can not judge whether pci_rescan_bus() was successfully done only by the return value. Shall we return a suitable error here to tell the user that his write was written, but pci_rescan_bus() was not done ? Is there some sort of automatic retry in libc or something if we return zero? No, there is not any extra operations in libc if we return zero indeed. Are you relying on the user code to notice that nothing was written and do its own retry? Yes, but it seems impractical. The last seems most likely, but that seems like it complicates the user's life unnecessarily. + } } return count; } @@ -319,9 +322,12 @@ dev_rescan_store(struct device *dev, struct device_attribute *attr, return -EINVAL; if (val) { - mutex_lock(pci_remove_rescan_mutex); - pci_rescan_bus(pdev-bus); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + pci_rescan_bus(pdev-bus); + mutex_unlock(pci_remove_rescan_mutex); + } else { + return 0; + } } return count; } @@ -330,9 +336,10 @@ static void remove_callback(struct device *dev) { struct pci_dev *pdev = to_pci_dev(dev); - mutex_lock(pci_remove_rescan_mutex); - pci_stop_and_remove_bus_device(pdev); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + pci_stop_and_remove_bus_device(pdev); + mutex_unlock(pci_remove_rescan_mutex); + } In the other cases, I think the user will at least get some indication, e.g., a write() that returns zero, when we abort. But here, we silently skip the pci_stop_and_remove_bus_device(). That sounds wrong to me. What actually happens here, and why is it OK to skip it? Yeah, the hasty skip seems not suitable. We should give out some information here, if we can not do pci_stop_and_remove_bus_device(). Can we avoid the deadlock by queuing these in a workqueue instead of using the mutex_trylock() approach? No, I think use a workqueue to queue the rescan routine into workqueue as the remove is not suitable. After we queue the scan-bus work into workqueue, the rescan routine can return directly(case1) or wait until work is completed(case2). case1: If we return directly after we queue the scan-bus work
[PATCH RESEND] pci-sysfs: replace mutex_lock with mutex_trylock to avoid potential deadlock situation
] pci_stop_and_remove_bus_device+0x16/0x30 [81282359] remove_callback+0x29/0x40 [811e4344] sysfs_schedule_callback_work+0x24/0x70 [81070009] process_one_work+0x179/0x4b0 [8107210e] worker_thread+0x12e/0x330 [81071fe0] ? manage_workers+0x110/0x110 [8107705e] kthread+0x9e/0xb0 [81525bc4] kernel_thread_helper+0x4/0x10 [81076fc0] ? kthread_freezable_should_stop+0x70/0x70 [81525bc0] ? gs_change+0x13/0x13 Reported-by: Taku Izumi izumi.t...@jp.fujitsu.com Signed-off-by: Lin Feng linf...@cn.fujitsu.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/pci/pci-sysfs.c | 42 ++ 1 files changed, 26 insertions(+), 16 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 05b78b1..d2efbb0 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -295,10 +295,13 @@ static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf, return -EINVAL; if (val) { - mutex_lock(pci_remove_rescan_mutex); - while ((b = pci_find_next_bus(b)) != NULL) - pci_rescan_bus(b); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + while ((b = pci_find_next_bus(b)) != NULL) + pci_rescan_bus(b); + mutex_unlock(pci_remove_rescan_mutex); + } else { + return 0; + } } return count; } @@ -319,9 +322,12 @@ dev_rescan_store(struct device *dev, struct device_attribute *attr, return -EINVAL; if (val) { - mutex_lock(pci_remove_rescan_mutex); - pci_rescan_bus(pdev-bus); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + pci_rescan_bus(pdev-bus); + mutex_unlock(pci_remove_rescan_mutex); + } else { + return 0; + } } return count; } @@ -330,9 +336,10 @@ static void remove_callback(struct device *dev) { struct pci_dev *pdev = to_pci_dev(dev); - mutex_lock(pci_remove_rescan_mutex); - pci_stop_and_remove_bus_device(pdev); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + pci_stop_and_remove_bus_device(pdev); + mutex_unlock(pci_remove_rescan_mutex); + } } static ssize_t @@ -366,12 +373,15 @@ dev_bus_rescan_store(struct device *dev, struct device_attribute *attr, return -EINVAL; if (val) { - mutex_lock(pci_remove_rescan_mutex); - if (!pci_is_root_bus(bus) list_empty(bus-devices)) - pci_rescan_bus_bridge_resize(bus-self); - else - pci_rescan_bus(bus); - mutex_unlock(pci_remove_rescan_mutex); + if (mutex_trylock(pci_remove_rescan_mutex)) { + if (!pci_is_root_bus(bus) list_empty(bus-devices)) + pci_rescan_bus_bridge_resize(bus-self); + else + pci_rescan_bus(bus); + mutex_unlock(pci_remove_rescan_mutex); + } else { + return 0; + } } return count; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] f2fs: add sysfs entries to select the gc policy
On 08/04/2013 10:10 PM, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com Add sysfs entry gc_idle to control the gc policy. Where gc_idle = 1 corresponds to selecting a cost benefit approach, while gc_idle = 2 corresponds to selecting a greedy approach to garbage collection. The selection is mutually exclusive one approach will work at any point. If gc_idle = 0, then this option is disabled. Cc: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Namjae Jeon namjae.j...@samsung.com Signed-off-by: Pankaj Kumar pankaj...@samsung.com Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com --- Documentation/ABI/testing/sysfs-fs-f2fs |6 +- Documentation/filesystems/f2fs.txt |6 ++ fs/f2fs/gc.c| 24 +--- fs/f2fs/gc.h|3 +++ fs/f2fs/super.c |2 ++ 5 files changed, 37 insertions(+), 4 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index 5f44095..31942ef 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -19,4 +19,8 @@ Description: Controls the default sleep time for gc_thread. Time is in milliseconds. - +What:/sys/fs/f2fs/disk/gc_idle +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the victim selection policy for garbage collection. diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index 5daf3bb..3cd27be 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -158,6 +158,12 @@ Files in /sys/fs/f2fs/devname time for the garbage collection thread. Time is in milliseconds. + gc_idle This parameter controls the selection of victim + policy for garbage collection. Setting gc_idle = 0 + (default) will disable this option. Setting + gc_idle = 1 will select the Cost Benefit approach + setting gc_idle = 2 will select the greedy aproach. + USAGE diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 60d4f67..2c0c8ad 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -106,6 +106,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi) gc_th-max_sleep_time = DEF_GC_THREAD_MAX_SLEEP_TIME; gc_th-no_gc_sleep_time = DEF_GC_THREAD_NOGC_SLEEP_TIME; + gc_th-gc_idle = 0; + sbi-gc_thread = gc_th; init_waitqueue_head(sbi-gc_thread-gc_wait_queue_head); sbi-gc_thread-f2fs_gc_task = kthread_run(gc_thread_func, sbi, @@ -130,9 +132,25 @@ void stop_gc_thread(struct f2fs_sb_info *sbi) sbi-gc_thread = NULL; } -static int select_gc_type(int gc_type) +static int select_gc_type(struct f2fs_gc_kthread *gc_th, int gc_type) { - return (gc_type == BG_GC) ? GC_CB : GC_GREEDY; + int gc_mode; + + if (gc_th gc_th-gc_idle) { + /* Cost Benefit Policy */ + if (gc_th-gc_idle == 1) { + gc_mode = GC_CB; + goto out; + } else if (gc_th-gc_idle == 2) { + /* Greedy Policy */ + gc_mode = GC_GREEDY; + goto out; + } + } + + gc_mode = (gc_type == BG_GC) ? GC_CB : GC_GREEDY; +out: + return gc_mode; } static void select_policy(struct f2fs_sb_info *sbi, int gc_type, @@ -145,7 +163,7 @@ static void select_policy(struct f2fs_sb_info *sbi, int gc_type, p-dirty_segmap = dirty_i-dirty_segmap[type]; p-ofs_unit = 1; } else { - p-gc_mode = select_gc_type(gc_type); + p-gc_mode = select_gc_type(sbi-gc_thread, gc_type); p-dirty_segmap = dirty_i-dirty_segmap[DIRTY]; p-ofs_unit = sbi-segs_per_sec; } diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h index f4bf44c..c22dee9 100644 --- a/fs/f2fs/gc.h +++ b/fs/f2fs/gc.h @@ -30,6 +30,9 @@ struct f2fs_gc_kthread { unsigned int min_sleep_time; unsigned int max_sleep_time; unsigned int no_gc_sleep_time; + + /* for changing gc mode */ + unsigned int gc_idle; }; struct inode_entry { diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 0a3e88f..f9c6c0b 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -148,12 +148,14 @@ static struct f2fs_attr f2fs_attr_##_name = { \ F2FS_RW_ATTR(gc_min_sleep_time, min_sleep_time); F2FS_RW_ATTR(gc_max_sleep_time, max_sleep_time); F2FS_RW_ATTR
Re: [PATCH 1/3] f2fs: add sysfs support for controlling the gc_thread
On 08/04/2013 10:09 PM, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com Add sysfs entries to control the timing parameters for f2fs gc thread. Various Sysfs options introduced are: gc_min_sleep_time: Min Sleep time for GC in ms gc_max_sleep_time: Max Sleep time for GC in ms gc_no_gc_sleep_time: Default Sleep time for GC in ms Cc: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Namjae Jeon namjae.j...@samsung.com Signed-off-by: Pankaj Kumar pankaj...@samsung.com Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com --- Documentation/ABI/testing/sysfs-fs-f2fs | 22 ++ Documentation/filesystems/f2fs.txt | 26 +++ fs/f2fs/f2fs.h |4 + fs/f2fs/gc.c| 17 +++-- fs/f2fs/gc.h| 33 + fs/f2fs/super.c | 122 +++ 6 files changed, 204 insertions(+), 20 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-fs-f2fs diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs new file mode 100644 index 000..5f44095 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -0,0 +1,22 @@ +What:/sys/fs/f2fs/disk/gc_max_sleep_time +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the maximun sleep time for gc_thread. Time + is in milliseconds. + +What:/sys/fs/f2fs/disk/gc_min_sleep_time +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the minimum sleep time for gc_thread. Time + is in milliseconds. + +What:/sys/fs/f2fs/disk/gc_no_gc_sleep_time +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the default sleep time for gc_thread. Time + is in milliseconds. + + diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index 0500c19..5daf3bb 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -133,6 +133,32 @@ f2fs. Each file shows the whole f2fs information. - current memory footprint consumed by f2fs. +SYSFS ENTRIES + + +Information about mounted f2f2 file systems can be found in +/sys/fs/f2fs. Each mounted filesystem will have a directory in +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). +The files in each per-device directory are shown in table below. + +Files in /sys/fs/f2fs/devname +(see also Documentation/ABI/testing/sysfs-fs-f2fs) +.. + File Content + + gc_max_sleep_timeThis tuning parameter controls the maximum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_min_sleep_timeThis tuning parameter controls the minimum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_no_gc_sleep_time This tuning parameter controls the default sleep + time for the garbage collection thread. Time is + in milliseconds. + + USAGE diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 78777cd..63813be 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -430,6 +430,10 @@ struct f2fs_sb_info { #endif unsigned int last_victim[2];/* last victim segment # */ spinlock_t stat_lock; /* lock for stat operations */ + + /* For sysfs suppport */ + struct kobject s_kobj; + struct completion s_kobj_unregister; }; /* diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 35f9b1a..60d4f67 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -29,10 +29,11 @@ static struct kmem_cache *winode_slab; static int gc_thread_func(void *data) { struct f2fs_sb_info *sbi = data; + struct f2fs_gc_kthread *gc_th = sbi-gc_thread; wait_queue_head_t *wq = sbi-gc_thread-gc_wait_queue_head; long wait_ms; - wait_ms = GC_THREAD_MIN_SLEEP_TIME; + wait_ms = gc_th-min_sleep_time; do { if (try_to_freeze()) @@ -45,7 +46,7 @@ static int gc_thread_func(void *data) break; if (sbi-sb-s_writers.frozen = SB_FREEZE_WRITE
[PATCH] f2fs: move bio_private allocation out of f2fs_bio_alloc()
bio-bi_private is not always needed. As in the reading data path, end_read_io does not need bio_private for further using, so moving bio_private allocation out of f2fs_bio_alloc(). Alloc it in the submit_write_page(), and ignore it in the f2fs_readpage(). Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/data.c|1 - fs/f2fs/segment.c | 19 +++ 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index c73c394..19cd7c6 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -365,7 +365,6 @@ static void read_end_io(struct bio *bio, int err) } unlock_page(page); } while (bvec = bio-bi_io_vec); - kfree(bio-bi_private); bio_put(bio); } diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index a86d125..9b74ae2 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -611,18 +611,12 @@ static void f2fs_end_io_write(struct bio *bio, int err) struct bio *f2fs_bio_alloc(struct block_device *bdev, int npages) { struct bio *bio; - struct bio_private *priv; -retry: - priv = kmalloc(sizeof(struct bio_private), GFP_NOFS); - if (!priv) { - cond_resched(); - goto retry; - } /* No failure on bio allocation */ bio = bio_alloc(GFP_NOIO, npages); bio-bi_bdev = bdev; - bio-bi_private = priv; + bio-bi_private = NULL; + return bio; } @@ -681,8 +675,17 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, do_submit_bio(sbi, type, false); alloc_new: if (sbi-bio[type] == NULL) { + struct bio_private *priv; +retry: + priv = kmalloc(sizeof(struct bio_private), GFP_NOFS); + if (!priv) { + cond_resched(); + goto retry; + } + sbi-bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi)); sbi-bio[type]-bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr); + sbi-bio[type]-bi_private = priv; /* * The end_io will be assigned at the sumbission phase. * Until then, let bio_add_page() merge consecutive IOs as much -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] driver/vga16fb.c: remove the unused variable dev of function vga16fb_destroy()
Commit e21d2170f36602ae2708 removed the unnecessary platform_set_drvdata(), but left the variable dev unused, delete it. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/video/vga16fb.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/video/vga16fb.c b/drivers/video/vga16fb.c index 830ded4..2827333 100644 --- a/drivers/video/vga16fb.c +++ b/drivers/video/vga16fb.c @@ -1265,7 +1265,6 @@ static void vga16fb_imageblit(struct fb_info *info, const struct fb_image *image static void vga16fb_destroy(struct fb_info *info) { - struct platform_device *dev = container_of(info-device, struct platform_device, dev); iounmap(info-screen_base); fb_dealloc_cmap(info-cmap); /* XXX unshare VGA regions */ -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver/vga16fb.c: remove the unused variable dev of function vga16fb_destroy()
On 07/25/2013 05:58 PM, Geert Uytterhoeven wrote: On Thu, Jul 25, 2013 at 5:37 AM, Gu Zheng guz.f...@cn.fujitsu.com wrote: Commit e21d2170f36602ae2708 removed the unnecessary platform_set_drvdata(), but left the variable dev unused, delete it. When referring to another commit, please also include the oneline summary of the commit, to make it easier for people to see what it's about. Got it, thanks for your reminder.:) E.g. Commit e21d2170f36602ae2708 (video: remove unnecessary platform_set_drvdata()) removed the unnecessary platform_set_drvdata(), but left the variable dev unused, delete it. This is easier reading. I'll update it. Regards, Gu Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say programmer or something like that. -- Linus Torvalds -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] driver/vga16fb.c: remove the unused variable dev of function vga16fb_destroy()
Commit e21d2170f36602ae2708 (video: remove unnecessary platform_set_drvdata()) removed the unnecessary platform_set_drvdata(), but left the variable dev unused, delete it. v2: Following Geert's suggestion to make change log easier reading. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/video/vga16fb.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/video/vga16fb.c b/drivers/video/vga16fb.c index 830ded4..2827333 100644 --- a/drivers/video/vga16fb.c +++ b/drivers/video/vga16fb.c @@ -1265,7 +1265,6 @@ static void vga16fb_imageblit(struct fb_info *info, const struct fb_image *image static void vga16fb_destroy(struct fb_info *info) { - struct platform_device *dev = container_of(info-device, struct platform_device, dev); iounmap(info-screen_base); fb_dealloc_cmap(info-cmap); /* XXX unshare VGA regions */ -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about splice
Hi Jianpeng, On 07/26/2013 03:08 PM, majianpeng wrote: Hi all, I used splice and found a prolem(at least i call). The demo is: A:splice(regularfileA---pipe); B:splice(pipe---regularfileB) Before do B, we modify the data of regA which now in pipe. The data to regularfileB willbe change. If we used the buff A:read(regA, buff); B: write(buff, regB); After A, the contend of regA can't effect the buff. Review the code of splice,I know the pipe share the pagecache of regA. Right. And also this is the splice's original design intention, using share mmap rather than copy_to_user/copy_from_user in order to achieve zero-copy. Thanks, Gu Maybe this is not a problem or am i missing something? Thanks! Jianpeng MaN嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏{睉赙zXФ洝塄}财爖�j:+v墾�珣赙zZ+€�+zf"穐殘啳嗃i�z�畐ア�?櫒璀��)撷f旟^j谦y呩@A玜囤� 0鹅h�鍜i -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] fs/bio-integrity: fix a potential mem leak
Free the bio_integrity_pool in the fail path of biovec_create_pool in function bioset_integrity_create(). Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/bio-integrity.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index 8fb4291..6025084 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -716,13 +716,14 @@ int bioset_integrity_create(struct bio_set *bs, int pool_size) return 0; bs-bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab); - - bs-bvec_integrity_pool = biovec_create_pool(bs, pool_size); - if (!bs-bvec_integrity_pool) + if (!bs-bio_integrity_pool) return -1; - if (!bs-bio_integrity_pool) + bs-bvec_integrity_pool = biovec_create_pool(bs, pool_size); + if (!bs-bvec_integrity_pool) { + mempool_destroy(bs-bio_integrity_pool); return -1; + } return 0; } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] f2fs: move bio_private allocation out of f2fs_bio_alloc()
bio-bi_private is not always needed. As in the reading data path, end_read_io does not need bio_private for further using, so moving bio_private allocation out of f2fs_bio_alloc(). Alloc it in the submit_write_page(), and ignore it in the f2fs_readpage(). Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/data.c|1 - fs/f2fs/segment.c | 19 +++ 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index c73c394..19cd7c6 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -365,7 +365,6 @@ static void read_end_io(struct bio *bio, int err) } unlock_page(page); } while (bvec = bio-bi_io_vec); - kfree(bio-bi_private); bio_put(bio); } diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index a86d125..9b74ae2 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -611,18 +611,12 @@ static void f2fs_end_io_write(struct bio *bio, int err) struct bio *f2fs_bio_alloc(struct block_device *bdev, int npages) { struct bio *bio; - struct bio_private *priv; -retry: - priv = kmalloc(sizeof(struct bio_private), GFP_NOFS); - if (!priv) { - cond_resched(); - goto retry; - } /* No failure on bio allocation */ bio = bio_alloc(GFP_NOIO, npages); bio-bi_bdev = bdev; - bio-bi_private = priv; + bio-bi_private = NULL; + return bio; } @@ -681,8 +675,17 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, do_submit_bio(sbi, type, false); alloc_new: if (sbi-bio[type] == NULL) { + struct bio_private *priv; +retry: + priv = kmalloc(sizeof(struct bio_private), GFP_NOFS); + if (!priv) { + cond_resched(); + goto retry; + } + sbi-bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi)); sbi-bio[type]-bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr); + sbi-bio[type]-bi_private = priv; /* * The end_io will be assigned at the sumbission phase. * Until then, let bio_add_page() merge consecutive IOs as much -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/9] Add namespace support for syslog v2
Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: This patchset introduces a system log namespace. It is the 2nd version. The link of the 1st version is http://lwn.net/Articles/525728/. In that version, syslog_ namespace was added into nsproxy and created through a new clone flag CLONE_SYSLOG when cloning a process. There were some discussion in last November about the 1st version. This version used these important advice, and referred to Serge's patch(http://lwn.net/Articles/525629/). Unlike the 1st version, in this patchset, syslog namespace is tied to a user namespace. Add we must create a new user ns before create a new syslog ns, because that will make users have full capabilities in this new userns after cloning a new user ns. The syslog namespace can be created through a new command(11) to __NR_syslog syscall. That owe to a new syslog flag SYSLOG_ACTION_NEW_NS. In syslog_namespace, some necessary identifiers for handling syslog buf are containerized. When one container creates a new syslog ns, individual buf will be allocated to store log ownned this container. A new interface ns_printk is added to print the logs which we want to see in the container. Through ns_printk, we can get more logs related to a specific net ns, for instance, iptables. Here we use it to report iptable logs per contianer. Then default printk targeted at the init_syslog_ns will continue to print out most kernel log to host. One task in a new syslog ns could affect only current container through dmesg, dmesg -c and /dev/kmsg actions. The read/write interface such as /dev/kmsg, /pro/kmsg and syslog syscall continue to be useful for container users. This patchset is based on linus' linux tree. Changelog details between V2 and V1 is seriously needed, the inline description is not easy reading for other guys. Rui Xiang (9): syslog_ns: add syslog_namespace and put/get_syslog_ns syslog_ns: add syslog_ns into user_namespace syslog_ns: add init syslog_ns for global syslog syslog_ns: make syslog handling per namespace syslog_ns: make permisiion check per user namespace syslog_ns: use init syslog_ns for console action syslog_ns: implement function for creating syslog ns syslog_ns: implement ns_printk for specific syslog_ns netfilter: use ns_printk in iptable context fs/proc/kmsg.c | 17 +- include/linux/printk.h | 5 +- include/linux/syslog.h | 79 - include/linux/user_namespace.h | 2 + include/net/netfilter/xt_log.h | 6 +- kernel/printk.c| 642 - kernel/sysctl.c| 3 +- kernel/user.c | 3 + kernel/user_namespace.c| 4 + net/netfilter/xt_LOG.c | 4 +- 10 files changed, 493 insertions(+), 272 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/9] syslog_ns: add syslog_namespace and put/get_syslog_ns
Hi Rui, Refer to inline:). On 07/29/2013 10:31 AM, Rui Xiang wrote: Add a struct syslog_namespace which contains the necessary members for hanlding syslog and realize get_syslog_ns and put_syslog_ns API. Signed-off-by: Rui Xiang rui.xi...@huawei.com --- include/linux/syslog.h | 68 ++ kernel/printk.c| 7 -- 2 files changed, 68 insertions(+), 7 deletions(-) diff --git a/include/linux/syslog.h b/include/linux/syslog.h index 98a3153..425fafe 100644 --- a/include/linux/syslog.h +++ b/include/linux/syslog.h @@ -21,6 +21,9 @@ #ifndef _LINUX_SYSLOG_H #define _LINUX_SYSLOG_H +#include linux/slab.h +#include linux/kref.h + /* Close the log. Currently a NOP. */ #define SYSLOG_ACTION_CLOSE 0 /* Open the log. Currently a NOP. */ @@ -47,6 +50,71 @@ #define SYSLOG_FROM_READER 0 #define SYSLOG_FROM_PROC 1 +enum log_flags { + LOG_NOCONS = 1,/* already flushed, do not print to console */ + LOG_NEWLINE = 2,/* text ended with a newline */ + LOG_PREFIX = 4,/* text started with a prefix */ + LOG_CONT= 8,/* text is a fragment of a continuation line */ +}; + +struct syslog_namespace { + struct kref kref; /* syslog_ns reference count control */ + + raw_spinlock_t logbuf_lock; /* access conflict locker */ + /* cpu currently holding logbuf_lock of ns */ + unsigned int logbuf_cpu; + + /* index and sequence number of the first record stored in the buffer */ + u64 log_first_seq; + u32 log_first_idx; + + /* index and sequence number of the next record stored in the buffer */ + u64 log_next_seq; + u32 log_next_idx; + + /* the next printk record to read after the last 'clear' command */ + u64 clear_seq; + u32 clear_idx; + + char *log_buf; + u32 log_buf_len; + + /* the next printk record to write to the console */ + u64 console_seq; + u32 console_idx; + + /* the next printk record to read by syslog(READ) or /proc/kmsg */ + u64 syslog_seq; + u32 syslog_idx; + enum log_flags syslog_prev; + size_t syslog_partial; + + int dmesg_restrict; +}; + +static inline struct syslog_namespace *get_syslog_ns( + struct syslog_namespace *ns) +{ + if (ns) + kref_get(ns-kref); + return ns; +} + +static inline void free_syslog_ns(struct kref *kref) +{ + struct syslog_namespace *ns; + ns = container_of(kref, struct syslog_namespace, kref); + + kfree(ns-log_buf); + kfree(ns); +} This interface seems a bit ugly, why not use the format like put_syslog_ns()? static inline void free_syslog_ns(struct syslog_namespace *ns) + +static inline void put_syslog_ns(struct syslog_namespace *ns) +{ + if (ns) + kref_put(ns-kref, free_syslog_ns); +} + int do_syslog(int type, char __user *buf, int count, bool from_file); #endif /* _LINUX_SYSLOG_H */ diff --git a/kernel/printk.c b/kernel/printk.c index d37d45c..7e544bf 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -193,13 +193,6 @@ static int console_may_schedule; * separated by ',', and find the message after the ';' character. */ -enum log_flags { - LOG_NOCONS = 1,/* already flushed, do not print to console */ - LOG_NEWLINE = 2,/* text ended with a newline */ - LOG_PREFIX = 4,/* text started with a prefix */ - LOG_CONT= 8,/* text is a fragment of a continuation line */ -}; - struct log { u64 ts_nsec;/* timestamp in nanoseconds */ u16 len;/* length of entire record */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] syslog_ns: add syslog_ns into user_namespace
Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: Add a syslog_ns pointer to user_namespace, and make syslog_ns per user_namespace, not global. Since syslog_ns is assigned to user_ns, we can have full capabilities in new user_ns to create a new syslog_ns. Signed-off-by: Rui Xiang rui.xi...@huawei.com --- include/linux/syslog.h | 5 + include/linux/user_namespace.h | 1 + 2 files changed, 6 insertions(+) diff --git a/include/linux/syslog.h b/include/linux/syslog.h index 425fafe..62ce47f 100644 --- a/include/linux/syslog.h +++ b/include/linux/syslog.h @@ -90,6 +90,11 @@ struct syslog_namespace { size_t syslog_partial; int dmesg_restrict; + + /* + * user namespace which owns this syslog ns. + */ + struct user_namespace *owner; }; static inline struct syslog_namespace *get_syslog_ns( diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index b6b215f..ce2de5b 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -28,6 +28,7 @@ struct user_namespace { unsigned intproc_inum; boolmay_mount_sysfs; boolmay_mount_proc; + struct syslog_namespace *syslog_ns; As we add a syslog_ns pointer to user_namespace to make syslog_ns per user_namespace and the caps check. But why also add a point to syslog_namespace in user_namespace? Am I missing something?:) Thanks, Gu }; extern struct user_namespace init_user_ns; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace
Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: This patch makes syslog buf and other fields per namespace. Here use ns-log_buf(log_buf_len, logbuf_lock, log_first_seq, logbuf_lock, and so on) fields instead of global ones to handle syslog. Syslog interfaces such as /dev/kmsg, /proc/kmsg, and syslog syscall are all containerized for container users. Signed-off-by: Rui Xiang rui.xi...@huawei.com --- fs/proc/kmsg.c | 17 +- include/linux/printk.h | 1 - include/linux/syslog.h | 3 +- kernel/printk.c| 507 + kernel/sysctl.c| 3 +- 5 files changed, 273 insertions(+), 258 deletions(-) diff --git a/fs/proc/kmsg.c b/fs/proc/kmsg.c index bdfabda..cb98431 100644 --- a/fs/proc/kmsg.c +++ b/fs/proc/kmsg.c @@ -13,6 +13,8 @@ #include linux/proc_fs.h #include linux/fs.h #include linux/syslog.h +#include linux/cred.h +#include linux/user_namespace.h #include asm/uaccess.h #include asm/io.h @@ -21,12 +23,14 @@ extern wait_queue_head_t log_wait; static int kmsg_open(struct inode * inode, struct file * file) { - return do_syslog(SYSLOG_ACTION_OPEN, NULL, 0, SYSLOG_FROM_PROC); + return do_syslog(SYSLOG_ACTION_OPEN, NULL, 0, SYSLOG_FROM_PROC, + file-f_cred-user_ns-syslog_ns); How about adding a help function to get the syslog_ns that file belongs to? } static int kmsg_release(struct inode * inode, struct file * file) { - (void) do_syslog(SYSLOG_ACTION_CLOSE, NULL, 0, SYSLOG_FROM_PROC); + (void) do_syslog(SYSLOG_ACTION_CLOSE, NULL, 0, SYSLOG_FROM_PROC, + file-f_cred-user_ns-syslog_ns); return 0; } @@ -34,15 +38,18 @@ static ssize_t kmsg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { if ((file-f_flags O_NONBLOCK) - !do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC)) + !do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC, + file-f_cred-user_ns-syslog_ns)) return -EAGAIN; - return do_syslog(SYSLOG_ACTION_READ, buf, count, SYSLOG_FROM_PROC); + return do_syslog(SYSLOG_ACTION_READ, buf, count, SYSLOG_FROM_PROC, + file-f_cred-user_ns-syslog_ns); } static unsigned int kmsg_poll(struct file *file, poll_table *wait) { poll_wait(file, log_wait, wait); - if (do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC)) + if (do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC, + file-f_cred-user_ns-syslog_ns)) return POLLIN | POLLRDNORM; return 0; } diff --git a/include/linux/printk.h b/include/linux/printk.h index 22c7052..29e3f85 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -139,7 +139,6 @@ extern bool printk_timed_ratelimit(unsigned long *caller_jiffies, unsigned int interval_msec); extern int printk_delay_msec; -extern int dmesg_restrict; extern int kptr_restrict; extern void wake_up_klogd(void); diff --git a/include/linux/syslog.h b/include/linux/syslog.h index 363bc56..fbf0cb6 100644 --- a/include/linux/syslog.h +++ b/include/linux/syslog.h @@ -120,7 +120,8 @@ static inline void put_syslog_ns(struct syslog_namespace *ns) kref_put(ns-kref, free_syslog_ns); } -int do_syslog(int type, char __user *buf, int count, bool from_file); +int do_syslog(int type, char __user *buf, int count, bool from_file, + struct syslog_namespace *ns); extern struct syslog_namespace init_syslog_ns; #endif /* _LINUX_SYSLOG_H */ diff --git a/kernel/printk.c b/kernel/printk.c index fd83ec1..846fef5 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -213,29 +213,8 @@ static DEFINE_RAW_SPINLOCK(logbuf_lock); #ifdef CONFIG_PRINTK DECLARE_WAIT_QUEUE_HEAD(log_wait); -/* the next printk record to read by syslog(READ) or /proc/kmsg */ -static u64 syslog_seq; -static u32 syslog_idx; -static enum log_flags syslog_prev; -static size_t syslog_partial; - -/* index and sequence number of the first record stored in the buffer */ -static u64 log_first_seq; -static u32 log_first_idx; - -/* index and sequence number of the next record to store in the buffer */ -static u64 log_next_seq; -static u32 log_next_idx; - -/* the next printk record to write to the console */ -static u64 console_seq; -static u32 console_idx; static enum log_flags console_prev; -/* the next printk record to read after the last 'clear' command */ -static u64 clear_seq; -static u32 clear_idx; - #define PREFIX_MAX 32 #define LOG_LINE_MAX 1024 - PREFIX_MAX @@ -246,12 +225,8 @@ static u32 clear_idx; #define LOG_ALIGN __alignof__(struct log) #endif #define
Re: [PATCH 2/9] syslog_ns: add syslog_ns into user_namespace
On 07/29/2013 05:54 PM, Gao feng wrote: On 07/29/2013 05:46 PM, Gu Zheng wrote: Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: Add a syslog_ns pointer to user_namespace, and make syslog_ns per user_namespace, not global. Since syslog_ns is assigned to user_ns, we can have full capabilities in new user_ns to create a new syslog_ns. Signed-off-by: Rui Xiang rui.xi...@huawei.com --- include/linux/syslog.h | 5 + include/linux/user_namespace.h | 1 + 2 files changed, 6 insertions(+) diff --git a/include/linux/syslog.h b/include/linux/syslog.h index 425fafe..62ce47f 100644 --- a/include/linux/syslog.h +++ b/include/linux/syslog.h @@ -90,6 +90,11 @@ struct syslog_namespace { size_t syslog_partial; int dmesg_restrict; + + /* +* user namespace which owns this syslog ns. +*/ + struct user_namespace *owner; }; static inline struct syslog_namespace *get_syslog_ns( diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index b6b215f..ce2de5b 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -28,6 +28,7 @@ struct user_namespace { unsigned intproc_inum; boolmay_mount_sysfs; boolmay_mount_proc; + struct syslog_namespace *syslog_ns; As we add a syslog_ns pointer to user_namespace to make syslog_ns per user_namespace and the caps check. But why also add a point to syslog_namespace in user_namespace? Am I missing something?:) yep,with this we can make sure all the other types of namespace such as mount, net, pid can access syslog_ns through user namespace. Got it.:) Thanks, Gu -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/9] syslog_ns: implement function for creating syslog ns
Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: Add create_syslog_ns function to create a new ns. We must create a user_ns before create a new syslog ns. And then tie the new syslog_ns to current user_ns instead of original syslog_ns which comes from parent user_ns. Add a new syslog flag SYSLOG_ACTION_NEW_NS to implement a new command(11) of __NR_syslog system call. Through that command, we can create a new syslog ns in user space. Signed-off-by: Rui Xiang rui.xi...@huawei.com --- include/linux/syslog.h | 2 ++ kernel/printk.c| 52 ++ 2 files changed, 54 insertions(+) diff --git a/include/linux/syslog.h b/include/linux/syslog.h index fbf0cb6..df57c21 100644 --- a/include/linux/syslog.h +++ b/include/linux/syslog.h @@ -46,6 +46,8 @@ #define SYSLOG_ACTION_SIZE_UNREAD9 /* Return size of the log buffer */ #define SYSLOG_ACTION_SIZE_BUFFER 10 +/* Create a new syslog ns */ +#define SYSLOG_ACTION_NEW_NS11 #define SYSLOG_FROM_READER 0 #define SYSLOG_FROM_PROC 1 diff --git a/kernel/printk.c b/kernel/printk.c index fd2d600..6b561db 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -384,6 +384,10 @@ static int check_syslog_permissions(int type, bool from_file, || type == SYSLOG_ACTION_CONSOLE_LEVEL) ns = init_syslog_ns; + /* create a new syslog ns */ + if (type == SYSLOG_ACTION_NEW_NS) + return 0; + Don't we need further permission or caps check here? Return success directly seems sloppy. Thanks, Gu if (syslog_action_restricted(type, ns)) { if (ns_capable(ns-owner, CAP_SYSLOG)) return 0; @@ -1131,6 +1135,51 @@ static int syslog_print_all(char __user *buf, int size, bool clear, return len; } +static int create_syslog_ns(void) +{ + struct user_namespace *userns = current_user_ns(); + struct syslog_namespace *oldns, *newns; + int err; + + /* + * syslog ns belongs to a user ns. So you can only unshare your + * user_ns if you share a user_ns with your parent userns + */ + if (userns == init_user_ns || + userns-syslog_ns != userns-parent-syslog_ns) + return -EINVAL; + + if (!ns_capable(userns, CAP_SYSLOG)) + return -EPERM; + + err = -ENOMEM; + oldns = userns-syslog_ns; + newns = kzalloc(sizeof(*newns), GFP_ATOMIC); + if (!newns) + goto out; + newns-log_buf_len = __LOG_BUF_LEN; + newns-log_buf = kzalloc(newns-log_buf_len, GFP_ATOMIC); + if (!newns-log_buf) + goto out; + + newns-owner = get_user_ns(userns); + raw_spin_lock_init((newns-logbuf_lock)); + newns-logbuf_cpu = UINT_MAX; + newns-dmesg_restrict = oldns-dmesg_restrict; + put_syslog_ns(oldns); + kref_init(newns-kref); + userns-syslog_ns = newns; + newns = NULL; + + err = 0; +out: + if (newns) { + kfree(newns-log_buf); + kfree(newns); + } + return err; +} + int do_syslog(int type, char __user *buf, int len, bool from_file, struct syslog_namespace *ns) { @@ -1254,6 +1303,9 @@ int do_syslog(int type, char __user *buf, int len, bool from_file, case SYSLOG_ACTION_SIZE_BUFFER: error = ns-log_buf_len; break; + case SYSLOG_ACTION_NEW_NS: + error = create_syslog_ns(); + break; default: error = -EINVAL; break; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/9] syslog_ns: implement ns_printk for specific syslog_ns
Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: Add a new interface named ns_printk, and assign an patamater ns. Log which belong to a container can be printed by ns_printk. One question, with the syslog_ns used, do the log we print by *printk* in the host contains the log in each syslog_ns(print out with ns_printk) or not? Thanks, Gu Signed-off-by: Rui Xiang rui.xi...@huawei.com --- include/linux/printk.h | 4 kernel/printk.c| 53 ++ 2 files changed, 53 insertions(+), 4 deletions(-) diff --git a/include/linux/printk.h b/include/linux/printk.h index 29e3f85..bf83ad9 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -6,6 +6,7 @@ #include linux/kern_levels.h #include linux/linkage.h +struct syslog_namespace; extern const char linux_banner[]; extern const char linux_proc_banner[]; @@ -123,6 +124,9 @@ asmlinkage int printk_emit(int facility, int level, asmlinkage __printf(1, 2) __cold int printk(const char *fmt, ...); +asmlinkage __printf(2, 3) __cold +int ns_printk(struct syslog_namespace *ns, const char *fmt, ...); + /* * Special printk facility for scheduler use only, _DO_NOT_USE_ ! */ diff --git a/kernel/printk.c b/kernel/printk.c index 6b561db..56a8b27 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1554,9 +1554,10 @@ static size_t cont_print_text(char *text, size_t size) return textlen; } -asmlinkage int vprintk_emit(int facility, int level, - const char *dict, size_t dictlen, - const char *fmt, va_list args) +static int ns_vprintk_emit(int facility, int level, + const char *dict, size_t dictlen, + const char *fmt, va_list args, + struct syslog_namespace *ns) { static int recursion_bug; static char textbuf[LOG_LINE_MAX]; @@ -1566,7 +1567,6 @@ asmlinkage int vprintk_emit(int facility, int level, unsigned long flags; int this_cpu; int printed_len = 0; - struct syslog_namespace *ns = init_syslog_ns; boot_delay_msec(level); printk_delay(); @@ -1697,6 +1697,14 @@ out_restore_irqs: return printed_len; } + +asmlinkage int vprintk_emit(int facility, int level, + const char *dict, size_t dictlen, + const char *fmt, va_list args) +{ + return ns_vprintk_emit(facility, level, dict, dictlen, fmt, args, + init_syslog_ns); +} EXPORT_SYMBOL(vprintk_emit); asmlinkage int vprintk(const char *fmt, va_list args) @@ -1762,6 +1770,43 @@ asmlinkage int printk(const char *fmt, ...) } EXPORT_SYMBOL(printk); +/** + * ns_printk - print a kernel message in syslog_ns + * @ns: syslog namespace + * @fmt: format string + * + * This is ns_printk(). + * It can be called from container context. We add a param + * ns to record current syslog namespace, because we need to + * print some log which are not generated by host, but contaner. + * + * See the vsnprintf() documentation for format string extensions over C99. + **/ +asmlinkage int ns_printk(struct syslog_namespace *ns, + const char *fmt, ...) +{ + va_list args; + int r; + + if (!ns) + ns = current_user_ns()-syslog_ns; + +#ifdef CONFIG_KGDB_KDB + if (unlikely(kdb_trap_printk)) { + va_start(args, fmt); + r = vkdb_printf(fmt, args); + va_end(args); + return r; + } +#endif + va_start(args, fmt); + r = ns_vprintk_emit(0, -1, NULL, 0, fmt, args, ns); + va_end(args); + + return r; +} +EXPORT_SYMBOL(ns_printk); + Here can we do some clean up to printk using ns_printk? #else /* CONFIG_PRINTK */ #define LOG_LINE_MAX 0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/9] syslog_ns: add syslog_namespace and put/get_syslog_ns
On 07/29/2013 07:47 PM, Rui Xiang wrote: On 2013/7/29 17:40, Gu Zheng wrote: Hi Rui, Refer to inline:). Hi Gu, Thanks for your attention. On 07/29/2013 10:31 AM, Rui Xiang wrote: Add a struct syslog_namespace which contains the necessary members for hanlding syslog and realize get_syslog_ns and put_syslog_ns API. Signed-off-by: Rui Xiang rui.xi...@huawei.com --- include/linux/syslog.h | 68 ++ kernel/printk.c| 7 -- 2 files changed, 68 insertions(+), 7 deletions(-) ... + +static inline void free_syslog_ns(struct kref *kref) +{ + struct syslog_namespace *ns; + ns = container_of(kref, struct syslog_namespace, kref); + + kfree(ns-log_buf); + kfree(ns); +} This interface seems a bit ugly, why not use the format like put_syslog_ns()? static inline void free_syslog_ns(struct syslog_namespace *ns) Free_syslog_ns is used in put_syslog_ns. And the kref_put function uses kref as a parameter for its relase funtion. You can see that from static inline int kref_put(struct kref *kref, void (*release)(struct kref *kref)). Got it. Regards, Gu Thanks. + +static inline void put_syslog_ns(struct syslog_namespace *ns) +{ + if (ns) + kref_put(ns-kref, free_syslog_ns); +} + -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/9] syslog_ns: implement function for creating syslog ns
On 07/30/2013 11:39 AM, Rui Xiang wrote: On 2013/7/29 18:25, Gu Zheng wrote: Hi Rui, On 07/29/2013 10:31 AM, Rui Xiang wrote: Add create_syslog_ns function to create a new ns. We must create a user_ns before create a new syslog ns. And then tie the new syslog_ns to current user_ns instead of original syslog_ns which comes from parent user_ns. ... diff --git a/kernel/printk.c b/kernel/printk.c index fd2d600..6b561db 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -384,6 +384,10 @@ static int check_syslog_permissions(int type, bool from_file, || type == SYSLOG_ACTION_CONSOLE_LEVEL) ns = init_syslog_ns; + /* create a new syslog ns */ + if (type == SYSLOG_ACTION_NEW_NS) + return 0; + Don't we need further permission or caps check here? Return success directly seems sloppy. CAP_SYSLOG is checked in create_syslog_ns, so I think we can return 0 temporarily. If so, why not move the check here? IMO, permission checking is the earlier the better, what's your opinion? Regards, Gu -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Rafael, On 08/26/2013 04:09 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki rafael.j.wyso...@intel.com There are two mutexes, device_hotplug_lock and acpi_scan_lock, held around the acpi_bus_trim() call in acpi_scan_hot_remove() which generally removes devices (it removes ACPI device objects at least, but it may also remove physical device objects through .detach() callbacks of ACPI scan handlers). Thus, potentially, device sysfs attributes are removed under these locks and to remove those attributes it is necessary to hold the s_active references of their directory entries for writing. On the other hand, the execution of a .show() or .store() callback from a sysfs attribute is carried out with that attribute's s_active reference held for reading. Consequently, if any device sysfs attribute that may be removed from within acpi_scan_hot_remove() through acpi_bus_trim() has a .store() or .show() callback which acquires either acpi_scan_lock or device_hotplug_lock, the execution of that callback may deadlock with the removal of the attribute. [Unfortunately, the online device attribute of CPUs and memory blocks and the eject attribute of ACPI device objects are affected by this issue.] To avoid those deadlocks introduce a new protection mechanism that can be used by the device sysfs attributes in question. Namely, if a device sysfs attribute's .store() or .show() callback routine is about to acquire device_hotplug_lock or acpi_scan_lock, it can first execute read_lock_device_remove() and return an error code if that function returns false. If true is returned, the lock in question may be acquired and read_unlock_device_remove() must be called. [This mechanism is implemented by means of an additional rwsem in drivers/base/core.c.] Make the affected sysfs attributes in the driver core and ACPI core use read_lock_device_remove() and read_unlock_device_remove() as described above. Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com Reported-by: Gu Zheng guz.f...@cn.fujitsu.com I'm sorry to forget to mention that the original reporter is Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com. I continued the investigation and found more issues. We tested this patch on kernel 3.11-rc6, but it seems that the issue is still there. Detail info as following. Thanks, Gu == [ INFO: possible circular locking dependency detected ] 3.11.0-rc6-lockdebug-refea+ #162 Tainted: GF --- kworker/0:2/754 is trying to acquire lock: (s_active#73){.+}, at: [8121062b] sysfs_addrm_finish+0x3b/0x70 but task is already holding lock: (mem_sysfs_mutex){+.+.+.}, at: [813b949d] remove_memory_block+0x1d/0xa0 which lock already depends on the new lock. the existing dependency chain (in reverse order
[PATCH] drivers/base/memory.c: introduce help macro to_memory_block
Introduce help macro to_memory_block to hide the conversion(device--memory_block), just clean up. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/base/memory.c | 27 --- 1 files changed, 12 insertions(+), 15 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 2b7813e..4a874c6 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -30,6 +30,8 @@ static DEFINE_MUTEX(mem_sysfs_mutex); #define MEMORY_CLASS_NAME memory +#define to_memory_block(dev) container_of(dev, struct memory_block, dev) + static int sections_per_block; static inline int base_memory_block_id(int section_nr) @@ -77,7 +79,7 @@ EXPORT_SYMBOL(unregister_memory_isolate_notifier); static void memory_block_release(struct device *dev) { - struct memory_block *mem = container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); kfree(mem); } @@ -110,8 +112,7 @@ static unsigned long get_memory_block_size(void) static ssize_t show_mem_start_phys_index(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; phys_index = mem-start_section_nr / sections_per_block; @@ -121,8 +122,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, static ssize_t show_mem_end_phys_index(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; phys_index = mem-end_section_nr / sections_per_block; @@ -137,8 +137,7 @@ static ssize_t show_mem_removable(struct device *dev, { unsigned long i, pfn; int ret = 1; - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); for (i = 0; i sections_per_block; i++) { pfn = section_nr_to_pfn(mem-start_section_nr + i); @@ -154,8 +153,7 @@ static ssize_t show_mem_removable(struct device *dev, static ssize_t show_mem_state(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); ssize_t len = 0; /* @@ -280,7 +278,7 @@ static int __memory_block_change_state(struct memory_block *mem, static int memory_subsys_online(struct device *dev) { - struct memory_block *mem = container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); int ret; mutex_lock(mem-state_mutex); @@ -295,7 +293,7 @@ static int memory_subsys_online(struct device *dev) static int memory_subsys_offline(struct device *dev) { - struct memory_block *mem = container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); int ret; mutex_lock(mem-state_mutex); @@ -349,7 +347,7 @@ store_mem_state(struct device *dev, bool offline; int ret = -EINVAL; - mem = container_of(dev, struct memory_block, dev); + mem = to_memory_block(dev); lock_device_hotplug(); @@ -392,8 +390,7 @@ store_mem_state(struct device *dev, static ssize_t show_phys_device(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); return sprintf(buf, %d\n, mem-phys_device); } @@ -525,7 +522,7 @@ struct memory_block *find_memory_block_hinted(struct mem_section *section, put_device(hint-dev); if (!dev) return NULL; - return container_of(dev, struct memory_block, dev); + return to_memory_block(dev); } /* -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Rafael, On 08/26/2013 10:43 PM, Rafael J. Wysocki wrote: On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote: Hi Rafael, Hi, On 08/26/2013 04:09 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki rafael.j.wyso...@intel.com There are two mutexes, device_hotplug_lock and acpi_scan_lock, held around the acpi_bus_trim() call in acpi_scan_hot_remove() which generally removes devices (it removes ACPI device objects at least, but it may also remove physical device objects through .detach() callbacks of ACPI scan handlers). Thus, potentially, device sysfs attributes are removed under these locks and to remove those attributes it is necessary to hold the s_active references of their directory entries for writing. On the other hand, the execution of a .show() or .store() callback from a sysfs attribute is carried out with that attribute's s_active reference held for reading. Consequently, if any device sysfs attribute that may be removed from within acpi_scan_hot_remove() through acpi_bus_trim() has a .store() or .show() callback which acquires either acpi_scan_lock or device_hotplug_lock, the execution of that callback may deadlock with the removal of the attribute. [Unfortunately, the online device attribute of CPUs and memory blocks and the eject attribute of ACPI device objects are affected by this issue.] To avoid those deadlocks introduce a new protection mechanism that can be used by the device sysfs attributes in question. Namely, if a device sysfs attribute's .store() or .show() callback routine is about to acquire device_hotplug_lock or acpi_scan_lock, it can first execute read_lock_device_remove() and return an error code if that function returns false. If true is returned, the lock in question may be acquired and read_unlock_device_remove() must be called. [This mechanism is implemented by means of an additional rwsem in drivers/base/core.c.] Make the affected sysfs attributes in the driver core and ACPI core use read_lock_device_remove() and read_unlock_device_remove() as described above. Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com Reported-by: Gu Zheng guz.f...@cn.fujitsu.com I'm sorry to forget to mention that the original reporter is Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com. I continued the investigation and found more issues. We tested this patch on kernel 3.11-rc6, but it seems that the issue is still there. Detail info as following. Well, taking pm_mutex under acpi_scan_lock (trace #2) is a bad idea anyway, because we'll need to take acpi_scan_lock during system suspend for PCI hot remove to work and that's under pm_mutex. So I wonder if we can simply drop the system sleep locking from lock/unlock_memory_hotplug(). But that's a side note, because dropping it won't help here. Now - == [ INFO: possible circular locking dependency detected ] 3.11.0-rc6-lockdebug-refea+ #162 Tainted: GF --- kworker/0:2/754 is trying to acquire lock: (s_active#73){.+}, at: [8121062b] sysfs_addrm_finish+0x3b/0x70 but task is already holding lock: (mem_sysfs_mutex){+.+.+.}, at: [813b949d] remove_memory_block+0x1d/0xa0 which lock already depends on the new lock
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Rafael, On 08/26/2013 10:43 PM, Rafael J. Wysocki wrote: On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote: Hi Rafael, Hi, On 08/26/2013 04:09 AM, Rafael J. Wysocki wrote: From: Rafael J. Wysocki rafael.j.wyso...@intel.com There are two mutexes, device_hotplug_lock and acpi_scan_lock, held around the acpi_bus_trim() call in acpi_scan_hot_remove() which generally removes devices (it removes ACPI device objects at least, but it may also remove physical device objects through .detach() callbacks of ACPI scan handlers). Thus, potentially, device sysfs attributes are removed under these locks and to remove those attributes it is necessary to hold the s_active references of their directory entries for writing. On the other hand, the execution of a .show() or .store() callback from a sysfs attribute is carried out with that attribute's s_active reference held for reading. Consequently, if any device sysfs attribute that may be removed from within acpi_scan_hot_remove() through acpi_bus_trim() has a .store() or .show() callback which acquires either acpi_scan_lock or device_hotplug_lock, the execution of that callback may deadlock with the removal of the attribute. [Unfortunately, the online device attribute of CPUs and memory blocks and the eject attribute of ACPI device objects are affected by this issue.] To avoid those deadlocks introduce a new protection mechanism that can be used by the device sysfs attributes in question. Namely, if a device sysfs attribute's .store() or .show() callback routine is about to acquire device_hotplug_lock or acpi_scan_lock, it can first execute read_lock_device_remove() and return an error code if that function returns false. If true is returned, the lock in question may be acquired and read_unlock_device_remove() must be called. [This mechanism is implemented by means of an additional rwsem in drivers/base/core.c.] Make the affected sysfs attributes in the driver core and ACPI core use read_lock_device_remove() and read_unlock_device_remove() as described above. Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com Reported-by: Gu Zheng guz.f...@cn.fujitsu.com I'm sorry to forget to mention that the original reporter is Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com. I continued the investigation and found more issues. We tested this patch on kernel 3.11-rc6, but it seems that the issue is still there. Detail info as following. Well, taking pm_mutex under acpi_scan_lock (trace #2) is a bad idea anyway, because we'll need to take acpi_scan_lock during system suspend for PCI hot remove to work and that's under pm_mutex. So I wonder if we can simply drop the system sleep locking from lock/unlock_memory_hotplug(). But that's a side note, because dropping it won't help here. Now - == [ INFO: possible circular locking dependency detected ] 3.11.0-rc6-lockdebug-refea+ #162 Tainted: GF --- kworker/0:2/754 is trying to acquire lock: (s_active#73){.+}, at: [8121062b] sysfs_addrm_finish+0x3b/0x70 but task is already holding lock: (mem_sysfs_mutex){+.+.+.}, at: [813b949d] remove_memory_block+0x1d/0xa0 which lock already depends on the new lock
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Rafael, On 08/26/2013 11:02 PM, Rafael J. Wysocki wrote: On Monday, August 26, 2013 04:43:26 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote: Hi Rafael, [...] OK, so the patch below is quick and dirty and overkill, but it should make the splat go away at least. And if this patch does make the splat go away for you, please also test the appended one (Tejun, thanks for the hint!). Yes, this one works too, and as expected, the ACPI part is still there. Thanks, Gu == [ INFO: possible circular locking dependency detected ] 3.11.0-rc6-fix-refeal-fix-01+ #171 Tainted: GF --- kworker/0:1/96 is trying to acquire lock: (s_active#245){.+}, at: [8121062b] sysfs_addrm_finish+0x3b/0x70 but task is already holding lock: (device_hotplug_lock){+.+.+.}, at: [813a16b7] lock_device_hotplug+0x17/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #2 (device_hotplug_lock){+.+.+.}: [810ba88c] validate_chain+0x70c/0x870 [810bad5f] __lock_acquire+0x36f/0x5f0 [810bb080] lock_acquire+0xa0/0x130 [8159779b] mutex_lock_nested+0x7b/0x3b0 [813a16b7] lock_device_hotplug+0x17/0x20 [8131c131] acpi_scan_bus_device_check+0x33/0x10f [8131c220] acpi_scan_device_check+0x13/0x15 [81315dac] acpi_os_execute_deferred+0x27/0x34 [8106bec8] process_one_work+0x1e8/0x560 [8106d0a0] worker_thread+0x120/0x3a0 [81073b5e] kthread+0xee/0x100 [815a5fdc] ret_from_fork+0x7c/0xb0 - #1 (acpi_scan_lock){+.+.+.}: [810ba88c] validate_chain+0x70c/0x870 [810bad5f] __lock_acquire+0x36f/0x5f0 [810bb080] lock_acquire+0xa0/0x130 [8159779b] mutex_lock_nested+0x7b/0x3b0 [8131a58a] acpi_eject_store+0x88/0x170 [813a0f40] dev_attr_store+0x20/0x30 [8120ed96] sysfs_write_file+0xe6/0x170 [81195bc8] vfs_write+0xc8/0x170
Re: [PATCH] f2fs: fix omitting to update inode page
On 08/26/2013 08:28 PM, Jaegeuk Kim wrote: The f2fs_set_link updates its parent inode number, so we should sync this to the inode block. Otherwise, the data can be lost after sudden-power-off. Signed-off-by: Jaegeuk Kim jaegeuk@samsung.com --- fs/f2fs/namei.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index 4e47518..9e90d31 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -447,6 +447,7 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, else release_orphan_inode(sbi); + update_inode_page(old_inode): ':' -- ';' update_inode_page(new_inode); } else { err = f2fs_add_link(new_dentry, old_inode); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Rafael, On 08/26/2013 11:02 PM, Rafael J. Wysocki wrote: On Monday, August 26, 2013 04:43:26 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote: Hi Rafael, [...] OK, so the patch below is quick and dirty and overkill, but it should make the splat go away at least. And if this patch does make the splat go away for you, please also test the appended one (Tejun, thanks for the hint!). I'll address the ACPI part differently later. What about changing device_hotplug_lock and acpi_scan_lock to rwsem? like the attached one(With a preliminary test, it also can make the splat go away).:) Regards, Gu [...] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From f1682ceaef4105f75f4d6a0bb8e77c8a5dde365b Mon Sep 17 00:00:00 2001 From: Gu Zheng guz.f...@cn.fujitsu.com Date: Tue, 27 Aug 2013 17:59:55 +0900 Subject: [PATCH] acpi: fix removal lock dep Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/acpi/scan.c| 43 ++- drivers/acpi/sysfs.c |7 +-- drivers/base/core.c| 45 - drivers/base/memory.c |5 +++-- include/linux/device.h |8 ++-- 5 files changed, 72 insertions(+), 36 deletions(-) diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index 8a46c92..bb41760 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -36,7 +36,7 @@ bool acpi_force_hot_remove; static const char *dummy_hid = device; static LIST_HEAD(acpi_bus_id_list); -static DEFINE_MUTEX(acpi_scan_lock); +static DECLARE_RWSEM(acpi_scan_rwsem); static LIST_HEAD(acpi_scan_handlers_list); DEFINE_MUTEX(acpi_device_lock); LIST_HEAD(acpi_wakeup_device_list); @@ -49,13 +49,13 @@ struct acpi_device_bus_id{ void acpi_scan_lock_acquire(void) { - mutex_lock(acpi_scan_lock); + down_write(acpi_scan_rwsem); } EXPORT_SYMBOL_GPL(acpi_scan_lock_acquire); void acpi_scan_lock_release(void) { - mutex_unlock(acpi_scan_lock); + up_write(acpi_scan_rwsem); } EXPORT_SYMBOL_GPL(acpi_scan_lock_release); @@ -207,7 +207,7 @@ static int acpi_scan_hot_remove(struct acpi_device *device) return -EINVAL; } - lock_device_hotplug(); + device_hotplug_begin(); /* * Carry out two passes here and ignore errors in the first pass, @@ -240,7 +240,7 @@ static int acpi_scan_hot_remove(struct acpi_device *device) acpi_bus_online_companions, NULL, NULL, NULL); - unlock_device_hotplug(); + device_hotplug_end(); put_device(device-dev); return -EBUSY; @@ -252,7 +252,7 @@ static int acpi_scan_hot_remove(struct acpi_device *device) acpi_bus_trim(device); - unlock_device_hotplug(); + device_hotplug_end(); /* Device node has been unregistered. */ put_device(device-dev); @@ -308,7 +308,7 @@ static void acpi_bus_device_eject(void *context) struct acpi_scan_handler *handler; u32 ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE; - mutex_lock(acpi_scan_lock); + acpi_scan_lock_acquire(); acpi_bus_get_device(handle, device); if (!device) @@ -334,7 +334,7 @@ static void acpi_bus_device_eject(void *context) } out: - mutex_unlock(acpi_scan_lock); + acpi_scan_lock_release(); return; err_out: @@ -349,8 +349,8 @@ static void acpi_scan_bus_device_check(acpi_handle handle, u32 ost_source) u32 ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE; int error; - mutex_lock(acpi_scan_lock); - lock_device_hotplug(); + acpi_scan_lock_acquire(); + device_hotplug_begin(); if (ost_source != ACPI_NOTIFY_BUS_CHECK) { acpi_bus_get_device(handle, device); @@ -376,9 +376,9 @@ static void acpi_scan_bus_device_check(acpi_handle handle, u32 ost_source) kobject_uevent(device-dev.kobj, KOBJ_ONLINE); out: - unlock_device_hotplug(); + device_hotplug_end(); acpi_evaluate_hotplug_ost(handle, ost_source, ost_code, NULL); - mutex_unlock(acpi_scan_lock); + acpi_scan_lock_release(); } static void acpi_scan_bus_check(void *context) @@ -469,15 +469,14 @@ void acpi_bus_hot_remove_device(void *context) acpi_handle handle = device-handle; int error; - mutex_lock(acpi_scan_lock); + acpi_scan_lock_acquire(); error = acpi_scan_hot_remove(device); if (error handle) acpi_evaluate_hotplug_ost(handle, ej_event-event
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Toshi, On 08/28/2013 05:38 AM, Toshi Kani wrote: On Tue, 2013-08-27 at 17:21 +0800, Gu Zheng wrote: Hi Rafael, On 08/26/2013 11:02 PM, Rafael J. Wysocki wrote: On Monday, August 26, 2013 04:43:26 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote: On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote: Hi Rafael, [...] OK, so the patch below is quick and dirty and overkill, but it should make the splat go away at least. And if this patch does make the splat go away for you, please also test the appended one (Tejun, thanks for the hint!). I'll address the ACPI part differently later. What about changing device_hotplug_lock and acpi_scan_lock to rwsem? like the attached one(With a preliminary test, it also can make the splat go away).:) I am curious how msleep(10) restart_syscall() work in the change below. Doesn't the msleep() make s_active held longer time, which can lead the thread holding device_hotplug_lock to wait it for deletion? Yes, but it can avoid busy waiting. Also, does restart_syscall() release s_active and reopen this file again? Sure, it just set a TIF_SIGPENDING flag and return an -ERESTARTNOINTR error, s_active/file will be released/closed in the failed path. And when do_signal() catches the -ERESTARTNOINTR, it will change the regs to restart the syscall. Thanks, Gu @@ -408,9 +408,13 @@ static ssize_t show_online(struct device *dev, struct device_attribute *attr, { bool val; - lock_device_hotplug(); + if (!read_lock_device_hotplug()) { + msleep(10); + return restart_syscall(); + } + Thanks, -Toshi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] drivers/base/memory.c: introduce help macro to_memory_block
Introduce help macro to_memory_block to hide the conversion(device--memory_block), just clean up. Reviewed-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/base/memory.c | 29 - 1 files changed, 12 insertions(+), 17 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 2a38cd2..69e09a1 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -29,6 +29,8 @@ static DEFINE_MUTEX(mem_sysfs_mutex); #define MEMORY_CLASS_NAME memory +#define to_memory_block(dev) container_of(dev, struct memory_block, dev) + static int sections_per_block; static inline int base_memory_block_id(int section_nr) @@ -76,7 +78,7 @@ EXPORT_SYMBOL(unregister_memory_isolate_notifier); static void memory_block_release(struct device *dev) { - struct memory_block *mem = container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); kfree(mem); } @@ -109,8 +111,7 @@ static unsigned long get_memory_block_size(void) static ssize_t show_mem_start_phys_index(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; phys_index = mem-start_section_nr / sections_per_block; @@ -120,8 +121,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev, static ssize_t show_mem_end_phys_index(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); unsigned long phys_index; phys_index = mem-end_section_nr / sections_per_block; @@ -136,8 +136,7 @@ static ssize_t show_mem_removable(struct device *dev, { unsigned long i, pfn; int ret = 1; - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); for (i = 0; i sections_per_block; i++) { pfn = section_nr_to_pfn(mem-start_section_nr + i); @@ -153,8 +152,7 @@ static ssize_t show_mem_removable(struct device *dev, static ssize_t show_mem_state(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); ssize_t len = 0; /* @@ -282,7 +280,7 @@ static int memory_block_change_state(struct memory_block *mem, /* The device lock serializes operations on memory_subsys_[online|offline] */ static int memory_subsys_online(struct device *dev) { - struct memory_block *mem = container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); int ret; if (mem-state == MEM_ONLINE) @@ -306,7 +304,7 @@ static int memory_subsys_online(struct device *dev) static int memory_subsys_offline(struct device *dev) { - struct memory_block *mem = container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); if (mem-state == MEM_OFFLINE) return 0; @@ -318,11 +316,9 @@ static ssize_t store_mem_state(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - struct memory_block *mem; + struct memory_block *mem = to_memory_block(dev); int ret, online_type; - mem = container_of(dev, struct memory_block, dev); - lock_device_hotplug(); if (!strncmp(buf, online_kernel, min_t(int, count, 13))) @@ -376,8 +372,7 @@ store_mem_state(struct device *dev, static ssize_t show_phys_device(struct device *dev, struct device_attribute *attr, char *buf) { - struct memory_block *mem = - container_of(dev, struct memory_block, dev); + struct memory_block *mem = to_memory_block(dev); return sprintf(buf, %d\n, mem-phys_device); } @@ -509,7 +504,7 @@ struct memory_block *find_memory_block_hinted(struct mem_section *section, put_device(hint-dev); if (!dev) return NULL; - return container_of(dev, struct memory_block, dev); + return to_memory_block(dev); } /* -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] driver core / ACPI: Avoid device removal locking problems
Hi Rafael, On 08/28/2013 05:45 AM, Rafael J. Wysocki wrote: On Tuesday, August 27, 2013 02:36:44 PM Tejun Heo wrote: Hello, [...] I've thought about that a bit over the last several hours and I'm still thinking that that patch is a bit overkill, because it will trigger the restart_syscall() for all cases when device_hotplug_lock is locked, even if they can't lead to any deadlocks. The only deadlockish situation is when device *removal* is in progress when store_online(), for example, is called. So to address that particular situation without adding too much overhead for other cases, I've come up with the appended patch (untested for now). This is how it is supposed to work. There are three lock levels for device hotplug, normal, remove and weak. The difference is related to how __lock_device_hotplug() works. Namely, if device hotplug is currently locked, that function will either block or return false, depending on the current lock level and its argument (the new lock level). The rules here are that false is returned immediately if the current lock level is remove and the new lock level is weak. The function blocks for all other combinations of the two. There are two functions supposed to use device hotplug lock levels other than normal: store_online() and acpi_scan_hot_remove(). Everybody else is supposed to use normal (well, there are more potential users of weak in drivers/base/memory.c). acpi_scan_hot_remove() uses the remove lock level to indicate that it is going to remove devices while holding device hotplug locked. In turn, store_online() uses the weak lock level so that it doesn't block when devices are being removed with device hotplug locked, because that may lead to a deadlock. show_online() actually doesn't need to lock device hotplug, but it is useful to serialize it with respect to device_offline() and device_online() (in case user space attempts to run them concurrently). Yeah. I tested this one on latest kernel tree, it does make the splat go away. Looking forward to the ACPI part one.:) Regards, Gu --- drivers/acpi/scan.c|4 +- drivers/base/core.c| 72 ++--- include/linux/device.h | 25 - 3 files changed, 83 insertions(+), 18 deletions(-) Index: linux-pm/drivers/base/core.c === --- linux-pm.orig/drivers/base/core.c +++ linux-pm/drivers/base/core.c @@ -49,6 +49,55 @@ static struct kobject *dev_kobj; struct kobject *sysfs_dev_char_kobj; struct kobject *sysfs_dev_block_kobj; +static struct { + struct task_struct *holder; + enum dev_hotplug_lock_type type; + struct mutex lock; /* Synchronizes accesses to holder and type */ + wait_queue_head_t wait_queue; +} device_hotplug = { + .holder = NULL, + .type = DEV_HOTPLUG_LOCK_NONE, + .lock = __MUTEX_INITIALIZER(device_hotplug.lock), + .wait_queue = __WAIT_QUEUE_HEAD_INITIALIZER(device_hotplug.wait_queue), +}; + +bool __lock_device_hotplug(enum dev_hotplug_lock_type type) +{ + DEFINE_WAIT(wait); + bool ret = true; + + mutex_lock(device_hotplug.lock); + for (;;) { + prepare_to_wait(device_hotplug.wait_queue, wait, + TASK_UNINTERRUPTIBLE); + if (!device_hotplug.holder) { + device_hotplug.holder = current; + device_hotplug.type = type; + break; + } else if (type == DEV_HOTPLUG_LOCK_WEAK + device_hotplug.type == DEV_HOTPLUG_LOCK_REMOVE) { + ret = false; + break; + } + mutex_unlock(device_hotplug.lock); + schedule(); + mutex_lock(device_hotplug.lock); + } + finish_wait(device_hotplug.wait_queue, wait); + mutex_unlock(device_hotplug.lock); + return ret; +} + +void unlock_device_hotplug(void) +{ + mutex_lock(device_hotplug.lock); + BUG_ON(device_hotplug.holder != current); + device_hotplug.holder = NULL; + device_hotplug.type = DEV_HOTPLUG_LOCK_NONE; + wake_up(device_hotplug.wait_queue); + mutex_unlock(device_hotplug.lock); +} + #ifdef CONFIG_BLOCK static inline int device_is_not_partition(struct device *dev) { @@ -408,9 +457,10 @@ static ssize_t show_online(struct device { bool val; - lock_device_hotplug(); + /* Serialize against device_online() and device_offline(). */ + device_lock(dev); val = !dev-offline; - unlock_device_hotplug(); + device_unlock(dev); return sprintf(buf, %u\n, val); } @@ -424,7 +474,11 @@ static ssize_t store_online(struct devic if (ret 0) return ret; - lock_device_hotplug(); + if (!__lock_device_hotplug(DEV_HOTPLUG_LOCK_WEAK)) { + /* Avoid
[PATCH] f2fs: fix a compound statement label error
From 685b72b66cb8ce019429b1958c91f346b260bc65 Mon Sep 17 00:00:00 2001 From: Gu Zheng guz.f...@cn.fujitsu.com Date: Mon, 19 Aug 2013 09:41:15 +0800 Subject: [PATCH] f2fs: fix a compound statement label error An error label at end of compound statement will occur if CONFIG_F2FS_STAT_FS disabled. fs/f2fs/segment.c:556:1: error: label at end of compound statement So clean up the 'out' label to fix it. Reported-by: Fengguang Wu fengguang...@intel.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/segment.c |8 ++-- 1 files changed, 2 insertions(+), 6 deletions(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 9c45b8e..09af9c7 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -540,12 +540,9 @@ static void allocate_segment_by_default(struct f2fs_sb_info *sbi, { struct curseg_info *curseg = CURSEG_I(sbi, type); - if (force) { + if (force) new_curseg(sbi, type, true); - goto out; - } - - if (type == CURSEG_WARM_NODE) + else if (type == CURSEG_WARM_NODE) new_curseg(sbi, type, false); else if (curseg-alloc_type == LFS is_next_segment_free(sbi, type)) new_curseg(sbi, type, false); @@ -553,7 +550,6 @@ static void allocate_segment_by_default(struct f2fs_sb_info *sbi, change_curseg(sbi, type, true); else new_curseg(sbi, type, false); -out: #ifdef CONFIG_F2FS_STAT_FS sbi-segment_count[curseg-alloc_type]++; #endif -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] f2fs: use strncasecmp() simplify the string comparison
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/namei.c | 12 +--- 1 files changed, 1 insertions(+), 11 deletions(-) diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index 4e47518..106c0b4 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -83,21 +83,11 @@ static int is_multimedia_file(const unsigned char *s, const char *sub) { size_t slen = strlen(s); size_t sublen = strlen(sub); - int ret; if (sublen slen) return 0; - ret = memcmp(s + slen - sublen, sub, sublen); - if (ret) { /* compare upper case */ - int i; - char upper_sub[8]; - for (i = 0; i sublen i sizeof(upper_sub); i++) - upper_sub[i] = toupper(sub[i]); - return !memcmp(s + slen - sublen, upper_sub, sublen); - } - - return !ret; + return !strncasecmp(s + slen - sublen, sub, sublen); } /* -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ocfs2/refcounttree: add the missing NULL check of the return value of find_or_create_page()
On 07/10/2013 06:11 AM, Joel Becker wrote: On Mon, Jul 08, 2013 at 03:52:53PM +0800, Gu Zheng wrote: Add the missing NULL check of the return value of find_or_create_page() in function ocfs2_duplicate_clusters_by_page(). Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/ocfs2/refcounttree.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index 998b17e..456d0e4 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -2965,7 +2965,11 @@ int ocfs2_duplicate_clusters_by_page(handle_t *handle, to = map_end (PAGE_CACHE_SIZE - 1); page = find_or_create_page(mapping, page_index, GFP_NOFS); - +if (!page) { +ret = -ENOMEM; +mlog_errno(ret); +break; +} /* * In case PAGE_CACHE_SIZE = CLUSTER_SIZE, This page * can't be dirtied before we CoW it out. Put a blank line between the closing brace and the comment. Otherwise, Got it.:) Acked-by: Joel Becker jl...@evilplan.org Thanks~ Regards, Gu Joel -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] fs/anon_inode: Introduce a new lib function, anon_inode_getfile_private()
ping... On 07/08/2013 06:38 PM, Gu Zheng wrote: Introduce a new lib function anon_inode_getfile_private(), it creates a new file instance by hooking it up to an anonymous inode, and a dentry that describe the class of the file, similar to anon_inode_getfile(), but each file holds a single inode. Furthermore, anyone who wants to create a private anon file will benefit from this change. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Benjamin LaHaise b...@kvack.org --- fs/anon_inodes.c| 66 +++ include/linux/anon_inodes.h |3 ++ 2 files changed, 69 insertions(+), 0 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 47a65df..85c9618 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -109,6 +109,72 @@ static struct file_system_type anon_inode_fs_type = { }; /** + * anon_inode_getfile_private - creates a new file instance by hooking it up to an + * anonymous inode, and a dentry that describe the class + * of the file + * + * @name:[in]name of the class of the new file + * @fops:[in]file operations for the new file + * @priv:[in]private data for the new file (will be file's private_data) + * @flags: [in]flags + * + * + * Similar to anon_inode_getfile, but each file holds a single inode. + * + */ +struct file *anon_inode_getfile_private(const char *name, + const struct file_operations *fops, + void *priv, int flags) +{ + struct qstr this; + struct path path; + struct file *file; + struct inode *inode; + + if (fops-owner !try_module_get(fops-owner)) + return ERR_PTR(-ENOENT); + + inode = anon_inode_mkinode(anon_inode_mnt-mnt_sb); + if (IS_ERR(inode)) { + file = ERR_PTR(-ENOMEM); + goto err_module; + } + + /* + * Link the inode to a directory entry by creating a unique name + * using the inode sequence number. + */ + file = ERR_PTR(-ENOMEM); + this.name = name; + this.len = strlen(name); + this.hash = 0; + path.dentry = d_alloc_pseudo(anon_inode_mnt-mnt_sb, this); + if (!path.dentry) + goto err_module; + + path.mnt = mntget(anon_inode_mnt); + + d_instantiate(path.dentry, inode); + + file = alloc_file(path, OPEN_FMODE(flags), fops); + if (IS_ERR(file)) + goto err_dput; + + file-f_mapping = inode-i_mapping; + file-f_flags = flags (O_ACCMODE | O_NONBLOCK); + file-private_data = priv; + + return file; + +err_dput: + path_put(path); +err_module: + module_put(fops-owner); + return file; +} +EXPORT_SYMBOL_GPL(anon_inode_getfile_private); + +/** * anon_inode_getfile - creates a new file instance by hooking it up to an * anonymous inode, and a dentry that describe the class * of the file diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h index 8013a45..cf573c2 100644 --- a/include/linux/anon_inodes.h +++ b/include/linux/anon_inodes.h @@ -13,6 +13,9 @@ struct file_operations; struct file *anon_inode_getfile(const char *name, const struct file_operations *fops, void *priv, int flags); +struct file *anon_inode_getfile_private(const char *name, + const struct file_operations *fops, + void *priv, int flags); int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Add support to aio ring pages migration
ping... On 07/08/2013 06:38 PM, Gu Zheng wrote: Currently aio ring pages use get_user_pages() to allocate pages from movable zone,as discussed in thread https://lkml.org/lkml/2012/11/29/69, it is easy to pin user pages for a long time, which is fatal for memory hotplug/remove framework. As Mel Gorman suggested, Implement a callback for migration to unpin pages, barrier operations until migration completes and pin the new pfns can soloved this issue. And the best palce to hold the callbacks is address space operations which can be found via page-mapping. But the current aio ring pages are anonymous pages, they don't have address_space_operations, so we use an anon inode file as the aio ring file to manage the aio ring pages, so that we can implement the callback and register it to page-mmapping-a_ops-migratepage. But there's a ploblem that all files created by anon_inode_getfile() share the same inode, so mutil aio context will share the same aio ring pages, it'll lead to io events chaos. In order to solve this issus, we introduce a new fucntion anon_inode_getfile_private() which is samilar to anon_inode_getfile(), but each new file has its own anon inode. This work is based on Benjamin's patch, http://www.spinics.net/lists/linux-fsdevel/msg66014.html Gu Zheng (2): fs/anon_inode: Introduce a new lib function anon_inode_getfile_private() fs/aio: Add support to aio ring pages migration fs/aio.c| 120 +++ fs/anon_inodes.c| 66 +++ include/linux/anon_inodes.h |3 + include/linux/migrate.h |3 + mm/migrate.c|2 +- 5 files changed, 182 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] fs/aio: Add support to aio ring pages migration
ping... On 07/08/2013 06:38 PM, Gu Zheng wrote: As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Benjamin LaHaise b...@kvack.org --- fs/aio.c| 120 ++ include/linux/migrate.h |3 + mm/migrate.c|2 +- 3 files changed, 113 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 9b5ca11..d10f956 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include linux/eventfd.h #include linux/blkdev.h #include linux/compat.h +#include linux/anon_inodes.h +#include linux/migrate.h +#include linux/ramfs.h #include asm/kmap_types.h #include asm/uaccess.h @@ -110,6 +113,7 @@ struct kioctx { } cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *aio_ring_file; }; /*-- sysctl variables*/ @@ -138,15 +142,78 @@ __initcall(aio_setup); static void aio_free_ring(struct kioctx *ctx) { - long i; + int i; + struct file *aio_ring_file = ctx-aio_ring_file; - for (i = 0; i ctx-nr_pages; i++) + for (i = 0; i ctx-nr_pages; i++) { + pr_debug(pid(%d) [%d] page-count=%d\n, current-pid, i, + page_count(ctx-ring_pages[i])); put_page(ctx-ring_pages[i]); + } if (ctx-ring_pages ctx-ring_pages != ctx-internal_pages) kfree(ctx-ring_pages); + + if (aio_ring_file) { + truncate_setsize(aio_ring_file-f_inode, 0); + pr_debug(pid(%d) i_nlink=%u d_count=%d d_unhashed=%d i_count=%d\n, + current-pid, aio_ring_file-f_inode-i_nlink, + aio_ring_file-f_path.dentry-d_count, + d_unhashed(aio_ring_file-f_path.dentry), + atomic_read(aio_ring_file-f_inode-i_count)); + fput(aio_ring_file); + ctx-aio_ring_file = NULL; + } +} + +static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ring_fops = { + .mmap = aio_ring_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; } +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping-private_data; + unsigned long flags; + unsigned idx = old-index; + int rc; + + /*Writeback must be complete*/ + BUG_ON(PageWriteback(old)); + put_page(old); + + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + + get_page(new); + + spin_lock_irqsave(ctx-completion_lock, flags); + migrate_page_copy(new, old); + ctx-ring_pages[idx] = new; + spin_unlock_irqrestore(ctx-completion_lock, flags); + + return rc; +} + +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage= aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -154,20 +221,45 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current-mm; unsigned long size, populate; int nr_pages; + int i; + struct file *file; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ size = sizeof(struct aio_ring); size += sizeof(struct io_event) * nr_events; - nr_pages = (size + PAGE_SIZE-1) PAGE_SHIFT; + nr_pages = (size + PAGE_SIZE-1) PAGE_SHIFT; if (nr_pages 0) return -EINVAL; - nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); + file = anon_inode_getfile_private([aio], aio_ring_fops, ctx, O_RDWR); + if (IS_ERR(file)) { + ctx-aio_ring_file = NULL; + return -EAGAIN; + } + + file-f_inode-i_mapping-a_ops = aio_ctx_aops; + file-f_inode-i_mapping-private_data = ctx; + file-f_inode-i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i = 0; i nr_pages; i++) { + struct page *page; + page = find_or_create_page(file-f_inode-i_mapping, +i, GFP_HIGHUSER | __GFP_ZERO); + if (!page) + break
[PATCH] f2fs: introduce help function F2FS_NODE()
Introduce help function F2FS_NODE() to simplify the conversion of node_page to f2fs_node. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/data.c |2 +- fs/f2fs/dir.c |2 +- fs/f2fs/f2fs.h |9 +++-- fs/f2fs/file.c |2 +- fs/f2fs/inode.c|4 ++-- fs/f2fs/node.c | 10 +- fs/f2fs/node.h | 40 fs/f2fs/recovery.c |6 ++ 8 files changed, 35 insertions(+), 40 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 035f9a3..c73c394 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -39,7 +39,7 @@ static void __set_data_blkaddr(struct dnode_of_data *dn, block_t new_addr) wait_on_page_writeback(node_page); - rn = (struct f2fs_node *)page_address(node_page); + rn = F2FS_NODE(node_page); /* Get physical address of data block */ addr_array = blkaddr_in_node(rn); diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c index 62f0d59..89ecb37 100644 --- a/fs/f2fs/dir.c +++ b/fs/f2fs/dir.c @@ -270,7 +270,7 @@ static void init_dent_inode(const struct qstr *name, struct page *ipage) struct f2fs_node *rn; /* copy name info. to this inode page */ - rn = (struct f2fs_node *)page_address(ipage); + rn = F2FS_NODE(ipage); rn-i.i_namelen = cpu_to_le32(name-len); memcpy(rn-i.i_name, name-name, name-len); set_page_dirty(ipage); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index c7620b9..ffa34f4 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -455,6 +455,11 @@ static inline struct f2fs_checkpoint *F2FS_CKPT(struct f2fs_sb_info *sbi) return (struct f2fs_checkpoint *)(sbi-ckpt); } +static inline struct f2fs_node *F2FS_NODE(struct page *page) +{ + return (struct f2fs_node *)page_address(page); +} + static inline struct f2fs_nm_info *NM_I(struct f2fs_sb_info *sbi) { return (struct f2fs_nm_info *)(sbi-nm_info); @@ -813,7 +818,7 @@ static inline struct kmem_cache *f2fs_kmem_cache_create(const char *name, static inline bool IS_INODE(struct page *page) { - struct f2fs_node *p = (struct f2fs_node *)page_address(page); + struct f2fs_node *p = F2FS_NODE(page); return RAW_IS_INODE(p); } @@ -827,7 +832,7 @@ static inline block_t datablock_addr(struct page *node_page, { struct f2fs_node *raw_node; __le32 *addr_array; - raw_node = (struct f2fs_node *)page_address(node_page); + raw_node = F2FS_NODE(node_page); addr_array = blkaddr_in_node(raw_node); return le32_to_cpu(addr_array[offset]); } diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index 157a635..65ca3b3 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -206,7 +206,7 @@ int truncate_data_blocks_range(struct dnode_of_data *dn, int count) struct f2fs_node *raw_node; __le32 *addr; - raw_node = page_address(dn-node_page); + raw_node = F2FS_NODE(dn-node_page); addr = blkaddr_in_node(raw_node) + ofs; for ( ; count 0; count--, addr++, dn-ofs_in_node++) { diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c index 2b2d45d1..debf743 100644 --- a/fs/f2fs/inode.c +++ b/fs/f2fs/inode.c @@ -56,7 +56,7 @@ static int do_read_inode(struct inode *inode) if (IS_ERR(node_page)) return PTR_ERR(node_page); - rn = page_address(node_page); + rn = F2FS_NODE(node_page); ri = (rn-i); inode-i_mode = le16_to_cpu(ri-i_mode); @@ -153,7 +153,7 @@ void update_inode(struct inode *inode, struct page *node_page) wait_on_page_writeback(node_page); - rn = page_address(node_page); + rn = F2FS_NODE(node_page); ri = (rn-i); ri-i_mode = cpu_to_le16(inode-i_mode); diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index b418aee..f5172e2 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -565,7 +565,7 @@ static int truncate_nodes(struct dnode_of_data *dn, unsigned int nofs, return PTR_ERR(page); } - rn = (struct f2fs_node *)page_address(page); + rn = F2FS_NODE(page); if (depth 3) { for (i = ofs; i NIDS_PER_BLOCK; i++, freed++) { child_nid = le32_to_cpu(rn-in.nid[i]); @@ -698,7 +698,7 @@ restart: set_new_dnode(dn, inode, page, NULL, 0); unlock_page(page); - rn = page_address(page); + rn = F2FS_NODE(page); switch (level) { case 0: case 1: @@ -1484,8 +1484,8 @@ int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page) SetPageUptodate(ipage); fill_node_footer(ipage, ino, ino, 0, true); - src = (struct f2fs_node *)page_address(page); - dst = (struct f2fs_node *)page_address(ipage); + src = F2FS_NODE(page); + dst = F2FS_NODE(ipage); memcpy(dst, src, (unsigned long)src-i.i_ext - (unsigned long)src-i); dst-i.i_size = 0; @@ -1535,7 +1535,7 @@ int restore_node_summary(struct f2fs_sb_info *sbi
[PATCH] fs/f2fs: Code cleanup and simplify in func {find/add}_gc_inode
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/gc.c | 17 + 1 files changed, 5 insertions(+), 12 deletions(-) diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 1496159..0b8b439 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -314,28 +314,21 @@ static const struct victim_selection default_v_ops = { static struct inode *find_gc_inode(nid_t ino, struct list_head *ilist) { - struct list_head *this; struct inode_entry *ie; - list_for_each(this, ilist) { - ie = list_entry(this, struct inode_entry, list); + list_for_each_entry(ie, ilist, list) if (ie-inode-i_ino == ino) return ie-inode; - } return NULL; } static void add_gc_inode(struct inode *inode, struct list_head *ilist) { - struct list_head *this; - struct inode_entry *new_ie, *ie; + struct inode_entry *new_ie; - list_for_each(this, ilist) { - ie = list_entry(this, struct inode_entry, list); - if (ie-inode == inode) { - iput(inode); - return; - } + if (inode == find_gc_inode(inode-i_ino, ilist)) { + iput(inode); + return; } repeat: new_ie = kmem_cache_alloc(winode_slab, GFP_NOFS); -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fs/jffs2: remove the unused paramters of function jffs2_{compress,decompress}
Remove the unused paramters of function jffs2_{compress,decompress}. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/jffs2/compr.c | 12 ++-- fs/jffs2/compr.h | 12 ++-- fs/jffs2/gc.c|2 +- fs/jffs2/read.c |2 +- fs/jffs2/write.c |2 +- 5 files changed, 15 insertions(+), 15 deletions(-) diff --git a/fs/jffs2/compr.c b/fs/jffs2/compr.c index 4849a4c..6fcb426 100644 --- a/fs/jffs2/compr.c +++ b/fs/jffs2/compr.c @@ -145,9 +145,9 @@ static int jffs2_selected_compress(u8 compr, unsigned char *data_in, * jffs2_compress should compress as much as will fit, and should set * *datalen accordingly to show the amount of data which were compressed. */ -uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, - unsigned char *data_in, unsigned char **cpage_out, - uint32_t *datalen, uint32_t *cdatalen) +uint16_t jffs2_compress(struct jffs2_sb_info *c, unsigned char *data_in, + unsigned char **cpage_out, uint32_t *datalen, + uint32_t *cdatalen) { int ret = JFFS2_COMPR_NONE; int mode, compr_ret; @@ -250,9 +250,9 @@ uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, return ret; } -int jffs2_decompress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, -uint16_t comprtype, unsigned char *cdata_in, -unsigned char *data_out, uint32_t cdatalen, uint32_t datalen) +int jffs2_decompress(uint16_t comprtype, unsigned char *cdata_in, +unsigned char *data_out, uint32_t cdatalen, +uint32_t datalen) { struct jffs2_compressor *this; int ret; diff --git a/fs/jffs2/compr.h b/fs/jffs2/compr.h index 5e91d57..092089a 100644 --- a/fs/jffs2/compr.h +++ b/fs/jffs2/compr.h @@ -70,13 +70,13 @@ int jffs2_unregister_compressor(struct jffs2_compressor *comp); int jffs2_compressors_init(void); int jffs2_compressors_exit(void); -uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, - unsigned char *data_in, unsigned char **cpage_out, - uint32_t *datalen, uint32_t *cdatalen); +uint16_t jffs2_compress(struct jffs2_sb_info *c, unsigned char *data_in, + unsigned char **cpage_out, uint32_t *datalen, + uint32_t *cdatalen); -int jffs2_decompress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, -uint16_t comprtype, unsigned char *cdata_in, -unsigned char *data_out, uint32_t cdatalen, uint32_t datalen); +int jffs2_decompress(uint16_t comprtype, unsigned char *cdata_in, +unsigned char *data_out, uint32_t cdatalen, +uint32_t datalen); void jffs2_free_comprbuf(unsigned char *comprbuf, unsigned char *orig); diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c index 5a2dec2..8dc85aa 100644 --- a/fs/jffs2/gc.c +++ b/fs/jffs2/gc.c @@ -1330,7 +1330,7 @@ static int jffs2_garbage_collect_dnode(struct jffs2_sb_info *c, struct jffs2_era writebuf = pg_ptr + (offset (PAGE_CACHE_SIZE -1)); - comprtype = jffs2_compress(c, f, writebuf, comprbuf, datalen, cdatalen); + comprtype = jffs2_compress(c, writebuf, comprbuf, datalen, cdatalen); ri.magic = cpu_to_je16(JFFS2_MAGIC_BITMASK); ri.nodetype = cpu_to_je16(JFFS2_NODETYPE_INODE); diff --git a/fs/jffs2/read.c b/fs/jffs2/read.c index 0b042b1..6395f41 100644 --- a/fs/jffs2/read.c +++ b/fs/jffs2/read.c @@ -132,7 +132,7 @@ int jffs2_read_dnode(struct jffs2_sb_info *c, struct jffs2_inode_info *f, jffs2_dbg(2, Decompress %d bytes from %p to %d bytes at %p\n, je32_to_cpu(ri-csize), readbuf, je32_to_cpu(ri-dsize), decomprbuf); - ret = jffs2_decompress(c, f, ri-compr | (ri-usercompr 8), readbuf, decomprbuf, je32_to_cpu(ri-csize), je32_to_cpu(ri-dsize)); + ret = jffs2_decompress(ri-compr | (ri-usercompr 8), readbuf, decomprbuf, je32_to_cpu(ri-csize), je32_to_cpu(ri-dsize)); if (ret) { pr_warn(Error: jffs2_decompress returned %d\n, ret); goto out_decomprbuf; diff --git a/fs/jffs2/write.c b/fs/jffs2/write.c index b634de4..dbc26de 100644 --- a/fs/jffs2/write.c +++ b/fs/jffs2/write.c @@ -369,7 +369,7 @@ int jffs2_write_inode_range(struct jffs2_sb_info *c, struct jffs2_inode_info *f, datalen = min_t(uint32_t, writelen, PAGE_CACHE_SIZE - (offset (PAGE_CACHE_SIZE-1))); cdatalen = min_t(uint32_t, alloclen - sizeof(*ri), datalen); - comprtype = jffs2_compress(c, f, buf, comprbuf, datalen, cdatalen); + comprtype = jffs2_compress(c, buf, comprbuf, datalen, cdatalen); ri-magic = cpu_to_je16(JFFS2_MAGIC_BITMASK); ri
[PATCH RESEND 0/2] Add support to aio ring pages migration
Currently aio ring pages use get_user_pages() to allocate pages from movable zone,as discussed in thread https://lkml.org/lkml/2012/11/29/69, it is easy to pin user pages for a long time, which is fatal for memory hotplug/remove framework. As Mel Gorman suggested, Implement a callback for migration to unpin pages, barrier operations until migration completes and pin the new pfns can soloved this issue. And the best palce to hold the callbacks is address space operations which can be found via page-mapping. But the current aio ring pages are anonymous pages, they don't have address_space_operations, so we use an anon inode file as the aio ring file to manage the aio ring pages, so that we can implement the callback and register it to page-mmapping-a_ops-migratepage. But there's a ploblem that all files created by anon_inode_getfile() share the same inode, so mutil aio context will share the same aio ring pages, it'll lead to io events chaos. In order to solve this issus, we introduce a new fucntion anon_inode_getfile_private() which is samilar to anon_inode_getfile(), but each new file has its own anon inode. This work is based on Benjamin's patch, http://www.spinics.net/lists/linux-fsdevel/msg66014.html Gu Zheng (2): fs/anon_inode: Introduce a new lib function anon_inode_getfile_private() fs/aio: Add support to aio ring pages migration fs/aio.c| 120 +++ fs/anon_inodes.c| 66 +++ include/linux/anon_inodes.h |3 + include/linux/migrate.h |3 + mm/migrate.c|2 +- 5 files changed, 182 insertions(+), 12 deletions(-) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND 2/2] fs/aio: Add support to aio ring pages migration
As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Benjamin LaHaise b...@kvack.org --- fs/aio.c| 120 ++ include/linux/migrate.h |3 + mm/migrate.c|2 +- 3 files changed, 113 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 9b5ca11..d10f956 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include linux/eventfd.h #include linux/blkdev.h #include linux/compat.h +#include linux/anon_inodes.h +#include linux/migrate.h +#include linux/ramfs.h #include asm/kmap_types.h #include asm/uaccess.h @@ -110,6 +113,7 @@ struct kioctx { } cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *aio_ring_file; }; /*-- sysctl variables*/ @@ -138,15 +142,78 @@ __initcall(aio_setup); static void aio_free_ring(struct kioctx *ctx) { - long i; + int i; + struct file *aio_ring_file = ctx-aio_ring_file; - for (i = 0; i ctx-nr_pages; i++) + for (i = 0; i ctx-nr_pages; i++) { + pr_debug(pid(%d) [%d] page-count=%d\n, current-pid, i, + page_count(ctx-ring_pages[i])); put_page(ctx-ring_pages[i]); + } if (ctx-ring_pages ctx-ring_pages != ctx-internal_pages) kfree(ctx-ring_pages); + + if (aio_ring_file) { + truncate_setsize(aio_ring_file-f_inode, 0); + pr_debug(pid(%d) i_nlink=%u d_count=%d d_unhashed=%d i_count=%d\n, + current-pid, aio_ring_file-f_inode-i_nlink, + aio_ring_file-f_path.dentry-d_count, + d_unhashed(aio_ring_file-f_path.dentry), + atomic_read(aio_ring_file-f_inode-i_count)); + fput(aio_ring_file); + ctx-aio_ring_file = NULL; + } +} + +static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ring_fops = { + .mmap = aio_ring_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; } +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping-private_data; + unsigned long flags; + unsigned idx = old-index; + int rc; + + /*Writeback must be complete*/ + BUG_ON(PageWriteback(old)); + put_page(old); + + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + + get_page(new); + + spin_lock_irqsave(ctx-completion_lock, flags); + migrate_page_copy(new, old); + ctx-ring_pages[idx] = new; + spin_unlock_irqrestore(ctx-completion_lock, flags); + + return rc; +} + +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage= aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -154,20 +221,45 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current-mm; unsigned long size, populate; int nr_pages; + int i; + struct file *file; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ size = sizeof(struct aio_ring); size += sizeof(struct io_event) * nr_events; - nr_pages = (size + PAGE_SIZE-1) PAGE_SHIFT; + nr_pages = (size + PAGE_SIZE-1) PAGE_SHIFT; if (nr_pages 0) return -EINVAL; - nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); + file = anon_inode_getfile_private([aio], aio_ring_fops, ctx, O_RDWR); + if (IS_ERR(file)) { + ctx-aio_ring_file = NULL; + return -EAGAIN; + } + + file-f_inode-i_mapping-a_ops = aio_ctx_aops; + file-f_inode-i_mapping-private_data = ctx; + file-f_inode-i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i = 0; i nr_pages; i++) { + struct page *page; + page = find_or_create_page(file-f_inode-i_mapping, + i, GFP_HIGHUSER | __GFP_ZERO); + if (!page) + break; + pr_debug(pid(%d) page[%d
[PATCH RESEND 1/2] fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()
Introduce a new lib function anon_inode_getfile_private(), it creates a new file instance by hooking it up to an anonymous inode, and a dentry that describe the class of the file, similar to anon_inode_getfile(), but each file holds a single inode. Furthermore, anyone who wants to create a private anon file will benefit from this change. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Benjamin LaHaise b...@kvack.org --- fs/anon_inodes.c| 66 +++ include/linux/anon_inodes.h |3 ++ 2 files changed, 69 insertions(+), 0 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 47a65df..85c9618 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -109,6 +109,72 @@ static struct file_system_type anon_inode_fs_type = { }; /** + * anon_inode_getfile_private - creates a new file instance by hooking it up to an + * anonymous inode, and a dentry that describe the class + * of the file + * + * @name:[in]name of the class of the new file + * @fops:[in]file operations for the new file + * @priv:[in]private data for the new file (will be file's private_data) + * @flags: [in]flags + * + * + * Similar to anon_inode_getfile, but each file holds a single inode. + * + */ +struct file *anon_inode_getfile_private(const char *name, + const struct file_operations *fops, + void *priv, int flags) +{ + struct qstr this; + struct path path; + struct file *file; + struct inode *inode; + + if (fops-owner !try_module_get(fops-owner)) + return ERR_PTR(-ENOENT); + + inode = anon_inode_mkinode(anon_inode_mnt-mnt_sb); + if (IS_ERR(inode)) { + file = ERR_PTR(-ENOMEM); + goto err_module; + } + + /* +* Link the inode to a directory entry by creating a unique name +* using the inode sequence number. +*/ + file = ERR_PTR(-ENOMEM); + this.name = name; + this.len = strlen(name); + this.hash = 0; + path.dentry = d_alloc_pseudo(anon_inode_mnt-mnt_sb, this); + if (!path.dentry) + goto err_module; + + path.mnt = mntget(anon_inode_mnt); + + d_instantiate(path.dentry, inode); + + file = alloc_file(path, OPEN_FMODE(flags), fops); + if (IS_ERR(file)) + goto err_dput; + + file-f_mapping = inode-i_mapping; + file-f_flags = flags (O_ACCMODE | O_NONBLOCK); + file-private_data = priv; + + return file; + +err_dput: + path_put(path); +err_module: + module_put(fops-owner); + return file; +} +EXPORT_SYMBOL_GPL(anon_inode_getfile_private); + +/** * anon_inode_getfile - creates a new file instance by hooking it up to an * anonymous inode, and a dentry that describe the class * of the file diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h index 8013a45..cf573c2 100644 --- a/include/linux/anon_inodes.h +++ b/include/linux/anon_inodes.h @@ -13,6 +13,9 @@ struct file_operations; struct file *anon_inode_getfile(const char *name, const struct file_operations *fops, void *priv, int flags); +struct file *anon_inode_getfile_private(const char *name, + const struct file_operations *fops, + void *priv, int flags); int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags); -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND 1/2] fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()
Hi Ben, On 07/16/2013 09:16 PM, Benjamin LaHaise wrote: On Tue, Jul 16, 2013 at 05:56:12PM +0800, Gu Zheng wrote: Introduce a new lib function anon_inode_getfile_private(), it creates a new file instance by hooking it up to an anonymous inode, and a dentry that describe the class of the file, similar to anon_inode_getfile(), but each file holds a single inode. Furthermore, anyone who wants to create a private anon file will benefit from this change. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Benjamin LaHaise b...@kvack.org Please don't add my Signed-off-by when I have never even seen or reviewed a patch -- that is completely unacceptable. Sorry for my reckless action, I'll remember your reminder.:) Second, I don't think this patch is suitable for 3.11, as it has not seen much testing outside of one test program I had written. It's a long standing bug, so it isn't urgent to get the fix into the tree. That said, it did pass a few tests I ran last night, so it is probably suitable for the -next tree. Thanks for your test.:) Regards, Gu As for patch 1, it looks okay to me, but will need Al Viro's signoff. -ben -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND 2/2] fs/aio: Add support to aio ring pages migration
Hi Ben, On 07/16/2013 09:34 PM, Benjamin LaHaise wrote: On Tue, Jul 16, 2013 at 05:56:16PM +0800, Gu Zheng wrote: As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. There are a few minor issues that needed to be fixed -- see below. I've made these changes and added them to git://git.kvack.org/~bcrl/aio-next.git , and will ask for that tree to be included in linux-next. Thanks very much, and your review. Stephen sent out a build failed msg when merger this patch into next-tree from your aio_next. This is because we use migrate_page_move_mapping() which is protected by CONFIG_MIGRATION, I'll fix this issue in the next version. Best regards, Gu mm folks: can someone familiar with page migration / hot plug memory please review the migration changes? Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Signed-off-by: Benjamin LaHaise b...@kvack.org Again, I had not provided my Signed-off-by on this patch previously, so don't add it for me. Sorry again.:) --- fs/aio.c| 120 ++ include/linux/migrate.h |3 + mm/migrate.c|2 +- 3 files changed, 113 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 9b5ca11..d10f956 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include linux/eventfd.h #include linux/blkdev.h #include linux/compat.h +#include linux/anon_inodes.h +#include linux/migrate.h +#include linux/ramfs.h #include asm/kmap_types.h #include asm/uaccess.h @@ -110,6 +113,7 @@ struct kioctx { } cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; +struct file *aio_ring_file; }; /*-- sysctl variables*/ @@ -138,15 +142,78 @@ __initcall(aio_setup); static void aio_free_ring(struct kioctx *ctx) { -long i; +int i; +struct file *aio_ring_file = ctx-aio_ring_file; -for (i = 0; i ctx-nr_pages; i++) +for (i = 0; i ctx-nr_pages; i++) { +pr_debug(pid(%d) [%d] page-count=%d\n, current-pid, i, +page_count(ctx-ring_pages[i])); put_page(ctx-ring_pages[i]); +} if (ctx-ring_pages ctx-ring_pages != ctx-internal_pages) kfree(ctx-ring_pages); + +if (aio_ring_file) { +truncate_setsize(aio_ring_file-f_inode, 0); +pr_debug(pid(%d) i_nlink=%u d_count=%d d_unhashed=%d i_count=%d\n, +current-pid, aio_ring_file-f_inode-i_nlink, +aio_ring_file-f_path.dentry-d_count, +d_unhashed(aio_ring_file-f_path.dentry), +atomic_read(aio_ring_file-f_inode-i_count)); +fput(aio_ring_file); +ctx-aio_ring_file = NULL; +} +} + +static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) +{ +vma-vm_ops = generic_file_vm_ops; +return 0; +} + +static const struct file_operations aio_ring_fops = { +.mmap = aio_ring_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ +return 0; } +static int aio_migratepage(struct address_space *mapping, struct page *new, +struct page *old, enum migrate_mode mode) +{ +struct kioctx *ctx = mapping-private_data; +unsigned long flags; +unsigned idx = old-index; +int rc; + +/*Writeback must be complete*/ Missing spaces before/after beginning and end of comment. +BUG_ON(PageWriteback(old)); +put_page(old); + +rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); +if (rc != MIGRATEPAGE_SUCCESS) { +get_page(old); +return rc; +} + +get_page(new); + +spin_lock_irqsave(ctx-completion_lock, flags); +migrate_page_copy(new, old); +ctx-ring_pages[idx] = new; +spin_unlock_irqrestore(ctx-completion_lock, flags); + +return rc; +} + +static const struct address_space_operations aio_ctx_aops = { +.set_page_dirty = aio_set_page_dirty, +.migratepage= aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -154,20 +221,45 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current-mm; unsigned long size, populate; int nr_pages; +int i; +struct file *file; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ size = sizeof(struct aio_ring); size += sizeof(struct io_event) * nr_events; -nr_pages = (size + PAGE_SIZE-1) PAGE_SHIFT; +nr_pages
[PATCH V2 2/2] fs/aio: Add support to aio ring pages migration
As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. v1-v2: Fix build failed issue if CONFIG_MIGRATION disabled. Fix some minor issues under Benjamin's comments. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/aio.c| 116 +++ include/linux/migrate.h |9 mm/migrate.c|2 +- 3 files changed, 116 insertions(+), 11 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 2bbcacf..15e8a13 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include linux/eventfd.h #include linux/blkdev.h #include linux/compat.h +#include linux/anon_inodes.h +#include linux/migrate.h +#include linux/ramfs.h #include asm/kmap_types.h #include asm/uaccess.h @@ -108,6 +111,7 @@ struct kioctx { } cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *aio_ring_file; }; /*-- sysctl variables*/ @@ -136,15 +140,78 @@ __initcall(aio_setup); static void aio_free_ring(struct kioctx *ctx) { - long i; - - for (i = 0; i ctx-nr_pages; i++) + int i; + struct file *aio_ring_file = ctx-aio_ring_file; + for (i = 0; i ctx-nr_pages; i++) { + pr_debug(pid(%d) [%d] page-count=%d\n, current-pid, i, + page_count(ctx-ring_pages[i])); put_page(ctx-ring_pages[i]); + } if (ctx-ring_pages ctx-ring_pages != ctx-internal_pages) kfree(ctx-ring_pages); + + if (aio_ring_file) { + truncate_setsize(aio_ring_file-f_inode, 0); + pr_debug(pid(%d) i_nlink=%u d_count=%d d_unhashed=%d i_count=%d\n, + current-pid, aio_ring_file-f_inode-i_nlink, + aio_ring_file-f_path.dentry-d_count, + d_unhashed(aio_ring_file-f_path.dentry), + atomic_read(aio_ring_file-f_inode-i_count)); + fput(aio_ring_file); + ctx-aio_ring_file = NULL; + } +} + +static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ring_fops = { + .mmap = aio_ring_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; } +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping-private_data; + unsigned long flags; + unsigned idx = old-index; + int rc; + + /* Writeback must be complete */ + BUG_ON(PageWriteback(old)); + + put_page(old); + + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + + get_page(new); + + spin_lock_irqsave(ctx-completion_lock, flags); + migrate_page_copy(new, old); + ctx-ring_pages[idx] = new; + spin_unlock_irqrestore(ctx-completion_lock, flags); + + return rc; +} + +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage= aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -152,18 +219,42 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current-mm; unsigned long size, populate; int nr_pages; + int i; + struct file *file; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ size = sizeof(struct aio_ring); size += sizeof(struct io_event) * nr_events; - nr_pages = (size + PAGE_SIZE-1) PAGE_SHIFT; + nr_pages = PFN_UP(size); if (nr_pages 0) return -EINVAL; + file = anon_inode_getfile_private([aio], aio_ring_fops, ctx, O_RDWR); + if (IS_ERR(file)) { + ctx-aio_ring_file = NULL; + return -EAGAIN; + } + file-f_inode-i_mapping-a_ops = aio_ctx_aops; + file-f_inode-i_mapping-private_data = ctx; + file-f_inode-i_size = PAGE_SIZE * (loff_t)nr_pages; - nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); + for (i = 0; i nr_pages; i++) { + struct page *page; + page = find_or_create_page(file-f_inode-i_mapping, + i, GFP_HIGHUSER | __GFP_ZERO); + if (!page
Re: [PATCH V2 2/2] fs/aio: Add support to aio ring pages migration
Hi Ben, On 07/17/2013 09:44 PM, Benjamin LaHaise wrote: On Wed, Jul 17, 2013 at 05:22:30PM +0800, Gu Zheng wrote: As the aio job will pin the ring pages, that will lead to mem migrated failed. In order to fix this problem we use an anon inode to manage the aio ring pages, and setup the migratepage callback in the anon inode's address space, so that when mem migrating the aio ring pages will be moved to other mem node safely. v1-v2: Fix build failed issue if CONFIG_MIGRATION disabled. Fix some minor issues under Benjamin's comments. I don't know what you did with this patch, but it doesn't apply to any of the trees I can find, and interdiff isn't able to compare it against your original patch. Since the first version of the patch was already applied it is generally more appropriate to provide an incremental fix. I've added the following to my tree (git://git.kvack.org/~bcrl/aio-next.git/) to fix the build issue. I've tested this with CONFIG_MIGRATION enabled and disabled on x86. My patch is applied on 3.10 release. I'm sorry that my working department is forbidden to access all the urls based on git protocol, so I can not make patch on your aio_next. Does aio_next have trees based on http/https protocol? Your fix looks very well. IMHO, because we *extern* the migrate_page_move_mapping(), so we have the duty to make sure it can work well all the place. If some one later use migrate_page_move_mapping() with out the protection of CONFIG_MIGRATION, it will lead to build-fail if CONFIG_MIGRATION is disable. So I think the following change(return ENOSYS error is CONFIG_MIGRATION disabled) is still needed. diff --git a/include/linux/migrate.h b/include/linux/migrate.h index c407d88..3d0a486 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -88,6 +88,13 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, return -ENOSYS; } +static inline int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode) +{ + return -ENOSYS; +} + /* Possible settings for the migrate_page() method in address_operations */ #define migrate_page NULL #define fail_migrate_page NULL Best regards, Gu -ben diff --git a/include/linux/migrate.h b/include/linux/migrate.h index c407d88..3d0a486 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -88,6 +88,13 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, return -ENOSYS; } +static inline int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode) +{ + return -ENOSYS; +} + /* Possible settings for the migrate_page() method in address_operations */ #define migrate_page NULL #define fail_migrate_page NULL
[PATCH RESEND] fs/jffs2: remove the unused paramters of function jffs2_{compress,decompress}
Remove the unused paramters of function jffs2_{compress,decompress}. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/jffs2/compr.c | 12 ++-- fs/jffs2/compr.h | 12 ++-- fs/jffs2/gc.c|2 +- fs/jffs2/read.c |4 +++- fs/jffs2/write.c |2 +- 5 files changed, 17 insertions(+), 15 deletions(-) diff --git a/fs/jffs2/compr.c b/fs/jffs2/compr.c index 4849a4c..6fcb426 100644 --- a/fs/jffs2/compr.c +++ b/fs/jffs2/compr.c @@ -145,9 +145,9 @@ static int jffs2_selected_compress(u8 compr, unsigned char *data_in, * jffs2_compress should compress as much as will fit, and should set * *datalen accordingly to show the amount of data which were compressed. */ -uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, - unsigned char *data_in, unsigned char **cpage_out, - uint32_t *datalen, uint32_t *cdatalen) +uint16_t jffs2_compress(struct jffs2_sb_info *c, unsigned char *data_in, + unsigned char **cpage_out, uint32_t *datalen, + uint32_t *cdatalen) { int ret = JFFS2_COMPR_NONE; int mode, compr_ret; @@ -250,9 +250,9 @@ uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, return ret; } -int jffs2_decompress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, -uint16_t comprtype, unsigned char *cdata_in, -unsigned char *data_out, uint32_t cdatalen, uint32_t datalen) +int jffs2_decompress(uint16_t comprtype, unsigned char *cdata_in, +unsigned char *data_out, uint32_t cdatalen, +uint32_t datalen) { struct jffs2_compressor *this; int ret; diff --git a/fs/jffs2/compr.h b/fs/jffs2/compr.h index 5e91d57..092089a 100644 --- a/fs/jffs2/compr.h +++ b/fs/jffs2/compr.h @@ -70,13 +70,13 @@ int jffs2_unregister_compressor(struct jffs2_compressor *comp); int jffs2_compressors_init(void); int jffs2_compressors_exit(void); -uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, - unsigned char *data_in, unsigned char **cpage_out, - uint32_t *datalen, uint32_t *cdatalen); +uint16_t jffs2_compress(struct jffs2_sb_info *c, unsigned char *data_in, + unsigned char **cpage_out, uint32_t *datalen, + uint32_t *cdatalen); -int jffs2_decompress(struct jffs2_sb_info *c, struct jffs2_inode_info *f, -uint16_t comprtype, unsigned char *cdata_in, -unsigned char *data_out, uint32_t cdatalen, uint32_t datalen); +int jffs2_decompress(uint16_t comprtype, unsigned char *cdata_in, +unsigned char *data_out, uint32_t cdatalen, +uint32_t datalen); void jffs2_free_comprbuf(unsigned char *comprbuf, unsigned char *orig); diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c index 5a2dec2..8dc85aa 100644 --- a/fs/jffs2/gc.c +++ b/fs/jffs2/gc.c @@ -1330,7 +1330,7 @@ static int jffs2_garbage_collect_dnode(struct jffs2_sb_info *c, struct jffs2_era writebuf = pg_ptr + (offset (PAGE_CACHE_SIZE -1)); - comprtype = jffs2_compress(c, f, writebuf, comprbuf, datalen, cdatalen); + comprtype = jffs2_compress(c, writebuf, comprbuf, datalen, cdatalen); ri.magic = cpu_to_je16(JFFS2_MAGIC_BITMASK); ri.nodetype = cpu_to_je16(JFFS2_NODETYPE_INODE); diff --git a/fs/jffs2/read.c b/fs/jffs2/read.c index 0b042b1..aed9183 100644 --- a/fs/jffs2/read.c +++ b/fs/jffs2/read.c @@ -132,7 +132,9 @@ int jffs2_read_dnode(struct jffs2_sb_info *c, struct jffs2_inode_info *f, jffs2_dbg(2, Decompress %d bytes from %p to %d bytes at %p\n, je32_to_cpu(ri-csize), readbuf, je32_to_cpu(ri-dsize), decomprbuf); - ret = jffs2_decompress(c, f, ri-compr | (ri-usercompr 8), readbuf, decomprbuf, je32_to_cpu(ri-csize), je32_to_cpu(ri-dsize)); + ret = jffs2_decompress(ri-compr | (ri-usercompr 8), + readbuf, decomprbuf, je32_to_cpu(ri-csize), + je32_to_cpu(ri-dsize)); if (ret) { pr_warn(Error: jffs2_decompress returned %d\n, ret); goto out_decomprbuf; diff --git a/fs/jffs2/write.c b/fs/jffs2/write.c index b634de4..dbc26de 100644 --- a/fs/jffs2/write.c +++ b/fs/jffs2/write.c @@ -369,7 +369,7 @@ int jffs2_write_inode_range(struct jffs2_sb_info *c, struct jffs2_inode_info *f, datalen = min_t(uint32_t, writelen, PAGE_CACHE_SIZE - (offset (PAGE_CACHE_SIZE-1))); cdatalen = min_t(uint32_t, alloclen - sizeof(*ri), datalen); - comprtype = jffs2_compress(c, f, buf, comprbuf, datalen, cdatalen); + comprtype = jffs2_compress(c, buf, comprbuf, datalen, cdatalen
[PATCH ] lib/crc32: update the comments of, crc32_{be,le}_generic()
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- lib/crc32.c | 15 ++- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/lib/crc32.c b/lib/crc32.c index 072fbd8..4722659 100644 --- a/lib/crc32.c +++ b/lib/crc32.c @@ -131,11 +131,14 @@ crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256]) #endif /** - * crc32_le() - Calculate bitwise little-endian Ethernet AUTODIN II CRC32 + * crc32_le_generic() - Calculate bitwise little-endian Ethernet AUTODIN II + * CRC32/CRC32C * @crc: seed value for computation. ~0 for Ethernet, sometimes 0 for - * other uses, or the previous crc32 value if computing incrementally. - * @p: pointer to buffer over which CRC is run + * other uses, or the previous crc32/crc32c value if computing incrementally. + * @p: pointer to buffer over which CRC32/CRC32C is run * @len: length of buffer @p + * @tab: little-endian Ethernet table + * @polynomial: CRC32/CRC32c LE polynomial */ static inline u32 __pure crc32_le_generic(u32 crc, unsigned char const *p, size_t len, const u32 (*tab)[256], @@ -201,11 +204,13 @@ EXPORT_SYMBOL(crc32_le); EXPORT_SYMBOL(__crc32c_le); /** - * crc32_be() - Calculate bitwise big-endian Ethernet AUTODIN II CRC32 + * crc32_be_generic() - Calculate bitwise big-endian Ethernet AUTODIN II CRC32 * @crc: seed value for computation. ~0 for Ethernet, sometimes 0 for * other uses, or the previous crc32 value if computing incrementally. - * @p: pointer to buffer over which CRC is run + * @p: pointer to buffer over which CRC32 is run * @len: length of buffer @p + * @tab: big-endian Ethernet table + * @polynomial: CRC32 BE polynomial */ static inline u32 __pure crc32_be_generic(u32 crc, unsigned char const *p, size_t len, const u32 (*tab)[256], -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] f2fs: add the missing delection of orphan inode entry in write_orphan_inodes()
After writing orphan inode entry in jornal block, we need to delete each entry from the orphan entry list, and release them. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 66a6b85..290db04 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -337,6 +337,10 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) memset(orphan_blk, 0, sizeof(*orphan_blk)); page_exist: orphan_blk-ino[nentries++] = cpu_to_le32(orphan-ino); + + list_del(orphan-list); + kmem_cache_free(orphan_entry_slab, orphan); + sbi-n_orphans--; } if (!page) goto end; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] f2fs: use list_for_each rather than list_for_each_safe, in remove_orphan_inode()
As we remove the target single node, so list_for_each is enought, in order to clean up, we use list_for_each_entry instead. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 290db04..87f7bc2 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -237,13 +237,12 @@ out: void remove_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) { - struct list_head *this, *next, *head; + struct list_head *head; struct orphan_inode_entry *orphan; mutex_lock(sbi-orphan_inode_mutex); head = sbi-orphan_inode_list; - list_for_each_safe(this, next, head) { - orphan = list_entry(this, struct orphan_inode_entry, list); + list_for_each_entry(orphan, head, list) { if (orphan-ino == ino) { list_del(orphan-list); kmem_cache_free(orphan_entry_slab, orphan); -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH aio-next] aio: fix race in ring buffer page lookup introduced by page migration support
Hi Ben, Al, On 09/10/2013 12:02 AM, Benjamin LaHaise wrote: Hi Al, Gu, I've added this patch to my tree at git://git.kvack.org/~bcrl/aio-next.git to fix the get_user_pages() issue introduced by Gu's changes in the page migration patch. Thanks Al for spotting this. Thanks very much for spotting and fixing this issue. Best regards, Gu -ben commit d6c355c7dabcd753a75bc77d150d36328a355267 Author: Benjamin LaHaise b...@kvack.org Date: Mon Sep 9 11:57:59 2013 -0400 aio: fix race in ring buffer page lookup introduced by page migration support Prior to the introduction of page migration support in fs/aio: Add support to aio ring pages migration / 36bc08cc01709b4a9bb563b35aa530241ddc63e3, mapping of the ring buffer pages was done via get_user_pages() while retaining mmap_sem held for write. This avoided possible races with userland racing an munmap() or mremap(). The page migration patch, however, switched to using mm_populate() to prime the page mapping. mm_populate() cannot be called with mmap_sem held. Instead of dropping the mmap_sem, revert to the old behaviour and simply drop the use of mm_populate() since get_user_pages() will cause the pages to get mapped anyways. Thanks to Al Viro for spotting this issue. Signed-off-by: Benjamin LaHaise b...@kvack.org diff --git a/fs/aio.c b/fs/aio.c index 6e26755..f4a27af 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -307,16 +307,25 @@ static int aio_setup_ring(struct kioctx *ctx) aio_free_ring(ctx); return -EAGAIN; } - up_write(mm-mmap_sem); - - mm_populate(ctx-mmap_base, populate); pr_debug(mmap address: 0x%08lx\n, ctx-mmap_base); + + /* We must do this while still holding mmap_sem for write, as we + * need to be protected against userspace attempting to mremap() + * or munmap() the ring buffer. + */ ctx-nr_pages = get_user_pages(current, mm, ctx-mmap_base, nr_pages, 1, 0, ctx-ring_pages, NULL); + + /* Dropping the reference here is safe as the page cache will hold + * onto the pages for us. It is also required so that page migration + * can unmap the pages and get the right reference count. + */ for (i = 0; i ctx-nr_pages; i++) put_page(ctx-ring_pages[i]); + up_write(mm-mmap_sem); + if (unlikely(ctx-nr_pages != nr_pages)) { aio_free_ring(ctx); return -EAGAIN; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH] f2fs: optimize fs_lock for better performance
Hi Jaegeuk, On 09/10/2013 08:59 AM, Jaegeuk Kim wrote: Hi, 2013-09-07 (토), 08:00 +, Chao Yu: Hi Knize, Thanks for your reply, I think it's actually meaningless that it's being named after spin_lock, it's better to rename this spinlock to round_robin_lock. This patch can only resolve the issue of unbalanced fs_lock usage, it can not fix the deadlock issue. can we fix deadlock issue through this method: - vfs_create() - f2fs_create() - takes an fs_lock and save current thread info into thread_info[NR_GLOBAL_LOCKS] - f2fs_add_link() - __f2fs_add_link() - init_inode_metadata() - f2fs_init_security() - security_inode_init_security() - f2fs_initxattrs() - f2fs_setxattr() - get fs_lock only if there is no current thread info in thread_info So it keeps one thread can only hold one fs_lock to avoid deadlock. Can we use this solution? It could be. But, I think we can avoid to grab the fs_lock at the f2fs_initxattrs() Agree. This fs_lock here is used to protect the xattr from parallel modification, but here is in the initxattrs routine, parallel modification can not happen. And in the normal setxattr routine the inode-i_mutex (vfs layer) is used to avoid parallel modification. So I think this fs_lock is needless. Am I missing something? Regards, Gu level, since this case only happens when f2fs_initxattrs() is called. Let's think about ut in more detail. Thanks, thanks again! --- Original Message --- Sender : Russ Knizeruss.kn...@motorola.com Date : 九月 07, 2013 04:25 (GMT+09:00) Title : Re: [f2fs-dev] [PATCH] f2fs: optimize fs_lock for better performance I encountered this same issue recently and solved it in much the same way. Can we rename spin_lock to something more meaningful? This race actually exposed a potential deadlock between f2fs_create() and f2fs_initxattrs(): - vfs_create() - f2fs_create() - takes an fs_lock - f2fs_add_link() - __f2fs_add_link() - init_inode_metadata() - f2fs_init_security() - security_inode_init_security() - f2fs_initxattrs() - f2fs_setxattr() - also takes an fs_lock If another CPU happens to have the same lock that f2fs_setxattr() was trying to take because of the race around next_lock_num, we can get into a deadlock situation if the two threads are also contending over another resource (like bdi). Another scenario is if the above happens while another thread is in the middle of grabbing all of the locks via mutex_lock_all(). f2fs_create() is holding a lock that mutex_lock_all() is waiting for and mutex_lock_all() is holding a lock that f2fs_setxattr() is waiting for. Russ On Fri, Sep 6, 2013 at 4:48 AM, Chao Yu chao2...@samsung.com wrote: Hi Kim: I think there is a performance problem: when all sbi-fs_lock is holded, then all other threads may get the same next_lock value from sbi-next_lock_num in function mutex_lock_op, and wait to get the same lock at position fs_lock[next_lock], it unbalance the fs_lock usage. It may lost performance when we do the multithread test. Here is the patch to fix this problem: Signed-off-by: Yu Chao chao2...@samsung.com diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h old mode 100644 new mode 100755 index 467d42d..983bb45 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -371,6 +371,7 @@ struct f2fs_sb_info { struct mutex fs_lock[NR_GLOBAL_LOCKS]; /* blocking FS operations */ struct mutex node_write;/* locking node writes */ struct mutex writepages;/* mutex for writepages() */ + spinlock_t spin_lock; /* lock for next_lock_num */ unsigned char next_lock_num;/* round-robin global locks */ int por_doing; /* recovery is doing or not */ int on_build_free_nids; /* build_free_nids is doing */ @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct f2fs_sb_info *sbi) static inline int mutex_lock_op(struct f2fs_sb_info *sbi) { - unsigned char next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; + unsigned char next_lock; int i = 0; for (; i NR_GLOBAL_LOCKS; i++)
Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance
Hi Jaegeuk, On 09/10/2013 08:52 AM, Jaegeuk Kim wrote: Hi, At first, thank you for the report and please follow the email writing rules. :) Anyway, I agree to the below issue. One thing that I can think of is that we don't need to use the spin_lock, since we don't care about the exact lock number, but just need to get any not-collided number. Agree, but if all the locks are held, IMO, we need to balance the following threads to wait for each not-collided number lock, though complete balance is unreachable. So, how about removing the spin_lock? Yeah, in this case, spin_lock is a bit heavy cost. And how about using a random number? Now NR_GLOBAL_LOCKS is 8, it seems that random can not offer an balance number as we expected. Regards, Gu Thanks, 2013-09-06 (금), 09:48 +, Chao Yu: Hi Kim: I think there is a performance problem: when all sbi-fs_lock is holded, then all other threads may get the same next_lock value from sbi-next_lock_num in function mutex_lock_op, and wait to get the same lock at position fs_lock[next_lock], it unbalance the fs_lock usage. It may lost performance when we do the multithread test. Here is the patch to fix this problem: Signed-off-by: Yu Chao chao2...@samsung.com diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h old mode 100644 new mode 100755 index 467d42d..983bb45 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -371,6 +371,7 @@ struct f2fs_sb_info { struct mutex fs_lock[NR_GLOBAL_LOCKS]; /* blocking FS operations */ struct mutex node_write;/* locking node writes */ struct mutex writepages;/* mutex for writepages() */ + spinlock_t spin_lock; /* lock for next_lock_num */ unsigned char next_lock_num;/* round-robin global locks */ int por_doing; /* recovery is doing or not */ int on_build_free_nids; /* build_free_nids is doing */ @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct f2fs_sb_info *sbi) static inline int mutex_lock_op(struct f2fs_sb_info *sbi) { - unsigned char next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; + unsigned char next_lock; int i = 0; for (; i NR_GLOBAL_LOCKS; i++) if (mutex_trylock(sbi-fs_lock[i])) return i; - mutex_lock(sbi-fs_lock[next_lock]); + spin_lock(sbi-spin_lock); + next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; sbi-next_lock_num++; + spin_unlock(sbi-spin_lock); + + mutex_lock(sbi-fs_lock[next_lock]); return next_lock; } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c old mode 100644 new mode 100755 index 75c7dc3..4f27596 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -657,6 +657,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) mutex_init(sbi-cp_mutex); for (i = 0; i NR_GLOBAL_LOCKS; i++) mutex_init(sbi-fs_lock[i]); + spin_lock_init(sbi-spin_lock); mutex_init(sbi-node_write); sbi-por_doing = 0; spin_lock_init(sbi-stat_lock); (END) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance
Hi Jaegeuk, Chao, On 09/10/2013 08:52 AM, Jaegeuk Kim wrote: Hi, At first, thank you for the report and please follow the email writing rules. :) Anyway, I agree to the below issue. One thing that I can think of is that we don't need to use the spin_lock, since we don't care about the exact lock number, but just need to get any not-collided number. IMHO, just moving sbi-next_lock_num++ before mutex_lock(sbi-fs_lock[next_lock]) can avoid unbalance issue mostly. IMO, the case two or more threads increase sbi-next_lock_num in the same time is really very very little. If you think it is not rigorous, change next_lock_num to atomic one can fix it. What's your opinion? Regards, Gu So, how about removing the spin_lock? And how about using a random number? Thanks, 2013-09-06 (금), 09:48 +, Chao Yu: Hi Kim: I think there is a performance problem: when all sbi-fs_lock is holded, then all other threads may get the same next_lock value from sbi-next_lock_num in function mutex_lock_op, and wait to get the same lock at position fs_lock[next_lock], it unbalance the fs_lock usage. It may lost performance when we do the multithread test. Here is the patch to fix this problem: Signed-off-by: Yu Chao chao2...@samsung.com diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h old mode 100644 new mode 100755 index 467d42d..983bb45 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -371,6 +371,7 @@ struct f2fs_sb_info { struct mutex fs_lock[NR_GLOBAL_LOCKS]; /* blocking FS operations */ struct mutex node_write;/* locking node writes */ struct mutex writepages;/* mutex for writepages() */ + spinlock_t spin_lock; /* lock for next_lock_num */ unsigned char next_lock_num;/* round-robin global locks */ int por_doing; /* recovery is doing or not */ int on_build_free_nids; /* build_free_nids is doing */ @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct f2fs_sb_info *sbi) static inline int mutex_lock_op(struct f2fs_sb_info *sbi) { - unsigned char next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; + unsigned char next_lock; int i = 0; for (; i NR_GLOBAL_LOCKS; i++) if (mutex_trylock(sbi-fs_lock[i])) return i; - mutex_lock(sbi-fs_lock[next_lock]); + spin_lock(sbi-spin_lock); + next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; sbi-next_lock_num++; + spin_unlock(sbi-spin_lock); + + mutex_lock(sbi-fs_lock[next_lock]); return next_lock; } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c old mode 100644 new mode 100755 index 75c7dc3..4f27596 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -657,6 +657,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) mutex_init(sbi-cp_mutex); for (i = 0; i NR_GLOBAL_LOCKS; i++) mutex_init(sbi-fs_lock[i]); + spin_lock_init(sbi-spin_lock); mutex_init(sbi-node_write); sbi-por_doing = 0; spin_lock_init(sbi-stat_lock); (END) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance
Hi Chao, On 09/12/2013 10:40 AM, 俞超 wrote: Hi Gu -Original Message- From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com] Sent: Wednesday, September 11, 2013 1:38 PM To: jaegeuk@samsung.com Cc: chao2...@samsung.com; shu@samsung.com; linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net Subject: Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance Hi Jaegeuk, Chao, On 09/10/2013 08:52 AM, Jaegeuk Kim wrote: Hi, At first, thank you for the report and please follow the email writing rules. :) Anyway, I agree to the below issue. One thing that I can think of is that we don't need to use the spin_lock, since we don't care about the exact lock number, but just need to get any not-collided number. IMHO, just moving sbi-next_lock_num++ before mutex_lock(sbi-fs_lock[next_lock]) can avoid unbalance issue mostly. IMO, the case two or more threads increase sbi-next_lock_num in the same time is really very very little. If you think it is not rigorous, change next_lock_num to atomic one can fix it. What's your opinion? Regards, Gu I did the test sbi-next_lock_num++ compare with the atomic one, And I found performance of them is almost the same under a small number thread racing. So as your and Kim's opinion, it's enough to use sbi-next_lock_num++ to fix this issue. Good, but it seems that your replay patch is out of format, and it's hard for Jaegeuk to merge. I'll format it, see the following thread. Thanks, Gu Thanks for the advice. So, how about removing the spin_lock? And how about using a random number? Thanks, 2013-09-06 (금), 09:48 +, Chao Yu: Hi Kim: I think there is a performance problem: when all sbi-fs_lock is holded, then all other threads may get the same next_lock value from sbi-next_lock_num in function mutex_lock_op, and wait to get the same lock at position fs_lock[next_lock], it unbalance the fs_lock usage. It may lost performance when we do the multithread test. Here is the patch to fix this problem: Signed-off-by: Yu Chao chao2...@samsung.com diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h old mode 100644 new mode 100755 index 467d42d..983bb45 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -371,6 +371,7 @@ struct f2fs_sb_info { struct mutex fs_lock[NR_GLOBAL_LOCKS]; /* blocking FS operations */ struct mutex node_write;/* locking node writes */ struct mutex writepages;/* mutex for writepages() */ + spinlock_t spin_lock; /* lock for next_lock_num */ unsigned char next_lock_num;/* round-robin global locks */ int por_doing; /* recovery is doing or not */ int on_build_free_nids; /* build_free_nids is doing */ @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct f2fs_sb_info *sbi) static inline int mutex_lock_op(struct f2fs_sb_info *sbi) { - unsigned char next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; + unsigned char next_lock; int i = 0; for (; i NR_GLOBAL_LOCKS; i++) if (mutex_trylock(sbi-fs_lock[i])) return i; - mutex_lock(sbi-fs_lock[next_lock]); + spin_lock(sbi-spin_lock); + next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; sbi-next_lock_num++; + spin_unlock(sbi-spin_lock); + + mutex_lock(sbi-fs_lock[next_lock]); return next_lock; } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c old mode 100644 new mode 100755 index 75c7dc3..4f27596 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -657,6 +657,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) mutex_init(sbi-cp_mutex); for (i = 0; i NR_GLOBAL_LOCKS; i++) mutex_init(sbi-fs_lock[i]); + spin_lock_init(sbi-spin_lock); mutex_init(sbi-node_write); sbi-por_doing = 0; spin_lock_init(sbi-stat_lock); (END) = -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[f2fs-dev][PATCH V2] f2fs: optimize fs_lock for better performance
From: Yu Chao chao2...@samsung.com There is a performance problem: when all sbi-fs_lock are holded, then all the following threads may get the same next_lock value from sbi-next_lock_num in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]), it may cause performance reduce. So we move the sbi-next_lock_num++ before getting lock, this will average the following threads if all sbi-fs_lock are holded. v1--v2: Drop the needless spin_lock as Jaegeuk suggested. Suggested-by: Jaegeuk Kim jaegeuk@samsung.com Signed-off-by: Yu Chao chao2...@samsung.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/f2fs.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 608f0df..7fd99d8 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -544,15 +544,15 @@ static inline void mutex_unlock_all(struct f2fs_sb_info *sbi) static inline int mutex_lock_op(struct f2fs_sb_info *sbi) { - unsigned char next_lock = sbi-next_lock_num % NR_GLOBAL_LOCKS; + unsigned char next_lock; int i = 0; for (; i NR_GLOBAL_LOCKS; i++) if (mutex_trylock(sbi-fs_lock[i])) return i; + next_lock = sbi-next_lock_num++ % NR_GLOBAL_LOCKS; mutex_lock(sbi-fs_lock[next_lock]); - sbi-next_lock_num++; return next_lock; } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] f2fs: add a wait step when submit bio with {READ,WRITE}_SYNC
When we submit bio with READ_SYNC or WRITE_SYNC, we need to wait a moment for the io completion, current codes only find_data_page() follows the rule, other places missing this step, so add it. Further more, moving the PageUptodate check into f2fs_readpage() to clean up the codes. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |1 - fs/f2fs/data.c | 39 +-- fs/f2fs/node.c |1 - fs/f2fs/recovery.c |2 -- fs/f2fs/segment.c|2 +- 5 files changed, 18 insertions(+), 27 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index fe91773..e376a42 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -64,7 +64,6 @@ repeat: if (f2fs_readpage(sbi, page, index, READ_SYNC)) goto repeat; - lock_page(page); if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 19cd7c6..b048936 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -216,13 +216,11 @@ struct page *find_data_page(struct inode *inode, pgoff_t index, bool sync) err = f2fs_readpage(sbi, page, dn.data_blkaddr, sync ? READ_SYNC : READA); - if (sync) { - wait_on_page_locked(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 0); - return ERR_PTR(-EIO); - } - } + if (err) + return ERR_PTR(err); + + if (sync) + unlock_page(page); return page; } @@ -267,11 +265,6 @@ repeat: if (err) return ERR_PTR(err); - lock_page(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 1); - return ERR_PTR(-EIO); - } if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; @@ -325,11 +318,7 @@ repeat: err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC); if (err) return ERR_PTR(err); - lock_page(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 1); - return ERR_PTR(-EIO); - } + if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; @@ -399,6 +388,16 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, submit_bio(type, bio); up_read(sbi-bio_sem); + + if (type == READ_SYNC) { + wait_on_page_locked(page); + lock_page(page); + if (!PageUptodate(page)) { + f2fs_put_page(page, 1); + return -EIO; + } + } + return 0; } @@ -679,11 +678,7 @@ repeat: err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC); if (err) return err; - lock_page(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 1); - return -EIO; - } + if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index f5172e2..f061554 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1534,7 +1534,6 @@ int restore_node_summary(struct f2fs_sb_info *sbi, if (f2fs_readpage(sbi, page, addr, READ_SYNC)) goto out; - lock_page(page); rn = F2FS_NODE(page); sum_entry-nid = rn-footer.nid; sum_entry-version = 0; diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c index 639eb34..ec68183 100644 --- a/fs/f2fs/recovery.c +++ b/fs/f2fs/recovery.c @@ -140,8 +140,6 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head) if (err) goto out; - lock_page(page); - if (cp_ver != cpver_of_node(page)) break; diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 9b74ae2..bcd19db 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -639,7 +639,7 @@ static void do_submit_bio(struct f2fs_sb_info *sbi, trace_f2fs_do_submit_bio(sbi-sb, btype, sync, sbi-bio[btype]); - if (type == META_FLUSH) { + if ((type == META_FLUSH) || (rw WRITE_SYNC)) { DECLARE_COMPLETION_ONSTACK(wait); p-is_sync = true; p-wait = wait; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http
Re: [PATCH] f2fs: add a wait step when submit bio with {READ,WRITE}_SYNC
Hi Kim, On 07/30/2013 08:29 PM, Jaegeuk Kim wrote: Hi Gu, The original read flow was to avoid redandunt lock/unlock_page() calls. Right, this can gain better read performance. But is the wait step after submitting bio with READ_SYNC needless too? And we should not wait for WRITE_SYNC, since it is just for write priority, not for synchronization of the file system. Got it, thanks for your explanation.:) Best regards, Gu Thanks, 2013-07-30 (화), 18:06 +0800, Gu Zheng: When we submit bio with READ_SYNC or WRITE_SYNC, we need to wait a moment for the io completion, current codes only find_data_page() follows the rule, other places missing this step, so add it. Further more, moving the PageUptodate check into f2fs_readpage() to clean up the codes. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |1 - fs/f2fs/data.c | 39 +-- fs/f2fs/node.c |1 - fs/f2fs/recovery.c |2 -- fs/f2fs/segment.c|2 +- 5 files changed, 18 insertions(+), 27 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index fe91773..e376a42 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -64,7 +64,6 @@ repeat: if (f2fs_readpage(sbi, page, index, READ_SYNC)) goto repeat; -lock_page(page); if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 19cd7c6..b048936 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -216,13 +216,11 @@ struct page *find_data_page(struct inode *inode, pgoff_t index, bool sync) err = f2fs_readpage(sbi, page, dn.data_blkaddr, sync ? READ_SYNC : READA); -if (sync) { -wait_on_page_locked(page); -if (!PageUptodate(page)) { -f2fs_put_page(page, 0); -return ERR_PTR(-EIO); -} -} +if (err) +return ERR_PTR(err); + +if (sync) +unlock_page(page); return page; } @@ -267,11 +265,6 @@ repeat: if (err) return ERR_PTR(err); -lock_page(page); -if (!PageUptodate(page)) { -f2fs_put_page(page, 1); -return ERR_PTR(-EIO); -} if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; @@ -325,11 +318,7 @@ repeat: err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC); if (err) return ERR_PTR(err); -lock_page(page); -if (!PageUptodate(page)) { -f2fs_put_page(page, 1); -return ERR_PTR(-EIO); -} + if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; @@ -399,6 +388,16 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, submit_bio(type, bio); up_read(sbi-bio_sem); + +if (type == READ_SYNC) { +wait_on_page_locked(page); +lock_page(page); +if (!PageUptodate(page)) { +f2fs_put_page(page, 1); +return -EIO; +} +} + return 0; } @@ -679,11 +678,7 @@ repeat: err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC); if (err) return err; -lock_page(page); -if (!PageUptodate(page)) { -f2fs_put_page(page, 1); -return -EIO; -} + if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index f5172e2..f061554 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1534,7 +1534,6 @@ int restore_node_summary(struct f2fs_sb_info *sbi, if (f2fs_readpage(sbi, page, addr, READ_SYNC)) goto out; -lock_page(page); rn = F2FS_NODE(page); sum_entry-nid = rn-footer.nid; sum_entry-version = 0; diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c index 639eb34..ec68183 100644 --- a/fs/f2fs/recovery.c +++ b/fs/f2fs/recovery.c @@ -140,8 +140,6 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head) if (err) goto out; -lock_page(page); - if (cp_ver != cpver_of_node(page)) break; diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 9b74ae2..bcd19db 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -639,7 +639,7 @@ static void do_submit_bio(struct f2fs_sb_info *sbi, trace_f2fs_do_submit_bio(sbi-sb, btype, sync, sbi-bio[btype
Re: [PATCH] fbdev: fix build warning in vga16fb.c
Hoho, Tomi has applied the patch from Lius to fix this warning. And this is the sixth patch to fix the same issue since last week. Thanks, Gu On 07/31/2013 11:21 AM, Xishi Qiu wrote: When building v3.11-rc3, I get the following warning: ... drivers/video/vga16fb.c: In function ‘vga16fb_destroy’: drivers/video/vga16fb.c:1268: warning: unused variable ‘dev’ ... Signed-off-by: Xishi Qiu qiuxi...@huawei.com --- drivers/video/vga16fb.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/video/vga16fb.c b/drivers/video/vga16fb.c index 830ded4..2827333 100644 --- a/drivers/video/vga16fb.c +++ b/drivers/video/vga16fb.c @@ -1265,7 +1265,6 @@ static void vga16fb_imageblit(struct fb_info *info, const struct fb_image *image static void vga16fb_destroy(struct fb_info *info) { - struct platform_device *dev = container_of(info-device, struct platform_device, dev); iounmap(info-screen_base); fb_dealloc_cmap(info-cmap); /* XXX unshare VGA regions */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] fs/bio-integrity: fix a potential mem leak
cc akpm On 07/29/2013 09:49 AM, Gu Zheng wrote: Free the bio_integrity_pool in the fail path of biovec_create_pool in function bioset_integrity_create(). Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/bio-integrity.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c index 8fb4291..6025084 100644 --- a/fs/bio-integrity.c +++ b/fs/bio-integrity.c @@ -716,13 +716,14 @@ int bioset_integrity_create(struct bio_set *bs, int pool_size) return 0; bs-bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab); - - bs-bvec_integrity_pool = biovec_create_pool(bs, pool_size); - if (!bs-bvec_integrity_pool) + if (!bs-bio_integrity_pool) return -1; - if (!bs-bio_integrity_pool) + bs-bvec_integrity_pool = biovec_create_pool(bs, pool_size); + if (!bs-bvec_integrity_pool) { + mempool_destroy(bs-bio_integrity_pool); return -1; + } return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] f2fs: add a wait step when submit bio with {READ,WRITE}_SYNC
On 07/31/2013 06:06 PM, Jaegeuk Kim wrote: 2013-07-31 (수), 09:59 +0800, Gu Zheng: Hi Kim, On 07/30/2013 08:29 PM, Jaegeuk Kim wrote: Hi Gu, The original read flow was to avoid redandunt lock/unlock_page() calls. Right, this can gain better read performance. But is the wait step after submitting bio with READ_SYNC needless too? Correct, the READ_SYNC is also used for IO priority. The basic read policy here is that the caller should lock the page only when it wants to manipulate there-in data. Otherwise, we don't need to unnecessary lock and unlocks. Got it, it seems that I had some miss reading originally, it's clear now, thanks very much for your explanation.:) Regards, Gu Thanks, And we should not wait for WRITE_SYNC, since it is just for write priority, not for synchronization of the file system. Got it, thanks for your explanation.:) Best regards, Gu Thanks, 2013-07-30 (화), 18:06 +0800, Gu Zheng: When we submit bio with READ_SYNC or WRITE_SYNC, we need to wait a moment for the io completion, current codes only find_data_page() follows the rule, other places missing this step, so add it. Further more, moving the PageUptodate check into f2fs_readpage() to clean up the codes. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |1 - fs/f2fs/data.c | 39 +-- fs/f2fs/node.c |1 - fs/f2fs/recovery.c |2 -- fs/f2fs/segment.c|2 +- 5 files changed, 18 insertions(+), 27 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index fe91773..e376a42 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -64,7 +64,6 @@ repeat: if (f2fs_readpage(sbi, page, index, READ_SYNC)) goto repeat; - lock_page(page); if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 19cd7c6..b048936 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -216,13 +216,11 @@ struct page *find_data_page(struct inode *inode, pgoff_t index, bool sync) err = f2fs_readpage(sbi, page, dn.data_blkaddr, sync ? READ_SYNC : READA); - if (sync) { - wait_on_page_locked(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 0); - return ERR_PTR(-EIO); - } - } + if (err) + return ERR_PTR(err); + + if (sync) + unlock_page(page); return page; } @@ -267,11 +265,6 @@ repeat: if (err) return ERR_PTR(err); - lock_page(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 1); - return ERR_PTR(-EIO); - } if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; @@ -325,11 +318,7 @@ repeat: err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC); if (err) return ERR_PTR(err); - lock_page(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 1); - return ERR_PTR(-EIO); - } + if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; @@ -399,6 +388,16 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, submit_bio(type, bio); up_read(sbi-bio_sem); + + if (type == READ_SYNC) { + wait_on_page_locked(page); + lock_page(page); + if (!PageUptodate(page)) { + f2fs_put_page(page, 1); + return -EIO; + } + } + return 0; } @@ -679,11 +678,7 @@ repeat: err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC); if (err) return err; - lock_page(page); - if (!PageUptodate(page)) { - f2fs_put_page(page, 1); - return -EIO; - } + if (page-mapping != mapping) { f2fs_put_page(page, 1); goto repeat; diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index f5172e2..f061554 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1534,7 +1534,6 @@ int restore_node_summary(struct f2fs_sb_info *sbi, if (f2fs_readpage(sbi, page, addr, READ_SYNC)) goto out; - lock_page(page); rn = F2FS_NODE(page); sum_entry-nid = rn-footer.nid; sum_entry-version = 0; diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c index 639eb34..ec68183 100644 --- a/fs/f2fs/recovery.c +++ b/fs/f2fs/recovery.c @@ -140,8 +140,6 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head *head) if (err) goto out; - lock_page(page); - if (cp_ver != cpver_of_node(page
Re: [PATCH 1/2] f2fs: add sysfs support for controlling the gc_thread
Hi Jeon, On 07/31/2013 10:33 PM, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com Add sysfs entries to control the timing parameters for f2fs gc thread. Various Sysfs options introduced are: gc_min_sleep_time: Min Sleep time for GC in ms gc_max_sleep_time: Max Sleep time for GC in ms gc_no_gc_sleep_time: Default Sleep time for GC in ms Signed-off-by: Namjae Jeon namjae.j...@samsung.com Signed-off-by: Pankaj Kumar pankaj...@samsung.com --- Documentation/ABI/testing/sysfs-fs-f2fs | 22 ++ Documentation/filesystems/f2fs.txt | 26 +++ fs/f2fs/f2fs.h |4 + fs/f2fs/gc.c| 17 +++-- fs/f2fs/gc.h| 33 fs/f2fs/super.c | 124 +++ 6 files changed, 206 insertions(+), 20 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-fs-f2fs diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs new file mode 100644 index 000..5f44095 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -0,0 +1,22 @@ +What:/sys/fs/f2fs/disk/gc_max_sleep_time +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the maximun sleep time for gc_thread. Time + is in milliseconds. + +What:/sys/fs/f2fs/disk/gc_min_sleep_time +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the minimum sleep time for gc_thread. Time + is in milliseconds. + +What:/sys/fs/f2fs/disk/gc_no_gc_sleep_time +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the default sleep time for gc_thread. Time + is in milliseconds. + + diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index 0500c19..2e9e873 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -133,6 +133,32 @@ f2fs. Each file shows the whole f2fs information. - current memory footprint consumed by f2fs. +SYSFS ENTRIES + + +Information about mounted f2fs file systems can be found in +/sys/fs/f2fs. Each mounted filesystem will have a directory in +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). +The files in each per-device directory are shown in table below. + +Files in /sys/fs/f2fs/devname +(see also Documentation/ABI/testing/sysfs-fs-f2fs) +.. + File Content + + gc_max_sleep_timeThis tuning parameter controls the maximum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_min_sleep_timeThis tuning parameter controls the minimum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_no_gc_sleep_time This tuning parameter controls the default sleep + time for the garbage collection thread. Time is + in milliseconds. + + USAGE diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 78777cd..63813be 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -430,6 +430,10 @@ struct f2fs_sb_info { #endif unsigned int last_victim[2];/* last victim segment # */ spinlock_t stat_lock; /* lock for stat operations */ + + /* For sysfs suppport */ + struct kobject s_kobj; + struct completion s_kobj_unregister; What is this completion used for? Or it's an ahead design? I do not find synchronization routines use it. Am I missing something? }; /* diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 35f9b1a..60d4f67 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -29,10 +29,11 @@ static struct kmem_cache *winode_slab; static int gc_thread_func(void *data) { struct f2fs_sb_info *sbi = data; + struct f2fs_gc_kthread *gc_th = sbi-gc_thread; wait_queue_head_t *wq = sbi-gc_thread-gc_wait_queue_head; long wait_ms; - wait_ms = GC_THREAD_MIN_SLEEP_TIME; + wait_ms = gc_th-min_sleep_time; do { if (try_to_freeze()) @@ -45,7 +46,7 @@ static int gc_thread_func(void *data) break;
Re: [PATCH 2/2] f2fs: add sysfs entries to select the gc policy
Hi Jeon, On 07/31/2013 10:33 PM, Namjae Jeon wrote: From: Namjae Jeon namjae.j...@samsung.com Add sysfs entries namely gc_long_idle and gc_short_idle to control the gc policy. Where long idle corresponds to selecting a cost benefit approach, while short idle corresponds to selecting a greedy approach to garbage collection. The selection is mutually exclusive one approach will work at any point. Signed-off-by: Namjae Jeon namjae.j...@samsung.com Signed-off-by: Pankaj Kumar pankaj...@samsung.com --- Documentation/ABI/testing/sysfs-fs-f2fs | 12 +++ Documentation/filesystems/f2fs.txt |8 + fs/f2fs/gc.c| 22 ++-- fs/f2fs/gc.h|4 +++ fs/f2fs/super.c | 59 +-- 5 files changed, 99 insertions(+), 6 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index 5f44095..96b62ea 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -19,4 +19,16 @@ Description: Controls the default sleep time for gc_thread. Time is in milliseconds. +What:/sys/fs/f2fs/disk/gc_long_idle +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the selection of gc policy. long_idle is used + to select the cost benefit approach for garbage collection. +What:/sys/fs/f2fs/disk/gc_short_idle +Date:July 2013 +Contact: Namjae Jeon namjae.j...@samsung.com +Description: + Controls the selection of gc policy. short_idle is used + to select the greedy approach for garbage collection. diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index 2e9e873..06dd5d7 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -158,6 +158,14 @@ Files in /sys/fs/f2fs/devname time for the garbage collection thread. Time is in milliseconds. + gc_long_idle This parameter controls the selection of cost + benefit approach for garbage collectoin. Writing + 1 to this file will select the cost benefit policy. + + gc_short_idleThis parameter controls the selection of greedy + approach for the garbage collection. Writing 1 + to this file will select the greedy policy. Why introduce two opposite attributes? It'll cause some confusion condition if we double enable/disable them. + USAGE diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 60d4f67..af2d9d7 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -106,6 +106,8 @@ int start_gc_thread(struct f2fs_sb_info *sbi) gc_th-max_sleep_time = DEF_GC_THREAD_MAX_SLEEP_TIME; gc_th-no_gc_sleep_time = DEF_GC_THREAD_NOGC_SLEEP_TIME; + gc_th-long_idle = gc_th-short_idle = 0; + sbi-gc_thread = gc_th; init_waitqueue_head(sbi-gc_thread-gc_wait_queue_head); sbi-gc_thread-f2fs_gc_task = kthread_run(gc_thread_func, sbi, @@ -130,9 +132,23 @@ void stop_gc_thread(struct f2fs_sb_info *sbi) sbi-gc_thread = NULL; } -static int select_gc_type(int gc_type) +static int select_gc_type(struct f2fs_gc_kthread *gc_th, int gc_type) { - return (gc_type == BG_GC) ? GC_CB : GC_GREEDY; + int gc_mode; + + if (gc_th) { + if (gc_th-long_idle) { + gc_mode = GC_CB; + goto out; + } else if (gc_th-short_idle) { + gc_mode = GC_GREEDY; + goto out; + } + } + + gc_mode = (gc_type == BG_GC) ? GC_CB : GC_GREEDY; +out: + return gc_mode; } static void select_policy(struct f2fs_sb_info *sbi, int gc_type, @@ -145,7 +161,7 @@ static void select_policy(struct f2fs_sb_info *sbi, int gc_type, p-dirty_segmap = dirty_i-dirty_segmap[type]; p-ofs_unit = 1; } else { - p-gc_mode = select_gc_type(gc_type); + p-gc_mode = select_gc_type(sbi-gc_thread, gc_type); p-dirty_segmap = dirty_i-dirty_segmap[DIRTY]; p-ofs_unit = sbi-segs_per_sec; } diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h index f4bf44c..b2faae5 100644 --- a/fs/f2fs/gc.h +++ b/fs/f2fs/gc.h @@ -30,6 +30,10 @@ struct f2fs_gc_kthread { unsigned int min_sleep_time; unsigned int max_sleep_time; unsigned int no_gc_sleep_time; + + /* for changing gc
Re: [PATCH] f2fs: fix handling orphan inodes
On 08/01/2013 03:58 PM, Jaegeuk Kim wrote: This patch fixes mishandling of the sbi-n_orphans variable. If users request lots of f2fs_unlink(), check_orphan_space() could be contended. In such the case, sbi-n_orphans can be read incorrectly so that f2fs_unlink() would fall into the wrong state which results in the failure of add_orphan_inode(). So, let's increment sbi-n_orphans virtually prior to the actual orphan inode stuffs. After that, let's release sbi-n_orphans by calling release_orphan_inode or remove_orphan_inode. Hi Kim, The key point is that we did not reduce sbi-n_orphans when we release/remove orphan inode, so just adding the reduction step can fix this issue. But why moving the increment of sbi-n_orphans before we add orphan inode? It seems that we can not get benefit from it, and it makes the procedure a bit complex, because we should reduce the sbi-n_orphans in some fail pathes before we really add orphan inode. Thanks, Gu Signed-off-by: Jaegeuk Kim jaegeuk@samsung.com --- fs/f2fs/checkpoint.c | 13 ++--- fs/f2fs/dir.c| 2 ++ fs/f2fs/f2fs.h | 3 ++- fs/f2fs/namei.c | 19 ++- 4 files changed, 28 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index fe91773..c5a5c39 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -182,7 +182,7 @@ const struct address_space_operations f2fs_meta_aops = { .set_page_dirty = f2fs_set_meta_page_dirty, }; -int check_orphan_space(struct f2fs_sb_info *sbi) +int acquire_orphan_inode(struct f2fs_sb_info *sbi) { unsigned int max_orphans; int err = 0; @@ -197,10 +197,19 @@ int check_orphan_space(struct f2fs_sb_info *sbi) mutex_lock(sbi-orphan_inode_mutex); if (sbi-n_orphans = max_orphans) err = -ENOSPC; + else + sbi-n_orphans++; mutex_unlock(sbi-orphan_inode_mutex); return err; } +void release_orphan_inode(struct f2fs_sb_info *sbi) +{ + mutex_lock(sbi-orphan_inode_mutex); + sbi-n_orphans--; + mutex_unlock(sbi-orphan_inode_mutex); +} + void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) { struct list_head *head, *this; @@ -229,8 +238,6 @@ retry: list_add(new-list, this-prev); else list_add_tail(new-list, head); - - sbi-n_orphans++; out: mutex_unlock(sbi-orphan_inode_mutex); } diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c index d1bb260..384c6da 100644 --- a/fs/f2fs/dir.c +++ b/fs/f2fs/dir.c @@ -572,6 +572,8 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page, if (inode-i_nlink == 0) add_orphan_inode(sbi, inode-i_ino); + else + release_orphan_inode(sbi); } if (bit_pos == NR_DENTRY_IN_BLOCK) { diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index a6858c7..78777cd 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -1044,7 +1044,8 @@ void destroy_segment_manager(struct f2fs_sb_info *); struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t); struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t); long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long); -int check_orphan_space(struct f2fs_sb_info *); +int acquire_orphan_inode(struct f2fs_sb_info *); +void release_orphan_inode(struct f2fs_sb_info *); void add_orphan_inode(struct f2fs_sb_info *, nid_t); void remove_orphan_inode(struct f2fs_sb_info *, nid_t); int recover_orphan_inodes(struct f2fs_sb_info *); diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index 3297278..4e47518 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -239,7 +239,7 @@ static int f2fs_unlink(struct inode *dir, struct dentry *dentry) if (!de) goto fail; - err = check_orphan_space(sbi); + err = acquire_orphan_inode(sbi); if (err) { kunmap(page); f2fs_put_page(page, 0); @@ -393,7 +393,7 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *old_inode = old_dentry-d_inode; struct inode *new_inode = new_dentry-d_inode; struct page *old_dir_page; - struct page *old_page; + struct page *old_page, *new_page; struct f2fs_dir_entry *old_dir_entry = NULL; struct f2fs_dir_entry *old_entry; struct f2fs_dir_entry *new_entry; @@ -415,7 +415,6 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, ilock = mutex_lock_op(sbi); if (new_inode) { - struct page *new_page; err = -ENOTEMPTY; if (old_dir_entry !f2fs_empty_dir(new_inode)) @@ -427,9 +426,13 @@ static int f2fs_rename(struct inode *old_dir, struct dentry *old_dentry, if (!new_entry) goto out_dir; + err = acquire_orphan_inode(sbi);
Re: [PATCH 1/2] f2fs: add sysfs support for controlling the gc_thread
On 08/02/2013 09:19 AM, Namjae Jeon wrote: diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 78777cd..63813be 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -430,6 +430,10 @@ struct f2fs_sb_info { #endif unsigned int last_victim[2];/* last victim segment # */ spinlock_t stat_lock; /* lock for stat operations */ + + /* For sysfs suppport */ + struct kobject s_kobj; + struct completion s_kobj_unregister; Hi. Gu. What is this completion used for? Or it's an ahead design? I do not find synchronization routines use it. Am I missing something? You're right. it is my mistake. I will update it on next version patch. }; /* diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 35f9b1a..60d4f67 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -29,10 +29,11 @@ static struct kmem_cache *winode_slab; static int gc_thread_func(void *data) { struct f2fs_sb_info *sbi = data; + struct f2fs_gc_kthread *gc_th = sbi-gc_thread; wait_queue_head_t *wq = sbi-gc_thread-gc_wait_queue_head; long wait_ms; - wait_ms = GC_THREAD_MIN_SLEEP_TIME; + wait_ms = gc_th-min_sleep_time; do { if (try_to_freeze()) @@ -45,7 +46,7 @@ static int gc_thread_func(void *data) break; if (sbi-sb-s_writers.frozen = SB_FREEZE_WRITE) { - wait_ms = GC_THREAD_MAX_SLEEP_TIME; + wait_ms = increase_sleep_time(gc_th, wait_ms); continue; } @@ -66,15 +67,15 @@ static int gc_thread_func(void *data) continue; if (!is_idle(sbi)) { - wait_ms = increase_sleep_time(wait_ms); + wait_ms = increase_sleep_time(gc_th, wait_ms); mutex_unlock(sbi-gc_mutex); continue; } if (has_enough_invalid_blocks(sbi)) - wait_ms = decrease_sleep_time(wait_ms); + wait_ms = decrease_sleep_time(gc_th, wait_ms); else - wait_ms = increase_sleep_time(wait_ms); + wait_ms = increase_sleep_time(gc_th, wait_ms); #ifdef CONFIG_F2FS_STAT_FS sbi-bg_gc++; @@ -82,7 +83,7 @@ static int gc_thread_func(void *data) /* if return value is not zero, no victim was selected */ if (f2fs_gc(sbi)) - wait_ms = GC_THREAD_NOGC_SLEEP_TIME; + wait_ms = gc_th-no_gc_sleep_time; } while (!kthread_should_stop()); return 0; } @@ -101,6 +102,10 @@ int start_gc_thread(struct f2fs_sb_info *sbi) goto out; } + gc_th-min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME; + gc_th-max_sleep_time = DEF_GC_THREAD_MAX_SLEEP_TIME; + gc_th-no_gc_sleep_time = DEF_GC_THREAD_NOGC_SLEEP_TIME; + sbi-gc_thread = gc_th; init_waitqueue_head(sbi-gc_thread-gc_wait_queue_head); sbi-gc_thread-f2fs_gc_task = kthread_run(gc_thread_func, sbi, diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h index 2c6a6bd..f4bf44c 100644 --- a/fs/f2fs/gc.h +++ b/fs/f2fs/gc.h @@ -13,9 +13,9 @@ * whether IO subsystem is idle * or not */ -#define GC_THREAD_MIN_SLEEP_TIME 3 /* milliseconds */ -#define GC_THREAD_MAX_SLEEP_TIME 6 -#define GC_THREAD_NOGC_SLEEP_TIME 30 /* wait 5 min */ +#define DEF_GC_THREAD_MIN_SLEEP_TIME 3 /* milliseconds */ +#define DEF_GC_THREAD_MAX_SLEEP_TIME 6 +#define DEF_GC_THREAD_NOGC_SLEEP_TIME 30 /* wait 5 min */ #define LIMIT_INVALID_BLOCK40 /* percentage over total user space */ #define LIMIT_FREE_BLOCK 40 /* percentage over invalid + free space */ @@ -25,6 +25,11 @@ struct f2fs_gc_kthread { struct task_struct *f2fs_gc_task; wait_queue_head_t gc_wait_queue_head; + + /* for gc sleep time */ + unsigned int min_sleep_time; + unsigned int max_sleep_time; + unsigned int no_gc_sleep_time; Though these attributes are used for gc thread, and in current design gc_thread is always singleton per f2fs_sb, but thare're in fact f2fs sb infos. So I think it's to attach these to f2fs_sb_info. What's your opinion? It does not matter wherever it is. but I think that these gc time are for gc thread. So I put gc time to gc thread. Yeah, in fact it's also OK. :) Regards, Gu Thanks for review :) Thanks, Gu }; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at
[RESEND PATCH 2/2] staging/olpc_docn: reorder the lock sequence to avoid potential dead lock
The lock sequence of dcon_blank_fb(fb_info-lock --- console_lock) is against with the one of console_callback(console_lock --- fb_info-lock), it'll lead to a potential dead lock, so reorder the lock sequence of dcon_blank_fb to avoid the potential dead lock. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/staging/olpc_dcon/olpc_dcon.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c b/drivers/staging/olpc_dcon/olpc_dcon.c index 198595e..9db88d9 100644 --- a/drivers/staging/olpc_dcon/olpc_dcon.c +++ b/drivers/staging/olpc_dcon/olpc_dcon.c @@ -255,17 +255,19 @@ static bool dcon_blank_fb(struct dcon_priv *dcon, bool blank) { int err; + console_lock(); if (!lock_fb_info(dcon-fbinfo)) { + console_unlock(); dev_err(dcon-client-dev, unable to lock framebuffer\n); return false; } - console_lock(); + dcon-ignore_fb_events = true; err = fb_blank(dcon-fbinfo, blank ? FB_BLANK_POWERDOWN : FB_BLANK_UNBLANK); dcon-ignore_fb_events = false; - console_unlock(); unlock_fb_info(dcon-fbinfo); + console_unlock(); if (err) { dev_err(dcon-client-dev, couldn't %sblank framebuffer\n, -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RESEND PATCH 1/2] fb: reorder the lock sequence to fix potential dead lock
Following commits: 50e244cc79 fb: rework locking to fix lock ordering on takeover e93a9a8687 fb: Yet another band-aid for fixing lockdep mess 054430e773 fbcon: fix locking harder reworked locking to fix related lock ordering on takeover, and introduced console_lock into fbmem, but it seems that the new lock sequence(fb_info-lock --- console_lock) is against with the one in console_callback(console_lock --- fb_info-lock), and leads to a potential dead lock as following: [ 601.079000] == [ 601.079000] [ INFO: possible circular locking dependency detected ] [ 601.079000] 3.11.0 #189 Not tainted [ 601.079000] --- [ 601.079000] kworker/0:3/619 is trying to acquire lock: [ 601.079000] (fb_info-lock){+.+.+.}, at: [81397566] lock_fb_info+0x26/0x60 [ 601.079000] but task is already holding lock: [ 601.079000] (console_lock){+.+.+.}, at: [8141aae3] console_callback+0x13/0x160 [ 601.079000] which lock already depends on the new lock. [ 601.079000] the existing dependency chain (in reverse order) is: [ 601.079000] - #1 (console_lock){+.+.+.}: [ 601.079000][810dc971] lock_acquire+0xa1/0x140 [ 601.079000][810c6267] console_lock+0x77/0x80 [ 601.079000][81399448] register_framebuffer+0x1d8/0x320 [ 601.079000][81cfb4c8] efifb_probe+0x408/0x48f [ 601.079000][8144a963] platform_drv_probe+0x43/0x80 [ 601.079000][8144853b] driver_probe_device+0x8b/0x390 [ 601.079000][814488eb] __driver_attach+0xab/0xb0 [ 601.079000][814463bd] bus_for_each_dev+0x5d/0xa0 [ 601.079000][81447e6e] driver_attach+0x1e/0x20 [ 601.079000][81447a07] bus_add_driver+0x117/0x290 [ 601.079000][81448fea] driver_register+0x7a/0x170 [ 601.079000][8144a10a] __platform_driver_register+0x4a/0x50 [ 601.079000][8144a12d] platform_driver_probe+0x1d/0xb0 [ 601.079000][81cfb0a1] efifb_init+0x273/0x292 [ 601.079000][81002132] do_one_initcall+0x102/0x1c0 [ 601.079000][81cb80a6] kernel_init_freeable+0x15d/0x1ef [ 601.079000][8166d2de] kernel_init+0xe/0xf0 [ 601.079000][816914ec] ret_from_fork+0x7c/0xb0 [ 601.079000] - #0 (fb_info-lock){+.+.+.}: [ 601.079000][810dc1d8] __lock_acquire+0x1e18/0x1f10 [ 601.079000][810dc971] lock_acquire+0xa1/0x140 [ 601.079000][816835ca] mutex_lock_nested+0x7a/0x3b0 [ 601.079000][81397566] lock_fb_info+0x26/0x60 [ 601.079000][813a4aeb] fbcon_blank+0x29b/0x2e0 [ 601.079000][81418658] do_blank_screen+0x1d8/0x280 [ 601.079000][8141ab34] console_callback+0x64/0x160 [ 601.079000][8108d855] process_one_work+0x1f5/0x540 [ 601.079000][8108e04c] worker_thread+0x11c/0x370 [ 601.079000][81095fbd] kthread+0xed/0x100 [ 601.079000][816914ec] ret_from_fork+0x7c/0xb0 [ 601.079000] other info that might help us debug this: [ 601.079000] Possible unsafe locking scenario: [ 601.079000]CPU0CPU1 [ 601.079000] [ 601.079000] lock(console_lock); [ 601.079000]lock(fb_info-lock); [ 601.079000]lock(console_lock); [ 601.079000] lock(fb_info-lock); [ 601.079000] *** DEADLOCK *** so we reorder the lock sequence the same as it in console_callback() to avoid this issue. And following Tomi's suggestion, fix these similar issues all in fb subsystem. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/video/fbmem.c| 50 - drivers/video/fbsysfs.c | 19 ++ drivers/video/sh_mobile_lcdcfb.c | 10 --- 3 files changed, 51 insertions(+), 28 deletions(-) diff --git a/drivers/video/fbmem.c b/drivers/video/fbmem.c index dacaf74..010d191 100644 --- a/drivers/video/fbmem.c +++ b/drivers/video/fbmem.c @@ -1108,14 +1108,16 @@ static long do_fb_ioctl(struct fb_info *info, unsigned int cmd, case FBIOPUT_VSCREENINFO: if (copy_from_user(var, argp, sizeof(var))) return -EFAULT; - if (!lock_fb_info(info)) - return -ENODEV; console_lock(); + if (!lock_fb_info(info)) { + console_unlock(); + return -ENODEV; + } info-flags |= FBINFO_MISC_USEREVENT; ret = fb_set_var(info, var); info-flags = ~FBINFO_MISC_USEREVENT; - console_unlock(); unlock_fb_info(info); + console_unlock
[RESEND PATCH] fs/buffer.c: exit if already confirmed page has dirty and writeback buffers
Stop the loop of iterating bh if we have confirmed page has dirty and writeback buffers. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/buffer.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 6024877..519cc5c 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -112,7 +112,7 @@ void buffer_check_dirty_writeback(struct page *page, *dirty = true; bh = bh-b_this_page; - } while (bh != head); + } while ((bh != head) !(*writeback *dirty)); } EXPORT_SYMBOL(buffer_check_dirty_writeback); -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH 2/2] staging/olpc_docn: reorder the lock sequence to avoid potential dead lock
Hi Dan, On 11/05/2013 07:02 PM, Dan Carpenter wrote: On Tue, Nov 05, 2013 at 06:01:00PM +0800, Gu Zheng wrote: The lock sequence of dcon_blank_fb(fb_info-lock --- console_lock) is against with the one of console_callback(console_lock --- fb_info-lock), it'll lead to a potential dead lock, so reorder the lock sequence of dcon_blank_fb to avoid the potential dead lock. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com Relax, Greg isn't taking new patches for another three weeks because the merge window is open. Got it, I just want to gain some comments about this patch. Also what happened to [PATCH 1/2]? It fixes the similar issue of fb subsystem. https://patchwork.kernel.org/patch/3140121/ Regards, Gu regards, dan carpenter -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH] f2fs: avoid to use a NULL point in destroy_segment_manager
On 11/06/2013 09:12 AM, Chao Yu wrote: A NULL point should avoid to be used in destroy_segment_manager after allocating memory fail for f2fs_sm_info. Though without this patch it still can work well, because if it failed to allocate f2fs_sm_info, the sit_info, free_info... all were NULL, and the destory path(e.g. destroy_dirty_segmap) can deal with them well. IMO, this patch is still a good catch. Regards, Gu Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/segment.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 3d4d5fc..ff363e6 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -1744,6 +1744,8 @@ static void destroy_sit_info(struct f2fs_sb_info *sbi) void destroy_segment_manager(struct f2fs_sb_info *sbi) { struct f2fs_sm_info *sm_info = SM_I(sbi); + if (!sm_info) + return; destroy_dirty_segmap(sbi); destroy_curseg(sbi); destroy_free_segmap(sbi); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH] f2fs: avoid to use a NULL point in destroy_segment_manager
On 11/06/2013 01:10 PM, Chao Yu wrote: Hi Gu, -Original Message- From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com] Sent: Wednesday, November 06, 2013 11:41 AM To: Chao Yu Cc: ???; linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; 谭姝 Subject: Re: [f2fs-dev] [PATCH] f2fs: avoid to use a NULL point in destroy_segment_manager On 11/06/2013 09:12 AM, Chao Yu wrote: A NULL point should avoid to be used in destroy_segment_manager after allocating memory fail for f2fs_sm_info. Though without this patch it still can work well, because if it failed to allocate f2fs_sm_info, the sit_info, free_info... all were NULL, and the destory path(e.g. destroy_dirty_segmap) can deal with them well. I think it could not work well. Without this patch we may got a segment fault in DIRTY_I(sbi) at the following code if it failed to allocate f2fs_sm_info memory(sbi-sm_info). Right? Yes, you're right. SIT_I generates sit_info from f2fs_sm_info. Sorry for my mistake.:( Regards, Gu static void destroy_dirty_segmap(struct f2fs_sb_info *sbi) { struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); IMO, this patch is still a good catch. Regards, Gu Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/segment.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 3d4d5fc..ff363e6 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -1744,6 +1744,8 @@ static void destroy_sit_info(struct f2fs_sb_info *sbi) void destroy_segment_manager(struct f2fs_sb_info *sbi) { struct f2fs_sm_info *sm_info = SM_I(sbi); + if (!sm_info) + return; destroy_dirty_segmap(sbi); destroy_curseg(sbi); destroy_free_segmap(sbi); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] fs/buffer.c: exit if already confirmed page has dirty and writeback buffers
Hi Jan, On 11/07/2013 07:44 PM, Jan Kara wrote: On Tue 05-11-13 18:02:03, Gu Zheng wrote: Stop the loop of iterating bh if we have confirmed page has dirty and writeback buffers. Thanks for the patch. What I'm somewhat missing here is a motivation of the patch. For the common case where blocksize == pagesize this is a noop (only adds some code). Yes, you're right. For the case where blocksize pagesize we can possibly save checking some buffers but how common is that going be? It's really hard to say.:( But many file systems support small blocksize. Does that minimal speed up outweight the cost of additional check / code complication? In fact, without complete test. But I think the speed up can outweigh the cost if blocksize small enough. For example, blocksize: 1k, pagesize: 4k, we can reduce 6 bh check(3 dirty, 3 writeback) in the best case. Best regards, Gu Honza Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/buffer.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 6024877..519cc5c 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -112,7 +112,7 @@ void buffer_check_dirty_writeback(struct page *page, *dirty = true; bh = bh-b_this_page; -} while (bh != head); +} while ((bh != head) !(*writeback *dirty)); } EXPORT_SYMBOL(buffer_check_dirty_writeback); -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev][PATCH RESEND] f2fs: avoid allocating failure in bio_alloc
Hi Chao, On 09/13/2013 09:27 PM, Chao Yu wrote: This patch add macro MAX_BIO_BLOCKS to limit value of npages in f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by npages is larger than UIO_MAXIOV. As I know bio_alloc is based of *fs_bio_set* pool, without the limitation of UIO_MAXIOV, am I missing something? Thanks, Gu Signed-off-by: Yu Chao chao2...@samsung.com --- fs/f2fs/segment.c |4 +++- fs/f2fs/segment.h |3 +++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 09af9c7..bd79bbe 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -657,6 +657,7 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, block_t blk_addr, enum page_type type) { struct block_device *bdev = sbi-sb-s_bdev; + int bio_blocks; verify_block_addr(sbi, blk_addr); @@ -676,7 +677,8 @@ retry: goto retry; } - sbi-bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi)); + bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi)); + sbi-bio[type] = f2fs_bio_alloc(bdev, bio_blocks); sbi-bio[type]-bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr); sbi-bio[type]-bi_private = priv; /* diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index bdd10ea..6352af1 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -9,6 +9,7 @@ * published by the Free Software Foundation. */ #include linux/blkdev.h +#include linux/uio.h /* constant macro */ #define NULL_SEGNO ((unsigned int)(~0)) @@ -90,6 +91,8 @@ (blk_addr ((sbi)-log_blocksize - F2FS_LOG_SECTOR_SIZE)) #define SECTOR_TO_BLOCK(sbi, sectors) \ (sectors ((sbi)-log_blocksize - F2FS_LOG_SECTOR_SIZE)) +#define MAX_BIO_BLOCKS(max_hw_blocks) \ + (min((int)max_hw_blocks, UIO_MAXIOV)) /* during checkpoint, bio_private is used to synchronize the last bio */ struct bio_private { --- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev][PATCH RESEND] f2fs: avoid allocating failure in bio_alloc
Hi Chao, On 09/16/2013 11:26 AM, Chao Yu wrote: Hi Gu -Original Message- From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com] Sent: Monday, September 16, 2013 10:09 AM To: Chao Yu Cc: Kim Jaegeuk; linux-f2fs-de...@lists.sourceforge.net; linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; 谭姝 Subject: Re: [f2fs-dev][PATCH RESEND] f2fs: avoid allocating failure in bio_alloc Hi Chao, On 09/13/2013 09:27 PM, Chao Yu wrote: This patch add macro MAX_BIO_BLOCKS to limit value of npages in f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by npages is larger than UIO_MAXIOV. As I know bio_alloc is based of *fs_bio_set* pool, without the limitation of UIO_MAXIOV, am I missing something? Here is the code in bio.c, fs_bio_set is as the actual parameter pass to bs without being inited. fs_bio_set was initiated early in the bio subsystem init. So it may have opportunity to return NULL in this function. It may be, but may not be the thread you mentioned below. --- Bio.c struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) { .. if (!bs) { if (nr_iovecs UIO_MAXIOV) return NULL; --- I did the abnormal test: modify the max_sectors_kb in /sys/block/sdx/queue to 32767 for a disk with f2fs format, and I got a segfualt in f2fs_bio_alloc after the img mounted. Is there anyting I missed? Hmm, this change will also trigger bvec_alloc failed, did you add some traces to debug this? Regards, Gu Thanks, Gu Signed-off-by: Yu Chao chao2...@samsung.com --- fs/f2fs/segment.c |4 +++- fs/f2fs/segment.h |3 +++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 09af9c7..bd79bbe 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -657,6 +657,7 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, block_t blk_addr, enum page_type type) { struct block_device *bdev = sbi-sb-s_bdev; + int bio_blocks; verify_block_addr(sbi, blk_addr); @@ -676,7 +677,8 @@ retry: goto retry; } - sbi-bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi)); + bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi)); + sbi-bio[type] = f2fs_bio_alloc(bdev, bio_blocks); sbi-bio[type]-bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr); sbi-bio[type]-bi_private = priv; /* diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index bdd10ea..6352af1 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -9,6 +9,7 @@ * published by the Free Software Foundation. */ #include linux/blkdev.h +#include linux/uio.h /* constant macro */ #define NULL_SEGNO ((unsigned int)(~0)) @@ -90,6 +91,8 @@ (blk_addr ((sbi)-log_blocksize - F2FS_LOG_SECTOR_SIZE)) #define SECTOR_TO_BLOCK(sbi, sectors) \ (sectors ((sbi)-log_blocksize - F2FS_LOG_SECTOR_SIZE)) +#define MAX_BIO_BLOCKS(max_hw_blocks) \ + (min((int)max_hw_blocks, UIO_MAXIOV)) /* during checkpoint, bio_private is used to synchronize the last bio */ struct bio_private { --- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] fb: reorder the lock sequence to fix a potential lockdep
Following commits: 50e244cc79 fb: rework locking to fix lock ordering on takeover e93a9a8687 fb: Yet another band-aid for fixing lockdep mess 054430e773 fbcon: fix locking harder reworked locking to fix related lock ordering on takeover, and introduced console_lock into fbmem, but it seems that the new lock sequence(fb_info-lock --- console_lock) is against with the one in console_callback(console_lock --- fb_info-lock), and leads to a potential deadlock as following: [ 601.079000] == [ 601.079000] [ INFO: possible circular locking dependency detected ] [ 601.079000] 3.11.0 #189 Not tainted [ 601.079000] --- [ 601.079000] kworker/0:3/619 is trying to acquire lock: [ 601.079000] (fb_info-lock){+.+.+.}, at: [81397566] lock_fb_info+0x26/0x60 [ 601.079000] but task is already holding lock: [ 601.079000] (console_lock){+.+.+.}, at: [8141aae3] console_callback+0x13/0x160 [ 601.079000] which lock already depends on the new lock. [ 601.079000] the existing dependency chain (in reverse order) is: [ 601.079000] - #1 (console_lock){+.+.+.}: [ 601.079000][810dc971] lock_acquire+0xa1/0x140 [ 601.079000][810c6267] console_lock+0x77/0x80 [ 601.079000][81399448] register_framebuffer+0x1d8/0x320 [ 601.079000][81cfb4c8] efifb_probe+0x408/0x48f [ 601.079000][8144a963] platform_drv_probe+0x43/0x80 [ 601.079000][8144853b] driver_probe_device+0x8b/0x390 [ 601.079000][814488eb] __driver_attach+0xab/0xb0 [ 601.079000][814463bd] bus_for_each_dev+0x5d/0xa0 [ 601.079000][81447e6e] driver_attach+0x1e/0x20 [ 601.079000][81447a07] bus_add_driver+0x117/0x290 [ 601.079000][81448fea] driver_register+0x7a/0x170 [ 601.079000][8144a10a] __platform_driver_register+0x4a/0x50 [ 601.079000][8144a12d] platform_driver_probe+0x1d/0xb0 [ 601.079000][81cfb0a1] efifb_init+0x273/0x292 [ 601.079000][81002132] do_one_initcall+0x102/0x1c0 [ 601.079000][81cb80a6] kernel_init_freeable+0x15d/0x1ef [ 601.079000][8166d2de] kernel_init+0xe/0xf0 [ 601.079000][816914ec] ret_from_fork+0x7c/0xb0 [ 601.079000] - #0 (fb_info-lock){+.+.+.}: [ 601.079000][810dc1d8] __lock_acquire+0x1e18/0x1f10 [ 601.079000][810dc971] lock_acquire+0xa1/0x140 [ 601.079000][816835ca] mutex_lock_nested+0x7a/0x3b0 [ 601.079000][81397566] lock_fb_info+0x26/0x60 [ 601.079000][813a4aeb] fbcon_blank+0x29b/0x2e0 [ 601.079000][81418658] do_blank_screen+0x1d8/0x280 [ 601.079000][8141ab34] console_callback+0x64/0x160 [ 601.079000][8108d855] process_one_work+0x1f5/0x540 [ 601.079000][8108e04c] worker_thread+0x11c/0x370 [ 601.079000][81095fbd] kthread+0xed/0x100 [ 601.079000][816914ec] ret_from_fork+0x7c/0xb0 [ 601.079000] other info that might help us debug this: [ 601.079000] Possible unsafe locking scenario: [ 601.079000]CPU0CPU1 [ 601.079000] [ 601.079000] lock(console_lock); [ 601.079000]lock(fb_info-lock); [ 601.079000]lock(console_lock); [ 601.079000] lock(fb_info-lock); [ 601.079000] *** DEADLOCK *** so we reorder the lock sequence the same as it in console_callback() to avoid this issue. Not very sure this change is suitable, any comments is welcome. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/video/fbmem.c | 50 +++- 1 files changed, 32 insertions(+), 18 deletions(-) diff --git a/drivers/video/fbmem.c b/drivers/video/fbmem.c index dacaf74..010d191 100644 --- a/drivers/video/fbmem.c +++ b/drivers/video/fbmem.c @@ -1108,14 +1108,16 @@ static long do_fb_ioctl(struct fb_info *info, unsigned int cmd, case FBIOPUT_VSCREENINFO: if (copy_from_user(var, argp, sizeof(var))) return -EFAULT; - if (!lock_fb_info(info)) - return -ENODEV; console_lock(); + if (!lock_fb_info(info)) { + console_unlock(); + return -ENODEV; + } info-flags |= FBINFO_MISC_USEREVENT; ret = fb_set_var(info, var); info-flags = ~FBINFO_MISC_USEREVENT; - console_unlock(); unlock_fb_info(info); + console_unlock(); if (!ret copy_to_user(argp, var, sizeof(var))) ret = -EFAULT; break
[PATCH] f2fs: introduce f2fs_kmem_cache_alloc to hide the unfailed kmem cache allocation
Introduce the unfailed version of kmem_cache_alloc named f2fs_kmem_cache_alloc to hide the retry routine and make the code a bit cleaner. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c | 26 +++--- fs/f2fs/f2fs.h | 13 + fs/f2fs/gc.c |8 ++-- fs/f2fs/node.c |6 +- 4 files changed, 23 insertions(+), 30 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 8d16071..6fb484c 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -226,12 +226,8 @@ void add_orphan_inode(struct f2fs_sb_info *sbi, nid_t ino) break; orphan = NULL; } -retry: - new = kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC); - if (!new) { - cond_resched(); - goto retry; - } + + new = f2fs_kmem_cache_alloc(orphan_entry_slab, GFP_ATOMIC); new-ino = ino; /* add new_oentry into list which is sorted by inode number */ @@ -484,12 +480,8 @@ void set_dirty_dir_page(struct inode *inode, struct page *page) if (!S_ISDIR(inode-i_mode)) return; -retry: - new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); - if (!new) { - cond_resched(); - goto retry; - } + + new = f2fs_kmem_cache_alloc(inode_entry_slab, GFP_NOFS); new-inode = inode; INIT_LIST_HEAD(new-list); @@ -506,13 +498,9 @@ retry: void add_dirty_dir_inode(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_SB(inode-i_sb); - struct dir_inode_entry *new; -retry: - new = kmem_cache_alloc(inode_entry_slab, GFP_NOFS); - if (!new) { - cond_resched(); - goto retry; - } + struct dir_inode_entry *new = + f2fs_kmem_cache_alloc(inode_entry_slab, GFP_NOFS); + new-inode = inode; INIT_LIST_HEAD(new-list); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 171c52f..fa9ad03 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -787,6 +787,19 @@ static inline struct kmem_cache *f2fs_kmem_cache_create(const char *name, return kmem_cache_create(name, size, 0, SLAB_RECLAIM_ACCOUNT, ctor); } +static inline void *f2fs_kmem_cache_alloc(struct kmem_cache *cachep, + gfp_t flags) +{ + void *entry = kmem_cache_alloc(cachep, flags); +retry: + if (!entry) { + cond_resched(); + goto retry; + } + + return entry; +} + #define RAW_IS_INODE(p)((p)-footer.nid == (p)-footer.ino) static inline bool IS_INODE(struct page *page) diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index fbad968..7914b92 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -361,12 +361,8 @@ static void add_gc_inode(struct inode *inode, struct list_head *ilist) iput(inode); return; } -repeat: - new_ie = kmem_cache_alloc(winode_slab, GFP_NOFS); - if (!new_ie) { - cond_resched(); - goto repeat; - } + + new_ie = f2fs_kmem_cache_alloc(winode_slab, GFP_NOFS); new_ie-inode = inode; list_add_tail(new_ie-list, ilist); } diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index ef80f79..fe3cf8e 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1308,11 +1308,7 @@ static int add_free_nid(struct f2fs_nm_info *nm_i, nid_t nid, bool build) if (allocated) return 0; retry: - i = kmem_cache_alloc(free_nid_slab, GFP_NOFS); - if (!i) { - cond_resched(); - goto retry; - } + i = f2fs_kmem_cache_alloc(free_nid_slab, GFP_NOFS); i-nid = nid; i-state = NID_NEW; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] f2fs: delete and free dirty dir freeing inode entry when sync dirty dir inodes
In sync_dirty_dir_inodes(), remove_dirty_dir_inode() will be called in the callback of filemap_flush to delete and free dirty dir inode entry. But for the freeing inode entry, missed this step after sbumit data bio, and this may lead to a dead loop if these is freeing inode entry in dir_inode_list. So add the delete and free step to fix it. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 8d16071..f61838f 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -600,7 +600,16 @@ retry: * wribacking dentry pages in the freeing inode. */ f2fs_submit_bio(sbi, DATA, true); + + spin_lock(sbi-dir_inode_lock); + list_del(entry-list); +#ifdef CONFIG_F2FS_STAT_FS + sbi-n_dirty_dirs--; +#endif + spin_unlock(sbi-dir_inode_lock); + kmem_cache_free(inode_entry_slab, entry); } + goto retry; } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[f2fs-dev][PATCH] f2fs: fix a potential out of range issue
Fix a potential out of range issue introduced by commit: 22fb72225a f2fs: simplify write_orphan_inodes for better readable Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 7fe69ff..3e62987 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -323,9 +323,9 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) memset(orphan_blk, 0, sizeof(*orphan_blk)); } - orphan_blk-ino[nentries] = cpu_to_le32(orphan-ino); + orphan_blk-ino[nentries++] = cpu_to_le32(orphan-ino); - if (nentries++ == F2FS_ORPHANS_PER_BLOCK) { + if (nentries == F2FS_ORPHANS_PER_BLOCK) { /* * an orphan block is full of 1020 entries, * then we need to flush current orphan blocks -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] f2fs: remove the own bi_private allocation
On 11/30/2013 09:48 AM, Jaegeuk Kim wrote: Previously f2fs allocates its own bi_private data structure all the time even though we don't use it. But, can we remove this bi_private allocation? This patch removes such the additional bi_private allocation. 1. Retrieve f2fs_sb_info from its page-mapping-host-i_sb. - This removes the usecases of bi_private in end_io. 2. Use bi_private only when we really need it. - The bi_private is used only when the checkpoint procedure is conducted. - When conducting the checkpoint, f2fs submits a META_FLUSH bio to wait its bio completion. - Since we have no dependancies to remove bi_private now, let's just use bi_private pointer as the completion pointer. Cool, looks good to me.:) Signed-off-by: Jaegeuk Kim jaegeuk@samsung.com Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/segment.c | 43 --- fs/f2fs/segment.h | 7 --- 2 files changed, 16 insertions(+), 34 deletions(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 0387863..0db4027 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -791,7 +791,7 @@ static void f2fs_end_io_write(struct bio *bio, int err) { const int uptodate = test_bit(BIO_UPTODATE, bio-bi_flags); struct bio_vec *bvec = bio-bi_io_vec + bio-bi_vcnt - 1; - struct bio_private *p = bio-bi_private; + struct f2fs_sb_info *sbi = F2FS_SB(bvec-bv_page-mapping-host-i_sb); do { struct page *page = bvec-bv_page; @@ -802,21 +802,21 @@ static void f2fs_end_io_write(struct bio *bio, int err) SetPageError(page); if (page-mapping) set_bit(AS_EIO, page-mapping-flags); - set_ckpt_flags(p-sbi-ckpt, CP_ERROR_FLAG); - p-sbi-sb-s_flags |= MS_RDONLY; + + set_ckpt_flags(sbi-ckpt, CP_ERROR_FLAG); + sbi-sb-s_flags |= MS_RDONLY; } end_page_writeback(page); - dec_page_count(p-sbi, F2FS_WRITEBACK); + dec_page_count(sbi, F2FS_WRITEBACK); } while (bvec = bio-bi_io_vec); - if (p-is_sync) - complete(p-wait); + if (bio-bi_private) + complete(bio-bi_private); - if (!get_pages(p-sbi, F2FS_WRITEBACK) - !list_empty(p-sbi-cp_wait.task_list)) - wake_up(p-sbi-cp_wait); + if (!get_pages(sbi, F2FS_WRITEBACK) + !list_empty(sbi-cp_wait.task_list)) + wake_up(sbi-cp_wait); - kfree(p); bio_put(bio); } @@ -838,7 +838,6 @@ static void do_submit_bio(struct f2fs_sb_info *sbi, int rw = sync ? WRITE_SYNC : WRITE; enum page_type btype = PAGE_TYPE_OF_BIO(type); struct f2fs_bio_info *io = sbi-write_io[btype]; - struct bio_private *p; if (!io-bio) return; @@ -851,18 +850,16 @@ static void do_submit_bio(struct f2fs_sb_info *sbi, trace_f2fs_submit_write_bio(sbi-sb, rw, btype, io-bio); - p = io-bio-bi_private; - p-sbi = sbi; - io-bio-bi_end_io = f2fs_end_io_write; - + /* + * META_FLUSH is only from the checkpoint procedure, and we should wait + * this metadata bio for FS consistency. + */ if (type == META_FLUSH) { DECLARE_COMPLETION_ONSTACK(wait); - p-is_sync = true; - p-wait = wait; + io-bio-bi_private = wait; submit_bio(rw, io-bio); wait_for_completion(wait); } else { - p-is_sync = false; submit_bio(rw, io-bio); } io-bio = NULL; @@ -897,18 +894,10 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, do_submit_bio(sbi, type, false); alloc_new: if (io-bio == NULL) { - struct bio_private *priv; -retry: - priv = kmalloc(sizeof(struct bio_private), GFP_NOFS); - if (!priv) { - cond_resched(); - goto retry; - } - bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi)); io-bio = f2fs_bio_alloc(bdev, bio_blocks); io-bio-bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr); - io-bio-bi_private = priv; + io-bio-bi_end_io = f2fs_end_io_write; /* * The end_io will be assigned at the sumbission phase. * Until then, let bio_add_page() merge consecutive IOs as much diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index 7fea2ee..26812fc 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -92,13 +92,6 @@ #define MAX_BIO_BLOCKS(max_hw_blocks) \ (min((int)max_hw_blocks, BIO_MAX_PAGES)) -/* during checkpoint, bio_private is used to synchronize the last
Re: [PATCH] f2fs: refactor bio-related operations
On 11/30/2013 02:25 PM, Jaegeuk Kim wrote: This patch integrates redundant bio operations on read and write IOs. 1. Move bio-related codes to the top of data.c. 2. Replace f2fs_submit_bio with f2fs_submit_merged_bio, which handles read bios additionally. 3. Introduce __submit_merged_bio to submit the merged bio. 4. Change f2fs_readpage to f2fs_submit_page_bio. 5. Introduce f2fs_submit_page_mbio to integrate previous submit_read_page and submit_write_page. Signed-off-by: Jaegeuk Kim jaegeuk@samsung.com Reviewed-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c| 14 +- fs/f2fs/data.c | 317 +--- fs/f2fs/f2fs.h | 13 +- fs/f2fs/gc.c| 2 +- fs/f2fs/node.c | 14 +- fs/f2fs/recovery.c | 4 +- fs/f2fs/segment.c | 164 +++ include/trace/events/f2fs.h | 30 ++--- 8 files changed, 259 insertions(+), 299 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 40eea42..38f4a224 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -61,7 +61,8 @@ repeat: if (PageUptodate(page)) goto out; - if (f2fs_readpage(sbi, page, index, READ_SYNC | REQ_META | REQ_PRIO)) + if (f2fs_submit_page_bio(sbi, page, index, + READ_SYNC | REQ_META | REQ_PRIO)) goto repeat; lock_page(page); @@ -157,7 +158,8 @@ long sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type, } if (nwritten) - f2fs_submit_bio(sbi, type, nr_to_write == LONG_MAX); + f2fs_submit_merged_bio(sbi, type, nr_to_write == LONG_MAX, + WRITE); return nwritten; } @@ -590,7 +592,7 @@ retry: * We should submit bio, since it exists several * wribacking dentry pages in the freeing inode. */ - f2fs_submit_bio(sbi, DATA, true); + f2fs_submit_merged_bio(sbi, DATA, true, WRITE); } goto retry; } @@ -796,9 +798,9 @@ void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) trace_f2fs_write_checkpoint(sbi-sb, is_umount, finish block_ops); - f2fs_submit_bio(sbi, DATA, true); - f2fs_submit_bio(sbi, NODE, true); - f2fs_submit_bio(sbi, META, true); + f2fs_submit_merged_bio(sbi, DATA, true, WRITE); + f2fs_submit_merged_bio(sbi, NODE, true, WRITE); + f2fs_submit_merged_bio(sbi, META, true, WRITE); /* * update checkpoint pack index diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index c9a76f8..53e3bbb 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -25,6 +25,205 @@ #include trace/events/f2fs.h /* + * Low-level block read/write IO operations. + */ +static struct bio *__bio_alloc(struct block_device *bdev, int npages) +{ + struct bio *bio; + + /* No failure on bio allocation */ + bio = bio_alloc(GFP_NOIO, npages); + bio-bi_bdev = bdev; + bio-bi_private = NULL; + return bio; +} + +static void f2fs_read_end_io(struct bio *bio, int err) +{ + const int uptodate = test_bit(BIO_UPTODATE, bio-bi_flags); + struct bio_vec *bvec = bio-bi_io_vec + bio-bi_vcnt - 1; + + do { + struct page *page = bvec-bv_page; + + if (--bvec = bio-bi_io_vec) + prefetchw(bvec-bv_page-flags); + + if (uptodate) { + SetPageUptodate(page); + } else { + ClearPageUptodate(page); + SetPageError(page); + } + unlock_page(page); + } while (bvec = bio-bi_io_vec); + + bio_put(bio); +} + +static void f2fs_write_end_io(struct bio *bio, int err) +{ + const int uptodate = test_bit(BIO_UPTODATE, bio-bi_flags); + struct bio_vec *bvec = bio-bi_io_vec + bio-bi_vcnt - 1; + struct f2fs_sb_info *sbi = F2FS_SB(bvec-bv_page-mapping-host-i_sb); + + do { + struct page *page = bvec-bv_page; + + if (--bvec = bio-bi_io_vec) + prefetchw(bvec-bv_page-flags); + + if (!uptodate) { + SetPageError(page); + set_bit(AS_EIO, page-mapping-flags); + set_ckpt_flags(sbi-ckpt, CP_ERROR_FLAG); + sbi-sb-s_flags |= MS_RDONLY; + } + end_page_writeback(page); + dec_page_count(sbi, F2FS_WRITEBACK); + } while (bvec = bio-bi_io_vec); + + if (bio-bi_private) + complete(bio-bi_private); + + if (!get_pages(sbi, F2FS_WRITEBACK) + !list_empty(sbi-cp_wait.task_list)) + wake_up(sbi-cp_wait); + + bio_put(bio); +} + +static void __submit_merged_bio(struct
Re: GPF in aio_migratepage
Hi Kristian, Dave, Could you please help to check whether the following patch can fix this issue? Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/aio.c | 28 ++-- 1 files changed, 10 insertions(+), 18 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 08159ed..fc1fd0a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -223,33 +223,25 @@ static int __init aio_setup(void) } __initcall(aio_setup); -static void put_aio_ring_file(struct kioctx *ctx) -{ - struct file *aio_ring_file = ctx-aio_ring_file; - if (aio_ring_file) { - truncate_setsize(aio_ring_file-f_inode, 0); - - /* Prevent further access to the kioctx from migratepages */ - spin_lock(aio_ring_file-f_inode-i_mapping-private_lock); - aio_ring_file-f_inode-i_mapping-private_data = NULL; - ctx-aio_ring_file = NULL; - spin_unlock(aio_ring_file-f_inode-i_mapping-private_lock); - - fput(aio_ring_file); - } -} - static void aio_free_ring(struct kioctx *ctx) { + struct file *aio_ring_file = ctx-aio_ring_file; int i; + BUG_ON(!aio_ring_file); + + spin_lock(aio_ring_file-f_inode-i_mapping-private_lock); for (i = 0; i ctx-nr_pages; i++) { pr_debug(pid(%d) [%d] page-count=%d\n, current-pid, i, page_count(ctx-ring_pages[i])); put_page(ctx-ring_pages[i]); } - - put_aio_ring_file(ctx); + truncate_setsize(aio_ring_file-f_inode, 0); + /* Prevent further access to the kioctx from migratepages */ + aio_ring_file-f_inode-i_mapping-private_data = NULL; + ctx-aio_ring_file = NULL; + spin_unlock(aio_ring_file-f_inode-i_mapping-private_lock); + fput(aio_ring_file); if (ctx-ring_pages ctx-ring_pages != ctx-internal_pages) { kfree(ctx-ring_pages); -- 1.7.7 On 11/30/2013 11:28 PM, Kristian Nielsen wrote: Benjamin LaHaise b...@kvack.org writes: For Dave: what line is this bug on? Is it the dereference of ctx when doing spin_lock_irqsave(ctx-completion_lock, flags); or is the ctx-ring_pages[idx] = new; ? From the 64 bit splat, I'm thinking the former, which is quite strange given that the clearing of mapping-private_data is protected by mapping-private_lock. If it's the latter, we might well need to check if ctx-ring_pages is NULL during setup. I think I got the same BUG (at least it looks very similar, full details below). The bug is on this line: ctx-ring_pages[idx] = new; Disassembly: af7: 48 89 2c d1mov%rbp,(%rcx,%rdx,8) ctx-ring_pages is 0x (this is x86_64). idx is 13. RCX: RDX: 000d BUG: unable to handle kernel NULL pointer dereference at 0067 So we are de-referencing a pointer that is (page **)-1, causing the crash. If you look closer at the 32-bit dump that Dave gave, you can see that it is similar: 7a2: 89 34 82mov%esi,(%edx,%eax,4) RAX: 6b6b6b6b6b6b6b6b RDX: Though in this case ctx-ring_pages seems to be NULL and idx=old-index seems to be 6b6b6b6b6b6b6b6b, so not completely the same (or maybe I read his dump incorrectly). This is 3.13-rc1. Unfortunately, I do not have a way to reproduce (so far I only saw it this once). But I can see if it turns up again, or should I install -rc2 and see if it goes away? I was not doing anything special at the time, normal desktop load (I was using the evince pdf viewer). Let me know if there is anything else I can do to help track this down? - Kristian. Full details: I put my .config here: http://knielsen-hq.org/config-3.13-rc1-gpf-in-aio-migratepage.txt BUG output: BUG: unable to handle kernel NULL pointer dereference at 0067 IP: [8113d73f] aio_migratepage+0xb3/0xe4 PGD 0 Oops: 0002 [#1] SMP Modules linked in: tun parport_pc ppdev lp parport bnep rfcomm bluetooth cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc uinput fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ext3 jbd loop snd_hda_codec_hdmi hid_generic usbhid hid joydev ums_realtek usb_storage snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 brcmsmac cordic brcmutil b43 mac80211 cfg80211 ssb mmc_core rfkill rng_core pcmcia pcmcia_core nouveau mxm_wmi wmi x86_pkg_temp_thermal coretemp snd_hda_intel kvm_intel snd_hda_codec snd_hwdep snd_pcm_oss kvm snd_mixer_oss snd_seq_midi snd_seq_midi_event snd_pcm crc32c_intel snd_rawmidi snd_page_alloc snd_seq ghash_clmulni_intel snd_timer snd_seq_device lpc_ich aesni_intel mfd_core ttm battery aes_x86_64 ablk_helper drm_kms_helper cryptd lrw gf128mul drm glue_helper psmouse snd pcspkr serio_raw i2c_i801 evdev ehci_pci soundcore ehci_hcd bcma ac acpi_cpufreq video button processor
[PATCH] f2fs: avoid wait if IO end up when do_checkpoint for better performance
Previously, do_checkpoint() will call congestion_wait() for waiting the pages (previous submitted node/meta/data pages) to be written back. Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for waiting, and no additional wake up mechanism was introduced if IO ends up before regular period costed. Yuan Zhong found there is a situation that after the pages have been written back, but the checkpoint thread still wait for congestion_wait to exit. So here we store checkpoint task into f2fs_sb when doing checkpoint, it'll wait for IO completes if there's IO going on, and in the end IO path, wake up checkpoint task when IO ends up. Thanks to Yuan Zhong's pre work about this problem. Reported-by: Yuan Zhong yuan.mark.zh...@samsung.com Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c | 11 +-- fs/f2fs/f2fs.h |1 + fs/f2fs/segment.c|4 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index d808827..2a5999d 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -757,8 +757,15 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) f2fs_put_page(cp_page, 1); /* wait for previous submitted node/meta pages writeback */ - while (get_pages(sbi, F2FS_WRITEBACK)) - congestion_wait(BLK_RW_ASYNC, HZ / 50); + sbi-cp_task = current; + while (get_pages(sbi, F2FS_WRITEBACK)) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (!get_pages(sbi, F2FS_WRITEBACK)) + break; + io_schedule(); + } + __set_current_state(TASK_RUNNING); + sbi-cp_task = NULL; filemap_fdatawait_range(sbi-node_inode-i_mapping, 0, LONG_MAX); filemap_fdatawait_range(sbi-meta_inode-i_mapping, 0, LONG_MAX); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 308967b..171c52f 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -362,6 +362,7 @@ struct f2fs_sb_info { struct mutex writepages;/* mutex for writepages() */ int por_doing; /* recovery is doing or not */ int on_build_free_nids; /* build_free_nids is doing */ + struct task_struct *cp_task;/* checkpoint task */ /* for orphan inode management */ struct list_head orphan_inode_list; /* orphan inode list */ diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index bd79bbe..3b20359 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -597,6 +597,10 @@ static void f2fs_end_io_write(struct bio *bio, int err) if (p-is_sync) complete(p-wait); + + if (!get_pages(p-sbi, F2FS_WRITEBACK) p-sbi-cp_task) + wake_up_process(p-sbi-cp_task); + kfree(p); bio_put(bio); } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] f2fs: introduce function read_raw_super_block()
Introduce function read_raw_super_block() to hide reading raw super block and the retry routine if the first sb is invalid. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/super.c | 54 +- 1 files changed, 33 insertions(+), 21 deletions(-) diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 3b786c8..5e913de 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -746,30 +746,46 @@ static void init_sb_info(struct f2fs_sb_info *sbi) atomic_set(sbi-nr_pages[i], 0); } -static int validate_superblock(struct super_block *sb, - struct f2fs_super_block **raw_super, - struct buffer_head **raw_super_buf, sector_t block) +/* Read f2fs raw super block. + * Because we have two copies of super block, so read the first one at first, + * if the first one is invalid, move to read the second one. + */ +static int read_raw_super_block(struct super_block *sb, + struct f2fs_super_block **raw_super, + struct buffer_head **raw_super_buf) { - const char *super = (block == 0 ? first : second); + int block = 0; - /* read f2fs raw super block */ +retry: *raw_super_buf = sb_bread(sb, block); if (!*raw_super_buf) { - f2fs_msg(sb, KERN_ERR, unable to read %s superblock, - super); - return -EIO; + f2fs_msg(sb, KERN_ERR, Unable to read %dth superblock, + block + 1); + if (block == 0) { + block++; + goto retry; + } else { + return -EIO; + } } *raw_super = (struct f2fs_super_block *) ((char *)(*raw_super_buf)-b_data + F2FS_SUPER_OFFSET); /* sanity checking of raw super */ - if (!sanity_check_raw_super(sb, *raw_super)) - return 0; + if (sanity_check_raw_super(sb, *raw_super)) { + brelse(*raw_super_buf); + f2fs_msg(sb, KERN_ERR, Can't find a valid F2FS filesystem + in %dth superblock, block + 1); + if(block == 0) { + block++; + goto retry; + } else { + return -EINVAL; + } + } - f2fs_msg(sb, KERN_ERR, Can't find a valid F2FS filesystem - in %s superblock, super); - return -EINVAL; + return 0; } static int f2fs_fill_super(struct super_block *sb, void *data, int silent) @@ -791,14 +807,10 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) goto free_sbi; } - err = validate_superblock(sb, raw_super, raw_super_buf, 0); - if (err) { - brelse(raw_super_buf); - /* check secondary superblock when primary failed */ - err = validate_superblock(sb, raw_super, raw_super_buf, 1); - if (err) - goto free_sb_buf; - } + err = read_raw_super_block(sb, raw_super, raw_super_buf); + if (err) + goto free_sbi; + sb-s_fs_info = sbi; /* init some FS parameters */ sbi-active_logs = NR_CURSEG_TYPE; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH 1/2] fb: reorder the lock sequence to fix potential dead lock
Hi Tomi, On 11/11/2013 09:59 PM, Tomi Valkeinen wrote: On 2013-11-05 12:00, Gu Zheng wrote: Following commits: 50e244cc79 fb: rework locking to fix lock ordering on takeover e93a9a8687 fb: Yet another band-aid for fixing lockdep mess 054430e773 fbcon: fix locking harder reworked locking to fix related lock ordering on takeover, and introduced console_lock into fbmem, but it seems that the new lock sequence(fb_info-lock --- console_lock) is against with the one in console_callback(console_lock --- fb_info-lock), and leads to a potential dead lock as following: snip so we reorder the lock sequence the same as it in console_callback() to avoid this issue. And following Tomi's suggestion, fix these similar issues all in fb subsystem. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- drivers/video/fbmem.c| 50 - drivers/video/fbsysfs.c | 19 ++ drivers/video/sh_mobile_lcdcfb.c | 10 --- 3 files changed, 51 insertions(+), 28 deletions(-) I'll apply this for 3.13. It's a bit difficult to verify if the locking is now correct, but looks fine to me. And we can revert this easily if things break badly. Thanks very munch.:) Regards, Gu Tomi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH 2/2] f2fs: read contiguous sit entry pages by merging for mount performance
Hi Yu, On 11/12/2013 01:18 PM, Chao Yu wrote: Previously we read sit entries page one by one, this method lost the chance of reading contiguous page together. So we read pages as contiguous as possible for better mount performance. Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/f2fs.h|2 ++ fs/f2fs/segment.c | 65 ++--- fs/f2fs/segment.h |2 ++ 3 files changed, 66 insertions(+), 3 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 0afdcec..bfe9d87 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -1113,6 +1113,8 @@ struct page *find_data_page(struct inode *, pgoff_t, bool); struct page *get_lock_data_page(struct inode *, pgoff_t); struct page *get_new_data_page(struct inode *, struct page *, pgoff_t, bool); int f2fs_readpage(struct f2fs_sb_info *, struct page *, block_t, int); +void f2fs_submit_read_bio(struct f2fs_sb_info *, int); +void submit_read_page(struct f2fs_sb_info *, struct page *, block_t, int); Better to move these declarations into PATCH 1/2. int do_write_data_page(struct page *); /* diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 86dc289..414c351 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -1474,19 +1474,72 @@ static int build_curseg(struct f2fs_sb_info *sbi) return restore_curseg_summaries(sbi); } +static int ra_sit_pages(struct f2fs_sb_info *sbi, int start, + int nrpages, bool *is_order) +{ + struct address_space *mapping = sbi-meta_inode-i_mapping; + struct sit_info *sit_i = SIT_I(sbi); + struct page *page; + block_t blk_addr; + int blkno, readcnt = 0; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + + for (blkno = start; blkno start + nrpages; blkno++) { + + if (blkno = sit_blk_cnt) Merge these two judgements: for (blkno = start; blkno start + nrpages blkno sit_blk_cnt; blkno++) + goto out; + if ((!f2fs_test_bit(blkno, sit_i-sit_bitmap) ^ !*is_order)) { + *is_order = !*is_order; + goto out; 'Break' seems more suitable. + } + + blk_addr = sit_i-sit_base_addr + blkno; + if (*is_order) + blk_addr += sit_i-sit_blocks; +repeat: + page = grab_cache_page(mapping, blk_addr); + if (!page) { + cond_resched(); + goto repeat; + } + if (PageUptodate(page)) { + f2fs_put_page(page, 1); + readcnt++; + goto out; Here may be 'Continue'. + } + + submit_read_page(sbi, page, blk_addr, READ_SYNC); + + page_cache_release(page); Put page here seems not a good idea, otherwise all your work may be in vain. + readcnt++; + } +out: + f2fs_submit_read_bio(sbi, READ_SYNC); + return readcnt; +} + static void build_sit_entries(struct f2fs_sb_info *sbi) { struct sit_info *sit_i = SIT_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); struct f2fs_summary_block *sum = curseg-sum_blk; - unsigned int start; + bool is_order = f2fs_test_bit(0, sit_i-sit_bitmap) ? true : false; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + int bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi)); + unsigned int i, start, end; + unsigned int readed, start_blk = 0; - for (start = 0; start TOTAL_SEGS(sbi); start++) { +next: + readed = ra_sit_pages(sbi, start_blk, bio_blocks, is_order); In fact, you know how many blocks that you want to read(SIT_BLK_CNT(sbi)), so here sit_blk_cnt is more suitable than a MAX one, and it also can make the logic of ra_sit_pages more simple. + + start = start_blk * sit_i-sents_per_block; + end = (start_blk + readed) * sit_i-sents_per_block; + + for (; start end start TOTAL_SEGS(sbi); start++) { struct seg_entry *se = sit_i-sentries[start]; struct f2fs_sit_block *sit_blk; struct f2fs_sit_entry sit; struct page *page; - int i; mutex_lock(curseg-curseg_mutex); for (i = 0; i sits_in_cursum(sum); i++) { @@ -1497,6 +1550,7 @@ static void build_sit_entries(struct f2fs_sb_info *sbi) } } mutex_unlock(curseg-curseg_mutex); + page = get_current_sit_page(sbi, start); sit_blk = (struct f2fs_sit_block *)page_address(page); sit = sit_blk-entries[SIT_ENTRY_OFFSET(sit_i, start)]; @@ -1509,6 +1563,11 @@ got_it: e-valid_blocks += se-valid_blocks; } } + + start_blk += readed; + if (start_blk = sit_blk_cnt) + return; + goto next; Using do {...}
Re: [f2fs-dev] [PATCH 1/2] f2fs: add a new function to support for merging contiguous read
On 11/12/2013 01:15 PM, Chao Yu wrote: For better read performance, we add a new function to support for merging contiguous read as the one for write. Nice shot! Signed-off-by: Chao Yu chao2...@samsung.com Acked-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/data.c | 45 + fs/f2fs/f2fs.h |2 ++ 2 files changed, 47 insertions(+) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index aa3438c..f30060b 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -404,6 +404,51 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, return 0; } +void f2fs_submit_read_bio(struct f2fs_sb_info *sbi, int rw) +{ + down_read(sbi-bio_sem); + if (sbi-read_bio) { + submit_bio(rw, sbi-read_bio); + sbi-read_bio = NULL; + } + up_read(sbi-bio_sem); +} + +void submit_read_page(struct f2fs_sb_info *sbi, struct page *page, + block_t blk_addr, int rw) +{ + struct block_device *bdev = sbi-sb-s_bdev; + int bio_blocks; + + verify_block_addr(sbi, blk_addr); + + down_read(sbi-bio_sem); + + if (sbi-read_bio sbi-last_read_block != blk_addr - 1) { + submit_bio(rw, sbi-read_bio); + sbi-read_bio = NULL; + } + +alloc_new: + if (sbi-read_bio == NULL) { + bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi)); + sbi-read_bio = f2fs_bio_alloc(bdev, bio_blocks); + sbi-read_bio-bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr); + sbi-read_bio-bi_end_io = read_end_io; + } + + if (bio_add_page(sbi-read_bio, page, PAGE_CACHE_SIZE, 0) + PAGE_CACHE_SIZE) { + submit_bio(rw, sbi-read_bio); + sbi-read_bio = NULL; + goto alloc_new; + } + + sbi-last_read_block = blk_addr; + + up_read(sbi-bio_sem); +} + /* * This function should be used by the data read flow only where it * does not check the create flag that indicates block allocation. diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 89dc750..0afdcec 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -359,6 +359,8 @@ struct f2fs_sb_info { /* for segment-related operations */ struct f2fs_sm_info *sm_info; /* segment manager */ + struct bio *read_bio; /* read bios to merge */ + sector_t last_read_block; /* last read block number */ struct bio *bio[NR_PAGE_TYPE]; /* bios to merge */ sector_t last_block_in_bio[NR_PAGE_TYPE]; /* last block number */ struct rw_semaphore bio_sem;/* IO semaphore */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH 2/2] f2fs: read contiguous sit entry pages by merging for mount performance
Hi Yu, On 11/13/2013 04:10 PM, Chao Yu wrote: Hi Gu, -Original Message- From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com] Sent: Wednesday, November 13, 2013 11:39 AM To: Chao Yu Cc: ???; linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; 谭姝 Subject: Re: [f2fs-dev] [PATCH 2/2] f2fs: read contiguous sit entry pages by merging for mount performance Hi Yu, On 11/12/2013 01:18 PM, Chao Yu wrote: Previously we read sit entries page one by one, this method lost the chance of reading contiguous page together. So we read pages as contiguous as possible for better mount performance. Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/f2fs.h|2 ++ fs/f2fs/segment.c | 65 ++--- fs/f2fs/segment.h |2 ++ 3 files changed, 66 insertions(+), 3 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 0afdcec..bfe9d87 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -1113,6 +1113,8 @@ struct page *find_data_page(struct inode *, pgoff_t, bool); struct page *get_lock_data_page(struct inode *, pgoff_t); struct page *get_new_data_page(struct inode *, struct page *, pgoff_t, bool); int f2fs_readpage(struct f2fs_sb_info *, struct page *, block_t, int); +void f2fs_submit_read_bio(struct f2fs_sb_info *, int); +void submit_read_page(struct f2fs_sb_info *, struct page *, block_t, int); Better to move these declarations into PATCH 1/2. Okay, I will move it to the right place. int do_write_data_page(struct page *); /* diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 86dc289..414c351 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -1474,19 +1474,72 @@ static int build_curseg(struct f2fs_sb_info *sbi) return restore_curseg_summaries(sbi); } +static int ra_sit_pages(struct f2fs_sb_info *sbi, int start, + int nrpages, bool *is_order) +{ + struct address_space *mapping = sbi-meta_inode-i_mapping; + struct sit_info *sit_i = SIT_I(sbi); + struct page *page; + block_t blk_addr; + int blkno, readcnt = 0; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + + for (blkno = start; blkno start + nrpages; blkno++) { + + if (blkno = sit_blk_cnt) Merge these two judgements: for (blkno = start; blkno start + nrpages blkno sit_blk_cnt; blkno++) Right, but the line may over 80 characters, if we split this line, it seems not suitable. So how about this? int blkno = start, readcnt = 0; int sit_blk_cnt = SIT_BLK_CNT(sbi); for (; blkno start + nrpages blkno sit_blk_cnt; blkno++) { More neat! + goto out; + if ((!f2fs_test_bit(blkno, sit_i-sit_bitmap) ^ !*is_order)) { + *is_order = !*is_order; + goto out; 'Break' seems more suitable. Yes, you are right. + } + + blk_addr = sit_i-sit_base_addr + blkno; + if (*is_order) + blk_addr += sit_i-sit_blocks; +repeat: + page = grab_cache_page(mapping, blk_addr); + if (!page) { + cond_resched(); + goto repeat; + } + if (PageUptodate(page)) { + f2fs_put_page(page, 1); + readcnt++; + goto out; Here may be 'Continue'. 'Out' label could be removed after this modification. It seems more neat. Right. + } + + submit_read_page(sbi, page, blk_addr, READ_SYNC); + + page_cache_release(page); Put page here seems not a good idea, otherwise all your work may be in vain. You mean that pages could be reclaimed by VM when out of memory? IMO, it is designed more like VM read ahead because we should concern memory state of system, and still we have second chance to read these pages. Yes, but we can avoid to read the same page secondly, that's a serious waste, if we still reread the page in get_current_sit_page(), all the improvement will disappear. Could we use mark_page_accessed () to delay VM reclaimed them? IMO, this is the right way. + readcnt++; + } +out: + f2fs_submit_read_bio(sbi, READ_SYNC); + return readcnt; +} + static void build_sit_entries(struct f2fs_sb_info *sbi) { struct sit_info *sit_i = SIT_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); struct f2fs_summary_block *sum = curseg-sum_blk; - unsigned int start; + bool is_order = f2fs_test_bit(0, sit_i-sit_bitmap) ? true : false; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + int bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi)); + unsigned int i, start, end; + unsigned int readed, start_blk = 0; - for (start = 0; start TOTAL_SEGS(sbi); start++) { +next: + readed = ra_sit_pages(sbi, start_blk, bio_blocks, is_order); In fact, you know how many
[PATCH] f2fs: use mutex rather than the rw_sem
Use mutex rather than the rw_sem to protect bio related fields, because it's needless to take the read_sem in the read path. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/data.c|4 fs/f2fs/f2fs.h|2 +- fs/f2fs/segment.c |8 fs/f2fs/super.c |2 +- 4 files changed, 6 insertions(+), 10 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index aa3438c..b4e4c7e 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -383,8 +383,6 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, trace_f2fs_readpage(page, blk_addr, type); - down_read(sbi-bio_sem); - /* Allocate a new bio */ bio = f2fs_bio_alloc(bdev, 1); @@ -394,13 +392,11 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, if (bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) PAGE_CACHE_SIZE) { bio_put(bio); - up_read(sbi-bio_sem); f2fs_put_page(page, 1); return -EFAULT; } submit_bio(type, bio); - up_read(sbi-bio_sem); return 0; } diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 89dc750..78a0054 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -361,7 +361,7 @@ struct f2fs_sb_info { struct f2fs_sm_info *sm_info; /* segment manager */ struct bio *bio[NR_PAGE_TYPE]; /* bios to merge */ sector_t last_block_in_bio[NR_PAGE_TYPE]; /* last block number */ - struct rw_semaphore bio_sem;/* IO semaphore */ + struct mutex bio_mutex; /* IO write mutex */ /* for checkpoint */ struct f2fs_checkpoint *ckpt; /* raw checkpoint pointer */ diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index fa284d3..e91f65c 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -653,9 +653,9 @@ static void do_submit_bio(struct f2fs_sb_info *sbi, void f2fs_submit_bio(struct f2fs_sb_info *sbi, enum page_type type, bool sync) { - down_write(sbi-bio_sem); + mutex_lock(sbi-bio_mutex); do_submit_bio(sbi, type, sync); - up_write(sbi-bio_sem); + mutex_unlock(sbi-bio_mutex); } static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, @@ -666,7 +666,7 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, verify_block_addr(sbi, blk_addr); - down_write(sbi-bio_sem); + mutex_lock(sbi-bio_mutex); inc_page_count(sbi, F2FS_WRITEBACK); @@ -701,7 +701,7 @@ retry: sbi-last_block_in_bio[type] = blk_addr; - up_write(sbi-bio_sem); + mutex_unlock(sbi-bio_mutex); trace_f2fs_submit_write_page(page, blk_addr, type); } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index bafff72..fab3550 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -874,7 +874,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) mutex_init(sbi-node_write); sbi-por_doing = false; spin_lock_init(sbi-stat_lock); - init_rwsem(sbi-bio_sem); + mutex_init(sbi-bio_mutex); init_rwsem(sbi-cp_rwsem); init_waitqueue_head(sbi-cp_wait); init_sb_info(sbi); -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] f2fs: use sbi-wr_mutex for write bios
Hi Kim, On 11/18/2013 05:12 PM, Jaegeuk Kim wrote: This patch removes an unnecessary semaphore (i.e., sbi-bio_sem). There is no reason to use the semaphore when f2fs submits read and write IOs. Instead, let's use a write mutex and cover the sbi-bio[] by the lock. My god, I just sent out an almost the same patch, do we have a telepathy?:) Regard, Gu Signed-off-by: Jaegeuk Kim jaegeuk@samsung.com --- fs/f2fs/data.c| 4 fs/f2fs/f2fs.h| 2 +- fs/f2fs/segment.c | 13 + fs/f2fs/super.c | 2 +- 4 files changed, 11 insertions(+), 10 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 84867dc..7550026 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -390,8 +390,6 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, trace_f2fs_readpage(page, blk_addr, type); - down_read(sbi-bio_sem); - /* Allocate a new bio */ bio = f2fs_bio_alloc(bdev, 1); @@ -401,13 +399,11 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, if (bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) PAGE_CACHE_SIZE) { bio_put(bio); - up_read(sbi-bio_sem); f2fs_put_page(page, 1); return -EFAULT; } submit_bio(type, bio); - up_read(sbi-bio_sem); return 0; } diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 1c783fd..76f5586 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -375,7 +375,7 @@ struct f2fs_sb_info { struct f2fs_sm_info *sm_info; /* segment manager */ struct bio *bio[NR_PAGE_TYPE]; /* bios to merge */ sector_t last_block_in_bio[NR_PAGE_TYPE]; /* last block number */ - struct rw_semaphore bio_sem;/* IO semaphore */ + struct mutex write_mutex; /* mutex for writing IOs */ /* for checkpoint */ struct f2fs_checkpoint *ckpt; /* raw checkpoint pointer */ diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index dad5f1a..893d489 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -871,9 +871,14 @@ static void do_submit_bio(struct f2fs_sb_info *sbi, void f2fs_submit_bio(struct f2fs_sb_info *sbi, enum page_type type, bool sync) { - down_write(sbi-bio_sem); + enum page_type btype = PAGE_TYPE_OF_BIO(type); + + if (!sbi-bio[btype]) + return; + + mutex_lock(sbi-write_mutex); do_submit_bio(sbi, type, sync); - up_write(sbi-bio_sem); + mutex_unlock(sbi-write_mutex); } static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, @@ -884,7 +889,7 @@ static void submit_write_page(struct f2fs_sb_info *sbi, struct page *page, verify_block_addr(sbi, blk_addr); - down_write(sbi-bio_sem); + mutex_lock(sbi-write_mutex); inc_page_count(sbi, F2FS_WRITEBACK); @@ -919,7 +924,7 @@ retry: sbi-last_block_in_bio[type] = blk_addr; - up_write(sbi-bio_sem); + mutex_unlock(sbi-write_mutex); trace_f2fs_submit_write_page(page, blk_addr, type); } diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 2c52527..c7b6300 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -882,7 +882,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) mutex_init(sbi-node_write); sbi-por_doing = false; spin_lock_init(sbi-stat_lock); - init_rwsem(sbi-bio_sem); + mutex_init(sbi-write_mutex); init_rwsem(sbi-cp_rwsem); init_waitqueue_head(sbi-cp_wait); init_sb_info(sbi); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH V2 1/2] f2fs: add a new function to support for merging contiguous read
On 11/18/2013 05:11 PM, Jaegeuk Kim wrote: Hi, 2013-11-18 (월), 09:37 +0800, Chao Yu: Hi Kim, -Original Message- From: Jaegeuk Kim [mailto:jaegeuk@samsung.com] Sent: Monday, November 18, 2013 8:29 AM To: Chao Yu Cc: linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; 谭姝 Subject: Re: [f2fs-dev] [PATCH V2 1/2] f2fs: add a new function to support for merging contiguous read Hi Chao, 2013-11-16 (토), 14:14 +0800, Chao Yu: For better read performance, we add a new function to support for merging contiguous read as the one for write. Please consider 80 columns for the description. I cannot fix this at every time though. :( Got it, sorry about my carelessness in previous patch. v1--v2: o add declarations here as Gu Zheng suggested. Signed-off-by: Chao Yu chao2...@samsung.com Acked-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/data.c | 45 + fs/f2fs/f2fs.h |4 2 files changed, 49 insertions(+) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index aa3438c..18107cb 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -404,6 +404,51 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page *page, return 0; } +void f2fs_submit_read_bio(struct f2fs_sb_info *sbi, int rw) +{ + down_read(sbi-bio_sem); Is there any reason to use down_read()? Isn't that we use bio_sem to let w/r or w/w submitting be mutex? As I examined the bio_sem, I think we don't need to use a semaphore for read and write IOs. Just it is enough to use a mutex for writes only. Agree. Mutex is more suitable here, we just want to protect the write bio related fields in the write patch, no relations to read. It seems that we need to declare sbi-bio_read and sbi-bio_write instead of sbi-bio_sem. In addition to that, we need to use down_write(sbi-bio_read) here. If so, it looks similar between (struct rw_semaphore) sbi-bio_read and (struct bio *) sbi-read_bio. How about using read_bio_sem/rbio_sem to differentiate from sbi-read_bio? I think sbi-write_mutex and sbi-read_mutex are much better. It's more reasonable and readable. Thanks, Gu Could you refer the following patches? Thanks, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH V2 2/2] f2fs: read contiguous sit entry pages by merging for mount performance
Hi Yu, One more comment, please refer to inline. On 11/16/2013 02:15 PM, Chao Yu wrote: Previously we read sit entries page one by one, this method lost the chance of reading contiguous page together. So we read pages as contiguous as possible for better mount performance. v1--v2: o merge judgements/use 'Continue' or 'Break' instead of 'Goto' as Gu Zheng suggested. o add mark_page_accessed () before release page to delay VM reclaiming them. Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/segment.c | 108 - fs/f2fs/segment.h |2 + 2 files changed, 84 insertions(+), 26 deletions(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index fa284d3..656fe40 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -14,6 +14,7 @@ #include linux/blkdev.h #include linux/prefetch.h #include linux/vmalloc.h +#include linux/swap.h #include f2fs.h #include segment.h @@ -1480,41 +1481,96 @@ static int build_curseg(struct f2fs_sb_info *sbi) return restore_curseg_summaries(sbi); } +static int ra_sit_pages(struct f2fs_sb_info *sbi, int start, + int nrpages, bool *is_order) +{ + struct address_space *mapping = sbi-meta_inode-i_mapping; + struct sit_info *sit_i = SIT_I(sbi); + struct page *page; + block_t blk_addr; + int blkno = start, readcnt = 0; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + + for (; blkno start + nrpages blkno sit_blk_cnt; blkno++) { + + if ((!f2fs_test_bit(blkno, sit_i-sit_bitmap) ^ !*is_order)) { + *is_order = !*is_order; + break; + } + + blk_addr = sit_i-sit_base_addr + blkno; + if (*is_order) + blk_addr += sit_i-sit_blocks; +repeat: + page = grab_cache_page(mapping, blk_addr); + if (!page) { + cond_resched(); + goto repeat; + } + if (PageUptodate(page)) { + mark_page_accessed(page); + f2fs_put_page(page, 1); + readcnt++; + continue; + } + + submit_read_page(sbi, page, blk_addr, READ_SYNC); + + mark_page_accessed(page); + f2fs_put_page(page, 0); + readcnt++; + } + + f2fs_submit_read_bio(sbi, READ_SYNC); + return readcnt; +} + static void build_sit_entries(struct f2fs_sb_info *sbi) { struct sit_info *sit_i = SIT_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); struct f2fs_summary_block *sum = curseg-sum_blk; - unsigned int start; - - for (start = 0; start TOTAL_SEGS(sbi); start++) { - struct seg_entry *se = sit_i-sentries[start]; - struct f2fs_sit_block *sit_blk; - struct f2fs_sit_entry sit; - struct page *page; - int i; + bool is_order = f2fs_test_bit(0, sit_i-sit_bitmap) ? true : false; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + unsigned int i, start, end; + unsigned int readed, start_blk = 0; - mutex_lock(curseg-curseg_mutex); - for (i = 0; i sits_in_cursum(sum); i++) { - if (le32_to_cpu(segno_in_journal(sum, i)) == start) { - sit = sit_in_journal(sum, i); - mutex_unlock(curseg-curseg_mutex); - goto got_it; + do { How about using find_next_bit to get the suitable start_blk if the next blk is not ordered here? And it also can simplify the logic of ra_sit_pages(). Thanks, Gu + readed = ra_sit_pages(sbi, start_blk, sit_blk_cnt, is_order); + + start = start_blk * sit_i-sents_per_block; + end = (start_blk + readed) * sit_i-sents_per_block; + + for (; start end start TOTAL_SEGS(sbi); start++) { + struct seg_entry *se = sit_i-sentries[start]; + struct f2fs_sit_block *sit_blk; + struct f2fs_sit_entry sit; + struct page *page; + + mutex_lock(curseg-curseg_mutex); + for (i = 0; i sits_in_cursum(sum); i++) { + if (le32_to_cpu(segno_in_journal(sum, i)) == start) { + sit = sit_in_journal(sum, i); + mutex_unlock(curseg-curseg_mutex); + goto got_it; + } } - } - mutex_unlock(curseg-curseg_mutex); - page = get_current_sit_page(sbi, start); - sit_blk = (struct f2fs_sit_block *)page_address(page); - sit = sit_blk-entries
[PATCH 0/5] f2fs: some minor cleanups and logic fixes
Gu Zheng (5): f2fs: convert remove_inode_page to void f2fs: convert dev_valid_block_count to void f2fs: convert inc/dec_valid_node_count to inc/dec one count f2fs: simplify write_orphan_inodes for better readable f2fs: move the list_head initialization into the lock protection region fs/f2fs/checkpoint.c | 53 ++--- fs/f2fs/f2fs.h | 37 -- fs/f2fs/node.c | 18 ++-- 3 files changed, 52 insertions(+), 56 deletions(-) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] f2fs: simplify write_orphan_inodes for better readable
Simplify write_orphan_inodes for better readable. Because we hold the orphan_inode_mutex, so it's safe to use list_for_each_entry instead of list_for_each_safe. Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c | 38 ++ 1 files changed, 18 insertions(+), 20 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index 5716e5e..f884589 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -300,12 +300,13 @@ int recover_orphan_inodes(struct f2fs_sb_info *sbi) static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) { - struct list_head *head, *this, *next; + struct list_head *head; struct f2fs_orphan_block *orphan_blk = NULL; struct page *page = NULL; unsigned int nentries = 0; unsigned short index = 1; unsigned short orphan_blocks; + struct orphan_inode_entry *orphan = NULL; orphan_blocks = (unsigned short)((sbi-n_orphans + (F2FS_ORPHANS_PER_BLOCK - 1)) / F2FS_ORPHANS_PER_BLOCK); @@ -314,12 +315,17 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) head = sbi-orphan_inode_list; /* loop for each orphan inode entry and write them in Jornal block */ - list_for_each_safe(this, next, head) { - struct orphan_inode_entry *orphan; + list_for_each_entry(orphan, head, list) { + if (!page) { + page = grab_meta_page(sbi, start_blk); + orphan_blk = + (struct f2fs_orphan_block *)page_address(page); + memset(orphan_blk, 0, sizeof(*orphan_blk)); + } - orphan = list_entry(this, struct orphan_inode_entry, list); + orphan_blk-ino[nentries] = cpu_to_le32(orphan-ino); - if (nentries == F2FS_ORPHANS_PER_BLOCK) { + if (nentries++ == F2FS_ORPHANS_PER_BLOCK) { /* * an orphan block is full of 1020 entries, * then we need to flush current orphan blocks @@ -335,24 +341,16 @@ static void write_orphan_inodes(struct f2fs_sb_info *sbi, block_t start_blk) nentries = 0; page = NULL; } - if (page) - goto page_exist; + } - page = grab_meta_page(sbi, start_blk); - orphan_blk = (struct f2fs_orphan_block *)page_address(page); - memset(orphan_blk, 0, sizeof(*orphan_blk)); -page_exist: - orphan_blk-ino[nentries++] = cpu_to_le32(orphan-ino); + if (page) { + orphan_blk-blk_addr = cpu_to_le16(index); + orphan_blk-blk_count = cpu_to_le16(orphan_blocks); + orphan_blk-entry_count = cpu_to_le32(nentries); + set_page_dirty(page); + f2fs_put_page(page, 1); } - if (!page) - goto end; - orphan_blk-blk_addr = cpu_to_le16(index); - orphan_blk-blk_count = cpu_to_le16(orphan_blocks); - orphan_blk-entry_count = cpu_to_le32(nentries); - set_page_dirty(page); - f2fs_put_page(page, 1); -end: mutex_unlock(sbi-orphan_inode_mutex); } -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] f2fs: convert dev_valid_block_count to void
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/f2fs.h |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 94fbec3..d0c6738 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -585,7 +585,7 @@ static inline bool inc_valid_block_count(struct f2fs_sb_info *sbi, return true; } -static inline int dec_valid_block_count(struct f2fs_sb_info *sbi, +static inline void dec_valid_block_count(struct f2fs_sb_info *sbi, struct inode *inode, blkcnt_t count) { @@ -595,7 +595,6 @@ static inline int dec_valid_block_count(struct f2fs_sb_info *sbi, inode-i_blocks -= count; sbi-total_valid_block_count -= (block_t)count; spin_unlock(sbi-stat_lock); - return 0; } static inline void inc_page_count(struct f2fs_sb_info *sbi, int count_type) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] f2fs: move the list_head initialization into the lock protection region
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/f2fs/checkpoint.c | 15 ++- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index f884589..1de70cc 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -511,8 +511,8 @@ void add_dirty_dir_inode(struct inode *inode) void remove_dirty_dir_inode(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_SB(inode-i_sb); - struct list_head *head = sbi-dir_inode_list; - struct list_head *this; + + struct list_head *this, *head; if (!S_ISDIR(inode-i_mode)) return; @@ -523,6 +523,7 @@ void remove_dirty_dir_inode(struct inode *inode) return; } + head = sbi-dir_inode_list; list_for_each(this, head) { struct dir_inode_entry *entry; entry = list_entry(this, struct dir_inode_entry, list); @@ -544,11 +545,13 @@ void remove_dirty_dir_inode(struct inode *inode) struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino) { - struct list_head *head = sbi-dir_inode_list; - struct list_head *this; + + struct list_head *this, *head; struct inode *inode = NULL; spin_lock(sbi-dir_inode_lock); + + head = sbi-dir_inode_list; list_for_each(this, head) { struct dir_inode_entry *entry; entry = list_entry(this, struct dir_inode_entry, list); @@ -563,11 +566,13 @@ struct inode *check_dirty_dir_inode(struct f2fs_sb_info *sbi, nid_t ino) void sync_dirty_dir_inodes(struct f2fs_sb_info *sbi) { - struct list_head *head = sbi-dir_inode_list; + struct list_head *head; struct dir_inode_entry *entry; struct inode *inode; retry: spin_lock(sbi-dir_inode_lock); + + head = sbi-dir_inode_list; if (list_empty(head)) { spin_unlock(sbi-dir_inode_lock); return; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH V2 2/2] f2fs: read contiguous sit entry pages by merging for mount performance
Hi Yu, On 11/20/2013 01:37 PM, Chao Yu wrote: Hi Gu, -Original Message- From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com] Sent: Monday, November 18, 2013 7:16 PM To: Chao Yu Cc: '???'; linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; 谭姝 Subject: Re: [f2fs-dev] [PATCH V2 2/2] f2fs: read contiguous sit entry pages by merging for mount performance Hi Yu, One more comment, please refer to inline. On 11/16/2013 02:15 PM, Chao Yu wrote: Previously we read sit entries page one by one, this method lost the chance of reading contiguous page together. So we read pages as contiguous as possible for better mount performance. v1--v2: o merge judgements/use 'Continue' or 'Break' instead of 'Goto' as Gu Zheng suggested. o add mark_page_accessed () before release page to delay VM reclaiming them. Signed-off-by: Chao Yu chao2...@samsung.com --- fs/f2fs/segment.c | 108 - fs/f2fs/segment.h |2 + 2 files changed, 84 insertions(+), 26 deletions(-) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index fa284d3..656fe40 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -14,6 +14,7 @@ #include linux/blkdev.h #include linux/prefetch.h #include linux/vmalloc.h +#include linux/swap.h #include f2fs.h #include segment.h @@ -1480,41 +1481,96 @@ static int build_curseg(struct f2fs_sb_info *sbi) return restore_curseg_summaries(sbi); } +static int ra_sit_pages(struct f2fs_sb_info *sbi, int start, + int nrpages, bool *is_order) +{ + struct address_space *mapping = sbi-meta_inode-i_mapping; + struct sit_info *sit_i = SIT_I(sbi); + struct page *page; + block_t blk_addr; + int blkno = start, readcnt = 0; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + + for (; blkno start + nrpages blkno sit_blk_cnt; blkno++) { + + if ((!f2fs_test_bit(blkno, sit_i-sit_bitmap) ^ !*is_order)) { + *is_order = !*is_order; + break; + } + + blk_addr = sit_i-sit_base_addr + blkno; + if (*is_order) + blk_addr += sit_i-sit_blocks; +repeat: + page = grab_cache_page(mapping, blk_addr); + if (!page) { + cond_resched(); + goto repeat; + } + if (PageUptodate(page)) { + mark_page_accessed(page); + f2fs_put_page(page, 1); + readcnt++; + continue; + } + + submit_read_page(sbi, page, blk_addr, READ_SYNC); + + mark_page_accessed(page); + f2fs_put_page(page, 0); + readcnt++; + } + + f2fs_submit_read_bio(sbi, READ_SYNC); + return readcnt; +} + static void build_sit_entries(struct f2fs_sb_info *sbi) { struct sit_info *sit_i = SIT_I(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA); struct f2fs_summary_block *sum = curseg-sum_blk; - unsigned int start; - - for (start = 0; start TOTAL_SEGS(sbi); start++) { - struct seg_entry *se = sit_i-sentries[start]; - struct f2fs_sit_block *sit_blk; - struct f2fs_sit_entry sit; - struct page *page; - int i; + bool is_order = f2fs_test_bit(0, sit_i-sit_bitmap) ? true : false; + int sit_blk_cnt = SIT_BLK_CNT(sbi); + unsigned int i, start, end; + unsigned int readed, start_blk = 0; - mutex_lock(curseg-curseg_mutex); - for (i = 0; i sits_in_cursum(sum); i++) { - if (le32_to_cpu(segno_in_journal(sum, i)) == start) { - sit = sit_in_journal(sum, i); - mutex_unlock(curseg-curseg_mutex); - goto got_it; + do { How about using find_next_bit to get the suitable start_blk if the next blk is not ordered here? And it also can simplify the logic of ra_sit_pages(). That's a good idea. But I thought there maybe endianness problem between test_bit and f2fs_test_bit, so find_next_bit may get wrong result. Am I right? IMO, find_next_bit can do well with endianness issue internally, if it's not so, that may be a weakness. On the other side, why not introduce a 'f2fs_find_next_bit' if it's seriously needed?:) Regards, Gu Thanks, Yu Thanks, Gu + readed = ra_sit_pages(sbi, start_blk, sit_blk_cnt, is_order); + + start = start_blk * sit_i-sents_per_block; + end = (start_blk + readed) * sit_i-sents_per_block; + + for (; start end start TOTAL_SEGS(sbi); start++) { + struct seg_entry *se = sit_i-sentries[start]; + struct f2fs_sit_block *sit_blk; + struct f2fs_sit_entry sit; + struct page *page; + + mutex_lock