pci-sysfs: queue sysfs rescan routine into workqueue to avoid potential deadlock situation

2013-02-06 Thread Gu Zheng
There is a potential deadlock situation when we manipulate the pci-sysfs user
interfaces from different bus hierarchy simultaneously, described as following:

path1: sysfs remove device: | path2: sysfs rescan device:
sysfs_schedule_callback_work()  | sysfs_write_file()
  remove_callback() |   flush_write_buffer()
*1* mutex_lock(&pci_remove_rescan_mutex)|*2*  sysfs_get_active(attr_sd)
  ...   | dev_attr_store()
device_remove_file()|   dev_rescan_store()
  ...   |*4*
mutex_lock(&pci_remove_rescan_mutex)
*3*   sysfs_deactivate(sd)  | ...
wait_for_completion()   |*5*  sysfs_put_active(attr_sd)
*6* mutex_unlock(&pci_remove_rescan_mutex)

If path1 first holds the pci_remove_rescan_mutex at *1*, then another path
called path2 actived and runs to *2* before path1 runs to *3*, we now runs
to a deadlock situation:
Path1 holds the mutex waiting path2 to decrease sysfs_dirent's s_active
counter at *5*, but path2 is blocked at *4* when trying to get the
pci_remove_rescan_mutex. The mutex won't be put by path1 until it reach
*6*, but it's now blocked at *3*.

Under the suggestion of Bjorn, and base on Yinghai Lu's patch:
http://git.kernel.org/?p=linux/kernel/git/yinghai/linux-yinghai.git;a=commitdiff;h=277df390baeab7ba6aa136356b677a096c890c0c

The circumvent approach is queuing the sysfs rescan routine into workqueue just
like removal to avoid manipulating(remove/scan) the pci-tree at the same time.


*dmesg ifno*:
(snip)
1000e :1c:00.0: eth9: Intel(R) PRO/1000 Network Connection
sd 13:2:0:0: [sdb] Attached SCSI disk
e1000e :1c:00.0: eth9: MAC: 0, PHY: 4, PBA No: D50228-005
e1000e :1c:00.1: Disabling ASPM  L1
e1000e :1c:00.1: Interrupt Throttling Rate (ints/sec) set to dynamic
conservative mode
e1000e :1c:00.1: irq 143 for MSI/MSI-X
e1000e :1c:00.1: eth10: (PCI Express:2.5GT/s:Width x4) 00:15:17:cd:96:bf
e1000e :1c:00.1: eth10: Intel(R) PRO/1000 Network Connection
e1000e :1c:00.1: eth10: MAC: 0, PHY: 4, PBA No: D50228-005
INFO: task bash:62982 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bashD  0 62982  62978 0x0080
 88038b277db8 0082 88038b277fd8 00013940
 88038b276010 00013940 00013940 00013940
 88038b277fd8 00013940 880377449e30 8806e822c670
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_preempt_disabled+0xe/0x10
 [] __mutex_lock_slowpath+0xd3/0x150
 [] mutex_lock+0x2b/0x50
 [] dev_rescan_store+0x5c/0x80
 [] dev_attr_store+0x20/0x30
 [] sysfs_write_file+0xef/0x170
 [] vfs_write+0xc8/0x190
 [] sys_write+0x51/0x90
 [] system_call_fastpath+0x16/0x1b
INFO: task bash:64141 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bashD 81610460 0 64141  64136 0x0080
 8803540e9db8 0086 8803540e9fd8 00013940
 8803540e8010 00013940 00013940 00013940
 8803540e9fd8 00013940 8807db338a10 8806f09abc60
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_preempt_disabled+0xe/0x10
 [] __mutex_lock_slowpath+0xd3/0x150
 [] mutex_lock+0x2b/0x50
 [] dev_rescan_store+0x5c/0x80
 [] dev_attr_store+0x20/0x30
 [] sysfs_write_file+0xef/0x170
 [] vfs_write+0xc8/0x190
 [] sys_write+0x51/0x90
 [] system_call_fastpath+0x16/0x1b
INFO: task kworker/u:3:64451 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u:3 D 81610460 0 64451  2 0x0080
 8807d51b7a30 0046 8807d51b7fd8 00013940
 8807d51b6010 00013940 00013940 00013940
 8807d51b7fd8 00013940 8807db339420 88037744b250
Call Trace:
 [] schedule+0x29/0x70
 [] schedule_timeout+0x19d/0x220
 [] ? __slab_free+0x1f2/0x2f0
 [] wait_for_common+0x11e/0x190
 [] ? try_to_wake_up+0x2c0/0x2c0
 [] wait_for_completion+0x1d/0x20
 [] sysfs_addrm_finish+0xb8/0xd0
 [] ? sysfs_schedule_callback+0x1e0/0x1e0
 [] sysfs_hash_and_remove+0x60/0xb0
 [] sysfs_remove_file+0x39/0x50
 [] device_remove_file+0x17/0x20
 [] bus_remove_device+0xdc/0x180
 [] device_del+0x120/0x1d0
 [] device_unregister+0x22/0x60
 [] pci_stop_bus_device+0x94/0xa0
 [] pci_stop_bus_device+0x40/0xa0
 [] pci_stop_bus_device+0x40/0xa0
 [] pci_stop_bus_device+0x40/0xa0
 [] pci_stop_and_remove_bus_device+0x16/0x30
 [] remove_callback+0x29/0x40
 [] sysfs_schedule_callback_work+0x24/0x70
 [] process_one_work+0x179/0x4b0
 [] worker_thread+0x12e/0x330
 [] ? manage_workers+0x110/0x110
 [] kthread+0x9e/0xb0
 [] kernel_thread_helper+0x4/0x10
 [] ? kthread_freezable_should_stop+0x70/0x70
 [] ? gs_change+0x13/0x13


Sign

Re: [PATCH 2/3] resource: Add release_mem_region_adjustable()

2013-04-03 Thread Gu Zheng
On 04/03/2013 12:17 AM, Toshi Kani wrote:

> Added release_mem_region_adjustable(), which releases a requested
> region from a currently busy memory resource.  This interface
> adjusts the matched memory resource accordingly if the requested
> region does not match exactly but still fits into.
> 
> This new interface is intended for memory hot-delete.  During
> bootup, memory resources are inserted from the boot descriptor
> table, such as EFI Memory Table and e820.  Each memory resource
> entry usually covers the whole contigous memory range.  Memory
> hot-delete request, on the other hand, may target to a particular
> range of memory resource, and its size can be much smaller than
> the whole contiguous memory.  Since the existing release interfaces
> like __release_region() require a requested region to be exactly
> matched to a resource entry, they do not allow a partial resource
> to be released.
> 
> There is no change to the existing interfaces since their restriction
> is valid for I/O resources.
> 
> Signed-off-by: Toshi Kani 
> ---
>  include/linux/ioport.h |2 +
>  kernel/resource.c  |   87 
> 
>  2 files changed, 89 insertions(+)
> 
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 85ac9b9b..0fe1a82 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -192,6 +192,8 @@ extern struct resource * __request_region(struct resource 
> *,
>  extern int __check_region(struct resource *, resource_size_t, 
> resource_size_t);
>  extern void __release_region(struct resource *, resource_size_t,
>   resource_size_t);
> +extern int release_mem_region_adjustable(struct resource *, resource_size_t,
> + resource_size_t);
>  
>  static inline int __deprecated check_region(resource_size_t s,
>   resource_size_t n)
> diff --git a/kernel/resource.c b/kernel/resource.c
> index ae246f9..789f160 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1021,6 +1021,93 @@ void __release_region(struct resource *parent, 
> resource_size_t start,
>  }
>  EXPORT_SYMBOL(__release_region);
>  
> +/**
> + * release_mem_region_adjustable - release a previously reserved memory 
> region
> + * @parent: parent resource descriptor
> + * @start: resource start address
> + * @size: resource region size
> + *
> + * The requested region is released from a currently busy memory resource.
> + * It adjusts the matched busy memory resource accordingly if the requested
> + * region does not match exactly but still fits into.  Existing children of
> + * the busy memory resource must be immutable in this request.
> + *
> + * Note, when the busy memory resource gets split into two entries, the code
> + * assumes that all children remain in the lower address entry for 
> simplicity.
> + * Enhance this logic when necessary.
> + */
> +int release_mem_region_adjustable(struct resource *parent,
> + resource_size_t start, resource_size_t size)
> +{
> + struct resource **p;
> + struct resource *res, *new;
> + resource_size_t end;
> + int ret = 0;
> +
> + p = &parent->child;
> + end = start + size - 1;
> +
> + write_lock(&resource_lock);
> +
> + while ((res = *p)) {
> + if (res->start > start || res->end < end) {
> + p = &res->sibling;
> + continue;
> + }
> +
> + if (!(res->flags & IORESOURCE_MEM)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + if (!(res->flags & IORESOURCE_BUSY)) {
> + p = &res->child;
> + continue;
> + }
> +
> + if (res->start == start && res->end == end) {
> + /* free the whole entry */
> + *p = res->sibling;
> + kfree(res);
> + } else if (res->start == start && res->end != end) {
> + /* adjust the start */
> + ret = __adjust_resource(res, end+1,
> + res->end - end);
> + } else if (res->start != start && res->end == end) {
> + /* adjust the end */
> + ret = __adjust_resource(res, res->start,
> + start - res->start);
> + } else {
> + /* split into two entries */
> + new = kzalloc(sizeof(struct resource), GFP_KERNEL);
> + if (!new) {
> + ret = -ENOMEM;
> + break;
> + }
> + new->name = res->name;
> + new->start = end + 1;
> + new->end = res->end;
> + new->flags = res->flags;
> + new->parent = r

Re: [PATCH] pci-sysfs: replace mutex_lock with mutex_trylock to avoid potential deadlock situation

2013-01-25 Thread Gu Zheng
Hi Bjorn,
Thanks for your review and comments! Please refer to inlined comments 
below.

On 01/25/2013 07:12 AM, Bjorn Helgaas wrote:

> On Thu, Dec 27, 2012 at 12:42 AM, Lin Feng  wrote:
>> There is a potential deadlock situation when we manipulate the pci-sysfs user
>> interfaces from different bus hierarchy simultaneously, described as 
>> following:
>>
>> path1: sysfs remove device: | path2: sysfs rescan device:
>> sysfs_schedule_callback_work()  | sysfs_write_file()
>>   remove_callback() |   flush_write_buffer()
>> *1* mutex_lock(&pci_remove_rescan_mutex)|*2*  sysfs_get_active(attr_sd)
>>   ...   | dev_attr_store()
>> device_remove_file()|   dev_rescan_store()
>>   ...   |*4*  
>> mutex_lock(&pci_remove_rescan_mutex)
>> *3*   sysfs_deactivate(sd)  | ...
>> wait_for_completion()   |*5*  sysfs_put_active(attr_sd)
>> *6* mutex_unlock(&pci_remove_rescan_mutex)
...snip...
>> Reported-by: Taku Izumi 
>> Signed-off-by: Lin Feng 
>> Signed-off-by: Gu Zheng 
>> ---
>>  drivers/pci/pci-sysfs.c |   42 ++
>>  1 files changed, 26 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index 05b78b1..d2efbb0 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -295,10 +295,13 @@ static ssize_t bus_rescan_store(struct bus_type *bus,
const char *buf,
>> return -EINVAL;
>>
>> if (val) {
>> -   mutex_lock(&pci_remove_rescan_mutex);
>> -   while ((b = pci_find_next_bus(b)) != NULL)
>> -   pci_rescan_bus(b);
>> -   mutex_unlock(&pci_remove_rescan_mutex);
>> +   if (mutex_trylock(&pci_remove_rescan_mutex)) {
>> +   while ((b = pci_find_next_bus(b)) != NULL)
>> +   pci_rescan_bus(b);
>> +   mutex_unlock(&pci_remove_rescan_mutex);
>> +   } else {
>> +   return 0;
> What are the semantics of returning 0 from a sysfs store function?
> Does the user's write just get dropped?  I would think we'd return
> "count" for that case.

Oh, yes, return "count" seems suitable here, although we did not reach the
user's target goal(rescan the bus), but the user's write has been flushed yet.
But the user still can not judge whether pci_rescan_bus() was successfully done
only by the return value. Shall we return a suitable error here to tell the user
that his write was written, but pci_rescan_bus() was not done ?

> Is there some sort of automatic retry in libc
> or something if we return zero?

No, there is not any extra operations in libc if we return zero indeed.

> Are you relying on the user code to
> notice that nothing was written and do its own retry?
>

Yes, but it seems impractical.

> The last seems most likely, but that seems like it complicates the
> user's life unnecessarily.
>
>> +   }
>> }
>> return count;
>>  }
>> @@ -319,9 +322,12 @@ dev_rescan_store(struct device *dev, struct 
>> device_attribute *attr,
>> return -EINVAL;
>>
>> if (val) {
>> -   mutex_lock(&pci_remove_rescan_mutex);
>> -   pci_rescan_bus(pdev->bus);
>> -   mutex_unlock(&pci_remove_rescan_mutex);
>> +   if (mutex_trylock(&pci_remove_rescan_mutex)) {
>> +   pci_rescan_bus(pdev->bus);
>> +   mutex_unlock(&pci_remove_rescan_mutex);
>> +   } else {
>> +   return 0;
>> +   }
>> }
>> return count;
>>  }
>> @@ -330,9 +336,10 @@ static void remove_callback(struct device *dev)
>>  {
>> struct pci_dev *pdev = to_pci_dev(dev);
>>
>> -   mutex_lock(&pci_remove_rescan_mutex);
>> -   pci_stop_and_remove_bus_device(pdev);
>> -   mutex_unlock(&pci_remove_rescan_mutex);
>> +   if (mutex_trylock(&pci_remove_rescan_mutex)) {
>> +   pci_stop_and_remove_bus_device(pdev);
>> +   mutex_unlock(&pci_remove_rescan_mutex);
>> +   }
> In the other cases, I think the user will at least get some
> indication, e.g., a write() that returns zero, when we abort.  But
> here, we silently sk

Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-25 Thread Gu Zheng
Hi Rafael,

On 08/26/2013 04:09 AM, Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki 
> 
> There are two mutexes, device_hotplug_lock and acpi_scan_lock, held
> around the acpi_bus_trim() call in acpi_scan_hot_remove() which
> generally removes devices (it removes ACPI device objects at least,
> but it may also remove "physical" device objects through .detach()
> callbacks of ACPI scan handlers).  Thus, potentially, device sysfs
> attributes are removed under these locks and to remove those
> attributes it is necessary to hold the s_active references of their
> directory entries for writing.
> 
> On the other hand, the execution of a .show() or .store() callback
> from a sysfs attribute is carried out with that attribute's s_active
> reference held for reading.  Consequently, if any device sysfs
> attribute that may be removed from within acpi_scan_hot_remove()
> through acpi_bus_trim() has a .store() or .show() callback which
> acquires either acpi_scan_lock or device_hotplug_lock, the execution
> of that callback may deadlock with the removal of the attribute.
> [Unfortunately, the "online" device attribute of CPUs and memory
> blocks and the "eject" attribute of ACPI device objects are affected
> by this issue.]
> 
> To avoid those deadlocks introduce a new protection mechanism that
> can be used by the device sysfs attributes in question.  Namely,
> if a device sysfs attribute's .store() or .show() callback routine
> is about to acquire device_hotplug_lock or acpi_scan_lock, it can
> first execute read_lock_device_remove() and return an error code if
> that function returns false.  If true is returned, the lock in
> question may be acquired and read_unlock_device_remove() must be
> called.  [This mechanism is implemented by means of an additional
> rwsem in drivers/base/core.c.]
> 
> Make the affected sysfs attributes in the driver core and ACPI core
> use read_lock_device_remove() and read_unlock_device_remove() as
> described above.
> 
> Signed-off-by: Rafael J. Wysocki 
> Reported-by: Gu Zheng 

I'm sorry to forget to mention that the original reporter is
Yasuaki Ishimatsu . I continued
the investigation and found more issues.

We tested this patch on kernel 3.11-rc6, but it seems that the
issue is still there. Detail info as following.

Thanks,
Gu

==  

 
[ INFO: possible circular locking dependency detected ] 

 
3.11.0-rc6-lockdebug-refea+ #162 Tainted: GF

 
--- 

 
kworker/0:2/754 is trying to acquire lock:  

 
 (s_active#73){.+}, at: [] sysfs_addrm_finish+0x3b/0x70   

 


 
but task is already holding lock:   

 
 (mem_sysfs_mutex){+.+.+.}, at: [] 
remove_memory_block+0x1d/0xa0   



 
which lock already depends on the new lock. 

 


 


 
the existing dependency chain (in reverse order) is:

 
   

[PATCH] drivers/base/memory.c: introduce help macro to_memory_block

2013-08-26 Thread Gu Zheng
Introduce help macro to_memory_block to hide the 
conversion(device-->memory_block),
just clean up.

Signed-off-by: Gu Zheng 
---
 drivers/base/memory.c |   27 ---
 1 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2b7813e..4a874c6 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -30,6 +30,8 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define MEMORY_CLASS_NAME  "memory"
 
+#define to_memory_block(dev) container_of(dev, struct memory_block, dev)
+
 static int sections_per_block;
 
 static inline int base_memory_block_id(int section_nr)
@@ -77,7 +79,7 @@ EXPORT_SYMBOL(unregister_memory_isolate_notifier);
 
 static void memory_block_release(struct device *dev)
 {
-   struct memory_block *mem = container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
 
kfree(mem);
 }
@@ -110,8 +112,7 @@ static unsigned long get_memory_block_size(void)
 static ssize_t show_mem_start_phys_index(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;
 
phys_index = mem->start_section_nr / sections_per_block;
@@ -121,8 +122,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev,
 static ssize_t show_mem_end_phys_index(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;
 
phys_index = mem->end_section_nr / sections_per_block;
@@ -137,8 +137,7 @@ static ssize_t show_mem_removable(struct device *dev,
 {
unsigned long i, pfn;
int ret = 1;
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
 
for (i = 0; i < sections_per_block; i++) {
pfn = section_nr_to_pfn(mem->start_section_nr + i);
@@ -154,8 +153,7 @@ static ssize_t show_mem_removable(struct device *dev,
 static ssize_t show_mem_state(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
ssize_t len = 0;
 
/*
@@ -280,7 +278,7 @@ static int __memory_block_change_state(struct memory_block 
*mem,
 
 static int memory_subsys_online(struct device *dev)
 {
-   struct memory_block *mem = container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
int ret;
 
mutex_lock(&mem->state_mutex);
@@ -295,7 +293,7 @@ static int memory_subsys_online(struct device *dev)
 
 static int memory_subsys_offline(struct device *dev)
 {
-   struct memory_block *mem = container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
int ret;
 
mutex_lock(&mem->state_mutex);
@@ -349,7 +347,7 @@ store_mem_state(struct device *dev,
bool offline;
int ret = -EINVAL;
 
-   mem = container_of(dev, struct memory_block, dev);
+   mem = to_memory_block(dev);
 
lock_device_hotplug();
 
@@ -392,8 +390,7 @@ store_mem_state(struct device *dev,
 static ssize_t show_phys_device(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
return sprintf(buf, "%d\n", mem->phys_device);
 }
 
@@ -525,7 +522,7 @@ struct memory_block *find_memory_block_hinted(struct 
mem_section *section,
put_device(&hint->dev);
if (!dev)
return NULL;
-   return container_of(dev, struct memory_block, dev);
+   return to_memory_block(dev);
 }
 
 /*
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-26 Thread Gu Zheng
Hi Rafael,

On 08/26/2013 10:43 PM, Rafael J. Wysocki wrote:

> On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote:
>> On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote:
>>> Hi Rafael,
>>
>> Hi,
>>
>>> On 08/26/2013 04:09 AM, Rafael J. Wysocki wrote:
>>>
>>>> From: Rafael J. Wysocki 
>>>>
>>>> There are two mutexes, device_hotplug_lock and acpi_scan_lock, held
>>>> around the acpi_bus_trim() call in acpi_scan_hot_remove() which
>>>> generally removes devices (it removes ACPI device objects at least,
>>>> but it may also remove "physical" device objects through .detach()
>>>> callbacks of ACPI scan handlers).  Thus, potentially, device sysfs
>>>> attributes are removed under these locks and to remove those
>>>> attributes it is necessary to hold the s_active references of their
>>>> directory entries for writing.
>>>>
>>>> On the other hand, the execution of a .show() or .store() callback
>>>> from a sysfs attribute is carried out with that attribute's s_active
>>>> reference held for reading.  Consequently, if any device sysfs
>>>> attribute that may be removed from within acpi_scan_hot_remove()
>>>> through acpi_bus_trim() has a .store() or .show() callback which
>>>> acquires either acpi_scan_lock or device_hotplug_lock, the execution
>>>> of that callback may deadlock with the removal of the attribute.
>>>> [Unfortunately, the "online" device attribute of CPUs and memory
>>>> blocks and the "eject" attribute of ACPI device objects are affected
>>>> by this issue.]
>>>>
>>>> To avoid those deadlocks introduce a new protection mechanism that
>>>> can be used by the device sysfs attributes in question.  Namely,
>>>> if a device sysfs attribute's .store() or .show() callback routine
>>>> is about to acquire device_hotplug_lock or acpi_scan_lock, it can
>>>> first execute read_lock_device_remove() and return an error code if
>>>> that function returns false.  If true is returned, the lock in
>>>> question may be acquired and read_unlock_device_remove() must be
>>>> called.  [This mechanism is implemented by means of an additional
>>>> rwsem in drivers/base/core.c.]
>>>>
>>>> Make the affected sysfs attributes in the driver core and ACPI core
>>>> use read_lock_device_remove() and read_unlock_device_remove() as
>>>> described above.
>>>>
>>>> Signed-off-by: Rafael J. Wysocki 
>>>> Reported-by: Gu Zheng 
>>>
>>> I'm sorry to forget to mention that the original reporter is
>>> Yasuaki Ishimatsu . I continued
>>> the investigation and found more issues.
>>>
>>> We tested this patch on kernel 3.11-rc6, but it seems that the
>>> issue is still there. Detail info as following.
>>
>> Well, taking pm_mutex under acpi_scan_lock (trace #2) is a bad idea anyway,
>> because we'll need to take acpi_scan_lock during system suspend for PCI hot
>> remove to work and that's under pm_mutex.  So I wonder if we can simply
>> drop the system sleep locking from lock/unlock_memory_hotplug().  But that's
>> a side note, because dropping it won't help here.
>>
>> Now ->
>>
>>> ==  
>>> 
>>>  
>>> [ INFO: possible circular locking dependency detected ] 
>>> 
>>>  
>>> 3.11.0-rc6-lockdebug-refea+ #162 Tainted: GF
>>> 
>>>  
>>> --- 
>>> 
>>>  
>>> kworker/0:2/754 is trying to acquire lock:  
>>> 
>>>  
>>>  (s_active#73){.+}, at: [] 
>>> sysfs_addrm_finish+0x3b/0x70
>&

Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-26 Thread Gu Zheng
Hi Rafael,

On 08/26/2013 10:43 PM, Rafael J. Wysocki wrote:

> On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote:
>> On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote:
>>> Hi Rafael,
>>
>> Hi,
>>
>>> On 08/26/2013 04:09 AM, Rafael J. Wysocki wrote:
>>>
>>>> From: Rafael J. Wysocki 
>>>>
>>>> There are two mutexes, device_hotplug_lock and acpi_scan_lock, held
>>>> around the acpi_bus_trim() call in acpi_scan_hot_remove() which
>>>> generally removes devices (it removes ACPI device objects at least,
>>>> but it may also remove "physical" device objects through .detach()
>>>> callbacks of ACPI scan handlers).  Thus, potentially, device sysfs
>>>> attributes are removed under these locks and to remove those
>>>> attributes it is necessary to hold the s_active references of their
>>>> directory entries for writing.
>>>>
>>>> On the other hand, the execution of a .show() or .store() callback
>>>> from a sysfs attribute is carried out with that attribute's s_active
>>>> reference held for reading.  Consequently, if any device sysfs
>>>> attribute that may be removed from within acpi_scan_hot_remove()
>>>> through acpi_bus_trim() has a .store() or .show() callback which
>>>> acquires either acpi_scan_lock or device_hotplug_lock, the execution
>>>> of that callback may deadlock with the removal of the attribute.
>>>> [Unfortunately, the "online" device attribute of CPUs and memory
>>>> blocks and the "eject" attribute of ACPI device objects are affected
>>>> by this issue.]
>>>>
>>>> To avoid those deadlocks introduce a new protection mechanism that
>>>> can be used by the device sysfs attributes in question.  Namely,
>>>> if a device sysfs attribute's .store() or .show() callback routine
>>>> is about to acquire device_hotplug_lock or acpi_scan_lock, it can
>>>> first execute read_lock_device_remove() and return an error code if
>>>> that function returns false.  If true is returned, the lock in
>>>> question may be acquired and read_unlock_device_remove() must be
>>>> called.  [This mechanism is implemented by means of an additional
>>>> rwsem in drivers/base/core.c.]
>>>>
>>>> Make the affected sysfs attributes in the driver core and ACPI core
>>>> use read_lock_device_remove() and read_unlock_device_remove() as
>>>> described above.
>>>>
>>>> Signed-off-by: Rafael J. Wysocki 
>>>> Reported-by: Gu Zheng 
>>>
>>> I'm sorry to forget to mention that the original reporter is
>>> Yasuaki Ishimatsu . I continued
>>> the investigation and found more issues.
>>>
>>> We tested this patch on kernel 3.11-rc6, but it seems that the
>>> issue is still there. Detail info as following.
>>
>> Well, taking pm_mutex under acpi_scan_lock (trace #2) is a bad idea anyway,
>> because we'll need to take acpi_scan_lock during system suspend for PCI hot
>> remove to work and that's under pm_mutex.  So I wonder if we can simply
>> drop the system sleep locking from lock/unlock_memory_hotplug().  But that's
>> a side note, because dropping it won't help here.
>>
>> Now ->
>>
>>> ==  
>>> 
>>>  
>>> [ INFO: possible circular locking dependency detected ] 
>>> 
>>>  
>>> 3.11.0-rc6-lockdebug-refea+ #162 Tainted: GF
>>> 
>>>  
>>> --- 
>>> 
>>>  
>>> kworker/0:2/754 is trying to acquire lock:  
>>> 
>>>  
>>>  (s_active#73){.+}, at: [] 
>>> sysfs_addrm_finish+0x3b/0x70
>&

Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-26 Thread Gu Zheng
Hi Rafael,

On 08/26/2013 11:02 PM, Rafael J. Wysocki wrote:

> On Monday, August 26, 2013 04:43:26 PM Rafael J. Wysocki wrote:
>> On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote:
>>> On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote:
>>>> Hi Rafael,
> 
> [...]
> 
>>
>> OK, so the patch below is quick and dirty and overkill, but it should make 
>> the
>> splat go away at least.
> 
> And if this patch does make the splat go away for you, please also test the
> appended one (Tejun, thanks for the hint!).

Yes, this one works too, and as expected, the ACPI part is still there.

Thanks,
Gu

==  

[ INFO: possible circular locking dependency detected ] 

3.11.0-rc6-fix-refeal-fix-01+ #171 Tainted: GF  

--- 

kworker/0:1/96 is trying to acquire lock:   

 (s_active#245){.+}, at: [] sysfs_addrm_finish+0x3b/0x70  



but task is already holding lock:   

 (device_hotplug_lock){+.+.+.}, at: [] 
lock_device_hotplug+0x17/0x20  


which lock already depends on the new lock. 





the existing dependency chain (in reverse order) is:



-> #2 (device_hotplug_lock){+.+.+.}:

   [] validate_chain+0x70c/0x870  

   [] __lock_acquire+0x36f/0x5f0  

   [] lock_acquire+0xa0/0x130 

   [] mutex_lock_nested+0x7b/0x3b0

   [] lock_device_hotplug+0x17/0x20   

   [] acpi_scan_bus_device_check+0x33/0x10f   

   [] acpi_scan_device_check+0x13/0x15

   [] acpi_os_execute_deferred+0x27/0x34  

   [] process_one_work+0x1e8/0x560

   [] worker_thread+0x120/0x3a0   

   [] kthread+0xee/0x100  

   [] ret_from_fork+0x7c/0xb0 



-> #1 (acpi_scan_lock){+.+.+.}: 

   [] validate_chain+0x70c/0x870  

   [] __lock_acquire+0x36f/0x5f0  

   [] lock_acquire+0xa0/0x130 

   [] mutex_lock_nested+0x7b/0x3b0

   [] acpi_eject_store+0x88/0x170 

   [] dev_attr_store+0x20/0x30

   [] sysfs_write_file+0xe6/0x170 

   [] vfs_write+0xc8/0x170

   [] SyS_write+0x62/0xb0 

Re: [PATCH] f2fs: fix omitting to update inode page

2013-08-26 Thread Gu Zheng
On 08/26/2013 08:28 PM, Jaegeuk Kim wrote:

> The f2fs_set_link updates its parent inode number, so we should sync this to
> the inode block.
> Otherwise, the data can be lost after sudden-power-off.
> 
> Signed-off-by: Jaegeuk Kim 
> ---
>  fs/f2fs/namei.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
> index 4e47518..9e90d31 100644
> --- a/fs/f2fs/namei.c
> +++ b/fs/f2fs/namei.c
> @@ -447,6 +447,7 @@ static int f2fs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
>   else
>   release_orphan_inode(sbi);
>  
> + update_inode_page(old_inode):

':' --> ';'

>   update_inode_page(new_inode);
>   } else {
>   err = f2fs_add_link(new_dentry, old_inode);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-27 Thread Gu Zheng
Hi Rafael,

On 08/26/2013 11:02 PM, Rafael J. Wysocki wrote:

> On Monday, August 26, 2013 04:43:26 PM Rafael J. Wysocki wrote:
>> On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote:
>>> On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote:
>>>> Hi Rafael,
> 
> [...]
> 
>>
>> OK, so the patch below is quick and dirty and overkill, but it should make 
>> the
>> splat go away at least.
> 
> And if this patch does make the splat go away for you, please also test the
> appended one (Tejun, thanks for the hint!).
> 
> I'll address the ACPI part differently later.

What about changing device_hotplug_lock and acpi_scan_lock to rwsem? like the
attached one(With a preliminary test, it also can make the splat go away).:)

Regards,
Gu

> 
[...]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


>From f1682ceaef4105f75f4d6a0bb8e77c8a5dde365b Mon Sep 17 00:00:00 2001
From: Gu Zheng 
Date: Tue, 27 Aug 2013 17:59:55 +0900
Subject: [PATCH] acpi: fix removal lock dep


Signed-off-by: Gu Zheng 
---
 drivers/acpi/scan.c|   43 ++-
 drivers/acpi/sysfs.c   |7 +--
 drivers/base/core.c|   45 -
 drivers/base/memory.c  |5 +++--
 include/linux/device.h |8 ++--
 5 files changed, 72 insertions(+), 36 deletions(-)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 8a46c92..bb41760 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -36,7 +36,7 @@ bool acpi_force_hot_remove;
 static const char *dummy_hid = "device";
 
 static LIST_HEAD(acpi_bus_id_list);
-static DEFINE_MUTEX(acpi_scan_lock);
+static DECLARE_RWSEM(acpi_scan_rwsem);
 static LIST_HEAD(acpi_scan_handlers_list);
 DEFINE_MUTEX(acpi_device_lock);
 LIST_HEAD(acpi_wakeup_device_list);
@@ -49,13 +49,13 @@ struct acpi_device_bus_id{
 
 void acpi_scan_lock_acquire(void)
 {
-   mutex_lock(&acpi_scan_lock);
+   down_write(&acpi_scan_rwsem);
 }
 EXPORT_SYMBOL_GPL(acpi_scan_lock_acquire);
 
 void acpi_scan_lock_release(void)
 {
-   mutex_unlock(&acpi_scan_lock);
+   up_write(&acpi_scan_rwsem);
 }
 EXPORT_SYMBOL_GPL(acpi_scan_lock_release);
 
@@ -207,7 +207,7 @@ static int acpi_scan_hot_remove(struct acpi_device *device)
return -EINVAL;
}
 
-   lock_device_hotplug();
+   device_hotplug_begin();
 
/*
 * Carry out two passes here and ignore errors in the first pass,
@@ -240,7 +240,7 @@ static int acpi_scan_hot_remove(struct acpi_device *device)
acpi_bus_online_companions, NULL,
NULL, NULL);
 
-   unlock_device_hotplug();
+   device_hotplug_end();
 
put_device(&device->dev);
return -EBUSY;
@@ -252,7 +252,7 @@ static int acpi_scan_hot_remove(struct acpi_device *device)
 
acpi_bus_trim(device);
 
-   unlock_device_hotplug();
+   device_hotplug_end();
 
/* Device node has been unregistered. */
put_device(&device->dev);
@@ -308,7 +308,7 @@ static void acpi_bus_device_eject(void *context)
struct acpi_scan_handler *handler;
u32 ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE;
 
-   mutex_lock(&acpi_scan_lock);
+   acpi_scan_lock_acquire();
 
acpi_bus_get_device(handle, &device);
if (!device)
@@ -334,7 +334,7 @@ static void acpi_bus_device_eject(void *context)
}
 
  out:
-   mutex_unlock(&acpi_scan_lock);
+   acpi_scan_lock_release();
return;
 
  err_out:
@@ -349,8 +349,8 @@ static void acpi_scan_bus_device_check(acpi_handle handle, 
u32 ost_source)
u32 ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE;
int error;
 
-   mutex_lock(&acpi_scan_lock);
-   lock_device_hotplug();
+   acpi_scan_lock_acquire();
+   device_hotplug_begin();
 
if (ost_source != ACPI_NOTIFY_BUS_CHECK) {
acpi_bus_get_device(handle, &device);
@@ -376,9 +376,9 @@ static void acpi_scan_bus_device_check(acpi_handle handle, 
u32 ost_source)
kobject_uevent(&device->dev.kobj, KOBJ_ONLINE);
 
  out:
-   unlock_device_hotplug();
+   device_hotplug_end();
acpi_evaluate_hotplug_ost(handle, ost_source, ost_code, NULL);
-   mutex_unlock(&acpi_scan_lock);
+   acpi_scan_lock_release();
 }
 
 static void acpi_scan_bus_check(void *context)
@@ -469,15 +469,14 @@ void acpi_bus_hot_remove_device(void *context)
acpi_handle handle = device->handle;
int error;
 
-   mutex_l

Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-27 Thread Gu Zheng
Hi Toshi,

On 08/28/2013 05:38 AM, Toshi Kani wrote:

> On Tue, 2013-08-27 at 17:21 +0800, Gu Zheng wrote:
>> Hi Rafael,
>>
>> On 08/26/2013 11:02 PM, Rafael J. Wysocki wrote:
>>
>>> On Monday, August 26, 2013 04:43:26 PM Rafael J. Wysocki wrote:
>>>> On Monday, August 26, 2013 02:42:09 PM Rafael J. Wysocki wrote:
>>>>> On Monday, August 26, 2013 11:13:13 AM Gu Zheng wrote:
>>>>>> Hi Rafael,
>>>
>>> [...]
>>>
>>>>
>>>> OK, so the patch below is quick and dirty and overkill, but it should make 
>>>> the
>>>> splat go away at least.
>>>
>>> And if this patch does make the splat go away for you, please also test the
>>> appended one (Tejun, thanks for the hint!).
>>>
>>> I'll address the ACPI part differently later.
>>
>> What about changing device_hotplug_lock and acpi_scan_lock to rwsem? like the
>> attached one(With a preliminary test, it also can make the splat go away).:)
> 
> I am curious how msleep(10) & restart_syscall() work in the change
> below.  Doesn't the msleep() make s_active held longer time, which can
> lead the thread holding device_hotplug_lock to wait it for deletion?

Yes, but it can avoid busy waiting. 

> Also, does restart_syscall() release s_active and reopen this file
> again?

Sure, it just set a TIF_SIGPENDING flag and return an -ERESTARTNOINTR error, 
s_active/file
will be released/closed in the failed path. And when do_signal() catches the 
-ERESTARTNOINTR,
it will change the regs to restart the syscall.

Thanks,
Gu

> 
> @@ -408,9 +408,13 @@ static ssize_t show_online(struct device *dev,
> struct device_attribute *attr,
>  {
> bool val;
> 
> -   lock_device_hotplug();
> +   if (!read_lock_device_hotplug()) {
> +   msleep(10);
> +   return restart_syscall();
> +   }
> +
> 
> Thanks,
> -Toshi
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND] drivers/base/memory.c: introduce help macro to_memory_block

2013-08-27 Thread Gu Zheng
Introduce help macro to_memory_block to hide the 
conversion(device-->memory_block),
just clean up.

Reviewed-by: Yasuaki Ishimatsu  
Signed-off-by: Gu Zheng 
---
 drivers/base/memory.c |   29 -
 1 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2a38cd2..69e09a1 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -29,6 +29,8 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define MEMORY_CLASS_NAME  "memory"
 
+#define to_memory_block(dev) container_of(dev, struct memory_block, dev)
+
 static int sections_per_block;
 
 static inline int base_memory_block_id(int section_nr)
@@ -76,7 +78,7 @@ EXPORT_SYMBOL(unregister_memory_isolate_notifier);
 
 static void memory_block_release(struct device *dev)
 {
-   struct memory_block *mem = container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
 
kfree(mem);
 }
@@ -109,8 +111,7 @@ static unsigned long get_memory_block_size(void)
 static ssize_t show_mem_start_phys_index(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;
 
phys_index = mem->start_section_nr / sections_per_block;
@@ -120,8 +121,7 @@ static ssize_t show_mem_start_phys_index(struct device *dev,
 static ssize_t show_mem_end_phys_index(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
unsigned long phys_index;
 
phys_index = mem->end_section_nr / sections_per_block;
@@ -136,8 +136,7 @@ static ssize_t show_mem_removable(struct device *dev,
 {
unsigned long i, pfn;
int ret = 1;
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
 
for (i = 0; i < sections_per_block; i++) {
pfn = section_nr_to_pfn(mem->start_section_nr + i);
@@ -153,8 +152,7 @@ static ssize_t show_mem_removable(struct device *dev,
 static ssize_t show_mem_state(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
ssize_t len = 0;
 
/*
@@ -282,7 +280,7 @@ static int memory_block_change_state(struct memory_block 
*mem,
 /* The device lock serializes operations on memory_subsys_[online|offline] */
 static int memory_subsys_online(struct device *dev)
 {
-   struct memory_block *mem = container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
int ret;
 
if (mem->state == MEM_ONLINE)
@@ -306,7 +304,7 @@ static int memory_subsys_online(struct device *dev)
 
 static int memory_subsys_offline(struct device *dev)
 {
-   struct memory_block *mem = container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
 
if (mem->state == MEM_OFFLINE)
return 0;
@@ -318,11 +316,9 @@ static ssize_t
 store_mem_state(struct device *dev,
struct device_attribute *attr, const char *buf, size_t count)
 {
-   struct memory_block *mem;
+   struct memory_block *mem = to_memory_block(dev);
int ret, online_type;
 
-   mem = container_of(dev, struct memory_block, dev);
-
lock_device_hotplug();
 
if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))
@@ -376,8 +372,7 @@ store_mem_state(struct device *dev,
 static ssize_t show_phys_device(struct device *dev,
struct device_attribute *attr, char *buf)
 {
-   struct memory_block *mem =
-   container_of(dev, struct memory_block, dev);
+   struct memory_block *mem = to_memory_block(dev);
return sprintf(buf, "%d\n", mem->phys_device);
 }
 
@@ -509,7 +504,7 @@ struct memory_block *find_memory_block_hinted(struct 
mem_section *section,
put_device(&hint->dev);
if (!dev)
return NULL;
-   return container_of(dev, struct memory_block, dev);
+   return to_memory_block(dev);
 }
 
 /*
-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] driver core / ACPI: Avoid device removal locking problems

2013-08-28 Thread Gu Zheng
Hi Rafael,

On 08/28/2013 05:45 AM, Rafael J. Wysocki wrote:

> On Tuesday, August 27, 2013 02:36:44 PM Tejun Heo wrote:
>> Hello,
>>
[...]
> 
> I've thought about that a bit over the last several hours and I'm still
> thinking that that patch is a bit overkill, because it will trigger the
> restart_syscall() for all cases when device_hotplug_lock is locked, even
> if they can't lead to any deadlocks.  The only deadlockish situation is
> when device *removal* is in progress when store_online(), for example,
> is called.
> 
> So to address that particular situation without adding too much overhead for
> other cases, I've come up with the appended patch (untested for now).
> 
> This is how it is supposed to work.
> 
> There are three "lock levels" for device hotplug, "normal", "remove"
> and "weak".  The difference is related to how __lock_device_hotplug()
> works.  Namely, if device hotplug is currently locked, that function
> will either block or return false, depending on the "current lock
> level" and its argument (the "new lock level").  The rules here are
> that false is returned immediately if the "current lock level" is
> "remove" and the "new lock level" is "weak".  The function blocks
> for all other combinations of the two.
> 
> There are two functions supposed to use device hotplug "lock levels"
> other than "normal": store_online() and acpi_scan_hot_remove().
> Everybody else is supposed to use "normal" (well, there are more
> potential users of "weak" in drivers/base/memory.c).
> 
> acpi_scan_hot_remove() uses the "remove" lock level to indicate
> that it is going to remove devices while holding device hotplug
> locked.  In turn, store_online() uses the "weak" lock level so
> that it doesn't block when devices are being removed with device
> hotplug locked, because that may lead to a deadlock.
> 
> show_online() actually doesn't need to lock device hotplug, but
> it is useful to serialize it with respect to device_offline()
> and device_online() (in case user space attempts to run them
> concurrently).

Yeah. I tested this one on latest kernel tree, it does make the splat go away.
Looking forward to the ACPI part one.:)

Regards,
Gu

> 
> ---
>  drivers/acpi/scan.c|4 +-
>  drivers/base/core.c|   72 
> ++---
>  include/linux/device.h |   25 -
>  3 files changed, 83 insertions(+), 18 deletions(-)
> 
> Index: linux-pm/drivers/base/core.c
> ===
> --- linux-pm.orig/drivers/base/core.c
> +++ linux-pm/drivers/base/core.c
> @@ -49,6 +49,55 @@ static struct kobject *dev_kobj;
>  struct kobject *sysfs_dev_char_kobj;
>  struct kobject *sysfs_dev_block_kobj;
>  
> +static struct {
> + struct task_struct *holder;
> + enum dev_hotplug_lock_type type;
> + struct mutex lock; /* Synchronizes accesses to holder and type */
> + wait_queue_head_t wait_queue;
> +} device_hotplug = {
> + .holder = NULL,
> + .type = DEV_HOTPLUG_LOCK_NONE,
> + .lock = __MUTEX_INITIALIZER(device_hotplug.lock),
> + .wait_queue = __WAIT_QUEUE_HEAD_INITIALIZER(device_hotplug.wait_queue),
> +};
> +
> +bool __lock_device_hotplug(enum dev_hotplug_lock_type type)
> +{
> + DEFINE_WAIT(wait);
> + bool ret = true;
> +
> + mutex_lock(&device_hotplug.lock);
> + for (;;) {
> + prepare_to_wait(&device_hotplug.wait_queue, &wait,
> + TASK_UNINTERRUPTIBLE);
> + if (!device_hotplug.holder) {
> + device_hotplug.holder = current;
> + device_hotplug.type = type;
> + break;
> + } else if (type == DEV_HOTPLUG_LOCK_WEAK
> + && device_hotplug.type == DEV_HOTPLUG_LOCK_REMOVE) {
> + ret = false;
> + break;
> + }
> + mutex_unlock(&device_hotplug.lock);
> + schedule();
> + mutex_lock(&device_hotplug.lock);
> + }
> + finish_wait(&device_hotplug.wait_queue, &wait);
> + mutex_unlock(&device_hotplug.lock);
> + return ret;
> +}
> +
> +void unlock_device_hotplug(void)
> +{
> + mutex_lock(&device_hotplug.lock);
> + BUG_ON(device_hotplug.holder != current);
> + device_hotplug.holder = NULL;
> + device_hotplug.type = DEV_HOTPLUG_LOCK_NONE;
> + wake_up(&device_hotplug.wait_queue);
> + mutex_unlock(&device_hotplug.lock);
> +}
> +
>  #ifdef CONFIG_BLOCK
>  static inline int device_is_not_partition(struct device *dev)
>  {
> @@ -408,9 +457,10 @@ static ssize_t show_online(struct device
>  {
>   bool val;
>  
> - lock_device_hotplug();
> + /* Serialize against device_online() and device_offline(). */
> + device_lock(dev);
>   val = !dev->offline;
> - unlock_device_hotplug();
> + device_unlock(dev);
>   return sprintf(buf, "%u\n", val);
>  }
>  
> @@ -424,7 +474,11 @@ static ssize_t store_online(struct d

Re: [PATCH 0/2] driver core / ACPI: Avoid device removal locking problems

2013-08-28 Thread Gu Zheng
Hi Rafael,

On 08/28/2013 09:45 PM, Rafael J. Wysocki wrote:

> Hi All,
> 
> The following two patches are to address possible deadlocks related to
> device removal and device sysfs attribute access.  In short, some device
> sysfs attribute callbacks need to acquire locks that are also held around
> device removal and that may lead to deadlocks with s_active draining in
> sysfs_deactivate().
> 
> [1/2] Avoid possible device removal deadlocks related to device_hotplug_lock.
> [2/2] Rework the handling of containers by ACPI hotplug (which makes a 
> possible
>   device removal deadlock related to acpi_scan_lock go away).
> 

This version is concise and friendly. It works well on latest kernel tree, and 
all
the splat goes away.:)

Best regards,
Gu

> On top of linux-next, for v3.12.

> 
> Thanks,
> Rafael
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ACPI / hotplug: Remove containers synchronously

2013-08-28 Thread Gu Zheng
On 08/28/2013 09:51 PM, Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki 
> 
> The current protocol for handling hot remove of containers is very
> fragile and causes acpi_eject_store() to acquire acpi_scan_lock
> which may deadlock with the removal of the device that it is called
> for (the reason is that device sysfs attributes cannot be removed
> while their callbacks are being executed and ACPI device objects
> are removed under acpi_scan_lock).
> 
> The problem is related to the fact that containers are handled by
> acpi_bus_device_eject() in a special way, which is to emit an
> offline uevent instead of just removing the container.  Then, user
> space is expected to handle that uevent and use the container's
> "eject" attribute to actually remove it.  That is fragile, because
> user space may fail to complete the ejection (for example, by not
> using the container's "eject" attribute at all) leaving the BIOS
> kind of in a limbo.  Moreover, if the eject event is not signaled
> for a container itself, but for its parent device object (or
> generally, for an ancestor above it in the ACPI namespace), the
> container will be removed straight away without doing that whole
> dance.
> 
> For this reason, modify acpi_bus_device_eject() to remove containers
> synchronously like any other objects (user space will get its uevent
> anyway in case it does some other things in response to it) and
> remove the eject_pending ACPI device flag that is not used any more.
> This way acpi_eject_store() doesn't have a reason to acquire
> acpi_scan_lock any more and one possible deadlock scenario goes
> away (plus the code is simplified a bit).
> 
> Signed-off-by: Rafael J. Wysocki 
> Reported-by: Gu Zheng 


Tested-by: Gu Zheng 

> ---
>  drivers/acpi/scan.c |   49 
> ++--
>  include/acpi/acpi_bus.h |3 --
>  2 files changed, 16 insertions(+), 36 deletions(-)
> 
> Index: linux-pm/drivers/acpi/scan.c
> ===
> --- linux-pm.orig/drivers/acpi/scan.c
> +++ linux-pm/drivers/acpi/scan.c
> @@ -287,6 +287,7 @@ static void acpi_bus_device_eject(void *
>   struct acpi_device *device = NULL;
>   struct acpi_scan_handler *handler;
>   u32 ost_code = ACPI_OST_SC_NON_SPECIFIC_FAILURE;
> + int error;
>  
>   mutex_lock(&acpi_scan_lock);
>  
> @@ -301,17 +302,13 @@ static void acpi_bus_device_eject(void *
>   }
>   acpi_evaluate_hotplug_ost(handle, ACPI_NOTIFY_EJECT_REQUEST,
> ACPI_OST_SC_EJECT_IN_PROGRESS, NULL);
> - if (handler->hotplug.mode == AHM_CONTAINER) {
> - device->flags.eject_pending = true;
> + if (handler->hotplug.mode == AHM_CONTAINER)
>   kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
> - } else {
> - int error;
>  
> - get_device(&device->dev);
> - error = acpi_scan_hot_remove(device);
> - if (error)
> - goto err_out;
> - }
> + get_device(&device->dev);
> + error = acpi_scan_hot_remove(device);
> + if (error)
> + goto err_out;
>  
>   out:
>   mutex_unlock(&acpi_scan_lock);
> @@ -496,7 +493,6 @@ acpi_eject_store(struct device *d, struc
>   struct acpi_eject_event *ej_event;
>   acpi_object_type not_used;
>   acpi_status status;
> - u32 ost_source;
>   int ret;
>  
>   if (!count || buf[0] != '1')
> @@ -510,43 +506,28 @@ acpi_eject_store(struct device *d, struc
>   if (ACPI_FAILURE(status) || !acpi_device->flags.ejectable)
>   return -ENODEV;
>  
> - mutex_lock(&acpi_scan_lock);
> -
> - if (acpi_device->flags.eject_pending) {
> - /* ACPI eject notification event. */
> - ost_source = ACPI_NOTIFY_EJECT_REQUEST;
> - acpi_device->flags.eject_pending = 0;
> - } else {
> - /* Eject initiated by user space. */
> - ost_source = ACPI_OST_EC_OSPM_EJECT;
> - }
>   ej_event = kmalloc(sizeof(*ej_event), GFP_KERNEL);
>   if (!ej_event) {
>   ret = -ENOMEM;
>   goto err_out;
>   }
> - acpi_evaluate_hotplug_ost(acpi_device->handle, ost_source,
> + acpi_evaluate_hotplug_ost(acpi_device->handle, ACPI_OST_EC_OSPM_EJECT,
> ACPI_OST_SC_EJECT_IN_PROGRESS, NULL);
>   ej_event->device = acpi_device;
> - ej_event->event = ost_source;
> + ej_event->event = ACPI_OST_EC_OSPM_EJECT;
>   get_device(&acp

Re: [PATCH 1/2] driver core / ACPI: Avoid device hot remove locking issues

2013-08-28 Thread Gu Zheng
On 08/28/2013 09:48 PM, Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki 
> 
> device_hotplug_lock is held around the acpi_bus_trim() call in
> acpi_scan_hot_remove() which generally removes devices (it removes
> ACPI device objects at least, but it may also remove "physical"
> device objects through .detach() callbacks of ACPI scan handlers).
> Thus, potentially, device sysfs attributes are removed under that
> lock and to remove those attributes it is necessary to hold the
> s_active references of their directory entries for writing.
> 
> On the other hand, the execution of a .show() or .store() callback
> from a sysfs attribute is carried out with that attribute's s_active
> reference held for reading.  Consequently, if any device sysfs
> attribute that may be removed from within acpi_scan_hot_remove()
> through acpi_bus_trim() has a .store() or .show() callback which
> acquires device_hotplug_lock, the execution of that callback may
> deadlock with the removal of the attribute.  [Unfortunately, the
> "online" device attribute of CPUs and memory blocks is one of them.]
> 
> To avoid such deadlocks, make all of the sysfs attribute callbacks
> that need to lock device hotplug, for example store_online(), use
> a special function, lock_device_hotplug_sysfs(), to lock device
> hotplug and return the result of that function immediately if it is
> not zero.  This will cause the s_active reference of the directory
> entry in question to be released and the syscall to be restarted
> if device_hotplug_lock cannot be acquired.
> 
> [show_online() actually doesn't need to lock device hotplug, but
> it is useful to serialize it with respect to device_offline() and
> device_online() for the same device (in case user space attempts to
> run them concurrently) which can be done with the help of
> device_lock().]
> 
> Signed-off-by: Rafael J. Wysocki 
> Reported-by: Yasuaki Ishimatsu 
> Reported-by: Gu Zheng 


Tested-by: Gu Zheng 

> ---
>  drivers/acpi/sysfs.c   |5 -
>  drivers/base/core.c|   43 ---
>  drivers/base/memory.c  |4 +++-
>  include/linux/device.h |1 +
>  4 files changed, 36 insertions(+), 17 deletions(-)
> 
> Index: linux-pm/drivers/base/core.c
> ===
> --- linux-pm.orig/drivers/base/core.c
> +++ linux-pm/drivers/base/core.c
> @@ -49,6 +49,28 @@ static struct kobject *dev_kobj;
>  struct kobject *sysfs_dev_char_kobj;
>  struct kobject *sysfs_dev_block_kobj;
>  
> +static DEFINE_MUTEX(device_hotplug_lock);
> +
> +void lock_device_hotplug(void)
> +{
> + mutex_lock(&device_hotplug_lock);
> +}
> +
> +void unlock_device_hotplug(void)
> +{
> + mutex_unlock(&device_hotplug_lock);
> +}
> +
> +int lock_device_hotplug_sysfs(void)
> +{
> + if (mutex_trylock(&device_hotplug_lock))
> + return 0;
> +
> + /* Avoid busy looping (5 ms of sleep should do). */
> + msleep(5);
> + return restart_syscall();
> +}
> +
>  #ifdef CONFIG_BLOCK
>  static inline int device_is_not_partition(struct device *dev)
>  {
> @@ -408,9 +430,9 @@ static ssize_t show_online(struct device
>  {
>   bool val;
>  
> - lock_device_hotplug();
> + device_lock(dev);
>   val = !dev->offline;
> - unlock_device_hotplug();
> + device_unlock(dev);
>   return sprintf(buf, "%u\n", val);
>  }
>  
> @@ -424,7 +446,10 @@ static ssize_t store_online(struct devic
>   if (ret < 0)
>   return ret;
>  
> - lock_device_hotplug();
> + ret = lock_device_hotplug_sysfs();
> + if (ret)
> + return ret;
> +
>   ret = val ? device_online(dev) : device_offline(dev);
>   unlock_device_hotplug();
>   return ret < 0 ? ret : count;
> @@ -1479,18 +1504,6 @@ EXPORT_SYMBOL_GPL(put_device);
>  EXPORT_SYMBOL_GPL(device_create_file);
>  EXPORT_SYMBOL_GPL(device_remove_file);
>  
> -static DEFINE_MUTEX(device_hotplug_lock);
> -
> -void lock_device_hotplug(void)
> -{
> - mutex_lock(&device_hotplug_lock);
> -}
> -
> -void unlock_device_hotplug(void)
> -{
> - mutex_unlock(&device_hotplug_lock);
> -}
> -
>  static int device_check_offline(struct device *dev, void *not_used)
>  {
>   int ret;
> Index: linux-pm/drivers/base/memory.c
> ===
> --- linux-pm.orig/drivers/base/memory.c
> +++ linux-pm/drivers/base/memory.c
> @@ -351,7 +351,9 @@ store_mem_state(struct device *dev,
>  
>   mem = container_of(dev, struct memory_block, dev);
> 

[PATCH] ocfs2/refcounttree: add the missing NULL check of the return value of find_or_create_page()

2013-07-08 Thread Gu Zheng
Add the missing NULL check of the return value of find_or_create_page() in
function ocfs2_duplicate_clusters_by_page().

Signed-off-by: Gu Zheng 
---
 fs/ocfs2/refcounttree.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 998b17e..456d0e4 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -2965,7 +2965,11 @@ int ocfs2_duplicate_clusters_by_page(handle_t *handle,
to = map_end & (PAGE_CACHE_SIZE - 1);

page = find_or_create_page(mapping, page_index, GFP_NOFS);
-
+   if (!page) {
+   ret = -ENOMEM;
+   mlog_errno(ret);
+   break;
+   }
/*
 * In case PAGE_CACHE_SIZE <= CLUSTER_SIZE, This page
 * can't be dirtied before we CoW it out.
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] fs/aio: Add support to aio ring pages migration

2013-07-08 Thread Gu Zheng
As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and  setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node 
safely.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
---
 fs/aio.c|  120 ++
 include/linux/migrate.h |3 +
 mm/migrate.c|2 +-
 3 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 9b5ca11..d10f956 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -35,6 +35,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 

 #include 
 #include 
@@ -110,6 +113,7 @@ struct kioctx {
} cacheline_aligned_in_smp;

struct page *internal_pages[AIO_RING_PAGES];
+   struct file *aio_ring_file;
 };

 /*-- sysctl variables*/
@@ -138,15 +142,78 @@ __initcall(aio_setup);

 static void aio_free_ring(struct kioctx *ctx)
 {
-   long i;
+   int i;
+   struct file *aio_ring_file = ctx->aio_ring_file;

-   for (i = 0; i < ctx->nr_pages; i++)
+   for (i = 0; i < ctx->nr_pages; i++) {
+   pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
+   page_count(ctx->ring_pages[i]));
put_page(ctx->ring_pages[i]);
+   }

if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
kfree(ctx->ring_pages);
+
+   if (aio_ring_file) {
+   truncate_setsize(aio_ring_file->f_inode, 0);
+   pr_debug("pid(%d) i_nlink=%u d_count=%d d_unhashed=%d 
i_count=%d\n",
+   current->pid, aio_ring_file->f_inode->i_nlink,
+   aio_ring_file->f_path.dentry->d_count,
+   d_unhashed(aio_ring_file->f_path.dentry),
+   atomic_read(&aio_ring_file->f_inode->i_count));
+   fput(aio_ring_file);
+   ctx->aio_ring_file = NULL;
+   }
+}
+
+static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   vma->vm_ops = &generic_file_vm_ops;
+   return 0;
+}
+
+static const struct file_operations aio_ring_fops = {
+   .mmap = aio_ring_mmap,
+};
+
+static int aio_set_page_dirty(struct page *page)
+{
+   return 0;
 }

+static int aio_migratepage(struct address_space *mapping, struct page *new,
+   struct page *old, enum migrate_mode mode)
+{
+   struct kioctx *ctx = mapping->private_data;
+   unsigned long flags;
+   unsigned idx = old->index;
+   int rc;
+
+   /*Writeback must be complete*/
+   BUG_ON(PageWriteback(old));
+   put_page(old);
+
+   rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+   if (rc != MIGRATEPAGE_SUCCESS) {
+   get_page(old);
+   return rc;
+   }
+
+   get_page(new);
+
+   spin_lock_irqsave(&ctx->completion_lock, flags);
+   migrate_page_copy(new, old);
+   ctx->ring_pages[idx] = new;
+   spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+   return rc;
+}
+
+static const struct address_space_operations aio_ctx_aops = {
+   .set_page_dirty = aio_set_page_dirty,
+   .migratepage= aio_migratepage,
+};
+
 static int aio_setup_ring(struct kioctx *ctx)
 {
struct aio_ring *ring;
@@ -154,20 +221,45 @@ static int aio_setup_ring(struct kioctx *ctx)
struct mm_struct *mm = current->mm;
unsigned long size, populate;
int nr_pages;
+   int i;
+   struct file *file;

/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */

size = sizeof(struct aio_ring);
size += sizeof(struct io_event) * nr_events;
-   nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;

+   nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;
if (nr_pages < 0)
return -EINVAL;

-   nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / 
sizeof(struct
io_event);
+   file = anon_inode_getfile_private("[aio]", &aio_ring_fops, ctx, O_RDWR);
+   if (IS_ERR(file)) {
+   ctx->aio_ring_file = NULL;
+   return -EAGAIN;
+   }
+
+   file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
+   file->f_inode->i_mapping->private_data = ctx;
+   file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
+
+   for (i = 0; i < nr_pages; i++) {
+   struct page *page;
+   page = find_or_create_page(file->f_inode->i_mapping,
+  i, GFP_HIGHUSER | __GFP_Z

[PATCH 0/2] Add support to aio ring pages migration

2013-07-08 Thread Gu Zheng
Currently aio ring pages use get_user_pages() to allocate pages from movable
zone,as discussed in thread https://lkml.org/lkml/2012/11/29/69, it is easy to
pin user pages for a long time, which is fatal for memory hotplug/remove 
framework.

As Mel Gorman suggested, "Implement a callback for migration to unpin pages,
barrier operations until migration completes and pin the new pfns" can soloved
this issue. And the best palce to hold the callbacks is address space operations
which can be found via page->mapping.

But the current aio ring pages are anonymous pages, they don't have
address_space_operations, so we use an anon inode file as the aio ring file to
manage the aio ring pages, so that we can implement the callback and register it
to page->mmapping->a_ops->migratepage.

But there's a ploblem that all files created by anon_inode_getfile() share the
same inode, so mutil aio context will share the same aio ring pages, it'll lead
to io events chaos. In order to solve this issus, we introduce a new fucntion
anon_inode_getfile_private() which is samilar to anon_inode_getfile(), but each
new file has its own anon inode.

This work is based on Benjamin's patch,
http://www.spinics.net/lists/linux-fsdevel/msg66014.html

Gu Zheng (2):
  fs/anon_inode: Introduce a new lib function
anon_inode_getfile_private()
  fs/aio: Add support to aio ring pages migration

 fs/aio.c|  120 +++
 fs/anon_inodes.c|   66 +++
 include/linux/anon_inodes.h |3 +
 include/linux/migrate.h |3 +
 mm/migrate.c|2 +-
 5 files changed, 182 insertions(+), 12 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] fs/anon_inode: Introduce a new lib function, anon_inode_getfile_private()

2013-07-08 Thread Gu Zheng
Introduce a new lib function anon_inode_getfile_private(), it creates a new file
instance by hooking it up to an anonymous inode, and a dentry that describe the
"class" of the file, similar to anon_inode_getfile(), but each file holds a
single inode. Furthermore, anyone who wants to create a private anon file will
benefit from this change.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
---
 fs/anon_inodes.c|   66 +++
 include/linux/anon_inodes.h |3 ++
 2 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 47a65df..85c9618 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -109,6 +109,72 @@ static struct file_system_type anon_inode_fs_type = {
 };

 /**
+ * anon_inode_getfile_private - creates a new file instance by hooking it up 
to an
+ *  anonymous inode, and a dentry that describe the "class"
+ *  of the file
+ *
+ * @name:[in]name of the "class" of the new file
+ * @fops:[in]file operations for the new file
+ * @priv:[in]private data for the new file (will be file's 
private_data)
+ * @flags:   [in]flags
+ *
+ *
+ * Similar to anon_inode_getfile, but each file holds a single inode.
+ *
+ */
+struct file *anon_inode_getfile_private(const char *name,
+   const struct file_operations *fops,
+   void *priv, int flags)
+{
+   struct qstr this;
+   struct path path;
+   struct file *file;
+   struct inode *inode;
+
+   if (fops->owner && !try_module_get(fops->owner))
+   return ERR_PTR(-ENOENT);
+
+   inode = anon_inode_mkinode(anon_inode_mnt->mnt_sb);
+   if (IS_ERR(inode)) {
+   file = ERR_PTR(-ENOMEM);
+   goto err_module;
+   }
+
+   /*
+* Link the inode to a directory entry by creating a unique name
+* using the inode sequence number.
+*/
+   file = ERR_PTR(-ENOMEM);
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
+   if (!path.dentry)
+   goto err_module;
+
+   path.mnt = mntget(anon_inode_mnt);
+
+   d_instantiate(path.dentry, inode);
+
+   file = alloc_file(&path, OPEN_FMODE(flags), fops);
+   if (IS_ERR(file))
+   goto err_dput;
+
+   file->f_mapping = inode->i_mapping;
+   file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
+   file->private_data = priv;
+
+   return file;
+
+err_dput:
+   path_put(&path);
+err_module:
+   module_put(fops->owner);
+   return file;
+}
+EXPORT_SYMBOL_GPL(anon_inode_getfile_private);
+
+/**
  * anon_inode_getfile - creates a new file instance by hooking it up to an
  *  anonymous inode, and a dentry that describe the "class"
  *  of the file
diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h
index 8013a45..cf573c2 100644
--- a/include/linux/anon_inodes.h
+++ b/include/linux/anon_inodes.h
@@ -13,6 +13,9 @@ struct file_operations;
 struct file *anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags);
+struct file *anon_inode_getfile_private(const char *name,
+   const struct file_operations *fops,
+   void *priv, int flags);
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 void *priv, int flags);

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] f2fs: Modify do_garbage_collect() to collect all the segs in once

2013-07-08 Thread Gu Zheng
Current do_garbage_collect() collect per segment per time. If there are more
than one segments in section, we need to call do_garbage_collect() many times to
collect all the segments(current is a for loop). We can move the loop into the
do_garbage_collect(), so that we can collect all the segs of section in one 
time.

Signed-off-by: Gu Zheng 
---
 fs/f2fs/gc.c |   59 -
 1 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 35f9b1a..ccde9f7 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -634,42 +634,50 @@ static int __get_victim(struct f2fs_sb_info *sbi, unsigned
int *victim,
return ret;
 }

-static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
-   struct list_head *ilist, int gc_type)
+static void do_garbage_collect(struct f2fs_sb_info *sbi,
+   unsigned int start_segno, struct list_head *ilist, int gc_type)
 {
-   struct page *sum_page;
-   struct f2fs_summary_block *sum;
-   struct blk_plug plug;
+   unsigned int segno = start_segno;

-   /* read segment summary of victim */
-   sum_page = get_sum_page(sbi, segno);
-   if (IS_ERR(sum_page))
-   return;
+   for (; sbi->segs_per_sec--; segno++) {
+   struct page *sum_page;
+   struct f2fs_summary_block *sum;
+   struct blk_plug plug;

-   blk_start_plug(&plug);
+   /* read segment summary of victim */
+   sum_page = get_sum_page(sbi, segno);
+   if (IS_ERR(sum_page))
+   continue;

-   sum = page_address(sum_page);
+   blk_start_plug(&plug);

-   switch (GET_SUM_TYPE((&sum->footer))) {
-   case SUM_TYPE_NODE:
-   gc_node_segment(sbi, sum->entries, segno, gc_type);
-   break;
-   case SUM_TYPE_DATA:
-   gc_data_segment(sbi, sum->entries, ilist, segno, gc_type);
-   break;
-   }
-   blk_finish_plug(&plug);
+   sum = page_address(sum_page);

-   stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
-   stat_inc_call_count(sbi->stat_info);
+   switch (GET_SUM_TYPE((&sum->footer))) {
+   case SUM_TYPE_NODE:
+   gc_node_segment(sbi, sum->entries,
+   segno, gc_type);
+   break;
+   case SUM_TYPE_DATA:
+   gc_data_segment(sbi, sum->entries, ilist,
+   segno, gc_type);
+   break;
+   default:
+   BUG();
+   }
+   blk_finish_plug(&plug);
+
+   stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
+   stat_inc_call_count(sbi->stat_info);

-   f2fs_put_page(sum_page, 1);
+   f2fs_put_page(sum_page, 1);
+   }
 }

 int f2fs_gc(struct f2fs_sb_info *sbi)
 {
struct list_head ilist;
-   unsigned int segno, i;
+   unsigned int segno;
int gc_type = BG_GC;
int nfree = 0;
int ret = -1;
@@ -688,8 +696,7 @@ gc_more:
goto stop;
ret = 0;

-   for (i = 0; i < sbi->segs_per_sec; i++)
-   do_garbage_collect(sbi, segno + i, &ilist, gc_type);
+   do_garbage_collect(sbi, segno, &ilist, gc_type);

if (gc_type == FG_GC) {
sbi->cur_victim_sec = NULL_SEGNO;
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] PCI: introduce PCIe Device Serial NUmber Capability support

2013-07-09 Thread Gu Zheng
On 07/09/2013 03:55 PM, Yijing Wang wrote:

> Introduce PCIe Ext Capability Device Serial Number support,
> so we can use the unique device serial number to identify
> the physical device. During system suspend, if the PCIe
> device was removed and inserted a new same device, after
> system resume there is no good way to identify it, maybe
> Device Serial Number is a good choice if device support.

Nice idea!

Regards,
Gu

> 
> Signed-off-by: Yijing Wang 


Reviewed-by: Gu Zheng 

> Cc: "Rafael J. Wysocki" 
> Cc: Oliver Neukum 
> Cc: Paul Bolle 
> Cc: Gu Zheng 
> Cc: linux-...@vger.kernel.org
> ---
>  drivers/pci/pci.c   |   33 +
>  drivers/pci/pci.h   |1 +
>  drivers/pci/probe.c |3 +++
>  include/linux/pci.h |4 
>  4 files changed, 41 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e37fea6..d08df2b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2048,6 +2048,39 @@ void pci_free_cap_save_buffers(struct pci_dev *dev)
>  }
>  
>  /**
> + * pci_get_dsn - get device serial number
> + * @dev: the PCI device
> + * @sn: saved device serial number
> + */
> +void pci_get_dsn(struct pci_dev *dev, u64 *sn)
> +{
> + int pos;
> + u32 lo, hi;
> +
> + if (!pci_is_pcie(dev))
> + goto out;
> +
> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DSN);
> + if (!pos)
> + goto out;
> +
> + pci_read_config_dword(dev, pos + 4, &lo);
> + pci_read_config_dword(dev, pos + 8, &hi);
> + *sn = ((u64)hi << 32) | lo;
> + return;
> +
> +out:
> + *sn = 0;
> + return;
> +}
> +EXPORT_SYMBOL(pci_get_dsn);
> +
> +void pci_dsn_init(struct pci_dev *dev)
> +{
> + return pci_get_dsn(dev, &dev->sn);
> +}
> +
> +/**
>   * pci_configure_ari - enable or disable ARI forwarding
>   * @dev: the PCI device
>   *
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 68678ed..f626006 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -202,6 +202,7 @@ int __pci_read_base(struct pci_dev *dev, enum 
> pci_bar_type type,
>   struct resource *res, unsigned int reg);
>  int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type 
> *type);
>  void pci_configure_ari(struct pci_dev *dev);
> +void pci_dsn_init(struct pci_dev *dev);
>  
>  /**
>   * pci_ari_enabled - query ARI forwarding status
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 46ada5c..d4c6e7e 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1322,6 +1322,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>   /* Power Management */
>   pci_pm_init(dev);
>  
> + /* Device Serial Number */
> + pci_dsn_init(dev);
> +
>   /* Vital Product Data */
>   pci_vpd_pci22_init(dev);
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0fd1f15..59cd205 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -342,6 +342,7 @@ struct pci_dev {
>   struct list_head msi_list;
>   struct kset *msi_kset;
>  #endif
> + u64 sn; /* device serieal number, 0 if not support */
>   struct pci_vpd *vpd;
>  #ifdef CONFIG_PCI_ATS
>   union {
> @@ -995,6 +996,9 @@ ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, 
> size_t count, void *buf);
>  ssize_t pci_write_vpd(struct pci_dev *dev, loff_t pos, size_t count, const 
> void *buf);
>  int pci_vpd_truncate(struct pci_dev *dev, size_t size);
>  
> +/* Device Serial Number */
> +void pci_get_dsn(struct pci_dev *dev, u64 *sn);
> +
>  /* Helper functions for low-level code (drivers/pci/setup-[bus,res].c) */
>  resource_size_t pcibios_retrieve_fw_addr(struct pci_dev *dev, int idx);
>  void pci_bus_assign_resources(const struct pci_bus *bus);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ocfs2/refcounttree: add the missing NULL check of the return value of find_or_create_page()

2013-07-09 Thread Gu Zheng
On 07/10/2013 06:11 AM, Joel Becker wrote:

> On Mon, Jul 08, 2013 at 03:52:53PM +0800, Gu Zheng wrote:
>> Add the missing NULL check of the return value of find_or_create_page() in
>> function ocfs2_duplicate_clusters_by_page().
>>
>> Signed-off-by: Gu Zheng 
>> ---
>>  fs/ocfs2/refcounttree.c |6 +-
>>  1 files changed, 5 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
>> index 998b17e..456d0e4 100644
>> --- a/fs/ocfs2/refcounttree.c
>> +++ b/fs/ocfs2/refcounttree.c
>> @@ -2965,7 +2965,11 @@ int ocfs2_duplicate_clusters_by_page(handle_t *handle,
>>  to = map_end & (PAGE_CACHE_SIZE - 1);
>>
>>  page = find_or_create_page(mapping, page_index, GFP_NOFS);
>> -
>> +if (!page) {
>> +ret = -ENOMEM;
>> +mlog_errno(ret);
>> +break;
>> +}
>>  /*
>>   * In case PAGE_CACHE_SIZE <= CLUSTER_SIZE, This page
>>   * can't be dirtied before we CoW it out.
> 
> Put a blank line between the closing brace and the comment.  Otherwise,

Got it.:)

> Acked-by: Joel Becker 

Thanks~

Regards,
Gu

> 
> Joel
>> -- 
>> 1.7.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] fs/anon_inode: Introduce a new lib function, anon_inode_getfile_private()

2013-07-11 Thread Gu Zheng
ping...


On 07/08/2013 06:38 PM, Gu Zheng wrote:

> Introduce a new lib function anon_inode_getfile_private(), it creates a new 
> file
> instance by hooking it up to an anonymous inode, and a dentry that describe 
> the
> "class" of the file, similar to anon_inode_getfile(), but each file holds a
> single inode. Furthermore, anyone who wants to create a private anon file will
> benefit from this change.
> 
> Signed-off-by: Gu Zheng 
> Signed-off-by: Benjamin LaHaise 
> ---
>  fs/anon_inodes.c|   66 
> +++
>  include/linux/anon_inodes.h |3 ++
>  2 files changed, 69 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 47a65df..85c9618 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -109,6 +109,72 @@ static struct file_system_type anon_inode_fs_type = {
>  };
> 
>  /**
> + * anon_inode_getfile_private - creates a new file instance by hooking it up 
> to an
> + *  anonymous inode, and a dentry that describe the 
> "class"
> + *  of the file
> + *
> + * @name:[in]name of the "class" of the new file
> + * @fops:[in]file operations for the new file
> + * @priv:[in]private data for the new file (will be file's 
> private_data)
> + * @flags:   [in]flags
> + *
> + *
> + * Similar to anon_inode_getfile, but each file holds a single inode.
> + *
> + */
> +struct file *anon_inode_getfile_private(const char *name,
> + const struct file_operations *fops,
> + void *priv, int flags)
> +{
> + struct qstr this;
> + struct path path;
> + struct file *file;
> + struct inode *inode;
> +
> + if (fops->owner && !try_module_get(fops->owner))
> + return ERR_PTR(-ENOENT);
> +
> + inode = anon_inode_mkinode(anon_inode_mnt->mnt_sb);
> + if (IS_ERR(inode)) {
> + file = ERR_PTR(-ENOMEM);
> + goto err_module;
> + }
> +
> + /*
> +  * Link the inode to a directory entry by creating a unique name
> +  * using the inode sequence number.
> +  */
> + file = ERR_PTR(-ENOMEM);
> + this.name = name;
> + this.len = strlen(name);
> + this.hash = 0;
> + path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
> + if (!path.dentry)
> + goto err_module;
> +
> + path.mnt = mntget(anon_inode_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + file = alloc_file(&path, OPEN_FMODE(flags), fops);
> + if (IS_ERR(file))
> + goto err_dput;
> +
> + file->f_mapping = inode->i_mapping;
> + file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
> + file->private_data = priv;
> +
> + return file;
> +
> +err_dput:
> + path_put(&path);
> +err_module:
> + module_put(fops->owner);
> + return file;
> +}
> +EXPORT_SYMBOL_GPL(anon_inode_getfile_private);
> +
> +/**
>   * anon_inode_getfile - creates a new file instance by hooking it up to an
>   *  anonymous inode, and a dentry that describe the 
> "class"
>   *  of the file
> diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h
> index 8013a45..cf573c2 100644
> --- a/include/linux/anon_inodes.h
> +++ b/include/linux/anon_inodes.h
> @@ -13,6 +13,9 @@ struct file_operations;
>  struct file *anon_inode_getfile(const char *name,
>   const struct file_operations *fops,
>   void *priv, int flags);
> +struct file *anon_inode_getfile_private(const char *name,
> + const struct file_operations *fops,
> + void *priv, int flags);
>  int anon_inode_getfd(const char *name, const struct file_operations *fops,
>void *priv, int flags);
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Add support to aio ring pages migration

2013-07-11 Thread Gu Zheng
ping...

On 07/08/2013 06:38 PM, Gu Zheng wrote:

> Currently aio ring pages use get_user_pages() to allocate pages from movable
> zone,as discussed in thread https://lkml.org/lkml/2012/11/29/69, it is easy to
> pin user pages for a long time, which is fatal for memory hotplug/remove 
> framework.
> 
> As Mel Gorman suggested, "Implement a callback for migration to unpin pages,
> barrier operations until migration completes and pin the new pfns" can soloved
> this issue. And the best palce to hold the callbacks is address space 
> operations
> which can be found via page->mapping.
> 
> But the current aio ring pages are anonymous pages, they don't have
> address_space_operations, so we use an anon inode file as the aio ring file to
> manage the aio ring pages, so that we can implement the callback and register 
> it
> to page->mmapping->a_ops->migratepage.
> 
> But there's a ploblem that all files created by anon_inode_getfile() share the
> same inode, so mutil aio context will share the same aio ring pages, it'll 
> lead
> to io events chaos. In order to solve this issus, we introduce a new fucntion
> anon_inode_getfile_private() which is samilar to anon_inode_getfile(), but 
> each
> new file has its own anon inode.
> 
> This work is based on Benjamin's patch,
> http://www.spinics.net/lists/linux-fsdevel/msg66014.html
> 
> Gu Zheng (2):
>   fs/anon_inode: Introduce a new lib function
> anon_inode_getfile_private()
>   fs/aio: Add support to aio ring pages migration
> 
>  fs/aio.c|  120 
> +++
>  fs/anon_inodes.c|   66 +++
>  include/linux/anon_inodes.h |3 +
>  include/linux/migrate.h |3 +
>  mm/migrate.c|2 +-
>  5 files changed, 182 insertions(+), 12 deletions(-)
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] fs/aio: Add support to aio ring pages migration

2013-07-11 Thread Gu Zheng
ping...

On 07/08/2013 06:38 PM, Gu Zheng wrote:

> As the aio job will pin the ring pages, that will lead to mem migrated
> failed. In order to fix this problem we use an anon inode to manage the aio 
> ring
> pages, and  setup the migratepage callback in the anon inode's address space, 
> so
> that when mem migrating the aio ring pages will be moved to other mem node 
> safely.
> 
> Signed-off-by: Gu Zheng 
> Signed-off-by: Benjamin LaHaise 
> ---
>  fs/aio.c|  120 ++
>  include/linux/migrate.h |3 +
>  mm/migrate.c|2 +-
>  3 files changed, 113 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 9b5ca11..d10f956 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -35,6 +35,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
> @@ -110,6 +113,7 @@ struct kioctx {
>   } cacheline_aligned_in_smp;
> 
>   struct page *internal_pages[AIO_RING_PAGES];
> + struct file *aio_ring_file;
>  };
> 
>  /*-- sysctl variables*/
> @@ -138,15 +142,78 @@ __initcall(aio_setup);
> 
>  static void aio_free_ring(struct kioctx *ctx)
>  {
> - long i;
> + int i;
> + struct file *aio_ring_file = ctx->aio_ring_file;
> 
> - for (i = 0; i < ctx->nr_pages; i++)
> + for (i = 0; i < ctx->nr_pages; i++) {
> + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
> + page_count(ctx->ring_pages[i]));
>   put_page(ctx->ring_pages[i]);
> + }
> 
>   if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
>   kfree(ctx->ring_pages);
> +
> + if (aio_ring_file) {
> + truncate_setsize(aio_ring_file->f_inode, 0);
> + pr_debug("pid(%d) i_nlink=%u d_count=%d d_unhashed=%d 
> i_count=%d\n",
> + current->pid, aio_ring_file->f_inode->i_nlink,
> + aio_ring_file->f_path.dentry->d_count,
> + d_unhashed(aio_ring_file->f_path.dentry),
> + atomic_read(&aio_ring_file->f_inode->i_count));
> + fput(aio_ring_file);
> + ctx->aio_ring_file = NULL;
> + }
> +}
> +
> +static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +static const struct file_operations aio_ring_fops = {
> + .mmap = aio_ring_mmap,
> +};
> +
> +static int aio_set_page_dirty(struct page *page)
> +{
> + return 0;
>  }
> 
> +static int aio_migratepage(struct address_space *mapping, struct page *new,
> + struct page *old, enum migrate_mode mode)
> +{
> + struct kioctx *ctx = mapping->private_data;
> + unsigned long flags;
> + unsigned idx = old->index;
> + int rc;
> +
> + /*Writeback must be complete*/
> + BUG_ON(PageWriteback(old));
> + put_page(old);
> +
> + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
> + if (rc != MIGRATEPAGE_SUCCESS) {
> + get_page(old);
> + return rc;
> + }
> +
> + get_page(new);
> +
> + spin_lock_irqsave(&ctx->completion_lock, flags);
> + migrate_page_copy(new, old);
> + ctx->ring_pages[idx] = new;
> + spin_unlock_irqrestore(&ctx->completion_lock, flags);
> +
> + return rc;
> +}
> +
> +static const struct address_space_operations aio_ctx_aops = {
> + .set_page_dirty = aio_set_page_dirty,
> + .migratepage= aio_migratepage,
> +};
> +
>  static int aio_setup_ring(struct kioctx *ctx)
>  {
>   struct aio_ring *ring;
> @@ -154,20 +221,45 @@ static int aio_setup_ring(struct kioctx *ctx)
>   struct mm_struct *mm = current->mm;
>   unsigned long size, populate;
>   int nr_pages;
> + int i;
> + struct file *file;
> 
>   /* Compensate for the ring buffer's head/tail overlap entry */
>   nr_events += 2; /* 1 is required, 2 for good luck */
> 
>   size = sizeof(struct aio_ring);
>   size += sizeof(struct io_event) * nr_events;
> - nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;
> 
> + nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;
>   if (nr_pages < 0)
>   return -EINVAL;
> 
> - nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / 
> sizeof(struct
> io_event);
> +   

Re: [f2fs-dev][PATCH RESEND] f2fs: avoid allocating failure in bio_alloc

2013-09-15 Thread Gu Zheng
Hi Chao,

On 09/13/2013 09:27 PM, Chao Yu wrote:

> This patch add macro MAX_BIO_BLOCKS to limit value of npages in
> f2fs_bio_alloc,
> it can avoid allocating failure in bio_alloc caused by npages is larger than
> UIO_MAXIOV.

As I know bio_alloc is based of *fs_bio_set* pool, without the limitation of 
UIO_MAXIOV,
am I missing something?

Thanks,
Gu

> 
> Signed-off-by: Yu Chao 
>  ---
>  fs/f2fs/segment.c |4 +++-
>  fs/f2fs/segment.h |3 +++
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index 09af9c7..bd79bbe 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -657,6 +657,7 @@ static void submit_write_page(struct f2fs_sb_info *sbi,
> struct page *page,
> block_t blk_addr, enum page_type type)
>  {
> struct block_device *bdev = sbi->sb->s_bdev;
> +   int bio_blocks;
>  
> verify_block_addr(sbi, blk_addr);
>  
> @@ -676,7 +677,8 @@ retry:
> goto retry;
> }
>  
> -   sbi->bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi));
> +   bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi));
> +   sbi->bio[type] = f2fs_bio_alloc(bdev, bio_blocks);
> sbi->bio[type]->bi_sector = SECTOR_FROM_BLOCK(sbi,
> blk_addr);
> sbi->bio[type]->bi_private = priv;
> /*
> diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
> index bdd10ea..6352af1 100644
> --- a/fs/f2fs/segment.h
> +++ b/fs/f2fs/segment.h
> @@ -9,6 +9,7 @@
>   * published by the Free Software Foundation.
>   */
>  #include 
> +#include 
>  
>  /* constant macro */
>  #define NULL_SEGNO ((unsigned int)(~0))
> @@ -90,6 +91,8 @@
> (blk_addr << ((sbi)->log_blocksize - F2FS_LOG_SECTOR_SIZE))
>  #define SECTOR_TO_BLOCK(sbi, sectors)  \
> (sectors >> ((sbi)->log_blocksize - F2FS_LOG_SECTOR_SIZE))
> +#define MAX_BIO_BLOCKS(max_hw_blocks)  \
> +   (min((int)max_hw_blocks, UIO_MAXIOV))
>  
>  /* during checkpoint, bio_private is used to synchronize the last bio */
>  struct bio_private {
> ---
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev][PATCH RESEND] f2fs: avoid allocating failure in bio_alloc

2013-09-15 Thread Gu Zheng
Hi Chao,

On 09/16/2013 11:26 AM, Chao Yu wrote:

> Hi Gu
> 
>> -Original Message-----
>> From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com]
>> Sent: Monday, September 16, 2013 10:09 AM
>> To: Chao Yu
>> Cc: Kim Jaegeuk; linux-f2fs-de...@lists.sourceforge.net;
>> linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org; 谭姝
>> Subject: Re: [f2fs-dev][PATCH RESEND] f2fs: avoid allocating failure in
> bio_alloc
>>
>> Hi Chao,
>>
>> On 09/13/2013 09:27 PM, Chao Yu wrote:
>>
>>> This patch add macro MAX_BIO_BLOCKS to limit value of npages in
>>> f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
>>> npages is larger than UIO_MAXIOV.
>>
>> As I know bio_alloc is based of *fs_bio_set* pool, without the limitation
> of
>> UIO_MAXIOV, am I missing something?
> 
> Here is the code in bio.c, fs_bio_set is as the actual parameter pass to bs
> without being inited.

fs_bio_set was initiated early in the bio subsystem init.

> So it may have opportunity to return NULL in this function.

It may be, but may not be the thread you mentioned below.

> ---
> Bio.c 
> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set
> *bs)
> {
> ..
>   if (!bs) {
>   if (nr_iovecs > UIO_MAXIOV)
>   return NULL;
> ---
> I did the abnormal test: modify the max_sectors_kb in /sys/block/sdx/queue
> to 32767 for a disk with f2fs format,
> and I got a segfualt in f2fs_bio_alloc after the img mounted.
> Is there anyting I missed?

Hmm, this change will also trigger bvec_alloc failed, did you add some traces
to debug this?

Regards,
Gu

> 
>>
>> Thanks,
>> Gu
>>
>>>
>>> Signed-off-by: Yu Chao 
>>>  ---
>>>  fs/f2fs/segment.c |4 +++-
>>>  fs/f2fs/segment.h |3 +++
>>>  2 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index
>>> 09af9c7..bd79bbe 100644
>>> --- a/fs/f2fs/segment.c
>>> +++ b/fs/f2fs/segment.c
>>> @@ -657,6 +657,7 @@ static void submit_write_page(struct f2fs_sb_info
>>> *sbi, struct page *page,
>>> block_t blk_addr, enum page_type
>> type)
>>> {
>>> struct block_device *bdev = sbi->sb->s_bdev;
>>> +   int bio_blocks;
>>>
>>> verify_block_addr(sbi, blk_addr);
>>>
>>> @@ -676,7 +677,8 @@ retry:
>>> goto retry;
>>> }
>>>
>>> -   sbi->bio[type] = f2fs_bio_alloc(bdev,
> max_hw_blocks(sbi));
>>> +   bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi));
>>> +   sbi->bio[type] = f2fs_bio_alloc(bdev, bio_blocks);
>>> sbi->bio[type]->bi_sector = SECTOR_FROM_BLOCK(sbi,
>>> blk_addr);
>>> sbi->bio[type]->bi_private = priv;
>>> /*
>>> diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index
>>> bdd10ea..6352af1 100644
>>> --- a/fs/f2fs/segment.h
>>> +++ b/fs/f2fs/segment.h
>>> @@ -9,6 +9,7 @@
>>>   * published by the Free Software Foundation.
>>>   */
>>>  #include 
>>> +#include 
>>>
>>>  /* constant macro */
>>>  #define NULL_SEGNO ((unsigned int)(~0))
>>> @@ -90,6 +91,8 @@
>>> (blk_addr << ((sbi)->log_blocksize - F2FS_LOG_SECTOR_SIZE))
>>>  #define SECTOR_TO_BLOCK(sbi, sectors)
>> \
>>> (sectors >> ((sbi)->log_blocksize - F2FS_LOG_SECTOR_SIZE))
>>> +#define MAX_BIO_BLOCKS(max_hw_blocks)
>> \
>>> +   (min((int)max_hw_blocks, UIO_MAXIOV))
>>>
>>>  /* during checkpoint, bio_private is used to synchronize the last bio
>>> */  struct bio_private {
>>> ---
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fs/bio-integrity: remove duplicate code

2013-09-20 Thread Gu Zheng
Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a common function bio_integrity_generate_verity()
to remove the duplicate code.

Signed-off-by: Gu Zheng 
---
 fs/bio-integrity.c |   84 ++-
 1 files changed, 36 insertions(+), 48 deletions(-)

diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 6025084..f096aec 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -287,24 +287,25 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:   bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:   bio to generate/verify integrity metadata for
+ * @operate:   operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec *bv;
-   sector_t sector = bio->bi_sector;
-   unsigned int i, sectors, total;
+   sector_t sector;
+   unsigned int i, sectors, total, ret;
void *prot_buf = bio->bi_integrity->bip_buf;
 
-   total = 0;
+   if (operate)
+   sector = bio->bi_sector;
+   else
+   sector = bio->bi_integrity->bip_sector;
+
+   ret = total = 0;
bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
bix.sector_size = bi->sector_size;
 
@@ -314,8 +315,13 @@ static void bio_integrity_generate(struct bio *bio)
bix.data_size = bv->bv_len;
bix.prot_buf = prot_buf;
bix.sector = sector;
-
-   bi->generate_fn(&bix);
+   if (operate) {
+   bi->generate_fn(&bix);
+   } else {
+   ret = bi->verify_fn(&bix);
+   kunmap_atomic(kaddr);
+   return ret;
+   }
 
sectors = bv->bv_len / bi->sector_size;
sector += sectors;
@@ -325,6 +331,22 @@ static void bio_integrity_generate(struct bio *bio)
 
kunmap_atomic(kaddr);
}
+
+   return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:   bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+   bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -439,41 +461,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-   struct blk_integrity_exchg bix;
-   struct bio_vec *bv;
-   sector_t sector = bio->bi_integrity->bip_sector;
-   unsigned int i, sectors, total, ret;
-   void *prot_buf = bio->bi_integrity->bip_buf;
-
-   ret = total = 0;
-   bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-   bix.sector_size = bi->sector_size;
-
-   bio_for_each_segment(bv, bio, i) {
-   void *kaddr = kmap_atomic(bv->bv_page);
-   bix.data_buf = kaddr + bv->bv_offset;
-   bix.data_size = bv->bv_len;
-   bix.prot_buf = prot_buf;
-   bix.sector = sector;
-
-   ret = bi->verify_fn(&bix);
-
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
-
-   sectors = bv->bv_len / bi->sector_size;
-   sector += sectors;
-   prot_buf += sectors * bi->tuple_size;
-   total += sectors * bi->tuple_size;
-   BUG_ON(total > bio->bi_integrity->bip_size);
-
-   kunmap_atomic(kaddr);
-   }
-
-   return ret;
+   return bio_integrity_generate_verify(bio, 0);
 }
 
 /**
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] aio: rcu_read_lock protection for new rcu_dereference calls

2013-09-08 Thread Gu Zheng
On 09/08/2013 10:10 PM, Artem Savkov wrote:

> Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch"
> (77d30b14d24e557f89c41980011d72428514d729 in linux-next.git) introduced a
> couple of new rcu_dereference calls which are not protected by rcu_read_lock
> and result in following warnings during syscall fuzzing(trinity):
> 
> [  471.646379] ===
> [  471.649727] [ INFO: suspicious RCU usage. ]
> [  471.653919] 3.11.0-next-20130906+ #496 Not tainted
> [  471.657792] ---
> [  471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage!
> [  471.665968]
> [  471.665968] other info that might help us debug this:
> [  471.665968]
> [  471.672141]
> [  471.672141] rcu_scheduler_active = 1, debug_locks = 1
> [  471.677549] 1 lock held by trinity-child0/3774:
> [  471.681675]  #0:  (&(&mm->ioctx_lock)->rlock){+.+...}, at: [] 
> SyS_io_setup+0x63a/0xc70
> [  471.688721]
> [  471.688721] stack backtrace:
> [  471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 
> 3.11.0-next-20130906+ #496
> [  471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  471.703151]    c58bbf30 c18a814b de2234c0 c58bbf58 
> c10a4ec6 c1b0d824
> [  471.709544]  c1b0f60e 0001 0001 c1af61b0  cb670ac0 
> c3aca000 c58bbfac
> [  471.716251]  c119bc7c 0002 0001  c119b8dd  
> c10cf684 c58bbfb4
> [  471.722902] Call Trace:
> [  471.724859]  [] dump_stack+0x4b/0x66
> [  471.728772]  [] lockdep_rcu_suspicious+0xc6/0x100
> [  471.733716]  [] SyS_io_setup+0x89c/0xc70
> [  471.737806]  [] ? SyS_io_setup+0x4fd/0xc70
> [  471.741689]  [] ? __audit_syscall_entry+0x94/0xe0
> [  471.746080]  [] syscall_call+0x7/0xb
> [  471.749723]  [] ? task_fork_fair+0x240/0x260
> 
> Signed-off-by: Artem Savkov 


Reviewed-by: Gu Zheng 

> ---
>  fs/aio.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 9b8f9fa..2fb9160 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -470,6 +470,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
> mm_struct *mm)
>   struct aio_ring *ring;
>  
>   spin_lock(&mm->ioctx_lock);
> + rcu_read_lock();
>   table = rcu_dereference(mm->ioctx_table);
>  
>   while (1) {
> @@ -478,6 +479,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
> mm_struct *mm)
>   if (!table->table[i]) {
>   ctx->id = i;
>   table->table[i] = ctx;
> + rcu_read_unlock();
>   spin_unlock(&mm->ioctx_lock);
>  
>   ring = kmap_atomic(ctx->ring_pages[0]);
> @@ -488,6 +490,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
> mm_struct *mm)
>  
>   new_nr = (table ? table->nr : 1) * 4;
>  
> + rcu_read_unlock();
>   spin_unlock(&mm->ioctx_lock);
>  
>   table = kzalloc(sizeof(*table) + sizeof(struct kioctx *) *
> @@ -498,6 +501,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
> mm_struct *mm)
>   table->nr = new_nr;
>  
>   spin_lock(&mm->ioctx_lock);
> + rcu_read_lock();
>   old = rcu_dereference(mm->ioctx_table);
>  
>   if (!old) {
> @@ -622,10 +626,12 @@ static void kill_ioctx(struct mm_struct *mm, struct 
> kioctx *ctx)
>   struct kioctx_table *table;
>  
>   spin_lock(&mm->ioctx_lock);
> + rcu_read_lock();
>   table = rcu_dereference(mm->ioctx_table);
>  
>   WARN_ON(ctx != table->table[ctx->id]);
>   table->table[ctx->id] = NULL;
> + rcu_read_unlock();
>   spin_unlock(&mm->ioctx_lock);
>  
>   /* percpu_ref_kill() will do the necessary call_rcu() */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH aio-next] aio: fix race in ring buffer page lookup introduced by page migration support

2013-09-09 Thread Gu Zheng
Hi Ben, Al,

On 09/10/2013 12:02 AM, Benjamin LaHaise wrote:

> Hi Al, Gu,
> 
> I've added this patch to my tree at git://git.kvack.org/~bcrl/aio-next.git 
> to fix the get_user_pages() issue introduced by Gu's changes in the page 
> migration patch.  Thanks Al for spotting this.

Thanks very much for spotting and fixing this issue.

Best regards,
Gu

> 
>   -ben
> 
> commit d6c355c7dabcd753a75bc77d150d36328a355267
> Author: Benjamin LaHaise 
> Date:   Mon Sep 9 11:57:59 2013 -0400
> 
> aio: fix race in ring buffer page lookup introduced by page migration 
> support
> 
> Prior to the introduction of page migration support in "fs/aio: Add 
> support
> to aio ring pages migration" / 36bc08cc01709b4a9bb563b35aa530241ddc63e3,
> mapping of the ring buffer pages was done via get_user_pages() while
> retaining mmap_sem held for write.  This avoided possible races with 
> userland
> racing an munmap() or mremap().  The page migration patch, however, 
> switched
> to using mm_populate() to prime the page mapping.  mm_populate() cannot be
> called with mmap_sem held.
> 
> Instead of dropping the mmap_sem, revert to the old behaviour and simply
> drop the use of mm_populate() since get_user_pages() will cause the pages 
> to
> get mapped anyways.  Thanks to Al Viro for spotting this issue.
> 
> Signed-off-by: Benjamin LaHaise 
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 6e26755..f4a27af 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -307,16 +307,25 @@ static int aio_setup_ring(struct kioctx *ctx)
>   aio_free_ring(ctx);
>   return -EAGAIN;
>   }
> - up_write(&mm->mmap_sem);
> -
> - mm_populate(ctx->mmap_base, populate);
>  
>   pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
> +
> + /* We must do this while still holding mmap_sem for write, as we
> +  * need to be protected against userspace attempting to mremap()
> +  * or munmap() the ring buffer.
> +  */
>   ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
>  1, 0, ctx->ring_pages, NULL);
> +
> + /* Dropping the reference here is safe as the page cache will hold
> +  * onto the pages for us.  It is also required so that page migration
> +  * can unmap the pages and get the right reference count.
> +  */
>   for (i = 0; i < ctx->nr_pages; i++)
>   put_page(ctx->ring_pages[i]);
>  
> + up_write(&mm->mmap_sem);
> +
>   if (unlikely(ctx->nr_pages != nr_pages)) {
>   aio_free_ring(ctx);
>   return -EAGAIN;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH] f2fs: optimize fs_lock for better performance

2013-09-10 Thread Gu Zheng
Hi Jaegeuk,
On 09/10/2013 08:59 AM, Jaegeuk Kim wrote:

> Hi,
> 
> 2013-09-07 (토), 08:00 +, Chao Yu:
>> Hi Knize,
>>
>> Thanks for your reply, I think it's actually meaningless that it's
>> being named after "spin_lock",
>> it's better to rename this spinlock to "round_robin_lock".
>>
>> This patch can only resolve the issue of unbalanced fs_lock usage,
>> it can not fix the deadlock issue.
>> can we fix deadlock issue through this method:
>>
>> - vfs_create()
>>  - f2fs_create() - takes an fs_lock and save current thread info into
>> thread_info[NR_GLOBAL_LOCKS]
>>   - f2fs_add_link()
>>- __f2fs_add_link()
>> - init_inode_metadata()
>>  - f2fs_init_security()
>>   - security_inode_init_security()
>>- f2fs_initxattrs()
>> - f2fs_setxattr() - get fs_lock only if there is no current
>> thread info in thread_info
>> 
>> So it keeps one thread can only hold one fs_lock to avoid deadlock.
>> Can we use this solution?
> 
> It could be.
> But, I think we can avoid to grab the fs_lock at the f2fs_initxattrs()

Agree. This fs_lock here is used to protect the xattr from parallel 
modification,
but here is in the initxattrs routine, parallel modification can not happen.
And in the normal setxattr routine the inode->i_mutex (vfs layer) is used to
avoid parallel modification. So I think this fs_lock is needless.
Am I missing something?

Regards,
Gu

> level, since this case only happens when f2fs_initxattrs() is called.
> Let's think about ut in more detail.
> Thanks,
> 
>>
>>  
>>
>> thanks again!
>>
>>  
>>
>> --- Original Message ---
>>
>> Sender : Russ Knize
>>
>> Date : 九月 07, 2013 04:25 (GMT+09:00)
>>
>> Title : Re: [f2fs-dev] [PATCH] f2fs: optimize fs_lock for better
>> performance
>>
>>  
>>
>> I encountered this same issue recently and solved it in much the same
>> way.  Can we rename "spin_lock" to something more meaningful? 
>>
>>
>> This race actually exposed a potential deadlock between f2fs_create()
>> and f2fs_initxattrs(): 
>>
>>
>> - vfs_create()
>>  - f2fs_create() - takes an fs_lock
>>   - f2fs_add_link()
>>- __f2fs_add_link()
>> - init_inode_metadata()
>>  - f2fs_init_security()
>>   - security_inode_init_security()
>>- f2fs_initxattrs()
>> - f2fs_setxattr() - also takes an fs_lock
>>
>>
>> If another CPU happens to have the same lock that f2fs_setxattr() was
>> trying to take because of the race around next_lock_num, we can get
>> into a deadlock situation if the two threads are also contending over
>> another resource (like bdi).
>>
>>
>> Another scenario is if the above happens while another thread is in
>> the middle of grabbing all of the locks via mutex_lock_all().
>>  f2fs_create() is holding a lock that mutex_lock_all() is waiting for
>> and mutex_lock_all() is holding a lock that f2fs_setxattr() is waiting
>> for.
>>
>>
>> Russ
>>
>>
>> On Fri, Sep 6, 2013 at 4:48 AM, Chao Yu  wrote:
>> Hi Kim:
>> 
>>  I think there is a performance problem: when all
>> sbi->fs_lock is holded, 
>> 
>> then all other threads may get the same next_lock value from
>> sbi->next_lock_num in function mutex_lock_op, 
>> 
>> and wait to get the same lock at position fs_lock[next_lock],
>> it unbalance the fs_lock usage. 
>> 
>> It may lost performance when we do the multithread test.
>> 
>>  
>> 
>> Here is the patch to fix this problem:
>> 
>>  
>> 
>> Signed-off-by: Yu Chao 
>> 
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> 
>> old mode 100644
>> 
>> new mode 100755
>> 
>> index 467d42d..983bb45
>> 
>> --- a/fs/f2fs/f2fs.h
>> 
>> +++ b/fs/f2fs/f2fs.h
>> 
>> @@ -371,6 +371,7 @@ struct f2fs_sb_info {
>> 
>> struct mutex fs_lock[NR_GLOBAL_LOCKS];  /* blocking FS
>> operations */
>> 
>> struct mutex node_write;/* locking
>> node writes */
>> 
>> struct mutex writepages;/* mutex for
>> writepages() */
>> 
>> +   spinlock_t spin_lock;   /* lock for
>> next_lock_num */
>> 
>> unsigned char next_lock_num;/* round-robin
>> global locks */
>> 
>> int por_doing;  /* recovery is
>> doing or not */
>> 
>> int on_build_free_nids; /*
>> build_free_nids is doing */
>> 
>> @@ -533,15 +534,19 @@ static inline void
>> mutex_unlock_all(struct f2fs_sb_info *sbi)
>> 
>>  
>> 
>>  static inline int mutex_lock_op(struct f2fs_sb_info *sbi)
>> 
>>  {
>> 
>> -   unsigne

Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance

2013-09-10 Thread Gu Zheng
Hi Jaegeuk,

On 09/10/2013 08:52 AM, Jaegeuk Kim wrote:

> Hi,
> 
> At first, thank you for the report and please follow the email writing
> rules. :)
> 
> Anyway, I agree to the below issue.
> One thing that I can think of is that we don't need to use the
> spin_lock, since we don't care about the exact lock number, but just
> need to get any not-collided number.

Agree, but if all the locks are held, IMO, we need to balance the following
threads to wait for each not-collided number lock, though complete balance is 
unreachable.

> 
> So, how about removing the spin_lock?

Yeah, in this case, spin_lock is a bit heavy cost. 

> And how about using a random number?

Now NR_GLOBAL_LOCKS is 8, it seems that random can not offer an balance number 
as we expected.

Regards,
Gu 

> Thanks,
> 
> 2013-09-06 (금), 09:48 +, Chao Yu:
>> Hi Kim:
>>
>>  I think there is a performance problem: when all sbi->fs_lock is
>> holded, 
>>
>> then all other threads may get the same next_lock value from
>> sbi->next_lock_num in function mutex_lock_op, 
>>
>> and wait to get the same lock at position fs_lock[next_lock], it
>> unbalance the fs_lock usage. 
>>
>> It may lost performance when we do the multithread test.
>>
>>  
>>
>> Here is the patch to fix this problem:
>>
>>  
>>
>> Signed-off-by: Yu Chao 
>>
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>
>> old mode 100644
>>
>> new mode 100755
>>
>> index 467d42d..983bb45
>>
>> --- a/fs/f2fs/f2fs.h
>>
>> +++ b/fs/f2fs/f2fs.h
>>
>> @@ -371,6 +371,7 @@ struct f2fs_sb_info {
>>
>> struct mutex fs_lock[NR_GLOBAL_LOCKS];  /* blocking FS
>> operations */
>>
>> struct mutex node_write;/* locking node writes
>> */
>>
>> struct mutex writepages;/* mutex for
>> writepages() */
>>
>> +   spinlock_t spin_lock;   /* lock for
>> next_lock_num */
>>
>> unsigned char next_lock_num;/* round-robin global
>> locks */
>>
>> int por_doing;  /* recovery is doing
>> or not */
>>
>> int on_build_free_nids; /* build_free_nids is
>> doing */
>>
>> @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct
>> f2fs_sb_info *sbi)
>>
>>  
>>
>>  static inline int mutex_lock_op(struct f2fs_sb_info *sbi)
>>
>>  {
>>
>> -   unsigned char next_lock = sbi->next_lock_num %
>> NR_GLOBAL_LOCKS;
>>
>> +   unsigned char next_lock;
>>
>> int i = 0;
>>
>>  
>>
>> for (; i < NR_GLOBAL_LOCKS; i++)
>>
>> if (mutex_trylock(&sbi->fs_lock[i]))
>>
>> return i;
>>
>>  
>>
>> -   mutex_lock(&sbi->fs_lock[next_lock]);
>>
>> +   spin_lock(&sbi->spin_lock);
>>
>> +   next_lock = sbi->next_lock_num % NR_GLOBAL_LOCKS;
>>
>> sbi->next_lock_num++;
>>
>> +   spin_unlock(&sbi->spin_lock);
>>
>> +
>>
>> +   mutex_lock(&sbi->fs_lock[next_lock]);
>>
>> return next_lock;
>>
>>  }
>>
>>  
>>
>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>>
>> old mode 100644
>>
>> new mode 100755
>>
>> index 75c7dc3..4f27596
>>
>> --- a/fs/f2fs/super.c
>>
>> +++ b/fs/f2fs/super.c
>>
>> @@ -657,6 +657,7 @@ static int f2fs_fill_super(struct super_block *sb,
>> void *data, int silent)
>>
>> mutex_init(&sbi->cp_mutex);
>>
>> for (i = 0; i < NR_GLOBAL_LOCKS; i++)
>>
>> mutex_init(&sbi->fs_lock[i]);
>>
>> +   spin_lock_init(&sbi->spin_lock);
>>
>> mutex_init(&sbi->node_write);
>>
>> sbi->por_doing = 0;
>>
>> spin_lock_init(&sbi->stat_lock);
>>
>> (END)
>>
>>  
>>
>>
>>
>>
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance

2013-09-10 Thread Gu Zheng
Hi Jaegeuk, Chao,

On 09/10/2013 08:52 AM, Jaegeuk Kim wrote:

> Hi,
> 
> At first, thank you for the report and please follow the email writing
> rules. :)
> 
> Anyway, I agree to the below issue.
> One thing that I can think of is that we don't need to use the
> spin_lock, since we don't care about the exact lock number, but just
> need to get any not-collided number.

IMHO, just moving sbi->next_lock_num++ before 
mutex_lock(&sbi->fs_lock[next_lock])
can avoid unbalance issue mostly.
IMO, the case two or more threads increase sbi->next_lock_num in the same time 
is
really very very little. If you think it is not rigorous, change next_lock_num 
to
atomic one can fix it.
What's your opinion?

Regards,
Gu

> 
> So, how about removing the spin_lock?
> And how about using a random number?

> Thanks,
> 
> 2013-09-06 (금), 09:48 +, Chao Yu:
>> Hi Kim:
>>
>>  I think there is a performance problem: when all sbi->fs_lock is
>> holded, 
>>
>> then all other threads may get the same next_lock value from
>> sbi->next_lock_num in function mutex_lock_op, 
>>
>> and wait to get the same lock at position fs_lock[next_lock], it
>> unbalance the fs_lock usage. 
>>
>> It may lost performance when we do the multithread test.
>>
>>  
>>
>> Here is the patch to fix this problem:
>>
>>  
>>
>> Signed-off-by: Yu Chao 
>>
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>
>> old mode 100644
>>
>> new mode 100755
>>
>> index 467d42d..983bb45
>>
>> --- a/fs/f2fs/f2fs.h
>>
>> +++ b/fs/f2fs/f2fs.h
>>
>> @@ -371,6 +371,7 @@ struct f2fs_sb_info {
>>
>> struct mutex fs_lock[NR_GLOBAL_LOCKS];  /* blocking FS
>> operations */
>>
>> struct mutex node_write;/* locking node writes
>> */
>>
>> struct mutex writepages;/* mutex for
>> writepages() */
>>
>> +   spinlock_t spin_lock;   /* lock for
>> next_lock_num */
>>
>> unsigned char next_lock_num;/* round-robin global
>> locks */
>>
>> int por_doing;  /* recovery is doing
>> or not */
>>
>> int on_build_free_nids; /* build_free_nids is
>> doing */
>>
>> @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct
>> f2fs_sb_info *sbi)
>>
>>  
>>
>>  static inline int mutex_lock_op(struct f2fs_sb_info *sbi)
>>
>>  {
>>
>> -   unsigned char next_lock = sbi->next_lock_num %
>> NR_GLOBAL_LOCKS;
>>
>> +   unsigned char next_lock;
>>
>> int i = 0;
>>
>>  
>>
>> for (; i < NR_GLOBAL_LOCKS; i++)
>>
>> if (mutex_trylock(&sbi->fs_lock[i]))
>>
>> return i;
>>
>>  
>>
>> -   mutex_lock(&sbi->fs_lock[next_lock]);
>>
>> +   spin_lock(&sbi->spin_lock);
>>
>> +   next_lock = sbi->next_lock_num % NR_GLOBAL_LOCKS;
>>
>> sbi->next_lock_num++;
>>
>> +   spin_unlock(&sbi->spin_lock);
>>
>> +
>>
>> +   mutex_lock(&sbi->fs_lock[next_lock]);
>>
>> return next_lock;
>>
>>  }
>>
>>  
>>
>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>>
>> old mode 100644
>>
>> new mode 100755
>>
>> index 75c7dc3..4f27596
>>
>> --- a/fs/f2fs/super.c
>>
>> +++ b/fs/f2fs/super.c
>>
>> @@ -657,6 +657,7 @@ static int f2fs_fill_super(struct super_block *sb,
>> void *data, int silent)
>>
>> mutex_init(&sbi->cp_mutex);
>>
>> for (i = 0; i < NR_GLOBAL_LOCKS; i++)
>>
>> mutex_init(&sbi->fs_lock[i]);
>>
>> +   spin_lock_init(&sbi->spin_lock);
>>
>> mutex_init(&sbi->node_write);
>>
>> sbi->por_doing = 0;
>>
>> spin_lock_init(&sbi->stat_lock);
>>
>> (END)
>>
>>  
>>
>>
>>
>>
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance

2013-09-11 Thread Gu Zheng
Hi Chao,
On 09/12/2013 10:40 AM, 俞超 wrote:

> Hi Gu
> 
>> -Original Message-----
>> From: Gu Zheng [mailto:guz.f...@cn.fujitsu.com]
>> Sent: Wednesday, September 11, 2013 1:38 PM
>> To: jaegeuk@samsung.com
>> Cc: chao2...@samsung.com; shu@samsung.com;
>> linux-fsde...@vger.kernel.org; linux-kernel@vger.kernel.org;
>> linux-f2fs-de...@lists.sourceforge.net
>> Subject: Re: [f2fs-dev][PATCH] f2fs: optimize fs_lock for better performance
>>
>> Hi Jaegeuk, Chao,
>>
>> On 09/10/2013 08:52 AM, Jaegeuk Kim wrote:
>>
>>> Hi,
>>>
>>> At first, thank you for the report and please follow the email writing
>>> rules. :)
>>>
>>> Anyway, I agree to the below issue.
>>> One thing that I can think of is that we don't need to use the
>>> spin_lock, since we don't care about the exact lock number, but just
>>> need to get any not-collided number.
>>
>> IMHO, just moving sbi->next_lock_num++ before
>> mutex_lock(&sbi->fs_lock[next_lock])
>> can avoid unbalance issue mostly.
>> IMO, the case two or more threads increase sbi->next_lock_num in the same
>> time is really very very little. If you think it is not rigorous, change
>> next_lock_num to atomic one can fix it.
>> What's your opinion?
>>
>> Regards,
>> Gu
> 
> I did the test sbi->next_lock_num++ compare with the atomic one,
> And I found performance of them is almost the same under a small number 
> thread racing.
> So as your and Kim's opinion, it's enough to use "sbi->next_lock_num++" to 
> fix this issue.

Good, but it seems that your replay patch is out of format, and it's hard for 
Jaegeuk to merge.
I'll format it, see the following thread.

Thanks,
Gu

> 
> Thanks for the advice.
>>
>>>
>>> So, how about removing the spin_lock?
>>> And how about using a random number?
>>
>>> Thanks,
>>>
>>> 2013-09-06 (금), 09:48 +, Chao Yu:
>>>> Hi Kim:
>>>>
>>>>  I think there is a performance problem: when all sbi->fs_lock is
>>>> holded,
>>>>
>>>> then all other threads may get the same next_lock value from
>>>> sbi->next_lock_num in function mutex_lock_op,
>>>>
>>>> and wait to get the same lock at position fs_lock[next_lock], it
>>>> unbalance the fs_lock usage.
>>>>
>>>> It may lost performance when we do the multithread test.
>>>>
>>>>
>>>>
>>>> Here is the patch to fix this problem:
>>>>
>>>>
>>>>
>>>> Signed-off-by: Yu Chao 
>>>>
>>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>>>
>>>> old mode 100644
>>>>
>>>> new mode 100755
>>>>
>>>> index 467d42d..983bb45
>>>>
>>>> --- a/fs/f2fs/f2fs.h
>>>>
>>>> +++ b/fs/f2fs/f2fs.h
>>>>
>>>> @@ -371,6 +371,7 @@ struct f2fs_sb_info {
>>>>
>>>> struct mutex fs_lock[NR_GLOBAL_LOCKS];  /* blocking FS
>>>> operations */
>>>>
>>>> struct mutex node_write;/* locking node
>> writes
>>>> */
>>>>
>>>> struct mutex writepages;/* mutex for
>>>> writepages() */
>>>>
>>>> +   spinlock_t spin_lock;   /* lock for
>>>> next_lock_num */
>>>>
>>>> unsigned char next_lock_num;/* round-robin
>> global
>>>> locks */
>>>>
>>>> int por_doing;  /* recovery is doing
>>>> or not */
>>>>
>>>> int on_build_free_nids; /* build_free_nids is
>>>> doing */
>>>>
>>>> @@ -533,15 +534,19 @@ static inline void mutex_unlock_all(struct
>>>> f2fs_sb_info *sbi)
>>>>
>>>>
>>>>
>>>>  static inline int mutex_lock_op(struct f2fs_sb_info *sbi)
>>>>
>>>>  {
>>>>
>>>> -   unsigned char next_lock = sbi->next_lock_num %
>>>> NR_GLOBAL_LOCKS;
>>>>
>>>> +   unsigned char next_lock;
>>>>
>>>> int i = 0;
>>>>
>>>>
>>>>
>>>> for (; i < NR_GLOBAL_LOCKS; i++)
>>>>
>>>> if

[f2fs-dev][PATCH V2] f2fs: optimize fs_lock for better performance

2013-09-11 Thread Gu Zheng
From: Yu Chao 

There is a performance problem: when all sbi->fs_lock are holded, then
all the following threads may get the same next_lock value from 
sbi->next_lock_num
in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]),
it may cause performance reduce.
So we move the sbi->next_lock_num++ before getting lock, this will average the
following threads if all sbi->fs_lock are holded. 

v1-->v2:
Drop the needless spin_lock as Jaegeuk suggested.

Suggested-by: Jaegeuk Kim 
Signed-off-by: Yu Chao 
Signed-off-by: Gu Zheng 
---
 fs/f2fs/f2fs.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 608f0df..7fd99d8 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -544,15 +544,15 @@ static inline void mutex_unlock_all(struct f2fs_sb_info 
*sbi)
 
 static inline int mutex_lock_op(struct f2fs_sb_info *sbi)
 {
-   unsigned char next_lock = sbi->next_lock_num % NR_GLOBAL_LOCKS;
+   unsigned char next_lock;
int i = 0;
 
for (; i < NR_GLOBAL_LOCKS; i++)
if (mutex_trylock(&sbi->fs_lock[i]))
return i;
 
+   next_lock = sbi->next_lock_num++ % NR_GLOBAL_LOCKS;
mutex_lock(&sbi->fs_lock[next_lock]);
-   sbi->next_lock_num++;
return next_lock;
 }
 
-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpoint for better performance

2013-10-08 Thread Gu Zheng
Hi Yuan,
On 10/08/2013 04:30 PM, Yuan Zhong wrote:

> Previously, do_checkpoint() will call congestion_wait() for waiting the pages 
> (previous submitted node/meta/data pages) to be written back.
> Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for 
> waiting.
> For this reason, there is a situation that after the pages have been written 
> back, but the checkpoint thread still wait for congestion_wait to exit.

How do you confirm this issue? I suspect that the block-core does not have a 
wake-up mechanism
when the back device is uncongested.

> This is a problem here, especially, when sync a large number of small files 
> or dirs.
> In order to avoid this, a wait_list is introduced, the checkpoint thread will 
> be dropped into the wait_list if the pages have not been written back, and 
> will be waked up by contrast.

Please pay some attention to the mail form, this mail is out of format in my 
mail client.

Regards,
Gu

> 
> Signed-off-by: Yuan Zhong 
> ---  
>  fs/f2fs/checkpoint.c |3 +--
>  fs/f2fs/f2fs.h   |   19 +++
>  fs/f2fs/segment.c|1 +
>  fs/f2fs/super.c  |1 +
>  4 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index ca39442..5d69ae0 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -758,8 +758,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool 
> is_umount)
>   f2fs_put_page(cp_page, 1);
>  
>   /* wait for previous submitted node/meta pages writeback */
> - while (get_pages(sbi, F2FS_WRITEBACK))
> - congestion_wait(BLK_RW_ASYNC, HZ / 50);
> + f2fs_writeback_wait(sbi);
>  
>   filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
>   filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 7fd99d8..4b0d70e 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -18,6 +18,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  /*
>   * For mount options
> @@ -368,6 +370,7 @@ struct f2fs_sb_info {
>   struct mutex fs_lock[NR_GLOBAL_LOCKS];  /* blocking FS operations */
>   struct mutex node_write;/* locking node writes */
>   struct mutex writepages;/* mutex for writepages() */
> + wait_queue_head_t writeback_wqh;/* wait_queue for writeback */
>   unsigned char next_lock_num;/* round-robin global locks */
>   int por_doing;  /* recovery is doing or not */
>   int on_build_free_nids; /* build_free_nids is doing */
> @@ -961,6 +964,22 @@ static inline int f2fs_readonly(struct super_block *sb)
>   return sb->s_flags & MS_RDONLY;
>  }
>  
> +static inline void f2fs_writeback_wait(struct f2fs_sb_info *sbi)
> +{
> + DEFINE_WAIT(wait);
> +
> + prepare_to_wait(&sbi->writeback_wqh, &wait, TASK_UNINTERRUPTIBLE);
> + if (get_pages(sbi, F2FS_WRITEBACK))
> + io_schedule();
> + finish_wait(&sbi->writeback_wqh, &wait);
> +}
> +
> +static inline void f2fs_writeback_wake(struct f2fs_sb_info *sbi)
> +{
> + if (!get_pages(sbi, F2FS_WRITEBACK))
> + wake_up_all(&sbi->writeback_wqh);
> +}
> +
>  /*
>   * file.c
>   */
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index bd79bbe..0708aa9 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -597,6 +597,7 @@ static void f2fs_end_io_write(struct bio *bio, int err)
>  
>   if (p->is_sync)
>   complete(p->wait);
> + f2fs_writeback_wake(p->sbi);
>   kfree(p);
>   bio_put(bio);
>  }
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index 094ccc6..3ac6d85 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -835,6 +835,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
> *data, int silent)
>   mutex_init(&sbi->gc_mutex);
>   mutex_init(&sbi->writepages);
>   mutex_init(&sbi->cp_mutex);
> + init_waitqueue_head(&sbi->writeback_wqh);
>   for (i = 0; i < NR_GLOBAL_LOCKS; i++)
>   mutex_init(&sbi->fs_lock[i]);
>   
> mutex_init(&sbi->node_write);N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±‘êçzX§¶›¡Ü¨}©ž²Æ
>  
> zÚ&j:+v‰¨¾«‘êçzZ+€Ê+zf£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú&¢)ߢf”ù^jÇ«y§m…á@A«a¶Úÿ
> 0¶ìh®å’i


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpoint for better performance

2013-10-08 Thread Gu Zheng
Hi Yuan,
On 10/08/2013 07:30 PM, Yuan Zhong wrote:

> Hi Gu,
> 
>> Hi Yuan,
>> On 10/08/2013 04:30 PM, Yuan Zhong wrote:
> 
>>> Previously, do_checkpoint() will call congestion_wait() for waiting the 
>>> pages (previous submitted node/meta/data pages) to be written back.
>>> Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for 
>>> waiting.
>>> For this reason, there is a situation that after the pages have been 
>>> written back, 
>>> but the checkpoint thread still wait for congestion_wait to exit.
> 
>> How do you confirm this issue? 
> 
>   I traced the execution path.
>   In f2fs_end_io_write, dec_page_count(p->sbi, F2FS_WRITEBACK) will be called.
>   And I found that, when pages of F2FS_WRITEBACK has been zero, but
>   checkpoint thread still congestion_wait for pages of F2FS_WRITEBACK to be 
> zero.

Yes, it maybe. Congestion_wait add the task to a global wait queue which 
related to
all back devices, so if F2FS_WRITEBACK has been zero, but other io may be still 
going on.
Anyway, using a private wait queue to hold is a better choose.:)


>   So, I think this point could be improved.
>   And I wrote a simple test case and tested on Micro-SD card, the steps as 
> following:
>   (a) create a fixed-size file (4KB)
>   (b) go on to sync the file 
>   (c) go back to step #a (fixed numbers of cycling:1024)  
>The results indicated that the execution time is reduced greatly by using 
> this patch.

Yes, the change is an improvement if the issue is existent.

  
> 
> 
>> I suspect that the block-core does not have a wake-up mechanism
>> when the back device is uncongested.
> 
> 
>   Yes, you are right.
>   So I wake up the checkpoint thread by myself, when pages of F2FS_WRITEBACK 
> to be zero.
>   In f2fs_end_io_write, f2fs_writeback_wait is called.
>   you cloud find this code in my patch. 

Saw it.:)
But one problem is that the checkpoint routine always is singleton, so the wait 
queue just
services only one body, it seems not very worthy. How about just schedule and 
wake up it
directly? See the following one.

Signed-off-by: Gu Zheng 
---
 fs/f2fs/checkpoint.c |   11 +--
 fs/f2fs/f2fs.h   |1 +
 fs/f2fs/segment.c|4 
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index d808827..2a5999d 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -757,8 +757,15 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool 
is_umount)
f2fs_put_page(cp_page, 1);
 
/* wait for previous submitted node/meta pages writeback */
-   while (get_pages(sbi, F2FS_WRITEBACK))
-   congestion_wait(BLK_RW_ASYNC, HZ / 50);
+   sbi->cp_task = current;
+   while (get_pages(sbi, F2FS_WRITEBACK)) {
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   if (!get_pages(sbi, F2FS_WRITEBACK))
+   break;
+   io_schedule();
+   }
+   __set_current_state(TASK_RUNNING);
+   sbi->cp_task = NULL;
 
filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index a955a59..408ace7 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -365,6 +365,7 @@ struct f2fs_sb_info {
struct mutex writepages;/* mutex for writepages() */
int por_doing;  /* recovery is doing or not */
int on_build_free_nids; /* build_free_nids is doing */
+   struct task_struct *cp_task;/* checkpoint task */
 
/* for orphan inode management */
struct list_head orphan_inode_list; /* orphan inode list */
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index bd79bbe..3b20359 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -597,6 +597,10 @@ static void f2fs_end_io_write(struct bio *bio, int err)
 
if (p->is_sync)
complete(p->wait);
+
+   if (!get_pages(p->sbi, F2FS_WRITEBACK) && p->sbi->cp_task)
+   wake_up_process(p->sbi->cp_task);
+
kfree(p);
bio_put(bio);
 }
-- 
1.7.7

Regards,
Gu 

> 
> 
>>> This is a problem here, especially, when sync a large number of small files 
>>> or dirs.
>>> In order to avoid this, a wait_list is introduced, 
>>> the checkpoint thread will be dropped into the wait_list if the pages have 
>>> not been written back, 
>>> and will be waked up by contrast.
> 
>> Please pay some attention to the mail form, this mail is out of format in my 
>> mail client.
> 
>> Regards,
>> Gu
> 
> Regar

Re: [RFC/query] kvm async_pf anon pined pages migration

2013-10-10 Thread Gu Zheng
Hi Gleb,

On 10/10/2013 03:15 PM, Gleb Natapov wrote:

> On Thu, Oct 10, 2013 at 03:05:58PM +0800, chai wen wrote:
>> On 10/08/2013 03:39 PM, Gleb Natapov wrote:
>>> On Tue, Oct 08, 2013 at 02:58:22PM +0800, chai wen wrote:
 On 10/02/2013 12:04 AM, chaiwen wrote:
> On 09/30/2013 08:51 PM, Gleb Natapov wrote:
>> On Mon, Sep 30, 2013 at 06:03:07PM +0800, chai wen wrote:
>>> Hi all
>>>
>>> Async page fault in kvm currently pin user pages via get_user_pages.
>>> when doing page migration,the method can be found via
>>> page->mmapping->a_ops->migratepage to offline old pages and migrate to
>>> new pages. As to anonymous page there is no file mapping but a 
>>> anon_vma.So
>>> the migration will fall back to some *default* migration method.Anon 
>>> pages
>>> that have been pined in memory by some reasons could be failed in the 
>>> migration
>>> processing because of some reasons like ref-count checking.
>>> (or I misunderstand some thing?)
>>>
>>> Now we want to make these anon pages in async_pf can be migrated, I try 
>>> some
>>> ways.But there are still many problems. The following is one that 
>>> replaceing
>>> the mapping of anon page arbitrarily and doing some thing based on it.
>>> Kvm-based virtual machine can works on this patch,but have no 
>>> experience of
>>> offline pages because of the limitaion of resouces.I'll check it later.
>>>
>>> I don't know weather it is a right direction of this issue.
>>> All comments/criticize are welcomed.
>> The pinning is not mandatory and can (and probably should) be dropped, 
>> but
>> pinning that is done by async page faults is short lived. What problems
>> are you seeing that warrant the complexity of handling their migration?
 Hi Gleb

 As to this issue, I still have some thing not very clear.
 If pages pinning is successfully holding (although not mandatory) by
 async page fault.
 And at the same time page migration happens because of memory
 hot-remove action.
 It has 120*hz timeout setting in common page offline processing,
 could it fail with
 these async_pf pined pages migration ?
 What's your opinion about this ?   If it may fail under this
 circumstance, should we do
 some thing on it ?

>>> 120 seconds is more than enough time for pinning to go away, but as I
>>> said the pinning is not even necessary. Patch to remove it is welcomed.
>> Thank you for your clarification !  I've got it. we will still work on it.
>>
> Should be extremely easy. Drop FOLL_GET from GUP in async_pf_execute().

One lower question, why pinning page is not necessary here?

Thanks,
Gu

> 
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/query] kvm async_pf anon pined pages migration

2013-10-10 Thread Gu Zheng
On 10/10/2013 04:01 PM, Gleb Natapov wrote:

> On Thu, Oct 10, 2013 at 03:53:08PM +0800, Gu Zheng wrote:
>> Hi Gleb,
>>
>> On 10/10/2013 03:15 PM, Gleb Natapov wrote:
>>
>>> On Thu, Oct 10, 2013 at 03:05:58PM +0800, chai wen wrote:
>>>> On 10/08/2013 03:39 PM, Gleb Natapov wrote:
>>>>> On Tue, Oct 08, 2013 at 02:58:22PM +0800, chai wen wrote:
>>>>>> On 10/02/2013 12:04 AM, chaiwen wrote:
>>>>>>> On 09/30/2013 08:51 PM, Gleb Natapov wrote:
>>>>>>>> On Mon, Sep 30, 2013 at 06:03:07PM +0800, chai wen wrote:
>>>>>>>>> Hi all
>>>>>>>>>
>>>>>>>>> Async page fault in kvm currently pin user pages via get_user_pages.
>>>>>>>>> when doing page migration,the method can be found via
>>>>>>>>> page->mmapping->a_ops->migratepage to offline old pages and migrate to
>>>>>>>>> new pages. As to anonymous page there is no file mapping but a 
>>>>>>>>> anon_vma.So
>>>>>>>>> the migration will fall back to some *default* migration method.Anon 
>>>>>>>>> pages
>>>>>>>>> that have been pined in memory by some reasons could be failed in the 
>>>>>>>>> migration
>>>>>>>>> processing because of some reasons like ref-count checking.
>>>>>>>>> (or I misunderstand some thing?)
>>>>>>>>>
>>>>>>>>> Now we want to make these anon pages in async_pf can be migrated, I 
>>>>>>>>> try some
>>>>>>>>> ways.But there are still many problems. The following is one that 
>>>>>>>>> replaceing
>>>>>>>>> the mapping of anon page arbitrarily and doing some thing based on it.
>>>>>>>>> Kvm-based virtual machine can works on this patch,but have no 
>>>>>>>>> experience of
>>>>>>>>> offline pages because of the limitaion of resouces.I'll check it 
>>>>>>>>> later.
>>>>>>>>>
>>>>>>>>> I don't know weather it is a right direction of this issue.
>>>>>>>>> All comments/criticize are welcomed.
>>>>>>>> The pinning is not mandatory and can (and probably should) be dropped, 
>>>>>>>> but
>>>>>>>> pinning that is done by async page faults is short lived. What problems
>>>>>>>> are you seeing that warrant the complexity of handling their migration?
>>>>>> Hi Gleb
>>>>>>
>>>>>> As to this issue, I still have some thing not very clear.
>>>>>> If pages pinning is successfully holding (although not mandatory) by
>>>>>> async page fault.
>>>>>> And at the same time page migration happens because of memory
>>>>>> hot-remove action.
>>>>>> It has 120*hz timeout setting in common page offline processing,
>>>>>> could it fail with
>>>>>> these async_pf pined pages migration ?
>>>>>> What's your opinion about this ?   If it may fail under this
>>>>>> circumstance, should we do
>>>>>> some thing on it ?
>>>>>>
>>>>> 120 seconds is more than enough time for pinning to go away, but as I
>>>>> said the pinning is not even necessary. Patch to remove it is welcomed.
>>>> Thank you for your clarification !  I've got it. we will still work on it.
>>>>
>>> Should be extremely easy. Drop FOLL_GET from GUP in async_pf_execute().
>>
>> One lower question, why pinning page is not necessary here?
>>
> The purpose of GUP here is to bring page from swap, the page itself is
> never used directly by async pf code. The page is used when guest
> accesses it next time, but that code path does its own GUP.

Got it, thanks for your explanation.:)

Regards,
Gu

> 
> --
>   Gleb.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpoint for better performance

2013-10-10 Thread Gu Zheng
Hi Jin,

On 10/10/2013 04:09 PM, Jin Xu wrote:

> Hi Gu,
> 
> I have a comment below.
> 
>> Date: Wed, 9 Oct 2013 12:04:09 +0800
>> From: guz.f...@cn.fujitsu.com
>> To: yuan.mark.zh...@samsung.com
>> CC: jaegeuk@samsung.com; linux-f2fs-de...@lists.sourceforge.net; 
>> linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org; 
>> shu@samsung.com
>> Subject: Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when 
>> do_checkpoint for better performance
>>
>> Hi Yuan,
>> On 10/08/2013 07:30 PM, Yuan Zhong wrote:
>>
> ...
>>
>> Signed-off-by: Gu Zheng 
>> ---
>> fs/f2fs/checkpoint.c | 11 +--
>> fs/f2fs/f2fs.h | 1 +
>> fs/f2fs/segment.c | 4 
>> 3 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> index d808827..2a5999d 100644
>> --- a/fs/f2fs/checkpoint.c
>> +++ b/fs/f2fs/checkpoint.c
>> @@ -757,8 +757,15 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, 
>> bool is_umount)
>> f2fs_put_page(cp_page, 1);
>>
>> /* wait for previous submitted node/meta pages writeback */
>> - while (get_pages(sbi, F2FS_WRITEBACK))
>> - congestion_wait(BLK_RW_ASYNC, HZ / 50);
>> + sbi->cp_task = current;
>> + while (get_pages(sbi, F2FS_WRITEBACK)) {
>> + set_current_state(TASK_UNINTERRUPTIBLE);
>> + if (!get_pages(sbi, F2FS_WRITEBACK))
>> + break;
>> + io_schedule();
>> + }
>> + __set_current_state(TASK_RUNNING);
>> + sbi->cp_task = NULL;
>>
>> filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
>> filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> index a955a59..408ace7 100644
>> --- a/fs/f2fs/f2fs.h
>> +++ b/fs/f2fs/f2fs.h
>> @@ -365,6 +365,7 @@ struct f2fs_sb_info {
>> struct mutex writepages; /* mutex for writepages() */
>> int por_doing; /* recovery is doing or not */
>> int on_build_free_nids; /* build_free_nids is doing */
>> + struct task_struct *cp_task; /* checkpoint task */
>>
>> /* for orphan inode management */
>> struct list_head orphan_inode_list; /* orphan inode list */
>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> index bd79bbe..3b20359 100644
>> --- a/fs/f2fs/segment.c
>> +++ b/fs/f2fs/segment.c
>> @@ -597,6 +597,10 @@ static void f2fs_end_io_write(struct bio *bio, int err)
>>
>> if (p->is_sync)
>> complete(p->wait);
>> +
>> + if (!get_pages(p->sbi, F2FS_WRITEBACK) && p->sbi->cp_task)
>> + wake_up_process(p->sbi->cp_task);
> 
> There is a risk of dereferencing a NULL pointer because here simply comparing 
> the
> cp_task against NULL is not enough to avoid race in multi-thread environment.
> Another thread could have assigned it to NULL in the window between the 
> comparison
> and waking up.

Can not be that, checkpoint routine is always singleton and protected by 
cp_mutex and cp_rwsem.

Thanks,
Gu

> 
> Regards,
> Jin
>> +
>> kfree(p);
>> bio_put(bio);
>> }
>> --
>> 1.7.7
>>
>> Regards,
>> Gu
>>
>> >
>> >
>> >>> This is a problem here, especially, when sync a large number of small 
>> >>> files or dirs.
>> >>> In order to avoid this, a wait_list is introduced,
>> >>> the checkpoint thread will be dropped into the wait_list if the pages 
>> >>> have not been written back,
>> >>> and will be waked up by contrast.
>> >
>> >> Please pay some attention to the mail form, this mail is out of format in 
>> >> my mail client.
>> >
>> >> Regards,
>> >> Gu
>> >
>> > Regards,
>> > Yuan
>> >
>> >>>
>> >>> Signed-off-by: Yuan Zhong 
>> >>> ---
>> >>> fs/f2fs/checkpoint.c | 3 +--
>> >>> fs/f2fs/f2fs.h | 19 +++
>> >>> fs/f2fs/segment.c | 1 +
>> >>> fs/f2fs/super.c | 1 +
>> >>> 4 files changed, 22 insertions(+), 2 deletions(-)
>> >>>
>> >>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> >>> index ca39442..5d69ae0 100644
>> >>> --- a/fs/f2fs/checkpoint.c
>> >>> +++ b/fs/f2fs/checkpoint.c
>> >>> @@ -758,8 +758,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, 
>> >>> bool is_umount)
>> >>> f2fs_put_page(cp_page, 1);
>> >

Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpoint for better performance

2013-10-10 Thread Gu Zheng
Hi Jin,
On 10/11/2013 07:54 AM, Jin Xu wrote:

>> Date: Thu, 10 Oct 2013 16:11:53 +0800
>> From: guz.f...@cn.fujitsu.com
>> To: jinuxst...@live.com
>> CC: yuan.mark.zh...@samsung.com; jaegeuk@samsung.com; 
>> linux-f2fs-de...@lists.sourceforge.net; linux-kernel@vger.kernel.org; 
>> linux-fsde...@vger.kernel.org; shu@samsung.com
>> Subject: Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when 
>> do_checkpoint for better performance
>>
>> Hi Jin,
>>
>> On 10/10/2013 04:09 PM, Jin Xu wrote:
>>
>> > Hi Gu,
>> >
>> > I have a comment below.
>> >
>> >> Date: Wed, 9 Oct 2013 12:04:09 +0800
>> >> From: guz.f...@cn.fujitsu.com
>> >> To: yuan.mark.zh...@samsung.com
>> >> CC: jaegeuk@samsung.com; linux-f2fs-de...@lists.sourceforge.net; 
>> >> linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org; 
>> >> shu@samsung.com
>> >> Subject: Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when 
>> >> do_checkpoint for better performance
>> >>
>> >> Hi Yuan,
>> >> On 10/08/2013 07:30 PM, Yuan Zhong wrote:
>> >>
>> > ...
>> >>
>> >> Signed-off-by: Gu Zheng 
>> >> ---
>> >> fs/f2fs/checkpoint.c | 11 +--
>> >> fs/f2fs/f2fs.h | 1 +
>> >> fs/f2fs/segment.c | 4 
>> >> 3 files changed, 14 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> >> index d808827..2a5999d 100644
>> >> --- a/fs/f2fs/checkpoint.c
>> >> +++ b/fs/f2fs/checkpoint.c
>> >> @@ -757,8 +757,15 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, 
>> >> bool is_umount)
>> >> f2fs_put_page(cp_page, 1);
>> >>
>> >> /* wait for previous submitted node/meta pages writeback */
>> >> - while (get_pages(sbi, F2FS_WRITEBACK))
>> >> - congestion_wait(BLK_RW_ASYNC, HZ / 50);
>> >> + sbi->cp_task = current;
>> >> + while (get_pages(sbi, F2FS_WRITEBACK)) {
>> >> + set_current_state(TASK_UNINTERRUPTIBLE);
>> >> + if (!get_pages(sbi, F2FS_WRITEBACK))
>> >> + break;
>> >> + io_schedule();
>> >> + }
>> >> + __set_current_state(TASK_RUNNING);
>> >> + sbi->cp_task = NULL;
>> >>
>> >> filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
>> >> filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
>> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> >> index a955a59..408ace7 100644
>> >> --- a/fs/f2fs/f2fs.h
>> >> +++ b/fs/f2fs/f2fs.h
>> >> @@ -365,6 +365,7 @@ struct f2fs_sb_info {
>> >> struct mutex writepages; /* mutex for writepages() */
>> >> int por_doing; /* recovery is doing or not */
>> >> int on_build_free_nids; /* build_free_nids is doing */
>> >> + struct task_struct *cp_task; /* checkpoint task */
>> >>
>> >> /* for orphan inode management */
>> >> struct list_head orphan_inode_list; /* orphan inode list */
>> >> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> >> index bd79bbe..3b20359 100644
>> >> --- a/fs/f2fs/segment.c
>> >> +++ b/fs/f2fs/segment.c
>> >> @@ -597,6 +597,10 @@ static void f2fs_end_io_write(struct bio *bio, int 
>> >> err)
>> >>
>> >> if (p->is_sync)
>> >> complete(p->wait);
>> >> +
>> >> + if (!get_pages(p->sbi, F2FS_WRITEBACK) && p->sbi->cp_task)
>> >> + wake_up_process(p->sbi->cp_task);
>> >
>> > There is a risk of dereferencing a NULL pointer because here simply 
>> > comparing the
>> > cp_task against NULL is not enough to avoid race in multi-thread 
>> > environment.
>> > Another thread could have assigned it to NULL in the window between the 
>> > comparison
>> > and waking up.
>>
>> Can not be that, checkpoint routine is always singleton and protected by 
>> cp_mutex and cp_rwsem.
>> 
> 
> The race could happen like this for example:
> On a SMP environment, thread 1 wakes up the checkpoint thread, then
> thread 2 comes to the f2fs_end_io_write, compared the cp_task as not NULL,
> but at the same time, the checkpoint thread just assigned the cp_task to NULL.
> When thread 2 gets to the wake_up_process, dereferencing to NULL pointer
> happens.

The case 

Re: [Bug report] Warning when hot-add an ACPI0004 device.

2013-09-25 Thread Gu Zheng
Hi Toshi,

On 09/26/2013 06:24 AM, Toshi Kani wrote:

> On Wed, 2013-09-25 at 10:31 +0000, Gu Zheng wrote:
>> Hi Toshi,
>>
>> On 09/12/2013 11:11 PM, Toshi Kani wrote:
>>
>>> On Thu, 2013-09-12 at 13:00 +0800, Tang Chen wrote:
>>>> Hi Rafael, Toshi,
>>>>
>>>> When we hot-add an ACPI0004 device, we got the following warning:
>>>>
>>>>acpi ACPI0004:01: Attempt to re-insert
>>>>
>>>> The ACPI0004 device is a System Board in Fujitsu server, which has two
>>>> numa nodes (processors and memory).
>>>>
>>>> It seems that we reserved the ACPI_NOTIFY_DEVICE_CHECK event twice in
>>>> acpi_hotplug_notify_cb().
>>>>
>>>>
>>>> According to bisect, this happens after the following commit:
>>>>
>>>>  From 68a67f6c78b80525d9b3c6672e7782de95e56a83 Mon Sep 17 00:00:00 2001
>>>> From: "Rafael J. Wysocki" 
>>>> Date: Sun, 3 Mar 2013 23:05:55 +0100
>>>> Subject: [PATCH 1/1] ACPI / container: Use common hotplug code
>>>>
>>>> Switch the ACPI container driver to using common device hotplug code
>>>> introduced previously.  This reduces the driver down to a trivial
>>>> definition and registration of a struct acpi_scan_handler object.
>>>>
>>>> Signed-off-by: Rafael J. Wysocki 
>>>> Acked-by: Toshi Kani 
>>>> Tested-by: Toshi Kani 
>>>> ---
>>>>   drivers/acpi/container.c | 146 
>>>> ---
>>>>   1 file changed, 10 insertions(+), 136 deletions(-)
>>>>
>>>>
>>>> I'm now investigating this problem. If you have any idea about why this
>>>> happens, please let me know.
>>>
>>> With the above change, container devices use the common notify handler,
>>> which logs the warning message in question when it receives device check
>>> twice on a same device.  Before the change, the container-specific
>>> notify handler did not log this message in the same case (but considered
>>> it as an eject request).
>>>
>>> So, I suspect that you are getting device check twice regardless of the
>>> kernel change.  Can you check KERN_DEBUG messages to see if that is the
>>> case?  The notify handler logs all events with KERN_DEBUG.
>>
>> Follow your suggestion, we confirm that it really received ACPI_NOTIFY_
>> DEVICE_CHECK event*twice*, but the original ACPI container driver only
>> received once, does the common device hotplug code introduce another device
>> check? any idea?
>>
>> Container uses common device hotplug code:
>> [  142.937724] IPv6: ADDRCONF(NETDEV_CHANGE): eth8: link becomes ready
>> [  674.975575] ACPI: \_SB_.LSB1: ACPI_NOTIFY_DEVICE_CHECK event  <<<<
> 
> acpi_hotplug_notify_cb() calls acpi_os_hotplug_execute() to schedule to
> run acpi_scan_device_check() asynchronously and returns immediately.
> This leads acpi_ev_asynch_enable_gpe() to run next, which clears this
> GPE (if level triggered) and re-enable GPE.

Thanks for your explanation, it's really the routine you mentioned above.

> 
>> [  674.991604] ACPI: \_SB_.LSB1: ACPI_NOTIFY_DEVICE_CHECK event  
>>  <<<<   
> 
> It appears that re-enabling GPE caused this GPE to show up again as a
> spurious interrupt.

Yes, it is.

> 
>> [  675.613990] ACPI: PCI Root Bridge [UNC2] (domain  [bus fd])
>> [  675.684970] acpi PNP0A03:01: ACPI _OSC support notification failed, 
>> disabling PCIe ASPM
>> [  675.780957] acpi PNP0A03:01: Unable to request _OSC control (_OSC support 
>> mask: 0x08)
>> [  675.874806] ACPI _OSC control for PCIe not granted, disabling ASPM
>> [  675.949005] pci_bus :fd: Allocating resources
>> [  675.960145] ACPI: PCI Root Bridge [UNC3] (domain  [bus fc])
>> [  676.031176] acpi PNP0A03:02: ACPI _OSC support notification failed, 
>> disabling PCIe ASPM
>> [  676.127129] acpi PNP0A03:02: Unable to request _OSC control (_OSC support 
>> mask: 0x08)
>> [  676.220943] ACPI _OSC control for PCIe not granted, disabling ASPM
>> [  676.295019] pci_bus :fc: Allocating resources
>>
>> Original ACPI container driver:
>> [ 1526.122933] Container driver received ACPI_NOTIFY_DEVICE_CHECK event <<<<
> 
> In the original code, container_notify_cb() proceeds the device check
> handling and then calls _OST on the same thread.  It then re-enable GPE.

According to our debug, the whole routine was executed on the 

Re: [Bug report] Warning when hot-add an ACPI0004 device.

2013-09-25 Thread Gu Zheng
Hi Rafael,

On 09/25/2013 10:36 PM, Rafael J. Wysocki wrote:

> On Wednesday, September 25, 2013 06:31:09 PM Gu Zheng wrote:
>> Hi Toshi,
>>
>> On 09/12/2013 11:11 PM, Toshi Kani wrote:
>>
>>> On Thu, 2013-09-12 at 13:00 +0800, Tang Chen wrote:
>>>> Hi Rafael, Toshi,
>>>>
>>>> When we hot-add an ACPI0004 device, we got the following warning:
>>>>
>>>>acpi ACPI0004:01: Attempt to re-insert
<...>
>>>>
>>>>
>>>> I'm now investigating this problem. If you have any idea about why this
>>>> happens, please let me know.
>>>
>>> With the above change, container devices use the common notify handler,
>>> which logs the warning message in question when it receives device check
>>> twice on a same device.  Before the change, the container-specific
>>> notify handler did not log this message in the same case (but considered
>>> it as an eject request).
>>>
>>> So, I suspect that you are getting device check twice regardless of the
>>> kernel change.  Can you check KERN_DEBUG messages to see if that is the
>>> case?  The notify handler logs all events with KERN_DEBUG.
>>
>> Follow your suggestion, we confirm that it really received ACPI_NOTIFY_
>> DEVICE_CHECK event*twice*, but the original ACPI container driver only
>> received once, does the common device hotplug code introduce another device
>> check? any idea?
> 
> Well, we couldn't possibly make the BIOS generate the event twice unless
> there's an _OST response missing somewhere or similar.
> 
> In any case the second event should be harmless.

Yes, though it's harmless, but this message is not very friendly. 

> 
>> Container uses common device hotplug code:
>> [  142.937724] IPv6: ADDRCONF(NETDEV_CHANGE): eth8: link becomes ready
>> [  674.975575] ACPI: \_SB_.LSB1: ACPI_NOTIFY_DEVICE_CHECK event  <<<<
>> [  674.991604] ACPI: \_SB_.LSB1: ACPI_NOTIFY_DEVICE_CHECK event  
>>  <<<<   
> 
> Where exactly did you put that printk()?

It's the acpi_handle_debug in acpi_hotplug_notify_cb():
 410 case ACPI_NOTIFY_DEVICE_CHECK: 
 
 411 acpi_handle_debug(handle, "ACPI_NOTIFY_DEVICE_CHECK 
event\n");  
 412 callback = acpi_scan_device_check; 
 
 413 break;  

> 
>> [  675.613990] ACPI: PCI Root Bridge [UNC2] (domain  [bus fd])
>> [  675.684970] acpi PNP0A03:01: ACPI _OSC support notification failed, 
>> disabling PCIe ASPM
>> [  675.780957] acpi PNP0A03:01: Unable to request _OSC control (_OSC support 
>> mask: 0x08)
>> [  675.874806] ACPI _OSC control for PCIe not granted, disabling ASPM
>> [  675.949005] pci_bus :fd: Allocating resources
>> [  675.960145] ACPI: PCI Root Bridge [UNC3] (domain  [bus fc])
>> [  676.031176] acpi PNP0A03:02: ACPI _OSC support notification failed, 
>> disabling PCIe ASPM
>> [  676.127129] acpi PNP0A03:02: Unable to request _OSC control (_OSC support 
>> mask: 0x08)
>> [  676.220943] ACPI _OSC control for PCIe not granted, disabling ASPM
>> [  676.295019] pci_bus :fc: Allocating resources
>>
>> Original ACPI container driver:
>> [ 1526.122933] Container driver received ACPI_NOTIFY_DEVICE_CHECK event <<<<
> 
> And that?

It seems that the original ACPI container driver can avoid the second event.

The debug printk is added in original container_notify_cb():
 96 case ACPI_NOTIFY_DEVICE_CHECK:
 97 printk("Container driver received %s event\n",
 98(type == ACPI_NOTIFY_BUS_CHECK) ?
 99"ACPI_NOTIFY_BUS_CHECK" : 
"ACPI_NOTIFY_DEVICE_CHECK");
100 
101 present = is_device_present(handle);
102 status = acpi_bus_get_device(handle, &device);
103 if (device)
104 printk("===attemp to reinsert!\n");
105 if (!present) {

Best regards,
Gu

> 
>> [ 1526.800646] ACPI: PCI Root Bridge [UNC2] (domain  [bus fd])
>> [ 1526.871682] acpi PNP0A03:01: ACPI _OSC support notification failed, 
>> disabling PCIe ASPM
>> [ 1526.967878] acpi PNP0A03:01: Unable to request _OSC control (_OSC support 
>> mask: 0x08)
>> [ 1527.061891] ACPI _OSC control for PCIe not granted, disabling ASPM
>> [ 1527.136036] pci_bus :fd: Allocating resources
>> [ 1527.150747] ACPI: PCI Root Bridge [UNC3] (domain  [bus fc])
>> [ 1527.221821] acpi PNP0A03:02: ACPI _OSC support notification failed, 
>> disabling PCIe ASPM
>> [ 1527.317738] acpi PNP0A03:02: Unable to request _OSC control (_OSC support 
>> mask: 0x08)
>> [ 1527.411795] ACPI _OSC control for PCIe not granted, disabling ASPM
>> [ 1527.485917] pci_bus :fc: Allocating resources
> 
> Thanks,
> Rafael
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[f2fs-dev] [PATCH] f2fs: use rw_sem instead of fs_lock(locks mutex)

2013-09-26 Thread Gu Zheng
The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
And each other operate routine(besides checkpoint) needs to acquire a fs_lock, 
there is a terrible problem here, if these are too many concurrency threads 
acquiring
fs_lock, so that they will block each other and may lead to some performance 
problem.
But this is not the phenomenon we want to see.
Though there are some optimization patches introduced to enhance the usage of 
fs_lock,
but the thorough solution is using a *rw_sem* to replace the fs_lock.
Checkpoint routine takes write_sem, and other ops take read_sem, so that we can 
block
other ops(ex, recovery) when doing checkpoint, and other ops will not disturb 
each other,
this can avoid the problem described above completely.

Thanks to Kim's review and test, and other guys' test is also welcome.

Reviewed-by: Jaegeuk Kim 
Tested-by: Jaegeuk Kim 
Signed-off-by: Gu Zheng 
---
 fs/f2fs/checkpoint.c |7 ++---
 fs/f2fs/data.c   |   11 -
 fs/f2fs/f2fs.h   |   52 -
 fs/f2fs/file.c   |   37 ---
 fs/f2fs/inode.c  |   11 -
 fs/f2fs/namei.c  |   50 +++-
 fs/f2fs/recovery.c   |7 ++---
 fs/f2fs/super.c  |4 +--
 fs/f2fs/xattr.c  |7 +
 9 files changed, 69 insertions(+), 117 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index ca39442..1f3bd47 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -619,11 +619,10 @@ static void block_operations(struct f2fs_sb_info *sbi)
blk_start_plug(&plug);
 
 retry_flush_dents:
-   mutex_lock_all(sbi);
-
+   write_lock_cp_rwsem(sbi);
/* write all the dirty dentry pages */
if (get_pages(sbi, F2FS_DIRTY_DENTS)) {
-   mutex_unlock_all(sbi);
+   write_unlock_cp_rwsem(sbi);
sync_dirty_dir_inodes(sbi);
goto retry_flush_dents;
}
@@ -646,7 +645,7 @@ retry_flush_nodes:
 static void unblock_operations(struct f2fs_sb_info *sbi)
 {
mutex_unlock(&sbi->node_write);
-   mutex_unlock_all(sbi);
+   write_unlock_cp_rwsem(sbi);
 }
 
 static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 941f9b9..5f9ddc1 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -560,9 +560,9 @@ write:
inode_dec_dirty_dents(inode);
err = do_write_data_page(page);
} else {
-   int ilock = mutex_lock_op(sbi);
+   read_lock_cp_rwsem(sbi);
err = do_write_data_page(page);
-   mutex_unlock_op(sbi, ilock);
+   read_unlock_cp_rwsem(sbi);
need_balance_fs = true;
}
if (err == -ENOENT)
@@ -641,7 +641,6 @@ static int f2fs_write_begin(struct file *file, struct 
address_space *mapping,
pgoff_t index = ((unsigned long long) pos) >> PAGE_CACHE_SHIFT;
struct dnode_of_data dn;
int err = 0;
-   int ilock;
 
f2fs_balance_fs(sbi);
 repeat:
@@ -650,7 +649,7 @@ repeat:
return -ENOMEM;
*pagep = page;
 
-   ilock = mutex_lock_op(sbi);
+   read_lock_cp_rwsem(sbi);
 
set_new_dnode(&dn, inode, NULL, NULL, 0);
err = get_dnode_of_data(&dn, index, ALLOC_NODE);
@@ -664,7 +663,7 @@ repeat:
if (err)
goto err;
 
-   mutex_unlock_op(sbi, ilock);
+   read_unlock_cp_rwsem(sbi);
 
if ((len == PAGE_CACHE_SIZE) || PageUptodate(page))
return 0;
@@ -700,7 +699,7 @@ out:
return 0;
 
 err:
-   mutex_unlock_op(sbi, ilock);
+   read_unlock_cp_rwsem(sbi);
f2fs_put_page(page, 1);
return err;
 }
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 7fd99d8..0836e98 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -318,14 +318,6 @@ enum count_type {
 };
 
 /*
- * Uses as sbi->fs_lock[NR_GLOBAL_LOCKS].
- * The checkpoint procedure blocks all the locks in this fs_lock array.
- * Some FS operations grab free locks, and if there is no free lock,
- * then wait to grab a lock in a round-robin manner.
- */
-#define NR_GLOBAL_LOCKS8
-
-/*
  * The below are the page types of bios used in submti_bio().
  * The available types are:
  * DATAUser data pages. It operates as async mode.
@@ -365,10 +357,9 @@ struct f2fs_sb_info {
struct f2fs_checkpoint *ckpt;   /* raw checkpoint pointer */
struct inode *meta_inode;   /* cache meta blocks */
struct mutex cp_mutex;  /* checkpoint procedure lock */
-   struct mutex fs_lock[NR_GLOBAL_LOCKS];  /* blocking FS operations */
+   struct rw_semaphore cp_rwsem;   /* blocking FS operations */
struct mutex node_write;/* locking node writes */
struct mutex w

Re: [f2fs-dev] [PATCH] f2fs: use rw_sem instead of fs_lock(locks mutex)

2013-09-26 Thread Gu Zheng
Hi Jin,
Thanks for your comments.
On 09/26/2013 08:31 PM, Jin Xu wrote:

> Great to see fs_locks is to be replaced. :)
> 
> There is a potential problem with using r/w semaphore this way. The
> thread doing checkpoint might get starved if other threads are
> intensively locking the read semaphore for I/O. 

Yes, it's a weakness of r/w semaphore.

> I noticed that Josef
> introduced a rwsem_is_contended for solving similar issue happening
> in btrfs recently. Maybe it's not a problem for f2fs. I haven't
> verified that yet.

All users of r/w semaphore will face this problem, not only btrfs and f2fs.
IMHO, it's easy to fix, just like Josef's rwsem_is_contended, I'll fix this in
version next. And if rwsem_is_contended is merged into mainline kernel, we can
consider using the common function.

> 
> In addition, maybe we should avoid using the "rwsem" or "mutex" as
> part of the routine name to make the name independent of actual lock
> mechanism. It seems better using:
> f2fs_lock/unlock_all instead of write_lock/unlock_cp_rwsem
> f2fs_lock/unlock_op instead of read_lock/unlock_cp_rwsem

Got it, the original name seems more suitable.

Best regards,
Gu

> 
> Regards,
> Jin
> 
> On 26/09/2013 17:40, Gu Zheng wrote:
>> The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
>> And each other operate routine(besides checkpoint) needs to acquire a 
>> fs_lock,
>> there is a terrible problem here, if these are too many concurrency threads 
>> acquiring
>> fs_lock, so that they will block each other and may lead to some performance 
>> problem.
>> But this is not the phenomenon we want to see.
>> Though there are some optimization patches introduced to enhance the usage 
>> of fs_lock,
>> but the thorough solution is using a *rw_sem* to replace the fs_lock.
>> Checkpoint routine takes write_sem, and other ops take read_sem, so that we 
>> can block
>> other ops(ex, recovery) when doing checkpoint, and other ops will not 
>> disturb each other,
>> this can avoid the problem described above completely.
>>
>> Thanks to Kim's review and test, and other guys' test is also welcome.
>>
>> Reviewed-by: Jaegeuk Kim 
>> Tested-by: Jaegeuk Kim 
>> Signed-off-by: Gu Zheng 
>> ---
>>   fs/f2fs/checkpoint.c |7 ++---
>>   fs/f2fs/data.c   |   11 -
>>   fs/f2fs/f2fs.h   |   52 
>> -
>>   fs/f2fs/file.c   |   37 ---
>>   fs/f2fs/inode.c  |   11 -
>>   fs/f2fs/namei.c  |   50 
>> +++-
>>   fs/f2fs/recovery.c   |7 ++---
>>   fs/f2fs/super.c  |4 +--
>>   fs/f2fs/xattr.c  |7 +
>>   9 files changed, 69 insertions(+), 117 deletions(-)
>>
>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> index ca39442..1f3bd47 100644
>> --- a/fs/f2fs/checkpoint.c
>> +++ b/fs/f2fs/checkpoint.c
>> @@ -619,11 +619,10 @@ static void block_operations(struct f2fs_sb_info *sbi)
>>   blk_start_plug(&plug);
>>
>>   retry_flush_dents:
>> -mutex_lock_all(sbi);
>> -
>> +write_lock_cp_rwsem(sbi);
> 
> It seems better to use name f2fs_lock_all instead of
> write_lock_cp_rwsem.
> 
>>   /* write all the dirty dentry pages */
>>   if (get_pages(sbi, F2FS_DIRTY_DENTS)) {
>> -mutex_unlock_all(sbi);
>> +write_unlock_cp_rwsem(sbi);
> 
> f2fs_lock_
>>   sync_dirty_dir_inodes(sbi);
>>   goto retry_flush_dents;
>>   }
>> @@ -646,7 +645,7 @@ retry_flush_nodes:
>>   static void unblock_operations(struct f2fs_sb_info *sbi)
>>   {
>>   mutex_unlock(&sbi->node_write);
>> -mutex_unlock_all(sbi);
>> +write_unlock_cp_rwsem(sbi);
>>   }
>>
>>   static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>> index 941f9b9..5f9ddc1 100644
>> --- a/fs/f2fs/data.c
>> +++ b/fs/f2fs/data.c
>> @@ -560,9 +560,9 @@ write:
>>   inode_dec_dirty_dents(inode);
>>   err = do_write_data_page(page);
>>   } else {
>> -int ilock = mutex_lock_op(sbi);
>> +read_lock_cp_rwsem(sbi);
>>   err = do_write_data_page(page);
>> -mutex_unlock_op(sbi, ilock);
>> +read_unlock_cp_rwsem(sbi);
>>   need_balance_fs = true;
>>   }
>>   if (err == -ENOENT)
>> @@ -641,7 +641,6 @@ static int f2f

[f2fs-dev][PATCH V2] f2fs: use rw_sem instead of fs_lock(locks mutex)

2013-09-27 Thread Gu Zheng
The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
there is a terrible problem here, if these are too many concurrency threads 
acquiring
fs_lock, so that they will block each other and may lead to some performance 
problem,
but this is not the phenomenon we want to see.
Though there are some optimization patches introduced to enhance the usage of 
fs_lock,
but the thorough solution is using a *rw_sem* to replace the fs_lock.
Checkpoint routine takes write_sem, and other ops take read_sem, so that we can 
block
other ops(ex, recovery) when doing checkpoint, and other ops will not disturb 
each other,
this can avoid the problem described above completely.
Because of the weakness of rw_sem, the above change may introduce a potential 
problem
that the checkpoint thread might get starved if other threads are intensively 
locking
the read semaphore for I/O.(Pointed out by Xu Jin)
In order to avoid this, a wait_list is introduced, the appending read semaphore 
ops
will be dropped into the wait_list if checkpoint thread is waiting for write 
semaphore,
and will be waked up when checkpoint thread gives up write semaphore.
Thanks to Kim's previous review and test, and will be very glad to see other 
guys'
performance tests about this patch.

V2:
  -fix the potential starvation problem.
  -use more suitable func name suggested by Xu Jin.

Signed-off-by: Gu Zheng 
---
 fs/f2fs/checkpoint.c |7 ++---
 fs/f2fs/data.c   |   11 
 fs/f2fs/f2fs.h   |   65 -
 fs/f2fs/file.c   |   37 +---
 fs/f2fs/inode.c  |   11 
 fs/f2fs/namei.c  |   50 ++
 fs/f2fs/recovery.c   |7 ++---
 fs/f2fs/super.c  |5 +--
 fs/f2fs/xattr.c  |7 +
 9 files changed, 82 insertions(+), 118 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index ca39442..d808827 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -619,11 +619,10 @@ static void block_operations(struct f2fs_sb_info *sbi)
blk_start_plug(&plug);
 
 retry_flush_dents:
-   mutex_lock_all(sbi);
-
+   f2fs_lock_all(sbi);
/* write all the dirty dentry pages */
if (get_pages(sbi, F2FS_DIRTY_DENTS)) {
-   mutex_unlock_all(sbi);
+   f2fs_unlock_all(sbi);
sync_dirty_dir_inodes(sbi);
goto retry_flush_dents;
}
@@ -646,7 +645,7 @@ retry_flush_nodes:
 static void unblock_operations(struct f2fs_sb_info *sbi)
 {
mutex_unlock(&sbi->node_write);
-   mutex_unlock_all(sbi);
+   f2fs_unlock_all(sbi);
 }
 
 static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 941f9b9..2535d3b 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -560,9 +560,9 @@ write:
inode_dec_dirty_dents(inode);
err = do_write_data_page(page);
} else {
-   int ilock = mutex_lock_op(sbi);
+   f2fs_lock_op(sbi);
err = do_write_data_page(page);
-   mutex_unlock_op(sbi, ilock);
+   f2fs_unlock_op(sbi);
need_balance_fs = true;
}
if (err == -ENOENT)
@@ -641,7 +641,6 @@ static int f2fs_write_begin(struct file *file, struct 
address_space *mapping,
pgoff_t index = ((unsigned long long) pos) >> PAGE_CACHE_SHIFT;
struct dnode_of_data dn;
int err = 0;
-   int ilock;
 
f2fs_balance_fs(sbi);
 repeat:
@@ -650,7 +649,7 @@ repeat:
return -ENOMEM;
*pagep = page;
 
-   ilock = mutex_lock_op(sbi);
+   f2fs_lock_op(sbi);
 
set_new_dnode(&dn, inode, NULL, NULL, 0);
err = get_dnode_of_data(&dn, index, ALLOC_NODE);
@@ -664,7 +663,7 @@ repeat:
if (err)
goto err;
 
-   mutex_unlock_op(sbi, ilock);
+   f2fs_unlock_op(sbi);
 
if ((len == PAGE_CACHE_SIZE) || PageUptodate(page))
return 0;
@@ -700,7 +699,7 @@ out:
return 0;
 
 err:
-   mutex_unlock_op(sbi, ilock);
+   f2fs_unlock_op(sbi);
f2fs_put_page(page, 1);
return err;
 }
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 7fd99d8..fd8add7 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -18,7 +18,8 @@
 #include 
 #include 
 #include 
-
+#include 
+#include 
 /*
  * For mount options
  */
@@ -318,14 +319,6 @@ enum count_type {
 };
 
 /*
- * Uses as sbi->fs_lock[NR_GLOBAL_LOCKS].
- * The checkpoint procedure blocks all the locks in this fs_lock array.
- * Some FS operations grab free locks, and if there is no free lock,
- * then wait to grab a lock in a round-robin manner.
- */
-#define NR_GLOBAL_LOCKS8
-
-/*
  * The below are the page types of bios used in submti_bio().

Re: [f2fs-dev] [PATCH] f2fs: avoid allocating failure in bio_alloc

2013-09-22 Thread Gu Zheng
On 09/22/2013 03:50 PM, Chao Yu wrote:

> This patch add macro MAX_BIO_BLOCKS to limit value of npages in
> f2fs_bio_alloc, it can avoid allocating failure in bio_alloc caused by
> npages is larger than BIO_MAX_PAGES.
> 
> Signed-off-by: Yu Chao 


Reviewed-by: Gu Zheng 

> ---
>  fs/f2fs/segment.c |4 +++-
>  fs/f2fs/segment.h |2 ++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index 09af9c7..bd79bbe 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -657,6 +657,7 @@ static void submit_write_page(struct f2fs_sb_info *sbi,
> struct page *page,
>   block_t blk_addr, enum page_type type)
>  {
>   struct block_device *bdev = sbi->sb->s_bdev;
> + int bio_blocks;
>  
>   verify_block_addr(sbi, blk_addr);
>  
> @@ -676,7 +677,8 @@ retry:
>   goto retry;
>   }
>  
> - sbi->bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi));
> + bio_blocks = MAX_BIO_BLOCKS(max_hw_blocks(sbi));
> + sbi->bio[type] = f2fs_bio_alloc(bdev, bio_blocks);
>   sbi->bio[type]->bi_sector = SECTOR_FROM_BLOCK(sbi,
> blk_addr);
>   sbi->bio[type]->bi_private = priv;
>   /*
> diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
> index bdd10ea..7f94d78 100644
> --- a/fs/f2fs/segment.h
> +++ b/fs/f2fs/segment.h
> @@ -90,6 +90,8 @@
>   (blk_addr << ((sbi)->log_blocksize - F2FS_LOG_SECTOR_SIZE))
>  #define SECTOR_TO_BLOCK(sbi, sectors)
> \
>   (sectors >> ((sbi)->log_blocksize - F2FS_LOG_SECTOR_SIZE))
> +#define MAX_BIO_BLOCKS(max_hw_blocks)
> \
> + (min((int)max_hw_blocks, BIO_MAX_PAGES))
>  
>  /* during checkpoint, bio_private is used to synchronize the last bio */
>  struct bio_private {
> ---
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH] f2fs: remove unneeded write checkpoint in recover_fsync_data

2013-09-22 Thread Gu Zheng
On 09/22/2013 03:51 PM, Chao Yu wrote:

> Previously, recover_fsync_data still to write checkpoint when there is
> nothing to recover with normal umount image.
> It may reduce mount performance and flash memory lifetime, so let's remove
> it.
> 
> Signed-off-by: Tan Shu 
> Signed-off-by: Yu Chao 
> ---
>  fs/f2fs/recovery.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
> index 51ef5ee..6988e1b 100644
> --- a/fs/f2fs/recovery.c
> +++ b/fs/f2fs/recovery.c
> @@ -419,6 +419,7 @@ int recover_fsync_data(struct f2fs_sb_info *sbi)
>  {
>   struct list_head inode_list;
>   int err;
> + int is_writecp = 0;

"need_writecp" may be more suitable.

Thanks,
Gu 

>  
>   fsync_entry_slab = f2fs_kmem_cache_create("f2fs_fsync_inode_entry",
>   sizeof(struct fsync_inode_entry), NULL);
> @@ -436,6 +437,8 @@ int recover_fsync_data(struct f2fs_sb_info *sbi)
>   if (list_empty(&inode_list))
>   goto out;
>  
> + is_writecp = 1;
> +
>   /* step #2: recover data */
>   err = recover_data(sbi, &inode_list, CURSEG_WARM_NODE);
>   BUG_ON(!list_empty(&inode_list));
> @@ -443,7 +446,7 @@ out:
>   destroy_fsync_dnodes(&inode_list);
>   kmem_cache_destroy(fsync_entry_slab);
>   sbi->por_doing = 0;
> - if (!err)
> + if (!err && is_writecp)
>   write_checkpoint(sbi, false);
>   return err;
>  }
> ---
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2] fs/bio-integrity: remove duplicate code

2013-09-23 Thread Gu Zheng
Most code of function bio_integrity_verify and bio_integrity_generate
is the same, so introduce a common function bio_integrity_generate_verify()
to remove the reduplicate code.

v2:
  fix a minor logic mistake.

Signed-off-by: Gu Zheng 
---
 fs/bio-integrity.c |   86 +++-
 1 files changed, 38 insertions(+), 48 deletions(-)

diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 6025084..4773ab2 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -287,24 +287,25 @@ int bio_integrity_get_tag(struct bio *bio, void *tag_buf, 
unsigned int len)
 EXPORT_SYMBOL(bio_integrity_get_tag);
 
 /**
- * bio_integrity_generate - Generate integrity metadata for a bio
- * @bio:   bio to generate integrity metadata for
- *
- * Description: Generates integrity metadata for a bio by calling the
- * block device's generation callback function.  The bio must have a
- * bip attached with enough room to accommodate the generated
- * integrity metadata.
+ * bio_integrity_generate_verify - Generate/verify integrity metadata for a bio
+ * @bio:   bio to generate/verify integrity metadata for
+ * @operate:   operate number, 1 for generate, 0 for verify
  */
-static void bio_integrity_generate(struct bio *bio)
+static int bio_integrity_generate_verify(struct bio *bio, int operate)
 {
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec *bv;
-   sector_t sector = bio->bi_sector;
-   unsigned int i, sectors, total;
+   sector_t sector;
+   unsigned int i, sectors, total, ret;
void *prot_buf = bio->bi_integrity->bip_buf;
 
-   total = 0;
+   if (operate)
+   sector = bio->bi_sector;
+   else
+   sector = bio->bi_integrity->bip_sector;
+
+   ret = total = 0;
bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
bix.sector_size = bi->sector_size;
 
@@ -314,8 +315,15 @@ static void bio_integrity_generate(struct bio *bio)
bix.data_size = bv->bv_len;
bix.prot_buf = prot_buf;
bix.sector = sector;
-
-   bi->generate_fn(&bix);
+   if (operate) {
+   bi->generate_fn(&bix);
+   } else {
+   ret = bi->verify_fn(&bix);
+   if (ret) {
+   kunmap_atomic(kaddr);
+   return ret;
+   }
+   }
 
sectors = bv->bv_len / bi->sector_size;
sector += sectors;
@@ -325,6 +333,22 @@ static void bio_integrity_generate(struct bio *bio)
 
kunmap_atomic(kaddr);
}
+
+   return ret;
+}
+
+/**
+ * bio_integrity_generate - Generate integrity metadata for a bio
+ * @bio:   bio to generate integrity metadata for
+ *
+ * Description: Generates integrity metadata for a bio by calling the
+ * block device's generation callback function.  The bio must have a
+ * bip attached with enough room to accommodate the generated
+ * integrity metadata.
+ */
+static void bio_integrity_generate(struct bio *bio)
+{
+   bio_integrity_generate_verify(bio, 1);
 }
 
 static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
@@ -439,41 +463,7 @@ EXPORT_SYMBOL(bio_integrity_prep);
  */
 static int bio_integrity_verify(struct bio *bio)
 {
-   struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-   struct blk_integrity_exchg bix;
-   struct bio_vec *bv;
-   sector_t sector = bio->bi_integrity->bip_sector;
-   unsigned int i, sectors, total, ret;
-   void *prot_buf = bio->bi_integrity->bip_buf;
-
-   ret = total = 0;
-   bix.disk_name = bio->bi_bdev->bd_disk->disk_name;
-   bix.sector_size = bi->sector_size;
-
-   bio_for_each_segment(bv, bio, i) {
-   void *kaddr = kmap_atomic(bv->bv_page);
-   bix.data_buf = kaddr + bv->bv_offset;
-   bix.data_size = bv->bv_len;
-   bix.prot_buf = prot_buf;
-   bix.sector = sector;
-
-   ret = bi->verify_fn(&bix);
-
-   if (ret) {
-   kunmap_atomic(kaddr);
-   return ret;
-   }
-
-   sectors = bv->bv_len / bi->sector_size;
-   sector += sectors;
-   prot_buf += sectors * bi->tuple_size;
-   total += sectors * bi->tuple_size;
-   BUG_ON(total > bio->bi_integrity->bip_size);
-
-   kunmap_atomic(kaddr);
-   }
-
-   return ret;
+   return bio_integrity_generate_verify(bio, 0);
 }
 
 /**
-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH RESEND] f2fs: remove unneeded write checkpoint in recover_fsync_data

2013-09-24 Thread Gu Zheng
On 09/24/2013 09:26 AM, Chao Yu wrote:

> Previously, recover_fsync_data still to write checkpoint when there is
> nothing to recover with normal umount image.
> It may reduce mount performance and flash memory lifetime, so let's remove
> it.
> 
> Signed-off-by: Tan Shu 
> Signed-off-by: Yu Chao 

Reviewed-by: Gu Zheng 

> ---
> fs/f2fs/recovery.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
> index 51ef5ee..d43e4cd 100644
> --- a/fs/f2fs/recovery.c
> +++ b/fs/f2fs/recovery.c
> @@ -419,6 +419,7 @@ int recover_fsync_data(struct f2fs_sb_info *sbi)
>  {
>   struct list_head inode_list;
>   int err;
> + int need_writecp = 0;
>  
>   fsync_entry_slab = f2fs_kmem_cache_create("f2fs_fsync_inode_entry",
>   sizeof(struct fsync_inode_entry), NULL);
> @@ -436,6 +437,8 @@ int recover_fsync_data(struct f2fs_sb_info *sbi)
>   if (list_empty(&inode_list))
>   goto out;
>  
> + need_writecp = 1;
> +
>   /* step #2: recover data */
>   err = recover_data(sbi, &inode_list, CURSEG_WARM_NODE);
>   BUG_ON(!list_empty(&inode_list));
> @@ -443,7 +446,7 @@ out:
>   destroy_fsync_dnodes(&inode_list);
>   kmem_cache_destroy(fsync_entry_slab);
>   sbi->por_doing = 0;
> - if (!err)
> + if (!err && need_writecp)
>   write_checkpoint(sbi, false);
>   return err;
>  }
> ---
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] f2fs: don't GC or take an fs_lock from f2fs_initxattrs()

2013-09-24 Thread Gu Zheng
On 09/25/2013 04:49 AM, Russ Knize wrote:

> From: Russ Knize 
> 
> f2fs_initxattrs() is called internally from within F2FS and should
> not call functions that are used by VFS handlers.  This avoids
> certain deadlocks:
> 
> - vfs_create()
>  - f2fs_create() <-- takes an fs_lock
>   - f2fs_add_link()
>- __f2fs_add_link()
> - init_inode_metadata()
>  - f2fs_init_security()
>   - security_inode_init_security()
>- f2fs_initxattrs()
> - f2fs_setxattr() <-- also takes an fs_lock
> 
> If the caller happens to grab the same fs_lock from the pool in both
> places, they will deadlock.  There are also deadlocks involving
> multiple threads and mutexes:
> 
> - f2fs_write_begin()
>  - f2fs_balance_fs() <-- takes gc_mutex
>   - f2fs_gc()
>- write_checkpoint()
> - block_operations()
>  - mutex_lock_all() <-- blocks trying to grab all fs_locks
> 
> - f2fs_mkdir() <-- takes an fs_lock
>  - __f2fs_add_link()
>   - f2fs_init_security()
>- security_inode_init_security()
> - f2fs_initxattrs()
>  - f2fs_setxattr()
>   - f2fs_balance_fs() <-- blocks trying to take gc_mutex
> 
> Signed-off-by: Russ Knize 

This solution is more thorough.

Reviewed-by: Gu Zheng 

> ---
>  fs/f2fs/xattr.c |   35 +--
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/f2fs/xattr.c b/fs/f2fs/xattr.c
> index 1ac8a5f..3d900ea 100644
> --- a/fs/f2fs/xattr.c
> +++ b/fs/f2fs/xattr.c
> @@ -154,6 +154,9 @@ static int f2fs_xattr_advise_set(struct dentry *dentry, 
> const char *name,
>  }
>  
>  #ifdef CONFIG_F2FS_FS_SECURITY
> +static int __f2fs_setxattr(struct inode *inode, int name_index,
> + const char *name, const void *value, size_t value_len,
> + struct page *ipage);
>  static int f2fs_initxattrs(struct inode *inode, const struct xattr 
> *xattr_array,
>   void *page)
>  {
> @@ -161,7 +164,7 @@ static int f2fs_initxattrs(struct inode *inode, const 
> struct xattr *xattr_array,
>   int err = 0;
>  
>   for (xattr = xattr_array; xattr->name != NULL; xattr++) {
> - err = f2fs_setxattr(inode, F2FS_XATTR_INDEX_SECURITY,
> + err = __f2fs_setxattr(inode, F2FS_XATTR_INDEX_SECURITY,
>   xattr->name, xattr->value,
>   xattr->value_len, (struct page *)page);
>   if (err < 0)
> @@ -469,16 +472,15 @@ cleanup:
>   return error;
>  }
>  
> -int f2fs_setxattr(struct inode *inode, int name_index, const char *name,
> - const void *value, size_t value_len, struct page *ipage)
> +static int __f2fs_setxattr(struct inode *inode, int name_index,
> + const char *name, const void *value, size_t value_len,
> + struct page *ipage)
>  {
> - struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
>   struct f2fs_inode_info *fi = F2FS_I(inode);
>   struct f2fs_xattr_entry *here, *last;
>   void *base_addr;
>   int found, newsize;
>   size_t name_len;
> - int ilock;
>   __u32 new_hsize;
>   int error = -ENOMEM;
>  
> @@ -493,10 +495,6 @@ int f2fs_setxattr(struct inode *inode, int name_index, 
> const char *name,
>   if (name_len > F2FS_NAME_LEN || value_len > MAX_VALUE_LEN(inode))
>   return -ERANGE;
>  
> - f2fs_balance_fs(sbi);
> -
> - ilock = mutex_lock_op(sbi);
> -
>   base_addr = read_all_xattrs(inode, ipage);
>   if (!base_addr)
>   goto exit;
> @@ -578,7 +576,24 @@ int f2fs_setxattr(struct inode *inode, int name_index, 
> const char *name,
>   else
>   update_inode_page(inode);
>  exit:
> - mutex_unlock_op(sbi, ilock);
>   kzfree(base_addr);
>   return error;
>  }
> +
> +int f2fs_setxattr(struct inode *inode, int name_index, const char *name,
> + const void *value, size_t value_len, struct page *ipage)
> +{
> + struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
> + int ilock;
> + int err;
> +
> + f2fs_balance_fs(sbi);
> +
> + ilock = mutex_lock_op(sbi);
> +
> + err = __f2fs_setxattr(inode, name_index, name, value, value_len, ipage);
> +
> + mutex_unlock_op(sbi, ilock);
> +
> + return err;
> +}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug report] Warning when hot-add an ACPI0004 device.

2013-09-25 Thread Gu Zheng
Hi Toshi,

On 09/12/2013 11:11 PM, Toshi Kani wrote:

> On Thu, 2013-09-12 at 13:00 +0800, Tang Chen wrote:
>> Hi Rafael, Toshi,
>>
>> When we hot-add an ACPI0004 device, we got the following warning:
>>
>>  acpi ACPI0004:01: Attempt to re-insert
>>
>> The ACPI0004 device is a System Board in Fujitsu server, which has two
>> numa nodes (processors and memory).
>>
>> It seems that we reserved the ACPI_NOTIFY_DEVICE_CHECK event twice in
>> acpi_hotplug_notify_cb().
>>
>>
>> According to bisect, this happens after the following commit:
>>
>>  From 68a67f6c78b80525d9b3c6672e7782de95e56a83 Mon Sep 17 00:00:00 2001
>> From: "Rafael J. Wysocki" 
>> Date: Sun, 3 Mar 2013 23:05:55 +0100
>> Subject: [PATCH 1/1] ACPI / container: Use common hotplug code
>>
>> Switch the ACPI container driver to using common device hotplug code
>> introduced previously.  This reduces the driver down to a trivial
>> definition and registration of a struct acpi_scan_handler object.
>>
>> Signed-off-by: Rafael J. Wysocki 
>> Acked-by: Toshi Kani 
>> Tested-by: Toshi Kani 
>> ---
>>   drivers/acpi/container.c | 146 
>> ---
>>   1 file changed, 10 insertions(+), 136 deletions(-)
>>
>>
>> I'm now investigating this problem. If you have any idea about why this
>> happens, please let me know.
> 
> With the above change, container devices use the common notify handler,
> which logs the warning message in question when it receives device check
> twice on a same device.  Before the change, the container-specific
> notify handler did not log this message in the same case (but considered
> it as an eject request).
> 
> So, I suspect that you are getting device check twice regardless of the
> kernel change.  Can you check KERN_DEBUG messages to see if that is the
> case?  The notify handler logs all events with KERN_DEBUG.

Follow your suggestion, we confirm that it really received ACPI_NOTIFY_
DEVICE_CHECK event*twice*, but the original ACPI container driver only
received once, does the common device hotplug code introduce another device
check? any idea?

Container uses common device hotplug code:
[  142.937724] IPv6: ADDRCONF(NETDEV_CHANGE): eth8: link becomes ready
[  674.975575] ACPI: \_SB_.LSB1: ACPI_NOTIFY_DEVICE_CHECK event  
[  674.991604] ACPI: \_SB_.LSB1: ACPI_NOTIFY_DEVICE_CHECK event     
[  675.613990] ACPI: PCI Root Bridge [UNC2] (domain  [bus fd])
[  675.684970] acpi PNP0A03:01: ACPI _OSC support notification failed, 
disabling PCIe ASPM
[  675.780957] acpi PNP0A03:01: Unable to request _OSC control (_OSC support 
mask: 0x08)
[  675.874806] ACPI _OSC control for PCIe not granted, disabling ASPM
[  675.949005] pci_bus :fd: Allocating resources
[  675.960145] ACPI: PCI Root Bridge [UNC3] (domain  [bus fc])
[  676.031176] acpi PNP0A03:02: ACPI _OSC support notification failed, 
disabling PCIe ASPM
[  676.127129] acpi PNP0A03:02: Unable to request _OSC control (_OSC support 
mask: 0x08)
[  676.220943] ACPI _OSC control for PCIe not granted, disabling ASPM
[  676.295019] pci_bus :fc: Allocating resources

Original ACPI container driver:
[ 1526.122933] Container driver received ACPI_NOTIFY_DEVICE_CHECK event 
[ 1526.800646] ACPI: PCI Root Bridge [UNC2] (domain  [bus fd])
[ 1526.871682] acpi PNP0A03:01: ACPI _OSC support notification failed, 
disabling PCIe ASPM
[ 1526.967878] acpi PNP0A03:01: Unable to request _OSC control (_OSC support 
mask: 0x08)
[ 1527.061891] ACPI _OSC control for PCIe not granted, disabling ASPM
[ 1527.136036] pci_bus :fd: Allocating resources
[ 1527.150747] ACPI: PCI Root Bridge [UNC3] (domain  [bus fc])
[ 1527.221821] acpi PNP0A03:02: ACPI _OSC support notification failed, 
disabling PCIe ASPM
[ 1527.317738] acpi PNP0A03:02: Unable to request _OSC control (_OSC support 
mask: 0x08)
[ 1527.411795] ACPI _OSC control for PCIe not granted, disabling ASPM
[ 1527.485917] pci_bus :fc: Allocating resources


Thanks,
Gu

> 
> Thanks,
> -Toshi
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH]fs/block_dev.c: fix the inaccurate judgement in function blkdev_aio_read

2013-04-18 Thread Gu Zheng
In function blkdev_aio_read(), the judgement of 'size', if it is equal or 
greater than
the target count we request(iocb->ki_left), there is no need to call 
iov_shorten() to
reduce number of segments and the iovec's length.
So the judgement should be changed to 'if (size < iocb->ki_left)' instead.

Signed-off-by: Jianpeng Ma 
Signed-off-by: Gu Zheng 
---
 fs/block_dev.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index aae187a..f0328f1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1559,7 +1559,7 @@ static ssize_t blkdev_aio_read(struct kiocb *iocb, const 
struct iovec *iov,
return 0;
 
size -= pos;
-   if (size < INT_MAX)
+   if (size < iocb->ki_left)
nr_segs = iov_shorten((struct iovec *)iov, nr_segs, size);
return generic_file_aio_read(iocb, iov, nr_segs, pos);
 }
-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/filemap.c: fix criteria of calling iov_shorten() in generic_file_direct_write()

2013-04-23 Thread Gu Zheng
>From 35947e6535d92c54cf523470cc8811e8b5fee3e5 Mon Sep 17 00:00:00 2001
From: Gu Zheng 
Date: Tue, 23 Apr 2013 16:09:04 +0800
Subject: [PATCH] mm/filemap.c: fix criteria of calling iov_shorten() in 
generic_file_direct_write()

generic_file_direct_write() compares 'count'(the max count we actually can 
write)
with 'ocount'(the count we request to write) to see if there is need to call
iov_shorten() to reduce number of segments and the iovec's length. If the
'count' is equal or greater than 'ocount', there is no need to call 
iov_shorten()
indeed. So the judgement should be changed:
'if (count != ocount)' --> 'if (count < ocount)'

Signed-off-by: Gu Zheng 
---
 mm/filemap.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e1979fd..c566b9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2183,7 +2183,7 @@ generic_file_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
size_t  write_len;
pgoff_t end;
 
-   if (count != ocount)
+   if (count < ocount)
*nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
 
write_len = iov_length(iov, *nr_segs);
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] f2fs: add remount_fs callback support

2013-06-16 Thread Gu Zheng
On 06/16/2013 08:48 AM, Namjae Jeon wrote:

> From: Namjae Jeon 
> 
> Add the f2fs_remount function call which will be used
> during the filesystem remounting. This function
> will help us to change the mount options specific to
> f2fs.
> 
> Also modify the f2fs background_gc mount option, which
> will allow the user to dynamically trun on/off the
> garbage collection in f2fs based on the background_gc
> value. If background_gc=on, Garbage collection will
> be turned off & if background_gc=off, Garbage collection
> will be truned on.
> 
> By default the garbage collection is on in f2fs.
> 
> Change Log:
> v2: Incorporated the review comments by Gu Zheng.
> Removing the restore part for VFS flags
> Updating comments with proper flag conditions
> Display GC background option as ON/OFF
> Revised conditions to stop GC in case of remount
> 
> v1: Initial changes for adding remount_fs callback
> support.
> 
> Cc: Gu Zheng 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Pankaj Kumar 


Reviewed-by: Gu Zheng 

Thanks,
Gu

> ---
>  Documentation/filesystems/f2fs.txt |9 +-
>  fs/f2fs/super.c|  235 
> +++-
>  2 files changed, 160 insertions(+), 84 deletions(-)
> 
> diff --git a/Documentation/filesystems/f2fs.txt 
> b/Documentation/filesystems/f2fs.txt
> index bd3c56c..b91e2f2 100644
> --- a/Documentation/filesystems/f2fs.txt
> +++ b/Documentation/filesystems/f2fs.txt
> @@ -98,8 +98,13 @@ Cleaning Overhead
>  MOUNT OPTIONS
>  
> 
>  
> -background_gc_off  Turn off cleaning operations, namely garbage 
> collection,
> -triggered in background when I/O subsystem is idle.
> +background_gc=%s   Turn on/off cleaning operations, namely garbage
> +   collection, triggered in background when I/O 
> subsystem is
> +   idle. If background_gc=on, it will turn on the garbage
> +   collection and if background_gc=off, garbage 
> collection
> +   will be truned off.
> +   Default value for this option is on. So garbage
> +   collection is on by default.
>  disable_roll_forward   Disable the roll-forward recovery routine
>  discardIssue discard/TRIM commands when a segment is cleaned.
>  no_heapDisable heap-style segment allocation which finds free
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ba56549..5a11484 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -34,7 +34,7 @@
>  static struct kmem_cache *f2fs_inode_cachep;
>  
>  enum {
> - Opt_gc_background_off,
> + Opt_gc_background,
>   Opt_disable_roll_forward,
>   Opt_discard,
>   Opt_noheap,
> @@ -46,7 +46,7 @@ enum {
>  };
>  
>  static match_table_t f2fs_tokens = {
> - {Opt_gc_background_off, "background_gc_off"},
> + {Opt_gc_background, "background_gc=%s"},
>   {Opt_disable_roll_forward, "disable_roll_forward"},
>   {Opt_discard, "discard"},
>   {Opt_noheap, "no_heap"},
> @@ -76,6 +76,91 @@ static void init_once(void *foo)
>   inode_init_once(&fi->vfs_inode);
>  }
>  
> +static int parse_options(struct super_block *sb, char *options)
> +{
> + struct f2fs_sb_info *sbi = F2FS_SB(sb);
> + substring_t args[MAX_OPT_ARGS];
> + char *p, *name;
> + int arg = 0;
> +
> + if (!options)
> + return 0;
> +
> + while ((p = strsep(&options, ",")) != NULL) {
> + int token;
> + if (!*p)
> + continue;
> + /*
> +  * Initialize args struct so we know whether arg was
> +  * found; some options take optional arguments.
> +  */
> + args[0].to = args[0].from = NULL;
> + token = match_token(p, f2fs_tokens, args);
> +
> + switch (token) {
> + case Opt_gc_background:
> + name = match_strdup(&args[0]);
> +
> + if (!name)
> + return -ENOMEM;
> + if (!strncmp(name, "on", 2))
> + set_opt(sbi, BG_GC);
> + else if (!strncmp(name, "off", 3))
> + clear_opt(sbi, BG_GC);
> + else {
> + kfree(name);
> + return -EINVAL;
> + }
> +   

Re: [PATCH 1/2] f2fs: add remount_fs callback support

2013-06-03 Thread Gu Zheng
On 06/01/2013 03:20 PM, Namjae Jeon wrote:

> From: Namjae Jeon 
> 
> Add the f2fs_remount function call which will be used
> during the filesystem remounting. This function
> will help us to change the mount options specific to
> f2fs.
> 
> Also modify the f2fs background_gc mount option, which
> will allow the user to dynamically trun on/off the
> garbage collection in f2fs based on the background_gc
> value. If background_gc=0, Garbage collection will
> be turned off & if background_gc=1, Garbage collection
> will be truned on.


Hi Namjae,
  I think splitting these two changes into single ones seems better.
Refer to the inline comments.

Thanks,
Gu

> 
> By default the garbage collection is on in f2fs.
> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Pankaj Kumar 

> ---
>  Documentation/filesystems/f2fs.txt |8 +-
>  fs/f2fs/super.c|  223 
> +++-
>  2 files changed, 148 insertions(+), 83 deletions(-)
> 
> diff --git a/Documentation/filesystems/f2fs.txt 
> b/Documentation/filesystems/f2fs.txt
> index bd3c56c..6a036ce 100644
> --- a/Documentation/filesystems/f2fs.txt
> +++ b/Documentation/filesystems/f2fs.txt
> @@ -98,8 +98,12 @@ Cleaning Overhead
>  MOUNT OPTIONS
>  
> 
>  
> -background_gc_off  Turn off cleaning operations, namely garbage 
> collection,
> -triggered in background when I/O subsystem is idle.
> +background_gc=%u   Turn on/off cleaning operations, namely garbage 
> collection,
> +   triggered in background when I/O subsystem is idle. If
> +   background_gc=1, it will turn on the garbage 
> collection &
> +   if background_gc=0, garbage collection will be truned 
> off.
> +   Default value for this option is 1. So garbage 
> collection
> +   is on by default.
>  disable_roll_forward   Disable the roll-forward recovery routine
>  discardIssue discard/TRIM commands when a segment is cleaned.
>  no_heapDisable heap-style segment allocation which finds free
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index 3ac305d..bcd68aa 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -34,7 +34,7 @@
>  static struct kmem_cache *f2fs_inode_cachep;
>  
>  enum {
> - Opt_gc_background_off,
> + Opt_gc_background,
>   Opt_disable_roll_forward,
>   Opt_discard,
>   Opt_noheap,
> @@ -46,7 +46,7 @@ enum {
>  };
>  
>  static match_table_t f2fs_tokens = {
> - {Opt_gc_background_off, "background_gc_off"},
> + {Opt_gc_background, "background_gc=%u"},
>   {Opt_disable_roll_forward, "disable_roll_forward"},
>   {Opt_discard, "discard"},
>   {Opt_noheap, "no_heap"},
> @@ -76,6 +76,86 @@ static void init_once(void *foo)
>   inode_init_once(&fi->vfs_inode);
>  }
>  
> +static int parse_options(struct super_block *sb, struct f2fs_sb_info *sbi,
> + char *options)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + char *p;
> + int arg = 0;
> +
> + if (!options)
> + return 0;
> +
> + while ((p = strsep(&options, ",")) != NULL) {
> + int token;
> + if (!*p)
> + continue;
> + /*
> +  * Initialize args struct so we know whether arg was
> +  * found; some options take optional arguments.
> +  */
> + args[0].to = args[0].from = NULL;
> + token = match_token(p, f2fs_tokens, args);
> +
> + switch (token) {
> + case Opt_gc_background:
> + if (args->from && match_int(args, &arg))
> + return -EINVAL;
> + if (arg != 0 && arg != 1)
> + return -EINVAL;
> + if (arg == 0)
> + clear_opt(sbi, BG_GC);
> + else
> + set_opt(sbi, BG_GC);
> + break;
> + case Opt_disable_roll_forward:
> + set_opt(sbi, DISABLE_ROLL_FORWARD);
> + break;
> + case Opt_discard:
> + set_opt(sbi, DISCARD);
> + break;
> + case Opt_noheap:
> + set_opt(sbi, NOHEAP);
> + break;
> +#ifdef CONFIG_F2FS_FS_XATTR
> + case Opt_nouser_xattr:
> + clear_opt(sbi, XATTR_USER);
> + break;
> +#else
> + case Opt_nouser_xattr:
> + f2fs_msg(sb, KERN_INFO,
> + "nouser_xattr options not supported");
> + break;
> +#endif
> +#ifdef CONFIG_F2FS_FS_POSIX_ACL
> + case Opt_noacl:
> + clear_opt(sbi, POSIX_ACL);
> + br

Re: [PATCH] scsi: Introduce a help function local_time_seconds() to simplify the getting time stamp operation

2013-06-04 Thread Gu Zheng
ping...

On 05/29/2013 05:33 PM, Gu Zheng wrote:

>>From 4d4caa16f3886ae910ad6dfe13353fc836f546cc Mon Sep 17 00:00:00 2001
> From: Gu Zheng 
> Date: Wed, 29 May 2013 17:34:22 +0900
> Subject: [PATCH] driver/scsi: Introduce a help function local_time_seconds() 
> to simplify the getting time stamp operation
> 
> Signed-off-by: Gu Zheng 
> ---
>  drivers/scsi/3w-9xxx.c |   14 ++
>  drivers/scsi/3w-sas.c  |   14 ++
>  include/scsi/scsi.h|9 +
>  3 files changed, 13 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/scsi/3w-9xxx.c b/drivers/scsi/3w-9xxx.c
> index 5e1e12c..44b3ea8 100644
> --- a/drivers/scsi/3w-9xxx.c
> +++ b/drivers/scsi/3w-9xxx.c
> @@ -374,8 +374,6 @@ out:
>  /* This function will queue an event */
>  static void twa_aen_queue_event(TW_Device_Extension *tw_dev, 
> TW_Command_Apache_Header *header)
>  {
> - u32 local_time;
> - struct timeval time;
>   TW_Event *event;
>   unsigned short aen;
>   char host[16];
> @@ -398,9 +396,7 @@ static void twa_aen_queue_event(TW_Device_Extension 
> *tw_dev, TW_Command_Apache_H
>   memset(event, 0, sizeof(TW_Event));
>  
>   event->severity = TW_SEV_OUT(header->status_block.severity__reserved);
> - do_gettimeofday(&time);
> - local_time = (u32)(time.tv_sec - (sys_tz.tz_minuteswest * 60));
> - event->time_stamp_sec = local_time;
> + event->time_stamp_sec = local_time_seconds();
>   event->aen_code = aen;
>   event->retrieved = TW_AEN_NOT_RETRIEVED;
>   event->sequence_id = tw_dev->error_sequence_id;
> @@ -479,11 +475,9 @@ out:
>  static void twa_aen_sync_time(TW_Device_Extension *tw_dev, int request_id)
>  {
>   u32 schedulertime;
> - struct timeval utc;
>   TW_Command_Full *full_command_packet;
>   TW_Command *command_packet;
>   TW_Param_Apache *param;
> - u32 local_time;
>  
>   /* Fill out the command packet */
>   full_command_packet = tw_dev->command_packet_virt[request_id];
> @@ -503,11 +497,7 @@ static void twa_aen_sync_time(TW_Device_Extension 
> *tw_dev, int request_id)
>   param->parameter_id = cpu_to_le16(0x3); /* SchedulerTime */
>   param->parameter_size_bytes = cpu_to_le16(4);
>  
> - /* Convert system time in UTC to local time seconds since last 
> -   Sunday 12:00AM */
> - do_gettimeofday(&utc);
> - local_time = (u32)(utc.tv_sec - (sys_tz.tz_minuteswest * 60));
> - schedulertime = local_time - (3 * 86400);
> + schedulertime = local_time_seconds() - (3 * 86400);
>   schedulertime = cpu_to_le32(schedulertime % 604800);
>  
>   memcpy(param->data, &schedulertime, sizeof(u32));
> diff --git a/drivers/scsi/3w-sas.c b/drivers/scsi/3w-sas.c
> index c845bdb..69f1d8a 100644
> --- a/drivers/scsi/3w-sas.c
> +++ b/drivers/scsi/3w-sas.c
> @@ -236,8 +236,6 @@ out:
>  /* This function will queue an event */
>  static void twl_aen_queue_event(TW_Device_Extension *tw_dev, 
> TW_Command_Apache_Header *header)
>  {
> - u32 local_time;
> - struct timeval time;
>   TW_Event *event;
>   unsigned short aen;
>   char host[16];
> @@ -256,9 +254,7 @@ static void twl_aen_queue_event(TW_Device_Extension 
> *tw_dev, TW_Command_Apache_H
>   memset(event, 0, sizeof(TW_Event));
>  
>   event->severity = TW_SEV_OUT(header->status_block.severity__reserved);
> - do_gettimeofday(&time);
> - local_time = (u32)(time.tv_sec - (sys_tz.tz_minuteswest * 60));
> - event->time_stamp_sec = local_time;
> + event->time_stamp_sec = local_time_seconds();
>   event->aen_code = aen;
>   event->retrieved = TW_AEN_NOT_RETRIEVED;
>   event->sequence_id = tw_dev->error_sequence_id;
> @@ -444,11 +440,9 @@ out:
>  static void twl_aen_sync_time(TW_Device_Extension *tw_dev, int request_id)
>  {
>   u32 schedulertime;
> - struct timeval utc;
>   TW_Command_Full *full_command_packet;
>   TW_Command *command_packet;
>   TW_Param_Apache *param;
> - u32 local_time;
>  
>   /* Fill out the command packet */
>   full_command_packet = tw_dev->command_packet_virt[request_id];
> @@ -468,11 +462,7 @@ static void twl_aen_sync_time(TW_Device_Extension 
> *tw_dev, int request_id)
>   param->parameter_id = cpu_to_le16(0x3); /* SchedulerTime */
>   param->parameter_size_bytes = cpu_to_le16(4);
>  
> - /* Convert system time in UTC to local time seconds since last 
> -   Sunday 12:00AM */
> - do_gettimeofday(&utc);
> - local_time = (u32)(utc.tv_sec - (sys_tz.tz_minuteswest * 60));
> - sch

Re: [PATCH RFC v2 1/3] drivers/platform/x86: add cpu physically hotplug driver

2013-06-05 Thread Gu Zheng
On 06/06/2013 09:40 AM, liguang wrote:

> this driver will support cpu phyical add/removal automatically
> after online/offline. if cpu hotpluged, cpu will not
> online automatically, and for cpu offline, we try to
> do actually eject if allowed for cpu like
> "echo 1 > /sys/bus/acpi/devices/LNXCPU\:0X/eject"
> this "echo ..." is only present for recent kernel
> (sorry, can't figure out since when), for a little
> older kernel, there's not such approach AFAICS.
> 
> Signed-off-by: liguang 
> ---
>  drivers/platform/x86/Kconfig  |8 
>  drivers/platform/x86/Makefile |1 +
>  drivers/platform/x86/cpu_physic_hotplug.c |   60 
> +
>  3 files changed, 69 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/platform/x86/cpu_physic_hotplug.c
> 
> diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig
> index 8577261..39b2392 100644
> --- a/drivers/platform/x86/Kconfig
> +++ b/drivers/platform/x86/Kconfig
> @@ -789,4 +789,12 @@ config PVPANIC
> a paravirtualized device provided by QEMU; it lets a virtual machine
> (guest) communicate panic events to the host.
>  
> +config QEMU_CPU_PHYSIC_HOTPLUG
> + tristate "physically add/remove cpu after cpu onlined/offlined"
> + depends on ACPI_HOTPLUG_CPU
> + ---help---
> +   This driver will support physically remove a cpu after
> +   it offlined for QEMU automatically. someone may require this feature
> +   to do a physically removal for a cpu.
> +
>  endif # X86_PLATFORM_DEVICES
> diff --git a/drivers/platform/x86/Makefile b/drivers/platform/x86/Makefile
> index ef0ec74..2e669b0 100644
> --- a/drivers/platform/x86/Makefile
> +++ b/drivers/platform/x86/Makefile
> @@ -53,3 +53,4 @@ obj-$(CONFIG_APPLE_GMUX)+= apple-gmux.o
>  obj-$(CONFIG_CHROMEOS_LAPTOP)+= chromeos_laptop.o
>  
>  obj-$(CONFIG_PVPANIC)   += pvpanic.o
> +obj-$(CONFIG_QEMU_CPU_PHYSIC_HOTPLUG)+= cpu_physic_hotplug.o
> diff --git a/drivers/platform/x86/cpu_physic_hotplug.c 
> b/drivers/platform/x86/cpu_physic_hotplug.c
> new file mode 100644
> index 000..a52c042
> --- /dev/null
> +++ b/drivers/platform/x86/cpu_physic_hotplug.c
> @@ -0,0 +1,60 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +MODULE_AUTHOR("Li Guang");
> +MODULE_DESCRIPTION("CPU physically hot-plug/unplug Driver");
> +MODULE_LICENSE("GPL");
> +
> +static int cpu_logic_hotplug_notify(struct notifier_block *nfb,
> + unsigned long action, void *hcpu)
> +{
> + unsigned int cpu = (unsigned long)hcpu;
> + struct acpi_processor *pr = per_cpu(processors, cpu);
> +
> + if (pr) {
> + switch (action) {
> + case CPU_ONLINE:
> + break;
> + case CPU_DEAD:
> + break;
> + default:
> + break;
> + }
> + }
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block cpu_logic_hotplug_notifier =
> +{
> + .notifier_call = cpu_logic_hotplug_notify,
> +};
> +
> +static int cpu_physic_hotplug_notify(struct notifier_block *nfb,
> +  unsigned char *s)
> +{
> +}

Hi guang,
Maybe you need to define the callback function in the right format at 
the beginning,
if so, no need to correct it later.:)

Thanks,
Gu


> +
> +static struct notifier_block cpu_physic_hotplug_notifier =
> +{
> + .notifier_call = cpu_physic_hotplug_notify,
> +};
> +
> +static int __init cpu_qemu_hotplug_init(void)
> +{
> + register_hotcpu_notifier(&cpu_logic_hotplug_notifier);
> + register_ec_gpe_notifier(&cpu_physic_hotplug_notifier);


As the [PATCH 2/3] has no dependence on this one, so you can set [PATCH 2/3] to 
[PATCH 1/3] and this one
to [PATCH 2/3]. Then you can use the xxx_ec_space_notifier directly here.

> + return 0;
> +}
> +
> +static void __exit cpu_qemu_hotplug_exit(void)
> +{
> + unregister_hotcpu_notifier(&cpu_logic_hotplug_notifier);
> + unregister_ec_gpe_notifier(&cpu_physic_hotplug_notifier);
> +}
> +
> +module_init(cpu_qemu_hotplug_init);
> +module_exit(cpu_qemu_hotplug_exit);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] f2fs: add remount_fs callback support

2013-06-06 Thread Gu Zheng
Hi Namjae,

On 06/05/2013 12:34 PM, Namjae Jeon wrote:

> 2013/6/4 Gu Zheng :
>> On 06/01/2013 03:20 PM, Namjae Jeon wrote:
>>
>>> From: Namjae Jeon 
>>>
>>> Add the f2fs_remount function call which will be used
>>> during the filesystem remounting. This function
>>> will help us to change the mount options specific to
>>> f2fs.
>>>
>>> Also modify the f2fs background_gc mount option, which
>>> will allow the user to dynamically trun on/off the
>>> garbage collection in f2fs based on the background_gc
>>> value. If background_gc=0, Garbage collection will
>>> be turned off & if background_gc=1, Garbage collection
>>> will be truned on.
>>
>>
>> Hi Namjae,
> Hi. Gu.
> 
>>   I think splitting these two changes into single ones seems better.
>> Refer to the inline comments.
> I don't think so. Mount option background_gc is changed to make
> remount_fs working in the correct way.

Yes, I know. Maybe you somewhat misread my words. 
Though remount_fs is dependent on changing background_gc option, but the change 
of background_gc option
and the adding remount_fs support are two different changes.
In order to make each patch simple and clear, maybe you need to split into 
single ones,
such as:
[PATCH 1/3] f2fs: Modify the f2fs background_gc mount option
[PATCH 2/3] f2fs: add remount_fs callback support
[PATCH 3/3] f2fs: reorganise the function get_victim_by_default

Just a personal suggestion, if you think it is worthless, please ignore it.:)


> 
>>
>> Thanks,
>> Gu
>>
>>
>> Though simply option show is enough, but I think the "background_gc=on/off" 
>> is more friendly.
> Yes, Agree. I will update.
> 
>>
>>> +
>>> + /**
>>> +  * We stop the GC thread if FS is mounted as RO
>>> +  * or if background_gc = 0 is passed in mount
>>> +  * option. Also sync the filesystem.
>>> +  */
>>> + if ((*flags & MS_RDONLY) || !test_opt(sbi, BG_GC)) {
>>
>>
>> Another condition: The old mount is not RO.
> I don't think that it is needed. I think current condition check can
> be covered about all cases.
> Am I missing something ?

Maybe. If the old mount is RO, so does the remount. It still can pass the 
judgement here, right?
Though the following stop_gc_thread() and f2fs_sync_fs() can handle this case 
well, but this
is unnecessary and needless. If we add additional judgement of whether old 
mount is not RO can avoid this.

Thanks,
Gu

> 
>>
>>> + stop_gc_thread(sbi);
>>> + f2fs_sync_fs(sb, 1);
>>> + } else if (test_opt(sbi, BG_GC) && !sbi->gc_thread) {
>>> + err = start_gc_thread(sbi);
>>> + if (err)
>>> + goto restore_opts;
>>> + }
>>> +
>>> + /* Update the POSIXACL Flag */
>>> +  sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
>>> + (test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
>>
>>
>> Maybe you forget to update the flags with MS_RDONLY or ~MS_RDONLY, if the 
>> flags changed.
> No, we don't need to check this flags. sb-s_flags will be updated by
> MS_REMOUNT of vfs.(do_remount_sb)
> 
>>
>>> + return 0;
>>> +
>>> +restore_opts:
>>> + sb->s_flags = old_sb_flags;
>>
>>
>> There is no need to restore sb->s_flags, parse_options() did not change it.
>> no need to store the old sb->s_flags too.
> Yes, right, I will update.
> 
>>
>>>
>>> - /* After POR, we can run background GC thread */
>>> - err = start_gc_thread(sbi);
>>> - if (err)
>>> - goto fail;
>>> + /* After POR, we can run background GC thread.*/
>>> + if (!(sb->s_flags & MS_RDONLY)) {
>>> + /**
>>> +  * If filesystem is mounted as read-only then
>>> +  * do not start the gc_thread.
>>> +  */
>>
>> It seems that the comment here is against with the logic.
> hum.. Okay, I will update comment to avoid some confusion.
> 
> Thanks for review :)
> I will post v2 patch including your opinion soon.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] f2fs: Set sb->s_fs_info before calling parse_options()

2013-06-06 Thread Gu Zheng
In f2fs_fill_super(), set sb->s_fs_info before calling parse_options(), then we 
can get
f2fs_sb_info via F2FS_SB(sb) in parse_options().
So that the second argument "sbi" of func parse_options() is no longer needed.

Signed-off-by: Gu Zheng 
---
 fs/f2fs/super.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 8555f7d..58516b5 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -303,9 +303,9 @@ static const struct export_operations f2fs_export_ops = {
.get_parent = f2fs_get_parent,
 };
 
-static int parse_options(struct super_block *sb, struct f2fs_sb_info *sbi,
-   char *options)
+static int parse_options(struct super_block *sb, char *options)
 {
+   struct f2fs_sb_info *sbi = F2FS_SB(sb);
substring_t args[MAX_OPT_ARGS];
char *p;
int arg = 0;
@@ -541,6 +541,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
if (err)
goto free_sb_buf;
}
+   sb->s_fs_info = sbi;
/* init some FS parameters */
sbi->active_logs = NR_CURSEG_TYPE;
 
@@ -553,7 +554,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
set_opt(sbi, POSIX_ACL);
 #endif
/* parse mount options */
-   err = parse_options(sb, sbi, (char *)data);
+   err = parse_options(sb, (char *)data);
if (err)
goto free_sb_buf;
 
@@ -565,7 +566,6 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_xattr = f2fs_xattr_handlers;
sb->s_export_op = &f2fs_export_ops;
sb->s_magic = F2FS_SUPER_MAGIC;
-   sb->s_fs_info = sbi;
sb->s_time_gran = 1;
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
(test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
-- 
1.7.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fs/f2fs: Code cleanup and simplify in func {find/add}_gc_inode

2013-06-20 Thread Gu Zheng

Signed-off-by: Gu Zheng 
---
 fs/f2fs/gc.c |   17 +
 1 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 1496159..0b8b439 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -314,28 +314,21 @@ static const struct victim_selection default_v_ops = {

 static struct inode *find_gc_inode(nid_t ino, struct list_head *ilist)
 {
-   struct list_head *this;
struct inode_entry *ie;

-   list_for_each(this, ilist) {
-   ie = list_entry(this, struct inode_entry, list);
+   list_for_each_entry(ie, ilist, list)
if (ie->inode->i_ino == ino)
return ie->inode;
-   }
return NULL;
 }

 static void add_gc_inode(struct inode *inode, struct list_head *ilist)
 {
-   struct list_head *this;
-   struct inode_entry *new_ie, *ie;
+   struct inode_entry *new_ie;

-   list_for_each(this, ilist) {
-   ie = list_entry(this, struct inode_entry, list);
-   if (ie->inode == inode) {
-   iput(inode);
-   return;
-   }
+   if (inode == find_gc_inode(inode->i_ino, ilist)) {
+   iput(inode);
+   return;
}
 repeat:
new_ie = kmem_cache_alloc(winode_slab, GFP_NOFS);
-- 
1.7.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] scsi: Introduce a help function local_time_seconds() to simplify the getting time stamp operation

2013-05-29 Thread Gu Zheng
On 05/30/2013 11:00 AM, Libo Chen wrote:

> On 2013/5/29 17:33, Gu Zheng wrote:
>> >From 4d4caa16f3886ae910ad6dfe13353fc836f546cc Mon Sep 17 00:00:00 2001
>> From: Gu Zheng 
>> Date: Wed, 29 May 2013 17:34:22 +0900
>> Subject: [PATCH] driver/scsi: Introduce a help function local_time_seconds() 
>> to simplify the getting time stamp operation
>>
> hi gu,
> 
> next time, you can remove above info.

Ah~, I forgot to clean up the unnecessary title infos, thanks for your reminder.

Regards,
Gu

> 
> 
> thanks,
> 
> Libo
> 
> 
>> Signed-off-by: Gu Zheng 
>> ---
>>  drivers/scsi/3w-9xxx.c |   14 ++
>>  drivers/scsi/3w-sas.c  |   14 ++
>>  include/scsi/scsi.h|9 +
>>  3 files changed, 13 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/scsi/3w-9xxx.c b/drivers/scsi/3w-9xxx.c
>> index 5e1e12c..44b3ea8 100644
>> --- a/drivers/scsi/3w-9xxx.c
>> +++ b/drivers/scsi/3w-9xxx.c
>> @@ -374,8 +374,6 @@ out:
>>  /* This function will queue an event */
>>  static void twa_aen_queue_event(TW_Device_Extension *tw_dev, 
>> TW_Command_Apache_Header *header)
>>  {
>> -u32 local_time;
>> -struct timeval time;
>>  TW_Event *event;
>>  unsigned short aen;
>>  char host[16];
>> @@ -398,9 +396,7 @@ static void twa_aen_queue_event(TW_Device_Extension 
>> *tw_dev, TW_Command_Apache_H
>>  memset(event, 0, sizeof(TW_Event));
>>  
>>  event->severity = TW_SEV_OUT(header->status_block.severity__reserved);
>> -do_gettimeofday(&time);
>> -local_time = (u32)(time.tv_sec - (sys_tz.tz_minuteswest * 60));
>> -event->time_stamp_sec = local_time;
>> +event->time_stamp_sec = local_time_seconds();
>>  event->aen_code = aen;
>>  event->retrieved = TW_AEN_NOT_RETRIEVED;
>>  event->sequence_id = tw_dev->error_sequence_id;
>> @@ -479,11 +475,9 @@ out:
>>  static void twa_aen_sync_time(TW_Device_Extension *tw_dev, int request_id)
>>  {
>>  u32 schedulertime;
>> -struct timeval utc;
>>  TW_Command_Full *full_command_packet;
>>  TW_Command *command_packet;
>>  TW_Param_Apache *param;
>> -u32 local_time;
>>  
>>  /* Fill out the command packet */
>>  full_command_packet = tw_dev->command_packet_virt[request_id];
>> @@ -503,11 +497,7 @@ static void twa_aen_sync_time(TW_Device_Extension 
>> *tw_dev, int request_id)
>>  param->parameter_id = cpu_to_le16(0x3); /* SchedulerTime */
>>  param->parameter_size_bytes = cpu_to_le16(4);
>>  
>> -/* Convert system time in UTC to local time seconds since last 
>> -   Sunday 12:00AM */
>> -do_gettimeofday(&utc);
>> -local_time = (u32)(utc.tv_sec - (sys_tz.tz_minuteswest * 60));
>> -schedulertime = local_time - (3 * 86400);
>> +schedulertime = local_time_seconds() - (3 * 86400);
>>  schedulertime = cpu_to_le32(schedulertime % 604800);
>>  
>>  memcpy(param->data, &schedulertime, sizeof(u32));
>> diff --git a/drivers/scsi/3w-sas.c b/drivers/scsi/3w-sas.c
>> index c845bdb..69f1d8a 100644
>> --- a/drivers/scsi/3w-sas.c
>> +++ b/drivers/scsi/3w-sas.c
>> @@ -236,8 +236,6 @@ out:
>>  /* This function will queue an event */
>>  static void twl_aen_queue_event(TW_Device_Extension *tw_dev, 
>> TW_Command_Apache_Header *header)
>>  {
>> -u32 local_time;
>> -struct timeval time;
>>  TW_Event *event;
>>  unsigned short aen;
>>  char host[16];
>> @@ -256,9 +254,7 @@ static void twl_aen_queue_event(TW_Device_Extension 
>> *tw_dev, TW_Command_Apache_H
>>  memset(event, 0, sizeof(TW_Event));
>>  
>>  event->severity = TW_SEV_OUT(header->status_block.severity__reserved);
>> -do_gettimeofday(&time);
>> -local_time = (u32)(time.tv_sec - (sys_tz.tz_minuteswest * 60));
>> -event->time_stamp_sec = local_time;
>> +event->time_stamp_sec = local_time_seconds();
>>  event->aen_code = aen;
>>  event->retrieved = TW_AEN_NOT_RETRIEVED;
>>  event->sequence_id = tw_dev->error_sequence_id;
>> @@ -444,11 +440,9 @@ out:
>>  static void twl_aen_sync_time(TW_Device_Extension *tw_dev, int request_id)
>>  {
>>  u32 schedulertime;
>> -struct timeval utc;
>>  TW_Command_Full *full_command_packet;
>>  TW_Command *command_packet;
>>  TW_Param_Apache *param;
>> -u32 local_time;
>>  
>>  /* Fill 

Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

2013-06-24 Thread Gu Zheng
On 06/19/2013 01:10 AM, Vasilis Liaskovitis wrote:

> Hi,
> 
> On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
>> From: Yinghai Lu 
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>>
>> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>>
>> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>> | Author: Tang Chen 
>> | Date:   Fri Feb 22 16:33:44 2013 -0800
>> |
>> |acpi, memory-hotplug: parse SRAT before memblock is ready
>>
>> It broke several things, like acpi override and fall back path etc.
>>
>> This patchset is clean implementation that will parse numa info early.
>> 1. keep the acpi table initrd override working by split finding with copying.
>>finding is done at head_32.S and head64.c stage,
>> in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
>> in head64.c, initrd is accessed via kernel low mapping address
>> with help of #PF set page table.
>>copying is done with early_ioremap just after memblock is setup.
>> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
>>seperate initmem_init to two stages.
>>early_initmem_init will only extract numa info early into numa_meminfo.
>>initmem_init will keep slit and emulation handling.
>> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
>>early_initmem_init will take old init_mem_mapping position.
>>it call early_x86_numa_init and init_mem_mapping for every nodes.
>>For 64bit, we avoid having size limit on initrd, as relocate_initrd
>>is still after init_mem_mapping for all memory.
>> 4. last patch will try to put page table on local node, so that memory
>>hotplug will be happy.
>>
>> In short, early_initmem_init will parse numa info early and call
>> init_mem_mapping to set page table for every nodes's mem.
>>
>> could be found at:
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git 
>> for-x86-mm
>>
>> and it is based on today's Linus tree.
>>
> 
> Has this patchset been tested on various numa configs?
> I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The 
> kernel
> boots successfully in many numa configs but while trying different memory 
> sizes
> for a 2 numa node VM, I noticed that booting does not complete in all cases
> (bootup screen appears to hang but there is no output indicating an early 
> panic)
> 
> node0   node1  boots
> 1G1G   yes
> 1G2G   yes
> 1G0.5G yes
> 3G2.5G yes
> 3G3G   yes
> 4G0G   yes
> 4G4G   yes
> 1.5G  1G   no
> 2G1G   no
> 2G2G   no
> 2.5G  2G   no
> 2.5G  2.5G no
> 
> linux-next next-20130607 boots al of these configs fine.
> 
> Looks odd, perhaps I have something wrong in my setup or maybe there is a
> seabios/qemu interaction with this patchset. I will update if I find 
> something.

Hi Vasilis,
   This patchset can work well with all the numa config cases you mentioned in 
latest kernel tree (3.10-rc7) in our box.

Host OS: RHEL 6.4 Beta
qemu-kvm: 0.12.1.2 (Released with RHEL 6.4 Beta)
Guest OS: RHEL 6.3 
Guest kernel:3.10-rc7 + [Part1 PATCH v5 ] x86, ACPI, numa: Parse numa info 
earlier
Cmd:

/usr/libexec/qemu-kvm -name rhel_6.3 -S -M rhel6.4.0 -enable-kvm 
-m 5120 -smp 4,sockets=4,cores=1,threads=1 
-numa node,nodeid=0,cpus=0-1,mem=2560 
-numa node,nodeid=1,cpus=2-3,mem=2560 
-uuid fa11164c-1a09-280b-eae4-e2c40c631767 -nodefconfig -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel_6.3.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/home/hut-rhel6.3.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:28:6e:29,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


Result:
node0   node1boots
1G  1G   yes
1G  2G   yes
1G  0.5G yes
3G  2.5G yes
3G  3G   yes
4G  0G   yes
4G  4G   yes
1.5G1G   yes
2G  1G   yes
2G  2G   yes
2.5G2G   yes
2.5G2.5G yes

Thanks,

Gu


> 
> thanks,
> 
> - Vasilis
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send 

Subject: [PATCH 1/2] fs/v9fs: Remove the unused variable "err" in v9fs_vfs_getattr()

2013-06-25 Thread Gu Zheng
Delete the unused variable "err" in v9fs_vfs_getattr()

Signed-off-by: Gu Zheng 
---
 fs/9p/vfs_inode.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index d86edc8..25b018e 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1054,13 +1054,11 @@ static int
 v9fs_vfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
 struct kstat *stat)
 {
-   int err;
struct v9fs_session_info *v9ses;
struct p9_fid *fid;
struct p9_wstat *st;
 
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
-   err = -EPERM;
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
generic_fillattr(dentry->d_inode, stat);
-- 
1.7.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Subject: [PATCH] fs/v9fs: Remove the unused variable "err" in v9fs_vfs_getattr()

2013-06-25 Thread Gu Zheng
Delete the unused variable "err" in v9fs_vfs_getattr()

Signed-off-by: Gu Zheng 
---
 fs/9p/vfs_inode.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index d86edc8..25b018e 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1054,13 +1054,11 @@ static int
 v9fs_vfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
 struct kstat *stat)
 {
-   int err;
struct v9fs_session_info *v9ses;
struct p9_fid *fid;
struct p9_wstat *st;
 
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
-   err = -EPERM;
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
generic_fillattr(dentry->d_inode, stat);
-- 
1.7.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2013-06-09 Thread Gu Zheng
On 06/07/2013 03:20 PM, Alex Shi wrote:

> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
> 
> We also try to include the blocked_load_avg as cpu load in balancing,
> but that cause kbuild performance drop 6% on every Intel machine, and
> aim7/oltp drop on some of 4 CPU sockets machines.

Hi Alex,
   Could you explain me why including the blocked_load_avg causes performance 
drop ?

Thanks,
Gu

> 
> Signed-off-by: Alex Shi 


Reviewed-by: Gu Zheng 

> ---
>  kernel/sched/fair.c |  5 +++--
>  kernel/sched/proc.c | 17 +++--
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 42c7be0..eadd2e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
> task_struct *p, int flags)
>  /* Used instead of source_load when we know the type == 0 */
>  static unsigned long weighted_cpuload(const int cpu)
>  {
> - return cpu_rq(cpu)->load.weight;
> + return cpu_rq(cpu)->cfs.runnable_load_avg;
>  }
>  
>  /*
> @@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>   struct rq *rq = cpu_rq(cpu);
>   unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> + unsigned long load_avg = rq->cfs.runnable_load_avg;
>  
>   if (nr_running)
> - return rq->load.weight / nr_running;
> + return load_avg / nr_running;
>  
>   return 0;
>  }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index bb3a6a0..ce5cd48 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, 
> unsigned long this_load,
>   sched_avg_update(this_rq);
>  }
>  
> +#ifdef CONFIG_SMP
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->cfs.runnable_load_avg;
> +}
> +#else
> +unsigned long get_rq_runnable_load(struct rq *rq)
> +{
> + return rq->load.weight;
> +}
> +#endif
> +
>  #ifdef CONFIG_NO_HZ_COMMON
>  /*
>   * There is no sane way to deal with nohz on smp when using jiffies because 
> the
> @@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, 
> unsigned long this_load,
>  void update_idle_cpu_load(struct rq *this_rq)
>  {
>   unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> - unsigned long load = this_rq->load.weight;
> + unsigned long load = get_rq_runnable_load(this_rq);
>   unsigned long pending_updates;
>  
>   /*
> @@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
>   */
>  void update_cpu_load_active(struct rq *this_rq)
>  {
> + unsigned long load = get_rq_runnable_load(this_rq);
>   /*
>* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>*/
>   this_rq->last_load_update_tick = jiffies;
> - __update_cpu_load(this_rq, this_rq->load.weight, 1);
> + __update_cpu_load(this_rq, load, 1);
>  
>   calc_load_account_active(this_rq);
>  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v8 3/9] sched: set initial value of runnable avg for new forked task

2013-06-09 Thread Gu Zheng
On 06/07/2013 03:20 PM, Alex Shi wrote:

> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
> 
> and make forking balancing imbalance since incorrect load_avg_contrib.
> 
> Further more, Morten Rasmussen notice some tasks were not launched at
> once after created. So Paul and Peter suggest giving a start value for
> new task runnable avg time same as sched_slice().
> 
> Signed-off-by: Alex Shi 


Reviewed-by: Gu Zheng 

> ---
>  kernel/sched/core.c  |  6 ++
>  kernel/sched/fair.c  | 23 +++
>  kernel/sched/sched.h |  2 ++
>  3 files changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b9e7036..6f226c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
>   p->se.vruntime  = 0;
>   INIT_LIST_HEAD(&p->se.group_node);
>  
> -#ifdef CONFIG_SMP
> - p->se.avg.runnable_avg_period = 0;
> - p->se.avg.runnable_avg_sum = 0;
> -#endif
>  #ifdef CONFIG_SCHEDSTATS
>   memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>  #endif
> @@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
>   set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
>  #endif
>  
> + /* Give new task start runnable values */
> + set_task_runnable_avg(p);
>   rq = __task_rq_lock(p);
>   activate_task(rq, p, 0);
>   p->on_rq = 1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f404468..1fc30b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
>   return calc_delta_fair(sched_slice(cfs_rq, se), se);
>  }
>  
> +#ifdef CONFIG_SMP
> +static inline void __update_task_entity_contrib(struct sched_entity *se);
> +
> +/* Give new task start runnable values to heavy its load in infant time */
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> + u32 slice;
> +
> + p->se.avg.decay_count = 0;
> + slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> + p->se.avg.runnable_avg_sum = slice;
> + p->se.avg.runnable_avg_period = slice;
> + __update_task_entity_contrib(&p->se);
> +}
> +#else
> +void set_task_runnable_avg(struct task_struct *p)
> +{
> +}
> +#endif
> +
>  /*
>   * Update the current task's runtime statistics. Skip current tasks that
>   * are not in our scheduling class.
> @@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct 
> cfs_rq *cfs_rq,
>* We track migrations using entity decay_count <= 0, on a wake-up
>* migration we use a negative decay count to track the remote decays
>* accumulated while sleeping.
> +  *
> +  * When enqueue a new forked task, the se->avg.decay_count == 0, so
> +  * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
>*/
>   if (unlikely(se->avg.decay_count <= 0)) {
>   se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24b1503..8bc66c6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth 
> *rt_b, u64 period, u64 runtime
>  
>  extern void update_idle_cpu_load(struct rq *this_rq);
>  
> +extern void set_task_runnable_avg(struct task_struct *p);
> +
>  #ifdef CONFIG_PARAVIRT
>  static inline u64 steal_ticks(u64 steal)
>  {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND] scsi: Introduce a help function local_time_seconds() to simplify the getting time stamp operation

2013-06-09 Thread Gu Zheng
There are four places convert system time in UTC to local time seconds as a 
time stamp in scsi-subsystem,
so we introduce a help function local_time_seconds() to simplify these 
operations. 


Signed-off-by: Gu Zheng 
---
 drivers/scsi/3w-9xxx.c |   14 ++
 drivers/scsi/3w-sas.c  |   14 ++
 include/scsi/scsi.h|9 +
 3 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/drivers/scsi/3w-9xxx.c b/drivers/scsi/3w-9xxx.c
index 5e1e12c..44b3ea8 100644
--- a/drivers/scsi/3w-9xxx.c
+++ b/drivers/scsi/3w-9xxx.c
@@ -374,8 +374,6 @@ out:
 /* This function will queue an event */
 static void twa_aen_queue_event(TW_Device_Extension *tw_dev, 
TW_Command_Apache_Header *header)
 {
-   u32 local_time;
-   struct timeval time;
TW_Event *event;
unsigned short aen;
char host[16];
@@ -398,9 +396,7 @@ static void twa_aen_queue_event(TW_Device_Extension 
*tw_dev, TW_Command_Apache_H
memset(event, 0, sizeof(TW_Event));
 
event->severity = TW_SEV_OUT(header->status_block.severity__reserved);
-   do_gettimeofday(&time);
-   local_time = (u32)(time.tv_sec - (sys_tz.tz_minuteswest * 60));
-   event->time_stamp_sec = local_time;
+   event->time_stamp_sec = local_time_seconds();
event->aen_code = aen;
event->retrieved = TW_AEN_NOT_RETRIEVED;
event->sequence_id = tw_dev->error_sequence_id;
@@ -479,11 +475,9 @@ out:
 static void twa_aen_sync_time(TW_Device_Extension *tw_dev, int request_id)
 {
u32 schedulertime;
-   struct timeval utc;
TW_Command_Full *full_command_packet;
TW_Command *command_packet;
TW_Param_Apache *param;
-   u32 local_time;
 
/* Fill out the command packet */
full_command_packet = tw_dev->command_packet_virt[request_id];
@@ -503,11 +497,7 @@ static void twa_aen_sync_time(TW_Device_Extension *tw_dev, 
int request_id)
param->parameter_id = cpu_to_le16(0x3); /* SchedulerTime */
param->parameter_size_bytes = cpu_to_le16(4);
 
-   /* Convert system time in UTC to local time seconds since last 
-   Sunday 12:00AM */
-   do_gettimeofday(&utc);
-   local_time = (u32)(utc.tv_sec - (sys_tz.tz_minuteswest * 60));
-   schedulertime = local_time - (3 * 86400);
+   schedulertime = local_time_seconds() - (3 * 86400);
schedulertime = cpu_to_le32(schedulertime % 604800);
 
memcpy(param->data, &schedulertime, sizeof(u32));
diff --git a/drivers/scsi/3w-sas.c b/drivers/scsi/3w-sas.c
index c845bdb..69f1d8a 100644
--- a/drivers/scsi/3w-sas.c
+++ b/drivers/scsi/3w-sas.c
@@ -236,8 +236,6 @@ out:
 /* This function will queue an event */
 static void twl_aen_queue_event(TW_Device_Extension *tw_dev, 
TW_Command_Apache_Header *header)
 {
-   u32 local_time;
-   struct timeval time;
TW_Event *event;
unsigned short aen;
char host[16];
@@ -256,9 +254,7 @@ static void twl_aen_queue_event(TW_Device_Extension 
*tw_dev, TW_Command_Apache_H
memset(event, 0, sizeof(TW_Event));
 
event->severity = TW_SEV_OUT(header->status_block.severity__reserved);
-   do_gettimeofday(&time);
-   local_time = (u32)(time.tv_sec - (sys_tz.tz_minuteswest * 60));
-   event->time_stamp_sec = local_time;
+   event->time_stamp_sec = local_time_seconds();
event->aen_code = aen;
event->retrieved = TW_AEN_NOT_RETRIEVED;
event->sequence_id = tw_dev->error_sequence_id;
@@ -444,11 +440,9 @@ out:
 static void twl_aen_sync_time(TW_Device_Extension *tw_dev, int request_id)
 {
u32 schedulertime;
-   struct timeval utc;
TW_Command_Full *full_command_packet;
TW_Command *command_packet;
TW_Param_Apache *param;
-   u32 local_time;
 
/* Fill out the command packet */
full_command_packet = tw_dev->command_packet_virt[request_id];
@@ -468,11 +462,7 @@ static void twl_aen_sync_time(TW_Device_Extension *tw_dev, 
int request_id)
param->parameter_id = cpu_to_le16(0x3); /* SchedulerTime */
param->parameter_size_bytes = cpu_to_le16(4);
 
-   /* Convert system time in UTC to local time seconds since last 
-   Sunday 12:00AM */
-   do_gettimeofday(&utc);
-   local_time = (u32)(utc.tv_sec - (sys_tz.tz_minuteswest * 60));
-   schedulertime = local_time - (3 * 86400);
+   schedulertime = local_time_seconds() - (3 * 86400);
schedulertime = cpu_to_le32(schedulertime % 604800);
 
memcpy(param->data, &schedulertime, sizeof(u32));
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 66216c1..f3377ca 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -574,4 +574,13 @@ static inline __u32 scsi_to_u32(__u8 *ptr)
return (ptr[0]<<24) + (ptr[1]<<16) + (ptr[2]<<8) + ptr[3];
 }
 
+/*
+ * Convert system time in UTC to local tim

Re: [patch v8 6/9] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2013-06-09 Thread Gu Zheng
On 06/10/2013 10:01 AM, Alex Shi wrote:

> On 06/10/2013 09:49 AM, Gu Zheng wrote:
>> On 06/07/2013 03:20 PM, Alex Shi wrote:
>>
>>>> They are the base values in load balance, update them with rq runnable
>>>> load average, then the load balance will consider runnable load avg
>>>> naturally.
>>>>
>>>> We also try to include the blocked_load_avg as cpu load in balancing,
>>>> but that cause kbuild performance drop 6% on every Intel machine, and
>>>> aim7/oltp drop on some of 4 CPU sockets machines.
>> Hi Alex,
>>Could you explain me why including the blocked_load_avg causes 
>> performance drop ?
> 
> 
> Thanks for review!
> 
> the 9th patch has few explanation. like, after the only task got into
> sleep in a CPU, there is only blocked_load_avg left, it looks quite big
> in short time. that, block it get tasks before sleep, drive task to
> other cpu in periodic balance. So, it cause clear load imbalance.
> 

Got it. Thanks very much for your explanation.:)

Best regards,
Gu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] scsi: Introduce a help function local_time_seconds() to simplify the getting time stamp operation

2013-06-10 Thread Gu Zheng
On 06/11/2013 01:47 AM, James Bottomley wrote:

> On Mon, 2013-06-10 at 09:57 +0800, Gu Zheng wrote:
>> diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
>> index 66216c1..f3377ca 100644
>> --- a/include/scsi/scsi.h
>> +++ b/include/scsi/scsi.h
>> @@ -574,4 +574,13 @@ static inline __u32 scsi_to_u32(__u8 *ptr)
>> return (ptr[0]<<24) + (ptr[1]<<16) + (ptr[2]<<8) + ptr[3];
>>  }
>>  
>> +/*
>> + * Convert system time in UTC to local time seconds.
>> + */
>> +static inline u32 local_time_seconds(void)
>> +{
>> +   struct timeval utc;
>> +   do_gettimeofday(&utc);
>> +   return (u32)(utc.tv_sec - (sys_tz.tz_minuteswest * 60));
>> +}
>>  #endif /* _SCSI_SCSI_H */
> 
> This doesn't belong in SCSI.
> 
> It's not a common pattern, so just leave it open coded in the 3ware
> drivers.  If there's a need for it to be a common pattern, John Stultz
> will add it to the timer code, but at the moment, he doesn't seem to see
> the need.

Hi James,
OK...Thanks for your reminder.
As you mentioned in an old thread, what about using "jiffies to seconds" to 
replace
the existed timestamps?

Best regards,
Gu



> 
> James
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] f2fs: Remove the second argument "sbi" of func parse_options()

2013-06-10 Thread Gu Zheng
We can get f2fs_sb_info via F2FS_SB(sb),so remove the second argument "sbi" of 
func parse_options().
f2fs_fill_super(), as the only user of parse_options() now, setting 
sb->s_fs_info = sbi before calling
parse_options().

Signed-off-by: Gu Zheng 
---
 fs/f2fs/super.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 8555f7d..58516b5 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -303,9 +303,9 @@ static const struct export_operations f2fs_export_ops = {
.get_parent = f2fs_get_parent,
 };
 
-static int parse_options(struct super_block *sb, struct f2fs_sb_info *sbi,
-   char *options)
+static int parse_options(struct super_block *sb, char *options)
 {
+   struct f2fs_sb_info *sbi = F2FS_SB(sb);
substring_t args[MAX_OPT_ARGS];
char *p;
int arg = 0;
@@ -541,6 +541,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
if (err)
goto free_sb_buf;
}
+   sb->s_fs_info = sbi;
/* init some FS parameters */
sbi->active_logs = NR_CURSEG_TYPE;
 
@@ -553,7 +554,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
set_opt(sbi, POSIX_ACL);
 #endif
/* parse mount options */
-   err = parse_options(sb, sbi, (char *)data);
+   err = parse_options(sb, (char *)data);
if (err)
goto free_sb_buf;
 
@@ -565,7 +566,6 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_xattr = f2fs_xattr_handlers;
sb->s_export_op = &f2fs_export_ops;
sb->s_magic = F2FS_SUPER_MAGIC;
-   sb->s_fs_info = sbi;
sb->s_time_gran = 1;
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
(test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
-- 
1.7.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfs: remove the unnecessrary code of fs/inode.c

2013-07-01 Thread Gu Zheng
On 07/01/2013 08:19 PM, Dong Fang wrote:

> These functions, such as find_inode_fast() and find_inode(), iget_lock() and
> iget5_lock(), insert_inode_locked() and insert_inode_locked4(), almost have
> the same code.

Maybe the title "[PATCH] vfs: remove the reduplicate code of fs/inode.c" is more
suitable.

> 
> Signed-off-by: Dong Fang 


Reviewed-by: Gu Zheng 

Thanks,
Gu

> ---
>  fs/inode.c |  134 
> 
>  1 files changed, 26 insertions(+), 108 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 00d5fc3..847eee9 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -790,6 +790,22 @@ void prune_icache_sb(struct super_block *sb, int 
> nr_to_scan)
>  }
>  
>  static void __wait_on_freeing_inode(struct inode *inode);
> +
> +
> +static int test_ino(struct inode *inode, void *data)
> +{
> + unsigned long ino = *(unsigned long *) data;
> + return inode->i_ino == ino;

Can be more concise:
return inode->i_ino == *(unsigned long *) data;
,so does the new insert_inode_locked():


> +}
> +
> +static int set_ino(struct inode *inode, void *data)
> +{
> + inode->i_ino = *(unsigned long *) data;
> + return 0;
> +}
> +
> +
> +
>  /*
>   * Called with the inode lock held.
>   */
> @@ -829,28 +845,7 @@ repeat:
>  static struct inode *find_inode_fast(struct super_block *sb,
>   struct hlist_head *head, unsigned long ino)
>  {
> - struct inode *inode = NULL;
> -
> -repeat:
> - hlist_for_each_entry(inode, head, i_hash) {
> - spin_lock(&inode->i_lock);
> - if (inode->i_ino != ino) {
> - spin_unlock(&inode->i_lock);
> - continue;
> - }
> - if (inode->i_sb != sb) {
> - spin_unlock(&inode->i_lock);
> - continue;
> - }
> - if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> - __wait_on_freeing_inode(inode);
> - goto repeat;
> - }
> - __iget(inode);
> - spin_unlock(&inode->i_lock);
> - return inode;
> - }
> - return NULL;
> + return find_inode(sb, head, test_ino, (void *)&ino);
>  }
>  
>  /*
> @@ -1073,50 +1068,7 @@ EXPORT_SYMBOL(iget5_locked);
>   */
>  struct inode *iget_locked(struct super_block *sb, unsigned long ino)
>  {
> - struct hlist_head *head = inode_hashtable + hash(sb, ino);
> - struct inode *inode;
> -
> - spin_lock(&inode_hash_lock);
> - inode = find_inode_fast(sb, head, ino);
> - spin_unlock(&inode_hash_lock);
> - if (inode) {
> - wait_on_inode(inode);
> - return inode;
> - }
> -
> - inode = alloc_inode(sb);
> - if (inode) {
> - struct inode *old;
> -
> - spin_lock(&inode_hash_lock);
> - /* We released the lock, so.. */
> - old = find_inode_fast(sb, head, ino);
> - if (!old) {
> - inode->i_ino = ino;
> - spin_lock(&inode->i_lock);
> - inode->i_state = I_NEW;
> - hlist_add_head(&inode->i_hash, head);
> - spin_unlock(&inode->i_lock);
> - inode_sb_list_add(inode);
> - spin_unlock(&inode_hash_lock);
> -
> - /* Return the locked inode with I_NEW set, the
> -  * caller is responsible for filling in the contents
> -  */
> - return inode;
> - }
> -
> - /*
> -  * Uhhuh, somebody else created the same inode under
> -  * us. Use the old inode instead of the one we just
> -  * allocated.
> -  */
> - spin_unlock(&inode_hash_lock);
> - destroy_inode(inode);
> - inode = old;
> - wait_on_inode(inode);
> - }
> - return inode;
> + return iget5_locked(sb, ino, test_ino, set_ino, (void *)&ino);
>  }
>  EXPORT_SYMBOL(iget_locked);
>  
> @@ -1281,48 +1233,6 @@ struct inode *ilookup(struct super_block *sb, unsigned 
> long ino)
>  }
>  EXPORT_SYMBOL(ilookup);
>  
> -int insert_inode_locked(struct inode *inode)
> -{
> - struct super_block *sb = inode->i_sb;
> - ino_t ino = inode->i_ino;
> - struct hlist_head *head = inode_hashtable + hash(sb, ino);
> -
&g

Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

2013-07-02 Thread Gu Zheng
On 07/03/2013 02:00 AM, Benjamin LaHaise wrote:

> On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote:
>> Hi Ben,
>> Are you still working on this patch?
>> As you know, using the current anon inode will lead to more than one 
>> instance of
>> aio can not work. Have you found a way to fix this issue? Or can we use some
>> other ones to replace the anon inode?
> 
> This patch hasn't been a high priority for me.  I would really appreciate 
> it if someone could confirm that this patch does indeed fix the hotplug 
> page migration issue by testing it in a system that hits the bug.  Removing 
> the anon_inode bits isn't too much work, but I'd just like to have some 
> confirmation that this fix is considered to be "good enough" for the 
> problem at hand before spending any further time on it.  There was talk of 
> using another approach, but it's not clear if there was any progress.

Yeah, we have not seen anyone try to fix this issue using the other approach we
talked. I'm not sure whether your patch can indeed fix the problem, but I'll
carry out a complete test to confirm it, and I'll be very glad to continue this
job based on your patch if you do not have enough time working on it.:)

Thanks,
Gu

> 
>   -ben


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] seq_file:update file->f_pos when lseek() to m->read_pos

2013-07-03 Thread Gu Zheng
Hi Jiaxing,
Please refer to inline comment.:)
On 07/02/2013 08:43 AM, Jiaxing Wang wrote:

> On 07/01/2013 08:41 PM, fangdong wrote:
>> On 06/29/2013 05:11 AM, Jiaxing Wang wrote:
>>> After pread(), file->f_pos and m->read_pos get different,
>>> and lseek() to m->read_pos did not update file->f_pos, then
>>> a subsequent read may read from a wrong position, the following
>>> program shows the problem:
>>>
>>>  char str1[32] = { 0 };
>>>  char str2[32] = { 0 };
>>>  int poffset = 10;
>>>  int count = 20;
>>>
>>>  /*open any seq file*/
>>>  int fd = open("/proc/modules", O_RDONLY);
>>>
>>>  pread(fd, str1, count, poffset);
>>>  printf("pread:%s\n", str1);
>>>
>>>  /*seek to where m->read_pos is*/
>>>  lseek(fd, poffset+count, SEEK_SET);
>>>
>>>  /*supposed to read from poffset+count, but this read from position 0*/
>>>  read(fd, str2, count);
>>>  printf("read:%s\n", str2);
>>>
>>> Signed-off-by: Jiaxing Wang 
>>> ---
>>>   fs/seq_file.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/seq_file.c b/fs/seq_file.c
>>> index 774c1eb..4b22b26 100644
>>> --- a/fs/seq_file.c
>>> +++ b/fs/seq_file.c
>>> @@ -328,7 +328,8 @@ loff_t seq_lseek(struct file *file, loff_t offset, int 
>>> whence)
>>>   m->read_pos = offset;
>>>   retval = file->f_pos = offset;
>>>   }
>>> -}
>>> +} else
>>> +file->f_pos = offset;
>>>   }
>>>   file->f_version = m->version;
>>>   mutex_unlock(&m->lock);
>>>
>> This does not appear to be a problem, in linux man page, the behaver seems 
>> clearly defined:
>>
>> DESCRIPTION
>>pread() reads up to count bytes from file descriptor fd at offset  
>> off-set  (from the start of the file) into the buffer starting at buf.  The 
>> file offset is not changed.
>>
>>pwrite() writes up to count bytes from the buffer starting at buf  to 
>> the  file  descriptor  fd  at  offset  offset.   The file offset is not 
>> changed.
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> There's no problem that pread() don't change file->f_pos, but I think lseek() 
> should have changed it.

No, the way that seq_seek deal with pos is right. Remember this seq file, it is
an iterator interface, e.g. it provides infos of each element of a list, so the
we should make sure that the info we read from seq file is the whole infos of
each entry, so seq_lseek can not set pos to a optional point, only can on the
points that divided by a whole infos size of each entry. If we can not set pos
on the special position, ste it as 0, I think it's reasonable.

Thanks,
Gu  

> Any comments from other people, and Alexander?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

2013-07-03 Thread Gu Zheng
On 07/03/2013 02:00 AM, Benjamin LaHaise wrote:

> On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote:
>> Hi Ben,
>> Are you still working on this patch?
>> As you know, using the current anon inode will lead to more than one 
>> instance of
>> aio can not work. Have you found a way to fix this issue? Or can we use some
>> other ones to replace the anon inode?
> 
> This patch hasn't been a high priority for me.  I would really appreciate 
> it if someone could confirm that this patch does indeed fix the hotplug 
> page migration issue by testing it in a system that hits the bug.  Removing 
> the anon_inode bits isn't too much work, but I'd just like to have some 
> confirmation that this fix is considered to be "good enough" for the 
> problem at hand before spending any further time on it.  There was talk of 
> using another approach, but it's not clear if there was any progress.

Hi Ben,
  When I test your patch on kernel 3.10, the kernel panic when aio job
complete or exit, exactly in aio_free_ring(), the following is a part of dmesg.

Thanks,
Gu

kernel BUG at mm/swap.c:163!

invalid opcode:  [#1] SMP

Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4
nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich
mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp
pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F)
mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F)

CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF3.10.0-aio-migrate+
#107
Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS
Version 89.32 DP Proto 08/16/2012
Workqueue: events kill_ioctx_work

task: 8807dda974e0 ti: 8807dda98000 task.ti: 8807dda98000

RIP: 0010:[]  [] put_page+0x48/0x60

RSP: 0018:8807dda99cd8  EFLAGS: 00010246

RAX:  RBX: 8807be1f1e00 RCX: 0001

RDX:  RSI:  RDI: ea001b196c80

RBP: 8807dda99cd8 R08:  R09: 

R10: 8807ffbb5f00 R11: 005a R12: 0001

R13:  R14: 8807dda974e0 R15: 8807be1f1ec8

FS:  () GS:8807fd68() knlGS:

CS:  0010 DS:  ES:  CR0: 8005003b

CR2: 003b826dc7d0 CR3: 01a0b000 CR4: 07e0

DR0:  DR1:  DR2: 

DR3:  DR6: 0ff0 DR7: 0400

Stack:

 8807dda99d18 811b11f6  0002

 8807be1f1e00 8807be1f1e80 000c 

 8807dda99dc8 811b21a2 0001000438ec 8807fd692d00

Call Trace:

 [] aio_free_ring+0x96/0x1c0

 [] free_ioctx+0x1f2/0x250

 [] ? idle_balance+0xed/0x140

 [] put_ioctx+0x1a/0x30

 [] kill_ioctx_work+0x2f/0x40

 [] process_one_work+0x183/0x490

 [] worker_thread+0x120/0x3a0

 [] ? manage_workers+0x160/0x160

 [] kthread+0xce/0xe0

 [] ? kthread_freezable_should_stop+0x70/0x70

 [] ret_from_fork+0x7c/0xb0

 [] ? kthread_freezable_should_stop+0x70/0x70

Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00
00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00
eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00
RIP  [] put_page+0x48/0x60

 RSP 

---[ end trace b5e2c17407c840d8 ]---

Jul  4 15:49:50 BUG: unable to handle kernel paging request at ffd8

IP: [] kthread_data+0x10/0x20

PGD 1a0c067 PUD 1a0e067 PMD 0

Oops:  [#2] SMP

Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4
nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich
mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp
pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F)
mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F)

CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D  3.10.0-aio-migrate+
#107
Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 100

Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

2013-07-04 Thread Gu Zheng
On 07/04/2013 07:41 PM, Benjamin LaHaise wrote:

> On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote:
>> Hi Ben,
>>   When I test your patch on kernel 3.10, the kernel panic when aio job
>> complete or exit, exactly in aio_free_ring(), the following is a part of 
>> dmesg.
> 
> What is your test case?

Just the one you mentioned in the previous mail:
 http://www.kvack.org/~bcrl/aio/aio-numa-test.c

Thanks,
Gu

> 
>   -ben
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

2013-06-28 Thread Gu Zheng
On 06/11/2013 10:45 PM, Benjamin LaHaise wrote:

> Hi Tang,
> 
> On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote:
>> Hi Benjamin,
>>
>> Are you still working on this problem ?
>>
>> Thanks. :)
> 
> Below is a copy of the most recent version of this patch I have worked 
> on.  This version works and stands up to my testing using move_pages() to 
> force the migration of the aio ring buffer.  A test program is available 
> at http://www.kvack.org/~bcrl/aio/aio-numa-test.c .  Please note that 
> this version is not suitable for mainline as the modifactions to the 

> anon inode code are undesirable, so that part needs reworking.

Hi Ben,
Are you still working on this patch?
As you know, using the current anon inode will lead to more than one instance of
aio can not work. Have you found a way to fix this issue? Or can we use some
other ones to replace the anon inode?

Thanks,
Gu

> 
>   -ben
> 
> 
>  fs/aio.c|  113 
> 
>  fs/anon_inodes.c|   14 -
>  include/linux/migrate.h |3 +
>  mm/migrate.c|2 
>  mm/swap.c   |1 
>  5 files changed, 121 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c5b1a8c..a951690 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -35,6 +35,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -108,6 +111,7 @@ struct kioctx {
>   } cacheline_aligned_in_smp;
>  
>   struct page *internal_pages[AIO_RING_PAGES];
> + struct file *ctx_file;
>  };
>  
>  /*-- sysctl variables*/
> @@ -136,18 +140,80 @@ __initcall(aio_setup);
>  
>  static void aio_free_ring(struct kioctx *ctx)
>  {
> - long i;
> -
> - for (i = 0; i < ctx->nr_pages; i++)
> - put_page(ctx->ring_pages[i]);
> + int i;
>  
>   if (ctx->mmap_size)
>   vm_munmap(ctx->mmap_base, ctx->mmap_size);
>  
> + if (ctx->ctx_file)
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> +
> + for (i = 0; i < ctx->nr_pages; i++) {
> + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
> +  page_count(ctx->ring_pages[i]));
> + put_page(ctx->ring_pages[i]);
> + }
> +
>   if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
>   kfree(ctx->ring_pages);
> +
> + if (ctx->ctx_file) {
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d 
> i_count=%d\n",
> +  current->pid, ctx->ctx_file->f_inode->i_nlink,
> +  ctx->ctx_file->f_path.dentry->d_count,
> +  d_unhashed(ctx->ctx_file->f_path.dentry),
> +  
> atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count));
> + fput(ctx->ctx_file);
> + ctx->ctx_file = NULL;
> + }
> +}
> +
> +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +static const struct file_operations aio_ctx_fops = {
> + .mmap   = aio_ctx_mmap,
> +};
> +
> +static int aio_set_page_dirty(struct page *page)
> +{
> + return 0;
> +}
> +
> +static int aio_migratepage(struct address_space *mapping, struct page *new,
> +struct page *old, enum migrate_mode mode)
> +{
> + struct kioctx *ctx = mapping->private_data;
> + unsigned long flags;
> + unsigned idx = old->index;
> + int rc;
> +
> + BUG_ON(PageWriteback(old));/* Writeback must be complete */
> + put_page(old);
> + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
> + if (rc != MIGRATEPAGE_SUCCESS) {
> + get_page(old);
> + return rc;
> + }
> + get_page(new);
> +
> + spin_lock_irqsave(&ctx->completion_lock, flags);
> + migrate_page_copy(new, old);
> + ctx->ring_pages[idx] = new;
> + spin_unlock_irqrestore(&ctx->completion_lock, flags);
> +
> + return MIGRATEPAGE_SUCCESS;
>  }
>  
> +static const struct address_space_operations aio_ctx_aops = {
> + .set_page_dirty = aio_set_page_dirty,
> + .migratepage= aio_migratepage,
> +};
> +
>  static int aio_setup_ring(struct kioctx *ctx)
>  {
>   struct aio_ring *ring;
> @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx)
>   struct mm_struct *mm = current->mm;
>   unsigned long size, populate;
>   int nr_pages;
> + int i;
>  
>   /* Compensate for the ring buffer's head/tail overlap entry */
>   nr_events += 2; /* 1 is required, 2 for good luck */
> @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx)
>   if (nr_pages < 0)
>   return -EINVAL;
>  
> + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
> + if (IS_ERR(ctx->ctx_fi

Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable())

2013-07-01 Thread Gu Zheng
On 06/11/2013 10:45 PM, Benjamin LaHaise wrote:

> Hi Tang,
> 
> On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote:
>> Hi Benjamin,
>>
>> Are you still working on this problem ?
>>
>> Thanks. :)
> 
> Below is a copy of the most recent version of this patch I have worked 
> on.  This version works and stands up to my testing using move_pages() to 
> force the migration of the aio ring buffer.  A test program is available 
> at http://www.kvack.org/~bcrl/aio/aio-numa-test.c .  Please note that 
> this version is not suitable for mainline as the modifactions to the 
> anon inode code are undesirable, so that part needs reworking.



Hi Ben,
Are you still working on this patch?
As you know, using the current anon inode will lead to more than one instance of
aio can not work. Have you found a way to fix this issue? Or can we use some
other ones to replace the anon inode?

Thanks,
Gu

> 
>   -ben
> 
> 
>  fs/aio.c|  113 
> 
>  fs/anon_inodes.c|   14 -
>  include/linux/migrate.h |3 +
>  mm/migrate.c|2 
>  mm/swap.c   |1 
>  5 files changed, 121 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c5b1a8c..a951690 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -35,6 +35,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -108,6 +111,7 @@ struct kioctx {
>   } cacheline_aligned_in_smp;
>  
>   struct page *internal_pages[AIO_RING_PAGES];
> + struct file *ctx_file;
>  };
>  
>  /*-- sysctl variables*/
> @@ -136,18 +140,80 @@ __initcall(aio_setup);
>  
>  static void aio_free_ring(struct kioctx *ctx)
>  {
> - long i;
> -
> - for (i = 0; i < ctx->nr_pages; i++)
> - put_page(ctx->ring_pages[i]);
> + int i;
>  
>   if (ctx->mmap_size)
>   vm_munmap(ctx->mmap_base, ctx->mmap_size);
>  
> + if (ctx->ctx_file)
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> +
> + for (i = 0; i < ctx->nr_pages; i++) {
> + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
> +  page_count(ctx->ring_pages[i]));
> + put_page(ctx->ring_pages[i]);
> + }
> +
>   if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
>   kfree(ctx->ring_pages);
> +
> + if (ctx->ctx_file) {
> + truncate_setsize(ctx->ctx_file->f_inode, 0);
> + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d 
> i_count=%d\n",
> +  current->pid, ctx->ctx_file->f_inode->i_nlink,
> +  ctx->ctx_file->f_path.dentry->d_count,
> +  d_unhashed(ctx->ctx_file->f_path.dentry),
> +  
> atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count));
> + fput(ctx->ctx_file);
> + ctx->ctx_file = NULL;
> + }
> +}
> +
> +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &generic_file_vm_ops;
> + return 0;
> +}
> +
> +static const struct file_operations aio_ctx_fops = {
> + .mmap   = aio_ctx_mmap,
> +};
> +
> +static int aio_set_page_dirty(struct page *page)
> +{
> + return 0;
> +}
> +
> +static int aio_migratepage(struct address_space *mapping, struct page *new,
> +struct page *old, enum migrate_mode mode)
> +{
> + struct kioctx *ctx = mapping->private_data;
> + unsigned long flags;
> + unsigned idx = old->index;
> + int rc;
> +
> + BUG_ON(PageWriteback(old));/* Writeback must be complete */
> + put_page(old);
> + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
> + if (rc != MIGRATEPAGE_SUCCESS) {
> + get_page(old);
> + return rc;
> + }
> + get_page(new);
> +
> + spin_lock_irqsave(&ctx->completion_lock, flags);
> + migrate_page_copy(new, old);
> + ctx->ring_pages[idx] = new;
> + spin_unlock_irqrestore(&ctx->completion_lock, flags);
> +
> + return MIGRATEPAGE_SUCCESS;
>  }
>  
> +static const struct address_space_operations aio_ctx_aops = {
> + .set_page_dirty = aio_set_page_dirty,
> + .migratepage= aio_migratepage,
> +};
> +
>  static int aio_setup_ring(struct kioctx *ctx)
>  {
>   struct aio_ring *ring;
> @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx)
>   struct mm_struct *mm = current->mm;
>   unsigned long size, populate;
>   int nr_pages;
> + int i;
>  
>   /* Compensate for the ring buffer's head/tail overlap entry */
>   nr_events += 2; /* 1 is required, 2 for good luck */
> @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx)
>   if (nr_pages < 0)
>   return -EINVAL;
>  
> + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR);
> + if (IS_ERR(ctx->ctx_f

[PATCH] f2fs: fix a compound statement label error

2013-08-18 Thread Gu Zheng
>From 685b72b66cb8ce019429b1958c91f346b260bc65 Mon Sep 17 00:00:00 2001
From: Gu Zheng 
Date: Mon, 19 Aug 2013 09:41:15 +0800
Subject: [PATCH] f2fs: fix a compound statement label error
An error "label at end of compound statement" will occur if CONFIG_F2FS_STAT_FS
disabled.
fs/f2fs/segment.c:556:1: error: label at end of compound statement
So clean up the 'out' label to fix it.

Reported-by: Fengguang Wu 
Signed-off-by: Gu Zheng 
---
 fs/f2fs/segment.c |8 ++--
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 9c45b8e..09af9c7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -540,12 +540,9 @@ static void allocate_segment_by_default(struct 
f2fs_sb_info *sbi,
 {
struct curseg_info *curseg = CURSEG_I(sbi, type);
 
-   if (force) {
+   if (force)
new_curseg(sbi, type, true);
-   goto out;
-   }
-
-   if (type == CURSEG_WARM_NODE)
+   else if (type == CURSEG_WARM_NODE)
new_curseg(sbi, type, false);
else if (curseg->alloc_type == LFS && is_next_segment_free(sbi, type))
new_curseg(sbi, type, false);
@@ -553,7 +550,6 @@ static void allocate_segment_by_default(struct f2fs_sb_info 
*sbi,
change_curseg(sbi, type, true);
else
new_curseg(sbi, type, false);
-out:
 #ifdef CONFIG_F2FS_STAT_FS
sbi->segment_count[curseg->alloc_type]++;
 #endif
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] f2fs: introduce help function F2FS_NODE()

2013-07-15 Thread Gu Zheng
Introduce help function F2FS_NODE() to simplify the conversion of node_page to
f2fs_node.


Signed-off-by: Gu Zheng 
---
 fs/f2fs/data.c |2 +-
 fs/f2fs/dir.c  |2 +-
 fs/f2fs/f2fs.h |9 +++--
 fs/f2fs/file.c |2 +-
 fs/f2fs/inode.c|4 ++--
 fs/f2fs/node.c |   10 +-
 fs/f2fs/node.h |   40 
 fs/f2fs/recovery.c |6 ++
 8 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 035f9a3..c73c394 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -39,7 +39,7 @@ static void __set_data_blkaddr(struct dnode_of_data *dn,
block_t new_addr)

wait_on_page_writeback(node_page);

-   rn = (struct f2fs_node *)page_address(node_page);
+   rn = F2FS_NODE(node_page);

/* Get physical address of data block */
addr_array = blkaddr_in_node(rn);
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 62f0d59..89ecb37 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -270,7 +270,7 @@ static void init_dent_inode(const struct qstr *name, struct
page *ipage)
struct f2fs_node *rn;

/* copy name info. to this inode page */
-   rn = (struct f2fs_node *)page_address(ipage);
+   rn = F2FS_NODE(ipage);
rn->i.i_namelen = cpu_to_le32(name->len);
memcpy(rn->i.i_name, name->name, name->len);
set_page_dirty(ipage);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index c7620b9..ffa34f4 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -455,6 +455,11 @@ static inline struct f2fs_checkpoint *F2FS_CKPT(struct
f2fs_sb_info *sbi)
return (struct f2fs_checkpoint *)(sbi->ckpt);
 }

+static inline struct f2fs_node *F2FS_NODE(struct page *page)
+{
+   return (struct f2fs_node *)page_address(page);
+}
+
 static inline struct f2fs_nm_info *NM_I(struct f2fs_sb_info *sbi)
 {
return (struct f2fs_nm_info *)(sbi->nm_info);
@@ -813,7 +818,7 @@ static inline struct kmem_cache
*f2fs_kmem_cache_create(const char *name,

 static inline bool IS_INODE(struct page *page)
 {
-   struct f2fs_node *p = (struct f2fs_node *)page_address(page);
+   struct f2fs_node *p = F2FS_NODE(page);
return RAW_IS_INODE(p);
 }

@@ -827,7 +832,7 @@ static inline block_t datablock_addr(struct page *node_page,
 {
struct f2fs_node *raw_node;
__le32 *addr_array;
-   raw_node = (struct f2fs_node *)page_address(node_page);
+   raw_node = F2FS_NODE(node_page);
addr_array = blkaddr_in_node(raw_node);
return le32_to_cpu(addr_array[offset]);
 }
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 157a635..65ca3b3 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -206,7 +206,7 @@ int truncate_data_blocks_range(struct dnode_of_data *dn, int
count)
struct f2fs_node *raw_node;
__le32 *addr;

-   raw_node = page_address(dn->node_page);
+   raw_node = F2FS_NODE(dn->node_page);
addr = blkaddr_in_node(raw_node) + ofs;

for ( ; count > 0; count--, addr++, dn->ofs_in_node++) {
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 2b2d45d1..debf743 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -56,7 +56,7 @@ static int do_read_inode(struct inode *inode)
if (IS_ERR(node_page))
return PTR_ERR(node_page);

-   rn = page_address(node_page);
+   rn = F2FS_NODE(node_page);
ri = &(rn->i);

inode->i_mode = le16_to_cpu(ri->i_mode);
@@ -153,7 +153,7 @@ void update_inode(struct inode *inode, struct page 
*node_page)

wait_on_page_writeback(node_page);

-   rn = page_address(node_page);
+   rn = F2FS_NODE(node_page);
ri = &(rn->i);

ri->i_mode = cpu_to_le16(inode->i_mode);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index b418aee..f5172e2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -565,7 +565,7 @@ static int truncate_nodes(struct dnode_of_data *dn, unsigned
int nofs,
return PTR_ERR(page);
}

-   rn = (struct f2fs_node *)page_address(page);
+   rn = F2FS_NODE(page);
if (depth < 3) {
for (i = ofs; i < NIDS_PER_BLOCK; i++, freed++) {
child_nid = le32_to_cpu(rn->in.nid[i]);
@@ -698,7 +698,7 @@ restart:
set_new_dnode(&dn, inode, page, NULL, 0);
unlock_page(page);

-   rn = page_address(page);
+   rn = F2FS_NODE(page);
switch (level) {
case 0:
case 1:
@@ -1484,8 +1484,8 @@ int recover_inode_page(struct f2fs_sb_info *sbi, struct
page *page)
SetPageUptodate(ipage);
fill_node_footer(ipage, ino, ino, 0, true);

-   src = (struct f2fs_node *)page_address(page);
-   dst = (struct f2fs_node *)page_address(ipage);
+   src = F2FS_NODE(page);
+   dst = F2FS_NODE(ipage);

memcpy(dst, src, (unsigned long)&src->i.i_ext - (unsigned long)&src->i);
dst-

[PATCH] fs/jffs2: remove the unused paramters of function jffs2_{compress,decompress}

2013-07-16 Thread Gu Zheng
Remove the unused paramters of function jffs2_{compress,decompress}.


Signed-off-by: Gu Zheng 
---
 fs/jffs2/compr.c |   12 ++--
 fs/jffs2/compr.h |   12 ++--
 fs/jffs2/gc.c|2 +-
 fs/jffs2/read.c  |2 +-
 fs/jffs2/write.c |2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/jffs2/compr.c b/fs/jffs2/compr.c
index 4849a4c..6fcb426 100644
--- a/fs/jffs2/compr.c
+++ b/fs/jffs2/compr.c
@@ -145,9 +145,9 @@ static int jffs2_selected_compress(u8 compr, unsigned char
*data_in,
  * jffs2_compress should compress as much as will fit, and should set
  * *datalen accordingly to show the amount of data which were compressed.
  */
-uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f,
-   unsigned char *data_in, unsigned char **cpage_out,
-   uint32_t *datalen, uint32_t *cdatalen)
+uint16_t jffs2_compress(struct jffs2_sb_info *c, unsigned char *data_in,
+   unsigned char **cpage_out, uint32_t *datalen,
+   uint32_t *cdatalen)
 {
int ret = JFFS2_COMPR_NONE;
int mode, compr_ret;
@@ -250,9 +250,9 @@ uint16_t jffs2_compress(struct jffs2_sb_info *c, struct
jffs2_inode_info *f,
return ret;
 }

-int jffs2_decompress(struct jffs2_sb_info *c, struct jffs2_inode_info *f,
-uint16_t comprtype, unsigned char *cdata_in,
-unsigned char *data_out, uint32_t cdatalen, uint32_t 
datalen)
+int jffs2_decompress(uint16_t comprtype, unsigned char *cdata_in,
+unsigned char *data_out, uint32_t cdatalen,
+uint32_t datalen)
 {
struct jffs2_compressor *this;
int ret;
diff --git a/fs/jffs2/compr.h b/fs/jffs2/compr.h
index 5e91d57..092089a 100644
--- a/fs/jffs2/compr.h
+++ b/fs/jffs2/compr.h
@@ -70,13 +70,13 @@ int jffs2_unregister_compressor(struct jffs2_compressor 
*comp);
 int jffs2_compressors_init(void);
 int jffs2_compressors_exit(void);

-uint16_t jffs2_compress(struct jffs2_sb_info *c, struct jffs2_inode_info *f,
-   unsigned char *data_in, unsigned char **cpage_out,
-   uint32_t *datalen, uint32_t *cdatalen);
+uint16_t jffs2_compress(struct jffs2_sb_info *c, unsigned char *data_in,
+   unsigned char **cpage_out, uint32_t *datalen,
+   uint32_t *cdatalen);

-int jffs2_decompress(struct jffs2_sb_info *c, struct jffs2_inode_info *f,
-uint16_t comprtype, unsigned char *cdata_in,
-unsigned char *data_out, uint32_t cdatalen, uint32_t 
datalen);
+int jffs2_decompress(uint16_t comprtype, unsigned char *cdata_in,
+unsigned char *data_out, uint32_t cdatalen,
+uint32_t datalen);

 void jffs2_free_comprbuf(unsigned char *comprbuf, unsigned char *orig);

diff --git a/fs/jffs2/gc.c b/fs/jffs2/gc.c
index 5a2dec2..8dc85aa 100644
--- a/fs/jffs2/gc.c
+++ b/fs/jffs2/gc.c
@@ -1330,7 +1330,7 @@ static int jffs2_garbage_collect_dnode(struct
jffs2_sb_info *c, struct jffs2_era

writebuf = pg_ptr + (offset & (PAGE_CACHE_SIZE -1));

-   comprtype = jffs2_compress(c, f, writebuf, &comprbuf, &datalen, 
&cdatalen);
+   comprtype = jffs2_compress(c, writebuf, &comprbuf, &datalen, 
&cdatalen);

ri.magic = cpu_to_je16(JFFS2_MAGIC_BITMASK);
ri.nodetype = cpu_to_je16(JFFS2_NODETYPE_INODE);
diff --git a/fs/jffs2/read.c b/fs/jffs2/read.c
index 0b042b1..6395f41 100644
--- a/fs/jffs2/read.c
+++ b/fs/jffs2/read.c
@@ -132,7 +132,7 @@ int jffs2_read_dnode(struct jffs2_sb_info *c, struct
jffs2_inode_info *f,
jffs2_dbg(2, "Decompress %d bytes from %p to %d bytes at %p\n",
  je32_to_cpu(ri->csize), readbuf,
  je32_to_cpu(ri->dsize), decomprbuf);
-   ret = jffs2_decompress(c, f, ri->compr | (ri->usercompr << 8), 
readbuf,
decomprbuf, je32_to_cpu(ri->csize), je32_to_cpu(ri->dsize));
+   ret = jffs2_decompress(ri->compr | (ri->usercompr << 8), 
readbuf, decomprbuf,
je32_to_cpu(ri->csize), je32_to_cpu(ri->dsize));
if (ret) {
pr_warn("Error: jffs2_decompress returned %d\n", ret);
goto out_decomprbuf;
diff --git a/fs/jffs2/write.c b/fs/jffs2/write.c
index b634de4..dbc26de 100644
--- a/fs/jffs2/write.c
+++ b/fs/jffs2/write.c
@@ -369,7 +369,7 @@ int jffs2_write_inode_range(struct jffs2_sb_info *c, struct
jffs2_inode_info *f,
datalen = min_t(uint32_t, writelen, PAGE_CACHE_SIZE - (offset &
(PAGE_CACHE_SIZE-1)));
cdatalen = min_t(uint32_t, alloclen - sizeof(*ri), datalen);

-   comprtype = jffs2_compress(c, f, buf, &comprbuf, &datalen, 
&cdatalen);
+   comprtype = jffs2_compr

[PATCH RESEND 0/2] Add support to aio ring pages migration

2013-07-16 Thread Gu Zheng
Currently aio ring pages use get_user_pages() to allocate pages from movable
zone,as discussed in thread https://lkml.org/lkml/2012/11/29/69, it is easy to
pin user pages for a long time, which is fatal for memory hotplug/remove 
framework.

As Mel Gorman suggested, "Implement a callback for migration to unpin pages,
barrier operations until migration completes and pin the new pfns" can soloved
this issue. And the best palce to hold the callbacks is address space operations
which can be found via page->mapping.

But the current aio ring pages are anonymous pages, they don't have
address_space_operations, so we use an anon inode file as the aio ring file to
manage the aio ring pages, so that we can implement the callback and register it
to page->mmapping->a_ops->migratepage.

But there's a ploblem that all files created by anon_inode_getfile() share the
same inode, so mutil aio context will share the same aio ring pages, it'll lead
to io events chaos. In order to solve this issus, we introduce a new fucntion
anon_inode_getfile_private() which is samilar to anon_inode_getfile(), but each
new file has its own anon inode.

This work is based on Benjamin's patch,
http://www.spinics.net/lists/linux-fsdevel/msg66014.html

Gu Zheng (2):
  fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()
  fs/aio: Add support to aio ring pages migration

 fs/aio.c|  120 +++
 fs/anon_inodes.c|   66 +++
 include/linux/anon_inodes.h |3 +
 include/linux/migrate.h |3 +
 mm/migrate.c|2 +-
 5 files changed, 182 insertions(+), 12 deletions(-)

-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND 2/2] fs/aio: Add support to aio ring pages migration

2013-07-16 Thread Gu Zheng
As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and  setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node 
safely.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
---
 fs/aio.c|  120 ++
 include/linux/migrate.h |3 +
 mm/migrate.c|2 +-
 3 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 9b5ca11..d10f956 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -35,6 +35,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -110,6 +113,7 @@ struct kioctx {
} cacheline_aligned_in_smp;
 
struct page *internal_pages[AIO_RING_PAGES];
+   struct file *aio_ring_file;
 };
 
 /*-- sysctl variables*/
@@ -138,15 +142,78 @@ __initcall(aio_setup);
 
 static void aio_free_ring(struct kioctx *ctx)
 {
-   long i;
+   int i;
+   struct file *aio_ring_file = ctx->aio_ring_file;
 
-   for (i = 0; i < ctx->nr_pages; i++)
+   for (i = 0; i < ctx->nr_pages; i++) {
+   pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
+   page_count(ctx->ring_pages[i]));
put_page(ctx->ring_pages[i]);
+   }
 
if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
kfree(ctx->ring_pages);
+
+   if (aio_ring_file) {
+   truncate_setsize(aio_ring_file->f_inode, 0);
+   pr_debug("pid(%d) i_nlink=%u d_count=%d d_unhashed=%d 
i_count=%d\n",
+   current->pid, aio_ring_file->f_inode->i_nlink,
+   aio_ring_file->f_path.dentry->d_count,
+   d_unhashed(aio_ring_file->f_path.dentry),
+   atomic_read(&aio_ring_file->f_inode->i_count));
+   fput(aio_ring_file);
+   ctx->aio_ring_file = NULL;
+   }
+}
+
+static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   vma->vm_ops = &generic_file_vm_ops;
+   return 0;
+}
+
+static const struct file_operations aio_ring_fops = {
+   .mmap = aio_ring_mmap,
+};
+
+static int aio_set_page_dirty(struct page *page)
+{
+   return 0;
 }
 
+static int aio_migratepage(struct address_space *mapping, struct page *new,
+   struct page *old, enum migrate_mode mode)
+{
+   struct kioctx *ctx = mapping->private_data;
+   unsigned long flags;
+   unsigned idx = old->index;
+   int rc;
+
+   /*Writeback must be complete*/
+   BUG_ON(PageWriteback(old));
+   put_page(old);
+
+   rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+   if (rc != MIGRATEPAGE_SUCCESS) {
+   get_page(old);
+   return rc;
+   }
+
+   get_page(new);
+
+   spin_lock_irqsave(&ctx->completion_lock, flags);
+   migrate_page_copy(new, old);
+   ctx->ring_pages[idx] = new;
+   spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+   return rc;
+}
+
+static const struct address_space_operations aio_ctx_aops = {
+   .set_page_dirty = aio_set_page_dirty,
+   .migratepage= aio_migratepage,
+};
+
 static int aio_setup_ring(struct kioctx *ctx)
 {
struct aio_ring *ring;
@@ -154,20 +221,45 @@ static int aio_setup_ring(struct kioctx *ctx)
struct mm_struct *mm = current->mm;
unsigned long size, populate;
int nr_pages;
+   int i;
+   struct file *file;
 
/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
 
size = sizeof(struct aio_ring);
size += sizeof(struct io_event) * nr_events;
-   nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;
 
+   nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;
if (nr_pages < 0)
return -EINVAL;
 
-   nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / 
sizeof(struct io_event);
+   file = anon_inode_getfile_private("[aio]", &aio_ring_fops, ctx, O_RDWR);
+   if (IS_ERR(file)) {
+   ctx->aio_ring_file = NULL;
+   return -EAGAIN;
+   }
+
+   file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
+   file->f_inode->i_mapping->private_data = ctx;
+   file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
+
+   for (i = 0; i < nr_pages; i++) {
+   struct page *page;
+   page = find_or_create_page(file->f_inode->i_mapping,
+

[PATCH RESEND 1/2] fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()

2013-07-16 Thread Gu Zheng

Introduce a new lib function anon_inode_getfile_private(), it creates a new file
instance by hooking it up to an anonymous inode, and a dentry that describe the
"class" of the file, similar to anon_inode_getfile(), but each file holds a
single inode. Furthermore, anyone who wants to create a private anon file will
benefit from this change.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
---
 fs/anon_inodes.c|   66 +++
 include/linux/anon_inodes.h |3 ++
 2 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 47a65df..85c9618 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -109,6 +109,72 @@ static struct file_system_type anon_inode_fs_type = {
 };
 
 /**
+ * anon_inode_getfile_private - creates a new file instance by hooking it up 
to an
+ *  anonymous inode, and a dentry that describe the "class"
+ *  of the file
+ *
+ * @name:[in]name of the "class" of the new file
+ * @fops:[in]file operations for the new file
+ * @priv:[in]private data for the new file (will be file's 
private_data)
+ * @flags:   [in]flags
+ *
+ *
+ * Similar to anon_inode_getfile, but each file holds a single inode.
+ *
+ */
+struct file *anon_inode_getfile_private(const char *name,
+   const struct file_operations *fops,
+   void *priv, int flags)
+{
+   struct qstr this;
+   struct path path;
+   struct file *file;
+   struct inode *inode;
+
+   if (fops->owner && !try_module_get(fops->owner))
+   return ERR_PTR(-ENOENT);
+
+   inode = anon_inode_mkinode(anon_inode_mnt->mnt_sb);
+   if (IS_ERR(inode)) {
+   file = ERR_PTR(-ENOMEM);
+   goto err_module;
+   }
+
+   /*
+* Link the inode to a directory entry by creating a unique name
+* using the inode sequence number.
+*/
+   file = ERR_PTR(-ENOMEM);
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   path.dentry = d_alloc_pseudo(anon_inode_mnt->mnt_sb, &this);
+   if (!path.dentry)
+   goto err_module;
+
+   path.mnt = mntget(anon_inode_mnt);
+
+   d_instantiate(path.dentry, inode);
+
+   file = alloc_file(&path, OPEN_FMODE(flags), fops);
+   if (IS_ERR(file))
+   goto err_dput;
+
+   file->f_mapping = inode->i_mapping;
+   file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
+   file->private_data = priv;
+
+   return file;
+
+err_dput:
+   path_put(&path);
+err_module:
+   module_put(fops->owner);
+   return file;
+}
+EXPORT_SYMBOL_GPL(anon_inode_getfile_private);
+
+/**
  * anon_inode_getfile - creates a new file instance by hooking it up to an
  *  anonymous inode, and a dentry that describe the "class"
  *  of the file
diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h
index 8013a45..cf573c2 100644
--- a/include/linux/anon_inodes.h
+++ b/include/linux/anon_inodes.h
@@ -13,6 +13,9 @@ struct file_operations;
 struct file *anon_inode_getfile(const char *name,
const struct file_operations *fops,
void *priv, int flags);
+struct file *anon_inode_getfile_private(const char *name,
+   const struct file_operations *fops,
+   void *priv, int flags);
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 void *priv, int flags);
 
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND 1/2] fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()

2013-07-16 Thread Gu Zheng
Hi Ben,

On 07/16/2013 09:16 PM, Benjamin LaHaise wrote:

> On Tue, Jul 16, 2013 at 05:56:12PM +0800, Gu Zheng wrote:
>>
>> Introduce a new lib function anon_inode_getfile_private(), it creates a new 
>> file
>> instance by hooking it up to an anonymous inode, and a dentry that describe 
>> the
>> "class" of the file, similar to anon_inode_getfile(), but each file holds a
>> single inode. Furthermore, anyone who wants to create a private anon file 
>> will
>> benefit from this change.
>>
>> Signed-off-by: Gu Zheng 
>> Signed-off-by: Benjamin LaHaise 
> 
> Please don't add my Signed-off-by when I have never even seen or reviewed 
> a patch -- that is completely unacceptable.  

Sorry for my reckless action, I'll remember your reminder.:)

> Second, I don't think this 
> patch is suitable for 3.11, as it has not seen much testing outside of one 
> test program I had written.  It's a long standing bug, so it isn't urgent 
> to get the fix into the tree.  That said, it did pass a few tests I ran 
> last night, so it is probably suitable for the -next tree.

Thanks for your test.:)

Regards,
Gu

> 
> As for patch 1, it looks okay to me, but will need Al Viro's signoff.
> 
>   -ben


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND 2/2] fs/aio: Add support to aio ring pages migration

2013-07-16 Thread Gu Zheng
Hi Ben,

On 07/16/2013 09:34 PM, Benjamin LaHaise wrote:

> On Tue, Jul 16, 2013 at 05:56:16PM +0800, Gu Zheng wrote:
>> As the aio job will pin the ring pages, that will lead to mem migrated
>> failed. In order to fix this problem we use an anon inode to manage the aio 
>> ring
>> pages, and  setup the migratepage callback in the anon inode's address 
>> space, so
>> that when mem migrating the aio ring pages will be moved to other mem node 
>> safely.
> 
> There are a few minor issues that needed to be fixed -- see below.  I've 
> made these changes and added them to git://git.kvack.org/~bcrl/aio-next.git ,
> and will ask for that tree to be included in linux-next.

Thanks very much, and your review.
Stephen sent out a build failed msg when merger this patch into next-tree from 
your aio_next.
This is because we use migrate_page_move_mapping() which is protected by 
CONFIG_MIGRATION, I'll
fix this issue in the next version.

Best regards,
Gu
  

> 
> mm folks: can someone familiar with page migration / hot plug memory please 
> review the migration changes?
> 
>>
>> Signed-off-by: Gu Zheng 
>> Signed-off-by: Benjamin LaHaise 
> 
> Again, I had not provided my Signed-off-by on this patch previously, so 
> don't add it for me.

Sorry again.:)

> 
>> ---
>>  fs/aio.c|  120 
>> ++
>>  include/linux/migrate.h |3 +
>>  mm/migrate.c|2 +-
>>  3 files changed, 113 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 9b5ca11..d10f956 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -35,6 +35,9 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -110,6 +113,7 @@ struct kioctx {
>>  } cacheline_aligned_in_smp;
>>  
>>  struct page *internal_pages[AIO_RING_PAGES];
>> +struct file *aio_ring_file;
>>  };
>>  
>>  /*-- sysctl variables*/
>> @@ -138,15 +142,78 @@ __initcall(aio_setup);
>>  
>>  static void aio_free_ring(struct kioctx *ctx)
>>  {
>> -long i;
>> +int i;
>> +struct file *aio_ring_file = ctx->aio_ring_file;
>>  
>> -for (i = 0; i < ctx->nr_pages; i++)
>> +for (i = 0; i < ctx->nr_pages; i++) {
>> +pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
>> +page_count(ctx->ring_pages[i]));
>>  put_page(ctx->ring_pages[i]);
>> +}
>>  
>>  if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
>>  kfree(ctx->ring_pages);
>> +
>> +if (aio_ring_file) {
>> +truncate_setsize(aio_ring_file->f_inode, 0);
>> +pr_debug("pid(%d) i_nlink=%u d_count=%d d_unhashed=%d 
>> i_count=%d\n",
>> +current->pid, aio_ring_file->f_inode->i_nlink,
>> +aio_ring_file->f_path.dentry->d_count,
>> +d_unhashed(aio_ring_file->f_path.dentry),
>> +atomic_read(&aio_ring_file->f_inode->i_count));
>> +fput(aio_ring_file);
>> +ctx->aio_ring_file = NULL;
>> +}
>> +}
>> +
>> +static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +vma->vm_ops = &generic_file_vm_ops;
>> +return 0;
>> +}
>> +
>> +static const struct file_operations aio_ring_fops = {
>> +.mmap = aio_ring_mmap,
>> +};
>> +
>> +static int aio_set_page_dirty(struct page *page)
>> +{
>> +return 0;
>>  }
>>  
>> +static int aio_migratepage(struct address_space *mapping, struct page *new,
>> +struct page *old, enum migrate_mode mode)
>> +{
>> +struct kioctx *ctx = mapping->private_data;
>> +unsigned long flags;
>> +unsigned idx = old->index;
>> +int rc;
>> +
>> +/*Writeback must be complete*/
> 
> Missing spaces before/after beginning and end of comment.

> 
>> +BUG_ON(PageWriteback(old));
>> +put_page(old);
>> +
>> +rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
>> +if (rc != MIGRATEPAGE_SUCCESS) {
>> +get_page(old);
>> +return rc;
>> +}
>> +
>> +get_page(new);
>> +
>> +spin_

[PATCH V2 2/2] fs/aio: Add support to aio ring pages migration

2013-07-17 Thread Gu Zheng
As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and  setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node 
safely.

v1->v2:
Fix build failed issue if CONFIG_MIGRATION disabled.
Fix some minor issues under Benjamin's comments.

Signed-off-by: Gu Zheng 
---
 fs/aio.c|  116 +++
 include/linux/migrate.h |9 
 mm/migrate.c|2 +-
 3 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2bbcacf..15e8a13 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -35,6 +35,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -108,6 +111,7 @@ struct kioctx {
} cacheline_aligned_in_smp;
 
struct page *internal_pages[AIO_RING_PAGES];
+   struct file *aio_ring_file;
 };
 
 /*-- sysctl variables*/
@@ -136,15 +140,78 @@ __initcall(aio_setup);
 
 static void aio_free_ring(struct kioctx *ctx)
 {
-   long i;
-
-   for (i = 0; i < ctx->nr_pages; i++)
+   int i;
+   struct file *aio_ring_file = ctx->aio_ring_file;
+   for (i = 0; i < ctx->nr_pages; i++) {
+   pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
+   page_count(ctx->ring_pages[i]));
put_page(ctx->ring_pages[i]);
+   }
 
if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages)
kfree(ctx->ring_pages);
+
+   if (aio_ring_file) {
+   truncate_setsize(aio_ring_file->f_inode, 0);
+   pr_debug("pid(%d) i_nlink=%u d_count=%d d_unhashed=%d 
i_count=%d\n",
+   current->pid, aio_ring_file->f_inode->i_nlink,
+   aio_ring_file->f_path.dentry->d_count,
+   d_unhashed(aio_ring_file->f_path.dentry),
+   atomic_read(&aio_ring_file->f_inode->i_count));
+   fput(aio_ring_file);
+   ctx->aio_ring_file = NULL;
+   }
+}
+
+static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   vma->vm_ops = &generic_file_vm_ops;
+   return 0;
+}
+
+static const struct file_operations aio_ring_fops = {
+   .mmap = aio_ring_mmap,
+};
+
+static int aio_set_page_dirty(struct page *page)
+{
+   return 0;
 }
 
+static int aio_migratepage(struct address_space *mapping, struct page *new,
+   struct page *old, enum migrate_mode mode)
+{
+   struct kioctx *ctx = mapping->private_data;
+   unsigned long flags;
+   unsigned idx = old->index;
+   int rc;
+
+   /* Writeback must be complete */
+   BUG_ON(PageWriteback(old));
+
+   put_page(old);
+
+   rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+   if (rc != MIGRATEPAGE_SUCCESS) {
+   get_page(old);
+   return rc;
+   }
+
+   get_page(new);
+
+   spin_lock_irqsave(&ctx->completion_lock, flags);
+   migrate_page_copy(new, old);
+   ctx->ring_pages[idx] = new;
+   spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+   return rc;
+}
+
+static const struct address_space_operations aio_ctx_aops = {
+   .set_page_dirty = aio_set_page_dirty,
+   .migratepage= aio_migratepage,
+};
+
 static int aio_setup_ring(struct kioctx *ctx)
 {
struct aio_ring *ring;
@@ -152,18 +219,42 @@ static int aio_setup_ring(struct kioctx *ctx)
struct mm_struct *mm = current->mm;
unsigned long size, populate;
int nr_pages;
+   int i;
+   struct file *file;
 
/* Compensate for the ring buffer's head/tail overlap entry */
nr_events += 2; /* 1 is required, 2 for good luck */
 
size = sizeof(struct aio_ring);
size += sizeof(struct io_event) * nr_events;
-   nr_pages = (size + PAGE_SIZE-1) >> PAGE_SHIFT;
+   nr_pages = PFN_UP(size);
 
if (nr_pages < 0)
return -EINVAL;
+   file = anon_inode_getfile_private("[aio]", &aio_ring_fops, ctx, O_RDWR);
+   if (IS_ERR(file)) {
+   ctx->aio_ring_file = NULL;
+   return -EAGAIN;
+   }
+   file->f_inode->i_mapping->a_ops = &aio_ctx_aops;
+   file->f_inode->i_mapping->private_data = ctx;
+   file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages;
 
-   nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / 
sizeof(struct io_event);
+   for (i = 0; i < nr_pages; i++) {
+   struct page *page;
+   page = find_or_create_page(fil

Re: [PATCH V2 2/2] fs/aio: Add support to aio ring pages migration

2013-07-17 Thread Gu Zheng
Hi Ben,

On 07/17/2013 09:44 PM, Benjamin LaHaise wrote:

> On Wed, Jul 17, 2013 at 05:22:30PM +0800, Gu Zheng wrote:
>> As the aio job will pin the ring pages, that will lead to mem migrated
>> failed. In order to fix this problem we use an anon inode to manage the aio 
>> ring
>> pages, and  setup the migratepage callback in the anon inode's address 
>> space, so
>> that when mem migrating the aio ring pages will be moved to other mem node 
>> safely.
>>
>> v1->v2:
>>  Fix build failed issue if CONFIG_MIGRATION disabled.
>>  Fix some minor issues under Benjamin's comments.
> 
> I don't know what you did with this patch, but it doesn't apply to any of 
> the trees I can find, and interdiff isn't able to compare it against your 
> original patch.  Since the first version of the patch was already applied 
> it is generally more appropriate to provide an incremental fix.  I've 
> added the following to my tree (git://git.kvack.org/~bcrl/aio-next.git/) 
> to fix the build issue.  I've tested this with CONFIG_MIGRATION enabled 
> and disabled on x86.

My patch is applied on 3.10 release. I'm sorry that my working department is
forbidden to access all the urls based on git protocol, so I can not make patch 
on
your aio_next. Does aio_next have trees based on http/https protocol?

Your fix looks very well.
IMHO, because we *extern* the migrate_page_move_mapping(), so we have
the duty to make sure it can work well all the place. If some one later use 
migrate_page_move_mapping() with out the protection of CONFIG_MIGRATION,
it will lead to build-fail if CONFIG_MIGRATION is disable. So I think the
following change(return ENOSYS error is CONFIG_MIGRATION disabled) is still 
needed.

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c407d88..3d0a486 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -88,6 +88,13 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
return -ENOSYS;
 }
 
+static inline int migrate_page_move_mapping(struct address_space *mapping,
+   struct page *newpage, struct page *page,
+   struct buffer_head *head, enum migrate_mode mode)
+{
+   return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL



Best regards,
Gu

> 
>   -ben


diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c407d88..3d0a486 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -88,6 +88,13 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
return -ENOSYS;
 }
 
+static inline int migrate_page_move_mapping(struct address_space *mapping,
+   struct page *newpage, struct page *page,
+   struct buffer_head *head, enum migrate_mode mode)
+{
+   return -ENOSYS;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL


[PATCH RESEND] fs/bio-integrity: fix a potential mem leak

2013-07-28 Thread Gu Zheng
Free the bio_integrity_pool in the fail path of biovec_create_pool
in function bioset_integrity_create().

Signed-off-by: Gu Zheng 
---
 fs/bio-integrity.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 8fb4291..6025084 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -716,13 +716,14 @@ int bioset_integrity_create(struct bio_set *bs, int 
pool_size)
return 0;
 
bs->bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab);
-
-   bs->bvec_integrity_pool = biovec_create_pool(bs, pool_size);
-   if (!bs->bvec_integrity_pool)
+   if (!bs->bio_integrity_pool)
return -1;
 
-   if (!bs->bio_integrity_pool)
+   bs->bvec_integrity_pool = biovec_create_pool(bs, pool_size);
+   if (!bs->bvec_integrity_pool) {
+   mempool_destroy(bs->bio_integrity_pool);
return -1;
+   }
 
return 0;
 }
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND] f2fs: move bio_private allocation out of f2fs_bio_alloc()

2013-07-28 Thread Gu Zheng
bio->bi_private is not always needed. As in the reading data path,
end_read_io does not need bio_private for further using, so moving
bio_private allocation out of f2fs_bio_alloc(). Alloc it in the
submit_write_page(), and ignore it in the f2fs_readpage().

Signed-off-by: Gu Zheng 
---
 fs/f2fs/data.c|1 -
 fs/f2fs/segment.c |   19 +++
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index c73c394..19cd7c6 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -365,7 +365,6 @@ static void read_end_io(struct bio *bio, int err)
}
unlock_page(page);
} while (bvec >= bio->bi_io_vec);
-   kfree(bio->bi_private);
bio_put(bio);
 }
 
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index a86d125..9b74ae2 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -611,18 +611,12 @@ static void f2fs_end_io_write(struct bio *bio, int err)
 struct bio *f2fs_bio_alloc(struct block_device *bdev, int npages)
 {
struct bio *bio;
-   struct bio_private *priv;
-retry:
-   priv = kmalloc(sizeof(struct bio_private), GFP_NOFS);
-   if (!priv) {
-   cond_resched();
-   goto retry;
-   }
 
/* No failure on bio allocation */
bio = bio_alloc(GFP_NOIO, npages);
bio->bi_bdev = bdev;
-   bio->bi_private = priv;
+   bio->bi_private = NULL;
+
return bio;
 }
 
@@ -681,8 +675,17 @@ static void submit_write_page(struct f2fs_sb_info *sbi, 
struct page *page,
do_submit_bio(sbi, type, false);
 alloc_new:
if (sbi->bio[type] == NULL) {
+   struct bio_private *priv;
+retry:
+   priv = kmalloc(sizeof(struct bio_private), GFP_NOFS);
+   if (!priv) {
+   cond_resched();
+   goto retry;
+   }
+
sbi->bio[type] = f2fs_bio_alloc(bdev, max_hw_blocks(sbi));
sbi->bio[type]->bi_sector = SECTOR_FROM_BLOCK(sbi, blk_addr);
+   sbi->bio[type]->bi_private = priv;
/*
 * The end_io will be assigned at the sumbission phase.
 * Until then, let bio_add_page() merge consecutive IOs as much
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Add namespace support for syslog v2

2013-07-29 Thread Gu Zheng
Hi Rui,

On 07/29/2013 10:31 AM, Rui Xiang wrote:

> This patchset introduces a system log namespace.
> 
> It is the 2nd version. The link of the 1st version is 
> http://lwn.net/Articles/525728/. In that version, syslog_
> namespace was added into nsproxy and created through a new
> clone flag CLONE_SYSLOG when cloning a process. 
> 
> There were some discussion in last November about the 1st 
> version. This version used these important advice, and 
> referred to Serge's patch(http://lwn.net/Articles/525629/).
> 
> Unlike the 1st version, in this patchset, syslog namespace 
> is tied to a user namespace. Add we must create a new user 
> ns before create a new syslog ns, because that will make 
> users have full capabilities in this new userns after 
> cloning a new user ns. The syslog namespace can be created 
> through a new command(11) to __NR_syslog syscall. That owe 
> to a new syslog flag SYSLOG_ACTION_NEW_NS.
> 
> In syslog_namespace, some necessary identifiers for handling 
> syslog buf are containerized. When one container creates a
> new syslog ns, individual buf will be allocated to store log
> ownned this container. 
> 
> A new interface ns_printk is added to print the logs which 
> we want to see in the container. Through ns_printk, we can 
> get more logs related to a specific net ns, for instance, 
> iptables. Here we use it to report iptable logs per 
> contianer.
> 
> Then default printk targeted at the init_syslog_ns will 
> continue to print out most kernel log to host.
> 
> One task in a new syslog ns could affect only current 
> container through "dmesg", "dmesg -c" and /dev/kmsg 
> actions. The read/write interface such as /dev/kmsg, 
> /pro/kmsg and syslog syscall continue to be useful for 
> container users.
> 
> This patchset is based on linus' linux tree.

Changelog details between V2 and V1 is seriously needed, the inline description
is not easy reading for other guys.

> 
> Rui Xiang (9):
>   syslog_ns: add syslog_namespace and put/get_syslog_ns
>   syslog_ns: add syslog_ns into user_namespace
>   syslog_ns: add init syslog_ns for global syslog
>   syslog_ns: make syslog handling per namespace
>   syslog_ns: make permisiion check per user namespace
>   syslog_ns: use init syslog_ns for console action
>   syslog_ns: implement function for creating syslog ns
>   syslog_ns: implement ns_printk for specific syslog_ns
>   netfilter: use ns_printk in iptable context
> 
>  fs/proc/kmsg.c |  17 +-
>  include/linux/printk.h |   5 +-
>  include/linux/syslog.h |  79 -
>  include/linux/user_namespace.h |   2 +
>  include/net/netfilter/xt_log.h |   6 +-
>  kernel/printk.c| 642 
> -
>  kernel/sysctl.c|   3 +-
>  kernel/user.c  |   3 +
>  kernel/user_namespace.c|   4 +
>  net/netfilter/xt_LOG.c |   4 +-
>  10 files changed, 493 insertions(+), 272 deletions(-)
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] syslog_ns: add syslog_namespace and put/get_syslog_ns

2013-07-29 Thread Gu Zheng
Hi Rui,
Refer to inline:).

On 07/29/2013 10:31 AM, Rui Xiang wrote:

> Add a struct syslog_namespace which contains the necessary
> members for hanlding syslog and realize get_syslog_ns and
> put_syslog_ns API.
> 
> Signed-off-by: Rui Xiang 
> ---
>  include/linux/syslog.h | 68 
> ++
>  kernel/printk.c|  7 --
>  2 files changed, 68 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/syslog.h b/include/linux/syslog.h
> index 98a3153..425fafe 100644
> --- a/include/linux/syslog.h
> +++ b/include/linux/syslog.h
> @@ -21,6 +21,9 @@
>  #ifndef _LINUX_SYSLOG_H
>  #define _LINUX_SYSLOG_H
>  
> +#include 
> +#include 
> +
>  /* Close the log.  Currently a NOP. */
>  #define SYSLOG_ACTION_CLOSE  0
>  /* Open the log. Currently a NOP. */
> @@ -47,6 +50,71 @@
>  #define SYSLOG_FROM_READER   0
>  #define SYSLOG_FROM_PROC 1
>  
> +enum log_flags {
> + LOG_NOCONS  = 1,/* already flushed, do not print to console */
> + LOG_NEWLINE = 2,/* text ended with a newline */
> + LOG_PREFIX  = 4,/* text started with a prefix */
> + LOG_CONT= 8,/* text is a fragment of a continuation line */
> +};
> +
> +struct syslog_namespace {
> + struct kref kref;   /* syslog_ns reference count & control */
> +
> + raw_spinlock_t logbuf_lock; /* access conflict locker */
> + /* cpu currently holding logbuf_lock of ns */
> + unsigned int logbuf_cpu;
> +
> + /* index and sequence number of the first record stored in the buffer */
> + u64 log_first_seq;
> + u32 log_first_idx;
> +
> + /* index and sequence number of the next record stored in the buffer */
> + u64 log_next_seq;
> + u32 log_next_idx;
> +
> + /* the next printk record to read after the last 'clear' command */
> + u64 clear_seq;
> + u32 clear_idx;
> +
> + char *log_buf;
> + u32 log_buf_len;
> +
> + /* the next printk record to write to the console */
> + u64 console_seq;
> + u32 console_idx;
> +
> + /* the next printk record to read by syslog(READ) or /proc/kmsg */
> + u64 syslog_seq;
> + u32 syslog_idx;
> + enum log_flags syslog_prev;
> + size_t syslog_partial;
> +
> + int dmesg_restrict;
> +};
> +
> +static inline struct syslog_namespace *get_syslog_ns(
> + struct syslog_namespace *ns)
> +{
> + if (ns)
> + kref_get(&ns->kref);
> + return ns;
> +}
> +
> +static inline void free_syslog_ns(struct kref *kref)
> +{
> + struct syslog_namespace *ns;
> + ns = container_of(kref, struct syslog_namespace, kref);
> +
> + kfree(ns->log_buf);
> + kfree(ns);
> +}

This interface seems a bit ugly, why not use the format like put_syslog_ns()?

static inline void free_syslog_ns(struct syslog_namespace *ns)

> +
> +static inline void put_syslog_ns(struct syslog_namespace *ns)
> +{
> + if (ns)
> + kref_put(&ns->kref, free_syslog_ns);
> +}
> +
>  int do_syslog(int type, char __user *buf, int count, bool from_file);
>  
>  #endif /* _LINUX_SYSLOG_H */
> diff --git a/kernel/printk.c b/kernel/printk.c
> index d37d45c..7e544bf 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -193,13 +193,6 @@ static int console_may_schedule;
>   * separated by ',', and find the message after the ';' character.
>   */
>  
> -enum log_flags {
> - LOG_NOCONS  = 1,/* already flushed, do not print to console */
> - LOG_NEWLINE = 2,/* text ended with a newline */
> - LOG_PREFIX  = 4,/* text started with a prefix */
> - LOG_CONT= 8,/* text is a fragment of a continuation line */
> -};
> -
>  struct log {
>   u64 ts_nsec;/* timestamp in nanoseconds */
>   u16 len;/* length of entire record */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] syslog_ns: add syslog_ns into user_namespace

2013-07-29 Thread Gu Zheng
Hi Rui,

On 07/29/2013 10:31 AM, Rui Xiang wrote:

> Add a syslog_ns pointer to user_namespace, and make
> syslog_ns per user_namespace, not global.
> 
> Since syslog_ns is assigned to user_ns, we can have
> full capabilities in new user_ns to create a new syslog_ns.
> 
> Signed-off-by: Rui Xiang 
> ---
>  include/linux/syslog.h | 5 +
>  include/linux/user_namespace.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/syslog.h b/include/linux/syslog.h
> index 425fafe..62ce47f 100644
> --- a/include/linux/syslog.h
> +++ b/include/linux/syslog.h
> @@ -90,6 +90,11 @@ struct syslog_namespace {
>   size_t syslog_partial;
>  
>   int dmesg_restrict;
> +
> + /*
> +  * user namespace which owns this syslog ns.
> +  */
> + struct user_namespace *owner;
>  };
>  
>  static inline struct syslog_namespace *get_syslog_ns(
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index b6b215f..ce2de5b 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -28,6 +28,7 @@ struct user_namespace {
>   unsigned intproc_inum;
>   boolmay_mount_sysfs;
>   boolmay_mount_proc;
> + struct syslog_namespace *syslog_ns;

As we add a syslog_ns pointer to user_namespace to make
syslog_ns per user_namespace and the caps check.
But why also add a point to syslog_namespace in
user_namespace? Am I missing something?:)

Thanks,
Gu

>  };
>  
>  extern struct user_namespace init_user_ns;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] syslog_ns: make syslog handling per namespace

2013-07-29 Thread Gu Zheng
Hi Rui,

On 07/29/2013 10:31 AM, Rui Xiang wrote:

> This patch makes syslog buf and other fields per
> namespace.
> 
> Here use ns->log_buf(log_buf_len, logbuf_lock,
> log_first_seq, logbuf_lock, and so on) fields
> instead of global ones to handle syslog.
> 
> Syslog interfaces such as /dev/kmsg, /proc/kmsg,
> and syslog syscall are all containerized for
> container users.
> 
> Signed-off-by: Rui Xiang 
> ---
>  fs/proc/kmsg.c |  17 +-
>  include/linux/printk.h |   1 -
>  include/linux/syslog.h |   3 +-
>  kernel/printk.c| 507 
> +
>  kernel/sysctl.c|   3 +-
>  5 files changed, 273 insertions(+), 258 deletions(-)
> 
> diff --git a/fs/proc/kmsg.c b/fs/proc/kmsg.c
> index bdfabda..cb98431 100644
> --- a/fs/proc/kmsg.c
> +++ b/fs/proc/kmsg.c
> @@ -13,6 +13,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -21,12 +23,14 @@ extern wait_queue_head_t log_wait;
>  
>  static int kmsg_open(struct inode * inode, struct file * file)
>  {
> - return do_syslog(SYSLOG_ACTION_OPEN, NULL, 0, SYSLOG_FROM_PROC);
> + return do_syslog(SYSLOG_ACTION_OPEN, NULL, 0, SYSLOG_FROM_PROC,
> + file->f_cred->user_ns->syslog_ns);

How about adding a help function to get the syslog_ns that file belongs to?
 

>  }
>  
>  static int kmsg_release(struct inode * inode, struct file * file)
>  {
> - (void) do_syslog(SYSLOG_ACTION_CLOSE, NULL, 0, SYSLOG_FROM_PROC);
> + (void) do_syslog(SYSLOG_ACTION_CLOSE, NULL, 0, SYSLOG_FROM_PROC,
> + file->f_cred->user_ns->syslog_ns);
>   return 0;
>  }
>  
> @@ -34,15 +38,18 @@ static ssize_t kmsg_read(struct file *file, char __user 
> *buf,
>size_t count, loff_t *ppos)
>  {
>   if ((file->f_flags & O_NONBLOCK) &&
> - !do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC))
> + !do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC,
> + file->f_cred->user_ns->syslog_ns))
>   return -EAGAIN;
> - return do_syslog(SYSLOG_ACTION_READ, buf, count, SYSLOG_FROM_PROC);
> + return do_syslog(SYSLOG_ACTION_READ, buf, count, SYSLOG_FROM_PROC,
> + file->f_cred->user_ns->syslog_ns);
>  }
>  
>  static unsigned int kmsg_poll(struct file *file, poll_table *wait)
>  {
>   poll_wait(file, &log_wait, wait);
> - if (do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC))
> + if (do_syslog(SYSLOG_ACTION_SIZE_UNREAD, NULL, 0, SYSLOG_FROM_PROC,
> + file->f_cred->user_ns->syslog_ns))
>   return POLLIN | POLLRDNORM;
>   return 0;
>  }
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 22c7052..29e3f85 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -139,7 +139,6 @@ extern bool printk_timed_ratelimit(unsigned long 
> *caller_jiffies,
>  unsigned int interval_msec);
>  
>  extern int printk_delay_msec;
> -extern int dmesg_restrict;
>  extern int kptr_restrict;
>  
>  extern void wake_up_klogd(void);
> diff --git a/include/linux/syslog.h b/include/linux/syslog.h
> index 363bc56..fbf0cb6 100644
> --- a/include/linux/syslog.h
> +++ b/include/linux/syslog.h
> @@ -120,7 +120,8 @@ static inline void put_syslog_ns(struct syslog_namespace 
> *ns)
>   kref_put(&ns->kref, free_syslog_ns);
>  }
>  
> -int do_syslog(int type, char __user *buf, int count, bool from_file);
> +int do_syslog(int type, char __user *buf, int count, bool from_file,
> + struct syslog_namespace *ns);
>  
>  extern struct syslog_namespace init_syslog_ns;
>  #endif /* _LINUX_SYSLOG_H */
> diff --git a/kernel/printk.c b/kernel/printk.c
> index fd83ec1..846fef5 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -213,29 +213,8 @@ static DEFINE_RAW_SPINLOCK(logbuf_lock);
>  
>  #ifdef CONFIG_PRINTK
>  DECLARE_WAIT_QUEUE_HEAD(log_wait);
> -/* the next printk record to read by syslog(READ) or /proc/kmsg */
> -static u64 syslog_seq;
> -static u32 syslog_idx;
> -static enum log_flags syslog_prev;
> -static size_t syslog_partial;
> -
> -/* index and sequence number of the first record stored in the buffer */
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> -
> -/* index and sequence number of the next record to store in the buffer */
> -static u64 log_next_seq;
> -static u32 log_next_idx;
> -
> -/* the next printk record to write to the console */
> -static u64 console_seq;
> -static u32 console_idx;
>  static enum log_flags console_prev;
>  
> -/* the next printk record to read after the last 'clear' command */
> -static u64 clear_seq;
> -static u32 clear_idx;
> -
>  #define PREFIX_MAX   32
>  #define LOG_LINE_MAX 1024 - PREFIX_MAX
>  
> @@ -246,12 +225,8 @@ static u32 clear_idx;
>  #define LOG_ALIGN __al

Re: [PATCH 2/9] syslog_ns: add syslog_ns into user_namespace

2013-07-29 Thread Gu Zheng
On 07/29/2013 05:54 PM, Gao feng wrote:

> On 07/29/2013 05:46 PM, Gu Zheng wrote:
>> Hi Rui,
>>
>> On 07/29/2013 10:31 AM, Rui Xiang wrote:
>>
>>> Add a syslog_ns pointer to user_namespace, and make
>>> syslog_ns per user_namespace, not global.
>>>
>>> Since syslog_ns is assigned to user_ns, we can have
>>> full capabilities in new user_ns to create a new syslog_ns.
>>>
>>> Signed-off-by: Rui Xiang 
>>> ---
>>>  include/linux/syslog.h | 5 +
>>>  include/linux/user_namespace.h | 1 +
>>>  2 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/syslog.h b/include/linux/syslog.h
>>> index 425fafe..62ce47f 100644
>>> --- a/include/linux/syslog.h
>>> +++ b/include/linux/syslog.h
>>> @@ -90,6 +90,11 @@ struct syslog_namespace {
>>> size_t syslog_partial;
>>>  
>>> int dmesg_restrict;
>>> +
>>> +   /*
>>> +* user namespace which owns this syslog ns.
>>> +*/
>>> +   struct user_namespace *owner;
>>>  };
>>>  
>>>  static inline struct syslog_namespace *get_syslog_ns(
>>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>>> index b6b215f..ce2de5b 100644
>>> --- a/include/linux/user_namespace.h
>>> +++ b/include/linux/user_namespace.h
>>> @@ -28,6 +28,7 @@ struct user_namespace {
>>> unsigned intproc_inum;
>>> boolmay_mount_sysfs;
>>> boolmay_mount_proc;
>>> +   struct syslog_namespace *syslog_ns;
>>
>> As we add a syslog_ns pointer to user_namespace to make
>> syslog_ns per user_namespace and the caps check.
>> But why also add a point to syslog_namespace in
>> user_namespace? Am I missing something?:)
>>
> 
> yep,with this we can make sure all the other types of namespace such as 
> mount, net, pid
> can access syslog_ns through user namespace.

Got it.:)

Thanks,
Gu

> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/9] syslog_ns: implement function for creating syslog ns

2013-07-29 Thread Gu Zheng
Hi Rui,

On 07/29/2013 10:31 AM, Rui Xiang wrote:

> Add create_syslog_ns function to create a new ns. We
> must create a user_ns before create a new syslog ns.
> And then tie the new syslog_ns to current user_ns
> instead of original syslog_ns which comes from
> parent user_ns.
> 
> Add a new syslog flag SYSLOG_ACTION_NEW_NS to implement
> a new command(11) of __NR_syslog system call. Through
> that command, we can create a new syslog ns in user
> space.
> 
> Signed-off-by: Rui Xiang 
> ---
>  include/linux/syslog.h |  2 ++
>  kernel/printk.c| 52 
> ++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/syslog.h b/include/linux/syslog.h
> index fbf0cb6..df57c21 100644
> --- a/include/linux/syslog.h
> +++ b/include/linux/syslog.h
> @@ -46,6 +46,8 @@
>  #define SYSLOG_ACTION_SIZE_UNREAD9
>  /* Return size of the log buffer */
>  #define SYSLOG_ACTION_SIZE_BUFFER   10
> +/* Create a new syslog ns */
> +#define SYSLOG_ACTION_NEW_NS11
>  
>  #define SYSLOG_FROM_READER   0
>  #define SYSLOG_FROM_PROC 1
> diff --git a/kernel/printk.c b/kernel/printk.c
> index fd2d600..6b561db 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -384,6 +384,10 @@ static int check_syslog_permissions(int type, bool 
> from_file,
>   || type == SYSLOG_ACTION_CONSOLE_LEVEL)
>   ns = &init_syslog_ns;
>  
> + /* create a new syslog ns */
> + if (type == SYSLOG_ACTION_NEW_NS)
> + return 0;
> +

Don't we need further permission or caps check here? Return success directly 
seems sloppy. 

Thanks,
Gu

>   if (syslog_action_restricted(type, ns)) {
>   if (ns_capable(ns->owner, CAP_SYSLOG))
>   return 0;
> @@ -1131,6 +1135,51 @@ static int syslog_print_all(char __user *buf, int 
> size, bool clear,
>   return len;
>  }
>  
> +static int create_syslog_ns(void)
> +{
> + struct user_namespace *userns = current_user_ns();
> + struct syslog_namespace *oldns, *newns;
> + int err;
> +
> + /*
> +  * syslog ns belongs to a user ns.  So you can only unshare your
> +  * user_ns if you share a user_ns with your parent userns
> +  */
> + if (userns == &init_user_ns ||
> + userns->syslog_ns != userns->parent->syslog_ns)
> + return -EINVAL;
> +
> + if (!ns_capable(userns, CAP_SYSLOG))
> + return -EPERM;
> +
> + err = -ENOMEM;
> + oldns = userns->syslog_ns;
> + newns = kzalloc(sizeof(*newns), GFP_ATOMIC);
> + if (!newns)
> + goto out;
> + newns->log_buf_len = __LOG_BUF_LEN;
> + newns->log_buf = kzalloc(newns->log_buf_len, GFP_ATOMIC);
> + if (!newns->log_buf)
> + goto out;
> +
> + newns->owner = get_user_ns(userns);
> + raw_spin_lock_init(&(newns->logbuf_lock));
> + newns->logbuf_cpu = UINT_MAX;
> + newns->dmesg_restrict = oldns->dmesg_restrict;
> + put_syslog_ns(oldns);
> + kref_init(&newns->kref);
> + userns->syslog_ns = newns;
> + newns = NULL;
> +
> + err = 0;
> +out:
> + if (newns) {
> + kfree(newns->log_buf);
> + kfree(newns);
> + }
> + return err;
> +}
> +
>  int do_syslog(int type, char __user *buf, int len, bool from_file,
>   struct syslog_namespace *ns)
>  {
> @@ -1254,6 +1303,9 @@ int do_syslog(int type, char __user *buf, int len, bool 
> from_file,
>   case SYSLOG_ACTION_SIZE_BUFFER:
>   error = ns->log_buf_len;
>   break;
> + case SYSLOG_ACTION_NEW_NS:
> + error = create_syslog_ns();
> + break;
>   default:
>   error = -EINVAL;
>   break;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/9] syslog_ns: implement ns_printk for specific syslog_ns

2013-07-29 Thread Gu Zheng
Hi Rui,

On 07/29/2013 10:31 AM, Rui Xiang wrote:

> Add a new interface named ns_printk, and assign an
> patamater ns. Log which belong to a container can
> be printed by ns_printk.

One question, with the syslog_ns used, do the log we print by *printk* in the
host contains the log in each syslog_ns(print out with ns_printk) or not?

Thanks,
Gu

> 
> Signed-off-by: Rui Xiang 
> ---
>  include/linux/printk.h |  4 
>  kernel/printk.c| 53 
> ++
>  2 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 29e3f85..bf83ad9 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  
> +struct syslog_namespace;
>  extern const char linux_banner[];
>  extern const char linux_proc_banner[];
>  
> @@ -123,6 +124,9 @@ asmlinkage int printk_emit(int facility, int level,
>  asmlinkage __printf(1, 2) __cold
>  int printk(const char *fmt, ...);
>  
> +asmlinkage __printf(2, 3) __cold
> +int ns_printk(struct syslog_namespace *ns, const char *fmt, ...);
> +
>  /*
>   * Special printk facility for scheduler use only, _DO_NOT_USE_ !
>   */
> diff --git a/kernel/printk.c b/kernel/printk.c
> index 6b561db..56a8b27 100644
> --- a/kernel/printk.c
> +++ b/kernel/printk.c
> @@ -1554,9 +1554,10 @@ static size_t cont_print_text(char *text, size_t size)
>   return textlen;
>  }
>  
> -asmlinkage int vprintk_emit(int facility, int level,
> - const char *dict, size_t dictlen,
> - const char *fmt, va_list args)
> +static int ns_vprintk_emit(int facility, int level,
> + const char *dict, size_t dictlen,
> + const char *fmt, va_list args,
> + struct syslog_namespace *ns)
>  {
>   static int recursion_bug;
>   static char textbuf[LOG_LINE_MAX];
> @@ -1566,7 +1567,6 @@ asmlinkage int vprintk_emit(int facility, int level,
>   unsigned long flags;
>   int this_cpu;
>   int printed_len = 0;
> - struct syslog_namespace *ns = &init_syslog_ns;
>  
>   boot_delay_msec(level);
>   printk_delay();
> @@ -1697,6 +1697,14 @@ out_restore_irqs:
>  
>   return printed_len;
>  }
> +
> +asmlinkage int vprintk_emit(int facility, int level,
> + const char *dict, size_t dictlen,
> + const char *fmt, va_list args)
> +{
> + return ns_vprintk_emit(facility, level, dict, dictlen, fmt, args,
> + &init_syslog_ns);
> +}
>  EXPORT_SYMBOL(vprintk_emit);
>  
>  asmlinkage int vprintk(const char *fmt, va_list args)
> @@ -1762,6 +1770,43 @@ asmlinkage int printk(const char *fmt, ...)
>  }
>  EXPORT_SYMBOL(printk);
>  
> +/**
> + * ns_printk - print a kernel message in syslog_ns
> + * @ns: syslog namespace
> + * @fmt: format string
> + *
> + * This is ns_printk().
> + * It can be called from container context. We add a param
> + * ns to record current syslog namespace, because we need to
> + * print some log which are not generated by host, but contaner.
> + *
> + * See the vsnprintf() documentation for format string extensions over C99.
> + **/
> +asmlinkage int ns_printk(struct syslog_namespace *ns,
> + const char *fmt, ...)
> +{
> + va_list args;
> + int r;
> +
> + if (!ns)
> + ns = current_user_ns()->syslog_ns;
> +
> +#ifdef CONFIG_KGDB_KDB
> + if (unlikely(kdb_trap_printk)) {
> + va_start(args, fmt);
> + r = vkdb_printf(fmt, args);
> + va_end(args);
> + return r;
> + }
> +#endif
> + va_start(args, fmt);
> + r = ns_vprintk_emit(0, -1, NULL, 0, fmt, args, ns);
> + va_end(args);
> +
> + return r;
> +}
> +EXPORT_SYMBOL(ns_printk);
> +

Here can we do some clean up to printk using ns_printk?

>  #else /* CONFIG_PRINTK */
>  
>  #define LOG_LINE_MAX 0


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] syslog_ns: add syslog_namespace and put/get_syslog_ns

2013-07-29 Thread Gu Zheng
On 07/29/2013 07:47 PM, Rui Xiang wrote:

> On 2013/7/29 17:40, Gu Zheng wrote:
>> Hi Rui,
>>  Refer to inline:).
>>
> Hi Gu,
> 
> Thanks for your attention.
> 
>> On 07/29/2013 10:31 AM, Rui Xiang wrote:
>>
>>> Add a struct syslog_namespace which contains the necessary
>>> members for hanlding syslog and realize get_syslog_ns and
>>> put_syslog_ns API.
>>>
>>> Signed-off-by: Rui Xiang 
>>> ---
>>>  include/linux/syslog.h | 68 
>>> ++
>>>  kernel/printk.c|  7 --
>>>  2 files changed, 68 insertions(+), 7 deletions(-)
>>>
> 
> ...
> 
>>> +
>>> +static inline void free_syslog_ns(struct kref *kref)
>>> +{
>>> +   struct syslog_namespace *ns;
>>> +   ns = container_of(kref, struct syslog_namespace, kref);
>>> +
>>> +   kfree(ns->log_buf);
>>> +   kfree(ns);
>>> +}
>>
>> This interface seems a bit ugly, why not use the format like put_syslog_ns()?
>>
>> static inline void free_syslog_ns(struct syslog_namespace *ns)
>>
> 
> Free_syslog_ns is used in put_syslog_ns. And the kref_put function uses kref 
> as
> a parameter for its relase funtion. You can see that from 
> static inline int kref_put(struct kref *kref, void (*release)(struct kref 
> *kref)).

Got it.

Regards,
Gu

> 
> Thanks.
> 
>>> +
>>> +static inline void put_syslog_ns(struct syslog_namespace *ns)
>>> +{
>>> +   if (ns)
>>> +   kref_put(&ns->kref, free_syslog_ns);
>>> +}
>>> +
>>>
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/9] syslog_ns: implement function for creating syslog ns

2013-07-29 Thread Gu Zheng
On 07/30/2013 11:39 AM, Rui Xiang wrote:

> On 2013/7/29 18:25, Gu Zheng wrote:
>> Hi Rui,
>>
>> On 07/29/2013 10:31 AM, Rui Xiang wrote:
>>
>>> Add create_syslog_ns function to create a new ns. We
>>> must create a user_ns before create a new syslog ns.
>>> And then tie the new syslog_ns to current user_ns
>>> instead of original syslog_ns which comes from
>>> parent user_ns.
> 
> ...
> 
>>> diff --git a/kernel/printk.c b/kernel/printk.c
>>> index fd2d600..6b561db 100644
>>> --- a/kernel/printk.c
>>> +++ b/kernel/printk.c
>>> @@ -384,6 +384,10 @@ static int check_syslog_permissions(int type, bool 
>>> from_file,
>>> || type == SYSLOG_ACTION_CONSOLE_LEVEL)
>>> ns = &init_syslog_ns;
>>>  
>>> +   /* create a new syslog ns */
>>> +   if (type == SYSLOG_ACTION_NEW_NS)
>>> +   return 0;
>>> +
>>
>> Don't we need further permission or caps check here? Return success directly 
>> seems sloppy. 
>>
> CAP_SYSLOG is checked in create_syslog_ns, so I think we can return 0 
> temporarily.

If so, why not move the check here? IMO, permission checking is the earlier the 
better,
what's your opinion?

Regards,
Gu

> 
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >