Re: [PATCH] brd: fix integer overflow in brd_check_and_reset_par

2021-04-01 Thread Zhiqiang Liu
friendly ping.

On 2021/3/25 19:45, lixiaokeng wrote:
> The max_part may overflow. For example,
>
> modprobe brd rd_nr=3 rd_size=102400 max_part=1073741824(2^30)
>
> Expected result
> NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> ram0   1:00   100M  0 disk
> ram1   1:256  0   100M  0 disk
> ram2   1:512  0   100M  0 disk
>
> Actual result
> NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> ram0 259:00   100M  0 disk
> ram1 259:10   100M  0 disk
> ram2 259:20   100M  0 disk
>
> Fix it.
>
> Signed-off-by: Lixiaokeng 
> Signed-off-by: Zhiqiang Liu 
> ---
>  drivers/block/brd.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index c43a6ab4b1f3..c91831cd5d2a 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -457,21 +457,19 @@ static void brd_del_one(struct brd_device *brd)
>
>  static inline void brd_check_and_reset_par(void)
>  {
> - if (unlikely(!max_part))
> + if (unlikely(max_part <= 0))
>   max_part = 1;
>
>   /*
>* make sure 'max_part' can be divided exactly by (1U << MINORBITS),
>* otherwise, it is possiable to get same dev_t when adding partitions.
>*/
> - if ((1U << MINORBITS) % max_part != 0)
> - max_part = 1UL << fls(max_part);
> -
>   if (max_part > DISK_MAX_PARTS) {
>   pr_info("brd: max_part can't be larger than %d, reset max_part 
> = %d.\n",
>   DISK_MAX_PARTS, DISK_MAX_PARTS);
>   max_part = DISK_MAX_PARTS;
> - }
> + } else if ((1U << MINORBITS) % max_part != 0)
> + max_part = 1UL << fls(max_part);
>  }
>
>  static int __init brd_init(void)



[PATCH v2] ACPI / hotplug / PCI: fix memory leak in enable_slot()

2021-03-25 Thread Zhiqiang Liu
From: Feilong Lin 

In enable_slot() in drivers/pci/hotplug/acpiphp_glue.c, if pci_get_slot()
will return NULL, we will do not set SLOT_ENABLED flag of slot. if one
device is found by calling pci_get_slot(), its reference count will be
increased. In this case, we did not call pci_dev_put() to decrement the
its reference count, the memory of the device (struct pci_dev type) will
leak.

Fix it by calling pci_dev_put() to decrement its reference count after that
pci_get_slot() returns a PCI device.

Signed-off-by: Feilong Lin 
Signed-off-by: Zhiqiang Liu 
--
v2: rewrite subject and commit log as suggested by Bjorn Helgaas.
---
 drivers/pci/hotplug/acpiphp_glue.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/hotplug/acpiphp_glue.c 
b/drivers/pci/hotplug/acpiphp_glue.c
index 3365c93abf0e..f031302ad401 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -533,6 +533,7 @@ static void enable_slot(struct acpiphp_slot *slot, bool 
bridge)
slot->flags &= ~SLOT_ENABLED;
continue;
}
+   pci_dev_put(dev);
}
 }

-- 
2.19.1



Re: md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-24 Thread Zhiqiang Liu



On 2021/3/22 22:22, Mike Snitzer wrote:
> On Mon, Mar 22 2021 at  4:11am -0400,
> Christoph Hellwig  wrote:
> 
>> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
>>> From: Zhiqiang Liu 
>>>
>>> When we make IO stress test on multipath device, there will
>>> be a metadata err because of wrong path. In the test, we
>>> concurrent execute 'iscsi device login|logout' and
>>> 'multipath -r' command with IO stress on multipath device.
>>> In some case, systemd-udevd may have not time to process
>>> uevents of iscsi device logout|login, and then 'multipath -r'
>>> command triggers multipathd daemon calls ioctl to load table
>>> with incorrect old device info from systemd-udevd.
>>> Then, one iscsi path may be incorrectly attached to another
>>> multipath which has different uuid. Finally, the metadata err
>>> occurs when umounting filesystem to down write metadata on
>>> the iscsi device which is actually not owned by the multipath
>>> device.
>>>
>>> So we need to check whether all pgpaths of one multipath have
>>> the same uuid, if not, we should throw a error.
>>>
>>> Signed-off-by: Zhiqiang Liu 
>>> Signed-off-by: lixiaokeng 
>>> Signed-off-by: linfeilong 
>>> Signed-off-by: Wubo 
>>> ---
>>>  drivers/md/dm-mpath.c   | 52 +
>>>  drivers/scsi/scsi_lib.c |  1 +
>>>  2 files changed, 53 insertions(+)
>>>
>>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
>>> index bced42f082b0..f0b995784b53 100644
>>> --- a/drivers/md/dm-mpath.c
>>> +++ b/drivers/md/dm-mpath.c
>>> @@ -24,6 +24,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>
>>> @@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, 
>>> struct multipath *m)
>>> return r;
>>>  }
>>>
>>> +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
>>> +#define MPATH_UUID_PREFIX_LEN 7
>>> +static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
>>> +{
>>> +   char pgpath_uuid[DM_UUID_LEN] = {0};
>>> +   struct request_queue *q;
>>> +   struct pgpath *pgpath;
>>> +   struct scsi_device *sdev;
>>> +   ssize_t count;
>>> +   int r = 0;
>>> +
>>> +   list_for_each_entry(pgpath, >pgpaths, list) {
>>> +   q = bdev_get_queue(pgpath->path.dev->bdev);
>>> +   sdev = scsi_device_from_queue(q);
>>
>> Common dm-multipath code should never poke into scsi internals.  This
>> is something for the device handler to check.  It probably also won't
>> work for all older devices.
> 
> Definitely.
> 
> But that aside, userspace (multipathd) _should_ be able to do extra
> validation, _before_ pushing down a new table to the kernel, rather than
> forcing the kernel to do it.

As your said, it is better to do extra validation in userspace (multipathd).
However, in some cases, the userspace cannot see the real-time present devices
info as Martin (committer of multipath-tools) said.
In addition, the kernel can see right device info in the table at any time,
so the uuid check in kernel can ensure one multipath is composed with paths 
mapped to
the same device.

Considering the severity of the wrong path in multipath, I think it worths more
checking.

Regards
Zhiqiang Liu.


> 
> .
> 



Re: md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-24 Thread Zhiqiang Liu


On 2021/3/24 1:11, Ewan D. Milne wrote:
> On Tue, 2021-03-23 at 15:47 +0800, lixiaokeng wrote:
>>
>> On 2021/3/22 22:22, Mike Snitzer wrote:
>>> On Mon, Mar 22 2021 at  4:11am -0400,
>>> Christoph Hellwig  wrote:
>>>
>>>> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
>>>>> From: Zhiqiang Liu 
>>>>>
>>>>> When we make IO stress test on multipath device, there will
>>>>> be a metadata err because of wrong path. In the test, we
>>>>> concurrent execute 'iscsi device login|logout' and
>>>>> 'multipath -r' command with IO stress on multipath device.
>>>>> In some case, systemd-udevd may have not time to process
>>>>> uevents of iscsi device logout|login, and then 'multipath -r'
>>>>> command triggers multipathd daemon calls ioctl to load table
>>>>> with incorrect old device info from systemd-udevd.
>>>>> Then, one iscsi path may be incorrectly attached to another
>>>>> multipath which has different uuid. Finally, the metadata err
>>>>> occurs when umounting filesystem to down write metadata on
>>>>> the iscsi device which is actually not owned by the multipath
>>>>> device.
>>>>>
>>>>> So we need to check whether all pgpaths of one multipath have
>>>>> the same uuid, if not, we should throw a error.
>>>>>
>>>>> Signed-off-by: Zhiqiang Liu 
>>>>> Signed-off-by: lixiaokeng 
>>>>> Signed-off-by: linfeilong 
>>>>> Signed-off-by: Wubo 
>>>>> ---
>>>>>  drivers/md/dm-mpath.c   | 52
>>>>> +
>>>>>  drivers/scsi/scsi_lib.c |  1 +
>>>>>  2 files changed, 53 insertions(+)
>>>>>
>>>>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
>>>>> index bced42f082b0..f0b995784b53 100644
>>>>> --- a/drivers/md/dm-mpath.c
>>>>> +++ b/drivers/md/dm-mpath.c
>>>>> @@ -24,6 +24,7 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>>  #include 
>>>>>  #include 
>>>>>
>>>>> @@ -1169,6 +1170,45 @@ static int parse_features(struct
>>>>> dm_arg_set *as, struct multipath *m)
>>>>>   return r;
>>>>>  }
>>>>>
>>>>> +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
>>>>> +#define MPATH_UUID_PREFIX_LEN 7
>>>>> +static int check_pg_uuid(struct priority_group *pg, char
>>>>> *md_uuid)
>>>>> +{
>>>>> + char pgpath_uuid[DM_UUID_LEN] = {0};
>>>>> + struct request_queue *q;
>>>>> + struct pgpath *pgpath;
>>>>> + struct scsi_device *sdev;
>>>>> + ssize_t count;
>>>>> + int r = 0;
>>>>> +
>>>>> + list_for_each_entry(pgpath, >pgpaths, list) {
>>>>> + q = bdev_get_queue(pgpath->path.dev->bdev);
>>>>> + sdev = scsi_device_from_queue(q);
>>>>
>>>> Common dm-multipath code should never poke into scsi
>>>> internals.  This
>>>> is something for the device handler to check.  It probably also
>>>> won't
>>>> work for all older devices.
>>>
>>> Definitely.
>>>
>>> But that aside, userspace (multipathd) _should_ be able to do extra
>>> validation, _before_ pushing down a new table to the kernel, rather
>>> than
>>> forcing the kernel to do it.
>>>
>>
>> Martin (committer of multipath-tools) said that:
>> "Don't get me wrong, I don't argue against tough testing. But we
>> should
>> be aware that there are always time intervals during which
>> multipathd's
>> picture of the present devices is different from what the kernel
>> sees."
>>
>> It is difficult to solve this in multipathd.
>>
>> Regards,
>> Lixiaokeng
>>
> 
> I think the patch is no good.  There are plenty of devices that don't
> support VPD page 83h:
> 
> int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len)
> {
> u8 cur_id_type = 0xff;
> u8 cur_id_size = 0;
> unsigned char *d, *cur_id_str;
> unsigned char __rcu *vpd_pg83;
> int id_size = -EINVAL;
> 
> rcu_read_lock();
> vpd_pg83 = rcu_dereference(sdev->vpd_pg83);
> if (!vpd_pg83) {
> rcu_read_unlock();
> return -ENXIO;
> }
> 
> and the DM layer should not be looking at the properties of the
> underlying devices in this way anyway.  It should be pushed down
> to the table.
> 
Thanks for your suggestion.
I will have a try to modify the patch as your advice.


Regards
Zhiqiang Liu.
> -Ewan
> 
> 
> 
> 
> .
> 



Re: [PATCH] pci: fix memory leak when virtio pci hotplug

2021-03-24 Thread Zhiqiang Liu
Thanks for your suggestion.

I will rewrite the commit log and send the v2 patch.


On 2021/3/24 2:24, Bjorn Helgaas wrote:
> On Sun, Mar 21, 2021 at 11:29:30PM +0800, Zhiqiang Liu wrote:
>> From: Feilong Lin 
>>
>> Repeated hot-plugging of pci devices for a virtual
>> machine driven by virtio, we found that there is a
>> leak in kmalloc-4k, which was confirmed as the memory
>> of the pci_device structure. Then we found out that
>> it was missing pci_dev_put() after pci_get_slot() in
>> enable_slot() of acpiphp_glue.c.
>>
>> Signed-off-by: Feilong Lin 
>> Reviewed-by: Zhiqiang Liu 
> Since this came from you, Zhiqiang, it needs a signed-off-by (not just
> a reviewed-by) from you.  See
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.11#n361
>
> Also see
> https://lore.kernel.org/r/20171026223701.ga25...@bhelgaas-glaptop.roam.corp.google.com
> and
>
>   - Wrap commit log to fill 80 columns
>   - s/pci/PCI/ (subject and commit log)
>   - Run "git log --oneline drivers/pci/hotplug/acpiphp_glue.c".  It's
> not completely consistent, but at least match the style of one of
> them.
>
> There is no "pci_device" structure.  I think you mean the "struct
> pci_dev".
>
> The commit log doesn't actually say what the patch does.  It's obvious
> from the patch, but it should say in the commit log.  Look at previous
> commit logs to see how they do it.
>
>> ---
>>  drivers/pci/hotplug/acpiphp_glue.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/pci/hotplug/acpiphp_glue.c 
>> b/drivers/pci/hotplug/acpiphp_glue.c
>> index 3365c93abf0e..f031302ad401 100644
>> --- a/drivers/pci/hotplug/acpiphp_glue.c
>> +++ b/drivers/pci/hotplug/acpiphp_glue.c
>> @@ -533,6 +533,7 @@ static void enable_slot(struct acpiphp_slot *slot, bool 
>> bridge)
>>  slot->flags &= ~SLOT_ENABLED;
>>  continue;
>>  }
>> +pci_dev_put(dev);
>>  }
>>  }
>>
>> -- 
>> 2.19.1
>>
>>
> .



Re: [PATCH] pci: fix memory leak when virtio pci hotplug

2021-03-23 Thread Zhiqiang Liu
friendly ping..

On 2021/3/22 10:43, Wu Bo wrote:
> On 2021/3/21 23:29, Zhiqiang Liu wrote:
>> From: Feilong Lin 
>>
>> Repeated hot-plugging of pci devices for a virtual
>> machine driven by virtio, we found that there is a
>> leak in kmalloc-4k, which was confirmed as the memory
>> of the pci_device structure. Then we found out that
>> it was missing pci_dev_put() after pci_get_slot() in
>> enable_slot() of acpiphp_glue.c.
>>
>> Signed-off-by: Feilong Lin 
>> Reviewed-by: Zhiqiang Liu 
>> ---
>>   drivers/pci/hotplug/acpiphp_glue.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/pci/hotplug/acpiphp_glue.c 
>> b/drivers/pci/hotplug/acpiphp_glue.c
>> index 3365c93abf0e..f031302ad401 100644
>> --- a/drivers/pci/hotplug/acpiphp_glue.c
>> +++ b/drivers/pci/hotplug/acpiphp_glue.c
>> @@ -533,6 +533,7 @@ static void enable_slot(struct acpiphp_slot *slot, bool 
>> bridge)
>>   slot->flags &= ~SLOT_ENABLED;
>>   continue;
>>   }
>> +    pci_dev_put(dev);
>>   }
>>   }
>>
> Reviewed-by: Wu Bo 
>
> .



[PATCH] pci/hotplug: fix potential memory leak in disable_slot()

2021-03-21 Thread Zhiqiang Liu


In disable_slot(), if we obtain desired PCI device
successfully by calling pci_get_slot(), we should
call pci_dev_put() to release its reference.

Signed-off-by: Zhiqiang Liu 
Signed-off-by: Feilong Lin 
---
 drivers/pci/hotplug/s390_pci_hpc.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/s390_pci_hpc.c 
b/drivers/pci/hotplug/s390_pci_hpc.c
index c9e790c74051..999a34b6fd50 100644
--- a/drivers/pci/hotplug/s390_pci_hpc.c
+++ b/drivers/pci/hotplug/s390_pci_hpc.c
@@ -89,9 +89,11 @@ static int disable_slot(struct hotplug_slot *hotplug_slot)
return -EIO;

pdev = pci_get_slot(zdev->zbus->bus, zdev->devfn);
-   if (pdev && pci_num_vf(pdev)) {
+   if (pdev) {
+   rc = pci_num_vf(pdev);
pci_dev_put(pdev);
-   return -EBUSY;
+   if (rc)
+   return -EBUSY;
}

zpci_remove_device(zdev);
-- 
2.19.1




[PATCH] pci: fix memory leak when virtio pci hotplug

2021-03-21 Thread Zhiqiang Liu
From: Feilong Lin 

Repeated hot-plugging of pci devices for a virtual
machine driven by virtio, we found that there is a
leak in kmalloc-4k, which was confirmed as the memory
of the pci_device structure. Then we found out that
it was missing pci_dev_put() after pci_get_slot() in
enable_slot() of acpiphp_glue.c.

Signed-off-by: Feilong Lin 
Reviewed-by: Zhiqiang Liu 
---
 drivers/pci/hotplug/acpiphp_glue.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/hotplug/acpiphp_glue.c 
b/drivers/pci/hotplug/acpiphp_glue.c
index 3365c93abf0e..f031302ad401 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -533,6 +533,7 @@ static void enable_slot(struct acpiphp_slot *slot, bool 
bridge)
slot->flags &= ~SLOT_ENABLED;
continue;
}
+   pci_dev_put(dev);
}
 }

-- 
2.19.1




[PATCH] md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-20 Thread Zhiqiang Liu
From: Zhiqiang Liu 

When we make IO stress test on multipath device, there will
be a metadata err because of wrong path. In the test, we
concurrent execute 'iscsi device login|logout' and
'multipath -r' command with IO stress on multipath device.
In some case, systemd-udevd may have not time to process
uevents of iscsi device logout|login, and then 'multipath -r'
command triggers multipathd daemon calls ioctl to load table
with incorrect old device info from systemd-udevd.
Then, one iscsi path may be incorrectly attached to another
multipath which has different uuid. Finally, the metadata err
occurs when umounting filesystem to down write metadata on
the iscsi device which is actually not owned by the multipath
device.

So we need to check whether all pgpaths of one multipath have
the same uuid, if not, we should throw a error.

Signed-off-by: Zhiqiang Liu 
Signed-off-by: lixiaokeng 
Signed-off-by: linfeilong 
Signed-off-by: Wubo 
---
 drivers/md/dm-mpath.c   | 52 +
 drivers/scsi/scsi_lib.c |  1 +
 2 files changed, 53 insertions(+)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index bced42f082b0..f0b995784b53 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 

@@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, struct 
multipath *m)
return r;
 }

+#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
+#define MPATH_UUID_PREFIX_LEN 7
+static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
+{
+   char pgpath_uuid[DM_UUID_LEN] = {0};
+   struct request_queue *q;
+   struct pgpath *pgpath;
+   struct scsi_device *sdev;
+   ssize_t count;
+   int r = 0;
+
+   list_for_each_entry(pgpath, >pgpaths, list) {
+   q = bdev_get_queue(pgpath->path.dev->bdev);
+   sdev = scsi_device_from_queue(q);
+   if (!sdev) {
+   r = -EINVAL;
+   goto out;
+   }
+
+   count = scsi_vpd_lun_id(sdev, pgpath_uuid, DM_UUID_LEN);
+   if (count <= SCSI_VPD_LUN_ID_PREFIX_LEN) {
+   r = -EINVAL;
+   put_device(>sdev_gendev);
+   goto out;
+   }
+
+   if (strcmp(md_uuid + MPATH_UUID_PREFIX_LEN,
+  pgpath_uuid + SCSI_VPD_LUN_ID_PREFIX_LEN)) {
+   r = -EINVAL;
+   put_device(>sdev_gendev);
+   goto out;
+   }
+   put_device(>sdev_gendev);
+   }
+
+out:
+   return r;
+}
+
 static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
 {
/* target arguments */
@@ -1183,6 +1223,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
unsigned pg_count = 0;
unsigned next_pg_num;
unsigned long flags;
+   char md_uuid[DM_UUID_LEN] = {0};

as.argc = argc;
as.argv = argv;
@@ -1220,6 +1261,11 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
goto bad;
}

+   if (dm_copy_name_and_uuid(dm_table_get_md(ti->table), NULL, md_uuid)) {
+   r = -ENXIO;
+   goto bad;
+   }
+
/* parse the priority groups */
while (as.argc) {
struct priority_group *pg;
@@ -1231,6 +1277,12 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
goto bad;
}

+   if (check_pg_uuid(pg, md_uuid)) {
+   ti->error = "uuid of pgpaths mismatch";
+   r = -EINVAL;
+   goto bad;
+   }
+
nr_valid_paths += pg->nr_pgpaths;
atomic_set(>nr_valid_paths, nr_valid_paths);

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7d52a11e1b61..fee82262a227 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1953,6 +1953,7 @@ struct scsi_device *scsi_device_from_queue(struct 
request_queue *q)

return sdev;
 }
+EXPORT_SYMBOL(scsi_device_from_queue);

 /**
  * scsi_block_requests - Utility function used by low-level drivers to prevent
-- 
2.19.1




[PATCH] bcache: reduce redundant code in bch_cached_dev_run()

2021-03-05 Thread Zhiqiang Liu


In bch_cached_dev_run(), free(env[1])|free(env[2])|free(buf)
show up three times. This patch introduce out tag in
which free(env[1])|free(env[2])|free(buf) are only called
one time. If we need to call free() when errors occur,
we can set error code to ret, and then goto out tag directly.

Signed-off-by: Zhiqiang Liu 
---
 drivers/md/bcache/super.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 71691f32959b..1a8748050087 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1052,6 +1052,7 @@ static int cached_dev_status_update(void *arg)

 int bch_cached_dev_run(struct cached_dev *dc)
 {
+   int ret = 0;
struct bcache_device *d = >disk;
char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL);
char *env[] = {
@@ -1064,19 +1065,15 @@ int bch_cached_dev_run(struct cached_dev *dc)
if (dc->io_disable) {
pr_err("I/O disabled on cached dev %s\n",
   dc->backing_dev_name);
-   kfree(env[1]);
-   kfree(env[2]);
-   kfree(buf);
-   return -EIO;
+   ret = -EIO;
+   goto out;
}

if (atomic_xchg(>running, 1)) {
-   kfree(env[1]);
-   kfree(env[2]);
-   kfree(buf);
pr_info("cached dev %s is running already\n",
   dc->backing_dev_name);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto out;
}

if (!d->c &&
@@ -1097,15 +1094,13 @@ int bch_cached_dev_run(struct cached_dev *dc)
 * only class / kset properties are persistent
 */
kobject_uevent_env(_to_dev(d->disk)->kobj, KOBJ_CHANGE, env);
-   kfree(env[1]);
-   kfree(env[2]);
-   kfree(buf);

if (sysfs_create_link(>kobj, _to_dev(d->disk)->kobj, "dev") ||
sysfs_create_link(_to_dev(d->disk)->kobj,
  >kobj, "bcache")) {
pr_err("Couldn't create bcache dev <-> disk sysfs symlinks\n");
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}

dc->status_update_thread = kthread_run(cached_dev_status_update,
@@ -1114,7 +1109,11 @@ int bch_cached_dev_run(struct cached_dev *dc)
pr_warn("failed to create bcache_status_update kthread, 
continue to run without monitoring backing device status\n");
}

-   return 0;
+out:
+   kfree(env[1]);
+   kfree(env[2]);
+   kfree(buf);
+   return ret;
 }

 /*
-- 
2.19.1




Re: [PATCH] fuse: check whether fuse_request_alloc returns NULL in fuse_simple_request

2020-11-04 Thread Zhiqiang Liu
Sorry for ignoring __GFP_NOFAIL flag.
Please ignore this patch.


On 2020/11/5 9:33, Zhiqiang Liu wrote:
> ping ...
> 
> On 2020/10/22 21:13, Zhiqiang Liu wrote:
>>
>> In fuse_simple_request func, we will call fuse_request_alloc func to alloc
>> one request from fuse_req_cachep when args->force is true. However, the
>> return value of fuse_request_alloc func is not checked whether it is NULL.
>> If allocating request fails, access-NULL-pointer problem will occur.
>>
>> Here, we check the return value of fuse_request_alloc func.
>>
>> Fixes: 7213394c4e18 ("fuse: simplify request allocation")
>> Signed-off-by: Zhiqiang Liu 
>> Signed-off-by: Haotian Li 
>> ---
>>  fs/fuse/dev.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 02b3c36b3676..f7dd33ae8e31 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -481,6 +481,8 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct 
>> fuse_args *args)
>>  if (args->force) {
>>  atomic_inc(>num_waiting);
>>  req = fuse_request_alloc(GFP_KERNEL | __GFP_NOFAIL);
>> +if (!req)
>> +return -ENOMEM;
>>
>>  if (!args->nocreds)
>>  fuse_force_creds(fc, req);
>>
> 
> 
> .
> 



Re: [PATCH] fuse: check whether fuse_request_alloc returns NULL in fuse_simple_request

2020-11-04 Thread Zhiqiang Liu
ping ...

On 2020/10/22 21:13, Zhiqiang Liu wrote:
> 
> In fuse_simple_request func, we will call fuse_request_alloc func to alloc
> one request from fuse_req_cachep when args->force is true. However, the
> return value of fuse_request_alloc func is not checked whether it is NULL.
> If allocating request fails, access-NULL-pointer problem will occur.
> 
> Here, we check the return value of fuse_request_alloc func.
> 
> Fixes: 7213394c4e18 ("fuse: simplify request allocation")
> Signed-off-by: Zhiqiang Liu 
> Signed-off-by: Haotian Li 
> ---
>  fs/fuse/dev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 02b3c36b3676..f7dd33ae8e31 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -481,6 +481,8 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct 
> fuse_args *args)
>   if (args->force) {
>   atomic_inc(>num_waiting);
>   req = fuse_request_alloc(GFP_KERNEL | __GFP_NOFAIL);
> + if (!req)
> + return -ENOMEM;
> 
>   if (!args->nocreds)
>   fuse_force_creds(fc, req);
> 



Re: [PATCH] fuse: fix potential accessing NULL pointer problem in fuse_send_init()

2020-10-29 Thread Zhiqiang Liu



On 2020/10/29 23:25, Miklos Szeredi wrote:
> On Thu, Oct 22, 2020 at 4:52 PM Zhiqiang Liu  wrote:
>>
>>
>> In fuse_send_init func, ia is allocated by calling kzalloc func, and
>> we donot check whether ia is NULL before using it. Thus, if allocating
>> ia fails, accessing NULL pointer problem will occur.
> 
> Note the __GFP_NOFAIL flag for kzalloc(), which ensures that it will not fail.

Thanks for your reply.
Please ignore this patch.



Re: [PATCH] pipe: fix potential inode leak in create_pipe_files()

2020-10-28 Thread Zhiqiang Liu



On 2020/10/28 11:54, Al Viro wrote:
> On Wed, Oct 28, 2020 at 11:03:52AM +0800, Zhiqiang Liu wrote:
>>
>> In create_pipe_files(), if alloc_file_clone() fails, we will call
>> put_pipe_info to release pipe, and call fput() to release f.
>> However, we donot call iput() to free inode.
> 
> Huh?  Have you actually tried to trigger that failure exit?
>
>> Signed-off-by: Zhiqiang Liu 
>> Signed-off-by: Feilong Lin 
>> ---
>>  fs/pipe.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/pipe.c b/fs/pipe.c
>> index 0ac197658a2d..8856607fde65 100644
>> --- a/fs/pipe.c
>> +++ b/fs/pipe.c
>> @@ -924,6 +924,7 @@ int create_pipe_files(struct file **res, int flags)
>>  if (IS_ERR(res[0])) {
>>  put_pipe_info(inode, inode->i_pipe);
>>  fput(f);
>> +iput(inode);
>>  return PTR_ERR(res[0]);
> 
> No.  That inode is created with refcount 1.  If alloc_file_pseudo()
> succeeds, the reference we'd been holding has been transferred into
> dentry allocated by alloc_file_pseudo() (and attached to f).
>>From that point on we do *NOT* own a reference to inode and no
> subsequent failure exits have any business releasing it.
> 
> In particular, alloc_file_clone() DOES NOT create extra references
> to inode, whether it succeeds or fails.  Dropping the reference
> to f will take care of everything.
> 
> If you tried to trigger that failure exit with your patch applied,
> you would've seen double iput(), as soon as you return from sys_pipe()
> to userland and task_work is processed (which is where the real
> destructor of struct file will happen).
> 
> NAK.
> 

Thanks for your patient response. Learned a lot from your reply.
Please ignore the patch.

> .
> 



[PATCH] pipe: fix potential inode leak in create_pipe_files()

2020-10-28 Thread Zhiqiang Liu


In create_pipe_files(), if alloc_file_clone() fails, we will call
put_pipe_info to release pipe, and call fput() to release f.
However, we donot call iput() to free inode.

Signed-off-by: Zhiqiang Liu 
Signed-off-by: Feilong Lin 
---
 fs/pipe.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index 0ac197658a2d..8856607fde65 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -924,6 +924,7 @@ int create_pipe_files(struct file **res, int flags)
if (IS_ERR(res[0])) {
put_pipe_info(inode, inode->i_pipe);
fput(f);
+   iput(inode);
return PTR_ERR(res[0]);
}
res[0]->private_data = inode->i_pipe;
-- 
2.19.1




Re: [PATCH] fuse: check whether fuse_request_alloc returns NULL in fuse_simple_request

2020-10-27 Thread Zhiqiang Liu
friendly ping...

On 2020/10/22 21:13, Zhiqiang Liu wrote:
> 
> In fuse_simple_request func, we will call fuse_request_alloc func to alloc
> one request from fuse_req_cachep when args->force is true. However, the
> return value of fuse_request_alloc func is not checked whether it is NULL.
> If allocating request fails, access-NULL-pointer problem will occur.
> 
> Here, we check the return value of fuse_request_alloc func.
> 
> Fixes: 7213394c4e18 ("fuse: simplify request allocation")
> Signed-off-by: Zhiqiang Liu 
> Signed-off-by: Haotian Li 
> ---
>  fs/fuse/dev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 02b3c36b3676..f7dd33ae8e31 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -481,6 +481,8 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct 
> fuse_args *args)
>   if (args->force) {
>   atomic_inc(>num_waiting);
>   req = fuse_request_alloc(GFP_KERNEL | __GFP_NOFAIL);
> + if (!req)
> + return -ENOMEM;
> 
>   if (!args->nocreds)
>   fuse_force_creds(fc, req);
> 



Re: [PATCH] fuse: fix potential accessing NULL pointer problem in fuse_send_init()

2020-10-27 Thread Zhiqiang Liu
friendly ping...

On 2020/10/22 22:51, Zhiqiang Liu wrote:
> 
> In fuse_send_init func, ia is allocated by calling kzalloc func, and
> we donot check whether ia is NULL before using it. Thus, if allocating
> ia fails, accessing NULL pointer problem will occur.
> 
> Here, we will call process_init_reply func if ia is NULL.
> 
> Fixes: 615047eff108 ("fuse: convert init to simple api")
> Signed-off-by: Zhiqiang Liu 
> Signed-off-by: Haotian Li 
> ---
>  fs/fuse/inode.c | 161 ++--
>  1 file changed, 87 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 581329203d68..bb526d8cf5b0 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -898,88 +898,97 @@ struct fuse_init_args {
>  static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
>  int error)
>  {
> - struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> - struct fuse_init_out *arg = >out;
> + struct fuse_init_args *ia;
> + struct fuse_init_out *arg;
> + unsigned long ra_pages;
> 
> - if (error || arg->major != FUSE_KERNEL_VERSION)
> + if (!args) {
>   fc->conn_error = 1;
> - else {
> - unsigned long ra_pages;
> + goto out;
> + }
> 
> - process_init_limits(fc, arg);
> + ia = container_of(args, typeof(*ia), args);
> + arg = >out;
> + if (error || arg->major != FUSE_KERNEL_VERSION) {
> + fc->conn_error = 1;
> + goto out_free_ia;
> + }
> 
> - if (arg->minor >= 6) {
> - ra_pages = arg->max_readahead / PAGE_SIZE;
> - if (arg->flags & FUSE_ASYNC_READ)
> - fc->async_read = 1;
> - if (!(arg->flags & FUSE_POSIX_LOCKS))
> - fc->no_lock = 1;
> - if (arg->minor >= 17) {
> - if (!(arg->flags & FUSE_FLOCK_LOCKS))
> - fc->no_flock = 1;
> - } else {
> - if (!(arg->flags & FUSE_POSIX_LOCKS))
> - fc->no_flock = 1;
> - }
> - if (arg->flags & FUSE_ATOMIC_O_TRUNC)
> - fc->atomic_o_trunc = 1;
> - if (arg->minor >= 9) {
> - /* LOOKUP has dependency on proto version */
> - if (arg->flags & FUSE_EXPORT_SUPPORT)
> - fc->export_support = 1;
> - }
> - if (arg->flags & FUSE_BIG_WRITES)
> - fc->big_writes = 1;
> - if (arg->flags & FUSE_DONT_MASK)
> - fc->dont_mask = 1;
> - if (arg->flags & FUSE_AUTO_INVAL_DATA)
> - fc->auto_inval_data = 1;
> - else if (arg->flags & FUSE_EXPLICIT_INVAL_DATA)
> - fc->explicit_inval_data = 1;
> - if (arg->flags & FUSE_DO_READDIRPLUS) {
> - fc->do_readdirplus = 1;
> - if (arg->flags & FUSE_READDIRPLUS_AUTO)
> - fc->readdirplus_auto = 1;
> - }
> - if (arg->flags & FUSE_ASYNC_DIO)
> - fc->async_dio = 1;
> - if (arg->flags & FUSE_WRITEBACK_CACHE)
> - fc->writeback_cache = 1;
> - if (arg->flags & FUSE_PARALLEL_DIROPS)
> - fc->parallel_dirops = 1;
> - if (arg->flags & FUSE_HANDLE_KILLPRIV)
> - fc->handle_killpriv = 1;
> - if (arg->time_gran && arg->time_gran <= 10)
> - fc->sb->s_time_gran = arg->time_gran;
> - if ((arg->flags & FUSE_POSIX_ACL)) {
> - fc->default_permissions = 1;
> - fc->posix_acl = 1;
> - fc->sb->s_xattr = fuse_acl_xattr_handlers;
> - }
> - if (arg->flags & FUSE_CACHE_SYMLINKS)
> - fc->cache_symlinks = 1;
> -   

[PATCH] fuse: fix potential accessing NULL pointer problem in fuse_send_init()

2020-10-22 Thread Zhiqiang Liu


In fuse_send_init func, ia is allocated by calling kzalloc func, and
we donot check whether ia is NULL before using it. Thus, if allocating
ia fails, accessing NULL pointer problem will occur.

Here, we will call process_init_reply func if ia is NULL.

Fixes: 615047eff108 ("fuse: convert init to simple api")
Signed-off-by: Zhiqiang Liu 
Signed-off-by: Haotian Li 
---
 fs/fuse/inode.c | 161 ++--
 1 file changed, 87 insertions(+), 74 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 581329203d68..bb526d8cf5b0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -898,88 +898,97 @@ struct fuse_init_args {
 static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
   int error)
 {
-   struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
-   struct fuse_init_out *arg = >out;
+   struct fuse_init_args *ia;
+   struct fuse_init_out *arg;
+   unsigned long ra_pages;

-   if (error || arg->major != FUSE_KERNEL_VERSION)
+   if (!args) {
fc->conn_error = 1;
-   else {
-   unsigned long ra_pages;
+   goto out;
+   }

-   process_init_limits(fc, arg);
+   ia = container_of(args, typeof(*ia), args);
+   arg = >out;
+   if (error || arg->major != FUSE_KERNEL_VERSION) {
+   fc->conn_error = 1;
+   goto out_free_ia;
+   }

-   if (arg->minor >= 6) {
-   ra_pages = arg->max_readahead / PAGE_SIZE;
-   if (arg->flags & FUSE_ASYNC_READ)
-   fc->async_read = 1;
-   if (!(arg->flags & FUSE_POSIX_LOCKS))
-   fc->no_lock = 1;
-   if (arg->minor >= 17) {
-   if (!(arg->flags & FUSE_FLOCK_LOCKS))
-   fc->no_flock = 1;
-   } else {
-   if (!(arg->flags & FUSE_POSIX_LOCKS))
-   fc->no_flock = 1;
-   }
-   if (arg->flags & FUSE_ATOMIC_O_TRUNC)
-   fc->atomic_o_trunc = 1;
-   if (arg->minor >= 9) {
-   /* LOOKUP has dependency on proto version */
-   if (arg->flags & FUSE_EXPORT_SUPPORT)
-   fc->export_support = 1;
-   }
-   if (arg->flags & FUSE_BIG_WRITES)
-   fc->big_writes = 1;
-   if (arg->flags & FUSE_DONT_MASK)
-   fc->dont_mask = 1;
-   if (arg->flags & FUSE_AUTO_INVAL_DATA)
-   fc->auto_inval_data = 1;
-   else if (arg->flags & FUSE_EXPLICIT_INVAL_DATA)
-   fc->explicit_inval_data = 1;
-   if (arg->flags & FUSE_DO_READDIRPLUS) {
-   fc->do_readdirplus = 1;
-   if (arg->flags & FUSE_READDIRPLUS_AUTO)
-   fc->readdirplus_auto = 1;
-   }
-   if (arg->flags & FUSE_ASYNC_DIO)
-   fc->async_dio = 1;
-   if (arg->flags & FUSE_WRITEBACK_CACHE)
-   fc->writeback_cache = 1;
-   if (arg->flags & FUSE_PARALLEL_DIROPS)
-   fc->parallel_dirops = 1;
-   if (arg->flags & FUSE_HANDLE_KILLPRIV)
-   fc->handle_killpriv = 1;
-   if (arg->time_gran && arg->time_gran <= 10)
-   fc->sb->s_time_gran = arg->time_gran;
-   if ((arg->flags & FUSE_POSIX_ACL)) {
-   fc->default_permissions = 1;
-   fc->posix_acl = 1;
-   fc->sb->s_xattr = fuse_acl_xattr_handlers;
-   }
-   if (arg->flags & FUSE_CACHE_SYMLINKS)
-   fc->cache_symlinks = 1;
-   if (arg->flags & FUSE_ABORT_ERROR)
-   fc->abort_err = 1;
-   if (arg->flags & FUSE_MAX_PAGES) {
-   fc->max_pages =
-   min_t(unsigned int, FUSE_MAX_MAX_PAGES,
-   max_t(unsigned int, arg->m

[PATCH] fuse: check whether fuse_request_alloc returns NULL in fuse_simple_request

2020-10-22 Thread Zhiqiang Liu


In fuse_simple_request func, we will call fuse_request_alloc func to alloc
one request from fuse_req_cachep when args->force is true. However, the
return value of fuse_request_alloc func is not checked whether it is NULL.
If allocating request fails, access-NULL-pointer problem will occur.

Here, we check the return value of fuse_request_alloc func.

Fixes: 7213394c4e18 ("fuse: simplify request allocation")
Signed-off-by: Zhiqiang Liu 
Signed-off-by: Haotian Li 
---
 fs/fuse/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02b3c36b3676..f7dd33ae8e31 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -481,6 +481,8 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct 
fuse_args *args)
if (args->force) {
atomic_inc(>num_waiting);
req = fuse_request_alloc(GFP_KERNEL | __GFP_NOFAIL);
+   if (!req)
+   return -ENOMEM;

if (!args->nocreds)
fuse_force_creds(fc, req);
-- 
2.19.1



[PATCH V3] bcache: fix potential deadlock problem in btree_gc_coalesce

2020-06-11 Thread Zhiqiang Liu


coccicheck reports:
  drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->write_lock
for freeing new_nodes[i] before return.

btree_gc_coalesce func details as follows:
if alloc new_nodes[i] fails:
goto out_nocoalesce;
mutex_lock(_nodes[i]->write_lock) // obtain new_nodes[i]->write_lock
for (i = nodes - 1; i > 0; --i)   // main coalescing process
……
if coalescing process fails:
goto out_nocoalesce;  // Here, directly goto 
out_nocoalesce
  // tag will cause a deadlock
……
mutex_unlock(_nodes[i]->write_lock) // release 
new_nodes[i]->write_lock
return; // coalesing succ, return
out_nocoalesce:
btree_node_free(new_nodes[i])   // free new_nodes[i]
mutex_lock(_nodes[i]->write_lock);  // obtain 
new_nodes[i]->write_lock
clear_bit(BTREE_NODE_dirty, _nodes[i]->flags); // 
set flag for reuse
mutex_unlock(_nodes[i]->write_lock); // release 
new_nodes[i]->write_lock

To fix the problem, we add a new tag 'out_unlock_nocoalesce' for releasing
new_nodes[i]->write_lock before out_nocoalesce tag. If coalescing process fails,
we will go to out_unlock_nocoalesce tag for releasing new_nodes[i]->write_lock
before free new_nodes[i] in out_nocoalesce tag.

Fixes: 2a285686c109816 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu 
---
Changelog:
V3: improve commit log again (suggested by Markus and Coly)
V2: improve commit log
V1: initial version

 drivers/md/bcache/btree.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 72856e5f23a3..fd1f288fd801 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1389,7 +1389,7 @@ static int btree_gc_coalesce(struct btree *b, struct 
btree_op *op,
if (__set_blocks(n1, n1->keys + n2->keys,
 block_bytes(b->c)) >
btree_blocks(new_nodes[i]))
-   goto out_nocoalesce;
+   goto out_unlock_nocoalesce;

keys = n2->keys;
/* Take the key of the node we're getting rid of */
@@ -1418,7 +1418,7 @@ static int btree_gc_coalesce(struct btree *b, struct 
btree_op *op,

if (__bch_keylist_realloc(,
  bkey_u64s(_nodes[i]->key)))
-   goto out_nocoalesce;
+   goto out_unlock_nocoalesce;

bch_btree_node_write(new_nodes[i], );
bch_keylist_add(, _nodes[i]->key);
@@ -1464,6 +1464,10 @@ static int btree_gc_coalesce(struct btree *b, struct 
btree_op *op,
/* Invalidated our iterator */
return -EINTR;

+out_unlock_nocoalesce:
+   for (i = 0; i < nodes; i++)
+   mutex_unlock(_nodes[i]->write_lock);
+
 out_nocoalesce:
closure_sync();

-- 
2.19.1




Re: [PATCH V2] bcache: fix potential deadlock problem in btree_gc_coalesce

2020-06-11 Thread Zhiqiang Liu
Sorry for late reply. I will improve the commit log as your suggestion in v3 
patch.
Thanks for your suggestion.

On 2020/4/27 2:16, Markus Elfring wrote:
>> --
>> V1->V2: rewrite commit log (suggested by Coly Li) and rename the patch
> 
> * The patch version description should be placed behind triple dashes.
> 
> * I would find a shorter version identification (without the arrow)
>   also sufficient.
> 
> 
>> Fixes: 2a285686c1 ("bcache: btree locking rework")>
> 
> Will a longer commit identifier be safer for the final change description?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=b2768df24ec400dd4f7fa79542f797e904812053#n183
> 
> Regards,
> Markus
> 



Re: [PATCH v3] module: add link_flag pram in ref_module func to decide whether add usage link

2019-09-10 Thread Zhiqiang Liu
Friendly Ping...

On 2019/7/20 22:40, Zhiqiang Liu wrote:
> Users can call ref_module func in their modules to construct
> relationships with other modules. However, the holders
> '/sys/module//holders' of the target module donot include
> the users` module. So lsmod command misses detailed info of 'Used by'.
> 
> When load module, the process is given as follows,
> load_module()
>   -> mod_sysfs_setup()
>   -> add_usage_links
>   -> do_init_module
>   -> mod->init()
> 
> add_usage_links func creates holders of target modules linking to
> this module. If ref_module is called in mod->init() func, the usage
> links cannot be added.
> 
> Consider that add_module_usage and add usage_link may separate, the
> link_flag pram is added in ref_module func to decide whether add usage
> link after add_module_usage. If link_flag is true, it means usage link
> of a to b's holder_dir should be created immediately after add_module_usage.
> 
> V2->V3:
> - add link_flag pram in ref_module func to decide whether add usage link
> 
> V1->V2:
> - remove incorrect Fixes tag
> - fix error handling of sysfs_create_link as suggested by Jessica Yu
> 
> Signed-off-by: Zhiqiang Liu 
> Suggested-by: Jessica Yu 
> Reviewed-by: Kang Zhou 
> ---
>  include/linux/module.h |  2 +-
>  kernel/module.c| 27 ---
>  2 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 188998d3dca9..9ec04b9e93e8 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -632,7 +632,7 @@ static inline void __module_get(struct module *module)
>  #define symbol_put_addr(p) do { } while (0)
> 
>  #endif /* CONFIG_MODULE_UNLOAD */
> -int ref_module(struct module *a, struct module *b);
> +int ref_module(struct module *a, struct module *b, bool link_flag);
> 
>  /* This is a #define so the string doesn't get put in every .o file */
>  #define module_name(mod) \
> diff --git a/kernel/module.c b/kernel/module.c
> index 80c7c09584cf..00e4862a8ef7 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -837,25 +837,26 @@ static int already_uses(struct module *a, struct module 
> *b)
>   *'b' can walk the list to see who sourced them), and of 'a'
>   *targets (so 'a' can see what modules it targets).
>   */
> -static int add_module_usage(struct module *a, struct module *b)
> +static struct module_use *add_module_usage(struct module *a, struct module 
> *b)
>  {
>   struct module_use *use;
> 
>   pr_debug("Allocating new usage for %s.\n", a->name);
>   use = kmalloc(sizeof(*use), GFP_ATOMIC);
>   if (!use)
> - return -ENOMEM;
> + return NULL;
> 
>   use->source = a;
>   use->target = b;
>   list_add(>source_list, >source_list);
>   list_add(>target_list, >target_list);
> - return 0;
> + return use;
>  }
> 
>  /* Module a uses b: caller needs module_mutex() */
> -int ref_module(struct module *a, struct module *b)
> +int ref_module(struct module *a, struct module *b, bool link_flag)
>  {
> + struct module_use *use;
>   int err;
> 
>   if (b == NULL || already_uses(a, b))
> @@ -866,9 +867,21 @@ int ref_module(struct module *a, struct module *b)
>   if (err)
>   return err;
> 
> - err = add_module_usage(a, b);
> + use = add_module_usage(a, b);
> + if (!use) {
> + module_put(b);
> + return -ENOMEM;
> + }
> +
> + if (!link_flag)
> + return 0;
> +
> + err = sysfs_create_link(b->holders_dir, >mkobj.kobj, a->name);
>   if (err) {
>   module_put(b);
> + list_del(>source_list);
> + list_del(>target_list);
> + kfree(use);
>   return err;
>   }
>   return 0;
> @@ -1152,7 +1165,7 @@ static inline void module_unload_free(struct module 
> *mod)
>  {
>  }
> 
> -int ref_module(struct module *a, struct module *b)
> +int ref_module(struct module *a, struct module *b, bool link_flag)
>  {
>   return strong_try_module_get(b);
>  }
> @@ -1407,7 +1420,7 @@ static const struct kernel_symbol 
> *resolve_symbol(struct module *mod,
>   goto getname;
>   }
> 
> - err = ref_module(mod, owner);
> + err = ref_module(mod, owner, false);
>   if (err) {
>   sym = ERR_PTR(err);
>   goto getname;
> 



[PATCH v3] module: add link_flag pram in ref_module func to decide whether add usage link

2019-07-20 Thread Zhiqiang Liu
Users can call ref_module func in their modules to construct
relationships with other modules. However, the holders
'/sys/module//holders' of the target module donot include
the users` module. So lsmod command misses detailed info of 'Used by'.

When load module, the process is given as follows,
load_module()
-> mod_sysfs_setup()
-> add_usage_links
-> do_init_module
-> mod->init()

add_usage_links func creates holders of target modules linking to
this module. If ref_module is called in mod->init() func, the usage
links cannot be added.

Consider that add_module_usage and add usage_link may separate, the
link_flag pram is added in ref_module func to decide whether add usage
link after add_module_usage. If link_flag is true, it means usage link
of a to b's holder_dir should be created immediately after add_module_usage.

V2->V3:
- add link_flag pram in ref_module func to decide whether add usage link

V1->V2:
- remove incorrect Fixes tag
- fix error handling of sysfs_create_link as suggested by Jessica Yu

Signed-off-by: Zhiqiang Liu 
Suggested-by: Jessica Yu 
Reviewed-by: Kang Zhou 
---
 include/linux/module.h |  2 +-
 kernel/module.c| 27 ---
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 188998d3dca9..9ec04b9e93e8 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -632,7 +632,7 @@ static inline void __module_get(struct module *module)
 #define symbol_put_addr(p) do { } while (0)

 #endif /* CONFIG_MODULE_UNLOAD */
-int ref_module(struct module *a, struct module *b);
+int ref_module(struct module *a, struct module *b, bool link_flag);

 /* This is a #define so the string doesn't get put in every .o file */
 #define module_name(mod)   \
diff --git a/kernel/module.c b/kernel/module.c
index 80c7c09584cf..00e4862a8ef7 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -837,25 +837,26 @@ static int already_uses(struct module *a, struct module 
*b)
  *'b' can walk the list to see who sourced them), and of 'a'
  *targets (so 'a' can see what modules it targets).
  */
-static int add_module_usage(struct module *a, struct module *b)
+static struct module_use *add_module_usage(struct module *a, struct module *b)
 {
struct module_use *use;

pr_debug("Allocating new usage for %s.\n", a->name);
use = kmalloc(sizeof(*use), GFP_ATOMIC);
if (!use)
-   return -ENOMEM;
+   return NULL;

use->source = a;
use->target = b;
list_add(>source_list, >source_list);
list_add(>target_list, >target_list);
-   return 0;
+   return use;
 }

 /* Module a uses b: caller needs module_mutex() */
-int ref_module(struct module *a, struct module *b)
+int ref_module(struct module *a, struct module *b, bool link_flag)
 {
+   struct module_use *use;
int err;

if (b == NULL || already_uses(a, b))
@@ -866,9 +867,21 @@ int ref_module(struct module *a, struct module *b)
if (err)
return err;

-   err = add_module_usage(a, b);
+   use = add_module_usage(a, b);
+   if (!use) {
+   module_put(b);
+   return -ENOMEM;
+   }
+
+   if (!link_flag)
+   return 0;
+
+   err = sysfs_create_link(b->holders_dir, >mkobj.kobj, a->name);
if (err) {
module_put(b);
+   list_del(>source_list);
+   list_del(>target_list);
+   kfree(use);
return err;
}
return 0;
@@ -1152,7 +1165,7 @@ static inline void module_unload_free(struct module *mod)
 {
 }

-int ref_module(struct module *a, struct module *b)
+int ref_module(struct module *a, struct module *b, bool link_flag)
 {
return strong_try_module_get(b);
 }
@@ -1407,7 +1420,7 @@ static const struct kernel_symbol *resolve_symbol(struct 
module *mod,
goto getname;
}

-   err = ref_module(mod, owner);
+   err = ref_module(mod, owner, false);
if (err) {
sym = ERR_PTR(err);
goto getname;
-- 
2.19.1



Re: [PATCH v2] module: add usage links when calling ref_module func

2019-07-11 Thread Zhiqiang Liu



On 2019/7/10 0:10, Jessica Yu wrote:
> +++ Zhiqiang Liu [03/07/19 10:09 +0800]:
>> From: Zhiqiang Liu >
>> V1->V2:
>> - remove incorrect Fixes tag
>> - fix error handling of sysfs_create_link as suggested by Jessica Yu
>>
>> Signed-off-by: Zhiqiang Liu 
>> Suggested-by: Jessica Yu 
>> Reviewed-by: Kang Zhou 
>> ---
>> kernel/module.c | 18 ++
>> 1 file changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/module.c b/kernel/module.c
>> index 80c7c09584cf..672abcec 100644
>> --- a/kernel/module.c
>> +++ b/kernel/module.c
>> @@ -837,25 +837,26 @@ static int already_uses(struct module *a, struct 
>> module *b)
>> ?0?2*?0?2?0?2?0?2 'b' can walk the list to see who sourced them), and of 'a'
>> ?0?2*?0?2?0?2?0?2 targets (so 'a' can see what modules it targets).
>> ?0?2*/
>> /* Module a uses b: caller needs module_mutex() */
>> int ref_module(struct module *a, struct module *b)
>> {
>> +?0?2?0?2?0?2 struct module_use *use;
>> ?0?2?0?2?0?2?0?2int err;
>>
>> ?0?2?0?2?0?2?0?2if (b == NULL || already_uses(a, b))
>> @@ -866,9 +867,18 @@ int ref_module(struct module *a, struct module *b)
>> ?0?2?0?2?0?2?0?2if (err)
>> ?0?2?0?2?0?2?0?2?0?2?0?2?0?2 return err;
>>
>> -?0?2?0?2?0?2 err = add_module_usage(a, b);
>> +?0?2?0?2?0?2 use = add_module_usage(a, b);
>> +?0?2?0?2?0?2 if (!use) {
>> +?0?2?0?2?0?2?0?2?0?2?0?2?0?2 module_put(b);
>> +?0?2?0?2?0?2?0?2?0?2?0?2?0?2 return -ENOMEM;
>> +?0?2?0?2?0?2 }
>> +
>> +?0?2?0?2?0?2 err = sysfs_create_link(b->holders_dir, >mkobj.kobj, 
>> a->name);
> 
> Sigh. This ultimately doesn't work because in load_module(), we use
> ref_module() in resolve_symbol(), and mod->mkobj.kobj doesn't get
> initialized until mod_sysfs_init(), which happens much later in
> load_module(). So what happens is that the ref_module(mod, owner) call
> in resolve_symbol() returns an error because sysfs_create_link() fails here.
> We could *maybe* move sysfs initialization earlier in load_module()
> but that is an entirely untested idea and I would need to think about
> that more.

Thank you for the reply.
I have tested the patch through livepatch. Maybe I miss somethings.
I will rewrite the patch and test it entirely before sending the v3 patch.

Thanks again.




Re: [PATCH v2] module: add usage links when calling ref_module func

2019-07-06 Thread Zhiqiang Liu
Friendly ping ...

On 2019/7/3 10:09, Zhiqiang Liu wrote:
> From: Zhiqiang Liu 
> 
> Users can call ref_module func in their modules to construct
> relationships with other modules. However, the holders
> '/sys/module//holders' of the target module donot include
> the users` module. So lsmod command misses detailed info of 'Used by'.
> 
> When load module, the process is given as follows,
> load_module()
>   -> mod_sysfs_setup()
>   -> add_usage_links
>   -> do_init_module
>   -> mod->init()
> 
> add_usage_links func creates holders of target modules linking to
> this module. If ref_module is called in mod->init() func, the usage
> links cannot be added.
> 
> Here, we will add usage link of a to b's holder_dir.
> 
> V1->V2:
> - remove incorrect Fixes tag
> - fix error handling of sysfs_create_link as suggested by Jessica Yu
> 
> Signed-off-by: Zhiqiang Liu 
> Suggested-by: Jessica Yu 
> Reviewed-by: Kang Zhou 
> ---
>  kernel/module.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/module.c b/kernel/module.c
> index 80c7c09584cf..672abcec 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -837,25 +837,26 @@ static int already_uses(struct module *a, struct module 
> *b)
>   *'b' can walk the list to see who sourced them), and of 'a'
>   *targets (so 'a' can see what modules it targets).
>   */
> -static int add_module_usage(struct module *a, struct module *b)
> +static struct module_use *add_module_usage(struct module *a, struct module 
> *b)
>  {
>   struct module_use *use;
> 
>   pr_debug("Allocating new usage for %s.\n", a->name);
>   use = kmalloc(sizeof(*use), GFP_ATOMIC);
>   if (!use)
> - return -ENOMEM;
> + return NULL;
> 
>   use->source = a;
>   use->target = b;
>   list_add(>source_list, >source_list);
>   list_add(>target_list, >target_list);
> - return 0;
> + return use;
>  }
> 
>  /* Module a uses b: caller needs module_mutex() */
>  int ref_module(struct module *a, struct module *b)
>  {
> + struct module_use *use;
>   int err;
> 
>   if (b == NULL || already_uses(a, b))
> @@ -866,9 +867,18 @@ int ref_module(struct module *a, struct module *b)
>   if (err)
>   return err;
> 
> - err = add_module_usage(a, b);
> + use = add_module_usage(a, b);
> + if (!use) {
> + module_put(b);
> + return -ENOMEM;
> + }
> +
> + err = sysfs_create_link(b->holders_dir, >mkobj.kobj, a->name);
>   if (err) {
>   module_put(b);
> + list_del(>source_list);
> + list_del(>target_list);
> + kfree(use);
>   return err;
>   }
>   return 0;
> 



[PATCH v2] module: add usage links when calling ref_module func

2019-07-02 Thread Zhiqiang Liu
From: Zhiqiang Liu 

Users can call ref_module func in their modules to construct
relationships with other modules. However, the holders
'/sys/module//holders' of the target module donot include
the users` module. So lsmod command misses detailed info of 'Used by'.

When load module, the process is given as follows,
load_module()
-> mod_sysfs_setup()
-> add_usage_links
-> do_init_module
-> mod->init()

add_usage_links func creates holders of target modules linking to
this module. If ref_module is called in mod->init() func, the usage
links cannot be added.

Here, we will add usage link of a to b's holder_dir.

V1->V2:
- remove incorrect Fixes tag
- fix error handling of sysfs_create_link as suggested by Jessica Yu

Signed-off-by: Zhiqiang Liu 
Suggested-by: Jessica Yu 
Reviewed-by: Kang Zhou 
---
 kernel/module.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 80c7c09584cf..672abcec 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -837,25 +837,26 @@ static int already_uses(struct module *a, struct module 
*b)
  *'b' can walk the list to see who sourced them), and of 'a'
  *targets (so 'a' can see what modules it targets).
  */
-static int add_module_usage(struct module *a, struct module *b)
+static struct module_use *add_module_usage(struct module *a, struct module *b)
 {
struct module_use *use;

pr_debug("Allocating new usage for %s.\n", a->name);
use = kmalloc(sizeof(*use), GFP_ATOMIC);
if (!use)
-   return -ENOMEM;
+   return NULL;

use->source = a;
use->target = b;
list_add(>source_list, >source_list);
list_add(>target_list, >target_list);
-   return 0;
+   return use;
 }

 /* Module a uses b: caller needs module_mutex() */
 int ref_module(struct module *a, struct module *b)
 {
+   struct module_use *use;
int err;

if (b == NULL || already_uses(a, b))
@@ -866,9 +867,18 @@ int ref_module(struct module *a, struct module *b)
if (err)
return err;

-   err = add_module_usage(a, b);
+   use = add_module_usage(a, b);
+   if (!use) {
+   module_put(b);
+   return -ENOMEM;
+   }
+
+   err = sysfs_create_link(b->holders_dir, >mkobj.kobj, a->name);
if (err) {
module_put(b);
+   list_del(>source_list);
+   list_del(>target_list);
+   kfree(use);
return err;
}
return 0;
-- 
2.19.1



Re: [PATCH] module: add usage links when calling ref_module func

2019-07-02 Thread Zhiqiang Liu
On 2019/7/1 21:55, Jessica Yu wrote:
> +++ Zhiqiang Liu [28/06/19 20:32 +0800]:
>> From: Zhiqiang Liu 
>>
>> Problem: Users can call ref_module func in their modules to construct
>> relationships with other modules. However, the holders
>> '/sys/module//holders' of the target module donot include
>> the users` module. So lsmod command misses detailed info of 'Used by'.
>>
>> Here, we will add usage link of a to b's holder_dir.
>>
>> Fixes: 9bea7f239 ("module: fix bne2 "gave up waiting for init of module 
>> libcrc32c")
> 
> I think we can drop this tag; it doesn't fix a bug specifically
> introduced by that particular commit.
> 
Thanks for your reply.
I will remove the Fixes tag in v2 patch.

>> Signed-off-by: Zhiqiang Liu 
>> ---
>> kernel/module.c | 5 +
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/module.c b/kernel/module.c
>> index 80c7c09584cf..11c6aff37b1f 100644
>> --- a/kernel/module.c
>> +++ b/kernel/module.c
>> @@ -871,6 +871,11 @@ int ref_module(struct module *a, struct module *b)
>> ?0?2?0?2?0?2?0?2?0?2?0?2?0?2 module_put(b);
>> ?0?2?0?2?0?2?0?2?0?2?0?2?0?2 return err;
>> ?0?2?0?2?0?2?0?2}
>> +
>> +?0?2?0?2?0?2 err = sysfs_create_link(b->holders_dir, >mkobj.kobj, 
>> a->name);
>> +?0?2?0?2?0?2 if (err)
>> +?0?2?0?2?0?2?0?2?0?2?0?2?0?2 return err;
> 
> We need to fix the error handling here - the module_use struct
> allocated in the call to add_module_usage() needs to be freed (you
> could just modify add_module_usage() to return the use pointer so that
> it's easier to free from within ref_module()), module_put() needs to
> be called, and the use struct should be removed from its respective
> lists (see module_unload_free()).
> 
Thanks again for your advice.
We will modify add_module_usage func to return the use pointer as your 
suggestion.
In the error handling, We will call module_put() and call list_del() to remove 
the use.

> Thanks,
> 
> Jessica
> 
> 
> .
> 



Re: [PATCH next] sysctl: add proc_dointvec_jiffies_minmax to limit the min/max write value

2019-07-01 Thread Zhiqiang Liu
friendly ping ...

On 2019/6/4 23:27, Zhiqiang Liu wrote:
>> On Wed, May 15, 2019 at 10:53:55PM +0800, Zhiqiang Liu wrote:
>>
>> (Please include akpm on CC for next versions of this, as he's likely
>> the person to take this patch.)
> Thanks for your advice. And sorry to reply you so late.
> 
>>>>> In proc_dointvec_jiffies func, the write value is only checked
>>>>> whether it is larger than INT_MAX. If the write value is less
>>>>> than zero, it can also be successfully writen in the data.
>>
>> This appears to be "be design", but I see many "unsigned int" users
>> that might be tricked into giant values... (for example, see
>> net/netfilter/nf_conntrack_standalone.c)
>>
>> Should proc_dointvec_jiffies() just be fixed to disallow negative values
>> entirely? Looking at the implementation, it seems to be very intentional
>> about accepting negative values.
>>
>> However, when I looked through a handful of proc_dointvec_jiffies()
>> users, it looks like they're all expecting a positive value. Many in the
>> networking subsystem are, in fact, writing to unsigned long variables,
>> as I mentioned.
>>
> I totally agree with you. And I also cannot find an scenario that expects
> negative values. Consideing the "negative" scenario may be exist, I add the
> proc_dointvec_jiffies_minmax like proc_dointvec_minmax.
> 
>> Are there real-world cases of wanting to set a negative jiffie value
>> via proc_dointvec_jiffies()?
> Until now, I do not find such cases.
> 
>>>>>
>>>>> Here, we add a new func, proc_dointvec_jiffies_minmax, to limit the
>>>>> min/max write value, which is similar to the proc_dointvec_minmax func.
>>>>>
>>
>> If proc_dointvec_jiffies() can't just be fixed, where will the new
>> function get used? It seems all the "unsigned int" users could benefit.
>>
> I tend to add the proc_dointvec_jiffies_minmax func to provide more choices 
> and
> not change the previous use of proc_dointvec_jiffies func.
> 
> Thanks for your reply again.
> 
> 
> .
> 



[PATCH] module: add usage links when calling ref_module func

2019-06-28 Thread Zhiqiang Liu
From: Zhiqiang Liu 

Problem: Users can call ref_module func in their modules to construct
relationships with other modules. However, the holders
'/sys/module//holders' of the target module donot include
the users` module. So lsmod command misses detailed info of 'Used by'.

When load module, the process is given as follows,
load_module()
-> mod_sysfs_setup()
-> add_usage_links
-> do_init_module
-> mod->init()

add_usage_links func creates holders of target modules linking to
this module. If ref_module is called in mod->init() func, the usage
links cannot be added.

Here, we will add usage link of a to b's holder_dir.

Fixes: 9bea7f239 ("module: fix bne2 "gave up waiting for init of module 
libcrc32c")
Signed-off-by: Zhiqiang Liu 
---
 kernel/module.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/module.c b/kernel/module.c
index 80c7c09584cf..11c6aff37b1f 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -871,6 +871,11 @@ int ref_module(struct module *a, struct module *b)
module_put(b);
return err;
}
+
+   err = sysfs_create_link(b->holders_dir, >mkobj.kobj, a->name);
+   if (err)
+   return err;
+
return 0;
 }
 EXPORT_SYMBOL_GPL(ref_module);
-- 
2.19.1



Re: [PATCH next] softirq: enable MAX_SOFTIRQ_TIME tuning with sysctl max_softirq_time_usecs

2019-06-23 Thread Zhiqiang Liu


在 2019/6/24 0:38, Thomas Gleixner 写道:
> Zhiqiang,
>> controlled by sysadmins to copy with hardware changes over time.
> 
> So much for the theory. See below.

Thanks for your reply.
> 
>> Correspondingly, the MAX_SOFTIRQ_TIME should be able to be tunned by 
>> sysadmins,
>> who knows best about hardware performance, for excepted tradeoff between 
>> latence
>> and fairness.
>>
>> Here, we add sysctl variable max_softirq_time_usecs to replace 
>> MAX_SOFTIRQ_TIME
>> with 2ms default value.
> 
> ...
> 
>>   */
>> -#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
>> +unsigned int __read_mostly max_softirq_time_usecs = 2000;
>>  #define MAX_SOFTIRQ_RESTART 10
>>
>>  #ifdef CONFIG_TRACE_IRQFLAGS
>> @@ -248,7 +249,8 @@ static inline void lockdep_softirq_end(bool in_hardirq) 
>> { }
>>
>>  asmlinkage __visible void __softirq_entry __do_softirq(void)
>>  {
>> -unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>> +unsigned long end = jiffies +
>> +usecs_to_jiffies(max_softirq_time_usecs);
> 
> That's still jiffies based and therefore depends on CONFIG_HZ. Any budget
> value will be rounded up to the next jiffie. So in case of HZ=100 and
> time=1000us this will still result in 10ms of allowed loop time.
> 
> I'm not saying that we must use a more fine grained time source, but both
> the changelog and the sysctl documentation are misleading.
> 
> If we keep it jiffies based, then microseconds do not make any sense. They
> just give a false sense of controlability.
> 
> Keep also in mind that with jiffies the accuracy depends also on the
> distance to the next tick when 'end' is evaluated. The next tick might be
> imminent.
> 
> That's all information which needs to be in the documentation.
> 

Thanks again for your detailed advice.
As your said, the max_softirq_time_usecs setting without explaining the
relationship with CONFIG_HZ will give a false sense of controlability. And
the time accuracy of jiffies will result in a certain difference between the
max_softirq_time_usecs set value and the actual value, which is in one jiffies
range.

I will add these infomation in the sysctl documentation and changelog in v2 
patch.

>> +{
>> +.procname   = "max_softirq_time_usecs",
>> +.data   = _softirq_time_usecs,
>> +.maxlen = sizeof(unsigned int),
>> +.mode   = 0644,
>> +.proc_handler   = proc_dointvec_minmax,
>> +.extra1 = ,
>> +},
> 
> Zero as the lower limit? That means it allows a single loop. Fine, but
> needs to be documented as well.
> 
> Thanks,
> 
>   tglx
> 
> .
> 



Re: [PATCH net] tcp: avoid creating multiple req socks with the same tuples

2019-06-05 Thread Zhiqiang Liu



在 2019/6/4 23:24, Eric Dumazet 写道:
> On Tue, Jun 4, 2019 at 7:47 AM Mao Wenan  wrote:
>>
>> There is one issue about bonding mode BOND_MODE_BROADCAST, and
>> two slaves with diffierent affinity, so packets will be handled
>> by different cpu. These are two pre-conditions in this case.

>> Signed-off-by: Mao Wenan 
>> --
> 
> This issue has been discussed last year.
> 
> I am afraid your patch does not solve all races.
> 
> The lookup you add is lockless, so this is racy.
> 
> Really the only way to solve this is to make sure that _when_ the
> bucket lock is held,
> we do not insert a request socket if the 4-tuple is already in the
> chain (probably in inet_ehash_insert())
> 
> This needs more tricky changes than your patch.
> 

This kind case is rarely used, and the condition of the issue is strict.
If we add the "lookup" before or in inet_ehash_insert func for each reqsk,
overall performance will be affected.

We may solve the small probability issue with a trick in the tcp_v4_rcv.
If the ACK is invalid checked by tcp_check_req func, the req could be dropped,
and then goto the lookup for searching another avaliable reqsk. In this way,
the performance will not be affected in the normal process.

The patch is given as following:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a2896944aa37..9d0491587ed2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1874,8 +1874,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
goto discard_and_relse;
}
if (nsk == sk) {
-   reqsk_put(req);
+   inet_csk_reqsk_queue_drop_and_put(sk, req);
tcp_v4_restore_cb(skb);
+   sock_put(sk);
+   goto lookup;
} else if (tcp_child_process(sk, nsk, skb)) {
tcp_v4_send_reset(nsk, skb);
goto discard_and_relse;








Re: [PATCH next] sysctl: add proc_dointvec_jiffies_minmax to limit the min/max write value

2019-06-04 Thread Zhiqiang Liu
> On Wed, May 15, 2019 at 10:53:55PM +0800, Zhiqiang Liu wrote:
> 
> (Please include akpm on CC for next versions of this, as he's likely
> the person to take this patch.)
Thanks for your advice. And sorry to reply you so late.

>>>> In proc_dointvec_jiffies func, the write value is only checked
>>>> whether it is larger than INT_MAX. If the write value is less
>>>> than zero, it can also be successfully writen in the data.
> 
> This appears to be "be design", but I see many "unsigned int" users
> that might be tricked into giant values... (for example, see
> net/netfilter/nf_conntrack_standalone.c)
> 
> Should proc_dointvec_jiffies() just be fixed to disallow negative values
> entirely? Looking at the implementation, it seems to be very intentional
> about accepting negative values.
> 
> However, when I looked through a handful of proc_dointvec_jiffies()
> users, it looks like they're all expecting a positive value. Many in the
> networking subsystem are, in fact, writing to unsigned long variables,
> as I mentioned.
> 
I totally agree with you. And I also cannot find an scenario that expects
negative values. Consideing the "negative" scenario may be exist, I add the
proc_dointvec_jiffies_minmax like proc_dointvec_minmax.

> Are there real-world cases of wanting to set a negative jiffie value
> via proc_dointvec_jiffies()?
Until now, I do not find such cases.

>>>>
>>>> Here, we add a new func, proc_dointvec_jiffies_minmax, to limit the
>>>> min/max write value, which is similar to the proc_dointvec_minmax func.
>>>>
> 
> If proc_dointvec_jiffies() can't just be fixed, where will the new
> function get used? It seems all the "unsigned int" users could benefit.
> 
I tend to add the proc_dointvec_jiffies_minmax func to provide more choices and
not change the previous use of proc_dointvec_jiffies func.

Thanks for your reply again.



Re: [PATCH next] sysctl: add proc_dointvec_jiffies_minmax to limit the min/max write value

2019-05-15 Thread Zhiqiang Liu
Friendly ping...

在 2019/4/24 12:04, Zhiqiang Liu 写道:
> 
> Friendly ping...
> 
>> From: Zhiqiang Liu 
>>
>> In proc_dointvec_jiffies func, the write value is only checked
>> whether it is larger than INT_MAX. If the write value is less
>> than zero, it can also be successfully writen in the data.
>>
>> However, in some scenarios, users would adopt the data to
>> set timers or check whether time is expired. Generally, the data
>> will be cast to an unsigned type variable, then the negative data
>> becomes a very large unsigned value, which leads to long waits
>> or other unpredictable problems.
>>
>> Here, we add a new func, proc_dointvec_jiffies_minmax, to limit the
>> min/max write value, which is similar to the proc_dointvec_minmax func.
>>
>> Signed-off-by: Zhiqiang Liu 
>> Reported-by: Qiang Ning 
>> Reviewed-by: Jie Liu 
>> ---
>>  include/linux/sysctl.h |  2 ++
>>  kernel/sysctl.c| 44 +++-
>>  2 files changed, 45 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
>> index b769ecf..8bde8a0 100644
>> --- a/include/linux/sysctl.h
>> +++ b/include/linux/sysctl.h
>> @@ -53,6 +53,8 @@ extern int proc_douintvec_minmax(struct ctl_table *table, 
>> int write,
>>   loff_t *ppos);
>>  extern int proc_dointvec_jiffies(struct ctl_table *, int,
>>   void __user *, size_t *, loff_t *);
>> +extern int proc_dointvec_jiffies_minmax(struct ctl_table *, int,
>> + void __user *, size_t *, loff_t *);
>>  extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int,
>>  void __user *, size_t *, loff_t *);
>>  extern int proc_dointvec_ms_jiffies(struct ctl_table *, int,
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index c9ec050..8e1eb59 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2967,10 +2967,15 @@ static int do_proc_dointvec_jiffies_conv(bool *negp, 
>> unsigned long *lvalp,
>>   int *valp,
>>   int write, void *data)
>>  {
>> +struct do_proc_dointvec_minmax_conv_param *param = data;
>> +
>>  if (write) {
>>  if (*lvalp > INT_MAX / HZ)
>>  return 1;
>>  *valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
>> +if ((param->min && (*param->min)*HZ > *valp) ||
>> +(param->max && (*param->max)*HZ < *valp))
>> +return -EINVAL;
>>  } else {
>>  int val = *valp;
>>  unsigned long lval;
>> @@ -3053,7 +3058,37 @@ int proc_dointvec_jiffies(struct ctl_table *table, 
>> int write,
>>void __user *buffer, size_t *lenp, loff_t *ppos)
>>  {
>>  return do_proc_dointvec(table,write,buffer,lenp,ppos,
>> -do_proc_dointvec_jiffies_conv,NULL);
>> +do_proc_dointvec_jiffies_conv, NULL);
>> +}
>> +
>> +/**
>> + * proc_dointvec_jiffies_minmax - read a vector of integers as seconds with 
>> min/max values
>> + * @table: the sysctl table
>> + * @write: %TRUE if this is a write to the sysctl file
>> + * @buffer: the user buffer
>> + * @lenp: the size of the user buffer
>> + * @ppos: file position
>> + *
>> + * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
>> + * values from/to the user buffer, treated as an ASCII string.
>> + * The values read are assumed to be in seconds, and are converted into
>> + * jiffies.
>> + *
>> + * This routine will ensure the values are within the range specified by
>> + * table->extra1 (min) and table->extra2 (max).
>> + *
>> + * Returns 0 on success or -EINVAL on write when the range check fails.
>> + */
>> +int proc_dointvec_jiffies_minmax(struct ctl_table *table, int write,
>> +  void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +struct do_proc_dointvec_minmax_conv_param param = {
>> +.min = (int *) table->extra1,
>> +.max = (int *) table->extra2,
>> +};
>> +
>> +return do_proc_dointvec(table, write, buffer, lenp, ppos,
>> +do_proc_dointvec_jiffies_conv, );
>>  }
>>
>>  /**
>> @@ -3301,6 +3336,12 @@ int proc_dointvec_jiffies(struct ctl_table *table, 
&

Re: [PATCH v2] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-08 Thread Zhiqiang Liu
> On 5/6/19 7:06 AM, Zhiqiang Liu wrote:
>> From: Kai Shen 
>>
>> spinlock recursion happened when do LTP test:
>> #!/bin/bash
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>>
>> The dtor returned by get_compound_page_dtor in __put_compound_page
>> may be the function of free_huge_page which will lock the hugetlb_lock,
>> so don't put_page in lock of hugetlb_lock.
>>
>>  BUG: spinlock recursion on CPU#0, hugemmap05/1079
>>   lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, 
>> .owner_cpu: 0
>>  Call trace:
>>   dump_backtrace+0x0/0x198
>>   show_stack+0x24/0x30
>>   dump_stack+0xa4/0xcc
>>   spin_dump+0x84/0xa8
>>   do_raw_spin_lock+0xd0/0x108
>>   _raw_spin_lock+0x20/0x30
>>   free_huge_page+0x9c/0x260
>>   __put_compound_page+0x44/0x50
>>   __put_page+0x2c/0x60
>>   alloc_surplus_huge_page.constprop.19+0xf0/0x140
>>   hugetlb_acct_memory+0x104/0x378
>>   hugetlb_reserve_pages+0xe0/0x250
>>   hugetlbfs_file_mmap+0xc0/0x140
>>   mmap_region+0x3e8/0x5b0
>>   do_mmap+0x280/0x460
>>   vm_mmap_pgoff+0xf4/0x128
>>   ksys_mmap_pgoff+0xb4/0x258
>>   __arm64_sys_mmap+0x34/0x48
>>   el0_svc_common+0x78/0x130
>>   el0_svc_handler+0x38/0x78
>>   el0_svc+0x8/0xc
>>
>> Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
>> Signed-off-by: Kai Shen 
>> Signed-off-by: Feilong Lin 
>> Reported-by: Wang Wang 
>> Acked-by: Michal Hocko 
> 
> Good catch.  Sorry, for the late reply.
> 
> Reviewed-by: Mike Kravetz 
> 

Thank your for the reply.
Friendly ping ...




Re: [PATCH v2] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-06 Thread Zhiqiang Liu
> On Mon 06-05-19 23:22:08, Zhiqiang Liu wrote:
> [...]
>> Does adding Cc: stable mean adding Cc: 
>> tag in the patch or Ccing sta...@vger.kernel.org when sending the new mail?
> 
> The former. See Documentation/process/stable-kernel-rules.rst for more.
> 
> Thanks!

Thank you, Oscar Salvador and Mike Kravetz again.
> 



Re: [PATCH v2] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-06 Thread Zhiqiang Liu
> On Mon 06-05-19 22:06:38, Zhiqiang Liu wrote:
>> From: Kai Shen 
>>
>> spinlock recursion happened when do LTP test:
>> #!/bin/bash
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>>
>> The dtor returned by get_compound_page_dtor in __put_compound_page
>> may be the function of free_huge_page which will lock the hugetlb_lock,
>> so don't put_page in lock of hugetlb_lock.
>>
>>  BUG: spinlock recursion on CPU#0, hugemmap05/1079
>>   lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, 
>> .owner_cpu: 0
>>  Call trace:
>>   dump_backtrace+0x0/0x198
>>   show_stack+0x24/0x30
>>   dump_stack+0xa4/0xcc
>>   spin_dump+0x84/0xa8
>>   do_raw_spin_lock+0xd0/0x108
>>   _raw_spin_lock+0x20/0x30
>>   free_huge_page+0x9c/0x260
>>   __put_compound_page+0x44/0x50
>>   __put_page+0x2c/0x60
>>   alloc_surplus_huge_page.constprop.19+0xf0/0x140
>>   hugetlb_acct_memory+0x104/0x378
>>   hugetlb_reserve_pages+0xe0/0x250
>>   hugetlbfs_file_mmap+0xc0/0x140
>>   mmap_region+0x3e8/0x5b0
>>   do_mmap+0x280/0x460
>>   vm_mmap_pgoff+0xf4/0x128
>>   ksys_mmap_pgoff+0xb4/0x258
>>   __arm64_sys_mmap+0x34/0x48
>>   el0_svc_common+0x78/0x130
>>   el0_svc_handler+0x38/0x78
>>   el0_svc+0x8/0xc
>>
>> Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
>> Signed-off-by: Kai Shen 
>> Signed-off-by: Feilong Lin 
>> Reported-by: Wang Wang 
>> Acked-by: Michal Hocko 
>> ---
>> v1->v2: add Acked-by: Michal Hocko 
> 
> A new version for single ack is usually an overkill and only makes the
> situation more confusing. You have also didn't add Cc: stable as
> suggested during the review. That part is arguably more important.
> 
> You also haven't CCed Andrew (now done) and your patch will not get
> merged without him applying it. Anyway, let's wait for Andrew to pick
> this patch up.
> 
Thank you for your patience. I am sorry for misunderstanding your advice
in your last mail.
Does adding Cc: stable mean adding Cc: 
tag in the patch or Ccing sta...@vger.kernel.org when sending the new mail?

You are very nice. Thanks again.



>>  mm/hugetlb.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 6cdc7b2..c1e7b81 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1574,8 +1574,9 @@ static struct page *alloc_surplus_huge_page(struct 
>> hstate *h, gfp_t gfp_mask,
>>   */
>>  if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>>  SetPageHugeTemporary(page);
>> +spin_unlock(_lock);
>>  put_page(page);
>> -page = NULL;
>> +return NULL;
>>  } else {
>>  h->surplus_huge_pages++;
>>  h->surplus_huge_pages_node[page_to_nid(page)]++;
>> -- 
>> 1.8.3.1
>>
> 



[PATCH v2] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-06 Thread Zhiqiang Liu
From: Kai Shen 

spinlock recursion happened when do LTP test:
#!/bin/bash
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &

The dtor returned by get_compound_page_dtor in __put_compound_page
may be the function of free_huge_page which will lock the hugetlb_lock,
so don't put_page in lock of hugetlb_lock.

 BUG: spinlock recursion on CPU#0, hugemmap05/1079
  lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, 
.owner_cpu: 0
 Call trace:
  dump_backtrace+0x0/0x198
  show_stack+0x24/0x30
  dump_stack+0xa4/0xcc
  spin_dump+0x84/0xa8
  do_raw_spin_lock+0xd0/0x108
  _raw_spin_lock+0x20/0x30
  free_huge_page+0x9c/0x260
  __put_compound_page+0x44/0x50
  __put_page+0x2c/0x60
  alloc_surplus_huge_page.constprop.19+0xf0/0x140
  hugetlb_acct_memory+0x104/0x378
  hugetlb_reserve_pages+0xe0/0x250
  hugetlbfs_file_mmap+0xc0/0x140
  mmap_region+0x3e8/0x5b0
  do_mmap+0x280/0x460
  vm_mmap_pgoff+0xf4/0x128
  ksys_mmap_pgoff+0xb4/0x258
  __arm64_sys_mmap+0x34/0x48
  el0_svc_common+0x78/0x130
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc

Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
Signed-off-by: Kai Shen 
Signed-off-by: Feilong Lin 
Reported-by: Wang Wang 
Acked-by: Michal Hocko 
---
v1->v2: add Acked-by: Michal Hocko 

 mm/hugetlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6cdc7b2..c1e7b81 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1574,8 +1574,9 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 */
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
SetPageHugeTemporary(page);
+   spin_unlock(_lock);
put_page(page);
-   page = NULL;
+   return NULL;
} else {
h->surplus_huge_pages++;
h->surplus_huge_pages_node[page_to_nid(page)]++;
-- 
1.8.3.1



Re: [PATCH] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-06 Thread Zhiqiang Liu
> On Sat 04-05-19 20:28:24, Zhiqiang Liu wrote:
>> From: Kai Shen 
>>
>> spinlock recursion happened when do LTP test:
>> #!/bin/bash
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>> ./runltp -p -f hugetlb &
>>

>> Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
>> Signed-off-by: Kai Shen 
>> Signed-off-by: Feilong Lin 
>> Reported-by: Wang Wang 
> 
> You are right. I must have completely missed that put_page path
> unconditionally takes the hugetlb_lock for hugetlb pages.
> 
> Thanks for fixing this. I think this should be marked for stable
> because it is not hard to imagine a regular user might trigger this.
> 
> Acked-by: Michal Hocko 

Thank you for your reply.
I will add Acked-by: Michal Hocko  in the v2 patch.




[PATCH] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-04 Thread Zhiqiang Liu
From: Kai Shen 

spinlock recursion happened when do LTP test:
#!/bin/bash
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &

The dtor returned by get_compound_page_dtor in __put_compound_page
may be the function of free_huge_page which will lock the hugetlb_lock,
so don't put_page in lock of hugetlb_lock.

 BUG: spinlock recursion on CPU#0, hugemmap05/1079
  lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, 
.owner_cpu: 0
 Call trace:
  dump_backtrace+0x0/0x198
  show_stack+0x24/0x30
  dump_stack+0xa4/0xcc
  spin_dump+0x84/0xa8
  do_raw_spin_lock+0xd0/0x108
  _raw_spin_lock+0x20/0x30
  free_huge_page+0x9c/0x260
  __put_compound_page+0x44/0x50
  __put_page+0x2c/0x60
  alloc_surplus_huge_page.constprop.19+0xf0/0x140
  hugetlb_acct_memory+0x104/0x378
  hugetlb_reserve_pages+0xe0/0x250
  hugetlbfs_file_mmap+0xc0/0x140
  mmap_region+0x3e8/0x5b0
  do_mmap+0x280/0x460
  vm_mmap_pgoff+0xf4/0x128
  ksys_mmap_pgoff+0xb4/0x258
  __arm64_sys_mmap+0x34/0x48
  el0_svc_common+0x78/0x130
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc

Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
Signed-off-by: Kai Shen 
Signed-off-by: Feilong Lin 
Reported-by: Wang Wang 
---
 mm/hugetlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6cdc7b2..c1e7b81 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1574,8 +1574,9 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 */
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
SetPageHugeTemporary(page);
+   spin_unlock(_lock);
put_page(page);
-   page = NULL;
+   return NULL;
} else {
h->surplus_huge_pages++;
h->surplus_huge_pages_node[page_to_nid(page)]++;
-- 
1.8.3.1




Re: [PATCH next] sysctl: add proc_dointvec_jiffies_minmax to limit the min/max write value

2019-04-23 Thread Zhiqiang Liu


Friendly ping...

> From: Zhiqiang Liu 
> 
> In proc_dointvec_jiffies func, the write value is only checked
> whether it is larger than INT_MAX. If the write value is less
> than zero, it can also be successfully writen in the data.
> 
> However, in some scenarios, users would adopt the data to
> set timers or check whether time is expired. Generally, the data
> will be cast to an unsigned type variable, then the negative data
> becomes a very large unsigned value, which leads to long waits
> or other unpredictable problems.
> 
> Here, we add a new func, proc_dointvec_jiffies_minmax, to limit the
> min/max write value, which is similar to the proc_dointvec_minmax func.
> 
> Signed-off-by: Zhiqiang Liu 
> Reported-by: Qiang Ning 
> Reviewed-by: Jie Liu 
> ---
>  include/linux/sysctl.h |  2 ++
>  kernel/sysctl.c| 44 +++-
>  2 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index b769ecf..8bde8a0 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -53,6 +53,8 @@ extern int proc_douintvec_minmax(struct ctl_table *table, 
> int write,
>loff_t *ppos);
>  extern int proc_dointvec_jiffies(struct ctl_table *, int,
>void __user *, size_t *, loff_t *);
> +extern int proc_dointvec_jiffies_minmax(struct ctl_table *, int,
> +  void __user *, size_t *, loff_t *);
>  extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int,
>   void __user *, size_t *, loff_t *);
>  extern int proc_dointvec_ms_jiffies(struct ctl_table *, int,
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index c9ec050..8e1eb59 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2967,10 +2967,15 @@ static int do_proc_dointvec_jiffies_conv(bool *negp, 
> unsigned long *lvalp,
>int *valp,
>int write, void *data)
>  {
> + struct do_proc_dointvec_minmax_conv_param *param = data;
> +
>   if (write) {
>   if (*lvalp > INT_MAX / HZ)
>   return 1;
>   *valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
> + if ((param->min && (*param->min)*HZ > *valp) ||
> + (param->max && (*param->max)*HZ < *valp))
> + return -EINVAL;
>   } else {
>   int val = *valp;
>   unsigned long lval;
> @@ -3053,7 +3058,37 @@ int proc_dointvec_jiffies(struct ctl_table *table, int 
> write,
> void __user *buffer, size_t *lenp, loff_t *ppos)
>  {
>  return do_proc_dointvec(table,write,buffer,lenp,ppos,
> - do_proc_dointvec_jiffies_conv,NULL);
> + do_proc_dointvec_jiffies_conv, NULL);
> +}
> +
> +/**
> + * proc_dointvec_jiffies_minmax - read a vector of integers as seconds with 
> min/max values
> + * @table: the sysctl table
> + * @write: %TRUE if this is a write to the sysctl file
> + * @buffer: the user buffer
> + * @lenp: the size of the user buffer
> + * @ppos: file position
> + *
> + * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
> + * values from/to the user buffer, treated as an ASCII string.
> + * The values read are assumed to be in seconds, and are converted into
> + * jiffies.
> + *
> + * This routine will ensure the values are within the range specified by
> + * table->extra1 (min) and table->extra2 (max).
> + *
> + * Returns 0 on success or -EINVAL on write when the range check fails.
> + */
> +int proc_dointvec_jiffies_minmax(struct ctl_table *table, int write,
> +   void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct do_proc_dointvec_minmax_conv_param param = {
> + .min = (int *) table->extra1,
> + .max = (int *) table->extra2,
> + };
> +
> + return do_proc_dointvec(table, write, buffer, lenp, ppos,
> + do_proc_dointvec_jiffies_conv, );
>  }
> 
>  /**
> @@ -3301,6 +3336,12 @@ int proc_dointvec_jiffies(struct ctl_table *table, int 
> write,
>   return -ENOSYS;
>  }
> 
> +int proc_dointvec_jiffies_minmax(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + return -ENOSYS;
> +}
> +
>  int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
>   void __user *buffer, size_t *lenp, loff_t *ppos)
>  {
> @@ -3359,6 +3400,7 @@ static int proc_dointvec_minmax_bpf_stats(struct 
> ctl_table *table, int write,
>  EXPORT_SYMBOL(proc_dointvec);
>  EXPORT_SYMBOL(proc_douintvec);
>  EXPORT_SYMBOL(proc_dointvec_jiffies);
> +EXPORT_SYMBOL(proc_dointvec_jiffies_minmax);
>  EXPORT_SYMBOL(proc_dointvec_minmax);
>  EXPORT_SYMBOL_GPL(proc_douintvec_minmax);
>  EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
> 



[PATCH next] sysctl: add proc_dointvec_jiffies_minmax to limit the min/max write value

2019-04-17 Thread Zhiqiang Liu
From: Zhiqiang Liu 

In proc_dointvec_jiffies func, the write value is only checked
whether it is larger than INT_MAX. If the write value is less
than zero, it can also be successfully writen in the data.

However, in some scenarios, users would adopt the data to
set timers or check whether time is expired. Generally, the data
will be cast to an unsigned type variable, then the negative data
becomes a very large unsigned value, which leads to long waits
or other unpredictable problems.

Here, we add a new func, proc_dointvec_jiffies_minmax, to limit the
min/max write value, which is similar to the proc_dointvec_minmax func.

Signed-off-by: Zhiqiang Liu 
Reported-by: Qiang Ning 
Reviewed-by: Jie Liu 
---
 include/linux/sysctl.h |  2 ++
 kernel/sysctl.c| 44 +++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index b769ecf..8bde8a0 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -53,6 +53,8 @@ extern int proc_douintvec_minmax(struct ctl_table *table, int 
write,
 loff_t *ppos);
 extern int proc_dointvec_jiffies(struct ctl_table *, int,
 void __user *, size_t *, loff_t *);
+extern int proc_dointvec_jiffies_minmax(struct ctl_table *, int,
+void __user *, size_t *, loff_t *);
 extern int proc_dointvec_userhz_jiffies(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 extern int proc_dointvec_ms_jiffies(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c9ec050..8e1eb59 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2967,10 +2967,15 @@ static int do_proc_dointvec_jiffies_conv(bool *negp, 
unsigned long *lvalp,
 int *valp,
 int write, void *data)
 {
+   struct do_proc_dointvec_minmax_conv_param *param = data;
+
if (write) {
if (*lvalp > INT_MAX / HZ)
return 1;
*valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
+   if ((param->min && (*param->min)*HZ > *valp) ||
+   (param->max && (*param->max)*HZ < *valp))
+   return -EINVAL;
} else {
int val = *valp;
unsigned long lval;
@@ -3053,7 +3058,37 @@ int proc_dointvec_jiffies(struct ctl_table *table, int 
write,
  void __user *buffer, size_t *lenp, loff_t *ppos)
 {
 return do_proc_dointvec(table,write,buffer,lenp,ppos,
-   do_proc_dointvec_jiffies_conv,NULL);
+   do_proc_dointvec_jiffies_conv, NULL);
+}
+
+/**
+ * proc_dointvec_jiffies_minmax - read a vector of integers as seconds with 
min/max values
+ * @table: the sysctl table
+ * @write: %TRUE if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
+ * values from/to the user buffer, treated as an ASCII string.
+ * The values read are assumed to be in seconds, and are converted into
+ * jiffies.
+ *
+ * This routine will ensure the values are within the range specified by
+ * table->extra1 (min) and table->extra2 (max).
+ *
+ * Returns 0 on success or -EINVAL on write when the range check fails.
+ */
+int proc_dointvec_jiffies_minmax(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   struct do_proc_dointvec_minmax_conv_param param = {
+   .min = (int *) table->extra1,
+   .max = (int *) table->extra2,
+   };
+
+   return do_proc_dointvec(table, write, buffer, lenp, ppos,
+   do_proc_dointvec_jiffies_conv, );
 }

 /**
@@ -3301,6 +3336,12 @@ int proc_dointvec_jiffies(struct ctl_table *table, int 
write,
return -ENOSYS;
 }

+int proc_dointvec_jiffies_minmax(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   return -ENOSYS;
+}
+
 int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -3359,6 +3400,7 @@ static int proc_dointvec_minmax_bpf_stats(struct 
ctl_table *table, int write,
 EXPORT_SYMBOL(proc_dointvec);
 EXPORT_SYMBOL(proc_douintvec);
 EXPORT_SYMBOL(proc_dointvec_jiffies);
+EXPORT_SYMBOL(proc_dointvec_jiffies_minmax);
 EXPORT_SYMBOL(proc_dointvec_minmax);
 EXPORT_SYMBOL_GPL(proc_douintvec_minmax);
 EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
-- 
1.8.3.1