Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Hi Donald, On 2/8/21 19:41, Donald Buczek wrote: Dear Guoqing, On 08.02.21 15:53, Guoqing Jiang wrote: On 2/8/21 12:38, Donald Buczek wrote: 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev); md_unregister_thread(>sync_thread); mddev_lock(mddev); As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex? I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea. Yes, this works. No deadlock after >11000 seconds, (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) Great. I will send a formal patch with your reported-by and tested-by. Thanks, Guoqing
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
On 2/8/21 12:38, Donald Buczek wrote: 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev); md_unregister_thread(>sync_thread); mddev_lock(mddev); As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex? I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea. Thanks, Guoqing
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Hi Donald, On 1/26/21 17:05, Donald Buczek wrote: Dear Guoqing, On 26.01.21 15:06, Guoqing Jiang wrote: On 1/26/21 13:58, Donald Buczek wrote: Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev); spin_lock_irq(>device_lock); } + + if ((atomic_read(>active_stripes) + < (conf->max_nr_stripes * 3 / 4) || + (test_bit(MD_RECOVERY_INTR, >recovery + wake_up(>wait_for_stripe); } pr_debug("%d stripes handled\n", handled); Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at root@sloth:~# cat /proc/$(pgrep md3_resync)/stack [<0>] md_do_sync.cold+0x8ec/0x97c [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync. Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted. wait_event(mddev->recovery_wait, test_bit(MD_RECOVERY_INTR,>recovery) || !atomic_read(>recovery_active)); With that added, md3_resync goes into a loop, too. Not 100% busy, though. root@sloth:~# cat /proc/$(pgrep md3_resync)/stack [<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 [<0>] raid5_sync_request+0x2a7/0x3d0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274 [<0>] md_do_sync.cold+0x3ee/0x97c # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883 [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again. I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request. wait_event_lock_irq(conf->wait_for_stripe, (test_bit(MD_RECOVERY_INTR, >recovery) && sync_req) || /* the previous condition */, *(conf->hash_locks + hash)); BTW, I think there some some possible ways: 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation). 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far). 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong). 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe at the end of make_request. 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev); md_unregister_thread(>sync_thread); mddev_lock(mddev); My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find. Thanks, Guoqing
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
On 1/26/21 13:58, Donald Buczek wrote: Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev); spin_lock_irq(>device_lock); } + + if ((atomic_read(>active_stripes) + < (conf->max_nr_stripes * 3 / 4) || + (test_bit(MD_RECOVERY_INTR, >recovery + wake_up(>wait_for_stripe); } pr_debug("%d stripes handled\n", handled); Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at root@sloth:~# cat /proc/$(pgrep md3_resync)/stack [<0>] md_do_sync.cold+0x8ec/0x97c [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync. Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted. wait_event(mddev->recovery_wait, test_bit(MD_RECOVERY_INTR,>recovery) || !atomic_read(>recovery_active)); And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg. Yes, that is intentional since MD_RECOVERY_INTR is set by write idle. Anyway, will try the script and investigate more about the issue. Thanks, Guoqing
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Hi Donald, On 1/26/21 10:50, Donald Buczek wrote: [...] diff --git a/drivers/md/md.c b/drivers/md/md.c index 2d21c298ffa7..f40429843906 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len) clear_bit(MD_RECOVERY_FROZEN, >recovery); if (test_bit(MD_RECOVERY_RUNNING, >recovery) && mddev_lock(mddev) == 0) { + set_bit(MD_ALLOW_SB_UPDATE, >flags); flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, >recovery); md_reap_sync_thread(mddev); } + clear_bit(MD_ALLOW_SB_UPDATE, >flags); mddev_unlock(mddev); } } else if (test_bit(MD_RECOVERY_RUNNING, >recovery)) Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the similar thing. Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage. I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that. I prefer to make resync thread not wait forever here. [...] - sh = raid5_get_active_stripe(conf, new_sector, previous, + sh = raid5_get_active_stripe(conf, new_sector, previous, 0, Mistake here (fourth argument added instead of third) Thanks for checking. [...] Unfortunately, this patch did not fix the issue. root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack [<0>] raid5_get_active_stripe+0x1e7/0x6b0 [<0>] raid5_sync_request+0x2a7/0x3d0 [<0>] md_do_sync.cold+0x3ee/0x97c [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters? Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev); spin_lock_irq(>device_lock); } + + if ((atomic_read(>active_stripes) +< (conf->max_nr_stripes * 3 / 4) || +(test_bit(MD_RECOVERY_INTR, >recovery + wake_up(>wait_for_stripe); } pr_debug("%d stripes handled\n", handled); If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like. wait_event_lock_irq(conf->wait_for_stripe, (test_bit(MD_RECOVERY_INTR, >recovery) && sync_req) || /* the previous condition */, *(conf->hash_locks + hash)); Thanks, Guoqing
Re: [PATCH v2 08/10] md: Implement ->corrupted_range()
On 1/25/21 23:55, Shiyang Ruan wrote: With the support of ->rmap(), it is possible to obtain the superblock on a mapped device. If a pmem device is used as one target of mapped device, we cannot obtain its superblock directly. With the help of SYSFS, the mapped device can be found on the target devices. So, we iterate the bdev->bd_holder_disks to obtain its mapped device. Signed-off-by: Shiyang Ruan --- drivers/md/dm.c | 61 +++ drivers/nvdimm/pmem.c | 11 +++- fs/block_dev.c| 42 - include/linux/genhd.h | 2 ++ 4 files changed, 107 insertions(+), 9 deletions(-) I can't see md raid is involved here, perhaps dm-devel need to be cced instead of raid list. And the subject need to be changed as well. Thanks, Guoqing
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Hi Donald, On 1/25/21 22:32, Donald Buczek wrote: On 25.01.21 09:54, Donald Buczek wrote: Dear Guoqing, a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem. My current theory of what is happening is: - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308) - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action` - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero. - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping. - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared. - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate. Does this make sense? Yes, exactly! That was my thought too, the scenario is twisted. Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag. And raid5 could suffer from the same issue I think. Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output: ``` #! /bin/bash ( while true; do echo "start check" echo check > /sys/block/md3/md/sync_action sleep 10 echo "stop check" echo idle > /sys/block/md3/md/sync_action sleep 2 done ) & ( while true; do dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none sync /mnt/raid_md3/bigfile rm /mnt/raid_md3/bigfile sleep .1 done ) & start="$(date +%s)" cd /sys/block/md3/md wp_count=0 while true; do array_state=$(cat array_state) if [ "$array_state" = write-pending ]; then wp_count=$(($wp_count+1)) else wp_count=0 fi echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active) if [ $wp_count -ge 3 ]; then kill -- -$$ exit fi sleep 1 done ``` The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this: 309 check 6283872 / 8378368 active-idle 4144 310 check 6283872 / 8378368 active 1702 311 check 6807528 / 8378368 active 4152 312 check 7331184 / 8378368 clean 3021 stop check 313 check 7331184 / 8378368 write-pending 3905 314 check 7331184 / 8378368 write-pending 3905 315 check 7331184 / 8378368 write-pending 3905 Terminated If I add pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags); in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Hi Donald, On 1/19/21 12:30, Donald Buczek wrote: Dear md-raid people, I've reported a problem in this thread in December: "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked. On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again. I was able to collect a lot of additional information from a blocked system. - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ] - The sync thread ("md1_resync") does not finish, because its blocked at [<0>] raid5_get_active_stripe+0x4c4/0x660 # [1] [<0>] raid5_sync_request+0x364/0x390 [<0>] md_do_sync+0xb41/0x1030 [<0>] md_thread+0x122/0x160 [<0>] kthread+0x118/0x130 [<0>] ret_from_fork+0x22/0x30 [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735 - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE ) The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread. - We have lots of active stripes: root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active 27534 There are too many active stripes, so this is false: atomic_read(>active_stripes) < (conf->max_nr_stripes * 3 / 4) so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes. 1. Increase max_nr_stripes since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system. Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory. 2. Or decrease active_stripes This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set - handle_stripe() doesn't make progress: echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958: [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 , check:4, reconstruct:0 [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 , check:4, reconstruct:0 [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 , check:4, reconstruct:0 The sector numbers repeat after some time. We have only the following variants of stripe state and "check": state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED, check_state_idle state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED, check_state_check_result state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED, check_state_idle - We have MD_SB_CHANGE_PENDING set: because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(>active_stripes). Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe). Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since > released_stripes = { > first = 0x0 > } And __get_priority_stripe fetches sh from conf->handle_list and delete it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list. So batch_size can't be '0', and handle_active_stripes in the loop repeats the process in the busy loop. This is my best guess to explain the busy loop situation.
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Hi Donald, On 12/2/20 18:28, Donald Buczek wrote: Dear Guoqing, unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used: --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_NEEDED, >recovery); goto unlock; } I think you can add the check of RECOVERY_CHECK in above part instead of add a new part. + if (test_bit(MD_RECOVERY_RUNNING, >recovery) && + (!test_bit(MD_RECOVERY_DONE, >recovery) || + test_bit(MD_RECOVERY_CHECK, >recovery))) { + /* resync/recovery still happening */ + pr_info("md: XXX BUGFIX applied\n"); + clear_bit(MD_RECOVERY_NEEDED, >recovery); + goto unlock; + } if (mddev->sync_thread) { md_reap_sync_thread(mddev); goto unlock; With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing: Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ... What is the stack of the process? I guess it is same as the stack of 2 in your previous mail, but just to confirm. Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [] [=>...] check = 8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [] [>] check = 0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec bitmap: 0/59 pages [0KB], 65536KB chunk after 1 minute: Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [] [=>...] check = 8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [] [>] check = 0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec bitmap: 0/59 pages [0KB], 65536KB chunk A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs: buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md root 134 2 0 Nov30 ? 00:00:00 [md] root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6] root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6] root 5845 2 0 16:20 ? 00:00:30 [md0_resync] root 5855 2 13 16:20 ? 00:14:11 [md1_resync] buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack [<0>] md_bitmap_cond_end_sync+0x12d/0x170 [<0>] raid5_sync_request+0x24b/0x390 [<0>] md_do_sync+0xb41/0x1030 [<0>] md_thread+0x122/0x160 [<0>] kthread+0x118/0x130 [<0>] ret_from_fork+0x1f/0x30 I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(>mddev->recovery_active) == 0);` in md-bitmap.c. Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe. I'd suggest to compare the stacks between md0 and md1 to find the difference. Thanks, Guoqing
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
On 11/28/20 13:25, Donald Buczek wrote: Dear Linux mdraid people, we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions. The last time this happened (in this case with Linux 5.10.0-rc4), I took some data. The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with echo idle > /sys/devices/virtual/block/md1/md/sync_action This doesn't complete. Here's /proc/stack of this process: root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 2 UID PID PPID C STIME TTY TIME CMD root 2 23331 0 02:00 ? 00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00 root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/2/stack [<0>] kthread_stop+0x6e/0x150 [<0>] md_unregister_thread+0x3e/0x70 [<0>] md_reap_sync_thread+0x1f/0x1e0 [<0>] action_store+0x141/0x2b0 [<0>] md_attr_store+0x71/0xb0 [<0>] kernfs_fop_write+0x113/0x1a0 [<0>] vfs_write+0xbc/0x250 [<0>] ksys_write+0xa1/0xe0 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Note, that md0 has been paused successfully just before. What is the personality of md0? Is it also raid6? 2020-11-27T02:00:01+01:00 done CROND[2]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00") 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920 2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320 2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1 2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted. 2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680 2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted. [ nothing related following ] After that, we see md1_raid6 in a busy loop: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 md1_raid6 Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again. Also, all processes doing I/O do the md device block. This is /proc/mdstat: Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [] [==>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [] bitmap: 0/59 pages [0KB], 65536KB chunk So the RECOVERY_CHECK flag should be set, not sure if the simple changes helps, but you may give it a try. diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f..e2697d0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev) md_update_sb(mddev, 0); if (test_bit(MD_RECOVERY_RUNNING, >recovery) && - !test_bit(MD_RECOVERY_DONE, >recovery)) { + (!test_bit(MD_RECOVERY_DONE, >recovery) || +test_bit(MD_RECOVERY_CHECK, >recovery))) { /* resync/recovery still happening */ clear_bit(MD_RECOVERY_NEEDED, >recovery); goto unlock; Thanks, Guoqing
Re: decruft the early init / initrd / initramfs code v2
On 7/15/20 8:51 AM, Christoph Hellwig wrote: On Tue, Jul 14, 2020 at 12:34:45PM -0700, Linus Torvalds wrote: On Tue, Jul 14, 2020 at 12:06 PM Christoph Hellwig wrote: this series starts to move the early init code away from requiring KERNEL_DS to be implicitly set during early startup. It does so by first removing legacy unused cruft, and the switches away the code from struct file based APIs to our more usual in-kernel APIs. Looks good to me, with the added note on the utimes cruft too as a further cleanup (separate patch). So you can add my acked-by. I _would_ like the md parts to get a few more acks. I see the one from Song Liu, anybody else in md land willing to go through those patches? They were the bulk of it, and the least obvious to me because I don't know that code at all? Song is the maintainer. Neil is the only person I could think of that also knows the old md code pretty well. Guoqing has contributed a lot lately, but the code touched here is rather historic (and not used very much at all these days as people use modular md and initramfѕ based detection). Hi Christoph, I just cloned the tree, seems there is compile issue that you need to resolve. hch-misc$ make -j8 DESCEND objtool CALL scripts/atomic/check-atomics.sh CALL scripts/checksyscalls.sh CHK include/generated/compile.h CC drivers/md/md.o CC drivers/md/md-bitmap.o CC drivers/md/md-autodetect.o AR drivers/perf/built-in.a CC drivers/md/dm.o AR drivers/hwtracing/intel_th/built-in.a CC drivers/nvmem/core.o drivers/md/md.c:7809:45: error: static declaration of ‘md_fops’ follows non-static declaration static const struct block_device_operations md_fops = ^~~ drivers/md/md.c:329:38: note: previous declaration of ‘md_fops’ was here const struct block_device_operations md_fops; ^~~ scripts/Makefile.build:280: recipe for target 'drivers/md/md.o' failed make[2]: *** [drivers/md/md.o] Error 1 make[2]: *** Waiting for unfinished jobs And for the changes of md, feel free to add my Acked-by if it could help. Thanks, Guoqing
Re: [PATCH 1/6] md: switch to ->check_events for media change notifications
On 7/8/20 3:23 PM, Matthew Wilcox wrote: On Wed, Jul 08, 2020 at 03:17:31PM +0200, Guoqing Jiang wrote: On 7/8/20 2:25 PM, Christoph Hellwig wrote: -static int md_media_changed(struct gendisk *disk) -{ - struct mddev *mddev = disk->private_data; - - return mddev->changed; -} Maybe we can remove "changed" from struct mddev since no one reads it after the change. You missed this hunk: +static unsigned int md_check_events(struct gendisk *disk, unsigned int clearing) { struct mddev *mddev = disk->private_data; + unsigned int ret = 0; + if (mddev->changed) + ret = DISK_EVENT_MEDIA_CHANGE; mddev->changed = 0; - return 0; + return ret; } Yes, sorry for the noise. Thanks, Guoqing
Re: [PATCH 1/6] md: switch to ->check_events for media change notifications
On 7/8/20 2:25 PM, Christoph Hellwig wrote: -static int md_media_changed(struct gendisk *disk) -{ - struct mddev *mddev = disk->private_data; - - return mddev->changed; -} Maybe we can remove "changed" from struct mddev since no one reads it after the change. Thanks, Guoqing
Re: [PATCH 08/10] orangefs: use attach/detach_page_private
On 5/26/20 11:54 PM, Mike Marshall wrote: I apologize for not mentioning that I ran this patch set through orangefs xfstests at 5.7 rc5 with no problems or regressions. Glad to hear that, thanks for your effort. Thanks, Guoqing
[PATCH] mm_types.h: change set_page_private to inline function
Change it to inline function to make callers use the proper argument. And no need for it to be macro per Andrew's comment [1]. [1] https://lore.kernel.org/lkml/20200518221235.1fa32c38e5766113f78e3...@linux-foundation.org/ Signed-off-by: Guoqing Jiang --- include/linux/mm_types.h | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e88dc5c65c01..64ede5f150dc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -240,7 +240,11 @@ static inline atomic_t *compound_pincount_ptr(struct page *page) #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) #define page_private(page) ((page)->private) -#define set_page_private(page, v) ((page)->private = (v)) + +static inline void set_page_private(struct page *page, unsigned long private) +{ + page->private = private; +} struct page_frag_cache { void * va; -- 2.17.1
Re: [PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
On 5/23/20 1:53 AM, Andrew Morton wrote: On Fri, 22 May 2020 09:18:25 +0200 Guoqing Jiang wrote: - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); attach_page_private(newpage, detach_page_private(page)); Mattew had suggested it as follows, but not sure if we can reorder of the setup of the bh and setting PagePrivate, so I didn't want to break the original syntax. @@ -797,11 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - get_page(newpage); + attach_page_private(newpage, detach_page_private(page)); bh = head; do { @@ -810,8 +806,6 @@ static int __buffer_migrate_page(struct address_space *mapping, } while (bh != head); - SetPagePrivate(newpage); - if (mode != MIGRATE_SYNC_NO_COPY) This is OK - coherency between PG_private and the page's buffer ring is maintained by holding lock_page(). Appreciate for your explanation. Thanks, Guoqing I have (effectively) applied the above change.
Re: [PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
Hi Dave, On 5/22/20 12:52 AM, Dave Chinner wrote: On Sun, May 17, 2020 at 11:47:18PM +0200, Guoqing Jiang wrote: We can cleanup code a little by call detach_page_private here. Signed-off-by: Guoqing Jiang --- No change since RFC V3. mm/migrate.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 5fed0305d2ec..f99502bc113c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -804,10 +804,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); attach_page_private(newpage, detach_page_private(page)); Mattew had suggested it as follows, but not sure if we can reorder of the setup of the bh and setting PagePrivate, so I didn't want to break the original syntax. @@ -797,11 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - get_page(newpage); + attach_page_private(newpage, detach_page_private(page)); bh = head; do { @@ -810,8 +806,6 @@ static int __buffer_migrate_page(struct address_space *mapping, } while (bh != head); - SetPagePrivate(newpage); - if (mode != MIGRATE_SYNC_NO_COPY) [1]. https://lore.kernel.org/lkml/20200502004158.gd29...@bombadil.infradead.org/ [2]. https://lore.kernel.org/lkml/e4d5ddc0-877f-6499-f697-2b7c0ddbf...@cloud.ionos.com/ Thanks, Guoqing
[UPDATE PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
We can cleanup code a little by call detach_page_private here. Signed-off-by: Guoqing Jiang --- Add the cast to fix type mismatch warning, sorry for the mistake. mm/migrate.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 7160c1556f79..44546d407e40 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -797,10 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, (unsigned long)detach_page_private(page)); get_page(newpage); bh = head; -- 2.17.1
Re: [PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
Hi Andrew, On 5/19/20 9:35 AM, Guoqing Jiang wrote: On 5/19/20 7:12 AM, Andrew Morton wrote: On Sun, 17 May 2020 23:47:18 +0200 Guoqing Jiang wrote: We can cleanup code a little by call detach_page_private here. ... --- a/mm/migrate.c +++ b/mm/migrate.c @@ -804,10 +804,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); bh = head; mm/migrate.c: In function '__buffer_migrate_page': ./include/linux/mm_types.h:243:52: warning: assignment makes integer from pointer without a cast [-Wint-conversion] #define set_page_private(page, v) ((page)->private = (v)) ^ mm/migrate.c:800:2: note: in expansion of macro 'set_page_private' set_page_private(newpage, detach_page_private(page)); ^~~~ The fact that set_page_private(detach_page_private()) generates a type mismatch warning seems deeply wrong, surely. Please let's get the types sorted out - either unsigned long or void *, not half-one and half-the other. Whatever needs the least typecasting at callsites, I suggest. Sorry about that, I should notice the warning before. I will double check if other places need the typecast or not, then send a new version. Only this patch missed the typecast. I guess I just need to send an updated patch to replace this one (I am fine to send a new patch set if you want), sorry again for the trouble. And can we please implement set_page_private() and page_private() with inlined C code? There is no need for these to be macros. Just did a quick change. -#define page_private(page) ((page)->private) -#define set_page_private(page, v) ((page)->private = (v)) +static inline unsigned long page_private(struct page *page) +{ + return page->private; +} + +static inline void set_page_private(struct page *page, unsigned long priv_data) +{ + page->private = priv_data; +} Then I get error like. fs/erofs/zdata.h: In function ‘z_erofs_onlinepage_index’: fs/erofs/zdata.h:126:8: error: lvalue required as unary ‘&’ operand u.v = _private(page); ^ I guess it is better to keep page_private as macro, please correct me in case I missed something. Lost of problems need to be fixed if change page_private to inline function, so I think it is better to keep it and only convert set_page_private. mm/compaction.c: In function ‘isolate_migratepages_block’: ./include/linux/compiler.h:287:20: error: lvalue required as unary ‘&’ operand __read_once_size(&(x), __u.__c, sizeof(x)); \ ^ ./include/linux/compiler.h:293:22: note: in expansion of macro ‘__READ_ONCE’ #define READ_ONCE(x) __READ_ONCE(x, 1) ^~~ mm/internal.h:293:34: note: in expansion of macro ‘READ_ONCE’ #define page_order_unsafe(page) READ_ONCE(page_private(page)) Thanks, Guoqing
Re: [PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
On 5/19/20 12:06 PM, Gao Xiang wrote: On Tue, May 19, 2020 at 09:35:59AM +0200, Guoqing Jiang wrote: On 5/19/20 7:12 AM, Andrew Morton wrote: On Sun, 17 May 2020 23:47:18 +0200 Guoqing Jiang wrote: We can cleanup code a little by call detach_page_private here. ... --- a/mm/migrate.c +++ b/mm/migrate.c @@ -804,10 +804,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); bh = head; mm/migrate.c: In function '__buffer_migrate_page': ./include/linux/mm_types.h:243:52: warning: assignment makes integer from pointer without a cast [-Wint-conversion] #define set_page_private(page, v) ((page)->private = (v)) ^ mm/migrate.c:800:2: note: in expansion of macro 'set_page_private' set_page_private(newpage, detach_page_private(page)); ^~~~ The fact that set_page_private(detach_page_private()) generates a type mismatch warning seems deeply wrong, surely. Please let's get the types sorted out - either unsigned long or void *, not half-one and half-the other. Whatever needs the least typecasting at callsites, I suggest. Sorry about that, I should notice the warning before. I will double check if other places need the typecast or not, then send a new version. And can we please implement set_page_private() and page_private() with inlined C code? There is no need for these to be macros. Just did a quick change. -#define page_private(page)            ((page)->private) -#define set_page_private(page, v)     ((page)->private = (v)) +static inline unsigned long page_private(struct page *page) +{ +      return page->private; +} + +static inline void set_page_private(struct page *page, unsigned long priv_data) +{ +      page->private = priv_data; +} Then I get error like. fs/erofs/zdata.h: In function ‘z_erofs_onlinepage_index’: fs/erofs/zdata.h:126:8: error: lvalue required as unary ‘&’ operand  u.v = _private(page);        ^ I guess it is better to keep page_private as macro, please correct me in case I missed something. I guess that you could Cc me in the reply. Sorry for not do that ... In that case, EROFS uses page->private as an atomic integer to trace 2 partial subpages in one page. I think that you could also use >private instead directly to replace _private(page) here since I didn't find some hint to pick _private(page) or >private. Thanks for the input, I just did a quick test, so need to investigate more. And I think it is better to have another thread to change those macros to inline function, then fix related issues due to the change. In addition, I found some limitation of new {attach,detach}_page_private helper (that is why I was interested in this series at that time [1] [2], but I gave up finally) since many patterns (not all) in EROFS are io_submit (origin, page locked): attach_page_private(page); ... put_page(page); end_io (page locked): SetPageUptodate(page); unlock_page(page); since the page is always locked, so io_submit could be simplified as set_page_private(page, ...); SetPagePrivate(page); , which can save both one temporary get_page(page) and one put_page(page) since it could be regarded as safe with page locked. The SetPageUptodate is not called inside {attach,detach}_page_private, I could probably misunderstand your point, maybe you want the new pairs can handle the locked page, care to elaborate more? btw, I noticed the patchset versions are PATCH [3], RFC PATCH [4], RFC PATCH v2 [5], RFC PATCH v3 [6], PATCH [7]. Although I also noticed the patchset title was once changed, but it could be some harder to trace the whole history discussion. [1] https://lore.kernel.org/linux-fsdevel/20200419051404.GA30986@hsiangkao-HP-ZHAN-66-Pro-G1/ [2] https://lore.kernel.org/linux-fsdevel/20200427025752.GA3979@hsiangkao-HP-ZHAN-66-Pro-G1/ [3] https://lore.kernel.org/linux-fsdevel/20200418225123.31850-1-guoqing.ji...@cloud.ionos.com/ [4] https://lore.kernel.org/linux-fsdevel/20200426214925.10970-1-guoqing.ji...@cloud.ionos.com/ [5] https://lore.kernel.org/linux-fsdevel/20200430214450.10662-1-guoqing.ji...@cloud.ionos.com/ [6] https://lore.kernel.org/linux-fsdevel/20200507214400.15785-1-guoqing.ji...@cloud.ionos.com/ [7] https://lore.kernel.org/linux-fsdevel/20200517214718.468-1-guoqing.ji...@cloud.ionos.com/ All the cover letter of those series are here. RFC V3:https://lore.kernel.org/lkml/20200507214400.15785-1-guoqing.ji...@cloud.ionos.com/ RFC V2:https://lore.kernel.org/lkml/20200430214450.10662-1-guoqing.ji...@cloud.ionos.com/ RFC:https://lore.kernel.org/lkml/20200
Re: [PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
On 5/19/20 7:12 AM, Andrew Morton wrote: On Sun, 17 May 2020 23:47:18 +0200 Guoqing Jiang wrote: We can cleanup code a little by call detach_page_private here. ... --- a/mm/migrate.c +++ b/mm/migrate.c @@ -804,10 +804,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); bh = head; mm/migrate.c: In function '__buffer_migrate_page': ./include/linux/mm_types.h:243:52: warning: assignment makes integer from pointer without a cast [-Wint-conversion] #define set_page_private(page, v) ((page)->private = (v)) ^ mm/migrate.c:800:2: note: in expansion of macro 'set_page_private' set_page_private(newpage, detach_page_private(page)); ^~~~ The fact that set_page_private(detach_page_private()) generates a type mismatch warning seems deeply wrong, surely. Please let's get the types sorted out - either unsigned long or void *, not half-one and half-the other. Whatever needs the least typecasting at callsites, I suggest. Sorry about that, I should notice the warning before. I will double check if other places need the typecast or not, then send a new version. And can we please implement set_page_private() and page_private() with inlined C code? There is no need for these to be macros. Just did a quick change. -#define page_private(page) ((page)->private) -#define set_page_private(page, v) ((page)->private = (v)) +static inline unsigned long page_private(struct page *page) +{ + return page->private; +} + +static inline void set_page_private(struct page *page, unsigned long priv_data) +{ + page->private = priv_data; +} Then I get error like. fs/erofs/zdata.h: In function ‘z_erofs_onlinepage_index’: fs/erofs/zdata.h:126:8: error: lvalue required as unary ‘&’ operand u.v = _private(page); ^ I guess it is better to keep page_private as macro, please correct me in case I missed something. Thanks, Guoqing
[PATCH 03/10] btrfs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in btrfs. Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further as suggested by Dave Chinner. fs/btrfs/disk-io.c | 4 +--- fs/btrfs/extent_io.c | 21 ++--- fs/btrfs/inode.c | 23 +-- 3 files changed, 12 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 714b57553ed6..44745d5e05cf 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -976,9 +976,7 @@ static void btree_invalidatepage(struct page *page, unsigned int offset, btrfs_warn(BTRFS_I(page->mapping->host)->root->fs_info, "page private not zero on page %llu", (unsigned long long)page_offset(page)); - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); + detach_page_private(page); } } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 451d17bfafd8..704239546093 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3099,22 +3099,16 @@ static int submit_extent_page(unsigned int opf, static void attach_extent_buffer_page(struct extent_buffer *eb, struct page *page) { - if (!PagePrivate(page)) { - SetPagePrivate(page); - get_page(page); - set_page_private(page, (unsigned long)eb); - } else { + if (!PagePrivate(page)) + attach_page_private(page, eb); + else WARN_ON(page->private != (unsigned long)eb); - } } void set_page_extent_mapped(struct page *page) { - if (!PagePrivate(page)) { - SetPagePrivate(page); - get_page(page); - set_page_private(page, EXTENT_PAGE_PRIVATE); - } + if (!PagePrivate(page)) + attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE); } static struct extent_map * @@ -4935,10 +4929,7 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) * We need to make sure we haven't be attached * to a new eb. */ - ClearPagePrivate(page); - set_page_private(page, 0); - /* One for the page private */ - put_page(page); + detach_page_private(page); } if (mapped) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 037efc25d993..6c2d6ecedd6a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7927,11 +7927,8 @@ static void btrfs_readahead(struct readahead_control *rac) static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags) { int ret = try_release_extent_mapping(page, gfp_flags); - if (ret == 1) { - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); - } + if (ret == 1) + detach_page_private(page); return ret; } @@ -7953,14 +7950,8 @@ static int btrfs_migratepage(struct address_space *mapping, if (ret != MIGRATEPAGE_SUCCESS) return ret; - if (page_has_private(page)) { - ClearPagePrivate(page); - get_page(newpage); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - SetPagePrivate(newpage); - } + if (page_has_private(page)) + attach_page_private(newpage, detach_page_private(page)); if (PagePrivate2(page)) { ClearPagePrivate2(page); @@ -8082,11 +8073,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, } ClearPageChecked(page); - if (PagePrivate(page)) { - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); - } + detach_page_private(page); } /* -- 2.17.1
[PATCH 09/10] buffer_head.h: remove attach_page_buffers
All the callers have replaced attach_page_buffers with the new function attach_page_private, so remove it. Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- No change since RFC. include/linux/buffer_head.h | 8 1 file changed, 8 deletions(-) diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 15b765a181b8..22fb11e2d2e0 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -272,14 +272,6 @@ void buffer_init(void); * inline definitions */ -static inline void attach_page_buffers(struct page *page, - struct buffer_head *head) -{ - get_page(page); - SetPagePrivate(page); - set_page_private(page, (unsigned long)head); -} - static inline void get_bh(struct buffer_head *bh) { atomic_inc(>b_count); -- 2.17.1
[PATCH 10/10] mm/migrate.c: call detach_page_private to cleanup code
We can cleanup code a little by call detach_page_private here. Signed-off-by: Guoqing Jiang --- No change since RFC V3. mm/migrate.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 5fed0305d2ec..f99502bc113c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -804,10 +804,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); bh = head; -- 2.17.1
[PATCH 08/10] orangefs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in orangefs. Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. avoid potential use-after-free as suggested by Dave Chinner. fs/orangefs/inode.c | 32 ++-- 1 file changed, 6 insertions(+), 26 deletions(-) diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 12ae630fbed7..48f0547d4850 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -62,12 +62,7 @@ static int orangefs_writepage_locked(struct page *page, } else { ret = 0; } - if (wr) { - kfree(wr); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); - } + kfree(detach_page_private(page)); return ret; } @@ -409,9 +404,7 @@ static int orangefs_write_begin(struct file *file, wr->len = len; wr->uid = current_fsuid(); wr->gid = current_fsgid(); - SetPagePrivate(page); - set_page_private(page, (unsigned long)wr); - get_page(page); + attach_page_private(page, wr); okay: return 0; } @@ -459,18 +452,12 @@ static void orangefs_invalidatepage(struct page *page, wr = (struct orangefs_write_range *)page_private(page); if (offset == 0 && length == PAGE_SIZE) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); + kfree(detach_page_private(page)); return; /* write range entirely within invalidate range (or equal) */ } else if (page_offset(page) + offset <= wr->pos && wr->pos + wr->len <= page_offset(page) + offset + length) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); + kfree(detach_page_private(page)); /* XXX is this right? only caller in fs */ cancel_dirty_page(page); return; @@ -535,12 +522,7 @@ static int orangefs_releasepage(struct page *page, gfp_t foo) static void orangefs_freepage(struct page *page) { - if (PagePrivate(page)) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); - } + kfree(detach_page_private(page)); } static int orangefs_launder_page(struct page *page) @@ -740,9 +722,7 @@ vm_fault_t orangefs_page_mkwrite(struct vm_fault *vmf) wr->len = PAGE_SIZE; wr->uid = current_fsuid(); wr->gid = current_fsgid(); - SetPagePrivate(page); - set_page_private(page, (unsigned long)wr); - get_page(page); + attach_page_private(page, wr); okay: file_update_time(vmf->vma->vm_file); -- 2.17.1
[PATCH 02/10] md: remove __clear_page_buffers and use attach/detach_page_private
After introduce attach/detach_page_private in pagemap.h, we can remove the duplicat code and call the new functions. Cc: Song Liu Cc: linux-r...@vger.kernel.org Acked-by: Song Liu Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. drivers/md/md-bitmap.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index b952bd45bd6a..95a5f3757fa3 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -324,14 +324,6 @@ static void end_bitmap_write(struct buffer_head *bh, int uptodate) wake_up(>write_wait); } -/* copied from buffer.c */ -static void -__clear_page_buffers(struct page *page) -{ - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); -} static void free_buffers(struct page *page) { struct buffer_head *bh; @@ -345,7 +337,7 @@ static void free_buffers(struct page *page) free_buffer_head(bh); bh = next; } - __clear_page_buffers(page); + detach_page_private(page); put_page(page); } @@ -374,7 +366,7 @@ static int read_page(struct file *file, unsigned long index, ret = -ENOMEM; goto out; } - attach_page_buffers(page, bh); + attach_page_private(page, bh); blk_cur = index << (PAGE_SHIFT - inode->i_blkbits); while (bh) { block = blk_cur; -- 2.17.1
[PATCH 05/10] f2fs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in f2fs.h. Cc: Jaegeuk Kim Cc: Chao Yu Cc: linux-f2fs-de...@lists.sourceforge.net Acked-by: Chao Yu Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. fs/f2fs/f2fs.h | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 3574629b75ba..a4d4a947f603 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -3128,19 +3128,12 @@ static inline void f2fs_set_page_private(struct page *page, if (PagePrivate(page)) return; - get_page(page); - SetPagePrivate(page); - set_page_private(page, data); + attach_page_private(page, (void *)data); } static inline void f2fs_clear_page_private(struct page *page) { - if (!PagePrivate(page)) - return; - - set_page_private(page, 0); - ClearPagePrivate(page); - f2fs_put_page(page, 0); + detach_page_private(page); } /* -- 2.17.1
[PATCH 07/10] ntfs: replace attach_page_buffers with attach_page_private
Call the new function since attach_page_buffers will be removed. Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Signed-off-by: Guoqing Jiang --- No change since RFC V2. RFC -> RFC V2 1. change the name of new function to attach_page_private. fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index 554b744f41bf..bb0a43860ad2 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -1732,7 +1732,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) { bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } else buffers_to_free = bh; } diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index 3aac5c917afe..fbb9f1bc623d 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -504,7 +504,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no, bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } bh = head = page_buffers(page); BUG_ON(!bh); -- 2.17.1
[PATCH 06/10] iomap: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in iomap. Cc: Christoph Hellwig Cc: "Darrick J. Wong" Cc: linux-...@vger.kernel.org Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further as suggested by Matthew Wilcox. 3. don't return attach_page_private in iomap_page_create per the comment from Christoph Hellwig. fs/iomap/buffered-io.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 890c8fcda4f3..a1ed7620fbac 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -59,24 +59,19 @@ iomap_page_create(struct inode *inode, struct page *page) * migrate_page_move_mapping() assumes that pages with private data have * their count elevated by 1. */ - get_page(page); - set_page_private(page, (unsigned long)iop); - SetPagePrivate(page); + attach_page_private(page, iop); return iop; } static void iomap_page_release(struct page *page) { - struct iomap_page *iop = to_iomap_page(page); + struct iomap_page *iop = detach_page_private(page); if (!iop) return; WARN_ON_ONCE(atomic_read(>read_count)); WARN_ON_ONCE(atomic_read(>write_count)); - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); kfree(iop); } @@ -526,14 +521,8 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage, if (ret != MIGRATEPAGE_SUCCESS) return ret; - if (page_has_private(page)) { - ClearPagePrivate(page); - get_page(newpage); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - SetPagePrivate(newpage); - } + if (page_has_private(page)) + attach_page_private(newpage, detach_page_private(page)); if (mode != MIGRATE_SYNC_NO_COPY) migrate_page_copy(newpage, page); -- 2.17.1
[PATCH 00/10] Introduce attach/detach_page_private to cleanup code
Hell Andrew and Al, Since no more feedback from RFC V3, I suppose the series could be ready for merge. And Song has already acked the md patch, then all the rest patches are for fs and mm. So, if no further comments for the patchset, could you consider to queue them from either of your tree to avoid potential build issue? Thank you! RFC V3: https://lore.kernel.org/lkml/20200507214400.15785-1-guoqing.ji...@cloud.ionos.com/ RFC V3: https://lore.kernel.org/lkml/20200430214450.10662-1-guoqing.ji...@cloud.ionos.com/ RFC: https://lore.kernel.org/lkml/20200426214925.10970-1-guoqing.ji...@cloud.ionos.com/ Thanks, Guoqing No change since RFC V3. RFC V2 -> RFC V3: 1. rename clear_page_private to detach_page_private. 2. Update the comments for attach/detach_page_private from Mattew. 3. add one patch to call new function in mm/migrate.c as suggested by Mattew, but use the conservative way to keep the orginal semantics [2]. RFC -> RFC V2: 1. rename the new functions and add comments for them. 2. change the return type of attach_page_private. 3. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further. 4. avoid potential use-after-free in orangefs. [1]. https://lore.kernel.org/linux-fsdevel/20200420221424.gh5...@bombadil.infradead.org/ [2]. https://lore.kernel.org/lkml/e4d5ddc0-877f-6499-f697-2b7c0ddbf...@cloud.ionos.com/ Guoqing Jiang (10): include/linux/pagemap.h: introduce attach/detach_page_private md: remove __clear_page_buffers and use attach/detach_page_private btrfs: use attach/detach_page_private fs/buffer.c: use attach/detach_page_private f2fs: use attach/detach_page_private iomap: use attach/detach_page_private ntfs: replace attach_page_buffers with attach_page_private orangefs: use attach/detach_page_private buffer_head.h: remove attach_page_buffers mm/migrate.c: call detach_page_private to cleanup code drivers/md/md-bitmap.c | 12 ++-- fs/btrfs/disk-io.c | 4 +--- fs/btrfs/extent_io.c| 21 ++--- fs/btrfs/inode.c| 23 +-- fs/buffer.c | 16 fs/f2fs/f2fs.h | 11 ++- fs/iomap/buffered-io.c | 19 --- fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- fs/orangefs/inode.c | 32 ++-- include/linux/buffer_head.h | 8 include/linux/pagemap.h | 37 + mm/migrate.c| 5 + 13 files changed, 70 insertions(+), 122 deletions(-) -- 2.17.1
[PATCH 01/10] include/linux/pagemap.h: introduce attach/detach_page_private
The logic in attach_page_buffers and __clear_page_buffers are quite paired, but 1. they are located in different files. 2. attach_page_buffers is implemented in buffer_head.h, so it could be used by other files. But __clear_page_buffers is static function in buffer.c and other potential users can't call the function, md-bitmap even copied the function. So, introduce the new attach/detach_page_private to replace them. With the new pair of function, we will remove the usage of attach_page_buffers and __clear_page_buffers in next patches. Thanks for suggestions about the function name from Alexander Viro, Andreas Grünbacher, Christoph Hellwig and Matthew Wilcox. Suggested-by: Matthew Wilcox Cc: Andrew Morton Cc: "Darrick J. Wong" Cc: William Kucharski Cc: "Kirill A. Shutemov" Cc: Andreas Gruenbacher Cc: Yang Shi Cc: Yafang Shao Cc: Song Liu Cc: linux-r...@vger.kernel.org Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Cc: Alexander Viro Cc: Jaegeuk Kim Cc: Chao Yu Cc: linux-f2fs-de...@lists.sourceforge.net Cc: Christoph Hellwig Cc: linux-...@vger.kernel.org Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3: 1. rename clear_page_private to detach_page_private. 2. updated the comments for the two functions. RFC -> RFC V2: Address the comments from Christoph Hellwig 1. change function names to attach/clear_page_private and add comments. 2. change the return type of attach_page_private. include/linux/pagemap.h | 37 + 1 file changed, 37 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c6348c50136f..8e085713150c 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -208,6 +208,43 @@ static inline int page_cache_add_speculative(struct page *page, int count) return __page_cache_add_speculative(page, count); } +/** + * attach_page_private - Attach private data to a page. + * @page: Page to attach data to. + * @data: Data to attach to page. + * + * Attaching private data to a page increments the page's reference count. + * The data must be detached before the page will be freed. + */ +static inline void attach_page_private(struct page *page, void *data) +{ + get_page(page); + set_page_private(page, (unsigned long)data); + SetPagePrivate(page); +} + +/** + * detach_page_private - Detach private data from a page. + * @page: Page to detach data from. + * + * Removes the data that was previously attached to the page and decrements + * the refcount on the page. + * + * Return: Data that was attached to the page. + */ +static inline void *detach_page_private(struct page *page) +{ + void *data = (void *)page_private(page); + + if (!PagePrivate(page)) + return NULL; + ClearPagePrivate(page); + set_page_private(page, 0); + put_page(page); + + return data; +} + #ifdef CONFIG_NUMA extern struct page *__page_cache_alloc(gfp_t gfp); #else -- 2.17.1
[PATCH 04/10] fs/buffer.c: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in buffer.c. Cc: Alexander Viro Signed-off-by: Guoqing Jiang --- No change since RFC V3. RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. fs/buffer.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 85b4be1939ce..fc8831c392d7 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -123,14 +123,6 @@ void __wait_on_buffer(struct buffer_head * bh) } EXPORT_SYMBOL(__wait_on_buffer); -static void -__clear_page_buffers(struct page *page) -{ - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); -} - static void buffer_io_error(struct buffer_head *bh, char *msg) { if (!test_bit(BH_Quiet, >b_state)) @@ -906,7 +898,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head) bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) @@ -1624,7 +1616,7 @@ void create_empty_buffers(struct page *page, bh = bh->b_this_page; } while (bh != head); } - attach_page_buffers(page, head); + attach_page_private(page, head); spin_unlock(>mapping->private_lock); } EXPORT_SYMBOL(create_empty_buffers); @@ -2611,7 +2603,7 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head) bh->b_this_page = head; bh = bh->b_this_page; } while (bh != head); - attach_page_buffers(page, head); + attach_page_private(page, head); spin_unlock(>mapping->private_lock); } @@ -3276,7 +3268,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) bh = next; } while (bh != head); *buffers_to_free = head; - __clear_page_buffers(page); + detach_page_private(page); return 1; failed: return 0; -- 2.17.1
[RFC PATCH V3 07/10] ntfs: replace attach_page_buffers with attach_page_private
Call the new function since attach_page_buffers will be removed. Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3: no change RFC -> RFC V2 1. change the name of new function to attach_page_private. fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index 554b744f41bf..bb0a43860ad2 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -1732,7 +1732,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) { bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } else buffers_to_free = bh; } diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index 3aac5c917afe..fbb9f1bc623d 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -504,7 +504,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no, bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } bh = head = page_buffers(page); BUG_ON(!bh); -- 2.17.1
[RFC PATCH V3 02/10] md: remove __clear_page_buffers and use attach/detach_page_private
After introduce attach/detach_page_private in pagemap.h, we can remove the duplicat code and call the new functions. Cc: Song Liu Cc: linux-r...@vger.kernel.org Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. drivers/md/md-bitmap.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index b952bd45bd6a..95a5f3757fa3 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -324,14 +324,6 @@ static void end_bitmap_write(struct buffer_head *bh, int uptodate) wake_up(>write_wait); } -/* copied from buffer.c */ -static void -__clear_page_buffers(struct page *page) -{ - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); -} static void free_buffers(struct page *page) { struct buffer_head *bh; @@ -345,7 +337,7 @@ static void free_buffers(struct page *page) free_buffer_head(bh); bh = next; } - __clear_page_buffers(page); + detach_page_private(page); put_page(page); } @@ -374,7 +366,7 @@ static int read_page(struct file *file, unsigned long index, ret = -ENOMEM; goto out; } - attach_page_buffers(page, bh); + attach_page_private(page, bh); blk_cur = index << (PAGE_SHIFT - inode->i_blkbits); while (bh) { block = blk_cur; -- 2.17.1
[RFC PATCH V3 03/10] btrfs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in btrfs. Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further as suggested by Dave Chinner. fs/btrfs/disk-io.c | 4 +--- fs/btrfs/extent_io.c | 21 ++--- fs/btrfs/inode.c | 23 +-- 3 files changed, 12 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d10c7be10f3b..7278789ff8a7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -980,9 +980,7 @@ static void btree_invalidatepage(struct page *page, unsigned int offset, btrfs_warn(BTRFS_I(page->mapping->host)->root->fs_info, "page private not zero on page %llu", (unsigned long long)page_offset(page)); - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); + detach_page_private(page); } } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 39e45b8a5031..bf816387715b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3076,22 +3076,16 @@ static int submit_extent_page(unsigned int opf, static void attach_extent_buffer_page(struct extent_buffer *eb, struct page *page) { - if (!PagePrivate(page)) { - SetPagePrivate(page); - get_page(page); - set_page_private(page, (unsigned long)eb); - } else { + if (!PagePrivate(page)) + attach_page_private(page, eb); + else WARN_ON(page->private != (unsigned long)eb); - } } void set_page_extent_mapped(struct page *page) { - if (!PagePrivate(page)) { - SetPagePrivate(page); - get_page(page); - set_page_private(page, EXTENT_PAGE_PRIVATE); - } + if (!PagePrivate(page)) + attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE); } static struct extent_map * @@ -4929,10 +4923,7 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) * We need to make sure we haven't be attached * to a new eb. */ - ClearPagePrivate(page); - set_page_private(page, 0); - /* One for the page private */ - put_page(page); + detach_page_private(page); } if (mapped) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 320d1062068d..a7f7ff0a8b7c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8303,11 +8303,8 @@ btrfs_readpages(struct file *file, struct address_space *mapping, static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags) { int ret = try_release_extent_mapping(page, gfp_flags); - if (ret == 1) { - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); - } + if (ret == 1) + detach_page_private(page); return ret; } @@ -8329,14 +8326,8 @@ static int btrfs_migratepage(struct address_space *mapping, if (ret != MIGRATEPAGE_SUCCESS) return ret; - if (page_has_private(page)) { - ClearPagePrivate(page); - get_page(newpage); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - SetPagePrivate(newpage); - } + if (page_has_private(page)) + attach_page_private(newpage, detach_page_private(page)); if (PagePrivate2(page)) { ClearPagePrivate2(page); @@ -8458,11 +8449,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, } ClearPageChecked(page); - if (PagePrivate(page)) { - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); - } + detach_page_private(page); } /* -- 2.17.1
[RFC PATCH V3 05/10] f2fs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in f2fs.h. Cc: Jaegeuk Kim Cc: Chao Yu Cc: linux-f2fs-de...@lists.sourceforge.net Acked-by: Chao Yu Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. fs/f2fs/f2fs.h | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index ba470d5687fe..6920d1a88289 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -3051,19 +3051,12 @@ static inline void f2fs_set_page_private(struct page *page, if (PagePrivate(page)) return; - get_page(page); - SetPagePrivate(page); - set_page_private(page, data); + attach_page_private(page, (void *)data); } static inline void f2fs_clear_page_private(struct page *page) { - if (!PagePrivate(page)) - return; - - set_page_private(page, 0); - ClearPagePrivate(page); - f2fs_put_page(page, 0); + detach_page_private(page); } /* -- 2.17.1
[RFC PATCH V3 08/10] orangefs: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in orangefs. Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. avoid potential use-after-free as suggested by Dave Chinner. fs/orangefs/inode.c | 32 ++-- 1 file changed, 6 insertions(+), 26 deletions(-) diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 12ae630fbed7..48f0547d4850 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -62,12 +62,7 @@ static int orangefs_writepage_locked(struct page *page, } else { ret = 0; } - if (wr) { - kfree(wr); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); - } + kfree(detach_page_private(page)); return ret; } @@ -409,9 +404,7 @@ static int orangefs_write_begin(struct file *file, wr->len = len; wr->uid = current_fsuid(); wr->gid = current_fsgid(); - SetPagePrivate(page); - set_page_private(page, (unsigned long)wr); - get_page(page); + attach_page_private(page, wr); okay: return 0; } @@ -459,18 +452,12 @@ static void orangefs_invalidatepage(struct page *page, wr = (struct orangefs_write_range *)page_private(page); if (offset == 0 && length == PAGE_SIZE) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); + kfree(detach_page_private(page)); return; /* write range entirely within invalidate range (or equal) */ } else if (page_offset(page) + offset <= wr->pos && wr->pos + wr->len <= page_offset(page) + offset + length) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); + kfree(detach_page_private(page)); /* XXX is this right? only caller in fs */ cancel_dirty_page(page); return; @@ -535,12 +522,7 @@ static int orangefs_releasepage(struct page *page, gfp_t foo) static void orangefs_freepage(struct page *page) { - if (PagePrivate(page)) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); - } + kfree(detach_page_private(page)); } static int orangefs_launder_page(struct page *page) @@ -740,9 +722,7 @@ vm_fault_t orangefs_page_mkwrite(struct vm_fault *vmf) wr->len = PAGE_SIZE; wr->uid = current_fsuid(); wr->gid = current_fsgid(); - SetPagePrivate(page); - set_page_private(page, (unsigned long)wr); - get_page(page); + attach_page_private(page, wr); okay: file_update_time(vmf->vma->vm_file); -- 2.17.1
[RFC PATCH V3 01/10] include/linux/pagemap.h: introduce attach/detach_page_private
The logic in attach_page_buffers and __clear_page_buffers are quite paired, but 1. they are located in different files. 2. attach_page_buffers is implemented in buffer_head.h, so it could be used by other files. But __clear_page_buffers is static function in buffer.c and other potential users can't call the function, md-bitmap even copied the function. So, introduce the new attach/detach_page_private to replace them. With the new pair of function, we will remove the usage of attach_page_buffers and __clear_page_buffers in next patches. Thanks for suggestions about the function name from Alexander Viro, Andreas Grünbacher, Christoph Hellwig and Matthew Wilcox. Suggested-by: Matthew Wilcox Cc: Andrew Morton Cc: linux...@kvack.org Cc: "Darrick J. Wong" Cc: William Kucharski Cc: "Kirill A. Shutemov" Cc: Andreas Gruenbacher Cc: Yang Shi Cc: Yafang Shao Cc: Song Liu Cc: linux-r...@vger.kernel.org Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Cc: Alexander Viro Cc: Jaegeuk Kim Cc: Chao Yu Cc: linux-f2fs-de...@lists.sourceforge.net Cc: Christoph Hellwig Cc: linux-...@vger.kernel.org Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3: 1. rename clear_page_private to detach_page_private. 2. updated the comments for the two functions. RFC -> RFC V2: Address the comments from Christoph Hellwig 1. change function names to attach/clear_page_private and add comments. 2. change the return type of attach_page_private. include/linux/pagemap.h | 37 + 1 file changed, 37 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a8f7bd8ea1c6..99dd93188a5e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -205,6 +205,43 @@ static inline int page_cache_add_speculative(struct page *page, int count) return __page_cache_add_speculative(page, count); } +/** + * attach_page_private - Attach private data to a page. + * @page: Page to attach data to. + * @data: Data to attach to page. + * + * Attaching private data to a page increments the page's reference count. + * The data must be detached before the page will be freed. + */ +static inline void attach_page_private(struct page *page, void *data) +{ + get_page(page); + set_page_private(page, (unsigned long)data); + SetPagePrivate(page); +} + +/** + * detach_page_private - Detach private data from a page. + * @page: Page to detach data from. + * + * Removes the data that was previously attached to the page and decrements + * the refcount on the page. + * + * Return: Data that was attached to the page. + */ +static inline void *detach_page_private(struct page *page) +{ + void *data = (void *)page_private(page); + + if (!PagePrivate(page)) + return NULL; + ClearPagePrivate(page); + set_page_private(page, 0); + put_page(page); + + return data; +} + #ifdef CONFIG_NUMA extern struct page *__page_cache_alloc(gfp_t gfp); #else -- 2.17.1
[RFC PATCH V3 04/10] fs/buffer.c: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in buffer.c. Cc: Alexander Viro Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. fs/buffer.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index a60f60396cfa..059404658d5d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -123,14 +123,6 @@ void __wait_on_buffer(struct buffer_head * bh) } EXPORT_SYMBOL(__wait_on_buffer); -static void -__clear_page_buffers(struct page *page) -{ - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); -} - static void buffer_io_error(struct buffer_head *bh, char *msg) { if (!test_bit(BH_Quiet, >b_state)) @@ -906,7 +898,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head) bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) @@ -1580,7 +1572,7 @@ void create_empty_buffers(struct page *page, bh = bh->b_this_page; } while (bh != head); } - attach_page_buffers(page, head); + attach_page_private(page, head); spin_unlock(>mapping->private_lock); } EXPORT_SYMBOL(create_empty_buffers); @@ -2567,7 +2559,7 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head) bh->b_this_page = head; bh = bh->b_this_page; } while (bh != head); - attach_page_buffers(page, head); + attach_page_private(page, head); spin_unlock(>mapping->private_lock); } @@ -3227,7 +3219,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) bh = next; } while (bh != head); *buffers_to_free = head; - __clear_page_buffers(page); + detach_page_private(page); return 1; failed: return 0; -- 2.17.1
[RFC PATCH V3 10/10] mm/migrate.c: call detach_page_private to cleanup code
We can cleanup code a little by call detach_page_private here. Cc: Andrew Morton Cc: linux...@kvack.org Signed-off-by: Guoqing Jiang --- Added since RFC V3. mm/migrate.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 7160c1556f79..f214adfb3fa4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -797,10 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); bh = head; -- 2.17.1
[RFC PATCH V3 06/10] iomap: use attach/detach_page_private
Since the new pair function is introduced, we can call them to clean the code in iomap. Cc: Christoph Hellwig Cc: "Darrick J. Wong" Cc: linux-...@vger.kernel.org Signed-off-by: Guoqing Jiang --- RFC V2 -> RFC V3 1. rename clear_page_private to detach_page_private. RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further as suggested by Matthew Wilcox. 3. don't return attach_page_private in iomap_page_create per the comment from Christoph Hellwig. fs/iomap/buffered-io.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 89e21961d1ad..e3031007b4ae 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -59,24 +59,19 @@ iomap_page_create(struct inode *inode, struct page *page) * migrate_page_move_mapping() assumes that pages with private data have * their count elevated by 1. */ - get_page(page); - set_page_private(page, (unsigned long)iop); - SetPagePrivate(page); + attach_page_private(page, iop); return iop; } static void iomap_page_release(struct page *page) { - struct iomap_page *iop = to_iomap_page(page); + struct iomap_page *iop = detach_page_private(page); if (!iop) return; WARN_ON_ONCE(atomic_read(>read_count)); WARN_ON_ONCE(atomic_read(>write_count)); - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); kfree(iop); } @@ -554,14 +549,8 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage, if (ret != MIGRATEPAGE_SUCCESS) return ret; - if (page_has_private(page)) { - ClearPagePrivate(page); - get_page(newpage); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - SetPagePrivate(newpage); - } + if (page_has_private(page)) + attach_page_private(newpage, detach_page_private(page)); if (mode != MIGRATE_SYNC_NO_COPY) migrate_page_copy(newpage, page); -- 2.17.1
[RFC PATCH V3 09/10] buffer_head.h: remove attach_page_buffers
All the callers have replaced attach_page_buffers with the new function attach_page_private, so remove it. Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- No change since RFC. include/linux/buffer_head.h | 8 1 file changed, 8 deletions(-) diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 15b765a181b8..22fb11e2d2e0 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -272,14 +272,6 @@ void buffer_init(void); * inline definitions */ -static inline void attach_page_buffers(struct page *page, - struct buffer_head *head) -{ - get_page(page); - SetPagePrivate(page); - set_page_private(page, (unsigned long)head); -} - static inline void get_bh(struct buffer_head *bh) { atomic_inc(>b_count); -- 2.17.1
[RFC PATCH V3 0/9] Introduce attach/detach_page_private to cleanup code
Hi, Based on the previous thread [1], this patchset introduces attach_page_private and detach_page_private to replace attach_page_buffers and __clear_page_buffers. Thanks a lot for the constructive suggestions and comments from Christoph, Matthew and Dave. And sorry for cross post to different lists since it modifies different subsystems. RFC V2 -> RFC V3: 1. rename clear_page_private to detach_page_private. 2. Update the comments for attach/detach_page_private from Mattew. 3. add one patch to call new function in mm/migrate.c as suggested by Mattew, but use the conservative way to keep the orginal semantics [2]. RFC -> RFC V2: 1. rename the new functions and add comments for them. 2. change the return type of attach_page_private. 3. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further. 4. avoid potential use-after-free in orangefs. [1]. https://lore.kernel.org/linux-fsdevel/20200420221424.gh5...@bombadil.infradead.org/ [2]. https://lore.kernel.org/lkml/e4d5ddc0-877f-6499-f697-2b7c0ddbf...@cloud.ionos.com/ Thanks, Guoqing Guoqing Jiang (10): include/linux/pagemap.h: introduce attach/detach_page_private md: remove __clear_page_buffers and use attach/detach_page_private btrfs: use attach/detach_page_private fs/buffer.c: use attach/detach_page_private f2fs: use attach/detach_page_private iomap: use attach/detach_page_private ntfs: replace attach_page_buffers with attach_page_private orangefs: use attach/detach_page_private buffer_head.h: remove attach_page_buffers mm/migrate.c: call detach_page_private to cleanup code drivers/md/md-bitmap.c | 12 ++-- fs/btrfs/disk-io.c | 4 +--- fs/btrfs/extent_io.c| 21 ++--- fs/btrfs/inode.c| 23 +-- fs/buffer.c | 16 fs/f2fs/f2fs.h | 11 ++- fs/iomap/buffered-io.c | 19 --- fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- fs/orangefs/inode.c | 32 ++-- include/linux/buffer_head.h | 8 include/linux/pagemap.h | 37 + mm/migrate.c| 5 + 13 files changed, 70 insertions(+), 122 deletions(-) -- 2.17.1
Re: [RFC PATCH V2 0/9] Introduce attach/clear_page_private to cleanup code
On 5/2/20 2:41 AM, Matthew Wilcox wrote: On Sat, May 02, 2020 at 12:42:15AM +0200, Guoqing Jiang wrote: On 5/2/20 12:16 AM, Matthew Wilcox wrote: On Thu, Apr 30, 2020 at 11:44:41PM +0200, Guoqing Jiang wrote: include/linux/pagemap.h: introduce attach/clear_page_private md: remove __clear_page_buffers and use attach/clear_page_private btrfs: use attach/clear_page_private fs/buffer.c: use attach/clear_page_private f2fs: use attach/clear_page_private iomap: use attach/clear_page_private ntfs: replace attach_page_buffers with attach_page_private orangefs: use attach/clear_page_private buffer_head.h: remove attach_page_buffers I think mm/migrate.c could also use this: ClearPagePrivate(page); set_page_private(newpage, page_private(page)); set_page_private(page, 0); put_page(page); get_page(newpage); Thanks for checking! Assume the below change is appropriate. diff --git a/mm/migrate.c b/mm/migrate.c index 7160c1556f79..f214adfb3fa4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -797,10 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); I think you can do: @@ -797,11 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - get_page(newpage); + attach_page_private(newpage, detach_page_private(page)); bh = head; do { @@ -810,8 +806,6 @@ static int __buffer_migrate_page(struct address_space *mapping, } while (bh != head); - SetPagePrivate(newpage); - if (mode != MIGRATE_SYNC_NO_COPY) ... but maybe there's a subtlety to the ordering of the setup of the bh and setting PagePrivate that means what you have there is a better patch. Yes, it is better but not sure if the order can be changed here. And seems the original commit is this one. commit e965f9630c651fa4249039fd4b80c9392d07a856 Author: Christoph Lameter Date: Wed Feb 1 03:05:41 2006 -0800 [PATCH] Direct Migration V9: Avoid writeback / page_migrate() method Migrate a page with buffers without requiring writeback This introduces a new address space operation migratepage() that may be used by a filesystem to implement its own version of page migration. A version is provided that migrates buffers attached to pages. Some filesystems (ext2, ext3, xfs) are modified to utilize this feature. The swapper address space operation are modified so that a regular migrate_page() will occur for anonymous pages without writeback (migrate_pages forces every anonymous page to have a swap entry). Hope mm experts could take a look, so CC more people and mm list. And the question is that if we can setting PagePrivate before setup bh in the __buffer_migrate_page, thanks for your any further input. Thanks, Guoqing
Re: [RFC PATCH V2 0/9] Introduce attach/clear_page_private to cleanup code
On 5/2/20 12:16 AM, Matthew Wilcox wrote: On Thu, Apr 30, 2020 at 11:44:41PM +0200, Guoqing Jiang wrote: include/linux/pagemap.h: introduce attach/clear_page_private md: remove __clear_page_buffers and use attach/clear_page_private btrfs: use attach/clear_page_private fs/buffer.c: use attach/clear_page_private f2fs: use attach/clear_page_private iomap: use attach/clear_page_private ntfs: replace attach_page_buffers with attach_page_private orangefs: use attach/clear_page_private buffer_head.h: remove attach_page_buffers I think mm/migrate.c could also use this: ClearPagePrivate(page); set_page_private(newpage, page_private(page)); set_page_private(page, 0); put_page(page); get_page(newpage); Thanks for checking! Assume the below change is appropriate. diff --git a/mm/migrate.c b/mm/migrate.c index 7160c1556f79..f214adfb3fa4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -797,10 +797,7 @@ static int __buffer_migrate_page(struct address_space *mapping, if (rc != MIGRATEPAGE_SUCCESS) goto unlock_buffers; - ClearPagePrivate(page); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); + set_page_private(newpage, detach_page_private(page)); get_page(newpage); bh = head; Cheers, Guoqing
Re: [RFC PATCH V2 1/9] include/linux/pagemap.h: introduce attach/clear_page_private
On 5/1/20 3:49 AM, Al Viro wrote: On Fri, May 01, 2020 at 02:42:29AM +0100, Al Viro wrote: On Thu, Apr 30, 2020 at 03:13:38PM -0700, Matthew Wilcox wrote: +/** + * clear_page_private - clear page's private field and PG_private. + * @page: page to be cleared. + * + * The counterpart function of attach_page_private. + * Return: private data of page or NULL if page doesn't have private data. + */ Seems to me that the opposite of "attach" is "detach", not "clear". Or "remove", perhaps... Actually, "detach" is better - neither "clear" nor "remove" imply "... and give me what used to be attached there", as this thing is doing. Ok, seems we have reached the agreement about the new name ;-), will follow the instruction. Thanks & Regards, Guoqing
Re: [RFC PATCH V2 1/9] include/linux/pagemap.h: introduce attach/clear_page_private
On 5/1/20 12:13 AM, Matthew Wilcox wrote: On Thu, Apr 30, 2020 at 11:44:42PM +0200, Guoqing Jiang wrote: +/** + * attach_page_private - attach data to page's private field and set PG_private. + * @page: page to be attached and set flag. + * @data: data to attach to page's private field. + * + * Need to take reference as mm.h said "Setting PG_private should also increment + * the refcount". + */ I don't think this will read well when added to the API documentation. Try this: /** * attach_page_private - Attach private data to a page. * @page: Page to attach data to. * @data: Data to attach to page. * * Attaching private data to a page increments the page's reference count. * The data must be detached before the page will be freed. */ +/** + * clear_page_private - clear page's private field and PG_private. + * @page: page to be cleared. + * + * The counterpart function of attach_page_private. + * Return: private data of page or NULL if page doesn't have private data. + */ Seems to me that the opposite of "attach" is "detach", not "clear". /** * detach_page_private - Detach private data from a page. * @page: Page to detach data from. * * Removes the data that was previously attached to the page and decrements * the refcount on the page. * * Return: Data that was attached to the page. */ Thanks you very much, Mattew! Will change them in next version. Best Regards, Guoqing
Re: [RFC PATCH V2 1/9] include/linux/pagemap.h: introduce attach/clear_page_private
On 5/1/20 12:10 AM, Andreas Grünbacher wrote: Hi, Am Do., 30. Apr. 2020 um 23:56 Uhr schrieb Guoqing Jiang : The logic in attach_page_buffers and __clear_page_buffers are quite paired, but 1. they are located in different files. 2. attach_page_buffers is implemented in buffer_head.h, so it could be used by other files. But __clear_page_buffers is static function in buffer.c and other potential users can't call the function, md-bitmap even copied the function. So, introduce the new attach/clear_page_private to replace them. With the new pair of function, we will remove the usage of attach_page_buffers and __clear_page_buffers in next patches. Thanks for the new names from Christoph Hellwig. Suggested-by: Matthew Wilcox Cc: Andrew Morton Cc: "Darrick J. Wong" Cc: William Kucharski Cc: "Kirill A. Shutemov" Cc: Andreas Gruenbacher Cc: Yang Shi Cc: Yafang Shao Cc: Song Liu Cc: linux-r...@vger.kernel.org Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Cc: Alexander Viro Cc: Jaegeuk Kim Cc: Chao Yu Cc: linux-f2fs-de...@lists.sourceforge.net Cc: Christoph Hellwig Cc: linux-...@vger.kernel.org Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- RFC -> RFC V2: Address the comments from Christoph Hellwig 1. change function names to attach/clear_page_private and add comments. 2. change the return type of attach_page_private. include/linux/pagemap.h | 35 +++ 1 file changed, 35 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a8f7bd8ea1c6..2e515f210b18 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -205,6 +205,41 @@ static inline int page_cache_add_speculative(struct page *page, int count) return __page_cache_add_speculative(page, count); } +/** + * attach_page_private - attach data to page's private field and set PG_private. + * @page: page to be attached and set flag. + * @data: data to attach to page's private field. + * + * Need to take reference as mm.h said "Setting PG_private should also increment + * the refcount". + */ +static inline void attach_page_private(struct page *page, void *data) +{ + get_page(page); + set_page_private(page, (unsigned long)data); + SetPagePrivate(page); +} + +/** + * clear_page_private - clear page's private field and PG_private. + * @page: page to be cleared. + * + * The counterpart function of attach_page_private. + * Return: private data of page or NULL if page doesn't have private data. + */ +static inline void *clear_page_private(struct page *page) +{ + void *data = (void *)page_private(page); + + if (!PagePrivate(page)) + return NULL; + ClearPagePrivate(page); + set_page_private(page, 0); + put_page(page); + + return data; +} + I like this in general, but the name clear_page_private suggests that this might be the inverse operation of set_page_private, which it is not. So maybe this can be renamed to detach_page_private to more clearly indicate that it pairs with attach_page_private? Yes, the new name is better, thank you! Cheers, Guoqing
[RFC PATCH V2 1/9] include/linux/pagemap.h: introduce attach/clear_page_private
The logic in attach_page_buffers and __clear_page_buffers are quite paired, but 1. they are located in different files. 2. attach_page_buffers is implemented in buffer_head.h, so it could be used by other files. But __clear_page_buffers is static function in buffer.c and other potential users can't call the function, md-bitmap even copied the function. So, introduce the new attach/clear_page_private to replace them. With the new pair of function, we will remove the usage of attach_page_buffers and __clear_page_buffers in next patches. Thanks for the new names from Christoph Hellwig. Suggested-by: Matthew Wilcox Cc: Andrew Morton Cc: "Darrick J. Wong" Cc: William Kucharski Cc: "Kirill A. Shutemov" Cc: Andreas Gruenbacher Cc: Yang Shi Cc: Yafang Shao Cc: Song Liu Cc: linux-r...@vger.kernel.org Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Cc: Alexander Viro Cc: Jaegeuk Kim Cc: Chao Yu Cc: linux-f2fs-de...@lists.sourceforge.net Cc: Christoph Hellwig Cc: linux-...@vger.kernel.org Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- RFC -> RFC V2: Address the comments from Christoph Hellwig 1. change function names to attach/clear_page_private and add comments. 2. change the return type of attach_page_private. include/linux/pagemap.h | 35 +++ 1 file changed, 35 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index a8f7bd8ea1c6..2e515f210b18 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -205,6 +205,41 @@ static inline int page_cache_add_speculative(struct page *page, int count) return __page_cache_add_speculative(page, count); } +/** + * attach_page_private - attach data to page's private field and set PG_private. + * @page: page to be attached and set flag. + * @data: data to attach to page's private field. + * + * Need to take reference as mm.h said "Setting PG_private should also increment + * the refcount". + */ +static inline void attach_page_private(struct page *page, void *data) +{ + get_page(page); + set_page_private(page, (unsigned long)data); + SetPagePrivate(page); +} + +/** + * clear_page_private - clear page's private field and PG_private. + * @page: page to be cleared. + * + * The counterpart function of attach_page_private. + * Return: private data of page or NULL if page doesn't have private data. + */ +static inline void *clear_page_private(struct page *page) +{ + void *data = (void *)page_private(page); + + if (!PagePrivate(page)) + return NULL; + ClearPagePrivate(page); + set_page_private(page, 0); + put_page(page); + + return data; +} + #ifdef CONFIG_NUMA extern struct page *__page_cache_alloc(gfp_t gfp); #else -- 2.17.1
[RFC PATCH V2 3/9] btrfs: use attach/clear_page_private
Since the new pair function is introduced, we can call them to clean the code in btrfs. Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: linux-bt...@vger.kernel.org Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further as suggested by Dave Chinner. fs/btrfs/disk-io.c | 4 +--- fs/btrfs/extent_io.c | 21 ++--- fs/btrfs/inode.c | 23 +-- 3 files changed, 12 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a6cb5cbbdb9f..fe4acf821110 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -980,9 +980,7 @@ static void btree_invalidatepage(struct page *page, unsigned int offset, btrfs_warn(BTRFS_I(page->mapping->host)->root->fs_info, "page private not zero on page %llu", (unsigned long long)page_offset(page)); - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); + clear_page_private(page); } } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 39e45b8a5031..095a5e83e660 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3076,22 +3076,16 @@ static int submit_extent_page(unsigned int opf, static void attach_extent_buffer_page(struct extent_buffer *eb, struct page *page) { - if (!PagePrivate(page)) { - SetPagePrivate(page); - get_page(page); - set_page_private(page, (unsigned long)eb); - } else { + if (!PagePrivate(page)) + attach_page_private(page, eb); + else WARN_ON(page->private != (unsigned long)eb); - } } void set_page_extent_mapped(struct page *page) { - if (!PagePrivate(page)) { - SetPagePrivate(page); - get_page(page); - set_page_private(page, EXTENT_PAGE_PRIVATE); - } + if (!PagePrivate(page)) + attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE); } static struct extent_map * @@ -4929,10 +4923,7 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) * We need to make sure we haven't be attached * to a new eb. */ - ClearPagePrivate(page); - set_page_private(page, 0); - /* One for the page private */ - put_page(page); + clear_page_private(page); } if (mapped) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 320d1062068d..34b09ab2c32a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8303,11 +8303,8 @@ btrfs_readpages(struct file *file, struct address_space *mapping, static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags) { int ret = try_release_extent_mapping(page, gfp_flags); - if (ret == 1) { - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); - } + if (ret == 1) + clear_page_private(page); return ret; } @@ -8329,14 +8326,8 @@ static int btrfs_migratepage(struct address_space *mapping, if (ret != MIGRATEPAGE_SUCCESS) return ret; - if (page_has_private(page)) { - ClearPagePrivate(page); - get_page(newpage); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - SetPagePrivate(newpage); - } + if (page_has_private(page)) + attach_page_private(newpage, clear_page_private(page)); if (PagePrivate2(page)) { ClearPagePrivate2(page); @@ -8458,11 +8449,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, } ClearPageChecked(page); - if (PagePrivate(page)) { - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); - } + clear_page_private(page); } /* -- 2.17.1
[RFC PATCH V2 8/9] orangefs: use attach/clear_page_private
Since the new pair function is introduced, we can call them to clean the code in orangefs. Cc: Mike Marshall Cc: Martin Brandenburg Cc: de...@lists.orangefs.org Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. avoid potential use-after-free as suggested by Dave Chinner. fs/orangefs/inode.c | 32 ++-- 1 file changed, 6 insertions(+), 26 deletions(-) diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index 12ae630fbed7..139c450aca68 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -62,12 +62,7 @@ static int orangefs_writepage_locked(struct page *page, } else { ret = 0; } - if (wr) { - kfree(wr); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); - } + kfree(clear_page_private(page)); return ret; } @@ -409,9 +404,7 @@ static int orangefs_write_begin(struct file *file, wr->len = len; wr->uid = current_fsuid(); wr->gid = current_fsgid(); - SetPagePrivate(page); - set_page_private(page, (unsigned long)wr); - get_page(page); + attach_page_private(page, wr); okay: return 0; } @@ -459,18 +452,12 @@ static void orangefs_invalidatepage(struct page *page, wr = (struct orangefs_write_range *)page_private(page); if (offset == 0 && length == PAGE_SIZE) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); + kfree(clear_page_private(page)); return; /* write range entirely within invalidate range (or equal) */ } else if (page_offset(page) + offset <= wr->pos && wr->pos + wr->len <= page_offset(page) + offset + length) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); + kfree(clear_page_private(page)); /* XXX is this right? only caller in fs */ cancel_dirty_page(page); return; @@ -535,12 +522,7 @@ static int orangefs_releasepage(struct page *page, gfp_t foo) static void orangefs_freepage(struct page *page) { - if (PagePrivate(page)) { - kfree((struct orangefs_write_range *)page_private(page)); - set_page_private(page, 0); - ClearPagePrivate(page); - put_page(page); - } + kfree(clear_page_private(page)); } static int orangefs_launder_page(struct page *page) @@ -740,9 +722,7 @@ vm_fault_t orangefs_page_mkwrite(struct vm_fault *vmf) wr->len = PAGE_SIZE; wr->uid = current_fsuid(); wr->gid = current_fsgid(); - SetPagePrivate(page); - set_page_private(page, (unsigned long)wr); - get_page(page); + attach_page_private(page, wr); okay: file_update_time(vmf->vma->vm_file); -- 2.17.1
[RFC PATCH V2 0/9] Introduce attach/clear_page_private to cleanup code
Hi, Based on the previous thread [1], this patchset introduces attach_page_private and clear_page_private to replace attach_page_buffers and __clear_page_buffers. Thanks a lot for the constructive suggestions and comments from Christoph, Matthew and Dave. And sorry for cross post to different lists since it modifies different subsystems. RFC -> RFC V2: 1. rename the new functions and add comments for them. 2. change the return type of attach_page_private. 3. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further. 4. avoid potential use-after-free in orangefs. [1]. https://lore.kernel.org/linux-fsdevel/20200420221424.gh5...@bombadil.infradead.org/ Thanks, Guoqing Guoqing Jiang (9): include/linux/pagemap.h: introduce attach/clear_page_private md: remove __clear_page_buffers and use attach/clear_page_private btrfs: use attach/clear_page_private fs/buffer.c: use attach/clear_page_private f2fs: use attach/clear_page_private iomap: use attach/clear_page_private ntfs: replace attach_page_buffers with attach_page_private orangefs: use attach/clear_page_private buffer_head.h: remove attach_page_buffers drivers/md/md-bitmap.c | 12 ++-- fs/btrfs/disk-io.c | 4 +--- fs/btrfs/extent_io.c| 21 ++--- fs/btrfs/inode.c| 23 +-- fs/buffer.c | 16 fs/f2fs/f2fs.h | 11 ++- fs/iomap/buffered-io.c | 19 --- fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- fs/orangefs/inode.c | 32 ++-- include/linux/buffer_head.h | 8 include/linux/pagemap.h | 35 +++ 12 files changed, 67 insertions(+), 118 deletions(-) -- 2.17.1
[RFC PATCH V2 2/9] md: remove __clear_page_buffers and use attach/clear_page_private
After introduce attach/clear_page_private in pagemap.h, we can remove the duplicat code and call the new functions. Cc: Song Liu Cc: linux-r...@vger.kernel.org Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. drivers/md/md-bitmap.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index b952bd45bd6a..033d12063600 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -324,14 +324,6 @@ static void end_bitmap_write(struct buffer_head *bh, int uptodate) wake_up(>write_wait); } -/* copied from buffer.c */ -static void -__clear_page_buffers(struct page *page) -{ - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); -} static void free_buffers(struct page *page) { struct buffer_head *bh; @@ -345,7 +337,7 @@ static void free_buffers(struct page *page) free_buffer_head(bh); bh = next; } - __clear_page_buffers(page); + clear_page_private(page); put_page(page); } @@ -374,7 +366,7 @@ static int read_page(struct file *file, unsigned long index, ret = -ENOMEM; goto out; } - attach_page_buffers(page, bh); + attach_page_private(page, bh); blk_cur = index << (PAGE_SHIFT - inode->i_blkbits); while (bh) { block = blk_cur; -- 2.17.1
[RFC PATCH V2 4/9] fs/buffer.c: use attach/clear_page_private
Since the new pair function is introduced, we can call them to clean the code in buffer.c. Cc: Alexander Viro Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. fs/buffer.c | 16 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index a60f60396cfa..60dd61384b13 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -123,14 +123,6 @@ void __wait_on_buffer(struct buffer_head * bh) } EXPORT_SYMBOL(__wait_on_buffer); -static void -__clear_page_buffers(struct page *page) -{ - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); -} - static void buffer_io_error(struct buffer_head *bh, char *msg) { if (!test_bit(BH_Quiet, >b_state)) @@ -906,7 +898,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head) bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) @@ -1580,7 +1572,7 @@ void create_empty_buffers(struct page *page, bh = bh->b_this_page; } while (bh != head); } - attach_page_buffers(page, head); + attach_page_private(page, head); spin_unlock(>mapping->private_lock); } EXPORT_SYMBOL(create_empty_buffers); @@ -2567,7 +2559,7 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head) bh->b_this_page = head; bh = bh->b_this_page; } while (bh != head); - attach_page_buffers(page, head); + attach_page_private(page, head); spin_unlock(>mapping->private_lock); } @@ -3227,7 +3219,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) bh = next; } while (bh != head); *buffers_to_free = head; - __clear_page_buffers(page); + clear_page_private(page); return 1; failed: return 0; -- 2.17.1
[RFC PATCH V2 6/9] iomap: use attach/clear_page_private
Since the new pair function is introduced, we can call them to clean the code in iomap. Cc: Christoph Hellwig Cc: "Darrick J. Wong" Cc: linux-...@vger.kernel.org Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. 2. call attach_page_private(newpage, clear_page_private(page)) to cleanup code further as suggested by Matthew Wilcox. 3. don't return attach_page_private in iomap_page_create per the comment from Christoph Hellwig. fs/iomap/buffered-io.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 89e21961d1ad..cf4c1b02a9d8 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -59,24 +59,19 @@ iomap_page_create(struct inode *inode, struct page *page) * migrate_page_move_mapping() assumes that pages with private data have * their count elevated by 1. */ - get_page(page); - set_page_private(page, (unsigned long)iop); - SetPagePrivate(page); + attach_page_private(page, iop); return iop; } static void iomap_page_release(struct page *page) { - struct iomap_page *iop = to_iomap_page(page); + struct iomap_page *iop = clear_page_private(page); if (!iop) return; WARN_ON_ONCE(atomic_read(>read_count)); WARN_ON_ONCE(atomic_read(>write_count)); - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); kfree(iop); } @@ -554,14 +549,8 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage, if (ret != MIGRATEPAGE_SUCCESS) return ret; - if (page_has_private(page)) { - ClearPagePrivate(page); - get_page(newpage); - set_page_private(newpage, page_private(page)); - set_page_private(page, 0); - put_page(page); - SetPagePrivate(newpage); - } + if (page_has_private(page)) + attach_page_private(newpage, clear_page_private(page)); if (mode != MIGRATE_SYNC_NO_COPY) migrate_page_copy(newpage, page); -- 2.17.1
[RFC PATCH V2 7/9] ntfs: replace attach_page_buffers with attach_page_private
Call the new function since attach_page_buffers will be removed. Cc: Anton Altaparmakov Cc: linux-ntfs-...@lists.sourceforge.net Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new function to attach_page_private. fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index 554b744f41bf..bb0a43860ad2 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -1732,7 +1732,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) { bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } else buffers_to_free = bh; } diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index 3aac5c917afe..fbb9f1bc623d 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -504,7 +504,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no, bh = bh->b_this_page; } while (bh); tail->b_this_page = head; - attach_page_buffers(page, head); + attach_page_private(page, head); } bh = head = page_buffers(page); BUG_ON(!bh); -- 2.17.1
[RFC PATCH V2 5/9] f2fs: use attach/clear_page_private
Since the new pair function is introduced, we can call them to clean the code in f2fs.h. Cc: Jaegeuk Kim Cc: linux-f2fs-de...@lists.sourceforge.net Acked-by: Chao Yu Signed-off-by: Guoqing Jiang --- RFC -> RFC V2 1. change the name of new functions to attach/clear_page_private. fs/f2fs/f2fs.h | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index ba470d5687fe..24d22bd7352d 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -3051,19 +3051,12 @@ static inline void f2fs_set_page_private(struct page *page, if (PagePrivate(page)) return; - get_page(page); - SetPagePrivate(page); - set_page_private(page, data); + attach_page_private(page, (void *)data); } static inline void f2fs_clear_page_private(struct page *page) { - if (!PagePrivate(page)) - return; - - set_page_private(page, 0); - ClearPagePrivate(page); - f2fs_put_page(page, 0); + clear_page_private(page); } /* -- 2.17.1
[RFC PATCH V2 9/9] buffer_head.h: remove attach_page_buffers
All the callers have replaced attach_page_buffers with the new function attach_page_private, so remove it. Cc: Thomas Gleixner Cc: Sebastian Andrzej Siewior Cc: Roman Gushchin Cc: Andreas Dilger Signed-off-by: Guoqing Jiang --- include/linux/buffer_head.h | 8 1 file changed, 8 deletions(-) diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 15b765a181b8..22fb11e2d2e0 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -272,14 +272,6 @@ void buffer_init(void); * inline definitions */ -static inline void attach_page_buffers(struct page *page, - struct buffer_head *head) -{ - get_page(page); - SetPagePrivate(page); - set_page_private(page, (unsigned long)head); -} - static inline void get_bh(struct buffer_head *bh) { atomic_inc(>b_count); -- 2.17.1
Re: [PATCH] md: Fix failed allocation of md_register_thread
On 3/5/19 6:48 AM, Aditya Pakki wrote: mddev->sync_thread can be set to NULL on kzalloc failure downstream. The patch checks for such a scenario and frees allocated resources. Signed-off-by: Aditya Pakki --- drivers/md/raid10.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index abb5d382f64d..f52b4d9bcd24 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3939,6 +3939,8 @@ static int raid10_run(struct mddev *mddev) set_bit(MD_RECOVERY_RUNNING, >recovery); mddev->sync_thread = md_register_thread(md_do_sync, mddev, "reshape"); + if (!mddev->sync_thread) + goto out_free_conf; } return 0; Could you make the change for raid5 as well? It also doesn't check the failure in raid5_run. Thanks, Guoqing
Re: linux 4.19.19: md0_raid:1317 blocked for more than 120 seconds.
On 2/14/19 11:27 PM, Wolfgang Walter wrote: Am Donnerstag, 14. Februar 2019, 10:09:56 schrieb Guoqing Jiang: On 2/12/19 7:20 PM, Wolfgang Walter wrote: Am Dienstag, 12. Februar 2019, 16:20:11 schrieb Guoqing Jiang: On 2/11/19 11:12 PM, Wolfgang Walter wrote: With 4.19.19 we see sometimes the following issue (practically only with blk_mq, though): Feb 4 20:04:46 tettnang kernel: [252300.060165] INFO: task md0_raid1:317 blocked for more than 120 seconds. Feb 4 20:04:46 tettnang kernel: [252300.060188] Not tainted 4.19.19-debian64.all+1.1 #1 Feb 4 20:04:46 tettnang kernel: [252300.060197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 4 20:04:46 tettnang kernel: [252300.060207] md0_raid1 D0 317 2 0x8000 Feb 4 20:04:46 tettnang kernel: [252300.060211] Call Trace: Feb 4 20:04:46 tettnang kernel: [252300.060222] ? __schedule+0x2a2/0x8c0 Feb 4 20:04:46 tettnang kernel: [252300.060226] ? _raw_spin_unlock_irqrestore+0x20/0x40 Feb 4 20:04:46 tettnang kernel: [252300.060229] schedule+0x32/0x90 Feb 4 20:04:46 tettnang kernel: [252300.060241] md_super_wait+0x69/0xa0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060247] ? finish_wait+0x80/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060255] md_bitmap_wait_writes+0x8e/0xa0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060263] ? md_bitmap_get_counter+0x42/0xd0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060271] md_bitmap_daemon_work+0x1e8/0x380 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060278] ? md_rdev_init+0xb0/0xb0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060285] md_check_recovery+0x26/0x540 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060290] raid1d+0x5c/0xf00 [raid1] Feb 4 20:04:46 tettnang kernel: [252300.060294] ? preempt_count_add+0x79/0xb0 Feb 4 20:04:46 tettnang kernel: [252300.060298] ? lock_timer_base+0x67/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060302] ? _raw_spin_unlock_irqrestore+0x20/0x40 Feb 4 20:04:46 tettnang kernel: [252300.060304] ? try_to_del_timer_sync+0x4d/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060306] ? del_timer_sync+0x35/0x40 Feb 4 20:04:46 tettnang kernel: [252300.060309] ? schedule_timeout+0x17a/0x3b0 Feb 4 20:04:46 tettnang kernel: [252300.060312] ? preempt_count_add+0x79/0xb0 Feb 4 20:04:46 tettnang kernel: [252300.060315] ? _raw_spin_lock_irqsave+0x25/0x50 Feb 4 20:04:46 tettnang kernel: [252300.060321] ? md_rdev_init+0xb0/0xb0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060327] ? md_thread+0xf9/0x160 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060330] ? r1bio_pool_alloc+0x20/0x20 [raid1] Feb 4 20:04:46 tettnang kernel: [252300.060336] md_thread+0xf9/0x160 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060340] ? finish_wait+0x80/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060344] kthread+0x112/0x130 Feb 4 20:04:46 tettnang kernel: [252300.060346] ? kthread_create_worker_on_cpu+0x70/0x70 Feb 4 20:04:46 tettnang kernel: [252300.060350] ret_from_fork+0x35/0x40 I saw that there was a similar problem with raid10 and an upstream patch e820d55cb99dd93ac2dc949cf486bb187e5cd70d md: fix raid10 hang issue caused by barrier by Guoqing Jiang I wonder if there is a similar fix needed for raid1? Seems not, the calltrace tells the previous write superblock IO was not finish as expected, there is a report for raid5 which has similar problem with md_super_wait in the link [1]. Maybe you can disable blk-mq to narrow down the issue as well. I already did for 4 weeks. I didn't saw this with blk-mq disabled (for scsi and md), though this may be by luck. Then I guess it maybe related to blk-mq, which scheduler are you used with blk-mq? And maybe you can switch it to see if it is caused by specified scheduler or not. mq-deadline for SCSI and none for md and dm. Can you try with the patch [1]? In case the block was caused by flush. [1]: https://patchwork.kernel.org/patch/10787903/ Thanks, Guoqing
Re: linux 4.19.19: md0_raid:1317 blocked for more than 120 seconds.
On 2/12/19 7:20 PM, Wolfgang Walter wrote: Am Dienstag, 12. Februar 2019, 16:20:11 schrieb Guoqing Jiang: On 2/11/19 11:12 PM, Wolfgang Walter wrote: With 4.19.19 we see sometimes the following issue (practically only with blk_mq, though): Feb 4 20:04:46 tettnang kernel: [252300.060165] INFO: task md0_raid1:317 blocked for more than 120 seconds. Feb 4 20:04:46 tettnang kernel: [252300.060188] Not tainted 4.19.19-debian64.all+1.1 #1 Feb 4 20:04:46 tettnang kernel: [252300.060197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 4 20:04:46 tettnang kernel: [252300.060207] md0_raid1 D0 317 2 0x8000 Feb 4 20:04:46 tettnang kernel: [252300.060211] Call Trace: Feb 4 20:04:46 tettnang kernel: [252300.060222] ? __schedule+0x2a2/0x8c0 Feb 4 20:04:46 tettnang kernel: [252300.060226] ? _raw_spin_unlock_irqrestore+0x20/0x40 Feb 4 20:04:46 tettnang kernel: [252300.060229] schedule+0x32/0x90 Feb 4 20:04:46 tettnang kernel: [252300.060241] md_super_wait+0x69/0xa0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060247] ? finish_wait+0x80/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060255] md_bitmap_wait_writes+0x8e/0xa0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060263] ? md_bitmap_get_counter+0x42/0xd0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060271] md_bitmap_daemon_work+0x1e8/0x380 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060278] ? md_rdev_init+0xb0/0xb0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060285] md_check_recovery+0x26/0x540 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060290] raid1d+0x5c/0xf00 [raid1] Feb 4 20:04:46 tettnang kernel: [252300.060294] ? preempt_count_add+0x79/0xb0 Feb 4 20:04:46 tettnang kernel: [252300.060298] ? lock_timer_base+0x67/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060302] ? _raw_spin_unlock_irqrestore+0x20/0x40 Feb 4 20:04:46 tettnang kernel: [252300.060304] ? try_to_del_timer_sync+0x4d/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060306] ? del_timer_sync+0x35/0x40 Feb 4 20:04:46 tettnang kernel: [252300.060309] ? schedule_timeout+0x17a/0x3b0 Feb 4 20:04:46 tettnang kernel: [252300.060312] ? preempt_count_add+0x79/0xb0 Feb 4 20:04:46 tettnang kernel: [252300.060315] ? _raw_spin_lock_irqsave+0x25/0x50 Feb 4 20:04:46 tettnang kernel: [252300.060321] ? md_rdev_init+0xb0/0xb0 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060327] ? md_thread+0xf9/0x160 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060330] ? r1bio_pool_alloc+0x20/0x20 [raid1] Feb 4 20:04:46 tettnang kernel: [252300.060336] md_thread+0xf9/0x160 [md_mod] Feb 4 20:04:46 tettnang kernel: [252300.060340] ? finish_wait+0x80/0x80 Feb 4 20:04:46 tettnang kernel: [252300.060344] kthread+0x112/0x130 Feb 4 20:04:46 tettnang kernel: [252300.060346] ? kthread_create_worker_on_cpu+0x70/0x70 Feb 4 20:04:46 tettnang kernel: [252300.060350] ret_from_fork+0x35/0x40 I saw that there was a similar problem with raid10 and an upstream patch e820d55cb99dd93ac2dc949cf486bb187e5cd70d md: fix raid10 hang issue caused by barrier by Guoqing Jiang I wonder if there is a similar fix needed for raid1? Seems not, the calltrace tells the previous write superblock IO was not finish as expected, there is a report for raid5 which has similar problem with md_super_wait in the link [1]. Maybe you can disable blk-mq to narrow down the issue as well. I already did for 4 weeks. I didn't saw this with blk-mq disabled (for scsi and md), though this may be by luck. Then I guess it maybe related to blk-mq, which scheduler are you used with blk-mq? And maybe you can switch it to see if it is caused by specified scheduler or not. [1] |https://bbs.archlinux.org/viewtopic.php?id=243520 I found this bug report in debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904822 Thanks, the bug report also said it didn't happen after disable blk-mq. Regards, Guoqing
Re: [PATCH] drivers/md.c: Make bio_alloc_mddev return bio_alloc_bioset
On 1/11/19 10:17 AM, Marcos Paulo de Souza wrote: ping? On Sat, Dec 22, 2018 at 08:08:45AM -0200, Marcos Paulo de Souza wrote: bio_alloc_bioset return a bio pointer or NULL, so we can avoid storing the returned data into a new variable. Signed-off-by: Marcos Paulo de Souza Add Jens to cc list, anyway, Acked-by: Guoqing Jiang Thanks, Guoqing --- drivers/md/md.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index fc488cb30a94..42e018f014cb 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -207,15 +207,10 @@ static bool create_on_open = true; struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs, struct mddev *mddev) { - struct bio *b; - if (!mddev || !bioset_initialized(>bio_set)) return bio_alloc(gfp_mask, nr_iovecs); - b = bio_alloc_bioset(gfp_mask, nr_iovecs, >bio_set); - if (!b) - return NULL; - return b; + return bio_alloc_bioset(gfp_mask, nr_iovecs, >bio_set); } EXPORT_SYMBOL_GPL(bio_alloc_mddev); -- 2.16.4
Re: Shaohua Li
On 1/3/19 1:13 AM, Jens Axboe wrote: Hi, I've got some very sad news to share with you - over Christmas, Shaohua Li passed away after battling cancer for most of last year. It is really a sad news and a big lost for the community consider Shaohua's great contribution! As you know, Shaohua acted as the maintainer for md. He remained dedicated to that through the difficult times. With his passing, we obviously have a void that needs to be filled. It is better if there is a small team to maintain md in future since nobody have broad/deep knowledge as Neil or Shaohua, also people in the team can cover/backup each other, see [1]. For now, I'll act as interim maintainer for md until we can find a good permanent solution. I'll queue up reviewed patches in a separate branch, in the block tree, and help review fixes. I'm subscribed to linux-raid, but please do CC me as well. Thanks, I think there are some patches in shli/md/for-next which supposed to be merge to 4.21, could you please help to merge them to your branch? And I suppose you will send a patch to CREDITS for Shaohua. [1]. https://lkml.org/lkml/2015/12/21/9 Regards, Guoqing
Re: [PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too
On 01/10/2018 02:13 PM, Paolo Valente wrote: Il giorno 10 gen 2018, alle ore 02:41, Guoqing Jiang <gqji...@suse.com> ha scritto: On 01/09/2018 05:27 PM, Paolo Valente wrote: For each pair [device for which bfq is selected as I/O scheduler, group in blkio/io], bfq maintains a corresponding bfq group. Each such bfq group contains a set of async queues, with each async queue created on demand, i.e., when some I/O request arrives for it. On creation, an async queue gets an extra reference, to make sure that the queue is not freed as long as its bfq group exists. Accordingly, to allow the queue to be freed after the group exited, this extra reference must released on group exit. The above holds also for a bfq root group, i.e., for the bfq group corresponding to the root blkio/io root for a given device. Yet, by mistake, the references to the existing async queues of a root group are not released when the latter exits. This causes a memory leak when the instance of bfq for a given device exits. In a similar vein, bfqg_stats_xfer_dead is not executed for a root group. This commit fixes bfq_pd_offline so that the latter executes the above missing operations for a root group too. Reported-by: Holger Hoffstätte <hol...@applied-asynchrony.com> Reported-by: Guoqing Jiang <gqji...@suse.com> Signed-off-by: Davide Ferrari <davideferra...@gmail.com> Signed-off-by: Paolo Valente <paolo.vale...@linaro.org> --- block/bfq-cgroup.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index da1525ec4c87..d819dc77fe65 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) unsigned long flags; int i; + spin_lock_irqsave(>lock, flags); + if (!entity) /* root group */ - return; + goto put_async_queues; - spin_lock_irqsave(>lock, flags); /* * Empty all service_trees belonging to this group before * deactivating the group itself. @@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) } __bfq_deactivate_entity(entity, false); + +put_async_queues: bfq_put_async_queues(bfqd, bfqg); spin_unlock_irqrestore(>lock, flags); With this change, bfqg_stats_xfer_dead will be called even entity is not existed, Actually, the entity associated with the bfq group being offlined exists even if the local variable entity is NULL here. Simply, that variable gets NULL if the bfq group is the bfq root group for a device. is it necessary? No, I opted for this solution to not shake the code too much, and considering that - bfqg_stats_xfer_dead simply does nothing if applied to a root group - who knows, in the future that function may need do be invoked for a root group too. Yet, if you guys think that it would be cleaner to not invoke bfqg_stats_xfer_dead at all for a root group, I'll change the code accordingly (this would introduce a little asymmetry with cfq_pd_offline, which invokes cfqg_stats_xfer_dead unconditionally). Thanks a lot for the explanation, I am fine with it. Acked-by: Guoqing Jiang <gqji...@suse.com> Regards, Guoqing
Re: [PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too
On 01/10/2018 02:13 PM, Paolo Valente wrote: Il giorno 10 gen 2018, alle ore 02:41, Guoqing Jiang ha scritto: On 01/09/2018 05:27 PM, Paolo Valente wrote: For each pair [device for which bfq is selected as I/O scheduler, group in blkio/io], bfq maintains a corresponding bfq group. Each such bfq group contains a set of async queues, with each async queue created on demand, i.e., when some I/O request arrives for it. On creation, an async queue gets an extra reference, to make sure that the queue is not freed as long as its bfq group exists. Accordingly, to allow the queue to be freed after the group exited, this extra reference must released on group exit. The above holds also for a bfq root group, i.e., for the bfq group corresponding to the root blkio/io root for a given device. Yet, by mistake, the references to the existing async queues of a root group are not released when the latter exits. This causes a memory leak when the instance of bfq for a given device exits. In a similar vein, bfqg_stats_xfer_dead is not executed for a root group. This commit fixes bfq_pd_offline so that the latter executes the above missing operations for a root group too. Reported-by: Holger Hoffstätte Reported-by: Guoqing Jiang Signed-off-by: Davide Ferrari Signed-off-by: Paolo Valente --- block/bfq-cgroup.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index da1525ec4c87..d819dc77fe65 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) unsigned long flags; int i; + spin_lock_irqsave(>lock, flags); + if (!entity) /* root group */ - return; + goto put_async_queues; - spin_lock_irqsave(>lock, flags); /* * Empty all service_trees belonging to this group before * deactivating the group itself. @@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) } __bfq_deactivate_entity(entity, false); + +put_async_queues: bfq_put_async_queues(bfqd, bfqg); spin_unlock_irqrestore(>lock, flags); With this change, bfqg_stats_xfer_dead will be called even entity is not existed, Actually, the entity associated with the bfq group being offlined exists even if the local variable entity is NULL here. Simply, that variable gets NULL if the bfq group is the bfq root group for a device. is it necessary? No, I opted for this solution to not shake the code too much, and considering that - bfqg_stats_xfer_dead simply does nothing if applied to a root group - who knows, in the future that function may need do be invoked for a root group too. Yet, if you guys think that it would be cleaner to not invoke bfqg_stats_xfer_dead at all for a root group, I'll change the code accordingly (this would introduce a little asymmetry with cfq_pd_offline, which invokes cfqg_stats_xfer_dead unconditionally). Thanks a lot for the explanation, I am fine with it. Acked-by: Guoqing Jiang Regards, Guoqing
Re: [PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too
On 01/09/2018 05:27 PM, Paolo Valente wrote: For each pair [device for which bfq is selected as I/O scheduler, group in blkio/io], bfq maintains a corresponding bfq group. Each such bfq group contains a set of async queues, with each async queue created on demand, i.e., when some I/O request arrives for it. On creation, an async queue gets an extra reference, to make sure that the queue is not freed as long as its bfq group exists. Accordingly, to allow the queue to be freed after the group exited, this extra reference must released on group exit. The above holds also for a bfq root group, i.e., for the bfq group corresponding to the root blkio/io root for a given device. Yet, by mistake, the references to the existing async queues of a root group are not released when the latter exits. This causes a memory leak when the instance of bfq for a given device exits. In a similar vein, bfqg_stats_xfer_dead is not executed for a root group. This commit fixes bfq_pd_offline so that the latter executes the above missing operations for a root group too. Reported-by: Holger Hoffstätte <hol...@applied-asynchrony.com> Reported-by: Guoqing Jiang <gqji...@suse.com> Signed-off-by: Davide Ferrari <davideferra...@gmail.com> Signed-off-by: Paolo Valente <paolo.vale...@linaro.org> --- block/bfq-cgroup.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index da1525ec4c87..d819dc77fe65 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) unsigned long flags; int i; + spin_lock_irqsave(>lock, flags); + if (!entity) /* root group */ - return; + goto put_async_queues; - spin_lock_irqsave(>lock, flags); /* * Empty all service_trees belonging to this group before * deactivating the group itself. @@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) } __bfq_deactivate_entity(entity, false); + +put_async_queues: bfq_put_async_queues(bfqd, bfqg); spin_unlock_irqrestore(>lock, flags); With this change, bfqg_stats_xfer_dead will be called even entity is not existed, is it necessary? Thanks. Regards, Guoqing
Re: [PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too
On 01/09/2018 05:27 PM, Paolo Valente wrote: For each pair [device for which bfq is selected as I/O scheduler, group in blkio/io], bfq maintains a corresponding bfq group. Each such bfq group contains a set of async queues, with each async queue created on demand, i.e., when some I/O request arrives for it. On creation, an async queue gets an extra reference, to make sure that the queue is not freed as long as its bfq group exists. Accordingly, to allow the queue to be freed after the group exited, this extra reference must released on group exit. The above holds also for a bfq root group, i.e., for the bfq group corresponding to the root blkio/io root for a given device. Yet, by mistake, the references to the existing async queues of a root group are not released when the latter exits. This causes a memory leak when the instance of bfq for a given device exits. In a similar vein, bfqg_stats_xfer_dead is not executed for a root group. This commit fixes bfq_pd_offline so that the latter executes the above missing operations for a root group too. Reported-by: Holger Hoffstätte Reported-by: Guoqing Jiang Signed-off-by: Davide Ferrari Signed-off-by: Paolo Valente --- block/bfq-cgroup.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index da1525ec4c87..d819dc77fe65 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) unsigned long flags; int i; + spin_lock_irqsave(>lock, flags); + if (!entity) /* root group */ - return; + goto put_async_queues; - spin_lock_irqsave(>lock, flags); /* * Empty all service_trees belonging to this group before * deactivating the group itself. @@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) } __bfq_deactivate_entity(entity, false); + +put_async_queues: bfq_put_async_queues(bfqd, bfqg); spin_unlock_irqrestore(>lock, flags); With this change, bfqg_stats_xfer_dead will be called even entity is not existed, is it necessary? Thanks. Regards, Guoqing
[PATCH] ppdev: remove unused ROUND_UP macro
This macro is not used after commit 3b9ab374a1e6 ("ppdev: convert to y2038 safe"), so let's remove it. Signed-off-by: Guoqing Jiang <gqji...@suse.com> --- drivers/char/ppdev.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/char/ppdev.c b/drivers/char/ppdev.c index 3e73bcd..d256110 100644 --- a/drivers/char/ppdev.c +++ b/drivers/char/ppdev.c @@ -101,9 +101,6 @@ static DEFINE_IDA(ida_index); #define PP_BUFFER_SIZE 1024 #define PARDEVICE_MAX 8 -/* ROUND_UP macro from fs/select.c */ -#define ROUND_UP(x,y) (((x)+(y)-1)/(y)) - static DEFINE_MUTEX(pp_do_mutex); /* define fixed sized ioctl cmd for y2038 migration */ -- 2.6.6
[PATCH] ppdev: remove unused ROUND_UP macro
This macro is not used after commit 3b9ab374a1e6 ("ppdev: convert to y2038 safe"), so let's remove it. Signed-off-by: Guoqing Jiang --- drivers/char/ppdev.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/char/ppdev.c b/drivers/char/ppdev.c index 3e73bcd..d256110 100644 --- a/drivers/char/ppdev.c +++ b/drivers/char/ppdev.c @@ -101,9 +101,6 @@ static DEFINE_IDA(ida_index); #define PP_BUFFER_SIZE 1024 #define PARDEVICE_MAX 8 -/* ROUND_UP macro from fs/select.c */ -#define ROUND_UP(x,y) (((x)+(y)-1)/(y)) - static DEFINE_MUTEX(pp_do_mutex); /* define fixed sized ioctl cmd for y2038 migration */ -- 2.6.6
Re: [PATCH v2 11/51] md: raid1: initialize bvec table via bio_add_page()
On 06/28/2017 12:22 AM, Ming Lei wrote: Seems above section is similar as reset_bvec_table introduced in next patch, why there is difference between raid1 and raid10? Maybe add reset_bvec_table into md.c, then call it in raid1 or raid10 is better, just my 2 cents. Hi Guoqing, I think it is a good idea, and even the two patches can be put into one, so how about the following patch? Looks good. Acked-by: Guoqing Jiang <gqji...@suse.com> Thanks, Guoqing Shaohua, what do you think of this one? --- diff --git a/drivers/md/md.c b/drivers/md/md.c index 3d957ac1e109..7ffc622dd7fa 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9126,6 +9126,24 @@ void md_reload_sb(struct mddev *mddev, int nr) } EXPORT_SYMBOL(md_reload_sb); +/* generally called after bio_reset() for reseting bvec */ +void reset_bvec_table(struct bio *bio, struct resync_pages *rp, int size) +{ + /* initialize bvec table again */ + rp->idx = 0; + do { + struct page *page = resync_fetch_page(rp, rp->idx++); + int len = min_t(int, size, PAGE_SIZE); + + /* +* won't fail because the vec table is big +* enough to hold all these pages +*/ + bio_add_page(bio, page, len, 0); + size -= len; + } while (rp->idx < RESYNC_PAGES && size > 0); +} + #ifndef MODULE /* diff --git a/drivers/md/md.h b/drivers/md/md.h index 991f0fe2dcc6..f569831b1de9 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -783,4 +783,6 @@ static inline struct page *resync_fetch_page(struct resync_pages *rp, return NULL; return rp->pages[idx]; } + +void reset_bvec_table(struct bio *bio, struct resync_pages *rp, int size); #endif /* _MD_MD_H */ diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 3febfc8391fb..6ae2665ecd58 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2086,10 +2086,7 @@ static void process_checks(struct r1bio *r1_bio) /* Fix variable parts of all bios */ vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9); for (i = 0; i < conf->raid_disks * 2; i++) { - int j; - int size; blk_status_t status; - struct bio_vec *bi; struct bio *b = r1_bio->bios[i]; struct resync_pages *rp = get_resync_pages(b); if (b->bi_end_io != end_sync_read) @@ -2098,8 +2095,6 @@ static void process_checks(struct r1bio *r1_bio) status = b->bi_status; bio_reset(b); b->bi_status = status; - b->bi_vcnt = vcnt; - b->bi_iter.bi_size = r1_bio->sectors << 9; b->bi_iter.bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset; b->bi_bdev = conf->mirrors[i].rdev->bdev; @@ -2107,15 +2102,8 @@ static void process_checks(struct r1bio *r1_bio) rp->raid_bio = r1_bio; b->bi_private = rp; - size = b->bi_iter.bi_size; - bio_for_each_segment_all(bi, b, j) { - bi->bv_offset = 0; - if (size > PAGE_SIZE) - bi->bv_len = PAGE_SIZE; - else - bi->bv_len = size; - size -= PAGE_SIZE; - } + /* initialize bvec table again */ + reset_bvec_table(b, rp, r1_bio->sectors << 9); } for (primary = 0; primary < conf->raid_disks * 2; primary++) if (r1_bio->bios[primary]->bi_end_io == end_sync_read && diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 5026e7ad51d3..72f4fa07449b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2087,8 +2087,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) rp = get_resync_pages(tbio); bio_reset(tbio); - tbio->bi_vcnt = vcnt; - tbio->bi_iter.bi_size = fbio->bi_iter.bi_size; + reset_bvec_table(tbio, rp, fbio->bi_iter.bi_size); + rp->raid_bio = r10_bio; tbio->bi_private = rp; tbio->bi_iter.bi_sector = r10_bio->devs[i].addr; Thanks, Ming
Re: [PATCH v2 11/51] md: raid1: initialize bvec table via bio_add_page()
On 06/28/2017 12:22 AM, Ming Lei wrote: Seems above section is similar as reset_bvec_table introduced in next patch, why there is difference between raid1 and raid10? Maybe add reset_bvec_table into md.c, then call it in raid1 or raid10 is better, just my 2 cents. Hi Guoqing, I think it is a good idea, and even the two patches can be put into one, so how about the following patch? Looks good. Acked-by: Guoqing Jiang Thanks, Guoqing Shaohua, what do you think of this one? --- diff --git a/drivers/md/md.c b/drivers/md/md.c index 3d957ac1e109..7ffc622dd7fa 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9126,6 +9126,24 @@ void md_reload_sb(struct mddev *mddev, int nr) } EXPORT_SYMBOL(md_reload_sb); +/* generally called after bio_reset() for reseting bvec */ +void reset_bvec_table(struct bio *bio, struct resync_pages *rp, int size) +{ + /* initialize bvec table again */ + rp->idx = 0; + do { + struct page *page = resync_fetch_page(rp, rp->idx++); + int len = min_t(int, size, PAGE_SIZE); + + /* +* won't fail because the vec table is big +* enough to hold all these pages +*/ + bio_add_page(bio, page, len, 0); + size -= len; + } while (rp->idx < RESYNC_PAGES && size > 0); +} + #ifndef MODULE /* diff --git a/drivers/md/md.h b/drivers/md/md.h index 991f0fe2dcc6..f569831b1de9 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -783,4 +783,6 @@ static inline struct page *resync_fetch_page(struct resync_pages *rp, return NULL; return rp->pages[idx]; } + +void reset_bvec_table(struct bio *bio, struct resync_pages *rp, int size); #endif /* _MD_MD_H */ diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 3febfc8391fb..6ae2665ecd58 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2086,10 +2086,7 @@ static void process_checks(struct r1bio *r1_bio) /* Fix variable parts of all bios */ vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9); for (i = 0; i < conf->raid_disks * 2; i++) { - int j; - int size; blk_status_t status; - struct bio_vec *bi; struct bio *b = r1_bio->bios[i]; struct resync_pages *rp = get_resync_pages(b); if (b->bi_end_io != end_sync_read) @@ -2098,8 +2095,6 @@ static void process_checks(struct r1bio *r1_bio) status = b->bi_status; bio_reset(b); b->bi_status = status; - b->bi_vcnt = vcnt; - b->bi_iter.bi_size = r1_bio->sectors << 9; b->bi_iter.bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset; b->bi_bdev = conf->mirrors[i].rdev->bdev; @@ -2107,15 +2102,8 @@ static void process_checks(struct r1bio *r1_bio) rp->raid_bio = r1_bio; b->bi_private = rp; - size = b->bi_iter.bi_size; - bio_for_each_segment_all(bi, b, j) { - bi->bv_offset = 0; - if (size > PAGE_SIZE) - bi->bv_len = PAGE_SIZE; - else - bi->bv_len = size; - size -= PAGE_SIZE; - } + /* initialize bvec table again */ + reset_bvec_table(b, rp, r1_bio->sectors << 9); } for (primary = 0; primary < conf->raid_disks * 2; primary++) if (r1_bio->bios[primary]->bi_end_io == end_sync_read && diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 5026e7ad51d3..72f4fa07449b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2087,8 +2087,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) rp = get_resync_pages(tbio); bio_reset(tbio); - tbio->bi_vcnt = vcnt; - tbio->bi_iter.bi_size = fbio->bi_iter.bi_size; + reset_bvec_table(tbio, rp, fbio->bi_iter.bi_size); + rp->raid_bio = r10_bio; tbio->bi_private = rp; tbio->bi_iter.bi_sector = r10_bio->devs[i].addr; Thanks, Ming
Re: [PATCH v2 11/51] md: raid1: initialize bvec table via bio_add_page()
On 06/26/2017 08:09 PM, Ming Lei wrote: We will support multipage bvec soon, so initialize bvec table using the standardy way instead of writing the talbe directly. Otherwise it won't work any more once multipage bvec is enabled. Cc: Shaohua LiCc: linux-r...@vger.kernel.org Signed-off-by: Ming Lei --- drivers/md/raid1.c | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 3febfc8391fb..835c42396861 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2086,10 +2086,8 @@ static void process_checks(struct r1bio *r1_bio) /* Fix variable parts of all bios */ vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9); for (i = 0; i < conf->raid_disks * 2; i++) { - int j; int size; blk_status_t status; - struct bio_vec *bi; struct bio *b = r1_bio->bios[i]; struct resync_pages *rp = get_resync_pages(b); if (b->bi_end_io != end_sync_read) @@ -2098,8 +2096,6 @@ static void process_checks(struct r1bio *r1_bio) status = b->bi_status; bio_reset(b); b->bi_status = status; - b->bi_vcnt = vcnt; - b->bi_iter.bi_size = r1_bio->sectors << 9; b->bi_iter.bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset; b->bi_bdev = conf->mirrors[i].rdev->bdev; @@ -2107,15 +2103,20 @@ static void process_checks(struct r1bio *r1_bio) rp->raid_bio = r1_bio; b->bi_private = rp; - size = b->bi_iter.bi_size; - bio_for_each_segment_all(bi, b, j) { - bi->bv_offset = 0; - if (size > PAGE_SIZE) - bi->bv_len = PAGE_SIZE; - else - bi->bv_len = size; - size -= PAGE_SIZE; - } + /* initialize bvec table again */ + rp->idx = 0; + size = r1_bio->sectors << 9; + do { + struct page *page = resync_fetch_page(rp, rp->idx++); + int len = min_t(int, size, PAGE_SIZE); + + /* +* won't fail because the vec table is big +* enough to hold all these pages +*/ + bio_add_page(b, page, len, 0); + size -= len; + } while (rp->idx < RESYNC_PAGES && size > 0); } Seems above section is similar as reset_bvec_table introduced in next patch, why there is difference between raid1 and raid10? Maybe add reset_bvec_table into md.c, then call it in raid1 or raid10 is better, just my 2 cents. Thanks, Guoqing
Re: [PATCH v2 11/51] md: raid1: initialize bvec table via bio_add_page()
On 06/26/2017 08:09 PM, Ming Lei wrote: We will support multipage bvec soon, so initialize bvec table using the standardy way instead of writing the talbe directly. Otherwise it won't work any more once multipage bvec is enabled. Cc: Shaohua Li Cc: linux-r...@vger.kernel.org Signed-off-by: Ming Lei --- drivers/md/raid1.c | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 3febfc8391fb..835c42396861 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2086,10 +2086,8 @@ static void process_checks(struct r1bio *r1_bio) /* Fix variable parts of all bios */ vcnt = (r1_bio->sectors + PAGE_SIZE / 512 - 1) >> (PAGE_SHIFT - 9); for (i = 0; i < conf->raid_disks * 2; i++) { - int j; int size; blk_status_t status; - struct bio_vec *bi; struct bio *b = r1_bio->bios[i]; struct resync_pages *rp = get_resync_pages(b); if (b->bi_end_io != end_sync_read) @@ -2098,8 +2096,6 @@ static void process_checks(struct r1bio *r1_bio) status = b->bi_status; bio_reset(b); b->bi_status = status; - b->bi_vcnt = vcnt; - b->bi_iter.bi_size = r1_bio->sectors << 9; b->bi_iter.bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset; b->bi_bdev = conf->mirrors[i].rdev->bdev; @@ -2107,15 +2103,20 @@ static void process_checks(struct r1bio *r1_bio) rp->raid_bio = r1_bio; b->bi_private = rp; - size = b->bi_iter.bi_size; - bio_for_each_segment_all(bi, b, j) { - bi->bv_offset = 0; - if (size > PAGE_SIZE) - bi->bv_len = PAGE_SIZE; - else - bi->bv_len = size; - size -= PAGE_SIZE; - } + /* initialize bvec table again */ + rp->idx = 0; + size = r1_bio->sectors << 9; + do { + struct page *page = resync_fetch_page(rp, rp->idx++); + int len = min_t(int, size, PAGE_SIZE); + + /* +* won't fail because the vec table is big +* enough to hold all these pages +*/ + bio_add_page(b, page, len, 0); + size -= len; + } while (rp->idx < RESYNC_PAGES && size > 0); } Seems above section is similar as reset_bvec_table introduced in next patch, why there is difference between raid1 and raid10? Maybe add reset_bvec_table into md.c, then call it in raid1 or raid10 is better, just my 2 cents. Thanks, Guoqing
Re: [PATCH] md-cluster: Fix a memleak in an error handling path
On 04/14/2017 02:08 PM, Christophe JAILLET wrote: We know that 'bm_lockres' is NULL here, so 'lockres_free(bm_lockres)' is a no-op. According to resource handling in case of error a few lines below, it is likely that 'bitmap_free(bitmap)' was expected instead. Fixes: b98938d16a10 ("md-cluster: introduce cluster_check_sync_size") Reviewed-by: Guoqing Jiang <gqji...@suse.com> Thanks, Guoqing Signed-off-by: Christophe JAILLET <christophe.jail...@wanadoo.fr> --- drivers/md/md-cluster.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index b21ef58819f6..7299ce2f08a8 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -1127,7 +1127,7 @@ int cluster_check_sync_size(struct mddev *mddev) bm_lockres = lockres_init(mddev, str, NULL, 1); if (!bm_lockres) { pr_err("md-cluster: Cannot initialize %s\n", str); - lockres_free(bm_lockres); + bitmap_free(bitmap); return -1; } bm_lockres->flags |= DLM_LKF_NOQUEUE;
Re: [PATCH] md-cluster: Fix a memleak in an error handling path
On 04/14/2017 02:08 PM, Christophe JAILLET wrote: We know that 'bm_lockres' is NULL here, so 'lockres_free(bm_lockres)' is a no-op. According to resource handling in case of error a few lines below, it is likely that 'bitmap_free(bitmap)' was expected instead. Fixes: b98938d16a10 ("md-cluster: introduce cluster_check_sync_size") Reviewed-by: Guoqing Jiang Thanks, Guoqing Signed-off-by: Christophe JAILLET --- drivers/md/md-cluster.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index b21ef58819f6..7299ce2f08a8 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -1127,7 +1127,7 @@ int cluster_check_sync_size(struct mddev *mddev) bm_lockres = lockres_init(mddev, str, NULL, 1); if (!bm_lockres) { pr_err("md-cluster: Cannot initialize %s\n", str); - lockres_free(bm_lockres); + bitmap_free(bitmap); return -1; } bm_lockres->flags |= DLM_LKF_NOQUEUE;
Re: [LSF/MM TOPIC] [LSF/MM ATTEND] md raid general discussion
On 01/10/2017 12:38 AM, Coly Li wrote: Hi Folks, I'd like to propose a general md raid discussion, it is quite necessary for most of active md raid developers sit together to discuss current challenge of Linux software raid and development trends. In the last years, we have many development activities in md raid, e.g. raid5 cache, raid1 clustering, partial parity log, fast fail upstreaming, and some effort for raid1 & raid0 performance improvement. I see there are some kind of functionality overlap between r5cache (raid5 cache) and PPL (partial parity log), currently I have no idea where we will go for these two development activities. Also I receive reports from users that raid1 performance is desired when it is built on NVMe SSDs as a cache (maybe bcache or dm-cache). I am working on some raid1 performance improvement (e.g. new raid1 I/O barrier and lockless raid1 I/O submit), and have some more ideas to discuss. Therefore, if md raid developers may have a chance to sit together, discuss how to efficiently collaborate in next year, it will be much more productive then communicating on mailing list. I would like to attend raid discussion, besides above topics I think we can talk about improve the test suite of mdadm to make it more robust (I can share related test suite which is used for clustered raid). And I could share the status of clustered raid about what we have done and what we can do in the future. Finally, I'd like to know/discuss about the roadmap of RAID. Thanks a lot! Guoqing
Re: [LSF/MM TOPIC] [LSF/MM ATTEND] md raid general discussion
On 01/10/2017 12:38 AM, Coly Li wrote: Hi Folks, I'd like to propose a general md raid discussion, it is quite necessary for most of active md raid developers sit together to discuss current challenge of Linux software raid and development trends. In the last years, we have many development activities in md raid, e.g. raid5 cache, raid1 clustering, partial parity log, fast fail upstreaming, and some effort for raid1 & raid0 performance improvement. I see there are some kind of functionality overlap between r5cache (raid5 cache) and PPL (partial parity log), currently I have no idea where we will go for these two development activities. Also I receive reports from users that raid1 performance is desired when it is built on NVMe SSDs as a cache (maybe bcache or dm-cache). I am working on some raid1 performance improvement (e.g. new raid1 I/O barrier and lockless raid1 I/O submit), and have some more ideas to discuss. Therefore, if md raid developers may have a chance to sit together, discuss how to efficiently collaborate in next year, it will be much more productive then communicating on mailing list. I would like to attend raid discussion, besides above topics I think we can talk about improve the test suite of mdadm to make it more robust (I can share related test suite which is used for clustered raid). And I could share the status of clustered raid about what we have done and what we can do in the future. Finally, I'd like to know/discuss about the roadmap of RAID. Thanks a lot! Guoqing
Re: ANNOUNCE: mdadm 4.0 - A tool for managing md Soft RAID under Linux
On 01/12/2017 12:59 AM, Jes Sorensen wrote: On 01/11/17 11:52, Shaohua Li wrote: On Tue, Jan 10, 2017 at 11:49:04AM -0600, Bruce Dubbs wrote: Jes Sorensen wrote: I am pleased to announce the availability of mdadm version 4.0 It is available at the usual places: http://www.kernel.org/pub/linux/utils/raid/mdadm/ and via git at git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git http://git.kernel.org/cgit/utils/mdadm/ The update in major version number primarily indicates this is a release by it's new maintainer. In addition it contains a large number of fixes in particular for IMSM RAID and clustered RAID support. In addition this release includes support for IMSM 4k sector drives, failfast and better documentation for journaled RAID. Thank you for the new release. Unfortunately I get 9 failures running the test suite: tests/00raid1... FAILED tests/07autoassemble... FAILED tests/07changelevels... FAILED tests/07revert-grow...FAILED tests/07revert-inplace... FAILED tests/07testreshape5... FAILED tests/10ddf-fail-twice... FAILED tests/20raid5journal... FAILED tests/10ddf-incremental-wrong-order... FAILED Yep, several tests usually fail. It appears some checks aren't always good. At least the 'check' function for reshape/resync isn't reliable in my test, I saw 07changelevelintr fails frequently. That is my experience as well - some of them are affected by the kernel version too. We probably need to look into making them more reliable. If possible, it could be a potential topic for lsf/mm raid discussion as Coly suggested in previous mail. Is current test can run the test for different raid level, say, "./test --raidtype=raid1" could execute all the *r1* tests, does it make sense to do it if we don't support it now. Thanks, Guoqing
Re: ANNOUNCE: mdadm 4.0 - A tool for managing md Soft RAID under Linux
On 01/12/2017 12:59 AM, Jes Sorensen wrote: On 01/11/17 11:52, Shaohua Li wrote: On Tue, Jan 10, 2017 at 11:49:04AM -0600, Bruce Dubbs wrote: Jes Sorensen wrote: I am pleased to announce the availability of mdadm version 4.0 It is available at the usual places: http://www.kernel.org/pub/linux/utils/raid/mdadm/ and via git at git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git http://git.kernel.org/cgit/utils/mdadm/ The update in major version number primarily indicates this is a release by it's new maintainer. In addition it contains a large number of fixes in particular for IMSM RAID and clustered RAID support. In addition this release includes support for IMSM 4k sector drives, failfast and better documentation for journaled RAID. Thank you for the new release. Unfortunately I get 9 failures running the test suite: tests/00raid1... FAILED tests/07autoassemble... FAILED tests/07changelevels... FAILED tests/07revert-grow...FAILED tests/07revert-inplace... FAILED tests/07testreshape5... FAILED tests/10ddf-fail-twice... FAILED tests/20raid5journal... FAILED tests/10ddf-incremental-wrong-order... FAILED Yep, several tests usually fail. It appears some checks aren't always good. At least the 'check' function for reshape/resync isn't reliable in my test, I saw 07changelevelintr fails frequently. That is my experience as well - some of them are affected by the kernel version too. We probably need to look into making them more reliable. If possible, it could be a potential topic for lsf/mm raid discussion as Coly suggested in previous mail. Is current test can run the test for different raid level, say, "./test --raidtype=raid1" could execute all the *r1* tests, does it make sense to do it if we don't support it now. Thanks, Guoqing
Re: [PATCH 08/16] md/bitmap: Rename a jump label in location_store()
On 09/28/2016 03:55 PM, Jes Sorensen wrote: SF Markus Elfringwrites: From: Markus Elfring Date: Tue, 27 Sep 2016 15:46:22 +0200 Adjust jump labels according to the current Linux coding style convention. Signed-off-by: Markus Elfring --- drivers/md/bitmap.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) Sorry but this patch is just plain ridiculous. It does not improve the code in any shape or form. 'out' as a label is perfectly legitimate and just as good as 'unlock'. Agree, I also curious which document recorded the coding style convention. Thanks, Guoqing
Re: [PATCH 08/16] md/bitmap: Rename a jump label in location_store()
On 09/28/2016 03:55 PM, Jes Sorensen wrote: SF Markus Elfring writes: From: Markus Elfring Date: Tue, 27 Sep 2016 15:46:22 +0200 Adjust jump labels according to the current Linux coding style convention. Signed-off-by: Markus Elfring --- drivers/md/bitmap.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) Sorry but this patch is just plain ridiculous. It does not improve the code in any shape or form. 'out' as a label is perfectly legitimate and just as good as 'unlock'. Agree, I also curious which document recorded the coding style convention. Thanks, Guoqing
Re: Question about commit f9a67b1182e5 ("md/bitmap: clear bitmap if bitmap_create failed").
On 09/14/2016 04:39 PM, Marion & Christophe JAILLET wrote: I don't share your feeling. bitmap_create() can return ERR_PTR(-ENOMEM) or ERR_PTR(-EINVAL). In such cases 'if (!bitmap)' will not be helpful. Maybe it should be turned into 'if (IS_ERR_OR_NULL(bitmap))' to handle errors returned by bitmap_create. Maybe just removing the call to 'bitmap_free(bitmap)' is enough. I agreed we can remove it, if so, seems we are not consistent with the previous comment of bitmap_create. /* * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ What about revert it and re-use v1 patch? see http://www.spinics.net/lists/raid/msg51819.html. Thanks, Guoqing
Re: Question about commit f9a67b1182e5 ("md/bitmap: clear bitmap if bitmap_create failed").
On 09/14/2016 04:39 PM, Marion & Christophe JAILLET wrote: I don't share your feeling. bitmap_create() can return ERR_PTR(-ENOMEM) or ERR_PTR(-EINVAL). In such cases 'if (!bitmap)' will not be helpful. Maybe it should be turned into 'if (IS_ERR_OR_NULL(bitmap))' to handle errors returned by bitmap_create. Maybe just removing the call to 'bitmap_free(bitmap)' is enough. I agreed we can remove it, if so, seems we are not consistent with the previous comment of bitmap_create. /* * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ What about revert it and re-use v1 patch? see http://www.spinics.net/lists/raid/msg51819.html. Thanks, Guoqing
Re: Question about commit f9a67b1182e5 ("md/bitmap: clear bitmap if bitmap_create failed").
On 09/13/2016 01:24 PM, Shaohua Li wrote: On Mon, Sep 12, 2016 at 09:09:48PM +0200, Christophe JAILLET wrote: Hi, I'm puzzled by commit f9a67b1182e5 ("md/bitmap: clear bitmap if bitmap_create failed"). Hi Christophe, Thank you very much to help check this! Part of the commit is: @@ -1865,8 +1866,10 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot, struct bitmap_counts *counts; struct bitmap *bitmap = bitmap_create(mddev, slot); -if (IS_ERR(bitmap)) +if (IS_ERR(bitmap)) { +bitmap_free(bitmap); return PTR_ERR(bitmap); +} but if 'bitmap' is an error, I think that bad things will happen in 'bitmap_free()' when, at the beginning of the function, we will execute: if (bitmap->sysfs_can_clear) <- sysfs_put(bitmap->sysfs_can_clear); I guess it is safe, since below part is at the beginning of bitmap_free. if (!bitmap) /* there was no bitmap */ return; Add Guoqing. Yeah, you are right, this bitmap_free isn't required. This must be something slip in in the v2 patch. I'll delete that line. However, the commit log message is really explicit and adding this call to 'bitmap_free' has really been done one purpose. ("If bitmap_create returns an error, we need to call either bitmap_destroy or bitmap_free to do clean up, ...") this log is a little confusing, I thought it really means the bitmap_free called in bitmap_create. The V1 patch calls bitmap_destroy in bitmap_create. I double checked v1 patch, it called bitmap_destroy once bitmap_create returned error inside bitmap_copy_from_slot, also bitmap_destroy is also not called in location_store once failed to create bitmap. But since bitmap_free within bitmap_create is used to handle related failure, seems we don't need the patch, and maybe we also don't need the second line of below comments (the patch is motivated by the comment IIRC). /* * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ Thanks, Guoqing
Re: Question about commit f9a67b1182e5 ("md/bitmap: clear bitmap if bitmap_create failed").
On 09/13/2016 01:24 PM, Shaohua Li wrote: On Mon, Sep 12, 2016 at 09:09:48PM +0200, Christophe JAILLET wrote: Hi, I'm puzzled by commit f9a67b1182e5 ("md/bitmap: clear bitmap if bitmap_create failed"). Hi Christophe, Thank you very much to help check this! Part of the commit is: @@ -1865,8 +1866,10 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot, struct bitmap_counts *counts; struct bitmap *bitmap = bitmap_create(mddev, slot); -if (IS_ERR(bitmap)) +if (IS_ERR(bitmap)) { +bitmap_free(bitmap); return PTR_ERR(bitmap); +} but if 'bitmap' is an error, I think that bad things will happen in 'bitmap_free()' when, at the beginning of the function, we will execute: if (bitmap->sysfs_can_clear) <- sysfs_put(bitmap->sysfs_can_clear); I guess it is safe, since below part is at the beginning of bitmap_free. if (!bitmap) /* there was no bitmap */ return; Add Guoqing. Yeah, you are right, this bitmap_free isn't required. This must be something slip in in the v2 patch. I'll delete that line. However, the commit log message is really explicit and adding this call to 'bitmap_free' has really been done one purpose. ("If bitmap_create returns an error, we need to call either bitmap_destroy or bitmap_free to do clean up, ...") this log is a little confusing, I thought it really means the bitmap_free called in bitmap_create. The V1 patch calls bitmap_destroy in bitmap_create. I double checked v1 patch, it called bitmap_destroy once bitmap_create returned error inside bitmap_copy_from_slot, also bitmap_destroy is also not called in location_store once failed to create bitmap. But since bitmap_free within bitmap_create is used to handle related failure, seems we don't need the patch, and maybe we also don't need the second line of below comments (the patch is motivated by the comment IIRC). /* * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ Thanks, Guoqing
[PATCH 1/3] md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <mart...@posteo.de> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Denys Vlasenko <dvlas...@redhat.com> Cc: Sasha Levin <sasha.le...@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <ne...@suse.com> Signed-off-by: Guoqing Jiang <gqji...@suse.com> --- drivers/md/md.c | 27 ++- drivers/md/raid1.c | 4 ++-- drivers/md/raid10.c | 8 drivers/md/raid5-cache.c | 4 ++-- drivers/md/raid5.c | 4 ++-- include/linux/bitops.h | 16 6 files changed, 40 insertions(+), 23 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 06f6e81..892d639 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2295,12 +2295,16 @@ repeat: if (mddev_is_clustered(mddev)) { if (test_and_clear_bit(MD_CHANGE_DEVS, >flags)) force_change = 1; + if (test_and_clear_bit(MD_CHANGE_CLEAN, >flags)) + nospares = 1; ret = md_cluster_ops->metadata_update_start(mddev); /* Has someone else has updated the sb */ if (!does_sb_need_changing(mddev)) { if (ret == 0) md_cluster_ops->metadata_update_cancel(mddev); - clear_bit(MD_CHANGE_PENDING, >flags); + bit_clear_unless(>flags, BIT(MD_CHANGE_PENDING), +BIT(MD_CHANGE_DEVS) | +BIT(MD_CHANGE_CLEAN)); return; } } @@ -2434,15 +2438,11 @@ repeat: if (mddev_is_clustered(mddev) && ret == 0) md_cluster_ops->metadata_update_finish(mddev); - spin_lock(>lock); if (mddev->in_sync != sync_req || - test_bit(MD_CHANGE_DEVS, >flags)) { + !bit_clear_unless(>flags, BIT(MD_CHANGE_PENDING), + BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_CLEAN))) /* have to write it out again */ - spin_unlock(>lock); goto repeat; - } - clear_bit(MD_CHANGE_PENDING, >flags); - spin_unlock(>lock); wake_up(>sb_wait); if (test_bit(MD_RECOVERY_RUNNING, >recovery)) sysfs_notify(>kobj, NULL, "sync_completed"); @@ -8147,18 +8147,18 @@ void md_do_sync(struct md_thread *thread) } } skip: - set_bit(MD_CHANGE_DEVS, >flags); - if (mddev_is_clustered(mddev) && ret == 0) { /* set CHANGE_PENDING here since maybe another * update is needed, so other nodes are informed */ - set_bit(MD_CHANGE_PENDING, >flags); + set_mask_bits(>flags, 0, + BIT(MD_CHANGE_PENDING) | BIT(MD_CHANGE_DEVS)); md_wakeup_thread(mddev->thread); wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING, >flags)); md_cluster_ops->resync_finish(mddev); - } + } else + set_bit(MD_CHANGE_DEVS, >flags); spin_lock(>lock); if (!test_bit(MD_RECOVERY_INTR, >recovery)) { @@ -8550,6 +8550,7 @@ EXPORT_SYMBOL(md_finish_reshape); int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, int is_new) { + struct mddev *mddev = rdev->mddev; int rv; if (is_new) s += rdev->new_data_offset; @@ -8559,8 +8560,8 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, if (rv == 0) { /* Make sure they get written out promptly */ sysfs_notify_dirent_safe(rdev->sysfs_state); - set_bit(MD_CHANGE_CLEAN, >mddev->flags); - set_bit(MD_CHANGE_PENDING, >mddev->flags); + set_mask_bits(>flags, 0, + BIT(MD_CHANGE_CLEAN) | BIT(MD_CHANGE_PENDING)); md_wakeup_thread(rdev->mddev->thread); return 1; } else diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a7f2b9c..c7c8cde 100644
[PATCH 1/3] md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger Cc: Andrew Morton Cc: Denys Vlasenko Cc: Sasha Levin Cc: Reviewed-by: NeilBrown Signed-off-by: Guoqing Jiang --- drivers/md/md.c | 27 ++- drivers/md/raid1.c | 4 ++-- drivers/md/raid10.c | 8 drivers/md/raid5-cache.c | 4 ++-- drivers/md/raid5.c | 4 ++-- include/linux/bitops.h | 16 6 files changed, 40 insertions(+), 23 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 06f6e81..892d639 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2295,12 +2295,16 @@ repeat: if (mddev_is_clustered(mddev)) { if (test_and_clear_bit(MD_CHANGE_DEVS, >flags)) force_change = 1; + if (test_and_clear_bit(MD_CHANGE_CLEAN, >flags)) + nospares = 1; ret = md_cluster_ops->metadata_update_start(mddev); /* Has someone else has updated the sb */ if (!does_sb_need_changing(mddev)) { if (ret == 0) md_cluster_ops->metadata_update_cancel(mddev); - clear_bit(MD_CHANGE_PENDING, >flags); + bit_clear_unless(>flags, BIT(MD_CHANGE_PENDING), +BIT(MD_CHANGE_DEVS) | +BIT(MD_CHANGE_CLEAN)); return; } } @@ -2434,15 +2438,11 @@ repeat: if (mddev_is_clustered(mddev) && ret == 0) md_cluster_ops->metadata_update_finish(mddev); - spin_lock(>lock); if (mddev->in_sync != sync_req || - test_bit(MD_CHANGE_DEVS, >flags)) { + !bit_clear_unless(>flags, BIT(MD_CHANGE_PENDING), + BIT(MD_CHANGE_DEVS) | BIT(MD_CHANGE_CLEAN))) /* have to write it out again */ - spin_unlock(>lock); goto repeat; - } - clear_bit(MD_CHANGE_PENDING, >flags); - spin_unlock(>lock); wake_up(>sb_wait); if (test_bit(MD_RECOVERY_RUNNING, >recovery)) sysfs_notify(>kobj, NULL, "sync_completed"); @@ -8147,18 +8147,18 @@ void md_do_sync(struct md_thread *thread) } } skip: - set_bit(MD_CHANGE_DEVS, >flags); - if (mddev_is_clustered(mddev) && ret == 0) { /* set CHANGE_PENDING here since maybe another * update is needed, so other nodes are informed */ - set_bit(MD_CHANGE_PENDING, >flags); + set_mask_bits(>flags, 0, + BIT(MD_CHANGE_PENDING) | BIT(MD_CHANGE_DEVS)); md_wakeup_thread(mddev->thread); wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING, >flags)); md_cluster_ops->resync_finish(mddev); - } + } else + set_bit(MD_CHANGE_DEVS, >flags); spin_lock(>lock); if (!test_bit(MD_RECOVERY_INTR, >recovery)) { @@ -8550,6 +8550,7 @@ EXPORT_SYMBOL(md_finish_reshape); int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, int is_new) { + struct mddev *mddev = rdev->mddev; int rv; if (is_new) s += rdev->new_data_offset; @@ -8559,8 +8560,8 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, if (rv == 0) { /* Make sure they get written out promptly */ sysfs_notify_dirent_safe(rdev->sysfs_state); - set_bit(MD_CHANGE_CLEAN, >mddev->flags); - set_bit(MD_CHANGE_PENDING, >mddev->flags); + set_mask_bits(>flags, 0, + BIT(MD_CHANGE_CLEAN) | BIT(MD_CHANGE_PENDING)); md_wakeup_thread(rdev->mddev->thread); return 1; } else diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a7f2b9c..c7c8cde 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1474,8 +1474,8 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) * if recovery is running, make sure it aborts.
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Hi David, David Teigland wrote: > On Thu, Jun 11, 2015 at 05:47:28PM +0800, Guoqing Jiang wrote: > >> Do you consider take the following clean up? If yes, I will send a >> formal patch, otherwise pls ignore it. >> > > On first glance, the old and new code do not appear to do the same thing, > so let's leave it as it is. > > >> - to_nodeid = dlm_dir_nodeid(r); >> Sorry, seems it is the only different thing, if combines previous change with below modification, then the behavior is same. @@ -3644,7 +3644,10 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) struct dlm_mhandle *mh; int to_nodeid, error; - to_nodeid = r->res_nodeid; + if (mstype == DLM_MSG_LOOKUP) + to_nodeid = dlm_dir_nodeid(r); + else + to_nodeid = r->res_nodeid; And for create_message, the second parameter (lkb) is not effective to create three type msgs (REQUEST/LOOKUP/REMOVE). Thanks, Guoqing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Hi David, David Teigland wrote: On Thu, Jun 11, 2015 at 05:47:28PM +0800, Guoqing Jiang wrote: Do you consider take the following clean up? If yes, I will send a formal patch, otherwise pls ignore it. On first glance, the old and new code do not appear to do the same thing, so let's leave it as it is. - to_nodeid = dlm_dir_nodeid(r); Sorry, seems it is the only different thing, if combines previous change with below modification, then the behavior is same. @@ -3644,7 +3644,10 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) struct dlm_mhandle *mh; int to_nodeid, error; - to_nodeid = r-res_nodeid; + if (mstype == DLM_MSG_LOOKUP) + to_nodeid = dlm_dir_nodeid(r); + else + to_nodeid = r-res_nodeid; And for create_message, the second parameter (lkb) is not effective to create three type msgs (REQUEST/LOOKUP/REMOVE). Thanks, Guoqing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Hi David, David Teigland wrote: > On Wed, Jun 10, 2015 at 11:10:44AM +0800, Guoqing Jiang wrote: > >> The remove_from_waiters could only be invoked after failed to >> create_message, right? >> Since send_message always returns 0, this patch doesn't touch anything >> about the failure >> path, and it also doesn't change the original semantic. >> > > I'm not inclined to take any patches unless there's a problem identified. > > . > Do you consider take the following clean up? If yes, I will send a formal patch, otherwise pls ignore it. diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..7c822f7 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3747,30 +3747,7 @@ static int send_bast(struct dlm_rsb *r, struct dlm_lkb *lkb, int mode) static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb) { - struct dlm_message *ms; - struct dlm_mhandle *mh; - int to_nodeid, error; - - to_nodeid = dlm_dir_nodeid(r); - - error = add_to_waiters(lkb, DLM_MSG_LOOKUP, to_nodeid); - if (error) - return error; - - error = create_message(r, NULL, to_nodeid, DLM_MSG_LOOKUP, , ); - if (error) - goto fail; - - send_args(r, lkb, ms); - - error = send_message(mh, ms); - if (error) - goto fail; - return 0; - - fail: - remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); - return error; + return send_common(r, lkb, DLM_MSG_LOOKUP); } Thanks, Guoqing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Hi David, David Teigland wrote: On Wed, Jun 10, 2015 at 11:10:44AM +0800, Guoqing Jiang wrote: The remove_from_waiters could only be invoked after failed to create_message, right? Since send_message always returns 0, this patch doesn't touch anything about the failure path, and it also doesn't change the original semantic. I'm not inclined to take any patches unless there's a problem identified. . Do you consider take the following clean up? If yes, I will send a formal patch, otherwise pls ignore it. diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..7c822f7 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3747,30 +3747,7 @@ static int send_bast(struct dlm_rsb *r, struct dlm_lkb *lkb, int mode) static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb) { - struct dlm_message *ms; - struct dlm_mhandle *mh; - int to_nodeid, error; - - to_nodeid = dlm_dir_nodeid(r); - - error = add_to_waiters(lkb, DLM_MSG_LOOKUP, to_nodeid); - if (error) - return error; - - error = create_message(r, NULL, to_nodeid, DLM_MSG_LOOKUP, ms, mh); - if (error) - goto fail; - - send_args(r, lkb, ms); - - error = send_message(mh, ms); - if (error) - goto fail; - return 0; - - fail: - remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); - return error; + return send_common(r, lkb, DLM_MSG_LOOKUP); } Thanks, Guoqing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Bob Peterson wrote: > >>>> >>>> >>>>> - Original Message - >>>>> >>>>> >>>>> >>>>>> We don't need the redundant logic since send_message always returns 0. >>>>>> >>>>>> Signed-off-by: Guoqing Jiang >>>>>> --- >>>>>> fs/dlm/lock.c | 10 ++ >>>>>> 1 file changed, 2 insertions(+), 8 deletions(-) >>>>>> >>>>>> diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c >>>>>> index 35502d4..6fc3de9 100644 >>>>>> --- a/fs/dlm/lock.c >>>>>> +++ b/fs/dlm/lock.c >>>>>> @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct >>>>>> dlm_lkb *lkb, int mstype) >>>>>> >>>>>> send_args(r, lkb, ms); >>>>>> >>>>>> -error = send_message(mh, ms); >>>>>> -if (error) >>>>>> -goto fail; >>>>>> -return 0; >>>>>> +return send_message(mh, ms); >>>>>> > > Hi Guoqing, > > Sorry, I was momentarily confused. I think you misunderstood what I was > saying. > What I meant was: Instead of doing: > > + return send_message(mh, ms); > ...where send_message returns 0, it might be better to have: > > static void send_message(struct dlm_mhandle *mh, struct dlm_message *ms) > { > dlm_message_out(ms); > dlm_lowcomms_commit_buffer(mh); > } > > ...And in send_common, do (in both places): > + send_message(mh, ms); > + return 0; > > Since it's so short, it might even be better to code send_message as a macro, > or at least an "inline" function. > > Hi Bob, Got it, thanks. It is a better solution but it is not a bug fix or similar thing, so maybe just leave it as it is. Regards, Guoqing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Bob Peterson wrote: - Original Message - We don't need the redundant logic since send_message always returns 0. Signed-off-by: Guoqing Jiang gqji...@suse.com --- fs/dlm/lock.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..6fc3de9 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) send_args(r, lkb, ms); -error = send_message(mh, ms); -if (error) -goto fail; -return 0; +return send_message(mh, ms); Hi Guoqing, Sorry, I was momentarily confused. I think you misunderstood what I was saying. What I meant was: Instead of doing: + return send_message(mh, ms); ...where send_message returns 0, it might be better to have: static void send_message(struct dlm_mhandle *mh, struct dlm_message *ms) { dlm_message_out(ms); dlm_lowcomms_commit_buffer(mh); } ...And in send_common, do (in both places): + send_message(mh, ms); + return 0; Since it's so short, it might even be better to code send_message as a macro, or at least an inline function. Hi Bob, Got it, thanks. It is a better solution but it is not a bug fix or similar thing, so maybe just leave it as it is. Regards, Guoqing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Bob Peterson wrote: > - Original Message - > >> Hi Bob, >> >> Bob Peterson wrote: >> >>> - Original Message - >>> >>> >>>> We don't need the redundant logic since send_message always returns 0. >>>> >>>> Signed-off-by: Guoqing Jiang >>>> --- >>>> fs/dlm/lock.c | 10 ++ >>>> 1 file changed, 2 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c >>>> index 35502d4..6fc3de9 100644 >>>> --- a/fs/dlm/lock.c >>>> +++ b/fs/dlm/lock.c >>>> @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct >>>> dlm_lkb *lkb, int mstype) >>>> >>>>send_args(r, lkb, ms); >>>> >>>> - error = send_message(mh, ms); >>>> - if (error) >>>> - goto fail; >>>> - return 0; >>>> + return send_message(mh, ms); >>>> >>>> fail: >>>>remove_from_waiters(lkb, msg_reply_type(mstype)); >>>> @@ -3763,10 +3760,7 @@ static int send_lookup(struct dlm_rsb *r, struct >>>> dlm_lkb *lkb) >>>> >>>>send_args(r, lkb, ms); >>>> >>>> - error = send_message(mh, ms); >>>> - if (error) >>>> - goto fail; >>>> - return 0; >>>> + return send_message(mh, ms); >>>> >>>> fail: >>>>remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); >>>> -- >>>> 1.7.12.4 >>>> >>>> >>> Hi, >>> >>> The patch looks okay, but if remove_from_waiters() always returns 0, >>> wouldn't it be better to change the function from int to void and >>> return 0 here? The advantage is that code spelunkers wouldn't need >>> to back-track one more level (not to mention the instruction or two >>> it might save). >>> >>> >>> >> Seems remove_from_waiters is not always returns 0, the return value >> could be -1 or 0 which depends on _remove_from_waiters. >> >> BTW, I found that there are no big difference between send_common >> and send_lookup, since the send_common can also be use to send >> lookup message, I guess send_lookup can be removed as well. >> >> Thanks, >> Guoqing >> > > Hi Guoqing, > > If remove_from_waiters can return -1, then the patch would prevent the > code from calling remove_from_waiters. So the patch still doesn't look > right to me. > > Hi Bob, The remove_from_waiters could only be invoked after failed to create_message, right? Since send_message always returns 0, this patch doesn't touch anything about the failure path, and it also doesn't change the original semantic. Thanks, Guoqing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Hi Bob, Bob Peterson wrote: > - Original Message - > >> We don't need the redundant logic since send_message always returns 0. >> >> Signed-off-by: Guoqing Jiang >> --- >> fs/dlm/lock.c | 10 ++ >> 1 file changed, 2 insertions(+), 8 deletions(-) >> >> diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c >> index 35502d4..6fc3de9 100644 >> --- a/fs/dlm/lock.c >> +++ b/fs/dlm/lock.c >> @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct >> dlm_lkb *lkb, int mstype) >> >> send_args(r, lkb, ms); >> >> -error = send_message(mh, ms); >> -if (error) >> -goto fail; >> -return 0; >> +return send_message(mh, ms); >> >> fail: >> remove_from_waiters(lkb, msg_reply_type(mstype)); >> @@ -3763,10 +3760,7 @@ static int send_lookup(struct dlm_rsb *r, struct >> dlm_lkb *lkb) >> >> send_args(r, lkb, ms); >> >> -error = send_message(mh, ms); >> -if (error) >> -goto fail; >> -return 0; >> +return send_message(mh, ms); >> >> fail: >> remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); >> -- >> 1.7.12.4 >> > > Hi, > > The patch looks okay, but if remove_from_waiters() always returns 0, > wouldn't it be better to change the function from int to void and > return 0 here? The advantage is that code spelunkers wouldn't need > to back-track one more level (not to mention the instruction or two > it might save). > > Seems remove_from_waiters is not always returns 0, the return value could be -1 or 0 which depends on _remove_from_waiters. BTW, I found that there are no big difference between send_common and send_lookup, since the send_common can also be use to send lookup message, I guess send_lookup can be removed as well. Thanks, Guoqing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dlm: remove unnecessary error check
We don't need the redundant logic since send_message always returns 0. Signed-off-by: Guoqing Jiang --- fs/dlm/lock.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..6fc3de9 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) send_args(r, lkb, ms); - error = send_message(mh, ms); - if (error) - goto fail; - return 0; + return send_message(mh, ms); fail: remove_from_waiters(lkb, msg_reply_type(mstype)); @@ -3763,10 +3760,7 @@ static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb) send_args(r, lkb, ms); - error = send_message(mh, ms); - if (error) - goto fail; - return 0; + return send_message(mh, ms); fail: remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dlm: remove unnecessary error check
We don't need the redundant logic since send_message always returns 0. Signed-off-by: Guoqing Jiang gqji...@suse.com --- fs/dlm/lock.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..6fc3de9 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) send_args(r, lkb, ms); - error = send_message(mh, ms); - if (error) - goto fail; - return 0; + return send_message(mh, ms); fail: remove_from_waiters(lkb, msg_reply_type(mstype)); @@ -3763,10 +3760,7 @@ static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb) send_args(r, lkb, ms); - error = send_message(mh, ms); - if (error) - goto fail; - return 0; + return send_message(mh, ms); fail: remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); -- 1.7.12.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Hi Bob, Bob Peterson wrote: - Original Message - We don't need the redundant logic since send_message always returns 0. Signed-off-by: Guoqing Jiang gqji...@suse.com --- fs/dlm/lock.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..6fc3de9 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) send_args(r, lkb, ms); -error = send_message(mh, ms); -if (error) -goto fail; -return 0; +return send_message(mh, ms); fail: remove_from_waiters(lkb, msg_reply_type(mstype)); @@ -3763,10 +3760,7 @@ static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb) send_args(r, lkb, ms); -error = send_message(mh, ms); -if (error) -goto fail; -return 0; +return send_message(mh, ms); fail: remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); -- 1.7.12.4 Hi, The patch looks okay, but if remove_from_waiters() always returns 0, wouldn't it be better to change the function from int to void and return 0 here? The advantage is that code spelunkers wouldn't need to back-track one more level (not to mention the instruction or two it might save). Seems remove_from_waiters is not always returns 0, the return value could be -1 or 0 which depends on _remove_from_waiters. BTW, I found that there are no big difference between send_common and send_lookup, since the send_common can also be use to send lookup message, I guess send_lookup can be removed as well. Thanks, Guoqing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cluster-devel] [PATCH] dlm: remove unnecessary error check
Bob Peterson wrote: - Original Message - Hi Bob, Bob Peterson wrote: - Original Message - We don't need the redundant logic since send_message always returns 0. Signed-off-by: Guoqing Jiang gqji...@suse.com --- fs/dlm/lock.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c index 35502d4..6fc3de9 100644 --- a/fs/dlm/lock.c +++ b/fs/dlm/lock.c @@ -3656,10 +3656,7 @@ static int send_common(struct dlm_rsb *r, struct dlm_lkb *lkb, int mstype) send_args(r, lkb, ms); - error = send_message(mh, ms); - if (error) - goto fail; - return 0; + return send_message(mh, ms); fail: remove_from_waiters(lkb, msg_reply_type(mstype)); @@ -3763,10 +3760,7 @@ static int send_lookup(struct dlm_rsb *r, struct dlm_lkb *lkb) send_args(r, lkb, ms); - error = send_message(mh, ms); - if (error) - goto fail; - return 0; + return send_message(mh, ms); fail: remove_from_waiters(lkb, DLM_MSG_LOOKUP_REPLY); -- 1.7.12.4 Hi, The patch looks okay, but if remove_from_waiters() always returns 0, wouldn't it be better to change the function from int to void and return 0 here? The advantage is that code spelunkers wouldn't need to back-track one more level (not to mention the instruction or two it might save). Seems remove_from_waiters is not always returns 0, the return value could be -1 or 0 which depends on _remove_from_waiters. BTW, I found that there are no big difference between send_common and send_lookup, since the send_common can also be use to send lookup message, I guess send_lookup can be removed as well. Thanks, Guoqing Hi Guoqing, If remove_from_waiters can return -1, then the patch would prevent the code from calling remove_from_waiters. So the patch still doesn't look right to me. Hi Bob, The remove_from_waiters could only be invoked after failed to create_message, right? Since send_message always returns 0, this patch doesn't touch anything about the failure path, and it also doesn't change the original semantic. Thanks, Guoqing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the md tree
Stephen Rothwell wrote: > Hi Neil, > > On Mon, 2 Mar 2015 17:11:49 +1100 Stephen Rothwell > wrote: > >> On Mon, 2 Mar 2015 17:03:45 +1100 NeilBrown wrote: >> >>> I think >>> + bm_blocks = DIV_ROUND_UP(bm_blocks, 4096); >>> >>> needs DIV_ROUND_UP_SECTOR_T() >>> >> I tried that and it was not sufficient. >> >> >>> The first patch you identified adds that line. The second relocates it. >>> >> The second also changes this: >> >> bm_blocks = sector_div(bitmap->mddev->resync_max_sectors, (chunksize >> 9)); >> >> (added by the first) to this: >> >> bm_blocks = bitmap->mddev->resync_max_sectors / >> (bitmap->mddev->bitmap_info.chunksize >> 9); >> >> where bitmap->mddev->resync_max_sectors is a sector_t ... >> > > So I applied this patch for today: > > From: Stephen Rothwell > Date: Tue, 3 Mar 2015 13:30:26 +1100 > Subject: [PATCH] md/bitmap: use sector_div for sector_t divisions > > Signed-off-by: Stephen Rothwell > --- > drivers/md/bitmap.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c > index 23f575f0cd92..d40398404ab6 100644 > --- a/drivers/md/bitmap.c > +++ b/drivers/md/bitmap.c > @@ -573,7 +573,8 @@ re_read: > if (bitmap->cluster_slot >= 0) { > long long bm_blocks; > > - bm_blocks = bitmap->mddev->resync_max_sectors / > (bitmap->mddev->bitmap_info.chunksize >> 9); > + bm_blocks = sector_div(bitmap->mddev->resync_max_sectors, > +bitmap->mddev->bitmap_info.chunksize >> > 9); > One simple question, isn't the sector_div used for change '%' operator? but the modified line is: bm_blocks = bitmap->mddev->resync_max_sectors */* (bitmap->mddev->bitmap_info.chunksize >> 9); But, the modified is: bm_blocks = sector_div(bitmap->mddev->resync_max_sectors, bitmap->mddev->bitmap_info.chunksize >> 9); The sector_div returns: bitmap->mddev->resync_max_sectors % bitmap->mddev->bitmap_info.chunksize >> 9 So it basically means : bm_blocks = bitmap->mddev->resync_max_sectors % bitmap->mddev->bitmap_info.chunksize >> 9 And maybe the current next tree should add the following change to keep original semantic. diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index 501f83f..ea9c685 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -571,11 +571,10 @@ static int bitmap_read_sb(struct bitmap *bitmap) re_read: /* If cluster_slot is set, the cluster is setup */ if (bitmap->cluster_slot >= 0) { - sector_t bm_blocks; - sector_t resync_sectors = bitmap->mddev->resync_max_sectors; + sector_t bm_blocks = bitmap->mddev->resync_max_sectors; - bm_blocks = sector_div(resync_sectors, - bitmap->mddev->bitmap_info.chunksize >> 9); + sector_div(bm_blocks, + bitmap->mddev->bitmap_info.chunksize >> 9); bm_blocks = bm_blocks << 3; bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); Thanks, Guoqing > bm_blocks = bm_blocks << 3; > bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); > bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * > (bm_blocks << 3); > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/