Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 11/6/18 12:41 PM, Dave Chinner wrote: > On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: >> On Tue 06-11-18 13:47:15, Dave Chinner wrote: >>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: On 11/5/18 1:54 AM, Jan Kara wrote: > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > has the advantage that it is pretty much standard for a test like this so > everyone knows what the test does from a glimpse) and run with it > something > like the following workfile: > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 > > And see how the numbers with and without your patches compare? > > Honza > > [1] https://github.com/axboe/fio That program is *very* good to have. Whew. Anyway, it looks like read bandwidth is approximately 74 MiB/s with my patch (it varies a bit, run to run), as compared to around 85 without the patch, so still showing about a 20% performance degradation, assuming I'm reading this correctly. Raw data follows, using the fio options you listed above: Baseline (without my patch): >>> lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 clat percentiles (usec): | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], | 99.99th=[12256] >>> . Modified (with my patch): >>> . lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 clat percentiles (usec): | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], | 99.99th=[14484] >>> >>> So it's adding at least 500us of completion latency to every IO? >>> I'd argue that the IO latency impact is far worse than the a 20% >>> throughput drop. >> >> Hum, right. So for each IO we have to remove the page from LRU on submit > > Which cost us less then 10us on average: > > slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 > vs > slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > >> and then put it back on IO completion (which is going to race with new >> submits so LRU lock contention might be an issue). > > Removal has to take the same LRU lock, so I don't think contention > is the problem here. More likely the overhead is in selecting the > LRU to put it on. e.g. list_lru_from_kmem() which may well be doing > a memcg lookup. > >> Spending 500 us on that >> is not unthinkable when the lock is contended but it is more expensive than >> I'd have thought. John, could you perhaps profile where the time is spent? > OK, some updates on that: 1. First of all, I fixed a direct-io-related call site (it was still calling put_page instead of put_user_page), and that not only got rid of a problem, it also changed performance: it makes the impact of the patch a bit less. (Sorry about that! I was hoping that I could get away with temporarily ignoring that failure, but no.) The bandwidth numbers in particular look much closer to each other. 2. Second, note that these fio results are noisy. The std deviation is large enough that some of this could be noise. In order to highlight that, I did 5 runs each of with, and without the patch, and while there is definitely a performance drop on average, it's also true that there is overlap in the results. In other words, the best "with patch" run is about the same as the worst "without patch" run. 3. Finally, initial profiling shows that we're adding about 1% total to the this particular test run...I think. I may have to narrow this down some more, but I don't seem to see any real lock contention. Hints or ideas on measurement methods are welcome, btw. -- 0.59% in put_user_page -- 0.2% (or 0.54%, depending on how you read the perf out below) in get_user_pages_fast: 1.36%--iov_iter_get_pages | --1.27%--get_user_pages_fast | --0.54%--pin_page_for_dma 0.59%--put_user_page 1.34% 0.03% fio [kernel.vmlinux] [k] _raw_spin_lock 0.95% 0.55% fio [kernel.vmlinux] [k] do_raw_spin_lock
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 11/6/18 12:41 PM, Dave Chinner wrote: > On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: >> On Tue 06-11-18 13:47:15, Dave Chinner wrote: >>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: On 11/5/18 1:54 AM, Jan Kara wrote: > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > has the advantage that it is pretty much standard for a test like this so > everyone knows what the test does from a glimpse) and run with it > something > like the following workfile: > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 > > And see how the numbers with and without your patches compare? > > Honza > > [1] https://github.com/axboe/fio That program is *very* good to have. Whew. Anyway, it looks like read bandwidth is approximately 74 MiB/s with my patch (it varies a bit, run to run), as compared to around 85 without the patch, so still showing about a 20% performance degradation, assuming I'm reading this correctly. Raw data follows, using the fio options you listed above: Baseline (without my patch): >>> lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 clat percentiles (usec): | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], | 99.99th=[12256] >>> . Modified (with my patch): >>> . lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 clat percentiles (usec): | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], | 99.99th=[14484] >>> >>> So it's adding at least 500us of completion latency to every IO? >>> I'd argue that the IO latency impact is far worse than the a 20% >>> throughput drop. >> >> Hum, right. So for each IO we have to remove the page from LRU on submit > > Which cost us less then 10us on average: > > slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 > vs > slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > >> and then put it back on IO completion (which is going to race with new >> submits so LRU lock contention might be an issue). > > Removal has to take the same LRU lock, so I don't think contention > is the problem here. More likely the overhead is in selecting the > LRU to put it on. e.g. list_lru_from_kmem() which may well be doing > a memcg lookup. > >> Spending 500 us on that >> is not unthinkable when the lock is contended but it is more expensive than >> I'd have thought. John, could you perhaps profile where the time is spent? > OK, some updates on that: 1. First of all, I fixed a direct-io-related call site (it was still calling put_page instead of put_user_page), and that not only got rid of a problem, it also changed performance: it makes the impact of the patch a bit less. (Sorry about that! I was hoping that I could get away with temporarily ignoring that failure, but no.) The bandwidth numbers in particular look much closer to each other. 2. Second, note that these fio results are noisy. The std deviation is large enough that some of this could be noise. In order to highlight that, I did 5 runs each of with, and without the patch, and while there is definitely a performance drop on average, it's also true that there is overlap in the results. In other words, the best "with patch" run is about the same as the worst "without patch" run. 3. Finally, initial profiling shows that we're adding about 1% total to the this particular test run...I think. I may have to narrow this down some more, but I don't seem to see any real lock contention. Hints or ideas on measurement methods are welcome, btw. -- 0.59% in put_user_page -- 0.2% (or 0.54%, depending on how you read the perf out below) in get_user_pages_fast: 1.36%--iov_iter_get_pages | --1.27%--get_user_pages_fast | --0.54%--pin_page_for_dma 0.59%--put_user_page 1.34% 0.03% fio [kernel.vmlinux] [k] _raw_spin_lock 0.95% 0.55% fio [kernel.vmlinux] [k] do_raw_spin_lock
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: > On Tue 06-11-18 13:47:15, Dave Chinner wrote: > > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > > > On 11/5/18 1:54 AM, Jan Kara wrote: > > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > > > has the advantage that it is pretty much standard for a test like this > > > > so > > > > everyone knows what the test does from a glimpse) and run with it > > > > something > > > > like the following workfile: > > > > > > > > [reader] > > > > direct=1 > > > > ioengine=libaio > > > > blocksize=4096 > > > > size=1g > > > > numjobs=1 > > > > rw=read > > > > iodepth=64 > > > > > > > > And see how the numbers with and without your patches compare? > > > > > > > > Honza > > > > > > > > [1] https://github.com/axboe/fio > > > > > > That program is *very* good to have. Whew. Anyway, it looks like read > > > bandwidth > > > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > > > as compared to around 85 without the patch, so still showing about a 20% > > > performance degradation, assuming I'm reading this correctly. > > > > > > Raw data follows, using the fio options you listed above: > > > > > > Baseline (without my patch): > > > > > > > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > > > clat percentiles (usec): > > > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > > > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > > > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > > > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > > > | 99.99th=[12256] > > . > > > Modified (with my patch): > > > > > . > > > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > > > clat percentiles (usec): > > > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > > > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > > > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > > > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > > > | 99.99th=[14484] > > > > So it's adding at least 500us of completion latency to every IO? > > I'd argue that the IO latency impact is far worse than the a 20% > > throughput drop. > > Hum, right. So for each IO we have to remove the page from LRU on submit Which cost us less then 10us on average: slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 vs slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > and then put it back on IO completion (which is going to race with new > submits so LRU lock contention might be an issue). Removal has to take the same LRU lock, so I don't think contention is the problem here. More likely the overhead is in selecting the LRU to put it on. e.g. list_lru_from_kmem() which may well be doing a memcg lookup. > Spending 500 us on that > is not unthinkable when the lock is contended but it is more expensive than > I'd have thought. John, could you perhaps profile where the time is spent? That'll tell us for sure :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: > On Tue 06-11-18 13:47:15, Dave Chinner wrote: > > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > > > On 11/5/18 1:54 AM, Jan Kara wrote: > > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > > > has the advantage that it is pretty much standard for a test like this > > > > so > > > > everyone knows what the test does from a glimpse) and run with it > > > > something > > > > like the following workfile: > > > > > > > > [reader] > > > > direct=1 > > > > ioengine=libaio > > > > blocksize=4096 > > > > size=1g > > > > numjobs=1 > > > > rw=read > > > > iodepth=64 > > > > > > > > And see how the numbers with and without your patches compare? > > > > > > > > Honza > > > > > > > > [1] https://github.com/axboe/fio > > > > > > That program is *very* good to have. Whew. Anyway, it looks like read > > > bandwidth > > > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > > > as compared to around 85 without the patch, so still showing about a 20% > > > performance degradation, assuming I'm reading this correctly. > > > > > > Raw data follows, using the fio options you listed above: > > > > > > Baseline (without my patch): > > > > > > > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > > > clat percentiles (usec): > > > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > > > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > > > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > > > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > > > | 99.99th=[12256] > > . > > > Modified (with my patch): > > > > > . > > > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > > > clat percentiles (usec): > > > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > > > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > > > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > > > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > > > | 99.99th=[14484] > > > > So it's adding at least 500us of completion latency to every IO? > > I'd argue that the IO latency impact is far worse than the a 20% > > throughput drop. > > Hum, right. So for each IO we have to remove the page from LRU on submit Which cost us less then 10us on average: slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 vs slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > and then put it back on IO completion (which is going to race with new > submits so LRU lock contention might be an issue). Removal has to take the same LRU lock, so I don't think contention is the problem here. More likely the overhead is in selecting the LRU to put it on. e.g. list_lru_from_kmem() which may well be doing a memcg lookup. > Spending 500 us on that > is not unthinkable when the lock is contended but it is more expensive than > I'd have thought. John, could you perhaps profile where the time is spent? That'll tell us for sure :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue 06-11-18 13:47:15, Dave Chinner wrote: > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > > On 11/5/18 1:54 AM, Jan Kara wrote: > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > > has the advantage that it is pretty much standard for a test like this so > > > everyone knows what the test does from a glimpse) and run with it > > > something > > > like the following workfile: > > > > > > [reader] > > > direct=1 > > > ioengine=libaio > > > blocksize=4096 > > > size=1g > > > numjobs=1 > > > rw=read > > > iodepth=64 > > > > > > And see how the numbers with and without your patches compare? > > > > > > Honza > > > > > > [1] https://github.com/axboe/fio > > > > That program is *very* good to have. Whew. Anyway, it looks like read > > bandwidth > > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > > as compared to around 85 without the patch, so still showing about a 20% > > performance degradation, assuming I'm reading this correctly. > > > > Raw data follows, using the fio options you listed above: > > > > Baseline (without my patch): > > > > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > > clat percentiles (usec): > > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > > | 99.99th=[12256] > . > > Modified (with my patch): > > > . > > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > > clat percentiles (usec): > > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > > | 99.99th=[14484] > > So it's adding at least 500us of completion latency to every IO? > I'd argue that the IO latency impact is far worse than the a 20% > throughput drop. Hum, right. So for each IO we have to remove the page from LRU on submit and then put it back on IO completion (which is going to race with new submits so LRU lock contention might be an issue). Spending 500 us on that is not unthinkable when the lock is contended but it is more expensive than I'd have thought. John, could you perhaps profile where the time is spent? Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue 06-11-18 13:47:15, Dave Chinner wrote: > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > > On 11/5/18 1:54 AM, Jan Kara wrote: > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > > has the advantage that it is pretty much standard for a test like this so > > > everyone knows what the test does from a glimpse) and run with it > > > something > > > like the following workfile: > > > > > > [reader] > > > direct=1 > > > ioengine=libaio > > > blocksize=4096 > > > size=1g > > > numjobs=1 > > > rw=read > > > iodepth=64 > > > > > > And see how the numbers with and without your patches compare? > > > > > > Honza > > > > > > [1] https://github.com/axboe/fio > > > > That program is *very* good to have. Whew. Anyway, it looks like read > > bandwidth > > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > > as compared to around 85 without the patch, so still showing about a 20% > > performance degradation, assuming I'm reading this correctly. > > > > Raw data follows, using the fio options you listed above: > > > > Baseline (without my patch): > > > > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > > clat percentiles (usec): > > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > > | 99.99th=[12256] > . > > Modified (with my patch): > > > . > > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > > clat percentiles (usec): > > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > > | 99.99th=[14484] > > So it's adding at least 500us of completion latency to every IO? > I'd argue that the IO latency impact is far worse than the a 20% > throughput drop. Hum, right. So for each IO we have to remove the page from LRU on submit and then put it back on IO completion (which is going to race with new submits so LRU lock contention might be an issue). Spending 500 us on that is not unthinkable when the lock is contended but it is more expensive than I'd have thought. John, could you perhaps profile where the time is spent? Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > On 11/5/18 1:54 AM, Jan Kara wrote: > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > has the advantage that it is pretty much standard for a test like this so > > everyone knows what the test does from a glimpse) and run with it something > > like the following workfile: > > > > [reader] > > direct=1 > > ioengine=libaio > > blocksize=4096 > > size=1g > > numjobs=1 > > rw=read > > iodepth=64 > > > > And see how the numbers with and without your patches compare? > > > > Honza > > > > [1] https://github.com/axboe/fio > > That program is *very* good to have. Whew. Anyway, it looks like read > bandwidth > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > as compared to around 85 without the patch, so still showing about a 20% > performance degradation, assuming I'm reading this correctly. > > Raw data follows, using the fio options you listed above: > > Baseline (without my patch): > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > clat percentiles (usec): > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > | 99.99th=[12256] . > Modified (with my patch): > . > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > clat percentiles (usec): > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > | 99.99th=[14484] So it's adding at least 500us of completion latency to every IO? I'd argue that the IO latency impact is far worse than the a 20% throughput drop. i.e. You can make up for throughput drops by running a deeper queue/more dispatch threads, but you can't reduce IO latency at all... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > On 11/5/18 1:54 AM, Jan Kara wrote: > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > has the advantage that it is pretty much standard for a test like this so > > everyone knows what the test does from a glimpse) and run with it something > > like the following workfile: > > > > [reader] > > direct=1 > > ioengine=libaio > > blocksize=4096 > > size=1g > > numjobs=1 > > rw=read > > iodepth=64 > > > > And see how the numbers with and without your patches compare? > > > > Honza > > > > [1] https://github.com/axboe/fio > > That program is *very* good to have. Whew. Anyway, it looks like read > bandwidth > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > as compared to around 85 without the patch, so still showing about a 20% > performance degradation, assuming I'm reading this correctly. > > Raw data follows, using the fio options you listed above: > > Baseline (without my patch): > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > clat percentiles (usec): > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > | 99.99th=[12256] . > Modified (with my patch): > . > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > clat percentiles (usec): > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > | 99.99th=[14484] So it's adding at least 500us of completion latency to every IO? I'd argue that the IO latency impact is far worse than the a 20% throughput drop. i.e. You can make up for throughput drops by running a deeper queue/more dispatch threads, but you can't reduce IO latency at all... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 11/5/18 1:54 AM, Jan Kara wrote: > On Sun 04-11-18 23:10:12, John Hubbard wrote: >> On 10/13/18 9:47 AM, Christoph Hellwig wrote: >>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(), unceremoniously rips the pages out of the LRU, as a prerequisite to using either of the page->dma_pinned_* fields. The idea is that LRU is not especially useful for this situation anyway, so we'll just make it one or the other: either a page is dma-pinned, and just hanging out doing RDMA most likely (and LRU is less meaningful during that time), or it's possibly on an LRU list. >>> >>> Have you done any benchmarking what this does to direct I/O performance, >>> especially for small I/O directly to a (fast) block device? >>> >> >> Hi Christoph, >> >> I'm seeing about 20% slower in one case: lots of reads and writes of size >> 8192 B, >> on a fast NVMe device. My put_page() --> put_user_page() conversions are >> incomplete >> and buggy yet, but I've got enough of them done to briefly run the test. >> >> One thing that occurs to me is that jumping on and off the LRU takes time, >> and >> if we limited this to 64-bit platforms, maybe we could use a real page flag? >> I >> know that leaves 32-bit out in the cold, but...maybe use this slower approach >> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down >> anything >> by 20%. >> >> Test program is below. I hope I didn't overlook something obvious, but it's >> definitely possible, given my lack of experience with direct IO. >> >> I'm preparing to send an updated RFC this week, that contains the feedback >> to date, >> and also many converted call sites as well, so that everyone can see what >> the whole >> (proposed) story would look like in its latest incarnation. > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > has the advantage that it is pretty much standard for a test like this so > everyone knows what the test does from a glimpse) and run with it something > like the following workfile: > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 > > And see how the numbers with and without your patches compare? > > Honza > > [1] https://github.com/axboe/fio That program is *very* good to have. Whew. Anyway, it looks like read bandwidth is approximately 74 MiB/s with my patch (it varies a bit, run to run), as compared to around 85 without the patch, so still showing about a 20% performance degradation, assuming I'm reading this correctly. Raw data follows, using the fio options you listed above: Baseline (without my patch): reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=87.2MiB/s,w=0KiB/s][r=22.3k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=1775: Mon Nov 5 12:08:45 2018 read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(1024MiB/11945msec) slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 clat (usec): min=71, max=13093, avg=2869.40, stdev=1225.23 lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 clat percentiles (usec): | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], | 99.99th=[12256] bw ( KiB/s): min=80648, max=93288, per=99.80%, avg=87608.57, stdev=3201.35, samples=23 iops: min=20162, max=23322, avg=21902.09, stdev=800.37, samples=23 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.02%, 4=88.47%, 10=11.27%, 20=0.25% cpu : usr=2.68%, sys=94.68%, ctx=408, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=85.7MiB/s (89.9MB/s), 85.7MiB/s-85.7MiB/s (89.9MB/s-89.9MB/s), io=1024MiB (1074MB), run=11945-11945msec Disk stats (read/write): nvme0n1: ios=260906/3, merge=0/1, ticks=14618/4, in_queue=17670, util=100.00% Modified (with my patch): reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 11/5/18 1:54 AM, Jan Kara wrote: > On Sun 04-11-18 23:10:12, John Hubbard wrote: >> On 10/13/18 9:47 AM, Christoph Hellwig wrote: >>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(), unceremoniously rips the pages out of the LRU, as a prerequisite to using either of the page->dma_pinned_* fields. The idea is that LRU is not especially useful for this situation anyway, so we'll just make it one or the other: either a page is dma-pinned, and just hanging out doing RDMA most likely (and LRU is less meaningful during that time), or it's possibly on an LRU list. >>> >>> Have you done any benchmarking what this does to direct I/O performance, >>> especially for small I/O directly to a (fast) block device? >>> >> >> Hi Christoph, >> >> I'm seeing about 20% slower in one case: lots of reads and writes of size >> 8192 B, >> on a fast NVMe device. My put_page() --> put_user_page() conversions are >> incomplete >> and buggy yet, but I've got enough of them done to briefly run the test. >> >> One thing that occurs to me is that jumping on and off the LRU takes time, >> and >> if we limited this to 64-bit platforms, maybe we could use a real page flag? >> I >> know that leaves 32-bit out in the cold, but...maybe use this slower approach >> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down >> anything >> by 20%. >> >> Test program is below. I hope I didn't overlook something obvious, but it's >> definitely possible, given my lack of experience with direct IO. >> >> I'm preparing to send an updated RFC this week, that contains the feedback >> to date, >> and also many converted call sites as well, so that everyone can see what >> the whole >> (proposed) story would look like in its latest incarnation. > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > has the advantage that it is pretty much standard for a test like this so > everyone knows what the test does from a glimpse) and run with it something > like the following workfile: > > [reader] > direct=1 > ioengine=libaio > blocksize=4096 > size=1g > numjobs=1 > rw=read > iodepth=64 > > And see how the numbers with and without your patches compare? > > Honza > > [1] https://github.com/axboe/fio That program is *very* good to have. Whew. Anyway, it looks like read bandwidth is approximately 74 MiB/s with my patch (it varies a bit, run to run), as compared to around 85 without the patch, so still showing about a 20% performance degradation, assuming I'm reading this correctly. Raw data follows, using the fio options you listed above: Baseline (without my patch): reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=87.2MiB/s,w=0KiB/s][r=22.3k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=1775: Mon Nov 5 12:08:45 2018 read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(1024MiB/11945msec) slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 clat (usec): min=71, max=13093, avg=2869.40, stdev=1225.23 lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 clat percentiles (usec): | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], | 99.99th=[12256] bw ( KiB/s): min=80648, max=93288, per=99.80%, avg=87608.57, stdev=3201.35, samples=23 iops: min=20162, max=23322, avg=21902.09, stdev=800.37, samples=23 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.02%, 4=88.47%, 10=11.27%, 20=0.25% cpu : usr=2.68%, sys=94.68%, ctx=408, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=85.7MiB/s (89.9MB/s), 85.7MiB/s-85.7MiB/s (89.9MB/s-89.9MB/s), io=1024MiB (1074MB), run=11945-11945msec Disk stats (read/write): nvme0n1: ios=260906/3, merge=0/1, ticks=14618/4, in_queue=17670, util=100.00% Modified (with my patch): reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sun 04-11-18 23:10:12, John Hubbard wrote: > On 10/13/18 9:47 AM, Christoph Hellwig wrote: > > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > >> In patch 6/6, pin_page_for_dma(), which is called at the end of > >> get_user_pages(), > >> unceremoniously rips the pages out of the LRU, as a prerequisite to using > >> either of the page->dma_pinned_* fields. > >> > >> The idea is that LRU is not especially useful for this situation anyway, > >> so we'll just make it one or the other: either a page is dma-pinned, and > >> just hanging out doing RDMA most likely (and LRU is less meaningful during > >> that > >> time), or it's possibly on an LRU list. > > > > Have you done any benchmarking what this does to direct I/O performance, > > especially for small I/O directly to a (fast) block device? > > > > Hi Christoph, > > I'm seeing about 20% slower in one case: lots of reads and writes of size > 8192 B, > on a fast NVMe device. My put_page() --> put_user_page() conversions are > incomplete > and buggy yet, but I've got enough of them done to briefly run the test. > > One thing that occurs to me is that jumping on and off the LRU takes time, and > if we limited this to 64-bit platforms, maybe we could use a real page flag? > I > know that leaves 32-bit out in the cold, but...maybe use this slower approach > for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down > anything > by 20%. > > Test program is below. I hope I didn't overlook something obvious, but it's > definitely possible, given my lack of experience with direct IO. > > I'm preparing to send an updated RFC this week, that contains the feedback to > date, > and also many converted call sites as well, so that everyone can see what the > whole > (proposed) story would look like in its latest incarnation. Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't going to max-out NVME iops by far. Can I suggest you install fio [1] (it has the advantage that it is pretty much standard for a test like this so everyone knows what the test does from a glimpse) and run with it something like the following workfile: [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 And see how the numbers with and without your patches compare? Honza [1] https://github.com/axboe/fio > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > > const static unsigned BUF_SIZE = 4096; > static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE; > > void read_from_file(int fd, size_t how_much, char * buf) > { > size_t bytes_read; > > for (size_t index = 0; index < how_much; index += BUF_SIZE) { > bytes_read = read(fd, buf, BUF_SIZE); > if (bytes_read != BUF_SIZE) { > printf("reading file failed: %m\n"); > exit(3); > } > } > } > > void seek_to_start(int fd, char *caller) > { > off_t result = lseek(fd, 0, SEEK_SET); > if (result == -1) { > printf("%s: lseek failed: %m\n", caller); > exit(4); > } > } > > void write_to_file(int fd, size_t how_much, char * buf) > { > int result; > for (size_t index = 0; index < how_much; index += BUF_SIZE) { > result = write(fd, buf, BUF_SIZE); > if (result < 0) { > printf("writing file failed: %m\n"); > exit(3); > } > } > } > > void read_and_write(int fd, size_t how_much, char * buf) > { > seek_to_start(fd, "About to read"); > read_from_file(fd, how_much, buf); > > memset(buf, 'a', BUF_SIZE); > > seek_to_start(fd, "About to write"); > write_to_file(fd, how_much, buf); > } > > int main(int argc, char *argv[]) > { > void *buf; > /* >* O_DIRECT requires at least 512 B alighnment, but runs faster >* (2.8 sec, vs. 3.5 sec) with 4096 B alignment. >*/ > unsigned align = 4096; > posix_memalign(, align, BUF_SIZE ); > > if (argc < 3) { > printf("Usage: %s \n", argv[0]); > return 1; > } > char *filename = argv[1]; > unsigned iterations = strtoul(argv[2], 0, 0); > > /* Not using O_SYNC for now, anyway. */ > int fd = open(filename, O_DIRECT | O_RDWR); > if (fd < 0) { > printf("Failed to open %s: %m\n", filename); > return 2; > } > > printf("File: %s, data size: %u, interations: %u\n", > filename, FULL_DATA_SIZE, iterations); > > for (int count = 0; count < iterations; count++) { > read_and_write(fd, FULL_DATA_SIZE, buf); > } > > close(fd); > return 0; > } > > > thanks, > -- > John Hubbard > NVIDIA -- Jan Kara SUSE
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sun 04-11-18 23:10:12, John Hubbard wrote: > On 10/13/18 9:47 AM, Christoph Hellwig wrote: > > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > >> In patch 6/6, pin_page_for_dma(), which is called at the end of > >> get_user_pages(), > >> unceremoniously rips the pages out of the LRU, as a prerequisite to using > >> either of the page->dma_pinned_* fields. > >> > >> The idea is that LRU is not especially useful for this situation anyway, > >> so we'll just make it one or the other: either a page is dma-pinned, and > >> just hanging out doing RDMA most likely (and LRU is less meaningful during > >> that > >> time), or it's possibly on an LRU list. > > > > Have you done any benchmarking what this does to direct I/O performance, > > especially for small I/O directly to a (fast) block device? > > > > Hi Christoph, > > I'm seeing about 20% slower in one case: lots of reads and writes of size > 8192 B, > on a fast NVMe device. My put_page() --> put_user_page() conversions are > incomplete > and buggy yet, but I've got enough of them done to briefly run the test. > > One thing that occurs to me is that jumping on and off the LRU takes time, and > if we limited this to 64-bit platforms, maybe we could use a real page flag? > I > know that leaves 32-bit out in the cold, but...maybe use this slower approach > for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down > anything > by 20%. > > Test program is below. I hope I didn't overlook something obvious, but it's > definitely possible, given my lack of experience with direct IO. > > I'm preparing to send an updated RFC this week, that contains the feedback to > date, > and also many converted call sites as well, so that everyone can see what the > whole > (proposed) story would look like in its latest incarnation. Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't going to max-out NVME iops by far. Can I suggest you install fio [1] (it has the advantage that it is pretty much standard for a test like this so everyone knows what the test does from a glimpse) and run with it something like the following workfile: [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 And see how the numbers with and without your patches compare? Honza [1] https://github.com/axboe/fio > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > > const static unsigned BUF_SIZE = 4096; > static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE; > > void read_from_file(int fd, size_t how_much, char * buf) > { > size_t bytes_read; > > for (size_t index = 0; index < how_much; index += BUF_SIZE) { > bytes_read = read(fd, buf, BUF_SIZE); > if (bytes_read != BUF_SIZE) { > printf("reading file failed: %m\n"); > exit(3); > } > } > } > > void seek_to_start(int fd, char *caller) > { > off_t result = lseek(fd, 0, SEEK_SET); > if (result == -1) { > printf("%s: lseek failed: %m\n", caller); > exit(4); > } > } > > void write_to_file(int fd, size_t how_much, char * buf) > { > int result; > for (size_t index = 0; index < how_much; index += BUF_SIZE) { > result = write(fd, buf, BUF_SIZE); > if (result < 0) { > printf("writing file failed: %m\n"); > exit(3); > } > } > } > > void read_and_write(int fd, size_t how_much, char * buf) > { > seek_to_start(fd, "About to read"); > read_from_file(fd, how_much, buf); > > memset(buf, 'a', BUF_SIZE); > > seek_to_start(fd, "About to write"); > write_to_file(fd, how_much, buf); > } > > int main(int argc, char *argv[]) > { > void *buf; > /* >* O_DIRECT requires at least 512 B alighnment, but runs faster >* (2.8 sec, vs. 3.5 sec) with 4096 B alignment. >*/ > unsigned align = 4096; > posix_memalign(, align, BUF_SIZE ); > > if (argc < 3) { > printf("Usage: %s \n", argv[0]); > return 1; > } > char *filename = argv[1]; > unsigned iterations = strtoul(argv[2], 0, 0); > > /* Not using O_SYNC for now, anyway. */ > int fd = open(filename, O_DIRECT | O_RDWR); > if (fd < 0) { > printf("Failed to open %s: %m\n", filename); > return 2; > } > > printf("File: %s, data size: %u, interations: %u\n", > filename, FULL_DATA_SIZE, iterations); > > for (int count = 0; count < iterations; count++) { > read_and_write(fd, FULL_DATA_SIZE, buf); > } > > close(fd); > return 0; > } > > > thanks, > -- > John Hubbard > NVIDIA -- Jan Kara SUSE
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/13/18 9:47 AM, Christoph Hellwig wrote: > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: >> In patch 6/6, pin_page_for_dma(), which is called at the end of >> get_user_pages(), >> unceremoniously rips the pages out of the LRU, as a prerequisite to using >> either of the page->dma_pinned_* fields. >> >> The idea is that LRU is not especially useful for this situation anyway, >> so we'll just make it one or the other: either a page is dma-pinned, and >> just hanging out doing RDMA most likely (and LRU is less meaningful during >> that >> time), or it's possibly on an LRU list. > > Have you done any benchmarking what this does to direct I/O performance, > especially for small I/O directly to a (fast) block device? > Hi Christoph, I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B, on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete and buggy yet, but I've got enough of them done to briefly run the test. One thing that occurs to me is that jumping on and off the LRU takes time, and if we limited this to 64-bit platforms, maybe we could use a real page flag? I know that leaves 32-bit out in the cold, but...maybe use this slower approach for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything by 20%. Test program is below. I hope I didn't overlook something obvious, but it's definitely possible, given my lack of experience with direct IO. I'm preparing to send an updated RFC this week, that contains the feedback to date, and also many converted call sites as well, so that everyone can see what the whole (proposed) story would look like in its latest incarnation. #define _GNU_SOURCE #include #include #include #include #include #include #include #include const static unsigned BUF_SIZE = 4096; static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE; void read_from_file(int fd, size_t how_much, char * buf) { size_t bytes_read; for (size_t index = 0; index < how_much; index += BUF_SIZE) { bytes_read = read(fd, buf, BUF_SIZE); if (bytes_read != BUF_SIZE) { printf("reading file failed: %m\n"); exit(3); } } } void seek_to_start(int fd, char *caller) { off_t result = lseek(fd, 0, SEEK_SET); if (result == -1) { printf("%s: lseek failed: %m\n", caller); exit(4); } } void write_to_file(int fd, size_t how_much, char * buf) { int result; for (size_t index = 0; index < how_much; index += BUF_SIZE) { result = write(fd, buf, BUF_SIZE); if (result < 0) { printf("writing file failed: %m\n"); exit(3); } } } void read_and_write(int fd, size_t how_much, char * buf) { seek_to_start(fd, "About to read"); read_from_file(fd, how_much, buf); memset(buf, 'a', BUF_SIZE); seek_to_start(fd, "About to write"); write_to_file(fd, how_much, buf); } int main(int argc, char *argv[]) { void *buf; /* * O_DIRECT requires at least 512 B alighnment, but runs faster * (2.8 sec, vs. 3.5 sec) with 4096 B alignment. */ unsigned align = 4096; posix_memalign(, align, BUF_SIZE ); if (argc < 3) { printf("Usage: %s \n", argv[0]); return 1; } char *filename = argv[1]; unsigned iterations = strtoul(argv[2], 0, 0); /* Not using O_SYNC for now, anyway. */ int fd = open(filename, O_DIRECT | O_RDWR); if (fd < 0) { printf("Failed to open %s: %m\n", filename); return 2; } printf("File: %s, data size: %u, interations: %u\n", filename, FULL_DATA_SIZE, iterations); for (int count = 0; count < iterations; count++) { read_and_write(fd, FULL_DATA_SIZE, buf); } close(fd); return 0; } thanks, -- John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/13/18 9:47 AM, Christoph Hellwig wrote: > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: >> In patch 6/6, pin_page_for_dma(), which is called at the end of >> get_user_pages(), >> unceremoniously rips the pages out of the LRU, as a prerequisite to using >> either of the page->dma_pinned_* fields. >> >> The idea is that LRU is not especially useful for this situation anyway, >> so we'll just make it one or the other: either a page is dma-pinned, and >> just hanging out doing RDMA most likely (and LRU is less meaningful during >> that >> time), or it's possibly on an LRU list. > > Have you done any benchmarking what this does to direct I/O performance, > especially for small I/O directly to a (fast) block device? > Hi Christoph, I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B, on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete and buggy yet, but I've got enough of them done to briefly run the test. One thing that occurs to me is that jumping on and off the LRU takes time, and if we limited this to 64-bit platforms, maybe we could use a real page flag? I know that leaves 32-bit out in the cold, but...maybe use this slower approach for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything by 20%. Test program is below. I hope I didn't overlook something obvious, but it's definitely possible, given my lack of experience with direct IO. I'm preparing to send an updated RFC this week, that contains the feedback to date, and also many converted call sites as well, so that everyone can see what the whole (proposed) story would look like in its latest incarnation. #define _GNU_SOURCE #include #include #include #include #include #include #include #include const static unsigned BUF_SIZE = 4096; static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE; void read_from_file(int fd, size_t how_much, char * buf) { size_t bytes_read; for (size_t index = 0; index < how_much; index += BUF_SIZE) { bytes_read = read(fd, buf, BUF_SIZE); if (bytes_read != BUF_SIZE) { printf("reading file failed: %m\n"); exit(3); } } } void seek_to_start(int fd, char *caller) { off_t result = lseek(fd, 0, SEEK_SET); if (result == -1) { printf("%s: lseek failed: %m\n", caller); exit(4); } } void write_to_file(int fd, size_t how_much, char * buf) { int result; for (size_t index = 0; index < how_much; index += BUF_SIZE) { result = write(fd, buf, BUF_SIZE); if (result < 0) { printf("writing file failed: %m\n"); exit(3); } } } void read_and_write(int fd, size_t how_much, char * buf) { seek_to_start(fd, "About to read"); read_from_file(fd, how_much, buf); memset(buf, 'a', BUF_SIZE); seek_to_start(fd, "About to write"); write_to_file(fd, how_much, buf); } int main(int argc, char *argv[]) { void *buf; /* * O_DIRECT requires at least 512 B alighnment, but runs faster * (2.8 sec, vs. 3.5 sec) with 4096 B alignment. */ unsigned align = 4096; posix_memalign(, align, BUF_SIZE ); if (argc < 3) { printf("Usage: %s \n", argv[0]); return 1; } char *filename = argv[1]; unsigned iterations = strtoul(argv[2], 0, 0); /* Not using O_SYNC for now, anyway. */ int fd = open(filename, O_DIRECT | O_RDWR); if (fd < 0) { printf("Failed to open %s: %m\n", filename); return 2; } printf("File: %s, data size: %u, interations: %u\n", filename, FULL_DATA_SIZE, iterations); for (int count = 0; count < iterations; count++) { read_and_write(fd, FULL_DATA_SIZE, buf); } close(fd); return 0; } thanks, -- John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/24/18 4:00 AM, Balbir Singh wrote: > On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote: >> On 10/12/18 3:56 AM, Balbir Singh wrote: >>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: From: John Hubbard >> [...] + * Because page->dma_pinned_flags is unioned with page->lru, any page that + * uses these flags must NOT be on an LRU. That's partly enforced by + * ClearPageDmaPinned, which gives the page back to LRU. + * + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union + * of struct page), and this flag is checked without knowing whether it is a + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2), + * rather than bit 0. + */ +#define PAGE_DMA_PINNED 0x2 +#define PAGE_DMA_PINNED_FLAGS (PAGE_DMA_PINNED) + >>> >>> This is really subtle, additional changes to compound_head will need to >>> coordinate >>> with these flags? Also doesn't this bit need to be unique across all >>> structs in >>> the union? I guess that is guaranteed by the fact that page == >>> compound_head(page) >>> as per your assertion, but I've forgotten why that is true. Could you please >>> add some commentary on that >>> >> >> Yes, agreed. I've rewritten and augmented that comment block, plus removed >> the >> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just >> misleading >> to even have it). So now it looks like this: >> >> /* >> * Because page->dma_pinned_flags is unioned with page->lru, any page that >> * uses these flags must NOT be on an LRU. That's partly enforced by >> * ClearPageDmaPinned, which gives the page back to LRU. >> * >> * PageDmaPinned is checked without knowing whether it is a tail page or a >> * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the >> 0th >> * bit in the first union of struct page), and instead uses bit 1 (0x2), >> * rather than bit 0. >> * >> * PageDmaPinned can only be used if no other systems are using the same bit >> * across the first struct page union. In this regard, it is similar to >> * PageTail, and in fact, because of PageTail's constraint that bit 0 be left >> * alone, bit 1 is also left alone so far: other union elements (ignoring >> tail >> * pages) put pointers there, and pointer alignment leaves the lower two bits >> * available. >> * >> * So, constraints include: >> * >> * -- Only use PageDmaPinned on non-tail pages. >> * -- Remove the page from any LRU list first. >> */ >> #define PAGE_DMA_PINNED 0x2 >> >> /* >> * Because these flags are read outside of a lock, ensure visibility between >> * different threads, by using READ|WRITE_ONCE. >> */ >> static __always_inline int PageDmaPinned(struct page *page) >> { >> VM_BUG_ON(page != compound_head(page)); >> return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; >> } >> >> [...] +static __always_inline void SetPageDmaPinned(struct page *page) +{ + VM_BUG_ON(page != compound_head(page)); >>> >>> VM_BUG_ON(!list_empty(>lru)) >> >> >> There is only one place where we set this flag, and that is when (in patch >> 6/6) >> transitioning from a page that might (or might not) have been >> on an LRU. In that case, the calling code has already corrupted page->lru, by >> writing to page->dma_pinned_count, which is unions with page->lru: >> >> atomic_set(>dma_pinned_count, 1); >> SetPageDmaPinned(page); >> >> ...so it would be inappropriate to call a list function, such as >> list_empty(), on that field. Let's just leave it as-is. >> >> >>> + WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); +} + +static __always_inline void ClearPageDmaPinned(struct page *page) +{ + VM_BUG_ON(page != compound_head(page)); + VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); + + /* This does a WRITE_ONCE to the lru.next, which is also the + * page->dma_pinned_flags field. So in addition to restoring page->lru, + * this provides visibility to other threads. + */ + INIT_LIST_HEAD(>lru); >>> >>> This assumes certain things about list_head, why not use the correct >>> initialization bits. >>> >> >> Yes, OK, changed to: >> >> static __always_inline void ClearPageDmaPinned(struct page *page) >> { >> VM_BUG_ON(page != compound_head(page)); >> VM_BUG_ON_PAGE(!PageDmaPinned(page), page); >> >> /* Provide visibility to other threads: */ >> WRITE_ONCE(page->dma_pinned_flags, 0); >> >> /* >> * Safety precaution: restore the list head, before possibly returning >> * the page to other subsystems. >> */ >> INIT_LIST_HEAD(>lru); >> } >> >> > > Sorry, I've been distracted with other things > > This looks better, do we still need the INIT_LIST_HEAD? > Good point. I guess not. I was getting a little too fancy, and
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/24/18 4:00 AM, Balbir Singh wrote: > On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote: >> On 10/12/18 3:56 AM, Balbir Singh wrote: >>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: From: John Hubbard >> [...] + * Because page->dma_pinned_flags is unioned with page->lru, any page that + * uses these flags must NOT be on an LRU. That's partly enforced by + * ClearPageDmaPinned, which gives the page back to LRU. + * + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union + * of struct page), and this flag is checked without knowing whether it is a + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2), + * rather than bit 0. + */ +#define PAGE_DMA_PINNED 0x2 +#define PAGE_DMA_PINNED_FLAGS (PAGE_DMA_PINNED) + >>> >>> This is really subtle, additional changes to compound_head will need to >>> coordinate >>> with these flags? Also doesn't this bit need to be unique across all >>> structs in >>> the union? I guess that is guaranteed by the fact that page == >>> compound_head(page) >>> as per your assertion, but I've forgotten why that is true. Could you please >>> add some commentary on that >>> >> >> Yes, agreed. I've rewritten and augmented that comment block, plus removed >> the >> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just >> misleading >> to even have it). So now it looks like this: >> >> /* >> * Because page->dma_pinned_flags is unioned with page->lru, any page that >> * uses these flags must NOT be on an LRU. That's partly enforced by >> * ClearPageDmaPinned, which gives the page back to LRU. >> * >> * PageDmaPinned is checked without knowing whether it is a tail page or a >> * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the >> 0th >> * bit in the first union of struct page), and instead uses bit 1 (0x2), >> * rather than bit 0. >> * >> * PageDmaPinned can only be used if no other systems are using the same bit >> * across the first struct page union. In this regard, it is similar to >> * PageTail, and in fact, because of PageTail's constraint that bit 0 be left >> * alone, bit 1 is also left alone so far: other union elements (ignoring >> tail >> * pages) put pointers there, and pointer alignment leaves the lower two bits >> * available. >> * >> * So, constraints include: >> * >> * -- Only use PageDmaPinned on non-tail pages. >> * -- Remove the page from any LRU list first. >> */ >> #define PAGE_DMA_PINNED 0x2 >> >> /* >> * Because these flags are read outside of a lock, ensure visibility between >> * different threads, by using READ|WRITE_ONCE. >> */ >> static __always_inline int PageDmaPinned(struct page *page) >> { >> VM_BUG_ON(page != compound_head(page)); >> return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; >> } >> >> [...] +static __always_inline void SetPageDmaPinned(struct page *page) +{ + VM_BUG_ON(page != compound_head(page)); >>> >>> VM_BUG_ON(!list_empty(>lru)) >> >> >> There is only one place where we set this flag, and that is when (in patch >> 6/6) >> transitioning from a page that might (or might not) have been >> on an LRU. In that case, the calling code has already corrupted page->lru, by >> writing to page->dma_pinned_count, which is unions with page->lru: >> >> atomic_set(>dma_pinned_count, 1); >> SetPageDmaPinned(page); >> >> ...so it would be inappropriate to call a list function, such as >> list_empty(), on that field. Let's just leave it as-is. >> >> >>> + WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); +} + +static __always_inline void ClearPageDmaPinned(struct page *page) +{ + VM_BUG_ON(page != compound_head(page)); + VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); + + /* This does a WRITE_ONCE to the lru.next, which is also the + * page->dma_pinned_flags field. So in addition to restoring page->lru, + * this provides visibility to other threads. + */ + INIT_LIST_HEAD(>lru); >>> >>> This assumes certain things about list_head, why not use the correct >>> initialization bits. >>> >> >> Yes, OK, changed to: >> >> static __always_inline void ClearPageDmaPinned(struct page *page) >> { >> VM_BUG_ON(page != compound_head(page)); >> VM_BUG_ON_PAGE(!PageDmaPinned(page), page); >> >> /* Provide visibility to other threads: */ >> WRITE_ONCE(page->dma_pinned_flags, 0); >> >> /* >> * Safety precaution: restore the list head, before possibly returning >> * the page to other subsystems. >> */ >> INIT_LIST_HEAD(>lru); >> } >> >> > > Sorry, I've been distracted with other things > > This looks better, do we still need the INIT_LIST_HEAD? > Good point. I guess not. I was getting a little too fancy, and
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote: > On 10/12/18 3:56 AM, Balbir Singh wrote: > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> + * Because page->dma_pinned_flags is unioned with page->lru, any page that > >> + * uses these flags must NOT be on an LRU. That's partly enforced by > >> + * ClearPageDmaPinned, which gives the page back to LRU. > >> + * > >> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first > >> union > >> + * of struct page), and this flag is checked without knowing whether it > >> is a > >> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 > >> (0x2), > >> + * rather than bit 0. > >> + */ > >> +#define PAGE_DMA_PINNED 0x2 > >> +#define PAGE_DMA_PINNED_FLAGS (PAGE_DMA_PINNED) > >> + > > > > This is really subtle, additional changes to compound_head will need to > > coordinate > > with these flags? Also doesn't this bit need to be unique across all > > structs in > > the union? I guess that is guaranteed by the fact that page == > > compound_head(page) > > as per your assertion, but I've forgotten why that is true. Could you please > > add some commentary on that > > > > Yes, agreed. I've rewritten and augmented that comment block, plus removed > the > PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just > misleading > to even have it). So now it looks like this: > > /* > * Because page->dma_pinned_flags is unioned with page->lru, any page that > * uses these flags must NOT be on an LRU. That's partly enforced by > * ClearPageDmaPinned, which gives the page back to LRU. > * > * PageDmaPinned is checked without knowing whether it is a tail page or a > * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th > * bit in the first union of struct page), and instead uses bit 1 (0x2), > * rather than bit 0. > * > * PageDmaPinned can only be used if no other systems are using the same bit > * across the first struct page union. In this regard, it is similar to > * PageTail, and in fact, because of PageTail's constraint that bit 0 be left > * alone, bit 1 is also left alone so far: other union elements (ignoring tail > * pages) put pointers there, and pointer alignment leaves the lower two bits > * available. > * > * So, constraints include: > * > * -- Only use PageDmaPinned on non-tail pages. > * -- Remove the page from any LRU list first. > */ > #define PAGE_DMA_PINNED 0x2 > > /* > * Because these flags are read outside of a lock, ensure visibility between > * different threads, by using READ|WRITE_ONCE. > */ > static __always_inline int PageDmaPinned(struct page *page) > { > VM_BUG_ON(page != compound_head(page)); > return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; > } > > [...] > >> +static __always_inline void SetPageDmaPinned(struct page *page) > >> +{ > >> + VM_BUG_ON(page != compound_head(page)); > > > > VM_BUG_ON(!list_empty(>lru)) > > > There is only one place where we set this flag, and that is when (in patch > 6/6) > transitioning from a page that might (or might not) have been > on an LRU. In that case, the calling code has already corrupted page->lru, by > writing to page->dma_pinned_count, which is unions with page->lru: > > atomic_set(>dma_pinned_count, 1); > SetPageDmaPinned(page); > > ...so it would be inappropriate to call a list function, such as > list_empty(), on that field. Let's just leave it as-is. > > > > > >> + WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); > >> +} > >> + > >> +static __always_inline void ClearPageDmaPinned(struct page *page) > >> +{ > >> + VM_BUG_ON(page != compound_head(page)); > >> + VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); > >> + > >> + /* This does a WRITE_ONCE to the lru.next, which is also the > >> + * page->dma_pinned_flags field. So in addition to restoring page->lru, > >> + * this provides visibility to other threads. > >> + */ > >> + INIT_LIST_HEAD(>lru); > > > > This assumes certain things about list_head, why not use the correct > > initialization bits. > > > > Yes, OK, changed to: > > static __always_inline void ClearPageDmaPinned(struct page *page) > { > VM_BUG_ON(page != compound_head(page)); > VM_BUG_ON_PAGE(!PageDmaPinned(page), page); > > /* Provide visibility to other threads: */ > WRITE_ONCE(page->dma_pinned_flags, 0); > > /* >* Safety precaution: restore the list head, before possibly returning >* the page to other subsystems. >*/ > INIT_LIST_HEAD(>lru); > } > > Sorry, I've been distracted with other things This looks better, do we still need the INIT_LIST_HEAD? Balbir Singh. > > -- > thanks, > John Hubbard > NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote: > On 10/12/18 3:56 AM, Balbir Singh wrote: > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> + * Because page->dma_pinned_flags is unioned with page->lru, any page that > >> + * uses these flags must NOT be on an LRU. That's partly enforced by > >> + * ClearPageDmaPinned, which gives the page back to LRU. > >> + * > >> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first > >> union > >> + * of struct page), and this flag is checked without knowing whether it > >> is a > >> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 > >> (0x2), > >> + * rather than bit 0. > >> + */ > >> +#define PAGE_DMA_PINNED 0x2 > >> +#define PAGE_DMA_PINNED_FLAGS (PAGE_DMA_PINNED) > >> + > > > > This is really subtle, additional changes to compound_head will need to > > coordinate > > with these flags? Also doesn't this bit need to be unique across all > > structs in > > the union? I guess that is guaranteed by the fact that page == > > compound_head(page) > > as per your assertion, but I've forgotten why that is true. Could you please > > add some commentary on that > > > > Yes, agreed. I've rewritten and augmented that comment block, plus removed > the > PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just > misleading > to even have it). So now it looks like this: > > /* > * Because page->dma_pinned_flags is unioned with page->lru, any page that > * uses these flags must NOT be on an LRU. That's partly enforced by > * ClearPageDmaPinned, which gives the page back to LRU. > * > * PageDmaPinned is checked without knowing whether it is a tail page or a > * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th > * bit in the first union of struct page), and instead uses bit 1 (0x2), > * rather than bit 0. > * > * PageDmaPinned can only be used if no other systems are using the same bit > * across the first struct page union. In this regard, it is similar to > * PageTail, and in fact, because of PageTail's constraint that bit 0 be left > * alone, bit 1 is also left alone so far: other union elements (ignoring tail > * pages) put pointers there, and pointer alignment leaves the lower two bits > * available. > * > * So, constraints include: > * > * -- Only use PageDmaPinned on non-tail pages. > * -- Remove the page from any LRU list first. > */ > #define PAGE_DMA_PINNED 0x2 > > /* > * Because these flags are read outside of a lock, ensure visibility between > * different threads, by using READ|WRITE_ONCE. > */ > static __always_inline int PageDmaPinned(struct page *page) > { > VM_BUG_ON(page != compound_head(page)); > return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; > } > > [...] > >> +static __always_inline void SetPageDmaPinned(struct page *page) > >> +{ > >> + VM_BUG_ON(page != compound_head(page)); > > > > VM_BUG_ON(!list_empty(>lru)) > > > There is only one place where we set this flag, and that is when (in patch > 6/6) > transitioning from a page that might (or might not) have been > on an LRU. In that case, the calling code has already corrupted page->lru, by > writing to page->dma_pinned_count, which is unions with page->lru: > > atomic_set(>dma_pinned_count, 1); > SetPageDmaPinned(page); > > ...so it would be inappropriate to call a list function, such as > list_empty(), on that field. Let's just leave it as-is. > > > > > >> + WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); > >> +} > >> + > >> +static __always_inline void ClearPageDmaPinned(struct page *page) > >> +{ > >> + VM_BUG_ON(page != compound_head(page)); > >> + VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); > >> + > >> + /* This does a WRITE_ONCE to the lru.next, which is also the > >> + * page->dma_pinned_flags field. So in addition to restoring page->lru, > >> + * this provides visibility to other threads. > >> + */ > >> + INIT_LIST_HEAD(>lru); > > > > This assumes certain things about list_head, why not use the correct > > initialization bits. > > > > Yes, OK, changed to: > > static __always_inline void ClearPageDmaPinned(struct page *page) > { > VM_BUG_ON(page != compound_head(page)); > VM_BUG_ON_PAGE(!PageDmaPinned(page), page); > > /* Provide visibility to other threads: */ > WRITE_ONCE(page->dma_pinned_flags, 0); > > /* >* Safety precaution: restore the list head, before possibly returning >* the page to other subsystems. >*/ > INIT_LIST_HEAD(>lru); > } > > Sorry, I've been distracted with other things This looks better, do we still need the INIT_LIST_HEAD? Balbir Singh. > > -- > thanks, > John Hubbard > NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Wed 17-10-18 17:03:03, John Hubbard wrote: > On 10/17/18 4:09 AM, Michal Hocko wrote: > > On Tue 16-10-18 18:48:23, John Hubbard wrote: > > [...] > >> It's hard to say exactly what the active/inactive/unevictable list should > >> be when DMA is done and put_user_page*() is called, because we don't know > >> if some device read, wrote, or ignored any of those pages. Although if > >> put_user_pages_dirty() is called, that's an argument for "active", at > >> least. > > > > Any reason to not use putback_lru_page? > > That does help with which LRU to use. I guess I'd still need to track whether > a page was on an LRU when get_user_pages() was called, because it seems > that that is not necessarily always the case. And putback_lru_page() > definitely > wants to deal with a page that *was* previously on an LRU. Well, if you ever g-u-p pages which are never going to go to LRU then sure (e.g. hugetlb pages). -- Michal Hocko SUSE Labs
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Wed 17-10-18 17:03:03, John Hubbard wrote: > On 10/17/18 4:09 AM, Michal Hocko wrote: > > On Tue 16-10-18 18:48:23, John Hubbard wrote: > > [...] > >> It's hard to say exactly what the active/inactive/unevictable list should > >> be when DMA is done and put_user_page*() is called, because we don't know > >> if some device read, wrote, or ignored any of those pages. Although if > >> put_user_pages_dirty() is called, that's an argument for "active", at > >> least. > > > > Any reason to not use putback_lru_page? > > That does help with which LRU to use. I guess I'd still need to track whether > a page was on an LRU when get_user_pages() was called, because it seems > that that is not necessarily always the case. And putback_lru_page() > definitely > wants to deal with a page that *was* previously on an LRU. Well, if you ever g-u-p pages which are never going to go to LRU then sure (e.g. hugetlb pages). -- Michal Hocko SUSE Labs
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/17/18 4:09 AM, Michal Hocko wrote: > On Tue 16-10-18 18:48:23, John Hubbard wrote: > [...] >> It's hard to say exactly what the active/inactive/unevictable list should >> be when DMA is done and put_user_page*() is called, because we don't know >> if some device read, wrote, or ignored any of those pages. Although if >> put_user_pages_dirty() is called, that's an argument for "active", at least. > > Any reason to not use putback_lru_page? That does help with which LRU to use. I guess I'd still need to track whether a page was on an LRU when get_user_pages() was called, because it seems that that is not necessarily always the case. And putback_lru_page() definitely wants to deal with a page that *was* previously on an LRU. > > Please note I haven't really got through your patches to have a wider > picture of the change so this is just hint for the LRU part of the > issue. > Understood, and the hints are much appreciated. -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/17/18 4:09 AM, Michal Hocko wrote: > On Tue 16-10-18 18:48:23, John Hubbard wrote: > [...] >> It's hard to say exactly what the active/inactive/unevictable list should >> be when DMA is done and put_user_page*() is called, because we don't know >> if some device read, wrote, or ignored any of those pages. Although if >> put_user_pages_dirty() is called, that's an argument for "active", at least. > > Any reason to not use putback_lru_page? That does help with which LRU to use. I guess I'd still need to track whether a page was on an LRU when get_user_pages() was called, because it seems that that is not necessarily always the case. And putback_lru_page() definitely wants to deal with a page that *was* previously on an LRU. > > Please note I haven't really got through your patches to have a wider > picture of the change so this is just hint for the LRU part of the > issue. > Understood, and the hints are much appreciated. -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue 16-10-18 18:48:23, John Hubbard wrote: [...] > It's hard to say exactly what the active/inactive/unevictable list should > be when DMA is done and put_user_page*() is called, because we don't know > if some device read, wrote, or ignored any of those pages. Although if > put_user_pages_dirty() is called, that's an argument for "active", at least. Any reason to not use putback_lru_page? Please note I haven't really got through your patches to have a wider picture of the change so this is just hint for the LRU part of the issue. -- Michal Hocko SUSE Labs
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue 16-10-18 18:48:23, John Hubbard wrote: [...] > It's hard to say exactly what the active/inactive/unevictable list should > be when DMA is done and put_user_page*() is called, because we don't know > if some device read, wrote, or ignored any of those pages. Although if > put_user_pages_dirty() is called, that's an argument for "active", at least. Any reason to not use putback_lru_page? Please note I haven't really got through your patches to have a wider picture of the change so this is just hint for the LRU part of the issue. -- Michal Hocko SUSE Labs
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/16/18 1:51 AM, Jan Kara wrote: > On Sun 14-10-18 10:01:24, Dave Chinner wrote: >> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: >>> On 10/12/18 8:55 PM, Dave Chinner wrote: On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard >>> [...] > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; Isn't this broken for mapped file-backed pages? i.e. they may be passed as the user buffer to read/write direct IO and so the pages passed to gup will be on the active/inactive LRUs. hence I can't see how you can have dual use of the LRU list head like this What am I missing here? >>> >>> Hi Dave, >>> >>> In patch 6/6, pin_page_for_dma(), which is called at the end of >>> get_user_pages(), >>> unceremoniously rips the pages out of the LRU, as a prerequisite to using >>> either of the page->dma_pinned_* fields. >> >> How is that safe? If you've ripped the page out of the LRU, it's no >> longer being tracked by the page cache aging and reclaim algorithms. >> Patch 6 doesn't appear to put these pages back in the LRU, either, >> so it looks to me like this just dumps them on the ground after the >> gup reference is dropped. How do we reclaim these page cache pages >> when there is memory pressure if they aren't in the LRU? > > Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should > return the page to the LRU from put_user_page(). > Yes. Ugh, the LRU handling in this series is definitely not all there yet: probably need to track (in the page->dma_pinned_flags) which LRU (if any) each page was taken from. It's hard to say exactly what the active/inactive/unevictable list should be when DMA is done and put_user_page*() is called, because we don't know if some device read, wrote, or ignored any of those pages. Although if put_user_pages_dirty() is called, that's an argument for "active", at least. And maybe this will all be pointless if the DIRECT_IO performance test, that Christoph requested, shows that LRU operations are too expensive here, anyway. I wonder if we should just limit this to 64-bit arches and find a real page flag...well, let's see what the testing shows first I suppose. -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/16/18 1:51 AM, Jan Kara wrote: > On Sun 14-10-18 10:01:24, Dave Chinner wrote: >> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: >>> On 10/12/18 8:55 PM, Dave Chinner wrote: On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard >>> [...] > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; Isn't this broken for mapped file-backed pages? i.e. they may be passed as the user buffer to read/write direct IO and so the pages passed to gup will be on the active/inactive LRUs. hence I can't see how you can have dual use of the LRU list head like this What am I missing here? >>> >>> Hi Dave, >>> >>> In patch 6/6, pin_page_for_dma(), which is called at the end of >>> get_user_pages(), >>> unceremoniously rips the pages out of the LRU, as a prerequisite to using >>> either of the page->dma_pinned_* fields. >> >> How is that safe? If you've ripped the page out of the LRU, it's no >> longer being tracked by the page cache aging and reclaim algorithms. >> Patch 6 doesn't appear to put these pages back in the LRU, either, >> so it looks to me like this just dumps them on the ground after the >> gup reference is dropped. How do we reclaim these page cache pages >> when there is memory pressure if they aren't in the LRU? > > Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should > return the page to the LRU from put_user_page(). > Yes. Ugh, the LRU handling in this series is definitely not all there yet: probably need to track (in the page->dma_pinned_flags) which LRU (if any) each page was taken from. It's hard to say exactly what the active/inactive/unevictable list should be when DMA is done and put_user_page*() is called, because we don't know if some device read, wrote, or ignored any of those pages. Although if put_user_pages_dirty() is called, that's an argument for "active", at least. And maybe this will all be pointless if the DIRECT_IO performance test, that Christoph requested, shows that LRU operations are too expensive here, anyway. I wonder if we should just limit this to 64-bit arches and find a real page flag...well, let's see what the testing shows first I suppose. -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sun 14-10-18 10:01:24, Dave Chinner wrote: > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > > On 10/12/18 8:55 PM, Dave Chinner wrote: > > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > > >> From: John Hubbard > > [...] > > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > >> index 5ed8f6292a53..017ab82e36ca 100644 > > >> --- a/include/linux/mm_types.h > > >> +++ b/include/linux/mm_types.h > > >> @@ -78,12 +78,22 @@ struct page { > > >> */ > > >> union { > > >> struct {/* Page cache and anonymous pages */ > > >> -/** > > >> - * @lru: Pageout list, eg. active_list > > >> protected by > > >> - * zone_lru_lock. Sometimes used as a generic > > >> list > > >> - * by the page owner. > > >> - */ > > >> -struct list_head lru; > > >> +union { > > >> +/** > > >> + * @lru: Pageout list, eg. active_list > > >> protected > > >> + * by zone_lru_lock. Sometimes used as > > >> a > > >> + * generic list by the page owner. > > >> + */ > > >> +struct list_head lru; > > >> +/* Used by get_user_pages*(). Pages may > > >> not be > > >> + * on an LRU while these dma_pinned_* > > >> fields > > >> + * are in use. > > >> + */ > > >> +struct { > > >> +unsigned long dma_pinned_flags; > > >> +atomic_t dma_pinned_count; > > >> +}; > > >> +}; > > > > > > Isn't this broken for mapped file-backed pages? i.e. they may be > > > passed as the user buffer to read/write direct IO and so the pages > > > passed to gup will be on the active/inactive LRUs. hence I can't see > > > how you can have dual use of the LRU list head like this > > > > > > What am I missing here? > > > > Hi Dave, > > > > In patch 6/6, pin_page_for_dma(), which is called at the end of > > get_user_pages(), > > unceremoniously rips the pages out of the LRU, as a prerequisite to using > > either of the page->dma_pinned_* fields. > > How is that safe? If you've ripped the page out of the LRU, it's no > longer being tracked by the page cache aging and reclaim algorithms. > Patch 6 doesn't appear to put these pages back in the LRU, either, > so it looks to me like this just dumps them on the ground after the > gup reference is dropped. How do we reclaim these page cache pages > when there is memory pressure if they aren't in the LRU? Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should return the page to the LRU from put_user_page(). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sun 14-10-18 10:01:24, Dave Chinner wrote: > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > > On 10/12/18 8:55 PM, Dave Chinner wrote: > > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > > >> From: John Hubbard > > [...] > > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > >> index 5ed8f6292a53..017ab82e36ca 100644 > > >> --- a/include/linux/mm_types.h > > >> +++ b/include/linux/mm_types.h > > >> @@ -78,12 +78,22 @@ struct page { > > >> */ > > >> union { > > >> struct {/* Page cache and anonymous pages */ > > >> -/** > > >> - * @lru: Pageout list, eg. active_list > > >> protected by > > >> - * zone_lru_lock. Sometimes used as a generic > > >> list > > >> - * by the page owner. > > >> - */ > > >> -struct list_head lru; > > >> +union { > > >> +/** > > >> + * @lru: Pageout list, eg. active_list > > >> protected > > >> + * by zone_lru_lock. Sometimes used as > > >> a > > >> + * generic list by the page owner. > > >> + */ > > >> +struct list_head lru; > > >> +/* Used by get_user_pages*(). Pages may > > >> not be > > >> + * on an LRU while these dma_pinned_* > > >> fields > > >> + * are in use. > > >> + */ > > >> +struct { > > >> +unsigned long dma_pinned_flags; > > >> +atomic_t dma_pinned_count; > > >> +}; > > >> +}; > > > > > > Isn't this broken for mapped file-backed pages? i.e. they may be > > > passed as the user buffer to read/write direct IO and so the pages > > > passed to gup will be on the active/inactive LRUs. hence I can't see > > > how you can have dual use of the LRU list head like this > > > > > > What am I missing here? > > > > Hi Dave, > > > > In patch 6/6, pin_page_for_dma(), which is called at the end of > > get_user_pages(), > > unceremoniously rips the pages out of the LRU, as a prerequisite to using > > either of the page->dma_pinned_* fields. > > How is that safe? If you've ripped the page out of the LRU, it's no > longer being tracked by the page cache aging and reclaim algorithms. > Patch 6 doesn't appear to put these pages back in the LRU, either, > so it looks to me like this just dumps them on the ground after the > gup reference is dropped. How do we reclaim these page cache pages > when there is memory pressure if they aren't in the LRU? Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should return the page to the LRU from put_user_page(). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > On 10/12/18 8:55 PM, Dave Chinner wrote: > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >> index 5ed8f6292a53..017ab82e36ca 100644 > >> --- a/include/linux/mm_types.h > >> +++ b/include/linux/mm_types.h > >> @@ -78,12 +78,22 @@ struct page { > >> */ > >>union { > >>struct {/* Page cache and anonymous pages */ > >> - /** > >> - * @lru: Pageout list, eg. active_list protected by > >> - * zone_lru_lock. Sometimes used as a generic list > >> - * by the page owner. > >> - */ > >> - struct list_head lru; > >> + union { > >> + /** > >> + * @lru: Pageout list, eg. active_list protected > >> + * by zone_lru_lock. Sometimes used as a > >> + * generic list by the page owner. > >> + */ > >> + struct list_head lru; > >> + /* Used by get_user_pages*(). Pages may not be > >> + * on an LRU while these dma_pinned_* fields > >> + * are in use. > >> + */ > >> + struct { > >> + unsigned long dma_pinned_flags; > >> + atomic_t dma_pinned_count; > >> + }; > >> + }; > > > > Isn't this broken for mapped file-backed pages? i.e. they may be > > passed as the user buffer to read/write direct IO and so the pages > > passed to gup will be on the active/inactive LRUs. hence I can't see > > how you can have dual use of the LRU list head like this > > > > What am I missing here? > > Hi Dave, > > In patch 6/6, pin_page_for_dma(), which is called at the end of > get_user_pages(), > unceremoniously rips the pages out of the LRU, as a prerequisite to using > either of the page->dma_pinned_* fields. How is that safe? If you've ripped the page out of the LRU, it's no longer being tracked by the page cache aging and reclaim algorithms. Patch 6 doesn't appear to put these pages back in the LRU, either, so it looks to me like this just dumps them on the ground after the gup reference is dropped. How do we reclaim these page cache pages when there is memory pressure if they aren't in the LRU? > The idea is that LRU is not especially useful for this situation anyway, > so we'll just make it one or the other: either a page is dma-pinned, and > just hanging out doing RDMA most likely (and LRU is less meaningful during > that > time), or it's possibly on an LRU list. gup isn't just used for RDMA. It's used by direct IO in far, far more situations and machines than RDMA is. Please explain why ripping pages out of the LRU and not putting them back is safe, has no side effects, doesn't adversely impact page cache reclaim, etc. Indeed, I'd love to see a description of all the page references and where they come and go so we know the changes aren't just leaking these pages until the filesystem invalidates them at unmount. Maybe I'm not seeing why this is safe yet, but seeing as you haven't explained why it is safe then, at minimum, the patch descriptions are incomplete. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > On 10/12/18 8:55 PM, Dave Chinner wrote: > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >> index 5ed8f6292a53..017ab82e36ca 100644 > >> --- a/include/linux/mm_types.h > >> +++ b/include/linux/mm_types.h > >> @@ -78,12 +78,22 @@ struct page { > >> */ > >>union { > >>struct {/* Page cache and anonymous pages */ > >> - /** > >> - * @lru: Pageout list, eg. active_list protected by > >> - * zone_lru_lock. Sometimes used as a generic list > >> - * by the page owner. > >> - */ > >> - struct list_head lru; > >> + union { > >> + /** > >> + * @lru: Pageout list, eg. active_list protected > >> + * by zone_lru_lock. Sometimes used as a > >> + * generic list by the page owner. > >> + */ > >> + struct list_head lru; > >> + /* Used by get_user_pages*(). Pages may not be > >> + * on an LRU while these dma_pinned_* fields > >> + * are in use. > >> + */ > >> + struct { > >> + unsigned long dma_pinned_flags; > >> + atomic_t dma_pinned_count; > >> + }; > >> + }; > > > > Isn't this broken for mapped file-backed pages? i.e. they may be > > passed as the user buffer to read/write direct IO and so the pages > > passed to gup will be on the active/inactive LRUs. hence I can't see > > how you can have dual use of the LRU list head like this > > > > What am I missing here? > > Hi Dave, > > In patch 6/6, pin_page_for_dma(), which is called at the end of > get_user_pages(), > unceremoniously rips the pages out of the LRU, as a prerequisite to using > either of the page->dma_pinned_* fields. How is that safe? If you've ripped the page out of the LRU, it's no longer being tracked by the page cache aging and reclaim algorithms. Patch 6 doesn't appear to put these pages back in the LRU, either, so it looks to me like this just dumps them on the ground after the gup reference is dropped. How do we reclaim these page cache pages when there is memory pressure if they aren't in the LRU? > The idea is that LRU is not especially useful for this situation anyway, > so we'll just make it one or the other: either a page is dma-pinned, and > just hanging out doing RDMA most likely (and LRU is less meaningful during > that > time), or it's possibly on an LRU list. gup isn't just used for RDMA. It's used by direct IO in far, far more situations and machines than RDMA is. Please explain why ripping pages out of the LRU and not putting them back is safe, has no side effects, doesn't adversely impact page cache reclaim, etc. Indeed, I'd love to see a description of all the page references and where they come and go so we know the changes aren't just leaking these pages until the filesystem invalidates them at unmount. Maybe I'm not seeing why this is safe yet, but seeing as you haven't explained why it is safe then, at minimum, the patch descriptions are incomplete. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/13/18 9:47 AM, Christoph Hellwig wrote: > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: >> In patch 6/6, pin_page_for_dma(), which is called at the end of >> get_user_pages(), >> unceremoniously rips the pages out of the LRU, as a prerequisite to using >> either of the page->dma_pinned_* fields. >> >> The idea is that LRU is not especially useful for this situation anyway, >> so we'll just make it one or the other: either a page is dma-pinned, and >> just hanging out doing RDMA most likely (and LRU is less meaningful during >> that >> time), or it's possibly on an LRU list. > > Have you done any benchmarking what this does to direct I/O performance, > especially for small I/O directly to a (fast) block device? Not yet. I can go do that now. If you have any particular test suites, benchmarks, or just programs to recommend, please let me know. So far, I see tools/testing/selftests/vm/gup_benchmark.c -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/13/18 9:47 AM, Christoph Hellwig wrote: > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: >> In patch 6/6, pin_page_for_dma(), which is called at the end of >> get_user_pages(), >> unceremoniously rips the pages out of the LRU, as a prerequisite to using >> either of the page->dma_pinned_* fields. >> >> The idea is that LRU is not especially useful for this situation anyway, >> so we'll just make it one or the other: either a page is dma-pinned, and >> just hanging out doing RDMA most likely (and LRU is less meaningful during >> that >> time), or it's possibly on an LRU list. > > Have you done any benchmarking what this does to direct I/O performance, > especially for small I/O directly to a (fast) block device? Not yet. I can go do that now. If you have any particular test suites, benchmarks, or just programs to recommend, please let me know. So far, I see tools/testing/selftests/vm/gup_benchmark.c -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > In patch 6/6, pin_page_for_dma(), which is called at the end of > get_user_pages(), > unceremoniously rips the pages out of the LRU, as a prerequisite to using > either of the page->dma_pinned_* fields. > > The idea is that LRU is not especially useful for this situation anyway, > so we'll just make it one or the other: either a page is dma-pinned, and > just hanging out doing RDMA most likely (and LRU is less meaningful during > that > time), or it's possibly on an LRU list. Have you done any benchmarking what this does to direct I/O performance, especially for small I/O directly to a (fast) block device?
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > In patch 6/6, pin_page_for_dma(), which is called at the end of > get_user_pages(), > unceremoniously rips the pages out of the LRU, as a prerequisite to using > either of the page->dma_pinned_* fields. > > The idea is that LRU is not especially useful for this situation anyway, > so we'll just make it one or the other: either a page is dma-pinned, and > just hanging out doing RDMA most likely (and LRU is less meaningful during > that > time), or it's possibly on an LRU list. Have you done any benchmarking what this does to direct I/O performance, especially for small I/O directly to a (fast) block device?
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/12/18 8:55 PM, Dave Chinner wrote: > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: >> From: John Hubbard [...] >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 5ed8f6292a53..017ab82e36ca 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -78,12 +78,22 @@ struct page { >> */ >> union { >> struct {/* Page cache and anonymous pages */ >> -/** >> - * @lru: Pageout list, eg. active_list protected by >> - * zone_lru_lock. Sometimes used as a generic list >> - * by the page owner. >> - */ >> -struct list_head lru; >> +union { >> +/** >> + * @lru: Pageout list, eg. active_list protected >> + * by zone_lru_lock. Sometimes used as a >> + * generic list by the page owner. >> + */ >> +struct list_head lru; >> +/* Used by get_user_pages*(). Pages may not be >> + * on an LRU while these dma_pinned_* fields >> + * are in use. >> + */ >> +struct { >> +unsigned long dma_pinned_flags; >> +atomic_t dma_pinned_count; >> +}; >> +}; > > Isn't this broken for mapped file-backed pages? i.e. they may be > passed as the user buffer to read/write direct IO and so the pages > passed to gup will be on the active/inactive LRUs. hence I can't see > how you can have dual use of the LRU list head like this > > What am I missing here? Hi Dave, In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(), unceremoniously rips the pages out of the LRU, as a prerequisite to using either of the page->dma_pinned_* fields. The idea is that LRU is not especially useful for this situation anyway, so we'll just make it one or the other: either a page is dma-pinned, and just hanging out doing RDMA most likely (and LRU is less meaningful during that time), or it's possibly on an LRU list. -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/12/18 8:55 PM, Dave Chinner wrote: > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: >> From: John Hubbard [...] >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 5ed8f6292a53..017ab82e36ca 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -78,12 +78,22 @@ struct page { >> */ >> union { >> struct {/* Page cache and anonymous pages */ >> -/** >> - * @lru: Pageout list, eg. active_list protected by >> - * zone_lru_lock. Sometimes used as a generic list >> - * by the page owner. >> - */ >> -struct list_head lru; >> +union { >> +/** >> + * @lru: Pageout list, eg. active_list protected >> + * by zone_lru_lock. Sometimes used as a >> + * generic list by the page owner. >> + */ >> +struct list_head lru; >> +/* Used by get_user_pages*(). Pages may not be >> + * on an LRU while these dma_pinned_* fields >> + * are in use. >> + */ >> +struct { >> +unsigned long dma_pinned_flags; >> +atomic_t dma_pinned_count; >> +}; >> +}; > > Isn't this broken for mapped file-backed pages? i.e. they may be > passed as the user buffer to read/write direct IO and so the pages > passed to gup will be on the active/inactive LRUs. hence I can't see > how you can have dual use of the LRU list head like this > > What am I missing here? Hi Dave, In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(), unceremoniously rips the pages out of the LRU, as a prerequisite to using either of the page->dma_pinned_* fields. The idea is that LRU is not especially useful for this situation anyway, so we'll just make it one or the other: either a page is dma-pinned, and just hanging out doing RDMA most likely (and LRU is less meaningful during that time), or it's possibly on an LRU list. -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Add two struct page fields that, combined, are unioned with > struct page->lru. There is no change in the size of > struct page. These new fields are for type safety and clarity. > > Also add page flag accessors to test, set and clear the new > page->dma_pinned_flags field. > > The page->dma_pinned_count field will be used in upcoming > patches > > Signed-off-by: John Hubbard > --- > include/linux/mm_types.h | 22 +- > include/linux/page-flags.h | 47 ++ > 2 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; Isn't this broken for mapped file-backed pages? i.e. they may be passed as the user buffer to read/write direct IO and so the pages passed to gup will be on the active/inactive LRUs. hence I can't see how you can have dual use of the LRU list head like this What am I missing here? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Add two struct page fields that, combined, are unioned with > struct page->lru. There is no change in the size of > struct page. These new fields are for type safety and clarity. > > Also add page flag accessors to test, set and clear the new > page->dma_pinned_flags field. > > The page->dma_pinned_count field will be used in upcoming > patches > > Signed-off-by: John Hubbard > --- > include/linux/mm_types.h | 22 +- > include/linux/page-flags.h | 47 ++ > 2 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; Isn't this broken for mapped file-backed pages? i.e. they may be passed as the user buffer to read/write direct IO and so the pages passed to gup will be on the active/inactive LRUs. hence I can't see how you can have dual use of the LRU list head like this What am I missing here? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/12/18 3:56 AM, Balbir Singh wrote: > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: >> From: John Hubbard [...] >> + * Because page->dma_pinned_flags is unioned with page->lru, any page that >> + * uses these flags must NOT be on an LRU. That's partly enforced by >> + * ClearPageDmaPinned, which gives the page back to LRU. >> + * >> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first >> union >> + * of struct page), and this flag is checked without knowing whether it is a >> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 >> (0x2), >> + * rather than bit 0. >> + */ >> +#define PAGE_DMA_PINNED 0x2 >> +#define PAGE_DMA_PINNED_FLAGS (PAGE_DMA_PINNED) >> + > > This is really subtle, additional changes to compound_head will need to > coordinate > with these flags? Also doesn't this bit need to be unique across all structs > in > the union? I guess that is guaranteed by the fact that page == > compound_head(page) > as per your assertion, but I've forgotten why that is true. Could you please > add some commentary on that > Yes, agreed. I've rewritten and augmented that comment block, plus removed the PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading to even have it). So now it looks like this: /* * Because page->dma_pinned_flags is unioned with page->lru, any page that * uses these flags must NOT be on an LRU. That's partly enforced by * ClearPageDmaPinned, which gives the page back to LRU. * * PageDmaPinned is checked without knowing whether it is a tail page or a * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th * bit in the first union of struct page), and instead uses bit 1 (0x2), * rather than bit 0. * * PageDmaPinned can only be used if no other systems are using the same bit * across the first struct page union. In this regard, it is similar to * PageTail, and in fact, because of PageTail's constraint that bit 0 be left * alone, bit 1 is also left alone so far: other union elements (ignoring tail * pages) put pointers there, and pointer alignment leaves the lower two bits * available. * * So, constraints include: * * -- Only use PageDmaPinned on non-tail pages. * -- Remove the page from any LRU list first. */ #define PAGE_DMA_PINNED 0x2 /* * Because these flags are read outside of a lock, ensure visibility between * different threads, by using READ|WRITE_ONCE. */ static __always_inline int PageDmaPinned(struct page *page) { VM_BUG_ON(page != compound_head(page)); return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; } [...] >> +static __always_inline void SetPageDmaPinned(struct page *page) >> +{ >> +VM_BUG_ON(page != compound_head(page)); > > VM_BUG_ON(!list_empty(>lru)) There is only one place where we set this flag, and that is when (in patch 6/6) transitioning from a page that might (or might not) have been on an LRU. In that case, the calling code has already corrupted page->lru, by writing to page->dma_pinned_count, which is unions with page->lru: atomic_set(>dma_pinned_count, 1); SetPageDmaPinned(page); ...so it would be inappropriate to call a list function, such as list_empty(), on that field. Let's just leave it as-is. > >> +WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); >> +} >> + >> +static __always_inline void ClearPageDmaPinned(struct page *page) >> +{ >> +VM_BUG_ON(page != compound_head(page)); >> +VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); >> + >> +/* This does a WRITE_ONCE to the lru.next, which is also the >> + * page->dma_pinned_flags field. So in addition to restoring page->lru, >> + * this provides visibility to other threads. >> + */ >> +INIT_LIST_HEAD(>lru); > > This assumes certain things about list_head, why not use the correct > initialization bits. > Yes, OK, changed to: static __always_inline void ClearPageDmaPinned(struct page *page) { VM_BUG_ON(page != compound_head(page)); VM_BUG_ON_PAGE(!PageDmaPinned(page), page); /* Provide visibility to other threads: */ WRITE_ONCE(page->dma_pinned_flags, 0); /* * Safety precaution: restore the list head, before possibly returning * the page to other subsystems. */ INIT_LIST_HEAD(>lru); } -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On 10/12/18 3:56 AM, Balbir Singh wrote: > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: >> From: John Hubbard [...] >> + * Because page->dma_pinned_flags is unioned with page->lru, any page that >> + * uses these flags must NOT be on an LRU. That's partly enforced by >> + * ClearPageDmaPinned, which gives the page back to LRU. >> + * >> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first >> union >> + * of struct page), and this flag is checked without knowing whether it is a >> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 >> (0x2), >> + * rather than bit 0. >> + */ >> +#define PAGE_DMA_PINNED 0x2 >> +#define PAGE_DMA_PINNED_FLAGS (PAGE_DMA_PINNED) >> + > > This is really subtle, additional changes to compound_head will need to > coordinate > with these flags? Also doesn't this bit need to be unique across all structs > in > the union? I guess that is guaranteed by the fact that page == > compound_head(page) > as per your assertion, but I've forgotten why that is true. Could you please > add some commentary on that > Yes, agreed. I've rewritten and augmented that comment block, plus removed the PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading to even have it). So now it looks like this: /* * Because page->dma_pinned_flags is unioned with page->lru, any page that * uses these flags must NOT be on an LRU. That's partly enforced by * ClearPageDmaPinned, which gives the page back to LRU. * * PageDmaPinned is checked without knowing whether it is a tail page or a * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th * bit in the first union of struct page), and instead uses bit 1 (0x2), * rather than bit 0. * * PageDmaPinned can only be used if no other systems are using the same bit * across the first struct page union. In this regard, it is similar to * PageTail, and in fact, because of PageTail's constraint that bit 0 be left * alone, bit 1 is also left alone so far: other union elements (ignoring tail * pages) put pointers there, and pointer alignment leaves the lower two bits * available. * * So, constraints include: * * -- Only use PageDmaPinned on non-tail pages. * -- Remove the page from any LRU list first. */ #define PAGE_DMA_PINNED 0x2 /* * Because these flags are read outside of a lock, ensure visibility between * different threads, by using READ|WRITE_ONCE. */ static __always_inline int PageDmaPinned(struct page *page) { VM_BUG_ON(page != compound_head(page)); return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; } [...] >> +static __always_inline void SetPageDmaPinned(struct page *page) >> +{ >> +VM_BUG_ON(page != compound_head(page)); > > VM_BUG_ON(!list_empty(>lru)) There is only one place where we set this flag, and that is when (in patch 6/6) transitioning from a page that might (or might not) have been on an LRU. In that case, the calling code has already corrupted page->lru, by writing to page->dma_pinned_count, which is unions with page->lru: atomic_set(>dma_pinned_count, 1); SetPageDmaPinned(page); ...so it would be inappropriate to call a list function, such as list_empty(), on that field. Let's just leave it as-is. > >> +WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); >> +} >> + >> +static __always_inline void ClearPageDmaPinned(struct page *page) >> +{ >> +VM_BUG_ON(page != compound_head(page)); >> +VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); >> + >> +/* This does a WRITE_ONCE to the lru.next, which is also the >> + * page->dma_pinned_flags field. So in addition to restoring page->lru, >> + * this provides visibility to other threads. >> + */ >> +INIT_LIST_HEAD(>lru); > > This assumes certain things about list_head, why not use the correct > initialization bits. > Yes, OK, changed to: static __always_inline void ClearPageDmaPinned(struct page *page) { VM_BUG_ON(page != compound_head(page)); VM_BUG_ON_PAGE(!PageDmaPinned(page), page); /* Provide visibility to other threads: */ WRITE_ONCE(page->dma_pinned_flags, 0); /* * Safety precaution: restore the list head, before possibly returning * the page to other subsystems. */ INIT_LIST_HEAD(>lru); } -- thanks, John Hubbard NVIDIA
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Add two struct page fields that, combined, are unioned with > struct page->lru. There is no change in the size of > struct page. These new fields are for type safety and clarity. > > Also add page flag accessors to test, set and clear the new > page->dma_pinned_flags field. > > The page->dma_pinned_count field will be used in upcoming > patches > > Signed-off-by: John Hubbard > --- > include/linux/mm_types.h | 22 +- > include/linux/page-flags.h | 47 ++ > 2 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; > /* See page-flags.h for PAGE_MAPPING_FLAGS */ > struct address_space *mapping; > pgoff_t index; /* Our offset within mapping. */ > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 74bee8cecf4c..81ed52c3caae 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -425,6 +425,53 @@ static __always_inline int __PageMovable(struct page > *page) > PAGE_MAPPING_MOVABLE; > } > > +/* > + * Because page->dma_pinned_flags is unioned with page->lru, any page that > + * uses these flags must NOT be on an LRU. That's partly enforced by > + * ClearPageDmaPinned, which gives the page back to LRU. > + * > + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union > + * of struct page), and this flag is checked without knowing whether it is a > + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 > (0x2), > + * rather than bit 0. > + */ > +#define PAGE_DMA_PINNED 0x2 > +#define PAGE_DMA_PINNED_FLAGS(PAGE_DMA_PINNED) > + This is really subtle, additional changes to compound_head will need to coordinate with these flags? Also doesn't this bit need to be unique across all structs in the union? I guess that is guaranteed by the fact that page == compound_head(page) as per your assertion, but I've forgotten why that is true. Could you please add some commentary on that > +/* > + * Because these flags are read outside of a lock, ensure visibility between > + * different threads, by using READ|WRITE_ONCE. > + */ > +static __always_inline int PageDmaPinnedFlags(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); > + return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED_FLAGS) != 0; > +} > + > +static __always_inline int PageDmaPinned(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); > + return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; > +} > + > +static __always_inline void SetPageDmaPinned(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); VM_BUG_ON(!list_empty(>lru)) > + WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); > +} > + > +static __always_inline void ClearPageDmaPinned(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); > + VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); > + > + /* This does a WRITE_ONCE to the lru.next, which is also the > + * page->dma_pinned_flags field. So in addition to restoring page->lru, > + * this provides visibility to other threads. > + */ > + INIT_LIST_HEAD(>lru); This assumes certain things about list_head, why not use the correct initialization bits. > +} > + > #ifdef CONFIG_KSM > /* > * A KSM page is
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Add two struct page fields that, combined, are unioned with > struct page->lru. There is no change in the size of > struct page. These new fields are for type safety and clarity. > > Also add page flag accessors to test, set and clear the new > page->dma_pinned_flags field. > > The page->dma_pinned_count field will be used in upcoming > patches > > Signed-off-by: John Hubbard > --- > include/linux/mm_types.h | 22 +- > include/linux/page-flags.h | 47 ++ > 2 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; > /* See page-flags.h for PAGE_MAPPING_FLAGS */ > struct address_space *mapping; > pgoff_t index; /* Our offset within mapping. */ > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 74bee8cecf4c..81ed52c3caae 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -425,6 +425,53 @@ static __always_inline int __PageMovable(struct page > *page) > PAGE_MAPPING_MOVABLE; > } > > +/* > + * Because page->dma_pinned_flags is unioned with page->lru, any page that > + * uses these flags must NOT be on an LRU. That's partly enforced by > + * ClearPageDmaPinned, which gives the page back to LRU. > + * > + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union > + * of struct page), and this flag is checked without knowing whether it is a > + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 > (0x2), > + * rather than bit 0. > + */ > +#define PAGE_DMA_PINNED 0x2 > +#define PAGE_DMA_PINNED_FLAGS(PAGE_DMA_PINNED) > + This is really subtle, additional changes to compound_head will need to coordinate with these flags? Also doesn't this bit need to be unique across all structs in the union? I guess that is guaranteed by the fact that page == compound_head(page) as per your assertion, but I've forgotten why that is true. Could you please add some commentary on that > +/* > + * Because these flags are read outside of a lock, ensure visibility between > + * different threads, by using READ|WRITE_ONCE. > + */ > +static __always_inline int PageDmaPinnedFlags(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); > + return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED_FLAGS) != 0; > +} > + > +static __always_inline int PageDmaPinned(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); > + return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0; > +} > + > +static __always_inline void SetPageDmaPinned(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); VM_BUG_ON(!list_empty(>lru)) > + WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED); > +} > + > +static __always_inline void ClearPageDmaPinned(struct page *page) > +{ > + VM_BUG_ON(page != compound_head(page)); > + VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page); > + > + /* This does a WRITE_ONCE to the lru.next, which is also the > + * page->dma_pinned_flags field. So in addition to restoring page->lru, > + * this provides visibility to other threads. > + */ > + INIT_LIST_HEAD(>lru); This assumes certain things about list_head, why not use the correct initialization bits. > +} > + > #ifdef CONFIG_KSM > /* > * A KSM page is