Re: [PATCH V4 15/15] blk-throttle: add latency target support

2016-11-29 Thread Shaohua Li
On Tue, Nov 29, 2016 at 05:54:46PM -0500, Tejun Heo wrote: > Hello, > > On Tue, Nov 29, 2016 at 10:14:03AM -0800, Shaohua Li wrote: > > What the patches do doesn't conflict what you are talking about. We need a > > way > > to detect if cgroups are idle or active. I think the problem is how to >

Re: [PATCH V4 15/15] blk-throttle: add latency target support

2016-11-29 Thread Tejun Heo
Hello, On Tue, Nov 29, 2016 at 10:14:03AM -0800, Shaohua Li wrote: > What the patches do doesn't conflict what you are talking about. We need a way > to detect if cgroups are idle or active. I think the problem is how to define > 'active' and 'idle'. We must quantify the state. We could use: > 1.

Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

2016-11-29 Thread Tejun Heo
Hello, On Tue, Nov 29, 2016 at 10:30:44AM -0800, Shaohua Li wrote: > > As discussed separately, it might make more sense to just use the avg > > of the closest bucket instead of trying to line-fit the buckets, but > > it's an implementation detail and whatever which works is fine. > > that is sti

[PATCH v2 0/4] SED OPAL Library

2016-11-29 Thread Scott Bauer
Changes from v1->v2 1) Removed work queues and call backs. The code now operates in in a normal call chain fashion. Each opal command provides a series of commands it needs to run. next() iterates through the functions only calling the subsequent function once the current has finished a

[PATCH v2 1/4] include: Add definitions for sed

2016-11-29 Thread Scott Bauer
This patch adds the definitions and structures for the SED Opal code. Signed-off-by: Scott Bauer Signed-off-by: Rafael Antognolli --- include/linux/sed-opal.h | 57 ++ include/linux/sed.h | 85 + include/uapi/linux/sed-opal.h

[PATCH v2 3/4] nvme: Implement resume_from_suspend and sed block ioctl

2016-11-29 Thread Scott Bauer
This patch implements the necessary logic to unlock a SED enabled device coming back from an S3. The patch also implements the ioctl handling from the block layer. Signed-off-by: Scott Bauer Signed-off-by: Rafael Antognolli --- drivers/nvme/host/core.c | 76

[PATCH v2 4/4] Maintainers: Add Information for SED Opal library

2016-11-29 Thread Scott Bauer
Signed-off-by: Scott Bauer Signed-off-by: Rafael Antognolli --- MAINTAINERS | 10 ++ 1 file changed, 10 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 8d414840..929eba3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10846,6 +10846,16 @@ L: linux-...@vger.kernel.org S:

[PATCH v2 2/4] block: Add Sed-opal library

2016-11-29 Thread Scott Bauer
This patch implements the necessary logic to bring an Opal enabled drive out of a factory-enabled into a working Opal state. This patch set also enables logic to save a password to be replayed during a resume from suspend. The key can be saved in the driver or in the Kernel's Key managment. Signe

Re: [PATCH 00/23] LightNVM patches for 4.10

2016-11-29 Thread Jens Axboe
On 11/28/2016 02:38 PM, Matias Bjørling wrote: Hi Jens, A bunch of patches for 4.10 have been prepared. Javier has been busy eliminating abstractions in the LightNVM interface. Mainly killing generic nvm_block and nvm_lun, which simplifies the locking mechanism within targets. He also added a c

Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

2016-11-29 Thread Shaohua Li
On Tue, Nov 29, 2016 at 12:24:35PM -0500, Tejun Heo wrote: > Hello, Shaohua. > > On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote: > > To do this, we sample some data, eg, average latency for request size > > 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit > > the

Re: [PATCH V4 15/15] blk-throttle: add latency target support

2016-11-29 Thread Shaohua Li
On Tue, Nov 29, 2016 at 12:31:08PM -0500, Tejun Heo wrote: > Hello, > > On Mon, Nov 14, 2016 at 02:22:22PM -0800, Shaohua Li wrote: > > One hard problem adding .high limit is to detect idle cgroup. If one > > cgroup doesn't dispatch enough IO against its high limit, we must have a > > mechanism to

Re: [PATCH V4 15/15] blk-throttle: add latency target support

2016-11-29 Thread Tejun Heo
Hello, On Mon, Nov 14, 2016 at 02:22:22PM -0800, Shaohua Li wrote: > One hard problem adding .high limit is to detect idle cgroup. If one > cgroup doesn't dispatch enough IO against its high limit, we must have a > mechanism to determine if other cgroups dispatch more IO. We added the > think time

Re: [PATCH V4 13/15] blk-throttle: add a mechanism to estimate IO latency

2016-11-29 Thread Tejun Heo
Hello, Shaohua. On Mon, Nov 14, 2016 at 02:22:20PM -0800, Shaohua Li wrote: > To do this, we sample some data, eg, average latency for request size > 4k, 8k, 16k, 32k, 64k. We then use an equation f(x) = a * x + b to fit > the data (x is request size in KB, f(x) is the latency). Then we can use >

Re: [PATCH V4 10/15] blk-throttle: add a simple idle detection

2016-11-29 Thread Tejun Heo
Hello, Shaohua. On Mon, Nov 28, 2016 at 03:10:18PM -0800, Shaohua Li wrote: > > But we can increase sharing by upping the target latency. That should > > be the main knob - if low, the user wants stricter service guarantee > > at the cost of lower overall utilization; if high, the workload can >

Re: [PATCH] blk-mq: Drop explicit timeout sync in hotplug

2016-11-29 Thread Jens Axboe
On 11/28/2016 10:01 AM, Gabriel Krisman Bertazi wrote: Sorry for the dup. Missed linux-block address. 8 After commit 287922eb0b18 ("block: defer timeouts to a workqueue"), deleting the timeout work after freezing the queue shouldn't be necessary, since the synchronization is already enforced

[PATCHv5 02/36] Revert "radix-tree: implement radix_tree_maybe_preload_order()"

2016-11-29 Thread Kirill A. Shutemov
This reverts commit 356e1c23292a4f63cfdf1daf0e0ddada51f32de8. After conversion of huge tmpfs to multi-order entries, we don't need this anymore. Signed-off-by: Kirill A. Shutemov --- include/linux/radix-tree.h | 1 - lib/radix-tree.c | 74 -

[PATCHv5 15/36] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask}

2016-11-29 Thread Kirill A. Shutemov
Slab pages can be compound, but we shouldn't threat them as THP for pupose of hpage_* helpers, otherwise it would lead to confusing results. For instance, ext4 uses slab pages for journal pages and we shouldn't confuse them with THPs. The easiest way is to exclude them in hpage_* helpers. Signed-

[PATCHv5 12/36] brd: make it handle huge pages

2016-11-29 Thread Kirill A. Shutemov
Do not assume length of bio segment is never larger than PAGE_SIZE. With huge pages it's HPAGE_PMD_SIZE (2M on x86-64). Signed-off-by: Kirill A. Shutemov --- drivers/block/brd.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/block/brd.c b/drivers/b

[PATCHv5 14/36] thp: introduce hpage_size() and hpage_mask()

2016-11-29 Thread Kirill A. Shutemov
Introduce new helpers which return size/mask of the page: HPAGE_PMD_SIZE/HPAGE_PMD_MASK if the page is PageTransHuge() and PAGE_SIZE/PAGE_MASK otherwise. Signed-off-by: Kirill A. Shutemov --- include/linux/huge_mm.h | 16 1 file changed, 16 insertions(+) diff --git a/include/li

[PATCHv5 04/36] mm, rmap: account file thp pages

2016-11-29 Thread Kirill A. Shutemov
Let's add FileHugePages and FilePmdMapped fields into meminfo and smaps. It indicates how many times we allocate and map file THP. Signed-off-by: Kirill A. Shutemov --- drivers/base/node.c| 6 ++ fs/proc/meminfo.c | 4 fs/proc/task_mmu.c | 5 - include/linux/mmzone.h

[PATCHv5 06/36] thp: handle write-protection faults for file THP

2016-11-29 Thread Kirill A. Shutemov
For filesystems that wants to be write-notified (has mkwrite), we will encount write-protection faults for huge PMDs in shared mappings. The easiest way to handle them is to clear the PMD and let it refault as wriable. Signed-off-by: Kirill A. Shutemov Reviewed-by: Jan Kara --- mm/memory.c | 1

[PATCHv5 21/36] truncate: make invalidate_inode_pages2_range() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov
For huge pages we need to unmap whole range covered by the huge page. Signed-off-by: Kirill A. Shutemov --- mm/truncate.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index d2d95f283ec3..6df4b06a190f 100644 --- a/mm/tr

[PATCHv5 19/36] fs: make block_page_mkwrite() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov
Adjust check on whether part of the page beyond file size and apply compound_head() and page_mapping() where appropriate. Signed-off-by: Kirill A. Shutemov --- fs/buffer.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 7d333621ccfb.

[PATCHv5 30/36] ext4: make ext4_da_page_release_reservation() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov
For huge pages 'stop' must be within HPAGE_PMD_SIZE. Let's use hpage_size() in the BUG_ON(). We also need to change how we calculate lblk for cluster deallocation. Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/e

[PATCHv5 01/36] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2016-11-29 Thread Kirill A. Shutemov
We would need to use multi-order radix-tree entires for ext4 and other filesystems to have coherent view on tags (dirty/towrite) in the tree. This patch converts huge tmpfs implementation to multi-order entries, so we will be able to use the same code patch for all filesystems. We also change int

[PATCHv5 18/36] fs: make block_write_{begin,end}() be able to handle huge pages

2016-11-29 Thread Kirill A. Shutemov
It's more or less straight-forward. Most changes are around getting offset/len withing page right and zero out desired part of the page. Signed-off-by: Kirill A. Shutemov --- fs/buffer.c | 70 +++-- 1 file changed, 40 insertions(+), 30 del

[PATCHv5 24/36] ext4: make ext4_mpage_readpages() hugepage-aware

2016-11-29 Thread Kirill A. Shutemov
As BIO_MAX_PAGES is smaller (on x86) than HPAGE_PMD_NR, we cannot use the optimization ext4_mpage_readpages() provides. So, for huge pages, we fallback directly to block_read_full_page(). This should be re-visited once we get multipage bvec upstream. Signed-off-by: Kirill A. Shutemov --- fs/ex

[PATCHv5 26/36] ext4: handle huge pages in ext4_page_mkwrite()

2016-11-29 Thread Kirill A. Shutemov
Trivial: remove assumption on page size. Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fa4467e4b129..387aa857770b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @

[PATCHv5 27/36] ext4: handle huge pages in __ext4_block_zero_page_range()

2016-11-29 Thread Kirill A. Shutemov
As the function handles zeroing range only within one block, the required changes are trivial, just remove assuption on page size. Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c in

[PATCHv5 17/36] fs: make block_read_full_page() be able to read huge page

2016-11-29 Thread Kirill A. Shutemov
The approach is straight-forward: for compound pages we read out whole huge page. For huge page we cannot have array of buffer head pointers on stack -- it's 4096 pointers on x86-64 -- 'arr' is allocated with kmalloc() for huge pages. Signed-off-by: Kirill A. Shutemov --- fs/buffer.c

[PATCHv5 35/36] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff()

2016-11-29 Thread Kirill A. Shutemov
With huge pages in page cache we see tail pages in more code paths. This patch replaces direct access to struct page fields with macros which can handle tail pages properly. Signed-off-by: Kirill A. Shutemov --- fs/buffer.c | 2 +- fs/ext4/inode.c | 4 ++-- mm/filemap.c| 24

[PATCHv5 32/36] ext4: make EXT4_IOC_MOVE_EXT work with huge pages

2016-11-29 Thread Kirill A. Shutemov
Adjust how we find relevant block within page and how we clear the required part of the page. Signed-off-by: Kirill A. Shutemov --- fs/ext4/move_extent.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index 6fc14def0

[PATCHv5 31/36] ext4: handle writeback with huge pages

2016-11-29 Thread Kirill A. Shutemov
Modify mpage_map_and_submit_buffers() and mpage_release_unused_pages() to deal with huge pages. Mostly result of try-and-error. Critical view would be appriciated. Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 61 - 1 file change

[PATCHv5 03/36] page-flags: relax page flag policy for few flags

2016-11-29 Thread Kirill A. Shutemov
These flags are in use for filesystems with backing storage: PG_error, PG_writeback and PG_readahead. Signed-off-by: Kirill A. Shutemov --- include/linux/page-flags.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-fl

[PATCHv5 33/36] ext4: fix SEEK_DATA/SEEK_HOLE for huge pages

2016-11-29 Thread Kirill A. Shutemov
ext4_find_unwritten_pgoff() needs few tweaks to work with huge pages. Mostly trivial page_mapping()/page_to_pgoff() and adjustment to how we find relevant block. Signe-off-by: Kirill A. Shutemov --- fs/ext4/file.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --

[PATCHv5 34/36] ext4: make fallocate() operations work with huge pages

2016-11-29 Thread Kirill A. Shutemov
__ext4_block_zero_page_range() adjusted to calculate starting iblock correctry for huge pages. ext4_{collapse,insert}_range() requires page cache invalidation. We need the invalidation to be aligning to huge page border if huge pages are possible in page cache. Signed-off-by: Kirill A. Shutemov

[PATCHv5 16/36] thp: make thp_get_unmapped_area() respect S_HUGE_MODE

2016-11-29 Thread Kirill A. Shutemov
We want mmap(NULL) to return PMD-aligned address if the inode can have huge pages in page cache. Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a15d566b14f6..9c6ba124ba50 1006

[PATCHv5 07/36] filemap: allocate huge page in page_cache_read(), if allowed

2016-11-29 Thread Kirill A. Shutemov
This patch adds basic functionality to put huge page into page cache. At the moment we only put huge pages into radix-tree if the range covered by the huge page is empty. We ignore shadow entires for now, just remove them from the tree before inserting huge page. Later we can add logic to accumu

[PATCHv5 22/36] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries

2016-11-29 Thread Kirill A. Shutemov
From: Naoya Horiguchi Currently, hugetlb pages are linked to page cache on the basis of hugepage offset (derived from vma_hugecache_offset()) for historical reason, which doesn't match to the generic usage of page cache and requires some routines to covert page offset <=> hugepage offset in commo

[PATCHv5 20/36] truncate: make truncate_inode_pages_range() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov
As with shmem_undo_range(), truncate_inode_pages_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. Unlike with shmem, it doesn't prevent us having holes in the middle of huge page we still can skip writeback not touched buffers. With

[PATCHv5 36/36] ext4, vfs: add huge= mount option

2016-11-29 Thread Kirill A. Shutemov
The same four values as in tmpfs case. Encyption code is not yet ready to handle huge page, so we disable huge pages support if the inode has EXT4_INODE_ENCRYPT. Signed-off-by: Kirill A. Shutemov --- fs/ext4/ext4.h | 5 + fs/ext4/inode.c | 30 +++--- fs/ext4/super.

[PATCHv5 11/36] HACK: readahead: alloc huge pages, if allowed

2016-11-29 Thread Kirill A. Shutemov
Most page cache allocation happens via readahead (sync or async), so if we want to have significant number of huge pages in page cache we need to find a ways to allocate them from readahead. Unfortunately, huge pages doesn't fit into current readahead design: 128 max readahead window, assumption o

[PATCHv5 28/36] ext4: make ext4_block_write_begin() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov
It simply matches changes to __block_write_begin_int(). Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 35 +-- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d3143dfe9962..21662bcbbbcb 100644 --- a/

[PATCHv5 25/36] ext4: make ext4_writepage() work on huge pages

2016-11-29 Thread Kirill A. Shutemov
Change ext4_writepage() and underlying ext4_bio_write_page(). It basically removes assumption on page size, infer it from struct page instead. Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 10 +- fs/ext4/page-io.c | 11 +-- 2 files changed, 14 insertions(+), 7 deleti

[PATCHv5 10/36] filemap: handle huge pages in filemap_fdatawait_range()

2016-11-29 Thread Kirill A. Shutemov
We writeback whole huge page a time. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index ec976ddcb88a..52be2b457208 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -405,9 +405,14 @@ static int __filemap_fda

[PATCHv5 23/36] mm: account huge pages to dirty, writaback, reclaimable, etc.

2016-11-29 Thread Kirill A. Shutemov
We need to account huge pages according to its size to get background writaback work properly. Signed-off-by: Kirill A. Shutemov --- fs/fs-writeback.c | 10 +++--- include/linux/backing-dev.h | 10 ++ include/linux/memcontrol.h | 22 ++--- mm/migrate.c| 1

[PATCHv5 29/36] ext4: handle huge pages in ext4_da_write_end()

2016-11-29 Thread Kirill A. Shutemov
Call ext4_da_should_update_i_disksize() for head page with offset relative to head page. Signed-off-by: Kirill A. Shutemov --- fs/ext4/inode.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 21662bcbbbcb..e89249c03d2f 100644 ---

[PATCHv5 08/36] filemap: handle huge pages in do_generic_file_read()

2016-11-29 Thread Kirill A. Shutemov
Most of work happans on head page. Only when we need to do copy data to userspace we find relevant subpage. We are still limited by PAGE_SIZE per iteration. Lifting this limitation would require some more work. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 5 - 1 file changed, 4 inse

[PATCHv5 05/36] thp: try to free page's buffers before attempt split

2016-11-29 Thread Kirill A. Shutemov
We want page to be isolated from the rest of the system before spliting it. We rely on page count to be 2 for file pages to make sure nobody uses the page: one pin to caller, one to radix-tree. Filesystems with backing storage can have page count increased if it has buffers. Let's try to free the

[PATCHv5 00/36] ext4: support of huge pages

2016-11-29 Thread Kirill A. Shutemov
Here's respin of my huge ext4 patchset on top of Matthew's patchset with few changes and fixes (see below). Please review and consider applying. I don't see any xfstests regressions with huge pages enabled. Patch with new configurations for xfstests-bld is below. The basics are the same as with

[PATCHv5 13/36] mm: make write_cache_pages() work on huge pages

2016-11-29 Thread Kirill A. Shutemov
We writeback whole huge page a time. Let's adjust iteration this way. Signed-off-by: Kirill A. Shutemov --- include/linux/mm.h | 1 + include/linux/pagemap.h | 1 + mm/page-writeback.c | 17 - 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/linu

[PATCHv5 09/36] filemap: allocate huge page in pagecache_get_page(), if allowed

2016-11-29 Thread Kirill A. Shutemov
Write path allocate pages using pagecache_get_page(). We should be able to allocate huge pages there, if it's allowed. As usually, fallback to small pages, if failed. Signed-off-by: Kirill A. Shutemov --- mm/filemap.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff