Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-06 Thread Kirill A. Shutemov
On Mon, Mar 07, 2016 at 10:03:36AM +1100, Dave Chinner wrote:
> On Sun, Mar 06, 2016 at 03:30:34AM +0300, Kirill A. Shutemov wrote:
> > On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> > > And it's not just hole punching that has this problem. Direct IO is
> > > going to have the same issue with invalidation of the mapped ranges
> > > over the IO being done. XFS already WARNs when page cache
> > > invalidation fails with EBUSY in direct IO, because that is
> > > indicative of an application with a potential data corruption vector
> > > and there's nothing we can do in the kernel code to prevent it.
> > 
> > My current understanding is that for filesystems with persistent storage,
> > in order to make THP any useful, we would need to implement writeback
> > without splitting the huge page.
> 
> Algorithmically it is no different to filesytem block size < page
> size writeback.
> 
> > At the moment, I have no idea how hard it would be..
> 
> THP support would effectively require us to remove PAGE_CACHE_SIZE
> assumptions from all of the filesystem and buffer code. That's a
> large chunk of work e.g.  fs/buffer.c and any filesystem that uses
> bufferheads for tracking filesystem block state through the page
> cache.

I'll try to learn more about the code before the summit.
I guess it's something worth descussion in person.

-- 
 Kirill A. Shutemov


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-06 Thread Kirill A. Shutemov
On Mon, Mar 07, 2016 at 10:03:36AM +1100, Dave Chinner wrote:
> On Sun, Mar 06, 2016 at 03:30:34AM +0300, Kirill A. Shutemov wrote:
> > On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> > > And it's not just hole punching that has this problem. Direct IO is
> > > going to have the same issue with invalidation of the mapped ranges
> > > over the IO being done. XFS already WARNs when page cache
> > > invalidation fails with EBUSY in direct IO, because that is
> > > indicative of an application with a potential data corruption vector
> > > and there's nothing we can do in the kernel code to prevent it.
> > 
> > My current understanding is that for filesystems with persistent storage,
> > in order to make THP any useful, we would need to implement writeback
> > without splitting the huge page.
> 
> Algorithmically it is no different to filesytem block size < page
> size writeback.
> 
> > At the moment, I have no idea how hard it would be..
> 
> THP support would effectively require us to remove PAGE_CACHE_SIZE
> assumptions from all of the filesystem and buffer code. That's a
> large chunk of work e.g.  fs/buffer.c and any filesystem that uses
> bufferheads for tracking filesystem block state through the page
> cache.

I'll try to learn more about the code before the summit.
I guess it's something worth descussion in person.

-- 
 Kirill A. Shutemov


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-06 Thread Dave Chinner
On Sun, Mar 06, 2016 at 03:30:34AM +0300, Kirill A. Shutemov wrote:
> On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> > On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> > > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> > > -EBUSY (or other errno on your choice), if we cannot split the page
> > > right away?
> > 
> > Which means THP are not transparent any more. What does an
> > application do when it gets an EBUSY, anyway?
> 
> I guess it's reasonable to expect from an application to handle EOPNOTSUPP
> as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems.

Yes, but this is usually done as a check at the program
initialisation to determine whether to issue hole punches at all.
It's not suppose to be a dynamic error.

> Although, non-consistent result from the same fd can be confusing.

Exactly.

> > And it's not just hole punching that has this problem. Direct IO is
> > going to have the same issue with invalidation of the mapped ranges
> > over the IO being done. XFS already WARNs when page cache
> > invalidation fails with EBUSY in direct IO, because that is
> > indicative of an application with a potential data corruption vector
> > and there's nothing we can do in the kernel code to prevent it.
> 
> My current understanding is that for filesystems with persistent storage,
> in order to make THP any useful, we would need to implement writeback
> without splitting the huge page.

Algorithmically it is no different to filesytem block size < page
size writeback.

> At the moment, I have no idea how hard it would be..

THP support would effectively require us to remove PAGE_CACHE_SIZE
assumptions from all of the filesystem and buffer code. That's a
large chunk of work e.g.  fs/buffer.c and any filesystem that uses
bufferheads for tracking filesystem block state through the page
cache.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-06 Thread Dave Chinner
On Sun, Mar 06, 2016 at 03:30:34AM +0300, Kirill A. Shutemov wrote:
> On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> > On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> > > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> > > -EBUSY (or other errno on your choice), if we cannot split the page
> > > right away?
> > 
> > Which means THP are not transparent any more. What does an
> > application do when it gets an EBUSY, anyway?
> 
> I guess it's reasonable to expect from an application to handle EOPNOTSUPP
> as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems.

Yes, but this is usually done as a check at the program
initialisation to determine whether to issue hole punches at all.
It's not suppose to be a dynamic error.

> Although, non-consistent result from the same fd can be confusing.

Exactly.

> > And it's not just hole punching that has this problem. Direct IO is
> > going to have the same issue with invalidation of the mapped ranges
> > over the IO being done. XFS already WARNs when page cache
> > invalidation fails with EBUSY in direct IO, because that is
> > indicative of an application with a potential data corruption vector
> > and there's nothing we can do in the kernel code to prevent it.
> 
> My current understanding is that for filesystems with persistent storage,
> in order to make THP any useful, we would need to implement writeback
> without splitting the huge page.

Algorithmically it is no different to filesytem block size < page
size writeback.

> At the moment, I have no idea how hard it would be..

THP support would effectively require us to remove PAGE_CACHE_SIZE
assumptions from all of the filesystem and buffer code. That's a
large chunk of work e.g.  fs/buffer.c and any filesystem that uses
bufferheads for tracking filesystem block state through the page
cache.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-05 Thread Kirill A. Shutemov
On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> > On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > > >> Truncate and punch hole that only cover part of THP range is 
> > > > > >> implemented
> > > > > >> by zero out this part of THP.
> > > > > >>
> > > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) 
> > > > > >> behaviour.
> > > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may 
> > > > > >> have
> > > > > >> inconsistent results depending what pages happened to be allocated.
> > > > > >> Not sure if it should be considered ABI break or not.
> > > > > > 
> > > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > > > 
> > > > > > Within the specified range, partial filesystem blocks are 
> > > > > > zeroed,
> > > > > > and whole filesystem blocks are removed from the file.  After a
> > > > > > successful call, subsequent reads from this range will return
> > > > > > zeroes.
> > > > > > 
> > > > > > It means we effectively have 2M filesystem block size.
> > > > > 
> > > > > The question is still whether this will case problems for apps.
> > > > > 
> > > > > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > > > > filesystem act like they have a 2M blocksize and others like they have
> > > > > 4k?  Would that confuse apps?
> > > > 
> > > > At risk of addressing the tip of an iceberg, before diving down to
> > > > scope out the rest of the iceberg...
> > > 
> > > 
> > > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > > to hold on to its compound pages for longer than I do, and that's fine
> > > > so long as it's all consistent.)
> > > 
> > > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > > expect the memcg not to be charged for those holes (just as when they
> > > > munmap most of an anonymous THP) - that does suggest splitting is 
> > > > needed.
> > > 
> > > I think filesystems will expect splitting to happen. They call
> > > truncate_pagecache_range() on the region that the hole is being
> > > punched out of, and they expect page cache pages over this range to
> > > be unmapped, invalidated and then removed from the mapping tree as a
> > > result. Also, most filesystems think the page cache only contains
> > > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > > limitations THP might have when it comes to invalidation.
> > > 
> > > IOWs, if this range is not aligned to huge page boundaries, then it
> > > implies the huge page is either split into PAGE_SIZE mappings and
> > > then the range is invalidated as expected, or it is completely
> > > invalidated and then refaulted on future accesses which determine if
> > > THP or normal pages are used for the page being faulted
> > 
> > The filesystem in question is tmpfs and complete invalidation is not
> > always an option.
> 
> Then your two options are: splitting the page and rerunning the hole
> punch, or simply zeroing the sections of the THP rather than trying
> to punch out the backing store.

The second option is implemented at the moment as splitting can fail.

> > For other filesystems it also can be unavailable
> > immediately if the page is dirty (the dirty flag is tracked on per-THP
> > basis at the moment).
> 
> Filesystems with persistent storage flush the range being punched
> first to ensure that partial blocks are correctly written before we
> start freeing the backing store. This is needed on XFS to ensure
> hole punch plays nicely with delayed allocation and other extent
> based operations. Hence we know that we have clean pages over the
> hole we are about to punch and so there is no reason the
> invalidation should *ever* fail.

Okay. It means we have other option to consider on THP-enabling for a
filesystem with persistent storage.

> tmpfs is a special snowflake when it comes to these fallocate based
> filesystem layout manipulation functions - it does not have
> persistent storage, so you have to do things very differently to
> ensure that data is not lost.
> 
> > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> > -EBUSY (or other errno on your choice), if we cannot split the page
> > right away?
> 
> Which means THP are not transparent any more. What does an
> application do when it gets an EBUSY, anyway?

I guess it's reasonable to expect from an application to handle EOPNOTSUPP
as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems.
Although, non-consistent result from the same fd 

Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-05 Thread Kirill A. Shutemov
On Sun, Mar 06, 2016 at 09:38:11AM +1100, Dave Chinner wrote:
> On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> > On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > > >> Truncate and punch hole that only cover part of THP range is 
> > > > > >> implemented
> > > > > >> by zero out this part of THP.
> > > > > >>
> > > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) 
> > > > > >> behaviour.
> > > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may 
> > > > > >> have
> > > > > >> inconsistent results depending what pages happened to be allocated.
> > > > > >> Not sure if it should be considered ABI break or not.
> > > > > > 
> > > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > > > 
> > > > > > Within the specified range, partial filesystem blocks are 
> > > > > > zeroed,
> > > > > > and whole filesystem blocks are removed from the file.  After a
> > > > > > successful call, subsequent reads from this range will return
> > > > > > zeroes.
> > > > > > 
> > > > > > It means we effectively have 2M filesystem block size.
> > > > > 
> > > > > The question is still whether this will case problems for apps.
> > > > > 
> > > > > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > > > > filesystem act like they have a 2M blocksize and others like they have
> > > > > 4k?  Would that confuse apps?
> > > > 
> > > > At risk of addressing the tip of an iceberg, before diving down to
> > > > scope out the rest of the iceberg...
> > > 
> > > 
> > > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > > to hold on to its compound pages for longer than I do, and that's fine
> > > > so long as it's all consistent.)
> > > 
> > > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > > expect the memcg not to be charged for those holes (just as when they
> > > > munmap most of an anonymous THP) - that does suggest splitting is 
> > > > needed.
> > > 
> > > I think filesystems will expect splitting to happen. They call
> > > truncate_pagecache_range() on the region that the hole is being
> > > punched out of, and they expect page cache pages over this range to
> > > be unmapped, invalidated and then removed from the mapping tree as a
> > > result. Also, most filesystems think the page cache only contains
> > > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > > limitations THP might have when it comes to invalidation.
> > > 
> > > IOWs, if this range is not aligned to huge page boundaries, then it
> > > implies the huge page is either split into PAGE_SIZE mappings and
> > > then the range is invalidated as expected, or it is completely
> > > invalidated and then refaulted on future accesses which determine if
> > > THP or normal pages are used for the page being faulted
> > 
> > The filesystem in question is tmpfs and complete invalidation is not
> > always an option.
> 
> Then your two options are: splitting the page and rerunning the hole
> punch, or simply zeroing the sections of the THP rather than trying
> to punch out the backing store.

The second option is implemented at the moment as splitting can fail.

> > For other filesystems it also can be unavailable
> > immediately if the page is dirty (the dirty flag is tracked on per-THP
> > basis at the moment).
> 
> Filesystems with persistent storage flush the range being punched
> first to ensure that partial blocks are correctly written before we
> start freeing the backing store. This is needed on XFS to ensure
> hole punch plays nicely with delayed allocation and other extent
> based operations. Hence we know that we have clean pages over the
> hole we are about to punch and so there is no reason the
> invalidation should *ever* fail.

Okay. It means we have other option to consider on THP-enabling for a
filesystem with persistent storage.

> tmpfs is a special snowflake when it comes to these fallocate based
> filesystem layout manipulation functions - it does not have
> persistent storage, so you have to do things very differently to
> ensure that data is not lost.
> 
> > Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> > -EBUSY (or other errno on your choice), if we cannot split the page
> > right away?
> 
> Which means THP are not transparent any more. What does an
> application do when it gets an EBUSY, anyway?

I guess it's reasonable to expect from an application to handle EOPNOTSUPP
as FALLOC_FL_PUNCH_HOLE is not supported by some filesystems.
Although, non-consistent result from the same fd 

Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-05 Thread Dave Chinner
On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > >> Truncate and punch hole that only cover part of THP range is 
> > > > >> implemented
> > > > >> by zero out this part of THP.
> > > > >>
> > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) 
> > > > >> behaviour.
> > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may 
> > > > >> have
> > > > >> inconsistent results depending what pages happened to be allocated.
> > > > >> Not sure if it should be considered ABI break or not.
> > > > > 
> > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > > 
> > > > >   Within the specified range, partial filesystem blocks are 
> > > > > zeroed,
> > > > >   and whole filesystem blocks are removed from the file.  After a
> > > > >   successful call, subsequent reads from this range will return
> > > > >   zeroes.
> > > > > 
> > > > > It means we effectively have 2M filesystem block size.
> > > > 
> > > > The question is still whether this will case problems for apps.
> > > > 
> > > > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > > > filesystem act like they have a 2M blocksize and others like they have
> > > > 4k?  Would that confuse apps?
> > > 
> > > At risk of addressing the tip of an iceberg, before diving down to
> > > scope out the rest of the iceberg...
> > 
> > 
> > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > to hold on to its compound pages for longer than I do, and that's fine
> > > so long as it's all consistent.)
> > 
> > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > expect the memcg not to be charged for those holes (just as when they
> > > munmap most of an anonymous THP) - that does suggest splitting is needed.
> > 
> > I think filesystems will expect splitting to happen. They call
> > truncate_pagecache_range() on the region that the hole is being
> > punched out of, and they expect page cache pages over this range to
> > be unmapped, invalidated and then removed from the mapping tree as a
> > result. Also, most filesystems think the page cache only contains
> > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > limitations THP might have when it comes to invalidation.
> > 
> > IOWs, if this range is not aligned to huge page boundaries, then it
> > implies the huge page is either split into PAGE_SIZE mappings and
> > then the range is invalidated as expected, or it is completely
> > invalidated and then refaulted on future accesses which determine if
> > THP or normal pages are used for the page being faulted
> 
> The filesystem in question is tmpfs and complete invalidation is not
> always an option.

Then your two options are: splitting the page and rerunning the hole
punch, or simply zeroing the sections of the THP rather than trying
to punch out the backing store.

> For other filesystems it also can be unavailable
> immediately if the page is dirty (the dirty flag is tracked on per-THP
> basis at the moment).

Filesystems with persistent storage flush the range being punched
first to ensure that partial blocks are correctly written before we
start freeing the backing store. This is needed on XFS to ensure
hole punch plays nicely with delayed allocation and other extent
based operations. Hence we know that we have clean pages over the
hole we are about to punch and so there is no reason the
invalidation should *ever* fail.

tmpfs is a special snowflake when it comes to these fallocate based
filesystem layout manipulation functions - it does not have
persistent storage, so you have to do things very differently to
ensure that data is not lost.

> Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> -EBUSY (or other errno on your choice), if we cannot split the page
> right away?

Which means THP are not transparent any more. What does an
application do when it gets an EBUSY, anyway? It needs to punch a
hole, and failure to do so could result in data corruption or stale
data exposure if the hole isn't punched and the data purged from the
range.

And it's not just hole punching that has this problem. Direct IO is
going to have the same issue with invalidation of the mapped ranges
over the IO being done. XFS already WARNs when page cache
invalidation fails with EBUSY in direct IO, because that is
indicative of an application with a potential data corruption vector
and there's nothing we can do in the kernel code to prevent it.

I think the same issues also exist with DAX using 

Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-05 Thread Dave Chinner
On Sat, Mar 05, 2016 at 02:24:12AM +0300, Kirill A. Shutemov wrote:
> On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> > On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > > >> Truncate and punch hole that only cover part of THP range is 
> > > > >> implemented
> > > > >> by zero out this part of THP.
> > > > >>
> > > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) 
> > > > >> behaviour.
> > > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may 
> > > > >> have
> > > > >> inconsistent results depending what pages happened to be allocated.
> > > > >> Not sure if it should be considered ABI break or not.
> > > > > 
> > > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > > 
> > > > >   Within the specified range, partial filesystem blocks are 
> > > > > zeroed,
> > > > >   and whole filesystem blocks are removed from the file.  After a
> > > > >   successful call, subsequent reads from this range will return
> > > > >   zeroes.
> > > > > 
> > > > > It means we effectively have 2M filesystem block size.
> > > > 
> > > > The question is still whether this will case problems for apps.
> > > > 
> > > > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > > > filesystem act like they have a 2M blocksize and others like they have
> > > > 4k?  Would that confuse apps?
> > > 
> > > At risk of addressing the tip of an iceberg, before diving down to
> > > scope out the rest of the iceberg...
> > 
> > 
> > > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > > punch splits the hugepage; but it's natural that Kirill's way would try
> > > to hold on to its compound pages for longer than I do, and that's fine
> > > so long as it's all consistent.)
> > 
> > > Ah, but suppose someone holepunches out most of each 2M page: they would
> > > expect the memcg not to be charged for those holes (just as when they
> > > munmap most of an anonymous THP) - that does suggest splitting is needed.
> > 
> > I think filesystems will expect splitting to happen. They call
> > truncate_pagecache_range() on the region that the hole is being
> > punched out of, and they expect page cache pages over this range to
> > be unmapped, invalidated and then removed from the mapping tree as a
> > result. Also, most filesystems think the page cache only contains
> > PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> > limitations THP might have when it comes to invalidation.
> > 
> > IOWs, if this range is not aligned to huge page boundaries, then it
> > implies the huge page is either split into PAGE_SIZE mappings and
> > then the range is invalidated as expected, or it is completely
> > invalidated and then refaulted on future accesses which determine if
> > THP or normal pages are used for the page being faulted
> 
> The filesystem in question is tmpfs and complete invalidation is not
> always an option.

Then your two options are: splitting the page and rerunning the hole
punch, or simply zeroing the sections of the THP rather than trying
to punch out the backing store.

> For other filesystems it also can be unavailable
> immediately if the page is dirty (the dirty flag is tracked on per-THP
> basis at the moment).

Filesystems with persistent storage flush the range being punched
first to ensure that partial blocks are correctly written before we
start freeing the backing store. This is needed on XFS to ensure
hole punch plays nicely with delayed allocation and other extent
based operations. Hence we know that we have clean pages over the
hole we are about to punch and so there is no reason the
invalidation should *ever* fail.

tmpfs is a special snowflake when it comes to these fallocate based
filesystem layout manipulation functions - it does not have
persistent storage, so you have to do things very differently to
ensure that data is not lost.

> Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
> -EBUSY (or other errno on your choice), if we cannot split the page
> right away?

Which means THP are not transparent any more. What does an
application do when it gets an EBUSY, anyway? It needs to punch a
hole, and failure to do so could result in data corruption or stale
data exposure if the hole isn't punched and the data purged from the
range.

And it's not just hole punching that has this problem. Direct IO is
going to have the same issue with invalidation of the mapped ranges
over the IO being done. XFS already WARNs when page cache
invalidation fails with EBUSY in direct IO, because that is
indicative of an application with a potential data corruption vector
and there's nothing we can do in the kernel code to prevent it.

I think the same issues also exist with DAX using 

Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Kirill A. Shutemov
On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > >> Truncate and punch hole that only cover part of THP range is 
> > > >> implemented
> > > >> by zero out this part of THP.
> > > >>
> > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > > >> inconsistent results depending what pages happened to be allocated.
> > > >> Not sure if it should be considered ABI break or not.
> > > > 
> > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > 
> > > > Within the specified range, partial filesystem blocks are 
> > > > zeroed,
> > > > and whole filesystem blocks are removed from the file.  After a
> > > > successful call, subsequent reads from this range will return
> > > > zeroes.
> > > > 
> > > > It means we effectively have 2M filesystem block size.
> > > 
> > > The question is still whether this will case problems for apps.
> > > 
> > > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > > filesystem act like they have a 2M blocksize and others like they have
> > > 4k?  Would that confuse apps?
> > 
> > At risk of addressing the tip of an iceberg, before diving down to
> > scope out the rest of the iceberg...
> 
> 
> > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > punch splits the hugepage; but it's natural that Kirill's way would try
> > to hold on to its compound pages for longer than I do, and that's fine
> > so long as it's all consistent.)
> 
> > Ah, but suppose someone holepunches out most of each 2M page: they would
> > expect the memcg not to be charged for those holes (just as when they
> > munmap most of an anonymous THP) - that does suggest splitting is needed.
> 
> I think filesystems will expect splitting to happen. They call
> truncate_pagecache_range() on the region that the hole is being
> punched out of, and they expect page cache pages over this range to
> be unmapped, invalidated and then removed from the mapping tree as a
> result. Also, most filesystems think the page cache only contains
> PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> limitations THP might have when it comes to invalidation.
> 
> IOWs, if this range is not aligned to huge page boundaries, then it
> implies the huge page is either split into PAGE_SIZE mappings and
> then the range is invalidated as expected, or it is completely
> invalidated and then refaulted on future accesses which determine if
> THP or normal pages are used for the page being faulted

The filesystem in question is tmpfs and complete invalidation is not
always an option. For other filesystems it also can be unavailable
immediately if the page is dirty (the dirty flag is tracked on per-THP
basis at the moment).

Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
-EBUSY (or other errno on your choice), if we cannot split the page
right away?

> Just to complicate things, keep in mind that some filesystems may
> have a PAGE_SIZE block size, but can be convinced to only
> allocate/punch/truncate/etc extents on larger alignments on a
> per-inode basis. IOWs, THP vs hole punch behaviour is not actually
> a filesystem type specific behaviour - it's per-inode specific...

There is also similar question about THP vs. i_size vs. SIGBUS.

For small pages an application will not get SIGBUS on mmap()ed file, until
it wouldn't try to access beyond round_up(i_size, PAGE_CACHE_SIZE) - 1.

For THP it would be round_up(i_size, HPAGE_PMD_SIZE) - 1.

Is it a problem?

-- 
 Kirill A. Shutemov


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Kirill A. Shutemov
On Sat, Mar 05, 2016 at 10:05:48AM +1100, Dave Chinner wrote:
> On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> > On Fri, 4 Mar 2016, Dave Hansen wrote:
> > > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > > >> Truncate and punch hole that only cover part of THP range is 
> > > >> implemented
> > > >> by zero out this part of THP.
> > > >>
> > > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > > >> inconsistent results depending what pages happened to be allocated.
> > > >> Not sure if it should be considered ABI break or not.
> > > > 
> > > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > > 
> > > > Within the specified range, partial filesystem blocks are 
> > > > zeroed,
> > > > and whole filesystem blocks are removed from the file.  After a
> > > > successful call, subsequent reads from this range will return
> > > > zeroes.
> > > > 
> > > > It means we effectively have 2M filesystem block size.
> > > 
> > > The question is still whether this will case problems for apps.
> > > 
> > > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > > filesystem act like they have a 2M blocksize and others like they have
> > > 4k?  Would that confuse apps?
> > 
> > At risk of addressing the tip of an iceberg, before diving down to
> > scope out the rest of the iceberg...
> 
> 
> > (Though in the case of my huge tmpfs, it's the reverse: the small hole
> > punch splits the hugepage; but it's natural that Kirill's way would try
> > to hold on to its compound pages for longer than I do, and that's fine
> > so long as it's all consistent.)
> 
> > Ah, but suppose someone holepunches out most of each 2M page: they would
> > expect the memcg not to be charged for those holes (just as when they
> > munmap most of an anonymous THP) - that does suggest splitting is needed.
> 
> I think filesystems will expect splitting to happen. They call
> truncate_pagecache_range() on the region that the hole is being
> punched out of, and they expect page cache pages over this range to
> be unmapped, invalidated and then removed from the mapping tree as a
> result. Also, most filesystems think the page cache only contains
> PAGE_CACHE_SIZE mappings, so they are completely unaware of the
> limitations THP might have when it comes to invalidation.
> 
> IOWs, if this range is not aligned to huge page boundaries, then it
> implies the huge page is either split into PAGE_SIZE mappings and
> then the range is invalidated as expected, or it is completely
> invalidated and then refaulted on future accesses which determine if
> THP or normal pages are used for the page being faulted

The filesystem in question is tmpfs and complete invalidation is not
always an option. For other filesystems it also can be unavailable
immediately if the page is dirty (the dirty flag is tracked on per-THP
basis at the moment).

Would it be acceptable for fallocate(FALLOC_FL_PUNCH_HOLE) to return
-EBUSY (or other errno on your choice), if we cannot split the page
right away?

> Just to complicate things, keep in mind that some filesystems may
> have a PAGE_SIZE block size, but can be convinced to only
> allocate/punch/truncate/etc extents on larger alignments on a
> per-inode basis. IOWs, THP vs hole punch behaviour is not actually
> a filesystem type specific behaviour - it's per-inode specific...

There is also similar question about THP vs. i_size vs. SIGBUS.

For small pages an application will not get SIGBUS on mmap()ed file, until
it wouldn't try to access beyond round_up(i_size, PAGE_CACHE_SIZE) - 1.

For THP it would be round_up(i_size, HPAGE_PMD_SIZE) - 1.

Is it a problem?

-- 
 Kirill A. Shutemov


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Dave Chinner
On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > > 
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > 
> > >   Within the specified range, partial filesystem blocks are zeroed,
> > >   and whole filesystem blocks are removed from the file.  After a
> > >   successful call, subsequent reads from this range will return
> > >   zeroes.
> > > 
> > > It means we effectively have 2M filesystem block size.
> > 
> > The question is still whether this will case problems for apps.
> > 
> > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k?  Would that confuse apps?
> 
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...


> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)

> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

I think filesystems will expect splitting to happen. They call
truncate_pagecache_range() on the region that the hole is being
punched out of, and they expect page cache pages over this range to
be unmapped, invalidated and then removed from the mapping tree as a
result. Also, most filesystems think the page cache only contains
PAGE_CACHE_SIZE mappings, so they are completely unaware of the
limitations THP might have when it comes to invalidation.

IOWs, if this range is not aligned to huge page boundaries, then it
implies the huge page is either split into PAGE_SIZE mappings and
then the range is invalidated as expected, or it is completely
invalidated and then refaulted on future accesses which determine if
THP or normal pages are used for the page being faulted

Just to complicate things, keep in mind that some filesystems may
have a PAGE_SIZE block size, but can be convinced to only
allocate/punch/truncate/etc extents on larger alignments on a
per-inode basis. IOWs, THP vs hole punch behaviour is not actually
a filesystem type specific behaviour - it's per-inode specific...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Dave Chinner
On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > > 
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > 
> > >   Within the specified range, partial filesystem blocks are zeroed,
> > >   and whole filesystem blocks are removed from the file.  After a
> > >   successful call, subsequent reads from this range will return
> > >   zeroes.
> > > 
> > > It means we effectively have 2M filesystem block size.
> > 
> > The question is still whether this will case problems for apps.
> > 
> > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k?  Would that confuse apps?
> 
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...


> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)

> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

I think filesystems will expect splitting to happen. They call
truncate_pagecache_range() on the region that the hole is being
punched out of, and they expect page cache pages over this range to
be unmapped, invalidated and then removed from the mapping tree as a
result. Also, most filesystems think the page cache only contains
PAGE_CACHE_SIZE mappings, so they are completely unaware of the
limitations THP might have when it comes to invalidation.

IOWs, if this range is not aligned to huge page boundaries, then it
implies the huge page is either split into PAGE_SIZE mappings and
then the range is invalidated as expected, or it is completely
invalidated and then refaulted on future accesses which determine if
THP or normal pages are used for the page being faulted

Just to complicate things, keep in mind that some filesystems may
have a PAGE_SIZE block size, but can be convinced to only
allocate/punch/truncate/etc extents on larger alignments on a
per-inode basis. IOWs, THP vs hole punch behaviour is not actually
a filesystem type specific behaviour - it's per-inode specific...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Kirill A. Shutemov
On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > > 
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > 
> > >   Within the specified range, partial filesystem blocks are zeroed,
> > >   and whole filesystem blocks are removed from the file.  After a
> > >   successful call, subsequent reads from this range will return
> > >   zeroes.
> > > 
> > > It means we effectively have 2M filesystem block size.
> > 
> > The question is still whether this will case problems for apps.
> > 
> > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k?  Would that confuse apps?
> 
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...
> 
> So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
> I don't think it matters to anyone if it skips some zeroed small pages
> within a hugepage.  It may cause some artificial tests of holepunch and
> SEEK_HOLE to fail, and it ought to be documented as a limitation from
> choosing to enable THP (Kirill's way) on a filesystem, but I don't think
> it's an ABI break to worry about: anyone who cares just shouldn't enable.
> 
> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)
> 
> But I may disagree with "we effectively have 2M filesystem block size",
> beyond the SEEK_HOLE case.  If we're emulating hugetlbfs in tmpfs, sure,
> we would have 2M filesystem block size.  But if we're enabling THP
> (emphasis on T for Transparent) in tmpfs (or another filesystem), then
> when it matters it must act as if the block size is the 4k (or whatever)
> it usually is.  When it matters?  Approaching memcg limit or ENOSPC
> spring to mind.
> 
> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

Hmm.. As split_huge_pages() can fail, we wound need to propagate this
error to userspace. This potentially triggers some other user-visible
effect. EBUSY is not on list of fallocate(2) errror codes.

I think we can invent a way to track if a THP has punch-holed subpages and
prevent the compound page from being mapped as PMD or mapping these
subpages.

But I'm reluctant doing it upfront until real users emerge.

I would propose to see what user demands will be. May be we overthink the
situation.

-- 
 Kirill A. Shutemov


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Kirill A. Shutemov
On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > > 
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > > 
> > >   Within the specified range, partial filesystem blocks are zeroed,
> > >   and whole filesystem blocks are removed from the file.  After a
> > >   successful call, subsequent reads from this range will return
> > >   zeroes.
> > > 
> > > It means we effectively have 2M filesystem block size.
> > 
> > The question is still whether this will case problems for apps.
> > 
> > Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k?  Would that confuse apps?
> 
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...
> 
> So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
> I don't think it matters to anyone if it skips some zeroed small pages
> within a hugepage.  It may cause some artificial tests of holepunch and
> SEEK_HOLE to fail, and it ought to be documented as a limitation from
> choosing to enable THP (Kirill's way) on a filesystem, but I don't think
> it's an ABI break to worry about: anyone who cares just shouldn't enable.
> 
> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)
> 
> But I may disagree with "we effectively have 2M filesystem block size",
> beyond the SEEK_HOLE case.  If we're emulating hugetlbfs in tmpfs, sure,
> we would have 2M filesystem block size.  But if we're enabling THP
> (emphasis on T for Transparent) in tmpfs (or another filesystem), then
> when it matters it must act as if the block size is the 4k (or whatever)
> it usually is.  When it matters?  Approaching memcg limit or ENOSPC
> spring to mind.
> 
> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

Hmm.. As split_huge_pages() can fail, we wound need to propagate this
error to userspace. This potentially triggers some other user-visible
effect. EBUSY is not on list of fallocate(2) errror codes.

I think we can invent a way to track if a THP has punch-holed subpages and
prevent the compound page from being mapped as PMD or mapping these
subpages.

But I'm reluctant doing it upfront until real users emerge.

I would propose to see what user demands will be. May be we overthink the
situation.

-- 
 Kirill A. Shutemov


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Hugh Dickins
On Fri, 4 Mar 2016, Dave Hansen wrote:
> On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> >> Truncate and punch hole that only cover part of THP range is implemented
> >> by zero out this part of THP.
> >>
> >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> >> inconsistent results depending what pages happened to be allocated.
> >> Not sure if it should be considered ABI break or not.
> > 
> > Looks like this shouldn't be a problem. man 2 fallocate:
> > 
> > Within the specified range, partial filesystem blocks are zeroed,
> > and whole filesystem blocks are removed from the file.  After a
> > successful call, subsequent reads from this range will return
> > zeroes.
> > 
> > It means we effectively have 2M filesystem block size.
> 
> The question is still whether this will case problems for apps.
> 
> Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> filesystem act like they have a 2M blocksize and others like they have
> 4k?  Would that confuse apps?

At risk of addressing the tip of an iceberg, before diving down to
scope out the rest of the iceberg...

So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
I don't think it matters to anyone if it skips some zeroed small pages
within a hugepage.  It may cause some artificial tests of holepunch and
SEEK_HOLE to fail, and it ought to be documented as a limitation from
choosing to enable THP (Kirill's way) on a filesystem, but I don't think
it's an ABI break to worry about: anyone who cares just shouldn't enable.

(Though in the case of my huge tmpfs, it's the reverse: the small hole
punch splits the hugepage; but it's natural that Kirill's way would try
to hold on to its compound pages for longer than I do, and that's fine
so long as it's all consistent.)

But I may disagree with "we effectively have 2M filesystem block size",
beyond the SEEK_HOLE case.  If we're emulating hugetlbfs in tmpfs, sure,
we would have 2M filesystem block size.  But if we're enabling THP
(emphasis on T for Transparent) in tmpfs (or another filesystem), then
when it matters it must act as if the block size is the 4k (or whatever)
it usually is.  When it matters?  Approaching memcg limit or ENOSPC
spring to mind.

Ah, but suppose someone holepunches out most of each 2M page: they would
expect the memcg not to be charged for those holes (just as when they
munmap most of an anonymous THP) - that does suggest splitting is needed.

Hugh


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Hugh Dickins
On Fri, 4 Mar 2016, Dave Hansen wrote:
> On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> >> Truncate and punch hole that only cover part of THP range is implemented
> >> by zero out this part of THP.
> >>
> >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> >> inconsistent results depending what pages happened to be allocated.
> >> Not sure if it should be considered ABI break or not.
> > 
> > Looks like this shouldn't be a problem. man 2 fallocate:
> > 
> > Within the specified range, partial filesystem blocks are zeroed,
> > and whole filesystem blocks are removed from the file.  After a
> > successful call, subsequent reads from this range will return
> > zeroes.
> > 
> > It means we effectively have 2M filesystem block size.
> 
> The question is still whether this will case problems for apps.
> 
> Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
> filesystem act like they have a 2M blocksize and others like they have
> 4k?  Would that confuse apps?

At risk of addressing the tip of an iceberg, before diving down to
scope out the rest of the iceberg...

So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
I don't think it matters to anyone if it skips some zeroed small pages
within a hugepage.  It may cause some artificial tests of holepunch and
SEEK_HOLE to fail, and it ought to be documented as a limitation from
choosing to enable THP (Kirill's way) on a filesystem, but I don't think
it's an ABI break to worry about: anyone who cares just shouldn't enable.

(Though in the case of my huge tmpfs, it's the reverse: the small hole
punch splits the hugepage; but it's natural that Kirill's way would try
to hold on to its compound pages for longer than I do, and that's fine
so long as it's all consistent.)

But I may disagree with "we effectively have 2M filesystem block size",
beyond the SEEK_HOLE case.  If we're emulating hugetlbfs in tmpfs, sure,
we would have 2M filesystem block size.  But if we're enabling THP
(emphasis on T for Transparent) in tmpfs (or another filesystem), then
when it matters it must act as if the block size is the 4k (or whatever)
it usually is.  When it matters?  Approaching memcg limit or ENOSPC
spring to mind.

Ah, but suppose someone holepunches out most of each 2M page: they would
expect the memcg not to be charged for those holes (just as when they
munmap most of an anonymous THP) - that does suggest splitting is needed.

Hugh


Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Dave Hansen
On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
>> Truncate and punch hole that only cover part of THP range is implemented
>> by zero out this part of THP.
>>
>> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
>> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
>> inconsistent results depending what pages happened to be allocated.
>> Not sure if it should be considered ABI break or not.
> 
> Looks like this shouldn't be a problem. man 2 fallocate:
> 
>   Within the specified range, partial filesystem blocks are zeroed,
>   and whole filesystem blocks are removed from the file.  After a
>   successful call, subsequent reads from this range will return
>   zeroes.
> 
> It means we effectively have 2M filesystem block size.

The question is still whether this will case problems for apps.

Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
filesystem act like they have a 2M blocksize and others like they have
4k?  Would that confuse apps?



Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Dave Hansen
On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
>> Truncate and punch hole that only cover part of THP range is implemented
>> by zero out this part of THP.
>>
>> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
>> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
>> inconsistent results depending what pages happened to be allocated.
>> Not sure if it should be considered ABI break or not.
> 
> Looks like this shouldn't be a problem. man 2 fallocate:
> 
>   Within the specified range, partial filesystem blocks are zeroed,
>   and whole filesystem blocks are removed from the file.  After a
>   successful call, subsequent reads from this range will return
>   zeroes.
> 
> It means we effectively have 2M filesystem block size.

The question is still whether this will case problems for apps.

Isn't 2MB a quote unusual block size?  Wouldn't some files on a tmpfs
filesystem act like they have a 2M blocksize and others like they have
4k?  Would that confuse apps?



THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Kirill A. Shutemov
On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> Truncate and punch hole that only cover part of THP range is implemented
> by zero out this part of THP.
> 
> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> inconsistent results depending what pages happened to be allocated.
> Not sure if it should be considered ABI break or not.

Looks like this shouldn't be a problem. man 2 fallocate:

Within the specified range, partial filesystem blocks are zeroed,
and whole filesystem blocks are removed from the file.  After a
successful call, subsequent reads from this range will return
zeroes.

It means we effectively have 2M filesystem block size.

And I don't see any guarantee about subsequent lseek(SEEK_HOLE) beheviour.

-- 
 Kirill A. Shutemov


THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

2016-03-04 Thread Kirill A. Shutemov
On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> Truncate and punch hole that only cover part of THP range is implemented
> by zero out this part of THP.
> 
> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> inconsistent results depending what pages happened to be allocated.
> Not sure if it should be considered ABI break or not.

Looks like this shouldn't be a problem. man 2 fallocate:

Within the specified range, partial filesystem blocks are zeroed,
and whole filesystem blocks are removed from the file.  After a
successful call, subsequent reads from this range will return
zeroes.

It means we effectively have 2M filesystem block size.

And I don't see any guarantee about subsequent lseek(SEEK_HOLE) beheviour.

-- 
 Kirill A. Shutemov