from:"Andreas Dilger"

Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

2021-04-08 Thread Andreas Dilger

On Apr 8, 2021, at 12:50 PM, Wen Yang  wrote:
> 
> Hi Ritesh and Andreas,
> 
> Thank you for your reply. Since there is still a faulty machine, we have 
> analyzed it again and found it is indeed a very special case:
> 
> 
> crash> struct ext4_group_info 8813bb5f72d0
> struct ext4_group_info {
>  bb_state = 0,
>  bb_free_root = {
>rb_node = 0x0
>  },
>  bb_first_free = 1681,
>  bb_free = 0,
>  bb_fragments = 0,
>  bb_largest_free_order = -1,
>  bb_prealloc_list = {
>next = 0x880268291d78,
>prev = 0x880268291d78 ---> *** The list is empty
>  },
>  alloc_sem = {
>count = {
>  counter = 0
>},
>wait_list = {
>  next = 0x8813bb5f7308,
>  prev = 0x8813bb5f7308
>},
>wait_lock = {
>  raw_lock = {
>{
>  val = {
>counter = 0
>  },
>  {
>locked = 0 '\000',
>pending = 0 '\000'
>  },
>  {
>locked_pending = 0,
>tail = 0
>  }
>}
>  }
>},
>osq = {
>  tail = {
>counter = 0
>  }
>},
>owner = 0x0
>  },
>  bb_counters = 0x8813bb5f7328
> }
> crash>
> 
> 
> crash> list 0x880268291d78  -l ext4_prealloc_space.pa_group_list -s 
> ext4_prealloc_space.pa_count
> 880268291d78
>  pa_count = {
>counter = 1  ---> pa->pa_count
>  }
> 8813bb5f72f0
>  pa_count = {
>counter = -30701
>  }
> 
> 
> crash> struct -xo  ext4_prealloc_space
> struct ext4_prealloc_space {
>   [0x0] struct list_head pa_inode_list;
>  [0x10] struct list_head pa_group_list;
> union {
> struct list_head pa_tmp_list;
> struct callback_head pa_rcu;
>  [0x20] } u;
>  [0x30] spinlock_t pa_lock;
>  [0x34] atomic_t pa_count;
>  [0x38] unsigned int pa_deleted;
>  [0x40] ext4_fsblk_t pa_pstart;
>  [0x48] ext4_lblk_t pa_lstart;
>  [0x4c] ext4_grpblk_t pa_len;
>  [0x50] ext4_grpblk_t pa_free;
>  [0x54] unsigned short pa_type;
>  [0x58] spinlock_t *pa_obj_lock;
>  [0x60] struct inode *pa_inode;
> }
> SIZE: 0x68
> 
> 
> Before "goto repeat", it is necessary to check whether grp->bb_prealloc_list 
> is empty.  This patch may fix it.
> Please kindly give us your comments and suggestions.
> Thanks.

This patch definitely looks more reasonable than the previous one, but
I don't think it is correct?

It isn't clear how the system could get into this state, or stay there.

If bb_prealloc_list is empty, then "busy" should be 0, since busy = 1
is only set inside "list_for_each_entry_safe(bb_prealloc_list)", and
that implies the list is *not* empty?  The ext4_lock_group() is held
over this code, so the list should not be changing in this case, and
even if the list changed, it _should_ only affect the loop one time.

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 99bf091..8082e2d 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4290,7 +4290,7 @@ static void ext4_mb_new_preallocation(struct 
> ext4_allocation_context *ac)
>free_total += free;
> 
>/* if we still need more blocks and some PAs were used, try again */
> -   if (free_total < needed && busy) {
> +   if (free_total < needed && busy && 
> !list_empty(>bb_prealloc_list)) {
>ext4_unlock_group(sb, group);
>cond_resched();
>busy = 0;
>goto repeat;

Minor suggestion - moving "busy = 0" right after "repeat:" and removing
the "int busy = 0" initializer would make this code a bit more clear.

Cheers, Andreas





signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_preallocations

2021-04-07 Thread Andreas Dilger

On Apr 7, 2021, at 5:16 AM, riteshh  wrote:
> 
> On 21/04/07 03:01PM, Wen Yang wrote:
>> From: Wen Yang 
>> 
>> The kworker has occupied 100% of the CPU for several days:
>> PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
>> 68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11
>> 
>> And the stack obtained through sysrq is as follows:
>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000
>> [20613144.850438] Call Trace:
>> [20613144.850439]  [] ext4_mb_new_blocks+0x429/0x550 [ext4]
>> [20613144.850439]  [] ext4_ext_map_blocks+0xb5e/0xf30 
>> [ext4]
>> [20613144.850441]  [] ext4_map_blocks+0x172/0x620 [ext4]
>> [20613144.850442]  [] ext4_writepages+0x7e5/0xf00 [ext4]
>> [20613144.850443]  [] do_writepages+0x1e/0x30
>> [20613144.850444]  [] __writeback_single_inode+0x45/0x320
>> [20613144.850444]  [] writeback_sb_inodes+0x272/0x600
>> [20613144.850445]  [] __writeback_inodes_wb+0x92/0xc0
>> [20613144.850445]  [] wb_writeback+0x268/0x300
>> [20613144.850446]  [] wb_workfn+0xb4/0x380
>> [20613144.850447]  [] process_one_work+0x189/0x420
>> [20613144.850447]  [] worker_thread+0x4e/0x4b0
>> 
>> The cpu resources of the cloud server are precious, and the server
>> cannot be restarted after running for a long time, so a configuration
>> parameter is added to prevent this endless loop.
> 
> Strange, if there is a endless loop here. Then I would definitely see
> if there is any accounting problem in pa->pa_count. Otherwise busy=1
> should not be set everytime. ext4_mb_show_pa() function may help debug this.
> 
> If yes, then that means there always exists either a file preallocation
> or a group preallocation. Maybe it is possible, in some use case.
> Others may know of such use case, if any.

If this code is broken, then it doesn't make sense to me that we would
leave it in the "run forever" state after the patch, and require a sysfs
tunable to be set to have a properly working system?

Is there anything particularly strange about the workload/system that
might cause this?  Filesystem is very full, memory is very low, etc?


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] ext4: Fix ext4_error_err save negative errno into superblock

2021-04-01 Thread Andreas Dilger

On Apr 1, 2021, at 1:40 AM, Ye Bin  wrote:
> 
> As read_mmp_block return 1 when failed. read_mmp_block return -EIO when buffer
> isn't uptodate.

Thank you for this second patch.  Unfortunately, the commit message
is still confusing/incorrect because it references read_mmp_block()
in the first usage but is actually changing write_mmp_block().

With that change you could add a Reviewed-by label from me.

Cheers, Andreas

> Fixes: 54d3adbc29f0 ("ext4: save all error info in save_error_info() and
> drop ext4_set_errno()")
> Reported-by: Liu Zhi Qiang 
> Signed-off-by: Ye Bin 
> ---
> fs/ext4/mmp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 795c3ff2907c..68fbeedd627b 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -56,7 +56,7 @@ static int write_mmp_block(struct super_block *sb, struct 
> buffer_head *bh)
>   wait_on_buffer(bh);
>   sb_end_write(sb);
>   if (unlikely(!buffer_uptodate(bh)))
> - return 1;
> + return -EIO;
> 
>   return 0;
> }
> --
> 2.25.4
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Fix ext4_error_err save negative errno into superblock

2021-04-01 Thread Andreas Dilger

On Apr 1, 2021, at 1:22 AM, Ye Bin  wrote:
> 
> As read_mmp_block return 1 when failed, so just pass retval to
> save_error_info.

Thank you for submitting this patch, but it should not be accepted.

The commit message is confusing, since the code being changed relates
to retval from write_mmp_block().  That currently returns 1, but
only until your next patch is applied.

I think it is better to fix write_mmp_block() as in your next patch
to return a negative value to be more consistent with other code.

Cheers, Andreas

> Fixes: 54d3adbc29f0 ("ext4: save all error info in save_error_info() and
> drop ext4_set_errno()")
> Reported-by: Liu Zhi Qiang 
> Signed-off-by: Ye Bin 
> ---
> fs/ext4/mmp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 795c3ff2907c..bb8353e25841 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -171,7 +171,7 @@ static int kmmpd(void *data)
>*/
>   if (retval) {
>   if ((failed_writes % 60) == 0) {
> - ext4_error_err(sb, -retval,
> + ext4_error_err(sb, retval,
>  "Error writing to MMP block");
>   }
>   failed_writes++;
> --
> 2.25.4
> 

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 1/2] ext4: Handle casefolding with encryption

2021-03-20 Thread Andreas Dilger

On Mar 19, 2021, at 1:34 AM, Daniel Rosenberg  wrote:
> 
> This adds support for encryption with casefolding.
> 
> Since the name on disk is case preserving, and also encrypted, we can no
> longer just recompute the hash on the fly. Additionally, to avoid
> leaking extra information from the hash of the unencrypted name, we use
> siphash via an fscrypt v2 policy.
> 
> The hash is stored at the end of the directory entry for all entries
> inside of an encrypted and casefolded directory apart from those that
> deal with '.' and '..'. This way, the change is backwards compatible
> with existing ext4 filesystems.
> 
> Signed-off-by: Daniel Rosenberg 

Reviewed-by: Andreas Dilger 

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-25 Thread Andreas Dilger

On Feb 18, 2021, at 4:21 PM, Daniel Rosenberg  wrote:
> 
> On Wed, Feb 17, 2021 at 2:48 PM Andreas Dilger  wrote:
>> 
>> On Feb 17, 2021, at 9:08 AM, Theodore Ts'o  wrote:
>>> 
>>> The problem is in how the space after the filename in a directory is
>>> encoded.  The dirdata format is (mildly) expandable, supporting up to
>>> 4 different metadata chunks after the filename, using a very
>>> compatctly encoded TLV (or moral equivalent) scheme.  For directory
>>> inodes that have both the encyption and compression flags set, we have
>>> a single blob which gets used as the IV for the crypto.
>>> 
>>> So it's the difference between a simple blob that is only used for one
>>> thing in this particular case, and something which is the moral
>>> equivalent of simple ASN.1 or protobuf encoding.
>>> 
>>> Currently, datadata has defined uses for 2 of the 4 "chunks", which is
>>> used in Lustre servers.  The proposal which Andreas has suggested is
>>> if the dirdata feature is supported, then the 3rd dirdata chunk would
>>> be used for the case where we currently used by the
>>> encrypted-casefolded extension, and the 4th would get reserved for a
>>> to-be-defined extension mechanism.
>>> 
>>> If there ext4 encrypted/casefold is not yet in use, and we can get the
>>> changes out to all potential users before they release products out
>>> into the field, then one approach would be to only support
>>> encrypted/casefold when dirdata is also enabled.
>>> 
>>> If ext4 encrypted/casefold is in use, my suggestion is that we support
>>> both encrypted/casefold && !dirdata as you have currently implemented
>>> it, and encrypted/casefold && dirdata as Andreas has proposed.
>>> 
>>> IIRC, supporting that Andreas's scheme essentially means that we use
>>> the top four bits in the rec_len field to indicate which chunks are
>>> present, and then for each chunk which is present, there is a 1 byte
>>> length followed by payload.  So that means in the case where it's
>>> encrypted/casefold && dirdata, the required storage of the directory
>>> entry would take one additional byte, plus setting a bit indicating
>>> that the encrypted/casefold dirdata chunk was present.
>> 
>> I think your email already covers pretty much all of the points.
>> 
>> One small difference between current "raw" encrypted/casefold hash vs.
>> dirdata is that the former is 4-byte aligned within the dirent, while
>> dirdata is packed.  So in 3/4 cases dirdata would take the same amount
>> of space (the 1-byte length would use one of the 1-3 bytes of padding
>> vs. the raw format), since the next dirent needs to be aligned anyway.
>> 
>> The other implication here is that the 8-byte hash may need to be
>> copied out of the dirent into a local variable before use, due to
>> alignment issues, but I'm not sure if that is actually needed or not.
>> 
>>> So, no, they aren't incompatible ultimatly, but it might require a
>>> tiny bit more work to integrate the combined support for dirdata plus
>>> encrypted/casefold.  One way we can do this, if we have to support the
>>> current encrypted/casefold format because it's out there in deployed
>>> implementations already, is to integrate encrypted/casefold &&
>>> !dirdata first upstream, and then when we integrate dirdata into
>>> upstream, we'll have to add support for the encrypted/casefold &&
>>> dirdata case.  This means that we'll have two variants of the on-disk
>>> format to test and support, but I don't think it's the going to be
>>> that difficult.
>> 
>> It would be possible to detect if the encrypted/casefold+dirdata
>> variant is in use, because the dirdata variant would have the 0x40
>> bit set in the file_type byte.  It isn't possible to positively
>> identify the "raw" non-dirdata variant, but the assumption would be
>> if (rec_len >= round_up(name_len, 4) + 8) in an encrypted+casefold
>> directory that the "raw" hash must be present in the dirent.
> 
> So sounds like we're going with the combined version. Andreas, do you
> have any suggestions for changes to the casefolding patch to ease the
> eventual merging with dirdata? A bunch of the changes are already
> pretty similar, so some of it is just calling essentially the same
> functions different things.

One thing I would suggest is to change the "is_fake_entry()" from using
offsets in the leaf bloc

Re: [RFC] Better page cache error handling

2021-02-24 Thread Andreas Dilger

On Feb 24, 2021, at 6:41 AM, Matthew Wilcox  wrote:
> 
> On Wed, Feb 24, 2021 at 01:38:48PM +0100, Jan Kara wrote:
>>> We allocate a page and try to read it.  29 threads pile up waiting
>>> for the page lock in filemap_update_page().  The error returned by the
>>> original I/O is shared between all 29 waiters as well as being returned
>>> to the requesting thread.  The next request for index.html will send
>>> another I/O, and more waiters will pile up trying to get the page lock,
>>> but at no time will more than 30 threads be waiting for the I/O to fail.
>> 
>> Interesting idea. It certainly improves current behavior. I just wonder
>> whether this isn't a partial solution to a problem and a full solution of
>> it would have to go in a different direction? I mean it just seems
>> wrong that each reader (let's assume they just won't overlap) has to retry
>> the failed IO and wait for the HW to figure out it's not going to work.
>> Shouldn't we cache the error state with the page? And I understand that we
>> then also have to deal with the problem how to invalidate the error state
>> when the block might eventually become readable (for stuff like temporary
>> IO failures). That would need some signalling from the driver to the page
>> cache, maybe in a form of some error recovery sequence counter or something
>> like that. For stuff like iSCSI, multipath, or NBD it could be doable I
>> believe...
> 
> That felt like a larger change than I wanted to make.  I already have
> a few big projects on my plate!
> 
> Also, it's not clear to me that the host can necessarily figure out when
> a device has fixed an error -- certainly for the three cases you list
> it can be done.  I think we'd want a timer to indicate that it's worth
> retrying instead of returning the error.
> 
> Anyway, that seems like a lot of data to cram into a struct page.  So I
> think my proposal is still worth pursuing while waiting for someone to
> come up with a perfect solution.

Since you would know that the page is bad at this point (not uptodate,
does not contain valid data) you could potentially re-use some other
fields in struct page, or potentially store something in the page itself?
That would avoid bloating struct page with fields that are only rarely
needed.  Userspace shouldn't be able to read the page at that point if
it is not marked uptodate, but they could overwrite it, so you wouldn't
want to store any kind of complex data structure there, but you _could_
store a magic, an error value, and a timeout, that are only valid if
!uptodate (cleared if the page were totally overwritten by userspace).

Yes, it's nasty, but better than growing struct page, and better than
blocking userspace threads for tens of minutes when a block is bad.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-17 Thread Andreas Dilger

On Feb 17, 2021, at 1:08 AM, Amir Goldstein  wrote:
> 
> You are missing my point.
> Never mind which server. The server does not *need* to rely on
> vfs_copy_file_range() to copy files from XFS to ext4.
> The server is very capable of implementing the fallback generic copy
> in case source/target fs do not support native {copy,remap}_file_range().
> 
> w.r.t semantics of copy_file_range() syscall vs. the fallback to userespace
> 'cp' tool (check source file size before copy or not), please note that the
> semantics of CIFS_IOC_COPYCHUNK_FILE are that of the former:
> 
>rc = cifs_file_copychunk_range(xid, src_file.file, 0, dst_file, 0,
>src_inode->i_size, 0);
> 
> It will copy zero bytes if advertised source file size if zero.
> 
> NFS server side copy semantics are currently de-facto the same
> because both the client and the server will have to pass through this
> line in vfs_copy_file_range():
> 
>if (len == 0)
>return 0;
> 
> IMO, and this opinion was voiced by several other filesystem developers,
> the shortend copy semantics are the correct semantics for copy_file_range()
> syscall as well as for vfs_copy_file_range() for internal kernel users.
> 
> I guess what this means is that if the 'cp' tool ever tries an opportunistic
> copy_file_range() syscall (e.g. --cfr=auto), it may result in zero size copy.

Having a syscall that does the "wrong thing" when called on two files
doesn't make sense.  Expecting userspace to check whether source/target
files supports CFR is also not practical.  This is trivial for the
kernel to determine and return -EOPNOTSUPP to the caller if the source
file (procfs/sysfs/etc) does not work with CFR properly.

Applications must already handle -EOPNOTSUPP with a fallback, but
expecting all applications that may call copy_file_range() to be
properly coded to handle corner cases is just asking for trouble.
That is doubly true given that an existing widely-used tool like
cp and mv are using this syscall if it is available in the kernel.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-17 Thread Andreas Dilger

On Feb 17, 2021, at 9:08 AM, Theodore Ts'o  wrote:
> 
> On Tue, Feb 16, 2021 at 08:01:11PM -0800, Daniel Rosenberg wrote:
>> I'm not sure what the conflict is, at least format-wise. Naturally,
>> there would need to be some work to reconcile the two patches, but my
>> patch only alters the format for directories which are encrypted and
>> casefolded, which always must have the additional hash field. In the
>> case of dirdata along with encryption and casefolding, couldn't we
>> have the dirdata simply follow after the existing data? Since we
>> always already know the length, it'd be unambiguous where that would
>> start. Casefolding can only be altered on an empty directory, and you
>> can only enable encryption for an empty directory, so I'm not too
>> concerned there. I feel like having it swapping between the different
>> methods makes it more prone to bugs, although it would be doable. I've
>> started rebasing the dirdata patch on my end to see how easy it is to
>> mix the two. At a glance, they touch a lot of the same areas in
>> similar ways, so it shouldn't be too hard. It's more of a question of
>> which way we want to resolve that, and which patch goes first.
>> 
>> I've been trying to figure out how many devices in the field are using
>> casefolded encryption, but haven't found out yet. The code is
>> definitely available though, so I would not be surprised if it's being
>> used, or is about to be.
> 
> The problem is in how the space after the filename in a directory is
> encoded.  The dirdata format is (mildly) expandable, supporting up to
> 4 different metadata chunks after the filename, using a very
> compatctly encoded TLV (or moral equivalent) scheme.  For directory
> inodes that have both the encyption and compression flags set, we have
> a single blob which gets used as the IV for the crypto.
> 
> So it's the difference between a simple blob that is only used for one
> thing in this particular case, and something which is the moral
> equivalent of simple ASN.1 or protobuf encoding.
> 
> Currently, datadata has defined uses for 2 of the 4 "chunks", which is
> used in Lustre servers.  The proposal which Andreas has suggested is
> if the dirdata feature is supported, then the 3rd dirdata chunk would
> be used for the case where we currently used by the
> encrypted-casefolded extension, and the 4th would get reserved for a
> to-be-defined extension mechanism.
> 
> If there ext4 encrypted/casefold is not yet in use, and we can get the
> changes out to all potential users before they release products out
> into the field, then one approach would be to only support
> encrypted/casefold when dirdata is also enabled.
> 
> If ext4 encrypted/casefold is in use, my suggestion is that we support
> both encrypted/casefold && !dirdata as you have currently implemented
> it, and encrypted/casefold && dirdata as Andreas has proposed.
> 
> IIRC, supporting that Andreas's scheme essentially means that we use
> the top four bits in the rec_len field to indicate which chunks are
> present, and then for each chunk which is present, there is a 1 byte
> length followed by payload.  So that means in the case where it's
> encrypted/casefold && dirdata, the required storage of the directory
> entry would take one additional byte, plus setting a bit indicating
> that the encrypted/casefold dirdata chunk was present.

I think your email already covers pretty much all of the points.

One small difference between current "raw" encrypted/casefold hash vs.
dirdata is that the former is 4-byte aligned within the dirent, while
dirdata is packed.  So in 3/4 cases dirdata would take the same amount
of space (the 1-byte length would use one of the 1-3 bytes of padding
vs. the raw format), since the next dirent needs to be aligned anyway.

The other implication here is that the 8-byte hash may need to be
copied out of the dirent into a local variable before use, due to
alignment issues, but I'm not sure if that is actually needed or not.

> So, no, they aren't incompatible ultimatly, but it might require a
> tiny bit more work to integrate the combined support for dirdata plus
> encrypted/casefold.  One way we can do this, if we have to support the
> current encrypted/casefold format because it's out there in deployed
> implementations already, is to integrate encrypted/casefold &&
> !dirdata first upstream, and then when we integrate dirdata into
> upstream, we'll have to add support for the encrypted/casefold &&
> dirdata case.  This means that we'll have two variants of the on-disk
> format to test and support, but I don't think it's the going to be
> that difficult.

It would be possible to detect if the encrypted/casefold+dirdata
variant is in use, because the dirdata variant would have the 0x40
bit set in the file_type byte.  It isn't possible to positively
identify the "raw" non-dirdata variant, but the assumption would be
if (rec_len >= round_up(name_len, 4) + 8) in an encrypted+casefold
directory that the "raw" hash

Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

2021-02-16 Thread Andreas Dilger

On Feb 16, 2021, at 6:51 AM, Amir Goldstein  wrote:
>> 
>>> This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
>>> is internal to kernel users.
>>> 
>>> FWIW, you may want to look at the loop in ovl_copy_up_data()
>>> for improvements to nfsd_copy_file_range().
>>> 
>>> We can move the check out to copy_file_range syscall:
>>> 
>>>if (flags != 0)
>>>return -EINVAL;
>>> 
>>> Leave the fallback from all filesystems and check for the
>>> COPY_FILE_SPLICE flag inside generic_copy_file_range().
>> 
>> Ok, the diff bellow is just to make sure I understood your suggestion.
>> 
>> The patch will also need to:
>> 
>> - change nfs and overlayfs calls to vfs_copy_file_range() so that they
>>   use the new flag.
>> 
>> - check flags in generic_copy_file_checks() to make sure only valid flags
>>   are used (COPY_FILE_SPLICE at the moment).
>> 
>> Also, where should this flag be defined?  include/uapi/linux/fs.h?
>> 
>> Cheers,
>> --
>> Luis
>> 
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..341d315d2a96 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, 
>> loff_t pos_in,
>>struct file *file_out, loff_t pos_out,
>>size_t len, unsigned int flags)
>> {
>> +   if (!(flags & COPY_FILE_SPLICE)) {
>> +   if (!file_out->f_op->copy_file_range)
>> +   return -EOPNOTSUPP;
>> +   else if (file_out->f_op->copy_file_range !=
>> +file_in->f_op->copy_file_range)
>> +   return -EXDEV;
>> +   }
> 
> That looks strange, because you are duplicating the logic in
> do_copy_file_range(). Maybe better:
> 
> if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
>return -EINVAL;
> if (flags & COPY_FILE_SPLICE)
>   return do_splice_direct(file_in, _in, file_out, _out,
> len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> if (!file_out->f_op->copy_file_range)
>return -EOPNOTSUPP;
> return -EXDEV;

This shouldn't return -EINVAL to userspace if the flag is not set.

That implies there *is* some valid way for userspace to call this
function, which is AFAICS not possible if COPY_FILE_SPLICE is only
available to in-kernel callers.  Instead, it should continue
to return -EOPNOTSUPP to userspace if copy_file_range() is not valid
for this combination of file descriptors, so that applications will
fall back to the non-CFR implementation.

The WARN_ON_ONCE(ret == -EOPNOTSUPP) in vfs_copy_file_range() would
also need to be removed if this will be triggered from userspace.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-09 Thread Andreas Dilger

On Feb 9, 2021, at 4:22 PM, Theodore Ts'o  wrote:
> 
> On Wed, Feb 03, 2021 at 11:31:28AM -0500, Theodore Ts'o wrote:
>> On Wed, Feb 03, 2021 at 03:55:06AM -0700, Andreas Dilger wrote:
>>> 
>>> It looks like this change will break the dirdata feature, which is similarly
>>> storing a data field beyond the end of the dirent. However, that feature 
>>> also
>>> provides for flags stored in the high bits of the type field to indicate
>>> which of the fields are in use there.
>>> The first byte of each field stores
>>> the length, so it can be skipped even if the content is not understood.
>> 
>> Daniel, for context, the dirdata field is an out-of-tree feature which
>> is used by Lustre, and so has fairly large deployed base.  So if there
>> is a way that we can accomodate not breaking dirdata, that would be
>> good.
>> 
>> Did the ext4 casefold+encryption implementation escape out to any
>> Android handsets?
> 
> So from an OOB chat with Daniel, it appears that the ext4
> casefold+encryption implementation did in fact escape out to Android
> handsets.  So I think what we will need to do, ultiumately, is support
> one way of supporting the casefold IV in the case where "encryption &&
> casefold", and another way when "encryption && casefold && dirdata".
> 
> That's going to be a bit sucky, but I don't think it should be that
> complex.  Daniel, Andreas, does that make sense to you?

I was just going to ping you about this, whether it made sense to remove
this feature addition from the "maint" branch (i.e. make a 1.45.8 without
it), and keep it only in 1.46 or "next" to reduce its spread?

Depending on the size of the "escape", it probably makes sense to move
toward having e2fsck migrate from the current mechanism to using dirdata
for all deployments.  In the current implementation, tools don't really
know for sure if there is data beyond the filename in the dirent or not.

I guess it is implicit with the casefold+encryption case for dirents in
directories that have the encryption flag set in a filesystem that also
has casefold enabled, but it's definitely not friendly to these features
being enabled on an existing filesystem.

For example, what if casefold is enabled on an existing filesystem that
already has an encrypted directory?  Does the code _assume_ that there is
a hash beyond the name if the rec_len is long enough for this?  There will
definitely be some pre-existing dirents that will have a large rec_len
(e.g. those at the end of the block, or with deleted entries immediately
following), that do *not* have the proper hash stored in them.  There may
be random garbage at the end of the dirent, and since every value in the
hash is valid, there is no way to know whether it is good or bad.

With the dirdata mechanism, there would be a bit set in the "file_type"
field that will indicate if the hash was present, as well as a length
field (0x08) that is a second confirmation that this field is valid.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] ext4: Enable code path when DX_DEBUG is set

2021-02-01 Thread Andreas Dilger

On Feb 1, 2021, at 11:41 AM, Vinicius Tinti  wrote:
> 
> On Mon, Feb 1, 2021 at 2:13 PM Theodore Ts'o  wrote:
>> 
>> On Mon, Feb 01, 2021 at 01:15:29PM -0300, Vinicius Tinti wrote:
>>> On Mon, Feb 1, 2021 at 9:49 AM Christoph Hellwig  wrote:
 
 DX_DEBUG is completely dead code, so either kill it off or make it an
 actual CONFIG_* symbol through Kconfig if it seems useful.
>>> 
>>> About the unreachable code in "if (0)" I think it could be removed.
>>> It seems to be doing an extra check.
>>> 
>>> About the DX_DEBUG I think I can do another patch adding it to Kconfig
>>> as you and Nathan suggested.
>> 
>> Yes, it's doing another check which is useful in terms of early
>> detection of bugs when a developer has the code open for
>> modifications.  It slows down performance under normal circumstances,
>> and assuming the code is bug-free(tm), it's entirely unnecessary ---
>> which is why it's under an "if (0)".
> 
> My goal is to avoid having a dead code. Three options come to mind.
> 
> The first would be to add another #ifdef SOMETHING (suggest a name).
> But this doesn't remove the code and someone could enable it by accident.

I don't see anything wrong with the original suggestion to use "DX_DEBUG".
If we want to get rid of "if (0)" in the code it could be changed like:

#define DX_DEBUG 0
#if DX_DEBUG
#define dxtrace(command) command
#else
#define dxtrace(command)
#endif

Then in the code check this directly (and fix the //-style comment also):

if (DX_DEBUG) { /* linear search cross check */
:
}

That will hopefully avoid the "dead code" warning, while keeping the code
visible for syntax checking by the compiler (unlike if it was under #ifdef).

Alternately, if we want to keep the "#ifdef DX_DEBUG" check intact:

#ifdef DX_DEBUG
#define dxtrace(command) command
#define DX_DEBUG_CROSSCHECK true
#else
#define dxtrace(command)
#define DX_DEBUG_CROSSCHECK false
#endif

if (DX_DEBUG_CROSSCHECK) { /* linear search cross check */
:
}

Cheers, Andreas

> 
> The second would be to add the code in a block comment. Then document
> that it is for checking an invariant. This will make it harder to cause
> problems.
> 
> The third would be to remove the code and explain the invariant in a block
> comment. Maybe add a pseudo code too in the comment.
> 
> What do you think?
> 
>> However, if there *is* a bug, having an early detection that the
>> representation invariant of the data structure has been violated can
>> be useful in root causing a bug.  This would probably be clearer if
>> the code was pulled out into a separate function with comments
>> explaining that this is a rep invariant check.
> 
> Good idea. I will do that too.
> 
>> The main thing about DX_DEBUG right now is that it is **super**
>> verbose.  Unwary users who enable it will be sorry.  If we want to
>> make it to be a first-class feature enabled via CONFIG_EXT4_DEBUG, we
>> should convert all of the dx_trace calls to use pr_debug so they are
>> enabled only if dynamic debug enables those pr_debug() statements.
>> And this should absolutely be a separate patch.
> 
> Agreed. For now I only want to focus on the "if (0)".
> 
> Regards,
> Vinicius
> 
>> Cheers,
>> 
>>- Ted


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Remove unreachable code

2021-01-30 Thread Andreas Dilger

On Jan 29, 2021, at 11:58 AM, Vinicius Tinti  wrote:
> 
> By enabling -Wunreachable-code-aggressive on Clang the following code
> paths are unreachable.
> 
> Commit dd73b5d5cb67 ("ext4: convert dx_probe() to use the ERR_PTR
> convention")
> Commit ac27a0ec112a ("[PATCH] ext4: initial copy of files from ext3")
> 
> Clang warns:
> 
> fs/ext4/namei.c:831:17: warning: code will never be executed
> [-Wunreachable-code]
>unsigned n = count - 1;
> ^
> fs/ext4/namei.c:830:7: note: silence by adding parentheses to mark code as
> explicitly dead
>if (0) { // linear search cross check
>^
>/* DISABLES CODE */ ( )
> 
> Signed-off-by: Vinicius Tinti 
> ---
> fs/ext4/namei.c | 15 ---
> 1 file changed, 15 deletions(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index cf652ba3e74d..1f64dbd7237b 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -827,21 +827,6 @@ dx_probe(struct ext4_filename *fname, struct inode *dir,
>   p = m + 1;
>   }
> 
> - if (0) { // linear search cross check

I would rather put this block under "#ifdef DX_DEBUG" so that it is available
in the future for debugging problems with hashed directories.

> - unsigned n = count - 1;
> - at = entries;
> - while (n--)
> - {
> - dxtrace(printk(KERN_CONT ","));
> - if (dx_get_hash(++at) > hash)
> - {
> - at--;
> - break;
> - }
> - }
> - ASSERT(at == p - 1);
> - }
> -
>   at = p - 1;
>   dxtrace(printk(KERN_CONT " %x->%u\n",
>  at == entries ? 0 : dx_get_hash(at),
> --
> 2.25.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: Getting a new fs in the kernel

2021-01-26 Thread Andreas Dilger

On Jan 26, 2021, at 09:25, Amy Parker  wrote:
> 
> Kernel development newcomer here. I've begun creating a concept for a
> new filesystem, and ideally once it's completed, rich, and stable I'd
> try to get it into the kernel.

Hello Amy, and welcome. 

I guess the first question to ask is what would be unique and valuable
about the new filesystem compared to existing filesystems in the kernel?

Given that maintaining a new filesystem adds ongoing maintenance
efforts, there has to be some added value before it would be accepted.
Also, given that filesystems are storing critical data for users, and
problems in the code can lead to data loss that can't be fixed by a reboot,
like many other software bugs, it takes a very long time for filesystems
to become stable enough for general use (the general rule of thumb is 10
years before a new filesystem is stable enough for general use). 

Often, if you have ideas for new functionality, it makes more sense to
add this into the framework of an existing filesystem (e.g. data verity,
data encryption, metadata checksum, DAX, etc. were all added to ext4).

Not only does this simplify efforts in terms of ongoing maintenance, but
it also means many more users will benefit from your development effort
in a much shorter timeframe. Otherwise, users would have to stop
using their existing filesystem before
they started using yours, and that is
a very slow process, because your filesystem would have to be much
better at *something* before they would make that switch. 

> What would be the process for this? I'd assume a patch sequence, but
> they'd all be dependent on each other, and sending in tons of
> dependent patches doesn't sound like a great idea. I've seen requests
> for pulls, but since I'm new here I don't really know what to do.

Probably the first step, before submitting any patches, would be to
provide a description of your work, and how it improves on the current
filesystems in the kernel. It may be that there are existing projects that
duplicate this effort, and combining resources will result in a 100% done
filesystem rather than two 80% done
projects...

Note that I don't want to discourage you from participating in the Linux
filesystem development community, but there are definitely considerations
going both ways wrt. accepting a new filesystem into the kernel. It may be
that your ideas, time, and efforts are better spent in contributing to an
exiting project.  It may also be that you have something groundbreaking
work, and I look forward to reading about what that is. 

Cheers, Andreas

Re: [PATCH] ext4: Remove expensive flush on fast commit

2021-01-07 Thread Andreas Dilger

On Tue, Jan 5, 2021 at 5:32 PM Daejun Park  wrote:
> 
> In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast commit
> block when barrier is enabled. However, in recovery phase, ext4 compares
> CRC value in the tail. So it is sufficient adds REQ_FUA and REQ_PREFLUSH
> on the block that has tail.

Does the tail block *always* contain a CRC, or is that dependent on
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM, JBD2_FEATURE_INCOMPAT_CSUM_V2,
or JBD2_FEATURE_INCOMPAT_CSUM_V3 being enabled?

If one of those features is *required* before the FAST_COMMIT feature
can be used, then this patch looks OK.  Otherwise, the CSUM feature
should be checked before the FUA is skipped for non-tail blocks.

Cheers, Andreas

> Signed-off-by: Daejun Park 
> ---
> fs/ext4/fast_commit.c | 10 +-
> 1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 4fcc21c25e79..e66507be334c 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -604,13 +604,13 @@ void ext4_fc_track_range(handle_t *handle, struct inode 
> *inode, ext4_lblk_t star
>trace_ext4_fc_track_range(inode, start, end, ret);
> }
> 
> -static void ext4_fc_submit_bh(struct super_block *sb)
> +static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail)
> {
>int write_flags = REQ_SYNC;
>struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> 
> -   /* TODO: REQ_FUA | REQ_PREFLUSH is unnecessarily expensive. */
> -   if (test_opt(sb, BARRIER))
> +   /* Add REQ_FUA | REQ_PREFLUSH only its tail */
> +   if (test_opt(sb, BARRIER) && is_tail)
>write_flags |= REQ_FUA | REQ_PREFLUSH;
>lock_buffer(bh);
>set_buffer_dirty(bh);
> @@ -684,7 +684,7 @@ static u8 *ext4_fc_reserve_space(struct super_block *sb, 
> int len, u32 *crc)
>*crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
>if (pad_len > 0)
>ext4_fc_memzero(sb, tl + 1, pad_len, crc);
> -   ext4_fc_submit_bh(sb);
> +   ext4_fc_submit_bh(sb, false);
> 
>ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, );
>if (ret)
> @@ -741,7 +741,7 @@ static int ext4_fc_write_tail(struct super_block *sb, u32 
> crc)
>tail.fc_crc = cpu_to_le32(crc);
>ext4_fc_memcpy(sb, dst, _crc, sizeof(tail.fc_crc), NULL);
> 
> -   ext4_fc_submit_bh(sb);
> +   ext4_fc_submit_bh(sb, true);
> 
>return 0;
> }
> --
> 2.25.1


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Don't leak old mountpoint samples

2020-12-17 Thread Andreas Dilger

On Dec 17, 2020, at 11:27 AM, Theodore Y. Ts'o  wrote:
> 
> On Tue, Dec 01, 2020 at 04:13:01PM +0100, Richard Weinberger wrote:
>> As soon the first file is opened, ext4 samples the mountpoint
>> of the filesystem in 64 bytes of the super block.
>> It does so using strlcpy(), this means that the remaining bytes
>> in the super block string buffer are untouched.
>> If the mount point before had a longer path than the current one,
>> it can be reconstructed.
>> 
>> Consider the case where the fs was mounted to "/media/johnjdeveloper"
>> and later to "/".
>> The the super block buffer then contains "/\x00edia/johnjdeveloper".
>> 
>> This case was seen in the wild and caused confusion how the name
>> of a developer ands up on the super block of a filesystem used
>> in production...
>> 
>> Fix this by clearing the string buffer before writing to it,
>> 
>> Signed-off-by: Richard Weinberger 
> 
> Thank for reporting this issue.  In fact, the better fix is to use
> strncpy().  See my revised patch for an explanation of why
> 
> commit cdc9ad7d3f201a77749432878fb4caa490862de6
> Author: Theodore Ts'o 
> Date:   Thu Dec 17 13:24:15 2020 -0500
> 
>ext4: don't leak old mountpoint samples
> 
>When the first file is opened, ext4 samples the mountpoint of the
>filesystem in 64 bytes of the super block.  It does so using
>strlcpy(), this means that the remaining bytes in the super block
>string buffer are untouched.  If the mount point before had a longer
>path than the current one, it can be reconstructed.
> 
>Consider the case where the fs was mounted to "/media/johnjdeveloper"
>and later to "/".  The super block buffer then contains
>"/\x00edia/johnjdeveloper".
> 
>This case was seen in the wild and caused confusion how the name
>of a developer ands up on the super block of a filesystem used
>in production...
> 
>Fix this by using strncpy() instead of strlcpy().  The superblock
>field is defined to be a fixed-size char array, and it is already
>marked using __nonstring in fs/ext4/ext4.h.  The consumer of the field
>in e2fsprogs already assumes that in the case of a 64+ byte mount
>path, that s_last_mounted will not be NUL terminated.
> 
>Reported-by: Richard Weinberger 
>Signed-off-by: Theodore Ts'o 

Color me confused, but I don't see how this change makes any difference?
If "cp" is only "/" then it will *still* not overwrite "edia/johnjdeveloper"
at the end of the s_last_mounted array.  To my mind, the only difference
between using strlcpy() and strncpy() would be whether the last byte in
the array can be used or not, but doesn't affect the remaining bytes.

> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 1cd3d26e3217..349b27f0dda0 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -810,7 +810,7 @@ static int ext4_sample_last_mounted(struct super_block 
> *sb,
>   if (err)
>   goto out_journal;
>   lock_buffer(sbi->s_sbh);
> - strlcpy(sbi->s_es->s_last_mounted, cp,
> + strncpy(sbi->s_es->s_last_mounted, cp,
>   sizeof(sbi->s_es->s_last_mounted));
>   ext4_superblock_csum_set(sb);
>   unlock_buffer(sbi->s_sbh);


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

2020-12-01 Thread Andreas Dilger

On Dec 1, 2020, at 10:44 AM, Eric Sandeen  wrote:
> 
> On 12/1/20 11:32 AM, Darrick J. Wong wrote:
>> On Tue, Dec 01, 2020 at 10:57:11AM -0600, Eric Sandeen wrote:
>>> STATX_ATTR_MOUNT_ROOT and STATX_ATTR_DAX got merged with the same value,
>>> so one of them needs fixing. Move STATX_ATTR_DAX.
>>> 
>>> While we're in here, clarify the value-matching scheme for some of the
>>> attributes, and explain why the value for DAX does not match.
>>> 
>>> Signed-off-by: Eric Sandeen 
>>> ---
>>> include/uapi/linux/stat.h | 7 ---
>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
>>> index 82cc58fe9368..9ad19eb9bbbf 100644
>>> --- a/include/uapi/linux/stat.h
>>> +++ b/include/uapi/linux/stat.h
>>> @@ -171,9 +171,10 @@ struct statx {
>>>  * be of use to ordinary userspace programs such as GUIs or ls rather than
>>>  * specialised tools.
>>>  *
>>> - * Note that the flags marked [I] correspond to generic FS_IOC_FLAGS
>>> + * Note that the flags marked [I] correspond to the FS_IOC_SETFLAGS flags
>>>  * semantically.  Where possible, the numerical value is picked to 
>>> correspond
>>> - * also.
>>> + * also. Note that the DAX attribute indicates that the inode is currently
>>> + * DAX-enabled, not simply that the per-inode flag has been set.
>> 
>> I don't really like using "DAX-enabled" to define "DAX attribute".  How
>> about cribbing from the statx manpage?
>> 
>> "Note that the DAX attribute indicates that the file is in the CPU
>> direct access state.  It does not correspond to the per-inode flag that
>> some filesystems support."
> 
> Sure.  Consistency and specificity is good, I'll change that.
> 
>>>  */
>>> #define STATX_ATTR_COMPRESSED   0x0004 /* [I] File is 
>>> compressed by the fs */
>>> #define STATX_ATTR_IMMUTABLE0x0010 /* [I] File is 
>>> marked immutable */
>>> @@ -183,7 +184,7 @@ struct statx {
>>> #define STATX_ATTR_AUTOMOUNT0x1000 /* Dir: Automount 
>>> trigger */
>>> #define STATX_ATTR_MOUNT_ROOT   0x2000 /* Root of a mount */
>>> #define STATX_ATTR_VERITY   0x0010 /* [I] Verity protected file 
>>> */
>>> -#define STATX_ATTR_DAX 0x2000 /* [I] File is DAX */
>>> +#define STATX_ATTR_DAX 0x0040 /* File is currently 
>>> DAX-enabled */
>> 
>> Why not use the next bit in the series (0x20)?  Did someone already
>> claim it in for-next?
> 
> Since it didn't match the FS_IOC_SETFLAGS flag, I was trying to pick one that
> seemed unlikely to ever gain a corresponding statx flag, and since 0x0040 
> is
> "reserved for ext4" it seemed like a decent choice.
> 
> But 0x20 corresponds to FS_EA_INODE_FL/EXT4_EA_INODE_FL which is 
> ext4-specific
> as well, so sure, I'll change to that.

If you look a few lines up in the context, this is supposed to be using the
same value as the other inode flags:

 * Note that the flags marked [I] correspond to generic FS_IOC_FLAGS
 * semantically.  Where possible, the numerical value is picked to correspond
 * also.

#define FS_DAX_FL   0x0200 /* Inode is DAX */
#define EXT4_DAX_FL 0x0200 /* Inode is DAX */

(FS_DAX_FL also used by XFS) so this should really be "0x0200" instead
of some other value.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: UAPI value collision: STATX_ATTR_MOUNT_ROOT vs STATX_ATTR_DAX

2020-11-25 Thread Andreas Dilger

On Nov 25, 2020, at 12:26 PM, Miklos Szeredi  wrote:
> 
> On Wed, Nov 25, 2020 at 5:57 PM David Howells  wrote:
>> 
>> Hi Linus, Miklos, Ira,
>> 
>> It seems that two patches that got merged in the 5.8 merge window collided 
>> and
>> no one noticed until now:
>> 
>> 80340fe3605c0 (Miklos Szeredi 2020-05-14 184) #define 
>> STATX_ATTR_MOUNT_ROOT 0x2000 /* Root of a mount */
>> ...
>> 712b2698e4c02 (Ira Weiny  2020-04-30 186) #define STATX_ATTR_DAX 
>>0x2000 /* [I] File is DAX */
>> 
>> The question is, what do we do about it?  Renumber one or both of the
>> constants?
> 
> :
> * Note that the flags marked [I] correspond to generic FS_IOC_FLAGS
> * semantically.  Where possible, the numerical value is picked to correspond
> * also.
> 
> :
> #define FS_DAX_FL 0x0200 /* Inode is DAX */
> 
> The DAX one can be the same value as FS_DAX_FL, the placement (after
> STATX_ATTR_VERITY, instead of before) seems to confirm this intention.

Yes, this looks like a bug in the STATX_ATTR_DAX value.  It should be the same
as FS_DAX_FL, like all of the other STATX_ATTR_* [I] values are.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] fs/aio.c: Cosmetic

2020-11-02 Thread Andreas Dilger

On Nov 2, 2020, at 2:58 PM, Alejandro Colomar  wrote:
> Changes:
> - Consistently use 'unsigned int', instead of 'unsigned'.
> - Add a blank line after variable declarations.
> - Move variable declarations to the top of functions.
> - Add a blank line at the top of functions if there are no declarations.

I'd agree that the other changes are following kernel coding style, but
I've never heard of leaving a blank line at the start of functions without
any local variables.  I don't see anything in process/coding-style.rst to
support this change, nor are the majority of variable-less functions
formatted this way, and it seems to just be a waste of vertical space.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-08 Thread Andreas Dilger

On Oct 8, 2020, at 1:12 PM, Josh Triplett  wrote:
> 
> On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote:
>> On Oct 7, 2020, at 2:14 PM, Josh Triplett  wrote:
>>> If those aren't the right way to express that, I could potentially
>>> adapt. I had a similar such conversation on linux-ext4 already (about
>>> inline data with 128-bit inodes), which led to me choosing to abandon
>>> 128-byte inodes rather than try to get ext4 to support what I wanted
>>> with them, because I didn't want to be disruptive to ext4 for a niche
>>> use case. In the particular case that motivated this thread, what I was
>>> doing already worked in previous kernels, and it seemed reasonable to
>>> ask for it to continue to work in new kernels, while preserving the
>>> newly added checks in the new kernels.
>> 
>> This was discussed in the "Inline data with 128-byte inodes?" thread
>> back in May.  While Jan was not necessarily in favour of this, I was
>> actually OK with improving the ext4 code to handle this case better,
>> since it would (at minimum) clean up ext4 to make a clear separation
>> of how it is detecting data in the i_block[] array and the system.data
>> xattr, and I don't think it added any complexity to the code.
>> 
>> I even posted a WIP patch to that effect, but didn't get a response back:
>> https://marc.info/?l=linux-ext4=158863275019187
> 
> My apologies, I thought I responded to that. It looks promising to me,
> though I wouldn't have the bandwidth to take it to completion anytime
> soon.

NP, I don't have bandwidth to work on it right now either.

>> I *do* think that inline_data is an under-appreciated feature that I
>> would be happy to see some improvements with.  I don't think that small
>> files are a niche use case, and if we can clean up the inline_data code
>> to work with 128-byte inodes I'm not against that, even though I'm not
>> going to use that combination of features myself.
> 
> I'd love to see that happen. At the time, it seemed like too large of a
> change to block on, which is why I ended up deciding to switch to
> 256-byte inodes.

Does that mean you are using inline_data with 256-byte inodes?  That would
also be good to know, since there haven't been any well-known users of
this feature so far (AFAIK).  Since you are using this in a read-only
manner, you won't hit the one know issue when an inline_data inode is
extended to use an external block that may temporarily leave the inode
in an inconsistent state.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-07 Thread Andreas Dilger

On Oct 7, 2020, at 2:14 PM, Josh Triplett  wrote:
> If those aren't the right way to express that, I could potentially
> adapt. I had a similar such conversation on linux-ext4 already (about
> inline data with 128-bit inodes), which led to me choosing to abandon
> 128-byte inodes rather than try to get ext4 to support what I wanted
> with them, because I didn't want to be disruptive to ext4 for a niche
> use case. In the particular case that motivated this thread, what I was
> doing already worked in previous kernels, and it seemed reasonable to
> ask for it to continue to work in new kernels, while preserving the
> newly added checks in the new kernels.

This was discussed in the "Inline data with 128-byte inodes?" thread
back in May.  While Jan was not necessarily in favour of this, I was
actually OK with improving the ext4 code to handle this case better,
since it would (at minimum) clean up ext4 to make a clear separation
of how it is detecting data in the i_block[] array and the system.data
xattr, and I don't think it added any complexity to the code.

I even posted a WIP patch to that effect, but didn't get a response back:
https://marc.info/?l=linux-ext4=158863275019187

I *do* think that inline_data is an under-appreciated feature that I
would be happy to see some improvements with.  I don't think that small
files are a niche use case, and if we can clean up the inline_data code
to work with 128-byte inodes I'm not against that, even though I'm not
going to use that combination of features myself.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] [v2] ext4: Fix error handling code in add_new_gdb

2020-08-29 Thread Andreas Dilger

On Aug 28, 2020, at 8:54 PM, Dinghao Liu  wrote:
> 
> When ext4_journal_get_write_access() fails, we should
> terminate the execution flow and release n_group_desc,
> iloc.bh, dind and gdb_bh.
> 
> Signed-off-by: Dinghao Liu 

Looks good.  I also reviewed the other error conditions in this function
and didn't see any similar problems.

Reviewed-by: Andreas Dilger 

> ---
> 
> Changelog:
> 
> v2: - Remove changes to ext4_handle_dirty_super()'s
>  error handling path.
> ---
> fs/ext4/resize.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index a50b51270ea9..71bf600e5b42 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -843,8 +843,10 @@ static int add_new_gdb(handle_t *handle, struct inode 
> *inode,
> 
>   BUFFER_TRACE(dind, "get_write_access");
>   err = ext4_journal_get_write_access(handle, dind);
> - if (unlikely(err))
> + if (unlikely(err)) {
>   ext4_std_error(sb, err);
> + goto errout;
> + }
> 
>   /* ext4_reserve_inode_write() gets a reference on the iloc */
>   err = ext4_reserve_inode_write(handle, inode, );
> --
> 2.17.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v4 1/3] io_uring: use an enumeration for io_uring_register(2) opcodes

2020-08-26 Thread Andreas Dilger

On Aug 26, 2020, at 1:43 PM, Kees Cook  wrote:
> 
> On Thu, Aug 13, 2020 at 05:32:52PM +0200, Stefano Garzarella wrote:
>> The enumeration allows us to keep track of the last
>> io_uring_register(2) opcode available.
>> 
>> Behaviour and opcodes names don't change.
>> 
>> Signed-off-by: Stefano Garzarella 
>> ---
>> include/uapi/linux/io_uring.h | 27 ---
>> 1 file changed, 16 insertions(+), 11 deletions(-)
>> 
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index d65fde732518..cdc98afbacc3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -255,17 +255,22 @@ struct io_uring_params {
>> /*
>>  * io_uring_register(2) opcodes and arguments
>>  */
>> -#define IORING_REGISTER_BUFFERS 0
>> -#define IORING_UNREGISTER_BUFFERS   1
>> -#define IORING_REGISTER_FILES   2
>> -#define IORING_UNREGISTER_FILES 3
>> -#define IORING_REGISTER_EVENTFD 4
>> -#define IORING_UNREGISTER_EVENTFD   5
>> -#define IORING_REGISTER_FILES_UPDATE6
>> -#define IORING_REGISTER_EVENTFD_ASYNC   7
>> -#define IORING_REGISTER_PROBE   8
>> -#define IORING_REGISTER_PERSONALITY 9
>> -#define IORING_UNREGISTER_PERSONALITY   10
>> +enum {
>> +IORING_REGISTER_BUFFERS,
> 
> Actually, one *tiny* thought. Since this is UAPI, do we want to be extra
> careful here and explicitly assign values? We can't change the meaning
> of a number (UAPI) but we can add new ones, etc? This would help if an
> OP were removed (to stop from triggering a cascade of changed values)...
> 
> for example:
> 
> enum {
>   IORING_REGISTER_BUFFERS = 0,
>   IORING_UNREGISTER_BUFFERS = 1,
>   ...

Definitely that is preferred, IMHO, for enums used as part of UAPI,
as it avoids accidental changes to the values, and it also makes it
easier to see what the actual values are.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-24 Thread Andreas Dilger

On Aug 24, 2020, at 9:26 PM, Matthew Wilcox  wrote:
> 
> On Tue, Aug 25, 2020 at 10:27:35AM +1000, Dave Chinner wrote:
>>> do {
>>> -   unsigned offset, bytes;
>>> -
>>> -   offset = offset_in_page(pos);
>>> -   bytes = min_t(loff_t, PAGE_SIZE - offset, count);
>>> +   loff_t bytes;
>>> 
>>> if (IS_DAX(inode))
>>> -   status = dax_iomap_zero(pos, offset, bytes, iomap);
>>> +   bytes = dax_iomap_zero(pos, length, iomap);
>> 
>> Hmmm. everything is loff_t here, but the callers are defining length
>> as u64, not loff_t. Is there a potential sign conversion problem
>> here? (sure 64 bit is way beyond anything we'll pass here, but...)
> 
> I've gone back and forth on the correct type for 'length' a few times.
> size_t is too small (not for zeroing, but for seek()).  An unsigned type
> seems right -- a length can't be negative, and we don't want to give
> the impression that it can.  But the return value from these functions
> definitely needs to be signed so we can represent an error.  So a u64
> length with an loff_t return type feels like the best solution.  And
> the upper layers have to promise not to pass in a length that's more
> than 2^63-1.

The problem with allowing a u64 as the length is that it leads to the
possibility of an argument value that cannot be returned.  Checking
length < 0 is not worse than checking length > 0x7ff,
and has the benefit of consistency with the other argument types and
signs...

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: move buffer_mapped() to proper position

2020-08-07 Thread Andreas Dilger


> On Aug 7, 2020, at 2:02 PM, ty...@mit.edu wrote:
> 
> Thanks, applied, although I rewrote the commit description to make it
> be a bit more clearer:
> 
>fs: prevent BUG_ON in submit_bh_wbc()
> 
>If a device is hot-removed --- for example, when a physical device is
>unplugged from pcie slot or a nbd device's network is shutdown ---
>this can result in a BUG_ON() crash in submit_bh_wbc().  This is
>because the when the block device dies, the buffer heads will have
>their Buffer_Mapped flag get cleared, leading to the crash in
>submit_bh_wbc.
> 
>We had attempted to work around this problem in commit a17712c8

Should this get a "Fixes:" label with this info, rather than embedding
it in the commit message, so that it could be picked up by stable?

Cheers, Andreas

>("ext4: check superblock mapped prior to committing").  Unfortunately,
>it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
>device dies between when the work-around check in ext4_commit_super()
>and when submit_bh_wbh() is finally called:
> 
>Code path:
>ext4_commit_super
>judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
>  lock_buffer(sbh)
>  ...
>  unlock_buffer(sbh)
>   __sync_dirty_buffer(sbh,...
>lock_buffer(sbh)
>judge if 'buffer_mapped(sbh))' is false, return 
> <== added by this patch
>submit_bh(...,sbh)
>submit_bh_wbc(...,sbh,...)
> 
>[100722.966497] kernel BUG at fs/buffer.c:3095! <== 
> BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
>[100722.966503] invalid opcode:  [#1] SMP
>[100722.966566] task: 8817e15a9e40 task.stack: c90024744000
>[100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
>[100722.966575] RSP: 0018:c90024747a90 EFLAGS: 00010246
>[100722.966576] RAX: 00620005 RBX: 8818a80603a8 RCX: 
> 
>[100722.966576] RDX: 8818a80603a8 RSI: 00020800 RDI: 
> 0001
>[100722.966577] RBP: c90024747ac0 R08:  R09: 
> 88207f94170d
>[100722.966578] R10: 000437c8 R11: 0001 R12: 
> 00020800
>[100722.966578] R13: 0001 R14: 0bf9a438 R15: 
> 88195f333000
>[100722.966580] FS:  7fa2eee27700() GS:88203d84() 
> knlGS:
>[100722.966580] CS:  0010 DS:  ES:  CR0: 80050033
>[100722.966581] CR2: 00f0b008 CR3: 00201a622003 CR4: 
> 007606e0
>[100722.966582] DR0:  DR1:  DR2: 
> 
>[100722.966583] DR3:  DR6: fffe0ff0 DR7: 
> 0400
>[100722.966583] PKRU: 5554
>[100722.966583] Call Trace:
>[100722.966588]  __sync_dirty_buffer+0x6e/0xd0
>[100722.966614]  ext4_commit_super+0x1d8/0x290 [ext4]
>[100722.966626]  __ext4_std_error+0x78/0x100 [ext4]
>[100722.966635]  ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
>[100722.966646]  ext4_reserve_inode_write+0x58/0xb0 [ext4]
>[100722.966655]  ? ext4_dirty_inode+0x48/0x70 [ext4]
>[100722.93]  ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
>[100722.966671]  ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
>[100722.966679]  ext4_dirty_inode+0x48/0x70 [ext4]
>[100722.966682]  __mark_inode_dirty+0x17f/0x350
>[100722.966686]  generic_update_time+0x87/0xd0
>[100722.966687]  touch_atime+0xa9/0xd0
>[100722.966690]  generic_file_read_iter+0xa09/0xcd0
>[100722.966694]  ? page_cache_tree_insert+0xb0/0xb0
>[100722.966704]  ext4_file_read_iter+0x4a/0x100 [ext4]
>[100722.966707]  ? __inode_security_revalidate+0x4f/0x60
>[100722.966709]  __vfs_read+0xec/0x160
>[100722.966711]  vfs_read+0x8c/0x130
>[100722.966712]  SyS_pread64+0x87/0xb0
>[100722.966716]  do_syscall_64+0x67/0x1b0
>[100722.966719]  entry_SYSCALL64_slow_path+0x25/0x25
> 
>To address this, add the check of 'buffer_mapped(bh)' to
>__sync_dirty_buffer().  This also has the benefit of fixing this for
>other file systems.
> 
>With this addition, we can drop the workaround in ext4_commit_supper().
> 
>[ Commit description rewritten by tytso. ]
> 
>Signed-off-by: Xianting Tian 
>Link: 
> https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_t...@126.com
>Signed-off-by: Theodore Ts'o 
> 
>   - Ted


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 3/3] ext4: add needed paramter to ext4_mb_discard_preallocations trace

2020-08-04 Thread Andreas Dilger

On Aug 4, 2020, at 7:02 PM, brookxu  wrote:
> 
> Add the needed value to ext4_mb_discard_preallocations trace, so
> we can more easily observe the requested number of trim.
> 
> Signed-off-by: Chunguang Xu 

IMHO, this should be part of the previous patch that is changing the
API for ext4_discard_preallocations().

Cheers, Andreas

> ---
>  include/trace/events/ext4.h | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index cc41d69..61736d8 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -746,24 +746,26 @@
>  );
> 
>  TRACE_EVENT(ext4_discard_preallocations,
> -TP_PROTO(struct inode *inode),
> +TP_PROTO(struct inode *inode, unsigned int needed),
> 
> -TP_ARGS(inode),
> +TP_ARGS(inode, needed),
> 
>  TP_STRUCT__entry(
> -__field(dev_t,dev)
> -__field(ino_t,ino)
> +__field(dev_t,dev)
> +__field(ino_t,ino)
> +__field(unsigned int,needed)
> 
>  ),
> 
>  TP_fast_assign(
>  __entry->dev= inode->i_sb->s_dev;
>  __entry->ino= inode->i_ino;
> +__entry->needed= needed;
>  ),
> 
> -TP_printk("dev %d,%d ino %lu",
> +TP_printk("dev %d,%d ino %lu needed %u",
>MAJOR(__entry->dev), MINOR(__entry->dev),
> -  (unsigned long) __entry->ino)
> +  (unsigned long) __entry->ino, __entry->needed)
>  );
> 
>  TRACE_EVENT(ext4_mb_discard_preallocations,
> 
> --
> 1.8.3.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 2/3] ext4: limit the length of per-inode prealloc list

2020-08-04 Thread Andreas Dilger

On Aug 4, 2020, at 7:02 PM, brookxu  wrote:
> 
> In the scenario of writing sparse files, the Per-inode prealloc list may
> be very long, resulting in high overhead for ext4_mb_use_preallocated().
> To circumvent this problem, we limit the maximum length of per-inode
> prealloc list to 512 and allow users to modify it.
> 
> Signed-off-by: Chunguang Xu 

Do you have any kind of measurements that show the benefit of this patch?
For example performance improvement, memory or CPU usage before and after?
How long is "very long"?

Cheers, Andreas

> ---
>  fs/ext4/ext4.h|  3 ++-
>  fs/ext4/extents.c | 10 -
>  fs/ext4/file.c|  2 +-
>  fs/ext4/indirect.c|  2 +-
>  fs/ext4/inode.c   |  6 +++---
>  fs/ext4/ioctl.c   |  2 +-
>  fs/ext4/mballoc.c | 57 
> +++
>  fs/ext4/mballoc.h |  4 
>  fs/ext4/move_extent.c |  4 ++--
>  fs/ext4/super.c   |  2 +-
>  fs/ext4/sysfs.c   |  2 ++
>  11 files changed, 75 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 42f5060..68e0ebe 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1501,6 +1501,7 @@ struct ext4_sb_info {
>  unsigned int s_mb_stats;
>  unsigned int s_mb_order2_reqs;
>  unsigned int s_mb_group_prealloc;
> +unsigned int s_mb_max_inode_prealloc;
>  unsigned int s_max_dir_size_kb;
>  /* where last allocation was done - for stream allocation */
>  unsigned long s_mb_last_group;
> @@ -2651,7 +2652,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
>  extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
>  struct ext4_allocation_request *, int *);
>  extern int ext4_mb_reserve_blocks(struct super_block *, int);
> -extern void ext4_discard_preallocations(struct inode *);
> +extern void ext4_discard_preallocations(struct inode *, unsigned int);
>  extern int __init ext4_init_mballoc(void);
>  extern void ext4_exit_mballoc(void);
>  extern void ext4_free_blocks(handle_t *handle, struct inode *inode,
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 221f240..a40f928 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -100,7 +100,7 @@ static int ext4_ext_trunc_restart_fn(struct inode *inode, 
> int *dropped)
>   * i_mutex. So we can safely drop the i_data_sem here.
>   */
>  BUG_ON(EXT4_JOURNAL(inode) == NULL);
> -ext4_discard_preallocations(inode);
> +ext4_discard_preallocations(inode, 0);
>  up_write(_I(inode)->i_data_sem);
>  *dropped = 1;
>  return 0;
> @@ -4272,7 +4272,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode 
> *inode,
>   * not a good idea to call discard here directly,
>   * but otherwise we'd need to call it every free().
>   */
> -ext4_discard_preallocations(inode);
> +ext4_discard_preallocations(inode, 0);
>  if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
>  fb_flags = EXT4_FREE_BLOCKS_NO_QUOT_UPDATE;
>  ext4_free_blocks(handle, inode, NULL, newblock,
> @@ -5299,7 +5299,7 @@ static int ext4_collapse_range(struct inode *inode, 
> loff_t offset, loff_t len)
>  }
> 
>  down_write(_I(inode)->i_data_sem);
> -ext4_discard_preallocations(inode);
> +ext4_discard_preallocations(inode, 0);
> 
>  ret = ext4_es_remove_extent(inode, punch_start,
>  EXT_MAX_BLOCKS - punch_start);
> @@ -5313,7 +5313,7 @@ static int ext4_collapse_range(struct inode *inode, 
> loff_t offset, loff_t len)
>  up_write(_I(inode)->i_data_sem);
>  goto out_stop;
>  }
> -ext4_discard_preallocations(inode);
> +ext4_discard_preallocations(inode, 0);
> 
>  ret = ext4_ext_shift_extents(inode, handle, punch_stop,
>   punch_stop - punch_start, SHIFT_LEFT);
> @@ -5445,7 +5445,7 @@ static int ext4_insert_range(struct inode *inode, 
> loff_t offset, loff_t len)
>  goto out_stop;
> 
>  down_write(_I(inode)->i_data_sem);
> -ext4_discard_preallocations(inode);
> +ext4_discard_preallocations(inode, 0);
> 
>  path = ext4_find_extent(inode, offset_lblk, NULL, 0);
>  if (IS_ERR(path)) {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31..e3ab8ea 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -148,7 +148,7 @@ static int ext4_release_file(struct inode *inode, struct 
> file *filp)
>  !EXT4_I(inode)->i_reserved_data_blocks)
>  {
>  down_write(_I(inode)->i_data_sem);
> -ext4_discard_preallocations(inode);
> +ext4_discard_preallocations(inode, 0);
>  up_write(_I(inode)->i_data_sem);
>  }
>  if (is_dx(inode) && filp->private_data)
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index be2b66e..ec6b930 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -696,7 +696,7 @@ static int ext4_ind_trunc_restart_fn(handle_t *handle, 
> struct

Re: [PATCH v2 1/3] ext4: reorganize if statement of ext4_mb_release_context()

2020-08-04 Thread Andreas Dilger

On Aug 4, 2020, at 7:01 PM, brookxu  wrote:
> 
> Reorganize the if statement of ext4_mb_release_context(), make it
> easier to read.
> 
> Signed-off-by: Chunguang Xu 

Reviewed-by: Andreas Dilger 

> ---
>  fs/ext4/mballoc.c | 27 +--
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index c0a331e..4f21f34 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4564,20 +4564,19 @@ static int ext4_mb_release_context(struct 
> ext4_allocation_context *ac)
>  pa->pa_free -= ac->ac_b_ex.fe_len;
>  pa->pa_len -= ac->ac_b_ex.fe_len;
>  spin_unlock(>pa_lock);
> -}
> -}
> -if (pa) {
> -/*
> - * We want to add the pa to the right bucket.
> - * Remove it from the list and while adding
> - * make sure the list to which we are adding
> - * doesn't grow big.
> - */
> -if ((pa->pa_type == MB_GROUP_PA) && likely(pa->pa_free)) {
> -spin_lock(pa->pa_obj_lock);
> -list_del_rcu(>pa_inode_list);
> -spin_unlock(pa->pa_obj_lock);
> -ext4_mb_add_n_trim(ac);
> +
> +/*
> + * We want to add the pa to the right bucket.
> + * Remove it from the list and while adding
> + * make sure the list to which we are adding
> + * doesn't grow big.
> + */
> +if (likely(pa->pa_free)) {
> +spin_lock(pa->pa_obj_lock);
> +list_del_rcu(>pa_inode_list);
> +spin_unlock(pa->pa_obj_lock);
> +ext4_mb_add_n_trim(ac);
> +}
>  }
>  ext4_mb_put_pa(ac, ac->ac_sb, pa);
>  }
> 
> --
> 1.8.3.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close faster

2020-07-05 Thread Andreas Dilger

On Jul 4, 2020, at 8:46 PM, Jan Ziak <0xe2.0x9a.0...@gmail.com> wrote:
> 
> On Sun, Jul 5, 2020 at 4:16 AM Matthew Wilcox  wrote:
>> 
>> On Sun, Jul 05, 2020 at 04:06:22AM +0200, Jan Ziak wrote:
>>> Hello
>>> 
>>> At first, I thought that the proposed system call is capable of
>>> reading *multiple* small files using a single system call - which
>>> would help increase HDD/SSD queue utilization and increase IOPS (I/O
>>> operations per second) - but that isn't the case and the proposed
>>> system call can read just a single file.
>>> 
>>> Without the ability to read multiple small files using a single system
>>> call, it is impossible to increase IOPS (unless an application is
>>> using multiple reader threads or somehow instructs the kernel to
>>> prefetch multiple files into memory).
>> 
>> What API would you use for this?
>> 
>> ssize_t readfiles(int dfd, char **files, void **bufs, size_t *lens);
>> 
>> I pretty much hate this interface, so I hope you have something better
>> in mind.
> 
> I am proposing the following:
> 
> struct readfile_t {
>  int dirfd;
>  const char *pathname;
>  void *buf;
>  size_t count;
>  int flags;
>  ssize_t retval; // set by kernel
>  int reserved; // not used by kernel
> };

If you are going to pass a struct from userspace to the kernel, it
should not mix int and pointer types (which may be 64-bit values,
so that there are not structure packing issues, like:

struct readfile {
int dirfd;
int flags;
const char *pathname;
void*buf;
size_t  count;
ssize_t retval;
};

It would be better if "retval" was returned in "count", so that
the structure fits nicely into 32 bytes on a 64-bit system, instead
of being 40 bytes per entry, which adds up over many entries, like.

struct readfile {
int dirfd;
int flags;
const char *pathname;
void*buf;
ssize_t count;  /* input: bytes requested, output: bytes read or -errno 
*/
};


However, there is still an issue with passing pointers from userspace,
since they may be 32-bit userspace pointers on a 64-bit kernel.

> int readfiles(struct readfile_t *requests, size_t count);

It's not clear why count is a "size_t" since it is not a size.
An unsigned int is fine here, since it should never be negative.

> Returns zero if all requests succeeded, otherwise the returned value
> is non-zero (glibc wrapper: -1) and user-space is expected to check
> which requests have succeeded and which have failed. retval in
> readfile_t is set to what the single-file readfile syscall would
> return if it was called with the contents of the corresponding
> readfile_t struct.
> 
> The glibc library wrapper of this system call is expected to store the
> errno in the "reserved" field. Thus, a programmer using glibc sees:
> 
> struct readfile_t {
>  int dirfd;
>  const char *pathname;
>  void *buf;
>  size_t count;
>  int flags;
>  ssize_t retval; // set by glibc (-1 on error)
>  int errno; // set by glibc if retval is -1
> };

Why not just return the errno directly in "retval", or in "count" as
proposed?  That avoids further bloating the structure by another field.

> retval and errno in glibc's readfile_t are set to what the single-file
> glibc readfile would return (retval) and set (errno) if it was called
> with the contents of the corresponding readfile_t struct. In case of
> an error, glibc will pick one readfile_t which failed (such as: the
> 1st failed one) and use it to set glibc's errno.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] checkpatch/coding-style: Allow 100 column lines

2020-05-30 Thread Andreas Dilger

On May 29, 2020, at 5:12 PM, Joe Perches  wrote:
> 
> Change the maximum allowed line length to 100 from 80.

What is the benefit/motivation for changing this?  The vast majority
of files are wrapped at 80 columns, and if some files start being
wrapped at 100 columns they will either display poorly on 80-column
terminals, or a lot of dead space will show in 100-column terminals.

> Miscellanea:
> 
> o to avoid unnecessary whitespace changes in files,
>  checkpatch will no longer emit a warning about line length
>  when scanning files unless --strict is also used
> o Add a bit to coding-style about alignment to open parenthesis
> 
> Signed-off-by: Joe Perches 
> ---
> Documentation/process/coding-style.rst | 25 -
> scripts/checkpatch.pl  | 14 +-
> 2 files changed, 25 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/process/coding-style.rst 
> b/Documentation/process/coding-style.rst
> index acb2f1b36350..55b148e9c6b8 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -84,15 +84,22 @@ Get a decent editor and don't leave whitespace at the end 
> of lines.
> Coding style is all about readability and maintainability using commonly
> available tools.
> 
> -The limit on the length of lines is 80 columns and this is a strongly
> -preferred limit.
> -
> -Statements longer than 80 columns will be broken into sensible chunks, unless
> -exceeding 80 columns significantly increases readability and does not hide
> -information. Descendants are always substantially shorter than the parent and
> -are placed substantially to the right. The same applies to function headers
> -with a long argument list. However, never break user-visible strings such as
> -printk messages, because that breaks the ability to grep for them.
> +The preferred limit on the length of a single line is 80 columns.
> +
> +Statements longer than 80 columns should be broken into sensible chunks,
> +unless exceeding 80 columns significantly increases readability and does
> +not hide information.
> +
> +Statements may be up to 100 columns when appropriate.
> +
> +Descendants are always substantially shorter than the parent and are
> +are placed substantially to the right.  A very commonly used style
> +is to align descendants to a function open parenthesis.
> +
> +These same rules are applied to function headers with a long argument list.
> +
> +However, never break user-visible strings such as printk messages because
> +that breaks the ability to grep for them.
> 
> 
> 3) Placing Braces and Spaces
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index dd750241958b..5f00df2c3f59 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -53,7 +53,7 @@ my %ignore_type = ();
> my @ignore = ();
> my $help = 0;
> my $configuration_file = ".checkpatch.conf";
> -my $max_line_length = 80;
> +my $max_line_length = 100;
> my $ignore_perl_version = 0;
> my $minimum_perl_version = 5.10.0;
> my $min_conf_desc_length = 4;
> @@ -99,9 +99,11 @@ Options:
>   --types TYPE(,TYPE2...)show only these comma separated message types
>   --ignore TYPE(,TYPE2...)   ignore various comma separated message types
>   --show-types   show the specific message type in the output
> -  --max-line-length=nset the maximum line length, if exceeded, warn
> +  --max-line-length=nset the maximum line length, (default 
> $max_line_length)
> + if exceeded, warn on patches
> + requires --strict for use with --file
>   --min-conf-desc-length=n   set the min description length, if shorter, warn
> -  --tab-size=n   set the number of spaces for tab (default 8)
> +  --tab-size=n   set the number of spaces for tab (default 
> $tabsize)
>   --root=PATHPATH to the kernel tree root
>   --no-summary   suppress the per-file summary
>   --mailback only produce a report in case of warnings/errors
> @@ -3282,8 +3284,10 @@ sub process {
> 
>   if ($msg_type ne "" &&
>   (show_type("LONG_LINE") || show_type($msg_type))) {
> - WARN($msg_type,
> -  "line over $max_line_length characters\n" 
> . $herecurr);
> + my $msg_level = \
> + $msg_level = \ if ($file);
> + &{$msg_level}($msg_type,
> +   "line length of $length exceeds 
> $max_line_length columns\n" . $herecurr);
>   }
>   }
> 
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH V4 7/8] fs/ext4: Introduce DAX inode flag

2020-05-22 Thread Andreas Dilger

On May 21, 2020, at 1:13 PM, ira.we...@intel.com wrote:
> 
> From: Ira Weiny 
> 
> Add a flag to preserve FS_XFLAG_DAX in the ext4 inode.
> 
> Set the flag to be user visible and changeable.  Set the flag to be
> inherited.  Allow applications to change the flag at any time with the
> exception of if VERITY or ENCRYPT is set.
> 
> Disallow setting VERITY or ENCRYPT if DAX is set.
> 
> Finally, on regular files, flag the inode to not be cached to facilitate
> changing S_DAX on the next creation of the inode.
> 
> Signed-off-by: Ira Weiny 

Reviewed-by: Andreas Dilger 

> 
> ---
> Changes from V3:
>   Move bit to bit25 per Andreas
> 
> Change from V2:
>   Add in making verity and DAX exclusive.
>   'Squash' in making encryption and DAX exclusive.
>   Add in EXT4_INODE_DAX flag definition to be compatible with
>   ext4_[set|test]_inode_flag() bit operations
>   Use ext4_[set|test]_inode_flag() bit operations to be consistent
>   with other code.
> 
> Change from V0:
>   Add FS_DAX_FL to include/uapi/linux/fs.h
>   to be consistent
>   Move ext4_dax_dontcache() to ext4_ioctl_setflags()
>   This ensures that it is only set when the flags are going to be
>   set and not if there is an error
>   Also this sets don't cache in the FS_IOC_SETFLAGS case
> 
> Change from RFC:
>   use new d_mark_dontcache()
>   Allow caching if ALWAYS/NEVER is set
>   Rebased to latest Linus master
>   Change flag to unused 0x0100
>   update ext4_should_enable_dax()
> ---
> fs/ext4/ext4.h  | 14 ++
> fs/ext4/inode.c |  2 +-
> fs/ext4/ioctl.c | 34 +-
> fs/ext4/super.c |  3 +++
> fs/ext4/verity.c|  2 +-
> include/uapi/linux/fs.h |  1 +
> 6 files changed, 49 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 65ffb831b2b9..09b8906568d2 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -415,13 +415,16 @@ struct flex_groups {
> #define EXT4_VERITY_FL0x0010 /* Verity protected 
> inode */
> #define EXT4_EA_INODE_FL  0x0020 /* Inode used for large EA */
> /* 0x0040 was formerly EXT4_EOFBLOCKS_FL */
> +
> +#define EXT4_DAX_FL  0x0200 /* Inode is DAX */
> +
> #define EXT4_INLINE_DATA_FL   0x1000 /* Inode has inline data. */
> #define EXT4_PROJINHERIT_FL   0x2000 /* Create with parents 
> projid */
> #define EXT4_CASEFOLD_FL  0x4000 /* Casefolded file */
> #define EXT4_RESERVED_FL  0x8000 /* reserved for ext4 lib */
> 
> -#define EXT4_FL_USER_VISIBLE 0x705BDFFF /* User visible flags */
> -#define EXT4_FL_USER_MODIFIABLE  0x604BC0FF /* User modifiable 
> flags */
> +#define EXT4_FL_USER_VISIBLE 0x725BDFFF /* User visible flags */
> +#define EXT4_FL_USER_MODIFIABLE  0x624BC0FF /* User modifiable 
> flags */
> 
> /* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
> #define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \
> @@ -429,14 +432,16 @@ struct flex_groups {
>EXT4_APPEND_FL | \
>EXT4_NODUMP_FL | \
>EXT4_NOATIME_FL | \
> -  EXT4_PROJINHERIT_FL)
> +  EXT4_PROJINHERIT_FL | \
> +  EXT4_DAX_FL)
> 
> /* Flags that should be inherited by new inodes from their parent. */
> #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
>  EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
>  EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
>  EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
> -EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL)
> +EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL |\
> +EXT4_DAX_FL)
> 
> /* Flags that are appropriate for regular files (all but dir-specific ones). 
> */
> #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL | 
> EXT4_CASEFOLD_FL |\
> @@ -488,6 +493,7 @@ enum {
>   EXT4_INODE_VERITY   = 20,   /* Verity protected inode */
>   EXT4_INODE_EA_INODE = 21,   /* Inode used for large EA */
> /* 22 was formerly EXT4_INODE_EOFBLOCKS */
> + EXT4_INODE_DAX  = 25,   /* Inode is DAX */
>   EXT4_INODE_INLINE_DATA  = 28,   /* Data in inode. */
>   EXT4_INODE_PROJINHERIT  = 29,   /* Create with parents proji

Re: [PATCH V3 7/8] fs/ext4: Introduce DAX inode flag

2020-05-20 Thread Andreas Dilger

On May 20, 2020, at 2:55 PM, Darrick J. Wong  wrote:
> On Wed, May 20, 2020 at 01:02:42PM -0700, Ira Weiny wrote:
>> On Wed, May 20, 2020 at 01:26:44PM -0600, Andreas Dilger wrote:
>>> On May 19, 2020, at 11:57 PM, ira.we...@intel.com wrote:
>>>> 
>>>> From: Ira Weiny 
>>>> 
>>>> Add a flag to preserve FS_XFLAG_DAX in the ext4 inode.
>>>> 
>>>> Set the flag to be user visible and changeable.  Set the flag to be
>>>> inherited.  Allow applications to change the flag at any time with the
>>>> exception of if VERITY or ENCRYPT is set.
>>>> 
>>>> Disallow setting VERITY or ENCRYPT if DAX is set.
>>>> 
>>>> Finally, on regular files, flag the inode to not be cached to facilitate
>>>> changing S_DAX on the next creation of the inode.
>>>> 
>>>> Signed-off-by: Ira Weiny 
>>>> 
>>>> ---
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 6235440e4c39..467c30a789b6 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -415,13 +415,16 @@ struct flex_groups {
>>>> #define EXT4_VERITY_FL 0x0010 /* Verity protected 
>>>> inode */
>>>> #define EXT4_EA_INODE_FL   0x0020 /* Inode used for large EA */
>>>> /* 0x0040 was formerly EXT4_EOFBLOCKS_FL */
>>>> +
>>>> +#define EXT4_DAX_FL   0x0100 /* Inode is DAX */
>>>> +
>>>> #define EXT4_INLINE_DATA_FL0x1000 /* Inode has inline 
>>>> data. */
>>>> #define EXT4_PROJINHERIT_FL0x2000 /* Create with 
>>>> parents projid */
>>>> #define EXT4_CASEFOLD_FL   0x4000 /* Casefolded file */
>>>> #define EXT4_RESERVED_FL   0x8000 /* reserved for ext4 lib */
>>> 
>>> Hi Ira,
>>> This flag value conflicts with the reserved flag in e2fsprogs for snapshots:
>>> 
>>> #define EXT4_SNAPFILE_FL0x0100  /* Inode is a snapshot 
>>> */
>> 
>> Sure NP but is that new?  I'm building off of 5.7-rc4.
>> 
>> Just curious if I completely missed something.
> 
> Yeah, you missed that ... for some reason the kernel ext4 driver is
> missing flags that are in e2fsprogs.  (huh??)

It's no different than ext2 not having the full set of bits defined or
in use.

> I would say you could probably just take over the flag because the 2010s
> called and they don't want next3 back.  I guess that leaves 0x0200
> as the sole unclaimed bit, but this seriously needs some cleaning.

Darrick,
we are in the process of updating the snapshot code for ext4, so need to
keep the 0x0100 bit for snapshots.  Since 0x0200 has never been
used for anything, there is no reason not to use it instead.

If we need to reclaim flags, it would be better to look at "COMPR" flags:

/* Reserved for compression usage... */
#define FS_COMPR_FL   0x0004 /* Compress file */
#define FS_DIRTY_FL   0x0100
#define FS_COMPRBLK_FL0x0200 /* One or more compressed clusters */
#define FS_NOCOMP_FL  0x0400 /* Don't compress */

since I don't think they have ever been used.  I don't think we need 4x
on-disk state flags for that, especially not as part of the API.  It is
enough to have FS_COMPR_FL for the API, and then handle internal state
separately (e.g. compress into a separate on-disk extent and then swap
extents atomically instead of storing transient state on disk).

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH V3 7/8] fs/ext4: Introduce DAX inode flag

2020-05-20 Thread Andreas Dilger

On May 19, 2020, at 11:57 PM, ira.we...@intel.com wrote:
> 
> From: Ira Weiny 
> 
> Add a flag to preserve FS_XFLAG_DAX in the ext4 inode.
> 
> Set the flag to be user visible and changeable.  Set the flag to be
> inherited.  Allow applications to change the flag at any time with the
> exception of if VERITY or ENCRYPT is set.
> 
> Disallow setting VERITY or ENCRYPT if DAX is set.
> 
> Finally, on regular files, flag the inode to not be cached to facilitate
> changing S_DAX on the next creation of the inode.
> 
> Signed-off-by: Ira Weiny 
> 
> ---
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6235440e4c39..467c30a789b6 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -415,13 +415,16 @@ struct flex_groups {
> #define EXT4_VERITY_FL0x0010 /* Verity protected 
> inode */
> #define EXT4_EA_INODE_FL  0x0020 /* Inode used for large EA */
> /* 0x0040 was formerly EXT4_EOFBLOCKS_FL */
> +
> +#define EXT4_DAX_FL  0x0100 /* Inode is DAX */
> +
> #define EXT4_INLINE_DATA_FL   0x1000 /* Inode has inline data. */
> #define EXT4_PROJINHERIT_FL   0x2000 /* Create with parents 
> projid */
> #define EXT4_CASEFOLD_FL  0x4000 /* Casefolded file */
> #define EXT4_RESERVED_FL  0x8000 /* reserved for ext4 lib */

Hi Ira,
This flag value conflicts with the reserved flag in e2fsprogs for snapshots:

#define EXT4_SNAPFILE_FL0x0100  /* Inode is a snapshot */

Please change EXT4_DAX_FL and FS_DAX_FL to use 0x0200, which is not used
for anything in either case.

Cheers, Andreas


> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..7c5f6eb51e2d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -262,6 +262,7 @@ struct fsxattr {
> #define FS_EA_INODE_FL0x0020 /* Inode used for 
> large EA */
> #define FS_EOFBLOCKS_FL   0x0040 /* Reserved for ext4 
> */
> #define FS_NOCOW_FL   0x0080 /* Do not cow file */
> +#define FS_DAX_FL0x0100 /* Inode is DAX */
> #define FS_INLINE_DATA_FL 0x1000 /* Reserved for ext4 */
> #define FS_PROJINHERIT_FL 0x2000 /* Create with parents 
> projid */
> #define FS_CASEFOLD_FL0x4000 /* Folder is case 
> insensitive */
> --
> 2.25.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Reduce ext4 timestamp warnings

2019-09-04 Thread Andreas Dilger

On Sep 4, 2019, at 09:02, Deepa Dinamani  wrote:
> 
> When ext4 file systems were created intentionally with 128 byte inodes,
> the rate-limited warning of eventual possible timestamp overflow are
> still emitted rather frequently.  Remove the warning for now.
> 
> Discussion for whether any warning is needed,
> and where it should be emitted, can be found at
> https://lore.kernel.org/lkml/1567523922.5576.57.ca...@lca.pw/.
> I can post a separate follow-up patch after the conclusion.
> 
> Reported-by: Qian Cai 
> Signed-off-by: Deepa Dinamani 

I'd be in favor of a severely rare-limited warning in the actual case
that Y2038 timestamps cannot be stored, but the current message is
too verbose for now and I agree it is better to remove it while discussions
on the best solution are underway. 

Reviewed-by: Andreas Dilger 

> ---
> fs/ext4/ext4.h | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9e3ae3be3de9..24b14bd3feab 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -833,10 +833,8 @@ do {\
>(raw_inode)->xtime ## _extra =\
>ext4_encode_extra_time(&(inode)->xtime);\
>}\
> -else{\
> +else\
>(raw_inode)->xtime = cpu_to_le32(clamp_t(int32_t, 
> (inode)->xtime.tv_sec, S32_MIN, S32_MAX));\
> -ext4_warning_inode(inode, "inode does not support timestamps beyond 
> 2038"); \
> -} \
> } while (0)
> 
> #define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)   \
> -- 
> 2.17.1
>

Re: "beyond 2038" warnings from loopback mount is noisy

2019-09-03 Thread Andreas Dilger

On Sep 3, 2019, at 12:15 PM, Qian Cai  wrote:
> 
> On Tue, 2019-09-03 at 09:36 -0700, Deepa Dinamani wrote:
>> We might also want to consider updating the file system the LTP is
>> being run on here.
> 
> It simply format (mkfs.ext4) a loop back device on ext4 with the kernel.
> 
> CONFIG_EXT4_FS=m
> # CONFIG_EXT4_USE_FOR_EXT2 is not set
> # CONFIG_EXT4_FS_POSIX_ACL is not set
> # CONFIG_EXT4_FS_SECURITY is not set
> # CONFIG_EXT4_DEBUG is not set
> 
> using e2fsprogs-1.44.6. Do you mean people now need to update the kernel to
> enable additional config to avoid the spam of warnings now?

Strange.  The defaults for mkfs.ext4 _should_ default to use options that
allow enough space for the extra timestamps.

Can you please provide "dumpe2fs -h" output for your filesystem, and the
formatting options that you used when creating this filesystem.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 09/20] ext4: Initialize timestamps limits

2019-08-07 Thread Andreas Dilger

On Aug 3, 2019, at 2:24 PM, Arnd Bergmann  wrote:
> 
> On Sat, Aug 3, 2019 at 6:03 PM Theodore Y. Ts'o  wrote:
>> 
>> On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote:
>>> 
>>> I see in the ext4 code that we always try to expand i_extra_size
>>> to s_want_extra_isize in ext4_mark_inode_dirty(), and that
>>> s_want_extra_isize is always at least  s_min_extra_isize, so
>>> we constantly try to expand the inode to fit.
>> 
>> Yes, we *try*.  But we may not succeed.  There may actually be a
>> problem here if the cause is due to there simply is no space in the
>> external xattr block, so we might try and try every time we try to
>> modify that inode, and it would be a performance mess.  If it's due to
>> there being no room in the current transaction, then it's highly
>> likely it will succeed the next time.
>> 
>>> Did older versions of ext4 or ext3 ignore s_min_extra_isize
>>> when creating inodes despite
>>> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE,
>>> or is there another possibility I'm missing?
>> 
>> s_min_extra_isize could get changed in order to make room for some new
>> file system feature --- such as extended timestamps.
> 
> Ok, that explains it. I assumed s_min_extra_isize was meant to
> not be modifiable, and did not find a way to change it using the
> kernel or tune2fs, but now I can see that debugfs can set it.
> 
>> If you want to pretend that file systems never get upgraded, then life
>> is much simpler.  The general approach is that for less-sophisticated
>> customers (e.g., most people running enterprise distros) file system
>> upgrades are not a thing.  But for sophisticated users, we do try to
>> make thing work for people who are aware of the risks / caveats /
>> rough edges.  Google won't have been able to upgrade thousands and
>> thousands of servers in data centers all over the world if we limited
>> ourselves to Red Hat's support restrictions.  Backup / reformat /
>> restore really isn't a practical rollout strategy for many exabytes of
>> file systems.
>> 
>> It sounds like your safety checks / warnings are mostly targeted at
>> low-information customers, no?
> 
> Yes, that seems like a reasonable compromise: just warn based
> on s_min_extra_isize, and assume that anyone who used debugfs
> to set s_min_extra_isize to a higher value from an ext3 file system
> during the migration to ext4 was aware of the risks already.
> 
> That leaves the question of what we should set the s_time_gran
> and s_time_max to on a superblock with s_min_extra_isize<16
> and s_want_extra_isize>=16.
> 
> If we base it on s_min_extra_isize, we never try to set a timestamp
> later than 2038 and so will never fail, but anyone with a grandfathered
> s_min_extra_isize from ext3 won't be able to set extended
> timestamps on any files any more. Based on s_want_extra_isize
> we would keep the current behavior, but could add a custom
> warning in the ext4 code about the small s_min_extra_isize
> indicating a theoretical problem.

I think it makes the most sense to always try to set timestamps on
inodes that have enough space for them.  The chance of running into
a filesystem with 256-byte inode size but *no* space in the inode to
store an extended timestamp, but is *also* being modified by a new
kernel after 2038 is vanishingly small.  This would require formatting
the filesystem with non-default mke2fs for ext3, using the filesystem
and storing enough xattrs on inodes that there isn't space for 12 bytes
of extra isize, and using it for 30+ years without upgrading to use
ext4 (which will also try to expand the inode to store the nsec
timestamps) and then modifying the inode after 2038.

Rather than printing a warning at mount time (which may be confusing
to users for a problem they may never see), it makes sense to only
print such a warning in the vanishingly small case that someone actually
tries to modify the inode timestamp but it doesn't fit, rather than on
the theoretical case that may never happen.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-29 Thread Andreas Dilger

On Jul 26, 2019, at 8:59 PM, Damien Le Moal  wrote:
> 
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
 
 This looks like something that could hit every file systems, so
 shouldn't we fix this in common code?  We could also look into
 just using memalloc_nofs_save for the page cache allocation path
 instead of the per-mapping gfp_mask.
>>> 
>>> I think it has to be the entire IO path - any allocation from the
>>> underlying filesystem could recurse into the top level filesystem
>>> and then deadlock if the memory reclaim submits IO or blocks on
>>> IO completion from the upper filesystem. That's a bloody big hammer
>>> for something that is only necessary when there are stacked
>>> filesystems like this
>> 
>> Yeah that's why using memalloc_nofs_save() probably makes the most
>> sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
in the BIO so that the worker thread will also get the correct behaviour
when it is processing this IO submission.

> We tried a simpler setup using loopback mount (XFS used directly in an ext4
> file) and running the same workload. We failed to recreate a similar deadlock 
> in
> this case, but I am strongly suspecting that it can happen too. It is simply
> much harder to hit because the IO path from XFS to ext4 is all in-kernel and
> asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for 
> IOs
> which makes it relatively easy to get inter-dependent writes or read+write
> queued back-to-back and create the deadlock.
> 
> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) 
> ?
> 
> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] mbcache: Speed up cache entry creation

2019-07-25 Thread Andreas Dilger

On Jul 23, 2019, at 10:01 PM, Sultan Alsawaf  wrote:
> 
> On Tue, Jul 23, 2019 at 10:56:05AM -0600, Andreas Dilger wrote:
>> Do you have any kind of performance metrics that show this is an actual
>> improvement in performance?  This would be either macro-level benchmarks
>> (e.g. fio, but this seems unlikely to show any benefit), or micro-level
>> measurements (e.g. flame graph) that show a net reduction in CPU cycles,
>> lock contention, etc. in this part of the code.
> 
> Hi Andreas,
> 
> Here are some basic micro-benchmark results:
> 
> Before:
> [3.162896] mb_cache_entry_create: AVG cycles: 75
> [3.054701] mb_cache_entry_create: AVG cycles: 78
> [3.152321] mb_cache_entry_create: AVG cycles: 77
> 
> After:
> [3.043380] mb_cache_entry_create: AVG cycles: 68
> [3.194321] mb_cache_entry_create: AVG cycles: 71
> [3.038100] mb_cache_entry_create: AVG cycles: 69

This information should be included in the patch description, since that
allows making a decision on whether the patch is worthwhile to land or not.

> The performance difference is probably more drastic when free memory is low,
> since an unnecessary call to kmem_cache_alloc() can result in a long wait for
> pages to be freed.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-25 Thread Andreas Dilger

On Jul 25, 2019, at 5:54 AM, Christoph Hellwig  wrote:
> 
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
>> +gfp_t gfp_mask;
>> +
>>  switch (ext4_inode_journal_mode(inode)) {
>>  case EXT4_INODE_ORDERED_DATA_MODE:
>>  case EXT4_INODE_WRITEBACK_DATA_MODE:
>> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>>  inode->i_mapping->a_ops = _da_aops;
>>  else
>>  inode->i_mapping->a_ops = _aops;
>> +
>> +/*
>> + * Ensure all page cache allocations are done from GFP_NOFS context to
>> + * prevent direct reclaim recursion back into the filesystem and blowing
>> + * stacks or deadlocking.
>> + */
>> +gfp_mask = mapping_gfp_mask(inode->i_mapping);
>> +mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?

It also has the drawback that it prevents __GFP_FS reclaim when ext4
is *not* at the bottom of the IO stack.

> We could also look into just using memalloc_nofs_save for the page
> cache allocation path instead of the per-mapping gfp_mask.

That makes more sense.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] mbcache: Speed up cache entry creation

2019-07-23 Thread Andreas Dilger

On Jul 22, 2019, at 11:35 PM, Sultan Alsawaf  wrote:
> 
> From: Sultan Alsawaf 
> 
> In order to prevent redundant entry creation by racing against itself,
> mb_cache_entry_create scans through a hash-list of all current entries
> in order to see if another allocation for the requested new entry has
> been made. Furthermore, it allocates memory for a new entry before
> scanning through this hash-list, which results in that allocated memory
> being discarded when the requested new entry is already present.
> 
> Speed up cache entry creation by keeping a small linked list of
> requested new entries in progress, and scanning through that first
> instead of the large hash-list. And don't allocate memory for a new
> entry until it's known that the allocated memory will be used.

Do you have any kind of performance metrics that show this is an actual
improvement in performance?  This would be either macro-level benchmarks
(e.g. fio, but this seems unlikely to show any benefit), or micro-level
measurements (e.g. flame graph) that show a net reduction in CPU cycles,
lock contention, etc. in this part of the code.

Cheers, Andreas

> Signed-off-by: Sultan Alsawaf 
> ---
> fs/mbcache.c | 82 
> 1 file changed, 57 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index 97c54d3a2227..289f3664061e 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -25,9 +25,14 @@
>  * size hash table is used for fast key lookups.
>  */
> 
> +struct mb_bucket {
> + struct hlist_bl_head hash;
> + struct list_head req_list;
> +};
> +
> struct mb_cache {
>   /* Hash table of entries */
> - struct hlist_bl_head*c_hash;
> + struct mb_bucket*c_bucket;
>   /* log2 of hash table size */
>   int c_bucket_bits;
>   /* Maximum entries in cache to avoid degrading hash too much */
> @@ -42,15 +47,21 @@ struct mb_cache {
>   struct work_struct  c_shrink_work;
> };
> 
> +struct mb_cache_req {
> + struct list_head lnode;
> + u32 key;
> + u64 value;
> +};
> +
> static struct kmem_cache *mb_entry_cache;
> 
> static unsigned long mb_cache_shrink(struct mb_cache *cache,
>unsigned long nr_to_scan);
> 
> -static inline struct hlist_bl_head *mb_cache_entry_head(struct mb_cache 
> *cache,
> - u32 key)
> +static inline struct mb_bucket *mb_cache_entry_bucket(struct mb_cache *cache,
> +   u32 key)
> {
> - return >c_hash[hash_32(key, cache->c_bucket_bits)];
> + return >c_bucket[hash_32(key, cache->c_bucket_bits)];
> }
> 
> /*
> @@ -77,6 +88,8 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t 
> mask, u32 key,
>   struct mb_cache_entry *entry, *dup;
>   struct hlist_bl_node *dup_node;
>   struct hlist_bl_head *head;
> + struct mb_cache_req *tmp_req, req;
> + struct mb_bucket *bucket;
> 
>   /* Schedule background reclaim if there are too many entries */
>   if (cache->c_entry_count >= cache->c_max_entries)
> @@ -85,9 +98,33 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t 
> mask, u32 key,
>   if (cache->c_entry_count >= 2*cache->c_max_entries)
>   mb_cache_shrink(cache, SYNC_SHRINK_BATCH);
> 
> + bucket = mb_cache_entry_bucket(cache, key);
> + head = >hash;
> + hlist_bl_lock(head);
> + list_for_each_entry(tmp_req, >req_list, lnode) {
> + if (tmp_req->key == key && tmp_req->value == value) {
> + hlist_bl_unlock(head);
> + return -EBUSY;
> + }
> + }
> + hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
> + if (dup->e_key == key && dup->e_value == value) {
> + hlist_bl_unlock(head);
> + return -EBUSY;
> + }
> + }
> + req.key = key;
> + req.value = value;
> + list_add(, >req_list);
> + hlist_bl_unlock(head);
> +
>   entry = kmem_cache_alloc(mb_entry_cache, mask);
> - if (!entry)
> + if (!entry) {
> + hlist_bl_lock(head);
> + list_del();
> + hlist_bl_unlock(head);
>   return -ENOMEM;
> + }
> 
>   INIT_LIST_HEAD(>e_list);
>   /* One ref for hash, one ref returned */
> @@ -96,15 +133,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t 
> mask, u32 key,
>   entry->e_value = value;
>   entry->e_reusable = reusable;
>   entry->e_referenced = 0;
> - head = mb_cache_entry_head(cache, key);
> +
>   hlist_bl_lock(head);
> - hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
> - if (dup->e_key == key && dup->e_value == value) {
> - hlist_bl_unlock(head);
> - kmem_cache_free(mb_entry_cache, entry);
> - return -EBUSY;
> - }
> - }
> +

Re: [PATCH v4 0/7] vfs: make immutable files actually immutable

2019-06-25 Thread Andreas Dilger

On Jun 25, 2019, at 12:03 PM, Darrick J. Wong  wrote:
> 
> On Tue, Jun 25, 2019 at 03:36:31AM -0700, Christoph Hellwig wrote:
>> On Fri, Jun 21, 2019 at 04:56:50PM -0700, Darrick J. Wong wrote:
>>> Hi all,
>>> 
>>> The chattr(1) manpage has this to say about the immutable bit that
>>> system administrators can set on files:
>>> 
>>> "A file with the 'i' attribute cannot be modified: it cannot be deleted
>>> or renamed, no link can be created to this file, most of the file's
>>> metadata can not be modified, and the file can not be opened in write
>>> mode."
>>> 
>>> Given the clause about how the file 'cannot be modified', it is
>>> surprising that programs holding writable file descriptors can continue
>>> to write to and truncate files after the immutable flag has been set,
>>> but they cannot call other things such as utimes, fallocate, unlink,
>>> link, setxattr, or reflink.
>> 
>> I still think living code beats documentation.  And as far as I can
>> tell the immutable bit never behaved as documented or implemented
>> in this series on Linux, and it originated on Linux.
> 
> The behavior has never been consistent -- since the beginning you can
> keep write()ing to a fd after the file becomes immutable, but you can't
> ftruncate() it.  I would really like to make the behavior consistent.
> Since the authors of nearly every new system call and ioctl since the
> late 1990s have interpreted S_IMMUTABLE to mean "immutable takes effect
> everywhere immediately" I resolved the inconsistency in favor of that
> interpretation.
> 
> I asked Ted what he thought that that userspace having the ability to
> continue writing to an immutable file, and he thought it was an
> implementation bug that had been there for 25 years.  Even he thought
> that immutable should take effect immediately everywhere.
> 
>> If you want  hard cut off style immutable flag it should really be a
>> new API, but I don't really see the point.  It isn't like the usual
>> workload is to set the flag on a file actively in use.
> 
> FWIW Ted also thought that since it's rare for admins to set +i on a
> file actively in use we could just change it without forcing everyone
> onto a new api.

On the flip side, it is possible to continue to write to an open fd
after removing the write permission, and this is a problem we've hit
in the real world with NFS export, so real applications do this.

It may be the same case with immutable files, where an application sets
the immutable flag immediately after creation, but continues to write
until it closes the file, so that the file can't be modified by other
processes, and there isn't a risk that the file is missing the immutable
flag if the writing process dies before setting it at the end.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: remove unnecessary gotos in ext4_xattr_set_entry

2019-05-31 Thread Andreas Dilger

On May 31, 2019, at 6:10 AM, Pavel Tikhomirov  wrote:
> 
> In the "out" label we only iput old/new_ea_inode-s, in all these places
> these variables are always NULL so there is no point in goto to "out".
> 
> Signed-off-by: Pavel Tikhomirov 

I'm not a fan of changes like this, since it adds potential complexity/bugs
if the error handling path is changed in the future.  That is one of the major
benefits of the "goto out_*" model of error handling is that you only need to
add one new label to the end of the function when some new state is added that
needs to be cleaned up, compared to having to check each individual error to
see if something needs to be cleaned up.

Cheers, Andreas

> ---
> fs/ext4/xattr.c | 9 +++--
> 1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 491f9ee4040e..ac2ddd4446b3 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1601,8 +1601,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info 
> *i,
>   next = EXT4_XATTR_NEXT(last);
>   if ((void *)next >= s->end) {
>   EXT4_ERROR_INODE(inode, "corrupted xattr entries");
> - ret = -EFSCORRUPTED;
> - goto out;
> + return -EFSCORRUPTED;
>   }
>   if (!last->e_value_inum && last->e_value_size) {
>   size_t offs = le16_to_cpu(last->e_value_offs);
> @@ -1620,8 +1619,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info 
> *i,
>   free += EXT4_XATTR_LEN(name_len) + old_size;
> 
>   if (free < EXT4_XATTR_LEN(name_len) + new_size) {
> - ret = -ENOSPC;
> - goto out;
> + return -ENOSPC;
>   }
> 
>   /*
> @@ -1634,8 +1632,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info 
> *i,
>   new_size && is_block &&
>   (min_offs + old_size - new_size) <
>   EXT4_XATTR_BLOCK_RESERVE(inode)) {
> - ret = -ENOSPC;
> - goto out;
> + return -ENOSPC;
>   }
>   }
> 
> --
> 2.20.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] ext4: fix use-after-free in dx_release()

2019-05-08 Thread Andreas Dilger

On May 8, 2019, at 2:34 AM, Sahitya Tummala  wrote:
> 
> The buffer_head (frames[0].bh) and it's corresping page can be
> potentially free'd once brelse() is done inside the for loop
> but before the for loop exits in dx_release(). It can be free'd
> in another context, when the page cache is flushed via
> drop_caches_sysctl_handler(). This results into below data abort
> when accessing info->indirect_levels in dx_release().
> 
> Unable to handle kernel paging request at virtual address ffc17ac3e01e
> Call trace:
> dx_release+0x70/0x90
> ext4_htree_fill_tree+0x2d4/0x300
> ext4_readdir+0x244/0x6f8
> iterate_dir+0xbc/0x160
> SyS_getdents64+0x94/0x174
> 
> Signed-off-by: Sahitya Tummala 

Reviewed-by: Andreas Dilger 

> ---
> v2:
> add a comment in dx_release()
> 
> fs/ext4/namei.c | 5 -
> 1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 980166a..5d9ffa8 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -871,12 +871,15 @@ static void dx_release(struct dx_frame *frames)
> {
>   struct dx_root_info *info;
>   int i;
> + unsigned int indirect_levels;
> 
>   if (frames[0].bh == NULL)
>   return;
> 
>   info = &((struct dx_root *)frames[0].bh->b_data)->info;
> - for (i = 0; i <= info->indirect_levels; i++) {
> + /* save local copy, "info" may be freed after brelse() */
> + indirect_levels = info->indirect_levels;
> + for (i = 0; i <= indirect_levels; i++) {
>   if (frames[i].bh == NULL)
>   break;
>   brelse(frames[i].bh);
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
> Foundation Collaborative Project.
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: fix use-after-free in dx_release()

2019-05-08 Thread Andreas Dilger

On May 8, 2019, at 12:13 AM, Sahitya Tummala  wrote:
> 
> The buffer_head (frames[0].bh) and it's corresping page can be
> potentially free'd once brelse() is done inside the for loop
> but before the for loop exits in dx_release(). It can be free'd
> in another context, when the page cache is flushed via
> drop_caches_sysctl_handler(). This results into below data abort
> when accessing info->indirect_levels in dx_release().
> 
> Unable to handle kernel paging request at virtual address ffc17ac3e01e
> Call trace:
> dx_release+0x70/0x90
> ext4_htree_fill_tree+0x2d4/0x300
> ext4_readdir+0x244/0x6f8
> iterate_dir+0xbc/0x160
> SyS_getdents64+0x94/0x174
> 
> Signed-off-by: Sahitya Tummala 

The patch looks reasonable, but there is a danger that it may be
"optimized" back to the pre-patch form again.  It probably makes
sense to include a comment like:

/* save local copy, "info" may be freed after brelse() */

Looks fine otherwise.

Reviewed-by: Andreas Dilger 

> ---
> fs/ext4/namei.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4181c9c..7e6c298 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -871,12 +871,14 @@ static void dx_release(struct dx_frame *frames)
> {
>   struct dx_root_info *info;
>   int i;
> + unsigned int indirect_levels;
> 
>   if (frames[0].bh == NULL)
>   return;
> 
>   info = &((struct dx_root *)frames[0].bh->b_data)->info;
> - for (i = 0; i <= info->indirect_levels; i++) {
> + indirect_levels = info->indirect_levels;
> + for (i = 0; i <= indirect_levels; i++) {
>   if (frames[i].bh == NULL)
>   break;
>   brelse(frames[i].bh);
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
> Foundation Collaborative Project.
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()

2019-04-29 Thread Andreas Dilger


> On Apr 29, 2019, at 10:26 PM, Al Viro  wrote:
> 
> On Mon, Apr 29, 2019 at 10:18:04PM -0600, Andreas Dilger wrote:
>>> 
>>> void*i_private; /* fs or device private pointer */
>>> +   void (*free_inode)(struct inode *);
>> 
>> It seems like a waste to increase the size of every struct inode just to 
>> access
>> a static pointer.  Is this the only place that ->free_inode() is called?  Why
>> not move the ->free_inode() pointer into inode->i_fop->free_inode() so that 
>> it
>> is still directly accessible at this point.
> 
> i_op, surely?

Yes, i_op is what I was thinking.

> In any case, increasing sizeof(struct inode) is not a problem -

> if anything, I'd turn ->i_fop into an anon union with that.  As in,
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fb45590d284e..627e1766503a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -211,8 +211,8 @@ EXPORT_SYMBOL(free_inode_nonrcu);
> static void i_callback(struct rcu_head *head)
> {
>   struct inode *inode = container_of(head, struct inode, i_rcu);
> - if (inode->i_sb->s_op->free_inode)
> - inode->i_sb->s_op->free_inode(inode);
> + if (inode->free_inode)
> + inode->free_inode(inode);
>   else
>   free_inode_nonrcu(inode);
> }
> @@ -236,6 +236,7 @@ static struct inode *alloc_inode(struct super_block *sb)
>   if (!ops->free_inode)
>   return NULL;
>   }
> + inode->free_inode = ops->free_inode;
>   i_callback(>i_rcu);
>   return NULL;
>   }

> @@ -276,6 +277,7 @@ static void destroy_inode(struct inode *inode)
>   if (!ops->free_inode)
>   return;
>   }
> + inode->free_inode = ops->free_inode;
>   call_rcu(>i_rcu, i_callback);
> }

This seems like kind of a hack.  I guess your goal is to have ->free_inode
accessible regardless of whether the filesystem has installed its own ->i_op
methods or not, and i_fop is no longer used by this point.

That said, this seems better than increasing the size of struct inode.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2e9b9f87caca..92732286b748 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -694,7 +694,10 @@ struct inode {
> #ifdef CONFIG_IMA
>   atomic_ti_readcount; /* struct files open RO */
> #endif
> - const struct file_operations*i_fop; /* former 
> ->i_op->default_file_ops */
> + union {
> + const struct file_operations*i_fop; /* former 
> ->i_op->default_file_ops */
> + void (*free_inode)(struct inode *);
> + };


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [RFC][PATCHSET] sorting out RCU-delayed stuff in ->destroy_inode()

2019-04-29 Thread Andreas Dilger

On Apr 29, 2019, at 9:09 PM, Al Viro  wrote:
> 
> On Tue, Apr 16, 2019 at 11:01:16AM -0700, Linus Torvalds wrote:
>> 
>> I only skimmed through the actual filesystem (and one networking)
>> patches, but they looked like trivial conversions to a better
>> interface.
> 
> ... except that this callback can (and always could) get executed after
> freeing struct super_block.  So we can't just dereference ->i_sb->s_op
> and expect to survive; the table ->s_op pointed to will still be there,
> but ->i_sb might very well have been freed, with all its contents overwritten.
> We need to copy the callback into struct inode itself, unfortunately.
> The following incremental fixes it; I'm going to fold it into the first
> commit in there.
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fb45590d284e..855dad43b11d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -164,6 +164,7 @@ int inode_init_always(struct super_block *sb, struct 
> inode *inode)
>   inode->i_wb_frn_avg_time = 0;
>   inode->i_wb_frn_history = 0;
> #endif
> + inode->free_inode = sb->s_op->free_inode;
> 
>   if (security_inode_alloc(inode))
>   goto out;
> @@ -211,8 +212,8 @@ EXPORT_SYMBOL(free_inode_nonrcu);
> static void i_callback(struct rcu_head *head)
> {
>   struct inode *inode = container_of(head, struct inode, i_rcu);
> - if (inode->i_sb->s_op->free_inode)
> - inode->i_sb->s_op->free_inode(inode);
> + if (inode->free_inode)
> + inode->free_inode(inode);
>   else
>   free_inode_nonrcu(inode);
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2e9b9f87caca..5ed6b39e588e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -718,6 +718,7 @@ struct inode {
> #endif
> 
>   void*i_private; /* fs or device private pointer */
> + void (*free_inode)(struct inode *);

It seems like a waste to increase the size of every struct inode just to access
a static pointer.  Is this the only place that ->free_inode() is called?  Why
not move the ->free_inode() pointer into inode->i_fop->free_inode() so that it
is still directly accessible at this point.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] ext4: cond_resched in work-heavy group loops

2019-04-24 Thread Andreas Dilger

On Apr 23, 2019, at 11:13 PM, Khazhismel Kumykov  wrote:
> 
> These functions may take a long time looping over many groups, which
> may cause issues for non-preempt kernels.
> ext4_mb_init_backend()
> ext4_setup_system_zone()
> ext4_mb_release()
> 
> Signed-off-by: Khazhismel Kumykov 

Reviewed-by: Andreas Dilger 

> ---
> v2:
> - a few other places that in testing showed to be slow
> 
> fs/ext4/block_validity.c | 1 +
> fs/ext4/mballoc.c| 2 ++
> 2 files changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
> index 913061c0de1b..16134469ea3c 100644
> --- a/fs/ext4/block_validity.c
> +++ b/fs/ext4/block_validity.c
> @@ -155,6 +155,7 @@ int ext4_setup_system_zone(struct super_block *sb)
>   return 0;
> 
>   for (i=0; i < ngroups; i++) {
> + cond_resched();
>   if (ext4_bg_has_super(sb, i) &&
>   ((i < 5) || ((i % flex_size) == 0)))
>   add_system_zone(sbi, ext4_group_first_block_no(sb, i),
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8ef5f12bbee2..99ba720dbb7a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2490,6 +2490,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
>   sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
>   EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
>   for (i = 0; i < ngroups; i++) {
> + cond_resched();
>   desc = ext4_get_group_desc(sb, i, NULL);
>   if (desc == NULL) {
>   ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i);
> @@ -2705,6 +2706,7 @@ int ext4_mb_release(struct super_block *sb)
> 
>   if (sbi->s_group_info) {
>   for (i = 0; i < ngroups; i++) {
> + cond_resched();
>   grinfo = ext4_get_group_info(sb, i);
> #ifdef DOUBLE_CHECK
>   kfree(grinfo->bb_bitmap);
> --
> 2.21.0.593.g511ec345e18-goog
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2] ext4: fix use-after-free race with debug_want_extra_isize

2019-04-18 Thread Andreas Dilger

On Apr 18, 2019, at 9:59 AM, Barret Rhoden  wrote:
> 
> When remounting with debug_want_extra_isize, we were not performing the
> same checks that we do during a normal mount.  That allowed us to set a
> value for s_want_extra_isize that reached outside the s_inode_size.
> 
> Reported-by: syzbot+f584efa0ac7213c22...@syzkaller.appspotmail.com
> Reviewed-by: Jan Kara 
> Signed-off-by: Barret Rhoden 
> Cc: sta...@vger.kernel.org # 4.14.111

Reviewed-by: Andreas Dilger 

> ---
> 
> - Updated tags
> 
> Thanks for the review!
> 
> fs/ext4/super.c | 58 +
> 1 file changed, 34 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6ed4eb81e674..184944d4d8d1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3513,6 +3513,37 @@ int ext4_calculate_overhead(struct super_block *sb)
>   return 0;
> }
> 
> +static void ext4_clamp_want_extra_isize(struct super_block *sb)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct ext4_super_block *es = sbi->s_es;
> +
> + /* determine the minimum size of new large inodes, if present */
> + if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE &&
> + sbi->s_want_extra_isize == 0) {
> + sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> +  EXT4_GOOD_OLD_INODE_SIZE;
> + if (ext4_has_feature_extra_isize(sb)) {
> + if (sbi->s_want_extra_isize <
> + le16_to_cpu(es->s_want_extra_isize))
> + sbi->s_want_extra_isize =
> + le16_to_cpu(es->s_want_extra_isize);
> + if (sbi->s_want_extra_isize <
> + le16_to_cpu(es->s_min_extra_isize))
> + sbi->s_want_extra_isize =
> + le16_to_cpu(es->s_min_extra_isize);
> + }
> + }
> + /* Check if enough inode space is available */
> + if (EXT4_GOOD_OLD_INODE_SIZE + sbi->s_want_extra_isize >
> + sbi->s_inode_size) {
> + sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> +EXT4_GOOD_OLD_INODE_SIZE;
> + ext4_msg(sb, KERN_INFO,
> +  "required extra inode space not available");
> + }
> +}
> +
> static void ext4_set_resv_clusters(struct super_block *sb)
> {
>   ext4_fsblk_t resv_clusters;
> @@ -4387,30 +4418,7 @@ static int ext4_fill_super(struct super_block *sb, 
> void *data, int silent)
>   } else if (ret)
>   goto failed_mount4a;
> 
> - /* determine the minimum size of new large inodes, if present */
> - if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE &&
> - sbi->s_want_extra_isize == 0) {
> - sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> -  EXT4_GOOD_OLD_INODE_SIZE;
> - if (ext4_has_feature_extra_isize(sb)) {
> - if (sbi->s_want_extra_isize <
> - le16_to_cpu(es->s_want_extra_isize))
> - sbi->s_want_extra_isize =
> - le16_to_cpu(es->s_want_extra_isize);
> - if (sbi->s_want_extra_isize <
> - le16_to_cpu(es->s_min_extra_isize))
> - sbi->s_want_extra_isize =
> - le16_to_cpu(es->s_min_extra_isize);
> - }
> - }
> - /* Check if enough inode space is available */
> - if (EXT4_GOOD_OLD_INODE_SIZE + sbi->s_want_extra_isize >
> - sbi->s_inode_size) {
> - sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> -EXT4_GOOD_OLD_INODE_SIZE;
> - ext4_msg(sb, KERN_INFO, "required extra inode space not"
> -  "available");
> - }
> + ext4_clamp_want_extra_isize(sb);
> 
>   ext4_set_resv_clusters(sb);
> 
> @@ -5194,6 +5202,8 @@ static int ext4_remount(struct super_block *sb, int 
> *flags, char *data)
>   goto restore_opts;
>   }
> 
> + ext4_clamp_want_extra_isize(sb);
> +
>   if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) ^
>   test_opt(sb, JOURNAL_CHECKSUM)) {
>   ext4_msg(sb, KERN_ERR, "changing journal_checksum "
> --
> 2.21.0.392.gf8f6787159e-goog
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] jbd2: do not start commit when t_updates does not back to zero

2019-03-29 Thread Andreas Dilger

On Mar 29, 2019, at 3:25 PM, Jan Kara  wrote:
> 
> On Sun 24-03-19 11:38:35, Liu Song wrote:
>> When t_updates back to zero, it guaranteed wake up process which
>> waiting on j_wait_updates. If we triggered a commit start without
>> considered t_updates, the commit thread wakes up and find t_updates
>> is not zero, it have to wait on it once again. So, add checking code
>> to avoid this happen.
>> 
>> Signed-off-by: Liu Song 
> 
> Do I understand correctly that this is a performance improvement? If yes,
> did you measure any benefit of the patch? Because I have some doubts that
> t_updates == 0 case is very common.

In the past there have been times when periodic journal commits keep a disk
from spinning down, but I don't know if that is the case here...

Cheers, Andreas

>> ---
>> fs/jbd2/transaction.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 79a028a7a579..e0499fd73b1e 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -144,12 +144,13 @@ static void wait_transaction_locked(journal_t *journal)
>>  __releases(journal->j_state_lock)
>> {
>>  DEFINE_WAIT(wait);
>> -int need_to_start;
>> +int need_to_start = 0;
>>  tid_t tid = journal->j_running_transaction->t_tid;
>> 
>>  prepare_to_wait(>j_wait_transaction_locked, ,
>>  TASK_UNINTERRUPTIBLE);
>> -need_to_start = !tid_geq(journal->j_commit_request, tid);
>> +if (!atomic_read(>j_running_transaction->t_updates))
>> +need_to_start = !tid_geq(journal->j_commit_request, tid);
>>  read_unlock(>j_state_lock);
>>  if (need_to_start)
>>  jbd2_log_start_commit(journal, tid);
>> --
>> 2.19.1
>> 
>> 
> --
> Jan Kara 
> SUSE Labs, CR


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: use BUG() instead of BUG_ON(1)

2019-03-25 Thread Andreas Dilger

On Mar 25, 2019, at 12:14 PM, Darrick J. Wong  wrote:
> 
> On Mon, Mar 25, 2019 at 02:00:25PM +0100, Arnd Bergmann wrote:
>> BUG_ON(1) leads to bogus warnings from clang when
>> CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
>> 
>> fs/ext4/inode.c:544:4: error: variable 'retval' is used uninitialized 
>> whenever 'if' condition is false
>>  [-Werror,-Wsometimes-uninitialized]
>>BUG_ON(1);
>>^
>> include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
>>   ^~~
>> include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
>>^
>> fs/ext4/inode.c:591:6: note: uninitialized use occurs here
>>if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
>>^~
>> fs/ext4/inode.c:544:4: note: remove the 'if' if its condition is always true
>>BUG_ON(1);
>>^
>> include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
>>   ^
>> fs/ext4/inode.c:502:12: note: initialize the variable 'retval' to silence 
>> this warning
>> 
>> Change it to BUG() so clang can see that this code path can never
>> continue.
> 
> I grok that most of these look like "should never get here" assertions,
> but shouldn't we be converting these BUG*() calls to "shut down fs,
> bounce error back to userspace" instead of killing the whole kernel?
> 
> (He says knowing that ripping all of those out is its own project... :P)

We definitely shouldn't be adding "BUG()" or "BUG_ON()" to active code, but
rather call ext4_error() and let the admin to set "errors=panic" if they
want this action, or leave "errors=remount-ro" (which is the default these
days) and just return an error to userspace.

Looking at the nearby code, it is using pr_warn("ES insert assertion failed..."
instead of using an actual assertion for the failure cases.

I guess a saving grace is that this code is only enabled if ES_AGGRESSIVE_TEST
is defined, which it is not by default, so probably it is only when some
developer is testing this code.  So using BUG() is probably OK in this case.

Cheers, Andreas

> 
>> Signed-off-by: Arnd Bergmann 
>> ---
>> fs/ext4/extents_status.c | 4 ++--
>> fs/ext4/inode.c  | 4 ++--
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>> index 2b439afafe13..023a3eb3afa3 100644
>> --- a/fs/ext4/extents_status.c
>> +++ b/fs/ext4/extents_status.c
>> @@ -711,7 +711,7 @@ static void ext4_es_insert_extent_ind_check(struct inode 
>> *inode,
>>   * We don't need to check unwritten extent because
>>   * indirect-based file doesn't have it.
>>   */
>> -BUG_ON(1);
>> +BUG();
>>  }
>>  } else if (retval == 0) {
>>  if (ext4_es_is_written(es)) {
>> @@ -780,7 +780,7 @@ static int __es_insert_extent(struct inode *inode, 
>> struct extent_status *newes)
>>  }
>>  p = &(*p)->rb_right;
>>  } else {
>> -BUG_ON(1);
>> +BUG();
>>  return -EINVAL;
>>  }
>>  }
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index b32a57bc5d5d..190f0478582a 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -541,7 +541,7 @@ int ext4_map_blocks(handle_t *handle, struct inode 
>> *inode,
>>  map->m_len = retval;
>>  retval = 0;
>>  } else {
>> -BUG_ON(1);
>> +BUG();
>>  }
>> #ifdef ES_AGGRESSIVE_TEST
>>  ext4_map_blocks_es_recheck(handle, inode, map,
>> @@ -1876,7 +1876,7 @@ static int ext4_da_map_blocks(struct inode *inode, 
>> sector_t iblock,
>>  else if (ext4_es_is_unwritten())
>>  map->m_flags |= EXT4_MAP_UNWRITTEN;
>>  else
>> -BUG_ON(1);
>> +BUG();
>> 
>> #ifdef ES_AGGRESSIVE_TEST
>>  ext4_map_blocks_es_recheck(NULL, inode, map, _map, 0);
>> --
>> 2.20.0


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/4] Expose O_PATHSTATIC to userspace

2019-02-12 Thread Andreas Dilger

On Feb 12, 2019, at 7:54 AM, demioben...@gmail.com wrote:
> 
> From: "Demi M. Obenour" 
> 
> This adds the file open flag O_PATHSTATIC, which ensures that symbolic
> links are *never* followed, even in path components other than the last.
> This is distinct from O_NOFOLLOW, which only prevents symlinks in the
> *last* component from being followed.
> 
> This is useful for avoiding race conditions in userspace code that
> should expose only a subset of the filesystem to clients.  This includes
> FTP and SFTP servers, QEMU, and others.
> 
> Currently, O_NOFOLLOW must be set if O_PATHSTATIC is set.  Otherwise,
> open() fails with -EINVAL.

I don't want to bikeshed (discard suggestion if you disagree), but why not
name the flag "O_NEVER_FOLLOW" so that users can see it is also related to
"O_NOFOLLOW"?  Otherwise it seems like they are two completely different
things from looking at the names, when in fact they are closely related.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/2] ext4: annotate implicit fall throughs

2019-01-17 Thread Andreas Dilger


> On Jan 14, 2019, at 1:39 PM, Mathieu Malaterre  wrote:
> 
> There is a plan to build the kernel with -Wimplicit-fallthrough and
> these places in the code produced warnings (W=1). Fix them up.
> 
> This commit remove the following warnings:
> 
>  fs/ext4/indirect.c:1182:6: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>  fs/ext4/indirect.c:1188:6: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>  fs/ext4/indirect.c:1432:6: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>  fs/ext4/indirect.c:1440:6: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
> 
> Signed-off-by: Mathieu Malaterre 

Reviewed-by: Andreas Dilger 

> ---
> fs/ext4/indirect.c | 6 ++
> 1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index bf7fa1507e81..c2225f0d31b5 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -1183,18 +1183,21 @@ void ext4_ind_truncate(handle_t *handle, struct inode 
> *inode)
>   ext4_free_branches(handle, inode, NULL, , +1, 1);
>   i_data[EXT4_IND_BLOCK] = 0;
>   }
> + /* fall through */
>   case EXT4_IND_BLOCK:
>   nr = i_data[EXT4_DIND_BLOCK];
>   if (nr) {
>   ext4_free_branches(handle, inode, NULL, , +1, 2);
>   i_data[EXT4_DIND_BLOCK] = 0;
>   }
> + /* fall through */
>   case EXT4_DIND_BLOCK:
>   nr = i_data[EXT4_TIND_BLOCK];
>   if (nr) {
>   ext4_free_branches(handle, inode, NULL, , +1, 3);
>   i_data[EXT4_TIND_BLOCK] = 0;
>   }
> + /* fall through */
>   case EXT4_TIND_BLOCK:
>   ;
>   }
> @@ -1433,6 +1436,7 @@ int ext4_ind_remove_space(handle_t *handle, struct 
> inode *inode,
>   ext4_free_branches(handle, inode, NULL, , +1, 1);
>   i_data[EXT4_IND_BLOCK] = 0;
>   }
> + /* fall through */
>   case EXT4_IND_BLOCK:
>   if (++n >= n2)
>   return 0;
> @@ -1441,6 +1445,7 @@ int ext4_ind_remove_space(handle_t *handle, struct 
> inode *inode,
>   ext4_free_branches(handle, inode, NULL, , +1, 2);
>   i_data[EXT4_DIND_BLOCK] = 0;
>   }
> + /* fall through */
>   case EXT4_DIND_BLOCK:
>   if (++n >= n2)
>   return 0;
> @@ -1449,6 +1454,7 @@ int ext4_ind_remove_space(handle_t *handle, struct 
> inode *inode,
>   ext4_free_branches(handle, inode, NULL, , +1, 3);
>   i_data[EXT4_TIND_BLOCK] = 0;
>   }
> + /* fall through */
>   case EXT4_TIND_BLOCK:
>   ;
>   }
> --
> 2.19.2
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] ext4: annotate implicit fall throughs

2019-01-17 Thread Andreas Dilger


> On Jan 14, 2019, at 1:39 PM, Mathieu Malaterre  wrote:
> 
> There is a plan to build the kernel with -Wimplicit-fallthrough and
> these places in the code produced warnings (W=1). Fix them up.
> 
> This commit remove the following warnings:
> 
>  fs/ext4/hash.c:233:15: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>  fs/ext4/hash.c:246:15: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
> 
> Signed-off-by: Mathieu Malaterre 

Reviewed-by: Andreas Dilger 

> ---
> fs/ext4/hash.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
> index e22dcfab308b..46b24da33a28 100644
> --- a/fs/ext4/hash.c
> +++ b/fs/ext4/hash.c
> @@ -231,6 +231,7 @@ int ext4fs_dirhash(const char *name, int len, struct 
> dx_hash_info *hinfo)
>   break;
>   case DX_HASH_HALF_MD4_UNSIGNED:
>   str2hashbuf = str2hashbuf_unsigned;
> + /* fall through */
>   case DX_HASH_HALF_MD4:
>   p = name;
>   while (len > 0) {
> @@ -244,6 +245,7 @@ int ext4fs_dirhash(const char *name, int len, struct 
> dx_hash_info *hinfo)
>   break;
>   case DX_HASH_TEA_UNSIGNED:
>   str2hashbuf = str2hashbuf_unsigned;
> + /* fall through */
>   case DX_HASH_TEA:
>   p = name;
>   while (len > 0) {
> --
> 2.19.2
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: Preserving a rev 0.0 ext2 filesystem

2019-01-12 Thread Andreas Dilger

On Jan 12, 2019, at 2:43 AM, Geert Uytterhoeven  wrote:
> 
> Hi Ted,
> 
> I'm still regularly using a Linux rev 0.0 ext2 filesystem as a  ramdisk
> on m68k, containing mid-90's binaries, from right after the a.out-to-ELF
> transition, so I notice if someone breaks old syscall support.
> 
> Recently I wanted to change /dev/console on that ramdisk from a symlink
> to chardev 5 0.  Unfortunately I cannot mount it, without it being
> upgraded automatically to a rev 1.0 ext2 filesystem.  Apparently the
> kernel has been doing that upgrade for ages.
> 
> Do you have a suggestion how to make that change, while preserving the
> ext2 revision?

It looks like there is a "gratuitous" call to ext4_update_dynamic_rev() in the
mount path that is left over from when ext3 always set the INCOMPAT_RECOVER and
COMPAT_HAS_JOURNAL features on every mount.  With the "nojournal" mount option
these are no longer needed.  You could probably just remove this call with no 
ill
effects, or better yet apply the patch that I'll push shortly to this effect.
It's been there since the start of git history (2.6.12-rc2) in the ext3 code
that was later copied to become ext4, but digging through my email archives it
looks like adding the call to "ext3_update_fs_rev()" was actually one of my 
first
patches against the original ext3 code "[Ext2-devel] Re: RECOVER INCOMPAT flag",
dated Nov 18, 2000.

The only thing ext4_update_dynamic_rev() does is update s_rev_level, and set
s_first_ino and s_inode_size on disk to the values that the kernel already
assumes to be true (newer kernels can use different values).  Those fields are
also set in memory later during the mount, so the rest of the code should work
OK (I haven't tested). This will probably only affect your filesysem (everything
else in the last 20 years uses EXT2_DYNAMIC_REV) so it won't be much of a 
change.

The "real" ext2 code looks like it does not call ext2_update_dynamic_rev() for
every filesystem during mount, only if you write a file > 2GB in size, so it 
would
also be possible to mount with the ext2.ko driver (if your kernel provides it).
There looks like a minor bug in ext2 that it doesn't call 
ext2_update_dynamic_rev()
if an xattr is stored on the filesystem and it sets EXT2_FEATURE_COMPAT_EXT_ATTR
in ext2_xattr_update_super_block().  However, nobody noticed in since xattrs 
landed.

In any case, it would be fun to give my patch a try to see if it allows your old
filesystem to be mounted and modified with a modern kernel and then mounted on
the old kernel again, as a testament to ext2/ext4 feature compatibility.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Andreas Dilger

On Dec 28, 2018, at 4:18 AM, Peter Maydell  wrote:
> 
> On Fri, 28 Dec 2018 at 00:23, Andreas Dilger  wrote:
>> On Dec 27, 2018, at 10:41 AM, Peter Maydell  wrote:
>>> As you note, this causes breakage for userspace programs which
>>> need to implement an API/ABI with 32-bit offset but which only
>>> have access to the kernel's 64-bit offset API/ABI.
>> 
>> This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
>> the 64-bit API, but storing the value in a 32-bit field?
> 
> I didn't say "which choose to store the value in a 32-bit field",
> I said "which have to implement an API/ABI which has 32-bit fields".
> In QEMU's case, we use the host kernel's ABI, which has 64-bit
> offset fields. We implement a syscall ABI for the guest binary
> we are running under emulation, which may have 32-bit offset fields
> (for instance if we are running a 32-bit Arm binary.) Both of
> these ABIs are fixed -- QEMU doesn't have a choice here, it
> just has to make the best effort it can with what the host kernel
> provides it, to provide the semantics the guest binary needs.
> My suggestion in this thread is that the host kernel provides
> a wider range of facilities so that QEMU can do the job it's
> trying to do.
> 
>> The same
>> problem would exist for filesystems with 64-bit inodes or 64-bit
>> file offsets trying to store these values in 32-bit variables.
>> It might work most of the time, but it can also break randomly.
> 
> In general inodes and offsets start from 0 and work up --
> so almost all of the time they don't actually overflow.
> The problem with ext4 directory hash "offsets" is that they
> overflow all the time and immediately, so instead of "works
> unless you have a weird edge case" like all the other filesystems,
> it's "never works".
> 
>>> I think the best fix for this would be for the kernel to either
>>> (a) consistently use a 32-bit hash or (b) to provide an API
>>> so that userspace can use the FMODE_32BITHASH flag the way
>>> that kernel-internal users already can.
>> 
>> It would be relatively straight forward to add a "32bitapi" mount
>> option to return a 32-bit directory hash to userspace for operations
>> on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
>> However, I can't think of an easy way to do this on a per-process
>> basis without just having it call the 32-bit API directly.
> 
> The problem is that there is no 32-bit API in some cases
> (unless I have misunderstood the kernel code) -- not all
> host architectures implement compat syscalls or allow them
> to be called from 64-bit processes or implement all the older
> syscall variants that had smaller offets. If there was a guaranteed
> "this syscall always exists and always gives me 32-bit offsets"
> we could use it.

The "32bitapi" mount option would use 32-bit hash for seekdir
and telldir, regardless of what kernel API was used.  That would
just set the FMODE_32BITHASH flag in the file->f_mode for all files.

Using 32-bit directory hash values is not necessarily harmful, but
it returns the possibility to hit the problem with hash collisions
that previously existed before the move to 64-bit hash values.
This becomes more of a problem as directory sizes increase.

>> For ext4 at least, you could just shift the high 32-bit part of
>> the 64-bit hash down into a 32-bit value in telldir(), and
>> shift it back up when seekdir() is called.
> 
> Yes, that has been suggested, but it seemed a bit dubious
> to bake in knowledge of ext4's internal implementation details.
> Can we rely on this as an ABI promise that will always work
> for all versions of all file systems going forwards?

Well, the directory cookies need to be relatively stable over
time because they are exported to applications and possibly
remote nodes via NFS, so it can't be changed very much.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-27 Thread Andreas Dilger

On Dec 27, 2018, at 10:41 AM, Peter Maydell  wrote:
> 
> On Thu, 27 Dec 2018 at 17:19, Florian Weimer  wrote:
>> We have a bit of an interesting problem with respect to the d_off
>> field in struct dirent.
>> 
>> When running a 64-bit kernel on certain file systems, notably ext4,
>> this field uses the full 63 bits even for small directories (strace -v
>> output, wrapped here for readability):
>> 
>> getdents(3, [
>>  {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
>> d_name="authorized_keys", d_type=DT_REG},
>>  {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
>> d_type=DT_DIR},
>>  {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
>> d_type=DT_DIR}
>> ], 32768) = 88
>> 
>> When running in 32-bit compat mode, this value is somehow truncated to
>> 31 bits, for both the getdents and the getdents64 (!) system call (at
>> least on i386).
> 
> Yes -- look for hash2pos() and friends in fs/ext4/dir.c.
> The ext4 code in the kernel uses a 32 bit hash if (a) the kernel
> is 32 bit (b) this is a compat syscall (b) some other bit of
> the kernel asked it to via the FMODE_32BITHASH flag (currently only
> NFS does that I think).
> 
> As you note, this causes breakage for userspace programs which
> need to implement an API/ABI with 32-bit offset but which only
> have access to the kernel's 64-bit offset API/ABI.

This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
the 64-bit API, but storing the value in a 32-bit field?  The same
problem would exist for filesystems with 64-bit inodes or 64-bit
file offsets trying to store these values in 32-bit variables.
It might work most of the time, but it can also break randomly.

> I think the best fix for this would be for the kernel to either
> (a) consistently use a 32-bit hash or (b) to provide an API
> so that userspace can use the FMODE_32BITHASH flag the way
> that kernel-internal users already can.

It would be relatively straight forward to add a "32bitapi" mount
option to return a 32-bit directory hash to userspace for operations
on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
However, I can't think of an easy way to do this on a per-process
basis without just having it call the 32-bit API directly.

> I couldn't think of or find any existing way for userspace
> to get the right results here, which is why
> 32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems
> (depending on what exactly the guest's libc etc do).
> 
>> the 32-bit getdents system call emulation in a 64-bit qemu-user
>> process would just silently truncate the d_off field as part of
>> the translation, not reporting an error.
>> [...]
>> This truncation has always been a bug; it breaks telldir/seekdir
>> at least in some cases.
> 
> Yes; you can't fit a quart into a pint pot, so if the guest
> only handles 32-bit offsets then truncation is about all we
> can do. This works fine if offsets are offsets, assuming the
> directory isn't so enormous it would have broken the guest
> anyway. I'm not aware of any issues with this other than the
> oddball ext4 offsets-are-hashes situation -- could you expand
> on the telldir/seekdir issue? (I suppose we should probably
> make QEMU's syscall emulation layer return "no more entries"
> rather than entries with truncated hashes.)

For ext4 at least, you could just shift the high 32-bit part of
the 64-bit hash down into a 32-bit value in telldir(), and
shift it back up when seekdir() is called.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 5/5] nfs: don't clear STATX_ATIME from result_mask

2018-11-05 Thread Andreas Dilger

On Oct 20, 2018, at 11:46 AM, Trond Myklebust  wrote:
> 
> On Fri, 2018-10-19 at 22:48 +0200, Miklos Szeredi wrote:
>> On Fri, Oct 19, 2018 at 8:14 PM, Trond Myklebust
>>  wrote:
>>> On Fri, 2018-10-19 at 19:46 +0200, Miklos Szeredi wrote:
 How is it then that only STATX_ATIME is cleared and not the other
 fields?
>>> 
>>> It isn't just the atime. We can also fail to revalidate the ctime
>>> and mtime if they are not being requested by the user.
>>> 
 Note: junk != stale.  The statx definition doesn't talk about the
 fields being up-to-date, except for AT_STATX_FORCE_SYNC, so stale
 attributes are okay, and do not warrant clearing the result_mask.
>>> 
>>> I disagree. stale == junk here, because the default of
>>> AT_STATX_SYNC_AS_STAT is described by the manpage as "Do whatever
>>> stat(2) does." which this is not.
>> 
>> Ah, you are talking about this:
>> 
>>  /* Is the user requesting attributes that might need revalidation? */
>>  if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
>>STATX_MTIME|STATX_UID|STATX_GID|
>>STATX_SIZE|STATX_BLOCKS)))
>>goto out_no_update;
>> 
>> Well, if this is triggered for statx(...,  STATX_ATIME,
>> AT_STATX_SYNC_AS_STAT) and MNT_NOATIME, then yes, result will be
>> junk. Which means that the code is wrong, it shouldn't do that.
> 
> The problem is that vfs_getattr_nosec() populates stat->result_mask
> with a default of STATX_BASIC_STATS, which makes no sense unless you
> assume that the user will always ask for a superset of
> STATX_BASIC_STATS (or you assume that those attributes never need
> revalidation, which is obviously braindead).

I guess the assumption in the VFS code is that statx is mostly called
by local filesystems, for which STATX_BASIC_STATS is usually right,
so the basic VFS helper is OK to set those stats.  It should also be
possible for the filesystem to clear flags out of result_mask for
attributes that it doesn't want to return.

For filesystems that know what they are doing, it might just be best
to always clear stat->result_mask and fill in what they want, based
on the available attributes and request_mask rather than assuming
something is set by the caller.

Cheers, Andreas

>> Otherwise (if something other than STATX_ATIME or STATX_INO or
>> STATX_TYPE is given as well) it *will* do the same thing as what
>> stat(2) does, so in that case STATX_ATIME should not  be cleared (yet
>> it is cleared).
> 
> As far as I'm concerned, we can definitely get rid of the
> 
>/*
> * We may force a getattr if the user cares about atime.
> *
> * Note that we only have to check the vfsmount flags here:
> *  - NFS always sets S_NOATIME by so checking it would give a
> *bogus result
> *  - NFS never sets SB_NOATIME or SB_NODIRATIME so there is
> *no point in checking those.
> */
>if ((path->mnt->mnt_flags & MNT_NOATIME) ||
>((path->mnt->mnt_flags & MNT_NODIRATIME) && 
> S_ISDIR(inode->i_mode)))
>request_mask &= ~STATX_ATIME;
> 
> 
> however the rest needs to stay, or there is no way we can use statx()
> to allow optimised retrieval of only those attributes that your
> application cares about.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 5/5] nfs: don't clear STATX_ATIME from result_mask

2018-11-05 Thread Andreas Dilger

On Oct 20, 2018, at 11:46 AM, Trond Myklebust  wrote:
> 
> On Fri, 2018-10-19 at 22:48 +0200, Miklos Szeredi wrote:
>> On Fri, Oct 19, 2018 at 8:14 PM, Trond Myklebust
>>  wrote:
>>> On Fri, 2018-10-19 at 19:46 +0200, Miklos Szeredi wrote:
 How is it then that only STATX_ATIME is cleared and not the other
 fields?
>>> 
>>> It isn't just the atime. We can also fail to revalidate the ctime
>>> and mtime if they are not being requested by the user.
>>> 
 Note: junk != stale.  The statx definition doesn't talk about the
 fields being up-to-date, except for AT_STATX_FORCE_SYNC, so stale
 attributes are okay, and do not warrant clearing the result_mask.
>>> 
>>> I disagree. stale == junk here, because the default of
>>> AT_STATX_SYNC_AS_STAT is described by the manpage as "Do whatever
>>> stat(2) does." which this is not.
>> 
>> Ah, you are talking about this:
>> 
>>  /* Is the user requesting attributes that might need revalidation? */
>>  if (!(request_mask & (STATX_MODE|STATX_NLINK|STATX_ATIME|STATX_CTIME|
>>STATX_MTIME|STATX_UID|STATX_GID|
>>STATX_SIZE|STATX_BLOCKS)))
>>goto out_no_update;
>> 
>> Well, if this is triggered for statx(...,  STATX_ATIME,
>> AT_STATX_SYNC_AS_STAT) and MNT_NOATIME, then yes, result will be
>> junk. Which means that the code is wrong, it shouldn't do that.
> 
> The problem is that vfs_getattr_nosec() populates stat->result_mask
> with a default of STATX_BASIC_STATS, which makes no sense unless you
> assume that the user will always ask for a superset of
> STATX_BASIC_STATS (or you assume that those attributes never need
> revalidation, which is obviously braindead).

I guess the assumption in the VFS code is that statx is mostly called
by local filesystems, for which STATX_BASIC_STATS is usually right,
so the basic VFS helper is OK to set those stats.  It should also be
possible for the filesystem to clear flags out of result_mask for
attributes that it doesn't want to return.

For filesystems that know what they are doing, it might just be best
to always clear stat->result_mask and fill in what they want, based
on the available attributes and request_mask rather than assuming
something is set by the caller.

Cheers, Andreas

>> Otherwise (if something other than STATX_ATIME or STATX_INO or
>> STATX_TYPE is given as well) it *will* do the same thing as what
>> stat(2) does, so in that case STATX_ATIME should not  be cleared (yet
>> it is cleared).
> 
> As far as I'm concerned, we can definitely get rid of the
> 
>/*
> * We may force a getattr if the user cares about atime.
> *
> * Note that we only have to check the vfsmount flags here:
> *  - NFS always sets S_NOATIME by so checking it would give a
> *bogus result
> *  - NFS never sets SB_NOATIME or SB_NODIRATIME so there is
> *no point in checking those.
> */
>if ((path->mnt->mnt_flags & MNT_NOATIME) ||
>((path->mnt->mnt_flags & MNT_NODIRATIME) && 
> S_ISDIR(inode->i_mode)))
>request_mask &= ~STATX_ATIME;
> 
> 
> however the rest needs to stay, or there is no way we can use statx()
> to allow optimised retrieval of only those attributes that your
> application cares about.


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: remove code duplication in update_ind/bind/tind_extent_range

2018-11-04 Thread Andreas Dilger

On Nov 4, 2018, at 9:41 AM, Vasily Averin  wrote:
> 
> update_ind/bind/tind_extent_page() differs by one variable and can be
> replaced by unified function. These functions are called by similar way
> and their caller function can be simplified too.

The patch looks quite reasonable, though I'd suggest a couple of very
minor style changes (inline).  You can add to the resubmitted patch:

Reviewed-by: Andreas Dilger 

> Signed-off-by: Vasily Averin 
> ---
> fs/ext4/migrate.c | 111 ++
> 1 file changed, 23 insertions(+), 88 deletions(-)
> 
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index 61a9d1927817..02ce99ba32be 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -109,12 +109,13 @@ static int update_extent_range(handle_t *handle, struct 
> inode *inode,
> 
> static int update_ind_extent_range(handle_t *handle, struct inode *inode,
>  ext4_fsblk_t pblock,
> -struct migrate_struct *lb)
> +struct migrate_struct *lb,
> +ext4_lblk_t inc)
> {
>   struct buffer_head *bh;
>   __le32 *i_data;
>   int i, retval = 0;
> - unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
> + ext4_lblk_t max_entries = inode->i_sb->s_blocksize >> 2;
> 
>   bh = sb_bread(inode->i_sb, pblock);
>   if (!bh)
> @@ -523,34 +464,28 @@ int ext4_ext_migrate(struct inode *inode)
> 
>   /* 32 bit block address 4 bytes */
>   max_entries = inode->i_sb->s_blocksize >> 2;
> - for (i = 0; i < EXT4_NDIR_BLOCKS; i++) {
> +
> + inc = 1; mult = 1;
> + for (i = 0; i < EXT4_N_BLOCKS; i++) {
> + if (i == EXT4_IND_BLOCK)
> + mult = max_entries;
> + else if (i > EXT4_IND_BLOCK)
> + inc = inc * mult;

This should be "inc *= mult".

>   if (i_data[i]) {
> + if (i < EXT4_IND_BLOCK)
> + retval = update_extent_range(handle, tmp_inode,
>   le32_to_cpu(i_data[i]), );
> + else
> + retval = update_ind_extent_range(handle,
> + tmp_inode,
> + le32_to_cpu(i_data[i]),
> + , inc);
>   if (retval)
>   goto err_out;
> +
> + } else if (i < EXT4_TIND_BLOCK)
> + lb.curr_block += inc * mult;

There should be {} around the else-block to match the if-block.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: remove code duplication in update_ind/bind/tind_extent_range

2018-11-04 Thread Andreas Dilger

On Nov 4, 2018, at 9:41 AM, Vasily Averin  wrote:
> 
> update_ind/bind/tind_extent_page() differs by one variable and can be
> replaced by unified function. These functions are called by similar way
> and their caller function can be simplified too.

The patch looks quite reasonable, though I'd suggest a couple of very
minor style changes (inline).  You can add to the resubmitted patch:

Reviewed-by: Andreas Dilger 

> Signed-off-by: Vasily Averin 
> ---
> fs/ext4/migrate.c | 111 ++
> 1 file changed, 23 insertions(+), 88 deletions(-)
> 
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index 61a9d1927817..02ce99ba32be 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -109,12 +109,13 @@ static int update_extent_range(handle_t *handle, struct 
> inode *inode,
> 
> static int update_ind_extent_range(handle_t *handle, struct inode *inode,
>  ext4_fsblk_t pblock,
> -struct migrate_struct *lb)
> +struct migrate_struct *lb,
> +ext4_lblk_t inc)
> {
>   struct buffer_head *bh;
>   __le32 *i_data;
>   int i, retval = 0;
> - unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
> + ext4_lblk_t max_entries = inode->i_sb->s_blocksize >> 2;
> 
>   bh = sb_bread(inode->i_sb, pblock);
>   if (!bh)
> @@ -523,34 +464,28 @@ int ext4_ext_migrate(struct inode *inode)
> 
>   /* 32 bit block address 4 bytes */
>   max_entries = inode->i_sb->s_blocksize >> 2;
> - for (i = 0; i < EXT4_NDIR_BLOCKS; i++) {
> +
> + inc = 1; mult = 1;
> + for (i = 0; i < EXT4_N_BLOCKS; i++) {
> + if (i == EXT4_IND_BLOCK)
> + mult = max_entries;
> + else if (i > EXT4_IND_BLOCK)
> + inc = inc * mult;

This should be "inc *= mult".

>   if (i_data[i]) {
> + if (i < EXT4_IND_BLOCK)
> + retval = update_extent_range(handle, tmp_inode,
>   le32_to_cpu(i_data[i]), );
> + else
> + retval = update_ind_extent_range(handle,
> + tmp_inode,
> + le32_to_cpu(i_data[i]),
> + , inc);
>   if (retval)
>   goto err_out;
> +
> + } else if (i < EXT4_TIND_BLOCK)
> + lb.curr_block += inc * mult;

There should be {} around the else-block to match the if-block.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 11/11] ext4: access to uninitialized bh fields in ext4_xattr_set_handle()

2018-10-31 Thread Andreas Dilger

On Oct 30, 2018, at 9:39 PM, Vasily Averin  wrote:
> 
> On 10/31/2018 04:30 AM, Andreas Dilger wrote:
>> Could you please explain your statement below that on-stack
>> initialization does not zero unspecified fields?  According
>> to documents I found, for example:
>> 
>> https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html
>> 
>> they *are* initialized to zero.
> 
> I did not know it,
> I re-checked it in generated assembler code and found that you
> are right and I was wrong.
> 
> Please drop this patch,
> should I resend of rest of this patch set once again?

I don't think it is necessary to resend the whole patch series.
Ted should notice these emails on this patch and not apply it.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 11/11] ext4: access to uninitialized bh fields in ext4_xattr_set_handle()

2018-10-31 Thread Andreas Dilger

On Oct 30, 2018, at 9:39 PM, Vasily Averin  wrote:
> 
> On 10/31/2018 04:30 AM, Andreas Dilger wrote:
>> Could you please explain your statement below that on-stack
>> initialization does not zero unspecified fields?  According
>> to documents I found, for example:
>> 
>> https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html
>> 
>> they *are* initialized to zero.
> 
> I did not know it,
> I re-checked it in generated assembler code and found that you
> are right and I was wrong.
> 
> Please drop this patch,
> should I resend of rest of this patch set once again?

I don't think it is necessary to resend the whole patch series.
Ted should notice these emails on this patch and not apply it.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: in_compat_syscall() returns from kernel thread for X86_32.

2018-10-20 Thread Andreas Dilger

On Oct 18, 2018, at 11:26 AM, Andy Lutomirski  wrote:
> 
> On Wed, Oct 17, 2018 at 9:36 PM NeilBrown  wrote:
>> 
>> On Wed, Oct 17 2018, Andy Lutomirski wrote:
>> 
>>> On Wed, Oct 17, 2018 at 6:48 PM NeilBrown  wrote:
 
 
 Was: Re: [tip:x86/asm] x86/entry: Rename is_{ia32,x32}_task() to 
 in_{ia32,x32}_syscall()
 On Tue, Apr 19 2016, tip-bot for Dmitry Safonov wrote:
 
> Commit-ID:  abfb9498ee1327f534df92a7ecaea81a85913bae
> Gitweb: 
> http://git.kernel.org/tip/abfb9498ee1327f534df92a7ecaea81a85913bae
> Author: Dmitry Safonov 
> AuthorDate: Mon, 18 Apr 2016 16:43:43 +0300
> Committer:  Ingo Molnar 
> CommitDate: Tue, 19 Apr 2016 10:44:52 +0200
> 
> x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()
> 
 ...
> @@ -318,7 +318,7 @@ static inline bool is_x32_task(void)
> 
> static inline bool in_compat_syscall(void)
> {
> - return is_ia32_task() || is_x32_task();
> + return in_ia32_syscall() || in_x32_syscall();
> }
 
 Hi,
 I'm reply to this patch largely to make sure I get the right people
 .
 
 This test is always true when CONFIG_X86_32 is set, as that forces
 in_ia32_syscall() to true.
 However we might not be in a syscall at all - we might be running a
 kernel thread which is always in 64 mode.
 Every other implementation of in_compat_syscall() that I found is
 dependant on a thread flag or syscall register flag, and so returns
 "false" in a kernel thread.
 
 Might something like this be appropriate?
 
 diff --git a/arch/x86/include/asm/thread_info.h 
 b/arch/x86/include/asm/thread_info.h
 index 2ff2a30a264f..c265b40a78f2 100644
 --- a/arch/x86/include/asm/thread_info.h
 +++ b/arch/x86/include/asm/thread_info.h
 @@ -219,7 +219,7 @@ static inline int arch_within_stack_frames(const void 
 * const stack,
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_X86_32
 -#define in_ia32_syscall() true
 +#define in_ia32_syscall() (!(current->flags & PF_KTHREAD))
 #else
 #define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
   current_thread_info()->status & TS_COMPAT)
 
 This came up in the (no out-of-tree) lustre filesystem where some code
 needs to assume 32-bit mode in X86_32 syscalls, and 64-bit mode in kernel
 threads.
 
>>> 
>>> I could get on board with:
>>> 
>>> ({WARN_ON_ONCE(current->flags & PF_KTHREAD); true})
>>> 
>>> The point of these accessors is to be used *in a syscall*.
>>> 
>>> What on Earth is Lustre doing that makes it have this problem?
>> 
>> Lustre uses it in the ->getattr method to make sure ->ino, ->dev and
>> ->rdev are appropriately sized.  This isn't very different from the
>> usage in ext4 to ensure the seek offset for directories is suitable.
>> 
>> These interfaces can be used both from systemcalls and from kernel
>> threads, such as via nfsd.
>> 
>> I don't *know* if nfsd is the particular kthread that causes problems
>> for lustre.  All I know is that ->getattr returns 32bit squashed inode
>> numbers in kthread context where 64 bit numbers would be expected.
>> 
> 
> Well, that looks like Lustre is copying an ext4 bug.
> 
> Hi ext4 people-
> 
> ext4's is_32bit_api() function is bogus.  You can't use
> in_compat_syscall() unless you know you're in a syscall
> 
> The buggy code was introduced in:
> 
> commit d1f5273e9adb40724a85272f248f210dc4ce919a
> Author: Fan Yong 
> Date:   Sun Mar 18 22:44:40 2012 -0400
> 
>ext4: return 32/64-bit dir name hash according to usage type
> 
> I don't know what the right solution is.  Al, is it legit at all for
> fops->llseek to care about the caller's bitness?  If what ext4 is
> doing is legit, then ISTM the VFS needs to gain a new API to tell
> ->llseek what to do.  But I'm wondering why FMODE_64BITHASH by itself
> isn't sufficient,
> 
> I'm quite tempted to add a warning to the x86 arch code to try to
> catch this type of bug.  Fortunately, a bit of grepping suggests that
> ext4 is the only filesystem with this problem.

We need to know whether the readdir cookie returned to userspace
should be a 32-bit cookie or a 64-bit cookie.  Trying to return
a 64-bit value will result in -EOVERFLOW for a 32-bit application,
but is preferable (if possible) because it reduces the chance of
hash collisions causing readdir to have problems.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: in_compat_syscall() returns from kernel thread for X86_32.

2018-10-20 Thread Andreas Dilger

On Oct 18, 2018, at 11:26 AM, Andy Lutomirski  wrote:
> 
> On Wed, Oct 17, 2018 at 9:36 PM NeilBrown  wrote:
>> 
>> On Wed, Oct 17 2018, Andy Lutomirski wrote:
>> 
>>> On Wed, Oct 17, 2018 at 6:48 PM NeilBrown  wrote:
 
 
 Was: Re: [tip:x86/asm] x86/entry: Rename is_{ia32,x32}_task() to 
 in_{ia32,x32}_syscall()
 On Tue, Apr 19 2016, tip-bot for Dmitry Safonov wrote:
 
> Commit-ID:  abfb9498ee1327f534df92a7ecaea81a85913bae
> Gitweb: 
> http://git.kernel.org/tip/abfb9498ee1327f534df92a7ecaea81a85913bae
> Author: Dmitry Safonov 
> AuthorDate: Mon, 18 Apr 2016 16:43:43 +0300
> Committer:  Ingo Molnar 
> CommitDate: Tue, 19 Apr 2016 10:44:52 +0200
> 
> x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()
> 
 ...
> @@ -318,7 +318,7 @@ static inline bool is_x32_task(void)
> 
> static inline bool in_compat_syscall(void)
> {
> - return is_ia32_task() || is_x32_task();
> + return in_ia32_syscall() || in_x32_syscall();
> }
 
 Hi,
 I'm reply to this patch largely to make sure I get the right people
 .
 
 This test is always true when CONFIG_X86_32 is set, as that forces
 in_ia32_syscall() to true.
 However we might not be in a syscall at all - we might be running a
 kernel thread which is always in 64 mode.
 Every other implementation of in_compat_syscall() that I found is
 dependant on a thread flag or syscall register flag, and so returns
 "false" in a kernel thread.
 
 Might something like this be appropriate?
 
 diff --git a/arch/x86/include/asm/thread_info.h 
 b/arch/x86/include/asm/thread_info.h
 index 2ff2a30a264f..c265b40a78f2 100644
 --- a/arch/x86/include/asm/thread_info.h
 +++ b/arch/x86/include/asm/thread_info.h
 @@ -219,7 +219,7 @@ static inline int arch_within_stack_frames(const void 
 * const stack,
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_X86_32
 -#define in_ia32_syscall() true
 +#define in_ia32_syscall() (!(current->flags & PF_KTHREAD))
 #else
 #define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
   current_thread_info()->status & TS_COMPAT)
 
 This came up in the (no out-of-tree) lustre filesystem where some code
 needs to assume 32-bit mode in X86_32 syscalls, and 64-bit mode in kernel
 threads.
 
>>> 
>>> I could get on board with:
>>> 
>>> ({WARN_ON_ONCE(current->flags & PF_KTHREAD); true})
>>> 
>>> The point of these accessors is to be used *in a syscall*.
>>> 
>>> What on Earth is Lustre doing that makes it have this problem?
>> 
>> Lustre uses it in the ->getattr method to make sure ->ino, ->dev and
>> ->rdev are appropriately sized.  This isn't very different from the
>> usage in ext4 to ensure the seek offset for directories is suitable.
>> 
>> These interfaces can be used both from systemcalls and from kernel
>> threads, such as via nfsd.
>> 
>> I don't *know* if nfsd is the particular kthread that causes problems
>> for lustre.  All I know is that ->getattr returns 32bit squashed inode
>> numbers in kthread context where 64 bit numbers would be expected.
>> 
> 
> Well, that looks like Lustre is copying an ext4 bug.
> 
> Hi ext4 people-
> 
> ext4's is_32bit_api() function is bogus.  You can't use
> in_compat_syscall() unless you know you're in a syscall
> 
> The buggy code was introduced in:
> 
> commit d1f5273e9adb40724a85272f248f210dc4ce919a
> Author: Fan Yong 
> Date:   Sun Mar 18 22:44:40 2012 -0400
> 
>ext4: return 32/64-bit dir name hash according to usage type
> 
> I don't know what the right solution is.  Al, is it legit at all for
> fops->llseek to care about the caller's bitness?  If what ext4 is
> doing is legit, then ISTM the VFS needs to gain a new API to tell
> ->llseek what to do.  But I'm wondering why FMODE_64BITHASH by itself
> isn't sufficient,
> 
> I'm quite tempted to add a warning to the x86 arch code to try to
> catch this type of bug.  Fortunately, a bit of grepping suggests that
> ext4 is the only filesystem with this problem.

We need to know whether the readdir cookie returned to userspace
should be a 32-bit cookie or a 64-bit cookie.  Trying to return
a 64-bit value will result in -EOVERFLOW for a 32-bit application,
but is preferable (if possible) because it reduces the chance of
hash collisions causing readdir to have problems.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 6/5] statx: add STATX_RESULT_MASK flag

2018-10-19 Thread Andreas Dilger

On Oct 19, 2018, at 11:42 AM, Miklos Szeredi  wrote:
>>> +#define STATX_RESULT_MASK STATX__RESERVED
>> 
>> Please don't use that bit.
> 
> Using it internally is perfectly harmless.   If we'll need to extend
> statx in the future and make use of this flag externally, then we can
> easily move the internal flag somewhere else (e.g. extend request_mask
> to 64bit, which we'll probably need to do anyway in that case).

I was thinking about this - what is the point of returning an error
if STATX__RESERVED is set?  If this is used to indicate the presence
of e.g. stx_mask2, then newer applications trying to request any of the
flags encoded into stx_mask2 will get an error, rather than the expected
behaviour of "ignore flags you don't understand, and don't set them in
the return stx_mask".

Essentially, this will make STATX__RESERVED useless in the future, since
no application will be able to use it without getting an error if they
are running on an old kernel.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 6/5] statx: add STATX_RESULT_MASK flag

2018-10-19 Thread Andreas Dilger

On Oct 19, 2018, at 11:42 AM, Miklos Szeredi  wrote:
>>> +#define STATX_RESULT_MASK STATX__RESERVED
>> 
>> Please don't use that bit.
> 
> Using it internally is perfectly harmless.   If we'll need to extend
> statx in the future and make use of this flag externally, then we can
> easily move the internal flag somewhere else (e.g. extend request_mask
> to 64bit, which we'll probably need to do anyway in that case).

I was thinking about this - what is the point of returning an error
if STATX__RESERVED is set?  If this is used to indicate the presence
of e.g. stx_mask2, then newer applications trying to request any of the
flags encoded into stx_mask2 will get an error, rather than the expected
behaviour of "ignore flags you don't understand, and don't set them in
the return stx_mask".

Essentially, this will make STATX__RESERVED useless in the future, since
no application will be able to use it without getting an error if they
are running on an old kernel.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 1/5] orangefs: fix request_mask misuse

2018-10-19 Thread Andreas Dilger

On Oct 19, 2018, at 6:20 AM, Miklos Szeredi  wrote:
> 
> Orangefs only handles STATX_BASIC_STATS in its getattr implementation, so
> mask off all other flags.  Not doing so results in statx(2) forcing a
> refresh of cached attributes on any other requested flag (i.e. STATX_BTIME
> currently) due to the following test in orangefs_inode_getattr():
> 
> (request_mask & orangefs_inode->getattr_mask) == request_mask
> 
> Also clean up gratuitous uses of STATX_ALL.
> 
> Signed-off-by: Miklos Szeredi 
> Reviewed-by: Martin Brandenburg 
> Cc: Mike Marshall 
> ---
> fs/orangefs/inode.c | 7 ---
> 1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> index 31932879b716..bd7f15a831dc 100644
> --- a/fs/orangefs/inode.c
> +++ b/fs/orangefs/inode.c
> @@ -256,7 +256,8 @@ int orangefs_getattr(const struct path *path, struct 
> kstat *stat,
>"orangefs_getattr: called on %pd\n",
>path->dentry);
> 
> - ret = orangefs_inode_getattr(inode, 0, 0, request_mask);
> + ret = orangefs_inode_getattr(inode, 0, 0,
> +  request_mask & STATX_BASIC_STATS);

Does it make sense to mask this off at the caller, rather than within
orangefs_inode_getattr()?  Otherwise, orangefs_inode_getattr() will
never see additional stats passed in, even if it is enhanced to return
other values.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v2 1/5] orangefs: fix request_mask misuse

2018-10-19 Thread Andreas Dilger

On Oct 19, 2018, at 6:20 AM, Miklos Szeredi  wrote:
> 
> Orangefs only handles STATX_BASIC_STATS in its getattr implementation, so
> mask off all other flags.  Not doing so results in statx(2) forcing a
> refresh of cached attributes on any other requested flag (i.e. STATX_BTIME
> currently) due to the following test in orangefs_inode_getattr():
> 
> (request_mask & orangefs_inode->getattr_mask) == request_mask
> 
> Also clean up gratuitous uses of STATX_ALL.
> 
> Signed-off-by: Miklos Szeredi 
> Reviewed-by: Martin Brandenburg 
> Cc: Mike Marshall 
> ---
> fs/orangefs/inode.c | 7 ---
> 1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
> index 31932879b716..bd7f15a831dc 100644
> --- a/fs/orangefs/inode.c
> +++ b/fs/orangefs/inode.c
> @@ -256,7 +256,8 @@ int orangefs_getattr(const struct path *path, struct 
> kstat *stat,
>"orangefs_getattr: called on %pd\n",
>path->dentry);
> 
> - ret = orangefs_inode_getattr(inode, 0, 0, request_mask);
> + ret = orangefs_inode_getattr(inode, 0, 0,
> +  request_mask & STATX_BASIC_STATS);

Does it make sense to mask this off at the caller, rather than within
orangefs_inode_getattr()?  Otherwise, orangefs_inode_getattr() will
never see additional stats passed in, even if it is enhanced to return
other values.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/3] uapi: get rid of STATX_ALL

2018-10-18 Thread Andreas Dilger

On Oct 18, 2018, at 7:11 AM, Miklos Szeredi  wrote:
> 
> Constants of the *_ALL type can be actively harmful due to the fact that
> developers will usually fail to consider the possible effects of future
> changes to the definition.
> 
> Remove STATX_ALL from the uapi, while no damage has been done yet.
> 
> We could keep something like this around in the kernel, but there's
> actually no point, since all filesystems should be explicitly checking
> flags that they support and not rely on the VFS masking unknown ones out: a
> flag could be known to the VFS, yet not known to the filesystem (see
> orangefs bug fixed in the previous patch).

What is actually strange is that the vfs_getattr_nosec() code is setting

stat->result_mask = STATX_BASIC_STATS;

in the code before any of the filesystem ->getattr methods are called.
That means it is up to the filesystems to actively *clear* flags from
the result_mask as opposed to only *setting* flags for fields that it
is filling in.  That seems a bit backward to me?

It looks like NFS is _probably_ doing the right thing, though it is still
using the userspace-supplied request_mask as a mask for the bits being
returned,  but the saving grace is that result_mask is STATX_BASIC_STATS
set by vfs_getattr_nosec() AFAICS.

Looking at OCFS2, CIFS, GFS2, they are doing a full inode revalidation
and returning the basic stats without looking at flags, request_mask,
or result_mask at all, so I'd expect they could be more efficient
(i.e. not revalidating the inode and possibly doing an RPC at all if
only some minimal flags are requested, or if AT_STATX_FORCE_SYNC is
not set).

It looks like overlayfs, nfsd, and fuse at least (as statx callers)
are often requesting a small subset of flags (e.g. STATX_INO,
STATX_NLINK, STATX_ATIME | STATX_MTIME, etc.) so they could be more
efficient if the underlying filesystems did only what was asked.

Cheers, Andreas

> 
> Signed-off-by: Miklos Szeredi 
> Cc: David Howells 
> Cc: Michael Kerrisk 
> ---
> fs/stat.c   | 1 -
> include/uapi/linux/stat.h   | 2 +-
> samples/statx/test-statx.c  | 2 +-
> tools/include/uapi/linux/stat.h | 2 +-
> 4 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index f8e6fb2c3657..8d297a279991 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -73,7 +73,6 @@ int vfs_getattr_nosec(const struct path *path, struct kstat 
> *stat,
> 
>   memset(stat, 0, sizeof(*stat));
>   stat->result_mask |= STATX_BASIC_STATS;
> - request_mask &= STATX_ALL;
>   query_flags &= KSTAT_QUERY_FLAGS;
>   if (inode->i_op->getattr)
>   return inode->i_op->getattr(path, stat, request_mask,
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..370f09d92fa6 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -148,7 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> -#define STATX_ALL0x0fffU /* All currently supported 
> flags */
> +
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> /*
> diff --git a/samples/statx/test-statx.c b/samples/statx/test-statx.c
> index d4d77b09412c..e354048dea3c 100644
> --- a/samples/statx/test-statx.c
> +++ b/samples/statx/test-statx.c
> @@ -211,7 +211,7 @@ int main(int argc, char **argv)
>   struct statx stx;
>   int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> 
> - unsigned int mask = STATX_ALL;
> + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> 
>   for (argv++; *argv; argv++) {
>   if (strcmp(*argv, "-F") == 0) {
> diff --git a/tools/include/uapi/linux/stat.h b/tools/include/uapi/linux/stat.h
> index 7b35e98d3c58..370f09d92fa6 100644
> --- a/tools/include/uapi/linux/stat.h
> +++ b/tools/include/uapi/linux/stat.h
> @@ -148,7 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> -#define STATX_ALL0x0fffU /* All currently supported 
> flags */
> +
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> /*
> --
> 2.14.3
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/3] uapi: get rid of STATX_ALL

2018-10-18 Thread Andreas Dilger

On Oct 18, 2018, at 7:11 AM, Miklos Szeredi  wrote:
> 
> Constants of the *_ALL type can be actively harmful due to the fact that
> developers will usually fail to consider the possible effects of future
> changes to the definition.
> 
> Remove STATX_ALL from the uapi, while no damage has been done yet.
> 
> We could keep something like this around in the kernel, but there's
> actually no point, since all filesystems should be explicitly checking
> flags that they support and not rely on the VFS masking unknown ones out: a
> flag could be known to the VFS, yet not known to the filesystem (see
> orangefs bug fixed in the previous patch).

What is actually strange is that the vfs_getattr_nosec() code is setting

stat->result_mask = STATX_BASIC_STATS;

in the code before any of the filesystem ->getattr methods are called.
That means it is up to the filesystems to actively *clear* flags from
the result_mask as opposed to only *setting* flags for fields that it
is filling in.  That seems a bit backward to me?

It looks like NFS is _probably_ doing the right thing, though it is still
using the userspace-supplied request_mask as a mask for the bits being
returned,  but the saving grace is that result_mask is STATX_BASIC_STATS
set by vfs_getattr_nosec() AFAICS.

Looking at OCFS2, CIFS, GFS2, they are doing a full inode revalidation
and returning the basic stats without looking at flags, request_mask,
or result_mask at all, so I'd expect they could be more efficient
(i.e. not revalidating the inode and possibly doing an RPC at all if
only some minimal flags are requested, or if AT_STATX_FORCE_SYNC is
not set).

It looks like overlayfs, nfsd, and fuse at least (as statx callers)
are often requesting a small subset of flags (e.g. STATX_INO,
STATX_NLINK, STATX_ATIME | STATX_MTIME, etc.) so they could be more
efficient if the underlying filesystems did only what was asked.

Cheers, Andreas

> 
> Signed-off-by: Miklos Szeredi 
> Cc: David Howells 
> Cc: Michael Kerrisk 
> ---
> fs/stat.c   | 1 -
> include/uapi/linux/stat.h   | 2 +-
> samples/statx/test-statx.c  | 2 +-
> tools/include/uapi/linux/stat.h | 2 +-
> 4 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index f8e6fb2c3657..8d297a279991 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -73,7 +73,6 @@ int vfs_getattr_nosec(const struct path *path, struct kstat 
> *stat,
> 
>   memset(stat, 0, sizeof(*stat));
>   stat->result_mask |= STATX_BASIC_STATS;
> - request_mask &= STATX_ALL;
>   query_flags &= KSTAT_QUERY_FLAGS;
>   if (inode->i_op->getattr)
>   return inode->i_op->getattr(path, stat, request_mask,
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..370f09d92fa6 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -148,7 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> -#define STATX_ALL0x0fffU /* All currently supported 
> flags */
> +
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> /*
> diff --git a/samples/statx/test-statx.c b/samples/statx/test-statx.c
> index d4d77b09412c..e354048dea3c 100644
> --- a/samples/statx/test-statx.c
> +++ b/samples/statx/test-statx.c
> @@ -211,7 +211,7 @@ int main(int argc, char **argv)
>   struct statx stx;
>   int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> 
> - unsigned int mask = STATX_ALL;
> + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> 
>   for (argv++; *argv; argv++) {
>   if (strcmp(*argv, "-F") == 0) {
> diff --git a/tools/include/uapi/linux/stat.h b/tools/include/uapi/linux/stat.h
> index 7b35e98d3c58..370f09d92fa6 100644
> --- a/tools/include/uapi/linux/stat.h
> +++ b/tools/include/uapi/linux/stat.h
> @@ -148,7 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> -#define STATX_ALL0x0fffU /* All currently supported 
> flags */
> +
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> /*
> --
> 2.14.3
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 3/3] statx: add STATX_ATTRIBUTES flag

2018-10-18 Thread Andreas Dilger

On Oct 18, 2018, at 7:11 AM, Miklos Szeredi  wrote:
> 
> FUSE will want to know if stx_attributes is interesting or not, because
> there's a non-zero cost of retreiving it.
> 
> This is more of a "want" flag, since stx_attributes_mask already indicates
> whether we "got" stx_attributes or not.  So for now we can just deal with
> this flag in the generic code.
> 
> Signed-off-by: Miklos Szeredi 
> Cc: David Howells 
> Cc: Michael Kerrisk 

Reviewed-by: Andreas Dilger 

> ---
> fs/stat.c   | 3 +++
> include/uapi/linux/stat.h   | 1 +
> samples/statx/test-statx.c  | 2 +-
> tools/include/uapi/linux/stat.h | 1 +
> 4 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 8d297a279991..6bf86d57e2e3 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -535,6 +535,9 @@ cp_statx(const struct kstat *stat, struct statx __user 
> *buffer)
>   tmp.stx_size = stat->size;
>   tmp.stx_blocks = stat->blocks;
>   tmp.stx_attributes_mask = stat->attributes_mask;
> + /* Having anything in attributes_mask means attributes are valid. */
> + if (tmp.stx_attributes_mask)
> + tmp.stx_mask |= STATX_ATTRIBUTES;
>   tmp.stx_atime.tv_sec = stat->atime.tv_sec;
>   tmp.stx_atime.tv_nsec = stat->atime.tv_nsec;
>   tmp.stx_btime.tv_sec = stat->btime.tv_sec;
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 370f09d92fa6..aef0aba5dfe7 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -148,6 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> +#define STATX_ATTRIBUTES 0x1000U /* Want/got stx_attributes */
> 
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> diff --git a/samples/statx/test-statx.c b/samples/statx/test-statx.c
> index e354048dea3c..deef9a68ff68 100644
> --- a/samples/statx/test-statx.c
> +++ b/samples/statx/test-statx.c
> @@ -211,7 +211,7 @@ int main(int argc, char **argv)
>   struct statx stx;
>   int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> 
> - unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_ATTRIBUTES;
> 
>   for (argv++; *argv; argv++) {
>   if (strcmp(*argv, "-F") == 0) {
> diff --git a/tools/include/uapi/linux/stat.h b/tools/include/uapi/linux/stat.h
> index 370f09d92fa6..aef0aba5dfe7 100644
> --- a/tools/include/uapi/linux/stat.h
> +++ b/tools/include/uapi/linux/stat.h
> @@ -148,6 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> +#define STATX_ATTRIBUTES 0x1000U /* Want/got stx_attributes */
> 
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> --
> 2.14.3
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 3/3] statx: add STATX_ATTRIBUTES flag

2018-10-18 Thread Andreas Dilger

On Oct 18, 2018, at 7:11 AM, Miklos Szeredi  wrote:
> 
> FUSE will want to know if stx_attributes is interesting or not, because
> there's a non-zero cost of retreiving it.
> 
> This is more of a "want" flag, since stx_attributes_mask already indicates
> whether we "got" stx_attributes or not.  So for now we can just deal with
> this flag in the generic code.
> 
> Signed-off-by: Miklos Szeredi 
> Cc: David Howells 
> Cc: Michael Kerrisk 

Reviewed-by: Andreas Dilger 

> ---
> fs/stat.c   | 3 +++
> include/uapi/linux/stat.h   | 1 +
> samples/statx/test-statx.c  | 2 +-
> tools/include/uapi/linux/stat.h | 1 +
> 4 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 8d297a279991..6bf86d57e2e3 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -535,6 +535,9 @@ cp_statx(const struct kstat *stat, struct statx __user 
> *buffer)
>   tmp.stx_size = stat->size;
>   tmp.stx_blocks = stat->blocks;
>   tmp.stx_attributes_mask = stat->attributes_mask;
> + /* Having anything in attributes_mask means attributes are valid. */
> + if (tmp.stx_attributes_mask)
> + tmp.stx_mask |= STATX_ATTRIBUTES;
>   tmp.stx_atime.tv_sec = stat->atime.tv_sec;
>   tmp.stx_atime.tv_nsec = stat->atime.tv_nsec;
>   tmp.stx_btime.tv_sec = stat->btime.tv_sec;
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 370f09d92fa6..aef0aba5dfe7 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -148,6 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> +#define STATX_ATTRIBUTES 0x1000U /* Want/got stx_attributes */
> 
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> diff --git a/samples/statx/test-statx.c b/samples/statx/test-statx.c
> index e354048dea3c..deef9a68ff68 100644
> --- a/samples/statx/test-statx.c
> +++ b/samples/statx/test-statx.c
> @@ -211,7 +211,7 @@ int main(int argc, char **argv)
>   struct statx stx;
>   int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW;
> 
> - unsigned int mask = STATX_BASIC_STATS | STATX_BTIME;
> + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_ATTRIBUTES;
> 
>   for (argv++; *argv; argv++) {
>   if (strcmp(*argv, "-F") == 0) {
> diff --git a/tools/include/uapi/linux/stat.h b/tools/include/uapi/linux/stat.h
> index 370f09d92fa6..aef0aba5dfe7 100644
> --- a/tools/include/uapi/linux/stat.h
> +++ b/tools/include/uapi/linux/stat.h
> @@ -148,6 +148,7 @@ struct statx {
> #define STATX_BLOCKS  0x0400U /* Want/got stx_blocks */
> #define STATX_BASIC_STATS 0x07ffU /* The stuff in the normal stat 
> struct */
> #define STATX_BTIME   0x0800U /* Want/got stx_btime */
> +#define STATX_ATTRIBUTES 0x1000U /* Want/got stx_attributes */
> 
> #define STATX__RESERVED   0x8000U /* Reserved for future 
> struct statx expansion */
> 
> --
> 2.14.3
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/3] uapi: get rid of STATX_ALL

2018-10-18 Thread Andreas Dilger

> On Oct 18, 2018, at 7:15 AM, Florian Weimer  wrote:
> 
> * Miklos Szeredi:
> 
>> #define STATX__RESERVED  0x8000U /* Reserved for future 
>> struct statx expansion */
> 
> What about this?  Isn't it similar to STATX_ALL in the sense that we
> don't know yet what it will mean?

No, this means that this last bit will be used for increasing the size of the
stx_mask field at some point in the future.  IMHO, this is present as a reminder
to any developer who is adding fields in the future not to use the last flag.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/3] uapi: get rid of STATX_ALL

2018-10-18 Thread Andreas Dilger

> On Oct 18, 2018, at 7:15 AM, Florian Weimer  wrote:
> 
> * Miklos Szeredi:
> 
>> #define STATX__RESERVED  0x8000U /* Reserved for future 
>> struct statx expansion */
> 
> What about this?  Isn't it similar to STATX_ALL in the sense that we
> don't know yet what it will mean?

No, this means that this last bit will be used for increasing the size of the
stx_mask field at some point in the future.  IMHO, this is present as a reminder
to any developer who is adding fields in the future not to use the last flag.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: direct return when jinode allocate failed

2018-10-17 Thread Andreas Dilger


> On Oct 16, 2018, at 8:26 PM, liu.son...@zte.com.cn wrote:
> 
>> On Tue, Oct 16, 2018 at 10:55:26PM +0800, fishland wrote:
>>> The jinode does not need protected by *i_lock*, we can return
>>> directly if memory allocation fails.
>>> 
>> 
>> I don't see anything wrong with this patch, but at the same time, I'm
>> not sure I see the benefit, either.  Checking for the allocation
>> failure is cheap, and moving it out spinlock doesn't buy much; not to
>> mention that the allocation failure is going to be highly uncommon.
>> 
>> What inspired this commit?
>> 
>> - Ted
> 
> Hi, Ted
> I found this during code reading. All code logic is very smooth, when
> reading here, need to spin_unlock before "return -ENOMEM", which I
> feel this could more close to the allocate. I refer to other processing
> logic for this situation in "inode.c", found that they all return immediately
> after memory application failed. I agree with you that the benefit of this
> patch is almost none, however it can slightly improve the code in rare case.

Looking at the patch there are two effects that it has:
- optimize a very rare case where there is an allocation failure before locking
- return an unnecessary error if "ei->jinode" is allocated before locking

I don't think it is worthwhile to optimize this case, since allocation failures
will have a serious impact on the application, and I'd rather avoid the rare
case where we don't return an unnecessary error than make the error case faster.

Cheers, Andreas

> 
>>> Signed-off-by: Liu Song 
>>> Reviewed-by: Wang Yi 
>>> ---
>>> fs/ext4/inode.c | 7 +++
>>> 1 file changed, 3 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index d767e993591d..67ba6f062de5 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -4384,12 +4384,11 @@ int ext4_inode_attach_jinode(struct inode *inode)
>>> return 0;
>>> 
>>> jinode = jbd2_alloc_inode(GFP_KERNEL);
>>> +   if (!jinode)
>>> +   return -ENOMEM;
>>> +
>>> spin_lock(>i_lock);
>>> if (!ei->jinode) {
>>> -   if (!jinode) {
>>> -   spin_unlock(>i_lock);
>>> -   return -ENOMEM;
>>> -   }
>>> ei->jinode = jinode;
>>> jbd2_journal_init_jbd_inode(ei->jinode, inode);
>>> jinode = NULL;
>>> --
>>> 2.17.1


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: direct return when jinode allocate failed

2018-10-17 Thread Andreas Dilger


> On Oct 16, 2018, at 8:26 PM, liu.son...@zte.com.cn wrote:
> 
>> On Tue, Oct 16, 2018 at 10:55:26PM +0800, fishland wrote:
>>> The jinode does not need protected by *i_lock*, we can return
>>> directly if memory allocation fails.
>>> 
>> 
>> I don't see anything wrong with this patch, but at the same time, I'm
>> not sure I see the benefit, either.  Checking for the allocation
>> failure is cheap, and moving it out spinlock doesn't buy much; not to
>> mention that the allocation failure is going to be highly uncommon.
>> 
>> What inspired this commit?
>> 
>> - Ted
> 
> Hi, Ted
> I found this during code reading. All code logic is very smooth, when
> reading here, need to spin_unlock before "return -ENOMEM", which I
> feel this could more close to the allocate. I refer to other processing
> logic for this situation in "inode.c", found that they all return immediately
> after memory application failed. I agree with you that the benefit of this
> patch is almost none, however it can slightly improve the code in rare case.

Looking at the patch there are two effects that it has:
- optimize a very rare case where there is an allocation failure before locking
- return an unnecessary error if "ei->jinode" is allocated before locking

I don't think it is worthwhile to optimize this case, since allocation failures
will have a serious impact on the application, and I'd rather avoid the rare
case where we don't return an unnecessary error than make the error case faster.

Cheers, Andreas

> 
>>> Signed-off-by: Liu Song 
>>> Reviewed-by: Wang Yi 
>>> ---
>>> fs/ext4/inode.c | 7 +++
>>> 1 file changed, 3 insertions(+), 4 deletions(-)
>>> 
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index d767e993591d..67ba6f062de5 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -4384,12 +4384,11 @@ int ext4_inode_attach_jinode(struct inode *inode)
>>> return 0;
>>> 
>>> jinode = jbd2_alloc_inode(GFP_KERNEL);
>>> +   if (!jinode)
>>> +   return -ENOMEM;
>>> +
>>> spin_lock(>i_lock);
>>> if (!ei->jinode) {
>>> -   if (!jinode) {
>>> -   spin_unlock(>i_lock);
>>> -   return -ENOMEM;
>>> -   }
>>> ei->jinode = jinode;
>>> jbd2_journal_init_jbd_inode(ei->jinode, inode);
>>> jinode = NULL;
>>> --
>>> 2.17.1


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: statx(2) API and documentation

2018-10-17 Thread Andreas Dilger

On Oct 17, 2018, at 1:04 PM, Miklos Szeredi  wrote:
> 
> On Wed, Oct 17, 2018 at 8:45 PM, Andreas Dilger  wrote:
>> On Oct 17, 2018, at 12:24 PM, Miklos Szeredi  wrote:
>>> 
>>> I'm trying to implement statx for fuse and ran into the following issues:
>>> 
>>> - Need a STATX_ATTRIBUTES bit, so that userspace can explicitly ask
>>> for stx_attribute; otherwise if querying has non-zero cost, then
>>> filesystem cannot do it without regressing performance.
>> 
>> Seems reasonable.
>> 
>>> - STATX_ALL definition is unclear, can this change, or is it fixed?
>>> If it's the former, than that's a backward compatibility nightmare.
>>> If it's the latter, then what's the point?
>> 
>> The value can change over time.  It is intended to reflect the current
>> state of affairs at the time the userspace program and kernel are compiled.
>> The value sent from userspace lets the kernel know what fields are in
>> the userspace struct, so it doesn't try to set fields that aren't there.
> 
> What's the point of a userspace program specifying STATX_ALL?  Without
> a way to programmatically query the interface definition it's useless:
> there's no way to guess which mask bit corresponds to which field, and
> what that field represents.
> 
> And there will be programs out there which specify STATX_ALL without
> considering that in the future it may become slower as it is now due
> to a recompile.

The whole point of statx is that applications should only be requesting
fields that they care about, so "ls --color=never" shouldn't be asking for
the file mode/size/etc., and size/blocks should only be requested with
"ls -ls", etc.

> So what's the point exactly?

Ah, I see your point...  STATX_ALL seems to be mostly useful for the kernel
to mask off flags that it doesn't currently understand.  It doesn't make
much sense for applications to specify STATX_ALL, since they don't have any
way to know what each flag means unless they are hard-coded to check each of
the STATX_* flags individually.  They should build up a mask of STATX_* flags
based on what they care about (e.g. "find" should only request attributes
based on the command-line options given).

>> The value in the kernel allows masking off new fields from userspace that
>> it doesn't understand.
> 
> Okay, but that has nothing to do with the UAPI.  Even as an internal
> flag we should be careful, as it might grow uses which can have
> similar issues as the userspace one above.
> 
>>> - STATX_ATIME is cleared from stx_mask on SB_RDONLY, and on NFS it is
>>> also cleared on MNT_NOATIME, but not on MNT_RDONLY.  We need some sort
>>> of guideline in the documentation about what constitutes
>>> "unsupported": does atime become unsupported because filesystem is
>>> remounted r/o?  If so, why isn't this case handled consistently in the
>>> VFS and filesystems?
>> 
>> Strange.  I'd think that if userspace is requesting atime, it should
>> get an atime value.  The fact that the kernel is not updating atime
>> due to mount options just means that atime might be old.  That doesn't
>> mean (IMHO) that atime doesn't exist.
> 
> Right, OTOH I sort of see the value in  NFS: no roundtrip to server if
> atime value is useless anyway.

Well, "noatime" might be a client-only option, which doesn't mean that the
server or some other client isn't updating the atime.  Returning an old
atime isn't the same as not returning anything at all.

>>> - What about fields that are not cached when statx() is called with
>>> AT_STATX_DONT_SYNC?  E.g. stx_btime is supported by the filesystem,
>>> but getting it requires a roundtrip to the server.  Requesting
>>> STATX_BTIME in the mask and adding  AT_STATX_DONT_SYNC to the flags
>>> means the filesystem has to decide which it will honor.   My feeling
>>> is that it should honor AT_STATX_DONT_SYNC and clear STATX_BTIME in
>>> stx_mask.   Documentation has no word about this case.
>> 
>> The btime value shouldn't change over the lifetime of a file, so
>> DONT_SYNC shouldn't have any effect on its validity?
> 
> Not validity, but presence in the cache.  Network filesystem or fuse
> may or may not have it available in the cache at the time the inode is
> first initialized on loookup.  So when statx(..., AT_STATX_DONT_SYNC,
> STATX_BTIME) comes along, it may need a roundtrip to the server.  What
> should it do?

To my thinking, if the user/app requests btime/ctime/size/blocks the kernel
should return these fields.  If DONT_SYNC is set, it means the kernel can
return potentially stale values, but if the kernel doesn't have *any*
values for these fields at all it still should query the server instead of
just returning random garbage.

That said, it seems that it isn't even possible to get into nfs_getattr()
without the VFS having instantiated a dentry and inode (i.e. queried the
server via nfs_lookup() at some point), so the inode fields have already
been filled with *something*.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: statx(2) API and documentation

2018-10-17 Thread Andreas Dilger

On Oct 17, 2018, at 1:04 PM, Miklos Szeredi  wrote:
> 
> On Wed, Oct 17, 2018 at 8:45 PM, Andreas Dilger  wrote:
>> On Oct 17, 2018, at 12:24 PM, Miklos Szeredi  wrote:
>>> 
>>> I'm trying to implement statx for fuse and ran into the following issues:
>>> 
>>> - Need a STATX_ATTRIBUTES bit, so that userspace can explicitly ask
>>> for stx_attribute; otherwise if querying has non-zero cost, then
>>> filesystem cannot do it without regressing performance.
>> 
>> Seems reasonable.
>> 
>>> - STATX_ALL definition is unclear, can this change, or is it fixed?
>>> If it's the former, than that's a backward compatibility nightmare.
>>> If it's the latter, then what's the point?
>> 
>> The value can change over time.  It is intended to reflect the current
>> state of affairs at the time the userspace program and kernel are compiled.
>> The value sent from userspace lets the kernel know what fields are in
>> the userspace struct, so it doesn't try to set fields that aren't there.
> 
> What's the point of a userspace program specifying STATX_ALL?  Without
> a way to programmatically query the interface definition it's useless:
> there's no way to guess which mask bit corresponds to which field, and
> what that field represents.
> 
> And there will be programs out there which specify STATX_ALL without
> considering that in the future it may become slower as it is now due
> to a recompile.

The whole point of statx is that applications should only be requesting
fields that they care about, so "ls --color=never" shouldn't be asking for
the file mode/size/etc., and size/blocks should only be requested with
"ls -ls", etc.

> So what's the point exactly?

Ah, I see your point...  STATX_ALL seems to be mostly useful for the kernel
to mask off flags that it doesn't currently understand.  It doesn't make
much sense for applications to specify STATX_ALL, since they don't have any
way to know what each flag means unless they are hard-coded to check each of
the STATX_* flags individually.  They should build up a mask of STATX_* flags
based on what they care about (e.g. "find" should only request attributes
based on the command-line options given).

>> The value in the kernel allows masking off new fields from userspace that
>> it doesn't understand.
> 
> Okay, but that has nothing to do with the UAPI.  Even as an internal
> flag we should be careful, as it might grow uses which can have
> similar issues as the userspace one above.
> 
>>> - STATX_ATIME is cleared from stx_mask on SB_RDONLY, and on NFS it is
>>> also cleared on MNT_NOATIME, but not on MNT_RDONLY.  We need some sort
>>> of guideline in the documentation about what constitutes
>>> "unsupported": does atime become unsupported because filesystem is
>>> remounted r/o?  If so, why isn't this case handled consistently in the
>>> VFS and filesystems?
>> 
>> Strange.  I'd think that if userspace is requesting atime, it should
>> get an atime value.  The fact that the kernel is not updating atime
>> due to mount options just means that atime might be old.  That doesn't
>> mean (IMHO) that atime doesn't exist.
> 
> Right, OTOH I sort of see the value in  NFS: no roundtrip to server if
> atime value is useless anyway.

Well, "noatime" might be a client-only option, which doesn't mean that the
server or some other client isn't updating the atime.  Returning an old
atime isn't the same as not returning anything at all.

>>> - What about fields that are not cached when statx() is called with
>>> AT_STATX_DONT_SYNC?  E.g. stx_btime is supported by the filesystem,
>>> but getting it requires a roundtrip to the server.  Requesting
>>> STATX_BTIME in the mask and adding  AT_STATX_DONT_SYNC to the flags
>>> means the filesystem has to decide which it will honor.   My feeling
>>> is that it should honor AT_STATX_DONT_SYNC and clear STATX_BTIME in
>>> stx_mask.   Documentation has no word about this case.
>> 
>> The btime value shouldn't change over the lifetime of a file, so
>> DONT_SYNC shouldn't have any effect on its validity?
> 
> Not validity, but presence in the cache.  Network filesystem or fuse
> may or may not have it available in the cache at the time the inode is
> first initialized on loookup.  So when statx(..., AT_STATX_DONT_SYNC,
> STATX_BTIME) comes along, it may need a roundtrip to the server.  What
> should it do?

To my thinking, if the user/app requests btime/ctime/size/blocks the kernel
should return these fields.  If DONT_SYNC is set, it means the kernel can
return potentially stale values, but if the kernel doesn't have *any*
values for these fields at all it still should query the server instead of
just returning random garbage.

That said, it seems that it isn't even possible to get into nfs_getattr()
without the VFS having instantiated a dentry and inode (i.e. queried the
server via nfs_lookup() at some point), so the inode fields have already
been filled with *something*.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: statx(2) API and documentation

2018-10-17 Thread Andreas Dilger

On Oct 17, 2018, at 12:24 PM, Miklos Szeredi  wrote:
> 
> I'm trying to implement statx for fuse and ran into the following issues:
> 
> - Need a STATX_ATTRIBUTES bit, so that userspace can explicitly ask
> for stx_attribute; otherwise if querying has non-zero cost, then
> filesystem cannot do it without regressing performance.

Seems reasonable.

> - STATX_ALL definition is unclear, can this change, or is it fixed?
> If it's the former, than that's a backward compatibility nightmare.
> If it's the latter, then what's the point?

The value can change over time.  It is intended to reflect the current
state of affairs at the time the userspace program and kernel are compiled.
The value sent from userspace lets the kernel know what fields are in
the userspace struct, so it doesn't try to set fields that aren't there.
The value in the kernel allows masking off new fields from userspace that
it doesn't understand.

> - STATX_ATIME is cleared from stx_mask on SB_RDONLY, and on NFS it is
> also cleared on MNT_NOATIME, but not on MNT_RDONLY.  We need some sort
> of guideline in the documentation about what constitutes
> "unsupported": does atime become unsupported because filesystem is
> remounted r/o?  If so, why isn't this case handled consistently in the
> VFS and filesystems?

Strange.  I'd think that if userspace is requesting atime, it should
get an atime value.  The fact that the kernel is not updating atime
due to mount options just means that atime might be old.  That doesn't
mean (IMHO) that atime doesn't exist.

> - What about fields that are not cached when statx() is called with
> AT_STATX_DONT_SYNC?  E.g. stx_btime is supported by the filesystem,
> but getting it requires a roundtrip to the server.  Requesting
> STATX_BTIME in the mask and adding  AT_STATX_DONT_SYNC to the flags
> means the filesystem has to decide which it will honor.   My feeling
> is that it should honor AT_STATX_DONT_SYNC and clear STATX_BTIME in
> stx_mask.   Documentation has no word about this case.

The btime value shouldn't change over the lifetime of a file, so
DONT_SYNC shouldn't have any effect on its validity?

I think DONT_SYNC applies to values like size, blocks, mtime, ctime
that may change rapidly over the lifetime of the file, so a totally
up-to-date value is not needed from the server in that case.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: statx(2) API and documentation

2018-10-17 Thread Andreas Dilger

On Oct 17, 2018, at 12:24 PM, Miklos Szeredi  wrote:
> 
> I'm trying to implement statx for fuse and ran into the following issues:
> 
> - Need a STATX_ATTRIBUTES bit, so that userspace can explicitly ask
> for stx_attribute; otherwise if querying has non-zero cost, then
> filesystem cannot do it without regressing performance.

Seems reasonable.

> - STATX_ALL definition is unclear, can this change, or is it fixed?
> If it's the former, than that's a backward compatibility nightmare.
> If it's the latter, then what's the point?

The value can change over time.  It is intended to reflect the current
state of affairs at the time the userspace program and kernel are compiled.
The value sent from userspace lets the kernel know what fields are in
the userspace struct, so it doesn't try to set fields that aren't there.
The value in the kernel allows masking off new fields from userspace that
it doesn't understand.

> - STATX_ATIME is cleared from stx_mask on SB_RDONLY, and on NFS it is
> also cleared on MNT_NOATIME, but not on MNT_RDONLY.  We need some sort
> of guideline in the documentation about what constitutes
> "unsupported": does atime become unsupported because filesystem is
> remounted r/o?  If so, why isn't this case handled consistently in the
> VFS and filesystems?

Strange.  I'd think that if userspace is requesting atime, it should
get an atime value.  The fact that the kernel is not updating atime
due to mount options just means that atime might be old.  That doesn't
mean (IMHO) that atime doesn't exist.

> - What about fields that are not cached when statx() is called with
> AT_STATX_DONT_SYNC?  E.g. stx_btime is supported by the filesystem,
> but getting it requires a roundtrip to the server.  Requesting
> STATX_BTIME in the mask and adding  AT_STATX_DONT_SYNC to the flags
> means the filesystem has to decide which it will honor.   My feeling
> is that it should honor AT_STATX_DONT_SYNC and clear STATX_BTIME in
> stx_mask.   Documentation has no word about this case.

The btime value shouldn't change over the lifetime of a file, so
DONT_SYNC shouldn't have any effect on its validity?

I think DONT_SYNC applies to values like size, blocks, mtime, ctime
that may change rapidly over the lifetime of the file, so a totally
up-to-date value is not needed from the server in that case.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: Future of dosfstools project (FAT)

2018-10-01 Thread Andreas Dilger

On Sep 29, 2018, at 2:40 AM, Pali Rohár  wrote:
> 
> Hi!
> 
> Last year I did some research how Windows and Linux tools handle FAT
> labels (boot sector vs root directory) and proposed some unification.
> More in thread: https://www.spinics.net/lists/kernel/msg2640891.html
> 
> My proposed change for manipulating with FAT labels is now implemented
> in util-linux v2.33, prepared new mtools version, and also in dosfstools
> git (unreleased) https://www.spinics.net/lists/kernel/msg2645732.html
> 
> Project dosfstools contains de-facto standard Linux userspace tools for
> FAT filesystems (mkfs.fat, fsck.fat, fatlabel) and so it is quite
> important because of high usage of FAT.
> 
> But there was no new release of dosfstools since Jan 2017 and for more
> then month I'm unsuccessfully try to contact maintainer of dosfstools
> project (Andreas) about future of project itself.
> 
> I do not know what happened, if either Andreas do not have time or
> something else...
> 
> So does somebody know what is state of dosfstools project?
> 
> Also there are lot of proposed changes (with patches) for this project
> https://github.com/dosfstools/dosfstools/pulls and also more reported
> bugs https://github.com/dosfstools/dosfstools/issues
> 
> If Andreas does not have time, can somebody else at least look at pull
> requests and do some code review?
> 
> I have some other unpublished/unfinished changes for dosfstools, but I
> do not know now if it make sense to invest more time in this project
> when maintainer does not respond... if this project is active or going
> to be dead...
> 
> If dosfstools is going to be inactive/dead, there is still mtools
> project with provides FAT tools too and maintainer of it is now
> preparing new bugfix version. So at least some alternative exists.

If the current dosfstools maintainer is non-responsive, you could always
fork the project in GitHub, land the critical patches into your branch,
 and make a release on your own.  If the maintainer surfaces again, then
they can pull in your patches.  If not, then you are the new maintainer.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: Future of dosfstools project (FAT)

2018-10-01 Thread Andreas Dilger

On Sep 29, 2018, at 2:40 AM, Pali Rohár  wrote:
> 
> Hi!
> 
> Last year I did some research how Windows and Linux tools handle FAT
> labels (boot sector vs root directory) and proposed some unification.
> More in thread: https://www.spinics.net/lists/kernel/msg2640891.html
> 
> My proposed change for manipulating with FAT labels is now implemented
> in util-linux v2.33, prepared new mtools version, and also in dosfstools
> git (unreleased) https://www.spinics.net/lists/kernel/msg2645732.html
> 
> Project dosfstools contains de-facto standard Linux userspace tools for
> FAT filesystems (mkfs.fat, fsck.fat, fatlabel) and so it is quite
> important because of high usage of FAT.
> 
> But there was no new release of dosfstools since Jan 2017 and for more
> then month I'm unsuccessfully try to contact maintainer of dosfstools
> project (Andreas) about future of project itself.
> 
> I do not know what happened, if either Andreas do not have time or
> something else...
> 
> So does somebody know what is state of dosfstools project?
> 
> Also there are lot of proposed changes (with patches) for this project
> https://github.com/dosfstools/dosfstools/pulls and also more reported
> bugs https://github.com/dosfstools/dosfstools/issues
> 
> If Andreas does not have time, can somebody else at least look at pull
> requests and do some code review?
> 
> I have some other unpublished/unfinished changes for dosfstools, but I
> do not know now if it make sense to invest more time in this project
> when maintainer does not respond... if this project is active or going
> to be dead...
> 
> If dosfstools is going to be inactive/dead, there is still mtools
> project with provides FAT tools too and maintainer of it is now
> preparing new bugfix version. So at least some alternative exists.

If the current dosfstools maintainer is non-responsive, you could always
fork the project in GitHub, land the critical patches into your branch,
 and make a release on your own.  If the maintainer surfaces again, then
they can pull in your patches.  If not, then you are the new maintainer.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/3] ext4: super: Fix spectre gadget in ext4_quota_on

2018-07-31 Thread Andreas Dilger


> On Jul 27, 2018, at 11:46 AM, Josh Poimboeuf  wrote:
> 
> On Fri, Jul 27, 2018 at 04:23:55PM +, Jeremy Cline wrote:
>> 'type' is a user-controlled value used to index into 's_qf_names', which
>> can be used in a Spectre v1 attack. Clamp 'type' to the size of the
>> array to avoid a speculative out-of-bounds read.
>> 
>> Cc: Josh Poimboeuf 
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Jeremy Cline 
>> ---
>> fs/ext4/super.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 6480e763080f..c04a09b51742 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -40,6 +40,7 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> #include 
>> #include 
>> 
>> @@ -5559,6 +5560,7 @@ static int ext4_quota_on(struct super_block *sb, int 
>> type, int format_id,
>>  if (path->dentry->d_sb != sb)
>>  return -EXDEV;
>>  /* Journaling quota? */
>> +type = array_index_nospec(type, EXT4_MAXQUOTAS);

This check just papers over the issue, but AFAICS doesn't actually
solve any problems.  It ends up squashing an invalid value to be
the same as EXT4_MAXQUOTAS, rather than returning an error to the
caller as it should.

>>  if (EXT4_SB(sb)->s_qf_names[type]) {
>>  /* Quotafile not in fs root? */
>>  if (path->dentry->d_parent != sb->s_root)
> 
> Generally we try to put the array_index_nospec() close to the bounds
> check for which it's trying to prevent speculation past.
> 
> In this case, I'd expect the EXT4_MAXQUOTAS bounds check to be in
> do_quotactl(), but it seems to be missing:
> 
>   if (type >= (XQM_COMMAND(cmd) ? XQM_MAXQUOTAS : MAXQUOTAS))
>   return -EINVAL;

Agreed that this should be checked at the highest layer possible.
IMHO, this means one check in the VFS/quota layer, and a separate
check in the filesystem layer.

> Also it looks like XQM_MAXQUOTAS, MAXQUOTAS, and EXT4_MAXQUOTAS all have the 
> same value (3).  Maybe they can be consolidated to just use
> MAXQUOTAS everywhere?

No, the filesystem-specific MAXQUOTAS values were separated from
the kernel MAXQUOTAS value for a good reason.  This allows some
filesystems to support new quota types (e.g. project quotas) that
not all other filesystems can handle.  This may potentially change
again in the future, so they shouldn't be tightly coupled.

>  Then the nospec would be simple:
> 
>   if (type >= MAXQUOTAS)
>   return -EINVAL;
>   type = array_index_nospec(type, MAXQUOTAS);
> 
> Otherwise I think we may need to disperse the array_index_nospec calls
> deeper in the callchain.
> 
> --
> Josh


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/3] ext4: super: Fix spectre gadget in ext4_quota_on

2018-07-31 Thread Andreas Dilger


> On Jul 27, 2018, at 11:46 AM, Josh Poimboeuf  wrote:
> 
> On Fri, Jul 27, 2018 at 04:23:55PM +, Jeremy Cline wrote:
>> 'type' is a user-controlled value used to index into 's_qf_names', which
>> can be used in a Spectre v1 attack. Clamp 'type' to the size of the
>> array to avoid a speculative out-of-bounds read.
>> 
>> Cc: Josh Poimboeuf 
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Jeremy Cline 
>> ---
>> fs/ext4/super.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 6480e763080f..c04a09b51742 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -40,6 +40,7 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> #include 
>> #include 
>> 
>> @@ -5559,6 +5560,7 @@ static int ext4_quota_on(struct super_block *sb, int 
>> type, int format_id,
>>  if (path->dentry->d_sb != sb)
>>  return -EXDEV;
>>  /* Journaling quota? */
>> +type = array_index_nospec(type, EXT4_MAXQUOTAS);

This check just papers over the issue, but AFAICS doesn't actually
solve any problems.  It ends up squashing an invalid value to be
the same as EXT4_MAXQUOTAS, rather than returning an error to the
caller as it should.

>>  if (EXT4_SB(sb)->s_qf_names[type]) {
>>  /* Quotafile not in fs root? */
>>  if (path->dentry->d_parent != sb->s_root)
> 
> Generally we try to put the array_index_nospec() close to the bounds
> check for which it's trying to prevent speculation past.
> 
> In this case, I'd expect the EXT4_MAXQUOTAS bounds check to be in
> do_quotactl(), but it seems to be missing:
> 
>   if (type >= (XQM_COMMAND(cmd) ? XQM_MAXQUOTAS : MAXQUOTAS))
>   return -EINVAL;

Agreed that this should be checked at the highest layer possible.
IMHO, this means one check in the VFS/quota layer, and a separate
check in the filesystem layer.

> Also it looks like XQM_MAXQUOTAS, MAXQUOTAS, and EXT4_MAXQUOTAS all have the 
> same value (3).  Maybe they can be consolidated to just use
> MAXQUOTAS everywhere?

No, the filesystem-specific MAXQUOTAS values were separated from
the kernel MAXQUOTAS value for a good reason.  This allows some
filesystems to support new quota types (e.g. project quotas) that
not all other filesystems can handle.  This may potentially change
again in the future, so they shouldn't be tightly coupled.

>  Then the nospec would be simple:
> 
>   if (type >= MAXQUOTAS)
>   return -EINVAL;
>   type = array_index_nospec(type, MAXQUOTAS);
> 
> Otherwise I think we may need to disperse the array_index_nospec calls
> deeper in the callchain.
> 
> --
> Josh


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: Why inode number is zero in writeback?

2018-07-25 Thread Andreas Dilger

On Jul 25, 2018, at 6:01 AM, Ilya Plenne  wrote:
> 
> I'm researching linux kernel. Right now only for v3.10.61, it's just
> proof of concept.
> 
> I need to pass-through some hints to hardware about what kind of data
> in particular WRITE\READ operation. E.g. read inodes bitmap or write
> journal block or write user data or something else...
> I'm assuming that at driver level I can reach bio struct from
> request_queue struct just for now. For testing purposes I'm using
> virtio_blk driver under qemu.

Please see the following articles/threads for similar developments
that are already present in Linux:

 https://lwn.net/Articles/685499/
 https://www.usenix.org/system/files/conference/fast18/fast18-rho.pdf
 https://lwn.net/Articles/726477/

commit c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
Author: Jens Axboe 
AuthorDate: Tue Jun 27 11:47:04 2017 -0600
Commit: Jens Axboe 
CommitDate: Tue Jun 27 12:05:22 2017 -0600

fs: add fcntl() interface for setting/getting write life time hints

This patch was landed in kernel 4.12, so you would be well advised to
update your testing to at least that kernel.  If there is a specific
reason to use the 3.10 kernel (e.g. RHEL7 requirement) then it would
be best to backport this patch series to the older kernel, rather than
making a different set of interfaces that will conflict with newer
kernels.

If you want to add additional streams (e.g. for different types of
filesystem metadata), you should modify nvme_configure_directives()
to remove the BLK_MAX_WRITE_HINTS limit to nr_streams, and then use
stream IDs > BLK_MAX_WRITE_HINTS - 1 for the filesystem-specific IDs.
Alternately, add a separate value BLK_MIN_WRITE_HITS = 5, and check
that against ctrl->nssa, and increase the value of BLK_MAX_WRITE_HITS
to 16 or similar (not too large, as there is an array in the queue
based on this value to track stream usage stats).

You still wouldn't be able to pass larger hints (stream IDs) from
userspace, but it would be enough to pass hints directly from ext4.
Using higher stream IDs would also avoid conflicts with user supplied
IDs, and doesn't affect operation otherwise.

There would need to be some way to map the higher IDs to lower ones
if there weren't enough distinct IDs in the device.  One option is to
encode the filesystem stream ID into the high 8 bits of the fields,
and the fallback hint into the low 8 (really 3) bits of the fields.
The fallback hint would be set manually (e.g. journal and bitmap use
the WRITE_LIFE_SHORT hint, inode uses WRITE_LIFE_MEDIUM, etc.).
It isn't clear whether "user data" has a good fallback hint or not,
but that is always true even if you segregate it into its own class
(it will intermingle short- and long-term files), so it could just
fall back to WRITE_LIFE_NONE.

Alternately, you could expand enum rw_hint to have a few generic
internal IDs not accessible from userspace, like WRITE_LIFE_INODE,
WRITE_LIFE_BITMAP, WRITE_LIFE_JOURNAL, WRITE_LIFE_DIRECTORY, and
WRITE_LIFE_TREE, etc. that could be used by multiple filesystems.

Cheers, Andreas

> Here what I have done:
> 
> 1. I've created FS with this command: mkfs.ext4 -b 4096 -E
> lazy_itable_init=0,lazy_journal_init=0 -m 0 /dev/vda
> 2. I've mounted it using this command: mount -o
> rw,nosuid,nodev,discard,noauto_da_alloc,data=ordered /dev/vda /mnt
> 3. Added breakpoint in virtio_blk.c:377 and will skip writes related to 
> journal
> 4. Execute this dd command: dd if=/dev/urandom of=/mnt/foo2 bs=764 count=34
> 5. Wait for breakpoint hit and will analyze backtrace under writeback 
> subsystem.
> 
> Here is the backtrace:
> 
> #0  virtblk_request (q=0x87ab8000) at drivers/block/virtio_blk.c:377
> #1  0x801c0b4c in __blk_run_queue_uncond (q=) at
> block/blk-core.c:312
> #2  __blk_run_queue (q=0x87ab8000) at block/blk-core.c:329
> #3  0x801c0c94 in queue_unplugged (q=0x87ab8000, depth= out>, from_schedule=) at block/blk-core.c:2920
> #4  0x801c3a98 in blk_flush_plug_list (plug=,
> from_schedule=false) at block/blk-core.c:3030
> #5  0x801c3d9c in blk_finish_plug (plug=0x8785fd8c) at block/blk-core.c:3037
> #6  0x80091644 in generic_writepages (mapping=,
> wbc=0x8785fde0) at mm/page-writeback.c:1910
> #7  0x80092a88 in do_writepages (mapping=,
> wbc=) at mm/page-writeback.c:1923
> #8  0x800e7084 in __writeback_single_inode (inode=0x8740c290,
> wbc=0x8785fde0) at fs/fs-writeback.c:454
> #9  0x800e7360 in writeback_sb_inodes (sb=0x87811000, wb=0x87ab81b0,
> work=0x8785fea4) at fs/fs-writeback.c:678
> #10 0x800e757c in __writeback_inodes_wb (wb=0x1 <__vectors_start>,
> work=0x3ff) at fs/fs-writeback.c:723
> #11 0x800e7760 in wb_writeback (wb=0x87ab81b0, work=0x8785fea4) at
> fs/fs-writeback.c:854
> #12 0x800e838c in wb_check_old_data_flush (wb=) at
> fs/fs-writeback.c:969
> #13 wb_do_writeback (wb=0x87ab81b0, force_wait=0) at fs/fs-writeback.c:1010
> #14 0x800e848c in bdi_writeback_workfn (work=0x87ab81bc) at
> fs/fs-writeback.c:1040
> #15 0x80039d34 in process_one_work

Re: Why inode number is zero in writeback?

2018-07-25 Thread Andreas Dilger

On Jul 25, 2018, at 6:01 AM, Ilya Plenne  wrote:
> 
> I'm researching linux kernel. Right now only for v3.10.61, it's just
> proof of concept.
> 
> I need to pass-through some hints to hardware about what kind of data
> in particular WRITE\READ operation. E.g. read inodes bitmap or write
> journal block or write user data or something else...
> I'm assuming that at driver level I can reach bio struct from
> request_queue struct just for now. For testing purposes I'm using
> virtio_blk driver under qemu.

Please see the following articles/threads for similar developments
that are already present in Linux:

 https://lwn.net/Articles/685499/
 https://www.usenix.org/system/files/conference/fast18/fast18-rho.pdf
 https://lwn.net/Articles/726477/

commit c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
Author: Jens Axboe 
AuthorDate: Tue Jun 27 11:47:04 2017 -0600
Commit: Jens Axboe 
CommitDate: Tue Jun 27 12:05:22 2017 -0600

fs: add fcntl() interface for setting/getting write life time hints

This patch was landed in kernel 4.12, so you would be well advised to
update your testing to at least that kernel.  If there is a specific
reason to use the 3.10 kernel (e.g. RHEL7 requirement) then it would
be best to backport this patch series to the older kernel, rather than
making a different set of interfaces that will conflict with newer
kernels.

If you want to add additional streams (e.g. for different types of
filesystem metadata), you should modify nvme_configure_directives()
to remove the BLK_MAX_WRITE_HINTS limit to nr_streams, and then use
stream IDs > BLK_MAX_WRITE_HINTS - 1 for the filesystem-specific IDs.
Alternately, add a separate value BLK_MIN_WRITE_HITS = 5, and check
that against ctrl->nssa, and increase the value of BLK_MAX_WRITE_HITS
to 16 or similar (not too large, as there is an array in the queue
based on this value to track stream usage stats).

You still wouldn't be able to pass larger hints (stream IDs) from
userspace, but it would be enough to pass hints directly from ext4.
Using higher stream IDs would also avoid conflicts with user supplied
IDs, and doesn't affect operation otherwise.

There would need to be some way to map the higher IDs to lower ones
if there weren't enough distinct IDs in the device.  One option is to
encode the filesystem stream ID into the high 8 bits of the fields,
and the fallback hint into the low 8 (really 3) bits of the fields.
The fallback hint would be set manually (e.g. journal and bitmap use
the WRITE_LIFE_SHORT hint, inode uses WRITE_LIFE_MEDIUM, etc.).
It isn't clear whether "user data" has a good fallback hint or not,
but that is always true even if you segregate it into its own class
(it will intermingle short- and long-term files), so it could just
fall back to WRITE_LIFE_NONE.

Alternately, you could expand enum rw_hint to have a few generic
internal IDs not accessible from userspace, like WRITE_LIFE_INODE,
WRITE_LIFE_BITMAP, WRITE_LIFE_JOURNAL, WRITE_LIFE_DIRECTORY, and
WRITE_LIFE_TREE, etc. that could be used by multiple filesystems.

Cheers, Andreas

> Here what I have done:
> 
> 1. I've created FS with this command: mkfs.ext4 -b 4096 -E
> lazy_itable_init=0,lazy_journal_init=0 -m 0 /dev/vda
> 2. I've mounted it using this command: mount -o
> rw,nosuid,nodev,discard,noauto_da_alloc,data=ordered /dev/vda /mnt
> 3. Added breakpoint in virtio_blk.c:377 and will skip writes related to 
> journal
> 4. Execute this dd command: dd if=/dev/urandom of=/mnt/foo2 bs=764 count=34
> 5. Wait for breakpoint hit and will analyze backtrace under writeback 
> subsystem.
> 
> Here is the backtrace:
> 
> #0  virtblk_request (q=0x87ab8000) at drivers/block/virtio_blk.c:377
> #1  0x801c0b4c in __blk_run_queue_uncond (q=) at
> block/blk-core.c:312
> #2  __blk_run_queue (q=0x87ab8000) at block/blk-core.c:329
> #3  0x801c0c94 in queue_unplugged (q=0x87ab8000, depth= out>, from_schedule=) at block/blk-core.c:2920
> #4  0x801c3a98 in blk_flush_plug_list (plug=,
> from_schedule=false) at block/blk-core.c:3030
> #5  0x801c3d9c in blk_finish_plug (plug=0x8785fd8c) at block/blk-core.c:3037
> #6  0x80091644 in generic_writepages (mapping=,
> wbc=0x8785fde0) at mm/page-writeback.c:1910
> #7  0x80092a88 in do_writepages (mapping=,
> wbc=) at mm/page-writeback.c:1923
> #8  0x800e7084 in __writeback_single_inode (inode=0x8740c290,
> wbc=0x8785fde0) at fs/fs-writeback.c:454
> #9  0x800e7360 in writeback_sb_inodes (sb=0x87811000, wb=0x87ab81b0,
> work=0x8785fea4) at fs/fs-writeback.c:678
> #10 0x800e757c in __writeback_inodes_wb (wb=0x1 <__vectors_start>,
> work=0x3ff) at fs/fs-writeback.c:723
> #11 0x800e7760 in wb_writeback (wb=0x87ab81b0, work=0x8785fea4) at
> fs/fs-writeback.c:854
> #12 0x800e838c in wb_check_old_data_flush (wb=) at
> fs/fs-writeback.c:969
> #13 wb_do_writeback (wb=0x87ab81b0, force_wait=0) at fs/fs-writeback.c:1010
> #14 0x800e848c in bdi_writeback_workfn (work=0x87ab81bc) at
> fs/fs-writeback.c:1040
> #15 0x80039d34 in process_one_work

Re: [PATCH] ext4: Check superblock mapped prior to committing

2018-06-29 Thread Andreas Dilger

On Jun 29, 2018, at 1:36 PM, Jon Derrick  wrote:
> 
> This patch attempts to close a hole leading to a BUG seen with hot
> removals during writes [1].
> 
> A block device (NVME namespace in this test case) is formatted to EXT4
> without partitions. It's mounted and write I/O is run to a file, then
> the device is hot removed from the slot. The superblock attempts to be
> written to the drive which is no longer present.
> 
> The typical chain of events leading to the BUG:
> ext4_commit_super()
>  __sync_dirty_buffer()
>submit_bh()
>  submit_bh_wbc()
>BUG_ON(!buffer_mapped(bh));
> 
> This fix checks for the superblock's buffer head being mapped prior to
> syncing.
> 
> [1] https://www.spinics.net/lists/linux-ext4/msg56527.html
> 
> Signed-off-by: Jon Derrick 
> ---
> fs/ext4/super.c | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 0c4c220..ee33233 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4736,6 +4736,14 @@ static int ext4_commit_super(struct super_block *sb, 
> int sync)
> 
>   if (!sbh || block_device_ejected(sb))
>   return error;
> +
> + /*
> +  * The superblock bh should be mapped, but it might not be if the
> +  * device was hot-removed. Not much we can do but fail the I/O.
> +  */
> + if (!buffer_mapped(sbh))
> + return error;

This still looks a bit racy, based on the stack trace you posted.
There is already a "block_device_ejected()" check a line above,
which makes me think that the PCI device removal should be handled
like an ejected device, so that it is also handled elsewhere.

Even so, the check here in ext4_commit_super() can pass, and the
PCI card can be removed on the next instruction and still trigger
the BUG_ON().

That said, this is probably still an improvement on the existing
situation.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] ext4: Check superblock mapped prior to committing

2018-06-29 Thread Andreas Dilger

On Jun 29, 2018, at 1:36 PM, Jon Derrick  wrote:
> 
> This patch attempts to close a hole leading to a BUG seen with hot
> removals during writes [1].
> 
> A block device (NVME namespace in this test case) is formatted to EXT4
> without partitions. It's mounted and write I/O is run to a file, then
> the device is hot removed from the slot. The superblock attempts to be
> written to the drive which is no longer present.
> 
> The typical chain of events leading to the BUG:
> ext4_commit_super()
>  __sync_dirty_buffer()
>submit_bh()
>  submit_bh_wbc()
>BUG_ON(!buffer_mapped(bh));
> 
> This fix checks for the superblock's buffer head being mapped prior to
> syncing.
> 
> [1] https://www.spinics.net/lists/linux-ext4/msg56527.html
> 
> Signed-off-by: Jon Derrick 
> ---
> fs/ext4/super.c | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 0c4c220..ee33233 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4736,6 +4736,14 @@ static int ext4_commit_super(struct super_block *sb, 
> int sync)
> 
>   if (!sbh || block_device_ejected(sb))
>   return error;
> +
> + /*
> +  * The superblock bh should be mapped, but it might not be if the
> +  * device was hot-removed. Not much we can do but fail the I/O.
> +  */
> + if (!buffer_mapped(sbh))
> + return error;

This still looks a bit racy, based on the stack trace you posted.
There is already a "block_device_ejected()" check a line above,
which makes me think that the PCI device removal should be handled
like an ejected device, so that it is also handled elsewhere.

Even so, the check here in ext4_commit_super() can pass, and the
PCI card can be removed on the next instruction and still trigger
the BUG_ON().

That said, this is probably still an improvement on the existing
situation.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger

On May 18, 2018, at 1:10 PM, Kent Overstreet <kent.overstr...@gmail.com> wrote:
> 
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstr...@gmail.com> 
>> wrote:
>>> 
>>> Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com>
>> 
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself.  The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>> 
>> 
>> That said, I think this is a useful functionality.  We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated.  This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
> 
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
> 
>> Some things that are missing from this patch that is in our code:
>> 
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>>  - timeout: sleep for N msec to simulate network/disk/locking delays
>>  - race: wait with one thread until a second thread hits matching check
>> 
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
> 
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure".  The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger

On May 18, 2018, at 1:10 PM, Kent Overstreet  wrote:
> 
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet  
>> wrote:
>>> 
>>> Signed-off-by: Kent Overstreet 
>> 
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself.  The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>> 
>> 
>> That said, I think this is a useful functionality.  We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated.  This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
> 
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
> 
>> Some things that are missing from this patch that is in our code:
>> 
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>>  - timeout: sleep for N msec to simulate network/disk/locking delays
>>  - race: wait with one thread until a second thread hits matching check
>> 
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
> 
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure".  The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger

On May 18, 2018, at 1:49 AM, Kent Overstreet  wrote:
> 
> Signed-off-by: Kent Overstreet 

I agree with Christoph that even if there was some explanation in the cover
letter, there should be something at least as good in the patch itself.  The
cover letter is not saved, but the commit stays around forever, and should
explain how this should be added to code, and how to use it from userspace.


That said, I think this is a useful functionality.  We have something similar
in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
test a distributed filesystem, which is just a CPP macro with an unlikely()
branch, while this looks more sophisticated.  This looks like it has some
added functionality like having more than one fault enabled at a time.
If this lands we could likely switch our code over to using this.


Some things that are missing from this patch that is in our code:

- in addition to the basic "enabled" and "oneshot" mechanisms, we have:
  - timeout: sleep for N msec to simulate network/disk/locking delays
  - race: wait with one thread until a second thread hits matching check

We also have a "fail_val" that allows making the check conditional (e.g.
only operation on server "N" should fail, only RPC opcode "N", etc).

Cheers, Andreas

> ---
> include/asm-generic/vmlinux.lds.h |   4 +
> include/linux/dynamic_fault.h | 117 +
> lib/Kconfig.debug |   5 +
> lib/Makefile  |   2 +
> lib/dynamic_fault.c   | 760 ++
> 5 files changed, 888 insertions(+)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 1ab0e520d6..a4c9dfcbbd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -246,6 +246,10 @@
>   VMLINUX_SYMBOL(__start___verbose) = .;  \
>   KEEP(*(__verbose))  \
>   VMLINUX_SYMBOL(__stop___verbose) = .;   \
> + . = ALIGN(8);   \
> + VMLINUX_SYMBOL(__start___faults) = .;   \
> + *(__faults) \
> + VMLINUX_SYMBOL(__stop___faults) = .;\
>   LIKELY_PROFILE()\
>   BRANCH_PROFILE()\
>   TRACE_PRINTKS() \
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 00..6e7bb56ae8
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,117 @@
> +#ifndef _DYNAMIC_FAULT_H
> +#define _DYNAMIC_FAULT_H
> +
> +#include 
> +#include 
> +#include 
> +
> +enum dfault_enabled {
> + DFAULT_DISABLED,
> + DFAULT_ENABLED,
> + DFAULT_ONESHOT,
> +};
> +
> +union dfault_state {
> + struct {
> + unsignedenabled:2;
> + unsignedcount:30;
> + };
> +
> + struct {
> + unsignedv;
> + };
> +};
> +
> +/*
> + * An instance of this structure is created in a special
> + * ELF section at every dynamic fault callsite.  At runtime,
> + * the special section is treated as an array of these.
> + */
> +struct _dfault {
> + const char  *modname;
> + const char  *function;
> + const char  *filename;
> + const char  *class;
> +
> + const u16   line;
> +
> + unsignedfrequency;
> + union dfault_state  state;
> +
> + struct static_key   enabled;
> +} __aligned(8);
> +
> +
> +#ifdef CONFIG_DYNAMIC_FAULT
> +
> +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
> +int dfault_remove_module(char *mod_name);
> +bool __dynamic_fault_enabled(struct _dfault *);
> +
> +#define dynamic_fault(_class)
> \
> +({   \
> + static struct _dfault descriptor\
> + __used __aligned(8) __attribute__((section("__faults"))) = {\
> + .modname= KBUILD_MODNAME,   \
> + .function   = __func__, \
> + .filename   = __FILE__, \
> + .line   = __LINE__, \
> + .class  = _class,   \
> + };  \
> + \
> +

Re: [PATCH 10/10] Dynamic fault injection

2018-05-18 Thread Andreas Dilger

On May 18, 2018, at 1:49 AM, Kent Overstreet  wrote:
> 
> Signed-off-by: Kent Overstreet 

I agree with Christoph that even if there was some explanation in the cover
letter, there should be something at least as good in the patch itself.  The
cover letter is not saved, but the commit stays around forever, and should
explain how this should be added to code, and how to use it from userspace.


That said, I think this is a useful functionality.  We have something similar
in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
test a distributed filesystem, which is just a CPP macro with an unlikely()
branch, while this looks more sophisticated.  This looks like it has some
added functionality like having more than one fault enabled at a time.
If this lands we could likely switch our code over to using this.


Some things that are missing from this patch that is in our code:

- in addition to the basic "enabled" and "oneshot" mechanisms, we have:
  - timeout: sleep for N msec to simulate network/disk/locking delays
  - race: wait with one thread until a second thread hits matching check

We also have a "fail_val" that allows making the check conditional (e.g.
only operation on server "N" should fail, only RPC opcode "N", etc).

Cheers, Andreas

> ---
> include/asm-generic/vmlinux.lds.h |   4 +
> include/linux/dynamic_fault.h | 117 +
> lib/Kconfig.debug |   5 +
> lib/Makefile  |   2 +
> lib/dynamic_fault.c   | 760 ++
> 5 files changed, 888 insertions(+)
> create mode 100644 include/linux/dynamic_fault.h
> create mode 100644 lib/dynamic_fault.c
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 1ab0e520d6..a4c9dfcbbd 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -246,6 +246,10 @@
>   VMLINUX_SYMBOL(__start___verbose) = .;  \
>   KEEP(*(__verbose))  \
>   VMLINUX_SYMBOL(__stop___verbose) = .;   \
> + . = ALIGN(8);   \
> + VMLINUX_SYMBOL(__start___faults) = .;   \
> + *(__faults) \
> + VMLINUX_SYMBOL(__stop___faults) = .;\
>   LIKELY_PROFILE()\
>   BRANCH_PROFILE()\
>   TRACE_PRINTKS() \
> diff --git a/include/linux/dynamic_fault.h b/include/linux/dynamic_fault.h
> new file mode 100644
> index 00..6e7bb56ae8
> --- /dev/null
> +++ b/include/linux/dynamic_fault.h
> @@ -0,0 +1,117 @@
> +#ifndef _DYNAMIC_FAULT_H
> +#define _DYNAMIC_FAULT_H
> +
> +#include 
> +#include 
> +#include 
> +
> +enum dfault_enabled {
> + DFAULT_DISABLED,
> + DFAULT_ENABLED,
> + DFAULT_ONESHOT,
> +};
> +
> +union dfault_state {
> + struct {
> + unsignedenabled:2;
> + unsignedcount:30;
> + };
> +
> + struct {
> + unsignedv;
> + };
> +};
> +
> +/*
> + * An instance of this structure is created in a special
> + * ELF section at every dynamic fault callsite.  At runtime,
> + * the special section is treated as an array of these.
> + */
> +struct _dfault {
> + const char  *modname;
> + const char  *function;
> + const char  *filename;
> + const char  *class;
> +
> + const u16   line;
> +
> + unsignedfrequency;
> + union dfault_state  state;
> +
> + struct static_key   enabled;
> +} __aligned(8);
> +
> +
> +#ifdef CONFIG_DYNAMIC_FAULT
> +
> +int dfault_add_module(struct _dfault *tab, unsigned int n, const char *mod);
> +int dfault_remove_module(char *mod_name);
> +bool __dynamic_fault_enabled(struct _dfault *);
> +
> +#define dynamic_fault(_class)
> \
> +({   \
> + static struct _dfault descriptor\
> + __used __aligned(8) __attribute__((section("__faults"))) = {\
> + .modname= KBUILD_MODNAME,   \
> + .function   = __func__, \
> + .filename   = __FILE__, \
> + .line   = __LINE__, \
> + .class  = _class,   \
> + };  \
> + \
> + static_key_false() &&\
> +

Re: copy_file_range and user space tools to do copy fastest

2018-04-27 Thread Andreas Dilger

On Apr 27, 2018, at 5:41 PM, Eric Biggers <ebigge...@gmail.com> wrote:
> 
> On Fri, Apr 27, 2018 at 01:45:40PM -0600, Andreas Dilger wrote:
>> On Apr 27, 2018, at 12:25 PM, Steve French <smfre...@gmail.com> wrote:
>>> 
>>> Are there any user space tools (other than our test tools and xfs_io
>>> etc.) that support copy_file_range?  Looks like at least cp and rsync
>>> and dd don't.  That syscall which now has been around a couple years,
>>> and was reminded about at the LSF/MM summit a few days ago, presumably
>>> is the 'best' way to copy a file fast since it tries all the
>>> mechanisms (reflink etc.) in order.
>>> 
>>> Since copy_file_range syscall can be 100x or more faster for network
>>> file systems than the alternative, was surprised when I noticed that
>>> cp and rsync didn't support it.  It doesn't look like rsync even
>>> supports reflink either(although presumably if you call
>>> copy_file_range you don't have to worry about that), and reads/writes
>>> are 8K. See copy_file() in rsync/util.c
>>> 
>>> In the cp command it looks like it can call the FICLONE IOCTL (see
>>> clone_file() in coreutils/src/copy.c) but doesn't call the expected
>>> "copy_file_range" syscall.
>>> 
>>> In the dd command it doesn't call either - see dd_copy in corutils/src/dd.c
>>> 
>>> Since it can be 100x or more faster in some cases to call
>>> copy_file_range than do reads/writes back and forth to do a copy
>>> (especially if network or clustered backend or cloud), what tools are
>>> the best to recommend?
>>> 
>>> Would rsync or cp be likely to take patches to call the standard
>>> "copy_file_range" syscall
>>> (http://man7.org/linux/man-pages/man2/copy_file_range.2.html)?
>>> Presumably not if it has been two+ years ... but would be interested
>>> what copy tools to recommend to use instead.
>> 
>> I would start with submitting a patch to coreutils, if you can figure
>> out that code enough to do so (I find it quite opaque).  Since it has
>> been in the kernel for a while already, it should be acceptable to the
>> upstream coreutils maintainers to use this interface.  Doubly so if you
>> include some benchmarks with CIFS/NFS clients avoiding network overhead
>> during the copy.
>> 
> 
> For cp (coreutils), apparently there was a concern that copy_file_range()
> expands holes; see the thread at
> https://lists.gnu.org/archive/html/bug-coreutils/2016-09/msg00020.html.
> Though, I'd think it could just be used on non-holes only.  And I don't think
> the size_t type of 'len' is a problem either, since it's the copy length, not
> the file size.  You just call it multiple times if the file is larger.

I think cp is already using SEEK_HOLE/SEEK_DATA and/or FIEMAP to determine
the mapped and sparse segments of the file, so it should be practical to
use copy_file_range() in conjunction with these to copy only the allocated
parts of the file.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: copy_file_range and user space tools to do copy fastest

2018-04-27 Thread Andreas Dilger

On Apr 27, 2018, at 5:41 PM, Eric Biggers  wrote:
> 
> On Fri, Apr 27, 2018 at 01:45:40PM -0600, Andreas Dilger wrote:
>> On Apr 27, 2018, at 12:25 PM, Steve French  wrote:
>>> 
>>> Are there any user space tools (other than our test tools and xfs_io
>>> etc.) that support copy_file_range?  Looks like at least cp and rsync
>>> and dd don't.  That syscall which now has been around a couple years,
>>> and was reminded about at the LSF/MM summit a few days ago, presumably
>>> is the 'best' way to copy a file fast since it tries all the
>>> mechanisms (reflink etc.) in order.
>>> 
>>> Since copy_file_range syscall can be 100x or more faster for network
>>> file systems than the alternative, was surprised when I noticed that
>>> cp and rsync didn't support it.  It doesn't look like rsync even
>>> supports reflink either(although presumably if you call
>>> copy_file_range you don't have to worry about that), and reads/writes
>>> are 8K. See copy_file() in rsync/util.c
>>> 
>>> In the cp command it looks like it can call the FICLONE IOCTL (see
>>> clone_file() in coreutils/src/copy.c) but doesn't call the expected
>>> "copy_file_range" syscall.
>>> 
>>> In the dd command it doesn't call either - see dd_copy in corutils/src/dd.c
>>> 
>>> Since it can be 100x or more faster in some cases to call
>>> copy_file_range than do reads/writes back and forth to do a copy
>>> (especially if network or clustered backend or cloud), what tools are
>>> the best to recommend?
>>> 
>>> Would rsync or cp be likely to take patches to call the standard
>>> "copy_file_range" syscall
>>> (http://man7.org/linux/man-pages/man2/copy_file_range.2.html)?
>>> Presumably not if it has been two+ years ... but would be interested
>>> what copy tools to recommend to use instead.
>> 
>> I would start with submitting a patch to coreutils, if you can figure
>> out that code enough to do so (I find it quite opaque).  Since it has
>> been in the kernel for a while already, it should be acceptable to the
>> upstream coreutils maintainers to use this interface.  Doubly so if you
>> include some benchmarks with CIFS/NFS clients avoiding network overhead
>> during the copy.
>> 
> 
> For cp (coreutils), apparently there was a concern that copy_file_range()
> expands holes; see the thread at
> https://lists.gnu.org/archive/html/bug-coreutils/2016-09/msg00020.html.
> Though, I'd think it could just be used on non-holes only.  And I don't think
> the size_t type of 'len' is a problem either, since it's the copy length, not
> the file size.  You just call it multiple times if the file is larger.

I think cp is already using SEEK_HOLE/SEEK_DATA and/or FIEMAP to determine
the mapped and sparse segments of the file, so it should be practical to
use copy_file_range() in conjunction with these to copy only the allocated
parts of the file.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: copy_file_range and user space tools to do copy fastest

2018-04-27 Thread Andreas Dilger

On Apr 27, 2018, at 12:25 PM, Steve French  wrote:
> 
> Are there any user space tools (other than our test tools and xfs_io
> etc.) that support copy_file_range?  Looks like at least cp and rsync
> and dd don't.  That syscall which now has been around a couple years,
> and was reminded about at the LSF/MM summit a few days ago, presumably
> is the 'best' way to copy a file fast since it tries all the
> mechanisms (reflink etc.) in order.
> 
> Since copy_file_range syscall can be 100x or more faster for network
> file systems than the alternative, was surprised when I noticed that
> cp and rsync didn't support it.  It doesn't look like rsync even
> supports reflink either(although presumably if you call
> copy_file_range you don't have to worry about that), and reads/writes
> are 8K. See copy_file() in rsync/util.c
> 
> In the cp command it looks like it can call the FICLONE IOCTL (see
> clone_file() in coreutils/src/copy.c) but doesn't call the expected
> "copy_file_range" syscall.
> 
> In the dd command it doesn't call either - see dd_copy in corutils/src/dd.c
> 
> Since it can be 100x or more faster in some cases to call
> copy_file_range than do reads/writes back and forth to do a copy
> (especially if network or clustered backend or cloud), what tools are
> the best to recommend?
> 
> Would rsync or cp be likely to take patches to call the standard
> "copy_file_range" syscall
> (http://man7.org/linux/man-pages/man2/copy_file_range.2.html)?
> Presumably not if it has been two+ years ... but would be interested
> what copy tools to recommend to use instead.

I would start with submitting a patch to coreutils, if you can figure
out that code enough to do so (I find it quite opaque).  Since it has
been in the kernel for a while already, it should be acceptable to the
upstream coreutils maintainers to use this interface.  Doubly so if you
include some benchmarks with CIFS/NFS clients avoiding network overhead
during the copy.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: copy_file_range and user space tools to do copy fastest

2018-04-27 Thread Andreas Dilger

On Apr 27, 2018, at 12:25 PM, Steve French  wrote:
> 
> Are there any user space tools (other than our test tools and xfs_io
> etc.) that support copy_file_range?  Looks like at least cp and rsync
> and dd don't.  That syscall which now has been around a couple years,
> and was reminded about at the LSF/MM summit a few days ago, presumably
> is the 'best' way to copy a file fast since it tries all the
> mechanisms (reflink etc.) in order.
> 
> Since copy_file_range syscall can be 100x or more faster for network
> file systems than the alternative, was surprised when I noticed that
> cp and rsync didn't support it.  It doesn't look like rsync even
> supports reflink either(although presumably if you call
> copy_file_range you don't have to worry about that), and reads/writes
> are 8K. See copy_file() in rsync/util.c
> 
> In the cp command it looks like it can call the FICLONE IOCTL (see
> clone_file() in coreutils/src/copy.c) but doesn't call the expected
> "copy_file_range" syscall.
> 
> In the dd command it doesn't call either - see dd_copy in corutils/src/dd.c
> 
> Since it can be 100x or more faster in some cases to call
> copy_file_range than do reads/writes back and forth to do a copy
> (especially if network or clustered backend or cloud), what tools are
> the best to recommend?
> 
> Would rsync or cp be likely to take patches to call the standard
> "copy_file_range" syscall
> (http://man7.org/linux/man-pages/man2/copy_file_range.2.html)?
> Presumably not if it has been two+ years ... but would be interested
> what copy tools to recommend to use instead.

I would start with submitting a patch to coreutils, if you can figure
out that code enough to do so (I find it quite opaque).  Since it has
been in the kernel for a while already, it should be acceptable to the
upstream coreutils maintainers to use this interface.  Doubly so if you
include some benchmarks with CIFS/NFS clients avoiding network overhead
during the copy.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion

2018-04-20 Thread Andreas Dilger

On Apr 20, 2018, at 11:03 AM, Linus Torvalds  
wrote:
> 
> On Fri, Apr 20, 2018 at 5:35 AM, David Howells  wrote:
>> In do_mount() when the MS_* flags are being converted to MNT_* flags,
>> MS_RDONLY got accidentally convered to SB_RDONLY.
> 
> Applied.
> 
> I guess they have the same value (1). How did you notice? Do you have
> some patches that turn the kernel flags into a bitwise type? Or did
> you just happen on it manually?

Making s_flags and mnt_flags differently-named enums would help make
this problem more obvious (and some compilers might complain if used
incorrectly).  That also makes it more clear what kind of flags are
being passed to do_mount(), rather than "unsigned long flags".

Even while looking at the code for this, I see that s_flags is being
used incorrectly with MS_* flags all over the place:

static inline bool sb_rdonly(const struct super_block *sb) {
return sb->s_flags & MS_RDONLY;
}

void inode_add_lru(struct inode *inode)
{
if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
I_FREEING | I_WILL_FREE)) &&
!atomic_read(>i_count) && inode->i_sb->s_flags & MS_ACTIVE)
inode_lru_list_add(inode);
}

A quick grep showed 243 lines matching "->s_flags.*MS_".

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion

2018-04-20 Thread Andreas Dilger

On Apr 20, 2018, at 11:03 AM, Linus Torvalds  
wrote:
> 
> On Fri, Apr 20, 2018 at 5:35 AM, David Howells  wrote:
>> In do_mount() when the MS_* flags are being converted to MNT_* flags,
>> MS_RDONLY got accidentally convered to SB_RDONLY.
> 
> Applied.
> 
> I guess they have the same value (1). How did you notice? Do you have
> some patches that turn the kernel flags into a bitwise type? Or did
> you just happen on it manually?

Making s_flags and mnt_flags differently-named enums would help make
this problem more obvious (and some compilers might complain if used
incorrectly).  That also makes it more clear what kind of flags are
being passed to do_mount(), rather than "unsigned long flags".

Even while looking at the code for this, I see that s_flags is being
used incorrectly with MS_* flags all over the place:

static inline bool sb_rdonly(const struct super_block *sb) {
return sb->s_flags & MS_RDONLY;
}

void inode_add_lru(struct inode *inode)
{
if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
I_FREEING | I_WILL_FREE)) &&
!atomic_read(>i_count) && inode->i_sb->s_flags & MS_ACTIVE)
inode_lru_list_add(inode);
}

A quick grep showed 243 lines matching "->s_flags.*MS_".

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/6] statfs: use << to align with fs header

2018-04-13 Thread Andreas Dilger

On Apr 13, 2018, at 10:11 AM, Christian Brauner  
wrote:
> 
> Consistenly use << to define ST_* constants. This also aligns them with
> their MS_* counterparts in fs.h

IMHO, using (1 << 10) makes the code harder to debug.  If you see a field
in a structure like 0x8354, it is non-trivial to map this to the ST_*
flags if they are declared in the form (1 << 10) or BIT(10).  If they are
declared in the form 0x100 (as they are now) then it is trivial that the
ST_APPEND flag is set in 0x8354, and easy to understand the other flags.

So, my preference would be to NOT land this or the previous patch.

Cheers, Andreas

> Signed-off-by: Christian Brauner 
> ---
> include/linux/statfs.h | 26 +-
> 1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/statfs.h b/include/linux/statfs.h
> index 3142e98546ac..b336c04e793c 100644
> --- a/include/linux/statfs.h
> +++ b/include/linux/statfs.h
> @@ -27,18 +27,18 @@ struct kstatfs {
>  * ABI.  The exception is ST_VALID which has the same value as MS_REMOUNT
>  * which doesn't make any sense for statfs.
>  */
> -#define ST_RDONLY0x0001  /* mount read-only */
> -#define ST_NOSUID0x0002  /* ignore suid and sgid bits */
> -#define ST_NODEV 0x0004  /* disallow access to device special files */
> -#define ST_NOEXEC0x0008  /* disallow program execution */
> -#define ST_SYNCHRONOUS   0x0010  /* writes are synced at once */
> -#define ST_VALID 0x0020  /* f_flags support is implemented */
> -#define ST_MANDLOCK  0x0040  /* allow mandatory locks on an FS */
> -/* 0x0080 used for ST_WRITE in glibc */
> -/* 0x0100 used for ST_APPEND in glibc */
> -/* 0x0200 used for ST_IMMUTABLE in glibc */
> -#define ST_NOATIME   0x0400  /* do not update access times */
> -#define ST_NODIRATIME0x0800  /* do not update directory access times 
> */
> -#define ST_RELATIME  0x1000  /* update atime relative to mtime/ctime */
> +#define ST_RDONLY(1<<0) /* mount read-only */
> +#define ST_NOSUID(1<<1) /* ignore suid and sgid bits */
> +#define ST_NODEV (1<<2) /* disallow access to device special files */
> +#define ST_NOEXEC(1<<3) /* disallow program execution */
> +#define ST_SYNCHRONOUS   (1<<4) /* writes are synced at once */
> +#define ST_VALID (1<<5) /* f_flags support is implemented */
> +#define ST_MANDLOCK  (1<<6) /* allow mandatory locks on an FS */
> +/* (1<<7) used for ST_WRITE in glibc */
> +/* (1<<8) used for ST_APPEND in glibc */
> +/* (1<<9) used for ST_IMMUTABLE in glibc */
> +#define ST_NOATIME   (1<<10) /* do not update access times */
> +#define ST_NODIRATIME(1<<11) /* do not update directory access times 
> */
> +#define ST_RELATIME  (1<<12) /* update atime relative to mtime/ctime */
> 
> #endif
> --
> 2.17.0
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/6] statfs: use << to align with fs header

2018-04-13 Thread Andreas Dilger

On Apr 13, 2018, at 10:11 AM, Christian Brauner  
wrote:
> 
> Consistenly use << to define ST_* constants. This also aligns them with
> their MS_* counterparts in fs.h

IMHO, using (1 << 10) makes the code harder to debug.  If you see a field
in a structure like 0x8354, it is non-trivial to map this to the ST_*
flags if they are declared in the form (1 << 10) or BIT(10).  If they are
declared in the form 0x100 (as they are now) then it is trivial that the
ST_APPEND flag is set in 0x8354, and easy to understand the other flags.

So, my preference would be to NOT land this or the previous patch.

Cheers, Andreas

> Signed-off-by: Christian Brauner 
> ---
> include/linux/statfs.h | 26 +-
> 1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/statfs.h b/include/linux/statfs.h
> index 3142e98546ac..b336c04e793c 100644
> --- a/include/linux/statfs.h
> +++ b/include/linux/statfs.h
> @@ -27,18 +27,18 @@ struct kstatfs {
>  * ABI.  The exception is ST_VALID which has the same value as MS_REMOUNT
>  * which doesn't make any sense for statfs.
>  */
> -#define ST_RDONLY0x0001  /* mount read-only */
> -#define ST_NOSUID0x0002  /* ignore suid and sgid bits */
> -#define ST_NODEV 0x0004  /* disallow access to device special files */
> -#define ST_NOEXEC0x0008  /* disallow program execution */
> -#define ST_SYNCHRONOUS   0x0010  /* writes are synced at once */
> -#define ST_VALID 0x0020  /* f_flags support is implemented */
> -#define ST_MANDLOCK  0x0040  /* allow mandatory locks on an FS */
> -/* 0x0080 used for ST_WRITE in glibc */
> -/* 0x0100 used for ST_APPEND in glibc */
> -/* 0x0200 used for ST_IMMUTABLE in glibc */
> -#define ST_NOATIME   0x0400  /* do not update access times */
> -#define ST_NODIRATIME0x0800  /* do not update directory access times 
> */
> -#define ST_RELATIME  0x1000  /* update atime relative to mtime/ctime */
> +#define ST_RDONLY(1<<0) /* mount read-only */
> +#define ST_NOSUID(1<<1) /* ignore suid and sgid bits */
> +#define ST_NODEV (1<<2) /* disallow access to device special files */
> +#define ST_NOEXEC(1<<3) /* disallow program execution */
> +#define ST_SYNCHRONOUS   (1<<4) /* writes are synced at once */
> +#define ST_VALID (1<<5) /* f_flags support is implemented */
> +#define ST_MANDLOCK  (1<<6) /* allow mandatory locks on an FS */
> +/* (1<<7) used for ST_WRITE in glibc */
> +/* (1<<8) used for ST_APPEND in glibc */
> +/* (1<<9) used for ST_IMMUTABLE in glibc */
> +#define ST_NOATIME   (1<<10) /* do not update access times */
> +#define ST_NODIRATIME(1<<11) /* do not update directory access times 
> */
> +#define ST_RELATIME  (1<<12) /* update atime relative to mtime/ctime */
> 
> #endif
> --
> 2.17.0
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1223 matches

Mail list logo