Re: [BUG] copy_file_range with sysfs file as input
On Wed, Feb 3, 2021 at 5:04 PM Greg KH wrote: > > On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote: > > Hi copy_file_range experts, > > > > We hit this interesting issue when upgrading Go compiler from 1.13 to > > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of > > `/sys/kernel/debug/tracing/trace` to a temporary file. > > Nit, the above file is NOT a sysfs file. Odds are it is either a > debugfs, or a tracefs file, please check your mounts to determine which > it is, as that matters a lot on the kernel backend side for trying to > figure out what is going on here :) Yes yes it's tracefs ,-) But, from the other thread https://lkml.org/lkml/2021/1/26/2029 sysfs (and any other fs that generates files dynamically like procfs) would likely hit issues as well (but maybe in more rare circumstances). Thanks! > > thanks, > > greg k-h
Re: [BUG] copy_file_range with sysfs file as input
On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote: > Hi copy_file_range experts, > > We hit this interesting issue when upgrading Go compiler from 1.13 to > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of > `/sys/kernel/debug/tracing/trace` to a temporary file. Nit, the above file is NOT a sysfs file. Odds are it is either a debugfs, or a tracefs file, please check your mounts to determine which it is, as that matters a lot on the kernel backend side for trying to figure out what is going on here :) thanks, greg k-h
Re: [BUG] copy_file_range with sysfs file as input
On Tue, Jan 26, 2021 at 11:50:50AM +0800, Nicolas Boichat wrote: > On Tue, Jan 26, 2021 at 9:34 AM Dave Chinner wrote: > > > > On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote: > > > Hi copy_file_range experts, > > > > > > We hit this interesting issue when upgrading Go compiler from 1.13 to > > > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of > > > `/sys/kernel/debug/tracing/trace` to a temporary file. > > > > > > Under the hood, Go now uses `copy_file_range` syscall to optimize the > > > copy operation. However, that fails to copy any content when the input > > > file is from sysfs/tracefs, with an apparent size of 0 (but there is > > > still content when you `cat` it, of course). > > > > > > A repro case is available in comment7 (adapted from the man page), > > > also copied below [2]. > > > > > > Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and > > > 5.10.3 (chromeos)) > > > $ ./copyfrom /sys/kernel/debug/tracing/trace x > > > 0 bytes copied > > > > That's basically telling you that copy_file_range() was unable to > > copy anything. The man page says: > > > > RETURN VALUE > >Upon successful completion, copy_file_range() will return > >the number of bytes copied between files. This could be less > >than the length originally requested. If the file offset > >of fd_in is at or past the end of file, no bytes are copied, > >and copy_file_range() returns zero. > > > > THe man page explains it perfectly. > > I'm not that confident the explanation is perfect ,-) > > How does one define "EOF"? The read manpage > (https://man7.org/linux/man-pages/man2/read.2.html) defines it as a > zero return value. And so does copy_file_range(). That's the -API definition-, it does not define the kernel implementation of how to decide when the file is at EOF. > I don't think using the inode file size is > standard. It is the standard VFS filesystem definition of EOF. Indeed: copy_file_range() vfs_copy_file_range() generic_copy_file_checks() . /* Shorten the copy to EOF */ size_in = i_size_read(inode_in); if (pos_in >= size_in) count = 0; else count = min(count, size_in - (uint64_t)pos_in); That inode size check for EOF is -exactly- what is triggering here, and a copy of zero length returns 0 bytes having done nothing. The page cache read path does similar things in generic_file_buffered_read() to avoid truncate races exposing stale/bad data to userspace: /* * i_size must be checked after we know the pages are Uptodate. * * Checking i_size after the check allows us to calculate * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). */ isize = i_size_read(inode); if (unlikely(iocb->ki_pos >= isize)) goto put_pages; > > 'cat' "works" in this situation because it doesn't check the file > > size and just attempts to read unconditionally from the file. Hence > > it happily returns non-existent stale data from busted filesystem > > implementations that allow data to be read from beyond EOF... > > `cp` also works, so does `dd` and basically any other file operation. They do not use a syscall interface that can offload work to filesystems, low level block layer software, hardware and/or remote systems. copy_file_range() is restricted to regular files and does complex stuff that read() and friends will never do, so we have strictly enforced rules to prevent people from playing fast and loose and silently corrupting user data with it Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [BUG] copy_file_range with sysfs file as input
On Tue, Jan 26, 2021 at 9:34 AM Dave Chinner wrote: > > On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote: > > Hi copy_file_range experts, > > > > We hit this interesting issue when upgrading Go compiler from 1.13 to > > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of > > `/sys/kernel/debug/tracing/trace` to a temporary file. > > > > Under the hood, Go now uses `copy_file_range` syscall to optimize the > > copy operation. However, that fails to copy any content when the input > > file is from sysfs/tracefs, with an apparent size of 0 (but there is > > still content when you `cat` it, of course). > > > > A repro case is available in comment7 (adapted from the man page), > > also copied below [2]. > > > > Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and > > 5.10.3 (chromeos)) > > $ ./copyfrom /sys/kernel/debug/tracing/trace x > > 0 bytes copied > > That's basically telling you that copy_file_range() was unable to > copy anything. The man page says: > > RETURN VALUE >Upon successful completion, copy_file_range() will return >the number of bytes copied between files. This could be less >than the length originally requested. If the file offset >of fd_in is at or past the end of file, no bytes are copied, >and copy_file_range() returns zero. > > THe man page explains it perfectly. I'm not that confident the explanation is perfect ,-) How does one define "EOF"? The read manpage (https://man7.org/linux/man-pages/man2/read.2.html) defines it as a zero return value. I don't think using the inode file size is standard. Seems like the kernel is not even trying to read from the source file here. In any case, I can fix this issue by dropping the count check here: https://elixir.bootlin.com/linux/latest/source/fs/read_write.c#L1445 . I'll send a patch so that we can discuss based on that. > Look at the trace file you are > trying to copy: > > $ ls -l /sys/kernel/debug/tracing/trace > -rw-r--r-- 1 root root 0 Jan 19 12:17 /sys/kernel/debug/tracing/trace > $ cat /sys/kernel/debug/tracing/trace > tracer: nop > # > # entries-in-buffer/entries-written: 0/0 #P:8 > # > # _-=> irqs-off > # / _=> need-resched > #| / _---=> hardirq/softirq > #|| / _--=> preempt-depth > #||| / delay > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | > > Yup, the sysfs file reports it's size as zero length, so the CFR > syscall is saying "there's nothing to copy from this empty file" and > so correctly is returning zero without even trying to copy anything > because the file offset is at EOF... > > IOWs, there's no copy_file_range() bug here - it's behaving as > documented. > > 'cat' "works" in this situation because it doesn't check the file > size and just attempts to read unconditionally from the file. Hence > it happily returns non-existent stale data from busted filesystem > implementations that allow data to be read from beyond EOF... `cp` also works, so does `dd` and basically any other file operation. (and I wouldn't call procfs, sysfs, debugfs and friends "busted", they are just... special) > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com
Re: [BUG] copy_file_range with sysfs file as input
On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote: > Hi copy_file_range experts, > > We hit this interesting issue when upgrading Go compiler from 1.13 to > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of > `/sys/kernel/debug/tracing/trace` to a temporary file. > > Under the hood, Go now uses `copy_file_range` syscall to optimize the > copy operation. However, that fails to copy any content when the input > file is from sysfs/tracefs, with an apparent size of 0 (but there is > still content when you `cat` it, of course). > > A repro case is available in comment7 (adapted from the man page), > also copied below [2]. > > Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and > 5.10.3 (chromeos)) > $ ./copyfrom /sys/kernel/debug/tracing/trace x > 0 bytes copied That's basically telling you that copy_file_range() was unable to copy anything. The man page says: RETURN VALUE Upon successful completion, copy_file_range() will return the number of bytes copied between files. This could be less than the length originally requested. If the file offset of fd_in is at or past the end of file, no bytes are copied, and copy_file_range() returns zero. THe man page explains it perfectly. Look at the trace file you are trying to copy: $ ls -l /sys/kernel/debug/tracing/trace -rw-r--r-- 1 root root 0 Jan 19 12:17 /sys/kernel/debug/tracing/trace $ cat /sys/kernel/debug/tracing/trace tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:8 # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | Yup, the sysfs file reports it's size as zero length, so the CFR syscall is saying "there's nothing to copy from this empty file" and so correctly is returning zero without even trying to copy anything because the file offset is at EOF... IOWs, there's no copy_file_range() bug here - it's behaving as documented. 'cat' "works" in this situation because it doesn't check the file size and just attempts to read unconditionally from the file. Hence it happily returns non-existent stale data from busted filesystem implementations that allow data to be read from beyond EOF... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [BUG] copy_file_range with sysfs file as input
Thanks for the note. I'm not a kernel developer, but to me this sounds like a kernel bug. It seems particularly unfortunate that copy_file_range returns 0 in this case. From the perspective of the Go standard library, what we would need is some mechanism to detect when the copy_file_range system call will not or did not work correctly. As the biggest hammer, we currently only call copy_file_range on kernel versions 5.3 and newer. We can bump that requirement if necessary. Please feel free to open a bug about this at https://golang.org/issue, but we'll need guidance as to what we should do to avoid the problem. Thanks. Ian On Sun, Jan 24, 2021 at 11:54 PM Nicolas Boichat wrote: > > Hi copy_file_range experts, > > We hit this interesting issue when upgrading Go compiler from 1.13 to > 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of > `/sys/kernel/debug/tracing/trace` to a temporary file. > > Under the hood, Go now uses `copy_file_range` syscall to optimize the > copy operation. However, that fails to copy any content when the input > file is from sysfs/tracefs, with an apparent size of 0 (but there is > still content when you `cat` it, of course). > > A repro case is available in comment7 (adapted from the man page), > also copied below [2]. > > Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and > 5.10.3 (chromeos)) > $ ./copyfrom /sys/kernel/debug/tracing/trace x > 0 bytes copied > $ cat x > $ cat /sys/kernel/debug/tracing/trace > # tracer: nop > # > # entries-in-buffer/entries-written: 0/0 #P:8 > # > #_-=> irqs-off > # / _=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / delay > # TASK-PID CPU# TIMESTAMP FUNCTION > # | | | | | > > I can try to dig further, but thought you'd like to get a bug report > as soon as possible. > > Thanks, > > Nicolas > > [1] http://issuetracker.google.com/issues/178332739 > [2] > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > > int > main(int argc, char **argv) > { > int fd_in, fd_out; > loff_t ret; > > if (argc != 3) { > fprintf(stderr, "Usage: %s \n", > argv[0]); > exit(EXIT_FAILURE); > } > > fd_in = open(argv[1], O_RDONLY); > if (fd_in == -1) { > perror("open (argv[1])"); > exit(EXIT_FAILURE); > } > > fd_out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0644); > if (fd_out == -1) { > perror("open (argv[2])"); > exit(EXIT_FAILURE); > } > > ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0); > if (ret == -1) { > perror("copy_file_range"); > exit(EXIT_FAILURE); > } > printf("%d bytes copied\n", (int)ret); > > close(fd_in); > close(fd_out); > exit(EXIT_SUCCESS); > }
[BUG] copy_file_range with sysfs file as input
Hi copy_file_range experts, We hit this interesting issue when upgrading Go compiler from 1.13 to 1.15 [1]. Basically we use Go's `io.Copy` to copy the content of `/sys/kernel/debug/tracing/trace` to a temporary file. Under the hood, Go now uses `copy_file_range` syscall to optimize the copy operation. However, that fails to copy any content when the input file is from sysfs/tracefs, with an apparent size of 0 (but there is still content when you `cat` it, of course). A repro case is available in comment7 (adapted from the man page), also copied below [2]. Output looks like this (on kernels 5.4.89 (chromeos), 5.7.17 and 5.10.3 (chromeos)) $ ./copyfrom /sys/kernel/debug/tracing/trace x 0 bytes copied $ cat x $ cat /sys/kernel/debug/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:8 # #_-=> irqs-off # / _=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | I can try to dig further, but thought you'd like to get a bug report as soon as possible. Thanks, Nicolas [1] http://issuetracker.google.com/issues/178332739 [2] #define _GNU_SOURCE #include #include #include #include #include #include int main(int argc, char **argv) { int fd_in, fd_out; loff_t ret; if (argc != 3) { fprintf(stderr, "Usage: %s \n", argv[0]); exit(EXIT_FAILURE); } fd_in = open(argv[1], O_RDONLY); if (fd_in == -1) { perror("open (argv[1])"); exit(EXIT_FAILURE); } fd_out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0644); if (fd_out == -1) { perror("open (argv[2])"); exit(EXIT_FAILURE); } ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0); if (ret == -1) { perror("copy_file_range"); exit(EXIT_FAILURE); } printf("%d bytes copied\n", (int)ret); close(fd_in); close(fd_out); exit(EXIT_SUCCESS); }