bug#59382: cp(1) tries to allocate too much memory if filesystem blocksizes are unusual
The block size for filesystems can also be quite large (currently, up to 16M). It seems ZFS tries to "help" apps by reporting misinformation (namely a smaller block size than actually preferred) when the file is small. This is unfortunate, since it messes up cp and similar programs that need to juggle multiple block sizes. Plus, it messes up any program that assumes st_blksize is constant for the life of a file descriptor, which "cp" does assume elsewhere. GNU cp doesn't need ZFS's "help", as it's already smart enough to not over-allocate a buffer when the input file is small but its blocksize is large. Instead, this "help" from ZFS causes GNU cp to over-allocate because it naively trusts the blocksize ZFS that reports. The proposed patch attached removes the use of buffer_lcm() and just picks the largest st_blksize, which would be 4MiB in your case. It also limits the max buffer size to 32MiB in the edge case where st_blksize returns a larger value that this. I suppose this could break cp if st_blksize is not a power of 2, and if the file is not a regular file, and reads must be a multiple of the block size. POSIX allows such things though I expect nowadays it'd be limited to weird devices. Although we inadvertently removed support for weird devices in 2009 by commit 55efc5f3ee485b3e31a91c331f07c89aeccc4e89, and nobody seems to care (because people use dd or whatever to deal with weird devices), I think it'd be better to limit the fix to regular files. And while we're at it we might as well resurrect support for weird devices. +#include No need for this, as static_assert works without in C23, and Gnulib's assert-h module support this. +/* Set a max constraint to avoid excessive mem usage or type overflow. */ +enum { IO_BUFSIZE_MAX = 128 * IO_BUFSIZE }; +static_assert (IO_BUFSIZE_MAX <= MIN (IDX_MAX, SIZE_MAX) / 2 + 1); I'm leery of putting in a maximum as low as 16 MiB. Although that's OK now (it matches OpenZFS's current maximum), cp in the future will surely deal with bigger block sizes. Instead, how about if we stick with GNU's "no arbitrary limits" policy and work around the ZFS bug instead? Something like the attached patch, perhaps?From 551f3f55180669ab0bfd6c5d9e3e0f38cb035172 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sat, 19 Nov 2022 19:04:36 -0800 Subject: [PATCH] cp: work around ZFS misinformation Problem reported by Korn Andras (Bug#59382). * bootstrap.conf (gnulib_modules): Add count-leading-zeros, which was already an indirect dependency, since ioblksize.h now uses it directly. * src/ioblksize.h: Include count-leading-zeros.h. (io_blksize): Treat impossible blocksizes as IO_BUFSIZE. When growing a blocksize to IO_BUFSIZE, keep it a multiple of the stated blocksize. Work around the ZFS performance bug. --- NEWS| 3 +++ bootstrap.conf | 1 + src/ioblksize.h | 28 +++- 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index b6b5201e7..9282352c8 100644 --- a/NEWS +++ b/NEWS @@ -13,6 +13,9 @@ GNU coreutils NEWS-*- outline -*- 'cp -rx / /mnt' no longer complains "cannot create directory /mnt/". [bug introduced in coreutils-9.1] + cp, mv, and install no longer use overly large I/O buffers when ZFS + misinforms them about IO block sizes. + 'mv --backup=simple f d/' no longer mistakenly backs up d/f to f~. [bug introduced in coreutils-9.1] diff --git a/bootstrap.conf b/bootstrap.conf index 8e257a254..f8715068e 100644 --- a/bootstrap.conf +++ b/bootstrap.conf @@ -59,6 +59,7 @@ gnulib_modules=" config-h configmake copy-file-range + count-leading-zeros crypto/md5 crypto/sha1 crypto/sha256 diff --git a/src/ioblksize.h b/src/ioblksize.h index 8bd18ba05..aa367aa4e 100644 --- a/src/ioblksize.h +++ b/src/ioblksize.h @@ -18,6 +18,7 @@ /* sys/stat.h and minmax.h will already have been included by system.h. */ #include "idx.h" +#include "count-leading-zeros.h" #include "stat-size.h" @@ -75,8 +76,33 @@ enum { IO_BUFSIZE = 128 * 1024 }; static inline idx_t io_blksize (struct stat sb) { + /* Treat impossible blocksizes as if they were IO_BUFSIZE. */ + idx_t blocksize = ST_BLKSIZE (sb) <= 0 ? IO_BUFSIZE : ST_BLKSIZE (sb); + + /* Use a blocksize of at least IO_BUFSIZE bytes, keeping it a + multiple of the original blocksize. */ + blocksize += (IO_BUFSIZE - 1) - (IO_BUFSIZE - 1) % blocksize; + + /* For regular files we can ignore the blocksize if we think we know better. + ZFS sometimes understates the blocksize, because it thinks + apps stupidly allocate a block that large even for small files. + This misinformation can cause coreutils to use wrong-sized blocks. + Work around some of the performance bug by substituting the next + power of two when the reported blocksize is not a power of two. */ + if (S_ISREG (sb.st_mode) + && blocksize & (blocksize - 1)) +{ + int
bug#59382: cp(1) tries to allocate too much memory if filesystem blocksizes are unusual
On 19/11/2022 08:14, Korn Andras wrote: Hi, on zfs, newfstatat() can return an st_blksize that is approximately equal to the file size in bytes (if the file fit into a single zfs record). The block size for filesystems can also be quite large (currently, up to 16M). The code at https://github.com/coreutils/coreutils/blob/master/src/copy.c#L1343 will try to compute the least common multiple of the input and output block sizes, then allocate a buffer of that size using mmap(). With such unusual block sizes, the least common multiple can be very large, causing the mmap() to return ENOMEM and cp to abort with "memory exhausted". Example: openat(AT_FDCWD, "tmp", O_RDONLY|O_PATH|O_DIRECTORY) = 3 newfstatat(AT_FDCWD, "usr/src/zfs-2.1.6/configure", {st_dev=makedev(0, 0x30), st_ino=29267, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=2647552, st_blocks=5177, st_size=2647046, st_atime=1668786119 /* 2022-11-18T16:41:59.760703544+0100 */, st_atime_nsec=760703544, st_mtime=1667705386 /* 2022-11-06T04:29:46+0100 */, st_mtime_nsec=0, st_ctime=1668785062 /* 2022-11-18T16:24:22.866737598+0100 */, st_ctime_nsec=866737598}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(3, "configure", {st_dev=makedev(0, 0x32), st_ino=3838, st_mode=S_IFREG|0700, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4194304, st_blocks=1, st_size=0, st_atime=1668788205 /* 2022-11-18T17:16:45.416328293+0100 */, st_atime_nsec=416328293, st_mtime=1668788255 /* 2022-11-18T17:17:35.809758585+0100 */, st_mtime_nsec=809758585, st_ctime=1668788255 /* 2022-11-18T17:17:35.809758585+0100 */, st_ctime_nsec=809758585}, 0) = 0 openat(AT_FDCWD, "usr/src/zfs-2.1.6/configure", O_RDONLY|O_NOFOLLOW) = 4 newfstatat(4, "", {st_dev=makedev(0, 0x30), st_ino=29267, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=2647552, st_blocks=5177, st_size=2647046, st_atime=1668786119 /* 2022-11-18T16:41:59.760703544+0100 */, st_atime_nsec=760703544, st_mtime=1667705386 /* 2022-11-06T04:29:46+0100 */, st_mtime_nsec=0, st_ctime=1668785062 /* 2022-11-18T16:24:22.866737598+0100 */, st_ctime_nsec=866737598}, AT_EMPTY_PATH) = 0 openat(3, "configure", O_WRONLY|O_TRUNC) = 5 ioctl(5, BTRFS_IOC_CLONE or FICLONE, 4) = -1 EXDEV (Invalid cross-device link) newfstatat(5, "", {st_dev=makedev(0, 0x32), st_ino=3838, st_mode=S_IFREG|0700, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4194304, st_blocks=1, st_size=0, st_atime=1668788205 /* 2022-11-18T17:16:45.416328293+0100 */, st_atime_nsec=416328293, st_mtime=1668788266 /* 2022-11-18T17:17:46.326056957+0100 */, st_mtime_nsec=326056957, st_ctime=1668788266 /* 2022-11-18T17:17:46.326056957+0100 */, st_ctime_nsec=326056957}, AT_EMPTY_PATH) = 0 fadvise64(4, 0, 0, POSIX_FADV_SEQUENTIAL) = 0 copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = -1 EXDEV (Invalid cross-device link) mmap(NULL, 21688754176, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) brk(0x55a7fa4b) = 0x55a2ed8ab000 mmap(NULL, 21689794560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) write(2>, "cp: ", 4) = 4 write(2>, "memory exhausted", 16) = 16 write(2>, "\n", 1) = 1 lseek(0>, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) close(0>)= 0 close(1>)= 0 close(2>)= 0 exit_group(1) = ? I think cp(1) shouldn't insist on using the lcm if it's very large; additionally, it should never try to allocate a buffer that is (much) larger than the source file. Perhaps you could also try to use successively smaller buffers if allocating a large one fails? Thank you for the detailed report. This buffer_lcm code has been present for a long time, but I agree in the presence of disparate st_blksize values it can result in inappropriate buffer sizes. In your case: src st_blksize = 2647552 (0x286600) (st_size = 2647046) dst st_blksize = 4194304 (0x40) (st_size = 0) Which implies a lowest common multiple of 21688745984 Once the st_blksize is based on st_size we can get very large values like this. In the zfs case it seems to set st_blksize to st_size rounded to next multiple of 512. The proposed patch attached removes the use of buffer_lcm() and just picks
bug#59382: cp(1) tries to allocate too much memory if filesystem blocksizes are unusual
Hi, on zfs, newfstatat() can return an st_blksize that is approximately equal to the file size in bytes (if the file fit into a single zfs record). The block size for filesystems can also be quite large (currently, up to 16M). The code at https://github.com/coreutils/coreutils/blob/master/src/copy.c#L1343 will try to compute the least common multiple of the input and output block sizes, then allocate a buffer of that size using mmap(). With such unusual block sizes, the least common multiple can be very large, causing the mmap() to return ENOMEM and cp to abort with "memory exhausted". Example: openat(AT_FDCWD, "tmp", O_RDONLY|O_PATH|O_DIRECTORY) = 3 newfstatat(AT_FDCWD, "usr/src/zfs-2.1.6/configure", {st_dev=makedev(0, 0x30), st_ino=29267, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=2647552, st_blocks=5177, st_size=2647046, st_atime=1668786119 /* 2022-11-18T16:41:59.760703544+0100 */, st_atime_nsec=760703544, st_mtime=1667705386 /* 2022-11-06T04:29:46+0100 */, st_mtime_nsec=0, st_ctime=1668785062 /* 2022-11-18T16:24:22.866737598+0100 */, st_ctime_nsec=866737598}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(3, "configure", {st_dev=makedev(0, 0x32), st_ino=3838, st_mode=S_IFREG|0700, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4194304, st_blocks=1, st_size=0, st_atime=1668788205 /* 2022-11-18T17:16:45.416328293+0100 */, st_atime_nsec=416328293, st_mtime=1668788255 /* 2022-11-18T17:17:35.809758585+0100 */, st_mtime_nsec=809758585, st_ctime=1668788255 /* 2022-11-18T17:17:35.809758585+0100 */, st_ctime_nsec=809758585}, 0) = 0 openat(AT_FDCWD, "usr/src/zfs-2.1.6/configure", O_RDONLY|O_NOFOLLOW) = 4 newfstatat(4, "", {st_dev=makedev(0, 0x30), st_ino=29267, st_mode=S_IFREG|0755, st_nlink=1, st_uid=0, st_gid=0, st_blksize=2647552, st_blocks=5177, st_size=2647046, st_atime=1668786119 /* 2022-11-18T16:41:59.760703544+0100 */, st_atime_nsec=760703544, st_mtime=1667705386 /* 2022-11-06T04:29:46+0100 */, st_mtime_nsec=0, st_ctime=1668785062 /* 2022-11-18T16:24:22.866737598+0100 */, st_ctime_nsec=866737598}, AT_EMPTY_PATH) = 0 openat(3, "configure", O_WRONLY|O_TRUNC) = 5 ioctl(5, BTRFS_IOC_CLONE or FICLONE, 4) = -1 EXDEV (Invalid cross-device link) newfstatat(5, "", {st_dev=makedev(0, 0x32), st_ino=3838, st_mode=S_IFREG|0700, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4194304, st_blocks=1, st_size=0, st_atime=1668788205 /* 2022-11-18T17:16:45.416328293+0100 */, st_atime_nsec=416328293, st_mtime=1668788266 /* 2022-11-18T17:17:46.326056957+0100 */, st_mtime_nsec=326056957, st_ctime=1668788266 /* 2022-11-18T17:17:46.326056957+0100 */, st_ctime_nsec=326056957}, AT_EMPTY_PATH) = 0 fadvise64(4, 0, 0, POSIX_FADV_SEQUENTIAL) = 0 copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = -1 EXDEV (Invalid cross-device link) mmap(NULL, 21688754176, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) brk(0x55a7fa4b) = 0x55a2ed8ab000 mmap(NULL, 21689794560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) openat(AT_FDCWD, "/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) write(2>, "cp: ", 4) = 4 write(2>, "memory exhausted", 16) = 16 write(2>, "\n", 1) = 1 lseek(0>, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) close(0>)= 0 close(1>)= 0 close(2>)= 0 exit_group(1) = ? I think cp(1) shouldn't insist on using the lcm if it's very large; additionally, it should never try to allocate a buffer that is (much) larger than the source file. Perhaps you could also try to use successively smaller buffers if allocating a large one fails? András -- Write your complaints in this box: -> []