bug#61386: [PATCH] cp,mv,install: Disable sparse copy on macOS

2023-03-07 Thread Pádraig Brady

On 07/03/2023 21:53, Paul Eggert wrote:

On 3/6/23 16:31, Pádraig Brady wrote:


syntax check now looks good thanks.


You're welcome. I'm getting a failure on tests/du/threshold.sh, though,
on Fedora 37 x86-64. Are you seeing that? See attached (compressed) log.



I see this as a more fundamental operational thing than a limit issue.
`split -n` needing the size up front alludes to the operation,


To ameliorate a bit, we can document this (done in the attached patch).

I don't see this as being significantly more serious than the other
utilities that read all their input (possibly from a pipe) before
outputting anything. They all need to deal with exhausting temporary
space, and 'split' is merely joining the company of 'sort' and 'tac'.



given split is often used with large data,
explicit control is often desired over where and how it's persisted.


Anybody needing that control can copy the data themselves to a temp
file, before calling 'split'. (Like 'sort' and 'tac'.)

Since the change doesn't invalidate existing uses, it sounds like you're
worried that users will want 'split' to work even on enormous pipe
inputs, inputs so large that /tmp fills up. I think this unlikely in
practice, but if it turns into a real concern I can work around most of
the problem by using the first output file as the temporary. However,
I'd prefer to avoid doing this unless it's necessary, as it seems overkill.

I installed the attached patch so that 'split' uses a temp file in this
situation, so it is no longer limited to the 128 KiB size that it was
before the attached patch. Hope this is good enough; if not please let
me know.


Thanks for doing that Paul.
The implementation looks good.

I'll look at the du test later.

cheers,
Pádraig





bug#61386: [PATCH] cp,mv,install: Disable sparse copy on macOS

2023-03-07 Thread Paul Eggert

On 3/6/23 16:31, Pádraig Brady wrote:


syntax check now looks good thanks.


You're welcome. I'm getting a failure on tests/du/threshold.sh, though, 
on Fedora 37 x86-64. Are you seeing that? See attached (compressed) log.




I see this as a more fundamental operational thing than a limit issue.
`split -n` needing the size up front alludes to the operation,


To ameliorate a bit, we can document this (done in the attached patch).

I don't see this as being significantly more serious than the other 
utilities that read all their input (possibly from a pipe) before 
outputting anything. They all need to deal with exhausting temporary 
space, and 'split' is merely joining the company of 'sort' and 'tac'.




given split is often used with large data,
explicit control is often desired over where and how it's persisted.


Anybody needing that control can copy the data themselves to a temp 
file, before calling 'split'. (Like 'sort' and 'tac'.)


Since the change doesn't invalidate existing uses, it sounds like you're 
worried that users will want 'split' to work even on enormous pipe 
inputs, inputs so large that /tmp fills up. I think this unlikely in 
practice, but if it turns into a real concern I can work around most of 
the problem by using the first output file as the temporary. However, 
I'd prefer to avoid doing this unless it's necessary, as it seems overkill.


I installed the attached patch so that 'split' uses a temp file in this 
situation, so it is no longer limited to the 128 KiB size that it was 
before the attached patch. Hope this is good enough; if not please let 
me know.

make-check-log.txt.gz
Description: application/gzip
From bb9dbcbbfd5c3159f975f39336a33319c6c5df04 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Tue, 7 Mar 2023 12:58:12 -0800
Subject: [PATCH] split: support split -n on larger pipe input

* bootstrap.conf (gnulib_modules): Add free-posix, tmpfile.
* src/split.c (copy_to_tmpfile): New function.
(input_file_size): Use it to split larger files when sizes cannot
easily be determined via fstat or lseek.  See Bug#61386#235.
* tests/split/l-chunk.sh: Mark tests of /dev/zero as
very expensive since they exhaust /tmp.
---
 NEWS   |  3 ++
 bootstrap.conf |  2 +
 doc/coreutils.texi |  6 ++-
 src/split.c| 89 +-
 tests/split/l-chunk.sh | 15 ---
 5 files changed, 72 insertions(+), 43 deletions(-)

diff --git a/NEWS b/NEWS
index 5b0dc939c..3df17d3b3 100644
--- a/NEWS
+++ b/NEWS
@@ -145,6 +145,9 @@ GNU coreutils NEWS-*- outline -*-
   split now accepts options like '-n SIZE' that exceed machine integer
   range, when they can be implemented as if they were infinity.
 
+  split -n now accepts piped input even when not in round-robin mode,
+  by first copying input to a temporary file to determine its size.
+
   wc now accepts the --total={auto,never,always,only} option
   to give explicit control over when the total is output.
 
diff --git a/bootstrap.conf b/bootstrap.conf
index 01430429b..c122354a1 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -103,6 +103,7 @@ gnulib_modules="
   fnmatch-gnu
   fopen-safer
   fprintftime
+  free-posix
   freopen
   freopen-safer
   fseeko
@@ -270,6 +271,7 @@ gnulib_modules="
   time_rz
   timer-time
   timespec
+  tmpfile
   tzset
   uname
   unicodeio
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index f0e46b9ee..7852e9f8a 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3409,8 +3409,10 @@ are not split even if they overlap a partition, the files written
 can be larger or smaller than the partition size, and even empty
 if a line/record is so long as to completely overlap the partition.
 
-For @samp{r} mode, the size of @var{input} is irrelevant,
-and so can be a pipe for example.
+When the input is a pipe or some other special file where the size
+cannot easily be determined, there is no trouble for @samp{r} mode
+because the size of the input is irrelevant.  For other modes, such an
+input is first copied to a temporary to determine its size.
 
 @item -a @var{length}
 @itemx --suffix-length=@var{length}
diff --git a/src/split.c b/src/split.c
index 95d174a8b..d872ec56a 100644
--- a/src/split.c
+++ b/src/split.c
@@ -275,6 +275,39 @@ CHUNKS may be:\n\
   exit (status);
 }
 
+/* Copy the data in FD to a temporary file, then make that file FD.
+   Use BUF, of size BUFSIZE, to copy.  Return the number of
+   bytes copied, or -1 (setting errno) on error.  */
+static off_t
+copy_to_tmpfile (int fd, char *buf, idx_t bufsize)
+{
+  FILE *tmp = tmpfile ();
+  if (!tmp)
+return -1;
+  off_t copied = 0;
+  off_t r;
+
+  while (0 < (r = read (fd, buf, bufsize)))
+{
+  if (fwrite (buf, 1, r, tmp) != r)
+return -1;
+  if (INT_ADD_WRAPV (copied, r, ))
+{
+  errno = EOVERFLOW;
+  return -1;
+}
+}
+
+  if (r < 0)
+return r;
+  r = dup2 (fileno (tmp), fd);
+