bug#46048: split -n K/N loses data, sum of output files is smaller than input file.
On 25/01/2021 14:21, Pádraig Brady wrote: On 24/01/2021 19:55, Paul Eggert wrote: On 1/24/21 8:52 AM, Pádraig Brady wrote: - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0) + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0) Dumb question: will this handle the case where you're splitting from stdin and stdin is a seekable file and its initial file offset is nonzero? Right. Following on the logic from input_file_size(), I'm going with the attached, which I'll push later. Marking this as done. Note this fix has now propagated to Fedora builds, and is in the process of propagating to RHEL/Centos. I've just logged a debian bug also: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=982300 cheers, Pádraig
bug#46048: split -n K/N loses data, sum of output files is smaller than input file.
On 24/01/2021 19:55, Paul Eggert wrote: On 1/24/21 8:52 AM, Pádraig Brady wrote: - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0) + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0) Dumb question: will this handle the case where you're splitting from stdin and stdin is a seekable file and its initial file offset is nonzero? Right. Following on the logic from input_file_size(), I'm going with the attached, which I'll push later. Marking this as done. thanks, Pádraig >From 8741d726327bddce3271de23af4aae4cfc185774 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Mon, 25 Jan 2021 14:12:48 + Subject: [PATCH] split: fix --number=K/N to output correct part of file This functionality regressed with the adjustments in commit v8.25-4-g62e7af032 * src/split.c (bytes_chunk_extract): Account for already read data when seeking into the file. * tests/split/b-chunk.sh: Use the hidden ---io-blksize option, to test this functionality. * NEWS: Mention the bug fix. Fixes https://bugs.gnu.org/46048 --- NEWS | 4 src/split.c| 2 +- tests/split/b-chunk.sh | 45 -- 3 files changed, 31 insertions(+), 20 deletions(-) diff --git a/NEWS b/NEWS index c2474fee3..e7fbde8ed 100644 --- a/NEWS +++ b/NEWS @@ -27,6 +27,10 @@ GNU coreutils NEWS-*- outline -*- rm no longer skips an extra file when the removal of an empty directory fails. [bug introduced by the rewrite to use fts in coreutils-8.0] + split --number=K/N will again correctly split chunk K of N to stdout. + Previously a chunk starting after 128KiB, output the wrong part of the file. + [bug introduced in coreutils-8.26] + tr no longer crashes when using --complement with certain invalid combinations of case character classes. [bug introduced in coreutils-8.6] diff --git a/src/split.c b/src/split.c index 0660da13f..59c234c12 100644 --- a/src/split.c +++ b/src/split.c @@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize, } else { - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0) + if (lseek (STDIN_FILENO, start - initial_read, SEEK_CUR) < 0) die (EXIT_FAILURE, errno, "%s", quotef (infile)); initial_read = SIZE_MAX; } diff --git a/tests/split/b-chunk.sh b/tests/split/b-chunk.sh index 8238dcb6d..dbed681f7 100755 --- a/tests/split/b-chunk.sh +++ b/tests/split/b-chunk.sh @@ -35,32 +35,39 @@ split -e -n 10 /dev/null || fail=1 returns_ 1 stat x?? 2>/dev/null || fail=1 printf '1\n2\n3\n4\n5\n' > input || framework_failure_ +printf '1\n2' > exp-1 || framework_failure_ +printf '\n3\n' > exp-2 || framework_failure_ +printf '4\n5\n' > exp-3 || framework_failure_ for file in input /proc/version /sys/kernel/profiling; do test -f $file || continue - split -n 3 $file > out || fail=1 - split -n 1/3 $file > b1 || fail=1 - split -n 2/3 $file > b2 || fail=1 - split -n 3/3 $file > b3 || fail=1 + for blksize in 1 2 4096; do +if ! test "$file" = 'input'; then + # For /proc like files we must be able to read all + # into the internal buffer to be able to determine size. + test "$blksize" = 4096 || continue +fi - case $file in -input) - printf '1\n2' > exp-1 - printf '\n3\n' > exp-2 - printf '4\n5\n' > exp-3 +split -n 3 ---io-blksize=$blksize $file > out || fail=1 +split -n 1/3 ---io-blksize=$blksize $file > b1 || fail=1 +split -n 2/3 ---io-blksize=$blksize $file > b2 || fail=1 +split -n 3/3 ---io-blksize=$blksize $file > b3 || fail=1 - compare exp-1 xaa || fail=1 - compare exp-2 xab || fail=1 - compare exp-3 xac || fail=1 - ;; - esac +case $file in + input) +compare exp-1 xaa || fail=1 +compare exp-2 xab || fail=1 +compare exp-3 xac || fail=1 +;; +esac - compare xaa b1 || fail=1 - compare xab b2 || fail=1 - compare xac b3 || fail=1 - cat xaa xab xac | compare - $file || fail=1 - test -f xad && fail=1 +compare xaa b1 || fail=1 +compare xab b2 || fail=1 +compare xac b3 || fail=1 +cat xaa xab xac | compare - $file || fail=1 +test -f xad && fail=1 + done done Exit $fail -- 2.26.2
bug#46048: split -n K/N loses data, sum of output files is smaller than input file.
On 1/24/21 8:52 AM, Pádraig Brady wrote: - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0) + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0) Dumb question: will this handle the case where you're splitting from stdin and stdin is a seekable file and its initial file offset is nonzero?
bug#46048: split -n K/N loses data, sum of output files is smaller than input file.
On 24/01/2021 16:52, Pádraig Brady wrote: diff --git a/src/split.c b/src/split.c index 0660da13f..6aa8d50e9 100644 --- a/src/split.c +++ b/src/split.c @@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize, } else { - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0) + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0) die (EXIT_FAILURE, errno, "%s", quotef (infile)); initial_read = SIZE_MAX; } The same adjustment is needed in lines_chunk_split() I'll add a test also. cheers, Pádraig
bug#46048: split -n K/N loses data, sum of output files is smaller than input file.
On 23/01/2021 04:58, Paul Hirst wrote: split --number K/N appears to lose data in, with the sum of the sizes of the output files being smaller than the original input file by 131072 bytes. $ split --version split (GNU coreutils) 8.30 ... $ head -c 100 < /dev/urandom > test.dat $ split --number=1/4 test.dat > t1 $ split --number=2/4 test.dat > t2 $ split --number=3/4 test.dat > t3 $ split --number=4/4 test.dat > t4 $ ls -l -rw-r--r-- 1 user user 25 Jan 22 18:36 t1 -rw-r--r-- 1 user user 25 Jan 22 18:36 t2 -rw-r--r-- 1 user user 25 Jan 22 18:36 t3 -rw-r--r-- 1 user user 118928 Jan 22 18:36 t4 -rw-r--r-- 1 user user 100 Jan 22 18:33 test.dat Surely this should not be the case? Ugh. This functionality was broken for all files > 128KiB due to adjustments for handling /dev/zero $ truncate -s 100 test.dat $ split --number=4/4 test.dat | wc -c 118928 The following patch fixes it here. I need to do some more testing, before committing. thanks! diff --git a/src/split.c b/src/split.c index 0660da13f..6aa8d50e9 100644 --- a/src/split.c +++ b/src/split.c @@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize, } else { - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0) + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0) die (EXIT_FAILURE, errno, "%s", quotef (infile)); initial_read = SIZE_MAX; }
bug#46048: split -n K/N loses data, sum of output files is smaller than input file.
split --number K/N appears to lose data in, with the sum of the sizes of the output files being smaller than the original input file by 131072 bytes. $ split --version split (GNU coreutils) 8.30 ... $ head -c 100 < /dev/urandom > test.dat $ split --number=1/4 test.dat > t1 $ split --number=2/4 test.dat > t2 $ split --number=3/4 test.dat > t3 $ split --number=4/4 test.dat > t4 $ ls -l -rw-r--r-- 1 user user 25 Jan 22 18:36 t1 -rw-r--r-- 1 user user 25 Jan 22 18:36 t2 -rw-r--r-- 1 user user 25 Jan 22 18:36 t3 -rw-r--r-- 1 user user 118928 Jan 22 18:36 t4 -rw-r--r-- 1 user user 100 Jan 22 18:33 test.dat Surely this should not be the case? Paul