bug#46048: split -n K/N loses data, sum of output files is smaller than input file.

2021-02-08 Thread Pádraig Brady

On 25/01/2021 14:21, Pádraig Brady wrote:

On 24/01/2021 19:55, Paul Eggert wrote:

On 1/24/21 8:52 AM, Pádraig Brady wrote:

-  if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+  if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)


Dumb question: will this handle the case where you're splitting from
stdin and stdin is a seekable file and its initial file offset is nonzero?


Right. Following on the logic from input_file_size(),
I'm going with the attached, which I'll push later.
Marking this as done.


Note this fix has now propagated to Fedora builds,
and is in the process of propagating to RHEL/Centos.

I've just logged a debian bug also:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=982300

cheers,
Pádraig





bug#46048: split -n K/N loses data, sum of output files is smaller than input file.

2021-01-25 Thread Pádraig Brady

On 24/01/2021 19:55, Paul Eggert wrote:

On 1/24/21 8:52 AM, Pádraig Brady wrote:

-  if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+  if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)


Dumb question: will this handle the case where you're splitting from
stdin and stdin is a seekable file and its initial file offset is nonzero?


Right. Following on the logic from input_file_size(),
I'm going with the attached, which I'll push later.
Marking this as done.

thanks,
Pádraig
>From 8741d726327bddce3271de23af4aae4cfc185774 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= 
Date: Mon, 25 Jan 2021 14:12:48 +
Subject: [PATCH] split: fix --number=K/N to output correct part of file

This functionality regressed with the adjustments
in commit v8.25-4-g62e7af032

* src/split.c (bytes_chunk_extract): Account for already read data
when seeking into the file.
* tests/split/b-chunk.sh: Use the hidden ---io-blksize option,
to test this functionality.
* NEWS: Mention the bug fix.
Fixes https://bugs.gnu.org/46048
---
 NEWS   |  4 
 src/split.c|  2 +-
 tests/split/b-chunk.sh | 45 --
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index c2474fee3..e7fbde8ed 100644
--- a/NEWS
+++ b/NEWS
@@ -27,6 +27,10 @@ GNU coreutils NEWS-*- outline -*-
   rm no longer skips an extra file when the removal of an empty directory fails.
   [bug introduced by the rewrite to use fts in coreutils-8.0]
 
+  split --number=K/N will again correctly split chunk K of N to stdout.
+  Previously a chunk starting after 128KiB, output the wrong part of the file.
+  [bug introduced in coreutils-8.26]
+
   tr no longer crashes when using --complement with certain
   invalid combinations of case character classes.
   [bug introduced in coreutils-8.6]
diff --git a/src/split.c b/src/split.c
index 0660da13f..59c234c12 100644
--- a/src/split.c
+++ b/src/split.c
@@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize,
 }
   else
 {
-  if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+  if (lseek (STDIN_FILENO, start - initial_read, SEEK_CUR) < 0)
 die (EXIT_FAILURE, errno, "%s", quotef (infile));
   initial_read = SIZE_MAX;
 }
diff --git a/tests/split/b-chunk.sh b/tests/split/b-chunk.sh
index 8238dcb6d..dbed681f7 100755
--- a/tests/split/b-chunk.sh
+++ b/tests/split/b-chunk.sh
@@ -35,32 +35,39 @@ split -e -n 10 /dev/null || fail=1
 returns_ 1 stat x?? 2>/dev/null || fail=1
 
 printf '1\n2\n3\n4\n5\n' > input || framework_failure_
+printf '1\n2' > exp-1 || framework_failure_
+printf '\n3\n' > exp-2 || framework_failure_
+printf '4\n5\n' > exp-3 || framework_failure_
 
 for file in input /proc/version /sys/kernel/profiling; do
   test -f $file || continue
 
-  split -n 3 $file > out || fail=1
-  split -n 1/3 $file > b1 || fail=1
-  split -n 2/3 $file > b2 || fail=1
-  split -n 3/3 $file > b3 || fail=1
+  for blksize in 1 2 4096; do
+if ! test "$file" = 'input'; then
+  # For /proc like files we must be able to read all
+  # into the internal buffer to be able to determine size.
+  test "$blksize" = 4096 || continue
+fi
 
-  case $file in
-input)
-  printf '1\n2' > exp-1
-  printf '\n3\n' > exp-2
-  printf '4\n5\n' > exp-3
+split -n 3 ---io-blksize=$blksize $file > out || fail=1
+split -n 1/3 ---io-blksize=$blksize $file > b1 || fail=1
+split -n 2/3 ---io-blksize=$blksize $file > b2 || fail=1
+split -n 3/3 ---io-blksize=$blksize $file > b3 || fail=1
 
-  compare exp-1 xaa || fail=1
-  compare exp-2 xab || fail=1
-  compare exp-3 xac || fail=1
-  ;;
-  esac
+case $file in
+  input)
+compare exp-1 xaa || fail=1
+compare exp-2 xab || fail=1
+compare exp-3 xac || fail=1
+;;
+esac
 
-  compare xaa b1 || fail=1
-  compare xab b2 || fail=1
-  compare xac b3 || fail=1
-  cat xaa xab xac | compare - $file || fail=1
-  test -f xad && fail=1
+compare xaa b1 || fail=1
+compare xab b2 || fail=1
+compare xac b3 || fail=1
+cat xaa xab xac | compare - $file || fail=1
+test -f xad && fail=1
+  done
 done
 
 Exit $fail
-- 
2.26.2



bug#46048: split -n K/N loses data, sum of output files is smaller than input file.

2021-01-24 Thread Paul Eggert

On 1/24/21 8:52 AM, Pádraig Brady wrote:

-  if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+  if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)


Dumb question: will this handle the case where you're splitting from 
stdin and stdin is a seekable file and its initial file offset is nonzero?






bug#46048: split -n K/N loses data, sum of output files is smaller than input file.

2021-01-24 Thread Pádraig Brady

On 24/01/2021 16:52, Pádraig Brady wrote:

diff --git a/src/split.c b/src/split.c
index 0660da13f..6aa8d50e9 100644
--- a/src/split.c
+++ b/src/split.c
@@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, 
size_t bufsize,
   }
 else
   {
-  if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+  if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
   die (EXIT_FAILURE, errno, "%s", quotef (infile));
 initial_read = SIZE_MAX;
   }


The same adjustment is needed in lines_chunk_split()
I'll add a test also.

cheers,
Pádraig






bug#46048: split -n K/N loses data, sum of output files is smaller than input file.

2021-01-24 Thread Pádraig Brady

On 23/01/2021 04:58, Paul Hirst wrote:

split --number K/N appears to lose data in, with the sum of the sizes of
the output files being smaller than the original input file by 131072 bytes.

$ split --version
split (GNU coreutils) 8.30
...

$ head -c 100 < /dev/urandom > test.dat
$ split --number=1/4 test.dat > t1
$ split --number=2/4 test.dat > t2
$ split --number=3/4 test.dat > t3
$ split --number=4/4 test.dat > t4

$ ls -l
-rw-r--r-- 1 user user  25 Jan 22 18:36 t1
-rw-r--r-- 1 user user  25 Jan 22 18:36 t2
-rw-r--r-- 1 user user  25 Jan 22 18:36 t3
-rw-r--r-- 1 user user  118928 Jan 22 18:36 t4
-rw-r--r-- 1 user user 100 Jan 22 18:33 test.dat

Surely this should not be the case?


Ugh. This functionality was broken for all files > 128KiB
due to adjustments for handling /dev/zero

$ truncate -s 100 test.dat
$ split --number=4/4 test.dat | wc -c
118928

The following patch fixes it here.
I need to do some more testing, before committing.

thanks!

diff --git a/src/split.c b/src/split.c
index 0660da13f..6aa8d50e9 100644
--- a/src/split.c
+++ b/src/split.c
@@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, 
size_t bufsize,
 }
   else
 {
-  if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+  if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
 die (EXIT_FAILURE, errno, "%s", quotef (infile));
   initial_read = SIZE_MAX;
 }





bug#46048: split -n K/N loses data, sum of output files is smaller than input file.

2021-01-23 Thread Paul Hirst
split --number K/N appears to lose data in, with the sum of the sizes of
the output files being smaller than the original input file by 131072 bytes.

$ split --version
split (GNU coreutils) 8.30
...

$ head -c 100 < /dev/urandom > test.dat
$ split --number=1/4 test.dat > t1
$ split --number=2/4 test.dat > t2
$ split --number=3/4 test.dat > t3
$ split --number=4/4 test.dat > t4

$ ls -l
-rw-r--r-- 1 user user  25 Jan 22 18:36 t1
-rw-r--r-- 1 user user  25 Jan 22 18:36 t2
-rw-r--r-- 1 user user  25 Jan 22 18:36 t3
-rw-r--r-- 1 user user  118928 Jan 22 18:36 t4
-rw-r--r-- 1 user user 100 Jan 22 18:33 test.dat

Surely this should not be the case?

Paul