from:"Assaf Gordon"

Re: [coreutils] join feature: auto-format

2010-10-07 Thread Assaf Gordon

Pádraig Brady wrote, On 10/07/2010 06:22 AM:
 On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.

 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:

 -o {all (default), padded, FORMAT}

 where padded is the functionality you're suggesting?
 
 Thinking more about it, we mightn't need any new options at all.
 Currently -e is redundant if -o is not specified.
 So how about changing that so that if -e is specified
 we operate as above by auto inserting empty fields?
 Also I wouldn't base on the number of fields in the first line,
 instead auto padding to the biggest number of fields
 on the current lines under consideration.

My concern is the principle of least surprise - if there are existing 
scripts/programs that specify -e without -o (doesn't make sense, but still 
possible) - this change will alter their behavior.

Also, implying/forcing 'auto-format' when -e is used without -o might be a 
bit confusing.
I prefer to have the user explicitly ask for auto-format - at least he/she will 
know how the output would look like.

That being said,
I can send a new patch with one of the new method (implicit autoformat or -o 
padded) - which one is preferred ?

Thanks,
 -gordon

[coreutils] du/bigtime fail ( was: new snapshot available: coreutils-8.7.66-561f8)

2010-12-17 Thread Assaf Gordon

Jim Meyering wrote, On 12/17/2010 05:07 AM:
 Here's a preview of what should soon appear as coreutils-8.8. [...]
 Any testing you can perform over the weekend would be most welcome.
 
On CentOS 5.4, du/bigtime fails (in a reproducible manner).

$ uname -a
Linux XX 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 
GNU/Linux

==
   GNU coreutils 8.7.66-561f8: tests/test-suite.log   
==

1 of 1 test failed.  

.. contents:: :depth: 2


FAIL: du/bigtime (exit: 1)
==

--- out 2010-12-17 15:29:06.0 +
+++ exp 2010-12-17 15:29:06.0 +
@@ -1 +1 @@
-4  9223372036854775807 future
+0  9223372036854775807 future


-gordon

Re: [coreutils] join feature: auto-format

2011-01-12 Thread Assaf Gordon

Pádraig Brady wrote, On 01/11/2011 07:35 AM:
 
 Spending another few minutes on this, I realized
 that we should not be trying to homogenize the number
 of fields from each file, but rather the fields used
 for a particular file in each line. The only sensible
 basis for that is the first line as previously suggested.
 
 The interface would be a little different for that.
 I was thinking of:
 
   -o 'header'  Infer the format from the first line of each file
 
I second the idea of using the first line as the basis for the auto-formatting,
but have reservation about the wording: '-o header' somewhat implies that the 
first line has to be an actual header line (with column names or similar), 
while it can just be the first line of actual data if the file doesn't have a 
header line.

Something like '-o auto' might be less confusing.

Just my 2 cents,
 -gordon

bug#7961: sort

2011-02-02 Thread Assaf Gordon

On a somewhat off-topic note,

Francesco Bettella wrote, On 02/02/2011 07:42 AM:
 
 I'm issuing the following sort commands (see attached files):
 [prompt1]  sort -k 1.4,1n asd1  asd1.sorted
 [prompt2]  sort -k 2.4,2n asd2  asd2.sorted
 
 the first one works as I would expect, the second one doesn't.

When sorting chromosome names, the version sort option (-V, introduced in 
coreutils 7.0) sorts as you would expect,
saving you the need to skip three characters in the sort key, and also 
accommodating mixing letters and numbers.

Example:

$ cat chrom.txt
chr1
chrUn_gl000232
chrY
chr2
chr13
chrM
chrUn_gl000218
chr6_hap
chr2R
chr16
chr10
chr6_dbb_hap3
chr4
chr3L
chr4_ctg9_hap1
chr3R
chr3
chrX

$ sort -k1,1V chrom.txt
chr1
chr2
chr2R
chr3
chr3L
chr3R
chr4
chr4_ctg9_hap1
chr6_dbb_hap3
chr6_hap
chr10
chr13
chr16
chrM
chrUn_gl000218
chrUn_gl000232
chrX
chrY


-gordon

sort parameters question: -V and -f

2011-04-06 Thread Assaf Gordon

Hello,

I'm wondering if this is a bug (where -f is ignored when using version sort):

=
$ sort --debug -f -k2,2V 
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
sort: option `-f' is ignored
==

My assumption is that using -f as stand alone parameter should have the same 
effect as using it in a specific key (for that key). e.g. the following two 
commands are equivalent:
 sort -f -k1,1
 sort -k1f,1

But the following two commands are not equivalent (because the standalone -f 
is ignored):
 sort -f -k1V,1
 sort -k1Vf,1

Example:
=

## This works
$ printf a\nB\nc\n | sort -k1f,1
a
B
c
$ printf a\nB\nc\n | sort -f -k1,1
a
B
c

## This doesn't work
$ printf a13\nA5\na1\n | sort -k1Vf,1
a1
A5
a13

$ printf a13\nA5\na1\n | sort -f -k1V,1
A5
a1
a13
===

I'm using coreutils 8.10.

-gordon

Re: sort parameters question: -V and -f

2011-04-07 Thread Assaf Gordon

Eric Blake wrote, On 04/06/2011 06:36 PM:
 On 04/06/2011 04:04 PM, Pádraig Brady wrote:
 On 06/04/11 22:26, Assaf Gordon wrote:
 I'm wondering if this is a bug (where -f is ignored when using version 
 sort):

 The same happens for any ordering option.
 If any is specified for the key, then all global options are ignored.
 This is specified by POSIX and here it's demonstrated on solaris:

 Not only that, but --debug would have told you the same:
 

--debug did tell me that, but I thought it's a bug, not a feature.
I assumed -f is accumulative, not overridden when specifying per-key sort 
order - I should have read the docs more carefully.

Thanks for the quick and detailed response.

-gordon

Re: uniq --accumulate

2012-02-07 Thread Assaf Gordon

Pádraig Brady wrote, On 02/07/2012 11:00 AM:
 On 02/07/2012 03:56 PM, Peng Yu wrote:

 Suppose that I have a table of the following, where the last column is
 a number. I'd like to accumulate the number of rows that are the same
 for all the remaining columns.

 
 Thanks for the suggestion,
 but this is too specialized for coreutils I think.

Slightly off-topic for coreutils,

but a package called BEDTools ( http://code.google.com/p/bedtools/ ) provides 
a program called groupBy, which does exactly that, and more.

Akin to SQL's group by command, the program can group a text file by a 
specified column, and perform operations (count,sum,mean,median,etc.) on 
another column.

-gordon

sort: new feature: use environment variable to set buffer size

2012-08-29 Thread Assaf Gordon

Hello,

I'd like to suggest a new feature to sort: the ability to set the buffer size 
(-S/--buffer-size X) using an environment variable.

In summary:
 $ export SORT_BUFFER_SIZE=20G 
 $ someprogram | sort -k1,1  output.txt
 # sort will use 20G of RAM, as if --buffer-size 20G was specified.


The rational:
recent commits improved the guessed buffer size when sort is given an input 
file,
but these don't apply if sort is used as part of a pipe line, with a pipe as 
input, e.g.
  some | program | sort | other | programs  file 

(Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though 
many GBs are available).
This results in many small temporary files being created.

The script (which uses sort) is not under my direct control, but even if it was,
I don't want to hard-code the amount of memory used, to keep it portable to 
different servers.

AFAIK, there are four aspects of sort the affect performance:
1. number of threads:
changeable with --parallel=X and with environment variable OMP_NUM_THREADS.

2. temporary files location:
changeable with --temporary-directory=DIR and with environment variable 
TMPDIR.

3. memory usage:
changeable with --buffer-size=SIZE but not with environment variable.

4. compression program:
changeable with --compression-program=PROG but not with environment variable.
(but at the moment, I do not address this aspect).


With the attached patch, sort will read an environment variable named 
SORT_BUFFER_SIZE, and will treat it as if --buffer-size was specified (but 
only if --buffer-size wasn't used on the command line).

If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, 
help, docs, etc.).

Regards,
 -gordon
From db8f1c319d772c5b13df51894f279c3a7276416e Mon Sep 17 00:00:00 2001
From: A. Gordon gor...@cshl.edu
Date: Wed, 29 Aug 2012 16:42:31 -0400
Subject: [PATCH] sort: accept buffer size from environment variable.

---
 src/sort.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 9dbfee1..1505a6d 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -4648,6 +4648,13 @@ main (int argc, char **argv)
   files = minus;
 }
 
+  if (sort_size == 0)
+{
+  char const *buffer_size = getenv (SORT_BUFFER_SIZE);
+  if (buffer_size)
+specify_sort_size(-1,'S',buffer_size);
+}
+
   /* Need to re-check that we meet the minimum requirement for memory
  usage with the final value for NMERGE. */
   if (0  sort_size)
-- 
1.7.9.1

physmem: a new program to report memory information

2012-08-30 Thread Assaf Gordon

Hello,

Related to the previous sort+memory envvar usage thread: 
http://thread.gmane.org/gmane.comp.gnu.coreutils.general/3028/focus=3090 .

Attached is a suggestion for a tiny command-line program physmem, that 
similarly to nproc, exposes the gnulib functions physmem_total() and 
physmem_available().

The code is closely modeled after nproc, and the recommended memory usage is 
calculated using sort's default_sort_size() .

The program works like this:
===
$ ./src/physmem --help
Usage: ./src/physmem [OPTION]...
Prints information about physical memory.

  -t, --total   print the total physical memory.
  -a, --available   print the available physical memory.
  -r, --recommended print a safe recommended amount of useable memory.
  -h, --human-readable  print sizes in human readable format (e.g., 1K 234M 2G)
  --si  like -h, but use powers of 1000 not 1024
  --help display this help and exit
  --version  output version information and exit

Report physmem bugs to bug-coreut...@gnu.org
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
Report physmem translation bugs to http://translationproject.org/team/
For complete documentation, run: info coreutils 'physmem invocation'
===

The actual working code (at the bottom of physmem.c) is:
===
  switch(memory_report_type)
{
case total:
  memory = physmem_total();
  break;

case available:
  memory = physmem_available();
  break;

case recommended:
  memory = default_sort_size();
  break;
}

  char buf[LONGEST_HUMAN_READABLE + 1];
  fputs (human_readable (memory, buf, human_output_opts,1,1),stdout);
  fputs(\n, stdout);
===

So it's very simple, and rely on existing coreutils code.

Please let me know if this is something you'd be willing to include in 
coreutils.

Thanks,
 -gordon
From 1eccf56a49bc0aa3f167a0fce1a65c91a92ed468 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 30 Aug 2012 11:21:57 -0400
Subject: [PATCH] physmem: A new program to report mem information.

---
 src/Makefile.am |2 +
 src/physmem.c   |  215 +++
 2 files changed, 217 insertions(+), 0 deletions(-)
 create mode 100644 src/physmem.c

diff --git a/src/Makefile.am b/src/Makefile.am
index 896c902..ae0c20c7 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -90,6 +90,7 @@ EXTRA_PROGRAMS = \
   od		\
   paste		\
   pathchk	\
+  physmem	\
   pr		\
   printenv	\
   printf	\
@@ -198,6 +199,7 @@ chroot_LDADD = $(LDADD)
 cksum_LDADD = $(LDADD)
 comm_LDADD = $(LDADD)
 nproc_LDADD = $(LDADD)
+physmem_LDADD = $(LDADD)
 cp_LDADD = $(LDADD)
 csplit_LDADD = $(LDADD)
 cut_LDADD = $(LDADD)
diff --git a/src/physmem.c b/src/physmem.c
new file mode 100644
index 000..b990503
--- /dev/null
+++ b/src/physmem.c
@@ -0,0 +1,215 @@
+/* physmem - report the total/available/recommended memory
+   Copyright (C) 2009-2012 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see http://www.gnu.org/licenses/.  */
+
+/* Written by Assaf Gordon.  */
+
+#include config.h
+#include getopt.h
+#include stdio.h
+#include sys/types.h
+
+#include system.h
+#include error.h
+#include xstrtol.h
+#include physmem.h
+#include human.h
+
+#ifndef RLIMIT_DATA
+struct rlimit { size_t rlim_cur; };
+# define getrlimit(Resource, Rlp) (-1)
+#endif
+
+/* The official name of this program (e.g., no 'g' prefix).  */
+#define PROGRAM_NAME physmem
+
+#define AUTHORS proper_name (Assaf Gordon)
+
+/* Human-readable options for output.  */
+static int human_output_opts;
+
+enum memory_report_type
+  {
+total,			/* default */
+available,
+recommended
+  };
+
+static enum memory_report_type memory_report_type = total;
+
+/* For long options that have no equivalent short option, use a
+   non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
+enum
+{
+  HUMAN_SI_OPTION= CHAR_MAX + 1
+};
+
+static struct option const longopts[] =
+{
+  {total, no_argument, NULL, 't'},
+  {available, no_argument, NULL, 'a'},
+  {recommended, no_argument, NULL, 'r'},
+  {human, no_argument, NULL, 'h'},
+  {si, no_argument, NULL, HUMAN_SI_OPTION},
+  {GETOPT_HELP_OPTION_DECL},
+  {GETOPT_VERSION_OPTION_DECL},
+  {NULL, 0, NULL, 0}
+};
+
+/* Return the default sort size.
+   FIXME: this function was copied from

Command-line program to convert 'human' sizes?

2012-12-04 Thread Assaf Gordon

Hello,

Is there a command-line program (or a recommended way) to expose the coreutil's 
common functionality of converting raw sizes to 'human' sizes and vice versa ?

I'm referring to the -h parameter that du/df/sort are accepting, and 
reporting human sizes, but also the reverse (where sort's -G accepts 40M 
as valid input).

I found two relevant threads, but no resolution:
http://lists.gnu.org/archive/html/coreutils/2011-08/msg00035.html
http://lists.gnu.org/archive/html/coreutils/2012-02/msg00088.html

Thanks,
 -gordon

Re: Command-line program to convert 'human' sizes?

2012-12-04 Thread Assaf Gordon

Hello,

Pádraig Brady wrote, On 12/04/2012 11:30 AM:
 On 12/04/2012 04:25 PM, Assaf Gordon wrote:
 
 Nothing yet. The plan is to make a numfmt command available with this 
 interface:
 http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html
 

Attached is a stub for such a program (mostly command-line processing, no 
actual conversion yet).

Please let me know if you're willing to eventually include this program (and 
I'll more functionality, tests, docs, etc.).

I tried to follow the existing code conventions in other programs, but all 
comments and suggestions are welcomed.

-gordon

From bb5162a7521aee6b95c902acc65c1d3800ba4f30 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Tue, 4 Dec 2012 15:32:05 -0500
Subject: [PATCH] numfmt: stub code for new program

---
 build-aux/gen-lists-of-programs.sh |1 +
 src/.gitignore |1 +
 src/numfmt.c   |  298 
 3 files changed, 300 insertions(+), 0 deletions(-)
 create mode 100644 src/numfmt.c

diff --git a/build-aux/gen-lists-of-programs.sh b/build-aux/gen-lists-of-programs.sh
index 212ce02..bf63ee3 100755
--- a/build-aux/gen-lists-of-programs.sh
+++ b/build-aux/gen-lists-of-programs.sh
@@ -85,6 +85,7 @@ normal_progs='
 nl
 nproc
 nohup
+numfmt
 od
 paste
 pathchk
diff --git a/src/.gitignore b/src/.gitignore
index 181..25573df 100644
--- a/src/.gitignore
+++ b/src/.gitignore
@@ -59,6 +59,7 @@ nice
 nl
 nohup
 nproc
+numfmt
 od
 paste
 pathchk
diff --git a/src/numfmt.c b/src/numfmt.c
new file mode 100644
index 000..e513194
--- /dev/null
+++ b/src/numfmt.c
@@ -0,0 +1,298 @@
+/* Reformat numbers like 11505426432 to the more human-readable 11G
+   Copyright (C) 2012 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see http://www.gnu.org/licenses/.  */
+
+#include config.h
+#include getopt.h
+#include stdio.h
+#include sys/types.h
+
+#include argmatch.h
+#include error.h
+#include system.h
+#include xstrtol.h
+
+/* The official name of this program (e.g., no 'g' prefix).  */
+#define PROGRAM_NAME numfmt
+
+#define AUTHORS proper_name ()
+
+#define BUFFER_SIZE (16 * 1024)
+
+enum
+{
+  FROM_OPTION = CHAR_MAX + 1,
+  FROM_UNIT_OPTION,
+  TO_OPTION,
+  TO_UNIT_OPTION,
+  ROUND_OPTION,
+  SUFFIX_OPTION
+};
+
+enum scale_type
+{
+scale_none, /* the default: no scaling */
+scale_auto, /* --from only */
+scale_SI,
+scale_IEC,
+scale_custom  /* --to only, custom scale */
+};
+
+static char const *const scale_from_args[] =
+{
+auto, SI, IEC, NULL
+};
+static enum scale_type const scale_from_types[] =
+{
+scale_auto, scale_SI, scale_IEC
+};
+
+static char const *const scale_to_args[] =
+{
+SI, IEC, NULL
+};
+static enum scale_type const scale_to_types[] =
+{
+scale_SI, scale_IEC
+};
+
+
+enum round_type
+{
+round_ceiling,
+round_floor,
+round_nearest
+};
+
+static char const *const round_args[] =
+{
+ceiling,floor,nearest, NULL
+};
+
+static enum round_type const round_types[] =
+{
+round_ceiling,round_floor,round_nearest
+};
+
+static struct option const longopts[] =
+{
+  {from, required_argument, NULL, FROM_OPTION},
+  {from-unit, required_argument, NULL, FROM_UNIT_OPTION},
+  {to, required_argument, NULL, TO_OPTION},
+  {to-unit, required_argument, NULL, TO_UNIT_OPTION},
+  {round, required_argument, NULL, ROUND_OPTION},
+  {format, required_argument, NULL, 'f'},
+  {suffix, required_argument, NULL, SUFFIX_OPTION},
+  {GETOPT_HELP_OPTION_DECL},
+  {GETOPT_VERSION_OPTION_DECL},
+  {NULL, 0, NULL, 0}
+};
+
+
+enum scale_type scale_from=scale_none;
+enum scale_type scale_to=scale_none;
+enum round_type _round=round_ceiling;
+char const *format_str = NULL;
+const char *suffix = NULL;
+uintmax_t from_unit_size=1;
+uintmax_t to_unit_size=1;
+
+/* Convert a string of decimal digits, N_STRING, with an optional suffinx
+   to an integral value.  Upon successful conversion,
+   return that value.  If it cannot be converted, give a diagnostic and exit.
+*/
+static uintmax_t
+string_to_integer (const char *n_string)
+{
+  strtol_error s_err;
+  uintmax_t n;
+
+  s_err = xstrtoumax (n_string, NULL, 10, n, bkKmMGTPEZY0);
+
+  if (s_err == LONGINT_OVERFLOW)
+{
+  error (EXIT_FAILURE, 0,
+ _(%s: unit size is so large that it is not representable),
+n_string);
+}
+
+  if (s_err != LONGINT_OK

Re: Command-line program to convert 'human' sizes?

2012-12-04 Thread Assaf Gordon

Pádraig Brady wrote, On 12/04/2012 06:11 PM:
 On 12/04/2012 10:55 PM, Assaf Gordon wrote:
 Hello,

 Pádraig Brady wrote, On 12/04/2012 11:30 AM:
 Nothing yet. The plan is to make a numfmt command available with this 
 interface:
 http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html


 Attached is a patch, with a proof-of-concept working 'numfmt'.

 
 Thanks a lot for working on this.
 All I'll say at this stage is to take it
 as far as you can as per the interface specified
 at the above URL with a mind to reusing stuff from
 lib/human.c if possible.
 
 We'll review it then with a view to including it ASAP.

Thanks!

Input-wise, I had to copy and modify the xstrtol implementation, because the 
original function doesn't allow the caller to force SI or IEC or AUTO (it has 
internal logic to deduce it, based on parameters and user input).

Output-wise, human_readable() from lib/human.c is called as-is (no code 
modification).

Regarding the advanced options:
1. I'm wondering what is the reason/need for --to=NUMBER ? It base 
different than 1024/1000 would result in values like 4K that are very 
unintuitive (since they don't mean 4096/4000).

2. FORMAT: is the only use-case adding spaces before/after the number, and 
grouping?
human_readable() already has support for grouping, and padding might be added 
with different parameters?

I'm asking about #1 and #2, because if we forgo them, human_readable() could 
be used as-is. Otherwise, it will require copypasting and some modifications.

3. SUFFIX - is the purpose of this simply to print a string following the 
number? or are there some more complications?

4. Should nun-suffix characters following a parsed number cause errors, or 
ignored? e.g. 4KQO

Re: Command-line program to convert 'human' sizes?

2012-12-04 Thread Assaf Gordon

Pádraig Brady wrote, On 12/04/2012 07:31 PM:
 On 12/05/2012 12:19 AM, Jim Meyering wrote:
 Pádraig Brady wrote:
 On 12/04/2012 11:35 PM, Assaf Gordon wrote:
 Pádraig Brady wrote, On 12/04/2012 06:11 PM:
 On 12/04/2012 10:55 PM, Assaf Gordon wrote:
 Pádraig Brady wrote, On 12/04/2012 11:30 AM:

 snip long discussion 

Would the following be acceptable:
1. remove --to=NUMBER option
2. surplus characters following immediately after converted number trigger a 
warning (error?), 
  except if the following characters match exactly the suffix parameter.


Regarding --format:
The implementation doesn't really use printf, so %d isn't directly usable.
One option is to tell the user to use %s (instead of %d), and we'll simply 
put the result of human_readable() as the string parameter in vasnprintf - 
this will be flexible in terms of alignment.
Another option is the remove --format option, and replace it with --padding 
or similar.

Regarding grouping (thousands separator):
This only has an effect when no using --to=SI or --to=IEC, right?
Perhaps we can add a separate option --grouping, and simply turn on the 
human_grouping flag? (easy to implement).

Re: Command-line program to convert 'human' sizes?

2012-12-04 Thread Assaf Gordon

Hello,

 Pádraig Brady wrote, On 12/04/2012 11:30 AM:
 Nothing yet. The plan is to make a numfmt command available with this 
 interface:
 http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html


Attached is a patch, with a proof-of-concept working 'numfmt'.

Works: from=SI/IEC/AUTO, to=SI/IEC, from-units, to-units, suffix, round.
Doesn't work: format, to=NUMBER,field=N .

The code isn't clean and can be improved.
Currently, either  every (non-option) command-line parameter is expected to be 
a number, or every line on stdin is expected to start with a number.
 
Comments are welcomed,
 -gordon


Examples;

$ ./src/numfmt --from=auto 2K
2000
$ ./src/numfmt --from=auto 2Ki
2048
$ ./src/numfmt --from=SI 2K
2000
$ ./src/numfmt --from=SI 2Ki
2000
$ ./src/numfmt --from=IEC  2Ki
2048
$ ./src/numfmt --from=SI --to=IEC 2Ki
2.0K
$ ./src/numfmt --from=IEC --to=SI 2K 
2.1k
$ ./src/numfmt --from=IEC 1M
1048576
$ ./src/numfmt --from=IEC --to=SI 1M
1.1M
$ ./src/numfmt --from=IEC --to-unit=20 1M
52429
./src/numfmt --from-unit=512 --to=IEC 4
2.0K
$ ./src/numfmt --round=ceiling --to=IEC 2000
2.0K
$ ./src/numfmt --round=floor --to=IEC 2000
1.9K


Help screen
===
$ ./src/numfmt --help 
Usage: ./src/numfmt [OPTIONS] [NUMBER]
Reformats NUMBER(s) to/from human-readable values.
Numbers can be processed either from stdin or command arguments.

  --from=UNIT Auto-scale input numbers (auto, SI, IEC)
  If not specified, input suffixed are ignored.
  --from-unit=N   Specifiy the input unit size (instead of the default 1).
  --to=UNIT   Auto-scale output numbres (SI,IEC,N).
  If not specified, 
  --to-unit=N Specifiy the output unit size (instead of the default 1).
  --rount=METHOD  Round input numbers. METHOD can be:
  ceiling (the default), floor, nearest
  -f, --format=FORMAT   use printf style output FORMAT.
Default output format is %d .
  --suffix=SUFFIX   
  
  --help display this help and exit
  --version  output version information and exit

UNIT options:
 auto ('--from' only):
  1K  = 1000
  1Ki = 1024
  1G  = 100
  1Gi = 1048576
 SI:
  1K* = 1000
  (additional suffixes after K/G/T do not alter the scale)
 IEC:
  1K* = 1024
  (additional suffixes after K/G/T do not alter the scale)
 N ('--to' only):
  Use number N as the scale.


Examples:
  ./src/numfmt --to=SI 1000   - 1K
  echo 1K | ./src/numfmt --from=SI- 1000
  echo 1K | ./src/numfmt --from=IEC   - 1024

Report numfmt bugs to bug-coreut...@gnu.org
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
Report numfmt translation bugs to http://translationproject.org/team/
For complete documentation, run: info coreutils 'numfmt invocation'
===

 build-aux/gen-lists-of-programs.sh |1 +
 src/.gitignore |1 +
 src/numfmt.c   |  549 
 3 files changed, 551 insertions(+), 0 deletions(-)

diff --git a/build-aux/gen-lists-of-programs.sh b/build-aux/gen-lists-of-programs.sh
index 212ce02..bf63ee3 100755
--- a/build-aux/gen-lists-of-programs.sh
+++ b/build-aux/gen-lists-of-programs.sh
@@ -85,6 +85,7 @@ normal_progs='
 nl
 nproc
 nohup
+numfmt
 od
 paste
 pathchk
diff --git a/src/.gitignore b/src/.gitignore
index 181..25573df 100644
--- a/src/.gitignore
+++ b/src/.gitignore
@@ -59,6 +59,7 @@ nice
 nl
 nohup
 nproc
+numfmt
 od
 paste
 pathchk
diff --git a/src/numfmt.c b/src/numfmt.c
new file mode 100644
index 000..99b1450
--- /dev/null
+++ b/src/numfmt.c
@@ -0,0 +1,549 @@
+/* Reformat numbers like 11505426432 to the more human-readable 11G
+   Copyright (C) 2012 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see http://www.gnu.org/licenses/.  */
+
+#include config.h
+#include getopt.h
+#include stdio.h
+#include sys/types.h
+
+#include argmatch.h
+#include error.h
+#include system.h
+#include xstrtol.h
+#include human.h
+
+/* The official name of this program (e.g., no 'g' prefix).  */
+#define PROGRAM_NAME numfmt
+
+#define AUTHORS proper_name ()
+
+#define BUFFER_SIZE (16 * 1024)
+
+enum
+{
+  FROM_OPTION = CHAR_MAX + 1,
+  FROM_UNIT_OPTION,
+  TO_OPTION,
+  TO_UNIT_OPTION,
+  ROUND_OPTION,
+  SUFFIX_OPTION
+};
+

Re: Command-line program to convert 'human' sizes?

2012-12-05 Thread Assaf Gordon

Hello,

Attached is a working version of numfmt.

The following are implemented:
===
Usage: ./src/numfmt [OPTIONS] [NUMBER]
Reformats NUMBER(s) to/from human-readable values.
Numbers can be processed either from stdin or command arguments.

  --from=UNIT Auto-scale input numbers to UNITs. Default is 'none'.
  See UNIT below.
  --from-unit=N   Specify the input unit size (instead of the default 1).
  --to=UNIT   Auto-scale output numbers to UNITs.
  See UNIT below.
  --to-unit=N Specify the output unit size (instead of the default 1).
  --round=METHOD  Round input numbers. METHOD can be:
  ceiling (the default), floor, nearest
  --suffix=SUFFIX Add SUFFIX to output numbers, and accept optional SUFFIX
  in input numbers.
  --padding=N Pad the output to N characters.
  Default is right-aligned. Negative N will left-align.
  Note: if N is too small, the output will be truncated,
  and a warning will be printed to stderr.
  --grouping  Group digits together (e.g. 1,000,000).
  Uses the locale-defined grouping (i.e. have no effect
  in C/POSIX locales).
  --field N   Replace the number in input field N (default is 1)
  -d, --delimiter=X  use X instead of whitespace for field delimiter
===

Also included in the patch is a test file, testing all sorts of combination of 
the parameters (hopefully catches most of the corner cases).

There's also an undocumented option --debug that will show what's going on:
===
$ /src/numfmt --debug --field 2 --suffix=Foo --from=SI --to=IEC Hello 70MFoo 
World
Extracting Fields:
  input: 'Hello 70MFoo World'
  field: 2
  prefix: 'Hello'
  number: '70MFoo'
  suffix: 'World'
Trimming suffix 'Foo'
Parsing number:
  input string: '70M'
  remaining characters: ''
  numeric value: 7000
Formatting output:
  value: 7000
  humanized: '67M'
Hello 67MFoo World
===

Comments are welcomed,
 -gordon



numfmt3.patch.gz
Description: GNU Zip compressed data

Re: Command-line program to convert 'human' sizes?

2012-12-05 Thread Assaf Gordon

Assaf Gordon wrote, On 12/05/2012 06:13 PM:
 Attached is a working version of numfmt.

Somewhat related:
How do I add a rule to build the man page (I'm working on passing 'make 
syntax-check').

I've added the following line to 'man/local.mk':
man/numfmt.1:src/numfmt 

But it doesn't get build (after bootstrap+configure+make).

Thanks,
 -gordon

Re: Command-line program to convert 'human' sizes?

2012-12-05 Thread Assaf Gordon



On 12/05/12 19:58, Jim Meyering wrote:

Assaf Gordon wrote:

Somewhat related:
How do I add a rule to build the man page (I'm working on passing 'make 
syntax-check').

I've added the following line to 'man/local.mk':
 man/numfmt.1:src/numfmt

But it doesn't get build (after bootstrap+configure+make).

You'll have to add numfmt to the list of programs in
build-aux/gen-lists-of-programs.sh

Then, be sure to run make syntax-check, and it'll cross-check
a few other things that have to be synced with that list.

I already added it to gen-lists-of-programs.sh (under normal_progs) - and 
the compilation works fine.
I've also added a line to tests/local.mk and make checks works fine.
But the man page part seems to be ignored.

The strange thing is that make doesn't complain about the job, it simply 
ignores it:
===
$ grep numfmt man/local.mk
man/numfmt.1:src/numfmt
$ ls man/numfmt.1
ls: cannot access man/numfmt.1: No such file or directory
$ make man/numfmt.1
gordon@tango:~/projects/coreutils$ ls man/numfmt.1
ls: cannot access man/numfmt.1: No such file or directory
===

When I add -d to make, it ends with these messages:
===
$ make -d man/numfmt.1
 snip 
   Prerequisite `src/numfmt.o' is older than target `src/numfmt'.
   Prerequisite `src/libver.a' is older than target `src/numfmt'.
   Prerequisite `lib/libcoreutils.a' is older than target `src/numfmt'.
   Prerequisite `lib/libcoreutils.a' is older than target `src/numfmt'.
   Prerequisite `src/.dirstamp' is older than target `src/numfmt'.
  No need to remake target `src/numfmt'.
 Finished prerequisites of target file `man/numfmt.1'.
Must remake target `man/numfmt.1'.
Successfully remade target file `man/numfmt.1'.
===

But the man file is not created.

Thanks,
 -gordon

Re: Command-line program to convert 'human' sizes?

2012-12-05 Thread Assaf Gordon


On 12/05/12 20:49, Assaf Gordon wrote:


On 12/05/12 19:58, Jim Meyering wrote:

Assaf Gordon wrote:

Somewhat related:
How do I add a rule to build the man page (I'm working on passing 'make 
syntax-check').

You'll have to add numfmt to the list of programs in
build-aux/gen-lists-of-programs.sh


I already added it to gen-lists-of-programs.sh (under normal_progs) - and 
the compilation works fine.
I've also added a line to tests/local.mk and make checks works fine.
But the man page part seems to be ignored.

Nevermind - figured it out:
A stub man/numfmt.x file is required.

Re: Command-line program to convert 'human' sizes?

2012-12-05 Thread Assaf Gordon


Hello,

With the attached patch, numfmt passes make syntax-check and almost passes make check 
and make distcheck.

Regarding the checks: tests/misc/numfmt.pl passes all tests successfully.

But:

1. When running make check, tests/df/total-verify.sh fails, so the check 
isn't complete.

2. When running make check TESTS=tests/misc/numfmt VERBOSE=yes, the tests 
script passes, but the process later fails with this error:

$ make check TESTS=tests/misc/numfmt VERBOSE=yes
  GENpublic-submodule-commit
make  check-recursive
make[1]: Entering directory `/home/gordon/projects/coreutils'
Making check in po
make[2]: Entering directory `/home/gordon/projects/coreutils/po'
make[2]: Leaving directory `/home/gordon/projects/coreutils/po'
Making check in .
make[2]: Entering directory `/home/gordon/projects/coreutils'
make  check-TESTS check-local
make[3]: Entering directory `/home/gordon/projects/coreutils'
make[4]: Entering directory `/home/gordon/projects/coreutils'
PASS: tests/misc/numfmt.pl
=
1 test passed
=
make[4]: Leaving directory `/home/gordon/projects/coreutils'
  GENcheck-README
  GENcheck-duplicate-no-install
  GENsc-avoid-builtin
  GENsc-avoid-io
  GENsc-avoid-non-zero
  GENsc-avoid-path
  GENsc-avoid-timezone
  GENsc-avoid-zeroes
  GENsc-exponent-grouping
  GENsc-lower-case-var
  GENsc-use-small-caps-NUL
  GENcheck-texinfo
make[3]: Leaving directory `/home/gordon/projects/coreutils'
make[2]: Leaving directory `/home/gordon/projects/coreutils'
Making check in gnulib-tests
make[2]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests'
make  check-recursive
make[3]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests'

 snip 

make[5]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests'
make  check-TESTS
make[5]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests'
make[6]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests'
make[6]: *** No rule to make target `tests/misc/numfmt.log', needed by 
`test-suite.log'.  Stop.
make[6]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests'
make[5]: *** [check-TESTS] Error 2
make[5]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests'
make[4]: *** [check-am] Error 2
make[4]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests'
make[3]: *** [check-recursive] Error 1
make[3]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests'
make[2]: *** [check] Error 2
make[2]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/home/gordon/projects/coreutils'
make: *** [check] Error 2

## Strangely, the log file does exist:
$ ls -l tests/misc/numfmt.log
-rw-r--r-- 1 gordon gordon 1069 Dec  5 21:51 tests/misc/numfmt.log



Any advice is appreciated,
 -gordon



numfmt4.patch.gz
Description: GNU Zip compressed data

Suggestion: update README/HACKING regarding tests

2012-12-05 Thread Assaf Gordon


As per: http://lists.gnu.org/archive/html/coreutils/2012-09/msg00144.html ,

Perhaps you'll agree to update README/HACKING about how to run individual tests:
===
diff --git a/HACKING b/HACKING
index de8cd7b..01e7605 100644
--- a/HACKING
+++ b/HACKING
@@ -438,9 +438,11 @@ Nearly every significant change must be accompanied by a 
test suite
 addition that exercises it.  If you fix a bug, add at least one test that
 fails without the patch, but that succeeds once your patch is applied.
 If you add a feature, add tests to exercise as much of the new code
-as possible. Note to run tests/misc/new-test in isolation you can do:
+as possible. If you add a new test file (as opposed to adding a test to an
+existing test file) add the new test file to 'tests/local.mk'.
+Note to run tests/misc/new-test in isolation you can do:
 
-  (cd tests  make check TESTS=misc/new-test VERBOSE=yes)

+  make TESTS=tests/misc/new-test SUBDIRS=. VERBOSE=yes
 
 Variables that are significant for tests with their default values are:
 
diff --git a/README b/README

index 21c9b03..15ed29b 100644
--- a/README
+++ b/README
@@ -176,7 +176,7 @@ in verbose mode for each failing test.  For example,
 if the test that fails is tests/misc/df, then you would
 run this command:
 
-  (cd tests  make check TESTS=misc/df VERBOSE=yes)  log 21

+  make check TESTS=tests/misc/df SUBDIRS=. VERBOSE=yes  log 21
 
 For some tests, you can get even more detail by adding DEBUG=yes.

 Then include the contents of the file 'log' in your bug report.
===


Regards,
-gordon

Re: Command-line program to convert 'human' sizes?

2012-12-07 Thread Assaf Gordon

Thank you for your feedback.
I'm working on fixing those issues.


Some comments/questions:

Pádraig Brady wrote, On 12/06/2012 06:59 PM:
 I noticed This command will core dump:
 $ /bin/ls -l | src/numfmt --to-unit=1 --field=5
 snip
 so I'm thinking `numfmt` should support --header too.
 
I'll add --header.


 The following should essentially be a noop with this data,
 but notice how the original spacing wasn't taken
 into account, and thus the alignment is broken:
 
 $ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to-unit=1 --field=5
 -rw-rw-r--.  1 padraig padraig 93787 Aug 23  2011 ABOUT-NLS
 -rw-rw-r--.  1 padraig padraig 49630 Dec  6 22:32 aclocal.m4
 -rw-rw-r--.  1 padraig padraig 3669 Dec  6 22:29 AUTHORS

I'm a bit wary of adding automatic/heuristic kind of padding - could lead to 
some weird outputs,
and also (when combined with header) will not produce proper output (because 
the header will be skipped, but the lines would re-padded?).

Wouldn't it be better to either force the user to specify '--padding', or 
switch from 'white-space' to an explicit delimiter, and then let expand 
handle the expanding correctly?

e.g.
===
$ cat white-space-data.txt | \
sed 's/  */\t/g' | \
numfmt --field=5 --delimiter=$'\t' --to=SI | \
expand  output
===

A bit more convoluted, but more reliable?

 
 With this the alignment is broken as before,
 but I also notice the differing width output of each number.
 
 $ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to=SI --field=5
 -rw-rw-r--.  1 padraig padraig 94k Aug 23  2011 ABOUT-NLS
 -rw-rw-r--.  1 padraig padraig 50k Dec  6 22:32 aclocal.m4
 -rw-rw-r--.  1 padraig padraig 3.7k Dec  6 22:29 AUTHORS
 

Again this is the automatic padding issue -
For example 94K vs 3.7K - should we always pad SI/IEC output to 5 
characters (e.g.  94K) even if the user didn't specify padding?
This would conflict with non-whitespace delimiters... e.g.:

Hello:94000:world

Would be converted to:

Hello:space94K:world

Which is not intuitive at all

Or perhaps the whole 'auto' padding should be enabled IFF delimiter is not 
specified (and defaults to white-space) ?

 
 Notice in the above I've used capital K for SI.
 I think human() from gnulib may be using k for 1000 and K for 1024.
 That's non standard and ambiguous and I see no need to do that.

 So for IEC we'd have:
 
 $ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to=IEC --field=5
 -rw-rw-r--.  1 padraig padraig  3.6Ki Dec  6 22:29 AUTHORS
 

I tried to use 'human_readable()' as-is, but I guess this is not sufficient.
I'll duplicate the code, and modify it to avoid this issue (lower/upper case K, 
and the i suffix)

 
 Another thing I thought of there, was it would be
 good to be able to parse number formats that it can generate:

Sounds like two separate (but related) issues:

 $ echo '1,234' | src/numfmt --from=auto
 src/numfmt: invalid suffix in input '1,234': ',234'

1. Is there already a gnulib function that can accept locale-grouped values? 
can the xstrtoXXX functions handle that?

 $ echo '3.7K' | src/numfmt --from=auto
 src/numfmt: invalid suffix in input '3.7K': '.7K'

2. Would you recommend switching internal representation to doubles (from the 
current uintmax_t),
 or just add special code to detect decimal point (which, as Bernhard 
mentioned, is also locale dependent).

 While I said before it would be better to error rather than warn
 on parse error, on consideration it's probably best to write a
 warning to stderr on parse error, and leave the original number in place.

I'll change the code accordingly. 


Regarding Bernhard's comments (from a different email):

Bernhard Voelker wrote, On 12/07/2012 03:25 AM:
 On 12/07/2012 12:59 AM, Pádraig Brady wrote:
 
 Therefore this is my first test:
   $ echo 11505426432 | src/numfmt
   11505426432
 Hmm, shouldn't it converting that to a human-readable
 number then? ;-)

From Pádraig's original specification ( 
http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html ) I assumed 
that the default of both --from and --to is not to scale - So one needs to 
explicitly use --to or --from.

But those defaults can be changed, if you prefer.

 Looking at scale_from_args: I'd favor lower-case arguments,
 i.e. si and iec instead of SI and IEC.
 WDYT?

I'll change those.


Regarding the help text and documentation:
I copied many of the texts from previous emails (the Reformat numbers like 
11505426432 to the more human-readable 11G comes verbatim from one of Jim 
Meyering's emails) - all of them would require better phrasing later.


Thanks,
 -gordon

git format-patch question

2012-12-11 Thread Assaf Gordon

Hello,

(picking up from a different thread)

Pádraig Brady wrote, On 12/06/2012 06:59 PM:
 Generally it's best to get git to send email
 or send around formats that git can apply directly,
 which includes commit messages and references new files etc.
 The handiest way to do that is:
 
   git format-patch --stdout -1 | gzip  numfmt.5.patch.gz

While working on my development branch, I commit small, specific changes, as so:
 [PATCH 1/6] numfmt: a new command to format numbers
 [PATCH 2/6] numfmt: change SI/IEC parameters to lowercase.
 [PATCH 3/6] numfmt: separate debug/devdebug options.
 [PATCH 4/6] numfmt: fix segfault when no numbers are found.
 [PATCH 5/6] numfmt: improve --field, add more tests.
 [PATCH 6/6] numfmt: add --header option.

Each commit can be just few lines.

When I send a patch the the mailing list, I want to send one 'nice' 'clean' 
patch with my changes, compared to the master branch.

When I use the following command:

   git diff -p --stat master..HEAD  my.patch

And all the changes (multiple commits) I made on my branch compared to master 
are represented as one coherent change in my.patch - but this is not 
convenient for you to apply.


However, when I use

git format-patch --stdout -1  my.patch

Only the last commit appears.

The alternative:

git format-patch --stdout master..HEAD  my.patch

Generates a file which will cause multiple commits when imported with git am .
 
When is the recommended way to generate a clean patch which will consolidate 
all my small commits into one?
Or is there another way?

Thanks,
 -gordon

Adding tests for non-C locales

2012-12-11 Thread Assaf Gordon

Hello,

I want to add tests for non-C locales (to check grouping in numfmt).

My test script is written in Perl, based on tests/misc/wc.pl .

It starts with:
===
   @ENV{qw(LANGUAGE LANG LC_ALL)} = ('C') x 3;
===

Which is fine for most of the tests.

How do I add tests for non-C locale, in a safe manner (I need a locale that I 
know which thousand-group separator character is used, but I can't know in 
advanced if it's installed on the testing machine).

Thanks,
 -gordon

numfmt: locale/grouping input issue

2012-12-11 Thread Assaf Gordon

Hello,

(Continuing a previously discussed issue - accepting input values with locale 
grouping separators)

Pádraig Brady wrote, On 12/07/2012 01:09 PM:
 On 12/07/2012 03:07 PM, Assaf Gordon wrote:
 Another thing I thought of there, was it would be
 good to be able to parse number formats that it can generate:

 Sounds like two separate (but related) issues:

 $ echo '1,234' | src/numfmt --from=auto
 src/numfmt: invalid suffix in input '1,234': ',234'

 1. Is there already a gnulib function that can accept locale-grouped values? 
 can the xstrtoXXX functions handle that?
 
 I was thinking you would just strip out
 localeconv()-thousands_sep before parsing.

I couldn't find an example of a coreutil program that readily accepts locale'd 
input.
The while dots and commas (US/DE locales) are relatively easy to handle, in the 
french locale the separator is space - causing a conflict when assuming the 
default field separator is also white space.

Another complication is that just stripping out the 'thousands_sep' character 
would treat text such as 1,3,4,5,6 as valid number 13456 .

I would suggest at first not to accept locale'd input, or only offer partial 
support.
WDYT ?

Thanks,
 -gordon


Couple of examples:

   # Output is OK
   $ LC_ALL=fr_FR.utf8 ./src/printf %'d\n 1000
   1 000

   # Input is not valid
   $ LC_ALL=fr_FR.utf8 ./src/printf %'d\n 1 000
   ./src/printf: 1 000 : valeur non complètement convertie
   1

   # Sort can't handle locale'd input, treats the white-space as separator,
   #  not as thousand separator.
   $ printf 1 123\n1 000\n | LC_ALL=fr_FR.utf8 sort --debug -k1,1 
   sort: utilse les règles de tri « fr_FR.utf8 »
   sort: leading blanks are significant in key 1; consider also specifying 'b'
   1 000
   _
   _
   1 123
   _
   _

numfmt (=print 'human' sizes) updates

2012-12-12 Thread Assaf Gordon


Hello,

Attached is an updated version of 'numfmt' .
(The patch should be compatible with git am).

Most of the previously raised issues have been addressed, except handling 
locale'd grouping in the input numbers (locale'd decimal-point is handled 
correctly).

Added support for header, auto-whitespace-padding, floating-point input .
Internally, all values are now stored as long double (instead of previously 
uintmax_t) - enables working with Yotta-scale values.

The following should now 'just work' :
df | ./src/numfmt --header --field 2 --to=si
ls -l | ./src/numfmt --header --field 5 --to=iec
ls -lh | ./src/numfmt --header --field 5 --from=iec --padding=10

The --debug option now behaves more like sort's --debug: prints messages to 
STDERR about possible bad combinations and inputs (which are not fatal errors):

$./src/numfmt --debug 6
./src/numfmt: no conversion option specified
6

The --devdebug option can be used to show internal states (perhaps will be 
removed once the program is finalized?).

The test file 'tests/misc/numfmt.pl' contains many more tests and details about 
possible inputs/outputs.

If the functionality is acceptable, the next steps are cleaner code and better 
documentations.

Comments are welcomed,
  -gordon



numfmt.7.patch.gz
Description: GNU Zip compressed data

[PATCH] two minor tweaks to HACKING

2012-12-13 Thread Assaf Gordon

The first mentions 'git stash' in a relevant paragraph.
The second changes parameters for 'lcov' example - the current parameters 
produce wrong output (the source files are not found, with LCOV version 1.9 ).

-gordon

From e1ece5ff278258a18a078cad1d8fbf65c7e4fe71 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 13 Dec 2012 11:42:01 -0500
Subject: [PATCH 1/2] doc: mention git stash in HACKING

---
 HACKING |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/HACKING b/HACKING
index f3f961a..84e9707 100644
--- a/HACKING
+++ b/HACKING
@@ -120,6 +120,8 @@ Note 2:
 sometimes the checkout will fail, telling you that your local
 modifications conflict with changes required to switch branches.
 However, in any case, you will *not* lose your uncommitted changes.
+Run git stash to temporarily hide uncommited changes in your
+local directory, restoring a clean working directory.
 
 Anyhow, get back onto your just-created branch:
 
-- 
1.7.7.4


From 8cd8f40882daa165ced8091697c158c7afb479d6 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 13 Dec 2012 14:20:47 -0500
Subject: [PATCH 2/2] doc: tweak 'lcov' in HACKING

Use the correct -b (--base-directory) parameter.
---
 HACKING |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/HACKING b/HACKING
index 84e9707..8e4243f 100644
--- a/HACKING
+++ b/HACKING
@@ -610,8 +610,8 @@ to generate HTML coverage reports.  Follow these steps:
   # run whatever tests you want, i.e.:
   make check
   # run lcov
-  lcov -t coreutils -q -d lib -b lib -o lib.lcov -c
-  lcov -t coreutils -q -d src -b src -o src.lcov -c
+  lcov -t coreutils -q -d lib -b `pwd` -o lib.lcov -c
+  lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c
   # generate HTML from the output
   genhtml -p `pwd` -t coreutils -q --output-directory lcov-html *.lcov
 
-- 
1.7.7.4

Re: numfmt (=print 'human' sizes) updates

2012-12-13 Thread Assaf Gordon

Hello,

Attached is a slightly improved patch - minor code changes, and many more tests.
Line coverage is 98%, and branch coverage is now 93% , and most of the 
non-covered branches are simply unreachable (I'm checking the reachable ones).

The comments below still apply.

- gordon


Assaf Gordon wrote, On 12/13/2012 01:02 AM:
 
 Most of the previously raised issues have been addressed, except handling 
 locale'd grouping in the input numbers (locale'd decimal-point is handled 
 correctly).
 
 Added support for header, auto-whitespace-padding, floating-point input .
 Internally, all values are now stored as long double (instead of previously 
 uintmax_t) - enables working with Yotta-scale values.
 
 The following should now 'just work' :
 df | ./src/numfmt --header --field 2 --to=si
 ls -l | ./src/numfmt --header --field 5 --to=iec
 ls -lh | ./src/numfmt --header --field 5 --from=iec --padding=10
 
 The --debug option now behaves more like sort's --debug: prints messages 
 to STDERR about possible bad combinations and inputs (which are not fatal 
 errors):
 
 $./src/numfmt --debug 6
 ./src/numfmt: no conversion option specified
 6
 
 The --devdebug option can be used to show internal states (perhaps will be 
 removed once the program is finalized?).
 
 The test file 'tests/misc/numfmt.pl' contains many more tests and details 
 about possible inputs/outputs.
 
 If the functionality is acceptable, the next steps are cleaner code and 
 better documentations.
 



numfmt.8.patch.xz
Description: application/xz

Re: [PATCH 2/2] doc: tweak 'lcov' in HACKING

2012-12-14 Thread Assaf Gordon

Hello Bernhard,

Bernhard Voelker wrote, On 12/14/2012 03:29 AM:
 splitting the discussion about the 2 patches ...
 
 On 12/13/2012 08:29 PM, Assaf Gordon wrote:
 [...]
 The second changes parameters for 'lcov' example - the current parameters
 produce wrong output (the source files are not found, with LCOV version 1.9 
 ).
 
 Thanks.
 
 [PATCH 2/2] doc: tweak 'lcov' in HACKING
 
 I also noticed the lcov issue recently, but didn't find the time
 to fix HACKING. Furthermore, I'm not sure if lcov-1.9 is the reason
 for the problem - I think it worked some time ago ... and according
 to 'rpm -q -changelog lcov', I already have 1.9 since about Jan or
 Feb 2011.
 
 It think the reason might be the new non-recursive build system.
 
 And furthermore, the second lcov call still fails here:
 
   $ lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c
   built-in:cannot open source file
   geninfo: ERROR: cannot read built-in.gcov!
 
 Don't you get that, too?
 

The following commands work on my system, generating a coverage report from a 
clean repository:

   git clone git://git.sv.gnu.org/coreutils.git coreutils_test
   cd coreutils_test/
   ./bootstrap
   ./configure CFLAGS=-g -fprofile-arcs -ftest-coverage
   make -j 4
   make -j 4 check
   lcov -t coreutils -q -d lib -b `pwd` -o lib.lcov -c
   lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c
   genhtml -p `pwd` -t coreutils -q --output-directory lcov-html *.lcov   

The two lcov invocations do produce some warnings, like so:

$ lcov -t coreutils -q -d lib -b `pwd` -o lib.lcov -c
geninfo: WARNING: no data found for 
/home/gordon/temp/coreutils_test/lib/mbuiter.h
 snip 
Cannot open source file parse-datetime.y
Cannot open source file parse-datetime.c

$ lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c
geninfo: WARNING: no data found for 
/home/gordon/temp/coreutils_test/lib/stat-time.h
geninfo: WARNING: no data found for 
/home/gordon/temp/coreutils_test/lib/mbchar.h
 snip 
geninfo: WARNING: no data found for 
/home/gordon/temp/coreutils_test/lib/openat.h
geninfo: WARNING: no data found for /usr/include/gmp-x86_64.h
geninfo: WARNING: no data found for 
/home/gordon/temp/coreutils_test/lib/stat-time.h
geninfo: WARNING: no data found for 
/home/gordon/temp/coreutils_test/lib/timespec.h

But I assume this is normal/acceptable (if those files weren't covered in the 
tests).

If this flow doesn't work reliably on all systems, then the HACKING needs 
more tweaking...

Thanks,
 -gordon

[PATCH] maint: ignore GCC coverage files

2012-12-14 Thread Assaf Gordon

Hello,

Related to the updated coverage documentation, perhaps update the .gitignore 
to ignore the generated coverage files?
Another possible addition is ignoring src.lcov, lib.lcov and lcov-html/* 
- but those file names are not fixed, just the recommended file names in 
HACKING .

-gordon
From eb54c8adf123481f3231aeb40e1b4ff38288b9af Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Fri, 14 Dec 2012 13:27:26 -0500
Subject: [PATCH] maint: update gitignore entries

* .gitignore: ignore GCC coverage data files.
---
 .gitignore |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/.gitignore b/.gitignore
index 5ce2361..15b77e9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -170,3 +170,5 @@ Makefile.in
 TAGS
 THANKS
 THANKS-to-translators
+*.gcno
+*.gcda
-- 
1.7.7.4

Re: [PATCH] doc: mention git stash in HACKING

2012-12-14 Thread Assaf Gordon

Bernhard Voelker wrote, On 12/14/2012 03:29 AM:
 On 12/13/2012 08:29 PM, Assaf Gordon wrote:
 
 [PATCH 1/2] doc: mention git stash in HACKING
 
 I tweaked the commit message a bit: even if the change is trivial
 and the subject is HACKING, it's good practice to mention it in a
 line below which describes the change in detail.
 

Thanks!
I keep on learning...

-gordon

Re: enhancement suggestions for sort and text editor

2012-12-14 Thread Assaf Gordon

Hello John,

Eric Blake wrote, On 12/14/2012 04:19 PM:
 On 12/14/2012 02:02 PM, john wrote:

 In particular I wish to enter text into predefined (fixed location)
 fields in a record as opposed to variably delimited fields.  In other
 words emulate the punched card record where card columns are assigned to
 particular data character columns.  Those card columns then just become
 a text column range of a single record.  If I could just set perhaps 10
 arbitrary tab stops (in any simple editor), it would be sufficient for
 this purpose.  The tab key would just advance to the next stop in
 succession, tho not necessarily regularly spaced.
 
 Sounds like you are talking in part about 'expand -t', which lets you
 re-expand tabs according to your choice of pre-defined stops.  Beyond
 that, your question is out of scope for coreutils, and better directed
 to the editor of your choice (I'm quite sure that emacs is probably
 going to have something that does what you want, although I don't use
 arbitrary tab stops enough to be able to tell you off-hand how to get at
 that feature)
 

(Slightly off-topic for coreutils, but for completeness:)

Recent versions of GNU awk (gawk) support exactly this kind of processing:

$ printf xx3\n1234yy98765\n | gawk -v FIELDWIDTHS=4 4 2 
5 '{print $1,$2,$3,$4}'
  xx 3
1234  yy 98765

Or with Tab-separated output:

$ printf xx3\n1234yy98765\n | gawk -v FIELDWIDTHS=4 4 2 
5 -v OFS=\t '{print $1,$2,$3,$4}'
xx  3
1234yy  98765

More information here:
 http://www.gnu.org/software/gawk/manual/html_node/Constant-Size.html

-gordon

Re: numfmt (=print 'human' sizes) updates

2012-12-18 Thread Assaf Gordon

Hello,

Attached is a first shot at documenting 'numfmt' .

Comments are welcomed,
-gordon


numfmt_doc.patch.gz
Description: GNU Zip compressed data

Re: numfmt (=print 'human' sizes) updates

2012-12-21 Thread Assaf Gordon

Hello Pádraig,

Thanks for the review and the feedback.

Pádraig Brady wrote, On 12/21/2012 12:42 PM:
 
 It's looking like a really useful cohesive command.
 

The attached patch addresses the following issues:

1. changed the ---devdebug option
2. incorporated most of 'indent -nut' recommendations. 
3. improved the usage string, to generate somewhat better man page.
  (help2man is a bit finicky about formatting, so it's not optimal). 

4. i suffix with iec input/output:

I understand the reason for always adding i, as it is The Right Thing to do.
But I think some (most?) people still treat single-letter suffix (K/M/G/T/etc.) 
as a valid suffix for both SI and IEC, and deduce the scale from the context of 
whatever they're working on. Forcing them to use Ki might be too obtrusive.
It could also mess-up automated scripted, when IEC scale is needed, but only 
with single-letter suffix (and there are many programs like that).

As a compromise, I've added yet another scale option: ieci .
When used with --from=ieci, a two-letter suffix is required.
When used with --to=ieci, i will always be appended.

Examples:

$ ./src/numfmt --to=iec 4096
4.0K
$ ./src/numfmt --to=ieci 4096
4.0Ki
$ ./src/numfmt --from=iec 4K
4096
$ ./src/numfmt --from=ieci 4Ki
4096
$ ./src/numfmt --from=auto 4Ki
4096
$ ./src/numfmt --from=auto 4K
4000

# 'ieci' requires the 'i':
$ ./src/numfmt --from=ieci 4K 
./src/numfmt: missing 'i' suffix in input: '4K' (e.g Ki/Mi/Gi)

# 'iec' does not accept 'i':
$ ./src/numfmt --from=iec 4Ki
./src/numfmt: invalid suffix in input '4Ki': 'i'


I hope this cover all the options, while maintaining consistent and expected 
behavior.
(Optionally, we can change iec to behave like ieci, and rename ieci to 
something else).


I will send --format and error message updates soon.

Regards,
 -gordon


numfmt.9.patch.xz
Description: application/xz

Re: numfmt (=print 'human' sizes) updates

2012-12-26 Thread Assaf Gordon

Hello, 

Pádraig Brady wrote, On 12/21/2012 12:42 PM:
 
 I'm starting to think the original idea of having a --format option
 would be more a more concise, well known and extendible interface
 rather than having --padding, --grouping, --base, ...
 It wouldn't have to be passed directly to printf, and could
 be parsed and preprocessed something like we do in seq(1).
 

Regarding 'format' option, there are some intricacies that are worth discussing:

1. Depending on the requested conversion, the output can be a string (e.g. 
1.4Ki) or a long double (e.g. 140).

2. Internally, the program uses long doubles - so the real format is %Lf - 
regardless of what the user will give (e.g. %f).

3. printf accepts all sorts of options, some of which aren't relevant to 
numfmt, or only relevant when printing non-humanized values.
e.g.:
$ LC_ALL=en_US.utf8 seq -f %0'14.5f 1000 1001
0001,000.0
0001,001.0

4. The assumption was that humanized numbers are always maximum 4 characters in 
SI/IEC (e.g. 1024 or 4.5M) or 5 characters with iec-i (e.g. 999Ti).
With the new 'format', if given %'2.9f - should the output be still 4 
characters (e.g. 4.5T), or respect the .9 format (e.g. 4.5T) ? 
and does the suffix character counts in the 2.9 format ?


My preference is to keep things simple, and accept just a limited subset of the 
format syntax:
1. grouping (the ' character)
2. padding (the number after '%' and before the 'f'
3. alignment (optional '-' after '%')
4. Any prefix/suffix before/after the '%' option.
5. Accept just %f, but internally treat it as '%s' or '%Lf', depending on the 
output.

All other options will be silently ignored, or trigger errors.

Example:
$ numfmt --format xx%20fxx --to=si 5000
[[ internally, treats as --padding 20 ]]
xx5.0Kxx

$ numfmt --format xx%'-10fxx 5000
[[ internally, treats as --padding -10 --grouping ]]
xx5,000 xx

$ numfmt --format xx%0#'+010llfxx 5000
[[ reject as 'too complicated' / unsupported printf options ]]


WDYT?

-gordon

Re: numfmt (=print 'human' sizes) updates

2012-12-27 Thread Assaf Gordon

Hello,

Assaf Gordon wrote, On 12/26/2012 05:40 PM:
 
 Attached is an updated numfmt, with the following two changes:
 1. --format support
 2. optionally ignoring input errors.

The attached patch (incremental to the above full patch) adds few more tests, 
and fixes 4 issues found with the clang static analyzer.

There's no change in functionality.

-gordon



numfmt.11.patch.xz
Description: application/xz

Re: numfmt (=print 'human' sizes) updates

2013-01-08 Thread Assaf Gordon

Hello,

The attached patch adds 'numfmt' to the coreutils documentation.

Regards, 
 -gordon


numfmt.12.patch.xz
Description: application/xz

Re: Sort with header/skip-lines support

2013-01-11 Thread Assaf Gordon

Pádraig Brady wrote, On 01/10/2013 07:11 PM:
 On 01/10/2013 09:57 PM, Assaf Gordon wrote:

 I'd like to re-visit an old issue: adding header-line/skip-lines support to 
 'sort'.

 [...]

 [2] - no pipe support: 
 http://lists.gnu.org/archive/html/bug-coreutils/2007-07/msg00215.html
 
 But recent sed can be used for this like: `seq -u 1q`
 http://git.sv.gnu.org/gitweb/?p=sed.git;a=commit;h=737ca5e
 Note that commit is 4 years old, but only recently released sed 4.2.2 
 contains it.
 
Thanks for the tip.

The following indeed works with sed 4.2.2 ( on linux 3.2 ):
   $ ( echo 99 ; seq 10 ) | ( sed -u 1q ; sort -n )

But I'm wondering (as per the link above [2]) if this is posix compliant and 
stable (i.e. can this be trusted to work everytime, even on non-linux 
machines?).

 [3] - Jim's patch: 
 http://lists.gnu.org/archive/html/coreutils/2010-11/msg00091.html
 
 Thanks for collating the previous threads on this subject.
 
 I'm on the fence on how warranted this is TBH.
 We'd need stronger arguments for it I think.
 

I'll collate the arguments as well :)

If the sed method works reliably, it leaves error checking: how to reliably 
check for error in such a pipe (inside a posix shell script)?
The closest code I found is this: https://github.com/cheusov/pipestatus which 
seems very long.

So additional arguments are:
1. robust error checking
2. simplicity of use: if 'sort' had this option built-in, the following use 
cases would just work. with sed+sort, it will require different invocations 
(and probably different pitfalls):
  a. one input file
  b. one input pipe
  c. multiple input files (without resorting to pipe, as this will cause 'sort' 
to use different amount of memory)
  d. specifying output file (with -o)

Thanks,
 -gordon

As a side note, I have a hackish Perl script that wraps sort and consumes the 
first line, and it's basically works-for-me kind of script - but I just wish it 
wasn't necessary:
https://github.com/agordon/bin_scripts/blob/master/scripts/sort-header.in

Re: Sort with header/skip-lines support

2013-01-11 Thread Assaf Gordon

Hello Pádraig,

Your suggestions work for all the cases I needed, so essentially there's 
already a way to do sort+header - much appreciated!

Pádraig Brady wrote, On 01/11/2013 01:13 PM:
 On 01/11/2013 04:10 PM, Assaf Gordon wrote:
 Pádraig Brady wrote, On 01/10/2013 07:11 PM:

 The following indeed works with sed 4.2.2 ( on linux 3.2 ):
 $ ( echo 99 ; seq 10 ) | ( sed -u 1q ; sort -n )

 [2] - no pipe support: 
 http://lists.gnu.org/archive/html/bug-coreutils/2007-07/msg00215.html 
 But I'm wondering (as per the link above [2]) if this is posix compliant and 
 stable (i.e. can this be trusted to work everytime, even on non-linux 
 machines?).
 
 No `sed -u` with this functionality is not portable.
 Though it's more portable than `sort --header`
 given that it already exists :)

Sorry for nitpicking, but just to verify:
sed -u is a GNU extension, hence not portable by definition.
But what I meant to ask:
If I install GNU sed + GNU sort on any machine (e.g. MAC OSX), would it work in 
a reliable way?
Eric Blake's email seemed to suggest this will never be guaranteed to work 
(even if it works in practice) due to sharing pipes between processes.


 For completeness, showing the current options for such cases...

Thanks for taking the time to write these - very helpful.

-gordon

Re: Enhancement suggestion for expand

2013-01-14 Thread Assaf Gordon

Hello,

Anoop Sharma wrote, On 01/14/2013 04:58 AM:
 On Mon, Jan 14, 2013 at 5:04 AM, Pádraig Brady p...@draigbrady.com 
 mailto:p...@draigbrady.com wrote:
 On 09/18/2012 03:18 PM, CoreUtils subscribtion for PLC wrote:
 
 I oten use expand to format scripts output by manually setting tabs 
  stop. The idea would be to add an option to expand to be able to auto-set 
  tabstops by analyzing first n lines of test (0 for analyzing whole 
  stream) so that the TS would be set to the minimum number of spaces to 
  obtains clean columns.
 
 
 This feature is already provided by a separate utility named column, 
 dedicated to columnization, which is available under BSD license.
 
 What is not provided there is ability to analyze only first n lines. 
 
 So unless it is about licensing, it may be better to enhance column instead 
 of expand. If licensing is an issue then it may be better to add a utility 
 dedicated to columnization to Coreutils, instead of enhancing expand.
 

For a possible work-around, I'm using a Perl+shell wrapper scripts that do 
exactly what you're asking for.

'detect_tab_stops' reads a single text file and prints a comma-separated list 
of tab stops, based on the first N lines (default 100).
https://github.com/agordon/bin_scripts/blob/master/scripts/detect_tab_stops.in

'atexpand' uses 'detect_tab_stops' to run 'expand' with auto-tabbing.
https://github.com/agordon/bin_scripts/blob/master/scripts/atexpand.in

'atless' uses 'detect_tab_stops' to run 'less -S -x' with proper auto-tabbing.
https://github.com/agordon/bin_scripts/blob/master/scripts/atless.in


The quickest way to install those is probably by taking the entire package ( 
https://github.com/agordon/bin_scripts ) and running configure, but they are 
AGPL and you are welcome to take them and modify them.

-gordon

Sort: optimal memory usage with multithreaded sort

2013-01-15 Thread Assaf Gordon

Hello,

Sort's memory usage (specifically, sort_buffer_size() ) has been discussed few 
times before, but I couldn't find mention of the following issue:

If given a regular input file, sort tries to guesstimate the optimal buffer 
size based on the file size.
But this value is calculated for one thread (before sort got multi-threaded).
The default --parallel value is 8 (or less, if fewer cores are available) - 
which requires more memory.

The result is, that for a somewhat powerful machine (e.g. 128GB RAM, 32 cores - 
not uncommon for a computer cluster),
sorting a big file (e.g 10GB) will always allocate too little memory, and will 
always resort to saving temporary files on /tmp.
The disk activity will result in slower sorting times than what could be done 
in an all-memory sort.

Based on this: 
http://lists.gnu.org/archive/html/coreutils/2010-12/msg00084.html ,
perhaps it would be beneficial to consider the number of threads in the memory 
allocation ?

Regards,
 -gordon

[PATCH] numfmt: fix help section typo

2013-02-05 Thread Assaf Gordon

Hello Pádraig,

Thank you for all the recent work on numfmt!

I noticed a typo in the help section, in my original program (The suffix says 
G but the values are mega). Also removes an extra space. Attached is a patch.

Thanks!
 -gordon


From b5b9e3281298fa14d7752579a63bfe3956d982f4 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Tue, 5 Feb 2013 11:04:41 -0500
Subject: [PATCH] numfmt: fix help section typo

* src/numfmt.c: change erroneous G to M.
---
 src/numfmt.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/numfmt.c b/src/numfmt.c
index cccd1d1..b37724e 100644
--- a/src/numfmt.c
+++ b/src/numfmt.c
@@ -856,19 +856,19 @@ UNIT options:\n\
   auto   Accept optional single-letter/two-letter suffix:\n\
  1K  = 1000\n\
  1Ki = 1024\n\
- 1G  = 100\n\
- 1Gi = 1048576\n\
+ 1M  = 100\n\
+ 1Mi = 1048576\n\
   si Accept optional single letter suffix:\n\
  1K = 1000\n\
- 1G  = 100\n\
+ 1M = 100\n\
  ...\n\
   iecAccept optional single letter suffix:\n\
  1K = 1024\n\
- 1G = 1048576\n\
+ 1M = 1048576\n\
  ...\n\
   iec-i  Accept optional two-letter suffix:\n\
  1Ki = 1024\n\
- 1Gi = 1048576\n\
+ 1Mi = 1048576\n\
  ...\n\
 \n\
 ), stdout);
-- 
1.7.7.4

csplit - split by content of field

2013-02-06 Thread Assaf Gordon

Hello,

Attach is a patch that gives 'csplit' the ability to split files by content of 
a field.
A typical usage is:

## the @1 pattern means start a new file when field 1 changes
$ printf A\nA\nB\nB\nB\nC\n | csplit - @1 {*}
$ wc -l xx*
2 xx00
3 xx01
1 xx02
6 total
$ head xx*
== xx00 ==
A
A

== xx01 ==
B
B
B

== xx02 ==
C



This is just a proof of concept, and the pattern specification can be changed 
(I think @N doesn't conflict with any existing pattern).

The same can probably be achieved using other programs (awk comes to mind), but 
it won't be as simple and clean (with all of csplit's output features).

Let me know if you're willing to consider such addition.

Thanks,
 -gordon


From 074614c0764c278e8abd9d41af4ce626fefd6cfc Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Wed, 6 Feb 2013 16:40:00 -0500
Subject: [PATCH] csplit: split files by field-change

src/csplit.c: create a new output file whenever field content changes.
---
 src/csplit.c |  237 --
 1 files changed, 230 insertions(+), 7 deletions(-)

diff --git a/src/csplit.c b/src/csplit.c
index 22f3ad4..ec725d2 100644
--- a/src/csplit.c
+++ b/src/csplit.c
@@ -44,6 +44,13 @@
 /* The default prefix for output file names. */
 #define DEFAULT_PREFIX	xx
 
+enum csplit_type
+  {
+CSPLIT_LINE,
+CSPLIT_REGEXPR,
+CSPLIT_FIELD_CHANGE
+  };
+
 /* A compiled pattern arg. */
 struct control
 {
@@ -53,8 +60,9 @@ struct control
   int argnum;			/* ARGV index. */
   bool repeat_forever;		/* True if '*' used as a repeat count. */
   bool ignore;			/* If true, produce no output (for regexp). */
-  bool regexpr;			/* True if regular expression was used. */
+  enum csplit_type type;		/* Split type: line/regex/field */
   struct re_pattern_buffer re_compiled;	/* Compiled regular expression. */
+  uintmax_t field;  /* Field to monitor for change */
 };
 
 /* Initial size of data area in buffers. */
@@ -176,6 +184,16 @@ static size_t control_used;
 /* The set of signals that are caught.  */
 static sigset_t caught_signals;
 
+/* If delimiter has this value, blanks separate fields.  */
+enum { DELIMITER_DEFAULT = CHAR_MAX + 1 };
+
+/* The delimiter to use for field extraction */
+static int delimiter = DELIMITER_DEFAULT;
+
+/* The content of the field from the last line, to be compared with the
+ * current line */
+static struct cstring last_field;
+
 static struct option const longopts[] =
 {
   {digits, required_argument, NULL, 'n'},
@@ -185,6 +203,7 @@ static struct option const longopts[] =
   {elide-empty-files, no_argument, NULL, 'z'},
   {prefix, required_argument, NULL, 'f'},
   {suffix-format, required_argument, NULL, 'b'},
+  {delimiter, required_argument, NULL, 'd'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -867,6 +886,169 @@ process_regexp (struct control *p, uintmax_t repetition)
 current_line = break_line;
 }
 
+/* Skip the requested number of fields in the input string.
+   Returns a pointer to the *delimiter* of the requested field,
+   or a pointer to NUL (if reached the end of the string).
+
+   NOTE: buf is *not* expected to be NULL-terminated string.
+ The end of the string is determined by 'len' */
+static inline char *
+__attribute ((pure))
+skip_fields (char *buf, int len, int fields)
+{
+  static char null_str[] = ;
+
+  char *ptr = buf;
+  if (delimiter != DELIMITER_DEFAULT)
+{
+  if (*ptr == delimiter)
+fields--;
+  while (len  fields--)
+{
+  while (len  *ptr == delimiter)
+{
+  ++ptr;
+  --len;
+}
+  while (len  *ptr != delimiter)
+{
+  ++ptr;
+  --len;
+}
+}
+}
+  else
+while (len  fields--)
+  {
+while (len  isblank (*ptr))
+  {
+--len;
+++ptr;
+  }
+while (len  !isblank (*ptr))
+  {
+++ptr;
+--len;
+  }
+  }
+
+  if (len==0)
+return null_str;
+
+  return ptr;
+}
+
+static void
+set_last_field (const char* str, size_t len)
+{
+  last_field.len = len ;
+  last_field.str = xrealloc (last_field.str, len);
+  memcpy (last_field.str, str, len);
+}
+
+static void
+reset_last_field (void)
+{
+  last_field.len = 0 ;
+}
+
+static void
+free_last_field (void)
+{
+  last_field.len = 0;
+  free (last_field.str);
+  last_field.str=NULL;
+}
+
+/* Prints the input line until a fields change its value */
+static void
+process_field_change (struct control *p)
+{
+  struct cstring *line;		/* From input file. */
+  char *field_start = NULL;
+  char *field_end = NULL ;
+  size_t field_len;
+  size_t line_len;
+  size_t eol_len; /* length from field_start to EOL */
+
+  create_output_file ();
+
+  reset_last_field ();
+
+  while (true)
+{
+  line

uniq - check specific fields

2013-02-07 Thread Assaf Gordon

Hello,

Attached is a proof-of-concept patch to add --check-fields=N to uniq, 
allowing uniq'ing by specific fields.
(Trying a different approach at promoting csplit-by-field [1] :) ).

It works just like 'check-chars' but on fields, and if not used, it does not 
affect the program flow.
===
# input file, every whole-line is uniq
$ cat input.txt 
A 1 z
A 1 y
A 2 x
B 2 w
B 3 w
C 3 w
C 4 w

# regular uniq
$ uniq -c input.txt 
  1 A 1 z
  1 A 1 y
  1 A 2 x
  1 B 2 w
  1 B 3 w
  1 C 3 w
  1 C 4 w
  
# Stop after 1 field
$ uniq -c --check-fields 1 input.txt 
  3 A 1 z
  2 B 2 w
  2 C 3 w

# Stop after 2 fields
$ uniq -c --check-fields 2 input.txt 
  2 A 1 z
  1 A 2 x
  1 B 2 w
  1 B 3 w
  1 C 3 w
  1 C 4 w

# Skip the first field and check 1 field (effectively, uniq on field 2)
$ uniq -c  --skip-fields 1 --check-fields 1 input.txt 
  2 A 1 z
  2 A 2 x
  2 B 3 w
  1 C 4 w

# --field is convenience shortcut for skipcheck fields 
$ uniq -c --field 2 input.txt 
  2 A 1 z
  2 A 2 x
  2 B 3 w
  1 C 4 w
$ uniq -c --field 3 input.txt 
  1 A 1 z
  1 A 1 y
  1 A 2 x
  4 B 2 w
===

What do you think ?

 -gordon

[1] http://lists.gnu.org/archive/html/coreutils/2013-02/msg00015.html
From 08ee89a89d6912c5872a1785b9079d943ad71623 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 7 Feb 2013 11:46:22 -0500
Subject: [PATCH] uniq: support uniq-by-field

src/uniq.c: add --field and --check-fields=N support
---
 src/uniq.c |   68 +++-
 1 files changed, 67 insertions(+), 1 deletions(-)

diff --git a/src/uniq.c b/src/uniq.c
index 5efdad7..b7c3dc8 100644
--- a/src/uniq.c
+++ b/src/uniq.c
@@ -63,6 +63,9 @@ static size_t skip_chars;
 /* Number of chars to compare. */
 static size_t check_chars;
 
+/* Number of fields to compare */
+static size_t check_fields;
+
 enum countmode
 {
   count_occurrences,		/* -c Print count before output lines. */
@@ -108,6 +111,13 @@ static enum delimit_method const delimit_method_map[] =
 /* Select whether/how to delimit groups of duplicate lines.  */
 static enum delimit_method delimit_groups;
 
+/* For long options that have no equivalent short option, use a
+   non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
+enum
+{
+  UNIQ_FIELD = CHAR_MAX + 1,
+};
+
 static struct option const longopts[] =
 {
   {count, no_argument, NULL, 'c'},
@@ -118,6 +128,8 @@ static struct option const longopts[] =
   {skip-fields, required_argument, NULL, 'f'},
   {skip-chars, required_argument, NULL, 's'},
   {check-chars, required_argument, NULL, 'w'},
+  {check-fields, required_argument, NULL, 'y'},
+  {field, required_argument, NULL, UNIQ_FIELD},
   {zero-terminated, no_argument, NULL, 'z'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -153,6 +165,8 @@ With no options, matching lines are merged to the first occurrence.\n\
 delimit-method={none(default),prepend,separate}\n\
 Delimiting is done with blank lines\n\
   -f, --skip-fields=N   avoid comparing the first N fields\n\
+  --field=N check only field N.\n\
+equivalent to '-f (N-1) -y 1'\n\
   -i, --ignore-case ignore differences in case when comparing\n\
   -s, --skip-chars=Navoid comparing the first N characters\n\
   -u, --unique  only print unique lines\n\
@@ -160,6 +174,7 @@ With no options, matching lines are merged to the first occurrence.\n\
 ), stdout);
  fputs (_(\
   -w, --check-chars=N   compare no more than N characters in lines\n\
+  -y, --check-fields=N  compare no more than N fields in lines\n\
 ), stdout);
  fputs (HELP_OPTION_DESCRIPTION, stdout);
  fputs (VERSION_OPTION_DESCRIPTION, stdout);
@@ -225,6 +240,34 @@ find_field (struct linebuffer const *line)
   return line-buffer + i;
 }
 
+/* Given a string and maximum length,
+ * returns the position after skipping 'check_fields' fields,
+ * or maximum length (if not enough fields on the input string) */
+static size_t _GL_ATTRIBUTE_PURE
+check_fields_length (const char* str, size_t maxlen)
+{
+  size_t count;
+  size_t i = 0;
+
+/*  fputs(check_fields_length(str=',stderr);
+  fwrite(str,sizeof(char),maxlen,stderr);
+  fprintf(stderr,' len=%zu, check_fields=%zu)\n,maxlen,check_fields);*/
+
+  for (count = 0; count  check_fields  i  maxlen; count++)
+{
+  while (i  maxlen  isblank (to_uchar (str[i])))
+i++;
+  while (i  maxlen  !isblank (to_uchar (str[i])))
+i++;
+}
+
+/*  fprintf(stderr,  result= ');
+  fwrite(str,sizeof(char),i,stderr);
+  fputs('\n,stderr);*/
+
+  return i;
+}
+
 /* Return false if two strings

Re: new snapshot available: coreutils-8.20.113-1f1f4

2013-02-08 Thread Assaf Gordon

Hello Bernhard,

Bernhard Voelker wrote, On 02/08/2013 09:53 AM:
 On February 7, 2013 at 8:57 PM Pádraig Brady p...@draigbrady.com wrote:
 coreutils snapshot:
http://pixelbeat.org/cu/coreutils-8.20.113-1f1f4.tar.xz
 
 Hi Padraig,
 
 * SLES-10.4 (x86_64):
   gcc (GCC) 4.1.2 20070115 (SUSE Linux)
 
   FAIL: tests/misc/numfmt.pl
 

Regarding the 'numfmt' failures - these are locale-related problems (in both 
cases).
Perhaps I wrote the tests incorrectly.

May I ask you to try the followings on those systems, and send the output (or 
compare with this expected output):
  
 # The french locale is used for locale testing - if it doesn't exist, 
those tests should not run at all.
 $ locale -a | grep -i fr
 fr_FR.utf8

 # First try without locale (this is test 'lcl-grp-1' which succeeded) 
 $ LC_ALL=C ./src/numfmt --debug --grouping  --from=si 7M
 ./src/numfmt: grouping has no effect in this locale
 700

 # Try grouping, the expected output should have a space as 
thousands-separator
 # this is test lcl-grp-3 which failed, on your system the result was 
700
 $ LC_ALL=fr_FR.utf8 ./src/numfmt --debug --grouping  --from=si 7M
 7 000 000
 

Thanks!
 -gordon

Re: new snapshot available: coreutils-8.20.113-1f1f4

2013-02-08 Thread Assaf Gordon

Thanks for the quick fix.

Bernhard Voelker wrote, On 02/08/2013 11:02 AM:
 On February 8, 2013 at 4:56 PM Pádraig Brady p...@draigbrady.com wrote:
 OK so we can't assume the locale will behave as we want.
 Therefore we can gate the test on the output of the independent
 printf like:
 
 PASS: tests/misc/numfmt.pl

I don't know if this is a SUSE bug or not.
The closest thing I've found is a Debian bug report from 2004:

  locales: Wrong thousands_sep value in fr_FR locale
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=248377

-gordon

Re: new snapshot available: coreutils-8.20.119-54cdb0

2013-02-11 Thread Assaf Gordon

Follow-up:

Assaf Gordon wrote, On 02/11/2013 12:27 PM:
 Strange failure with numfmt on an eccentric system (Mac OS X 10.6.8): some 
 errors are not reported correctly.
 
 [ ... ]
 
 In the source code, it seems to be related to this part, in 
 parse_format_string(), line 972:
  970   i += strspn (fmt + i,  );
  971   errno = 0;
  972   pad = strtol (fmt + i, endptr, 10);
  973   if (errno != 0) 
  974 error (EXIT_FAILURE, 0,
  975_(invalid format %s (width overflow)), quote (fmt));
  976
 
 On this system (Mac OS X):
fmt = 'hello%'
i = 6
fmt+i = ''
 
 And 'strtol' returns errno=EINVAL (22) instead of 0 - causing the incorrect 
 error message.
 

This is likely the reason, 'man strtol' has this to say (on this computer):
===
  ERRORS
 [EINVAL]   The value of base is not supported or no conversion
could be performed (the last feature is not portable
across all platforms).
===

Would it be better to explicitly check for this case, or replace with xstrtol ?

-gordon

Re: new snapshot available: coreutils-8.20.119-54cdb0

2013-02-11 Thread Assaf Gordon

Assaf Gordon wrote, On 02/11/2013 12:35 PM:
 
 Assaf Gordon wrote, On 02/11/2013 12:27 PM:
 Strange failure with numfmt on an eccentric system (Mac OS X 10.6.8): some 
 errors are not reported correctly.

 [ ... ]


 And 'strtol' returns errno=EINVAL (22) instead of 0 - causing the incorrect 
 error message.


The attached patch fixes the problem (tested on Mac OS 10.6.8 and Debian/Linux 
3.2).

-gordon
From 68ff89d497fcaffe054f0ca619fd747db8fb4574 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 11 Feb 2013 15:39:42 -0500
Subject: [PATCH] numfmt: fix strtol() bug

src/numfmt.c: on some system, strtol() returns EINVAL if no conversion
was performed. Ignore and continue if so.
---
 src/numfmt.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/src/numfmt.c b/src/numfmt.c
index d87d8ef..6e7cf2f 100644
--- a/src/numfmt.c
+++ b/src/numfmt.c
@@ -970,7 +970,10 @@ parse_format_string (char const *fmt)
   i += strspn (fmt + i,  );
   errno = 0;
   pad = strtol (fmt + i, endptr, 10);
-  if (errno != 0)
+  /* EINVAL can happen if 'base' is invalid (hardcoded as 10, so can't happen),
+ or if no conversion was performed (on some platforms). Ignore  continue
+ if no conversion was performed */
+  if (errno != 0  (errno != EINVAL))
 error (EXIT_FAILURE, 0,
_(invalid format %s (width overflow)), quote (fmt));
 
-- 
1.7.7.4

uniq with sort-like --key support

2013-02-11 Thread Assaf Gordon

Hello,

I'd like to offer a proof-of-concept patch for adding sort-like --key support 
for the 'uniq' program, as discussed here:
  http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html
and in several other threads.

The patch involves few core changes:
1. All key-related functions were copied as-is from sort.c, and put in a 
separate file (uniq_sort_common.h). In theory, those could extracted later on 
to file that will be used by both sort and uniq. At the moment, it's a 
hodge-podge of copypaste, including code that's not relevant to uniq (like 
reverse).

2. The function check_files was modified to convert struct linebuffer (used 
by uniq) to struct line (used by sort's functions)

and then

3. The different function was modified to call sort's keycompare function.

4. In main(), the key argument passing was copied from 'sort', and some code 
was added to adapt previous options (e.g. skip-fields/skip-chars/check-chars) 
to internal struct keyfield .

The result is that uniq can now do:
===
$ printf A 1\nA 2\nB 2\n | ./src/uniq -k1,1
A 1
B 2
$ printf A 1\nA 2\nB 2\n | ./src/uniq -k2,2
A 1
A 2
===

Most (but not all) of the existing tests pass.
New tests to demonstrate the new possibilities have been added to 
'tests/misc/uniq-key.pl', try with:
 make check TESTS=tests/misc/uniq-key SUBDIRS=.

I think that most of the keycomparison functions (like 
numeric/general-numeric/month/version/skip-blanks) would just work, though I 
haven't tested it thoroughly yet.


Comments are welcomed, 
-gordon


0001-uniq-support-sort-like-key-specification.patch.xz
Description: application/xz

[PATCH]: uniq: add tests for --ignore-case

2013-02-12 Thread Assaf Gordon

Hello,

Attached are three small tests for uniq with --ignore-case (they pass, the 
option was simply not tested before).

Also,
I noticed that by running the default test suite (make check SUBDIRS=.), the 
majority of uniq tests are skipped:
  uniq: skipping this test -- no appropriate locale
  SKIP: tests/misc/uniq.pl
  PASS: tests/misc/uniq-perf.sh

This is due to tests/misc/uniq.pl line 83:
 83 # I've only ever triggered the problem in a non-C locale.   

 84 my $locale = $ENV{LOCALE_FR};   

 85 ! defined $locale || $locale eq 'none'  

 86   and CuSkip::skip $prog: skipping this test -- no appropriate locale\n;  


which skips the entire suite if there's no french locale defined, even though 
only one test actually sets the locale.

I can have a patch for it, if that's acceptable.

-gordon

From c8cec42eee16f3824635a3ba93b9360b2e7b236d Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Tue, 12 Feb 2013 10:30:25 -0500
Subject: [PATCH] tests: add '--ignore-case' tests for uniq.

* tests/misc/uniq.pl: add tests for --ignore-case.
---
 tests/misc/uniq.pl |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/tests/misc/uniq.pl b/tests/misc/uniq.pl
index 140a49b..e3873b5 100755
--- a/tests/misc/uniq.pl
+++ b/tests/misc/uniq.pl
@@ -199,6 +199,10 @@ my @Tests =
  # Check that --zero-terminated is synonymous with -z.
  ['123', '--zero-terminated', {IN=a\na\nb}, {OUT=a\na\nb\0}],
  ['124', '--zero-terminated', {IN=a\0a\0b}, {OUT=a\0b\0}],
+ # Check ignore-case
+ ['125', '',  {IN=A\na\n}, {OUT=A\na\n}],
+ ['126', '-i',{IN=A\na\n}, {OUT=A\n}],
+ ['127', '--ignore-case', {IN=A\na\n}, {OUT=A\n}],
 );
 
 # Set _POSIX2_VERSION=199209 in the environment of each obs-plus* test.
-- 
1.7.7.4

Re: uniq with sort-like --key support

2013-02-12 Thread Assaf Gordon

Pádraig Brady wrote, On 02/11/2013 08:50 PM:
 On 02/12/2013 01:31 AM, Assaf Gordon wrote:

 I'd like to offer a proof-of-concept patch for adding sort-like --key 
 support for the 'uniq' program, as discussed here:
http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html
 and in several other threads.

 I'm not going to look at it this week, but thank you!
 Consolidating the field processing in a central place
 is really good, and it can then be enhanced in future
 to support multibyte chars etc.
 

I'll continue in the meantime - the attached version passes all tests, and 
includes many new ones.
also supports --field-separator=SEP (like sort), multiple keys,
and tested unique by month/fast-numeric/general-numeric/case-insensitive.

-gordon




uniq_keys1.patch.xz
Description: application/xz

Re: uniq with sort-like --key support

2013-02-13 Thread Assaf Gordon

 On 02/12/2013 01:31 AM, Assaf Gordon wrote:

 I'd like to offer a proof-of-concept patch for adding sort-like --key 
 support for the 'uniq' program, as discussed here:
http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html
 and in several other threads.


One more update with two changes:

1. re-arranged src/uniq_sort_common.h to have the functions in the same order 
as in src/sort.c,
making diff src/uniq_sort_common.h src/sort.c much easier to view (and seeing 
that the functions were not modified at all).

2. when specifying explicit field separator and using -c, report the counts 
with no space-padding right-aligned numbers (and the separator).
This might be controversial, but I always needed that :) (used to wrap every 
uniq -c with sed 's/^  *// ; s/ /\t/' ) 
==
## Existing:
$ printf a\tx\na\tx\nb\ty\n | uniq -c
  2 a   x
  1 b   y

## New:
$ printf a\tx\na\tx\nb\ty\n | ./src/uniq -t $'\t' -c  
2   a   x
1   b   y
==


Also, I'm wondering what exactly is the effect of the following statement
( from http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00217.html ):
  This point was addressed in IEEE Std 1003.1-2001/Cor 1-2002, item
  XCU/TC1/D6/40, and it's why the current Posix spec says that the
  behavior of uniq depends on LC_COLLATE.

And whether sort's keycompare functions fulfill this requirement, and whether 
the current 'uniq' tests check this situation? 
Otherwise my changes are not backwards-compatible.

Thanks,
 -gordon

Re: uniq with sort-like --key support

2013-02-13 Thread Assaf Gordon

Assaf Gordon wrote, On 02/13/2013 11:45 AM:
 On 02/12/2013 01:31 AM, Assaf Gordon wrote:

 I'd like to offer a proof-of-concept patch for adding sort-like --key 
 support for the 'uniq' program, as discussed here:
http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html
 and in several other threads.

 
 One more update with two changes:
 

Sorry, forgot to attach the file in the previous email.

-gordon



uniq_key3.patch.xz
Description: application/xz

Re: uniq with sort-like --key support

2013-02-13 Thread Assaf Gordon

Hello Jim,

Jim Meyering wrote, On 02/13/2013 12:05 PM:
 Assaf Gordon wrote:
 Assaf Gordon wrote, On 02/13/2013 11:45 AM:
 ...
 One more update with two changes:
  ...
  src/uniq_sort_common.h | 1096 
 
 
 Hi Gordon.
 Thanks a lot for working on this long-requested change.
 I don't have time to review it, but please change the name of that
 new header file.  First, we use hyphens (not underscores) in file names.
 Did you consider any names that evoke key spec parsing?
 Then, the name would still be apropos if someday it's used by a program
 other than sort and uniq.

This was just a proof-of-concept, so I wanted to have minimal changes that 
would just work.

What would be the recommended way to compartmentalize this functionality?
1. put it in src/key-spec-parsing.h, and have each program (e.g. uniq.c) do 
#include ?
or 
2. split it into src/key-spec-parsing.h and src/key-spec-parsing.c (with 
all the src/local.mk associated changes) - but removing the static from all 
the variables/functions?

or something else?

-gordon

Re: uniq with sort-like --key support

2013-02-13 Thread Assaf Gordon

 Jim Meyering wrote, On 02/13/2013 12:05 PM:
 [...] but please change the name of that
 new header file.  First, we use hyphens (not underscores) in file names.
 Did you consider any names that evoke key spec parsing?
 Then, the name would still be apropos if someday it's used by a program
 other than sort and uniq.
 
 What would be the recommended way to compartmentalize this functionality?
 1. put it in src/key-spec-parsing.h, and have each program (e.g. uniq.c) do 
 #include ?
 or 
 2. split it into src/key-spec-parsing.h and src/key-spec-parsing.c (with 
 all the src/local.mk associated changes) - but removing the static from 
 all the variables/functions?
 
 or something else?

I'm leaning towards option #1 (just a header file) - this will allow to 
include/remove functionality using #ifdefs (e.g. uniq doesn't need to support 
random/reverse/human/version key comparisons, and in the far future - perhaps 
'join' will use it and wouldn't need them).

src/system.h is already used in the same fashion (and has 'static' 
functions), although it's much smaller in scope.

Thoughts?
 -gordon

Re: uniq with sort-like --key support

2013-02-13 Thread Assaf Gordon

Pádraig Brady wrote, On 02/13/2013 12:54 PM:
 On 02/13/2013 05:34 PM, Assaf Gordon wrote:

 What would be the recommended way to compartmentalize this functionality?
 1. put it in src/key-spec-parsing.h, and have each program (e.g. uniq.c) 
 do #include ?
 or
 2. split it into src/key-spec-parsing.h and src/key-spec-parsing.c (with 
 all the src/local.mk associated changes) - but removing the static from 
 all the variables/functions?
 
 2 is more standard/flexible.
 

Evidently, leaning towards option #1 was the wrong choice :)


This update splits the code into the two files (src/key-spec-parsing.{c,h}), 
and adds conditional compilation of supported keys, using per-file CFLAGS in 
local.mk:
src_uniq_SOURCES = src/uniq.c src/key-spec-parsing.c
src_uniq_CPPFLAGS = $(AM_CPPFLAGS)

Another program that needs all the keys might define:
src_sort_SOURCES = src/sort.c src/key-spec-parsing.c
src_sort_CPPFLAGS = -DKEY_SPEC_RANDOM -DKEY_SPEC_REVERSE -DKEY_SPEC_VERSION 
-DKEY_SPEC_HUMAN_NUMERIC $(AM_CPPFLAGS)

These are explained in 'src/key-spec-parsing.c':
  /* define the following to enable extra key options:
  KEY_SPEC_RANDOM   - sort by random order (-k1R,1)
  KEY_SPEC_REVERSE  - reverse sort order   (-k1r,1)
  KEY_SPEC_VERSION  - Version sort order   (-k1V,1)
  KEY_SPEC_HUMAN_NUMERIC- Human sizes order(-k1h,1)

If these are not defined, specifing them will generate an error.
  
See 'set_ordering()' and 'key_to_opts()' in this file,
and src_uniq_CPPFLAGS in src/local.mk for usage examples.
   */


-gordon


uniq_key5.patch.xz
Description: application/xz

[PATCH] join: Add -z option

2013-02-14 Thread Assaf Gordon

Hello,

This patch add -z to join, supporting joining zero-terminated lines.
The patch is heavily based on James Youngman's patch of adding -z to uniq 
(commit e062524).

-gordon

P.S.
This patch is independent of the key-comparison patches discussed recently, 
though I'm also adding it there.
From 525eb72b150ed34d3bfcfe453d1494fe28a824b7 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 14 Feb 2013 15:29:08 -0500
Subject: [PATCH] join: Add -z option

* NEWS: Mention join's new option: --zero-terminated (-z).
* src/join.c: Add new option, --zero-terminated (-z), to make
join use the NUL byte as separator/delimiter rather than newline.
(get_line): Use readlinebuffer_delim in place of readlinebuffer.
(main): Handle the new option.
(usage): Describe new option the same way sort does.
* doc/coreutils.texi (join invocation): Describe the new option.
* tests/misc/join.pl: add tests for -z option.
---
 NEWS   |6 ++
 doc/coreutils.texi |   17 +
 src/join.c |   19 +++
 tests/misc/join.pl |   20 
 4 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index 37bcdf7..618c1da 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,12 @@ GNU coreutils NEWS-*- outline -*-
 
 * Noteworthy changes in release ?.? (-??-??) [?]
 
+** New features
+
+  join accepts a new option: --zero-terminated (-z). As with the sort,uniq
+  option of the same name, this makes join consume and produce NUL-terminated
+  lines rather than newline-terminated lines.
+
 
 * Noteworthy changes in release 8.21 (2013-02-14) [stable]
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 2c16dc4..a72d9ce 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -6059,6 +6059,10 @@ available; the sort order can be any order that considers two fields
 to be equal if and only if the sort comparison described above
 considers them to be equal.  For example:
 
+Input and output lines are terminated with a newline character unless the
+@option{--zero-terminated} (@option{-z}) is used, in which case lines are
+@sc{nul} terminated.
+
 @example
 $ cat file1
 a a1
@@ -6181,6 +6185,19 @@ character is used to delimit the fields.
 Print a line for each unpairable line in file @var{file-number}
 (either @samp{1} or @samp{2}), instead of the normal output.
 
+@item -z
+@itemx --zero-terminated
+@opindex -z
+@opindex --zero-terminated
+@cindex join zero-terminated lines
+Treat the input as a set of lines, each terminated by a null character
+(ASCII @sc{nul}) instead of a line feed
+(ASCII @sc{lf}).
+This option can be useful in conjunction with @samp{sort -z}, @samp{uniq -z},
+@samp{perl -0} or @samp{find -print0} and @samp{xargs -0} which do the same in
+order to reliably handle arbitrary file names (even those containing blanks
+or other special characters).
+
 @end table
 
 @exitstatus
diff --git a/src/join.c b/src/join.c
index 11e647c..1810ac2 100644
--- a/src/join.c
+++ b/src/join.c
@@ -161,6 +161,7 @@ static struct option const longopts[] =
   {ignore-case, no_argument, NULL, 'i'},
   {check-order, no_argument, NULL, CHECK_ORDER_OPTION},
   {nocheck-order, no_argument, NULL, NOCHECK_ORDER_OPTION},
+  {zero-terminated, no_argument, NULL, 'z'},
   {header, no_argument, NULL, HEADER_LINE_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -177,6 +178,9 @@ static bool ignore_case;
join them without checking for ordering */
 static bool join_header_lines;
 
+/* The character marking end of line. Default to \n. */
+static char eolchar = '\n';
+
 void
 usage (int status)
 {
@@ -213,6 +217,9 @@ by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.\n\
   --header  treat the first line in each file as field headers,\n\
   print them without trying to pair them\n\
 ), stdout);
+  fputs (_(\
+  -z, --zero-terminated end lines with 0 byte, not newline\n\
+), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
   fputs (VERSION_OPTION_DESCRIPTION, stdout);
   fputs (_(\
@@ -445,7 +452,7 @@ get_line (FILE *fp, struct line **linep, int which)
   else
 line = init_linep (linep);
 
-  if (! readlinebuffer (line-buf, fp))
+  if (! readlinebuffer_delim (line-buf, fp, eolchar))
 {
   if (ferror (fp))
 error (EXIT_FAILURE, errno, _(read error));
@@ -614,7 +621,7 @@ prjoin (struct line const *line1, struct line const *line2)
 break;
   putchar (output_separator);
 }
-  putchar ('\n');
+  putchar (eolchar);
 }
   else
 {
@@ -636,7 +643,7 @@ prjoin (struct line const *line1, struct line const *line2)
   prfields (line1, join_field_1, autocount_1);
   prfields (line2, join_field_2, autocount_2);
 
-  putchar ('\n');
+  putchar (eolchar);
 }
 }
 
@@ -1017,7 +1024,7 @@ main (int argc, char **argv)
   issued_disorder_warning[0] = issued_disorder_warning

sort/uniq/join: key-comparison code consolidation

2013-02-14 Thread Assaf Gordon

Hello,

( new thread for previous topic 
http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html ) .

The attached patch contains:

1. src/key-spec-parsing.{h,c} - key comparison code, previously in sort.c

2. uniq - now supports --key (multiple keys, too).
Same as before, but rebased against 8.21.
Supported orders:
  -k1,1  = ascii
  -k1b,1 = ignore-blanks
  -k1d,1 = dictionary
  -k1i,1 = non-printing
  -k1f,1 = ignore-case
  -k1n,1 = fast-numeric
  -k1g,1 = general-numeric
  -k1M,1 = month
also supports user-specified delimiter (default: white-space).

Related discussions:
  http://debbugs.gnu.org/cgi/bugreport.cgi?bug=5832
  http://debbugs.gnu.org/cgi/bugreport.cgi?bug=7068
  http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html

3. sort - same functionality as before, but key-comparison code extracted to a 
different file.

4. join - internally uses the key-comparison code.
Does not support the --key parameter (uses the standard -j/-1/-2),
but accepts new arguments that affect joining order:
 -r --reverse
 -n --numeric-sort
 -d --dictionary-order
 -g --general-numeric

Related discussions:
 http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6903
 http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6366

As an option, perhaps we can support new -k that will be like -j but allow 
specificity options
(e.g. -k1nr will be equivalent to -j 1 --numeric --reverse).
 

It'll be easy to add human-numeric-sort/version-sort to join/uniq, but I'm not 
sure if they make sense.


Regards,
 -gordon




key_compare7.patch.xz
Description: application/xz

Re: [PATCH]: uniq: add --group option

2013-02-21 Thread Assaf Gordon

Hello Pádraig,

Pádraig Brady wrote, On 02/20/2013 08:47 PM:
 On 02/20/2013 06:44 PM, Assaf Gordon wrote:
 Hello,

 Attached is a suggestion for --group option in uniq, as discussed here:
http://lists.gnu.org/archive/html/coreutils/2011-03/msg0.html
http://lists.gnu.org/archive/html/coreutils/2012-03/msg00052.html

 The patch adds two parameters:
--group=[method]  separate each unique line (whether duplicated or 
 not)
  with a marker.
  method={none,separate(default),prepend,append,both)
--group-separator=SEP   with --group, separates group using SEP
  (default: empty line)

 
 --group-sep is probably overkill.
 I'd just use \n or \0 if -z specified.
 
OK.

 As for separation methods I'd just go with what we have for
 --all-repeated (but remove 'none' which wouldn't be useful with --group),
 as we've never had requests for anything else. so:
 --group={prepend, separate(default)}
 

I'd like to have at least append or both, for the added convenience of 
downstream analysis.
It's obviously a nice-to-have and not must-have feature, and can be 
implemented in other ways, but knowing that there will always be a terminating 
marker *after* a group (even the last group) makes downstream processing code 
simpler.

Typical example:
 $ cat INPUT | uniq --group=append | \
  awk '$0!= { ## item in the group, collect it }
   $0== { ## end of group, do something }'

Without the final group marker, any downstream code will require two points of 
group processing: when a marker is found, and at EOF.
Something like:

 $ cat INPUT | uniq --group=append | \
  awk '$0!= { ## item in the group, collect it }
   $0== { ## end of group, do something }
   END { ## end of last group, do something, duplicated code }'

Similar reason for having both, as it ensures there I can put any special 
initialization code in the group-marker case, and doesn't need to duplicate it 
in a separate 'BEGIN{}' clause (Of course, this doesn't have to be awk - can be 
perl/python/ruby/whatever that will do downstream processing).

I realize it's not a make-or-break feature - but if we're trying to make text 
processing easier, I believe append/both makes it even easier.


 So on to operation...
 
 And it behaves as expected:
 ===
 $ printf a\na\na\nb\nc\nc\n | ./src/uniq --group-sep=-- --group=separate
 
 The above isn't that useful and could be done with sed.
 
I assume you're specifically referring to the group-sep part - then OK.


 Supporting -u or -d with --group wouldn't be useful either really.
 It's probably most consistent to just disallow those combinations.
 

Just to be clear on the reasoning: because with -u and -d, each *line* is 
implicitly a separate group, there's no apparent utility for an end-of-group 
marker.

I guess it's true from a technical POV - but again, for downstream analysis 
convenience it's nice to have a fixed end-of-group marker.
I could use the same downstream script (which expects end-of-group markers) 
with uniq, whether I used -d or -u or nothing at all.

What do you think?
 -gordon

Re: [PATCH]: uniq: add --group option

2013-02-21 Thread Assaf Gordon

Pádraig Brady wrote, On 02/21/2013 11:11 AM:
 On 02/21/2013 03:42 PM, Assaf Gordon wrote:
 Hello Pádraig,
 
 Pádraig Brady wrote, On 02/20/2013 08:47 PM:
 On 02/20/2013 06:44 PM, Assaf Gordon wrote:
 Hello,
 
 Attached is a suggestion for --group option in uniq, as
 discussed here: 
 http://lists.gnu.org/archive/html/coreutils/2011-03/msg0.html
 

[ ... ]

 So on to operation...
 
 And it behaves as expected: === $ printf a\na\na\nb\nc\nc\n
 | ./src/uniq --group-sep=-- --group=separate
 
 The above isn't that useful and could be done with sed.
 
 I assume you're specifically referring to the group-sep part -
 then OK.
 
 
 Actually I was referring to the fact that in your example --group
 didn't output all entries by default. If it only output unique
 entries then you can separate with:
 
 uniq | sed 'G' # (note sed also supports -z) uniq | sed '$q;G'
 
 So `uniq --group` should output all items by default I think.
 
[ ... ]
 
 I guess it's true from a technical POV - but again, for downstream
 analysis convenience it's nice to have a fixed end-of-group
 marker. I could use the same downstream script (which expects
 end-of-group markers) with uniq, whether I used -d or -u or
 nothing at all.
 
 But what's the point in such processing if there is only ever going 
 to be a single line in each group?

I see now,

I was thinking of --group as simply an output modifier (ie add group marker 
to whatever uniq is outputing), allowing combination of --group with 
-u/-d/-D or any other option (whether it made useful sense or not).

You were planning on --group to mean explicitly output all input lines, and 
add group-markers for unique groups (meaning -u/-d/-D and --group are mutually 
exclusive).

I can go on with your definition. I'll send update soon.

-gordon

Re: [PATCH]: uniq: add --group option

2013-02-21 Thread Assaf Gordon

Assaf Gordon wrote, On 02/21/2013 11:37 AM:
 
 You were planning on --group to mean explicitly output all input lines, 
 and add group-markers for unique groups (meaning -u/-d/-D and --group are 
 mutually exclusive).
 

Attached is a version that behaves as previously discussed.
--group can't be used with -c/-d/-D/-u.

Since it's a completely separate behavior, I found it easier to create a whole 
new code path in check_file() for the special case of grouping.

Comments are welcomed,
 -gordon

From 072ffee0f45a67465607cde3d984e6fd7e37a1af Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Wed, 20 Feb 2013 13:31:22 -0500
Subject: [PATCH] uniq: add --group option

* src/uniq.c: implement --group options.
* tests/misc/uniq.pl: add tests.
---
 src/uniq.c |  125 +---
 tests/misc/uniq.pl |   40 +
 2 files changed, 159 insertions(+), 6 deletions(-)

diff --git a/src/uniq.c b/src/uniq.c
index 5efdad7..598c62d 100644
--- a/src/uniq.c
+++ b/src/uniq.c
@@ -108,11 +108,47 @@ static enum delimit_method const delimit_method_map[] =
 /* Select whether/how to delimit groups of duplicate lines.  */
 static enum delimit_method delimit_groups;
 
+enum grouping_method
+{
+  /* No grouping, when --group isn't used */
+  GM_NONE,
+
+  /* Delimiter preceges all groups.  --group=prepend */
+  GM_PREPEND,
+
+  /* Delimiter follows all groups.   --group=append */
+  GM_APPEND,
+
+  /* Delimiter between groups.--group[=separate] */
+  GM_SEPARATE,
+
+  /* Delimiter before and after each group. --group=both */
+  GM_BOTH
+};
+
+static char const *const grouping_method_string[] =
+{
+  prepend, append, separate, both, NULL
+};
+
+static enum grouping_method const grouping_method_map[] =
+{
+  GM_PREPEND, GM_APPEND, GM_SEPARATE, GM_BOTH
+};
+
+static enum grouping_method grouping = GM_NONE;
+
+enum
+{
+  GROUP_OPTION = CHAR_MAX + 1
+};
+
 static struct option const longopts[] =
 {
   {count, no_argument, NULL, 'c'},
   {repeated, no_argument, NULL, 'd'},
   {all-repeated, optional_argument, NULL, 'D'},
+  {group, optional_argument, NULL, GROUP_OPTION},
   {ignore-case, no_argument, NULL, 'i'},
   {unique, no_argument, NULL, 'u'},
   {skip-fields, required_argument, NULL, 'f'},
@@ -159,6 +195,11 @@ With no options, matching lines are merged to the first occurrence.\n\
   -z, --zero-terminated  end lines with 0 byte, not newline\n\
 ), stdout);
  fputs (_(\
+  --group=[method]  separate each unique group (whether duplicated or not)\n\
+with an empty line.\n\
+method={separate(default),prepend,append,both)\n\
+), stdout);
+ fputs (_(\
   -w, --check-chars=N   compare no more than N characters in lines\n\
 ), stdout);
  fputs (HELP_OPTION_DESCRIPTION, stdout);
@@ -293,13 +334,57 @@ check_file (const char *infile, const char *outfile, char delimiter)
   initbuffer (prevline);
 
   /* The duplication in the following 'if' and 'else' blocks is an
- optimization to distinguish the common case (in which none of
- the following options has been specified: --count, -repeated,
- --all-repeated, --unique) from the others.  In the common case,
- this optimization lets uniq output each different line right away,
- without waiting to see if the next one is different.  */
+ optimization to distinguish several cases:
 
-  if (output_unique  output_first_repeated  countmode == count_none)
+ 1. grouping (--group=X) - all input lines are printed.
+checking for unique/duplicated lines is used only for printing
+group separators.
+
+ 2. The common case -
+In which none of the following options has been specified:
+  --count, --repeated,  --all-repeated, --unique
+In the common case, this optimization lets uniq output each different
+line right away, without waiting to see if the next one is different.
+
+ 3. All other cases.
+  */
+  if (grouping != GM_NONE)
+{
+  char *prevfield IF_LINT ( = NULL);
+  size_t prevlen IF_LINT ( = 0);
+  bool first_group_printed = false;
+
+  while (!feof (stdin))
+{
+  char *thisfield;
+  size_t thislen;
+  bool new_group;
+  if (readlinebuffer_delim (thisline, stdin, delimiter) == 0)
+break;
+  thisfield = find_field (thisline);
+  thislen = thisline-length - 1 - (thisfield - thisline-buffer);
+
+  new_group = (prevline-length == 0
+   || different (thisfield, prevfield, thislen, prevlen));
+
+  if (new_group
+   ( (grouping == GM_PREPEND) || (grouping == GM_BOTH)
+   || ( first_group_printed
+
+( grouping == GM_APPEND || grouping == GM_SEPARATE 
+putchar (delimiter);
+
+  fwrite (thisline-buffer, sizeof (char), thisline-length, stdout);
+  SWAP_LINES

[PATCH] improve 'autotools-install'

2013-02-21 Thread Assaf Gordon

Hello,

Trying to use 'scripts/autotools-install' on a problematic system (Mac OS X 
10.6.8, already has few other related bugs), building pkg-config fails.

Two patches attached:

1. When ./configure or make fail, use die() to print an error, pointing 
the user to the error log file. 
This helps when troubleshooting errors, because the script has set -e and 
simply exits on errors.

2. Recent pkg-config has a cyclic requirement of glib, explained in the 
README [1]:
   pkg-config depends on glib.  Note that glib build-depends on pkg-config,
   but you can just set the corresponding environment variables (ZLIB_LIBS,
   ZLIB_CFLAGS are the only needed ones when this is written) to build it.

   If this requirement is too cumbersome, a bundled copy of a recent glib
   stable release is included. Pass --with-internal-glib to configure to
   use this copy.

The second patch adds this --with-internal-glib flag when configuring 
pkg-config .

Sadly, autotools-install still doesn't complete, because gettext0.18.1 fails to 
compile with stpncpy() related problem (exactly as solved in coreutils [2]) but 
that's is not a coreutil bug.

-gordon

[1] http://cgit.freedesktop.org/pkg-config/tree/README?id=pkg-config-0.27.1
[2] http://bugs.gnu.org/13495

From ba2c30e47e808c60bd5e899caca1207dae5aa95a Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 21 Feb 2013 17:50:28 -0500
Subject: [PATCH 1/2] maint: print errors with autotools-install fails

* scripts/autotools-install: call die() when configure/make fail. Point
the user to the relevant error log file.
---
 scripts/autotools-install |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/scripts/autotools-install b/scripts/autotools-install
index bd49664..419806d 100755
--- a/scripts/autotools-install
+++ b/scripts/autotools-install
@@ -148,8 +148,12 @@ for pkg in $pkgs; do
   rm -rf $dir
   gzip -dc $pkg | tar xf -
   cd $dir
-  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21
-  $MAKE makerr-build 21
+  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \
+|| die configuring package $dir failed.  \
+check '$tmpdir/$dir/makerr-config' for possible details.
+  $MAKE makerr-build 21 \
+|| die building package $dir failed.  \
+check '$tmpdir/$dir/makerr-build' for possible details.
   if test $make_check = yes; then
 case $pkg in
   # FIXME: these are out of date and very system-sensitive
-- 
1.7.7.4


From c3d135c51e20ceb72d5b453081bea1e1899f9ef1 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 21 Feb 2013 17:58:55 -0500
Subject: [PATCH 2/2] maint: add special config flags for pkg-config

* scripts/autotools-install: force pkg-config to use internal 'glib'
files when compiling from source.
---
 scripts/autotools-install |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/scripts/autotools-install b/scripts/autotools-install
index 419806d..2b626ff 100755
--- a/scripts/autotools-install
+++ b/scripts/autotools-install
@@ -144,11 +144,13 @@ pkgs=`get_sources`
 export PATH=$prefix/bin:$PATH
 for pkg in $pkgs; do
   echo building/installing $pkg...
+  extra=
+  case $pkg in pkg-config*) extra=--with-internal-glib;; esac
   dir=`basename $pkg .tar.gz`
   rm -rf $dir
   gzip -dc $pkg | tar xf -
   cd $dir
-  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \
+  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix $extramakerr-config 21 \
 || die configuring package $dir failed.  \
 check '$tmpdir/$dir/makerr-config' for possible details.
   $MAKE makerr-build 21 \
-- 
1.7.7.4

Re: [PATCH] improve 'autotools-install'

2013-02-22 Thread Assaf Gordon

Hello Stefano,

Stefano Lattarini wrote, On 02/22/2013 02:30 AM:
 On 02/22/2013 12:08 AM, Assaf Gordon wrote:

 I think this explanation should go in the commit message of the second patch,
 as it makes clear why such patch is needed.
 

Good idea, attached an improved patch.


 Sadly, autotools-install still doesn't complete, because gettext0.18.1
 fails to compile with stpncpy() related problem (exactly as solved in
 coreutils [2]) but that's is not a coreutil bug.

 Is the issue still present with the latest gettext version (1.18.2)?  If not,
 you could update the '$tarballs' definition to point to that instead.

No, 0.18.2 doesn't compile either.
Eric Blake already found the fix for this, I'll just send the gettext people a 
bug report.

 
 Also, I see that the Automake version referenced by '$tarballs' is still
 1.12.3; I think it should be updated to the latest available version
 (1.13.2 at the moment of writing).
 

I can send a separate patch for that, but perhaps others would chime in as to 
whether this should be done?
I assume changing version (1.12 vs 1.13) should be done when it's explicitly 
needed?

-gordon
From ba2c30e47e808c60bd5e899caca1207dae5aa95a Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 21 Feb 2013 17:50:28 -0500
Subject: [PATCH 1/2] maint: print errors with autotools-install fails

* scripts/autotools-install: call die() when configure/make fail. Point
the user to the relevant error log file.
---
 scripts/autotools-install |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/scripts/autotools-install b/scripts/autotools-install
index bd49664..419806d 100755
--- a/scripts/autotools-install
+++ b/scripts/autotools-install
@@ -148,8 +148,12 @@ for pkg in $pkgs; do
   rm -rf $dir
   gzip -dc $pkg | tar xf -
   cd $dir
-  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21
-  $MAKE makerr-build 21
+  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \
+|| die configuring package $dir failed.  \
+check '$tmpdir/$dir/makerr-config' for possible details.
+  $MAKE makerr-build 21 \
+|| die building package $dir failed.  \
+check '$tmpdir/$dir/makerr-build' for possible details.
   if test $make_check = yes; then
 case $pkg in
   # FIXME: these are out of date and very system-sensitive
-- 
1.7.7.4


From 49c577432325de449239ce5ed5e2b82e401eee14 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 21 Feb 2013 17:58:55 -0500
Subject: [PATCH 2/2] maint: add special config flags for pkg-config

* scripts/autotools-install: force pkg-config to use internal 'glib'
files when compiling from source.

Recent pkg-config has a cyclic requirement of glib, explained in the
pkg-config's README:
http://cgit.freedesktop.org/pkg-config/tree/README?id=pkg-config-0.27.1

  pkg-config depends on glib.  Note that glib build-depends on
  pkg-config, but you can just set the corresponding environment
  variables to build it.

  If this requirement is too cumbersome, a bundled copy of a recent
  glib stable release is included.
  Pass --with-internal-glib to configure to use this copy.
---
 scripts/autotools-install |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/scripts/autotools-install b/scripts/autotools-install
index 419806d..2b626ff 100755
--- a/scripts/autotools-install
+++ b/scripts/autotools-install
@@ -144,11 +144,13 @@ pkgs=`get_sources`
 export PATH=$prefix/bin:$PATH
 for pkg in $pkgs; do
   echo building/installing $pkg...
+  extra=
+  case $pkg in pkg-config*) extra=--with-internal-glib;; esac
   dir=`basename $pkg .tar.gz`
   rm -rf $dir
   gzip -dc $pkg | tar xf -
   cd $dir
-  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \
+  ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix $extramakerr-config 21 \
 || die configuring package $dir failed.  \
 check '$tmpdir/$dir/makerr-config' for possible details.
   $MAKE makerr-build 21 \
-- 
1.7.7.4

Re: bug#13786: pr command does not fold

2013-02-22 Thread Assaf Gordon

(Adding the list)

Doh Smith wrote, On 02/22/2013 05:14 AM:
 
 I could not get the pr command to fold the lines. Is this a bug?
 

I replied but forgot to CC the mailing list, answer available here:
  http://bugs.gnu.org/13786

This bug can likely be closed, if others agree.

-gordon

Re: coreutils FAQ link in manpages and/or --help output

2013-02-25 Thread Assaf Gordon

Hi,

Bernhard Voelker wrote, On 02/25/2013 10:58 AM:
 On 02/25/2013 03:53 PM, Ondrej Oprala wrote:
 to reduce the amount of questions about date, sort and anything 
 multibyte-related, I think
 it'd be a good idea to add a link to the coreutils FAQ to the man pages 
 (and/or --help output), maybe something like
 Report TOOL bugs to bug-coreut...@gnu.org but please make sure the 
 behaviour is not listed in FAQ: link
 What do you think?
 
 I'd not be surprised if this had already been discussed before.
 However, I like the idea, but it would be best if we had nice
 translations for that FAQ page.
 

If changing the general help message (I assume from emit_ancillary_info()), 
perhaps consider another change:

Add a line (preferably on top) that would point to coreutils@gnu.org in 
addition to bugs-coreut...@gnu.org,
saying something like:
 For common usage questions, see FAQ
and then
 Send usage questions to coreutils@gnu.org
and only then print the existing:
 Report sort bugs to bug-coreut...@gnu.org

This will (hopefully?) prevent those cases where people send general questions 
and open a new bug, forcing someone to respond with the boiler-plate answer by 
sending an email to this mailing list, you've opened a bug-report, and I'm 
closing it.

There's already a link saying:
 General help using GNU software: http://www.gnu.org/gethelp/
But this page is not very helpful for a non-expert who just wants to get help 
about a specific GNU coreutils program...


Just my two cents,
 -gordon

Re: [PATCH]: uniq: add --group option

2013-02-28 Thread Assaf Gordon

Pádraig Brady wrote, On 02/27/2013 08:16 PM:
 On 02/21/2013 07:40 PM, Assaf Gordon wrote:
 Assaf Gordon wrote, On 02/21/2013 11:37 AM:

 You were planning on --group to mean explicitly output all input lines, 
 and add group-markers for unique groups (meaning -u/-d/-D and --group are 
 mutually exclusive).


 
 I'll push this tomorrow with the attached changes.
 I added NEWS, docs and refactored the
 default and --group core loops together as
 as they're essentially the same.
 

Thank you.

Once pushed, I'll send a rebased patch for the sort/join/uniq key-comparison 
feature.

-gordon

Re: [PATCH]: uniq: add tests for --ignore-case

2013-02-28 Thread Assaf Gordon

Pádraig Brady wrote, On 02/27/2013 10:38 PM:
 On 02/12/2013 03:44 PM, Assaf Gordon wrote:
 I noticed that by running the default test suite (make check SUBDIRS=.), 
 the majority of uniq tests are skipped:
   uniq: skipping this test -- no appropriate locale
   SKIP: tests/misc/uniq.pl
   PASS: tests/misc/uniq-perf.sh

 This is due to tests/misc/uniq.pl line 83:
  83 # I've only ever triggered the problem in a non-C locale.

  84 my $locale = $ENV{LOCALE_FR};

  85 ! defined $locale || $locale eq 'none'   

  86   and CuSkip::skip $prog: skipping this test -- no appropriate 
 locale\n;  

 which skips the entire suite if there's no french locale defined, even 
 though only one test actually sets the locale.

 I can have a patch for it, if that's acceptable.
 
 Thanks for noticing that.
 A patch would be much appreciated.
 

Attached a patch to not-skip all uniq tests if french locale is missing.

-gordon

From 65e47a463e672eddf8f7ed0ca5a9886033e0ef69 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 28 Feb 2013 14:12:52 -0500
Subject: [PATCH] uniq: don't skip all tests when locale is missing

* tests/misc/uniq.pl: Previously, if LOCALE_FR was not defined, all
tests would be skipped. Modified to skip only the relevant test.
---
 tests/misc/uniq.pl |   41 ++---
 1 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/tests/misc/uniq.pl b/tests/misc/uniq.pl
index e3873b5..4fe1357 100755
--- a/tests/misc/uniq.pl
+++ b/tests/misc/uniq.pl
@@ -80,23 +80,8 @@ sub add_z_variants($)
   return @new;
 }
 
-# I've only ever triggered the problem in a non-C locale.
-my $locale = $ENV{LOCALE_FR};
-! defined $locale || $locale eq 'none'
-  and CuSkip::skip $prog: skipping this test -- no appropriate locale\n;
-
-# See if isblank returns true for nbsp.
-my $x = qx!env printf '\xa0'| LC_ALL=$locale tr '[:blank:]' x!;
-# If so, expect just one line of output in the schar test.
-# Otherwise, expect two.
-my $in =  y z\n\xa0 y z\n;
-my $schar_exp = $x eq 'x' ?  y z\n : $in;
-
 my @Tests =
 (
-  # Test for a subtle, system-and-locale-dependent bug in uniq.
- ['schar', '-f1',  {IN = $in}, {OUT = $schar_exp},
-  {ENV = LC_ALL=$locale}],
  ['1', '', {IN=''}, {OUT=''}],
  ['2', '', {IN=a\na\n}, {OUT=a\n}],
  ['3', '', {IN=a\na}, {OUT=a\n}],
@@ -205,6 +190,32 @@ my @Tests =
  ['127', '--ignore-case', {IN=A\na\n}, {OUT=A\n}],
 );
 
+
+# Locale related tests
+
+my $locale = $ENV{LOCALE_FR};
+if ( defined $locale  $locale ne 'none' )
+  {
+# I've only ever triggered the problem in a non-C locale.
+
+# See if isblank returns true for nbsp.
+my $x = qx!env printf '\xa0'| LC_ALL=$locale tr '[:blank:]' x!;
+# If so, expect just one line of output in the schar test.
+# Otherwise, expect two.
+my $in =  y z\n\xa0 y z\n;
+my $schar_exp = $x eq 'x' ?  y z\n : $in;
+
+my @Locale_Tests =
+(
+  # Test for a subtle, system-and-locale-dependent bug in uniq.
+  ['schar', '-f1',  {IN = $in}, {OUT = $schar_exp},
+{ENV = LC_ALL=$locale}]
+);
+
+push @Tests, @Locale_Tests;
+  }
+
+
 # Set _POSIX2_VERSION=199209 in the environment of each obs-plus* test.
 foreach my $t (@Tests)
   {
-- 
1.7.7.4

Re: sort/uniq/join: key-comparison code consolidation

2013-02-28 Thread Assaf Gordon

Hello,

Assaf Gordon wrote, On 02/14/2013 06:07 PM:
 ( new thread for previous topic 
 http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html ) .

Attached is the sort/uniq/join key-comparison patch, rebased against the latest 
revision.
This patch should also be cleaner and the commit comments more helpful.

comments are welcomed,
 -gordon


key-compare4.patch.xz
Description: application/xz

Re: coreutils FAQ link in manpages and/or --help output

2013-03-01 Thread Assaf Gordon

Pádraig Brady wrote, On 02/28/2013 08:12 AM:
 On 02/28/2013 08:40 AM, Ondrej Vasik wrote:
 On Thu, 2013-02-28 at 09:26 +0100, Bernhard Voelker wrote:
 On February 28, 2013 at 4:23 AM Pádraig Brady p...@draigbrady.com wrote:
 I've adjusted the above to only reference online resources,
 and ensure the links are at the end of each line.
 The result is now at http://www.gnu.org/software/coreutils/

 I like it.


Looks great.

Since you're updating the website to make it more approachable, may I suggest 
two more changes?

1. In the Downloads section, put a direct link to the official GIT repository 
( http://git.savannah.gnu.org/cgit/coreutils.git ) ?
The current text says:
  Coreutils source releases can be found at 
  Test source releases can be found at 
  The latest source code, along with a revision history, can be found in the 
Savannah repository

It's true that it mentions the Savannah repository, but it's not immediately 
clear what's going on.
And to actually see the Git page, one has to click on the Savannah repository 
link (and the Savannah page is a bit of an overloaded mess), then go to the 
Source Code drop-down menu, and click on Browse Source Code - not exactly 
intuitive.

If we could add a simple line below that says:
  View git source code repository: 
http://git.savannah.gnu.org/cgit/coreutils.git
It would be much more convenient.


2. In the Downloads section, mention which is the latest version, and provide 
a direct link to it.
This requires a bit of work every time a new release is made, but it's very 
helpful for someone who just wants to download the latest version without 
exploring the GNU FTP website.

-gordon

[PATCH] shuf: use reservoir-sampling when possible

2013-03-06 Thread Assaf Gordon

Hello,

Attached is a suggestion to implement reservoir-sampling in shuf:
When the expected output of lines is known, it will not load the entire file 
into memory - allowing shuffling very large inputs.

I've seen this mentioned once:
 http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079.html
but no follow-up discussion.

There is no change in the usage of shuf (barring unexpected bugs...).

Example (with debug messages):
===
  $ seq 1 | ./src/shuf ---debug -n 5
  --reservoir_sampling--
  filling reservoir, input line 1 of 5: '1'
  filling reservoir, input line 2 of 5: '2'
  filling reservoir, input line 3 of 5: '3'
  filling reservoir, input line 4 of 5: '4'
  filling reservoir, input line 5 of 5: '5'
  Replacing reservoir sample 4 with line 7 '7'
  Replacing reservoir sample 4 with line 8 '8'
  Replacing reservoir sample 3 with line 9 '9'
  Replacing reservoir sample 2 with line 10 '10'
  Replacing reservoir sample 4 with line 11 '11'
  Replacing reservoir sample 4 with line 16 '16'
  Replacing reservoir sample 4 with line 17 '17'
  Replacing reservoir sample 4 with line 20 '20'
  Replacing reservoir sample 2 with line 22 '22'
  Replacing reservoir sample 0 with line 31 '31'
  Replacing reservoir sample 1 with line 52 '52'
  Replacing reservoir sample 4 with line 55 '55'
  Replacing reservoir sample 3 with line 61 '61'
  Replacing reservoir sample 4 with line 76 '76'
  Replacing reservoir sample 2 with line 169 '169'
  Replacing reservoir sample 2 with line 187 '187'
  Replacing reservoir sample 0 with line 216 '216'
  Replacing reservoir sample 1 with line 340 '340'
  Replacing reservoir sample 4 with line 431 '431'
  Replacing reservoir sample 1 with line 524 '524'
  Replacing reservoir sample 2 with line 942 '942'
  Replacing reservoir sample 1 with line 1096 '1096'
  Replacing reservoir sample 2 with line 1627 '1627'
  Replacing reservoir sample 4 with line 1763 '1763'
  Replacing reservoir sample 2 with line 2679 '2679'
  Replacing reservoir sample 3 with line 4382 '4382'
  Replacing reservoir sample 2 with line 4439 '4439'
  Replacing reservoir sample 3 with line 7748 '7748'
  Replacing reservoir sample 2 with line 9902 '9902'
  -- reservoir lines (begin)--
  216
  1096
  9902
  7748
  1763
  -- reservoir lines (end)--
  216
  1763
  7748
  1096
  9902
===

The last 5 lines are the final output (the rest is STDERR debug messages).
After the input is read completely, the lines are still re-permuted (using the 
existing shuf code), to accommodate cases like:

===
  $ seq 6 | ./src/shuf ---debug -n 5
  --reservoir_sampling--
  filling reservoir, input line 1 of 5: '1'
  filling reservoir, input line 2 of 5: '2'
  filling reservoir, input line 3 of 5: '3'
  filling reservoir, input line 4 of 5: '4'
  filling reservoir, input line 5 of 5: '5'
  Replacing reservoir sample 2 with line 6 '6'
  -- reservoir lines (begin)--
  1
  2
  6
  4
  5
  -- reservoir lines (end)--
  4
  2
  1
  6
  5
===


Comments are welcomed,
 -gordon
From b64d5063e26c0f3485d8342a2d5501f655f1063e Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Wed, 6 Mar 2013 18:25:49 -0500
Subject: [PATCH] shuf: use reservoir-sampling when possible

* src/shuf.c: Use reservoir-sampling when the number of output lines
is known (by using '-n X' parameter).
read_input_reservoir_sampling() - read lines from input file, and keep
only K lines in memory, replacing lines with decreasing probability.
prepare_shuf_lines() - convert reservoir lines to a usable structure.
main() - if the number of lines is known, use reservoir-sampling
instead of reading entire input file.
---
 src/shuf.c |  171 ++--
 1 files changed, 167 insertions(+), 4 deletions(-)

diff --git a/src/shuf.c b/src/shuf.c
index 71ac3e6..27982e5 100644
--- a/src/shuf.c
+++ b/src/shuf.c
@@ -25,6 +25,7 @@
 #include error.h
 #include fadvise.h
 #include getopt.h
+#include linebuffer.h
 #include quote.h
 #include quotearg.h
 #include randint.h
@@ -81,7 +82,8 @@ With no FILE, or when FILE is -, read standard input.\n\
non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
 {
-  RANDOM_SOURCE_OPTION = CHAR_MAX + 1
+  RANDOM_SOURCE_OPTION = CHAR_MAX + 1,
+  DEV_DEBUG_OPTION
 };
 
 static struct option const long_opts[] =
@@ -92,11 +94,31 @@ static struct option const long_opts[] =
   {output, required_argument, NULL, 'o'},
   {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION},
   {zero-terminated, no_argument, NULL, 'z'},
+  {-debug, no_argument, NULL, DEV_DEBUG_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {0, 0, 0, 0},
 };
 
+/* debugging for developers.  Enables devmsg(). */
+static bool dev_debug = false;
+
+/* Like error(0, 0, ...), but without an implicit newline.
+   Also a noop unless the global DEV_DEBUG is set.
+   TODO: Replace with variadic macro in system.h or
+   move to a separate module.  */
+static inline void
+devmsg (char

Re: [PATCH] shuf: use reservoir-sampling when possible

2013-03-07 Thread Assaf Gordon

Hello,

Attached is an updated version.

Pádraig Brady wrote, On 03/06/2013 08:24 PM:
 On 03/06/2013 11:50 PM, Assaf Gordon wrote:
 Attached is a suggestion to implement reservoir-sampling in shuf:
 When the expected output of lines is known, it will not load the entire file 
 into memory - allowing shuffling very large inputs.


Regarding comments:

  {-debug, no_argument, NULL, DEV_DEBUG_OPTION},
 no need to keep this, for final commit.

Yes, I'll remove this once the code is acceptable.


 prepare_shuf_lines (struct linebuffer *in_lines, size_t n, char ***out_lines,
 
 I've not looked into the details, but it would
 be nice to avoid the memcpy/conversion here
 

I've removed the conversion function, and instead added a new function to 
output the lines directly.

 static size_t
 read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, size_t 
 k,
struct randint_source *s)
...
   struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init 
 reservoir*/
 
 Since this change is mainly about efficient mem usage we should probably 
 handle
 the case where we have small inputs but large k.  This will allocate (and 
 zero)
 memory up front. The zeroing will defeat any memory overcommit configured on 
 the
 system, but it's probably better to avoid the large initial commit and realloc
 as required (not per line, but per 1K lines maybe).
 

I'm not quite sure about this:
The reservoir-sampling path can only be used when the user explicitly ask to 
limit output lines.
I would naively assume that if a user explicitly asked to limit the output to 
1,000,000 lines, he/she expects large input as well.
And so the (edge?) case of asking for a large number of output lines, but 
supplying very small number of input lines is rare.
Wouldn't you agree? or is there a different typical usage case?

Also, the allocation only allocates an array of struct linebuffer (on 64bit 
systems, 24 bytes).
So even asking for 1M lines will allocate 24MB of RAM - not too much on 
modern machines.



The second attached patch is experimental - it tries to assess the randomness 
of 'shuf' output by running it 1,000 times and checking if the output is (very 
roughly) uniformly distributed.
I don't know if there were attempts in the past to unit-test randomness (and 
then decided not to do it) - or if this was just never considered worth-while 
(or too error prone).

Comments are welcomed,
 -gordon









From 1adfd08cd3a52c373932b0f1039755a240d2c0b8 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 7 Mar 2013 01:57:57 -0500
Subject: [PATCH 1/2] shuf: add (expensive) test for randomness

To run manually:
  make check TESTS=tests/misc/shuf-nonrandomess.sh \
 SUBDIRS=. RUN_VERY_EXPENSIVE_TESTS=yes

* tests/misc/shuf-randomness.sh: run 'shuf' repeatedly, and check if the
output is uniformly distributed enough.
* tests/local.mk: add new test script.
---
 tests/local.mk|1 +
 tests/misc/shuf-randomness.sh |  186 +
 2 files changed, 187 insertions(+), 0 deletions(-)
 create mode 100755 tests/misc/shuf-randomness.sh

diff --git a/tests/local.mk b/tests/local.mk
index 607ddc4..d3923f8 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -313,6 +313,7 @@ all_tests =	\
   tests/misc/shred-passes.sh			\
   tests/misc/shred-remove.sh			\
   tests/misc/shuf.sh\
+  tests/misc/shuf-randomness.sh			\
   tests/misc/sort.pl\
   tests/misc/sort-benchmark-random.sh		\
   tests/misc/sort-compress.sh			\
diff --git a/tests/misc/shuf-randomness.sh b/tests/misc/shuf-randomness.sh
new file mode 100755
index 000..3e35cca
--- /dev/null
+++ b/tests/misc/shuf-randomness.sh
@@ -0,0 +1,186 @@
+#!/bin/sh
+# Test shuf for somewhat uniform randomness
+
+# Copyright (C) 2013 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see http://www.gnu.org/licenses/.
+
+. ${srcdir=.}/tests/init.sh; path_prepend_ ./src
+print_ver_ shuf
+getlimits_
+
+# Don't run these tests by default.
+very_expensive_
+
+# Number of trails
+T=1000
+
+# Number of categories
+N=100
+REQUIRED_CHI_SQUARED=200 # Be extremely leniet:
+ # don't require great goodness of fit
+ # even for our assumed 99 degrees of freedom
+
+# K - when testing reservoir-sampling, print K lines
+K=20
+REQUIRED_CHI_SQUARED_K=50

[PATCH] csplit: new option --suppress-matched

2013-03-07 Thread Assaf Gordon

Hello,

Attached is a new option for csplit, suppress-matched, as been mentioned few 
times before (e.g. 
http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ).

It works well for REGEXP patterns, but there's a bug with INTEGER patterns that 
I haven't been able to pinpoint yet (suggestions are welcomed).

Regards,
  -gordon
From 49f43214ebfa41fa1f67e7001d8467288ff34837 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Wed, 6 Mar 2013 15:53:16 -0500
Subject: [PATCH] csplit: new option, --suppress-matched

FIXME: Currently works only with REGEXP patterns.

With --suppress-matched, the lines that match the pattern will not be
printed in the output files.

* src/csplit.c: implement --suppress-matched.
process_regexp(),process_line_count(): skip the matched lined without
printing. Since csplit always does up to but not including matched
lines, the first line (in the next group) is the matched line - just
skip it.
main(): handle new option.
usage(): mention new option.
* NEWS: mention new option.
* doc/coreutils.texi: mention new option, add examples.
* tests/misc/csplit-supress-matched.sh: test new option.
* tests/local.mk: add new test script.
---
 NEWS  |3 +
 doc/coreutils.texi|   25 
 src/csplit.c  |   26 -
 tests/local.mk|1 +
 tests/misc/csplit-suppress-matched.sh |  233 +
 5 files changed, 287 insertions(+), 1 deletions(-)
 create mode 100755 tests/misc/csplit-suppress-matched.sh

diff --git a/NEWS b/NEWS
index 5b28c92..2385be7 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,9 @@ GNU coreutils NEWS-*- outline -*-
   uniq accepts a new option: --group to print all items, while separating
   unique groups with empty lines.
 
+  csplit accepts a new option: --suppressed-matched (-m). Lines matching
+  the specified patterns will not be printed.
+
 
 * Noteworthy changes in release 8.21 (2013-02-14) [stable]
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index fe4c3ad..4f7da4c 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3608,6 +3608,12 @@ long instead of the default 2.
 @opindex --keep-files
 Do not remove output files when errors are encountered.
 
+@item -m
+@itemx --suppress-matched
+@opindex -m
+@opindex --suppress-matched
+Do not output lines matching the specified @var{pattern}.
+
 @item -z
 @itemx --elide-empty-files
 @opindex -z
@@ -3684,6 +3690,25 @@ $ head xx*
 14
 @end example
 
+Example of splitting input by empty lines:
+
+@example
+$ csplit --suppress-matched @var{input.txt} '/^$/' '@{*@}'
+@end example
+
+@c
+@c TODO: uniq already supportes --group.
+@cwhen it gets the --key option, uncomment this example.
+@c
+@c Example of splitting input file, based on the value of column 2:
+@c
+@c @example
+@c $ cat @var{input.txt} |
+@c   sort -k2,2 |
+@c   uniq --group -k2,2 |
+@c   csplit -m '/^$/' '@{*@}'
+@c @end example
+
 @node Summarizing files
 @chapter Summarizing files
 
diff --git a/src/csplit.c b/src/csplit.c
index 22f3ad4..664b567 100644
--- a/src/csplit.c
+++ b/src/csplit.c
@@ -166,6 +166,9 @@ static bool volatile remove_files;
 /* If true, remove all output files which have a zero length. */
 static bool elide_empty_files;
 
+/* If true, supress the lines that match the PATTERN */
+static bool suppress_matched;
+
 /* The compiled pattern arguments, which determine how to split
the input file. */
 static struct control *controls;
@@ -185,6 +188,7 @@ static struct option const longopts[] =
   {elide-empty-files, no_argument, NULL, 'z'},
   {prefix, required_argument, NULL, 'f'},
   {suffix-format, required_argument, NULL, 'b'},
+  {suppress-matched, no_argument, NULL, 'm'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -721,6 +725,15 @@ process_line_count (const struct control *p, uintmax_t repetition)
 
   create_output_file ();
 
+#if 0
+  /* FIXME: this doesn't work when the last line is the matched line
+   * e.g.:
+   *   $ seq 1 6 | ./src/csplit -m - 2 4 6
+   */
+  if (suppress_matched)
+line = remove_line ();
+#endif
+
   linenum = get_first_line_in_buffer ();
 
   while (linenum++  last_line_to_save)
@@ -778,6 +791,9 @@ process_regexp (struct control *p, uintmax_t repetition)
   if (!ignore)
 create_output_file ();
 
+  if (suppress_matched  current_line  0)
+line = remove_line ();
+
   /* If there is no offset for the regular expression, or
  it is positive, then it is not necessary to buffer the lines. */
 
@@ -1324,9 +1340,10 @@ main (int argc, char **argv)
   control_used = 0;
   suppress_count = false;
   remove_files = true;
+  suppress_matched = false;
   prefix = DEFAULT_PREFIX;
 
-  while ((optc = getopt_long (argc, argv, f:b:kn:sqz, longopts, NULL)) != -1)
+  while ((optc = getopt_long (argc, argv, f:b:kmn:sqz, longopts, NULL)) != -1)
 switch (optc

[PATCH] tests: test sort,shuf with rngtest

2013-03-08 Thread Assaf Gordon

Hello,

Regarding comment:

Pádraig Brady wrote, On 03/07/2013 06:26 PM:
 On 03/07/2013 07:32 PM, Assaf Gordon wrote:
 The second attached patch is experimental - it tries to assess the
 randomness of 'shuf' output by running it 1,000 times and checking
 if the output is (very roughly) uniformly distributed. 
 
 Cool, I was considering testing with rngtest or something, so it'll
 be good to have something independent.
(  http://lists.gnu.org/archive/html/coreutils/2013-03/msg00030.html )

Using rngtest is probably much more reliable than the independent test - 
attached are tests for sort and shuf with rngtest.
They are marked 'expensive' as they require an external program and they run 
each test 10 times.

-gordon



From 15392de8f0ffa0746c9fd338ed14d15b614029a3 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Fri, 8 Mar 2013 15:54:24 -0500
Subject: [PATCH] tests: test sort,shuf with rngtest

rngtest check the randomness of data using FIPS 140-2 tests.
http://sourceforge.net/projects/gkernel/

If rngtest is not installed (and available in the PATH),
the tests will be skipped.

These tests are marked 'expensive'. To run directly:

  $ make check TESTS=tests/misc/sort-rand-rngtest.sh \
   SUBDIRS=. RUN_EXPENSIVE_TESTS=yes
  $ make check TESTS=tests/misc/shuf-rand-rngtest.sh \
   SUBDIRS=. RUN_EXPENSIVE_TESTS=yes

* tests/misc/shuf-rand-rngtest.sh - test shuf with rngtest.
* tests/misc/sort-rand-rngtest.sh - test sort with rngtest.
* tests/local.mk - add above tests.
---
 tests/local.mk  |2 +
 tests/misc/shuf-rand-rngtest.sh |   78 +++
 tests/misc/sort-rand-rngtest.sh |   71 +++
 3 files changed, 151 insertions(+), 0 deletions(-)
 create mode 100755 tests/misc/shuf-rand-rngtest.sh
 create mode 100755 tests/misc/sort-rand-rngtest.sh

diff --git a/tests/local.mk b/tests/local.mk
index 607ddc4..21d347a 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -313,6 +313,7 @@ all_tests =	\
   tests/misc/shred-passes.sh			\
   tests/misc/shred-remove.sh			\
   tests/misc/shuf.sh\
+  tests/misc/shuf-rand-rngtest.sh		\
   tests/misc/sort.pl\
   tests/misc/sort-benchmark-random.sh		\
   tests/misc/sort-compress.sh			\
@@ -329,6 +330,7 @@ all_tests =	\
   tests/misc/sort-month.sh			\
   tests/misc/sort-exit-early.sh			\
   tests/misc/sort-rand.sh			\
+  tests/misc/sort-rand-rngtest.sh		\
   tests/misc/sort-spinlock-abuse.sh		\
   tests/misc/sort-stale-thread-mem.sh		\
   tests/misc/sort-unique.sh			\
diff --git a/tests/misc/shuf-rand-rngtest.sh b/tests/misc/shuf-rand-rngtest.sh
new file mode 100755
index 000..9ad2797
--- /dev/null
+++ b/tests/misc/shuf-rand-rngtest.sh
@@ -0,0 +1,78 @@
+#!/bin/sh
+# Test shuf's random output with rngtest
+#
+# NOTE:
+#  rngtest must be installed, or the test will be skipped.
+#  rngtest is available here: http://sourceforge.net/projects/gkernel/
+
+# Copyright (C) 2013 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see http://www.gnu.org/licenses/.
+
+. ${srcdir=.}/tests/init.sh; path_prepend_ ./src
+print_ver_ shuf
+expensive_
+
+if ! which rngtest  /dev/null ; then
+  skip_ rngtest not found - skipping test.
+fi
+
+# Test for randomness several times.
+# On the reare occasion when the randomly sorted data doesn't pass rngtest,
+# it should be just one failure out of 10 rounds.
+# If more rounds fail in a single run - there's likely a real problem.
+ROUNDS=10
+
+( yes 1 | head -n 1 ; yes 0 | head -n 1 )  in || framework_failure_
+
+# rgntest always reads the first 32 bits as bootstrap data
+printf \x00\x00\x00\x00  rngtest_header || framework_failure_
+
+
+# Sanity check:
+#  unsorted data should not be random
+cat in | tr -d '\n' | \
+   perl -npe '$_=pack(b*,$_)'  out_non_random || framework_failure_
+
+echo Testing rngtest on non-random input: 12
+cat rngtest_header out_non_random | rngtest 
+  { fail=1 ; echo rngtest failed to detect non-random data. 12 ; }
+
+#
+# Check randomness of shuf's output
+# (using the 'read-entire-file' code path)
+for i in $(seq $ROUNDS) ; do
+  cat in | shuf | tr -d '\n' | \
+   perl -npe '$_=pack(b*,$_)'  out_random$i || framework_failure_
+
+  echo Testing rngtest on randomly-sorted input (round $i of $ROUNDS): 12
+  cat rngtest_header out_random$i | rngtest ||
+  { fail=1 ; echo shuf random

Re: [PATCH] shuf: use reservoir-sampling when possible

2013-03-11 Thread Assaf Gordon

Hello,

Pádraig Brady wrote, On 03/07/2013 06:26 PM:
 On 03/07/2013 07:32 PM, Assaf Gordon wrote:
 Pádraig Brady wrote, On 03/06/2013 08:24 PM:
 On 03/06/2013 11:50 PM, Assaf Gordon wrote:
 Attached is a suggestion to implement reservoir-sampling in shuf:
 When the expected output of lines is known, it will not load the entire 
 file into memory - allowing shuffling very large inputs.



 static size_t
 read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, 
 size_t k,
struct randint_source *s)
 ...
   struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init 
 reservoir*/

 Since this change is mainly about efficient mem usage we should probably 
 handle
 the case where we have small inputs but large k.  This will allocate (and 
 zero)
 memory up front. The zeroing will defeat any memory overcommit configured 
 on the
 system, but it's probably better to avoid the large initial commit and 
 realloc
 as required (not per line, but per 1K lines maybe).



Attached is an updated version (mostly a re-write of the memory allocation 
part), as per the comment above.
Also includes a very_expensive valgrind test to exercise the new code.
(and the other patch is the uniform-distribution randomness test).

-gordon
From 0ff2403dde869af3f9a44dd7418aae3082d8c0aa Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 7 Mar 2013 01:57:57 -0500
Subject: [PATCH 1/2] shuf: add (expensive) test for randomness

To run manually:
  make check TESTS=tests/misc/shuf-randomess.sh \
 SUBDIRS=. RUN_VERY_EXPENSIVE_TESTS=yes

* tests/misc/shuf-randomness.sh: run 'shuf' repeatedly, and check if the
output is uniformly distributed enough.
* tests/local.mk: add new test script.
---
 tests/local.mk|1 +
 tests/misc/shuf-randomness.sh |  187 +
 2 files changed, 188 insertions(+), 0 deletions(-)
 create mode 100755 tests/misc/shuf-randomness.sh

diff --git a/tests/local.mk b/tests/local.mk
index 607ddc4..d3923f8 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -313,6 +313,7 @@ all_tests =	\
   tests/misc/shred-passes.sh			\
   tests/misc/shred-remove.sh			\
   tests/misc/shuf.sh\
+  tests/misc/shuf-randomness.sh			\
   tests/misc/sort.pl\
   tests/misc/sort-benchmark-random.sh		\
   tests/misc/sort-compress.sh			\
diff --git a/tests/misc/shuf-randomness.sh b/tests/misc/shuf-randomness.sh
new file mode 100755
index 000..c0b9e2e
--- /dev/null
+++ b/tests/misc/shuf-randomness.sh
@@ -0,0 +1,187 @@
+#!/bin/sh
+# Test shuf for somewhat uniform randomness
+
+# Copyright (C) 2013 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see http://www.gnu.org/licenses/.
+
+. ${srcdir=.}/tests/init.sh; path_prepend_ ./src
+print_ver_ shuf
+getlimits_
+
+# Don't run these tests by default.
+very_expensive_
+
+# Number of trails
+T=1000
+
+# Number of categories
+N=100
+REQUIRED_CHI_SQUARED=200 # Be extremely leniet:
+ # don't require great goodness of fit
+ # even for our assumed 99 degrees of freedom
+
+# K - when testing reservoir-sampling, print K lines
+K=20
+REQUIRED_CHI_SQUARED_K=50 # Be extremely leniet:
+  # don't require great goodness of fit
+  # even for our assumed 19 degrees of freedom
+
+
+
+# The input: many zeros followed by 1 one
+(yes 0 | head -n $((N-1)) ; echo 1 )  in || framework_failure_
+
+
+is_uniform()
+{
+  # Input is assumed to be a string of $T spaces-separated-values
+  # between 1 and $N
+  LINES=$1
+
+  # Convert spaces to new-lines
+  LINES=$(echo $LINES | tr ' ' '\n' | sed '/^$/d') || framework_failure_
+
+  # Requre exactly $T values
+  COUNT=$(echo $LINES | wc -l)
+  test $COUNT -eq $T || framework_failure_
+
+  # HIST is the histogram of counts per categories
+  #  ( categories are between 1 and $N )
+  HIST=$(echo $LINES | sort -n | uniq -c)
+
+  #DEBUG
+  #echo HIST=$HIST 12
+
+  ## Calculate Chi-Squared
+  CHI=$( echo $HIST |
+ awk -v n=$N -v t=$T '{ counts[$2] = $1 }
+  END {
+  exptd = ((1.0)*t)/n
+  chi = 0
+  for (i=1;i=n;++i)
+  {
+if (i in counts

Re: [PATCH] shuf: use reservoir-sampling when possible

2013-03-25 Thread Assaf Gordon

Hello Pádraig,

Pádraig Brady wrote, On 03/24/2013 11:45 PM:
 On 03/06/2013 11:50 PM, Assaf Gordon wrote:
 Attached is a suggestion to implement reservoir-sampling in shuf:
 When the expected output of lines is known, it will not load the entire 
 file into memory - allowing shuffling very large inputs.
 
 I've attached 9 patches to adjust things a bit.
 

Looks great, thank you very much.

One minor improvement: the comment in the test file is wrong (in early stages 
of the patch I thought I could use a fixed random-source and pre-calculate the 
expected output).
Attached is a fix.

-gordon
From d01dd496c517e20ac92fcbbb6b34045303b1b514 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 25 Mar 2013 12:25:50 -0400
Subject: [PATCH] maint: adjust shuf resevoir sampling comments

* tests/misc/shuf-reservoir.sh: re-word comments.
---
 tests/misc/shuf-reservoir.sh |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tests/misc/shuf-reservoir.sh b/tests/misc/shuf-reservoir.sh
index b695afc..6ba6e6e 100755
--- a/tests/misc/shuf-reservoir.sh
+++ b/tests/misc/shuf-reservoir.sh
@@ -26,7 +26,7 @@ require_valgrind_
 getlimits_
 
 # Run shuf with specific number of input lines and output lines
-# The output must match the expected (pre-calculated) output.
+# Check the output for expected number of lines.
 run_shuf_n()
 {
   INPUT_LINES=$1
-- 
1.7.7.4

Re: [PATCH] tests: test sort,shuf with rngtest

2013-03-26 Thread Assaf Gordon

Assaf Gordon wrote, On 03/08/2013 04:28 PM:
 Pádraig Brady wrote, On 03/07/2013 06:26 PM:

 Cool, I was considering testing with rngtest or something, so it'll
 be good to have something independent.
 (  http://lists.gnu.org/archive/html/coreutils/2013-03/msg00030.html )
 
 Using rngtest is probably much more reliable than the independent test - 
 attached are tests for sort and shuf with rngtest.
 They are marked 'expensive' as they require an external program and they run 
 each test 10 times.

Same patch, rebased with the latest shuf/reservoir-sampling,
and with require_rngtest_ added to init.cfg.

-gordon

From c4130abf2baf1f1484c9f72e0d2845b996d55210 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Fri, 8 Mar 2013 15:54:24 -0500
Subject: [PATCH] tests: test sort,shuf with rngtest

rngtest check the randomness of data using FIPS 140-2 tests.
http://sourceforge.net/projects/gkernel/

If rngtest is not installed (and available in the PATH),
the tests will be skipped.

These tests are marked 'expensive'. To run directly:

  $ make check TESTS=tests/misc/sort-rand-rngtest.sh \
   SUBDIRS=. RUN_EXPENSIVE_TESTS=yes
  $ make check TESTS=tests/misc/shuf-rand-rngtest.sh \
   SUBDIRS=. RUN_EXPENSIVE_TESTS=yes

* tests/misc/shuf-rand-rngtest.sh - test shuf with rngtest.
* tests/misc/sort-rand-rngtest.sh - test sort with rngtest.
* tests/local.mk - add above tests.
* init.cfg - add 'require_rngtest_' function.
---
 init.cfg|7 
 tests/local.mk  |2 +
 tests/misc/shuf-rand-rngtest.sh |   75 +++
 tests/misc/sort-rand-rngtest.sh |   68 +++
 4 files changed, 152 insertions(+), 0 deletions(-)
 create mode 100755 tests/misc/shuf-rand-rngtest.sh
 create mode 100755 tests/misc/sort-rand-rngtest.sh

diff --git a/init.cfg b/init.cfg
index afee930..27d7627 100644
--- a/init.cfg
+++ b/init.cfg
@@ -169,6 +169,13 @@ require_valgrind_()
 skip_ requires a working valgrind
 }
 
+# Skip the current test if rngtest doesn't work
+require_rngtest_()
+{
+  rngtest -V 2/dev/null ||
+skip_ requires a working rngtest
+}
+
 require_setfacl_()
 {
   setfacl -m user::rwx . \
diff --git a/tests/local.mk b/tests/local.mk
index dc87ef4..a75cfa3 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -313,6 +313,7 @@ all_tests =	\
   tests/misc/shred-passes.sh			\
   tests/misc/shred-remove.sh			\
   tests/misc/shuf.sh\
+  tests/misc/shuf-rand-rngtest.sh		\
   tests/misc/shuf-reservoir.sh			\
   tests/misc/sort.pl\
   tests/misc/sort-benchmark-random.sh		\
@@ -330,6 +331,7 @@ all_tests =	\
   tests/misc/sort-month.sh			\
   tests/misc/sort-exit-early.sh			\
   tests/misc/sort-rand.sh			\
+  tests/misc/sort-rand-rngtest.sh		\
   tests/misc/sort-spinlock-abuse.sh		\
   tests/misc/sort-stale-thread-mem.sh		\
   tests/misc/sort-unique.sh			\
diff --git a/tests/misc/shuf-rand-rngtest.sh b/tests/misc/shuf-rand-rngtest.sh
new file mode 100755
index 000..934791f
--- /dev/null
+++ b/tests/misc/shuf-rand-rngtest.sh
@@ -0,0 +1,75 @@
+#!/bin/sh
+# Test shuf's random output with rngtest
+#
+# NOTE:
+#  rngtest must be installed, or the test will be skipped.
+#  rngtest is available here: http://sourceforge.net/projects/gkernel/
+
+# Copyright (C) 2013 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see http://www.gnu.org/licenses/.
+
+. ${srcdir=.}/tests/init.sh; path_prepend_ ./src
+print_ver_ shuf
+expensive_
+require_rngtest_
+
+# Test for randomness several times.
+# On the reare occasion when the randomly sorted data doesn't pass rngtest,
+# it should be just one failure out of 10 rounds.
+# If more rounds fail in a single run - there's likely a real problem.
+ROUNDS=10
+
+( yes 1 | head -n 1 ; yes 0 | head -n 1 )  in || framework_failure_
+
+# rgntest always reads the first 32 bits as bootstrap data
+printf \x00\x00\x00\x00  rngtest_header || framework_failure_
+
+
+# Sanity check:
+#  unsorted data should not be random
+cat in | tr -d '\n' | \
+   perl -npe '$_=pack(b*,$_)'  out_non_random || framework_failure_
+
+echo Testing rngtest on non-random input: 12
+cat rngtest_header out_non_random | rngtest 
+  { fail=1 ; echo rngtest failed to detect non-random data. 12 ; }
+
+#
+# Check randomness of shuf's output
+# (using the 'read-entire-file' code path)
+for i in $(seq

Re: [PATCH] csplit: new option --suppress-matched

2013-03-28 Thread Assaf Gordon

Hello,


Assaf Gordon wrote, On 03/07/2013 05:39 PM:
 
 Attached is a new option for csplit, suppress-matched, as been mentioned few 
 times before (e.g. 
 http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ).
 

Attached updated version (works with both regexp and int patterns).
Also updated tests.

Comments are welcomed,
  -gordon

From eec5cf679824ed67c8b751ecb90565a22fc51719 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Wed, 6 Mar 2013 15:53:16 -0500
Subject: [PATCH] csplit: new option --suppress-matched

With --suppress-matched, the lines that match the pattern will not be
printed in the output files.

* src/csplit.c: implement --suppress-matched.
process_regexp(),process_line_count(): skip the matched lined without
printing. Since csplit always does up to but not including matched
lines, the first line (in the next group) is the matched line - just
skip it.
main(): handle new option.
usage(): mention new option.
* NEWS: mention new option.
* doc/coreutils.texi: mention new option, add examples.
* tests/misc/csplit-supress-matched.pl: test new option.
* tests/local.mk: add new test script.
---
 NEWS  |3 +
 doc/coreutils.texi|   25 
 src/csplit.c  |   29 -
 tests/local.mk|1 +
 tests/misc/csplit-suppress-matched.pl |  213 +
 5 files changed, 268 insertions(+), 3 deletions(-)
 create mode 100644 tests/misc/csplit-suppress-matched.pl

diff --git a/NEWS b/NEWS
index 0c2daad..896512d 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,9 @@ GNU coreutils NEWS-*- outline -*-
   uniq accepts a new option: --group to print all items, while separating
   unique groups with empty lines.
 
+  csplit accepts a new option: --suppressed-matched (-m). Lines matching
+  the specified patterns will not be printed.
+
 ** Improvements
 
   stat and tail work better with EFIVARFS, EXOFS, F2FS and UBIFS.
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index dfa9b1c..7dfe724 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3607,6 +3607,12 @@ long instead of the default 2.
 @opindex --keep-files
 Do not remove output files when errors are encountered.
 
+@item -m
+@itemx --suppress-matched
+@opindex -m
+@opindex --suppress-matched
+Do not output lines matching the specified @var{pattern}.
+
 @item -z
 @itemx --elide-empty-files
 @opindex -z
@@ -3683,6 +3689,25 @@ $ head xx*
 14
 @end example
 
+Example of splitting input by empty lines:
+
+@example
+$ csplit --suppress-matched @var{input.txt} '/^$/' '@{*@}'
+@end example
+
+@c
+@c TODO: uniq already supportes --group.
+@cwhen it gets the --key option, uncomment this example.
+@c
+@c Example of splitting input file, based on the value of column 2:
+@c
+@c @example
+@c $ cat @var{input.txt} |
+@c   sort -k2,2 |
+@c   uniq --group -k2,2 |
+@c   csplit -m '/^$/' '@{*@}'
+@c @end example
+
 @node Summarizing files
 @chapter Summarizing files
 
diff --git a/src/csplit.c b/src/csplit.c
index 22f3ad4..4ae2de2 100644
--- a/src/csplit.c
+++ b/src/csplit.c
@@ -166,6 +166,9 @@ static bool volatile remove_files;
 /* If true, remove all output files which have a zero length. */
 static bool elide_empty_files;
 
+/* If true, suppress the lines that match the PATTERN */
+static bool suppress_matched;
+
 /* The compiled pattern arguments, which determine how to split
the input file. */
 static struct control *controls;
@@ -185,6 +188,7 @@ static struct option const longopts[] =
   {elide-empty-files, no_argument, NULL, 'z'},
   {prefix, required_argument, NULL, 'f'},
   {suffix-format, required_argument, NULL, 'b'},
+  {suppress-matched, no_argument, NULL, 'm'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -721,8 +725,13 @@ process_line_count (const struct control *p, uintmax_t repetition)
 
   create_output_file ();
 
-  linenum = get_first_line_in_buffer ();
+  /* Ensure that the line number specified is not 1 greater than
+ the number of lines in the file.
+ When suppressing matched lines, check before the loop. */
+  if (no_more_lines ()  suppress_matched)
+handle_line_error (p, repetition);
 
+  linenum = get_first_line_in_buffer ();
   while (linenum++  last_line_to_save)
 {
   line = remove_line ();
@@ -733,9 +742,12 @@ process_line_count (const struct control *p, uintmax_t repetition)
 
   close_output_file ();
 
+  if (suppress_matched)
+line = remove_line ();
+
   /* Ensure that the line number specified is not 1 greater than
  the number of lines in the file. */
-  if (no_more_lines ())
+  if (no_more_lines ()  !suppress_matched)
 handle_line_error (p, repetition);
 }
 
@@ -778,6 +790,9 @@ process_regexp (struct control *p, uintmax_t repetition)
   if (!ignore)
 create_output_file ();
 
+  if (suppress_matched  current_line  0)
+line = remove_line

Re: [PATCH] csplit: new option --suppress-matched

2013-03-30 Thread Assaf Gordon


On 03/30/13 01:08, Pádraig Brady wrote:

On 03/28/2013 10:10 PM, Assaf Gordon wrote:

Attached is a new option for csplit, suppress-matched, as been mentioned few 
times before (e.g. 
http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ).


The awkward case here is with integer boundaries and offsets.


...


# Adding in the offset, we currently consider the
# offset line as the one to suppress, rather than the matched pattern.


This was exactly my original understanding of matched - not just the line that 
matched the regular expression,
but the line that matched the specified pattern (i.e. regexp+offset or integer 
pattern) - and that's the line suppressed.


This could be confusing, but at least it's consistent.
So more accurately what we're doing is suppressing the boundary line.

So less confusingly and more accurately,
this option should probably be named/described as:

--suppress-boundary
   Suppress the boundary line from the start of the second and subsequent 
splits.


I'm fine with whichever name you decide. I find matched more natural, and not 
so confusing, but boundary is just as good.
I do think the description is a bit cumbersome (the from the start of the second 
and subsequent splits part) - it seems more confusing to me than with just omitting 
it.
It's probably one of those cases that a single example of input+output is worth 
more than a whole paragraph of explanation...


Nice work on the tests BTW.


Thanks.
I found CMP by accident, after almost writing an equivalent mechanism thing 
from scratch.
It's not mentioned in tests/Coreutils.pm, perhaps I'll send a small patch for 
that.


I hope to apply this with the adjusted naming over the weekend.



Thanks again.

[PATCH] tests: document CMP/PRE/POST in unit test module

2013-04-01 Thread Assaf Gordon

Hello,

Attached is a small patch to document CMP/PRE/POST in tests/Coreutils.pm.
No code changes.

-gordon
From 229c94ebc0c4955a418f6e7348488d9ca28dc593 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 1 Apr 2013 17:44:27 -0400
Subject: [PATCH] tests: document CMP/PRE/POST in unit test module

*tests/Coreutils.pm: document CMP/PRE/POST keys.
---
 tests/Coreutils.pm |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/tests/Coreutils.pm b/tests/Coreutils.pm
index 71b1516..fd4408a 100644
--- a/tests/Coreutils.pm
+++ b/tests/Coreutils.pm
@@ -54,7 +54,7 @@ defined $ENV{DJDIR}
 # I/O spec: a hash ref with the following properties
 # 
 # - one key/value pair
-# - the key must be one of these strings: IN, OUT, ERR, AUX, CMP, EXIT
+# - the key must be one of these strings: IN, OUT, ERR, AUX, CMP, EXIT, PRE,POST
 # - the value must be a file spec
 # {OUT = 'data'}put data in a temp file and compare it to stdout from cmd
 # {OUT = {'filename'=undef}} compare contents of existing filename to
@@ -82,6 +82,12 @@ defined $ENV{DJDIR}
 # {ENV_DEL = 'VAR'}
 #   Remove VAR from the environment just before running the corresponding
 #   command, and restore any value just afterwards.
+# {CMP = [ 'data',{'filename'=undef}}Compare the content of 'filename'
+#   to 'data' (a string scalar). The program under test is expected to create
+#   file 'filename'.
+# {PRE = sub{} }   Execute sub() before running the test.
+# {POST = sub{} }  Execute sub() after running the test.
+#   If the PRE/POST sub calls die, the test will be marked as failed.
 #
 # There may be many input file specs.  File names from the input specs
 # are concatenated in order on the command line.
-- 
1.7.7.4

Re: [PATCH] tests: document CMP/PRE/POST in unit test module

2013-04-02 Thread Assaf Gordon

Thanks for the quick reply.
Here's a better patch.

Bernhard Voelker wrote, On 04/02/2013 04:03 AM:

 s/PRE,POST/PRE, POST
 due to the line length it may be worth adding a line break.

Done. Also added IN_PIPE .

 Close square brackets, and move blank character to after the comma:

Done.

 2 notes:
 * According to the code, instead of a plain string, 'data' can also be a HASH.
 * If the file name is '@AUX@', then it is replaced.

I do not fully understand those uses, so I can't really explain them.
When are these useful?

 Furthermore, IN, AUX, and EXIT also do not seem to be documented yet.
 Do you like to document these, too?
 

I've added IN and IN_PIPE.
EXIT was already mentioned.
AUX - I do not know what it does...

-gordon



From 309fd6398558b6e85ae2b2fa1cee6b5e2f492dde Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 1 Apr 2013 17:44:27 -0400
Subject: [PATCH] tests: document more test keys in unit test module

* tests/Coreutils.pm: document IN/IN_PIPE/CMP/PRE/POST keys.
---
 tests/Coreutils.pm |   12 +++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/tests/Coreutils.pm b/tests/Coreutils.pm
index 71b1516..661fce4 100644
--- a/tests/Coreutils.pm
+++ b/tests/Coreutils.pm
@@ -54,8 +54,12 @@ defined $ENV{DJDIR}
 # I/O spec: a hash ref with the following properties
 # 
 # - one key/value pair
-# - the key must be one of these strings: IN, OUT, ERR, AUX, CMP, EXIT
+# - the key must be one of these strings: IN, IN_PIPE, OUT, ERR, AUX, CMP,
+# EXIT, PRE, POST
 # - the value must be a file spec
+# {IN  = 'data'}Create file containing 'data'. The filename will be
+#appended as the last parameter on the command-line.
+# {IN_PIPE = 'data'} Send 'data' as input from stdin.
 # {OUT = 'data'}put data in a temp file and compare it to stdout from cmd
 # {OUT = {'filename'=undef}} compare contents of existing filename to
 #   stdout from cmd
@@ -82,6 +86,12 @@ defined $ENV{DJDIR}
 # {ENV_DEL = 'VAR'}
 #   Remove VAR from the environment just before running the corresponding
 #   command, and restore any value just afterwards.
+# {CMP = ['data', {'filename'=undef}]}Compare the content of 'filename'
+#   to 'data' (a string scalar). The program under test is expected to create
+#   file 'filename'.
+# {PRE = sub{}}   Execute sub() before running the test.
+# {POST = sub{}}  Execute sub() after running the test.
+#   If the PRE/POST sub calls die, the test will be marked as failed.
 #
 # There may be many input file specs.  File names from the input specs
 # are concatenated in order on the command line.
-- 
1.7.7.4

Re: Move Command Feature

2013-04-05 Thread Assaf Gordon

Hello Michael,

Michael Boldischar wrote, On 04/05/2013 01:56 PM:
 
 My first attempt with rsync resulted in the same problem I have when there 
 are errors using the mv command:
 $ mkdir a b
 $ touch a/1.txt a/2.txt
 $ chmod 000 a/2.txt
 $rsync -r --remove-source-files a/ b/
 rsync: send_files failed to open /tmp/test/a/2.txt: Permission denied (13)
 rsync error: some files/attrs were not transferred (see previous errors) 
 (code 23) at main.c(1070) [sender=3.0.8]
 $ ls b
 1.txt
 $ ls a
 2.txt
 
 The a directory was partially moved.  This is no big deal with a small set 
 of files, but a large set becomes a headache.

There's one advantage to rsync - it can continue copying files from where it 
left off.
That is - if something went wrong and it stopped, you can easily resume with 
exactly the same command line.

Example:
# Your scenario
$ mkdir a b
$ touch a/1.txt a/2.txt
$ chmod 000 a/2.txt
$ rsync -r --remove-source-files a/ b/
rsync: send_files failed to open /tmp/test/a/2.txt: Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) 
(code 23) at main.c(1070) [sender=3.0.8]

# Rsync stopped, some files are moved to b, some are still in a.

# now, fix the problem, and re-run rsync
$ chmod 444 a/2.txt
$ rsync -r --remove-source-files a/ b/

# the result: all files moved from a to b.
$ ls a
$ ls b
1.txt 2.txt

Running rsync can be done repeatedly, until all files have been moved.
Large files or small files, many files or few files - rsync will handle them 
all just fine.


But regarding your question:
 
 On 04/05/2013 11:23 AM, Michael Boldischar wrote:
  Hello,
 
  This is a suggestion for a new feature in the mv command.  This 
 feature
  applies to moving directories.  If a user moves a directory with a lot 
 of
  files and encounters an error, it can often leave the source directory 
 in a
  partially moved state.  It makes it hard to redo the operation because 
 the
  source directory has changed.
 

There is a subtle difference between keeping the source directory intact until 
the move is complete and being able to resume/redo the move.

If you just want to be able to resume an interrupted move, rsync can do it.
You'll have to accept that until rsync complete successfully, some files are 
moved and some aren't (what you called partial state).
But when rsync is complete (perhaps after running it multiple times) - the 
move is complete and there's no partial state.

If you insist of keeping a full copy of the source directory until the entire 
move is complete, then something like:
  rsync -r a/ b/  rm -r a/
would do the trick - a/ will not be modified until the copy is completed.

If you're moving files on the same filesystem and can use hardlinks to avoid 
unnecessary copies, then rsync have flags for that as well.

Hope this helps,
 -gordon

Re: [PATCH] csplit: new option --suppress-matched

2013-04-10 Thread Assaf Gordon

Hello,

Pádraig Brady wrote, On 04/10/2013 07:49 AM:
 On 03/28/2013 10:10 PM, Assaf Gordon wrote:
 Attached is a new option for csplit, suppress-matched, as been mentioned 
 few times before (e.g. 
 http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ).

...

 Note I've removed the -m short option since we try to avoid them for new 
 stuff.
 Also it gives us the flexibility in future to add a param to 
 --suppress-matched
 to suppress X lines before/around/after the matched line, which could also be 
 useful.

Ok. good idea.

 
 Note I needed to fix array references in the perl test as follows:
 -push $new_ent, $cmp;
 +push @$new_ent, $cmp;
 

Sorry about that.
Seems like Perl 5.14 and later (which I use on my dev machine) allows unblessed 
references to functions that take arrays/hashes
( http://perldoc.perl.org/5.14.0/perldelta.html#Syntactical-Enhancements ).

I'll have to remember to avoid such backwards-incompatible syntax.

 
 Will push in a while...

Thanks!

-gordon

Re: sort/uniq/join: key-comparison code consolidation

2013-04-17 Thread Assaf Gordon

Assaf Gordon wrote, On 04/10/2013 01:49 PM:
 ( new thread for previous topic 
 http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html ) .

Another update, rebased against the latest version.
 
comments are welcomed,
  -gordon
 



key-comapre.2013-04-17.patch.xz
Description: application/xz

Re: sort/uniq/join: key-comparison code consolidation

2013-07-04 Thread Assaf Gordon


Regarding previously discussed topic:

http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html


Attached is another update, rebased against the latest version.

comments are welcomed,
  -gordon



key-comapre.2013-07-02.patch.xz
Description: application/xz

Generate random numbers with shuf

2013-07-04 Thread Assaf Gordon


Hello,

Regarding old discussion here:
http://lists.gnu.org/archive/html/coreutils/2011-02/msg00030.html

Attached is a patch with adds --repetition option to shuf, enabling random 
number generation with repetitions.

Example:

to generate 50 values between 0 and 9:
  $ shuf --rep -i0-9 -n50

Comments are welcomed,
 -gordon

From 12ca3d6d5b8591e7bd424ff264b9f26cc2f31b90 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 4 Jul 2013 14:40:15 -0600
Subject: [PATCH 0/4] *** SUBJECT HERE ***

*** BLURB HERE ***

Assaf Gordon (4):
  shuf: add --repetition to generate random numbers
  shuf: add tests for --repetition option
  shuf: mention new --repetition option in NEWS
  shuf: document new --repetition option

 NEWS   |  3 +++
 doc/coreutils.texi | 23 +++
 src/shuf.c | 50 ++
 tests/misc/shuf.sh | 29 +
 4 files changed, 101 insertions(+), 4 deletions(-)

-- 
1.8.3.2

From 2c09d46ebeee61e2e46633dc8b9158edba1eaa8b Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 4 Jul 2013 13:26:45 -0600
Subject: [PATCH 1/4] shuf: add --repetition to generate random numbers

* src/shuf.c: new option (-r,--repetition), generate random numbers.
main(): process new option.
usage(): mention new option.
write_random_numbers(): generate random numbers.
---
 src/shuf.c | 50 ++
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/src/shuf.c b/src/shuf.c
index 0fabb0b..cdc3151 100644
--- a/src/shuf.c
+++ b/src/shuf.c
@@ -76,6 +76,9 @@ Write a random permutation of the input lines to standard output.\n\
   -n, --head-count=COUNToutput at most COUNT lines\n\
   -o, --output=FILE write result to FILE instead of standard output\n\
   --random-source=FILE  get random bytes from FILE\n\
+  -r, --repetition  used with -iLO-HI, output COUNT random numbers\n\
+between LO and HI, with repetitions.\n\
+count defaults to 1 if -n COUNT is not used.\n\
   -z, --zero-terminated end lines with 0 byte, not newline\n\
 ), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
@@ -104,6 +107,7 @@ static struct option const long_opts[] =
   {head-count, required_argument, NULL, 'n'},
   {output, required_argument, NULL, 'o'},
   {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION},
+  {repetition, no_argument, NULL, 'r'},
   {zero-terminated, no_argument, NULL, 'z'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -328,6 +332,23 @@ write_permuted_output (size_t n_lines, char *const *line, size_t lo_input,
   return 0;
 }
 
+static int
+write_random_numbers (struct randint_source *s, size_t count,
+  size_t lo_input, size_t hi_input, char eolbyte)
+{
+  size_t i;
+  const randint range = hi_input - lo_input + 1;
+
+  for (i = 0; i  count; i++)
+{
+  randint j = lo_input + randint_choose (s, range);
+  if (printf (%lu%c, j, eolbyte)  0)
+return -1;
+}
+
+  return 0;
+}
+
 int
 main (int argc, char **argv)
 {
@@ -340,6 +361,7 @@ main (int argc, char **argv)
   char eolbyte = '\n';
   char **input_lines = NULL;
   bool use_reservoir_sampling = false;
+  bool repetition = false;
 
   int optc;
   int n_operands;
@@ -348,7 +370,7 @@ main (int argc, char **argv)
   char **line = NULL;
   struct linebuffer *reservoir = NULL;
   struct randint_source *randint_source;
-  size_t *permutation;
+  size_t *permutation = NULL;
   int i;
 
   initialize_main (argc, argv);
@@ -424,6 +446,10 @@ main (int argc, char **argv)
 random_source = optarg;
 break;
 
+  case 'r':
+repetition = true;
+break;
+
   case 'z':
 eolbyte = '\0';
 break;
@@ -454,9 +480,19 @@ main (int argc, char **argv)
 }
   n_lines = hi_input - lo_input + 1;
   line = NULL;
+
+  /* When generating random numbers with repetitions,
+ the default count is one, unless specified by the user */
+  if (repetition  head_lines == SIZE_MAX)
+head_lines = 1 ;
 }
   else
 {
+  if (repetition)
+{
+  error (0, 0, _(--repetition requires --input-range));
+  usage (EXIT_FAILURE);
+}
   switch (n_operands)
 {
 case 0:
@@ -488,10 +524,12 @@ main (int argc, char **argv)
 }
 }
 
-  head_lines = MIN (head_lines, n_lines);
+  if (!repetition)
+head_lines = MIN (head_lines, n_lines);
 
   randint_source = randint_all_new (random_source,
-use_reservoir_sampling ? SIZE_MAX :
+(use_reservoir_sampling||repetition)?
+SIZE_MAX:
 randperm_bound (head_lines, n_lines));
   if (! randint_source)
 error (EXIT_FAILURE, errno, %s

Re: Generate random numbers with shuf

2013-07-05 Thread Assaf Gordon


Hello,

On 07/04/2013 05:40 PM, Pádraig Brady wrote:

On 07/04/2013 09:41 PM, Assaf Gordon wrote:


Regarding old discussion here:
http://lists.gnu.org/archive/html/coreutils/2011-02/msg00030.html

Attached is a patch with adds --repetition option to shuf, enabling random 
number generation with repetitions.



I like this.
--repetition seems to be a very good interface too,
since it aligns with standard math nomenclature in regard to permutations.

I'd prefer to generalize it though, to supporting stdin as well as -i.


Attached is an updated patch, supporting --repetitions with STDIN/FILE/-e 
(using the naive implementation ATM).
e.g.
  $ shuf --repetitions --head-count=100 --echo Head Tail
or
  $ shuf -r -n100 -e Head Tail


But the code is getting a bit messy, I guess from evolving features over time.
I'd like to re-organize it a bit, re-factor some functions and make the code 
clearer - what do you think?
it will make the code slightly more verbose (and slightly bigger), but 
shouldn't change the running performance.

-gordon



From 9e14bf963eb27faed847a979677fb5f344c27362 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Fri, 5 Jul 2013 11:58:16 -0600
Subject: [PATCH 0/7] *** SUBJECT HERE ***

*** BLURB HERE ***

Assaf Gordon (7):
  shuf: add --repetition to generate random numbers
  shuf: add tests for --repetition option
  shuf: mention new --repetition option in NEWS
  shuf: document new --repetition option
  shuf: enable --repetition on stdin/FILE/-e input
  shuf: add tests for --repetition with STDIN
  shuf: document new --repetitions option

 NEWS   |  3 +++
 doc/coreutils.texi | 37 ++
 src/shuf.c | 66 --
 tests/misc/shuf.sh | 63 +++
 4 files changed, 162 insertions(+), 7 deletions(-)

-- 
1.8.3.2

From c41160016ed36fe5b4e2b3d03cde34e0dcec84b6 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 4 Jul 2013 13:26:45 -0600
Subject: [PATCH 1/7] shuf: add --repetition to generate random numbers

* src/shuf.c: new option (-r,--repetition), generate random numbers.
main(): process new option.
usage(): mention new option.
write_random_numbers(): generate random numbers.
---
 src/shuf.c | 50 ++
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/src/shuf.c b/src/shuf.c
index 0fabb0b..cdc3151 100644
--- a/src/shuf.c
+++ b/src/shuf.c
@@ -76,6 +76,9 @@ Write a random permutation of the input lines to standard output.\n\
   -n, --head-count=COUNToutput at most COUNT lines\n\
   -o, --output=FILE write result to FILE instead of standard output\n\
   --random-source=FILE  get random bytes from FILE\n\
+  -r, --repetition  used with -iLO-HI, output COUNT random numbers\n\
+between LO and HI, with repetitions.\n\
+count defaults to 1 if -n COUNT is not used.\n\
   -z, --zero-terminated end lines with 0 byte, not newline\n\
 ), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
@@ -104,6 +107,7 @@ static struct option const long_opts[] =
   {head-count, required_argument, NULL, 'n'},
   {output, required_argument, NULL, 'o'},
   {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION},
+  {repetition, no_argument, NULL, 'r'},
   {zero-terminated, no_argument, NULL, 'z'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -328,6 +332,23 @@ write_permuted_output (size_t n_lines, char *const *line, size_t lo_input,
   return 0;
 }
 
+static int
+write_random_numbers (struct randint_source *s, size_t count,
+  size_t lo_input, size_t hi_input, char eolbyte)
+{
+  size_t i;
+  const randint range = hi_input - lo_input + 1;
+
+  for (i = 0; i  count; i++)
+{
+  randint j = lo_input + randint_choose (s, range);
+  if (printf (%lu%c, j, eolbyte)  0)
+return -1;
+}
+
+  return 0;
+}
+
 int
 main (int argc, char **argv)
 {
@@ -340,6 +361,7 @@ main (int argc, char **argv)
   char eolbyte = '\n';
   char **input_lines = NULL;
   bool use_reservoir_sampling = false;
+  bool repetition = false;
 
   int optc;
   int n_operands;
@@ -348,7 +370,7 @@ main (int argc, char **argv)
   char **line = NULL;
   struct linebuffer *reservoir = NULL;
   struct randint_source *randint_source;
-  size_t *permutation;
+  size_t *permutation = NULL;
   int i;
 
   initialize_main (argc, argv);
@@ -424,6 +446,10 @@ main (int argc, char **argv)
 random_source = optarg;
 break;
 
+  case 'r':
+repetition = true;
+break;
+
   case 'z':
 eolbyte = '\0';
 break;
@@ -454,9 +480,19 @@ main (int argc, char **argv)
 }
   n_lines = hi_input - lo_input + 1;
   line = NULL;
+
+  /* When generating random numbers with repetitions,
+ the default count is one, unless

Re: Generate random numbers with shuf

2013-07-05 Thread Assaf Gordon



On 07/05/2013 12:12 PM, Pádraig Brady wrote:

On 07/05/2013 07:04 PM, Assaf Gordon wrote:

Hello,



Regarding old discussion here:
http://lists.gnu.org/archive/html/coreutils/2011-02/msg00030.html

Attached is a patch with adds --repetition option to shuf, enabling random 
number generation with repetitions.



I like this.
--repetition seems to be a very good interface too,
since it aligns with standard math nomenclature in regard to permutations.

I'd prefer to generalize it though, to supporting stdin as well as -i.


Attached is an updated patch, supporting --repetitions with STDIN/FILE/-e 
(using the naive implementation ATM).
e.g.
   $ shuf --repetitions --head-count=100 --echo Head Tail
or
   $ shuf -r -n100 -e Head Tail


Excellent thanks.


But the code is getting a bit messy, I guess from evolving features over time.
I'd like to re-organize it a bit, re-factor some functions and make the code 
clearer - what do you think?
it will make the code slightly more verbose (and slightly bigger), but 
shouldn't change the running performance.


If you're getting your head around the code enough to refactor,
then it would be great if you could handle the TODO: item in shuf.c


Attached is an updated patch, with some code cleanups (not including said TODO 
item yet).

-gordon


From 5ba2828e72f6d276fc349f69824cd6cb626053a4 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Fri, 5 Jul 2013 15:41:17 -0600
Subject: [PATCH 00/14] *** SUBJECT HERE ***

*** BLURB HERE ***

Assaf Gordon (14):
  shuf: add --repetition to generate random numbers
  shuf: add tests for --repetition option
  shuf: mention new --repetition option in NEWS
  shuf: document new --repetition option
  shuf: enable --repetition on stdin/FILE/-e input
  shuf: add tests for --repetition with STDIN
  shuf: document new --repetitions option
  shuf: code-cleanup
  shuf: add more tests
  shuf: refactor --repetition with stdin
  shuf: refactor write_permuted_output()
  shuf: code cleanup
  shuf: code clean-up
  shuf: add tests for more erroneous usage

 NEWS   |   3 +
 doc/coreutils.texi |  37 +++
 src/shuf.c | 192 +
 tests/misc/shuf.sh |  92 +
 4 files changed, 268 insertions(+), 56 deletions(-)

-- 
1.8.3.2

From c41160016ed36fe5b4e2b3d03cde34e0dcec84b6 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Thu, 4 Jul 2013 13:26:45 -0600
Subject: [PATCH 01/14] shuf: add --repetition to generate random numbers

* src/shuf.c: new option (-r,--repetition), generate random numbers.
main(): process new option.
usage(): mention new option.
write_random_numbers(): generate random numbers.
---
 src/shuf.c | 50 ++
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/src/shuf.c b/src/shuf.c
index 0fabb0b..cdc3151 100644
--- a/src/shuf.c
+++ b/src/shuf.c
@@ -76,6 +76,9 @@ Write a random permutation of the input lines to standard output.\n\
   -n, --head-count=COUNToutput at most COUNT lines\n\
   -o, --output=FILE write result to FILE instead of standard output\n\
   --random-source=FILE  get random bytes from FILE\n\
+  -r, --repetition  used with -iLO-HI, output COUNT random numbers\n\
+between LO and HI, with repetitions.\n\
+count defaults to 1 if -n COUNT is not used.\n\
   -z, --zero-terminated end lines with 0 byte, not newline\n\
 ), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
@@ -104,6 +107,7 @@ static struct option const long_opts[] =
   {head-count, required_argument, NULL, 'n'},
   {output, required_argument, NULL, 'o'},
   {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION},
+  {repetition, no_argument, NULL, 'r'},
   {zero-terminated, no_argument, NULL, 'z'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -328,6 +332,23 @@ write_permuted_output (size_t n_lines, char *const *line, size_t lo_input,
   return 0;
 }
 
+static int
+write_random_numbers (struct randint_source *s, size_t count,
+  size_t lo_input, size_t hi_input, char eolbyte)
+{
+  size_t i;
+  const randint range = hi_input - lo_input + 1;
+
+  for (i = 0; i  count; i++)
+{
+  randint j = lo_input + randint_choose (s, range);
+  if (printf (%lu%c, j, eolbyte)  0)
+return -1;
+}
+
+  return 0;
+}
+
 int
 main (int argc, char **argv)
 {
@@ -340,6 +361,7 @@ main (int argc, char **argv)
   char eolbyte = '\n';
   char **input_lines = NULL;
   bool use_reservoir_sampling = false;
+  bool repetition = false;
 
   int optc;
   int n_operands;
@@ -348,7 +370,7 @@ main (int argc, char **argv)
   char **line = NULL;
   struct linebuffer *reservoir = NULL;
   struct randint_source *randint_source;
-  size_t *permutation;
+  size_t *permutation = NULL;
   int i;
 
   initialize_main (argc, argv);
@@ -424,6 +446,10 @@ main

Re: Generate random numbers with shuf

2013-07-10 Thread Assaf Gordon


On 07/10/2013 09:20 AM, Pádraig Brady wrote:


I've split to two patches.
1. Unrelated test improvements.
2. All the rest


...
 

Note in both patches I made adjustments to the tests [...]

...

I.E. avoid cat unless needed, and paste is more general than fmt in this usage.

...

Also I simplified the --help a little [...]


Indeed, looks more concise and much better. I keep on learning...



I'll push the 2 attached patches soon.



Thanks!
 -gordon

Re: bug#15077: Clarification

2013-08-12 Thread Assaf Gordon


(CC'ing the list so that others could comment)

Hello Federico,

On 08/12/2013 06:50 PM, CDR wrote:

How do I get latest, latest version, even beta, or join, sort, etc?


I would not recommend using beta or development versions of GNU coreutils 
for production code, just to be on the safe side.
The stable releases are available as source code here:
 http://ftp.gnu.org/gnu/coreutils/
With more details here:
 http://www.gnu.org/software/coreutils/


One thing that I suggest is to change sort, comm and join to use more
than one core. I had to use a commercial version of sort because the
regular version tales for ever to sort a 15G file. The commercial
version is called nsort and it uses all the cores in the machines and
also you may add a flag to give the program a huge memory block. It
works like ten times faster than the regular sort.


Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see 
the --parallel parameter).
Sort also supports the --buffer-size parameter to explicitly specify how much 
memory to use.

I'm not familiar with nsort and can not comment on nsort vs GNU sort's speeds,
I believe that on modern hardware, sorting 15G should take few minutes at most, not 
forever - but that depends on many factors (e.g. cores, memory, disk, etc.).

join operates on sorted input, and as such, requires very little CPU and 
memory.
I  do not think much can be gained from making join multi-threaded.
I believe the same applies to comm.


I am using comm a lot for business problem that involves comparing
daily files that have 550 MM records. I find it extremely slow. Do
you any suggestion?



Others could perhaps comment on ways to improve performance when using GNU 
coreutils.

I'd assume it very much depends on the technical details you're comparing - 
perhaps there are ways to improve the workflow.
First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk 
speed, Algorithm, etc.)


regards,
 -gordon

Re: Shuf reservoir sampling

2014-01-23 Thread Assaf Gordon

Hello,

(reviving an old thread, sorry for the delayed response).

On 12/28/2013 03:36 PM, Jyothis V wrote:
...

Hi, thanks for the reply. I understand why something like reservoir
sampling is needed. But in shuf.c, shuffling is done in two steps:
1) using reservoir sampling, an array of length head_length is obtained.
At this stage, the array is not completely shuffled because the
first head_length elements were copied verbatim. 2) This array is
then randomly permuted while writing. My question is whether these
two steps could be clubbed together, just as shown in the second
algorithm given in the wikipedia page you mentioned.

I didn't have a look into the Maths behind yet, nor was I involved
during that last improvement. Further improvement is maybe possible,
and the best way to push this is providing code. Are you willing
to propose such a patch?

Regarding the shuffle correctness:
Yes, the data is first read into the array, and only later permuted.
I believe the implementation is correct (ie it does randomly shuffles
the input), and if this is not the case, it's a bug and should be fixed.

Regarding the implementation:
In shuf.c there's an intricate interplay between reading the input and
writing the output - notice that the input is closed explicitly half-way
through main(), before any output is ever written.

The initial patch was written to maintain this behavior, and minimally
disrupt the existing code flow:

http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff

This is not to say a better implementation is not possible, just that
there are few technical details to note before changing 'shuf'.

There are certainly ways to improve the code.

HTH,
-Gordon

Re: sort/uniq/join: key-comparison code consolidation

2014-01-23 Thread Assaf Gordon


Hello,

If there is still interest, here's an updated patch, against the latest 
version, of adding these features to join+uniq:


http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html

The patch has been re-created (not just rebased), because the old one 
caused few conflicts. Functionally it is the same as before, and all 
tests pass.


Comments are welcomed,
 -gordon



key-compare-2014-01-23.patch.xz
Description: application/xz

script suggestion: 'check_program' to easily run multiple tests

2014-01-24 Thread Assaf Gordon


Hello,

Attached is a small script I've been using.
It helps running multiple tests for a given program.

example:
  ./scripts/check_program sort

Will find all sort-related tests (based on filename) and run them.
Adding -e or -v also runs expensive and very expensive tests:

examle:
  ./scripts/check_program -v sort

is equivalent to:
   make check VERBOSE=yes SUBDIRS=. \
  RUN_EXPENSIVE_TESTS=yes \ RUN_VERY_EXPENSIVE_TESTS=yes \
  TESTS=./tests/misc/sort-NaN-infloop.sh
 ./tests/misc/sort-benchmark-random.sh
 ./tests/misc/sort-compress-hang.sh
 ./tests/misc/sort-compress-proc.sh
 ./tests/misc/sort-compress.sh
 ./tests/misc/sort-continue.sh
 ./tests/misc/sort-debug-keys.sh
 ./tests/misc/sort-debug-warn.sh
 ./tests/misc/sort-discrim.sh
 ./tests/misc/sort-exit-early.sh
 ./tests/misc/sort-files0-from.pl
 ./tests/misc/sort-float.sh
 ./tests/misc/sort-merge-fdlimit.sh
 ./tests/misc/sort-merge.pl
 ./tests/misc/sort-month.sh
 ./tests/misc/sort-rand.sh
 ./tests/misc/sort-spinlock-abuse.sh
 ./tests/misc/sort-stale-thread-mem.sh
 ./tests/misc/sort-u-FMR.sh
 ./tests/misc/sort-unique-segv.sh
 ./tests/misc/sort-unique.sh
 ./tests/misc/sort-version.sh
 ./tests/misc/sort.pl


If others find it useful, you're welcomed to add this.

-gordon


From 965a01bfaf129b4d1da8d0927a9149e4c4145ff3 Mon Sep 17 00:00:00 2001
From: A. Gordon assafgor...@gmail.com
Date: Fri, 24 Jan 2014 13:39:14 -0500
Subject: [PATCH] scripts: add check_program, to run tests easily

* scripts/check_program: New script, so you can easily run all tests
relating to a certain program. Takes less time than checking all
programs with 'make check', and quicker to type than
'make check TESTS=TEST1,TEST2,TEST3' for multiple tests.
---
 scripts/check_program | 70 +++
 1 file changed, 70 insertions(+)
 create mode 100755 scripts/check_program

diff --git a/scripts/check_program b/scripts/check_program
new file mode 100755
index 000..f38e410
--- /dev/null
+++ b/scripts/check_program
@@ -0,0 +1,70 @@
+#!/bin/sh
+# A small helper script to run multiple tests at once.
+# example:
+#   ./scripts/check_program sort
+# would run all 'sort' related tests under ./tests/
+
+# Written by Assaf Gordon
+
+# allow the user to override 'make'
+MAKE=${MAKE-make}
+
+VERSION='2014-01-24 00:37:51' # UTC
+
+prog_name=`basename $0`
+die () { echo $prog_name: $* 2; exit 1; }
+
+usage() {
+  echo 2 \
+Usage: $0 [OPTION] PROGRAM
+Runs all tests for PROGRAM
+
+Options:
+ -e run EXPENSIVE tests
+ -v run EXPENSIVE and VERY_EXPENSIVE tests
+ -h display this help and exit
+
+Examples:
+To run all (non-expensive) tests for 'uniq':
+
+  $0 uniq
+
+To run all (including expensive and very expensive) tests for 'sort':
+
+  $0 -v sort
+
+
+}
+
+RUN_EXPENSIVE_TESTS=no
+RUN_VERY_EXPENSIVE_TESTS=no
+while getopts :evh name
+do
+case $name in
+(v)RUN_VERY_EXPENSIVE_TESTS=yes;RUN_EXPENSIVE_TESTS=yes;;
+(e)RUN_EXPENSIVE_TESTS=yes;;
+(h)usage; exit 0 ;;
+(--)   shift ; break ;;
+(*)die Unknown option '$OPTARG' ;;
+esac
+shift
+done
+
+PROGRAM=$1
+[ -z $PROGRAM ]  die missing PROGRAM name. See '-h' for help.
+
+
+[ -d ./tests ] || die 'tests/' directory not found. \
+Please run this script from the \
+main directory of 'Coreutils'.
+
+TESTS=$(find ./tests/ \( -name '*.sh' -o -name '*.pl' \) -print | \
+	grep -w -- $PROGRAM | paste -s -d' ')
+[ -z $TESTS ]  die no tests found for '$PROGRAM'
+
+echo Running the following tests for '$PROGRAM':
+echo $TESTS | tr ' ' '\n' | sed 's/^/   /'
+
+$MAKE check TESTS=$TESTS VERBOSE=yes SUBDIRS=. \
+	RUN_EXPENSIVE_TESTS=$RUN_EXPENSIVE_TESTS \
+	RUN_VERY_EXPENSIVE_TESTS=$RUN_VERY_EXPENSIVE_TESTS
-- 
1.8.4.3

stat: clarify mtime vs ctime [patch]

2014-04-21 Thread Assaf Gordon


Hello,

Would you be receptive to adding a tiny patch to 'stat' to clarify the 
difference between modification time and change time?

Currently, it simply says:
  %y   time of last modification, human-readable
  %Y   time of last modification, seconds since Epoch
  %z   time of last change, human-readable
  %Z   time of last change, seconds since Epoch

And for most non-unix experts, last modification is (almost) a synonym for last 
change (IMHO).

The patch changes:
  modification - data modification
  change - status change
And adds one clarification paragraph to the docs.

While this will not immediately resolve all questions, it will at least hint users which option 
they need (as data is different from status).

The words data and status are also used (for mtime and ctime, respectively) 
in the POSIX pages of 'sys/stat.h':
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html


Perhaps, in addition, add a new FAQ ?
Something like:

Q. What is the difference between access time, data modification time and 
status change time ?
A. Most UNIX systems keeps track of different times for each file.
Access Time keeps track of the last time a file was opened for reading.
Data Modification time keeps tracks of the last time file's content has been 
modified.
Status Change time keeps tracks of the last time a file's status (e.g. mode, 
owner, group, hard-links) was modified.

Configuration varies between filesystems - not all systems keep track of all 
three times.

To show Access time, use ls -lu or stat's %X and %x formats.
To show Data modification time, use ls -l or stat's %Y and %y formats.
To show Status change time, use ls -lc or stat's %Z and %z formats.

Example:
# Create a new file
$ echo hello  test.txt

# Show the file's time stamps
$ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt
Access: 2014-04-21 14:01:00.131648000 +
Modify: 2014-04-21 14:01:00.131648000 +
Change: 2014-04-21 14:01:00.131648000 +

# Wait 5 seconds, then update the file's content.
# NOTE: Status change time is also updated.
$ sleep 5 ; echo world  test.txt
$ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt
Access: 2014-04-21 14:01:00.131648000 +
Modify: 2014-04-21 14:01:05.161657000 +
Change: 2014-04-21 14:01:05.161657000 +

# Wait 5 seconds, then update the file's status (but not content)
$ sleep 5 ; chmod o-rwx test.txt
$ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt
Access: 2014-04-21 14:01:00.131648000 +
Modify: 2014-04-21 14:01:05.161657000 +
Change: 2014-04-21 14:01:10.250232749 +

# Wait 5 seconds, then read (access) the file's content
$ sleep 5 ; wc test.txt  /dev/null
$ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt
Access: 2014-04-21 14:01:15.298241904 +
Modify: 2014-04-21 14:01:05.161657000 +
Change: 2014-04-21 14:01:10.250232749 +

# Show Data Modification time with 'ls -l'
$  ls --full-time -log test.txt
-rw-r- 1 12 2014-04-21 14:01:05.161657000 + test.txt

# Show Status Change time with 'ls -c'
$ ls --full-time -log -c test.txt
-rw-r- 1 12 2014-04-21 14:01:10.250232749 + test.txt

# Show Last Access time with 'ls -u'
$ ls --full-time -log -u test.txt
-rw-r- 1 12 2014-04-21 14:01:15.298241904 + test.txt



Regards,
 -gordon

From 4cf4784aafdf45fd3dec3855b9320d72dcd1a6ec Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 21 Apr 2014 14:31:23 -0400
Subject: [PATCH] stat: clarify mtime vs ctime in usage(), doc

Change modification time to data modification time,
change time to status change time.

* src/stat.c: improve usage()
* doc/coreutils.texi: add clarification paragraph
---
 doc/coreutils.texi | 19 +++
 src/stat.c |  8 
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 6c49385..e979c88 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -11829,10 +11829,10 @@ The valid @var{format} directives for files with @option{--format} and
 @item %W - Time of file birth as seconds since Epoch, or @samp{0}
 @item %x - Time of last access
 @item %X - Time of last access as seconds since Epoch
-@item %y - Time of last modification
-@item %Y - Time of last modification as seconds since Epoch
-@item %z - Time of last change
-@item %Z - Time of last change as seconds since Epoch
+@item %y - Time of last data modification
+@item %Y - Time of last data modification as seconds since Epoch
+@item %z - Time of last status change
+@item %Z - Time of last status change as seconds since Epoch
 @end itemize
 
 The @samp{%t} and @samp{%T} formats operate on the st_rdev member of
@@ -11864,6 +11864,17 @@ precision:
   [1288929712.114951834]
 @end example
 
+@emph{Access time} formats (@samp{%x},@samp{%X}) output the last time the
+file

Re: stat: clarify mtime vs ctime [patch]

2014-04-22 Thread Assaf Gordon


On 04/21/2014 03:57 PM, Pádraig Brady wrote:

On 04/21/2014 08:14 PM, Assaf Gordon wrote:

Would you be receptive to adding a tiny patch to 'stat' to clarify the 
difference between modification time and change time?


This clarification is worth making, thanks!


Perhaps, in addition, add a new FAQ ?


Let's avoid the FAQ for the moment.
Hopefully the improved docs will avoid the need.


...


but
if the file was just opened for reading, then access time isn't updated,
only if data is read. Also for performance reasons, modern Linux systems
only update atime if it's older than [cm]time.
I.E. with relatime enabled, it's really only an indicator
as to whether the file has been read since it was last updated.
So I think this whole block might add more ambiguity than
any additional clarification. OK to drop this block?


Attached are improved patches:
The first contains only the added words status and data.

The second adds the paragraph to the docs, and can be included at your 
discretion.
I've reworded the Access Time sentence to make clear it depends on the 
operating system and file system configuration.
But I think at least the data modification time and status change time 
sentences are correct for all systems.

For both the FAQ and the additional paragraph, my reasoning is:
1. Expert users (who know by heart what mtime vs ctime mean) - don't need any 
of these.
2. Seasoned users - perhaps just need a reminder, in which case the words data vs 
status are enough.
3. Other (most?) users - will still look for clarification after seeing data modification 
time vs status change time.

There are many answers for what is the difference between modification time and change 
time found on the internet, but I think it would be beneficial if there's an authoritative 
answer, from a reliable source (i.e. by coreutils).

Regards,
 -gordon

From 611b2b12ec7f6ae4ee276adfe74efe41602d27d7 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 21 Apr 2014 14:31:23 -0400
Subject: [PATCH 1/2] doc: clarify stat's mtime vs ctime in usage(), doc

Change modification time to data modification time,
change time to status change time.

* src/stat.c: improve usage()
* doc/coreutils.texi: ditto
---
 doc/coreutils.texi | 8 
 src/stat.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 6c49385..b21a4fd 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -11829,10 +11829,10 @@ The valid @var{format} directives for files with @option{--format} and
 @item %W - Time of file birth as seconds since Epoch, or @samp{0}
 @item %x - Time of last access
 @item %X - Time of last access as seconds since Epoch
-@item %y - Time of last modification
-@item %Y - Time of last modification as seconds since Epoch
-@item %z - Time of last change
-@item %Z - Time of last change as seconds since Epoch
+@item %y - Time of last data modification
+@item %Y - Time of last data modification as seconds since Epoch
+@item %z - Time of last status change
+@item %Z - Time of last status change as seconds since Epoch
 @end itemize
 
 The @samp{%t} and @samp{%T} formats operate on the st_rdev member of
diff --git a/src/stat.c b/src/stat.c
index fffebe3..7d43eb5 100644
--- a/src/stat.c
+++ b/src/stat.c
@@ -1457,10 +1457,10 @@ The valid format sequences for files (without --file-system):\n\
   %W   time of file birth, seconds since Epoch; 0 if unknown\n\
   %x   time of last access, human-readable\n\
   %X   time of last access, seconds since Epoch\n\
-  %y   time of last modification, human-readable\n\
-  %Y   time of last modification, seconds since Epoch\n\
-  %z   time of last change, human-readable\n\
-  %Z   time of last change, seconds since Epoch\n\
+  %y   time of last data modification, human-readable\n\
+  %Y   time of last data modification, seconds since Epoch\n\
+  %z   time of last status change, human-readable\n\
+  %Z   time of last status change, seconds since Epoch\n\
 \n\
 ), stdout);
 
-- 
1.9.1

From d7757509a9248a1b2ead45433741d2ec0d4ce7d2 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Tue, 22 Apr 2014 11:13:02 -0400
Subject: [PATCH 2/2] doc: add paragraph about stat's %x/%y/%z

doc/coreutils.texi: added paragraph.
---
 doc/coreutils.texi | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index b21a4fd..b505d1e 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -11864,6 +11864,16 @@ precision:
   [1288929712.114951834]
 @end example
 
+@emph{Access time} formats (@samp{%x},@samp{%X}) output the file's access time.
+Access time is also shown with @command{ls -lu}. The precise meaning of file
+access time depends on your operating system and file system configuration.
+@emph{Data modification} format (@samp{%y}, @samp{%Y})
+outputs the time the file's content was modified (e.g. by a program writing
+to the file). Data modification time is also

Re: sort/uniq/join: key-comparison code consolidation

2014-05-02 Thread Assaf Gordon


Hello,

On 01/23/2014 07:50 PM, Assaf Gordon wrote:


If there is still interest, here's an updated patch, against the latest 
version, of adding these features to join+uniq:

http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html


Attached another rebase + minor fix for recent 'devmsg' change.


Comments are welcomed,
  -gordon




key-compare-2014-05-02.patch.xz
Description: application/xz

Work-around for bootstrap failure with gettext 0.18.3.1

2014-05-02 Thread Assaf Gordon


Hello,

Coreutils' bootstrap script fails (in a freshly cloned directory) with gettext 
0.18.3.1.

This has been discussed few times on the mailing list:
http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html
http://lists.gnu.org/archive/html/bug-coreutils/2014-01/msg00058.html
http://lists.gnu.org/archive/html/bug-coreutils/2014-04/msg00106.html

And already resolved (with recommendation to upgrade to 0.18.3.2):
http://savannah.gnu.org/bugs/?40083
https://bugs.launchpad.net/ubuntu/+source/gettext/+bug/1311895

But version 0.18.3.1 is still out there and hasn't been upgraded in several 
distributions.

Would you be receptive to add the following minor work-around for bootstrap ?
It creates the two needed files, which allows autopoint to continue, then 
gnulib immediately overrides them with the correct versions.

Comments are welcomed,
 - gordon

P.S.
So far I have only tested it on Ubuntu 14.04 (with gettext 0.18.3.1) and Debian 
7 (with gettext 0.18.1.1-9).




From 3186927f477b12ad5ce3d184047336c382432226 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Fri, 2 May 2014 20:17:06 -0400
Subject: [PATCH] build: avoid bootstrap error with gettext 0.18.3.1

* bootstrap: Create critical bootstrap files for autopoint,
before gnulib re-generates them.
This avoids a bug in gettext/autopoint version 0.18.3.1 (which
is advertised as 0.18.3).
See:
http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html
http://savannah.gnu.org/bugs/?40083
https://bugs.launchpad.net/ubuntu/+source/gettext/+bug/1311895
---
 bootstrap | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/bootstrap b/bootstrap
index ce90bc4..81b576d 100755
--- a/bootstrap
+++ b/bootstrap
@@ -807,6 +807,20 @@ version_controlled_file() {
   fi
 }
 
+
+# Work-around for gettext/autopoint bug in version 0.18.3.1:
+# Create dummy 'm4/cu-progs.m4' and 'build-aux/git-version-gen'
+# to avoid 'bootstrap' failure.
+# http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html
+autopoint_version=$(get_version $AUTOPOINT)
+if test $autopoint_version = 0.18.3 ; then
+  test -e 'm4/cu-progs.m4' || touch 'm4/cu-progs.m4'
+  if ! test -e 'build-aux/git-version-gen' ; then
+printf #!/bin/sh\n  'build-aux/git-version-gen'
+chmod a+x 'build-aux/git-version-gen'
+  fi
+fi
+
 # NOTE: we have to be careful to run both autopoint and libtoolize
 # before gnulib-tool, since gnulib-tool is likely to provide newer
 # versions of files installed by these two programs.
-- 
1.9.1

Re: Work-around for bootstrap failure with gettext 0.18.3.1

2014-05-05 Thread Assaf Gordon


On 05/03/2014 05:52 AM, Pádraig Brady wrote:


We wouldn't be wanting the cu-progs.m4 in other projects though,
so we should probably conditionalize that to just $package = coreutils.
It wouldn't be worth adding new hooks for this to generalize.



To clarify: do you mean conditionalize just the cu-progs.m4 part, or the entire 
work-around with $package = coreutils ?

Re: Work-around for bootstrap failure with gettext 0.18.3.1

2014-05-12 Thread Assaf Gordon


On 05/05/2014 02:42 PM, Pádraig Brady wrote:

On 05/05/2014 06:34 PM, Assaf Gordon wrote:

On 05/03/2014 05:52 AM, Pádraig Brady wrote:


We wouldn't be wanting the cu-progs.m4 in other projects though,
so we should probably conditionalize that to just $package = coreutils.
It wouldn't be worth adding new hooks for this to generalize.



To clarify: do you mean conditionalize just the cu-progs.m4 part, or the entire 
work-around with $package = coreutils ?


Just the cu-progs.m4 bit



Attached is an updated patch.
Comments are welcomed.

-gordon

From ba30f3d9f5f217883fb13d06354f2c8478f598d6 Mon Sep 17 00:00:00 2001
From: Assaf Gordon assafgor...@gmail.com
Date: Mon, 12 May 2014 12:17:06 -0400
Subject: [PATCH] build: avoid bootstrap error with gettext 0.18.3.1

* bootstrap: Create critical bootstrap files for autopoint,
before gnulib re-generates them.
This avoids a bug in gettext/autopoint version 0.18.3.1 (which
is advertised as 0.18.3).
See:
http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html
http://savannah.gnu.org/bugs/?40083
https://bugs.launchpad.net/ubuntu/+source/gettext/+bug/1311895
---
 bootstrap | 16 
 1 file changed, 16 insertions(+)

diff --git a/bootstrap b/bootstrap
index ce90bc4..9cd8024 100755
--- a/bootstrap
+++ b/bootstrap
@@ -807,6 +807,22 @@ version_controlled_file() {
   fi
 }
 
+
+# Work-around for gettext/autopoint bug in version 0.18.3.1:
+# Create dummy 'm4/cu-progs.m4' and 'build-aux/git-version-gen'
+# to avoid 'bootstrap' failure.
+# http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html
+autopoint_version=$(get_version $AUTOPOINT)
+if test $autopoint_version = 0.18.3 ; then
+  if test $package = coreutils ; then
+test -e 'm4/cu-progs.m4' || touch 'm4/cu-progs.m4'
+  fi
+  if ! test -e 'build-aux/git-version-gen' ; then
+printf #!/bin/sh\n  'build-aux/git-version-gen'
+chmod a+x 'build-aux/git-version-gen'
+  fi
+fi
+
 # NOTE: we have to be careful to run both autopoint and libtoolize
 # before gnulib-tool, since gnulib-tool is likely to provide newer
 # versions of files installed by these two programs.
-- 
1.9.1

sharing STDOUT in multiple sha256sum processes

2014-05-16 Thread Assaf Gordon


Hello,

I'd like to ask your advice, to verify that my command is correct.

I'm trying to calculate sha256 checksum on many files, in parallel.

A contrived example would be:
$ find /path/ -type f -print0 | xargs -0 -P5 -n1 stdbuf -oL sha256sum  1.txt

Which would run at most 5 processes of sha256sum,
and the output of all would be the file 1.txt.

Is it correct to assume that because sha256sum prints one line per file, and 
stdbuf -oL makes it line-buffered,
that the content in 1.txt will be valid (i.e. no inter-mixed lines from 
different processes) ?

Thanks,
 - gordon

Re: seq feature: print letters

2014-06-30 Thread Assaf Gordon

On Jun 30, 2014, at 5:24, Pádraig Brady p...@draigbrady.com wrote:

On 06/30/2014 11:23 AM, assafgor...@gmail.com wrote:
I'd like to suggest a patch to allow seq to generate letter sequences.
I notice about 45 copies of the A-Z alphabet, would it be worth introducing
aliases to avoid copies?

Yes, we can consolidate them.

What about case. The current code only has upper case. case is a can of worms
I know, with not necessarily 1:1 mapping etc.

Once leaving the realm of latin languages, upper/lower case indeed becomes very
complicated. Or even meaningless. I thought that 'tr [:upper:] [:lower:]' would
handle it better (but I now realize tr doesn't support UTF-8 well, if I
understand correctly).

I think that for the first step, we should not deal with upper/lower case
issues.

The data being leveraged is well defined at present reasonable to include
directly in the seq binary (about 12K I'm guessing),
though have you looked at whether libunistring contains the appropriate
data/logic for this?
This might be more significant if case or more characters were considered for
example.

This first draft stores UTF-8 strings (with NUL) for each character. I saw the
libunistring code stores some bit-fields for some of the functions, though I
haven't learned it yet.
I will try to improve the storage method in following patches.

I had a quick look at the CLDR. Are you only considering the Index exemplar
chars here?
http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html

Exactly.

Maybe it would be better to default to the standard exemplars?
http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

The reason I liked to index list, is because it most directly answers the
question what is the alphabet in language X ? (is in, what are the letters
that would be taught in schools as the alphabet, or if you ask a person on
the street to list the alphabet letters).
It also lends itself to do:
# How many letters are in the Arabic alphabet:
seq --alphabet=ar | wc -l
# What is the eleventh letter in the Russian alphabet:
seq --alphabet=ru | awk 'NR==11'

Technically, the functionality of is_alpha() does not correspond 1:1 to the
alphabet, which is part of the problem... In English, there are no
complications, but in many other languages it becomes complicated.

Using other Unicode categories (e.g.the 'main' letters or even 'auxiliary'
letters) answers a slightly different question, more akin to what symbols are
acceptable in language X ? - not a bad question, just different that the
previous question.

For example in Hebrew, the index list contains 22 letters (which agrees with
the question how many letters are in the Hebrew alphabet), but the
main/standard list has 5 more symbols, of 5 hebrew letters that have specific
final form (if those letters appear at the end of the word).
So using the main list would list 5 letters twice. I believe other language
such as Arabic would present similar issues.

From a technical point of view, it's easy to include both index and
standard letters (with different command-line options), it's just a matter of
adding more lists.

What do you think?

Thanks,
-Gordon

Re: seq feature: print letters

2014-07-02 Thread Assaf Gordon


 On Jul 1, 2014, at 2:21, Bernhard Voelker m...@bernhard-voelker.de wrote:
 
 Hmm, what about just providing the standard A-Z alphabet,
 and instead leave it up to the user if she needs a different
 set (rolling over if needed)?

I like the idea of seq using user-specified sequence of characters (though this 
brings it's own issues),
But my goal was to provide an easy way to generate letters in many languages - 
if the user has to type them explicitly, then seq is no better than printf 
'%s\n'  followed by all the letters typed by the user...

What do you think?

I'm still working on an improved patch with much more efficient storage. Hope 
to have it in a week or so.

Regards,
  - gordon

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 966 matches

Mail list logo