Re: [coreutils] join feature: auto-format
Pádraig Brady wrote, On 10/07/2010 06:22 AM: On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. My concern is the principle of least surprise - if there are existing scripts/programs that specify -e without -o (doesn't make sense, but still possible) - this change will alter their behavior. Also, implying/forcing 'auto-format' when -e is used without -o might be a bit confusing. I prefer to have the user explicitly ask for auto-format - at least he/she will know how the output would look like. That being said, I can send a new patch with one of the new method (implicit autoformat or -o padded) - which one is preferred ? Thanks, -gordon
[coreutils] du/bigtime fail ( was: new snapshot available: coreutils-8.7.66-561f8)
Jim Meyering wrote, On 12/17/2010 05:07 AM: Here's a preview of what should soon appear as coreutils-8.8. [...] Any testing you can perform over the weekend would be most welcome. On CentOS 5.4, du/bigtime fails (in a reproducible manner). $ uname -a Linux XX 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 GNU/Linux == GNU coreutils 8.7.66-561f8: tests/test-suite.log == 1 of 1 test failed. .. contents:: :depth: 2 FAIL: du/bigtime (exit: 1) == --- out 2010-12-17 15:29:06.0 + +++ exp 2010-12-17 15:29:06.0 + @@ -1 +1 @@ -4 9223372036854775807 future +0 9223372036854775807 future -gordon
Re: [coreutils] join feature: auto-format
Pádraig Brady wrote, On 01/11/2011 07:35 AM: Spending another few minutes on this, I realized that we should not be trying to homogenize the number of fields from each file, but rather the fields used for a particular file in each line. The only sensible basis for that is the first line as previously suggested. The interface would be a little different for that. I was thinking of: -o 'header' Infer the format from the first line of each file I second the idea of using the first line as the basis for the auto-formatting, but have reservation about the wording: '-o header' somewhat implies that the first line has to be an actual header line (with column names or similar), while it can just be the first line of actual data if the file doesn't have a header line. Something like '-o auto' might be less confusing. Just my 2 cents, -gordon
bug#7961: sort
On a somewhat off-topic note, Francesco Bettella wrote, On 02/02/2011 07:42 AM: I'm issuing the following sort commands (see attached files): [prompt1] sort -k 1.4,1n asd1 asd1.sorted [prompt2] sort -k 2.4,2n asd2 asd2.sorted the first one works as I would expect, the second one doesn't. When sorting chromosome names, the version sort option (-V, introduced in coreutils 7.0) sorts as you would expect, saving you the need to skip three characters in the sort key, and also accommodating mixing letters and numbers. Example: $ cat chrom.txt chr1 chrUn_gl000232 chrY chr2 chr13 chrM chrUn_gl000218 chr6_hap chr2R chr16 chr10 chr6_dbb_hap3 chr4 chr3L chr4_ctg9_hap1 chr3R chr3 chrX $ sort -k1,1V chrom.txt chr1 chr2 chr2R chr3 chr3L chr3R chr4 chr4_ctg9_hap1 chr6_dbb_hap3 chr6_hap chr10 chr13 chr16 chrM chrUn_gl000218 chrUn_gl000232 chrX chrY -gordon
sort parameters question: -V and -f
Hello, I'm wondering if this is a bug (where -f is ignored when using version sort): = $ sort --debug -f -k2,2V sort: using simple byte comparison sort: leading blanks are significant in key 1; consider also specifying `b' sort: option `-f' is ignored == My assumption is that using -f as stand alone parameter should have the same effect as using it in a specific key (for that key). e.g. the following two commands are equivalent: sort -f -k1,1 sort -k1f,1 But the following two commands are not equivalent (because the standalone -f is ignored): sort -f -k1V,1 sort -k1Vf,1 Example: = ## This works $ printf a\nB\nc\n | sort -k1f,1 a B c $ printf a\nB\nc\n | sort -f -k1,1 a B c ## This doesn't work $ printf a13\nA5\na1\n | sort -k1Vf,1 a1 A5 a13 $ printf a13\nA5\na1\n | sort -f -k1V,1 A5 a1 a13 === I'm using coreutils 8.10. -gordon
Re: sort parameters question: -V and -f
Eric Blake wrote, On 04/06/2011 06:36 PM: On 04/06/2011 04:04 PM, Pádraig Brady wrote: On 06/04/11 22:26, Assaf Gordon wrote: I'm wondering if this is a bug (where -f is ignored when using version sort): The same happens for any ordering option. If any is specified for the key, then all global options are ignored. This is specified by POSIX and here it's demonstrated on solaris: Not only that, but --debug would have told you the same: --debug did tell me that, but I thought it's a bug, not a feature. I assumed -f is accumulative, not overridden when specifying per-key sort order - I should have read the docs more carefully. Thanks for the quick and detailed response. -gordon
Re: uniq --accumulate
Pádraig Brady wrote, On 02/07/2012 11:00 AM: On 02/07/2012 03:56 PM, Peng Yu wrote: Suppose that I have a table of the following, where the last column is a number. I'd like to accumulate the number of rows that are the same for all the remaining columns. Thanks for the suggestion, but this is too specialized for coreutils I think. Slightly off-topic for coreutils, but a package called BEDTools ( http://code.google.com/p/bedtools/ ) provides a program called groupBy, which does exactly that, and more. Akin to SQL's group by command, the program can group a text file by a specified column, and perform operations (count,sum,mean,median,etc.) on another column. -gordon
sort: new feature: use environment variable to set buffer size
Hello, I'd like to suggest a new feature to sort: the ability to set the buffer size (-S/--buffer-size X) using an environment variable. In summary: $ export SORT_BUFFER_SIZE=20G $ someprogram | sort -k1,1 output.txt # sort will use 20G of RAM, as if --buffer-size 20G was specified. The rational: recent commits improved the guessed buffer size when sort is given an input file, but these don't apply if sort is used as part of a pipe line, with a pipe as input, e.g. some | program | sort | other | programs file (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though many GBs are available). This results in many small temporary files being created. The script (which uses sort) is not under my direct control, but even if it was, I don't want to hard-code the amount of memory used, to keep it portable to different servers. AFAIK, there are four aspects of sort the affect performance: 1. number of threads: changeable with --parallel=X and with environment variable OMP_NUM_THREADS. 2. temporary files location: changeable with --temporary-directory=DIR and with environment variable TMPDIR. 3. memory usage: changeable with --buffer-size=SIZE but not with environment variable. 4. compression program: changeable with --compression-program=PROG but not with environment variable. (but at the moment, I do not address this aspect). With the attached patch, sort will read an environment variable named SORT_BUFFER_SIZE, and will treat it as if --buffer-size was specified (but only if --buffer-size wasn't used on the command line). If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, help, docs, etc.). Regards, -gordon From db8f1c319d772c5b13df51894f279c3a7276416e Mon Sep 17 00:00:00 2001 From: A. Gordon gor...@cshl.edu Date: Wed, 29 Aug 2012 16:42:31 -0400 Subject: [PATCH] sort: accept buffer size from environment variable. --- src/sort.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/src/sort.c b/src/sort.c index 9dbfee1..1505a6d 100644 --- a/src/sort.c +++ b/src/sort.c @@ -4648,6 +4648,13 @@ main (int argc, char **argv) files = minus; } + if (sort_size == 0) +{ + char const *buffer_size = getenv (SORT_BUFFER_SIZE); + if (buffer_size) +specify_sort_size(-1,'S',buffer_size); +} + /* Need to re-check that we meet the minimum requirement for memory usage with the final value for NMERGE. */ if (0 sort_size) -- 1.7.9.1
physmem: a new program to report memory information
Hello, Related to the previous sort+memory envvar usage thread: http://thread.gmane.org/gmane.comp.gnu.coreutils.general/3028/focus=3090 . Attached is a suggestion for a tiny command-line program physmem, that similarly to nproc, exposes the gnulib functions physmem_total() and physmem_available(). The code is closely modeled after nproc, and the recommended memory usage is calculated using sort's default_sort_size() . The program works like this: === $ ./src/physmem --help Usage: ./src/physmem [OPTION]... Prints information about physical memory. -t, --total print the total physical memory. -a, --available print the available physical memory. -r, --recommended print a safe recommended amount of useable memory. -h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G) --si like -h, but use powers of 1000 not 1024 --help display this help and exit --version output version information and exit Report physmem bugs to bug-coreut...@gnu.org GNU coreutils home page: http://www.gnu.org/software/coreutils/ General help using GNU software: http://www.gnu.org/gethelp/ Report physmem translation bugs to http://translationproject.org/team/ For complete documentation, run: info coreutils 'physmem invocation' === The actual working code (at the bottom of physmem.c) is: === switch(memory_report_type) { case total: memory = physmem_total(); break; case available: memory = physmem_available(); break; case recommended: memory = default_sort_size(); break; } char buf[LONGEST_HUMAN_READABLE + 1]; fputs (human_readable (memory, buf, human_output_opts,1,1),stdout); fputs(\n, stdout); === So it's very simple, and rely on existing coreutils code. Please let me know if this is something you'd be willing to include in coreutils. Thanks, -gordon From 1eccf56a49bc0aa3f167a0fce1a65c91a92ed468 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 30 Aug 2012 11:21:57 -0400 Subject: [PATCH] physmem: A new program to report mem information. --- src/Makefile.am |2 + src/physmem.c | 215 +++ 2 files changed, 217 insertions(+), 0 deletions(-) create mode 100644 src/physmem.c diff --git a/src/Makefile.am b/src/Makefile.am index 896c902..ae0c20c7 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -90,6 +90,7 @@ EXTRA_PROGRAMS = \ od \ paste \ pathchk \ + physmem \ pr \ printenv \ printf \ @@ -198,6 +199,7 @@ chroot_LDADD = $(LDADD) cksum_LDADD = $(LDADD) comm_LDADD = $(LDADD) nproc_LDADD = $(LDADD) +physmem_LDADD = $(LDADD) cp_LDADD = $(LDADD) csplit_LDADD = $(LDADD) cut_LDADD = $(LDADD) diff --git a/src/physmem.c b/src/physmem.c new file mode 100644 index 000..b990503 --- /dev/null +++ b/src/physmem.c @@ -0,0 +1,215 @@ +/* physmem - report the total/available/recommended memory + Copyright (C) 2009-2012 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see http://www.gnu.org/licenses/. */ + +/* Written by Assaf Gordon. */ + +#include config.h +#include getopt.h +#include stdio.h +#include sys/types.h + +#include system.h +#include error.h +#include xstrtol.h +#include physmem.h +#include human.h + +#ifndef RLIMIT_DATA +struct rlimit { size_t rlim_cur; }; +# define getrlimit(Resource, Rlp) (-1) +#endif + +/* The official name of this program (e.g., no 'g' prefix). */ +#define PROGRAM_NAME physmem + +#define AUTHORS proper_name (Assaf Gordon) + +/* Human-readable options for output. */ +static int human_output_opts; + +enum memory_report_type + { +total, /* default */ +available, +recommended + }; + +static enum memory_report_type memory_report_type = total; + +/* For long options that have no equivalent short option, use a + non-character as a pseudo short option, starting with CHAR_MAX + 1. */ +enum +{ + HUMAN_SI_OPTION= CHAR_MAX + 1 +}; + +static struct option const longopts[] = +{ + {total, no_argument, NULL, 't'}, + {available, no_argument, NULL, 'a'}, + {recommended, no_argument, NULL, 'r'}, + {human, no_argument, NULL, 'h'}, + {si, no_argument, NULL, HUMAN_SI_OPTION}, + {GETOPT_HELP_OPTION_DECL}, + {GETOPT_VERSION_OPTION_DECL}, + {NULL, 0, NULL, 0} +}; + +/* Return the default sort size. + FIXME: this function was copied from
Command-line program to convert 'human' sizes?
Hello, Is there a command-line program (or a recommended way) to expose the coreutil's common functionality of converting raw sizes to 'human' sizes and vice versa ? I'm referring to the -h parameter that du/df/sort are accepting, and reporting human sizes, but also the reverse (where sort's -G accepts 40M as valid input). I found two relevant threads, but no resolution: http://lists.gnu.org/archive/html/coreutils/2011-08/msg00035.html http://lists.gnu.org/archive/html/coreutils/2012-02/msg00088.html Thanks, -gordon
Re: Command-line program to convert 'human' sizes?
Hello, Pádraig Brady wrote, On 12/04/2012 11:30 AM: On 12/04/2012 04:25 PM, Assaf Gordon wrote: Nothing yet. The plan is to make a numfmt command available with this interface: http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html Attached is a stub for such a program (mostly command-line processing, no actual conversion yet). Please let me know if you're willing to eventually include this program (and I'll more functionality, tests, docs, etc.). I tried to follow the existing code conventions in other programs, but all comments and suggestions are welcomed. -gordon From bb5162a7521aee6b95c902acc65c1d3800ba4f30 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Tue, 4 Dec 2012 15:32:05 -0500 Subject: [PATCH] numfmt: stub code for new program --- build-aux/gen-lists-of-programs.sh |1 + src/.gitignore |1 + src/numfmt.c | 298 3 files changed, 300 insertions(+), 0 deletions(-) create mode 100644 src/numfmt.c diff --git a/build-aux/gen-lists-of-programs.sh b/build-aux/gen-lists-of-programs.sh index 212ce02..bf63ee3 100755 --- a/build-aux/gen-lists-of-programs.sh +++ b/build-aux/gen-lists-of-programs.sh @@ -85,6 +85,7 @@ normal_progs=' nl nproc nohup +numfmt od paste pathchk diff --git a/src/.gitignore b/src/.gitignore index 181..25573df 100644 --- a/src/.gitignore +++ b/src/.gitignore @@ -59,6 +59,7 @@ nice nl nohup nproc +numfmt od paste pathchk diff --git a/src/numfmt.c b/src/numfmt.c new file mode 100644 index 000..e513194 --- /dev/null +++ b/src/numfmt.c @@ -0,0 +1,298 @@ +/* Reformat numbers like 11505426432 to the more human-readable 11G + Copyright (C) 2012 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see http://www.gnu.org/licenses/. */ + +#include config.h +#include getopt.h +#include stdio.h +#include sys/types.h + +#include argmatch.h +#include error.h +#include system.h +#include xstrtol.h + +/* The official name of this program (e.g., no 'g' prefix). */ +#define PROGRAM_NAME numfmt + +#define AUTHORS proper_name () + +#define BUFFER_SIZE (16 * 1024) + +enum +{ + FROM_OPTION = CHAR_MAX + 1, + FROM_UNIT_OPTION, + TO_OPTION, + TO_UNIT_OPTION, + ROUND_OPTION, + SUFFIX_OPTION +}; + +enum scale_type +{ +scale_none, /* the default: no scaling */ +scale_auto, /* --from only */ +scale_SI, +scale_IEC, +scale_custom /* --to only, custom scale */ +}; + +static char const *const scale_from_args[] = +{ +auto, SI, IEC, NULL +}; +static enum scale_type const scale_from_types[] = +{ +scale_auto, scale_SI, scale_IEC +}; + +static char const *const scale_to_args[] = +{ +SI, IEC, NULL +}; +static enum scale_type const scale_to_types[] = +{ +scale_SI, scale_IEC +}; + + +enum round_type +{ +round_ceiling, +round_floor, +round_nearest +}; + +static char const *const round_args[] = +{ +ceiling,floor,nearest, NULL +}; + +static enum round_type const round_types[] = +{ +round_ceiling,round_floor,round_nearest +}; + +static struct option const longopts[] = +{ + {from, required_argument, NULL, FROM_OPTION}, + {from-unit, required_argument, NULL, FROM_UNIT_OPTION}, + {to, required_argument, NULL, TO_OPTION}, + {to-unit, required_argument, NULL, TO_UNIT_OPTION}, + {round, required_argument, NULL, ROUND_OPTION}, + {format, required_argument, NULL, 'f'}, + {suffix, required_argument, NULL, SUFFIX_OPTION}, + {GETOPT_HELP_OPTION_DECL}, + {GETOPT_VERSION_OPTION_DECL}, + {NULL, 0, NULL, 0} +}; + + +enum scale_type scale_from=scale_none; +enum scale_type scale_to=scale_none; +enum round_type _round=round_ceiling; +char const *format_str = NULL; +const char *suffix = NULL; +uintmax_t from_unit_size=1; +uintmax_t to_unit_size=1; + +/* Convert a string of decimal digits, N_STRING, with an optional suffinx + to an integral value. Upon successful conversion, + return that value. If it cannot be converted, give a diagnostic and exit. +*/ +static uintmax_t +string_to_integer (const char *n_string) +{ + strtol_error s_err; + uintmax_t n; + + s_err = xstrtoumax (n_string, NULL, 10, n, bkKmMGTPEZY0); + + if (s_err == LONGINT_OVERFLOW) +{ + error (EXIT_FAILURE, 0, + _(%s: unit size is so large that it is not representable), +n_string); +} + + if (s_err != LONGINT_OK
Re: Command-line program to convert 'human' sizes?
Pádraig Brady wrote, On 12/04/2012 06:11 PM: On 12/04/2012 10:55 PM, Assaf Gordon wrote: Hello, Pádraig Brady wrote, On 12/04/2012 11:30 AM: Nothing yet. The plan is to make a numfmt command available with this interface: http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html Attached is a patch, with a proof-of-concept working 'numfmt'. Thanks a lot for working on this. All I'll say at this stage is to take it as far as you can as per the interface specified at the above URL with a mind to reusing stuff from lib/human.c if possible. We'll review it then with a view to including it ASAP. Thanks! Input-wise, I had to copy and modify the xstrtol implementation, because the original function doesn't allow the caller to force SI or IEC or AUTO (it has internal logic to deduce it, based on parameters and user input). Output-wise, human_readable() from lib/human.c is called as-is (no code modification). Regarding the advanced options: 1. I'm wondering what is the reason/need for --to=NUMBER ? It base different than 1024/1000 would result in values like 4K that are very unintuitive (since they don't mean 4096/4000). 2. FORMAT: is the only use-case adding spaces before/after the number, and grouping? human_readable() already has support for grouping, and padding might be added with different parameters? I'm asking about #1 and #2, because if we forgo them, human_readable() could be used as-is. Otherwise, it will require copypasting and some modifications. 3. SUFFIX - is the purpose of this simply to print a string following the number? or are there some more complications? 4. Should nun-suffix characters following a parsed number cause errors, or ignored? e.g. 4KQO
Re: Command-line program to convert 'human' sizes?
Pádraig Brady wrote, On 12/04/2012 07:31 PM: On 12/05/2012 12:19 AM, Jim Meyering wrote: Pádraig Brady wrote: On 12/04/2012 11:35 PM, Assaf Gordon wrote: Pádraig Brady wrote, On 12/04/2012 06:11 PM: On 12/04/2012 10:55 PM, Assaf Gordon wrote: Pádraig Brady wrote, On 12/04/2012 11:30 AM: snip long discussion Would the following be acceptable: 1. remove --to=NUMBER option 2. surplus characters following immediately after converted number trigger a warning (error?), except if the following characters match exactly the suffix parameter. Regarding --format: The implementation doesn't really use printf, so %d isn't directly usable. One option is to tell the user to use %s (instead of %d), and we'll simply put the result of human_readable() as the string parameter in vasnprintf - this will be flexible in terms of alignment. Another option is the remove --format option, and replace it with --padding or similar. Regarding grouping (thousands separator): This only has an effect when no using --to=SI or --to=IEC, right? Perhaps we can add a separate option --grouping, and simply turn on the human_grouping flag? (easy to implement).
Re: Command-line program to convert 'human' sizes?
Hello, Pádraig Brady wrote, On 12/04/2012 11:30 AM: Nothing yet. The plan is to make a numfmt command available with this interface: http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html Attached is a patch, with a proof-of-concept working 'numfmt'. Works: from=SI/IEC/AUTO, to=SI/IEC, from-units, to-units, suffix, round. Doesn't work: format, to=NUMBER,field=N . The code isn't clean and can be improved. Currently, either every (non-option) command-line parameter is expected to be a number, or every line on stdin is expected to start with a number. Comments are welcomed, -gordon Examples; $ ./src/numfmt --from=auto 2K 2000 $ ./src/numfmt --from=auto 2Ki 2048 $ ./src/numfmt --from=SI 2K 2000 $ ./src/numfmt --from=SI 2Ki 2000 $ ./src/numfmt --from=IEC 2Ki 2048 $ ./src/numfmt --from=SI --to=IEC 2Ki 2.0K $ ./src/numfmt --from=IEC --to=SI 2K 2.1k $ ./src/numfmt --from=IEC 1M 1048576 $ ./src/numfmt --from=IEC --to=SI 1M 1.1M $ ./src/numfmt --from=IEC --to-unit=20 1M 52429 ./src/numfmt --from-unit=512 --to=IEC 4 2.0K $ ./src/numfmt --round=ceiling --to=IEC 2000 2.0K $ ./src/numfmt --round=floor --to=IEC 2000 1.9K Help screen === $ ./src/numfmt --help Usage: ./src/numfmt [OPTIONS] [NUMBER] Reformats NUMBER(s) to/from human-readable values. Numbers can be processed either from stdin or command arguments. --from=UNIT Auto-scale input numbers (auto, SI, IEC) If not specified, input suffixed are ignored. --from-unit=N Specifiy the input unit size (instead of the default 1). --to=UNIT Auto-scale output numbres (SI,IEC,N). If not specified, --to-unit=N Specifiy the output unit size (instead of the default 1). --rount=METHOD Round input numbers. METHOD can be: ceiling (the default), floor, nearest -f, --format=FORMAT use printf style output FORMAT. Default output format is %d . --suffix=SUFFIX --help display this help and exit --version output version information and exit UNIT options: auto ('--from' only): 1K = 1000 1Ki = 1024 1G = 100 1Gi = 1048576 SI: 1K* = 1000 (additional suffixes after K/G/T do not alter the scale) IEC: 1K* = 1024 (additional suffixes after K/G/T do not alter the scale) N ('--to' only): Use number N as the scale. Examples: ./src/numfmt --to=SI 1000 - 1K echo 1K | ./src/numfmt --from=SI- 1000 echo 1K | ./src/numfmt --from=IEC - 1024 Report numfmt bugs to bug-coreut...@gnu.org GNU coreutils home page: http://www.gnu.org/software/coreutils/ General help using GNU software: http://www.gnu.org/gethelp/ Report numfmt translation bugs to http://translationproject.org/team/ For complete documentation, run: info coreutils 'numfmt invocation' === build-aux/gen-lists-of-programs.sh |1 + src/.gitignore |1 + src/numfmt.c | 549 3 files changed, 551 insertions(+), 0 deletions(-) diff --git a/build-aux/gen-lists-of-programs.sh b/build-aux/gen-lists-of-programs.sh index 212ce02..bf63ee3 100755 --- a/build-aux/gen-lists-of-programs.sh +++ b/build-aux/gen-lists-of-programs.sh @@ -85,6 +85,7 @@ normal_progs=' nl nproc nohup +numfmt od paste pathchk diff --git a/src/.gitignore b/src/.gitignore index 181..25573df 100644 --- a/src/.gitignore +++ b/src/.gitignore @@ -59,6 +59,7 @@ nice nl nohup nproc +numfmt od paste pathchk diff --git a/src/numfmt.c b/src/numfmt.c new file mode 100644 index 000..99b1450 --- /dev/null +++ b/src/numfmt.c @@ -0,0 +1,549 @@ +/* Reformat numbers like 11505426432 to the more human-readable 11G + Copyright (C) 2012 Free Software Foundation, Inc. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see http://www.gnu.org/licenses/. */ + +#include config.h +#include getopt.h +#include stdio.h +#include sys/types.h + +#include argmatch.h +#include error.h +#include system.h +#include xstrtol.h +#include human.h + +/* The official name of this program (e.g., no 'g' prefix). */ +#define PROGRAM_NAME numfmt + +#define AUTHORS proper_name () + +#define BUFFER_SIZE (16 * 1024) + +enum +{ + FROM_OPTION = CHAR_MAX + 1, + FROM_UNIT_OPTION, + TO_OPTION, + TO_UNIT_OPTION, + ROUND_OPTION, + SUFFIX_OPTION +}; +
Re: Command-line program to convert 'human' sizes?
Hello, Attached is a working version of numfmt. The following are implemented: === Usage: ./src/numfmt [OPTIONS] [NUMBER] Reformats NUMBER(s) to/from human-readable values. Numbers can be processed either from stdin or command arguments. --from=UNIT Auto-scale input numbers to UNITs. Default is 'none'. See UNIT below. --from-unit=N Specify the input unit size (instead of the default 1). --to=UNIT Auto-scale output numbers to UNITs. See UNIT below. --to-unit=N Specify the output unit size (instead of the default 1). --round=METHOD Round input numbers. METHOD can be: ceiling (the default), floor, nearest --suffix=SUFFIX Add SUFFIX to output numbers, and accept optional SUFFIX in input numbers. --padding=N Pad the output to N characters. Default is right-aligned. Negative N will left-align. Note: if N is too small, the output will be truncated, and a warning will be printed to stderr. --grouping Group digits together (e.g. 1,000,000). Uses the locale-defined grouping (i.e. have no effect in C/POSIX locales). --field N Replace the number in input field N (default is 1) -d, --delimiter=X use X instead of whitespace for field delimiter === Also included in the patch is a test file, testing all sorts of combination of the parameters (hopefully catches most of the corner cases). There's also an undocumented option --debug that will show what's going on: === $ /src/numfmt --debug --field 2 --suffix=Foo --from=SI --to=IEC Hello 70MFoo World Extracting Fields: input: 'Hello 70MFoo World' field: 2 prefix: 'Hello' number: '70MFoo' suffix: 'World' Trimming suffix 'Foo' Parsing number: input string: '70M' remaining characters: '' numeric value: 7000 Formatting output: value: 7000 humanized: '67M' Hello 67MFoo World === Comments are welcomed, -gordon numfmt3.patch.gz Description: GNU Zip compressed data
Re: Command-line program to convert 'human' sizes?
Assaf Gordon wrote, On 12/05/2012 06:13 PM: Attached is a working version of numfmt. Somewhat related: How do I add a rule to build the man page (I'm working on passing 'make syntax-check'). I've added the following line to 'man/local.mk': man/numfmt.1:src/numfmt But it doesn't get build (after bootstrap+configure+make). Thanks, -gordon
Re: Command-line program to convert 'human' sizes?
On 12/05/12 19:58, Jim Meyering wrote: Assaf Gordon wrote: Somewhat related: How do I add a rule to build the man page (I'm working on passing 'make syntax-check'). I've added the following line to 'man/local.mk': man/numfmt.1:src/numfmt But it doesn't get build (after bootstrap+configure+make). You'll have to add numfmt to the list of programs in build-aux/gen-lists-of-programs.sh Then, be sure to run make syntax-check, and it'll cross-check a few other things that have to be synced with that list. I already added it to gen-lists-of-programs.sh (under normal_progs) - and the compilation works fine. I've also added a line to tests/local.mk and make checks works fine. But the man page part seems to be ignored. The strange thing is that make doesn't complain about the job, it simply ignores it: === $ grep numfmt man/local.mk man/numfmt.1:src/numfmt $ ls man/numfmt.1 ls: cannot access man/numfmt.1: No such file or directory $ make man/numfmt.1 gordon@tango:~/projects/coreutils$ ls man/numfmt.1 ls: cannot access man/numfmt.1: No such file or directory === When I add -d to make, it ends with these messages: === $ make -d man/numfmt.1 snip Prerequisite `src/numfmt.o' is older than target `src/numfmt'. Prerequisite `src/libver.a' is older than target `src/numfmt'. Prerequisite `lib/libcoreutils.a' is older than target `src/numfmt'. Prerequisite `lib/libcoreutils.a' is older than target `src/numfmt'. Prerequisite `src/.dirstamp' is older than target `src/numfmt'. No need to remake target `src/numfmt'. Finished prerequisites of target file `man/numfmt.1'. Must remake target `man/numfmt.1'. Successfully remade target file `man/numfmt.1'. === But the man file is not created. Thanks, -gordon
Re: Command-line program to convert 'human' sizes?
On 12/05/12 20:49, Assaf Gordon wrote: On 12/05/12 19:58, Jim Meyering wrote: Assaf Gordon wrote: Somewhat related: How do I add a rule to build the man page (I'm working on passing 'make syntax-check'). You'll have to add numfmt to the list of programs in build-aux/gen-lists-of-programs.sh I already added it to gen-lists-of-programs.sh (under normal_progs) - and the compilation works fine. I've also added a line to tests/local.mk and make checks works fine. But the man page part seems to be ignored. Nevermind - figured it out: A stub man/numfmt.x file is required.
Re: Command-line program to convert 'human' sizes?
Hello, With the attached patch, numfmt passes make syntax-check and almost passes make check and make distcheck. Regarding the checks: tests/misc/numfmt.pl passes all tests successfully. But: 1. When running make check, tests/df/total-verify.sh fails, so the check isn't complete. 2. When running make check TESTS=tests/misc/numfmt VERBOSE=yes, the tests script passes, but the process later fails with this error: $ make check TESTS=tests/misc/numfmt VERBOSE=yes GENpublic-submodule-commit make check-recursive make[1]: Entering directory `/home/gordon/projects/coreutils' Making check in po make[2]: Entering directory `/home/gordon/projects/coreutils/po' make[2]: Leaving directory `/home/gordon/projects/coreutils/po' Making check in . make[2]: Entering directory `/home/gordon/projects/coreutils' make check-TESTS check-local make[3]: Entering directory `/home/gordon/projects/coreutils' make[4]: Entering directory `/home/gordon/projects/coreutils' PASS: tests/misc/numfmt.pl = 1 test passed = make[4]: Leaving directory `/home/gordon/projects/coreutils' GENcheck-README GENcheck-duplicate-no-install GENsc-avoid-builtin GENsc-avoid-io GENsc-avoid-non-zero GENsc-avoid-path GENsc-avoid-timezone GENsc-avoid-zeroes GENsc-exponent-grouping GENsc-lower-case-var GENsc-use-small-caps-NUL GENcheck-texinfo make[3]: Leaving directory `/home/gordon/projects/coreutils' make[2]: Leaving directory `/home/gordon/projects/coreutils' Making check in gnulib-tests make[2]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests' make check-recursive make[3]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests' snip make[5]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests' make check-TESTS make[5]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests' make[6]: Entering directory `/home/gordon/projects/coreutils/gnulib-tests' make[6]: *** No rule to make target `tests/misc/numfmt.log', needed by `test-suite.log'. Stop. make[6]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests' make[5]: *** [check-TESTS] Error 2 make[5]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests' make[4]: *** [check-am] Error 2 make[4]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests' make[3]: *** [check-recursive] Error 1 make[3]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests' make[2]: *** [check] Error 2 make[2]: Leaving directory `/home/gordon/projects/coreutils/gnulib-tests' make[1]: *** [check-recursive] Error 1 make[1]: Leaving directory `/home/gordon/projects/coreutils' make: *** [check] Error 2 ## Strangely, the log file does exist: $ ls -l tests/misc/numfmt.log -rw-r--r-- 1 gordon gordon 1069 Dec 5 21:51 tests/misc/numfmt.log Any advice is appreciated, -gordon numfmt4.patch.gz Description: GNU Zip compressed data
Suggestion: update README/HACKING regarding tests
As per: http://lists.gnu.org/archive/html/coreutils/2012-09/msg00144.html , Perhaps you'll agree to update README/HACKING about how to run individual tests: === diff --git a/HACKING b/HACKING index de8cd7b..01e7605 100644 --- a/HACKING +++ b/HACKING @@ -438,9 +438,11 @@ Nearly every significant change must be accompanied by a test suite addition that exercises it. If you fix a bug, add at least one test that fails without the patch, but that succeeds once your patch is applied. If you add a feature, add tests to exercise as much of the new code -as possible. Note to run tests/misc/new-test in isolation you can do: +as possible. If you add a new test file (as opposed to adding a test to an +existing test file) add the new test file to 'tests/local.mk'. +Note to run tests/misc/new-test in isolation you can do: - (cd tests make check TESTS=misc/new-test VERBOSE=yes) + make TESTS=tests/misc/new-test SUBDIRS=. VERBOSE=yes Variables that are significant for tests with their default values are: diff --git a/README b/README index 21c9b03..15ed29b 100644 --- a/README +++ b/README @@ -176,7 +176,7 @@ in verbose mode for each failing test. For example, if the test that fails is tests/misc/df, then you would run this command: - (cd tests make check TESTS=misc/df VERBOSE=yes) log 21 + make check TESTS=tests/misc/df SUBDIRS=. VERBOSE=yes log 21 For some tests, you can get even more detail by adding DEBUG=yes. Then include the contents of the file 'log' in your bug report. === Regards, -gordon
Re: Command-line program to convert 'human' sizes?
Thank you for your feedback. I'm working on fixing those issues. Some comments/questions: Pádraig Brady wrote, On 12/06/2012 06:59 PM: I noticed This command will core dump: $ /bin/ls -l | src/numfmt --to-unit=1 --field=5 snip so I'm thinking `numfmt` should support --header too. I'll add --header. The following should essentially be a noop with this data, but notice how the original spacing wasn't taken into account, and thus the alignment is broken: $ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to-unit=1 --field=5 -rw-rw-r--. 1 padraig padraig 93787 Aug 23 2011 ABOUT-NLS -rw-rw-r--. 1 padraig padraig 49630 Dec 6 22:32 aclocal.m4 -rw-rw-r--. 1 padraig padraig 3669 Dec 6 22:29 AUTHORS I'm a bit wary of adding automatic/heuristic kind of padding - could lead to some weird outputs, and also (when combined with header) will not produce proper output (because the header will be skipped, but the lines would re-padded?). Wouldn't it be better to either force the user to specify '--padding', or switch from 'white-space' to an explicit delimiter, and then let expand handle the expanding correctly? e.g. === $ cat white-space-data.txt | \ sed 's/ */\t/g' | \ numfmt --field=5 --delimiter=$'\t' --to=SI | \ expand output === A bit more convoluted, but more reliable? With this the alignment is broken as before, but I also notice the differing width output of each number. $ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to=SI --field=5 -rw-rw-r--. 1 padraig padraig 94k Aug 23 2011 ABOUT-NLS -rw-rw-r--. 1 padraig padraig 50k Dec 6 22:32 aclocal.m4 -rw-rw-r--. 1 padraig padraig 3.7k Dec 6 22:29 AUTHORS Again this is the automatic padding issue - For example 94K vs 3.7K - should we always pad SI/IEC output to 5 characters (e.g. 94K) even if the user didn't specify padding? This would conflict with non-whitespace delimiters... e.g.: Hello:94000:world Would be converted to: Hello:space94K:world Which is not intuitive at all Or perhaps the whole 'auto' padding should be enabled IFF delimiter is not specified (and defaults to white-space) ? Notice in the above I've used capital K for SI. I think human() from gnulib may be using k for 1000 and K for 1024. That's non standard and ambiguous and I see no need to do that. So for IEC we'd have: $ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to=IEC --field=5 -rw-rw-r--. 1 padraig padraig 3.6Ki Dec 6 22:29 AUTHORS I tried to use 'human_readable()' as-is, but I guess this is not sufficient. I'll duplicate the code, and modify it to avoid this issue (lower/upper case K, and the i suffix) Another thing I thought of there, was it would be good to be able to parse number formats that it can generate: Sounds like two separate (but related) issues: $ echo '1,234' | src/numfmt --from=auto src/numfmt: invalid suffix in input '1,234': ',234' 1. Is there already a gnulib function that can accept locale-grouped values? can the xstrtoXXX functions handle that? $ echo '3.7K' | src/numfmt --from=auto src/numfmt: invalid suffix in input '3.7K': '.7K' 2. Would you recommend switching internal representation to doubles (from the current uintmax_t), or just add special code to detect decimal point (which, as Bernhard mentioned, is also locale dependent). While I said before it would be better to error rather than warn on parse error, on consideration it's probably best to write a warning to stderr on parse error, and leave the original number in place. I'll change the code accordingly. Regarding Bernhard's comments (from a different email): Bernhard Voelker wrote, On 12/07/2012 03:25 AM: On 12/07/2012 12:59 AM, Pádraig Brady wrote: Therefore this is my first test: $ echo 11505426432 | src/numfmt 11505426432 Hmm, shouldn't it converting that to a human-readable number then? ;-) From Pádraig's original specification ( http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html ) I assumed that the default of both --from and --to is not to scale - So one needs to explicitly use --to or --from. But those defaults can be changed, if you prefer. Looking at scale_from_args: I'd favor lower-case arguments, i.e. si and iec instead of SI and IEC. WDYT? I'll change those. Regarding the help text and documentation: I copied many of the texts from previous emails (the Reformat numbers like 11505426432 to the more human-readable 11G comes verbatim from one of Jim Meyering's emails) - all of them would require better phrasing later. Thanks, -gordon
git format-patch question
Hello, (picking up from a different thread) Pádraig Brady wrote, On 12/06/2012 06:59 PM: Generally it's best to get git to send email or send around formats that git can apply directly, which includes commit messages and references new files etc. The handiest way to do that is: git format-patch --stdout -1 | gzip numfmt.5.patch.gz While working on my development branch, I commit small, specific changes, as so: [PATCH 1/6] numfmt: a new command to format numbers [PATCH 2/6] numfmt: change SI/IEC parameters to lowercase. [PATCH 3/6] numfmt: separate debug/devdebug options. [PATCH 4/6] numfmt: fix segfault when no numbers are found. [PATCH 5/6] numfmt: improve --field, add more tests. [PATCH 6/6] numfmt: add --header option. Each commit can be just few lines. When I send a patch the the mailing list, I want to send one 'nice' 'clean' patch with my changes, compared to the master branch. When I use the following command: git diff -p --stat master..HEAD my.patch And all the changes (multiple commits) I made on my branch compared to master are represented as one coherent change in my.patch - but this is not convenient for you to apply. However, when I use git format-patch --stdout -1 my.patch Only the last commit appears. The alternative: git format-patch --stdout master..HEAD my.patch Generates a file which will cause multiple commits when imported with git am . When is the recommended way to generate a clean patch which will consolidate all my small commits into one? Or is there another way? Thanks, -gordon
Adding tests for non-C locales
Hello, I want to add tests for non-C locales (to check grouping in numfmt). My test script is written in Perl, based on tests/misc/wc.pl . It starts with: === @ENV{qw(LANGUAGE LANG LC_ALL)} = ('C') x 3; === Which is fine for most of the tests. How do I add tests for non-C locale, in a safe manner (I need a locale that I know which thousand-group separator character is used, but I can't know in advanced if it's installed on the testing machine). Thanks, -gordon
numfmt: locale/grouping input issue
Hello, (Continuing a previously discussed issue - accepting input values with locale grouping separators) Pádraig Brady wrote, On 12/07/2012 01:09 PM: On 12/07/2012 03:07 PM, Assaf Gordon wrote: Another thing I thought of there, was it would be good to be able to parse number formats that it can generate: Sounds like two separate (but related) issues: $ echo '1,234' | src/numfmt --from=auto src/numfmt: invalid suffix in input '1,234': ',234' 1. Is there already a gnulib function that can accept locale-grouped values? can the xstrtoXXX functions handle that? I was thinking you would just strip out localeconv()-thousands_sep before parsing. I couldn't find an example of a coreutil program that readily accepts locale'd input. The while dots and commas (US/DE locales) are relatively easy to handle, in the french locale the separator is space - causing a conflict when assuming the default field separator is also white space. Another complication is that just stripping out the 'thousands_sep' character would treat text such as 1,3,4,5,6 as valid number 13456 . I would suggest at first not to accept locale'd input, or only offer partial support. WDYT ? Thanks, -gordon Couple of examples: # Output is OK $ LC_ALL=fr_FR.utf8 ./src/printf %'d\n 1000 1 000 # Input is not valid $ LC_ALL=fr_FR.utf8 ./src/printf %'d\n 1 000 ./src/printf: 1 000 : valeur non complètement convertie 1 # Sort can't handle locale'd input, treats the white-space as separator, # not as thousand separator. $ printf 1 123\n1 000\n | LC_ALL=fr_FR.utf8 sort --debug -k1,1 sort: utilse les règles de tri « fr_FR.utf8 » sort: leading blanks are significant in key 1; consider also specifying 'b' 1 000 _ _ 1 123 _ _
numfmt (=print 'human' sizes) updates
Hello, Attached is an updated version of 'numfmt' . (The patch should be compatible with git am). Most of the previously raised issues have been addressed, except handling locale'd grouping in the input numbers (locale'd decimal-point is handled correctly). Added support for header, auto-whitespace-padding, floating-point input . Internally, all values are now stored as long double (instead of previously uintmax_t) - enables working with Yotta-scale values. The following should now 'just work' : df | ./src/numfmt --header --field 2 --to=si ls -l | ./src/numfmt --header --field 5 --to=iec ls -lh | ./src/numfmt --header --field 5 --from=iec --padding=10 The --debug option now behaves more like sort's --debug: prints messages to STDERR about possible bad combinations and inputs (which are not fatal errors): $./src/numfmt --debug 6 ./src/numfmt: no conversion option specified 6 The --devdebug option can be used to show internal states (perhaps will be removed once the program is finalized?). The test file 'tests/misc/numfmt.pl' contains many more tests and details about possible inputs/outputs. If the functionality is acceptable, the next steps are cleaner code and better documentations. Comments are welcomed, -gordon numfmt.7.patch.gz Description: GNU Zip compressed data
[PATCH] two minor tweaks to HACKING
The first mentions 'git stash' in a relevant paragraph. The second changes parameters for 'lcov' example - the current parameters produce wrong output (the source files are not found, with LCOV version 1.9 ). -gordon From e1ece5ff278258a18a078cad1d8fbf65c7e4fe71 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 13 Dec 2012 11:42:01 -0500 Subject: [PATCH 1/2] doc: mention git stash in HACKING --- HACKING |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/HACKING b/HACKING index f3f961a..84e9707 100644 --- a/HACKING +++ b/HACKING @@ -120,6 +120,8 @@ Note 2: sometimes the checkout will fail, telling you that your local modifications conflict with changes required to switch branches. However, in any case, you will *not* lose your uncommitted changes. +Run git stash to temporarily hide uncommited changes in your +local directory, restoring a clean working directory. Anyhow, get back onto your just-created branch: -- 1.7.7.4 From 8cd8f40882daa165ced8091697c158c7afb479d6 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 13 Dec 2012 14:20:47 -0500 Subject: [PATCH 2/2] doc: tweak 'lcov' in HACKING Use the correct -b (--base-directory) parameter. --- HACKING |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/HACKING b/HACKING index 84e9707..8e4243f 100644 --- a/HACKING +++ b/HACKING @@ -610,8 +610,8 @@ to generate HTML coverage reports. Follow these steps: # run whatever tests you want, i.e.: make check # run lcov - lcov -t coreutils -q -d lib -b lib -o lib.lcov -c - lcov -t coreutils -q -d src -b src -o src.lcov -c + lcov -t coreutils -q -d lib -b `pwd` -o lib.lcov -c + lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c # generate HTML from the output genhtml -p `pwd` -t coreutils -q --output-directory lcov-html *.lcov -- 1.7.7.4
Re: numfmt (=print 'human' sizes) updates
Hello, Attached is a slightly improved patch - minor code changes, and many more tests. Line coverage is 98%, and branch coverage is now 93% , and most of the non-covered branches are simply unreachable (I'm checking the reachable ones). The comments below still apply. - gordon Assaf Gordon wrote, On 12/13/2012 01:02 AM: Most of the previously raised issues have been addressed, except handling locale'd grouping in the input numbers (locale'd decimal-point is handled correctly). Added support for header, auto-whitespace-padding, floating-point input . Internally, all values are now stored as long double (instead of previously uintmax_t) - enables working with Yotta-scale values. The following should now 'just work' : df | ./src/numfmt --header --field 2 --to=si ls -l | ./src/numfmt --header --field 5 --to=iec ls -lh | ./src/numfmt --header --field 5 --from=iec --padding=10 The --debug option now behaves more like sort's --debug: prints messages to STDERR about possible bad combinations and inputs (which are not fatal errors): $./src/numfmt --debug 6 ./src/numfmt: no conversion option specified 6 The --devdebug option can be used to show internal states (perhaps will be removed once the program is finalized?). The test file 'tests/misc/numfmt.pl' contains many more tests and details about possible inputs/outputs. If the functionality is acceptable, the next steps are cleaner code and better documentations. numfmt.8.patch.xz Description: application/xz
Re: [PATCH 2/2] doc: tweak 'lcov' in HACKING
Hello Bernhard, Bernhard Voelker wrote, On 12/14/2012 03:29 AM: splitting the discussion about the 2 patches ... On 12/13/2012 08:29 PM, Assaf Gordon wrote: [...] The second changes parameters for 'lcov' example - the current parameters produce wrong output (the source files are not found, with LCOV version 1.9 ). Thanks. [PATCH 2/2] doc: tweak 'lcov' in HACKING I also noticed the lcov issue recently, but didn't find the time to fix HACKING. Furthermore, I'm not sure if lcov-1.9 is the reason for the problem - I think it worked some time ago ... and according to 'rpm -q -changelog lcov', I already have 1.9 since about Jan or Feb 2011. It think the reason might be the new non-recursive build system. And furthermore, the second lcov call still fails here: $ lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c built-in:cannot open source file geninfo: ERROR: cannot read built-in.gcov! Don't you get that, too? The following commands work on my system, generating a coverage report from a clean repository: git clone git://git.sv.gnu.org/coreutils.git coreutils_test cd coreutils_test/ ./bootstrap ./configure CFLAGS=-g -fprofile-arcs -ftest-coverage make -j 4 make -j 4 check lcov -t coreutils -q -d lib -b `pwd` -o lib.lcov -c lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c genhtml -p `pwd` -t coreutils -q --output-directory lcov-html *.lcov The two lcov invocations do produce some warnings, like so: $ lcov -t coreutils -q -d lib -b `pwd` -o lib.lcov -c geninfo: WARNING: no data found for /home/gordon/temp/coreutils_test/lib/mbuiter.h snip Cannot open source file parse-datetime.y Cannot open source file parse-datetime.c $ lcov -t coreutils -q -d src -b `pwd` -o src.lcov -c geninfo: WARNING: no data found for /home/gordon/temp/coreutils_test/lib/stat-time.h geninfo: WARNING: no data found for /home/gordon/temp/coreutils_test/lib/mbchar.h snip geninfo: WARNING: no data found for /home/gordon/temp/coreutils_test/lib/openat.h geninfo: WARNING: no data found for /usr/include/gmp-x86_64.h geninfo: WARNING: no data found for /home/gordon/temp/coreutils_test/lib/stat-time.h geninfo: WARNING: no data found for /home/gordon/temp/coreutils_test/lib/timespec.h But I assume this is normal/acceptable (if those files weren't covered in the tests). If this flow doesn't work reliably on all systems, then the HACKING needs more tweaking... Thanks, -gordon
[PATCH] maint: ignore GCC coverage files
Hello, Related to the updated coverage documentation, perhaps update the .gitignore to ignore the generated coverage files? Another possible addition is ignoring src.lcov, lib.lcov and lcov-html/* - but those file names are not fixed, just the recommended file names in HACKING . -gordon From eb54c8adf123481f3231aeb40e1b4ff38288b9af Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Fri, 14 Dec 2012 13:27:26 -0500 Subject: [PATCH] maint: update gitignore entries * .gitignore: ignore GCC coverage data files. --- .gitignore |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/.gitignore b/.gitignore index 5ce2361..15b77e9 100644 --- a/.gitignore +++ b/.gitignore @@ -170,3 +170,5 @@ Makefile.in TAGS THANKS THANKS-to-translators +*.gcno +*.gcda -- 1.7.7.4
Re: [PATCH] doc: mention git stash in HACKING
Bernhard Voelker wrote, On 12/14/2012 03:29 AM: On 12/13/2012 08:29 PM, Assaf Gordon wrote: [PATCH 1/2] doc: mention git stash in HACKING I tweaked the commit message a bit: even if the change is trivial and the subject is HACKING, it's good practice to mention it in a line below which describes the change in detail. Thanks! I keep on learning... -gordon
Re: enhancement suggestions for sort and text editor
Hello John, Eric Blake wrote, On 12/14/2012 04:19 PM: On 12/14/2012 02:02 PM, john wrote: In particular I wish to enter text into predefined (fixed location) fields in a record as opposed to variably delimited fields. In other words emulate the punched card record where card columns are assigned to particular data character columns. Those card columns then just become a text column range of a single record. If I could just set perhaps 10 arbitrary tab stops (in any simple editor), it would be sufficient for this purpose. The tab key would just advance to the next stop in succession, tho not necessarily regularly spaced. Sounds like you are talking in part about 'expand -t', which lets you re-expand tabs according to your choice of pre-defined stops. Beyond that, your question is out of scope for coreutils, and better directed to the editor of your choice (I'm quite sure that emacs is probably going to have something that does what you want, although I don't use arbitrary tab stops enough to be able to tell you off-hand how to get at that feature) (Slightly off-topic for coreutils, but for completeness:) Recent versions of GNU awk (gawk) support exactly this kind of processing: $ printf xx3\n1234yy98765\n | gawk -v FIELDWIDTHS=4 4 2 5 '{print $1,$2,$3,$4}' xx 3 1234 yy 98765 Or with Tab-separated output: $ printf xx3\n1234yy98765\n | gawk -v FIELDWIDTHS=4 4 2 5 -v OFS=\t '{print $1,$2,$3,$4}' xx 3 1234yy 98765 More information here: http://www.gnu.org/software/gawk/manual/html_node/Constant-Size.html -gordon
Re: numfmt (=print 'human' sizes) updates
Hello, Attached is a first shot at documenting 'numfmt' . Comments are welcomed, -gordon numfmt_doc.patch.gz Description: GNU Zip compressed data
Re: numfmt (=print 'human' sizes) updates
Hello Pádraig, Thanks for the review and the feedback. Pádraig Brady wrote, On 12/21/2012 12:42 PM: It's looking like a really useful cohesive command. The attached patch addresses the following issues: 1. changed the ---devdebug option 2. incorporated most of 'indent -nut' recommendations. 3. improved the usage string, to generate somewhat better man page. (help2man is a bit finicky about formatting, so it's not optimal). 4. i suffix with iec input/output: I understand the reason for always adding i, as it is The Right Thing to do. But I think some (most?) people still treat single-letter suffix (K/M/G/T/etc.) as a valid suffix for both SI and IEC, and deduce the scale from the context of whatever they're working on. Forcing them to use Ki might be too obtrusive. It could also mess-up automated scripted, when IEC scale is needed, but only with single-letter suffix (and there are many programs like that). As a compromise, I've added yet another scale option: ieci . When used with --from=ieci, a two-letter suffix is required. When used with --to=ieci, i will always be appended. Examples: $ ./src/numfmt --to=iec 4096 4.0K $ ./src/numfmt --to=ieci 4096 4.0Ki $ ./src/numfmt --from=iec 4K 4096 $ ./src/numfmt --from=ieci 4Ki 4096 $ ./src/numfmt --from=auto 4Ki 4096 $ ./src/numfmt --from=auto 4K 4000 # 'ieci' requires the 'i': $ ./src/numfmt --from=ieci 4K ./src/numfmt: missing 'i' suffix in input: '4K' (e.g Ki/Mi/Gi) # 'iec' does not accept 'i': $ ./src/numfmt --from=iec 4Ki ./src/numfmt: invalid suffix in input '4Ki': 'i' I hope this cover all the options, while maintaining consistent and expected behavior. (Optionally, we can change iec to behave like ieci, and rename ieci to something else). I will send --format and error message updates soon. Regards, -gordon numfmt.9.patch.xz Description: application/xz
Re: numfmt (=print 'human' sizes) updates
Hello, Pádraig Brady wrote, On 12/21/2012 12:42 PM: I'm starting to think the original idea of having a --format option would be more a more concise, well known and extendible interface rather than having --padding, --grouping, --base, ... It wouldn't have to be passed directly to printf, and could be parsed and preprocessed something like we do in seq(1). Regarding 'format' option, there are some intricacies that are worth discussing: 1. Depending on the requested conversion, the output can be a string (e.g. 1.4Ki) or a long double (e.g. 140). 2. Internally, the program uses long doubles - so the real format is %Lf - regardless of what the user will give (e.g. %f). 3. printf accepts all sorts of options, some of which aren't relevant to numfmt, or only relevant when printing non-humanized values. e.g.: $ LC_ALL=en_US.utf8 seq -f %0'14.5f 1000 1001 0001,000.0 0001,001.0 4. The assumption was that humanized numbers are always maximum 4 characters in SI/IEC (e.g. 1024 or 4.5M) or 5 characters with iec-i (e.g. 999Ti). With the new 'format', if given %'2.9f - should the output be still 4 characters (e.g. 4.5T), or respect the .9 format (e.g. 4.5T) ? and does the suffix character counts in the 2.9 format ? My preference is to keep things simple, and accept just a limited subset of the format syntax: 1. grouping (the ' character) 2. padding (the number after '%' and before the 'f' 3. alignment (optional '-' after '%') 4. Any prefix/suffix before/after the '%' option. 5. Accept just %f, but internally treat it as '%s' or '%Lf', depending on the output. All other options will be silently ignored, or trigger errors. Example: $ numfmt --format xx%20fxx --to=si 5000 [[ internally, treats as --padding 20 ]] xx5.0Kxx $ numfmt --format xx%'-10fxx 5000 [[ internally, treats as --padding -10 --grouping ]] xx5,000 xx $ numfmt --format xx%0#'+010llfxx 5000 [[ reject as 'too complicated' / unsupported printf options ]] WDYT? -gordon
Re: numfmt (=print 'human' sizes) updates
Hello, Assaf Gordon wrote, On 12/26/2012 05:40 PM: Attached is an updated numfmt, with the following two changes: 1. --format support 2. optionally ignoring input errors. The attached patch (incremental to the above full patch) adds few more tests, and fixes 4 issues found with the clang static analyzer. There's no change in functionality. -gordon numfmt.11.patch.xz Description: application/xz
Re: numfmt (=print 'human' sizes) updates
Hello, The attached patch adds 'numfmt' to the coreutils documentation. Regards, -gordon numfmt.12.patch.xz Description: application/xz
Re: Sort with header/skip-lines support
Pádraig Brady wrote, On 01/10/2013 07:11 PM: On 01/10/2013 09:57 PM, Assaf Gordon wrote: I'd like to re-visit an old issue: adding header-line/skip-lines support to 'sort'. [...] [2] - no pipe support: http://lists.gnu.org/archive/html/bug-coreutils/2007-07/msg00215.html But recent sed can be used for this like: `seq -u 1q` http://git.sv.gnu.org/gitweb/?p=sed.git;a=commit;h=737ca5e Note that commit is 4 years old, but only recently released sed 4.2.2 contains it. Thanks for the tip. The following indeed works with sed 4.2.2 ( on linux 3.2 ): $ ( echo 99 ; seq 10 ) | ( sed -u 1q ; sort -n ) But I'm wondering (as per the link above [2]) if this is posix compliant and stable (i.e. can this be trusted to work everytime, even on non-linux machines?). [3] - Jim's patch: http://lists.gnu.org/archive/html/coreutils/2010-11/msg00091.html Thanks for collating the previous threads on this subject. I'm on the fence on how warranted this is TBH. We'd need stronger arguments for it I think. I'll collate the arguments as well :) If the sed method works reliably, it leaves error checking: how to reliably check for error in such a pipe (inside a posix shell script)? The closest code I found is this: https://github.com/cheusov/pipestatus which seems very long. So additional arguments are: 1. robust error checking 2. simplicity of use: if 'sort' had this option built-in, the following use cases would just work. with sed+sort, it will require different invocations (and probably different pitfalls): a. one input file b. one input pipe c. multiple input files (without resorting to pipe, as this will cause 'sort' to use different amount of memory) d. specifying output file (with -o) Thanks, -gordon As a side note, I have a hackish Perl script that wraps sort and consumes the first line, and it's basically works-for-me kind of script - but I just wish it wasn't necessary: https://github.com/agordon/bin_scripts/blob/master/scripts/sort-header.in
Re: Sort with header/skip-lines support
Hello Pádraig, Your suggestions work for all the cases I needed, so essentially there's already a way to do sort+header - much appreciated! Pádraig Brady wrote, On 01/11/2013 01:13 PM: On 01/11/2013 04:10 PM, Assaf Gordon wrote: Pádraig Brady wrote, On 01/10/2013 07:11 PM: The following indeed works with sed 4.2.2 ( on linux 3.2 ): $ ( echo 99 ; seq 10 ) | ( sed -u 1q ; sort -n ) [2] - no pipe support: http://lists.gnu.org/archive/html/bug-coreutils/2007-07/msg00215.html But I'm wondering (as per the link above [2]) if this is posix compliant and stable (i.e. can this be trusted to work everytime, even on non-linux machines?). No `sed -u` with this functionality is not portable. Though it's more portable than `sort --header` given that it already exists :) Sorry for nitpicking, but just to verify: sed -u is a GNU extension, hence not portable by definition. But what I meant to ask: If I install GNU sed + GNU sort on any machine (e.g. MAC OSX), would it work in a reliable way? Eric Blake's email seemed to suggest this will never be guaranteed to work (even if it works in practice) due to sharing pipes between processes. For completeness, showing the current options for such cases... Thanks for taking the time to write these - very helpful. -gordon
Re: Enhancement suggestion for expand
Hello, Anoop Sharma wrote, On 01/14/2013 04:58 AM: On Mon, Jan 14, 2013 at 5:04 AM, Pádraig Brady p...@draigbrady.com mailto:p...@draigbrady.com wrote: On 09/18/2012 03:18 PM, CoreUtils subscribtion for PLC wrote: I oten use expand to format scripts output by manually setting tabs stop. The idea would be to add an option to expand to be able to auto-set tabstops by analyzing first n lines of test (0 for analyzing whole stream) so that the TS would be set to the minimum number of spaces to obtains clean columns. This feature is already provided by a separate utility named column, dedicated to columnization, which is available under BSD license. What is not provided there is ability to analyze only first n lines. So unless it is about licensing, it may be better to enhance column instead of expand. If licensing is an issue then it may be better to add a utility dedicated to columnization to Coreutils, instead of enhancing expand. For a possible work-around, I'm using a Perl+shell wrapper scripts that do exactly what you're asking for. 'detect_tab_stops' reads a single text file and prints a comma-separated list of tab stops, based on the first N lines (default 100). https://github.com/agordon/bin_scripts/blob/master/scripts/detect_tab_stops.in 'atexpand' uses 'detect_tab_stops' to run 'expand' with auto-tabbing. https://github.com/agordon/bin_scripts/blob/master/scripts/atexpand.in 'atless' uses 'detect_tab_stops' to run 'less -S -x' with proper auto-tabbing. https://github.com/agordon/bin_scripts/blob/master/scripts/atless.in The quickest way to install those is probably by taking the entire package ( https://github.com/agordon/bin_scripts ) and running configure, but they are AGPL and you are welcome to take them and modify them. -gordon
Sort: optimal memory usage with multithreaded sort
Hello, Sort's memory usage (specifically, sort_buffer_size() ) has been discussed few times before, but I couldn't find mention of the following issue: If given a regular input file, sort tries to guesstimate the optimal buffer size based on the file size. But this value is calculated for one thread (before sort got multi-threaded). The default --parallel value is 8 (or less, if fewer cores are available) - which requires more memory. The result is, that for a somewhat powerful machine (e.g. 128GB RAM, 32 cores - not uncommon for a computer cluster), sorting a big file (e.g 10GB) will always allocate too little memory, and will always resort to saving temporary files on /tmp. The disk activity will result in slower sorting times than what could be done in an all-memory sort. Based on this: http://lists.gnu.org/archive/html/coreutils/2010-12/msg00084.html , perhaps it would be beneficial to consider the number of threads in the memory allocation ? Regards, -gordon
[PATCH] numfmt: fix help section typo
Hello Pádraig, Thank you for all the recent work on numfmt! I noticed a typo in the help section, in my original program (The suffix says G but the values are mega). Also removes an extra space. Attached is a patch. Thanks! -gordon From b5b9e3281298fa14d7752579a63bfe3956d982f4 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Tue, 5 Feb 2013 11:04:41 -0500 Subject: [PATCH] numfmt: fix help section typo * src/numfmt.c: change erroneous G to M. --- src/numfmt.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/src/numfmt.c b/src/numfmt.c index cccd1d1..b37724e 100644 --- a/src/numfmt.c +++ b/src/numfmt.c @@ -856,19 +856,19 @@ UNIT options:\n\ auto Accept optional single-letter/two-letter suffix:\n\ 1K = 1000\n\ 1Ki = 1024\n\ - 1G = 100\n\ - 1Gi = 1048576\n\ + 1M = 100\n\ + 1Mi = 1048576\n\ si Accept optional single letter suffix:\n\ 1K = 1000\n\ - 1G = 100\n\ + 1M = 100\n\ ...\n\ iecAccept optional single letter suffix:\n\ 1K = 1024\n\ - 1G = 1048576\n\ + 1M = 1048576\n\ ...\n\ iec-i Accept optional two-letter suffix:\n\ 1Ki = 1024\n\ - 1Gi = 1048576\n\ + 1Mi = 1048576\n\ ...\n\ \n\ ), stdout); -- 1.7.7.4
csplit - split by content of field
Hello, Attach is a patch that gives 'csplit' the ability to split files by content of a field. A typical usage is: ## the @1 pattern means start a new file when field 1 changes $ printf A\nA\nB\nB\nB\nC\n | csplit - @1 {*} $ wc -l xx* 2 xx00 3 xx01 1 xx02 6 total $ head xx* == xx00 == A A == xx01 == B B B == xx02 == C This is just a proof of concept, and the pattern specification can be changed (I think @N doesn't conflict with any existing pattern). The same can probably be achieved using other programs (awk comes to mind), but it won't be as simple and clean (with all of csplit's output features). Let me know if you're willing to consider such addition. Thanks, -gordon From 074614c0764c278e8abd9d41af4ce626fefd6cfc Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Wed, 6 Feb 2013 16:40:00 -0500 Subject: [PATCH] csplit: split files by field-change src/csplit.c: create a new output file whenever field content changes. --- src/csplit.c | 237 -- 1 files changed, 230 insertions(+), 7 deletions(-) diff --git a/src/csplit.c b/src/csplit.c index 22f3ad4..ec725d2 100644 --- a/src/csplit.c +++ b/src/csplit.c @@ -44,6 +44,13 @@ /* The default prefix for output file names. */ #define DEFAULT_PREFIX xx +enum csplit_type + { +CSPLIT_LINE, +CSPLIT_REGEXPR, +CSPLIT_FIELD_CHANGE + }; + /* A compiled pattern arg. */ struct control { @@ -53,8 +60,9 @@ struct control int argnum; /* ARGV index. */ bool repeat_forever; /* True if '*' used as a repeat count. */ bool ignore; /* If true, produce no output (for regexp). */ - bool regexpr; /* True if regular expression was used. */ + enum csplit_type type; /* Split type: line/regex/field */ struct re_pattern_buffer re_compiled; /* Compiled regular expression. */ + uintmax_t field; /* Field to monitor for change */ }; /* Initial size of data area in buffers. */ @@ -176,6 +184,16 @@ static size_t control_used; /* The set of signals that are caught. */ static sigset_t caught_signals; +/* If delimiter has this value, blanks separate fields. */ +enum { DELIMITER_DEFAULT = CHAR_MAX + 1 }; + +/* The delimiter to use for field extraction */ +static int delimiter = DELIMITER_DEFAULT; + +/* The content of the field from the last line, to be compared with the + * current line */ +static struct cstring last_field; + static struct option const longopts[] = { {digits, required_argument, NULL, 'n'}, @@ -185,6 +203,7 @@ static struct option const longopts[] = {elide-empty-files, no_argument, NULL, 'z'}, {prefix, required_argument, NULL, 'f'}, {suffix-format, required_argument, NULL, 'b'}, + {delimiter, required_argument, NULL, 'd'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, {NULL, 0, NULL, 0} @@ -867,6 +886,169 @@ process_regexp (struct control *p, uintmax_t repetition) current_line = break_line; } +/* Skip the requested number of fields in the input string. + Returns a pointer to the *delimiter* of the requested field, + or a pointer to NUL (if reached the end of the string). + + NOTE: buf is *not* expected to be NULL-terminated string. + The end of the string is determined by 'len' */ +static inline char * +__attribute ((pure)) +skip_fields (char *buf, int len, int fields) +{ + static char null_str[] = ; + + char *ptr = buf; + if (delimiter != DELIMITER_DEFAULT) +{ + if (*ptr == delimiter) +fields--; + while (len fields--) +{ + while (len *ptr == delimiter) +{ + ++ptr; + --len; +} + while (len *ptr != delimiter) +{ + ++ptr; + --len; +} +} +} + else +while (len fields--) + { +while (len isblank (*ptr)) + { +--len; +++ptr; + } +while (len !isblank (*ptr)) + { +++ptr; +--len; + } + } + + if (len==0) +return null_str; + + return ptr; +} + +static void +set_last_field (const char* str, size_t len) +{ + last_field.len = len ; + last_field.str = xrealloc (last_field.str, len); + memcpy (last_field.str, str, len); +} + +static void +reset_last_field (void) +{ + last_field.len = 0 ; +} + +static void +free_last_field (void) +{ + last_field.len = 0; + free (last_field.str); + last_field.str=NULL; +} + +/* Prints the input line until a fields change its value */ +static void +process_field_change (struct control *p) +{ + struct cstring *line; /* From input file. */ + char *field_start = NULL; + char *field_end = NULL ; + size_t field_len; + size_t line_len; + size_t eol_len; /* length from field_start to EOL */ + + create_output_file (); + + reset_last_field (); + + while (true) +{ + line
uniq - check specific fields
Hello, Attached is a proof-of-concept patch to add --check-fields=N to uniq, allowing uniq'ing by specific fields. (Trying a different approach at promoting csplit-by-field [1] :) ). It works just like 'check-chars' but on fields, and if not used, it does not affect the program flow. === # input file, every whole-line is uniq $ cat input.txt A 1 z A 1 y A 2 x B 2 w B 3 w C 3 w C 4 w # regular uniq $ uniq -c input.txt 1 A 1 z 1 A 1 y 1 A 2 x 1 B 2 w 1 B 3 w 1 C 3 w 1 C 4 w # Stop after 1 field $ uniq -c --check-fields 1 input.txt 3 A 1 z 2 B 2 w 2 C 3 w # Stop after 2 fields $ uniq -c --check-fields 2 input.txt 2 A 1 z 1 A 2 x 1 B 2 w 1 B 3 w 1 C 3 w 1 C 4 w # Skip the first field and check 1 field (effectively, uniq on field 2) $ uniq -c --skip-fields 1 --check-fields 1 input.txt 2 A 1 z 2 A 2 x 2 B 3 w 1 C 4 w # --field is convenience shortcut for skipcheck fields $ uniq -c --field 2 input.txt 2 A 1 z 2 A 2 x 2 B 3 w 1 C 4 w $ uniq -c --field 3 input.txt 1 A 1 z 1 A 1 y 1 A 2 x 4 B 2 w === What do you think ? -gordon [1] http://lists.gnu.org/archive/html/coreutils/2013-02/msg00015.html From 08ee89a89d6912c5872a1785b9079d943ad71623 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 7 Feb 2013 11:46:22 -0500 Subject: [PATCH] uniq: support uniq-by-field src/uniq.c: add --field and --check-fields=N support --- src/uniq.c | 68 +++- 1 files changed, 67 insertions(+), 1 deletions(-) diff --git a/src/uniq.c b/src/uniq.c index 5efdad7..b7c3dc8 100644 --- a/src/uniq.c +++ b/src/uniq.c @@ -63,6 +63,9 @@ static size_t skip_chars; /* Number of chars to compare. */ static size_t check_chars; +/* Number of fields to compare */ +static size_t check_fields; + enum countmode { count_occurrences, /* -c Print count before output lines. */ @@ -108,6 +111,13 @@ static enum delimit_method const delimit_method_map[] = /* Select whether/how to delimit groups of duplicate lines. */ static enum delimit_method delimit_groups; +/* For long options that have no equivalent short option, use a + non-character as a pseudo short option, starting with CHAR_MAX + 1. */ +enum +{ + UNIQ_FIELD = CHAR_MAX + 1, +}; + static struct option const longopts[] = { {count, no_argument, NULL, 'c'}, @@ -118,6 +128,8 @@ static struct option const longopts[] = {skip-fields, required_argument, NULL, 'f'}, {skip-chars, required_argument, NULL, 's'}, {check-chars, required_argument, NULL, 'w'}, + {check-fields, required_argument, NULL, 'y'}, + {field, required_argument, NULL, UNIQ_FIELD}, {zero-terminated, no_argument, NULL, 'z'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, @@ -153,6 +165,8 @@ With no options, matching lines are merged to the first occurrence.\n\ delimit-method={none(default),prepend,separate}\n\ Delimiting is done with blank lines\n\ -f, --skip-fields=N avoid comparing the first N fields\n\ + --field=N check only field N.\n\ +equivalent to '-f (N-1) -y 1'\n\ -i, --ignore-case ignore differences in case when comparing\n\ -s, --skip-chars=Navoid comparing the first N characters\n\ -u, --unique only print unique lines\n\ @@ -160,6 +174,7 @@ With no options, matching lines are merged to the first occurrence.\n\ ), stdout); fputs (_(\ -w, --check-chars=N compare no more than N characters in lines\n\ + -y, --check-fields=N compare no more than N fields in lines\n\ ), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); fputs (VERSION_OPTION_DESCRIPTION, stdout); @@ -225,6 +240,34 @@ find_field (struct linebuffer const *line) return line-buffer + i; } +/* Given a string and maximum length, + * returns the position after skipping 'check_fields' fields, + * or maximum length (if not enough fields on the input string) */ +static size_t _GL_ATTRIBUTE_PURE +check_fields_length (const char* str, size_t maxlen) +{ + size_t count; + size_t i = 0; + +/* fputs(check_fields_length(str=',stderr); + fwrite(str,sizeof(char),maxlen,stderr); + fprintf(stderr,' len=%zu, check_fields=%zu)\n,maxlen,check_fields);*/ + + for (count = 0; count check_fields i maxlen; count++) +{ + while (i maxlen isblank (to_uchar (str[i]))) +i++; + while (i maxlen !isblank (to_uchar (str[i]))) +i++; +} + +/* fprintf(stderr, result= '); + fwrite(str,sizeof(char),i,stderr); + fputs('\n,stderr);*/ + + return i; +} + /* Return false if two strings
Re: new snapshot available: coreutils-8.20.113-1f1f4
Hello Bernhard, Bernhard Voelker wrote, On 02/08/2013 09:53 AM: On February 7, 2013 at 8:57 PM Pádraig Brady p...@draigbrady.com wrote: coreutils snapshot: http://pixelbeat.org/cu/coreutils-8.20.113-1f1f4.tar.xz Hi Padraig, * SLES-10.4 (x86_64): gcc (GCC) 4.1.2 20070115 (SUSE Linux) FAIL: tests/misc/numfmt.pl Regarding the 'numfmt' failures - these are locale-related problems (in both cases). Perhaps I wrote the tests incorrectly. May I ask you to try the followings on those systems, and send the output (or compare with this expected output): # The french locale is used for locale testing - if it doesn't exist, those tests should not run at all. $ locale -a | grep -i fr fr_FR.utf8 # First try without locale (this is test 'lcl-grp-1' which succeeded) $ LC_ALL=C ./src/numfmt --debug --grouping --from=si 7M ./src/numfmt: grouping has no effect in this locale 700 # Try grouping, the expected output should have a space as thousands-separator # this is test lcl-grp-3 which failed, on your system the result was 700 $ LC_ALL=fr_FR.utf8 ./src/numfmt --debug --grouping --from=si 7M 7 000 000 Thanks! -gordon
Re: new snapshot available: coreutils-8.20.113-1f1f4
Thanks for the quick fix. Bernhard Voelker wrote, On 02/08/2013 11:02 AM: On February 8, 2013 at 4:56 PM Pádraig Brady p...@draigbrady.com wrote: OK so we can't assume the locale will behave as we want. Therefore we can gate the test on the output of the independent printf like: PASS: tests/misc/numfmt.pl I don't know if this is a SUSE bug or not. The closest thing I've found is a Debian bug report from 2004: locales: Wrong thousands_sep value in fr_FR locale http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=248377 -gordon
Re: new snapshot available: coreutils-8.20.119-54cdb0
Follow-up: Assaf Gordon wrote, On 02/11/2013 12:27 PM: Strange failure with numfmt on an eccentric system (Mac OS X 10.6.8): some errors are not reported correctly. [ ... ] In the source code, it seems to be related to this part, in parse_format_string(), line 972: 970 i += strspn (fmt + i, ); 971 errno = 0; 972 pad = strtol (fmt + i, endptr, 10); 973 if (errno != 0) 974 error (EXIT_FAILURE, 0, 975_(invalid format %s (width overflow)), quote (fmt)); 976 On this system (Mac OS X): fmt = 'hello%' i = 6 fmt+i = '' And 'strtol' returns errno=EINVAL (22) instead of 0 - causing the incorrect error message. This is likely the reason, 'man strtol' has this to say (on this computer): === ERRORS [EINVAL] The value of base is not supported or no conversion could be performed (the last feature is not portable across all platforms). === Would it be better to explicitly check for this case, or replace with xstrtol ? -gordon
Re: new snapshot available: coreutils-8.20.119-54cdb0
Assaf Gordon wrote, On 02/11/2013 12:35 PM: Assaf Gordon wrote, On 02/11/2013 12:27 PM: Strange failure with numfmt on an eccentric system (Mac OS X 10.6.8): some errors are not reported correctly. [ ... ] And 'strtol' returns errno=EINVAL (22) instead of 0 - causing the incorrect error message. The attached patch fixes the problem (tested on Mac OS 10.6.8 and Debian/Linux 3.2). -gordon From 68ff89d497fcaffe054f0ca619fd747db8fb4574 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 11 Feb 2013 15:39:42 -0500 Subject: [PATCH] numfmt: fix strtol() bug src/numfmt.c: on some system, strtol() returns EINVAL if no conversion was performed. Ignore and continue if so. --- src/numfmt.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/src/numfmt.c b/src/numfmt.c index d87d8ef..6e7cf2f 100644 --- a/src/numfmt.c +++ b/src/numfmt.c @@ -970,7 +970,10 @@ parse_format_string (char const *fmt) i += strspn (fmt + i, ); errno = 0; pad = strtol (fmt + i, endptr, 10); - if (errno != 0) + /* EINVAL can happen if 'base' is invalid (hardcoded as 10, so can't happen), + or if no conversion was performed (on some platforms). Ignore continue + if no conversion was performed */ + if (errno != 0 (errno != EINVAL)) error (EXIT_FAILURE, 0, _(invalid format %s (width overflow)), quote (fmt)); -- 1.7.7.4
uniq with sort-like --key support
Hello, I'd like to offer a proof-of-concept patch for adding sort-like --key support for the 'uniq' program, as discussed here: http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html and in several other threads. The patch involves few core changes: 1. All key-related functions were copied as-is from sort.c, and put in a separate file (uniq_sort_common.h). In theory, those could extracted later on to file that will be used by both sort and uniq. At the moment, it's a hodge-podge of copypaste, including code that's not relevant to uniq (like reverse). 2. The function check_files was modified to convert struct linebuffer (used by uniq) to struct line (used by sort's functions) and then 3. The different function was modified to call sort's keycompare function. 4. In main(), the key argument passing was copied from 'sort', and some code was added to adapt previous options (e.g. skip-fields/skip-chars/check-chars) to internal struct keyfield . The result is that uniq can now do: === $ printf A 1\nA 2\nB 2\n | ./src/uniq -k1,1 A 1 B 2 $ printf A 1\nA 2\nB 2\n | ./src/uniq -k2,2 A 1 A 2 === Most (but not all) of the existing tests pass. New tests to demonstrate the new possibilities have been added to 'tests/misc/uniq-key.pl', try with: make check TESTS=tests/misc/uniq-key SUBDIRS=. I think that most of the keycomparison functions (like numeric/general-numeric/month/version/skip-blanks) would just work, though I haven't tested it thoroughly yet. Comments are welcomed, -gordon 0001-uniq-support-sort-like-key-specification.patch.xz Description: application/xz
[PATCH]: uniq: add tests for --ignore-case
Hello, Attached are three small tests for uniq with --ignore-case (they pass, the option was simply not tested before). Also, I noticed that by running the default test suite (make check SUBDIRS=.), the majority of uniq tests are skipped: uniq: skipping this test -- no appropriate locale SKIP: tests/misc/uniq.pl PASS: tests/misc/uniq-perf.sh This is due to tests/misc/uniq.pl line 83: 83 # I've only ever triggered the problem in a non-C locale. 84 my $locale = $ENV{LOCALE_FR}; 85 ! defined $locale || $locale eq 'none' 86 and CuSkip::skip $prog: skipping this test -- no appropriate locale\n; which skips the entire suite if there's no french locale defined, even though only one test actually sets the locale. I can have a patch for it, if that's acceptable. -gordon From c8cec42eee16f3824635a3ba93b9360b2e7b236d Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Tue, 12 Feb 2013 10:30:25 -0500 Subject: [PATCH] tests: add '--ignore-case' tests for uniq. * tests/misc/uniq.pl: add tests for --ignore-case. --- tests/misc/uniq.pl |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/tests/misc/uniq.pl b/tests/misc/uniq.pl index 140a49b..e3873b5 100755 --- a/tests/misc/uniq.pl +++ b/tests/misc/uniq.pl @@ -199,6 +199,10 @@ my @Tests = # Check that --zero-terminated is synonymous with -z. ['123', '--zero-terminated', {IN=a\na\nb}, {OUT=a\na\nb\0}], ['124', '--zero-terminated', {IN=a\0a\0b}, {OUT=a\0b\0}], + # Check ignore-case + ['125', '', {IN=A\na\n}, {OUT=A\na\n}], + ['126', '-i',{IN=A\na\n}, {OUT=A\n}], + ['127', '--ignore-case', {IN=A\na\n}, {OUT=A\n}], ); # Set _POSIX2_VERSION=199209 in the environment of each obs-plus* test. -- 1.7.7.4
Re: uniq with sort-like --key support
Pádraig Brady wrote, On 02/11/2013 08:50 PM: On 02/12/2013 01:31 AM, Assaf Gordon wrote: I'd like to offer a proof-of-concept patch for adding sort-like --key support for the 'uniq' program, as discussed here: http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html and in several other threads. I'm not going to look at it this week, but thank you! Consolidating the field processing in a central place is really good, and it can then be enhanced in future to support multibyte chars etc. I'll continue in the meantime - the attached version passes all tests, and includes many new ones. also supports --field-separator=SEP (like sort), multiple keys, and tested unique by month/fast-numeric/general-numeric/case-insensitive. -gordon uniq_keys1.patch.xz Description: application/xz
Re: uniq with sort-like --key support
On 02/12/2013 01:31 AM, Assaf Gordon wrote: I'd like to offer a proof-of-concept patch for adding sort-like --key support for the 'uniq' program, as discussed here: http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html and in several other threads. One more update with two changes: 1. re-arranged src/uniq_sort_common.h to have the functions in the same order as in src/sort.c, making diff src/uniq_sort_common.h src/sort.c much easier to view (and seeing that the functions were not modified at all). 2. when specifying explicit field separator and using -c, report the counts with no space-padding right-aligned numbers (and the separator). This might be controversial, but I always needed that :) (used to wrap every uniq -c with sed 's/^ *// ; s/ /\t/' ) == ## Existing: $ printf a\tx\na\tx\nb\ty\n | uniq -c 2 a x 1 b y ## New: $ printf a\tx\na\tx\nb\ty\n | ./src/uniq -t $'\t' -c 2 a x 1 b y == Also, I'm wondering what exactly is the effect of the following statement ( from http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00217.html ): This point was addressed in IEEE Std 1003.1-2001/Cor 1-2002, item XCU/TC1/D6/40, and it's why the current Posix spec says that the behavior of uniq depends on LC_COLLATE. And whether sort's keycompare functions fulfill this requirement, and whether the current 'uniq' tests check this situation? Otherwise my changes are not backwards-compatible. Thanks, -gordon
Re: uniq with sort-like --key support
Assaf Gordon wrote, On 02/13/2013 11:45 AM: On 02/12/2013 01:31 AM, Assaf Gordon wrote: I'd like to offer a proof-of-concept patch for adding sort-like --key support for the 'uniq' program, as discussed here: http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html and in several other threads. One more update with two changes: Sorry, forgot to attach the file in the previous email. -gordon uniq_key3.patch.xz Description: application/xz
Re: uniq with sort-like --key support
Hello Jim, Jim Meyering wrote, On 02/13/2013 12:05 PM: Assaf Gordon wrote: Assaf Gordon wrote, On 02/13/2013 11:45 AM: ... One more update with two changes: ... src/uniq_sort_common.h | 1096 Hi Gordon. Thanks a lot for working on this long-requested change. I don't have time to review it, but please change the name of that new header file. First, we use hyphens (not underscores) in file names. Did you consider any names that evoke key spec parsing? Then, the name would still be apropos if someday it's used by a program other than sort and uniq. This was just a proof-of-concept, so I wanted to have minimal changes that would just work. What would be the recommended way to compartmentalize this functionality? 1. put it in src/key-spec-parsing.h, and have each program (e.g. uniq.c) do #include ? or 2. split it into src/key-spec-parsing.h and src/key-spec-parsing.c (with all the src/local.mk associated changes) - but removing the static from all the variables/functions? or something else? -gordon
Re: uniq with sort-like --key support
Jim Meyering wrote, On 02/13/2013 12:05 PM: [...] but please change the name of that new header file. First, we use hyphens (not underscores) in file names. Did you consider any names that evoke key spec parsing? Then, the name would still be apropos if someday it's used by a program other than sort and uniq. What would be the recommended way to compartmentalize this functionality? 1. put it in src/key-spec-parsing.h, and have each program (e.g. uniq.c) do #include ? or 2. split it into src/key-spec-parsing.h and src/key-spec-parsing.c (with all the src/local.mk associated changes) - but removing the static from all the variables/functions? or something else? I'm leaning towards option #1 (just a header file) - this will allow to include/remove functionality using #ifdefs (e.g. uniq doesn't need to support random/reverse/human/version key comparisons, and in the far future - perhaps 'join' will use it and wouldn't need them). src/system.h is already used in the same fashion (and has 'static' functions), although it's much smaller in scope. Thoughts? -gordon
Re: uniq with sort-like --key support
Pádraig Brady wrote, On 02/13/2013 12:54 PM: On 02/13/2013 05:34 PM, Assaf Gordon wrote: What would be the recommended way to compartmentalize this functionality? 1. put it in src/key-spec-parsing.h, and have each program (e.g. uniq.c) do #include ? or 2. split it into src/key-spec-parsing.h and src/key-spec-parsing.c (with all the src/local.mk associated changes) - but removing the static from all the variables/functions? 2 is more standard/flexible. Evidently, leaning towards option #1 was the wrong choice :) This update splits the code into the two files (src/key-spec-parsing.{c,h}), and adds conditional compilation of supported keys, using per-file CFLAGS in local.mk: src_uniq_SOURCES = src/uniq.c src/key-spec-parsing.c src_uniq_CPPFLAGS = $(AM_CPPFLAGS) Another program that needs all the keys might define: src_sort_SOURCES = src/sort.c src/key-spec-parsing.c src_sort_CPPFLAGS = -DKEY_SPEC_RANDOM -DKEY_SPEC_REVERSE -DKEY_SPEC_VERSION -DKEY_SPEC_HUMAN_NUMERIC $(AM_CPPFLAGS) These are explained in 'src/key-spec-parsing.c': /* define the following to enable extra key options: KEY_SPEC_RANDOM - sort by random order (-k1R,1) KEY_SPEC_REVERSE - reverse sort order (-k1r,1) KEY_SPEC_VERSION - Version sort order (-k1V,1) KEY_SPEC_HUMAN_NUMERIC- Human sizes order(-k1h,1) If these are not defined, specifing them will generate an error. See 'set_ordering()' and 'key_to_opts()' in this file, and src_uniq_CPPFLAGS in src/local.mk for usage examples. */ -gordon uniq_key5.patch.xz Description: application/xz
[PATCH] join: Add -z option
Hello, This patch add -z to join, supporting joining zero-terminated lines. The patch is heavily based on James Youngman's patch of adding -z to uniq (commit e062524). -gordon P.S. This patch is independent of the key-comparison patches discussed recently, though I'm also adding it there. From 525eb72b150ed34d3bfcfe453d1494fe28a824b7 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 14 Feb 2013 15:29:08 -0500 Subject: [PATCH] join: Add -z option * NEWS: Mention join's new option: --zero-terminated (-z). * src/join.c: Add new option, --zero-terminated (-z), to make join use the NUL byte as separator/delimiter rather than newline. (get_line): Use readlinebuffer_delim in place of readlinebuffer. (main): Handle the new option. (usage): Describe new option the same way sort does. * doc/coreutils.texi (join invocation): Describe the new option. * tests/misc/join.pl: add tests for -z option. --- NEWS |6 ++ doc/coreutils.texi | 17 + src/join.c | 19 +++ tests/misc/join.pl | 20 4 files changed, 58 insertions(+), 4 deletions(-) diff --git a/NEWS b/NEWS index 37bcdf7..618c1da 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,12 @@ GNU coreutils NEWS-*- outline -*- * Noteworthy changes in release ?.? (-??-??) [?] +** New features + + join accepts a new option: --zero-terminated (-z). As with the sort,uniq + option of the same name, this makes join consume and produce NUL-terminated + lines rather than newline-terminated lines. + * Noteworthy changes in release 8.21 (2013-02-14) [stable] diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 2c16dc4..a72d9ce 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -6059,6 +6059,10 @@ available; the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal. For example: +Input and output lines are terminated with a newline character unless the +@option{--zero-terminated} (@option{-z}) is used, in which case lines are +@sc{nul} terminated. + @example $ cat file1 a a1 @@ -6181,6 +6185,19 @@ character is used to delimit the fields. Print a line for each unpairable line in file @var{file-number} (either @samp{1} or @samp{2}), instead of the normal output. +@item -z +@itemx --zero-terminated +@opindex -z +@opindex --zero-terminated +@cindex join zero-terminated lines +Treat the input as a set of lines, each terminated by a null character +(ASCII @sc{nul}) instead of a line feed +(ASCII @sc{lf}). +This option can be useful in conjunction with @samp{sort -z}, @samp{uniq -z}, +@samp{perl -0} or @samp{find -print0} and @samp{xargs -0} which do the same in +order to reliably handle arbitrary file names (even those containing blanks +or other special characters). + @end table @exitstatus diff --git a/src/join.c b/src/join.c index 11e647c..1810ac2 100644 --- a/src/join.c +++ b/src/join.c @@ -161,6 +161,7 @@ static struct option const longopts[] = {ignore-case, no_argument, NULL, 'i'}, {check-order, no_argument, NULL, CHECK_ORDER_OPTION}, {nocheck-order, no_argument, NULL, NOCHECK_ORDER_OPTION}, + {zero-terminated, no_argument, NULL, 'z'}, {header, no_argument, NULL, HEADER_LINE_OPTION}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, @@ -177,6 +178,9 @@ static bool ignore_case; join them without checking for ordering */ static bool join_header_lines; +/* The character marking end of line. Default to \n. */ +static char eolchar = '\n'; + void usage (int status) { @@ -213,6 +217,9 @@ by whitespace. When FILE1 or FILE2 (not both) is -, read standard input.\n\ --header treat the first line in each file as field headers,\n\ print them without trying to pair them\n\ ), stdout); + fputs (_(\ + -z, --zero-terminated end lines with 0 byte, not newline\n\ +), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); fputs (VERSION_OPTION_DESCRIPTION, stdout); fputs (_(\ @@ -445,7 +452,7 @@ get_line (FILE *fp, struct line **linep, int which) else line = init_linep (linep); - if (! readlinebuffer (line-buf, fp)) + if (! readlinebuffer_delim (line-buf, fp, eolchar)) { if (ferror (fp)) error (EXIT_FAILURE, errno, _(read error)); @@ -614,7 +621,7 @@ prjoin (struct line const *line1, struct line const *line2) break; putchar (output_separator); } - putchar ('\n'); + putchar (eolchar); } else { @@ -636,7 +643,7 @@ prjoin (struct line const *line1, struct line const *line2) prfields (line1, join_field_1, autocount_1); prfields (line2, join_field_2, autocount_2); - putchar ('\n'); + putchar (eolchar); } } @@ -1017,7 +1024,7 @@ main (int argc, char **argv) issued_disorder_warning[0] = issued_disorder_warning
sort/uniq/join: key-comparison code consolidation
Hello, ( new thread for previous topic http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html ) . The attached patch contains: 1. src/key-spec-parsing.{h,c} - key comparison code, previously in sort.c 2. uniq - now supports --key (multiple keys, too). Same as before, but rebased against 8.21. Supported orders: -k1,1 = ascii -k1b,1 = ignore-blanks -k1d,1 = dictionary -k1i,1 = non-printing -k1f,1 = ignore-case -k1n,1 = fast-numeric -k1g,1 = general-numeric -k1M,1 = month also supports user-specified delimiter (default: white-space). Related discussions: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=5832 http://debbugs.gnu.org/cgi/bugreport.cgi?bug=7068 http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html 3. sort - same functionality as before, but key-comparison code extracted to a different file. 4. join - internally uses the key-comparison code. Does not support the --key parameter (uses the standard -j/-1/-2), but accepts new arguments that affect joining order: -r --reverse -n --numeric-sort -d --dictionary-order -g --general-numeric Related discussions: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6903 http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6366 As an option, perhaps we can support new -k that will be like -j but allow specificity options (e.g. -k1nr will be equivalent to -j 1 --numeric --reverse). It'll be easy to add human-numeric-sort/version-sort to join/uniq, but I'm not sure if they make sense. Regards, -gordon key_compare7.patch.xz Description: application/xz
Re: [PATCH]: uniq: add --group option
Hello Pádraig, Pádraig Brady wrote, On 02/20/2013 08:47 PM: On 02/20/2013 06:44 PM, Assaf Gordon wrote: Hello, Attached is a suggestion for --group option in uniq, as discussed here: http://lists.gnu.org/archive/html/coreutils/2011-03/msg0.html http://lists.gnu.org/archive/html/coreutils/2012-03/msg00052.html The patch adds two parameters: --group=[method] separate each unique line (whether duplicated or not) with a marker. method={none,separate(default),prepend,append,both) --group-separator=SEP with --group, separates group using SEP (default: empty line) --group-sep is probably overkill. I'd just use \n or \0 if -z specified. OK. As for separation methods I'd just go with what we have for --all-repeated (but remove 'none' which wouldn't be useful with --group), as we've never had requests for anything else. so: --group={prepend, separate(default)} I'd like to have at least append or both, for the added convenience of downstream analysis. It's obviously a nice-to-have and not must-have feature, and can be implemented in other ways, but knowing that there will always be a terminating marker *after* a group (even the last group) makes downstream processing code simpler. Typical example: $ cat INPUT | uniq --group=append | \ awk '$0!= { ## item in the group, collect it } $0== { ## end of group, do something }' Without the final group marker, any downstream code will require two points of group processing: when a marker is found, and at EOF. Something like: $ cat INPUT | uniq --group=append | \ awk '$0!= { ## item in the group, collect it } $0== { ## end of group, do something } END { ## end of last group, do something, duplicated code }' Similar reason for having both, as it ensures there I can put any special initialization code in the group-marker case, and doesn't need to duplicate it in a separate 'BEGIN{}' clause (Of course, this doesn't have to be awk - can be perl/python/ruby/whatever that will do downstream processing). I realize it's not a make-or-break feature - but if we're trying to make text processing easier, I believe append/both makes it even easier. So on to operation... And it behaves as expected: === $ printf a\na\na\nb\nc\nc\n | ./src/uniq --group-sep=-- --group=separate The above isn't that useful and could be done with sed. I assume you're specifically referring to the group-sep part - then OK. Supporting -u or -d with --group wouldn't be useful either really. It's probably most consistent to just disallow those combinations. Just to be clear on the reasoning: because with -u and -d, each *line* is implicitly a separate group, there's no apparent utility for an end-of-group marker. I guess it's true from a technical POV - but again, for downstream analysis convenience it's nice to have a fixed end-of-group marker. I could use the same downstream script (which expects end-of-group markers) with uniq, whether I used -d or -u or nothing at all. What do you think? -gordon
Re: [PATCH]: uniq: add --group option
Pádraig Brady wrote, On 02/21/2013 11:11 AM: On 02/21/2013 03:42 PM, Assaf Gordon wrote: Hello Pádraig, Pádraig Brady wrote, On 02/20/2013 08:47 PM: On 02/20/2013 06:44 PM, Assaf Gordon wrote: Hello, Attached is a suggestion for --group option in uniq, as discussed here: http://lists.gnu.org/archive/html/coreutils/2011-03/msg0.html [ ... ] So on to operation... And it behaves as expected: === $ printf a\na\na\nb\nc\nc\n | ./src/uniq --group-sep=-- --group=separate The above isn't that useful and could be done with sed. I assume you're specifically referring to the group-sep part - then OK. Actually I was referring to the fact that in your example --group didn't output all entries by default. If it only output unique entries then you can separate with: uniq | sed 'G' # (note sed also supports -z) uniq | sed '$q;G' So `uniq --group` should output all items by default I think. [ ... ] I guess it's true from a technical POV - but again, for downstream analysis convenience it's nice to have a fixed end-of-group marker. I could use the same downstream script (which expects end-of-group markers) with uniq, whether I used -d or -u or nothing at all. But what's the point in such processing if there is only ever going to be a single line in each group? I see now, I was thinking of --group as simply an output modifier (ie add group marker to whatever uniq is outputing), allowing combination of --group with -u/-d/-D or any other option (whether it made useful sense or not). You were planning on --group to mean explicitly output all input lines, and add group-markers for unique groups (meaning -u/-d/-D and --group are mutually exclusive). I can go on with your definition. I'll send update soon. -gordon
Re: [PATCH]: uniq: add --group option
Assaf Gordon wrote, On 02/21/2013 11:37 AM: You were planning on --group to mean explicitly output all input lines, and add group-markers for unique groups (meaning -u/-d/-D and --group are mutually exclusive). Attached is a version that behaves as previously discussed. --group can't be used with -c/-d/-D/-u. Since it's a completely separate behavior, I found it easier to create a whole new code path in check_file() for the special case of grouping. Comments are welcomed, -gordon From 072ffee0f45a67465607cde3d984e6fd7e37a1af Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Wed, 20 Feb 2013 13:31:22 -0500 Subject: [PATCH] uniq: add --group option * src/uniq.c: implement --group options. * tests/misc/uniq.pl: add tests. --- src/uniq.c | 125 +--- tests/misc/uniq.pl | 40 + 2 files changed, 159 insertions(+), 6 deletions(-) diff --git a/src/uniq.c b/src/uniq.c index 5efdad7..598c62d 100644 --- a/src/uniq.c +++ b/src/uniq.c @@ -108,11 +108,47 @@ static enum delimit_method const delimit_method_map[] = /* Select whether/how to delimit groups of duplicate lines. */ static enum delimit_method delimit_groups; +enum grouping_method +{ + /* No grouping, when --group isn't used */ + GM_NONE, + + /* Delimiter preceges all groups. --group=prepend */ + GM_PREPEND, + + /* Delimiter follows all groups. --group=append */ + GM_APPEND, + + /* Delimiter between groups.--group[=separate] */ + GM_SEPARATE, + + /* Delimiter before and after each group. --group=both */ + GM_BOTH +}; + +static char const *const grouping_method_string[] = +{ + prepend, append, separate, both, NULL +}; + +static enum grouping_method const grouping_method_map[] = +{ + GM_PREPEND, GM_APPEND, GM_SEPARATE, GM_BOTH +}; + +static enum grouping_method grouping = GM_NONE; + +enum +{ + GROUP_OPTION = CHAR_MAX + 1 +}; + static struct option const longopts[] = { {count, no_argument, NULL, 'c'}, {repeated, no_argument, NULL, 'd'}, {all-repeated, optional_argument, NULL, 'D'}, + {group, optional_argument, NULL, GROUP_OPTION}, {ignore-case, no_argument, NULL, 'i'}, {unique, no_argument, NULL, 'u'}, {skip-fields, required_argument, NULL, 'f'}, @@ -159,6 +195,11 @@ With no options, matching lines are merged to the first occurrence.\n\ -z, --zero-terminated end lines with 0 byte, not newline\n\ ), stdout); fputs (_(\ + --group=[method] separate each unique group (whether duplicated or not)\n\ +with an empty line.\n\ +method={separate(default),prepend,append,both)\n\ +), stdout); + fputs (_(\ -w, --check-chars=N compare no more than N characters in lines\n\ ), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); @@ -293,13 +334,57 @@ check_file (const char *infile, const char *outfile, char delimiter) initbuffer (prevline); /* The duplication in the following 'if' and 'else' blocks is an - optimization to distinguish the common case (in which none of - the following options has been specified: --count, -repeated, - --all-repeated, --unique) from the others. In the common case, - this optimization lets uniq output each different line right away, - without waiting to see if the next one is different. */ + optimization to distinguish several cases: - if (output_unique output_first_repeated countmode == count_none) + 1. grouping (--group=X) - all input lines are printed. +checking for unique/duplicated lines is used only for printing +group separators. + + 2. The common case - +In which none of the following options has been specified: + --count, --repeated, --all-repeated, --unique +In the common case, this optimization lets uniq output each different +line right away, without waiting to see if the next one is different. + + 3. All other cases. + */ + if (grouping != GM_NONE) +{ + char *prevfield IF_LINT ( = NULL); + size_t prevlen IF_LINT ( = 0); + bool first_group_printed = false; + + while (!feof (stdin)) +{ + char *thisfield; + size_t thislen; + bool new_group; + if (readlinebuffer_delim (thisline, stdin, delimiter) == 0) +break; + thisfield = find_field (thisline); + thislen = thisline-length - 1 - (thisfield - thisline-buffer); + + new_group = (prevline-length == 0 + || different (thisfield, prevfield, thislen, prevlen)); + + if (new_group + ( (grouping == GM_PREPEND) || (grouping == GM_BOTH) + || ( first_group_printed + +( grouping == GM_APPEND || grouping == GM_SEPARATE +putchar (delimiter); + + fwrite (thisline-buffer, sizeof (char), thisline-length, stdout); + SWAP_LINES
[PATCH] improve 'autotools-install'
Hello, Trying to use 'scripts/autotools-install' on a problematic system (Mac OS X 10.6.8, already has few other related bugs), building pkg-config fails. Two patches attached: 1. When ./configure or make fail, use die() to print an error, pointing the user to the error log file. This helps when troubleshooting errors, because the script has set -e and simply exits on errors. 2. Recent pkg-config has a cyclic requirement of glib, explained in the README [1]: pkg-config depends on glib. Note that glib build-depends on pkg-config, but you can just set the corresponding environment variables (ZLIB_LIBS, ZLIB_CFLAGS are the only needed ones when this is written) to build it. If this requirement is too cumbersome, a bundled copy of a recent glib stable release is included. Pass --with-internal-glib to configure to use this copy. The second patch adds this --with-internal-glib flag when configuring pkg-config . Sadly, autotools-install still doesn't complete, because gettext0.18.1 fails to compile with stpncpy() related problem (exactly as solved in coreutils [2]) but that's is not a coreutil bug. -gordon [1] http://cgit.freedesktop.org/pkg-config/tree/README?id=pkg-config-0.27.1 [2] http://bugs.gnu.org/13495 From ba2c30e47e808c60bd5e899caca1207dae5aa95a Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 21 Feb 2013 17:50:28 -0500 Subject: [PATCH 1/2] maint: print errors with autotools-install fails * scripts/autotools-install: call die() when configure/make fail. Point the user to the relevant error log file. --- scripts/autotools-install |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/scripts/autotools-install b/scripts/autotools-install index bd49664..419806d 100755 --- a/scripts/autotools-install +++ b/scripts/autotools-install @@ -148,8 +148,12 @@ for pkg in $pkgs; do rm -rf $dir gzip -dc $pkg | tar xf - cd $dir - ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 - $MAKE makerr-build 21 + ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \ +|| die configuring package $dir failed. \ +check '$tmpdir/$dir/makerr-config' for possible details. + $MAKE makerr-build 21 \ +|| die building package $dir failed. \ +check '$tmpdir/$dir/makerr-build' for possible details. if test $make_check = yes; then case $pkg in # FIXME: these are out of date and very system-sensitive -- 1.7.7.4 From c3d135c51e20ceb72d5b453081bea1e1899f9ef1 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 21 Feb 2013 17:58:55 -0500 Subject: [PATCH 2/2] maint: add special config flags for pkg-config * scripts/autotools-install: force pkg-config to use internal 'glib' files when compiling from source. --- scripts/autotools-install |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/scripts/autotools-install b/scripts/autotools-install index 419806d..2b626ff 100755 --- a/scripts/autotools-install +++ b/scripts/autotools-install @@ -144,11 +144,13 @@ pkgs=`get_sources` export PATH=$prefix/bin:$PATH for pkg in $pkgs; do echo building/installing $pkg... + extra= + case $pkg in pkg-config*) extra=--with-internal-glib;; esac dir=`basename $pkg .tar.gz` rm -rf $dir gzip -dc $pkg | tar xf - cd $dir - ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \ + ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix $extramakerr-config 21 \ || die configuring package $dir failed. \ check '$tmpdir/$dir/makerr-config' for possible details. $MAKE makerr-build 21 \ -- 1.7.7.4
Re: [PATCH] improve 'autotools-install'
Hello Stefano, Stefano Lattarini wrote, On 02/22/2013 02:30 AM: On 02/22/2013 12:08 AM, Assaf Gordon wrote: I think this explanation should go in the commit message of the second patch, as it makes clear why such patch is needed. Good idea, attached an improved patch. Sadly, autotools-install still doesn't complete, because gettext0.18.1 fails to compile with stpncpy() related problem (exactly as solved in coreutils [2]) but that's is not a coreutil bug. Is the issue still present with the latest gettext version (1.18.2)? If not, you could update the '$tarballs' definition to point to that instead. No, 0.18.2 doesn't compile either. Eric Blake already found the fix for this, I'll just send the gettext people a bug report. Also, I see that the Automake version referenced by '$tarballs' is still 1.12.3; I think it should be updated to the latest available version (1.13.2 at the moment of writing). I can send a separate patch for that, but perhaps others would chime in as to whether this should be done? I assume changing version (1.12 vs 1.13) should be done when it's explicitly needed? -gordon From ba2c30e47e808c60bd5e899caca1207dae5aa95a Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 21 Feb 2013 17:50:28 -0500 Subject: [PATCH 1/2] maint: print errors with autotools-install fails * scripts/autotools-install: call die() when configure/make fail. Point the user to the relevant error log file. --- scripts/autotools-install |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/scripts/autotools-install b/scripts/autotools-install index bd49664..419806d 100755 --- a/scripts/autotools-install +++ b/scripts/autotools-install @@ -148,8 +148,12 @@ for pkg in $pkgs; do rm -rf $dir gzip -dc $pkg | tar xf - cd $dir - ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 - $MAKE makerr-build 21 + ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \ +|| die configuring package $dir failed. \ +check '$tmpdir/$dir/makerr-config' for possible details. + $MAKE makerr-build 21 \ +|| die building package $dir failed. \ +check '$tmpdir/$dir/makerr-build' for possible details. if test $make_check = yes; then case $pkg in # FIXME: these are out of date and very system-sensitive -- 1.7.7.4 From 49c577432325de449239ce5ed5e2b82e401eee14 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 21 Feb 2013 17:58:55 -0500 Subject: [PATCH 2/2] maint: add special config flags for pkg-config * scripts/autotools-install: force pkg-config to use internal 'glib' files when compiling from source. Recent pkg-config has a cyclic requirement of glib, explained in the pkg-config's README: http://cgit.freedesktop.org/pkg-config/tree/README?id=pkg-config-0.27.1 pkg-config depends on glib. Note that glib build-depends on pkg-config, but you can just set the corresponding environment variables to build it. If this requirement is too cumbersome, a bundled copy of a recent glib stable release is included. Pass --with-internal-glib to configure to use this copy. --- scripts/autotools-install |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/scripts/autotools-install b/scripts/autotools-install index 419806d..2b626ff 100755 --- a/scripts/autotools-install +++ b/scripts/autotools-install @@ -144,11 +144,13 @@ pkgs=`get_sources` export PATH=$prefix/bin:$PATH for pkg in $pkgs; do echo building/installing $pkg... + extra= + case $pkg in pkg-config*) extra=--with-internal-glib;; esac dir=`basename $pkg .tar.gz` rm -rf $dir gzip -dc $pkg | tar xf - cd $dir - ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix makerr-config 21 \ + ./configure CFLAGS=-O2 LDFLAGS=-s --prefix=$prefix $extramakerr-config 21 \ || die configuring package $dir failed. \ check '$tmpdir/$dir/makerr-config' for possible details. $MAKE makerr-build 21 \ -- 1.7.7.4
Re: bug#13786: pr command does not fold
(Adding the list) Doh Smith wrote, On 02/22/2013 05:14 AM: I could not get the pr command to fold the lines. Is this a bug? I replied but forgot to CC the mailing list, answer available here: http://bugs.gnu.org/13786 This bug can likely be closed, if others agree. -gordon
Re: coreutils FAQ link in manpages and/or --help output
Hi, Bernhard Voelker wrote, On 02/25/2013 10:58 AM: On 02/25/2013 03:53 PM, Ondrej Oprala wrote: to reduce the amount of questions about date, sort and anything multibyte-related, I think it'd be a good idea to add a link to the coreutils FAQ to the man pages (and/or --help output), maybe something like Report TOOL bugs to bug-coreut...@gnu.org but please make sure the behaviour is not listed in FAQ: link What do you think? I'd not be surprised if this had already been discussed before. However, I like the idea, but it would be best if we had nice translations for that FAQ page. If changing the general help message (I assume from emit_ancillary_info()), perhaps consider another change: Add a line (preferably on top) that would point to coreutils@gnu.org in addition to bugs-coreut...@gnu.org, saying something like: For common usage questions, see FAQ and then Send usage questions to coreutils@gnu.org and only then print the existing: Report sort bugs to bug-coreut...@gnu.org This will (hopefully?) prevent those cases where people send general questions and open a new bug, forcing someone to respond with the boiler-plate answer by sending an email to this mailing list, you've opened a bug-report, and I'm closing it. There's already a link saying: General help using GNU software: http://www.gnu.org/gethelp/ But this page is not very helpful for a non-expert who just wants to get help about a specific GNU coreutils program... Just my two cents, -gordon
Re: [PATCH]: uniq: add --group option
Pádraig Brady wrote, On 02/27/2013 08:16 PM: On 02/21/2013 07:40 PM, Assaf Gordon wrote: Assaf Gordon wrote, On 02/21/2013 11:37 AM: You were planning on --group to mean explicitly output all input lines, and add group-markers for unique groups (meaning -u/-d/-D and --group are mutually exclusive). I'll push this tomorrow with the attached changes. I added NEWS, docs and refactored the default and --group core loops together as as they're essentially the same. Thank you. Once pushed, I'll send a rebased patch for the sort/join/uniq key-comparison feature. -gordon
Re: [PATCH]: uniq: add tests for --ignore-case
Pádraig Brady wrote, On 02/27/2013 10:38 PM: On 02/12/2013 03:44 PM, Assaf Gordon wrote: I noticed that by running the default test suite (make check SUBDIRS=.), the majority of uniq tests are skipped: uniq: skipping this test -- no appropriate locale SKIP: tests/misc/uniq.pl PASS: tests/misc/uniq-perf.sh This is due to tests/misc/uniq.pl line 83: 83 # I've only ever triggered the problem in a non-C locale. 84 my $locale = $ENV{LOCALE_FR}; 85 ! defined $locale || $locale eq 'none' 86 and CuSkip::skip $prog: skipping this test -- no appropriate locale\n; which skips the entire suite if there's no french locale defined, even though only one test actually sets the locale. I can have a patch for it, if that's acceptable. Thanks for noticing that. A patch would be much appreciated. Attached a patch to not-skip all uniq tests if french locale is missing. -gordon From 65e47a463e672eddf8f7ed0ca5a9886033e0ef69 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 28 Feb 2013 14:12:52 -0500 Subject: [PATCH] uniq: don't skip all tests when locale is missing * tests/misc/uniq.pl: Previously, if LOCALE_FR was not defined, all tests would be skipped. Modified to skip only the relevant test. --- tests/misc/uniq.pl | 41 ++--- 1 files changed, 26 insertions(+), 15 deletions(-) diff --git a/tests/misc/uniq.pl b/tests/misc/uniq.pl index e3873b5..4fe1357 100755 --- a/tests/misc/uniq.pl +++ b/tests/misc/uniq.pl @@ -80,23 +80,8 @@ sub add_z_variants($) return @new; } -# I've only ever triggered the problem in a non-C locale. -my $locale = $ENV{LOCALE_FR}; -! defined $locale || $locale eq 'none' - and CuSkip::skip $prog: skipping this test -- no appropriate locale\n; - -# See if isblank returns true for nbsp. -my $x = qx!env printf '\xa0'| LC_ALL=$locale tr '[:blank:]' x!; -# If so, expect just one line of output in the schar test. -# Otherwise, expect two. -my $in = y z\n\xa0 y z\n; -my $schar_exp = $x eq 'x' ? y z\n : $in; - my @Tests = ( - # Test for a subtle, system-and-locale-dependent bug in uniq. - ['schar', '-f1', {IN = $in}, {OUT = $schar_exp}, - {ENV = LC_ALL=$locale}], ['1', '', {IN=''}, {OUT=''}], ['2', '', {IN=a\na\n}, {OUT=a\n}], ['3', '', {IN=a\na}, {OUT=a\n}], @@ -205,6 +190,32 @@ my @Tests = ['127', '--ignore-case', {IN=A\na\n}, {OUT=A\n}], ); + +# Locale related tests + +my $locale = $ENV{LOCALE_FR}; +if ( defined $locale $locale ne 'none' ) + { +# I've only ever triggered the problem in a non-C locale. + +# See if isblank returns true for nbsp. +my $x = qx!env printf '\xa0'| LC_ALL=$locale tr '[:blank:]' x!; +# If so, expect just one line of output in the schar test. +# Otherwise, expect two. +my $in = y z\n\xa0 y z\n; +my $schar_exp = $x eq 'x' ? y z\n : $in; + +my @Locale_Tests = +( + # Test for a subtle, system-and-locale-dependent bug in uniq. + ['schar', '-f1', {IN = $in}, {OUT = $schar_exp}, +{ENV = LC_ALL=$locale}] +); + +push @Tests, @Locale_Tests; + } + + # Set _POSIX2_VERSION=199209 in the environment of each obs-plus* test. foreach my $t (@Tests) { -- 1.7.7.4
Re: sort/uniq/join: key-comparison code consolidation
Hello, Assaf Gordon wrote, On 02/14/2013 06:07 PM: ( new thread for previous topic http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html ) . Attached is the sort/uniq/join key-comparison patch, rebased against the latest revision. This patch should also be cleaner and the commit comments more helpful. comments are welcomed, -gordon key-compare4.patch.xz Description: application/xz
Re: coreutils FAQ link in manpages and/or --help output
Pádraig Brady wrote, On 02/28/2013 08:12 AM: On 02/28/2013 08:40 AM, Ondrej Vasik wrote: On Thu, 2013-02-28 at 09:26 +0100, Bernhard Voelker wrote: On February 28, 2013 at 4:23 AM Pádraig Brady p...@draigbrady.com wrote: I've adjusted the above to only reference online resources, and ensure the links are at the end of each line. The result is now at http://www.gnu.org/software/coreutils/ I like it. Looks great. Since you're updating the website to make it more approachable, may I suggest two more changes? 1. In the Downloads section, put a direct link to the official GIT repository ( http://git.savannah.gnu.org/cgit/coreutils.git ) ? The current text says: Coreutils source releases can be found at Test source releases can be found at The latest source code, along with a revision history, can be found in the Savannah repository It's true that it mentions the Savannah repository, but it's not immediately clear what's going on. And to actually see the Git page, one has to click on the Savannah repository link (and the Savannah page is a bit of an overloaded mess), then go to the Source Code drop-down menu, and click on Browse Source Code - not exactly intuitive. If we could add a simple line below that says: View git source code repository: http://git.savannah.gnu.org/cgit/coreutils.git It would be much more convenient. 2. In the Downloads section, mention which is the latest version, and provide a direct link to it. This requires a bit of work every time a new release is made, but it's very helpful for someone who just wants to download the latest version without exploring the GNU FTP website. -gordon
[PATCH] shuf: use reservoir-sampling when possible
Hello, Attached is a suggestion to implement reservoir-sampling in shuf: When the expected output of lines is known, it will not load the entire file into memory - allowing shuffling very large inputs. I've seen this mentioned once: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079.html but no follow-up discussion. There is no change in the usage of shuf (barring unexpected bugs...). Example (with debug messages): === $ seq 1 | ./src/shuf ---debug -n 5 --reservoir_sampling-- filling reservoir, input line 1 of 5: '1' filling reservoir, input line 2 of 5: '2' filling reservoir, input line 3 of 5: '3' filling reservoir, input line 4 of 5: '4' filling reservoir, input line 5 of 5: '5' Replacing reservoir sample 4 with line 7 '7' Replacing reservoir sample 4 with line 8 '8' Replacing reservoir sample 3 with line 9 '9' Replacing reservoir sample 2 with line 10 '10' Replacing reservoir sample 4 with line 11 '11' Replacing reservoir sample 4 with line 16 '16' Replacing reservoir sample 4 with line 17 '17' Replacing reservoir sample 4 with line 20 '20' Replacing reservoir sample 2 with line 22 '22' Replacing reservoir sample 0 with line 31 '31' Replacing reservoir sample 1 with line 52 '52' Replacing reservoir sample 4 with line 55 '55' Replacing reservoir sample 3 with line 61 '61' Replacing reservoir sample 4 with line 76 '76' Replacing reservoir sample 2 with line 169 '169' Replacing reservoir sample 2 with line 187 '187' Replacing reservoir sample 0 with line 216 '216' Replacing reservoir sample 1 with line 340 '340' Replacing reservoir sample 4 with line 431 '431' Replacing reservoir sample 1 with line 524 '524' Replacing reservoir sample 2 with line 942 '942' Replacing reservoir sample 1 with line 1096 '1096' Replacing reservoir sample 2 with line 1627 '1627' Replacing reservoir sample 4 with line 1763 '1763' Replacing reservoir sample 2 with line 2679 '2679' Replacing reservoir sample 3 with line 4382 '4382' Replacing reservoir sample 2 with line 4439 '4439' Replacing reservoir sample 3 with line 7748 '7748' Replacing reservoir sample 2 with line 9902 '9902' -- reservoir lines (begin)-- 216 1096 9902 7748 1763 -- reservoir lines (end)-- 216 1763 7748 1096 9902 === The last 5 lines are the final output (the rest is STDERR debug messages). After the input is read completely, the lines are still re-permuted (using the existing shuf code), to accommodate cases like: === $ seq 6 | ./src/shuf ---debug -n 5 --reservoir_sampling-- filling reservoir, input line 1 of 5: '1' filling reservoir, input line 2 of 5: '2' filling reservoir, input line 3 of 5: '3' filling reservoir, input line 4 of 5: '4' filling reservoir, input line 5 of 5: '5' Replacing reservoir sample 2 with line 6 '6' -- reservoir lines (begin)-- 1 2 6 4 5 -- reservoir lines (end)-- 4 2 1 6 5 === Comments are welcomed, -gordon From b64d5063e26c0f3485d8342a2d5501f655f1063e Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Wed, 6 Mar 2013 18:25:49 -0500 Subject: [PATCH] shuf: use reservoir-sampling when possible * src/shuf.c: Use reservoir-sampling when the number of output lines is known (by using '-n X' parameter). read_input_reservoir_sampling() - read lines from input file, and keep only K lines in memory, replacing lines with decreasing probability. prepare_shuf_lines() - convert reservoir lines to a usable structure. main() - if the number of lines is known, use reservoir-sampling instead of reading entire input file. --- src/shuf.c | 171 ++-- 1 files changed, 167 insertions(+), 4 deletions(-) diff --git a/src/shuf.c b/src/shuf.c index 71ac3e6..27982e5 100644 --- a/src/shuf.c +++ b/src/shuf.c @@ -25,6 +25,7 @@ #include error.h #include fadvise.h #include getopt.h +#include linebuffer.h #include quote.h #include quotearg.h #include randint.h @@ -81,7 +82,8 @@ With no FILE, or when FILE is -, read standard input.\n\ non-character as a pseudo short option, starting with CHAR_MAX + 1. */ enum { - RANDOM_SOURCE_OPTION = CHAR_MAX + 1 + RANDOM_SOURCE_OPTION = CHAR_MAX + 1, + DEV_DEBUG_OPTION }; static struct option const long_opts[] = @@ -92,11 +94,31 @@ static struct option const long_opts[] = {output, required_argument, NULL, 'o'}, {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION}, {zero-terminated, no_argument, NULL, 'z'}, + {-debug, no_argument, NULL, DEV_DEBUG_OPTION}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, {0, 0, 0, 0}, }; +/* debugging for developers. Enables devmsg(). */ +static bool dev_debug = false; + +/* Like error(0, 0, ...), but without an implicit newline. + Also a noop unless the global DEV_DEBUG is set. + TODO: Replace with variadic macro in system.h or + move to a separate module. */ +static inline void +devmsg (char
Re: [PATCH] shuf: use reservoir-sampling when possible
Hello, Attached is an updated version. Pádraig Brady wrote, On 03/06/2013 08:24 PM: On 03/06/2013 11:50 PM, Assaf Gordon wrote: Attached is a suggestion to implement reservoir-sampling in shuf: When the expected output of lines is known, it will not load the entire file into memory - allowing shuffling very large inputs. Regarding comments: {-debug, no_argument, NULL, DEV_DEBUG_OPTION}, no need to keep this, for final commit. Yes, I'll remove this once the code is acceptable. prepare_shuf_lines (struct linebuffer *in_lines, size_t n, char ***out_lines, I've not looked into the details, but it would be nice to avoid the memcpy/conversion here I've removed the conversion function, and instead added a new function to output the lines directly. static size_t read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, size_t k, struct randint_source *s) ... struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init reservoir*/ Since this change is mainly about efficient mem usage we should probably handle the case where we have small inputs but large k. This will allocate (and zero) memory up front. The zeroing will defeat any memory overcommit configured on the system, but it's probably better to avoid the large initial commit and realloc as required (not per line, but per 1K lines maybe). I'm not quite sure about this: The reservoir-sampling path can only be used when the user explicitly ask to limit output lines. I would naively assume that if a user explicitly asked to limit the output to 1,000,000 lines, he/she expects large input as well. And so the (edge?) case of asking for a large number of output lines, but supplying very small number of input lines is rare. Wouldn't you agree? or is there a different typical usage case? Also, the allocation only allocates an array of struct linebuffer (on 64bit systems, 24 bytes). So even asking for 1M lines will allocate 24MB of RAM - not too much on modern machines. The second attached patch is experimental - it tries to assess the randomness of 'shuf' output by running it 1,000 times and checking if the output is (very roughly) uniformly distributed. I don't know if there were attempts in the past to unit-test randomness (and then decided not to do it) - or if this was just never considered worth-while (or too error prone). Comments are welcomed, -gordon From 1adfd08cd3a52c373932b0f1039755a240d2c0b8 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 7 Mar 2013 01:57:57 -0500 Subject: [PATCH 1/2] shuf: add (expensive) test for randomness To run manually: make check TESTS=tests/misc/shuf-nonrandomess.sh \ SUBDIRS=. RUN_VERY_EXPENSIVE_TESTS=yes * tests/misc/shuf-randomness.sh: run 'shuf' repeatedly, and check if the output is uniformly distributed enough. * tests/local.mk: add new test script. --- tests/local.mk|1 + tests/misc/shuf-randomness.sh | 186 + 2 files changed, 187 insertions(+), 0 deletions(-) create mode 100755 tests/misc/shuf-randomness.sh diff --git a/tests/local.mk b/tests/local.mk index 607ddc4..d3923f8 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -313,6 +313,7 @@ all_tests = \ tests/misc/shred-passes.sh \ tests/misc/shred-remove.sh \ tests/misc/shuf.sh\ + tests/misc/shuf-randomness.sh \ tests/misc/sort.pl\ tests/misc/sort-benchmark-random.sh \ tests/misc/sort-compress.sh \ diff --git a/tests/misc/shuf-randomness.sh b/tests/misc/shuf-randomness.sh new file mode 100755 index 000..3e35cca --- /dev/null +++ b/tests/misc/shuf-randomness.sh @@ -0,0 +1,186 @@ +#!/bin/sh +# Test shuf for somewhat uniform randomness + +# Copyright (C) 2013 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see http://www.gnu.org/licenses/. + +. ${srcdir=.}/tests/init.sh; path_prepend_ ./src +print_ver_ shuf +getlimits_ + +# Don't run these tests by default. +very_expensive_ + +# Number of trails +T=1000 + +# Number of categories +N=100 +REQUIRED_CHI_SQUARED=200 # Be extremely leniet: + # don't require great goodness of fit + # even for our assumed 99 degrees of freedom + +# K - when testing reservoir-sampling, print K lines +K=20 +REQUIRED_CHI_SQUARED_K=50
[PATCH] csplit: new option --suppress-matched
Hello, Attached is a new option for csplit, suppress-matched, as been mentioned few times before (e.g. http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ). It works well for REGEXP patterns, but there's a bug with INTEGER patterns that I haven't been able to pinpoint yet (suggestions are welcomed). Regards, -gordon From 49f43214ebfa41fa1f67e7001d8467288ff34837 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Wed, 6 Mar 2013 15:53:16 -0500 Subject: [PATCH] csplit: new option, --suppress-matched FIXME: Currently works only with REGEXP patterns. With --suppress-matched, the lines that match the pattern will not be printed in the output files. * src/csplit.c: implement --suppress-matched. process_regexp(),process_line_count(): skip the matched lined without printing. Since csplit always does up to but not including matched lines, the first line (in the next group) is the matched line - just skip it. main(): handle new option. usage(): mention new option. * NEWS: mention new option. * doc/coreutils.texi: mention new option, add examples. * tests/misc/csplit-supress-matched.sh: test new option. * tests/local.mk: add new test script. --- NEWS |3 + doc/coreutils.texi| 25 src/csplit.c | 26 - tests/local.mk|1 + tests/misc/csplit-suppress-matched.sh | 233 + 5 files changed, 287 insertions(+), 1 deletions(-) create mode 100755 tests/misc/csplit-suppress-matched.sh diff --git a/NEWS b/NEWS index 5b28c92..2385be7 100644 --- a/NEWS +++ b/NEWS @@ -18,6 +18,9 @@ GNU coreutils NEWS-*- outline -*- uniq accepts a new option: --group to print all items, while separating unique groups with empty lines. + csplit accepts a new option: --suppressed-matched (-m). Lines matching + the specified patterns will not be printed. + * Noteworthy changes in release 8.21 (2013-02-14) [stable] diff --git a/doc/coreutils.texi b/doc/coreutils.texi index fe4c3ad..4f7da4c 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -3608,6 +3608,12 @@ long instead of the default 2. @opindex --keep-files Do not remove output files when errors are encountered. +@item -m +@itemx --suppress-matched +@opindex -m +@opindex --suppress-matched +Do not output lines matching the specified @var{pattern}. + @item -z @itemx --elide-empty-files @opindex -z @@ -3684,6 +3690,25 @@ $ head xx* 14 @end example +Example of splitting input by empty lines: + +@example +$ csplit --suppress-matched @var{input.txt} '/^$/' '@{*@}' +@end example + +@c +@c TODO: uniq already supportes --group. +@cwhen it gets the --key option, uncomment this example. +@c +@c Example of splitting input file, based on the value of column 2: +@c +@c @example +@c $ cat @var{input.txt} | +@c sort -k2,2 | +@c uniq --group -k2,2 | +@c csplit -m '/^$/' '@{*@}' +@c @end example + @node Summarizing files @chapter Summarizing files diff --git a/src/csplit.c b/src/csplit.c index 22f3ad4..664b567 100644 --- a/src/csplit.c +++ b/src/csplit.c @@ -166,6 +166,9 @@ static bool volatile remove_files; /* If true, remove all output files which have a zero length. */ static bool elide_empty_files; +/* If true, supress the lines that match the PATTERN */ +static bool suppress_matched; + /* The compiled pattern arguments, which determine how to split the input file. */ static struct control *controls; @@ -185,6 +188,7 @@ static struct option const longopts[] = {elide-empty-files, no_argument, NULL, 'z'}, {prefix, required_argument, NULL, 'f'}, {suffix-format, required_argument, NULL, 'b'}, + {suppress-matched, no_argument, NULL, 'm'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, {NULL, 0, NULL, 0} @@ -721,6 +725,15 @@ process_line_count (const struct control *p, uintmax_t repetition) create_output_file (); +#if 0 + /* FIXME: this doesn't work when the last line is the matched line + * e.g.: + * $ seq 1 6 | ./src/csplit -m - 2 4 6 + */ + if (suppress_matched) +line = remove_line (); +#endif + linenum = get_first_line_in_buffer (); while (linenum++ last_line_to_save) @@ -778,6 +791,9 @@ process_regexp (struct control *p, uintmax_t repetition) if (!ignore) create_output_file (); + if (suppress_matched current_line 0) +line = remove_line (); + /* If there is no offset for the regular expression, or it is positive, then it is not necessary to buffer the lines. */ @@ -1324,9 +1340,10 @@ main (int argc, char **argv) control_used = 0; suppress_count = false; remove_files = true; + suppress_matched = false; prefix = DEFAULT_PREFIX; - while ((optc = getopt_long (argc, argv, f:b:kn:sqz, longopts, NULL)) != -1) + while ((optc = getopt_long (argc, argv, f:b:kmn:sqz, longopts, NULL)) != -1) switch (optc
[PATCH] tests: test sort,shuf with rngtest
Hello, Regarding comment: Pádraig Brady wrote, On 03/07/2013 06:26 PM: On 03/07/2013 07:32 PM, Assaf Gordon wrote: The second attached patch is experimental - it tries to assess the randomness of 'shuf' output by running it 1,000 times and checking if the output is (very roughly) uniformly distributed. Cool, I was considering testing with rngtest or something, so it'll be good to have something independent. ( http://lists.gnu.org/archive/html/coreutils/2013-03/msg00030.html ) Using rngtest is probably much more reliable than the independent test - attached are tests for sort and shuf with rngtest. They are marked 'expensive' as they require an external program and they run each test 10 times. -gordon From 15392de8f0ffa0746c9fd338ed14d15b614029a3 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Fri, 8 Mar 2013 15:54:24 -0500 Subject: [PATCH] tests: test sort,shuf with rngtest rngtest check the randomness of data using FIPS 140-2 tests. http://sourceforge.net/projects/gkernel/ If rngtest is not installed (and available in the PATH), the tests will be skipped. These tests are marked 'expensive'. To run directly: $ make check TESTS=tests/misc/sort-rand-rngtest.sh \ SUBDIRS=. RUN_EXPENSIVE_TESTS=yes $ make check TESTS=tests/misc/shuf-rand-rngtest.sh \ SUBDIRS=. RUN_EXPENSIVE_TESTS=yes * tests/misc/shuf-rand-rngtest.sh - test shuf with rngtest. * tests/misc/sort-rand-rngtest.sh - test sort with rngtest. * tests/local.mk - add above tests. --- tests/local.mk |2 + tests/misc/shuf-rand-rngtest.sh | 78 +++ tests/misc/sort-rand-rngtest.sh | 71 +++ 3 files changed, 151 insertions(+), 0 deletions(-) create mode 100755 tests/misc/shuf-rand-rngtest.sh create mode 100755 tests/misc/sort-rand-rngtest.sh diff --git a/tests/local.mk b/tests/local.mk index 607ddc4..21d347a 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -313,6 +313,7 @@ all_tests = \ tests/misc/shred-passes.sh \ tests/misc/shred-remove.sh \ tests/misc/shuf.sh\ + tests/misc/shuf-rand-rngtest.sh \ tests/misc/sort.pl\ tests/misc/sort-benchmark-random.sh \ tests/misc/sort-compress.sh \ @@ -329,6 +330,7 @@ all_tests = \ tests/misc/sort-month.sh \ tests/misc/sort-exit-early.sh \ tests/misc/sort-rand.sh \ + tests/misc/sort-rand-rngtest.sh \ tests/misc/sort-spinlock-abuse.sh \ tests/misc/sort-stale-thread-mem.sh \ tests/misc/sort-unique.sh \ diff --git a/tests/misc/shuf-rand-rngtest.sh b/tests/misc/shuf-rand-rngtest.sh new file mode 100755 index 000..9ad2797 --- /dev/null +++ b/tests/misc/shuf-rand-rngtest.sh @@ -0,0 +1,78 @@ +#!/bin/sh +# Test shuf's random output with rngtest +# +# NOTE: +# rngtest must be installed, or the test will be skipped. +# rngtest is available here: http://sourceforge.net/projects/gkernel/ + +# Copyright (C) 2013 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see http://www.gnu.org/licenses/. + +. ${srcdir=.}/tests/init.sh; path_prepend_ ./src +print_ver_ shuf +expensive_ + +if ! which rngtest /dev/null ; then + skip_ rngtest not found - skipping test. +fi + +# Test for randomness several times. +# On the reare occasion when the randomly sorted data doesn't pass rngtest, +# it should be just one failure out of 10 rounds. +# If more rounds fail in a single run - there's likely a real problem. +ROUNDS=10 + +( yes 1 | head -n 1 ; yes 0 | head -n 1 ) in || framework_failure_ + +# rgntest always reads the first 32 bits as bootstrap data +printf \x00\x00\x00\x00 rngtest_header || framework_failure_ + + +# Sanity check: +# unsorted data should not be random +cat in | tr -d '\n' | \ + perl -npe '$_=pack(b*,$_)' out_non_random || framework_failure_ + +echo Testing rngtest on non-random input: 12 +cat rngtest_header out_non_random | rngtest + { fail=1 ; echo rngtest failed to detect non-random data. 12 ; } + +# +# Check randomness of shuf's output +# (using the 'read-entire-file' code path) +for i in $(seq $ROUNDS) ; do + cat in | shuf | tr -d '\n' | \ + perl -npe '$_=pack(b*,$_)' out_random$i || framework_failure_ + + echo Testing rngtest on randomly-sorted input (round $i of $ROUNDS): 12 + cat rngtest_header out_random$i | rngtest || + { fail=1 ; echo shuf random
Re: [PATCH] shuf: use reservoir-sampling when possible
Hello, Pádraig Brady wrote, On 03/07/2013 06:26 PM: On 03/07/2013 07:32 PM, Assaf Gordon wrote: Pádraig Brady wrote, On 03/06/2013 08:24 PM: On 03/06/2013 11:50 PM, Assaf Gordon wrote: Attached is a suggestion to implement reservoir-sampling in shuf: When the expected output of lines is known, it will not load the entire file into memory - allowing shuffling very large inputs. static size_t read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, size_t k, struct randint_source *s) ... struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init reservoir*/ Since this change is mainly about efficient mem usage we should probably handle the case where we have small inputs but large k. This will allocate (and zero) memory up front. The zeroing will defeat any memory overcommit configured on the system, but it's probably better to avoid the large initial commit and realloc as required (not per line, but per 1K lines maybe). Attached is an updated version (mostly a re-write of the memory allocation part), as per the comment above. Also includes a very_expensive valgrind test to exercise the new code. (and the other patch is the uniform-distribution randomness test). -gordon From 0ff2403dde869af3f9a44dd7418aae3082d8c0aa Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 7 Mar 2013 01:57:57 -0500 Subject: [PATCH 1/2] shuf: add (expensive) test for randomness To run manually: make check TESTS=tests/misc/shuf-randomess.sh \ SUBDIRS=. RUN_VERY_EXPENSIVE_TESTS=yes * tests/misc/shuf-randomness.sh: run 'shuf' repeatedly, and check if the output is uniformly distributed enough. * tests/local.mk: add new test script. --- tests/local.mk|1 + tests/misc/shuf-randomness.sh | 187 + 2 files changed, 188 insertions(+), 0 deletions(-) create mode 100755 tests/misc/shuf-randomness.sh diff --git a/tests/local.mk b/tests/local.mk index 607ddc4..d3923f8 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -313,6 +313,7 @@ all_tests = \ tests/misc/shred-passes.sh \ tests/misc/shred-remove.sh \ tests/misc/shuf.sh\ + tests/misc/shuf-randomness.sh \ tests/misc/sort.pl\ tests/misc/sort-benchmark-random.sh \ tests/misc/sort-compress.sh \ diff --git a/tests/misc/shuf-randomness.sh b/tests/misc/shuf-randomness.sh new file mode 100755 index 000..c0b9e2e --- /dev/null +++ b/tests/misc/shuf-randomness.sh @@ -0,0 +1,187 @@ +#!/bin/sh +# Test shuf for somewhat uniform randomness + +# Copyright (C) 2013 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see http://www.gnu.org/licenses/. + +. ${srcdir=.}/tests/init.sh; path_prepend_ ./src +print_ver_ shuf +getlimits_ + +# Don't run these tests by default. +very_expensive_ + +# Number of trails +T=1000 + +# Number of categories +N=100 +REQUIRED_CHI_SQUARED=200 # Be extremely leniet: + # don't require great goodness of fit + # even for our assumed 99 degrees of freedom + +# K - when testing reservoir-sampling, print K lines +K=20 +REQUIRED_CHI_SQUARED_K=50 # Be extremely leniet: + # don't require great goodness of fit + # even for our assumed 19 degrees of freedom + + + +# The input: many zeros followed by 1 one +(yes 0 | head -n $((N-1)) ; echo 1 ) in || framework_failure_ + + +is_uniform() +{ + # Input is assumed to be a string of $T spaces-separated-values + # between 1 and $N + LINES=$1 + + # Convert spaces to new-lines + LINES=$(echo $LINES | tr ' ' '\n' | sed '/^$/d') || framework_failure_ + + # Requre exactly $T values + COUNT=$(echo $LINES | wc -l) + test $COUNT -eq $T || framework_failure_ + + # HIST is the histogram of counts per categories + # ( categories are between 1 and $N ) + HIST=$(echo $LINES | sort -n | uniq -c) + + #DEBUG + #echo HIST=$HIST 12 + + ## Calculate Chi-Squared + CHI=$( echo $HIST | + awk -v n=$N -v t=$T '{ counts[$2] = $1 } + END { + exptd = ((1.0)*t)/n + chi = 0 + for (i=1;i=n;++i) + { +if (i in counts
Re: [PATCH] shuf: use reservoir-sampling when possible
Hello Pádraig, Pádraig Brady wrote, On 03/24/2013 11:45 PM: On 03/06/2013 11:50 PM, Assaf Gordon wrote: Attached is a suggestion to implement reservoir-sampling in shuf: When the expected output of lines is known, it will not load the entire file into memory - allowing shuffling very large inputs. I've attached 9 patches to adjust things a bit. Looks great, thank you very much. One minor improvement: the comment in the test file is wrong (in early stages of the patch I thought I could use a fixed random-source and pre-calculate the expected output). Attached is a fix. -gordon From d01dd496c517e20ac92fcbbb6b34045303b1b514 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 25 Mar 2013 12:25:50 -0400 Subject: [PATCH] maint: adjust shuf resevoir sampling comments * tests/misc/shuf-reservoir.sh: re-word comments. --- tests/misc/shuf-reservoir.sh |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tests/misc/shuf-reservoir.sh b/tests/misc/shuf-reservoir.sh index b695afc..6ba6e6e 100755 --- a/tests/misc/shuf-reservoir.sh +++ b/tests/misc/shuf-reservoir.sh @@ -26,7 +26,7 @@ require_valgrind_ getlimits_ # Run shuf with specific number of input lines and output lines -# The output must match the expected (pre-calculated) output. +# Check the output for expected number of lines. run_shuf_n() { INPUT_LINES=$1 -- 1.7.7.4
Re: [PATCH] tests: test sort,shuf with rngtest
Assaf Gordon wrote, On 03/08/2013 04:28 PM: Pádraig Brady wrote, On 03/07/2013 06:26 PM: Cool, I was considering testing with rngtest or something, so it'll be good to have something independent. ( http://lists.gnu.org/archive/html/coreutils/2013-03/msg00030.html ) Using rngtest is probably much more reliable than the independent test - attached are tests for sort and shuf with rngtest. They are marked 'expensive' as they require an external program and they run each test 10 times. Same patch, rebased with the latest shuf/reservoir-sampling, and with require_rngtest_ added to init.cfg. -gordon From c4130abf2baf1f1484c9f72e0d2845b996d55210 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Fri, 8 Mar 2013 15:54:24 -0500 Subject: [PATCH] tests: test sort,shuf with rngtest rngtest check the randomness of data using FIPS 140-2 tests. http://sourceforge.net/projects/gkernel/ If rngtest is not installed (and available in the PATH), the tests will be skipped. These tests are marked 'expensive'. To run directly: $ make check TESTS=tests/misc/sort-rand-rngtest.sh \ SUBDIRS=. RUN_EXPENSIVE_TESTS=yes $ make check TESTS=tests/misc/shuf-rand-rngtest.sh \ SUBDIRS=. RUN_EXPENSIVE_TESTS=yes * tests/misc/shuf-rand-rngtest.sh - test shuf with rngtest. * tests/misc/sort-rand-rngtest.sh - test sort with rngtest. * tests/local.mk - add above tests. * init.cfg - add 'require_rngtest_' function. --- init.cfg|7 tests/local.mk |2 + tests/misc/shuf-rand-rngtest.sh | 75 +++ tests/misc/sort-rand-rngtest.sh | 68 +++ 4 files changed, 152 insertions(+), 0 deletions(-) create mode 100755 tests/misc/shuf-rand-rngtest.sh create mode 100755 tests/misc/sort-rand-rngtest.sh diff --git a/init.cfg b/init.cfg index afee930..27d7627 100644 --- a/init.cfg +++ b/init.cfg @@ -169,6 +169,13 @@ require_valgrind_() skip_ requires a working valgrind } +# Skip the current test if rngtest doesn't work +require_rngtest_() +{ + rngtest -V 2/dev/null || +skip_ requires a working rngtest +} + require_setfacl_() { setfacl -m user::rwx . \ diff --git a/tests/local.mk b/tests/local.mk index dc87ef4..a75cfa3 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -313,6 +313,7 @@ all_tests = \ tests/misc/shred-passes.sh \ tests/misc/shred-remove.sh \ tests/misc/shuf.sh\ + tests/misc/shuf-rand-rngtest.sh \ tests/misc/shuf-reservoir.sh \ tests/misc/sort.pl\ tests/misc/sort-benchmark-random.sh \ @@ -330,6 +331,7 @@ all_tests = \ tests/misc/sort-month.sh \ tests/misc/sort-exit-early.sh \ tests/misc/sort-rand.sh \ + tests/misc/sort-rand-rngtest.sh \ tests/misc/sort-spinlock-abuse.sh \ tests/misc/sort-stale-thread-mem.sh \ tests/misc/sort-unique.sh \ diff --git a/tests/misc/shuf-rand-rngtest.sh b/tests/misc/shuf-rand-rngtest.sh new file mode 100755 index 000..934791f --- /dev/null +++ b/tests/misc/shuf-rand-rngtest.sh @@ -0,0 +1,75 @@ +#!/bin/sh +# Test shuf's random output with rngtest +# +# NOTE: +# rngtest must be installed, or the test will be skipped. +# rngtest is available here: http://sourceforge.net/projects/gkernel/ + +# Copyright (C) 2013 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see http://www.gnu.org/licenses/. + +. ${srcdir=.}/tests/init.sh; path_prepend_ ./src +print_ver_ shuf +expensive_ +require_rngtest_ + +# Test for randomness several times. +# On the reare occasion when the randomly sorted data doesn't pass rngtest, +# it should be just one failure out of 10 rounds. +# If more rounds fail in a single run - there's likely a real problem. +ROUNDS=10 + +( yes 1 | head -n 1 ; yes 0 | head -n 1 ) in || framework_failure_ + +# rgntest always reads the first 32 bits as bootstrap data +printf \x00\x00\x00\x00 rngtest_header || framework_failure_ + + +# Sanity check: +# unsorted data should not be random +cat in | tr -d '\n' | \ + perl -npe '$_=pack(b*,$_)' out_non_random || framework_failure_ + +echo Testing rngtest on non-random input: 12 +cat rngtest_header out_non_random | rngtest + { fail=1 ; echo rngtest failed to detect non-random data. 12 ; } + +# +# Check randomness of shuf's output +# (using the 'read-entire-file' code path) +for i in $(seq
Re: [PATCH] csplit: new option --suppress-matched
Hello, Assaf Gordon wrote, On 03/07/2013 05:39 PM: Attached is a new option for csplit, suppress-matched, as been mentioned few times before (e.g. http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ). Attached updated version (works with both regexp and int patterns). Also updated tests. Comments are welcomed, -gordon From eec5cf679824ed67c8b751ecb90565a22fc51719 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Wed, 6 Mar 2013 15:53:16 -0500 Subject: [PATCH] csplit: new option --suppress-matched With --suppress-matched, the lines that match the pattern will not be printed in the output files. * src/csplit.c: implement --suppress-matched. process_regexp(),process_line_count(): skip the matched lined without printing. Since csplit always does up to but not including matched lines, the first line (in the next group) is the matched line - just skip it. main(): handle new option. usage(): mention new option. * NEWS: mention new option. * doc/coreutils.texi: mention new option, add examples. * tests/misc/csplit-supress-matched.pl: test new option. * tests/local.mk: add new test script. --- NEWS |3 + doc/coreutils.texi| 25 src/csplit.c | 29 - tests/local.mk|1 + tests/misc/csplit-suppress-matched.pl | 213 + 5 files changed, 268 insertions(+), 3 deletions(-) create mode 100644 tests/misc/csplit-suppress-matched.pl diff --git a/NEWS b/NEWS index 0c2daad..896512d 100644 --- a/NEWS +++ b/NEWS @@ -18,6 +18,9 @@ GNU coreutils NEWS-*- outline -*- uniq accepts a new option: --group to print all items, while separating unique groups with empty lines. + csplit accepts a new option: --suppressed-matched (-m). Lines matching + the specified patterns will not be printed. + ** Improvements stat and tail work better with EFIVARFS, EXOFS, F2FS and UBIFS. diff --git a/doc/coreutils.texi b/doc/coreutils.texi index dfa9b1c..7dfe724 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -3607,6 +3607,12 @@ long instead of the default 2. @opindex --keep-files Do not remove output files when errors are encountered. +@item -m +@itemx --suppress-matched +@opindex -m +@opindex --suppress-matched +Do not output lines matching the specified @var{pattern}. + @item -z @itemx --elide-empty-files @opindex -z @@ -3683,6 +3689,25 @@ $ head xx* 14 @end example +Example of splitting input by empty lines: + +@example +$ csplit --suppress-matched @var{input.txt} '/^$/' '@{*@}' +@end example + +@c +@c TODO: uniq already supportes --group. +@cwhen it gets the --key option, uncomment this example. +@c +@c Example of splitting input file, based on the value of column 2: +@c +@c @example +@c $ cat @var{input.txt} | +@c sort -k2,2 | +@c uniq --group -k2,2 | +@c csplit -m '/^$/' '@{*@}' +@c @end example + @node Summarizing files @chapter Summarizing files diff --git a/src/csplit.c b/src/csplit.c index 22f3ad4..4ae2de2 100644 --- a/src/csplit.c +++ b/src/csplit.c @@ -166,6 +166,9 @@ static bool volatile remove_files; /* If true, remove all output files which have a zero length. */ static bool elide_empty_files; +/* If true, suppress the lines that match the PATTERN */ +static bool suppress_matched; + /* The compiled pattern arguments, which determine how to split the input file. */ static struct control *controls; @@ -185,6 +188,7 @@ static struct option const longopts[] = {elide-empty-files, no_argument, NULL, 'z'}, {prefix, required_argument, NULL, 'f'}, {suffix-format, required_argument, NULL, 'b'}, + {suppress-matched, no_argument, NULL, 'm'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, {NULL, 0, NULL, 0} @@ -721,8 +725,13 @@ process_line_count (const struct control *p, uintmax_t repetition) create_output_file (); - linenum = get_first_line_in_buffer (); + /* Ensure that the line number specified is not 1 greater than + the number of lines in the file. + When suppressing matched lines, check before the loop. */ + if (no_more_lines () suppress_matched) +handle_line_error (p, repetition); + linenum = get_first_line_in_buffer (); while (linenum++ last_line_to_save) { line = remove_line (); @@ -733,9 +742,12 @@ process_line_count (const struct control *p, uintmax_t repetition) close_output_file (); + if (suppress_matched) +line = remove_line (); + /* Ensure that the line number specified is not 1 greater than the number of lines in the file. */ - if (no_more_lines ()) + if (no_more_lines () !suppress_matched) handle_line_error (p, repetition); } @@ -778,6 +790,9 @@ process_regexp (struct control *p, uintmax_t repetition) if (!ignore) create_output_file (); + if (suppress_matched current_line 0) +line = remove_line
Re: [PATCH] csplit: new option --suppress-matched
On 03/30/13 01:08, Pádraig Brady wrote: On 03/28/2013 10:10 PM, Assaf Gordon wrote: Attached is a new option for csplit, suppress-matched, as been mentioned few times before (e.g. http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ). The awkward case here is with integer boundaries and offsets. ... # Adding in the offset, we currently consider the # offset line as the one to suppress, rather than the matched pattern. This was exactly my original understanding of matched - not just the line that matched the regular expression, but the line that matched the specified pattern (i.e. regexp+offset or integer pattern) - and that's the line suppressed. This could be confusing, but at least it's consistent. So more accurately what we're doing is suppressing the boundary line. So less confusingly and more accurately, this option should probably be named/described as: --suppress-boundary Suppress the boundary line from the start of the second and subsequent splits. I'm fine with whichever name you decide. I find matched more natural, and not so confusing, but boundary is just as good. I do think the description is a bit cumbersome (the from the start of the second and subsequent splits part) - it seems more confusing to me than with just omitting it. It's probably one of those cases that a single example of input+output is worth more than a whole paragraph of explanation... Nice work on the tests BTW. Thanks. I found CMP by accident, after almost writing an equivalent mechanism thing from scratch. It's not mentioned in tests/Coreutils.pm, perhaps I'll send a small patch for that. I hope to apply this with the adjusted naming over the weekend. Thanks again.
[PATCH] tests: document CMP/PRE/POST in unit test module
Hello, Attached is a small patch to document CMP/PRE/POST in tests/Coreutils.pm. No code changes. -gordon From 229c94ebc0c4955a418f6e7348488d9ca28dc593 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 1 Apr 2013 17:44:27 -0400 Subject: [PATCH] tests: document CMP/PRE/POST in unit test module *tests/Coreutils.pm: document CMP/PRE/POST keys. --- tests/Coreutils.pm |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/tests/Coreutils.pm b/tests/Coreutils.pm index 71b1516..fd4408a 100644 --- a/tests/Coreutils.pm +++ b/tests/Coreutils.pm @@ -54,7 +54,7 @@ defined $ENV{DJDIR} # I/O spec: a hash ref with the following properties # # - one key/value pair -# - the key must be one of these strings: IN, OUT, ERR, AUX, CMP, EXIT +# - the key must be one of these strings: IN, OUT, ERR, AUX, CMP, EXIT, PRE,POST # - the value must be a file spec # {OUT = 'data'}put data in a temp file and compare it to stdout from cmd # {OUT = {'filename'=undef}} compare contents of existing filename to @@ -82,6 +82,12 @@ defined $ENV{DJDIR} # {ENV_DEL = 'VAR'} # Remove VAR from the environment just before running the corresponding # command, and restore any value just afterwards. +# {CMP = [ 'data',{'filename'=undef}}Compare the content of 'filename' +# to 'data' (a string scalar). The program under test is expected to create +# file 'filename'. +# {PRE = sub{} } Execute sub() before running the test. +# {POST = sub{} } Execute sub() after running the test. +# If the PRE/POST sub calls die, the test will be marked as failed. # # There may be many input file specs. File names from the input specs # are concatenated in order on the command line. -- 1.7.7.4
Re: [PATCH] tests: document CMP/PRE/POST in unit test module
Thanks for the quick reply. Here's a better patch. Bernhard Voelker wrote, On 04/02/2013 04:03 AM: s/PRE,POST/PRE, POST due to the line length it may be worth adding a line break. Done. Also added IN_PIPE . Close square brackets, and move blank character to after the comma: Done. 2 notes: * According to the code, instead of a plain string, 'data' can also be a HASH. * If the file name is '@AUX@', then it is replaced. I do not fully understand those uses, so I can't really explain them. When are these useful? Furthermore, IN, AUX, and EXIT also do not seem to be documented yet. Do you like to document these, too? I've added IN and IN_PIPE. EXIT was already mentioned. AUX - I do not know what it does... -gordon From 309fd6398558b6e85ae2b2fa1cee6b5e2f492dde Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 1 Apr 2013 17:44:27 -0400 Subject: [PATCH] tests: document more test keys in unit test module * tests/Coreutils.pm: document IN/IN_PIPE/CMP/PRE/POST keys. --- tests/Coreutils.pm | 12 +++- 1 files changed, 11 insertions(+), 1 deletions(-) diff --git a/tests/Coreutils.pm b/tests/Coreutils.pm index 71b1516..661fce4 100644 --- a/tests/Coreutils.pm +++ b/tests/Coreutils.pm @@ -54,8 +54,12 @@ defined $ENV{DJDIR} # I/O spec: a hash ref with the following properties # # - one key/value pair -# - the key must be one of these strings: IN, OUT, ERR, AUX, CMP, EXIT +# - the key must be one of these strings: IN, IN_PIPE, OUT, ERR, AUX, CMP, +# EXIT, PRE, POST # - the value must be a file spec +# {IN = 'data'}Create file containing 'data'. The filename will be +#appended as the last parameter on the command-line. +# {IN_PIPE = 'data'} Send 'data' as input from stdin. # {OUT = 'data'}put data in a temp file and compare it to stdout from cmd # {OUT = {'filename'=undef}} compare contents of existing filename to # stdout from cmd @@ -82,6 +86,12 @@ defined $ENV{DJDIR} # {ENV_DEL = 'VAR'} # Remove VAR from the environment just before running the corresponding # command, and restore any value just afterwards. +# {CMP = ['data', {'filename'=undef}]}Compare the content of 'filename' +# to 'data' (a string scalar). The program under test is expected to create +# file 'filename'. +# {PRE = sub{}} Execute sub() before running the test. +# {POST = sub{}} Execute sub() after running the test. +# If the PRE/POST sub calls die, the test will be marked as failed. # # There may be many input file specs. File names from the input specs # are concatenated in order on the command line. -- 1.7.7.4
Re: Move Command Feature
Hello Michael, Michael Boldischar wrote, On 04/05/2013 01:56 PM: My first attempt with rsync resulted in the same problem I have when there are errors using the mv command: $ mkdir a b $ touch a/1.txt a/2.txt $ chmod 000 a/2.txt $rsync -r --remove-source-files a/ b/ rsync: send_files failed to open /tmp/test/a/2.txt: Permission denied (13) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1070) [sender=3.0.8] $ ls b 1.txt $ ls a 2.txt The a directory was partially moved. This is no big deal with a small set of files, but a large set becomes a headache. There's one advantage to rsync - it can continue copying files from where it left off. That is - if something went wrong and it stopped, you can easily resume with exactly the same command line. Example: # Your scenario $ mkdir a b $ touch a/1.txt a/2.txt $ chmod 000 a/2.txt $ rsync -r --remove-source-files a/ b/ rsync: send_files failed to open /tmp/test/a/2.txt: Permission denied (13) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1070) [sender=3.0.8] # Rsync stopped, some files are moved to b, some are still in a. # now, fix the problem, and re-run rsync $ chmod 444 a/2.txt $ rsync -r --remove-source-files a/ b/ # the result: all files moved from a to b. $ ls a $ ls b 1.txt 2.txt Running rsync can be done repeatedly, until all files have been moved. Large files or small files, many files or few files - rsync will handle them all just fine. But regarding your question: On 04/05/2013 11:23 AM, Michael Boldischar wrote: Hello, This is a suggestion for a new feature in the mv command. This feature applies to moving directories. If a user moves a directory with a lot of files and encounters an error, it can often leave the source directory in a partially moved state. It makes it hard to redo the operation because the source directory has changed. There is a subtle difference between keeping the source directory intact until the move is complete and being able to resume/redo the move. If you just want to be able to resume an interrupted move, rsync can do it. You'll have to accept that until rsync complete successfully, some files are moved and some aren't (what you called partial state). But when rsync is complete (perhaps after running it multiple times) - the move is complete and there's no partial state. If you insist of keeping a full copy of the source directory until the entire move is complete, then something like: rsync -r a/ b/ rm -r a/ would do the trick - a/ will not be modified until the copy is completed. If you're moving files on the same filesystem and can use hardlinks to avoid unnecessary copies, then rsync have flags for that as well. Hope this helps, -gordon
Re: [PATCH] csplit: new option --suppress-matched
Hello, Pádraig Brady wrote, On 04/10/2013 07:49 AM: On 03/28/2013 10:10 PM, Assaf Gordon wrote: Attached is a new option for csplit, suppress-matched, as been mentioned few times before (e.g. http://lists.gnu.org/archive/html/coreutils/2013-02/msg00170.html ). ... Note I've removed the -m short option since we try to avoid them for new stuff. Also it gives us the flexibility in future to add a param to --suppress-matched to suppress X lines before/around/after the matched line, which could also be useful. Ok. good idea. Note I needed to fix array references in the perl test as follows: -push $new_ent, $cmp; +push @$new_ent, $cmp; Sorry about that. Seems like Perl 5.14 and later (which I use on my dev machine) allows unblessed references to functions that take arrays/hashes ( http://perldoc.perl.org/5.14.0/perldelta.html#Syntactical-Enhancements ). I'll have to remember to avoid such backwards-incompatible syntax. Will push in a while... Thanks! -gordon
Re: sort/uniq/join: key-comparison code consolidation
Assaf Gordon wrote, On 04/10/2013 01:49 PM: ( new thread for previous topic http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html ) . Another update, rebased against the latest version. comments are welcomed, -gordon key-comapre.2013-04-17.patch.xz Description: application/xz
Re: sort/uniq/join: key-comparison code consolidation
Regarding previously discussed topic: http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html Attached is another update, rebased against the latest version. comments are welcomed, -gordon key-comapre.2013-07-02.patch.xz Description: application/xz
Generate random numbers with shuf
Hello, Regarding old discussion here: http://lists.gnu.org/archive/html/coreutils/2011-02/msg00030.html Attached is a patch with adds --repetition option to shuf, enabling random number generation with repetitions. Example: to generate 50 values between 0 and 9: $ shuf --rep -i0-9 -n50 Comments are welcomed, -gordon From 12ca3d6d5b8591e7bd424ff264b9f26cc2f31b90 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 4 Jul 2013 14:40:15 -0600 Subject: [PATCH 0/4] *** SUBJECT HERE *** *** BLURB HERE *** Assaf Gordon (4): shuf: add --repetition to generate random numbers shuf: add tests for --repetition option shuf: mention new --repetition option in NEWS shuf: document new --repetition option NEWS | 3 +++ doc/coreutils.texi | 23 +++ src/shuf.c | 50 ++ tests/misc/shuf.sh | 29 + 4 files changed, 101 insertions(+), 4 deletions(-) -- 1.8.3.2 From 2c09d46ebeee61e2e46633dc8b9158edba1eaa8b Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 4 Jul 2013 13:26:45 -0600 Subject: [PATCH 1/4] shuf: add --repetition to generate random numbers * src/shuf.c: new option (-r,--repetition), generate random numbers. main(): process new option. usage(): mention new option. write_random_numbers(): generate random numbers. --- src/shuf.c | 50 ++ 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/src/shuf.c b/src/shuf.c index 0fabb0b..cdc3151 100644 --- a/src/shuf.c +++ b/src/shuf.c @@ -76,6 +76,9 @@ Write a random permutation of the input lines to standard output.\n\ -n, --head-count=COUNToutput at most COUNT lines\n\ -o, --output=FILE write result to FILE instead of standard output\n\ --random-source=FILE get random bytes from FILE\n\ + -r, --repetition used with -iLO-HI, output COUNT random numbers\n\ +between LO and HI, with repetitions.\n\ +count defaults to 1 if -n COUNT is not used.\n\ -z, --zero-terminated end lines with 0 byte, not newline\n\ ), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); @@ -104,6 +107,7 @@ static struct option const long_opts[] = {head-count, required_argument, NULL, 'n'}, {output, required_argument, NULL, 'o'}, {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION}, + {repetition, no_argument, NULL, 'r'}, {zero-terminated, no_argument, NULL, 'z'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, @@ -328,6 +332,23 @@ write_permuted_output (size_t n_lines, char *const *line, size_t lo_input, return 0; } +static int +write_random_numbers (struct randint_source *s, size_t count, + size_t lo_input, size_t hi_input, char eolbyte) +{ + size_t i; + const randint range = hi_input - lo_input + 1; + + for (i = 0; i count; i++) +{ + randint j = lo_input + randint_choose (s, range); + if (printf (%lu%c, j, eolbyte) 0) +return -1; +} + + return 0; +} + int main (int argc, char **argv) { @@ -340,6 +361,7 @@ main (int argc, char **argv) char eolbyte = '\n'; char **input_lines = NULL; bool use_reservoir_sampling = false; + bool repetition = false; int optc; int n_operands; @@ -348,7 +370,7 @@ main (int argc, char **argv) char **line = NULL; struct linebuffer *reservoir = NULL; struct randint_source *randint_source; - size_t *permutation; + size_t *permutation = NULL; int i; initialize_main (argc, argv); @@ -424,6 +446,10 @@ main (int argc, char **argv) random_source = optarg; break; + case 'r': +repetition = true; +break; + case 'z': eolbyte = '\0'; break; @@ -454,9 +480,19 @@ main (int argc, char **argv) } n_lines = hi_input - lo_input + 1; line = NULL; + + /* When generating random numbers with repetitions, + the default count is one, unless specified by the user */ + if (repetition head_lines == SIZE_MAX) +head_lines = 1 ; } else { + if (repetition) +{ + error (0, 0, _(--repetition requires --input-range)); + usage (EXIT_FAILURE); +} switch (n_operands) { case 0: @@ -488,10 +524,12 @@ main (int argc, char **argv) } } - head_lines = MIN (head_lines, n_lines); + if (!repetition) +head_lines = MIN (head_lines, n_lines); randint_source = randint_all_new (random_source, -use_reservoir_sampling ? SIZE_MAX : +(use_reservoir_sampling||repetition)? +SIZE_MAX: randperm_bound (head_lines, n_lines)); if (! randint_source) error (EXIT_FAILURE, errno, %s
Re: Generate random numbers with shuf
Hello, On 07/04/2013 05:40 PM, Pádraig Brady wrote: On 07/04/2013 09:41 PM, Assaf Gordon wrote: Regarding old discussion here: http://lists.gnu.org/archive/html/coreutils/2011-02/msg00030.html Attached is a patch with adds --repetition option to shuf, enabling random number generation with repetitions. I like this. --repetition seems to be a very good interface too, since it aligns with standard math nomenclature in regard to permutations. I'd prefer to generalize it though, to supporting stdin as well as -i. Attached is an updated patch, supporting --repetitions with STDIN/FILE/-e (using the naive implementation ATM). e.g. $ shuf --repetitions --head-count=100 --echo Head Tail or $ shuf -r -n100 -e Head Tail But the code is getting a bit messy, I guess from evolving features over time. I'd like to re-organize it a bit, re-factor some functions and make the code clearer - what do you think? it will make the code slightly more verbose (and slightly bigger), but shouldn't change the running performance. -gordon From 9e14bf963eb27faed847a979677fb5f344c27362 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Fri, 5 Jul 2013 11:58:16 -0600 Subject: [PATCH 0/7] *** SUBJECT HERE *** *** BLURB HERE *** Assaf Gordon (7): shuf: add --repetition to generate random numbers shuf: add tests for --repetition option shuf: mention new --repetition option in NEWS shuf: document new --repetition option shuf: enable --repetition on stdin/FILE/-e input shuf: add tests for --repetition with STDIN shuf: document new --repetitions option NEWS | 3 +++ doc/coreutils.texi | 37 ++ src/shuf.c | 66 -- tests/misc/shuf.sh | 63 +++ 4 files changed, 162 insertions(+), 7 deletions(-) -- 1.8.3.2 From c41160016ed36fe5b4e2b3d03cde34e0dcec84b6 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 4 Jul 2013 13:26:45 -0600 Subject: [PATCH 1/7] shuf: add --repetition to generate random numbers * src/shuf.c: new option (-r,--repetition), generate random numbers. main(): process new option. usage(): mention new option. write_random_numbers(): generate random numbers. --- src/shuf.c | 50 ++ 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/src/shuf.c b/src/shuf.c index 0fabb0b..cdc3151 100644 --- a/src/shuf.c +++ b/src/shuf.c @@ -76,6 +76,9 @@ Write a random permutation of the input lines to standard output.\n\ -n, --head-count=COUNToutput at most COUNT lines\n\ -o, --output=FILE write result to FILE instead of standard output\n\ --random-source=FILE get random bytes from FILE\n\ + -r, --repetition used with -iLO-HI, output COUNT random numbers\n\ +between LO and HI, with repetitions.\n\ +count defaults to 1 if -n COUNT is not used.\n\ -z, --zero-terminated end lines with 0 byte, not newline\n\ ), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); @@ -104,6 +107,7 @@ static struct option const long_opts[] = {head-count, required_argument, NULL, 'n'}, {output, required_argument, NULL, 'o'}, {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION}, + {repetition, no_argument, NULL, 'r'}, {zero-terminated, no_argument, NULL, 'z'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, @@ -328,6 +332,23 @@ write_permuted_output (size_t n_lines, char *const *line, size_t lo_input, return 0; } +static int +write_random_numbers (struct randint_source *s, size_t count, + size_t lo_input, size_t hi_input, char eolbyte) +{ + size_t i; + const randint range = hi_input - lo_input + 1; + + for (i = 0; i count; i++) +{ + randint j = lo_input + randint_choose (s, range); + if (printf (%lu%c, j, eolbyte) 0) +return -1; +} + + return 0; +} + int main (int argc, char **argv) { @@ -340,6 +361,7 @@ main (int argc, char **argv) char eolbyte = '\n'; char **input_lines = NULL; bool use_reservoir_sampling = false; + bool repetition = false; int optc; int n_operands; @@ -348,7 +370,7 @@ main (int argc, char **argv) char **line = NULL; struct linebuffer *reservoir = NULL; struct randint_source *randint_source; - size_t *permutation; + size_t *permutation = NULL; int i; initialize_main (argc, argv); @@ -424,6 +446,10 @@ main (int argc, char **argv) random_source = optarg; break; + case 'r': +repetition = true; +break; + case 'z': eolbyte = '\0'; break; @@ -454,9 +480,19 @@ main (int argc, char **argv) } n_lines = hi_input - lo_input + 1; line = NULL; + + /* When generating random numbers with repetitions, + the default count is one, unless
Re: Generate random numbers with shuf
On 07/05/2013 12:12 PM, Pádraig Brady wrote: On 07/05/2013 07:04 PM, Assaf Gordon wrote: Hello, Regarding old discussion here: http://lists.gnu.org/archive/html/coreutils/2011-02/msg00030.html Attached is a patch with adds --repetition option to shuf, enabling random number generation with repetitions. I like this. --repetition seems to be a very good interface too, since it aligns with standard math nomenclature in regard to permutations. I'd prefer to generalize it though, to supporting stdin as well as -i. Attached is an updated patch, supporting --repetitions with STDIN/FILE/-e (using the naive implementation ATM). e.g. $ shuf --repetitions --head-count=100 --echo Head Tail or $ shuf -r -n100 -e Head Tail Excellent thanks. But the code is getting a bit messy, I guess from evolving features over time. I'd like to re-organize it a bit, re-factor some functions and make the code clearer - what do you think? it will make the code slightly more verbose (and slightly bigger), but shouldn't change the running performance. If you're getting your head around the code enough to refactor, then it would be great if you could handle the TODO: item in shuf.c Attached is an updated patch, with some code cleanups (not including said TODO item yet). -gordon From 5ba2828e72f6d276fc349f69824cd6cb626053a4 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Fri, 5 Jul 2013 15:41:17 -0600 Subject: [PATCH 00/14] *** SUBJECT HERE *** *** BLURB HERE *** Assaf Gordon (14): shuf: add --repetition to generate random numbers shuf: add tests for --repetition option shuf: mention new --repetition option in NEWS shuf: document new --repetition option shuf: enable --repetition on stdin/FILE/-e input shuf: add tests for --repetition with STDIN shuf: document new --repetitions option shuf: code-cleanup shuf: add more tests shuf: refactor --repetition with stdin shuf: refactor write_permuted_output() shuf: code cleanup shuf: code clean-up shuf: add tests for more erroneous usage NEWS | 3 + doc/coreutils.texi | 37 +++ src/shuf.c | 192 + tests/misc/shuf.sh | 92 + 4 files changed, 268 insertions(+), 56 deletions(-) -- 1.8.3.2 From c41160016ed36fe5b4e2b3d03cde34e0dcec84b6 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Thu, 4 Jul 2013 13:26:45 -0600 Subject: [PATCH 01/14] shuf: add --repetition to generate random numbers * src/shuf.c: new option (-r,--repetition), generate random numbers. main(): process new option. usage(): mention new option. write_random_numbers(): generate random numbers. --- src/shuf.c | 50 ++ 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/src/shuf.c b/src/shuf.c index 0fabb0b..cdc3151 100644 --- a/src/shuf.c +++ b/src/shuf.c @@ -76,6 +76,9 @@ Write a random permutation of the input lines to standard output.\n\ -n, --head-count=COUNToutput at most COUNT lines\n\ -o, --output=FILE write result to FILE instead of standard output\n\ --random-source=FILE get random bytes from FILE\n\ + -r, --repetition used with -iLO-HI, output COUNT random numbers\n\ +between LO and HI, with repetitions.\n\ +count defaults to 1 if -n COUNT is not used.\n\ -z, --zero-terminated end lines with 0 byte, not newline\n\ ), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); @@ -104,6 +107,7 @@ static struct option const long_opts[] = {head-count, required_argument, NULL, 'n'}, {output, required_argument, NULL, 'o'}, {random-source, required_argument, NULL, RANDOM_SOURCE_OPTION}, + {repetition, no_argument, NULL, 'r'}, {zero-terminated, no_argument, NULL, 'z'}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, @@ -328,6 +332,23 @@ write_permuted_output (size_t n_lines, char *const *line, size_t lo_input, return 0; } +static int +write_random_numbers (struct randint_source *s, size_t count, + size_t lo_input, size_t hi_input, char eolbyte) +{ + size_t i; + const randint range = hi_input - lo_input + 1; + + for (i = 0; i count; i++) +{ + randint j = lo_input + randint_choose (s, range); + if (printf (%lu%c, j, eolbyte) 0) +return -1; +} + + return 0; +} + int main (int argc, char **argv) { @@ -340,6 +361,7 @@ main (int argc, char **argv) char eolbyte = '\n'; char **input_lines = NULL; bool use_reservoir_sampling = false; + bool repetition = false; int optc; int n_operands; @@ -348,7 +370,7 @@ main (int argc, char **argv) char **line = NULL; struct linebuffer *reservoir = NULL; struct randint_source *randint_source; - size_t *permutation; + size_t *permutation = NULL; int i; initialize_main (argc, argv); @@ -424,6 +446,10 @@ main
Re: Generate random numbers with shuf
On 07/10/2013 09:20 AM, Pádraig Brady wrote: I've split to two patches. 1. Unrelated test improvements. 2. All the rest ... Note in both patches I made adjustments to the tests [...] ... I.E. avoid cat unless needed, and paste is more general than fmt in this usage. ... Also I simplified the --help a little [...] Indeed, looks more concise and much better. I keep on learning... I'll push the 2 attached patches soon. Thanks! -gordon
Re: bug#15077: Clarification
(CC'ing the list so that others could comment) Hello Federico, On 08/12/2013 06:50 PM, CDR wrote: How do I get latest, latest version, even beta, or join, sort, etc? I would not recommend using beta or development versions of GNU coreutils for production code, just to be on the safe side. The stable releases are available as source code here: http://ftp.gnu.org/gnu/coreutils/ With more details here: http://www.gnu.org/software/coreutils/ One thing that I suggest is to change sort, comm and join to use more than one core. I had to use a commercial version of sort because the regular version tales for ever to sort a 15G file. The commercial version is called nsort and it uses all the cores in the machines and also you may add a flag to give the program a huge memory block. It works like ten times faster than the regular sort. Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see the --parallel parameter). Sort also supports the --buffer-size parameter to explicitly specify how much memory to use. I'm not familiar with nsort and can not comment on nsort vs GNU sort's speeds, I believe that on modern hardware, sorting 15G should take few minutes at most, not forever - but that depends on many factors (e.g. cores, memory, disk, etc.). join operates on sorted input, and as such, requires very little CPU and memory. I do not think much can be gained from making join multi-threaded. I believe the same applies to comm. I am using comm a lot for business problem that involves comparing daily files that have 550 MM records. I find it extremely slow. Do you any suggestion? Others could perhaps comment on ways to improve performance when using GNU coreutils. I'd assume it very much depends on the technical details you're comparing - perhaps there are ways to improve the workflow. First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk speed, Algorithm, etc.) regards, -gordon
Re: Shuf reservoir sampling
Hello, (reviving an old thread, sorry for the delayed response). On 12/28/2013 03:36 PM, Jyothis V wrote: ... Hi, thanks for the reply. I understand why something like reservoir sampling is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir sampling, an array of length head_length is obtained. At this stage, the array is not completely shuffled because the first head_length elements were copied verbatim. 2) This array is then randomly permuted while writing. My question is whether these two steps could be clubbed together, just as shown in the second algorithm given in the wikipedia page you mentioned. I didn't have a look into the Maths behind yet, nor was I involved during that last improvement. Further improvement is maybe possible, and the best way to push this is providing code. Are you willing to propose such a patch? Regarding the shuffle correctness: Yes, the data is first read into the array, and only later permuted. I believe the implementation is correct (ie it does randomly shuffles the input), and if this is not the case, it's a bug and should be fixed. Regarding the implementation: In shuf.c there's an intricate interplay between reading the input and writing the output - notice that the input is closed explicitly half-way through main(), before any output is ever written. The initial patch was written to maintain this behavior, and minimally disrupt the existing code flow: http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff This is not to say a better implementation is not possible, just that there are few technical details to note before changing 'shuf'. There are certainly ways to improve the code. HTH, -Gordon
Re: sort/uniq/join: key-comparison code consolidation
Hello, If there is still interest, here's an updated patch, against the latest version, of adding these features to join+uniq: http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html The patch has been re-created (not just rebased), because the old one caused few conflicts. Functionally it is the same as before, and all tests pass. Comments are welcomed, -gordon key-compare-2014-01-23.patch.xz Description: application/xz
script suggestion: 'check_program' to easily run multiple tests
Hello, Attached is a small script I've been using. It helps running multiple tests for a given program. example: ./scripts/check_program sort Will find all sort-related tests (based on filename) and run them. Adding -e or -v also runs expensive and very expensive tests: examle: ./scripts/check_program -v sort is equivalent to: make check VERBOSE=yes SUBDIRS=. \ RUN_EXPENSIVE_TESTS=yes \ RUN_VERY_EXPENSIVE_TESTS=yes \ TESTS=./tests/misc/sort-NaN-infloop.sh ./tests/misc/sort-benchmark-random.sh ./tests/misc/sort-compress-hang.sh ./tests/misc/sort-compress-proc.sh ./tests/misc/sort-compress.sh ./tests/misc/sort-continue.sh ./tests/misc/sort-debug-keys.sh ./tests/misc/sort-debug-warn.sh ./tests/misc/sort-discrim.sh ./tests/misc/sort-exit-early.sh ./tests/misc/sort-files0-from.pl ./tests/misc/sort-float.sh ./tests/misc/sort-merge-fdlimit.sh ./tests/misc/sort-merge.pl ./tests/misc/sort-month.sh ./tests/misc/sort-rand.sh ./tests/misc/sort-spinlock-abuse.sh ./tests/misc/sort-stale-thread-mem.sh ./tests/misc/sort-u-FMR.sh ./tests/misc/sort-unique-segv.sh ./tests/misc/sort-unique.sh ./tests/misc/sort-version.sh ./tests/misc/sort.pl If others find it useful, you're welcomed to add this. -gordon From 965a01bfaf129b4d1da8d0927a9149e4c4145ff3 Mon Sep 17 00:00:00 2001 From: A. Gordon assafgor...@gmail.com Date: Fri, 24 Jan 2014 13:39:14 -0500 Subject: [PATCH] scripts: add check_program, to run tests easily * scripts/check_program: New script, so you can easily run all tests relating to a certain program. Takes less time than checking all programs with 'make check', and quicker to type than 'make check TESTS=TEST1,TEST2,TEST3' for multiple tests. --- scripts/check_program | 70 +++ 1 file changed, 70 insertions(+) create mode 100755 scripts/check_program diff --git a/scripts/check_program b/scripts/check_program new file mode 100755 index 000..f38e410 --- /dev/null +++ b/scripts/check_program @@ -0,0 +1,70 @@ +#!/bin/sh +# A small helper script to run multiple tests at once. +# example: +# ./scripts/check_program sort +# would run all 'sort' related tests under ./tests/ + +# Written by Assaf Gordon + +# allow the user to override 'make' +MAKE=${MAKE-make} + +VERSION='2014-01-24 00:37:51' # UTC + +prog_name=`basename $0` +die () { echo $prog_name: $* 2; exit 1; } + +usage() { + echo 2 \ +Usage: $0 [OPTION] PROGRAM +Runs all tests for PROGRAM + +Options: + -e run EXPENSIVE tests + -v run EXPENSIVE and VERY_EXPENSIVE tests + -h display this help and exit + +Examples: +To run all (non-expensive) tests for 'uniq': + + $0 uniq + +To run all (including expensive and very expensive) tests for 'sort': + + $0 -v sort + + +} + +RUN_EXPENSIVE_TESTS=no +RUN_VERY_EXPENSIVE_TESTS=no +while getopts :evh name +do +case $name in +(v)RUN_VERY_EXPENSIVE_TESTS=yes;RUN_EXPENSIVE_TESTS=yes;; +(e)RUN_EXPENSIVE_TESTS=yes;; +(h)usage; exit 0 ;; +(--) shift ; break ;; +(*)die Unknown option '$OPTARG' ;; +esac +shift +done + +PROGRAM=$1 +[ -z $PROGRAM ] die missing PROGRAM name. See '-h' for help. + + +[ -d ./tests ] || die 'tests/' directory not found. \ +Please run this script from the \ +main directory of 'Coreutils'. + +TESTS=$(find ./tests/ \( -name '*.sh' -o -name '*.pl' \) -print | \ + grep -w -- $PROGRAM | paste -s -d' ') +[ -z $TESTS ] die no tests found for '$PROGRAM' + +echo Running the following tests for '$PROGRAM': +echo $TESTS | tr ' ' '\n' | sed 's/^/ /' + +$MAKE check TESTS=$TESTS VERBOSE=yes SUBDIRS=. \ + RUN_EXPENSIVE_TESTS=$RUN_EXPENSIVE_TESTS \ + RUN_VERY_EXPENSIVE_TESTS=$RUN_VERY_EXPENSIVE_TESTS -- 1.8.4.3
stat: clarify mtime vs ctime [patch]
Hello, Would you be receptive to adding a tiny patch to 'stat' to clarify the difference between modification time and change time? Currently, it simply says: %y time of last modification, human-readable %Y time of last modification, seconds since Epoch %z time of last change, human-readable %Z time of last change, seconds since Epoch And for most non-unix experts, last modification is (almost) a synonym for last change (IMHO). The patch changes: modification - data modification change - status change And adds one clarification paragraph to the docs. While this will not immediately resolve all questions, it will at least hint users which option they need (as data is different from status). The words data and status are also used (for mtime and ctime, respectively) in the POSIX pages of 'sys/stat.h': http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html Perhaps, in addition, add a new FAQ ? Something like: Q. What is the difference between access time, data modification time and status change time ? A. Most UNIX systems keeps track of different times for each file. Access Time keeps track of the last time a file was opened for reading. Data Modification time keeps tracks of the last time file's content has been modified. Status Change time keeps tracks of the last time a file's status (e.g. mode, owner, group, hard-links) was modified. Configuration varies between filesystems - not all systems keep track of all three times. To show Access time, use ls -lu or stat's %X and %x formats. To show Data modification time, use ls -l or stat's %Y and %y formats. To show Status change time, use ls -lc or stat's %Z and %z formats. Example: # Create a new file $ echo hello test.txt # Show the file's time stamps $ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt Access: 2014-04-21 14:01:00.131648000 + Modify: 2014-04-21 14:01:00.131648000 + Change: 2014-04-21 14:01:00.131648000 + # Wait 5 seconds, then update the file's content. # NOTE: Status change time is also updated. $ sleep 5 ; echo world test.txt $ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt Access: 2014-04-21 14:01:00.131648000 + Modify: 2014-04-21 14:01:05.161657000 + Change: 2014-04-21 14:01:05.161657000 + # Wait 5 seconds, then update the file's status (but not content) $ sleep 5 ; chmod o-rwx test.txt $ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt Access: 2014-04-21 14:01:00.131648000 + Modify: 2014-04-21 14:01:05.161657000 + Change: 2014-04-21 14:01:10.250232749 + # Wait 5 seconds, then read (access) the file's content $ sleep 5 ; wc test.txt /dev/null $ stat --printf Access: %x\nModify: %y\nChange: %z\n test.txt Access: 2014-04-21 14:01:15.298241904 + Modify: 2014-04-21 14:01:05.161657000 + Change: 2014-04-21 14:01:10.250232749 + # Show Data Modification time with 'ls -l' $ ls --full-time -log test.txt -rw-r- 1 12 2014-04-21 14:01:05.161657000 + test.txt # Show Status Change time with 'ls -c' $ ls --full-time -log -c test.txt -rw-r- 1 12 2014-04-21 14:01:10.250232749 + test.txt # Show Last Access time with 'ls -u' $ ls --full-time -log -u test.txt -rw-r- 1 12 2014-04-21 14:01:15.298241904 + test.txt Regards, -gordon From 4cf4784aafdf45fd3dec3855b9320d72dcd1a6ec Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 21 Apr 2014 14:31:23 -0400 Subject: [PATCH] stat: clarify mtime vs ctime in usage(), doc Change modification time to data modification time, change time to status change time. * src/stat.c: improve usage() * doc/coreutils.texi: add clarification paragraph --- doc/coreutils.texi | 19 +++ src/stat.c | 8 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 6c49385..e979c88 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -11829,10 +11829,10 @@ The valid @var{format} directives for files with @option{--format} and @item %W - Time of file birth as seconds since Epoch, or @samp{0} @item %x - Time of last access @item %X - Time of last access as seconds since Epoch -@item %y - Time of last modification -@item %Y - Time of last modification as seconds since Epoch -@item %z - Time of last change -@item %Z - Time of last change as seconds since Epoch +@item %y - Time of last data modification +@item %Y - Time of last data modification as seconds since Epoch +@item %z - Time of last status change +@item %Z - Time of last status change as seconds since Epoch @end itemize The @samp{%t} and @samp{%T} formats operate on the st_rdev member of @@ -11864,6 +11864,17 @@ precision: [1288929712.114951834] @end example +@emph{Access time} formats (@samp{%x},@samp{%X}) output the last time the +file
Re: stat: clarify mtime vs ctime [patch]
On 04/21/2014 03:57 PM, Pádraig Brady wrote: On 04/21/2014 08:14 PM, Assaf Gordon wrote: Would you be receptive to adding a tiny patch to 'stat' to clarify the difference between modification time and change time? This clarification is worth making, thanks! Perhaps, in addition, add a new FAQ ? Let's avoid the FAQ for the moment. Hopefully the improved docs will avoid the need. ... but if the file was just opened for reading, then access time isn't updated, only if data is read. Also for performance reasons, modern Linux systems only update atime if it's older than [cm]time. I.E. with relatime enabled, it's really only an indicator as to whether the file has been read since it was last updated. So I think this whole block might add more ambiguity than any additional clarification. OK to drop this block? Attached are improved patches: The first contains only the added words status and data. The second adds the paragraph to the docs, and can be included at your discretion. I've reworded the Access Time sentence to make clear it depends on the operating system and file system configuration. But I think at least the data modification time and status change time sentences are correct for all systems. For both the FAQ and the additional paragraph, my reasoning is: 1. Expert users (who know by heart what mtime vs ctime mean) - don't need any of these. 2. Seasoned users - perhaps just need a reminder, in which case the words data vs status are enough. 3. Other (most?) users - will still look for clarification after seeing data modification time vs status change time. There are many answers for what is the difference between modification time and change time found on the internet, but I think it would be beneficial if there's an authoritative answer, from a reliable source (i.e. by coreutils). Regards, -gordon From 611b2b12ec7f6ae4ee276adfe74efe41602d27d7 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 21 Apr 2014 14:31:23 -0400 Subject: [PATCH 1/2] doc: clarify stat's mtime vs ctime in usage(), doc Change modification time to data modification time, change time to status change time. * src/stat.c: improve usage() * doc/coreutils.texi: ditto --- doc/coreutils.texi | 8 src/stat.c | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 6c49385..b21a4fd 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -11829,10 +11829,10 @@ The valid @var{format} directives for files with @option{--format} and @item %W - Time of file birth as seconds since Epoch, or @samp{0} @item %x - Time of last access @item %X - Time of last access as seconds since Epoch -@item %y - Time of last modification -@item %Y - Time of last modification as seconds since Epoch -@item %z - Time of last change -@item %Z - Time of last change as seconds since Epoch +@item %y - Time of last data modification +@item %Y - Time of last data modification as seconds since Epoch +@item %z - Time of last status change +@item %Z - Time of last status change as seconds since Epoch @end itemize The @samp{%t} and @samp{%T} formats operate on the st_rdev member of diff --git a/src/stat.c b/src/stat.c index fffebe3..7d43eb5 100644 --- a/src/stat.c +++ b/src/stat.c @@ -1457,10 +1457,10 @@ The valid format sequences for files (without --file-system):\n\ %W time of file birth, seconds since Epoch; 0 if unknown\n\ %x time of last access, human-readable\n\ %X time of last access, seconds since Epoch\n\ - %y time of last modification, human-readable\n\ - %Y time of last modification, seconds since Epoch\n\ - %z time of last change, human-readable\n\ - %Z time of last change, seconds since Epoch\n\ + %y time of last data modification, human-readable\n\ + %Y time of last data modification, seconds since Epoch\n\ + %z time of last status change, human-readable\n\ + %Z time of last status change, seconds since Epoch\n\ \n\ ), stdout); -- 1.9.1 From d7757509a9248a1b2ead45433741d2ec0d4ce7d2 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Tue, 22 Apr 2014 11:13:02 -0400 Subject: [PATCH 2/2] doc: add paragraph about stat's %x/%y/%z doc/coreutils.texi: added paragraph. --- doc/coreutils.texi | 10 ++ 1 file changed, 10 insertions(+) diff --git a/doc/coreutils.texi b/doc/coreutils.texi index b21a4fd..b505d1e 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -11864,6 +11864,16 @@ precision: [1288929712.114951834] @end example +@emph{Access time} formats (@samp{%x},@samp{%X}) output the file's access time. +Access time is also shown with @command{ls -lu}. The precise meaning of file +access time depends on your operating system and file system configuration. +@emph{Data modification} format (@samp{%y}, @samp{%Y}) +outputs the time the file's content was modified (e.g. by a program writing +to the file). Data modification time is also
Re: sort/uniq/join: key-comparison code consolidation
Hello, On 01/23/2014 07:50 PM, Assaf Gordon wrote: If there is still interest, here's an updated patch, against the latest version, of adding these features to join+uniq: http://lists.gnu.org/archive/html/coreutils/2013-02/msg00082.html Attached another rebase + minor fix for recent 'devmsg' change. Comments are welcomed, -gordon key-compare-2014-05-02.patch.xz Description: application/xz
Work-around for bootstrap failure with gettext 0.18.3.1
Hello, Coreutils' bootstrap script fails (in a freshly cloned directory) with gettext 0.18.3.1. This has been discussed few times on the mailing list: http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html http://lists.gnu.org/archive/html/bug-coreutils/2014-01/msg00058.html http://lists.gnu.org/archive/html/bug-coreutils/2014-04/msg00106.html And already resolved (with recommendation to upgrade to 0.18.3.2): http://savannah.gnu.org/bugs/?40083 https://bugs.launchpad.net/ubuntu/+source/gettext/+bug/1311895 But version 0.18.3.1 is still out there and hasn't been upgraded in several distributions. Would you be receptive to add the following minor work-around for bootstrap ? It creates the two needed files, which allows autopoint to continue, then gnulib immediately overrides them with the correct versions. Comments are welcomed, - gordon P.S. So far I have only tested it on Ubuntu 14.04 (with gettext 0.18.3.1) and Debian 7 (with gettext 0.18.1.1-9). From 3186927f477b12ad5ce3d184047336c382432226 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Fri, 2 May 2014 20:17:06 -0400 Subject: [PATCH] build: avoid bootstrap error with gettext 0.18.3.1 * bootstrap: Create critical bootstrap files for autopoint, before gnulib re-generates them. This avoids a bug in gettext/autopoint version 0.18.3.1 (which is advertised as 0.18.3). See: http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html http://savannah.gnu.org/bugs/?40083 https://bugs.launchpad.net/ubuntu/+source/gettext/+bug/1311895 --- bootstrap | 14 ++ 1 file changed, 14 insertions(+) diff --git a/bootstrap b/bootstrap index ce90bc4..81b576d 100755 --- a/bootstrap +++ b/bootstrap @@ -807,6 +807,20 @@ version_controlled_file() { fi } + +# Work-around for gettext/autopoint bug in version 0.18.3.1: +# Create dummy 'm4/cu-progs.m4' and 'build-aux/git-version-gen' +# to avoid 'bootstrap' failure. +# http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html +autopoint_version=$(get_version $AUTOPOINT) +if test $autopoint_version = 0.18.3 ; then + test -e 'm4/cu-progs.m4' || touch 'm4/cu-progs.m4' + if ! test -e 'build-aux/git-version-gen' ; then +printf #!/bin/sh\n 'build-aux/git-version-gen' +chmod a+x 'build-aux/git-version-gen' + fi +fi + # NOTE: we have to be careful to run both autopoint and libtoolize # before gnulib-tool, since gnulib-tool is likely to provide newer # versions of files installed by these two programs. -- 1.9.1
Re: Work-around for bootstrap failure with gettext 0.18.3.1
On 05/03/2014 05:52 AM, Pádraig Brady wrote: We wouldn't be wanting the cu-progs.m4 in other projects though, so we should probably conditionalize that to just $package = coreutils. It wouldn't be worth adding new hooks for this to generalize. To clarify: do you mean conditionalize just the cu-progs.m4 part, or the entire work-around with $package = coreutils ?
Re: Work-around for bootstrap failure with gettext 0.18.3.1
On 05/05/2014 02:42 PM, Pádraig Brady wrote: On 05/05/2014 06:34 PM, Assaf Gordon wrote: On 05/03/2014 05:52 AM, Pádraig Brady wrote: We wouldn't be wanting the cu-progs.m4 in other projects though, so we should probably conditionalize that to just $package = coreutils. It wouldn't be worth adding new hooks for this to generalize. To clarify: do you mean conditionalize just the cu-progs.m4 part, or the entire work-around with $package = coreutils ? Just the cu-progs.m4 bit Attached is an updated patch. Comments are welcomed. -gordon From ba30f3d9f5f217883fb13d06354f2c8478f598d6 Mon Sep 17 00:00:00 2001 From: Assaf Gordon assafgor...@gmail.com Date: Mon, 12 May 2014 12:17:06 -0400 Subject: [PATCH] build: avoid bootstrap error with gettext 0.18.3.1 * bootstrap: Create critical bootstrap files for autopoint, before gnulib re-generates them. This avoids a bug in gettext/autopoint version 0.18.3.1 (which is advertised as 0.18.3). See: http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html http://savannah.gnu.org/bugs/?40083 https://bugs.launchpad.net/ubuntu/+source/gettext/+bug/1311895 --- bootstrap | 16 1 file changed, 16 insertions(+) diff --git a/bootstrap b/bootstrap index ce90bc4..9cd8024 100755 --- a/bootstrap +++ b/bootstrap @@ -807,6 +807,22 @@ version_controlled_file() { fi } + +# Work-around for gettext/autopoint bug in version 0.18.3.1: +# Create dummy 'm4/cu-progs.m4' and 'build-aux/git-version-gen' +# to avoid 'bootstrap' failure. +# http://lists.gnu.org/archive/html/coreutils/2013-11/msg00038.html +autopoint_version=$(get_version $AUTOPOINT) +if test $autopoint_version = 0.18.3 ; then + if test $package = coreutils ; then +test -e 'm4/cu-progs.m4' || touch 'm4/cu-progs.m4' + fi + if ! test -e 'build-aux/git-version-gen' ; then +printf #!/bin/sh\n 'build-aux/git-version-gen' +chmod a+x 'build-aux/git-version-gen' + fi +fi + # NOTE: we have to be careful to run both autopoint and libtoolize # before gnulib-tool, since gnulib-tool is likely to provide newer # versions of files installed by these two programs. -- 1.9.1
sharing STDOUT in multiple sha256sum processes
Hello, I'd like to ask your advice, to verify that my command is correct. I'm trying to calculate sha256 checksum on many files, in parallel. A contrived example would be: $ find /path/ -type f -print0 | xargs -0 -P5 -n1 stdbuf -oL sha256sum 1.txt Which would run at most 5 processes of sha256sum, and the output of all would be the file 1.txt. Is it correct to assume that because sha256sum prints one line per file, and stdbuf -oL makes it line-buffered, that the content in 1.txt will be valid (i.e. no inter-mixed lines from different processes) ? Thanks, - gordon
Re: seq feature: print letters
On Jun 30, 2014, at 5:24, Pádraig Brady p...@draigbrady.com wrote: On 06/30/2014 11:23 AM, assafgor...@gmail.com wrote: I'd like to suggest a patch to allow seq to generate letter sequences. I notice about 45 copies of the A-Z alphabet, would it be worth introducing aliases to avoid copies? Yes, we can consolidate them. What about case. The current code only has upper case. case is a can of worms I know, with not necessarily 1:1 mapping etc. Once leaving the realm of latin languages, upper/lower case indeed becomes very complicated. Or even meaningless. I thought that 'tr [:upper:] [:lower:]' would handle it better (but I now realize tr doesn't support UTF-8 well, if I understand correctly). I think that for the first step, we should not deal with upper/lower case issues. The data being leveraged is well defined at present reasonable to include directly in the seq binary (about 12K I'm guessing), though have you looked at whether libunistring contains the appropriate data/logic for this? This might be more significant if case or more characters were considered for example. This first draft stores UTF-8 strings (with NUL) for each character. I saw the libunistring code stores some bit-fields for some of the functions, though I haven't learned it yet. I will try to improve the storage method in following patches. I had a quick look at the CLDR. Are you only considering the Index exemplar chars here? http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html Exactly. Maybe it would be better to default to the standard exemplars? http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters The reason I liked to index list, is because it most directly answers the question what is the alphabet in language X ? (is in, what are the letters that would be taught in schools as the alphabet, or if you ask a person on the street to list the alphabet letters). It also lends itself to do: # How many letters are in the Arabic alphabet: seq --alphabet=ar | wc -l # What is the eleventh letter in the Russian alphabet: seq --alphabet=ru | awk 'NR==11' Technically, the functionality of is_alpha() does not correspond 1:1 to the alphabet, which is part of the problem... In English, there are no complications, but in many other languages it becomes complicated. Using other Unicode categories (e.g.the 'main' letters or even 'auxiliary' letters) answers a slightly different question, more akin to what symbols are acceptable in language X ? - not a bad question, just different that the previous question. For example in Hebrew, the index list contains 22 letters (which agrees with the question how many letters are in the Hebrew alphabet), but the main/standard list has 5 more symbols, of 5 hebrew letters that have specific final form (if those letters appear at the end of the word). So using the main list would list 5 letters twice. I believe other language such as Arabic would present similar issues. From a technical point of view, it's easy to include both index and standard letters (with different command-line options), it's just a matter of adding more lists. What do you think? Thanks, -Gordon
Re: seq feature: print letters
On Jul 1, 2014, at 2:21, Bernhard Voelker m...@bernhard-voelker.de wrote: Hmm, what about just providing the standard A-Z alphabet, and instead leave it up to the user if she needs a different set (rolling over if needed)? I like the idea of seq using user-specified sequence of characters (though this brings it's own issues), But my goal was to provide an easy way to generate letters in many languages - if the user has to type them explicitly, then seq is no better than printf '%s\n' followed by all the letters typed by the user... What do you think? I'm still working on an improved patch with much more efficient storage. Hope to have it in a week or so. Regards, - gordon