Re: new coreutil? shuffle - randomize file contents
Also, each external symbol (function, macro, variable) should have a comment explaining what it does. Currently I'm at a bit of a loss trying to figure out what things do, so my comments will be limited. +#ifndef _CHECKSUM_H +#define _CHECKSUM_H 1 + +#include sys/types.h +#include config.h +#include sha1.h +#include md5.h This seems overkill to me. sort is just using md5, right? I wanted an easy way to switch between them. Perhaps I should have gotten rid of this macro stuff from md5sum.c as well, but at least what I added lets someone else take that on in the future: #define DIGEST_TYPE_STRING(Alg) ((Alg) == ALG_MD5 ? MD5 : SHA1) #define DIGEST_STREAM(Alg) ((Alg) == ALG_MD5 ? md5_stream : sha_stream) #define DIGEST_BITS(Alg) ((Alg) == ALG_MD5 ? 128 : 160) #define DIGEST_HEX_BYTES(Alg) (DIGEST_BITS (Alg) / 4) #define DIGEST_BIN_BYTES(Alg) (DIGEST_BITS (Alg) / 8) + int len = strlen(str); There should be a space before the (. There are several other instances of that. + return (len*2) = ops-bits; Please put spaces around the *. The parentheses aren't needed here. It surprises me that this seems so important to you guys, but whatever. Also, we prefer = to =. This is a programming style rule that I learned from D. Val Schorre. It's an application of Leibnitz's Leibniz's criterion for notation: textual order should reflect numeric order. Interesting. + void *ctx = alloca(digops.ctx_size); What are the bounds on digops.ctx_size? If it can be large (more than a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow detection issues. It's only going to be a few bytes. + else if (key-random_hash) + { + int dig_bytes = digops.bits/8; + char diga[dig_bytes]; + char digb[dig_bytes]; + get_hash(texta, lena, diga); + get_hash(textb, lenb, digb); + diff = memcmp(diga, digb, sizeof(diga)); + } It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n, and -r. For example, sort -nR should compute the same hash for 1.0 and 01.0 that it computes for 1.00. Currently, though, the -R is silently ignored in this case. Conversely, sort -MR should compute the same hash for Jan that it does for Jan; but here, the -M is silently ignored. I disagree that this is important. + -R, --randomsort by random hash of keys\n\ I would prefer the long option name --random-hash, since it's not a truly random sort. What, in the sense that it pays attention to the keys, or that the numbers are pseudorandom? I might call something which pays attention to keys a random sort, and something which doesn't a random shuffle. I don't like your suggestion because I don't think the implementation should show up as part of the API. If it's the pseudorandomness, I think mentioning that is redundant, and the same thing I said about not wanting implementation in the API applies - a good pseudorandom number generator should be externally indistinguishable from an actual random number generator. There is a key problem here: one of documentation. doc/coreutils.texi and NEWS need to be modified to say exactly what's going on, and why sorting based on a random hash is not the same as generating a random permutation, so that people who are doing security-relevant stuff won't rely on something that they shouldn't. Yes, I mentioned that I haven't written documentation yet. I will keep in mind that these files need to be modified. Also, I can't easily tell from the code how many random bits are currently acquired from the environment, and whether there are enough bits to actually satisfy the claim that we are using a random hash. It should be 128 bits for MD5 and 160 for SHA1. Do you think I need more? Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Sat, Jul 16, 2005 at 07:01:53AM -0700, Frederik Eaton wrote: If it's the pseudorandomness, I think mentioning that is redundant, and the same thing I said about not wanting implementation in the API applies - a good pseudorandom number generator should be externally indistinguishable from an actual random number generator. Disagree. There's a useful distinction between the two. Generally, someone using pseudorandom numbers expects them to happen quickly and repeatably. Someone using random numbers expects them to be really random. David ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Frederik Eaton [EMAIL PROTECTED] wrote: Attached is a second patch, which contains a ChangeLog entry and some formatting changes as requested by Jim. Can you update your patch to be relative to coreutils-CVS, http://savannah.gnu.org/cvs/?group=coreutils rather than to the aging 5.2.1? Also, formatting should adhere to the gnu coding standards. A good way to start is by using GNU indent with the default settings. shred also tries to obtain a random seed. It'd be nice (eventually) to have both programs use the same mechanism. Please use FIXME rather than XXX to mark bits of code that need more attention. Regarding this: error (SORT_FAILURE, 0, _(%s: invalid field specification `%s'), _(msgid), spec); - abort (); + abort (); // XXX is this ever reached? need comment if it is That code is never reached, because the preceding error call invokes `exit (SORT_FAILURE)'. The abort is to pacify gcc -Wall. Regarding this: + char *randseed = 0; please use `NULL', instead: char *randseed = NULL; I've just noticed that this would make sort use a file in $HOME (~/.gnupg/entropy). That dependency on $HOME would be a first for coreutils. I'm reluctant to pull in all of that EGD-related code just to get a default seed in some environments. Do any of you have a feel for how important (or how widely available) EGD support is in practice? ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Thanks for working on this. You've gotten further than anyone else has! Some quick comments: Frederik Eaton [EMAIL PROTECTED] writes: Is there a script for making a patch with all the right files excluded by the way? Not yet. That's on the list of things to do. The fix will be to remove those files from the repository, and have a bootstrap script that generates them. Then cvs diff will do the right thing. shred also tries to obtain a random seed. It'd be nice (eventually) to have both programs use the same mechanism. I did not realize this. Yes, perhaps it would. That sounds right to me as well. I do not have any idea, but it would be good to have random seed functionality somewhere standard-ish. It would be nice if it were in libc but that's an area I don't want to tackle. It's a reasonable thing to put in coreutils/lib. We can then put it into gnulib. Also, I didn't personally think it was necessary to do anything but look at the time and process ID for a default seed - I only added the /dev/random stuff because Paul Eggert seems to think that security is very important here If sort -R is used for anything security-relevant, then it will be important, yes. - and once /dev/random is referenced then I figured compatibility with other kinds of systems would become an issue No, the method used by shred is good enough. We don't need to worry about EGD any more. It's an obsolete hack. You might be better off starting with the code in shred. I would like not to spend much more time on this, by the way, Alas, more work needs to be done. Perhaps we can recruit someone else to look into it? I will put it on my list of things to do as well, but it's a pretty long list + * src/sort.c: add functions init_salt, get_hash; 'init_salt' is + used to provide seeding, and 'get_hash' calculates the hash value + by which a key is sorted Please use the ChangeLog style recommended by the GNU coding standards http://www.gnu.org/prep/standards/html_node/Change-Logs.html. For example, the first line should start with * src/sort.c (init_salt, get_hash):. It's important to log every externally-visible identifier whose meaning has changed. Also, each external symbol (function, macro, variable) should have a comment explaining what it does. Currently I'm at a bit of a loss trying to figure out what things do, so my comments will be limited. +#ifndef _CHECKSUM_H +#define _CHECKSUM_H 1 + +#include sys/types.h +#include config.h +#include sha1.h +#include md5.h This seems overkill to me. sort is just using md5, right? + int len = strlen(str); There should be a space before the (. There are several other instances of that. + return (len*2) = ops-bits; Please put spaces around the *. The parentheses aren't needed here. Also, we prefer = to =. This is a programming style rule that I learned from D. Val Schorre. It's an application of Leibnitz's criterion for notation: textual order should reflect numeric order. + --seeduse specified seed for random number generator\n\ --seed has an operand, so it should be mentioned here. + void *ctx = alloca(digops.ctx_size); What are the bounds on digops.ctx_size? If it can be large (more than a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow detection issues. + else if (key-random_hash) + { + int dig_bytes = digops.bits/8; + char diga[dig_bytes]; + char digb[dig_bytes]; + get_hash(texta, lena, diga); + get_hash(textb, lenb, digb); + diff = memcmp(diga, digb, sizeof(diga)); + } It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n, and -r. For example, sort -nR should compute the same hash for 1.0 and 01.0 that it computes for 1.00. Currently, though, the -R is silently ignored in this case. Conversely, sort -MR should compute the same hash for Jan that it does for Jan; but here, the -M is silently ignored. else if (key-month) diff = getmonth (texta, lena) - getmonth (textb, lenb); /* Sorting like this may become slow, so in a simple locale the user @@ -1986,7 +2038,7 @@ badfieldspec (char const *spec, char con { error (SORT_FAILURE, 0, _(%s: invalid field specification %s), _(msgid), quote (spec)); - abort (); + abort (); // inserted to avoid a compiler warning } Please don't use //-style comments, as we can't assume C99 yet. Also, please prefer declarative sentences in comments, and put the comments on separate lines before the code they describe, e,g.: /* Avoid a compiler warning. */ abort (); + char *randseed = 0; This should be type char const *, not char *. Note that we prefer the const after the char. Also, the 0 should be NULL. + randseed = strdupa(optarg); We can't assume strdupa exists. But you don't need to copy optarg; just use it without copying it. + -R, --randomsort by random hash of keys\n\ I
Re: new coreutil? shuffle - randomize file contents
Hi, Attached is a third patch. Is there a script for making a patch with all the right files excluded by the way? cvs diff produces a huge amount of unrelated output because of files that are both in the repository and touched by configure, and it doesn't list new files. And diff doesn't seem to have an --include option to match its --exclude... Attached is a second patch, which contains a ChangeLog entry and some formatting changes as requested by Jim. Can you update your patch to be relative to coreutils-CVS, http://savannah.gnu.org/cvs/?group=coreutils rather than to the aging 5.2.1? Done. Also, formatting should adhere to the gnu coding standards. A good way to start is by using GNU indent with the default settings. I did this for some things. Maybe if you see other problems you could run GNU indent before running 'cvs commit'? shred also tries to obtain a random seed. It'd be nice (eventually) to have both programs use the same mechanism. I did not realize this. Yes, perhaps it would. Please use FIXME rather than XXX to mark bits of code that need more attention. Regarding this: error (SORT_FAILURE, 0, _(%s: invalid field specification `%s'), _(msgid), spec); - abort (); + abort (); // XXX is this ever reached? need comment if it is That code is never reached, because the preceding error call invokes `exit (SORT_FAILURE)'. The abort is to pacify gcc -Wall. Regarding this: + char *randseed = 0; please use `NULL', instead: char *randseed = NULL; I've just noticed that this would make sort use a file in $HOME (~/.gnupg/entropy). That dependency on $HOME would be a first for coreutils. I'm reluctant to pull in all of that EGD-related code just to get a default seed in some environments. Do any of you have a feel for how important (or how widely available) EGD support is in practice? I do not have any idea, but it would be good to have random seed functionality somewhere standard-ish. It would be nice if it were in libc but that's an area I don't want to tackle. Also, I didn't personally think it was necessary to do anything but look at the time and process ID for a default seed - I only added the /dev/random stuff because Paul Eggert seems to think that security is very important here - and once /dev/random is referenced then I figured compatibility with other kinds of systems would become an issue so I just copied the whole bit from libgcrypt. If you want to change that stuff, go ahead. I would like not to spend much more time on this, by the way, it's been about 7 months now, I would like for us to try to at least figure out what should be in a preliminary version, make sure that the command line interface is suitable, etc., somewhat soon. For instance, if the random stuff stays, then I'll need some guidance on what is the best way to update configure.ac (unless you want to do it). Also, I'm not sure what kind of documentation will need to be updated. I was hoping for more of this kind of feedback, so we can minimize the number of email back-and-forths. Thanks, Frederik diff --exclude CVS --exclude '*.in' --exclude configure --exclude '*.1' --exclude '*.info' --exclude '*~' --exclude '*.m4' --exclude 'autom4te*' --exclude config.hin --exclude getdate.c --exclude stamp-vti --exclude '*.texi' --exclude dircolors.h --exclude wheel.h --exclude tests --exclude wheel-size.h --exclude false.c -ruNp coreutils-cvs/ChangeLog coreutils-cvs-modified/ChangeLog --- coreutils-cvs/ChangeLog 2005-07-14 01:03:08.0 +0100 +++ coreutils-cvs-modified/ChangeLog2005-07-15 14:38:25.0 +0100 @@ -1,3 +1,17 @@ +2005-07-14 Frederik Eaton [EMAIL PROTECTED] + + Add --random, --seed options to 'sort' to get shuffle-like + functionality. + + * src/sort.c: add functions init_salt, get_hash; 'init_salt' is + used to provide seeding, and 'get_hash' calculates the hash value + by which a key is sorted + * src/checksum.h: stuff to make it possible to switch between + different hash algorithms at runtime + * src/randseed.h: + * src/randseed.c: read a fixed-length random seed from entropy + devices. adapted from libgcrypt + 2005-07-13 Paul Eggert [EMAIL PROTECTED] * Version 5.3.1. diff --exclude CVS --exclude '*.in' --exclude configure --exclude '*.1' --exclude '*.info' --exclude '*~' --exclude '*.m4' --exclude 'autom4te*' --exclude config.hin --exclude getdate.c --exclude stamp-vti --exclude '*.texi' --exclude dircolors.h --exclude wheel.h --exclude tests --exclude wheel-size.h --exclude false.c -ruNp coreutils-cvs/src/checksum.h coreutils-cvs-modified/src/checksum.h --- coreutils-cvs/src/checksum.h2004-09-22 21:11:10.0 +0100 +++ coreutils-cvs-modified/src/checksum.h 2005-07-15 15:17:44.0 +0100 @@ -1,3 +1,11 @@ +#ifndef _CHECKSUM_H +#define _CHECKSUM_H 1 + +#include sys/types.h +#include config.h +#include sha1.h +#include
autoreconf vs bootstrap script (was: new coreutil? shuffle - randomize file contents)
Paul Eggert wrote: Frederik Eaton writes: Is there a script for making a patch with all the right files excluded by the way? Not yet. That's on the list of things to do. The fix will be to remove those files from the repository, and have a bootstrap script that generates them. Then cvs diff will do the right thing. Doesn't 'autoreconf --install' do everything that one would want in a bootstrap script these days? And if not shouldn't it? Bob ...who thinks that bootstrap scripts are obsolete now... ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Attached is a second patch, which contains a ChangeLog entry and some formatting changes as requested by Jim. On Tue, Jun 07, 2005 at 08:47:14AM +0200, Jim Meyering wrote: Frederik Eaton [EMAIL PROTECTED] wrote: Here is a preliminary patch for basic shuffling functionality in 'sort', with same-keys-sort-together behavior. It adds two options: -R to compare based on a hash of key, and --seed to specify salt for the hash. If --seed is not given then the default is to read from /dev/random or /dev/urandom. I still need to add support to configure.ac for some of the random device macros. Thanks. I haven't had time to give it serious consideration yet, but we should get started on the copyright-related paperwork. I'll send you the form separately. diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' coreutils-5.2.1/ChangeLog coreutils-5.2.1-patch-3/ChangeLog --- coreutils-5.2.1/ChangeLog 2004-03-12 19:03:44.0 + +++ coreutils-5.2.1-patch-3/ChangeLog 2005-07-14 15:27:38.0 +0100 @@ -1,3 +1,17 @@ +2005-07-14 Frederik Eaton [EMAIL PROTECTED] + + Add --random, --seed options to 'sort' to get shuffle-like + functionality. + + * src/sort.c: add functions init_salt, get_hash; 'init_salt' is + used to provide seeding, and 'get_hash' calculates the hash value + by which a key is sorted + * src/checksum.h: stuff to make it possible to switch between + different hash algorithms at runtime + * src/randseed.h: + * src/randseed.c: read a fixed-length random seed from entropy + devices. adapted from libgcrypt + 2004-03-12 Jim Meyering [EMAIL PROTECTED] * Version 5.2.1. diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' coreutils-5.2.1/src/checksum.h coreutils-5.2.1-patch-3/src/checksum.h --- coreutils-5.2.1/src/checksum.h 2003-04-11 10:07:21.0 +0100 +++ coreutils-5.2.1-patch-3/src/checksum.h 2005-07-14 15:27:38.0 +0100 @@ -1,8 +1,14 @@ +#ifndef _CHECKSUM_H +#define _CHECKSUM_H 1 + #include config.h #include sys/types.h #include stdio.h +#include sha1.h +#include md5.h + /* For long options that have no equivalent short option, use a non-character as a pseudo short option, starting with CHAR_MAX + 1. */ enum @@ -13,3 +19,39 @@ }; extern int algorithm; + +struct digest_ops { + int ctx_size; + int bits; + void (*init_ctx) (void *ctx); + void (*process_bytes) (const void *buffer, size_t len, void *ctx); + void *(*finish_ctx) (void *ctx, void *resbuf); + void *(*read_ctx) (void *ctx, void *resbuf); +}; + +static inline +void get_digest_ops(int alg, struct digest_ops *res) { + switch(alg) { + case ALG_MD5: +res-ctx_size = sizeof(struct md5_ctx); +res-bits = 128; +res-init_ctx = (void (*) (void *))md5_init_ctx; +res-process_bytes = (void (*) (const void *, size_t, void *))md5_process_bytes; +res-finish_ctx = (void *(*) (void *, void *))md5_finish_ctx; +res-read_ctx = (void *(*) (void *, void *))md5_read_ctx; +break; + case ALG_SHA1: +res-ctx_size = sizeof(struct sha_ctx); +res-bits = 160; +res-init_ctx = (void (*) (void *))sha_init_ctx; +res-process_bytes = (void (*) (const void *, size_t, void *))sha_process_bytes; +res-finish_ctx = (void *(*) (void *, void *))sha_finish_ctx; +res-read_ctx = (void *(*) (void *, void *))sha_read_ctx; +break; + default: +/* abort? */ +break; + } +} + +#endif diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' coreutils-5.2.1/src/Makefile.am coreutils-5.2.1-patch-3/src/Makefile.am --- coreutils-5.2.1/src/Makefile.am 2004-02-02 08:12:57.0 + +++ coreutils-5.2.1-patch-3/src/Makefile.am 2005-07-14 15:45:22.0 +0100 @@ -147,6 +147,8 @@ md5sum_SOURCES = md5sum.c md5.c sha1sum_SOURCES = md5sum.c sha1sum.c +sort_SOURCES = sort.c md5.c randseed.c + editpl = sed -e 's,@''PERL''@,$(PERL),g' localedir = $(datadir)/locale diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' coreutils-5.2.1/src/randseed.c coreutils-5.2.1-patch-3/src/randseed.c --- coreutils-5.2.1/src/randseed.c 1970-01-01 01:00:00.0 +0100 +++ coreutils-5.2.1-patch-3/src/randseed.c 2005-07-14 16:01:50.0 +0100 @@ -0,0 +1,393 @@ +/* randseed.c + +Code for portably reading a random seed from entropy devices. Mostly +adapted from libgcrypt. + +June 2005 Frederik Eaton +*/ + +#include sys/types.h +#include sys/time.h +#include sys/times.h +#include sys/stat.h +#include fcntl.h +#include unistd.h +#include time.h +#include errno.h +#include string.h +#include sys/socket.h +#include sys/un.h + +#include checksum.h + +// XXX put these in autoconf +#define NAME_OF_DEV_RANDOM /dev/random +#define NAME_OF_DEV_URANDOM /dev/urandom +#define USE_RNDLINUX 1 +#define USE_RNDEGD 1 + +#define SORT_FAILURE 2 +#define FAIL_CODE SORT_FAILURE + +static int
Re: new coreutil? shuffle - randomize file contents
Frederik Eaton [EMAIL PROTECTED] wrote: Here is a preliminary patch for basic shuffling functionality in 'sort', with same-keys-sort-together behavior. It adds two options: -R to compare based on a hash of key, and --seed to specify salt for the hash. If --seed is not given then the default is to read from /dev/random or /dev/urandom. I still need to add support to configure.ac for some of the random device macros. Thanks. I haven't had time to give it serious consideration yet, but we should get started on the copyright-related paperwork. I'll send you the form separately. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Here is a preliminary patch for basic shuffling functionality in 'sort', with same-keys-sort-together behavior. It adds two options: -R to compare based on a hash of key, and --seed to specify salt for the hash. If --seed is not given then the default is to read from /dev/random or /dev/urandom. I still need to add support to configure.ac for some of the random device macros. Feedback? Frederik -- http://ofb.net/~frederik/ diff -x '*~' -ur -N coreutils-5.2.1/src/sort.c coreutils-5.2.1-sort-rand/src/sort.c --- coreutils-5.2.1/src/sort.c 2005-06-02 22:56:29.0 -0700 +++ coreutils-5.2.1-sort-rand/src/sort.c2005-06-05 22:07:37.0 -0700 @@ -37,6 +37,9 @@ #include stdio-safer.h #include xmemcoll.h #include xstrtol.h +#include checksum.h +#include randseed.h +#include md5.h #if HAVE_SYS_RESOURCE_H # include sys/resource.h @@ -153,6 +156,7 @@ bool numeric;/* Flag for numeric comparison. Handle strings of digits with optional decimal point, but no exponential notation. */ + bool random_hash;/* Shuffle by sorting on random hash of key */ bool general_numeric;/* Flag for general, numeric comparison. Handle numbers in exponential notation. */ bool month; /* Flag for comparison by month name. */ @@ -296,6 +300,7 @@ -i, --ignore-nonprintingconsider only printable characters\n\ -M, --month-sortcompare (unknown) `JAN' ... `DEC'\n\ -n, --numeric-sort compare according to string numerical value\n\ + -R, --randomsort by random hash of keys\n\ -r, --reverse reverse the result of comparisons\n\ \n\ ), stdout); @@ -318,6 +323,7 @@ ), DEFAULT_TMPDIR); fputs (_(\ -z, --zero-terminated end lines with 0 byte, not newline\n\ + --seeduse specified seed for random number generator\n\ ), stdout); fputs (HELP_OPTION_DESCRIPTION, stdout); fputs (VERSION_OPTION_DESCRIPTION, stdout); @@ -346,7 +352,11 @@ exit (status); } -#define COMMON_SHORT_OPTIONS -bcdfgik:mMno:rsS:t:T:uz +#define COMMON_SHORT_OPTIONS -bcdfgik:mMnRo:rsS:t:T:uz +enum +{ + SEED_OPTION = CHAR_MAX + 1 +}; static struct option const long_options[] = { @@ -360,6 +370,7 @@ {merge, no_argument, NULL, 'm'}, {month-sort, no_argument, NULL, 'M'}, {numeric-sort, no_argument, NULL, 'n'}, + {random, no_argument, NULL, 'R'}, {output, required_argument, NULL, 'o'}, {reverse, no_argument, NULL, 'r'}, {stable, no_argument, NULL, 's'}, @@ -368,6 +379,7 @@ {temporary-directory, required_argument, NULL, 'T'}, {unique, no_argument, NULL, 'u'}, {zero-terminated, no_argument, NULL, 'z'}, + {seed, required_argument, NULL, SEED_OPTION}, {GETOPT_HELP_OPTION_DECL}, {GETOPT_VERSION_OPTION_DECL}, {0, 0, 0, 0}, @@ -1332,6 +1344,26 @@ return result; } +struct digest_ops digops; +void *salt_ctx; + +static void +init_salt() +{ + get_digest_ops(ALG_MD5, digops); + salt_ctx = xmalloc(digops.ctx_size); + digops.init_ctx(salt_ctx); +} + +static void +get_hash(char* text, int len, void *resbuf) +{ + void *ctx = alloca(digops.ctx_size); + memcpy(ctx, salt_ctx, digops.ctx_size); + digops.process_bytes(text, len, ctx); + digops.finish_ctx(ctx,resbuf); +} + /* Compare two lines A and B trying every key in sequence until there are no more keys or a difference is found. */ @@ -1374,6 +1406,15 @@ (texta, textb)); *lima = savea, *limb = saveb; } + else if (key-random_hash) +{ + int dig_bytes = digops.bits/8; + char diga[dig_bytes]; + char digb[dig_bytes]; + get_hash(texta, lena, diga); + get_hash(textb, lenb, digb); + diff = memcmp(diga, digb, sizeof(diga)); +} else if (key-month) diff = getmonth (texta, lena) - getmonth (textb, lenb); /* Sorting like this may become slow, so in a simple locale the user @@ -2083,7 +2124,7 @@ { error (SORT_FAILURE, 0, _(%s: invalid field specification `%s'), _(msgid), spec); - abort (); + abort (); // XXX is this ever reached? need comment if it is } /* Parse the leading integer in STRING and store the resulting value @@ -2189,6 +2230,9 @@ case 'n': key-numeric = true; break; + case 'R': + key-random_hash = true; + break; case 'r': key-reverse = true; break; @@ -2217,6 +2261,8 @@ int c = 0; bool checkonly = false; bool mergeonly = false; + char *randseed = 0; + bool need_rand = false; int nfiles = 0; bool posixly_correct = (getenv (POSIXLY_CORRECT) != NULL); bool obsolete_usage = (posix2_version () 200112); @@ -2300,7 +2346,7 @@ gkey.sword = gkey.eword = SIZE_MAX; gkey.ignore = NULL; gkey.translate = NULL; -
Re: new coreutil? shuffle - randomize file contents
So, the prototype runs a little slower than I had expected - it's currently using md5 hashes, I could also look into CRC or something faster (but less secure, for those concerned). Anyway here is a sample: $ time ./sort -R /usr/share/dict/words /dev/null ./sort -R /usr/share/dict/words /dev/null 1.67s user 0.01s system 99% cpu 1.684 total $ time ./sort /usr/share/dict/words /dev/null ./sort /usr/share/dict/words /dev/null 0.45s user 0.01s system 99% cpu 0.462 total $ time rl /usr/share/dict/words /dev/null rl /usr/share/dict/words /dev/null 0.07s user 0.01s system 99% cpu 0.074 total $ wc -l /usr/share/dict/words 96274 /usr/share/dict/words 'rl' is the executable from Arthur's randomize-lines. Of course, this 'sort'-based shuffle has the advantage of being independent of input order $ print -l g f e d c b a | ./sort -R | md5sum dda0a6660319917afd6ed021f27fb452 - $ print -l a b c d e f g | ./sort -R | md5sum dda0a6660319917afd6ed021f27fb452 - and being field-aware $ print -l a 1 a 2 b 2 b 03 c 1 c 2 | ./sort -k 1,1R -k 2,2n c 1 c 2 a 1 a 2 b 2 b 03 - whatever these are worth. I just wanted to make sure the slowness is acceptable before proceeding. Frederik On Thu, Jun 02, 2005 at 02:16:10PM -0700, Frederik Eaton wrote: James I think the consensus is that the functionality belongs in sort. James Beyond that things are a bit less clear. However, Paul put forward a James proposed usage which adapts the current -k option (see James http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html). James Nobody made any comments suggesting that that wasn't a good way of James doing things. I liked my proposal for co-opting -s: http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html In other words, rather than having a random column, you'd have a row number column, which would turn into a random column under the hash operation. The documentation for '-s' would be changed to suggest that it was enabling a comparison based on input row number. James Last time this was discussed on this list there was some talk about James selecting permutations with a seed. This makes me uncomfortable since James though doubtless it would be convenient. The problem is that the James number of permutations of the records of most files is larger than James 2^64, and so representing the seed and making use of it would be very James tricky. I then had a look at Fascicle 2 of volume 4 of Knuth, but it James concentrates on generating all permutations, and so wasn't directly James much help. This seems to be a common point of confusion which I think needs to be cleared up. The mapping from random seed to sample space is rarely surjective - in fact, whenever you sample more than a few numbers from a random number generator, it isn't. It only matters that the number of possible seeds is much larger than the number of samplings we would ever choose to perform, and if not 2^64 then definitely 2^128 will be sufficient for all eternity. (Maybe some theoretical common ground is needed. A random variable is a function X from a measurable space with probability measure P to another measurable space such that X^{-1}(B) is measurable for every measurable B in the range of X. We can define a probability measure on the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X. Usually the domain of X is taken to be unspecified and uncountable, but since in practice we only ever sample a finite subset of it - and since we cannot calculate the inverse of X at arbitrary points - we can replace the domain with a suitably large finite set with discrete probability measure and still get good sample coverage, independently of the size of the range of X. This new domain is the seed domain.) Phil Paul's statement: Phil It's not as easy to implement as it sounds, I'm afraid; even if you Phil can assume /dev/random or /dev/urandom. Phil Phil is a slight concern, but I imagine the problems are applicable to *any* Phil shuffling utility. Is it that the app must guarantee all lines of a Phil non-seekable stdin must have an equal chance of any sort order? See my comment to James above. I think one need not make this guarantee, since only a tiny fraction of possible sort orders will be able to be tried by the user. However, I think it would be true for my proposal (and the one that James suggested to Davis) if one took the random number generator or hash function to be variable or unspecified during analysis of the algorithm. The fact that we ship with a particular RNG or hash function (of all uncountable possibilities, some unimplementable) hard-coded is not a problem since its resources are never exhausted. Jim It looks like there are some desirable features that can be Jim provided only by a shuffle-enabled program that is key-aware. Jim Key specification and the comparison code are already part of sort. Jim
Re: new coreutil? shuffle - randomize file contents
$ print -l g f e d c b a | ./sort -R | md5sum dda0a6660319917afd6ed021f27fb452 - $ print -l a b c d e f g | ./sort -R | md5sum dda0a6660319917afd6ed021f27fb452 - By the way, this wouldn't actually be the default behavior, you'd have to specify an explicit seed and have it be the same each time to be reproducible... Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Saturday 04 June 2005 01:11, Philip Rowlands wrote: Extend sort. In extending sort, would the O(n) shuffle algorithm be implemented? Or would the existing O(n log n) mergesort logic be used via keys? Though I am not a sort maintainer, and am probably the least qualified to pass assumption or judgement, I would think that it might be harder to maintain two sets of processing paths in sort. Or, if we only use O(n log n) in sort, then that means there will always be a more efficient O(n) implementation elsewhere. Is that what we really want? It may be--I'll be the first to admit that for average cases less than several 100MB in size, the differences on modern 2ghz machines are almost unmeasurable. As far as some administrative trivia, I have applied for a shuffle project on savannah. I do not have a URL for source, an application requirement, so I'm not sure who the POC would be or how long it will take. shuffle itself has passed most unit tests. things to do on the horizon-- a) implement formal test procedures b) profile for memory leaks c) take an optimization pass and then it should be ready for primetime outside of coreutils. Of course, the final step d) Implement finite memory space architecture due to the work involved, is dependant on potential for coreutil admission. Thanks, Davis ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
If you intened on making shuffle part of coreutils someday, then you could use the GNU womb repository on Savannah. You'd need to get proper papers form [EMAIL PROTECTED] though, and if you add code that was written by someone else we'd need papers from them too. But this would make putting shuffle into coreutils easier when the day comes. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Frederik Eaton [EMAIL PROTECTED] writes: How about this: Put an upper limit on the number of samples that your adversary will be able to try before the earth blows up. But that's not how adversarial attacks work. They work by exploiting flaws in your pseudorandom number generator. Thus, for example, it's possible that by looking at the first half of the pseudorandom output, the adversary would be able to deduce information about the seed that you used, and thus they'll have extra information about what the second half of the output will look like; this is information that they wouldn't be able to guess if the output were truly random. The application to high-stakes poker games should be obvious. This may help to explain the results I already referred to in http://www.maa.org/mathland/mathtrek_10_16_00.html. From first principles, to randomize a deck of 52 cards, you need about lg (52!) random bits, which is about 220 random bits. If each shuffle divides the deck in half and then picks halves at random to interleave, then it introduces about 50 bits of information, and you should need about 5 shuffles to randomize the deck. Actual shuffles aren't that good, though, and there are a few other factors, so you'll need 6 or 7 shuffles to randomize a real deck, depending on whether you listen to the guys from Stanford or the guys from Oxford. If you could get by with a lot fewer shuffles, I'd expect those bright guys at Stanford and Oxford would have figured it out by now. Similarly, if you have a deck of 10 million cards, you'll need about lg (1e7!) random bits, which is about 220 million random bits. Notice that the problem does not scale linearly: here you need 22 times as many random bits as records, even though to shuffle a 52-card deck you need only about 5 times as many random bits as records. If you cut this huge deck and half and shuffle it, you'll need more than 22 shuffles to randomize it. (I agree that all this is overkill for non-adversarial applications.) ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Sat, Jun 04, 2005 at 04:29:50PM -0700, Paul Eggert wrote: were truly random. The application to high-stakes poker games should be obvious. snip (I agree that all this is overkill for non-adversarial applications.) Aside from shuffling cards (which should rarely if ever involve more than 520 cards), can anyone think of an application of shuffling in which attack is likely? David ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Sat, Jun 04, 2005 at 04:29:50PM -0700, Paul Eggert wrote: Frederik Eaton [EMAIL PROTECTED] writes: How about this: Put an upper limit on the number of samples that your adversary will be able to try before the earth blows up. But that's not how adversarial attacks work. They work by exploiting flaws in your pseudorandom number generator. Do you have a definition for adversarial attack? Thus, for example, it's possible that by looking at the first half of the pseudorandom output, the adversary would be able to deduce information about the seed that you used, and thus they'll have extra information about what the second half of the output will look like; this is information that they wouldn't be able to guess if the output were truly random. The application to high-stakes poker games should be obvious. This may help to explain the results I already referred to in http://www.maa.org/mathland/mathtrek_10_16_00.html. From first principles, to randomize a deck of 52 cards, you need about lg (52!) random bits, which is about 220 random bits. If each shuffle divides the deck in half and then picks halves at random to interleave, then it introduces about 50 bits of information, and you should need about 5 shuffles to randomize the deck. Actual shuffles aren't that good, though, and there are a few other factors, so you'll need 6 or 7 shuffles to randomize a real deck, depending on whether you listen to the guys from Stanford or the guys from Oxford. If you could get by with a lot fewer shuffles, I'd expect those bright guys at Stanford and Oxford would have figured it out by now. Similarly, if you have a deck of 10 million cards, you'll need about lg (1e7!) random bits, which is about 220 million random bits. Notice that the problem does not scale linearly: here you need 22 times as many random bits as records, even though to shuffle a 52-card deck you need only about 5 times as many random bits as records. If you cut this huge deck and half and shuffle it, you'll need more than 22 shuffles to randomize it. (I agree that all this is overkill for non-adversarial applications.) Yes, this is all understood. I think existing standard cryptographic hashes can be trusted for this use, though. Predicting all of the output from some of it requires at least preimage attack (or more, since you don't get the values of the hash directly, only their sort order). Recent results showing weakness in MD5 or SHA-1 (apparently unpublished) deal only with collision attacks. I doubt that good a preimage attack will ever be available for these. Personally, I don't think that this is a productive area of discussion, especially coming as it is before even minimal functionality has been added. I'm planning to submit something along the lines of my original proposal. If you think this is a really important point then you can write your own version. Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Philip Rowlands [EMAIL PROTECTED] writes: I'm still interested to read what Paul considers to be the difficulties of such an implementation? Suppose you're randomizing an input file of 10 million lines. And suppose you want to approximate a truly random key by using a 128-bit random key for each input line. Then you'll need about 1.3 billion random bits. On my desktop, /dev/random generates only about 200 random bits per second, and it'll take me about 2.5 months to randomize the file. If you're clever you can tune things so you need only about 0.22 billion random bits: this is because the log base 2 of (10 million factorial) is about 220 million (I used Stirling's approximation). But even with that optimization it'll still take me a couple of weeks to randomize the file. One way to be clever is to use the Knuth shuffle (a.k.a. the Fisher-Yates shuffle). However, this doesn't easily generalize to an algorithm that will work efficiently if the input is too large to fit into main memory. There are other ways to be clever even for large files, but I haven't had time to think them through. And don't forget: even if we're clever, it still takes me a couple of weeks to randomize a 10-million line file. For more on this subject, please see: http://en.wikipedia.org/wiki/Knuth_shuffle ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Thu, Jun 02, 2005 at 11:31:26PM -0700, Paul Eggert wrote: Philip Rowlands [EMAIL PROTECTED] writes: I'm still interested to read what Paul considers to be the difficulties of such an implementation? Suppose you're randomizing an input file of 10 million lines. And suppose you want to approximate a truly random key by using a 128-bit random key for each input line. Then you'll need about 1.3 billion random bits. No! Please see my penultimate email. This is what random seeds are for. Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Wed, Jun 01, 2005 at 06:52:08PM -0700, Frederik Eaton wrote: So, what is the current state of things? Who is in charge of accepting patches? The coreutils maintainers, who are all subscribed to this list I think. So, you're asking in the right place. Are we decided that a 'shuffle' command but no 'sort -R' facility would be best I think the consensus is that the functionality belongs in sort. Beyond that things are a bit less clear. However, Paul put forward a proposed usage which adapts the current -k option (see http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html). Nobody made any comments suggesting that that wasn't a good way of doing things. or that it would be good to have both, or is it still in question whether either would be accepted? Both is probably uncalled for though I personally would put something like this in /usr/local/bin/shuffle: #! /bin/sh exec sort -k R -- $@ Having said this, in the past people have expressed the sentiment that perhaps a separate command would be a good idea (for example http://lists.gnu.org/archive/html/bug-coreutils/2004-09/msg7.html). Last time this was discussed on this list there was some talk about selecting permutations with a seed. This makes me uncomfortable since though doubtless it would be convenient. The problem is that the number of permutations of the records of most files is larger than 2^64, and so representing the seed and making use of it would be very tricky. I then had a look at Fascicle 2 of volume 4 of Knuth, but it concentrates on generating all permutations, and so wasn't directly much help. Regards, James. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Thu, 2 Jun 2005, James Youngman wrote: I think the consensus is that the functionality belongs in sort. Beyond that things are a bit less clear. However, Paul put forward a proposed usage which adapts the current -k option (see http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html). Nobody made any comments suggesting that that wasn't a good way of doing things. or that it would be good to have both, or is it still in question whether either would be accepted? Both is probably uncalled for though I personally would put something like this in /usr/local/bin/shuffle: #! /bin/sh exec sort -k R -- $@ FWIW, I favour this solution for both sort and shuffle. Determining behaviour via argv[0] (a la grep/egrep/fgrep) would also work. Paul's statement: It's not as easy to implement as it sounds, I'm afraid; even if you can assume /dev/random or /dev/urandom. is a slight concern, but I imagine the problems are applicable to *any* shuffling utility. Is it that the app must guarantee all lines of a non-seekable stdin must have an equal chance of any sort order? Cheers, Phil ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Frederik Eaton [EMAIL PROTECTED] wrote: So, what is the current state of things? Who is in charge of accepting patches? Are we decided that a 'shuffle' command but no 'sort -R' facility would be best, or that it would be good to have both, or is it still in question whether either would be accepted? I am the official `maintainer', but Paul Eggert has been making most of the changes recently. It looks like there are some desirable features that can be provided only by a shuffle-enabled program that is key-aware. Key specification and the comparison code are already part of sort. Obviously, duplicating all of that in a separate program is not an option. I don't relish the idea of factoring out sort's line- and key-handling code either, but it might be feasible. However, I do like the idea of a new program that simply outputs a random permutation of its input records, and that does it well, and repeatably. The Unix tool philosophy certainly does encourage the `perform one task and do it well' approach. Since doing it well includes handling input larger than available virtual memory, this is not trivial -- and it is well suited to the coreutils, i.e., it's not easily implementable as a script. Initially, I was inclined to say that adding both the new program (no key support) and related functionality to sort was desirable. Thinking of the limits of robustness of such a new program, I realized that if the input is sufficiently large and not seekable (e.g., from a pipe), then the program will have to resort to writing temporary files, much as sort already does. More duplicated effort, determining how much memory to use (like sort's --buffer-size=SIZE option), managing the temporary files, ensuring that they're removed upon interrupt, etc. But maybe not prohibitive. The new program would also have to have an option like sort's -z, --zero-terminated option, and --temporary-directory=DIR, and --output=FILE. In effect, it would need all of sort's options that don't relate to sorting. So implementing a robust shuffle program, even one without key handling capabilities, would require much of the infrastructure already present in sort.c. It sure sounds like shuffle and sort should share a lot of code, one way or another, so why not have them share the line- and key- handling code, too? I won't rule out adding a new program, like shuffle, but I confess I'm less inclined now than when I started typing this message. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
James I think the consensus is that the functionality belongs in sort. James Beyond that things are a bit less clear. However, Paul put forward a James proposed usage which adapts the current -k option (see James http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html). James Nobody made any comments suggesting that that wasn't a good way of James doing things. I liked my proposal for co-opting -s: http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html In other words, rather than having a random column, you'd have a row number column, which would turn into a random column under the hash operation. The documentation for '-s' would be changed to suggest that it was enabling a comparison based on input row number. James Last time this was discussed on this list there was some talk about James selecting permutations with a seed. This makes me uncomfortable since James though doubtless it would be convenient. The problem is that the James number of permutations of the records of most files is larger than James 2^64, and so representing the seed and making use of it would be very James tricky. I then had a look at Fascicle 2 of volume 4 of Knuth, but it James concentrates on generating all permutations, and so wasn't directly James much help. This seems to be a common point of confusion which I think needs to be cleared up. The mapping from random seed to sample space is rarely surjective - in fact, whenever you sample more than a few numbers from a random number generator, it isn't. It only matters that the number of possible seeds is much larger than the number of samplings we would ever choose to perform, and if not 2^64 then definitely 2^128 will be sufficient for all eternity. (Maybe some theoretical common ground is needed. A random variable is a function X from a measurable space with probability measure P to another measurable space such that X^{-1}(B) is measurable for every measurable B in the range of X. We can define a probability measure on the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X. Usually the domain of X is taken to be unspecified and uncountable, but since in practice we only ever sample a finite subset of it - and since we cannot calculate the inverse of X at arbitrary points - we can replace the domain with a suitably large finite set with discrete probability measure and still get good sample coverage, independently of the size of the range of X. This new domain is the seed domain.) Phil Paul's statement: Phil It's not as easy to implement as it sounds, I'm afraid; even if you Phil can assume /dev/random or /dev/urandom. Phil Phil is a slight concern, but I imagine the problems are applicable to *any* Phil shuffling utility. Is it that the app must guarantee all lines of a Phil non-seekable stdin must have an equal chance of any sort order? See my comment to James above. I think one need not make this guarantee, since only a tiny fraction of possible sort orders will be able to be tried by the user. However, I think it would be true for my proposal (and the one that James suggested to Davis) if one took the random number generator or hash function to be variable or unspecified during analysis of the algorithm. The fact that we ship with a particular RNG or hash function (of all uncountable possibilities, some unimplementable) hard-coded is not a problem since its resources are never exhausted. Jim It looks like there are some desirable features that can be Jim provided only by a shuffle-enabled program that is key-aware. Jim Key specification and the comparison code are already part of sort. Jim Obviously, duplicating all of that in a separate program is not Jim an option. I don't relish the idea of factoring out sort's line- Jim and key-handling code either, but it might be feasible. Jim Jim However, I do like the idea of a new program that simply outputs Jim a random permutation of its input records, and that does it well, Jim and repeatably. The Unix tool philosophy certainly does encourage Jim the `perform one task and do it well' approach. Since doing it Jim well includes handling input larger than available virtual memory, Jim this is not trivial -- and it is well suited to the coreutils, Jim i.e., it's not easily implementable as a script. Jim Jim Initially, I was inclined to say that adding both the new program Jim (no key support) and related functionality to sort was desirable. Jim Thinking of the limits of robustness of such a new program, I Jim realized that if the input is sufficiently large and not seekable Jim (e.g., from a pipe), then the program will have to resort to writing Jim temporary files, much as sort already does. More duplicated effort, Jim determining how much memory to use (like sort's --buffer-size=SIZE Jim option), managing the temporary files, ensuring that they're removed Jim upon interrupt, etc. But maybe not prohibitive. The new program Jim would also have to have an option like
Re: new coreutil? shuffle - randomize file contents
On Thu, 2 Jun 2005, Frederik Eaton wrote: Phil Is it that the app must guarantee all lines of a Phil non-seekable stdin must have an equal chance of any sort order? See my comment to James above. I think one need not make this guarantee, since only a tiny fraction of possible sort orders will be able to be tried by the user. However, I think it would be true for my proposal (and the one that James suggested to Davis) if one took the random number generator or hash function to be variable or unspecified during analysis of the algorithm. Thinking further on this, I don't think it matters to the guts of sort whether the ultimate random key is based on position hash or PRNG. One possibility for an efficient random permutation of a large file is a program which scans the file once and records the location of each line in an index, then applies a uniformly distributed random permutation to this line index by stepping through it in order and swapping the current entry with an entry chosen at random from the set of later entries, and finally steps through the index again and dereferences each line entry and prints out the corresponding line in the original file. That approach is fine on seekable files, but the user may wish to shuffle from stdin. sort already knows how to do this. Here's a concrete example of Paul's suggestion as I understand it: ---INPUT--- a 3 b 2 c 1 d 1 ---INPUT--- sort -R (or -k R) creates an imaginary field and sorts only on that: ---IMAGINARY INPUT--- a 3 0x293a b 2 0xc9f4 c 1 0xa932 d 1 0x5ef5 ---IMAGINARY INPUT--- ---SORTED IMAGINARY OUTPUT--- a 3 0x293a d 1 0x5ef5 c 1 0xa932 b 2 0xc9f4 ---SORTED IMAGINARY OUTPUT--- ---ACTUAL OUTPUT--- a 3 d 1 c 1 b 2 ---ACTUAL OUTPUT--- Of course the random key should be long enough to ensure a negligible chance of repetition (128 bits?) The user is free to make use of the random key and other keys to create random subsets of an overall-sorted list. I'm still interested to read what Paul considers to be the difficulties of such an implementation? Cheers, Phil ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
There seems to be some sloppy thinking regarding efficiency and uniform randomness. Regarding uniform randomness, the infamous Oleg of comp.lang.{scheme,functional} writes: Furthermore, if we have a sequence of N elements and associate with each element a key -- a random number uniformly distributed within [0, M-1] (where N!M=N), we have the configurational space of size M^N (i.e., M^N ways to key the sequence). There are N! possible permutations of the sequence. Alas, for N2 and MN!, M^N is not evenly divisible by N!. Therefore, certain permutations are bound to be a bit more likely than the others. (see http://okmij.org/ftp/Haskell/perfect-shuffle.txt ) I think this is a potential problem with some of the suggestions for shuffling methods. I don't know how to combine sorting with shuffling properly, so I won't comment on that now. In the following, please note that I am not an experienced system programmer, so some of my thoughts could be totally wrong. Regarding efficiency, there are several cases: 1. The file is small In this case, the program will be fast unless someone does something super-stupid. The program speed here will probably be limited by the overhead of the system calls used to read and write files, and (if /dev/urandom is used) the time to generate random numbers. Tweaking performance in this case is probably not worth the effort, and certainly should not be done without careful profiling. 2. The file is large In this case, some care is in order, but I don't think a fancy algorithm will do any better than a simple one. The process speed will be limited by paging. Optimization should focus on preventing unnecessary page faults. In any case, the file must be read through from start to finish to identify line breaks, and then the lines must be read out. It's important to avoid adding a third pass by carelessness. Since there is no way to know how many lines there will be in the file before it's entirely processed, we will have to take a guess based on a reasonable line length and then correct it if necessary, probably using realloc. It would also be possible to use a temporary file for the line indices and then mmap it for use, but I suspect this will rarely be a win. Empty lines should be easy to detect, and since they are all the same, there is no sense in storing an index for each of them. Instead, just count how many there are in the file. On systems with mmap and madvise, the file should be mapped, the OS advised of sequential use, the file scanned for line breaks, the line break array shuffled, the OS advised of random use, and then the lines read to output through a buffer. The best thing about this method is that it should work well for both large and small files. If a file (probably a device) does not support seeking (and therefore presumably cannot be mapped either), then the entire file has to be read into memory, or into a temporary file. There are two tricky parts here: there is generally no way to know how many bites will be coming, so we will have to allocate buffer space dynamically. Probably the easiest way to do this, but not necessarily the best, is to copy the input to a temporary file, and then mmap the temporary file for readout. The good thing about this is that we don't have to grow the buffer ourselves. The other easy possibility is to use realloc, which may or may not be hideously inefficient. The last possibility I can think of is to use an array of arrays, but that seems on the messy side, especially when using CR/LF newlines, and also because it makes it harder to deal efficiently with the case that read only fills the buffer part way. Everything gets messier of course on a system that does not support mmap. In particular, with large files it will be necessary to skip around the file with lseek and read. Not fun at all. The last major problem is to choose the shuffling algorithm itself. For any reasonably sized file a really simple shuffle will be best. Just fill an array with an increasing sequence 0,1,2, etc., shuffle it with a Knuth shuffle, and output the lines in that order, through a buffer. If the file is ridiculously huge, it would seem to be good to find an algorithm that would allow small parts to be shuffled separately. I have not yet seen such an algorithm that actually worked. I will think about it some more. David Feuer ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
There seems to be some sloppy thinking regarding efficiency and uniform randomness. Regarding uniform randomness, the infamous Oleg of comp.lang.{scheme,functional} writes: Furthermore, if we have a sequence of N elements and associate with each element a key -- a random number uniformly distributed within [0, M-1] (where N!M=N), we have the configurational space of size M^N (i.e., M^N ways to key the sequence). There are N! possible permutations of the sequence. Alas, for N2 and MN!, M^N is not evenly divisible by N!. Therefore, certain permutations are bound to be a bit more likely than the others. (see http://okmij.org/ftp/Haskell/perfect-shuffle.txt ) I think this is a potential problem with some of the suggestions for shuffling methods. I don't know how to combine sorting with shuffling properly, so I won't comment on that now. In the following, please note that I am not an experienced system programmer, so some of my thoughts could be totally wrong. Regarding efficiency, there are several cases: 1. The file is small In this case, the program will be fast unless someone does something super-stupid. The program speed here will probably be limited by the overhead of the system calls used to read and write files, and (if /dev/urandom is used) the time to generate random numbers. Tweaking performance in this case is probably not worth the effort, and certainly should not be done without careful profiling. 2. The file is large In this case, some care is in order, but I don't think a fancy algorithm will do any better than a simple one. The process speed will be limited by paging. Optimization should focus on preventing unnecessary page faults. In any case, the file must be read through from start to finish to identify line breaks, and then the lines must be read out. It's important to avoid adding a third pass by carelessness. Since there is no way to know how many lines there will be in the file before it's entirely processed, we will have to take a guess based on a reasonable line length and then correct it if necessary, probably using realloc. It would also be possible to use a temporary file for the line indices and then mmap it for use, but I suspect this will rarely be a win. Empty lines should be easy to detect, and since they are all the same, there is no sense in storing an index for each of them. Instead, just count how many there are in the file. On systems with mmap and madvise, the file should be mapped, the OS advised of sequential use, the file scanned for line breaks, the line break array shuffled, the OS advised of random use, and then the lines read to output through a buffer. The best thing about this method is that it should work well for both large and small files. If a file (probably a device) does not support seeking (and therefore presumably cannot be mapped either), then the entire file has to be read into memory, or into a temporary file. There are two tricky parts here: there is generally no way to know how many bites will be coming, so we will have to allocate buffer space dynamically. Probably the easiest way to do this, but not necessarily the best, is to copy the input to a temporary file, and then mmap the temporary file for readout. The good thing about this is that we don't have to grow the buffer ourselves. The other easy possibility is to use realloc, which may or may not be hideously inefficient. The last possibility I can think of is to use an array of arrays, but that seems on the messy side, especially when using CR/LF newlines, and also because it makes it harder to deal efficiently with the case that read only fills the buffer part way. Everything gets messier of course on a system that does not support mmap. In particular, with large files it will be necessary to skip around the file with lseek and read. Not fun at all. The last major problem is to choose the shuffling algorithm itself. For any reasonably sized file a really simple shuffle will be best. Just fill an array with an increasing sequence 0,1,2, etc., shuffle it with a Knuth shuffle, and output the lines in that order, through a buffer. If the file is ridiculously huge, it would seem to be good to find an algorithm that would allow small parts to be shuffled separately. I have not yet seen such an algorithm that actually worked. I will think about it some more. David Feuer ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Phil Is it that the app must guarantee all lines of a Phil non-seekable stdin must have an equal chance of any sort order? See my comment to James above. I think one need not make this guarantee, since only a tiny fraction of possible sort orders will be able to be tried by the user. However, I think it would be true for my proposal (and the one that James suggested to Davis) if one took the random number generator or hash function to be variable or unspecified during analysis of the algorithm. Thinking further on this, I don't think it matters to the guts of sort whether the ultimate random key is based on position hash or PRNG. No, it doesn't ... but I don't understand why you bring it up in this context. One possibility for an efficient random permutation of a large file is a program which scans the file once and records the location of each line in an index, then applies a uniformly distributed random permutation to this line index by stepping through it in order and swapping the current entry with an entry chosen at random from the set of later entries, and finally steps through the index again and dereferences each line entry and prints out the corresponding line in the original file. That approach is fine on seekable files, but the user may wish to shuffle from stdin. sort already knows how to do this. Here's a concrete example of Paul's suggestion as I understand it: I understand Paul's suggestion. I was throwing out the other algorithm since Davis was looking for something more efficient. Regards, Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
So, what is the current state of things? Who is in charge of accepting patches? Are we decided that a 'shuffle' command but no 'sort -R' facility would be best, or that it would be good to have both, or is it still in question whether either would be accepted? Frederik -- http://ofb.net/~frederik/ ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Monday 30 May 2005 23:02, Frederik Eaton wrote: I hope that you aren't proposing an algorithm which is similar to card-shuffling. That would be exactly like merge-sorting on a key hash - i.e. no more efficient. Agreed! The algorithm implemented is a slight variation on Knuth's shuffle algorithm--instead of randomizing a list into another list, we randomize directly to output--and is linear O(n). Thanks, Davis ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Wed, May 25, 2005 at 10:58:41AM +0100, James Youngman wrote: On Tue, May 24, 2005 at 09:55:35AM -0700, Paul Eggert wrote: That way, you could use, e.g.: sort -k 2,2 -k R which would mean sort by the 2nd field, but if there are ties then sort the ties randomly. sort -R would be short for sort -k R. Perhaps this approach avoids the problems that were discussed earlier regarding expectations about lines with identical keys shuffling together. I hope it is agreed that the conclusion that was reached earlier was that both behaviors - identical keys shuffling *together* vs. *apart* - would be useful in different situations. We came up with a number of situations in which one behavior or the other was necessary, and we didn't really come up with any other ideas for useful behaviors. I think we have yet to consider other ways of getting these two behaviors, however. For instance, -s could be seen as an instruction to last of all, sort by the input row number. But if we implement randomization as sort by hash of keys - for a together shuffle - then including input row number in this hash would get the contrasting above behavior that Paul Eggert is suggesting - the apart shuffle. So with a rephrasing of the -s option description, it might make sense for -R to indicate the together behavior and -Rs to indicate the apart behavior. In this case -s wouldn't mean stable so much as depends on input ordering. I don't know if this is sensible. Anyway, here is the end of the last thread: http://lists.gnu.org/archive/html/bug-coreutils/2005-02/msg5.html Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
I'm not following exactly - in part I think it is premature to discuss implementation details now. And as for the idea to put shuffle functionality in a separate command, this and other issues were discussed at length in the previous thread which starts here: http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html Basically, sometimes you want to be able to sort on one key and shuffle on another, which means that your hypothetical 'shuffle' would have to have a superset of the 'sort' functionality. Not an ideal situation. Not only would the implementations be very similar, but, more importantly, the APIs would have a lot of overlap as well. Also, just because some users might look for shuffle functionality first in a shuffle command doesn't mean that we should put it there. You only have to learn that the functionality is provided by sort once, and it doesn't make sense to sacrifice too much usability or maintainability to try to captivate the small minority of users who are first-time users (as commercial software vendors tend to do). (Now, it also doesn't make sense to have to work out a line-long awk script every time you need to shuffle something) Frederik On Tue, May 24, 2005 at 09:46:13PM +, Davis Houlton wrote: On Tuesday 24 May 2005 15:33, Frederik Eaton wrote: reason to expand the functionality of 'sort'. But in my opinion a more important reason is that the set of commands that one runs on a unix system comprise a language, which is a very important language from a user's perspective, and if people think that it should make sense intuitively to use a certain command to do a certain thing, then that command should be extended to do that thing, for the sake of the conciseness and clarity and usability of this language. I agree that the coreutils represent a toolbox for programing. Without them, /bin/sh would be an empty place indeed! So, in the same way we have echo / printf, expand / unexpand, head / tail, I like to think that there is also room for both sort and shuffle. In my mind, from an functional operator's perspective, I find it very intuitive that sort does as it says--sort lines ACB into ABC. Sorting by dates and numbers are a logical extension of the sort metaphor, and again, I intuitively look to sort to do these things. But when I went to look for a shuffle command--as a user--I did not even think to look at the sort man page. I'm not sure if that is typical -- sort is certainly a logical place to look--but that was my experience. A command like shuffle, in my mind, has a different set of functional requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA, etc. As a user, not necessarily aware of technical implementation, sort and shuffle make sense as separate commands, in the same way I use different methods to sort and shuffle a deck of cards. Reading these notes has given me pause to challenge my current line of thinking regarding large files. Glossing over some gritty implementation details, my revised thought process is that for each shuffle execution, there are two paths: a computationally fast one involving manipulation of text buffers, and one that involves only delimiter locations line sizes. Shuffle would pick the optimum execution path based on input set size; small, average cases use text buffers. Larger cases use delimiter locations paged to disk. That is a good point about RAND_MAX. For a case where we have more than RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX is large enough to encapsulate the entire input set. For example, suppose we used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to give us the range we need (1-21). For 30 lines, 6d6-5 (1-31), etc. I currently do not have time (tonight) to think through a scheme that uses delimiter locations paged out to disk, but I suspect it would be implemented differently from sort? Not sure yet--or if there would be any benefit. Some considerations--are file sizes fseek's fpos_t something that need to be considered? What is the maximum average footprint considered acceptable for a reduced memory environment? 1MB? 4MB? 16k? 2k? Like Jay pointed out, the general algorithm will involve splitting delimiter locations into manageable pages of memory, and randomizing each page. Then we select a page at random, and walk down that page. The question is then--what is the maximum page size? Thanks, Davis ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils -- http://ofb.net/~frederik/ ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Hi Frederik! I guess we're both a little confused :) My question is why would I sort AND shuffle in the same command? Are we talking sort the whole data set and shuffle a subset? I guess I'm having a hard time thinking why I would randomize via key--not saying that there aren't reasons, I'm just not sure what they are! My premise is that shuffle is organized pretty differently than sort--the code I have (in addition to the code I imagine we'll need for large files) looks radically different than sort, if only because shuffling is vastly simpler. While we could graft a shuffle into sort--I must admit to have only taken a cursory glance at the sort source--I think we can gain greater efficiencies by keeping the logic paths separate. My assumption is thus the shuffling code will be it's own entity, whether it is in sort or shuffle. Looking at it a different way, lets take a look at the usage of sort and shuffle as a card metaphor. The way I sort a deck of cards--and my rather simple method is far from optimum--is to first spread the cards face up out on a table, look for some high cards of each suit, start a pile of the four suits, and then as I pull additional cards, place them in the proper order in each suit pile. When I'm done sometime later, I'm left with the four stacks of cards, each suit in the proper order. When I shuffle the resulting deck, however, I use a different process. Granted, I could spread all the cards on the table, mix them up domino style, and then place them randomly into one, or even four stacks. That would be acceptable. But what I do (following the grand tradition of card shark wannabes everywhere) is split the deck in half. I take each deck, and attempt to randomly merge them together like we've all seen those Las Vegas dealers do on tv, and voila--I have now (in theory) randomized the deck. It's quicker and just as effective as the table spread method. If we are willing to ignore the imperfections of the analogy--that Vegas dealers shuffle their cards 7 times, that I have a tendency to mangle cards with improper shuffling technique, etc--my thinking is that it makes sense to have sort and shuffle remain separate on an intuitive level. And I admit, it is true, it is not hard to train a user in sort and shuffle commands. Had sort --random already existed, there would be no need to propose any separation. But if we accept as a given that the code will follow two different logic paths, I personally don't see maintenance gains from combining the two. I must admit, with the American holidays and family I'm pretty pressed for time. I took a quick scan of the archive and it seemed like the conclusion was it is a good idea to keep shuffle functionality separate? At any rate, I will add the delimiter, -o, -z options sometime in the future and then check the code into savannah for (I hope) scathing review. While my personal feeling is that it could be a good addition to coreutils, at least this way it can be available to those who have a use for a simple, quick shuffle. On Monday 30 May 2005 05:27, Frederik Eaton wrote: I'm not following exactly - in part I think it is premature to discuss implementation details now. And as for the idea to put shuffle functionality in a separate command, this and other issues were discussed at length in the previous thread which starts here: http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html Basically, sometimes you want to be able to sort on one key and shuffle on another, which means that your hypothetical 'shuffle' would have to have a superset of the 'sort' functionality. Not an ideal situation. Not only would the implementations be very similar, but, more importantly, the APIs would have a lot of overlap as well. Also, just because some users might look for shuffle functionality first in a shuffle command doesn't mean that we should put it there. You only have to learn that the functionality is provided by sort once, and it doesn't make sense to sacrifice too much usability or maintainability to try to captivate the small minority of users who are first-time users (as commercial software vendors tend to do). (Now, it also doesn't make sense to have to work out a line-long awk script every time you need to shuffle something) Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Mon, May 30, 2005 at 09:25:45AM +, Davis Houlton wrote: Hi Frederik! I guess we're both a little confused :) My question is why would I sort AND shuffle in the same command? Are we talking sort the whole data set and shuffle a subset? I guess I'm having a hard time thinking why I would randomize via key--not saying that there aren't reasons, I'm just not sure what they are! This is covered in the previous thread. The canonical example is playing songs with albums shuffled, but with songs on each album played together and in order. My premise is that shuffle is organized pretty differently than sort--the code I have (in addition to the code I imagine we'll need for large files) looks radically different than sort, if only because shuffling is vastly simpler. While we could graft a shuffle into sort--I must admit to have only taken a cursory glance at the sort source--I think we can gain greater efficiencies by keeping the logic paths separate. My assumption is thus the shuffling code will be it's own entity, whether it is in sort or shuffle. It is true that shuffling can generally be done more efficiently than sorting. I don't know if efficiency is a primary concern - I think that the *ability* to handle multi-gigabyte files is important, but since they come up so rarely, especially when the task is to shuffle and not to sort, whether they are done in a minute or 30 minutes seems inconsequential. But if you are already writing something which will be able to handle large files well, I guess I personally don't see a problem with including it in coreutils. The only thing is that what you describe won't be able to handle all of the use cases that I had in mind. I would still like to see 'sort' have an option to sort based on a hash of keys since this would cover those. Looking at it a different way, lets take a look at the usage of sort and shuffle as a card metaphor. The way I sort a deck of cards--and my rather simple method is far from optimum--is to first spread the cards face up out on a table, look for some high cards of each suit, start a pile of the four suits, and then as I pull additional cards, place them in the proper order in each suit pile. When I'm done sometime later, I'm left with the four stacks of cards, each suit in the proper order. When I shuffle the resulting deck, however, I use a different process. Granted, I could spread all the cards on the table, mix them up domino style, and then place them randomly into one, or even four stacks. That would be acceptable. But what I do (following the grand tradition of card shark wannabes everywhere) is split the deck in half. I take each deck, and attempt to randomly merge them together like we've all seen those Las Vegas dealers do on tv, and voila--I have now (in theory) randomized the deck. It's quicker and just as effective as the table spread method. If we are willing to ignore the imperfections of the analogy--that Vegas dealers shuffle their cards 7 times, that I have a tendency to mangle cards with improper shuffling technique, etc--my thinking is that it makes sense to have sort and shuffle remain separate on an intuitive level. And I admit, it is true, it is not hard to train a user in sort and shuffle commands. Had sort --random already existed, there would be no need to propose any separation. But if we accept as a given that the code will follow two different logic paths, I personally don't see maintenance gains from combining the two. I hope that you aren't proposing an algorithm which is similar to card-shuffling. That would be exactly like merge-sorting on a key hash - i.e. no more efficient. I took a quick scan of the archive and it seemed like the conclusion was it is a good idea to keep shuffle functionality separate? I believe it was concluded that two functionalities were needed - I don't know what you mean by separate. Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Tue, May 24, 2005 at 09:55:35AM -0700, Paul Eggert wrote: That way, you could use, e.g.: sort -k 2,2 -k R which would mean sort by the 2nd field, but if there are ties then sort the ties randomly. sort -R would be short for sort -k R. Perhaps this approach avoids the problems that were discussed earlier regarding expectations about lines with identical keys shuffling together. James. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Tue, May 24, 2005 at 11:25:48AM +0100, [EMAIL PROTECTED] wrote: James Youngman wrote: Davis Houlton writes:- I recently had to write a shuffle utility for a personal project and was wondering if it would make a canidate for the coreutils suite. It seems like the kind of utility the toolbox could use (maybe under section 3. Output of entire files). This behaviour was proposed a few months ago as a new option to sort, and there were objections around the ideas of keeping the shuffled sort stable (i.e. that lines with the same key should appear in groups in the shuffled output) and of repeatability (e.g. giving a 'random seed' to ensure output is reproducible[*]). Much discussion followed and eneded up with many people agreeing that this behaviour properly belonged in a a separate program. I didn't get that impression. There was some resistance to the idea, but no conclusion, except that two variants would be needed if the functionality were added to 'sort'. See the thread starting on 2005/1/29. I think that since people keep asking for this feature, and offering to implement it, there is no reason not to make a commitment to adding it. If you don't want to use it, you can easily avoid using it. So, I think that shuffle is a good idea. I don't agree. You just end up duplicating 99% of the sort logic. Logically the only difference from sort is the low level ordering algorithm. so I vote for and extra arg to sort: --sort=random. Another arg to the --sort option could be, version which would sort files with version numbers in their name appropriately. I don't know about the version suggestion, at this point it seems that you would be better off using perl. However, it does seem that the 'sort' command line API could be extended to allow a great variety of special sorting functions to be specified, with little burden on the documentation and option syntax, by specifying the function name as an argument to a general purpose option as with the --sort XX you propose. As for the duplication of sort logic, yes, I think this is a good reason to expand the functionality of 'sort'. But in my opinion a more important reason is that the set of commands that one runs on a unix system comprise a language, which is a very important language from a user's perspective, and if people think that it should make sense intuitively to use a certain command to do a certain thing, then that command should be extended to do that thing, for the sake of the conciseness and clarity and usability of this language. The fact that something can be accomplished in another way means nothing, if that other way is cumbersome or unintuitive. Think about the big picture: if somebody is arguing that there should be a standard well-known 'shuffle' command which is part of most distributions, then I would say, well, adding such a feature to 'sort' would be a much simpler solution - it would be on the whole easier to maintain, and on the whole easier to remember. Frederik ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Mon, May 23, 2005 at 08:02:19PM +, Davis Houlton wrote: On Monday 23 May 2005 16:35, you wrote: So, I think that shuffle is a good idea. Great! As I wasn't sure if this was a good idea or not, right now the functionality is quite minimal. I agree that it needs to be exapnded, and will do what I can to do so. Of course, I'm not one of the coreutils maintainers, so my opinion has no weight. Does it produce all possible outputs with equal probability? I would say so. The randomizing function is that old standby return 1+(rand()%line_count); You have a problem where a file has more than RAND_MAX lines. See below. [...] redudancy--I will make those (right now I'm using a linked list to store the file in memory. This really should be a pointer of strings, of course). It's fast enough for me on 2ghz hardware, but might as well go the extra mile. [...] I also realized I need to add a caveat section--this utility is memory bound (currently). If you have 4GB of file to randomize and only 1GB of memory, well, thems the bricks. The obvious way around this is to first scan and create an index of delimiter locations, then fseek through a random walk of said delimiters--but I wasn't sure if it was really neccessary at this point--can't imagine it would be very fast, and then there's the issue of stdin. Probably need to add this as a --use-index option. The GNU project likes to avoid arbitrary limits. Take a look at the way that 'sort' handles the same situation. Rather than doing that many random reads, it is probably more efficient to do something like this: 1.Read N lines from the input file(s), where N is chosen such that the data fits into memory. 2.Shuffle each chunk of N lines, and write the shuffled data to a temorary file. Continue until all the input has been read and sent to a temporary file (if the input all fits into memory, skip to step 4). 3.Read the temporary files back, selecting the next line from a randomly selected temporary file until all the temporary files have been completely read 4.On the final merge pass (if any) send the data to the output file instead of a temporary file. Delete all the temporary files (you need to do this manually, since the resource limit on the number of temporary files prevents you using tmpfile()). This is quite complex to implement since you can only merge M temporary files in the merge pass, where (M RAND_MAX) and (M the RLIMIT_NOFILE resource limit). Because of this complexity, you might want to get some feedback from the coreutils maintainers before you consider making that change because it's quite significant. Regards, James Youngman. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
James Youngman wrote: Davis Houlton writes:- I recently had to write a shuffle utility for a personal project and was wondering if it would make a canidate for the coreutils suite. It seems like the kind of utility the toolbox could use (maybe under section 3. Output of entire files). This behaviour was proposed a few months ago as a new option to sort, and there were objections around the ideas of keeping the shuffled sort stable (i.e. that lines with the same key should appear in groups in the shuffled output) and of repeatability (e.g. giving a 'random seed' to ensure output is reproducible[*]). Much discussion followed and eneded up with many people agreeing that this behaviour properly belonged in a a separate program. So, I think that shuffle is a good idea. I don't agree. You just end up duplicating 99% of the sort logic. Logically the only difference from sort is the low level ordering algorithm. so I vote for and extra arg to sort: --sort=random. Another arg to the --sort option could be, version which would sort files with version numbers in their name appropriately. Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
RE: new coreutil? shuffle - randomize file contents
I'm just a lurker so my opinion doesn't count. for much. Certainly I don't expect everyone to be a programmer in order to be able to shuffle their playlist, but perhaps an example needs to be added to the sort man-page stating how easy is to accomplish with tools that are likely already installed on your system (specifically, awk to add a pseudorandom key to a line and cut to remove it): find /music -name '*ogg' -type f | awk 'BEGIN{srand()}{printf %06d%s\n, rand()*100, $0}' | sort | cut -b 7- /music/randomized_oggfiles.m3u Remove srand() if you prefer more deterministic behavior. Standard disclaimers about using pseudorandom numbers for anything important apply. As to adding a random function to sort that would keep like-keys together, perhaps that can be achieved by abusing the locales and collate function that does unexpected things for many users anyway? I'm only half serious when I say that, but LANG=(something other than C) has caused so many users so much grief already that LANG=random couldn't be any worse. Apologies for the noise on the list, but I hope you got a chuckle from LANG=random anyway. I'll go back to lurking now. James Youngman wrote: I don't agree. You just end up duplicating 99% of the sort logic. Logically the only difference from sort is the low level ordering algorithm. so I vote for and extra arg to sort: --sort=random. Another arg to the --sort option could be, version which would sort files with version numbers in their name appropriately. ** The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please re-send this communication to the sender and delete the original message or any copy of it from your computer system. Thank You. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Lemley James - jlemle wrote: Certainly I don't expect everyone to be a programmer in order to be able to shuffle their playlist, but perhaps an example needs to be added to the sort man-page stating how easy is to accomplish with tools that are likely already installed on your system (specifically, awk to add a pseudorandom key to a line and cut to remove it): find /music -name '*ogg' -type f | awk 'BEGIN{srand()}{printf %06d%s\n, rand()*100, $0}' | sort | cut -b 7- /music/randomized_oggfiles.m3u Your solution and a previously posted one are quite similar. http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00148.html Yours has the advantage of not using $RANDOM, a ksh/bash specific feature not in POSIX. Bob ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
[EMAIL PROTECTED] writes: Logically the only difference from sort is the low level ordering algorithm. so I vote for and extra arg to sort: --sort=random. More generally, sort could pretend that every line had an extra field called R whose contents are random. That way, you could use, e.g.: sort -k 2,2 -k R which would mean sort by the 2nd field, but if there are ties then sort the ties randomly. sort -R would be short for sort -k R. It's not as easy to implement as it sounds, I'm afraid; even if you can assume /dev/random or /dev/urandom. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
On Tuesday 24 May 2005 15:33, Frederik Eaton wrote: reason to expand the functionality of 'sort'. But in my opinion a more important reason is that the set of commands that one runs on a unix system comprise a language, which is a very important language from a user's perspective, and if people think that it should make sense intuitively to use a certain command to do a certain thing, then that command should be extended to do that thing, for the sake of the conciseness and clarity and usability of this language. I agree that the coreutils represent a toolbox for programing. Without them, /bin/sh would be an empty place indeed! So, in the same way we have echo / printf, expand / unexpand, head / tail, I like to think that there is also room for both sort and shuffle. In my mind, from an functional operator's perspective, I find it very intuitive that sort does as it says--sort lines ACB into ABC. Sorting by dates and numbers are a logical extension of the sort metaphor, and again, I intuitively look to sort to do these things. But when I went to look for a shuffle command--as a user--I did not even think to look at the sort man page. I'm not sure if that is typical -- sort is certainly a logical place to look--but that was my experience. A command like shuffle, in my mind, has a different set of functional requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA, etc. As a user, not necessarily aware of technical implementation, sort and shuffle make sense as separate commands, in the same way I use different methods to sort and shuffle a deck of cards. Reading these notes has given me pause to challenge my current line of thinking regarding large files. Glossing over some gritty implementation details, my revised thought process is that for each shuffle execution, there are two paths: a computationally fast one involving manipulation of text buffers, and one that involves only delimiter locations line sizes. Shuffle would pick the optimum execution path based on input set size; small, average cases use text buffers. Larger cases use delimiter locations paged to disk. That is a good point about RAND_MAX. For a case where we have more than RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX is large enough to encapsulate the entire input set. For example, suppose we used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to give us the range we need (1-21). For 30 lines, 6d6-5 (1-31), etc. I currently do not have time (tonight) to think through a scheme that uses delimiter locations paged out to disk, but I suspect it would be implemented differently from sort? Not sure yet--or if there would be any benefit. Some considerations--are file sizes fseek's fpos_t something that need to be considered? What is the maximum average footprint considered acceptable for a reduced memory environment? 1MB? 4MB? 16k? 2k? Like Jay pointed out, the general algorithm will involve splitting delimiter locations into manageable pages of memory, and randomizing each page. Then we select a page at random, and walk down that page. The question is then--what is the maximum page size? Thanks, Davis ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: new coreutil? shuffle - randomize file contents
Davis Houlton writes:- I recently had to write a shuffle utility for a personal project and was wondering if it would make a canidate for the coreutils suite. It seems like the kind of utility the toolbox could use (maybe under section 3. Output of entire files). This behaviour was proposed a few months ago as a new option to sort, and there were objections around the ideas of keeping the shuffled sort stable (i.e. that lines with the same key should appear in groups in the shuffled output) and of repeatability (e.g. giving a 'random seed' to ensure output is reproducible[*]). Much discussion followed and eneded up with many people agreeing that this behaviour properly belonged in a a separate program. So, I think that shuffle is a good idea. [*] I'm against the 'random seed' idea since the the number of possible permutations of the output is so large for most reasonable inputs that numbers as low as 2^32 are way too small. That makes parsing and making use of the seed quite tricky and in all likelihood not worth the bother. It's a fairly simple program, and while there is room for optimization, works fairly quickly enough on my hardware. Does it produce all possible outputs with equal probability? SYNOPSIS shuffle [--help] [FILE]... I would suggest the addition of the following options --null, -0, -z Accept and produce records terminated by the NUL character (in ASCII, this is 0) instead of newline. This can be used in conjunction with find -print0 or with xargs -0. -o file Write the shuffled output data to the named file. The reason for offering -z there as well is that sort has a -z option that does this. Create a random multimedia playlist based on directory contents: find /path/to/media -name *.ogg | shuffle playlist.pls Does that generate a playlist in the right format? Based on a look at http://gonze.com/playlists/playlist-format-survey.html#PLS it seems to me like it's better to indicate that the output file is 'plauylist.m3u'. EXIT STATUS shuffle returns a zero upon succesful completion; a one upon failing to read a file and a two upon failing to write a file. I realise that many writing style guides suggest that numbers less than six or so should be represented as words rather than numerals, but I found the paragraph above would have been more useful if formatted a bit more like a table:- .SH EXIT STATUS .B shuffle exits with the following status: .nf 0 if it succeeds 1 if it fails to open or read an input file 2 if it fails to write output data or fails to open the output file 3 for command-line usage errors .fi The program can also exit with numerically greater status values if other errors occur, but most of these are operating-system dependent. Please let me know if this would be considered an appropriate addition, and what the next step might be. I have used GNU utils for decades now, and am happy at the opportunity to finally have a chance to give back. I would find this program useful. Regards, James. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
new coreutil? shuffle - randomize file contents
Hello, I recently had to write a shuffle utility for a personal project and was wondering if it would make a canidate for the coreutils suite. It seems like the kind of utility the toolbox could use (maybe under section 3. Output of entire files). It's a fairly simple program, and while there is room for optimization, works fairly quickly enough on my hardware. It is probably best explained by the quick man page I whipped up: --- shuffle(1) USER COMMANDS shuffle(1) NAME shuffle - Randomize stdin/file contents SYNOPSIS shuffle [--help] [FILE]... DESCRIPTION Randomize the contents of all files specified on the command line. If no files are specified, shuffle will read from stdin, and write to std- out. OPTIONS --help Display short usage information EXAMPLES Randomize contents of a series of files: shuffle file1.txt file2.txt file3.txt Create a random multimedia playlist based on directory contents: find /path/to/media -name *.ogg | shuffle playlist.pls Select a file at random: ls | shuffle | head -n 1 EXIT STATUS shuffle returns a zero upon succesful completion; a one upon failing to read a file and a two upon failing to write a file. AUTHOR Written by Davis Houlton. version 1.0 May 21, 2005 shuffle(1) --- Please let me know if this would be considered an appropriate addition, and what the next step might be. I have used GNU utils for decades now, and am happy at the opportunity to finally have a chance to give back. Thanks, Davis Houlton ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils