Re: new coreutil? shuffle - randomize file contents

2005-07-16 Thread Frederik Eaton
 Also, each external symbol (function, macro, variable) should have a comment
 explaining what it does.  Currently I'm at a bit of a loss trying to
 figure out what things do, so my comments will be limited.
 
 +#ifndef _CHECKSUM_H
 +#define _CHECKSUM_H 1
 +
 +#include sys/types.h
 +#include config.h
 +#include sha1.h
 +#include md5.h
 
 This seems overkill to me.  sort is just using md5, right?

I wanted an easy way to switch between them. Perhaps I should have
gotten rid of this macro stuff from md5sum.c as well, but at least
what I added lets someone else take that on in the future:

#define DIGEST_TYPE_STRING(Alg) ((Alg) == ALG_MD5 ? MD5 : SHA1)
#define DIGEST_STREAM(Alg) ((Alg) == ALG_MD5 ? md5_stream : sha_stream)

#define DIGEST_BITS(Alg) ((Alg) == ALG_MD5 ? 128 : 160)
#define DIGEST_HEX_BYTES(Alg) (DIGEST_BITS (Alg) / 4)
#define DIGEST_BIN_BYTES(Alg) (DIGEST_BITS (Alg) / 8)

 +  int len = strlen(str);
 
 There should be a space before the (.  There are several other instances
 of that.

 +  return (len*2) = ops-bits;
 
 Please put spaces around the *.  The parentheses aren't needed here.

It surprises me that this seems so important to you guys, but
whatever.

 Also, we prefer = to =.  This is a programming style rule that I
 learned from D. Val Schorre.  It's an application of Leibnitz's
Leibniz's
 criterion for notation: textual order should reflect numeric order.

Interesting. 

 +  void *ctx = alloca(digops.ctx_size);
 
 What are the bounds on digops.ctx_size?  If it can be large (more than
 a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow
 detection issues.

It's only going to be a few bytes.

 +  else if (key-random_hash)
 + {
 +   int dig_bytes = digops.bits/8;
 +   char diga[dig_bytes];
 +   char digb[dig_bytes];
 +   get_hash(texta, lena, diga);
 +   get_hash(textb, lenb, digb);
 +   diff = memcmp(diga, digb, sizeof(diga));
 + }
 
 It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n,
 and -r.  For example, sort -nR should compute the same hash for 1.0
 and 01.0 that it computes for 1.00.  Currently, though, the -R is
 silently ignored in this case.  Conversely, sort -MR should compute
 the same hash for  Jan that it does for Jan; but here, the -M is
 silently ignored.

I disagree that this is important.

 +  -R, --randomsort by random hash of keys\n\
 
 I would prefer the long option name --random-hash, since it's not a
 truly random sort.

What, in the sense that it pays attention to the keys, or that the
numbers are pseudorandom? I might call something which pays attention
to keys a random sort, and something which doesn't a random shuffle. I
don't like your suggestion because I don't think the implementation
should show up as part of the API.

If it's the pseudorandomness, I think mentioning that is redundant,
and the same thing I said about not wanting implementation in the API
applies - a good pseudorandom number generator should be externally
indistinguishable from an actual random number generator.

 There is a key problem here: one of documentation.  doc/coreutils.texi
 and NEWS need to be modified to say exactly what's going on, and why
 sorting based on a random hash is not the same as generating a random
 permutation, so that people who are doing security-relevant stuff
 won't rely on something that they shouldn't.

Yes, I mentioned that I haven't written documentation yet. I will keep
in mind that these files need to be modified.

 Also, I can't easily tell from the code how many random bits are
 currently acquired from the environment, and whether there are enough
 bits to actually satisfy the claim that we are using a random hash.

It should be 128 bits for MD5 and 160 for SHA1. Do you think I need
more?

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-07-16 Thread David Feuer
On Sat, Jul 16, 2005 at 07:01:53AM -0700, Frederik Eaton wrote:
 If it's the pseudorandomness, I think mentioning that is redundant,
 and the same thing I said about not wanting implementation in the API
 applies - a good pseudorandom number generator should be externally
 indistinguishable from an actual random number generator.

Disagree.  There's a useful distinction between the two.  Generally,
someone using pseudorandom numbers expects them to happen quickly and
repeatably.  Someone using random numbers expects them to be really
random.

David


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-07-15 Thread Jim Meyering
Frederik Eaton [EMAIL PROTECTED] wrote:
 Attached is a second patch, which contains a ChangeLog entry and some
 formatting changes as requested by Jim.

Can you update your patch to be relative to coreutils-CVS,
  http://savannah.gnu.org/cvs/?group=coreutils
rather than to the aging 5.2.1?

Also, formatting should adhere to the gnu coding standards.
A good way to start is by using GNU indent with the default
settings.

shred also tries to obtain a random seed.
It'd be nice (eventually) to have both programs
use the same mechanism.

Please use FIXME rather than XXX to mark bits of
code that need more attention.

Regarding this:
   error (SORT_FAILURE, 0, _(%s: invalid field specification `%s'),
_(msgid), spec);
-  abort ();
+  abort (); // XXX is this ever reached? need comment if it is

That code is never reached, because the preceding error
call invokes `exit (SORT_FAILURE)'.  The abort is to pacify
gcc -Wall.

Regarding this:
+  char *randseed = 0;
please use `NULL', instead:
  char *randseed = NULL;

I've just noticed that this would make sort
use a file in $HOME (~/.gnupg/entropy).  That
dependency on $HOME would be a first for coreutils.
I'm reluctant to pull in all of that EGD-related code
just to get a default seed in some environments.
Do any of you have a feel for how important (or how
widely available) EGD support is in practice?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-07-15 Thread Paul Eggert
Thanks for working on this.  You've gotten further than anyone else
has!  Some quick comments:

Frederik Eaton [EMAIL PROTECTED] writes:

 Is there a script for making a patch with all the right files excluded
 by the way?

Not yet.  That's on the list of things to do.  The fix will be to remove
those files from the repository, and have a bootstrap script that generates
them.  Then cvs diff will do the right thing.

 shred also tries to obtain a random seed.
 It'd be nice (eventually) to have both programs
 use the same mechanism.

 I did not realize this. Yes, perhaps it would.

That sounds right to me as well.

 I do not have any idea, but it would be good to have random seed
 functionality somewhere standard-ish. It would be nice if it were in
 libc but that's an area I don't want to tackle.

It's a reasonable thing to put in coreutils/lib.  We can then put it
into gnulib.

 Also, I didn't personally think it was necessary to do anything but
 look at the time and process ID for a default seed - I only added the
 /dev/random stuff because Paul Eggert seems to think that security
 is very important here

If sort -R is used for anything security-relevant, then it will be
important, yes.

 - and once /dev/random is referenced then I figured compatibility
 with other kinds of systems would become an issue

No, the method used by shred is good enough.  We don't need to worry
about EGD any more.  It's an obsolete hack.  You might be better off
starting with the code in shred.

 I would like not to spend much more time on this, by the way,

Alas, more work needs to be done.  Perhaps we can recruit someone else
to look into it?  I will put it on my list of things to do as well,
but it's a pretty long list

+   * src/sort.c: add functions init_salt, get_hash; 'init_salt' is
+   used to provide seeding, and 'get_hash' calculates the hash value
+   by which a key is sorted

Please use the ChangeLog style recommended by the GNU coding standards
http://www.gnu.org/prep/standards/html_node/Change-Logs.html.
For example, the first line should start with
* src/sort.c (init_salt, get_hash):.
It's important to log every externally-visible identifier whose meaning
has changed.

Also, each external symbol (function, macro, variable) should have a comment
explaining what it does.  Currently I'm at a bit of a loss trying to
figure out what things do, so my comments will be limited.

+#ifndef _CHECKSUM_H
+#define _CHECKSUM_H 1
+
+#include sys/types.h
+#include config.h
+#include sha1.h
+#include md5.h

This seems overkill to me.  sort is just using md5, right?

+  int len = strlen(str);

There should be a space before the (.  There are several other instances
of that.

+  return (len*2) = ops-bits;

Please put spaces around the *.  The parentheses aren't needed here.

Also, we prefer = to =.  This is a programming style rule that I
learned from D. Val Schorre.  It's an application of Leibnitz's
criterion for notation: textual order should reflect numeric order.

+  --seeduse specified seed for random number generator\n\

--seed has an operand, so it should be mentioned here.

+  void *ctx = alloca(digops.ctx_size);

What are the bounds on digops.ctx_size?  If it can be large (more than
a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow
detection issues.

+  else if (key-random_hash)
+   {
+ int dig_bytes = digops.bits/8;
+ char diga[dig_bytes];
+ char digb[dig_bytes];
+ get_hash(texta, lena, diga);
+ get_hash(textb, lenb, digb);
+ diff = memcmp(diga, digb, sizeof(diga));
+   }

It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n,
and -r.  For example, sort -nR should compute the same hash for 1.0
and 01.0 that it computes for 1.00.  Currently, though, the -R is
silently ignored in this case.  Conversely, sort -MR should compute
the same hash for  Jan that it does for Jan; but here, the -M is
silently ignored.


   else if (key-month)
diff = getmonth (texta, lena) - getmonth (textb, lenb);
   /* Sorting like this may become slow, so in a simple locale the user

@@ -1986,7 +2038,7 @@ badfieldspec (char const *spec, char con
 {
   error (SORT_FAILURE, 0, _(%s: invalid field specification %s),
 _(msgid), quote (spec));
-  abort ();
+  abort (); // inserted to avoid a compiler warning
 }

Please don't use //-style comments, as we can't assume C99 yet.
Also, please prefer declarative sentences in comments, and put
the comments on separate lines before the code they describe,
e,g.:

  /* Avoid a compiler warning.  */
  abort ();


+  char *randseed = 0;

This should be type char const *, not char *.  Note that we prefer
the const after the char.  Also, the 0 should be NULL.
 
+ randseed = strdupa(optarg);

We can't assume strdupa exists.  But you don't need to copy optarg; just
use it without copying it.

+  -R, --randomsort by random hash of keys\n\

I 

Re: new coreutil? shuffle - randomize file contents

2005-07-15 Thread Frederik Eaton
Hi,

Attached is a third patch.

Is there a script for making a patch with all the right files excluded
by the way? cvs diff produces a huge amount of unrelated output
because of files that are both in the repository and touched by
configure, and it doesn't list new files. And diff doesn't seem to
have an --include option to match its --exclude...

  Attached is a second patch, which contains a ChangeLog entry and some
  formatting changes as requested by Jim.
 
 Can you update your patch to be relative to coreutils-CVS,
   http://savannah.gnu.org/cvs/?group=coreutils
 rather than to the aging 5.2.1?

Done.

 Also, formatting should adhere to the gnu coding standards.
 A good way to start is by using GNU indent with the default
 settings.

I did this for some things. Maybe if you see other problems you could
run GNU indent before running 'cvs commit'?

 shred also tries to obtain a random seed.
 It'd be nice (eventually) to have both programs
 use the same mechanism.

I did not realize this. Yes, perhaps it would.

 Please use FIXME rather than XXX to mark bits of
 code that need more attention.
 
 Regarding this:
error (SORT_FAILURE, 0, _(%s: invalid field specification `%s'),
 _(msgid), spec);
 -  abort ();
 +  abort (); // XXX is this ever reached? need comment if it is
 
 That code is never reached, because the preceding error
 call invokes `exit (SORT_FAILURE)'.  The abort is to pacify
 gcc -Wall.
 
 Regarding this:
 +  char *randseed = 0;
 please use `NULL', instead:
   char *randseed = NULL;
 
 I've just noticed that this would make sort
 use a file in $HOME (~/.gnupg/entropy).  That
 dependency on $HOME would be a first for coreutils.
 I'm reluctant to pull in all of that EGD-related code
 just to get a default seed in some environments.
 Do any of you have a feel for how important (or how
 widely available) EGD support is in practice?

I do not have any idea, but it would be good to have random seed
functionality somewhere standard-ish. It would be nice if it were in
libc but that's an area I don't want to tackle.

Also, I didn't personally think it was necessary to do anything but
look at the time and process ID for a default seed - I only added the
/dev/random stuff because Paul Eggert seems to think that security
is very important here - and once /dev/random is referenced then I
figured compatibility with other kinds of systems would become an
issue so I just copied the whole bit from libgcrypt. If you want to
change that stuff, go ahead.

I would like not to spend much more time on this, by the way, it's
been about 7 months now, I would like for us to try to at least figure
out what should be in a preliminary version, make sure that the
command line interface is suitable, etc., somewhat soon. For instance,
if the random stuff stays, then I'll need some guidance on what is the
best way to update configure.ac (unless you want to do it). Also, I'm
not sure what kind of documentation will need to be updated. I was
hoping for more of this kind of feedback, so we can minimize the
number of email back-and-forths.

Thanks,

Frederik
diff --exclude CVS --exclude '*.in' --exclude configure --exclude '*.1' 
--exclude '*.info' --exclude '*~' --exclude '*.m4' --exclude 'autom4te*' 
--exclude config.hin --exclude getdate.c --exclude stamp-vti --exclude '*.texi' 
--exclude dircolors.h --exclude wheel.h --exclude tests --exclude wheel-size.h 
--exclude false.c -ruNp coreutils-cvs/ChangeLog coreutils-cvs-modified/ChangeLog
--- coreutils-cvs/ChangeLog 2005-07-14 01:03:08.0 +0100
+++ coreutils-cvs-modified/ChangeLog2005-07-15 14:38:25.0 +0100
@@ -1,3 +1,17 @@
+2005-07-14  Frederik Eaton [EMAIL PROTECTED]
+
+   Add --random, --seed options to 'sort' to get shuffle-like
+   functionality.
+
+   * src/sort.c: add functions init_salt, get_hash; 'init_salt' is
+   used to provide seeding, and 'get_hash' calculates the hash value
+   by which a key is sorted
+   * src/checksum.h: stuff to make it possible to switch between
+   different hash algorithms at runtime
+   * src/randseed.h:
+   * src/randseed.c: read a fixed-length random seed from entropy
+   devices. adapted from libgcrypt
+
 2005-07-13  Paul Eggert  [EMAIL PROTECTED]
 
* Version 5.3.1.
diff --exclude CVS --exclude '*.in' --exclude configure --exclude '*.1' 
--exclude '*.info' --exclude '*~' --exclude '*.m4' --exclude 'autom4te*' 
--exclude config.hin --exclude getdate.c --exclude stamp-vti --exclude '*.texi' 
--exclude dircolors.h --exclude wheel.h --exclude tests --exclude wheel-size.h 
--exclude false.c -ruNp coreutils-cvs/src/checksum.h 
coreutils-cvs-modified/src/checksum.h
--- coreutils-cvs/src/checksum.h2004-09-22 21:11:10.0 +0100
+++ coreutils-cvs-modified/src/checksum.h   2005-07-15 15:17:44.0 
+0100
@@ -1,3 +1,11 @@
+#ifndef _CHECKSUM_H
+#define _CHECKSUM_H 1
+
+#include sys/types.h
+#include config.h
+#include sha1.h
+#include 

autoreconf vs bootstrap script (was: new coreutil? shuffle - randomize file contents)

2005-07-15 Thread Bob Proulx
Paul Eggert wrote:
 Frederik Eaton writes:
  Is there a script for making a patch with all the right files excluded
  by the way?
 
 Not yet.  That's on the list of things to do.  The fix will be to remove
 those files from the repository, and have a bootstrap script that generates
 them.  Then cvs diff will do the right thing.

Doesn't 'autoreconf --install' do everything that one would want in a
bootstrap script these days?  And if not shouldn't it?

Bob ...who thinks that bootstrap scripts are obsolete now...


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-07-14 Thread Frederik Eaton
Attached is a second patch, which contains a ChangeLog entry and some
formatting changes as requested by Jim.

On Tue, Jun 07, 2005 at 08:47:14AM +0200, Jim Meyering wrote:
 Frederik Eaton [EMAIL PROTECTED] wrote:
  Here is a preliminary patch for basic shuffling functionality in
  'sort', with same-keys-sort-together behavior. It adds two options:
  -R to compare based on a hash of key, and --seed to specify salt
  for the hash. If --seed is not given then the default is to read
  from /dev/random or /dev/urandom.
 
  I still need to add support to configure.ac for some of the random
  device macros.
 
 Thanks.
 I haven't had time to give it serious consideration yet,
 but we should get started on the copyright-related paperwork.
 I'll send you the form separately.
 
diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' 
coreutils-5.2.1/ChangeLog coreutils-5.2.1-patch-3/ChangeLog
--- coreutils-5.2.1/ChangeLog   2004-03-12 19:03:44.0 +
+++ coreutils-5.2.1-patch-3/ChangeLog   2005-07-14 15:27:38.0 +0100
@@ -1,3 +1,17 @@
+2005-07-14  Frederik Eaton [EMAIL PROTECTED]
+
+   Add --random, --seed options to 'sort' to get shuffle-like
+   functionality.
+
+   * src/sort.c: add functions init_salt, get_hash; 'init_salt' is
+   used to provide seeding, and 'get_hash' calculates the hash value
+   by which a key is sorted
+   * src/checksum.h: stuff to make it possible to switch between
+   different hash algorithms at runtime
+   * src/randseed.h:
+   * src/randseed.c: read a fixed-length random seed from entropy
+   devices. adapted from libgcrypt
+
 2004-03-12  Jim Meyering  [EMAIL PROTECTED]
 
* Version 5.2.1.
diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' 
coreutils-5.2.1/src/checksum.h coreutils-5.2.1-patch-3/src/checksum.h
--- coreutils-5.2.1/src/checksum.h  2003-04-11 10:07:21.0 +0100
+++ coreutils-5.2.1-patch-3/src/checksum.h  2005-07-14 15:27:38.0 
+0100
@@ -1,8 +1,14 @@
+#ifndef _CHECKSUM_H
+#define _CHECKSUM_H 1
+
 #include config.h
 
 #include sys/types.h
 #include stdio.h
 
+#include sha1.h
+#include md5.h
+
 /* For long options that have no equivalent short option, use a
non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
@@ -13,3 +19,39 @@
 };
 
 extern int algorithm;
+
+struct digest_ops {
+  int ctx_size;
+  int bits;
+  void (*init_ctx) (void *ctx);
+  void (*process_bytes) (const void *buffer, size_t len, void *ctx);
+  void *(*finish_ctx) (void *ctx, void *resbuf);
+  void *(*read_ctx) (void *ctx, void *resbuf);
+};
+
+static inline
+void get_digest_ops(int alg, struct digest_ops *res) {
+  switch(alg) {
+  case ALG_MD5:
+res-ctx_size = sizeof(struct md5_ctx);
+res-bits = 128;
+res-init_ctx = (void (*) (void *))md5_init_ctx;
+res-process_bytes = (void (*) (const void *, size_t, void 
*))md5_process_bytes;
+res-finish_ctx = (void *(*) (void *, void *))md5_finish_ctx;
+res-read_ctx = (void *(*) (void *, void *))md5_read_ctx;
+break;
+  case ALG_SHA1:
+res-ctx_size = sizeof(struct sha_ctx);
+res-bits = 160;
+res-init_ctx = (void (*) (void *))sha_init_ctx;
+res-process_bytes = (void (*) (const void *, size_t, void 
*))sha_process_bytes;
+res-finish_ctx = (void *(*) (void *, void *))sha_finish_ctx;
+res-read_ctx = (void *(*) (void *, void *))sha_read_ctx;
+break;
+  default:
+/* abort? */
+break;
+  }
+}
+
+#endif
diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' 
coreutils-5.2.1/src/Makefile.am coreutils-5.2.1-patch-3/src/Makefile.am
--- coreutils-5.2.1/src/Makefile.am 2004-02-02 08:12:57.0 +
+++ coreutils-5.2.1-patch-3/src/Makefile.am 2005-07-14 15:45:22.0 
+0100
@@ -147,6 +147,8 @@
 md5sum_SOURCES = md5sum.c md5.c
 sha1sum_SOURCES = md5sum.c sha1sum.c
 
+sort_SOURCES = sort.c md5.c randseed.c
+
 editpl = sed -e 's,@''PERL''@,$(PERL),g'
 
 localedir = $(datadir)/locale
diff -ruN --exclude '*.in' --exclude configure --exclude sort.1 --exclude '*~' 
coreutils-5.2.1/src/randseed.c coreutils-5.2.1-patch-3/src/randseed.c
--- coreutils-5.2.1/src/randseed.c  1970-01-01 01:00:00.0 +0100
+++ coreutils-5.2.1-patch-3/src/randseed.c  2005-07-14 16:01:50.0 
+0100
@@ -0,0 +1,393 @@
+/* randseed.c
+
+Code for portably reading a random seed from entropy devices. Mostly
+adapted from libgcrypt.
+
+June 2005 Frederik Eaton
+*/
+
+#include sys/types.h
+#include sys/time.h
+#include sys/times.h
+#include sys/stat.h
+#include fcntl.h
+#include unistd.h
+#include time.h
+#include errno.h
+#include string.h
+#include sys/socket.h
+#include sys/un.h
+
+#include checksum.h
+
+// XXX put these in autoconf
+#define NAME_OF_DEV_RANDOM /dev/random
+#define NAME_OF_DEV_URANDOM /dev/urandom
+#define USE_RNDLINUX 1
+#define USE_RNDEGD 1
+
+#define SORT_FAILURE 2
+#define FAIL_CODE SORT_FAILURE
+
+static int 

Re: new coreutil? shuffle - randomize file contents

2005-06-07 Thread Jim Meyering
Frederik Eaton [EMAIL PROTECTED] wrote:
 Here is a preliminary patch for basic shuffling functionality in
 'sort', with same-keys-sort-together behavior. It adds two options:
 -R to compare based on a hash of key, and --seed to specify salt
 for the hash. If --seed is not given then the default is to read
 from /dev/random or /dev/urandom.

 I still need to add support to configure.ac for some of the random
 device macros.

Thanks.
I haven't had time to give it serious consideration yet,
but we should get started on the copyright-related paperwork.
I'll send you the form separately.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-06 Thread Frederik Eaton
Here is a preliminary patch for basic shuffling functionality in
'sort', with same-keys-sort-together behavior. It adds two options:
-R to compare based on a hash of key, and --seed to specify salt
for the hash. If --seed is not given then the default is to read
from /dev/random or /dev/urandom.

I still need to add support to configure.ac for some of the random
device macros.

Feedback?

Frederik

-- 
http://ofb.net/~frederik/
diff -x '*~' -ur -N coreutils-5.2.1/src/sort.c 
coreutils-5.2.1-sort-rand/src/sort.c
--- coreutils-5.2.1/src/sort.c  2005-06-02 22:56:29.0 -0700
+++ coreutils-5.2.1-sort-rand/src/sort.c2005-06-05 22:07:37.0 
-0700
@@ -37,6 +37,9 @@
 #include stdio-safer.h
 #include xmemcoll.h
 #include xstrtol.h
+#include checksum.h
+#include randseed.h
+#include md5.h
 
 #if HAVE_SYS_RESOURCE_H
 # include sys/resource.h
@@ -153,6 +156,7 @@
   bool numeric;/* Flag for numeric comparison.  Handle
   strings of digits with optional decimal
   point, but no exponential notation. */
+  bool random_hash;/* Shuffle by sorting on random hash of key */
   bool general_numeric;/* Flag for general, numeric comparison.
   Handle numbers in exponential notation. */
   bool month;  /* Flag for comparison by month name. */
@@ -296,6 +300,7 @@
   -i, --ignore-nonprintingconsider only printable characters\n\
   -M, --month-sortcompare (unknown)  `JAN'  ...  `DEC'\n\
   -n, --numeric-sort  compare according to string numerical value\n\
+  -R, --randomsort by random hash of keys\n\
   -r, --reverse   reverse the result of comparisons\n\
 \n\
 ), stdout);
@@ -318,6 +323,7 @@
 ), DEFAULT_TMPDIR);
   fputs (_(\
   -z, --zero-terminated end lines with 0 byte, not newline\n\
+  --seeduse specified seed for random number generator\n\
 ), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
   fputs (VERSION_OPTION_DESCRIPTION, stdout);
@@ -346,7 +352,11 @@
   exit (status);
 }
 
-#define COMMON_SHORT_OPTIONS -bcdfgik:mMno:rsS:t:T:uz
+#define COMMON_SHORT_OPTIONS -bcdfgik:mMnRo:rsS:t:T:uz
+enum
+{
+  SEED_OPTION = CHAR_MAX + 1
+};
 
 static struct option const long_options[] =
 {
@@ -360,6 +370,7 @@
   {merge, no_argument, NULL, 'm'},
   {month-sort, no_argument, NULL, 'M'},
   {numeric-sort, no_argument, NULL, 'n'},
+  {random, no_argument, NULL, 'R'},
   {output, required_argument, NULL, 'o'},
   {reverse, no_argument, NULL, 'r'},
   {stable, no_argument, NULL, 's'},
@@ -368,6 +379,7 @@
   {temporary-directory, required_argument, NULL, 'T'},
   {unique, no_argument, NULL, 'u'},
   {zero-terminated, no_argument, NULL, 'z'},
+  {seed, required_argument, NULL, SEED_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {0, 0, 0, 0},
@@ -1332,6 +1344,26 @@
   return result;
 }
 
+struct digest_ops digops;
+void *salt_ctx;
+
+static void
+init_salt()
+{
+  get_digest_ops(ALG_MD5, digops);
+  salt_ctx = xmalloc(digops.ctx_size);
+  digops.init_ctx(salt_ctx);
+}
+
+static void
+get_hash(char* text, int len, void *resbuf)
+{
+  void *ctx = alloca(digops.ctx_size);
+  memcpy(ctx, salt_ctx, digops.ctx_size);
+  digops.process_bytes(text, len, ctx);
+  digops.finish_ctx(ctx,resbuf);
+}
+
 /* Compare two lines A and B trying every key in sequence until there
are no more keys or a difference is found. */
 
@@ -1374,6 +1406,15 @@
  (texta, textb));
  *lima = savea, *limb = saveb;
}
+  else if (key-random_hash)
+{
+  int dig_bytes = digops.bits/8;
+  char diga[dig_bytes];
+  char digb[dig_bytes];
+  get_hash(texta, lena, diga);
+  get_hash(textb, lenb, digb);
+  diff = memcmp(diga, digb, sizeof(diga));
+}
   else if (key-month)
diff = getmonth (texta, lena) - getmonth (textb, lenb);
   /* Sorting like this may become slow, so in a simple locale the user
@@ -2083,7 +2124,7 @@
 {
   error (SORT_FAILURE, 0, _(%s: invalid field specification `%s'),
 _(msgid), spec);
-  abort ();
+  abort (); // XXX is this ever reached? need comment if it is
 }
 
 /* Parse the leading integer in STRING and store the resulting value
@@ -2189,6 +2230,9 @@
case 'n':
  key-numeric = true;
  break;
+   case 'R':
+ key-random_hash = true;
+ break;
case 'r':
  key-reverse = true;
  break;
@@ -2217,6 +2261,8 @@
   int c = 0;
   bool checkonly = false;
   bool mergeonly = false;
+  char *randseed = 0;
+  bool need_rand = false;
   int nfiles = 0;
   bool posixly_correct = (getenv (POSIXLY_CORRECT) != NULL);
   bool obsolete_usage = (posix2_version ()  200112);
@@ -2300,7 +2346,7 @@
   gkey.sword = gkey.eword = SIZE_MAX;
   gkey.ignore = NULL;
   gkey.translate = NULL;
-  

Re: new coreutil? shuffle - randomize file contents

2005-06-05 Thread Frederik Eaton
So, the prototype runs a little slower than I had expected - it's
currently using md5 hashes, I could also look into CRC or something
faster (but less secure, for those concerned). Anyway here is a
sample:

$ time ./sort -R  /usr/share/dict/words  /dev/null
./sort -R  /usr/share/dict/words  /dev/null  1.67s user 0.01s system 99% cpu 
1.684 total
$ time ./sort  /usr/share/dict/words  /dev/null   
./sort  /usr/share/dict/words  /dev/null  0.45s user 0.01s system 99% cpu 
0.462 total
$ time rl  /usr/share/dict/words  /dev/null   
rl  /usr/share/dict/words  /dev/null  0.07s user 0.01s system 99% cpu 0.074 
total
$ wc -l /usr/share/dict/words
96274 /usr/share/dict/words

'rl' is the executable from Arthur's randomize-lines. Of course, this
'sort'-based shuffle has the advantage of being independent of input
order

$ print -l g f e d c b a | ./sort -R | md5sum
dda0a6660319917afd6ed021f27fb452  -
$ print -l a b c d e f g | ./sort -R | md5sum
dda0a6660319917afd6ed021f27fb452  -

and being field-aware

$ print -l a 1 a 2 b 2 b 03 c 1 c 2 | ./sort -k 1,1R -k 2,2n
c 1
c 2
a 1
a 2
b 2
b 03

- whatever these are worth. I just wanted to make sure the slowness is
acceptable before proceeding.

Frederik

On Thu, Jun 02, 2005 at 02:16:10PM -0700, Frederik Eaton wrote:
 James I think the consensus is that the functionality belongs in sort.
 James Beyond that things are a bit less clear.  However, Paul put forward a
 James proposed usage which adapts the current -k option (see
 James http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html).
 James Nobody made any comments suggesting that that wasn't a good way of
 James doing things.
 
 I liked my proposal for co-opting -s:
 
 http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html
 
 In other words, rather than having a random column, you'd have a
 row number column, which would turn into a random column under the
 hash operation. The documentation for '-s' would be changed to suggest
 that it was enabling a comparison based on input row number.
 
 James Last time this was discussed on this list there was some talk about
 James selecting permutations with a seed.  This makes me uncomfortable since
 James though doubtless it would be convenient.  The problem is that the
 James number of permutations of the records of most files is larger than
 James 2^64, and so representing the seed and making use of it would be very
 James tricky.  I then had a look at Fascicle 2 of volume 4 of Knuth, but it
 James concentrates on generating all permutations, and so wasn't directly
 James much help.
 
 This seems to be a common point of confusion which I think needs to be
 cleared up. The mapping from random seed to sample space is rarely
 surjective - in fact, whenever you sample more than a few numbers from
 a random number generator, it isn't. It only matters that the number
 of possible seeds is much larger than the number of samplings we would
 ever choose to perform, and if not 2^64 then definitely 2^128 will be
 sufficient for all eternity.
 
 (Maybe some theoretical common ground is needed. A random variable is a
 function X from a measurable space with probability measure P to
 another measurable space such that X^{-1}(B) is measurable for every
 measurable B in the range of X. We can define a probability measure on
 the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X.
 
 Usually the domain of X is taken to be unspecified and uncountable,
 but since in practice we only ever sample a finite subset of it - and
 since we cannot calculate the inverse of X at arbitrary points - we
 can replace the domain with a suitably large finite set with discrete
 probability measure and still get good sample coverage, independently
 of the size of the range of X. This new domain is the seed domain.)
 
 Phil Paul's statement:
 Phil  It's not as easy to implement as it sounds, I'm afraid; even if you
 Phil  can assume /dev/random or /dev/urandom.
 Phil 
 Phil is a slight concern, but I imagine the problems are applicable to *any*
 Phil shuffling utility. Is it that the app must guarantee all lines of a
 Phil non-seekable stdin must have an equal chance of any sort order?
 
 See my comment to James above. I think one need not make this
 guarantee, since only a tiny fraction of possible sort orders will be
 able to be tried by the user. However, I think it would be true for my
 proposal (and the one that James suggested to Davis) if one took the
 random number generator or hash function to be variable or unspecified
 during analysis of the algorithm. The fact that we ship with a
 particular RNG or hash function (of all uncountable possibilities,
 some unimplementable) hard-coded is not a problem since its resources
 are never exhausted.
 
 Jim It looks like there are some desirable features that can be
 Jim provided only by a shuffle-enabled program that is key-aware.
 Jim Key specification and the comparison code are already part of sort.
 Jim 

Re: new coreutil? shuffle - randomize file contents

2005-06-05 Thread Frederik Eaton
 $ print -l g f e d c b a | ./sort -R | md5sum
 dda0a6660319917afd6ed021f27fb452  -
 $ print -l a b c d e f g | ./sort -R | md5sum
 dda0a6660319917afd6ed021f27fb452  -

By the way, this wouldn't actually be the default behavior, you'd have
to specify an explicit seed and have it be the same each time to be
reproducible...

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-04 Thread Davis Houlton
On Saturday 04 June 2005 01:11, Philip Rowlands wrote:
 Extend sort.

In extending sort, would the O(n) shuffle algorithm be implemented? Or would 
the existing O(n log n) mergesort logic be used via keys?

Though I am not a sort maintainer, and am probably the least qualified to pass 
assumption or judgement, I would think that it might be harder to maintain 
two sets of processing paths in sort.  Or, if we only use O(n log n) in sort, 
then that means there will always be a more efficient O(n) implementation 
elsewhere.  Is that what we really want? It may be--I'll be the first to 
admit that for average cases less than several 100MB in size, the differences 
on modern 2ghz machines are almost unmeasurable.

As far as some administrative trivia, I have applied for a shuffle project on 
savannah.  I do not have a URL for source, an application requirement, so I'm 
not sure who the POC would be or how long it will take.

shuffle itself has passed most unit tests.  things to do on the horizon--
  a) implement formal test procedures 
  b) profile for memory leaks
  c) take an optimization pass
and then it should be ready for primetime  outside of coreutils.

Of course, the final step
  d) Implement finite memory space architecture
due to the work involved, is dependant on potential for coreutil admission. 

Thanks,
  Davis




___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-04 Thread Alfred M. Szmidt
If you intened on making shuffle part of coreutils someday, then you
could use the GNU womb repository on Savannah.  You'd need to get
proper papers form [EMAIL PROTECTED] though, and if you add code that was
written by someone else we'd need papers from them too.  But this
would make putting shuffle into coreutils easier when the day comes.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-04 Thread Paul Eggert
Frederik Eaton [EMAIL PROTECTED] writes:

 How about this: Put an upper limit on the number of samples that your
 adversary will be able to try before the earth blows up.

But that's not how adversarial attacks work.  They work by exploiting
flaws in your pseudorandom number generator.

Thus, for example, it's possible that by looking at the first half of
the pseudorandom output, the adversary would be able to deduce
information about the seed that you used, and thus they'll have extra
information about what the second half of the output will look like;
this is information that they wouldn't be able to guess if the output
were truly random.  The application to high-stakes poker games should
be obvious.

This may help to explain the results I already referred to in
http://www.maa.org/mathland/mathtrek_10_16_00.html.  From first
principles, to randomize a deck of 52 cards, you need about lg (52!)
random bits, which is about 220 random bits.  If each shuffle divides
the deck in half and then picks halves at random to interleave, then
it introduces about 50 bits of information, and you should need about
5 shuffles to randomize the deck.  Actual shuffles aren't that good,
though, and there are a few other factors, so you'll need 6 or 7
shuffles to randomize a real deck, depending on whether you listen to
the guys from Stanford or the guys from Oxford.  If you could get by
with a lot fewer shuffles, I'd expect those bright guys at Stanford
and Oxford would have figured it out by now.

Similarly, if you have a deck of 10 million cards, you'll need about
lg (1e7!) random bits, which is about 220 million random bits.  Notice
that the problem does not scale linearly: here you need 22 times as
many random bits as records, even though to shuffle a 52-card deck you
need only about 5 times as many random bits as records.  If you cut
this huge deck and half and shuffle it, you'll need more than 22
shuffles to randomize it.

(I agree that all this is overkill for non-adversarial applications.)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-04 Thread David Feuer
On Sat, Jun 04, 2005 at 04:29:50PM -0700, Paul Eggert wrote:
 were truly random.  The application to high-stakes poker games should
 be obvious.
snip 
 (I agree that all this is overkill for non-adversarial applications.)

Aside from shuffling cards (which should rarely if ever involve more
than 520 cards), can anyone think of an application of shuffling in
which attack is likely?

David


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-04 Thread Frederik Eaton
On Sat, Jun 04, 2005 at 04:29:50PM -0700, Paul Eggert wrote:
 Frederik Eaton [EMAIL PROTECTED] writes:
 
  How about this: Put an upper limit on the number of samples that your
  adversary will be able to try before the earth blows up.
 
 But that's not how adversarial attacks work.  They work by exploiting
 flaws in your pseudorandom number generator.

Do you have a definition for adversarial attack?

 Thus, for example, it's possible that by looking at the first half of
 the pseudorandom output, the adversary would be able to deduce
 information about the seed that you used, and thus they'll have extra
 information about what the second half of the output will look like;
 this is information that they wouldn't be able to guess if the output
 were truly random.  The application to high-stakes poker games should
 be obvious.
 
 This may help to explain the results I already referred to in
 http://www.maa.org/mathland/mathtrek_10_16_00.html.  From first
 principles, to randomize a deck of 52 cards, you need about lg (52!)
 random bits, which is about 220 random bits.  If each shuffle divides
 the deck in half and then picks halves at random to interleave, then
 it introduces about 50 bits of information, and you should need about
 5 shuffles to randomize the deck.  Actual shuffles aren't that good,
 though, and there are a few other factors, so you'll need 6 or 7
 shuffles to randomize a real deck, depending on whether you listen to
 the guys from Stanford or the guys from Oxford.  If you could get by
 with a lot fewer shuffles, I'd expect those bright guys at Stanford
 and Oxford would have figured it out by now.
 
 Similarly, if you have a deck of 10 million cards, you'll need about
 lg (1e7!) random bits, which is about 220 million random bits.  Notice
 that the problem does not scale linearly: here you need 22 times as
 many random bits as records, even though to shuffle a 52-card deck you
 need only about 5 times as many random bits as records.  If you cut
 this huge deck and half and shuffle it, you'll need more than 22
 shuffles to randomize it.
 
 (I agree that all this is overkill for non-adversarial applications.)

Yes, this is all understood. I think existing standard cryptographic
hashes can be trusted for this use, though. Predicting all of the
output from some of it requires at least preimage attack (or more,
since you don't get the values of the hash directly, only their sort
order). Recent results showing weakness in MD5 or SHA-1 (apparently
unpublished) deal only with collision attacks. I doubt that good a
preimage attack will ever be available for these. Personally, I don't
think that this is a productive area of discussion, especially coming
as it is before even minimal functionality has been added.

I'm planning to submit something along the lines of my original
proposal. If you think this is a really important point then you can
write your own version.

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-03 Thread Paul Eggert
Philip Rowlands [EMAIL PROTECTED] writes:

 I'm still interested to read what Paul considers to be the
 difficulties of such an implementation?

Suppose you're randomizing an input file of 10 million lines.  And
suppose you want to approximate a truly random key by using a
128-bit random key for each input line.

Then you'll need about 1.3 billion random bits.  On my desktop,
/dev/random generates only about 200 random bits per second, and it'll
take me about 2.5 months to randomize the file.

If you're clever you can tune things so you need only about 0.22
billion random bits: this is because the log base 2 of (10 million
factorial) is about 220 million (I used Stirling's approximation).
But even with that optimization it'll still take me a couple of weeks
to randomize the file.

One way to be clever is to use the Knuth shuffle (a.k.a. the
Fisher-Yates shuffle).  However, this doesn't easily generalize to an
algorithm that will work efficiently if the input is too large to fit
into main memory.  There are other ways to be clever even for large
files, but I haven't had time to think them through.

And don't forget: even if we're clever, it still takes me a couple of
weeks to randomize a 10-million line file.

For more on this subject, please see:

http://en.wikipedia.org/wiki/Knuth_shuffle


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-03 Thread Frederik Eaton
On Thu, Jun 02, 2005 at 11:31:26PM -0700, Paul Eggert wrote:
 Philip Rowlands [EMAIL PROTECTED] writes:
 
  I'm still interested to read what Paul considers to be the
  difficulties of such an implementation?
 
 Suppose you're randomizing an input file of 10 million lines.  And
 suppose you want to approximate a truly random key by using a
 128-bit random key for each input line.
 
 Then you'll need about 1.3 billion random bits. 

No! Please see my penultimate email. This is what random seeds are
for.

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread James Youngman
On Wed, Jun 01, 2005 at 06:52:08PM -0700, Frederik Eaton wrote:
 So, what is the current state of things? Who is in charge of accepting
 patches? 

The coreutils maintainers, who are all subscribed to this list I
think.  So, you're asking in the right place.

 Are we decided that a 'shuffle' command but no 'sort -R' facility
 would be best

I think the consensus is that the functionality belongs in sort.
Beyond that things are a bit less clear.  However, Paul put forward a
proposed usage which adapts the current -k option (see
http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html).
Nobody made any comments suggesting that that wasn't a good way of
doing things.

 or that it would be good to have both, or is it still in question
 whether either would be accepted?

Both is probably uncalled for though I personally would put something
like this in /usr/local/bin/shuffle:

#! /bin/sh
exec sort -k R -- $@

Having said this, in the past people have expressed the sentiment that
perhaps a separate command would be a good idea (for example
http://lists.gnu.org/archive/html/bug-coreutils/2004-09/msg7.html).

Last time this was discussed on this list there was some talk about
selecting permutations with a seed.  This makes me uncomfortable since
though doubtless it would be convenient.  The problem is that the
number of permutations of the records of most files is larger than
2^64, and so representing the seed and making use of it would be very
tricky.  I then had a look at Fascicle 2 of volume 4 of Knuth, but it
concentrates on generating all permutations, and so wasn't directly
much help.

Regards,
James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread Philip Rowlands
On Thu, 2 Jun 2005, James Youngman wrote:

I think the consensus is that the functionality belongs in sort.
Beyond that things are a bit less clear.  However, Paul put forward a
proposed usage which adapts the current -k option (see
http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html).
Nobody made any comments suggesting that that wasn't a good way of
doing things.

 or that it would be good to have both, or is it still in question
 whether either would be accepted?

Both is probably uncalled for though I personally would put something
like this in /usr/local/bin/shuffle:

#! /bin/sh
exec sort -k R -- $@

FWIW, I favour this solution for both sort and shuffle. Determining
behaviour via argv[0] (a la grep/egrep/fgrep) would also work.

Paul's statement:
 It's not as easy to implement as it sounds, I'm afraid; even if you
 can assume /dev/random or /dev/urandom.

is a slight concern, but I imagine the problems are applicable to *any*
shuffling utility. Is it that the app must guarantee all lines of a
non-seekable stdin must have an equal chance of any sort order?


Cheers,
Phil


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread Jim Meyering
Frederik Eaton [EMAIL PROTECTED] wrote:
 So, what is the current state of things? Who is in charge of accepting
 patches? Are we decided that a 'shuffle' command but no 'sort -R'
 facility would be best, or that it would be good to have both, or is
 it still in question whether either would be accepted?

I am the official `maintainer', but Paul Eggert has been making
most of the changes recently.

It looks like there are some desirable features that can be
provided only by a shuffle-enabled program that is key-aware.
Key specification and the comparison code are already part of sort.
Obviously, duplicating all of that in a separate program is not
an option.  I don't relish the idea of factoring out sort's line-
and key-handling code either, but it might be feasible.

However, I do like the idea of a new program that simply outputs
a random permutation of its input records, and that does it well,
and repeatably.  The Unix tool philosophy certainly does encourage
the `perform one task and do it well' approach.  Since doing it
well includes handling input larger than available virtual memory,
this is not trivial -- and it is well suited to the coreutils,
i.e., it's not easily implementable as a script.

Initially, I was inclined to say that adding both the new program
(no key support) and related functionality to sort was desirable.
Thinking of the limits of robustness of such a new program, I
realized that if the input is sufficiently large and not seekable
(e.g., from a pipe), then the program will have to resort to writing
temporary files, much as sort already does.  More duplicated effort,
determining how much memory to use (like sort's --buffer-size=SIZE
option), managing the temporary files, ensuring that they're removed
upon interrupt, etc.  But maybe not prohibitive.  The new program
would also have to have an option like sort's -z, --zero-terminated
option, and --temporary-directory=DIR, and --output=FILE.  In effect,
it would need all of sort's options that don't relate to sorting.

So implementing a robust shuffle program, even one without key
handling capabilities, would require much of the infrastructure
already present in sort.c.

It sure sounds like shuffle and sort should share a lot of code,
one way or another, so why not have them share the line- and key-
handling code, too?  I won't rule out adding a new program, like
shuffle, but I confess I'm less inclined now than when I started
typing this message.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread Frederik Eaton
James I think the consensus is that the functionality belongs in sort.
James Beyond that things are a bit less clear.  However, Paul put forward a
James proposed usage which adapts the current -k option (see
James http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html).
James Nobody made any comments suggesting that that wasn't a good way of
James doing things.

I liked my proposal for co-opting -s:

http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html

In other words, rather than having a random column, you'd have a
row number column, which would turn into a random column under the
hash operation. The documentation for '-s' would be changed to suggest
that it was enabling a comparison based on input row number.

James Last time this was discussed on this list there was some talk about
James selecting permutations with a seed.  This makes me uncomfortable since
James though doubtless it would be convenient.  The problem is that the
James number of permutations of the records of most files is larger than
James 2^64, and so representing the seed and making use of it would be very
James tricky.  I then had a look at Fascicle 2 of volume 4 of Knuth, but it
James concentrates on generating all permutations, and so wasn't directly
James much help.

This seems to be a common point of confusion which I think needs to be
cleared up. The mapping from random seed to sample space is rarely
surjective - in fact, whenever you sample more than a few numbers from
a random number generator, it isn't. It only matters that the number
of possible seeds is much larger than the number of samplings we would
ever choose to perform, and if not 2^64 then definitely 2^128 will be
sufficient for all eternity.

(Maybe some theoretical common ground is needed. A random variable is a
function X from a measurable space with probability measure P to
another measurable space such that X^{-1}(B) is measurable for every
measurable B in the range of X. We can define a probability measure on
the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X.

Usually the domain of X is taken to be unspecified and uncountable,
but since in practice we only ever sample a finite subset of it - and
since we cannot calculate the inverse of X at arbitrary points - we
can replace the domain with a suitably large finite set with discrete
probability measure and still get good sample coverage, independently
of the size of the range of X. This new domain is the seed domain.)

Phil Paul's statement:
Phil  It's not as easy to implement as it sounds, I'm afraid; even if you
Phil  can assume /dev/random or /dev/urandom.
Phil 
Phil is a slight concern, but I imagine the problems are applicable to *any*
Phil shuffling utility. Is it that the app must guarantee all lines of a
Phil non-seekable stdin must have an equal chance of any sort order?

See my comment to James above. I think one need not make this
guarantee, since only a tiny fraction of possible sort orders will be
able to be tried by the user. However, I think it would be true for my
proposal (and the one that James suggested to Davis) if one took the
random number generator or hash function to be variable or unspecified
during analysis of the algorithm. The fact that we ship with a
particular RNG or hash function (of all uncountable possibilities,
some unimplementable) hard-coded is not a problem since its resources
are never exhausted.

Jim It looks like there are some desirable features that can be
Jim provided only by a shuffle-enabled program that is key-aware.
Jim Key specification and the comparison code are already part of sort.
Jim Obviously, duplicating all of that in a separate program is not
Jim an option.  I don't relish the idea of factoring out sort's line-
Jim and key-handling code either, but it might be feasible.
Jim 
Jim However, I do like the idea of a new program that simply outputs
Jim a random permutation of its input records, and that does it well,
Jim and repeatably.  The Unix tool philosophy certainly does encourage
Jim the `perform one task and do it well' approach.  Since doing it
Jim well includes handling input larger than available virtual memory,
Jim this is not trivial -- and it is well suited to the coreutils,
Jim i.e., it's not easily implementable as a script.
Jim 
Jim Initially, I was inclined to say that adding both the new program
Jim (no key support) and related functionality to sort was desirable.
Jim Thinking of the limits of robustness of such a new program, I
Jim realized that if the input is sufficiently large and not seekable
Jim (e.g., from a pipe), then the program will have to resort to writing
Jim temporary files, much as sort already does.  More duplicated effort,
Jim determining how much memory to use (like sort's --buffer-size=SIZE
Jim option), managing the temporary files, ensuring that they're removed
Jim upon interrupt, etc.  But maybe not prohibitive.  The new program
Jim would also have to have an option like 

Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread Philip Rowlands
On Thu, 2 Jun 2005, Frederik Eaton wrote:

Phil Is it that the app must guarantee all lines of a
Phil non-seekable stdin must have an equal chance of any sort order?

See my comment to James above. I think one need not make this
guarantee, since only a tiny fraction of possible sort orders will be
able to be tried by the user. However, I think it would be true for my
proposal (and the one that James suggested to Davis) if one took the
random number generator or hash function to be variable or unspecified
during analysis of the algorithm.

Thinking further on this, I don't think it matters to the guts of sort
whether the ultimate random key is based on position hash or PRNG.

One possibility for an efficient random permutation of a large file is
a program which scans the file once and records the location of each
line in an index, then applies a uniformly distributed random
permutation to this line index by stepping through it in order and
swapping the current entry with an entry chosen at random from the set
of later entries, and finally steps through the index again and
dereferences each line entry and prints out the corresponding line in
the original file.

That approach is fine on seekable files, but the user may wish to
shuffle from stdin. sort already knows how to do this.

Here's a concrete example of Paul's suggestion as I understand it:
---INPUT---
a 3
b 2
c 1
d 1
---INPUT---

sort -R (or -k R) creates an imaginary field and sorts only on that:
---IMAGINARY INPUT---
a 3 0x293a
b 2 0xc9f4
c 1 0xa932
d 1 0x5ef5
---IMAGINARY INPUT---

---SORTED IMAGINARY OUTPUT---
a 3 0x293a
d 1 0x5ef5
c 1 0xa932
b 2 0xc9f4
---SORTED IMAGINARY OUTPUT---

---ACTUAL OUTPUT---
a 3
d 1
c 1
b 2
---ACTUAL OUTPUT---


Of course the random key should be long enough to ensure a negligible
chance of repetition (128 bits?)

The user is free to make use of the random key and other keys to create
random subsets of an overall-sorted list.

I'm still interested to read what Paul considers to be the
difficulties of such an implementation?


Cheers,
Phil


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread David Feuer
There seems to be some sloppy thinking regarding efficiency and uniform
randomness.  Regarding uniform randomness, the infamous Oleg of
comp.lang.{scheme,functional} writes:

Furthermore, if we have a sequence of N elements and associate with
each element a key -- a random number uniformly distributed within [0,
M-1] (where N!M=N), we have the configurational space of size M^N
(i.e., M^N ways to key the sequence). There are N! possible
permutations of the sequence. Alas, for N2 and MN!, M^N is not
evenly divisible by N!. Therefore, certain permutations are bound to
be a bit more likely than the others.

(see http://okmij.org/ftp/Haskell/perfect-shuffle.txt )

I think this is a potential problem with some of the suggestions for
shuffling methods.  I don't know how to combine sorting with shuffling
properly, so I won't comment on that now.

In the following, please note that I am not an experienced system
programmer, so some of my thoughts could be totally wrong.


Regarding efficiency, there are several cases:
1.  The file is small
In this case, the program will be fast unless someone does something
super-stupid.  The program speed here will probably be limited by the
overhead of the system calls used to read and write files, and (if
/dev/urandom is used) the time to generate random numbers.  Tweaking
performance in this case is probably not worth the effort, and certainly
should not be done without careful profiling.

2.  The file is large

In this case, some care is in order, but I don't think a fancy algorithm
will do any better than a simple one.  The process speed will be limited
by paging.  Optimization should focus on preventing unnecessary page
faults.  In any case, the file must be read through from start to finish
to identify line breaks, and then the lines must be read out.  It's
important to avoid adding a third pass by carelessness.  Since there is
no way to know how many lines there will be in the file before it's
entirely processed, we will have to take a guess based on a reasonable
line length and then correct it if necessary, probably using realloc.
It would also be possible to use a temporary file for the line indices
and then mmap it for use, but I suspect this will rarely be a win.
Empty lines should be easy to detect, and since they are all the same,
there is no sense in storing an index for each of them.  Instead, just
count how many there are in the file.

On systems with mmap and madvise, the file should be mapped, the OS
advised of sequential use, the file scanned for line breaks, the line
break array shuffled, the OS advised of random use, and then the lines
read to output through a buffer.  The best thing about this method is
that it should work well for both large and small files.  If a file
(probably a device) does not support seeking (and therefore presumably
cannot be mapped either), then the entire file has to be read into
memory, or into a temporary file.  There are two tricky parts here:
there is generally no way to know how many bites will be coming, so we
will have to allocate buffer space dynamically.  Probably the easiest
way to do this, but not necessarily the best, is to copy the input to a
temporary file, and then mmap the temporary file for readout.  The good
thing about this is that we don't have to grow the buffer ourselves.
The other easy possibility is to use realloc, which may or may not be
hideously inefficient.  The last possibility I can think of is to use an
array of arrays, but that seems on the messy side, especially when using
CR/LF newlines, and also because it makes it harder to deal efficiently
with the case that read only fills the buffer part way.

Everything gets messier of course on a system that does not support
mmap.  In particular, with large files it will be necessary to skip
around the file with lseek and read.  Not fun at all.

The last major problem is to choose the shuffling algorithm itself.  For
any reasonably sized file a really simple shuffle will be best.  Just
fill an array with an increasing sequence 0,1,2, etc., shuffle it with a
Knuth shuffle, and output the lines in that order, through a buffer.

If the file is ridiculously huge, it would seem to be good to find an
algorithm that would allow small parts to be shuffled separately.  I
have not yet seen such an algorithm that actually worked.  I will think
about it some more.

David Feuer


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread David Feuer
There seems to be some sloppy thinking regarding efficiency and uniform
randomness.  Regarding uniform randomness, the infamous Oleg of
comp.lang.{scheme,functional} writes:

Furthermore, if we have a sequence of N elements and associate with
each element a key -- a random number uniformly distributed within [0,
M-1] (where N!M=N), we have the configurational space of size M^N
(i.e., M^N ways to key the sequence). There are N! possible
permutations of the sequence. Alas, for N2 and MN!, M^N is not
evenly divisible by N!. Therefore, certain permutations are bound to
be a bit more likely than the others.

(see http://okmij.org/ftp/Haskell/perfect-shuffle.txt )

I think this is a potential problem with some of the suggestions for
shuffling methods.  I don't know how to combine sorting with shuffling
properly, so I won't comment on that now.

In the following, please note that I am not an experienced system
programmer, so some of my thoughts could be totally wrong.


Regarding efficiency, there are several cases:
1.  The file is small
In this case, the program will be fast unless someone does something
super-stupid.  The program speed here will probably be limited by the
overhead of the system calls used to read and write files, and (if
/dev/urandom is used) the time to generate random numbers.  Tweaking
performance in this case is probably not worth the effort, and certainly
should not be done without careful profiling.

2.  The file is large

In this case, some care is in order, but I don't think a fancy algorithm
will do any better than a simple one.  The process speed will be limited
by paging.  Optimization should focus on preventing unnecessary page
faults.  In any case, the file must be read through from start to finish
to identify line breaks, and then the lines must be read out.  It's
important to avoid adding a third pass by carelessness.  Since there is
no way to know how many lines there will be in the file before it's
entirely processed, we will have to take a guess based on a reasonable
line length and then correct it if necessary, probably using realloc.
It would also be possible to use a temporary file for the line indices
and then mmap it for use, but I suspect this will rarely be a win.
Empty lines should be easy to detect, and since they are all the same,
there is no sense in storing an index for each of them.  Instead, just
count how many there are in the file.

On systems with mmap and madvise, the file should be mapped, the OS
advised of sequential use, the file scanned for line breaks, the line
break array shuffled, the OS advised of random use, and then the lines
read to output through a buffer.  The best thing about this method is
that it should work well for both large and small files.  If a file
(probably a device) does not support seeking (and therefore presumably
cannot be mapped either), then the entire file has to be read into
memory, or into a temporary file.  There are two tricky parts here:
there is generally no way to know how many bites will be coming, so we
will have to allocate buffer space dynamically.  Probably the easiest
way to do this, but not necessarily the best, is to copy the input to a
temporary file, and then mmap the temporary file for readout.  The good
thing about this is that we don't have to grow the buffer ourselves.
The other easy possibility is to use realloc, which may or may not be
hideously inefficient.  The last possibility I can think of is to use an
array of arrays, but that seems on the messy side, especially when using
CR/LF newlines, and also because it makes it harder to deal efficiently
with the case that read only fills the buffer part way.

Everything gets messier of course on a system that does not support
mmap.  In particular, with large files it will be necessary to skip
around the file with lseek and read.  Not fun at all.

The last major problem is to choose the shuffling algorithm itself.  For
any reasonably sized file a really simple shuffle will be best.  Just
fill an array with an increasing sequence 0,1,2, etc., shuffle it with a
Knuth shuffle, and output the lines in that order, through a buffer.

If the file is ridiculously huge, it would seem to be good to find an
algorithm that would allow small parts to be shuffled separately.  I
have not yet seen such an algorithm that actually worked.  I will think
about it some more.

David Feuer


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-02 Thread Frederik Eaton
 Phil Is it that the app must guarantee all lines of a
 Phil non-seekable stdin must have an equal chance of any sort order?
 
 See my comment to James above. I think one need not make this
 guarantee, since only a tiny fraction of possible sort orders will be
 able to be tried by the user. However, I think it would be true for my
 proposal (and the one that James suggested to Davis) if one took the
 random number generator or hash function to be variable or unspecified
 during analysis of the algorithm.
 
 Thinking further on this, I don't think it matters to the guts of sort
 whether the ultimate random key is based on position hash or PRNG.

No, it doesn't ... but I don't understand why you bring it up in this
context.

 One possibility for an efficient random permutation of a large file is
 a program which scans the file once and records the location of each
 line in an index, then applies a uniformly distributed random
 permutation to this line index by stepping through it in order and
 swapping the current entry with an entry chosen at random from the set
 of later entries, and finally steps through the index again and
 dereferences each line entry and prints out the corresponding line in
 the original file.
 
 That approach is fine on seekable files, but the user may wish to
 shuffle from stdin. sort already knows how to do this.
 
 Here's a concrete example of Paul's suggestion as I understand it:

I understand Paul's suggestion. I was throwing out the other algorithm
since Davis was looking for something more efficient.

Regards,

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-06-01 Thread Frederik Eaton
So, what is the current state of things? Who is in charge of accepting
patches? Are we decided that a 'shuffle' command but no 'sort -R'
facility would be best, or that it would be good to have both, or is
it still in question whether either would be accepted?

Frederik

-- 
http://ofb.net/~frederik/


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-31 Thread Davis Houlton
On Monday 30 May 2005 23:02, Frederik Eaton wrote:
 I hope that you aren't proposing an algorithm which is similar to
 card-shuffling. That would be exactly like merge-sorting on a key hash
 - i.e. no more efficient.

Agreed! The algorithm implemented is a slight variation on Knuth's shuffle 
algorithm--instead of randomizing a list into another list, we randomize 
directly to output--and is linear O(n).  

Thanks,
   Davis


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-30 Thread Frederik Eaton
On Wed, May 25, 2005 at 10:58:41AM +0100, James Youngman wrote:
 On Tue, May 24, 2005 at 09:55:35AM -0700, Paul Eggert wrote:
 
  That way, you could use, e.g.:
  
sort -k 2,2 -k R
  
  which would mean sort by the 2nd field, but if there are ties then
  sort the ties randomly.  sort -R would be short for sort -k R.
 
 Perhaps this approach avoids the problems that were discussed earlier
 regarding expectations about lines with identical keys shuffling
 together.

I hope it is agreed that the conclusion that was reached earlier was
that both behaviors - identical keys shuffling *together* vs. *apart*
- would be useful in different situations. We came up with a number of
situations in which one behavior or the other was necessary, and we
didn't really come up with any other ideas for useful behaviors.

I think we have yet to consider other ways of getting these two
behaviors, however. For instance, -s could be seen as an instruction
to last of all, sort by the input row number. But if we implement
randomization as sort by hash of keys - for a together shuffle -
then including input row number in this hash would get the contrasting
above behavior that Paul Eggert is suggesting - the apart shuffle.
So with a rephrasing of the -s option description, it might make
sense for -R to indicate the together behavior and -Rs to
indicate the apart behavior. In this case -s wouldn't mean
stable so much as depends on input ordering. I don't know if this
is sensible. Anyway, here is the end of the last thread:

http://lists.gnu.org/archive/html/bug-coreutils/2005-02/msg5.html

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-30 Thread Frederik Eaton
I'm not following exactly - in part I think it is premature to discuss
implementation details now. And as for the idea to put shuffle
functionality in a separate command, this and other issues were
discussed at length in the previous thread which starts here:

http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html

Basically, sometimes you want to be able to sort on one key and
shuffle on another, which means that your hypothetical 'shuffle' would
have to have a superset of the 'sort' functionality. Not an ideal
situation. Not only would the implementations be very similar, but,
more importantly, the APIs would have a lot of overlap as well.

Also, just because some users might look for shuffle functionality
first in a shuffle command doesn't mean that we should put it there.
You only have to learn that the functionality is provided by sort
once, and it doesn't make sense to sacrifice too much usability or
maintainability to try to captivate the small minority of users who
are first-time users (as commercial software vendors tend to do).
(Now, it also doesn't make sense to have to work out a line-long awk
script every time you need to shuffle something)

Frederik

On Tue, May 24, 2005 at 09:46:13PM +, Davis Houlton wrote:
 On Tuesday 24 May 2005 15:33, Frederik Eaton wrote:
  reason to expand the functionality of 'sort'. But in my opinion a more
  important reason is that the set of commands that one runs on a unix
  system comprise a language, which is a very important language from a
  user's perspective, and if people think that it should make sense
  intuitively to use a certain command to do a certain thing, then that
  command should be extended to do that thing, for the sake of the
  conciseness and clarity and usability of this language. 
 
 I agree that the coreutils represent a toolbox for programing.  Without 
 them, /bin/sh would be an empty place indeed!
 
 So, in the same way we have echo / printf, expand / unexpand,  head / tail, I 
 like to think that there is also room for both sort and shuffle.   
 
 In my mind, from an functional operator's perspective, I find it very 
 intuitive that sort does as it says--sort lines ACB into ABC.  Sorting by 
 dates and numbers are a logical extension of the sort metaphor, and again, I 
 intuitively look to sort to do these things. But when I went to look for a 
 shuffle command--as a user--I did not even think to look at the sort man 
 page. I'm not sure if that is typical -- sort is certainly a logical place to 
 look--but that was my experience.
 
 A command like shuffle, in my mind, has a different set of functional 
 requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA, 
 etc.  As a user, not necessarily aware of technical implementation, sort and 
 shuffle make sense as separate commands, in the same way I use different 
 methods to sort and shuffle a deck of cards.
 
 Reading these notes has given me pause to challenge my current line of 
 thinking regarding large files. Glossing over some gritty implementation 
 details, my revised thought process is that for each shuffle execution, there 
 are two paths: a computationally fast one involving manipulation of text 
 buffers, and one that involves only delimiter locations  line sizes.  
 Shuffle would pick the optimum execution path based on input set size; small, 
 average cases use text buffers. Larger cases use delimiter locations paged to 
 disk.
 
 That is a good point about RAND_MAX. For a case where we have more than 
 RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX 
 is large enough to encapsulate the entire input set.  For example, suppose we 
 used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to 
 give us the range we need (1-21).  For 30 lines, 6d6-5 (1-31), etc.  
 
 I currently do not have time (tonight) to think through a scheme that uses 
 delimiter locations paged out to disk, but I suspect it would be implemented 
 differently from sort?  Not sure yet--or if there would be any benefit.
 
 Some considerations--are file sizes  fseek's fpos_t something that need to 
 be 
 considered? 
 
 What is the maximum average footprint considered acceptable for a reduced 
 memory environment? 1MB? 4MB? 16k? 2k?  Like Jay pointed out, the general 
 algorithm will involve splitting delimiter locations into manageable pages of 
 memory, and randomizing each page. Then we select a page at random, and walk 
 down that page.  The question is then--what is the maximum page size?
 
 Thanks,
   Davis
 
 
 
 ___
 Bug-coreutils mailing list
 Bug-coreutils@gnu.org
 http://lists.gnu.org/mailman/listinfo/bug-coreutils
 

-- 
http://ofb.net/~frederik/


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-30 Thread Davis Houlton
Hi Frederik! I guess we're both a little confused :) My question is why would 
I sort AND shuffle in the same command? Are we talking sort the whole data 
set and shuffle a subset? I guess I'm having a hard time thinking why I would 
randomize via key--not saying that there aren't reasons, I'm just not sure 
what they are! 

My premise is that shuffle is organized pretty differently than sort--the code 
I have (in addition to the code I imagine we'll need for large files) looks 
radically different than sort, if only because shuffling is vastly simpler. 

While we could graft a shuffle into sort--I must admit to have only taken a 
cursory glance at the sort source--I think we can gain greater efficiencies 
by keeping the logic paths separate.  My assumption is thus the shuffling 
code will be it's own entity, whether it is in sort or shuffle.  

Looking at it a different way, lets take a look at the usage of sort and 
shuffle as a card metaphor.  The way I sort a deck of cards--and my rather 
simple method is far from optimum--is to first spread the cards face up out 
on a table, look for some high cards of each suit, start a pile of the four 
suits, and then as I pull additional cards, place them in the proper order in 
each suit pile. When I'm done sometime later, I'm left with the four stacks 
of cards, each suit in the proper order.  

When I shuffle the resulting deck, however, I use a different process. 
Granted, I could spread all the cards on the table, mix them up domino 
style, and then place them randomly into one, or even four stacks.  That 
would be acceptable.  But what I do (following the grand tradition of card 
shark wannabes everywhere) is split the deck in half.  I take each deck, and 
attempt to randomly merge them together like we've all seen those Las Vegas 
dealers do on tv, and voila--I have now (in theory) randomized the deck. It's 
quicker and just as effective as the table spread method.

If we are willing to ignore the imperfections of the analogy--that Vegas 
dealers shuffle their cards 7 times, that I have a tendency to mangle cards 
with improper shuffling technique, etc--my thinking is that it makes sense to 
have sort and shuffle remain separate on an intuitive level.  And I admit, it 
is true, it is not hard to train a user in sort and shuffle commands.  Had 
sort --random already existed, there would be no need to propose any 
separation. But if we accept as a given that the code will follow two 
different logic paths, I personally don't see maintenance gains from 
combining the two.  

I must admit, with the American holidays and family I'm pretty pressed for 
time. I took a quick scan of the archive and it seemed like the conclusion 
was it is a good idea to keep shuffle functionality separate? 

At any rate, I will add the delimiter, -o, -z options sometime in the future 
and then check the code into savannah for (I hope) scathing review.  While my 
personal feeling is that it could be a good addition to coreutils, at least 
this way it can be available to those who have a use for a simple, quick 
shuffle.


On Monday 30 May 2005 05:27, Frederik Eaton wrote:
 I'm not following exactly - in part I think it is premature to discuss
 implementation details now. And as for the idea to put shuffle
 functionality in a separate command, this and other issues were
 discussed at length in the previous thread which starts here:

 http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html

 Basically, sometimes you want to be able to sort on one key and
 shuffle on another, which means that your hypothetical 'shuffle' would
 have to have a superset of the 'sort' functionality. Not an ideal
 situation. Not only would the implementations be very similar, but,
 more importantly, the APIs would have a lot of overlap as well.

 Also, just because some users might look for shuffle functionality
 first in a shuffle command doesn't mean that we should put it there.
 You only have to learn that the functionality is provided by sort
 once, and it doesn't make sense to sacrifice too much usability or
 maintainability to try to captivate the small minority of users who
 are first-time users (as commercial software vendors tend to do).
 (Now, it also doesn't make sense to have to work out a line-long awk
 script every time you need to shuffle something)

 Frederik



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-30 Thread Frederik Eaton
On Mon, May 30, 2005 at 09:25:45AM +, Davis Houlton wrote:
 Hi Frederik! I guess we're both a little confused :) My question is why would 
 I sort AND shuffle in the same command? Are we talking sort the whole data 
 set and shuffle a subset? I guess I'm having a hard time thinking why I would 
 randomize via key--not saying that there aren't reasons, I'm just not sure 
 what they are! 

This is covered in the previous thread. The canonical example is
playing songs with albums shuffled, but with songs on each album
played together and in order.

 My premise is that shuffle is organized pretty differently than sort--the 
 code 
 I have (in addition to the code I imagine we'll need for large files) looks 
 radically different than sort, if only because shuffling is vastly simpler. 
 
 While we could graft a shuffle into sort--I must admit to have only taken a 
 cursory glance at the sort source--I think we can gain greater efficiencies 
 by keeping the logic paths separate.  My assumption is thus the shuffling 
 code will be it's own entity, whether it is in sort or shuffle.  

It is true that shuffling can generally be done more efficiently than
sorting. I don't know if efficiency is a primary concern - I think
that the *ability* to handle multi-gigabyte files is important, but
since they come up so rarely, especially when the task is to shuffle
and not to sort, whether they are done in a minute or 30 minutes seems
inconsequential. But if you are already writing something which will
be able to handle large files well, I guess I personally don't see a
problem with including it in coreutils. The only thing is that what
you describe won't be able to handle all of the use cases that I had
in mind. I would still like to see 'sort' have an option to sort based
on a hash of keys since this would cover those.

 Looking at it a different way, lets take a look at the usage of sort and 
 shuffle as a card metaphor.  The way I sort a deck of cards--and my rather 
 simple method is far from optimum--is to first spread the cards face up out 
 on a table, look for some high cards of each suit, start a pile of the four 
 suits, and then as I pull additional cards, place them in the proper order in 
 each suit pile. When I'm done sometime later, I'm left with the four stacks 
 of cards, each suit in the proper order.  
 
 When I shuffle the resulting deck, however, I use a different process. 
 Granted, I could spread all the cards on the table, mix them up domino 
 style, and then place them randomly into one, or even four stacks.  That 
 would be acceptable.  But what I do (following the grand tradition of card 
 shark wannabes everywhere) is split the deck in half.  I take each deck, and 
 attempt to randomly merge them together like we've all seen those Las Vegas 
 dealers do on tv, and voila--I have now (in theory) randomized the deck. It's 
 quicker and just as effective as the table spread method.
 
 If we are willing to ignore the imperfections of the analogy--that Vegas 
 dealers shuffle their cards 7 times, that I have a tendency to mangle cards 
 with improper shuffling technique, etc--my thinking is that it makes sense to 
 have sort and shuffle remain separate on an intuitive level.  And I admit, it 
 is true, it is not hard to train a user in sort and shuffle commands.  Had 
 sort --random already existed, there would be no need to propose any 
 separation. But if we accept as a given that the code will follow two 
 different logic paths, I personally don't see maintenance gains from 
 combining the two.  

I hope that you aren't proposing an algorithm which is similar to
card-shuffling. That would be exactly like merge-sorting on a key hash
- i.e. no more efficient.

 I took a quick scan of the archive and it seemed like the conclusion
 was it is a good idea to keep shuffle functionality separate?

I believe it was concluded that two functionalities were needed - I
don't know what you mean by separate.

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-25 Thread James Youngman
On Tue, May 24, 2005 at 09:55:35AM -0700, Paul Eggert wrote:

 That way, you could use, e.g.:
 
   sort -k 2,2 -k R
 
 which would mean sort by the 2nd field, but if there are ties then
 sort the ties randomly.  sort -R would be short for sort -k R.

Perhaps this approach avoids the problems that were discussed earlier
regarding expectations about lines with identical keys shuffling
together.


James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-25 Thread Frederik Eaton
On Tue, May 24, 2005 at 11:25:48AM +0100, [EMAIL PROTECTED] wrote:
 James Youngman wrote:
 Davis Houlton writes:-
 
 
 
 I recently had to write a shuffle utility for a personal project and
 was wondering if it would make a canidate for the coreutils
 suite. It seems like the kind of utility the toolbox could use
 (maybe under section 3. Output of entire files).
 
 
 This behaviour was proposed a few months ago as a new option to
 sort, and there were objections around the ideas of keeping the
 shuffled sort stable (i.e. that lines with the same key should appear
 in groups in the shuffled output) and of repeatability (e.g. giving a
 'random seed' to ensure output is reproducible[*]).  Much discussion
 followed and eneded up with many people agreeing that this behaviour
 properly belonged in a a separate program.

I didn't get that impression. There was some resistance to the idea,
but no conclusion, except that two variants would be needed if the
functionality were added to 'sort'. See the thread starting on
2005/1/29.

I think that since people keep asking for this feature, and offering
to implement it, there is no reason not to make a commitment to adding
it. If you don't want to use it, you can easily avoid using it.

 So, I think that shuffle is a good idea.
 
 I don't agree. You just end up duplicating
 99% of the sort logic. Logically the only difference from sort
 is the low level ordering algorithm. so I vote for and extra arg to 
 sort: --sort=random. Another arg to the --sort option could be,
 version which would sort files with version numbers in their name
 appropriately.

I don't know about the version suggestion, at this point it seems
that you would be better off using perl. However, it does seem that
the 'sort' command line API could be extended to allow a great variety
of special sorting functions to be specified, with little burden on
the documentation and option syntax, by specifying the function name
as an argument to a general purpose option as with the --sort XX you
propose.

As for the duplication of sort logic, yes, I think this is a good
reason to expand the functionality of 'sort'. But in my opinion a more
important reason is that the set of commands that one runs on a unix
system comprise a language, which is a very important language from a
user's perspective, and if people think that it should make sense
intuitively to use a certain command to do a certain thing, then that
command should be extended to do that thing, for the sake of the
conciseness and clarity and usability of this language. The fact that
something can be accomplished in another way means nothing, if that
other way is cumbersome or unintuitive. Think about the big picture:
if somebody is arguing that there should be a standard well-known
'shuffle' command which is part of most distributions, then I would
say, well, adding such a feature to 'sort' would be a much simpler
solution - it would be on the whole easier to maintain, and on the
whole easier to remember.

Frederik


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-24 Thread James Youngman
On Mon, May 23, 2005 at 08:02:19PM +, Davis Houlton wrote:
 On Monday 23 May 2005 16:35, you wrote:
  So, I think that shuffle is a good idea.
 Great!  As I wasn't sure if this was a good idea or not, right now the 
 functionality is quite minimal.  I agree that it needs to be exapnded, and 
 will do what I can to do so.

Of course, I'm not one of the coreutils maintainers, so my opinion has
no weight.

  Does it produce all possible outputs with equal probability?

 I would say so. The randomizing function is that old standby
  return 1+(rand()%line_count);

You have a problem where a file has more than RAND_MAX lines.  See
below.

[...]
 redudancy--I will make those (right now I'm using a linked list to
 store the file in memory. This really should be a pointer of
 strings, of course).  It's fast enough for me on 2ghz hardware, but
 might as well go the extra mile.
[...]
 I also realized I need to add a caveat section--this utility is
 memory bound (currently). If you have 4GB of file to randomize and
 only 1GB of memory, well, thems the bricks. The obvious way around
 this is to first scan and create an index of delimiter locations,
 then fseek through a random walk of said delimiters--but I wasn't
 sure if it was really neccessary at this point--can't imagine it
 would be very fast, and then there's the issue of stdin. Probably
 need to add this as a --use-index option.

The GNU project likes to avoid arbitrary limits.  Take a look at the
way that 'sort' handles the same situation.

Rather than doing that many random reads, it is probably more
efficient to do something like this:

  1.Read N lines from the input file(s), where N is chosen
such that the data fits into memory.

  2.Shuffle each chunk of N lines, and write the shuffled data to 
a temorary file.   Continue until all the input has been
read and sent to a temporary file (if the input all fits
into memory, skip to step 4).

  3.Read the temporary files back, selecting the next 
line from a randomly selected temporary file until 
all the temporary files have been completely read

  4.On the final merge pass (if any) send the data to the 
output file instead of a temporary file.  Delete all the 
temporary files (you need to do this manually, since the
resource limit on the number of temporary files prevents 
you using tmpfile()).

This is quite complex to implement since you can only merge M
temporary files in the merge pass, where (M  RAND_MAX) and (M  the
RLIMIT_NOFILE resource limit).  Because of this complexity, you might
want to get some feedback from the coreutils maintainers before you
consider making that change because it's quite significant.

Regards,
James Youngman.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-24 Thread P

James Youngman wrote:

Davis Houlton writes:-




I recently had to write a shuffle utility for a personal project and
was wondering if it would make a canidate for the coreutils
suite. It seems like the kind of utility the toolbox could use
(maybe under section 3. Output of entire files).



This behaviour was proposed a few months ago as a new option to
sort, and there were objections around the ideas of keeping the
shuffled sort stable (i.e. that lines with the same key should appear
in groups in the shuffled output) and of repeatability (e.g. giving a
'random seed' to ensure output is reproducible[*]).  Much discussion
followed and eneded up with many people agreeing that this behaviour
properly belonged in a a separate program.

So, I think that shuffle is a good idea.


I don't agree. You just end up duplicating
99% of the sort logic. Logically the only difference from sort
is the low level ordering algorithm. so I vote for and extra arg to 
sort: --sort=random. Another arg to the --sort option could be,

version which would sort files with version numbers in their name
appropriately.

Pádraig.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


RE: new coreutil? shuffle - randomize file contents

2005-05-24 Thread Lemley James - jlemle
I'm just a lurker so my opinion doesn't count.  for much.

Certainly I don't expect everyone to be a programmer in order to be able
to shuffle their playlist, but perhaps an example needs to be added to
the sort man-page stating how easy is to accomplish with tools that are
likely already installed on your system (specifically, awk to add a
pseudorandom key to a line and cut to remove it): 

find /music -name '*ogg' -type f | 
awk 'BEGIN{srand()}{printf %06d%s\n, rand()*100, $0}' | 
sort | 
cut -b 7-  /music/randomized_oggfiles.m3u

Remove srand() if you prefer more deterministic behavior.   Standard
disclaimers about using pseudorandom numbers for anything important
apply.

As to adding a random function to sort that would keep like-keys
together, perhaps that can be achieved by abusing the locales and
collate function that does unexpected things for many users anyway?  I'm
only half serious when I say that, but LANG=(something other than C) has
caused so many users so much grief already that LANG=random couldn't be
any worse. 

Apologies for the noise on the list, but I hope you got a chuckle from
LANG=random anyway.  I'll go back to lurking now. 


James Youngman wrote:

I don't agree. You just end up duplicating
99% of the sort logic. Logically the only difference from sort
is the low level ordering algorithm. so I vote for and extra arg to 
sort: --sort=random. Another arg to the --sort option could be,
version which would sort files with version numbers in their name
appropriately.



**
The information contained in this communication is
confidential, is intended only for the use of the recipient
named above, and may be legally privileged.
If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly
prohibited.
If you have received this communication in error,
please re-send this communication to the sender and
delete the original message or any copy of it from your
computer system. Thank You.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-24 Thread Bob Proulx
Lemley James - jlemle wrote:
 Certainly I don't expect everyone to be a programmer in order to be able
 to shuffle their playlist, but perhaps an example needs to be added to
 the sort man-page stating how easy is to accomplish with tools that are
 likely already installed on your system (specifically, awk to add a
 pseudorandom key to a line and cut to remove it): 
 
 find /music -name '*ogg' -type f | 
 awk 'BEGIN{srand()}{printf %06d%s\n, rand()*100, $0}' | 
 sort | 
 cut -b 7-  /music/randomized_oggfiles.m3u

Your solution and a previously posted one are quite similar.

  http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00148.html

Yours has the advantage of not using $RANDOM, a ksh/bash specific
feature not in POSIX.

Bob


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-24 Thread Paul Eggert
[EMAIL PROTECTED] writes:

 Logically the only difference from sort is the low level ordering
 algorithm. so I vote for and extra arg to sort:
 --sort=random.

More generally, sort could pretend that every line had an extra field
called R whose contents are random.  That way, you could use, e.g.:

  sort -k 2,2 -k R

which would mean sort by the 2nd field, but if there are ties then
sort the ties randomly.  sort -R would be short for sort -k R.

It's not as easy to implement as it sounds, I'm afraid; even if you
can assume /dev/random or /dev/urandom.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-24 Thread Davis Houlton
On Tuesday 24 May 2005 15:33, Frederik Eaton wrote:
 reason to expand the functionality of 'sort'. But in my opinion a more
 important reason is that the set of commands that one runs on a unix
 system comprise a language, which is a very important language from a
 user's perspective, and if people think that it should make sense
 intuitively to use a certain command to do a certain thing, then that
 command should be extended to do that thing, for the sake of the
 conciseness and clarity and usability of this language. 

I agree that the coreutils represent a toolbox for programing.  Without 
them, /bin/sh would be an empty place indeed!

So, in the same way we have echo / printf, expand / unexpand,  head / tail, I 
like to think that there is also room for both sort and shuffle.   

In my mind, from an functional operator's perspective, I find it very 
intuitive that sort does as it says--sort lines ACB into ABC.  Sorting by 
dates and numbers are a logical extension of the sort metaphor, and again, I 
intuitively look to sort to do these things. But when I went to look for a 
shuffle command--as a user--I did not even think to look at the sort man 
page. I'm not sure if that is typical -- sort is certainly a logical place to 
look--but that was my experience.

A command like shuffle, in my mind, has a different set of functional 
requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA, 
etc.  As a user, not necessarily aware of technical implementation, sort and 
shuffle make sense as separate commands, in the same way I use different 
methods to sort and shuffle a deck of cards.

Reading these notes has given me pause to challenge my current line of 
thinking regarding large files. Glossing over some gritty implementation 
details, my revised thought process is that for each shuffle execution, there 
are two paths: a computationally fast one involving manipulation of text 
buffers, and one that involves only delimiter locations  line sizes.  
Shuffle would pick the optimum execution path based on input set size; small, 
average cases use text buffers. Larger cases use delimiter locations paged to 
disk.

That is a good point about RAND_MAX. For a case where we have more than 
RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX 
is large enough to encapsulate the entire input set.  For example, suppose we 
used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to 
give us the range we need (1-21).  For 30 lines, 6d6-5 (1-31), etc.  

I currently do not have time (tonight) to think through a scheme that uses 
delimiter locations paged out to disk, but I suspect it would be implemented 
differently from sort?  Not sure yet--or if there would be any benefit.

Some considerations--are file sizes  fseek's fpos_t something that need to be 
considered? 

What is the maximum average footprint considered acceptable for a reduced 
memory environment? 1MB? 4MB? 16k? 2k?  Like Jay pointed out, the general 
algorithm will involve splitting delimiter locations into manageable pages of 
memory, and randomizing each page. Then we select a page at random, and walk 
down that page.  The question is then--what is the maximum page size?

Thanks,
  Davis



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: new coreutil? shuffle - randomize file contents

2005-05-23 Thread James Youngman
Davis Houlton writes:-


 I recently had to write a shuffle utility for a personal project and
 was wondering if it would make a canidate for the coreutils
 suite. It seems like the kind of utility the toolbox could use
 (maybe under section 3. Output of entire files).

This behaviour was proposed a few months ago as a new option to
sort, and there were objections around the ideas of keeping the
shuffled sort stable (i.e. that lines with the same key should appear
in groups in the shuffled output) and of repeatability (e.g. giving a
'random seed' to ensure output is reproducible[*]).  Much discussion
followed and eneded up with many people agreeing that this behaviour
properly belonged in a a separate program.

So, I think that shuffle is a good idea.

[*] I'm against the 'random seed' idea since the the number of
possible permutations of the output is so large for most reasonable
inputs that numbers as low as 2^32 are way too small.  That makes
parsing and making use of the seed quite tricky and in all likelihood
not worth the bother.


 It's a fairly simple program, and while there is room for
 optimization, works fairly quickly enough on my hardware.

Does it produce all possible outputs with equal probability?

 SYNOPSIS
shuffle [--help] [FILE]...

I would suggest the addition of the following options 

--null, -0, -z
Accept and produce records terminated by the NUL
character (in ASCII, this is 0) instead of newline.
This can be used in conjunction with find -print0
or with xargs -0.   

-o file
Write the shuffled output data to the named file.

The reason for offering -z there as well is that sort has a -z
option that does this.   


Create a random multimedia playlist based on directory contents:
   find /path/to/media -name *.ogg | shuffle  playlist.pls

Does that generate a playlist in the right format?  Based on a look at
http://gonze.com/playlists/playlist-format-survey.html#PLS it seems to
me like it's better to indicate that the output file is 'plauylist.m3u'.


 EXIT STATUS
shuffle returns a zero upon succesful completion; a one upon failing to
read a file and a two upon failing to write a file.

I realise that many writing style guides suggest that numbers less
than six or so should be represented as words rather than numerals,
but I found the paragraph above would have been more useful if
formatted a bit more like a table:-

   .SH EXIT STATUS
   .B shuffle
   exits with the following status:
   .nf
   0 if it succeeds
   1 if it fails to open or read an input file
   2 if it fails to write output data or fails to open the output file
   3 for command-line usage errors
   .fi
   The program can also exit with numerically greater status values if
   other errors occur, but most of these are operating-system
   dependent.


 Please let me know if this would be considered an appropriate
 addition, and what the next step might be.  I have used GNU utils
 for decades now, and am happy at the opportunity to finally have a
 chance to give back.

I would find this program useful.

Regards,
James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


new coreutil? shuffle - randomize file contents

2005-05-21 Thread Davis Houlton
Hello,

I recently had to write a shuffle utility for a personal project and was 
wondering if it would make a canidate for the coreutils suite. It seems like 
the kind of utility the toolbox could use (maybe under section 3. Output of 
entire files).

It's a fairly simple program, and while there is room for optimization, works 
fairly quickly enough on my hardware.

It is probably best explained by the quick man page I whipped up:
---
shuffle(1)   USER COMMANDS  shuffle(1)



NAME
   shuffle - Randomize stdin/file contents

SYNOPSIS
   shuffle [--help] [FILE]...

DESCRIPTION
   Randomize  the  contents of all files specified on the command line. If
   no files are specified, shuffle will read from stdin, and write to std-
   out.

OPTIONS
   --help Display short usage information

EXAMPLES
   Randomize contents of a series of files:
  shuffle file1.txt file2.txt file3.txt


   Create a random multimedia playlist based on directory contents:
  find /path/to/media -name *.ogg | shuffle  playlist.pls


   Select a file at random:
  ls | shuffle | head -n 1


EXIT STATUS
   shuffle returns a zero upon succesful completion; a one upon failing to
   read a file and a two upon failing to write a file.

AUTHOR
   Written by Davis Houlton.



version 1.0  May 21, 2005   shuffle(1)
---
Please let me know if this would be considered an appropriate addition, and 
what the next step might be.  I have used GNU utils for decades now, and am 
happy at the opportunity to finally have a chance to give back. Thanks,
  Davis Houlton



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils