Bug#961638: proposal: stocat - probabilistic cat

2021-03-18 Thread Nicolas Schier
Dear Joey,

Stefano suggested the inclusion of his 'stocat' (see below) tool into
moreutils.  I think the simple script is a great idea and I would like
to see it in the moreutils collection.  What do you think?  Might you
consider to adopt it?

Kind regards,
Nicolas


On Wed, Mar 10, 2021 at 02:38:55PM +0100, Stefano Zacchiroli wrote:
> On Tue, May 26, 2020 at 11:39:42PM +0200, Stefano Zacchiroli wrote:
> > I'm hereby proposing the inclusion of the attached "stocat" utility to
> > moreutils. It's like cat, but output lines with a given probability,
> > defaulting to 10%. It's very useful for random sampling (and *much*
> > more efficient at that than using "shuf" which is unwieldy on very
> > large inputs) and, while it can be implemented instead with awk/perl
> > oneliners, those oneliners aren't very mnemonic and are error prone.
> 
> Heya, as I haven't heard back about this, but others have asked me about
> how to best use stocat, I've now released it as an independent tool
> here:
> 
>   https://gitlab.com/zacchiro/stocat
> 
> I'm happy to reconsider if/when it gets integrated into moreutils.
> 
> Cheers
> -- 
> Stefano Zacchiroli . z...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
> Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
> Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
> « the first rule of tautology club is the first rule of tautology club »

-- 
epost: nico...@fjasle.eu   irc://oftc.net/nsc
↳ gpg: 18ed 52db e34f 860e e9fb  c82b 7d97 0932 55a0 ce7f
 -- frykten for herren er opphav til kunnskap --


signature.asc
Description: PGP signature


Bug#961638: proposal: stocat - probabilistic cat

2021-03-10 Thread Stefano Zacchiroli
On Tue, May 26, 2020 at 11:39:42PM +0200, Stefano Zacchiroli wrote:
> I'm hereby proposing the inclusion of the attached "stocat" utility to
> moreutils. It's like cat, but output lines with a given probability,
> defaulting to 10%. It's very useful for random sampling (and *much*
> more efficient at that than using "shuf" which is unwieldy on very
> large inputs) and, while it can be implemented instead with awk/perl
> oneliners, those oneliners aren't very mnemonic and are error prone.

Heya, as I haven't heard back about this, but others have asked me about
how to best use stocat, I've now released it as an independent tool
here:

  https://gitlab.com/zacchiro/stocat

I'm happy to reconsider if/when it gets integrated into moreutils.

Cheers
-- 
Stefano Zacchiroli . z...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »



Bug#961638: proposal: stocat - probabilistic cat

2020-05-26 Thread Stefano Zacchiroli
Package: moreutils
Version: 0.63-1+b1
Severity: wishlist
Tags: patch upstream

I'm hereby proposing the inclusion of the attached "stocat" utility to
moreutils. It's like cat, but output lines with a given probability, defaulting
to 10%. It's very useful for random sampling (and *much* more efficient at that
than using "shuf" which is unwieldy on very large inputs) and, while it can be
implemented instead with awk/perl oneliners, those oneliners aren't very
mnemonic and are error prone.

If desired, it could be extended by adding a reservoir sampling option, to
guarantee a selection of exactly K items.

Thanks a lot for moreutils!

Cheers

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 5.6.0-1-amd64 (SMP w/8 CPU cores)
Kernel taint flags: TAINT_WARN
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages moreutils depends on:
ii  libc6  2.30-8
ii  libipc-run-perl20200505.0-1
ii  libtime-duration-perl  1.21-1
ii  libtimedate-perl   2.3200-1
ii  perl   5.30.2-1

moreutils recommends no packages.

moreutils suggests no packages.

-- no debconf information
#!/usr/bin/perl

=head1 NAME

stocat - stochastic cat, selecting lines with uniform probability


=head1 SYNOPSIS

=over

=item B [B<-p>|B<--probability> PROBABILITY] [I|B<->]...

=back


=head1 DESCRIPTION

Concatenate FILE(s) to standard output, but printing each input line to output
only with a given probability, defaulting to 0.1 (i.e., 10%).

With no FILE or when FILE is B<->, read standard input.


=head1 OPTIONS

=over 4

=item -p, --probability

Output lines with the given probability, specified as a number between 0 (0%
probability) and 1 (100% probability). Default: 0.1 (i.e., 10% probability).

=back


=head1 SEE ALSO

L


=head1 AUTHOR

Copyright 2020 by Stefano Zacchiroli 

Licensed under the GNU GPL.

=cut

use Getopt::Long;

sub die_usage() {
die "Usage: $0 [--probability|-p PROBABILITY] [file|-]\n";
}

my $probability = 0.1;
if (! GetOptions("probability|p=f" => \$probability)) {
die_usage();
}
if ($probability < 0 || $probability > 1) {
die_usage();
}

while (<>) {
print $_ if rand() <= $probability;
}