bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-14 Thread Stéphane Blondon
2011/12/13 Bob Proulx b...@proulx.com:
 Davide Brini wrote:
 Bob Proulx wrote:
    perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

 While we're at it, this is the typical awk way to do that:

 awk '!a[$0]++'

Very great thanks to you and David about providing a one-liner
solution! I've modified the awk version in order it works as an alias.
I send it in case some one asks the same question:

Copy-paste the next line in ~/.bash_aliases:
alias uniqall='awk '''! a[$0]++

Then you can filter like that:
cat file | ... | uniqall | ...


(tested with bash, version 4.2.20(1)-release under Debian Wheezy)

Thanks and good bye,
-- 
Stéphane





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-13 Thread Davide Brini
On Mon, 12 Dec 2011 21:20:18 -0700, Bob Proulx b...@proulx.com wrote:

 Stéphane Blondon wrote:
  I think `uniq` should have an additional option (for example -a,
  --all) to remove same lines but not adjacent.
  
  The man page explains a workaround based on `sort` but it can be
  complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
  the sort couldn't really work. Fortunately, the order was not
  important so using `sort | uniq | sort --random-sort` was an
  acceptable solution. I imagine cases based on other tools like `top`
  could be a problem too.
 
 If you want to print only the first of a unique line then this perl
 one-liner will do it.
 
   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

While we're at it, this is the typical awk way to do that:

awk '!a[$0]++'


-- 
D.





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-13 Thread Pádraig Brady
On 12/12/2011 10:54 PM, Stéphane Blondon wrote:
 Tool: uniq
 Priority: wishlist
 
 Hello,
 
 I think `uniq` should have an additional option (for example -a,
 --all) to remove same lines but not adjacent.
 
 The man page explains a workaround based on `sort` but it can be
 complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
 the sort couldn't really work. Fortunately, the order was not
 important so using `sort | uniq | sort --random-sort` was an
 acceptable solution. I imagine cases based on other tools like `top`
 could be a problem too.
 
 If you are interested, I could try to provide a patch. (I have learnt
 C but I don't use it today.)
 
 I don't think the increase of memory use is a problem today, so a
 warning in the manpage should be enought.

Well that would increase the complexity of `uniq` a _lot_
http://lists.gnu.org/archive/html/coreutils/2011-11/msg00018.html
For that reason I would be against adding such a feature.
Note improving the field selection of `uniq` is appropriate,
and would make DSU solutions using sort, easier to implement.

cheers,
Pádraig.





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-13 Thread Jim Meyering
Bob Proulx wrote:

 Stéphane Blondon wrote:
 I think `uniq` should have an additional option (for example -a,
 --all) to remove same lines but not adjacent.

 The man page explains a workaround based on `sort` but it can be
 complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
 the sort couldn't really work. Fortunately, the order was not
 important so using `sort | uniq | sort --random-sort` was an
 acceptable solution. I imagine cases based on other tools like `top`
 could be a problem too.

 If you want to print only the first of a unique line then this perl
 one-liner will do it.

   perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

Thanks, but with large files, isn't it better to store not
the full line, but rather a constant?

  perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'

(actually, using 1 could be seen as misleading, since 0 or even undef
would also work)

I think you can drop the l.
I have a slight preference for this:

  perl -ne 'defined $seen{$_} or print; $seen{$_}=1'





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-13 Thread Bob Proulx
Jim Meyering wrote:
 Bob Proulx wrote:
  If you want to print only the first of a unique line then this perl
  one-liner will do it.
 
perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
 
 Thanks, but with large files, isn't it better to store not
 the full line, but rather a constant?
 
   perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1'

Good point!  I hadn't given it much thought since it usually runs so
quickly in my usage that I never worried about it.

 (actually, using 1 could be seen as misleading, since 0 or even undef
 would also work)
 
 I think you can drop the l.
 I have a slight preference for this:
 
   perl -ne 'defined $seen{$_} or print; $seen{$_}=1'

Refering to print v. print $_ here I have never liked implicit use
of $_ and so I tend to avoid it.  At one time there was a push in the
perl community to make all uses explicit.  And as to whether to use
the 'if (expr) { stmt }' or 'stmt if expr' or 'expr or stmt' forms is
a matter of taste.  Might as well discuss the one true indention and
brace styles.  :-)  For one-liners I do tend to use short variables
to keep the line length minimized.  In order to compact a line I also
sacrifice whitespace when required.

But you have me thinking about conserving memory.  If the file was
large due to long lines then memory use would be proportionately large
due to the key storage needs.  This could be reduced by using a hash
of the line as the storage key instead of the entire line.  But the
savings would be relative to the average line size.  If the average
line size was smaller than the hash size then this would increase
memory use.

  perl -MDigest::MD5=md5 -lne '$m=md5($_); print $_ if ! defined $a{$m}; 
$a{$m}=1'

If you are ever going to debug and print out the md5 value then
substitute md5_hex for md5 to get a printable result.

Bob





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-13 Thread Bob Proulx
Davide Brini wrote:
 Bob Proulx wrote:
perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'
 
 While we're at it, this is the typical awk way to do that:
 
 awk '!a[$0]++'

I like it!  I will definitely be using that awk idiom in the future.
It is simple and concise.

Bob





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-12 Thread Stéphane Blondon
Tool: uniq
Priority: wishlist

Hello,

I think `uniq` should have an additional option (for example -a,
--all) to remove same lines but not adjacent.

The man page explains a workaround based on `sort` but it can be
complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
the sort couldn't really work. Fortunately, the order was not
important so using `sort | uniq | sort --random-sort` was an
acceptable solution. I imagine cases based on other tools like `top`
could be a problem too.

If you are interested, I could try to provide a patch. (I have learnt
C but I don't use it today.)

I don't think the increase of memory use is a problem today, so a
warning in the manpage should be enought.


Thank for all,
-- 
Stéphane





bug#10287: [wishlist] uniq can remove non adjacent lines

2011-12-12 Thread Bob Proulx
Stéphane Blondon wrote:
 I think `uniq` should have an additional option (for example -a,
 --all) to remove same lines but not adjacent.
 
 The man page explains a workaround based on `sort` but it can be
 complex to use. Few weeks ago, I had to `uniq`-ize random numbers and
 the sort couldn't really work. Fortunately, the order was not
 important so using `sort | uniq | sort --random-sort` was an
 acceptable solution. I imagine cases based on other tools like `top`
 could be a problem too.

If you want to print only the first of a unique line then this perl
one-liner will do it.

  perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;'

Bob