bug#10287: [wishlist] uniq can remove non adjacent lines
2011/12/13 Bob Proulx b...@proulx.com: Davide Brini wrote: Bob Proulx wrote: perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;' While we're at it, this is the typical awk way to do that: awk '!a[$0]++' Very great thanks to you and David about providing a one-liner solution! I've modified the awk version in order it works as an alias. I send it in case some one asks the same question: Copy-paste the next line in ~/.bash_aliases: alias uniqall='awk '''! a[$0]++ Then you can filter like that: cat file | ... | uniqall | ... (tested with bash, version 4.2.20(1)-release under Debian Wheezy) Thanks and good bye, -- Stéphane
bug#10287: [wishlist] uniq can remove non adjacent lines
On Mon, 12 Dec 2011 21:20:18 -0700, Bob Proulx b...@proulx.com wrote: Stéphane Blondon wrote: I think `uniq` should have an additional option (for example -a, --all) to remove same lines but not adjacent. The man page explains a workaround based on `sort` but it can be complex to use. Few weeks ago, I had to `uniq`-ize random numbers and the sort couldn't really work. Fortunately, the order was not important so using `sort | uniq | sort --random-sort` was an acceptable solution. I imagine cases based on other tools like `top` could be a problem too. If you want to print only the first of a unique line then this perl one-liner will do it. perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;' While we're at it, this is the typical awk way to do that: awk '!a[$0]++' -- D.
bug#10287: [wishlist] uniq can remove non adjacent lines
On 12/12/2011 10:54 PM, Stéphane Blondon wrote: Tool: uniq Priority: wishlist Hello, I think `uniq` should have an additional option (for example -a, --all) to remove same lines but not adjacent. The man page explains a workaround based on `sort` but it can be complex to use. Few weeks ago, I had to `uniq`-ize random numbers and the sort couldn't really work. Fortunately, the order was not important so using `sort | uniq | sort --random-sort` was an acceptable solution. I imagine cases based on other tools like `top` could be a problem too. If you are interested, I could try to provide a patch. (I have learnt C but I don't use it today.) I don't think the increase of memory use is a problem today, so a warning in the manpage should be enought. Well that would increase the complexity of `uniq` a _lot_ http://lists.gnu.org/archive/html/coreutils/2011-11/msg00018.html For that reason I would be against adding such a feature. Note improving the field selection of `uniq` is appropriate, and would make DSU solutions using sort, easier to implement. cheers, Pádraig.
bug#10287: [wishlist] uniq can remove non adjacent lines
Bob Proulx wrote: Stéphane Blondon wrote: I think `uniq` should have an additional option (for example -a, --all) to remove same lines but not adjacent. The man page explains a workaround based on `sort` but it can be complex to use. Few weeks ago, I had to `uniq`-ize random numbers and the sort couldn't really work. Fortunately, the order was not important so using `sort | uniq | sort --random-sort` was an acceptable solution. I imagine cases based on other tools like `top` could be a problem too. If you want to print only the first of a unique line then this perl one-liner will do it. perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;' Thanks, but with large files, isn't it better to store not the full line, but rather a constant? perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1' (actually, using 1 could be seen as misleading, since 0 or even undef would also work) I think you can drop the l. I have a slight preference for this: perl -ne 'defined $seen{$_} or print; $seen{$_}=1'
bug#10287: [wishlist] uniq can remove non adjacent lines
Jim Meyering wrote: Bob Proulx wrote: If you want to print only the first of a unique line then this perl one-liner will do it. perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;' Thanks, but with large files, isn't it better to store not the full line, but rather a constant? perl -lne 'print $_ if ! defined $seen{$_}; $seen{$_}=1' Good point! I hadn't given it much thought since it usually runs so quickly in my usage that I never worried about it. (actually, using 1 could be seen as misleading, since 0 or even undef would also work) I think you can drop the l. I have a slight preference for this: perl -ne 'defined $seen{$_} or print; $seen{$_}=1' Refering to print v. print $_ here I have never liked implicit use of $_ and so I tend to avoid it. At one time there was a push in the perl community to make all uses explicit. And as to whether to use the 'if (expr) { stmt }' or 'stmt if expr' or 'expr or stmt' forms is a matter of taste. Might as well discuss the one true indention and brace styles. :-) For one-liners I do tend to use short variables to keep the line length minimized. In order to compact a line I also sacrifice whitespace when required. But you have me thinking about conserving memory. If the file was large due to long lines then memory use would be proportionately large due to the key storage needs. This could be reduced by using a hash of the line as the storage key instead of the entire line. But the savings would be relative to the average line size. If the average line size was smaller than the hash size then this would increase memory use. perl -MDigest::MD5=md5 -lne '$m=md5($_); print $_ if ! defined $a{$m}; $a{$m}=1' If you are ever going to debug and print out the md5 value then substitute md5_hex for md5 to get a printable result. Bob
bug#10287: [wishlist] uniq can remove non adjacent lines
Davide Brini wrote: Bob Proulx wrote: perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;' While we're at it, this is the typical awk way to do that: awk '!a[$0]++' I like it! I will definitely be using that awk idiom in the future. It is simple and concise. Bob
bug#10287: [wishlist] uniq can remove non adjacent lines
Tool: uniq Priority: wishlist Hello, I think `uniq` should have an additional option (for example -a, --all) to remove same lines but not adjacent. The man page explains a workaround based on `sort` but it can be complex to use. Few weeks ago, I had to `uniq`-ize random numbers and the sort couldn't really work. Fortunately, the order was not important so using `sort | uniq | sort --random-sort` was an acceptable solution. I imagine cases based on other tools like `top` could be a problem too. If you are interested, I could try to provide a patch. (I have learnt C but I don't use it today.) I don't think the increase of memory use is a problem today, so a warning in the manpage should be enought. Thank for all, -- Stéphane
bug#10287: [wishlist] uniq can remove non adjacent lines
Stéphane Blondon wrote: I think `uniq` should have an additional option (for example -a, --all) to remove same lines but not adjacent. The man page explains a workaround based on `sort` but it can be complex to use. Few weeks ago, I had to `uniq`-ize random numbers and the sort couldn't really work. Fortunately, the order was not important so using `sort | uniq | sort --random-sort` was an acceptable solution. I imagine cases based on other tools like `top` could be a problem too. If you want to print only the first of a unique line then this perl one-liner will do it. perl -lne 'print $_ if ! defined $a{$_}; $a{$_}=$_;' Bob