Re: uniq --accumulate

2012-02-07 Thread Pádraig Brady
On 02/07/2012 03:56 PM, Peng Yu wrote:
 Hi,
 
 Suppose that I have a table of the following, where the last column is
 a number. I'd like to accumulate the number of rows that are the same
 for all the remaining columns.
 
 A 1
 A 3
 X 2
 X 3
 Y 3
 
 The result will be the following. Although this is can be easily done
 by awk, etc, I'm wondering if it is worthwhile to add an additional
 option '--accumulate' to uniq. I'd recommend adding such an option if
 anybody in coreutils develop team also think this is useful.
 
 A 4
 X 5
 Y 3
 

Thanks for the suggestion,
but this is too specialized for coreutils I think.

cheers,
Pádraig.



Re: uniq --accumulate

2012-02-07 Thread Assaf Gordon
Pádraig Brady wrote, On 02/07/2012 11:00 AM:
 On 02/07/2012 03:56 PM, Peng Yu wrote:

 Suppose that I have a table of the following, where the last column is
 a number. I'd like to accumulate the number of rows that are the same
 for all the remaining columns.

 
 Thanks for the suggestion,
 but this is too specialized for coreutils I think.

Slightly off-topic for coreutils,

but a package called BEDTools ( http://code.google.com/p/bedtools/ ) provides 
a program called groupBy, which does exactly that, and more.

Akin to SQL's group by command, the program can group a text file by a 
specified column, and perform operations (count,sum,mean,median,etc.) on 
another column.

-gordon




Re: uniq --accumulate

2012-02-07 Thread Peng Yu
On Tue, Feb 7, 2012 at 10:21 AM, Assaf Gordon assafgor...@gmail.com wrote:
 Pádraig Brady wrote, On 02/07/2012 11:00 AM:
 On 02/07/2012 03:56 PM, Peng Yu wrote:

 Suppose that I have a table of the following, where the last column is
 a number. I'd like to accumulate the number of rows that are the same
 for all the remaining columns.


 Thanks for the suggestion,
 but this is too specialized for coreutils I think.

 Slightly off-topic for coreutils,

 but a package called BEDTools ( http://code.google.com/p/bedtools/ ) 
 provides a program called groupBy, which does exactly that, and more.

 Akin to SQL's group by command, the program can group a text file by a 
 specified column, and perform operations (count,sum,mean,median,etc.) on 
 another column.

OK. This is what I'm looking for. The latest groupBy is from filo. I
think that the field separator is not documented. My guess is that it
is '\t'. But users can not change it. Would you please document it and
add an option so that users can change it? Thanks!

~$ groupBy -h

Program: groupBy (v1.1.0)
Authors: Aaron Quinlan (aaronquin...@gmail.com)
 Assaf Gordon
Summary: Summarizes a dataset column based upon
 common column groupings. Akin to the SQL group by command.

Usage:   groupBy -i [FILE] -g [group_column(s)] -c [op_column(s)] -o [ops]
 cat [FILE] | groupBy -g [group_column(s)] -c [op_column(s)] -o [ops]

Options:
-i  Input file. Assumes stdin if omitted.

-g -grp Specify the columns (1-based) for the grouping.
The columns must be comma separated.
- Default: 1,2,3

-c -opCols  Specify the column (1-based) that should be summarized.
- Required.

-o -ops Specify the operation that should be applied to opCol.
Valid operations:
sum, count, min, max,
mean, median, mode, antimode,
stdev, sstdev (sample standard dev.),
collapse (i.e., print a comma separated list 
(duplicates allowed)),
distinct (i.e., print a comma separated list (NO 
duplicates allowed)),
concat   (i.e., merge values into a single, 
non-delimited string),
freqdesc (i.e., print desc. list of values:freq)
freqasc (i.e., print asc. list of values:freq)
- Default: sum

-full   Print all columns from input file.
Default: print only grouped columns.

-inheader   Input file has a header line - the first line will be 
ignored.

-outheader  Print header line in the output, detailing the column 
names.
If the input file has headers (-inheader), the output 
file
will use the input's column names.
If the input file has no headers, the output file
will use col_1, col_2, etc. as the column names.

-header same as '-inheader -outheader'

-ignorecase Group values regardless of upper/lower case.

Examples:
$ cat ex1.out
chr1 10  20  A   chr115  25  B.1 1000ATAT
chr1 10  20  A   chr125  35  B.2 1   CGCG

$ groupBy -i ex1.out -g 1,2,3,4 -c 9 -o sum
chr1 10  20  A   11000

$ groupBy -i ex1.out -grp 1,2,3,4 -opCols 9,9 -ops sum,max
chr1 10  20  A   11000   1

$ groupBy -i ex1.out -g 1,2,3,4 -c 8,9 -o collapse,mean
chr1 10  20  A   B.1,B.2,5500

$ cat ex1.out | groupBy -g 1,2,3,4 -c 8,9 -o collapse,mean
chr1 10  20  A   B.1,B.2,5500

$ cat ex1.out | groupBy -g 1,2,3,4 -c 10 -o concat
chr1 10  20  A   ATATCGCG

Notes:
(1)  The input file/stream should be sorted/grouped by the -grp. columns
(2)  If -i is unspecified, input is assumed to come from stdin.


-- 
Regards,
Peng



Re: uniq --accumulate

2012-02-07 Thread Peng Yu
        (1)  The input file/stream should be sorted/grouped by the -grp. 
 columns

Do you actually mean that the rows are treated just like uniq does?
I.e, only adjacent rows are subjected to grouping? If so, the document
should be reworded to reflect this meaning.

-- 
Regards,
Peng



Re: uniq --accumulate

2012-02-07 Thread Peng Yu
        -o -ops         Specify the operation that should be applied to opCol.
                        Valid operations:
                            sum, count, min, max,
                            mean, median, mode, antimode,
                            stdev, sstdev (sample standard dev.),
                            collapse (i.e., print a comma separated list 
 (duplicates allowed)),
                            distinct (i.e., print a comma separated list (NO 
 duplicates allowed)),
                            concat   (i.e., merge values into a single, 
 non-delimited string),
                            freqdesc (i.e., print desc. list of values:freq)
                            freqasc (i.e., print asc. list of values:freq)
                        - Default: sum

I'm not sure that I understand this option. I assume that if I don't
specify it I should by default have the sum option. By the following
test case demonstrates that it is not the case.


~$ ./main.sh
groupBy -g 1,2,3 -c 4 -i input1.txt 2/dev/null
groupBy -g 1,2,3 -c 4 -o sum -i input1.txt
chr1-   1   3
chr1+   1   2

~$ cat input1.txt
chr1-   1   1
chr1-   1   2
chr1+   1   2
~$ cat main.sh
#!/usr/bin/env bash

set -v
groupBy -g 1,2,3 -c 4 -i input1.txt 2/dev/null
groupBy -g 1,2,3 -c 4 -o sum -i input1.txt


-- 
Regards,
Peng