Re: [galaxy-dev] Select first/last N rows from grouped tabular files (e.g. top BLAST hits)

2011-05-19 Thread Peter Cock
On Thu, May 19, 2011 at 7:33 PM, madduri gal...@ci.uchicago.edu wrote:
 I wonder if somebody can give me more context around this issue..


On 3rd May I emailed IBX about their Galaxy install and one of
the (in house) tools mentioned on the workflow image here:
https://ibi.uchicago.edu/resources/galaxy/index.html

I recognised the NCBI BLAST+ tools but the Filter Top Blast
Results tool was new to me, and asked what it did and if it or
any the other IBX tools would be available at the Galaxy Tool Shed:
http://community.g2.bx.psu.edu/

I had a reply from Alex Rodriguez (iBi/CI University of Chicago)
that they haven't put any of the wrappers on the Galaxy tool
shed yet as they are still being worked on. The IBI system
assigned the number [Galaxy #13918].

This thread Select first/last N rows from grouped tabular
files (e.g. top BLAST hits) could have similarities to the
IBI Filter Top Blast Results tool, so I forwarded the email
to the IBI galaxy email address to encourage you (e.g. Alex)
to comment on the thread. The IBI system assigned the
number [Galaxy #14246].

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Select first/last N rows from grouped tabular files (e.g. top BLAST hits)

2011-05-18 Thread Peter Cock
On Tue, May 17, 2011 at 5:30 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 Hi all,

 I'm wondering if the following task can be done in Galaxy with the
 standard tools. The specific example is selecting the top (e.g. 3)
 match sequences for each blast query, but I see this problem as much
 more general than a  Select top BLAST hits tool.

 ...

 Does this make sense? Does it seem like a useful tool to write if
 there isn't anything like this already present? Or might it be simpler
 to just write a Select top BLAST hits tool?

While I still think the above task could be useful in general, I am
now considering a general BLAST filter tool to offer this and some
other commonly used filters like a minimum coverage threshold
(which is possible with a filter on the extended tabular output, but
not trivial).

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] Select first/last N rows from grouped tabular files (e.g. top BLAST hits)

2011-05-17 Thread Peter Cock
Hi all,

I'm wondering if the following task can be done in Galaxy with the
standard tools. The specific example is selecting the top (e.g. 3)
match sequences for each blast query, but I see this problem as much
more general than a  Select top BLAST hits tool.

I want to select the first few (e.g. 1) rows of each group in a
tabular file, where the group criteria is having certain columns equal
(e.g. the first 2).

e.g. Tabular BLAST output has columns of query ID, match ID, etc.

queryA match1 ...
queryA match2 ...
queryA match2 ...
queryA match3 ...
queryA match4 ...
queryA match4 ...
queryA match4 ...
queryB match5 ...
queryB match5 ...
queryC match6 ...
queryC match7 ...

In this example, some of my queries have more than one HSP per match
(more than one line with the same first two columns). If I group on
the first two columns, the groups are:


queryA match1 ...

queryA match2 ...
queryA match2 ...

queryA match3 ...

queryA match4 ...
queryA match4 ...
queryA match4 ...

queryB match5 ...
queryB match5 ...

queryC match6 ...

queryC match7 ...


If I then take the first row in each group, that gives me just the
first HSP for each query+match combination.

queryA match1 ...
queryA match2 ...
queryA match3 ...
queryA match4 ...
queryB match5 ...
queryC match6 ...
queryC match7 ...

If for example I wanted only the top 3 matches for each query, I could
repeat the proposed tool one more time but with different settings -
this time grouping on the first column only:

queryA match1 ...
queryA match2 ...
queryA match3 ...
queryB match5 ...
queryC match6 ...
queryC match7 ...

I hope I've conveyed the idea here. The existing tools Select first
lines from a dataset and Select last lines from a dataset are
related, but do this at the file level.

Does this make sense? Does it seem like a useful tool to write if
there isn't anything like this already present? Or might it be simpler
to just write a Select top BLAST hits tool?

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/