Hello Bert,

The problem you had with the other workflow most likely had to do with using BED12 format instead of BED3 or BED6. BED12 represents one or more regions, where BED3-6 represents a single region. This is an important distinction for how the 'Operate on Genomic Intervals" tools function - a single region is required.

There are two basic ways do this and both would require that the query is re-run so that the RefSeq output is in BED6 - or a single region (interval) per line. The second more direct as the filtering is done only once, in Galaxy, but I will break down both and you can choose. Other manipulations to isolate the name or do counts once the data is all in one file will be similar to the workflow from James.

You do not need to run these steps as a workflow the first time while sorting out the parameters. Instead run the steps, evaluate and tune as needed, then create a workflow from the history for future queries. BED and Interval format are very similar, a description of interval is here:
https://wiki.galaxyproject.org/Learn/Datatypes#Interval

Method 1:
1. Re-run the query as you have already performed it, but instead of selecting "Whole gene" as output, instead select "Exons". This will result in one line of output for each match, and possibly multiple lines per RefSeq if more than one input query coordinate region overlaps it. The "name" field will be annotated with the RefSeq identifier and the exon name (which you can break-up/simplify later using tools from the group "Text Manipulation"). 2. With both datasets in Galaxy (the query bed file and the output from #1), double check that metadata assignments are correct by clicking on the pencil icon. (chrom, start, end, name, strand)
https://wiki.galaxyproject.org/Learn/Managing%20Datasets#Dataset_Icons_.26_Text
3. Now run the "Operate on Genomic Intervals -> Join" tool - most likely with "overlap=1" and "inner join" settings, but review the options and decide. 4. This places all of your data in a single file, both intervals side-by-side. From here you can cut out columns, do counts (tool "Group"), etc.

Method 2:
1. Instead of running the initial query at UCSC with the first bed file as a filter for the RefSeq dataset, run the query without a filter and just extract all RefSeq exons into Galaxy.
2. Make sure both datasets are loaded and double check the metadata.
3. Run the "Join" tool again to merge the two datasets based on coordinate overlap as above. 4. Rearrange/continue as wanted. This includes isolating the RefSeq name and merging it back with any other dataset that includes that same RefSeq name, with the other "Join" tool in the group "Join, Subtract and Group".

When running the query - be sure to use the correct "Join" tool at each step. One will match on common keys (a "name") and one on overlapping coordinates. Be sure to use one in the group "Operate on Genomic Intervals" for the first part of your query. We have a couple of tutorials that demonstrate how these tools can be used, along with how to extract a workflow.
Galaxy101: https://usegalaxy.org/u/aun1/p/galaxy101
UsingGalaxy, Protocol1: https://usegalaxy.org/u/galaxyproject/p/using-galaxy-2012

Hopefully this helps,

Jen
Galaxy team

On 12/23/13 6:41 PM, Gold, Bert (NIH/NCI) [E] wrote:
Hi!

Having provided a name (field 4) in a UCSC bed file ( 
http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1 ) and sought a RefSeq 
name using the UCSC Table Browser ( http://www.genome.ucsc.edu/cgi-bin/hgTables 
), I would now like to recover which line of the bed file delivered which line 
of the output fileā€¦ However, I am told I need Galaxy to provide a workflow to 
do this.  Can anyone explain how?  eg, one line of my bedfile looks like:
chr2    2723752 2723777 seqid6354405    0       -
and one line of my intersected table browser output looks like:
chr1    176432306       176811970       NM_020318       0       +       
176525458       176811590       0       23      
248,1835,1072,146,294,193,122,490,129,92,194,147,136,217,172,178,214,169,136,110,72,99,455,
     
0,92236,131353,207799,226966,228955,232567,235929,239436,243188,246812,248664,276455,276809,302495,306436,307796,326638,328176,330389,336890,377002,379209,

Clearly the first line of my bed doesn't correspond to the first line of my 
intersection output, but as my bed is long, what reference can I use to 
unambiguously identify which line of output the first line of my intersection 
corresponds to?  How do I do this in Galaxy?

PS - I tried this workflow earlier today without success, aiming to achieve a 
similar objective:  v

PPS- I also note similar issues were raised in this discussion, with Galaxy 
promoted as the solution, but with no real details about how to achieve the 
desired results:
http://redmine.soe.ucsc.edu/forum/index.php?t=msg&goto=10615&S=0d1b303e6dfdceaf3b240804fd0f52aa

Bert Gold, Ph.D., FACMG
Staff Scientist
NCI-Frederick
Frederick, MD 21702
VOICE: 301-846-5098
EMAIL: go...@mail.nih.gov
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

   http://galaxyproject.org/search/mailinglists/


--
Jennifer Hillman-Jackson
http://galaxyproject.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/

Reply via email to