[galaxy-user] Identifying Genes

2013-11-25 Thread Loupe, Jacob M.
I am very new to Galaxy. We have performed a comparative analysis between the 
transcriptomes of different samples. We performed the analysis using Galaxy 
software (Tophat; CuffDiff; etc). What my PI has done is compiled a list of all 
the genes differentially expressed between each set, each in a separate excel 
sheet. So what I have is an excel spreadsheet with a list (usually around 300) 
of test id, gene id, and locus (ChrX:1-222). Initially, we have 
been identifying each gene individually, one at a time, by pasting the locus 
into the UCSC browser. This works, but is incredibly tedious. There has to be a 
better way in Galaxy. I have tried making BED files out of the loci, but so far 
I have been unable to identify genes using galaxy.

Can someone please explain how I can take my long list of loci and get gene 
names, ID, function, and possibly some downstream comparative ontologies to 
begin analyzing.
Like I said, very new to Galaxy and genomics.

Thanks very much

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Identifying Genes

2013-11-25 Thread Jennifer Jackson

Hi Jacob,

Using the tool Get Data - UCSC Main table browser, data can be 
retrieved directly using either gene symbols or locus positions.


A good track to go against is UCSC Genes, if available for your 
genome. RefSeq Genes is another good choice. But really any track in 
the group Gene and Gene Prediction Tracks is worth a look to see if it 
is fit for what you are interested in, as the content can vary between 
genomes and even builds. The specifics can be reviewed at UCSC by 
clicking into the describe table schema area (button next to table 
selection, start with default table).


To search multiple gene symbols, enter the list in the form under 
identifiers. To search multiple loci, enter the list under region 
(define regions). These both accept a text file, so download the 
information, cut out of the original file, formatted how the UCSC form 
states from Galaxy as text (tabular). Or, export as text from the Excel 
spreadsheet. 300 should be fine at once, I believe the limits are around 
1000 per query for each of these.


At this point in the query, the extract would just pull basic data from 
the single primary table. To also pull out related information, change 
the output file type to be selected fields from primary and related 
tables and then click on get output.


The next form is where you can link in additional tables of data. The 
general idea is to add the table, then select the specific fields that 
you want to include. Again, any of these can be reviewed before the 
final query is made using the first main form and then the describe 
table schema button, or once in that describe view, by clicking on 
related tables to navigate. When doing the query this way, the Table 
browser takes care of the relational joins for you, just as an SQL query 
would.


For more help about using the UCSC table browser, these links are good 
places to start, and for detailed questions about a specific piece of 
data that you cannot locate, the support team for the browser can almost 
certainly help. The Table browser is not your only option (flat text 
files and a mySQL database are available), but this is a web-based 
access point to the information, easily imported into Galaxy or 
downloaded for further analysis. There are also other types of queries 
possible, at UCSC and in Galaxy, this is just the most direct I know of, 
for your question and original data:

https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
https://genome.ucsc.edu/FAQ/FAQmaillist.html

One note: you have the locus position with a chromosome identifier in 
the format Chr1 in your email. I am not sure if this was intentional 
or not - but you will need to format the identifiers to match those in 
the target reference genome, just as they were in the original analysis. 
In general, this would mean the format would be chrX instead (case 
matters). So, check/adjust the case/format to avoid problems, these 
really do have to be an exact match. The same is true for gene 
names/symbols - you can always search in the browser to see what the 
format is if something is missing and adjust. Also make sure that Excel 
does not output any hidden characters (line wraps) - stick with plain 
text cells for best results if you plan to output/use the data with 
external tools. You probably know most of this, but just in case I 
wanted to point out where the gotchas could be. Even if using gene names 
for this, you may want to use the position later on, and identifiers in 
the correct format from the start are a good idea.


Hopefully this gets you started!

Jen
Galaxy team

On 11/25/13 8:40 AM, Loupe, Jacob M. wrote:
I am very new to Galaxy. We have performed a comparative analysis 
between the transcriptomes of different samples. We performed the 
analysis using Galaxy software (Tophat; CuffDiff; etc). What my PI has 
done is compiled a list of all the genes differentially expressed 
between each set, each in a separate excel sheet. So what I have is an 
excel spreadsheet with a list (usually around 300) of test id, gene 
id, and locus (ChrX:1-222). Initially, we have been 
identifying each gene individually, one at a time, by pasting the 
locus into the UCSC browser. This works, but is incredibly tedious. 
There has to be a better way in Galaxy. I have tried making BED files 
out of the loci, but so far I have been unable to identify genes using 
galaxy.


Can someone please explain how I can take my long list of loci and get 
gene names, ID, function, and possibly some downstream comparative 
ontologies to begin analyzing.

Like I said, very new to Galaxy and genomics.

Thanks very much




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source