Hi Jacob,
Using the tool Get Data - UCSC Main table browser, data can be
retrieved directly using either gene symbols or locus positions.
A good track to go against is UCSC Genes, if available for your
genome. RefSeq Genes is another good choice. But really any track in
the group Gene and Gene Prediction Tracks is worth a look to see if it
is fit for what you are interested in, as the content can vary between
genomes and even builds. The specifics can be reviewed at UCSC by
clicking into the describe table schema area (button next to table
selection, start with default table).
To search multiple gene symbols, enter the list in the form under
identifiers. To search multiple loci, enter the list under region
(define regions). These both accept a text file, so download the
information, cut out of the original file, formatted how the UCSC form
states from Galaxy as text (tabular). Or, export as text from the Excel
spreadsheet. 300 should be fine at once, I believe the limits are around
1000 per query for each of these.
At this point in the query, the extract would just pull basic data from
the single primary table. To also pull out related information, change
the output file type to be selected fields from primary and related
tables and then click on get output.
The next form is where you can link in additional tables of data. The
general idea is to add the table, then select the specific fields that
you want to include. Again, any of these can be reviewed before the
final query is made using the first main form and then the describe
table schema button, or once in that describe view, by clicking on
related tables to navigate. When doing the query this way, the Table
browser takes care of the relational joins for you, just as an SQL query
would.
For more help about using the UCSC table browser, these links are good
places to start, and for detailed questions about a specific piece of
data that you cannot locate, the support team for the browser can almost
certainly help. The Table browser is not your only option (flat text
files and a mySQL database are available), but this is a web-based
access point to the information, easily imported into Galaxy or
downloaded for further analysis. There are also other types of queries
possible, at UCSC and in Galaxy, this is just the most direct I know of,
for your question and original data:
https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html
https://genome.ucsc.edu/FAQ/FAQmaillist.html
One note: you have the locus position with a chromosome identifier in
the format Chr1 in your email. I am not sure if this was intentional
or not - but you will need to format the identifiers to match those in
the target reference genome, just as they were in the original analysis.
In general, this would mean the format would be chrX instead (case
matters). So, check/adjust the case/format to avoid problems, these
really do have to be an exact match. The same is true for gene
names/symbols - you can always search in the browser to see what the
format is if something is missing and adjust. Also make sure that Excel
does not output any hidden characters (line wraps) - stick with plain
text cells for best results if you plan to output/use the data with
external tools. You probably know most of this, but just in case I
wanted to point out where the gotchas could be. Even if using gene names
for this, you may want to use the position later on, and identifiers in
the correct format from the start are a good idea.
Hopefully this gets you started!
Jen
Galaxy team
On 11/25/13 8:40 AM, Loupe, Jacob M. wrote:
I am very new to Galaxy. We have performed a comparative analysis
between the transcriptomes of different samples. We performed the
analysis using Galaxy software (Tophat; CuffDiff; etc). What my PI has
done is compiled a list of all the genes differentially expressed
between each set, each in a separate excel sheet. So what I have is an
excel spreadsheet with a list (usually around 300) of test id, gene
id, and locus (ChrX:1-222). Initially, we have been
identifying each gene individually, one at a time, by pasting the
locus into the UCSC browser. This works, but is incredibly tedious.
There has to be a better way in Galaxy. I have tried making BED files
out of the loci, but so far I have been unable to identify genes using
galaxy.
Can someone please explain how I can take my long list of loci and get
gene names, ID, function, and possibly some downstream comparative
ontologies to begin analyzing.
Like I said, very new to Galaxy and genomics.
Thanks very much
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using reply all in your mail client. For discussion of
local Galaxy instances and the Galaxy source