Re: [BioMart Users] Problem with downloading coding sequences

Rhoda Kinsella Fri, 04 May 2012 01:00:20 -0700

Hi George

Unfortunately it is not possible to select the canonical transcriptfrom the Ensembl marts at present. It is possible to do this using theEnsembl Perl API and if you send a ticket to [email protected] wewill be able to help you retrieve the information you need.

Regards
Rhoda



On 4 May 2012, at 00:46, George Gutman wrote:

Arek,

Thanks, that makes sense, I had imagined that the choice of "Unique
results" would have avoided this duplication. But this creates aproblemfor me, since there will be a large amount of redundancy in thesequencesI recover, and I require a collection of unique sequences even atthe costof the database being less than complete. Is there a way for me toavoid
this redundancy by requesting only a single transcript for each gene,
perhaps the longest or the most abundant?

I am requesting the email help you suggest.

Thanks,
    George

On Thu, 3 May 2012, Arek Kasprzyk wrote:
Hi George,
the count " 21405/56478" refers to number of genes that haveprotein codingtranscripts. A large proportion of these genes will have more thanonetranscript thanks to the alternative splicing model that Ensemblsupports.Hence you will be getting far more transcripts than genes. If youneed toknow more about how Ensembl predicts genes and correspondingtranscripts
please drop an email to [email protected]

a
On Thu, May 3, 2012 at 1:59 PM, George Gutman <[email protected]>wrote:
I'm trying to collect all protein coding sequences from variousspecies.
I start here: http://www.biomart.org/biomart/martview/

My selections are as follows:
Database: Ensembl Genes 66 (Sanger UK)
Dataset:  Homo sapiens genes (GRCh37.p6)
Filters: Gene Type: protein coding
Attributes: Sequences/Coding Sequences
Header: Associated Gene Name, Description, Ensembl Gene ID
When I click on "Count" I get 21405/56478, which I interpret as21405
coding sequences that I should be recovering out of a total of 56478
entries.
I click "Results", "Unique Results Only", "Export" to "CompressedWeb file
(notify by email), I enter my email address, then "Go".
The resulting file I download is 111,827 Kb in size and itcontains 98,024entries, many more than the 21,405 I expected. (I determined thisby
doing a "seach and replace" for ">").  The first entry is gene name
"CYP26B1", and I find a total of five separate entries with thisname and
with identical gene descriptions and Ensemble Gene IDs.  The five
sequences, however, are different, although three of them areidentical
through the first few lines.
So what am I doing wrong? When I went through the same steps forE. coliK12 I recovered a file with 4258 sequences, the number I expectedbased
on the output of "Count".

Thanks,
   George Gutman





















Cheers,

   George Gutman
*************************************************************************
* George A. Gutman, ProfessorEmeritus ** Department of Microbiology and MolecularGenetics ** University of California,Irvine ** Office: (949)824-6593 B250, MedSci ** (714)552-1242 (cell) e-mail:[email protected] ** http://www.ucihs.uci.edu/microbio/facultyResearch/faculty/gutman.html*
*************************************************************************

On Thu, 3 May 2012, Arek Kasprzyk wrote:
Date: Thu, 3 May 2012 12:55:22 -0400
From: Arek Kasprzyk <[email protected]>
To: George Gutman <[email protected]>
Cc: [email protected]
Subject: Re: [BioMart Users] Where to address questions

Dear George,
please feel free to post here and we'll try to help you. Pleasegive us
more details on your problem, the portal that you are using etc


a
On Wed, May 2, 2012 at 2:30 AM, George Gutman <[email protected]>wrote:
I'm trying to download collections of protein coding regions from
various
genomes and am having trouble. Where can I address myquestions? Do I
need to be a member of this list to post here?

   George Gutman
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users
--
Arek Kasprzyk, MD, MSc, PhD
BioMart Project Lead
www.biomart.org
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users


Rhoda Kinsella Ph.D.
Ensembl Production Project Leader,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.

_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Re: [BioMart Users] Problem with downloading coding sequences

Reply via email to