Hi George,
the count " 21405/56478" refers to number of genes that have protein coding
transcripts. A large proportion of these genes will have more than one
transcript thanks to the alternative splicing model that Ensembl supports.
Hence you will be getting far more transcripts than genes. If you need to
know  more about how Ensembl predicts genes and corresponding transcripts
please drop an email to [email protected]


a




On Thu, May 3, 2012 at 1:59 PM, George Gutman <[email protected]> wrote:

> I'm trying to collect all protein coding sequences from various species.
>
> I start here: http://www.biomart.org/biomart/martview/
>
> My selections are as follows:
>  Database: Ensembl Genes 66 (Sanger UK)
>  Dataset:  Homo sapiens genes (GRCh37.p6)
>  Filters: Gene Type: protein coding
>  Attributes: Sequences/Coding Sequences
>  Header: Associated Gene Name, Description, Ensembl Gene ID
>
> When I click on "Count" I get 21405/56478, which I interpret as 21405
> coding sequences that I should be recovering out of a total of 56478
> entries.
>
> I click "Results", "Unique Results Only", "Export" to "Compressed Web file
> (notify by email), I enter my email address, then "Go".
>
> The resulting file I download is 111,827 Kb in size and it contains 98,024
> entries, many more than the 21,405 I expected.  (I determined this by
> doing a "seach and replace" for ">").  The first entry is gene name
> "CYP26B1", and I find a total of five separate entries with this name and
> with identical gene descriptions and Ensemble Gene IDs.  The five
> sequences, however, are different, although three of them are identical
> through the first few lines.
>
> So what am I doing wrong?  When I went through the same steps for E. coli
> K12 I recovered a file with 4258 sequences, the number I expected based
> on the output of "Count".
>
> Thanks,
>     George Gutman
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Cheers,
>
>     George Gutman
> *************************************************************************
> *    George A. Gutman, Professor Emeritus                               *
> *    Department of Microbiology  and Molecular Genetics                 *
> *    University of California, Irvine                                   *
> *    Office:  (949)824-6593             B250, Med Sci                   *
> *    (714)552-1242 (cell)               e-mail: [email protected]        *
> * http://www.ucihs.uci.edu/microbio/facultyResearch/faculty/gutman.html *
> *************************************************************************
>
> On Thu, 3 May 2012, Arek Kasprzyk wrote:
>
> > Date: Thu, 3 May 2012 12:55:22 -0400
> > From: Arek Kasprzyk <[email protected]>
> > To: George Gutman <[email protected]>
> > Cc: [email protected]
> > Subject: Re: [BioMart Users] Where to address questions
> >
> > Dear George,
> > please feel free to post here and we'll try to help you. Please give us
> > more details on your problem, the portal that you are using etc
> >
> >
> > a
> >
> > On Wed, May 2, 2012 at 2:30 AM, George Gutman <[email protected]> wrote:
> >
> > > I'm trying to download collections of protein coding regions from
> various
> > > genomes and am having trouble.  Where can I address my questions?  Do I
> > > need to be a member of this list to post here?
> > >
> > >     George Gutman
> > > _______________________________________________
> > > Users mailing list
> > > [email protected]
> > > https://lists.biomart.org/mailman/listinfo/users
> > >
> >
>



-- 
Arek Kasprzyk, MD, MSc, PhD
BioMart Project Lead
www.biomart.org
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Reply via email to