Arek,

Thanks, that makes sense, I had imagined that the choice of "Unique
results" would have avoided this duplication.  But this creates a problem
for me, since there will be a large amount of redundancy in the sequences
I recover, and I require a collection of unique sequences even at the cost
of the database being less than complete.  Is there a way for me to avoid
this redundancy by requesting only a single transcript for each gene,
perhaps the longest or the most abundant?

I am requesting the email help you suggest.

Thanks,
     George

On Thu, 3 May 2012, Arek Kasprzyk wrote:

> Hi George,
> the count " 21405/56478" refers to number of genes that have protein coding
> transcripts. A large proportion of these genes will have more than one
> transcript thanks to the alternative splicing model that Ensembl supports.
> Hence you will be getting far more transcripts than genes. If you need to
> know  more about how Ensembl predicts genes and corresponding transcripts
> please drop an email to [email protected]
>
> a
>
> On Thu, May 3, 2012 at 1:59 PM, George Gutman <[email protected]> wrote:
>
> > I'm trying to collect all protein coding sequences from various species.
> >
> > I start here: http://www.biomart.org/biomart/martview/
> >
> > My selections are as follows:
> >  Database: Ensembl Genes 66 (Sanger UK)
> >  Dataset:  Homo sapiens genes (GRCh37.p6)
> >  Filters: Gene Type: protein coding
> >  Attributes: Sequences/Coding Sequences
> >  Header: Associated Gene Name, Description, Ensembl Gene ID
> >
> > When I click on "Count" I get 21405/56478, which I interpret as 21405
> > coding sequences that I should be recovering out of a total of 56478
> > entries.
> >
> > I click "Results", "Unique Results Only", "Export" to "Compressed Web file
> > (notify by email), I enter my email address, then "Go".
> >
> > The resulting file I download is 111,827 Kb in size and it contains 98,024
> > entries, many more than the 21,405 I expected.  (I determined this by
> > doing a "seach and replace" for ">").  The first entry is gene name
> > "CYP26B1", and I find a total of five separate entries with this name and
> > with identical gene descriptions and Ensemble Gene IDs.  The five
> > sequences, however, are different, although three of them are identical
> > through the first few lines.
> >
> > So what am I doing wrong?  When I went through the same steps for E. coli
> > K12 I recovered a file with 4258 sequences, the number I expected based
> > on the output of "Count".
> >
> > Thanks,
> >     George Gutman
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Cheers,
> >
> >     George Gutman
> > *************************************************************************
> > *    George A. Gutman, Professor Emeritus                               *
> > *    Department of Microbiology  and Molecular Genetics                 *
> > *    University of California, Irvine                                   *
> > *    Office:  (949)824-6593             B250, Med Sci                   *
> > *    (714)552-1242 (cell)               e-mail: [email protected]        *
> > * http://www.ucihs.uci.edu/microbio/facultyResearch/faculty/gutman.html *
> > *************************************************************************
> >
> > On Thu, 3 May 2012, Arek Kasprzyk wrote:
> >
> > > Date: Thu, 3 May 2012 12:55:22 -0400
> > > From: Arek Kasprzyk <[email protected]>
> > > To: George Gutman <[email protected]>
> > > Cc: [email protected]
> > > Subject: Re: [BioMart Users] Where to address questions
> > >
> > > Dear George,
> > > please feel free to post here and we'll try to help you. Please give us
> > > more details on your problem, the portal that you are using etc
> > >
> > >
> > > a
> > >
> > > On Wed, May 2, 2012 at 2:30 AM, George Gutman <[email protected]> wrote:
> > >
> > > > I'm trying to download collections of protein coding regions from
> > various
> > > > genomes and am having trouble.  Where can I address my questions?  Do I
> > > > need to be a member of this list to post here?
> > > >
> > > >     George Gutman
> > > > _______________________________________________
> > > > Users mailing list
> > > > [email protected]
> > > > https://lists.biomart.org/mailman/listinfo/users
> > > >
> > >
> >
>
>
>
> --
> Arek Kasprzyk, MD, MSc, PhD
> BioMart Project Lead
> www.biomart.org
>
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Reply via email to