Hi Jaime If you take a look at the gene on the Ensembl website (see here:http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000000003;r=X:99883667-99894988) you will see that this gene has four transcripts, two are protein coding and two are processed transcripts. I agree with Arek that you should filter by protein coding and add Ensembl transcript ID to your list of attributes. Hope that helps Regards Rhoda
> Hi Jaime, > > thanks for your input. > > Re: 'repeated rows' > I'd suggest that you include transcript id in your result as well. The > seemingly repeating genes are usually a sign of alternative splice > variants > (different transcripts). Also it is worth to include gene type, there > maybe > some genes that have transcripts but no proteins. So there is a chance > that > what you are seeing is actually genuinly unique. Someone from Ensembl > could > probably provide a better insight. (cc'ing Rhoda) > > Re: speed. > I remember that the first implementation by Jonathan was very slow but > Jack > optimized it quite a bit and remember Junjun testing it and being > satisfied > with the speed so i am not sure if this is the server issue now or > something has changed since then. (cc'ing Junjun so he can comment) > > Re: gz option. > This 0.8 service is still in development. You are right. We'll definitely > need the 'gz' by email option as well > > > a > > > On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[email protected]> wrote: > >> Hello, >> >> I'm trying the new interface for biomart and I have a couple of >> comments. >> >> I'm trying to download protein sequences for genes in homo sapiens for >> GRCh37.p3 >> >> In the results I find something like this for multiple genes: >> >> >ENSG00000000003|ENSP00000362111 >> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV >> PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN >> SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL >> EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN >> QYEIV* >> >> >ENSG00000000003|ENSP00000409517 >> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG >> CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD >> YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF >> IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV* >> >> >ENSG00000000003| >> Sequence unavailable >> >> >ENSG00000000003| >> Sequence unavailable >> >> It makes me think there is a problem when joining the tables and so some >> empty protein ids are resulting in extra rows for the result. >> >> Also I would like to know if the is the need of using the unique option >> present in the previous version to show only one result per gene >> id|protein >> id >> >> Also these queries tend to be quiet long. As the application works now I >> can download the file as txt, but is extremely slow and I guess for some >> files it may result in incomplete files without way of knowing if the >> file >> is actually complete or not. I mean a time out may result in an >> incomplete >> file which looks good and there will be no way to probe the contrary >> just >> from the data. I think the possibility of getting the file as a gz >> compressed file is highly desirable either directly from the service or >> by >> a link in email. >> >> Best regards, >> >> J >> >> _______________________________________________ >> Users mailing list >> [email protected] >> https://lists.biomart.org/mailman/listinfo/users >> >> > _______________________________________________ Users mailing list [email protected] https://lists.biomart.org/mailman/listinfo/users
