Hi Jaime, thanks for your input.
Re: 'repeated rows' I'd suggest that you include transcript id in your result as well. The seemingly repeating genes are usually a sign of alternative splice variants (different transcripts). Also it is worth to include gene type, there maybe some genes that have transcripts but no proteins. So there is a chance that what you are seeing is actually genuinly unique. Someone from Ensembl could probably provide a better insight. (cc'ing Rhoda) Re: speed. I remember that the first implementation by Jonathan was very slow but Jack optimized it quite a bit and remember Junjun testing it and being satisfied with the speed so i am not sure if this is the server issue now or something has changed since then. (cc'ing Junjun so he can comment) Re: gz option. This 0.8 service is still in development. You are right. We'll definitely need the 'gz' by email option as well a On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[email protected]> wrote: > Hello, > > I'm trying the new interface for biomart and I have a couple of comments. > > I'm trying to download protein sequences for genes in homo sapiens for > GRCh37.p3 > > In the results I find something like this for multiple genes: > > >ENSG00000000003|ENSP00000362111 > MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV > PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN > SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL > EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN > QYEIV* > > >ENSG00000000003|ENSP00000409517 > MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG > CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD > YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF > IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV* > > >ENSG00000000003| > Sequence unavailable > > >ENSG00000000003| > Sequence unavailable > > It makes me think there is a problem when joining the tables and so some > empty protein ids are resulting in extra rows for the result. > > Also I would like to know if the is the need of using the unique option > present in the previous version to show only one result per gene id|protein > id > > Also these queries tend to be quiet long. As the application works now I > can download the file as txt, but is extremely slow and I guess for some > files it may result in incomplete files without way of knowing if the file > is actually complete or not. I mean a time out may result in an incomplete > file which looks good and there will be no way to probe the contrary just > from the data. I think the possibility of getting the file as a gz > compressed file is highly desirable either directly from the service or by > a link in email. > > Best regards, > > J > > _______________________________________________ > Users mailing list > [email protected] > https://lists.biomart.org/mailman/listinfo/users > >
_______________________________________________ Users mailing list [email protected] https://lists.biomart.org/mailman/listinfo/users
