Hi Jaime, The new sequence retrieval tool is overall slower than the old one. Getting sequences for the whole genome will take some time. As I suggested to another user, for genome-wide query, it should be better to get data for one chromosome in one queries. Running queries in the background on the server should be a better option, although this is not supported yet. Hope this helps, Junjun Sent from my BBerry
From: Arek Kasprzyk [mailto:[email protected]] Sent: Tuesday, November 22, 2011 12:45 PM To: Jaime Tovar <[email protected]> Cc: [email protected] <[email protected]>; Rhoda Kinsella <[email protected]>; Junjun Zhang Subject: Re: [BioMart Users] General query Hi Jaime, thanks for your input. Re: 'repeated rows' I'd suggest that you include transcript id in your result as well. The seemingly repeating genes are usually a sign of alternative splice variants (different transcripts). Also it is worth to include gene type, there maybe some genes that have transcripts but no proteins. So there is a chance that what you are seeing is actually genuinly unique. Someone from Ensembl could probably provide a better insight. (cc'ing Rhoda) Re: speed. I remember that the first implementation by Jonathan was very slow but Jack optimized it quite a bit and remember Junjun testing it and being satisfied with the speed so i am not sure if this is the server issue now or something has changed since then. (cc'ing Junjun so he can comment) Re: gz option. This 0.8 service is still in development. You are right. We'll definitely need the 'gz' by email option as well a On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[email protected]<mailto:[email protected]>> wrote: Hello, I'm trying the new interface for biomart and I have a couple of comments. I'm trying to download protein sequences for genes in homo sapiens for GRCh37.p3 In the results I find something like this for multiple genes: >ENSG00000000003|ENSP00000362111 MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN QYEIV* >ENSG00000000003|ENSP00000409517 MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV* >ENSG00000000003| Sequence unavailable >ENSG00000000003| Sequence unavailable It makes me think there is a problem when joining the tables and so some empty protein ids are resulting in extra rows for the result. Also I would like to know if the is the need of using the unique option present in the previous version to show only one result per gene id|protein id Also these queries tend to be quiet long. As the application works now I can download the file as txt, but is extremely slow and I guess for some files it may result in incomplete files without way of knowing if the file is actually complete or not. I mean a time out may result in an incomplete file which looks good and there will be no way to probe the contrary just from the data. I think the possibility of getting the file as a gz compressed file is highly desirable either directly from the service or by a link in email. Best regards, J _______________________________________________ Users mailing list [email protected]<mailto:[email protected]> https://lists.biomart.org/mailman/listinfo/users
_______________________________________________ Users mailing list [email protected] https://lists.biomart.org/mailman/listinfo/users
