Hi Jaime,
The new sequence retrieval tool is overall slower than the old one. Getting 
sequences for the whole genome will take some time. As I suggested to another 
user, for genome-wide query, it should be better to get data for one chromosome 
in one queries.
Running queries in the background on the server should be a better option, 
although this is not supported yet.
Hope this helps,
Junjun
Sent from my BBerry

From: Arek Kasprzyk [mailto:[email protected]]
Sent: Tuesday, November 22, 2011 12:45 PM
To: Jaime Tovar <[email protected]>
Cc: [email protected] <[email protected]>; Rhoda Kinsella <[email protected]>; 
Junjun Zhang
Subject: Re: [BioMart Users] General query

Hi Jaime,

thanks for your input.

Re: 'repeated rows'
I'd suggest that you include transcript id in your result as well. The 
seemingly repeating genes are usually a sign of alternative splice variants 
(different transcripts). Also it is worth to include gene type, there maybe 
some genes that have transcripts but no proteins. So there is a chance that 
what you are seeing is actually genuinly unique. Someone from Ensembl could 
probably provide a better insight. (cc'ing Rhoda)

Re: speed.
I remember that the first implementation by Jonathan was very slow but Jack 
optimized it quite a bit and remember Junjun testing it and being satisfied 
with the speed so i am not sure if this is the server issue now or something 
has changed since then. (cc'ing Junjun so he can  comment)

Re: gz option.
This 0.8 service is still in development. You are right. We'll definitely need 
the 'gz' by email option as well


a


On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I'm trying the new interface for biomart and I have a couple of comments.

I'm trying to download protein sequences for genes in homo sapiens for GRCh37.p3

In the results I find something like this for multiple genes:


>ENSG00000000003|ENSP00000362111
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
QYEIV*

>ENSG00000000003|ENSP00000409517
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*

>ENSG00000000003|
Sequence unavailable

>ENSG00000000003|
Sequence unavailable

It makes me think there is a problem when joining the tables and so some empty 
protein ids are resulting in extra rows for the result.

Also I would like to know if the is the need of using the unique option present 
in the previous version to show only one result per gene id|protein id

Also these queries tend to be quiet long. As the application works now I can 
download the file as txt, but is extremely slow and I guess for some files it 
may result in incomplete files without way of knowing if the file is actually 
complete or not. I mean a time out may result in an incomplete file which looks 
good and there will be no way to probe the contrary just from the data. I think 
the possibility of getting the file as a gz compressed file is highly desirable 
either directly from the service or by a link in email.

Best regards,

J

_______________________________________________
Users mailing list
[email protected]<mailto:[email protected]>
https://lists.biomart.org/mailman/listinfo/users


_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Reply via email to