Hi Jaime,

thanks for your input.

Re: 'repeated rows'
I'd suggest that you include transcript id in your result as well. The
seemingly repeating genes are usually a sign of alternative splice variants
(different transcripts). Also it is worth to include gene type, there maybe
some genes that have transcripts but no proteins. So there is a chance that
what you are seeing is actually genuinly unique. Someone from Ensembl could
probably provide a better insight. (cc'ing Rhoda)

Re: speed.
I remember that the first implementation by Jonathan was very slow but Jack
optimized it quite a bit and remember Junjun testing it and being satisfied
with the speed so i am not sure if this is the server issue now or
something has changed since then. (cc'ing Junjun so he can  comment)

Re: gz option.
This 0.8 service is still in development. You are right. We'll definitely
need the 'gz' by email option as well


a


On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[email protected]> wrote:

>  Hello,
>
> I'm trying the new interface for biomart and I have a couple of comments.
>
> I'm trying to download protein sequences for genes in homo sapiens for
> GRCh37.p3
>
> In the results I find something like this for multiple genes:
>
>  >ENSG00000000003|ENSP00000362111
> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
> PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
> SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
> EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
> QYEIV*
>
> >ENSG00000000003|ENSP00000409517
> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
> CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
> YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
> IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*
>
> >ENSG00000000003|
> Sequence unavailable
>
> >ENSG00000000003|
> Sequence unavailable
>
> It makes me think there is a problem when joining the tables and so some
> empty protein ids are resulting in extra rows for the result.
>
> Also I would like to know if the is the need of using the unique option
> present in the previous version to show only one result per gene id|protein
> id
>
> Also these queries tend to be quiet long. As the application works now I
> can download the file as txt, but is extremely slow and I guess for some
> files it may result in incomplete files without way of knowing if the file
> is actually complete or not. I mean a time out may result in an incomplete
> file which looks good and there will be no way to probe the contrary just
> from the data. I think the possibility of getting the file as a gz
> compressed file is highly desirable either directly from the service or by
> a link in email.
>
> Best regards,
>
> J
>
> _______________________________________________
> Users mailing list
> [email protected]
> https://lists.biomart.org/mailman/listinfo/users
>
>
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Reply via email to