Hi Jaime
If you take a look at the gene on the Ensembl website (see
here:http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000000003;r=X:99883667-99894988)
you will see that this gene has four transcripts, two are protein coding
and two are processed transcripts. I agree with Arek that you should
filter by protein coding and add Ensembl transcript ID to your list of
attributes.
Hope that helps
Regards
Rhoda


> Hi Jaime,
>
> thanks for your input.
>
> Re: 'repeated rows'
> I'd suggest that you include transcript id in your result as well. The
> seemingly repeating genes are usually a sign of alternative splice
> variants
> (different transcripts). Also it is worth to include gene type, there
> maybe
> some genes that have transcripts but no proteins. So there is a chance
> that
> what you are seeing is actually genuinly unique. Someone from Ensembl
> could
> probably provide a better insight. (cc'ing Rhoda)
>
> Re: speed.
> I remember that the first implementation by Jonathan was very slow but
> Jack
> optimized it quite a bit and remember Junjun testing it and being
> satisfied
> with the speed so i am not sure if this is the server issue now or
> something has changed since then. (cc'ing Junjun so he can  comment)
>
> Re: gz option.
> This 0.8 service is still in development. You are right. We'll definitely
> need the 'gz' by email option as well
>
>
> a
>
>
> On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[email protected]> wrote:
>
>>  Hello,
>>
>> I'm trying the new interface for biomart and I have a couple of
>> comments.
>>
>> I'm trying to download protein sequences for genes in homo sapiens for
>> GRCh37.p3
>>
>> In the results I find something like this for multiple genes:
>>
>>  >ENSG00000000003|ENSP00000362111
>> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
>> PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
>> SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
>> EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
>> QYEIV*
>>
>> >ENSG00000000003|ENSP00000409517
>> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
>> CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
>> YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
>> IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*
>>
>> >ENSG00000000003|
>> Sequence unavailable
>>
>> >ENSG00000000003|
>> Sequence unavailable
>>
>> It makes me think there is a problem when joining the tables and so some
>> empty protein ids are resulting in extra rows for the result.
>>
>> Also I would like to know if the is the need of using the unique option
>> present in the previous version to show only one result per gene
>> id|protein
>> id
>>
>> Also these queries tend to be quiet long. As the application works now I
>> can download the file as txt, but is extremely slow and I guess for some
>> files it may result in incomplete files without way of knowing if the
>> file
>> is actually complete or not. I mean a time out may result in an
>> incomplete
>> file which looks good and there will be no way to probe the contrary
>> just
>> from the data. I think the possibility of getting the file as a gz
>> compressed file is highly desirable either directly from the service or
>> by
>> a link in email.
>>
>> Best regards,
>>
>> J
>>
>> _______________________________________________
>> Users mailing list
>> [email protected]
>> https://lists.biomart.org/mailman/listinfo/users
>>
>>
>


_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Reply via email to