Re: [BioMart Users] FW: trouble with sequence download

David M. Goodstein Tue, 26 Jul 2011 19:58:25 -0700

Thanks Junjun. I think we're seeing something a bit more dramaticthan simply duplicating records. In fact, by dropping the batch sizedown to different values (as small as 5), we've been able to reproducethe sequence splitting problem (e.g., the full sequence gets splitinto pieces, and each piece appears in the fasta file with the sameheader) and it always occurs for the record at the batch boundary. Soit's happening every time, in all situations.

We will poke around rc6 a little more, but it sounds like we might bebetter off trying to merge our modifications (made to handle multiplegenomes in a single mart) into rc8. BTW, is that capability ofinterest to the larger Biomart community. If so, we'd be happy forthe multi-organism capability to be pulled into the main source.


thx,
-David

On Jul 26, 2011, at 6:30 PM, Junjun Zhang wrote:

Hi David,
Batching is implemented in a very different way in BioMart 0.8compared to that in 0.7. The new batching does not rely on SQL LIMITclause at all. Under certain situations, it is known that batchingvia LIMIT may lead to duplicating and/or missing data records, andthis may be what you were experiencing in the sequence mart.Batching is utilized in data retrieval for two major reasons: 1.quicker response, first batch of result almost reaches usersinstantly; 2. reduce memory usage for the possible subsequent stepsfor in memory joins
In 0.8, the above two goals are achieved in different way. New dataretrieval leverages the streaming capability in RDBMS. When SQLquery is executed result starts to stream back to instantly. Incases, no further data sources are involved in query, the queryoutput stream is then directly sent to the client. In cases, furtherjoin with data from other dataset is needed, batching kicks in. Astream buffer is used to collect intermediate results from the firstoutput stream, when it received certain amount of data rows, asecond query will be fired with a list of keys (from the collectionof first query result) included in the query's WHERE clause. Oncequery against second dataset is fired, stream buffer will beemptied, and continue to collect result from output stream of thefirst query preparing for the next batch. The results from thesecond query and first query will then be joined via common keys inmemory, and final result will be streamed out to the client.
As you can see, new batching implementation does not rely on SQLLIMIT, the problem of duplicating/missing records will not happen in0.8.
Hope this helps, let me know if you have any further questions.

Best wishes!

Junjun


From: "David M. Goodstein" <[email protected]>
Date: Mon, 25 Jul 2011 18:26:15 -0400
To: jzhang <[email protected]>
Cc: Joni Fazo <[email protected]>, "[email protected]"<[email protected]>
Subject: Re: [BioMart Users] FW: trouble with sequence download
Junjun,
We have modified our configuration to more closely match theEnsembl gene dataset config, but this hasn't changed our originalproblem (splitting of sequence into multiple entries in resultsfiles). We did manage to (finally) completely disable batchqueuing, and that does make the problem go away.
Our question is, was there a known bug with batch queueing inrc6? We are still using rc6 because we had to make somemodifications to allow multiple genomes to be accessible in asingle mart.
thanks,
-David Goodstein
 JGI


On Jul 14, 2011, at 9:10 AM, Junjun Zhang wrote:
For 'coding' sequence type, exportable should have oderBy set to'transcript_id,exon_rank'. Similar as Ensembl gene dataset shows.
From: jzhang <[email protected]>
Date: Thu, 14 Jul 2011 12:05:04 -0400
To: "[email protected]" <[email protected]>
Subject: [BioMart Users] FW: trouble with sequence download
Forget to send to the list.


From: jzhang <[email protected]>
Date: Thu, 14 Jul 2011 12:01:37 -0400
To: Joni Fazo <[email protected]>
Subject: Re: [BioMart Users] trouble with sequence download
Hi Joni,
After looking at the configuration file for phytozome datasetat: http://www.phytozome.net/biomart/martservice?type=configuration&dataset=phytozome
It seems to me there might be some problem with the 'Exportable'setting in phytozome mart. orderBy="exon_rank" may not becorrect, it should be ordered by transcript ID. You can look athow this is set up in Ensembl gene mart at: http://www.biomart.org/biomart/martservice?type=configuration&dataset=hsapiens_gene_ensembl(in the page search for: internalName="cdna" linkName="cdna")
You can also connect to ensembl mart db using MartEditor to examthe settings more closely:
Hope this helps!

Junjun



From: Joni Fazo <[email protected]>
Date: Wed, 13 Jul 2011 13:15:28 -0400
To: jzhang <[email protected]>
Subject: Re: [BioMart Users] trouble with sequence download
Hi Junjun,

Please go to: http://www.phytozome.net/biomart/martview

To download all the CDS for one genome please follow these steps:

1) Select the dataset "Phytozome 7.0 Genomes"
2) For Filters select the Organism "Arabidopsis thaliana"
3) For Attributes select "Sequences"
4) Select the radio button "Coding Sequences"
5) As well as the default check boxes, also select "Exon CDSStart" and "Exon CDS End"
6) Click the Results button
7) Then select "Export all results to compressed web file"
The resulting file will have the CDS sequence split for thefollowing 5 transcripts (so 10 entries total):
AT1G31930.2, AT3G02530.1, AT3G54470.1, AT4G24270.2, AT5G61910.4
All other transcripts in the file will just have one entry.
If you follow the above steps but add the additional filter ofthe transcript names, the resulting file will have the CDS forthe transcripts in just 5 entries. Which I believe is correct.
Thanks in advance for your help,
Joni
On Wed, Jul 13, 2011 at 7:13 AM, Junjun Zhang <[email protected]> wrote:
Hi Joni,
Is it possible to provide us the URL where we can reproduceand test the problem below?
Thanks,
Junjun


From: Joni Fazo <[email protected]>
Date: Tue, 12 Jul 2011 16:23:50 -0400
To: "[email protected]" <[email protected]>
Subject: [BioMart Users] trouble with sequence download
Hello,
My name is Joni Fazo and I am trying to trouble shoot anerror on our Biomart configuration for http://phytozome.net.
Our issue is that the sequence data for some transcripts aresplit into multiple entries when all CDS for a given genomeis requested.
If the user requests the CDS for the individual transcripts,the sequence is presented by one entry. Listed below is anexample of the FASTA headers for one such transcript(AT1G31930.2):
Split entries (generated when all CDS for the genome isrequested from Biomart):
>11466849;11465832|11466986;11466755|AT1G31930|AT1G31930.2
>11468117;11467266;11467524;11468367;11467780;11467074|11468289;11467445;11467698;11468961;11468036;11467178|AT1G31930|AT1G31930.2
Single entry (generated when just CDS for AT1G31930.2 isrequested from Biomart):>11468117;11467266;11467524;11468367;11467780;11466849;11467074;11465832|11468289;11467445;11467698;11468961;11468036;11466986;11467178;11466755|AT1G31930|AT1G31930.2
Has anyone encountered this issue or something similar?

Best regards,
Joni Fazo
Joint Genome Institute / Lawrence Berkeley National Lab
http://www.phytozome.net
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users
David M. Goodstein, Ph.D.
Phytozome Group Lead
Plant and Computational Genomics Group
Joint Genome Institute - U.S. Dept. of Energy
Center for Integrative Genomics - UC Berkeley


David M. Goodstein, Ph.D.
Phytozome Group Lead
Plant and Computational Genomics Group
Joint Genome Institute - U.S. Dept. of Energy
Center for Integrative Genomics - UC Berkeley

_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Re: [BioMart Users] FW: trouble with sequence download

Reply via email to