Re: [BioMart Users] FW: trouble with sequence download

Junjun Zhang Wed, 27 Jul 2011 07:06:25 -0700

Hi David,

It seems there has been some misunderstanding about the BioMart versions. The 
latest stable release is 0.7, this version and older versions were all 
developed using Perl language. BioMart 0.8 is what we have been working on and 
it's written in Java. There is no stable BioMart 0.8 release yet, however, 
there were a few 0.8 release candidates available. The latest 0.8 release 
candidate is rc6.

I just wanted to confirm with you that the sequence retrieval problem you 
described is in an old BioMart release: 0.7 or earlier. It's not 0.8 rc6, right?

From the differences of batching implementation between 0.8 and earlier version 
I described in the previous email, we can see it is possible to make sure the 
problem with splitting gene sequences into small pieces will not happen in 0.8 
if we avoid batching. We are working on a new sequence retrieval tool in 0.8, 
it does not involve batching and we do not have the same problem as you are 
experiencing.

Regarding the 'multi-organism capability', I am not sure I understand it 
correctly, Ensembl gene sequence retrieval already deals with multiple species, 
why is it necessary to modify the code to support it? I would gratefully 
appreciate it if you could elaborate this.

Best,

Junjun
Sent from my BBerry

From: David M. Goodstein [mailto:[email protected]]
Sent: Tuesday, July 26, 2011 10:58 PM
To: Junjun Zhang
Cc: Joni Fazo <[email protected]>; [email protected] <[email protected]>; Richard 
Hayes <[email protected]>
Subject: Re: [BioMart Users] FW: trouble with sequence download

Thanks Junjun.  I think we're seeing something a bit more dramatic than simply 
duplicating records.  In fact, by dropping the batch size down to different 
values (as small as 5), we've been able to reproduce the sequence splitting 
problem (e.g., the full sequence gets split into pieces, and each piece appears 
in the fasta file with the same header) and it always occurs for the record at 
the batch boundary.  So it's happening every time, in all situations.

We will poke around rc6 a little more, but it sounds like we might be better 
off trying to merge our modifications (made to handle multiple genomes in a 
single mart) into rc8.  BTW, is that capability of interest to the larger 
Biomart community.  If so, we'd be happy for the multi-organism capability to 
be pulled into the main source.

thx,
-David

On Jul 26, 2011, at 6:30 PM, Junjun Zhang wrote:

Hi David,

Batching is implemented in a very different way in BioMart 0.8 compared to that 
in 0.7. The new batching does not rely on SQL LIMIT clause at all. Under 
certain situations, it is known that batching via LIMIT may lead to duplicating 
and/or missing data records, and this may be what you were experiencing in the 
sequence mart. Batching is utilized in data retrieval for two major reasons: 1. 
quicker response, first batch of result almost reaches users instantly; 2. 
reduce memory usage for the possible subsequent steps for in memory joins

In 0.8, the above two goals are achieved in different way. New data retrieval 
leverages the streaming capability in RDBMS. When SQL query is executed result 
starts to stream back to instantly. In cases, no further data sources are 
involved in query, the query output stream is then directly sent to the client. 
In cases, further join with data from other dataset is needed, batching kicks 
in. A stream buffer is used to collect intermediate results from the first 
output stream, when it received certain amount of data rows, a second query 
will be fired with a list of keys (from the collection of first query result) 
included in the query's WHERE clause. Once query against second dataset is 
fired, stream buffer will be emptied, and continue to collect result from 
output stream of the first query preparing for the next batch. The results from 
the second query and first query will then be joined via common keys in memory, 
and final result will be streamed out to the client.

As you can see, new batching implementation does not rely on SQL LIMIT, the 
problem of duplicating/missing records will not happen in 0.8.

Hope this helps, let me know if you have any further questions.

Best wishes!

Junjun

From: "David M. Goodstein" <[email protected]<mailto:[email protected]>>
Date: Mon, 25 Jul 2011 18:26:15 -0400
To: jzhang <[email protected]<mailto:[email protected]>>
Cc: Joni Fazo <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [BioMart Users] FW: trouble with sequence download

Junjun,

  We have modified our configuration to more closely match the Ensembl gene 
dataset config, but this hasn't changed our original problem (splitting of 
sequence into multiple entries in results files).  We did manage to (finally) 
completely disable batch queuing, and that does make the problem go away.

  Our question is, was there a known bug with batch queueing in rc6?  We are 
still using rc6 because we had to make some modifications to allow multiple 
genomes to be accessible in a single mart.

thanks,
-David Goodstein
 JGI

On Jul 14, 2011, at 9:10 AM, Junjun Zhang wrote:

For 'coding' sequence type,  exportable should have oderBy set to 
'transcript_id,exon_rank'. Similar as Ensembl gene dataset shows.

From: jzhang <[email protected]<mailto:[email protected]>>
Date: Thu, 14 Jul 2011 12:05:04 -0400
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [BioMart Users] FW: trouble with sequence download

Forget to send to the list.

From: jzhang <[email protected]<mailto:[email protected]>>
Date: Thu, 14 Jul 2011 12:01:37 -0400
To: Joni Fazo <[email protected]<mailto:[email protected]>>
Subject: Re: [BioMart Users] trouble with sequence download

Hi Joni,

After looking at the configuration file for phytozome dataset at: 
http://www.phytozome.net/biomart/martservice?type=configuration&dataset=phytozome

It seems to me there might be some problem with the 'Exportable' setting in 
phytozome mart. orderBy="exon_rank" may not be correct, it should be ordered by 
transcript ID. You can look at how this is set up in Ensembl gene mart at: 
http://www.biomart.org/biomart/martservice?type=configuration&dataset=hsapiens_gene_ensembl
 (in the page search for: internalName="cdna" linkName="cdna")

You can also connect to ensembl mart db using MartEditor to exam the settings 
more closely:

Hope this helps!

Junjun

From: Joni Fazo <[email protected]<mailto:[email protected]>>
Date: Wed, 13 Jul 2011 13:15:28 -0400
To: jzhang <[email protected]<mailto:[email protected]>>
Subject: Re: [BioMart Users] trouble with sequence download

Hi Junjun,

Please go to: http://www.phytozome.net/biomart/martview

To download all the CDS for one genome please follow these steps:

1) Select the dataset "Phytozome 7.0 Genomes"
2) For Filters select the Organism "Arabidopsis thaliana"
3) For Attributes select "Sequences"
4) Select the radio button "Coding Sequences"
5) As well as the default check boxes, also select "Exon CDS Start" and "Exon 
CDS End"
6) Click the Results button
7) Then select "Export all results to compressed web file"

The resulting file will have the CDS sequence split for the following 5 
transcripts (so 10 entries total):
AT1G31930.2, AT3G02530.1, AT3G54470.1, AT4G24270.2, AT5G61910.4
All other transcripts in the file will just have one entry.

If you follow the above steps but add the additional filter of the transcript 
names, the resulting file will have the CDS for the transcripts in just 5 
entries.  Which I believe is correct.

Thanks in advance for your help,
Joni

On Wed, Jul 13, 2011 at 7:13 AM, Junjun Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Hi Joni,

Is it possible to provide us the URL where we can reproduce and test the 
problem below?

Thanks,
Junjun

From: Joni Fazo <[email protected]<mailto:[email protected]>>
Date: Tue, 12 Jul 2011 16:23:50 -0400
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [BioMart Users] trouble with sequence download

Hello,

My name is Joni Fazo and I am trying to trouble shoot an error on our Biomart 
configuration for http://phytozome.net.

Our issue is that the sequence data for some transcripts are split into 
multiple entries when all CDS for a given genome is requested.

If the user requests the CDS for the individual transcripts, the sequence is 
presented by one entry.  Listed below is an example of the FASTA headers for 
one such transcript (AT1G31930.2):

Split entries (generated when all CDS for the genome is requested from Biomart):
>11466849;11465832|11466986;11466755|AT1G31930|AT1G31930.2
>11468117;11467266;11467524;11468367;11467780;11467074|11468289;11467445;11467698;11468961;11468036;11467178|AT1G31930|AT1G31930.2

Single entry (generated when just CDS for AT1G31930.2 is requested from 
Biomart):
>11468117;11467266;11467524;11468367;11467780;11466849;11467074;11465832|11468289;11467445;11467698;11468961;11468036;11466986;11467178;11466755|AT1G31930|AT1G31930.2

Has anyone encountered this issue or something similar?

Best regards,
Joni Fazo
Joint Genome Institute / Lawrence Berkeley National Lab
http://www.phytozome.net<http://www.phytozome.net/>

_______________________________________________
Users mailing list
[email protected]<mailto:[email protected]>
https://lists.biomart.org/mailman/listinfo/users

David M. Goodstein, Ph.D.
Phytozome Group Lead
Plant and Computational Genomics Group
Joint Genome Institute - U.S. Dept. of Energy
Center for Integrative Genomics - UC Berkeley

David M. Goodstein, Ph.D.
Phytozome Group Lead
Plant and Computational Genomics Group
Joint Genome Institute - U.S. Dept. of Energy
Center for Integrative Genomics - UC Berkeley

_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Re: [BioMart Users] FW: trouble with sequence download

Reply via email to