Hi all,

Our group has previously written to the list after noticing a bug in Biomart
0.6 for sequence queries involving exons, such as CDS FASTA retrieval.
Especially when requesting an entire transcriptome/proteome/etc., a handful
of sequences were returned either with exons missing entirely or split
across multiple FASTA entries.

With the help of Junjun Zhang, we were able to trace this to a problem with
how BioMart versions 0.6 and 0.7 attempt to combine the results of batched
SQL queries (e.g., the problem doesn't occur at all if one completely
disables batching, which unfortunately introduces a significant performance
hit). I have found a fix for the DatasetI.pm module. Essentially, hash keys
were not sorted during the previous and current batch dataset attribute
merger step, causing improper handling of transcript data when exons
happened by chance to be split between SQL query batches.

The attached patch file has been tested successfully on both version 0.6 and
version 0.7. Also, I have been able to successfully return correct, complete
FASTA data files without any special filter/attribute orderBy constraints.
This may be a quirk of our database, as exons for each transcript are batch
loaded in the correct exon_rank order. If that is not the case for your
data, you should, in addition to applying this patch, use orderBy
constraints of "transcript_id_key, exon_rank" (similar to the Ensembl gene
dataset configuration) on the "coding" and "peptide" structure exportables
in your configuration.

Best regards,

-- 
Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://www.phytozome.net

Attachment: DatasetI.pm.0.7.sqlBatchingFix.patch
Description: Binary data

_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users

Reply via email to