Hi Mike,
(I CC'ed this to the mailing list).
Ray can be utilized to classify k-mers in a taxonomy. To do so,
Ray needs a taxonomy. You can use anything for the taxonomy.
At our center, we are using Greengenes and NCBI.
See these documents for general documentation about graph coloring and
taxonomic profiling
features (called Ray Communities):
- Documentation/Taxonomy.txt
- Documentation/BiologicalAbundances.txt
To download the NCBI taxonomy and generate required files:
Get a copy of ray:
git clone git://github.com/sebhtml/ray.git
Add this to your PATH:
export PATH=~/git-clones/ray/scripts/NCBI-Taxonomy/:$PATH
Then, run this:
CreateRayInputStructures.sh
This will generate these files:
- NCBI-taxonomy/NCBI-Finished-Bacterial-Genomes
- NCBI-taxonomy/Genome-to-Taxon.tsv
- NCBI-taxonomy/TreeOfLife-Edges.tsv
- NCBI-taxonomy/Taxon-Names.tsv
Now, you can run Ray as usual (including Ray Méta plugins), but with
additional options to run Ray Communities plugins as well:
mpiexec -n 96 \
Ray \
-k 31 -o Ray-Communities \
-p SeqA_1.fastq SeqA_2.fastq \
-p SeqB_1.fastq SeqB_2.fastq \
-search NCBI-taxonomy/NCBI-Finished-Bacterial-Genomes \
-with-taxonomy NCBI-taxonomy/Genome-to-Taxon.tsv \
NCBI-taxonomy/TreeOfLife-Edges.tsv NCBI-taxonomy/Taxon-Names.tsv
As usual, you can also put all the arguments in a configuration file like this:
mpiexec -n 96 Ray Ray.conf
where Ray.conf contains
-k 31 -o Ray-Communities
-p SeqA_1.fastq SeqA_2.fastq
-p SeqB_1.fastq SeqB_2.fastq
-search NCBI-taxonomy/NCBI-Finished-Bacterial-Genomes
-with-taxonomy NCBI-taxonomy/Genome-to-Taxon.tsv
NCBI-taxonomy/TreeOfLife-Edges.tsv NCBI-taxonomy/Taxon-Names.tsv
So basically, the whole thing does a distributed de Bruijn graph really
fast (plugins for the distributed storage engine), assembles de novo the
data by distributed graph traversals (Ray Méta; plugin SeedExtender),
colors the graph with the reference genomes provided with the -search
option (Ray Communities, plugin Searcher), and computes taxonomic profiles
using the provided taxonomy (Ray Communities, -with-taxonomy, plugin
PhylogenyViewer).
All that stuff is heavily distributed -- each Ray process has 32768 user-space
threads
(workers) and you can throw as many Ray processes as you want to.
If you are running Ray on a buggy network (we had problems with Mellanox
Infiniband MT26428,
revision a0), you can turn on virtual communications too.
Cheers,
Sébastien
On 19/09/12 08:23 PM, Mike Peabody wrote:
> Thanks Sébastien!
>
> -Mike
>
> - Original Message -
> From: "Sébastien Boisvert"
> To: "Mike Peabody"
> Sent: Wednesday, September 19, 2012 6:46:19 AM
> Subject: Re: RE : MetaRay inquiry
>
> Hi,
>
> I should be done today I guess.
>
> On Monday, we had a deadline for the Genome Canada bioinformatics competition.
>
> Basically, the script will fetch all the finished bacterial genomes
> and all the draft bacterial genomes and create a bunch of symbolic links.
>
> Each of these fasta files will already contain a >gi|something to classify
> it in the NCBI taxonomy.
>
> For the NCBI taxonomy,there will be 3 files:
>
> -with-taxonomy Genome-to-Taxon.tsv TreeOfLife-Edges.tsv Taxon-Names.tsv
>
>
> I added the script in
> https://github.com/sebhtml/ray/tree/master/scripts/NCBI-Taxonomy
>
> You can get it with "git clone git://github.com/sebhtml/ray.git"
>
> The documentation is in Documentation/NCBI-Taxonomy.txt
>
> It is not complete yet though. I need to add some code to format the tree and
> taxon names.
>
> I will let you know once I have finished and tested everything.
>
>
> On 19/09/12 01:50 AM, Mike Peabody wrote:
>> Hi Sébastien,
>>
>> Just wanted to see how the script was going.
>>
>> Cheers,
>> Mike
>>
>> - Original Message -
>> From: "Sébastien Boisvert"
>> To: "Mike Peabody"
>> Sent: Thursday, September 13, 2012 6:27:28 PM
>> Subject: Re: RE : MetaRay inquiry
>>
>> I will write you a script that downloads the required files and that
>> convert them.
>>
>> I should get back at you by next Tuesday.
>>
>>
>> On 12/09/12 09:23 AM, Mike Peabody wrote:
>>> Hi Sébastien,
>>>
>>> Maybe you can upload the files to filedropper or another similar website?
>>> http://www.filedropper.com/
>>>
>>> Thanks!
>>> Mike
>>>
>>> - Original Message -
>>> From: "Sébastien Boisvert"
>>> To: "Mike Peabody"
>>> Sent: Wednesday, September 12, 2012 4:51:46 AM
>>> Subject: Re: RE : MetaRay inquiry
>>>
>>> Hi Mike,
>>>
>>> The 3 required files for taxonomy profiling are (+ reference genomes)
>>>
>>> -with-taxonomy \
>>> Genome-to-Taxon.tsv \
>>> TreeOfLife-Edges.tsv \
>>> Taxons.tsv
>>>
>>>
>>> There is the documentation at Documentation/Taxonomy.txt, but
>>> it seems that since I wrote the initial version, NCBI has changed (once
>>> again !)
>>> the file formats on their FTP.
>>>
>>>
>>> The file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip use