Re: [galaxy-user] Metagenomic filtering
Jing All good thoughts and if I remember correctly, custom software can indeed to incorparated into Galaxy through use of the Toolshed . I'll check into this with Jennifer. Thanks Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 10/6/2013 9:59 AM, Jing Yu wrote: Dear Scott, I think what you propose is doable. You may 1. use a 16s or gyrase DNA sequence as feeds to blast against your data to get the relative sequences, 2. and then use the sequences as feeds to blast against your nucleotide database with appropriate filters. There are several ways to make the steps. For example, you may already have the 16s sequence from assembly against a reference genome. And for Step 2, if you are not blasting thousands of times a day, and believe in the recent stability of NCBI, then a simple web_blast code will do the trick. Otherwise, since the local blast+ toolkit doesn't provide the equivalent organism filters, you'll have to work a wit bit on it: Make a nucleotide database for Prokaryotes. Search txid561[ORGN] on http://www.ncbi.nlm.nih.gov/nuccore (this is for Escherichia as an example), Send to 'File' - Format -GI List When Blast, use this GI list as the value of this argument: -negative_gilist Then parse the Blast result. Most of these can be automated with some code, but I don't know how to incorporate it into Galaxy. Regards, Jing On 4 Oct 2013, at 23:52, Scott Tighe scott.ti...@uvm.edu mailto:scott.ti...@uvm.edu wrote: Dear Jing What you have outlined below is perfect. I wonder how hard it would be to design a few filters that only look a certain genes and or filter model organisms out of the dataset. For example, say you want only data for 16s or only gyrase, but no /E.coli/ and no /Pseudomanas aeroginosa/ Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/25/2013 12:06 AM, Jing Yu wrote: Hi Scott, My first thought is: 1. Remove rDNA sequences (and/or other well known highly-conserved sequences to reduce the workload in step 2). 2. Blast, then remove sequences with (say 99%) match to (say 5) genus. (Optional if step 1 is already good enough) For step 1: Build a fasta file of the chosen highly conserved sequences, and use it as a feed to blast against your MiSeq result. Remove positive hits. For step 2: Blast remaining MiSeq sequences against NCBI (or whatever) database. Remove if it hits more than n genus. Jing On 24 Sep 2013, at 22:17, Scott Tighe scott.ti...@uvm.edu mailto:scott.ti...@uvm.edu wrote: Jing et al Thank you for the offer to write some code to help advance the metagenomics arena. It is certainly needed. So the problem is well known with megablast and shotgun metagenomics and without proper understanding and correct software will yield very misleading and in many cases incorrect data. For those of us who wish NOT to move to a protein level of comparison for specific reasons, we are stuck. *The Problem:* If I megablast 50 million sequences from a HiSeq run, millions of rRNA sequences will have a 99% match to all microbes rRNA genbank deposits. Not surprizing since the rRNA is highly conserved. The difference between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s. So 16s is not useful for Genus level, and certainly not Species *So what happens:* The returned matches will have many hits to whatever model organism is in Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 3 entries for rRNA. If the blasted sequence matches both, the results will mislead the investigator to think they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus. *The cure?:* If there was a way to filter/ remove all hits ? Let say, for example, that a result has a first match (say E. coli) at 99% a second match (say Pseudomanas) at 99% and a third , forth and fifth match 99 for three other organisms. This sequence _must_ be discarded because it is a conserve sequence. Basically conserved sequence is the enemy and invalidates the entire result. * **Another problem:* If you have a reference sample with 19 non-model microbes, and you run that by HiSeq Shotgun for metagenomics and then megablast, what do you think you get? If E coli is not in the reference sample, how many hits do you think you get? Yes, 10,000 of thousands. So without removing conserved sequences, your data is wrong and you are much better served by culturing and running a Biolog metabolic panel and comparing to the sequence result. So where do we start? I have some shotgun metagenomics data from the reference sample which included
Re: [galaxy-user] Metagenomic filtering
Dear Scott, I think what you propose is doable. You may 1. use a 16s or gyrase DNA sequence as feeds to blast against your data to get the relative sequences, 2. and then use the sequences as feeds to blast against your nucleotide database with appropriate filters. There are several ways to make the steps. For example, you may already have the 16s sequence from assembly against a reference genome. And for Step 2, if you are not blasting thousands of times a day, and believe in the recent stability of NCBI, then a simple web_blast code will do the trick. Otherwise, since the local blast+ toolkit doesn't provide the equivalent organism filters, you'll have to work a wit bit on it: Make a nucleotide database for Prokaryotes. Search txid561[ORGN] on http://www.ncbi.nlm.nih.gov/nuccore (this is for Escherichia as an example), Send to 'File' - Format -GI List When Blast, use this GI list as the value of this argument: -negative_gilist Then parse the Blast result. Most of these can be automated with some code, but I don't know how to incorporate it into Galaxy. Regards, Jing On 4 Oct 2013, at 23:52, Scott Tighe scott.ti...@uvm.edu wrote: Dear Jing What you have outlined below is perfect. I wonder how hard it would be to design a few filters that only look a certain genes and or filter model organisms out of the dataset. For example, say you want only data for 16s or only gyrase, but no E.coli and no Pseudomanas aeroginosa Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/25/2013 12:06 AM, Jing Yu wrote: Hi Scott, My first thought is: 1. Remove rDNA sequences (and/or other well known highly-conserved sequences to reduce the workload in step 2). 2. Blast, then remove sequences with (say 99%) match to (say 5) genus. (Optional if step 1 is already good enough) For step 1: Build a fasta file of the chosen highly conserved sequences, and use it as a feed to blast against your MiSeq result. Remove positive hits. For step 2: Blast remaining MiSeq sequences against NCBI (or whatever) database. Remove if it hits more than n genus. Jing On 24 Sep 2013, at 22:17, Scott Tighe scott.ti...@uvm.edu wrote: Jing et al Thank you for the offer to write some code to help advance the metagenomics arena. It is certainly needed. So the problem is well known with megablast and shotgun metagenomics and without proper understanding and correct software will yield very misleading and in many cases incorrect data. For those of us who wish NOT to move to a protein level of comparison for specific reasons, we are stuck. The Problem: If I megablast 50 million sequences from a HiSeq run, millions of rRNA sequences will have a 99% match to all microbes rRNA genbank deposits. Not surprizing since the rRNA is highly conserved. The difference between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s. So 16s is not useful for Genus level, and certainly not Species So what happens: The returned matches will have many hits to whatever model organism is in Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 3 entries for rRNA. If the blasted sequence matches both, the results will mislead the investigator to think they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus. The cure?: If there was a way to filter/ remove all hits ? Let say, for example, that a result has a first match (say E. coli) at 99% a second match (say Pseudomanas) at 99% and a third , forth and fifth match 99 for three other organisms. This sequence must be discarded because it is a conserve sequence. Basically conserved sequence is the enemy and invalidates the entire result. Another problem: If you have a reference sample with 19 non-model microbes, and you run that by HiSeq Shotgun for metagenomics and then megablast, what do you think you get? If E coli is not in the reference sample, how many hits do you think you get? Yes, 10,000 of thousands. So without removing conserved sequences, your data is wrong and you are much better served by culturing and running a Biolog metabolic panel and comparing to the sequence result. So where do we start? I have some shotgun metagenomics data from the reference sample which included the 19 microbes. That was data from a MiSeq. Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/20/2013 9:17 PM, Jing Yu wrote: Hi Scott, I can do some perl programming, such as local/remote blasting. Can you specify your problem a little bit
Re: [galaxy-user] Metagenomic filtering
Hi Scott, My first thought is: 1. Remove rDNA sequences (and/or other well known highly-conserved sequences to reduce the workload in step 2). 2. Blast, then remove sequences with (say 99%) match to (say 5) genus. (Optional if step 1 is already good enough) For step 1: Build a fasta file of the chosen highly conserved sequences, and use it as a feed to blast against your MiSeq result. Remove positive hits. For step 2: Blast remaining MiSeq sequences against NCBI (or whatever) database. Remove if it hits more than n genus. Jing On 24 Sep 2013, at 22:17, Scott Tighe scott.ti...@uvm.edu wrote: Jing et al Thank you for the offer to write some code to help advance the metagenomics arena. It is certainly needed. So the problem is well known with megablast and shotgun metagenomics and without proper understanding and correct software will yield very misleading and in many cases incorrect data. For those of us who wish NOT to move to a protein level of comparison for specific reasons, we are stuck. The Problem: If I megablast 50 million sequences from a HiSeq run, millions of rRNA sequences will have a 99% match to all microbes rRNA genbank deposits. Not surprizing since the rRNA is highly conserved. The difference between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s. So 16s is not useful for Genus level, and certainly not Species So what happens: The returned matches will have many hits to whatever model organism is in Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 3 entries for rRNA. If the blasted sequence matches both, the results will mislead the investigator to think they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus. The cure?: If there was a way to filter/ remove all hits ? Let say, for example, that a result has a first match (say E. coli) at 99% a second match (say Pseudomanas) at 99% and a third , forth and fifth match 99 for three other organisms. This sequence must be discarded because it is a conserve sequence. Basically conserved sequence is the enemy and invalidates the entire result. Another problem: If you have a reference sample with 19 non-model microbes, and you run that by HiSeq Shotgun for metagenomics and then megablast, what do you think you get? If E coli is not in the reference sample, how many hits do you think you get? Yes, 10,000 of thousands. So without removing conserved sequences, your data is wrong and you are much better served by culturing and running a Biolog metabolic panel and comparing to the sequence result. So where do we start? I have some shotgun metagenomics data from the reference sample which included the 19 microbes. That was data from a MiSeq. Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/20/2013 9:17 PM, Jing Yu wrote: Hi Scott, I can do some perl programming, such as local/remote blasting. Can you specify your problem a little bit clearer, so that maybe I can write a program to do just that? Regards, Jing Gerald 16s is basically useless for identification to genus. Since I started sequencing 16s in 1992, I have come to realize that without sequencing the full 1540 bases, it is generally misleading, and even than, it is not accurate enough to nail genus on more than 1/2 the cases. However, what is your feeling on ITS and gyrase, They seem to be far more discriminating but those databases have been decommissioned sometime ago. The desirable thing would be that Galaxy or NCBI add a filter conserved genes [ ie any hit with a second choice greater than 3% distance]. Something such as that. If you (or others) are aware of such a thing, I'd love the here about it. Sincerely Scott ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Metagenomic filtering
Jing et al Thank you for the offer to write some code to help advance the metagenomics arena. It is certainly needed. So the problem is well known with megablast and shotgun metagenomics and without proper understanding and correct software will yield very misleading and in many cases incorrect data. For those of us who wish NOT to move to a protein level of comparison for specific reasons, we are stuck. *The Problem:* If I megablast 50 million sequences from a HiSeq run, millions of rRNA sequences will have a 99% match to all microbes rRNA genbank deposits. Not surprizing since the rRNA is highly conserved. The difference between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s. So 16s is not useful for Genus level, and certainly not Species *So what happens:* The returned matches will have many hits to whatever model organism is in Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 3 entries for rRNA. If the blasted sequence matches both, the results will mislead the investigator to think they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus. *The cure?:* If there was a way to filter/ remove all hits ? Let say, for example, that a result has a first match (say E. coli) at 99% a second match (say Pseudomanas) at 99% and a third , forth and fifth match 99 for three other organisms. This sequence _must_ be discarded because it is a conserve sequence. Basically conserved sequence is the enemy and invalidates the entire result. * **Another problem:* If you have a reference sample with 19 non-model microbes, and you run that by HiSeq Shotgun for metagenomics and then megablast, what do you think you get? If E coli is not in the reference sample, how many hits do you think you get? Yes, 10,000 of thousands. So without removing conserved sequences, your data is wrong and you are much better served by culturing and running a Biolog metabolic panel and comparing to the sequence result. So where do we start? I have some shotgun metagenomics data from the reference sample which included the 19 microbes. That was data from a MiSeq. Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/20/2013 9:17 PM, Jing Yu wrote: Hi Scott, I can do some perl programming, such as local/remote blasting. Can you specify your problem a little bit clearer, so that maybe I can write a program to do just that? Regards, Jing Gerald 16s is basically useless for identification to genus. Since I started sequencing 16s in 1992, I have come to realize that without sequencing the full 1540 bases, it is generally misleading, and even than, it is not accurate enough to nail genus on more than 1/2 the cases. However, what is your feeling on ITS and gyrase, They seem to be far more discriminating but those databases have been decommissioned sometime ago. The desirable thing would be that Galaxy or NCBI add a filter conserved genes [ ie any hit with a second choice greater than 3% distance]. Something such as that. If you (or others) are aware of such a thing, I'd love the here about it. Sincerely Scott ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Metagenomic filtering
Hi all - Not to derail the conversation, but I wanted to point out some Galaxy resources that may help when considering how to approach solution. These may be knowns, but thought I'd put them out there just in case. See below. Best! Jen Galaxy team There are at least three public Galaxy instances that focus heavily on Metagenomics. Maybe worth a look? http://wiki.galaxyproject.org/PublicGalaxyServers Just do a browser search on metagenomics to find on page. May be others, but these are top 3. The Tool Shed may or may not contain specialized tools from these servers. Asking to have those tools made available via TS route is can be done through direct contact. Other repos may have tools that fit or could be tuned. Tool authors own tools - changes could potentially be incorporated through direct contact. Or, as is open source, used as baseline with attribution if that doesn't work out. http://toolshed.g2.bx.psu.edu/ Making a Galaxy Trello ticket for new tools and discussing new tool development on the galaxy-...@bx.psu.edu list may help you find other Galaxy community developers working on similar projects. Tickets are not just for the Galaxy core team, and even though the issue to solve is scientific, a technical implementation seems to be where this is going (new tool or existing tool tuning). http://wiki.galaxyproject.org/Issues - Inbox is where this would go. Final home almost certainly Tool Shed (same for all tools), but possibility of also including on Galaxy Main server also exists once there are a valid repo and it is determined to be a good fit (resource, etc.). On 9/24/13 7:17 AM, Scott Tighe wrote: Jing et al Thank you for the offer to write some code to help advance the metagenomics arena. It is certainly needed. So the problem is well known with megablast and shotgun metagenomics and without proper understanding and correct software will yield very misleading and in many cases incorrect data. For those of us who wish NOT to move to a protein level of comparison for specific reasons, we are stuck. *The Problem:* If I megablast 50 million sequences from a HiSeq run, millions of rRNA sequences will have a 99% match to all microbes rRNA genbank deposits. Not surprizing since the rRNA is highly conserved. The difference between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s. So 16s is not useful for Genus level, and certainly not Species *So what happens:* The returned matches will have many hits to whatever model organism is in Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 3 entries for rRNA. If the blasted sequence matches both, the results will mislead the investigator to think they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus. *The cure?:* If there was a way to filter/ remove all hits ? Let say, for example, that a result has a first match (say E. coli) at 99% a second match (say Pseudomanas) at 99% and a third , forth and fifth match 99 for three other organisms. This sequence _must_ be discarded because it is a conserve sequence. Basically conserved sequence is the enemy and invalidates the entire result. * **Another problem:* If you have a reference sample with 19 non-model microbes, and you run that by HiSeq Shotgun for metagenomics and then megablast, what do you think you get? If E coli is not in the reference sample, how many hits do you think you get? Yes, 10,000 of thousands. So without removing conserved sequences, your data is wrong and you are much better served by culturing and running a Biolog metabolic panel and comparing to the sequence result. So where do we start? I have some shotgun metagenomics data from the reference sample which included the 19 microbes. That was data from a MiSeq. Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/20/2013 9:17 PM, Jing Yu wrote: Hi Scott, I can do some perl programming, such as local/remote blasting. Can you specify your problem a little bit clearer, so that maybe I can write a program to do just that? Regards, Jing Gerald 16s is basically useless for identification to genus. Since I started sequencing 16s in 1992, I have come to realize that without sequencing the full 1540 bases, it is generally misleading, and even than, it is not accurate enough to nail genus on more than 1/2 the cases. However, what is your feeling on ITS and gyrase, They seem to be far more discriminating but those databases have been decommissioned sometime ago. The desirable thing would be that Galaxy or NCBI add a filter conserved genes [ ie any hit with a second choice greater than 3% distance]. Something such as that. If you (or others) are aware of such a thing, I'd love the here about it. Sincerely Scott
Re: [galaxy-user] Metagenomic filtering
Scott, agreed, 16S is not accurate if you only have partial sequences. I would make the Galaxy button(s) more specific, saying remove all rRNA and tRNA genes from bacteria/archaea/eukaryotes. That would leave the user with protein coding regions and intergenic regions. Ideally, one would then add an option compare to gene collection which would then give options for a collection of gyrase etc. As the gyrase collection is no longer available, one would have to rebuild this from the sequenced genomes - that's far from perfect in terms of coverage, but at least the quality of the published genomes is generally good (rRNA gene sequences are often not very good, another problem with the rRNA approach). Currently, I don't know of a such a program. Gerald From: Scott Tighe scott.ti...@uvm.edu To: Gerald Bothe g_bo...@yahoo.com; galaxy-user@lists.bx.psu.edu Sent: Thursday, September 19, 2013 10:45 AM Subject: Re: [galaxy-user] Metagenomic filtering Gerald 16s is basically useless for identification to genus. Since I started sequencing 16s in 1992, I have come to realize that without sequencing the full 1540 bases, it is generally misleading, and even than, it is not accurate enough to nail genus on more than 1/2 the cases. However, what is your feeling on ITS and gyrase, They seem to be far more discriminating but those databases have been decommissioned sometime ago. The desirable thing would be that Galaxy or NCBI add a filter conserved genes [ ie any hit with a second choice greater than 3% distance]. Something such as that. If you (or others) are aware of such a thing, I'd love the here about it. Sincerely Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557On 9/18/2013 2:05 PM, Gerald Bothe wrote: Removing model organisms may not be enough, you may have the same problem with, say, a Clostridium cluster IV anaerobe. I think a solution would be to first: compare to a collection of genes, e.g. get all the hits for 16S rRNA genes, RNA polymerases (conserved to quite conserved), and to e.g. ion channels and cell surface proteins. then: once a read or contig is identified as belonging to a gene family, gene, or protein domain, check within that group for species identities. Then you compare apples to apples in terms of gene conservation level Does anybody know a program that would do this efficiently from metagenomic data? Gerald Bothe From: Scott W. Tighe mailto:scott.ti...@uvm.edu To: galaxy-user@lists.bx.psu.edu Sent: Wednesday, September 18, 2013 10:03 AM Subject: Re: [galaxy-user] Metagenomic filtering Dear Galaxy When running HiSeq shot metagenomics sample from the environment against megablast and taxonomic representation, How do I filter/remove all the 16s and other conserved sequences. The problem if blasting a single organism that has a fraction of conserved sequence, the results will align with E.coli 10,000 times more then the possible target organism. This data would be wrong and misleading. For example a 100mg sample that was negative for e coli using MUG test, give thousands of hits with galaxy. 1) Is there a filter conserved sequences setting? 2) Is there a remove model organisms setting? Scott Tighe --Core Laboratory Research Staff Advanced Genome Technologies Core Deep Sequencing (MPS) Facility Vermont Cancer Center 149 Beaumont Ave University of Vermont HSRF 303 Burlington Vermont USA 05045 802-656-AGTC 802-999- (cell) Quoting Jennifer Jackson j...@bx.psu.edu: Hello Elwood, Are you still having connection issues today? Or is this resolved? Best, Jen Galaxy team On 9/13/13 11:36 AM, Elwood Linney wrote: A message sent earlier this week by me indicated that I could not connect to Galaxy via Fetch to download data. A reply indicated a glitch was fixed. I then could connect with Fetch and I tried to transfer 4 x 16gb files and the connection disconnected about 4 times. Now, once again, I cannot connect with Galaxy online to transfer data. Is this a problem that can be solved-either at my end or at Galaxy? Elwood Linney ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ --Jennifer Hillman-Jackson
Re: [galaxy-user] Metagenomic filtering
Gerald 16s is basically useless for identification to genus. Since I started sequencing 16s in 1992, I have come to realize that without sequencing the full 1540 bases, it is generally misleading, and even than, it is not accurate enough to nail genus on more than 1/2 the cases. However, what is your feeling on ITS and gyrase, They seem to be far more discriminating but those databases have been decommissioned sometime ago. The desirable thing would be that Galaxy or NCBI add a filter conserved genes [ ie any hit with a second choice greater than 3% distance]. Something such as that. If you (or others) are aware of such a thing, I'd love the here about it. Sincerely Scott Scott Tighe Senior Core Laboratory Research Staff Advanced Genome Technologies Core University of Vermont Vermont Cancer Center 149 Beaumont ave Health Science Research Facility 303/305 Burlington Vermont 05405 802-656-2557 On 9/18/2013 2:05 PM, Gerald Bothe wrote: Removing model organisms may not be enough, you may have the same problem with, say, a Clostridium cluster IV anaerobe. I think a solution would be to first: compare to a collection of genes, e.g. get all the hits for 16S rRNA genes, RNA polymerases (conserved to quite conserved), and to e.g. ion channels and cell surface proteins. then: once a read or contig is identified as belonging to a gene family, gene, or protein domain, check within that group for species identities. Then you compare apples to apples in terms of gene conservation level Does anybody know a program that would do this efficiently from metagenomic data? Gerald Bothe *From:* Scott W. Tighe scott.ti...@uvm.edu *To:* galaxy-user@lists.bx.psu.edu *Sent:* Wednesday, September 18, 2013 10:03 AM *Subject:* Re: [galaxy-user] Metagenomic filtering Dear Galaxy When running HiSeq shot metagenomics sample from the environment against megablast and taxonomic representation, How do I filter/remove all the 16s and other conserved sequences. The problem if blasting a single organism that has a fraction of conserved sequence, the results will align with E.coli 10,000 times more then the possible target organism. This data would be wrong and misleading. For example a 100mg sample that was negative for e coli using MUG test, give thousands of hits with galaxy. 1) Is there a filter conserved sequences setting? 2) Is there a remove model organisms setting? Scott Tighe --Core Laboratory Research Staff Advanced Genome Technologies Core Deep Sequencing (MPS) Facility Vermont Cancer Center 149 Beaumont Ave University of Vermont HSRF 303 Burlington Vermont USA 05045 802-656-AGTC 802-999- (cell) Quoting Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu: Hello Elwood, Are you still having connection issues today? Or is this resolved? Best, Jen Galaxy team On 9/13/13 11:36 AM, Elwood Linney wrote: A message sent earlier this week by me indicated that I could not connect to Galaxy via Fetch to download data. A reply indicated a glitch was fixed. I then could connect with Fetch and I tried to transfer 4 x 16gb files and the connection disconnected about 4 times. Now, once again, I cannot connect with Galaxy online to transfer data. Is this a problem that can be solved-either at my end or at Galaxy? Elwood Linney ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org http://usegalaxy.org/. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ --Jennifer Hillman-Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified
Re: [galaxy-user] Metagenomic filtering
Dear Galaxy When running HiSeq shot metagenomics sample from the environment against megablast and taxonomic representation, How do I filter/remove all the 16s and other conserved sequences. The problem if blasting a single organism that has a fraction of conserved sequence, the results will align with E.coli 10,000 times more then the possible target organism. This data would be wrong and misleading. For example a 100mg sample that was negative for e coli using MUG test, give thousands of hits with galaxy. 1) Is there a filter conserved sequences setting? 2) Is there a remove model organisms setting? Scott Tighe -- Core Laboratory Research Staff Advanced Genome Technologies Core Deep Sequencing (MPS) Facility Vermont Cancer Center 149 Beaumont Ave University of Vermont HSRF 303 Burlington Vermont USA 05045 802-656-AGTC 802-999- (cell) Quoting Jennifer Jackson j...@bx.psu.edu: Hello Elwood, Are you still having connection issues today? Or is this resolved? Best, Jen Galaxy team On 9/13/13 11:36 AM, Elwood Linney wrote: A message sent earlier this week by me indicated that I could not connect to Galaxy via Fetch to download data. A reply indicated a glitch was fixed. I then could connect with Fetch and I tried to transfer 4 x 16gb files and the connection disconnected about 4 times. Now, once again, I cannot connect with Galaxy online to transfer data. Is this a problem that can be solved-either at my end or at Galaxy? Elwood Linney ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-user] Metagenomic filtering
Hi Scott, The tool Metagenomic analyses - Find diagnostic hits can be used to isolate the conserved sequences. Then, you use the tool Join, Subtract and Group - Compare to find Non Matching rows of 1st dataset to filter out anything that you think is spurious for your analysis (put in original file first, output of diagnostic hits second) before moving forward with the other summary tools. You will probably want to run the Find diagnostic hits tool more than once. The choice is yours whether to do the Compare after each, or to Text Manipulation - Concatenate all the results together first, then Compare. The first might work faster, it just depends on the size of your datasets (how much filtering occurred before this step, etc). The Compare tool sorts and holds data in memory. Even if you need to break the data up and run in smaller chunks, the results should be the same in the end. None of these jobs require the data to be in one lump. Others are welcome to add to this with their own strategies, I am sure there are others ways to do this. Some of the public servers specializing in Metagenomics may also have tools for this, or options, and some of those may have donated to the Tool Shed, for local or cloud use. May be worth a look. http://wiki.galaxyproject.org/PublicGalaxyServers Good question! Jen Galaxy team On 9/18/13 7:03 AM, Scott W. Tighe wrote: Dear Galaxy When running HiSeq shot metagenomics sample from the environment against megablast and taxonomic representation, How do I filter/remove all the 16s and other conserved sequences. The problem if blasting a single organism that has a fraction of conserved sequence, the results will align with E.coli 10,000 times more then the possible target organism. This data would be wrong and misleading. For example a 100mg sample that was negative for e coli using MUG test, give thousands of hits with galaxy. 1) Is there a filter conserved sequences setting? 2) Is there a remove model organisms setting? Scott Tighe -- Jennifer Hillman-Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/