Re: [galaxy-user] Metagenomic filtering

2013-10-08 Thread Scott Tighe

Jing

All good thoughts and if I remember correctly, custom software can 
indeed to incorparated into Galaxy through use of the Toolshed . I'll 
check into this with Jennifer.


Thanks

Scott


Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557

On 10/6/2013 9:59 AM, Jing Yu wrote:

Dear Scott,

I think what you propose is doable.

You may
1. use a 16s or gyrase DNA sequence as feeds to blast against your 
data to get the relative sequences,
2. and then use the sequences as feeds to blast against your 
nucleotide database with appropriate filters.


There are several ways to make the steps. For example, you may already 
have the 16s sequence from assembly against a reference genome.
And for Step 2, if you are not blasting thousands of times a day, and 
believe in the recent stability of NCBI, then a simple web_blast code 
will do the trick. Otherwise, since the local blast+ toolkit doesn't 
provide the equivalent organism filters, you'll have to work a wit bit 
on it:


Make a nucleotide database for Prokaryotes.
Search txid561[ORGN] on http://www.ncbi.nlm.nih.gov/nuccore (this is 
for Escherichia as an example),

Send to 'File' - Format -GI List
When Blast, use this GI list as the value of this argument: 
-negative_gilist

Then parse the Blast result.

Most of these can be automated with some code, but I don't know how to 
incorporate it into Galaxy.


Regards,
Jing
On 4 Oct 2013, at 23:52, Scott Tighe scott.ti...@uvm.edu 
mailto:scott.ti...@uvm.edu wrote:



Dear Jing

What you have outlined below is perfect.

I wonder how hard it would be to design a few filters that only look 
a certain genes and or filter model organisms out of the dataset.


For example, say you want only data for 16s or only gyrase, but no 
/E.coli/ and no /Pseudomanas aeroginosa/


Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/25/2013 12:06 AM, Jing Yu wrote:

Hi Scott,

My first thought is:

1. Remove rDNA sequences (and/or other well known highly-conserved 
sequences to reduce the workload in step 2).
2. Blast, then remove sequences with  (say 99%) match to  (say 5) 
genus. (Optional if step 1 is already good enough)



For step 1:
Build a fasta file of the chosen highly conserved sequences, and 
use it as a feed to blast against your MiSeq result.

Remove positive hits.
For step 2:
Blast remaining MiSeq sequences against NCBI (or whatever) database.
Remove if it hits more than n genus.

Jing
On 24 Sep 2013, at 22:17, Scott Tighe scott.ti...@uvm.edu 
mailto:scott.ti...@uvm.edu wrote:



Jing et al

Thank you for the offer to write some code to help advance the 
metagenomics arena. It is certainly needed.


So the problem is well known with megablast and shotgun 
metagenomics and without proper understanding and correct software 
will yield very misleading and in many cases incorrect data. For 
those of us who wish NOT to move to a protein level of comparison 
for specific reasons, we are stuck.


*The Problem:*

If I megablast 50 million sequences from a HiSeq run, millions of 
rRNA sequences will have a 99% match to all microbes rRNA genbank 
deposits. Not surprizing since the rRNA is highly conserved. The 
difference between E.coli and Shigella is 1 to 2 bases for the full 
1540 bp 16s.  So 16s is not useful for Genus level, and certainly 
not Species


*So what happens:*

The returned matches will have many hits to whatever model organism 
is in Genbank. For example E coli has 13000 entries for rRNA and 
Sphearotilus has 3 entries for rRNA. If the blasted sequence 
matches both, the results will mislead the investigator to think 
they have 13000 hits to E coli, EVEN if the microbe is Sphearotilus.


*The cure?:*

If there was a way to filter/ remove all hits ? Let say, for 
example, that a result has a first match (say E. coli) at 99% a 
second match (say Pseudomanas) at 99% and a third , forth and 
fifth match 99 for three other organisms. This sequence _must_ be 
discarded because it is a conserve sequence.


Basically conserved sequence is the enemy and invalidates the 
entire result.

*
**Another problem:*

If you have a reference sample with 19 non-model  microbes, and you 
run that by HiSeq Shotgun for metagenomics and then megablast, what 
do you think you get?  If E coli is not in the reference sample, 
how many hits do you think you get? Yes, 10,000 of thousands. So 
without removing conserved sequences, your data is wrong and you 
are much better served by culturing and running a Biolog metabolic 
panel and comparing to the sequence result.


So where do we start? I have some shotgun metagenomics data from 
the reference sample which included 

Re: [galaxy-user] Metagenomic filtering

2013-10-06 Thread Jing Yu
Dear Scott,

I think what you propose is doable. 

You may 
1. use a 16s or gyrase DNA sequence as feeds to blast against your data to get 
the relative sequences, 
2. and then use the sequences as feeds to blast against your nucleotide 
database with appropriate filters.

There are several ways to make the steps. For example, you may already have the 
16s sequence from assembly against a reference genome.
And for Step 2, if you are not blasting thousands of times a day, and believe 
in the recent stability of NCBI, then a simple web_blast code will do the 
trick. Otherwise, since the local blast+ toolkit doesn't provide the equivalent 
organism filters, you'll have to work a wit bit on it:

Make a nucleotide database for Prokaryotes.
Search txid561[ORGN] on http://www.ncbi.nlm.nih.gov/nuccore (this is for 
Escherichia as an example),
Send to 'File' - Format -GI List
When Blast, use this GI list as the value of this argument: -negative_gilist
Then parse the Blast result.

Most of these can be automated with some code, but I don't know how to 
incorporate it into Galaxy. 

Regards,
Jing
On 4 Oct 2013, at 23:52, Scott Tighe scott.ti...@uvm.edu wrote:

 Dear Jing
 
 What you have outlined below is perfect. 
 
 I wonder how hard it would be to design a few filters that only look a 
 certain genes and or filter model organisms out of the dataset. 
 
 For example, say you want only data for 16s or only gyrase, but no E.coli and 
 no Pseudomanas aeroginosa
 
 Scott
 Scott Tighe
 Senior Core Laboratory Research Staff
 Advanced Genome Technologies Core
 University of Vermont
 Vermont Cancer Center
 149 Beaumont ave
 Health Science Research Facility 303/305
 Burlington Vermont 05405
 802-656-2557
 On 9/25/2013 12:06 AM, Jing Yu wrote:
 Hi Scott,
 
 My first thought is:
 
 1. Remove rDNA sequences (and/or other well known highly-conserved sequences 
 to reduce the workload in step 2).
 2. Blast, then remove sequences with  (say 99%) match to  (say 5) genus. 
 (Optional if step 1 is already good enough)
 
 
 For step 1:
 Build a fasta file of the chosen highly conserved sequences, and use it 
 as a feed to blast against your MiSeq result.
 Remove positive hits.
 For step 2:
 Blast remaining MiSeq sequences against NCBI (or whatever) database.
 Remove if it hits more than n genus.
 
 Jing
 On 24 Sep 2013, at 22:17, Scott Tighe scott.ti...@uvm.edu wrote:
 
 Jing et al 
 
 Thank you for the offer to write some code to help advance the metagenomics 
 arena. It is certainly needed.
 
 So the problem is well known with megablast and shotgun metagenomics and 
 without proper understanding and correct software will yield very 
 misleading and in many cases incorrect data. For those of us who wish NOT 
 to move to a protein level of comparison for specific reasons, we are 
 stuck. 
 
 The Problem:
 
 If I megablast 50 million sequences from a HiSeq run, millions of rRNA 
 sequences will have a 99% match to all microbes rRNA genbank deposits. Not 
 surprizing since the rRNA is highly conserved. The difference between 
 E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s.  So 16s is 
 not useful for Genus level, and certainly not Species
 
 So what happens:
 
 The returned matches will have many hits to whatever model organism is in 
 Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 
 3 entries for rRNA. If the blasted sequence matches both, the results will 
 mislead the investigator to think they have 13000 hits to E coli, EVEN if 
 the microbe is Sphearotilus. 
 
 The cure?:
 
 If there was a way to filter/ remove all hits ? Let say, for example, that 
 a result has a first match (say E. coli) at 99% a second match (say 
 Pseudomanas) at 99% and a third , forth and fifth match 99 for three 
 other organisms. This sequence must be discarded because it is a conserve 
 sequence.
 
 Basically conserved sequence is the enemy and invalidates the entire 
 result. 
 
 Another problem:
 
 If you have a reference sample with 19 non-model  microbes, and you run 
 that by HiSeq Shotgun for metagenomics and then megablast, what do you 
 think you get?  If E coli is not in the reference sample, how many hits do 
 you think you get? Yes, 10,000 of thousands. So without removing conserved 
 sequences, your data is wrong and you are much better served by culturing 
 and running a Biolog metabolic panel and comparing to the sequence result.  
 
 So where do we start? I have some shotgun metagenomics data from the 
 reference sample which included the 19 microbes. That was data from a MiSeq.
 
 Scott
 Scott Tighe
 Senior Core Laboratory Research Staff
 Advanced Genome Technologies Core
 University of Vermont
 Vermont Cancer Center
 149 Beaumont ave
 Health Science Research Facility 303/305
 Burlington Vermont 05405
 802-656-2557
 On 9/20/2013 9:17 PM, Jing Yu wrote:
 Hi Scott,
 
 I can do some perl programming, such as local/remote blasting. Can you 
 specify your problem a little bit 

Re: [galaxy-user] Metagenomic filtering

2013-09-25 Thread Jing Yu
Hi Scott,

My first thought is:

1. Remove rDNA sequences (and/or other well known highly-conserved sequences to 
reduce the workload in step 2).
2. Blast, then remove sequences with  (say 99%) match to  (say 5) genus. 
(Optional if step 1 is already good enough)


For step 1:
Build a fasta file of the chosen highly conserved sequences, and use it as 
a feed to blast against your MiSeq result.
Remove positive hits.
For step 2:
Blast remaining MiSeq sequences against NCBI (or whatever) database.
Remove if it hits more than n genus.

Jing
On 24 Sep 2013, at 22:17, Scott Tighe scott.ti...@uvm.edu wrote:

 Jing et al 
 
 Thank you for the offer to write some code to help advance the metagenomics 
 arena. It is certainly needed.
 
 So the problem is well known with megablast and shotgun metagenomics and 
 without proper understanding and correct software will yield very misleading 
 and in many cases incorrect data. For those of us who wish NOT to move to a 
 protein level of comparison for specific reasons, we are stuck. 
 
 The Problem:
 
 If I megablast 50 million sequences from a HiSeq run, millions of rRNA 
 sequences will have a 99% match to all microbes rRNA genbank deposits. Not 
 surprizing since the rRNA is highly conserved. The difference between E.coli 
 and Shigella is 1 to 2 bases for the full 1540 bp 16s.  So 16s is not useful 
 for Genus level, and certainly not Species
 
 So what happens:
 
 The returned matches will have many hits to whatever model organism is in 
 Genbank. For example E coli has 13000 entries for rRNA and Sphearotilus has 3 
 entries for rRNA. If the blasted sequence matches both, the results will 
 mislead the investigator to think they have 13000 hits to E coli, EVEN if the 
 microbe is Sphearotilus. 
 
 The cure?:
 
 If there was a way to filter/ remove all hits ? Let say, for example, that a 
 result has a first match (say E. coli) at 99% a second match (say 
 Pseudomanas) at 99% and a third , forth and fifth match 99 for three other 
 organisms. This sequence must be discarded because it is a conserve sequence.
 
 Basically conserved sequence is the enemy and invalidates the entire result. 
 
 Another problem:
 
 If you have a reference sample with 19 non-model  microbes, and you run that 
 by HiSeq Shotgun for metagenomics and then megablast, what do you think you 
 get?  If E coli is not in the reference sample, how many hits do you think 
 you get? Yes, 10,000 of thousands. So without removing conserved sequences, 
 your data is wrong and you are much better served by culturing and running a 
 Biolog metabolic panel and comparing to the sequence result.  
 
 So where do we start? I have some shotgun metagenomics data from the 
 reference sample which included the 19 microbes. That was data from a MiSeq.
 
 Scott
 Scott Tighe
 Senior Core Laboratory Research Staff
 Advanced Genome Technologies Core
 University of Vermont
 Vermont Cancer Center
 149 Beaumont ave
 Health Science Research Facility 303/305
 Burlington Vermont 05405
 802-656-2557
 On 9/20/2013 9:17 PM, Jing Yu wrote:
 Hi Scott,
 
 I can do some perl programming, such as local/remote blasting. Can you 
 specify your problem a little bit clearer, so that maybe I can write a 
 program to do just that?
 
 Regards,
 Jing
 
 
 
 
 Gerald
 
  16s is basically useless for identification to genus. Since I started 
 sequencing 16s in 1992, I have come to realize that without sequencing the  
 full 1540 bases, it is generally  misleading, and even than, it is not 
 accurate enough to nail genus on more than 1/2 the cases.   However, what is 
 your feeling on ITS  and gyrase, They seem to be far more discriminating but 
 those databases have been decommissioned sometime ago.
 
 The desirable thing would be that Galaxy or NCBI  add a filter conserved 
 genes [ ie any hit with a second choice greater than 3% distance]. 
 Something such as that.
 
 If you (or others)  are aware of such a thing, I'd love the here about it.
 
 Sincerely 
 Scott
 

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Metagenomic filtering

2013-09-24 Thread Scott Tighe

Jing et al

Thank you for the offer to write some code to help advance the 
metagenomics arena. It is certainly needed.


So the problem is well known with megablast and shotgun metagenomics and 
without proper understanding and correct software will yield very 
misleading and in many cases incorrect data. For those of us who wish 
NOT to move to a protein level of comparison for specific reasons, we 
are stuck.


*The Problem:*

If I megablast 50 million sequences from a HiSeq run, millions of rRNA 
sequences will have a 99% match to all microbes rRNA genbank deposits. 
Not surprizing since the rRNA is highly conserved. The difference 
between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s.  
So 16s is not useful for Genus level, and certainly not Species


*So what happens:*

The returned matches will have many hits to whatever model organism is 
in Genbank. For example E coli has 13000 entries for rRNA and 
Sphearotilus has 3 entries for rRNA. If the blasted sequence matches 
both, the results will mislead the investigator to think they have 13000 
hits to E coli, EVEN if the microbe is Sphearotilus.


*The cure?:*

If there was a way to filter/ remove all hits ? Let say, for example, 
that a result has a first match (say E. coli) at 99% a second match 
(say Pseudomanas) at 99% and a third , forth and fifth match 99 for 
three other organisms. This sequence _must_ be discarded because it is a 
conserve sequence.


Basically conserved sequence is the enemy and invalidates the entire 
result.

*
**Another problem:*

If you have a reference sample with 19 non-model  microbes, and you run 
that by HiSeq Shotgun for metagenomics and then megablast, what do you 
think you get?  If E coli is not in the reference sample, how many hits 
do you think you get? Yes, 10,000 of thousands. So without removing 
conserved sequences, your data is wrong and you are much better served 
by culturing and running a Biolog metabolic panel and comparing to the 
sequence result.


So where do we start? I have some shotgun metagenomics data from the 
reference sample which included the 19 microbes. That was data from a MiSeq.


Scott

Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557

On 9/20/2013 9:17 PM, Jing Yu wrote:

Hi Scott,

I can do some perl programming, such as local/remote blasting. Can you 
specify your problem a little bit clearer, so that maybe I can write a 
program to do just that?


Regards,
Jing




Gerald

 16s is basically useless for identification to genus. Since I started 
sequencing 16s in 1992, I have come to realize that without sequencing 
the  full 1540 bases, it is generally misleading, and even than, it is 
not accurate enough to nail genus on more than 1/2 the cases.   
However, what is your feeling on ITS  and gyrase, They seem to be far 
more discriminating but those databases have been decommissioned 
sometime ago.


The desirable thing would be that Galaxy or NCBI  add a filter 
conserved genes [ ie any hit with a second choice greater than 3% 
distance]. Something such as that.


If you (or others)  are aware of such a thing, I'd love the here about it.

Sincerely
Scott 


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] Metagenomic filtering

2013-09-24 Thread Jennifer Jackson

Hi all -
Not to derail the conversation, but I wanted to point out some Galaxy 
resources that may help when considering how to approach solution. These 
may be knowns, but thought I'd put them out there just in case. See below.

Best!
Jen
Galaxy team

There are at least three public Galaxy instances that focus heavily on 
Metagenomics. Maybe worth a look?

http://wiki.galaxyproject.org/PublicGalaxyServers
Just do a browser search on metagenomics to find on page. May be 
others, but these are top 3.


The Tool Shed may or may not contain specialized tools from these 
servers. Asking to have those tools made available via TS route is can 
be done through direct contact. Other repos may have tools that fit or 
could be tuned. Tool authors own tools - changes could potentially be 
incorporated through direct contact. Or, as is open source, used as 
baseline with attribution if that doesn't work out.

http://toolshed.g2.bx.psu.edu/

Making a Galaxy Trello ticket for new tools and discussing new tool 
development on the galaxy-...@bx.psu.edu list may help you find other 
Galaxy community developers working on similar projects. Tickets are not 
just for the Galaxy core team, and even though the issue to solve is 
scientific, a technical implementation seems to be where this is going 
(new tool or existing tool tuning).
http://wiki.galaxyproject.org/Issues - Inbox is where this would go. 
Final home almost certainly Tool Shed (same for all tools), but 
possibility of also including on Galaxy Main server also exists once 
there are a valid repo and it is determined to be a good fit (resource, 
etc.).


On 9/24/13 7:17 AM, Scott Tighe wrote:

Jing et al

Thank you for the offer to write some code to help advance the 
metagenomics arena. It is certainly needed.


So the problem is well known with megablast and shotgun metagenomics 
and without proper understanding and correct software will yield very 
misleading and in many cases incorrect data. For those of us who wish 
NOT to move to a protein level of comparison for specific reasons, we 
are stuck.


*The Problem:*

If I megablast 50 million sequences from a HiSeq run, millions of rRNA 
sequences will have a 99% match to all microbes rRNA genbank deposits. 
Not surprizing since the rRNA is highly conserved. The difference 
between E.coli and Shigella is 1 to 2 bases for the full 1540 bp 16s.  
So 16s is not useful for Genus level, and certainly not Species


*So what happens:*

The returned matches will have many hits to whatever model organism is 
in Genbank. For example E coli has 13000 entries for rRNA and 
Sphearotilus has 3 entries for rRNA. If the blasted sequence matches 
both, the results will mislead the investigator to think they have 
13000 hits to E coli, EVEN if the microbe is Sphearotilus.


*The cure?:*

If there was a way to filter/ remove all hits ? Let say, for example, 
that a result has a first match (say E. coli) at 99% a second match 
(say Pseudomanas) at 99% and a third , forth and fifth match 99 for 
three other organisms. This sequence _must_ be discarded because it is 
a conserve sequence.


Basically conserved sequence is the enemy and invalidates the entire 
result.

*
**Another problem:*

If you have a reference sample with 19 non-model  microbes, and you 
run that by HiSeq Shotgun for metagenomics and then megablast, what do 
you think you get?  If E coli is not in the reference sample, how many 
hits do you think you get? Yes, 10,000 of thousands. So without 
removing conserved sequences, your data is wrong and you are much 
better served by culturing and running a Biolog metabolic panel and 
comparing to the sequence result.


So where do we start? I have some shotgun metagenomics data from the 
reference sample which included the 19 microbes. That was data from a 
MiSeq.


Scott
Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557
On 9/20/2013 9:17 PM, Jing Yu wrote:

Hi Scott,

I can do some perl programming, such as local/remote blasting. Can 
you specify your problem a little bit clearer, so that maybe I can 
write a program to do just that?


Regards,
Jing




Gerald

 16s is basically useless for identification to genus. Since I 
started sequencing 16s in 1992, I have come to realize that without 
sequencing the  full 1540 bases, it is generally misleading, and even 
than, it is not accurate enough to nail genus on more than 1/2 the 
cases.   However, what is your feeling on ITS  and gyrase, They seem 
to be far more discriminating but those databases have been 
decommissioned sometime ago.


The desirable thing would be that Galaxy or NCBI  add a filter 
conserved genes [ ie any hit with a second choice greater than 3% 
distance]. Something such as that.


If you (or others)  are aware of such a thing, I'd love the here 
about it.


Sincerely
Scott 





Re: [galaxy-user] Metagenomic filtering

2013-09-20 Thread Gerald Bothe
Scott,
agreed, 16S is not accurate if you only have partial sequences. I would make 
the Galaxy button(s) more specific, saying remove all rRNA and tRNA genes from 
bacteria/archaea/eukaryotes. That would leave the user with protein coding 
regions and intergenic regions. Ideally, one would then add an option compare 
to gene collection which would then give options for a collection of gyrase 
etc. As the gyrase collection is no longer available, one would have to rebuild 
this from the sequenced genomes - that's far from perfect in terms of coverage, 
but at least the quality of the published genomes is generally good (rRNA gene 
sequences are often not very good, another problem with the rRNA approach). 
Currently, I don't know of a such a program.
 
Gerald



From: Scott Tighe scott.ti...@uvm.edu
To: Gerald Bothe g_bo...@yahoo.com; galaxy-user@lists.bx.psu.edu 
Sent: Thursday, September 19, 2013 10:45 AM
Subject: Re: [galaxy-user] Metagenomic filtering



Gerald

 16s is basically useless for identification to genus. Since I started 
sequencing 16s in 1992, I have come to realize that without sequencing the  
full 1540 bases, it is generally  misleading, and even than, it is not 
accurate enough to nail genus on more than 1/2 the cases.   However, what is 
your feeling on ITS  and gyrase, They seem to be far more discriminating but 
those databases have been decommissioned sometime ago.

The desirable thing would be that Galaxy or NCBI  add a filter conserved 
genes [ ie any hit with a second choice greater than 3% distance]. Something 
such as that.

If you (or others)  are aware of such a thing, I'd love the here about it.

Sincerely 
Scott



Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557On 9/18/2013 2:05 PM, Gerald Bothe wrote:

Removing model organisms may not be enough, you may have the same problem 
with, say, a Clostridium cluster IV anaerobe. I think a solution would be to 
 
first: compare to a collection of genes, e.g. get all the hits for 16S rRNA 
genes, RNA polymerases (conserved to quite conserved), and to e.g. ion 
channels and cell surface proteins. 
 
then: once a read or contig is identified as belonging to a gene family, 
gene, or protein domain, check within that group for  species identities. 
Then you compare apples to apples in terms of gene conservation level
 
Does anybody know a program that would do this efficiently from metagenomic 
data?

Gerald Bothe


From: Scott W. Tighe mailto:scott.ti...@uvm.edu
To: galaxy-user@lists.bx.psu.edu 
Sent: Wednesday, September 18, 2013 10:03 AM
Subject: Re: [galaxy-user] Metagenomic filtering


Dear Galaxy

When running HiSeq shot metagenomics sample from the environment against 
megablast and taxonomic representation, How do I filter/remove all the 16s 
and other conserved sequences.

The problem if blasting a single organism that has a fraction of conserved 
sequence, the results will align with E.coli 10,000 times more then the 
possible target organism. This data would be wrong and misleading. For 
example a 100mg sample that was negative for e coli using MUG test, give 
thousands of hits with galaxy.

1) Is there a filter conserved sequences setting?



2) Is there a remove model organisms setting?


Scott Tighe
--Core Laboratory Research Staff
Advanced Genome Technologies Core
Deep Sequencing (MPS) Facility
Vermont Cancer Center
149 Beaumont Ave
University of Vermont HSRF 303
Burlington Vermont  USA 05045
802-656-AGTC
802-999- (cell)



Quoting Jennifer Jackson j...@bx.psu.edu:

 Hello Elwood,
 
 Are you still having connection issues today? Or is this resolved?
 
 Best,
 
 Jen
 Galaxy team
 
 On 9/13/13 11:36 AM, Elwood Linney wrote:
 A message sent earlier this week by me indicated that I could not connect 
 to Galaxy via Fetch to download data.
 
 A reply indicated a glitch was fixed.
 
 I then could connect with Fetch and I tried to transfer 4 x 16gb files 
 and the connection disconnected about 4 times.
 
 Now, once again, I cannot connect with Galaxy online to transfer data.
 
 Is this a problem that can be solved-either at my end or at Galaxy?
 
 Elwood Linney
 
 
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:
 
  http://lists.bx.psu.edu/listinfo/galaxy-dev
 
 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:
 
  http://lists.bx.psu.edu/
 
 To search Galaxy mailing lists use the unified search at:
 
  http://galaxyproject.org/search/mailinglists/
 
 --Jennifer Hillman-Jackson

Re: [galaxy-user] Metagenomic filtering

2013-09-19 Thread Scott Tighe

Gerald

 16s is basically useless for identification to genus. Since I started 
sequencing 16s in 1992, I have come to realize that without sequencing 
the  full 1540 bases, it is generally misleading, and even than, it is 
not accurate enough to nail genus on more than 1/2 the cases.   However, 
what is your feeling on ITS  and gyrase, They seem to be far more 
discriminating but those databases have been decommissioned sometime ago.


The desirable thing would be that Galaxy or NCBI  add a filter 
conserved genes [ ie any hit with a second choice greater than 3% 
distance]. Something such as that.


If you (or others)  are aware of such a thing, I'd love the here about it.

Sincerely
Scott


Scott Tighe
Senior Core Laboratory Research Staff
Advanced Genome Technologies Core
University of Vermont
Vermont Cancer Center
149 Beaumont ave
Health Science Research Facility 303/305
Burlington Vermont 05405
802-656-2557

On 9/18/2013 2:05 PM, Gerald Bothe wrote:
Removing model organisms may not be enough, you may have the same 
problem with, say, a Clostridium cluster IV anaerobe. I think a 
solution would be to
first: compare to a collection of genes, e.g. get all the hits for 16S 
rRNA genes, RNA polymerases (conserved to quite conserved), and to 
e.g. ion channels and cell surface proteins.
then: once a read or contig is identified as belonging to a gene 
family, gene, or protein domain, check within that group for  species 
identities. Then you compare apples to apples in terms of gene 
conservation level
Does anybody know a program that would do this efficiently from 
metagenomic data?

Gerald Bothe

*From:* Scott W. Tighe scott.ti...@uvm.edu
*To:* galaxy-user@lists.bx.psu.edu
*Sent:* Wednesday, September 18, 2013 10:03 AM
*Subject:* Re: [galaxy-user] Metagenomic filtering

Dear Galaxy

When running HiSeq shot metagenomics sample from the environment
against megablast and taxonomic representation, How do I
filter/remove all the 16s and other conserved sequences.

The problem if blasting a single organism that has a fraction of
conserved sequence, the results will align with E.coli 10,000
times more then the possible target organism. This data would be
wrong and misleading. For example a 100mg sample that was negative
for e coli using MUG test, give thousands of hits with galaxy.

1) Is there a filter conserved sequences setting?



2) Is there a remove model organisms setting?


Scott Tighe
--Core Laboratory Research Staff
Advanced Genome Technologies Core
Deep Sequencing (MPS) Facility
Vermont Cancer Center
149 Beaumont Ave
University of Vermont HSRF 303
Burlington Vermont  USA 05045
802-656-AGTC
802-999- (cell)



Quoting Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu:

 Hello Elwood,

 Are you still having connection issues today? Or is this resolved?

 Best,

 Jen
 Galaxy team

 On 9/13/13 11:36 AM, Elwood Linney wrote:
 A message sent earlier this week by me indicated that I could
not connect to Galaxy via Fetch to download data.

 A reply indicated a glitch was fixed.

 I then could connect with Fetch and I tried to transfer 4 x
16gb files and the connection disconnected about 4 times.

 Now, once again, I cannot connect with Galaxy online to
transfer data.

 Is this a problem that can be solved-either at my end or at Galaxy?

 Elwood Linney


 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org http://usegalaxy.org/. Please keep all
replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

  http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/

 --Jennifer Hillman-Jackson
 http://galaxyproject.org



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified

Re: [galaxy-user] Metagenomic filtering

2013-09-18 Thread Scott W. Tighe

Dear Galaxy

When running HiSeq shot metagenomics sample from the environment  
against megablast and taxonomic representation, How do I filter/remove  
all the 16s and other conserved sequences.


The problem if blasting a single organism that has a fraction of  
conserved sequence, the results will align with E.coli 10,000 times  
more then the possible target organism. This data would be wrong and  
misleading. For example a 100mg sample that was negative for e coli  
using MUG test, give thousands of hits with galaxy.


1) Is there a filter conserved sequences setting?



2) Is there a remove model organisms setting?


Scott Tighe
--
Core Laboratory Research Staff
Advanced Genome Technologies Core
Deep Sequencing (MPS) Facility
Vermont Cancer Center
149 Beaumont Ave
University of Vermont HSRF 303
Burlington Vermont  USA 05045
802-656-AGTC
802-999- (cell)



Quoting Jennifer Jackson j...@bx.psu.edu:


Hello Elwood,

Are you still having connection issues today? Or is this resolved?

Best,

Jen
Galaxy team

On 9/13/13 11:36 AM, Elwood Linney wrote:
A message sent earlier this week by me indicated that I could not  
connect to Galaxy via Fetch to download data.


A reply indicated a glitch was fixed.

I then could connect with Fetch and I tried to transfer 4 x 16gb  
files and the connection disconnected about 4 times.


Now, once again, I cannot connect with Galaxy online to transfer data.

Is this a problem that can be solved-either at my end or at Galaxy?

Elwood Linney


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/


--
Jennifer Hillman-Jackson
http://galaxyproject.org




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/


Re: [galaxy-user] Metagenomic filtering

2013-09-18 Thread Jennifer Jackson

Hi Scott,

The tool Metagenomic analyses - Find diagnostic hits can be used to 
isolate the conserved sequences. Then, you use the tool Join, Subtract 
and Group - Compare to find Non Matching rows of 1st dataset to 
filter out anything that you think is spurious for your analysis (put in 
original file first, output of diagnostic hits second) before moving 
forward with the other summary tools.


You will probably want to run the Find diagnostic hits tool more than 
once. The choice is yours whether to do the Compare after each, or to 
Text Manipulation - Concatenate all the results together first, then 
Compare. The first might work faster, it just depends on the size of 
your datasets (how much filtering occurred before this step, etc).


The Compare tool sorts and holds data in memory. Even if you need to 
break the data up and run in smaller chunks, the results should be the 
same in the end. None of these jobs require the data to be in one lump.


Others are welcome to add to this with their own strategies, I am sure 
there are others ways to do this. Some of the public servers 
specializing in Metagenomics may also have tools for this, or options, 
and some of those may have donated to the Tool Shed, for local or cloud 
use. May be worth a look.

http://wiki.galaxyproject.org/PublicGalaxyServers

Good question!

Jen
Galaxy team



On 9/18/13 7:03 AM, Scott W. Tighe wrote:

Dear Galaxy

When running HiSeq shot metagenomics sample from the environment 
against megablast and taxonomic representation, How do I filter/remove 
all the 16s and other conserved sequences.


The problem if blasting a single organism that has a fraction of 
conserved sequence, the results will align with E.coli 10,000 times 
more then the possible target organism. This data would be wrong and 
misleading. For example a 100mg sample that was negative for e coli 
using MUG test, give thousands of hits with galaxy.


1) Is there a filter conserved sequences setting?



2) Is there a remove model organisms setting?


Scott Tighe


--
Jennifer Hillman-Jackson
http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/