Re: [galaxy-user] 1. cuffcompare or cuffmerge (???)
Hi, JiWen I thought about this before, here is the answer from Cole Trapnell from Seqanswer website: I can shed some light on this. We have an upcoming protocol paper that describes our recommended workflow for TopHat and Cufflinks that discusses some of these issues. As turnersd outlined, there are three strategies: 1) merge bams and assemble in a single run of Cufflinks 2) assemble each bam and cuffcompare them to get a combined.gtf 3) assemble each bam and cuffmerge them to get a merged.gtf All three options work a little differently depending on whether you're also trying to integrate reference transcripts from UCSC or another annotation source. #1 is quite different from #2 and #3, so I'll discuss its pros and cons first. The advantage here is simplicity of workflow. It's one Cufflinks run, so no need to worry about the details of the other programs. As turnersd mentions, you might also think this maximizes the accuracy of the resulting assembly, and that might be the case, but it also might not (for technical reasons that I don't want to get into right now). The disadvantage of this approach is that your computer might not be powerful enough to run it. More data and more isoforms means substantially more memory and running time. I haven't actually tried this on something like the human body map, but I would be very impressed and surprised if Cufflinks can deal with all of that on a machine owned by mere mortals. #2 and #3 are very similar - both are designed to gracefully merge full-length and partial transcript assemblies without ever merging transfrags that disagree on splicing structure. Consider two transfrags, A and B, each with a couple exons. If A and B overlap, and they don't disagree on splicing structure, we can (and according to Cufflinks' assembly philosophy, we should) merge them. The difference between Cuffcompare and Cuffmerge is that Cuffcompare will only merge them if A is contained in B, or vice versa. That is, only if one of the transfrags is essentially redundant. Otherwise, they both get included. Cuffmerge on the other hand, will merge them if they overlap, and agree on splicing, and are in the same orientiation. As turnersd noted, this is done by converting the transfrags into SAM alignments and running Cufflinks on them. The other thing that distinguishes these two options is how they deal with a reference annotation. You can read on our website how the Cufflinks Reference Annotation Based Transcript assembler (RABT) works. Cuffcompare doesn't do any RABT assembly, it just includes the reference annotation in the combined.gtf and discards partial transfrags that are contained and compatible with the reference. Cuffmerge actually runs RABT when you provide a reference, and this happens during the step where transfrags are converted into SAM alignments and assembled. We do this to improve quantification accuracy and reduce errors downstream. I should also say that Cuffmerge runs cuffcompare in order annotate the merged assembly with certain helpful features for use later on. So we recommend #3 for a number of reasons, because it is the closest in spirit to #1 while still being reasonably fast. For reasons that I don't want to get into here (pretty arcane details about the Cufflinks assembler) I also feel that option #3 is actually the most accurate in most experimental settings. Hope this helps. Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645 Today's Topics: 1. cuffcompare or cuffmerge (???) 2. Data upload... (Gregory Miles) -- Forwarded message -- From: 杨继文 jiwenyang0...@126.com To: galaxy-user@lists.bx.psu.edu Cc: Date: Mon, 23 Apr 2012 20:32:28 +0800 (CST) Subject: [galaxy-user] cuffcompare or cuffmerge Hi all, I read one paper Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. They say the procedure for RNA-Seq analysis is *Tophat--cufflinks-- cuffmerge--cuffdiff* But what I normally do in Galaxy is *Tophat--cufflinks--cuffcompare--cuffdiff. I have six samples, which means I will generate 6 assembled transcript files by cufflinks. Then I run cuffcompare using all six assembled transcript files as input. The resulting combined transcript is the input for cufflinks. * ** *I don't know why I shoud use cuffmerge. Actually I don't understand the function of cuffmerge.* ** *Did I miss something?? * Please let me know your opinions. Jiwen -- -- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645 ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all
Re: [galaxy-user] Data upload...
Hi Greg, Upload your files to a Galaxy data library using a combination of Upload files from filesystem paths without copying data into Galaxy's default data store. See the following wiki for all the details: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files For all of the details about data libraries, see: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries Greg Von Kuster On Apr 23, 2012, at 11:26 AM, Gregory Miles wrote: We have large files that cannot be uploaded using the file upload command and instead would need to be uploaded using a URL. Unfortunately, we are using a local install on a non-local machine, so setting up an FTP server on this machine is a security issue. The files are located on this computer already anyhow, and Galaxy would simply be copying from one folder to another in order to perform the get data step. Is there a simple way to have a pointer of some sort such that galaxy knows where this file is and: 1) Would not have to copy it and could simply refer to the file location. 2) Could perform data analysis steps on this file and push the output to the usual location (not the location of the data files). Any help would be greatly appreciated. Thanks. Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] MegaBLAST output
I am having trouble finding information on the MegaBLAST output columns. What is each column for? I can't seem to figure this out by comparing info in the columns to NCBI directly because the GI#'s don't match with the correct entry on NCBI. I've seen that others have posted about that problem, so I'm also waiting on details on that question, but for now, I'd just like to know what to make of the output... best, Sarah ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Data upload...
Thank you very much for your help with this - we got that settled. One other question...we are importing sorted, indexed bam files into a galaxy data library and we are not having galaxy copy over the files (they are large) but rather just setting up galaxy such that it points to the relevant directory. We noticed that the file (160 GB in size) is taking a long time to import considering all it should be doing is creating a link. When we examined processes that are running, we noticed that samtools is running. From searching around a bit, it seems that Galaxy does this in order to groom the bam file (sort/index) and ensure that it is in the format necessary for galaxy to be able to interpret it. Is there any way around this? We did the sorting and indexing prior to import and it's taking quite a while to perform an unnecessary function. Thanks. Greg Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. On Mon, Apr 23, 2012 at 12:55 PM, Greg Von Kuster g...@bx.psu.edu wrote: Hi Greg, Upload your files to a Galaxy data library using a combination of Upload files from filesystem paths without copying data into Galaxy's default data store. See the following wiki for all the details: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files For all of the details about data libraries, see: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries Greg Von Kuster On Apr 23, 2012, at 11:26 AM, Gregory Miles wrote: We have large files that cannot be uploaded using the file upload command and instead would need to be uploaded using a URL. Unfortunately, we are using a local install on a non-local machine, so setting up an FTP server on this machine is a security issue. The files are located on this computer already anyhow, and Galaxy would simply be copying from one folder to another in order to perform the get data step. Is there a simple way to have a pointer of some sort such that galaxy knows where this file is and: 1) Would not have to copy it and could simply refer to the file location. 2) Could perform data analysis steps on this file and push the output to the usual location (not the location of the data files). Any help would be greatly appreciated. Thanks. Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface
Re: [galaxy-user] Data upload...
Hi Greg, Even though you are not copying the data into Galaxy's default data store, Galaxy determines and stores certain metadata for each of the data files to which you are linking. One of the types of metadata defined for the Bam datatypes is it's index, which is created by a call to samtools. Unfortunately there is really no way around this because Galaxy requires the index file to be in a correct state, and I believe the test to determine correctness is at least as intensive as generating the index in the first place. It's been a while since I was involved in this (specifically setting metadata for bam files using samtools), so perhaps samtools has been recently improved in this regard. if so, I'll look to others to let me know I'm now outdated in my understanding of this. If we need to update samtools used by the Galaxy code to take advantage of newer features, we can certainly do so. Greg Von Kuster On Apr 23, 2012, at 2:51 PM, Gregory Miles wrote: Thank you very much for your help with this - we got that settled. One other question...we are importing sorted, indexed bam files into a galaxy data library and we are not having galaxy copy over the files (they are large) but rather just setting up galaxy such that it points to the relevant directory. We noticed that the file (160 GB in size) is taking a long time to import considering all it should be doing is creating a link. When we examined processes that are running, we noticed that samtools is running. From searching around a bit, it seems that Galaxy does this in order to groom the bam file (sort/index) and ensure that it is in the format necessary for galaxy to be able to interpret it. Is there any way around this? We did the sorting and indexing prior to import and it's taking quite a while to perform an unnecessary function. Thanks. Greg Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. On Mon, Apr 23, 2012 at 12:55 PM, Greg Von Kuster g...@bx.psu.edu wrote: Hi Greg, Upload your files to a Galaxy data library using a combination of Upload files from filesystem paths without copying data into Galaxy's default data store. See the following wiki for all the details: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files For all of the details about data libraries, see: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries Greg Von Kuster On Apr 23, 2012, at 11:26 AM, Gregory Miles wrote: We have large files that cannot be uploaded using the file upload command and instead would need to be uploaded using a URL. Unfortunately, we are using a local install on a non-local machine, so setting up an FTP server on this machine is a security issue. The files are located on this computer already anyhow, and Galaxy would simply be copying from one folder to another in order to perform the get data step. Is there a simple way to have a pointer of some sort such that galaxy knows where this file is and: 1) Would not have to copy it and could simply refer to the file location. 2) Could perform data analysis steps on this file and push the output to the usual location (not the location of the data files). Any help would be greatly appreciated. Thanks. Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
Re: [galaxy-user] MegaBLAST output
Hi Sarah, Peter defined the columns (thanks) but I can provide some information about the GenBank identifiers. The megablast database on the public server are roughly a year old and there have been updates at NCBI since that time. As I understand it, this manifests as occasional mismatches between hits at Galaxy vs Genbank when comparing certain IDs linked to updated records. We are working to update these three databases, but there are some complicating factors around this processing specifically related to the public instance and the metagenomics workflow that have yet to be resolved. Please know that getting updated is a priority for us and we apologize for the inconvenience. To use the most current databases, a local or (better) cloud instance with either the regular or BLAST+ version of the tool and a database your choice is the recommendation. Instructions to get started are at: getgalaxy.org getgalaxy.org/cloud Hopefully this explains the data mismatch. This question has come up before, but I think you are correct in that the final conclusion never was posted back to the galaxy-user list (for different reasons). So, thank you for asking so we that could send out a clear reply for everyone using the tool. Best, Jen Galaxy team On 4/23/12 9:56 AM, Sarah Hicks wrote: I am having trouble finding information on the MegaBLAST output columns. What is each column for? I can't seem to figure this out by comparing info in the columns to NCBI directly because the GI#'s don't match with the correct entry on NCBI. I've seen that others have posted about that problem, so I'm also waiting on details on that question, but for now, I'd just like to know what to make of the output... best, Sarah ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Data upload...
Thanks again for the feedback...one final (hopefully) thingas I mentioned in first e-mail, we are trying to add a large (~170 GB) BAM file to a library with just a link to the file (no copying). After at least an hour of working, I get the error message Unable to finish job, tool error. Any thoughts as to how I can fix this? Thanks. Greg On 4/23/12, Greg Von Kuster g...@bx.psu.edu wrote: Hi Greg, Even though you are not copying the data into Galaxy's default data store, Galaxy determines and stores certain metadata for each of the data files to which you are linking. One of the types of metadata defined for the Bam datatypes is it's index, which is created by a call to samtools. Unfortunately there is really no way around this because Galaxy requires the index file to be in a correct state, and I believe the test to determine correctness is at least as intensive as generating the index in the first place. It's been a while since I was involved in this (specifically setting metadata for bam files using samtools), so perhaps samtools has been recently improved in this regard. if so, I'll look to others to let me know I'm now outdated in my understanding of this. If we need to update samtools used by the Galaxy code to take advantage of newer features, we can certainly do so. Greg Von Kuster On Apr 23, 2012, at 2:51 PM, Gregory Miles wrote: Thank you very much for your help with this - we got that settled. One other question...we are importing sorted, indexed bam files into a galaxy data library and we are not having galaxy copy over the files (they are large) but rather just setting up galaxy such that it points to the relevant directory. We noticed that the file (160 GB in size) is taking a long time to import considering all it should be doing is creating a link. When we examined processes that are running, we noticed that samtools is running. From searching around a bit, it seems that Galaxy does this in order to groom the bam file (sort/index) and ensure that it is in the format necessary for galaxy to be able to interpret it. Is there any way around this? We did the sorting and indexing prior to import and it's taking quite a while to perform an unnecessary function. Thanks. Greg Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. On Mon, Apr 23, 2012 at 12:55 PM, Greg Von Kuster g...@bx.psu.edu wrote: Hi Greg, Upload your files to a Galaxy data library using a combination of Upload files from filesystem paths without copying data into Galaxy's default data store. See the following wiki for all the details: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files For all of the details about data libraries, see: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries Greg Von Kuster On Apr 23, 2012, at 11:26 AM, Gregory Miles wrote: We have large files that cannot be uploaded using the file upload command and instead would need to be uploaded using a URL. Unfortunately, we are using a local install on a non-local machine, so setting up an FTP server on this machine is a security issue. The files are located on this computer already anyhow, and Galaxy would simply be copying from one folder to another in order to perform the get data step. Is there a simple way to have a pointer of some sort such that galaxy knows where this file is and: 1) Would not have to copy it and could simply refer to the file location. 2) Could perform data analysis steps on this file and push the output to the usual location (not the location of the data files). Any help would be greatly appreciated. Thanks. Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error,
Re: [galaxy-user] Data upload...
Is there something helpful in your paster log about the cause? On Apr 23, 2012, at 4:34 PM, Gregory Miles wrote: Thanks again for the feedback...one final (hopefully) thingas I mentioned in first e-mail, we are trying to add a large (~170 GB) BAM file to a library with just a link to the file (no copying). After at least an hour of working, I get the error message Unable to finish job, tool error. Any thoughts as to how I can fix this? Thanks. Greg On 4/23/12, Greg Von Kuster g...@bx.psu.edu wrote: Hi Greg, Even though you are not copying the data into Galaxy's default data store, Galaxy determines and stores certain metadata for each of the data files to which you are linking. One of the types of metadata defined for the Bam datatypes is it's index, which is created by a call to samtools. Unfortunately there is really no way around this because Galaxy requires the index file to be in a correct state, and I believe the test to determine correctness is at least as intensive as generating the index in the first place. It's been a while since I was involved in this (specifically setting metadata for bam files using samtools), so perhaps samtools has been recently improved in this regard. if so, I'll look to others to let me know I'm now outdated in my understanding of this. If we need to update samtools used by the Galaxy code to take advantage of newer features, we can certainly do so. Greg Von Kuster On Apr 23, 2012, at 2:51 PM, Gregory Miles wrote: Thank you very much for your help with this - we got that settled. One other question...we are importing sorted, indexed bam files into a galaxy data library and we are not having galaxy copy over the files (they are large) but rather just setting up galaxy such that it points to the relevant directory. We noticed that the file (160 GB in size) is taking a long time to import considering all it should be doing is creating a link. When we examined processes that are running, we noticed that samtools is running. From searching around a bit, it seems that Galaxy does this in order to groom the bam file (sort/index) and ensure that it is in the format necessary for galaxy to be able to interpret it. Is there any way around this? We did the sorting and indexing prior to import and it's taking quite a while to perform an unnecessary function. Thanks. Greg Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information. On Mon, Apr 23, 2012 at 12:55 PM, Greg Von Kuster g...@bx.psu.edu wrote: Hi Greg, Upload your files to a Galaxy data library using a combination of Upload files from filesystem paths without copying data into Galaxy's default data store. See the following wiki for all the details: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files For all of the details about data libraries, see: http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries Greg Von Kuster On Apr 23, 2012, at 11:26 AM, Gregory Miles wrote: We have large files that cannot be uploaded using the file upload command and instead would need to be uploaded using a URL. Unfortunately, we are using a local install on a non-local machine, so setting up an FTP server on this machine is a security issue. The files are located on this computer already anyhow, and Galaxy would simply be copying from one folder to another in order to perform the get data step. Is there a simple way to have a pointer of some sort such that galaxy knows where this file is and: 1) Would not have to copy it and could simply refer to the file location. 2) Could perform data analysis steps on this file and push the output to the usual location (not the location of the data files). Any help would be greatly appreciated. Thanks. Dr. Gregory Miles Bioinformatics Specialist Cancer Institute of New Jersey @ UMDNJ Office: (732) 235 8817 - CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the
Re: [galaxy-user] MegaBLAST output
Thanks so much for the prompt reply. I don't mind using last years GenBank, as long as I am getting accurate hits. I just have a couple more questions to confirm I am safe using the Galaxy pipline for this... So if I continue to work within the the 1 year old database, can I trust the output as accurate matches? Specifics about my project: I have environmental samples that were sequenced for fungal ITS. I have clustered these into OTUs, and chosen a representative sequence for each. If I retrieve hits for this representative sequence file in my sample, can I trust the hits as being the correct hits as of last year? I'm just worried about what that one person said who thought there was some column arrangement problems, because I'm finding that I'm getting hits from different phylum for the same sequence using default parameters in megablast... Can I also assume, then, that I should NOT identify my representative sequence file to updated GI numbers using another pipeline, and then bring the file of GI numbers to Galaxy to fetch taxonomic assignments? (which I would do because of the nice neat columns for each taxonomic level Galaxy puts out) Sarah On Mon, Apr 23, 2012 at 2:26 PM, Jennifer Jackson j...@bx.psu.edu wrote: Hi Sarah, Peter defined the columns (thanks) but I can provide some information about the GenBank identifiers. The megablast database on the public server are roughly a year old and there have been updates at NCBI since that time. As I understand it, this manifests as occasional mismatches between hits at Galaxy vs Genbank when comparing certain IDs linked to updated records. We are working to update these three databases, but there are some complicating factors around this processing specifically related to the public instance and the metagenomics workflow that have yet to be resolved. Please know that getting updated is a priority for us and we apologize for the inconvenience. To use the most current databases, a local or (better) cloud instance with either the regular or BLAST+ version of the tool and a database your choice is the recommendation. Instructions to get started are at: getgalaxy.org getgalaxy.org/cloud Hopefully this explains the data mismatch. This question has come up before, but I think you are correct in that the final conclusion never was posted back to the galaxy-user list (for different reasons). So, thank you for asking so we that could send out a clear reply for everyone using the tool. Best, Jen Galaxy team On 4/23/12 9:56 AM, Sarah Hicks wrote: I am having trouble finding information on the MegaBLAST output columns. What is each column for? I can't seem to figure this out by comparing info in the columns to NCBI directly because the GI#'s don't match with the correct entry on NCBI. I've seen that others have posted about that problem, so I'm also waiting on details on that question, but for now, I'd just like to know what to make of the output... best, Sarah ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] MegaBLAST output
Peter, you requested an example, here are the first five hits for my first query sequence (OTU#0) 0 324034994 527 93.23 266 13 5 1 265 22 283 7e-102 379.0 0 56181650513 93.26 267 10 8 1 265 25 285 7e-102 379.0 0 314913953 582 91.79 268 13 9 1 265 24 285 2e-92 347.0 0 305670062 281 92.52 254 14 5 4 256 32 281 2e-92 347.0 0 310814066 118091.73 266 14 7 1 265 24 282 9e-92 345.0 You will notice there are 13 columns, one in addition to the 12 column titles you explained. This is because there is a column between sseqID and pident. In the metagenomic tutorial the first 4 columns are explained, and column 3 is described as length of sequence in database (or length of the subject sequence). This is the problem column. The length of only one of the subject GI numbers above match the subject length in NCBI. This has caused me to wonder if I can trust the hit info. In all cases that I've checked, when this happens the correct match is the listed GI value minus 1 (ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt long). On Mon, Apr 23, 2012 at 11:05 AM, Peter Cock p.j.a.c...@googlemail.com wrote: On Mon, Apr 23, 2012 at 5:56 PM, Sarah Hicks garlicsc...@gmail.com wrote: I am having trouble finding information on the MegaBLAST output columns. What is each column for? I can't seem to figure this out by comparing info in the columns to NCBI directly because the GI#'s don't match with the correct entry on NCBI. I've seen that others have posted about that problem, so I'm also waiting on details on that question, but for now, I'd just like to know what to make of the output... best, Sarah I've not tried to track down this reported possible bug in GI numbers, and weather it also affects BLAST+ as well as the legacy NCBI BLAST (which has now been discontinued). Do you have a specific example. As to the 12 columns, they are standard BLAST tabular output, and should match the defaults in BLAST+ tabular output which are: Column NCBI name Description 1 qseqid Query Seq-id (ID of your sequence) 2 sseqid Subject Seq-id (ID of the database hit) 3 pident Percentage of identical matches 4 length Alignment length 5 mismatch Number of mismatches 6 gapopen Number of gap openings 7 qstart Start of alignment in query 8 qend End of alignment in query 9 sstart Start of alignment in subject (database hit) 10 send End of alignment in subject (database hit) 11 evalue Expectation value (E-value) 12 bitscore Bit score Peter ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/