Re: [galaxy-user] operate on genomic intervals
Hi Jose, That's great news! A tool tip in the wiki or UI would probably be helpful - your question was a good one. Meanwhile, I'll post your results to back to the list, it may help others who are also working to optimize. Glad that it worked out so well, Jen Galaxy team On 5/8/12 2:57 PM, Xianrong Wong wrote: Thank you Jennifer! I swapped the order of the input datasets which reduced the running time from 2 hrs to 10 min! Jose On Mon, May 7, 2012 at 12:03 PM, Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu wrote: Hi Jose, Very glad to know that you have this working. You question is difficult to address with specificity, as these are completely different algorithms. But in general, any alignment algorithm (Bowtie included) has some sort of indexing strategy (some are better than others) to minimize what is held in memory and process bulk data through. See the Bowtie documentation for how this is achieved. The interval operations tools also have an indexing strategy, specifically, the second input file is the portion loaded memory and the first file is processed against it. So, if you want use an extremely large dataset (or just want to the job to run quicker) try to use it as the first input file if possible. These tools are designed to be used together and with other tools to create workflows, so there should pretty much always be some way to break jobs up (as you did) to get them through the tools, on even modest systems: http://wiki.g2.bx.psu.edu/__Learn/Interval%20Operations http://wiki.g2.bx.psu.edu/Learn/Interval%20Operations Take care, Jen Galaxy team On 5/7/12 8:30 AM, Xianrong Wong wrote: Hello Jennifer, thanks for the advise. It worked when I did it a chromosome at a time. Is there a reason why this is so much more computationally heavy as compared to bowtie? (mapping 190 million reads took only 3-4 hours for me) Jose On Fri, May 4, 2012 at 7:55 PM, Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu mailto:j...@bx.psu.edu mailto:j...@bx.psu.edu wrote: Hello Jose, It sounds as if the job is running out of memory. Since you are already working on a cloud, I am going to make the assumption that you have explored the server options with high-capacity memory there. But if not, that is one place to start, in particular your EC2 Instance type, as described on this wiki: http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning http://wiki.g2.bx.psu.edu/__Admin/Cloud/CapacityPlanning http://wiki.g2.bx.psu.edu/__Admin/Cloud/CapacityPlanning http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning However, even if that was an option, you may want to consider running your in data through in another way - by running smaller jobs, then merging results, to avoid the large jobs. For example, in the last step where you join to the full BamHI delimited bin file, instead join to groups of bins in that file (perhaps grouped by chromosome), then combine the results to produce the full output. Hopefully this helps provide some options, Jen Galaxy team On 5/4/12 2:18 PM, Xianrong Wong wrote: Hello, I have binned the mouse genome into fragments based on restriction enzyme cut sites. So each bin is a fragment flanked by say BamHI. The output file is in the interval format: chr# start and end coordinates of each bin. I want to count how many times each bin has reads that align to it. I mapped my reads using bowtie and generated a dataset (interval format) for the aligned reads. I then used join in operate on genomic intervals and asked it to return intervals that innerjoin the bin file. The subsequent steps involve grouping and counting and then joining back to the 1st dataset (BamHI delimited bins). I have tried this workflow on small datasets and it worked. However when I subject my full alignment file and full BamHI delimited bin file, the tool fails. I am doing this on cloud. Any advice would be appreciated! Jose _ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on
Re: [galaxy-user] operate on genomic intervals
Hello Jose, It sounds as if the job is running out of memory. Since you are already working on a cloud, I am going to make the assumption that you have explored the server options with high-capacity memory there. But if not, that is one place to start, in particular your EC2 Instance type, as described on this wiki: http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning However, even if that was an option, you may want to consider running your in data through in another way - by running smaller jobs, then merging results, to avoid the large jobs. For example, in the last step where you join to the full BamHI delimited bin file, instead join to groups of bins in that file (perhaps grouped by chromosome), then combine the results to produce the full output. Hopefully this helps provide some options, Jen Galaxy team On 5/4/12 2:18 PM, Xianrong Wong wrote: Hello, I have binned the mouse genome into fragments based on restriction enzyme cut sites. So each bin is a fragment flanked by say BamHI. The output file is in the interval format: chr# start and end coordinates of each bin. I want to count how many times each bin has reads that align to it. I mapped my reads using bowtie and generated a dataset (interval format) for the aligned reads. I then used join in operate on genomic intervals and asked it to return intervals that innerjoin the bin file. The subsequent steps involve grouping and counting and then joining back to the 1st dataset (BamHI delimited bins). I have tried this workflow on small datasets and it worked. However when I subject my full alignment file and full BamHI delimited bin file, the tool fails. I am doing this on cloud. Any advice would be appreciated! Jose ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Operate on genomic intervals
Hi Sarah, If you just want to compare rows between two files without interpreting genome positional information (similar to the unix comm command), then please see the tools under Join, Subtract and Group. In particular, Subtract Whole Dataset, but perhaps also Compare two Datasets. You will likely need to convert the datatype to be tabular before using the Compare function. Do this by using the Edit Attributes - Change data type form, located by click on the pencil icon in the top right corner of a dataset's box in the history panel. The instructions in the Compare tool form state to use Text Manipulation-Convert, but if your file is already in interval format, a specific type of tabular data, then this is not an appropriate (or needed) step. Hopefully this helps! Best, Jen Galaxy team On 10/28/11 8:19 AM, Sarah wrote: Hi All, I have two files containing multiple scaffolds and their genomic intervals. Now I want to substract the intervals of one file from the other, but apparently this tool is only available if all the data in the chromosome column have the same name. Does anybody known a way how to handle files with different names in the chromosome column? Thanks a lot in advance, Sarah ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/