[galaxy-dev] metadata in parallelization
Hello, I am writing some code to enable parallelization for some tool wrappers. First, I did it for simple bwa wrapper, but now I am modifying toolshed.g2.bx.psu.edu/repos/devteam/bwa/c71dd035971e/bwa/bwa-mem.xml to check if the code would work with this wrapper. So, I wrote some code that I thing was necessary in order to merge some bam and I added the parallelism tag (in bold) to the config file: tool id=bwa_mem name=BWA-MEM version=0.1 macros importbwa_macros.xml/import /macros requirements requirement type=package version=0.7.10.039ea20639bwa/requirement requirement type=package version=1.1samtools/requirement /requirements description- map medium and long reads (gt; 100 bp) against reference genome/description *parallelism method=multi split_size=3 shared_inputs=ref_file split_mode=number_of_parts merge_outputs=bam_output split_inputs=fastq_input1,fastq_input2 /parallelism* command ... So, everything works well, and the resulting bam from parallelization mode and without the parallelization mode is the same but the Galaxy log throws an error regarding metadata, it says something like this: galaxy.jobs.splitters.multi DEBUG 2015-04-17 09:54:58,335 merge finished: /home/ralonso/galaxy/database/files/000/dataset_198.dat galaxy.jobs.runners.tasks DEBUG 2015-04-17 09:54:58,473 executing external set_meta script for job 200: python /home/ralonso/galaxy/database/tmp/set_metadata_E5fGIE.py /home/ralonso/galaxy/database/tmp/tmpHS8Byo /home/ralonso/galaxy/database/job_working_directory/000/200/galaxy.json /home/ralonso/galaxy/database/tmp/metadata_in_HistoryDatasetAssociation_198_yOGiQG,/home/ralonso/galaxy/database/tmp/metadata_kwds_HistoryDatasetAssociation_198_nAsQoq,/home/ralonso/galaxy/database/tmp/metadata_out_HistoryDatasetAssociation_198_I_cLs4,/home/ralonso/galaxy/database/tmp/metadata_results_HistoryDatasetAssociation_198_qhjzoV,/home/ralonso/galaxy/database/files/000/dataset_198.dat,/home/ralonso/galaxy/database/tmp/metadata_override_HistoryDatasetAssociation_198_ScKLqH Traceback (most recent call last): File /home/ralonso/galaxy/database/tmp/set_metadata_E5fGIE.py, line 1, in module from galaxy_ext.metadata.set_metadata import set_metadata; set_metadata() ImportError: No module named galaxy_ext.metadata.set_metadata galaxy.jobs.runners.tasks DEBUG 2015-04-17 09:54:58,624 execution of external set_meta finished for job 200 *galaxy.datatypes.metadata DEBUG 2015-04-17 09:54:58,714 setting metadata externally failed for HistoryDatasetAssociation 198: External set_meta() not called* When using no parallelization mode, there is no problem, also because Galaxy doesn't go through this part of code, I mean it doesn't execute this. I see that Galaxy have to do something with metada attributes, but what is t trying to do? is there any way to solve this? Thank you very much Regards, Roberto ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] join bam results in one file
Hello, I am playing with Galaxy splitters capabilities. After some cases that you help me out to solve I am facing a new issue, this is maybe due to my tool configuration file, but in any case I tell you what I've done. What I would like to do exactly, is to split paired fastq, map them and then join them. This is my configuration file: tool id=bwa_mio name=map with bwa descriptionmap with bwa/description parallelism method=basic split_size=3 split_mode=number_of_parts merge_outputs=output/parallelism command bwa mem -R '@RG\tID:foo\tSM:bar' /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input gt; temporary_bam_file.sam 2gt;/dev/null ; samtools view -Sb temporary_bam_file.sam gt; temporary_bam_file.bam ; samtools sort temporary_bam_file.bam $output ; /command inputs param format=fastqsanger name=input type=data label=fastq/ /inputs outputs data format=bam name=output / /outputs help bwa /help /tool My problem of this configuration is that generates an empty file. So, after seeing the code, I discover that when it tries to join the several bam files it goes to the first parent: *class Data( object )*, to the method merge: *def merge( split_files, output_file). *So I may be wrong, but I think binary.bam class should override this method, is this right? if this is the case, I would like to implement this method, I have couple of basic ideas, like merge them with samtools. What do you think? On ther other hand, is this related with the last email of John Chilton and the 15.03 release? Best Regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] problems splitting
Hello again, first of all thanks for your help, it is being very useful. What I have done up to now is to copy this method to the class Sequence def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): Does a brain-dead sequential scan extract of certain sequences Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat ./input.gz | ( tail -n +1 2 /dev/null) | head -40 | gzip -c ./output.gz'] Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 ./input.fastq 2 /dev/null | head -40 ./output.fastq'] start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat %s | ( tail -n +%s 2 /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s %s 2 /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' %s' % output_name return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) This is something that you suggested. When I run the tool with this configuration: tool id=bwa_mio name=map with bwa descriptionmap with bwa/description parallelism method=basic split_size=3 split_mode=number_of_parts/parallelism command bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input $output 2/dev/null/command inputs param format=fastqsanger name=input type=data label=fastq/ /inputs outputs data format=sam name=output / /outputs help bwa /help /tool Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTAGTGGGTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT AS:i:0 XS:i:0 you know what may be going on? If i don't split the file, everything goes correctly. Best regards On 13 February 2015 at 13:39, Peter Cock p.j.a.c...@googlemail.com wrote: On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo nsora...@tiscali.it wrote: Il 13.02.2015 03:17 Peter Cock ha scritto: Hi Roberto, It looks like this is a known issue with FASTQ splitting, https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?): https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4059dd7cde6fd2dc6#comment-820648 I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead. Nicola - did you fix this locally after noticing the problem last year? No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes. Nicola I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it. Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later... Peter -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] problems splitting
Hello again, this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2 /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this? Thanks a lot, Regards On 25 February 2015 at 11:13, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam Best regards On 24 February 2015 at 17:49, Peter Cock p.j.a.c...@googlemail.com wrote: On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello again, first of all thanks for your help, it is being very useful. What I have done up to now is to copy this method to the class Sequence def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) This is something that you suggested. Good. When I run the tool with this configuration: tool id=bwa_mio name=map with bwa descriptionmap with bwa/description parallelism method=basic split_size=3 split_mode=number_of_parts/parallelism command bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input $output 2/dev/null/command inputs param format=fastqsanger name=input type=data label=fastq/ /inputs outputs data format=sam name=output / /outputs help bwa /help /tool One minor improvement would be to escape the as gt; in your XML, or use the CDATA approach documented here: https://wiki.galaxyproject.org/Tools/BestPractices Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTAGTGGGTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT AS:i:0 XS:i:0 you know what may be going on? If i don't split the file, everything goes correctly. This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)? Peter -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] problems splitting
Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam Best regards On 24 February 2015 at 17:49, Peter Cock p.j.a.c...@googlemail.com wrote: On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello again, first of all thanks for your help, it is being very useful. What I have done up to now is to copy this method to the class Sequence def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) This is something that you suggested. Good. When I run the tool with this configuration: tool id=bwa_mio name=map with bwa descriptionmap with bwa/description parallelism method=basic split_size=3 split_mode=number_of_parts/parallelism command bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input $output 2/dev/null/command inputs param format=fastqsanger name=input type=data label=fastq/ /inputs outputs data format=sam name=output / /outputs help bwa /help /tool One minor improvement would be to escape the as gt; in your XML, or use the CDATA approach documented here: https://wiki.galaxyproject.org/Tools/BestPractices Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTAGTGGGTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT AS:i:0 XS:i:0 you know what may be going on? If i don't split the file, everything goes correctly. This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)? Peter -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] problems splitting
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2 /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion On 25 February 2015 at 11:40, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello again, this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2 /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this? Thanks a lot, Regards On 25 February 2015 at 11:13, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam Best regards On 24 February 2015 at 17:49, Peter Cock p.j.a.c...@googlemail.com wrote: On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello again, first of all thanks for your help, it is being very useful. What I have done up to now is to copy this method to the class Sequence def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) This is something that you suggested. Good. When I run the tool with this configuration: tool id=bwa_mio name=map with bwa descriptionmap with bwa/description parallelism method=basic split_size=3 split_mode=number_of_parts/parallelism command bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input $output 2/dev/null/command inputs param format=fastqsanger name=input type=data label=fastq/ /inputs outputs data format=sam name=output / /outputs help bwa /help /tool One minor improvement would be to escape the as gt; in your XML, or use the CDATA approach documented here: https://wiki.galaxyproject.org/Tools/BestPractices Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTAGTGGGTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT AS:i:0 XS:i:0 you know what may be going on? If i don't split the file, everything goes correctly. This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)? Peter -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] question about splitting bams
Hello, I ma trying ti write some code in order to give the possibility of parallelize some tasks. Now, I was with the problem of splitting a bam in some parts, for this I create this simple tool parallelism method=multi split_size=3 split_mode=number_of_parts merge_outputs=output split_inputs=input /parallelism command java -jar /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I $input -o $output 2gt; /dev/null; /command inputs param format=bam name=input type=data label=bam/ /inputs outputs data format=vcf name=output / /outputs But I have one problem, when I execute the tool it goes through this part of code (I am working in dev branch): *$galaxy/lib/galaxy/jobs/splitters/multi.py, line 75:* for input in parent_job.input_datasets: if input.name in split_inputs: this_input_files = job_wrapper.get_input_dataset_fnames(input.dataset) if len(this_input_files) 1: log_error = The input '%s' is composed of multiple files - splitting is not allowed % str(input.name) log.error(log_error) raise Exception(log_error) input_datasets.append(input.dataset) So, it is raising the exception because this_input_files=2, concretely: ['/home/ralonso/galaxy/database/files/000/dataset_171.dat', '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'], I guess that: *dataset_171.dat*: It is the bam file. *metadata_13.dat*: It is the bai file. So, Galaxy can't move on and I don't know which would be the best solution. Maybe change the *if* to check only non-metadata files? I think I should use both files in order to create the bam sub-files, but this would be inside the Bam class, under *binary.py* file. Could you please guide me before I mess things up? Thanks so much -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] question about splitting bams
Regarding my previous mail I found this thread http://www.bytebucket.org/galaxy/galaxy-central/pull-request/175/parameter-based-bam-file-parallelization/diff is it still alive? is it maybe the best choice to do the bam parallelization? Thanks! Best regards On 23 April 2015 at 17:55, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I ma trying ti write some code in order to give the possibility of parallelize some tasks. Now, I was with the problem of splitting a bam in some parts, for this I create this simple tool parallelism method=multi split_size=3 split_mode=number_of_parts merge_outputs=output split_inputs=input /parallelism command java -jar /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I $input -o $output 2gt; /dev/null; /command inputs param format=bam name=input type=data label=bam/ /inputs outputs data format=vcf name=output / /outputs But I have one problem, when I execute the tool it goes through this part of code (I am working in dev branch): *$galaxy/lib/galaxy/jobs/splitters/multi.py, line 75:* for input in parent_job.input_datasets: if input.name in split_inputs: this_input_files = job_wrapper.get_input_dataset_fnames(input.dataset) if len(this_input_files) 1: log_error = The input '%s' is composed of multiple files - splitting is not allowed % str(input.name) log.error(log_error) raise Exception(log_error) input_datasets.append(input.dataset) So, it is raising the exception because this_input_files=2, concretely: ['/home/ralonso/galaxy/database/files/000/dataset_171.dat', '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'], I guess that: *dataset_171.dat*: It is the bam file. *metadata_13.dat*: It is the bai file. So, Galaxy can't move on and I don't know which would be the best solution. Maybe change the *if* to check only non-metadata files? I think I should use both files in order to create the bam sub-files, but this would be inside the Bam class, under *binary.py* file. Could you please guide me before I mess things up? Thanks so much -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] question about splitting bams
Hello, I have been reading those different threads and I have some doubts that you maybe can clarify me. In the thread you said: ability to write tools that split up a single input into a collection. , I think this is focused for workflows, but in any case, could we use this to split bams? Another comment is the next: These common pipelines where you split up a BAM files, run a bunch of steps, and then merge the results will be executable in the near future (though 15.03 won't have workflow editor support for it - I will try to get to this by the following release - and you can manually build up workflows to do this - As I was trying to write something that will do exactly this and I guess there is someone working on this, do you think is it worth to continue doing this or just switch to another thing? would you know the road-map of this feature? Thanks a lot, Roberto On 23 April 2015 at 20:09, John Chilton jmchil...@gmail.com wrote: I am a pragmatist - I have no problem just splitting the inputs and skipping the metadata files. I would just convert the error into an log.info() and warn that the tool cannot use metadata files. If the underlying tool needs an index it can recreate it instead I think. One can imagine a more intricate solution that would recreate metadata files as needed - but that would be a lot of work I think. Does that make sense? About BB PR 175 there were some recent discussions about that approach - I would check out http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html . -John On Thu, Apr 23, 2015 at 11:55 AM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I ma trying ti write some code in order to give the possibility of parallelize some tasks. Now, I was with the problem of splitting a bam in some parts, for this I create this simple tool parallelism method=multi split_size=3 split_mode=number_of_parts merge_outputs=output split_inputs=input /parallelism command java -jar /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I $input -o $output 2gt; /dev/null; /command inputs param format=bam name=input type=data label=bam/ /inputs outputs data format=vcf name=output / /outputs But I have one problem, when I execute the tool it goes through this part of code (I am working in dev branch): $galaxy/lib/galaxy/jobs/splitters/multi.py, line 75: for input in parent_job.input_datasets: if input.name in split_inputs: this_input_files = job_wrapper.get_input_dataset_fnames(input.dataset) if len(this_input_files) 1: log_error = The input '%s' is composed of multiple files - splitting is not allowed % str(input.name) log.error(log_error) raise Exception(log_error) input_datasets.append(input.dataset) So, it is raising the exception because this_input_files=2, concretely: ['/home/ralonso/galaxy/database/files/000/dataset_171.dat', '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'], I guess that: dataset_171.dat: It is the bam file. metadata_13.dat: It is the bai file. So, Galaxy can't move on and I don't know which would be the best solution. Maybe change the if to check only non-metadata files? I think I should use both files in order to create the bam sub-files, but this would be inside the Bam class, under binary.py file. Could you please guide me before I mess things up? Thanks so much -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] PR 149
Ok, no problems ;) It is another PR, I think it is useful without the other PR, for example when you map with BWA. The last PR will be like the next step, I mean for example that you split a bam to do some calling or whatever. I think both PR can live independently and together... I don't know if I self-explained well :) Regards On 29 April 2015 at 17:14, John Chilton jmchil...@gmail.com wrote: No it just slipped through the cracks - sorry about that. I have commented on it now. There was a time when a couple weeks before a first response was the norm :). Does it belong with the bam splitting pull request - is merging useful on its own without the other piece you are working on. -John On Wed, Apr 29, 2015 at 11:04 AM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I created a PR https://github.com/galaxyproject/galaxy/pull/149 dome days ago, but I don't have any feedback yet, is there any problem with it? Is it not interesting for the current Galaxy? didn't the authors realized about it? It would be nice to have some feedback, even if it is not a convenient PR. Thanks so much, Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] splitting bams bai
is in a metadata table, but I don't know how to get it, Could you please help me with this? In any case, if you find that I am doing something wrong, or you have a better idea of implementing this, please don't hesitate to contact me. Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] PR 149
Hello, I created a PR https://github.com/galaxyproject/galaxy/pull/149 dome days ago, but I don't have any feedback yet, is there any problem with it? Is it not interesting for the current Galaxy? didn't the authors realized about it? It would be nice to have some feedback, even if it is not a convenient PR. Thanks so much, Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] bam split and gatk calling
Hello, I have been working in the Galaxy parallelization module and I would like to ask you some questions that I have about how to face one problem. I have done one pull request about splitting bams: https://github.com/galaxyproject/galaxy/pull/184 Regarding this, I think it is useful but it could be more while accessing somehow the interval. I better explain it with an example: If I define a simple tool like this, with the parallelism tag actived: tool id=gatk name=call with gatk descriptiongatk/description * parallelism method=multi split_mode=by_interval split_size=1 merge_outputs=output split_inputs=input /parallelism* command ## by_rname ln -s $input input.bam; samtools index input.bam; UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam -o $output -L *REGION* ; /command inputs param format=bam name=input type=data label=bam/ /inputs outputs data format=vcf name=output / /outputs help bwa /help /tool The region is based on the field split_size, it is better explained in the PR. How does the code from the PR work? It goes through the bam file and does something like samtools view *REGION *-o bam_splitted.bam, so then GATK does the calling for this small bam, but what is the problem? As you know, in the software GATK if you don't pass the region as an argument in the command line it goes through all the genome, so it is very slow. So, what would you recommend to me to be able to pass this information to GATK? I was thinking to create, at the same time the bam is splitted, a file region.bed and use it in the tool definition xml, so the command would be like this: command ... UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam -o $output -L *region.bed*; /command This solution does not convince me too much because it is a bit intrusive in the tool definition and also because you have to trust that the *region.bed* file exists. Do you have any opinion, suggestion...? Thanks a lot! Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] bam split and gatk calling
Hello, I agree, what you say fits perfectly for GATK, but as I wanted to create a more generic code I did it this way (also because I am a newbie in the galaxy code and I didn't know so well how to implement this ). What about a tool that doesn't accept a region, just a bam? Maybe we can put another parameter in the parallelism tag that force to split the bam. Mostly, just to create a bed file would be better, right? What do you think? Regards On 6 May 2015 at 12:23, Peter Cock p.j.a.c...@googlemail.com wrote: Hi Roberto, Given the way BAM indexing works, I see no reason to actually split the BAM file at all - it seems like wasted disk IO. Instead, can you split a BED file into sub-regions? This way each child GATK job would look at the full BAM file but only for a small region described in the split BED region file? Peter On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I have been working in the Galaxy parallelization module and I would like to ask you some questions that I have about how to face one problem. I have done one pull request about splitting bams: https://github.com/galaxyproject/galaxy/pull/184 Regarding this, I think it is useful but it could be more while accessing somehow the interval. I better explain it with an example: If I define a simple tool like this, with the parallelism tag actived: tool id=gatk name=call with gatk descriptiongatk/description parallelism method=multi split_mode=by_interval split_size=1 merge_outputs=output split_inputs=input /parallelism command ## by_rname ln -s $input input.bam; samtools index input.bam; UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam -o $output -L REGION ; /command inputs param format=bam name=input type=data label=bam/ /inputs outputs data format=vcf name=output / /outputs help bwa /help /tool The region is based on the field split_size, it is better explained in the PR. How does the code from the PR work? It goes through the bam file and does something like samtools view REGION -o bam_splitted.bam, so then GATK does the calling for this small bam, but what is the problem? As you know, in the software GATK if you don't pass the region as an argument in the command line it goes through all the genome, so it is very slow. So, what would you recommend to me to be able to pass this information to GATK? I was thinking to create, at the same time the bam is splitted, a file region.bed and use it in the tool definition xml, so the command would be like this: command ... UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam -o $output -L region.bed; /command This solution does not convince me too much because it is a bit intrusive in the tool definition and also because you have to trust that the region.bed file exists. Do you have any opinion, suggestion...? Thanks a lot! Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] bam split and gatk calling
I agree, I prefer your solution, I will focus on that solution, thanks! Although there is some software more or less used in the community such Delly https://github.com/tobiasrausch/delly and Breakdancer http://gmt.genome.wustl.edu/packages/breakdancer/documentation.html, that doesn't use bed files, the only way to parallelize their execution is through smaller bams Regards On 6 May 2015 at 15:00, Peter Cock p.j.a.c...@googlemail.com wrote: On Wed, May 6, 2015 at 11:33 AM, Roberto Alonso CIPF ralo...@cipf.es wrote: Hello, I agree, what you say fits perfectly for GATK, but as I wanted to create a more generic code I did it this way (also because I am a newbie in the galaxy code and I didn't know so well how to implement this ). What about a tool that doesn't accept a region, just a bam? Maybe we can put another parameter in the parallelism tag that force to split the bam. Mostly, just to create a bed file would be better, right? What do you think? Regards Maybe you're right - BAM splitting might be useful for some tools (any examples?), even though BED splitting is a much more elegant solution. Peter -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] error executing test
Hello, I am designing some test and I have a problem, it works under the Galaxy web environment, but it doesn't work when I try to use it as a test case. Indeed I am trying other tests and they fail as well. My test *./run_tests.sh -framework -id parallelism_bam_filter_reads* says the next: == ERROR: filter reads ( parallelism_bam_filter_reads ) Test-1 -- Traceback (most recent call last): File /home/ralonso/galaxy/test/functional/test_toolbox.py, line 268, in test_tool self.do_it( td ) File /home/ralonso/galaxy/test/functional/test_toolbox.py, line 55, in do_it raise e RunToolException: Error creating a job for these tool inputs - {u'type': u'error', u'data': {u'input': u'History does not include a dataset of the required format / build'}} And the other test *./run_tests.sh -framework -id compare_bam_as_sam* == ERROR: compare_bam_as_sam ( compare_bam_as_sam ) Test-1 -- Traceback (most recent call last): File /home/ralonso/galaxy/test/functional/test_toolbox.py, line 268, in test_tool self.do_it( td ) File /home/ralonso/galaxy/test/functional/test_toolbox.py, line 37, in do_it stage_data_in_history( galaxy_interactor, testdef.test_data(), test_history, shed_tool_id ) File /home/ralonso/galaxy/test/base/interactor.py, line 38, in stage_data_in_history upload_wait() File /home/ralonso/galaxy/test/base/interactor.py, line 279, in wait while not self.__history_ready( history_id ): File /home/ralonso/galaxy/test/base/interactor.py, line 297, in __history_ready return self._state_ready( state, error_msg=History in error state. ) File /home/ralonso/galaxy/test/base/interactor.py, line 356, in _state_ready raise Exception( error_msg ) Exception: History in error state. begin captured logging Besides than it tries to migrate the database each time I try a test case and it takes too long. I have seen that you can use --db postgres but it doesn't work, I think this option should be user with --dockerize (that is not my case). would you have any idea of what is going on? Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] metadata in parallelization
Hello, I am writing some code to enable parallelization for some tool wrappers. First, I did it for simple bwa wrapper, but now I am modifying toolshed.g2.bx.psu.edu/repos/devteam/bwa/c71dd035971e/bwa/bwa-mem.xml to check if the code would work with this wrapper. So, I wrote some code that I thing was necessary in order to merge some bam and I added the parallelism tag (in bold) to the config file: tool id=bwa_mem name=BWA-MEM version=0.1 macros importbwa_macros.xml/import /macros requirements requirement type=package version=0.7.10.039ea20639bwa/requirement requirement type=package version=1.1samtools/requirement /requirements description- map medium and long reads (gt; 100 bp) against reference genome/description parallelism method=multi split_size=3 shared_inputs=ref_file split_mode=number_of_parts merge_outputs=bam_output split_inputs=fastq_input1,fastq_input2 /parallelism command ... So, everything works well, and the resulting bam from parallelization mode and without the parallelization mode is the same but the Galaxy log throws an error regarding metadata, it says something like this: galaxy.jobs.splitters.multi DEBUG 2015-04-17 09:54:58,335 merge finished: /home/ralonso/galaxy/database/files/000/dataset_198.dat galaxy.jobs.runners.tasks DEBUG 2015-04-17 09:54:58,473 executing external set_meta script for job 200: python /home/ralonso/galaxy/database/tmp/set_metadata_E5fGIE.py /home/ralonso/galaxy/database/tmp/tmpHS8Byo /home/ralonso/galaxy/database/job_working_directory/000/200/galaxy.json /home/ralonso/galaxy/database/tmp/metadata_in_HistoryDatasetAssociation_198_yOGiQG,/home/ralonso/galaxy/database/tmp/metadata_kwds_HistoryDatasetAssociation_198_nAsQoq,/home/ralonso/galaxy/database/tmp/metadata_out_HistoryDatasetAssociation_198_I_cLs4,/home/ralonso/galaxy/database/tmp/metadata_results_HistoryDatasetAssociation_198_qhjzoV,/home/ralonso/galaxy/database/files/000/dataset_198.dat,/home/ralonso/galaxy/database/tmp/metadata_override_HistoryDatasetAssociation_198_ScKLqH Traceback (most recent call last): File /home/ralonso/galaxy/database/tmp/set_metadata_E5fGIE.py, line 1, in module from galaxy_ext.metadata.set_metadata import set_metadata; set_metadata() ImportError: No module named galaxy_ext.metadata.set_metadata galaxy.jobs.runners.tasks DEBUG 2015-04-17 09:54:58,624 execution of external set_meta finished for job 200 galaxy.datatypes.metadata DEBUG 2015-04-17 09:54:58,714 setting metadata externally failed for HistoryDatasetAssociation 198: External set_meta() not called When using no parallelization mode, there is no problem, also because Galaxy doesn't go through this part of code, I mean it doesn't execute this. I see that Galaxy have to do something with metada attributes, but what is t trying to do? is there any way to solve this? Thank you very much Regards, -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralo...@cipf.es ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/