Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:

 Main still runs these jobs in the standard non-split fashion, and as a
 resource that is occasionally saturated (and thus doesn't necessarily have
 extra resources to parallelize to) will probably continue doing so as long
 as there's significant overhead involved in splitting the files.  Fancy
 scheduling could minimize the issue, but as it is during heavy load you
 would actually have lower total throughput due to the splitting overhead.


Because the splitting (currently) happens on the main server?

 Regarding the merging of the out, I see there is a default merge
 method in lib/galaxy/datatypes/data.py which just concatenates
 the files. I am surprised at that - it seems like a very bad idea in
 general - consider many binary files, or XML. Why not put this
 as the default for text and subclasses thereof?

 I can't think of a better reasonable default behavior for Data, though
 you're obviously right that each datatype subclass will need to define
 particular behaviors for merging files.

The default should raise an error (and better yet, refuse to do the
split in the first place). Zen of Python: In the face of ambiguity,
refuse the temptation to guess.

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Dannon Baker

On Feb 16, 2012, at 5:15 AM, Peter Cock wrote:

 On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 
 Main still runs these jobs in the standard non-split fashion, and as a
 resource that is occasionally saturated (and thus doesn't necessarily have
 extra resources to parallelize to) will probably continue doing so as long
 as there's significant overhead involved in splitting the files.  Fancy
 scheduling could minimize the issue, but as it is during heavy load you
 would actually have lower total throughput due to the splitting overhead.
 
 
 Because the splitting (currently) happens on the main server?

No, because the splitting process is work which has to happen somewhere.  
Ignoring possible benefits from things that haven't been implemented yet, in a 
situation where your cluster is saturated with work you are unable to take 
advantage of the parallelism and splitting files apart is only adding more 
work, reducing total job throughput.  That splitting always happens on the head 
node is not ideal, and needs to be configurable.  I have a fork somewhere that 
attempts to address this but it needs work.
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 Good luck, let me know how it goes, and again - contributions are certainly
 welcome :)

I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py
for class Sequence assumes four lines per sequence. This would make
sense as the split method of the Fastq class (after grooming to remove
any line wrapping) but is a very bad idea on most sequence file formats
(e.g. FASTA).

It looks like a little refactoring is needed, defining a Sequence split method
which raises not implemented, and moving the current code to the Fastq
class, then writing something similar but allowing multiple lines per record
for the Fasta class.

Does that sound reasonable? I'll do this on a new branch for review...

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 10:47 AM, Peter Cock p.j.a.c...@googlemail.com wrote:
 On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 Good luck, let me know how it goes, and again - contributions are certainly
 welcome :)

 I think I found the first bug, method split in 
 lib/galaxy/datatypes/sequence.py
 for class Sequence assumes four lines per sequence. This would make
 sense as the split method of the Fastq class (after grooming to remove
 any line wrapping) but is a very bad idea on most sequence file formats
 (e.g. FASTA).

 It looks like a little refactoring is needed, defining a Sequence split method
 which raises not implemented, and moving the current code to the Fastq
 class, then writing something similar but allowing multiple lines per record
 for the Fasta class.

 Does that sound reasonable? I'll do this on a new branch for review...

Refactoring lib/galaxy/datatypes/sequence.py split method here,
https://bitbucket.org/peterjc/galaxy-central/changeset/762777618073

This is part of a work-in-progress split_blast branch to try splitting
BLAST jobs, for which I will need to split FASTA files as inputs, and
also merge BLAST XML output:
https://bitbucket.org/peterjc/galaxy-central/src/split_blast

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Fields, Christopher J
On Feb 16, 2012, at 4:47 AM, Peter Cock wrote:

 On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 Good luck, let me know how it goes, and again - contributions are certainly
 welcome :)
 
 I think I found the first bug, method split in 
 lib/galaxy/datatypes/sequence.py
 for class Sequence assumes four lines per sequence. This would make
 sense as the split method of the Fastq class (after grooming to remove
 any line wrapping) but is a very bad idea on most sequence file formats
 (e.g. FASTA).
 
 It looks like a little refactoring is needed, defining a Sequence split method
 which raises not implemented, and moving the current code to the Fastq
 class, then writing something similar but allowing multiple lines per record
 for the Fasta class.
 
 Does that sound reasonable? I'll do this on a new branch for review...
 
 Peter

Makes sense from my perspective; splits have to be defined based on data type.  
It could be as low-level as defining a simple iterator per record, then a 
wrapper that allows a specific chunk-size.  The split file creation could 
almost be abstracted completely away into a common method.

As Peter implies, maybe a simple API for defining a split method would be all 
that is needed.  Might also be useful on any merge step, 'cat'-like merges 
won't work for every format but would be a suitable default.

chris
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 1:53 PM, Fields, Christopher J
cjfie...@illinois.edu wrote:

 Makes sense from my perspective; splits have to be defined based on
 data type.  It could be as low-level as defining a simple iterator per
 record, then a wrapper that allows a specific chunk-size.  The split
 file creation could almost be abstracted completely away into a
 common method.

I'm trying to understand exactly how the current code creates the
splits, but yes - something like that is what I would expect.

 As Peter implies, maybe a simple API for defining a split method
 would be all that is needed.  Might also be useful on any merge
 step, 'cat'-like merges won't work for every format but would be
 a suitable default.

Yes, for a lot of file types concatenation is fine. Again, like the
splitting, this has to be and is defined at the data type level (which
is a heirachy of classes in Galaxy).

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] Tool that outputs html report

2012-02-16 Thread Daniel Sobral
Hello,

I want to develop a galaxy wrapper for a tool that outputs an html
report (with images inside specific subfolders).

Do I need to do like the FastQC galaxy wrapper that reconstructs the
html so that they are all in the same folder?
Or is there a way to circumvent this so that I can keep the original
html structure?

One way of cheating is maybe zipping the output and provide the zip
file as data?

Thanks,
Daniel Sobral
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
Hi Dan,

I think I need a little more advice - what is the role of the script
scripts/extract_dataset_part.py and the JSON files created
when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
and then used by the class' process_split_file method?

Why is there no JSON file created by the base data class in
lib/galaxy/datatypes/data.py and no method process_split_file?

Is the JSON thing part of a partial and unfinished rewrite of the
splitter code?

On the assumption that not all splitters bother with the JSON,
I am trying a little hack to scripts/extract_dataset_part.py to
abort silently if there is no JSON file:
https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

This seems to be working with my current attempt at a FASTA
splitter (not checked in yes, only partly implemented and tested).

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 Hi Dan,

 I think I need a little more advice - what is the role of the script
 scripts/extract_dataset_part.py and the JSON files created
 when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
 and then used by the class' process_split_file method?

 Why is there no JSON file created by the base data class in
 lib/galaxy/datatypes/data.py and no method process_split_file?

 Is the JSON thing part of a partial and unfinished rewrite of the
 splitter code?

 On the assumption that not all splitters bother with the JSON,
 I am trying a little hack to scripts/extract_dataset_part.py to
 abort silently if there is no JSON file:
 https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

 This seems to be working with my current attempt at a FASTA
 splitter (not checked in yes, only partly implemented and tested).

I've checked in my FASTA splitting, which now seems to be
working OK with my BLAST tests. So far this only does splitting
into chunks of the requested number of sequences, rather than
the option to split the whole file into a given number of pieces.
https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

I also need to look at merging multiple BLAST XML outputs, but
this is looking promising.

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Dannon Baker
Very cool, I'll check it out!  The addition of the JSON files is indeed very 
new and was likely unfinished with respect to the base splitter.

-Dannon

On Feb 16, 2012, at 1:24 PM, Peter Cock wrote:

 On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 Hi Dan,
 
 I think I need a little more advice - what is the role of the script
 scripts/extract_dataset_part.py and the JSON files created
 when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
 and then used by the class' process_split_file method?
 
 Why is there no JSON file created by the base data class in
 lib/galaxy/datatypes/data.py and no method process_split_file?
 
 Is the JSON thing part of a partial and unfinished rewrite of the
 splitter code?
 
 On the assumption that not all splitters bother with the JSON,
 I am trying a little hack to scripts/extract_dataset_part.py to
 abort silently if there is no JSON file:
 https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
 
 This seems to be working with my current attempt at a FASTA
 splitter (not checked in yes, only partly implemented and tested).
 
 I've checked in my FASTA splitting, which now seems to be
 working OK with my BLAST tests. So far this only does splitting
 into chunks of the requested number of sequences, rather than
 the option to split the whole file into a given number of pieces.
 https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
 
 I also need to look at merging multiple BLAST XML outputs, but
 this is looking promising.
 
 Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] Expanding galaxyTool volume?

2012-02-16 Thread Dave Lin
Hi All,

What is the recommend process for expanding the galaxyTool volume for an
existing galaxy instance (using EC2/cloudman)?

I tried the following, but it didnt' work for me.

0) Terminate cluster.

1) Amazon EC2- create snapshot of current galaxyTools volume
2) Amazon EC2- create volume from step 1 + specify desired volume size.
3) Amazon EC2- create new snapshot from Step 2.
4) Amazon S3- identify S3 bucket for this cluster. Modify
persistent_data.yaml.  Modify size and snap_id to correspond with step #3
5) Amazon EC2-  Start new instance-- using same AmazonID + ClusterName

I was expecting the new instance to startup and create a galaxyTools volume
based on the snapshot identified in the persistent_data.yaml file, but that
didn't seem work.

Thanks in advance for any pointers.
Dave
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Expanding galaxyTool volume?

2012-02-16 Thread Dave Lin
Hi Enis,

I installed a new test cluster earlier today and did notice that the new
clusters magically now have galaxyTool volumes with 10GB. That is a good
change.

However, you are correct. I have an existing cluster (that had the old 2 GB
volume size) that I'm trying to expand. With additional tools and log
files, that volume keeps getting full.

Can you help guide me through this process?

Thanks again,
Dave
On Thu, Feb 16, 2012 at 2:36 PM, Enis Afgan eaf...@emory.edu wrote:

 Hi Dave,
 Are you trying to modify the size of the tools volume for a cluster that's
 been around for a while and you customized already or could this be a new
 cluster?
 The reason I'm asking is because as of Tuesday (3 days ago), the default
 tools volume for any new cluster will be 10GB (vs 2GB previously) and only
 1.7GB are taken. I would hope that gives plenty of storage space for
 majority of anyone's needs.

 Let me know if you need to modify an existing cluster and I'll guide you
 through the process then.

 Enis

 On Thu, Feb 16, 2012 at 9:31 PM, Dave Lin d...@verdematics.com wrote:

 Hi All,

 What is the recommend process for expanding the galaxyTool volume for an
 existing galaxy instance (using EC2/cloudman)?

 I tried the following, but it didnt' work for me.

 0) Terminate cluster.

 1) Amazon EC2- create snapshot of current galaxyTools volume
 2) Amazon EC2- create volume from step 1 + specify desired volume size.
 3) Amazon EC2- create new snapshot from Step 2.
 4) Amazon S3- identify S3 bucket for this cluster. Modify
 persistent_data.yaml.  Modify size and snap_id to correspond with step #3
 5) Amazon EC2-  Start new instance-- using same AmazonID + ClusterName

 I was expecting the new instance to startup and create a galaxyTools
 volume based on the snapshot identified in the persistent_data.yaml file,
 but that didn't seem work.

 Thanks in advance for any pointers.
  Dave


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Expanding galaxyTool volume?

2012-02-16 Thread Enis Afgan
No problem; this should get you there:

1. With CloudMan running, go to the Admin console and stop Galaxy and
PostgreSQL services (in that order)
2. From instance's CLI: sudo umount /mnt/galaxyTools
 3. From the AWS console, detach the tools volume (but remember as which
device it was attached)
4. From the AWS console, create a snapshot of the detached volume
5. From the AWS console, create a new volume from the newly created
snapshot of the desired size.
6. From the AWS console, attach the new larger volume to the running
instance (attach it as a different device then the original volume that was
attached)
7. From instance's CLI: sudo mount device /mnt/galaxyTools
8. From instance's CLI: sudo xfs_growfs /mnt/galaxyTools
9. From instance's CLI: sudo umount /mnt/galaxyTools
10. From the AWS console, detach the volume and create a snapshot
11. From the AWS console, attach the original volume as the same device as
it was attached
12. From instance's CLI: sudo mount device /mnt/galaxyTools
13. From CloudMan Admin, 'File systems' service should be running now. If
so, start PostgresSQL and Galaxy services (in that order)
14. From CloudMan, Terminate cluster
15. From the AWS S3 console, in the cluster's bucket, edit the
'persistent_data.yaml' galaxyTools file system to point to the new snapshot
and its size is properly set (snapshot from step 10)
16. Start the cluster back up using the same user data. Now you should have
the new file system there and any changes you want to make can be done from
CLI. Then persist file system changes from the CloudMan Admin to keep those
around after you restart the cluster.

I have not actually tried this but am speaking from memory so there may be
things that do not end up working quite like this but the general concept
is there. Good luck and let us know how it goes.

Enis



On Fri, Feb 17, 2012 at 12:34 AM, Dave Lin d...@verdematics.com wrote:

 Hi Enis,

 I installed a new test cluster earlier today and did notice that the new
 clusters magically now have galaxyTool volumes with 10GB. That is a good
 change.

 However, you are correct. I have an existing cluster (that had the old 2
 GB volume size) that I'm trying to expand. With additional tools and log
 files, that volume keeps getting full.

 Can you help guide me through this process?

 Thanks again,
 Dave
 On Thu, Feb 16, 2012 at 2:36 PM, Enis Afgan eaf...@emory.edu wrote:

 Hi Dave,
 Are you trying to modify the size of the tools volume for a cluster
 that's been around for a while and you customized already or could this be
 a new cluster?
 The reason I'm asking is because as of Tuesday (3 days ago), the default
 tools volume for any new cluster will be 10GB (vs 2GB previously) and only
 1.7GB are taken. I would hope that gives plenty of storage space for
 majority of anyone's needs.

 Let me know if you need to modify an existing cluster and I'll guide you
 through the process then.

 Enis

 On Thu, Feb 16, 2012 at 9:31 PM, Dave Lin d...@verdematics.com wrote:

 Hi All,

 What is the recommend process for expanding the galaxyTool volume for an
 existing galaxy instance (using EC2/cloudman)?

 I tried the following, but it didnt' work for me.

 0) Terminate cluster.

 1) Amazon EC2- create snapshot of current galaxyTools volume
 2) Amazon EC2- create volume from step 1 + specify desired volume size.
 3) Amazon EC2- create new snapshot from Step 2.
 4) Amazon S3- identify S3 bucket for this cluster. Modify
 persistent_data.yaml.  Modify size and snap_id to correspond with step #3
 5) Amazon EC2-  Start new instance-- using same AmazonID + ClusterName

 I was expecting the new instance to startup and create a galaxyTools
 volume based on the snapshot identified in the persistent_data.yaml file,
 but that didn't seem work.

 Thanks in advance for any pointers.
  Dave


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/




___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-dev] transferring sample datasets stuck in 'In queue' status

2012-02-16 Thread Luobin Yang
Hi all,

I configured Sample Tracking System in Galaxy to transfer datasets from a
sequencer to data libraries, however, after I selected the datasets to be
transferred on the sequencer and clicked Transfer button, the transfer
status has been in queue forever.

I didn't find any error message in galaxy_listener.log, however, I found
the following error message in data_transfer.log:

2012-02-16 19:39:01,221 - datatx_3623 - Error. !DOCTYPE HTML PUBLIC
-//IETF//DTD HTML 2.0//EN
htmlhead
title405 Method Not Allowed/title
/headbody
h1Method Not Allowed/h1
pThe requested method PUT is not allowed for the URL
/api/samples/1e8ab44153008be8./p
hr
addressApache/2.2.14 (Ubuntu) Server at xxx.xxx.xxx.xxx Port 80/address
/body/html

Any idea what went wrong?

Thanks,
Luobin
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-dev] producing HTML output with images

2012-02-16 Thread Nikhil Joshi
Hi all,

I am having trouble producing HTML output with images.  In the past I
have been able to produce HTML files with no images and it seemed to
work fine.  However, now I am writing a script that produces
diagnostic images and I want to display all of the images on one page
using HTML.  I am using the files_path variable to create the plots
in the working directory, and then I am using the extra_files_path
variable to access the final plot from the html file.  I look at the
resulting HTML files and it points to the proper plot full path, and
the plot DOES exist... but when I click to view the output, the plot
doesn't render.  It just shows an empty box with a broken image
icon, however, the text does render.  I copy the plot file, and I can
view it fine by itself.  I copy the plot and the html and I can view
the page just fine offline, but galaxy doesn't want to render the
image for some reason.  What am I doing wrong?

- Nik.

-- 
Nikhil Joshi
Bioinformatics Programmer
UC Davis Genome Center
Davis, CA
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] producing HTML output with images

2012-02-16 Thread Nikhil Joshi
Well, I guess I figured it out... it turns out that galaxy does some
internal magic to automatically add the path of the output directory
to the filename when it is rendering the html.  So what I needed to do
was to create the file in the temporary job directory and then just
use the file name without any path information in the html and galaxy
does the rest.  If anybody has any other advice or pitfalls to avoid,
please feel free to let me know.

- Nik.

On Thu, Feb 16, 2012 at 8:15 PM, Nikhil Joshi najo...@ucdavis.edu wrote:
 Hi all,

 I am having trouble producing HTML output with images.  In the past I
 have been able to produce HTML files with no images and it seemed to
 work fine.  However, now I am writing a script that produces
 diagnostic images and I want to display all of the images on one page
 using HTML.  I am using the files_path variable to create the plots
 in the working directory, and then I am using the extra_files_path
 variable to access the final plot from the html file.  I look at the
 resulting HTML files and it points to the proper plot full path, and
 the plot DOES exist... but when I click to view the output, the plot
 doesn't render.  It just shows an empty box with a broken image
 icon, however, the text does render.  I copy the plot file, and I can
 view it fine by itself.  I copy the plot and the html and I can view
 the page just fine offline, but galaxy doesn't want to render the
 image for some reason.  What am I doing wrong?

 - Nik.

 --
 Nikhil Joshi
 Bioinformatics Programmer
 UC Davis Genome Center
 Davis, CA



-- 
Nikhil Joshi
Bioinformatics Programmer
UC Davis Genome Center
Davis, CA

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/