Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...

2011-02-21 Thread Nate Coraor
Nick Schurch wrote:
 Hi all,
 
 I've recently encountered a few problems when trying to use Galaxy which are
 really driving me away from using it as a bioinformatics platform for NGS. I
 was wonderinf if there are any simple solutions that I've missed...

Hi Nick,

We've had some internal discussion and proposed some solutions which
would hopefully make Galaxy more useful for your environment.

 Firstly, It seems that while there are a few solutions for getting large
 files (a few GB) into a local install of galaxy without going through HTTP,
 many tools that operate on these files produces multiple, uncompressed large
 files which quickly eat up the disk allocation. This is particularly
 significant in a workflow that has multiple processing steps which each
 leave behind a large file. With no way to compress or archive files produced
 by intermediate steps in a workflow, and no desire to delete them since I
 may need to go back to them and they can take hours to re-run, the only two
 remaining options seem to be to save them and then delete them.

We've dealt with this locally by implementing compression in the
underlying filesystem (ZFS), but this requires a fileserver that runs
Solaris (or a derivative) or FreeBSD.  Btrfs also supports compression
but I would be a bit more wary of losing my data with btrfs since it is
less mature and can't recover corrupted filesystems.  Fusecompress would
also be an option.

We would strongly recommend performing regular backups regardless of any
filesystem-level choice.

Unfortunately this is a tricky problem to solve within Galaxy itself.
While some tools can operate on compressed files directly, many cannot
and so compressing all outputs could prove to be very CPU intensive and
a waste of time if the next step will have to decompress the file.
There has been some discussion of how to implement transparent
compression and other complex underlying data management directly in
Galaxy, but any work on it is not likely to commence soon.

 And this brings me to the second problem. Getting large files out of Galaxy.
 The only way to save large files from Galaxy (that I can see) is the save
 icon, which downloads the file via http. This take *ages* for a large file
 and also causes big headaches for my firefox browser. I've taken a quick
 peek at the Galaxy file system to see if I could just copy a file, but its
 almost completely indecipherable if you want to find out what file in the
 file system corresponds to a file saved from a tool. Is there some way to
 get the location of a particular file on the galaxy file system, that I can
 just copy?

This is certainly something we can implement and will be working on
fairly soon.  There have been quite a few requests to integrate more
tightly with environments where Galaxy users exist as system users.

There's an issue in our tracker which you can follow here:

  https://bitbucket.org/galaxy/galaxy-central/issue/106/

--nate

 
 -- 
 Cheers,
 
 Nick Schurch
 
 Data Analysis Group (The Barton Group),
 School of Life Sciences,
 University of Dundee,
 Dow St,
 Dundee,
 DD1 5EH,
 Scotland,
 UK
 
 Tel: +44 1382 388707
 Fax: +44 1382 345 893
 
 
 
 -- 
 Cheers,
 
 Nick Schurch
 
 Data Analysis Group (The Barton Group),
 School of Life Sciences,
 University of Dundee,
 Dow St,
 Dundee,
 DD1 5EH,
 Scotland,
 UK
 
 Tel: +44 1382 388707
 Fax: +44 1382 345 893

 ___
 galaxy-user mailing list
 galaxy-u...@lists.bx.psu.edu
 http://lists.bx.psu.edu/listinfo/galaxy-user

___
To manage your subscriptions to this and other Galaxy lists, please use the 
interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...

2011-02-21 Thread Assaf Gordon
Hi Nick,

If you're running your own local instance, then nothing is impossible - it's 
just a bit ugly...

Nate Coraor wrote, On 02/21/2011 12:36 PM:
 [...] many tools that operate on these files produces multiple,
 uncompressed large files which quickly eat up the disk allocation.
 [...] With no way to compress or archive files produced by
 intermediate steps in a workflow, [...]

Here's a tool that compresses an input galaxy dataset and then deletes the 
input file.
Deleting the input dataset from underneath galaxy's feet obviously goes against 
everything galaxy stands for, 
and I'm sure the Galaxy team does not endorse such solutions. It will also 
slightly make your database out-of-sync with the real files on the disk.
But hey - desperate times call for desperate means :)


tool id=cshl_compress_input name=Compress Input File
  descriptionfor advanced users only!/description
  commandgzip -c  '$input' gt; '$output' amp;amp; rm '$input'/command
  inputs
param format=data name=input type=data label=Dataset to Compress /
param format=data name=waitforinput type=data label=Tool to wait 
for /
  /inputs
  outputs
data format=gzip name=output /
  /outputs
  help
**What it does**
DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING.
  /help
/tool


The second input parameter in this tool is there only to force this tool to 
run after another tool (which needs the uncompressed input file) - 
you should connect this tool carefully in your workflow.

Making the output format gzip ensures the new compressed files can't be used 
with any regular tool.
Then create a similar uncompress tool that does the opposite.

 And this brings me to the second problem. Getting large files out
 of Galaxy. The only way to save large files from Galaxy (that I can
 see) is the save icon, which downloads the file via http. This take
 *ages* for a large file and also causes big headaches for my
 firefox browser. 

Here are three solutions (in varying level of ugliness) to get files out of 
galaxy:

1. This simple tool will tell you the full path of your dataset:
=
tool id=cshl_get_dataset_full_path name=Get dataset full path
  descriptionfor advanced users only!/description
  commandreadlink -f '$input' gt; '$output'/command
  inputs
param format=data name=input type=data label=Show full path of 
dataset /
  /inputs
  outputs
data format=txt name=output /
  /outputs
  help
**What it does**
DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING.
  /help
/tool
=

run it on any input dataset, the output will contain the full path on your 
local system.
It goes without saying that this is a security hazard, and only use this tool 
if you know what you're doing, and you trust your users.
Once you have the full path, just access the file directly out of galaxy.


2. The following tool allows the user to export a dataset into a hard-coded 
directory (/tmp/galaxy_export in this example).
This is just a proof of concept, and for a production environment you'll need 
to add validators to the description variable to prevent users from adding 
unwanted characters.
But it works - once the tool is run, the selected dataset will appear under 
/tmp/galaxy_export/$USER/ .
=
tool id=cshl_export_to_local name=Export to local file
descriptionfor advanced users only!/description
command
mkdir -p /tmp/galaxy_export/$userEmail amp;amp;
ln -s '$input' 
'/tmp/galaxy_export/$userEmail/${input.hid}_${description}.${input.extension}'
/command
inputs
param format=data name=input type=data label=Dataset to 
Export /
param name=description type=text size=30 label=File 
name /
/inputs
outputs
data format=txt name=output /
/outputs
help
**What it does**

DO NOT USE THIS TOOL UNLESS YOU KNOW WHAT YOU'RE DOING.
/help
/tool



3. Last but not least, if you have access to the database, getting the dataset 
path is easy if you now the dataset number or the dataset hash-id (and you have 
them as links on the galaxy web page). This solution is not for the faint of 
heart, but if you want I show examples of how to get from one to the other.


-gordon
___
To manage your subscriptions to this and other Galaxy lists, please use the 
interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] [galaxy-user] operating on, and getting at, large files in galaxy...

2011-02-17 Thread Edward Kirton
Hi Nick,

Yes, these nextgen reads files are huge and getting bigger every quarter!
 But there will be storage issues nomatter whether you use Galaxy or not.
 In fact, i think users are more likely to cleanup files and histories in
galaxy than they are to cleanup NFS folders -- out of sight, out of mind!

Firstly, I think unnecessary intermediate files are more of a problem than
whether or not the file is compressed or not.  Indeed, just transferring
these files back and forth from the cluster takes a while, not to mention
the delay in waiting to be rescheduled for each step.  And so I created a
tool which would do the job of fastq groomer, end-trimmer, process pairs,
and a few other simple tasks -- all in one shot.  I haven't uploaded it to
the toolshed yet but I will.  I hate to duplicate existing tools, but i have
a lot of seq data.  I will also create a fastqilluminabz2 datatype as well
and include it with the tool.

For getting files into galaxy, I created a simple tool which would allow
staff to enter NFS paths and the option to either copy or symlink if the
location was considered stable.  I allowed only certain folders (e.g. /home,
/storage) and added a password, for security.  Similarly, for getting a file
out, all you need is a dinky tool for users to provide a destination path.
since i've got galaxy running as a special galaxy user in a special galaxy
group, file access is restricted (as it should be), so i tell users to
create a dropbox folder in their homedir (and chmod 777).  by creating a
tool like this, you don't need to care how galaxy names the files.  i
deliberately try to not mess around under the hood.  i can upload these to
galaxy toolshed, but like i said, there isn't much to them.

Ed

On Wed, Feb 9, 2011 at 4:17 AM, Nick Schurch n.schu...@dundee.ac.uk wrote:


 Hi all,

 I've recently encountered a few problems when trying to use Galaxy which
 are really driving me away from using it as a bioinformatics platform for
 NGS. I was wonderinf if there are any simple solutions that I've missed...

 Firstly, It seems that while there are a few solutions for getting large
 files (a few GB) into a local install of galaxy without going through HTTP,
 many tools that operate on these files produces multiple, uncompressed large
 files which quickly eat up the disk allocation. This is particularly
 significant in a workflow that has multiple processing steps which each
 leave behind a large file. With no way to compress or archive files produced
 by intermediate steps in a workflow, and no desire to delete them since I
 may need to go back to them and they can take hours to re-run, the only two
 remaining options seem to be to save them and then delete them.

 And this brings me to the second problem. Getting large files out of
 Galaxy. The only way to save large files from Galaxy (that I can see) is the
 save icon, which downloads the file via http. This take *ages* for a large
 file and also causes big headaches for my firefox browser. I've taken a
 quick peek at the Galaxy file system to see if I could just copy a file, but
 its almost completely indecipherable if you want to find out what file in
 the file system corresponds to a file saved from a tool. Is there some way
 to get the location of a particular file on the galaxy file system, that I
 can just copy?

 --
 Cheers,

 Nick Schurch

 Data Analysis Group (The Barton Group),
 School of Life Sciences,
 University of Dundee,
 Dow St,
 Dundee,
 DD1 5EH,
 Scotland,
 UK

 Tel: +44 1382 388707
 Fax: +44 1382 345 893



 --
 Cheers,

 Nick Schurch

 Data Analysis Group (The Barton Group),
 School of Life Sciences,
 University of Dundee,
 Dow St,
 Dundee,
 DD1 5EH,
 Scotland,
 UK

 Tel: +44 1382 388707
 Fax: +44 1382 345 893

 ___
 galaxy-user mailing list
 galaxy-u...@lists.bx.psu.edu
 http://lists.bx.psu.edu/listinfo/galaxy-user


___
To manage your subscriptions to this and other Galaxy lists, please use the 
interface at:

  http://lists.bx.psu.edu/