Re: [galaxy-dev] user data upload directory structure

2012-05-16 Thread Josh Nielsen
Ah, yes. This is what I was just requesting yesterday in the email
that I sent, although it was much more long-winded. I didn't see this
email chain from the day before. Having a user-representative
directory structure would be beneficial in my mind.

I followed/understood your suggested directory structure up until the
arrows. Are those supposed to be symlinks? If so, what do you have in
mind? I was thinking that just having those subdirectories by user id
under files/ would be enough (although I could see how you could
symlink them to some other arbitrary location if you so desired).

My desired application was so that I could set up an FTP share to the
files/ directory so that our users could copy their (processed) files
off of the Galaxy server to other servers in our environment as well
as one of our other clusters. Having the datasets segregated into the
user's/owner's subdirectories would make it easier to identify and
copy them off for that purpose.

-Josh

>Nate-
>I do know about the disk accounting/quota features of Galaxy
>As I eluded in my previous email, it goes beyond accounting actually. I
>wanted to be able to implement something like:
>~/galaxy-dist/database/files/user_id_000 -> /one_data_pool_set/id_000
>~/galaxy-dist/database/files/user_id_001 -> /another_data_pool_set/id_001
>which would match the usual data placement from a scheduler perspective too.
>I'll look at  galaxy-dist/lib/galaxy/objectstore/__init__.py
>Thanks a lot
>JC
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] user data upload directory structure

2012-05-15 Thread Jean-Christophe Ducom

Nate-
I do know about the disk accounting/quota features of Galaxy
As I eluded in my previous email, it goes beyond accounting actually. I 
wanted to be able to implement something like:

~/galaxy-dist/database/files/user_id_000 -> /one_data_pool_set/id_000
~/galaxy-dist/database/files/user_id_001 -> /another_data_pool_set/id_001
which would match the usual data placement from a scheduler perspective too.
I'll look at  galaxy-dist/lib/galaxy/objectstore/__init__.py
Thanks a lot
JC




On 05/15/2012 07:26 AM, Nate Coraor wrote:

On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote:


Thank you for your email Peter.
We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. 
Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The 
motivation is indeed what you describe besides managing cost/disk performance 
on user/project basis as we have a tiered storage.
Our filesystem is GPFS which as you might know has one (amongst many) nice 
feature called fileset: it's basically a data bucket that reports usage 
disregarding the Unix ownership. It works great for project type directory.
The file name length is a legitimate one indeed for command line limitation 
(GPFS has same length name limit as ext3/4). The current filename can remain 
unmodified: the requested schema would only introduce the user database ID 
(usually 3-4 digits) in the path e.g.
~/galaxy-dist/database/files/000/dataset_0001.dat
~/galaxy-dist/database/files/001/dataset_0002.dat
~/galaxy-dist/database/files/002/dataset_0003.dat
~/galaxy-dist/database/files/000/dataset_0004.dat
~/galaxy-dist/database/files/000/dataset_0005.dat
~/galaxy-dist/database/files/000/dataset_0006.dat
~/galaxy-dist/database/files/002/dataset_0007.dat
[...]
Any thoughts?
Thanks again
JC

Hi JC,

As Peter mentions, there's no clear way to determine ownership when data is 
shared.  The best you could do is identify the user that originally created a 
dataset.  If you wanted to go this route, the best place to start would be an 
enhancement of the Object Store framework, at 
galaxy-dist/lib/galaxy/objectstore/__init__.py

If you're not aware, Galaxy does have internal disk accounting and quota 
features:

 http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas

--nate



From: Peter Cock [p.j.a.c...@googlemail.com]
Sent: Tuesday, May 15, 2012 1:38 AM
To: Jean-Christophe Ducom
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] user data upload directory structure

On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom
  wrote:

All-
Is there a way to change the upload default directory structure
(/database/files) to organize files per user_id instead?
something along the following lines
~galaxy-dist/database/files/postgresql_user_id0
~galaxy-dist/database/files/postgresql_user_id1

Thank you
JC

I can see that being useful for a quick way to look at per user
disk usage - although you'd have problems counting with
shared data. Is that your motivation?

Another concern would be overly long filenames, which has
a direct impact on the command line lengths used to call the
tools - there are OS limits on this.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] user data upload directory structure

2012-05-15 Thread Nate Coraor
On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote:

> Thank you for your email Peter.
> We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. 
> Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The 
> motivation is indeed what you describe besides managing cost/disk performance 
> on user/project basis as we have a tiered storage.
> Our filesystem is GPFS which as you might know has one (amongst many) nice 
> feature called fileset: it's basically a data bucket that reports usage 
> disregarding the Unix ownership. It works great for project type directory. 
> The file name length is a legitimate one indeed for command line limitation 
> (GPFS has same length name limit as ext3/4). The current filename can remain 
> unmodified: the requested schema would only introduce the user database ID 
> (usually 3-4 digits) in the path e.g.
> ~/galaxy-dist/database/files/000/dataset_0001.dat
> ~/galaxy-dist/database/files/001/dataset_0002.dat
> ~/galaxy-dist/database/files/002/dataset_0003.dat
> ~/galaxy-dist/database/files/000/dataset_0004.dat
> ~/galaxy-dist/database/files/000/dataset_0005.dat
> ~/galaxy-dist/database/files/000/dataset_0006.dat
> ~/galaxy-dist/database/files/002/dataset_0007.dat
> [...]
> Any thoughts?
> Thanks again
> JC

Hi JC,

As Peter mentions, there's no clear way to determine ownership when data is 
shared.  The best you could do is identify the user that originally created a 
dataset.  If you wanted to go this route, the best place to start would be an 
enhancement of the Object Store framework, at 
galaxy-dist/lib/galaxy/objectstore/__init__.py

If you're not aware, Galaxy does have internal disk accounting and quota 
features:

http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas

--nate

> 
> 
> From: Peter Cock [p.j.a.c...@googlemail.com]
> Sent: Tuesday, May 15, 2012 1:38 AM
> To: Jean-Christophe Ducom
> Cc: galaxy-dev@lists.bx.psu.edu
> Subject: Re: [galaxy-dev] user data upload directory structure
> 
> On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom
>  wrote:
>> All-
>> Is there a way to change the upload default directory structure
>> (/database/files) to organize files per user_id instead?
>> something along the following lines
>> ~galaxy-dist/database/files/postgresql_user_id0
>> ~galaxy-dist/database/files/postgresql_user_id1
>> 
>> Thank you
>> JC
> 
> I can see that being useful for a quick way to look at per user
> disk usage - although you'd have problems counting with
> shared data. Is that your motivation?
> 
> Another concern would be overly long filenames, which has
> a direct impact on the command line lengths used to call the
> tools - there are OS limits on this.
> 
> Peter
> 
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] user data upload directory structure

2012-05-15 Thread Jean-Christophe Ducom
Thank you for your email Peter.
We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. 
Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The 
motivation is indeed what you describe besides managing cost/disk performance 
on user/project basis as we have a tiered storage.
Our filesystem is GPFS which as you might know has one (amongst many) nice 
feature called fileset: it's basically a data bucket that reports usage 
disregarding the Unix ownership. It works great for project type directory. 
The file name length is a legitimate one indeed for command line limitation 
(GPFS has same length name limit as ext3/4). The current filename can remain 
unmodified: the requested schema would only introduce the user database ID 
(usually 3-4 digits) in the path e.g.
~/galaxy-dist/database/files/000/dataset_0001.dat
~/galaxy-dist/database/files/001/dataset_0002.dat
~/galaxy-dist/database/files/002/dataset_0003.dat
~/galaxy-dist/database/files/000/dataset_0004.dat
~/galaxy-dist/database/files/000/dataset_0005.dat
~/galaxy-dist/database/files/000/dataset_0006.dat
~/galaxy-dist/database/files/002/dataset_0007.dat
[...]
Any thoughts?
Thanks again
JC


From: Peter Cock [p.j.a.c...@googlemail.com]
Sent: Tuesday, May 15, 2012 1:38 AM
To: Jean-Christophe Ducom
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] user data upload directory structure

On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom
 wrote:
> All-
> Is there a way to change the upload default directory structure
> (/database/files) to organize files per user_id instead?
> something along the following lines
> ~galaxy-dist/database/files/postgresql_user_id0
> ~galaxy-dist/database/files/postgresql_user_id1
>
> Thank you
> JC

I can see that being useful for a quick way to look at per user
disk usage - although you'd have problems counting with
shared data. Is that your motivation?

Another concern would be overly long filenames, which has
a direct impact on the command line lengths used to call the
tools - there are OS limits on this.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] user data upload directory structure

2012-05-15 Thread Peter Cock
On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom
 wrote:
> All-
> Is there a way to change the upload default directory structure
> (/database/files) to organize files per user_id instead?
> something along the following lines
> ~galaxy-dist/database/files/postgresql_user_id0
> ~galaxy-dist/database/files/postgresql_user_id1
>
> Thank you
> JC

I can see that being useful for a quick way to look at per user
disk usage - although you'd have problems counting with
shared data. Is that your motivation?

Another concern would be overly long filenames, which has
a direct impact on the command line lengths used to call the
tools - there are OS limits on this.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/