Re: [galaxy-dev] user data upload directory structure
Ah, yes. This is what I was just requesting yesterday in the email that I sent, although it was much more long-winded. I didn't see this email chain from the day before. Having a user-representative directory structure would be beneficial in my mind. I followed/understood your suggested directory structure up until the arrows. Are those supposed to be symlinks? If so, what do you have in mind? I was thinking that just having those subdirectories by user id under files/ would be enough (although I could see how you could symlink them to some other arbitrary location if you so desired). My desired application was so that I could set up an FTP share to the files/ directory so that our users could copy their (processed) files off of the Galaxy server to other servers in our environment as well as one of our other clusters. Having the datasets segregated into the user's/owner's subdirectories would make it easier to identify and copy them off for that purpose. -Josh >Nate- >I do know about the disk accounting/quota features of Galaxy >As I eluded in my previous email, it goes beyond accounting actually. I >wanted to be able to implement something like: >~/galaxy-dist/database/files/user_id_000 -> /one_data_pool_set/id_000 >~/galaxy-dist/database/files/user_id_001 -> /another_data_pool_set/id_001 >which would match the usual data placement from a scheduler perspective too. >I'll look at galaxy-dist/lib/galaxy/objectstore/__init__.py >Thanks a lot >JC ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] user data upload directory structure
Nate- I do know about the disk accounting/quota features of Galaxy As I eluded in my previous email, it goes beyond accounting actually. I wanted to be able to implement something like: ~/galaxy-dist/database/files/user_id_000 -> /one_data_pool_set/id_000 ~/galaxy-dist/database/files/user_id_001 -> /another_data_pool_set/id_001 which would match the usual data placement from a scheduler perspective too. I'll look at galaxy-dist/lib/galaxy/objectstore/__init__.py Thanks a lot JC On 05/15/2012 07:26 AM, Nate Coraor wrote: On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote: Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC Hi JC, As Peter mentions, there's no clear way to determine ownership when data is shared. The best you could do is identify the user that originally created a dataset. If you wanted to go this route, the best place to start would be an enhancement of the Object Store framework, at galaxy-dist/lib/galaxy/objectstore/__init__.py If you're not aware, Galaxy does have internal disk accounting and quota features: http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas --nate From: Peter Cock [p.j.a.c...@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom wrote: All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1 Thank you JC I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation? Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] user data upload directory structure
On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote: > Thank you for your email Peter. > We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. > Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The > motivation is indeed what you describe besides managing cost/disk performance > on user/project basis as we have a tiered storage. > Our filesystem is GPFS which as you might know has one (amongst many) nice > feature called fileset: it's basically a data bucket that reports usage > disregarding the Unix ownership. It works great for project type directory. > The file name length is a legitimate one indeed for command line limitation > (GPFS has same length name limit as ext3/4). The current filename can remain > unmodified: the requested schema would only introduce the user database ID > (usually 3-4 digits) in the path e.g. > ~/galaxy-dist/database/files/000/dataset_0001.dat > ~/galaxy-dist/database/files/001/dataset_0002.dat > ~/galaxy-dist/database/files/002/dataset_0003.dat > ~/galaxy-dist/database/files/000/dataset_0004.dat > ~/galaxy-dist/database/files/000/dataset_0005.dat > ~/galaxy-dist/database/files/000/dataset_0006.dat > ~/galaxy-dist/database/files/002/dataset_0007.dat > [...] > Any thoughts? > Thanks again > JC Hi JC, As Peter mentions, there's no clear way to determine ownership when data is shared. The best you could do is identify the user that originally created a dataset. If you wanted to go this route, the best place to start would be an enhancement of the Object Store framework, at galaxy-dist/lib/galaxy/objectstore/__init__.py If you're not aware, Galaxy does have internal disk accounting and quota features: http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas --nate > > > From: Peter Cock [p.j.a.c...@googlemail.com] > Sent: Tuesday, May 15, 2012 1:38 AM > To: Jean-Christophe Ducom > Cc: galaxy-dev@lists.bx.psu.edu > Subject: Re: [galaxy-dev] user data upload directory structure > > On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom > wrote: >> All- >> Is there a way to change the upload default directory structure >> (/database/files) to organize files per user_id instead? >> something along the following lines >> ~galaxy-dist/database/files/postgresql_user_id0 >> ~galaxy-dist/database/files/postgresql_user_id1 >> >> Thank you >> JC > > I can see that being useful for a quick way to look at per user > disk usage - although you'd have problems counting with > shared data. Is that your motivation? > > Another concern would be overly long filenames, which has > a direct impact on the command line lengths used to call the > tools - there are OS limits on this. > > Peter > > ___ > Please keep all replies on the list by using "reply all" > in your mail client. To manage your subscriptions to this > and other Galaxy lists, please use the interface at: > > http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] user data upload directory structure
Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC From: Peter Cock [p.j.a.c...@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom wrote: > All- > Is there a way to change the upload default directory structure > (/database/files) to organize files per user_id instead? > something along the following lines > ~galaxy-dist/database/files/postgresql_user_id0 > ~galaxy-dist/database/files/postgresql_user_id1 > > Thank you > JC I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation? Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] user data upload directory structure
On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom wrote: > All- > Is there a way to change the upload default directory structure > (/database/files) to organize files per user_id instead? > something along the following lines > ~galaxy-dist/database/files/postgresql_user_id0 > ~galaxy-dist/database/files/postgresql_user_id1 > > Thank you > JC I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation? Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this. Peter ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-dev] user data upload directory structure
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1 Thank you JC ___ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/