Re: [galaxy-dev] Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)

2013-04-18 Thread Lance Parsons
Just an update on the cleanup script.  I have implemented a basic script 
to perform administrative dataset deletion and email notification.  
Right now it's limited to the history_dataset_assocation update_time and 
an optional tool_id string.


I have pushed it to a galaxy-central and issued a pull request to 
galaxy-central 
(https://bitbucket.org/galaxy/galaxy-central/pull-request/158/basic-administrative-dataset-cleanup/diff).  
I'm open to comments or suggestions, it could certainly be extended.  
Hopefully people find this useful.



admin_dataset_cleanup.py Documentation
-

Mark datasets as deleted that are older than specified cutoff
and (optionally) with a tool_id that matches the specified search
string.

This script is useful for administrators to cleanup after users who
leave many old datasets around.  It was modeled after the 
cleanup_datasets.py

script originally distributed with Galaxy.

Basic Usage:
admin_cleanup_datasets.py universe_wsgi.ini -d 60 \
--template=email_template.txt

Required Arguments:
config_file - the Galaxy configuration file (universe_wsgi.ini)

Optional Arguments:
-d --days - number of days old the dataset must be (default: 60)
--tool_id - string to search for in dataset tool_id
--template - Mako template file to use for email notification
-i --info_only - Print results, but don't email or delete anything
-e --email_only - Email notifications, but don't delete anything
Useful for notifying users of pending deletion

--smtp - Specify smtp server
If not specified, use smtp settings specified in config file
--fromaddr - Specify from address
If not specified, use error_email_to specified in config file

Email Template Variables:
   cutoff - the cutoff in days
   email - the users email address
   datasets - a list of tuples containing 'dataset' and 'history' names


Lance Parsons wrote:

Nate Coraor wrote:

On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:


I have been running a Galaxy server for our sequencing researchers for a while 
now and it's become increasingly successful. The biggest resource challenge for 
us has been, and continues to be disk space.  As such, I'd like to implement 
some additional cleanup scripts. I thought I run a few questions by this list 
before I got too far into things.

In general, I'm wondering how to implement updates/additions to the cleanup system that 
will be in line with the direction that the Galaxy project is headed.  The pgcleanup.py 
script is the newest piece of code in this area (and even adds cleanup of exported 
histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py 
script uses a cleanup_event table that I don't believe is used by the older 
cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, 
and worse, only for version 9.1+.  I run my system on RedHat (CentOS) and thus we use 
version 8.4 of Postgres.  Are there plans to support other databases or older versions of 
Postgres?


Hi Lance,

pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way 
to port it to older versions.  For 8.4 or MySQL, you can still use the older 
cleanup_datasets.py.
After looking at it a bit more, I see what you mean.  Are there plans 
to implement and additional cleanup scripts for non-postgres 9.1 
users?  Just curious so I don't reinvent the wheel, I'd be happy to 
help with existing efforts.

I'd like to implement a script to delete (set the deleted flag) for certain 
datasets (e.g. raw data imported from our archive, for old, inactive users, 
etc.).  I'm wondering if it would make sense to try and extend pgcleanup.py or 
cleanup_datasets.py.  Or perhaps it would be best to just implement a separate 
script, though that seems like I'd have to re-implement a lot of boilerplate 
code for configuration reading, connections, logging, etc.   Any tips on 
generally acceptable (supported) procedures for marking a dataset as deleted?


You could probably reuse a lot of the code from either of the cleanup scripts 
for this.
Right.  It seems to make sense to me to focus on the 
cleanup_datasets.py since that will work for everyone.  I would like 
to essentially mimic the user deleting a dataset.  I'd then email them 
to let them know that some old data had been marked for deletion and 
let the rest of the scripts proceed as normal, cleaning that up if 
they don't undelete it.


It looks like I would want to mark the HistoryDatasetAssociations as 
deleted?  Is that correct?  Would I need to do anything else to 
simulate the user deleting the dataset?


Thanks for the help,
Lance

Thanks,
--nate



--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University



--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative 

Re: [galaxy-dev] Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)

2013-04-17 Thread Lance Parsons
I have implemented a basic script to perform administrative dataset 
deletion and email notification.  Right now it's limited to the 
history_dataset_assocation update_time and an optional tool_id string.


I have pushed it to a galaxy-central fork and will issue a pull request 
once I've tested it a bit more. I'm open to comments or suggestions, it 
could certainly be extended.  Hopefully people find this useful.


See 
https://bitbucket.org/lance_parsons/galaxy-central-dataset-cleanup/commits/aa938fc707ae314ceeecd1e13631663f7596b0f4


admin_dataset_cleanup.py Documentation
-

Mark datasets as deleted that are older than specified cutoff
and (optionaly) with a tool_id that matches the specified search
string.

This script is useful for administrators to cleanup after users who
leave many old datasets around.  It was modeled after the 
cleanup_datasets.py

script originally distributed with Galaxy.

Basic Usage:
admin_cleanup_datasets.py universe_wsgi.ini -d 60 \
--template=email_template.txt

Required Arguments:
config_file - the Galaxy configuration file (universe_wsgi.ini)

Optional Arguments:
-d --days - number of days old the dataset must be (default: 60)
--tool_id - string to search for in dataset tool_id
--template - Mako template file to use for email notification
-i --info_only - Print results, but don't email or delete anything
-e --email_only - Email notifications, but don't delete anything
Useful for notifying users of pending deletion

--smtp - Specify smtp server
If not specified, use smtp settings specified in config file
--fromaddr - Specify from address
If not specified, use error_email_to specified in config file

Email Template Variables:
   cutoff - the cutoff in days
   email - the users email address
   datasets - a list of tuples containing 'dataset' and 'history' names


xiandongm...@lbl.gov wrote:

Hi Lance,

I have same questions as yours about cleanup data in Galaxy.

We maintain a local instance of the Galaxy system. I am thinking a way to 
delete datasets which are not accessed/updated for a certain period of time, no 
matter if users deleted them. In addition, sending an email to users before 
deleting their dataset. It looks like the current scripts for clean up only 
purges deleted histories/datasets. I was trying to find a way to use API 
functions to delete old files which are not deleted, but am not successful.  I 
found the update_time is not access time in the dataset table. A reference file 
may be used frequently but the update_time is pretty old. This would be a 
problem if deleting file by the update_time. I think files should be deleted by 
the access time.  Is it enough to resolve this problem by checking the 
update_time of the HistoryDatasetAssociation table?

Thanks for your help,

X. Meng
Joint Genome Institute, LBNL

quote author='Lance Parsons'
Nate Coraor wrote:

On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:


I have been running a Galaxy server for our sequencing researchers for a
while now and it's become increasingly successful. The biggest resource
challenge for us has been, and continues to be disk space.  As such, I'd
like to implement some additional cleanup scripts. I thought I run a few
questions by this list before I got too far into things.

In general, I'm wondering how to implement updates/additions to the
cleanup system that will be in line with the direction that the Galaxy
project is headed.  The pgcleanup.py script is the newest piece of code
in this area (and even adds cleanup of exported histories, which are
absent from the older cleanup scripts). Also, the pgcleanup.py script
uses a cleanup_event table that I don't believe is used by the older
cleanup_datasets.py script. However, the new pgcleanup.py script only
works for Postgres, and worse, only for version 9.1+.  I run my system on
RedHat (CentOS) and thus we use version 8.4 of Postgres.  Are there plans
to support other databases or older versions of Postgres?

Hi Lance,

pgcleanup.py makes extensive use of Writable CTEs, so there is not really
a way to port it to older versions.  For 8.4 or MySQL, you can still use
the older cleanup_datasets.py.

After looking at it a bit more, I see what you mean.  Are there plans to
implement and additional cleanup scripts for non-postgres 9.1 users?
Just curious so I don't reinvent the wheel, I'd be happy to help with
existing efforts.

I'd like to implement a script to delete (set the deleted flag) for
certain datasets (e.g. raw data imported from our archive, for old,
inactive users, etc.).  I'm wondering if it would make sense to try and
extend pgcleanup.py or cleanup_datasets.py.  Or perhaps it would be best
to just implement a separate script, though that seems like I'd have to
re-implement a lot of boilerplate code for configuration reading,
connections, logging, etc.   Any tips on generally acceptable 

Re: [galaxy-dev] Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)

2013-03-28 Thread Nate Coraor
On Mar 22, 2013, at 4:57 PM, Lance Parsons wrote:

 Nate Coraor wrote:
 
 On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:
 
 I have been running a Galaxy server for our sequencing researchers for a 
 while now and it's become increasingly successful. The biggest resource 
 challenge for us has been, and continues to be disk space.  As such, I'd 
 like to implement some additional cleanup scripts. I thought I run a few 
 questions by this list before I got too far into things.
 
 In general, I'm wondering how to implement updates/additions to the cleanup 
 system that will be in line with the direction that the Galaxy project is 
 headed.  The pgcleanup.py script is the newest piece of code in this area 
 (and even adds cleanup of exported histories, which are absent from the 
 older cleanup scripts). Also, the pgcleanup.py script uses a 
 cleanup_event table that I don't believe is used by the older 
 cleanup_datasets.py script. However, the new pgcleanup.py script only works 
 for Postgres, and worse, only for version 9.1+.  I run my system on RedHat 
 (CentOS) and thus we use version 8.4 of Postgres.  Are there plans to 
 support other databases or older versions of Postgres?
 
 Hi Lance,
 
 pgcleanup.py makes extensive use of Writable CTEs, so there is not really a 
 way to port it to older versions.  For 8.4 or MySQL, you can still use the 
 older cleanup_datasets.py.
 After looking at it a bit more, I see what you mean.  Are there plans to 
 implement and additional cleanup scripts for non-postgres 9.1 users?  Just 
 curious so I don't reinvent the wheel, I'd be happy to help with existing 
 efforts.

No, there aren't any plans as long as the alternative (cleanup_datasets.py) 
still works for other versions.

 I'd like to implement a script to delete (set the deleted flag) for certain 
 datasets (e.g. raw data imported from our archive, for old, inactive users, 
 etc.).  I'm wondering if it would make sense to try and extend pgcleanup.py 
 or cleanup_datasets.py.  Or perhaps it would be best to just implement a 
 separate script, though that seems like I'd have to re-implement a lot of 
 boilerplate code for configuration reading, connections, logging, etc.   
 Any tips on generally acceptable (supported) procedures for marking a 
 dataset as deleted?
 
 You could probably reuse a lot of the code from either of the cleanup 
 scripts for this.
 Right.  It seems to make sense to me to focus on the cleanup_datasets.py 
 since that will work for everyone.  I would like to essentially mimic the 
 user deleting a dataset.  I'd then email them to let them know that some old 
 data had been marked for deletion and let the rest of the scripts proceed as 
 normal, cleaning that up if they don't undelete it.
 
 It looks like I would want to mark the HistoryDatasetAssociations as deleted? 
  Is that correct?  Would I need to do anything else to simulate the user 
 deleting the dataset?  

That's correct.

--nate

 
 Thanks for the help,
 Lance
 Thanks,
 --nate
 
 
 -- 
 Lance Parsons - Scientific Programmer
 134 Carl C. Icahn Laboratory
 Lewis-Sigler Institute for Integrative Genomics
 Princeton University
 


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)

2013-03-22 Thread Nate Coraor
On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:

 I have been running a Galaxy server for our sequencing researchers for a 
 while now and it's become increasingly successful. The biggest resource 
 challenge for us has been, and continues to be disk space.  As such, I'd like 
 to implement some additional cleanup scripts. I thought I run a few questions 
 by this list before I got too far into things.
 
 In general, I'm wondering how to implement updates/additions to the cleanup 
 system that will be in line with the direction that the Galaxy project is 
 headed.  The pgcleanup.py script is the newest piece of code in this area 
 (and even adds cleanup of exported histories, which are absent from the older 
 cleanup scripts). Also, the pgcleanup.py script uses a cleanup_event table 
 that I don't believe is used by the older cleanup_datasets.py script. 
 However, the new pgcleanup.py script only works for Postgres, and worse, only 
 for version 9.1+.  I run my system on RedHat (CentOS) and thus we use version 
 8.4 of Postgres.  Are there plans to support other databases or older 
 versions of Postgres?

Hi Lance,

pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way 
to port it to older versions.  For 8.4 or MySQL, you can still use the older 
cleanup_datasets.py.

 I'd like to implement a script to delete (set the deleted flag) for certain 
 datasets (e.g. raw data imported from our archive, for old, inactive users, 
 etc.).  I'm wondering if it would make sense to try and extend pgcleanup.py 
 or cleanup_datasets.py.  Or perhaps it would be best to just implement a 
 separate script, though that seems like I'd have to re-implement a lot of 
 boilerplate code for configuration reading, connections, logging, etc.   Any 
 tips on generally acceptable (supported) procedures for marking a dataset as 
 deleted?

You could probably reuse a lot of the code from either of the cleanup scripts 
for this.

Thanks,
--nate

 
 Of course, I'll make any of the enhancements available (and would be happy to 
 submit pull requests if there is interest).
 
 -- 
 Lance Parsons - Scientific Programmer
 134 Carl C. Icahn Laboratory
 Lewis-Sigler Institute for Integrative Genomics
 Princeton University
 
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/
 
 To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)

2013-03-22 Thread Lance Parsons

Nate Coraor wrote:

On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:


I have been running a Galaxy server for our sequencing researchers for a while 
now and it's become increasingly successful. The biggest resource challenge for 
us has been, and continues to be disk space.  As such, I'd like to implement 
some additional cleanup scripts. I thought I run a few questions by this list 
before I got too far into things.

In general, I'm wondering how to implement updates/additions to the cleanup system that 
will be in line with the direction that the Galaxy project is headed.  The pgcleanup.py 
script is the newest piece of code in this area (and even adds cleanup of exported 
histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py 
script uses a cleanup_event table that I don't believe is used by the older 
cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, 
and worse, only for version 9.1+.  I run my system on RedHat (CentOS) and thus we use 
version 8.4 of Postgres.  Are there plans to support other databases or older versions of 
Postgres?


Hi Lance,

pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way 
to port it to older versions.  For 8.4 or MySQL, you can still use the older 
cleanup_datasets.py.
After looking at it a bit more, I see what you mean.  Are there plans to 
implement and additional cleanup scripts for non-postgres 9.1 users?  
Just curious so I don't reinvent the wheel, I'd be happy to help with 
existing efforts.

I'd like to implement a script to delete (set the deleted flag) for certain 
datasets (e.g. raw data imported from our archive, for old, inactive users, 
etc.).  I'm wondering if it would make sense to try and extend pgcleanup.py or 
cleanup_datasets.py.  Or perhaps it would be best to just implement a separate 
script, though that seems like I'd have to re-implement a lot of boilerplate 
code for configuration reading, connections, logging, etc.   Any tips on 
generally acceptable (supported) procedures for marking a dataset as deleted?


You could probably reuse a lot of the code from either of the cleanup scripts 
for this.
Right.  It seems to make sense to me to focus on the cleanup_datasets.py 
since that will work for everyone.  I would like to essentially mimic 
the user deleting a dataset.  I'd then email them to let them know that 
some old data had been marked for deletion and let the rest of the 
scripts proceed as normal, cleaning that up if they don't undelete it.


It looks like I would want to mark the HistoryDatasetAssociations as 
deleted?  Is that correct?  Would I need to do anything else to simulate 
the user deleting the dataset?


Thanks for the help,
Lance

Thanks,
--nate



--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/