Re: [Beowulf] non-stop computing

2016-10-27 Thread Christopher Samuel
On 28/10/16 00:57, Michael Di Domenico wrote: > i was intrigued by Joe's suggestion of snapshot'ing kvm instances. i > might look into that as an academic exercise. i knew you could > pause/snapshot/resume an instance, but i've never tried to resume a > saved off snapshot, only restart one. if

Re: [Beowulf] non-stop computing

2016-10-27 Thread Luc Vereecken
Hi Michael, Keep us informed if you pull that off... I'm interested in that functionality as well, for similar reasons. For what it's worth, on the torque mailing list I remember that somebody had a script for instantiating and destroying a VM on job start/end. Can't remember who or what,

Re: [Beowulf] non-stop computing

2016-10-27 Thread Guy Coates
BLCR or DMTCP should both be able to checkpoint a single node job (single or multi threaded) straight out of the box; you won't need to recompile any of your binaries. DMTCP does not require any kernel modules, and so you might find that easier going if you are on a more recent kernel than BLCR

Re: [Beowulf] non-stop computing

2016-10-27 Thread Justin Y. Shi
Snapshot restart would only work for you if your application leaves restarting points on the disk. Otherwise restarting the snapshot is the same as restarting the program. Justin On Thu, Oct 27, 2016 at 9:57 AM, Michael Di Domenico wrote: > thanks for the insights.

Re: [Beowulf] non-stop computing

2016-10-27 Thread Michael Di Domenico
thanks for the insights. comedic levity included... :) running the job twice is likely going to be our solution. it's painful when you have multiple people running multiple jobs, in that it wastes resources, but such is life. i was intrigued by Joe's suggestion of snapshot'ing kvm instances.

Re: [Beowulf] non-stop computing

2016-10-26 Thread Prentice Bisbal
On 10/26/2016 10:22 AM, Joe Landman wrote: On 10/26/2016 10:20 AM, Prentice Bisbal wrote: How so? By only having a single seat or node-locked license? Either ... for licensed code this is a non-starter. Which is a shame that we still are talking about node locked/single seat in 2016.

Re: [Beowulf] non-stop computing

2016-10-26 Thread Joe Landman
On 10/26/2016 10:20 AM, Prentice Bisbal wrote: How so? By only having a single seat or node-locked license? Either ... for licensed code this is a non-starter. Which is a shame that we still are talking about node locked/single seat in 2016. -- Joseph Landman, Ph.D Founder and CEO

Re: [Beowulf] non-stop computing

2016-10-26 Thread Prentice Bisbal
How so? By only having a single seat or node-locked license? Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 10/26/2016 09:52 AM, Joe Landman wrote: Licensing might impede this ... Usually does. On 10/26/2016 09:50 AM, Prentice Bisbal wrote:

Re: [Beowulf] non-stop computing

2016-10-26 Thread John Hearns
[mailto:beowulf-boun...@beowulf.org] On Behalf Of Prentice Bisbal Sent: 26 October 2016 14:51 To: beowulf@beowulf.org Subject: Re: [Beowulf] non-stop computing There is a amazing beauty in this simplicity. Prentice On 10/25/2016 02:46 PM, Gavin W. Burris wrote: > Hi, Michael. > > What if the sam

Re: [Beowulf] non-stop computing

2016-10-26 Thread Joe Landman
Licensing might impede this ... Usually does. On 10/26/2016 09:50 AM, Prentice Bisbal wrote: There is a amazing beauty in this simplicity. Prentice On 10/25/2016 02:46 PM, Gavin W. Burris wrote: Hi, Michael. What if the same job ran on two separate nodes, with IO to local scratch? What

Re: [Beowulf] non-stop computing

2016-10-26 Thread Prentice Bisbal
I would be laughing if this wasn't so true. The sad thing is, the person who took on this convoluted, BS-heavy approach would probably get promoted for managing a "large, complicated project with many moving parts" while the guy who took Gavin's approach would continue to toil away in his

Re: [Beowulf] non-stop computing

2016-10-26 Thread Prentice Bisbal
There is a amazing beauty in this simplicity. Prentice On 10/25/2016 02:46 PM, Gavin W. Burris wrote: Hi, Michael. What if the same job ran on two separate nodes, with IO to local scratch? What are the odds both nodes would fail in that three week period. No special hardware / software

Re: [Beowulf] non-stop computing

2016-10-26 Thread Justin Y. Shi
John's post is really funny! But I would only endorse Gavin's recommendation for it solves the problem statistically (and correctly). Justin On Wed, Oct 26, 2016 at 12:07 AM, Christopher Samuel wrote: > On 26/10/16 14:45, John Hanks wrote: > > > I'd suggest making NFS

Re: [Beowulf] non-stop computing

2016-10-25 Thread Christopher Samuel
On 26/10/16 14:45, John Hanks wrote: > I'd suggest making NFS mounts hard, so processes can recover from an NFS > server reboot. ...plus set the NFS fsid for each export server side so they come back reproducibly each time... PS: I endorse what John said (now I've finished laughing), I'd

Re: [Beowulf] non-stop computing

2016-10-25 Thread John Hanks
We routinely run jobs that last for months, some are codes that have an endpoint others are processes that provide some service (SOLR, ElasticSearch, etc,...) which have no defined endpoint. Unless you have some seriously flaky hardware or ongoing power/cooling issues there is nothing special

Re: [Beowulf] non-stop computing

2016-10-25 Thread Skylar Thompson
Assuming you can contain a run on a single node, you could use containers and the freezer controller (plus maybe LVM snapshots) to do checkpoint/restart. Skylar On 10/25/2016 11:24 AM, Michael Di Domenico wrote: > here's an interesting thought exercise and a real problem i have to tackle. > > i

Re: [Beowulf] non-stop computing

2016-10-25 Thread Paul McIntosh
Hi Michael, You could try BLCR for check pointing - I have only had a brief test of it and it check pointed OpenFOAM ok on one node (though I think a single threaded run) http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/ So it would be likely to work on magma. There is also

Re: [Beowulf] non-stop computing

2016-10-25 Thread Gavin W. Burris
Hi, Michael. What if the same job ran on two separate nodes, with IO to local scratch? What are the odds both nodes would fail in that three week period. No special hardware / software required. Simple. Done. Cheers. On Tue 10/25/16 02:24PM EDT, Michael Di Domenico wrote: > here's an

Re: [Beowulf] non-stop computing

2016-10-25 Thread Joe Landman
On 10/25/2016 02:24 PM, Michael Di Domenico wrote: here's an interesting thought exercise and a real problem i have to tackle. i have a researchers that want to run magma codes for three weeks or so at a time. the process is unfortunately sequential in nature and magma doesn't support check