On 28/10/16 00:57, Michael Di Domenico wrote:
> i was intrigued by Joe's suggestion of snapshot'ing kvm instances. i
> might look into that as an academic exercise. i knew you could
> pause/snapshot/resume an instance, but i've never tried to resume a
> saved off snapshot, only restart one. if
Hi Michael,
Keep us informed if you pull that off... I'm interested in that
functionality as well, for similar reasons.
For what it's worth, on the torque mailing list I remember that somebody
had a script for instantiating and destroying a VM on job start/end.
Can't remember who or what,
BLCR or DMTCP should both be able to checkpoint a single node job (single
or multi threaded) straight out of the box; you won't need to recompile any
of your binaries.
DMTCP does not require any kernel modules, and so you might find that
easier going if you are on a more recent kernel than BLCR
Snapshot restart would only work for you if your application leaves
restarting points on the disk. Otherwise restarting the snapshot is the
same as restarting the program.
Justin
On Thu, Oct 27, 2016 at 9:57 AM, Michael Di Domenico wrote:
> thanks for the insights.
thanks for the insights. comedic levity included... :)
running the job twice is likely going to be our solution. it's
painful when you have multiple people running multiple jobs, in that
it wastes resources, but such is life.
i was intrigued by Joe's suggestion of snapshot'ing kvm instances.
On 10/26/2016 10:22 AM, Joe Landman wrote:
On 10/26/2016 10:20 AM, Prentice Bisbal wrote:
How so? By only having a single seat or node-locked license?
Either ... for licensed code this is a non-starter. Which is a shame
that we still are talking about node locked/single seat in 2016.
On 10/26/2016 10:20 AM, Prentice Bisbal wrote:
How so? By only having a single seat or node-locked license?
Either ... for licensed code this is a non-starter. Which is a shame
that we still are talking about node locked/single seat in 2016.
--
Joseph Landman, Ph.D
Founder and CEO
How so? By only having a single seat or node-locked license?
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 10/26/2016 09:52 AM, Joe Landman wrote:
Licensing might impede this ... Usually does.
On 10/26/2016 09:50 AM, Prentice Bisbal wrote:
[mailto:beowulf-boun...@beowulf.org] On Behalf Of Prentice Bisbal
Sent: 26 October 2016 14:51
To: beowulf@beowulf.org
Subject: Re: [Beowulf] non-stop computing
There is a amazing beauty in this simplicity.
Prentice
On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
> Hi, Michael.
>
> What if the sam
Licensing might impede this ... Usually does.
On 10/26/2016 09:50 AM, Prentice Bisbal wrote:
There is a amazing beauty in this simplicity.
Prentice
On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
Hi, Michael.
What if the same job ran on two separate nodes, with IO to local
scratch? What
I would be laughing if this wasn't so true.
The sad thing is, the person who took on this convoluted, BS-heavy
approach would probably get promoted for managing a "large, complicated
project with many moving parts" while the guy who took Gavin's approach
would continue to toil away in his
There is a amazing beauty in this simplicity.
Prentice
On 10/25/2016 02:46 PM, Gavin W. Burris wrote:
Hi, Michael.
What if the same job ran on two separate nodes, with IO to local scratch? What
are the odds both nodes would fail in that three week period. No special
hardware / software
John's post is really funny! But I would only endorse Gavin's
recommendation for it solves the problem statistically (and correctly).
Justin
On Wed, Oct 26, 2016 at 12:07 AM, Christopher Samuel
wrote:
> On 26/10/16 14:45, John Hanks wrote:
>
> > I'd suggest making NFS
On 26/10/16 14:45, John Hanks wrote:
> I'd suggest making NFS mounts hard, so processes can recover from an NFS
> server reboot.
...plus set the NFS fsid for each export server side so they come back
reproducibly each time...
PS: I endorse what John said (now I've finished laughing), I'd
We routinely run jobs that last for months, some are codes that have an
endpoint others are processes that provide some service (SOLR,
ElasticSearch, etc,...) which have no defined endpoint. Unless you have
some seriously flaky hardware or ongoing power/cooling issues there is
nothing special
Assuming you can contain a run on a single node, you could use
containers and the freezer controller (plus maybe LVM snapshots) to do
checkpoint/restart.
Skylar
On 10/25/2016 11:24 AM, Michael Di Domenico wrote:
> here's an interesting thought exercise and a real problem i have to tackle.
>
> i
Hi Michael,
You could try BLCR for check pointing - I have only had a brief test of it
and it check pointed OpenFOAM ok on one node (though I think a single
threaded run)
http://crd.lbl.gov/departments/computer-science/CLaSS/research/BLCR/
So it would be likely to work on magma.
There is also
Hi, Michael.
What if the same job ran on two separate nodes, with IO to local scratch? What
are the odds both nodes would fail in that three week period. No special
hardware / software required. Simple. Done.
Cheers.
On Tue 10/25/16 02:24PM EDT, Michael Di Domenico wrote:
> here's an
On 10/25/2016 02:24 PM, Michael Di Domenico wrote:
here's an interesting thought exercise and a real problem i have to tackle.
i have a researchers that want to run magma codes for three weeks or
so at a time. the process is unfortunately sequential in nature and
magma doesn't support check
19 matches
Mail list logo