Package: dmtcp
Version: 1.2.4-1
Severity: normal

[ This is just for the record to avoid hiding the issue in private email.
  Upstream has already received this bug report and fixed half of the
  issue within a few hours -- pretty cool. ]

The attached script demonstrates the bug. It will print a couple of
letters ten times (sleeping 2 seconds between two letters), verify
correct output and exit. When ran with an integer argument larger than
1, it will distribute the letter printing over the number of requested
processes (using pprocess and capped by actual number of CPUs on a machine).

Snapshot and restart works beautifully in the single process mode, but
fails with pprocess. Below are the logs from dmtcp_checkpoint and
_coordinator. Checkpoints were requested manually.



dmtcp_coordinator
=================

michael@meiner ~/debian/dmtcptest % dmtcp_coordinator
dmtcp_coordinator starting...
    Port: 7779
    Checkpoint Interval: disabled (checkpoint manually instead)
    Exit on last client: 0
Type '?' for help.

[15518] NOTE at dmtcp_coordinator.cpp:1020 in onConnect; REASON='worker 
connected'
     hello_remote.from = 1313a2c6-15537-4f563b29(-1)
[15518] NOTE at dmtcp_coordinator.cpp:1026 in onConnect; 
REASON='CheckpointInterval Updated'
     oldInterval = 0
     theCheckpointInterval = 0
[15518] NOTE at dmtcp_coordinator.cpp:1020 in onConnect; REASON='worker 
connected'
     hello_remote.from = 1313a2c6-15537-4f563b29(-1)
[15518] NOTE at dmtcp_coordinator.cpp:1026 in onConnect; 
REASON='CheckpointInterval Updated'
     oldInterval = 0
     theCheckpointInterval = 0
[15518] NOTE at dmtcp_coordinator.cpp:880 in onData; REASON='Updating process 
Information after fork()'
     hostname = meiner
     progname = python2.7_(forked)
     msg.from.pid() = 1313a2c6-15544-4f563b29
     client->identity() = 1313a2c6-15537-4f563b29
[15518] NOTE at dmtcp_coordinator.cpp:1020 in onConnect; REASON='worker 
connected'
     hello_remote.from = 1313a2c6-15537-4f563b29(-1)
[15518] NOTE at dmtcp_coordinator.cpp:1026 in onConnect; 
REASON='CheckpointInterval Updated'
     oldInterval = 0
     theCheckpointInterval = 0
[15518] NOTE at dmtcp_coordinator.cpp:880 in onData; REASON='Updating process 
Information after fork()'
     hostname = meiner
     progname = python2.7_(forked)
     msg.from.pid() = 1313a2c6-15546-4f563b29
     client->identity() = 1313a2c6-15537-4f563b29
c
[15518] NOTE at dmtcp_coordinator.cpp:1294 in startCheckpoint; REASON='starting 
checkpoint, suspending all nodes'
     s.numPeers = 3
[15518] NOTE at dmtcp_coordinator.cpp:1296 in startCheckpoint; 
REASON='Incremented Generation'
     UniquePid::ComputationId().generation() = 1
[15518] NOTE at dmtcp_coordinator.cpp:630 in onData; REASON='locking all nodes'
[15518] NOTE at dmtcp_coordinator.cpp:665 in onData; REASON='draining all nodes'
[15518] NOTE at dmtcp_coordinator.cpp:671 in onData; REASON='checkpointing all 
nodes'
[15518] NOTE at dmtcp_coordinator.cpp:681 in onData; REASON='building name 
service database'
[15518] NOTE at dmtcp_coordinator.cpp:700 in onData; REASON='entertaining 
queries now'
[15518] NOTE at dmtcp_coordinator.cpp:705 in onData; REASON='refilling all 
nodes'
[15518] NOTE at dmtcp_coordinator.cpp:734 in onData; REASON='restarting all 
nodes'
[15518] NOTE at dmtcp_coordinator.cpp:905 in onDisconnect; REASON='client 
disconnected'
     client.identity() = 1313a2c6-15537-4f563b29
[15518] NOTE at dmtcp_coordinator.cpp:905 in onDisconnect; REASON='client 
disconnected'
     client.identity() = 1313a2c6-15546-4f563b29
[15518] NOTE at dmtcp_coordinator.cpp:905 in onDisconnect; REASON='client 
disconnected'
     client.identity() = 1313a2c6-15544-4f563b29
^C[15518] NOTE at dmtcp_coordinator.cpp:522 in handleUserCommand; 
REASON='killing all connected peers and quitting ...'
DMTCP coordinator exiting... (per request)


dmtcp_checkpoint
================

michael@meiner ~/debian/dmtcptest % dmtcp_checkpoint python pproc_runner.py 2
dmtcp_checkpoint (DMTCP + MTCP) 1.2.4
Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                       Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)

A
B
A
B
A
B
A
B
A
B
Traceback (most recent call last):
  File "pproc_runner.py", line 25, in <module>
    results = np.hstack(p_results)
  File "/usr/lib/pymodules/python2.7/numpy/core/shape_base.py", line 258, in 
hstack
    return _nx.concatenate(map(atleast_1d,tup),1)
  File "/usr/lib/pymodules/python2.7/pprocess.py", line 757, in next
    self.store()
  File "/usr/lib/pymodules/python2.7/pprocess.py", line 396, in store
    for channel in self.ready(timeout):
  File "/usr/lib/pymodules/python2.7/pprocess.py", line 304, in ready
    fds = self.poller.poll(timeout)
select.error: (4, 'Interrupted system call')



dmtcp_restart_script.sh
=======================


michael@meiner ~/debian/dmtcptest % ./dmtcp_restart_script.sh
dmtcp_checkpoint (DMTCP + MTCP) 1.2.4
Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                       Gene Cooperman
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING file for details.
(Use flag "-q" to hide this message.)

dmtcp_coordinator starting...
    Port: 7779
    Checkpoint Interval: disabled (checkpoint manually instead)
    Exit on last client: 1
Backgrounding...
B
Traceback (most recent call last):
  File "pproc_runner.py", line 25, in <module>
    results = np.hstack(p_results)
  File "/usr/lib/pymodules/python2.7/numpy/core/shape_base.py", line 258, in 
hstack
    return _nx.concatenate(map(atleast_1d,tup),1)
  File "/usr/lib/pymodules/python2.7/pprocess.py", line 757, in next
    self.store()
  File "/usr/lib/pymodules/python2.7/pprocess.py", line 396, in store
    for channel in self.ready(timeout):
  File "/usr/lib/pymodules/python2.7/pprocess.py", line 304, in ready
A
    fds = self.poller.poll(timeout)
select.error: (4, 'Interrupted system call')


-- System Information:
Debian Release: wheezy/sid
  APT prefers testing
  APT policy: (500, 'testing'), (1, 'experimental')
Architecture: i386 (i686)

Kernel: Linux 3.1.0-1-686-pae (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages dmtcp depends on:
ii  libc6       2.13-26
ii  libgcc1     1:4.6.1-4
ii  libmtcp1    1.2.4-1
ii  libstdc++6  4.6.1-4

dmtcp recommends no packages.

dmtcp suggests no packages.

-- no debconf information
import pprocess
import numpy as np
import time
import sys

def dummy(printme):
    results = []
    for p in printme:
        print p
        results.append(p)
        time.sleep(2)
    return results

# get number of processes from arg
nelement = 10
blocks = ('A', 'B', 'C', 'D')

nproc = int(sys.argv[1])
if nproc > 1:
    p_results = pprocess.Map(limit=nproc)
    compute = p_results.manage(
                pprocess.MakeParallel(dummy))
    for block in blocks:
        compute(np.repeat(block, nelement))
    results = np.hstack(p_results)
else:
    results = np.hstack([dummy(np.repeat(block, nelement)) for block in blocks])

# collect results
for block in blocks:
    assert np.sum(results == block) == nelement

Reply via email to