Dear OMPI Users,

 

I’m now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a
very long time to checkpoint.

 

BLCR configuration:

./onfigure --prefix=/opt/blcr --enable-static

OpenMPi configuration:

./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr
--enable-static  --enable-ft-thread --enable-mpi-threads

 

Our blades use NFS. $HOME and /opt are shared.

 

In $HOME/.opnempi/mca-params.conf:

crs_base_snapshot_dir=/tmp/

snapc_base_global_snapshot_dir=/home/chenwh

snapc_basee_store_in_place=0

 

 

Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04).

With no checkpoint, 'Time in seconds' is about 100s. It's normal.

But when I take a single checkpoint, 'Time in seconds' is up to 300s. The
overhead ratio is over 200%! WHY? How can I improve it?

 

blade02:~> ompi-checkpoint --status 27115

[blade02:27130] [  0.00 /   0.25]                 Requested - ...

[blade02:27130] [  0.00 /   0.25]                   Pending - ...

[blade02:27130] [  0.21 /   0.46]                   Running - ...

[blade02:27130] [221.25 / 221.71]                  Finished -
ompi_global_snapshot_27115.ckpt

Snapshot Ref.:   0 ompi_global_snapshot_27115.ckpt

 

As you see, it takes 200+ secconds to checkpoint. btw, what the former and
latter number represent in [ , ]?

 

Regards

 

Whchen

Reply via email to