Dear OMPI Users,
I’m now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a very long time to checkpoint. BLCR configuration: ./onfigure --prefix=/opt/blcr --enable-static OpenMPi configuration: ./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr --enable-static --enable-ft-thread --enable-mpi-threads Our blades use NFS. $HOME and /opt are shared. In $HOME/.opnempi/mca-params.conf: crs_base_snapshot_dir=/tmp/ snapc_base_global_snapshot_dir=/home/chenwh snapc_basee_store_in_place=0 Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04). With no checkpoint, 'Time in seconds' is about 100s. It's normal. But when I take a single checkpoint, 'Time in seconds' is up to 300s. The overhead ratio is over 200%! WHY? How can I improve it? blade02:~> ompi-checkpoint --status 27115 [blade02:27130] [ 0.00 / 0.25] Requested - ... [blade02:27130] [ 0.00 / 0.25] Pending - ... [blade02:27130] [ 0.21 / 0.46] Running - ... [blade02:27130] [221.25 / 221.71] Finished - ompi_global_snapshot_27115.ckpt Snapshot Ref.: 0 ompi_global_snapshot_27115.ckpt As you see, it takes 200+ secconds to checkpoint. btw, what the former and latter number represent in [ , ]? Regards Whchen