Thanks, Ian. I got h5check (not part of the default HDF5 installation), and ran it. Each of the troublesome checkpoint does, indeed, have at least one or two "non-compliant" files there. Irritating, but I suppose it answers my question.
Roland, thanks for the abort_on_io_errors suggestion (though it might not have helped here, given the lack of warnings). I guess I'll be starting from the older checkpoint, then. Bernard On 2 February 2016 at 04:30, Ian Hinder <[email protected]> wrote: > > On 1 Feb 2016, at 21:39, Bernard Kelly <[email protected]> wrote: > > Hi all. I'm having checkpoint/recovery issues with a particular simulation: > > An initial short run stopped some time after iteration 32000, leaving > me with checkpoints at it 30000 & 32000. I found I couldn't recover > from the later of these, but as the earlier one *did* allow recovery, > I didn't worry too much about it. > > Now the recovered run went until some time after it 124000. I again > have two sets of checkpoint data, from it 122000 and 124000. *Neither* > of these work. I could imagine the later one being corrupted somehow > because of disk space issues, but both? > > In each case, the error output in the STDERR consists of multiple > instances of the message below. > > * Is this likely due to file corruption? > > * What's the best way to check CarpetIOHDF5 files for corruption? > > > Hi Bernard, > > There is a tool called h5check (Google for "What's the best way to check > HDF5 files for corruption"): > > • h5check: A tool to check the validity of an HDF5 file. > > The HDF5 Format Checker, h5check, is a validation tool for verifying that an > HDF5 file is encoded according to the HDF File Format Specification. Its > purpose is to ensure data model integrity and long-term compatibility > between evolving versions of the HDF5 library. > > Note that h5check is designed and implemented without any use of the HDF5 > Library. > > Given a file, h5check scans through the encoded content, verifying it > against the defined library format. If it finds any non-compliance, h5check > prints the error and the reason behind the non-compliance; if possible, it > continues the scanning. If h5check does not find any non-compliance, it > prints an approval statement upon completion. > > By default, the file is verified against the latest version of the file > format, but the format version can be specified. > > > I have used this successfully in the past. > > * Can I do anything about this particular run, apart from start > (again) from the "good" 30000 checkpoint? > > > If the file is corrupt, then I doubt it. You might be able to add debugging > code to work out which dataset is corrupt, and if it is not an important > one, you might be able to create a new HDF5 file with a corrected version. > But this is a lot of work, and if there is more than one corrupt dataset, > it's unlikely to be practical. It's probably much more realistic to just > repeat the run. However, if you got corruption twice already, I suspect you > will get it again. It's probably a good idea to checkpoint more frequently, > as you can probably expect more corruption. > > The only legitimate reason for the files being corrupted is if you ran out > of disk space during write (and this is only legitimate because HDF5 does > not support journaled writing, which is disappointing in 2016). If that > happened, I would expect to see evidence of it in stdout/stderr, which you > didn't see. If you don't have abort_on_io_errors set, then Cactus would > have happily continued on after the HDF5 disk write failed, but I think it > would have crashed if it couldn't write the error message to stdout/stderr, > so I don't think you ran out of disk space. If the files are corrupt, it > would either be a problem with the filesystem, or a bug in Cactus. > > You might want to run some filesystem-checking program to see if this can be > reproduced in a test case, or ask the system admins to do so. > > -- > Ian Hinder > http://members.aei.mpg.de/ianhin > _______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
