Hi Zakelly! I was able to re-write the SST file with the missing block and patch my checkpoint/savepoint, but as you would probably expect another check failed :
*Caused by: org.rocksdb.RocksDBException: file is too short (268848910 bytes) to be an sstable/....../* The problem is the bad sst file is in our checkpoints and savepoints. We are likely faced with a rather complicated re-bootstrap of this operator, which I assume should drop this SST file entirely when we drop the bad state. Darin On Wed, Feb 4, 2026 at 10:57 AM Zakelly Lan <[email protected]> wrote: > Hi Darin, > > I'm afraid it's very difficult to fix the corruption, the only way is to > rewrite the whole rocksdb's MANIFEST and remove that file, or rewrite the > SST file. Either way there will be some data loss. Or if you have enabled > the local recovery, you may find a local copy of that checkpoint file, > which can be used to replace the corresponding file on DFS. Or perhaps your > corrupted file itself comes from the local copy, then disabling local > recovery may help. > > It is rare, and I guess it is caused by some DFS failure or disk > corruption. You can keep an eye on that. > > > Best, > Zakelly > > On Wed, Feb 4, 2026 at 12:03 PM Darin Amos via user <[email protected]> > wrote: > >> Hi! >> >> I have a problem where my incremental checkpoint has a corrupt SST file >> that was created weeks ago, meaning going back in time to replay the data >> to fix the corruption is not possible, and re-bootstrapping the job is >> extremely difficult. >> >> Is there a way to patch the corrupt SST file to fix my job? In this >> particular case some data loss is acceptable in favour of system health. >> >> Thanks! >> >> Darin >> >> >> % $(brew --prefix rocksdb)/bin/rocksdb_sst_dump \ >> >> >> --file=./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst >> \ >> >> --command=verify --verify_checksum >> >> options.env is 0x600003f638e0 >> >> Process >> ./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst >> >> Sst file format: block-based >> >> ./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst >> is corrupted: Corruption: block checksum mismatch: stored = 3954219857, >> computed = 4054404265, type = 1 in >> ./checkpoint_verification/sst_files/06240ecd-9154-409b-8a32-3a0ebd8e64de.sst >> offset 84885876 size 11204 >> >> >>
