Re: Triage recovery of damaged Subversion repo

Daniel Shahaf Fri, 11 Nov 2022 06:09:26 -0800

Michael K wrote on Fri, Oct 28, 2022 at 17:25:19 -0500:
> I am working on an important Subversion repository that was hit by a
> targeted ransomware attack. Apparently the backups were deleted securely as
> well, though there is a backup from a few years back that was unaffected in
> different storage. In brief, the ransomware encrypted and overwrote (up to)
> the first 4 KB of data and also added some encrypted data and zero-padding
> to the end of every file. Since Subversion has many small files, the data
> has been slashed up badly and some is gone forever. But files larger than 4
> KB have original data remaining.
> 
> My goal is to build a working repository with as much of the original data
> that is remaining as I can, like a triage operation. I have a backup that
> was not affected, but it does not contain the last few years of data. I
> need to utilize the data that is affected by ransomware encryption.
> 
> Eventually I plan to write a program that will work over all the affected
> revs and revprops files required and output new files. I'm coming at this
> without previous knowledge of the inner workings of Subversion, but I am
> comfortable working in a hex editor and writing programs that process raw
> data. So for now, I have been learning about Subversion from reading the
> documentation and while working hands-on with the raw data of these files
> in a hex editor. I've learned a bit about the "representations" within the
> revs files. That will probably be helpful since those provide units that
> each revs file can be broken down into. I can use that knowledge to try
> keeping full "representations" and discard partial ones.
>


Yes, rev files have quite a bit of internal structure: reps, node-rev
headers, changed-paths, P2L/L2P, final line.  These are generally easy
to parse out of surrounding contexts (revprop files use counted-length
strings, reps have their header and "ENDREP" trailer, L2P-INDEX and
P2L-INDEX know their own length and have ASCII before and after them,
and everything else is ASCII in specific formats).

Similarly, it should be easy to recognize where the appended cryptogram
and padding start, since the part from L2P-INDEX to the last line is
distinctive and self-checksummed.

I don't know by heart what elements will be serialized into the first
4KB of a rev file in logical addressing mode.  (By the way, it's worth
looking up in the implementation what physical order it writes the items
to the file in.  Chances are this wasn't left to chance.)  What you
might find there is:

- File reps.

  A rep is a compressed [see fsfs.conf, no relation to "self-compressed"
  in the sense of having no base] svndelta [see notes/svndiff], whose
  base, if there is one, might or might not be the preceding revision of
  the node [see notes/skip-deltas and fsfs.conf] [note: this means it's
  possible for rN+M of a file to be recoverable even if rN's rep is lost].

  In principle, you can even dive down this rabbit hole of abstractions to
  recover data from the surviving tail ends of partially-overwritten reps.

- Dir reps.  These are like file reps but the content of the file is
  an svn_hash_write2() hash mapping basenames to node-rev id's.  IIRC,
  the hashes are dumped in sorted order and the node-rev id's are also
  fairly predictable, and in any case they are repeated in the node-rev
  headers of the directory entries.  It might even be possible to
  reconstruct an overwritten dir rep from the remainder of the rev file.

- Node-rev headers.  Parts of these are predictable (e.g., the "pred:"
  value), or can be regenerated (e.g., the checksums), or inferred from
  other parts of the rev file (e.g., "type: dir" can easily be guessed
  if you still have the rep itself).

- Changed-paths.  That's just an index/cache, IIRC, of information
  derivable from the remainder of the file.

> Currently, I am trying to add a single new empty revision that Subversion
> will accept after testing with the "svnadmin verify" and "svn info"
> commands. I fabricated data for a revprops file on this new revision, I
> adjusted the "current" file to the new revision number, and I'm working on
> the revs file. If I can achieve that, I'll move on to adding a new revision
> that contains some original data.
> 

I assume you mean this:

[[[
echo Hello world > foo
svnadmin create r
svnmucc -U file://$(pwd)/r -mm put foo iota   # 'svn import' would do the trick 
too
xxd < r/db/revs/1
]]]

Why would you need to /manually/ create a rev file with original data?
You can use 'svn commit' to create rev files (on top of the old, good
backup).  I'd have thought you'd focus on trying to extract data from
the partially-corrupted rev files (e.g., reconstruct the fulltexts of
reps where it's possible to do so).

Anyway, regarding creating rev files:

The rev files you get by default have bells and whistles turned on.  For
instance, they use DELTA and self-DELTA reps even though it's a lot
easier to fabricate a PLAIN rep, and you can use PLAIN anywhere you can
use DELTA.

For this reason, I'd recommend to try to create a 1.1-era rev file
first.  Pass «--compatible-version=1.1 --fs-type=fsfs» to «svnadmin
create» above.  (Subversion 1.1's FSFS is the oldest FSFS there is; see
`svnadmin info`.)

Word of warning: when you test things, do NOT test with the r0 rev file.
The C code hard-codes the assumption that r0 is empty.

> I've learned about the footer of the revs files as I've come across errors
> when trying those commands. I know how the L2P_OFFSET and P2L_OFFSET work
> and I have remedied the errors when those offsets are incorrect. I also
> discovered some kind of item indexes from logical addressing (I think, not
> sure what they are called) which occur right after both "L2P_OFFSET" and
> "P2L_OFFSET" in the revs files.

Do you mean "L2P-INDEX" and "P2L-INDEX"?

>                                 By looking at many files, I figured out how
> to calculate the binary representation for that based on the rev number
> (strange calculation).

The checksum in the final line is just MD5.

>                        That got me past the error such as - "svn: E160054:
> Index rev / pack file revision numbers do not match" - from the svn info
> command.
> 
> And now I'm trying to get past the "L2P index checksum mismatch" error. I
> don't know yet how the "actual" checksum value is calculated. Thankfully
> Subversion's error message shows both the "expected" and "actual"
> checksums. So I've tried taking an MD5 hash on byte ranges of the L2P-INDEX
> area (and variations), but haven't gotten a match to that "actual" value
> yet.
> 
> If you could provide insight to where these 2 checksums come from, I'd be
> really grateful.

I think you're looking for the modified FNV-1A in structure-indexes
(which I suspect is svn_checksum_fnv1a_32), but anyway, try setting the
checksum's value to all-zeroes: by convention, such a checksum is
considered equal to everything in checksum comparisons.  You might even
be able to use «svnfsfs load-index» for that (after removing the
appended data or adjusting svnfsfs's source).

The on-disk format is documetned in subversion/libsvn_fs_fs/structure
(grep for "logical").

You can sidestep the entire L2P/P2L fabrication step by using physical
addressing.  The C code as it stands makes use_log_addressing
a per-fs-instance knob rather than a per-rev-file one, but for your
purposes you can patch the C sources to pretend ffd->use_log_addressing
were FALSE for a specific fs instance and revnum range (the revnums
whose rev files you'll be fabricating).  svn_fs_fs__item_offset() seems
a relevant callsite.

> Also, any other general thoughts on this project would be appreciated.

Enable post-commit email notifications with diffs?

Cheers,

Daniel

Re: Triage recovery of damaged Subversion repo

Reply via email to