Re: [TYPES] Type theory of file formats

Kalani Thielen Mon, 22 Aug 2022 06:15:52 -0700

[ The Types Forum, http://lists.seas.upenn.edu/mailman/listinfo/types-list ]

Hi Jimmy,

I’ve used a similar observation in this PL/compiler that I made for Morgan 
Stanley a few years back:
https://urldefense.com/v3/__https://github.com/morganstanley/hobbes__;!!IBzWLUs!Xn2Ulm5-C0403pcAOa7dzz_fxK-N1RaBDiDdLrQw0Gu9jezqnv25nAd6bchH-X7ZWeQD3y-fuvItutbVnbCfS5s6VVK1$

… although I didn’t require strict field placement but just convertibility as 
determined by structural subtyping.  So e.g. ‘{x:int,y:int,z:int}’ is a subtype 
of ‘{z:long,x:double}’ as witnessed by an inferred function that drops “y”, 
swaps “x” and “z” while widening them into their respective types.  As a 
special case it optimizes for ’trivial convertibility’ (where the supertype is 
a prefix of the subtype) and the identity function is inferred.  There’s a 
flipped rule for sum types, where it’s always safe to convert into a large 
subsuming sum (hence the type 0 as the subtype for all types and 1 as the 
supertype for all types).

In this PL, I defined those rules as a system of type classes (ala Haskell), 
with the additional feature that constraints on qualified types are always 
eliminated at compile-time (since where this is used at Morgan Stanley, runtime 
performance penalties have to be minimized).

As well as being used to decide compatibility between file formats, this method 
is also useful to decide compatibility between independent message passing 
processes.  Previously at Morgan Stanley, we had a lot of software that kept 
“version numbers” and it could be very difficult to figure out if two 
“versions” of a message format could be used together (or on a larger scale, if 
a whole system was even coherent and ready to trade the next day).  Structural 
subtyping makes that problem much simpler, and also gives you a straightforward 
explanation for why e.g. two programs can’t communicate.

Also on the subject of file formats, beyond just “flat headers”, I found 
dependent sum types very useful to make “file references” as e.g. “T@f” is an 
offset in the file “f” (dynamically determined) where a “T” value can be found 
(which, if you mmap a large file, you still need to resolve outside of the file 
contents since different programs will map at different base addresses).  With 
this it’s easy to build regular recursive types and give type assignments to 
e.g. btrees and skip lists, decide subtyping on indexes this way, etc.

I also added probability distributions to structured types, so again requiring 
inference of a subtyping witness between convertible structures with different 
expectations, and used those expectations to encode structures down to their 
entropy limit — so that we can minimally store billions of ticks split out by 
symbol in a skip list and efficiently read them out of mapped files 
concurrently, automatically converting into a few different types for reporting 
tools and ad-hoc queries.

Hope this helps!

> On Aug 20, 2022, at 12:07 AM, Jimmy Koppel <[email protected]> wrote:
> 
> [ The Types Forum, http://lists.seas.upenn.edu/mailman/listinfo/types-list ]
> 
> There’s a beautiful fact, not often discussed, that the theory of subtyping
> explains forwards compatibility between file formats.
> 
> 
> For example here’s a simple file format for images:
> 
> 
> type header_length_v1 = {x : uint32 | x >= 2}
> 
> type width = uint32
> 
> type height = uint32
> 
> type header_v1(n) =  width * height * uint32[n-2]
> 
> type image_data = uint32[]
> 
> 
> type image_file_v1 = {n: header_length_v1} * header_v1(n) * image_data
> 
> 
> 
> This type allows for arbitrary extensions to the header, so that any
> extension to this format which adds more header fields is a subtype. A
> reader for this file format is thus forwards-compatible with any format
> extension which adds more header fields.
> 
> 
> However, even though I wrote it them above as products, the components of a
> binary file format are not products. In particular, they do not allow
> nesting, e.g.: (a*b*c)*d is not compatible with (a*b)*d.  Compatibility
> between two binary formats is determined  by the same data being in “the
> same place.”  Even though, classically, uint32 * uint8 is a subtype of
> uint32, one could not replace  width in the type above with (uint32 * uint8),
> as that would move the location of the height from bytes 8-11 to bytes
> 9-12, breaking existing code.
> 
> 
> For my class, I made up a “flat sequence type” to describe file formats,
> which effectively functions as a product type which must be
> right-associated. I would instead write the header(n) type as width ++
> height ++ uint32[n-2]. I have not worked out all the theory, but I’ve
> noticed fewer students making mistakes about binary compatibly since
> introducing this idea.
> 
> 
> Can anyone point me to related work?
> 
> 
> Yours,
> 
> Jimmy Koppel
> 
> MIT CSAIL

Re: [TYPES] Type theory of file formats

Reply via email to