[swift-dev] Reducing the size of Swift binaries by shortening symbols

Nadav Rotem via swift-dev Fri, 18 Dec 2015 15:43:09 -0800

Hi Swift-Dev, 

We care deeply about the size of binaries that are generated by the Swift 
compiler and make an effort to optimize and shrink the generated binaries. One 
of the problems that we have today is that swift symbols are mangled into 
extremely long strings. This is especially a problem for libraries, and almost 
half of the size of libswiftCore.dylib (the swift runtime library on x86_64 
OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ 
to read the size of the string table.  C++ also suffers from the problem of 
long names, but since we control the Swift ABI we can do better than C++.

Here are two names that I found in our libraries:

__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_

__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_

One way to solve this problem is to implement a better string table format that
will include some kind of compression or tree encoding into MachO, ELF and PE.
The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d
like to focus on improving things on the Swift side.

I see two independent tasks that we can do to shorten the length of string
symbols. First, we can improve the existing mangling. We can do things like use
non-decimal length specifiers, etc. The second task, which is the subject of
the rest of this email, is the implementation of a compression layer on top of
the existing mangling scheme.

Compressing long symbols

The Swift compiler is free to encode Swift names in whatever way it likes, but
we need to follow a few basic rules. First, the character encoding space needs
to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names
because this is the character space that ELF supports. Second, names need to be
unique and independent. We can’t just compress the whole string table or create
names that are a running counter because we need to be able to inspect each
name independently. We need to be able to decode the name independently and
extract the information that’s stored in the name. We also need to be able to
decode the name to be able to present something useful in the debugger. This
means that a simple one directional hashing of the name is not an option.

With these restrictions in mind, I believe that the problem that we need to
solve is independent compression of names. I suggest that we add a
post-processing string compression pass. The input to the compress method would
be a string mangled name, like the two names I showed above, and the output
would be a shorter string. We would also need a decompression method that would
return the original string.

Obviously, gzipping each string independently won’t be effective. I believe
that we need to implement our own compression algorithm here that will work
with the restrictions that I mentioned above (being textual) and with the prior
knowledge we have.

Example of a basic compression algorithm

In this section I describe a trivial compression algorithm that can reduce the
size of the string tables by 30%. This algorithm is not optimal but it is a
good starting point for our discussion. One example of "prior knowledge” that I
mentioned above is that we know what are the common sub-strings that are used
in Swift names. Some of the most common substrings in Swift mangled names are:

1. "S_S_S_S"
2. "ollectio”
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"

We can use this prior knowledge about names of Swift types to compress our
mangling! Consider the following encoding:

A string like this:

__TTSg5VSS13CharacterView

Would be translated into something like:

__TTSg5<reference to word #67>13<reference to word #43>View

Commonly used strings can be encoded as references to global tables. We need to
have some escape character (that needs to be ascii-printable), and use that
character to encode the reference.
One interesting bit of information is that the character ‘Y’ is only used 4
times in the entire standard library! The letter J, and a few other letters are
also not used very frequently. We can use these letters as escape characters.

The basic encoding that I experimented with used the following rules:

1. Use two escape characters that are not frequently used in names (Y and Z).
These characters are escape character and can’t be used without escaping. ‘Y’
will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’.

2. The most commonly used sub-strings (calculated as length of substring times
number of occurrences) would be encoded with a single escape character and a
single index character. The table of highly repetitive substrings can only
contain 61 entries (a-zA-Z0-9_, minus two escape characters).

A reference to the very frequent substrings would be encoded as “Yx”, where x
is the character that’s translated into a numeric index. This is two chars per
substring.

3. The less commonly used substrings are encoded as three-character sequence.
“Zxx”, where xx is the numeric index in the large table (that can hold 63*63
substrings).

It is obvious how to reverse this compression using the same string table used
to compress the names.

With this encoding scheme the name
“__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes
“Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes
“Y5MPJ3r5J5EJ82”.

I benchmarked this basic technique by using 5 dylibs as the training set
(SwiftCore, Simd, unittests, etc) and compressed the string tables of these
binaries. The result was a ~31% reduction in size.

On the Swift dylib itself (training set and input to the compressor) the wins
are about ~25%.

This encoding makes the symbol name a lot less intuitive for humans, but at the
SIL-level we already print human readable strings above function names as
comments. I believe that the debugger would just work as-is since the internal
APIs won’t change.

One of the disadvantages of using a constant string table is that once we
rename the APIs in the standard library the string table would be less
effective in compressing it and code that uses it.

What’s next?

The small experiment I described above showed that compressing the names in the
string table has a huge potential for reducing the size of swift binaries. I’d
like for us (swift-developers) to talk about the implications of this change
and start working on the two tasks of tightening our existing mangling format
and on implementing a new compression layer on top.

Nadav

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

[swift-dev] Reducing the size of Swift binaries by shortening symbols

Reply via email to