Hi Swift-Dev, 

We care deeply about the size of binaries that are generated by the Swift 
compiler and make an effort to optimize and shrink the generated binaries. One 
of the problems that we have today is that swift symbols are mangled into 
extremely long strings. This is especially a problem for libraries, and almost 
half of the size of libswiftCore.dylib (the swift runtime library on x86_64 
OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ 
to read the size of the string table.  C++ also suffers from the problem of 
long names, but since we control the Swift ABI we can do better than C++.

Here are two names that I found in our libraries:

 
__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_

 
__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_

One way to solve this problem is to implement a better string table format that 
will include some kind of compression or tree encoding into MachO, ELF and PE. 
The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d 
like to focus on improving things on the Swift side. 

I see two independent tasks that we can do to shorten the length of string 
symbols. First, we can improve the existing mangling. We can do things like use 
non-decimal length specifiers, etc. The second task, which is the subject of 
the rest of this email, is the implementation of a compression layer on top of 
the existing mangling scheme. 

Compressing long symbols
 
The Swift compiler is free to encode Swift names in whatever way it likes, but 
we need to follow a few basic rules. First, the character encoding space needs 
to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names 
because this is the character space that ELF supports. Second, names need to be 
unique and independent. We can’t just compress the whole string table or create 
names that are a running counter because we need to be able to inspect each 
name independently. We need to be able to decode the name independently and 
extract the information that’s stored in the name.  We also need to be able to 
decode the name to be able to present something useful in the debugger. This 
means that a simple one directional hashing of the name is not an option.

With these restrictions in mind, I believe that the problem that we need to 
solve is independent compression of names. I suggest that we add a 
post-processing string compression pass. The input to the compress method would 
be a string mangled name, like the two names I showed above, and the output 
would be a shorter string. We would also need a decompression method that would 
return the original string. 

Obviously, gzipping each string independently won’t be effective. I believe 
that we need to implement our own compression algorithm here that will work 
with the restrictions that I mentioned above (being textual) and with the prior 
knowledge we have. 


Example of a basic compression algorithm 

In this section I describe a trivial compression algorithm that can reduce the 
size of the string tables by 30%. This algorithm is not optimal but it is a 
good starting point for our discussion. One example of "prior knowledge” that I 
mentioned above is that we know what are the common sub-strings that are used 
in Swift names.  Some of the most common substrings in Swift mangled names are:

1. "S_S_S_S"
2. "ollectio”  
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"

We can use this prior knowledge about names of Swift types to compress our 
mangling!  Consider the following encoding:

A string like this:

__TTSg5VSS13CharacterView

Would be translated into something like:

__TTSg5<reference to word #67>13<reference to word #43>View

Commonly used strings can be encoded as references to global tables. We need to 
have some escape character (that needs to be ascii-printable), and use that 
character to encode the reference. 
One interesting bit of information is that the character ‘Y’ is only used 4 
times in the entire standard library! The letter J, and a few other letters are 
also not used very frequently. We can use these letters as escape characters. 

The basic encoding that I experimented with used the following rules:

1. Use two escape characters that are not frequently used in names (Y and Z). 
These characters are escape character and can’t be used without escaping. ‘Y’ 
will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’. 

2. The most commonly used sub-strings (calculated as length of substring times 
number of occurrences) would be encoded with a single escape character and a 
single index character. The table of highly repetitive substrings can only 
contain 61 entries (a-zA-Z0-9_, minus two escape characters).

A reference to the very frequent substrings would be encoded as “Yx”, where x 
is the character that’s translated into a numeric index. This is two chars per 
substring. 

3. The less commonly used substrings are encoded as three-character sequence. 
“Zxx”, where xx is the numeric index in the large table (that can hold 63*63 
substrings).

It is obvious how to reverse this compression using the same string table used 
to compress the names. 


With this encoding scheme the name 
“__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes 
“Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes 
“Y5MPJ3r5J5EJ82”.

I benchmarked this basic technique by using 5 dylibs as the training set 
(SwiftCore, Simd, unittests, etc) and compressed the string tables of these 
binaries. The result was a ~31% reduction in size.




On the Swift dylib itself (training set and input to the compressor) the wins 
are about ~25%.

This encoding makes the symbol name a lot less intuitive for humans, but at the 
SIL-level we already print human readable strings above function names as 
comments. I believe that the debugger would just work as-is since the internal 
APIs won’t change. 

One of the disadvantages of using a constant string table is that once we 
rename the APIs in the standard library the string table would be less 
effective in compressing it and code that uses it. 

What’s next?

The small experiment I described above showed that compressing the names in the 
string table has a huge potential for reducing the size of swift binaries. I’d 
like for us (swift-developers) to talk about the implications of this change 
and start working on the two tasks of tightening our existing mangling format 
and on implementing a new compression layer on top. 


Nadav
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to