Optimizing for Size
Many modern C compilers offer the option of optimizing the code you are compiling. That's great. You write your code and the compiler takes care of making it run fast for you. But sometimes you don't need your compiler to produce a program that runs as fast as possible, but to generate the smallest binary size. Compression of some sort, if you apply that term liberally.
Compressing with cc?
Crazy idea: If the compiler is good at making code smaller, can it also do it with arbitrary data? For that we have to make compileable code out of that data first.
Let's write a script to do that!
#!/usr/bin/env python3
import sys
with open(sys.argv[1], 'rb') as f:
data = f.read()
print(f'''
#include <stdio.h>
int
main(void)
{{
static char buf[{len(data)}];
''')
for i, c in enumerate(data):
print(f' buf[{i}] = {c};')
print('''
fwrite(buf, 1, sizeof(buf), stdout);
return 0;
}
''')
It generates code like this:
#include <stdio.h>
int
main(void)
{
static char buf[431888];
buf[0] = 239;
buf[1] = 187;
buf[2] = 191;
buf[3] = 84;
[...]
fwrite(buf, 1, sizeof(buf), stdout);
return 0;
}
Let's try it
Let's try this out on a book I downloaded from Project Gutenberg.
Both clang and gcc offer an option to optimize for size (-Os, -Oz) and I already have them installed on my machine, so that's what I'm going to use here.
gcc
$ gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ time gcc -Os -o a.out.gcc pg2162.c
real 5m59.957s
user 5m57.199s
sys 0m0.939s
clang
$ clang --version
clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ time clang -Os -o a.out.clang.-Os pg2162,c
pg2162.c:9:14: warning: implicit conversion from 'int' to 'char' changes value from 239 to -17 [-Wconstant-conversion]
buf[0] = 239;
~ ^~~
pg2162.c:10:14: warning: implicit conversion from 'int' to 'char' changes value from 187 to -69 [-Wconstant-conversion]
buf[1] = 187;
~ ^~~
pg2162.c:11:14: warning: implicit conversion from 'int' to 'char' changes value from 191 to -65 [-Wconstant-conversion]
buf[2] = 191;
~ ^~~
3 warnings generated.
real 4m17.739s
user 4m15.006s
sys 0m1.274s
Interestingly clang emits warnings, even though no options have been provided. I think that's valuable, as it allows to find problems with your code early on. Also, chars are signed. Sometimes.
Conclusion
Misusing a compiler for compression does not seem to be viable. A lot of effort has gone into making compilers generate optimal code, but those optimizations are designed to work in the context of source/machine code, not as a general compression algorithm.
808K a.out.clang.Os # clang -Os
808K a.out.clang.Oz # clang -OZ
908K a.out.gcc # gcc -Os
424K pg2162.txt # The original text file
164K pg2162.txt.gz # gzip compressed (default options)
Pity for us, but not entirely surprising.