[ccache] BSDiff for cache objects
I just did a quick search, and couldn't find discussions on the idea of caching compiled objects as binary diffs from other existing objects. Basically, before writing a new object file, ccache could find a similar object in the cache (based on object-code or source-code hashes for example) and store the delta (using bsdiff, xdelta, ...) instead of the complete file. For most minor source code changes, the savings should be worth the extra effort. Alternatively, a compact operation could be run periodically, that compresses the cache using the same approach. My question is whether ccache's real-world use would benefit from a feature like this. I can put together a test that looks through people's .ccache and reports how many good bsdiff candidates there are, and what the savings would be. Any opinions ? ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
Am Montag, den 12.11.2012, 13:49 +0200 schrieb Bogdan Harjoc: Basically, before writing a new object file, ccache could find a similar object in the cache (based on object-code or source-code hashes for example) The main goal of most hashes is to give very distinct results even for even small changes in the input data, which is why there is not really an algorithm to compare two files' similarity based on hashes. Similarity of two files would have to be calculated based on something that currently isn't available - AFAICT. The savings in size are probably less important than the expectable performance loss for building deltas of source and/or object files. Juergen ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On Mon, Nov 12, 2012 at 2:30 PM, Jürgen Buchmüller pullm...@t-online.dewrote: Am Montag, den 12.11.2012, 13:49 +0200 schrieb Bogdan Harjoc: Basically, before writing a new object file, ccache could find a similar object in the cache (based on object-code or source-code hashes for example) The main goal of most hashes is to give very distinct results even for even small changes in the input data, which is why there is not really an algorithm to compare two files' similarity based on hashes. I should have been more specific. I meant block-hashes, like rsync and bsdiff do: http://www.samba.org/~tridge/phd_thesis.pdf The savings in size are probably less important than the expectable performance loss for building deltas of source and/or object files. My concern as well. But an offline ccache-compact that runs every 24h or so, possibly only creating the 100 hashes once for every new file, should be pretty fast. And applying a bspatch requires a bunzip2 and going through a list of INSERT/ADD instructions. It can probably be approximated to just bunzip2. There is also xdelta which is faster. ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On Mon, Nov 12, 2012 at 3:39 PM, Andrew Stubbs a...@codesourcery.com wrote: On 12/11/12 11:49, Bogdan Harjoc wrote: Alternatively, a compact operation could be run periodically, that compresses the cache using the same approach. Is cache size/capacity a very big issue for you? No but there is room for improvement. This could be optional, like a CCACHE_COMPRESS that saves 99% instead of 40% when I routinely recompile 20 kernel branches, for example (v2.6.x, 3.0.x, 3.4.x, -git, -next, -ubuntu, etc). Or how about a larger cache limit? If you don't have much disk space then combining that with CCACHE_HARDLINK might provide a useful saving? (Although compression must be disabled and your cache must be on the same file-system as your build.) My opinion is that ccache is a space-speed tradeoff that moves the savings balance toward speed-savings and away from space-savings. For the most part, users are fine with that, indeed want that, and modifications that move that balance back toward space-saving aren't interesting. Adjusting cache limits to available space works, of course. This is the kind of reply I was asking for -- what size/speed constraints do ccache users typically face. This idea is nice, but seems like it will reduce performance on both cache-hit and cache-miss, like regular compression, but even more so, especially on cache-miss. The offline approach won't affect cache-miss at all. In cache-hit cases, it will add the cost of applying the patch. That said, if we can have out cake and eat it then bring it on. (It's a shame the hard-link feature isn't completely safe, or it would do just that.) I'll run some tests on my .ccache dir and post results once I have them. Cheers, Bogdan ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On 12/11/12 14:08, Bogdan Harjoc wrote: No but there is room for improvement. This could be optional, like a CCACHE_COMPRESS that saves 99% instead of 40% when I routinely recompile 20 kernel branches, for example (v2.6.x, 3.0.x, 3.4.x, -git, -next, -ubuntu, etc). I realise that the more diverged the branches are, the fewer exact cache hits you will get, but there still should be a lot of overlap here. However, if you're building the branches in different directories and using absolute paths then you might be missing out on potential cache sharing (without any binary differences). Are you familiar with CCACHE_BASEDIR? Andrew ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
Initial results from a small .ccache (3.0) dir: - 6476 objects - 300MB - probably about 500-1000 compiles/recompiles of around 100 small to large projects The test was: 1. Find the candidates for compression, based on: objdump -t | grep g (defined symbols). If two objects had at least 4 symbols defined, and 85% of them were identical, the files were selected for compression. 2. Run bsdiff on the selected pairs of files, and collect the total raw size, and the resulting compressed size. The results are: 4459 out of 6476 files compressed (6099674 - 629795 bytes) So roughly 90% compression rate, for a random .ccache folder. I attached sources for the test (first run ./get-symbols.sh, then ./find-similar). Will post results for a more favorable scenario (multiple builds of different versions of the same project) if time permits. #include stdio.h #include sys/types.h #include sys/stat.h #include dirent.h #include string.h #include stdlib.h #include limits.h #include ctype.h const char *BSDIFF = ~/build/bsdiff-4.3/bsdiff; struct Func { Func *next; char *name; }; struct Obj { Obj *next; char *name; char *path; int nfuncs; Func *funcs; Obj *match; int common; }; const int HASH_SIZE = 19031; Obj *objs[HASH_SIZE] = {}; #define NAME_HASH(name) ((*(int *)(name)) % HASH_SIZE) Obj *read_obj(const char *base, int a, int b, const char *namesuff) { Obj *o = new Obj; o-next = 0; o-name = new char[64]; o-path = new char[128]; o-nfuncs = 0; o-funcs = 0; o-common = 0; o-match = 0; int nc = snprintf(o-name, 64, %x%x%s, a, b, namesuff); if (nc = 0 || nc = 64) abort(); snprintf(o-path, 128, %s/%s, base, namesuff); FILE *f = fopen(o-path, rb); for (;;) { static char line[PATH_MAX]; if (!fgets(line, PATH_MAX, f)) break; for (int l=strlen(line); l isspace(line[l-1]); line[--l]=0); Func **h = o-funcs; Func *fn = new Func; fn-name = strdup(line); fn-next = *h; *h = fn; o-nfuncs++; } fclose(f); return o; } void compare(Obj *o1, Obj *o2) { /* if (strcmp(o1-name, 978911edd002afb6ecb9611659327e3e-3475537) || strcmp(o2-name, 4794bbcacf3bc42cdf1a44cf89523949-3429991)) return; */ if (o1-nfuncs 4 || o2-nfuncs 4) return; int r = 100*o1-nfuncs/o2-nfuncs; if (r 80 || r 125) return; if (!strcmp(o1-name, o2-name)) return; Func *f1 = o2-funcs; Func *f2 = o1-funcs; int m=0; while (f1) { int r = 0; while (f2 (r = strcmp(f1-name, f2-name)) 0) // reversed since we pushed fns at the head f2 = f2-next; if (f2 r==0) // match m++; f1 = f1-next; } if (m o1-common) { o1-common = m; o1-match = o2; } } int main() { int nobjs=0; for (int a=0; a0x10; a++) for (int b=0; b0x10; b++) { char base[] = strings/.ccache/1/2; sprintf(base, strings/.ccache/%x/%x, a, b); DIR *dh = opendir(base); if (!dh) continue; for (;;) { struct dirent *de = readdir(dh); if (!de) break; if (strchr(de-d_name, '.')) continue; Obj *o = read_obj(base, a, b, de-d_name); Obj **h = objs[NAME_HASH(o-name)]; o-next = *h; *h = o; nobjs++; } closedir(dh); } int total_raw=0; int total_bsd=0; int nbsd=0; for (int h1=0; h1HASH_SIZE; h1++) { printf(\r%d%%, h1*100/HASH_SIZE); for (Obj *o1=objs[h1]; o1; o1=o1-next) { // Look for a match in all the objects that are after it in our hash. // If we get a match, it means we can store o1 as a delta from o2. for (int h2=h1; h2HASH_SIZE; h2++) { Obj *o2 = (h2==h1) ? o1 : objs[h2]; for (; o2; o2=o2-next) compare(o1, o2); } if (o1-nfuncs 100*o1-common/o1-nfuncs = 85) { char cmd[PATH_MAX]; snprintf(cmd, PATH_MAX, %s %s %s tmp.bsd, BSDIFF, o1-path, o1-match-path); if (system(cmd)) { printf(%s\n, cmd); abort(); } struct stat st; if (stat(o1-path, st)) { printf(%s\n, o1-path); abort(); } total_raw += st.st_size; if (stat(tmp.bsd, st)) { printf(%s (tmp.bsd)\n, o1-path); abort(); } total_bsd += st.st_size; nbsd++; } } } remove(tmp.bsd); printf(\n%d files compressed (%d - %d bytes)\n, nbsd, total_raw, total_bsd); return 0; } get-symbols.sh Description: Bourne shell script ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache