Re: [ccache] BSDiff for cache objects
On Mon, Nov 12, 2012 at 8:09 PM, Bogdan Harjoc wrote: > Initial results from a small .ccache (3.0) dir: > The previous results were bogus (I was diffing .gz compressed objects). I tried the bsdiff approach on a new .ccache folder obtained from compiling 8 linux kernels (various v3.x versions) configured as allnoconfig. Stats: - 22MB - 4030 objects (gzipped) - 503 cache hits After applying diffs: - 8.4MB (1862 files) were selected for diffs - 1.5MB (17%) compressed patches resulted - applying the 1862 diffs took 1.21 sec (so 0.65ms per patch) I got the 0.65ms figure using a script that runs gunzip -c and bspatch for each patched file in the cache. To find out and subtract the overhead of creating the gunzip and bspatch processes, I ran the loop with just "gunzip; bspatch --help > null" instead of the normal "gunzip; bspatch". So the result should cover just the actual patch applying (ccache does the gunzip anyway). The test was done on a 3.4Ghz i7-2600. My conclusions are: - performance impact should be negligible (when compressing files offline -- e.g. when make aborts with "no space left on device") - savings in size vary (7MB out of 22MB in this case) - the patches will need to have the "old filename" embedded If people post results from their cache folders and there is interest, I can work on an implementation. Attached are the updated sources for the test. I normally run them in a "testccache" directory, since they create two more folders with data. (I posted this mail privately just to Andrew by mistake 3 days ago, reposting it to ccache-list now) Cheers, Bogdan bspatch-all.sh Description: Bourne shell script #include #include #include #include #include #include #include #include #include const char *BSDIFF = "bsdiff"; //const char *BSDIFF = "~/build/bsdiff-4.3/bsdiff"; struct Func { Func *next; char *name; }; struct Obj { Obj *next; char *name; char *expanded; int nfuncs; Func *funcs; Obj *match; int score; }; const int HASH_SIZE = 19031; Obj *objs[HASH_SIZE] = {}; #define NAME_HASH(name) ((*(int *)(name)) % HASH_SIZE) Obj *read_obj(const char *strbase, int a, int b, const char *namesuff) { Obj *o = new Obj; o->next = 0; o->name = new char[64]; o->expanded = new char[128]; o->nfuncs = 0; o->funcs = 0; o->match = 0; o->score = 0; o->name = strdup(namesuff); int nc = snprintf(o->expanded, 128, "%x/%x/%s", a, b, namesuff); if (nc <= 0 || nc >= 64) abort(); char strpath[PATH_MAX]; sprintf(strpath, "%s/%s", strbase, namesuff); FILE *f = fopen(strpath, "rb"); if (!f) { printf("open %s\n", namesuff); abort(); } for (;;) { static char line[PATH_MAX]; if (!fgets(line, PATH_MAX, f)) break; for (int l=strlen(line); l && isspace(line[l-1]); line[--l]=0); Func **h = &o->funcs; Func *fn = new Func; fn->name = strdup(line); fn->next = *h; *h = fn; o->nfuncs++; } fclose(f); return o; } #define max(a,b) (((a) > (b)) ? (a) : (b)) void compare(Obj *o1, Obj *o2) { /* if (strcmp(o1->name, "978911edd002afb6ecb9611659327e3e-3475537") || strcmp(o2->name, "4794bbcacf3bc42cdf1a44cf89523949-3429991")) return; */ if (o1->nfuncs < 4 || o2->nfuncs < 4) return; int r = 100*o1->nfuncs/o2->nfuncs; if (r < 80 || r > 125) return; if (!strcmp(o1->name, o2->name)) return; Func *f1 = o2->funcs; Func *f2 = o1->funcs; int m=0; while (f1) { int r = 0; while (f2 && (r = strcmp(f1->name, f2->name)) < 0) // reversed since we pushed fns at the head f2 = f2->next; if (f2 && r==0) // match m++; f1 = f1->next; } int score = 1000*m/max(o1->nfuncs, o2->nfuncs); if (score > 800 && score > o1->score) { o1->score = score; o1->match = o2; } } int mk_file_dirs(char *file) { char *ptr = file; for (;;) { ptr = strchr(ptr, '/'); if (!ptr) return 0; *ptr = 0; int ret = mkdir(file, 0750); *ptr = '/'; if (ret < 0 && errno != EEXIST) { fprintf(stderr, "%s: errno=%s (%d)\n", file, strerror(errno), errno); return -1; } ptr++; } } int main() { int nobjs=0; for (int a=0; a<0x10; a++) for (int b=0; b<0x10; b++) { char strbase[PATH_MAX]; sprintf(strbase, "strings/.ccache/%x/%x", a, b); DIR *dh = opendir(strbase); if (!dh) continue; for (;;) { struct dirent *de = readdir(dh); if (!de) break; if (strchr(de->d_name, '.')) continue; Obj *o = read_obj(strbase, a, b, de->d_name); Obj **h = &objs[NAME_HASH(o->name)]; o->next = *h; *h = o; nobjs++; } closedir(dh); } long long total_raw=0; long long total_bsd=0; int nbsd=0; for (int h1=0; h1next) { // Look for a match in all the objects that are after it in our hash. // If we get a match, it means we can store o1 as a delta from o2. for (int h2=h1; h2next) compare(o1, o2); } if (o1->match) { char bin1[PATH_MAX]; char bin2[PATH_MAX]; sprintf(bin1, "%s/.ccache/%s", getenv("HOME"), o1->expanded); sprintf(bin2, "%s/.ccache
Re: [ccache] BSDiff for cache objects
Initial results from a small .ccache (3.0) dir: - 6476 objects - 300MB - probably about 500-1000 compiles/recompiles of around 100 small to large projects The test was: 1. Find the candidates for compression, based on: objdump -t | grep " g " (defined symbols). If two objects had at least 4 symbols defined, and 85% of them were identical, the files were selected for compression. 2. Run bsdiff on the selected pairs of files, and collect the total raw size, and the resulting compressed size. The results are: 4459 out of 6476 files compressed (6099674 -> 629795 bytes) So roughly 90% compression rate, for a random .ccache folder. I attached sources for the test (first run "./get-symbols.sh", then "./find-similar"). Will post results for a more favorable scenario (multiple builds of different versions of the same project) if time permits. #include #include #include #include #include #include #include #include const char *BSDIFF = "~/build/bsdiff-4.3/bsdiff"; struct Func { Func *next; char *name; }; struct Obj { Obj *next; char *name; char *path; int nfuncs; Func *funcs; Obj *match; int common; }; const int HASH_SIZE = 19031; Obj *objs[HASH_SIZE] = {}; #define NAME_HASH(name) ((*(int *)(name)) % HASH_SIZE) Obj *read_obj(const char *base, int a, int b, const char *namesuff) { Obj *o = new Obj; o->next = 0; o->name = new char[64]; o->path = new char[128]; o->nfuncs = 0; o->funcs = 0; o->common = 0; o->match = 0; int nc = snprintf(o->name, 64, "%x%x%s", a, b, namesuff); if (nc <= 0 || nc >= 64) abort(); snprintf(o->path, 128, "%s/%s", base, namesuff); FILE *f = fopen(o->path, "rb"); for (;;) { static char line[PATH_MAX]; if (!fgets(line, PATH_MAX, f)) break; for (int l=strlen(line); l && isspace(line[l-1]); line[--l]=0); Func **h = &o->funcs; Func *fn = new Func; fn->name = strdup(line); fn->next = *h; *h = fn; o->nfuncs++; } fclose(f); return o; } void compare(Obj *o1, Obj *o2) { /* if (strcmp(o1->name, "978911edd002afb6ecb9611659327e3e-3475537") || strcmp(o2->name, "4794bbcacf3bc42cdf1a44cf89523949-3429991")) return; */ if (o1->nfuncs < 4 || o2->nfuncs < 4) return; int r = 100*o1->nfuncs/o2->nfuncs; if (r < 80 || r > 125) return; if (!strcmp(o1->name, o2->name)) return; Func *f1 = o2->funcs; Func *f2 = o1->funcs; int m=0; while (f1) { int r = 0; while (f2 && (r = strcmp(f1->name, f2->name)) < 0) // reversed since we pushed fns at the head f2 = f2->next; if (f2 && r==0) // match m++; f1 = f1->next; } if (m > o1->common) { o1->common = m; o1->match = o2; } } int main() { int nobjs=0; for (int a=0; a<0x10; a++) for (int b=0; b<0x10; b++) { char base[] = "strings/.ccache/1/2"; sprintf(base, "strings/.ccache/%x/%x", a, b); DIR *dh = opendir(base); if (!dh) continue; for (;;) { struct dirent *de = readdir(dh); if (!de) break; if (strchr(de->d_name, '.')) continue; Obj *o = read_obj(base, a, b, de->d_name); Obj **h = &objs[NAME_HASH(o->name)]; o->next = *h; *h = o; nobjs++; } closedir(dh); } int total_raw=0; int total_bsd=0; int nbsd=0; for (int h1=0; h1next) { // Look for a match in all the objects that are after it in our hash. // If we get a match, it means we can store o1 as a delta from o2. for (int h2=h1; h2next) compare(o1, o2); } if (o1->nfuncs && 100*o1->common/o1->nfuncs >= 85) { char cmd[PATH_MAX]; snprintf(cmd, PATH_MAX, "%s %s %s tmp.bsd", BSDIFF, o1->path, o1->match->path); if (system(cmd)) { printf("%s\n", cmd); abort(); } struct stat st; if (stat(o1->path, &st)) { printf("%s\n", o1->path); abort(); } total_raw += st.st_size; if (stat("tmp.bsd", &st)) { printf("%s (tmp.bsd)\n", o1->path); abort(); } total_bsd += st.st_size; nbsd++; } } } remove("tmp.bsd"); printf("\n%d files compressed (%d -> %d bytes)\n", nbsd, total_raw, total_bsd); return 0; } get-symbols.sh Description: Bourne shell script ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On 12/11/12 14:08, Bogdan Harjoc wrote: No but there is room for improvement. This could be optional, like a CCACHE_COMPRESS that saves 99% instead of 40% when I routinely recompile 20 kernel branches, for example (v2.6.x, 3.0.x, 3.4.x, -git, -next, -ubuntu, etc). I realise that the more diverged the branches are, the fewer exact cache hits you will get, but there still should be a lot of overlap here. However, if you're building the branches in different directories and using absolute paths then you might be missing out on potential cache sharing (without any binary differences). Are you familiar with CCACHE_BASEDIR? Andrew ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On Mon, Nov 12, 2012 at 3:39 PM, Andrew Stubbs wrote: > On 12/11/12 11:49, Bogdan Harjoc wrote: > >> Alternatively, a "compact" operation could be run periodically, that >> compresses the cache using the same approach. >> > > Is cache size/capacity a very big issue for you? > No but there is room for improvement. This could be optional, like a CCACHE_COMPRESS that saves 99% instead of 40% when I routinely recompile 20 kernel branches, for example (v2.6.x, 3.0.x, 3.4.x, -git, -next, -ubuntu, etc). Or how about a larger cache limit? If you don't have much disk space then > combining that with CCACHE_HARDLINK might provide a useful saving? > (Although compression must be disabled and your cache must be on the same > file-system as your build.) My opinion is that ccache is a space-speed tradeoff that moves the savings > balance toward speed-savings and away from space-savings. For the most > part, users are fine with that, indeed want that, and modifications that > move that balance back toward space-saving aren't interesting. > Adjusting cache limits to available space works, of course. This is the kind of reply I was asking for -- what size/speed constraints do ccache users typically face. This idea is nice, but seems like it will reduce performance on both > cache-hit and cache-miss, like regular compression, but even more so, > especially on cache-miss. > The offline approach won't affect cache-miss at all. In cache-hit cases, it will add the cost of applying the patch. That said, if we can have out cake and eat it then bring it on. (It's a > shame the hard-link feature isn't completely safe, or it would do just > that.) > I'll run some tests on my .ccache dir and post results once I have them. Cheers, Bogdan ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On 12/11/12 11:49, Bogdan Harjoc wrote: Alternatively, a "compact" operation could be run periodically, that compresses the cache using the same approach. Is cache size/capacity a very big issue for you? Have you tried the CCACHE_COMPRESS feature? That simply gzips the binaries in the cache. You can control the compression level with CCACHE_COMPRESSLEVEL. Or how about a larger cache limit? If you don't have much disk space then combining that with CCACHE_HARDLINK might provide a useful saving? (Although compression must be disabled and your cache must be on the same file-system as your build.) My question is whether ccache's real-world use would benefit from a feature like this. I can put together a test that looks through people's .ccache and reports how many "good bsdiff candidates" there are, and what the savings would be. Any opinions ? My opinion is that ccache is a space-speed tradeoff that moves the savings balance toward speed-savings and away from space-savings. For the most part, users are fine with that, indeed want that, and modifications that move that balance back toward space-saving aren't interesting. This idea is nice, but seems like it will reduce performance on both cache-hit and cache-miss, like regular compression, but even more so, especially on cache-miss. That said, if we can have out cake and eat it then bring it on. (It's a shame the hard-link feature isn't completely safe, or it would do just that.) Andrew ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
On Mon, Nov 12, 2012 at 2:30 PM, Jürgen Buchmüller wrote: > Am Montag, den 12.11.2012, 13:49 +0200 schrieb Bogdan Harjoc: > > Basically, before writing a new object file, ccache could find a similar > > object in the cache (based on object-code or source-code hashes for > > example) > > The main goal of most hashes is to give very distinct results even for > even small changes in the input data, which is why there is not really > an algorithm to compare two files' similarity based on hashes. > I should have been more specific. I meant block-hashes, like rsync and bsdiff do: http://www.samba.org/~tridge/phd_thesis.pdf The savings in size are > probably less important than the expectable performance loss for > building deltas of source and/or object files. > My concern as well. But an offline "ccache-compact" that runs every 24h or so, possibly only creating the "100 hashes" once for every new file, should be pretty fast. And applying a bspatch requires a bunzip2 and going through a list of INSERT/ADD instructions. It can probably be approximated to just "bunzip2". There is also xdelta which is faster. ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache
Re: [ccache] BSDiff for cache objects
Am Montag, den 12.11.2012, 13:49 +0200 schrieb Bogdan Harjoc: > Basically, before writing a new object file, ccache could find a similar > object in the cache (based on object-code or source-code hashes for > example) The main goal of most hashes is to give very distinct results even for even small changes in the input data, which is why there is not really an algorithm to compare two files' similarity based on hashes. Similarity of two files would have to be calculated based on something that currently isn't available - AFAICT. The savings in size are probably less important than the expectable performance loss for building deltas of source and/or object files. Juergen ___ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache