Re: [ccache] BSDiff for cache objects

2012-11-16 Thread Bogdan Harjoc
On Mon, Nov 12, 2012 at 8:09 PM, Bogdan Harjoc  wrote:

> Initial results from a small .ccache (3.0) dir:
>

The previous results were bogus (I was diffing .gz compressed objects).

I tried the bsdiff approach on a new .ccache folder obtained from compiling
8 linux kernels (various v3.x versions) configured as allnoconfig.

Stats:
- 22MB
- 4030 objects (gzipped)
- 503 cache hits

After applying diffs:
- 8.4MB (1862 files) were selected for diffs
- 1.5MB (17%) compressed patches resulted
- applying the 1862 diffs took 1.21 sec (so 0.65ms per patch)

I got the 0.65ms figure using a script that runs gunzip -c and bspatch for
each patched file in the cache. To find out and subtract the overhead of
creating the gunzip and bspatch processes, I ran the loop with just
"gunzip; bspatch --help > null" instead of the normal "gunzip; bspatch". So
the result should cover just the actual patch applying (ccache does the
gunzip anyway).

The test was done on a 3.4Ghz i7-2600.

My conclusions are:
- performance impact should be negligible (when compressing files offline
-- e.g. when make aborts with "no space left on device")
- savings in size vary (7MB out of 22MB in this case)
- the patches will need to have the "old filename" embedded

If people post results from their cache folders and there is interest, I
can work on an implementation.

Attached are the updated sources for the test. I normally run them in a
"testccache" directory, since they create two more folders with data.

(I posted this mail privately just to Andrew by mistake 3 days ago,
reposting it to ccache-list now)

Cheers,
Bogdan


bspatch-all.sh
Description: Bourne shell script
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

const char *BSDIFF = "bsdiff";
//const char *BSDIFF = "~/build/bsdiff-4.3/bsdiff";

struct Func {
	Func *next;

	char *name;
};

struct Obj {
	Obj *next;

	char *name;
	char *expanded;
	int nfuncs;
	Func *funcs;
	
	Obj *match;
	int score;
};

const int HASH_SIZE = 19031;
Obj *objs[HASH_SIZE] = {};
#define NAME_HASH(name) ((*(int *)(name)) % HASH_SIZE)

Obj *read_obj(const char *strbase, int a, int b, const char *namesuff)
{
	Obj *o = new Obj;
	o->next = 0;
	o->name = new char[64];
	o->expanded = new char[128];
	o->nfuncs = 0;
	o->funcs = 0;
	o->match = 0;
	o->score = 0;
	
	o->name = strdup(namesuff);

	int nc = snprintf(o->expanded, 128, "%x/%x/%s", a, b, namesuff);
	if (nc <= 0 || nc >= 64) abort();
	
	char strpath[PATH_MAX];
	sprintf(strpath, "%s/%s", strbase, namesuff);

	FILE *f = fopen(strpath, "rb");
	if (!f) {
		printf("open %s\n", namesuff);
		abort();
	}

	for (;;) {
		static char line[PATH_MAX];
		if (!fgets(line, PATH_MAX, f))
			break;

		for (int l=strlen(line); l && isspace(line[l-1]); line[--l]=0);


		Func **h = &o->funcs;
		
		Func *fn = new Func;
		fn->name = strdup(line);
		
		fn->next = *h;
		*h = fn;

		o->nfuncs++;
	}

	fclose(f);

	return o;
}

#define max(a,b) (((a) > (b)) ? (a) : (b))

void compare(Obj *o1, Obj *o2)
{
	/*
	if (strcmp(o1->name, "978911edd002afb6ecb9611659327e3e-3475537") ||
		strcmp(o2->name, "4794bbcacf3bc42cdf1a44cf89523949-3429991"))
		return;
	*/

	if (o1->nfuncs < 4 || o2->nfuncs < 4) return;
	int r = 100*o1->nfuncs/o2->nfuncs;
	if (r < 80 || r > 125) return;
	if (!strcmp(o1->name, o2->name)) return;

	Func *f1 = o2->funcs;
	Func *f2 = o1->funcs;
	
	int m=0;
	while (f1) {
		int r = 0;
		while (f2 && (r = strcmp(f1->name, f2->name)) < 0) // reversed since we pushed fns at the head
			f2 = f2->next;

		if (f2 && r==0) // match
			m++;

		f1 = f1->next;
	}
	
	int score = 1000*m/max(o1->nfuncs, o2->nfuncs);
	if (score > 800 && score > o1->score) {
		o1->score = score;
		o1->match = o2;
	}
}

int mk_file_dirs(char *file)
{
	char *ptr = file;

	for (;;) {
		ptr = strchr(ptr, '/');
		if (!ptr) return 0;

		*ptr = 0;
		int ret = mkdir(file, 0750);
		*ptr = '/';

		if (ret < 0 && errno != EEXIST) {
			fprintf(stderr, "%s: errno=%s (%d)\n", file, strerror(errno), errno);
			return -1;
		}

		ptr++;
	}
}

int main()
{
	int nobjs=0;

	for (int a=0; a<0x10; a++)
		for (int b=0; b<0x10; b++) {
			char strbase[PATH_MAX];
			sprintf(strbase, "strings/.ccache/%x/%x", a, b);

			DIR *dh = opendir(strbase);
			if (!dh) continue;

			for (;;) {
struct dirent *de = readdir(dh);
if (!de)
	break;
if (strchr(de->d_name, '.'))
	continue;

Obj *o = read_obj(strbase, a, b, de->d_name);

Obj **h = &objs[NAME_HASH(o->name)];
o->next = *h;
*h = o;

nobjs++;
			}

			closedir(dh);
		}
	
	long long total_raw=0;
	long long total_bsd=0;
	int nbsd=0;

	for (int h1=0; h1next) {
			// Look for a match in all the objects that are after it in our hash.
			// If we get a match, it means we can store o1 as a delta from o2.

			for (int h2=h1; h2next)
	compare(o1, o2);
			}

			if (o1->match) {
char bin1[PATH_MAX];
char bin2[PATH_MAX];
sprintf(bin1, "%s/.ccache/%s", getenv("HOME"), o1->expanded);
sprintf(bin2, "%s/.ccache

Re: [ccache] BSDiff for cache objects

2012-11-12 Thread Bogdan Harjoc
Initial results from a small .ccache (3.0) dir:

- 6476 objects
- 300MB
- probably about 500-1000 compiles/recompiles of around 100 small to large
projects

The test was:
1. Find the candidates for compression, based on: objdump -t | grep " g "
(defined symbols). If two objects had at least 4 symbols defined, and 85%
of them were identical, the files were selected for compression.
2. Run bsdiff on the selected pairs of files, and collect the total raw
size, and the resulting compressed size.

The results are:
4459 out of 6476 files compressed (6099674 -> 629795 bytes)

So roughly 90% compression rate, for a random .ccache folder.

I attached sources for the test (first run "./get-symbols.sh", then
"./find-similar").

Will post results for a more favorable scenario (multiple builds of
different versions of the same project) if time permits.
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

const char *BSDIFF = "~/build/bsdiff-4.3/bsdiff";

struct Func {
	Func *next;

	char *name;
};

struct Obj {
	Obj *next;

	char *name;
	char *path;
	int nfuncs;
	Func *funcs;
	
	Obj *match;
	int common;
};

const int HASH_SIZE = 19031;
Obj *objs[HASH_SIZE] = {};
#define NAME_HASH(name) ((*(int *)(name)) % HASH_SIZE)

Obj *read_obj(const char *base, int a, int b, const char *namesuff)
{
	Obj *o = new Obj;
	o->next = 0;
	o->name = new char[64];
	o->path = new char[128];
	o->nfuncs = 0;
	o->funcs = 0;
	o->common = 0;
	o->match = 0;
	
	int nc = snprintf(o->name, 64, "%x%x%s", a, b, namesuff);
	if (nc <= 0 || nc >= 64) abort();

	snprintf(o->path, 128, "%s/%s", base, namesuff);
	FILE *f = fopen(o->path, "rb");
	for (;;) {
		static char line[PATH_MAX];
		if (!fgets(line, PATH_MAX, f))
			break;

		for (int l=strlen(line); l && isspace(line[l-1]); line[--l]=0);


		Func **h = &o->funcs;
		
		Func *fn = new Func;
		fn->name = strdup(line);
		
		fn->next = *h;
		*h = fn;

		o->nfuncs++;
	}

	fclose(f);

	return o;
}

void compare(Obj *o1, Obj *o2)
{
	/*
	if (strcmp(o1->name, "978911edd002afb6ecb9611659327e3e-3475537") ||
		strcmp(o2->name, "4794bbcacf3bc42cdf1a44cf89523949-3429991"))
		return;
	*/

	if (o1->nfuncs < 4 || o2->nfuncs < 4) return;
	int r = 100*o1->nfuncs/o2->nfuncs;
	if (r < 80 || r > 125) return;
	if (!strcmp(o1->name, o2->name)) return;

	Func *f1 = o2->funcs;
	Func *f2 = o1->funcs;
	
	int m=0;
	while (f1) {
		int r = 0;
		while (f2 && (r = strcmp(f1->name, f2->name)) < 0) // reversed since we pushed fns at the head
			f2 = f2->next;

		if (f2 && r==0) // match
			m++;

		f1 = f1->next;
	}
	
	if (m > o1->common) {
		o1->common = m;
		o1->match = o2;
	}
}

int main()
{
	int nobjs=0;

	for (int a=0; a<0x10; a++)
		for (int b=0; b<0x10; b++) {
			char base[] = "strings/.ccache/1/2";
			sprintf(base, "strings/.ccache/%x/%x", a, b);
			DIR *dh = opendir(base);
			if (!dh) continue;

			for (;;) {
struct dirent *de = readdir(dh);
if (!de)
	break;
if (strchr(de->d_name, '.'))
	continue;

Obj *o = read_obj(base, a, b, de->d_name);

Obj **h = &objs[NAME_HASH(o->name)];
o->next = *h;
*h = o;

nobjs++;
			}

			closedir(dh);
		}
	
	int total_raw=0;
	int total_bsd=0;
	int nbsd=0;

	for (int h1=0; h1next) {
			// Look for a match in all the objects that are after it in our hash.
			// If we get a match, it means we can store o1 as a delta from o2.

			for (int h2=h1; h2next)
	compare(o1, o2);
			}

			if (o1->nfuncs && 100*o1->common/o1->nfuncs >= 85) {
char cmd[PATH_MAX];
snprintf(cmd, PATH_MAX, "%s %s %s tmp.bsd", BSDIFF, o1->path, o1->match->path);
if (system(cmd))
	{ printf("%s\n", cmd); abort(); }

struct stat st;
if (stat(o1->path, &st))
	{ printf("%s\n", o1->path); abort(); }
total_raw += st.st_size;

if (stat("tmp.bsd", &st))
	{ printf("%s (tmp.bsd)\n", o1->path); abort(); }
total_bsd += st.st_size;

nbsd++;
			}
		}
	}

	remove("tmp.bsd");

	printf("\n%d files compressed (%d -> %d bytes)\n", nbsd, total_raw, total_bsd);
	
	return 0;
}


get-symbols.sh
Description: Bourne shell script
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] BSDiff for cache objects

2012-11-12 Thread Andrew Stubbs

On 12/11/12 14:08, Bogdan Harjoc wrote:

No but there is room for improvement. This could be optional, like a
CCACHE_COMPRESS that saves 99% instead of 40% when I routinely recompile 20
kernel branches, for example (v2.6.x, 3.0.x, 3.4.x, -git, -next, -ubuntu,
etc).


I realise that the more diverged the branches are, the fewer exact cache 
hits you will get, but there still should be a lot of overlap here. 
However, if you're building the branches in different directories and 
using absolute paths then you might be missing out on potential cache 
sharing (without any binary differences).


Are you familiar with CCACHE_BASEDIR?

Andrew

___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] BSDiff for cache objects

2012-11-12 Thread Bogdan Harjoc
On Mon, Nov 12, 2012 at 3:39 PM, Andrew Stubbs  wrote:

> On 12/11/12 11:49, Bogdan Harjoc wrote:
>
>> Alternatively, a "compact" operation could be run periodically, that
>> compresses the cache using the same approach.
>>
>
> Is cache size/capacity a very big issue for you?
>

No but there is room for improvement. This could be optional, like a
CCACHE_COMPRESS that saves 99% instead of 40% when I routinely recompile 20
kernel branches, for example (v2.6.x, 3.0.x, 3.4.x, -git, -next, -ubuntu,
etc).

Or how about a larger cache limit? If you don't have much disk space then
> combining that with CCACHE_HARDLINK might provide a useful saving?
> (Although compression must be disabled and your cache must be on the same
> file-system as your build.)


My opinion is that ccache is a space-speed tradeoff that moves the savings
> balance toward speed-savings and away from space-savings. For the most
> part, users are fine with that, indeed want that, and modifications that
> move that balance back toward space-saving aren't interesting.
>

Adjusting cache limits to available space works, of course. This is the
kind of reply I was asking for -- what size/speed constraints do ccache
users typically face.

This idea is nice, but seems like it will reduce performance on both
> cache-hit and cache-miss, like regular compression, but even more so,
> especially on cache-miss.
>

The offline approach won't affect cache-miss at all. In cache-hit cases, it
will add the cost of applying the patch.

That said, if we can have out cake and eat it then bring it on. (It's a
> shame the hard-link feature isn't completely safe, or it would do just
> that.)
>

I'll run some tests on my .ccache dir and post results once I have them.

Cheers,
Bogdan
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] BSDiff for cache objects

2012-11-12 Thread Andrew Stubbs

On 12/11/12 11:49, Bogdan Harjoc wrote:

Alternatively, a "compact" operation could be run periodically, that
compresses the cache using the same approach.


Is cache size/capacity a very big issue for you?

Have you tried the CCACHE_COMPRESS feature? That simply gzips the 
binaries in the cache. You can control the compression level with 
CCACHE_COMPRESSLEVEL.


Or how about a larger cache limit? If you don't have much disk space 
then combining that with CCACHE_HARDLINK might provide a useful saving? 
(Although compression must be disabled and your cache must be on the 
same file-system as your build.)



My question is whether ccache's real-world use would benefit from a feature
like this. I can put together a test that looks through people's .ccache
and reports how many "good bsdiff candidates" there are, and what the
savings would be.

Any opinions ?


My opinion is that ccache is a space-speed tradeoff that moves the 
savings balance toward speed-savings and away from space-savings. For 
the most part, users are fine with that, indeed want that, and 
modifications that move that balance back toward space-saving aren't 
interesting.


This idea is nice, but seems like it will reduce performance on both 
cache-hit and cache-miss, like regular compression, but even more so, 
especially on cache-miss.


That said, if we can have out cake and eat it then bring it on. (It's a 
shame the hard-link feature isn't completely safe, or it would do just 
that.)


Andrew
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] BSDiff for cache objects

2012-11-12 Thread Bogdan Harjoc
On Mon, Nov 12, 2012 at 2:30 PM, Jürgen Buchmüller wrote:

> Am Montag, den 12.11.2012, 13:49 +0200 schrieb Bogdan Harjoc:
> > Basically, before writing a new object file, ccache could find a similar
> > object in the cache (based on object-code or source-code hashes for
> > example)
>
> The main goal of most hashes is to give very distinct results even for
> even small changes in the input data, which is why there is not really
> an algorithm to compare two files' similarity based on hashes.
>

I should have been more specific. I meant block-hashes, like rsync and
bsdiff do:
http://www.samba.org/~tridge/phd_thesis.pdf

The savings in size are
> probably less important than the expectable performance loss for
> building deltas of source and/or object files.
>

My concern as well. But an offline "ccache-compact" that runs every 24h or
so, possibly only creating the "100 hashes" once for every new file, should
be pretty fast. And applying a bspatch requires a bunzip2 and going through
a list of INSERT/ADD instructions. It can probably be approximated to just
"bunzip2". There is also xdelta which is faster.
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] BSDiff for cache objects

2012-11-12 Thread Jürgen Buchmüller
Am Montag, den 12.11.2012, 13:49 +0200 schrieb Bogdan Harjoc:
> Basically, before writing a new object file, ccache could find a similar
> object in the cache (based on object-code or source-code hashes for
> example)

The main goal of most hashes is to give very distinct results even for
even small changes in the input data, which is why there is not really
an algorithm to compare two files' similarity based on hashes.

Similarity of two files would have to be calculated based on something
that currently isn't available - AFAICT. The savings in size are
probably less important than the expectable performance loss for
building deltas of source and/or object files.

Juergen


___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache