Re: [ccache] Corrupt objects from three colliding compiles

2010-05-09 Thread Joel Rosdahl
On 2010-05-06 22:22, Wilson Snyder wrote:
> 
> Two followups.
> 
> First, though not the problem I'm seeing, as I understand
> it, mkstemp isn't guaranteed to work on NFS v2 mounted
> drives.  Thus I suggest including the hostname; there's
> already tmp_string() which is perfect.  Patch below.

Applied.

> Second, I'm still not sure of the cause, but if I count
> bytes copied, and fail if zero bytes are moved this works
> around the issue.

Hmm. Have you tried checking gzeof/gzerror after gzread? I guess it
wouldn't hurt to act on read errors detected by zlib:

--- a/util.c
+++ b/util.c
@@ -190,6 +190,18 @@ int copy_file(const char *src, const char *dest,
int compress_dest)
return -1;
}
}
+   if (n == 0 && !gzeof(gz_in)) {
+   int errnum;
+   cc_log("gzread error: %s", gzerror(gz_in, &errnum));
+   gzclose(gz_in);
+   if (gz_out) {
+   gzclose(gz_out);
+   }
+   close(fd_out);
+   unlink(tmp_name);
+   free(tmp_name);
+   return -1;
+   }

gzclose(gz_in);
if (gz_out) {

> [...]
> One thing I wonder is the unlink() before the rename.  I
> presume that is needed to prevent permission issues, but I
> wonder if it contributes to the race involving
> host1:unlink(), host1:rename(), host2:unlink(),
> host2:rename(), host3:open().

I'm not sure why there's an unlink before the rename -- it seems
redundant to me. If you remove it, do you see any improvement?

-- Joel
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] Corrupt objects from three colliding compiles

2010-05-09 Thread Joel Rosdahl
On 2010-05-06 19:14, Wilson Snyder wrote:

> [...]
> [1273161404.432174,2424 ] === 2010-05-06 11:56:44 ===
> [1273161404.432636,2424 ] Hostname: maserati
> [1273161404.432643,2424 ] Cwd: /nfs/compiling/dir02
> [1273161404.432646,2424 ] Source file: INV.c
> [1273161404.432648,2424 ] Object file: INV.o
> [1273161404.432657,2424 ] Trying direct lookup
> [1273161404.439730,2424 ] Looking for object file hash in 
> /nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
> [1273161404.440126,2424 ] Did not find object file hash in manifest
> [1273161404.440134,2424 ] Running preprocessor
> [1273161404.469188,2084 ] === 2010-05-06 11:56:44 ===
> [1273161404.469515,2084 ] Hostname: bird07
> [1273161404.469519,2084 ] Cwd: /nfs/compiling/dir03
> [1273161404.469521,2084 ] Source file: INV.c
> [1273161404.469523,2084 ] Object file: INV.o
> [1273161404.469532,2084 ] Trying direct lookup
> [1273161404.471093,2084 ] Looking for object file hash in 
> /nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
> [1273161404.475427,2084 ] Did not find object file hash in manifest
> [1273161404.475434,2084 ] Running preprocessor
> [1273161404.487509,2424 ] Got object file hash from preprocessor
> [1273161404.492494,2424 ] Object file 
> /nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o not in cache

Job 2424 doesn't see the object file in the cache...

> [1273161404.492522,2424 ] Running real compiler

...and decides to create it.

> [1273161404.511123,21936] === 2010-05-06 11:56:44 ===
> [1273161404.511506,21936] Hostname: bat05
> [1273161404.511510,21936] Cwd: /nfs/compiling/dir04
> [1273161404.511513,21936] Source file: INV.c
> [1273161404.511515,21936] Object file: INV.o
> [1273161404.511523,21936] Trying direct lookup
> [1273161404.514058,21936] Looking for object file hash in 
> /nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
> [1273161404.517648,21936] Did not find object file hash in manifest
> [1273161404.517654,21936] Running preprocessor
> [1273161404.523658,2084 ] Got object file hash from preprocessor
> [1273161404.542730,2084 ] Created INV.o from 
> /nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o

But here, job 2084 sees the object file in the cache ~34 ms before the
object file was created...

> [1273161404.564688,21936] Got object file hash from preprocessor
> [1273161404.569955,2084 ] Added object file hash to 
> /nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
> [1273161404.570387,2084 ] Succeded getting cached result
> [1273161404.576412,2424 ] Stored in cache: 
> /nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o

...by job 2424 here. Maybe the clocks are more than 12 ms apart?

> [1273161404.588776,21936] Created INV.o from 
> /nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o   ***BAD OBJECT***
> [1273161404.596098,2424 ] Created INV.o from 
> /nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o
> [1273161404.630619,21936] Added object file hash to 
> /nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
> [1273161404.634864,21936] Succeded getting cached result
> [1273161404.638443,2424 ] Added object file hash to 
> /nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
> 
> It looks like the third job's read of the cache is stuck in
> between the first's write and the second's.  Note that times
> may be off by >8ms, this would put the bad object read right
> on top of the object update by the second compile when it
> hits the NFS server.

Since renames should be atomic on NFS, I don't see any problem here, but
I'm by no means an NFS expert.

-- Joel
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


Re: [ccache] Corrupt objects from three colliding compiles

2010-05-06 Thread Wilson Snyder

Two followups.

First, though not the problem I'm seeing, as I understand
it, mkstemp isn't guaranteed to work on NFS v2 mounted
drives.  Thus I suggest including the hostname; there's
already tmp_string() which is perfect.  Patch below.

Second, I'm still not sure of the cause, but if I count
bytes copied, and fail if zero bytes are moved this works
around the issue.  Patch below, but this not be the right
solution.  I thought of calling stat() after the copy, and
fail if the size doesn't match what stat says, alas
compression makes that difficult.

One thing I wonder is the unlink() before the rename.  I
presume that is needed to prevent permission issues, but I
wonder if it contributes to the race involving
host1:unlink(), host1:rename(), host2:unlink(),
host2:rename(), host3:open().

-Wilson


diff --git a/util.c b/util.c
index ae48cf0..df522f1 100644
--- a/util.c
+++ b/util.c
@@ -128,7 +128,7 @@ int copy_file(const char *src, const char *dest, int 
compress_dest)
   mode_t mask;
   struct stat st;
 
-   x_asprintf(&tmp_name, "%s.XX", dest);
+   x_asprintf(&tmp_name, "%s.%s.XX", dest, tmp_string());
 
/* open source file */
fd_in = open(src, O_RDONLY);


diff --git a/util.c b/util.c
index ae48cf0..df522f1 100644
--- a/util.c
+++ b/util.c
@@ -177,7 +177,9 @@ int copy_file(const char *src, const char *dest, int 
compress_dest)
  }
  }
 
+   int bytecnt = 0;
while ((n = gzread(gz_in, buf, sizeof(buf))) > 0) {
+ bytecnt += n;
  if (compress_dest) {
 ret = gzwrite(gz_out, buf, n);
 } else {
@@ -212,14 +214,25 @@ int copy_file(const char *src, const char *dest, int 
compress_dest)
   return -1;
   }
 
+   /* Empty objects are bad - created by race in NFS between 2 writers and 
1+ readers */
+   if (bytecnt == 0) {
+   unlink(tmp_name);
+   free(tmp_name);
+   cc_log("Refusing to copy zero sized file %s", src);
+   return -1;
+   }
+
unlink(dest);
 
if (rename(tmp_name, dest) == -1) {
   unlink(tmp_name);
free(tmp_name);
+   cc_log("Rename failed");
   return -1;
   }
 
+   cc_log("Copied %d bytes from %s via %s to %s", bytecnt, src, tmp_name, 
dest);
+
free(tmp_name);
 
return 0;
___
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache


[ccache] Corrupt objects from three colliding compiles

2010-05-06 Thread Wilson Snyder

I'm hitting a NFS issue causing bad caching.  (And yes, we have
no_subtree_check.)

This is a bug present in 2.4, but it is also present in
trunk b964c173cf493a1c0c0749a9c6a39bc07936507b which is used
for the below diagnosis.

The issue seems to occur only when there are three compiles
on different systems all compiling the same file at the same
time.  Here's the C files:

$ find . -name INV.c -exec ls -la --time-style=full-iso {} \;
-rw-rw-r-- 1 wsnyder users 405 2010-05-06 11:38:39.0 -0400 ./dir02/INV.c
-rw-rw-r-- 1 wsnyder users 405 2010-05-06 11:48:57.0 -0400 ./dir03/INV.c
-rw-rw-r-- 1 wsnyder users 405 2010-05-06 11:43:08.0 -0400 ./dir04/INV.c

They got compiled, but one got truncated

$ find . -name INV.o -exec ls -la --time-style=full-iso {} \;
-rw-rw-r-- 1 wsnyder users 916 2010-05-06 11:56:44.0 -0400 ./dir02/INV.o
-rw-rw-r-- 1 wsnyder users 916 2010-05-06 11:56:44.0 -0400 ./dir03/INV.o
-rw-rw-r-- 1 wsnyder users   0 2010-05-06 11:56:44.0 -0400 ./dir04/INV.o

Note the timestamps (unfortunately this isn't a file system
with sub-second resolution!)

The cache files:

$ ls -la --time-style=full-iso $CCACHE_DIR/2/b/2/eedf0591498af8e5f3a*
-rw-rw-r-- 1 wsnyder users 916 2010-05-06 11:56:44.0 -0400 
$CCACHE_DIR/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o

$ ls -la --time-style=full-iso $CCACHE_DIR/0/3/8/f19fe0a63fc088a9c57*
-rw-rw-r-- 1 wsnyder users 874 2010-05-06 11:56:44.0 -0400 
$CCACHE_DIR/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest

To debug this further I added hostname and timestamps to the
log files (earlier email sent this patch), otherwise I
couldn't know what the ordering was.  This gave three
reports:

[1273161404.432174,2424 ] === 2010-05-06 11:56:44 ===
[1273161404.432636,2424 ] Hostname: maserati
[1273161404.432643,2424 ] Cwd: /nfs/compiling/dir02
[1273161404.432646,2424 ] Source file: INV.c
[1273161404.432648,2424 ] Object file: INV.o
[1273161404.432657,2424 ] Trying direct lookup
[1273161404.439730,2424 ] Looking for object file hash in 
/nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
[1273161404.440126,2424 ] Did not find object file hash in manifest
[1273161404.440134,2424 ] Running preprocessor
[1273161404.487509,2424 ] Got object file hash from preprocessor
[1273161404.492494,2424 ] Object file 
/nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o not in cache
[1273161404.492522,2424 ] Running real compiler
[1273161404.576412,2424 ] Stored in cache: 
/nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o
[1273161404.596098,2424 ] Created INV.o from 
/nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o
[1273161404.638443,2424 ] Added object file hash to 
/nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest

[1273161404.469188,2084 ] === 2010-05-06 11:56:44 ===
[1273161404.469515,2084 ] Hostname: bird07
[1273161404.469519,2084 ] Cwd: /nfs/compiling/dir03
[1273161404.469521,2084 ] Source file: INV.c
[1273161404.469523,2084 ] Object file: INV.o
[1273161404.469532,2084 ] Trying direct lookup
[1273161404.471093,2084 ] Looking for object file hash in 
/nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
[1273161404.475427,2084 ] Did not find object file hash in manifest
[1273161404.475434,2084 ] Running preprocessor
[1273161404.523658,2084 ] Got object file hash from preprocessor
[1273161404.542730,2084 ] Created INV.o from 
/nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o
[1273161404.569955,2084 ] Added object file hash to 
/nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
[1273161404.570387,2084 ] Succeded getting cached result

[1273161404.511123,21936] === 2010-05-06 11:56:44 ===
[1273161404.511506,21936] Hostname: bat05
[1273161404.511510,21936] Cwd: /nfs/compiling/dir04
[1273161404.511513,21936] Source file: INV.c
[1273161404.511515,21936] Object file: INV.o
[1273161404.511523,21936] Trying direct lookup
[1273161404.514058,21936] Looking for object file hash in 
/nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
[1273161404.517648,21936] Did not find object file hash in manifest
[1273161404.517654,21936] Running preprocessor
[1273161404.564688,21936] Got object file hash from preprocessor
[1273161404.588776,21936] Created INV.o from 
/nfs/ccache-dir/2/b/2/eedf0591498af8e5f3a27fcc89c4f-69975.o   ***BAD OBJECT***
[1273161404.630619,21936] Added object file hash to 
/nfs/ccache-dir/0/3/8/f19fe0a63fc088a9c570845a0a4eb-584.manifest
[1273161404.634864,21936] Succeded getting cached result

The next step assumes our times are close across systems; they should be within 
about 12ms.  Here's the sorted view

[1273161404.432174,2424 ] === 2010-05-06 11:56:44 ===
[1273161404.432636,2424 ] Hostname: maserati
[1273161404.432643,2424 ] Cwd: /nfs/compiling/dir02
[1273161404.432646,2424 ] Source file: INV.c
[1273161404.432648,2424 ] Object file: INV.o
[1273161404.432657,2424 ] Trying direct lookup
[1273161404.439730,2424 ] Looking fo