Git does not handle changing inode numbers well

2012-08-08 Thread Matthijs Kooijman
(Please CC me, I'm not on the list)

Hi folks,

I've spent some time debugging an issue and I'd like to share the
results. The conclusion of my debugging is that git does not currently
handle changing inode numbers on files well.

I have a custom Fuse filesystem, and fuse dynamically allocates inode
numbers to paths, but keeps a limited cache of inode - name mappings,
causing the inodes to change over time.

Now of course, you'll probably say, it's the filesystem's fault, git
can't be expected to cope with that. You'll be right of course, but
since I already spent the time digging into this and figuring out what
goes on inside git in this case, I thought I might as well share the
analysis, just in case someone sees an easy fix in here, or in case
someone else stumbles upon this problem as well.

So, the actual problem I was seeing is that running git status showed
all symlinks as modified, even though they really were identical
between the working copy, index and HEAD. Interestingly enough this only
happened when running git status without further arguments, when
running on a subdirectory, it would show no changes as expected.

I compared the output of stat to a hexdump of the index file and found
that everything matched, except for the inode numbers. I originally
thought I was misinterpreting what I saw, but gdb confirmed that it were
indeed the inode numbers that git observed as different.

Now, I could have stopped here and started trying to fix my filesystem
instead. But it was still weird that this problem only existed for
symlinks and that normal files acted as expected. So I dug in a bit
deeper, hoping to find some way to make this work for symlinks as well.

So, here's what happens (IIUC):
 - cmd_status calls refresh_index, which calls refresh_cache_ent for
   every entry in the index.
 - refresh_cache_ent notices that the inode number has changed (for both
   symlinks and regular files) and compares the file / symlink contents.
 - refresh_cache_ent sees the content hasn't changed, so it calls
   fill_stat_cache_info to update the stat info.
 - fill_stat_cache_info sets the EC_UPTODATE flag on the entry, but only
   if it is a regular file.
 - cmd_status calls wt_status_collect which calls
   wt_status_collect_changes_worktree which calls run_diff_files.
 - run_diff_files skips regular files, because of the EC_UPTODATE flag.
   For symlinks, however, it checks the stat info and notices that the
   inode number has changed (again). It does not do a content check at
   this point, but instead just outputs the file as modified.


It turned out that the reason running git status on a subdirectory did
appear to work, was that the number of files in the subdir wasn't big
enough to overflow the inode number cache fuse keeps, so that numbers
didn't change in this case (the problem _did_ occur when trying a bigger
subdirectory).

So, it seems that git just doesn't cope well with changing inode numbers
because it checks the content in a first pass in refresh_index, but only
checks the stat info in the second pass in run_diff_files. The reason it
does work for regular files is EC_UPTODATE optimization introduced in
eadb5831: Avoid running lstat(2) on the same cache entry.

So, let's see if I can fix my filesystem now ;-)

Gr.

Matthijs


signature.asc
Description: Digital signature


Re: Git does not handle changing inode numbers well

2012-08-08 Thread Junio C Hamano
Matthijs Kooijman matth...@stdin.nl writes:

 So, it seems that git just doesn't cope well with changing inode numbers
 because it checks the content in a first pass in refresh_index, but only
 checks the stat info in the second pass in run_diff_files. The reason it
 does work for regular files is EC_UPTODATE optimization introduced in
 eadb5831: Avoid running lstat(2) on the same cache entry.

 So, let's see if I can fix my filesystem now ;-)

True.  We have knobs to cope with filesystems whose st_dev or
st_ctime are not stable, but there is no such knob to tweak for
st_ino.  Shouldn't be too hard to add such, though.  One approach is
to do something like the attached patch, and declare, define,
initialize, and set trust_inum in a way similar to how we handle
trust_ctime in the existing code.

 read-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/read-cache.c b/read-cache.c
index 2f8159f..6da99af 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -210,7 +210,7 @@ static int ce_match_stat_basic(struct cache_entry *ce, 
struct stat *st)
if (ce-ce_uid != (unsigned int) st-st_uid ||
ce-ce_gid != (unsigned int) st-st_gid)
changed |= OWNER_CHANGED;
-   if (ce-ce_ino != (unsigned int) st-st_ino)
+   if (trust_inum  ce-ce_ino != (unsigned int) st-st_ino)
changed |= INODE_CHANGED;
 
 #ifdef USE_STDEV
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html