Re: [PATCH/RFC v3 04/13] Add documentation of the index-v5 file format

2012-08-09 Thread Junio C Hamano
Thomas Gummerer t.gumme...@gmail.com writes:

 +GIT index format
 +
 +
 +== The git index file format
 +
 +   The git index file (.git/index) documents the status of the files
 + in the git staging area.
 +
 +   The staging area is used for preparing commits, merging, etc.

The above two are not about index file format.  It is an
explanation of what the index is.

 +   All binary numbers are in network byte order. Version 5 is described
 + here.

I had to read between these two lines something like 

The index file consists of various sections; the sections
appear in the following order in the file.

to make sense of the document.

 +   - A 20-byte header consisting of
 +
 + sig (32-bits): Signature:
 +   The signature is { 'D', 'I', 'R', 'C' } (stands for dircache)
 +
 + vnr (32-bits): Version number:
 +   The current supported versions are 2, 3, 4 and 5.
 +
 + ndir (32-bits): number of directories in the index.
 +
 + nfile (32-bits): number of file entries in the index.
 +
 + fblockoffset (32-bits): offset to the file block, relative to the
 +   beginning of the file.

Ok.

 +   - Offset to the extensions.

 + nextensions (32-bits): number of extensions.
 +
 + extoffset (32-bits): offset to the extension. (Possibly none, as
 +   many as indicated in the 4-byte number of extensions)

OK.

 + headercrc (32-bits): crc checksum for the header and extension
 +   offsets

This may have to have the same   - section title at the same
level as A 20-byte header and Offset to the ext; as it stands,
it looks as if it is part of Offset to the ext which consists of
12 bytes.

 +   - diroffsets (ndir * directory offsets): A directory offset for each
 +   of the ndir directories in the index, sorted by pathname (of the
 +   directory it's pointing to) (see below). The diroffsets are relative
 +   to the beginning of the direntries block. [1]

ndir * diroffsets confused me.  I think you meant to say that this
diroffsets section consists of ndir entries of something and that
each of that something is a directory offset.  It is unclear how a
directory offset is represented, except that it is relative to the
beginning of direntry block (and it is unclear what and where the
direntry block is from the information given up to this point) and
the reader can guess it is in network byte order (assuming it is a
binary number).  Perhaps

diroffsets (ndir entries of directory offset): A 4-byte
offset relative to the beginning of the direntries block
(see below) for each of the ...

and drop the last sentence?

Other tables may want to be adjusted in a similar fashion.

 +== Directory offsets (diroffsets)
 +
 +  diroffset (32-bits): offset to the directory relative to the beginning
 +of the index file. There are ndir + 1 offsets in the diroffset table,
 +the last is pointing to the end of the last direntry. With this last
 +entry, we can replace the strlen when reading each filename, by
 +calculating its length with the offsets.

The mention of strlen looks very out of place.  The reader may be
able to guess that you want to say that the nth string is between
diroffset[n] and diroffset[n+1], and these strings are densely
packed so strlen(diroffset[n]) and diroffset[n+1]-diroffset[n] are
either the same thing (or with a fixed difference, if each string
is accompanied by some fixed-length data), but it is unclear what
these strings represent, especially because the name of the table
implies that you are talking about directories but strlen talks
about filename.

 +== Design explanations
 + ...
 +[3] The data of the cache-tree extension and the resolve undo
 +extension is now part of the index itself, but if other extensions
 +come up in the future, there is no need to change the index, they
 +can simply be added at the end.

Interesting.  When we added extensions, we said that there is no
need to change the index to add new features, they can simply be
added at the end.  Perhaps the file offset table can be added as an
extension to v2 to give us the same bisectability, allowing us a
single entry in-place replacementability, without defining an
entirely different format?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC v3 04/13] Add documentation of the index-v5 file format

2012-08-08 Thread Thomas Gummerer
Add a documentation of the index file format version 5 to
Documentation/technical.

Helped-by: Michael Haggerty mhag...@alum.mit.edu
Helped-by: Junio C Hamano gits...@pobox.com
Helped-by: Thomas Rast tr...@student.ethz.ch
Helped-by: Nguyen Thai Ngoc Duy pclo...@gmail.com
Helped-by: Robin Rosenberg robin.rosenb...@dewire.com
Signed-off-by: Thomas Gummerer t.gumme...@gmail.com
---
 Documentation/technical/index-file-format-v5.txt |  285 ++
 1 file changed, 285 insertions(+)
 create mode 100644 Documentation/technical/index-file-format-v5.txt

diff --git a/Documentation/technical/index-file-format-v5.txt 
b/Documentation/technical/index-file-format-v5.txt
new file mode 100644
index 000..6707f06
--- /dev/null
+++ b/Documentation/technical/index-file-format-v5.txt
@@ -0,0 +1,285 @@
+GIT index format
+
+
+== The git index file format
+
+   The git index file (.git/index) documents the status of the files
+ in the git staging area.
+
+   The staging area is used for preparing commits, merging, etc.
+
+   All binary numbers are in network byte order. Version 5 is described
+ here.
+
+   - A 20-byte header consisting of
+
+ sig (32-bits): Signature:
+   The signature is { 'D', 'I', 'R', 'C' } (stands for dircache)
+
+ vnr (32-bits): Version number:
+   The current supported versions are 2, 3, 4 and 5.
+
+ ndir (32-bits): number of directories in the index.
+
+ nfile (32-bits): number of file entries in the index.
+
+ fblockoffset (32-bits): offset to the file block, relative to the
+   beginning of the file.
+
+   - Offset to the extensions.
+
+ nextensions (32-bits): number of extensions.
+
+ extoffset (32-bits): offset to the extension. (Possibly none, as
+   many as indicated in the 4-byte number of extensions)
+
+ headercrc (32-bits): crc checksum for the header and extension
+   offsets
+
+   - diroffsets (ndir * directory offsets): A directory offset for each
+   of the ndir directories in the index, sorted by pathname (of the
+   directory it's pointing to) (see below). The diroffsets are relative
+   to the beginning of the direntries block. [1]
+
+   - direntries (ndir * directory entries): A directory entry for each
+   of the ndir directories in the index, sorted by pathname (see
+   below). [2]
+
+   - fileoffsets (nfile * file offsets): A file offset for each of the
+   nfile files in the index (see below). The file offsets are relative
+   to the beginning of the fileentries block. [1]
+
+   - fileentries (nfile * file entries): A file entry for each of the
+   nfile files in the index (see below).
+
+   - crdata: A number of entries for conflicted data/resolved conflicts
+   (see below).
+
+   - Extensions (Currently none, see below in the future)
+
+ Extensions are identified by signature. Optional extensions can
+ be ignored if GIT does not understand them.
+
+ GIT supports an arbitrary number of extension, but currently none
+ is implemented. [3]
+
+ extsig (32-bits): extension signature. If the first byte is 'A'..'Z'
+ the extension is optional and can be ignored.
+
+ extsize (32-bits): size of the extension, excluding the header
+   (extsig, extsize, extchecksum).
+
+ extchecksum (32-bits): crc32 checksum of the extension signature
+   and size.
+
+- Extension data.
+
+
+== Directory offsets (diroffsets)
+
+  diroffset (32-bits): offset to the directory relative to the beginning
+of the index file. There are ndir + 1 offsets in the diroffset table,
+the last is pointing to the end of the last direntry. With this last
+entry, we can replace the strlen when reading each filename, by
+calculating its length with the offsets.
+
+  This part is needed for making the directory entries bisectable and
+thus allowing a binary search.
+
+== Directory entry (direntries)
+  
+  Directory entries are sorted in lexicographic order by the name 
+of their path starting with the root.
+  
+  pathname (variable length, nul terminated): relative to top level
+directory (without the leading slash). '/' is used as path
+separator. A string of length 0 ('') indicates the root directory.
+The special path components ., and .. (without quotes) are
+disallowed. The path also includes a trailing slash. [9]
+
+  foffset (32-bits): offset to the lexicographically first file in 
+the file offsets (fileoffsets), relative to the beginning of
+the fileoffset block.
+
+  cr (32-bits): offset to conflicted/resolved data at the end of the
+index. 0 if there is no such data. [4]
+
+  ncr (32-bits): number of conflicted/resolved data entries at the
+end of the index if the offset is non 0. If cr is 0, ncr is
+also 0.
+
+  nsubtrees (32-bits): number of subtrees this tree has in the index.
+
+  nfiles (32-bits): number of files in the directory, that are in
+the index.
+
+  nentries