Re: [PATCH v4 01/13] commit-graph: add format document

2018-04-02 Thread Jakub Narebski
Derrick Stolee  writes:

> On 3/30/2018 9:25 AM, Jakub Narebski wrote:
>> Derrick Stolee  writes:
>>
>>> +== graph-*.graph files have the following format:
>> What is this '*' here?
>
> No longer necessary. It used to be a placeholder for a hash value, but
> now the graph is stored in objects/info/commit-graph.

All right.

Excuse me replying to v4 instead of v6 of the patch series, where it
would be answered or rather made moot already.

>>
>> [...]
>>> +  The remaining data in the body is described one chunk at a time, and
>>> +  these chunks may be given in any order. Chunks are required unless
>>> +  otherwise specified.
>> Does Git need to understand all chunks, or could there be optional
>> chunks that can be safely ignored (like in PNG format)?  Though this may
>> be overkill, and could be left for later revision of the format if
>> deemed necessary.
>
> In v6, the format and design documents are edited to make clear the
> use of optional chunks, specifically for future extension without
> increasing the version number.

That's good.

>>> +CHUNK DATA:
>>> +
>>> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>>> +  The ith entry, F[i], stores the number of OIDs with first
>>> +  byte at most i. Thus F[255] stores the total
>>> +  number of commits (N).
>> All right, it is small enough that can be required even for a very small
>> number of commits.
>>
>>> +
>>> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>>> +  The OIDs for all commits in the graph, sorted in ascending order.
>>> +
>>> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> Do commits need to be put here in the ascending order of OIDs?
>
> Yes.
>
>> If so, this would mean that it is not possible to add information about
>> new commits by only appending data and maybe overwriting some fields, I
>> think.  You would need to do full rewrite to insert new commit in
>> appropriate place.
>
> That is the idea. This file is not updated with every new commit, but
> instead will be updated on some scheduled cleanup events. The
> commit-graph file is designed in a way to be non-critical, and not
> tied to the packfile layout. This allows flexibility for when to do
> the write.
>
> For example, in GVFS, we will write a new commit-graph when there are
> new daily prefetch packs.
>
> This could also integrate with 'gc' and 'repack' so whenever they are
> triggered the commit-graph is written as well.

I wonder if it would be possible to use existing hooks...

> Commits that do not exist in the commit-graph file will load from the
> object database as normal (after a failed lookup in the commit-graph
> file).

Ah. I thought wrongly that it would (or at least could) be something
that can be kept up to date, and extended when adding any new commit.

>>> +* The first H bytes are for the OID of the root tree.
>>> +* The next 8 bytes are for the int-ids of the first two parents
>>> +  of the ith commit. Stores value 0x if no parent in that
>>> +  position. If there are more than two parents, the second value
>>> +  has its most-significant bit on and the other bits store an array
>>> +  position into the Large Edge List chunk.
>>> +* The next 8 bytes store the generation number of the commit and
>>> +  the commit time in seconds since EPOCH. The generation number
>>> +  uses the higher 30 bits of the first 4 bytes, while the commit
>>> +  time uses the 32 bits of the second 4 bytes, along with the lowest
>>> +  2 bits of the lowest byte, storing the 33rd and 34th bit of the
>>> +  commit time.
>>> +
>>> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>>> +  This list of 4-byte values store the second through nth parents for
>>> +  all octopus merges. The second parent value in the commit data stores
>>> +  an array position within this list along with the most-significant 
>>> bit
>>> +  on. Starting at that array position, iterate through this list of 
>>> int-ids
>>> +  for the parents until reaching a value with the most-significant bit 
>>> on.
>>> +  The other bits correspond to the int-id of the last parent.
>>
>> All right, that is one chunk that cannot use fixed-length records; this
>> shouldn't matter much, as we iterate only up to the number of parents
>> less two.
>
> Less one: the second "parent" column of the commit data chunk is used
> to point into this list, so (P-1) parents are in this chunk for a
> commit with P parents.

Right.

>> A question: what happens to the last list of parents?  Is there a
>> guardian value of 0x at last place?
>
> The termination condition is in the position of the last parent, since
> the most-significant bit is on. The other 31 bits contain the int-id
> of the parent.

Ah. I have misunderstood the format: I thought that first entry is
marked with most-significant bit set to 1, and all the rest to 0, while
it is last entry (last 

Re: [PATCH v4 01/13] commit-graph: add format document

2018-04-02 Thread Derrick Stolee

On 3/30/2018 9:25 AM, Jakub Narebski wrote:

Derrick Stolee  writes:


+== graph-*.graph files have the following format:

What is this '*' here?


No longer necessary. It used to be a placeholder for a hash value, but 
now the graph is stored in objects/info/commit-graph.




[...]

+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.

Does Git need to understand all chunks, or could there be optional
chunks that can be safely ignored (like in PNG format)?  Though this may
be overkill, and could be left for later revision of the format if
deemed necessary.


In v6, the format and design documents are edited to make clear the use 
of optional chunks, specifically for future extension without increasing 
the version number.





+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).

All right, it is small enough that can be required even for a very small
number of commits.


+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)

Do commits need to be put here in the ascending order of OIDs?


Yes.


If so, this would mean that it is not possible to add information about
new commits by only appending data and maybe overwriting some fields, I
think.  You would need to do full rewrite to insert new commit in
appropriate place.


That is the idea. This file is not updated with every new commit, but 
instead will be updated on some scheduled cleanup events. The 
commit-graph file is designed in a way to be non-critical, and not tied 
to the packfile layout. This allows flexibility for when to do the write.


For example, in GVFS, we will write a new commit-graph when there are 
new daily prefetch packs.


This could also integrate with 'gc' and 'repack' so whenever they are 
triggered the commit-graph is written as well.


Commits that do not exist in the commit-graph file will load from the 
object database as normal (after a failed lookup in the commit-graph file).



+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the int-ids of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through this list of int-ids
+  for the parents until reaching a value with the most-significant bit on.
+  The other bits correspond to the int-id of the last parent.

All right, that is one chunk that cannot use fixed-length records; this
shouldn't matter much, as we iterate only up to the number of parents
less two.


Less one: the second "parent" column of the commit data chunk is used to 
point into this list, so (P-1) parents are in this chunk for a commit 
with P parents.



A question: what happens to the last list of parents?  Is there a
guardian value of 0x at last place?


The termination condition is in the position of the last parent, since 
the most-significant bit is on. The other 31 bits contain the int-id of 
the parent.


Thanks,
-Stolee



Re: [PATCH v4 01/13] commit-graph: add format document

2018-03-30 Thread Jakub Narebski
Derrick Stolee  writes:

> +== graph-*.graph files have the following format:

What is this '*' here?

[...]
> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.

Does Git need to understand all chunks, or could there be optional
chunks that can be safely ignored (like in PNG format)?  Though this may
be overkill, and could be left for later revision of the format if
deemed necessary.

> +
> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +  The ith entry, F[i], stores the number of OIDs with first
> +  byte at most i. Thus F[255] stores the total
> +  number of commits (N).

All right, it is small enough that can be required even for a very small
number of commits.

> +
> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +  The OIDs for all commits in the graph, sorted in ascending order.
> +
> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)

Do commits need to be put here in the ascending order of OIDs?

If so, this would mean that it is not possible to add information about
new commits by only appending data and maybe overwriting some fields, I
think.  You would need to do full rewrite to insert new commit in
appropriate place.

> +* The first H bytes are for the OID of the root tree.
> +* The next 8 bytes are for the int-ids of the first two parents
> +  of the ith commit. Stores value 0x if no parent in that
> +  position. If there are more than two parents, the second value
> +  has its most-significant bit on and the other bits store an array
> +  position into the Large Edge List chunk.
> +* The next 8 bytes store the generation number of the commit and
> +  the commit time in seconds since EPOCH. The generation number
> +  uses the higher 30 bits of the first 4 bytes, while the commit
> +  time uses the 32 bits of the second 4 bytes, along with the lowest
> +  2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +  commit time.
> +
> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> +  This list of 4-byte values store the second through nth parents for
> +  all octopus merges. The second parent value in the commit data stores
> +  an array position within this list along with the most-significant bit
> +  on. Starting at that array position, iterate through this list of 
> int-ids
> +  for the parents until reaching a value with the most-significant bit 
> on.
> +  The other bits correspond to the int-id of the last parent.

All right, that is one chunk that cannot use fixed-length records; this
shouldn't matter much, as we iterate only up to the number of parents
less two.

A question: what happens to the last list of parents?  Is there a
guardian value of 0x at last place?

> +
> +TRAILER:
> +
> + H-byte HASH-checksum of all of the above.
> +

Best,
--
Jakub Narębski


Re: [PATCH v4 01/13] commit-graph: add format document

2018-02-21 Thread Stefan Beller
>>
>> [ so in small repos, where there are fewer than 256 objects,
>> F[i] == F[i+1], for all i'th where there is no object starting with i
>> byte]
>
>
> Correct. I'm not sure this additional information is valuable for the
> document, though.

It is not, I was just making sure I'd understand correctly.


Re: [PATCH v4 01/13] commit-graph: add format document

2018-02-21 Thread Derrick Stolee

On 2/21/2018 2:23 PM, Stefan Beller wrote:

On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee  wrote:

+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+  The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+  Currently, the only valid version is 1.
+
+  1-byte Object Id Version (1 = SHA-1)
+
+  1-byte number (C) of "chunks"
+
+  1-byte (reserved for later use)

What should clients of today do with it?
* ignore it completely [as they have no idea what it is] or
* throw hands up in the air if it is anything other than 0 ?
   [because clearly we will increment the version
or have new information in a new chunk instead of just sneaking
in information here?]


They should ignore it completely, which will allow using the value for 
something meaningful later without causing a version change (which we DO 
die() for). A user could downgrade from a version that uses this byte 
for something meaningful and not require a new commit-graph file.


The "commit-graph read" subcommand does output this byte, so we can 
verify that the "write" subcommand places a 0 in this position.





+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide offset in current file for chunk to start.

offset [in bytes? I could imagine having a larger granularity here,
because chunks don't sound small.]


It is good to specify "offset in bytes".




+  (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.)
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).

[ so in small repos, where there are fewer than 256 objects,
F[i] == F[i+1], for all i'th where there is no object starting with i byte]


Correct. I'm not sure this additional information is valuable for the 
document, though.





+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the int-ids of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through this list of int-ids
+  for the parents until reaching a value with the most-significant bit on.
+  The other bits correspond to the int-id of the last parent.
+
+TRAILER:
+
+   H-byte HASH-checksum of all of the above.
+
--
2.7.4

Makes sense so far, I'll read on.
I agree with Junio, that I could read this documentation without
the urge to point out nits. :)

Thanks,
Stefan




Re: [PATCH v4 01/13] commit-graph: add format document

2018-02-21 Thread Stefan Beller
On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee  wrote:
> Add document specifying the binary format for commit graphs. This
> format allows for:
>
> * New versions.
> * New hash functions and hash lengths.
> * Optional extensions.
>
> Basic header information is followed by a binary table of contents
> into "chunks" that include:
>
> * An ordered list of commit object IDs.
> * A 256-entry fanout into that list of OIDs.
> * A list of metadata for the commits.
> * A list of "large edges" to enable octopus merges.
>
> The format automatically includes two parent positions for every
> commit. This favors speed over space, since using only one position
> per commit would cause an extra level of indirection for every merge
> commit. (Octopus merges suffer from this indirection, but they are
> very rare.)
>
> Signed-off-by: Derrick Stolee 
> ---
>  Documentation/technical/commit-graph-format.txt | 90 
> +
>  1 file changed, 90 insertions(+)
>  create mode 100644 Documentation/technical/commit-graph-format.txt
>
> diff --git a/Documentation/technical/commit-graph-format.txt 
> b/Documentation/technical/commit-graph-format.txt
> new file mode 100644
> index 000..11b18b5
> --- /dev/null
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -0,0 +1,90 @@
> +Git commit graph format
> +===
> +
> +The Git commit graph stores a list of commit OIDs and some associated
> +metadata, including:
> +
> +- The generation number of the commit. Commits with no parents have
> +  generation number 1; commits with parents have generation number
> +  one more than the maximum generation number of its parents. We
> +  reserve zero as special, and can be used to mark a generation
> +  number invalid or as "not computed".
> +
> +- The root tree OID.
> +
> +- The commit date.
> +
> +- The parents of the commit, stored using positional references within
> +  the graph file.
> +
> +== graph-*.graph files have the following format:
> +
> +In order to allow extensions that add extra data to the graph, we organize
> +the body into "chunks" and provide a binary lookup table at the beginning
> +of the body. The header includes certain values, such as number of chunks,
> +hash lengths and types.
> +
> +All 4-byte numbers are in network order.
> +
> +HEADER:
> +
> +  4-byte signature:
> +  The signature is: {'C', 'G', 'P', 'H'}
> +
> +  1-byte version number:
> +  Currently, the only valid version is 1.
> +
> +  1-byte Object Id Version (1 = SHA-1)
> +
> +  1-byte number (C) of "chunks"
> +
> +  1-byte (reserved for later use)

What should clients of today do with it?
* ignore it completely [as they have no idea what it is] or
* throw hands up in the air if it is anything other than 0 ?
  [because clearly we will increment the version
   or have new information in a new chunk instead of just sneaking
   in information here?]

> +CHUNK LOOKUP:
> +
> +  (C + 1) * 12 bytes listing the table of contents for the chunks:
> +  First 4 bytes describe chunk id. Value 0 is a terminating label.
> +  Other 8 bytes provide offset in current file for chunk to start.

offset [in bytes? I could imagine having a larger granularity here,
because chunks don't sound small.]

> +  (Chunks are ordered contiguously in the file, so you can infer
> +  the length using the next chunk position if necessary.)
> +
> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.
> +
> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +  The ith entry, F[i], stores the number of OIDs with first
> +  byte at most i. Thus F[255] stores the total
> +  number of commits (N).

[ so in small repos, where there are fewer than 256 objects,
F[i] == F[i+1], for all i'th where there is no object starting with i byte]

> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +  The OIDs for all commits in the graph, sorted in ascending order.
> +
> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> +* The first H bytes are for the OID of the root tree.
> +* The next 8 bytes are for the int-ids of the first two parents
> +  of the ith commit. Stores value 0x if no parent in that
> +  position. If there are more than two parents, the second value
> +  has its most-significant bit on and the other bits store an array
> +  position into the Large Edge List chunk.
> +* The next 8 bytes store the generation number of the commit and
> +  the commit time in seconds since EPOCH. The generation number
> +  uses the higher 30 bits of the first 4 bytes, while the commit
> +  time uses the 32 bits of the second 4 bytes, along with the lowest
> +  2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +  commit time.
> +
> +  Large Edge List (ID: {'E', 

Re: [PATCH v4 01/13] commit-graph: add format document

2018-02-20 Thread Junio C Hamano
Derrick Stolee  writes:

>  Documentation/technical/commit-graph-format.txt | 90 
> +
>  1 file changed, 90 insertions(+)
>  create mode 100644 Documentation/technical/commit-graph-format.txt

Hopefully just a few remaining nits.  Overall I find this written
really clearly.

> +== graph-*.graph files have the following format:
> +
> +In order to allow extensions that add extra data to the graph, we organize
> +the body into "chunks" and provide a binary lookup table at the beginning
> +of the body. The header includes certain values, such as number of chunks,
> +hash lengths and types.

We no longer have lengths stored.

> + ...
> +  The remaining data in the body is described one chunk at a time, and
> +  these chunks may be given in any order. Chunks are required unless
> +  otherwise specified.

It is good that this explicitly says chunks can come in any order,
and which ones are required.  It should also say which chunk can
appear multiple times.  I think all four chunk types we currently
define can have at most one instance in a file.

> +CHUNK DATA:
> +
> +  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> +  The ith entry, F[i], stores the number of OIDs with first
> +  byte at most i. Thus F[255] stores the total
> +  number of commits (N).
> +
> +  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> +  The OIDs for all commits in the graph, sorted in ascending order.

Somewhere in this document, we probably would want to say that this
format allows at most (1<<31)-1 commits recorded in the file (as
CGET and EDGE uses 31-bit uint to index into this table, using MSB
for other purposes, and the all-1-bit pattern is also special), and
when we refer to "int-ids" of a commit, it is this 31-bit number
that is an index into this table.

> +  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> +* The first H bytes are for the OID of the root tree.
> +* The next 8 bytes are for the int-ids of the first two parents
> +  of the ith commit. Stores value 0x if no parent in that
> +  position. If there are more than two parents, the second value
> +  has its most-significant bit on and the other bits store an array
> +  position into the Large Edge List chunk.
> +* The next 8 bytes store the generation number of the commit and
> +  the commit time in seconds since EPOCH. The generation number
> +  uses the higher 30 bits of the first 4 bytes, while the commit
> +  time uses the 32 bits of the second 4 bytes, along with the lowest
> +  2 bits of the lowest byte, storing the 33rd and 34th bit of the
> +  commit time.
> +
> +  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> +  This list of 4-byte values store the second through nth parents for
> +  all octopus merges. The second parent value in the commit data stores
> +  an array position within this list along with the most-significant bit
> +  on. Starting at that array position, iterate through this list of 
> int-ids
> +  for the parents until reaching a value with the most-significant bit 
> on.
> +  The other bits correspond to the int-id of the last parent.
> +
> +TRAILER:
> +
> + H-byte HASH-checksum of all of the above.
> +