Re: Understanding pack format

2018-11-06 Thread Duy Nguyen
On Tue, Nov 6, 2018 at 3:23 AM Farhan Khan  wrote:
> To follow-up from the other day, I have been reading the code that
> retrieves the pack entry for the past 3 days now without much success.
> But there are quite a few abstractions and I get lost half-way down
> the line.

Jeff already gave you some pointers. This is just a side note.

I think it's easier to run the code under a debugger and see what it
does than just reading it. You can create a repo with just one blob to
have better control over it (small packs also make it possible to
examine with a hex editor in parallel), e.g.

git init foo
cd foo
echo hello >file
git add file
git repack -ad
gdb --args git show :./file

then put a breakpoint in some interesting functions (perhaps one of
those Jeff pointed out)
-- 
Duy


Re: Understanding pack format

2018-11-05 Thread Jeff King
On Mon, Nov 05, 2018 at 09:23:45PM -0500, Farhan Khan wrote:

> I am trying to identify where the content from a pack comes from. I
> traced it back to sha1-file.c:read_object(), which will return the
> 'content'. I want to know where the 'content' comes from, which seems
> to come from sha1-file.c:oid_object_info_extended. This goes into
> packfile.c:find_pack_entry(), but from here I get lost. I do not
> understand what is happening.
> 
> How does it retrieve the pack content? I am lost here. Please assist.
> This is in the technical git documentation, but it was not clear.

After find_pack_entry() tells us the object is in a pack, we end up in
packed_object_info(). Depending what the caller is asking for, there are
a couple different strategies (because we try to avoid loading the whole
object if we don't need it).

Probably the one you're interested in is just grabbing the content,
which happens via cache_or_unpack_entry(). The cached case is less
interesting, so try unpack_entry(), which is what actually reads the
bytes out of the packfile.

-Peff


Re: Understanding pack format

2018-11-05 Thread Farhan Khan
On Fri, Nov 2, 2018 at 12:00 PM Duy Nguyen  wrote:
>
> On Fri, Nov 2, 2018 at 7:19 AM Junio C Hamano  wrote:
> >
> > Farhan Khan  writes:
> >
> > > ...Where is this in the git code? That might
> > > serve as a good guide.
> >
> > There are two major codepaths.  One is used at runtime, giving us
> > random access into the packfile with the help with .idx file.  The
> > other is used when receiving a new packstream to create an .idx
> > file.
>
> The third path is copying/reusing objects in
> builtin/pack-objects.c::write_reuse_object(). Since it's mostly
> encoding the header of new objects in pack, it could also be a good
> starting point. Then you can move to write_no_reuse_object() and get
> how the data is encoded, deltified or not (yeah not parsed, but I
> think it's more or less the same thing conceptually).
> --
> Duy

Hi all,

To follow-up from the other day, I have been reading the code that
retrieves the pack entry for the past 3 days now without much success.
But there are quite a few abstractions and I get lost half-way down
the line.

I am trying to identify where the content from a pack comes from. I
traced it back to sha1-file.c:read_object(), which will return the
'content'. I want to know where the 'content' comes from, which seems
to come from sha1-file.c:oid_object_info_extended. This goes into
packfile.c:find_pack_entry(), but from here I get lost. I do not
understand what is happening.

How does it retrieve the pack content? I am lost here. Please assist.
This is in the technical git documentation, but it was not clear.

Thank you,

--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE


Re: Understanding pack format

2018-11-02 Thread Duy Nguyen
On Fri, Nov 2, 2018 at 7:19 AM Junio C Hamano  wrote:
>
> Farhan Khan  writes:
>
> > ...Where is this in the git code? That might
> > serve as a good guide.
>
> There are two major codepaths.  One is used at runtime, giving us
> random access into the packfile with the help with .idx file.  The
> other is used when receiving a new packstream to create an .idx
> file.

The third path is copying/reusing objects in
builtin/pack-objects.c::write_reuse_object(). Since it's mostly
encoding the header of new objects in pack, it could also be a good
starting point. Then you can move to write_no_reuse_object() and get
how the data is encoded, deltified or not (yeah not parsed, but I
think it's more or less the same thing conceptually).
-- 
Duy


Re: Understanding pack format

2018-11-02 Thread Duy Nguyen
On Fri, Nov 2, 2018 at 6:26 AM Farhan Khan  wrote:
>
> Hi all,
>
> I am trying to understand the pack file format and have been reading
> the documentation, specifically https://git-scm.com/docs/pack-format
> (which is in git's own git repository as
> "Documentation/technical/pack-format.txt"). I see that the file starts
> with the "PACK" signature, followed by the 4 byte version and 4 byte
> number of objects. After this, the documentation speaks about
> Undeltified and Deltified representations. I understand conceptually
> what each is, but do not know specifically how git parses it out.

If by "it" you mean the deltified representations, I think it's
actually documented in pack-format.txt. If you prefer C over English,
look at patch-delta.c

-- 
Duy


Re: Understanding pack format

2018-11-02 Thread Junio C Hamano
Farhan Khan  writes:

> ...Where is this in the git code? That might
> serve as a good guide.

There are two major codepaths.  One is used at runtime, giving us
random access into the packfile with the help with .idx file.  The
other is used when receiving a new packstream to create an .idx
file.

Personally I find the latter a bit too dense for those who are new
to the codebase, and the former would probably be easier to grok.

Start from sha1-file.c::read_object(), which will eventually lead
you to oid_object_info_extended() that essentially boils down to

 - a call to find_pack_entry() with the object name, and then

 - a call to packed_object_info() with the pack entry found earlier.

Following packfile.c::packed_object_info() will lead you to
cache_or_unpack_entry(); the unpack_entry() function is where all
the action to read from the packstream for one object's worth of
data and to reconstruct the object out of its deltified representation
takes place.


Understanding pack format

2018-11-01 Thread Farhan Khan
Hi all,

I am trying to understand the pack file format and have been reading
the documentation, specifically https://git-scm.com/docs/pack-format
(which is in git's own git repository as
"Documentation/technical/pack-format.txt"). I see that the file starts
with the "PACK" signature, followed by the 4 byte version and 4 byte
number of objects. After this, the documentation speaks about
Undeltified and Deltified representations. I understand conceptually
what each is, but do not know specifically how git parses it out.

Can someone please explain this to me? Is there any sample code of how
to interpret each entry? Where is this in the git code? That might
serve as a good guide.

I see a few references to "PACK_SIGNATURE", but not certain which
actually reads the data.

Thanks!
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE