Re: [PATCH 5.5/22] Add documentation for the index api

2013-07-11 Thread Duy Nguyen
On Thu, Jul 11, 2013 at 6:42 PM, Duy Nguyen  wrote:
> On Thu, Jul 11, 2013 at 6:30 PM, Thomas Gummerer  wrote:
>> Duy Nguyen  writes:
>> Hrm, I played around a bit with this idea, but I couldn't figure out how
>> to make it work.  For it to work we would still have to load some
>> entries in a directory at least?  Or is there a way to match the
>> directories, which I just haven't figured out yet?
>
> Yes you have to load some entries first. Even if a directory does not
> match, we only know until at least the first file in the directory. OK
> there might be problems because tree_entry_interesting expects all
> entries in a directory to be memcmp sorted, without trailing slash for
> subdirectories. I need to check again if v5 sort order is compatible..

Not gonna work (at least not simple) because we have to mix
directories and files again. The way directory entries are ordered
makes it hard (or less efficient) to get the list of immediate subdirs
of a dir. I think I understand now why you need adjusted_pathspec..
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5.5/22] Add documentation for the index api

2013-07-11 Thread Duy Nguyen
On Thu, Jul 11, 2013 at 6:30 PM, Thomas Gummerer  wrote:
> Duy Nguyen  writes:
>
>> On Wed, Jul 10, 2013 at 3:10 AM, Thomas Gummerer  
>> wrote:
 If you happen to know that certain entries match the given pathspec,
 you could help the caller avoid match_pathspec'ing again by set a bit
 in ce_flags.
>>>
>>> I currently don't know which entries do match the pathspec from just
>>> reading the index file, additional calls would be needed.  I don't think
>>> that would be worth the overhead.
>>
>> Yeah I now see that you select what to load in v5 with the adjusted
>> pathspec, not the input pathspec. Originally I thought you match the
>> input pathspec against every file entry in the index :P Your adjusted
>> pathspec looks like what common_prefix is for. It's cheaper than
>> creating adjusted_pathspec from match_pathspec and reduces loading in
>> major cases, where glob is not used.
>>
>> Still, creating an adjusted pathspec this way looks iffy. You need to
>> understand pathspec in order to strip the filename part out to match
>> the directory match only. An alternative is use
>> tree_entry_interesting. It goes along well with tree traversal and can
>> be used to match directories with original pathspec. Once you see it
>> matches an entry in a directory, you could skip matching the rest of
>> the files and load the whole directory. read_index_filtered_v5 and
>> read_entries may need some tweaking though. I'll try it and post a
>> patch later if I succeed.
>
> Hrm, I played around a bit with this idea, but I couldn't figure out how
> to make it work.  For it to work we would still have to load some
> entries in a directory at least?  Or is there a way to match the
> directories, which I just haven't figured out yet?

Yes you have to load some entries first. Even if a directory does not
match, we only know until at least the first file in the directory. OK
there might be problems because tree_entry_interesting expects all
entries in a directory to be memcmp sorted, without trailing slash for
subdirectories. I need to check again if v5 sort order is compatible..
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5.5/22] Add documentation for the index api

2013-07-11 Thread Thomas Gummerer
Duy Nguyen  writes:

> On Wed, Jul 10, 2013 at 3:10 AM, Thomas Gummerer  wrote:
>>> If you happen to know that certain entries match the given pathspec,
>>> you could help the caller avoid match_pathspec'ing again by set a bit
>>> in ce_flags.
>>
>> I currently don't know which entries do match the pathspec from just
>> reading the index file, additional calls would be needed.  I don't think
>> that would be worth the overhead.
>
> Yeah I now see that you select what to load in v5 with the adjusted
> pathspec, not the input pathspec. Originally I thought you match the
> input pathspec against every file entry in the index :P Your adjusted
> pathspec looks like what common_prefix is for. It's cheaper than
> creating adjusted_pathspec from match_pathspec and reduces loading in
> major cases, where glob is not used.
>
> Still, creating an adjusted pathspec this way looks iffy. You need to
> understand pathspec in order to strip the filename part out to match
> the directory match only. An alternative is use
> tree_entry_interesting. It goes along well with tree traversal and can
> be used to match directories with original pathspec. Once you see it
> matches an entry in a directory, you could skip matching the rest of
> the files and load the whole directory. read_index_filtered_v5 and
> read_entries may need some tweaking though. I'll try it and post a
> patch later if I succeed.

Hrm, I played around a bit with this idea, but I couldn't figure out how
to make it work.  For it to work we would still have to load some
entries in a directory at least?  Or is there a way to match the
directories, which I just haven't figured out yet?

>>> To know which entry exists in the index and which is
>>> new, use another flag. Most reader code won't change if we do it this
>>> way, all match_pathspec() remain where they are.
>>
>> Hrm you mean to know which cache entries are added (or changed) in the
>> in-memory index and will have to be written later?  I'm not sure I
>> understand correctly what you mean here.
>
> Oh.. The "to know.." sentence was nonsense. We probably don't need to
> know. We may track changed entries for partial writing, but let's
> leave that out for now.

Ok, makes sense.

 +`index_change_filter_opts(opts)`::
 +   This function again has a slightly different functionality for
 +   index-v2 and index-v5.
 +
 +   For index-v2 it simply changes the filter_opts, so
 +   for_each_index_entry uses the changed index_opts, to iterate
 +   over a different set of cache entries.
 +
 +   For index-v5 it refreshes the index if the filter_opts have
 +   changed and sets the new filter_opts in the index state, again
 +   to iterate over a different set of cache entries as with
 +   index-v2.
 +
 +   This has some optimization potential, in the case that the
 +   opts get stricter (less of the index should be read) it
 +   doesn't have to reload anything, but currently does.
>>>
>>> The only use case I see so far is converting a partial index_state
>>> back to a full one. Apart from doing so in order to write the new
>>> index, I think some operation (like rename tracking in diff or
>>> unpack-trees) may expect full index. I think we should support that. I
>>> doubt we need to change pathspec to something different than the one
>>> we used to load the index. When a user passes a pathspec to a command,
>>> the user expects the command to operate on that set only, not outside.
>>
>> One application was in ls-files, where we strip the trailing slash from
>> the pathspecs for submodules.  But when we let the caller filter the
>> rest out it's not needed anymore.  We load all entries without the
>> trailing slash anyway.
>
> That submodule trailing slash stripping code will be moved away soon
> (I've been working on it for some time now). There's similar code in
> pathspec.c. I hope by the time this series becomes a candidate for
> 'next', those pathspec manipulation is already gone. For
> strip_trailing_slash_from_submodules, peeking in index file for a few
> entries is probably ok. For check_path_for_gitlink, full index is
> loaded until we figure out a clever way.

Ah great, for now I'll just not use the for_each_index_entry function in
ls-files, and then change the code later once the stripping code is
moved away.

>>> Some thoughts about the writing api.
>>>
>>> In think we should avoid automatically converting partial index into a
>>> full one before writing. Push that back to the caller and die() when
>>> asked to update partial index. They know at what point the index may
>>> be updated and even what part of it may be updated. I think all
>>> commands fall into two categories, tree-wide updates (merge,
>>> checkout...) and limited by the user-given pathspec. "what part to be
>>> updated" is not so hard to determine.
>>
>> Hrm this is only true if index entries are added or removed, not if they
>> are

Re: [PATCH 5.5/22] Add documentation for the index api

2013-07-09 Thread Duy Nguyen
On Wed, Jul 10, 2013 at 3:10 AM, Thomas Gummerer  wrote:
>> If you happen to know that certain entries match the given pathspec,
>> you could help the caller avoid match_pathspec'ing again by set a bit
>> in ce_flags.
>
> I currently don't know which entries do match the pathspec from just
> reading the index file, additional calls would be needed.  I don't think
> that would be worth the overhead.

Yeah I now see that you select what to load in v5 with the adjusted
pathspec, not the input pathspec. Originally I thought you match the
input pathspec against every file entry in the index :P Your adjusted
pathspec looks like what common_prefix is for. It's cheaper than
creating adjusted_pathspec from match_pathspec and reduces loading in
major cases, where glob is not used.

Still, creating an adjusted pathspec this way looks iffy. You need to
understand pathspec in order to strip the filename part out to match
the directory match only. An alternative is use
tree_entry_interesting. It goes along well with tree traversal and can
be used to match directories with original pathspec. Once you see it
matches an entry in a directory, you could skip matching the rest of
the files and load the whole directory. read_index_filtered_v5 and
read_entries may need some tweaking though. I'll try it and post a
patch later if I succeed.

>> To know which entry exists in the index and which is
>> new, use another flag. Most reader code won't change if we do it this
>> way, all match_pathspec() remain where they are.
>
> Hrm you mean to know which cache entries are added (or changed) in the
> in-memory index and will have to be written later?  I'm not sure I
> understand correctly what you mean here.

Oh.. The "to know.." sentence was nonsense. We probably don't need to
know. We may track changed entries for partial writing, but let's
leave that out for now.

>>> +`index_change_filter_opts(opts)`::
>>> +   This function again has a slightly different functionality for
>>> +   index-v2 and index-v5.
>>> +
>>> +   For index-v2 it simply changes the filter_opts, so
>>> +   for_each_index_entry uses the changed index_opts, to iterate
>>> +   over a different set of cache entries.
>>> +
>>> +   For index-v5 it refreshes the index if the filter_opts have
>>> +   changed and sets the new filter_opts in the index state, again
>>> +   to iterate over a different set of cache entries as with
>>> +   index-v2.
>>> +
>>> +   This has some optimization potential, in the case that the
>>> +   opts get stricter (less of the index should be read) it
>>> +   doesn't have to reload anything, but currently does.
>>
>> The only use case I see so far is converting a partial index_state
>> back to a full one. Apart from doing so in order to write the new
>> index, I think some operation (like rename tracking in diff or
>> unpack-trees) may expect full index. I think we should support that. I
>> doubt we need to change pathspec to something different than the one
>> we used to load the index. When a user passes a pathspec to a command,
>> the user expects the command to operate on that set only, not outside.
>
> One application was in ls-files, where we strip the trailing slash from
> the pathspecs for submodules.  But when we let the caller filter the
> rest out it's not needed anymore.  We load all entries without the
> trailing slash anyway.

That submodule trailing slash stripping code will be moved away soon
(I've been working on it for some time now). There's similar code in
pathspec.c. I hope by the time this series becomes a candidate for
'next', those pathspec manipulation is already gone. For
strip_trailing_slash_from_submodules, peeking in index file for a few
entries is probably ok. For check_path_for_gitlink, full index is
loaded until we figure out a clever way.

>> Some thoughts about the writing api.
>>
>> In think we should avoid automatically converting partial index into a
>> full one before writing. Push that back to the caller and die() when
>> asked to update partial index. They know at what point the index may
>> be updated and even what part of it may be updated. I think all
>> commands fall into two categories, tree-wide updates (merge,
>> checkout...) and limited by the user-given pathspec. "what part to be
>> updated" is not so hard to determine.
>
> Hrm this is only true if index entries are added or removed, not if they
> are only changed.  If they are only changed we can write a partially
> read index once we have partial writing.

Yep. We can detect if changes are updates only, no additions nor
removals. If so do partial write, else full write. These little
details are hidden from the user, as long as they keep their promise
about read/write regions.

> For now it would make sense to just die() though, until we have that in place.

Agreed.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More maj

Re: [PATCH 5.5/22] Add documentation for the index api

2013-07-09 Thread Thomas Gummerer
Duy Nguyen  writes:

> On Tue, Jul 9, 2013 at 3:54 AM, Thomas Gummerer  wrote:
>> As promised, a draft for a documentation for the index api as it is in
>> this series.
>
> First of all, it may be a good idea to acknowledge
> index_state->cache[] as part of the API for now. Not hiding it
> simplifies a few things (no need for new next_ce field, no worries
> about rewinding in unpack-trees..). Supporting partial loading (and
> maybe partial update in some cases) with this API and
> index_state->cache[] part of the API are good enough for me. We can do
> another tree-based API or something update later when it's formed (I
> looked at your index-v5api branch but I don't think a tree-based api
> was there, my concern is how much extra head pre-v5 has to pay to use
> tree-based api).

Yes, I think you're right, that simplifies everything a lot, while we
still can have partial loading.  Hiding index_state->cache[] was mainly
thought for future changes for the in-memory format, but I think that
will not be happening for a while.

>> +`read_index_filtered(opts)`::
>> +   This method behaves differently for index-v2 and index-v5.
>> +
>> +   For index-v2 it simply reads the whole index as read_index()
>> +   does, so we are sure we don't have to reload anything if the
>> +   user wants a different filter.  It also sets the filter_opts
>> +   in the index_state, which is used to limit the results when
>> +   iterating over the index with for_each_index_entry().
>> +
>> +   The whole index is read to avoid the need to eventually
>> +   re-read the index later, because the performance is no
>> +   different when reading it partially.
>> +
>> +   For index-v5 it creates an adjusted_pathspec to filter the
>> +   reading.  First all the directory entries are read and then
>> +   the cache_entries in the directories that match the adjusted
>> +   pathspec are read.  The filter_opts in the index_state are set
>> +   to filter out the rest of the cache_entries that are matched
>> +   by the adjusted pathspec but not by the pathspec given.  The
>> +   rest of the index entries are filtered out when iterating over
>> +   the cache with for_each_index_entries.
>
> You can state in the API that the input pathspec is used as a hint to
> load only a portion of the index. read_index_filtered may load _more_
> than necessary. It's the caller's responsibility to verify again which
> is matched and which is not. That's how read_directory is done. I
> think it gives you more liberty in loading strategy. It's already true
> for v2 because full index is loaded regardless of the given pathspec.
> In the end, we have a linear list (from public view) of cache entries,
> accessible via index_state->cache[].

Yes, and it's also partly true for index-v5, as the full content of a
directory is loaded even if only some files it it match the pathspec
that's given.

> If you happen to know that certain entries match the given pathspec,
> you could help the caller avoid match_pathspec'ing again by set a bit
> in ce_flags.

I currently don't know which entries do match the pathspec from just
reading the index file, additional calls would be needed.  I don't think
that would be worth the overhead.

> To know which entry exists in the index and which is
> new, use another flag. Most reader code won't change if we do it this
> way, all match_pathspec() remain where they are.

Hrm you mean to know which cache entries are added (or changed) in the
in-memory index and will have to be written later?  I'm not sure I
understand correctly what you mean here.

>> +`for_each_index_entry(fn, cb_data)`::
>> +   Iterates over all cache_entries in the index filtered by
>> +   filter_opts in the index_stat.  For each cache entry fn is
>> +   executed with cb_data as callback data.  From within the loop
>> +   do `return 0` to continue, or `return 1` to break the loop.
>
> Because we don't attempt to hide index_state->cache[], this one may be
> for convenience, the user is not required to convert to it. Actually I
> think this may be slower because of the cost of calling function
> pointer.

Yes right, I think you're right.  In fact I just tested it, and it's
slightly slower.

I still think it would make sense to keep it around, for the callers
that want the cache filtered exactly by the filter_opts, for convenience
as you said.

>> +`next_index_entry(ce)`::
>> +   Returns the cache_entry that follows after ce
>
> next_ce field and this method may be gone too, just access 
> index_state->cache[]

Yes, this makes no sense when we're not hiding index_state->cache[].
The same goes for the get_index_entry_by_name function, which is
essentially the same as using index_name_pos and then getting the cache
entry from index_state->cache[].

>> +`index_change_filter_opts(opts)`::
>> +   This function again has a slightly different functionality for
>> +   index-v2 and index-v5.
>> +

Re: [PATCH 5.5/22] Add documentation for the index api

2013-07-09 Thread Duy Nguyen
On Tue, Jul 9, 2013 at 3:54 AM, Thomas Gummerer  wrote:
> As promised, a draft for a documentation for the index api as it is in
> this series.

First of all, it may be a good idea to acknowledge
index_state->cache[] as part of the API for now. Not hiding it
simplifies a few things (no need for new next_ce field, no worries
about rewinding in unpack-trees..). Supporting partial loading (and
maybe partial update in some cases) with this API and
index_state->cache[] part of the API are good enough for me. We can do
another tree-based API or something update later when it's formed (I
looked at your index-v5api branch but I don't think a tree-based api
was there, my concern is how much extra head pre-v5 has to pay to use
tree-based api).

> +`read_index_filtered(opts)`::
> +   This method behaves differently for index-v2 and index-v5.
> +
> +   For index-v2 it simply reads the whole index as read_index()
> +   does, so we are sure we don't have to reload anything if the
> +   user wants a different filter.  It also sets the filter_opts
> +   in the index_state, which is used to limit the results when
> +   iterating over the index with for_each_index_entry().
> +
> +   The whole index is read to avoid the need to eventually
> +   re-read the index later, because the performance is no
> +   different when reading it partially.
> +
> +   For index-v5 it creates an adjusted_pathspec to filter the
> +   reading.  First all the directory entries are read and then
> +   the cache_entries in the directories that match the adjusted
> +   pathspec are read.  The filter_opts in the index_state are set
> +   to filter out the rest of the cache_entries that are matched
> +   by the adjusted pathspec but not by the pathspec given.  The
> +   rest of the index entries are filtered out when iterating over
> +   the cache with for_each_index_entries.

You can state in the API that the input pathspec is used as a hint to
load only a portion of the index. read_index_filtered may load _more_
than necessary. It's the caller's responsibility to verify again which
is matched and which is not. That's how read_directory is done. I
think it gives you more liberty in loading strategy. It's already true
for v2 because full index is loaded regardless of the given pathspec.
In the end, we have a linear list (from public view) of cache entries,
accessible via index_state->cache[].

If you happen to know that certain entries match the given pathspec,
you could help the caller avoid match_pathspec'ing again by set a bit
in ce_flags.  To know which entry exists in the index and which is
new, use another flag. Most reader code won't change if we do it this
way, all match_pathspec() remain where they are.

> +`for_each_index_entry(fn, cb_data)`::
> +   Iterates over all cache_entries in the index filtered by
> +   filter_opts in the index_stat.  For each cache entry fn is
> +   executed with cb_data as callback data.  From within the loop
> +   do `return 0` to continue, or `return 1` to break the loop.

Because we don't attempt to hide index_state->cache[], this one may be
for convenience, the user is not required to convert to it. Actually I
think this may be slower because of the cost of calling function
pointer.

> +`next_index_entry(ce)`::
> +   Returns the cache_entry that follows after ce

next_ce field and this method may be gone too, just access index_state->cache[]

> +`index_change_filter_opts(opts)`::
> +   This function again has a slightly different functionality for
> +   index-v2 and index-v5.
> +
> +   For index-v2 it simply changes the filter_opts, so
> +   for_each_index_entry uses the changed index_opts, to iterate
> +   over a different set of cache entries.
> +
> +   For index-v5 it refreshes the index if the filter_opts have
> +   changed and sets the new filter_opts in the index state, again
> +   to iterate over a different set of cache entries as with
> +   index-v2.
> +
> +   This has some optimization potential, in the case that the
> +   opts get stricter (less of the index should be read) it
> +   doesn't have to reload anything, but currently does.

The only use case I see so far is converting a partial index_state
back to a full one. Apart from doing so in order to write the new
index, I think some operation (like rename tracking in diff or
unpack-trees) may expect full index. I think we should support that. I
doubt we need to change pathspec to something different than the one
we used to load the index. When a user passes a pathspec to a command,
the user expects the command to operate on that set only, not outside.

If you take the input pathspec at loading just as a hint, you could
load all related directory blocks and all files in those blocks, so
that expanding to full index is simply adding more files from missing
directory blocks (and their files). An advantage of not stri