Re: [PATCH] object: measure time needed for resolving hash collisions

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 07:01:41PM -0700, Stefan Beller wrote:

>  According to Jeff, sending patches that don't get accepted is the new hype!

It is what all the cool kids are doing. Unfortunately, it does not save
you from nitpicky reviews...;)

>   first = i = hash_obj(sha1, obj_hash_size);
> + clock_gettime(CLOCK_MONOTONIC, &time1);
>   while ((obj = obj_hash[i]) != NULL) {
>   if (!hashcmp(sha1, obj->oid.hash))
>   break;
> @@ -98,6 +131,9 @@ struct object *lookup_object(const unsigned char *sha1)
>   if (i == obj_hash_size)
>   i = 0;
>   }
> + clock_gettime(CLOCK_MONOTONIC, &time2);
> + diff(&time1, &time2, &t_diff);
> + add_time_to(&caching, &t_diff);
>   if (obj && i != first) {

I don't think this is actually measuring the time spent on collisions.
It's measuring the time we spend in hashcmp(), but that includes the
non-collision case where we find it in the first hashcmp.

Measuring _just_ the collisions is more like the patch below. In my
measurements it's more like 30ms, compared to 10s for all of the
hashcmps.

So we really aren't dealing with collisions, but rather just verifying
that our hash landed at the right spot. And _any_ data structure is
going to have to do that. If you want to make it faster, I'd try
optimizing hashcmp (and you can see why the critbit tree was slower; if
we spend so much time just on hashcmp() to make sure we're at the right
key, then making that slower with a bunch of branching is not going to
help).

I notice we still open-code hashcmp. I get a slight speedup by switching
it to memcmp(). About 2.5%, which is similar to what I showed in

  http://public-inbox.org/git/20130318073229.ga5...@sigill.intra.peff.net/

a few years ago (though it's more pronounced as a portion of the whole
now, because we've optimized some of the other bits).

The main driver there was memcmp() improvements that went into glibc
2.13 several years ago. It might be time to start assuming that memcmp()
beats our open-coded loop.

It may also be possible to really micro-optimize it on some platforms,
because we know the size in advance (I'd kind of expect the compiler to
do that, but if we're ending up in glibc memcmp then it sounds like it
is not the case).

-Peff


Re: [PATCH 2/2] use zstd zlib wrapper

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 06:22:17PM -0700, Stefan Beller wrote:

> > Disappointingly, the answer seems to be "no".
> 
> After having looked at the data, I disagree with the conclusion.
> And for that I think we need to reason about the frequency
> of the operations happening.

I definitely agree that reads outnumber writes, and it's OK to have an
asymmetric tradeoff between the two. zstd5 isn't _too_ bad in that
respect. I guess I was just disappointed that the pack size was still
bigger, as I was really hoping to see some speed tradeoff without
getting a worse pack.

The other thing to weigh against is "if we were designing it today"
versus "is it worth the compatibility headaches now". A 6% improvement
in "rev-list --objects" is not that amazing for a data format change.
Bitmaps were an _easier_ data format change and are more like a 99%
speedup.

They do not apply to every operations, but we may be able to do similar
space/time tradeoffs that are easier to handle in terms of backwards
compatibility, and which yield bigger results.

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Kevin Daudt
On Wed, Sep 14, 2016 at 12:38:20PM -0700, Jeff King wrote:
> On Wed, Sep 14, 2016 at 12:30:06PM -0700, Junio C Hamano wrote:
> 
> > Another small thing I am not sure about is if the \ quoting can hide
> > an embedded newline in the author name.  Would we end up turning
> > 
> > From: "Jeff \
> > King" 
> > 
> > or somesuch into
> > 
> > Author: Jeff
> > King
> > Email: p...@peff.net
> > 
> > ;-)
> 
> Heh, yeah. That is another reason to clean up and sanitize as much as
> possible before stuffing it into another text format that will be
> parsed.

A quoted string cannot contain newlines according to the RFC, so I think
we don't need to care about that.

> 
> > So let's roll the \" -> " into mailinfo.
> > 
> > I am not sure if we also should remove the surrounding "", i.e. we
> > currently do not turn this
> > 
> > From: "Jeff King" 
> > 
> > into this:
> > 
> > Author: Jeff King
> > Email: p...@peff.net
> > 
> > I think we probably should, and remove the one that does so from the
> > reader.
> 
> I think you have to, or else you cannot tell the difference between
> surrounding quotes that need to be stripped, and ones that were
> backslash-escaped. Like:
> 
>   From: "Jeff King" 
>   From: \"Jeff King\" 
> 
> which would both become:
> 
>   Author: "Jeff King"
>   Email: p...@peff.net
> 
> though I am not sure the latter one is actually valid; you might need to
> be inside syntactic quotes in order to include backslashed quotes. I
> haven't read rfc2822 carefully recently enough to know.
> 
> Anyway, I think that:
> 
>   From: One "Two \"Three\" Four" Five
> 
> may also be valid. So the quote-stripping in the reader is not just "at
> the outside", but may need to handle interior syntactic quotes, too. So
> it really makes sense for me to clean and sanitize as much as possible
> in one step, and then make the parser of mailinfo as dumb as possible.
> 

Makes sense, the current itteration of my patch already strips exterior
quotes, no matter where they happen.

I will send a patch soon.


git add --intent-to-add silently creates empty commits

2016-09-14 Thread Aviv Eyal
Hello,

I think this is an unintended behavior, but I might be wrong:

Using `git add -N` allows creating of empty commits:


git init test && cd test
echo text > file
git add --intent-to-add file
git commit -m 'Empty commit'
echo $?# prints 0
git log -1 -p --format='' HEAD | wc -l # prints 0
git status -s  # prints `AM file`


I'd expect `git commit` to error out instead of producing an empty commit.

I've seen this with git 2.8.1 and 2.10.0.129.g35f6318

Thanks,
Aviv


[PATCH] object: measure time needed for resolving hash collisions

2016-09-14 Thread Stefan Beller
$ time ./git rev-list HEAD --count
44271
print_time_spent_object: measure time needed for resolving hash collisions

This shows that we roughly spent a third of the time in resolving
hash collisions:

I am using linux.git to measure:

$ time git rev-list --objects --count HEAD >/dev/null
print_time_spent_resolving_hash_collisions 9:445591664

real0m31.733s
user0m31.220s
sys 0m0.463s

For fun I reverted 9a414486d9f0:

$ time git rev-list --objects --count HEAD >/dev/null
print_time_spent_resolving_hash_collisions 14:338605504

real0m37.008s
user0m36.524s
sys 0m0.423s

The 17% that Jeff measured in there, equals to 1-31.733/37.008 = 0.14
in these measurements.
The time spent in collision resolving went down by 1/3 on itself
(14.33s to 9.44s), so there is still room for improvement.

Signed-off-by: Stefan Beller 
---

 According to Jeff, sending patches that don't get accepted is the new hype!

 builtin/rev-list.c |  2 ++
 object.c   | 36 
 object.h   |  2 ++
 3 files changed, 40 insertions(+)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 8479f6e..d0e5922 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -412,5 +412,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
printf("%d\n", revs.count_left + revs.count_right);
}
 
+   print_time_spent_resolving_hash_collisions();
+
return 0;
 }
diff --git a/object.c b/object.c
index 67d9a9e..e9e73e0 100644
--- a/object.c
+++ b/object.c
@@ -5,9 +5,40 @@
 #include "commit.h"
 #include "tag.h"
 
+#include 
+
 static struct object **obj_hash;
 static int nr_objs, obj_hash_size;
 
+struct timespec caching = {0, 0};
+
+void diff(struct timespec *start, struct timespec *end, struct timespec *dst)
+{
+   if ((end->tv_nsec-start->tv_nsec)<0) {
+   dst->tv_sec = end->tv_sec-start->tv_sec-1;
+   dst->tv_nsec = 10+end->tv_nsec-start->tv_nsec;
+   } else {
+   dst->tv_sec = end->tv_sec-start->tv_sec;
+   dst->tv_nsec = end->tv_nsec-start->tv_nsec;
+   }
+}
+
+void add_time_to(struct timespec *dst, struct timespec *addend)
+{
+   dst->tv_sec += addend->tv_sec;
+   dst->tv_nsec += addend->tv_nsec;
+   while (dst->tv_nsec > 10) {
+   dst->tv_nsec -= 10;
+   dst->tv_sec ++;
+   }
+}
+
+void print_time_spent_resolving_hash_collisions(void)
+{
+   fprintf(stderr, "print_time_spent_resolving_hash_collisions %ld:%9ld\n",
+   (long)caching.tv_sec, (long)caching.tv_nsec);
+}
+
 unsigned int get_max_object_index(void)
 {
return obj_hash_size;
@@ -86,11 +117,13 @@ struct object *lookup_object(const unsigned char *sha1)
 {
unsigned int i, first;
struct object *obj;
+   struct timespec time1, time2, t_diff;
 
if (!obj_hash)
return NULL;
 
first = i = hash_obj(sha1, obj_hash_size);
+   clock_gettime(CLOCK_MONOTONIC, &time1);
while ((obj = obj_hash[i]) != NULL) {
if (!hashcmp(sha1, obj->oid.hash))
break;
@@ -98,6 +131,9 @@ struct object *lookup_object(const unsigned char *sha1)
if (i == obj_hash_size)
i = 0;
}
+   clock_gettime(CLOCK_MONOTONIC, &time2);
+   diff(&time1, &time2, &t_diff);
+   add_time_to(&caching, &t_diff);
if (obj && i != first) {
/*
 * Move object to where we started to look for it so
diff --git a/object.h b/object.h
index f8b6442..ee6ab3a 100644
--- a/object.h
+++ b/object.h
@@ -56,6 +56,8 @@ extern const char *typename(unsigned int type);
 extern int type_from_string_gently(const char *str, ssize_t, int gentle);
 #define type_from_string(str) type_from_string_gently(str, -1, 0)
 
+void print_time_spent_resolving_hash_collisions(void);
+
 /*
  * Return the current number of buckets in the object hashmap.
  */
-- 
2.10.0.130.gef2bcd7.dirty



Re: [PATCH 2/2] use zstd zlib wrapper

2016-09-14 Thread Stefan Beller
On Wed, Sep 14, 2016 at 4:58 PM, Jeff King  wrote:
> There's a fancy new compression algorithm called "zstd". The
> idea is that it's supposed to get similar compression ratios
> to zlib, but with much faster compression and decompression
> times. And on top of that, a nice sliding scale to trade off
> size versus time on the compression side.
>
> The zstd site at https://facebook.github.io/zstd/ claims
> close to 3x speedup for both compression and decompression
> versus zlib, with similar compression ratios. There are
> other fast algorithms (like lz4), but they usually compress
> much worse (follow the link above for a nice table of
> results).
>
> Since any git operations that have to access objects need to
> do a zlib inflate, in theory we can speed up everything by
> using zstd. And then on the packing side, use higher
> compression levels when making on-disk packfiles (which will
> be accessed many times) and lower ones when making loose
> objects, or deflating packed objects on the fly when serving
> fetches.
>
> The catch, of course, is that it's a new incompatible
> format. This would be a pretty huge change and totally break
> backwards compatibility for git, not just on disk but
> on-the-wire as well. So my goal here was not a finished
> product but just a quick experiment to see if it did indeed
> bring the promise speedups.
>
> Disappointingly, the answer seems to be "no".

After having looked at the data, I disagree with the conclusion.
And for that I think we need to reason about the frequency
of the operations happening.

* As an enduser, happily hacking away at one repository,
  I probably do not care about the pack size on disk as much
  as I care about timing of the local operations. And I assume
  that for each repack we have about 1000 reads (log/rev-list)
  The 1000 is a wild speculation without any data to back it up.
  So as an end user I'd be happy about [zstd, ~5]
  For the end user LZ4 seems to be the best solution if it were available.

* As a service provider, I know we have a lot more reads than
  writes, and repacking is annoying. Also at that scale the disk
  isn't negligible cheap. So we need to weigh the numbers differently,
  but how? I suspect depending on the weighting it could still be
  considered beneficial to go with zstd5. (No hard numbers here)


Re: [PATCH 1/2] obj_hash: convert to a critbit tree

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 05:52:06PM -0700, Stefan Beller wrote:

> > So finding "1011" involves traversing the trie: down the "1"
> > side, then the "0" side, and then check that the rest
> > matches "11".
> 
> So we stop building a tree as soon as we hit a unique data
> element (i.e. if we stick to the idea of encoding the hash along the way,
> we would have needed another node each for "11" as well as "00"
> that points to NULL and the ending data respectively.
> 
> So we stop short as soon as we have a unique.
> 
> That makes insertion very easy, because as soon as we hit
> a unique, we'd just introduce a node and add the two uniques
> left and right. (Well what happens if we were to insert
> 101010111 and 101010101 ? Both have a long prefix,
> I suppose we'd have 7 nodes and then the distiguishing
> node for those 2.)

I think it actually can represent the interior sequence as a single
node. That's what the "critical" part of critbit is for; the critical
bit is the one where we diverge. But I'm not 100% sure that it's
implemented that way.

In practice, though, sha1 would give us a pretty full tree near the
front, so I'd expect each bit to be "critical".

> > It's possible that (2) would be better if instead of a
> > critbit tree, we used a "critbyte" tree. That would involve
> > fewer node traversals, at the cost of making each node
> > larger (right now the internal nodes store 2 pointer slots;
> > they'd have to store 256 to handle a full byte). I didn't
> > try it, but I suspect it would still be slower for the same
> > reasons.
> 
> I would expect to go for a crit-"variable-length" tree instead.
> 
> The reason for this is that a higher fan out seems to be more
> beneficial in the earlier stages, e.g. we could use critbyte trees
> for the first 1-3 layers in the tree as that will have good memory
> efficiency (all 256 slots filled), but will be faster than the critbit trees
> (one lookup instead of 8 conditional jumps).

Yeah, I suspect a hybrid approach may be a better tradeoff. I encourage
you to try and measure. :)

> I guess when trying to improve the hashsets, someone tried trees
> as a collision resolver?

I don't think so. hashmap.c uses a linked list, but obj_hash just does
linear probing. Both are linear, but the latter is more memory efficient
and especially has a more compact cache footprint when resolving
collisions. The downside of open probing like that is that it's very
hard to delete an entry, but we don't care about that for obj_hash.

I don't think that improving collision resolution would help that much
for obj_hash, though. The fact that quadratic probing and cuckoo hashing
didn't yield big benefits implies that we don't spend most of our time
on collisions in the first place (OTOH, my 9a414486d9f0 that you found
does help _only_ when there are collisions, so maybe I'm wrong).

> So maybe we could have some software sided cache for hot entries?
> (I imagine a data structure consisting of 2 hash sets here, one
> hashset containing
> the complete data set, and the other hashset is a very small hashset with e.g.
> just 256(?) entries that are an LRU cache for the cache entries.
> Though this sounds like we'd be trying to outsmart the hardware... Not sure
> I'd expect gains from that)

Yeah. Basically what we want is the reverse of a bloom filter: it's OK
to be wrong, but it most be wrong to say "it's not here" when it is, not
the other way around. And so that's basically...a cache of a smaller
data-structure in front of the larger one.

But given that the hash table is already O(1)-ish, it's hard to beat
that (because remember when you have a cache miss, you then have to do
an extra lookup in the full table anyway).

I did play around with stuff like that when I was coming up with
9a414486d9f0, but was never able to make it faster. Patches welcome, of
course.

> I guess we rather want to split up the data sets on the application
> side: i.e. instead of
> having so many objects, have hash sets for e.g. blobs, trees, commits 
> separately
> and then use slightly different strategies there (different load factors?)

My gut is that would not make a big difference (and sometimes it would
be worse, because we don't know what the object type is, or we want to
make _sure_ that we don't have the object known as a different type).

> Unrelated note about hashmaps:
> I wonder if we rather want to give good initial estimates of the size
> as one improvement

My recollection is that the growth isn't really relevant. At least for
obj_hash, we do _way_ more lookups than insertions, so the only thing
that really matters is lookup speed.

But as with everything, if you can come up with improved numbers, I'm
happy to look at the patches. :)

-Peff


Re: Tracking down a segfault in delta_base_cache

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 05:42:29PM -0700, Jonathon Mah wrote:

> Hi git, I've been seeing git segfault over the past few days. I'm on Mac OS X 
> 10.12, 64-bit, compiling with clang (Apple LLVM version 8.0.0 
> (clang-800.0.40)).
> [...]
> Bisect says:
> 
> 8261e1f139db3f8aa6f9fd7d98c876cbeb0f927c is the first bad commit
> commit 8261e1f139db3f8aa6f9fd7d98c876cbeb0f927c
> Author: Jeff King 
> Date:   Mon Aug 22 18:00:07 2016 -0400
> 
> delta_base_cache: use hashmap.h

Have you tried with the patch in:

  
http://public-inbox.org/git/20160912164616.vg33kldazuthf...@sigill.intra.peff.net/

?

> $ lldb /Users/jmah/Documents/Streams/git/git-log -- -u
> (lldb) target create "/Users/jmah/Documents/Streams/git/git-log"
> Current executable set to '/Users/jmah/Documents/Streams/git/git-log' 
> (x86_64).
> (lldb) settings set -- target.run-args  "-u"
> (lldb) process launch -o /dev/null
> Process 92815 launched: '/Users/jmah/Documents/Streams/git/git-log' (x86_64)
> Process 92815 stopped
> * thread #1: tid = 0x1c30677, 0x0001001bba80 
> git-log`release_delta_base_cache(ent=0xffd0) + 16 at 
> sha1_file.c:2171, queue = 'com.apple.main-thread', stop reason = 
> EXC_BAD_ACCESS (code=1, address=0x10)
> frame #0: 0x0001001bba80 
> git-log`release_delta_base_cache(ent=0xffd0) + 16 at 
> sha1_file.c:2171
>2168   
>2169   static inline void release_delta_base_cache(struct 
> delta_base_cache_entry *ent)
>2170   {
> -> 2171   free(ent->data);
>2172   detach_delta_base_cache_entry(ent);

The problems I saw with valgrind weren't here, but would explain this.
We free() the previous node, then walk forward from its "next" pointer.
On my Linux box, that happens to work, but we could be feeding total
junk to the list pointer, which would meant ent->data is junk, and
free() notices.

-Peff


Re: [PATCH 1/2] obj_hash: convert to a critbit tree

2016-09-14 Thread Stefan Beller
On Wed, Sep 14, 2016 at 4:56 PM, Jeff King  wrote:
> For operations that traverse the whole reachability graph,
> like "rev-list --objects", the obj_hash in object.c shows up
> as a hotspot. We basically have to do "nr_commits *
> size_of_tree" hash lookups, because each tree we look at, we
> need to say "have we seen this sha1 yet?" (it's a little
> more complicated because we avoid digging into sub-trees we
> know we've already seen, but it's a reasonable
> approximation).  So graph traversal operations like that are
> very sensitive to improvements in lookup time.
>
> For keys with length "m", a hash table is generally O(m) to
> compute the hash (and verify that you've found the correct
> key), and O(1) in finding the correct slot. The latter is
> subject to collisions and probing strategy, but we keep the
> load factor on the obj_hash table below 50%, which is quite
> low. And we have a good hash (we reuse the sha1s themselves,
> which are effectively random). So we'd expect relatively few
> collisions, and past experiments to tweak the probing
> strategy haven't yielded any benefit (we use linear probing;
> I tried quadratic probing at one point, and Junio tried
> cuckoo hashing).
>
> Another data structure with similar properties is sometimes
> known as a "critbit tree" (see http://cr.yp.to/critbit.html
> for discussion and history). The basic idea is a binary trie
> where each node is either an internal node, representing a
> single bit of a stored key, or a leaf node, representing one
> or more "remainder" bits. So if you were storing two bit
> sequences 1011 and "1100", you'd have three nodes (besides
> the root):
>
> (root)
> /\
>0  1
>   /\
> NULL(internal)
>  /\
> 0  1
>/\
> "11"   "00"
>
> So finding "1011" involves traversing the trie: down the "1"
> side, then the "0" side, and then check that the rest
> matches "11".

So we stop building a tree as soon as we hit a unique data
element (i.e. if we stick to the idea of encoding the hash along the way,
we would have needed another node each for "11" as well as "00"
that points to NULL and the ending data respectively.

So we stop short as soon as we have a unique.

That makes insertion very easy, because as soon as we hit
a unique, we'd just introduce a node and add the two uniques
left and right. (Well what happens if we were to insert
101010111 and 101010101 ? Both have a long prefix,
I suppose we'd have 7 nodes and then the distiguishing
node for those 2.)

>
> Looking up a key is O(m), and there's no issue with
> collisions or probing. It can use less memory than a hash
> table, because there's no load factor to care about.
>
> That's the good news. The bad news is that big-O analysis
> is not the whole story. You'll notice that we have to do a
> lot of conditional pointer-jumping to walk the trie. Whereas
> a hash table can jump right to a proposed key and do a
> memcmp() to see if we got it.
>
> So I wasn't overly optimistic that this would be any faster.
> And indeed it's not. It's about three times slower (about
> 4.5s versus 1.5s running "rev-list --objects --all" on
> git.git).
>
> The reason is, I think, a combination of:
>
>   0. We care much more about performance for this hash than
>  memory efficiency. So we keep the load factor
>  quite low, and thus reduce the number of collisions.
>
>   1. For many hash tables, computing the hash is expensive.
>  Not so here, because we are storing sha1s. So it is
>  literally just casting the first 4 bytes of the sha1 to
>  an int; we don't even look at the other bytes until the
>  collision check (and because of point 0, we don't
>  generally have to do very many collision checks during
>  our probe).
>
>   2. The collision check _does_ have to look at all 20 bytes
>  of the sha1. And we may have to do it multiple times as
>  we linearly probe the collisions. _But_ it can be done
>  with memcmp(), which is optimized to compare 32 or 64
>  bits at a time. So we our O(m) has a very nice constant
>  factor.
>
>  Whereas in the critbit tree, we pay an instruction for
>  _each_ bit we look at.
>
> It's possible that (2) would be better if instead of a
> critbit tree, we used a "critbyte" tree. That would involve
> fewer node traversals, at the cost of making each node
> larger (right now the internal nodes store 2 pointer slots;
> they'd have to store 256 to handle a full byte). I didn't
> try it, but I suspect it would still be slower for the same
> reasons.

I would expect to go for a crit-"variable-length" tree instead.

The reason for this is that a higher fan out seems to be more
beneficial in the earlier stages, e.g. we could use critbyte trees
for the first 1-3 layers in the tree as that will have good memory
efficiency (all 256 slots filled), but will be faster than the critbit trees
(one lookup instead of 8 conditi

Tracking down a segfault in delta_base_cache

2016-09-14 Thread Jonathon Mah
Hi git, I've been seeing git segfault over the past few days. I'm on Mac OS X 
10.12, 64-bit, compiling with clang (Apple LLVM version 8.0.0 (clang-800.0.40)).

I first noticed it during a checkout, then also during `log -u`. I'm still 
debugging, but wanted to give a heads-up in case anyone else is seeing this.

~/D/S/A/HLT $ git-log -u -n1000 >/dev/null
fish: 'git-log' terminated by signal SIGSEGV (Address boundary error)

~/D/S/A/HLT $ git fsck
Checking object directories: 100% (256/256), done.
fish: 'git fsck' terminated by signal SIGSEGV (Address boundary error)

~/D/S/A/HLT $ git --version
git version 2.10.0.129.g35f6318

Running git-fsck from 2.9.2 validates the repository data.

Bisect says:

8261e1f139db3f8aa6f9fd7d98c876cbeb0f927c is the first bad commit
commit 8261e1f139db3f8aa6f9fd7d98c876cbeb0f927c
Author: Jeff King 
Date:   Mon Aug 22 18:00:07 2016 -0400

delta_base_cache: use hashmap.h


Backtrace for the `log -u` case is below. I'll follow up with my progress.
-Jonathon

$ lldb /Users/jmah/Documents/Streams/git/git-log -- -u
(lldb) target create "/Users/jmah/Documents/Streams/git/git-log"
Current executable set to '/Users/jmah/Documents/Streams/git/git-log' (x86_64).
(lldb) settings set -- target.run-args  "-u"
(lldb) process launch -o /dev/null
Process 92815 launched: '/Users/jmah/Documents/Streams/git/git-log' (x86_64)
Process 92815 stopped
* thread #1: tid = 0x1c30677, 0x0001001bba80 
git-log`release_delta_base_cache(ent=0xffd0) + 16 at 
sha1_file.c:2171, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x10)
frame #0: 0x0001001bba80 
git-log`release_delta_base_cache(ent=0xffd0) + 16 at 
sha1_file.c:2171
   2168 
   2169 static inline void release_delta_base_cache(struct 
delta_base_cache_entry *ent)
   2170 {
-> 2171 free(ent->data);
   2172 detach_delta_base_cache_entry(ent);
   2173 }
   2174 
(lldb) bt
warning: could not load any Objective-C class information. This will 
significantly reduce the quality of type information available.
* thread #1: tid = 0x1c30677, 0x0001001bba80 
git-log`release_delta_base_cache(ent=0xffd0) + 16 at 
sha1_file.c:2171, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x10)
  * frame #0: 0x0001001bba80 
git-log`release_delta_base_cache(ent=0xffd0) + 16 at 
sha1_file.c:2171
frame #1: 0x0001001bcadf 
git-log`add_delta_base_cache(p=0x0001006062f0, base_offset=1792781, 
base=0x00015749a000, base_size=1617761, type=OBJ_BLOB) + 143 at 
sha1_file.c:2199
frame #2: 0x0001001bc0d6 git-log`unpack_entry(p=0x0001006062f0, 
obj_offset=1792781, final_type=0x7fff5fbfe7fc, 
final_size=0x00010185a5a0) + 1590 at sha1_file.c:2345
frame #3: 0x0001001c2209 
git-log`cache_or_unpack_entry(p=0x0001006062f0, base_offset=2692554, 
base_size=0x00010185a5a0, type=0x7fff5fbfe7fc) + 73 at sha1_file.c:2162
frame #4: 0x0001001bed8d 
git-log`read_packed_sha1(sha1="?c?}\x0e'\x81҄MH;yP?, 
type=0x7fff5fbfe7fc, size=0x00010185a5a0) + 93 at sha1_file.c:2765
frame #5: 0x0001001bcc17 
git-log`read_object(sha1="?c?}\x0e'\x81҄MH;yP?, type=0x7fff5fbfe7fc, 
size=0x00010185a5a0) + 119 at sha1_file.c:2813
frame #6: 0x0001001be013 
git-log`read_sha1_file_extended(sha1="?c?}\x0e'\x81҄MH;yP?, 
type=0x7fff5fbfe7fc, size=0x00010185a5a0, flag=1) + 67 at 
sha1_file.c:2841
frame #7: 0x0001001073ba 
git-log`read_sha1_file(sha1="?c?}\x0e'\x81҄MH;yP?, type=0x7fff5fbfe7fc, 
size=0x00010185a5a0) + 42 at cache.h:1056
frame #8: 0x000100106ce6 
git-log`diff_populate_filespec(s=0x00010185a570, flags=2) + 1334 at 
diff.c:2845
frame #9: 0x000100106670 
git-log`diff_filespec_is_binary(one=0x00010185a570) + 160 at diff.c:2248
frame #10: 0x0001001124bc 
git-log`builtin_diff(name_a="Applications/IDE/PlugIns/IDEPlugIns/IDEPlugIns.xcodeproj/project.pbxproj",
 
name_b="Applications/IDE/PlugIns/IDEPlugIns/IDEPlugIns.xcodeproj/project.pbxproj",
 one=0x00010185a570, two=0x000101878310, xfrm_msg="index 
e063d6f..288f95f 100644\n", must_show_header=0, o=0x7fff5fbff4b8, 
complete_rewrite=0) + 1852 at diff.c:2383
frame #11: 0x0001001116ce git-log`run_diff_cmd(pgm=0x, 
name="Applications/IDE/PlugIns/IDEPlugIns/IDEPlugIns.xcodeproj/project.pbxproj",
 other=0x, 
attr_path="Applications/IDE/PlugIns/IDEPlugIns/IDEPlugIns.xcodeproj/project.pbxproj",
 one=0x00010185a570, two=0x000101878310, msg=0x7fff5fbfed18, 
o=0x7fff5fbff4b8, p=0x00010186c130) + 734 at diff.c:3134
frame #12: 0x000100111350 git-log`run_diff(p=0x00010186c130, 
o=0x7fff5fbff4b8) + 720 at diff.c:3222
frame #13: 0x00010010d75d 
git-log`diff_flush_patch(p=0x00010186c130, o=0x7fff5fbff4b8) + 157 at 
diff.c:4202
frame #14: 0x00010010b9bc 
git-log`diff_flus

[RFC] extending pathspec support to submodules

2016-09-14 Thread Brandon Williams
---
I've been trying to think through how we could potentially add pathspec support
for --recurse-submodule options (for builtins like ls-files or grep down the
line).  This is something that could be useful if the user supply's a pathspec
that could match to a file in a submodule.  We could match the submodule to the
pathspec and then fork the process to recursively run the command on the
submodule which can be passed a modified pathspec.

For example with a pathspec 'sub/dir/a', where sub is a submodule in the root
directory of the supermodule's repo, we could match 'sub' to that spec and then
recursively call the git command with a pathspec of 'dir/a'.  The child process
would then have the responsibility of matching 'dir/a' to files in its repo.

Does this seem like a reasonable feature to add? And if so are how is my
initial approach at solving the problem?

One idea I had was to add a submodule match flag in order to perform special
matching just in the --recurse-submodules cases since we'll want somethings to
match here that wouldn't normally match.

@@ -283,6 +284,29 @@ static int match_pathspec_item(const struct pathspec_item 
*item, int prefix,
 item->nowildcard_len - prefix))
return MATCHED_FNMATCH;
 
+   /*
+* Preform some checks to see if "name" is a super set of the pathspec
+*/
+   if (flags & DO_MATCH_SUBMODULE) {
+   struct strbuf buf = STRBUF_INIT;
+   strbuf_addstr(&buf, name);
+   strbuf_addch(&buf, '/');
+   /*
+* Check if the name is a prefix of the pathspec
+*/
+   if ((item->match[namelen] == '/') &&
+   !ps_strncmp(item, match, name, namelen))
+   return MATCHED_RECURSIVELY;
+   /*
+* Check if the name wildmatches to the pathspec
+*/
+   if (!wildmatch(item->match, buf.buf,
+  WM_PREFIX |
+  (item->magic & PATHSPEC_ICASE ? WM_CASEFOLD : 0),
+  NULL));
+   return MATCHED_FNMATCH;
+   }
+
return 0;
 }
 
One of the main difficulties I was having is figuring out how wildmatching
should be applied in this case.  What I believe we want is the ability for the
whole name of the submodule to match a prefix of the pathspec pattern.  To do
this I was thinking of adding a flag to do prefix matching to the wildmatch
function like so: 


diff --git a/wildmatch.c b/wildmatch.c
index 57c8765..f1e1725 100644
--- a/wildmatch.c
+++ b/wildmatch.c
@@ -60,8 +60,12 @@ static int dowild(const uchar *p, const uchar *text, 
unsigned int flags)
for ( ; (p_ch = *p) != '\0'; text++, p++) {
int matched, match_slash, negated;
uchar t_ch, prev_ch;
-   if ((t_ch = *text) == '\0' && p_ch != '*')
-   return WM_ABORT_ALL;
+   if ((t_ch = *text) == '\0' && p_ch != '*') {
+   if ((flags & WM_PREFIX) && (*(p-1) == '/'))
+   return WM_MATCH;
+   else
+   return WM_ABORT_ALL;
+   }
if ((flags & WM_CASEFOLD) && ISUPPER(t_ch))
t_ch = tolower(t_ch);
if ((flags & WM_CASEFOLD) && ISUPPER(p_ch))
diff --git a/wildmatch.h b/wildmatch.h
index 4090c8f..490db51 100644
--- a/wildmatch.h
+++ b/wildmatch.h
@@ -3,6 +3,7 @@
 
 #define WM_CASEFOLD 1
 #define WM_PATHNAME 2
+#define WM_PREFIX 4
 
 #define WM_ABORT_MALFORMED 2
 #define WM_NOMATCH 1
-- 

Any comments or thoughts on this would be appreciated.

Thanks,
Brandon


[PATCH 2/2] use zstd zlib wrapper

2016-09-14 Thread Jeff King
There's a fancy new compression algorithm called "zstd". The
idea is that it's supposed to get similar compression ratios
to zlib, but with much faster compression and decompression
times. And on top of that, a nice sliding scale to trade off
size versus time on the compression side.

The zstd site at https://facebook.github.io/zstd/ claims
close to 3x speedup for both compression and decompression
versus zlib, with similar compression ratios. There are
other fast algorithms (like lz4), but they usually compress
much worse (follow the link above for a nice table of
results).

Since any git operations that have to access objects need to
do a zlib inflate, in theory we can speed up everything by
using zstd. And then on the packing side, use higher
compression levels when making on-disk packfiles (which will
be accessed many times) and lower ones when making loose
objects, or deflating packed objects on the fly when serving
fetches.

The catch, of course, is that it's a new incompatible
format. This would be a pretty huge change and totally break
backwards compatibility for git, not just on disk but
on-the-wire as well. So my goal here was not a finished
product but just a quick experiment to see if it did indeed
bring the promise speedups.

Disappointingly, the answer seems to be "no".

The patch is relatively straightforward. zstd ships with a
zlib-compatible wrapper that we can just plug in. For
compression, it writes with zlib by default, but you can
flip it to zstd mode with a run-time flag. For
decompression, it auto-detects and runs the correct choice
between zlib and zstd. So this patch _could_ serve as the
basis for a real solution. We'd bundle zstd and its wrapper
with the runtime switch off, and then after several years
people could enable zstd writing to cut off older clients.
In the patch below, that switch is just setting the
pack.compression level to "zstd1", "zstd2", etc.

I ran a series of tests where I would:

  1. Repack with "-F" and a particular pack.compression
 setting, measuring both the time to repack and the size
 of the resulting pack.

  2. Time "git rev-list --objects --all" to see the effect
 on commit/tree traversal.

  3. Run "log -Sfoo --raw" to see the additional effect on
 accessing the blobs.

All the timings below are against linux.git. The rev-list
timings are best-of-five, but I didn't do multiple timings
for the repack or "log -S" tests, because they were so
painfully slow (and because the effect sizes seemed to be
larger than run-to-run noise).

I did these on top of my delta-cache and pack-mru patches,
since that would amplify any effect we see (i.e.,
those speed up other things so decompression time is a
relatively larger share of the pie).

Here's a baseline run using stock zlib, with the default
compression settings:

 [zlib]
 - pack
   1122845190
 - repack
   real4m43.817s
   user17m32.396s
   sys 0m5.964s
 - rev-list --objects --all
   real0m27.378s
   user0m27.132s
   sys 0m0.240s
 - log -Sfoo --raw
   real5m3.490s
   user5m0.040s
   sys 0m3.424s

You can see that we're pretty much CPU bound. You can also
see that there's quite a bit of parallelism going on in the
repack (this is a quad-core with hyperthreading).

Here's the same thing built against the zstd wrapper, but
still using zlib:

 [zstd, disabled]
 - pack
   1123089975
   (+0.02%)
 - repack
   real4m58.263s
   user17m50.596s
   sys 0m7.200s
   (+5%)
 - rev-list --objects --all
   real0m28.143s
   user0m27.864s
   sys 0m0.272s
   (+3%)
 - log -Sfoo --raw
   real5m12.742s
   user5m9.804s
   sys 0m2.912s
   (+3%)

The percentages show the difference against the baseline run
above (wall-clock times for the timing results). The pack
size is similar, which we'd expect (it's not identical
because the threaded delta search is slightly
non-deterministic).

Sadly we lose 3-5% just to the wrapper sitting in front of
zlib. So that's not a great start. But it also means we
might be able to recover that through code optimizations of
the wrapper, the way we call it, etc.

Now here it is with zstd turned onto its lowest setting:

 [zstd, 1]
 - pack
   1199180502
   (+7%)
 - repack
   real4m11.740s
   user16m36.510s
   sys 0m7.700s
   (-11%)
 - rev-list --objects --all
   real0m25.870s
   user0m25.632s
   sys 0m0.232s
   (-5.5%)
 - log -Sfoo --raw
   real4m10.899s
   user4m7.788s
   sys 0m3.056s
   (-17%)

We've definitely traded off some space here. But the
compression is faster, _and_ so is the decompression.

So let's bump up the compression level a bit:

 [zstd, 3]
 - pack
   1162416903
   (+3.5%)
 - repack
   real5m1.477s
   user19m34.476s
   sys 0m17.720s
   (+6.2%)
 - rev-list --objects --all
   real0m25.716s
   user0m25.468s
   sys 0m0.244s
   (-6%)
 - log -Sfoo --raw
   real4m10.547s
   user4m6.500s
   sys 0m3.884s
   (-17%)

OK, the size is improving. 

[PATCH 1/2] obj_hash: convert to a critbit tree

2016-09-14 Thread Jeff King
For operations that traverse the whole reachability graph,
like "rev-list --objects", the obj_hash in object.c shows up
as a hotspot. We basically have to do "nr_commits *
size_of_tree" hash lookups, because each tree we look at, we
need to say "have we seen this sha1 yet?" (it's a little
more complicated because we avoid digging into sub-trees we
know we've already seen, but it's a reasonable
approximation).  So graph traversal operations like that are
very sensitive to improvements in lookup time.

For keys with length "m", a hash table is generally O(m) to
compute the hash (and verify that you've found the correct
key), and O(1) in finding the correct slot. The latter is
subject to collisions and probing strategy, but we keep the
load factor on the obj_hash table below 50%, which is quite
low. And we have a good hash (we reuse the sha1s themselves,
which are effectively random). So we'd expect relatively few
collisions, and past experiments to tweak the probing
strategy haven't yielded any benefit (we use linear probing;
I tried quadratic probing at one point, and Junio tried
cuckoo hashing).

Another data structure with similar properties is sometimes
known as a "critbit tree" (see http://cr.yp.to/critbit.html
for discussion and history). The basic idea is a binary trie
where each node is either an internal node, representing a
single bit of a stored key, or a leaf node, representing one
or more "remainder" bits. So if you were storing two bit
sequences 1011 and "1100", you'd have three nodes (besides
the root):

(root)
/\
   0  1
  /\
NULL(internal)
 /\
0  1
   /\
"11"   "00"

So finding "1011" involves traversing the trie: down the "1"
side, then the "0" side, and then check that the rest
matches "11".

Looking up a key is O(m), and there's no issue with
collisions or probing. It can use less memory than a hash
table, because there's no load factor to care about.

That's the good news. The bad news is that big-O analysis
is not the whole story. You'll notice that we have to do a
lot of conditional pointer-jumping to walk the trie. Whereas
a hash table can jump right to a proposed key and do a
memcmp() to see if we got it.

So I wasn't overly optimistic that this would be any faster.
And indeed it's not. It's about three times slower (about
4.5s versus 1.5s running "rev-list --objects --all" on
git.git).

The reason is, I think, a combination of:

  0. We care much more about performance for this hash than
 memory efficiency. So we keep the load factor
 quite low, and thus reduce the number of collisions.

  1. For many hash tables, computing the hash is expensive.
 Not so here, because we are storing sha1s. So it is
 literally just casting the first 4 bytes of the sha1 to
 an int; we don't even look at the other bytes until the
 collision check (and because of point 0, we don't
 generally have to do very many collision checks during
 our probe).

  2. The collision check _does_ have to look at all 20 bytes
 of the sha1. And we may have to do it multiple times as
 we linearly probe the collisions. _But_ it can be done
 with memcmp(), which is optimized to compare 32 or 64
 bits at a time. So we our O(m) has a very nice constant
 factor.

 Whereas in the critbit tree, we pay an instruction for
 _each_ bit we look at.

It's possible that (2) would be better if instead of a
critbit tree, we used a "critbyte" tree. That would involve
fewer node traversals, at the cost of making each node
larger (right now the internal nodes store 2 pointer slots;
they'd have to store 256 to handle a full byte). I didn't
try it, but I suspect it would still be slower for the same
reasons.

The code is below for reference. I pulled the critbit
implementation from:

  https://github.com/ennorehling/critbit

but made a few tweaks:

  - the critbit does not store the keys themselves, as we
already have them in the "struct object", and do not
want to waste space. As a result, I baked in some
assumptions about the 20-byte key-length (which is fine
for our purposes here).

  - rather than malloc() each node, I used a pool allocator
(to try to keep the nodes closer together in cache)

Signed-off-by: Jeff King 
---
 Makefile  |   1 +
 critbit.c | 350 ++
 critbit.h |  60 +++
 object.c  |  85 ++-
 4 files changed, 417 insertions(+), 79 deletions(-)
 create mode 100644 critbit.c
 create mode 100644 critbit.h

diff --git a/Makefile b/Makefile
index 7f18492..99e0eb2 100644
--- a/Makefile
+++ b/Makefile
@@ -720,6 +720,7 @@ LIB_OBJS += connected.o
 LIB_OBJS += convert.o
 LIB_OBJS += copy.o
 LIB_OBJS += credential.o
+LIB_OBJS += critbit.o
 LIB_OBJS += csum-file.o
 LIB_OBJS += ctype.o
 LIB_OBJS += date.o
diff --git a/critbit.c b/critbit.c
new file mode 100644
index 000..735

Journal of Failed Git Experiments, Volume 1

2016-09-14 Thread Jeff King
I try a lot of different experiments with git performance, some of them
more hare-brained than others. The ones that succeed end up as real
patches. But I hate for the ones that fail to die a quiet death. Then
nobody learns what _doesn't_ work, and nobody has the opportunity to
point out the spot where I made a stupid mistake that invalidates the
whole result.

Here are two performance experiments that did _not_ turn out. They
are in patch form, because I want to document exactly what I did
(especially because it helps in the "stupid mistake" case, or if
somebody wants to try building on my results).

So without further ado:

  [1/2]: obj_hash: convert to a critbit tree
  [2/2]: use zstd zlib wrapper

-Peff


Re: [PATCH] Move format-patch base commit and prerequisites before email signature

2016-09-14 Thread Josh Triplett
On Wed, Sep 14, 2016 at 03:57:36PM -0700, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
> > I do not mind doing it myself, but I am already in today's
> > integration cycle (which will merge a handful of topics to
> > 'master'), so I won't get around to it for some time.  If you are
> > inclined to, please be my guest ;-)
> 
> I queued this on top for now; I think it can be just squashed into
> your patch.  Please say "I agree" and I'll make it happen, or say
> "that's wrong" followed by a replacement patch ;-).

"I agree". :)

I'd suggest squashing in an *additional* patch to the testsuite to
ensure the presence of the blank line:

diff --git a/t/t4014-format-patch.sh b/t/t4014-format-patch.sh
index 535857e..8d90a6e 100755
--- a/t/t4014-format-patch.sh
+++ b/t/t4014-format-patch.sh
@@ -1515,8 +1515,9 @@ test_expect_success 'format-patch -o overrides 
format.outputDirectory' '
 
 test_expect_success 'format-patch --base' '
git checkout side &&
-   git format-patch --stdout --base=HEAD~3 -1 | tail -n 6 >actual &&
-   echo "base-commit: $(git rev-parse HEAD~3)" >expected &&
+   git format-patch --stdout --base=HEAD~3 -1 | tail -n 7 >actual &&
+   echo >expected &&
+   echo "base-commit: $(git rev-parse HEAD~3)" >>expected &&
echo "prerequisite-patch-id: $(git show --patch HEAD~2 | git patch-id 
--stable | awk "{print \$1}")" >>expected &&
echo "prerequisite-patch-id: $(git show --patch HEAD~1 | git patch-id 
--stable | awk "{print \$1}")" >>expected &&
signature >> expected &&


What's cooking in git.git (Sep 2016, #04; Wed, 14)

2016-09-14 Thread Junio C Hamano
Here are the topics that have been cooking.  Commits prefixed with
'-' are only in 'pu' (proposed updates) while commits prefixed with
'+' are in 'next'.  The ones marked with '.' do not appear in any of
the integration branches, but I am still holding onto them.

You can find the changes described here in the integration branches
of the repositories listed at

http://git-blame.blogspot.com/p/git-public-repositories.html

--
[Graduated to "master"]

* ep/use-git-trace-curl-in-tests (2016-09-07) 4 commits
  (merged to 'next' on 2016-09-08 at 04372de)
 + t5551-http-fetch-smart.sh: use the GIT_TRACE_CURL environment var
 + t5550-http-fetch-dumb.sh: use the GIT_TRACE_CURL environment var
 + test-lib.sh: preserve GIT_TRACE_CURL from the environment
 + t5541-http-push-smart.sh: use the GIT_TRACE_CURL environment var

 Update a few tests that used to use GIT_CURL_VERBOSE to use the
 newer GIT_TRACE_CURL.


* jc/am-read-author-file (2016-08-30) 1 commit
  (merged to 'next' on 2016-09-08 at d2db42f)
 + am: refactor read_author_script()

 Extract a small helper out of the function that reads the authors
 script file "git am" internally uses.
 This by itself is not useful until a second caller appears in the
 future for "rebase -i" helper.


* jc/forbid-symbolic-ref-d-HEAD (2016-09-02) 1 commit
  (merged to 'next' on 2016-09-08 at cd8c1b3)
 + symbolic-ref -d: do not allow removal of HEAD

 "git symbolic-ref -d HEAD" happily removes the symbolic ref, but
 the resulting repository becomes an invalid one.  Teach the command
 to forbid removal of HEAD.


* jc/submodule-anchor-git-dir (2016-09-01) 1 commit
  (merged to 'next' on 2016-09-08 at b6f20cf)
 + submodule: avoid auto-discovery in prepare_submodule_repo_env()

 Having a submodule whose ".git" repository is somehow corrupt
 caused a few commands that recurse into submodules loop forever.


* jk/diff-submodule-diff-inline (2016-08-31) 8 commits
  (merged to 'next' on 2016-09-02 at 734e42c)
 + diff: teach diff to display submodule difference with an inline diff
 + submodule: refactor show_submodule_summary with helper function
 + submodule: convert show_submodule_summary to use struct object_id *
 + allow do_submodule_path to work even if submodule isn't checked out
 + diff: prepare for additional submodule formats
 + graph: add support for --line-prefix on all graph-aware output
 + diff.c: remove output_prefix_length field
 + cache: add empty_tree_oid object and helper function

 The "git diff --submodule={short,log}" mechanism has been enhanced
 to allow "--submodule=diff" to show the patch between the submodule
 commits bound to the superproject.


* jk/squelch-false-warning-from-gcc-o3 (2016-08-31) 2 commits
  (merged to 'next' on 2016-09-08 at c9a2af6)
 + color_parse_mem: initialize "struct color" temporary
 + error_errno: use constant return similar to error()

 Compilation fix.


* jk/test-lib-drop-pid-from-results (2016-08-30) 1 commit
  (merged to 'next' on 2016-09-08 at 0967b0b)
 + test-lib: drop PID from test-results/*.count

 The test framework left the number of tests and success/failure
 count in the t/test-results directory, keyed by the name of the
 test script plus the process ID.  The latter however turned out not
 to serve any useful purpose.  The process ID part of the filename
 has been removed.


* js/t6026-clean-up (2016-09-07) 1 commit
  (merged to 'next' on 2016-09-08 at 5ad2fc1)
 + t6026-merge-attr: clean up background process at end of test case

 A test spawned a short-lived background process, which sometimes
 prevented the test directory from getting removed at the end of the
 script on some platforms.


* js/t9903-chaining (2016-09-07) 1 commit
  (merged to 'next' on 2016-09-08 at 162a3c9)
 + t9903: fix broken && chain

 Test fix.


* rs/compat-strdup (2016-09-07) 1 commit
  (merged to 'next' on 2016-09-08 at 46acfdf)
 + compat: move strdup(3) replacement to its own file

 Code cleanup.


* rs/hex2chr (2016-09-07) 1 commit
  (merged to 'next' on 2016-09-08 at 7266d5b)
 + introduce hex2chr() for converting two hexadecimal digits to a character

 Code cleanup.


* sb/transport-report-missing-submodule-on-stderr (2016-09-08) 1 commit
  (merged to 'next' on 2016-09-08 at 3550831)
 + transport: report missing submodule pushes consistently on stderr

 Message cleanup.


* sb/xdiff-remove-unused-static-decl (2016-09-07) 1 commit
  (merged to 'next' on 2016-09-08 at 39e41dd)
 + xdiff: remove unneeded declarations

 Code cleanup.

--
[New Topics]

* et/add-chmod-x (2016-09-12) 1 commit
 - add: document the chmod option
 (this branch is used by tg/add-chmod+x-fix.)

 "git add --chmod=+x" added recently lacked documentation, which has
 been corrected.

 Will merge to 'next'.


* js/libify-require-clean-work-tree (2016-09-12) 5 commits
 - wt-status: teach has_{unstaged,uncommitted}_changes() about submodules
 - Export also the has_un{staged,committed

Re: Left with empty files after "git stash pop" when system hung

2016-09-14 Thread Jeff King
On Tue, Sep 13, 2016 at 11:39:56PM +0200, Daniel Hahler wrote:

> I have used "git stash --include-untracked", checked out another branch,
> went back, and "git stash pop"ed the changes.
> Then my system crashed/hung (music that was playing was repeated in a
> loop).  I have waited for some minutes, and then turned it off.
> 
> Afterwards, the repository in question was in a state where all files
> contained in the stash were empty.
> "git status" looked good on first sight: all the untracked and modified
> files were listed there; but they were empty.
> 
>   % git fsck --lost-found
>   error: object file .git/objects/04/1e659b5dbfd3f0be351a782b54743692875aec 
> is empty
>   error: object file .git/objects/04/1e659b5dbfd3f0be351a782b54743692875aec 
> is empty
>   fatal: loose object 041e659b5dbfd3f0be351a782b54743692875aec (stored in 
> .git/objects/04/1e659b5dbfd3f0be351a782b54743692875aec) is corrupt
>   % find .git/objects -size 0|wc -l
>   12
>
> [...]
> The filesystem in question is ext4, and I am using Arch Linux.

Is your filesystem mounted with data=writeback? Git should never write
an empty object file; it writes the content to a temporary file, and
then hardlinks it into place. If your filesystem does not order data and
metadata writes (i.e., the hardlink may get journaled and picked
up, even though the data did not hit the disk), then you can end up with
empty files. If you set core.fsyncobjectfiles in your config file, then
Git will fsync each object write (at the cost of some performance).

> I would have assumed that the "stash pop" operation would be "atomic",
> i.e. it should not remove the stash object before other objects have
> been written successfully.

Stash does not remove any objects at all; it should only be updating the
stash reflog to delete the entry (which also happens via write to a
tempfile and rename, though I don't think we ever fsync it, even with
core.fsyncobjectfiles).

The empty object you found is probably the result of a write too close
to the crash. In general I wouldn't expect "stash pop" to write, but I
suspect it may in order to populate the index.

> I have removed all empty files in .git/objects and tried to find the
> previous stash with `gitk --all $( git fsck | awk '{print $3}' )` then,
> but it appears to have disappeared.

fsck won't mention the object as dangling if it's reachable from a
reflog. Did you try "git stash list" (or just "git log -g refs/stash")?

-Peff


Re: [PATCH] Move format-patch base commit and prerequisites before email signature

2016-09-14 Thread Junio C Hamano
Junio C Hamano  writes:

> I do not mind doing it myself, but I am already in today's
> integration cycle (which will merge a handful of topics to
> 'master'), so I won't get around to it for some time.  If you are
> inclined to, please be my guest ;-)

I queued this on top for now; I think it can be just squashed into
your patch.  Please say "I agree" and I'll make it happen, or say
"that's wrong" followed by a replacement patch ;-).

Thanks.

 builtin/log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/log.c b/builtin/log.c
index d69d5e6..cd9c4a4 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1360,7 +1360,7 @@ static void print_bases(struct base_tree_info *bases, 
FILE *file)
return;
 
/* Show the base commit */
-   fprintf(file, "base-commit: %s\n", oid_to_hex(&bases->base_commit));
+   fprintf(file, "\nbase-commit: %s\n", oid_to_hex(&bases->base_commit));
 
/* Show the prerequisite patches */
for (i = bases->nr_patch_id - 1; i >= 0; i--)
-- 
2.10.0-458-g8cce42d



Re: [PATCH v4 3/4] read-cache: introduce chmod_index_entry

2016-09-14 Thread Junio C Hamano
I've queued this trivial SQUASH??? on top, which I think should be
squashed into 3/4.

Thanks.


 read-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/read-cache.c b/read-cache.c
index 2445e30..c2b2e97 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -779,7 +779,7 @@ int chmod_index_entry(struct index_state *istate, struct 
cache_entry *ce,
default:
return -2;
}
-   cache_tree_invalidate_path(&the_index, ce->name);
+   cache_tree_invalidate_path(istate, ce->name);
ce->ce_flags |= CE_UPDATE_IN_BASE;
istate->cache_changed |= CE_ENTRY_CHANGED;
 
-- 
2.10.0-458-g8cce42d



Re: [PATCH 1/2] serialize collection of changed submodules

2016-09-14 Thread Junio C Hamano
Heiko Voigt  writes:

> Sorry about the late reply. I was not able to process emails until now.
> Here are two patches that should help to improve the situation and batch
> up some processing. This one is for repositories with submodules, so
> that they do not iterate over the same submodule twice with the same
> hash.
>
> The second one will be the one people without submodules are interested
> in.

Thanks.  Will take a look at later as I'm already deep in today's
integration cycle.  Very much appreciated.


Re: Bug

2016-09-14 Thread Dennis Kaarsemaker
On Tue, 2016-09-13 at 13:18 -0400, Mike Hawes wrote:
> To whom this may concern,
>
> I found a bug in git while trying to push my website.
> I redid the process and it happened again.
> I also tried it on another computer and it happened again.
> I was wondering how to claim a bug?

Hi Mike,

When you think git does not behave as you expect, please do not stop
your bug report with just "git does not work".  "I used git in this
way, but it did not work" is not much better, neither is "I used git
in this way, and X happend, which is broken".  It often is that git is
correct to cause X happen in such a case, and it is your expectation
that is broken. People would not know what other result Y you expected
to see instead of X, if you left it unsaid.

Please remember to always state

 - what you wanted to achieve;

 - what you did (the version of git and the command sequence to reproduce
   the behavior);

 - what you saw happen (X above);

 - what you expected to see (Y above); and

 - how the last two are different.

See http://www.chiark.greenend.org.uk/~sgtatham/bugs.html for further
hints.

(The above was shamelessly copied from the "A note from the maintainer" mails)

D.


Re: [PATCH v4 4/4] add: modify already added files when --chmod is given

2016-09-14 Thread Junio C Hamano
Thomas Gummerer  writes:

> When the chmod option was added to git add, it was hooked up to the diff
> machinery, meaning that it only works when the version in the index
> differs from the version on disk.
>
> As the option was supposed to mirror the chmod option in update-index,
> which always changes the mode in the index, regardless of the status of
> the file, make sure the option behaves the same way in git add.
>
> Signed-off-by: Thomas Gummerer 
> ---

This change essentially reverts most of what 4e55ed32 ("add: add
--chmod=+x / --chmod=-x options", 2016-05-31) did, except that it
keeps the command line option and adjusts its validation, and adds a
separate phase to "fix-up" the executable bits for all paths that
match the given pathspec, whether they were new or modified or
unchanged.

The patch makes sense to me.  Thanks.


Re: [PATCH v4 3/4] read-cache: introduce chmod_index_entry

2016-09-14 Thread Junio C Hamano
Thomas Gummerer  writes:

> As there are chmod options for both add and update-index, introduce a
> new chmod_index_entry function to do the work.  Use it in update-index,
> while it will be used in add in the next patch.
>
> Signed-off-by: Thomas Gummerer 
> ---
>  builtin/update-index.c | 16 ++--
>  cache.h|  2 ++
>  read-cache.c   | 29 +
>  3 files changed, 33 insertions(+), 14 deletions(-)
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index bbdf0d9..9e9e040 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -423,26 +423,14 @@ static void chmod_path(char flip, const char *path)
>  {
> ...
> - mode = ce->ce_mode;
> - if (!S_ISREG(mode))
> - goto fail;
> - switch (flip) {
> - case '+':
> - ce->ce_mode |= 0111; break;
> - case '-':
> - ce->ce_mode &= ~0111; break;
> - default:
> + if (chmod_cache_entry(ce, flip) < 0)
>   goto fail;
> - }
> - cache_tree_invalidate_path(&the_index, path);

This used to always work on the default index, hence the_index
reference is here, but ...

> +int chmod_index_entry(struct index_state *istate, struct cache_entry *ce,
> +   char flip)
> +{
> + if (!S_ISREG(ce->ce_mode))
> + return -1;
> + switch (flip) {
> + case '+':
> + ce->ce_mode |= 0111;
> + break;
> + case '-':
> + ce->ce_mode &= ~0111;
> + break;
> + default:
> + return -2;
> + }
> + cache_tree_invalidate_path(&the_index, ce->name);

... this one takes istate, so you need to use it, instead of the
hard-coded the_index reference.

> + ce->ce_flags |= CE_UPDATE_IN_BASE;
> + istate->cache_changed |= CE_ENTRY_CHANGED;
> +
> + return 0;
> +}
> +
>  int ce_same_name(const struct cache_entry *a, const struct cache_entry *b)
>  {
>   int len = ce_namelen(a);

Other than that, this looks good to me.


[PATCH v4 3/4] read-cache: introduce chmod_index_entry

2016-09-14 Thread Thomas Gummerer
As there are chmod options for both add and update-index, introduce a
new chmod_index_entry function to do the work.  Use it in update-index,
while it will be used in add in the next patch.

Signed-off-by: Thomas Gummerer 
---
 builtin/update-index.c | 16 ++--
 cache.h|  2 ++
 read-cache.c   | 29 +
 3 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index bbdf0d9..9e9e040 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -423,26 +423,14 @@ static void chmod_path(char flip, const char *path)
 {
int pos;
struct cache_entry *ce;
-   unsigned int mode;
 
pos = cache_name_pos(path, strlen(path));
if (pos < 0)
goto fail;
ce = active_cache[pos];
-   mode = ce->ce_mode;
-   if (!S_ISREG(mode))
-   goto fail;
-   switch (flip) {
-   case '+':
-   ce->ce_mode |= 0111; break;
-   case '-':
-   ce->ce_mode &= ~0111; break;
-   default:
+   if (chmod_cache_entry(ce, flip) < 0)
goto fail;
-   }
-   cache_tree_invalidate_path(&the_index, path);
-   ce->ce_flags |= CE_UPDATE_IN_BASE;
-   active_cache_changed |= CE_ENTRY_CHANGED;
+
report("chmod %cx '%s'", flip, path);
return;
  fail:
diff --git a/cache.h b/cache.h
index 6738050..35c8d1c 100644
--- a/cache.h
+++ b/cache.h
@@ -369,6 +369,7 @@ extern void free_name_hash(struct index_state *istate);
 #define remove_file_from_cache(path) remove_file_from_index(&the_index, (path))
 #define add_to_cache(path, st, flags) add_to_index(&the_index, (path), (st), 
(flags), 0)
 #define add_file_to_cache(path, flags) add_file_to_index(&the_index, (path), 
(flags), 0)
+#define chmod_cache_entry(ce, flip) chmod_index_entry(&the_index, (ce), (flip))
 #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL, 
NULL)
 #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), 
(options))
 #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), 
(options))
@@ -584,6 +585,7 @@ extern int remove_file_from_index(struct index_state *, 
const char *path);
 extern int add_to_index(struct index_state *, const char *path, struct stat *, 
int flags, int force_mode);
 extern int add_file_to_index(struct index_state *, const char *path, int 
flags, int force_mode);
 extern struct cache_entry *make_cache_entry(unsigned int mode, const unsigned 
char *sha1, const char *path, int stage, unsigned int refresh_options);
+extern int chmod_index_entry(struct index_state *, struct cache_entry *ce, 
char flip);
 extern int ce_same_name(const struct cache_entry *a, const struct cache_entry 
*b);
 extern void set_object_name_for_intent_to_add_entry(struct cache_entry *ce);
 extern int index_name_is_other(const struct index_state *, const char *, int);
diff --git a/read-cache.c b/read-cache.c
index 491e52d..8924f2e 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -756,6 +756,35 @@ struct cache_entry *make_cache_entry(unsigned int mode,
return ret;
 }
 
+/*
+ * Chmod an index entry with either +x or -x.
+ *
+ * Returns -1 if the chmod for the particular cache entry failed (if it's
+ * not a regular file), -2 if an invalid flip argument is passed in, 0
+ * otherwise.
+ */
+int chmod_index_entry(struct index_state *istate, struct cache_entry *ce,
+ char flip)
+{
+   if (!S_ISREG(ce->ce_mode))
+   return -1;
+   switch (flip) {
+   case '+':
+   ce->ce_mode |= 0111;
+   break;
+   case '-':
+   ce->ce_mode &= ~0111;
+   break;
+   default:
+   return -2;
+   }
+   cache_tree_invalidate_path(&the_index, ce->name);
+   ce->ce_flags |= CE_UPDATE_IN_BASE;
+   istate->cache_changed |= CE_ENTRY_CHANGED;
+
+   return 0;
+}
+
 int ce_same_name(const struct cache_entry *a, const struct cache_entry *b)
 {
int len = ce_namelen(a);
-- 
2.10.0.304.gf2ff484



[PATCH v4 2/4] update-index: add test for chmod flags

2016-09-14 Thread Thomas Gummerer
Currently there is no test checking the expected behaviour when multiple
chmod flags with different arguments are passed.  As argument handling
is not in line with other git commands it's easy to miss and
accidentally change the current behaviour.

While there, fix the argument type of chmod_path, which takes an int,
but had a char passed in.

Signed-off-by: Thomas Gummerer 
---
 builtin/update-index.c|  2 +-
 t/t2107-update-index-basic.sh | 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index ba04b19..bbdf0d9 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -419,7 +419,7 @@ static int add_cacheinfo(unsigned int mode, const unsigned 
char *sha1,
return 0;
 }
 
-static void chmod_path(int flip, const char *path)
+static void chmod_path(char flip, const char *path)
 {
int pos;
struct cache_entry *ce;
diff --git a/t/t2107-update-index-basic.sh b/t/t2107-update-index-basic.sh
index dfe02f4..32ac6e0 100755
--- a/t/t2107-update-index-basic.sh
+++ b/t/t2107-update-index-basic.sh
@@ -80,4 +80,17 @@ test_expect_success '.lock files cleaned up' '
)
 '
 
+test_expect_success '--chmod=+x and chmod=-x in the same argument list' '
+   >A &&
+   >B &&
+   git add A B &&
+   git update-index --chmod=+x A --chmod=-x B &&
+   cat >expect <<-\EOF &&
+   100755 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0   A
+   100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0   B
+   EOF
+   git ls-files --stage A B >actual &&
+   test_cmp expect actual
+'
+
 test_done
-- 
2.10.0.304.gf2ff484



[PATCH v4 4/4] add: modify already added files when --chmod is given

2016-09-14 Thread Thomas Gummerer
When the chmod option was added to git add, it was hooked up to the diff
machinery, meaning that it only works when the version in the index
differs from the version on disk.

As the option was supposed to mirror the chmod option in update-index,
which always changes the mode in the index, regardless of the status of
the file, make sure the option behaves the same way in git add.

Signed-off-by: Thomas Gummerer 
---
 builtin/add.c  | 47 ---
 builtin/checkout.c |  2 +-
 builtin/commit.c   |  2 +-
 cache.h| 10 +-
 read-cache.c   | 14 ++
 t/t3700-add.sh | 50 ++
 6 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/builtin/add.c b/builtin/add.c
index b1dddb4..595a0b2 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -26,10 +26,25 @@ static int patch_interactive, add_interactive, 
edit_interactive;
 static int take_worktree_changes;
 
 struct update_callback_data {
-   int flags, force_mode;
+   int flags;
int add_errors;
 };
 
+static void chmod_pathspec(struct pathspec *pathspec, int force_mode)
+{
+   int i;
+   
+   for (i = 0; i < active_nr; i++) {
+   struct cache_entry *ce = active_cache[i];
+
+   if (pathspec && !ce_path_match(ce, pathspec, NULL))
+   continue;
+
+   if (chmod_cache_entry(ce, force_mode) < 0)
+   fprintf(stderr, "cannot chmod '%s'", ce->name);
+   }
+}
+
 static int fix_unmerged_status(struct diff_filepair *p,
   struct update_callback_data *data)
 {
@@ -65,8 +80,7 @@ static void update_callback(struct diff_queue_struct *q,
die(_("unexpected diff status %c"), p->status);
case DIFF_STATUS_MODIFIED:
case DIFF_STATUS_TYPE_CHANGED:
-   if (add_file_to_index(&the_index, path,
-   data->flags, data->force_mode)) {
+   if (add_file_to_index(&the_index, path, data->flags)) {
if (!(data->flags & ADD_CACHE_IGNORE_ERRORS))
die(_("updating files failed"));
data->add_errors++;
@@ -84,15 +98,14 @@ static void update_callback(struct diff_queue_struct *q,
}
 }
 
-int add_files_to_cache(const char *prefix, const struct pathspec *pathspec,
-   int flags, int force_mode)
+int add_files_to_cache(const char *prefix,
+  const struct pathspec *pathspec, int flags)
 {
struct update_callback_data data;
struct rev_info rev;
 
memset(&data, 0, sizeof(data));
data.flags = flags;
-   data.force_mode = force_mode;
 
init_revisions(&rev, prefix);
setup_revisions(0, NULL, &rev, NULL);
@@ -281,7 +294,7 @@ static int add_config(const char *var, const char *value, 
void *cb)
return git_default_config(var, value, cb);
 }
 
-static int add_files(struct dir_struct *dir, int flags, int force_mode)
+static int add_files(struct dir_struct *dir, int flags)
 {
int i, exit_status = 0;
 
@@ -294,8 +307,7 @@ static int add_files(struct dir_struct *dir, int flags, int 
force_mode)
}
 
for (i = 0; i < dir->nr; i++)
-   if (add_file_to_index(&the_index, dir->entries[i]->name,
-   flags, force_mode)) {
+   if (add_file_to_index(&the_index, dir->entries[i]->name, 
flags)) {
if (!ignore_add_errors)
die(_("adding files failed"));
exit_status = 1;
@@ -308,7 +320,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
int exit_status = 0;
struct pathspec pathspec;
struct dir_struct dir;
-   int flags, force_mode;
+   int flags;
int add_new_files;
int require_pathspec;
char *seen = NULL;
@@ -342,13 +354,8 @@ int cmd_add(int argc, const char **argv, const char 
*prefix)
if (!show_only && ignore_missing)
die(_("Option --ignore-missing can only be used together with 
--dry-run"));
 
-   if (!chmod_arg)
-   force_mode = 0;
-   else if (!strcmp(chmod_arg, "-x"))
-   force_mode = 0666;
-   else if (!strcmp(chmod_arg, "+x"))
-   force_mode = 0777;
-   else
+   if (chmod_arg && ((chmod_arg[0] != '-' && chmod_arg[0] != '+') ||
+ chmod_arg[1] != 'x' || chmod_arg[2]))
die(_("--chmod param '%s' must be either -x or +x"), chmod_arg);
 
add_new_files = !take_worktree_changes && !refresh_only;
@@ -441,11 +448,13 @@ int cmd_add(int argc, const char **argv, const char 
*prefix)
 
plug_bulk_checkin();
 
-   exit_status |= add_files_to_cache(prefix, &pathspec, flags, force_mode);
+   exit_status |= ad

[PATCH v4 0/4] git add --chmod: always change the file

2016-09-14 Thread Thomas Gummerer
Thanks Junio for the review of my last round.

Changes since then:
[1/4]: patch unchanged
[2/4]: Only adds a test now, and corrects the type of the argument of
   chmod_path, but leaves the rest of the patch unchanged.
[3/4]: chmod_index_entry now takes a char as argument which can either
   be + or -, and changes the mode based on that, instead of using
   the 0777 or 0666 mode that was passed in from the outside.
[4/4]: Adapted to the different behaviour of chmod_index_entry and
   added tests as suggested by Junio.

Thomas Gummerer (4):
  add: document the chmod option
  update-index: add test for chmod flags
  read-cache: introduce chmod_index_entry
  add: modify already added files when --chmod is given

 Documentation/git-add.txt |  7 +-
 builtin/add.c | 47 
 builtin/checkout.c|  2 +-
 builtin/commit.c  |  2 +-
 builtin/update-index.c| 18 +++-
 cache.h   | 12 ++-
 read-cache.c  | 43 ++---
 t/t2107-update-index-basic.sh | 13 +++
 t/t3700-add.sh| 50 +++
 9 files changed, 144 insertions(+), 50 deletions(-)

-- 
2.10.0.304.gf2ff484



[PATCH v4 1/4] add: document the chmod option

2016-09-14 Thread Thomas Gummerer
The git add --chmod option was introduced in 4e55ed3 ("add: add
--chmod=+x / --chmod=-x options", 2016-05-31), but was never
documented.  Document the feature.

Signed-off-by: Thomas Gummerer 
---
 Documentation/git-add.txt | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 6a96a66..7ed63dc 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -11,7 +11,7 @@ SYNOPSIS
 'git add' [--verbose | -v] [--dry-run | -n] [--force | -f] [--interactive | 
-i] [--patch | -p]
  [--edit | -e] [--[no-]all | --[no-]ignore-removal | [--update | -u]]
  [--intent-to-add | -N] [--refresh] [--ignore-errors] 
[--ignore-missing]
- [--] [...]
+ [--chmod=(+|-)x] [--] [...]
 
 DESCRIPTION
 ---
@@ -165,6 +165,11 @@ for "git add --no-all ...", i.e. ignored removed 
files.
be ignored, no matter if they are already present in the work
tree or not.
 
+--chmod=(+|-)x::
+   Override the executable bit of the added files.  The executable
+   bit is only changed in the index, the files on disk are left
+   unchanged.
+
 \--::
This option can be used to separate command-line options from
the list of files, (useful when filenames might be mistaken
-- 
2.10.0.304.gf2ff484



Re: Bug Report: "git submodule deinit" fails right after a clone

2016-09-14 Thread Heiko Voigt
On Tue, Aug 30, 2016 at 01:45:56PM +0200, Thomas Bétous wrote:
> Are you able to reproduce this problem?

No. I just did a clone and an immediate deinit afterwards and no error.
Maybe you can provide a script to reproduce? Which System was this on?

Cheers Heiko


Re: git submodule add spits unrelated to actual problem error msg about .gitignore

2016-09-14 Thread Yaroslav Halchenko
On September 14, 2016 3:32:11 PM EDT, Stefan Beller  wrote:
!
>I think we could chop off "2>&1" as that would have exposed the
>underlying error.
>
>Another way to go would be to use verbose git-add and grep for
>the string "add '$sm_path'".
>
> if test -z "$force" && ! git add --verbose --dry-run
>--ignore-missing "$sm_path" |grep "add $sm_path"
>
>git-add already gives the correct (the same error message) for  the
>ignored files, so maybe we'd just do:
>
># no need for a if, but this single line will do:
>test -z "$force" && git add --dry-run git.o >/dev/null || exit 1

FWIW Imho exposing error is good but not sufficient alone, since custom 
gitignore message would still be confusing.
-- 
Sent from a phone which beats iPhone.


Re: [PATCH 2/2] serialize collection of refs that contain submodule changes

2016-09-14 Thread Stefan Beller
On Wed, Sep 14, 2016 at 12:46 PM, Heiko Voigt  wrote:
> On Wed, Sep 14, 2016 at 07:51:30PM +0200, Heiko Voigt wrote:
>> Here are some numbers (using the my development clone of git
>> itself) from my local machine:
>>
>> rm -rf  && mkdir  &&
>> (cd  && git init) &&
>> time git push --mirror 
>>
>>real   0m16.056s
>>user   0m24.424s
>>sys0m1.380s
>>
>>real   0m15.885s
>>user   0m24.204s
>>sys0m1.296s
>>
>>real   0m16.731s
>>user   0m24.176s
>>sys0m1.244s
>>
>> rm -rf  && mkdir  &&
>> (cd  && git init) &&
>> time git push --mirror --recurse-submodules=check 
>>
>>real   0m21.441s
>>user   0m29.560s
>>sys0m1.480s
>>
>>real   0m21.319s
>>user   0m29.484s
>>sys0m1.464s
>>
>>real   0m21.261s
>>user   0m29.252s
>>sys0m1.592s
>>
>> Without my patches and --recurse-submodules=check the numbers are
>> basically the same. I stopped the test with --recurse-submodules=check
>> after ~ 9 minutes.
>
> Fun fact, I let the push without my patch and with
> --recurse-submodules=check finish:

Thanks for the numbers, one of the major push backs for
origin/sb/push-make-submodule-check-the-default
was that it introduced slowness; this patch might help a bit there.


Re: [PATCH 2/2] serialize collection of refs that contain submodule changes

2016-09-14 Thread Heiko Voigt
On Wed, Sep 14, 2016 at 07:51:30PM +0200, Heiko Voigt wrote:
> Here are some numbers (using the my development clone of git
> itself) from my local machine:
> 
> rm -rf  && mkdir  &&
> (cd  && git init) &&
> time git push --mirror 
> 
>real   0m16.056s
>user   0m24.424s
>sys0m1.380s
> 
>real   0m15.885s
>user   0m24.204s
>sys0m1.296s
> 
>real   0m16.731s
>user   0m24.176s
>sys0m1.244s
> 
> rm -rf  && mkdir  &&
> (cd  && git init) &&
> time git push --mirror --recurse-submodules=check 
> 
>real   0m21.441s
>user   0m29.560s
>sys0m1.480s
> 
>real   0m21.319s
>user   0m29.484s
>sys0m1.464s
> 
>real   0m21.261s
>user   0m29.252s
>sys0m1.592s
> 
> Without my patches and --recurse-submodules=check the numbers are
> basically the same. I stopped the test with --recurse-submodules=check
> after ~ 9 minutes.

Fun fact, I let the push without my patch and with
--recurse-submodules=check finish:

real27m7.962s
user27m15.568s
sys 0m2.016s

Thats quite some time...

Cheers Heiko


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 12:30:06PM -0700, Junio C Hamano wrote:

> Another small thing I am not sure about is if the \ quoting can hide
> an embedded newline in the author name.  Would we end up turning
> 
>   From: "Jeff \
> King" 
> 
> or somesuch into
> 
>   Author: Jeff
> King
> Email: p...@peff.net
> 
> ;-)

Heh, yeah. That is another reason to clean up and sanitize as much as
possible before stuffing it into another text format that will be
parsed.

> So let's roll the \" -> " into mailinfo.
> 
> I am not sure if we also should remove the surrounding "", i.e. we
> currently do not turn this
> 
>   From: "Jeff King" 
> 
> into this:
> 
>   Author: Jeff King
> Email: p...@peff.net
> 
> I think we probably should, and remove the one that does so from the
> reader.

I think you have to, or else you cannot tell the difference between
surrounding quotes that need to be stripped, and ones that were
backslash-escaped. Like:

  From: "Jeff King" 
  From: \"Jeff King\" 

which would both become:

  Author: "Jeff King"
  Email: p...@peff.net

though I am not sure the latter one is actually valid; you might need to
be inside syntactic quotes in order to include backslashed quotes. I
haven't read rfc2822 carefully recently enough to know.

Anyway, I think that:

  From: One "Two \"Three\" Four" Five

may also be valid. So the quote-stripping in the reader is not just "at
the outside", but may need to handle interior syntactic quotes, too. So
it really makes sense for me to clean and sanitize as much as possible
in one step, and then make the parser of mailinfo as dumb as possible.

-Peff


Re: git submodule add spits unrelated to actual problem error msg about .gitignore

2016-09-14 Thread Stefan Beller
On Wed, Sep 14, 2016 at 7:03 AM, Yaroslav Halchenko  wrote:
> I have spent some time chasing the wild goose (well - the .gitignore
> file) after getting:
>
> $> git-submodule add --name fcx-1 ./fcx-1/ ./fcx-1/
> The following path is ignored by one of your .gitignore files:
> fcx-1
> Use -f if you really want to add it.
>
> long story short -- the culprit is this piece of code in git-submodule:
>
> if test -z "$force" && ! git add --dry-run --ignore-missing "$sm_path" > 
> /dev/null 2>&1
> then
> eval_gettextln "The following path is ignored by one of your 
> .gitignore files:
> \$sm_path
> Use -f if you really want to add it." >&2
> exit 1
> fi
>
>
> so if anything goes wrong in git add, it just reports  this error
> message.

Thanks for the bug report!
I think we could chop off "2>&1" as that would have exposed the
underlying error.

Another way to go would be to use verbose git-add and grep for
the string "add '$sm_path'".

 if test -z "$force" && ! git add --verbose --dry-run
--ignore-missing "$sm_path" |grep "add $sm_path"

git-add already gives the correct (the same error message) for  the
ignored files, so maybe we'd just do:

# no need for a if, but this single line will do:
test -z "$force" && git add --dry-run git.o >/dev/null || exit 1


Re: [PATCH] pathspec: removed unnecessary function prototypes

2016-09-14 Thread Brandon Williams
On Wed, Sep 14, 2016 at 12:23 PM, Jeff King  wrote:

> On Tue, Sep 13, 2016 at 11:15:52AM -0700, Jeff King wrote:
> I should have done a better job of not just providing the answer, but
> showing how. The easiest tool here is "git log -S":
>
>   git log -1 -p -Scheck_path_for_gitlink
>
> (and then you can see that the whole function went away there).
>
> -Peff

Perfect thanks! There's still a lot of little features like this that
I'm unaware
of so I really appreciate the pointer.

-Brandon


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Junio C Hamano
Jeff King  writes:

> On Wed, Sep 14, 2016 at 10:43:18AM -0700, Junio C Hamano wrote:
>
>> I think we can go either way and it does not matter all that much if
>> "mailinfo" changes its output or the reader of "mailinfo" output
>> changes its input--we will either be munging data read from "From:"
>> when producing the "Author:" line, or taking the "Author:" output by
>> mailinfo and removing the quotes.
>
> Yeah, that was the part I was wondering about in my original response.
> What is the output of mailinfo _supposed_ to be, and do we consider that
> at all public (i.e., are there are other tools besides "git am" that
> build on mailinfo)?
>
> At least "am" already does some quote-stripping, so any de-quoting added
> in mailinfo is potentially a regression (if we indeed care about keeping
> the output stable).

Another small thing I am not sure about is if the \ quoting can hide
an embedded newline in the author name.  Would we end up turning

From: "Jeff \
King" 

or somesuch into

Author: Jeff
King
Email: p...@peff.net

;-)

> But if we are OK with that, it seems to me that mailinfo is the best
> place to do the de-quoting, because then its output is well-defined:
> everything after "Author:" up to the newline is the name.

There are other things mailinfo does, like turning this

From: p...@peff.net (Jeff King)

into

Author: Jeff King
Email: p...@peff.net

and

From: Uh "foo" Bar p...@peff.net (Jeff King)

into

Author: Uh "foo" Bar (Jeff King)
Email: p...@peff.net

So let's roll the \" -> " into mailinfo.

I am not sure if we also should remove the surrounding "", i.e. we
currently do not turn this

From: "Jeff King" 

into this:

Author: Jeff King
Email: p...@peff.net

I think we probably should, and remove the one that does so from the
reader.



Re: [PATCH] pathspec: removed unnecessary function prototypes

2016-09-14 Thread Jeff King
On Tue, Sep 13, 2016 at 11:15:52AM -0700, Jeff King wrote:

> On Tue, Sep 13, 2016 at 09:52:51AM -0700, Brandon Williams wrote:
> 
> > removed function prototypes from pathspec.h which don't have a
> > corresponding implementation.
> 
> I'm always curious of the "why" in cases like this. Did we forget to add
> them? Did they get renamed? Did they go away?
> 
> Looks like the latter; 5a76aff (add: convert to use parse_pathspec,
> 2013-07-14) just forgot to remove them.

I should have done a better job of not just providing the answer, but
showing how. The easiest tool here is "git log -S":

  git log -1 -p -Scheck_path_for_gitlink

(and then you can see that the whole function went away there).

-Peff


Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 10:43:18AM -0700, Junio C Hamano wrote:

> I think we can go either way and it does not matter all that much if
> "mailinfo" changes its output or the reader of "mailinfo" output
> changes its input--we will either be munging data read from "From:"
> when producing the "Author:" line, or taking the "Author:" output by
> mailinfo and removing the quotes.

Yeah, that was the part I was wondering about in my original response.
What is the output of mailinfo _supposed_ to be, and do we consider that
at all public (i.e., are there are other tools besides "git am" that
build on mailinfo)?

At least "am" already does some quote-stripping, so any de-quoting added
in mailinfo is potentially a regression (if we indeed care about keeping
the output stable).

But if we are OK with that, it seems to me that mailinfo is the best
place to do the de-quoting, because then its output is well-defined:
everything after "Author:" up to the newline is the name. Whereas if the
cleanup of the value is split across mailinfo and its reader, then it is
hard to know which side is responsible for which part. mailinfo handles
whitespace unfolding, certainly. What other rfc2822 things does it
handle? What are the rules for dequoting its output?

I'll admit I don't care _too_ much. This is a remote corner of the code
that I hope never to have to look at. I'm mostly just describing how the
problem space makes sense to _me_, and how I would write it if starting
from scratch.

-Peff


re

2016-09-14 Thread Ali Saeed



Did you get my message?



Re: [PATCH] vcs-svn/fast_export: fix timestamp fmt specifiers

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 06:40:57AM +, Mike Ralphson wrote:

> Two instances of %ld being used for unsigned longs

Obviously this is an improvement, but I'm kind of surprised that
compiler warnings didn't flag this. I couldn't find a "-W" combination
that noticed, though (at least not with gcc 6).

-Peff


RE: E-mail User:-

2016-09-14 Thread Hopfauf, Sandee


From: Hopfauf, Sandee
Sent: Wednesday, September 14, 2016 10:36 AM
Subject: E-mail User:-

Dear E-mail User:-

Take note of this important update that our new web mail has been improved with 
a new messaging system from Owa/outlook which also include faster usage on 
email, shared calendar,web-documents and the New 2016 Anti-Spam Version.

Please use the link below to complete your update for our new Owa/outlook 
improved web mail.
Log on to Outlook Web Access to update your 
mailbox.

ITS help desk
ADMIN TEAM
©Copyright 2016 Microsoft Outlook
All Right Reserved.


[PATCH 2/2] serialize collection of refs that contain submodule changes

2016-09-14 Thread Heiko Voigt
We are iterating over each pushed ref and want to check whether it
contains changes to submodules. Instead of immediately checking each ref
lets first collect them and then do the check for all of them in one
revision walk.

Signed-off-by: Heiko Voigt 
---

Sorry this was not catched earlier. This was implemented as part of
summer of code and it seems we never tested with --mirror.

This is the one which does only one revision walk instead of one for
each ref. Here are some numbers (using the my development clone of git
itself) from my local machine:

rm -rf  && mkdir  &&
(cd  && git init) &&
time git push --mirror 

   real 0m16.056s
   user 0m24.424s
   sys  0m1.380s

   real 0m15.885s
   user 0m24.204s
   sys  0m1.296s

   real 0m16.731s
   user 0m24.176s
   sys  0m1.244s

rm -rf  && mkdir  &&
(cd  && git init) &&
time git push --mirror --recurse-submodules=check 

   real 0m21.441s
   user 0m29.560s
   sys  0m1.480s

   real 0m21.319s
   user 0m29.484s
   sys  0m1.464s

   real 0m21.261s
   user 0m29.252s
   sys  0m1.592s

Without my patches and --recurse-submodules=check the numbers are
basically the same. I stopped the test with --recurse-submodules=check
after ~ 9 minutes.

Cheers Heiko

 submodule.c | 36 +---
 submodule.h |  5 +++--
 transport.c | 22 ++
 3 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/submodule.c b/submodule.c
index b04c066..a15e346 100644
--- a/submodule.c
+++ b/submodule.c
@@ -627,24 +627,31 @@ static void free_submodules_sha1s(struct string_list 
*submodules)
string_list_clear(submodules, 1);
 }
 
-int find_unpushed_submodules(unsigned char new_sha1[20],
+static void append_hash_to_argv(const unsigned char sha1[20],
+   void *data)
+{
+   struct argv_array *argv = (struct argv_array *) data;
+   argv_array_push(argv, sha1_to_hex(sha1));
+}
+
+int find_unpushed_submodules(struct sha1_array *hashes,
const char *remotes_name, struct string_list *needs_pushing)
 {
struct rev_info rev;
struct commit *commit;
-   const char *argv[] = {NULL, NULL, "--not", "NULL", NULL};
-   int argc = ARRAY_SIZE(argv) - 1, i;
-   char *sha1_copy;
+   int i;
struct string_list submodules = STRING_LIST_INIT_DUP;
+   struct argv_array argv = ARGV_ARRAY_INIT;
 
-   struct strbuf remotes_arg = STRBUF_INIT;
-
-   strbuf_addf(&remotes_arg, "--remotes=%s", remotes_name);
init_revisions(&rev, NULL);
-   sha1_copy = xstrdup(sha1_to_hex(new_sha1));
-   argv[1] = sha1_copy;
-   argv[3] = remotes_arg.buf;
-   setup_revisions(argc, argv, &rev, NULL);
+
+   /* argv.argv[0] will be ignored by setup_revisions */
+   argv_array_push(&argv, "find_unpushed_submodules");
+   sha1_array_for_each_unique(hashes, append_hash_to_argv, &argv);
+   argv_array_push(&argv, "--not");
+   argv_array_pushf(&argv, "--remotes=%s", remotes_name);
+
+   setup_revisions(argv.argc, argv.argv, &rev, NULL);
if (prepare_revision_walk(&rev))
die("revision walk setup failed");
 
@@ -652,8 +659,7 @@ int find_unpushed_submodules(unsigned char new_sha1[20],
find_unpushed_submodule_commits(commit, &submodules);
 
reset_revision_walk();
-   free(sha1_copy);
-   strbuf_release(&remotes_arg);
+   argv_array_clear(&argv);
 
for (i = 0; i < submodules.nr; i++) {
struct string_list_item *item = &submodules.items[i];
@@ -691,12 +697,12 @@ static int push_submodule(const char *path)
return 1;
 }
 
-int push_unpushed_submodules(unsigned char new_sha1[20], const char 
*remotes_name)
+int push_unpushed_submodules(struct sha1_array *hashes, const char 
*remotes_name)
 {
int i, ret = 1;
struct string_list needs_pushing = STRING_LIST_INIT_DUP;
 
-   if (!find_unpushed_submodules(new_sha1, remotes_name, &needs_pushing))
+   if (!find_unpushed_submodules(hashes, remotes_name, &needs_pushing))
return 1;
 
for (i = 0; i < needs_pushing.nr; i++) {
diff --git a/submodule.h b/submodule.h
index d9e197a..065b2f0 100644
--- a/submodule.h
+++ b/submodule.h
@@ -3,6 +3,7 @@
 
 struct diff_options;
 struct argv_array;
+struct sha1_array;
 
 enum {
RECURSE_SUBMODULES_CHECK = -4,
@@ -62,9 +63,9 @@ int submodule_uses_gitfile(const char *path);
 int ok_to_remove_submodule(const char *path);
 int merge_submodule(unsigned char result[20], const char *path, const unsigned 
char base[20],
const unsigned char a[20], const unsigned char b[20], int 
search);
-int find_unpushed_submodules(unsigned char new_sha1[20], const char 
*remotes_name,
+int find_unpushed_submodules(struct sha1_array *hashes, const char 
*remotes_name,
struct string_list *needs_pushing);
-int push_unpushed_submodules(unsigned char new_sha1[20], const char 
*remotes_name);
+int push_unpushed_submodules(struct sha1_array *hashe

Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Junio C Hamano
Kevin Daudt  writes:

> When applied the the author of this patch shows up as:
>
> Author: A U Thor" (test) 
>
> So I agree with Jeff[1] where he states that the surrounding quotes
> should be removed, if that's not a problem for git.
>
> [1]:https://public-inbox.org/git/20160914051305.vphknpsikyxi3...@sigill.intra.peff.net/

I think we can go either way and it does not matter all that much if
"mailinfo" changes its output or the reader of "mailinfo" output
changes its input--we will either be munging data read from "From:"
when producing the "Author:" line, or taking the "Author:" output by
mailinfo and removing the quotes.

As an output from mailinfo that looks like this:

Author: "A U Thor"
Email: a...@thor.com

is made into a commit object that has this:

author A U Thor 

we know that the reader of mailinfo output _already_ has some logic
to strip the surrounding double quotes.  That is the only reason why
I think it is a better approach to not dequote in the "mailinfo" but
in the reader to turn

Author: "A \"U\" Thor"
Email: a...@thor.com

into a commit object that has this:

author A "U" Thor 

than updating mailinfo to produce

Author: A "U" Thor
Email: a...@thor.com

and then create the same result.


[PATCH 1/2] serialize collection of changed submodules

2016-09-14 Thread Heiko Voigt
To check whether a submodule needs to be pushed we need to collect all
changed submodules. Lets collect them first and then execute the
possibly expensive test whether certain revisions are already pushed
only once per submodule.

There is further potential for optimization since we can assemble one
command and only issued that instead of one call for each remote ref in
the submodule.

Signed-off-by: Heiko Voigt 
---
Sorry about the late reply. I was not able to process emails until now.
Here are two patches that should help to improve the situation and batch
up some processing. This one is for repositories with submodules, so
that they do not iterate over the same submodule twice with the same
hash.

The second one will be the one people without submodules are interested
in.

Cheers Heiko

 submodule.c | 67 -
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/submodule.c b/submodule.c
index 0ef2ff4..b04c066 100644
--- a/submodule.c
+++ b/submodule.c
@@ -554,19 +554,38 @@ static int submodule_needs_pushing(const char *path, 
const unsigned char sha1[20
return 0;
 }
 
+static struct sha1_array *get_sha1s_from_list(struct string_list *submodules,
+   const char *path)
+{
+   struct string_list_item *item;
+   struct sha1_array *hashes;
+
+   item = string_list_insert(submodules, path);
+   if (item->util)
+   return (struct sha1_array *) item->util;
+
+   hashes = (struct sha1_array *) xmalloc(sizeof(struct sha1_array));
+   /* NEEDSWORK: should we add an initializer function for
+* sha1_array ? */
+   memset(hashes, 0, sizeof(struct sha1_array));
+   item->util = hashes;
+   return hashes;
+}
+
 static void collect_submodules_from_diff(struct diff_queue_struct *q,
 struct diff_options *options,
 void *data)
 {
int i;
-   struct string_list *needs_pushing = data;
+   struct string_list *submodules = data;
 
for (i = 0; i < q->nr; i++) {
struct diff_filepair *p = q->queue[i];
+   struct sha1_array *hashes;
if (!S_ISGITLINK(p->two->mode))
continue;
-   if (submodule_needs_pushing(p->two->path, p->two->oid.hash))
-   string_list_insert(needs_pushing, p->two->path);
+   hashes = get_sha1s_from_list(submodules, p->two->path);
+   sha1_array_append(hashes, p->two->oid.hash);
}
 }
 
@@ -582,14 +601,41 @@ static void find_unpushed_submodule_commits(struct commit 
*commit,
diff_tree_combined_merge(commit, 1, &rev);
 }
 
+struct collect_submodule_from_sha1s_data {
+   char *submodule_path;
+   struct string_list *needs_pushing;
+};
+
+static void collect_submodules_from_sha1s(const unsigned char sha1[20],
+   void *data)
+{
+   struct collect_submodule_from_sha1s_data *me =
+   (struct collect_submodule_from_sha1s_data *) data;
+
+   if (submodule_needs_pushing(me->submodule_path, sha1))
+   string_list_insert(me->needs_pushing, me->submodule_path);
+}
+
+static void free_submodules_sha1s(struct string_list *submodules)
+{
+   int i;
+   for (i = 0; i < submodules->nr; i++) {
+   struct string_list_item *item = &submodules->items[i];
+   struct sha1_array *hashes = (struct sha1_array *) item->util;
+   sha1_array_clear(hashes);
+   }
+   string_list_clear(submodules, 1);
+}
+
 int find_unpushed_submodules(unsigned char new_sha1[20],
const char *remotes_name, struct string_list *needs_pushing)
 {
struct rev_info rev;
struct commit *commit;
const char *argv[] = {NULL, NULL, "--not", "NULL", NULL};
-   int argc = ARRAY_SIZE(argv) - 1;
+   int argc = ARRAY_SIZE(argv) - 1, i;
char *sha1_copy;
+   struct string_list submodules = STRING_LIST_INIT_DUP;
 
struct strbuf remotes_arg = STRBUF_INIT;
 
@@ -603,12 +649,23 @@ int find_unpushed_submodules(unsigned char new_sha1[20],
die("revision walk setup failed");
 
while ((commit = get_revision(&rev)) != NULL)
-   find_unpushed_submodule_commits(commit, needs_pushing);
+   find_unpushed_submodule_commits(commit, &submodules);
 
reset_revision_walk();
free(sha1_copy);
strbuf_release(&remotes_arg);
 
+   for (i = 0; i < submodules.nr; i++) {
+   struct string_list_item *item = &submodules.items[i];
+   struct collect_submodule_from_sha1s_data data;
+   data.submodule_path = item->string;
+   data.needs_pushing = needs_pushing;
+   sha1_array_for_each_unique((struct sha1_array *) item->util,
+   collect_submodules_from_sha1s,
+   &data);
+  

Re: Git Miniconference at Plumbers

2016-09-14 Thread Junio C Hamano
Lars Schneider  writes:

> Some applications have test data, image assets, and other data sets that
> need to be versioned along with the source code.
>
> How would you deal with these kind of "huge objects" _today_?

When you know that you'd find the answer to that question totally
uninteresting, why do you even bother to ask? ;-)

I don't, and if I had to, I would deal with them just like any other
objects.

A more interesting pair of questions to ask would be what the
fundamental requirement for an acceptable solution is, and what
solution within the constraint I would envision, if I were given a
group competent Git hackers and enough time to realize it.

The most important constraint is that any acceptable solution should
preserve the object identity.

And starting from a "I don't but if I had to..." repository that is
created in a dumb way, a solution that satisifies the constraint may
work like this, requiring enhancements to various parts of the
system:

 - The "upload-pack" protocol would allow the owner of a such
   repository and the party that "git clone"'s from there to
   negotiate:

. what it means for a object to be "huge" (e.g. the owner may
  implicitly show the preference by marking a packfile as
  containing such "huge" objects, may configure that blobs that
  appear at paths that match certain glob pattern are "huge", or
  the sender and the receiver may say objects that are larger
  than X MB are "huge", etc.); and

. what to do with "huge" objects (e.g. the receiver may ask for
  a full clone, or the receiver may ask to omit "huge" ones from
  the initial transfer)

 - The "upload-pack" protocol would give, in addition to the normal
   pack stream that conveys only non-"huge" objects, for each of
   "huge" objects that are not transferred, what its object name is
   and how it can later be retrieved.

 - Just like packing objects in packfiles was added as a different
   implementation to store objects in the object database that is
   better than storing them individually as loose object files,
   there will be a third way to store such "huge" object _in_ the
   object database, which may actually not _store_ them locally at
   all.  The local object store may merely have placeholders for
   them, in which instructions for how it can be acquired when
   necessary are stored.  The extra information sent over the
   "upload-pack" protocol for "huge" objects with the previous
   bullet-point are used to store these objects in this "third" way.

 - A new mechanism would allow such objects that are stored in this
   "third" way to be retrieved lazily or on-demand.

There are other enhancements whose necessity will fall naturally out
of such a lazy scheme outlined above.  E.g. "fsck" needs to learn
that the objects stored in the third way are considered to "exist"
but their actual contents is not expected to be verifiable until
they are retrieved. "send-pack" (i.e. running "git push" from a
repository cloned with the procedure outlined above) needs to treat
the objects stored in the third way differently (most likely, it
will fail a request for full-clone and send "not here, but you can
get it this way" for them).  Local operations that need more than
object names need to learn reasonable fallback behaviours to work
when the actual object contents are not yet available (e.g. all of
them may offer "this is not yet available; do you want to get
on-demand?" or there may even be "object.ondemand" configuration
option to skip the end-user interaction.  When on-demand retrieving
is not done, "git archive" may place a placeholder file in its
output that says "no data (yet) here", "git log --raw" may show the
object name but "git log -p" may say "insufficient data to produce a
patch", etc.) [*1*].

Because we start from the "object identity should not change", you
do not have to make a decision upfront when preparing the ultimate
source of the truth.  When you take a clone-network of a single
project as a whole, somebody needs to hold the entire set of objects
somewhere, and many of the repository in the clone-network may have
"huge" objects in the third "not here yet, here is how to get it"
form.  As the system improves, and as the networking and storage
technology changes, the definition of "huge" WILL change over time
and those repositories can turn the ones that used to be "huge" into
normal objects.

If you use approaches taken by various clean/smudge based current
crop of solutions [*2*], on the other hand, once you decide a blob
object is "huge" and needs to be replaced with a surrogate (to be
instantiated via the "clean" filter), the "huge" object _has_ to
stay in the surrogate form in the containing tree and you cannot
change the division between "huge" and "normal" ever without
rewriting the history.


[Footnote]

*1* Astute readers would realize that the utility of such a "third
way" object storage mechanism is not limited to "keep and
transfer

Re: Git Miniconference at Plumbers

2016-09-14 Thread Christian Couder
On Tue, Sep 13, 2016 at 1:14 AM, Lars Schneider
 wrote:
>
>> On 12 Sep 2016, at 21:11, Junio C Hamano  wrote:
>>
>> [..]
>> properly; supporting "huge objects" better in the object layer,
>> without having to resort to ugly hacks like GitLFS that will never
>> be part of the core Git. [...]
>
> I agree with you that GitLFS is an ugly hack.
>
> Some applications have test data, image assets, and other data sets that
> need to be versioned along with the source code.
>
> How would you deal with these kind of "huge objects" _today_?

I think that Junio was saying that this problem and other problems
like this one are indeed itches for some people, but maybe not for
kernel community.

About this specific problem, as you probably know, I started working
on adding support for external object databases, on top of some
previous work that Peff had started some years ago:

https://public-inbox.org/git/20160628181933.24620-1-chrisc...@tuxfamily.org/

So if you want to better deal with huge objects in the near future,
you are welcome to help on this.


[PATCH] xdiff: fix merging of hunks with -W context and -u context

2016-09-14 Thread René Scharfe
If the function context for a hunk (with -W) reaches the beginning of
the next hunk then we need to merge these two -- otherwise we'd show
some lines twice, which looks strange and even confuses git apply.  We
already do this checking and merging in xdl_emit_diff(), but forget to
consider regular context (with -u or -U).

Fix that by merging hunks already if function context of the first one
touches or overlaps regular context of the second one.

Signed-off-by: Rene Scharfe 
---
 t/t4051-diff-function-context.sh | 25 +
 xdiff/xemit.c|  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/t/t4051-diff-function-context.sh b/t/t4051-diff-function-context.sh
index b79b877..6154acb 100755
--- a/t/t4051-diff-function-context.sh
+++ b/t/t4051-diff-function-context.sh
@@ -67,6 +67,15 @@ test_expect_success 'setup' '
commit_and_tag long_common_tail file.c &&
 
git checkout initial &&
+   cat "$dir/hello.c" "$dir/dummy.c" >file.c &&
+   commit_and_tag hello_dummy file.c &&
+
+   # overlap function context of 1st change and -u context of 2nd change
+   grep -v "delete me from hello" <"$dir/hello.c" >file.c &&
+   sed 2p <"$dir/dummy.c" >>file.c &&
+   commit_and_tag changed_hello_dummy file.c &&
+
+   git checkout initial &&
grep -v "delete me from hello" file.c.new &&
mv file.c.new file.c &&
cat "$dir/appended1.c" >>file.c &&
@@ -179,4 +188,20 @@ test_expect_success ' context does not include other 
functions' '
test $(grep -c "^[ +-].*Begin" changed_hello_appended.diff) -le 2
 '
 
+check_diff changed_hello_dummy 'changed two consecutive functions'
+
+test_expect_success ' context includes begin' '
+   grep "^ .*Begin of hello" changed_hello_dummy.diff &&
+   grep "^ .*Begin of dummy" changed_hello_dummy.diff
+'
+
+test_expect_success ' context includes end' '
+   grep "^ .*End of hello" changed_hello_dummy.diff &&
+   grep "^ .*End of dummy" changed_hello_dummy.diff
+'
+
+test_expect_success ' overlapping hunks are merged' '
+   test $(grep -c "^@@" changed_hello_dummy.diff) -eq 1
+'
+
 test_done
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index b52b4b9..7389ce4 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -239,7 +239,7 @@ int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, 
xdemitcb_t *ecb,
if (xche->next) {
long l = XDL_MIN(xche->next->i1,
 xe->xdf1.nrec - 1);
-   if (l <= e1 ||
+   if (l - xecfg->ctxlen <= e1 ||
get_func_line(xe, xecfg, NULL, l, e1) < 0) {
xche = xche->next;
goto post_context_calculation;
-- 
2.10.0



Re: [RFC 0/1] mailinfo: de-quote quoted-pair in header fields

2016-09-14 Thread Kevin Daudt
On Tue, Sep 13, 2016 at 10:54:47PM -0700, Junio C Hamano wrote:
> Jeff King  writes:
> 
> > It has been a while since I looked at rfc2822, but aren't the quoting
> > and syntax rules different for addresses versus other headers? We would
> > not want to dequote a Subject header, I think.
> 
> You're absolutely right.  RFC2822 does not quite _want_ to dequote
> anything.  As you pointed out in a separate message, we are the one
> who want to strip out "" quoting when mailinfo says
> 
>   Author: "Jeff King"
> 
> to its standard output (aka "info"), and turn it into
> 
>   GIT_AUTHOR_NAME='Jeff King'
> 
> and do so ONLY for the author name.
> 
> So I would think it is the responsibility of the one that reads the
> "info" file that is produced by mailinfo to dequote the backslash
> thing if the mailinfo gave us
> 
>   Author: "Jeff \"Peff\" King"
> 

The RFC makes a distinction between structured fields and unstructured
fields. Quoting would not even be necessary for unstructured fields
(like Subject), so yes, that those fields should be left alone.

Unstructures fields are subject, comments, keywords and optional fields,
the rest is considered structured.

Because the only field where this is a problem is the From field, I
think it would be safe to limit the unquoting just to that field.

My reasoning to do the unquoting here is because it's the RFC requires
the quoting in the first place.

I already noticed a bug in the current unquoting of the author when
adding a comment to the From: field.

From: "A U Thor"  (test)

When applied the the author of this patch shows up as:

Author: A U Thor" (test) 

So I agree with Jeff[1] where he states that the surrounding quotes
should be removed, if that's not a problem for git.

[1]:https://public-inbox.org/git/20160914051305.vphknpsikyxi3...@sigill.intra.peff.net/


Re: [PATCH] vcs-svn/fast_export: fix timestamp fmt specifiers

2016-09-14 Thread Junio C Hamano
Mike Ralphson  writes:

> Two instances of %ld being used for unsigned longs.
>
> Signed-off-by: Mike Ralphson 
> ---

Good eyes.  Thanks for spotting.

>  vcs-svn/fast_export.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/vcs-svn/fast_export.c b/vcs-svn/fast_export.c
> index bd0f2c2..97cba39 100644
> --- a/vcs-svn/fast_export.c
> +++ b/vcs-svn/fast_export.c
> @@ -73,7 +73,7 @@ void fast_export_begin_note(uint32_t revision, const char 
> *author,
>   static int firstnote = 1;
>   size_t loglen = strlen(log);
>   printf("commit %s\n", note_ref);
> - printf("committer %s <%s@%s> %ld +\n", author, author, "local", 
> timestamp);
> + printf("committer %s <%s@%s> %lu +\n", author, author, "local", 
> timestamp);
>   printf("data %"PRIuMAX"\n", (uintmax_t)loglen);
>   fwrite(log, loglen, 1, stdout);
>   if (firstnote) {
> @@ -107,7 +107,7 @@ void fast_export_begin_commit(uint32_t revision, const 
> char *author,
>   }
>   printf("commit %s\n", local_ref);
>   printf("mark :%"PRIu32"\n", revision);
> - printf("committer %s <%s@%s> %ld +\n",
> + printf("committer %s <%s@%s> %lu +\n",
>  *author ? author : "nobody",
>  *author ? author : "nobody",
>  *uuid ? uuid : "local", timestamp);
>
> --
> https://github.com/git/git/pull/293


Re: [PATCH v3] checkout: eliminate unnecessary merge for trivial checkout

2016-09-14 Thread Junio C Hamano
Oleg Taranenko  writes:

> Sorry for bothering, why not introduce a brand new option like git
> checkout -b foo --skip-worktree-merge for such rare optimization use
> case?

I am not sure what problem such a new option solves.  How would you
describe and explain what "--skip-worktree-merge" option to the end
user?


Re: [PATCH 0/16] fix config-reading in non-repos

2016-09-14 Thread Jeff King
On Wed, Sep 14, 2016 at 12:55:41PM +0200, Dennis Kaarsemaker wrote:

> >   [14/16]: config: only read .git/config from configured repos
> >   [15/16]: init: expand comments explaining config trickery
> >   [16/16]: init: reset cached config when entering new repo
> 
> Couldn't find anything to comment on, and I've tested that this does
> indeed fix the symptoms we saw.
> 
> Reviewed-by: Dennis Kaarsemaker 

Thanks. Based on our conversations earlier, I tried to dig around for
other cases that might be broken, especially with the "reinit" cases.
The result is the tests in patch 16. But let me know if you can think of
anything else that might be broken.

-Peff


git submodule add spits unrelated to actual problem error msg about .gitignore

2016-09-14 Thread Yaroslav Halchenko
I have spent some time chasing the wild goose (well - the .gitignore
file) after getting:

$> git-submodule add --name fcx-1 ./fcx-1/ ./fcx-1/
The following path is ignored by one of your .gitignore files:
fcx-1
Use -f if you really want to add it.

long story short -- the culprit is this piece of code in git-submodule:

if test -z "$force" && ! git add --dry-run --ignore-missing "$sm_path" > 
/dev/null 2>&1
then
eval_gettextln "The following path is ignored by one of your .gitignore 
files:
\$sm_path
Use -f if you really want to add it." >&2
exit 1
fi


so if anything goes wrong in git add, it just reports  this error
message.

FTR -- actual problem in my case was:

$> git add --dry-run --ignore-missing fcx-1
fatal: Unable to create 
'/mnt/datasets/datalad/crawl/crcns/.git/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.

;-)

Cheers!
P.S. Please CC me in replies
-- 
Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik


[PATCH] pkt-line: mark a file-local symbol static

2016-09-14 Thread Ramsay Jones

Signed-off-by: Ramsay Jones 
---

Hi Lars,

If you need to re-roll your 'ls/filter-process' branch, could you
please squash this into the relevant patch; commit 2afd9b22
("pkt-line: add packet_write_gently()", 08-09-2016).

[If you think the symbol should be public (I don't), then add a
suitable declaration to pkt-line.h instead.]

Thanks!

ATB,
Ramsay Jones

 pkt-line.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pkt-line.c b/pkt-line.c
index 538e35f..4900fc0 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -181,7 +181,7 @@ int packet_write_fmt_gently(int fd, const char *fmt, ...)
return status;
 }
 
-int packet_write_gently(const int fd_out, const char *buf, size_t size)
+static int packet_write_gently(const int fd_out, const char *buf, size_t size)
 {
static char packet_write_buffer[LARGE_PACKET_MAX];
 
-- 
2.10.0


[ANNOUNCE] Git Rev News edition 19

2016-09-14 Thread Christian Couder
Hi everyone,

I'm happy announce that the 19th edition of Git Rev News is now published:

https://git.github.io/rev_news/2016/09/14/edition-19/

Thanks a lot to all the contributors and helpers, especially Brian,
Jakub, Lars and Josh!

Enjoy,
Christian and Thomas.


Re: [RFC 0/3] http: avoid repeatedly adding curl easy to curlm

2016-09-14 Thread Yaroslav Halchenko

On Tue, 13 Sep 2016, Eric Wong wrote:
> What is unclear to me is how only Yaroslav's repository seems to
> trigger this bug after all these years...

Thank you Eric very much for tracking down this issue!  Since issue is
intermittent, I guess people just didn't bother going through reporting
if they got an error once in a while, and/or could have attributed
to somehow misran update-server-info.

> However, I am fairly sure this fixes the bug Yaroslav
> encountered.  This patch series is also needed for 2.9.3 and
> perhaps older maintenance tracks for distros.

FWIW I have tested your branch locally - and do not observe that bug any
longer.  Thanks again!

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik


Re: Git Ignore Exception bug

2016-09-14 Thread Dennis Kaarsemaker
On vr, 2016-09-09 at 15:39 -0600, Nathan Williams wrote:
> it ignore doesn't seem to be working properly when adding exceptions.

>8 -- snip testcase

> Expected results
> % git st
> On branch master
> Untracked files:
>   (use "git add ..." to include in what will be committed)
> 
> foo/bar/

That expectation is wrong, it should show foo/. And indeed it does
(tested with 2.9.0 and 2.10.0-rc1)

$ sh -x testscript
+ rm -rf repo
+ mkdir repo
+ cd repo
+ git init
Initialized empty Git repository in /home/dennis/code/git/repo/.git/
+ echo foo/*
+ echo !foo/bar
+ git add .gitignore
+ git commit -m Ignore file with exceptions
[master (root-commit) 7e1b82a] Ignore file with exceptions
 1 file changed, 2 insertions(+)
 create mode 100644 .gitignore
+ mkdir foo
+ mkdir foo/bar
+ touch foo/1
+ touch foo/2
+ touch foo/bar/a
+ touch foo/bar/b
+ git status
On branch master
Untracked files:
  (use "git add ..." to include in what will be committed)

foo/

nothing added to commit but untracked files present (use "git add" to track)
-- 
Dennis Kaarsemaker
http://www.kaarsemaker.net




Re: [PATCH 0/16] fix config-reading in non-repos

2016-09-14 Thread Dennis Kaarsemaker
On ma, 2016-09-12 at 23:22 -0400, Jeff King wrote:

> The motivation for this series is to fix the regression in v2.9 where
> core.logallrefupdates is sometimes not set properly in a newly
> initialized repository, as described in this thread:
> 
>   http://public-inbox.org/git/c46d36ef-3c2e-374f-0f2e-ffe31104e...@gmx.de/T/#u
> 
> The root of the problem is that we are overly eager to read and use
> config from ".git/config", even when we have not established that it is
> part of a repository. This is especially bad for git-init, which would
> not want to read anything until we've created the new repo.
> 
> So the two interesting parts of the fix are:
> 
>   1. We stop blindly reading ".git/config" when we don't know there's an
>  actual git directory. This is in patch 14, and is actually enough
>  to fix the v2.9 regression.
> 
>   2. We are more thorough about dropping any cached config values when
>  we move into the new repository in git-init (patch 16).
> 
>  I didn't dig into when this was broken, but it was probably when we
>  switched git_config() to use cached values in the v2.2.0
>  time-frame.
> 
> Doing (1) required fixing up some builtins that depended on the blind
> .git/config thing, as the tests demonstrated. But I think this is a sign
> that we are moving in the right direction, because each one of those
> programs could easily be demonstrated to be broken in scenarios only
> slightly more exotic than the test scripts (e.g., see patch 3 for one of
> the simplest cases).
> 
> So I think notwithstanding their use as prep for patch 14, patches 1-13
> fix useful bugs.
> 
> I won't be surprised if there are other fallouts that were not caught by
> the test suite (i.e., programs that expect to read config, don't do
> RUN_SETUP, but aren't covered well by tests). I poked around the list of
> builtins in git.c that do not use RUN_SETUP, and they seem to correctly
> end up in setup_git_directory_gently() before reading config. But it's
> possible I missed a case.
> 
> So this is definitely a bit larger than I'd hope for a regression-fix to
> maint. But anything that doesn't address this issue at the config layer
> is going to end up as a bit of a hack, and I'd rather not pile up hacks
> if we can avoid it.

Agreed with all of the above, this is much better than just fixing the
symptom on the mailinglist thread that started this.

> I've cc'd Dennis, who helped investigate solutions in the thread
> mentioned above, and Duy, because historically he has been the one most
> willing and able to battle the dragon of our setup code. :)
> 
>   [01/16]: t1007: factor out repeated setup
>   [02/16]: hash-object: always try to set up the git repository
>   [03/16]: patch-id: use RUN_SETUP_GENTLY
>   [04/16]: diff: skip implicit no-index check when given --no-index
>   [05/16]: diff: handle --no-index prefixes consistently
>   [06/16]: diff: always try to set up the repository
>   [07/16]: pager: remove obsolete comment
>   [08/16]: pager: stop loading git_default_config()
>   [09/16]: pager: make pager_program a file-local static
>   [10/16]: pager: use callbacks instead of configset
>   [11/16]: pager: handle early config
>   [12/16]: t1302: use "git -C"
>   [13/16]: test-config: setup git directory
>   [14/16]: config: only read .git/config from configured repos
>   [15/16]: init: expand comments explaining config trickery
>   [16/16]: init: reset cached config when entering new repo

Couldn't find anything to comment on, and I've tested that this does
indeed fix the symptoms we saw.

Reviewed-by: Dennis Kaarsemaker 

-- 
Dennis Kaarsemaker
http://www.kaarsemaker.net