Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Jeff King
On Sun, May 05, 2013 at 05:38:53PM -0500, Felipe Contreras wrote:

 We don't care about blobs, or any object other than commits, but in
 order to find the type of object, we are parsing the whole thing, which
 is slow, specially in big repositories with lots of big files.

I did a double-take on reading this subject line and first paragraph,
thinking surely fast-export needs to actually output blobs?.

Reading the patch, I see that this is only about not bothering to load
blob marks from --import-marks. It might be nice to mention that in the
commit message, which is otherwise quite confusing.

I'm also not sure why your claim we don't care about blobs is true,
because naively we would want future runs of fast-export to avoid having
to write out the whole blob content when mentioning the blob again. I
think one argument could be if we write a mark for blob X, we will also
have written a mark for commit Y which contains it; on subsequent runs,
we will just show the mark for Y in the first place, and not even care
about showing X (as a part of Y) either way. We would only refer to the
mark for X if it appears as part of a different commit, but that is a
rare case not worth worrying about.

Does that match your reasoning?

 Before this, loading the objects of a fresh emacs import, with 260598
 blobs took 14 minutes, after this patch, it takes 3 seconds.

Presumably most of that speed improvement comes from not parsing the
blob objects. I wonder if you could get similar speedups by applying the
do not bother parsing rule from your patch 3. You would still incur
some cost to create a struct blob, but it may or may not be
measurable.  That would mean we get the case not worth worrying about
from above for free. I doubt it would make that big a difference,
though, given the rarity of it. So I am OK with it either way.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Sun, May 05, 2013 at 05:38:53PM -0500, Felipe Contreras wrote:

 We don't care about blobs, or any object other than commits, but in
 order to find the type of object, we are parsing the whole thing, which
 is slow, specially in big repositories with lots of big files.

 I did a double-take on reading this subject line and first paragraph,
 thinking surely fast-export needs to actually output blobs?.

 Reading the patch, I see that this is only about not bothering to load
 blob marks from --import-marks. It might be nice to mention that in the
 commit message, which is otherwise quite confusing.

I had the same reaction first, but not writing the blob _objects_
out to the output stream would not make any sense, so it was fairly
easy to guess what the author wanted to say ;-).

 I'm also not sure why your claim we don't care about blobs is true,
 because naively we would want future runs of fast-export to avoid having
 to write out the whole blob content when mentioning the blob again.

The existing documentation is fairly clear that marks for objects
other than commits are not exported, and the import-marks codepath
discards anything but commits, so there is no mechanism for the
existing fast-export users to leave blob marks in the marks file for
later runs of fast-export to take advantage of.  The second
invocation cannot refer to such a blob in the first place.

The story is different on the fast-import side, where we do say we
dump the full table and a later run can depend on these marks.

By discarding marks on blobs, we may be robbing some optimization
possibilities, and by discarding marks on tags, we may be robbing
some features, from users of fast-export; we might want to add an
option --use-object-marks={blob,commit,tag} or something to both
fast-export and fast-import, so that the former can optionally write
marks for non-commits out, and the latter can omit non commit marks
if the user do not need them. But that is a separate issue.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 By discarding marks on blobs, we may be robbing some optimization
 possibilities, and by discarding marks on tags, we may be robbing
 some features, from users of fast-export; we might want to add an
 option --use-object-marks={blob,commit,tag} or something to both
 fast-export and fast-import, so that the former can optionally write
 marks for non-commits out, and the latter can omit non commit marks
 if the user do not need them. But that is a separate issue.

s/--use-object-marks/--persistent-object-marks/; I think that would
express the issue better.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 08:08:45AM -0700, Junio C Hamano wrote:

  I'm also not sure why your claim we don't care about blobs is true,
  because naively we would want future runs of fast-export to avoid having
  to write out the whole blob content when mentioning the blob again.
 
 The existing documentation is fairly clear that marks for objects
 other than commits are not exported, and the import-marks codepath
 discards anything but commits, so there is no mechanism for the
 existing fast-export users to leave blob marks in the marks file for
 later runs of fast-export to take advantage of.  The second
 invocation cannot refer to such a blob in the first place.

OK. If the argument is we do not write them, so do not bother reading
them back in, I think that is reasonable. It could hurt anybody trying
to run fast-export against a marks file created by somebody else, but
that is also the same case that is being helped here (since otherwise,
we would not be seeing blob entries at all).

I do not offhand know enough about the internals of import/export-style
remote-helpers to say whether the hurt case even exists, let alone how
common it is.

 By discarding marks on blobs, we may be robbing some optimization
 possibilities, and by discarding marks on tags, we may be robbing
 some features, from users of fast-export; we might want to add an
 option --use-object-marks={blob,commit,tag} or something to both
 fast-export and fast-import, so that the former can optionally write
 marks for non-commits out, and the latter can omit non commit marks
 if the user do not need them. But that is a separate issue.

Yeah, that would allow the old behavior (and more) if anybody is hurt by
this. It is nice if the order of implementation is more features, then
flip the default because it provides an immediate escape hatch for
anybody who is hurt by the change in default. But again, I do not know
enough to say whether such hurt cases even exist.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Mon, May 06, 2013 at 08:08:45AM -0700, Junio C Hamano wrote:

  I'm also not sure why your claim we don't care about blobs is true,
  because naively we would want future runs of fast-export to avoid having
  to write out the whole blob content when mentioning the blob again.
 
 The existing documentation is fairly clear that marks for objects
 other than commits are not exported, and the import-marks codepath
 discards anything but commits, so there is no mechanism for the
 existing fast-export users to leave blob marks in the marks file for
 later runs of fast-export to take advantage of.  The second
 invocation cannot refer to such a blob in the first place.

 OK. If the argument is we do not write them, so do not bother reading
 them back in, I think that is reasonable.

The way I read builtin/fast-export.c::import_marks() is that it is
more like we do not write them, and we do not read them back in
either IN THE CURRENT CODE.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 09:32:41AM -0700, Junio C Hamano wrote:

  OK. If the argument is we do not write them, so do not bother reading
  them back in, I think that is reasonable.
 
 The way I read builtin/fast-export.c::import_marks() is that it is
 more like we do not write them, and we do not read them back in
 either IN THE CURRENT CODE.

Ahh...I see now. It is not about skipping the blobs as a new behavior,
but rather about skipping them _earlier_, before we have loaded the
object contents from disk.

I took the we don't care about as the general use of fast-export does
not care about, but it is we will literally just drop them a few lines
later.

So yes, I think this is an obviously correct optimization. Thanks for
clarifying, and sorry to be so slow.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 So yes, I think this is an obviously correct optimization. Thanks for
 clarifying, and sorry to be so slow.

No need to be sorry.  It just shows that the log message could have
been more helpful.

Here is what I tentatively queued.

commit 83582e91d22c66413b291d4d6d45bbeafddc2af9
Author: Felipe Contreras felipe.contre...@gmail.com
Date:   Sun May 5 17:38:53 2013 -0500

fast-export: do not parse non-commit objects while reading marks file

We read from the marks file and keep only marked commits, but in
order to find the type of object, we are parsing the whole thing,
which is slow, specially in big repositories with lots of big files.

There's no need for that, we can query the object information with
sha1_object_info().

Before this, loading the objects of a fresh emacs import, with 260598
blobs took 14 minutes, after this patch, it takes 3 seconds.

This is the way fast-import does it. Also die if the object is not
found (like fast-import).

Signed-off-by: Felipe Contreras felipe.contre...@gmail.com
Signed-off-by: Junio C Hamano gits...@pobox.com
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 10:17:41AM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  So yes, I think this is an obviously correct optimization. Thanks for
  clarifying, and sorry to be so slow.
 
 No need to be sorry.  It just shows that the log message could have
 been more helpful.
 
 Here is what I tentatively queued.
 [...]

Yeah, that is much for to understand (to me, at least).

Thanks.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 01:19:35PM -0400, Jeff King wrote:

  Here is what I tentatively queued.
  [...]
 
 Yeah, that is much for to understand (to me, at least).

Ugh. That was supposed to be much easier to understand. Perhaps I will
learn to type one day.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Felipe Contreras
On Mon, May 6, 2013 at 7:31 AM, Jeff King p...@peff.net wrote:
 On Sun, May 05, 2013 at 05:38:53PM -0500, Felipe Contreras wrote:

 We don't care about blobs, or any object other than commits, but in
 order to find the type of object, we are parsing the whole thing, which
 is slow, specially in big repositories with lots of big files.

 I did a double-take on reading this subject line and first paragraph,
 thinking surely fast-export needs to actually output blobs?.

If you think that, then you are not familiar with the code.

--export-marks=file::
Dumps the internal marks table to file when complete.
Marks are written one per line as `:markid SHA-1`. Only marks
for revisions are dumped; marks for blobs are ignored.

if (deco-base  deco-base-type == 1) {
mark = ptr_to_mark(deco-decoration);
if (fprintf(f, :%PRIu32 %s\n, mark,
sha1_to_hex(deco-base-sha1))  0) {
e = 1;
break;
}
}

 Reading the patch, I see that this is only about not bothering to load
 blob marks from --import-marks. It might be nice to mention that in the
 commit message, which is otherwise quite confusing.

The commit message says it exactly like it is: we don't care about blobs.

If an object is not a commit, we *already* skip it. But as the commit
message already says, we do so by parsing the whole thing.

 I'm also not sure why your claim we don't care about blobs is true,
 because naively we would want future runs of fast-export to avoid having
 to write out the whole blob content when mentioning the blob again.

Because it's pointless to have hundreds and thousands of blob marks
that are *never* going to be used, only for an extremely tiny minority
that would.

 Does that match your reasoning?

It doesn't matter, it has been that way since --export-marks was introduced.

 Before this, loading the objects of a fresh emacs import, with 260598
 blobs took 14 minutes, after this patch, it takes 3 seconds.

 Presumably most of that speed improvement comes from not parsing the
 blob objects. I wonder if you could get similar speedups by applying the
 do not bother parsing rule from your patch 3. You would still incur
 some cost to create a struct blob, but it may or may not be
 measurable.  That would mean we get the case not worth worrying about
 from above for free. I doubt it would make that big a difference,
 though, given the rarity of it. So I am OK with it either way.

How would I know if it's a blob or a commit, if not by the code this
patch introduces?

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Felipe Contreras
On Mon, May 6, 2013 at 10:08 AM, Junio C Hamano gits...@pobox.com wrote:
 Jeff King p...@peff.net writes:

 On Sun, May 05, 2013 at 05:38:53PM -0500, Felipe Contreras wrote:

 We don't care about blobs, or any object other than commits, but in
 order to find the type of object, we are parsing the whole thing, which
 is slow, specially in big repositories with lots of big files.

 I did a double-take on reading this subject line and first paragraph,
 thinking surely fast-export needs to actually output blobs?.

 Reading the patch, I see that this is only about not bothering to load
 blob marks from --import-marks. It might be nice to mention that in the
 commit message, which is otherwise quite confusing.

 I had the same reaction first, but not writing the blob _objects_
 out to the output stream would not make any sense, so it was fairly
 easy to guess what the author wanted to say ;-).

That's how fast-export has worked since --export-marks was introduced.

 I'm also not sure why your claim we don't care about blobs is true,
 because naively we would want future runs of fast-export to avoid having
 to write out the whole blob content when mentioning the blob again.

 The existing documentation is fairly clear that marks for objects
 other than commits are not exported, and the import-marks codepath
 discards anything but commits, so there is no mechanism for the
 existing fast-export users to leave blob marks in the marks file for
 later runs of fast-export to take advantage of.  The second
 invocation cannot refer to such a blob in the first place.

 The story is different on the fast-import side, where we do say we
 dump the full table and a later run can depend on these marks.

Yes, and gaining nothing but increased disk-space.

 By discarding marks on blobs, we may be robbing some optimization
 possibilities, and by discarding marks on tags, we may be robbing
 some features, from users of fast-export; we might want to add an
 option --use-object-marks={blob,commit,tag} or something to both
 fast-export and fast-import, so that the former can optionally write
 marks for non-commits out, and the latter can omit non commit marks
 if the user do not need them. But that is a separate issue.

How? The only way we might rob optimizations is if there's an obscene
amount files, otherwise the number of blob marks that we are
*actually* going to use ever again is extremely tiny.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Jeff King
On Mon, May 06, 2013 at 02:02:13PM -0500, Felipe Contreras wrote:

  I did a double-take on reading this subject line and first paragraph,
  thinking surely fast-export needs to actually output blobs?.
 
 If you think that, then you are not familiar with the code.
 
 --export-marks=file::
 [...]

My point was that nothing in the subject line nor that first paragraph
(nor, for that matter, the entire commit message) says that we are
talking about marks here.

  Reading the patch, I see that this is only about not bothering to load
  blob marks from --import-marks. It might be nice to mention that in the
  commit message, which is otherwise quite confusing.
 
 The commit message says it exactly like it is: we don't care about blobs.

If you guess that we means the marks code and not all of fast-export,
then yes. But I do not have any desire to get into another debate trying
to convince you that there is value to having a clear commit message.
Junio has already proposed a much more readable one.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Felipe Contreras
On Mon, May 6, 2013 at 11:20 AM, Jeff King p...@peff.net wrote:
 On Mon, May 06, 2013 at 08:08:45AM -0700, Junio C Hamano wrote:

  I'm also not sure why your claim we don't care about blobs is true,
  because naively we would want future runs of fast-export to avoid having
  to write out the whole blob content when mentioning the blob again.

 The existing documentation is fairly clear that marks for objects
 other than commits are not exported, and the import-marks codepath
 discards anything but commits, so there is no mechanism for the
 existing fast-export users to leave blob marks in the marks file for
 later runs of fast-export to take advantage of.  The second
 invocation cannot refer to such a blob in the first place.

 OK. If the argument is we do not write them, so do not bother reading
 them back in, I think that is reasonable.

We already do that:

5d3698f fast-export: avoid importing blob marks

 It could hurt anybody trying
 to run fast-export against a marks file created by somebody else, but
 that is also the same case that is being helped here (since otherwise,
 we would not be seeing blob entries at all).

 I do not offhand know enough about the internals of import/export-style
 remote-helpers to say whether the hurt case even exists, let alone how
 common it is.

 By discarding marks on blobs, we may be robbing some optimization
 possibilities, and by discarding marks on tags, we may be robbing
 some features, from users of fast-export; we might want to add an
 option --use-object-marks={blob,commit,tag} or something to both
 fast-export and fast-import, so that the former can optionally write
 marks for non-commits out, and the latter can omit non commit marks
 if the user do not need them. But that is a separate issue.

 Yeah, that would allow the old behavior (and more) if anybody is hurt by
 this.

There is no behavior change in this patch. We do *exactly* the same as before.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Felipe Contreras
On Mon, May 6, 2013 at 2:11 PM, Jeff King p...@peff.net wrote:
 On Mon, May 06, 2013 at 02:02:13PM -0500, Felipe Contreras wrote:

  I did a double-take on reading this subject line and first paragraph,
  thinking surely fast-export needs to actually output blobs?.

 If you think that, then you are not familiar with the code.

 --export-marks=file::
 [...]

 My point was that nothing in the subject line nor that first paragraph
 (nor, for that matter, the entire commit message) says that we are
 talking about marks here.

s/$/ while loading marks/. Fixed.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Junio C Hamano
Felipe Contreras felipe.contre...@gmail.com writes:

 The story is different on the fast-import side, where we do say we
 dump the full table and a later run can depend on these marks.

 Yes, and gaining nothing but increased disk-space.

I thought that the gaining nothing has already been refuted by the
discussion several hours ago...

cf. http://thread.gmane.org/gmane.comp.version-control.git/223275/focus=223440

Puzzled...

 By discarding marks on blobs, we may be robbing some optimization
 possibilities, and by discarding marks on tags, we may be robbing
 some features, from users of fast-export; we might want to add an
 option --use-object-marks={blob,commit,tag} or something to both
 fast-export and fast-import, so that the former can optionally write
 marks for non-commits out, and the latter can omit non commit marks
 if the user do not need them. But that is a separate issue.

 How?

 * if we teach fast-import to optionally not write marks for blobs
   and trees out, your remote-bzr can take advantage of it, because
   it does not reuse marks for non-commits in later runs, right?
   Existing users like cvs2git that do not ask to skip marks for
   non-commits will not be hurt and keep referring to blobs an
   earlier run wrote out.

 * if we teach fast-export to optionally write marks for blobs and
   trees out, the users of fast-export could reuse marks for blobs
   and trees in later runs (perhaps they can drive fast-export from
   the output of git log --raw, noticing blob object names they
   already saw).  Existing users that do not ask for such a feature
   will not be hurt.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Junio C Hamano
Felipe Contreras felipe.contre...@gmail.com writes:

 How?

  * if we teach fast-import to optionally not write marks for blobs
and trees out, your remote-bzr can take advantage of it,

 I already said remote-bzr is irrelevant. *Everybody* benefits.

Everybody who does _not_ need to look at marks for non-commits from
previous run does.  What about the others who do?

Surely, some lucky ones may get the benefit of a new optimization
for free if you turn it on uncondtionally without an escape hatch,
but that goes against our goal of not to knowingly introduce any
regression.  Michael's cvs2git might have a way to work the breakage
around (I will let him comment on your suggested workaround), but as
long as he has been coding it using the documented feature, why
should he even change anything for no gain at all in the first
place?  Even if you have a workaround, that does not change the fact
that a removal of a feature that somebody has been depending on is a
regression.

What's so difficult to understand that by default the responsibility
of making sure an optimization applies safely to a program that uses
a new optmization lies on that program, in other words, a new
feature is by default an opt-in?

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] fast-export: improve speed by skipping blobs

2013-05-06 Thread Felipe Contreras
On Mon, May 6, 2013 at 8:59 PM, Junio C Hamano gits...@pobox.com wrote:
 Felipe Contreras felipe.contre...@gmail.com writes:

 How?

  * if we teach fast-import to optionally not write marks for blobs
and trees out, your remote-bzr can take advantage of it,

 I already said remote-bzr is irrelevant. *Everybody* benefits.

 Everybody who does _not_ need to look at marks for non-commits from
 previous run does.

IOW; everyone.

 What about the others who do?

Like who?

 Surely, some lucky ones may get the benefit of a new optimization
 for free if you turn it on uncondtionally without an escape hatch,
 but that goes against our goal of not to knowingly introduce any
 regression.

That's different. One thing is to turn it on unconditionally, and
another thing is to turn it on by *default*.

 Michael's cvs2git might have a way to work the breakage
 around (I will let him comment on your suggested workaround), but as
 long as he has been coding it using the documented feature, why
 should he even change anything for no gain at all in the first
 place?  Even if you have a workaround, that does not change the fact
 that a removal of a feature that somebody has been depending on is a
 regression.

Who is depending on it? Michael didn't say that he used that feature,
merely that it was documented in cvs2git, because Windows doesn't have
'cat'. He claimed that other people *might* be using that feature,
but we don't *know*.

Is a couple of commands somebody wrote in some documentation which can
be easily fixed, reason enough to punish everyone else?

 What's so difficult to understand that by default the responsibility
 of making sure an optimization applies safely to a program that uses
 a new optmization lies on that program, in other words, a new
 feature is by default an opt-in?

Is that written in some git bible descended from some god that I
missed? If not, everything are guidelines, and guidelines are there
for a reason, and those reasons can be challenged, and so can the
guidelines.

Sometimes it makes sense to make a new feature opt-in, sometimes it
doesn't, there are no absolutes, there should be no dogmas.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html