Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-06 Thread Michael Haggerty
On 01/05/2013 04:11 PM, Eric S. Raymond wrote:
> Perhaps I was unclear.  I consider the interface design error to
> be not in the fact that all the blobs are written first or detached,
> but rather that the implementation detail of the two separate journal
> files is ever exposed.
> 
> I understand why the storage of intermediate results was done this
> way, in order to decrease the tool's working set during the run, but
> finishing by automatically concatenating the results and streaming
> them to stdout would surely have been the right thing here.

cvs2svn/cvs2git is built to be able to handle very large CVS
repositories, not only those that can fit in RAM.  This goal influences
a lot of its design, including the pass-by-pass structure with
intermediate databases and the resumability of passes.

The blobfile necessarily contains every version of every file, with no
delta-encoding and no compression.  Its size can be a large multiple of
the on-disk size of the original CVS repository.  If the "save to
tempfiles then cat tempfiles at end of run" behavior were hard-coded
into cvs2git, then there would be no way to avoid requiring enough
temporary space to hold the whole blobfile.

Writing the blobfile into a separate file, on the other hand, means that
for example the blobfile could be written into a named pipe connected to
the standard input of "git fast-import" [1].  "git fast-import" could
even be run on a remote server.

I consider these bigger advantages than the ability to pipe the output
of cvs2git directly into another command.

> The downstream cost of letting the journalling implementation be
> exposed, instead, can be seen in this snippet from the new git-cvsimport
> I've been working on:
> 
> def command(self):
> "Emit the command implied by all previous options."
> return "(cvs2git --username=git-cvsimport --quiet --quiet 
> --blobfile={0} --dumpfile={1} {2} {3} && cat {0} {1} && rm {0} 
> {1})".format(tempfile.mkstemp()[1], tempfile.mkstemp()[1], self.opts, 
> self.modulepath)
> 
> According to the documentation, every caller of csv2git must go
> through analogous contortions!  This is not the Unix way; if Unix
> design principles had been minimally applied, that second line would
> just read like this:
> 
>  return "cvs2git --username=git-cvsimport --quiet --quiet"

Never in my worst nightmares did I imagine that my terrible design taste
would force you to type an extra two lines of code.  Oh the humanity!

By the way, patches are welcome.  And you don't need to trumpet their
imminent arrival [2] or malign the existing code beforehand.  Moreover,
it would be adequate if you just demonstrate working code and *then* ask
for "sign-in", rather than the other way around.

> If Unix design principles had been thoroughly applied, the "--quiet
> --quiet" part would be unnecessary too - well-behaved Unix commands
> *default* to being completely quiet unless either (a) they have an
> exceptional condition to report, or (b) their expected running time is
> so long that tasteful silence would leave users in doubt that they're
> working.

cvs2git is not a command that one uses 100 times a day.  It is a tool
for one-shot conversions of CVS repositories to git.  These conversions
can take hours or even days of processing time (not to mention the time
for configuring the conversion and changing the rest of a project's
infrastructure from CVS to git).  So yes, I think we would like to
appeal to (b) and humbly ask for your permission to give the user some
feedback during the conversion.

> (And yes, I do think violating these principles is a lapse of taste when
> git tools do it, too.)
> 
> Michael Haggerty wants me to trust that cvs2git's analysis stage has
> been fixed, but I must say that is a more difficult leap of faith when
> two of the most visible things about it are still (a) a conspicuous
> instance of interface misdesign, and (b) documentation that is careless and
> incomplete.

The cvs2git documentation is lacking; I admit it (as opposed to the
cvs2svn documentation, which I think is quite complete).  And the
program itself also has a lot of rough edges, for example its inability
to convert .cvsignore files into .gitignore files.  Patches are welcome.
 I haven't used cvs2svn for my own purposes in many years and I've
*never* once had a need to use cvs2git; I maintain these programs purely
as a service to the community.  Most of the community seems satisfied
with the programs as they are, and if not they usually submit courteous
and concrete bug reports or submit patches.

I request that you follow their example.  I especially ask that you
restrain from spreading public FUD about imagined problems based on
speculation.  Please do your tests and *then* report any problems that
you find.

Yours,
Michael

[1] In fact, the current implementation of generate_blobs.py sometimes
seeks back to earlier parts of the blob file when it needs the fulltext
of a revision that has already be

Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-05 Thread Jonathan Nieder
Eric S. Raymond wrote:

> Michael Haggerty wants me to trust that cvs2git's analysis stage has
> been fixed, but I must say that is a more difficult leap of faith when
> two of the most visible things about it are still (a) a conspicuous
> instance of interface misdesign, and (b) documentation that is careless and
> incomplete.

For what it's worth, I use cvs2git quite often.  I've found it to work
well and its code to be clear and its developers responsive.  But I
don't mind if we disagree, and multiple implementations to explore the
design space of importers doesn't seem like a terrible outcome.

Thanks for your work,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-05 Thread Eric S. Raymond
Bart Massey :
> I don't know what Eric Raymond "officially end-of-life"-ing parsecvs means?

You and Keith handed me the maintainer's baton.  If I were to EOL it,
that would be the successor you two designated judging in public that
the code is unsalvageable or has become pointless.  If you wanted to
exclude the possibility that a successor would make that call, you
shouldn't have handed it off in a state so broken that I can't even
test it properly.

But I don't in fact think the parsecvs code is pointless. The fact that it
only needs the ,v files is nifty and means it could be used as an RCS
exporter too.  The parsing and topo-analysis stages look like really
good work, very crisp and elegant (which is no less than I'd expect
from Keith, actually).

Alas, after wrestling with it I'm beginning to wonder whether the
codebase is salvageable by anyone but Keith himself.  The tight coupling
to the git cache mechanism is the biggest problem.  So far, I can't
figure out what tree.c is actually doing in enough detail to fix it or pry
it loose - the code is opaque and internal documentation is lacking.

More generally, interfacing to the unstable API of libgit was clearly
a serious mistake, leading directly to the current brokenness.  The
tool should have emitted an import stream to begin with.  I'm trying
to fix that, but success is looking doubtful.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-05 Thread Eric S. Raymond
Max Horn :
> Hm, you snipped this part of Michael's mail:
> 
> >> However, if that is a
> >> problem, it is possible to configure cvs2git to write the blobs inline
> >> with the rest of the dumpfile (this mode is supported because "hg
> >> fast-import" doesn't support detached blobs).
> 
> I would call "hg fast-import" a main potential customer, given that there 
> "cvs2hg" is another part of the cvs2svn suite. So I can't quite see how you 
> can come to your conclusion above...

Perhaps I was unclear.  I consider the interface design error to
be not in the fact that all the blobs are written first or detached,
but rather that the implementation detail of the two separate journal
files is ever exposed.

I understand why the storage of intermediate results was done this
way, in order to decrease the tool's working set during the run, but
finishing by automatically concatenating the results and streaming
them to stdout would surely have been the right thing here.
 
The downstream cost of letting the journalling implementation be
exposed, instead, can be seen in this snippet from the new git-cvsimport
I've been working on:

def command(self):
"Emit the command implied by all previous options."
return "(cvs2git --username=git-cvsimport --quiet --quiet 
--blobfile={0} --dumpfile={1} {2} {3} && cat {0} {1} && rm {0} 
{1})".format(tempfile.mkstemp()[1], tempfile.mkstemp()[1], self.opts, 
self.modulepath)

According to the documentation, every caller of csv2git must go
through analogous contortions!  This is not the Unix way; if Unix
design principles had been minimally applied, that second line would
just read like this:

 return "cvs2git --username=git-cvsimport --quiet --quiet"

If Unix design principles had been thoroughly applied, the "--quiet
--quiet" part would be unnecessary too - well-behaved Unix commands
*default* to being completely quiet unless either (a) they have an
exceptional condition to report, or (b) their expected running time is
so long that tasteful silence would leave users in doubt that they're
working.

(And yes, I do think violating these principles is a lapse of taste when
git tools do it, too.)

Michael Haggerty wants me to trust that cvs2git's analysis stage has
been fixed, but I must say that is a more difficult leap of faith when
two of the most visible things about it are still (a) a conspicuous
instance of interface misdesign, and (b) documentation that is careless and
incomplete.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-05 Thread Max Horn

On 03.01.2013, at 21:53, Eric S. Raymond wrote:

> Michael Haggerty :
>> There are two good reasons that the output is written to two separate files:
> 
> Those are good reasons to write to a pair of tempfiles, and I was able
> to deduce in advance most of what your explanation would be from the
> bare fact that you did it that way.
> 
> They are *not* good reasons for having an interface that exposes this
> implementation detail to the caller - that choice I consider a failure
> of interface-design judgment.  But I know how to fix this in a simple and
> backward-compatible way, and will do so when I have time to write you
> a patch.  Next week or the week after, most likely.
> 
> Also, the cvs2git manual page is still rather half-baked and careless,
> with several fossil references to cvs2svn that shouldn't be there and
> obviously incomplete feature coverage. Fixing these bugs is also on my
> to-do list for sometime this month.
> 
> I'd be willing to put in this work anyway, but it still in the back of
> my mind that if cvs2git wins the test-suite competition I might
> officially end-of-life both cvsps and parsecvs.  One of the features
> of the new git-cvsimport is direct support for using cvs2git as a
> conversion engine.
> 
>> A potentially bigger problem is that if you want to handle such
>> blob/dump output, you have to deal with git-fast-import format's "blob"
>> command as opposed to only handling inline blobs.
> 
> Not a problem.  All of the main potential consumers for this output,
> including reposurgeon, handle the blob command just fine.

Hm, you snipped this part of Michael's mail:

>> However, if that is a
>> problem, it is possible to configure cvs2git to write the blobs inline
>> with the rest of the dumpfile (this mode is supported because "hg
>> fast-import" doesn't support detached blobs).

I would call "hg fast-import" a main potential customer, given that there 
"cvs2hg" is another part of the cvs2svn suite. So I can't quite see how you can 
come to your conclusion above...



Cheers,
Max--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-03 Thread Eric S. Raymond
Michael Haggerty :
> There are two good reasons that the output is written to two separate files:

Those are good reasons to write to a pair of tempfiles, and I was able
to deduce in advance most of what your explanation would be from the
bare fact that you did it that way.

They are *not* good reasons for having an interface that exposes this
implementation detail to the caller - that choice I consider a failure
of interface-design judgment.  But I know how to fix this in a simple and
backward-compatible way, and will do so when I have time to write you
a patch.  Next week or the week after, most likely.

Also, the cvs2git manual page is still rather half-baked and careless,
with several fossil references to cvs2svn that shouldn't be there and
obviously incomplete feature coverage. Fixing these bugs is also on my
to-do list for sometime this month.

I'd be willing to put in this work anyway, but it still in the back of
my mind that if cvs2git wins the test-suite competition I might
officially end-of-life both cvsps and parsecvs.  One of the features
of the new git-cvsimport is direct support for using cvs2git as a
conversion engine.
 
> A potentially bigger problem is that if you want to handle such
> blob/dump output, you have to deal with git-fast-import format's "blob"
> command as opposed to only handling inline blobs.

Not a problem.  All of the main potential consumers for this output,
including reposurgeon, handle the blob command just fine.

> cvs2git does not currently support incremental conversions; therefore, a
> cvsps-based option (if it would actually work, that is) would have at
> least one advantage over cvs2git.

Yes. The reason I didn't ship the replacement patch Junio was
expecting yesterday is that I don't have test coverage for the
incremental case.  I'm working on that now.

> cvs2svn has an extensive test suite which includes tests derived from
> bug reports that we have received over the years.  I adapted a few of
> its test repositories to create the git test suite additions that I made
> in Feb 2009, but there are many more in our project.

I've merged those into my tree.

> I think it would be great to have a way to test across tools, though
> please realize that the inference of the most plausible "true" CVS
> history is partly objective but also often a matter of heuristics and
> taste.  Moreover, the choice of how to represent the inferred history in
> git, which has rather a different model than CVS/Subversion, is also
> non-obvious and somewhat controversial.  I expect that there will be a
> number of simple CVS repositories for which we can all agree about the
> correct git output, but not far away will be a vast number for which the
> "correct" answer is unclear.  Many of the interesting tests would fall
> into the latter category.

I'm aware of the problem.  One of the interesting questions is how much
further into the weird cases everybody can agree on what correct 
translation looks like.  We won't know until we push it.
 
> It's not clear what you want me to sign off on.

If you're not willing to use the new suite, my spending the effort 
required to genericize it gets much less interesting.  I needed 
Junio's agreement because I wanted to move the old git-cvsimport
tests from the git tree to the new test suite; they're not really
tests of the wrapper script at all but of the conversion engines.

>   I guess you want to
> replace (or augment?) the cvs2svn test suite with one based on your
> framework? 

Augment, not replace - and just as importantly, commit to writing 
new tests into the new generic framework when they don't involve a 
tool-specific option.  It would be silly and duplicative for us *not*
to be sharing as many tests as we can.

> * We definitely want to continue testing the Subversion output of
> cvs2svn.  A test suite that only tests the git output could at best be
> an addition to the current test suite, not a replacement for it.  (That
> being said, the addition of good tests of the 2git output would be great.)

Agreed.

> * A test suite that tests only the easy cases wouldn't really be
> interesting, because the difficult cases are where the potential
> problems lie.

Yes, I know.  I'm arguing that we should be doing that exploration
jointly rather than separately.

> * It would be unfortunate if the cvs2svn test suite would grow another
> run-time dependency or if we would have to invest a lot of time
> synchronizing with another project, though if the gain were big enough
> we could consider it.

I know how to keep the friction cost low.  You'll see more about this when
I split off the test suite and announce it.

> * The licenses obviously have to be compatible to the extent required by
> the level of coupling.

I don't think this will be a problem.  You own the copyright on your tests and
I own it on mine, so we can relicense under whatever common license we choose.
I'm not fussy about what we use; AS

Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-03 Thread Martin Langhoff
On Sat, Dec 22, 2012 at 12:36 PM, Eric S. Raymond  wrote:
> It is pure accident that I now maintain two of these.

Maintainership is always temporary.

> Having three different tools for this job seems to me duplicative and
> pointless; two of them should probably be let die an honorable death.

Perhaps just maintain the code that serves your goals. That way, you
don't need long trolly emails nor approval from anyone.




m
--
 martin.langh...@gmail.com
 mar...@laptop.org -- Software Architect - OLPC
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2013-01-03 Thread Michael Haggerty
On 12/22/2012 06:36 PM, Eric S. Raymond wrote:
> * One is Michael Haggerty's cvs2git.  I had bad experiences with the
> cvs2svn code it's derived from in the past, but Michael believes those
> problems have been fixed and I will accept that - at least until I can
> test for myself.  Its documented interface is not quite good enough
> yet; as the documentation says, "The data that should be fed to git
> fast-import are written to two files, which have to be loaded into git
> fast-import manually."

There are two good reasons that the output is written to two separate files:

1. The files are generated during different passes of cvs2git, and since
the cvs2git conversion is restartable pass-by-pass, the first file might
only need to be generated once even while the user is iterating on
adjustments to other conversion options.

2. The first ("blobfile") contains blob definitions for file revisions,
which are read out of the RCS files in the order they are held in the
RCS file.  This is vastly faster than reading the file revisions in the
order that they are needed for git commits because (1) all revisions for
a file can be computed from one serial read of the RCS file; (2) there
is no need to jump around from rcsfile to rcsfile.  The second
("dumpfile") stitches the blobs together into git commits by referring
to the blobs that are needed.  This file is smaller because it doesn't
contain the actual file contents.  Another advantage of this approach is
that a blob need only appear once in the blobfile even if it is used
multiple times in the git history.

Anyway, surely cat'ing two output files together is not such a difficult
problem?

A potentially bigger problem is that if you want to handle such
blob/dump output, you have to deal with git-fast-import format's "blob"
command as opposed to only handling inline blobs.  However, if that is a
problem, it is possible to configure cvs2git to write the blobs inline
with the rest of the dumpfile (this mode is supported because "hg
fast-import" doesn't support detached blobs).  You would have to create
an options file that uses GitRevisionInlineWriter, similar to what is
done in cvs2hg-example.options.

> [...]
> Having three different tools for this job seems to me duplicative and
> pointless; two of them should probably be let die an honorable death.
> I don't actually care which of the three survives - and, in
> particular, if I determine that cvs2git is doing the best job of the
> three I am quite willing to declare end-of-life for cvsps and
> parsecvs.  It's not like I don't have plenty of other projects to work
> on.

cvs2git does not currently support incremental conversions; therefore, a
cvsps-based option (if it would actually work, that is) would have at
least one advantage over cvs2git.

> I presently know of three test suites other than mine. One was built
> by Heiko to test cvsps, another lives in the git t/ directory, and the
> third is cvs2git's. I haven't looked at cv2git's yet, but the others
> are not in their present form suited to where I am taking cvsps and
> parsecvs.  Heiko's relies on the default human-readable cvsps format,
> which I consider obsolete and uninteresting.  The git tests are
> dependent on details of porcelain behavior.  I think it would be
> better to test import-stream output.

cvs2svn has an extensive test suite which includes tests derived from
bug reports that we have received over the years.  I adapted a few of
its test repositories to create the git test suite additions that I made
in Feb 2009, but there are many more in our project.

A lot of our test suite deals with additional conversion features, like:

* Re-encoding filenames, usernames, and log messages from whatever
happens to have been used in the CVS repository into UTF-8

* Fixing CVS branches, tags, and mixed branch/tag messes according to
user wishes; renaming branches and tags

* Allowing the user to influence the choice of which branch should serve
as the source for another branch/tag (CVS records this information very
ambiguously)

* Fixing binary vs. text files, expanding/contracting CVS keywords, etc.

* Removing lots of synthetic revisions and other cruft generated by CVS
to fit within the RCS file format

* Dealing with vendor branches in a sensible way, especially considering
that very many users misuse vendor branches for initial imports

* Dealing with various common types of CVS repository corruption

See our list of features [1] for more details.  Presumably many of these
features would not be covered by your test framework, and are not
supported by the other conversion tools.

Unfortunately, our tests are mostly based on cvs2svn (i.e., not 2git);
that is, the conversion is done with cvs2svn and checked by verifying
the contents of the resulting Subversion repository.

The script contrib/verify-cvs2svn.py is another kind of test; it checks
every branch and tag out of CVS and the destination repository and
verifies that their contents are identical.

Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2012-12-23 Thread Eric S. Raymond
Heiko Voigt :
> Please share so we can have a look. BTW, where can I find your cvsps
> code?

https://gitorious.org/cvsps

Developments of the last 48 hours:

1. Andreas Schwab sent me a patch that uses commitids wherever the history
   has them - this makes all the time-skew problems go away.  I added code
   to warn if commitids aren't present, so users will get a clear indication
   of when time-skew problems might bite them versus when that is happily
   impossible.

2. I've scrapped a lot of obsolete code and options.  The repo head
   version uses what used to be called cvs-direct mode all the time
   now; it works, and the effect on performance is major.  This also
   means that cvsps doesn't need to use any local CVS commands or even
   have CVS installed where it runs.

> >From my past cvs conversion experiences my personal guess is that
> cvs2svn will win this competition.

That could be.  But right now cvsps has one significant advantage over
cvs2git (which parsecvs might share) - it's *blazingly* fast.  So fast
that I scrapped all the local-caching logic; there seems no point to it at
today's network speeds, and that's one less layer of complications to
go wrong.

I've removed a couple hundred lines of code and the program works
better and faster than it did before.  That's having a good day!
-- 
http://www.catb.org/~esr/";>Eric S. Raymond
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cvsps, parsecvs, svn2git and the CVS exporter mess

2012-12-23 Thread Heiko Voigt
Hi,

On Sat, Dec 22, 2012 at 12:36:48PM -0500, Eric S. Raymond wrote:
> If we can agree on this, I'll start a public repo, and contribute my
> Python framework - it's more capable than any of the shell harnesses
> out there because it can easily drive interleaved operations on multiple 
> checkout directories.

Please share so we can have a look. BTW, where can I find your cvsps
code?

> Anybody who is still interested in this problem should contribute
> tests.  Heiko Voigt, I'd particularly like you in on this.

If it does not take to much effort I could port my tests to the new
framework. Since I currently are not in active need of cvs conversions
its not of big interest to me anymore. But if it does not take too much
time I am happy to help.

>From my past cvs conversion experiences my personal guess is that
cvs2svn will win this competition.

Cheers Heiko
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html