Re: [Monotone-devel] cvs
Hendrik, On 07/15/2013 07:21 PM, Hendrik Boom wrote: Just wondering what the current status of of monotone's CVS support; in particular, cvs_import, csv_pull. cvs_sync, cvs_takeover, and I've heard there's even a cvs_push. Standard monotone's cvs_import works fine for simple cases. I didn't do any work on the cvs_import branch and I don't think it's in a usable state. You might want to check if cvs2svn can be of help. They have a nice git export function and their CVS sanitizer code is field proven. I'm not sure if you can get that into monotone, though. Is the conversion a one-time event, or can it keepp up with further revisions on the cvs site withoug having to start over? There are all one-time conversion options, which need access to the RCS files on the CVS server. tailor may be an option, if you want a continuous mirror. It certainly has a monotone plugin. I'm not sure what the status of cvs_sync is, but it's intended to provide continuous synchronization as well. Regards Markus Wanner ___ Monotone-devel mailing list Monotone-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/monotone-devel
[Monotone-devel] cvs
Just wondering what the current status of of monotone's CVS support; in particular, cvs_import, csv_pull. cvs_sync, cvs_takeover, and I've heard there's even a cvs_push. Do any of these work well? Do they come close to importing most of the history in an intelligible way? Is the conversion a one-time event, or can it keepp up with further revisions on the cvs site withoug having to start over? -- hendrik ___ Monotone-devel mailing list Monotone-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/monotone-devel
[Monotone-devel] cvs-import
I just looked again at documentation page http://monotone.ca/docs/RCS.html#RCS Perhaps I remember wrong, but I thought a year or so ago cvs_import was edged with limitations and warnings -- things like it would only import one branch, and the like. I had been considering modifying cvs2svn to turn it into a cvs2mtn. Now the documentation seems indicate that mtn cvs_import pathname does the whole job? Have things changed since then? Does this mean that I no longer have to build svn2mtn? If so, thanks, especially since I haven't had any real time to work on that in the past year, so no work is wasted. -- hendrik ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
RE: [Monotone-devel] cvs and monotone
Somehow I missed Markus' reply, and just found it in the archives, so Markus Schiltknecht Sun, 27 May 2007 11:55:04 -0700 Hi, Kelly F. Hickel wrote: Is reality really as dark as it seems at the moment If you really need connected branches, I fear the answer is yes. Yes, we pretty much do. We could choose a small subset of branches, but being unable do it at all seems like a pretty big roadblock. I'm still struggling with the cvsimport-branch-reconstruction branch of monotone. But CVS is so wicked and brain damaged that it's very hard to get usable information from it. Is your CVS repository publicly available? No, it's not. Regards Markus -- Kelly F. Hickel Senior Software Architect MQSoftware, Inc 952.345.8677 [EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kelly F. Hickel Sent: Friday, May 25, 2007 11:25 AM To: monotone-devel@nongnu.org Subject: [Monotone-devel] cvs and monotone Was about to do another rtest import of our cvs repo to play around with the new mtn. Ran into the fact that the cvs_import command has a new required argument (or at least one I don't remember using) named -branch. What's that for?? While googling for the answer to that, I ran into this page http://www.venge.net/mtn-wiki/MonotoneAndCVS which contains the following terribly disturbing nugget: There is an important limitation, though. This method doesn't presently try to attach branches to their parents, either on the mainline or on other branches, instead each CVS branch gets its own separate linear history in the resulting monotone db. That's pretty amazingly disturbing, especially since I don't believe I'd ever seen that statement before Assuming that's actually true, it seems to be a pretty big problem. We'd discussed just skimming the currently active branches from cvs into monotone, but even that is problematical if it's not going back to the most recent common ancestor of the branches. I also ran across this: http://www.venge.net/mtn-wiki/CvsImport which at least gives me hope. Is reality really as dark as it seems at the moment Thanks, -- Kelly F. Hickel Senior Software Architect MQSoftware, Inc 952.345.8677 [EMAIL PROTECTED] ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs and monotone
Hi, Kelly F. Hickel wrote: Is reality really as dark as it seems at the moment If you really need connected branches, I fear the answer is yes. I'm still struggling with the cvsimport-branch-reconstruction branch of monotone. But CVS is so wicked and brain damaged that it's very hard to get usable information from it. Is your CVS repository publicly available? Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Hi, Nathaniel Smith wrote: .- A --. /\ --x x-- C \/ '- B --' You can't do this, unless you want to do some sort of inexact inverse patching -- you would need to know what file-1 looks like with only A, and what file-1 looks like with only B, but you don't. That's were I've been heading to. I don't know if it's doable. But the reasoning behind was something like: if CVS is able to commit to A and B twice, no matter in which order, those changes probably didn't conflict. Thus we could extract them and apply separately. Would a star merge with the previous commit and A and B tell us more? Or a reverse look at it with 'ancestor' C and A and B? But that looks like micro optimization anyway. You could fork into one A/B revision and one B/A revision, but that doesn't seem helpful. Or even merge A and B into one single revision (since you can't determine exactly what belongs to A and what to B), thus: AB - C Door A seems somewhat better than this, at least you get to preserve all commit messages. Hm.. you're right. The changelog could be put together, but we can't simply concatenate the authors... Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Dear diary, on Thu, Sep 14, 2006 at 03:53:24AM CEST, I got a letter where Daniel Carosone [EMAIL PROTECTED] said that... On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote: Mozilla is 120,000 files. The complexity comes from 10 years worth of history. A few of the files have around 1,700 revisions. There are about 1,600 branches and 1,000 tags. The branch number is inflated because cvs2svn is generating extra branches, the real number is around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this into 250,000 commits over about 1M unique revisions. Those numbers are pretty close to those in the NetBSD repository, and between them these probably represent just about the most extensive public CVS test data available. Don't forget OpenOffice. It's just a shame that the OpenOffice CVS tree is not available for cloning. http://wiki.services.openoffice.org/wiki/SVNMigration -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ Snow falling on Perl. White noise covering line noise. Hides all the bugs too. -- J. Putnam ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Hi, Nathaniel Smith wrote: Regarding the basic dependency-based algorithm, the approach of throwing everything into blobs and then trying to tease them apart again seems backwards. What I'm thinking is, first we go through and build the history graph for each file. Now, advance a frontier across the all of these graphs simultaneously. Your frontier is basically a map filename - CVS revision, that represents a tree snapshot. Hm.. weren't you the one saying we should profit from the experience of cvs2svn? Another question I'm asking myself: if it would have been that easy to write a sane CVS importer, why didn't cvs2svn do something like that? Anyway, I didn't want to go into discussing more algorithms here. And the discussion is already ways to noisy for my feeling. I want to write code, not emails :-) Regarding storing things on disk vs. in memory: we always used to stress-test monotone's cvs importer with the gcc history; just a few weeks ago someone did a test import of NetBSD's src repo (~180k commits) on a desktop with 2 gigs of RAM. It takes a pretty big history to really require disk (and for that matter, people with histories that big likely have a big enough organization that they can get access to some big iron to run the conversion on -- and probably will want to anyway, to make it run in reasonable time). Full ack. Probably the biggest technical advantage of having the converter built into monotone is that it makes it easy to import the file contents. Since this data is huge (100x the repo size, maybe?), and the naive algorithm for reconstructing takes time that is quadratic in the depth of history, this is very valuable. I'm not sure what sort of dump format one could come up with that would avoid making this step very expensive. I can imagine a dump format that is only loosely coupled to the file data and deltas. But it seems like a lot of work to write a generic format which performs well for all VCSes. I also suspect that SVN's dump format is suboptimal at the metadata level -- we would essentially have to run a lot of branch/tag inferencing logic _again_ to go from SVN-style one giant tree with branches described as copies, and multiple copies allowed for branches/tags that are built up over time, to monotone-style DAG of tree snapshots. This would be substantially less annoying inferencing logic than that needed to decipher CVS in the first place, granted, and it's stuff we want to write at some point anyway to allow SVN importing, but it adds another step where information could be lost. I may be biased because I grok monotone better, but I suspect it would be much easier to losslessly convert a monotone-style history to an svn-style history than vice versa, possibly a generic dumping tool would want to generate output that looks more like monotone's model? Yeah, and the GIT people want the generic dump look more like GIT. And then there are darcs, mercurial, etc... Even if we _do_ end up writing two implementations of the algorithm, we should share a test suite. Sure, but as cvs2svn has another license, I can't just copy them over :-( I will write some tests, but if I write them in our monotone-lua testsuite, I'm sure nobody else is going to use them. Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Hi, the algorithm Nathaniel described looks simple, clean and logic to me. What were the reasons for the more complex algorithms cvs2svn uses? In which way is the proposed dependency-based one better? Regards Markus Nathaniel Smith wrote: I just read over the thread on the cvs2svn list about this -- I have a few random thoughts. Take them with a grain of salt, since I haven't actually tried writing a CVS importer myself... Regarding the basic dependency-based algorithm, the approach of throwing everything into blobs and then trying to tease them apart again seems backwards. What I'm thinking is, first we go through and build the history graph for each file. Now, advance a frontier across the all of these graphs simultaneously. Your frontier is basically a map filename - CVS revision, that represents a tree snapshot. The basic loop is: 1) pick some subset of files to advance to their next revision 2) slide the frontier one CVS revision forward on each of those files 3) snapshot the new frontier (write it to the target VCS as a new tree commit) 4) go to step 1 Obviously, this will produce a target VCS history that respects the CVS dependency graph, so that's good; it puts a strict limit on how badly whatever heuristics we use can screw us over if they guess wrong about things. Also, it makes the problem much simpler -- all the heuristics are now in step 1, where we are given a bunch of possible edits, and we have to pick some subset of them to accept next. ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Nathaniel Smith writes: I just read over the thread on the cvs2svn list about this -- I have a few random thoughts. Take them with a grain of salt, since I haven't actually tried writing a CVS importer myself... Regarding the basic dependency-based algorithm, the approach of throwing everything into blobs and then trying to tease them apart again seems backwards. What I'm thinking is, first we go through and build the history graph for each file. Now, advance a frontier across the all of these graphs simultaneously. Your frontier is basically a map filename - CVS revision, that represents a tree snapshot. The basic loop is: 1) pick some subset of files to advance to their next revision 2) slide the frontier one CVS revision forward on each of those files 3) snapshot the new frontier (write it to the target VCS as a new tree commit) 4) go to step 1 Obviously, this will produce a target VCS history that respects the CVS dependency graph, so that's good; it puts a strict limit on how badly whatever heuristics we use can screw us over if they guess wrong about things. Also, it makes the problem much simpler -- all the heuristics are now in step 1, where we are given a bunch of possible edits, and we have to pick some subset of them to accept next. This isn't trivial problem. I think the main thing you want to avoid is: 1 2 3 4 | | | | --o--o--o--o- -- current frontier | | | | A B A C | A say you have four files named 1, 2, 3, and 4. We want to slide the frontier down, and the next edits were originally created by one of three commits, A, B, or C. In this situation, we can take commit B, or we can take commit C, but we don't want to take commit A until _after_ we have taken commit B -- because otherwise we will end up splitting A up into two different commits, A1, B, A2. The main problem with converting CVS repositories is its unreliable timestamps. Sometimes they are off by a few minutes; that would be no problem for your algorithm. But they might be off by hours (maybe a timezone was set incorrectly), and it is not unusual to have a server with a bad battery that resets its time to Jan 1 1970 after each reboot for a while before somebody notices it. Timestamps that are too far in the future are probably rarer, but also occur. CVS timestamps are simply not to be trusted. The best hope to correcting timestamp problems is pooling information across files. For example, you might have the following case: 1 2 | | A Z | B : Y | Z where A..Y have correct timestamps but Z has an incorrect timestamp far in the past. It is clear from the dependency graph that Z was committed after Y, and by implication revision Z of file 2 was committed at the same time. But your algorithm would grab revision Z of file 2 first, even before revision A of file 1. The point of the blob method that I proposed is that timestamps are secondary in deciding what constitutes a changeset. Any changeset consistent with the dependency graph (subject maybe to some timestamp heuristics *) is accepted. [*] Typically, clock inaccuracies will affect all CVS revisions that made up a change set. Therefore the suggestion to split blobs that have more than (say) a 5 minute time gap within them. There are a lot of approaches one could take here, on up to pulling out a full-on optimal constraint satisfaction system (if we can route chips, we should be able to pick a good ordering for accepting CVS edits, after all). A really simple heuristic, though, would be to just pick the file whose next commit has the earliest timestamp, then group in all the other next commits with the same commit message, and (maybe) a similar timestamp. I have a suspicion that this heuristic will work really, really, well in practice. Also, it's cheap to apply, and worst case you accidentally split up a commit that already had wacky timestamps, and we already know that we _have_ to do that in some cases. Handling file additions could potentially be slightly tricky in this model. I guess it is not so bad, if you model added files as being present all along (so you never have to add add whole new entries to the frontier), with each file starting out in a pre-birth state, and then addition of the file is the first edit performed on top of that, and you treat these edits like any other edits when considering how to advance the frontier. I have no particular idea on how to handle tags and branches here; I've never actually wrapped my head around CVS's model for those . I'm not seeing any obvious problem with handling them, though. Tags and branches do not have any timestamps at all in CVS. (You can sometimes put bounds on the timestamps: a branch must have been created after the version from which it sprouts, and before the first commit on the branch (if there ever was a commit on the branch).) And it is
Re: [Monotone-devel] cvs import
Hi, Michael Haggerty wrote: The main problem with converting CVS repositories is its unreliable timestamps. Sometimes they are off by a few minutes; that would be no problem for your algorithm. But they might be off by hours (maybe a timezone was set incorrectly), and it is not unusual to have a server with a bad battery that resets its time to Jan 1 1970 after each reboot for a while before somebody notices it. Timestamps that are too far in the future are probably rarer, but also occur. CVS timestamps are simply not to be trusted. The best hope to correcting timestamp problems is pooling information across files. For example, you might have the following case: 1 2 | | A Z | B : Y | Z where A..Y have correct timestamps but Z has an incorrect timestamp far in the past. It is clear from the dependency graph that Z was committed after Y, and by implication revision Z of file 2 was committed at the same time. But your algorithm would grab revision Z of file 2 first, even before revision A of file 1. But you could use another method to determine what to commit first. One which takes only dependency graph into account. The simplest variant would be: 1. randomly choose a commit (or take the one with the lowest timestamp for a mostly good starter) 2. collect the other file's commits which seem to belong to the same revision (for me, a revision is a set of files, as in monotone. I don't know what terms you use here, probably we should define a set of terms to discuss such issues and avoid confusion.) 3. check if any of those file commits conflict in the dependency graph. I.e. in your example above file 1 would also find a commit Z, but it conflicts A, B, ... and Y. If there are conflics, take the first one in your graph (A) and repeat from step 2 with that commit. Otherwise continue. 4. You now have the 'next' revision to commit (next in the dependency graph sense). With such an algorithm, you won't rely on the timestamps, but only on the dependencies. Thus, what other advantages would the blob method have? Tags and branches do not have any timestamps at all in CVS. (You can sometimes put bounds on the timestamps: a branch must have been created after the version from which it sprouts, and before the first commit on the branch (if there ever was a commit on the branch).) And it is not possible to distinguish whether two branches/tags sprouted from the same revision of a file or whether one sprouted from the other. So a date-based method has to work hard to get tags and branches correct. But in the above way, none of it would be timestamp based. You could, as you do in your blob method, insert tag and branch 'events', which would be dependent on a commit event of a certain file. You would then not get a 'revision' in step 4 above, but a branch or tag. (Don't get me wrong, I think the blob method is better. Because I suspect importing a CVS repository can't be that simple. But I'm missing prove of that.) Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Markus Schiltknecht wrote: Michael Haggerty wrote: The main problem with converting CVS repositories is its unreliable timestamps. Sometimes they are off by a few minutes; that would be no problem for your algorithm. But they might be off by hours (maybe a timezone was set incorrectly), and it is not unusual to have a server with a bad battery that resets its time to Jan 1 1970 after each reboot for a while before somebody notices it. Timestamps that are too far in the future are probably rarer, but also occur. CVS timestamps are simply not to be trusted. The best hope to correcting timestamp problems is pooling information across files. For example, you might have the following case: 1 2 | | A Z | B : Y | Z where A..Y have correct timestamps but Z has an incorrect timestamp far in the past. It is clear from the dependency graph that Z was committed after Y, and by implication revision Z of file 2 was committed at the same time. But your algorithm would grab revision Z of file 2 first, even before revision A of file 1. But you could use another method to determine what to commit first. One which takes only dependency graph into account. The simplest variant would be: 1. randomly choose a commit (or take the one with the lowest timestamp for a mostly good starter) 2. collect the other file's commits which seem to belong to the same revision (for me, a revision is a set of files, as in monotone. I don't know what terms you use here, probably we should define a set of terms to discuss such issues and avoid confusion.) 3. check if any of those file commits conflict in the dependency graph. I.e. in your example above file 1 would also find a commit Z, but it conflicts A, B, ... and Y. If there are conflics, take the first one in your graph (A) and repeat from step 2 with that commit. Otherwise continue. 4. You now have the 'next' revision to commit (next in the dependency graph sense). With such an algorithm, you won't rely on the timestamps, but only on the dependencies. Thus, what other advantages would the blob method have? Step 2 is essentially the creation of a blob, isn't it? And steps 2 and 3 could be an infinite loop, because of 1 2 | | A B | | B A This can arise if two (nonatomic, remember) CVS commits are going on at the same time, even without clock errors. Of course more complicated loops can also arise. Tags and branches do not have any timestamps at all in CVS. (You can sometimes put bounds on the timestamps: a branch must have been created after the version from which it sprouts, and before the first commit on the branch (if there ever was a commit on the branch).) And it is not possible to distinguish whether two branches/tags sprouted from the same revision of a file or whether one sprouted from the other. So a date-based method has to work hard to get tags and branches correct. But in the above way, none of it would be timestamp based. You could, as you do in your blob method, insert tag and branch 'events', which would be dependent on a commit event of a certain file. You would then not get a 'revision' in step 4 above, but a branch or tag. (Don't get me wrong, I think the blob method is better. Because I suspect importing a CVS repository can't be that simple. But I'm missing prove of that.) Yes, but branches and especially tags are very slippery. They don't even have to be created (chronologically) before a succeeding commit on the same file. So you'll have branch/tag events rising to the top of the frontier and you need some way to decide when to process them. Not that this part is much easier in the blob scheme, except that from early on you have a global picture of the topology of branches/tags so I think it should be easier to design the heuristics that will be needed. Michael ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Hi, Michael Haggerty wrote: Markus Schiltknecht wrote: With such an algorithm, you won't rely on the timestamps, but only on the dependencies. Thus, what other advantages would the blob method have? Step 2 is essentially the creation of a blob, isn't it? Sure. Except that you won't have inter-blob dependencies to resolve. And steps 2 and 3 could be an infinite loop, because of 1 2 | | A B | | B A True, but you could easily check for that. Just remember what you've already tried and don't try again. To me the question is: what to do then? Split A into two commits around B: A1 - B - A2 - C Or (for monotone or git): try to seperate into individual commits (not always possible) and create two heads, which then merge later on. I.e.: .- A --. /\ --x x-- C \/ '- B --' Or even merge A and B into one single revision (since you can't determine exactly what belongs to A and what to B), thus: AB - C This can arise if two (nonatomic, remember) CVS commits are going on at the same time, even without clock errors. Of course more complicated loops can also arise. Yes, but the problem stays the same for Nathaniel's continuous algorithm and for your blob-method. Yes, but branches and especially tags are very slippery. They don't even have to be created (chronologically) before a succeeding commit on the same file. So you'll have branch/tag events rising to the top of the frontier and you need some way to decide when to process them. If you apply the exactly same algorithm for 'commit', 'tag' and 'branch' events, I don't see no problem there. Except the 'loop' resolution will work differently if your loop consists of not only commits. Not that this part is much easier in the blob scheme, except that from early on you have a global picture of the topology of branches/tags so I think it should be easier to design the heuristics that will be needed. Ah, that's a difference. What do we gain with the 'global picture'? Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On Thu, Sep 14, 2006 at 10:05:42AM +0200, Markus Schiltknecht wrote: Hi, Nathaniel Smith wrote: Regarding the basic dependency-based algorithm, the approach of throwing everything into blobs and then trying to tease them apart again seems backwards. What I'm thinking is, first we go through and build the history graph for each file. Now, advance a frontier across the all of these graphs simultaneously. Your frontier is basically a map filename - CVS revision, that represents a tree snapshot. Hm.. weren't you the one saying we should profit from the experience of cvs2svn? Yes, and apparently their experience is saying that their algorithm could be improved :-). Another question I'm asking myself: if it would have been that easy to write a sane CVS importer, why didn't cvs2svn do something like that? I don't know, that's why I asked them :-). I also suspect that SVN's dump format is suboptimal at the metadata level -- we would essentially have to run a lot of branch/tag inferencing logic _again_ to go from SVN-style one giant tree with branches described as copies, and multiple copies allowed for branches/tags that are built up over time, to monotone-style DAG of tree snapshots. This would be substantially less annoying inferencing logic than that needed to decipher CVS in the first place, granted, and it's stuff we want to write at some point anyway to allow SVN importing, but it adds another step where information could be lost. I may be biased because I grok monotone better, but I suspect it would be much easier to losslessly convert a monotone-style history to an svn-style history than vice versa, possibly a generic dumping tool would want to generate output that looks more like monotone's model? Yeah, and the GIT people want the generic dump look more like GIT. And then there are darcs, mercurial, etc... Well, monotone, git, and mercurial at least all share a design heritage, and would want pretty much the same format... :-) Even if we _do_ end up writing two implementations of the algorithm, we should share a test suite. Sure, but as cvs2svn has another license, I can't just copy them over :-( I will write some tests, but if I write them in our monotone-lua testsuite, I'm sure nobody else is going to use them. Duh, I forgot about the license thing :-(. Tests could be written in a somewhat standardized way, and then we could just have a harness to run them in our testsuite, others could have harnesses to run them in their testsuites, while keeping the actual test data shared. -- Nathaniel -- Eternity is very long, especially towards the end. -- Woody Allen ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On Thu, Sep 14, 2006 at 01:14:11PM +0200, Markus Schiltknecht wrote: Hi, Michael Haggerty wrote: Markus Schiltknecht wrote: With such an algorithm, you won't rely on the timestamps, but only on the dependencies. Thus, what other advantages would the blob method have? Step 2 is essentially the creation of a blob, isn't it? Sure. Except that you won't have inter-blob dependencies to resolve. And steps 2 and 3 could be an infinite loop, because of 1 2 | | A B | | B A True, but you could easily check for that. Just remember what you've already tried and don't try again. To me the question is: what to do then? Split A into two commits around B: A1 - B - A2 - C Or (for monotone or git): try to seperate into individual commits (not always possible) and create two heads, which then merge later on. I.e.: .- A --. /\ --x x-- C \/ '- B --' You can't do this, unless you want to do some sort of inexact inverse patching -- you would need to know what file-1 looks like with only A, and what file-1 looks like with only B, but you don't. You could fork into one A/B revision and one B/A revision, but that doesn't seem helpful. Or even merge A and B into one single revision (since you can't determine exactly what belongs to A and what to B), thus: AB - C Door A seems somewhat better than this, at least you get to preserve all commit messages. -- Nathaniel -- When the flush of a new-born sun fell first on Eden's green and gold, Our father Adam sat under the Tree and scratched with a stick in the mould; And the first rude sketch that the world had seen was joy to his mighty heart, Till the Devil whispered behind the leaves, It's pretty, but is it Art? -- The Conundrum of the Workshops, Rudyard Kipling ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Petr Baudis [EMAIL PROTECTED] wrote: Don't forget OpenOffice. It's just a shame that the OpenOffice CVS tree is not available for cloning. http://wiki.services.openoffice.org/wiki/SVNMigration Hmm, the KDE repo is even larger than Mozilla: 19 GB in CVS and 499,367 revisions. Question is, are those distinct file revisions or SVN revisions? And just what machine did they use that completed that conversion in 38 hours? -- Shawn. ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
[Monotone-devel] cvs import
Hi, I've been trying to understand the cvsimport algorithm used by monotone and wanted to adjust that to be more like the one in cvs2svn. I've had some problems with cvs2svn itself and began to question the algorithm used there. It turned out that the cvs2svn people have discussed an improved algorithms and are about to write a cvs2svn 2.0. The main problem with the current algorithm is that it depends on the timestamp information stored in the CVS repository. Instead, it would be much better to just take the dependencies of the revisions into account. Considering the timestamp an irrelevant (for the import) attribute of the revision. Now, that can be used to convert from CVS to about anything else. Obviously we were discussing about subversion, but then there was git, too. And monotone. I'm beginning to question if one could come up with a generally useful cleaned-and-sane-CVS-changeset-dump-format, which could then be used by importers to all sort of VCSes. This would make monotone's cvsimport function dependent on cvs2svn (and therefore python). But the general try-to-get-something-usefull-from-an-insane-CVS-repository-algorithm would only have to be written once. On the other hand, I see that lots of the cvsimport functionality for monotone has already been written (rcs file parsing, stuffing files, file deltas and complete revisions into the monotone database, etc..). Changing it to a better algorithm does not seem to be _that_ much work anymore. Plus the hard part seems to be to come up with a good algorithm, not implementing it. And we could still exchange our experience with the general algorithm with the cvs2svn people. Plus, the guy who mentioned git pointed out that git needs quite a different dump-format than subversion to do an efficient conversion. I think coming up with a generally-usable dump format would not be that easy. So you see, I'm slightly favoring the second implementation approach with a C++ implementation inside monotone. Thoughts or comments? Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Sorry, I forgot to mention some pointers: Here is the thread where I've started the discussion about the cvs2svn algorithm: http://cvs2svn.tigris.org/servlets/ReadMsg?list=devmsgNo=1599 And this is a proposal for an algorithm to do cvs imports independant of the timestamp: http://cvs2svn.tigris.org/servlets/ReadMsg?list=devmsgNo=1451 Regards Markus ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On Wed, Sep 13, 2006 at 07:46:40PM +0200, Markus Schiltknecht wrote: Hi, I've been trying to understand the cvsimport algorithm used by monotone and wanted to adjust that to be more like the one in cvs2svn. I've had some problems with cvs2svn itself and began to question the algorithm used there. It turned out that the cvs2svn people have discussed an improved algorithms and are about to write a cvs2svn 2.0. The main problem with the current algorithm is that it depends on the timestamp information stored in the CVS repository. Instead, it would be much better to just take the dependencies of the revisions into account. Considering the timestamp an irrelevant (for the import) attribute of the revision. I just read over the thread on the cvs2svn list about this -- I have a few random thoughts. Take them with a grain of salt, since I haven't actually tried writing a CVS importer myself... Regarding the basic dependency-based algorithm, the approach of throwing everything into blobs and then trying to tease them apart again seems backwards. What I'm thinking is, first we go through and build the history graph for each file. Now, advance a frontier across the all of these graphs simultaneously. Your frontier is basically a map filename - CVS revision, that represents a tree snapshot. The basic loop is: 1) pick some subset of files to advance to their next revision 2) slide the frontier one CVS revision forward on each of those files 3) snapshot the new frontier (write it to the target VCS as a new tree commit) 4) go to step 1 Obviously, this will produce a target VCS history that respects the CVS dependency graph, so that's good; it puts a strict limit on how badly whatever heuristics we use can screw us over if they guess wrong about things. Also, it makes the problem much simpler -- all the heuristics are now in step 1, where we are given a bunch of possible edits, and we have to pick some subset of them to accept next. This isn't trivial problem. I think the main thing you want to avoid is: 1 2 3 4 | | | | --o--o--o--o- -- current frontier | | | | A B A C | A say you have four files named 1, 2, 3, and 4. We want to slide the frontier down, and the next edits were originally created by one of three commits, A, B, or C. In this situation, we can take commit B, or we can take commit C, but we don't want to take commit A until _after_ we have taken commit B -- because otherwise we will end up splitting A up into two different commits, A1, B, A2. There are a lot of approaches one could take here, on up to pulling out a full-on optimal constraint satisfaction system (if we can route chips, we should be able to pick a good ordering for accepting CVS edits, after all). A really simple heuristic, though, would be to just pick the file whose next commit has the earliest timestamp, then group in all the other next commits with the same commit message, and (maybe) a similar timestamp. I have a suspicion that this heuristic will work really, really, well in practice. Also, it's cheap to apply, and worst case you accidentally split up a commit that already had wacky timestamps, and we already know that we _have_ to do that in some cases. Handling file additions could potentially be slightly tricky in this model. I guess it is not so bad, if you model added files as being present all along (so you never have to add add whole new entries to the frontier), with each file starting out in a pre-birth state, and then addition of the file is the first edit performed on top of that, and you treat these edits like any other edits when considering how to advance the frontier. I have no particular idea on how to handle tags and branches here; I've never actually wrapped my head around CVS's model for those :-). I'm not seeing any obvious problem with handling them, though. In this approach, incremental conversion is cheap, easy, and robust -- simply remember what frontier corresponded to the final revision imported, and restart the process directly at that frontier. Regarding storing things on disk vs. in memory: we always used to stress-test monotone's cvs importer with the gcc history; just a few weeks ago someone did a test import of NetBSD's src repo (~180k commits) on a desktop with 2 gigs of RAM. It takes a pretty big history to really require disk (and for that matter, people with histories that big likely have a big enough organization that they can get access to some big iron to run the conversion on -- and probably will want to anyway, to make it run in reasonable time). Now, that can be used to convert from CVS to about anything else. Obviously we were discussing about subversion, but then there was git, too. And monotone. I'm beginning to question if one could come up with a generally useful cleaned-and-sane-CVS-changeset-dump-format, which could then be used by importers to all sort of VCSes.
Re: [Monotone-devel] cvs import
On Wed, Sep 13, 2006 at 03:52:00PM -0700, Nathaniel Smith wrote: This isn't trivial problem. I think the main thing you want to avoid is: 1 2 3 4 | | | | --o--o--o--o- -- current frontier | | | | A B A C | A There are a lot of approaches one could take here, on up to pulling out a full-on optimal constraint satisfaction system (if we can route chips, we should be able to pick a good ordering for accepting CVS edits, after all). A really simple heuristic, though, would be to just pick the file whose next commit has the earliest timestamp, then group in all the other next commits with the same commit message, and (maybe) a similar timestamp. Pick the earliest first, or more generally: take all the file commits immediately below the frontier. Find revs further below the frontier (up to some small depth or time limit) on other files that might match them, based on changelog etc (the same grouping you describe, and we do now). Eliminate any of those that are not entirely on the frontier (ie, have some other revision in the way, as with file 2). Commit the remaining set in time order. [*] If you wind up with an empty set, then you need to split revs, but at this point you have only conflicting revs on the frontier i.e. you've already committed all the other revs you can that might have avoided this need, whereas we currently might be doing this too often). For time order, you could look at each rev as having a time window, from the first to last commit matching. If the revs windows are non-overlapping, commit them in order. If the rev windows overlap, at this point we already know the file changes don't overlap - we *could* commit these as parallel heads and merge them, to better model the original developer's overlapping commits. Handling file additions could potentially be slightly tricky in this model. I guess it is not so bad, if you model added files as being present all along (so you never have to add add whole new entries to the frontier), with each file starting out in a pre-birth state, and then addition of the file is the first edit performed on top of that, and you treat these edits like any other edits when considering how to advance the frontier. CVS allows resurrections too.. I have no particular idea on how to handle tags and branches here; I've never actually wrapped my head around CVS's model for those :-). I'm not seeing any obvious problem with handling them, though. Tags could be modelled as another 'event' in the file graph, like a commit. If your frontier advances through both revisions and a 'tag this revision' event, the same sequencing as above would work. If tags had been moved, this would wind up with a sequence whereby commits interceded with tagging, and we'd need to split the commits such that we could end up with a revision matching the tagged content. In this approach, incremental conversion is cheap, easy, and robust -- simply remember what frontier corresponded to the final revision imported, and restart the process directly at that frontier. Hm. Except for the tagging idea above, because tags can be applied behind a live cvs frontier. -- Dan. pgpGMTUVW8nis.pgp Description: PGP signature ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On Thu, Sep 14, 2006 at 09:21:39AM +1000, Daniel Carosone wrote: I have no particular idea on how to handle tags and branches here; I've never actually wrapped my head around CVS's model for those :-). I'm not seeing any obvious problem with handling them, though. Tags could be modelled as another 'event' in the file graph, like a commit. If your frontier advances through both revisions and a 'tag this revision' event, the same sequencing as above would work. Likewise, if we had file branched events in the file lifeline (based on the rcs id's), then we would be sure to always have a monotone revision that corresponded to the branching event, where we could attach the revisions in the branch. Because we can't split tags, and can't split branch events, we will end up splitting file commits (down to individual commits per file) in order to arrive at the revisions we need for those. Because tags and branches can be across subsets of the tree, we gain some scheduling flexibility about where in the reconstructed sequence they can come. Many well-managed CVS repositories will use good practices, such as having a branch base tag. If they do, then they will help this algorithm produce correct results. Once we have a branch with a base starting revision, we can pretty much treat it independently from there: make a whole new set of file lifelines along the RCS branches and a new frontier for it. -- Dan. pgpHeFQtuX3V5.pgp Description: PGP signature ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On 9/13/06, Nathaniel Smith [EMAIL PROTECTED] wrote: On Wed, Sep 13, 2006 at 04:42:01PM -0700, Keith Packard wrote: However, this means that parsecvs must hold the entire tree state in memory, which turned out to be its downfall with large repositories. Worked great for all of X.org, not so good with Mozilla. Does anyone know how big Mozilla (or other humonguous repos, like KDE) are, in terms of number of files? Mozilla is 120,000 files. The complexity comes from 10 years worth of history. A few of the files have around 1,700 revisions. There are about 1,600 branches and 1,000 tags. The branch number is inflated because cvs2svn is generating extra branches, the real number is around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this into 250,000 commits over about 1M unique revisions. A few numbers for repositories I had lying around: Linux kernel -- ~21,000 gcc -- ~42,000 NetBSD src repo -- ~100,000 uClinux distro -- ~110,000 These don't seem very indimidating... even if it takes an entire kilobyte per CVS revision to store the information about it that we need to make decisions about how to move the frontier... that's only 110 megabytes for the largest of these repos. The frontier sweeping algorithm only _needs_ to have available the current frontier, and the current frontier+1. Storing information on every version of every file in memory might be worse; but since the algorithm accesses this data in a linear way, it'd be easy enough to stick those in a lookaside table on disk if really necessary, like a bdb or sqlite file or something. (Again, in practice storing all the metadata for the entire 180k revisions of the 100k files in the netbsd repo was possible on a desktop. Monotone's cvs_import does try somewhat to be frugal about memory, though, interning strings and suchlike.) -- Nathaniel -- When the flush of a new-born sun fell first on Eden's green and gold, Our father Adam sat under the Tree and scratched with a stick in the mould; And the first rude sketch that the world had seen was joy to his mighty heart, Till the Devil whispered behind the leaves, It's pretty, but is it Art? -- The Conundrum of the Workshops, Rudyard Kipling - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jon Smirl [EMAIL PROTECTED] ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote: Mozilla is 120,000 files. The complexity comes from 10 years worth of history. A few of the files have around 1,700 revisions. There are about 1,600 branches and 1,000 tags. The branch number is inflated because cvs2svn is generating extra branches, the real number is around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this into 250,000 commits over about 1M unique revisions. Those numbers are pretty close to those in the NetBSD repository, and between them these probably represent just about the most extensive public CVS test data available. I've only done imports of individual top-level dirs (what used to be modules), like src and pkgsrc, because they're used independently and don't really overlap. src had about 180k commits over 1M versions of 120k files, 1000 tags and 260 branches. pkgsrc had 110k commits over about half as many files and versions thereof. We too have a few hot files, one had 13,625 revisions. xsrc adds a bunch more files and content, but not many versions; that's mostly vendor branches and only some local changes. Between them the cvs ,v files take up 4.7G covering about 13 years of history. One thing that was interesting was that src used to be several different modules, but we rearranged the repository at one point to match the checkout structure these modules produced (combining them all under the src dir). This doesn't seem to have upset the import at all. Just about every other form of CVS evil has been perpetrated in this repository at some stage or other too, but always very carefully. -- Dan. pgp0TPIIjcCvI.pgp Description: PGP signature ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Daniel Carosone [EMAIL PROTECTED] wrote: On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote: Mozilla is 120,000 files. The complexity comes from 10 years worth of history. A few of the files have around 1,700 revisions. There are about 1,600 branches and 1,000 tags. The branch number is inflated because cvs2svn is generating extra branches, the real number is around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this into 250,000 commits over about 1M unique revisions. Those numbers are pretty close to those in the NetBSD repository, and between them these probably represent just about the most extensive public CVS test data available. I don't know exactly how big it is but the Gentoo CVS repository is also considered to be very large (about the size of the Mozilla repository) and just as difficult to import. Its either crashed or taken about a month to process with the current Git CVS-Git tools. Since I know that the bulk of the Gentoo CVS repository is the portage tree I did a quick find|wc -l in my /usr/portage; its about 124,500 files. Its interesting that Gentoo has almost as large of a repository given that its such a young project, compared to NetBSD and Mozilla. :-) -- Shawn. ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
Keith Packard [EMAIL PROTECTED] wrote: On Wed, 2006-09-13 at 15:52 -0700, Nathaniel Smith wrote: Regarding the basic dependency-based algorithm, the approach of throwing everything into blobs and then trying to tease them apart again seems backwards. What I'm thinking is, first we go through and build the history graph for each file. Now, advance a frontier across the all of these graphs simultaneously. Your frontier is basically a map filename - CVS revision, that represents a tree snapshot. Parsecvs does this, except backwards from now into the past; I found it easier to identify merge points than branch points (Oh, look, these two branches are the same now, they must have merged). Why not let Git do that? If two branches are the same in CVS then shouldn't they have the same tree SHA1 in Git? Surely comparing 20 bytes of SHA1 is faster than almost any other comparsion... However, this means that parsecvs must hold the entire tree state in memory, which turned out to be its downfall with large repositories. Worked great for all of X.org, not so good with Mozilla. Any chance that can be paged in on demand from some sort of work file? git-fast-import hangs onto a configurable number of tree states (default of 5) but keeps them in an LRU chain and dumps the ones that aren't current. -- Shawn. ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] cvs import
On Wed, Sep 13, 2006 at 10:30:17PM -0400, Shawn Pearce wrote: I don't know exactly how big it is but the Gentoo CVS repository is also considered to be very large (about the size of the Mozilla repository) and just as difficult to import. Its either crashed or taken about a month to process with the current Git CVS-Git tools. Ah, thanks for the tip. Since I know that the bulk of the Gentoo CVS repository is the portage tree I did a quick find|wc -l in my /usr/portage; its about 124,500 files. Its interesting that Gentoo has almost as large of a repository given that its such a young project, compared to NetBSD and Mozilla. :-) Portage uses files and thus CVS very differently, though. Each ebuild for each package revision of each version of a third-party package (like, say, monotone 0.28 and 0.29, and -r1, -r2 pkg bumps of those if they were needed) is its own file that's added, maybe edited a couple of times, and then deleted again later as new versions are added and older ones retired. These are copies and renames in the workspace, but are invisible to CVS. This uses up lots more files than a single long-lived build that gets edited each time; the Attic dirs must have huge numbers of files, way beyond the number that are live now. This lets portage keep builds around in a HEAD checkout for multiple versions at once, tagged internally with different statuses. Effectively, these tags take the place of VCS-based branches and releases, and are more flexible for end users tracking their favourite applications while keeping the rest of their system stable. If they had a VCS that supported file cloning and/or renaming, and used that to follow history between these ebuild files, things would be very different. There are some interesting use cases for VCS tools in supporting this behaviour nicely, too. -- Dan. pgpnc37Mmz5Hp.pgp Description: PGP signature ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] CVS import errors
Måns Rullgård [EMAIL PROTECTED] writes: I'm trying to import a CVS repository into monotone. All goes seemingly well, in that there are no warnings or error messages. However, when I check it out, I notice that a lot of the files are old versions, and some are missing altogether. The set is not consistent with any point in the past, either. If I import only a subset of the repository (a few files), I get different versions, sometimes even the latest. I reported this to the bug tracker a week ago, but it appears to have gone unnoticed there. For reference, the report there is at URL https://savannah.nongnu.org/bugs/?func=detailitemitem_id=14151, where I also attached the failing repo. Please, could someone at least comment on this? Or should I be looking for a replacement for monotone? -- Måns Rullgård [EMAIL PROTECTED] ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel
Re: [Monotone-devel] CVS import errors
Nathaniel Smith [EMAIL PROTECTED] writes: On Fri, Aug 26, 2005 at 08:36:01AM +0100, M?ns Rullg?rd wrote: M?ns Rullg?rd [EMAIL PROTECTED] writes: I'm trying to import a CVS repository into monotone. All goes seemingly well, in that there are no warnings or error messages. However, when I check it out, I notice that a lot of the files are old versions, and some are missing altogether. The set is not consistent with any point in the past, either. If I import only a subset of the repository (a few files), I get different versions, sometimes even the latest. I reported this to the bug tracker a week ago, but it appears to have gone unnoticed there. For reference, the report there is at URL https://savannah.nongnu.org/bugs/?func=detailitemitem_id=14151, where I also attached the failing repo. Please, could someone at least comment on this? Or should I be looking for a replacement for monotone? Sorry about that. Unfortunately, the answer is yes, it seems to be broken; but, as you've seen, no-one seems to have time to look at it ATM :-/. Other options are to use Tailor: http://www.darcs.net/DarcsWiki/Tailor or to check out and build the net.venge.monotone.cvssync branch, which is a version of monotone with a different, incremental CVS importer built in. Given that this repo seems to have been converted from BK (and I'm suspicious that this might be related to our problems importing it, CVS files have ill-defined structure in some ways and it's possible that bkcvs is generating something that CVS can read but would never itself produce), you might have some luck writing a script based on tridge's sourcepuller program. In principle, this could preserve the full merge history graph, rather than the degraded linearization bkcvs produces. The Xaraya folks might have some insight into good ways to go straight BK-monotone. I didn't know it was possible to go from BK directly to anything else. Thanks for the pointers. As for other systems, your best bet is probably SVN; svn2cvs is the only CVS converter that can do better than the above options (except, possibly, for some unreleased software that Canonical uses). I don't like SVN, being all centralized and that. -- Måns Rullgård [EMAIL PROTECTED] ___ Monotone-devel mailing list Monotone-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/monotone-devel