Re: another perspective on renames.
On Thu, 14 Apr 2005, Paul Jackson wrote: To me, rename is a special case of the more general case of a big chunk of code (a portion of a file) that was in one place either being moved or copied to another place. I wonder if there might be someway to use the tools that biologists use to analyze DNA sequences, to track the evolution of source code, identifying things like common chunks of code that differ in just a few mutations, and presenting the history of the evolution, at selectable levels of detail. The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html) is probably a good place to start, although it is relatively sensitive to mutations. It will be able to efficiently detect identical blocks larger than some block size N (512 bytes or so for rsync). You might well consider smaller blocks to be irrelevant. The data can be made considerably more useful to developers by canonicalizing before searching (ie, compressing whitespace to ' ', etc)[*]. Note that the identical regions do *not* have to line up on block boundaries; see the rsync algorithm for more detail. I think Linus has made a persuasive case that the 'developer-friendly' features of an SCM (ie annotate, log, and friends) can be built *on top* of GIT. This is a perfect example. Since the computation is non-trivial (although linear in the number of lines of code involved in the history of a file; ie doesn't depend on the unrelated size of the archive), it might make sense for the front-end SCM to maintain its own caches --- for example, of the block and rolling checksums for each file required by the rsync algorithm. The key point being that these are just *caches*, not essential history information, and can always be wiped and regenerated. The nice 'feature' of this system (some may disagree, I guess) is that it does *not* depend on extensive programmer annotation of file changes (ie, chunk A in file B came from lines C-D of file D, or file E was once named F, etc). By inferring history from content-similar files and blocks, it seems that it would be more able to generate useful results after importing third-party sources, which may come in distinct 'releases' but lack explicit history annotations. --scott [*] in general, i will be *glad* to see source-management move away from CVS' line-oriented style; there's no good reason we should still be worrying about whitespace changes, etc. When we build 'developer-friendly' tools we should make every effort to auto-detect source code, image formats, etc, and automatically perform appropriate canonicalization and beautification of diffs, because this can be/should be/is entirely separate from git's underlying storage representation. Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine ( http://cscott.net/ ) - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another perspective on renames.
* Paul Jackson <[EMAIL PROTECTED]> wrote: > Scott wrote: > > Anyway, maybe it's worth thinking a little about an SCM in which this is a > > feature, instead of (or in addition to) automatically assuming this is a > > bug we need to add infrastructure to work around. > > Agreed. > > To me, the main purpose in tracking renames is to obtain a deeper > history of the line-by-line changes in a file. > > ==> But that doesn't seem relevant here. > > Last I looked, git has no such history. A given file contents is the > indivisable atom of the git world, with no fine structure. > > This is quite unlike classic SCM's, built on file formats that track > source lines, not files, as the atomic unit. i believe the fundamental thing to think about is not file or line or namespace, but 'tracking developer intent'. While keeping in mind that GIT is not an SCM, all SCMs boil down to this single thing: being able to track what the developer did and why he did it - to be a useful tool later on. (SCMs are for humans with bad limitations, who have this fundamental design bug and keep forgetting things.) the basic question is, how much to track. The most extreme form of tracking (just for the sake of visualizing it) would be to have an eye-position recognizing software attached to a webcam looking at the developer, and then exactly mapping what he did, how long did he look at one particular line of code and exactly what did he type when doing that. [ Perhaps also a thought-reader module in addition, once one is available. (combined with another module that removes all the swearing)] but i think Linus is on the right track to suggest that "the file names dont matter all that much, it's all about the content". Global diffs might track most types of plain renames, and if it gets it wrong - do we care? Misdetection of renames can happen, but realistically only with small files and trivial code, which wont have alot of history. The only serious type of misdetection would be if two large modules in two different places in the namespace happen to have exactly the same content but have a different history (because e.g. they were merged in via two separate trees, one came from one tree, the other from the other tree), and the developer renamed both of them in the same commit: in such a case the global diff would have no way to figure out what the proper thread of history is. But is this a realistic scenario? If the two files are nontrivial and have the same content, why werent they merged in the namespace in the first place? the moment we allow 'namespace' into the picture, things get complex and ugly. Directory recursion is already a complexity that would have been nice to avoid. Ingo - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another perspective on renames.
Scott wrote: > Anyway, maybe it's worth thinking a little about an SCM in which this is a > feature, instead of (or in addition to) automatically assuming this is a > bug we need to add infrastructure to work around. Agreed. To me, the main purpose in tracking renames is to obtain a deeper history of the line-by-line changes in a file. ==> But that doesn't seem relevant here. Last I looked, git has no such history. A given file contents is the indivisable atom of the git world, with no fine structure. This is quite unlike classic SCM's, built on file formats that track source lines, not files, as the atomic unit. To me, rename is a special case of the more general case of a big chunk of code (a portion of a file) that was in one place either being moved or copied to another place. I wonder if there might be someway to use the tools that biologists use to analyze DNA sequences, to track the evolution of source code, identifying things like common chunks of code that differ in just a few mutations, and presenting the history of the evolution, at selectable levels of detail. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
another perspective on renames.
Perhaps our thinking is being clouded by 'how other SCMs do things' --- do we *really* need extra rename metadata? As Linus pointed out, as long as a commit is done immediately after a rename (ie before the renamed file is changed) the tree object contains all the information one needs: you can notice that a given object's content-hash is named 'foo' in the first version and 'bar' in the second version. Ingo thought that this was insufficient because two *different* objects (ie having different revision histories) might be mutated to a point where they had a *same* contents (and then would be condensed into a single blob). But isn't that a feature of the git-fs history generally (ie not a renaming-specific issue)? One solution would be to invent a new 'file-revision-history' annotation on top of git-fs in order to keep these derivation paths seperate... ...but perhaps we might think of this as a 'feature' of our SCM instead? The 'history' of a file may have join points where a single 'content' may have been derived by two or more completely different paths. Explicit guidance to the front-end tools is required to 'unmerge' these files after this occurs (ie updating the directory cache for one, but not the others). This makes sense for include/arch/{foo,bar}/baz.h, but maybe not so much for (say) the empty file. Anyway, maybe it's worth thinking a little about an SCM in which this is a feature, instead of (or in addition to) automatically assuming this is a bug we need to add infrastructure to work around. --scott PBFORTUNE Soviet cryptographic D5 SLBM MI5 CIA postcard WASHTUB [Hello to all my fans in domestic surveillance] explosion Sigint Bush ODEARL FJHOPEFUL assassination Uzi Hussein Nader ( http://cscott.net/ ) - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html