Re: another perspective on renames.

2005-04-15 Thread C. Scott Ananian
On Thu, 14 Apr 2005, Paul Jackson wrote:
To me, rename is a special case of the more general case of a
big chunk of code (a portion of a file) that was in one place
either being moved or copied to another place.
I wonder if there might be someway to use the tools that biologists use
to analyze DNA sequences, to track the evolution of source code,
identifying things like common chunks of code that differ in just a few
mutations, and presenting the history of the evolution, at selectable
levels of detail.
The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html) 
is probably a good place to start, although it is relatively sensitive to 
mutations.  It will be able to efficiently detect identical blocks larger 
than some block size N (512 bytes or so for rsync).  You might well 
consider smaller blocks to be irrelevant.  The data can be made 
considerably more useful to developers by canonicalizing before searching 
(ie, compressing whitespace to ' ', etc)[*].  Note that the identical 
regions do *not* have to line up on block boundaries; see the rsync 
algorithm for more detail.

I think Linus has made a persuasive case that the 'developer-friendly' 
features of an SCM (ie annotate, log, and friends) can be built *on top* 
of GIT.   This is a perfect example.  Since the computation is non-trivial 
(although linear in the number of lines of code involved in the history of 
a file; ie doesn't depend on the unrelated size of the archive), it might 
make sense for the front-end SCM to maintain its own caches --- for 
example, of the block and rolling checksums for each file required by the 
rsync algorithm.  The key point being that these are just *caches*, not 
essential history information, and can always be wiped and regenerated.

The nice 'feature' of this system (some may disagree, I guess) is that it 
does *not* depend on extensive programmer annotation of file changes (ie, 
chunk A in file B came from lines C-D of file D, or file E was once named 
F, etc).  By inferring history from content-similar files and blocks, it 
seems that it would be more able to generate useful results after 
importing third-party sources, which may come in distinct 'releases' but 
lack explicit history annotations.
  --scott

[*] in general, i will be *glad* to see source-management move away from 
CVS' line-oriented style; there's no good reason we should still be worrying
about whitespace changes, etc.  When we build 'developer-friendly' tools 
we should make every effort to auto-detect source code, image formats, 
etc, and automatically perform appropriate canonicalization and 
beautification of diffs, because this can be/should be/is entirely 
separate from git's underlying storage representation.

Mk 48 PANCHO ZPSECANT MKDELTA SCRANTON D5 SLBM JMTRAX Delta Force 
MI6 SGUAT Khaddafi SMOTH interception mail drop SECANT PBSUCCESS Cocaine
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another perspective on renames.

2005-04-15 Thread Ingo Molnar

* Paul Jackson <[EMAIL PROTECTED]> wrote:

> Scott wrote:
> > Anyway, maybe it's worth thinking a little about an SCM in which this is a 
> > feature, instead of (or in addition to) automatically assuming this is a 
> > bug we need to add infrastructure to work around.
> 
> Agreed.
> 
> To me, the main purpose in tracking renames is to obtain a deeper
> history of the line-by-line changes in a file.
> 
>   ==> But that doesn't seem relevant here.
> 
> Last I looked, git has no such history.  A given file contents is the 
> indivisable atom of the git world, with no fine structure.
> 
> This is quite unlike classic SCM's, built on file formats that track 
> source lines, not files, as the atomic unit.

i believe the fundamental thing to think about is not file or line or 
namespace, but 'tracking developer intent'. While keeping in mind that 
GIT is not an SCM, all SCMs boil down to this single thing: being able 
to track what the developer did and why he did it - to be a useful tool 
later on. (SCMs are for humans with bad limitations, who have this 
fundamental design bug and keep forgetting things.)

the basic question is, how much to track. The most extreme form of 
tracking (just for the sake of visualizing it) would be to have an 
eye-position recognizing software attached to a webcam looking at the 
developer, and then exactly mapping what he did, how long did he look at 
one particular line of code and exactly what did he type when doing 
that. [ Perhaps also a thought-reader module in addition, once one is 
available. (combined with another module that removes all the swearing)]

but i think Linus is on the right track to suggest that "the file names 
dont matter all that much, it's all about the content". Global diffs 
might track most types of plain renames, and if it gets it wrong - do we 
care? Misdetection of renames can happen, but realistically only with 
small files and trivial code, which wont have alot of history.

The only serious type of misdetection would be if two large modules in 
two different places in the namespace happen to have exactly the same 
content but have a different history (because e.g. they were merged in 
via two separate trees, one came from one tree, the other from the other 
tree), and the developer renamed both of them in the same commit: in 
such a case the global diff would have no way to figure out what the 
proper thread of history is. But is this a realistic scenario?  If the 
two files are nontrivial and have the same content, why werent they 
merged in the namespace in the first place?

the moment we allow 'namespace' into the picture, things get complex and 
ugly. Directory recursion is already a complexity that would have been 
nice to avoid.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: another perspective on renames.

2005-04-14 Thread Paul Jackson
Scott wrote:
> Anyway, maybe it's worth thinking a little about an SCM in which this is a 
> feature, instead of (or in addition to) automatically assuming this is a 
> bug we need to add infrastructure to work around.

Agreed.

To me, the main purpose in tracking renames is to obtain a deeper
history of the line-by-line changes in a file.

  ==> But that doesn't seem relevant here.

Last I looked, git has no such history.  A given file contents
is the indivisable atom of the git world, with no fine structure.

This is quite unlike classic SCM's, built on file formats that
track source lines, not files, as the atomic unit.

To me, rename is a special case of the more general case of a
big chunk of code (a portion of a file) that was in one place
either being moved or copied to another place.

I wonder if there might be someway to use the tools that biologists use
to analyze DNA sequences, to track the evolution of source code,
identifying things like common chunks of code that differ in just a few
mutations, and presenting the history of the evolution, at selectable
levels of detail.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


another perspective on renames.

2005-04-14 Thread C. Scott Ananian
Perhaps our thinking is being clouded by 'how other SCMs do things' ---
do we *really* need extra rename metadata?  As Linus pointed out, as long 
as a commit is done immediately after a rename (ie before the renamed file 
is changed) the tree object contains all the information one needs: you 
can notice that a given object's content-hash is named 'foo' in the first 
version and 'bar' in the second version.

Ingo thought that this was insufficient because two *different* objects 
(ie having different revision histories) might be mutated to a point where 
they had a *same* contents (and then would be condensed into a single 
blob).  But isn't that a feature of the git-fs history generally (ie not a 
renaming-specific issue)?

One solution would be to invent a new 'file-revision-history' annotation 
on top of git-fs in order to keep these derivation paths seperate...

...but perhaps we might think of this as a 'feature' of our SCM instead?
The 'history' of a file may have join points where a single 'content' may 
have been derived by two or more completely different paths.  Explicit 
guidance to the front-end tools is required to 'unmerge' these files after 
this occurs (ie updating the directory cache for one, but not the others). 
This makes sense for include/arch/{foo,bar}/baz.h, but maybe not so much 
for (say) the empty file.

Anyway, maybe it's worth thinking a little about an SCM in which this is a 
feature, instead of (or in addition to) automatically assuming this is a 
bug we need to add infrastructure to work around.
 --scott

PBFORTUNE Soviet  cryptographic D5 SLBM MI5 CIA postcard WASHTUB [Hello to all my fans in domestic surveillance] 
explosion Sigint Bush ODEARL FJHOPEFUL assassination Uzi Hussein Nader
 ( http://cscott.net/ )
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html