Re: [git-users] SHA-1 checksum
> > Hello All, > I just wanted to say a big thank you. Some of the examples here really has helped me get hang of some fundamentals. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
Sharan Basappa writes: > The other question is, when it is time for Git to pick up the file > associated with 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d then > it starts > computing checksum of all the objects? The point is that it doesn't have to *search* for the contents of the file, because those contents are stored in ./git/objects/02/15040f90f133f999bac86eede7565c6d09b93d The hash of an object tells Git where the object is stored. This is why a *cryptographic* hash must be used, so that no two different objects have the same hash, which would require that they both be stored in the same file. There is the complication that a file's contents are stored compressed, so you can't directly read the file, which is why you need to use a Git command to get the proper file contents. There is also the complication that "pack files" can be made that contain many objects. Each pack file has a corresponding index listing all the hashes of the objects in the pack file. Clearly, the indexes are arranged in some way that allows Git to quickly find what objects are in which pack file, but I do not know the details. Dale -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
On Mon, 8 Aug 2016 20:21:47 -0700 (PDT) Sharan Basappa wrote: > > Well, there are exactly three types of objects in Git repos: blobs, > > trees and commits. Files are stored as blobs. Blobs have no "file > > names" attached to them; in fact, they keep no associated metadata > > at all. Since humans routinely manipulate data kept in files using > > hierarchical files systems, Git mirrors this approach by using tree > > objects. A tree object serves the same purpose a directory does on > > a file system: it maps human-defined names of the files to their > > contents. So a tree object contains a set of entries -- each > > representing a single file or a subdirectory. Each entry has three > > "fields" a (simplified) file mode, the hash value of the entry's > > contents (its address, that is) and the human-friendly name -- > > taken from the source filesystem. Subdirectory entries refer to > > other tree objects and file entries refer to blobs. [...] > So, all the 3 objects types are referenced by SHA hash > values and searched using these values. > This includes blobs, trees & commit objects. Yes, this is correct. Git never uses names of files and directories as found in the work tree to look up bits of data it stores. Such lookups *do* happen -- say, when you run something like git log -- path/to/some/file but they happen like 1) ... Fetch the next commit object; 2) Fetch the root tree object it references, parse it to find an entry named "path", get its SHA-1 name. 3) Fetch a tree object figured out on step (2), parse it to find an entry named "to", ... ...and so on, so in the end the actual data is always looked up in the object store using its SHA-1 name. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
> > Well, there are exactly three types of objects in Git repos: blobs, > trees and commits. Files are stored as blobs. Blobs have no "file > names" attached to them; in fact, they keep no associated metadata at > all. Since humans routinely manipulate data kept in files using > hierarchical files systems, Git mirrors this approach by using tree > objects. A tree object serves the same purpose a directory does on a > file system: it maps human-defined names of the files to their contents. > So a tree object contains a set of entries -- each representing a > single file or a subdirectory. Each entry has three "fields" a > (simplified) file mode, the hash value of the entry's contents (its > address, that is) and the human-friendly name -- taken from the source > filesystem. Subdirectory entries refer to other tree objects and file > entries refer to blobs. > Dear Konstantin, Thanks a lot. So, all the 3 objects types are referenced by SHA hash values and searched using these values. This includes blobs, trees & commit objects. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
On Mon, 8 Aug 2016 09:00:06 -0700 (PDT) Sharan Basappa wrote: [...] > > The contents of file "-NOTES" is in > > .git/objects/02/15040f90f133f999bac86eede7565c6d09b93d. In this > > case, that object is in one of the "pack" files. git-cat-file has > > to read through the indexes of the pack files to find that. > > > > The critical ideas are that files are stored by their *contents* > > not their *names*. Any particular blob of content has an eternally > > unique name (its hash), which will be the same in any repository > > containing a blob with the same bytes. "tree" objects are used to > > catalog the names of files and their contents. [...] > To clarify, > > 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d-NOTES > > Instead of storing reference to actual file, Git stores reference to > the content rather (in the form of checksum > 0215040f90f133f999bac86eede7565c6d09b93d)? > Is -NOTES a reference stored by Git. I am thinking where does Git get > the file name if it does not store it in someplace originally? Well, there are exactly three types of objects in Git repos: blobs, trees and commits. Files are stored as blobs. Blobs have no "file names" attached to them; in fact, they keep no associated metadata at all. Since humans routinely manipulate data kept in files using hierarchical files systems, Git mirrors this approach by using tree objects. A tree object serves the same purpose a directory does on a file system: it maps human-defined names of the files to their contents. So a tree object contains a set of entries -- each representing a single file or a subdirectory. Each entry has three "fields" a (simplified) file mode, the hash value of the entry's contents (its address, that is) and the human-friendly name -- taken from the source filesystem. Subdirectory entries refer to other tree objects and file entries refer to blobs. Each commit object refers to exactly one tree object representing the root of the project. Conceptually, a commit is created by starting from the project's root directory and going all the way down -- into subdirectories, considering all the tracked files on each level and creating appropriate tree and blob entries for everything found. Of course, the real implementation is much more complex to perform with the utmost speed possible. I think you should read the famous (and old) "Git from the bottom up" document [1]. It takes an unusual approach at explaining Git by actually dealing with its data model -- rather than commands to manipulate the repository. 1. https://jwiegley.github.io/git-from-the-bottom-up/ -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
> 2) At its very bottom, Git implements the so-called > "content-addressable filesystem". Its chief principle is that every > unique piece of data is stored exactly once, and these pieces are > identified by their contents. Since use the contents "as is" is > unwieldy, its being addressed using -- again -- the cryptographic hashes > calculated over those contents. This what makes Git effectively > implement its paradigm where each commit refers to a complete state of > all the project's files: even though like 99.9% of the content of each > commit a typical big project is the same as its parent commit, each > unique chunk of information -- a file or a tree referring to a set of > files -- is stored in the repository exactly once. Content addressable filesystem. Nicely put. So, sort of content addressable memory (CAM) where contents are unique. Thanks a lot, -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
> Consider one of my Git repositories. The file .git/HEAD contains > > ref: refs/heads/hobgoblin > > That points to the file .git/refs/heads/hobgoblin, which contains the > hash of the commit which is the tip of the "hobgoblin" branch: > > 92f8f718eb9b19f921f20283e55c56e8dc66ed10 > > That point to the file > .git/objects/92/f8f718eb9b19f921f20283e55c56e8dc66ed10. That file's > contents aren't in ASCII, so you have to use "git cat-file -p > 92f8f718eb9b19f921f20283e55c56e8dc66ed10" to read its contents: > > tree d5d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22 > parent 39c83b086e141bb00d32737a4e2aae675d795f44 > author Dale R. Worley > 1470669963 > -0400 > committer Dale R. Worley > > 1470669963 -0400 > > ... > > So the hash of the tree object is > d5d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22 and the hash of the one parent > commit is 39c83b086e141bb00d32737a4e2aae675d795f44. The tree object is > in .git/objects/d5/d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22, but again, > you have to use git-cat-file to read it: > > 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d-NOTES > 100644 blob > ef62bfd5a8e81c8ca13372b2436bccf1c0698185-NOTES.MYOB > 100644 blob > 65dda34dadf753dbfc791b5811f3cd437a666cac-NOTES.XA.recovery > 100644 blob > 88182ec16035fd4d77c0c1312ce1510f2f8da4b2-NOTES.XB.recovery > 100644 blob > 73415b6e2ebcd6a384874c0ab40ec70a5112db18-NOTES.freeze > 100644 blob > 3a4fb8ec6e7c0219c4d7ab002eaaa84abae2c72d-NOTES.gleaning > 04 tree c21923c2647ecec7d627a49e51b4e8b5d19344b4.a68g > 100644 blob > f9a4c46f50234a11f9ad283973ed2f11a4758f2f.aspell.en.prepl > 100644 blob > 182c2739a5cc69a322a41723d4423ed1d8a6266e.aspell.en.pws > ... > > The contents of file "-NOTES" is in > .git/objects/02/15040f90f133f999bac86eede7565c6d09b93d. In this case, > that object is in one of the "pack" files. git-cat-file has to read > through the indexes of the pack files to find that. > > The critical ideas are that files are stored by their *contents* not > their *names*. Any particular blob of content has an eternally unique > name (its hash), which will be the same in any repository containing a > blob with the same bytes. "tree" objects are used to catalog the names > of files and their contents. > Dear Philip, Dale, Thanks. I think this example helps me a lot. To clarify, 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d-NOTES Instead of storing reference to actual file, Git stores reference to the content rather (in the form of checksum 0215040f90f133f999bac86eede7565c6d09b93d)? Is -NOTES a reference stored by Git. I am thinking where does Git get the file name if it does not store it in someplace originally? The other question is, when it is time for Git to pick up the file associated with 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d then it starts computing checksum of all the objects? Similarly, referring to tree object d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22, one has to again calculate checksum of all tree objects in order to get the following contents: 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d-NOTES 100644 blob ef62bfd5a8e81c8ca13372b2436bccf1c0698185-NOTES.MYOB 100644 blob 65dda34dadf753dbfc791b5811f3cd437a666cac-NOTES.XA.recovery 100644 blob 88182ec16035fd4d77c0c1312ce1510f2f8da4b2-NOTES.XB.recovery 100644 blob 73415b6e2ebcd6a384874c0ab40ec70a5112db18-NOTES.freeze 100644 blob 3a4fb8ec6e7c0219c4d7ab002eaaa84abae2c72d-NOTES.gleaning 04 tree c21923c2647ecec7d627a49e51b4e8b5d19344b4.a68g 100644 blob f9a4c46f50234a11f9ad283973ed2f11a4758f2f.aspell.en.prepl 100644 blob 182c2739a5cc69a322a41723d4423ed1d8a6266e.aspell.en.pws Thanks a lot -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
On Sun, 7 Aug 2016 09:26:30 -0700 (PDT) Sharan Basappa wrote: > I would like to know why GIT calculates checksum of a file. > Typically, checksum is used for the purpose of integrity. Well, Git does this for two reasons: 1) It's what makes "D" in the "DVCS" ("Distributed Version Control System") possible. When two Git instances exchange histories from their repositories over the wire, they need to have a way to figure out what parts of them they share. Now suppose that the user of the first repository created a file containing the string "Hello world" and named that file "foo.txt". The user of the second repository created a file with identical contents but named it "bar.txt" and placed it in a directory named "stuff". If we look at file names only, these files are clearly different. But they have identical contents, and that is what DVCSes exchange with each other. Enter cryptographic hashes. They have two major properties: * Identical sets of data "compress" to identical hash values. * No two different sets of data compress to identical hash values (well, in fact it's theoretically possible for real-world hash functions to fail keeping this invariant, and it's called "a collision", but such an event is quite improbable for real-world applications). So cryptographic hashes allow to neatly serve as short "handles" to chunks of data of arbitrary size: for my toy example of the data string "Hello world", it not quite obvious, but a cryptographic hash is perfectly able to uniquely identify the contents of a multi-megabyte file as well. 2) At its very bottom, Git implements the so-called "content-addressable filesystem". Its chief principle is that every unique piece of data is stored exactly once, and these pieces are identified by their contents. Since use the contents "as is" is unwieldy, its being addressed using -- again -- the cryptographic hashes calculated over those contents. This what makes Git effectively implement its paradigm where each commit refers to a complete state of all the project's files: even though like 99.9% of the content of each commit a typical big project is the same as its parent commit, each unique chunk of information -- a file or a tree referring to a set of files -- is stored in the repository exactly once. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
Sharan Basappa writes: > So, if Git stores files using just their checksums then > > a) how does it look up (or retrieve) a specific file in the database? > For example, if it wants to find a file in the data base then it takes > checksum and starts computing checking of every file in its database & > compare? > This looks pretty costly & rather unnecessary to me. > > b) how does it get keep track file names that are required when it gives us > a working copy? Consider one of my Git repositories. The file .git/HEAD contains ref: refs/heads/hobgoblin That points to the file .git/refs/heads/hobgoblin, which contains the hash of the commit which is the tip of the "hobgoblin" branch: 92f8f718eb9b19f921f20283e55c56e8dc66ed10 That point to the file .git/objects/92/f8f718eb9b19f921f20283e55c56e8dc66ed10. That file's contents aren't in ASCII, so you have to use "git cat-file -p 92f8f718eb9b19f921f20283e55c56e8dc66ed10" to read its contents: tree d5d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22 parent 39c83b086e141bb00d32737a4e2aae675d795f44 author Dale R. Worley 1470669963 -0400 committer Dale R. Worley 1470669963 -0400 ... So the hash of the tree object is d5d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22 and the hash of the one parent commit is 39c83b086e141bb00d32737a4e2aae675d795f44. The tree object is in .git/objects/d5/d1ad293f8fdd4a4a4e0e9a73c5c3c851126c22, but again, you have to use git-cat-file to read it: 100644 blob 0215040f90f133f999bac86eede7565c6d09b93d-NOTES 100644 blob ef62bfd5a8e81c8ca13372b2436bccf1c0698185-NOTES.MYOB 100644 blob 65dda34dadf753dbfc791b5811f3cd437a666cac -NOTES.XA.recovery 100644 blob 88182ec16035fd4d77c0c1312ce1510f2f8da4b2 -NOTES.XB.recovery 100644 blob 73415b6e2ebcd6a384874c0ab40ec70a5112db18-NOTES.freeze 100644 blob 3a4fb8ec6e7c0219c4d7ab002eaaa84abae2c72d-NOTES.gleaning 04 tree c21923c2647ecec7d627a49e51b4e8b5d19344b4.a68g 100644 blob f9a4c46f50234a11f9ad283973ed2f11a4758f2f.aspell.en.prepl 100644 blob 182c2739a5cc69a322a41723d4423ed1d8a6266e.aspell.en.pws ... The contents of file "-NOTES" is in .git/objects/02/15040f90f133f999bac86eede7565c6d09b93d. In this case, that object is in one of the "pack" files. git-cat-file has to read through the indexes of the pack files to find that. The critical ideas are that files are stored by their *contents* not their *names*. Any particular blob of content has an eternally unique name (its hash), which will be the same in any repository containing a blob with the same bytes. "tree" objects are used to catalog the names of files and their contents. Dale -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
- Original Message - From: Sharan Basappa >Philip Oakley wrote: > You have it in one. > Yes that is the reason that git computes the sha1 of the file's > contents - it provides integrity, veracity and non-repudiation (the last > one is still true though cryo-analysis is getting better, so sha1 is no > longer recommended, and Git is looking at how to progress to newer > crypto-hashes). > Once Git has the sha1's of the files in a directory, it does the same > again for the 'file' that lists the file names, mode bits and their > content's sha1s, and ever onwards up the trees to the commit, which > lists the sha1s of its parents. > So it you have the sha1 of the tip of a branch, such as master, and you > have a repo that holds that sha1, then you have the full crypto > integrity that your copy (with all its history) is identical to that of > the originators - your own Dali, Rembrant, Gogin, hanging in your > hall... and it isn't even a replica, it's the real thing! Dear Philip, Michael, Thanks. It's true that checksums like SHA give a very signature of any file. But where things start getting confusing (to me) is when I read "In fact, Git stores everything in its database not by file name but by the hash value of its contents.". Correct, in the .git/objects folder you will see those new objects stored as ab/cdef01234 etc. This is from book Pro-Git. So, if Git stores files using just their checksums then a) how does it look up (or retrieve) a specific file in the database? For example, if it wants to find a file in the data base then it takes checksum and starts computing checking of every file in its database & compare? You will see in my reply that there is a 'next level' file which has the lists of names to associate with the sha1 hash it needs. These are the ones called 'tree' objects. This looks pretty costly & rather unnecessary to me. You will be looking at this from the wrong side. It's about speed of reconstruction when you are getting a specific revision back from the store. Don't forget that Git normally works on the revision of the complete project, not just some little file. b) how does it get keep track file names that are required when it gives us a working copy? Starting at the commit sha1, it looks for that sha1 file, which is lists the top level tree sha1. Expand that as the top level directory names, with sha1s for each next level directory of file. It's almost identical to how a file system works! (I think Linus, who wrote git, wrote a little OS, nothing big, once ;-) Once you have all that nicely fixed in your head, you can then look (if you are interested in the next layer of digging) at pack files which are Git's way of compressing all those sha1 files which have lots of repetition because nothing much changes from one rev to the next (or at least it should, because the changes within a commit should be small! - it's part of what makes Git work) Thanks again ... -- No problems -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
> > > > You have it in one. > > Yes that is the reason that git computes the sha1 of the file's contents - > it provides integrity, veracity and non-repudiation (the last one is still > true though cryo-analysis is getting better, so sha1 is no longer > recommended, and Git is looking at how to progress to newer crypto-hashes) > . > Once Git has the sha1's of the files in a directory, it does the same > again for the 'file' that lists the file names, mode bits and their > content's sha1s, and ever onwards up the trees to the commit, which lists > the sha1s of its parents. > > So it you have the sha1 of the tip of a branch, such as master, and you > have a repo that holds that sha1, then you have the full crypto integrity > that your copy (with all its history) is identical to that of the > originators - your own Dali, Rembrant, Gogin, hanging in your hall... and > it isn't even a replica, it's the real thing! > > Dear Philip, Michael, Thanks. It's true that checksums like SHA give a very signature of any file. But where things start getting confusing (to me) is when I read *"**In fact, Git stores everything in its database not by file name but by the hash value of its contents.". * This is from book Pro-Git. So, if Git stores files using just their checksums then a) how does it look up (or retrieve) a specific file in the database? For example, if it wants to find a file in the data base then it takes checksum and starts computing checking of every file in its database & compare? This looks pretty costly & rather unnecessary to me. b) how does it get keep track file names that are required when it gives us a working copy? Thanks again ... -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
On 2016-08-07, at 9:26 AM, Sharan Basappa wrote: > Hi, > > I would like to know why GIT calculates checksum of a file. > Typically, checksum is used for the purpose of integrity. > > An example would really help. An example? Ok. Back when something else was using a simple CRC, someone tried to replace a file with another, bypassing the normal history system. The CRC was good enough to detect it; so, something was needed that was good enough to detect/stop this. But more importantly: The hash is the filename of the file. It is critical that the hash be good enough that you won't get duplicate filenames. CRC doesn't do that. Sha-1 does. The checksum has to be good enough to make a unique filename in normal use. It does not have to be good enough to guarantee non-alteration, but that's a really good secondary; it does have to be good enough to detect accidental damage (such as memory/disk/network/driver/etc corruption). Now, a secondary benefit of the whole "layer upon layer" approach: The hash of the last commit is only valid if every file and commit to date is accurate. If you know the hash of your last commit (20 bytes, I think), and you can validate all the hashes in the past, then you know that nothing has altered any file outside of the git mechanism. --- Entertaining minecraft videos http://YouTube.com/keybounce -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] SHA-1 checksum
Sharan, You have it in one. Yes that is the reason that git computes the sha1 of the file's contents - it provides integrity, veracity and non-repudiation (the last one is still true though cryo-analysis is getting better, so sha1 is no longer recommended, and Git is looking at how to progress to newer crypto-hashes). Once Git has the sha1's of the files in a directory, it does the same again for the 'file' that lists the file names, mode bits and their content's sha1s, and ever onwards up the trees to the commit, which lists the sha1s of its parents. So it you have the sha1 of the tip of a branch, such as master, and you have a repo that holds that sha1, then you have the full crypto integrity that your copy (with all its history) is identical to that of the originators - your own Dali, Rembrant, Gogin, hanging in your hall... and it isn't even a replica, it's the real thing! Philip It's turtles all the way down. - Original Message - From: Sharan Basappa To: Git for human beings Sent: Sunday, August 07, 2016 5:26 PM Subject: [git-users] SHA-1 checksum Hi, I would like to know why GIT calculates checksum of a file. Typically, checksum is used for the purpose of integrity. An example would really help. Regards, -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.