Re: [git-users] multiple clients on one workarea possible?

2013-11-29 Thread Konstantin Khomoutov
On Thu, 28 Nov 2013 02:10:42 -0800 (PST)
Mario Wohlwender mario.wohlwen...@torqeedo.com wrote:

 Hi git users, 
  
 I want to use a git-Repository to handle a folder with lot of
 documents (.pdf...) needed for manufacturing of my companys products
 (technical drawings, Gerber-Files for PCB Manufacturing and so on).  
  
 Because the data is mainly binary (.pdf) and the size of the Folder
 is very large (1GB), I thougt of a workflow git is probably not made
 for (to avoid everybody who wants to change something has to clone
 the repository). 
 The basic idea is to have a work area with a git repository on the 
 company's network. To this work area (directory) all employees have
 read access and only the development department has write access. To
 have an history of who changed what and when I would like to use git
 on this folder. 
 Is it a problem when more than one client (developer) makes commits
 to this one workarea/repository? The access to the repository is by
 file protocol, so the git-commands would be executed on the local
 computer of the developer.
 Is the git-Repository locked while one command is active? Or can the 
 database get corrupted when two developers start commands
 simultaneously on the repository?
  
 Would be nice to hear your opinion about this workflow.

I'm afraid it's not gonna work, and here's why.  While Git locks access
to its own database (the .git directory inside the repository's work
tree) using a lock file, this only prevents other Git instances from
messing with the repository when another one is active.  Contrary to
this, accessing files in the work tree is inherently racy: I don't know
offhand if Git flock()'s the files in the work tree while it's reading
them but even if it does, some file operations are inherently
non-atomic, and, further, flock() is only advisory.
Hence a situation like this one might well occur:
1) Developer A updates a file foo.pdf and decides to `git add` it.
2) In the meantime developer B decides to update that same foo.pdf,
   so the B's file copying program and A's Git access the file
   simultaneously, at the same time.  What happens, I don't know.
   A good file manager would first upload a file, fsync() its contents,
   then delete the old file and rename the new one to have the old one's
   name.  This leaves a tiny window for `git add` to miss out the old
   file, which would prevent `git add` from doing its work but it's
   mostly OK because the error is visible.  Still, a bad file manager
   program might decide to upload the new contents by truncating the old
   file and writing the new data over.  This obviously would make `git
   add` have a non-zero chance to access an incomplete file.

What I mean, is that under normal course of operation (where each
developer has their private repository) Git assumes these developers
manage to make sure theirselves ther repositories are not subject to
races like I outlined above, and you're likely to break this assumption.

Git itself does not support branch or file locks, and its design
explicitly goes against this, so you can't do that.

With this in mind, I might propose these solutions:

* Make several Git repositories, grouping files managed using them by
  types or some other criteria.  This would allow normal workflow,
  where each developer has a set of personal clones, but supposedly
  the disk space requirements in this case will be tolerable.
  You'll then have to invent a scheme for updating a read-only
  file share when a push to an appropriate central repository is done.

  One modification to this is to have a shared (centralized) huge
  repository, which contains everything, but with different kinds
  of data kept on a different branch, and then you have a set of
  repositories each of which hosts just one branch of that data.
  Each developer might then clone only a single-branch repo,
  when needed, but they have to push their updated branch to both repos.
  Or, alternatively, post-update hooks in those slim repos would
  do that theirselves.  While the common sense might tell you this
  is not gonna work, Git only cares about commits, trees and blobs,
  and it's perfectly OK to have the same graph of commits in different
  repos.

* Use Subversion: it allows you to organize your data as a set of
  directories and check out only a single directory to update it.

I don't know your setup precisely, but I think Subversion fits your
model better than Git: it appears you don't need Git's killer feaures
like cheap branching, rebasing, stashing per-hunk staging and resetting
etc -- it looks like what you need is just something which is linearly
updated with ready binary data (PDFs and all).  So may be the way to go
is to use a powerful DVCS (Git) to manage *sources* for those PDFs and
other stuff, using different repositories for different projects, and
Subversion to keep a set of everything ready to be consumed by
downstream users (non-developers).

-- 
You received this message 

Re: [git-users] multiple clients on one workarea possible?

2013-11-29 Thread Mario Wohlwender

Am Freitag, 29. November 2013 09:26:06 UTC+1 schrieb Konstantin Khomoutov:

 On Thu, 28 Nov 2013 02:10:42 -0800 (PST) 
 Mario Wohlwender mario.wo...@torqeedo.com javascript: wrote: 

  Hi git users, 

  I want to use a git-Repository to handle a folder with lot of 
  documents (.pdf...) needed for manufacturing of my companys products 
  (technical drawings, Gerber-Files for PCB Manufacturing and so on).   

  Because the data is mainly binary (.pdf) and the size of the Folder 
  is very large (1GB), I thougt of a workflow git is probably not made 
  for (to avoid everybody who wants to change something has to clone 
  the repository). 
  The basic idea is to have a work area with a git repository on the 
  company's network. To this work area (directory) all employees have 
  read access and only the development department has write access. To 
  have an history of who changed what and when I would like to use git 
  on this folder. 
  Is it a problem when more than one client (developer) makes commits 
  to this one workarea/repository? The access to the repository is by 
  file protocol, so the git-commands would be executed on the local 
  computer of the developer. 
  Is the git-Repository locked while one command is active? Or can the 
  database get corrupted when two developers start commands 
  simultaneously on the repository? 

  Would be nice to hear your opinion about this workflow. 

 I'm afraid it's not gonna work, and here's why.  While Git locks access 
 to its own database (the .git directory inside the repository's work 
 tree) using a lock file, this only prevents other Git instances from 
 messing with the repository when another one is active.  Contrary to 
 this, accessing files in the work tree is inherently racy: I don't know 
 offhand if Git flock()'s the files in the work tree while it's reading 
 them but even if it does, some file operations are inherently 
 non-atomic, and, further, flock() is only advisory. 
 Hence a situation like this one might well occur: 
 1) Developer A updates a file foo.pdf and decides to `git add` it. 
 2) In the meantime developer B decides to update that same foo.pdf, 
so the B's file copying program and A's Git access the file 
simultaneously, at the same time.  What happens, I don't know. 
A good file manager would first upload a file, fsync() its contents, 
then delete the old file and rename the new one to have the old one's 
name.  This leaves a tiny window for `git add` to miss out the old 
file, which would prevent `git add` from doing its work but it's 
mostly OK because the error is visible.  Still, a bad file manager 
program might decide to upload the new contents by truncating the old 
file and writing the new data over.  This obviously would make `git 
add` have a non-zero chance to access an incomplete file. 

 What I mean, is that under normal course of operation (where each 
 developer has their private repository) Git assumes these developers 
 manage to make sure theirselves ther repositories are not subject to 
 races like I outlined above, and you're likely to break this assumption. 

 Git itself does not support branch or file locks, and its design 
 explicitly goes against this, so you can't do that. 

 With this in mind, I might propose these solutions: 

 * Make several Git repositories, grouping files managed using them by 
   types or some other criteria.  This would allow normal workflow, 
   where each developer has a set of personal clones, but supposedly 
   the disk space requirements in this case will be tolerable. 
   You'll then have to invent a scheme for updating a read-only 
   file share when a push to an appropriate central repository is done. 

   One modification to this is to have a shared (centralized) huge 
   repository, which contains everything, but with different kinds 
   of data kept on a different branch, and then you have a set of 
   repositories each of which hosts just one branch of that data. 
   Each developer might then clone only a single-branch repo, 
   when needed, but they have to push their updated branch to both repos. 
   Or, alternatively, post-update hooks in those slim repos would 
   do that theirselves.  While the common sense might tell you this 
   is not gonna work, Git only cares about commits, trees and blobs, 
   and it's perfectly OK to have the same graph of commits in different 
   repos. 

 * Use Subversion: it allows you to organize your data as a set of 
   directories and check out only a single directory to update it. 

 I don't know your setup precisely, but I think Subversion fits your 
 model better than Git: it appears you don't need Git's killer feaures 
 like cheap branching, rebasing, stashing per-hunk staging and resetting 
 etc -- it looks like what you need is just something which is linearly 
 updated with ready binary data (PDFs and all).  So may be the way to go 
 is to use a powerful DVCS (Git) to