On Thu, 28 Nov 2013 02:10:42 -0800 (PST)
Mario Wohlwender mario.wohlwen...@torqeedo.com wrote:
Hi git users,
I want to use a git-Repository to handle a folder with lot of
documents (.pdf...) needed for manufacturing of my companys products
(technical drawings, Gerber-Files for PCB Manufacturing and so on).
Because the data is mainly binary (.pdf) and the size of the Folder
is very large (1GB), I thougt of a workflow git is probably not made
for (to avoid everybody who wants to change something has to clone
the repository).
The basic idea is to have a work area with a git repository on the
company's network. To this work area (directory) all employees have
read access and only the development department has write access. To
have an history of who changed what and when I would like to use git
on this folder.
Is it a problem when more than one client (developer) makes commits
to this one workarea/repository? The access to the repository is by
file protocol, so the git-commands would be executed on the local
computer of the developer.
Is the git-Repository locked while one command is active? Or can the
database get corrupted when two developers start commands
simultaneously on the repository?
Would be nice to hear your opinion about this workflow.
I'm afraid it's not gonna work, and here's why. While Git locks access
to its own database (the .git directory inside the repository's work
tree) using a lock file, this only prevents other Git instances from
messing with the repository when another one is active. Contrary to
this, accessing files in the work tree is inherently racy: I don't know
offhand if Git flock()'s the files in the work tree while it's reading
them but even if it does, some file operations are inherently
non-atomic, and, further, flock() is only advisory.
Hence a situation like this one might well occur:
1) Developer A updates a file foo.pdf and decides to `git add` it.
2) In the meantime developer B decides to update that same foo.pdf,
so the B's file copying program and A's Git access the file
simultaneously, at the same time. What happens, I don't know.
A good file manager would first upload a file, fsync() its contents,
then delete the old file and rename the new one to have the old one's
name. This leaves a tiny window for `git add` to miss out the old
file, which would prevent `git add` from doing its work but it's
mostly OK because the error is visible. Still, a bad file manager
program might decide to upload the new contents by truncating the old
file and writing the new data over. This obviously would make `git
add` have a non-zero chance to access an incomplete file.
What I mean, is that under normal course of operation (where each
developer has their private repository) Git assumes these developers
manage to make sure theirselves ther repositories are not subject to
races like I outlined above, and you're likely to break this assumption.
Git itself does not support branch or file locks, and its design
explicitly goes against this, so you can't do that.
With this in mind, I might propose these solutions:
* Make several Git repositories, grouping files managed using them by
types or some other criteria. This would allow normal workflow,
where each developer has a set of personal clones, but supposedly
the disk space requirements in this case will be tolerable.
You'll then have to invent a scheme for updating a read-only
file share when a push to an appropriate central repository is done.
One modification to this is to have a shared (centralized) huge
repository, which contains everything, but with different kinds
of data kept on a different branch, and then you have a set of
repositories each of which hosts just one branch of that data.
Each developer might then clone only a single-branch repo,
when needed, but they have to push their updated branch to both repos.
Or, alternatively, post-update hooks in those slim repos would
do that theirselves. While the common sense might tell you this
is not gonna work, Git only cares about commits, trees and blobs,
and it's perfectly OK to have the same graph of commits in different
repos.
* Use Subversion: it allows you to organize your data as a set of
directories and check out only a single directory to update it.
I don't know your setup precisely, but I think Subversion fits your
model better than Git: it appears you don't need Git's killer feaures
like cheap branching, rebasing, stashing per-hunk staging and resetting
etc -- it looks like what you need is just something which is linearly
updated with ready binary data (PDFs and all). So may be the way to go
is to use a powerful DVCS (Git) to manage *sources* for those PDFs and
other stuff, using different repositories for different projects, and
Subversion to keep a set of everything ready to be consumed by
downstream users (non-developers).
--
You received this message