RE: EXT :Re: GIT and large files

2014-05-20 Thread Stewart, Louis (IS)
The files in question would be in directory containing many files some small 
other huge (example: text files, docs,and jpgs are Mbs, but executables and ova 
images are GBs, etc).

Lou

From: Gary Fixler [mailto:gfix...@gmail.com] 
Sent: Tuesday, May 20, 2014 12:09 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: EXT :Re: GIT and large files

Technically yes, but from a practical standpoint, not really. Facebook recently 
revealed that they have a 54GB git repo[1], but I doubt it has 20+GB files in 
it. I've put 18GB of photos into a git repo, but everything about the process 
was fairly painful, and I don't plan to do it again.
Are your files non-mergeable binaries (e.g. videos)? The biggest problem here 
is with branching and merging. Conflict resolution with non-mergeable assets 
ends up an us-vs-them fight, and I don't understand all of the particulars of 
that. From git's standpoint it's simple - you just have to choose one or the 
other. From a workflow standpoint, you end up causing trouble if two people 
have changed an asset, and both people consider their change important. 
Centralized systems get around this problem with locks.
Git could do this, and I've thought about it quite a bit. I work in games - we 
have code, but also a lot of binaries, that I'd like to keep in sync with the 
code. For awhile I considered suggesting some ideas to this group, but I'm 
pretty sure the locking issue makes it a non-starter. The basic idea - skipping 
locking for the moment - would be to allow setting git attributes by file type, 
file size threshold, folder, etc., to allow git to know that some files are 
considered bigfiles. These could be placed into the objects folder, but I'd 
actually prefer they go into a .git/bigfile folder. They'd still be saved as 
contents under their hash, but a normal git transfer wouldn't send them. They'd 
be in the tree as 'big' or 'bigfile' (instead of 'blob', 'tree', or 'commit' 
(for submodules)).

Git would warn you on push that there were bigfiles to send, and you could add, 
say, --with-big to also send them, or send them later with, say, `git push 
--big`. They'd simply be zipped up and sent over, without any packfile 
fanciness. When you clone, you wouldn't get the bigfiles, unless you specified 
--with-big, and it would warn you that there are also bigfiles, and tell you 
what command to run to get also get them (`git fetch --big`, perhaps). Git 
status would always let you know if you were missing bigfiles. I think hopping 
around between commits would follow the same strategy, you'd always have to, 
e.g. `git checkout foo --with-big`, or `git checkout foo` and then `git update 
big` (or whatever - I'm not married to any of these names).

Resolving conflicts on merge would simply have to be up to you. It would be 
documented clearly that you're entering weird territory, and that your team has 
to deal with bigfiles somehow, perhaps with some suggested strategies (Pass 
the conch?). I could imagine some strategies for this. Maybe bigfiles require 
connecting to a blessed repo to grab the right to make a commit on it. That has 
many problems, of course, and now I can feel everyone reading this shifting 
uneasily in their seats :)
-g

[1] https://twitter.com/feross/status/459259593630433280

On Tue, May 20, 2014 at 8:37 AM, Stewart, Louis (IS) louis.stew...@ngc.com 
wrote:
Can GIT handle versioning of large 20+ GB files in a directory?

Lou Stewart
AOCWS Software Configuration Management
757-269-2388

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

N�r��yb�X��ǧv�^�)޺{.n�+ا���ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: EXT :Re: GIT and large files

2014-05-20 Thread Stewart, Louis (IS)
Thanks for the reply.  I just read the intro to GIT and I am concerned about 
the part that it will copy the whole repository to the developers work area.  
They really just need the one directory and files under that one directory. The 
history has TBs of data.

Lou

-Original Message-
From: Junio C Hamano [mailto:gits...@pobox.com] 
Sent: Tuesday, May 20, 2014 1:18 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: EXT :Re: GIT and large files

Stewart, Louis (IS) louis.stew...@ngc.com writes:

 Can GIT handle versioning of large 20+ GB files in a directory?

I think you can git add such files, push/fetch histories that contains such 
files over the wire, and git checkout such files, but naturally reading, 
processing and writing 20+GB would take some time.  In order to run operations 
that need to see the changes, e.g. git log -p, a real content-level merge, 
etc., you would also need sufficient memory because we do things in-core.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EXT :Re: GIT and large files

2014-05-20 Thread Junio C Hamano
Stewart, Louis (IS) louis.stew...@ngc.com writes:

 Thanks for the reply.  I just read the intro to GIT and I am
 concerned about the part that it will copy the whole repository to
 the developers work area.  They really just need the one directory
 and files under that one directory. The history has TBs of data.

Then you will spend time reading, processing and writing TBs of data
when you clone, unless your developers do something to limit the
history they fetch, e.g. by shallowly cloning.


 Lou

 -Original Message-
 From: Junio C Hamano [mailto:gits...@pobox.com] 
 Sent: Tuesday, May 20, 2014 1:18 PM
 To: Stewart, Louis (IS)
 Cc: git@vger.kernel.org
 Subject: EXT :Re: GIT and large files

 Stewart, Louis (IS) louis.stew...@ngc.com writes:

 Can GIT handle versioning of large 20+ GB files in a directory?

 I think you can git add such files, push/fetch histories that contains such 
 files over the wire, and git checkout such files, but naturally reading, 
 processing and writing 20+GB would take some time.  In order to run 
 operations that need to see the changes, e.g. git log -p, a real 
 content-level merge, etc., you would also need sufficient memory because we 
 do things in-core.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: EXT :Re: GIT and large files

2014-05-20 Thread Stewart, Louis (IS)
From you response then there is a method to only obtain the Project, Directory 
and Files (which could hold 80 GBs of data) and not the rest of the Repository 
that contained the full overall Projects?

-Original Message-
From: Junio C Hamano [mailto:gits...@pobox.com] 
Sent: Tuesday, May 20, 2014 2:15 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: Re: EXT :Re: GIT and large files

Stewart, Louis (IS) louis.stew...@ngc.com writes:

 Thanks for the reply.  I just read the intro to GIT and I am concerned 
 about the part that it will copy the whole repository to the 
 developers work area.  They really just need the one directory and 
 files under that one directory. The history has TBs of data.

Then you will spend time reading, processing and writing TBs of data when you 
clone, unless your developers do something to limit the history they fetch, 
e.g. by shallowly cloning.


 Lou

 -Original Message-
 From: Junio C Hamano [mailto:gits...@pobox.com]
 Sent: Tuesday, May 20, 2014 1:18 PM
 To: Stewart, Louis (IS)
 Cc: git@vger.kernel.org
 Subject: EXT :Re: GIT and large files

 Stewart, Louis (IS) louis.stew...@ngc.com writes:

 Can GIT handle versioning of large 20+ GB files in a directory?

 I think you can git add such files, push/fetch histories that contains such 
 files over the wire, and git checkout such files, but naturally reading, 
 processing and writing 20+GB would take some time.  In order to run 
 operations that need to see the changes, e.g. git log -p, a real 
 content-level merge, etc., you would also need sufficient memory because we 
 do things in-core.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EXT :Re: GIT and large files

2014-05-20 Thread Thomas Braun
Am Dienstag, den 20.05.2014, 17:24 + schrieb Stewart, Louis (IS):
 Thanks for the reply.  I just read the intro to GIT and I am concerned
 about the part that it will copy the whole repository to the developers
 work area.  They really just need the one directory and files under
 that one directory. The history has TBs of data.
 
 Lou
 
 -Original Message-
 From: Junio C Hamano [mailto:gits...@pobox.com] 
 Sent: Tuesday, May 20, 2014 1:18 PM
 To: Stewart, Louis (IS)
 Cc: git@vger.kernel.org
 Subject: EXT :Re: GIT and large files
 
 Stewart, Louis (IS) louis.stew...@ngc.com writes:
 
  Can GIT handle versioning of large 20+ GB files in a directory?
 
 I think you can git add such files, push/fetch histories that
 contains such files over the wire, and git checkout such files, but
 naturally reading, processing and writing 20+GB would take some time. 
 In order to run operations that need to see the changes, e.g. git log
 -p, a real content-level merge, etc., you would also need sufficient
 memory because we do things in-core.

Preventing that a clone fetches the whole history can be done with the
--depth option of git clone.

The question is what do you want to do with these 20G files?
Just store them in the repo and *very* occasionally change them?
For that you need a 64bit compiled version of git with enough ram. 32G
does the trick here. Everything with git 1.9.1.

Doing some tests on my machine with a normal harddisc gives (sorry for
LC_ALL != C):
$time git add file.dat; time git commit -m add file; time git status

real16m17.913s
user13m3.965s
sys 0m22.461s
[master 15fa953] add file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file.dat

real15m36.666s
user13m26.962s
sys 0m16.185s
# Auf Branch master
nichts zu committen, Arbeitsverzeichnis unverändert

real11m58.936s
user11m50.300s
sys 0m5.468s

$ls -lh
-rw-r--r-- 1 thomas thomas 20G Mai 20 19:01 file.dat

So this works but aint fast.

Playing some tricks with --assume-unchanged helps here:
$git update-index --assume-unchanged file.dat
$time git status
# Auf Branch master
nichts zu committen, Arbeitsverzeichnis unverändert

real0m0.003s
user0m0.000s
sys 0m0.000s

This trick is only save if you *know* that file.dat does not change.

And btw I also set 
$cat .gitattributes 
*.dat -delta
as delta compresssion should be skipped in any case.

Pushing and pulling these files to and from a server needs some tweaking
on the server side, otherwise the occasional git gc might kill the box.
 
Btw. I happily have files with 1.5GB in my git repositories and also
change them. And also work with git for windows. So in this region of
file sizes things work quite well.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EXT :Re: GIT and large files

2014-05-20 Thread Konstantin Khomoutov
On Tue, 20 May 2014 18:18:08 +
Stewart, Louis (IS) louis.stew...@ngc.com wrote:

 From you response then there is a method to only obtain the Project,
 Directory and Files (which could hold 80 GBs of data) and not the
 rest of the Repository that contained the full overall Projects?

Please google the phrase Git shallow cloning.

I would also recommend to read up on git-annex [1].

You might also consider using Subversion as it seems you do not need
most benefits Git has over it and want certain benefits Subversion has
over Git:
* You don't need a distributed VCS (as you don't want each developer to
  have a full clone).
* You only need a single slice of the repository history at any given
  revision on a developer's machine, and this is *almost* what
  Subversion does: it will keep the so-called base (or pristine)
  versions of files comprising the revision you will check out, plus
  the checked out files theirselves.  So, twice the space of the files
  comprising a revision.
* Subversion allows you to check out only a single folder out of the
  entire revision.
* IIRC, subversion supports locks, when a developer might tell the
  server they're editing a file, and this will prevent other devs from
  locking the same file.  This might be used to serialize editions of
  huge and/or unmergeable files.  Git can't do that (without
  non-standard tools deployed on the side or a centralized meeting
  point repository).

My point is that while Git is fantastic for managing source code
projects and project of similar types with regard to their contents,
it seems your requirements are mainly not suitable for the use case
Git is best tailored for.  Your apparent lack of familiarity with Git
might as well bite you later should you pick it right now.  At least
please consider reading a book or some other introduction-level
material on Git to get the feeling of typical workflows used with it.


1. https://git-annex.branchable.com/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html