Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Jarrad Hope
Hello,

As a software developer I've used git for years and have found it the
perfect solution for source control.

Lately I have found myself using git in a unique use-case - modifying
DNA/RNA sequences and storing them in git, which are essentially
software/source code for cells/life. For Bacteria and Viruses the
repo's are very small <10mb & compress nicely.

However on the extreme end of the spectrum a human genome can run in
at 50gb or say ~1gb per file/chromosome.

Now, this is not the binary problem and it is not the same as storing
media inside git - I have reviewed the solutions that exist for the
binary problem, such as git-annex, git-media & bup. But they don't
provide the featureset of git and the data i'm storing is more like
plaintext sourcecode with relatively small edits per commit.

I have googled and asked in #git which discussion mostly revolved
around these tools.

The only project that holds interest is a 2009 project, git-bigfiles -
however it is abit dated & the author is not interested in reviving
this project - referring me to git-annex. Unfortunately.

With that background;
I wanted to discuss the problems with git and how I can contribute to
the core project to best solve them.

>From my understanding the largest problem revolves around git's delta
discovery method, holding 2 files in memory at once - is there a
reason this could not be adapted to page/chunk the data in a sliding
window fashion ?

Are there any other issues I need to know about, is anyone else
working on making git more capable of handling large source files that
I can collaborate with?

Thanks for your time,
Jarrad
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Jarrad Hope
Thank-you all for replying,

It's just as Jason suggests - Genbank, FASTA & EMBL are more or less
the defacto standards, I suspect FASTA will be phased out because (to
my knowledge) it does not support gene annotation, nevertheless, they
are all text based.

These formats usually insert linebreaks around 80 characters (a
cultural/human readability relic, whatever terminal output they had
the time)

I tried to find a Penguin genome sequence for you, The best I can find
is the complete penguin mitochrondrian dna, as you can see, fairly
small.
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank

If you would like to checkout the source for a Human, please see
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
Don't ask me for a Makefile :) in near future you'll be able to print
sequences of this length, today we're limited to small sequences (such
as bacteria/virus) at ~30cents per basepair

Each chromosome packs quite well ~80MB packed, ~240MB unpacked
However these formats allow you to repesent multiple sequences in one file
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
<- ~850MB packed

Sidenote, Humans aren't that particulary more complicated than Rice
(in terms of genome size)
http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os

Other animal sequences - http://www.ensembl.org/index.html

Git is already being used very successfully for SBML, Synthetic
Biology Markup Language, an XML dialect for cell modelling.

I would show an example git repo of some open source cancer treatments
(various oncolytic viruses) I've been working on, unfortunately it's
not finished yet, but you can imagine something the size of penguin
mitochrondrial dna with essentially just text being deleted (gene
deletions) as commits.

I hope that helps - With the advancement of Synthetic and Systems
Biology, I really see these sequences benefiting from git.


On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds
 wrote:
> On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron  wrote:
>>
>> The issue will be, if we talk about changes other than same length 
>> substitutions
>> (e.g. Down's Syndrome where it has an insertion of code) would require one 
>> code
>> per line for the diffs to work nicely.
>
> Not my area of expertise, but depending on what you are interested in
> - like protein encoding etc, I really think you don't need to do
> things character-per-character. You might want to break at interesting
> sequences (TATA box, and/or known long repeating sequences).
>
> So you could basically turn the "one long line" representation into
> multiple lines, by just looking for particular known interesting (or
> known particularly *UN*interesting) patterns, and whenever you see the
> pattern you create a new line, describing the pattern ("TATAAA" or
> "run of 128 U"), and then continue on the next line.
>
> Then you diff those "semantically enriched" streams instead of the raw data.
>
> But it probably depends on what you are looking for and at. Sometimes
> you might be looking at individual base pairs. And sometimes maybe you
> want to look at the codons, and consider condons that transcribe to
> the same amino acid to be the same, and not show up as a difference.
> So I could well imagine that you might want to have multiple different
> ways to generate these diffs. No?
>
>Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html