Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-08-10 Thread Øyvind A . Holm
On 30 June 2014 14:56, Jakub Narębski jna...@gmail.com wrote:
 Linus Torvalds wrote:
  .. even there, there's another issue. With enough memory, the diff
  itself should be fairly reasonable to do, but we do not have any
  sane *format* for diffing those kinds of things.
 
  The regular textual diff is line-based, and is not amenable to
  comparing two long lines. You'll just get a diff that says the two
  really long lines are different.
 
  The binary diff option should work, but it is a horrible output
  format, and not very helpful. It contains all the relevant data
  (copy this chunk from here to here), but it's then shown in a
  binary encoding that isn't really all that useful if you want to say
  what are the differences between these two chromosomes.

 There is also --word-diff[=mode] word-based textual diff, and I
 think one can abuse --word-diff-regex=regex for character-based
 diff... or maybe not, as regex specifies word characters, not words
 or word separators.

Yes, I have this alias defined:

  dww = diff --word-diff --word-diff-regex=.

It creates nice diffs on a character level. Sometimes specifying
--patience to this helps.

-- Øyvind
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-30 Thread Jakub Narębski

Linus Torvalds wrote:

On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano gits...@pobox.com wrote:


Even though the original question mentioned delta discovery, I
think what was being asked is not delta in the Git sense (which
your answer is about) but is can we diff two long sequences of text
(that happens to consist of only 4-letter alphabet but that is a
irrelevant detail) without holding both in-core in their entirety?,
which is a more relevant question/desire from the application point
of view.


.. even there, there's another issue. With enough memory, the diff
itself should be fairly reasonable to do, but we do not have any sane
*format* for diffing those kinds of things.

The regular textual diff is line-based, and is not amenable to
comparing two long lines. You'll just get a diff that says the two
really long lines are different.

The binary diff option should work, but it is a horrible output
format, and not very helpful. It contains all the relevant data (copy
this chunk from here to here), but it's then shown in a binary
encoding that isn't really all that useful if you want to say what
are the differences between these two chromosomes.


There is also --word-diff[=mode] word-based textual diff,
and I think one can abuse --word-diff-regex=regex for
character-based diff... or maybe not, as regex specifies
word characters, not words or word separators.

--
Jakub Narębski

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-28 Thread Jarrad Hope
Thank-you all for replying,

It's just as Jason suggests - Genbank, FASTA  EMBL are more or less
the defacto standards, I suspect FASTA will be phased out because (to
my knowledge) it does not support gene annotation, nevertheless, they
are all text based.

These formats usually insert linebreaks around 80 characters (a
cultural/human readability relic, whatever terminal output they had
the time)

I tried to find a Penguin genome sequence for you, The best I can find
is the complete penguin mitochrondrian dna, as you can see, fairly
small.
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank

If you would like to checkout the source for a Human, please see
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
Don't ask me for a Makefile :) in near future you'll be able to print
sequences of this length, today we're limited to small sequences (such
as bacteria/virus) at ~30cents per basepair

Each chromosome packs quite well ~80MB packed, ~240MB unpacked
However these formats allow you to repesent multiple sequences in one file
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
- ~850MB packed

Sidenote, Humans aren't that particulary more complicated than Rice
(in terms of genome size)
http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os

Other animal sequences - http://www.ensembl.org/index.html

Git is already being used very successfully for SBML, Synthetic
Biology Markup Language, an XML dialect for cell modelling.

I would show an example git repo of some open source cancer treatments
(various oncolytic viruses) I've been working on, unfortunately it's
not finished yet, but you can imagine something the size of penguin
mitochrondrial dna with essentially just text being deleted (gene
deletions) as commits.

I hope that helps - With the advancement of Synthetic and Systems
Biology, I really see these sequences benefiting from git.


On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron jpye...@pdinc.us wrote:

 The issue will be, if we talk about changes other than same length 
 substitutions
 (e.g. Down's Syndrome where it has an insertion of code) would require one 
 code
 per line for the diffs to work nicely.

 Not my area of expertise, but depending on what you are interested in
 - like protein encoding etc, I really think you don't need to do
 things character-per-character. You might want to break at interesting
 sequences (TATA box, and/or known long repeating sequences).

 So you could basically turn the one long line representation into
 multiple lines, by just looking for particular known interesting (or
 known particularly *UN*interesting) patterns, and whenever you see the
 pattern you create a new line, describing the pattern (TATAAA or
 run of 128 U), and then continue on the next line.

 Then you diff those semantically enriched streams instead of the raw data.

 But it probably depends on what you are looking for and at. Sometimes
 you might be looking at individual base pairs. And sometimes maybe you
 want to look at the codons, and consider condons that transcribe to
 the same amino acid to be the same, and not show up as a difference.
 So I could well imagine that you might want to have multiple different
 ways to generate these diffs. No?

Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Shawn Pearce
On Fri, Jun 27, 2014 at 1:45 AM, Jarrad Hope m...@jarradhope.com wrote:
 As a software developer I've used git for years and have found it the
 perfect solution for source control.

 Lately I have found myself using git in a unique use-case - modifying
 DNA/RNA sequences and storing them in git, which are essentially
 software/source code for cells/life. For Bacteria and Viruses the
 repo's are very small 10mb  compress nicely.

 However on the extreme end of the spectrum a human genome can run in
 at 50gb or say ~1gb per file/chromosome.

Interesting. Unfortunately not everything is used like source code. :)

Git does source code well. I don't know enough to judge if DNA/RNA
sequence storage is similar enough to source code to benefit from
things like `git log -p` showing deltas over time, or if some other
algorithm would be more effective.

 From my understanding the largest problem revolves around git's delta
 discovery method, holding 2 files in memory at once - is there a
 reason this could not be adapted to page/chunk the data in a sliding
 window fashion ?

During delta discovery Git holds like 11 files in memory at once. One
T is the target file that you are trying to delta compress. The other
10 are in a window and Git compares T to each one of them in turn,
selecting the file that produces the smallest delta instruction
sequence to recreate T.

Because T is compared to 10ish other files (the window size is
tuneable), Git needs a full copy of T in memory for the entire compare
step. For any single compare, T is scanned through only once. If you
were doing a single compare (window size of 1), T could be on disk
and paged through sequentially.

The files in the window need to be held entirely in memory, along with
a matching index. The actual delta compression algorithm is a
Rabin-Karp sliding window hash function. Copies can be made from any
part of the source file with no regard to ordering. This makes
paging/chunking the source file at both compression and decompression
time nearly impossible. Git jumps around the source file many times,
but it allows for efficient storage for movement of long sequences
within a file (e.g. move function foo() later in the file).

Maybe if you limited the window to 1 and limited the hash function to
avoid backing up in the source file so it could be paged, you can get
somewhere.


But you mentioned the files are O(1 GiB). Just buy more RAM? Modern
workstations have pretty good memory capacity.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Junio C Hamano
Shawn Pearce spea...@spearce.org writes:

 Git does source code well. I don't know enough to judge if DNA/RNA
 sequence storage is similar enough to source code to benefit from
 things like `git log -p` showing deltas over time, or if some other
 algorithm would be more effective.

 From my understanding the largest problem revolves around git's delta
 discovery method, holding 2 files in memory at once - is there a
 reason this could not be adapted to page/chunk the data in a sliding
 window fashion ?

 During delta discovery Git holds like 11 files in memory at once

Even though the original question mentioned delta discovery, I
think what was being asked is not delta in the Git sense (which
your answer is about) but is can we diff two long sequences of text
(that happens to consist of only 4-letter alphabet but that is a
irrelevant detail) without holding both in-core in their entirety?,
which is a more relevant question/desire from the application point
of view.

Is there a reason this could not be adapted?  No, there is no
particular reason why this could not.  I think that the only
reason we only do in-core diff is because adapting to page/chunk
hasn't been anybody's high priority list of itches to scratch.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Linus Torvalds
On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano gits...@pobox.com wrote:

 Even though the original question mentioned delta discovery, I
 think what was being asked is not delta in the Git sense (which
 your answer is about) but is can we diff two long sequences of text
 (that happens to consist of only 4-letter alphabet but that is a
 irrelevant detail) without holding both in-core in their entirety?,
 which is a more relevant question/desire from the application point
 of view.

.. even there, there's another issue. With enough memory, the diff
itself should be fairly reasonable to do, but we do not have any sane
*format* for diffing those kinds of things.

The regular textual diff is line-based, and is not amenable to
comparing two long lines. You'll just get a diff that says the two
really long lines are different.

The binary diff option should work, but it is a horrible output
format, and not very helpful. It contains all the relevant data (copy
this chunk from here to here), but it's then shown in a binary
encoding that isn't really all that useful if you want to say what
are the differences between these two chromosomes.

I think it might be possible to just specify a special diff algorithm
(git already supports that, obviously), and just introduce a new use
binary diffs with a textual representation model.

But it also sounds like there might be some actual performance problem
with these 1GB file delta-calculations. Which I wouldn't be surprised
about at all.

Jarrad - is there any public data you could give as an example and for
people to play with?

Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Linus Torvalds
On Fri, Jun 27, 2014 at 12:38 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I think it might be possible to just specify a special diff algorithm
 (git already supports that, obviously), and just introduce a new use
 binary diffs with a textual representation model.

Another model would be to just insert newlines in the data, and use
the regular textual diff on that preprocessed format.

The problem of *where* to insert the newlines is somewhat interesting,
since the stupid approaches (chunk it up in 64-byte lines) don't
work with data insertion/deletion (all the lines will now be different
just because the data is offset), but there are algorithms that handle
that reasonably well, like breaking lines at certain well-defined
patterns (the patterns can then be defined either explicitly or
algorithmically - like calculating a hash/crc over the last rolling N
characters and breaking if the result  matches some modulo
calculation).

Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Jason Pyeron
 -Original Message-
 From: Linus Torvalds
 Sent: Friday, June 27, 2014 15:39
 
 On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano 
 gits...@pobox.com wrote:
 
  Even though the original question mentioned delta discovery, I
  think what was being asked is not delta in the Git sense (which
  your answer is about) but is can we diff two long sequences of text
  (that happens to consist of only 4-letter alphabet but that is a
  irrelevant detail) without holding both in-core in their entirety?,
  which is a more relevant question/desire from the application point
  of view.
 
 .. even there, there's another issue. With enough memory, the diff
 itself should be fairly reasonable to do, but we do not have any sane
 *format* for diffing those kinds of things.
 
 The regular textual diff is line-based, and is not amenable to
 comparing two long lines. You'll just get a diff that says the two
 really long lines are different.
 
 The binary diff option should work, but it is a horrible output
 format, and not very helpful. It contains all the relevant data (copy
 this chunk from here to here), but it's then shown in a binary
 encoding that isn't really all that useful if you want to say what
 are the differences between these two chromosomes.
 
 I think it might be possible to just specify a special diff algorithm
 (git already supports that, obviously), and just introduce a new use
 binary diffs with a textual representation model.
 
 But it also sounds like there might be some actual performance problem
 with these 1GB file delta-calculations. Which I wouldn't be surprised
 about at all.
 
 Jarrad - is there any public data you could give as an example and for
 people to play with?

Until Jarrad replies see sample here:
http://www.genomatix.de/online_help/help/sequence_formats.html

The issue will be, if we talk about changes other than same length substitutions
(e.g. Down's Syndrome where it has an insertion of code) would require one code
per line for the diffs to work nicely.

-Jason

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-   -
- Jason Pyeron  PD Inc. http://www.pdinc.us -
- Principal Consultant  10 West 24th Street #100-
- +1 (443) 269-1555 x333Baltimore, Maryland 21218   -
-   -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files

2014-06-27 Thread Linus Torvalds
On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron jpye...@pdinc.us wrote:

 The issue will be, if we talk about changes other than same length 
 substitutions
 (e.g. Down's Syndrome where it has an insertion of code) would require one 
 code
 per line for the diffs to work nicely.

Not my area of expertise, but depending on what you are interested in
- like protein encoding etc, I really think you don't need to do
things character-per-character. You might want to break at interesting
sequences (TATA box, and/or known long repeating sequences).

So you could basically turn the one long line representation into
multiple lines, by just looking for particular known interesting (or
known particularly *UN*interesting) patterns, and whenever you see the
pattern you create a new line, describing the pattern (TATAAA or
run of 128 U), and then continue on the next line.

Then you diff those semantically enriched streams instead of the raw data.

But it probably depends on what you are looking for and at. Sometimes
you might be looking at individual base pairs. And sometimes maybe you
want to look at the codons, and consider condons that transcribe to
the same amino acid to be the same, and not show up as a difference.
So I could well imagine that you might want to have multiple different
ways to generate these diffs. No?

   Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html