Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
On 30 June 2014 14:56, Jakub Narębski jna...@gmail.com wrote: Linus Torvalds wrote: .. even there, there's another issue. With enough memory, the diff itself should be fairly reasonable to do, but we do not have any sane *format* for diffing those kinds of things. The regular textual diff is line-based, and is not amenable to comparing two long lines. You'll just get a diff that says the two really long lines are different. The binary diff option should work, but it is a horrible output format, and not very helpful. It contains all the relevant data (copy this chunk from here to here), but it's then shown in a binary encoding that isn't really all that useful if you want to say what are the differences between these two chromosomes. There is also --word-diff[=mode] word-based textual diff, and I think one can abuse --word-diff-regex=regex for character-based diff... or maybe not, as regex specifies word characters, not words or word separators. Yes, I have this alias defined: dww = diff --word-diff --word-diff-regex=. It creates nice diffs on a character level. Sometimes specifying --patience to this helps. -- Øyvind -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
Linus Torvalds wrote: On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano gits...@pobox.com wrote: Even though the original question mentioned delta discovery, I think what was being asked is not delta in the Git sense (which your answer is about) but is can we diff two long sequences of text (that happens to consist of only 4-letter alphabet but that is a irrelevant detail) without holding both in-core in their entirety?, which is a more relevant question/desire from the application point of view. .. even there, there's another issue. With enough memory, the diff itself should be fairly reasonable to do, but we do not have any sane *format* for diffing those kinds of things. The regular textual diff is line-based, and is not amenable to comparing two long lines. You'll just get a diff that says the two really long lines are different. The binary diff option should work, but it is a horrible output format, and not very helpful. It contains all the relevant data (copy this chunk from here to here), but it's then shown in a binary encoding that isn't really all that useful if you want to say what are the differences between these two chromosomes. There is also --word-diff[=mode] word-based textual diff, and I think one can abuse --word-diff-regex=regex for character-based diff... or maybe not, as regex specifies word characters, not words or word separators. -- Jakub Narębski -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
Thank-you all for replying, It's just as Jason suggests - Genbank, FASTA EMBL are more or less the defacto standards, I suspect FASTA will be phased out because (to my knowledge) it does not support gene annotation, nevertheless, they are all text based. These formats usually insert linebreaks around 80 characters (a cultural/human readability relic, whatever terminal output they had the time) I tried to find a Penguin genome sequence for you, The best I can find is the complete penguin mitochrondrian dna, as you can see, fairly small. http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank If you would like to checkout the source for a Human, please see ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/ Don't ask me for a Makefile :) in near future you'll be able to print sequences of this length, today we're limited to small sequences (such as bacteria/virus) at ~30cents per basepair Each chromosome packs quite well ~80MB packed, ~240MB unpacked However these formats allow you to repesent multiple sequences in one file ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz - ~850MB packed Sidenote, Humans aren't that particulary more complicated than Rice (in terms of genome size) http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os Other animal sequences - http://www.ensembl.org/index.html Git is already being used very successfully for SBML, Synthetic Biology Markup Language, an XML dialect for cell modelling. I would show an example git repo of some open source cancer treatments (various oncolytic viruses) I've been working on, unfortunately it's not finished yet, but you can imagine something the size of penguin mitochrondrial dna with essentially just text being deleted (gene deletions) as commits. I hope that helps - With the advancement of Synthetic and Systems Biology, I really see these sequences benefiting from git. On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron jpye...@pdinc.us wrote: The issue will be, if we talk about changes other than same length substitutions (e.g. Down's Syndrome where it has an insertion of code) would require one code per line for the diffs to work nicely. Not my area of expertise, but depending on what you are interested in - like protein encoding etc, I really think you don't need to do things character-per-character. You might want to break at interesting sequences (TATA box, and/or known long repeating sequences). So you could basically turn the one long line representation into multiple lines, by just looking for particular known interesting (or known particularly *UN*interesting) patterns, and whenever you see the pattern you create a new line, describing the pattern (TATAAA or run of 128 U), and then continue on the next line. Then you diff those semantically enriched streams instead of the raw data. But it probably depends on what you are looking for and at. Sometimes you might be looking at individual base pairs. And sometimes maybe you want to look at the codons, and consider condons that transcribe to the same amino acid to be the same, and not show up as a difference. So I could well imagine that you might want to have multiple different ways to generate these diffs. No? Linus -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
On Fri, Jun 27, 2014 at 1:45 AM, Jarrad Hope m...@jarradhope.com wrote: As a software developer I've used git for years and have found it the perfect solution for source control. Lately I have found myself using git in a unique use-case - modifying DNA/RNA sequences and storing them in git, which are essentially software/source code for cells/life. For Bacteria and Viruses the repo's are very small 10mb compress nicely. However on the extreme end of the spectrum a human genome can run in at 50gb or say ~1gb per file/chromosome. Interesting. Unfortunately not everything is used like source code. :) Git does source code well. I don't know enough to judge if DNA/RNA sequence storage is similar enough to source code to benefit from things like `git log -p` showing deltas over time, or if some other algorithm would be more effective. From my understanding the largest problem revolves around git's delta discovery method, holding 2 files in memory at once - is there a reason this could not be adapted to page/chunk the data in a sliding window fashion ? During delta discovery Git holds like 11 files in memory at once. One T is the target file that you are trying to delta compress. The other 10 are in a window and Git compares T to each one of them in turn, selecting the file that produces the smallest delta instruction sequence to recreate T. Because T is compared to 10ish other files (the window size is tuneable), Git needs a full copy of T in memory for the entire compare step. For any single compare, T is scanned through only once. If you were doing a single compare (window size of 1), T could be on disk and paged through sequentially. The files in the window need to be held entirely in memory, along with a matching index. The actual delta compression algorithm is a Rabin-Karp sliding window hash function. Copies can be made from any part of the source file with no regard to ordering. This makes paging/chunking the source file at both compression and decompression time nearly impossible. Git jumps around the source file many times, but it allows for efficient storage for movement of long sequences within a file (e.g. move function foo() later in the file). Maybe if you limited the window to 1 and limited the hash function to avoid backing up in the source file so it could be paged, you can get somewhere. But you mentioned the files are O(1 GiB). Just buy more RAM? Modern workstations have pretty good memory capacity. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
Shawn Pearce spea...@spearce.org writes: Git does source code well. I don't know enough to judge if DNA/RNA sequence storage is similar enough to source code to benefit from things like `git log -p` showing deltas over time, or if some other algorithm would be more effective. From my understanding the largest problem revolves around git's delta discovery method, holding 2 files in memory at once - is there a reason this could not be adapted to page/chunk the data in a sliding window fashion ? During delta discovery Git holds like 11 files in memory at once Even though the original question mentioned delta discovery, I think what was being asked is not delta in the Git sense (which your answer is about) but is can we diff two long sequences of text (that happens to consist of only 4-letter alphabet but that is a irrelevant detail) without holding both in-core in their entirety?, which is a more relevant question/desire from the application point of view. Is there a reason this could not be adapted? No, there is no particular reason why this could not. I think that the only reason we only do in-core diff is because adapting to page/chunk hasn't been anybody's high priority list of itches to scratch. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano gits...@pobox.com wrote: Even though the original question mentioned delta discovery, I think what was being asked is not delta in the Git sense (which your answer is about) but is can we diff two long sequences of text (that happens to consist of only 4-letter alphabet but that is a irrelevant detail) without holding both in-core in their entirety?, which is a more relevant question/desire from the application point of view. .. even there, there's another issue. With enough memory, the diff itself should be fairly reasonable to do, but we do not have any sane *format* for diffing those kinds of things. The regular textual diff is line-based, and is not amenable to comparing two long lines. You'll just get a diff that says the two really long lines are different. The binary diff option should work, but it is a horrible output format, and not very helpful. It contains all the relevant data (copy this chunk from here to here), but it's then shown in a binary encoding that isn't really all that useful if you want to say what are the differences between these two chromosomes. I think it might be possible to just specify a special diff algorithm (git already supports that, obviously), and just introduce a new use binary diffs with a textual representation model. But it also sounds like there might be some actual performance problem with these 1GB file delta-calculations. Which I wouldn't be surprised about at all. Jarrad - is there any public data you could give as an example and for people to play with? Linus -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
On Fri, Jun 27, 2014 at 12:38 PM, Linus Torvalds torva...@linux-foundation.org wrote: I think it might be possible to just specify a special diff algorithm (git already supports that, obviously), and just introduce a new use binary diffs with a textual representation model. Another model would be to just insert newlines in the data, and use the regular textual diff on that preprocessed format. The problem of *where* to insert the newlines is somewhat interesting, since the stupid approaches (chunk it up in 64-byte lines) don't work with data insertion/deletion (all the lines will now be different just because the data is offset), but there are algorithms that handle that reasonably well, like breaking lines at certain well-defined patterns (the patterns can then be defined either explicitly or algorithmically - like calculating a hash/crc over the last rolling N characters and breaking if the result matches some modulo calculation). Linus -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
-Original Message- From: Linus Torvalds Sent: Friday, June 27, 2014 15:39 On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano gits...@pobox.com wrote: Even though the original question mentioned delta discovery, I think what was being asked is not delta in the Git sense (which your answer is about) but is can we diff two long sequences of text (that happens to consist of only 4-letter alphabet but that is a irrelevant detail) without holding both in-core in their entirety?, which is a more relevant question/desire from the application point of view. .. even there, there's another issue. With enough memory, the diff itself should be fairly reasonable to do, but we do not have any sane *format* for diffing those kinds of things. The regular textual diff is line-based, and is not amenable to comparing two long lines. You'll just get a diff that says the two really long lines are different. The binary diff option should work, but it is a horrible output format, and not very helpful. It contains all the relevant data (copy this chunk from here to here), but it's then shown in a binary encoding that isn't really all that useful if you want to say what are the differences between these two chromosomes. I think it might be possible to just specify a special diff algorithm (git already supports that, obviously), and just introduce a new use binary diffs with a textual representation model. But it also sounds like there might be some actual performance problem with these 1GB file delta-calculations. Which I wouldn't be surprised about at all. Jarrad - is there any public data you could give as an example and for people to play with? Until Jarrad replies see sample here: http://www.genomatix.de/online_help/help/sequence_formats.html The issue will be, if we talk about changes other than same length substitutions (e.g. Down's Syndrome where it has an insertion of code) would require one code per line for the diffs to work nicely. -Jason -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100- - +1 (443) 269-1555 x333Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron jpye...@pdinc.us wrote: The issue will be, if we talk about changes other than same length substitutions (e.g. Down's Syndrome where it has an insertion of code) would require one code per line for the diffs to work nicely. Not my area of expertise, but depending on what you are interested in - like protein encoding etc, I really think you don't need to do things character-per-character. You might want to break at interesting sequences (TATA box, and/or known long repeating sequences). So you could basically turn the one long line representation into multiple lines, by just looking for particular known interesting (or known particularly *UN*interesting) patterns, and whenever you see the pattern you create a new line, describing the pattern (TATAAA or run of 128 U), and then continue on the next line. Then you diff those semantically enriched streams instead of the raw data. But it probably depends on what you are looking for and at. Sometimes you might be looking at individual base pairs. And sometimes maybe you want to look at the codons, and consider condons that transcribe to the same amino acid to be the same, and not show up as a difference. So I could well imagine that you might want to have multiple different ways to generate these diffs. No? Linus -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html