Re: [notmuch] Mail in git
also sprach Stewart Smith stew...@flamingspork.com [2010.02.17.1107 +0100]: On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith stew...@flamingspork.com wrote: Using fast-import is interesting. Does it update the working tree? The big thing I wanted to avoid was creating a working tree (another million inodes being created is not ever what I need) Also interesting is the mention of creating packs on the fly... this could save the time in first writing the object and then packing it (as my script does). I'm going to play with this and I did. Has anyone worked on this since? -- martin | http://madduck.net/ | http://two.sentenc.es/ one should never allow one's mind and one's foot to wander at the same time. -- edward perkins (yes, the librarian) spamtraps: madduck.bo...@madduck.net digital_signature_gpg.asc Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current) ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
On Sat, 21 May 2011 09:05:54 +0200, martin f krafft madd...@madduck.net wrote: Has anyone worked on this since? No, haven't had the cycles... and SSD helped a bit to delay urgency. -- Stewart Smith ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith stew...@flamingspork.com wrote: Using fast-import is interesting. Does it update the working tree? The big thing I wanted to avoid was creating a working tree (another million inodes being created is not ever what I need) Also interesting is the mention of creating packs on the fly... this could save the time in first writing the object and then packing it (as my script does). I'm going to play with this and I did. good news... on my mailstore (which, as I've previously mentioned, takes about 10 minutes to run 'du' over, about the same time as 'notmuch new' takes): using the (attached) evenless.pl to create a single commit with everything in it: $ du -sh .git 3.4G.git Down from a whopping 14-15GB!!! My previous effort (git-write-object, create pack every 1000 messages, rinse, repeat) took all night and got to 3.7GB. This took only 108 minutes. In both cases, i was creating the repository on another spindle (USB2.0 disk attached to my laptop). git-ls-tree and git-cat-file both work for listing and getting objects. The next thing to think about is adding objects as they come in... creating a new commit with just an added file should be pretty simple and easy... but this means we get to keep a revision history of the mailstore, which is *possibly* not ideal in terms of storage efficiency (i'll do a trial with mine of doing one message at a time and seeing what the end size is). however... commit per added mail (or mails) does give us the advantage of a really well documented and tested backup system :) Deleting could be hard.. if we actually want the objects to go away in a permanent way (not just no longer be referenced). for the stats nerds: $ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX git-fast-import statistics: - Alloc'd objects: 785000 Total objects: 781813 ( 79023 duplicates ) blobs : 781363 ( 79023 duplicates 708627 deltas) trees : 449 ( 0 duplicates 0 deltas) commits:1 ( 0 duplicates 0 deltas) tags :0 ( 0 duplicates 0 deltas) Total branches: 1 ( 1 loads ) marks:1048576 (860386 unique) atoms: 860557 Memory total:182780 KiB pools:152116 KiB objects: 30664 KiB - pack_report: getpagesize()= 4096 pack_report: core.packedGitWindowSize = 1073741824 pack_report: core.packedGitLimit = 8589934592 pack_report: pack_used_ctr= 1 pack_report: pack_mmap_calls = 1 pack_report: pack_open_windows= 1 / 1 pack_report: pack_mapped = 388496447 / 388496447 - real107m43.130s user45m25.430s sys 2m49.440s #!/usr/bin/perl -w use strict; my $tree= ; use IPC::Open2; use File::stat; my $FILES; my $mark= 1; my $stripdir= $ARGV[0]; sub fastimport_blobs ($); sub fastimport_blobs ($) { my $dirname= shift @_; opendir (my $dirhandle, $dirname); foreach (readdir $dirhandle) { next if /^\.\.?$/; next if /\.cmeta$/; next if /\.ibex.index$/; next if /\.ibex.index.data$/; next if /\.ev-summary$/; next if /\.ev-summary-meta$/; next if /\.notmuch$/; if (-d $dirname.'/'.$_) { print STDERR Recursing into $_/ ; fastimport_blobs($dirname.'/'.$_); print STDERR \n; } else { my $sb= stat($dirname/$_); print FASTIMPORT blob\n; print FASTIMPORT mark :$mark\n; print FASTIMPORT data .($sb-size).\n; open FILEIN, $dirname/$_; my $content; sysread FILEIN, $content, $sb-size; close FILEIN; print FASTIMPORT $content; my $storedir= $dirname/$_; $storedir=~ s/^$stripdir//; $storedir=~ s/^\///; $FILES.=M 0644 :$mark $storedir\n; $mark++; } } } open FASTIMPORT, | git fast-import --date-format=rfc2822; fastimport_blobs($ARGV[0]); print FASTIMPORT commit refs/heads/master\n; print FASTIMPORT committer EvenLess evenle...@evenless .`date -R`; print FASTIMPORT data 11\n; print FASTIMPORT mail commit\n; print FASTIMPORT $FILES; print FASTIMPORT \n; close FASTIMPORT; -- Stewart Smith ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
On Wed, 17 Feb 2010 10:03:36 -0500, Ben Gamari bgam...@gmail.com wrote: notmuch would then only search and provide the hash ID(s); tags would be a function of storage. Is it possible to find out all trees that reference a given object with Git in constant or sub-linear time? I don't believe so. I think this is one of the reasons why git gc is so expensive. But if we have notmuch as a cache of the tags, then don't we already know the tree objects that need updating? Yes, we would probably need some consistency checks for when things don't work as planned, but in the common case we ought to always know. Perhaps I'm misunderstanding these tree objects, and you're suggesting that we don't even tell notmuch about them. -Mark Just poking my nose where it don't belong, since 1984. ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.0834 +1300]: Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500 2010: But if we have notmuch as a cache of the tags, then don't we already know the tree objects that need updating? Yes, we would probably need some consistency checks for when things don't work as planned, but in the common case we ought to always know. Cached or not, rewriting would still be an incredibly (e.g. prohibitively or close to it) expensive operation for a large mailstore. Why? Well, would involve creating n objects and unlinking n objects for n tags, but it would be constant in the number of messages, no? Perhaps I'm misunderstanding these tree objects, and you're suggesting that we don't even tell notmuch about them. I think it would be unwise to teach notmuch anything about the underlying store. That would be leaking way too many implementation details into I agree. Also, it would introduce redundancy. -- martin | http://madduck.net/ | http://two.sentenc.es/ twenty-four hour room-service must be one of the premiere achievements of modern civilization. -- special agent dale cooper spamtraps: madduck.bo...@madduck.net digital_signature_gpg.asc Description: Digital signature (see http://martin-krafft.net/gpg/) ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft madd...@madduck.net wrote: What I am wondering is if (explicit) tags couldn't be represented as tree-objects with this. evenless-link — link a message object with a tree object evenless–unlink – unlink a message object from tree object [replaces evenless-unlink] I think it could get expensive for tags with lots of messages. With my fast-import script, doing the commit (that referenced... umm.. 800,000+ objects took a *very* long time). As far as I understand it, the tree object is stored in full and space is only reclaimed during repack (due to delta compression). So if you, say, had the entire history of a high volume list such as linux-kernel, adding messages could get rather expensive if you auto-tagged (or autotagged messages with patches or whatever). messages would then be deleted whenever using git-gc. No idea how this would sync if we don't keep ancestry. Otoh, it would probably not be very expensive to do just that. If we keep ancestry though, we are reusing existing working code for backup (git-pull :) Keep in mind that with my tests, the Maildir in git is about a quarter to a fifth of the size of it in Maildir... so a bit of extra usage per message isn't as dramatic as it may sound. Is it possible to find out all trees that reference a given object with Git in constant or sub-linear time? I don't think so but I'm not sure. -- Stewart Smith ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
Excerpts from martin f krafft's message of Wed Feb 17 18:52:11 -0500 2010: also sprach Ben Gamari bgam...@gmail.com [2010.02.18.0834 +1300]: Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500 2010: But if we have notmuch as a cache of the tags, then don't we already know the tree objects that need updating? Yes, we would probably need some consistency checks for when things don't work as planned, but in the common case we ought to always know. Cached or not, rewriting would still be an incredibly (e.g. prohibitively or close to it) expensive operation for a large mailstore. Why? Well, would involve creating n objects and unlinking n objects for n tags, but it would be constant in the number of messages, no? Yes, it would be linear in number of tags. I suppose if messages weren't stored in the top-level tree nodes, then it would still be linear, although with a slope equal to the reciprocal of the fan-out. This has the potential to be very reasonable performance-wise. - Ben ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1401 +1300]: If we keep ancestry though, we are reusing existing working code for backup (git-pull :) This is one of the reasons I feel it's important we keep it. And as is stated below, the storage overhead is minimal. Absolutely; Stewart mentioned at LCA to forego the porcelain and harness the power of the plumbing, and I knew back then that this would be among the first things of which to convince him once he had the basic idea out. ;) -- martin | http://madduck.net/ | http://two.sentenc.es/ DISCLAIMER: this entire message is privileged communication, intended for the sole use of its recipients only. If you read it even though you know you aren't supposed to, you're a poopy-head. spamtraps: madduck.bo...@madduck.net digital_signature_gpg.asc Description: Digital signature (see http://martin-krafft.net/gpg/) ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
Excerpts from martin f krafft's message of Wed Feb 17 20:58:47 -0500 2010: also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1339 +1300]: Yes, it would be linear in number of tags. I suppose if messages weren't stored in the top-level tree nodes, then it would still be linear, although with a slope equal to the reciprocal of the fan-out. This has the potential to be very reasonable performance-wise. Messages are never stored in tree nodes; all these do are store references to objects (blobs) holding messages. I bet you know this, but I just wanted to make it explicit. Yep, I'm aware. So retagging is really just writing a new tree with a modified list of references. Certainly, however if you have a large tag (100,000 messages), this list of reference could easily be tens of megabytes. For this reason, it seems like the added overhead of nesting trees would be well worth it. - Ben ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
Hi Stewart, On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith stew...@flamingspork.com wrote: Which goes from a 15GB Maildir to a 3.7GB git repo. That's quite interesting ratio. I've tried a plain git add and git gc on my mail store and the result was a repo of approximately 50% of mail store size. Do you think that this difference might be caused by the way you created the packs? The algorithm of evenless.pl is basically: 1 get next directory entry 2 if is directory, recurse into it 3 write item to git (git hash-object -w) 4 add item to tree object 5 if number of items written = 1000 5.1 make pack of last 1000 items 6 goto 1 So it seems that you have all you mails in a single tree. How long it takes to caculate difference of two trees (git diff-tree --name-status)? This operation will be needed by notmuch new to determine which files/blobs to index. I suppose it will be better if mail blobs are stored in subtrees. If a subtree is not changed git doesn't need to descend to it because it has the same sha1. I think that storing mails in a similar structure as in .git/objects (i.e. 256 subdirectories based on the first sha1 byte and file names based on the last 39 sha1 bytes) would be reasonable. Next step? Make notmuch be able to read mail out of it and add it to an index (oh, and some kind of verification and error checking about creating the git repo). Besides using git to compact the size of mail store, another feature that cames with git for free is synchronization. For this to work, you only need to store tags in the repo. What might work is to store tags in files named mail-name.tags. The tags would be stored in the files alphabetically, one tag per line. I guess, that this way makes it easy to merge tags during synchronization even without writing custom git merge driver. Onother point that must be solved if we would like to use git with notmuch is the license problem. As it was pointed out by Carl in another thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and these licences are incompatible. So I think we will need some kind of hooks in notmuch from which external programs (git) will be called. Cheers, Michal ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [notmuch] Mail in git
also sprach Stewart Smith stew...@flamingspork.com [2010.02.15.1329 +1300]: What about adding more mail to the archive? So the way I think is that you use a Maildir for day to day mail (e.g. delivery) and every so often you run some magic command that takes old mail out of the Maildir and stores it in the git repo. Either that, or the other idea we had (which I prefer), which would basically be: evenless-submit — add a new mail (and return a hash ID) and invoke a hook, e.g. to let notmuch know evenless-cat— print the full mail given ID with headers to stdout evenless-delete — unlink a mail identified by hash ID and invoke a hook, e.g. to let notmuch know If we expose the submit and delete functionality at the notmuch level, then we don't need the hooks for then evenless would be plumbing. Anything to avoid a cronjob would be good, I think. Then we need a notmuch backend for mutt etc.. For those who still want to use a regular Maildir, let them use the worktree. What I am wondering is if (explicit) tags couldn't be represented as tree-objects with this. evenless-link — link a message object with a tree object evenless–unlink – unlink a message object from tree object [replaces evenless-unlink] messages would then be deleted whenever using git-gc. No idea how this would sync if we don't keep ancestry. Otoh, it would probably not be very expensive to do just that. notmuch would then only search and provide the hash ID(s); tags would be a function of storage. Is it possible to find out all trees that reference a given object with Git in constant or sub-linear time? -- martin | http://madduck.net/ | http://two.sentenc.es/ the question of whether computers can think is like the question of whether submarines can swim. -- edsgar w. dijkstra spamtraps: madduck.bo...@madduck.net digital_signature_gpg.asc Description: Digital signature (see http://martin-krafft.net/gpg/) ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch