Re: [notmuch] Mail in git

2011-05-21 Thread martin f krafft
also sprach Stewart Smith stew...@flamingspork.com [2010.02.17.1107 +0100]:
 On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith stew...@flamingspork.com 
 wrote:
  Using fast-import is interesting. Does it update the working tree? The
  big thing I wanted to avoid was creating a working tree (another million
  inodes being created is not ever what I need)
  
  Also interesting is the mention of creating packs on the fly... this
  could save the time in first writing the object and then packing it (as
  my script does).
  
  I'm going to play with this
 
 and I did.

Has anyone worked on this since?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
one should never allow one's mind
 and one's foot to wander at the same time.
-- edward perkins (yes, the librarian)
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2011-05-21 Thread Stewart Smith
On Sat, 21 May 2011 09:05:54 +0200, martin f krafft madd...@madduck.net wrote:
 Has anyone worked on this since?

No, haven't had the cycles... and SSD helped a bit to delay urgency.

-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Stewart Smith
On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith stew...@flamingspork.com 
wrote:
 Using fast-import is interesting. Does it update the working tree? The
 big thing I wanted to avoid was creating a working tree (another million
 inodes being created is not ever what I need)
 
 Also interesting is the mention of creating packs on the fly... this
 could save the time in first writing the object and then packing it (as
 my script does).
 
 I'm going to play with this

and I did.

good news... on my mailstore (which, as I've previously mentioned, takes
about 10 minutes to run 'du' over, about the same time as 'notmuch new'
takes):

using the (attached) evenless.pl to create a single commit with
everything in it:

$ du -sh .git
3.4G.git

Down from a whopping 14-15GB!!!

My previous effort (git-write-object, create pack every 1000 messages,
rinse, repeat) took all night and got to 3.7GB.

This took only 108 minutes.

In both cases, i was creating the repository on another spindle (USB2.0
disk attached to my laptop).

git-ls-tree and git-cat-file both work for listing and getting objects.

The next thing to think about is adding objects as they come
in... creating a new commit with just an added file should be pretty
simple and easy... but this means we get to keep a revision history of
the mailstore, which is *possibly* not ideal in terms of storage
efficiency (i'll do a trial with mine of doing one message at a time and
seeing what the end size is).

however... commit per added mail (or mails) does give us the advantage
of a really well documented and tested backup system :)

Deleting could be hard.. if we actually want the objects to go away in a
permanent way (not just no longer be referenced).

for the stats nerds:

$ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX

git-fast-import statistics:
-
Alloc'd objects: 785000
Total objects:   781813 ( 79023 duplicates  )
  blobs  :   781363 ( 79023 duplicates 708627 deltas)
  trees  :  449 ( 0 duplicates  0 deltas)
  commits:1 ( 0 duplicates  0 deltas)
  tags   :0 ( 0 duplicates  0 deltas)
Total branches:   1 ( 1 loads )
  marks:1048576 (860386 unique)
  atoms: 860557
Memory total:182780 KiB
   pools:152116 KiB
 objects: 30664 KiB
-
pack_report: getpagesize()=   4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit  = 8589934592
pack_report: pack_used_ctr=  1
pack_report: pack_mmap_calls  =  1
pack_report: pack_open_windows=  1 /  1
pack_report: pack_mapped  =  388496447 /  388496447
-


real107m43.130s
user45m25.430s
sys 2m49.440s


#!/usr/bin/perl -w

use strict;

my $tree= ;

use IPC::Open2;

use File::stat;

my $FILES;

my $mark= 1;

my $stripdir= $ARGV[0];

sub fastimport_blobs ($);
sub fastimport_blobs ($)
{
my $dirname= shift @_;

opendir (my $dirhandle, $dirname);
foreach (readdir $dirhandle)
{
	next if /^\.\.?$/;
	next if /\.cmeta$/;
	next if /\.ibex.index$/;
	next if /\.ibex.index.data$/;
	next if /\.ev-summary$/;
	next if /\.ev-summary-meta$/;
	next if /\.notmuch$/;

	if (-d $dirname.'/'.$_)
	{
	print STDERR Recursing into $_/ ;
	fastimport_blobs($dirname.'/'.$_);
	print STDERR \n;
	}
	else
	{
	my $sb= stat($dirname/$_);
	print FASTIMPORT blob\n;
	print FASTIMPORT mark :$mark\n;
	print FASTIMPORT data .($sb-size).\n;
	open FILEIN, $dirname/$_;
	my $content;
	sysread FILEIN, $content, $sb-size;
	close FILEIN;
	print FASTIMPORT $content;
	my $storedir= $dirname/$_;
	$storedir=~ s/^$stripdir//;
	$storedir=~ s/^\///;
	$FILES.=M 0644 :$mark $storedir\n;
	$mark++;
	}
}
}

open FASTIMPORT, | git fast-import --date-format=rfc2822;

fastimport_blobs($ARGV[0]);

print FASTIMPORT commit refs/heads/master\n;
print FASTIMPORT committer EvenLess evenle...@evenless .`date -R`;
print FASTIMPORT data 11\n;
print FASTIMPORT mail commit\n;
print FASTIMPORT $FILES;
print FASTIMPORT \n;

close FASTIMPORT;




-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Mark Anderson
On Wed, 17 Feb 2010 10:03:36 -0500, Ben Gamari bgam...@gmail.com wrote:
  notmuch would then only search and provide the hash ID(s); tags
  would be a function of storage.
  
  Is it possible to find out all trees that reference a given object
  with Git in constant or sub-linear time?
  
 I don't believe so. I think this is one of the reasons why git gc is so
 expensive.

But if we have notmuch as a cache of the tags, then don't we already
know the tree objects that need updating?  Yes, we would probably need
some consistency checks for when things don't work as planned, but in
the common case we ought to always know.

Perhaps I'm misunderstanding these tree objects, and you're suggesting
that we don't even tell notmuch about them.

-Mark

Just poking my nose where it don't belong, since 1984.

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread martin f krafft
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.0834 +1300]:
 Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
 2010:
  But if we have notmuch as a cache of the tags, then don't we
  already know the tree objects that need updating?  Yes, we would
  probably need some consistency checks for when things don't work
  as planned, but in the common case we ought to always know.
  
 Cached or not, rewriting would still be an incredibly (e.g.
 prohibitively or close to it) expensive operation for a large
 mailstore.

Why? Well, would involve creating n objects and unlinking n objects
for n tags, but it would be constant in the number of messages, no?

  Perhaps I'm misunderstanding these tree objects, and you're
  suggesting that we don't even tell notmuch about them.
  
 I think it would be unwise to teach notmuch anything about the
 underlying store. That would be leaking way too many
 implementation details into

I agree. Also, it would introduce redundancy.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
twenty-four hour room-service must be one of the
 premiere achievements of modern civilization.
  -- special agent dale cooper
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Stewart Smith
On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft madd...@madduck.net wrote:
 What I am wondering is if (explicit) tags couldn't be represented as
 tree-objects with this.
 
   evenless-link   — link a message object with a tree object
   evenless–unlink – unlink a message object from tree object
 [replaces evenless-unlink]

I think it could get expensive for tags with lots of messages.

With my fast-import script, doing the commit (that
referenced... umm.. 800,000+ objects took a *very* long time).

As far as I understand it, the tree object is stored in full and space
is only reclaimed during repack (due to delta compression).

So if you, say, had the entire history of a high volume list such as
linux-kernel, adding messages could get rather expensive if you
auto-tagged (or autotagged messages with patches or whatever).

 messages would then be deleted whenever using git-gc.
 
 No idea how this would sync if we don't keep ancestry. Otoh, it
 would probably not be very expensive to do just that.

If we keep ancestry though, we are reusing existing working code for
backup (git-pull :)

Keep in mind that with my tests, the Maildir in git is about a quarter
to a fifth of the size of it in Maildir... so a bit of extra usage per
message isn't as dramatic as it may sound.

 Is it possible to find out all trees that reference a given object
 with Git in constant or sub-linear time?

I don't think so but I'm not sure.

-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 18:52:11 -0500 2010:
 also sprach Ben Gamari bgam...@gmail.com [2010.02.18.0834 +1300]:
  Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
  2010:
   But if we have notmuch as a cache of the tags, then don't we
   already know the tree objects that need updating?  Yes, we would
   probably need some consistency checks for when things don't work
   as planned, but in the common case we ought to always know.
   
  Cached or not, rewriting would still be an incredibly (e.g.
  prohibitively or close to it) expensive operation for a large
  mailstore.
 
 Why? Well, would involve creating n objects and unlinking n objects
 for n tags, but it would be constant in the number of messages, no?

Yes, it would be linear in number of tags. I suppose if messages
weren't stored in the top-level tree nodes, then it would still be
linear, although with a slope equal to the reciprocal of the fan-out.
This has the potential to be very reasonable performance-wise.

- Ben
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread martin f krafft
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1401 +1300]:
  If we keep ancestry though, we are reusing existing working code for
  backup (git-pull :)
 
 This is one of the reasons I feel it's important we keep it. And as is
 stated below, the storage overhead is minimal.

Absolutely; Stewart mentioned at LCA to forego the porcelain and
harness the power of the plumbing, and I knew back then that this
would be among the first things of which to convince him once he had
the basic idea out. ;)

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
DISCLAIMER: this entire message is privileged communication, intended
for the sole use of its recipients only. If you read it even though
you know you aren't supposed to, you're a poopy-head.
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 20:58:47 -0500 2010:
 also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1339 +1300]:
  Yes, it would be linear in number of tags. I suppose if messages
  weren't stored in the top-level tree nodes, then it would still be
  linear, although with a slope equal to the reciprocal of the fan-out.
  This has the potential to be very reasonable performance-wise.
 
 Messages are never stored in tree nodes; all these do are store
 references to objects (blobs) holding messages. I bet you know this,
 but I just wanted to make it explicit.

Yep, I'm aware.
 
 So retagging is really just writing a new tree with a modified list
 of references.
 
Certainly, however if you have a large tag (100,000 messages), this
list of reference could easily be tens of megabytes. For this reason, it
seems like the added overhead of nesting trees would be well worth it.

- Ben
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-16 Thread Michal Sojka
Hi Stewart,

On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith stew...@flamingspork.com 
wrote:
 Which goes from a 15GB Maildir to a 3.7GB git repo.

That's quite interesting ratio. I've tried a plain git add and git gc on
my mail store and the result was a repo of approximately 50% of mail
store size. Do you think that this difference might be caused by the way
you created the packs?

 
 The algorithm of evenless.pl is basically:
 1 get next directory entry
 2 if is directory, recurse into it
 3 write item to git (git hash-object -w)
 4 add item to tree object
 5 if number of items written = 1000
   5.1 make pack of last 1000 items
 6 goto 1

So it seems that you have all you mails in a single tree. How long it
takes to caculate difference of two trees (git diff-tree --name-status)?
This operation will be needed by notmuch new to determine which
files/blobs to index. I suppose it will be better if mail blobs are
stored in subtrees. If a subtree is not changed git doesn't need to
descend to it because it has the same sha1.

I think that storing mails in a similar structure as in .git/objects
(i.e. 256 subdirectories based on the first sha1 byte and file names
based on the last 39 sha1 bytes) would be reasonable.

 Next step?
 
 Make notmuch be able to read mail out of it and add it to an index
 (oh, and some kind of verification and error checking about creating
 the git repo).

Besides using git to compact the size of mail store, another feature that
cames with git for free is synchronization. For this to work, you only
need to store tags in the repo. What might work is to store tags in
files named mail-name.tags. The tags would be stored in the files
alphabetically, one tag per line. I guess, that this way makes it easy
to merge tags during synchronization even without writing custom git
merge driver.

Onother point that must be solved if we would like to use git with
notmuch is the license problem. As it was pointed out by Carl in another
thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and
these licences are incompatible. So I think we will need some kind of
hooks in notmuch from which external programs (git) will be called.

Cheers,
 Michal
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-16 Thread martin f krafft
also sprach Stewart Smith stew...@flamingspork.com [2010.02.15.1329 +1300]:
 What about adding more mail to the archive?
 
 So the way I think is that you use a Maildir for day to day mail
 (e.g. delivery) and every so often you run some magic command that
 takes old mail out of the Maildir and stores it in the git repo.

Either that, or the other idea we had (which I prefer), which would
basically be:

  evenless-submit — add a new mail (and return a hash ID)
and invoke a hook, e.g. to let notmuch know
  evenless-cat— print the full mail given ID with headers to stdout
  evenless-delete — unlink a mail identified by hash ID
and invoke a hook, e.g. to let notmuch know

If we expose the submit and delete functionality at the notmuch
level, then we don't need the hooks for then evenless would be
plumbing.

Anything to avoid a cronjob would be good, I think.

Then we need a notmuch backend for mutt etc.. For those who still
want to use a regular Maildir, let them use the worktree.

What I am wondering is if (explicit) tags couldn't be represented as
tree-objects with this.

  evenless-link   — link a message object with a tree object
  evenless–unlink – unlink a message object from tree object
[replaces evenless-unlink]

messages would then be deleted whenever using git-gc.

No idea how this would sync if we don't keep ancestry. Otoh, it
would probably not be very expensive to do just that.

notmuch would then only search and provide the hash ID(s); tags
would be a function of storage.

Is it possible to find out all trees that reference a given object
with Git in constant or sub-linear time?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
the question of whether computers can think
 is like the question of whether submarines can swim.
 -- edsgar w. dijkstra
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch