[notmuch] Mail in git

2011-05-21 Thread Stewart Smith
On Sat, 21 May 2011 09:05:54 +0200, martin f krafft  
wrote:
> Has anyone worked on this since?

No, haven't had the cycles... and SSD helped a bit to delay urgency.

-- 
Stewart Smith


[notmuch] Mail in git

2011-05-21 Thread martin f krafft
also sprach Stewart Smith  [2010.02.17.1107 +0100]:
> On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith  flamingspork.com> wrote:
> > Using fast-import is interesting. Does it update the working tree? The
> > big thing I wanted to avoid was creating a working tree (another million
> > inodes being created is not ever what I need)
> > 
> > Also interesting is the mention of creating packs on the fly... this
> > could save the time in first writing the object and then packing it (as
> > my script does).
> > 
> > I'm going to play with this
> 
> and I did.

Has anyone worked on this since?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"one should never allow one's mind
 and one's foot to wander at the same time."
-- edward perkins (yes, the librarian)

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: digital_signature_gpg.asc
Type: application/pgp-signature
Size: 1124 bytes
Desc: Digital signature (see 
http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
URL: 



Re: [notmuch] Mail in git

2011-05-21 Thread martin f krafft
also sprach Stewart Smith stew...@flamingspork.com [2010.02.17.1107 +0100]:
 On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith stew...@flamingspork.com 
 wrote:
  Using fast-import is interesting. Does it update the working tree? The
  big thing I wanted to avoid was creating a working tree (another million
  inodes being created is not ever what I need)
  
  Also interesting is the mention of creating packs on the fly... this
  could save the time in first writing the object and then packing it (as
  my script does).
  
  I'm going to play with this
 
 and I did.

Has anyone worked on this since?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
one should never allow one's mind
 and one's foot to wander at the same time.
-- edward perkins (yes, the librarian)
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2011-05-21 Thread Stewart Smith
On Sat, 21 May 2011 09:05:54 +0200, martin f krafft madd...@madduck.net wrote:
 Has anyone worked on this since?

No, haven't had the cycles... and SSD helped a bit to delay urgency.

-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Mail in git

2010-02-18 Thread martin f krafft
also sprach Ben Gamari  [2010.02.18.1401 +1300]:
> > If we keep ancestry though, we are reusing existing working code for
> > backup (git-pull :)
> 
> This is one of the reasons I feel it's important we keep it. And as is
> stated below, the storage overhead is minimal.

Absolutely; Stewart mentioned at LCA to forego the porcelain and
harness the power of the plumbing, and I knew back then that this
would be among the first things of which to convince him once he had
the basic idea out. ;)

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

DISCLAIMER: this entire message is privileged communication, intended
for the sole use of its recipients only. If you read it even though
you know you aren't supposed to, you're a poopy-head.

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] Mail in git

2010-02-18 Thread martin f krafft
also sprach Ben Gamari  [2010.02.18.1339 +1300]:
> Yes, it would be linear in number of tags. I suppose if messages
> weren't stored in the top-level tree nodes, then it would still be
> linear, although with a slope equal to the reciprocal of the fan-out.
> This has the potential to be very reasonable performance-wise.

Messages are never stored in tree nodes; all these do are store
references to objects (blobs) holding messages. I bet you know this,
but I just wanted to make it explicit.

So retagging is really just writing a new tree with a modified list
of references.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"no survivors? then where do the stories come from I wonder?"
   -- captain jack sparrow

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] Mail in git

2010-02-18 Thread martin f krafft
also sprach Ben Gamari  [2010.02.18.0834 +1300]:
> Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
> 2010:
> > But if we have notmuch as a cache of the tags, then don't we
> > already know the tree objects that need updating?  Yes, we would
> > probably need some consistency checks for when things don't work
> > as planned, but in the common case we ought to always know.
> > 
> Cached or not, rewriting would still be an incredibly (e.g.
> prohibitively or close to it) expensive operation for a large
> mailstore.

Why? Well, would involve creating n objects and unlinking n objects
for n tags, but it would be constant in the number of messages, no?

> > Perhaps I'm misunderstanding these tree objects, and you're
> > suggesting that we don't even tell notmuch about them.
> > 
> I think it would be unwise to teach notmuch anything about the
> underlying store. That would be leaking way too many
> implementation details into

I agree. Also, it would introduce redundancy.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"twenty-four hour room-service must be one of the
 premiere achievements of modern civilization."
  -- special agent dale cooper

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] Mail in git

2010-02-18 Thread Stewart Smith
On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft  
wrote:
> What I am wondering is if (explicit) tags couldn't be represented as
> tree-objects with this.
> 
>   evenless-link   ? link a message object with a tree object
>   evenless?unlink ? unlink a message object from tree object
> [replaces evenless-unlink]

I think it could get expensive for tags with lots of messages.

With my fast-import script, doing the commit (that
referenced... umm.. 800,000+ objects took a *very* long time).

As far as I understand it, the tree object is stored in full and space
is only reclaimed during repack (due to delta compression).

So if you, say, had the entire history of a high volume list such as
linux-kernel, adding messages could get rather expensive if you
auto-tagged (or autotagged messages with patches or whatever).

> messages would then be deleted whenever using git-gc.
> 
> No idea how this would sync if we don't keep ancestry. Otoh, it
> would probably not be very expensive to do just that.

If we keep ancestry though, we are reusing existing working code for
backup (git-pull :)

Keep in mind that with my tests, the Maildir in git is about a quarter
to a fifth of the size of it in Maildir... so a bit of extra usage per
message isn't as dramatic as it may sound.

> Is it possible to find out all trees that reference a given object
> with Git in constant or sub-linear time?

I don't think so but I'm not sure.

-- 
Stewart Smith


[notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 20:58:47 -0500 2010:
> also sprach Ben Gamari  [2010.02.18.1339 +1300]:
> > Yes, it would be linear in number of tags. I suppose if messages
> > weren't stored in the top-level tree nodes, then it would still be
> > linear, although with a slope equal to the reciprocal of the fan-out.
> > This has the potential to be very reasonable performance-wise.
> 
> Messages are never stored in tree nodes; all these do are store
> references to objects (blobs) holding messages. I bet you know this,
> but I just wanted to make it explicit.

Yep, I'm aware.
> 
> So retagging is really just writing a new tree with a modified list
> of references.
> 
Certainly, however if you have a large tag (>100,000 messages), this
list of reference could easily be tens of megabytes. For this reason, it
seems like the added overhead of nesting trees would be well worth it.

- Ben


[notmuch] Mail in git

2010-02-17 Thread Stewart Smith
On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith  
wrote:
> Using fast-import is interesting. Does it update the working tree? The
> big thing I wanted to avoid was creating a working tree (another million
> inodes being created is not ever what I need)
> 
> Also interesting is the mention of creating packs on the fly... this
> could save the time in first writing the object and then packing it (as
> my script does).
> 
> I'm going to play with this

and I did.

good news... on my mailstore (which, as I've previously mentioned, takes
about 10 minutes to run 'du' over, about the same time as 'notmuch new'
takes):

using the (attached) evenless.pl to create a single commit with
everything in it:

$ du -sh .git
3.4G.git

Down from a whopping 14-15GB!!!

My previous effort (git-write-object, create pack every 1000 messages,
rinse, repeat) took all night and got to 3.7GB.

This took only 108 minutes.

In both cases, i was creating the repository on another spindle (USB2.0
disk attached to my laptop).

git-ls-tree and git-cat-file both work for listing and getting objects.

The next thing to think about is adding objects as they come
in... creating a new commit with just an added file should be pretty
simple and easy... but this means we get to keep a "revision history" of
the mailstore, which is *possibly* not ideal in terms of storage
efficiency (i'll do a trial with mine of doing one message at a time and
seeing what the end size is).

however... commit per added mail (or mails) does give us the advantage
of a really well documented and tested backup system :)

Deleting could be hard.. if we actually want the objects to go away in a
"permanent" way (not just no longer be referenced).

for the stats nerds:

$ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX

git-fast-import statistics:
-
Alloc'd objects: 785000
Total objects:   781813 ( 79023 duplicates  )
  blobs  :   781363 ( 79023 duplicates 708627 deltas)
  trees  :  449 ( 0 duplicates  0 deltas)
  commits:1 ( 0 duplicates  0 deltas)
  tags   :0 ( 0 duplicates  0 deltas)
Total branches:   1 ( 1 loads )
  marks:1048576 (860386 unique)
  atoms: 860557
Memory total:182780 KiB
   pools:152116 KiB
 objects: 30664 KiB
-
pack_report: getpagesize()=   4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit  = 8589934592
pack_report: pack_used_ctr=  1
pack_report: pack_mmap_calls  =  1
pack_report: pack_open_windows=  1 /  1
pack_report: pack_mapped  =  388496447 /  388496447
-


real107m43.130s
user45m25.430s
sys 2m49.440s


-- next part --
A non-text attachment was scrubbed...
Name: evenless.pl
Type: text/x-perl
Size: 1413 bytes
Desc: evenless.pl: maildir to git using fast-import
URL: 

-- next part --




-- 
Stewart Smith


[notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from Stewart Smith's message of Wed Feb 17 18:56:53 -0500 2010:
> On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft  
> wrote:
> > What I am wondering is if (explicit) tags couldn't be represented as
> > tree-objects with this.
> 
> I think it could get expensive for tags with lots of messages.
> 
> As far as I understand it, the tree object is stored in full and space
> is only reclaimed during repack (due to delta compression).
> 
> So if you, say, had the entire history of a high volume list such as
> linux-kernel, adding messages could get rather expensive if you
> auto-tagged (or autotagged messages with patches or whatever).
> 

Well, it's tough to say, but I don't think it's as bad as you think. I
proposed that we could use a tree structure like the following,

  ??msg1
  ?tagA.list1???msg2
  ?   ??msg3
  ?
  ?   ??msg4
  ?tagA.list2???msg5
  ?   ??msg6
tagA ??
  ?   ??msg7
  ?tagA.list3???msg8
  ?   ??msg9
  ?
  ?   ??msg10
  ?tagA.list4???msg11
  ??msg12

This way, adding a message to, say list3, would only require rewriting
list3 and tagA, which seems pretty reasonable to me. Moreover, we could
make the tree structure as deep as necessary, although we
would need to rewrite a node at every level of the tree, so its tough
saying how many levels is too many. It could simply be adaptive (e.g.
bisect any nodes with more than N children).

This certainly isn't as simple as the naive approach, but I think it's
the only reasonable approach performance-wise and I don't believe it
shouldn't be too tricky.

> > messages would then be deleted whenever using git-gc.
> > 
> > No idea how this would sync if we don't keep ancestry. Otoh, it
> > would probably not be very expensive to do just that.
> 
> If we keep ancestry though, we are reusing existing working code for
> backup (git-pull :)

This is one of the reasons I feel it's important we keep it. And as is
stated below, the storage overhead is minimal.
> 
> Keep in mind that with my tests, the Maildir in git is about a quarter
> to a fifth of the size of it in Maildir... so a bit of extra usage per
> message isn't as dramatic as it may sound.
> 


[notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 18:52:11 -0500 2010:
> also sprach Ben Gamari  [2010.02.18.0834 +1300]:
> > Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
> > 2010:
> > > But if we have notmuch as a cache of the tags, then don't we
> > > already know the tree objects that need updating?  Yes, we would
> > > probably need some consistency checks for when things don't work
> > > as planned, but in the common case we ought to always know.
> > > 
> > Cached or not, rewriting would still be an incredibly (e.g.
> > prohibitively or close to it) expensive operation for a large
> > mailstore.
> 
> Why? Well, would involve creating n objects and unlinking n objects
> for n tags, but it would be constant in the number of messages, no?

Yes, it would be linear in number of tags. I suppose if messages
weren't stored in the top-level tree nodes, then it would still be
linear, although with a slope equal to the reciprocal of the fan-out.
This has the potential to be very reasonable performance-wise.

- Ben


[notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500 2010:
> But if we have notmuch as a cache of the tags, then don't we already
> know the tree objects that need updating?  Yes, we would probably need
> some consistency checks for when things don't work as planned, but in
> the common case we ought to always know.
> 
Cached or not, rewriting would still be an incredibly (e.g.
prohibitively or close to it) expensive operation for a large mailstore.

> Perhaps I'm misunderstanding these tree objects, and you're suggesting
> that we don't even tell notmuch about them.
> 
I think it would be unwise to teach notmuch anything about the
underlying store. That would be leaking way too many implementation
details into 

- Ben


[notmuch] Mail in git

2010-02-17 Thread martin f krafft
also sprach Stewart Smith  [2010.02.15.1329 +1300]:
> What about adding more mail to the archive?
> 
> So the way I think is that you use a Maildir for day to day mail
> (e.g. delivery) and every so often you run some magic command that
> takes old mail out of the Maildir and stores it in the git repo.

Either that, or the other idea we had (which I prefer), which would
basically be:

  evenless-submit ? add a new mail (and return a hash ID)
and invoke a hook, e.g. to let notmuch know
  evenless-cat? print the full mail given ID with headers to stdout
  evenless-delete ? unlink a mail identified by hash ID
and invoke a hook, e.g. to let notmuch know

If we expose the submit and delete functionality at the notmuch
level, then we don't need the hooks for then evenless would be
plumbing.

Anything to avoid a cronjob would be good, I think.

Then we need a notmuch backend for mutt etc.. For those who still
want to use a regular Maildir, let them use the worktree.

What I am wondering is if (explicit) tags couldn't be represented as
tree-objects with this.

  evenless-link   ? link a message object with a tree object
  evenless?unlink ? unlink a message object from tree object
[replaces evenless-unlink]

messages would then be deleted whenever using git-gc.

No idea how this would sync if we don't keep ancestry. Otoh, it
would probably not be very expensive to do just that.

notmuch would then only search and provide the hash ID(s); tags
would be a function of storage.

Is it possible to find out all trees that reference a given object
with Git in constant or sub-linear time?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"the question of whether computers can think
 is like the question of whether submarines can swim."
 -- edsgar w. dijkstra

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] Mail in git

2010-02-17 Thread Mark Anderson
On Wed, 17 Feb 2010 10:03:36 -0500, Ben Gamari  wrote:
> > notmuch would then only search and provide the hash ID(s); tags
> > would be a function of storage.
> > 
> > Is it possible to find out all trees that reference a given object
> > with Git in constant or sub-linear time?
> > 
> I don't believe so. I think this is one of the reasons why git gc is so
> expensive.

But if we have notmuch as a cache of the tags, then don't we already
know the tree objects that need updating?  Yes, we would probably need
some consistency checks for when things don't work as planned, but in
the common case we ought to always know.

Perhaps I'm misunderstanding these tree objects, and you're suggesting
that we don't even tell notmuch about them.

-Mark

Just poking my nose where it don't belong, since 1984.



[notmuch] Mail in git

2010-02-17 Thread Stewart Smith
On Tue, 16 Feb 2010 14:06:29 -0500, Ben Gamari  wrote:
> Excerpts from Stewart Smith's message of Sun Feb 14 19:29:14 -0500 2010:
> > So... I sketched this out in my head at LCA... and it's taken a bit of
> > time to actually properly try it.
> > 
> In case anyone wanted to play around with this, I've written up my own
> little implementation[1] of a git mail import script. It's quite simple,
> but I felt it might be nice to have some public code to play around
> with. I get around 80 messages/second on my laptop and things are
> definitely quite IO bound. You get 1 commit per message, although I'm
> not entirely sure if this is the correct way to do things.
> 
> [1] git://goldnerlab.physics.umass.edu/git-mail

Using fast-import is interesting. Does it update the working tree? The
big thing I wanted to avoid was creating a working tree (another million
inodes being created is not ever what I need)

Also interesting is the mention of creating packs on the fly... this
could save the time in first writing the object and then packing it (as
my script does).

I'm going to play with this
-- 
Stewart Smith


[notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Tue Feb 16 20:21:01 -0500 2010:
> What I am wondering is if (explicit) tags couldn't be represented as
> tree-objects with this.
> 
>   evenless-link   ? link a message object with a tree object
>   evenless?unlink ? unlink a message object from tree object
> [replaces evenless-unlink]

I was actually wondering this very thing. I'd just be worried about tags
with large numbers of messages (presumably we would need an All tag,
that would contain a reference to every known message). It seems like
the simple act of adding a message to the repository could turn into an
extremely expensive operation.

Moreover, deleting a message could also be quite expensive as this will
require rewriting all of the tags that reference it. Surely, we would
need to batch these sort of operations to avoid disasterous performance.

However, even with batching, it seems we would face some pretty serious
scalability issues. I think if we were to implement tag storage in
trees, we'd need to use a multi-level tree. This way we could avoid
rewriting a tree object containing all of the tag's messages on every
change. I apologize if this was already obvious to everyone but me.

> 
> messages would then be deleted whenever using git-gc.
> 
> No idea how this would sync if we don't keep ancestry. Otoh, it
> would probably not be very expensive to do just that.

I think that keeping the ancestry would be quite important and would
come with relatively low overhead given the correct dereferencing of
data structures.

> 
> notmuch would then only search and provide the hash ID(s); tags
> would be a function of storage.
> 
> Is it possible to find out all trees that reference a given object
> with Git in constant or sub-linear time?
> 
I don't believe so. I think this is one of the reasons why git gc is so
expensive.

- Ben


Re: [notmuch] Mail in git

2010-02-17 Thread Stewart Smith
On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith stew...@flamingspork.com 
wrote:
 Using fast-import is interesting. Does it update the working tree? The
 big thing I wanted to avoid was creating a working tree (another million
 inodes being created is not ever what I need)
 
 Also interesting is the mention of creating packs on the fly... this
 could save the time in first writing the object and then packing it (as
 my script does).
 
 I'm going to play with this

and I did.

good news... on my mailstore (which, as I've previously mentioned, takes
about 10 minutes to run 'du' over, about the same time as 'notmuch new'
takes):

using the (attached) evenless.pl to create a single commit with
everything in it:

$ du -sh .git
3.4G.git

Down from a whopping 14-15GB!!!

My previous effort (git-write-object, create pack every 1000 messages,
rinse, repeat) took all night and got to 3.7GB.

This took only 108 minutes.

In both cases, i was creating the repository on another spindle (USB2.0
disk attached to my laptop).

git-ls-tree and git-cat-file both work for listing and getting objects.

The next thing to think about is adding objects as they come
in... creating a new commit with just an added file should be pretty
simple and easy... but this means we get to keep a revision history of
the mailstore, which is *possibly* not ideal in terms of storage
efficiency (i'll do a trial with mine of doing one message at a time and
seeing what the end size is).

however... commit per added mail (or mails) does give us the advantage
of a really well documented and tested backup system :)

Deleting could be hard.. if we actually want the objects to go away in a
permanent way (not just no longer be referenced).

for the stats nerds:

$ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX

git-fast-import statistics:
-
Alloc'd objects: 785000
Total objects:   781813 ( 79023 duplicates  )
  blobs  :   781363 ( 79023 duplicates 708627 deltas)
  trees  :  449 ( 0 duplicates  0 deltas)
  commits:1 ( 0 duplicates  0 deltas)
  tags   :0 ( 0 duplicates  0 deltas)
Total branches:   1 ( 1 loads )
  marks:1048576 (860386 unique)
  atoms: 860557
Memory total:182780 KiB
   pools:152116 KiB
 objects: 30664 KiB
-
pack_report: getpagesize()=   4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit  = 8589934592
pack_report: pack_used_ctr=  1
pack_report: pack_mmap_calls  =  1
pack_report: pack_open_windows=  1 /  1
pack_report: pack_mapped  =  388496447 /  388496447
-


real107m43.130s
user45m25.430s
sys 2m49.440s


#!/usr/bin/perl -w

use strict;

my $tree= ;

use IPC::Open2;

use File::stat;

my $FILES;

my $mark= 1;

my $stripdir= $ARGV[0];

sub fastimport_blobs ($);
sub fastimport_blobs ($)
{
my $dirname= shift @_;

opendir (my $dirhandle, $dirname);
foreach (readdir $dirhandle)
{
	next if /^\.\.?$/;
	next if /\.cmeta$/;
	next if /\.ibex.index$/;
	next if /\.ibex.index.data$/;
	next if /\.ev-summary$/;
	next if /\.ev-summary-meta$/;
	next if /\.notmuch$/;

	if (-d $dirname.'/'.$_)
	{
	print STDERR Recursing into $_/ ;
	fastimport_blobs($dirname.'/'.$_);
	print STDERR \n;
	}
	else
	{
	my $sb= stat($dirname/$_);
	print FASTIMPORT blob\n;
	print FASTIMPORT mark :$mark\n;
	print FASTIMPORT data .($sb-size).\n;
	open FILEIN, $dirname/$_;
	my $content;
	sysread FILEIN, $content, $sb-size;
	close FILEIN;
	print FASTIMPORT $content;
	my $storedir= $dirname/$_;
	$storedir=~ s/^$stripdir//;
	$storedir=~ s/^\///;
	$FILES.=M 0644 :$mark $storedir\n;
	$mark++;
	}
}
}

open FASTIMPORT, | git fast-import --date-format=rfc2822;

fastimport_blobs($ARGV[0]);

print FASTIMPORT commit refs/heads/master\n;
print FASTIMPORT committer EvenLess evenle...@evenless .`date -R`;
print FASTIMPORT data 11\n;
print FASTIMPORT mail commit\n;
print FASTIMPORT $FILES;
print FASTIMPORT \n;

close FASTIMPORT;




-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Mark Anderson
On Wed, 17 Feb 2010 10:03:36 -0500, Ben Gamari bgam...@gmail.com wrote:
  notmuch would then only search and provide the hash ID(s); tags
  would be a function of storage.
  
  Is it possible to find out all trees that reference a given object
  with Git in constant or sub-linear time?
  
 I don't believe so. I think this is one of the reasons why git gc is so
 expensive.

But if we have notmuch as a cache of the tags, then don't we already
know the tree objects that need updating?  Yes, we would probably need
some consistency checks for when things don't work as planned, but in
the common case we ought to always know.

Perhaps I'm misunderstanding these tree objects, and you're suggesting
that we don't even tell notmuch about them.

-Mark

Just poking my nose where it don't belong, since 1984.

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread martin f krafft
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.0834 +1300]:
 Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
 2010:
  But if we have notmuch as a cache of the tags, then don't we
  already know the tree objects that need updating?  Yes, we would
  probably need some consistency checks for when things don't work
  as planned, but in the common case we ought to always know.
  
 Cached or not, rewriting would still be an incredibly (e.g.
 prohibitively or close to it) expensive operation for a large
 mailstore.

Why? Well, would involve creating n objects and unlinking n objects
for n tags, but it would be constant in the number of messages, no?

  Perhaps I'm misunderstanding these tree objects, and you're
  suggesting that we don't even tell notmuch about them.
  
 I think it would be unwise to teach notmuch anything about the
 underlying store. That would be leaking way too many
 implementation details into

I agree. Also, it would introduce redundancy.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
twenty-four hour room-service must be one of the
 premiere achievements of modern civilization.
  -- special agent dale cooper
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Stewart Smith
On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft madd...@madduck.net wrote:
 What I am wondering is if (explicit) tags couldn't be represented as
 tree-objects with this.
 
   evenless-link   — link a message object with a tree object
   evenless–unlink – unlink a message object from tree object
 [replaces evenless-unlink]

I think it could get expensive for tags with lots of messages.

With my fast-import script, doing the commit (that
referenced... umm.. 800,000+ objects took a *very* long time).

As far as I understand it, the tree object is stored in full and space
is only reclaimed during repack (due to delta compression).

So if you, say, had the entire history of a high volume list such as
linux-kernel, adding messages could get rather expensive if you
auto-tagged (or autotagged messages with patches or whatever).

 messages would then be deleted whenever using git-gc.
 
 No idea how this would sync if we don't keep ancestry. Otoh, it
 would probably not be very expensive to do just that.

If we keep ancestry though, we are reusing existing working code for
backup (git-pull :)

Keep in mind that with my tests, the Maildir in git is about a quarter
to a fifth of the size of it in Maildir... so a bit of extra usage per
message isn't as dramatic as it may sound.

 Is it possible to find out all trees that reference a given object
 with Git in constant or sub-linear time?

I don't think so but I'm not sure.

-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 18:52:11 -0500 2010:
 also sprach Ben Gamari bgam...@gmail.com [2010.02.18.0834 +1300]:
  Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
  2010:
   But if we have notmuch as a cache of the tags, then don't we
   already know the tree objects that need updating?  Yes, we would
   probably need some consistency checks for when things don't work
   as planned, but in the common case we ought to always know.
   
  Cached or not, rewriting would still be an incredibly (e.g.
  prohibitively or close to it) expensive operation for a large
  mailstore.
 
 Why? Well, would involve creating n objects and unlinking n objects
 for n tags, but it would be constant in the number of messages, no?

Yes, it would be linear in number of tags. I suppose if messages
weren't stored in the top-level tree nodes, then it would still be
linear, although with a slope equal to the reciprocal of the fan-out.
This has the potential to be very reasonable performance-wise.

- Ben
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread martin f krafft
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1401 +1300]:
  If we keep ancestry though, we are reusing existing working code for
  backup (git-pull :)
 
 This is one of the reasons I feel it's important we keep it. And as is
 stated below, the storage overhead is minimal.

Absolutely; Stewart mentioned at LCA to forego the porcelain and
harness the power of the plumbing, and I knew back then that this
would be among the first things of which to convince him once he had
the basic idea out. ;)

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
DISCLAIMER: this entire message is privileged communication, intended
for the sole use of its recipients only. If you read it even though
you know you aren't supposed to, you're a poopy-head.
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 20:58:47 -0500 2010:
 also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1339 +1300]:
  Yes, it would be linear in number of tags. I suppose if messages
  weren't stored in the top-level tree nodes, then it would still be
  linear, although with a slope equal to the reciprocal of the fan-out.
  This has the potential to be very reasonable performance-wise.
 
 Messages are never stored in tree nodes; all these do are store
 references to objects (blobs) holding messages. I bet you know this,
 but I just wanted to make it explicit.

Yep, I'm aware.
 
 So retagging is really just writing a new tree with a modified list
 of references.
 
Certainly, however if you have a large tag (100,000 messages), this
list of reference could easily be tens of megabytes. For this reason, it
seems like the added overhead of nesting trees would be well worth it.

- Ben
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Mail in git

2010-02-16 Thread Ben Gamari
Excerpts from Stewart Smith's message of Sun Feb 14 19:29:14 -0500 2010:
> So... I sketched this out in my head at LCA... and it's taken a bit of
> time to actually properly try it.
> 
In case anyone wanted to play around with this, I've written up my own
little implementation[1] of a git mail import script. It's quite simple,
but I felt it might be nice to have some public code to play around
with. I get around 80 messages/second on my laptop and things are
definitely quite IO bound. You get 1 commit per message, although I'm
not entirely sure if this is the correct way to do things.

- Ben


[1] git://goldnerlab.physics.umass.edu/git-mail


[notmuch] Mail in git

2010-02-16 Thread Michal Sojka
Hi Stewart,

On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith  
wrote:
> Which goes from a 15GB Maildir to a 3.7GB git repo.

That's quite interesting ratio. I've tried a plain git add and git gc on
my mail store and the result was a repo of approximately 50% of mail
store size. Do you think that this difference might be caused by the way
you created the packs?

> 
> The algorithm of evenless.pl is basically:
> 1 get next directory entry
> 2 if is directory, recurse into it
> 3 write item to git (git hash-object -w)
> 4 add item to tree object
> 5 if number of items written = 1000
>   5.1 make pack of last 1000 items
> 6 goto 1

So it seems that you have all you mails in a single tree. How long it
takes to caculate difference of two trees (git diff-tree --name-status)?
This operation will be needed by "notmuch new" to determine which
files/blobs to index. I suppose it will be better if mail blobs are
stored in subtrees. If a subtree is not changed git doesn't need to
descend to it because it has the same sha1.

I think that storing mails in a similar structure as in .git/objects
(i.e. 256 subdirectories based on the first sha1 byte and file names
based on the last 39 sha1 bytes) would be reasonable.

> Next step?
> 
> Make notmuch be able to read mail out of it and add it to an index
> (oh, and some kind of verification and error checking about creating
> the git repo).

Besides using git to compact the size of mail store, another feature that
cames with git for free is synchronization. For this to work, you only
need to store tags in the repo. What might work is to store tags in
files named .tags. The tags would be stored in the files
alphabetically, one tag per line. I guess, that this way makes it easy
to merge tags during synchronization even without writing custom git
merge driver.

Onother point that must be solved if we would like to use git with
notmuch is the license problem. As it was pointed out by Carl in another
thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and
these licences are incompatible. So I think we will need some kind of
hooks in notmuch from which external programs (git) will be called.

Cheers,
 Michal


Re: [notmuch] Mail in git

2010-02-16 Thread Michal Sojka
Hi Stewart,

On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith stew...@flamingspork.com 
wrote:
 Which goes from a 15GB Maildir to a 3.7GB git repo.

That's quite interesting ratio. I've tried a plain git add and git gc on
my mail store and the result was a repo of approximately 50% of mail
store size. Do you think that this difference might be caused by the way
you created the packs?

 
 The algorithm of evenless.pl is basically:
 1 get next directory entry
 2 if is directory, recurse into it
 3 write item to git (git hash-object -w)
 4 add item to tree object
 5 if number of items written = 1000
   5.1 make pack of last 1000 items
 6 goto 1

So it seems that you have all you mails in a single tree. How long it
takes to caculate difference of two trees (git diff-tree --name-status)?
This operation will be needed by notmuch new to determine which
files/blobs to index. I suppose it will be better if mail blobs are
stored in subtrees. If a subtree is not changed git doesn't need to
descend to it because it has the same sha1.

I think that storing mails in a similar structure as in .git/objects
(i.e. 256 subdirectories based on the first sha1 byte and file names
based on the last 39 sha1 bytes) would be reasonable.

 Next step?
 
 Make notmuch be able to read mail out of it and add it to an index
 (oh, and some kind of verification and error checking about creating
 the git repo).

Besides using git to compact the size of mail store, another feature that
cames with git for free is synchronization. For this to work, you only
need to store tags in the repo. What might work is to store tags in
files named mail-name.tags. The tags would be stored in the files
alphabetically, one tag per line. I guess, that this way makes it easy
to merge tags during synchronization even without writing custom git
merge driver.

Onother point that must be solved if we would like to use git with
notmuch is the license problem. As it was pointed out by Carl in another
thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and
these licences are incompatible. So I think we will need some kind of
hooks in notmuch from which external programs (git) will be called.

Cheers,
 Michal
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Mail in git

2010-02-16 Thread martin f krafft
also sprach Stewart Smith stew...@flamingspork.com [2010.02.15.1329 +1300]:
 What about adding more mail to the archive?
 
 So the way I think is that you use a Maildir for day to day mail
 (e.g. delivery) and every so often you run some magic command that
 takes old mail out of the Maildir and stores it in the git repo.

Either that, or the other idea we had (which I prefer), which would
basically be:

  evenless-submit — add a new mail (and return a hash ID)
and invoke a hook, e.g. to let notmuch know
  evenless-cat— print the full mail given ID with headers to stdout
  evenless-delete — unlink a mail identified by hash ID
and invoke a hook, e.g. to let notmuch know

If we expose the submit and delete functionality at the notmuch
level, then we don't need the hooks for then evenless would be
plumbing.

Anything to avoid a cronjob would be good, I think.

Then we need a notmuch backend for mutt etc.. For those who still
want to use a regular Maildir, let them use the worktree.

What I am wondering is if (explicit) tags couldn't be represented as
tree-objects with this.

  evenless-link   — link a message object with a tree object
  evenless–unlink – unlink a message object from tree object
[replaces evenless-unlink]

messages would then be deleted whenever using git-gc.

No idea how this would sync if we don't keep ancestry. Otoh, it
would probably not be very expensive to do just that.

notmuch would then only search and provide the hash ID(s); tags
would be a function of storage.

Is it possible to find out all trees that reference a given object
with Git in constant or sub-linear time?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
the question of whether computers can think
 is like the question of whether submarines can swim.
 -- edsgar w. dijkstra
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Mail in git

2010-02-15 Thread Stewart Smith
So... I sketched this out in my head at LCA... and it's taken a bit of
time to actually properly try it.

The problem is:
A simple 'find ~/Maildir` takes 10 minutes, and if you write the
output to a file, it's 88MB+

there's "only" about 900,000 entries there. But this means 900,000
files, which is a non-trivial amount. Some mail folders are quite
large too.

Some of this problem could just be solved by using notmuch a bit
differently (folder per month for example).

However... this is a one-way change and going back would be very
tricky.

There's also the backup problem. Iterating through ~1million inodes
takes a *LONG* time. Restoring it takes even longer (think about
writing all that data to the file system journal).

Historically, if i'm running a backup, I couldn't really use my
laptop, it'd be saturated with disk IO performing the file system
dump. It would also take many hours.

Restoring from backup? about 8hrs.

An observation is that mail never changes. It may be reclassified (and
that's what notmuch is for), but it never changes.

We really just want a way to store and access many many many small
blobs of data that never change.

It turns out git is pretty good at that. Underneath, we could just use
it as an object store (a simple git-hash-object and git-cat-file test
confirmed this to be pretty simple to do). even better is since a lot
of mail is fairly similar, to use delta compression between mail
messages to reduce the storage space. Git is pretty good at that too.

A few giant git packs will be much quicker to backup and restore than
1million files.

So... I wrote a script to test it

$ time perl /home/stewart/evenless.pl /home/stewart/Maildir/

real841m41.491s
user491m3.200s
sys 261m58.080s

Which goes from a 15GB Maildir to a 3.7GB git repo.

The algorithm of evenless.pl is basically:
1 get next directory entry
2 if is directory, recurse into it
3 write item to git (git hash-object -w)
4 add item to tree object
5 if number of items written = 1000
  5.1 make pack of last 1000 items
6 goto 1

$ git count-objects -v
count: 479
size: 27680
in-pack: 873109
packs: 1084
size-pack: 3746219
prune-packable: 0
garbage: 0

If i did a "git checkout", about 8 hours later i'd have a directory
tree exactly the same as my maildir.

Why didn't I just git-add everything? I didn't exactly feel like
creating another giant copy of my mail (that also takes a long time).

What about adding more mail to the archive?

So the way I think is that you use a Maildir for day to day mail (e.g.
delivery) and every so often you run some magic command that takes old
mail out of the Maildir and stores it in the git repo.

Next step?

Make notmuch be able to read mail out of it and add it to an index
(oh, and some kind of verification and error checking about creating
the git repo).
-- 
Stewart Smith


[notmuch] Mail in git

2010-02-14 Thread Stewart Smith
So... I sketched this out in my head at LCA... and it's taken a bit of
time to actually properly try it.

The problem is:
A simple 'find ~/Maildir` takes 10 minutes, and if you write the
output to a file, it's 88MB+

there's only about 900,000 entries there. But this means 900,000
files, which is a non-trivial amount. Some mail folders are quite
large too.

Some of this problem could just be solved by using notmuch a bit
differently (folder per month for example).

However... this is a one-way change and going back would be very
tricky.

There's also the backup problem. Iterating through ~1million inodes
takes a *LONG* time. Restoring it takes even longer (think about
writing all that data to the file system journal).

Historically, if i'm running a backup, I couldn't really use my
laptop, it'd be saturated with disk IO performing the file system
dump. It would also take many hours.

Restoring from backup? about 8hrs.

An observation is that mail never changes. It may be reclassified (and
that's what notmuch is for), but it never changes.

We really just want a way to store and access many many many small
blobs of data that never change.

It turns out git is pretty good at that. Underneath, we could just use
it as an object store (a simple git-hash-object and git-cat-file test
confirmed this to be pretty simple to do). even better is since a lot
of mail is fairly similar, to use delta compression between mail
messages to reduce the storage space. Git is pretty good at that too.

A few giant git packs will be much quicker to backup and restore than
1million files.

So... I wrote a script to test it

$ time perl /home/stewart/evenless.pl /home/stewart/Maildir/

real841m41.491s
user491m3.200s
sys 261m58.080s

Which goes from a 15GB Maildir to a 3.7GB git repo.

The algorithm of evenless.pl is basically:
1 get next directory entry
2 if is directory, recurse into it
3 write item to git (git hash-object -w)
4 add item to tree object
5 if number of items written = 1000
  5.1 make pack of last 1000 items
6 goto 1

$ git count-objects -v
count: 479
size: 27680
in-pack: 873109
packs: 1084
size-pack: 3746219
prune-packable: 0
garbage: 0

If i did a git checkout, about 8 hours later i'd have a directory
tree exactly the same as my maildir.

Why didn't I just git-add everything? I didn't exactly feel like
creating another giant copy of my mail (that also takes a long time).

What about adding more mail to the archive?

So the way I think is that you use a Maildir for day to day mail (e.g.
delivery) and every so often you run some magic command that takes old
mail out of the Maildir and stores it in the git repo.

Next step?

Make notmuch be able to read mail out of it and add it to an index
(oh, and some kind of verification and error checking about creating
the git repo).
-- 
Stewart Smith
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch