[notmuch] nested tag trees (was: Mail in git)

2010-02-19 Thread martin f krafft
also sprach Ben Gamari  [2010.02.18.1810 +1300]:
> Yeah, this is a bit of a bummer. This is really a stretch, but I wonder
> if the git folks would accept patches/minor database semantics changes
> in the name of making git more flexible as a general purpose object
> database. I really doubt it, but you never know.

I am pretty sure they won't. Git is a content tracker, not a general
purpose filesystem. It's a bit of a shame.

> > Instead of nested subtrees, think of 16 subtrees forming
> > a level-1 hash table, or 256 for level-2, which really *ought*
> > to be enough.
> > 
> > Anyway, rewriting a tree object is pretty much exactly the same
> > as removing a line (e.g. a message ID) from a file (e.g. a tag),
> > as that file would have to be fully rewritten.
> > 
> This is very true, but exactly do you mean by this statement?

That any form of tag-to-message mapping will be expensive when you
have a million messages referenced. If you used symlinks like mairix
does, any manipulation would require changes to the directory index,
which ? curiously ? functions much like the subtree approach you
proposed.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"the faster i go, the behinder i get."
-- lewis carroll

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] nested tag trees (was: Mail in git)

2010-02-19 Thread Michal Sojka
On Fri, 19 Feb 2010 13:31:15 +1300, martin f krafft  
wrote:
> also sprach Ben Gamari  [2010.02.18.1810 +1300]:
> > > Instead of nested subtrees, think of 16 subtrees forming
> > > a level-1 hash table, or 256 for level-2, which really *ought*
> > > to be enough.
> > > 
> > > Anyway, rewriting a tree object is pretty much exactly the same
> > > as removing a line (e.g. a message ID) from a file (e.g. a tag),
> > > as that file would have to be fully rewritten.
> > > 
> > This is very true, but exactly do you mean by this statement?
> 
> That any form of tag-to-message mapping will be expensive when you
> have a million messages referenced. If you used symlinks like mairix
> does, any manipulation would require changes to the directory index,
> which ? curiously ? functions much like the subtree approach you
> proposed.

Why do you want to store tag-to-message mapping in git? This is IMHO
perfectly solved by Xapian so storing message-to-tag mapping would be
sufficient, wouldn't it?

Michal


[notmuch] nested tag trees (was: Mail in git)

2010-02-19 Thread Ben Gamari
Excerpts from Michal Sojka's message of Fri Feb 19 04:52:18 -0500 2010:
> Why do you want to store tag-to-message mapping in git? This is IMHO
> perfectly solved by Xapian so storing message-to-tag mapping would be
> sufficient, wouldn't it?
> 
In my case, I would like to keep the entire state of my mail store
synchronized between multiple machines. This includes both messages and
metadata alike. It seems clear that Xapian would still be necessary for
querying in reaonable time, but I feel like tag storage itself should
have support beyond just the indexer.

- Ben


Re: [notmuch] nested tag trees (was: Mail in git)

2010-02-19 Thread Michal Sojka
On Fri, 19 Feb 2010 13:31:15 +1300, martin f krafft madd...@madduck.net wrote:
 also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1810 +1300]:
   Instead of nested subtrees, think of 16 subtrees forming
   a level-1 hash table, or 256 for level-2, which really *ought*
   to be enough.
   
   Anyway, rewriting a tree object is pretty much exactly the same
   as removing a line (e.g. a message ID) from a file (e.g. a tag),
   as that file would have to be fully rewritten.
   
  This is very true, but exactly do you mean by this statement?
 
 That any form of tag-to-message mapping will be expensive when you
 have a million messages referenced. If you used symlinks like mairix
 does, any manipulation would require changes to the directory index,
 which — curiously — functions much like the subtree approach you
 proposed.

Why do you want to store tag-to-message mapping in git? This is IMHO
perfectly solved by Xapian so storing message-to-tag mapping would be
sufficient, wouldn't it?

Michal
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] nested tag trees (was: Mail in git)

2010-02-19 Thread Ben Gamari
Excerpts from Michal Sojka's message of Fri Feb 19 04:52:18 -0500 2010:
 Why do you want to store tag-to-message mapping in git? This is IMHO
 perfectly solved by Xapian so storing message-to-tag mapping would be
 sufficient, wouldn't it?
 
In my case, I would like to keep the entire state of my mail store
synchronized between multiple machines. This includes both messages and
metadata alike. It seems clear that Xapian would still be necessary for
querying in reaonable time, but I feel like tag storage itself should
have support beyond just the indexer.

- Ben
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] nested tag trees (was: Mail in git)

2010-02-18 Thread martin f krafft
also sprach Ben Gamari  [2010.02.18.1744 +1300]:
> I believe you would. The problem isn't the messages (well, that's
> a problem too), it's the fact that the tree (e.g. tab) objects
> which reference the messages are immutable (I believe). This
> presents us with the difficult circumstance of being unable to
> modify a tag after it has been created. Therefore, as far as I can
> tell, we need to rewrite the tag's tree object whenever we add or
> remove a message. This was the reason I suggested nesting tag
> trees, although this only partially solves the issue.

You are absolutely right, and I think nesting tag trees is an
interesting idea to pursue. It *would* make it impossible to ever
check out the metatree into the filesystem, or rather result in
subdirectories that the user shouldn't need to worry about.

Instead of nested subtrees, think of 16 subtrees forming a level-1
hash table, or 256 for level-2, which really *ought* to be enough.

Anyway, rewriting a tree object is pretty much exactly the same as
removing a line (e.g. a message ID) from a file (e.g. a tag), as
that file would have to be fully rewritten.

> > This can probably be further optimised, but still: it's not
> > quite as nice as enumerating all parents of a message in O(1)
> > time (which would still result in O(m?n)).
> > 
> Yeah, I'm not sure how well this would scale on truly massive mail
> stores.

The more I think about this, the more I want to implement this
between evenless and Git, i.e. as a porcelain layer, since then
I could also use it for vcs-home[0]. In fact, maybe one day we can
store ~ and mail all in one Git repo, with different porcelains for
different use-cases, and notmuch indexing it all anyway. ;)

0. http://vcs-home.madduck.net

Let's continue the technical discussion on the Git list, okay?

http://marc.info/?l=git=126646636824600=2
id:20100218041240.GA4127 at lapse.rw.madduck.net

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"i hate vulgar realism in literature. the man who could call a spade
 a spade should be compelled to use one. it is the only thing he is
 fit for."
-- oscar wilde

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] nested tag trees (was: Mail in git)

2010-02-18 Thread martin f krafft
[Taking a private message back to the list with permission]

also sprach Ben Gamari  [2010.02.18.1620 +1300]:
> This is a very good point. From what I've read about the database
> format, I can't think of any way that reverse dependencies could be
> easily found, unfortunately. If there really is no way to do this, then
> we could have a problem. I'm not sure rewriting tens of megabytes
> everytime you receive a mail message is acceptable.

You would not need to do that, since the messages don't change, and
thus their blobs remain the same.

However, for every manipulation of a message, you would need to
iterate *all* tag trees (O(n)) and update the ones referencing the
message (also O(n)).

The entire process will still be O(n) per message, and O(m?n) for
all:

  messages=[list of messages]
  add_tags=[list of tags to add]
  remove_tags=[list of tags to remove]
  tagtrees=[all tag trees]
  trees_to_update=[]

  for t in remove_tags:
if intersection(t.tree.children, messages):
  T = new_tree(t.name)
  write_tree(T, t.tree.children - messages)
  write_tree(t.tree, [])
  t.tree = T

  for t in add_tags:
t.tree = new_tree(t.name)
rewrite_tree(t.tree, messages)

This can probably be further optimised, but still: it's not quite as
nice as enumerating all parents of a message in O(1) time (which
would still result in O(m?n)).

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"... (ethik und ?sthetik sind eins.)"
   -- wittgenstein

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] nested tag trees (was: Mail in git)

2010-02-18 Thread martin f krafft
also sprach martin f krafft  [2010.02.18.1548 +1300]:
> True ? iff we find a way to enumerate trees referencing a given
> blob or tree so that we can walk up the hierarchy. I could look
> right now, but I am about to cross half of the globe tomorrow, so
> I have other things I should rather be doing. Sorry.

http://marc.info/?l=git=126646636824600=2
id:20100218041240.GA4127 at lapse.rw.madduck.net

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"although occasionally there is something to be said for solitude."
  -- special agent dale cooper

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] nested tag trees (was: Mail in git)

2010-02-18 Thread martin f krafft
also sprach Ben Gamari  [2010.02.18.1519 +1300]:
> > So retagging is really just writing a new tree with a modified list
> > of references.
> > 
> Certainly, however if you have a large tag (>100,000 messages), this
> list of reference could easily be tens of megabytes. For this reason, it
> seems like the added overhead of nesting trees would be well worth it.

True ? iff we find a way to enumerate trees referencing a given
blob or tree so that we can walk up the hierarchy. I could look
right now, but I am about to cross half of the globe tomorrow, so
I have other things I should rather be doing. Sorry.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"men always want to be a woman's first love.
 women have a more subtle instinct:
 what they like is to be a man's last romance."
-- oscar wilde

spamtraps: madduck.bogus at madduck.net
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature (see http://martin-krafft.net/gpg/)
URL: 



[notmuch] nested tag trees (was: Mail in git)

2010-02-18 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 23:59:43 -0500 2010:
> also sprach Ben Gamari  [2010.02.18.1744 +1300]:
> > I believe you would. The problem isn't the messages (well, that's
> > a problem too), it's the fact that the tree (e.g. tab) objects
> > which reference the messages are immutable (I believe). This
> > presents us with the difficult circumstance of being unable to
> > modify a tag after it has been created. Therefore, as far as I can
> > tell, we need to rewrite the tag's tree object whenever we add or
> > remove a message. This was the reason I suggested nesting tag
> > trees, although this only partially solves the issue.
> 
> You are absolutely right, and I think nesting tag trees is an
> interesting idea to pursue. It *would* make it impossible to ever
> check out the metatree into the filesystem, or rather result in
> subdirectories that the user shouldn't need to worry about.
> 
Yeah, this is a bit of a bummer. This is really a stretch, but I wonder
if the git folks would accept patches/minor database semantics changes
in the name of making git more flexible as a general purpose object
database. I really doubt it, but you never know.

> Instead of nested subtrees, think of 16 subtrees forming a level-1
> hash table, or 256 for level-2, which really *ought* to be enough.
> 
> Anyway, rewriting a tree object is pretty much exactly the same as
> removing a line (e.g. a message ID) from a file (e.g. a tag), as
> that file would have to be fully rewritten.
> 
This is very true, but exactly do you mean by this statement?

> > Yeah, I'm not sure how well this would scale on truly massive mail
> > stores.
> 
> The more I think about this, the more I want to implement this
> between evenless and Git, i.e. as a porcelain layer, since then
> I could also use it for vcs-home[0]. In fact, maybe one day we can
> store ~ and mail all in one Git repo, with different porcelains for
> different use-cases, and notmuch indexing it all anyway. ;)

It would be nice if git just didn't attach so many semantics to its
object types and left more up to the porcelain. Git is a fantastic
database, unfortunately it seems you need to work around a lot of VCS
behavior in order to make use of it in a non-VCS application. Attaching
less meaning to database objects would make things substantially easier.

> Let's continue the technical discussion on the Git list, okay?
> 

Yep. As soon as Majordomo sends me my confirmation.

Cheers,

- Ben


[notmuch] nested tag trees (was: Mail in git)

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 22:46:13 -0500 2010:
> You ought to have sent to the list, and I want to send mine there
> too, so please give permission.
> 
Oops! Sorry about that. Damn you sup.

> also sprach Ben Gamari  [2010.02.18.1620 +1300]:
> > This is a very good point. From what I've read about the database
> > format, I can't think of any way that reverse dependencies could be
> > easily found, unfortunately. If there really is no way to do this, then
> > we could have a problem. I'm not sure rewriting tens of megabytes
> > everytime you receive a mail message is acceptable.
> 
> You would not need to do that, since the messages don't change, and
> thus their blobs remain the same.

I believe you would. The problem isn't the messages (well, that's a
problem too), it's the fact that
the tree (e.g. tab) objects which reference the messages are immutable
(I believe). This presents us with the difficult
circumstance of being unable to modify a tag after it has been created.
Therefore, as far as I can tell, we need to rewrite the tag's tree
object whenever we add or remove a message. This was the reason I
suggested nesting tag trees, although this only partially solves the
issue.

(Please correct me if I'm wrong about any/all of the above)

> 
> However, for every manipulation of a message, you would need to
> iterate *all* tag trees (O(n)) and update the ones referencing the
> message (also O(n)).
> 
This is definitely an issue.

> This can probably be further optimised, but still: it's not quite as
> nice as enumerating all parents of a message in O(1) time (which
> would still result in O(m?n)).
> 
Yeah, I'm not sure how well this would scale on truly massive
mail stores.

Cheers,

- Ben


Re: [notmuch] nested tag trees (was: Mail in git)

2010-02-17 Thread martin f krafft
[Taking a private message back to the list with permission]

also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1620 +1300]:
 This is a very good point. From what I've read about the database
 format, I can't think of any way that reverse dependencies could be
 easily found, unfortunately. If there really is no way to do this, then
 we could have a problem. I'm not sure rewriting tens of megabytes
 everytime you receive a mail message is acceptable.

You would not need to do that, since the messages don't change, and
thus their blobs remain the same.

However, for every manipulation of a message, you would need to
iterate *all* tag trees (O(n)) and update the ones referencing the
message (also O(n)).

The entire process will still be O(n) per message, and O(m×n) for
all:

  messages=[list of messages]
  add_tags=[list of tags to add]
  remove_tags=[list of tags to remove]
  tagtrees=[all tag trees]
  trees_to_update=[]

  for t in remove_tags:
if intersection(t.tree.children, messages):
  T = new_tree(t.name)
  write_tree(T, t.tree.children - messages)
  write_tree(t.tree, [])
  t.tree = T

  for t in add_tags:
t.tree = new_tree(t.name)
rewrite_tree(t.tree, messages)

This can probably be further optimised, but still: it's not quite as
nice as enumerating all parents of a message in O(1) time (which
would still result in O(m×n)).

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
... (ethik und ästhetik sind eins.)
   -- wittgenstein
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] nested tag trees (was: Mail in git)

2010-02-17 Thread martin f krafft
also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1744 +1300]:
 I believe you would. The problem isn't the messages (well, that's
 a problem too), it's the fact that the tree (e.g. tab) objects
 which reference the messages are immutable (I believe). This
 presents us with the difficult circumstance of being unable to
 modify a tag after it has been created. Therefore, as far as I can
 tell, we need to rewrite the tag's tree object whenever we add or
 remove a message. This was the reason I suggested nesting tag
 trees, although this only partially solves the issue.

You are absolutely right, and I think nesting tag trees is an
interesting idea to pursue. It *would* make it impossible to ever
check out the metatree into the filesystem, or rather result in
subdirectories that the user shouldn't need to worry about.

Instead of nested subtrees, think of 16 subtrees forming a level-1
hash table, or 256 for level-2, which really *ought* to be enough.

Anyway, rewriting a tree object is pretty much exactly the same as
removing a line (e.g. a message ID) from a file (e.g. a tag), as
that file would have to be fully rewritten.

  This can probably be further optimised, but still: it's not
  quite as nice as enumerating all parents of a message in O(1)
  time (which would still result in O(m×n)).
  
 Yeah, I'm not sure how well this would scale on truly massive mail
 stores.

The more I think about this, the more I want to implement this
between evenless and Git, i.e. as a porcelain layer, since then
I could also use it for vcs-home[0]. In fact, maybe one day we can
store ~ and mail all in one Git repo, with different porcelains for
different use-cases, and notmuch indexing it all anyway. ;)

0. http://vcs-home.madduck.net

Let's continue the technical discussion on the Git list, okay?

http://marc.info/?l=gitm=126646636824600w=2
id:20100218041240.ga4...@lapse.rw.madduck.net

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
i hate vulgar realism in literature. the man who could call a spade
 a spade should be compelled to use one. it is the only thing he is
 fit for.
-- oscar wilde
 
spamtraps: madduck.bo...@madduck.net


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] nested tag trees (was: Mail in git)

2010-02-17 Thread Ben Gamari
Excerpts from martin f krafft's message of Wed Feb 17 23:59:43 -0500 2010:
 also sprach Ben Gamari bgam...@gmail.com [2010.02.18.1744 +1300]:
  I believe you would. The problem isn't the messages (well, that's
  a problem too), it's the fact that the tree (e.g. tab) objects
  which reference the messages are immutable (I believe). This
  presents us with the difficult circumstance of being unable to
  modify a tag after it has been created. Therefore, as far as I can
  tell, we need to rewrite the tag's tree object whenever we add or
  remove a message. This was the reason I suggested nesting tag
  trees, although this only partially solves the issue.
 
 You are absolutely right, and I think nesting tag trees is an
 interesting idea to pursue. It *would* make it impossible to ever
 check out the metatree into the filesystem, or rather result in
 subdirectories that the user shouldn't need to worry about.
 
Yeah, this is a bit of a bummer. This is really a stretch, but I wonder
if the git folks would accept patches/minor database semantics changes
in the name of making git more flexible as a general purpose object
database. I really doubt it, but you never know.

 Instead of nested subtrees, think of 16 subtrees forming a level-1
 hash table, or 256 for level-2, which really *ought* to be enough.
 
 Anyway, rewriting a tree object is pretty much exactly the same as
 removing a line (e.g. a message ID) from a file (e.g. a tag), as
 that file would have to be fully rewritten.
 
This is very true, but exactly do you mean by this statement?

  Yeah, I'm not sure how well this would scale on truly massive mail
  stores.
 
 The more I think about this, the more I want to implement this
 between evenless and Git, i.e. as a porcelain layer, since then
 I could also use it for vcs-home[0]. In fact, maybe one day we can
 store ~ and mail all in one Git repo, with different porcelains for
 different use-cases, and notmuch indexing it all anyway. ;)

It would be nice if git just didn't attach so many semantics to its
object types and left more up to the porcelain. Git is a fantastic
database, unfortunately it seems you need to work around a lot of VCS
behavior in order to make use of it in a non-VCS application. Attaching
less meaning to database objects would make things substantially easier.

 Let's continue the technical discussion on the Git list, okay?
 

Yep. As soon as Majordomo sends me my confirmation.

Cheers,

- Ben
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch