Re: Reimagining notmuch-git/nmbug
Felipe Contreras writes: > On Tue, Apr 4, 2023 at 12:54 PM David Bremner wrote: >> >> This sounds right. Can we use the detection of missing messages in >> wr_export to reset the appropriate counters? It looks like yes, given >> the call to store_lastmod. [snip] > I would rather go for a solution that is less hacky, and has less > chance of leaving the user in an unrecoverable state. fair enough. Certainly notmuch-git has too many accumulated performance hacks. ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Tue, Apr 4, 2023 at 12:54 PM David Bremner wrote: > > Felipe Contreras writes: > > > On Mon, Apr 3, 2023 at 6:37 PM David Bremner wrote: > > > Or we could say that after jumping a certain threshold of lastmod we > > delete all the messages and start from scratch, perhaps every 1000 > > revisions. > > > > Or maybe the query could generate a virtual tag if a message was > > deleted since the previous lastmod (e.g. "nm::deleted"). Then it would > > be trivial for the remote helper to tell that to git. > > A complication here is that tags be attached to mail message documents > in the database, so we would need to generate a so called "ghost > message", and clean those up somehow. I thought a little bit more about how I would use git-notmuch, and I don't see the point in tracking messages that have no tags. In my view the whole point of the tool is to backup the tags, and the whole point of a backup is to eventually be able to restore it. But if there's nothing to restore for a specific message, it might very well not exist. So instead of a `nm::deleted` tag, just no tags. I think from the point of view of git-notmuch it shouldn't make a difference. > > I lean towards the threshold, because that way the user doesn't need > > to do anything, and there's no modifications needed in libnotmuch. > > This sounds right. Can we use the detection of missing messages in > wr_export to reset the appropriate counters? It looks like yes, given > the call to store_lastmod. We would need to store them and use that information in the next fetch. Although doable, it seems hacky, and in the past such things have led to problems that are hard to solve due to inconsistent states. For example what happens if in the next fetch we tell git that some files have been removed, but we crash in the middle of it? The next fetch we'll tell git that some files were removed, but git might think they don't exist and fail. I think for that particular problem git was fixed it shouldn't update the files unless the program exists successfully, but I don't know. I would rather go for a solution that is less hacky, and has less chance of leaving the user in an unrecoverable state. Cheers. -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
Felipe Contreras writes: > On Mon, Apr 3, 2023 at 6:37 PM David Bremner wrote: > Or we could say that after jumping a certain threshold of lastmod we > delete all the messages and start from scratch, perhaps every 1000 > revisions. > > Or maybe the query could generate a virtual tag if a message was > deleted since the previous lastmod (e.g. "nm::deleted"). Then it would > be trivial for the remote helper to tell that to git. A complication here is that tags be attached to mail message documents in the database, so we would need to generate a so called "ghost message", and clean those up somehow. > I lean towards the threshold, because that way the user doesn't need > to do anything, and there's no modifications needed in libnotmuch. This sounds right. Can we use the detection of missing messages in wr_export to reset the appropriate counters? It looks like yes, given the call to store_lastmod. ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
Felipe Contreras writes: > > I'm not familiar with git-annex, I would need to see an example of > such merging happening. I was confused, git-annex is using the builtin merge strategy "union", which is not eliminating duplicates or sorting, so probably not applicable here. I still have to try some merges between different machines to see what kind of conflicts can arise. > One advantage of using the fast-import format is that it's easy to > change it, or support multiple formats. > > In fact, the format could be specified in the URL, like > `nm::1:$HOME/mail` for the current notmuch-git format, and > `nm::2:$HOME/mail` for the new. This might also be a way to handle the "prefix" setting that nmbug / notmuch-git needs to only sync certain (e.g. notmuch::*) tags ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Mon, Apr 3, 2023 at 5:46 AM David Bremner wrote: > > David Bremner writes: > > > > > I'm intrigued (and indeed I hadn't really thought about the degree to > > which we were re-inventing git-fast-import and friends); however so far > > my experiments did not get far enough to say anything conclusive. > > > > I did manage to finish, about 70 minutes elapsed. > > Although you'r probably right that a file of tags is the right > representation (it is what git-annex uses also), I think we'd need to > define a custom merge driver to take unions of lists in the same way > that git-annex does. Otherwise merging will be less automagic than it is > now. I'm not familiar with git-annex, I would need to see an example of such merging happening. One advantage of using the fast-import format is that it's easy to change it, or support multiple formats. In fact, the format could be specified in the URL, like `nm::1:$HOME/mail` for the current notmuch-git format, and `nm::2:$HOME/mail` for the new. -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Mon, Apr 3, 2023 at 6:37 PM David Bremner wrote: > > Felipe Contreras writes: > > > > > That should work to update existing tags, but how are we going to > > detect if a message has disappeared? Or is that not a thing? > > Indeed the same thought had occurred to me not long ago. I remembered > (belately) that I'd been through some similar thought process with nmbug. > Messages can and do disappear. So for I guess that optimization not OK, > at least not without some complications. > > > Does "lastmod:0.." get all the revisions? If so, it might make sense > > to set $lastmod to 0 initially. > > > > Then we could unconditionally do: > > > > $db.query('lastmod:%d..' % $lastmod, sort: Notmuch::SORT_UNSORTED) > > That would work, but as you point out, we'd need to deal with deletions > somehow. It occurs to me that wr_export also needs to be able to handle > disappearing message-ids. I suppose like notmuch-restore it can just > complain and skip any missing ones. It's tempting to try to do some kind > of lazy cleanup at that point, but I don't really see how that fits with > the remote-helper protocol. We could have an external tool, something like `git-notmuch-fsck` or something that the user has to regularly execute, as `git fsck` was in the past. Or we could say that after jumping a certain threshold of lastmod we delete all the messages and start from scratch, perhaps every 1000 revisions. Or maybe the query could generate a virtual tag if a message was deleted since the previous lastmod (e.g. "nm::deleted"). Then it would be trivial for the remote helper to tell that to git. I lean towards the threshold, because that way the user doesn't need to do anything, and there's no modifications needed in libnotmuch. Cheers. -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
Felipe Contreras writes: > > That should work to update existing tags, but how are we going to > detect if a message has disappeared? Or is that not a thing? Indeed the same thought had occurred to me not long ago. I remembered (belately) that I'd been through some similar thought process with nmbug. Messages can and do disappear. So for I guess that optimization not OK, at least not without some complications. > Does "lastmod:0.." get all the revisions? If so, it might make sense > to set $lastmod to 0 initially. > > Then we could unconditionally do: > > $db.query('lastmod:%d..' % $lastmod, sort: Notmuch::SORT_UNSORTED) That would work, but as you point out, we'd need to deal with deletions somehow. It occurs to me that wr_export also needs to be able to handle disappearing message-ids. I suppose like notmuch-restore it can just complain and skip any missing ones. It's tempting to try to do some kind of lazy cleanup at that point, but I don't really see how that fits with the remote-helper protocol. d ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Mon, Apr 3, 2023 at 2:40 PM David Bremner wrote: > > David Bremner writes: > > > Indeed that speeds up the initial clone on this machine from 39 minutes > > (I switched machines) to 30s. I will play with it a bit more, and report > > back. > > It's not a showstopper, but "git pull" takes about 1/2 the wall time > (about 2/3 of the CPU time) of the original clone, even if there is only > one tag changed. Yes, every fetch should take as much time as the original clone. > Two potential improvements I can think of. > > - notmuch-dump.c calls notmuch_query_set_sort (query, > NOTMUCH_SORT_UNSORTED). I think I managed to do this (diff below), > but performance gain was negligible. OK. > - Since you cache the lastmod value, you should be able to use it in a > query. This does make a big difference in my experiments. I had to > remove the 'deleteall' (otherwise only the changed messages are left > in the git repo). I'm not 100% this is correct, hopefully you see > quicker than I. In any case the lastmod query is what notmuch-git > uses. That should work to update existing tags, but how are we going to detect if a message has disappeared? Or is that not a thing? > diff --git a/git-remote-nm b/git-remote-nm > index c668b38..cabea26 100755 > --- a/git-remote-nm > +++ b/git-remote-nm > @@ -148,9 +148,11 @@ def wr_import(ref) >wr_data("lastmod: %d\n" % ($lastmod || 0)) >wr_l 'from refs/notmuch/master^0' if $lastmod > > - wr_l 'deleteall' > +# wr_l 'deleteall' > > - $db.query('').search_messages.each do |msg| > + $query=$db.query("lastmod:%d.." % ($lastmod || 0) ) Does "lastmod:0.." get all the revisions? If so, it might make sense to set $lastmod to 0 initially. Then we could unconditionally do: $db.query('lastmod:%d..' % $lastmod, sort: Notmuch::SORT_UNSORTED) -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
David Bremner writes: > Indeed that speeds up the initial clone on this machine from 39 minutes > (I switched machines) to 30s. I will play with it a bit more, and report > back. It's not a showstopper, but "git pull" takes about 1/2 the wall time (about 2/3 of the CPU time) of the original clone, even if there is only one tag changed. Two potential improvements I can think of. - notmuch-dump.c calls notmuch_query_set_sort (query, NOTMUCH_SORT_UNSORTED). I think I managed to do this (diff below), but performance gain was negligible. - Since you cache the lastmod value, you should be able to use it in a query. This does make a big difference in my experiments. I had to remove the 'deleteall' (otherwise only the changed messages are left in the git repo). I'm not 100% this is correct, hopefully you see quicker than I. In any case the lastmod query is what notmuch-git uses. diff --git a/git-remote-nm b/git-remote-nm index c668b38..cabea26 100755 --- a/git-remote-nm +++ b/git-remote-nm @@ -148,9 +148,11 @@ def wr_import(ref) wr_data("lastmod: %d\n" % ($lastmod || 0)) wr_l 'from refs/notmuch/master^0' if $lastmod - wr_l 'deleteall' +# wr_l 'deleteall' - $db.query('').search_messages.each do |msg| + $query=$db.query("lastmod:%d.." % ($lastmod || 0) ) + $query.sort=Notmuch::SORT_UNSORTED + $query.search_messages.each do |msg| hash = Blake2b.hex(msg.message_id, Blake2b::Key.none, 2) dir1, dir2 = hash[..1], hash[2..] wr_l 'M 644 inline %s/%s/%s/tags' % [dir1, dir2, encode_filename(msg.message_id)] ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
Felipe Contreras writes: > By distributing the files in multiple directories like notmuch-git > does using BLAKE2b, the operation is much faster. > > I've pushed the changes, now there's a dependency, but you can just > `gem install blake2b`. > > I'm able to clone the database of the performance corpus in 5 seconds: > > % git clone --bare nm::$PWD/mail mail.git Indeed that speeds up the initial clone on this machine from 39 minutes (I switched machines) to 30s. I will play with it a bit more, and report back. I had just finished a pretty graph showing nonlinear growth of the old version, but I guess nobody cares now ;) d ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Mon, Apr 3, 2023 at 4:49 AM David Bremner wrote: > Performance-wise the initial clone seems pretty slow. For my 600k > messages I have been waiting a while now. htop tells me that > git-fast-import has about 45 minutes of CPU time at this point. This > machine is not that fast, but for comparison an initial (i.e. fresh > repo, no caching) "notmuch git commit" takes about 15-20s. I found the problem. If all the files are in the same directory, `git fast-import` spends a lot of time comparing all the paths. By distributing the files in multiple directories like notmuch-git does using BLAKE2b, the operation is much faster. I've pushed the changes, now there's a dependency, but you can just `gem install blake2b`. I'm able to clone the database of the performance corpus in 5 seconds: % git clone --bare nm::$PWD/mail mail.git Cheers. -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Mon, Apr 3, 2023 at 4:49 AM David Bremner wrote: > > Felipe Contreras writes: > > > Hi, > > > > I noticed you promoted notmuch-git as a user tool to toy around with it. > > > > Very quickly I realized that most of what it does is something I've > > been working on for at least 10 years: making git work with other > > tools. > > > > I presume you haven't heard of git remote-helpers [1], because they do > > precisely what notmuch-git is trying to do. > > > > As a proof of concept I created a remote helper for notmuch [2]. If > > you have this script (`git-remote-nm`) anywhere in your path, git will > > interpret URLs prefixed with "nm::" as notmuch transports, and you can > > do: > > > > git clone nm::$HOME/mail > > I'm intrigued (and indeed I hadn't really thought about the degree to > which we were re-inventing git-fast-import and friends); however so far > my experiments did not get far enough to say anything conclusive. > > I tried your script with the bindings from master (554690) but it does > not seem to like my split configuration, where the database lives in > ~/.local/share/share/notmuch/default/xapian. Just clone the xapian database instead of the Maildir: % git clone nm::$HOME/.local/share/share/notmuch/default/ > Performance-wise the initial clone seems pretty slow. For my 600k > messages I have been waiting a while now. htop tells me that > git-fast-import has about 45 minutes of CPU time at this point. This > machine is not that fast, but for comparison an initial (i.e. fresh > repo, no caching) "notmuch git commit" takes about 15-20s. That's weird. In my tests generating the fast-export output is almost instantaneous, which means `git fast-import` is the one that is slow. And it seems it starts to get slow after a certain point, so perhaps it's not optimized to receive many files in one go. > If you need a larger corpus of messages to play with, the notmuch > performance suite includes about 400k messages, and running T00-new.sh > will build a notmuch database that you can clone. I tried that, the database has 194562 messages, and it takes 1:43 minutes to clone in my machine. It's weird it takes so long in your machine. Can you try to hardcode a search query to limit the number of messages? Just put something in here: $db.query('').search_messages.each Cheers. -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
David Bremner writes: > > I'm intrigued (and indeed I hadn't really thought about the degree to > which we were re-inventing git-fast-import and friends); however so far > my experiments did not get far enough to say anything conclusive. > I did manage to finish, about 70 minutes elapsed. Although you'r probably right that a file of tags is the right representation (it is what git-annex uses also), I think we'd need to define a custom merge driver to take unions of lists in the same way that git-annex does. Otherwise merging will be less automagic than it is now. ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
Felipe Contreras writes: > Hi, > > I noticed you promoted notmuch-git as a user tool to toy around with it. > > Very quickly I realized that most of what it does is something I've > been working on for at least 10 years: making git work with other > tools. > > I presume you haven't heard of git remote-helpers [1], because they do > precisely what notmuch-git is trying to do. > > As a proof of concept I created a remote helper for notmuch [2]. If > you have this script (`git-remote-nm`) anywhere in your path, git will > interpret URLs prefixed with "nm::" as notmuch transports, and you can > do: > > git clone nm::$HOME/mail I'm intrigued (and indeed I hadn't really thought about the degree to which we were re-inventing git-fast-import and friends); however so far my experiments did not get far enough to say anything conclusive. I tried your script with the bindings from master (554690) but it does not seem to like my split configuration, where the database lives in ~/.local/share/share/notmuch/default/xapian. $ git clone nm::/home/bremner/Maildir Cloning into 'Maildir'... /home/bremner/.config/scripts/git-remote-nm:164:in `initialize': failed to read/write file (Notmuch::FileError) from /home/bremner/.config/scripts/git-remote-nm:164:in `new' from /home/bremner/.config/scripts/git-remote-nm:164:in `' If I make a fake .notmuch directory, then it seems to work. I'm not sure if this is an issue with the bindings or with the script. Conceptually there is also the question of how to handle split configurations as a URL. Performance-wise the initial clone seems pretty slow. For my 600k messages I have been waiting a while now. htop tells me that git-fast-import has about 45 minutes of CPU time at this point. This machine is not that fast, but for comparison an initial (i.e. fresh repo, no caching) "notmuch git commit" takes about 15-20s. If you need a larger corpus of messages to play with, the notmuch performance suite includes about 400k messages, and running T00-new.sh will build a notmuch database that you can clone. ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
On Wed, Mar 29, 2023 at 3:50 AM Michael J Gruber wrote: > > Am Mi., 29. März 2023 um 10:41 Uhr schrieb Felipe Contreras > : > > > > Hi, > > > > I noticed you promoted notmuch-git as a user tool to toy around with it. > > > > Very quickly I realized that most of what it does is something I've > > been working on for at least 10 years: making git work with other > > tools. > > > > I presume you haven't heard of git remote-helpers [1], because they do > > precisely what notmuch-git is trying to do. > > > > Hi Felipe > > that's an interesting idea for sure. When I came across `notmuch-git` > first I wondered whether it rather should be`git-notmuch`, i.e. a > subcommand to `git`. I admit that - given its preexistence as nmbug - > I was never quite sure what to use it for. Maybe sync tags for mail > stores whose content you sync otherwise? `public-inbox` came to my > mind in this context, too. (I wondered about an nm backend for that, > i.e. a public-inbox backed mailstore for notmuch, without multiple > checkouts.) Yes, I also thought of a public-inbox backend for notmuch, but for that some notion of virtual files should probably be introduced, and I think at the moment the current code of notmuch relies on real files. > So, if we consider the notmuch database (more precisely: the dump > output) as a "remote", then what is the history? I understand that we > can transfer and transform its content in the form of blobs as > specific paths encoding mid etc. Is the history stored by current > `notmuch-git` something secondary (say, like the history of notes refs > in git) which can be discarded? The history is arbitrarily created. Say you have two `git-remote-nm` repositories keeping track of the same notmuch database. Except one does a daily `git fetch`, and the other does it once a month. The former is going to have many more commits, and thus a more granular history. Think of it as a `git fetch` just being a simpler version of some custom `notmuch dump | convert-script | git commit`. > Note that I haven't looked at your code thoroughly yet (I'm not a > rubyist), You don't need to be a rubyist, just copy the script anywhere in your path, and clone your mail database. As long as you never do `git push`, the operations are going to be read-only, but if you want to be extra safe, remove " mode: Notmuch::MODE_READ_WRITE" from the code, and/or copy the mail database somewhere temporary. Do `git fetch` regularly, and you'll see how a history of "origin/master" is being created. > and I'm all for using git tools to do gittish things and > more; I'm just wondering whether fast-import/export cover what current > `notmuch-git` intends to do. They are probably the best tool for > "cloning" an existing nm-db into a git repo of mid-tag associations. > And if all you want is a gittish transport for nm tags then that's > probably perfect! > > `notmuch-git` seems to be about handling both updates (commit etc) You can do the same with `git-notmuch`: just do `git commit`. I do that in the tests to add a tag [1]. > and queries (log etc), Ditto: just do `git log`. If you look at the code of `notmuch-git`, it's just a wrapper for `git log --name-status --no-renames`. > In summary, I think a notmuch-git repo is more than a conversion of > notmuch-dump output (it adds history and commit messages; we have a > "one-sided inverse" only), and the notmuch-git command is more than a > converter between the respective data stores. So is `git-notmuch`: every time you do `git fetch` a commit is created. The history is all there. Cheers. [1] https://github.com/felipec/git-notmuch/blob/cdb2954abf3eb9f2f04f71fd2385a34653f758f5/t/basic.t#L87 -- Felipe Contreras ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: Reimagining notmuch-git/nmbug
Am Mi., 29. März 2023 um 10:41 Uhr schrieb Felipe Contreras : > > Hi, > > I noticed you promoted notmuch-git as a user tool to toy around with it. > > Very quickly I realized that most of what it does is something I've > been working on for at least 10 years: making git work with other > tools. > > I presume you haven't heard of git remote-helpers [1], because they do > precisely what notmuch-git is trying to do. > Hi Felipe that's an interesting idea for sure. When I came across `notmuch-git` first I wondered whether it rather should be`git-notmuch`, i.e. a subcommand to `git`. I admit that - given its preexistence as nmbug - I was never quite sure what to use it for. Maybe sync tags for mail stores whose content you sync otherwise? `public-inbox` came to my mind in this context, too. (I wondered about an nm backend for that, i.e. a public-inbox backed mailstore for notmuch, without multiple checkouts.) So, if we consider the notmuch database (more precisely: the dump output) as a "remote", then what is the history? I understand that we can transfer and transform its content in the form of blobs as specific paths encoding mid etc. Is the history stored by current `notmuch-git` something secondary (say, like the history of notes refs in git) which can be discarded? Note that I haven't looked at your code thoroughly yet (I'm not a rubyist), and I'm all for using git tools to do gittish things and more; I'm just wondering whether fast-import/export cover what current `notmuch-git` intends to do. They are probably the best tool for "cloning" an existing nm-db into a git repo of mid-tag associations. And if all you want is a gittish transport for nm tags then that's probably perfect! `notmuch-git` seems to be about handling both updates (commit etc) and queries (log etc), too, as a wrapper to git commands. Those may be candidates for other git tools, such as aliases, diff helpers, textconv and such. In summary, I think a notmuch-git repo is more than a conversion of notmuch-dump output (it adds history and commit messages; we have a "one-sided inverse" only), and the notmuch-git command is more than a converter between the respective data stores. It smells more like `git-lfs` or other filter-based approaches, storing the real objects outside of the git repo. But I feel I know too little about `notmuch-git`'s purpose so far. Cheers Michael ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org