Re: Tips re syncing notmuch database and maildir via iCloud?

2024-01-15 Thread Olly Betts
On Mon, Jan 15, 2024 at 08:35:37PM -0400, David Bremner wrote:
> HGV  writes:
> 
> > Does anyone have experience syncing the notmuch database or an entire 
> > maildir directory via iCloud? I keep most of my email archive offline 
> > but since iCloud added end-to-end encryption, I've considered syncing my 
> > archived mail and notmuch database via iCloud. But I'm worried that 
> > iCloud might introduce issues either directly with the notmuch database 
> > or by slightly altering the mail files/directories. Any tips about 
> > syncing the notmuch database or a maildir directory (whether iCloud or 
> > via another service) would be appreciated.
> 
> Since you didn't get any answer on the notmuch list, I'm forwarding your
> mail to the Xapian list. I guess if Xapian syncs fine in general to
> iCloud, notmuch should be likely (but not guaranteed) OK.

I know nothing about iCloud, but assuming it syncs the filenames and
contents correctly then the potential concerns I can see if you're
syncing a Xapian database are:

* On the machine being synced from you need to ensure that indexing
  doesn't happen during syncing or else the synced version may be
  corrupted - if you run "notmuch new" in a mail delivery hook or
  from cron or similar that's problematic

* On the side being synced to searches may fail or work incorrectly
  during syncing

Xapian's replication feature knows how to sync changes safely avoiding
these potential problems, but I've no idea if you could make it work
with iCloud.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: Advanced search with wildcard using notmuch for mutt

2023-12-07 Thread Olly Betts
On Wed, Dec 06, 2023 at 03:07:08PM +0800, io wrote:
> i like to know how do we use 'FLAG_FUZZY' if one need to use 'quest'
> to do the search query.

E.g. this will match a document indexed by term "fuzzy":

quest --flags=default,fuzzy --db=/path/to/db 'phuzzy~'

Default edit distance is 2, but you can also specify it explicitly, e.g.
phuzzi~3 sets it to 3 and phuzzi~0.5 sets it to half the length (so also
3 here).

As Bremner indicated, this is new in git master.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: Advanced search with wildcard using notmuch for mutt

2023-12-04 Thread Olly Betts
On Mon, Dec 04, 2023 at 06:39:43AM -0500, David Bremner wrote:
> I guess the restriction is based on what is easy to do efficiently with
> the Xapian database (find prefixes).  If I remember correctly there was
> some work in progress to support leading wildcards in Xapian. I can't
> find relevant discussion now, but I CC'ed the Xapian list in case
> someone remembers that.

The development version of Xapian supports both `*` and `?` glob-style
wildcards in any position.

You can enable them for the QueryParser using FLAG_WILDCARD_MULTI,
FLAG_WILDCARD_SINGLE or FLAG_WILDCARD_GLOB (the last one is just the
first two combined).

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: Internal error: Message without type term

2023-07-03 Thread Olly Betts
On Mon, Jul 03, 2023 at 02:26:03PM +0200, David Bremner wrote:
> "Peter P."  writes:
> 
> > I ran xapian-check on ~/.notmuch/xapian and include its messages
> > below at the end of this mail. Everyone please forgive me for
> > pasting 1121 there. :)
> 
> H'mm. It doesn't look familiar to me, but I will check with xapian
> experts to see if the failure mode is known/fixable. I'd guess probably
> not fixable.

Currently we don't have a database fixing tool for glass databases (the
"fix" mode in xapian-check can recreate base files for the older chert
database format, but glass doesn't have these base files which
eradicated the failure mode of them sometimes getting truncated to zero
size on power failure or OS crash).

Some of the problems reported have an obvious fix, but we don't have
existing code to fix them, and some look like they are probably due to
data being overwritten so fixing everything to be consistent probably
wouldn't actually give a database that entirely matches your email
anyway.

Was this database originally created by Xapian < 1.4.22?  It looks
like it could be the result of the bug fixed in 1.4.22 with handling
commit() failure on disk full.

> >> 2)  Move the database out of the way, re-run notmuch new,
> >> and restore your state using "notmuch restore < notmuch-db.txt"
> >  
> > I'd be fine regenerating the entire database without a backup dump even,
> > I don't think there is anything in there that can't be regernerated,
> > no?
> 
> The main thing that would be lost is tags that are not synched to
> maildir flags. In the "standard" workflow "inbox" is such a tag.

If there's tag data in the database which isn't backed up or synced to
maildir flags, you may be able to rescue it using:

https://git.xapian.org/?p=xapian;a=blob;f=README.notmuch;hb=refs/heads/notmuch-tag-rescue-hack

This creates a file with the tag data in the format `notmuch restore`
expects.  I'd expect this would work for your database as the termlist
table is mostly OK.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: How to recover from this permanent fatal error?

2021-06-06 Thread Olly Betts
On Sun, Jun 06, 2021 at 07:48:39AM -0500, Felipe Contreras wrote:
> On Sun, Jun 6, 2021 at 5:08 AM Olly Betts  wrote:
> 
> > You could try commenting out the body of GlassTable::set_overwritten()
> > in xapian-core/backends/glass/glass_table.cc so it keeps going instead
> > of throwing this exception, which might allow it to usefully recover
> > some or all tags.  If you (or anyone) try that and it works let me know
> > and I can patch the branch to emit a warning message and continue there.
> 
> Now I get this:
> 
> termlist:
> blocksize=8K items=687440 firstunused=152676 revision=2 levels=2 root=749
> /home/felipec/contrib/xapian/xapian-core/bin/.libs/lt-xapian-check:
> DatabaseError: Block 152676: used more than once in the Btree

I've pushed a change to skip the low level table consistency checking on
the branch since that's where this report is from.  The whole point of
this branch is to rescue tags from a broken database, so the user
presumably already ran the real xapian-check and it's not useful to be
repeating those checks here.  Hopefully that'll get us to actually
rescuing some tags!

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: How to recover from this permanent fatal error?

2021-06-06 Thread Olly Betts
On Sat, Jun 05, 2021 at 11:40:58PM -0500, Felipe Contreras wrote:
> % xapian-core/bin/xapian-check ~/mail/.notmuch/xapian/termlist.glass
> termlist:
> xapian-core/bin/.libs/lt-xapian-check: DatabaseCorruptError: Db block
> overwritten - are there multiple writers?

Ah - this tool currently requires the termlist table to be undamaged
enough to at least scan through.

You could try commenting out the body of GlassTable::set_overwritten()
in xapian-core/backends/glass/glass_table.cc so it keeps going instead
of throwing this exception, which might allow it to usefully recover
some or all tags.  If you (or anyone) try that and it works let me know
and I can patch the branch to emit a warning message and continue there.

If the postlist table is readable it'd be possible to rescue the tag
data from there instead, but that's more complicated to do because
the tags would need collating for each message.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: How to recover from this permanent fatal error?

2021-06-05 Thread Olly Betts
On Sat, Jun 05, 2021 at 09:39:28AM -0500, Felipe Contreras wrote:
> On Fri, Jun 4, 2021 at 9:43 PM Olly Betts  wrote:
> > I'd suggest trying this simple tool I wrote that can probably rescue the
> > tags from a broken notmuch database (the tags are the part notmuch can't
> > just recreate by reindexing):
> >
> > https://git.xapian.org/?p=xapian;a=blob;f=README.notmuch;hb=refs/heads/notmuch-tag-rescue-hack
> 
> I can't seem to build it:
[...]
> ./backends/documentinternal.h:339:29: error: ‘numeric_limits’ was not
> declared in this scope
>   339 | wdf_delta = numeric_limits::max();
>   | ^~

Oh, that's a missing header include which older compilers seemed to
not complain about - it was fixed on master a few months ago, and I've
just merged master to the branch to pick up the fix so it should build
now.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: How to recover from this permanent fatal error?

2021-06-04 Thread Olly Betts
On Fri, Jun 04, 2021 at 08:40:56PM -0500, Felipe Contreras wrote:
> On Fri, Jun 4, 2021 at 8:37 PM David Bremner  wrote:
> > Felipe Contreras  writes:
> 
> > > I can't use notmuch anymore, I get this error:
> > >
> > > A Xapian exception occurred opening database: The revision being read
> > > has been discarded - you should call Xapian::Database::reopen() and
> > > retry the operation
> > >
> > > Context. In order to investigate a bug about mbsync I moved away the
> > > folder ~/mail/.notmuch. I have a timer that calls notmuch new after
> > > mbsync, so I paused that timer.
> > >
> > > Initially I used notmuch, only to see everything empty. Then I
> > > recalled what I did, removed all the files, and moved back the .nomuch
> > > directory.

Perhaps a process had the database or the empty replacement open for
writing over the moving aside or the moving back?  That could result
in a broken database.

> `xapian-check ~/mail/.notmuch/xapian F` doesn't seem to change anything.

With some filing systems and older format (chert) Xapian databases a
system crash or power failure could result in truncating to zero size
the files which tracked which blocks were in use and where the root of a
particular revision of the tree; the xapian-check's "fix" mode was added
to recreate those files by scanning the whole database to work out what
they should contain.

In newer format databases (glass) we eliminated these files and
currently the "fix" mode doesn't actually do anything for glass.

The plan was to teach xapian-check how to recreate the `iamglass` file,
but that doesn't seem to suffer from the truncation problem and so it
hasn't actually been implemented yet and so "F" currently does nothing
for glass databases.

> > > IIRC I was able to use notmuch without problems once, and then I got the 
> > > issue.
> >
> > Maybe the Xapian folk will have a more concrete suggestion, but I would
> > start by running xapian-check on the database. In your case I guess that
> > should be "xapian-check ~/mail/.notmuch".

I'd suggest trying this simple tool I wrote that can probably rescue the
tags from a broken notmuch database (the tags are the part notmuch can't
just recreate by reindexing):

https://git.xapian.org/?p=xapian;a=blob;f=README.notmuch;hb=refs/heads/notmuch-tag-rescue-hack

Once you have those, you can reindex your mail and then restore the
tags.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: out of memory on idle machine

2021-02-11 Thread Olly Betts
On Thu, Feb 11, 2021 at 06:53:27AM -0400, David Bremner wrote:
> At this point I don't really have any good ideas, so I'm waiting for
> results from the 1.4.18 trial.

I've uploaded a backport, but it's the first backport of xapian-core to
buster so it'll need manual approval.  Hopefully that'll happen over the
weekend, but it could take longer.

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: out of memory on idle machine

2021-02-08 Thread Olly Betts
On Wed, Feb 03, 2021 at 07:59:43AM -0400, David Bremner wrote:
> Gregor Zattler  writes:
> > A Xapian exception occurred finding message: Db block overwritten - are 
> > there multiple writers?.
> 
> I have included the Xapian list in copy in case that message rings a
> bell.

There was a bug fixed in 1.4.7 which incorrectly resulted in this error
message, but it seems from the quoted text you're using 1.4.11.

> I guess you know there are not multiple writers in your setup.

There's a lock file locked by fcntl() which protects against multiple
writers, so someone/something would have need to have deleted that
behind Xapian's back, or else a bug somewhere in the locking code stack.

(Aside from that bug, probably the most common case here over time has
been that someone deleted the lock file thinking it's "stale", but it's
not the mere presence of the file that means the lock is held.  It's
not at all frequent, but perhaps we should adjust this message to better
reflect that.)

Have you tried xapian-check on this database?

> Olly Betts mentioned in a different thread that he will build a version
> of xapian 1.4.18 for buster backports, so trying with that is probably a
> good step when it is available.

Yes - 1.4.18 packages are now in Debian testing, so hopefully I can get
this done soon.

> % xapian-delve -1 -A XDIRECTORY ~/Mail/.notmuch/xapian | sort -u > delve.txt

FWIW, the output should be sorted and unique already (sorted by byte
order, so equivalent to `LC_ALL=C sort`).

Cheers,
Olly
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: performance problems with notmuch new

2020-04-22 Thread Olly Betts
On Mon, Apr 20, 2020 at 11:36:36AM -0300, David Bremner wrote:
> Franz Fellner  writes:
> 
> > I also suffer from bad performance of notmuch new.  I used notmuch
> > some years ago and notmuch new always felt instantanious.  Had to stop
> > using it because internet was too slow to sync my mails :/ Now (with
> > better internet and a completely new setup using mbsync) indexing one
> > mail takes at least 10 seconds, sometimes even more.  It can go into
> > minutes when I get lots of mail (~30...).

First question: what version of Xapian are you using?

And second thing to check, are you committing each message separately?

The commit operation tries to ensure that the data has actually been
written out to disk, so the time to index one message by itself isn't
indicative as it'll often mostly just be waiting for fdatasync() or
similar to return.

If you index 30 messages but commit each separately (i.e. run "notmuch
new" 30 times picking up one new message each time) that'll probably
scale something like linearly, but indexing a batch of 30 messages
should be much quicker per message.

> > When I run it after a
> > reboot I can have breakfast while notmuch starts up...  This is all on
> > spinning rust. I thought of getting an SSD but not in the near future.

After reboot the disk cache won't have any of the database in, so the
first operation will typically be slower, especially with a spinning
drive where seeks are relatively slow.

> > What I observe during that time: notmuch doesn't really need much CPU.
> > iotop shows constant read and write with extremely low rates, under
> > 1MB/sec.  So I think it might be an issue in xapian?
> 
> Just in case one of the xapian experts can suggest some kind of test for
> why you might be seeing this behaviour, I've included the xapian list in
> CC.

It sounds like you're seek-limited in this "cold cache" phase.  That is
not necessarily related to the slow indexing, but it could be.

I'd check the SMART diagnostics for the drive first (e.g. with
smartctl).  It's not the most likely cause, but it's quick to check and
if the drive is starting to fail it's better to find out sooner rather
than later.

Then I'd try compacting the database (I think there's a "notmuch
compact" subcommand to do this).
 
If that doesn't help, profiling the I/O would probably be my next
suggestion - there are some tools in the xapian git repo to help with
this (in xapian-maintainer-tools/profiling).  Under Linux I'd suggest
the strace ones (there's also an LD_PRELOAD library but it may need
tweaking for 32 vs 64 bit).

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: crash after running notmuch new

2020-04-07 Thread Olly Betts
On Tue, Apr 07, 2020 at 05:21:47PM -0300, David Bremner wrote:
> Matt  writes:
[...]
> > termlist:
> > blocksize=8K items=186136 firstunused=62058 revision=421 levels=2 root=12260
> > B-tree checked okay
> > termlist table structure checked OK
> >
> > postlist:
> > blocksize=8K items=2598971 firstunused=61412 revision=421 levels=2 
> > root=49814
> > xapian-check: DatabaseCorruptError: Db block overwritten - are there
> > multiple writers?
> > ===
> > suggests there is an error but I couldn't find a fix for it. Should I
> > just remove the xapian folder and rerun `notmuch new` ?
> 
> If you have a backup of your tags from notmuch-dump, then yes that's
> probably a good way forward.

If you don't have a current dump, you may be able to rescue a dump of
tags from a broken notmuch database using:

https://git.xapian.org/?p=xapian;a=blob;f=README.notmuch;hb=refs/heads/notmuch-tag-rescue-hack

That should work if the termlist table is undamaged (as the above
appears to show), and may work even if it's damaged.

> I've put the xapian developers in copy in
> case they are interested in trying to debug this corruption. For those
> just joining us, this is notmuch 0.29.3 linked against xapian 1.4.15

Was the database created with 1.4.15 too?

If it's reproducible, I'm definitely interested.

If it isn't reproducible (and/or the data is sensitive) it's much more
difficult to usefully investigate.  And it may also be due to a
non-Xapian issue (a bug in something else or a hardware problem).

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Transitioning notmuch/Xapian from 32-bit to 64-bit system

2019-07-09 Thread Olly Betts
On Tue, Jul 09, 2019 at 01:28:51PM -0300, David Bremner wrote:
> Thomas Schwinge  writes:
> 
> > sizes?).  Doing some light (read-only!) testing, it seems to work fine to
> > access the old 32-bit built Xapian "chert" database with 64-bit
> > notmuch/Xapian.  Is that (a) generally safe and expected to work fine,
> > (b) there may be issues, or (c) don't do that, or don't know?
> 
> IIRC, Olly previously confirmed that the Xapian database is architecture
> independent.

Yes.

[...]
> 
> > (Of course, I'll eventually want to rebuild the database, to take
> > advantage of several new features both on the notmuch and Xapian sides,
> > but I'd prefer to do that later.)
> >
> 
> It would be wise to make a backup with "notmuch dump" before going much 
> further.

You can convert the database at the Xapian level:

https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/admin_notes.html#converting-a-chert-database-to-a-glass-database

However that probably doesn't allow taking advantage of all the new
notmuch features, so it's probably better to dump the tags, reindex
then restore the tags.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Feature request: Limit output of notmuch show

2019-03-25 Thread Olly Betts
On Fri, Mar 22, 2019 at 12:41:12PM -0300, David Bremner wrote:
> jo...@joergvolbers.de (Jörg Volbers) writes:
> > Is it a difficulty to implement that? Would anyone do that? I am 
> > not able to write anything in C, so I'm out of it.
> 
> I don't have any good ideas how to impliment that in a Xapian query. Maybe 
> Olly
> does.

It's not possible as part of the Query object, but the second parameter
to get_mset() specifies the limit you seem to want.

No need to write anything in C - you can use C++!

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: xapian parser bug?

2018-09-30 Thread Olly Betts
On Sun, Sep 30, 2018 at 09:05:25AM -0300, David Bremner wrote:
> if (str.find (' ') != std::string::npos)
>   query_str = '"' + str + '"';
>   else
>   query_str = str;
> 
>   return parser.parse_query (query_str, NOTMUCH_QUERY_PARSER_FLAGS, 
> term_prefix);

I wouldn't recommend trying to generate strings to feed to QueryParser
like this code seems to be doing.  QueryParser aims to parse input from
humans not machines.

As well as the case where str is an operation name, the code above looks
like it will mishandle cases where str contains a tab or double quotes.
There are likely other problem cases too.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: xapian parser bug?

2018-09-30 Thread Olly Betts
On Sun, Sep 30, 2018 at 09:50:30AM +0100, James Aylett wrote:
> Note that I'm using 1.4.7, and from your output I believe you're not
> (the * in the query description I believe doesn't happen in those
> situations any more).

1.4.4 and later eliminate redundant 0 scaling factors, but this one
isn't actually redundant:

> > Query(((Tmail AND 0 * XSUBJECTnot@1) AND_NOT (((Kspam OR Kdeleted) OR 
> > Kmuted) OR Kbad-address)))

If it was on the right-hand side of AND_NOT it would be eliminated
(because the right-hand side doesn't contribute any weight anyway).

FWIW, I also couldn't reproduce this (I tried with quest and 1.4.7):

$ quest -psubject:S -fdefault,boolean_any_case 'subject:"and"'
Parsed Query: Query(Sand@1)

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Notmuch DB Problems

2018-09-10 Thread Olly Betts
On Mon, Sep 10, 2018 at 08:01:06AM -0300, David Bremner wrote:
> Mueen Nawaz  writes:
> > Now killing all those jobs did not fix the database. It was still
> > broken. And as we saw the second time round, it was /really/ broken - it
> > would not even open in read-only mode.
> 
> That seems like something the Xapian devs (in copy) might be interested
> in fixing, if you could come up with a simple reproducer.

I'm certainly happy to investigate if someone can provide a way for
me to make it happen on demand.

It doesn't make much sense to me that holding the lock alone could be
causing any sort of corruption - that's just an fcntl() lock.

I would suggest to make sure you're running Xapian 1.4.7 as that fixed a
cursor handling bug which affected notmuch.  I didn't find a way to make
it corrupt on-disk data, but it's hard to be completely certain that it
couldn't ever do that, so ruling out that as a cause would be good.

> notmuch could be cleverer about timing out on trying to acquire a
> lock. I suspect it's a bit delicate to get that right, and I've been
> hoping the underlying primitives would get a bit more flexible
> w.r.t. locking.

You mean in Xapian?  If so, a wishlist bug saying what you're hoping
for might help it happen.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Database corruption after clean rebuild

2018-04-07 Thread Olly Betts
On Sat, Apr 07, 2018 at 12:17:39PM -0300, David Bremner wrote:
> Javier Garcia  writes:
> 
> > The following is a solid workaround I've stumbled upon. Afew no longer
> > complains and database corruption is gone.
> >
> > $ notmuch compact
> > $ xapian-check ~/.mail/.notmuch/xapian
> >    
> >    No errors found
> 
> Right, I should have thought of compaction, that's a workaround Olly
> mentioned before. That strongly suggests that you are hitting the known
> Xapian bug.

Yes - the error exactly matches that too.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: bug: "no top level messages" crash on Zen email loops

2018-03-29 Thread Olly Betts
On Thu, Mar 29, 2018 at 08:50:22AM -0400, Antoine Beaupré wrote:
> On 2018-03-29 04:17:21, Olly Betts wrote:
> > If changes to a new database which didn't modify the termlist table were
> > committed, then a disk block which had been allocated to be the root
> > block in the termlist table was leaked (not used but not on the
> > freelist of blocks the table can recycle).  This was largely harmless,
> > except that it was detected by Database::check() and caused an error.
> 
> Hmm... but if I understand correctly, that's one part of the story: I
> could get that error and not have the problem with `notmuch show`. Does
> that *also* resolve the issue with email loops?

Yes, from what bremner said on IRC there's still a notmuch bug here.

My reply was really just in the context of Xapian to note what the bug
actually was and when the fix would appear (since bremner sent his
message to both the notmuch and Xapian lists).

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: bug: "no top level messages" crash on Zen email loops

2018-03-28 Thread Olly Betts
On Mon, Mar 19, 2018 at 05:03:21PM -0300, David Bremner wrote:
> I can confirm this reproduces both the xapian-check and the notmuch-show
> error. Olly agrees that whatever notmuch is doing wrong, it shouldn't
> lead to a corrupted database

There was a Xapian bug here, which I fixed on master last week and will
be fixed in 1.4.6.

If changes to a new database which didn't modify the termlist table were
committed, then a disk block which had been allocated to be the root
block in the termlist table was leaked (not used but not on the
freelist of blocks the table can recycle).  This was largely harmless,
except that it was detected by Database::check() and caused an error.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: emacs-notmuch: A Xapian exception occurred parsing query

2018-02-15 Thread Olly Betts
On 2018-02-07, David Bremner wrote:
> The underlying issue is that * is parsed (simplistically) by notmuch
> before passing to Xapian, so only works if it is the entire query.
>
> For cases like you report, where the user has not entered '*', but
> rather it is contained in some generated query string, we could fix the
> problem by adding a prefix like "special:*".

If you're generating the query string, you could presumably just
generate « tag:flagged » for this case.

Though it's generally better not to try to generate a string to parse,
but instead to parse any part(s) the user actually wrote and combine
the resulting Xapian::Query objects with directly constructed objects
for other filters, etc.

> This would allow Xapian to parse it, but only for Xapian versions >=
> 3.5.  How many users of older systems do we think this would affect?
> E.g. users of Debian oldstable (jessie) would have to compile Xapian
> in order to use the newest notmuch.

(That should be >= 1.3.5 I think - certainly 3.5 is wrong).

For Debian oldstable users, there's a backport of 1.4.3:

https://packages.debian.org/source/oldstable-backports/xapian-core

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Inconsistent query results

2017-03-09 Thread Olly Betts
On Wed, Mar 08, 2017 at 10:32:56PM -0400, David Bremner wrote:
> "Kirill A. Shutemov"  writes:
> > I found that on particular queries notmuch return different results if run
> > the query few times. Re-initialing the query or db doesn't help.
> 
> Thanks for the report. I don't yet understand where the bug is, but I
> think it's safe to say it's not in your code. I made a somewhat simpler
> test case that displays the same problem (at the end).

It's a bug in Xapian - I've committed a fix to master (commit
fa12a83957e97349aa6e2a6c0896faf210dfe4b4) which I'll backport for 1.4.4 and
1.2.25 (it also affects 1.2.x).

To trigger it you need an AND operator which is unweighted (e.g. being on the
right side of AND_NOT here) with a subquery which uses the passed max weight
value (the obvious case is another operator, OR in this case).

(David already confirmed on IRC that the fix applied solves this for him.)

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v4 16/16] add "notmuch reindex" subcommand

2016-08-14 Thread Olly Betts
On Mon, Aug 15, 2016 at 07:42:39AM +0900, David Bremner wrote:
> Daniel Kahn Gillmor  writes:
> > +Supported options for **reindex** include
> > +
> > +``--try-decrypt``
> > +
> > +For each message, if it is encrypted, try to decrypt it while
> > +indexing.  If decryption is successful, index the cleartext
> > +itself.  Be aware that the index is likely sufficient to
> > +reconstruct the cleartext of the message itself, so please
> > +ensure that the notmuch message index is adequately
> > +protected. DO NOT USE THIS FLAG without considering the
> > +security of your index.
> 
> What can we say about re-indexing without the flag, when the user has
> previously indexed cleartext? I guess this is at least partly a question
> for Olly: if we delete terms from a xapian document, how recoverable are
> those terms and  positions? I suppose it might depend on backend, but
> does deleting terms provide at least same level of security as deleting
> files in modern file systems

That seems a fair assessment.  Probably the main extra security you'd
get is that there are less likely to be existing tools to get at the
data, and that it's spread over more places so it's harder to locate it
all so you can reconstruct the plain text (whereas if a deleted file
contained the plain text, it would be fairly easy to locate if you can
guess part of it, or at least write a bit of code to recognise likely
candidates).

> (i.e. not much against determined state level actors, but good enough
> to defeat most older brothers)

"Good enough against big brother, but not Big Brother"

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: slowdown in notmuch perf suite with xapian 1.3.5

2016-04-07 Thread Olly Betts
On Thu, Apr 07, 2016 at 09:40:59PM -0300, David Bremner wrote:
> Olly Betts <o...@survex.com> writes:
> 
> >
> > So the T00-new.sh numbers make sense - there's more work to do, and
> > we need to read existing positional data more to insert the new stuff,
> > so the increased reads and writes make sense.
> >
> > But guessing at what the other two tests do, I wouldn't expect them to
> > be affected by this.
> 
> The non-optimized-away cases of T02-tag just adding and deleting terms
> to each document with term Tmail

That should short-cut to just only changing the data for Tmail.  Perhaps that's
not working correctly - I'll take a look at this, but probably after 1.4.0 is
out.

> > I'm also a bit puzzled by how glass can manage not to read any data
> > for "dump *", and several tests seem to not read or write anything
> > for either backend.  What exactly are the "In/Out" numbers?
> 
> that's just the output from /usr/bin/time -f '%e\t%U\t%S\t%M\t%I/%O'
> 
> The manual describes them as "number of file system
> inputs/outputs". From looking at the source, they correspond to
> ru_inblock and ru_oublock fields from the getrusage call. AFAIU, that
> means the number of non-cached read/writes.

Non-cached reads/writes are arguably the most useful sort to measure, but the
reads at least will be sensitive to OS caching, which means a repeat run will
generally show lower numbers of reads, e.g.:

$ /usr/bin/time -f '%I/%O' wc randomfile 
  240  2908 96780 randomfile
192/0
$ /usr/bin/time -f '%I/%O' wc randomfile 
  240  2908 96780 randomfile
0/0

So those numbers may not be entirely comparable, depending what order your
tests were done in, and whether you'd run the tests (or cloned the repo or some
other operation which read or wrote the files used) recently enough that their
data might still be cached.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: slowdown in notmuch perf suite with xapian 1.3.5

2016-04-07 Thread Olly Betts
On Thu, Apr 07, 2016 at 08:56:46AM -0300, David Bremner wrote:
> I hadn't noticed any interactive slowdown, but when I got around to
> running the notmuch performance suite, there seems to be some noticable
> slowdown with the glass backend (default in Xapian 1.3.5) compared to
> chert (using xapian 1.2.22)

Some of this is pretty much expected, though other parts I don't
entirely understand.

One of the big changes in glass is how the position table is structured.
In chert, it is ordered by (document,term) but in glass that has been
changed to (term,document).

This change makes a huge difference to phrase searches in cases where
a lot of phrase data is needed, but it has an indexing time cost -
adding a new document can no longer just append a load of entries to
the position table, but instead we need to buffer up the changes, and
then merge the entries within the existing table.

The trade-off isn't ideal for everyone, but the cases of slow phrase
searches were a real pain point that needed addressing.  The plan is
to optimise indexing speed in other ways to regain this loss - some
of that has been done but there's a lot more to do still.

So the T00-new.sh numbers make sense - there's more work to do, and
we need to read existing positional data more to insert the new stuff,
so the increased reads and writes make sense.

But guessing at what the other two tests do, I wouldn't expect them to
be affected by this.

I'm also a bit puzzled by how glass can manage not to read any data
for "dump *", and several tests seem to not read or write anything
for either backend.  What exactly are the "In/Out" numbers?

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


segfault with xapian 1.3.1

2013-09-06 Thread Olly Betts
Olly wrote on  IRC:
> bremner: ok, 1.2 explicitly no-oped skip_to() on an iterator at_end on
> trunk that code has been rewritten without that explicit check, and
> the iterator internals are NULL then i think restoring the check is
> reasonable, though I'm not sure if we actually promise that's defined
> behaviour if you want to work with 1.3.1, then checking against the
> end iterator before calling skip_to() will work for any version

I've put a NULL check in there, and also for the other iterator classes
which use NULL internals to signify being at the end - this only worked
for TermIterator in 1.2.x, but it seems reasonable to make it work in
general and a NULL pointer check isn't a big overhead.

On Thu, Sep 05, 2013 at 10:22:42PM -0300, David Bremner wrote:
> So, now we know what to fix.

This should work again in 1.3.2 (once it is out), but if you want to
support 1.3.1 then you need the check.

Cheers,
Olly


Re: segfault with xapian 1.3.1

2013-09-06 Thread Olly Betts
Olly wrote on  IRC:
 bremner: ok, 1.2 explicitly no-oped skip_to() on an iterator at_end on
 trunk that code has been rewritten without that explicit check, and
 the iterator internals are NULL then i think restoring the check is
 reasonable, though I'm not sure if we actually promise that's defined
 behaviour if you want to work with 1.3.1, then checking against the
 end iterator before calling skip_to() will work for any version

I've put a NULL check in there, and also for the other iterator classes
which use NULL internals to signify being at the end - this only worked
for TermIterator in 1.2.x, but it seems reasonable to make it work in
general and a NULL pointer check isn't a big overhead.

On Thu, Sep 05, 2013 at 10:22:42PM -0300, David Bremner wrote:
 So, now we know what to fix.

This should work again in 1.3.2 (once it is out), but if you want to
support 1.3.1 then you need the check.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH] Actually close the xapian database in notmuch_database_close

2012-03-01 Thread Olly Betts
On Thu, Mar 01, 2012 at 07:59:30AM +0100, Justus Winter wrote:
> Olly wrote:
> >It is hard to say if calling close() is actually useful here from just
> >seeing the patch.
> 
> Huh? I provided a test case...

I only saw the part of the patch Austin quoted in the mail he cc-ed to
me.

> Quoting Austin Clements (2012-02-29 23:17:54)
> >Also, since close could throw an exception, it should get wrapped in a
> >try/catch like flush currently is.
> 
> My interpretation of [0] was that Xapian::Database::close() does not
> throw any exceptions.

Sadly there's not full documentation of exceptions which can be thrown 
by a particular method.

Cheers,
Olly


Re: [PATCH] Actually close the xapian database in notmuch_database_close

2012-03-01 Thread Olly Betts
On Thu, Mar 01, 2012 at 07:59:30AM +0100, Justus Winter wrote:
 Olly wrote:
 It is hard to say if calling close() is actually useful here from just
 seeing the patch.
 
 Huh? I provided a test case...

I only saw the part of the patch Austin quoted in the mail he cc-ed to
me.

 Quoting Austin Clements (2012-02-29 23:17:54)
 Also, since close could throw an exception, it should get wrapped in a
 try/catch like flush currently is.
 
 My interpretation of [0] was that Xapian::Database::close() does not
 throw any exceptions.

Sadly there's not full documentation of exceptions which can be thrown 
by a particular method.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH] Actually close the xapian database in notmuch_database_close

2012-02-29 Thread Olly Betts
On Wed, Feb 29, 2012 at 10:48:33AM -0500, Austin Clements wrote:
> Quoth Justus Winter on Feb 29 at 10:19 am:
> > Formerly the xapian database object was deleted and closed in its
> > destructor once the object was garbage collected. Explicitly call
> > close() so that the database and the associated lock is released
> > immediately.
> 
> Interesting.  Is this a bug in Xapian?  According to the docs,
> ~Database is supposed to close the database (if there are no other
> copies, which there shouldn't be), so this should be redundant with
> the delete notmuch->xapian_db a few lines down, but your experience
> obviously suggests that it isn't and I can't find the code path in
> Xapian that would close it in the destructor.

Most Xapian API classes (including Database and WritableDatabase) just
hold a reference-counted pointer, and so it's the destructor of the
reference-counted object which closes the database.  If "PIMPL" means
anything to you, that's what we have here.

Some other API classes objects (such as PostingIterator) internally hold
a reference to the database they are using, so calling close()
explicitly is useful if you don't want to have to worry about such
objects still existing and holding onto references which keep the
database open.

The main motivation for adding close() was the bindings though - e.g. in
Python the wrapped Database object gets destroyed when the GC gets run,
which is at some essentially arbitrary time after you remove the last
reference to it.

It is hard to say if calling close() is actually useful here from just
seeing the patch.

Cheers,
Olly


Re: [PATCH] Actually close the xapian database in notmuch_database_close

2012-02-29 Thread Olly Betts
On Wed, Feb 29, 2012 at 10:48:33AM -0500, Austin Clements wrote:
 Quoth Justus Winter on Feb 29 at 10:19 am:
  Formerly the xapian database object was deleted and closed in its
  destructor once the object was garbage collected. Explicitly call
  close() so that the database and the associated lock is released
  immediately.
 
 Interesting.  Is this a bug in Xapian?  According to the docs,
 ~Database is supposed to close the database (if there are no other
 copies, which there shouldn't be), so this should be redundant with
 the delete notmuch-xapian_db a few lines down, but your experience
 obviously suggests that it isn't and I can't find the code path in
 Xapian that would close it in the destructor.

Most Xapian API classes (including Database and WritableDatabase) just
hold a reference-counted pointer, and so it's the destructor of the
reference-counted object which closes the database.  If PIMPL means
anything to you, that's what we have here.

Some other API classes objects (such as PostingIterator) internally hold
a reference to the database they are using, so calling close()
explicitly is useful if you don't want to have to worry about such
objects still existing and holding onto references which keep the
database open.

The main motivation for adding close() was the bindings though - e.g. in
Python the wrapped Database object gets destroyed when the GC gets run,
which is at some essentially arbitrary time after you remove the last
reference to it.

It is hard to say if calling close() is actually useful here from just
seeing the patch.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


doc: notmuch help search-terms, boolean operators

2011-04-27 Thread Olly Betts
On Tue, Apr 26, 2011 at 04:01:17PM -0700, Carl Worth wrote:
> On Wed, 27 Apr 2011 00:24:38 +0200, Florian Friesdorf  
> wrote:
> > Through playing with `notmuch tag` and `notmuch search
> > --output=messages` I found:
> > 
> > Complete list of boolean operators in order of precedence:
> > - NOT
> > - AND
> > - XOR
> > - OR
> > 
> > Is this correct? If yes, I would extend the manpage accordingly.
> 
> Currently, notmuch doesn't implement this behavior but relies on
> Xapian's query parser to do so. As such, I'd really prefer to see
> Xapian's documentation augmented here before we start documenting
> any particular behavior in notmuch.
> 
> Olly, is the above list of operators complete and in the correct order
> of  precendence?

Close, but actually AND and NOT have equal precedence.

Also, NEAR and ADJ bind tighter than AND or NOT.

And '+' and '-' tightest of all.

I'll try to make sure this gets documented.

Cheers,
Olly


Some Xapian tips and thoughts on rebuilding

2010-07-28 Thread Olly Betts
Kan-Ru Chen writes: 
> Seems this still does not work as expected.
> 
> With the new `notmuch count' command:
> 
>  % notmuch count
>  131720
> 
> After xapian-compact-1.1
> 
>  % notmuch count
>  1001
> 
> And a subsequent `notmuch new'
> 
>  % notmuch new
>  Processed 6 total files in almost no time.
>  Added 6 new messages to the database.
> 
>  % notmuch count
>  A Xapian exception occurred: Value in posting list too large.
>  Query string was: 
>  0

Bear in mind Xapian 1.1.x were development versions, and it sounds like you
are using a revision somewhere between 1.1.3 and 1.1.4.

If you can reproduce this with 1.2.x (or 1.0.x), I'm happy to investigate.
I'm not sure it's worthwhile to try to isolate a bug in an old development
version - a lot has changed in the last 6+ months.

Cheers,
Olly



Re: Some Xapian tips and thoughts on rebuilding

2010-07-28 Thread Olly Betts
Kan-Ru Chen writes: 
 Seems this still does not work as expected.
 
 With the new `notmuch count' command:
 
  % notmuch count
  131720
 
 After xapian-compact-1.1
 
  % notmuch count
  1001
 
 And a subsequent `notmuch new'
 
  % notmuch new
  Processed 6 total files in almost no time.
  Added 6 new messages to the database.
 
  % notmuch count
  A Xapian exception occurred: Value in posting list too large.
  Query string was: 
  0

Bear in mind Xapian 1.1.x were development versions, and it sounds like you
are using a revision somewhere between 1.1.3 and 1.1.4.

If you can reproduce this with 1.2.x (or 1.0.x), I'm happy to investigate.
I'm not sure it's worthwhile to try to isolate a bug in an old development
version - a lot has changed in the last 6+ months.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH 2/5] Add quotes around id:message-id queries.

2010-07-05 Thread Olly Betts
On Fri, Jul 02, 2010 at 05:04:46PM +0400, Dmitry Kurochkin wrote:
 On Fri, 2 Jul 2010 04:41:43 + (UTC), Olly Betts o...@survex.com wrote:
  On 2010-07-01, Dmitry Kurochkin wrote:
   -  (concat id: (notmuch-show-get-prop :id props)))
   +  (concat id:\ (notmuch-show-get-prop :id props) \))
  
  This is probably a good idea (the .. example is arguably a Xapian bug so
  that should be fixed soon, but you find all sorts of junk in message-ids.
 
 If I comment out add_valuerangeprocessor call in
 notmuch_database_open(), ids with .. are matched fine with no quotes.

Yes, the code which checks for ranges is disabled if there are no possible
ranges to find.

 So it seems that xapian uses the ValueRangeProcessor for all terms
 while it should be used for one value parsing only. Is this correct?

The issue is that if there's a .. in there you have to ask the VRPs to
find out if it is a range they understand or not, so they have to be called
first in such cases (otherwise the same prefix couldn't be made to work for
ranges and single term filters).  There needs to be some sort of fallback
to considering boolean filters if there isn't a valid range though.

 Is there a xapian bug for this?

I couldn't find a ticket for it, but I was aware of the issue.

I've committed a fix to Xapian now (r14790 on trunk), which should be in
Xapian 1.2.3 when it gets released.

 I have found a xapian bug #128 Allow queryparser to treat some prefixes
 as literal text. Seems to be just what we need here. Perhaps instead of
 quoting in emacs client, we can wait for the value range parsing fix
 (can be fixed in minor release?) and use #128 when it is available. IMHO
 should be good enought in most cases. What do you think?

The main problem at the moment is with .., which is now fixed on trunk.
So any Xapian version with #128 fully addressed will handle .. in
message-ids fine anyway.

With current trunk, message-ids with whitespace or ')' in will still
misbehave unless you quote them.  If the FieldProcessor idea in #128 were
implemented, you could arrange for whitespace and ')' to be included, but
then it would be impossible to end a message-id term - it would span to the
end of the query string, which I think would surprise most users.

The ability to quote terms discussed in #128 is already implemented (that
is what you've been using!) and I think that using this selectively is
probably the best way to deal with this.

If you only try to quote message-ids which either:

* contain whitespace, .., or ')'
* start with ''

Then the only cases which break with older Xapian will be those which
wouldn't work there anyway, plus message-ids which start with a '' (which 
seem rare - I couldn't find any in my mail folders).

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH 2/5] Add quotes around id:"message-id" queries.

2010-07-02 Thread Olly Betts
On 2010-07-01, Dmitry Kurochkin wrote:
> -  (concat "id:" (notmuch-show-get-prop :id props)))
> +  (concat "id:\"" (notmuch-show-get-prop :id props) "\""))

This is probably a good idea (the ".." example is arguably a Xapian bug so
that should be fixed soon, but you find all sorts of junk in message-ids.

However, the quoting feature this relies on was added in Xapian 1.0.18 (and
1.1.4), and with older versions this will break for *all* message-ids (even
those which currently work).

Also, if you're going to quote the message-id, you should escaping any "
characters in the message-id (by doubling them).

Cheers,
Olly



Re: [PATCH 2/5] Add quotes around id:message-id queries.

2010-07-01 Thread Olly Betts
On 2010-07-01, Dmitry Kurochkin wrote:
 -  (concat id: (notmuch-show-get-prop :id props)))
 +  (concat id:\ (notmuch-show-get-prop :id props) \))

This is probably a good idea (the .. example is arguably a Xapian bug so
that should be fixed soon, but you find all sorts of junk in message-ids.

However, the quoting feature this relies on was added in Xapian 1.0.18 (and
1.1.4), and with older versions this will break for *all* message-ids (even
those which currently work).

Also, if you're going to quote the message-id, you should escaping any 
characters in the message-id (by doubling them).

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Failing test cases

2010-04-30 Thread Olly Betts
On 2010-04-28, Jason White wrote:
> It seems to be repeatable here, even after running git clean -d -f -x,
> followed by make && make test
>
> Is anyone else seeing this?

Yes, I got the same failure trying to rebuild the notmuch 0.3.1 Debian package
with Xapian 1.2.0 (building in a clean Debian sid chroot on x86-64).

I'm hoping to get 1.2.x in for Debian's next stable release, so it would be
good to work out what's going on.

Cheers,
Olly



Re: Failing test cases

2010-04-29 Thread Olly Betts
On 2010-04-28, Jason White wrote:
 It seems to be repeatable here, even after running git clean -d -f -x,
 followed by make  make test

 Is anyone else seeing this?

Yes, I got the same failure trying to rebuild the notmuch 0.3.1 Debian package
with Xapian 1.2.0 (building in a clean Debian sid chroot on x86-64).

I'm hoping to get 1.2.x in for Debian's next stable release, so it would be
good to work out what's going on.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH] allow to not sort the search results

2010-04-16 Thread Olly Betts
On Fri, Apr 16, 2010 at 08:37:04AM +0200, Sebastian Spaeth wrote:
> On 2010-04-15, Olly Betts wrote:
> > Also, sorting by relevance requires more calculations and may require
> > fetching additional data (document length for example).
> > 
> > So I think it would make sense for --sort=relevance and --sort=unsorted to
> > be separate options.
> 
> Now I am a bit confused. The API docs state that sort_by_relevance is
> the default. So by skipping any sort_by_value() will that incur the additional
> calculations (with our BoolWeight set?). All I want is the fasted way
> to return a searched set of docs :-).

Yes, sort_by_relevance() is the default.  But if you set BoolWeight as the
weighting scheme then the relevance is simply zero, and Xapian doesn't have
to fetch any statistics and calculate a score from them.  When documents
have exactly equal relevance weight, then the docid order is used.  So
although sort_by_relevance() is technically still on with BoolWeight, by
"sorting by relevance" I wasn't talking about this case.

So --sort=unsorted and --sort=relevance would only differ in code by the former
setting BoolWeight and the latter not.

Cheers,
Olly


Re: [PATCH] allow to not sort the search results

2010-04-16 Thread Olly Betts
On Fri, Apr 16, 2010 at 08:37:04AM +0200, Sebastian Spaeth wrote:
 On 2010-04-15, Olly Betts wrote:
  Also, sorting by relevance requires more calculations and may require
  fetching additional data (document length for example).
  
  So I think it would make sense for --sort=relevance and --sort=unsorted to
  be separate options.
 
 Now I am a bit confused. The API docs state that sort_by_relevance is
 the default. So by skipping any sort_by_value() will that incur the additional
 calculations (with our BoolWeight set?). All I want is the fasted way
 to return a searched set of docs :-).

Yes, sort_by_relevance() is the default.  But if you set BoolWeight as the
weighting scheme then the relevance is simply zero, and Xapian doesn't have
to fetch any statistics and calculate a score from them.  When documents
have exactly equal relevance weight, then the docid order is used.  So
although sort_by_relevance() is technically still on with BoolWeight, by
sorting by relevance I wasn't talking about this case.

So --sort=unsorted and --sort=relevance would only differ in code by the former
setting BoolWeight and the latter not.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH] allow to not sort the search results

2010-04-15 Thread Olly Betts
Sebastian Spaeth writes:
> On 2010-04-14, Jason White wrote:
> > > Also add a --sort=unsorted command line option to notmuch search to test
> > > this.
> > 
> > Does this provide relevance-ranked search results? I think relevance ranking
> > is the Xapian default if a sort order isn't specified. 
> 
> Yes, by default it is using sort_by_relevance, so "unsorted" implies
> just that. (in fact, a previous incarnation of this patch called it
> --sort=relevance)

Except notmuch (at least in the code I've looked at) sets the weighting scheme
to BoolWeight, so the ordering is actually just the raw docid ordering
(BoolWeight gives all matching docs a weight of 0).

> I would be happy to have it called --sort=relevance too, the unsorted
> points out potential performance improvements a bit better, IMHO
> (although they seem to be really small with a warm cache).

When using the results of a search to add/remove tags, there's likely to be
an additional win from --sort=unsorted as documents will now be processed
in docid order which will tend to have a more cache friendly locality of
access.

Also, sorting by relevance requires more calculations and may require fetching
additional data (document length for example).

So I think it would make sense for --sort=relevance and --sort=unsorted to be
separate options.

Cheers,
Olly



Re: [PATCH] allow to not sort the search results

2010-04-15 Thread Olly Betts
Sebastian Spaeth writes:
 On 2010-04-14, Jason White wrote:
   Also add a --sort=unsorted command line option to notmuch search to test
   this.
  
  Does this provide relevance-ranked search results? I think relevance ranking
  is the Xapian default if a sort order isn't specified. 
 
 Yes, by default it is using sort_by_relevance, so unsorted implies
 just that. (in fact, a previous incarnation of this patch called it
 --sort=relevance)

Except notmuch (at least in the code I've looked at) sets the weighting scheme
to BoolWeight, so the ordering is actually just the raw docid ordering
(BoolWeight gives all matching docs a weight of 0).

 I would be happy to have it called --sort=relevance too, the unsorted
 points out potential performance improvements a bit better, IMHO
 (although they seem to be really small with a warm cache).

When using the results of a search to add/remove tags, there's likely to be
an additional win from --sort=unsorted as documents will now be processed
in docid order which will tend to have a more cache friendly locality of
access.

Also, sorting by relevance requires more calculations and may require fetching
additional data (document length for example).

So I think it would make sense for --sort=relevance and --sort=unsorted to be
separate options.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Notmuch performance (literally, in my case)

2010-03-16 Thread Olly Betts
On Mon, Mar 15, 2010 at 10:29:36AM -0700, Ben Gamari wrote:
> On Mon, 15 Mar 2010 09:29:35 + (UTC), Olly Betts  
> wrote:
> > http://oligarchy.co.uk/xapian/patches/xapian-1.0.18-flint-group-fsyncs.patch
> > 
> > What this does it to at least pair up the calls to fdatasync().  It's
> > possible to move them all together, but requires more effort, so it'd be
> > nice to know if this is actually going to help.
> 
> This does seem to help. Of course, latency is a difficult thing to measure,
> but notmuch does _feel_ faster. That being said, iostat still only shows
> 700kByte/second read and 300kByte/second write, so things haven't changed in
> the throughput side of things.

For the issue of a background task interfering with interactive use, the feel
arguably matters more than the throughput.

I'll probably put that patch in 1.0.19, and look at moving all the fdatasync()
calls together.  This is http://trac.xapian.org/ticket/426 BTW.

The kernel should be able to handle this workload better though, so I would
say it was worthwhile to bring up on LKML if you have the energy.  It certainly
isn't just you, as apt-xapian-index seems to trigger it for some Ubuntu users,
and madduck mentioned it on #notmuch a week or so ago.

Cheers,
Olly


Re: [notmuch] Notmuch performance (literally, in my case)

2010-03-16 Thread Olly Betts
On Mon, Mar 15, 2010 at 10:29:36AM -0700, Ben Gamari wrote:
 On Mon, 15 Mar 2010 09:29:35 + (UTC), Olly Betts o...@survex.com wrote:
  http://oligarchy.co.uk/xapian/patches/xapian-1.0.18-flint-group-fsyncs.patch
  
  What this does it to at least pair up the calls to fdatasync().  It's
  possible to move them all together, but requires more effort, so it'd be
  nice to know if this is actually going to help.
 
 This does seem to help. Of course, latency is a difficult thing to measure,
 but notmuch does _feel_ faster. That being said, iostat still only shows
 700kByte/second read and 300kByte/second write, so things haven't changed in
 the throughput side of things.

For the issue of a background task interfering with interactive use, the feel
arguably matters more than the throughput.

I'll probably put that patch in 1.0.19, and look at moving all the fdatasync()
calls together.  This is http://trac.xapian.org/ticket/426 BTW.

The kernel should be able to handle this workload better though, so I would
say it was worthwhile to bring up on LKML if you have the energy.  It certainly
isn't just you, as apt-xapian-index seems to trigger it for some Ubuntu users,
and madduck mentioned it on #notmuch a week or so ago.

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Notmuch performance (literally, in my case)

2010-03-15 Thread Olly Betts
On 2010-03-15, Hans Dieter Pearcey wrote:
> On Sun, 14 Mar 2010 22:59:28 -0700 (PDT), Ben Gamari wrote:
>> Notmuch is using xapian 1.08-1.99karmic from the Xapian backports PPA, which
>> I believe includes the recent database update optimizations.
>
> As far as I know, it doesn't.  1.0.18 is the stable version in which it was
> fixed.

1.0.18 is also the version that's in the PPA - 1.08 has to be a typo as the
PPA tracks currently releases closely, and 1.0.8 is 18 months old.

I've seen a similar issue reported with apt-xapian-index in Ubuntu (it uses
Xapian to maintain a database of packages).  But I've never seen anything
like this myself, despite running Ubuntu on my laptop and spending a lot of
my time building Xapian databases.

Xapian's commit operation currently writes data and then calls fdatasync(),
on several files one after another.  That sounds a lot like a bad case in
one of the mails you linked to.

Can you try this patch (you'll need to rebuild Xapian from source, and
depending where you install it, perhaps set LD_LIBRARY_PATH to ensure the new
build gets used):

http://oligarchy.co.uk/xapian/patches/xapian-1.0.18-flint-group-fsyncs.patch

What this does it to at least pair up the calls to fdatasync().  It's
possible to move them all together, but requires more effort, so it'd be
nice to know if this is actually going to help.

Cheers,
Olly



Re: [notmuch] Notmuch performance (literally, in my case)

2010-03-15 Thread Olly Betts
On 2010-03-15, Hans Dieter Pearcey wrote:
 On Sun, 14 Mar 2010 22:59:28 -0700 (PDT), Ben Gamari wrote:
 Notmuch is using xapian 1.08-1.99karmic from the Xapian backports PPA, which
 I believe includes the recent database update optimizations.

 As far as I know, it doesn't.  1.0.18 is the stable version in which it was
 fixed.

1.0.18 is also the version that's in the PPA - 1.08 has to be a typo as the
PPA tracks currently releases closely, and 1.0.8 is 18 months old.

I've seen a similar issue reported with apt-xapian-index in Ubuntu (it uses
Xapian to maintain a database of packages).  But I've never seen anything
like this myself, despite running Ubuntu on my laptop and spending a lot of
my time building Xapian databases.

Xapian's commit operation currently writes data and then calls fdatasync(),
on several files one after another.  That sounds a lot like a bad case in
one of the mails you linked to.

Can you try this patch (you'll need to rebuild Xapian from source, and
depending where you install it, perhaps set LD_LIBRARY_PATH to ensure the new
build gets used):

http://oligarchy.co.uk/xapian/patches/xapian-1.0.18-flint-group-fsyncs.patch

What this does it to at least pair up the calls to fdatasync().  It's
possible to move them all together, but requires more effort, so it'd be
nice to know if this is actually going to help.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Backport of Xapian term update optimisation

2010-02-15 Thread Olly Betts
Thanks for all the testing feedback folks.

I've made a new Xapian release (1.0.18) and packaged it for Debian.  If you
are using unstable, and not on mips or hurd, it's now built:

https://buildd.debian.org/status/package.php?p=xapian-core

I haven't updated the xapian.org download page, etc yet - I wanted to get the
Debian package uploaded promptly so there's more chance it'll get into the
next Ubuntu release - but you can find tarballs in the usual place:

http://oligarchy.co.uk/xapian/1.0.18/

Cheers,
Olly


Re: [notmuch] Backport of Xapian term update optimisation

2010-02-14 Thread Olly Betts
Thanks for all the testing feedback folks.

I've made a new Xapian release (1.0.18) and packaged it for Debian.  If you
are using unstable, and not on mips or hurd, it's now built:

https://buildd.debian.org/status/package.php?p=xapian-core

I haven't updated the xapian.org download page, etc yet - I wanted to get the
Debian package uploaded promptly so there's more chance it'll get into the
next Ubuntu release - but you can find tarballs in the usual place:

http://oligarchy.co.uk/xapian/1.0.18/

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Notmuch performance problems on OSX

2010-02-09 Thread Olly Betts
On 2010-02-09, Oliver Charles wrote:
> I just upgraded to xapian-core HEAD and notmuch master tip today, in
> desparation to get away from GMail. Sadly it's still taking at least
> 0.7s to tag a single thread (with one message). I'm really eager to
> solve this, could anyone give me any pointers on how I could go about
> profiling it or finding the cause of this problem?

The first thing to try is disabling use of F_FULLFSYNC.  You'll need to
run this command in the xapian-core source tree to comment out the F_FULLFSYNC
code:

perl -pi -e 's/^#ifdef F_FULLFSYNC/#if 0/' backends/*/*_io.h

Then run "make" and "make install".

This makes you a bit more vulnerable to power failures, but no worse than
a typical Unix system.  There's some background here:

http://lists.apple.com/archives/Darwin-dev/2005/Feb/msg00072.html

Assuming that helps, then (a) you have a workaround, and (b) we'll know for
sure it is F_FULLFSYNC to blame.

I've created a ticket for a change to Xapian which should help here, but
not had a chance to work on it yet:

http://trac.xapian.org/ticket/426

Cheers,
Olly



[notmuch] strange behavior of indexing of and searching for strings containing '[]'

2010-02-05 Thread Olly Betts
On 2010-02-05, Jameson Rollins wrote:
> Hey, folks.  I've been noticing some strange behavior of notmuch search
> results for strings containing '[]'.  Here are some searches for some
> exact strings in messages subjects:

The '[]' is a red herring.  Xapian's TermGenerator and QueryParser classes
treat these two characters pretty much as if they were spaces.

> servo:~ 0$ notmuch search subject:'emacs paned UI'

Note that the '' is quoting for the shell only here.  So Xapian sees:

subject:emacs paned UI

Assuming you are defaulting to an AND search, that's `emacs in the subject'
AND `paned anywhere in the indexed text' AND `UI anywhere in the indexed text'.

To specify a quoted phrase you want "" anyway (not ''), so the command
matching what I think you intended to search for is:

notmuch search 'subject:"emacs paned UI"'

> servo:~ 0$ notmuch search subject:'[notmuch] emacs paned UI'

notmuch search 'subject:"[notmuch] emacs paned UI"'

Which should return identical results to:

notmuch search 'subject:"notmuch emacs paned UI"'

> thread:5f2cb4b108773a39161b33c86e54f7fd  4 mins. ago [1/1] Jameson Rollins;=
>  [notmuch] loss of duplicate messages (inbox)
> servo:~ 0$=20
>
> Not only did it not turn up the message that *does* match that exact
> string in it's subject line, it actually turns up a completely different
> message that doesn't match the search term at all!

It matches the notmuch in the subject, and presumably emacs, paned, and UI
in the body.

> [snip the rest - the same explanations apply]

Cheers,
Olly



[notmuch] Backport of Xapian term update optimisation

2010-02-04 Thread Olly Betts
On Thu, Feb 04, 2010 at 11:55:44AM -0500, micah anderson wrote:
> Once this is available in unstable, the notmuch package should be
> re-uploaded with a build-dependency on this version so that the package
> users see the speed improvement. As it is now, the debian package is
> pretty slow.

There's no need to rebuild notmuch to benefit from this improvement.  Just
install the updated libxapian15 package.

> What is the expected time-frame for moving this out of an experimental
> stage into unstable?

Probably a week or two.  It would be good to get this into Ubuntu Lucid (though 
oddly notmuch doesn't seem to be there yet) which means pretty soon.

If I get a lot of positive feedback on the snapshot version, that'll encourage
me to push ahead with it.  I've also announced it on the sup mailing list (as
they have similar issues with adding tags) and will on the Xapian list later
today.

> Presumably this is a more proper release by Xapian?

I would indeed prefer to put it in a new Xapian release and then package
that - if the code isn't "good enough" for an upstream release, then I'd
find it hard to justify it as "good enough" for Debian.

Cheers,
Olly


[notmuch] Backport of Xapian term update optimisation

2010-02-04 Thread Olly Betts
On Wed, Feb 03, 2010 at 02:35:14PM -0500, Jameson Rollins wrote:
> On Thu, 28 Jan 2010 00:06:59 + (UTC), Olly Betts  
> wrote:
> > I've backported the term update optimisation patches
> > <http://trac.xapian.org/ticket/250> to Xapian's 1.0 branch, and you can
> > find snapshot tarballs including these changes here:
> > 
> > http://oligarchy.co.uk/xapian/branches/1.0/
> > 
> > Xapian's testsuite passes (including the additional test coverage which I
> > also backported), and I looked over each change carefully, but I would be
> > interested to see some real world testing, particularly in the situation
> > which these changes are intended to improve (i.e. speed of tagging in
> > notmuch).
> 
> Hey, Olly.  Thanks so much for backporting this patch and uploading a
> patched package to Debian experimental (which is now available):

It hasn't built for all Debian architectures yet, but is available for at
least amd64 and x86, which are probably the most popular two.

If you aren't sure how to pull in packages from experimental, see:

http://wiki.debian.org/DebianExperimental

I've also put it in a Launchpad PPA for all currently supported Ubuntu
releases, which has built for all of them already:

https://launchpad.net/~ojwb/+archive/experimental/

> I just installed this new version from a Debian experimental repo,
> rebuilt notmuch against the new installation, and everything seems to be
> working great.  I'll report back any issues to the BTS.  Thanks again.

Thanks.  Are you seeing the expected speed improvement?

Cheers,
Olly


Re: [notmuch] Backport of Xapian term update optimisation

2010-02-03 Thread Olly Betts
On Wed, Feb 03, 2010 at 02:35:14PM -0500, Jameson Rollins wrote:
 On Thu, 28 Jan 2010 00:06:59 + (UTC), Olly Betts o...@survex.com wrote:
  I've backported the term update optimisation patches
  http://trac.xapian.org/ticket/250 to Xapian's 1.0 branch, and you can
  find snapshot tarballs including these changes here:
  
  http://oligarchy.co.uk/xapian/branches/1.0/
  
  Xapian's testsuite passes (including the additional test coverage which I
  also backported), and I looked over each change carefully, but I would be
  interested to see some real world testing, particularly in the situation
  which these changes are intended to improve (i.e. speed of tagging in
  notmuch).
 
 Hey, Olly.  Thanks so much for backporting this patch and uploading a
 patched package to Debian experimental (which is now available):

It hasn't built for all Debian architectures yet, but is available for at
least amd64 and x86, which are probably the most popular two.

If you aren't sure how to pull in packages from experimental, see:

http://wiki.debian.org/DebianExperimental

I've also put it in a Launchpad PPA for all currently supported Ubuntu
releases, which has built for all of them already:

https://launchpad.net/~ojwb/+archive/experimental/

 I just installed this new version from a Debian experimental repo,
 rebuilt notmuch against the new installation, and everything seems to be
 working great.  I'll report back any issues to the BTS.  Thanks again.

Thanks.  Are you seeing the expected speed improvement?

Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Internal error no thread ID

2010-01-28 Thread Olly Betts
On 2010-01-27, Matthias Teege wrote:
> Internal error: Message with document ID of 62004 has no thread ID.
>  (lib/message.cc:353).
>
> Is it possible to get the filename for the document 62004?

You can get it from the database using Xapian's delve utility:

delve -d -r62004 /PATH/TO/XAPIAN-DATABASE

Cheers,
Olly



[notmuch] Backport of Xapian term update optimisation

2010-01-28 Thread Olly Betts
I've backported the term update optimisation patches
 to Xapian's 1.0 branch, and you can
find snapshot tarballs including these changes here:

http://oligarchy.co.uk/xapian/branches/1.0/

Xapian's testsuite passes (including the additional test coverage which I
also backported), and I looked over each change carefully, but I would be
interested to see some real world testing, particularly in the situation
which these changes are intended to improve (i.e. speed of tagging in
notmuch).

So if you're so inclined, try it and report how you got on (on this list
is fine).

Cheers,
Olly



[notmuch] Backport of Xapian term update optimisation

2010-01-27 Thread Olly Betts
I've backported the term update optimisation patches
http://trac.xapian.org/ticket/250 to Xapian's 1.0 branch, and you can
find snapshot tarballs including these changes here:

http://oligarchy.co.uk/xapian/branches/1.0/

Xapian's testsuite passes (including the additional test coverage which I
also backported), and I looked over each change carefully, but I would be
interested to see some real world testing, particularly in the situation
which these changes are intended to improve (i.e. speed of tagging in
notmuch).

So if you're so inclined, try it and report how you got on (on this list
is fine).

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] indexing mail?

2010-01-15 Thread Olly Betts
On 2010-01-15, Dirk-Jan C  Binnema wrote:
>Olly>   Other than Linux, the d_type field is available mainly only on BSD
>Olly>   systems.
>
> Yes, my patch could me generalized a bit more, just like your patch could not
> hardcode the '/'-separator :)

Well, '/' works as a directory separator for all Unix systems and also
for Microsoft Windows at this level.  Is there a system which doesn't
accept '/' in this place which is still relevant?

Personally I don't see the point in aiming for portability to systems like
Mac OS 9, RISC OS, and VMS in 2010...

> In practice though, what Unices in use today do not support d_type?

Solaris 10 doesn't for starters.  I don't have ready access to the other
non-Linux, non-BSD Unix flavours to check those right now.

Cheers,
Olly



[notmuch] indexing mail?

2010-01-15 Thread Olly Betts
On 2010-01-15, Dirk-Jan C  Binnema wrote:
>>>>>> "Olly" == Olly Betts  writes:
>Olly> Not a full patch, but I already posted what this code should look 
> like
>Olly> to handle both systems without d_type, and those which return 
> DT_UNKNOWN:
>
>Olly> http://article.gmane.org/gmane.mail.notmuch.general/1044

> static gboolean
> _set_dtype (const char* path, struct dirent *entry)

Underscore prefixed identifiers are reserved by ISO C at file-scope; using them
yourself is undefined behaviour...

>   /* we only care about dirs, regular files and links */
>   if (S_ISREG (statbuf.st_mode))
>   entry->d_type = DT_REG;
>   else if (S_ISDIR (statbuf.st_mode))
>   entry->d_type = DT_DIR;
>   else if (S_ISLNK (statbuf.st_mode))
>   entry->d_type = DT_LNK;

This addresses the case where the FS returns DT_UNKNOWN for d_type, but doesn't
deal with the case of platforms where struct dirent has no d_type member - from
the Linux readdir man page:

  The only fields in the dirent structure that are mandated by POSIX.1 are:
  d_name[], of unspecified size, with at most NAME_MAX characters preceding
  the terminating null byte; and (as an XSI extension) d_ino.  The other fields
  are unstandardized, and not present on all systems; see NOTES below for some
  further details.

And in NOTES:

  Other than Linux, the d_type field is available mainly only on BSD systems.

Cheers,
Olly



[notmuch] Notmuch performance problems on OSX

2010-01-15 Thread Olly Betts
On 2010-01-14, Oliver Charles wrote:
> I've installed the latest notmuch from Git at this time of writing,
> along with Xapian from SVN head. However, just tagging a single thread
> with only one message seems to take too long:

One difference between OS X and other systems is that OS X supports the
F_FULLSYNC ioctl, and other systems don't (currently, at least AFAIK)
and Xapian uses that if it is available to ensure that changes have
actually made it to disk:

http://trac.xapian.org/ticket/288

On other systems, it uses fdatasync() or fsync(), which typically just
ensure that the data has left the OS - it can sit in disk controller or
drive caches for potentially seconds longer.  This call happens once
per table for every (explicit or implicit) flush on a database.

I can see an issue here which is that currently Xapian writes the base
file for the table, then syncs it, then does the next table.  I bet it
would be more efficient to write them all and then sync them all,
especially with F_FULLSYNC.

I'll take a look at doing that, and have created a ticket for it:

http://trac.xapian.org/ticket/426

If after that this is still causing problems, it should probably be made
configurable what (if any) flushing is done.  If you're on a UPS-backed
server, you probably don't need such paranoia.

Cheers,
Olly



[notmuch] indexing mail?

2010-01-15 Thread Olly Betts
On 2010-01-14, Carl Worth wrote:
> On Thu, 14 Jan 2010 18:38:54 +0100, Adrian Perez de Castro  igalia.com> wrote:
>> I am using XFS, which always returns DT_UNKNOWN. Taking into account that
>> there is a good deal of people using filesystems other than the ones you
>> mention, and that other non-linux filesystems may also return DT_UNKNOWN,
>> in my opinion there should be a fall-back. I will try to post a patch
>> Anytime Soon=E2=84=A2.
>
> We definitely want the fallback. I can attempt to code it, but I don't
> have ready access to an afflicted filesystem, so I'd need help testing
> anyway.
>
> I'd love to see a patch for this bug soon. Be sure to CC me when the
> patch is sent and that will help me commit it sooner.

Not a full patch, but I already posted what this code should look like
to handle both systems without d_type, and those which return DT_UNKNOWN:

http://article.gmane.org/gmane.mail.notmuch.general/1044

Cheers,
Olly



[notmuch] indexing encrypted messages (was: OpenPGP support)

2010-01-14 Thread Olly Betts
On 2010-01-08, James Westby wrote:
> That would leave an open question over whether future notmuch show
> invocations would return the plaintext or ciphertext. If it is the
> latter then it requires decrypting every time you want to view it, but
> it does mean that there is less information leakage (you could find out
> whether an encrypted message contained a particular term, but not read
> the whole message directly).

You can actually use the term position information to reconstruct the
original message text pretty well.  It misses capitalisation, punctuation,
and distinctions between whitespace, but is generally enough to allow
the message to be understood:

http://article.gmane.org/gmane.comp.search.xapian.general/2187

Cheers,
Olly



Re: [notmuch] indexing encrypted messages (was: OpenPGP support)

2010-01-14 Thread Olly Betts
On 2010-01-08, James Westby wrote:
 That would leave an open question over whether future notmuch show
 invocations would return the plaintext or ciphertext. If it is the
 latter then it requires decrypting every time you want to view it, but
 it does mean that there is less information leakage (you could find out
 whether an encrypted message contained a particular term, but not read
 the whole message directly).

You can actually use the term position information to reconstruct the
original message text pretty well.  It misses capitalisation, punctuation,
and distinctions between whitespace, but is generally enough to allow
the message to be understood:

http://article.gmane.org/gmane.comp.search.xapian.general/2187

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] indexing mail?

2010-01-14 Thread Olly Betts
On 2010-01-14, Carl Worth wrote:
 On Thu, 14 Jan 2010 18:38:54 +0100, Adrian Perez de Castro 
 ape...@igalia.com wrote:
 I am using XFS, which always returns DT_UNKNOWN. Taking into account that
 there is a good deal of people using filesystems other than the ones you
 mention, and that other non-linux filesystems may also return DT_UNKNOWN,
 in my opinion there should be a fall-back. I will try to post a patch
 Anytime Soon=E2=84=A2.

 We definitely want the fallback. I can attempt to code it, but I don't
 have ready access to an afflicted filesystem, so I'd need help testing
 anyway.

 I'd love to see a patch for this bug soon. Be sure to CC me when the
 patch is sent and that will help me commit it sooner.

Not a full patch, but I already posted what this code should look like
to handle both systems without d_type, and those which return DT_UNKNOWN:

http://article.gmane.org/gmane.mail.notmuch.general/1044

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Notmuch performance problems on OSX

2010-01-14 Thread Olly Betts
On 2010-01-14, Oliver Charles wrote:
 I've installed the latest notmuch from Git at this time of writing,
 along with Xapian from SVN head. However, just tagging a single thread
 with only one message seems to take too long:

One difference between OS X and other systems is that OS X supports the
F_FULLSYNC ioctl, and other systems don't (currently, at least AFAIK)
and Xapian uses that if it is available to ensure that changes have
actually made it to disk:

http://trac.xapian.org/ticket/288

On other systems, it uses fdatasync() or fsync(), which typically just
ensure that the data has left the OS - it can sit in disk controller or
drive caches for potentially seconds longer.  This call happens once
per table for every (explicit or implicit) flush on a database.

I can see an issue here which is that currently Xapian writes the base
file for the table, then syncs it, then does the next table.  I bet it
would be more efficient to write them all and then sync them all,
especially with F_FULLSYNC.

I'll take a look at doing that, and have created a ticket for it:

http://trac.xapian.org/ticket/426

If after that this is still causing problems, it should probably be made
configurable what (if any) flushing is done.  If you're on a UPS-backed
server, you probably don't need such paranoia.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] indexing mail?

2010-01-14 Thread Olly Betts
On 2010-01-15, Dirk-Jan C  Binnema wrote:
 Olly == Olly Betts o...@survex.com writes:
Olly Not a full patch, but I already posted what this code should look 
 like
Olly to handle both systems without d_type, and those which return 
 DT_UNKNOWN:

Olly http://article.gmane.org/gmane.mail.notmuch.general/1044

 static gboolean
 _set_dtype (const char* path, struct dirent *entry)

Underscore prefixed identifiers are reserved by ISO C at file-scope; using them
yourself is undefined behaviour...

   /* we only care about dirs, regular files and links */
   if (S_ISREG (statbuf.st_mode))
   entry-d_type = DT_REG;
   else if (S_ISDIR (statbuf.st_mode))
   entry-d_type = DT_DIR;
   else if (S_ISLNK (statbuf.st_mode))
   entry-d_type = DT_LNK;

This addresses the case where the FS returns DT_UNKNOWN for d_type, but doesn't
deal with the case of platforms where struct dirent has no d_type member - from
the Linux readdir man page:

  The only fields in the dirent structure that are mandated by POSIX.1 are:
  d_name[], of unspecified size, with at most NAME_MAX characters preceding
  the terminating null byte; and (as an XSI extension) d_ino.  The other fields
  are unstandardized, and not present on all systems; see NOTES below for some
  further details.

And in NOTES:

  Other than Linux, the d_type field is available mainly only on BSD systems.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] [PATCH] Add post-add and post-tag hooks

2009-12-25 Thread Olly Betts
[Sorry, I seemed to manage to attach my reply to the wrong thread...]

On Wed, Dec 23, 2009 at 07:57:21AM +0100, Tomas Carnecky wrote:
 On 12/23/09 12:02 AM, Olly Betts wrote:
 Rather than a platform-specific check, it would be better to check if DT_DIR
 is defined.

 Beware that even on Linux (where the d_type field is present), it may always
 contain DT_UNKNOWN for some filesystems, so you really should check for that
 case and fall back to using stat() instead.

 Currently configure is a simple shell script and not some autoconf  
 magic. And I don't know how eager Carl is to use autoconf, scons, cmake  
 or similar.

No autoconf magic required (or desirable here that I can see) - here's what
I'm suggesting (untested as written, but Xapian's omega indexer uses an
approach much like this):

#ifdef DT_UNKNOWN
/* If d_type is available and supported by the FS, avoid a call to stat. */
if (entries[i]-d_type == DT_UNKNOWN) {
/* Fall back to calling stat. */
#endif
{
char pbuf[PATH_MAX];
snprintf(pbuf, PATH_MAX, %s/%s, path, entries[i]-d_name);

struct stat buf;
if (stat(pbuf, buf) == -1 || !S_ISDIR(buf.st_mode))
continue;
}
#ifdef DT_UNKNOWN
} else if (entries[i]-d_type != DT_DIR) continue;
#endif


Cheers,
Olly
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Rather simple optimization for notmuch tag

2009-12-24 Thread Olly Betts
Mark Anderson writes:
> On Wed, 23 Dec 2009 03:45:14 +0000, Olly Betts wrote:
> > Handling a combination of removals and additions is trickier, but probably
> > possible, although the more tags you are dealing with, the less profitable
> > the filtering is likely to be (as the filter is likely to cull fewer
> > documents yet be more expensive to evaluate).
> 
> But the transform is pretty simple, I think that any combination of
> additions and removals could be transformed according to the following
> formula.
> 
> notmuch tag +a1 +a2 +a3 -d1 -d2 -d3 
> 
> would transform to something like:
> 
>  and ( not(a1) or not(a2) or not(a3) or d1 or d2 or d3)

Note that Xapian doesn't really have a "not" operator (because of how it
works - by storing the documents indexing each term - rather than because
nobody's implemented it), so it isn't quite as simple as the above.

There is a posting list for "all documents" (which is very efficient if
the document ids form a contiguous range; if they don't, it's as efficient
as a term which matches all those documents for the chert backend, but not
so great for the default flint backend in 1.0.x), and you can combine this
with the "AND_NOT" operator to give the equivalent of a "NOT" operator.

So I think the example above is probably best expressed as:

(  AND ( ( ALL AND_NOT (a1 AND a2 AND a3) ) OR d1 OR d2 OR d3 )

But my point wasn't that I doubted it could be handled, but that it becomes
less worthwhile as the number of tags increases (and at some point will
become slower).

> There are certainly may be much more optimal ways to do it depending on
> the specific corpus of the database, considering if the tags a1 and a2
> and a3 are usually added as one tag, or if the addition is done
> individually, because if I know that a3 implies a1 and a2, the first 3
> terms could be combined to not(a1 and a2 and a3), or I could just
> exclude a3 tagged messages for nearly the same effect, with expected
> performance improvements.

I think you always can combine them like that.  The documents that don't
need looking at are precisely those which already have all three tags
(i.e. a1 AND a2 AND a3), so those that do are "NOT" that expression.

Cheers,
Olly



[notmuch] [PATCH] Add post-add and post-tag hooks

2009-12-23 Thread Olly Betts
[Sorry, I seemed to manage to attach my reply to the wrong thread...]

On Wed, Dec 23, 2009 at 07:57:21AM +0100, Tomas Carnecky wrote:
> On 12/23/09 12:02 AM, Olly Betts wrote:
>> Rather than a platform-specific check, it would be better to check if DT_DIR
>> is defined.
>>
>> Beware that even on Linux (where the d_type field is present), it may always
>> contain DT_UNKNOWN for some filesystems, so you really should check for that
>> case and fall back to using stat() instead.
>
> Currently configure is a simple shell script and not some autoconf  
> magic. And I don't know how eager Carl is to use autoconf, scons, cmake  
> or similar.

No autoconf magic required (or desirable here that I can see) - here's what
I'm suggesting (untested as written, but Xapian's omega indexer uses an
approach much like this):

#ifdef DT_UNKNOWN
/* If d_type is available and supported by the FS, avoid a call to stat. */
if (entries[i]->d_type == DT_UNKNOWN) {
/* Fall back to calling stat. */
#endif
{
char pbuf[PATH_MAX];
snprintf(pbuf, PATH_MAX, "%s/%s", path, entries[i]->d_name);

struct stat buf;
if (stat(pbuf, ) == -1 || !S_ISDIR(buf.st_mode))
continue;
}
#ifdef DT_UNKNOWN
} else if (entries[i]->d_type != DT_DIR) continue;
#endif


Cheers,
Olly


[notmuch] Rather simple optimization for notmuch tag

2009-12-23 Thread Olly Betts
Carl Worth writes:
> On Fri, 18 Dec 2009 00:49:00 -0700, Mark Anderson wrote:
> > I was updating my poll script that tags messages, and a common idiom is
> > to put
> >  tag +mytag  and not tag:mytag
> > 
> > I don't know anything about efficiency, but for the simple single-tag
> > case, couldn't we imply the "and not tag:mytag" from the +mytag action
> > list for the tag command?
> 
> On one level, it really shouldn't be a performance issue to tag messages
> that already have a particular tag. (And in fact, the recently proposed
> patches to fix Xapian defect 250 even address this I think.)

Applying a filter up-front like this is likely to still help I think as it
avoids Xapian having to reverse-engineer this information internally.

> One potential snag with both ideas is that the "notmuch tag"
> command-line as currently implemented allows for multiple tag additions
> and removals with a single search. So the optimization here couldn't be
> used unless there was just a single tag action.

Actually, you could do this with multiple tags - you just need to build
a filter for documents which might be affected.

So if you're adding tags a1 and a2, you want:  AND_NOT (a1 AND a2)
since documents which already have tags a1 and a2 can be ignored.

If you're removing d1 and d2, then the filter is:  AND (d1 OR d2)
since documents have to be tagged d1 or d2 in order for the removal to
do anything.

Handling a combination of removals and additions is trickier, but probably
possible, although the more tags you are dealing with, the less profitable
the filtering is likely to be (as the filter is likely to cull fewer
documents yet be more expensive to evaluate).

Cheers,
Olly



[notmuch] [PATCH] Add post-add and post-tag hooks

2009-12-22 Thread Olly Betts
Tomas Carnecky writes:
> #if defined(__sun__)
>   ... sprintf, stat etc
> #else
>   (void) path;
>   return dirent->d_type == DT_DIR;
> #endif

Rather than a platform-specific check, it would be better to check if DT_DIR
is defined.

Beware that even on Linux (where the d_type field is present), it may always
contain DT_UNKNOWN for some filesystems, so you really should check for that
case and fall back to using stat() instead.

Cheers,
Olly



[notmuch] Missing messages breaking threads

2009-12-22 Thread Olly Betts
Carl Worth writes:
> We don't have any concept of versioning yet, but it would obviously be
> easy to have a new version document with an increasing integer.

Adding a magic document for this isn't ideal as you have to make sure
it can't appear in search results, etc.

This is just the sort of thing which Xapian's "user metadata" is there
for.  It's essentially a key/value store which is versioned along with
the rest of the Xapian database.  So to set it:

  database.set_metadata("version", "1");

And to read (and default if not set):

  string version = database.get_metadata("version");
  if (version.empty()) version = "0";

Cheers,
   Olly



Re: [notmuch] [PATCH] Add post-add and post-tag hooks

2009-12-22 Thread Olly Betts
Tomas Carnecky writes:
 #if defined(__sun__)
   ... sprintf, stat etc
 #else
   (void) path;
   return dirent-d_type == DT_DIR;
 #endif

Rather than a platform-specific check, it would be better to check if DT_DIR
is defined.

Beware that even on Linux (where the d_type field is present), it may always
contain DT_UNKNOWN for some filesystems, so you really should check for that
case and fall back to using stat() instead.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Notmuch's search view sucks

2009-12-04 Thread Olly Betts
Karl Wiberg writes:
> On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth wrote:
> > And a step beyond that would support different languages for
> > different emails, but that sounds like something "hard" to identify.
> 
> But probably not as hard as identifying spam. It could probably be
> done with a simple Bayesian filter counting word frequencies---but
> it'd be much better if somebody else had already solved the problem,
> since this smells suspiciously like something that ought to be a
> separate project and put in a library ... does anyone know if such a
> project already exists?

There's TextCat:

http://www.let.rug.nl/vannoord/TextCat/

It looks at n-gram frequencies, and can guess pretty reliably from
even a fairly small amount of text.

TextCat is in Perl.  I don't know if there's a C or C++ implementation
but it isn't a huge piece of code - finding a good technique was the
clever part of it.

Cheers,
Olly



Re: [notmuch] Notmuch's search view sucks

2009-12-04 Thread Olly Betts
Karl Wiberg writes:
 On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth wrote:
  And a step beyond that would support different languages for
  different emails, but that sounds like something hard to identify.
 
 But probably not as hard as identifying spam. It could probably be
 done with a simple Bayesian filter counting word frequencies---but
 it'd be much better if somebody else had already solved the problem,
 since this smells suspiciously like something that ought to be a
 separate project and put in a library ... does anyone know if such a
 project already exists?

There's TextCat:

http://www.let.rug.nl/vannoord/TextCat/

It looks at n-gram frequencies, and can guess pretty reliably from
even a fairly small amount of text.

TextCat is in Perl.  I don't know if there's a C or C++ implementation
but it isn't a huge piece of code - finding a good technique was the
clever part of it.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch