Re: Prepending Xapian Tiers

2019-05-28 Thread Dilyan Palauzov

Hello,

so the .conversations database does, apart of the descriptions  
at  
https://www.cyrusimap.org/imap/concepts/deployment/databases.html#conversations-userid-conversations, also store per user a G record for each message, mapping the mailboxes where the message is located and the results from Xapian search return G  
records.


Are a G record, GUID and a conversation ID the same thing?

When a message is expunged, are its records from  
.conversations removed?


When a message is unexpunged, is it again inserted in  
.conversations and referenced in the sync_log_channels:  
squatter?


squatter has the modes: indexer, search, rolling, synclog, compact,  
indexfrom (deprecated) and audit. Is search_batchsize used only in the  
indexer mode, in particular it is not used when squatter -t … -z -X is  
called (compact and reindex simultaneously)?


What is the application for squatter -X (Reindex all messages before  
compacting.  This mode reads all the  lists of messages indexed by the  
listed tiers, and re-indexes them into a temporary database before  
compacting that into place)?


Does it index messages, that were not indexed yet for any reason, or  
it deletes the database, scans each message again and creates a  
compact Xapian database?


In the case I described, mailbox receiving reports, having an index  
grow very fast, the cause was a mail loop - a lot of emails arriving  
in short time.  Once the loop stopped, the index does not exand faster  
than other mailboxes.


So by default for now, unless some extra setup is performed, only  
words in text/plain and text/html get indexed, possibly with headers,  
and attachments are ignored?


Regards
  Дилян


- Message from Bron Gondwana  -
   Date: Tue, 28 May 2019 18:20:32 +1000
   From: Bron Gondwana 
Subject: Re: Prepending Xapian Tiers
 To: Cyrus Devel 



On Sat, May 25, 2019, at 22:19, Dilyan Palauzov wrote:

Hello Bron,

For me it is still not absolutely clear how things work with the
Xapian seach backend.

Has search_batchsize any impact during compacting? Does this setting
say how many new messages have to arrive, before indexing them
together in Xapian?


No, search_batchsize just means that when you're indexing a brand  
new mailbox with millions of emails, it will put that many emails at  
a time into a single transactional write to the search index. During  
compacting, this value is not used.



What are the use-cases to call "squatter -F" (In compact mode, filter
the resulting database to only include messages which are not expunged
in mailboxes with existing name/uidvalidity.) and "squatter -X"
(Reindex all messages before compacting. This mode reads all the
lists of messages indexed by the listed tiers, and re-indexes them
into a temporary database before compacting that into place)?


-F is useful to run occasionally so that your search indexes don't  
grow forever. When emails are expunged, their matching terms aren't  
removed from the xapian indexes, so the database will be bigger than  
necessary and when you search for a term which is in deleted emails,  
it will cause extra IO and conversations DB lookups on the document  
id.



Why shall one keep index of deleted and expunged messages and how to
delete references from messages that are both expunged and expired
(after cyr_expire -X0, so removed from the hard disk), but keep the
index to messages that are still on the hard disk, but the user
expunged (double-deleted) them.


I'm not sure I understand your question here. Deleting from xapian  
databases is slow, and particularly with the compacted form, it's  
designed to be efficient if you don't write to it. Finally, since  
we're de-duplicating by GUID, you would need to do a conversations  
db lookup for every deleted email to check the refcount before  
cleaning up the associated record.



How does re-compacting (as in
https://fastmail.blog/2014/12/01/email-search-system/) differ from
re-indexing (as in the manual page of master/squatter)?


"re-compacting" - just means combining multiple databases together  
into a single compacted database - so the terms in all the source  
databases are compacted together into a destination database. I used  
"re-compacting" because the databases are already all compacted, so  
it's just combining them rather than gaining the initial space  
saving of the first compact.


"re-indexing" involves parsing the email again and creating terms  
from the source document. When you "reindex" a set of xapian  
directories, the squatter reads the cyrus.indexed.db for each of the  
source directories to know which emails it claims to cover, and  
reads each of those emails in order to index them again.



What gets indexed? For a mailbox receiving only reports (dkim, dmarc,
mta-sts, arf, mtatls), some of which are archived (zip, gzip) the
Xapian index increases very fast.


This would be because these emails often contain unique identifiers,  
which do indeed take a lot of 

Re: Prepending Xapian Tiers

2019-05-28 Thread Bron Gondwana
On Sat, May 25, 2019, at 22:19, Dilyan Palauzov wrote:
> Hello Bron,
> 
> For me it is still not absolutely clear how things work with the 
> Xapian seach backend.
> 
> Has search_batchsize any impact during compacting? Does this setting 
> say how many new messages have to arrive, before indexing them 
> together in Xapian?

No, search_batchsize just means that when you're indexing a brand new mailbox 
with millions of emails, it will put that many emails at a time into a single 
transactional write to the search index. During compacting, this value is not 
used.

> What are the use-cases to call "squatter -F" (In compact mode, filter 
> the resulting database to only include messages which are not expunged 
> in mailboxes with existing name/uidvalidity.) and "squatter -X" 
> (Reindex all messages before compacting. This mode reads all the 
> lists of messages indexed by the listed tiers, and re-indexes them 
> into a temporary database before compacting that into place)?

-F is useful to run occasionally so that your search indexes don't grow 
forever. When emails are expunged, their matching terms aren't removed from the 
xapian indexes, so the database will be bigger than necessary and when you 
search for a term which is in deleted emails, it will cause extra IO and 
conversations DB lookups on the document id.

> Why shall one keep index of deleted and expunged messages and how to 
> delete references from messages that are both expunged and expired 
> (after cyr_expire -X0, so removed from the hard disk), but keep the 
> index to messages that are still on the hard disk, but the user 
> expunged (double-deleted) them.

I'm not sure I understand your question here. Deleting from xapian databases is 
slow, and particularly with the compacted form, it's designed to be efficient 
if you don't write to it. Finally, since we're de-duplicating by GUID, you 
would need to do a conversations db lookup for every deleted email to check the 
refcount before cleaning up the associated record.

> How does re-compacting (as in 
> https://fastmail.blog/2014/12/01/email-search-system/) differ from 
> re-indexing (as in the manual page of master/squatter)?

"re-compacting" - just means combining multiple databases together into a 
single compacted database - so the terms in all the source databases are 
compacted together into a destination database. I used "re-compacting" because 
the databases are already all compacted, so it's just combining them rather 
than gaining the initial space saving of the first compact.

"re-indexing" involves parsing the email again and creating terms from the 
source document. When you "reindex" a set of xapian directories, the squatter 
reads the cyrus.indexed.db for each of the source directories to know which 
emails it claims to cover, and reads each of those emails in order to index 
them again.

> What gets indexed? For a mailbox receiving only reports (dkim, dmarc, 
> mta-sts, arf, mtatls), some of which are archived (zip, gzip) the 
> Xapian index increases very fast.

This would be because these emails often contain unique identifiers, which do 
indeed take a lot of space. We have had lots of debates over what exactly 
should be indexed - for example should you index sha1 values (e.g. git commit 
identifiers)? They're completely random, and hence all 40 characters need to be 
indexed each time! But - it's very handy to be able to search your email for a 
known identifier and see where it was referenced... so we decided to include 
them.

We try not index GPG parts or other opaque blobs where nobody will be 
interested in searching for the phrase. Likewise we don't index MIME 
boundaries, because they're substructure, not something a user would know to 
search for.

We have a work in progress on the master branch to index attachments using an 
external tool to extract text from the attachment where possible, which will 
increase index sizes even more if enabled!

> How can I remove a tier, that contains no data, but is mentioned in 
> the .xapianactive files?

If you run a compact which includes that tier as a source and not as a 
destination, then it should remove that tier from every .xapianactive file, at 
which point you can remove it from your imapd.conf.

> How can I rename a tier?

The whole point of tier names not being paths on disk is so you can change the 
disk path without having to rename the tier. Tier names are IDs, so you're not 
supposed to rename them.

Having said that, you could add a new tier, compact everything across to that 
tier, then remove the old tier.

> How can I efficiently prepend a new tear in the .xapianactive file? 
> “squatter -t X -z Y -o” does add to the .xapianactive files the 
> defaultsearhtier, but first has to duplicate with rsync all existing 
> files. This is not efficient, as big files have to copied.

I'm afraid that's what we have right now. Again, tiers are supposed to be set 
up at the start and not fiddled with afterwards, so