Re: Prepending Xapian Tiers
Hello, so the .conversations database does, apart of the descriptions at https://www.cyrusimap.org/imap/concepts/deployment/databases.html#conversations-userid-conversations, also store per user a G record for each message, mapping the mailboxes where the message is located and the results from Xapian search return G records. Are a G record, GUID and a conversation ID the same thing? When a message is expunged, are its records from .conversations removed? When a message is unexpunged, is it again inserted in .conversations and referenced in the sync_log_channels: squatter? squatter has the modes: indexer, search, rolling, synclog, compact, indexfrom (deprecated) and audit. Is search_batchsize used only in the indexer mode, in particular it is not used when squatter -t … -z -X is called (compact and reindex simultaneously)? What is the application for squatter -X (Reindex all messages before compacting. This mode reads all the lists of messages indexed by the listed tiers, and re-indexes them into a temporary database before compacting that into place)? Does it index messages, that were not indexed yet for any reason, or it deletes the database, scans each message again and creates a compact Xapian database? In the case I described, mailbox receiving reports, having an index grow very fast, the cause was a mail loop - a lot of emails arriving in short time. Once the loop stopped, the index does not exand faster than other mailboxes. So by default for now, unless some extra setup is performed, only words in text/plain and text/html get indexed, possibly with headers, and attachments are ignored? Regards Дилян - Message from Bron Gondwana - Date: Tue, 28 May 2019 18:20:32 +1000 From: Bron Gondwana Subject: Re: Prepending Xapian Tiers To: Cyrus Devel On Sat, May 25, 2019, at 22:19, Dilyan Palauzov wrote: Hello Bron, For me it is still not absolutely clear how things work with the Xapian seach backend. Has search_batchsize any impact during compacting? Does this setting say how many new messages have to arrive, before indexing them together in Xapian? No, search_batchsize just means that when you're indexing a brand new mailbox with millions of emails, it will put that many emails at a time into a single transactional write to the search index. During compacting, this value is not used. What are the use-cases to call "squatter -F" (In compact mode, filter the resulting database to only include messages which are not expunged in mailboxes with existing name/uidvalidity.) and "squatter -X" (Reindex all messages before compacting. This mode reads all the lists of messages indexed by the listed tiers, and re-indexes them into a temporary database before compacting that into place)? -F is useful to run occasionally so that your search indexes don't grow forever. When emails are expunged, their matching terms aren't removed from the xapian indexes, so the database will be bigger than necessary and when you search for a term which is in deleted emails, it will cause extra IO and conversations DB lookups on the document id. Why shall one keep index of deleted and expunged messages and how to delete references from messages that are both expunged and expired (after cyr_expire -X0, so removed from the hard disk), but keep the index to messages that are still on the hard disk, but the user expunged (double-deleted) them. I'm not sure I understand your question here. Deleting from xapian databases is slow, and particularly with the compacted form, it's designed to be efficient if you don't write to it. Finally, since we're de-duplicating by GUID, you would need to do a conversations db lookup for every deleted email to check the refcount before cleaning up the associated record. How does re-compacting (as in https://fastmail.blog/2014/12/01/email-search-system/) differ from re-indexing (as in the manual page of master/squatter)? "re-compacting" - just means combining multiple databases together into a single compacted database - so the terms in all the source databases are compacted together into a destination database. I used "re-compacting" because the databases are already all compacted, so it's just combining them rather than gaining the initial space saving of the first compact. "re-indexing" involves parsing the email again and creating terms from the source document. When you "reindex" a set of xapian directories, the squatter reads the cyrus.indexed.db for each of the source directories to know which emails it claims to cover, and reads each of those emails in order to index them again. What gets indexed? For a mailbox receiving only reports (dkim, dmarc, mta-sts, arf, mtatls), some of which are archived (zip, gzip) the Xapian index increases very fast. This would be because these emails often contain unique identifiers, which do indeed take a lot of
Re: Prepending Xapian Tiers
On Sat, May 25, 2019, at 22:19, Dilyan Palauzov wrote: > Hello Bron, > > For me it is still not absolutely clear how things work with the > Xapian seach backend. > > Has search_batchsize any impact during compacting? Does this setting > say how many new messages have to arrive, before indexing them > together in Xapian? No, search_batchsize just means that when you're indexing a brand new mailbox with millions of emails, it will put that many emails at a time into a single transactional write to the search index. During compacting, this value is not used. > What are the use-cases to call "squatter -F" (In compact mode, filter > the resulting database to only include messages which are not expunged > in mailboxes with existing name/uidvalidity.) and "squatter -X" > (Reindex all messages before compacting. This mode reads all the > lists of messages indexed by the listed tiers, and re-indexes them > into a temporary database before compacting that into place)? -F is useful to run occasionally so that your search indexes don't grow forever. When emails are expunged, their matching terms aren't removed from the xapian indexes, so the database will be bigger than necessary and when you search for a term which is in deleted emails, it will cause extra IO and conversations DB lookups on the document id. > Why shall one keep index of deleted and expunged messages and how to > delete references from messages that are both expunged and expired > (after cyr_expire -X0, so removed from the hard disk), but keep the > index to messages that are still on the hard disk, but the user > expunged (double-deleted) them. I'm not sure I understand your question here. Deleting from xapian databases is slow, and particularly with the compacted form, it's designed to be efficient if you don't write to it. Finally, since we're de-duplicating by GUID, you would need to do a conversations db lookup for every deleted email to check the refcount before cleaning up the associated record. > How does re-compacting (as in > https://fastmail.blog/2014/12/01/email-search-system/) differ from > re-indexing (as in the manual page of master/squatter)? "re-compacting" - just means combining multiple databases together into a single compacted database - so the terms in all the source databases are compacted together into a destination database. I used "re-compacting" because the databases are already all compacted, so it's just combining them rather than gaining the initial space saving of the first compact. "re-indexing" involves parsing the email again and creating terms from the source document. When you "reindex" a set of xapian directories, the squatter reads the cyrus.indexed.db for each of the source directories to know which emails it claims to cover, and reads each of those emails in order to index them again. > What gets indexed? For a mailbox receiving only reports (dkim, dmarc, > mta-sts, arf, mtatls), some of which are archived (zip, gzip) the > Xapian index increases very fast. This would be because these emails often contain unique identifiers, which do indeed take a lot of space. We have had lots of debates over what exactly should be indexed - for example should you index sha1 values (e.g. git commit identifiers)? They're completely random, and hence all 40 characters need to be indexed each time! But - it's very handy to be able to search your email for a known identifier and see where it was referenced... so we decided to include them. We try not index GPG parts or other opaque blobs where nobody will be interested in searching for the phrase. Likewise we don't index MIME boundaries, because they're substructure, not something a user would know to search for. We have a work in progress on the master branch to index attachments using an external tool to extract text from the attachment where possible, which will increase index sizes even more if enabled! > How can I remove a tier, that contains no data, but is mentioned in > the .xapianactive files? If you run a compact which includes that tier as a source and not as a destination, then it should remove that tier from every .xapianactive file, at which point you can remove it from your imapd.conf. > How can I rename a tier? The whole point of tier names not being paths on disk is so you can change the disk path without having to rename the tier. Tier names are IDs, so you're not supposed to rename them. Having said that, you could add a new tier, compact everything across to that tier, then remove the old tier. > How can I efficiently prepend a new tear in the .xapianactive file? > “squatter -t X -z Y -o” does add to the .xapianactive files the > defaultsearhtier, but first has to duplicate with rsync all existing > files. This is not efficient, as big files have to copied. I'm afraid that's what we have right now. Again, tiers are supposed to be set up at the start and not fiddled with afterwards, so