This might be to do with how an FTS index works under the hood, involving various levels of "b-tree" that grow as entries are added, but aren't always shrunk when entries are deleted.
There were a bunch of emails on the list around 4th to the 13th May 2014: sample below from Dan Kennedy (one of the SQLite devs) which includes the original problem report and his summary of "why". I've a feeling SOME changes MIGHT have been made to how the various levels of b-tree were merged when entries were deleted, but I'm not seeing anything linked to the original thread... perhaps this email will jog a memory... Graham START ----- Mail from Dan Kennedy, 4 May 2014 ----------------- START On 05/01/2014 03:30 PM, andrewmo wrote: > We are using the FTS3 extension to sqlite to store large numbers of short > (~300 byte) documents. This is working very well and providing us with very > fast text search, but the behaviour around deletion of documents has me > confused. > > Our system must control the the size of the database and will delete the > oldest documents when the database size breaches a certain limit. I now > understand from comments on this mailing list and elsewhere that this is not > an optimal pattern for the FTS extension as doclists for the oldest > documents are the least likely to be 'merged'. > > My question is, does this actually work at all? If I delete a row from my > FTS4 table (resulting in a new empty doclist being added to the index), then > I subsequently add many (1000s) new documents and call the 'merge' function > several times (automerge is also enabled), is there any gaurentee that the > empty doclist and the populated doclist that it superseded will ever be > removed? My testing suggests this isn't the case. > > I have a 1GB database with 6million documents. If I keep adding new > documents at around 1 per second and deleting documents when the size of the > data goes beyond 1GB, the size of the index seems to grow and the number of > documents I can store in the 1GB file seems decrease in a linear manner. > > Calling the 'optimize' function seems to solve this issue (removing all the > dead doclists), but that isn't practical for our software, as it implies > some downtime for our high availablity service due to the long execution > time of the optimize function (Could be minutes for a 1GB file). > > I have seen this > (http://sqlite.1065341.n5.nabble.com/fts3-database-grows-td42069.html) post > from 2008. However, it predates the 'automerge' and manual merge features, > and from the documentation I assumed these new features would delete all the > data related to deleted documents. Am I incorrect in my assumption? > > Thanks for any clarification you can offer. Normally, when you write to an FTS index (either to add new doclists or to add delete markers) the new entries are accumulated in-memory for a while and then flushed to a new "level-0" b-tree. A level-0 b-tree is often roughly 1MB in size. Once there are 16 level-0 b-trees, they are merged and written to a single level-1 b-tree. Once there are 16 level-1 b-trees... And so on. So when an entry is deleted from the FTS index, a delete marker is added. But the original doclists are not actually deleted until the delete marker and the doclists are merged into the same b-tree. Delete markers are discarded when they are merged into the oldest b-tree in the index. At first glance it seems (to me) that this means the index might grow to anything up to 16 times its "optimized" size. But I think it's actually worse than that. Say your entire database fits into a single level-N b-tree. You keep adding data (and delete markers) until there are 15 level-N b-trees and almost enough data to create the 16th in lower levels. So at this point the FTS index is 16 times its optimal size. If you then add even more data so that the 16th level-N b-tree is created, everything gets merged together and we're back in the optimal state - everything in a single b-tree. However - this b-tree is deemed to be a level-N+1 b-tree. Meaning that this time, much more data will have to be added before everything is merged together again. So I'm thinking a solution might be: * Fix FTS so that it picks this case - when a merge includes so many delete markers that the output is small enough to be deemed a level-N b-tree, not a level-N+1 b-tree, and * Instead of using the default 16-way merges, the app could organize to periodically invoke the "merge=X,Y" command with a smaller Y value (say 2) to limit the maximum size of the index to Y times its optimal size (instead of 16 times). It is an interesting problem. And the above is just guesswork... It would be good to verify experimentally that the index really does grow indefinitely with this kind of input before trying to "fix" anything. Dan. ENDS ----- Mail from Dan Kennedy, 4 May 2014 --------------------- ENDS Tuesday, February 25, 2020, 11:52:59 AM, Matt Kloss <matt.kl...@yandex.com> wrote: > Dear sqlite users, > I noticed that when you delete lines from a FTS virtual table, somehow there > is some data remaining in the sqlite db, so that's it does not shrink much in > size. > $ sqlite3 test.sql "CREATE VIRTUAL TABLE tab USING fts5(x)" > $ curl -s https://www.wikipedia.org | tr -cd '[:alnum:][:space:]' > wikipedia > $ ls -lh wikipedia test.sql > -rw-r--r-- 1 mkloss mkloss 24K feb. 25 06:55 test.sql > -rw-r--r-- 1 mkloss mkloss 54K feb. 25 06:56 wikipedia > $ sqlite3 test.sql ".import wikipedia tab" && ls -lh test.sql > -rw-r--r-- 1 mkloss mkloss 148K feb. 25 06:56 test.sql > $ sqlite3 test.sql "delete from tab" && ls -lh test.sql > -rw-r--r-- 1 mkloss mkloss 148K feb. 25 06:56 test.sql > $ sqlite3 test.sql "VACUUM" && ls -lh test.sql > -rw-r--r-- 1 mkloss mkloss 124K feb. 25 06:56 test.sql > I would expect the db size to be 24K (not 124K), as it was when the table > "tab" was empty. > I noticed that some data remains in the tab_XXX tables, but less than 700 > bytes. That's nowhere near the 100K of added cruft: > $ for t in tab_{config,content,data,docsize,idx}; do echo "select * from > $t;"; done | sqlite3 test.sql | wc -c > 682 > So here is my questions: > (1) How do you really cleanup a db with FTS tables after deleting some lines? > (2) If there is no way to remove the cruft, does that mean that adding and > deleting lines will constantly inflate the db size? > $ sqlite3 --version > 3.31.1 2020-01-27 19:55:54 > 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1 > Thank you for your help, > Regards, > Matthew > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users Regards, Graham Holden _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users