On 05/01/2014 03:30 PM, andrewmo wrote:
We are using the FTS3 extension to sqlite to store large numbers of short
(~300 byte) documents. This is working very well and providing us with very
fast text search, but the behaviour around deletion of documents has me
confused.

Our system must control the the size of the database and will delete the
oldest documents when the database size breaches a certain limit. I now
understand from comments on this mailing list and elsewhere that this is not
an optimal pattern for the FTS extension as doclists for the oldest
documents are the least likely to be 'merged'.

My question is, does this actually work at all? If I delete a row from my
FTS4 table (resulting in a new empty doclist being added to the index), then
I subsequently add many (1000s) new documents and call the 'merge' function
several times (automerge is also enabled), is there any gaurentee that the
empty doclist and the populated doclist that it superseded will ever be
removed? My testing suggests this isn't the case.

I have a 1GB database with 6million documents. If I keep adding new
documents at around 1 per second and deleting documents when the size of the
data goes beyond 1GB, the size of the index seems to grow and the number of
documents I can store in the 1GB file seems decrease in a linear manner.

Calling the 'optimize' function seems to solve this issue (removing all the
dead doclists), but that isn't practical for our software, as it implies
some downtime for our high availablity service due to the long execution
time of the optimize function (Could be minutes for a 1GB file).

I have seen this
(http://sqlite.1065341.n5.nabble.com/fts3-database-grows-td42069.html) post
from 2008. However, it predates the 'automerge' and manual merge features,
and from the documentation I assumed these new features would delete all the
data related to deleted documents. Am I incorrect in my assumption?

Thanks for any clarification you can offer.

Normally, when you write to an FTS index (either to add new doclists or to
add delete markers) the new entries are accumulated in-memory for a while
and then flushed to a new "level-0" b-tree. A level-0 b-tree is often
roughly 1MB in size. Once there are 16 level-0 b-trees, they are merged
and written to a single level-1 b-tree. Once there are 16 level-1 b-trees...
And so on.

So when an entry is deleted from the FTS index, a delete marker is added.
But the original doclists are not actually deleted until the delete marker
and the doclists are merged into the same b-tree. Delete markers are
discarded when they are merged into the oldest b-tree in the index.

At first glance it seems (to me) that this means the index might grow to
anything up to 16 times its "optimized" size. But I think it's actually
worse than that.

Say your entire database fits into a single level-N b-tree. You keep adding
data (and delete markers) until there are 15 level-N b-trees and almost
enough data to create the 16th in lower levels. So at this point the FTS
index is 16 times its optimal size. If you then add even more data so that
the 16th level-N b-tree is created, everything gets merged together and
we're back in the optimal state - everything in a single b-tree. However -
this b-tree is deemed to be a level-N+1 b-tree. Meaning that this time,
much more data will have to be added before everything is merged together
again.

So I'm thinking a solution might be:

  * Fix FTS so that it picks this case - when a merge includes so many
    delete markers that the output is small enough to be deemed a level-N
    b-tree, not a level-N+1 b-tree, and

  * Instead of using the default 16-way merges, the app could organize
    to periodically invoke the "merge=X,Y" command with a smaller Y value
    (say 2) to limit the maximum size of the index to Y times its optimal
    size (instead of 16 times).

It is an interesting problem. And the above is just guesswork... It would
be good to verify experimentally that the index really does grow indefinitely
with this kind of input before trying to "fix" anything.

Dan.


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to