[sqlite] Segfault during FTS index creation from huge data

Dominique Pellé Thu, 30 Apr 2015 09:08:52 +0200

Dan Kennedy <danielk1977 at gmail.com> wrote:

> On 04/30/2015 07:41 AM, Scott Robison wrote:
>>
>>
>> I wrote up some test code today that definitely forces a SQLITE_NOMEM
>> condition, and there are no leaks (though there are a couple really large
>> outstanding allocations to force an error without needing to insert
>> gigabytes of data).
>>
>> In thinking about what Dan wrote last night (many short word matches) I
>> decided that maybe my supposition was wrong and it wasn't the number of
>> relatively unique words in my old data set. Using the FTS4AUX virtual
>> table, I recall looking at the info in the DB at the time and seeing a
>> huge
>> number of words with 1 or very few matches, and assumed that was the
>> problem in that data set. However, given the random nature of that data
>> set, and the fact that there are only 26 single letter "words" and 676
>> double letter "words" (and so on), I could have easily missed the
>> relatively few rows of data that had very large numbers of docids /
>> occurrences in the aux table output.
>>
>> My test app goes to the other extreme. It inserts as many rows as possible
>> consisting of the single letter word "a" 256 times, and in my case, it
>> fails after 1,052,641 rows were inserted (1,048,576 rows were committed).
>>
>> In any case, my memory of the "merging" of things was correct, though the
>> precise location may not have been.
>>
>> Notes:
>>
>> 1. Depending on how much memory pressure I put on the system by
>> pre-allocating even more big chunks of memory, the NOMEM error moves
>> around
>> a bit. I've seen it happen in a malloc as well.
>>
>> 2. The reality is that FTS was designed around certain assumptions, and
>> these extra large data sets don't fit into those cases. In my case, the
>> only time I've seen errors was due to synthetic / unrealistic test data.
>> The exponential growth related to segment directory merges seems to
>> dictate
>> that eventually, after a lot of inserts / updates, the data structures are
>> going to get quite large.
>
>
> Thanks for doing this. I'll look at it properly later on today to see if the
> results suggest anything we can do.
>
> I'm thinking there's another problem though. At some point soon, we run into
> this:
>
>   https://www.sqlite.org/limits.html#max_length
>
> The default limit is 10^9 bytes.
>
> The fully-merged doclist generated for the term "a" in the above is probably
> around 256MiB in size. So if we could successfully merge it in memory, it
> could be inserted into the database. However, once you get up over a billion
> records there might be doclists for common terms that exceed this limit.
>
> Dan.



What about FTS5? I understand that it will use less memory than
FTS3/FTS4. Will it solve this problem?

Regards
Dominique

[sqlite] Segfault during FTS index creation from huge data

Reply via email to