[sqlite] Segfault during FTS index creation from huge data

Dan Kennedy Thu, 30 Apr 2015 21:43:15 +0700

On 04/30/2015 02:08 PM, Dominique Pell? wrote:
> Dan Kennedy <danielk1977 at gmail.com> wrote:
>
>> On 04/30/2015 07:41 AM, Scott Robison wrote:
>>>
>>> I wrote up some test code today that definitely forces a SQLITE_NOMEM
>>> condition, and there are no leaks (though there are a couple really large
>>> outstanding allocations to force an error without needing to insert
>>> gigabytes of data).
>>>
>>> In thinking about what Dan wrote last night (many short word matches) I
>>> decided that maybe my supposition was wrong and it wasn't the number of
>>> relatively unique words in my old data set. Using the FTS4AUX virtual
>>> table, I recall looking at the info in the DB at the time and seeing a
>>> huge
>>> number of words with 1 or very few matches, and assumed that was the
>>> problem in that data set. However, given the random nature of that data
>>> set, and the fact that there are only 26 single letter "words" and 676
>>> double letter "words" (and so on), I could have easily missed the
>>> relatively few rows of data that had very large numbers of docids /
>>> occurrences in the aux table output.
>>>
>>> My test app goes to the other extreme. It inserts as many rows as possible
>>> consisting of the single letter word "a" 256 times, and in my case, it
>>> fails after 1,052,641 rows were inserted (1,048,576 rows were committed).
>>>
>>> In any case, my memory of the "merging" of things was correct, though the
>>> precise location may not have been.
>>>
>>> Notes:
>>>
>>> 1. Depending on how much memory pressure I put on the system by
>>> pre-allocating even more big chunks of memory, the NOMEM error moves
>>> around
>>> a bit. I've seen it happen in a malloc as well.
>>>
>>> 2. The reality is that FTS was designed around certain assumptions, and
>>> these extra large data sets don't fit into those cases. In my case, the
>>> only time I've seen errors was due to synthetic / unrealistic test data.
>>> The exponential growth related to segment directory merges seems to
>>> dictate
>>> that eventually, after a lot of inserts / updates, the data structures are
>>> going to get quite large.
>>
>> Thanks for doing this. I'll look at it properly later on today to see if the
>> results suggest anything we can do.
>>
>> I'm thinking there's another problem though. At some point soon, we run into
>> this:
>>
>>    https://www.sqlite.org/limits.html#max_length
>>
>> The default limit is 10^9 bytes.
>>
>> The fully-merged doclist generated for the term "a" in the above is probably
>> around 256MiB in size. So if we could successfully merge it in memory, it
>> could be inserted into the database. However, once you get up over a billion
>> records there might be doclists for common terms that exceed this limit.
>>
>> Dan.
>
> What about FTS5? I understand that it will use less memory than
> FTS3/FTS4. Will it solve this problem?


In theory, yes. But I haven't actually tried it yet.

Dan.

[sqlite] Segfault during FTS index creation from huge data

Reply via email to