On 04/30/2015 02:08 PM, Dominique Pell? wrote: > Dan Kennedy <danielk1977 at gmail.com> wrote: > >> On 04/30/2015 07:41 AM, Scott Robison wrote: >>> >>> I wrote up some test code today that definitely forces a SQLITE_NOMEM >>> condition, and there are no leaks (though there are a couple really large >>> outstanding allocations to force an error without needing to insert >>> gigabytes of data). >>> >>> In thinking about what Dan wrote last night (many short word matches) I >>> decided that maybe my supposition was wrong and it wasn't the number of >>> relatively unique words in my old data set. Using the FTS4AUX virtual >>> table, I recall looking at the info in the DB at the time and seeing a >>> huge >>> number of words with 1 or very few matches, and assumed that was the >>> problem in that data set. However, given the random nature of that data >>> set, and the fact that there are only 26 single letter "words" and 676 >>> double letter "words" (and so on), I could have easily missed the >>> relatively few rows of data that had very large numbers of docids / >>> occurrences in the aux table output. >>> >>> My test app goes to the other extreme. It inserts as many rows as possible >>> consisting of the single letter word "a" 256 times, and in my case, it >>> fails after 1,052,641 rows were inserted (1,048,576 rows were committed). >>> >>> In any case, my memory of the "merging" of things was correct, though the >>> precise location may not have been. >>> >>> Notes: >>> >>> 1. Depending on how much memory pressure I put on the system by >>> pre-allocating even more big chunks of memory, the NOMEM error moves >>> around >>> a bit. I've seen it happen in a malloc as well. >>> >>> 2. The reality is that FTS was designed around certain assumptions, and >>> these extra large data sets don't fit into those cases. In my case, the >>> only time I've seen errors was due to synthetic / unrealistic test data. >>> The exponential growth related to segment directory merges seems to >>> dictate >>> that eventually, after a lot of inserts / updates, the data structures are >>> going to get quite large. >> >> Thanks for doing this. I'll look at it properly later on today to see if the >> results suggest anything we can do. >> >> I'm thinking there's another problem though. At some point soon, we run into >> this: >> >> https://www.sqlite.org/limits.html#max_length >> >> The default limit is 10^9 bytes. >> >> The fully-merged doclist generated for the term "a" in the above is probably >> around 256MiB in size. So if we could successfully merge it in memory, it >> could be inserted into the database. However, once you get up over a billion >> records there might be doclists for common terms that exceed this limit. >> >> Dan. > > What about FTS5? I understand that it will use less memory than > FTS3/FTS4. Will it solve this problem?
In theory, yes. But I haven't actually tried it yet. Dan.