subject:"\[sqlite\] Segfault during FTS index creation from huge data"

[sqlite] Segfault during FTS index creation from huge data

2015-05-19 Thread Dan Kennedy

On 05/19/2015 05:54 PM, Artem wrote:
>> On 05/19/2015 03:35 PM, Artem wrote:
>>> Hi!
>>>
>>> And what about result of our conversation?
>>> Can developers increase this limitations for using all memory that user
>>> have?
>> Hi Artem,
>> The conclusion was that although the first problem encountered is the
>> massive allocation that FTS tries to make, fixing that won't actually
>> improve things much. SQLite is hard-coded to limit the size of blobs to
>> a bit under 1GiB. So even if we could merge these big doclists in
>> memory, we're pretty close to the limit of what can be stored in the
>> database anyway.
>> The 1GiB limit can be raised to 2GiB by recompiling SQLite with
>> different options. But any further than that would require a redesign of
>> the file-format.
>> So to fix this, we would need to change the format FTS4 uses to store
>> the index within the database. Some sort of extension to allow it to
>> spread the largest doclists across two or more database records. And
>> unfortunately I suspect that will remain a low priority for the
>> foreseeable future.
>> I've tested FTS5 with really large databases, and it seems to work. It's
>> not actually released yet though.
>> Regards,
>> Dan.
> Thank you very much, Dan. Now I understand the situation.
> Can  you  please tell me, when FTS5 will be released? Maybe I can test
> it on my data before official release?

Hi Artem,

I'm not sure exactly when fts5 will be released. Testing with a large 
data set would be extremely helpful though. Grab the "fts5" branch from 
fossil and run [make sqlite3.c] to create an amalgamation file that 
includes fts5. Build with -DSQLITE_ENABLE_FTS5.

For large data sets, I think it will help to run a command like:

   INSERT INTO tbl(tbl, rank) VALUES('pgsz', 4040);

after creating the FTS5 table (where "tbl" is the name of the new 
table). More here:

   https://sqlite.org/fts5.html#section_5_7

Thanks,
Dan.

[sqlite] Segfault during FTS index creation from huge data

2015-05-19 Thread Dan Kennedy

On 05/19/2015 03:35 PM, Artem wrote:
> Hi!
>
> And what about result of our conversation?
> Can developers increase this limitations for using all memory that user
> have?

Hi Artem,

The conclusion was that although the first problem encountered is the 
massive allocation that FTS tries to make, fixing that won't actually 
improve things much. SQLite is hard-coded to limit the size of blobs to 
a bit under 1GiB. So even if we could merge these big doclists in 
memory, we're pretty close to the limit of what can be stored in the 
database anyway.

The 1GiB limit can be raised to 2GiB by recompiling SQLite with 
different options. But any further than that would require a redesign of 
the file-format.

So to fix this, we would need to change the format FTS4 uses to store 
the index within the database. Some sort of extension to allow it to 
spread the largest doclists across two or more database records. And 
unfortunately I suspect that will remain a low priority for the 
foreseeable future.

I've tested FTS5 with really large databases, and it seems to work. It's 
not actually released yet though.

Regards,
Dan.





>
>> One, you should remove sqlite-users at sqlite.org from your To list. I keep
>> bouncing email when I reply to you. Not a big deal, just an FYI.
>> Two:
>> On Sun, May 3, 2015 at 2:13 PM, James K. Lowden 
>> wrote:
>>> On Thu, 30 Apr 2015 12:47:57 -0600
>>> Scott Robison  wrote:
>>>
 Perhaps you are correct and "sigsegv" is not the literal signal that
 is triggered in this case. I don't care, really. The fact is that an
 apparently valid pointer was returned from a memory allocation
 function yet can result in an invalid access for whatever reason (out
 of memory, in this case). The Linux OOM killer may kill the offending
 process (which is what one would expect, but one would also expect
 malloc to return null, so we already know not to expect the
 expected). Or it may kill some other process which has done nothing
 wrong! Sure, the OS is protecting the two processes address space
 from one another, but it seems to me that if one process can kill
 another process, there is a problem.
>>> I have no argument with you, Scott.  It's neither the first nor last
>>> thing Linix implemented in the name of efficiency that undid what
>>> previously were guarantees.  My only angels-on-the-head-of-a-pin
>>> argument is that the OOM-killer doesn't invalidate malloc-returned
>>> pointers in particular.  It sweeps with a much broader brush, you might
>>> say.   ;-)
>>>
>> Okay, I think I see what you're saying, though there seem to be a number of
>> anecdotes online about people who do get a sigsegv from some simple memory
>> allocation strategies (designed to allocate all available memory). I would
>> not discount the possibility of a sigsegv, but agree that it is probably
>> not supposed to happen given the way optimistic memory allocation is
>> claimed to work.
>
>>> SIGSEGV *is* significant to the OP because it doesn't signify heap
>>> exhaustion.  If that signal was triggered in the heap, it indicates
>>> heap corruption.  If it was triggered in the stack, it suggests the
>>> stack might been exhausted, perhaps before a pure OOM condition was
>>> reached.
>>>
>> The code I was referring to in earlier posts performed a realloc. I wonder
>> if perhaps there is a corner case in there. Growing a block potentially
>> means moving a block, so if the library was copying from a previously
>> allocated block to a new block, maybe that could result in a segfault?
>> Or maybe another explanation could be that some other library other piece
>> of code replaced the default malloc-family functions.
>
>
>

[sqlite] Segfault during FTS index creation from huge data

2015-05-19 Thread Artem

> On 05/19/2015 03:35 PM, Artem wrote:
>> Hi!
>>
>> And what about result of our conversation?
>> Can developers increase this limitations for using all memory that user
>> have?

> Hi Artem,

> The conclusion was that although the first problem encountered is the 
> massive allocation that FTS tries to make, fixing that won't actually 
> improve things much. SQLite is hard-coded to limit the size of blobs to
> a bit under 1GiB. So even if we could merge these big doclists in 
> memory, we're pretty close to the limit of what can be stored in the 
> database anyway.

> The 1GiB limit can be raised to 2GiB by recompiling SQLite with 
> different options. But any further than that would require a redesign of
> the file-format.

> So to fix this, we would need to change the format FTS4 uses to store 
> the index within the database. Some sort of extension to allow it to 
> spread the largest doclists across two or more database records. And 
> unfortunately I suspect that will remain a low priority for the 
> foreseeable future.

> I've tested FTS5 with really large databases, and it seems to work. It's
> not actually released yet though.

> Regards,
> Dan.

Thank you very much, Dan. Now I understand the situation.
Can  you  please tell me, when FTS5 will be released? Maybe I can test
it on my data before official release?

>>
>>> One, you should remove sqlite-users at sqlite.org from your To list. I keep
>>> bouncing email when I reply to you. Not a big deal, just an FYI.
>>> Two:
>>> On Sun, May 3, 2015 at 2:13 PM, James K. Lowden >> schemamania.org>
>>> wrote:
 On Thu, 30 Apr 2015 12:47:57 -0600
 Scott Robison  wrote:

> Perhaps you are correct and "sigsegv" is not the literal signal that
> is triggered in this case. I don't care, really. The fact is that an
> apparently valid pointer was returned from a memory allocation
> function yet can result in an invalid access for whatever reason (out
> of memory, in this case). The Linux OOM killer may kill the offending
> process (which is what one would expect, but one would also expect
> malloc to return null, so we already know not to expect the
> expected). Or it may kill some other process which has done nothing
> wrong! Sure, the OS is protecting the two processes address space
> from one another, but it seems to me that if one process can kill
> another process, there is a problem.
 I have no argument with you, Scott.  It's neither the first nor last
 thing Linix implemented in the name of efficiency that undid what
 previously were guarantees.  My only angels-on-the-head-of-a-pin
 argument is that the OOM-killer doesn't invalidate malloc-returned
 pointers in particular.  It sweeps with a much broader brush, you might
 say.   ;-)

>>> Okay, I think I see what you're saying, though there seem to be a number of
>>> anecdotes online about people who do get a sigsegv from some simple memory
>>> allocation strategies (designed to allocate all available memory). I would
>>> not discount the possibility of a sigsegv, but agree that it is probably
>>> not supposed to happen given the way optimistic memory allocation is
>>> claimed to work.
>>
 SIGSEGV *is* significant to the OP because it doesn't signify heap
 exhaustion.  If that signal was triggered in the heap, it indicates
 heap corruption.  If it was triggered in the stack, it suggests the
 stack might been exhausted, perhaps before a pure OOM condition was
 reached.

>>> The code I was referring to in earlier posts performed a realloc. I wonder
>>> if perhaps there is a corner case in there. Growing a block potentially
>>> means moving a block, so if the library was copying from a previously
>>> allocated block to a new block, maybe that could result in a segfault?
>>> Or maybe another explanation could be that some other library other piece
>>> of code replaced the default malloc-family functions.
>>
>>
>>

> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-05-19 Thread Artem

> On 19 May 2015, at 9:35am, Artem  wrote:

>> Can developers increase this limitations for using all memory that user
>> have?

> The answer to this depends on which operating system you are
> running and how it is configured.  The answers cannot be changed by changing 
> a setting in SQLite.

Please, read my previous messages. I run my FTS-queries on both
operating systems - Linux and Windows. I run it on x32 and x64
architecture. I run it with System.Data.Sqlite .NET provider
with x86 and x86 architecture set. And I always get the same result.
500 mln fit to FTS, 1 billion - not fit.

> Simon.
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-05-19 Thread Artem

Hi!

And what about result of our conversation?
Can developers increase this limitations for using all memory that user
have?

> One, you should remove sqlite-users at sqlite.org from your To list. I keep
> bouncing email when I reply to you. Not a big deal, just an FYI.

> Two:

> On Sun, May 3, 2015 at 2:13 PM, James K. Lowden 
> wrote:

>> On Thu, 30 Apr 2015 12:47:57 -0600
>> Scott Robison  wrote:
>>
>> > Perhaps you are correct and "sigsegv" is not the literal signal that
>> > is triggered in this case. I don't care, really. The fact is that an
>> > apparently valid pointer was returned from a memory allocation
>> > function yet can result in an invalid access for whatever reason (out
>> > of memory, in this case). The Linux OOM killer may kill the offending
>> > process (which is what one would expect, but one would also expect
>> > malloc to return null, so we already know not to expect the
>> > expected). Or it may kill some other process which has done nothing
>> > wrong! Sure, the OS is protecting the two processes address space
>> > from one another, but it seems to me that if one process can kill
>> > another process, there is a problem.
>>
>> I have no argument with you, Scott.  It's neither the first nor last
>> thing Linix implemented in the name of efficiency that undid what
>> previously were guarantees.  My only angels-on-the-head-of-a-pin
>> argument is that the OOM-killer doesn't invalidate malloc-returned
>> pointers in particular.  It sweeps with a much broader brush, you might
>> say.   ;-)
>>

> Okay, I think I see what you're saying, though there seem to be a number of
> anecdotes online about people who do get a sigsegv from some simple memory
> allocation strategies (designed to allocate all available memory). I would
> not discount the possibility of a sigsegv, but agree that it is probably
> not supposed to happen given the way optimistic memory allocation is
> claimed to work.


>> SIGSEGV *is* significant to the OP because it doesn't signify heap
>> exhaustion.  If that signal was triggered in the heap, it indicates
>> heap corruption.  If it was triggered in the stack, it suggests the
>> stack might been exhausted, perhaps before a pure OOM condition was
>> reached.
>>

> The code I was referring to in earlier posts performed a realloc. I wonder
> if perhaps there is a corner case in there. Growing a block potentially
> means moving a block, so if the library was copying from a previously
> allocated block to a new block, maybe that could result in a segfault?

> Or maybe another explanation could be that some other library other piece
> of code replaced the default malloc-family functions.




-- 
? ?,
 Artem  mailto:devspec at yandex.ru

[sqlite] Segfault during FTS index creation from huge data

2015-05-19 Thread Simon Slavin


On 19 May 2015, at 9:35am, Artem  wrote:

> Can developers increase this limitations for using all memory that user
> have?

The answer to this depends on which operating system you are running and how it 
is configured.  The answers cannot be changed by changing a setting in SQLite.

Simon.

[sqlite] Segfault during FTS index creation from huge data

2015-05-03 Thread James K. Lowden

On Thu, 30 Apr 2015 12:47:57 -0600
Scott Robison  wrote:

> Perhaps you are correct and "sigsegv" is not the literal signal that
> is triggered in this case. I don't care, really. The fact is that an
> apparently valid pointer was returned from a memory allocation
> function yet can result in an invalid access for whatever reason (out
> of memory, in this case). The Linux OOM killer may kill the offending
> process (which is what one would expect, but one would also expect
> malloc to return null, so we already know not to expect the
> expected). Or it may kill some other process which has done nothing
> wrong! Sure, the OS is protecting the two processes address space
> from one another, but it seems to me that if one process can kill
> another process, there is a problem.

I have no argument with you, Scott.  It's neither the first nor last
thing Linix implemented in the name of efficiency that undid what
previously were guarantees.  My only angels-on-the-head-of-a-pin
argument is that the OOM-killer doesn't invalidate malloc-returned
pointers in particular.  It sweeps with a much broader brush, you might
say.   ;-)  

SIGSEGV *is* significant to the OP because it doesn't signify heap
exhaustion.  If that signal was triggered in the heap, it indicates
heap corruption.  If it was triggered in the stack, it suggests the
stack might been exhausted, perhaps before a pure OOM condition was
reached.  

--jkl

[sqlite] Segfault during FTS index creation from huge data

2015-05-03 Thread Scott Robison

One, you should remove sqlite-users at sqlite.org from your To list. I keep
bouncing email when I reply to you. Not a big deal, just an FYI.

Two:

On Sun, May 3, 2015 at 2:13 PM, James K. Lowden 
wrote:

> On Thu, 30 Apr 2015 12:47:57 -0600
> Scott Robison  wrote:
>
> > Perhaps you are correct and "sigsegv" is not the literal signal that
> > is triggered in this case. I don't care, really. The fact is that an
> > apparently valid pointer was returned from a memory allocation
> > function yet can result in an invalid access for whatever reason (out
> > of memory, in this case). The Linux OOM killer may kill the offending
> > process (which is what one would expect, but one would also expect
> > malloc to return null, so we already know not to expect the
> > expected). Or it may kill some other process which has done nothing
> > wrong! Sure, the OS is protecting the two processes address space
> > from one another, but it seems to me that if one process can kill
> > another process, there is a problem.
>
> I have no argument with you, Scott.  It's neither the first nor last
> thing Linix implemented in the name of efficiency that undid what
> previously were guarantees.  My only angels-on-the-head-of-a-pin
> argument is that the OOM-killer doesn't invalidate malloc-returned
> pointers in particular.  It sweeps with a much broader brush, you might
> say.   ;-)
>

Okay, I think I see what you're saying, though there seem to be a number of
anecdotes online about people who do get a sigsegv from some simple memory
allocation strategies (designed to allocate all available memory). I would
not discount the possibility of a sigsegv, but agree that it is probably
not supposed to happen given the way optimistic memory allocation is
claimed to work.

> SIGSEGV *is* significant to the OP because it doesn't signify heap
> exhaustion.  If that signal was triggered in the heap, it indicates
> heap corruption.  If it was triggered in the stack, it suggests the
> stack might been exhausted, perhaps before a pure OOM condition was
> reached.
>

The code I was referring to in earlier posts performed a realloc. I wonder
if perhaps there is a corner case in there. Growing a block potentially
means moving a block, so if the library was copying from a previously
allocated block to a new block, maybe that could result in a segfault?

Or maybe another explanation could be that some other library other piece
of code replaced the default malloc-family functions.

-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Dan Kennedy

On 04/30/2015 02:08 PM, Dominique Pell? wrote:
> Dan Kennedy  wrote:
>
>> On 04/30/2015 07:41 AM, Scott Robison wrote:
>>>
>>> I wrote up some test code today that definitely forces a SQLITE_NOMEM
>>> condition, and there are no leaks (though there are a couple really large
>>> outstanding allocations to force an error without needing to insert
>>> gigabytes of data).
>>>
>>> In thinking about what Dan wrote last night (many short word matches) I
>>> decided that maybe my supposition was wrong and it wasn't the number of
>>> relatively unique words in my old data set. Using the FTS4AUX virtual
>>> table, I recall looking at the info in the DB at the time and seeing a
>>> huge
>>> number of words with 1 or very few matches, and assumed that was the
>>> problem in that data set. However, given the random nature of that data
>>> set, and the fact that there are only 26 single letter "words" and 676
>>> double letter "words" (and so on), I could have easily missed the
>>> relatively few rows of data that had very large numbers of docids /
>>> occurrences in the aux table output.
>>>
>>> My test app goes to the other extreme. It inserts as many rows as possible
>>> consisting of the single letter word "a" 256 times, and in my case, it
>>> fails after 1,052,641 rows were inserted (1,048,576 rows were committed).
>>>
>>> In any case, my memory of the "merging" of things was correct, though the
>>> precise location may not have been.
>>>
>>> Notes:
>>>
>>> 1. Depending on how much memory pressure I put on the system by
>>> pre-allocating even more big chunks of memory, the NOMEM error moves
>>> around
>>> a bit. I've seen it happen in a malloc as well.
>>>
>>> 2. The reality is that FTS was designed around certain assumptions, and
>>> these extra large data sets don't fit into those cases. In my case, the
>>> only time I've seen errors was due to synthetic / unrealistic test data.
>>> The exponential growth related to segment directory merges seems to
>>> dictate
>>> that eventually, after a lot of inserts / updates, the data structures are
>>> going to get quite large.
>>
>> Thanks for doing this. I'll look at it properly later on today to see if the
>> results suggest anything we can do.
>>
>> I'm thinking there's another problem though. At some point soon, we run into
>> this:
>>
>>https://www.sqlite.org/limits.html#max_length
>>
>> The default limit is 10^9 bytes.
>>
>> The fully-merged doclist generated for the term "a" in the above is probably
>> around 256MiB in size. So if we could successfully merge it in memory, it
>> could be inserted into the database. However, once you get up over a billion
>> records there might be doclists for common terms that exceed this limit.
>>
>> Dan.
>
> What about FTS5? I understand that it will use less memory than
> FTS3/FTS4. Will it solve this problem?

In theory, yes. But I haven't actually tried it yet.

Dan.

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread James K. Lowden

On Wed, 29 Apr 2015 20:29:07 -0600
Scott Robison  wrote:

> > That code can fail on a system configured to overcommit memory. By
> > that standard, the pointer is invalid.
> >
> 
> Accidentally sent before I was finished. In any case, by "invalid
> pointer" I did not mean to imply "it returns a bit pattern that could
> never represent a valid pointer". I mean "if you dereference a
> pointer returned by malloc that is not null or some implementation
> defined value, it should not result in an invalid memory access".

Agreed.  And I don't think that will happen with malloc.  It might, and
I have a plausible scenario, but I don't think that's what happened.  

In the bizarre context of the Linux OOM killer, the OS may promise more
memory than it can supply.  This promise is made by malloc and
materialized by writes to memory allocated through the returned
pointer, because at time of writing the the OS must actually (and may
fail to) allocate the memory from RAM or swap.  

Exhaustion of overcommitted memory does *not* result in SIGSEGV,
however.  The OOM killer selects a process for SIGKILL, and the
straw-on-the-camel's-back process that triggered the OOM condition is
not necessarily the one that is selected.  

As far as "invalid" goes, I don't see how we can single out pointers
from malloc.  In the presence of overcommitted memory, *all* addresses,
including that of the program text, are invalid in the sense that they
are undependable.  The process may be killed through no fault of its
own by virtue of a heuristic.  I think it's fair to say it makes the
machine nondeterministic, or at least adds to the machine's
nondeterminism.  

Can writing through a pointer returned by malloc (within the
allocated range) ever result in SIGSEGV?  Maybe.   I have a plausible
scenario in the context of sparse files and mmap, which malloc uses.  

Let us say that you have two processes on a 64-bit machine, and a 1 TB
filesystem.  Each process opens a new file, seeks to the position 1 TB -
1, and writes 1 byte.  Each process now owns a file whose "size" is 1 TB
and whose block count is 1.  Most of the filesystem is empty, yet the
two files have allocated 200% of the available space.  These are known
as "sparse" files; the unwritten locations are called "holes".  

Now each process calls mmap(2) on its file for the entire 1 TB.  Still
OK.  mmap will not fail.  The holes in the files return 0 when read.
When written to, the OS allocates a block from filesystem and maps it
to a page of memory.  As each process begins writing 1's sequentially
to its memory, successive blocks are allocated.  Soon enough the last
block is allocated and the filesystem will be really and truly full.  

At the next allocation, no block can be allocated and no page mapped.
What to do?  When calling write(2) on a full filesystem we expect
ENOSPC, but there's nowhere to return an error condition when writing
to memory.  Consequently the OS has no choice but to signal the
process.  That signal will be, yes, SIGSEGV.  

What does that have to do with malloc?  GNU malloc uses mmap for large
allocations; the pointer it returns is supplied by mmap for an
anonymous mapping to blocks in the swap partition.  If malloc creates
sparse files, writes through malloc'd pointers could result in SIGSEGV.
However, I do not know that that's what malloc does.  

I do not think that's what's happening in the OP's case.  I suspect the
OP's process sailed past any memory-allocation constraints because of
the overcommitted memory configuration, and eventually ran aground when
the stack was exhausted.  Others have already suggested fixing the
overcommit setting as a first step.  Others might be:

1.  Examine the core dump to determine if the SIGSEGV was triggered by
a write to heap or stack memory.  Or not, as the case may be.  ;-)  

2.  Investigate the malloc algorithm and/or replace it with one that
does not use sparse files.  

3.  Increase the stack space allocated to the process.  

It's an interesting problem.  I hope we learn the answer.   

--jkl

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Dan Kennedy

On 04/30/2015 07:41 AM, Scott Robison wrote:
>
> I wrote up some test code today that definitely forces a SQLITE_NOMEM
> condition, and there are no leaks (though there are a couple really large
> outstanding allocations to force an error without needing to insert
> gigabytes of data).
>
> In thinking about what Dan wrote last night (many short word matches) I
> decided that maybe my supposition was wrong and it wasn't the number of
> relatively unique words in my old data set. Using the FTS4AUX virtual
> table, I recall looking at the info in the DB at the time and seeing a huge
> number of words with 1 or very few matches, and assumed that was the
> problem in that data set. However, given the random nature of that data
> set, and the fact that there are only 26 single letter "words" and 676
> double letter "words" (and so on), I could have easily missed the
> relatively few rows of data that had very large numbers of docids /
> occurrences in the aux table output.
>
> My test app goes to the other extreme. It inserts as many rows as possible
> consisting of the single letter word "a" 256 times, and in my case, it
> fails after 1,052,641 rows were inserted (1,048,576 rows were committed).
>
> In any case, my memory of the "merging" of things was correct, though the
> precise location may not have been.
>
> Notes:
>
> 1. Depending on how much memory pressure I put on the system by
> pre-allocating even more big chunks of memory, the NOMEM error moves around
> a bit. I've seen it happen in a malloc as well.
>
> 2. The reality is that FTS was designed around certain assumptions, and
> these extra large data sets don't fit into those cases. In my case, the
> only time I've seen errors was due to synthetic / unrealistic test data.
> The exponential growth related to segment directory merges seems to dictate
> that eventually, after a lot of inserts / updates, the data structures are
> going to get quite large.

Thanks for doing this. I'll look at it properly later on today to see if 
the results suggest anything we can do.

I'm thinking there's another problem though. At some point soon, we run 
into this:

   https://www.sqlite.org/limits.html#max_length

The default limit is 10^9 bytes.

The fully-merged doclist generated for the term "a" in the above is 
probably around 256MiB in size. So if we could successfully merge it in 
memory, it could be inserted into the database. However, once you get up 
over a billion records there might be doclists for common terms that 
exceed this limit.

Dan.








>
> 3. One possible change that might have an impact for Artem: right now the
> FTS_MERGE_COUNT is set to 16. 16 also seems to be hard coded in FTS in a
> few places, though I could be mistaken and the instances of "16" as a magic
> number could be coincidence. Regardless, I wonder if a different value of
> FTS_MERGE_COUNT might tweak the system so that it takes a lot longer for
> Artem to encounter the problem.
>
> Finally, the details:
>
> In this exact case, the call stack looks as follows at the time of the
> SQLITE_NOMEM error code:
>
> sqlite3MemRealloc, line 17109, sqlite3-1.c
> sqlite3Realloc, line 20996, sqlite3-1.c
> sqlite3_realloc, line 21022, sqlite3-1.c
> sqlite3Fts3SegReaderStep, line 2946, sqlite3-6.c (attempting to realloc to
> 135,265,788 bytes)
> fts3SegmentMerge, line 3214, sqlite3-6.c
> fts3AllocateSegdirIdx, line 1150, sqlite3-6.c
> fts3SegmentMerge, line 3199, sqlite3-6.c
> fts3AllocateSegdirIdx, line 1150, sqlite3-6.c
> fts3SegmentMerge, line 3199, sqlite3-6.c
> sqlite3Fts3PendingTermsFlush, line 3252, sqlite3-6.c
> fts3PendingTermsDocid, line 878, sqlite3-6.c
> sqlite3Fts3UpdateMethod, line 5619, sqlite3-6.c
> fts3UpdateMethod, line 21701, sqlite3-5.c
> sqlite3VdbeExec, line 24064, sqlite3-3.c
> sqlite3Step, line 16367, sqlite3-3.c
> sqlite3_step, line 16434, sqlite3-3.c
> main, line 68: if (chk(sqlite3_step(stmt)) != SQLITE_DONE)
>
> Here is my test code (C++, compiled with Visual C++ 2010, though I'm quite
> certain that is not the problem):
>
> #include 
> #include 
> #include 
> #include 
>
> #include "sqlite3.h"
>
> sqlite3* db = nullptr;
>
> int chk(int errcode)
> {
> if ((errcode > 0) && (errcode < 100))
> {
> int ext_errcode = sqlite3_extended_errcode(db);
> const char* p = sqlite3_errmsg(db);
> if (!p || !*p)
> p = "{missing errmsg}";
> std::ostringstream oss;
> oss << ext_errcode << '-' << p;
> throw oss.str();
> }
>
> return errcode;
> }
>
> int main()
> {
> unsigned long long ull = 0;
>
> // allocate a bunch of memory to put pressure on malloc
> std::vector allocs;
> allocs.push_back(new char[1024*1024*1024]);
> allocs.push_back(new char[256*1024*1024]);
> //allocs.push_back(new char[128*1024*1024]);
> //allocs.push_back(new char[64*1024*1024]);
>
> try
> {
> chk(sqlite3_open("test.db", ));
> chk(sqlite3_exec(db, "BEGIN", nullptr, nullptr, nullptr));
> chk(sqlite3_exec(db, "CREATE VIRTUAL TABLE IF NOT EXISTS data USING
> fts4();", nullptr, nullptr, nullptr));
>
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Scott Robison

On Thu, Apr 30, 2015 at 11:42 AM, James K. Lowden 
wrote:

> On Wed, 29 Apr 2015 20:29:07 -0600
> Scott Robison  wrote:
>
> > > That code can fail on a system configured to overcommit memory. By
> > > that standard, the pointer is invalid.
> > >
> >
> > Accidentally sent before I was finished. In any case, by "invalid
> > pointer" I did not mean to imply "it returns a bit pattern that could
> > never represent a valid pointer". I mean "if you dereference a
> > pointer returned by malloc that is not null or some implementation
> > defined value, it should not result in an invalid memory access".
>
> Agreed.  And I don't think that will happen with malloc.  It might, and
> I have a plausible scenario, but I don't think that's what happened.
>

The Linux man page for malloc documents that the pointer returned may not
be usable in the case of optimistic memory allocations, as the eventual use
of the pointer may trigger the need to commit a page of memory to the
address space and that a page of memory may not be available at that point
in time. Thus malloc, on Linux, makes no guarantee as to the viability of
using the returned pointer.

Perhaps you are correct and "sigsegv" is not the literal signal that is
triggered in this case. I don't care, really. The fact is that an
apparently valid pointer was returned from a memory allocation function yet
can result in an invalid access for whatever reason (out of memory, in this
case). The Linux OOM killer may kill the offending process (which is what
one would expect, but one would also expect malloc to return null, so we
already know not to expect the expected). Or it may kill some other process
which has done nothing wrong! Sure, the OS is protecting the two processes
address space from one another, but it seems to me that if one process can
kill another process, there is a problem.

I can see the utility of a memory allocation strategy like this. It should
in no way be the *default* memory allocation strategy, especially for a
system that tries to be posix compliant, because this is in direct
violation of posix compliance (by my reading) from
http://pubs.opengroup.org/onlinepubs/009695399/functions/malloc.html

Upon successful completion with *size* not equal to 0, *malloc*() shall
> return a pointer to the allocated space.
>

Or maybe posix just needs a better definition for "allocated space". Sure,
an address was allocated in the processes address space, but actual memory
was not allocated.

The decades old interface contract was "if you call malloc with a non-zero
size, you can depend on malloc to return a null pointer or a pointer to the
first byte of an uninitialized allocation". Thus your application could
decide what to do if the memory was not available: abort, exit, select an
alternative code path that can get the job done with less or no memory,
ignore the return value and let sigsegv handle it later.

Now with optimistic memory allocation, you do not have a choice. If your
malloc call results in an overcommit, your process can be killed later when
it tries to access the memory. Or some other innocent process might be
killed.

I really wonder how many man hours have been wasted trying to debug
problems with processes just to find out that the killed process did
nothing wrong, it was some other process overcommitting memory. Or worse,
how many man hours were wasted and no good reason was ever learned.

I came across this last night while learning more about OOM:
https://lwn.net/Articles/104179/ -- particular, the analogy, which I think
is spot on.

I realize that there is no one right answer to how an OS should handle
memory exhaustion. There are various tradeoffs. However, C is not an
operating system, it is a language, and the standards tell you how you can
expect it to behave. In this case, the C API is broken on Linux by default,
so it becomes impossible to write fault tolerant applications in the face
of this feature.

-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Eduardo Morras

On Fri, 3 Apr 2015 18:16:18 +0300
Artem  wrote:

> Hi!
> 
> The situation is like that. There?s a SQLite database with around 3
> billion records. Each record consists of a certain CHAR field and
> several other additional fields with different types. The file size
> is approx. 340 gb. The maximum content length in the doc field is 256
> symbols, the content is in Russian.

You can extend fts3/4 tokenizers to recognize Russian stop words[1] and exclude 
them from FTS index. I don't know Russian, but in English, examples of stop 
words are: 'a', 'the', 'of', etc...

See https://www.sqlite.org/fts3.html#section_8_1 for implement your own 
tokenizer or extend unicode one to exclude your stop words. A fast hack can be 
to add code at/to end of icuNext[2] (file ext/fts3/fts3_icu.c) function and 
check if the token is in your stop word list and skip the token [3](it's a 
pointer to current string) instead, something like this:

233 *piEndOffset = pCsr->aOffset[iEnd];
234 *piPosition = pCsr->iToken++;
235 if ( token_is_stop_word(ppToken,nByte) ){
236   *ppToken = Pointer_to_empty_string;
237   *pnBytes = 0;
238   *piStartOffset = pCsr->aOffset[iStart+nByte];
239   *piEndOffset = pCsr->aOffset[iEnd+nByte];
240   *piPosition = pCsr->iToken--;
241 }
242 return SQLITE_OK;

N.B. It's a fast hack and I haven't compile, run or check with full Sqlite3 
documentation, 

There are list of stop words available[4][5] on internet.

[1] https://en.wikipedia.org/wiki/Stop_words
[2] 
http://www.sqlite.org/src/info/e319e108661147bcca8dd511cd562f33a1ba81b5?txt=1=177
[3] 
http://www.sqlite.org/src/info/e319e108661147bcca8dd511cd562f33a1ba81b5?txt=1=235
[4] https://code.google.com/p/stop-words/ (Warning!! GPLv3 code)
[5] http://www.ranks.nl/stopwords/russian (Warning!! Unknow licence)

> Thank you.

HTH

---   ---
Eduardo Morras

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Scott Doctor


This makes my head spin. ridiculous that an over commit even exists. any 
slight performance gain is totally nullified by a crashing program.

-
Scott Doctor
scott at scottdoctor.com
-

On 4/30/2015 11:47 AM, Scott Robison wrote:
> On Thu, Apr 30, 2015 at 11:42 AM, James K. Lowden  schemamania.org>
> wrote:
>
>> On Wed, 29 Apr 2015 20:29:07 -0600
>> Scott Robison  wrote:
>>
 That code can fail on a system configured to overcommit memory. By
 that standard, the pointer is invalid.

>>> Accidentally sent before I was finished. In any case, by "invalid
>>> pointer" I did not mean to imply "it returns a bit pattern that could
>>> never represent a valid pointer". I mean "if you dereference a
>>> pointer returned by malloc that is not null or some implementation
>>> defined value, it should not result in an invalid memory access".
>> Agreed.  And I don't think that will happen with malloc.  It might, and
>> I have a plausible scenario, but I don't think that's what happened.
>>
> The Linux man page for malloc documents that the pointer returned may not
> be usable in the case of optimistic memory allocations, as the eventual use
> of the pointer may trigger the need to commit a page of memory to the
> address space and that a page of memory may not be available at that point
> in time. Thus malloc, on Linux, makes no guarantee as to the viability of
> using the returned pointer.
>
> Perhaps you are correct and "sigsegv" is not the literal signal that is
> triggered in this case. I don't care, really. The fact is that an
> apparently valid pointer was returned from a memory allocation function yet
> can result in an invalid access for whatever reason (out of memory, in this
> case). The Linux OOM killer may kill the offending process (which is what
> one would expect, but one would also expect malloc to return null, so we
> already know not to expect the expected). Or it may kill some other process
> which has done nothing wrong! Sure, the OS is protecting the two processes
> address space from one another, but it seems to me that if one process can
> kill another process, there is a problem.
>
> I can see the utility of a memory allocation strategy like this. It should
> in no way be the *default* memory allocation strategy, especially for a
> system that tries to be posix compliant, because this is in direct
> violation of posix compliance (by my reading) from
> http://pubs.opengroup.org/onlinepubs/009695399/functions/malloc.html
>
> Upon successful completion with *size* not equal to 0, *malloc*() shall
>> return a pointer to the allocated space.
>>
> Or maybe posix just needs a better definition for "allocated space". Sure,
> an address was allocated in the processes address space, but actual memory
> was not allocated.
>
> The decades old interface contract was "if you call malloc with a non-zero
> size, you can depend on malloc to return a null pointer or a pointer to the
> first byte of an uninitialized allocation". Thus your application could
> decide what to do if the memory was not available: abort, exit, select an
> alternative code path that can get the job done with less or no memory,
> ignore the return value and let sigsegv handle it later.
>
> Now with optimistic memory allocation, you do not have a choice. If your
> malloc call results in an overcommit, your process can be killed later when
> it tries to access the memory. Or some other innocent process might be
> killed.
>
> I really wonder how many man hours have been wasted trying to debug
> problems with processes just to find out that the killed process did
> nothing wrong, it was some other process overcommitting memory. Or worse,
> how many man hours were wasted and no good reason was ever learned.
>
> I came across this last night while learning more about OOM:
> https://lwn.net/Articles/104179/ -- particular, the analogy, which I think
> is spot on.
>
> I realize that there is no one right answer to how an OS should handle
> memory exhaustion. There are various tradeoffs. However, C is not an
> operating system, it is a language, and the standards tell you how you can
> expect it to behave. In this case, the C API is broken on Linux by default,
> so it becomes impossible to write fault tolerant applications in the face
> of this feature.
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Scott Robison

On Apr 29, 2015 11:50 PM, "Dan Kennedy"  wrote:
>
> On 04/30/2015 07:41 AM, Scott Robison wrote:
>>
>>
>> I wrote up some test code today that definitely forces a SQLITE_NOMEM
>> condition, and there are no leaks (though there are a couple really large
>> outstanding allocations to force an error without needing to insert
>> gigabytes of data).
>>
>> In thinking about what Dan wrote last night (many short word matches) I
>> decided that maybe my supposition was wrong and it wasn't the number of
>> relatively unique words in my old data set. Using the FTS4AUX virtual
>> table, I recall looking at the info in the DB at the time and seeing a
huge
>> number of words with 1 or very few matches, and assumed that was the
>> problem in that data set. However, given the random nature of that data
>> set, and the fact that there are only 26 single letter "words" and 676
>> double letter "words" (and so on), I could have easily missed the
>> relatively few rows of data that had very large numbers of docids /
>> occurrences in the aux table output.
>>
>> My test app goes to the other extreme. It inserts as many rows as
possible
>> consisting of the single letter word "a" 256 times, and in my case, it
>> fails after 1,052,641 rows were inserted (1,048,576 rows were committed).
>>
>> In any case, my memory of the "merging" of things was correct, though the
>> precise location may not have been.
>>
>> Notes:
>>
>> 1. Depending on how much memory pressure I put on the system by
>> pre-allocating even more big chunks of memory, the NOMEM error moves
around
>> a bit. I've seen it happen in a malloc as well.
>>
>> 2. The reality is that FTS was designed around certain assumptions, and
>> these extra large data sets don't fit into those cases. In my case, the
>> only time I've seen errors was due to synthetic / unrealistic test data.
>> The exponential growth related to segment directory merges seems to
dictate
>> that eventually, after a lot of inserts / updates, the data structures
are
>> going to get quite large.
>
>
> Thanks for doing this. I'll look at it properly later on today to see if
the results suggest anything we can do.
>
> I'm thinking there's another problem though. At some point soon, we run
into this:
>
>   https://www.sqlite.org/limits.html#max_length
>
> The default limit is 10^9 bytes.
>
> The fully-merged doclist generated for the term "a" in the above is
probably around 256MiB in size. So if we could successfully merge it in
memory, it could be inserted into the database. However, once you get up
over a billion records there might be doclists for common terms that exceed
this limit.

Excellent points. It all comes back to "FTS3/4 was not designed with
datasets this large in mind, and nothing simple is going to make things
better."

The stop word list seems like it could help, if it is viable in this use
case, or partitioning as I described earlier. Neither is necessarily
optimal though.

Just as there are use cases where SQLite is not recommended as a general
database solution, it seems we've identified one where FTS is not either.
Might be able to force it, but you aren't going to want to run your Google
killer with it. :)

>
> Dan.
>
>
>
>
>
>
>
>
>>
>> 3. One possible change that might have an impact for Artem: right now the
>> FTS_MERGE_COUNT is set to 16. 16 also seems to be hard coded in FTS in a
>> few places, though I could be mistaken and the instances of "16" as a
magic
>> number could be coincidence. Regardless, I wonder if a different value of
>> FTS_MERGE_COUNT might tweak the system so that it takes a lot longer for
>> Artem to encounter the problem.
>>
>> Finally, the details:
>>
>> In this exact case, the call stack looks as follows at the time of the
>> SQLITE_NOMEM error code:
>>
>> sqlite3MemRealloc, line 17109, sqlite3-1.c
>> sqlite3Realloc, line 20996, sqlite3-1.c
>> sqlite3_realloc, line 21022, sqlite3-1.c
>> sqlite3Fts3SegReaderStep, line 2946, sqlite3-6.c (attempting to realloc
to
>> 135,265,788 bytes)
>> fts3SegmentMerge, line 3214, sqlite3-6.c
>> fts3AllocateSegdirIdx, line 1150, sqlite3-6.c
>> fts3SegmentMerge, line 3199, sqlite3-6.c
>> fts3AllocateSegdirIdx, line 1150, sqlite3-6.c
>> fts3SegmentMerge, line 3199, sqlite3-6.c
>> sqlite3Fts3PendingTermsFlush, line 3252, sqlite3-6.c
>> fts3PendingTermsDocid, line 878, sqlite3-6.c
>> sqlite3Fts3UpdateMethod, line 5619, sqlite3-6.c
>> fts3UpdateMethod, line 21701, sqlite3-5.c
>> sqlite3VdbeExec, line 24064, sqlite3-3.c
>> sqlite3Step, line 16367, sqlite3-3.c
>> sqlite3_step, line 16434, sqlite3-3.c
>> main, line 68: if (chk(sqlite3_step(stmt)) != SQLITE_DONE)
>>
>> Here is my test code (C++, compiled with Visual C++ 2010, though I'm
quite
>> certain that is not the problem):
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> #include "sqlite3.h"
>>
>> sqlite3* db = nullptr;
>>
>> int chk(int errcode)
>> {
>> if ((errcode > 0) && (errcode < 100))
>> {
>> int ext_errcode = sqlite3_extended_errcode(db);
>> const char* p =

[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread Dominique Pellé

Dan Kennedy  wrote:

> On 04/30/2015 07:41 AM, Scott Robison wrote:
>>
>>
>> I wrote up some test code today that definitely forces a SQLITE_NOMEM
>> condition, and there are no leaks (though there are a couple really large
>> outstanding allocations to force an error without needing to insert
>> gigabytes of data).
>>
>> In thinking about what Dan wrote last night (many short word matches) I
>> decided that maybe my supposition was wrong and it wasn't the number of
>> relatively unique words in my old data set. Using the FTS4AUX virtual
>> table, I recall looking at the info in the DB at the time and seeing a
>> huge
>> number of words with 1 or very few matches, and assumed that was the
>> problem in that data set. However, given the random nature of that data
>> set, and the fact that there are only 26 single letter "words" and 676
>> double letter "words" (and so on), I could have easily missed the
>> relatively few rows of data that had very large numbers of docids /
>> occurrences in the aux table output.
>>
>> My test app goes to the other extreme. It inserts as many rows as possible
>> consisting of the single letter word "a" 256 times, and in my case, it
>> fails after 1,052,641 rows were inserted (1,048,576 rows were committed).
>>
>> In any case, my memory of the "merging" of things was correct, though the
>> precise location may not have been.
>>
>> Notes:
>>
>> 1. Depending on how much memory pressure I put on the system by
>> pre-allocating even more big chunks of memory, the NOMEM error moves
>> around
>> a bit. I've seen it happen in a malloc as well.
>>
>> 2. The reality is that FTS was designed around certain assumptions, and
>> these extra large data sets don't fit into those cases. In my case, the
>> only time I've seen errors was due to synthetic / unrealistic test data.
>> The exponential growth related to segment directory merges seems to
>> dictate
>> that eventually, after a lot of inserts / updates, the data structures are
>> going to get quite large.
>
>
> Thanks for doing this. I'll look at it properly later on today to see if the
> results suggest anything we can do.
>
> I'm thinking there's another problem though. At some point soon, we run into
> this:
>
>   https://www.sqlite.org/limits.html#max_length
>
> The default limit is 10^9 bytes.
>
> The fully-merged doclist generated for the term "a" in the above is probably
> around 256MiB in size. So if we could successfully merge it in memory, it
> could be inserted into the database. However, once you get up over a billion
> records there might be doclists for common terms that exceed this limit.
>
> Dan.


What about FTS5? I understand that it will use less memory than
FTS3/FTS4. Will it solve this problem?

Regards
Dominique

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread James K. Lowden

On Wed, 29 Apr 2015 01:27:03 +0300
Artem  wrote:

> >> Error: near line 1: out of memory
> 
> > That's not a segfault, though, is it.
> 
> When I did the same in linux version of SQLite - I saw
> the "Segmentation Fault" error.

The above message indicates that SQLite trapped an errno of ENOMEM
after malloc returned NULL.  Otherwise it would not report "near line
1", something the kernel could never do.  

One possible way you could see SIGSEGV on Linux -- even if every call
to malloc is handled correctly and no out-of-bounds reference is ever
made -- is if the process runs out of stack space.[1]  That seems within
the realm of possiblity if we're talking about giant allocations.  

It's also plausible that Windows would report ENOMEM while LInux does
not.  If, as is commonly done, Linux is configured to overcommit
memory, malloc will seldom or never (depending
on /proc/sys/vm/overcommit_memory) return NULL.  Given the same RAM and
swap space, Windows would report ENOMEM when Linux would not.  In the
Linux case, the process could plausibly continue (and not be the OOM
victim) up until the point that it exhausts its stack.  

You could check the value of your overcommit_memory and set it to 2 if
that's not already the case.  That might shed some light.  

Developers: What value of overcommit_memory is used in testing?  

---jkl

[1] http://www.win.tue.nl/~aeb/linux/lk/lk-9.html

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread James K. Lowden

On Wed, 29 Apr 2015 02:39:50 -0600
Scott Robison  wrote:

> On linux, malloc may return a non null yet invalid pointer and only
> fail when the memory is accessed because it wasn't really available.

Citation needed.  I believe SIGSEGV is possible with anonymous mmap and
overcommitted memory.  ISTR I've read as much.  But I was unable to
find a definitive reference saying as much when I looked for one in
answering this message.  

It's not quite accurate to say the pointer is "invalid".  It's valid;
it just refers to memory the system may not be able to supply when
committed_memory > memory.  

--jkl

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

On Wed, Apr 29, 2015 at 8:23 PM, Scott Robison 
wrote:

>
> 1. From http://linux.die.net/man/3/malloc
>
> By default, Linux follows an optimistic memory allocation strategy. This
>> means that when *malloc*() returns non-NULL there is no guarantee that
>> the memory really is available. In case it turns out that the system is out
>> of memory, one or more processes will be killed by the OOM killer. For more
>> information, see the description of*/proc/sys/vm/overcommit_memory* and
>> */proc/sys/vm/oom_adj* in *proc *(5),
>> and the Linux kernel source file *Documentation/vm/overcommit-accounting*
>> .
>
>
> 2. From ISO/IEC 9899 1990 7.10.3.3
>
> Synopsis
>
> Description
>> The malloc function allocates space for an object whose size is specified
>> by size and
>> whose value is indeterminate.
>> Returns
>> The malloc  function returns either a null pointer or a pointer to the
>> allocated space.
>
>
> 3. The following code should never cause an error per the relevant
> standards:
>
> char* p = malloc(1);
>
> if (p) *p = 'x';
>
>
> That code can fail on a system configured to overcommit memory. By that
> standard, the pointer is invalid.
>

Accidentally sent before I was finished. In any case, by "invalid pointer"
I did not mean to imply "it returns a bit pattern that could never
represent a valid pointer". I mean "if you dereference a pointer returned
by malloc that is not null or some implementation defined value, it should
not result in an invalid memory access".

-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

On Wed, Apr 29, 2015 at 8:04 PM, James K. Lowden 
wrote:

> On Wed, 29 Apr 2015 02:39:50 -0600
> Scott Robison  wrote:
>
> > On linux, malloc may return a non null yet invalid pointer and only
> > fail when the memory is accessed because it wasn't really available.
>
> Citation needed.  I believe SIGSEGV is possible with anonymous mmap and
> overcommitted memory.  ISTR I've read as much.  But I was unable to
> find a definitive reference saying as much when I looked for one in
> answering this message.
>
> It's not quite accurate to say the pointer is "invalid".  It's valid;
> it just refers to memory the system may not be able to supply when
> committed_memory > memory.
>

1. From http://linux.die.net/man/3/malloc

By default, Linux follows an optimistic memory allocation strategy. This
> means that when *malloc*() returns non-NULL there is no guarantee that
> the memory really is available. In case it turns out that the system is out
> of memory, one or more processes will be killed by the OOM killer. For more
> information, see the description of*/proc/sys/vm/overcommit_memory* and
> */proc/sys/vm/oom_adj* in *proc *(5),
> and the Linux kernel source file *Documentation/vm/overcommit-accounting*.


2. From ISO/IEC 9899 1990 7.10.3.3

Synopsis

Description
> The malloc function allocates space for an object whose size is specified
> by size and
> whose value is indeterminate.
> Returns
> The malloc  function returns either a null pointer or a pointer to the
> allocated space.


3. The following code should never cause an error per the relevant
standards:

char* p = malloc(1);

if (p) *p = 'x';


That code can fail on a system configured to overcommit memory. By that
standard, the pointer is invalid.

-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

On Wed, Apr 29, 2015 at 2:55 AM, Dan Kennedy  wrote:
>
> That we're getting a segfault instead of SQLITE_NOMEM is not an SQLite
bug.
>
> But that SQLite is requesting a ridiculously large allocation (assuming
that is what is happening) is less than ideal as well though. Even if it's
not technically a "bug".
>
> It's just that working around the large allocation in question (if we're
right about which one it is) is tricky to do. And tricky to test too.

I wrote up some test code today that definitely forces a SQLITE_NOMEM
condition, and there are no leaks (though there are a couple really large
outstanding allocations to force an error without needing to insert
gigabytes of data).

In thinking about what Dan wrote last night (many short word matches) I
decided that maybe my supposition was wrong and it wasn't the number of
relatively unique words in my old data set. Using the FTS4AUX virtual
table, I recall looking at the info in the DB at the time and seeing a huge
number of words with 1 or very few matches, and assumed that was the
problem in that data set. However, given the random nature of that data
set, and the fact that there are only 26 single letter "words" and 676
double letter "words" (and so on), I could have easily missed the
relatively few rows of data that had very large numbers of docids /
occurrences in the aux table output.

My test app goes to the other extreme. It inserts as many rows as possible
consisting of the single letter word "a" 256 times, and in my case, it
fails after 1,052,641 rows were inserted (1,048,576 rows were committed).

In any case, my memory of the "merging" of things was correct, though the
precise location may not have been.

Notes:

1. Depending on how much memory pressure I put on the system by
pre-allocating even more big chunks of memory, the NOMEM error moves around
a bit. I've seen it happen in a malloc as well.

2. The reality is that FTS was designed around certain assumptions, and
these extra large data sets don't fit into those cases. In my case, the
only time I've seen errors was due to synthetic / unrealistic test data.
The exponential growth related to segment directory merges seems to dictate
that eventually, after a lot of inserts / updates, the data structures are
going to get quite large.

3. One possible change that might have an impact for Artem: right now the
FTS_MERGE_COUNT is set to 16. 16 also seems to be hard coded in FTS in a
few places, though I could be mistaken and the instances of "16" as a magic
number could be coincidence. Regardless, I wonder if a different value of
FTS_MERGE_COUNT might tweak the system so that it takes a lot longer for
Artem to encounter the problem.

Finally, the details:

In this exact case, the call stack looks as follows at the time of the
SQLITE_NOMEM error code:

sqlite3MemRealloc, line 17109, sqlite3-1.c
sqlite3Realloc, line 20996, sqlite3-1.c
sqlite3_realloc, line 21022, sqlite3-1.c
sqlite3Fts3SegReaderStep, line 2946, sqlite3-6.c (attempting to realloc to
135,265,788 bytes)
fts3SegmentMerge, line 3214, sqlite3-6.c
fts3AllocateSegdirIdx, line 1150, sqlite3-6.c
fts3SegmentMerge, line 3199, sqlite3-6.c
fts3AllocateSegdirIdx, line 1150, sqlite3-6.c
fts3SegmentMerge, line 3199, sqlite3-6.c
sqlite3Fts3PendingTermsFlush, line 3252, sqlite3-6.c
fts3PendingTermsDocid, line 878, sqlite3-6.c
sqlite3Fts3UpdateMethod, line 5619, sqlite3-6.c
fts3UpdateMethod, line 21701, sqlite3-5.c
sqlite3VdbeExec, line 24064, sqlite3-3.c
sqlite3Step, line 16367, sqlite3-3.c
sqlite3_step, line 16434, sqlite3-3.c
main, line 68: if (chk(sqlite3_step(stmt)) != SQLITE_DONE)

Here is my test code (C++, compiled with Visual C++ 2010, though I'm quite
certain that is not the problem):

#include 
#include 
#include 
#include 

#include "sqlite3.h"

sqlite3* db = nullptr;

int chk(int errcode)
{
if ((errcode > 0) && (errcode < 100))
{
int ext_errcode = sqlite3_extended_errcode(db);
const char* p = sqlite3_errmsg(db);
if (!p || !*p)
p = "{missing errmsg}";
std::ostringstream oss;
oss << ext_errcode << '-' << p;
throw oss.str();
}

return errcode;
}

int main()
{
unsigned long long ull = 0;

// allocate a bunch of memory to put pressure on malloc
std::vector allocs;
allocs.push_back(new char[1024*1024*1024]);
allocs.push_back(new char[256*1024*1024]);
//allocs.push_back(new char[128*1024*1024]);
//allocs.push_back(new char[64*1024*1024]);

try
{
chk(sqlite3_open("test.db", ));
chk(sqlite3_exec(db, "BEGIN", nullptr, nullptr, nullptr));
chk(sqlite3_exec(db, "CREATE VIRTUAL TABLE IF NOT EXISTS data USING
fts4();", nullptr, nullptr, nullptr));

sqlite3_stmt* stmt;
chk(sqlite3_prepare_v2(db, "INSERT INTO data VALUES (?1)", -1, ,
nullptr));

static const char p[] =
"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a "
"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a "
"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a "
"a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a "
"a a a a a a a a

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Tim Streater

On 29 Apr 2015 at 00:06, Simon Slavin  wrote: 

> On 28 Apr 2015, at 11:49pm, Scott Robison  wrote:
>
>> I never saw a segfault in my case, though I never tried anything on any
>> posix style system. It was strictly Windows.
>
> Windows doesn't call it segfault.  It will report an access violation or (very
> occasionally) a page fault.

I'm simply quoting what the OP said in his OP:

"I?m trying to create a full-text index, but it results in a Segmentation 
Fault error."

--
Cheers  --  Tim

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Dan Kennedy

On 04/29/2015 03:39 PM, Scott Robison wrote:
> On windows, malloc returns null if the allocation fails. Sqlite detects
> this and returns an error.
>
> On linux, malloc may return a non null yet invalid pointer and only fail
> when the memory is accessed because it wasn't really available.

That we're getting a segfault instead of SQLITE_NOMEM is not an SQLite bug.

But that SQLite is requesting a ridiculously large allocation (assuming 
that is what is happening) is less than ideal as well though. Even if 
it's not technically a "bug".

It's just that working around the large allocation in question (if we're 
right about which one it is) is tricky to do. And tricky to test too.

Dan.




>
> If Sqlite is not at fault when posix APIs lie about file locking, I don't
> think Sqlite is responsible when malloc lies about what should be
> considered a failed memory allocation which should return null.
> On Apr 29, 2015 2:28 AM, "Simon Slavin"  wrote:
>
>> On 29 Apr 2015, at 9:21am, Scott Robison  wrote:
>>
>>> Personally I don't see it as a bug. A limitation, yes. A different
>>> algorithm that requires less ram would remove / change the limit.
>> Anything inside the SQLite code which causes a Segfault is a bug.
>> Assuming that you're not hacking your OS or messing with SQLite 'black box'
>> variables like connections and statements.
>>
>> Just my humble opinion.  I'm not on the development team.
>>
>> A minimal (either small, or simple to reproduce) repeatable example would
>> be welcome.  It can even include huge data as long as you can describe a
>> way to create crashing data like "Generate a trillion identical rows in
>> this table.".
>>
>> Simon.
>> ___
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Eduardo Morras

On Wed, 29 Apr 2015 12:00:02 +0200
Dominique Pell?  wrote:

> I'd suggest to try reproducing the problem on Linux after disabling
> memory overcommit  (at least temporarily for the experimentation):
> 
> * If it returns SQLITE_NOMEM without segfault, then it's not a bug
>   but a limitation of SQLite which uses lots of memory for FTS, or
>   a leak in the application.
> * If it still segfaults, then there is a bug in SQLite or in the
> application.

 A fast checking code inside malloc.c shows this warning inside void 
*sqlite3Malloc(u64 n) function. Don't know if FTS uses sqlite3_malloc() 
directly or this one.

/* A memory allocation of a number of bytes which is near the maximum
** signed integer value might cause an integer overflow inside of the
** xMalloc().  Hence we limit the maximum size to 0x7f00, giving
** 255 bytes of overhead.  SQLite itself will never use anything near
** this amount.  The only way to reach the limit is with sqlite3_malloc() */

> 
> Regards
> Dominique

---   ---
Eduardo Morras

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Dan Kennedy

On 04/29/2015 05:27 AM, Artem wrote:
> , Tim.
>
> ?? ?? 29 ?? 2015 ?., 1:21:00:
>
>> On 28 Apr 2015 at 23:14, Artem  wrote:
 How about trying the sqlite3.exe command line utility. put your
 sql for that operation in a text file, launch the program, open
 the database, then read in the sql file with the .read command.
 If the error occurs, then possibly sqlite3. if not then it is
 probably something else.
>>> I tried it and failed.
>>>
>>> Console log:
>>>
>>> f:\Suggests\test>sqlite3.exe single.db
>>> SQLite version 3.8.9 2015-04-08 12:16:33
>>> Enter ".help" for usage hints.
>>> sqlite> .read test.sql
>>> Error: near line 1: out of memory
>> That's not a segfault, though, is it.
> When I did the same in linux version of SQLite - I saw
> the "Segmentation Fault" error.

Maybe something to do with the optimistic allocation strategy Linux 
uses. Perhaps malloc() returned non-NULL but then a segfault occurred 
when it first tried to access the pages. From the man-page:

By default, Linux follows an optimistic memory allocation
strategy. This means that when malloc() returns non-NULL
there is no guarantee that the memory really is available.

Or maybe the OOM killer took out the process. Or something.

Dan.

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Artem

> On Apr 29, 2015 2:55 AM, "Dan Kennedy"  wrote:
>>
>> On 04/29/2015 03:39 PM, Scott Robison wrote:
>>>
>>> On windows, malloc returns null if the allocation fails. Sqlite detects
>>> this and returns an error.
>>>
>>> On linux, malloc may return a non null yet invalid pointer and only fail
>>> when the memory is accessed because it wasn't really available.
>>
>>
>> That we're getting a segfault instead of SQLITE_NOMEM is not an SQLite
> bug.
>>
>> But that SQLite is requesting a ridiculously large allocation (assuming
> that is what is happening) is less than ideal as well though. Even if it's
> not technically a "bug".

> Agreed. My point is only that it is not a bug that can be fixed by checking
> for null or some such. The algorithm fails on all 32 bit platforms for
> really large data sets. Windows & BSD (at least) appear to return a null
> pointer allowing the failure to be detected and handled gracefully. Linux
> by default makes that impossible with optimistic allocations.

I tested it also on 64-bit platform (with sqlite.net) - and it causes the same 
OOM error.
I can use all of my 32 GB, but I can't due to library limitations.
IMHO it's a little bit incorrect.
Maybe a simple solution exists like change int32 to int64 somewhere?

>> It's just that working around the large allocation in question (if we're
> right about which one it is) is tricky to do. And tricky to test too.

> I'll get you some more data to verify my memory.

>>
>> Dan.
>>
>>
>>
>>
>>
>>>
>>> If Sqlite is not at fault when posix APIs lie about file locking, I don't
>>> think Sqlite is responsible when malloc lies about what should be
>>> considered a failed memory allocation which should return null.

> SDR
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users




-- 
? ?,
 Artem  mailto:devspec at yandex.ru

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Dominique Pellé

Simon Slavin  wrote:

> On 29 Apr 2015, at 9:39am, Scott Robison  wrote:
>
>> On windows, malloc returns null if the allocation fails. Sqlite detects
>> this and returns an error.
>>
>> On linux, malloc may return a non null yet invalid pointer and only fail
>> when the memory is accessed because it wasn't really available.
>>
>> If Sqlite is not at fault when posix APIs lie about file locking, I don't
>> think Sqlite is responsible when malloc lies about what should be
>> considered a failed memory allocation which should return null.
>
> Okay, I see your distinction.  Yes, Linux is lying about malloc having 
> succeeded.
> And there's no further way to check the block it handed back apart from 
> trying to
> use it, which triggers the error.  A program can't do anything about a lying 
> OS,
> so SQLite is not at fault.

I'd suggest to try reproducing the problem on Linux after disabling
memory overcommit  (at least temporarily for the experimentation):

* If it returns SQLITE_NOMEM without segfault, then it's not a bug
  but a limitation of SQLite which uses lots of memory for FTS, or
  a leak in the application.
* If it still segfaults, then there is a bug in SQLite or in the application.

Regards
Dominique

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Artem

> Personally I don't see it as a bug. A limitation, yes. A different
> algorithm that requires less ram would remove / change the limit.

> I'll be trying some sample data tomorrow  (if time permits) to see if I can
> come up with any ideas.

> I described an approach in a previous email where I divided my FTS index
> into 53 partitions. It means running more queries, but maybe something like
> that could serve as a temporary solution.

What queries (partitions) do you mean? I tried to create FTS with LIMIT clause
like:
INSERT INTO pages_fts(title, body) SELECT title, body FROM pages LIMIT 0,100
INSERT INTO pages_fts(title, body) SELECT title, body FROM pages LIMIT 
100,200
etc

It was OK until 600 mln rows, after 600 mln rows it failed with Segmentation 
Fault error (on Ubuntu).

> On Apr 29, 2015 1:56 AM, "Artem"  wrote:

>> So, is that a real bug of SQLIte and how to fix it?
>> I really need to create huge FTS-indexes like that,
>> I have 32GB of memory for that.
>>
>> Many thanks for your attention.
>>
>> > On 04/29/2015 05:27 AM, Artem wrote:
>> >> , Tim.
>> >>
>> >> ?? ?? 29 ?? 2015 ?., 1:21:00:
>> >>
>> >>> On 28 Apr 2015 at 23:14, Artem  wrote:
>> > How about trying the sqlite3.exe command line utility. put your
>> > sql for that operation in a text file, launch the program, open
>> > the database, then read in the sql file with the .read command.
>> > If the error occurs, then possibly sqlite3. if not then it is
>> > probably something else.
>>  I tried it and failed.
>> 
>>  Console log:
>> 
>>  f:\Suggests\test>sqlite3.exe single.db
>>  SQLite version 3.8.9 2015-04-08 12:16:33
>>  Enter ".help" for usage hints.
>>  sqlite> .read test.sql
>>  Error: near line 1: out of memory
>> >>> That's not a segfault, though, is it.
>> >> When I did the same in linux version of SQLite - I saw
>> >> the "Segmentation Fault" error.
>>
>> > Maybe something to do with the optimistic allocation strategy Linux
>> > uses. Perhaps malloc() returned non-NULL but then a segfault occurred
>> > when it first tried to access the pages. From the man-page:
>>
>> > By default, Linux follows an optimistic memory allocation
>> > strategy. This means that when malloc() returns non-NULL
>> > there is no guarantee that the memory really is available.
>>
>> > Or maybe the OOM killer took out the process. Or something.
>>
>> > Dan.
>>
>>
>> > ___
>> > sqlite-users mailing list
>> > sqlite-users at mailinglists.sqlite.org
>> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>>
>>
>> --
>> ? ?,
>>  Artem  mailto:devspec at yandex.ru
>>
>> ___
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Artem

So, is that a real bug of SQLIte and how to fix it?
I really need to create huge FTS-indexes like that,
I have 32GB of memory for that.

Many thanks for your attention.

> On 04/29/2015 05:27 AM, Artem wrote:
>> , Tim.
>>
>> ?? ?? 29 ?? 2015 ?., 1:21:00:
>>
>>> On 28 Apr 2015 at 23:14, Artem  wrote:
> How about trying the sqlite3.exe command line utility. put your
> sql for that operation in a text file, launch the program, open
> the database, then read in the sql file with the .read command.
> If the error occurs, then possibly sqlite3. if not then it is
> probably something else.
 I tried it and failed.

 Console log:

 f:\Suggests\test>sqlite3.exe single.db
 SQLite version 3.8.9 2015-04-08 12:16:33
 Enter ".help" for usage hints.
 sqlite> .read test.sql
 Error: near line 1: out of memory
>>> That's not a segfault, though, is it.
>> When I did the same in linux version of SQLite - I saw
>> the "Segmentation Fault" error.

> Maybe something to do with the optimistic allocation strategy Linux 
> uses. Perhaps malloc() returned non-NULL but then a segfault occurred 
> when it first tried to access the pages. From the man-page:

> By default, Linux follows an optimistic memory allocation
> strategy. This means that when malloc() returns non-NULL
> there is no guarantee that the memory really is available.

> Or maybe the OOM killer took out the process. Or something.

> Dan.


> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



-- 
? ?,
 Artem  mailto:devspec at yandex.ru

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Simon Slavin

On 29 Apr 2015, at 9:39am, Scott Robison  wrote:

> On windows, malloc returns null if the allocation fails. Sqlite detects
> this and returns an error.
> 
> On linux, malloc may return a non null yet invalid pointer and only fail
> when the memory is accessed because it wasn't really available.
> 
> If Sqlite is not at fault when posix APIs lie about file locking, I don't
> think Sqlite is responsible when malloc lies about what should be
> considered a failed memory allocation which should return null.

Okay, I see your distinction.  Yes, Linux is lying about malloc having 
succeeded.  And there's no further way to check the block it handed back apart 
from trying to use it, which triggers the error.  A program can't do anything 
about a lying OS, so SQLite is not at fault.

Simon.

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Eduardo Morras

On Tue, 28 Apr 2015 16:49:46 -0600
Scott Robison  wrote:

> On Tue, Apr 28, 2015 at 4:27 PM, Artem  wrote:
> 
> > > That's not a segfault, though, is it.
> >
> > When I did the same in linux version of SQLite - I saw
> > the "Segmentation Fault" error.
> >
> 
> I never saw a segfault in my case, though I never tried anything on
> any posix style system. It was strictly Windows.

Don't know if it's asked yet, but do you use a 64bit Windows or a 32bit version?

> -- 
> Scott Robison

---   ---
Eduardo Morras

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Simon Slavin

On 29 Apr 2015, at 9:21am, Scott Robison  wrote:

> Personally I don't see it as a bug. A limitation, yes. A different
> algorithm that requires less ram would remove / change the limit.

Anything inside the SQLite code which causes a Segfault is a bug.  Assuming 
that you're not hacking your OS or messing with SQLite 'black box' variables 
like connections and statements.

Just my humble opinion.  I'm not on the development team.

A minimal (either small, or simple to reproduce) repeatable example would be 
welcome.  It can even include huge data as long as you can describe a way to 
create crashing data like "Generate a trillion identical rows in this table.".

Simon.

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

On Apr 29, 2015 2:55 AM, "Dan Kennedy"  wrote:
>
> On 04/29/2015 03:39 PM, Scott Robison wrote:
>>
>> On windows, malloc returns null if the allocation fails. Sqlite detects
>> this and returns an error.
>>
>> On linux, malloc may return a non null yet invalid pointer and only fail
>> when the memory is accessed because it wasn't really available.
>
>
> That we're getting a segfault instead of SQLITE_NOMEM is not an SQLite
bug.
>
> But that SQLite is requesting a ridiculously large allocation (assuming
that is what is happening) is less than ideal as well though. Even if it's
not technically a "bug".

Agreed. My point is only that it is not a bug that can be fixed by checking
for null or some such. The algorithm fails on all 32 bit platforms for
really large data sets. Windows & BSD (at least) appear to return a null
pointer allowing the failure to be detected and handled gracefully. Linux
by default makes that impossible with optimistic allocations.

> It's just that working around the large allocation in question (if we're
right about which one it is) is tricky to do. And tricky to test too.

I'll get you some more data to verify my memory.

>
> Dan.
>
>
>
>
>
>>
>> If Sqlite is not at fault when posix APIs lie about file locking, I don't
>> think Sqlite is responsible when malloc lies about what should be
>> considered a failed memory allocation which should return null.

SDR

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

I created multiple FTS virtual tables and inserted data into specific
tables based on ID. The FTS tables used the content option to specify the
text was stored elsewhere. Something like "insert into FTSBUCKET# (...)
..." where # was computed from a rowid to spread the load around.

This may not be an ideal or workable solution for you, I only suggest it as
an idea. It actually worked out quite well for my use case as I often only
needed a predictable subset of data, but YMMV.
On Apr 29, 2015 2:39 AM, "Artem"  wrote:

> > Personally I don't see it as a bug. A limitation, yes. A different
> > algorithm that requires less ram would remove / change the limit.
>
> > I'll be trying some sample data tomorrow  (if time permits) to see if I
> can
> > come up with any ideas.
>
> > I described an approach in a previous email where I divided my FTS index
> > into 53 partitions. It means running more queries, but maybe something
> like
> > that could serve as a temporary solution.
>
> What queries (partitions) do you mean? I tried to create FTS with LIMIT
> clause
> like:
> INSERT INTO pages_fts(title, body) SELECT title, body FROM pages LIMIT
> 0,100
> INSERT INTO pages_fts(title, body) SELECT title, body FROM pages LIMIT
> 100,200
> etc
>
> It was OK until 600 mln rows, after 600 mln rows it failed with
> Segmentation Fault error (on Ubuntu).
>
> > On Apr 29, 2015 1:56 AM, "Artem"  wrote:
>
> >> So, is that a real bug of SQLIte and how to fix it?
> >> I really need to create huge FTS-indexes like that,
> >> I have 32GB of memory for that.
> >>
> >> Many thanks for your attention.
> >>
> >> > On 04/29/2015 05:27 AM, Artem wrote:
> >> >> , Tim.
> >> >>
> >> >> ?? ?? 29 ?? 2015 ?., 1:21:00:
> >> >>
> >> >>> On 28 Apr 2015 at 23:14, Artem  wrote:
> >> > How about trying the sqlite3.exe command line utility. put your
> >> > sql for that operation in a text file, launch the program, open
> >> > the database, then read in the sql file with the .read command.
> >> > If the error occurs, then possibly sqlite3. if not then it is
> >> > probably something else.
> >>  I tried it and failed.
> >> 
> >>  Console log:
> >> 
> >>  f:\Suggests\test>sqlite3.exe single.db
> >>  SQLite version 3.8.9 2015-04-08 12:16:33
> >>  Enter ".help" for usage hints.
> >>  sqlite> .read test.sql
> >>  Error: near line 1: out of memory
> >> >>> That's not a segfault, though, is it.
> >> >> When I did the same in linux version of SQLite - I saw
> >> >> the "Segmentation Fault" error.
> >>
> >> > Maybe something to do with the optimistic allocation strategy Linux
> >> > uses. Perhaps malloc() returned non-NULL but then a segfault occurred
> >> > when it first tried to access the pages. From the man-page:
> >>
> >> > By default, Linux follows an optimistic memory allocation
> >> > strategy. This means that when malloc() returns non-NULL
> >> > there is no guarantee that the memory really is available.
> >>
> >> > Or maybe the OOM killer took out the process. Or something.
> >>
> >> > Dan.
> >>
> >>
> >> > ___
> >> > sqlite-users mailing list
> >> > sqlite-users at mailinglists.sqlite.org
> >> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >>
> >>
> >>
> >> --
> >> ? ?,
> >>  Artem  mailto:devspec at yandex.ru
> >>
> >> ___
> >> sqlite-users mailing list
> >> sqlite-users at mailinglists.sqlite.org
> >> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >>
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

On windows, malloc returns null if the allocation fails. Sqlite detects
this and returns an error.

On linux, malloc may return a non null yet invalid pointer and only fail
when the memory is accessed because it wasn't really available.

If Sqlite is not at fault when posix APIs lie about file locking, I don't
think Sqlite is responsible when malloc lies about what should be
considered a failed memory allocation which should return null.
On Apr 29, 2015 2:28 AM, "Simon Slavin"  wrote:

>
> On 29 Apr 2015, at 9:21am, Scott Robison  wrote:
>
> > Personally I don't see it as a bug. A limitation, yes. A different
> > algorithm that requires less ram would remove / change the limit.
>
> Anything inside the SQLite code which causes a Segfault is a bug.
> Assuming that you're not hacking your OS or messing with SQLite 'black box'
> variables like connections and statements.
>
> Just my humble opinion.  I'm not on the development team.
>
> A minimal (either small, or simple to reproduce) repeatable example would
> be welcome.  It can even include huge data as long as you can describe a
> way to create crashing data like "Generate a trillion identical rows in
> this table.".
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

Personally I don't see it as a bug. A limitation, yes. A different
algorithm that requires less ram would remove / change the limit.

I'll be trying some sample data tomorrow  (if time permits) to see if I can
come up with any ideas.

I described an approach in a previous email where I divided my FTS index
into 53 partitions. It means running more queries, but maybe something like
that could serve as a temporary solution.
On Apr 29, 2015 1:56 AM, "Artem"  wrote:

> So, is that a real bug of SQLIte and how to fix it?
> I really need to create huge FTS-indexes like that,
> I have 32GB of memory for that.
>
> Many thanks for your attention.
>
> > On 04/29/2015 05:27 AM, Artem wrote:
> >> , Tim.
> >>
> >> ?? ?? 29 ?? 2015 ?., 1:21:00:
> >>
> >>> On 28 Apr 2015 at 23:14, Artem  wrote:
> > How about trying the sqlite3.exe command line utility. put your
> > sql for that operation in a text file, launch the program, open
> > the database, then read in the sql file with the .read command.
> > If the error occurs, then possibly sqlite3. if not then it is
> > probably something else.
>  I tried it and failed.
> 
>  Console log:
> 
>  f:\Suggests\test>sqlite3.exe single.db
>  SQLite version 3.8.9 2015-04-08 12:16:33
>  Enter ".help" for usage hints.
>  sqlite> .read test.sql
>  Error: near line 1: out of memory
> >>> That's not a segfault, though, is it.
> >> When I did the same in linux version of SQLite - I saw
> >> the "Segmentation Fault" error.
>
> > Maybe something to do with the optimistic allocation strategy Linux
> > uses. Perhaps malloc() returned non-NULL but then a segfault occurred
> > when it first tried to access the pages. From the man-page:
>
> > By default, Linux follows an optimistic memory allocation
> > strategy. This means that when malloc() returns non-NULL
> > there is no guarantee that the memory really is available.
>
> > Or maybe the OOM killer took out the process. Or something.
>
> > Dan.
>
>
> > ___
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
>
> --
> ? ?,
>  Artem  mailto:devspec at yandex.ru
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Dan Kennedy

On 04/29/2015 02:03 AM, Dan Kennedy wrote:
> On 04/29/2015 01:55 AM, Scott Robison wrote:
>> On Tue, Apr 28, 2015 at 7:08 AM, Hick Gunter  wrote:
>>
>>> Getting "NoMem" sounds very much like a memory leak somewhere, with the
>>> most likely place being your own application, followed by the 
>>> wrapper you
>>> are using, the FTS code and lastly the SQLite core. Lastly because the
>>> SQLite core is extensively tested with an explicit emphasis on not 
>>> leaking
>>> memory (or other resources) in the first place and secondly recovering
>>> gracefully from memory allocation failures.
>>>
>> I've seen the same thing from the plain old amalgamation (not 
>> sqlite.net).
>> It only happens on *HUGE* (multiples of gigabytes) data sets. At 
>> least in
>> my case, it was not a memory leak.
>>
>> It's been a couple of years since I encountered it, and I worked 
>> around it
>> from the presumption that the data set used to stress test FTS was 
>> atypical
>> and wouldn't be encountered in the wild. Here are the details as best 
>> as I
>> can remember them:
>>
>> While inserting records into the FTS table, multiple FTS b-tree 
>> structures
>> are created. These are not the same b-trees used in plain vanilla 
>> SQLite.
>> Periodically as multiple b-trees are created and grow to some size, the
>> multiple b-trees are merged into a single b-tree.
>>
>> This merge operation allocates chunks of memory proportionate to the 
>> size
>> of the b-trees being merged. Using a contrived example that is not 
>> exact,
>> just illustrative:
>>
>> Set of inserts until two b-trees of one megabyte each are present. Merge
>> them into a two megabyte b-tree.
>>
>> Merge 2 2MiB trees into 1 4MiB tree.
>>
>> 2 x 4 MiB = 8 MiB.
>>
>> lather rinse repeat.
>>
>> 2 x 1 GiB = 2 GiB but probably fails due to overhead; if not...
>>
>> 2 x 2 GiB = 4 GiB but almost certainly fails due to overhead; if not...
>>
>> 2 x 4 GiB = 8 GiB definitely fails on a 32 bit system.
>>
>> In reality I never got to the point of allocating chunks of memory that
>> large. The failure happened well under 2 GiB (somewhere within a few
>> hundred MiB of the 1 GiB limit) due to other allocations and OS 
>> overhead.
>>
>> I just took a quick glance at the FTS code. As I said, it has been a 
>> couple
>> years, but this looks like the malloc that was failing for me at the 
>> time:
>> http://www.sqlite.org/cgi/src/artifact/81f9ed55ad586148?ln=2473
>
> That one is allocating enough space for the doclist associated with a 
> single term. Doclists are between say 2 and 12 bytes in size for each 
> instance of a term in the document set. So they can get quite large 
> for common terms ("a", "the" etc.). And the OP does have over a 
> billion documents. So I guess if there is a fairly common term in 
> there, that allocation could be too large for the OS to satisfy.

Or, really, 32-bit overflow resulting in a negative value being passed 
to sqlite3_malloc() causing the OOM report. Huh.

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Dan Kennedy

On 04/29/2015 01:55 AM, Scott Robison wrote:
> On Tue, Apr 28, 2015 at 7:08 AM, Hick Gunter  wrote:
>
>> Getting "NoMem" sounds very much like a memory leak somewhere, with the
>> most likely place being your own application, followed by the wrapper you
>> are using, the FTS code and lastly the SQLite core. Lastly because the
>> SQLite core is extensively tested with an explicit emphasis on not leaking
>> memory (or other resources) in the first place and secondly recovering
>> gracefully from memory allocation failures.
>>
> I've seen the same thing from the plain old amalgamation (not sqlite.net).
> It only happens on *HUGE* (multiples of gigabytes) data sets. At least in
> my case, it was not a memory leak.
>
> It's been a couple of years since I encountered it, and I worked around it
> from the presumption that the data set used to stress test FTS was atypical
> and wouldn't be encountered in the wild. Here are the details as best as I
> can remember them:
>
> While inserting records into the FTS table, multiple FTS b-tree structures
> are created. These are not the same b-trees used in plain vanilla SQLite.
> Periodically as multiple b-trees are created and grow to some size, the
> multiple b-trees are merged into a single b-tree.
>
> This merge operation allocates chunks of memory proportionate to the size
> of the b-trees being merged. Using a contrived example that is not exact,
> just illustrative:
>
> Set of inserts until two b-trees of one megabyte each are present. Merge
> them into a two megabyte b-tree.
>
> Merge 2 2MiB trees into 1 4MiB tree.
>
> 2 x 4 MiB = 8 MiB.
>
> lather rinse repeat.
>
> 2 x 1 GiB = 2 GiB but probably fails due to overhead; if not...
>
> 2 x 2 GiB = 4 GiB but almost certainly fails due to overhead; if not...
>
> 2 x 4 GiB = 8 GiB definitely fails on a 32 bit system.
>
> In reality I never got to the point of allocating chunks of memory that
> large. The failure happened well under 2 GiB (somewhere within a few
> hundred MiB of the 1 GiB limit) due to other allocations and OS overhead.
>
> I just took a quick glance at the FTS code. As I said, it has been a couple
> years, but this looks like the malloc that was failing for me at the time:
> http://www.sqlite.org/cgi/src/artifact/81f9ed55ad586148?ln=2473

That one is allocating enough space for the doclist associated with a 
single term. Doclists are between say 2 and 12 bytes in size for each 
instance of a term in the document set. So they can get quite large for 
common terms ("a", "the" etc.). And the OP does have over a billion 
documents. So I guess if there is a fairly common term in there, that 
allocation could be too large for the OS to satisfy.










>
> Note: The data set I was using to stress test had been created by other
> team members and consisted of completely random text. Not random words, but
> random collections of letters up to dozens or maybe hundreds of letters in
> length (though there were certainly many shorter "words" in the data set).
> This resulted in a "worst case scenario" for FTS because there were
> millions of terms that were only used one or at most very few times. Very
> little doclist delta compression was possible, so the trees grew more
> quickly than they otherwise would have. Even so, it took hours of
> processing (like overnight test runs) to generate the NOMEM error. Given
> the nonsensical nature of the fake data set, I didn't report it as a
> problem at the time (I don't think; if I did, I didn't dwell on it long).
>
> What I wound up doing to support even that huge random data set was to
> split my FTS index into 53 FTS index partitions. The worst case scenario
> for me was that I might have to do up to 53 queries to gather my data and
> combine it after the fact. FTS is fast enough that I was able to do this
> without appreciable overhead. Splitting it into 53 "buckets" (a nice prime
> number that kept things relatively balanced) kept the largest FTS b-trees
> to a reasonable size so that merging wouldn't need such large allocations.
> This might not be an acceptable solution for everyone, but it worked for me.
>
> Given how merging works in FTS 3 & 4, I don't think a "simple" solution is
> available. The only one that comes to mind might be to stop merging once
> data structures grow to a certain size. Otherwise a more complicated merge
> algorithm would be necessary.
>
> If you want or need more info, I can probably reproduce a data set that
> would result in the failure. Feel free to email me, on or off list, if I
> can be of assistance.
>
> OH! One final point. I don't think I ever had a segfault because of this.
> Just a failure that prevented FTS creation progress.
>
> SDR
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Scott Robison

Win 7 64 bit system, building & testing 32 & 64 bit versions of our app.
On Apr 29, 2015 1:34 AM, "Eduardo Morras"  wrote:

> On Tue, 28 Apr 2015 16:49:46 -0600
> Scott Robison  wrote:
>
> > On Tue, Apr 28, 2015 at 4:27 PM, Artem  wrote:
> >
> > > > That's not a segfault, though, is it.
> > >
> > > When I did the same in linux version of SQLite - I saw
> > > the "Segmentation Fault" error.
> > >
> >
> > I never saw a segfault in my case, though I never tried anything on
> > any posix style system. It was strictly Windows.
>
> Don't know if it's asked yet, but do you use a 64bit Windows or a 32bit
> version?
>
> > --
> > Scott Robison
>
> ---   ---
> Eduardo Morras 
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Artem

, Tim.

?? ?? 29 ?? 2015 ?., 1:21:00:

> On 28 Apr 2015 at 23:14, Artem  wrote: 

>>> How about trying the sqlite3.exe command line utility. put your 
>>> sql for that operation in a text file, launch the program, open 
>>> the database, then read in the sql file with the .read command. 
>>> If the error occurs, then possibly sqlite3. if not then it is 
>>> probably something else.
>>
>> I tried it and failed.
>>
>> Console log:
>>
>> f:\Suggests\test>sqlite3.exe single.db
>> SQLite version 3.8.9 2015-04-08 12:16:33
>> Enter ".help" for usage hints.
>> sqlite> .read test.sql
>> Error: near line 1: out of memory

> That's not a segfault, though, is it.

When I did the same in linux version of SQLite - I saw
the "Segmentation Fault" error.

> --
> Cheers  --  Tim



-- 
? ?,
 Artem  mailto:devspec at yandex.ru

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Artem

Another one tip about this situation: when I make query like
INSERT INTO test_fts(test_fts) VALUES('rebuild');
for the table with 100 000 000 rows - it works fine!
When 1 000 000 000 - out of memory.
So, the table and the query are correct.

> How about trying the sqlite3.exe command line utility. put your 
> sql for that operation in a text file, launch the program, open 
> the database, then read in the sql file with the .read command. 
> If the error occurs, then possibly sqlite3. if not then it is 
> probably something else.

I tried it and failed.

Console log:

f:\Suggests\test>sqlite3.exe single.db
SQLite version 3.8.9 2015-04-08 12:16:33
Enter ".help" for usage hints.
sqlite> .read test.sql
Error: near line 1: out of memory

sqlite>

Content of test.sql:

insert into phrases_fts(phrases_fts) values('rebuild');

DDL:

CREATE TABLE phrases ([id] INTEGER PRIMARY KEY AUTOINCREMENT, [phrase] CHAR, 
[denial] DOUBLE, [visits] INTEGER, [page_views] INTEGER, [visit_time] INTEGER, 
[depth] DOUBLE, [se] CHAR, [date_add] INTEGER, [metrika_id] INTEGER, [period] 
CHAR);
CREATE VIRTUAL TABLE phrases_fts using fts4 (content='phrases', phrase, denial, 
visits, page_views, visit_time, depth, se, date_add, metrika_id, period);

Database  contains  about  3.2 billion rows, maximum [phrase] length -
256 symbols.

sqlite3.exe from official site: 
https://www.sqlite.org/2015/sqlite-shell-win32-x86-3080900.zip

> 
> Scott Doctor
> scott at scottdoctor.com
> --

> On 4/28/2015 6:29 AM, Artem wrote:
>> No, I'm sure that is not a problem in my software, it exactly
>> error of the SQLite library. My software is very simple - it creates
>> simple connection to the database with connection string like
>> "Data Source={0};New=false;Journal 
>> Mode=Off;Synchronous=Off;FailIfMissing=True"
>> and executes a query like
>> INSERT INTO test_fts(test_fts) VALUES('rebuild');
>> and that is all.
>> I'm pretty sure because I got exactly the same error in SQLite Expert 
>> Professional -
>> popular  desktop  sqlite-management  software, that uses another
>> sqlite driver.
>>
>> P.S. Source code of my function:
>>
>> Using conn As New SQLiteConnection(String.Format("Data 
>> Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True", 
>> "f:\Suggests\suggests.db"))
>>
>>  conn.Open()
>>
>>  Using command = conn.CreateCommand
>>  command.CommandText = "INSERT INTO suggests_fts(suggests_fts) 
>> VALUES('rebuild');"
>>  command.ExecuteNonQuery()
>>  command.Dispose()
>>  End Using
>>
>>  conn.Close()
>>
>> End Using
>>
>> P.S. I can send the database to someone who can try, 19 GB in
>> rar-archive.
>>
>>> Getting "NoMem" sounds very much like a memory leak somewhere, with
>>> the most likely place being your own application, followed by the
>>> wrapper you are using, the FTS code and lastly the SQLite core.
>>> Lastly because the SQLite core is extensively tested with an
>>> explicit emphasis on not leaking memory (or other resources) in the
>>> first place and secondly recovering gracefully from memory allocation 
>>> failures.
>>> Also, since you have swapped out SQLite versions and even operating
>>> systems without eliminating the problem, it seems rational to look
>>> into the parts that have remained the same.
>>> Maybe you could run a test on Linux under the control of valgrind
>>> and have its memcheck tool take a look.
>>> -Urspr?ngliche Nachricht-
>>> Von: Artem [mailto:devspec at yandex.ru]
>>> Gesendet: Dienstag, 28. April 2015 14:36
>>> An: General Discussion of SQLite Database
>>> Betreff: Re: [sqlite] Segfault during FTS index creation from huge data
>>>> On 04/03/2015 10:16 PM, Artem wrote:
>>>>> Hi!
>>>>>
>>>>> The situation is like that. There?s a SQLite database with around 3 
>>>>> billion records. Each record consists of a certain CHAR field and several 
>>>>> other additional fields with different types. The file size is approx. 
>>>>> 340 gb. The maximum content length in the doc field is 256 symbols, the 
>>>>> content is in Russian.
>>>>>
>>>>> I?m trying to create a full-text index, but it results in a Segmentation 
>>>>> Fault error. I?ve been trying to create it in different possible ways, 
>>>>> both under Windows (with SQLite Expert and my own .NET software, 
>>>>> including on

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Simon Slavin


On 28 Apr 2015, at 11:49pm, Scott Robison  wrote:

> I never saw a segfault in my case, though I never tried anything on any
> posix style system. It was strictly Windows.

Windows doesn't call it segfault.  It will report an access violation or (very 
occasionally) a page fault.

Simon.

[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread Tim Streater

On 28 Apr 2015 at 23:14, Artem  wrote: 

>> How about trying the sqlite3.exe command line utility. put your 
>> sql for that operation in a text file, launch the program, open 
>> the database, then read in the sql file with the .read command. 
>> If the error occurs, then possibly sqlite3. if not then it is 
>> probably something else.
>
> I tried it and failed.
>
> Console log:
>
> f:\Suggests\test>sqlite3.exe single.db
> SQLite version 3.8.9 2015-04-08 12:16:33
> Enter ".help" for usage hints.
> sqlite> .read test.sql
> Error: near line 1: out of memory

That's not a segfault, though, is it.

--
Cheers  --  Tim

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread SQLite

I don't know if it actually causes a problem, but isn't the 
"?command.Dispose()" not needed? Doesn't the "using" take care of disposing?

Graham.


Sent from Samsung Mobile

 Original message 
From: Artem <devs...@yandex.ru> 
Date: 28/04/2015  14:29  (GMT+00:00) 
To: General Discussion of SQLite Database  
Subject: Re: [sqlite] Segfault during FTS index creation from huge data 

No, I'm sure that is not a problem in my software, it exactly
error of the SQLite library. My software is very simple - it creates
simple connection to the database with connection string like
"Data Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True"
and executes a query like
INSERT INTO test_fts(test_fts) VALUES('rebuild');
and that is all.
I'm pretty sure because I got exactly the same error in SQLite Expert 
Professional -
popular? desktop? sqlite-management? software, that uses another
sqlite driver.

P.S. Source code of my function:

Using conn As New SQLiteConnection(String.Format("Data 
Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True", 
"f:\Suggests\suggests.db"))

??? conn.Open()

??? Using command = conn.CreateCommand
??? command.CommandText = "INSERT INTO suggests_fts(suggests_fts) 
VALUES('rebuild');"
??? command.ExecuteNonQuery()
??? command.Dispose()
??? End Using

??? conn.Close()

End Using

P.S. I can send the database to someone who can try, 19 GB in
rar-archive.

> Getting "NoMem" sounds very much like a memory leak somewhere, with
> the most likely place being your own application, followed by the
> wrapper you are using, the FTS code and lastly the SQLite core.
> Lastly because the SQLite core is extensively tested with an
> explicit emphasis on not leaking memory (or other resources) in the
> first place and secondly recovering gracefully from memory allocation 
> failures.

> Also, since you have swapped out SQLite versions and even operating
> systems without eliminating the problem, it seems rational to look
> into the parts that have remained the same.

> Maybe you could run a test on Linux under the control of valgrind
> and have its memcheck tool take a look.

> -Urspr?ngliche Nachricht-
> Von: Artem [mailto:devspec at yandex.ru]
> Gesendet: Dienstag, 28. April 2015 14:36
> An: General Discussion of SQLite Database
> Betreff: Re: [sqlite] Segfault during FTS index creation from huge data

>> On 04/03/2015 10:16 PM, Artem wrote:
>>> Hi!
>>>
>>> The situation is like that. There?s a SQLite database with around 3 billion 
>>> records. Each record consists of a certain CHAR field and several other 
>>> additional fields with different types. The file size is approx. 340 gb. 
>>> The maximum content length in the doc field is 256 symbols, the content is 
>>> in Russian.
>>>
>>> I?m trying to create a full-text index, but it results in a Segmentation 
>>> Fault error. I?ve been trying to create it in different possible ways, both 
>>> under Windows (with SQLite Expert and my own .NET software, including one 
>>> with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even 
>>> compiled sqlite from the sources, having included necessary flags for FTS3 
>>> and FTS4, but every time I get one and the same error.

>> This does sound like a real problem, but one that might be difficult
>> to track down.

>> Are you able to get us a stack trace of the crash? Ideally one from a
>> build with compiler options "-g -DSQLITE_DEBUG" set.

>> Thanks,
>> Dan.

> Hi, Dan. Now I can't to do this because I haven't Linux on my PC.
> But I tried to create FTS table again (now it was another database with
> 1 350 000 000 rows, smaller than before). And I got the same error (out of 
> memory) on function:

> internal override SQLiteErrorCode Reset(SQLiteStatement stmt)

> in file SQLite3.cs

> It returns System.Data.SQLite.SQLiteErrorCode.NoMem.
> I home it helps.

> P.S.? It? is? latest? version? of SQLite.Net compiled in Visual Studio 2012.

>>>
>>> I?ve tried two options:
>>> - creating a contentless FTS4, when content is stored in a regular
>>> table, and FTS-table contains only index (create virtual table
>>> docs_fts using fts4(content='docs'... )
>>> - creating a full-fledged FTS table from a regular one (insert into
>>> docs_fts select doc... from docs;)
>>>
>>> SQLite is functioning for about 4 hours, after which Segmentation Fault 
>>> error occurs inevitably.
>>> There?re no NULL fields in the database.
>>>
>>> I?ve worked with 3 different SQLite versions, including the lates

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread SQLite

I don't know if it will actually cause problems, but is the 
"?command.Dispose()" needed? Doesn't the "using" handle disposing?

Graham


Sent from Samsung Mobile

 Original message 
From: Artem <devs...@yandex.ru> 
Date: 28/04/2015  14:29  (GMT+00:00) 
To: General Discussion of SQLite Database  
Subject: Re: [sqlite] Segfault during FTS index creation from huge data 

No, I'm sure that is not a problem in my software, it exactly
error of the SQLite library. My software is very simple - it creates
simple connection to the database with connection string like
"Data Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True"
and executes a query like
INSERT INTO test_fts(test_fts) VALUES('rebuild');
and that is all.
I'm pretty sure because I got exactly the same error in SQLite Expert 
Professional -
popular? desktop? sqlite-management? software, that uses another
sqlite driver.

P.S. Source code of my function:

Using conn As New SQLiteConnection(String.Format("Data 
Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True", 
"f:\Suggests\suggests.db"))

??? conn.Open()

??? Using command = conn.CreateCommand
??? command.CommandText = "INSERT INTO suggests_fts(suggests_fts) 
VALUES('rebuild');"
??? command.ExecuteNonQuery()
??? command.Dispose()
??? End Using

??? conn.Close()

End Using

P.S. I can send the database to someone who can try, 19 GB in
rar-archive.

> Getting "NoMem" sounds very much like a memory leak somewhere, with
> the most likely place being your own application, followed by the
> wrapper you are using, the FTS code and lastly the SQLite core.
> Lastly because the SQLite core is extensively tested with an
> explicit emphasis on not leaking memory (or other resources) in the
> first place and secondly recovering gracefully from memory allocation 
> failures.

> Also, since you have swapped out SQLite versions and even operating
> systems without eliminating the problem, it seems rational to look
> into the parts that have remained the same.

> Maybe you could run a test on Linux under the control of valgrind
> and have its memcheck tool take a look.

> -Urspr?ngliche Nachricht-
> Von: Artem [mailto:devspec at yandex.ru]
> Gesendet: Dienstag, 28. April 2015 14:36
> An: General Discussion of SQLite Database
> Betreff: Re: [sqlite] Segfault during FTS index creation from huge data

>> On 04/03/2015 10:16 PM, Artem wrote:
>>> Hi!
>>>
>>> The situation is like that. There?s a SQLite database with around 3 billion 
>>> records. Each record consists of a certain CHAR field and several other 
>>> additional fields with different types. The file size is approx. 340 gb. 
>>> The maximum content length in the doc field is 256 symbols, the content is 
>>> in Russian.
>>>
>>> I?m trying to create a full-text index, but it results in a Segmentation 
>>> Fault error. I?ve been trying to create it in different possible ways, both 
>>> under Windows (with SQLite Expert and my own .NET software, including one 
>>> with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even 
>>> compiled sqlite from the sources, having included necessary flags for FTS3 
>>> and FTS4, but every time I get one and the same error.

>> This does sound like a real problem, but one that might be difficult
>> to track down.

>> Are you able to get us a stack trace of the crash? Ideally one from a
>> build with compiler options "-g -DSQLITE_DEBUG" set.

>> Thanks,
>> Dan.

> Hi, Dan. Now I can't to do this because I haven't Linux on my PC.
> But I tried to create FTS table again (now it was another database with
> 1 350 000 000 rows, smaller than before). And I got the same error (out of 
> memory) on function:

> internal override SQLiteErrorCode Reset(SQLiteStatement stmt)

> in file SQLite3.cs

> It returns System.Data.SQLite.SQLiteErrorCode.NoMem.
> I home it helps.

> P.S.? It? is? latest? version? of SQLite.Net compiled in Visual Studio 2012.

>>>
>>> I?ve tried two options:
>>> - creating a contentless FTS4, when content is stored in a regular
>>> table, and FTS-table contains only index (create virtual table
>>> docs_fts using fts4(content='docs'... )
>>> - creating a full-fledged FTS table from a regular one (insert into
>>> docs_fts select doc... from docs;)
>>>
>>> SQLite is functioning for about 4 hours, after which Segmentation Fault 
>>> error occurs inevitably.
>>> There?re no NULL fields in the database.
>>>
>>> I?ve worked with 3 different SQLite versions, including the latest one, 
>>> av

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Scott Robison

On Tue, Apr 28, 2015 at 5:06 PM, Simon Slavin  wrote:

>
> On 28 Apr 2015, at 11:49pm, Scott Robison  wrote:
>
> > I never saw a segfault in my case, though I never tried anything on any
> > posix style system. It was strictly Windows.
>
> Windows doesn't call it segfault.  It will report an access violation or
> (very occasionally) a page fault.
>

I am aware of this. My point was that an illegal memory access never
occurred in my use on Windows. Perhaps because I checked all error codes
and did not continue trying to process data in an error state.

Regardless of what they're called, they are the same thing. A rose is a
rose.
-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Scott Robison

On Tue, Apr 28, 2015 at 4:27 PM, Artem  wrote:

> > That's not a segfault, though, is it.
>
> When I did the same in linux version of SQLite - I saw
> the "Segmentation Fault" error.
>

I never saw a segfault in my case, though I never tried anything on any
posix style system. It was strictly Windows.
-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Artem

No, I'm sure that is not a problem in my software, it exactly
error of the SQLite library. My software is very simple - it creates
simple connection to the database with connection string like
"Data Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True"
and executes a query like
INSERT INTO test_fts(test_fts) VALUES('rebuild');
and that is all.
I'm pretty sure because I got exactly the same error in SQLite Expert 
Professional -
popular  desktop  sqlite-management  software, that uses another
sqlite driver.

P.S. Source code of my function:

Using conn As New SQLiteConnection(String.Format("Data 
Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True", 
"f:\Suggests\suggests.db"))

conn.Open()

Using command = conn.CreateCommand
command.CommandText = "INSERT INTO suggests_fts(suggests_fts) 
VALUES('rebuild');"
command.ExecuteNonQuery()
command.Dispose()
End Using

conn.Close()

End Using

P.S. I can send the database to someone who can try, 19 GB in
rar-archive.

> Getting "NoMem" sounds very much like a memory leak somewhere, with
> the most likely place being your own application, followed by the
> wrapper you are using, the FTS code and lastly the SQLite core.
> Lastly because the SQLite core is extensively tested with an
> explicit emphasis on not leaking memory (or other resources) in the
> first place and secondly recovering gracefully from memory allocation 
> failures.

> Also, since you have swapped out SQLite versions and even operating
> systems without eliminating the problem, it seems rational to look
> into the parts that have remained the same.

> Maybe you could run a test on Linux under the control of valgrind
> and have its memcheck tool take a look.

> -Urspr?ngliche Nachricht-
> Von: Artem [mailto:devspec at yandex.ru]
> Gesendet: Dienstag, 28. April 2015 14:36
> An: General Discussion of SQLite Database
> Betreff: Re: [sqlite] Segfault during FTS index creation from huge data

>> On 04/03/2015 10:16 PM, Artem wrote:
>>> Hi!
>>>
>>> The situation is like that. There?s a SQLite database with around 3 billion 
>>> records. Each record consists of a certain CHAR field and several other 
>>> additional fields with different types. The file size is approx. 340 gb. 
>>> The maximum content length in the doc field is 256 symbols, the content is 
>>> in Russian.
>>>
>>> I?m trying to create a full-text index, but it results in a Segmentation 
>>> Fault error. I?ve been trying to create it in different possible ways, both 
>>> under Windows (with SQLite Expert and my own .NET software, including one 
>>> with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even 
>>> compiled sqlite from the sources, having included necessary flags for FTS3 
>>> and FTS4, but every time I get one and the same error.

>> This does sound like a real problem, but one that might be difficult
>> to track down.

>> Are you able to get us a stack trace of the crash? Ideally one from a
>> build with compiler options "-g -DSQLITE_DEBUG" set.

>> Thanks,
>> Dan.

> Hi, Dan. Now I can't to do this because I haven't Linux on my PC.
> But I tried to create FTS table again (now it was another database with
> 1 350 000 000 rows, smaller than before). And I got the same error (out of 
> memory) on function:

> internal override SQLiteErrorCode Reset(SQLiteStatement stmt)

> in file SQLite3.cs

> It returns System.Data.SQLite.SQLiteErrorCode.NoMem.
> I home it helps.

> P.S.  It  is  latest  version  of SQLite.Net compiled in Visual Studio 2012.

>>>
>>> I?ve tried two options:
>>> - creating a contentless FTS4, when content is stored in a regular
>>> table, and FTS-table contains only index (create virtual table
>>> docs_fts using fts4(content='docs'... )
>>> - creating a full-fledged FTS table from a regular one (insert into
>>> docs_fts select doc... from docs;)
>>>
>>> SQLite is functioning for about 4 hours, after which Segmentation Fault 
>>> error occurs inevitably.
>>> There?re no NULL fields in the database.
>>>
>>> I?ve worked with 3 different SQLite versions, including the latest one, 
>>> available on the website. I started trying to create the full-text index of 
>>> the base shortly after it was created and filled; no other activity, apart 
>>> from filling the base with data, was conveyed. It has only one docs table, 
>>> that for sure doesn?t contain any NULL values.
>>> I also had an idea that Reindex and Vacuum might have done something wrong, 
>>

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Artem

> On 04/03/2015 10:16 PM, Artem wrote:
>> Hi!
>>
>> The situation is like that. There?s a SQLite database with around 3 billion 
>> records. Each record consists of a certain CHAR field and several other 
>> additional fields with different types. The file size is approx. 340 gb. The 
>> maximum content length in the doc field is 256 symbols, the content is in 
>> Russian.
>>
>> I?m trying to create a full-text index, but it results in a Segmentation 
>> Fault error. I?ve been trying to create it in different possible ways, both 
>> under Windows (with SQLite Expert and my own .NET software, including one 
>> with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even 
>> compiled sqlite from the sources, having included necessary flags for FTS3 
>> and FTS4, but every time I get one and the same error.

> This does sound like a real problem, but one that might be difficult to
> track down.

> Are you able to get us a stack trace of the crash? Ideally one from a 
> build with compiler options "-g -DSQLITE_DEBUG" set.

> Thanks,
> Dan.

Hi, Dan. Now I can't to do this because I haven't Linux on my PC.
But I tried to create FTS table again (now it was another database with
1 350 000 000 rows, smaller than before). And I got the same error
(out of memory) on function:

internal override SQLiteErrorCode Reset(SQLiteStatement stmt)

in file SQLite3.cs

It returns System.Data.SQLite.SQLiteErrorCode.NoMem.
I home it helps.

P.S.  It  is  latest  version  of SQLite.Net compiled in Visual Studio
2012.

>>
>> I?ve tried two options:
>> - creating a contentless FTS4, when content is stored in a regular table, 
>> and FTS-table contains only index (create virtual table docs_fts using 
>> fts4(content='docs'... )
>> - creating a full-fledged FTS table from a regular one (insert into docs_fts 
>> select doc... from docs;)
>>
>> SQLite is functioning for about 4 hours, after which Segmentation Fault 
>> error occurs inevitably.
>> There?re no NULL fields in the database.
>>
>> I?ve worked with 3 different SQLite versions, including the latest one, 
>> available on the website. I started trying to create the full-text index of 
>> the base shortly after it was created and filled; no other activity, apart 
>> from filling the base with data, was conveyed. It has only one docs table, 
>> that for sure doesn?t contain any NULL values.
>> I also had an idea that Reindex and Vacuum might have done something wrong, 
>> however Reindex happens instantly and Vacuum works rather slowly, but 
>> successfully.
>>
>> In short, my actions are:
>>
>> CREATE TABLE docs ([id] INTEGER PRIMARY KEY AUTOINCREMENT, [doc] CHAR... 
>> (other fields here)
>> (here?s the process of filling the base with the data, which are downloaded 
>> by means of my own script from text files.)
>>
>> CREATE VIRTUAL TABLE docs_fts using fts4 (content='docs', doc... (other 
>> fields here)
>> (here?s an attempt of INSERT INTO docs_fts(docs_fts) VALUES ('rebuild') to 
>> rebuild contentless FTS index)
>>
>> or
>>
>> CREATE VIRTUAL TABLE docs_fts using fts4 (doc... (other fields here)
>> (here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)
>>
>> or
>>
>> CREATE VIRTUAL TABLE docs_fts using fts3 (doc... (other fields here)
>> (here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)
>>
>> For each attempt I?ve been making a new copy of the source file, because I 
>> suspected that the base could have got broken after Segmentation Fault. I 
>> even changed the ram-cards, in case if memory was the problem.
>>
>> But every time I get one and the same result - Segmentation Fault error.
>>
>> So, can you please pay your attention to this problem and fix it ASAP?
>>
>> I can send you a file if you need.
>>
>> Thank you.
>>
>> ___
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Scott Robison

On Tue, Apr 28, 2015 at 1:05 PM, Dan Kennedy  wrote:

> I just took a quick glance at the FTS code. As I said, it has been a couple
>>> years, but this looks like the malloc that was failing for me at the
>>> time:
>>> http://www.sqlite.org/cgi/src/artifact/81f9ed55ad586148?ln=2473
>>>
>>
>> That one is allocating enough space for the doclist associated with a
>> single term. Doclists are between say 2 and 12 bytes in size for each
>> instance of a term in the document set. So they can get quite large for
>> common terms ("a", "the" etc.). And the OP does have over a billion
>> documents. So I guess if there is a fairly common term in there, that
>> allocation could be too large for the OS to satisfy.
>>
>
> Or, really, 32-bit overflow resulting in a negative value being passed to
> sqlite3_malloc() causing the OOM report. Huh.


In looking at the code some more, it could have been in segment merging
code in fts3_write.c @
http://www.sqlite.org/cgi/src/artifact/4f005f78592a1447 ... I'll try to
generate some test data and see if I can get you some real information vs
hazy recollections.

-- 
Scott Robison

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Hick Gunter

Getting "NoMem" sounds very much like a memory leak somewhere, with the most 
likely place being your own application, followed by the wrapper you are using, 
the FTS code and lastly the SQLite core. Lastly because the SQLite core is 
extensively tested with an explicit emphasis on not leaking memory (or other 
resources) in the first place and secondly recovering gracefully from memory 
allocation failures.

Also, since you have swapped out SQLite versions and even operating systems 
without eliminating the problem, it seems rational to look into the parts that 
have remained the same.

Maybe you could run a test on Linux under the control of valgrind and have its 
memcheck tool take a look.

-Urspr?ngliche Nachricht-
Von: Artem [mailto:devspec at yandex.ru]
Gesendet: Dienstag, 28. April 2015 14:36
An: General Discussion of SQLite Database
Betreff: Re: [sqlite] Segfault during FTS index creation from huge data

> On 04/03/2015 10:16 PM, Artem wrote:
>> Hi!
>>
>> The situation is like that. There?s a SQLite database with around 3 billion 
>> records. Each record consists of a certain CHAR field and several other 
>> additional fields with different types. The file size is approx. 340 gb. The 
>> maximum content length in the doc field is 256 symbols, the content is in 
>> Russian.
>>
>> I?m trying to create a full-text index, but it results in a Segmentation 
>> Fault error. I?ve been trying to create it in different possible ways, both 
>> under Windows (with SQLite Expert and my own .NET software, including one 
>> with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even 
>> compiled sqlite from the sources, having included necessary flags for FTS3 
>> and FTS4, but every time I get one and the same error.

> This does sound like a real problem, but one that might be difficult
> to track down.

> Are you able to get us a stack trace of the crash? Ideally one from a
> build with compiler options "-g -DSQLITE_DEBUG" set.

> Thanks,
> Dan.

Hi, Dan. Now I can't to do this because I haven't Linux on my PC.
But I tried to create FTS table again (now it was another database with
1 350 000 000 rows, smaller than before). And I got the same error (out of 
memory) on function:

internal override SQLiteErrorCode Reset(SQLiteStatement stmt)

in file SQLite3.cs

It returns System.Data.SQLite.SQLiteErrorCode.NoMem.
I home it helps.

P.S.  It  is  latest  version  of SQLite.Net compiled in Visual Studio 2012.

>>
>> I?ve tried two options:
>> - creating a contentless FTS4, when content is stored in a regular
>> table, and FTS-table contains only index (create virtual table
>> docs_fts using fts4(content='docs'... )
>> - creating a full-fledged FTS table from a regular one (insert into
>> docs_fts select doc... from docs;)
>>
>> SQLite is functioning for about 4 hours, after which Segmentation Fault 
>> error occurs inevitably.
>> There?re no NULL fields in the database.
>>
>> I?ve worked with 3 different SQLite versions, including the latest one, 
>> available on the website. I started trying to create the full-text index of 
>> the base shortly after it was created and filled; no other activity, apart 
>> from filling the base with data, was conveyed. It has only one docs table, 
>> that for sure doesn?t contain any NULL values.
>> I also had an idea that Reindex and Vacuum might have done something wrong, 
>> however Reindex happens instantly and Vacuum works rather slowly, but 
>> successfully.
>>
>> In short, my actions are:
>>
>> CREATE TABLE docs ([id] INTEGER PRIMARY KEY AUTOINCREMENT, [doc]
>> CHAR... (other fields here) (here?s the process of filling the base
>> with the data, which are downloaded by means of my own script from
>> text files.)
>>
>> CREATE VIRTUAL TABLE docs_fts using fts4 (content='docs', doc...
>> (other fields here) (here?s an attempt of INSERT INTO
>> docs_fts(docs_fts) VALUES ('rebuild') to rebuild contentless FTS
>> index)
>>
>> or
>>
>> CREATE VIRTUAL TABLE docs_fts using fts4 (doc... (other fields here)
>> (here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)
>>
>> or
>>
>> CREATE VIRTUAL TABLE docs_fts using fts3 (doc... (other fields here)
>> (here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)
>>
>> For each attempt I?ve been making a new copy of the source file, because I 
>> suspected that the base could have got broken after Segmentation Fault. I 
>> even changed the ram-cards, in case if memory was the problem.
>>
>> But every time I get one and the same result - Segmentation Fault error.
>>
>

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Scott Robison

On Tue, Apr 28, 2015 at 1:03 PM, Dan Kennedy  wrote:

> On 04/29/2015 01:55 AM, Scott Robison wrote:
>
>> I just took a quick glance at the FTS code. As I said, it has been a
>> couple
>> years, but this looks like the malloc that was failing for me at the time:
>> http://www.sqlite.org/cgi/src/artifact/81f9ed55ad586148?ln=2473
>>
>
> That one is allocating enough space for the doclist associated with a
> single term. Doclists are between say 2 and 12 bytes in size for each
> instance of a term in the document set. So they can get quite large for
> common terms ("a", "the" etc.). And the OP does have over a billion
> documents. So I guess if there is a fairly common term in there, that
> allocation could be too large for the OS to satisfy.
>

Perhaps ... it has been years, and I could be mistaken about the exact
allocation, but that one looked right.

In any case, hope it helps someone. :)

SDR

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Scott Robison

On Tue, Apr 28, 2015 at 7:08 AM, Hick Gunter  wrote:

> Getting "NoMem" sounds very much like a memory leak somewhere, with the
> most likely place being your own application, followed by the wrapper you
> are using, the FTS code and lastly the SQLite core. Lastly because the
> SQLite core is extensively tested with an explicit emphasis on not leaking
> memory (or other resources) in the first place and secondly recovering
> gracefully from memory allocation failures.
>

I've seen the same thing from the plain old amalgamation (not sqlite.net).
It only happens on *HUGE* (multiples of gigabytes) data sets. At least in
my case, it was not a memory leak.

It's been a couple of years since I encountered it, and I worked around it
from the presumption that the data set used to stress test FTS was atypical
and wouldn't be encountered in the wild. Here are the details as best as I
can remember them:

While inserting records into the FTS table, multiple FTS b-tree structures
are created. These are not the same b-trees used in plain vanilla SQLite.
Periodically as multiple b-trees are created and grow to some size, the
multiple b-trees are merged into a single b-tree.

This merge operation allocates chunks of memory proportionate to the size
of the b-trees being merged. Using a contrived example that is not exact,
just illustrative:

Set of inserts until two b-trees of one megabyte each are present. Merge
them into a two megabyte b-tree.

Merge 2 2MiB trees into 1 4MiB tree.

2 x 4 MiB = 8 MiB.

lather rinse repeat.

2 x 1 GiB = 2 GiB but probably fails due to overhead; if not...

2 x 2 GiB = 4 GiB but almost certainly fails due to overhead; if not...

2 x 4 GiB = 8 GiB definitely fails on a 32 bit system.

In reality I never got to the point of allocating chunks of memory that
large. The failure happened well under 2 GiB (somewhere within a few
hundred MiB of the 1 GiB limit) due to other allocations and OS overhead.

I just took a quick glance at the FTS code. As I said, it has been a couple
years, but this looks like the malloc that was failing for me at the time:
http://www.sqlite.org/cgi/src/artifact/81f9ed55ad586148?ln=2473

Note: The data set I was using to stress test had been created by other
team members and consisted of completely random text. Not random words, but
random collections of letters up to dozens or maybe hundreds of letters in
length (though there were certainly many shorter "words" in the data set).
This resulted in a "worst case scenario" for FTS because there were
millions of terms that were only used one or at most very few times. Very
little doclist delta compression was possible, so the trees grew more
quickly than they otherwise would have. Even so, it took hours of
processing (like overnight test runs) to generate the NOMEM error. Given
the nonsensical nature of the fake data set, I didn't report it as a
problem at the time (I don't think; if I did, I didn't dwell on it long).

What I wound up doing to support even that huge random data set was to
split my FTS index into 53 FTS index partitions. The worst case scenario
for me was that I might have to do up to 53 queries to gather my data and
combine it after the fact. FTS is fast enough that I was able to do this
without appreciable overhead. Splitting it into 53 "buckets" (a nice prime
number that kept things relatively balanced) kept the largest FTS b-trees
to a reasonable size so that merging wouldn't need such large allocations.
This might not be an acceptable solution for everyone, but it worked for me.

Given how merging works in FTS 3 & 4, I don't think a "simple" solution is
available. The only one that comes to mind might be to stop merging once
data structures grow to a certain size. Otherwise a more complicated merge
algorithm would be necessary.

If you want or need more info, I can probably reproduce a data set that
would result in the failure. Feel free to email me, on or off list, if I
can be of assistance.

OH! One final point. I don't think I ever had a segfault because of this.
Just a failure that prevented FTS creation progress.

SDR

[sqlite] Segfault during FTS index creation from huge data

2015-04-28 Thread Scott Doctor


How about trying the sqlite3.exe command line utility. put your 
sql for that operation in a text file, launch the program, open 
the database, then read in the sql file with the .read command. 
If the error occurs, then possibly sqlite3. if not then it is 
probably something else.


Scott Doctor
scott at scottdoctor.com
--

On 4/28/2015 6:29 AM, Artem wrote:
> No, I'm sure that is not a problem in my software, it exactly
> error of the SQLite library. My software is very simple - it creates
> simple connection to the database with connection string like
> "Data Source={0};New=false;Journal 
> Mode=Off;Synchronous=Off;FailIfMissing=True"
> and executes a query like
> INSERT INTO test_fts(test_fts) VALUES('rebuild');
> and that is all.
> I'm pretty sure because I got exactly the same error in SQLite Expert 
> Professional -
> popular  desktop  sqlite-management  software, that uses another
> sqlite driver.
>
> P.S. Source code of my function:
>
> Using conn As New SQLiteConnection(String.Format("Data 
> Source={0};New=false;Journal Mode=Off;Synchronous=Off;FailIfMissing=True", 
> "f:\Suggests\suggests.db"))
>
>  conn.Open()
>
>  Using command = conn.CreateCommand
>  command.CommandText = "INSERT INTO suggests_fts(suggests_fts) 
> VALUES('rebuild');"
>  command.ExecuteNonQuery()
>  command.Dispose()
>  End Using
>
>  conn.Close()
>
> End Using
>
> P.S. I can send the database to someone who can try, 19 GB in
> rar-archive.
>
>> Getting "NoMem" sounds very much like a memory leak somewhere, with
>> the most likely place being your own application, followed by the
>> wrapper you are using, the FTS code and lastly the SQLite core.
>> Lastly because the SQLite core is extensively tested with an
>> explicit emphasis on not leaking memory (or other resources) in the
>> first place and secondly recovering gracefully from memory allocation 
>> failures.
>> Also, since you have swapped out SQLite versions and even operating
>> systems without eliminating the problem, it seems rational to look
>> into the parts that have remained the same.
>> Maybe you could run a test on Linux under the control of valgrind
>> and have its memcheck tool take a look.
>> -----Urspr?ngliche Nachricht-
>> Von: Artem [mailto:devspec at yandex.ru]
>> Gesendet: Dienstag, 28. April 2015 14:36
>> An: General Discussion of SQLite Database
>> Betreff: Re: [sqlite] Segfault during FTS index creation from huge data
>>> On 04/03/2015 10:16 PM, Artem wrote:
>>>> Hi!
>>>>
>>>> The situation is like that. There?s a SQLite database with around 3 
>>>> billion records. Each record consists of a certain CHAR field and several 
>>>> other additional fields with different types. The file size is approx. 340 
>>>> gb. The maximum content length in the doc field is 256 symbols, the 
>>>> content is in Russian.
>>>>
>>>> I?m trying to create a full-text index, but it results in a Segmentation 
>>>> Fault error. I?ve been trying to create it in different possible ways, 
>>>> both under Windows (with SQLite Expert and my own .NET software, including 
>>>> one with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even 
>>>> compiled sqlite from the sources, having included necessary flags for FTS3 
>>>> and FTS4, but every time I get one and the same error.
>>> This does sound like a real problem, but one that might be difficult
>>> to track down.
>>> Are you able to get us a stack trace of the crash? Ideally one from a
>>> build with compiler options "-g -DSQLITE_DEBUG" set.
>>> Thanks,
>>> Dan.
>> Hi, Dan. Now I can't to do this because I haven't Linux on my PC.
>> But I tried to create FTS table again (now it was another database with
>> 1 350 000 000 rows, smaller than before). And I got the same error (out of 
>> memory) on function:
>> internal override SQLiteErrorCode Reset(SQLiteStatement stmt)
>> in file SQLite3.cs
>> It returns System.Data.SQLite.SQLiteErrorCode.NoMem.
>> I home it helps.
>> P.S.  It  is  latest  version  of SQLite.Net compiled in Visual Studio 2012.
>>>> I?ve tried two options:
>>>> - creating a contentless FTS4, when content is stored in a regular
>>>> table, and FTS-table contains only index (create virtual table
>>>> docs_fts using fts4(content='docs'... )
>>>> - creating a full-fledged FTS table from a regular one (insert into
>>>> do

[sqlite] Segfault during FTS index creation from huge data

2015-04-04 Thread Dan Kennedy

On 04/03/2015 10:16 PM, Artem wrote:
> Hi!
>
> The situation is like that. There?s a SQLite database with around 3 billion 
> records. Each record consists of a certain CHAR field and several other 
> additional fields with different types. The file size is approx. 340 gb. The 
> maximum content length in the doc field is 256 symbols, the content is in 
> Russian.
>
> I?m trying to create a full-text index, but it results in a Segmentation 
> Fault error. I?ve been trying to create it in different possible ways, both 
> under Windows (with SQLite Expert and my own .NET software, including one 
> with x64 architecture) and Linux (both Ubuntu and Centos). I?ve even compiled 
> sqlite from the sources, having included necessary flags for FTS3 and FTS4, 
> but every time I get one and the same error.

This does sound like a real problem, but one that might be difficult to 
track down.

Are you able to get us a stack trace of the crash? Ideally one from a 
build with compiler options "-g -DSQLITE_DEBUG" set.

Thanks,
Dan.




>
> I?ve tried two options:
> - creating a contentless FTS4, when content is stored in a regular table, and 
> FTS-table contains only index (create virtual table docs_fts using 
> fts4(content='docs'... )
> - creating a full-fledged FTS table from a regular one (insert into docs_fts 
> select doc... from docs;)
>
> SQLite is functioning for about 4 hours, after which Segmentation Fault error 
> occurs inevitably.
> There?re no NULL fields in the database.
>
> I?ve worked with 3 different SQLite versions, including the latest one, 
> available on the website. I started trying to create the full-text index of 
> the base shortly after it was created and filled; no other activity, apart 
> from filling the base with data, was conveyed. It has only one docs table, 
> that for sure doesn?t contain any NULL values.
> I also had an idea that Reindex and Vacuum might have done something wrong, 
> however Reindex happens instantly and Vacuum works rather slowly, but 
> successfully.
>
> In short, my actions are:
>
> CREATE TABLE docs ([id] INTEGER PRIMARY KEY AUTOINCREMENT, [doc] CHAR... 
> (other fields here)
> (here?s the process of filling the base with the data, which are downloaded 
> by means of my own script from text files.)
>
> CREATE VIRTUAL TABLE docs_fts using fts4 (content='docs', doc... (other 
> fields here)
> (here?s an attempt of INSERT INTO docs_fts(docs_fts) VALUES ('rebuild') to 
> rebuild contentless FTS index)
>
> or
>
> CREATE VIRTUAL TABLE docs_fts using fts4 (doc... (other fields here)
> (here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)
>
> or
>
> CREATE VIRTUAL TABLE docs_fts using fts3 (doc... (other fields here)
> (here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)
>
> For each attempt I?ve been making a new copy of the source file, because I 
> suspected that the base could have got broken after Segmentation Fault. I 
> even changed the ram-cards, in case if memory was the problem.
>
> But every time I get one and the same result - Segmentation Fault error.
>
> So, can you please pay your attention to this problem and fix it ASAP?
>
> I can send you a file if you need.
>
> Thank you.
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Segfault during FTS index creation from huge data

2015-04-03 Thread Artem

Hi!

The situation is like that. There?s a SQLite database with around 3 billion 
records. Each record consists of a certain CHAR field and several other 
additional fields with different types. The file size is approx. 340 gb. The 
maximum content length in the doc field is 256 symbols, the content is in 
Russian.

I?m trying to create a full-text index, but it results in a Segmentation Fault 
error. I?ve been trying to create it in different possible ways, both under 
Windows (with SQLite Expert and my own .NET software, including one with x64 
architecture) and Linux (both Ubuntu and Centos). I?ve even compiled sqlite 
from the sources, having included necessary flags for FTS3 and FTS4, but every 
time I get one and the same error. 

I?ve tried two options:
- creating a contentless FTS4, when content is stored in a regular table, and 
FTS-table contains only index (create virtual table docs_fts using 
fts4(content='docs'... )
- creating a full-fledged FTS table from a regular one (insert into docs_fts 
select doc... from docs;)

SQLite is functioning for about 4 hours, after which Segmentation Fault error 
occurs inevitably. 
There?re no NULL fields in the database. 

I?ve worked with 3 different SQLite versions, including the latest one, 
available on the website. I started trying to create the full-text index of the 
base shortly after it was created and filled; no other activity, apart from 
filling the base with data, was conveyed. It has only one docs table, that for 
sure doesn?t contain any NULL values. 
I also had an idea that Reindex and Vacuum might have done something wrong, 
however Reindex happens instantly and Vacuum works rather slowly, but 
successfully. 

In short, my actions are:

CREATE TABLE docs ([id] INTEGER PRIMARY KEY AUTOINCREMENT, [doc] CHAR... (other 
fields here)
(here?s the process of filling the base with the data, which are downloaded by 
means of my own script from text files.)

CREATE VIRTUAL TABLE docs_fts using fts4 (content='docs', doc... (other fields 
here)
(here?s an attempt of INSERT INTO docs_fts(docs_fts) VALUES ('rebuild') to 
rebuild contentless FTS index)

or

CREATE VIRTUAL TABLE docs_fts using fts4 (doc... (other fields here)
(here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)

or

CREATE VIRTUAL TABLE docs_fts using fts3 (doc... (other fields here)
(here?s an attempt of INSERT INTO docs_fts SELECT doc... from docs;)

For each attempt I?ve been making a new copy of the source file, because I 
suspected that the base could have got broken after Segmentation Fault. I even 
changed the ram-cards, in case if memory was the problem.

But every time I get one and the same result - Segmentation Fault error.

So, can you please pay your attention to this problem and fix it ASAP?

I can send you a file if you need.

Thank you.

[sqlite] Segfault during FTS index creation from huge data

2015-04-03 Thread Dominique Pellé

Artem  wrote:

...snip...
> SQLite is functioning for about 4 hours, after which Segmentation Fault error 
> occurs inevitably.
...snip...

You did not provide any stack dump, so I don't think that SQLite developers
can do anything to help you. Anyway, it's more likely that the bug is in
your application.

Since you use Linux, you should try to recompile and link with
-fsanitize=address -g -O0 -fno-omit-frame-pointer

-fsanitize option is available in recent versions of gcc or clang.
It will likely tell you where the bug is.

Regards
Dominique

57 matches

Mail list logo