I've done some further digging here, and using git bisect, it seems like the problem was this change: https://sqlite.org/src/info/c736c40aab071a69. I can happily send a patch that fixes this for me, but I would like help producing a minimal test case to add to the test suite.
In the database I have as an example, there is one section of thread_search_segdir that has a negative size (indicative of being part-way through a rebuild?). level|idx|start_block|leaves_end_block|end_block|hex(root) 8|0|491439|492126|530222 -782471|02A7921E04363233660005633436623800046A396B67 Decoding the root, the left_child_id is 493863, and in fact there are four height=1 nodes (493863, 493864, 493865, and 493866). Node 493866 is interesting, because it is an interior node with no terms: blockid|hex(block) 493866|01DE841E # left-child 492126 When we try and insert a new row into the table, it triggers an incremental merge. As part of that merge, we loop down the tree looking for nodes to merge together. We first hit node 496287 at height=2, then node 493866 at height=1 and then 492126 (a leaf node). Before this commit, everything works ok; but after this commit, when we run `nodeReaderInit` on node 492166 it segfaults because we pass in `pNode->block.a` which is 0, and then try and read the memory it points to. The reason that `pNode->block.a` is 0 on the leaf node is because we have (erroneously) skipped the loop body on the height=1 node and the code assumes that the previous iteration will have initialized pNode. There are a few ways to fix the crash while keeping the test-suite passing, but I think the correct thing to do is to move the new check earlier: In ./ext/fts3/fts3_write.c:4926 - rc = nodeReaderInit(&reader, pNode->block.a, pNode->block.n); - if( reader.aNode ){ + if( pNode->block.a ){ + rc = nodeReaderInit(&reader, pNode->block.a, pNode->block.n); Although it might seem like this change is equivalent, `pNode->block.a` will exist more often than `reader.aNode`. This is because `nodeReaderInit` calls `nodeReaderNext`, which has the side-effect of setting `reader.aNode = 0` when there are no more terms to be read. In the case of node 493866 (above) pNode->block.a exists (and it contains a pointer to the left-mode-child) but calling nodeReaderNext sets reader.aNode to 0 because it contains no terms. This change allows the loop to finish iterating correctly in the case above, where there is no corruption but there is an interior node with no terms. This change will also fix a potential crash on the leaf node in the case of an actually corrupt interior node as we will skip the initialization of `aNodeWriter[i-1]` later in the loop and so pNode->block.a will still be 0 from when it was first created. Please let me know if I can be further helpful in getting this fixed, Conrad P.S. Superhuman is hiring — referral bonus for Full Stack Engineers ( https://superhuman.com/roles?gh_jid=260350 ) : $1,947. On Sat, Sep 28, 2019 at 10:35 PM, Conrad Irwin < con...@superhuman.com > wrote: > > Hi SQLite, > > > > I'd like to report a bug I've encountered using SQLite v3.29, where it > reliably segfaults when trying to insert a row into a fts3 table in a > database created with a previous version of SQLite. > > > > Please let me know if I can provide more information than what's below, as > I'm keen to get this resolved (and/or find a work around) urgently — this > is affecting hundreds of our users in production. > > > The SQL is `INSERT INTO thread_search(subject) VALUES ('a')`, but it seems > that any insert to this table will cause a problem. The table was created > using `CREATE VIRTUAL TABLE thread_search USING fts3(thread_id, subject, > content, from, to, cc, bcc, replyto, deliveredto, attachments, labels, > list, rfc822msgid, meta, tokenize=porter)` on a previous version of > SQLite (though I'm not sure which one). > > > > I am happy to share a broken database privately (we have reproduced this > once internally, and are seeing it in the wild for a large number of > customers, but have not yet been able to create a shareable database that > reproduces the problem). If it makes a difference, one potentially unusual > thing we do when writing to the table in normal operations is setting the > `rowid` to a value we control (so that search results are returned in the > correct order without a post-sort). > > > > I'm using SQLite as part of Chromium, and the stacktrace I get from > Chromium looks like this: > > 5 libsystem_platform.dylib 0x00007fff70be7b5d _sigtramp + 29 > > 6 libchromium_sqlite3.dylib 0x000000016e5b2fe8 sqlite3DbMallocRawNN + 56 > > 7 libchromium_sqlite3.dylib 0x000000016e67053c fts3IncrmergeLoad + 1388 > > 8 libchromium_sqlite3.dylib 0x000000016e66f6d7 sqlite3Fts3Incrmerge + 1031 > > > 9 libchromium_sqlite3.dylib 0x000000016e660b12 fts3SyncMethod + 210 > > 10 libchromium_sqlite3.dylib 0x000000016e5d1e70 sqlite3VtabSync + 176 > > 11 libchromium_sqlite3.dylib 0x000000016e5d0813 vdbeCommit + 67 > > 12 libchromium_sqlite3.dylib 0x000000016e5cff8d sqlite3VdbeHalt + 701 > > 13 libchromium_sqlite3.dylib 0x000000016e5d8c0b sqlite3VdbeExec + 16459 > > 14 libchromium_sqlite3.dylib 0x000000016e5a57f1 sqlite3Step + 433 > > 15 libchromium_sqlite3.dylib 0x000000016e5a54bd chrome_sqlite3_step + 125 > > > > That said, when I run the commandline version sqlite3 (built after running > fossil checkout branch-3.29 --force) inside lldb, I get a slightly > different-looking crash: > > * thread #1, queue = ' com. apple. main-thread ( > http://com.apple.main-thread/ ) ', stop reason = EXC_BAD_ACCESS (code=1, > address=0x0) frame #0: 0x000000010a2b8098 sqlite3`sqlite3Fts3Incrmerge > [inlined] nodeReaderInit(aNode=0x0000000000000000, nNode=0) at > sqlite3.c:174812:7 [opt] 174809 p->nNode = nNode; 174810 174811 /* Figure > out if this is a leaf or an internal node. */ -> 174812 if( p->aNode[0] ){ > 174813 /* An internal node. */ 174814 p->iOff = 1 + > sqlite3Fts3GetVarint(&p->aNode[1], &p->iChild); 174815 }else{ Target 0: > (sqlite3) stopped. (lldb) bt * thread #1, queue = ' com. apple. main-thread > ( http://com.apple.main-thread/ ) ', stop reason = EXC_BAD_ACCESS (code=1, > address=0x0) * frame #0: 0x000000010a2b8098 sqlite3`sqlite3Fts3Incrmerge > [inlined] nodeReaderInit(aNode=0x0000000000000000, nNode=0) at > sqlite3.c:174812:7 [opt] frame #1: 0x000000010a2b806b > sqlite3`sqlite3Fts3Incrmerge at sqlite3.c:175310 [opt] frame #2: > 0x000000010a2b7e6d sqlite3`sqlite3Fts3Incrmerge(p=<unavailable>, > nMerge=<unavailable>, nMin=<unavailable>) at sqlite3.c:175967 [opt] frame > #3: 0x000000010a2a6bf3 sqlite3`fts3SyncMethod(pVtab=0x00007fcc74004ba0) at > sqlite3.c:164365:33 [opt] frame #4: 0x000000010a1dbe15 > sqlite3`sqlite3VdbeHalt at sqlite3.c:135756:12 [opt] frame #5: > 0x000000010a1dbd8c sqlite3`sqlite3VdbeHalt [inlined] > vdbeCommit(db=<unavailable>, p=<unavailable>) at sqlite3.c:78914 [opt] > frame #6: 0x000000010a1dbd8c sqlite3`sqlite3VdbeHalt(p=<unavailable>) at > sqlite3.c:79379 [opt] frame #7: 0x000000010a1ee6ff > sqlite3`sqlite3VdbeExec(p=<unavailable>) at sqlite3.c:84884:8 [opt] frame > #8: 0x000000010a1aa02e sqlite3`sqlite3_step [inlined] > sqlite3Step(p=<unavailable>) at sqlite3.c:82160:10 [opt] frame #9: > 0x000000010a1a9e98 sqlite3`sqlite3_step(pStmt=<unavailable>) at > sqlite3.c:82225 [opt] > > I can also confirm that with sqlite v3.28, the INSERT works without > failing: > > $ ~/Downloads/sqlite-tools-osx-x86-3280000/sqlite3 > ~/debug/gabes-database-25-copy 'INSERT INTO thread_search (subject) > VALUES("a")' $ > > Interestingly, if I first run the insert first using v3.28; then it works > fine with v3.29 — I assume that this is because I no longer hitting the > incremental merge path in the newer version of the code. > > > > It also seems to succeed in v3.29 if I first run `INSERT INTO > thread_search(thread_search) VALUES ('optimize')` before running the > insert. I don't yet know whether this is a permanent fix or whether if I > continue inserting until sqlite3Fts3Incmerge runs again, it will crash > again. > > > > Conrad > > > > P.S. I'm happy to run more debugging here, in particular I'm not sure how > to get more information out when this breaks. > > > > > P.S. Superhuman is hiring — referral bonus for Full Stack Engineers ( > https://superhuman.com/roles?gh_jid=260350 ) : $1,947. > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users