Thanks for at least confirming that I was on the right track. I was doing some basic inspection of the delim array before and after I changed it and that seemed to be doing what I wanted to do. I didn't look any closer at the simpleNext function which is clearly my next port of call.
It does feel that I'm missing something though. My use case is in fact pretty simple. The text column I'm indexing doesn't contain lots of free-flowing text - they are mainly one- or two-word items. I'm using FTS3 because the MATCH performance for prefix searches is so much better than standard LIKE 'string%' searches. As another test, I tried simply setting each element in the delim array to 0 so that it doesn't break up the text at all, even on spaces. But as soon as I alter that line of code it stops my MATCH queries from returning any results, even though in theory it should. So that's a surprise result which leaves stratching my head a tad. So I definitely need to investigate more closely how the tokens are being split up as a starting point, but perhaps the issue is elsewhere, in the execution of the MATCH query itself for example, although I've not located that code yet. I will keep this thread up to date if and when I make any progress. Thanks, Andy Scott Hess wrote: > > My suggestion would be to spin up a debugger and set a breakpoint on > that loop and check to see if it's really doing what you think it is. > If it is, break in the code doing the actual tokenization and see if > it's being called as expected. Or scatter some printf() calls in > there. It's embarrassing the number of times one writes really simple > code which doesn't work, and it turns out that it's something silly, > like you didn't actually compile what you thought you did, or you > modified code in the wrong file, or something like that. > > I don't think you probably want isspace() here, though, it would let > through non-whitespace control characters. Most of your inputs > probably won't have those, but it's easy to plan ahead, you might > consider isgraph() instead, used the same way as isalnum() was being > used before. > > -scott > > > On Mon, Apr 6, 2009 at 8:51 AM, Andy Roberts <the_...@hotmail.com> wrote: >> >> Hi, >> >> I downloaded the amalgamation sources in order to create a build of >> sqlite >> with FTS3 enabled. The problem for me is that the default "simple" >> tokenizer >> is not behaving precisely how I want. In fact, I'd prefer if it wouldn't >> count punctuation as a delimeter, and stuck purely to whitespace. >> >> In the simpleCreate() function there's some code that initializes an >> array >> that records with characters are delimiters or not: >> >> for(i=1; i<0x80; i++){ >> t->delim[i] = !isalnum(i); >> } >> >> I thought that if I made a simple edit to use the isspace() function then >> I'd achieve what I was after, i.e., >> >> for(i=1; i<0x80; i++){ >> t->delim[i] = isspace(i); >> } >> >> However, when I build this version, create my fts virtual tables and then >> query them I get zero results. When I revert back to !isalnum I get >> results, >> but as I'm seeing words that are being split where I don't want them to >> be. >> >> I must admit my C experience isn't great, but I've been trying for far >> too >> many hours now with little gain. I'd really appreciate some pointers! >> >> Thanks in advance, >> Andy >> -- >> View this message in context: >> http://www.nabble.com/Simple-Tokenizer-in-FTS3-tp22911635p22911635.html >> Sent from the SQLite mailing list archive at Nabble.com. >> >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > -- View this message in context: http://www.nabble.com/Simple-Tokenizer-in-FTS3-tp22911635p22924173.html Sent from the SQLite mailing list archive at Nabble.com. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users