On Fri, 3 Apr 2015 18:16:18 +0300 Artem <devspec at yandex.ru> wrote:
> Hi! > > The situation is like that. There?s a SQLite database with around 3 > billion records. Each record consists of a certain CHAR field and > several other additional fields with different types. The file size > is approx. 340 gb. The maximum content length in the doc field is 256 > symbols, the content is in Russian. <snip> You can extend fts3/4 tokenizers to recognize Russian stop words[1] and exclude them from FTS index. I don't know Russian, but in English, examples of stop words are: 'a', 'the', 'of', etc... See https://www.sqlite.org/fts3.html#section_8_1 for implement your own tokenizer or extend unicode one to exclude your stop words. A fast hack can be to add code at/to end of icuNext[2] (file ext/fts3/fts3_icu.c) function and check if the token is in your stop word list and skip the token [3](it's a pointer to current string) instead, something like this: 233 *piEndOffset = pCsr->aOffset[iEnd]; 234 *piPosition = pCsr->iToken++; 235 if ( token_is_stop_word(ppToken,nByte) ){ 236 *ppToken = Pointer_to_empty_string; 237 *pnBytes = 0; 238 *piStartOffset = pCsr->aOffset[iStart+nByte]; 239 *piEndOffset = pCsr->aOffset[iEnd+nByte]; 240 *piPosition = pCsr->iToken--; 241 } 242 return SQLITE_OK; N.B. It's a fast hack and I haven't compile, run or check with full Sqlite3 documentation, There are list of stop words available[4][5] on internet. [1] https://en.wikipedia.org/wiki/Stop_words [2] http://www.sqlite.org/src/info/e319e108661147bcca8dd511cd562f33a1ba81b5?txt=1&ln=177 [3] http://www.sqlite.org/src/info/e319e108661147bcca8dd511cd562f33a1ba81b5?txt=1&ln=235 [4] https://code.google.com/p/stop-words/ (Warning!! GPLv3 code) [5] http://www.ranks.nl/stopwords/russian (Warning!! Unknow licence) > Thank you. HTH --- --- Eduardo Morras <emorrasg at yahoo.es>