On Fri, 3 Apr 2015 18:16:18 +0300
Artem <devspec at yandex.ru> wrote:

> Hi!
> 
> The situation is like that. There?s a SQLite database with around 3
> billion records. Each record consists of a certain CHAR field and
> several other additional fields with different types. The file size
> is approx. 340 gb. The maximum content length in the doc field is 256
> symbols, the content is in Russian.
<snip>

You can extend fts3/4 tokenizers to recognize Russian stop words[1] and exclude 
them from FTS index. I don't know Russian, but in English, examples of stop 
words are: 'a', 'the', 'of', etc...

See https://www.sqlite.org/fts3.html#section_8_1 for implement your own 
tokenizer or extend unicode one to exclude your stop words. A fast hack can be 
to add code at/to end of icuNext[2] (file ext/fts3/fts3_icu.c) function and 
check if the token is in your stop word list and skip the token [3](it's a 
pointer to current string) instead, something like this:

233 *piEndOffset = pCsr->aOffset[iEnd];
234 *piPosition = pCsr->iToken++;
235 if ( token_is_stop_word(ppToken,nByte) ){
236   *ppToken = Pointer_to_empty_string;
237   *pnBytes = 0;
238   *piStartOffset = pCsr->aOffset[iStart+nByte];
239   *piEndOffset = pCsr->aOffset[iEnd+nByte];
240   *piPosition = pCsr->iToken--;
241 }
242 return SQLITE_OK;

N.B. It's a fast hack and I haven't compile, run or check with full Sqlite3 
documentation, 

There are list of stop words available[4][5] on internet.

[1] https://en.wikipedia.org/wiki/Stop_words
[2] 
http://www.sqlite.org/src/info/e319e108661147bcca8dd511cd562f33a1ba81b5?txt=1&ln=177
[3] 
http://www.sqlite.org/src/info/e319e108661147bcca8dd511cd562f33a1ba81b5?txt=1&ln=235
[4] https://code.google.com/p/stop-words/ (Warning!! GPLv3 code)
[5] http://www.ranks.nl/stopwords/russian (Warning!! Unknow licence)

> Thank you.

HTH

---   ---
Eduardo Morras <emorrasg at yahoo.es>

Reply via email to