[ 
https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447625#comment-17447625
 ] 

Michael Sokolov edited comment on LUCENE-10247 at 11/22/21, 8:52 PM:
---------------------------------------------------------------------

As far as testing goes, https://issues.apache.org/jira/browse/LUCENE-8781 has 
some pointers. IIRC that one relied on tests in luceneutil plus Suggester unit 
tests (they do a lot of FST lookups), and there is also a torture test 
(commented out) in TestJapaneseTokenizer using JA wikipedia. That one is 
interesting because the character distribution is quite different from 
euro-centric texts.

 

I'll also echo that speed (search, mainly, but to some extent compilation speed 
matters)  is generally the main factor, but of course size reduction is welcome 
if it doesn't cost too much!


was (Author: sokolov):
As far as testing goes, https://issues.apache.org/jira/browse/LUCENE-8781 has 
some pointers. IIRC that one relied on tests in luceneutil plus Suggester unit 
tests (they do a lot of FST lookups), and there is also a torture test 
(commented out) in TestJapaneseTokenizer using JA wikipedia. That one is 
interesting because the character distribution is quite different from 
euro-centric texts.

> Reduce FST size by using absolute and relative coding for target pointers
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-10247
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10247
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Hendrik Muhs
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is 
> using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next` 
> flag, it writes a VLong which contains the address of the child. This is an 
> absolute address. A relative address can be shorter, relative address means 
> taking the address of the node and expressing the result of child node as 
> diff between the parent and the child. E.g. if the child is _10099_ and the 
> parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative 
> pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not 
> always true, the absolute pointer could be smaller. Therefore the idea is to 
> switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is 
> constructed from the leafs to the root, child node addresses are always 
> smaller than parent ones and due to good locality, connected nodes are 
> usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as 
> link here. It introduces a new bit in _flags_ and changes the compiler to use 
> relative coding if possible. The lookup side needs the current node address 
> and checks one more bit flag. Both additions seem reasonable cheap and do not 
> add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key 
> dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you 
> have better reference datasets, let me know. It obviously works better the 
> more data.
> The POC implementation only supports 1 output type (positive int output) and 
> I just made it compile. For this 1 type of dictionary it's possible to create 
> and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement 
> on whether this is something you like to have. It seems to me that a size 
> reduction of FST's is always welcome and as pointed out, the runtime 
> additions are reasonable small. If you wonder about the implementation, I 
> happily discuss the nitty gritty details on gh, it had some interesting 
> challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to