Liya Fan created ARROW-6933:
-------------------------------

             Summary: [Java] Suppor linear dictionary encoder
                 Key: ARROW-6933
                 URL: https://issues.apache.org/jira/browse/ARROW-6933
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Java
            Reporter: Liya Fan
            Assignee: Liya Fan


For many scenarios, the distribution of dictionary entries is highly skewed. In 
other words, a few dictionary entries occurs much more frequently than others. 
If we can sort the dictionary by the non-increasing order of entry frequencies, 
and compare each value to encode from the beginning of the dictionary, we get 
the following benefits:

1)      We need no extra memory space or data structure.
2)      The search is extremely efficient, as we are likely to find a match in 
the first few entries of the dictionary.

This is the basic idea behind the linear dictionary encoder. When the scenario 
is right (highly skewed dictionary distribution), it outperforms both search 
based encoder and hash table based encoders. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to