There’re two trie tree dictionary in kylin, TrieDictionary and 
AppendTrieDictionary.

In TrieDictionary, dict id decided by the position of value in the trie tree. 
That meaning dict id maybe change if more value added into trie tree. It’s 
unacceptable in GlobalDict. That’s why we introduced the second one, 
AppendTrieDictionary.

AppendTrieDictionary store dict id in the serialize bytes, instead of decide by 
position. The dict id will not changed even new value added. 

Trie tree is indeed effective to store and compress data. In our case, one 
value in AppendTrieDictionary cost about 10 bytes, that means 200,000,000 
values cost about 2GB, it’s kind of acceptable. We also made 
AppendTrieDictionary split into many sub-tree, called Slice. Every Slice can be 
read and write out independently, added LRU cache, that’s how we control the 
memory cost.

> 在 2016年7月21日,15:12,big data <[email protected]> 写道:
> 
> Thanks. 
> I've browsed the source codes about global dict, the Trie structure seems 
> effective to store and make the seq no for string or other types of fields.  
> I randomly generate 10,000,000 GUIDs and build a TrieDictionary, but it seems 
> always OutOfMemoryError, 
> how Kylin control the memory size of this type of object. If the field's 
> cardinality is too large?
> 
> 
> 
> 在 16/7/20 下午1:30, hongbin ma 写道:
>> the original JIRA for global dict is 
>> https://issues.apache.org/jira/browse/KYLIN-1705 
>> <https://issues.apache.org/jira/browse/KYLIN-1705>, now it's pending on GUI 
>> part https://issues.apache.org/jira/browse/KYLIN-1904 
>> <https://issues.apache.org/jira/browse/KYLIN-1904>
>> 
>> On Tue, Jul 19, 2016 at 2:01 PM, big data <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Thank you ,Sun. I'm still downloading the code, so I first browse the
>> articles about Kylin dictionary, still some open questions about it:
>> 
>> 1. This
>> article(http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/ 
>> <http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/>)
>> describes the Trie structure for the dictionary, but I didn't catch the
>> generation of Seq No. in the Trie example. How dictonary generate the
>> seq no for each coming string?
>> 
>> 2. If the string field is user id or device id with millions of (even
>> billiions of) UUID, the Trie will have fixed height (same length of
>> UUID, such as 32 bytes), so the dictionay will be too huge.  Does Kylin
>> still calculate the accurate cardinality value? or approprete value? And
>> How Kylin can keep the query performance for the huge one?
>> 
>> Thanks.
>> 
>> 
>> 
>> 在 16/7/19 上午11:01, Yerui Sun 写道:
>> > Generally speaking, we used dictionary to encode non-integer values, and 
>> > mapping the dict id into bitmap to count.
>> >
>> > In some details, original dictionary in Kylin is at segment level, which 
>> > means that one same value in different segments may have different dict 
>> > id, made the result wrong when count values across segments.
>> > We’ve introduced GlobalDictionary to solve this problem. Global Dict is at 
>> > cube level, making sure one value has one stable dict id, no matter the 
>> > value shows up in which or how many segments. The Global Dict is 
>> > append-able, to support incremental cube building, and it’s also 
>> > splittable with LRU cache, to reduce the memory cost, with huge dataset 
>> > supporting, such as 500M etc.
>> >
>> > The code have been merge into master branch and will be released in 
>> > v1.5.3, you can check it out.
>> >
>> > Any comment or discussion is welcome.
>> >
>> > Thanks.
>> >
>> >> 在 2016年7月18日,15:41,big data <[email protected] 
>> >> <mailto:[email protected]>> 写道:
>> >>
>> >> I heard the Kylin support non-integer field by using bitmap index.
>> >>
>> >> I just want to know how Kylin indexes the string field, and mapping each
>> >> item to bitmap?
>> >>
>> >> Thanks.
>> > .
>> >
>> 
>> 
>> 
>> 
>> -- 
>> Regards,
>> 
>> Bin Mahone | 马洪宾
> 

Reply via email to