Hi Xiaoqiao He,
You can go ahead with DAT implementation, based on the result.
I will look forward for you PR.
Please let me know if you need any support:).
-Regards
KUmar Vishal
On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He wrote:
> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> ---
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
> HashMap : java.util.HashMap
> DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
> RadixTree:
> https://github.com/npgall/concurrent-trees
> TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> dict/dict.txt.big
> Test Result
> a. Dictionary Size:584429
>
> b. Build Time (ms) :
>DAT : 5714
>HashMap : 110
>RadixTree : 22044
>TrieDict : 855
>
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT : 16779752
>HashMap : 32196592
>RadixTree : 46130584
>TrieDict : 10443608
>
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT : 585
>HashMap : 1010
>RadixTree : 417639
>TrieDict : 8664
> Test Result
>
> Test Result
> a. Dictionary Size:584429
>
> b. Build Time (ms) :
>DAT : 5867
>HashMap : 100
>RadixTree : 22082
>TrieDict : 840
>
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT : 16779752
>HashMap : 32196592
>RadixTree : 46130584
>TrieDict : 10443608
>
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT : 593
>HashMap : 821
>RadixTree : 422297
>TrieDict : 8752
> Test Result
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> we
> > > can get better memory footprint and also good performance. Of course,
> DAT
> > > is not only choice, and I will do test about DAT vs Radix Trie and
> > release
> > > the test result as soon as possible. Thanks your suggestions again.
> > >
> > > Regards,
> > > Xiaoqiao
> > >
> > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal
> >
> > > kumarvishal1802@
> >
> > >
> > > wrote:
> > >
> > >> Hi XIaoqiao He,
> > >> +1,
> > >> For forward dictionary case it will be very good optimisation, as our
> > >> case
> > >> is very specific storing byte array to int mapping[data to surrogate
> key
> > >> mapping], I think we will get much better memory footprint and
> > >> performance
> > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > more
> > >> optimise for storage.
> > >>
> > >> -Regards
> > >> Kumar Vishal
> > >>
> > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen
> >
> > > chenliang6136@
> >
> > >
> > >> wrote:
> > >>
> > >> > Hi xiaoqiao
> > >> >
> > >> > For the below example, 600K dictionary data:
> > >> > It is to say that using "DAT" can save 36M memory against
> > >> > "ConcurrentHashMap", whereas the performance just lost less
> (1718ms) ?
> > >> >
> > >> > One more question:if increases the dictionary data size, what's the
> > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > >> >
> > >> > Regards
> > >> > Liang
> > >> >
> > >> >