Re: [Feature ]Design Document for Update/Delete support in CarbonData

2016-11-26 Thread Jacky Li
Hi Aniket,

Yes, background monitor process is preferred in the future. And there are
other places need this process already, like refreshing the caches in driver
and executors. Currently, dictionary caches and index caches are refreshed
by checking timestamp in every query, which introduces unnecessary overhead
in query flow and impact NameNode in concurrent query scenario.

Regards,
Jacky



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Design-Document-for-Update-Delete-support-in-CarbonData-tp3043p3237.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[Feature Proposal] Spark 2 integration with CarbonData

2016-11-26 Thread Jacky Li
Hi all,

Currently CarbonData only works with spark1.5 and spark1.6, as Apache Spark
community is moving to 2.1, more and more user will deploy spark 2.x in
production environment. In order to make CarbonData even more popular, I
think now it is good time to start considering spark2.x integration with
CarbonData.

Moreover, we can take this as a chance to refactory CarbonData to make it
both easier to use and higher performance.

Usability:
Instead of using CarbonContext, in spark2 integration, user should able to
1. use native SparkSession in the spark application to create and query
table backed by CarbonData files with full feature support, including index
and late decode optimization.

2. use CarbonData's API and tool to acomplish carbon specific tasks, like
compaction, delete segment, etc.

Perforamnce:
1. deep integration with Datasource API and leveraging spark2's whole stage
codegen feature.

2. provide implementation of vectorized record reader, to improve scanning
performance.

Since spark2 changes a lot comparing to spark 1.6, it may take some time to
complete all these features. With the help of contributors and committers, I
hope we can have basic features working in next CarbonData release. 

What do you think about this idea? All kinds of contribution and suggestions
are welcomed.

Regards,
Jacky Li




--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-tp3236.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-26 Thread Kumar Vishal
Hi Xiaoqiao He,

You can go ahead with DAT implementation, based on the result.
I will look forward for you PR.

Please let me know if you need any support:).

-Regards
KUmar Vishal

On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He  wrote:

> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> ---
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
>   HashMap :   java.util.HashMap
>   DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
>   RadixTree:
> https://github.com/npgall/concurrent-trees
>   TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> dict/dict.txt.big
> Test Result
> a. Dictionary Size:584429
> 
> b. Build Time (ms) :
>DAT   : 5714
>HashMap   : 110
>RadixTree : 22044
>TrieDict  : 855
> 
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT   : 16779752
>HashMap   : 32196592
>RadixTree : 46130584
>TrieDict  : 10443608
> 
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT   : 585
>HashMap   : 1010
>RadixTree : 417639
>TrieDict  : 8664
> Test Result
>
> Test Result
> a. Dictionary Size:584429
> 
> b. Build Time (ms) :
>DAT   : 5867
>HashMap   : 100
>RadixTree : 22082
>TrieDict  : 840
> 
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT   : 16779752
>HashMap   : 32196592
>RadixTree : 46130584
>TrieDict  : 10443608
> 
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT   : 593
>HashMap   : 821
>RadixTree : 422297
>TrieDict  : 8752
> Test Result
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen 
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> we
> > > can get better memory footprint and also good performance. Of course,
> DAT
> > > is not only choice, and I will do test about DAT vs Radix Trie and
> > release
> > > the test result as soon as possible. Thanks your suggestions again.
> > >
> > > Regards,
> > > Xiaoqiao
> > >
> > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal 
> >
> > > kumarvishal1802@
> >
> > > 
> > > wrote:
> > >
> > >> Hi XIaoqiao He,
> > >> +1,
> > >> For forward dictionary case it will be very good optimisation, as our
> > >> case
> > >> is very specific storing byte array to int mapping[data to surrogate
> key
> > >> mapping], I think we will get much better memory footprint and
> > >> performance
> > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > more
> > >> optimise for storage.
> > >>
> > >> -Regards
> > >> Kumar Vishal
> > >>
> > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen 
> >
> > > chenliang6136@
> >
> > > 
> > >> wrote:
> > >>
> > >> > Hi xiaoqiao
> > >> >
> > >> > For the below example, 600K dictionary data:
> > >> > It is to say that using "DAT" can save 36M memory against
> > >> > "ConcurrentHashMap", whereas the performance just lost less
> (1718ms) ?
> > >> >
> > >> > One more question:if increases the dictionary data size, what's the
> > >> > comparison results "ConcurrentHashMap" VS "DAT"
> > >> >
> > >> > Regards
> > >> > Liang
> > >> > 
> > >> > 

[jira] [Created] (CARBONDATA-451) Can not run query on windows now

2016-11-26 Thread zhangshunyu (JIRA)
zhangshunyu created CARBONDATA-451:
--

 Summary: Can not run query on windows now
 Key: CARBONDATA-451
 URL: https://issues.apache.org/jira/browse/CARBONDATA-451
 Project: CarbonData
  Issue Type: Bug
  Components: core
Reporter: zhangshunyu
Assignee: zhangshunyu
 Fix For: 0.2.0-incubating


As tablePath on windows has '/' and not replaced when substring, it would throw 
error when execute query.
I have fixed this and will raise a pr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)