[jira] [Created] (CARBONDATA-456) Select count(*) from table is slower.

2016-11-27 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created CARBONDATA-456:
--

 Summary: Select count(*) from table is slower.
 Key: CARBONDATA-456
 URL: https://issues.apache.org/jira/browse/CARBONDATA-456
 Project: CarbonData
  Issue Type: Bug
Affects Versions: 0.3.0-incubating
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala
Priority: Minor


Select count(*) is slower in current master branch compare to previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


One Pass Load Design Document

2016-11-27 Thread Lion.X
One Pass Load Design Document, Pls Review and give your  suggestion.


https://docs.google.com/document/d/1m6rY7vJMu604FagIJmrOhhy_RiUoK53-LPO6qE8jeNU/edit?usp=sharing



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/One-Pass-Load-Design-Document-tp3268.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [Feature Proposal] Spark 2 integration with CarbonData

2016-11-27 Thread Venkata Gollamudi
Hi All,

+1
I agree with Jacky and it is important for CarbonData community to work on
Spark2.x. As Spark2.x has major design and interface changes. It is also
challenge to support both Spark2.x and Spark1.x. We can start creating
sub-tasks under issue(CARBONDATA-322)

Regards,
Ramana

On Sun, Nov 27, 2016 at 9:39 AM, Liang Chen  wrote:

> Hi
>
> Very excited to see that CarbonData will integrate with Spark 2.x, look
> forward to getting performance improved further and usability enhanced.
>
> Regards
> Liang
>
>
> Jacky Li wrote
> > Hi all,
> >
> > Currently CarbonData only works with spark1.5 and spark1.6, as Apache
> > Spark community is moving to 2.1, more and more user will deploy spark
> 2.x
> > in production environment. In order to make CarbonData even more popular,
> > I think now it is good time to start considering spark2.x integration
> with
> > CarbonData.
> >
> > Moreover, we can take this as a chance to refactory CarbonData to make it
> > both easier to use and higher performance.
> >
> > Usability:
> > Instead of using CarbonContext, in spark2 integration, user should able
> to
> > 1. use native SparkSession in the spark application to create and query
> > table backed by CarbonData files with full feature support, including
> > index and late decode optimization.
> >
> > 2. use CarbonData's API and tool to acomplish carbon specific tasks, like
> > compaction, delete segment, etc.
> >
> > Perforamnce:
> > 1. deep integration with Datasource API and leveraging spark2's whole
> > stage codegen feature.
> >
> > 2. provide implementation of vectorized record reader, to improve
> scanning
> > performance.
> >
> > Since spark2 changes a lot comparing to spark 1.6, it may take some time
> > to complete all these features. With the help of contributors and
> > committers, I hope we can have basic features working in next CarbonData
> > release.
> >
> > What do you think about this idea? All kinds of contribution and
> > suggestions are welcomed.
> >
> > Regards,
> > Jacky Li
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Feature-
> Proposal-Spark-2-integration-with-CarbonData-tp3236p3238.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


Re: [improvement] Support unsafe in-memory sort in carbondata

2016-11-27 Thread Venkata Gollamudi
This proposal looks good, should improve performance and GC issues during
dataload. Please create an issue in Jira. We can create unsafe functions in
common module (just like spark) to allow them to be used across
modules/components, also can check if can reuse any from spark unsafe.

On Sun, Nov 27, 2016 at 11:40 PM, Ravindra Pesala 
wrote:

> Hi All,
>
> In the current carbondata system loading performance is not so encouraging
> since we need to sort the data at executor level for data loading.
> Carbondata collects batch of data and sorts before dumping to the temporary
> files and finally it does merge sort from those temporary files to finish
> sorting. Here we face two major issues , one is disk IO and second is GC
> issue. Even though we dump to the file still carbondata face lot of GC
> issue since we sort batch data in-memory before dumping to the temporary
> files.
>
> To solve the above problems we can introduce Unsafe Storage and Unsafe
> sort.
> Unsafe Storage : User can configure the memory limit to keep the amount of
> data to in-memory. Here we can keep all the data in continuous memory
> location either on off-heap or on-heap using Unsafe. Once configure limit
> exceeds remaining data will be spilled to disk.
> Unsafe Sort : The data which is store in-memory using Unsafe can be sorted
> using Unsafe sort.
>
> We can take inspiration from Spark to do Unsafe implementations
> effectively.
>
> --
> Thanks & Regards,
> Ravindra
>


Re: [New Feature] Adding bucketed table feature to Carbondata

2016-11-27 Thread Ravindra Pesala
Hi Raghu,

In Hive's or Spark's terminology Partitioning and bucketing are different.
Partitioning divides the large amount of data into number pieces of folders
based on table columns value.Here the number partitions created is
depending upon the cardinality of that partitioned column. So it is very
in-effective if cardinality is higher.

In other hand Bucketing can divide the data into equal parts(user
configurable number) depends on hashing of that column. It is useful for
high cardinality columns.


Regards,
Ravindra

On 27 November 2016 at 23:24, Raghunandan S <
carbondatacontributi...@gmail.com> wrote:

> How is this different from partitioning?
> On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala 
> wrote:
>
> > Hi All,
> >
> > Bucketing concept is based on the hash partition the bucketed column as
> per
> > configured bucket numbers. Records with same bucketed column always goes
> to
> > the same same bucket. Physically each bucket is a file/files in table
> > directory.
> > Advantages
> > Bucketed table is useful feature to do the map side joins and avoids
> > shuffling of data.
> > Carbondata can do driver level pruning on bucketed column to improve
> query
> > performance.
> >
> > User can add bucketed table as follows
> >
> > CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
> > CLUSTERED BY(user_id) INTO 32 BUCKETS;
> >
> > In the above example column user_id is hash partitioned and creates 32
> > buckets/partitions files in carbondata. So while doing the join with
> other
> > table on bucketed column it can select same buckets and do the join with
> > out shuffling.
> >
> > Carbon creates following folder structure currently, since carbon is
> > already supporting partitioning in its file format
> >
> > dbName -> tableName - > Fact ->
> >
> >Part0 ->Segment_id ->
> > carbondatafiles
> >
> >Part1 ->Segment_id ->
> > carbondatafiles
> >
> > we can also move the partitionid to file metadata.But if we move the
> > partitionId to metadata then there would be complications in backward
> > compatibility.
> > --
> > Thanks & Regards,
> > Ravindra
> >
>



-- 
Thanks & Regards,
Ravi


[improvement] Support unsafe in-memory sort in carbondata

2016-11-27 Thread Ravindra Pesala
Hi All,

In the current carbondata system loading performance is not so encouraging
since we need to sort the data at executor level for data loading.
Carbondata collects batch of data and sorts before dumping to the temporary
files and finally it does merge sort from those temporary files to finish
sorting. Here we face two major issues , one is disk IO and second is GC
issue. Even though we dump to the file still carbondata face lot of GC
issue since we sort batch data in-memory before dumping to the temporary
files.

To solve the above problems we can introduce Unsafe Storage and Unsafe sort.
Unsafe Storage : User can configure the memory limit to keep the amount of
data to in-memory. Here we can keep all the data in continuous memory
location either on off-heap or on-heap using Unsafe. Once configure limit
exceeds remaining data will be spilled to disk.
Unsafe Sort : The data which is store in-memory using Unsafe can be sorted
using Unsafe sort.

We can take inspiration from Spark to do Unsafe implementations effectively.

-- 
Thanks & Regards,
Ravindra


Re: [New Feature] Adding bucketed table feature to Carbondata

2016-11-27 Thread Raghunandan S
How is this different from partitioning?
On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala 
wrote:

> Hi All,
>
> Bucketing concept is based on the hash partition the bucketed column as per
> configured bucket numbers. Records with same bucketed column always goes to
> the same same bucket. Physically each bucket is a file/files in table
> directory.
> Advantages
> Bucketed table is useful feature to do the map side joins and avoids
> shuffling of data.
> Carbondata can do driver level pruning on bucketed column to improve query
> performance.
>
> User can add bucketed table as follows
>
> CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
> CLUSTERED BY(user_id) INTO 32 BUCKETS;
>
> In the above example column user_id is hash partitioned and creates 32
> buckets/partitions files in carbondata. So while doing the join with other
> table on bucketed column it can select same buckets and do the join with
> out shuffling.
>
> Carbon creates following folder structure currently, since carbon is
> already supporting partitioning in its file format
>
> dbName -> tableName - > Fact ->
>
>Part0 ->Segment_id ->
> carbondatafiles
>
>Part1 ->Segment_id ->
> carbondatafiles
>
> we can also move the partitionid to file metadata.But if we move the
> partitionId to metadata then there would be complications in backward
> compatibility.
> --
> Thanks & Regards,
> Ravindra
>


[New Feature] Adding bucketed table feature to Carbondata

2016-11-27 Thread Ravindra Pesala
Hi All,

Bucketing concept is based on the hash partition the bucketed column as per
configured bucket numbers. Records with same bucketed column always goes to
the same same bucket. Physically each bucket is a file/files in table
directory.
Advantages
Bucketed table is useful feature to do the map side joins and avoids
shuffling of data.
Carbondata can do driver level pruning on bucketed column to improve query
performance.

User can add bucketed table as follows

CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
CLUSTERED BY(user_id) INTO 32 BUCKETS;

In the above example column user_id is hash partitioned and creates 32
buckets/partitions files in carbondata. So while doing the join with other
table on bucketed column it can select same buckets and do the join with
out shuffling.

Carbon creates following folder structure currently, since carbon is
already supporting partitioning in its file format

dbName -> tableName - > Fact ->

   Part0 ->Segment_id ->
carbondatafiles

   Part1 ->Segment_id ->
carbondatafiles

we can also move the partitionid to file metadata.But if we move the
partitionId to metadata then there would be complications in backward
compatibility.
-- 
Thanks & Regards,
Ravindra


Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-27 Thread Vimal Das Kammath
+1
-vimal
On Nov 23, 2016 9:39 PM, "Venkata Gollamudi"  wrote:

> Hi All,
>
> CarbonData 0.2.0 has been a good work and stable release with lot of
> defects fixed and with number of performance improvements.
> https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%
> 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%
> 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
>
> Next version has many major and new value added features are planned,
> taking CarbonData capability to next level.
> Like
> - IUD(Insert-Update-Delete) support,
> - complete rewrite of data load flow with out Kettle,
> - Spark 2.x support,
> - Standardize CarbonInputFormat and CarbonOutputFormat,
> - alluxio(tachyon) file system support,
> - Carbon thrift format optimization for fast query,
> - Data loading performance improvement and In memory off heap sorting,
> - Query performance improvement using off heap,
> - Support Vectorized batch reader.
>
> https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%
> 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%
> 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
>
> I think it makes sense to change CarbonData Major version in next version
> to 1.0.0.
> Please comment and vote on this.
>
> Thanks,
> Ramana
>


[jira] [Created] (CARBONDATA-455) Benchmark for HashMap and DAT

2016-11-27 Thread He Xiaoqiao (JIRA)
He Xiaoqiao created CARBONDATA-455:
--

 Summary: Benchmark for HashMap and DAT
 Key: CARBONDATA-455
 URL: https://issues.apache.org/jira/browse/CARBONDATA-455
 Project: CarbonData
  Issue Type: Sub-task
  Components: core
Reporter: He Xiaoqiao


Evaluate performance and memory footprint about HashMap and DAT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-454) Add new unit test for DAT

2016-11-27 Thread He Xiaoqiao (JIRA)
He Xiaoqiao created CARBONDATA-454:
--

 Summary: Add new unit test for DAT
 Key: CARBONDATA-454
 URL: https://issues.apache.org/jira/browse/CARBONDATA-454
 Project: CarbonData
  Issue Type: Sub-task
  Components: core
Reporter: He Xiaoqiao
Assignee: He Xiaoqiao


Add new Unit Test for DAT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-453) Implement DAT(Double Array Trie) for Dictionary

2016-11-27 Thread He Xiaoqiao (JIRA)
He Xiaoqiao created CARBONDATA-453:
--

 Summary: Implement DAT(Double Array Trie) for Dictionary 
 Key: CARBONDATA-453
 URL: https://issues.apache.org/jira/browse/CARBONDATA-453
 Project: CarbonData
  Issue Type: Sub-task
  Components: core
Reporter: He Xiaoqiao
Assignee: He Xiaoqiao
Priority: Blocker


Implement DAT structure for Dictionary in order to reduce memory footprint and 
improve performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-452) Optimize structure of Dictionary use Trie in place of HashMap

2016-11-27 Thread He Xiaoqiao (JIRA)
He Xiaoqiao created CARBONDATA-452:
--

 Summary: Optimize structure of Dictionary use Trie in place of 
HashMap
 Key: CARBONDATA-452
 URL: https://issues.apache.org/jira/browse/CARBONDATA-452
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 0.2.0-incubating
Reporter: He Xiaoqiao
Assignee: He Xiaoqiao
Priority: Critical


CarbonData apply ConcurrentHashMap to maintain Dictionary currently, and memory 
footprint is considerable overhead cause it has to load whole Dictionary to 
decode actual data value, especially column cardinality is a large number.
Update Dictionary using Trie in place of HashMap to reduce memory footprint and 
improve retrieval performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-27 Thread Xiaoqiao He
Hi Kumar Vishal,

I'll create task to trace this issue.
Thanks for your suggestions.

Regards,
He Xiaoqiao


On Sun, Nov 27, 2016 at 1:41 AM, Kumar Vishal 
wrote:

> Hi Xiaoqiao He,
>
> You can go ahead with DAT implementation, based on the result.
> I will look forward for you PR.
>
> Please let me know if you need any support:).
>
> -Regards
> KUmar Vishal
>
> On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He  wrote:
>
> > Hi Liang, Kumar Vishal,
> >
> > I has done a standard benchmark about multiply data structures for
> > Dictionary following your suggestions. Based on the test results, I think
> > DAT may be the best choice for CarbonData.
> >
> > *1. Here are 2 test results:*
> > ---
> > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for
> Dictionary
> >   HashMap :   java.util.HashMap
> >   DAT (Double Array Trie):
> > https://github.com/komiya-atsushi/darts-java
> >   RadixTree:
> > https://github.com/npgall/concurrent-trees
> >   TrieDict (Dictionary in Kylin):
> > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> > Dictionary Source (Traditional Chinese):
> > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> > dict/dict.txt.big
> > Test Result
> > a. Dictionary Size:584429
> > 
> > b. Build Time (ms) :
> >DAT   : 5714
> >HashMap   : 110
> >RadixTree : 22044
> >TrieDict  : 855
> > 
> > c. Memory footprint in 64-bit JVM (bytes) :
> >DAT   : 16779752
> >HashMap   : 32196592
> >RadixTree : 46130584
> >TrieDict  : 10443608
> > 
> > d. Retrieval Performance for 9935293 query times (ms) :
> >DAT   : 585
> >HashMap   : 1010
> >RadixTree : 417639
> >TrieDict  : 8664
> > Test Result
> >
> > Test Result
> > a. Dictionary Size:584429
> > 
> > b. Build Time (ms) :
> >DAT   : 5867
> >HashMap   : 100
> >RadixTree : 22082
> >TrieDict  : 840
> > 
> > c. Memory footprint in 64-bit JVM (bytes) :
> >DAT   : 16779752
> >HashMap   : 32196592
> >RadixTree : 46130584
> >TrieDict  : 10443608
> > 
> > d. Retrieval Performance for 9935293 query times (ms) :
> >DAT   : 593
> >HashMap   : 821
> >RadixTree : 422297
> >TrieDict  : 8752
> > Test Result
> >
> > *2. Conclusion:*
> > a. TrieDict is good for building tree and less memory footprint overhead,
> > but worst retrieval performance,
> > b. DAT is a good tradeoff between memory footprint and retrieval
> > performance,
> > c. RadixTree has the worst performance in different aspects.
> >
> > *3. Result Analysis:*
> > a. With Trie the memory footprint of the TrieDict mapping is kinda
> > minimized if compared to HashMap, in order to improve performance there
> is
> > a cache layer overlays on top of Trie.
> > b. Because a large number of duplicate prefix data, the total memory
> > footprint is more than trie, meanwhile i think calculating string hash
> code
> > of traditional Chinese consume considerable time overhead, so the
> > performance is not the best.
> > c. DAT is a better tradeoff.
> > d. I have no idea why RadixTree has the worst performance in terms of
> > memory, retrieval and building tree.
> >
> >
> > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen 
> > wrote:
> >
> > > Hi xiaoqiao
> > >
> > > ok, look forward to seeing your test result.
> > > Can you take this task for this improvement? Please let me know if you
> > need
> > > any support :)
> > >
> > > Regards
> > > Liang
> > >
> > >
> > > hexiaoqiao wrote
> > > > Hi Kumar Vishal,
> > > >
> > > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> > we
> > > > can get better memory footprint and also good performance. Of course,
> > DAT
> > > > is not only choice, and I will do test about DAT vs Radix Trie and
> > > release
> > > > the test result as soon as possible. Thanks your suggestions again.
> > > >
> > > > Regards,
> > > > Xiaoqiao
> > > >
> > > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <
> > >
> > > > kumarvishal1802@
> > >
> > > > >
> > > > wrote:
> > > >
> > > >> Hi XIaoqiao He,
> > > >> +1,
> > > >> For forward dictionary case it will be very good optimisation, as
> our
> > > >> case
> > > >> is very specific storing byte array to int mapping[data to surrogate
> > key
> > > >> mapping], I think we will get much better memory footprint and
> > > >> performance
> > > >> will be also good(2x). We can also try radix tree(radix trie), it is
> > > more
> > > >> optimise for storage.
> > > >>
> > > >> -Regards
> > > >> Kumar Vishal
> > > >>
> > > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <
> > >
> > > > chenliang6136@
> > >
> > > > >
> > > >> wrote:
> > > >>
> > > >> > Hi xiaoqiao
> > > >> >
> > > >> > For the below example, 600K dictionary data:
> > > >> > It i