[jira] [Created] (CARBONDATA-456) Select count(*) from table is slower.
Ravindra Pesala created CARBONDATA-456: -- Summary: Select count(*) from table is slower. Key: CARBONDATA-456 URL: https://issues.apache.org/jira/browse/CARBONDATA-456 Project: CarbonData Issue Type: Bug Affects Versions: 0.3.0-incubating Reporter: Ravindra Pesala Assignee: Ravindra Pesala Priority: Minor Select count(*) is slower in current master branch compare to previous versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
One Pass Load Design Document
One Pass Load Design Document, Pls Review and give your suggestion. https://docs.google.com/document/d/1m6rY7vJMu604FagIJmrOhhy_RiUoK53-LPO6qE8jeNU/edit?usp=sharing -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/One-Pass-Load-Design-Document-tp3268.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Re: [Feature Proposal] Spark 2 integration with CarbonData
Hi All, +1 I agree with Jacky and it is important for CarbonData community to work on Spark2.x. As Spark2.x has major design and interface changes. It is also challenge to support both Spark2.x and Spark1.x. We can start creating sub-tasks under issue(CARBONDATA-322) Regards, Ramana On Sun, Nov 27, 2016 at 9:39 AM, Liang Chen wrote: > Hi > > Very excited to see that CarbonData will integrate with Spark 2.x, look > forward to getting performance improved further and usability enhanced. > > Regards > Liang > > > Jacky Li wrote > > Hi all, > > > > Currently CarbonData only works with spark1.5 and spark1.6, as Apache > > Spark community is moving to 2.1, more and more user will deploy spark > 2.x > > in production environment. In order to make CarbonData even more popular, > > I think now it is good time to start considering spark2.x integration > with > > CarbonData. > > > > Moreover, we can take this as a chance to refactory CarbonData to make it > > both easier to use and higher performance. > > > > Usability: > > Instead of using CarbonContext, in spark2 integration, user should able > to > > 1. use native SparkSession in the spark application to create and query > > table backed by CarbonData files with full feature support, including > > index and late decode optimization. > > > > 2. use CarbonData's API and tool to acomplish carbon specific tasks, like > > compaction, delete segment, etc. > > > > Perforamnce: > > 1. deep integration with Datasource API and leveraging spark2's whole > > stage codegen feature. > > > > 2. provide implementation of vectorized record reader, to improve > scanning > > performance. > > > > Since spark2 changes a lot comparing to spark 1.6, it may take some time > > to complete all these features. With the help of contributors and > > committers, I hope we can have basic features working in next CarbonData > > release. > > > > What do you think about this idea? All kinds of contribution and > > suggestions are welcomed. > > > > Regards, > > Jacky Li > > > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Feature- > Proposal-Spark-2-integration-with-CarbonData-tp3236p3238.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. >
Re: [improvement] Support unsafe in-memory sort in carbondata
This proposal looks good, should improve performance and GC issues during dataload. Please create an issue in Jira. We can create unsafe functions in common module (just like spark) to allow them to be used across modules/components, also can check if can reuse any from spark unsafe. On Sun, Nov 27, 2016 at 11:40 PM, Ravindra Pesala wrote: > Hi All, > > In the current carbondata system loading performance is not so encouraging > since we need to sort the data at executor level for data loading. > Carbondata collects batch of data and sorts before dumping to the temporary > files and finally it does merge sort from those temporary files to finish > sorting. Here we face two major issues , one is disk IO and second is GC > issue. Even though we dump to the file still carbondata face lot of GC > issue since we sort batch data in-memory before dumping to the temporary > files. > > To solve the above problems we can introduce Unsafe Storage and Unsafe > sort. > Unsafe Storage : User can configure the memory limit to keep the amount of > data to in-memory. Here we can keep all the data in continuous memory > location either on off-heap or on-heap using Unsafe. Once configure limit > exceeds remaining data will be spilled to disk. > Unsafe Sort : The data which is store in-memory using Unsafe can be sorted > using Unsafe sort. > > We can take inspiration from Spark to do Unsafe implementations > effectively. > > -- > Thanks & Regards, > Ravindra >
Re: [New Feature] Adding bucketed table feature to Carbondata
Hi Raghu, In Hive's or Spark's terminology Partitioning and bucketing are different. Partitioning divides the large amount of data into number pieces of folders based on table columns value.Here the number partitions created is depending upon the cardinality of that partitioned column. So it is very in-effective if cardinality is higher. In other hand Bucketing can divide the data into equal parts(user configurable number) depends on hashing of that column. It is useful for high cardinality columns. Regards, Ravindra On 27 November 2016 at 23:24, Raghunandan S < carbondatacontributi...@gmail.com> wrote: > How is this different from partitioning? > On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala > wrote: > > > Hi All, > > > > Bucketing concept is based on the hash partition the bucketed column as > per > > configured bucket numbers. Records with same bucketed column always goes > to > > the same same bucket. Physically each bucket is a file/files in table > > directory. > > Advantages > > Bucketed table is useful feature to do the map side joins and avoids > > shuffling of data. > > Carbondata can do driver level pruning on bucketed column to improve > query > > performance. > > > > User can add bucketed table as follows > > > > CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING) > > CLUSTERED BY(user_id) INTO 32 BUCKETS; > > > > In the above example column user_id is hash partitioned and creates 32 > > buckets/partitions files in carbondata. So while doing the join with > other > > table on bucketed column it can select same buckets and do the join with > > out shuffling. > > > > Carbon creates following folder structure currently, since carbon is > > already supporting partitioning in its file format > > > > dbName -> tableName - > Fact -> > > > >Part0 ->Segment_id -> > > carbondatafiles > > > >Part1 ->Segment_id -> > > carbondatafiles > > > > we can also move the partitionid to file metadata.But if we move the > > partitionId to metadata then there would be complications in backward > > compatibility. > > -- > > Thanks & Regards, > > Ravindra > > > -- Thanks & Regards, Ravi
[improvement] Support unsafe in-memory sort in carbondata
Hi All, In the current carbondata system loading performance is not so encouraging since we need to sort the data at executor level for data loading. Carbondata collects batch of data and sorts before dumping to the temporary files and finally it does merge sort from those temporary files to finish sorting. Here we face two major issues , one is disk IO and second is GC issue. Even though we dump to the file still carbondata face lot of GC issue since we sort batch data in-memory before dumping to the temporary files. To solve the above problems we can introduce Unsafe Storage and Unsafe sort. Unsafe Storage : User can configure the memory limit to keep the amount of data to in-memory. Here we can keep all the data in continuous memory location either on off-heap or on-heap using Unsafe. Once configure limit exceeds remaining data will be spilled to disk. Unsafe Sort : The data which is store in-memory using Unsafe can be sorted using Unsafe sort. We can take inspiration from Spark to do Unsafe implementations effectively. -- Thanks & Regards, Ravindra
Re: [New Feature] Adding bucketed table feature to Carbondata
How is this different from partitioning? On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala wrote: > Hi All, > > Bucketing concept is based on the hash partition the bucketed column as per > configured bucket numbers. Records with same bucketed column always goes to > the same same bucket. Physically each bucket is a file/files in table > directory. > Advantages > Bucketed table is useful feature to do the map side joins and avoids > shuffling of data. > Carbondata can do driver level pruning on bucketed column to improve query > performance. > > User can add bucketed table as follows > > CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING) > CLUSTERED BY(user_id) INTO 32 BUCKETS; > > In the above example column user_id is hash partitioned and creates 32 > buckets/partitions files in carbondata. So while doing the join with other > table on bucketed column it can select same buckets and do the join with > out shuffling. > > Carbon creates following folder structure currently, since carbon is > already supporting partitioning in its file format > > dbName -> tableName - > Fact -> > >Part0 ->Segment_id -> > carbondatafiles > >Part1 ->Segment_id -> > carbondatafiles > > we can also move the partitionid to file metadata.But if we move the > partitionId to metadata then there would be complications in backward > compatibility. > -- > Thanks & Regards, > Ravindra >
[New Feature] Adding bucketed table feature to Carbondata
Hi All, Bucketing concept is based on the hash partition the bucketed column as per configured bucket numbers. Records with same bucketed column always goes to the same same bucket. Physically each bucket is a file/files in table directory. Advantages Bucketed table is useful feature to do the map side joins and avoids shuffling of data. Carbondata can do driver level pruning on bucketed column to improve query performance. User can add bucketed table as follows CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING) CLUSTERED BY(user_id) INTO 32 BUCKETS; In the above example column user_id is hash partitioned and creates 32 buckets/partitions files in carbondata. So while doing the join with other table on bucketed column it can select same buckets and do the join with out shuffling. Carbon creates following folder structure currently, since carbon is already supporting partitioning in its file format dbName -> tableName - > Fact -> Part0 ->Segment_id -> carbondatafiles Part1 ->Segment_id -> carbondatafiles we can also move the partitionid to file metadata.But if we move the partitionId to metadata then there would be complications in backward compatibility. -- Thanks & Regards, Ravindra
Re: CarbonData propose major version number increment for next version (to 1.0.0)
+1 -vimal On Nov 23, 2016 9:39 PM, "Venkata Gollamudi" wrote: > Hi All, > > CarbonData 0.2.0 has been a good work and stable release with lot of > defects fixed and with number of performance improvements. > https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D% > 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY% > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC > > Next version has many major and new value added features are planned, > taking CarbonData capability to next level. > Like > - IUD(Insert-Update-Delete) support, > - complete rewrite of data load flow with out Kettle, > - Spark 2.x support, > - Standardize CarbonInputFormat and CarbonOutputFormat, > - alluxio(tachyon) file system support, > - Carbon thrift format optimization for fast query, > - Data loading performance improvement and In memory off heap sorting, > - Query performance improvement using off heap, > - Support Vectorized batch reader. > > https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D% > 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY% > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC > > I think it makes sense to change CarbonData Major version in next version > to 1.0.0. > Please comment and vote on this. > > Thanks, > Ramana >
[jira] [Created] (CARBONDATA-455) Benchmark for HashMap and DAT
He Xiaoqiao created CARBONDATA-455: -- Summary: Benchmark for HashMap and DAT Key: CARBONDATA-455 URL: https://issues.apache.org/jira/browse/CARBONDATA-455 Project: CarbonData Issue Type: Sub-task Components: core Reporter: He Xiaoqiao Evaluate performance and memory footprint about HashMap and DAT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-454) Add new unit test for DAT
He Xiaoqiao created CARBONDATA-454: -- Summary: Add new unit test for DAT Key: CARBONDATA-454 URL: https://issues.apache.org/jira/browse/CARBONDATA-454 Project: CarbonData Issue Type: Sub-task Components: core Reporter: He Xiaoqiao Assignee: He Xiaoqiao Add new Unit Test for DAT -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-453) Implement DAT(Double Array Trie) for Dictionary
He Xiaoqiao created CARBONDATA-453: -- Summary: Implement DAT(Double Array Trie) for Dictionary Key: CARBONDATA-453 URL: https://issues.apache.org/jira/browse/CARBONDATA-453 Project: CarbonData Issue Type: Sub-task Components: core Reporter: He Xiaoqiao Assignee: He Xiaoqiao Priority: Blocker Implement DAT structure for Dictionary in order to reduce memory footprint and improve performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-452) Optimize structure of Dictionary use Trie in place of HashMap
He Xiaoqiao created CARBONDATA-452: -- Summary: Optimize structure of Dictionary use Trie in place of HashMap Key: CARBONDATA-452 URL: https://issues.apache.org/jira/browse/CARBONDATA-452 Project: CarbonData Issue Type: Improvement Components: core Affects Versions: 0.2.0-incubating Reporter: He Xiaoqiao Assignee: He Xiaoqiao Priority: Critical CarbonData apply ConcurrentHashMap to maintain Dictionary currently, and memory footprint is considerable overhead cause it has to load whole Dictionary to decode actual data value, especially column cardinality is a large number. Update Dictionary using Trie in place of HashMap to reduce memory footprint and improve retrieval performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary
Hi Kumar Vishal, I'll create task to trace this issue. Thanks for your suggestions. Regards, He Xiaoqiao On Sun, Nov 27, 2016 at 1:41 AM, Kumar Vishal wrote: > Hi Xiaoqiao He, > > You can go ahead with DAT implementation, based on the result. > I will look forward for you PR. > > Please let me know if you need any support:). > > -Regards > KUmar Vishal > > On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He wrote: > > > Hi Liang, Kumar Vishal, > > > > I has done a standard benchmark about multiply data structures for > > Dictionary following your suggestions. Based on the test results, I think > > DAT may be the best choice for CarbonData. > > > > *1. Here are 2 test results:* > > --- > > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for > Dictionary > > HashMap : java.util.HashMap > > DAT (Double Array Trie): > > https://github.com/komiya-atsushi/darts-java > > RadixTree: > > https://github.com/npgall/concurrent-trees > > TrieDict (Dictionary in Kylin): > > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary > > Dictionary Source (Traditional Chinese): > > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_ > > dict/dict.txt.big > > Test Result > > a. Dictionary Size:584429 > > > > b. Build Time (ms) : > >DAT : 5714 > >HashMap : 110 > >RadixTree : 22044 > >TrieDict : 855 > > > > c. Memory footprint in 64-bit JVM (bytes) : > >DAT : 16779752 > >HashMap : 32196592 > >RadixTree : 46130584 > >TrieDict : 10443608 > > > > d. Retrieval Performance for 9935293 query times (ms) : > >DAT : 585 > >HashMap : 1010 > >RadixTree : 417639 > >TrieDict : 8664 > > Test Result > > > > Test Result > > a. Dictionary Size:584429 > > > > b. Build Time (ms) : > >DAT : 5867 > >HashMap : 100 > >RadixTree : 22082 > >TrieDict : 840 > > > > c. Memory footprint in 64-bit JVM (bytes) : > >DAT : 16779752 > >HashMap : 32196592 > >RadixTree : 46130584 > >TrieDict : 10443608 > > > > d. Retrieval Performance for 9935293 query times (ms) : > >DAT : 593 > >HashMap : 821 > >RadixTree : 422297 > >TrieDict : 8752 > > Test Result > > > > *2. Conclusion:* > > a. TrieDict is good for building tree and less memory footprint overhead, > > but worst retrieval performance, > > b. DAT is a good tradeoff between memory footprint and retrieval > > performance, > > c. RadixTree has the worst performance in different aspects. > > > > *3. Result Analysis:* > > a. With Trie the memory footprint of the TrieDict mapping is kinda > > minimized if compared to HashMap, in order to improve performance there > is > > a cache layer overlays on top of Trie. > > b. Because a large number of duplicate prefix data, the total memory > > footprint is more than trie, meanwhile i think calculating string hash > code > > of traditional Chinese consume considerable time overhead, so the > > performance is not the best. > > c. DAT is a better tradeoff. > > d. I have no idea why RadixTree has the worst performance in terms of > > memory, retrieval and building tree. > > > > > > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen > > wrote: > > > > > Hi xiaoqiao > > > > > > ok, look forward to seeing your test result. > > > Can you take this task for this improvement? Please let me know if you > > need > > > any support :) > > > > > > Regards > > > Liang > > > > > > > > > hexiaoqiao wrote > > > > Hi Kumar Vishal, > > > > > > > > Thanks for your suggestions. As you said, choose Trie replace HashMap > > we > > > > can get better memory footprint and also good performance. Of course, > > DAT > > > > is not only choice, and I will do test about DAT vs Radix Trie and > > > release > > > > the test result as soon as possible. Thanks your suggestions again. > > > > > > > > Regards, > > > > Xiaoqiao > > > > > > > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal < > > > > > > > kumarvishal1802@ > > > > > > > > > > > > wrote: > > > > > > > >> Hi XIaoqiao He, > > > >> +1, > > > >> For forward dictionary case it will be very good optimisation, as > our > > > >> case > > > >> is very specific storing byte array to int mapping[data to surrogate > > key > > > >> mapping], I think we will get much better memory footprint and > > > >> performance > > > >> will be also good(2x). We can also try radix tree(radix trie), it is > > > more > > > >> optimise for storage. > > > >> > > > >> -Regards > > > >> Kumar Vishal > > > >> > > > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen < > > > > > > > chenliang6136@ > > > > > > > > > > > >> wrote: > > > >> > > > >> > Hi xiaoqiao > > > >> > > > > >> > For the below example, 600K dictionary data: > > > >> > It i