[jira] [Created] (CARBONDATA-465) Spark streaming dataframe support

2016-11-28 Thread WilliamZhu (JIRA)
WilliamZhu created CARBONDATA-465:
-

 Summary: Spark streaming dataframe support
 Key: CARBONDATA-465
 URL: https://issues.apache.org/jira/browse/CARBONDATA-465
 Project: CarbonData
  Issue Type: Improvement
  Components: data-load
Affects Versions: 0.3.0-incubating
Reporter: WilliamZhu
Assignee: WilliamZhu
Priority: Minor
 Fix For: 0.3.0-incubating


Carbondata-0.3.0 support load data with spark data frame api. There is a limit 
that kettle is still required since DataFrameLoaderRDD still depends on kettle. 
We provide NewDataFrameLoaderRDD  to load data with new flow .

Also,we discovered some bugs:

1. CarbonMetastoreCatalog.createTableFromThrift

```
/**
 * schemaFilePath starts with file:// will not create meta files 
successfully
 * while thriftWriter will have no complains.
 * This will cause some weired error eg. No table found.
 */
val thriftWriter = new ThriftWriter(schemaFilePath, false)
thriftWriter.open()
thriftWriter.write(thriftTableInfo)
thriftWriter.close()
``` 

2. There are some exceptions raised  even when you have set useKettle to false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-464) Too many tiems GC occurs in qurey if we increase the blocklet size

2016-11-28 Thread suo tong (JIRA)
suo tong created CARBONDATA-464:
---

 Summary: Too many tiems GC occurs in qurey if we increase the 
blocklet size
 Key: CARBONDATA-464
 URL: https://issues.apache.org/jira/browse/CARBONDATA-464
 Project: CarbonData
  Issue Type: Sub-task
Reporter: suo tong






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-28 Thread Xiaoqiao He
Hi Jihong,

Thanks for your attentions and reply.
1. Actually I has done benchmark with English/Chinese dictionary size in
{100K,200K,300K,400K,500K,600K} separately, and test result is basic same
as mentioned in this mail flow before, I will submit and open the benchmark
code and dictionary source to github
 as soon as
possible.
2. I also notice the license of DAT
, and i think it's necessary
to re-implement another DAT following this paper:
https://linux.thai.net/~thep/datrie/datrie.html.

All kinds of suggestions are welcomed.

Regards,
He Xiaoqiao


On Tue, Nov 29, 2016 at 5:17 AM, Jihong Ma  wrote:

> Thank you Xiaoqiao for looking into this issue and sharing your result!
>
> Have you tried varied dictionary size for comparison among all the
> alternatives?
>
> And please pay closer attention to the license of DAT implementation, as
> they are under LGPL, generally speaking, it is not legally allowed to be
> included.
>
> Jihong
>
> -Original Message-
> From: Xiaoqiao He [mailto:xq.he2...@gmail.com]
> Sent: Friday, November 25, 2016 9:52 AM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: [Improvement] Use Trie in place of HashMap to reduce memory
> footprint of Dictionary
>
> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> ---
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
>   HashMap :   java.util.HashMap
>   DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
>   RadixTree:
> https://github.com/npgall/concurrent-trees
>   TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_d
> ict/dict.txt.big
> Test Result
> a. Dictionary Size:584429
> 
> b. Build Time (ms) :
>DAT   : 5714
>HashMap   : 110
>RadixTree : 22044
>TrieDict  : 855
> 
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT   : 16779752
>HashMap   : 32196592
>RadixTree : 46130584
>TrieDict  : 10443608
> 
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT   : 585
>HashMap   : 1010
>RadixTree : 417639
>TrieDict  : 8664
> Test Result
>
> Test Result
> a. Dictionary Size:584429
> 
> b. Build Time (ms) :
>DAT   : 5867
>HashMap   : 100
>RadixTree : 22082
>TrieDict  : 840
> 
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT   : 16779752
>HashMap   : 32196592
>RadixTree : 46130584
>TrieDict  : 10443608
> 
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT   : 593
>HashMap   : 821
>RadixTree : 422297
>TrieDict  : 8752
> Test Result
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen 
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> we
> > > can get better memory footprint and also good performance. Of course,
> DAT
> > > is not only choice, and I will do test about DAT vs Radix Trie and
> > release
> > > the test result as soon as possible. Thanks your suggestions again.
> > >
> > > Regards,
> > > Xiaoqiao
> > >
> > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal 
> >
> > > kumarvishal1802@
> >
> > > 
> > > wrote:
> > >
> > >> Hi XIaoqiao 

RE: [Feature Proposal] Spark 2 integration with CarbonData

2016-11-28 Thread Jihong Ma

Integration with Spark 2.x is a great feature for Carbondata as Spark 2.x is 
getting the momentum gradually. This is a big effort ahead and let's take into 
consideration of all the complexity involved due to dramatic API level change, 
realizing it in phases is a good idea.

Regards.

Jihong

-Original Message-
From: Jacky Li [mailto:jacky.li...@qq.com] 
Sent: Saturday, November 26, 2016 10:08 AM
To: dev@carbondata.incubator.apache.org
Subject: [Feature Proposal] Spark 2 integration with CarbonData

Hi all,

Currently CarbonData only works with spark1.5 and spark1.6, as Apache Spark
community is moving to 2.1, more and more user will deploy spark 2.x in
production environment. In order to make CarbonData even more popular, I
think now it is good time to start considering spark2.x integration with
CarbonData.

Moreover, we can take this as a chance to refactory CarbonData to make it
both easier to use and higher performance.

Usability:
Instead of using CarbonContext, in spark2 integration, user should able to
1. use native SparkSession in the spark application to create and query
table backed by CarbonData files with full feature support, including index
and late decode optimization.

2. use CarbonData's API and tool to acomplish carbon specific tasks, like
compaction, delete segment, etc.

Perforamnce:
1. deep integration with Datasource API and leveraging spark2's whole stage
codegen feature.

2. provide implementation of vectorized record reader, to improve scanning
performance.

Since spark2 changes a lot comparing to spark 1.6, it may take some time to
complete all these features. With the help of contributors and committers, I
hope we can have basic features working in next CarbonData release. 

What do you think about this idea? All kinds of contribution and suggestions
are welcomed.

Regards,
Jacky Li




--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-tp3236.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


RE: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-28 Thread Jihong Ma
Thank you Xiaoqiao for looking into this issue and sharing your result!

Have you tried varied dictionary size for comparison among all the 
alternatives? 

And please pay closer attention to the license of DAT implementation, as they 
are under LGPL, generally speaking, it is not legally allowed to be included.

Jihong

-Original Message-
From: Xiaoqiao He [mailto:xq.he2...@gmail.com] 
Sent: Friday, November 25, 2016 9:52 AM
To: dev@carbondata.incubator.apache.org
Subject: Re: [Improvement] Use Trie in place of HashMap to reduce memory 
footprint of Dictionary

Hi Liang, Kumar Vishal,

I has done a standard benchmark about multiply data structures for
Dictionary following your suggestions. Based on the test results, I think
DAT may be the best choice for CarbonData.

*1. Here are 2 test results:*
---
Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
  HashMap :   java.util.HashMap
  DAT (Double Array Trie):
https://github.com/komiya-atsushi/darts-java
  RadixTree:
https://github.com/npgall/concurrent-trees
  TrieDict (Dictionary in Kylin):
http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
Dictionary Source (Traditional Chinese):
https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big
Test Result
a. Dictionary Size:584429

b. Build Time (ms) :
   DAT   : 5714
   HashMap   : 110
   RadixTree : 22044
   TrieDict  : 855

c. Memory footprint in 64-bit JVM (bytes) :
   DAT   : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608

d. Retrieval Performance for 9935293 query times (ms) :
   DAT   : 585
   HashMap   : 1010
   RadixTree : 417639
   TrieDict  : 8664
Test Result

Test Result
a. Dictionary Size:584429

b. Build Time (ms) :
   DAT   : 5867
   HashMap   : 100
   RadixTree : 22082
   TrieDict  : 840

c. Memory footprint in 64-bit JVM (bytes) :
   DAT   : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608

d. Retrieval Performance for 9935293 query times (ms) :
   DAT   : 593
   HashMap   : 821
   RadixTree : 422297
   TrieDict  : 8752
Test Result

*2. Conclusion:*
a. TrieDict is good for building tree and less memory footprint overhead,
but worst retrieval performance,
b. DAT is a good tradeoff between memory footprint and retrieval
performance,
c. RadixTree has the worst performance in different aspects.

*3. Result Analysis:*
a. With Trie the memory footprint of the TrieDict mapping is kinda
minimized if compared to HashMap, in order to improve performance there is
a cache layer overlays on top of Trie.
b. Because a large number of duplicate prefix data, the total memory
footprint is more than trie, meanwhile i think calculating string hash code
of traditional Chinese consume considerable time overhead, so the
performance is not the best.
c. DAT is a better tradeoff.
d. I have no idea why RadixTree has the worst performance in terms of
memory, retrieval and building tree.


On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen 
wrote:

> Hi xiaoqiao
>
> ok, look forward to seeing your test result.
> Can you take this task for this improvement? Please let me know if you need
> any support :)
>
> Regards
> Liang
>
>
> hexiaoqiao wrote
> > Hi Kumar Vishal,
> >
> > Thanks for your suggestions. As you said, choose Trie replace HashMap we
> > can get better memory footprint and also good performance. Of course, DAT
> > is not only choice, and I will do test about DAT vs Radix Trie and
> release
> > the test result as soon as possible. Thanks your suggestions again.
> >
> > Regards,
> > Xiaoqiao
> >
> > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal 
>
> > kumarvishal1802@
>
> > 
> > wrote:
> >
> >> Hi XIaoqiao He,
> >> +1,
> >> For forward dictionary case it will be very good optimisation, as our
> >> case
> >> is very specific storing byte array to int mapping[data to surrogate key
> >> mapping], I think we will get much better memory footprint and
> >> performance
> >> will be also good(2x). We can also try radix tree(radix trie), it is
> more
> >> optimise for storage.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen 
>
> > chenliang6136@
>
> > 
> >> wrote:
> >>
> >> > Hi xiaoqiao
> >> >
> >> > For the below example, 600K dictionary data:
> >> > It is to say that using "DAT" can save 36M memory against
> >> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >> >
> >> > One more question:if increases the dictionary data size, what's the
> >> > comparison results "ConcurrentHashMap" VS "DAT"
> >> >
> >> > Regards
> >> > Liang
> >> > 
> >> > 

RE: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-28 Thread Jihong Ma
+1

A rich set of features are planned to be included into next release, and more 
importantly there will be external API changes introduced as we integrate with 
Spark 2.x, Carbondata deserves a major version jump as it gets 
mature/production ready and powerful in terms of rich functionality and better 
performance compared to other file format alternatives on Hadoop ecosystem.

Regards.

Jihong

-Original Message-
From: Liang Chen [mailto:chenliang6...@gmail.com] 
Sent: Saturday, November 26, 2016 8:19 PM
To: dev@carbondata.incubator.apache.org
Subject: Re: CarbonData propose major version number increment for next version 
(to 1.0.0)

Hi

Agree with JB's point.

For next version, will use native SparkSession to instead CarbonContext, so
it would be better if next version number cloud use 1.0.0 for distinguishing
major API changes. 

Please find the detail discussion of CarbonData integration with Spark 2.x :
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-td3236.html#a3238

Regards
Liang

Jean-Baptiste Onofré wrote
> +1
> 
> Good idea.
> 
> Generally speaking minor version is bug fix, major is breaking API and
> command change.
> 
> Regards
> JB⁣​
> 
> On Nov 25, 2016, 10:00, at 10:00, sujith chacko 

> sujithchacko.2010@

>  wrote:
>>+1
>>
>>Thanks,
>>Sujith
>>
>>On Nov 24, 2016 10:37 PM, "manish gupta" 

> tomanishgupta18@

> 
>>wrote:
>>
>>> +1
>>>
>>> Regards
>>> Manish Gupta
>>>
>>> On Thu, Nov 24, 2016 at 7:30 PM, Kumar Vishal
>>

> kumarvishal1802@

> 
>>> wrote:
>>>
>>> > +1
>>> >
>>> > -Regards
>>> > Kumar Vishal
>>> >
>>> > On Thu, Nov 24, 2016 at 2:41 PM, Raghunandan S <
>>> > 

> carbondatacontributions@

>> wrote:
>>> >
>>> > > +1
>>> > > On Thu, 24 Nov 2016 at 2:30 PM, Liang Chen
>>

> chenliang6136@

> 
>>> > > wrote:
>>> > >
>>> > > > Hi
>>> > > >
>>> > > > Ya, good proposal.
>>> > > > CarbonData 0.x version integrate with spark 1.x,  and the load
>>data
>>> > > > solution
>>> > > > of 0.x version is using kettle.
>>> > > > CarbonData 1.x version integrate with spark 2.x, the load data
>>> solution
>>> > > of
>>> > > > 1.x version will not use kettle .
>>> > > >
>>> > > > That would be helpful to reduce maintenance cost through
>>> distinguishing
>>> > > the
>>> > > > major different version.
>>> > > >
>>> > > > +1 for the proposal.
>>> > > >
>>> > > > Regards
>>> > > > Liang
>>> > > >
>>> > > >
>>> > > > Venkata Gollamudi wrote
>>> > > > > Hi All,
>>> > > > >
>>> > > > > CarbonData 0.2.0 has been a good work and stable release with
>>lot
>>> of
>>> > > > > defects fixed and with number of performance improvements.
>>> > > > >
>>> > > > https://issues.apache.org/jira/browse/CARBONDATA-320?
>>> > jql=project%20%3D%
>>> > >
>>20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%
>>> > > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
>>> > > > >
>>> > > > > Next version has many major and new value added features are
>>> planned,
>>> > > > > taking CarbonData capability to next level.
>>> > > > > Like
>>> > > > > - IUD(Insert-Update-Delete) support,
>>> > > > > - complete rewrite of data load flow with out Kettle,
>>> > > > > - Spark 2.x support,
>>> > > > > - Standardize CarbonInputFormat and CarbonOutputFormat,
>>> > > > > - alluxio(tachyon) file system support,
>>> > > > > - Carbon thrift format optimization for fast query,
>>> > > > > - Data loading performance improvement and In memory off heap
>>> > sorting,
>>> > > > > - Query performance improvement using off heap,
>>> > > > > - Support Vectorized batch reader.
>>> > > > >
>>> > > > >
>>> > > > https://issues.apache.org/jira/browse/CARBONDATA-301?
>>> > jql=project%20%3D%
>>> > >
>>20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%
>>> > > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
>>> > > > >
>>> > > > > I think it makes sense to change CarbonData Major version in
>>next
>>> > > version
>>> > > > > to 1.0.0.
>>> > > > > Please comment and vote on this.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > Ramana
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > View this message in context:
>>> > > > http://apache-carbondata-mailing-list-archive.1130556.
>>> > > n5.nabble.com/CarbonData-propose-major-version-number-
>>> > > increment-for-next-version-to-1-0-0-tp3131p3157.html
>>> > > > Sent from the Apache CarbonData Mailing List archive mailing
>>list
>>> > archive
>>> > > > at Nabble.com.
>>> > > >
>>> > >
>>> >
>>>





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonData-propose-major-version-number-increment-for-next-version-to-1-0-0-tp3131p3239.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [Feature Proposal] Spark 2 integration with CarbonData

2016-11-28 Thread QiangCai
+1
I think I can finish some tasks. please assign some tasks to me.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-tp3236p3320.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: carbon data

2016-11-28 Thread Liang Chen
Hi Lionel

Don't need to create table first, please find the example code in
ExampleUtils.scala

df.write
.format("carbondata")
.option("tableName", tableName)
.option("compress", "true")
.option("useKettle", "false")
.mode(mode)
.save()

Preparing API docs is in progress.

Regards
Liang
2016-11-28 20:24 GMT+08:00 Lu Cao :

> Hi team,
> I'm trying to save spark dataframe to carbondata file. I see the example in
> your wiki
> option("tableName", "carbontable"). Does that mean I have to create a
> carbondata table first and then save data into the table? Can I save it
> directly without creating the carbondata table?
>
> the code is
> df.write.format("carbondata").mode(SaveMode.Append).save("
> hdfs:///user//data.carbon")
>
> BTW, do you have the formal api doc?
>
> Thanks,
> Lionel
>



-- 
Regards
Liang


carbon data

2016-11-28 Thread Lu Cao
Hi team,
I'm trying to save spark dataframe to carbondata file. I see the example in
your wiki
option("tableName", "carbontable"). Does that mean I have to create a
carbondata table first and then save data into the table? Can I save it
directly without creating the carbondata table?

the code is
df.write.format("carbondata").mode(SaveMode.Append).save("hdfs:///user//data.carbon")

BTW, do you have the formal api doc?

Thanks,
Lionel


[jira] [Created] (CARBONDATA-463) Extract spark-common module

2016-11-28 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-463:
---

 Summary: Extract spark-common module
 Key: CARBONDATA-463
 URL: https://issues.apache.org/jira/browse/CARBONDATA-463
 Project: CarbonData
  Issue Type: Sub-task
Affects Versions: 0.2.0-incubating
Reporter: Jacky Li
Assignee: Jacky Li
 Fix For: 0.3.0-incubating


Extracting spark module code to spark-common module, to make it reusable in 
spark2 integration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-462) Clean up code before moving to spark-common package

2016-11-28 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-462:
---

 Summary: Clean up code before moving to spark-common package
 Key: CARBONDATA-462
 URL: https://issues.apache.org/jira/browse/CARBONDATA-462
 Project: CarbonData
  Issue Type: Sub-task
Affects Versions: 0.2.0-incubating
Reporter: Jacky Li
 Fix For: 0.3.0-incubating


Clean up code, prepare for moving to spark-common package



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-461) Clean partitioner in RDD package

2016-11-28 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-461:
---

 Summary: Clean partitioner in RDD package
 Key: CARBONDATA-461
 URL: https://issues.apache.org/jira/browse/CARBONDATA-461
 Project: CarbonData
  Issue Type: Sub-task
Affects Versions: 0.2.0-incubating
Reporter: Jacky Li
 Fix For: 0.3.0-incubating


To make carbon RDDs reusable in spark2 integration, need to remove partitioner 
in RDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-460) Add Unit Tests For core.writer.sortindex package

2016-11-28 Thread SWATI RAO (JIRA)
SWATI RAO created CARBONDATA-460:


 Summary: Add Unit Tests For core.writer.sortindex package 
 Key: CARBONDATA-460
 URL: https://issues.apache.org/jira/browse/CARBONDATA-460
 Project: CarbonData
  Issue Type: Test
Reporter: SWATI RAO
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-459) Block distribution is wrong in case of dynamic allocation=true

2016-11-28 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-459:
---

 Summary: Block distribution is wrong in case of dynamic 
allocation=true
 Key: CARBONDATA-459
 URL: https://issues.apache.org/jira/browse/CARBONDATA-459
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
Priority: Minor
 Fix For: 0.2.0-incubating


In case when dynamic allocation is true and configured max executors are more 
than the initial executors then carbon is not able to request the max number of 
executors configured. Due to this resources are getting under utilized and case 
when number of blocks increases, the distribution of blocks is limited to the 
number of nodes and the number of tasks launched are less. This leads to under 
utilization of resources and hence impacts the query and load performance.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [Feature Proposal] Spark 2 integration with CarbonData

2016-11-28 Thread Jacky Li
Hi Ramana,

Sure, I can work out a subtasks list and put it under CARBONDATA-322

Regards,
Jacky



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-tp3236p3278.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[jira] [Created] (CARBONDATA-458) Improving carbon first time query performance

2016-11-28 Thread kumar vishal (JIRA)
kumar vishal created CARBONDATA-458:
---

 Summary:  Improving carbon first time query performance
 Key: CARBONDATA-458
 URL: https://issues.apache.org/jira/browse/CARBONDATA-458
 Project: CarbonData
  Issue Type: Improvement
  Components: core, data-load, data-query
Reporter: kumar vishal
Assignee: kumar vishal


Improving carbon first time query performance

Reason:
1. As file system cache is cleared file reading will make it slower to read and 
cache
2. In first time query carbon will have to read the footer from file data file 
to form the btree
3. Carbon reading more footer data than its required(data chunk)
4. There are lots of random seek is happening in carbon as column data(data 
page, rle, inverted index) are not stored together.

Solution: 
1. Improve block loading time. This can be done by removing data chunk from 
blockletInfo and storing only offset and length of data chunk
2. compress presence meta bitset stored for null values for measure column 
using snappy 
3. Store the metadata and data of a column together and read together this 
reduces random seek and improve IO




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-457) Add Unit Tests For core.writer package

2016-11-28 Thread SWATI RAO (JIRA)
SWATI RAO created CARBONDATA-457:


 Summary: Add Unit Tests For core.writer package 
 Key: CARBONDATA-457
 URL: https://issues.apache.org/jira/browse/CARBONDATA-457
 Project: CarbonData
  Issue Type: Test
Reporter: SWATI RAO
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)