from:"Ravindra Pesala"

[jira] [Created] (CARBONDATA-955) CacheProvider test fails

2017-04-18 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-955:
--

 Summary: CacheProvider test fails 
 Key: CARBONDATA-955
 URL: https://issues.apache.org/jira/browse/CARBONDATA-955
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Trivial


CacheProvider test fails in core package.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[VOTE] Apache CarbonData 1.1.0-incubating (RC2) release

2017-04-18 Thread Ravindra Pesala

Hi PPMC,

I submit the Apache CarbonData 1.1.0-incubating (RC2) to your vote.

Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12338987
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12338987>*

Key features of this release are highlighted as below.

   -  Introduced new data format called V3 to improve scan performance (~20
   to 50%).
   -  Alter table support in carbondata. (Only for Spark 2.1)
   -  Supported Batch Sort to improve data loading performance.
   -  Improved Single pass load by upgrading to latest netty framework and
   launched dictionary client for each loading
   -  Supported range filters to combine the between filters to one filter
   to improve the filter performance.
   -  Improve performance as per TPC-H and will provide formal TPC-H report.


Staging Repository:
https://repository.apache.org/content/repositories/orgapachecarbondata-1012

Git Tag:
apache-carbondata-1.1.0-incubating-rc2

Please vote to approve this release:
[ ] +1 Approve the release
[ ] -1 Don't approve the release (please provide specific comments)

This vote will be open for at least 72 hours. If this vote passes (we need
at least 3 binding votes, meaning three votes from the PPMC), I will
forward to gene...@incubator.apache.org for  the IPMC votes.

-- 
Thanks & Regards,
Ravindra Pesala.

[jira] [Created] (CARBONDATA-953) Add validations to Unsafe dataload. And control the data added to threads

2017-04-18 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-953:
--

 Summary: Add validations to Unsafe dataload. And control the data 
added to threads
 Key: CARBONDATA-953
 URL: https://issues.apache.org/jira/browse/CARBONDATA-953
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Add validations to Unsafe dataload, now there are no validations of how much 
chunksize can be configured and how much working thread memory uses. And also 
there is no control of adding data  to sort threads so it may lead to out of 
memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-915) Call getAll dictionary from codegen of dictionary decoder to improve dictionary load performance

2017-04-12 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-915:
--

 Summary: Call getAll dictionary from codegen of dictionary decoder 
to improve dictionary load performance
 Key: CARBONDATA-915
 URL: https://issues.apache.org/jira/browse/CARBONDATA-915
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Currently it gets the dictionary individualy from cache so it is not effective 
way as it does not load parallel. And also it is not thread safe to just call 
dictionary instead of getAll
Call getAll dictionary from codegen of dictionary decoder to improve dictionary 
load performance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-893) MR testcase hangs in Hadoop 2.7.2 version profile

2017-04-10 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-893:
--

 Summary: MR testcase hangs in Hadoop 2.7.2 version profile
 Key: CARBONDATA-893
 URL: https://issues.apache.org/jira/browse/CARBONDATA-893
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


MR testcase hangs in Hadoop 2.7.2 version profile



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-874) select * from table order by limit query is failing

2017-04-05 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-874:
--

 Summary: select * from table order by limit query is failing
 Key: CARBONDATA-874
 URL: https://issues.apache.org/jira/browse/CARBONDATA-874
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Query like below are failing in carbon with spark 2.1
select * from alldatatypestablesort order by empname limit 10



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[VOTE] Apache CarbonData 1.1.0-incubating (RC1) release

2017-04-05 Thread Ravindra Pesala

Hi PPMC,

I submit the Apache CarbonData 1.1.0-incubating (RC1) to your vote.

Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12338987
*

Key features of this release are highlighted as below.

   -  Introduced new data format called V3 to improve scan performance (~20
   to 50%).
   -  Alter table support in carbondata. (Only for Spark 2.1)
   -  Supported Batch Sort to improve data loading performance.
   -  Improved Single pass load by upgrading to latest netty framework and
   launched dictionary client for each loading
   -  Supported range filters to combine the between filters to one filter
   to improve the filter performance.
   -  Improve performance as per TPC-H and will provide formal TPC-H report.


Staging Repository:
https://repository.apache.org/content/repositories/orgapachecarbondata-1011

Git Tag:
apache-carbondata-1.1.0-incubating-rc1

Please vote to approve this release:
[ ] +1 Approve the release
[ ] -1 Don't approve the release (please provide specific comments)

This vote will be open for at least 72 hours. If this vote passes (we need
at least 3 binding votes, meaning three votes from the PPMC), I will
forward to gene...@incubator.apache.org for  the IPMC votes.

-- 
Thanks & Regards,
Ravindra

[jira] [Created] (CARBONDATA-861) Improvements in query processing.

2017-04-05 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-861:
--

 Summary: Improvements in query processing.
 Key: CARBONDATA-861
 URL: https://issues.apache.org/jira/browse/CARBONDATA-861
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Priority: Minor


Following are the list of can improvements done

Remove multiple creation of array and copy of it in Dimension and measure chunk 
readers.
Simplify logic of finding offsets of nodictionary keys in the class 
SafeVariableLengthDimensionDataChunkStore.
Avoid byte array creation and copy for nodictionary columns in case of 
vectorized reader. Instead directly sending the length and offset to vector.
Remove unnecessary decoder plan additions to oprtimized plan. It can optimize 
the codegen flow.
Update CompareTest to take table blocksize and kept as 32 Mb in order to make 
use of small sorting when doing take ordered in spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Re: Re: Optimize Order By + Limit Query

2017-03-29 Thread Ravindra Pesala

Hi,

It comes up with many limitations
1. It cannot work for dictionary columns. As there is no guarantee that
dictionary allocation is in sorted order.
2. It cannot work for no inverted index columns.
3. It cannot work for measures.

Moreover as you mentioned that it can reduce IO, But I don't think we can
reduce any IO since we need to read all blocklets to do merge sort. And I
am not sure how we can keep all the data in memory until we do merge sort.
I am still believe that this work is belonged to execution engine, not file
format. This type of specific improvements may give good performance in
some specific type of queries but these will give long term complications
in maintainability.


Regards,
Ravindra.

On 30 March 2017 at 08:23, 马云  wrote:

> Hi Ravindran,
>
> yes, use carbon do the sorting if the order by column is not first column.
>
> But its sorting is very high since the dimension data in blocklet is stored 
> after sorting.
>
> So in carbon can use  merge sort  + topN to get N data from each block.
>
> In addition,  the biggest difference is that it can reduce disk IO since can 
> use limit n to reduce required blocklets.
>
> if you only apply spark's top N, I don't think you can make  suck below 
> performance.
>
> That's impossible  if don't reduce disk IO.
>
>
>
>
>
>
>
>
> At 2017-03-30 03:12:54, "Ravindra Pesala"  wrote:
> >Hi,
> >
> >You mean Carbon do the sorting if the order by column is not first column
> >and provide only limit values to spark. But the same job spark is also
> >doing it just sorts the partition and gets the top values out of it. You
> >can reduce the table_blocksize to get the better sort performance as spark
> >try to do sorting inside memory.
> >
> >I can see we can do some optimizations in integration layer itself with out
> >pushing down any logic to carbon like if the order by column is first
> >column then we can just get limit values with out sorting any data.
> >
> >Regards,
> >Ravindra.
> >
> >On 29 March 2017 at 08:58, 马云  wrote:
> >
> >> Hi Ravindran,
> >> Thanks for your quick response. please see my answer as below
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>  What if the order by column is not the first column? It needs to scan all
> >> blocklets to get the data out of it if the order by column is not first
> >> column of mdk
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> Answer :  if step2 doesn't filter any blocklet, you are right,It needs to
> >> scan all blocklets to get the data out of it if the order by column is not
> >> first column of mdk
> >> but it just scan all the order by column's data, for
> >> others columns data,  use the lazy-load strategy and  it can reduce scan
> >> accordingly to  limit value.
> >> Hence you can see the performance is much better now
> >> after  my optimization. Currently the carbondata order by + limit
> >> performance is very bad since it scans all data.
> >>in my test there are  20,000,000 data, it takes more than
> >> 10s, if data is much more huge,  I think it is hard for user to stand such
> >> bad performance when they do order by + limit  query?
> >>
> >>
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>  We used to have multiple push down optimizations from spark to carbon
> >> like aggregation, limit, topn etc. But later it was removed because it is
> >> very hard to maintain for version to version. I feel it is better that
> >> execution engine like spark can do these type of operations.
> >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >> Answer : In my opinion, I don't think "hard to maintain for version to
> >> version" is a good reason to give up the order by  + limit optimization.
> >> I think it can create new class to extends current and try to reduce the
> >>

Re: Load data into carbondata executors distributed unevenly

2017-03-29 Thread Ravindra Pesala

Hi,

It seems attachments are missing.Can you attach them again.

Regards,
Ravindra.

On 30 March 2017 at 08:02, a  wrote:

> Hello!
>
> *Test result：*
> When I load csv data into carbondata table 3 times，the executors
> distributed unevenly。My  purpose
> 
>  is
> one node one task，but the result is some node has 2 task and some node has
> no task。
> See the load data 1.png,data 2.png,data 3.png。
> The carbondata data.PNG is the data structure in hadoop.
>
> I load 4   records into carbondata table takes 2629s seconds，its
> too long。
>
> *Question：*
> How can i make the executors distributed evenly ?
>
> The environment：
> spark2.1+carbondata1.1，there are 7 datanodes.
>
> *./bin/spark-shell   \--master yarn \--deploy-mode client
>  \--num-executors n \ （the first time is 7(result in load data 1.png)，the
> second time is 6(result in load data 2.png),the three time is 8(result in
> load data3.png)）--executor-cores 10 \--executor-memory 40G \--driver-memory
> 8G \*
>
> carbon.properties
>  DataLoading Configuration 
> carbon.sort.file.buffer.size=20
> carbon.graph.rowset.size=1
> carbon.number.of.cores.while.loading=10
> carbon.sort.size=5
> carbon.number.of.cores.while.compacting=10
> carbon.number.of.cores=10
>
> Best regards!
>
>
>
>
>
>
>



-- 
Thanks & Regards,
Ravi

Re: Re: Optimize Order By + Limit Query

2017-03-29 Thread Ravindra Pesala

Hi,

You mean Carbon do the sorting if the order by column is not first column
and provide only limit values to spark. But the same job spark is also
doing it just sorts the partition and gets the top values out of it. You
can reduce the table_blocksize to get the better sort performance as spark
try to do sorting inside memory.

I can see we can do some optimizations in integration layer itself with out
pushing down any logic to carbon like if the order by column is first
column then we can just get limit values with out sorting any data.

Regards,
Ravindra.

On 29 March 2017 at 08:58, 马云  wrote:

> Hi Ravindran,
> Thanks for your quick response. please see my answer as below
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>  What if the order by column is not the first column? It needs to scan all
> blocklets to get the data out of it if the order by column is not first
> column of mdk
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Answer :  if step2 doesn't filter any blocklet, you are right,It needs to
> scan all blocklets to get the data out of it if the order by column is not
> first column of mdk
> but it just scan all the order by column's data, for
> others columns data,  use the lazy-load strategy and  it can reduce scan
> accordingly to  limit value.
> Hence you can see the performance is much better now
> after  my optimization. Currently the carbondata order by + limit
> performance is very bad since it scans all data.
>in my test there are  20,000,000 data, it takes more than
> 10s, if data is much more huge,  I think it is hard for user to stand such
> bad performance when they do order by + limit  query?
>
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>  We used to have multiple push down optimizations from spark to carbon
> like aggregation, limit, topn etc. But later it was removed because it is
> very hard to maintain for version to version. I feel it is better that
> execution engine like spark can do these type of operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Answer : In my opinion, I don't think "hard to maintain for version to
> version" is a good reason to give up the order by  + limit optimization.
> I think it can create new class to extends current and try to reduce the
> impact for the current code. Maybe can make it is easy to maintain.
> Maybe I am wrong.
>
>
>
>
> At 2017-03-29 02:21:58, "Ravindra Pesala"  wrote:
>
>
> Hi Jarck Ma,
>
> It is great to try optimizing Carbondata.
> I think this solution comes up with many limitations. What if the order by
> column is not the first column? It needs to scan all blocklets to get the
> data out of it if the order by column is not first column of mdk.
>
> We used to have multiple push down optimizations from spark to carbon like
> aggregation, limit, topn etc. But later it was removed because it is very
> hard to maintain for version to version. I feel it is better that execution
> engine like spark can do these type of operations.
>
>
> Regards,
> Ravindra.
>
>
>
> On Tue, Mar 28, 2017, 14:28 马云  wrote:
>
>
> Hi Carbon Dev,
>
> Currently I have done optimization for ordering by 1 dimension.
>
> my local performance test as below. Please give your suggestion.
>
>
>
>
> | data count | test sql | limit value in sql | performance(ms) |
> | optimized code | original code |
> | 20,000,000 | SELECT name, serialname, country, salary, id, date FROM t3
> ORDER BY country limit 1000 | 1000 | 677 | 10906 |
> | SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
> serialname limit 1 | 1 | 1897 | 12108 |
> | SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
> serialname limit 5 | 5 | 2814 | 14279 |
>
> my optimization solution for order by 1 dimension ＋ limit as below
>
> mainly filter some unnecessary blocklets and leverage  the dimension's
> order stored feature to get sorted data in each partition.
>
> at last use the TakeOrderedAndProject to merge sorted data from partitions
>
> ste

Re: Optimize Order By + Limit Query

2017-03-28 Thread Ravindra Pesala

Hi Jarck Ma,

It is great to try optimizing Carbondata.
I think this solution comes up with many limitations. What if the order by
column is not the first column? It needs to scan all blocklets to get the
data out of it if the order by column is not first column of mdk.

We used to have multiple push down optimizations from spark to carbon like
aggregation, limit, topn etc. But later it was removed because it is very
hard to maintain for version to version. I feel it is better that execution
engine like spark can do these type of operations.

Regards,
Ravindra.

On Tue, Mar 28, 2017, 14:28 马云  wrote:

> Hi Carbon Dev,
>
> Currently I have done optimization for ordering by 1 dimension.
>
> my local performance test as below. Please give your suggestion.
>
>
> data count test sql limit value in sql performance(ms)
> optimized code original code
> 20,000,000 SELECT name, serialname, country, salary, id, date FROM t3
> ORDER BY country limit 1000 1000 677 10906
> SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
> serialname limit 1 1 1897 12108
> SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
> serialname limit 5 5 2814 14279
>
> my optimization solution for order by 1 dimension ＋ limit as below
>
> mainly filter some unnecessary blocklets and leverage  the dimension's
> order stored feature to get sorted data in each partition.
>
> at last use the TakeOrderedAndProject to merge sorted data from partitions
>
> *step1*. change logical plan and push down the order by and limit
> information to carbon scan
>
> and change sort physical plan to TakeOrderedAndProject  since
> data will be get and sorted in each partition
>
> *step2*. in each partition apply the limit number, blocklet's min_max
> index to filter blocklet.
>
>   it can reduce scan data if some blocklets were filtered
>
>  for example,  SELECT name, serialname, country, salary, id, date
> FROM t3 ORDER BY serialname limit 1
>
>  supposing there are 2 blocklets , each has 32000 data, serial name  is
> between serialname1 to serialname2 in the first blocklet
>
> and between  serialname2 to serialname3 in the second blocklet. Actually
> we only need to scan the first blocklet
>
> since 32000 > 100 and first blocklet's serial name <= second
> blocklet's serial name
>
>
>
> *step3*.  load the order by dimension data to scanResult.  put all
> scanResults to a TreeSet for sorting
>
>   Other columns' data will be lazy-loaded in step4.
>
> *step4.* according to the limit value, use a iterator to get the topN
> sorted data from the TreeSet. In the same time to load other columns data
> if needed.
>
>in this step  it tries to reduce scanning non-sort dimension
>  data.
>
>  for example, SELECT name, serialname, country, salary, id, date
> FROM t3 ORDER BY serialname limit 1
>
>  supposing there are 3 blocklets ,  in the first 2 blocklets, serial name
>  is between serialname1 to serialname100 and each has 2500 serialname1
> and serialname2.
>
> In the third blocklet, serial name
>  is between serialname2 to serialnam100, but no serialname1 in it.
>
> load serial name data for the 3 blocklets and put all to a treeset
> sorting by the min serialname.
>
> apparently use iterator to get the top 1 sorted data, it only need to
> care the first 2 blocklets(5000 serialname1 + 5000 serialname2).
>
> In others words, it  loads serial name data for the 3 blocklets.But only
> "load name, country, salary, id, date"'s data for the first 2 blocklets
>
>
>
> *step5.* TakeOrderedAndProject physical plan will be used to merge sorted
> data from partitions
>
>
>
> the below items also can be optimized in future
>
>
>
> •   *leverage *mdk keys' order feature to optimize the SQL who order by
> prefix dimension columns of MDK
>
> •   use the dimension order feature in blocklet lever and dimensions'
> inverted index to optimize SQL who order by multi-dimensions
>
>
>
>
>
>
>
>
>
>
>
> Jarck Ma
>
>
>
>
>
>

Re: Re:Re:Re:Re: insert into carbon table failed

2017-03-26 Thread Ravindra Pesala

rkSubmit$.submit(
> SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.
> scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> At 2017-03-27 00:42:28, "a"  wrote:
>
>
>
>  Container log : error executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL 15: SIGTERM。
>  spark log: 17/03/26 23:40:30 ERROR YarnScheduler: Lost executor 2 on
> hd25: Container killed by YARN for exceeding memory limits. 49.0 GB of 49
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> The test sql
>
>
>
>
>
>
>
> At 2017-03-26 23:34:36, "a"  wrote:
> >
> >
> >I have set the parameters as follow：
> >1、fs.hdfs.impl.disable.cache=true
> >2、dfs.socket.timeout=180  （Exception：aused by: java.io.IOException:
> Filesystem closed）
> >3、dfs.datanode.socket.write.timeout=360
> >4、set carbondata property enable.unsafe.sort=true
> >5、remove BUCKETCOLUMNS property from the create table sql
> >6、set spark job parameter executor-memory=48G （from 20G to 48G）
> >
> >
> >But it  still failed, the error is "executor.CoarseGrainedExecutorBackend:
> RECEIVED SIGNAL 15: SIGTERM。"
> >
> >
> >Then i try to insert 4  records into carbondata table ,it works
> success.
> >
> >
> >How can i insert 20   records into carbondata?
> >Should me set  executor-memory big enough? Or Should me generate the csv
> file from the hive table first ,then load the csv file into carbon table?
> >Any body give me same help?
> >
> >
> >Regards
> >fish
> >
> >
> >
> >
> >
> >
> >
> >At 2017-03-26 00:34:18, "a"  wrote:
> >>Thank you  Ravindra!
> >>Version:
> >>My carbondata version is 1.0,spark version is 1.6.3,hadoop version is
> 2.7.1,hive version is 1.1.0
> >>one of the containers log:
> >>17/03/25 22:07:09 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL 15: SIGTERM
> >>17/03/25 22:07:09 INFO storage.DiskBlockManager: Shutdown hook called
> >>17/03/25 22:07:09 INFO util.ShutdownHookManager: Shutdown hook called
> >>17/03/25 22:07:09 INFO util.ShutdownHookManager: Deleting directory
> /data1/hadoop/hd_space/tmp/nm-local-dir/usercache/storm/
> appcache/application_1490340325187_0042/spark-84b305f9-af7b-4f58-a809-
> 700345a84109
> >>17/03/25 22:07:10 ERROR impl.ParallelReadMergeSorterImpl:
> pool-23-thread-2
> >>java.io.IOException: Error reading file: hdfs://_table_tmp/dt=2017-
> 01-01/pt=ios/06_0
> >>at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(
> RecordReaderImpl.java:1046)
> >>at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$
> OriginalReaderPair.next(OrcRawRecordMerger.java:263)
> >>at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.next(
> OrcRawRecordMerger.java:547)
> >>at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$1.next(
> OrcInputFormat.java:1234)
> >>at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$1.next(
> OrcInputFormat.java:1218)
> >>at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$
> NullKeyRecordReader.next(OrcInputFormat.java:1150)
> >>at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$
> NullKeyRecordReader.next(OrcInputFormat.java:1136)
> >>at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(
> HadoopRDD.scala:249)
> >>at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(
> HadoopRDD.scala:211)
> >>at org.apache.spark.util.NextIterator.hasNext(
> NextIterator.scala:73)
> >>at org.apache.spark.InterruptibleIterator.hasNext(
> InterruptibleIterator.scala:39)
> >>at scala.collection.Iterator$$anon$11.hasNext(Iterator.
> scala:327)
> >>at scala.collection.Iterator$$anon$11.hasNext(Iterator.
> scala:327)
> >>at scala.collection.Iterator$$anon$11.hasNext(Iterator.
> scala:327)
> >>at org.apache.carbondata.spark.rdd.NewRddIterator.hasNext(
> NewCarbonDataLoadRDD.scala:412)
> >>    at org.apache.carbondata.processing.newflow.steps.
> InputProcessorStepImpl$InputProcessorIterator.internalHasNext(
> InputProcessorStepImpl.java:163)
> >>at org.apache.carbondata.processing.newflow.steps.
> InputProcessorStepImpl$InputProcessorIterator.getBatch(
> InputProcessorStepImpl.java:221)
> >>at org.apache.carbondata.processing.newflow.steps.
> InputProcessorStepImpl$InputProcessorIterator.next(
> InputProcessorStepImpl.java:

Re: Re:Re:Re: insert into carbon table failed

2017-03-26 Thread Ravindra Pesala

StepImpl$InputProcessorIterator.getBatch(InputProcessorStepImpl.java:221)
> >>at 
> >> org.apache.carbondata.processing.newflow.steps.InputProcessorStepImpl$InputProcessorIterator.next(InputProcessorStepImpl.java:183)
> >>at 
> >> org.apache.carbondata.processing.newflow.steps.InputProcessorStepImpl$InputProcessorIterator.next(InputProcessorStepImpl.java:117)
> >>at 
> >> org.apache.carbondata.processing.newflow.steps.DataConverterProcessorStepImpl$1.next(DataConverterProcessorStepImpl.java:80)
> >>at 
> >> org.apache.carbondata.processing.newflow.steps.DataConverterProcessorStepImpl$1.next(DataConverterProcessorStepImpl.java:73)
> >>at 
> >> org.apache.carbondata.processing.newflow.sort.impl.ParallelReadMergeSorterImpl$SortIteratorThread.call(ParallelReadMergeSorterImpl.java:196)
> >>at 
> >> org.apache.carbondata.processing.newflow.sort.impl.ParallelReadMergeSorterImpl$SortIteratorThread.call(ParallelReadMergeSorterImpl.java:177)
> >>at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >>at 
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>at 
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>at java.lang.Thread.run(Thread.java:745)
> >>Caused by: java.io.IOException: Filesystem closed
> >>at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
> >>at 
> >> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:868)
> >>at 
> >> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> >>at java.io.DataInputStream.readFully(DataInputStream.java:195)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.MetadataReader.readStripeFooter(MetadataReader.java:112)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:228)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:805)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:776)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019)
> >>at 
> >> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1042)
> >>... 26 more
> >>I will try to set enable.unsafe.sort=true and remove BUCKETCOLUMNS property 
> >>,and try again.
> >>
> >>
> >>At 2017-03-25 20:55:03, "Ravindra Pesala"  wrote:
> >>>Hi,
> >>>
> >>>Carbodata launches one job per each node to sort the data at node level and
> >>>avoid shuffling. Internally it uses threads to use parallel load. Please
> >>>use carbon.number.of.cores.while.loading property in carbon.properties file
> >>>and set the number of cores it should use per machine while loading.
> >>>Carbondata sorts the data  at each node level to maintain the Btree for
> >>>each node per segment. It improves the query performance by filtering
> >>>faster if we have Btree at node level instead of each block level.
> >>>
> >>>1.Which version of Carbondata are you using?
> >>>2.There are memory issues in Carbondata-1.0 version and are fixed current
> >>>master.
> >>>3.And you can improve the performance by enabling enable.unsafe.sort=true 
> >>>in
> >>>carbon.properties file. But it is not supported if bucketing of columns are
> >>>enabled. We are planning to support unsafe sort load for bucketing also in
> >>>next version.
> >>>
> >>>Please send the executor log to know about the error you are facing.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>Regards,
> >>>Ravindra
> >>>
> >>>On 25 March 2017 at 16:18, ww...@163.com  wrote:
> >>>
> >>>> Hello!
> >>>>
> >>>> *0、The failure*
> >>>> When i insert into carbon table，i encounter failure。The failure is  as
> >>>> follow:
> >>>> Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, 
> >>>> most
> >>&g

[jira] [Created] (CARBONDATA-822) Add unsafe sort for bucketing feature

2017-03-26 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-822:
--

 Summary: Add unsafe sort for bucketing feature
 Key: CARBONDATA-822
 URL: https://issues.apache.org/jira/browse/CARBONDATA-822
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala


Currently there is no unsafe sort in case of bucketing enabled. To improve the 
bucketing load performance enable unsafe sort for bucketing as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-821) Remove Kettle related code and flow from carbon.

2017-03-26 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-821:
--

 Summary: Remove Kettle related code and flow from carbon.
 Key: CARBONDATA-821
 URL: https://issues.apache.org/jira/browse/CARBONDATA-821
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Remove Kettle related code and flow from carbon. It becomes difficult to 
developers to handle all bugs and features in both the flows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[DISCUSSION] Initiating Apache CarbonData-1.1.0 incubating Release

2017-03-25 Thread Ravindra Pesala

Hi All,

As planned we are going to release Apache CarbonData-1.1.0. Please discuss
and vote for it to initiate 1.1.0 release, i will start to prepare the
release after 3-days of discussion. It will have following features.

 1. Introduced new data format called V3(version 3).

  Improves the sequential IO by keeping larger size blocklets.So read
larger data at once to memory.
  Introduced pages with size of 32000 each for every column inside
blocklet. And min/max is maintained for each page to improve the filter
queries.
  Improved compression/decompression of row pages.
Our all performance is improved by 50% compare to old format as per TPC-H
benchmark results.


2. Alter table support in carbondata. (Only for Spark 2.1)

   Support renaming of existing table.
   Support adding of new column.
   Support removing of new column.
   Support Upcasting(Ex: from smallint to int) of datatype


3. Supported Batch Sort to improve dataloading performance.

   It makes sort step as non blocking step and capable of sorting whole
batch in memory and converts to carbondata file.


4. Improved Single pass load by upgrading to latest netty framework and
launched dictionary client for each loading

5. Supported range filters to combine the between filters to one filter to
improve the filter performance.

6. Apart from features many bugs and improvements are done in this release.

-- 
Thanks & Regards,
Ravindra

Re: insert into carbon table failed

2017-03-25 Thread Ravindra Pesala

Hi,

Carbodata launches one job per each node to sort the data at node level and
avoid shuffling. Internally it uses threads to use parallel load. Please
use carbon.number.of.cores.while.loading property in carbon.properties file
and set the number of cores it should use per machine while loading.
Carbondata sorts the data  at each node level to maintain the Btree for
each node per segment. It improves the query performance by filtering
faster if we have Btree at node level instead of each block level.

1.Which version of Carbondata are you using?
2.There are memory issues in Carbondata-1.0 version and are fixed current
master.
3.And you can improve the performance by enabling enable.unsafe.sort=true in
carbon.properties file. But it is not supported if bucketing of columns are
enabled. We are planning to support unsafe sort load for bucketing also in
next version.

Please send the executor log to know about the error you are facing.






Regards,
Ravindra

On 25 March 2017 at 16:18, ww...@163.com  wrote:

> Hello!
>
> *0、The failure*
> When i insert into carbon table，i encounter failure。The failure is  as
> follow:
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most
> recent failure: Lost task 0.3 in stage 2.0 (TID 1007, hd26):
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks)
> Reason: Slave lost+details
>
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 2.0 (TID 1007, hd26): 
> ExecutorLostFailure (executor 1 exited caused by one of the running tasks) 
> Reason: Slave lost
> Driver stacktrace:
>
> the stage:
>
> *Step:*
> *1、start spark－shell*
> ./bin/spark-shell \
> --master yarn-client \
> --num-executors 5 \  (I tried to set this parameter range from 10 to
> 20,but the second job has only 5 tasks)
> --executor-cores 5 \
> --executor-memory 20G \
> --driver-memory 8G \
> --queue root.default \
> --jars /xxx.jar
>
> //spark-default.conf spark.default.parallelism=320
>
> import org.apache.spark.sql.CarbonContext
> val cc = new CarbonContext(sc, "hdfs:///carbonData/CarbonStore")
>
> *2、create table*
> cc.sql("CREATE TABLE IF NOT EXISTS _table (dt String,pt String,lst
> String,plat String,sty String,is_pay String,is_vip String,is_mpack
> String,scene String,status String,nw String,isc String,area String,spttag
> String,province String,isp String,city String,tv String,hwm String,pip
> String,fo String,sh String,mid String,user_id String,play_pv Int,spt_cnt
> Int,prg_spt_cnt Int) row format delimited fields terminated by '|' STORED
> BY 'carbondata' TBLPROPERTIES ('DICTIONARY_EXCLUDE'='pip,sh,
> mid,fo,user_id','DICTIONARY_INCLUDE'='dt,pt,lst,plat,sty,
> is_pay,is_vip,is_mpack,scene,status,nw,isc,area,spttag,
> province,isp,city,tv,hwm','NO_INVERTED_INDEX'='lst,plat,hwm,
> pip,sh,mid','BUCKETNUMBER'='10','BUCKETCOLUMNS'='fo')")
>
> //notes，set "fo" column BUCKETCOLUMNS is to join another table
> //the column distinct values are as follows:
>
>
> *3、insert into table*（_table_tmp  is a hive extenal orc table，has 20
>   records）
> cc.sql("insert into _table select dt,pt,lst,plat,sty,is_pay,is_
> vip,is_mpack,scene,status,nw,isc,area,spttag,province,isp,
> city,tv,hwm,pip,fo,sh,mid,user_id ,play_pv,spt_cnt,prg_spt_cnt from
> _table_tmp where dt='2017-01-01'")
>
> *4、spark split sql into two jobs，the first finished succeeded, but the
> second failed:*
>
>
> *5、The second job stage:*
>
>
>
> *Question:*
> 1、Why the second job has only five jobs,but the first job has 994 jobs ?(
> note:My hadoop cluster has 5 datanode）
>   I guess it caused the failure
> 2、In the sources,i find DataLoadPartitionCoalescer.class，is it means that
> "one datanode has only one partition ,and then the task is only one on the
> datanode"?
> 3、In the ExampleUtils class,"carbon.table.split.partition.enable" is set
> as follow,but i can not find "carbon.table.split.partition.enable" in
> other parts of the project。
>  I set "carbon.table.split.partition.enable" to true, but the second
> job has only five jobs.How to use this property?
>  ExampleUtils :
> // whether use table split partition
> // true -> use table split partition, support multiple partition
> loading
> // false -> use node split partition, support data load by host
> partition
> 
> CarbonProperties.getInstance().addProperty("carbon.table.split.partition.enable",
> "false")
> 4、Insert into carbon table takes 3 hours ,but eventually failed 。How can
> i speed it.
> 5、in the spark-shell  ,I tried to set this parameter range from 10 to
> 20,but the second job has only 5 tasks
>  the other parameter executor-memory = 20G is enough?
>
> I need your help!Thank you very much!
>
> ww...@163.com
>
> --
> ww...@163.com
>



-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-809) Union with alias is returning wrong result.

2017-03-22 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-809:
--

 Summary: Union with alias is returning wrong result.
 Key: CARBONDATA-809
 URL: https://issues.apache.org/jira/browse/CARBONDATA-809
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Union with alias is returning wrong result.

Testcase 
{code}
SELECT t.c1 a FROM (select c1 from  carbon_table1 union all  select c1 from  
carbon_table2) t
{code}

The above query returns the data from only one table and also duplicated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: [PROPOSAL] Update on the Jenkins CarbonData job

2017-03-19 Thread Ravindra Pesala

+1

Regards,
Ravindra.

On 19 March 2017 at 11:14, Liang Chen  wrote:

> +1
> Thanks, JB.
>
> Regards
> Liang
>
> 2017-03-17 22:48 GMT+08:00 Jean-Baptiste Onofré :
>
> > Hi guys,
> >
> > Tomorrow I plan to update our jobs on Apache Jenkins as the following:
> >
> > - carbondata-master-spark-1.5 building master branch with Spark 1.5
> profile
> > - carbondata-master-spark-1.6 building master branch with Spark 1.6
> profile
> > - carbondata-master-spark-2.1 building master branch with Spark 2.1
> profile
> > - carbondata-pr-spark-1.5 building PR with Spark 1.5 profile
> > - carbondata-pr-spark-1.6 building PR with Spark 1.6 profile
> > - carbondata-pr-spark-2.1 building PR with Spark 2.1 profile
> >
> > I will run some builds to identify eventual issues.
> >
> > No objection ?
> >
> > Thanks,
> > Regards
> > JB
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
>
>
> --
> Regards
> Liang
>



-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-793) Count with null values is giving wrong result.

2017-03-18 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-793:
--

 Summary: Count with null values is giving wrong result.
 Key: CARBONDATA-793
 URL: https://issues.apache.org/jira/browse/CARBONDATA-793
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Minor


if the data has null values then it should not count the data. But it is 
counting now. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-791) Exists queries of TPC-DS are failing in carbon

2017-03-17 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-791:
--

 Summary: Exists queries of TPC-DS are failing in carbon
 Key: CARBONDATA-791
 URL: https://issues.apache.org/jira/browse/CARBONDATA-791
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Exists queries are failing in carbon.
These are required in TPC-DS test.

Testcase to reproduce.

{code}
val df = sqlContext.sparkContext.parallelize(1 to 1000).map(x => (x+"", 
(x+100)+"")).toDF("c1", "c2")
df.write
  .format("carbondata")
  .mode(SaveMode.Overwrite)
  .option("tableName", "carbon")
  .save()
sql("select * from carbon where c1='200' and exists(select * from carbon)")
{code}

It fails in carbon.
 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-786) Data mismatch if the data data is loaded across blocklet groups

2017-03-16 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-786:
--

 Summary: Data mismatch if the data data is loaded across blocklet 
groups
 Key: CARBONDATA-786
 URL: https://issues.apache.org/jira/browse/CARBONDATA-786
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Data mismatch if the data data is loaded across blocklet groups and filter 
applied on second column onwards.

Follow testcase

{code} 

CarbonProperties.getInstance()
  .addProperty("carbon.blockletgroup.size.in.mb", "16")
  .addProperty("carbon.enable.vector.reader", "true")
  .addProperty("enable.unsafe.sort", "true")

val rdd = sqlContext.sparkContext
  .parallelize(1 to 120, 4)
  .map { x =>
("city" + x % 8, "country" + x % 1103, "planet" + x % 10007, x.toString,
  (x % 16).toShort, x / 2, (x << 1).toLong, x.toDouble / 13, x.toDouble 
/ 11)
  }.map { x =>
  Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9)
}

val schema = StructType(
  Seq(
StructField("city", StringType, nullable = false),
StructField("country", StringType, nullable = false),
StructField("planet", StringType, nullable = false),
StructField("id", StringType, nullable = false),
StructField("m1", ShortType, nullable = false),
StructField("m2", IntegerType, nullable = false),
StructField("m3", LongType, nullable = false),
StructField("m4", DoubleType, nullable = false),
StructField("m5", DoubleType, nullable = false)
  )
)

val input = sqlContext.createDataFrame(rdd, schema)
sql(s"drop table if exists testBigData")
input.write
  .format("carbondata")
  .option("tableName", "testBigData")
  .option("tempCSV", "false")
  .option("single_pass", "true")
  .option("dictionary_exclude", "id") // id is high cardinality column
  .mode(SaveMode.Overwrite)
  .save()
sql(s"select city, sum(m1) from testBigData " +
  s"where country='country12' group by city order by city").show()
{code}

The above code supposed return following data, but not returning it.
{code}
+-+---+
| city|sum(m1)|
+-+---+
|city0|544|
|city1|680|
|city2|816|
|city3|952|
|city4|   1088|
|city5|   1224|
|city6|   1360|
|city7|   1496|
+-+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: 【DISCUSS】add more index for sort columns

2017-03-14 Thread Ravindra Pesala

Hi Bill,
Min/max for measure columns are already added in V3 format. Now measure
columns filters are being added now so it does block and blocklet pruning
based on min/max to reduce IO and processing.

And as per your suggestions, column need to be sorted and maintain multiple
ranges in metadata. But if the data is sorted we can do binary search and
find out the data directly we may not require to maintain ranges in this
case.
If the data is not sorted then maintain more min/max may give some benefit.
This approach we can take as alternate approach for inverted indexes.

Regards,
Ravindra.

On Tue, Mar 14, 2017, 20:40 bill.zhou  wrote:

> hi all
>
>   Carbon will add min/max index for sort columns which used for better
> filter query. So can we add more index for the sort column to make filter
> faster.
>
>   This is one idea which I get from anther database design.
>For example this is one student, and the column: score in the student
> table which will be sorted column. And the score range is from 1 to 100.
> The table as following:
>
> id  namescore
> 1   bill001 83
> 2   bill002 84
> 3   bill003 90
> 4   bill004 89
> 5   bill005 93
> 6   bill006 76
> 7   bill007 87
> 8   bill008 90
> 9   bill009 89
> 10  bill010 96
> 11  bill011 96
> 12  bill012 100
> 13  bill013 84
> 14  bill014 90
> 15  bill015 79
> 16  bill016 1
> 17  bill017 97
> 18  bill018 79
> 19  bill019 88
> 20  bill068 95
>
>  After load the data into Cabron the score column will sort as following:
> 1   76  79  79  83  84  84  87  88
> 89  89  90  90  90  93  95  96  96  97
> 100
>
> the min/max index is 1/100.
> So for the query as following will take all the block data.
> query1:select sum(score) from student when score score > 90
> query2:select sum(score) from student when score score > 60 and score < 70.
>
> Following two suggestion to reduce the block scan.
> Suggestion 1: according the score range to divide into multiple small range
> for example 4:
> <
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n8891/index1.png
> >
> 0: meas this block has not the score rang value
> 1: meas this block has the score rang value
> If add this index, for the query1 only need scan 1/4 data of the block and
> query2 no need scan any data, directly sckip this block
>
> Suggestion 2: record more min/max for the score, for example every 5 rows
> record one min/max
> <
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n8891/index2.png
> >
> If add this index for query1 only need scan 1/2 data of the block and
> query2
> only need scan 1/4 data of the block
>
> this is the raw idea, please Jacky, Ravindra and liang correct it whether
> we
> can add this feature. thanks
>
> Regards
> Bill
>
>
>
>
> --
> View this message in context:
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-add-more-index-for-sort-columns-tp8891.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

[jira] [Created] (CARBONDATA-771) Dataloading fails in V3 format for TPC-DS data.

2017-03-14 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-771:
--

 Summary: Dataloading fails in V3 format for TPC-DS data.
 Key: CARBONDATA-771
 URL: https://issues.apache.org/jira/browse/CARBONDATA-771
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Minor


Dataloading fails in V3 format for TPC-DS data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-769) Support Codegen in CarbonDictionaryDecoder

2017-03-14 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-769:
--

 Summary: Support Codegen in CarbonDictionaryDecoder
 Key: CARBONDATA-769
 URL: https://issues.apache.org/jira/browse/CARBONDATA-769
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala


Support Codegen in CarbonDictionaryDecoder to leverage wholecodegen performance 
of Spark2.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: column auto mapping when loading data from csv file

2017-03-12 Thread Ravindra Pesala

Hi Yinwei,

Even I feel it is little cumbersome to let user forced to add the header to
CSV file or to loading script.

But what Manish said is also true. I think we should come with some new
option in loading script to accept auto mapping of DDL columns and CSV
columns. If user knows that DDL columns and CSV file columns are in same
order then he may mention like below
 LOAD DATA INPATH INTO TABLE OPTIONS('AUTOFILEHEADER'='true')
 when user mention this then it can take all DDL columns as file header.
May be can have more discussion on this option. Please others comment on
it.

Regards,
Ravindra.

On 13 March 2017 at 10:36, manish gupta  wrote:

> Hi Yinwei,
>
> Thanks for this suggestion. From my opinion providing first 2 options
> ensures that user is aware about the data he is going to load and column
> data mapping.
>
> For the 3rd option suggested by you I think it will be something that we
> are taking the decision without intimating the user and we cannot be sure
> that this is exactly how user wanted to load the data. So from my opinion
> we should let user decide this behavior.
>
> Regards
> Manish Gupta
>
> On Mon, Mar 13, 2017 at 7:48 AM, Yinwei Li <251469...@qq.com> wrote:
>
> > Hi all,
> >
> >
> >   when loading data from a csv file to carbondata table, we have 2
> choices
> > to mapping the columns from csv file to carbondata table:
> >
> >
> >   1. add columns' names at the start of the csv file
> >   2. declare the column mapping at the data loading script
> >
> >
> >   shall we add a feature which make an auto mapping in the order of the
> > columns at the csv file and the carbondata table at default, so that
> users
> > don't have to do the above jobs any more under most of the circumstance.
>

-- 
Thanks & Regards,
Ravi

Re: Removing of kettle code from Carbondata

2017-03-12 Thread Ravindra Pesala

Hi David,

Thank you for your suggestion.
All known and major flows are tested and already it is the default flow in
current version.
Please let us know when you finish the new flow testing completely after
that we can initiate removing of kettle flow again.

Regards,
Ravindra.

On 13 March 2017 at 09:42, QiangCai  wrote:

> +1
>
> To avoid redundancy code,  better to do it after testing the new flow
> fully.
>
> Regards
> David QiangCai
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Removing-of-
> kettle-code-from-Carbondata-tp8649p8724.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

-- 
Thanks & Regards,
Ravi

Removing of kettle code from Carbondata

2017-03-10 Thread Ravindra Pesala

Hi All,

I guess it is time to remove the kettle flow from Carbondata loading. Now
there are two flows to load the data and becomes difficult to maintain the
code.Bug fixing or any feature implementation needs to be done in both the
places so it becomes difficult for developer to implement and test.

Please comment and vote on it.

-- 
Thanks & Regards,
Ravindra

Re: CarbonDictionaryDecoder should support codegen

2017-03-10 Thread Ravindra Pesala

Thanks Bill for pointing out.
Yes it is a long pending task and we should do it. I am not not sure about
the performance benefit we get, but definitely we should try. I will try it
out and see.

Regards,
Ravindra.

On 10 March 2017 at 17:18, bill.zhou  wrote:

> hi All
>Now for the canrbon scan support codegen, but carbonditionarydecoder
> does't support codegen, I think it should support.
>For example, toady I do one test and the query plan is as following
> left,
> if CarbondictionaryDecoder support codegen the plan will change to
> following
> right.  I think that will improve the performance.
>
>  n5.nabble.com/file/n8600/3.png>
>
> Please Ravindra,Jacky,Liang correct it.  thank you.
>
> Regards
> Bill
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/CarbonDictionaryDecoder-
> should-support-codegen-tp8600.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-757) Big decimal optimization in store and processing

2017-03-10 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-757:
--

 Summary:  Big decimal optimization in store and processing
 Key: CARBONDATA-757
 URL: https://issues.apache.org/jira/browse/CARBONDATA-757
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala


Currently Decimal is converted to bytes and using LV (length + value) format  
to write to store. And while getting back read the bytes in LV format and 
convert back the bigdecimal.

We can do following operations to improve storage and processing.
1. if decimal precision is less than 9 then we can fit in int (4 bytes)
2. if decimal precision is less than 18 then we can fit in long (8 bytes)
3. if decimal precision is more than 18 then we can fit in fixed length 
bytes(the length bytes can vary depends on precision but it is always fixed 
length)
So in this approach we no need store bigdecimal in LV format, we can store in 
fixed format.It reduces the memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Question related to lazy decoding optimzation

2017-03-08 Thread Ravindra Pesala

Hi Yong Zhang,

Thank you for analyzing carbondata.
Yes, lazy decoding is only possible if the dictionaries are global.
At the time of loading the data it generates global dictionary values.
There are 2 ways to generate global dictionary values.
1. Launch a job to read all input data and find the distinct values from
each columns and assign the dictionary values to it. Then starts the actual
loading job, it just encodes the data with already generated dictionary
values and write down in carbondata format.
2. Launch Dictionary Server/client to generate global dictionary during the
load job. It consults dictionary server to get the global dictionary for
the fields.

Yes, compare to local dictionary it is little more expensive but with this
approach we can have better compression and better performance through lazy
decoding.

Regards,
Ravindra.

On 9 March 2017 at 00:01, Yong Zhang  wrote:

> Hi,
>
>
> I watched one session of "Apache Carbondata" in Spark Submit 2017. The
> video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc.
>
> [https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<
> https://www.youtube.com/watch?v=lhsAg2H_GXc>
>
> Apache Carbondata: An Indexed Columnar File Format for Interactive Query
> by Jacky Li/Jihong Ma
> www.youtube.com
> Realtime analytics over large datasets has become an increasing
> wide-spread demand, over the past several years, Hadoop ecosystem has been
> continuously evolv...
>
>
>
>
> Starting from 23:10, the speaker talks about lazy decoding optimization,
> and the example given in the speech is following:
>
> "select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be
> aggregated directly by the encoding value (Maybe integer, if let's say a
> String type c3 is encoded as int). I assume this in fact is done even
> within Spark executor engine, as the Speaker described.
>
>
> But I really not sure that I understand this is possible, especially in
> the Spark. If Carbondata is the storage format for a framework on one box,
> I can image that and understand this value it brings. But for a distribute
> executing engine, like Spark, the data will come from multi hosts. Spark
> has to deserialize the data for grouping/aggregating (C3 in this case).
> Let's say that even Spark dedicates this to underline storage engine
> somehow, how Carbondata will make sure that all the value will be encoded
> in the same globally? Won't it just encode consistently per file? Globally
> is just too expensive. But without it, I don't know how this lazy decoding
> can work.
>
>
> I am just start researching this project, so maybe there are something
> underline I don't understand.
>
>
> Thanks
>
>
> Yong
>

-- 
Thanks & Regards,
Ravi

Re: question about dimColumnExecuterInfo.getFilterKeys()

2017-03-08 Thread Ravindra Pesala

Hi,

The filter values which we get from query will be converted to respective
surrogates and sorted on surrogate values before start applying the filter.


Regards,
Ravindra

On 8 March 2017 at 09:55, 马云  wrote:

> Hi  Dev,
>
>
> when do filter query, I can see a filtered byte array.
> Does filterValues always has order by the dictionary value?
> If not, which case it has no order. thanks
>
>
>
>  byte[][] filterValues = dimColumnExecuterInfo.getFilterKeys();
>
>
>
>
>


-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-743) Remove the abundant class CarbonFilters.scala

2017-03-02 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-743:
--

 Summary: Remove the abundant class CarbonFilters.scala
 Key: CARBONDATA-743
 URL: https://issues.apache.org/jira/browse/CARBONDATA-743
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Trivial


Remove the abundant class CarbonFilters.scala from spark2 package.

Right now there are two classes with name CarbonFilters in carbondata.
1.Delete the CarbonFilters scala file from spark-common package
2. Move the CarbonFilters scala from spark2 package to spark-common package.
 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-742) Add batch sort to improve the loading performance

2017-03-02 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-742:
--

 Summary: Add batch sort to improve the loading performance
 Key: CARBONDATA-742
 URL: https://issues.apache.org/jira/browse/CARBONDATA-742
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala


Hi,
Current Problem:
Sort step is major issue as it is blocking step. It needs to receive all data 
and write down the sort temp files to disk, after that only data writer step 
can start.

Solution: 
Make sort step as non blocking step so it avoids waiting of Data writer step.
Process the data in sort step in batches with size of in-memory capability of 
the machine. For suppose if machine can allocate 4 GB to process data 
in-memory, then Sort step can sorts the data with batch size of 2GB and gives 
it to the data writer step. By the time data writer step consumes the data, 
sort step receives and sorts the data. So here all steps are continuously 
working and absolutely there is no disk IO in sort step.

So there would not be any waiting of data writer step for sort step, As and 
when sort step sorts the data in memory data writer can start writing it.
It can significantly improves the performance.

Advantages:
Increases the loading performance as there is no intermediate IO and no 
blocking of Sort step.
There is no extra effort for compaction, the current flow can handle it.

Disadvantages:
Number of driver side btrees will increase. So the memory might increase but it 
could be controlled by current LRU cache implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-741) Remove the unnecessary classes from carbondata

2017-03-02 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-741:
--

 Summary: Remove the unnecessary classes from carbondata
 Key: CARBONDATA-741
 URL: https://issues.apache.org/jira/browse/CARBONDATA-741
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Trivial


Please remove following classes as it is not used now.

VectorChunkRowIterator
CarbonColumnVectorImpl



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-740) Add logger for rows processed while closing in AbstractDataLoadProcessorStep

2017-03-02 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-740:
--

 Summary: Add logger for rows processed while closing in 
AbstractDataLoadProcessorStep
 Key: CARBONDATA-740
 URL: https://issues.apache.org/jira/browse/CARBONDATA-740
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Trivial


Add logger for rows processed while closing in AbstractDataLoadProcessorStep.
It is good to print the total records processed while closing the step, so 
please log the rows processed in AbstractDataLoadProcessorStep



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-739) Avoid creating multiple instances of DirectDictionary in DictionaryBasedResultCollector

2017-03-02 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-739:
--

 Summary: Avoid creating multiple instances of DirectDictionary in 
DictionaryBasedResultCollector
 Key: CARBONDATA-739
 URL: https://issues.apache.org/jira/browse/CARBONDATA-739
 Project: CarbonData
  Issue Type: Bug
  Components: core
Reporter: Ravindra Pesala
Priority: Minor


Avoid creating multiple instances of DirectDictionary in 
DictionaryBasedResultCollector.

For every row, direct dictionary is creating inside 
DictionaryBasedResultCollector.collectData method.

Please create single instance per column and reuse it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Improving Non-dictionary storage & performance.

2017-03-02 Thread Ravindra Pesala

Hi Likun,

Yes, Likun we better keep dictionary as default until we optimize
no-dictionary columns.
As you mentioned we can suggest 2-pass for first load and subsequent loads
will use single-pass to improve the performance.

Regards,
Ravindra.

On 2 March 2017 at 06:48, Jacky Li  wrote:

> Hi Ravindra & Vishal,
>
> Yes, I think these works need to be done before switching no-dictionary as
> default. So as of now, we should use dictionary as default.
> I think we can suggest user to do loading as:
> 1. First load: use 2-pass mode to load, the first scan should discover the
> cardinality, and check with user specified option. We should define rules
> to pass or fail the validation, and finalize the load option for subsequent
> load.
> 2. Subsequent load: use single-pass mode to load, use the options defined
> by first load
>
> What is your idea?
>
> Regards,
> Jacky
>
> > 在 2017年3月1日，下午11:31，Ravindra Pesala  写道：
> >
> > Hi Vishal,
> >
> > You are right, thats why we can do no-dictionary only for String
> datatype.
> > Please look at my first point. we can always use direct dictionary for
> > possible data types like short, int, long, double & float for
> sort_columns.
> >
> > Regards,
> > Ravindra.
> >
> > On 1 March 2017 at 18:18, Kumar Vishal 
> wrote:
> >
> >> Hi Ravi,
> >> Sorting of data for no dictionary should be based on data type + same
> for
> >> filter . Please add this point.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> In order to make non-dictionary columns storage and performance more
> >>> efficient, I am suggesting following improvements.
> >>>
> >>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> >>> dictionary.
> >>>   Right now only date and timestamp are direct dictionary columns. We
> >> can
> >>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> >> columns
> >>> are included in SORT_COLUMNS
> >>>
> >>> 2. Consider delta/value compression while storing direct dictionary
> >> values.
> >>> Right now it always uses INT datatype to store direct dictionary
> values.
> >> So
> >>> we can consider value/Delta compression to compact the storage.
> >>>
> >>> 3. Use the Separator instead of LV format to store String value in
> >>> no-dictionary format.
> >>> Currently String datatypes for non-dictionary colums are stored as
> >>> LV(length value) format, here we are using Short(2 bytes) as length
> >> always.
> >>> In order to keep storage compact we can use separator (0 byte as
> >> separator)
> >>> it just takes single byte. And while reading we can traverse through
> data
> >>> and get the offsets like we are doing now.
> >>>
> >>> 4. Add Range filters for no-dictionary columns.
> >>> Currently range filters like greater/ less than filters are not
> >> implemented
> >>> for no-dictionary columns. So we should implement them to avoid row
> level
> >>> filter and improve the performance.
> >>>
> >>> Regards,
> >>> Ravindra.
> >>>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


-- 
Thanks & Regards,
Ravi

Re: [DISCUSS] Graduation to a TLP (Top Level Project)

2017-03-01 Thread Ravindra Pesala

+1

Its excited to see Carbondata project going for graduation to TLP. Our hard
work is going to payoff soon. Thanks JB for taking this initiative.

Regards,
Ravindra.

On 1 March 2017 at 15:50, Jean-Baptiste Onofré  wrote:

> Hi Liang,
>
> We are now good. I will update pull requests and Jira issues count.
>
> I'm updating the maturity doc on the wiki and then I will start the
> discussion on the mailing list.
>
> Regards
> JB
>
>
> On 03/01/2017 04:12 AM, Liang Chen wrote:
>
>> Hi JB
>>
>> Great, thank you drove it.
>> One gentle reminder : The number of pull request and JIRA issue can be
>> updated to the latest.
>>
>> Regards
>> Liang
>>
>> 2017-02-28 21:46 GMT+08:00 Jean-Baptiste Onofré :
>>
>> Hi guys,
>>>
>>> I created a pull request to add a complete release guide:
>>>
>>> https://github.com/apache/incubator-carbondata/pull/617
>>>
>>> I also updated the maturity self-assessment doc:
>>>
>>> https://docs.google.com/document/d/12hifkDCfbyramBba1uRHYjwa
>>> KEcxAyWMxS9iwJ1_etY/edit?usp=sharing
>>>
>>> I would like just a quick update about CARBONDATA-722 to address the last
>>> TODO:
>>>
>>> "TODO. Add release guide, mailing lists, source repositories, release
>>> notes and issue tracker links."
>>>
>>> Once the release guide is merged and CARBONDATA-722 is addressed, I would
>>> like to start the formal discussion about CarbonData graduation.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 02/20/2017 05:28 PM, Jean-Baptiste Onofré wrote:
>>>
>>> Hi all,

 Regarding all work and progress we made so far in Apache CarbonData, I
 think it's time to start the discussion about graduation as a new TLP
 (Top Level Project) at the Apache Software Foundation.

 Graduation means we are a self-sustaining and self-governing community,
 and ready to be a full participant in the Apache Software Foundation. Of
 course, it doesn't imply that our community growth is complete or that a
 particular level of technical maturity has been reached, rather that we
 are on a solid trajectory in those areas. After graduation, we will
 still periodically report to the ASF Board to ensure continued growth of
 a healthy community.

 Graduation is an important milestone for the project and for the users
 community.

 A way to think about graduation readiness is through the Apache Maturity
 Model [1]. I think we satisfy most of the requirements [2].
 There are some TODOs to address. I will tackle in the coming days
 (release guide, security link, ...).

 Regarding the process, graduation consists of drafting a board
 resolution, which needs to identify the full Project Management
 Committee, and getting it approved by the community, the Incubator, and
 the Board. Within the CarbonData community, most of these discussions
 and votes have to be on the private@ mailing list.

 I would like to summarize here from points arguing in favor of
 graduation:
 * Project's maturity self-assessment [2]
 * 600 pull requests in incubation
 * 5 releases (including RC) performed by two different release manager
 * 65 contributors
 * 4 new committers
 * 713 Jira created, 593 resolved or closed

 Thoughts ? If you agree, I would like to share the maturity
 self-assessment on the website.

 If you want to help me on some TODO tasks, please, ping me by e-mail,
 Skype, hangout or whatever, to sync together.

 Thanks !
 Regards
 JB

 [1] http://community.apache.org/apache-way/apache-project-
 maturity-model.html
 [2]
 https://docs.google.com/document/d/12hifkDCfbyramBba1uRHYjwa
 KEcxAyWMxS9iwJ1_etY/edit?usp=sharing

 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

-- 
Thanks & Regards,
Ravi

Re: Improving Non-dictionary storage & performance.

2017-03-01 Thread Ravindra Pesala

Hi Vishal,

You are right, thats why we can do no-dictionary only for String datatype.
Please look at my first point. we can always use direct dictionary for
possible data types like short, int, long, double & float for sort_columns.

Regards,
Ravindra.

On 1 March 2017 at 18:18, Kumar Vishal  wrote:

> Hi Ravi,
> Sorting of data for no dictionary should be based on data type + same for
> filter . Please add this point.
>
> -Regards
> Kumar Vishal
>
> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala 
> wrote:
>
> > Hi,
> >
> > In order to make non-dictionary columns storage and performance more
> > efficient, I am suggesting following improvements.
> >
> > 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> > dictionary.
> >Right now only date and timestamp are direct dictionary columns. We
> can
> > make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> columns
> > are included in SORT_COLUMNS
> >
> > 2. Consider delta/value compression while storing direct dictionary
> values.
> > Right now it always uses INT datatype to store direct dictionary values.
> So
> > we can consider value/Delta compression to compact the storage.
> >
> > 3. Use the Separator instead of LV format to store String value in
> > no-dictionary format.
> > Currently String datatypes for non-dictionary colums are stored as
> > LV(length value) format, here we are using Short(2 bytes) as length
> always.
> > In order to keep storage compact we can use separator (0 byte as
> separator)
> > it just takes single byte. And while reading we can traverse through data
> > and get the offsets like we are doing now.
> >
> > 4. Add Range filters for no-dictionary columns.
> > Currently range filters like greater/ less than filters are not
> implemented
> > for no-dictionary columns. So we should implement them to avoid row level
> > filter and improve the performance.
> >
> > Regards,
> > Ravindra.
> >
>



-- 
Thanks & Regards,
Ravi

Re: [DISCUSS] For the dimension default should be no dictionary

2017-03-01 Thread Ravindra Pesala

Hi All,

In order to make no-dictionary columns as default we should improve the
storage and performance for these columns. I have sent another mail to
discuss the improvement points. Please comment on it.

Regards,
Ravindra

On 1 March 2017 at 10:12, Ravindra Pesala  wrote:

> Hi Likun,
>
> It would be same case if we use all non dictionary columns by default, it
> will increase the store size and decrease the performance so it is also
> does not encourage more users if performance is poor.
>
> If we need to make no-dictionary columns as default then we should first
> focus on reducing the store size and improve the filter queries on
> non-dictionary columns.Even memory usage is higher while querying the
> non-dictionary columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 06:00, Jacky Li  wrote:
>
>> Yes, I agree to your point. The only concern I have is for loading, I
>> have seen many users accidentally put high cardinality column into
>> dictionary column then the loading failed because out of memory or loading
>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>> these columns, or they do not have a easy way to identify the high card
>> columns. I feel preventing such misusage is important in order to encourage
>> more users to use carbondata.
>>
>> Any suggestion on solving this issue?
>>
>>
>> Regards,
>> Likun
>>
>>
>> > 在 2017年2月28日，下午10:20，Ravindra Pesala  写道：
>> >
>> > Hi Likun,
>> >
>> > You mentioned that if user does not specify dictionary columns then by
>> > default those are chosen as no dictionary columns.
>> > But we have many disadvantages as I mentioned in above mail if you keep
>> no
>> > dictionary as default. We have initially introduced no dictionary
>> columns
>> > to handle high cardinality dimensions, but now making every thing as no
>> > dictionary columns by default looses our unique feature compare to
>> parquet.
>> > Dictionary columns are introduced not only for aggregation queries, it
>> is
>> > for better compression and better filter queries as well. With out
>> > dictionary store size will be increased a lot.
>> >
>> > Regards,
>> > Ravindra.
>> >
>> > On 28 February 2017 at 18:05, Liang Chen 
>> wrote:
>> >
>> >> Hi
>> >>
>> >> A couple of questions:
>> >>
>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>> >> index" for these columns which be specified into the option(SORT_KEY)
>> ?
>> >>
>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>> make
>> >> dictionary encoding, and all shuffle operations are based on fact
>> value, is
>> >> my understanding right ?
>> >> 
>> >> ---
>> >> If this option is not specified by user, means all columns encoding
>> without
>> >> global dictionary support. Normal shuffle on decoded value will be
>> applied
>> >> when doing group by operation.
>> >>
>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified into
>> >> TABLE_DICTIONARY, then system how to handle this case ?
>> >> 
>> >> ---
>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded as
>> >> Inverted Index and with Minmax Index
>> >>
>> >> Regards
>> >> Liang
>> >>
>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li :
>> >>
>> >>> Yes, first we should simplify the DDL options. I propose following
>> >> options,
>> >>> please check weather it miss some scenario.
>> >>>
>> >>> 1. SORT_COLUMNS, or SORT_KEY
>> >>> This indicates three things:
>> >>> 1) All columns specified in options will be used to construct
>> >>> Multi-Dimensional Key, which will be sorted along this key
>> >>> 2) They will be encoded as Inverted Index and thus again sorted within
>> >>> column chunk in one blocklet
>> >>> 3) Minmax index will also be created for these columns
>> >>>
>> >>

Improving Non-dictionary storage & performance.

2017-03-01 Thread Ravindra Pesala

Hi,

In order to make non-dictionary columns storage and performance more
efficient, I am suggesting following improvements.

1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct dictionary.
   Right now only date and timestamp are direct dictionary columns. We can
make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns
are included in SORT_COLUMNS

2. Consider delta/value compression while storing direct dictionary values.
Right now it always uses INT datatype to store direct dictionary values. So
we can consider value/Delta compression to compact the storage.

3. Use the Separator instead of LV format to store String value in
no-dictionary format.
Currently String datatypes for non-dictionary colums are stored as
LV(length value) format, here we are using Short(2 bytes) as length always.
In order to keep storage compact we can use separator (0 byte as separator)
it just takes single byte. And while reading we can traverse through data
and get the offsets like we are doing now.

4. Add Range filters for no-dictionary columns.
Currently range filters like greater/ less than filters are not implemented
for no-dictionary columns. So we should implement them to avoid row level
filter and improve the performance.

Regards,
Ravindra.

Re: [DISCUSS] For the dimension default should be no dictionary

2017-02-28 Thread Ravindra Pesala

Hi Likun,

It would be same case if we use all non dictionary columns by default, it
will increase the store size and decrease the performance so it is also
does not encourage more users if performance is poor.

If we need to make no-dictionary columns as default then we should first
focus on reducing the store size and improve the filter queries on
non-dictionary columns.Even memory usage is higher while querying the
non-dictionary columns.

Regards,
Ravindra.

On 1 March 2017 at 06:00, Jacky Li  wrote:

> Yes, I agree to your point. The only concern I have is for loading, I have
> seen many users accidentally put high cardinality column into dictionary
> column then the loading failed because out of memory or loading very slow.
> I guess they just do not know to use DICTIONARY_EXCLUDE for these columns,
> or they do not have a easy way to identify the high card columns. I feel
> preventing such misusage is important in order to encourage more users to
> use carbondata.
>
> Any suggestion on solving this issue?
>
>
> Regards,
> Likun
>
>
> > 在 2017年2月28日，下午10:20，Ravindra Pesala  写道：
> >
> > Hi Likun,
> >
> > You mentioned that if user does not specify dictionary columns then by
> > default those are chosen as no dictionary columns.
> > But we have many disadvantages as I mentioned in above mail if you keep
> no
> > dictionary as default. We have initially introduced no dictionary columns
> > to handle high cardinality dimensions, but now making every thing as no
> > dictionary columns by default looses our unique feature compare to
> parquet.
> > Dictionary columns are introduced not only for aggregation queries, it is
> > for better compression and better filter queries as well. With out
> > dictionary store size will be increased a lot.
> >
> > Regards,
> > Ravindra.
> >
> > On 28 February 2017 at 18:05, Liang Chen 
> wrote:
> >
> >> Hi
> >>
> >> A couple of questions:
> >>
> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> >> index" for these columns which be specified into the option(SORT_KEY)  ?
> >>
> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> >> dictionary encoding, and all shuffle operations are based on fact
> value, is
> >> my understanding right ?
> >> 
> >> ---
> >> If this option is not specified by user, means all columns encoding
> without
> >> global dictionary support. Normal shuffle on decoded value will be
> applied
> >> when doing group by operation.
> >>
> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> >> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> >> TABLE_DICTIONARY, then system how to handle this case ?
> >> 
> >> ---
> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> as
> >> Inverted Index and with Minmax Index
> >>
> >> Regards
> >> Liang
> >>
> >> 2017-02-28 19:35 GMT+08:00 Jacky Li :
> >>
> >>> Yes, first we should simplify the DDL options. I propose following
> >> options,
> >>> please check weather it miss some scenario.
> >>>
> >>> 1. SORT_COLUMNS, or SORT_KEY
> >>> This indicates three things:
> >>> 1) All columns specified in options will be used to construct
> >>> Multi-Dimensional Key, which will be sorted along this key
> >>> 2) They will be encoded as Inverted Index and thus again sorted within
> >>> column chunk in one blocklet
> >>> 3) Minmax index will also be created for these columns
> >>>
> >>> When to use: This option is designed for accelerating filter query, so
> >> put
> >>> all filter columns into this option. The order of it can be:
> >>> 1) From low cardinality to high cardinality, this will make most
> >>> compression
> >>> and fit for scenario that does not have frequent filter on high card
> >> column
> >>> 2) Put high cardinality column first, then put others. This fits for
> >>> frequent filter on high card column
> >>>
> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> >> as
> >>> Inverted Index and with Minmax Index

Re: [DISCUSS] For the dimension default should be no dictionary

2017-02-28 Thread Ravindra Pesala

Hi Likun,

You mentioned that if user does not specify dictionary columns then by
default those are chosen as no dictionary columns.
But we have many disadvantages as I mentioned in above mail if you keep no
dictionary as default. We have initially introduced no dictionary columns
to handle high cardinality dimensions, but now making every thing as no
dictionary columns by default looses our unique feature compare to parquet.
Dictionary columns are introduced not only for aggregation queries, it is
for better compression and better filter queries as well. With out
dictionary store size will be increased a lot.

Regards,
Ravindra.

On 28 February 2017 at 18:05, Liang Chen  wrote:

> Hi
>
> A couple of questions:
>
> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> index" for these columns which be specified into the option(SORT_KEY)  ?
>
> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> dictionary encoding, and all shuffle operations are based on fact value, is
> my understanding right ?
> 
> ---
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
>
> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> TABLE_DICTIONARY, then system how to handle this case ?
> 
> ---
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
>
> Regards
> Liang
>
> 2017-02-28 19:35 GMT+08:00 Jacky Li :
>
> > Yes, first we should simplify the DDL options. I propose following
> options,
> > please check weather it miss some scenario.
> >
> > 1. SORT_COLUMNS, or SORT_KEY
> > This indicates three things:
> > 1) All columns specified in options will be used to construct
> > Multi-Dimensional Key, which will be sorted along this key
> > 2) They will be encoded as Inverted Index and thus again sorted within
> > column chunk in one blocklet
> > 3) Minmax index will also be created for these columns
> >
> > When to use: This option is designed for accelerating filter query, so
> put
> > all filter columns into this option. The order of it can be:
> > 1) From low cardinality to high cardinality, this will make most
> > compression
> > and fit for scenario that does not have frequent filter on high card
> column
> > 2) Put high cardinality column first, then put others. This fits for
> > frequent filter on high card column
> >
> > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> as
> > Inverted Index and with Minmax Index
> > Note that while C1,C2,C3 can be dimension but they also can be measure.
> So
> > if user need to filter on measure column, it can be put in SORT_COLUMNS
> > option.
> >
> > If this option is not specified by user, carbon will pick MDK as it is
> now.
> >
> > 2. TABLE_DICTIONARY
> > This is to specify the table level dictionary columns. Will create global
> > dictionary for all columns in this option for every data load.
> >
> > When to use: The option is designed for accelerating aggregate query, so
> > put
> > group by columns into this option
> >
> > For example. TABLE_DICTIONARY=“C2,C3,C5”
> >
> > If this option is not specified by user, means all columns encoding
> without
> > global dictionary support. Normal shuffle on decoded value will be
> applied
> > when doing group by operation.
> >
> > I think these two options should be the basic option for normal user, the
> > goal of them is to satisfy the most scenario without deep tuning of the
> > table
> > For advanced user who want to do deep tuning, we can debate to add more
> > options. But we need to identify what scenario is not satisfied by using
> > these two options first.
> >
> > Regards,
> > Jacky
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > dimension-default-should-be-no-dictionary-tp8010p8081.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Regards
> Liang
>



-- 
Thanks & Regards,
Ravi

Re: Block B-tree loading failed

2017-02-28 Thread Ravindra Pesala

Hi,

Have you loaded data freshly and try to execute the query? Or you are
trying to query the old store you already has loaded?

Regards,
Ravindra.

On 28 February 2017 at 17:20, ericzgy <1987zhangguang...@163.com> wrote:

> Now when I load data into CarbonData table using spark1.6.2 and
> carbondata1.0.0,The problem details are as follows:
>
> WARN  28-02 15:15:33,154 - Lost task 15.0 in stage 5.0 (TID 139, halu062):
> org.apache.carbondata.core.datastore.exception.IndexBuilderException:
> Block
> B-tree loading failed
> at
> org.apache.carbondata.core.datastore.BlockIndexStore.fillLoadedBlocks(
> BlockIndexStore.java:264)
> at
> org.apache.carbondata.core.datastore.BlockIndexStore.
> getAll(BlockIndexStore.java:189)
> at
> org.apache.carbondata.core.scan.executor.impl.AbstractQueryExecutor.
> initQuery(AbstractQueryExecutor.java:130)
> at
> org.apache.carbondata.core.scan.executor.impl.AbstractQueryExecutor.
> getBlockExecutionInfos(AbstractQueryExecutor.java:219)
> at
> org.apache.carbondata.core.scan.executor.impl.DetailQueryExecutor.execute(
> DetailQueryExecutor.java:39)
> at
> org.apache.carbondata.hadoop.CarbonRecordReader.initialize(
> CarbonRecordReader.java:79)
> at
> org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(
> CarbonScanRDD.scala:192)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:227)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
> org.apache.thrift.protocol.TProtocolException: don't know what type: 14
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.carbondata.core.datastore.BlockIndexStore.fillLoadedBlocks(
> BlockIndexStore.java:254)
> ... 35 more
> Caused by: java.io.IOException:
> org.apache.thrift.protocol.TProtocolException: don't know what type: 14
> at
> org.apache.carbondata.core.reader.ThriftReader.read(ThriftReader.java:108)
> at
> org.apache.carbondata.core.reader.CarbonFooterReader.
> readFooter(CarbonFooterReader.java:54)
> at
> org.apache.carbondata.core.util.DataFileFooterConverter2.
> readDataFileFooter(DataFileFooterConverter2.java:47)
> at
> org.apache.carbondata.core.util.CarbonUtil.readMetadatFile(CarbonUtil.
> java:848)
> at
> org.apache.carbondata.core.datastore.AbstractBlockIndexStoreCache.
> checkAndLoadTableBlocks(AbstractBlockIndexStoreCache.java:98)
> at
> org.apache.carbondata.core.datastore.BlockIndexStore.
> loadBlock(BlockIndexStore.java:304)
> at
> org.apache.carbondata.core.datastore.BlockIndexStore.get(
> BlockIndexStore.java:109)
> at
> org.apache.carbondata.core.datastore.BlockIndexStore$
> BlockLoaderThread.call(BlockIndexStore.java:294)
> a

Re: [DISCUSS] For the dimension default should be no dictionary

2017-02-27 Thread Ravindra Pesala

Hi Bill,

I got your point, but the solution of making no-dictionary as default may
not be perfect solution. Basically no-dictionary columns are only meant for
high cardinality dimensions, so the usage may change from user to user or
scenario to scenario .
This is the basic issue of usability of DDL, please first focus on to
simplify DDL usability.

For example we have 6 columns , we can mention DDL as below.
case 1 :
SORT_COLUMNS="C1,C2,C3"
NON_SORT_COLUMNS="C4,C5,C6"
In above case C1, C2 , C3 are sort columns and part of MDK key. And
C4,C5,C6 are become non sort columns(measure/complex)

DICTIONARY_EXCLUDE= 'ALL'
DICTIONARY_INCLUDE='C3'
In the above case all sort columns((C1,C2,C3) are non-dictionary columns
except C3, here C3 is dictionary column.

case 2:
SORT_COLUMNS="ALL"
NON_SORT_COLUMNS="C6"
In this case all columns are sort columns except C6.

DICTIONARY_EXCLUDE= 'C2'
DICTIONARY_INCLUDE='ALL'
In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
except C2, here C2 is no-dictionary column.

Above mentioned are just my idea of how to simplify DDL to handle all
scenarios. We can have more discussion towards it to simplify the DDL.

Regards,
Ravindra.

On 27 February 2017 at 12:38, bill.zhou  wrote:

> Dear Vishal & Ravindra
>
>   Thanks for you reply,  I think I didn't describe it clearly so that you
> don't get full idea.
> 1. dictionary is important feature in CarbonData, for every new customer we
> will introduce this feature to him. So for new customer will know it
> clearly, will set the dictionary column when create table.
> 2. For all customer like bank customer, telecom customer and traffic
> customer have a same scenario is: have more column but only set few column
> as dictionary.
> like telecom customer, 300 column only set 5 column dictionary, other
> dim don't set dictionary.
> like bank customer, 100 column only set about 5 column dictionary,
> other
> dim don't set dictionary.
> *For currently customer actually user scenario, they only set the dim which
> used for filter and group by related column as dictionary*
> 3. mys suggestion is that: dim column default as no dictionary is only for
> the dim which not put into the dictionary_include properties, not for all
> dim column. If customer always used 5 columns add into dictionary_include
> and others column no dictionary, this will not impact the query
> performance.
>
> So that I suggestion the dim column default set as no dictionary which not
> added in to dictionary_include properties.
>
> Regards
> Bill
>
>
>
> kumarvishal09 wrote
> > Hi,
> > I completely agree with Ravindra's points, more number of no
> > dictionary
> > column will impact the IO reading+writing both as in case of no
> dictionary
> > data size will increase. Late decoding is one of main advantage, no
> > dictionary column aggregation will be slower. Filter query will suffer as
> > in case of dictionary column we are comparing on byte pack value, in case
> > of no dictionary it will be on actual value.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala <
>
> > ravi.pesala@
>
> > >
> > wrote:
> >
> >> Hi,
> >>
> >> I feel there are more disadvantages than advantages in this approach. In
> >> your current scenario you want to set dictionary only for columns which
> >> are
> >> used as filters, but the usage of dictionary is not only limited for
> >> filters, it can reduce the store size and improve the aggregation
> >> queries.
> >> I think you should set no_inverted_index false on non filtered columns
> to
> >> reduce the store size and improve the performance.
> >>
> >> If we make no dictionary as default then user no need set them in DDL
> but
> >> user needs to set the dictionary columns. If user wants to set more
> >> dictionary columns then the same problem what you mentioned arises again
> >> so
> >> it does not solve the problem. I feel we should give more flexibility in
> >> our DDL to simplify these scenarios and we should have more discussion
> on
> >> it.
> >>
> >> Pros & Cons of your suggestion.
> >> Advantages :
> >> 1. Decoding/Encoding of dictionary could be avoided.
> >>
> >> Disadvantages :
> >> 1. Store size will increase drastically.
> >> 2. IO will increase so query performance will come down.
> >> 3. Aggregation queries performance will suffer.
> >>
> >>
> >>
> &

Re: [DISCUSS] For the dimension default should be no dictionary

2017-02-26 Thread Ravindra Pesala

Hi,

I feel there are more disadvantages than advantages in this approach. In
your current scenario you want to set dictionary only for columns which are
used as filters, but the usage of dictionary is not only limited for
filters, it can reduce the store size and improve the aggregation queries.
I think you should set no_inverted_index false on non filtered columns to
reduce the store size and improve the performance.

If we make no dictionary as default then user no need set them in DDL but
user needs to set the dictionary columns. If user wants to set more
dictionary columns then the same problem what you mentioned arises again so
it does not solve the problem. I feel we should give more flexibility in
our DDL to simplify these scenarios and we should have more discussion on
it.

Pros & Cons of your suggestion.
Advantages :
1. Decoding/Encoding of dictionary could be avoided.

Disadvantages :
1. Store size will increase drastically.
2. IO will increase so query performance will come down.
3. Aggregation queries performance will suffer.

Regards,
Ravindra.

On 26 February 2017 at 20:04, bill.zhou  wrote:

> hi All
> Now when create the CarbonData table,if  the dimension don't add into
> the dictionary_exclude properties, the dimension will be consider as
> dictionary default. I think default should be no dictionary.
>
> For example when I do the POC for one customer, it has 300 columns and
> 200 dimensions, but only 5 columns is used for filter, so he only need set
> this 5 columns to dictionary and leave other 195 columns to no dictionary.
> But now he need specify for the 195 columns to dictionary_exclude
> properties
> the will waste time and make the create table command huge, also will
> impact
> the load performance.
>
> So I suggestion dimension default should be no dictionary and this can
> also help customer easy to know the dictionary column which is useful.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-tp8010.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-726) Update with V3 format for better IO and processing optimization.

2017-02-22 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-726:
--

 Summary: Update with V3 format for better IO and processing 
optimization.
 Key: CARBONDATA-726
 URL: https://issues.apache.org/jira/browse/CARBONDATA-726
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala


Problems in current format.
1. IO read is slower since it needs to go for multiple seeks on the file to 
read column blocklets. Current size of blocklet is 12, so it needs to read 
multiple times from file to scan the data on that column. Alternatively we can 
increase the blocklet size but it suffers for filter queries as it gets big 
blocklet to filter.
2. Decompression is slower in current format, we are using inverted index for 
faster filter queries and using NumberCompressor to compress the inverted index 
in bit wise packing. It becomes slower so we should avoid number compressor. 
One alternative is to keep blocklet size with in 32000 so that inverted index 
can be written with short, but IO read suffers a lot.

To overcome from above 2 issues we are introducing new format V3.
Here each blocklet has multiple pages with size 32000, number of pages in 
blocklet is configurable. Since we keep the page with in short limit so no need 
compress the inverted index here.
And maintain the max/min for each page to further prune the filter queries.
Read the blocklet with pages at once and keep in offheap memory.
During filter first check the max/min range and if it is valid then go for 
decompressing the page to filter further.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: carbondata performance test under benchmark tpc-ds

2017-02-21 Thread Ravindra Pesala

Hi,

We are working on TPC-H performance report now, and have improved the
performance with new format, we have already raised the PR(584 and 586) for
the same, It is still under review and it will be merged soon. Once these
PR's are merged we will start verify the TPC-DS performace as well.

Regards,
Ravindra.

On 21 February 2017 at 13:48, Yinwei Li <251469...@qq.com> wrote:

> up↑
>
>
> haha~~~
>
>
>
>
> -- Original --
> From:  "ﻬ.贝壳里的海";<251469...@qq.com>;
> Date:  Mon, Feb 20, 2017 09:52 AM
> To:  "dev";
>
> Subject:  carbondata performance test under benchmark tpc-ds
>
>
>
> Hi all,
>
>
>   I've made a simple performance test under benchmark tpc-ds using
> spark2.1.0+carbondata1.0.0, well the result seems unsatisfactory. The
> details are as follows:
>
>
>   About Env:
> Hadoop 2.7.2 + Spark 2.1.0 + CarbonData 1.0.0
> Cluster: 5 nodes, 32G mem per node
>   About TPC-DS:
> Data size: 1G (test data generation script: ./dsdgen -scale 1 -suffix
> '.csv' -dir /data/tpc-ds/data/)
> Max records num of the tables: table name - inventory, record num -
> 11,745,000
>   About Performance Tuning:
> Spark:
>   SPARK_WORKER_MEMORY=4g
>   SPARK_WORKER_INSTANCES=4
> Carbondata:
>   Leaving Default to avoid configuration difference.
>   About Performance Test Result:
> SQL that can execute without modify: 70% (using sql template netezza)
> Max duration: 39.00s
> Min duration: 2.18s
> Average duration: 9.99s
>
>
>   Well, I want to raise a discussion about the following topics:
> 1. Is the hardware of the cluster reasonable? (what's the common
> hardware configuration about a spark/carbondata cluster [per node?])
> 2. Is the result of the performance test resonable & explicable?
> 3. Under interactive query circumstance, Is spark + carbondata an
> acceptable solution?
> 4. Under interactive query circumstance, what's other solution may
> work well.(maybe the average query duration should less then 5s or even
> less)
>
>
>   Thx very much ~
>



-- 
Thanks & Regards,
Ravi

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-21 Thread Ravindra Pesala

Hi,

Please create the carbon context as follows.

val cc = new CarbonContext(sc, storeLocation)

Here storeLocation is hdfs://hacluster/tmp/carbondata/carbon.store in your case.


Regards,
Ravindra

On 21 February 2017 at 08:30, Ravindra Pesala  wrote:

> Hi,
>
> How did you create CarbonContext?
> Can you check whether you have provided same store path in
> carbon.properties and the CarbonContext.
>
> Regards,
> Ravindra.
>
> On 20 February 2017 at 12:26, Xiaoqiao He  wrote:
>
>> Hi Ravindra,
>>
>> Thanks for your suggestions. But another problem met when I create table
>> and load data.
>>
>> 1. I follow README to compile and build CarbonData actually, via
>> https://github.com/apache/incubator-carbondata/blob/master/
>> build/README.md :
>>
>> > mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package
>>
>>
>> 2. I think the exceptions mentioned above (ClassNotFoundException/'exists
>> and does not match'), is related to configuration item of
>> 'spark.executor.extraClassPath'. Since when i trace executor logs, i
>> found
>> it tries to load Class from the same path as spark.executor.extraClassPath
>> config and it can not found local (this local path is valid only for
>> driver), and throw exception. When I remove this item in configuration and
>> run the same command with --jar parameter, then not throw this exception
>> again.
>>
>> 3. but when i create table following quick-start as below:
>>
>> > scala> cc.sql("CREATE TABLE IF NOT EXISTS sample (id string, name
>> string,
>> > city string, age Int) STORED BY 'carbondata'")
>>
>>
>> there is some info logs such as:
>>
>> > INFO  20-02 12:00:35,690 - main Query [CREATE TABLE TEST.SAMPLE USING
>> > CARBONDATA OPTIONS (TABLENAME "TEST.SAMPLE", TABLEPATH
>> > "/HOME/PATH/HEXIAOQIAO/CARBON.STORE/TEST/SAMPLE") ]
>>
>> and* TABLEPATH looks not the proper path (I have no idea why this path is
>> not HDFS path)*, and then load data as blow but another exception throws.
>>
>> > scala> cc.sql("LOAD DATA INPATH
>> > 'hdfs://hacluster/user/hadoop-data/sample.csv' INTO TABLE sample")
>>
>>
>> there is some info logs such as:
>>
>> > INFO  20-02 12:01:27,608 - main HDFS lock
>> > path:hdfs://hacluster/home/path/hexiaoqiao/carbon.store/test
>> /sample/meta.lock
>>
>> *this lock path is not the expected hdfs path, it looks [hdfs
>> scheme://authority] + local setup path of carbondata. (is storelocation
>> not
>> active?)*
>> and throw exception:
>>
>> > INFO  20-02 12:01:42,668 - Table MetaData Unlocked Successfully after
>> data
>> > load
>> > java.lang.RuntimeException: Table is locked for updation. Please try
>> after
>> > some time
>> > at scala.sys.package$.error(package.scala:27)
>> > at
>> > org.apache.spark.sql.execution.command.LoadTable.run(
>> carbonTableSchema.scala:360)
>> > at
>> > org.apache.spark.sql.execution.ExecutedCommand.sideEffectRes
>> ult$lzycompute(commands.scala:58)
>> > at
>> > org.apache.spark.sql.execution.ExecutedCommand.sideEffectRes
>> ult(commands.scala:56)
>> > at
>> > org.apache.spark.sql.execution.ExecutedCommand.doExecute(
>> commands.scala:70)
>>
>>  ..
>>
>>
>> CarbonData Configuration:
>> carbon.storelocation=hdfs://hacluster/tmp/carbondata/carbon.store
>> carbon.lock.type=HDFSLOCK
>> FYI.
>>
>> Regards,
>> Hexiaoqiao
>>
>>
>> On Sat, Feb 18, 2017 at 3:26 PM, Ravindra Pesala 
>> wrote:
>>
>> > Hi Xiaoqiao,
>> >
>> > Is the problem still exists?
>> > Can you try with clean build  with  "mvn clean -DskipTests -Pspark-1.6
>> > package" command.
>> >
>> > Regards,
>> > Ravindra.
>> >
>> > On 16 February 2017 at 08:36, Xiaoqiao He  wrote:
>> >
>> > > hi Liang Chen,
>> > >
>> > > Thank for your help. It is true that i install and configure
>> carbondata
>> > on
>> > > "spark on yarn" cluster following installation guide (
>> > > https://github.com/apache/incubator-carbondata/blob/
>> > > master/docs/installation-guide.md#installing-and-
>> > > configuring-carbondata-on-spark-on-yarn-cluster
>> > > ).
>> > >
>> > > Best Regards,
>> > > Heixaoq

Re: [ANNOUNCE] Hexiaoqiao as new Apache CarbonData committer

2017-02-20 Thread Ravindra Pesala

Congratulations Hexiaoqiao.

Regards,
Ravindra.

On 21 February 2017 at 10:15, Xiaoqiao He  wrote:

> Hi PPMC, Liang,
>
> It is my honor that receive the invitation, and very happy to have chance
> that participate to build CarbonData community also. I will keep
> contributing to Apache CarbonData and continue to promoting the practical
> application on CarbonData.
>
> Thank you again and hope CarbonData have a better development in the
> future.
>
> Best Regards.
> Hexiaoqiao
>
>
> On Tue, Feb 21, 2017 at 9:26 AM, Liang Chen 
> wrote:
>
> > Hi all
> >
> > We are pleased to announce that the PPMC has invited Hexiaoqiao as new
> > Apache CarbonData committer, and the invite has been accepted !
> >
> > Congrats to Hexiaoqiao and welcome aboard.
> >
> > Regards
> > Liang
> >
>



-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-715) Optimize Single pass data load

2017-02-20 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-715:
--

 Summary: Optimize Single pass data load
 Key: CARBONDATA-715
 URL: https://issues.apache.org/jira/browse/CARBONDATA-715
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala


1. Upgrade to latest netty-4.1.8 
2. Optimize the serialization of key for passing in network.
3. Launch individual dictionary client for each loading thread.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-20 Thread Ravindra Pesala

Hi,

How did you create CarbonContext?
Can you check whether you have provided same store path in
carbon.properties and the CarbonContext.

Regards,
Ravindra.

On 20 February 2017 at 12:26, Xiaoqiao He  wrote:

> Hi Ravindra,
>
> Thanks for your suggestions. But another problem met when I create table
> and load data.
>
> 1. I follow README to compile and build CarbonData actually, via
> https://github.com/apache/incubator-carbondata/blob/master/build/README.md
> :
>
> > mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package
>
>
> 2. I think the exceptions mentioned above (ClassNotFoundException/'exists
> and does not match'), is related to configuration item of
> 'spark.executor.extraClassPath'. Since when i trace executor logs, i found
> it tries to load Class from the same path as spark.executor.extraClassPath
> config and it can not found local (this local path is valid only for
> driver), and throw exception. When I remove this item in configuration and
> run the same command with --jar parameter, then not throw this exception
> again.
>
> 3. but when i create table following quick-start as below:
>
> > scala> cc.sql("CREATE TABLE IF NOT EXISTS sample (id string, name string,
> > city string, age Int) STORED BY 'carbondata'")
>
>
> there is some info logs such as:
>
> > INFO  20-02 12:00:35,690 - main Query [CREATE TABLE TEST.SAMPLE USING
> > CARBONDATA OPTIONS (TABLENAME "TEST.SAMPLE", TABLEPATH
> > "/HOME/PATH/HEXIAOQIAO/CARBON.STORE/TEST/SAMPLE") ]
>
> and* TABLEPATH looks not the proper path (I have no idea why this path is
> not HDFS path)*, and then load data as blow but another exception throws.
>
> > scala> cc.sql("LOAD DATA INPATH
> > 'hdfs://hacluster/user/hadoop-data/sample.csv' INTO TABLE sample")
>
>
> there is some info logs such as:
>
> > INFO  20-02 12:01:27,608 - main HDFS lock
> > path:hdfs://hacluster/home/path/hexiaoqiao/carbon.store/
> test/sample/meta.lock
>
> *this lock path is not the expected hdfs path, it looks [hdfs
> scheme://authority] + local setup path of carbondata. (is storelocation not
> active?)*
> and throw exception:
>
> > INFO  20-02 12:01:42,668 - Table MetaData Unlocked Successfully after
> data
> > load
> > java.lang.RuntimeException: Table is locked for updation. Please try
> after
> > some time
> > at scala.sys.package$.error(package.scala:27)
> > at
> > org.apache.spark.sql.execution.command.LoadTable.
> run(carbonTableSchema.scala:360)
> > at
> > org.apache.spark.sql.execution.ExecutedCommand.
> sideEffectResult$lzycompute(commands.scala:58)
> > at
> > org.apache.spark.sql.execution.ExecutedCommand.
> sideEffectResult(commands.scala:56)
> > at
> > org.apache.spark.sql.execution.ExecutedCommand.
> doExecute(commands.scala:70)
>
>  ..
>
>
> CarbonData Configuration:
> carbon.storelocation=hdfs://hacluster/tmp/carbondata/carbon.store
> carbon.lock.type=HDFSLOCK
> FYI.
>
> Regards,
> Hexiaoqiao
>
>
> On Sat, Feb 18, 2017 at 3:26 PM, Ravindra Pesala 
> wrote:
>
> > Hi Xiaoqiao,
> >
> > Is the problem still exists?
> > Can you try with clean build  with  "mvn clean -DskipTests -Pspark-1.6
> > package" command.
> >
> > Regards,
> > Ravindra.
> >
> > On 16 February 2017 at 08:36, Xiaoqiao He  wrote:
> >
> > > hi Liang Chen,
> > >
> > > Thank for your help. It is true that i install and configure carbondata
> > on
> > > "spark on yarn" cluster following installation guide (
> > > https://github.com/apache/incubator-carbondata/blob/
> > > master/docs/installation-guide.md#installing-and-
> > > configuring-carbondata-on-spark-on-yarn-cluster
> > > ).
> > >
> > > Best Regards,
> > > Heixaoqiao
> > >
> > >
> > > On Thu, Feb 16, 2017 at 7:47 AM, Liang Chen 
> > > wrote:
> > >
> > > > Hi He xiaoqiao
> > > >
> > > > Quick start is local model spark.
> > > > Your case is yarn cluster , please check :
> > > > https://github.com/apache/incubator-carbondata/blob/
> > > > master/docs/installation-guide.md
> > > >
> > > > Regards
> > > > Liang
> > > >
> > > > 2017-02-15 3:29 GMT-08:00 Xiaoqiao He :
> > > >
> > > > > hi Manish Gupta,
> > > > >
> > > > > Thanks for you focus, actually i try to load data following
> > > > > https

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-17 Thread Ravindra Pesala

Hi Xiaoqiao,

Is the problem still exists?
Can you try with clean build  with  "mvn clean -DskipTests -Pspark-1.6
package" command.

Regards,
Ravindra.

On 16 February 2017 at 08:36, Xiaoqiao He  wrote:

> hi Liang Chen,
>
> Thank for your help. It is true that i install and configure carbondata on
> "spark on yarn" cluster following installation guide (
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/installation-guide.md#installing-and-
> configuring-carbondata-on-spark-on-yarn-cluster
> ).
>
> Best Regards,
> Heixaoqiao
>
>
> On Thu, Feb 16, 2017 at 7:47 AM, Liang Chen 
> wrote:
>
> > Hi He xiaoqiao
> >
> > Quick start is local model spark.
> > Your case is yarn cluster , please check :
> > https://github.com/apache/incubator-carbondata/blob/
> > master/docs/installation-guide.md
> >
> > Regards
> > Liang
> >
> > 2017-02-15 3:29 GMT-08:00 Xiaoqiao He :
> >
> > > hi Manish Gupta,
> > >
> > > Thanks for you focus, actually i try to load data following
> > > https://github.com/apache/incubator-carbondata/blob/
> > > master/docs/quick-start-guide.md
> > > for deploying carbondata-1.0.0.
> > >
> > > 1.when i execute carbondata by `bin/spark-shell`, it throws as above.
> > > 2.when i execute carbondata by `bin/spark-shell --jars
> > > carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`, it
> > > throws another exception as below:
> > >
> > > org.apache.spark.SparkException: Job aborted due to stage failure:
> Task
> > 0
> > > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> > stage
> > > > 0.0 (TID 3, [task hostname]): org.apache.spark.SparkException: File
> > > > ./carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar exists and
> > does
> > > > not match contents of
> > > > http://master:50843/jars/carbondata_2.10-1.0.0-
> > > incubating-shade-hadoop2.7.1.jar
> > >
> > >
> > > I check the assembly jar and CarbonBlockDistinctValuesCombineRDD is
> > > present
> > > actually.
> > >
> > > anyone who meet the same problem?
> > >
> > > Best Regards,
> > > Hexiaoqiao
> > >
> > >
> > > On Wed, Feb 15, 2017 at 12:56 AM, manish gupta <
> > tomanishgupt...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I think the carbon jar is compiled properly. Can you use any
> decompiler
> > > and
> > > > decompile carbondata-spark-common-1.1.0-incubating-SNAPSHOT.jar
> > present
> > > in
> > > > spark-common module target folder and check whether the required
> class
> > > file
> > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
> is
> > > > present or not.
> > > >
> > > > If you are using only the assembly jar then decompile and check in
> > > assembly
> > > > jar.
> > > >
> > > > Regards
> > > > Manish Gupta
> > > >
> > > > On Tue, Feb 14, 2017 at 11:19 AM, Xiaoqiao He 
> > > wrote:
> > > >
> > > > >  hi, dev,
> > > > >
> > > > > The latest release version apache-carbondata-1.0.0-incubating-rc2
> > > which
> > > > > takes Spark-1.6.2 to build throws exception `
> > > > > java.lang.ClassNotFoundException:
> > > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombi
> neRDD`
> > > > when
> > > > > i
> > > > > load data following Quick Start Guide.
> > > > >
> > > > > Env:
> > > > > a. CarbonData-1.0.0-incubating-rc2
> > > > > b. Spark-1.6.2
> > > > > c. Hadoop-2.7.1
> > > > > d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.
> > > > >
> > > > > any suggestions? Thank you.
> > > > >
> > > > > The exception stack trace as below:
> > > > >
> > > > > 
> > > > > ERROR 14-02 12:21:02,005 - main generate global dictionary failed
> > > > > org.apache.spark.SparkException: Job aborted due to stage failure:
> > > Task
> > > > 0
> > > > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> > > stage
> > > > > 0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
> > > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombi
> neRDD
> > > > >  at
> > > > > org.apache.spark.repl.ExecutorClassLoader.findClass(
> > > > > ExecutorClassLoader.scala:84)
> > > > >
> > > > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > > > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > > > >  at java.lang.Class.forName0(Native Method)
> > > > >  at java.lang.Class.forName(Class.java:274)
> > > > >  at
> > > > > org.apache.spark.serializer.JavaDeserializationStream$$
> > > > > anon$1.resolveClass(JavaSerializer.scala:68)
> > > > >
> > > > >  at
> > > > > java.io.ObjectInputStream.readNonProxyDesc(
> > > ObjectInputStream.java:1612)
> > > > >  at
> > > > > java.io.ObjectInputStream.readClassDesc(
> ObjectInputStream.java:1517)
> > > > >  at
> > > > > java.io.ObjectInputStream.readOrdinaryObject(
> > > > ObjectInputStream.java:1771)
> > > > >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> > > > java:1350)
> > > > >  at
> > > > > java.io.ObjectInputStream.defaultReadFields(
> > > ObjectInputStream.java:1990)
> > > > >

Re: question about the order between original values and its encoded values

2017-02-16 Thread Ravindra Pesala

Hi,
 Yes, it works because we are sorting the column values before assigning
dictionary values to it. So it can work only if you have loaded the data
only once( it means there is no incremental load). If you do incremental
load and some more dictionary values are added to store then there is no
guarantee that you get sorted result on encoded data.

Regards,
Ravindra.

On 16 February 2017 at 15:46, Ma Yun 马云  wrote:

> Hi dev team,
>
> One question about the dictionary encode,
> As you know, the original values of a dimension column will be encoded as
> integer and stored to carbon file ordered by the encoded values.
> I have done some test of order by dimension query in my local machine. I
> changed some code to use the encoded values to sort first, then decode to
> original values.
> The query results are correct. It seems the encoded values has the same
> order of the original values.
> But in the carbondata it always decode to original value first, then
> order by the  original values.
>
> Could you help to tell me which scenarios has the different order between
> the original values and the encoded values?
> BTW is there any document to explain the dictionary encode algorithm?
>
> Thanks
>
> Ma, yun
> 邮件免责申明- 该电子邮件中的信息是保密的，除收件人外任何人无权访问此电子邮件。
> 如果您不是收件人，公开、复制、分发或基于此封邮件的任何行动，都是禁止的，并可能是违法的。该邮件包含的任何意见与建议均应遵循上汽集团关于信息传递与保密
> 的制度或规定。除经上汽集团信函以正式书面方式确认外，任何相关的内容或信息不得作为正式依据。 Email Disclaimer- The
> information in this email is confidential and may be legally privileged. It
> is intended solely for the addressee. Access to this email by anyone else
> is unauthorized. If you are not the intended recipient, any disclosure,
> copying, distribution or any action taken or omitted to be taken in
> reliance on it, is prohibited and may be unlawful. Any opinions or advice
> contained in this email are subject to the terms and conditions expressed
> in the governing SAICMOTOR client engagement letter and should not be
> relied upon unless they are confirmed in writing on SAICMOTOR's letterhead.
>



-- 
Thanks & Regards,
Ravi

Re: whether carbondata can be used in hive on spark?

2017-02-16 Thread Ravindra Pesala

Hi,

We have so far integrated only to the Spark, not yet integrated to Hive. So
carbondata cannot be used in Hive on Spark at this moment.

Regards,
Ravindra.

On 16 February 2017 at 14:35, wangzheng <18031...@qq.com> wrote:

> we use cdh5.7, it remove the thriftserver of spark, so sparksql is not
> suitable for us.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/whether-
> carbondata-can-be-used-in-hive-on-spark-tp7661.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

Re: 回复： data lost when loading data from csv file to carbon table

2017-02-16 Thread Ravindra Pesala

Hi QiangCai,

PR594 fix does not solve the data lost issue, it fixes the data mismatch in
some cases.

Regards,
Ravindra.

On 16 February 2017 at 09:35, QiangCai  wrote:

> Maybe you can check PR594, it will fix a bug which will impact the result
> of
> loading.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/data-lost-when-
> loading-data-from-csv-file-to-carbon-table-tp7554p7639.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

Re: data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala

2),
> sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2),
> sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss
> decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='sr_returned_date_sk, sr_return_time_sk,
> sr_item_sk, sr_customer_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk,
> sr_store_sk, sr_reason_sk, sr_ticket_number')");
>
>
>
>
> carbon.sql("create table if not exists _1g.web_sales(ws_sold_date_sk
> integer, ws_sold_time_sk integer, ws_ship_date_sk integer, ws_item_sk
> integer, ws_bill_customer_sk integer, ws_bill_cdemo_sk integer,
> ws_bill_hdemo_sk integer, ws_bill_addr_sk integer, ws_ship_customer_sk
> integer, ws_ship_cdemo_sk integer, ws_ship_hdemo_sk integer,
> ws_ship_addr_sk integer, ws_web_page_sk integer, ws_web_site_sk integer,
> ws_ship_mode_sk integer, ws_warehouse_sk integer, ws_promo_sk integer,
> ws_order_number integer, ws_quantity integer, ws_wholesale_cost
> decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2),
> ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2),
> ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2),
> ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost
> decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2),
> ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2),
> ws_net_profit decimal(7,2)) STORED BY 'carbondata' TBLPROPERTIES
> ('DICTIONARY_INCLUDE'='ws_sold_date_sk, ws_sold_time_sk, ws_ship_date_sk,
> ws_item_sk, ws_bill_customer_sk, ws_bill_cdemo_sk, ws_bill_hdemo_sk,
> ws_bill_addr_sk, ws_ship_customer_sk, ws_ship_cdemo_sk, ws_ship_hdemo_sk,
> ws_ship_addr_sk, ws_web_page_sk, ws_web_site_sk, ws_ship_mode_sk,
> ws_warehouse_sk, ws_promo_sk, ws_order_number')");
>
>
>
> and here is my script for generate tpc-ds data:
> [hadoop@master tools]$ ./dsdgen -scale 1 -suffix '.csv' -dir
> /data/tpc-ds/data/
>
>
>
>
>
>
>
>
> -- 原始邮件 --
> 发件人: "Ravindra Pesala";;
> 发送时间: 2017年2月16日(星期四) 下午3:15
> 收件人: "dev";
>
> 主题: Re: 回复： data lost when loading data from csv file to carbon table
>
>
>
> Hi Yinwei,
>
> Can you provide create table scripts for both the tables store_returns and
> web_sales.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 10:07, Ravindra Pesala 
> wrote:
>
> > Hi Yinwei,
> >
> > Thank you for pointing out the issue, I will check with TPC-DS data and
> > verify the data load with new flow.
> >
> > Regards,
> > Ravindra.
> >
> > On 16 February 2017 at 09:35, QiangCai  wrote:
> >
> >> Maybe you can check PR594, it will fix a bug which will impact the
> result
> >> of
> >> loading.
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-maili
> >> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
> >> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Re: 回复： data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala

Hi Yinwei,

Can you provide create table scripts for both the tables store_returns and
web_sales.

Regards,
Ravindra.

On 16 February 2017 at 10:07, Ravindra Pesala  wrote:

> Hi Yinwei,
>
> Thank you for pointing out the issue, I will check with TPC-DS data and
> verify the data load with new flow.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 09:35, QiangCai  wrote:
>
>> Maybe you can check PR594, it will fix a bug which will impact the result
>> of
>> loading.
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-maili
>> ng-list-archive.1130556.n5.nabble.com/data-lost-when-load
>> ing-data-from-csv-file-to-carbon-table-tp7554p7639.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Re: 回复： data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala

Hi Yinwei,

Thank you for pointing out the issue, I will check with TPC-DS data and
verify the data load with new flow.

Regards,
Ravindra.

On 16 February 2017 at 09:35, QiangCai  wrote:

> Maybe you can check PR594, it will fix a bug which will impact the result
> of
> loading.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/data-lost-when-
> loading-data-from-csv-file-to-carbon-table-tp7554p7639.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

Re: Introducing V3 format.

2017-02-15 Thread Ravindra Pesala

Hi Liang,

Backward compatibility is already handled in 1.0.0 version, so to read old
store then  it uses V1/V2 format readers to read data from old store. So
backward compatibility works even though we jump to V3 format.

Regards,
Ravindra.

On 16 February 2017 at 04:18, Liang Chen  wrote:

> Hi Ravi
>
> Thank you bringing the discussion to mailing list, i have one question: how
> to ensure backward-compatible after introducing the new format.
>
> Regards
> Liang
>
> Jean-Baptiste Onofré wrote
> > Agree.
> >
> > +1
> >
> > Regards
> > JB
> >
> > On Feb 15, 2017, 09:09, at 09:09, Kumar Vishal <
>
> > kumarvishal1802@
>
> > > wrote:
> >>+1
> >>This will improve the IO bottleneck. Page level min max will improve
> >>the
> >>block pruning and less number of false positive blocks will improve the
> >>filter query performance. Separating uncompression of data from reader
> >>layer will improve the overall query performance.
> >>
> >>-Regards
> >>Kumar Vishal
> >>
> >>On Wed, Feb 15, 2017 at 7:50 PM, Ravindra Pesala
> >><
>
> > ravi.pesala@
>
> > >
> >>wrote:
> >>
> >>> Please find the thrift file in below location.
> >>> https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b
> >>> 1NqSTU2b2g4dkhkVDRj
> >>>
> >>> On 15 February 2017 at 17:14, Ravindra Pesala <
>
> > ravi.pesala@
>
> > >
> >>> wrote:
> >>>
> >>> > Problems in current format.
> >>> > 1. IO read is slower since it needs to go for multiple seeks on the
> >>file
> >>> > to read column blocklets. Current size of blocklet is 12, so it
> >>needs
> >>> > to read multiple times from file to scan the data on that column.
> >>> > Alternatively we can increase the blocklet size but it suffers for
> >>filter
> >>> > queries as it gets big blocklet to filter.
> >>> > 2. Decompression is slower in current format, we are using inverted
> >>index
> >>> > for faster filter queries and using NumberCompressor to compress
> >>the
> >>> > inverted index in bit wise packing. It becomes slower so we should
> >>avoid
> >>> > number compressor. One alternative is to keep blocklet size with in
> >>32000
> >>> > so that inverted index can be written with short, but IO read
> >>suffers a
> >>> lot.
> >>> >
> >>> > To overcome from above 2 issues we are introducing new format V3.
> >>> > Here each blocklet has multiple pages with size 32000, number of
> >>pages in
> >>> > blocklet is configurable. Since we keep the page with in short
> >>limit so
> >>> no
> >>> > need compress the inverted index here.
> >>> > And maintain the max/min for each page to further prune the filter
> >>> queries.
> >>> > Read the blocklet with pages at once and keep in offheap memory.
> >>> > During filter first check the max/min range and if it is valid then
> >>go
> >>> for
> >>> > decompressing the page to filter further.
> >>> >
> >>> > Please find the attached V3 format thrift file.
> >>> >
> >>> > --
> >>> > Thanks & Regards,
> >>> > Ravi
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Ravi
> >>>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Introducing-V3-
> format-tp7609p7622.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

Re: data lost when loading data from csv file to carbon table

2017-02-15 Thread Ravindra Pesala

Please make 'use_kettle'='false' and try to run.

Regards,
Ravindra

On 16 February 2017 at 08:44, Yinwei Li <251469...@qq.com> wrote:

> thx Ravindra.
>
>
> I've run the script as:
>
>
> scala> import org.apache.carbondata.core.util.CarbonProperties
> scala> CarbonProperties.getInstance().addProperty("carbon.
> badRecords.location","hdfs://master:9000/data/carbondata/badrecords/")
> scala> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
> scala> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
> 'use_kettle'='true')")
>
>
>
> but it occured an Exception: java.lang.RuntimeException:
> carbon.kettle.home is not set
>
>
> the configuration in my carbon.properties is:
> carbon.kettle.home=/opt/spark-2.1.0/carbonlib/carbonplugins, but it seems
> not work.
>
>
> how can I solve this problem.
>
>
> --
>
>
> Hi Liang Chen,
>
>
> would you add a more detail document about the badRecord shows us how
> to use it, thx~~
>
>
>
>
>
>
>
>
>
>
> -- 原始邮件 --
> 发件人: "Ravindra Pesala";;
> 发送时间: 2017年2月15日(星期三) 中午11:36
> 收件人: "dev";
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> I guess you are using spark-shell, so better set bad record location to
> CarbonProperties class before creating carbon session like below.
>
> CarbonProperties.getInstance().addProperty("carbon.
> badRecords.location"," record location>").
>
>
> 1. And while loading data you need to enable bad record logging as below.
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
> OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
> '='true')").
>
> Please check the bad records which are added to that bad record location.
>
>
> 2. You can alternatively verify by ignoring the bad records by using
> following command
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
> OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
> 'bad_records_action'='ignore')").
>
> Regards,
> Ravindra.
>
> On 15 February 2017 at 07:37, Yinwei Li <251469...@qq.com> wrote:
>
> > Hi,
> >
> >
> > I've set the properties as:
> >
> >
> > carbon.badRecords.location=hdfs://localhost:9000/data/
> > carbondata/badrecords
> >
> >
> > and add 'bad_records_action'='force' when loading data as:
> >
> >
> > carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> > _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
> >
> >
> > but the configurations seems not work as there are no path or file
> > created under the path hdfs://localhost:9000/data/carbondata/badrecords.
> >
> >
> > here are the way I created carbonContext:
> >
> >
> > import org.apache.spark.sql.SparkSession
> > import org.apache.spark.sql.CarbonSession._
> > import org.apache.spark.sql.catalyst.util._
> > val carbon = SparkSession.builder().config(sc.getConf).
> > getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
> >
> >
> >
> >
> > and the following are bad record logs:
> >
> >
> > INFO  15-02 09:43:24,393 - [Executor task launch
> > worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-
> 031d602fe2be]
> > Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> > web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> > ERROR 15-02 09:43:24,393 - [Executor task launch
> > worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-
> 031d602fe2be]
> > Data Load is partially success for table web_sales
> > INFO  15-02 09:43:24,393 - Bad Record Found
> >
> >
> >
> >
> > -- 原始邮件 --
> > 发件人: "Ravindra Pesala";;
> > 发送时间: 2017年2月14日(星期二) 晚上10:41
> > 收件人: "dev";
> >
> > 主题: Re: data lost when loading data from csv file to

Re: Introducing V3 format.

2017-02-15 Thread Ravindra Pesala

Please find the thrift file in below location.
https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b1NqSTU2b2g4dkhkVDRj

On 15 February 2017 at 17:14, Ravindra Pesala  wrote:

> Problems in current format.
> 1. IO read is slower since it needs to go for multiple seeks on the file
> to read column blocklets. Current size of blocklet is 12, so it needs
> to read multiple times from file to scan the data on that column.
> Alternatively we can increase the blocklet size but it suffers for filter
> queries as it gets big blocklet to filter.
> 2. Decompression is slower in current format, we are using inverted index
> for faster filter queries and using NumberCompressor to compress the
> inverted index in bit wise packing. It becomes slower so we should avoid
> number compressor. One alternative is to keep blocklet size with in 32000
> so that inverted index can be written with short, but IO read suffers a lot.
>
> To overcome from above 2 issues we are introducing new format V3.
> Here each blocklet has multiple pages with size 32000, number of pages in
> blocklet is configurable. Since we keep the page with in short limit so no
> need compress the inverted index here.
> And maintain the max/min for each page to further prune the filter queries.
> Read the blocklet with pages at once and keep in offheap memory.
> During filter first check the max/min range and if it is valid then go for
> decompressing the page to filter further.
>
> Please find the attached V3 format thrift file.
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Introducing V3 format.

2017-02-15 Thread Ravindra Pesala

Problems in current format.
1. IO read is slower since it needs to go for multiple seeks on the file to
read column blocklets. Current size of blocklet is 12, so it needs to
read multiple times from file to scan the data on that column.
Alternatively we can increase the blocklet size but it suffers for filter
queries as it gets big blocklet to filter.
2. Decompression is slower in current format, we are using inverted index
for faster filter queries and using NumberCompressor to compress the
inverted index in bit wise packing. It becomes slower so we should avoid
number compressor. One alternative is to keep blocklet size with in 32000
so that inverted index can be written with short, but IO read suffers a lot.

To overcome from above 2 issues we are introducing new format V3.
Here each blocklet has multiple pages with size 32000, number of pages in
blocklet is configurable. Since we keep the page with in short limit so no
need compress the inverted index here.
And maintain the max/min for each page to further prune the filter queries.
Read the blocklet with pages at once and keep in offheap memory.
During filter first check the max/min range and if it is valid then go for
decompressing the page to filter further.

Please find the attached V3 format thrift file.

-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-705) Make the partition distribution as configurable and keep spark distribution as default

2017-02-15 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-705:
--

 Summary: Make the partition distribution as configurable and keep 
spark distribution as default
 Key: CARBONDATA-705
 URL: https://issues.apache.org/jira/browse/CARBONDATA-705
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Make the partition distribution as configurable and keep spark distribution as 
default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: data lost when loading data from csv file to carbon table

2017-02-14 Thread Ravindra Pesala

Hi,

I guess you are using spark-shell, so better set bad record location to
CarbonProperties class before creating carbon session like below.

CarbonProperties.getInstance().addProperty("carbon.badRecords.location","").


1. And while loading data you need to enable bad record logging as below.

carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true', 'use_kettle
'='true')").

Please check the bad records which are added to that bad record location.


2. You can alternatively verify by ignoring the bad records by using
following command
carbon.sql(s"load data inpath '$src/web_sales.csv' into table _1g.web_sales
OPTIONS('DELIMITER'='|','bad_records_logger_enable'='true',
'bad_records_action'='ignore')").

Regards,
Ravindra.

On 15 February 2017 at 07:37, Yinwei Li <251469...@qq.com> wrote:

> Hi,
>
>
> I've set the properties as:
>
>
> carbon.badRecords.location=hdfs://localhost:9000/data/
> carbondata/badrecords
>
>
> and add 'bad_records_action'='force' when loading data as:
>
>
> carbon.sql(s"load data inpath '$src/web_sales.csv' into table
> _1g.web_sales OPTIONS('DELIMITER'='|','bad_records_action'='force')")
>
>
> but the configurations seems not work as there are no path or file
> created under the path hdfs://localhost:9000/data/carbondata/badrecords.
>
>
> here are the way I created carbonContext:
>
>
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.CarbonSession._
> import org.apache.spark.sql.catalyst.util._
> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession("hdfs://master:9000/opt/carbonStore")
>
>
>
>
> and the following are bad record logs:
>
>
> INFO  15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Total copy time (ms) to copy file /tmp/1039730591739247/0/_1g/
> web_sales/Fact/Part0/Segment_0/0/0-0-1487122995007.carbonindex is 65
> ERROR 15-02 09:43:24,393 - [Executor task launch
> worker-0][partitionID:_1g_web_sales_d59af854-773c-429c-b7e6-031d602fe2be]
> Data Load is partially success for table web_sales
> INFO  15-02 09:43:24,393 - Bad Record Found
>
>
>
>
> -- 原始邮件 --
> 发件人: "Ravindra Pesala";;
> 发送时间: 2017年2月14日(星期二) 晚上10:41
> 收件人: "dev";
>
> 主题: Re: data lost when loading data from csv file to carbon table
>
>
>
> Hi,
>
> Please set carbon.badRecords.location in carbon.properties and check any
> bad records are added to that location.
>
>
> Regards,
> Ravindra.
>
> On 14 February 2017 at 15:24, Yinwei Li <251469...@qq.com> wrote:
>
> > Hi all,
> >
> >
> >   I met an data lost problem when loading data from csv file to carbon
> > table, here are some details:
> >
> >
> >   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
> >   Total Records:719,384
> >   Loaded Records:606,305 (SQL: select count(1) from table)
> >
> >
> >   My Attemps:
> >
> >
> > Attemp1: Add option bad_records_action='force' when loading data. It
> > also doesn't work, it's count equals to 606,305;
> > Attemp2: Cut line 1 to 300,000 into a csv file and load, the result
> is
> > right, which equals to 300,000;
> > Attemp3: Cut line 1 to 350,000 into a csv file and load, the result
> is
> > wrong, it equals to 305,631;
> > Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> > result is right, it equals to 50,000;
> > Attemp5: Count the separator '|' of my csv file, it equals to lines *
> > columns,  so the source data may in the correct format;
> >
> >
> > In spark log, each attemp logs out : "Bad Record Found".
> >
> >
> > Anyone have any ideas?
>
>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Re: data lost when loading data from csv file to carbon table

2017-02-14 Thread Ravindra Pesala

Hi,

Please set carbon.badRecords.location in carbon.properties and check any
bad records are added to that location.


Regards,
Ravindra.

On 14 February 2017 at 15:24, Yinwei Li <251469...@qq.com> wrote:

> Hi all,
>
>
>   I met an data lost problem when loading data from csv file to carbon
> table, here are some details:
>
>
>   Env: Spark 2.1.0 + Hadoop 2.7.2 + CarbonData 1.0.0
>   Total Records:719,384
>   Loaded Records:606,305 (SQL: select count(1) from table)
>
>
>   My Attemps:
>
>
> Attemp1: Add option bad_records_action='force' when loading data. It
> also doesn't work, it's count equals to 606,305;
> Attemp2: Cut line 1 to 300,000 into a csv file and load, the result is
> right, which equals to 300,000;
> Attemp3: Cut line 1 to 350,000 into a csv file and load, the result is
> wrong, it equals to 305,631;
> Attemp4: Cut line 300,000 to 350,000 into a csv file and load, the
> result is right, it equals to 50,000;
> Attemp5: Count the separator '|' of my csv file, it equals to lines *
> columns,  so the source data may in the correct format;
>
>
> In spark log, each attemp logs out : "Bad Record Found".
>
>
> Anyone have any ideas?




-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-702) Create carbondata repository to keep format jar

2017-02-11 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-702:
--

 Summary: Create carbondata repository to keep format jar
 Key: CARBONDATA-702
 URL: https://issues.apache.org/jira/browse/CARBONDATA-702
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Minor


Create carbondata repository to keep format jar. At the time of IPMC voting, 
the format jar will be downloaded from this reoository, so IPMC no need to 
install thrift to build the carbondata.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Discussion about getting excution duration about a query when using sparkshell+carbondata

2017-02-08 Thread Ravindra Pesala

Hi Libis,

spark-sql CLI is not supported by carbondata.
Why don't you use carbon thrift server and beeline, it is also same as
spark-sql CLI and it gives execution time for each query.

Start carbondata thrift server script.
bin/spark-submit --class
org.apache.carbondata.spark.thriftserver.CarbonThriftServer   

beeline script
bin/beeline -u jdbc:hive2://localhost:1

Regards,
Ravindra

On 9 February 2017 at 07:55, 范范欣欣  wrote:

> Hi
>
> Now i can use carbondata 1.0.0 with spark-shell(spark 2.1) as:
>
> ./bin/spark-shell --jars 
>
> but it's inconvenient to get the query time , so i try to use
> ./bin/spark-sql --jars  ,but i found some
> errors when create table :
>
> spark-sql> create table if not exists test_table(id string, name string,
> city string, age int) stored by 'carbondata';
> Error in query:
> Operation not allowed:STORED BY(line 1, pos 87)
>
> it seems that the carbondata jar is not load successfully. How can i use
> ./bin/spark-sql?
>
> Regards
>
> Libis
>
>
>
> 2017-02-07 13:16 GMT+08:00 Liang Chen :
>
> > Hi
> >
> > I used the below method in spark shell for DEMO, for your reference:
> >
> > import org.apache.spark.sql.catalyst.util._
> >
> > benchmark { carbondf.filter($"name" === "Allen" and $"gender" === "Male"
> > and $"province" === "NB" and $"singler" === "false").count }
> >
> >
> > Regards
> >
> > Liang
> >
> > 2017-02-06 22:07 GMT-05:00 Yinwei Li <251469...@qq.com>:
> >
> > > Hi all,
> > >
> > >
> > >   When we are using sparkshell + carbondata to send a query, how can we
> > > get the excution duration? Some topics are thrown as follows:
> > >
> > >
> > >   1. One query can produce one or more jobs, and some of the jobs may
> > have
> > > DAG dependence, thus we can't get the excution duration by sum up all
> the
> > > jobs' duration or get the max duration of the jobs roughly.
> > >
> > >
> > >   2. In the spark shell console or spark application web ui, we can get
> > > each job's duration, but we can't get the carbondata-query directly, if
> > > some improvement would take by carbondata in the near future.
> > >
> > >
> > >   3. Maybe we can use the following command to get a approximate
> result:
> > >
> > >
> > > scala > val begin = new Date();cc.sql("$SQL_COMMAND").show;val
> end =
> > > new Date();
> > >
> > >
> > >   Any other opinions?
> >
> >
> >
> >
> > --
> > Regards
> > Liang
> >
>



-- 
Thanks & Regards,
Ravi

Re: query exception: Path is not a file when carbon 1.0.0

2017-02-08 Thread Ravindra Pesala

Hi,

This exception is actually ignored in class SegmentUpdateStatusManager line
number 696. This exception does not create any problem. Usually this
exception won't be printed in any server logs as we are ignoring it. May be
in spark-shell it is printing. we will look into it.

Regards,
Ravindra.

On 8 February 2017 at 08:43, Li Peng  wrote:

> Hi,
>When use carbon 1.0.0 to query,  the sql is "select count(*) from
> store1.sale"
>I can get the query result , but also get the below error log:
>
>
> Exception while invoking
> ClientNamenodeProtocolTranslatorPB.getBlockLocations over
> dpnode02/192.168.50.2:8020. Not retrying because try once and fail.
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path
> is not a file: /carbondata/carbonstore/store1/sale/Metadata
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INo
> deFile.java:75)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INo
> deFile.java:61)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock
> LocationsInt(FSNamesystem.java:1860)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock
> Locations(FSNamesystem.java:1831)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlock
> Locations(FSNamesystem.java:1744)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.get
> BlockLocations(NameNodeRpcServer.java:693)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServ
> erSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolS
> erverSideTranslatorPB.java:373)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocol
> Protos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNam
> enodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcIn
> voker.call(ProtobufRpcEngine.java:640)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> upInformation.java:1724)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
>
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)
> at org.apache.hadoop.ipc.Client.call(Client.java:1496)
> at org.apache.hadoop.ipc.Client.call(Client.java:1396)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
> ProtobufRpcEngine.java:233)
> at com.sun.proxy.$Proxy31.getBlockLocations(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTran
> slatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:270)
> at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
> od(RetryInvocationHandler.java:278)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
> ryInvocationHandler.java:194)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
> ryInvocationHandler.java:176)
> at com.sun.proxy.$Proxy32.getBlockLocations(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSCl
> ient.java:1236)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.
> java:1223)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.
> java:1211)
> at
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndG
> etLastBlockLength(DFSInputStream.java:309)
> at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStrea
> m.java:274)
> at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.
> java:266)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1536)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(
> DistributedFileSystem.java:330)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(
> DistributedFileSystem.java:326)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSyst
> emLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(Distribute
> dFileSystem.java:326)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:782)
> at
> org.apache.carbondata.core.datastore.impl.FileFactory.getDat
> aInputStream(FileFactory.java:130)
> at
> org.apache.carbondata.core.datastore.impl.FileFactory.getDat
> aInputStream(FileFactory.java:104)
> at
> org.apache.carbondata.core.fileoperations.Atom

Re: Aggregate performace

2017-02-08 Thread Ravindra Pesala

Hi,

The performance is depends on the query plan, when you submit the query
like [Select attributeA , count(*)  from tableB group by attributeA]  in
case of spark it asks carbon to give only attributeA column. So Carbon
reads only attributeA column from all files send the result to spark to
aggregate data.

In my laptop with 4 cores test, Spark2 with carbon with store size of
100million records could get result in 11 seconds for query like above. In
good machines this may get very faster.

Regards,
Ravindra.

On 8 February 2017 at 11:49, ffpeng90  wrote:

> Hi,all:
>Recently, I create two tables as ORC and Carbondata.  All of them
> contain
> one hundred million records.
> Then I submit aggregate querys to presto like : [Select  count(*)  from
> tableB where attributeA = 'xxx'],
> carbon performs better than orc.
>
> However,  when i submit querys like: [Select attributeA , count(*)  from
> tableB group by attributeA],  the performace of carbon is bad. Obviously
> this query will result-in a full scan,  so QueryModel need to rebuild all
> records with columns related. This step need a lot of time.
>
> So i want to know is there any optimize techniques for this kind of
> problems
> in spark?
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Aggregate-
> performace-tp7440.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

-- 
Thanks & Regards,
Ravi

Re: store location can't be found

2017-02-03 Thread Ravindra Pesala

Hi Mars,

Please try creating carbonsession with storepath as follow.

val carbon = SparkSession.builder().config(sc.getConf).
getOrCreateCarbonSession("hdfs://localhost:9000/carbon/store ")


Regards,
Ravindra.

On 4 February 2017 at 08:12, Mars Xu  wrote:

> Hello All,
> I met a problem of file not exist. it looks like the store
> location can’t be found. I have already set 
> carbon.store.location=hdfs://localhost:9000/carbon/store
>  in $SPARK_HOME/conf/carbon.properties,
> but when I start up spark-shell by following command and run some commands
> ,the error is coming
> spark-shell --master spark://localhost:7077 --jars
> ~/carbonlib/carbondata_2.11-1.0.0-incubating-shade-hadoop2.7.2.jar --conf
> spark.carbon.storepath=hdfs://localhost:9000/carbon/store
> 
>
> scala> import org.apache.spark.sql.SparkSession
> scala> import org.apache.spark.sql.CarbonSession._
> scala> val carbon = SparkSession.builder().config(sc.getConf).
> getOrCreateCarbonSession()
> scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string, name
> string, city string, age Int) STORED BY 'carbondata’")
> scala> carbon.sql("load data inpath 
> 'hdfs://localhost:9000/resources/sample.csv'
> into table test_table”)
>
> scala> carbon.sql("select * from test_table").show()
> java.io.FileNotFoundException: File /private/var/carbon.store/
> default/test_table/Fact/Part0/Segment_0 does not exist.
>   at org.apache.hadoop.hdfs.DistributedFileSystem$
> DirListingIterator.(DistributedFileSystem.java:948)
>   at org.apache.hadoop.hdfs.DistributedFileSystem$
> DirListingIterator.(DistributedFileSystem.java:927)
>
> My carbonate version is 1.0 and spark version is spark 2.1.




-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-692) Support scalar subquery in carbon

2017-02-01 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-692:
--

 Summary: Support scalar subquery in carbon
 Key: CARBONDATA-692
 URL: https://issues.apache.org/jira/browse/CARBONDATA-692
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Reporter: Ravindra Pesala


Carbon cannot run scalar sub queries like below

{code}
select sum(salary) from scalarsubquery t1
where ID < (select sum(ID) from scalarsubquery t2 where t1.name = t2.name
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CARBONDATA-680) Add stats like rows processed in each step. And also fix unsafe sort enable issue.

2017-01-25 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-680:
--

 Summary: Add stats like rows processed in each step. And also fix 
unsafe sort enable issue.
 Key: CARBONDATA-680
 URL: https://issues.apache.org/jira/browse/CARBONDATA-680
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Minor


Currently stats like number of rows processed in each step is not added in no 
kettle flow. Please add the same.
And also unsafe sort is not enabling even though user enable the sort in 
property file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] Apache CarbonData 1.0.0-incubating release (RC2)

2017-01-20 Thread Ravindra Pesala

+1
Done sanity for all major features, it is fine.

Regards,
Ravindra.

On Sat, Jan 21, 2017, 07:51 Liang Chen  wrote:

> +1(binding)
>
> I checked:
> - name contains incubating
> - disclaimer exists
> - signatures and hash correct
> - NOTICE good
> - LICENSE is good
> - Source files have ASF headers
> - No unexpected binary files
> - Can compile from source with "mvn clean -DskipTests -Pbuild-with-format
> -Pspark-1.6 install"
>
> Regards
> Liang
>
> 2017-01-21 9:38 GMT+08:00 Jacky Li :
>
> > Please find the build guide as following:
> >
> > Build guide (need install Apache Thrift 0.9.3, can use command: mvn clean
> > -DskipTests -Pbuild-with-format -Pspark-1.6 install), please find the
> > detail:
> > https://github.com/apache/incubator-carbondata/tree/master/build
> >
> >
> > > 在 2017年1月21日，上午9:36，Jacky Li  写道：
> > >
> > > Hi all,
> > >
> > > Please vote on releasing the following candidate as Apache
> > CarbonData(incubating)
> > > version 1.0.0.
> > >
> > > Release Notes:
> > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> > ctId=12320220&version=12338020
> > >
> > > Staging Repository:
> > > https://repository.apache.org/content/repositories/orgapache
> > carbondata-1009
> > >
> > > Git Tag, apache-carbondata-1.0.0-incubating-rc2 at :
> > > https://git-wip-us.apache.org/repos/asf?p=incubator-carbonda
> > ta.git;a=commit;h=39efa332be094772daed05976b29241593da309f
> > >
> > > Please vote to approve this release:
> > >
> > > [ ] +1 Approve the release
> > > [ ] -1 Don't approve the release (please provide specific comments)
> > >
> > > This vote will be open for at least 72 hours. If this vote passes (we
> > need at least 3 binding votes, meaning three  votes from the PPMC), I
> will
> > forward to gene...@incubator.apache.org  > he.org> for  the IPMC votes.
> > >
> > > Regards,
> > > Jacky
> >
> >
>
>
> --
> Regards
> Liang
>

Re: Re: Failed to APPEND_FILE, hadoop.hdfs.protocol.AlreadyBeingCreatedException

2017-01-20 Thread Ravindra Pesala

Hi,

Please use
 "mvn clean -DskipTests -Pspark-1.5 -Dspark.version=1.5.2 -Phadoop-2.7.2
package"

Regards,
Ravindra


On 20 January 2017 at 15:42, manish gupta  wrote:

> Can you try compiling with hadoop-2.7.2 version and use it and let us know
> if the issue still persists.
>
> "mvn package -DskipTests -Pspark-1.5.2 -Phadoop-2.7.2 -DskipTests"
>
> Regards
> Manish Gupta
>
> On Fri, Jan 20, 2017 at 1:30 PM, 彭  wrote:
>
> > I build the jar with hadoop2.6,  like  "mvn package -DskipTests
> > -Pspark-1.5.2 -Phadoop-2.6.0 -DskipTests"
> > My Spark version is "spark-1.5.2-bin-hadoop2.6"
> > However my hadoop environment is hadoop-2.7.2
> >
> >
> >
> > At 2017-01-20 15:05:56, "manish gupta" 
> wrote:
> > >Hi,
> > >
> > >Which version of hadoop you are using while compiling the carbondata
> jar?
> > >
> > >If you are using hadoop-2.2.0, then please go through the below link
> which
> > >says that there is some issue with hadoop-2.2.0 while writing a file in
> > >append mode.
> > >
> > >http://stackoverflow.com/questions/21655634/hadoop2-2-
> > 0-append-file-occur-alreadybeingcreatedexception
> > >
> > >Regards
> > >Manish Gupta
> > >
> > >On Fri, Jan 20, 2017 at 8:10 AM, ffpeng90  wrote:
> > >
> > >> I have met the same problem.
> > >> I load data for three times and this exception always throws at the
> > third
> > >> time.
> > >> I use the branch-1.0 version from git.
> > >>
> > >> Table :
> > >> cc.sql(s"create table if not exists flightdb15(ID Int, date string,
> > country
> > >> string, name string, phonetype string, serialname string, salary Int)
> > ROW
> > >> FORMAT SERDE 'org.apache.hadoop.hive.serde2.
> > MetadataTypedColumnsetSerDe'
> > >> STORED BY 'org.apache.carbondata.format'  TBLPROPERTIES
> > >> ('table_blocksize'='256 MB')")
> > >>
> > >> Exception:
> > >>
> > >>  > >> n5.nabble.com/file/n6843/bug1.png>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context: http://apache-carbondata-
> > >> mailing-list-archive.1130556.n5.nabble.com/Failed-to-
> > >> APPEND-FILE-hadoop-hdfs-protocol-AlreadyBeingCreatedException-
> > >> tp5433p6843.html
> > >> Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > >> at Nabble.com.
> > >>
> >
>



-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-656) Simplify the carbon session creation

2017-01-17 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-656:
--

 Summary: Simplify the carbon session creation 
 Key: CARBONDATA-656
 URL: https://issues.apache.org/jira/browse/CARBONDATA-656
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Priority: Minor


Now it is cumbersome to create CarbonSession through spark shell. We should 
minimize steps to give more usability for first time users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CARBONDATA-655) Make nokettle dataload flow as default in carbon

2017-01-17 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-655:
--

 Summary: Make nokettle dataload flow as default in carbon
 Key: CARBONDATA-655
 URL: https://issues.apache.org/jira/browse/CARBONDATA-655
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala
Priority: Minor


Make nokettle dataload flow as default in carbon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Unable to Assign Jira to me

2017-01-13 Thread Ravindra Pesala

Please provide Jira user name and mail id. We will add you as a contributor
so that you can assign issues to yourself.

On Fri, Jan 13, 2017, 16:49 Anurag Srivastava  wrote:

> Hello Team,
>
> I am working on JIRA [CARBONDATA-542] and want to assign this JIRA to me.
> But I am not able to assign it to me.
>
> How can I assign it to me ?
>
> --
> *Thanks®ards*
>
>
> *Anurag Srivastava**Software Consultant*
> *Knoldus Software LLP*
>
> *India - US - Canada*
> *Twitter  | FB
>  | LinkedIn
> *
>

[jira] [Created] (CARBONDATA-628) Issue when measure selection with out table order gives wrong result with vectorized reader enabled

2017-01-11 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-628:
--

 Summary: Issue when measure selection with out table order gives 
wrong result with vectorized reader enabled
 Key: CARBONDATA-628
 URL: https://issues.apache.org/jira/browse/CARBONDATA-628
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala
Priority: Minor


If the table is created with measure order like m1, m2 and user selects the 
measures m2, m1 then it returns wrong result with vectorized reader enabled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: TestCase failed

2017-01-10 Thread Ravindra Pesala

Hi,

Please make sure the store path of "flightdb2" is given properly in side
CarbonInputMapperTest class.
Please provide complete stack trace of error.

On 10 January 2017 at 17:54, 彭  wrote:

> Hi,all:
> Recently, i meet a failed TestCase,   Is there anyone know it?
> http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/jira-TestCase-in-carbondata-hadoop-failed-td5896.html
>
>

-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-618) Add new profile to build all modules for release purpose

2017-01-10 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-618:
--

 Summary: Add new profile to build all modules for release purpose
 Key: CARBONDATA-618
 URL: https://issues.apache.org/jira/browse/CARBONDATA-618
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Add new profile to build all modules for release purpose



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CARBONDATA-611) mvn clean -Pbuild-with-format package does not work

2017-01-09 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-611:
--

 Summary: mvn clean -Pbuild-with-format package does not work
 Key: CARBONDATA-611
 URL: https://issues.apache.org/jira/browse/CARBONDATA-611
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


mvn clean -Pbuild-with-format package does not work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Select query is not working.

2017-01-05 Thread Ravindra Pesala

Hi,

Its an issue, we are working on the fix.

On 5 January 2017 at 17:26, Anurag Srivastava  wrote:

> Hello,
>
> I have taken latest code at today (5/01/2017) and build code with spark
> 1.6. After that I put the latest jar in carbonlib in spark and start thrift
> server.
>
> When I have started running query I was able to run the create and load
> query but for the "*select*"
> query it is giving me error :
>
> *org.apache.carbondata.core.carbon.datastore.exception.
> IndexBuilderException:
> Block B-tree loading failed*
>
> I have raised JIRA ISSUE for the same. Please look at there for further
> information and stack trace. Here is the link :
>
> https://issues.apache.org/jira/browse/CARBONDATA-597
>
>
> --
> *Thanks®ards*
>
>
> *Anurag Srivastava**Software Consultant*
> *Knoldus Software LLP*
>
> *India - US - Canada*
> * Twitter  | FB
>  | LinkedIn
> *
>



-- 
Thanks & Regards,
Ravi

Re: why there is a table name option in carbon source format?

2017-01-03 Thread Ravindra Pesala

you can directly use the other sql create table command like in 1.6.

CREATE TABLE IF NOT EXISTS t3
(ID Int, date Timestamp, country String,
name String, phonetype String, serialname char(10), salary Int)
STORED BY 'carbondata'


On 4 January 2017 at 10:02, Anubhav Tarar  wrote:

> exactly my point if table name in create table statement and table name in
> carbon source option is different consider this example
>
> 0: jdbc:hive2://localhost:1> CREATE TABLE testing2(String string)USING
> org.apache.spark.sql.CarbonSource OPTIONS("bucketnumber"="1",
> "bucketcolumns"="String",tableName=" testing1");
>
> then the table which get created in hdfs is testing1 not testing testing2
> it is quite confusing from user side
>
> On Wed, Jan 4, 2017 at 8:33 AM, QiangCai  wrote:
>
> > For Spark 2,  when using SparkSession to create carbon table, need
> > tableName
> > option to create carbon schema in store location folder. Better to use
> > CarbonSession to create carbon table now.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/why-there-is-a-
> > table-name-option-in-carbon-source-format-tp5385p5420.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks and Regards
>
> *   Anubhav Tarar *
>
>
> * Software Consultant*
>   *Knoldus Software LLP    *
>LinkedIn  Twitter
> fb 
>   mob : 8588915184
>



-- 
Thanks & Regards,
Ravi

Re: carbon thrift server for spark 2.0 showing unusual behaviour

2017-01-03 Thread Ravindra Pesala

Hi,

I did not understand the issue, what is the error it throws?

On 4 January 2017 at 10:03, Anubhav Tarar  wrote:

> here is the script ./bin/spark-submit --conf
> spark.sql.hive.thriftServer.singleSession=true --class
> org.apache.carbondata.spark.thriftserver.CarbonThriftServer
> /opt/spark-2.0.0-bin-hadoop2.7/carbonlib/carbondata_2.11-1.
> 0.0-incubating-SNAPSHOT-shade-hadoop2.2.0.jar
> hdfs://localhost:54310/opt/carbonStore
>
> it works fine only 1 out of 10 times
>
> On Wed, Jan 4, 2017 at 8:28 AM, QiangCai  wrote:
>
> > Can you show the JDBCServer startup script?
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/carbon-thrift-
> > server-for-spark-2-0-showing-unusual-behaviour-tp5384p5419.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks and Regards
>
> *   Anubhav Tarar *
>
>
> * Software Consultant*
>   *Knoldus Software LLP    *
>LinkedIn  Twitter
> fb 
>   mob : 8588915184
>



-- 
Thanks & Regards,
Ravi

Re: carbon shell is not working with spark 2.0 version

2017-01-03 Thread Ravindra Pesala

Yes, it is not working because the support is not yet added, right now it
is low priority task as user can directly use spark-shell to create
carbonsession and execute the queries.

On 4 January 2017 at 12:40, anubhavtarar  wrote:

> carbon shell is not working with spark 2.0 version
> here are the logs
>
> java.lang.ClassNotFoundException: org.apache.spark.repl.carbon.Main
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.
> scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/carbon-shell-is-
> not-working-with-spark-2-0-version-tp5436.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

[jira] [Created] (CARBONDATA-580) Support Spark 2.1 in Carbon

2016-12-30 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-580:
--

 Summary: Support Spark 2.1 in Carbon
 Key: CARBONDATA-580
 URL: https://issues.apache.org/jira/browse/CARBONDATA-580
 Project: CarbonData
  Issue Type: Improvement
  Components: spark-integration
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala


Support latest Spark 2.1 in Carbon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CARBONDATA-577) Carbon session is not working in spark shell.

2016-12-28 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-577:
--

 Summary: Carbon session is not working in spark shell.
 Key: CARBONDATA-577
 URL: https://issues.apache.org/jira/browse/CARBONDATA-577
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.0.0-incubating
Reporter: Ravindra Pesala


Currently user cannot create CarbonSession from spark shell , it always creates 
SparkSession only so carbon queries cannot be executed on spark shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CARBONDATA-574) Add thrift server support to Spark 2.0 carbon integration

2016-12-27 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-574:
--

 Summary: Add thrift server support to Spark 2.0 carbon integration
 Key: CARBONDATA-574
 URL: https://issues.apache.org/jira/browse/CARBONDATA-574
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala


Add thrift server support to Spark 2.0 carbon integration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: CatalystAnalysy

2016-12-27 Thread Ravindra Pesala

Have you used 'mvn clean'?

On 28 December 2016 at 07:18, rahulforallp  wrote:

> hey QiangCai,
> thank you for your reply . i have spark 1.6.2. and also tried with
> -Dspark.version=1.6.2 . But result is same . Still i am getting same
> exception.
>
> Is this exception possibe if i have different scala version?
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/CatalystAnalysy-
> tp5129p5137.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



-- 
Thanks & Regards,
Ravi

Re: Float Data Type Support in carbondata Querry

2016-12-27 Thread Ravindra Pesala

Hi,

>From carbon it supposed to return float data when you use float data type.
Please check whether you are converting the data to float or not
in ScannedResultCollector implementation classes.

Regards,
Ravindra

On 27 December 2016 at 20:23, Rahul Kumar  wrote:

> Hello Ravindra,
>
> I am working on *CARBONDATA-390*(to support Float data type).
> I have made following changes *and its working fine with Spark 1.6*.
>
> https://github.com/apache/incubator-carbondata/compare/maste
> r...rahulforallp:CARBONDATA-390?expand=1
>
> But when i have tested it for Spark 2.0 .
> It works fine if i execute create and load(insert) command .
> Select query also works fine and gives the object of DataSet.
> Schema of Table and dataset is correct with float datatype.
> But when i try dataSet.show() , it gives me following error:
>
> *Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent
> failure: Lost task 0.0 in stage 4.0 (TID 5, localhost):
> java.lang.ClassCastException: java.lang.Double cannot be cast to
> java.lang.Float*
>
>
> *My observation is  Dataset is strongly typed collection . So when we
> selects record from table it gives correct result , but somewhere float has
> been converted to Double and during the dataSet.show() gives cast exception
> because dataset schema has float datatype.*
>
>
>   Thanks and Regards
>
> *   Rahul Kumar  *
>
> *Software Consultant*   *Knoldus Software LLP
>   *
>
> *[image: https://www.linkedin.com/in/rahulforallp]
> [image:
> https://twitter.com/RahulKu71223673] 
>
> ** : 8800897566*
>



-- 
Thanks & Regards,
Ravi

Re: Dictionary file is locked for updation

2016-12-27 Thread Ravindra Pesala

Hi,

It seems the store path location is taking default location. Did you set
the store location properly? Which spark version you are using?

Regards,
Ravindra

On Tue, Dec 27, 2016, 1:38 PM 251469031 <251469...@qq.com> wrote:

> Hi Kumar,
>
>
>   thx to your repley, the full logs is as follows:
>
>
> 16/12/27 12:30:17 INFO locks.HdfsFileLock: Executor task launch worker-0
> HDFS lock
> path:hdfs://master:9000../carbon.store/default/test_table/2e9b7efa-2934-463a-9280-ff50c5129268.lock
> 16/12/27 12:30:17 INFO storage.ShuffleBlockFetcherIterator: Getting 1
> non-empty blocks out of 1 blocks
> 16/12/27 12:30:17 INFO storage.ShuffleBlockFetcherIterator: Started 1
> remote fetches in 1 ms
> 16/12/27 12:30:32 ERROR rdd.CarbonGlobalDictionaryGenerateRDD: Executor
> task launch worker-0
> java.lang.RuntimeException: Dictionary file name is locked for updation.
> Please try after some time
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD$$anon$1.(CarbonGlobalDictionaryRDD.scala:364)
> at
> org.apache.carbondata.spark.rdd.CarbonGlobalDictionaryGenerateRDD.compute(CarbonGlobalDictionaryRDD.scala:302)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> as u see, the lock file path
> is:hdfs://master:9000../carbon.store/default/test_table/2e9b7efa-2934-463a-9280-ff50c5129268.lock
>
>
>
>
> -- 原始邮件 --
> 发件人: "Kumar Vishal";;
> 发送时间: 2016年12月27日(星期二) 下午3:25
> 收件人: "dev";
>
> 主题: Re: Dictionary file is locked for updation
>
>
>
> Hi,
> can you please find *"HDFS lock path"* string in executor log and let me
> know the complete log message.
>
> -Regards
> Kumar Vishal
>
> On Tue, Dec 27, 2016 at 12:45 PM, 251469031 <251469...@qq.com> wrote:
>
> > Hi all,
> >
> >
> > when I run the following script:
> > scala> cc.sql(s"load data inpath
> 'hdfs://master:9000/carbondata/sample.csv'
> > into table test_table")
> >
> >
> > it turns out that:
> > WARN  27-12 12:37:58,044 - Lost task 1.3 in stage 2.0 (TID 13, slave1):
> > java.lang.RuntimeException: Dictionary file name is locked for updation.
> > Please try after some time
> >
> >
> > what I have done are:
> > 1.in carbon.properties, set carbon.lock.type=HDFSLOCK
> > 2.send carbon.properties & spark-defaults.conf to all nodes of the
> clusters
> >
> >
> > if any of you have any idea, looking forward to your replay, thx~

[jira] [Created] (CARBONDATA-547) Add CarbonSession and enabled parser to use all carbon commands

2016-12-21 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-547:
--

 Summary: Add CarbonSession and enabled parser to use all carbon 
commands
 Key: CARBONDATA-547
 URL: https://issues.apache.org/jira/browse/CARBONDATA-547
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala


Currently no DDL commands like CREATE,LOAD,ALTER,DROP,DESCRIBE, SHOW LOADS, 
DELETE SEGMENTS etc are not working in Spark 2.0 integration.
So please add CarbonSession and overwrite the SQL parser to make all this 
commands work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread Ravindra Pesala

+1 to have separate output formats, now user can have flexibility to choose
as per scenario.

On Fri, Dec 16, 2016, 2:47 AM Jihong Ma  wrote:

>
> It is great idea to have separate OutputFormat for regular Carbon data
> files, index files as well as meta data files, For instance: dictionary
> file, schema file, global index file etc.. for writing Carbon generated
> files laid out HDFS, and it is orthogonal to the actual data load process.
>
> Regards.
>
> Jihong
>
> -Original Message-
> From: Jacky Li [mailto:jacky.li...@qq.com]
> Sent: Thursday, December 15, 2016 12:55 AM
> To: dev@carbondata.incubator.apache.org
> Subject: [DISCUSSION] CarbonData loading solution discussion
>
>
> Hi community,
>
> Since CarbonData has global dictionary feature, currently when loading
> data to CarbonData, it requires two times of scan of the input data. First
> scan is to generate dictionary, second scan to do actual data encoding and
> write to carbon files. Obviously, this approach is simple, but this
> approach has at least two problem:
> 1. involve unnecessary IO read.
> 2. need two jobs for MapReduce application to write carbon files
>
> To solve this, we need single-pass data loading solution, as discussed
> earlier, and now community is developing it (CARBONDATA-401, PR310).
>
> In this post, I want to discuss the OutputFormat part, I think there will
> be two OutputFormat for CarbonData.
> 1. DictionaryOutputFormat, which is used for the global dictionary
> generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> 2. TableOutputFormat, which is used for writing CarbonData files.
>
> When carbon has these output formats, it is more easier to integrate with
> compute framework like spark, hive, mapreduce.
> And in order to make data loading faster, user can choose different
> solution based on its scenario as following
> Scenario 1:  First load is small (can not cover most dictionary)
>
> run two jobs that use DictionaryOutputFormat and TableOutputFormat
> accordingly, in first few loads
> after some loads, it becomes like Scenario 2, run one job that use
> TableOutputFormat with single-pass
> Scenario 2: First load is big (can cover most dictionary)
>
> for first load
> if the bigest column cardinality > 10K, run two jobs using two output
> formats
> otherwise, run one job that use TableOutputFormat with single-pass
> for subsequent load, run one job that use TableOutputFormat with
> single-pass
> What do yo think this idea?
>
> Regards,
> Jacky
>

[jira] [Created] (CARBONDATA-519) Enable vector reader in Carbon-Spark 2.0 integration and Carbon layer

2016-12-10 Thread Ravindra Pesala (JIRA)

Ravindra Pesala created CARBONDATA-519:
--

 Summary: Enable vector reader in Carbon-Spark 2.0 integration and 
Carbon layer
 Key: CARBONDATA-519
 URL: https://issues.apache.org/jira/browse/CARBONDATA-519
 Project: CarbonData
  Issue Type: Improvement
Reporter: Ravindra Pesala
Assignee: Ravindra Pesala


Spark 2.0 supports batch reader and uses whole codegen to improve performance, 
so carbon also can implement vector reader and leverage the features of Spark2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [Discussion] Some confused properties

2016-12-08 Thread Ravindra Pesala

Hi,

Carbon takes store location from CarbonContext and sets to CarbonProperties
as carbon.storelocation , so it is not required to add store location in
properties file. And carbon.ddl.base.hdfs.url is not a mandatory property
it is just used when load path is provided with prefix then it appends this
configured prefix to it.

Regards,
Ravi

On 8 December 2016 at 15:03, Sea <261810...@qq.com> wrote:

> Hi, all:
> I am trying to use carbon,  but I am confused about the properties as
> blow:
>
>
> carbon.storelocation=hdfs://hacluster/Opt/CarbonStore
> #Base directory for Data files
> carbon.ddl.base.hdfs.url=hdfs://hacluster/opt/data
> #Path where the bad records are stored
> carbon.badRecords.location=/opt/Carbon/Spark/badrecords
>
>
>
>
>
> Why do I need to set carbon.storelocation and  carbon.ddl.base.hdfs.url
> before I create carbon table?
>
>
> Best Regards
> yuhai

-- 
Thanks & Regards,
Ravi

Re: select return error when filter string column in where clause

2016-12-05 Thread Ravindra Pesala

Hi,

Please provide table schema, load command and sample data to reproduce this
issue, you may create the JIRA for it.

Regards,
Ravi

On 6 December 2016 at 07:05, Lu Cao  wrote:

> Hi Dev team,
> I have loaded some data into carbondata table. But when I put the id
> column(String type) in where clause it always return error as below:
>
> cc.sql("select to_date(data_date),count(*) from default.carbontest_001
> where id='LSJW26762FS044062' group by to_date(data_date)").show
>
>
>
> ===
> WARN  06-12 09:02:13,763 - Lost task 5.0 in stage 44.0 (TID 687,
> .com): java.lang.RuntimeException: Exception occurred in query
> execution.Please check logs.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<
> init>(CarbonScanRDD.scala:226)
> at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(
> CarbonScanRDD.scala:192)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> ERROR 06-12 09:02:14,091 - Task 1 in stage 44.0 failed 4 times; aborting
> job
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 44.0 failed 4 times, most recent failure: Lost task
> 1.3 in stage 44.0 (TID 694, scsp00258.saicdt.com):
> java.lang.RuntimeException: Exception occurred in query
> execution.Please check logs.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.carbondata.spark.rdd.CarbonScanRDD$$anon$1.<
> init>(CarbonScanRDD.scala:226)
> at org.apache.carbondata.spark.rdd.CarbonScanRDD.compute(
> CarbonScanRDD.scala:192)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependent

Re: About hive integration

2016-12-04 Thread Ravindra Pesala

Hi,

Yes, we have plans for integrating carbondata to hive engine but it is not
our high priority work now so we will take it up this task gradually. Any
contributions towards it are welcome.

Regards,
Ravi

On 4 December 2016 at 12:30, Sea <261810...@qq.com> wrote:

> Hi, all:
> Now carbondata is not working in hive which is the most widely used
> query engine. In my company, if I want to use carbon, I need to query
> carbondata table in hive.
> I think we should implement the following features in hive:
> 1. DDL create/drop/alter carbondata table
> 2. DML insert(overwrite) /select
>
>
> What do you think?




-- 
Thanks & Regards,
Ravi

1 2 3 4 >

1 - 100 of 380 matches

Mail list logo