[RESULT][VOTE] Apache CarbonData 0.1.1-incubating release

2016-10-08 Thread Liang Big data
Hi all

Finally, this vote passed with the following result:

+1 (binding) : Justin Mclean,Henry Saputra,Julian Hyde-3,Uma Gangumalla

Thanks all for your vote.


Regards
Liang


[jira] [Created] (CARBONDATA-289) Support MB/M for table block size and update the doc about this new feature.

2016-10-08 Thread zhangshunyu (JIRA)
zhangshunyu created CARBONDATA-289:
--

 Summary: Support MB/M for table block size and update the doc 
about this new feature. 
 Key: CARBONDATA-289
 URL: https://issues.apache.org/jira/browse/CARBONDATA-289
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 0.1.0-incubating
Reporter: zhangshunyu
Assignee: zhangshunyu
Priority: Minor
 Fix For: 0.2.0-incubating


Support MB/M for table block size and update the doc about this new feature. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-carbondata pull request #219: [WIP] [CARBONDATA-37]Support differe...

2016-10-08 Thread lion-x
GitHub user lion-x opened a pull request:

https://github.com/apache/incubator-carbondata/pull/219

[WIP] [CARBONDATA-37]Support different time format input style

# Why raise this PR?
support different time format input style. In some scenarios different time 
dimensions may use different time format, we should support these requirements.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lion-x/incubator-carbondata timeformat

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #219


commit 15d1e640f96b15411fb39eaf7208b3b656ebda0a
Author: X-Lion 
Date:   2016-09-29T14:33:18Z

Lionx0929




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #218: [CARBONDATA-288] In hdfs bad record ...

2016-10-08 Thread mohammadshahidkhan
GitHub user mohammadshahidkhan opened a pull request:

https://github.com/apache/incubator-carbondata/pull/218

[CARBONDATA-288] In hdfs bad record logger is failing in writing the bad 
records in log file

**Poblem**
For HDFS file system 
CarbonFile logFile = FileFactory.getCarbonFile(filePath, FileType.HDFS);
if filePath does not exits then
Calling CarbonFile.getPath() throws NullPointerException.
**Solution:**
If file does not exist then before accessing the file must be created first.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mohammadshahidkhan/incubator-carbondata 
badrecord_log_file_writting_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #218


commit 6a9ca5a78bc128f1e4f8c37164fc81dec6b70894
Author: mohammadshahidkhan 
Date:   2016-10-08T20:54:57Z

[CARBONDATA-288] In hdfs bad record logger is failing in writting the bad 
records




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (CARBONDATA-288) In hdfs bad record logger is failing in writting the bad records

2016-10-08 Thread Mohammad Shahid Khan (JIRA)
Mohammad Shahid Khan created CARBONDATA-288:
---

 Summary: In hdfs bad record logger is failing in writting the bad 
records
 Key: CARBONDATA-288
 URL: https://issues.apache.org/jira/browse/CARBONDATA-288
 Project: CarbonData
  Issue Type: Bug
Reporter: Mohammad Shahid Khan
Assignee: Mohammad Shahid Khan


For HDFS file system 
CarbonFile logFile = FileFactory.getCarbonFile(filePath, FileType.HDFS);
if filePath does not exits then
Calling CarbonFile.getPath() throws NullPointerException.
Solution:
If file does not exist then file must be created first.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

2016-10-08 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505613
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonInputFormatBase.java
 ---
@@ -0,0 +1,69 @@
+/*
--- End diff --

Yes, it is like that only


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

2016-10-08 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505582
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/StreamingSegment.java
 ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.hadoop.internal.segment;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
+import org.apache.carbondata.scan.model.QueryModel;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+
+public class StreamingSegment extends Segment {
--- End diff --

If I understand the comment correctly, the answer is that all segments are 
handled unifiedly in `CarbonInputFormatBase`, however, the internally read 
implementation of this segment is different from IndexedSegment. It uses Row 
input format to read. Is this the question?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

2016-10-08 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505527
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/IndexedSegment.java
 ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.hadoop.internal.segment;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+
+import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
+import org.apache.carbondata.hadoop.internal.index.Index;
+import org.apache.carbondata.hadoop.internal.index.IndexLoader;
+import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf;
+import org.apache.carbondata.scan.model.QueryModel;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+
+public class IndexedSegment extends Segment {
--- End diff --

Do you mean a non-indexed and non-streaming segment? Yes, I think we can 
have a segment like that, then reading of this segment will be pure scan only 
without any index. 
But what is the benefit doing that? 

I think the real issue is how to let optimizer to decide when to use index 
and when not to use index. I think I will create another `Estimator` interface 
to expose cost information for optimizer. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...

2016-10-08 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505442
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/api/row/CarbonRowInputFormat.java
 ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.hadoop.api.row;
+
+import java.io.IOException;
+
+import org.apache.carbondata.hadoop.api.CarbonInputFormatBase;
+import org.apache.carbondata.hadoop.internal.segment.SegmentManager;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+public class CarbonRowInputFormat extends CarbonInputFormatBase {
--- End diff --

Row format is for streaming ingest. I can remove it now and submit it in 
future feature of streaming ingest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #198: [CARBONDATA-273]Using carbon common ...

2016-10-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/198


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discussion about using multi local directorys to improve dataloading perfomance

2016-10-08 Thread Jacky Li
Yes, I think it is a good feature to have. Please feel free to create JIRA 
issue and Pull Request. 

Regards,
Jacky

> 在 2016年10月9日,上午12:04,caiqiang  写道:
> 
> Hi All,
>  For each dataloading, we write the sorted temp files into only one different 
> local directory. I think this is a bottle neck of dataloading. It is 
> neccessary to use multi local directorys in multi disks for each dataloading 
> to improve dataloading performance.





[GitHub] incubator-carbondata pull request #216: Update DOC about table level blocksi...

2016-10-08 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/216#discussion_r82504992
  
--- Diff: docs/DDL-Operations-on-Carbon.md ---
@@ -67,6 +67,14 @@ Here, DICTIONARY_EXCLUDE will exclude dictionary 
creation. This is applicable fo
   ```ruby
   TBLPROPERTIES 
("COLUMN_GROUPS"="(column1,column3),(Column4,Column5,Column6)") 
   ```
+ - **Table Block Size Configuration**
+
+   The block size of one table's files on hdfs can be defined using an int 
value whose size is in MB, the range is form 1MB to 2048MB and the default 
value is 1024MB, if user didn't define this values in ddl, it would use default 
value to set.
+
+  ```ruby
+  TBLPROPERTIES ("TABLE_BLOCKSIZE"="512")
--- End diff --

I think it is better to have the MB inside the value, it is more clear to 
the user. For example:
`TBLPROPERTIES ("TABLE_BLOCKSIZE"="512MB")`
And we can also add support for GB in the future, if required. 
Can you modify the code accordingly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Discussion about using multi local directorys to improve dataloading perfomance

2016-10-08 Thread caiqiang
Hi All,
  For each dataloading, we write the sorted temp files into only one different 
local directory. I think this is a bottle neck of dataloading. It is neccessary 
to use multi local directorys in multi disks for each dataloading to improve 
dataloading performance.

[GitHub] incubator-carbondata pull request #217: [CARBONDATA-287]Using multi local di...

2016-10-08 Thread QiangCai
GitHub user QiangCai opened a pull request:

https://github.com/apache/incubator-carbondata/pull/217

[CARBONDATA-287]Using multi local directorys to improve dataloading 
perfomance

**1  Analysis**

Now for each dataloading, we use only one different local directory to save 
the sorted temp files. I think it is neccessary to use multi local directorys 
in multi disks for each dataloading to improve dataloading performance.

**2 Solution**

Modify SortDataRows class to separate the sorted temp files into different 
local directorys in different disks.. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/QiangCai/incubator-carbondata 
localDirsWorkLoadBalance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/217.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #217


commit 4d0a276bb7283dc0683efe9f3684fdff4cadf729
Author: c00318382 
Date:   2016-10-08T12:54:32Z

localdirs

commit 41c027833c19c3f3331470f48759f469159eb59c
Author: QiangCai 
Date:   2016-10-08T15:26:39Z

clean temp folder




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (CARBONDATA-287) Save the sorted temp files to multi local dirs to improve dataloading perfomance

2016-10-08 Thread QiangCai (JIRA)
QiangCai created CARBONDATA-287:
---

 Summary: Save the sorted temp files to multi local dirs to improve 
dataloading perfomance 
 Key: CARBONDATA-287
 URL: https://issues.apache.org/jira/browse/CARBONDATA-287
 Project: CarbonData
  Issue Type: Improvement
  Components: data-load
Affects Versions: 0.2.0-incubating
Reporter: QiangCai
Assignee: QiangCai
Priority: Minor
 Fix For: 0.2.0-incubating


Now for each dataloading, we use only a different local dir to save  the sorted 
temp files. I think it is neccessary to use multi local dirs for each 
dataloading to improve dataloading performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Discussion regrading design of data load after kettle removal.

2016-10-08 Thread Kumar Vishal
Hi Ravi,
We can move mdkey generation step before sorting, this will compress the
dictionary data and will reduce the IO.
-Regards
Kumar Vishal

On Sat, Oct 8, 2016 at 3:30 PM, Ravindra Pesala 
wrote:

> Hi All,
>
>
> Removing kettle from carbondata is necessary as this legacy kettle
> framework become overhead to carbondata.This discussion is regarding the
> design of carbon load with out kettle.
>
> The main interface for data loading here is DataLoadProcessorStep.
>
> */***
> * * This base interface for data loading. It can do transformation jobs as
> per the implementation.*
> * **
> * */*
> *public interface DataLoadProcessorStep {*
>
> *  /***
> *   * The output meta for this step. The data returns from this step is as
> per this meta.*
> *   * @return*
> *   */*
> *  DataField[] getOutput();*
>
> *  /***
> *   * Intialization process for this step.*
> *   * @param configuration*
> *   * @param child*
> *   * @throws CarbonDataLoadingException*
> *   */*
> *  void intialize(CarbonDataLoadConfiguration configuration,
> DataLoadProcessorStep child) throws*
> *  CarbonDataLoadingException;*
>
> *  /***
> *   * Tranform the data as per the implemetation.*
> *   * @return Iterator of data*
> *   * @throws CarbonDataLoadingException*
> *   */*
> *  Iterator execute() throws CarbonDataLoadingException;*
>
> *  /***
> *   * Any closing of resources after step execution can be done here.*
> *   */*
> *  void finish();*
> *}*
>
> The implementation classes for DataLoadProcessorStep are
> InputProcessorStep, EncoderProcessorStep, SortProcessorStep and
> DataWriterProcessorStep.
>
> The following picture depicts the loading process with implementation
> classes.
>
> [image: Inline images 2]
>
> *InputProcessorStep* :  It does two jobs, 1. It reads data from
> RecordReader of InputFormat 2. Parse each field of column as per the data
> type.
> *EncoderProcessorStep*: It encodes each field with dictionary if
> requires.And combine all no dictionary columns to single byte array.
> *SortProcessorStep* :   It sorts the data on dimension columns and write
> to intermediate files.
> *DataWriterProcessorStep* : It merge sort the data from intermediate temp
> files and generate mdk key and writes the data in carbondata format to
> store.
>
>
>
> The following interface for Dictionary generation.
>
> */***
> * * Generates dictionary for the column. The implementation classes can be
> pre-defined or*
> * * local or global dictionary generations.*
> * */*
> *public interface ColumnDictionaryGenerator {*
>
> *  /***
> *   * Generates dictionary value for the column data*
> *   * @param data*
> *   * @return dictionary value*
> *   */*
> *  int generateDictionaryValue(Object data);*
>
> *  /***
> *   * Returns the actual value associated with dictionary value.*
> *   * @param dictionary*
> *   * @return actual value.*
> *   */*
> *  Object getValueFromDictionary(int dictionary);*
>
> *  /***
> *   * Returns the maximum value among the dictionary values. It is used
> for generating mdk key.*
> *   * @return max dictionary value.*
> *   */*
> *  int getMaxDictionaryValue();*
>
> *}*
>
> This ColumnDictionaryGenerator interface can have 3 implementations, 1.
> PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator
> 3. LocalColumnDictionaryGenerator
>
> [image: Inline images 3]
>
> *PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values
> from already generated and loaded dictionary.
> *GlobalColumnDictionaryGenerator* : It generates global dictionary online
> by using KV store or distributed map.
> *LocalColumnDictionaryGenerator* : It generates local dictionary only for
> that executor.
>
>
> For more information on the loading please check the PR
> https://github.com/apache/incubator-carbondata/pull/215
>
> Please let me know any changes are required in these interfaces.
>
> --
> Thanks & Regards,
> Ravi
>


Discussion regrading design of data load after kettle removal.

2016-10-08 Thread Ravindra Pesala
Hi All,


Removing kettle from carbondata is necessary as this legacy kettle
framework become overhead to carbondata.This discussion is regarding the
design of carbon load with out kettle.

The main interface for data loading here is DataLoadProcessorStep.

*/***
* * This base interface for data loading. It can do transformation jobs as
per the implementation.*
* **
* */*
*public interface DataLoadProcessorStep {*

*  /***
*   * The output meta for this step. The data returns from this step is as
per this meta.*
*   * @return*
*   */*
*  DataField[] getOutput();*

*  /***
*   * Intialization process for this step.*
*   * @param configuration*
*   * @param child*
*   * @throws CarbonDataLoadingException*
*   */*
*  void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws*
*  CarbonDataLoadingException;*

*  /***
*   * Tranform the data as per the implemetation.*
*   * @return Iterator of data*
*   * @throws CarbonDataLoadingException*
*   */*
*  Iterator execute() throws CarbonDataLoadingException;*

*  /***
*   * Any closing of resources after step execution can be done here.*
*   */*
*  void finish();*
*}*

The implementation classes for DataLoadProcessorStep are
InputProcessorStep, EncoderProcessorStep, SortProcessorStep and
DataWriterProcessorStep.

The following picture depicts the loading process with implementation
classes.

[image: Inline images 2]

*InputProcessorStep* :  It does two jobs, 1. It reads data from
RecordReader of InputFormat 2. Parse each field of column as per the data
type.
*EncoderProcessorStep*: It encodes each field with dictionary if
requires.And combine all no dictionary columns to single byte array.
*SortProcessorStep* :   It sorts the data on dimension columns and write to
intermediate files.
*DataWriterProcessorStep* : It merge sort the data from intermediate temp
files and generate mdk key and writes the data in carbondata format to
store.



The following interface for Dictionary generation.

*/***
* * Generates dictionary for the column. The implementation classes can be
pre-defined or*
* * local or global dictionary generations.*
* */*
*public interface ColumnDictionaryGenerator {*

*  /***
*   * Generates dictionary value for the column data*
*   * @param data*
*   * @return dictionary value*
*   */*
*  int generateDictionaryValue(Object data);*

*  /***
*   * Returns the actual value associated with dictionary value.*
*   * @param dictionary*
*   * @return actual value.*
*   */*
*  Object getValueFromDictionary(int dictionary);*

*  /***
*   * Returns the maximum value among the dictionary values. It is used for
generating mdk key.*
*   * @return max dictionary value.*
*   */*
*  int getMaxDictionaryValue();*

*}*

This ColumnDictionaryGenerator interface can have 3 implementations, 1.
PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator 3.
LocalColumnDictionaryGenerator

[image: Inline images 3]

*PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values
from already generated and loaded dictionary.
*GlobalColumnDictionaryGenerator* : It generates global dictionary online
by using KV store or distributed map.
*LocalColumnDictionaryGenerator* : It generates local dictionary only for
that executor.


For more information on the loading please check the PR
https://github.com/apache/incubator-carbondata/pull/215

Please let me know any changes are required in these interfaces.

-- 
Thanks & Regards,
Ravi