[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-25 Thread xuchuanyin
Github user xuchuanyin closed the pull request at:

https://github.com/apache/carbondata/pull/1953


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r170003740
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/RangePartitionerImpl.java
 ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.processing.loading.partition.impl;
+
+import java.util.Arrays;
+import java.util.Comparator;
+
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.datastore.row.CarbonRow;
+import org.apache.carbondata.processing.loading.partition.Partitioner;
+
+public class RangePartitionerImpl implements Partitioner {
+  private static final LogService LOGGER =
+  
LogServiceFactory.getLogService(RangePartitionerImpl.class.getName());
+  private CarbonRow[] rangeBounds;
+  private Comparator comparator;
+
+  public RangePartitionerImpl(CarbonRow[] rangeBounds, 
Comparator comparator) {
+this.rangeBounds = rangeBounds;
+LOGGER.info("Use range partitioner to distribute data to "
++ (rangeBounds.length + 1) + " ranges.");
+this.comparator = comparator;
+  }
+
+  /**
+   * learned from spark org.apache.spark.RangePartitioner
+   *
+   * @param key key
+   * @return partitionId
+   */
+  @Override public int getPartition(CarbonRow key) {
--- End diff --

put all @Override to previous line


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r170003522
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java
 ---
@@ -231,4 +238,71 @@ public static CarbonDataLoadConfiguration 
createConfiguration(CarbonLoadModel lo
 return configuration;
   }
 
+  /**
+   * set sort column info in configuration
+   * @param carbonTable carbon table
+   * @param loadModel load model
+   * @param configuration configuration
+   */
+  private static void setSortColumnInfo(CarbonTable carbonTable, 
CarbonLoadModel loadModel,
+  CarbonDataLoadConfiguration configuration) {
+List sortCols = 
carbonTable.getSortColumns(carbonTable.getTableName());
+SortScopeOptions.SortScope sortScope = 
SortScopeOptions.getSortScope(loadModel.getSortScope());
+if (!SortScopeOptions.SortScope.LOCAL_SORT.equals(sortScope)
+|| sortCols.size() == 0
+|| StringUtils.isBlank(loadModel.getSortColumnsBoundsStr())) {
+  if (!StringUtils.isBlank(loadModel.getSortColumnsBoundsStr())) {
+LOGGER.warn("sort column bounds will be ignored");
+  }
+
+  configuration.setSortColumnRangeInfo(null);
+  return;
+}
+// column index for sort columns
+int[] sortColIndex = new int[sortCols.size()];
+boolean[] isSortColNoDict = new boolean[sortCols.size()];
+
+DataField[] outFields = configuration.getDataFields();
+int j = 0;
+boolean columnExist;
+for (String sortCol : sortCols) {
+  columnExist = false;
+
+  for (int i = 0; !columnExist && i < outFields.length; i++) {
+if 
(outFields[i].getColumn().getColName().equalsIgnoreCase(sortCol)) {
+  columnExist = true;
+
+  sortColIndex[j] = i;
+  isSortColNoDict[j] = !outFields[i].hasDictionaryEncoding();
+  j++;
+}
+  }
+
+  if (!columnExist) {
+throw new RuntimeException("Field " + sortCol + " does not 
exist.");
--- End diff --

It  is better to use DataLoadingException


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r170003203
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/loading/CarbonDataLoadConfiguration.java
 ---
@@ -107,6 +108,7 @@
*/
   private short writingCoresCount;
 
+  private SortColumnRangeInfo sortColumnRangeInfo;
   public CarbonDataLoadConfiguration() {
--- End diff --

add one empty line


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r170003007
  
--- Diff: 
core/src/main/java/org/apache/carbondata/core/metadata/schema/SortColumnRangeInfo.java
 ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.metadata.schema;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * column ranges specified by sort column bounds
+ */
+public class SortColumnRangeInfo implements ColumnRangeInfo, Serializable {
--- End diff --

For all public class, please annotate with `@InterfaceAudience`, in this 
PR, all newly added public class should be `@InterfaceAudience.Internal`


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r170002493
  
--- Diff: docs/data-management-on-carbondata.md ---
@@ -370,6 +370,17 @@ This tutorial is going to introduce all commands and 
data operations on CarbonDa
 ```
 NOTE: Date formats are specified by date pattern strings. The date 
pattern letters in CarbonData are same as in JAVA. Refer to 
[SimpleDateFormat](http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html).
 
+  - **SORT COLUMN BOUNDS:** Range bounds for sort columns.
+
+```
+OPTIONS('SORT_COLUMN_BOUNDS'='v11,v21,v31;v12,v22,v32;v12,v23,v33')
--- End diff --

typo, last value range is `v13,v23,v33`


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r169990579
  
--- Diff: 
core/src/main/java/org/apache/carbondata/core/metadata/schema/ColumnRangeInfo.java
 ---
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.metadata.schema;
+
+/**
+ * interface for column range information. Currently we treat bucket and 
sort_column_range as
+ * column ranges.
+ */
+public interface ColumnRangeInfo {
+  int getNumOfRanges();
+}
--- End diff --

add one new line at the end of file, otherwise it breaks code style


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-22 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1953#discussion_r169989543
  
--- Diff: 
core/src/main/java/org/apache/carbondata/core/metadata/schema/BucketingInfo.java
 ---
@@ -30,34 +30,32 @@
 /**
  * Bucketing information
  */
-public class BucketingInfo implements Serializable, Writable {
-
+public class BucketingInfo implements ColumnRangeInfo, Serializable, 
Writable {
   private static final long serialVersionUID = -0L;
-
   private List listOfColumns;
-
-  private int numberOfBuckets;
+  // number of column ranges
--- End diff --

Why is it called "column ranges"? Isn't it "value ranges"?


---


[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...

2018-02-08 Thread xuchuanyin
GitHub user xuchuanyin opened a pull request:

https://github.com/apache/carbondata/pull/1953

[CARBONDATA-2091][DataLoad] Support specifying sort column bounds in data 
loading

Enhance data loading performance by specifying sort column bounds
1. Add row range number during convert-process-step
2. Dispatch rows to each sorter by range number
3. Sort/Write process step can be done concurrently in each range

Tests added and docs updated

After implementing this feature, the data load performance has gained about 
25% enhancement (80MB/s/Node -> 102MB/s/Node) in my scenario with only 1 bounds 
provided. 

Be sure to do all of the following checklist to help us incorporate 
your contribution quickly and easily:

 - [x] Any interfaces changed?
 `Only internal used interfaces are changed`
 - [x] Any backward compatibility impacted?
 `No`
 - [x] Document update required?
`Yes, added the usage of this feature to documents`
 - [x] Testing done
Please provide details on 
- Whether new unit test cases have been added or why no new tests 
are required?
`Yes`
- How it is tested? Please attach test report.
`Tested in 3-node cluster and local machine`
- Is it a performance related change? Please attach the performance 
test report.
`Yes. After implementing this feature, the data load performance has gained 
about 25% enhancement (80MB/s/Node -> 102MB/s/Node) in my scenario with only 1 
bounds provided. `
- Any additional information to help reviewers in testing this 
change.
   `I refactored the bucket related feature and treated the range and 
bucket as the similar logic`
 - [x] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA. 
`Not related`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuchuanyin/carbondata 
0208_support_specifying_sort_column_bounds

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/1953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1953


commit 11463dd22db17f2e1858e0a1f3ebfeb07e3ec0e9
Author: xuchuanyin 
Date:   2018-02-08T08:30:09Z

Support specifying sort column bounds in data loading

Enhance data loading performance by specifying sort column bounds
1. Add row range number during convert-process-step
2. Dispatch rows to each sorter by range number
3. Sort/Write process step can be done concurrently in each range

Tests added and docs updated




---