[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user xuchuanyin closed the pull request at: https://github.com/apache/carbondata/pull/1953 ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r170003740 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/RangePartitionerImpl.java --- @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.processing.loading.partition.impl; + +import java.util.Arrays; +import java.util.Comparator; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.processing.loading.partition.Partitioner; + +public class RangePartitionerImpl implements Partitioner { + private static final LogService LOGGER = + LogServiceFactory.getLogService(RangePartitionerImpl.class.getName()); + private CarbonRow[] rangeBounds; + private Comparator comparator; + + public RangePartitionerImpl(CarbonRow[] rangeBounds, Comparator comparator) { +this.rangeBounds = rangeBounds; +LOGGER.info("Use range partitioner to distribute data to " ++ (rangeBounds.length + 1) + " ranges."); +this.comparator = comparator; + } + + /** + * learned from spark org.apache.spark.RangePartitioner + * + * @param key key + * @return partitionId + */ + @Override public int getPartition(CarbonRow key) { --- End diff -- put all @Override to previous line ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r170003522 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/DataLoadProcessBuilder.java --- @@ -231,4 +238,71 @@ public static CarbonDataLoadConfiguration createConfiguration(CarbonLoadModel lo return configuration; } + /** + * set sort column info in configuration + * @param carbonTable carbon table + * @param loadModel load model + * @param configuration configuration + */ + private static void setSortColumnInfo(CarbonTable carbonTable, CarbonLoadModel loadModel, + CarbonDataLoadConfiguration configuration) { +List sortCols = carbonTable.getSortColumns(carbonTable.getTableName()); +SortScopeOptions.SortScope sortScope = SortScopeOptions.getSortScope(loadModel.getSortScope()); +if (!SortScopeOptions.SortScope.LOCAL_SORT.equals(sortScope) +|| sortCols.size() == 0 +|| StringUtils.isBlank(loadModel.getSortColumnsBoundsStr())) { + if (!StringUtils.isBlank(loadModel.getSortColumnsBoundsStr())) { +LOGGER.warn("sort column bounds will be ignored"); + } + + configuration.setSortColumnRangeInfo(null); + return; +} +// column index for sort columns +int[] sortColIndex = new int[sortCols.size()]; +boolean[] isSortColNoDict = new boolean[sortCols.size()]; + +DataField[] outFields = configuration.getDataFields(); +int j = 0; +boolean columnExist; +for (String sortCol : sortCols) { + columnExist = false; + + for (int i = 0; !columnExist && i < outFields.length; i++) { +if (outFields[i].getColumn().getColName().equalsIgnoreCase(sortCol)) { + columnExist = true; + + sortColIndex[j] = i; + isSortColNoDict[j] = !outFields[i].hasDictionaryEncoding(); + j++; +} + } + + if (!columnExist) { +throw new RuntimeException("Field " + sortCol + " does not exist."); --- End diff -- It is better to use DataLoadingException ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r170003203 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/CarbonDataLoadConfiguration.java --- @@ -107,6 +108,7 @@ */ private short writingCoresCount; + private SortColumnRangeInfo sortColumnRangeInfo; public CarbonDataLoadConfiguration() { --- End diff -- add one empty line ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r170003007 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/SortColumnRangeInfo.java --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.metadata.schema; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * column ranges specified by sort column bounds + */ +public class SortColumnRangeInfo implements ColumnRangeInfo, Serializable { --- End diff -- For all public class, please annotate with `@InterfaceAudience`, in this PR, all newly added public class should be `@InterfaceAudience.Internal` ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r170002493 --- Diff: docs/data-management-on-carbondata.md --- @@ -370,6 +370,17 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ``` NOTE: Date formats are specified by date pattern strings. The date pattern letters in CarbonData are same as in JAVA. Refer to [SimpleDateFormat](http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html). + - **SORT COLUMN BOUNDS:** Range bounds for sort columns. + +``` +OPTIONS('SORT_COLUMN_BOUNDS'='v11,v21,v31;v12,v22,v32;v12,v23,v33') --- End diff -- typo, last value range is `v13,v23,v33` ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r169990579 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/ColumnRangeInfo.java --- @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.metadata.schema; + +/** + * interface for column range information. Currently we treat bucket and sort_column_range as + * column ranges. + */ +public interface ColumnRangeInfo { + int getNumOfRanges(); +} --- End diff -- add one new line at the end of file, otherwise it breaks code style ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1953#discussion_r169989543 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/BucketingInfo.java --- @@ -30,34 +30,32 @@ /** * Bucketing information */ -public class BucketingInfo implements Serializable, Writable { - +public class BucketingInfo implements ColumnRangeInfo, Serializable, Writable { private static final long serialVersionUID = -0L; - private List listOfColumns; - - private int numberOfBuckets; + // number of column ranges --- End diff -- Why is it called "column ranges"? Isn't it "value ranges"? ---
[GitHub] carbondata pull request #1953: [CARBONDATA-2091][DataLoad] Support specifyin...
GitHub user xuchuanyin opened a pull request: https://github.com/apache/carbondata/pull/1953 [CARBONDATA-2091][DataLoad] Support specifying sort column bounds in data loading Enhance data loading performance by specifying sort column bounds 1. Add row range number during convert-process-step 2. Dispatch rows to each sorter by range number 3. Sort/Write process step can be done concurrently in each range Tests added and docs updated After implementing this feature, the data load performance has gained about 25% enhancement (80MB/s/Node -> 102MB/s/Node) in my scenario with only 1 bounds provided. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [x] Any interfaces changed? `Only internal used interfaces are changed` - [x] Any backward compatibility impacted? `No` - [x] Document update required? `Yes, added the usage of this feature to documents` - [x] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? `Yes` - How it is tested? Please attach test report. `Tested in 3-node cluster and local machine` - Is it a performance related change? Please attach the performance test report. `Yes. After implementing this feature, the data load performance has gained about 25% enhancement (80MB/s/Node -> 102MB/s/Node) in my scenario with only 1 bounds provided. ` - Any additional information to help reviewers in testing this change. `I refactored the bucket related feature and treated the range and bucket as the similar logic` - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. `Not related` You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata 0208_support_specifying_sort_column_bounds Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1953 commit 11463dd22db17f2e1858e0a1f3ebfeb07e3ec0e9 Author: xuchuanyin Date: 2018-02-08T08:30:09Z Support specifying sort column bounds in data loading Enhance data loading performance by specifying sort column bounds 1. Add row range number during convert-process-step 2. Dispatch rows to each sorter by range number 3. Sort/Write process step can be done concurrently in each range Tests added and docs updated ---