[GitHub] carbondata issue #2778: [CARBONDATA-2980][BloomDataMap] Fix bug in clearing ...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2778 LGTM ---
[GitHub] carbondata issue #2778: [CARBONDATA-2980][BloomDataMap] Fix bug in clearing ...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2778 @xuchuanyin .`CarbonLRUCache` instance is one per JVM but cache implementation is different for different purpose like we have separate cache provider implementation for blockDataMap and dictionary. You can check the `CacheProvider` class to get some more details on it. Cache is a standard interface which has very generic methods. Because we are storing all the entries in LRU cache to control the memory through LRU eviction all the keys are being stored in LRU cache map. BloomDataMap should also be aware about its keys and then use the invalidate API to clean all the required keys. Check the `clear` method implementation of BlockletDataMpaFactory ---
[GitHub] carbondata issue #2778: [CARBONDATA-2980][BloomDataMap] Fix bug in clearing ...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2778 @xuchuanyin ...I think this PR changes are not correct...you are calling `lruCache.clear()` which will clear the complete LRU cache mapthat means all the entries for all the tables will be cleared because LRUCache instance is only 1 per JVM.I believe through this PR you are only trying to clear the stale entry for table being recreated You can check the flow of drop table to understand how the dataMap cache gets cleared on dropping the table and try to incorporate Bloom dataMap cache clearing changes as per that. You can use the existing API `void invalidate(K key)` and try not to introduce a new API `void invalidateAll()` ---
[GitHub] carbondata pull request #2774: [CARBONDATA-2979] select count fails when car...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2774 [CARBONDATA-2979] select count fails when carbondata file is written through SDK and read through sparkfileformat for complex datatype map(struct->array->map) **Problem** Select query failed issue for map type when data is loaded using avro SDK and external table using carbon file format is used to query the data **Analysis** When data is loaded through Avro SDK which has a schema of type struct>, fieldName was hard coded to "val" because of which during query the schema written in the file footer and schema inferred for the external table had a mismatch which lead to failure. **Solution** Instead of hard coding the field value as "val" use the given field name in the schema - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? No - [ ] Document update required? No - [ ] Testing done UT Added - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata fix_map_type_issues Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2774.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2774 commit 1e1eda1dcfcfdf424c9fa7748555d4400c6c14c3 Author: manishgupta88 Date: 2018-09-27T12:32:34Z Fixed query failure issue for map type when data is loaded using avro SDK and external table using carbon file format is used to query the data ---
[GitHub] carbondata issue #2758: [CARBONDATA-2972] Debug Logs and function added for ...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2758 LGTM ---
[GitHub] carbondata issue #2766: [CARBONDATA-2973] Added documentation for fallback c...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2766 LGTM ---
[GitHub] carbondata pull request #2758: [CARBONDATA-2972] Debug Logs and function add...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2758#discussion_r220531471 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/ColumnPageEncoder.java --- @@ -78,6 +78,13 @@ public DataType getTargetDataType(ColumnPage inputPage) { } } + public Encoding getTypeOfEncoding() { +if (CarbonUtil.isEncodedWithMeta(getEncodingList())) { + return getEncodingList().get(0); --- End diff -- 1. `getEncodingList()` is called 2 times. Hold its value in one variable and use at both the places 2. Change the method name from `getTypeOfEncoding()` to `getEncodingType()` ---
[GitHub] carbondata pull request #2759: [HOTFIX] Fix NPE in LRU cache when entry from...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2759#discussion_r220505034 --- Diff: core/src/main/java/org/apache/carbondata/core/memory/UnsafeMemoryManager.java --- @@ -182,10 +224,15 @@ public long getUsableMemory() { */ public static MemoryBlock allocateMemoryWithRetry(long taskId, long size) throws MemoryException { +return allocateMemoryWithRetry(taskId, size, false); + } + + public static MemoryBlock allocateMemoryWithRetry(long taskId, long size, boolean isDriver) --- End diff -- ok done ---
[GitHub] carbondata pull request #2759: [HOTFIX] Fix NPE in LRU cache when entry from...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2759 [HOTFIX] Fix NPE in LRU cache when entry from the same table is getting evicted to load another entry from same table **Problem** When driver LRU cache size is configured to a small value then on running concurrent queries sometimes while loading the block dataMap in LRU cache one of the dataMap entries from the same table is getting deleted because of shortage of space. Due to this in the flow after loading the dataMap cache NPE is thrown. This is because when an cacheable entry is removed from LRU cache then invalidate is called on that cacheable entry to clear the unsafe memory used by that entry. Invalidate method makes the references null and clears the unsafe memory which leads to NPE when accessed again. **Solution** Currently dataMap cache uses unsafe offheap memory for datamap caching. To avoid this the code is modified to use unsafe with onheap so that JVM itself takes care of clearing the memory when required. We do not require to explicitly set the references to null. - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? No - [ ] Document update required? No - [ ] Testing done Yes verified manually - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata lru_cache_NPE Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2759.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2759 commit 598fceb3e7ee4f5797ef75ef04cd3b50c426984b Author: manishgupta88 Date: 2018-09-25T13:51:08Z Fix NPE in LRU cache when entry from the same table is gettign evicted to load another entry ---
[GitHub] carbondata issue #2654: [CARBONDATA-2896] Adaptive Encoding for Primitive da...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2654 LGTM ---
[GitHub] carbondata issue #2725: [CARBONDATA-2942] Add read and write support for wri...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2725 @xuchuanyin ...please check my reply on the mailing list http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-Optimizing-the-writing-of-min-max-for-a-column-td62515.html ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218298762 --- Diff: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java --- @@ -1831,6 +1831,16 @@ public static final short LOCAL_DICT_ENCODED_BYTEARRAY_SIZE = 3; + /** + * property to be used for specifying the max character limit for string/varchar data type till --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218298776 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/DataRefNode.java --- @@ -125,4 +125,12 @@ DimensionRawColumnChunk readDimensionChunk(FileReader fileReader, int columnInde * @return */ BitSetGroup getIndexedData(); + + /** + * Return the array which contains the flag for each column whether the min max for that column + * is written in metadata or not + * + * @return --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218298749 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMapRowIndexes.java --- @@ -40,12 +40,14 @@ int BLOCK_LENGTH = 8; + int BLOCK_MIN_MAX_FLAG = 9; + // below variables are specific for blockletDataMap - int BLOCKLET_INFO_INDEX = 9; + int BLOCKLET_INFO_INDEX = 10; --- End diff -- No it will not have any impact as the change this code is in query part ---
[GitHub] carbondata issue #2725: [CARBONDATA-2942] Add read and write support for wri...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2725 @ravipesala ..Fixed review comments. Kindly review ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218139529 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/DataRefNode.java --- @@ -125,4 +125,12 @@ DimensionRawColumnChunk readDimensionChunk(FileReader fileReader, int columnInde * @return */ BitSetGroup getIndexedData(); + + /** + * Return the array which contains the flag for each column whether the min max for that column + * is written in metadata or not + * + * @return + */ + boolean[] isMinMaxSet(); --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218139409 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeGrtThanFiterExecuterImpl.java --- @@ -194,9 +198,12 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, BitSetGroup bitSetGroup = new BitSetGroup(rawColumnChunk.getPagesCount()); FilterExecuter filterExecuter = null; boolean isExclude = false; + boolean isMinMaxSetForFilterDimension = + rawBlockletColumnChunks.getDataBlock().isMinMaxSet()[chunkIndex]; for (int i = 0; i < rawColumnChunk.getPagesCount(); i++) { if (rawColumnChunk.getMaxValues() != null) { - if (isScanRequired(rawColumnChunk.getMaxValues()[i], this.filterRangeValues)) { + if (!isMinMaxSetForFilterDimension || isScanRequired(rawColumnChunk.getMaxValues()[i], --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218139494 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataRefNode.java --- @@ -133,6 +134,21 @@ return null; } + @Override + public boolean[] isMinMaxSet() { +BlockletIndex blockletIndex = + blockInfos.get(index).getDetailInfo().getBlockletInfo().getBlockletIndex(); --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218139378 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/scanner/impl/BlockletFilterScanner.java --- @@ -122,11 +122,11 @@ public boolean isScanRequired(DataRefNode dataBlock) { bitSet = ((ImplicitColumnFilterExecutor) filterExecuter) .isFilterValuesPresentInBlockOrBlocklet( dataBlock.getColumnsMaxValue(), -dataBlock.getColumnsMinValue(), blockletId); +dataBlock.getColumnsMinValue(), blockletId, dataBlock.isMinMaxSet()); } else { bitSet = this.filterExecuter .isScanRequired(dataBlock.getColumnsMaxValue(), -dataBlock.getColumnsMinValue()); +dataBlock.getColumnsMinValue(), dataBlock.isMinMaxSet()); --- End diff -- For blocklet min max comparison the footer is getting read in executor and the same min max flag information is getting used. The information serialized from driver is not getting used for blocklet min max comparison in executor ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218139444 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -108,10 +108,13 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, BitSetGroup bitSetGroup = new BitSetGroup(dimensionRawColumnChunk.getPagesCount()); filterValues = dimColumnExecuterInfo.getFilterKeys(); boolean isDecoded = false; + boolean isMinMaxSetForFilterDimension = + rawBlockletColumnChunks.getDataBlock().isMinMaxSet()[chunkIndex]; for (int i = 0; i < dimensionRawColumnChunk.getPagesCount(); i++) { if (dimensionRawColumnChunk.getMaxValues() != null) { - if (isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], - dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys())) { + if (!isMinMaxSetForFilterDimension || isScanRequired( --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218139467 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/blocklet/index/BlockletIndex.java --- @@ -75,4 +80,12 @@ public void setMinMaxIndex(BlockletMinMaxIndex minMaxIndex) { this.minMaxIndex = minMaxIndex; } + @Override public void write(DataOutput out) throws IOException { --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [CARBONDATA-2942] Add read and write support ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2725#discussion_r218138883 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonMetadataUtil.java --- @@ -250,6 +275,42 @@ public static BlockletIndex getBlockletIndex(EncodedBlocklet encodedBlocklet, } /** + * This method will combine the writeMinMax flag from all the pages. If any page for a given + * dimension has writeMinMax flag set to false then min max for that dimension will nto be + * written in any of the page and metadata + * + * @param blockletMinMaxIndex + * @param encodedBlocklet + */ + private static List mergeWriteMinMaxFlagForAllPages( + BlockletMinMaxIndex blockletMinMaxIndex, EncodedBlocklet encodedBlocklet) { +Boolean[] mergedWriteMinMaxFlag = +new Boolean[encodedBlocklet.getNumberOfDimension() + encodedBlocklet.getNumberOfMeasure()]; +// set writeMinMax flag to true for all the columns by default and then update if stats object +// has the this flag set to false +Arrays.fill(mergedWriteMinMaxFlag, true); +for (int i = 0; i < encodedBlocklet.getNumberOfDimension(); i++) { + for (int pageIndex = 0; pageIndex < encodedBlocklet.getNumberOfPages(); pageIndex++) { +EncodedColumnPage encodedColumnPage = + encodedBlocklet.getEncodedDimensionColumnPages().get(i).getEncodedColumnPageList() +.get(pageIndex); +SimpleStatsResult stats = encodedColumnPage.getStats(); +if (!stats.writeMinMax()) { + mergedWriteMinMaxFlag[i] = stats.writeMinMax(); + String columnName = encodedColumnPage.getActualPage().getColumnSpec().getFieldName(); + LOGGER.info("Min Max writing ignored for column " + columnName + " from page 0 to " --- End diff -- ok ---
[GitHub] carbondata pull request #2725: [WIP] Added code to support storing min max f...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2725 [WIP] Added code to support storing min max for string columns based on number of characacters - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata string_min_max_decision Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2725 commit 6ef6d880e68aa3bfc1c704659c916b384e5d545a Author: manishgupta88 Date: 2018-09-14T05:13:20Z Modified code to support writing the min max flag for all the columns in the metadata. This will help in deciding whether min max for a column is written or not commit 4ed24f3972cb4c03133912ba8a1c9d35b3122e7a Author: manishgupta88 Date: 2018-09-15T06:20:03Z Modified code to support filter query using min max flag for a column ---
[GitHub] carbondata issue #2716: [HOTFIX] Fixed 2.3 CI
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2716 LGTM ---
[GitHub] carbondata pull request #2702: [WIP] Fixed multiple complex type issue
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2702 [WIP] Fixed multiple complex type issue **This PR contains** 1. Fix to support nested complex type till integer length. When the data length in one row for nested complex type was crossing the short limit exception was being throw. Now length till integer limit is supported. 2. Fixed error message when map type was given as sort column in SDK 3. Fixed map type parsing issue for nested structures like struct> or array> - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? No - [ ] Document update required? No - [ ] Testing done Added test cases - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata fix_map_type_issues Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2702.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2702 commit f9e76162043206a006cf68822121c995f12b3c6b Author: manishgupta88 Date: 2018-09-10T07:12:33Z Fixed multiple complex type issue and added test cases for validation ---
[GitHub] carbondata issue #2687: [CARBONDATA-2876]Fix Avro decimal datatype with prec...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2687 LGTM ---
[GitHub] carbondata issue #2698: [HOTFIX] Fixed LRU cache bug to invalidate the cache...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2698 @ravipesala ..handled review comment. Kindly review and merge ---
[GitHub] carbondata pull request #2698: [HOTFIX] Fixed LRU cache bug to invalidate th...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2698#discussion_r215984873 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCacheKeyValue.java --- @@ -100,6 +100,10 @@ public long getMemorySize() { return size; } +@Override public void invalidate() { --- End diff -- Made bloomFilters object to null ---
[GitHub] carbondata issue #2685: [CARBONDATA-2910] Support backward compatability in ...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2685 LGTM ---
[GitHub] carbondata issue #2694: [CARBONDATA-2876]AVRO datatype support through SDK
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2694 LGTM ---
[GitHub] carbondata pull request #2698: [HOTFIX] Fixed LRU cache bug to invalidate th...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2698 [HOTFIX] Fixed LRU cache bug to invalidate the cacheable object to clean up the resources This PR contains the fix for LRU cache bug to invalidate the Cacheable object while removing it from LRU cache. This will help in clearing the unsafe memory for cacheable objects like BlockDataMaps Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata lru_cache_bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2698.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2698 commit ba4221cc18700b1c22866dbd86055065aefdb1d4 Author: manishgupta88 Date: 2018-09-06T17:26:49Z Fixed LRU cache of bug to invalidate the Cacheable object while removing it from LRU cache. This will help in clearing the unsafe memory for cacheable objects like BlockDataMaps ---
[GitHub] carbondata pull request #2687: [CARBONDATA-2876]Fix Avro decimal datatype wi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2687#discussion_r214790148 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -407,6 +413,19 @@ private Object avroFieldToObjectForUnionType(Schema avroField, Object fieldValue out = null; } break; + case BYTES: +// DECIMAL type is defined in Avro as a BYTE type with the logicalType property +// set to "decimal" and a specified precision and scale +// As binary type is not supported yet,value will be null +if (logicalType instanceof LogicalTypes.Decimal) { + BigDecimal decimalValue = new BigDecimal(new String(((ByteBuffer) fieldValue).array(), + CarbonCommonConstants.DEFAULT_CHARSET_CLASS)); + out = (decimalValue.round( + new MathContext(((LogicalTypes.Decimal) avroField.getLogicalType()).getPrecision( + .setScale(((LogicalTypes.Decimal) avroField.getLogicalType()).getScale(), + RoundingMode.HALF_UP); --- End diff -- replace bigdecimal conversion with below code at all places in the code `BigDecimal bigDecimal = new BigDecimal(new String(((ByteBuffer) fieldValue).array(), CarbonCommonConstants.DEFAULT_CHARSET_CLASS) .setScale(dimension.getColumnSchema().getScale(), RoundingMode.HALF_UP);` ---
[GitHub] carbondata pull request #2663: [CARBONDATA-2894] Add support for complex map...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2663#discussion_r214717788 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/Field.java --- @@ -213,4 +218,58 @@ public String getColumnComment() { public void setColumnComment(String columnComment) { this.columnComment = columnComment; } + + private void initComplexTypeChildren() { +if (getDataType().isComplexType()) { + StructField subFields = prepareSubFields(getFieldName(), getDataType()); + if (DataTypes.isArrayType(getDataType()) || DataTypes.isMapType(getDataType())) { +children = subFields.getChildren(); + } else if (DataTypes.isStructType(getDataType())) { +children = ((StructType) subFields.getDataType()).getFields(); + } +} + } + + /** + * prepare sub fields for complex types + * + * @param fieldName + * @param dType --- End diff -- ok ---
[GitHub] carbondata issue #2663: [CARBONDATA-2894] Add support for complex map type t...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2663 @ravipesala ...handled review comments. Kindly review and merge ---
[GitHub] carbondata pull request #2663: [CARBONDATA-2894] Add support for complex map...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2663#discussion_r214643424 --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/SparkCarbonFileFormat.scala --- @@ -214,6 +216,30 @@ class SparkCarbonFileFormat extends FileFormat data } +private def extractMapData(data: AnyRef, mapType: MapType): ArrayObject = { + val mapData = data.asInstanceOf[MapData] + val keys: ArrayData = mapData.keyArray() + val values: ArrayData = mapData.valueArray() + var keyValueHolder = scala.collection.mutable.ArrayBuffer[AnyRef]() --- End diff -- ok done ---
[GitHub] carbondata pull request #2663: [CARBONDATA-2894] Add support for complex map...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2663#discussion_r214643373 --- Diff: integration/spark-datasource/src/main/scala/org/apache/spark/sql/carbondata/execution/datasources/CarbonSparkDataSourceUtil.scala --- @@ -198,7 +249,20 @@ object CarbonSparkDataSourceUtil { val dataType = convertSparkToCarbonDataType(field.dataType) dataType match { case s: CarbonStructType => - new Field(field.name, s, s.getFields) + val subFields = prepareSubFields(field.name, s) --- End diff -- moved the preparation to Field class ---
[GitHub] carbondata issue #2654: [CARBONDATA-2896] Adaptive Encoding for Primitive da...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2654 @dhatchayani You can raise one more to improvise the code at some places: 1. Unify isScanRequired code in all the filter classes using ENUM and flag based on min max comparison 2. Create new page wrapper that extends from ColumnPageWrapper and sends the actual data for no dictionary primitive type columns ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214371546 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -331,8 +332,18 @@ private BloomQueryModel buildQueryModelInternal(CarbonColumn carbonColumn, // for dictionary/date columns, convert the surrogate key to bytes internalFilterValue = CarbonUtil.getValueAsBytes(DataTypes.INT, convertedValue); } else { - // for non dictionary dimensions, is already bytes, - internalFilterValue = (byte[]) convertedValue; + // for non dictionary dimensions, numeric columns will be of original data, + // so convert the data to bytes + if (DataTypeUtil.isPrimitiveColumn(carbonColumn.getDataType())) { +if (convertedValue == null) { --- End diff -- if possible initialize and store the flag in constructor and remove the check `DataTypeUtil.isPrimitiveColumn` wherever applicable in the below code ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214361007 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -110,8 +112,19 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, boolean isDecoded = false; for (int i = 0; i < dimensionRawColumnChunk.getPagesCount(); i++) { if (dimensionRawColumnChunk.getMaxValues() != null) { - if (isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], - dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys())) { + boolean scanRequired; + // for no dictionary measure column comparison can be done + // on the original data as like measure column + if (DataTypeUtil.isPrimitiveColumn(dimColumnEvaluatorInfo.getDimension().getDataType()) + && !dimColumnEvaluatorInfo.getDimension().hasEncoding(Encoding.DICTIONARY)) { +scanRequired = isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], --- End diff -- You can create a `isPrimitiveNoDictionaryColumn` flag and check `DataTypeUtil.isPrimitiveColum` in the constructor. This will avoid the check for every page. Do this for all the filters ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214356896 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/reader/dimension/v3/CompressedDimensionChunkFileBasedReaderV3.java --- @@ -239,12 +239,25 @@ private boolean isEncodedWithMeta(DataChunk2 pageMetadata) { protected DimensionColumnPage decodeDimension(DimensionRawColumnChunk rawColumnPage, ByteBuffer pageData, DataChunk2 pageMetadata, int offset) throws IOException, MemoryException { +List encodings = pageMetadata.getEncoders(); if (isEncodedWithMeta(pageMetadata)) { ColumnPage decodedPage = decodeDimensionByMeta(pageMetadata, pageData, offset, null != rawColumnPage.getLocalDictionary()); decodedPage.setNullBits(QueryUtil.getNullBitSet(pageMetadata.presence)); - return new ColumnPageWrapper(decodedPage, rawColumnPage.getLocalDictionary(), - isEncodedWithAdaptiveMeta(pageMetadata)); + int[] invertedIndexes = new int[0]; --- End diff -- add a comment to explain that this scenario is to handle no dictionary primitive type columns where inverted index can be created on row id's during data load ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214354541 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/TablePage.java --- @@ -363,7 +398,16 @@ public EncodedTablePage getEncodedTablePage() { columnPageEncoder = encodingFactory.createEncoder( spec, noDictDimensionPages[noDictIndex]); - encodedPage = columnPageEncoder.encode(noDictDimensionPages[noDictIndex++]); + encodedPage = columnPageEncoder.encode(noDictDimensionPages[noDictIndex]); + DataType targetDataType = + columnPageEncoder.getTargetDataType(noDictDimensionPages[noDictIndex]); + if (null != targetDataType) { +LOGGER.info("Encoder result ---> Source data type: " + noDictDimensionPages[noDictIndex] --- End diff -- make this logger debug and check for debugenabled ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214352965 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/DefaultEncodingFactory.java --- @@ -346,12 +371,21 @@ static ColumnPageCodec selectCodecByAlgorithmForDecimal(SimpleStatsResult stats, // no effect to use adaptive or delta, use compression only return new DirectCompressCodec(stats.getDataType()); } +boolean isSort = false; +boolean isInvertedIndex = false; +if (columnSpec instanceof TableSpec.DimensionSpec +&& columnSpec.getColumnType() != ColumnType.COMPLEX_PRIMITIVE) { + isSort = ((TableSpec.DimensionSpec) columnSpec).isInSortColumns(); + isInvertedIndex = isSort && ((TableSpec.DimensionSpec) columnSpec).isDoInvertedIndex(); +} --- End diff -- Put the above changes in one method as the same code is used in above places also and then call this method while creating the encoding type ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214351650 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/TableSpec.java --- @@ -91,6 +92,30 @@ private void addMeasures(List measures) { } } + /** + * No dictionary and complex dimensions of the table + * + * @return + */ + public DimensionSpec[] getNoDictAndComplexDimensions() { +List noDicOrCompIndexes = new ArrayList<>(dimensionSpec.length); +int noDicCount = 0; +for (int i = 0; i < dimensionSpec.length; i++) { + if (dimensionSpec[i].getColumnType() == ColumnType.PLAIN_VALUE + || dimensionSpec[i].getColumnType() == ColumnType.COMPLEX_PRIMITIVE + || dimensionSpec[i].getColumnType() == ColumnType.COMPLEX) { +noDicOrCompIndexes.add(i); +noDicCount++; + } +} + +DimensionSpec[] dims = new DimensionSpec[noDicCount]; +for (int i = 0; i < dims.length; i++) { + dims[i] = dimensionSpec[noDicOrCompIndexes.get(i)]; +} +return dims; --- End diff -- Avoid the below for loop in this method ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214338168 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -375,6 +454,47 @@ public void writeRawRowAsIntermediateSortTempRowToOutputStream(Object[] row, outputStream.write(rowBuffer.array(), 0, packSize); } + /** + * Write the data to stream + * + * @param data + * @param outputStream + * @param idx + * @throws IOException + */ + private void writeDataToStream(Object data, DataOutputStream outputStream, int idx) + throws IOException { +DataType dataType = noDicSortDataTypes[idx]; +if (null == data) { + outputStream.writeBoolean(false); + return; --- End diff -- do not use return statement instead use the if else block wisely ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214336180 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -224,10 +237,15 @@ public IntermediateSortTempRow readWithNoSortFieldConvert( // read no-dict & sort data for (int idx = 0; idx < this.noDictSortDimCnt; idx++) { - short len = inputStream.readShort(); - byte[] bytes = new byte[len]; - inputStream.readFully(bytes); - noDictSortDims[idx] = bytes; + // for no dict measure column get the original data + if (DataTypeUtil.isPrimitiveColumn(noDicSortDataTypes[idx])) { +noDictSortDims[idx] = readDataFromStream(inputStream, idx); + } else { +short len = inputStream.readShort(); +byte[] bytes = new byte[len]; +inputStream.readFully(bytes); +noDictSortDims[idx] = bytes; + } --- End diff -- Above also there is a similar..refactor the code to one method and call from both these places ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214341633 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/RawRowComparator.java --- @@ -30,24 +33,39 @@ public class RawRowComparator implements Comparator { private int[] sortColumnIndices; private boolean[] isSortColumnNoDict; + private DataType[] noDicDataTypes; - public RawRowComparator(int[] sortColumnIndices, boolean[] isSortColumnNoDict) { + public RawRowComparator(int[] sortColumnIndices, boolean[] isSortColumnNoDict, + DataType[] noDicDataTypes) { this.sortColumnIndices = sortColumnIndices; this.isSortColumnNoDict = isSortColumnNoDict; +this.noDicDataTypes = noDicDataTypes; } @Override public int compare(CarbonRow o1, CarbonRow o2) { int diff = 0; int i = 0; +int noDicIdx = 0; for (int colIdx : sortColumnIndices) { if (isSortColumnNoDict[i]) { -byte[] colA = (byte[]) o1.getObject(colIdx); -byte[] colB = (byte[]) o2.getObject(colIdx); -diff = UnsafeComparer.INSTANCE.compareTo(colA, colB); -if (diff != 0) { - return diff; +if (DataTypeUtil.isPrimitiveColumn(noDicDataTypes[noDicIdx])) { + // for no dictionary numeric column get comparator based on the data type + SerializableComparator comparator = org.apache.carbondata.core.util.comparator.Comparator --- End diff -- increment `noDicIdx` in if block and remove from method end ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214341135 --- Diff: processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/IntermediateSortTempRowComparator.java --- @@ -45,18 +52,31 @@ public int compare(IntermediateSortTempRow rowA, IntermediateSortTempRow rowB) { int diff = 0; int dictIndex = 0; int nonDictIndex = 0; +int noDicTypeIdx = 0; for (boolean isNoDictionary : isSortColumnNoDictionary) { if (isNoDictionary) { -byte[] byteArr1 = rowA.getNoDictSortDims()[nonDictIndex]; -byte[] byteArr2 = rowB.getNoDictSortDims()[nonDictIndex]; -nonDictIndex++; +if (DataTypeUtil.isPrimitiveColumn(noDicSortDataTypes[noDicTypeIdx])) { + // use data types based comparator for the no dictionary measure columns + SerializableComparator comparator = org.apache.carbondata.core.util.comparator.Comparator --- End diff -- Increment the no dictionary type index here `noDicTypeIdx` in if block and not at the end ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214341442 --- Diff: processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/NewRowComparator.java --- @@ -43,15 +53,31 @@ public NewRowComparator(boolean[] noDictionarySortColumnMaping) { public int compare(Object[] rowA, Object[] rowB) { int diff = 0; int index = 0; +int dataTypeIdx = 0; +int noDicSortIdx = 0; -for (boolean isNoDictionary : noDictionarySortColumnMaping) { - if (isNoDictionary) { -byte[] byteArr1 = (byte[]) rowA[index]; -byte[] byteArr2 = (byte[]) rowB[index]; +for (int i = 0; i < noDicDimColMapping.length; i++) { + if (noDicDimColMapping[i]) { +if (noDicSortColumnMapping[noDicSortIdx++]) { + if (DataTypeUtil.isPrimitiveColumn(noDicDataTypes[dataTypeIdx])) { +// use data types based comparator for the no dictionary measure columns +SerializableComparator comparator = --- End diff -- increment `dataTypeIdx` in if block and remove from method end ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214337384 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -359,9 +433,14 @@ public void writeRawRowAsIntermediateSortTempRowToOutputStream(Object[] row, // write no-dict & sort for (int idx = 0; idx < this.noDictSortDimCnt; idx++) { - byte[] bytes = (byte[]) row[this.noDictSortDimIdx[idx]]; - outputStream.writeShort(bytes.length); - outputStream.write(bytes); + if (DataTypeUtil.isPrimitiveColumn(noDicSortDataTypes[idx])) { --- End diff -- I can see that at multiple places for every row DataTypeUtil.isPrimitiveColumn is getting used. Please check the load performance impact of this ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214341815 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/comparator/UnsafeRowComparator.java --- @@ -60,26 +64,50 @@ public int compare(UnsafeCarbonRow rowL, Object baseObjectL, UnsafeCarbonRow row if (isNoDictionary) { short lengthA = CarbonUnsafe.getUnsafe().getShort(baseObjectL, rowA + dictSizeInMemory + sizeInNonDictPartA); -byte[] byteArr1 = new byte[lengthA]; sizeInNonDictPartA += 2; -CarbonUnsafe.getUnsafe() -.copyMemory(baseObjectL, rowA + dictSizeInMemory + sizeInNonDictPartA, -byteArr1, CarbonUnsafe.BYTE_ARRAY_OFFSET, lengthA); -sizeInNonDictPartA += lengthA; - short lengthB = CarbonUnsafe.getUnsafe().getShort(baseObjectR, rowB + dictSizeInMemory + sizeInNonDictPartB); -byte[] byteArr2 = new byte[lengthB]; sizeInNonDictPartB += 2; -CarbonUnsafe.getUnsafe() -.copyMemory(baseObjectR, rowB + dictSizeInMemory + sizeInNonDictPartB, -byteArr2, CarbonUnsafe.BYTE_ARRAY_OFFSET, lengthB); -sizeInNonDictPartB += lengthB; +DataType dataType = tableFieldStat.getNoDicSortDataType()[noDicSortIdx]; +if (DataTypeUtil.isPrimitiveColumn(dataType)) { + Object data1 = null; --- End diff -- increment `noDicSortIdx` in if block and remove from method end ---
[GitHub] carbondata issue #2644: [CARBONDATA-2853] Implement file-level min/max index...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2644 @QiangCai In General I can see that you put empty lines at many places in the code. Please remove those empty lines everywhere and add some code comments for better understanding ---
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214310465 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonMetadataUtil.java --- @@ -96,14 +96,35 @@ private static FileFooter3 getFileFooter3(List infoList, return footer; } - public static BlockletIndex getBlockletIndex( - org.apache.carbondata.core.metadata.blocklet.index.BlockletIndex info) { + public static org.apache.carbondata.core.metadata.blocklet.index.BlockletMinMaxIndex + convertExternalMinMaxIndex(BlockletMinMaxIndex minMaxIndex) { --- End diff -- please add a method comment to explain what is meaning of convertExternalMinMaxIndex ---
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214303472 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/StreamDataMap.java --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.datamap; + +import java.io.File; +import java.io.IOException; +import java.util.ArrayList; +import java.util.BitSet; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.blockletindex.SegmentIndexFileStore; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.schema.table.CarbonTable; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonDimension; +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.reader.CarbonIndexFileReader; +import org.apache.carbondata.core.scan.filter.FilterUtil; +import org.apache.carbondata.core.scan.filter.executer.FilterExecuter; +import org.apache.carbondata.core.scan.filter.executer.ImplicitColumnFilterExecutor; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonMetadataUtil; +import org.apache.carbondata.core.util.path.CarbonTablePath; +import org.apache.carbondata.format.BlockIndex; + +@InterfaceAudience.Internal +public class StreamDataMap { + + private CarbonTable carbonTable; + + private AbsoluteTableIdentifier identifier; --- End diff -- If carbonTable is getting stored then no need to store identifier...you can get it from carbontable ---
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214307411 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/StreamDataMap.java --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.datamap; + +import java.io.File; +import java.io.IOException; +import java.util.ArrayList; +import java.util.BitSet; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.blockletindex.SegmentIndexFileStore; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.schema.table.CarbonTable; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonDimension; +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.reader.CarbonIndexFileReader; +import org.apache.carbondata.core.scan.filter.FilterUtil; +import org.apache.carbondata.core.scan.filter.executer.FilterExecuter; +import org.apache.carbondata.core.scan.filter.executer.ImplicitColumnFilterExecutor; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonMetadataUtil; +import org.apache.carbondata.core.util.path.CarbonTablePath; +import org.apache.carbondata.format.BlockIndex; + +@InterfaceAudience.Internal +public class StreamDataMap { + + private CarbonTable carbonTable; + + private AbsoluteTableIdentifier identifier; + + private FilterExecuter filterExecuter; + + public StreamDataMap(CarbonTable carbonTable) { +this.carbonTable = carbonTable; +this.identifier = carbonTable.getAbsoluteTableIdentifier(); + } + + public void init(FilterResolverIntf filterExp) { +if (filterExp != null) { + + List minMaxCacheColumns = new ArrayList<>(); + for (CarbonDimension dimension : carbonTable.getDimensions()) { +if (!dimension.isComplex()) { + minMaxCacheColumns.add(dimension); +} + } + minMaxCacheColumns.addAll(carbonTable.getMeasures()); + + List listOfColumns = + carbonTable.getTableInfo().getFactTable().getListOfColumns(); + int[] columnCardinality = new int[listOfColumns.size()]; + for (int index = 0; index < columnCardinality.length; index++) { +columnCardinality[index] = Integer.MAX_VALUE; + } + + SegmentProperties segmentProperties = + new SegmentProperties(listOfColumns, columnCardinality); + + filterExecuter = FilterUtil.getFilterExecuterTree( + filterExp, segmentProperties, null, minMaxCacheColumns); +} + } + + public List prune(List segments) throws IOException { +if (filterExecuter == null) { + return listAllStreamFiles(segments, false); +} else { + List streamFileList = new ArrayList<>(); + for (StreamFile streamFile : listAllStreamFiles(segments, true)) { +if (isScanRequire(streamFile)) { + streamFileList.add(streamFile); + streamFile.setMinMaxIndex(null); +} + } + return streamFileList; +} + } + + private boolean isScanRequire(StreamFile streamFile) { +// backward compatibility, old stream file without min/max index +if (streamFile.getMinMaxIndex() == null) { + return true; +} + +byte[][] maxValue = streamFile.getMinMaxIndex().getMaxValues(); +byte[][] m
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214313170 --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/StreamHandoffRDD.scala --- @@ -205,8 +205,9 @@ class StreamHandoffRDD[K, V]( segmentList.add(Segment.toSegment(handOffSegmentId, null)) val splits = inputFormat.getSplitsOfStreaming( job, - carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getAbsoluteTableIdentifier, - segmentList + segmentList, + carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable, + null --- End diff -- Once you add the overloaded method as explained in above comment you can call the method with 3 arguments from here ---
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214311953 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java --- @@ -342,60 +341,52 @@ public void refreshSegmentCacheIfRequired(JobContext job, CarbonTable carbonTabl /** * use file list in .carbonindex file to get the split of streaming. */ - public List getSplitsOfStreaming(JobContext job, AbsoluteTableIdentifier identifier, - List streamSegments) throws IOException { + public List getSplitsOfStreaming(JobContext job, List streamSegments, + CarbonTable carbonTable, FilterResolverIntf filterResolverIntf) throws IOException { List splits = new ArrayList(); if (streamSegments != null && !streamSegments.isEmpty()) { numStreamSegments = streamSegments.size(); long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); long maxSize = getMaxSplitSize(job); - for (Segment segment : streamSegments) { -String segmentDir = -CarbonTablePath.getSegmentPath(identifier.getTablePath(), segment.getSegmentNo()); -FileFactory.FileType fileType = FileFactory.getFileType(segmentDir); -if (FileFactory.isFileExist(segmentDir, fileType)) { - SegmentIndexFileStore segmentIndexFileStore = new SegmentIndexFileStore(); - segmentIndexFileStore.readAllIIndexOfSegment(segmentDir); - Map carbonIndexMap = segmentIndexFileStore.getCarbonIndexMap(); - CarbonIndexFileReader indexReader = new CarbonIndexFileReader(); - for (byte[] fileData : carbonIndexMap.values()) { -indexReader.openThriftReader(fileData); -try { - // map block index - while (indexReader.hasNext()) { -BlockIndex blockIndex = indexReader.readBlockIndexInfo(); -String filePath = segmentDir + File.separator + blockIndex.getFile_name(); -Path path = new Path(filePath); -long length = blockIndex.getFile_size(); -if (length != 0) { - BlockLocation[] blkLocations; - FileSystem fs = FileFactory.getFileSystem(path); - FileStatus file = fs.getFileStatus(path); - blkLocations = fs.getFileBlockLocations(path, 0, length); - long blockSize = file.getBlockSize(); - long splitSize = computeSplitSize(blockSize, minSize, maxSize); - long bytesRemaining = length; - while (((double) bytesRemaining) / splitSize > 1.1) { -int blkIndex = getBlockIndex(blkLocations, length - bytesRemaining); -splits.add(makeSplit(segment.getSegmentNo(), path, length - bytesRemaining, -splitSize, blkLocations[blkIndex].getHosts(), -blkLocations[blkIndex].getCachedHosts(), FileFormat.ROW_V1)); -bytesRemaining -= splitSize; - } - if (bytesRemaining != 0) { -int blkIndex = getBlockIndex(blkLocations, length - bytesRemaining); -splits.add(makeSplit(segment.getSegmentNo(), path, length - bytesRemaining, -bytesRemaining, blkLocations[blkIndex].getHosts(), -blkLocations[blkIndex].getCachedHosts(), FileFormat.ROW_V1)); - } -} else { - //Create empty hosts array for zero length files - splits.add(makeSplit(segment.getSegmentNo(), path, 0, length, new String[0], - FileFormat.ROW_V1)); -} - } -} finally { - indexReader.closeThriftReader(); + + if (filterResolverIntf == null) { +if (carbonTable != null) { + Expression filter = getFilterPredicates(job.getConfiguration()); + if (filter != null) { +carbonTable.processFilterExpression(filter, null, null); +filterResolverIntf = carbonTable.resolveFilter(filter); + } +} + } + StreamDataMap streamDataMap = + DataMapStoreManager.getInstance().getStreamDataMap(carbonTable); + streamDataMap.init(filterResolverIntf); + List streamFiles = streamDataMap.prune(streamSegments); + for (StreamFile streamFile : streamFiles) { +if (FileFactory.isFileExist(streamFile.getFilePath())) { + Path path = new Path(streamFile.getFilePath()); + long length = streamFile.
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214311329 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java --- @@ -342,60 +341,52 @@ public void refreshSegmentCacheIfRequired(JobContext job, CarbonTable carbonTabl /** * use file list in .carbonindex file to get the split of streaming. */ - public List getSplitsOfStreaming(JobContext job, AbsoluteTableIdentifier identifier, - List streamSegments) throws IOException { + public List getSplitsOfStreaming(JobContext job, List streamSegments, + CarbonTable carbonTable, FilterResolverIntf filterResolverIntf) throws IOException { --- End diff -- You can write an overloaded method for getSplitsOfStreaming. One which accepts 3 parameters and one with 4 parameters. 1. getSplitsOfStreaming(JobContext job, AbsoluteTableIdentifier identifier,List streamSegments) -- From this method you can the other method and pass null as the 4th argument. This will avoid passing null at all places above. 2. getSplitsOfStreaming(JobContext job, List streamSegments, CarbonTable carbonTable, FilterResolverIntf filterResolverIntf) ---
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214305126 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/StreamDataMap.java --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.datamap; + +import java.io.File; +import java.io.IOException; +import java.util.ArrayList; +import java.util.BitSet; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.blockletindex.SegmentIndexFileStore; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.schema.table.CarbonTable; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonDimension; +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.reader.CarbonIndexFileReader; +import org.apache.carbondata.core.scan.filter.FilterUtil; +import org.apache.carbondata.core.scan.filter.executer.FilterExecuter; +import org.apache.carbondata.core.scan.filter.executer.ImplicitColumnFilterExecutor; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonMetadataUtil; +import org.apache.carbondata.core.util.path.CarbonTablePath; +import org.apache.carbondata.format.BlockIndex; + +@InterfaceAudience.Internal +public class StreamDataMap { --- End diff -- Please check the feasibility if we can extend DataMap interface and implement all its method to keep it similar like BlockDataMap. I think it should be feasible ---
[GitHub] carbondata pull request #2644: [CARBONDATA-2853] Implement file-level min/ma...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2644#discussion_r214316607 --- Diff: streaming/src/main/java/org/apache/carbondata/streaming/CarbonStreamRecordWriter.java --- @@ -212,9 +271,13 @@ private void initializeAtFirstRow() throws IOException, InterruptedException { byte[] col = (byte[]) columnValue; output.writeShort(col.length); output.writeBytes(col); +dimensionStatsCollectors[dimCount].update(col); } else { output.writeInt((int) columnValue); + dimensionStatsCollectors[dimCount].update(ByteUtil.toBytes((int) columnValue)); --- End diff -- For min/max comparison you are converting from Int to byte array for all the rows. This can impact the writing performance. Instead you can typecast into Int and do the comparison. After all the data is loaded then at the end you can convert all the values into byte array based on datatype. At that time it will be only one conversion for the final min/max values ---
[GitHub] carbondata issue #2671: [CARBONDATA-2876]AVRO datatype support through SDK
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2671 LGTM...can be merged once build passes ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r214244301 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +181,124 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case UNION: +// Union type will be internally stored as Struct +// Fill data object only if fieldvalue is instance of datatype +// For other field datatypes, fill value as Null +List unionFields = avroField.schema().getTypes(); +int notNullUnionFieldsCount = 0; +for (Schema unionField : unionFields) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +notNullUnionFieldsCount++; + } +} +Object[] values = new Object[notNullUnionFieldsCount]; +int j = 0; +for (Schema unionField : unionFields) { + if (!unionField.getType().equals(Schema.Type.NULL) && checkFieldValueType( + unionField.getType(), fieldValue)) { +values[j] = avroFieldToObjectForUnionType(unionField, fieldValue, avroField); +break; + } + if (notNullUnionFieldsCount != 1) { +j++; + } +} --- End diff -- Modify the code as below int j = 0; for (Schema unionField : unionFields) { if (unionField.getType().equals(Schema.Type.NULL) ) { continue; } if (checkFieldValueType(unionField.getType(), fieldValue)) { values[j] = avroFieldToObjectForUnionType(unionField, fieldValue, avroField); break; } j++; } ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214047186 --- Diff: processing/src/main/java/org/apache/carbondata/processing/sort/sortdata/SortParameters.java --- @@ -88,6 +88,12 @@ private DataType[] measureDataType; + private DataType[] noDicDataType; + + private DataType[] noDicSortDataType; + + private DataType[] noDicNoSortDataType; --- End diff -- write the usage of each type to get a better understanding ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214044420 --- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java --- @@ -976,4 +978,122 @@ public static long getDataBasedOnRestructuredDataType(byte[] data, DataType rest return value; } + /** + * Check if the column is a no dictionary primitive column + * + * @param dataType + * @return + */ + public static boolean isPrimitiveColumn(DataType dataType) { +if (dataType == DataTypes.BOOLEAN || dataType == DataTypes.BYTE || dataType == DataTypes.SHORT +|| dataType == DataTypes.INT || dataType == DataTypes.LONG || DataTypes.isDecimal(dataType) +|| dataType == DataTypes.FLOAT || dataType == DataTypes.DOUBLE +|| dataType == DataTypes.BYTE_ARRAY) { + return true; +} +return false; + } + + /** + * Put the data to unsafe memory + * + * @param dataType + * @param data + * @param baseObject + * @param address + * @param size + * @param sizeInBytes + */ + public static void putDataToUnsafe(DataType dataType, Object data, Object baseObject, + long address, int size, int sizeInBytes) { +dataType = DataTypeUtil.valueOf(dataType.getName()); +if (dataType == DataTypes.BOOLEAN) { + CarbonUnsafe.getUnsafe().putBoolean(baseObject, address + size, (boolean) data); +} else if (dataType == DataTypes.BYTE) { + CarbonUnsafe.getUnsafe().putByte(baseObject, address + size, (byte) data); +} else if (dataType == DataTypes.SHORT) { + CarbonUnsafe.getUnsafe().putShort(baseObject, address + size, (short) data); +} else if (dataType == DataTypes.INT) { + CarbonUnsafe.getUnsafe().putInt(baseObject, address + size, (int) data); +} else if (dataType == DataTypes.LONG) { + CarbonUnsafe.getUnsafe().putLong(baseObject, address + size, (long) data); +} else if (DataTypes.isDecimal(dataType) || dataType == DataTypes.DOUBLE) { + CarbonUnsafe.getUnsafe().putDouble(baseObject, address + size, (double) data); +} else if (dataType == DataTypes.FLOAT) { + CarbonUnsafe.getUnsafe().putFloat(baseObject, address + size, (float) data); +} else if (dataType == DataTypes.BYTE_ARRAY) { + CarbonUnsafe.getUnsafe() + .copyMemory(data, CarbonUnsafe.BYTE_ARRAY_OFFSET, baseObject, address + size, + sizeInBytes); +} + } + + /** + * Retrieve/Get the data from unsafe memory + * + * @param dataType + * @param baseObject + * @param address + * @param size + * @param sizeInBytes + * @return + */ + public static Object getDataFromUnsafe(DataType dataType, Object baseObject, long address, + int size, int sizeInBytes) { +dataType = DataTypeUtil.valueOf(dataType.getName()); +Object data = new Object(); +if (dataType == DataTypes.BOOLEAN) { + data = CarbonUnsafe.getUnsafe().getBoolean(baseObject, address + size); +} else if (dataType == DataTypes.BYTE) { + data = CarbonUnsafe.getUnsafe().getByte(baseObject, address + size); +} else if (dataType == DataTypes.SHORT) { + data = CarbonUnsafe.getUnsafe().getShort(baseObject, address + size); +} else if (dataType == DataTypes.INT) { + data = CarbonUnsafe.getUnsafe().getInt(baseObject, address + size); +} else if (dataType == DataTypes.LONG) { + data = CarbonUnsafe.getUnsafe().getLong(baseObject, address + size); +} else if (DataTypes.isDecimal(dataType) || dataType == DataTypes.DOUBLE) { + data = CarbonUnsafe.getUnsafe().getDouble(baseObject, address + size); +} else if (dataType == DataTypes.FLOAT) { + data = CarbonUnsafe.getUnsafe().getDouble(baseObject, address + size); +} else if (dataType == DataTypes.BYTE_ARRAY) { + CarbonUnsafe.getUnsafe() + .copyMemory(baseObject, address + size, data, CarbonUnsafe.BYTE_ARRAY_OFFSET, + sizeInBytes); +} +return data; + } + + /** + * Put the data to vector + * + * @param vector + * @param value + * @param vectorRow + * @param length + */ + public static void putDataToVector(CarbonColumnVector vector, byte[] value, int vectorRow, --- End diff -- Move this method to DimensionVectorDataProcessor.java ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214051061 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java --- @@ -450,6 +450,114 @@ public static boolean isHeaderValid(String tableName, String[] csvHeader, return type; } + /** + * Get the no dictionary data types on the table + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicCount++; + } +} +DataType[] type = new DataType[noDicCount]; +noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +type[noDicCount++] = dimensions.get(i).getDataType(); + } +} +return type; + } + + /** + * Get the no dictionary sort column mapping of the table + * + * @param databaseName + * @param tableName + * @return + */ + public static boolean[] getNoDicSortColMapping(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +List noDicIndexes = new ArrayList<>(dimensions.size()); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicIndexes.add(i); +noDicCount++; + } +} + +boolean[] noDicSortColMapping = new boolean[noDicCount]; +for (int i = 0; i < noDicSortColMapping.length; i++) { + if (dimensions.get(noDicIndexes.get(i)).isSortColumn()) { +noDicSortColMapping[i] = true; + } +} +return noDicSortColMapping; + } + + /** + * Get the data types of the no dictionary sort columns + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicSortDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicSortCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY) && dimensions.get(i).isSortColumn()) { +noDicSortCount++; + } +} +DataType[] type = new DataType[noDicSortCount]; +noDicSortCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY) && dimensions.get(i).isSortColumn()) { +type[noDicSortCount++] = dimensions.get(i).getDataType(); + } +} +return type; + } + + /** + * Get the data types of the no dictionary no sort columns + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicNoSortDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicNoSortCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY) && !dimensions.get(i) + .isSortColumn()) { +noDicNoSortCount++; + } +} +DataType[] type = new DataType[noDicNoSortCount]; +noDicNoSortCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY) && !dimensions.get(i) + .isSortColumn()) { +type[noDicNoSortCount++] = dimensions.get(i).getDataType(); + } +} +return type; --- End diff -- same comment as above ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214049718 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java --- @@ -450,6 +450,114 @@ public static boolean isHeaderValid(String tableName, String[] csvHeader, return type; } + /** + * Get the no dictionary data types on the table + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicCount++; + } +} +DataType[] type = new DataType[noDicCount]; +noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +type[noDicCount++] = dimensions.get(i).getDataType(); + } +} +return type; + } + + /** + * Get the no dictionary sort column mapping of the table + * + * @param databaseName + * @param tableName + * @return + */ + public static boolean[] getNoDicSortColMapping(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +List noDicIndexes = new ArrayList<>(dimensions.size()); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicIndexes.add(i); +noDicCount++; + } +} + +boolean[] noDicSortColMapping = new boolean[noDicCount]; +for (int i = 0; i < noDicSortColMapping.length; i++) { + if (dimensions.get(noDicIndexes.get(i)).isSortColumn()) { +noDicSortColMapping[i] = true; + } +} +return noDicSortColMapping; + } + + /** + * Get the data types of the no dictionary sort columns + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicSortDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicSortCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY) && dimensions.get(i).isSortColumn()) { +noDicSortCount++; + } +} +DataType[] type = new DataType[noDicSortCount]; +noDicSortCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY) && dimensions.get(i).isSortColumn()) { +type[noDicSortCount++] = dimensions.get(i).getDataType(); + } --- End diff -- same comment as above. Check on the callers and merge all 3 methods into one if possible ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214055038 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerModel.java --- @@ -320,12 +325,22 @@ public static CarbonFactDataHandlerModel getCarbonFactDataHandlerModel(CarbonLoa List allDimensions = carbonTable.getDimensions(); int dictDimCount = allDimensions.size() - segmentProperties.getNumberOfNoDictionaryDimension() - segmentProperties.getComplexDimensions().size(); +CarbonColumn[] noDicAndComplexColumns = +new CarbonColumn[segmentProperties.getNumberOfNoDictionaryDimension() + segmentProperties +.getComplexDimensions().size()]; +int noDic = 0; --- End diff -- Remove this unused variable ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214038515 --- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java --- @@ -976,4 +978,122 @@ public static long getDataBasedOnRestructuredDataType(byte[] data, DataType rest return value; } + /** + * Check if the column is a no dictionary primitive column + * + * @param dataType + * @return + */ + public static boolean isPrimitiveColumn(DataType dataType) { +if (dataType == DataTypes.BOOLEAN || dataType == DataTypes.BYTE || dataType == DataTypes.SHORT +|| dataType == DataTypes.INT || dataType == DataTypes.LONG || DataTypes.isDecimal(dataType) +|| dataType == DataTypes.FLOAT || dataType == DataTypes.DOUBLE +|| dataType == DataTypes.BYTE_ARRAY) { + return true; +} +return false; + } + + /** + * Put the data to unsafe memory + * + * @param dataType + * @param data + * @param baseObject + * @param address + * @param size + * @param sizeInBytes + */ + public static void putDataToUnsafe(DataType dataType, Object data, Object baseObject, --- End diff -- Create a new class CarbonUnsafeUtil and move this method there ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214038624 --- Diff: core/src/main/java/org/apache/carbondata/core/util/DataTypeUtil.java --- @@ -976,4 +978,122 @@ public static long getDataBasedOnRestructuredDataType(byte[] data, DataType rest return value; } + /** + * Check if the column is a no dictionary primitive column + * + * @param dataType + * @return + */ + public static boolean isPrimitiveColumn(DataType dataType) { +if (dataType == DataTypes.BOOLEAN || dataType == DataTypes.BYTE || dataType == DataTypes.SHORT +|| dataType == DataTypes.INT || dataType == DataTypes.LONG || DataTypes.isDecimal(dataType) +|| dataType == DataTypes.FLOAT || dataType == DataTypes.DOUBLE +|| dataType == DataTypes.BYTE_ARRAY) { + return true; +} +return false; + } + + /** + * Put the data to unsafe memory + * + * @param dataType + * @param data + * @param baseObject + * @param address + * @param size + * @param sizeInBytes + */ + public static void putDataToUnsafe(DataType dataType, Object data, Object baseObject, + long address, int size, int sizeInBytes) { +dataType = DataTypeUtil.valueOf(dataType.getName()); +if (dataType == DataTypes.BOOLEAN) { + CarbonUnsafe.getUnsafe().putBoolean(baseObject, address + size, (boolean) data); +} else if (dataType == DataTypes.BYTE) { + CarbonUnsafe.getUnsafe().putByte(baseObject, address + size, (byte) data); +} else if (dataType == DataTypes.SHORT) { + CarbonUnsafe.getUnsafe().putShort(baseObject, address + size, (short) data); +} else if (dataType == DataTypes.INT) { + CarbonUnsafe.getUnsafe().putInt(baseObject, address + size, (int) data); +} else if (dataType == DataTypes.LONG) { + CarbonUnsafe.getUnsafe().putLong(baseObject, address + size, (long) data); +} else if (DataTypes.isDecimal(dataType) || dataType == DataTypes.DOUBLE) { + CarbonUnsafe.getUnsafe().putDouble(baseObject, address + size, (double) data); +} else if (dataType == DataTypes.FLOAT) { + CarbonUnsafe.getUnsafe().putFloat(baseObject, address + size, (float) data); +} else if (dataType == DataTypes.BYTE_ARRAY) { + CarbonUnsafe.getUnsafe() + .copyMemory(data, CarbonUnsafe.BYTE_ARRAY_OFFSET, baseObject, address + size, + sizeInBytes); +} + } + + /** + * Retrieve/Get the data from unsafe memory + * + * @param dataType + * @param baseObject + * @param address + * @param size + * @param sizeInBytes + * @return + */ + public static Object getDataFromUnsafe(DataType dataType, Object baseObject, long address, --- End diff -- same comment as above ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214049002 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java --- @@ -450,6 +450,114 @@ public static boolean isHeaderValid(String tableName, String[] csvHeader, return type; } + /** + * Get the no dictionary data types on the table + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicCount++; + } +} +DataType[] type = new DataType[noDicCount]; +noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +type[noDicCount++] = dimensions.get(i).getDataType(); + } +} --- End diff -- The below for loop can be removed and the filling of data type can be done in single iteration by taking an arraylist and while sending back you can convert back to array ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214049631 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java --- @@ -450,6 +450,114 @@ public static boolean isHeaderValid(String tableName, String[] csvHeader, return type; } + /** + * Get the no dictionary data types on the table + * + * @param databaseName + * @param tableName + * @return + */ + public static DataType[] getNoDicDataTypes(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicCount++; + } +} +DataType[] type = new DataType[noDicCount]; +noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +type[noDicCount++] = dimensions.get(i).getDataType(); + } +} +return type; + } + + /** + * Get the no dictionary sort column mapping of the table + * + * @param databaseName + * @param tableName + * @return + */ + public static boolean[] getNoDicSortColMapping(String databaseName, String tableName) { +CarbonTable carbonTable = CarbonMetadata.getInstance().getCarbonTable(databaseName, tableName); +List dimensions = carbonTable.getDimensionByTableName(tableName); +List noDicIndexes = new ArrayList<>(dimensions.size()); +int noDicCount = 0; +for (int i = 0; i < dimensions.size(); i++) { + if (!dimensions.get(i).hasEncoding(Encoding.DICTIONARY)) { +noDicIndexes.add(i); +noDicCount++; + } +} + +boolean[] noDicSortColMapping = new boolean[noDicCount]; +for (int i = 0; i < noDicSortColMapping.length; i++) { + if (dimensions.get(noDicIndexes.get(i)).isSortColumn()) { +noDicSortColMapping[i] = true; + } --- End diff -- same comment as above ---
[GitHub] carbondata pull request #2654: [CARBONDATA-2896] Adaptive Encoding for Primi...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2654#discussion_r214055654 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerModel.java --- @@ -320,12 +325,22 @@ public static CarbonFactDataHandlerModel getCarbonFactDataHandlerModel(CarbonLoa List allDimensions = carbonTable.getDimensions(); int dictDimCount = allDimensions.size() - segmentProperties.getNumberOfNoDictionaryDimension() - segmentProperties.getComplexDimensions().size(); +CarbonColumn[] noDicAndComplexColumns = +new CarbonColumn[segmentProperties.getNumberOfNoDictionaryDimension() + segmentProperties +.getComplexDimensions().size()]; +int noDic = 0; +int noDicAndComp = 0; for (CarbonDimension dim : allDimensions) { if (!dim.isComplex() && !dim.hasEncoding(Encoding.DICTIONARY) && dim.getDataType() == DataTypes.VARCHAR) { // ordinal is set in CarbonTable.fillDimensionsAndMeasuresForTables() varcharDimIdxInNoDict.add(dim.getOrdinal() - dictDimCount); } + if (!dim.hasEncoding(Encoding.DICTIONARY) || dim.isComplex() || null != dim + .getComplexParentDimension()) { +noDicAndComplexColumns[noDicAndComp++] = --- End diff -- Modify the if loop check to check only for dictionary encoding if (!dim.hasEncoding(Encoding.DICTIONARY)) ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r214024187 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -310,6 +503,31 @@ private static Field prepareFields(Schema.Field avroField) { } else { return null; } + case UNION: +int i = 0; +// Get union types and store as Struct +ArrayList unionFields = new ArrayList<>(); +for (Schema avroSubField : avroField.schema().getTypes()) { + StructField unionField = prepareSubFields(avroField.name() + i++, avroSubField); --- End diff -- check for NULL schema here ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r214023293 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +181,126 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case UNION: +// Union type will be internally stored as Struct +// Fill data object only if fieldvalue is instance of datatype +// For other field datatypes, fill value as Null +List unionFields = avroField.schema().getTypes(); +int notNullUnionFieldsCount = 0; +for (Schema unionField : unionFields) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +notNullUnionFieldsCount++; + } +} +Object[] values = new Object[notNullUnionFieldsCount]; +int j = 0; +for (Schema unionField : unionFields) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +if (checkFieldValueType(unionField.getType(), fieldValue)) { + values[j] = avroFieldToObjectForUnionType(unionField, fieldValue, avroField); +} else { + values[j] = null; +} +j++; + } +} +out = new StructObject(values); +break; + default: +out = avroPrimitiveFieldToObject(type, logicalType, fieldValue); +} +return out; + } + + /** + * For Union type, fill data if Schema.Type is instance of fieldValue + * and return result + * + * @param type + * @param fieldValue + * @return + */ + private boolean checkFieldValueType(Schema.Type type, Object fieldValue) { +switch (type) { + case INT: +return (fieldValue instanceof Integer); + case BOOLEAN: +return (fieldValue instanceof Boolean); + case LONG: +return (fieldValue instanceof Long); + case DOUBLE: +return (fieldValue instanceof Double); + case STRING: +return (fieldValue instanceof Utf8 || fieldValue instanceof String); + case FLOAT: +return (fieldValue instanceof Float); + case RECORD: +return (fieldValue instanceof GenericData.Record); + case ARRAY: +return (fieldValue instanceof GenericData.Array || fieldValue instanceof ArrayList); + case BYTES: +return (fieldValue instanceof ByteBuffer); + case MAP: +return (fieldValue instanceof HashMap); + case ENUM: +return (fieldValue instanceof GenericData.EnumSymbol); + default: +return false; +} + } + + private Object avroPrimitiveFieldToObject(Schema.Type type, LogicalType logicalType, + Object fieldValue) { +Object out; +switch (type) { + case INT: +if (logicalType != null) { + if (logicalType instanceof LogicalTypes.Date) { +int dateIntValue = (int) fieldValue; +out = dateIntValue * DateDirectDictionaryGenerator.MILLIS_PER_DAY; + } else { +LOGGER.warn("Actual type: INT, Logical Type: " + logicalType.getName()); +out = fieldValue; + } +} else { + out = fieldValue; +} +break; + case BOOLEAN: + case LONG: +if (logicalType != null && !(logicalType instanceof LogicalTypes.TimestampMillis)) { + if (logicalType instanceof LogicalTypes.TimestampMicros) { +long dateIntValue = (long) fieldValue; +out = dateIntValue / 1000L; + } else { +LOGGER.warn("Actual type: INT, Logical Type: " + logicalType.getName()); +out = fieldValue; + } +} else { + out = fieldValue; +} +break; + case DOUBLE: + case STRING: + case ENUM: +out = fieldValue; +break; + case FLOAT: +// direct conversion will change precision. So parse from string. +// also carbon internally needs float as double +out = Double.parseDouble(fieldValue.toString()); +break; + case BYTES: +// DECIMAL type is defined in Avro as a BYTE type with the logicalType property +// set to "decimal" and a specified precision and scale +if (logicalType instanceof LogicalTypes.Decimal) { + out = new BigDecimal(new S
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r214024597 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -310,6 +503,31 @@ private static Field prepareFields(Schema.Field avroField) { } else { return null; } + case UNION: +int i = 0; +// Get union types and store as Struct +ArrayList unionFields = new ArrayList<>(); +for (Schema avroSubField : avroField.schema().getTypes()) { + StructField unionField = prepareSubFields(avroField.name() + i++, avroSubField); + if (unionField != null) { +unionFields.add(unionField); + } +} +if (unionFields.size() != 0) { --- End diff -- replace 'unionFields.size()' with 'unionFields.isEmpty()' ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r214022747 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +181,126 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case UNION: +// Union type will be internally stored as Struct +// Fill data object only if fieldvalue is instance of datatype +// For other field datatypes, fill value as Null +List unionFields = avroField.schema().getTypes(); +int notNullUnionFieldsCount = 0; +for (Schema unionField : unionFields) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +notNullUnionFieldsCount++; + } +} +Object[] values = new Object[notNullUnionFieldsCount]; +int j = 0; +for (Schema unionField : unionFields) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +if (checkFieldValueType(unionField.getType(), fieldValue)) { + values[j] = avroFieldToObjectForUnionType(unionField, fieldValue, avroField); +} else { + values[j] = null; +} --- End diff -- 1. Remove else block 2. Combine above 2 if conditions into 1 using && operator 3. break the loop once if check is success ---
[GitHub] carbondata issue #2663: [CARBONDATA-2894] Add support for complex map type t...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2663 retest this please ---
[GitHub] carbondata issue #2671: [CARBONDATA-2876]AVRO datatype support through SDK
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2671 @Indhumathi27 ..please modify the code as per the comments then we can continue with further review of code ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r213914935 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +218,225 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case BYTES: +// DECIMAL type is defined in Avro as a BYTE type with the logicalType property +// set to "decimal" and a specified precision and scale +if (logicalType instanceof LogicalTypes.Decimal) { + out = new BigDecimal(new String(((ByteBuffer) fieldValue).array(), + CarbonCommonConstants.DEFAULT_CHARSET_CLASS)); + ; --- End diff -- Remove this semi-colon ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r213916136 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +218,225 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case BYTES: +// DECIMAL type is defined in Avro as a BYTE type with the logicalType property +// set to "decimal" and a specified precision and scale +if (logicalType instanceof LogicalTypes.Decimal) { + out = new BigDecimal(new String(((ByteBuffer) fieldValue).array(), + CarbonCommonConstants.DEFAULT_CHARSET_CLASS)); + ; +} else { + out = fieldValue; +} +break; + case UNION: +List fieldsUnion = avroField.schema().getTypes(); +int countIfNotNull = 0; +for (Schema unionField : fieldsUnion) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +countIfNotNull++; + } +} +Object[] values; +values = new Object[countIfNotNull]; --- End diff -- 1. Rename countIfNotNull to notNullUnionFieldsCount 2. merge above 2 lines 'Object[] values = new Object[countIfNotNull];' 3. Check union behavior for only NULL type and handle if any special handling is required ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r213915304 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +218,225 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case BYTES: +// DECIMAL type is defined in Avro as a BYTE type with the logicalType property +// set to "decimal" and a specified precision and scale +if (logicalType instanceof LogicalTypes.Decimal) { + out = new BigDecimal(new String(((ByteBuffer) fieldValue).array(), + CarbonCommonConstants.DEFAULT_CHARSET_CLASS)); + ; +} else { + out = fieldValue; --- End diff -- else case handling is not required as Binary data type is not supported...so write a proper comment to say that only decimal logical type is support for avro Byte data type. Once binary data type is supported we can add the else block ---
[GitHub] carbondata pull request #2671: [CARBONDATA-2876]AVRO datatype support throug...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2671#discussion_r213922678 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/AvroCarbonWriter.java --- @@ -213,6 +218,225 @@ private Object avroFieldToObject(Schema.Field avroField, Object fieldValue) { } out = new ArrayObject(arrayChildObjects); break; + case BYTES: +// DECIMAL type is defined in Avro as a BYTE type with the logicalType property +// set to "decimal" and a specified precision and scale +if (logicalType instanceof LogicalTypes.Decimal) { + out = new BigDecimal(new String(((ByteBuffer) fieldValue).array(), + CarbonCommonConstants.DEFAULT_CHARSET_CLASS)); + ; +} else { + out = fieldValue; +} +break; + case UNION: +List fieldsUnion = avroField.schema().getTypes(); +int countIfNotNull = 0; +for (Schema unionField : fieldsUnion) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +countIfNotNull++; + } +} +Object[] values; +values = new Object[countIfNotNull]; +int j = 0; +for (Schema unionField : fieldsUnion) { + if (!unionField.getType().equals(Schema.Type.NULL)) { +values[j] = avroFieldToObjectForUnionType(unionField, fieldValue, avroField); --- End diff -- 1. Try to reuse the code as much as possible. You can extract the primitives types computation in a separate method and override only complex types as different methods for union and rest of the data types 2. Write a function for union type to identify the schema for which the computation need to be done. Do not call the computation for all the union types as at one time only one type of value will exists ---
[GitHub] carbondata issue #2663: [CARBONDATA-2894] Add support for complex map type t...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2663 retest this please ---
[GitHub] carbondata issue #2649: [CARBONDATA-2869] Add support for Avro Map data type...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2649 @ravipesala ...fixed review comment. Kindly review and merge ---
[GitHub] carbondata pull request #2663: [CARBONDATA-2894] Add support for complex map...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2663 [CARBONDATA-2894] Add support for complex map type through spark carbon file format API This PR supports loading querying complex map type through spark carbon file format API. **Note: This PR is dependent on PR #2649** - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? No - [ ] Document update required? No - [ ] Testing done Added test cases - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata map_spark_carbon_file_support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2663.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2663 commit e67bd0cb485c4bed15ce8ac3ef3be9b3a4f3798e Author: manishgupta88 Date: 2018-08-20T04:59:29Z Added support for Avro Map type using SDK commit 6db7f2a0d7c02406e0ecc9aa7ac69e2ec2e540a6 Author: manishgupta88 Date: 2018-08-27T13:47:21Z Add support for complex map type using spark carbon file format API ---
[GitHub] carbondata issue #2651: [HOTFIX] Support TableProperties Map API for SDK
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2651 LGTM...can be merged once build is passed ---
[GitHub] carbondata pull request #2651: [HOTFIX] Support TableProperties Map API for ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2651#discussion_r212582796 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonWriterBuilder.java --- @@ -272,6 +272,56 @@ public CarbonWriterBuilder withLoadOptions(Map options) { return this; } + /** + * To support the table properties for sdk writer + * + * @param options key,value pair of create table properties. + * supported keys values are + * a. blocksize -- [1-2048] values in MB. Default value is 1024 + * b. blockletsize -- values in MB. Default value is 64 MB + * c. localDictionaryThreshold -- positive value, default is 1 + * d. enableLocalDictionary -- true / false. Default is false + * e. sortcolumns -- comma separated column. "c1,c2". Default all dimensions are sorted. + * + * @return updated CarbonWriterBuilder + */ + public CarbonWriterBuilder withTableProperties(Map options) { +Objects.requireNonNull(options, "Table properties should not be null"); +//validate the options. +if (options.size() > 5) { + throw new IllegalArgumentException("Supports only 5 options now. " + + "Refer method header or documentation"); +} + +for (String option: options.keySet()) { + if (!option.equalsIgnoreCase("blocksize") && + !option.equalsIgnoreCase("blockletsize") && + !option.equalsIgnoreCase("localDictionaryThreshold") && + !option.equalsIgnoreCase("enableLocalDictionary") && + !option.equalsIgnoreCase("sortcolumns")) { +throw new IllegalArgumentException("Unsupported options. " --- End diff -- Better to put all the allowed options in a set and then each property u can check using Set.Contains API which will be faster and a cleaner code ---
[GitHub] carbondata issue #2649: [WIP] Add support for Avro Map data type support for...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2649 @jackylk 1. This PR is only for SDK support 2. Yes map is implemented as an array of struct. For more details you can check the design document https://docs.google.com/document/d/1HHe2fdkIh3Jyz1y3494_2kGRSc4muTWuilAmwg5lpVw/edit?usp=sharing ---
[GitHub] carbondata pull request #2649: [WIP] Add support for Avro Map data type supp...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2649 [WIP] Add support for Avro Map data type support for SDK Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata sdk_map_support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2649 commit 40c678fe7b5892ccb6fe30b5b484fe612c1fda54 Author: manishgupta88 Date: 2018-08-20T04:59:29Z changes for supporting map complex type for SDK ---
[GitHub] carbondata pull request #2613: [HOTFIX] Modified code to fix the degrade in ...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2613#discussion_r208182587 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/result/iterator/RawResultIterator.java --- @@ -53,153 +39,124 @@ */ private CarbonIterator detailRawQueryResultIterator; - private boolean prefetchEnabled; - private List currentBuffer; - private List backupBuffer; - private int currentIdxInBuffer; - private ExecutorService executorService; - private Future fetchFuture; - private Object[] currentRawRow = null; - private boolean isBackupFilled = false; + /** + * Counter to maintain the row counter. + */ + private int counter = 0; + + private Object[] currentConveretedRawRow = null; + + /** + * LOGGER + */ + private static final LogService LOGGER = + LogServiceFactory.getLogService(RawResultIterator.class.getName()); + + /** + * batch of the result. + */ + private RowBatch batch; public RawResultIterator(CarbonIterator detailRawQueryResultIterator, - SegmentProperties sourceSegProperties, SegmentProperties destinationSegProperties, - boolean isStreamingHandOff) { + SegmentProperties sourceSegProperties, SegmentProperties destinationSegProperties) { this.detailRawQueryResultIterator = detailRawQueryResultIterator; this.sourceSegProperties = sourceSegProperties; this.destinationSegProperties = destinationSegProperties; -this.executorService = Executors.newFixedThreadPool(1); - -if (!isStreamingHandOff) { - init(); -} } - private void init() { -this.prefetchEnabled = CarbonProperties.getInstance().getProperty( -CarbonCommonConstants.CARBON_COMPACTION_PREFETCH_ENABLE, - CarbonCommonConstants.CARBON_COMPACTION_PREFETCH_ENABLE_DEFAULT).equalsIgnoreCase("true"); -try { - new RowsFetcher(false).call(); - if (prefetchEnabled) { -this.fetchFuture = executorService.submit(new RowsFetcher(true)); - } -} catch (Exception e) { - LOGGER.error(e, "Error occurs while fetching records"); - throw new RuntimeException(e); -} - } + @Override public boolean hasNext() { - /** - * fetch rows - */ - private final class RowsFetcher implements Callable { -private boolean isBackupFilling; - -private RowsFetcher(boolean isBackupFilling) { - this.isBackupFilling = isBackupFilling; -} - -@Override -public Void call() throws Exception { - if (isBackupFilling) { -backupBuffer = fetchRows(); -isBackupFilled = true; +if (null == batch || checkIfBatchIsProcessedCompletely(batch)) { + if (detailRawQueryResultIterator.hasNext()) { +batch = null; +batch = detailRawQueryResultIterator.next(); +counter = 0; // batch changed so reset the counter. } else { -currentBuffer = fetchRows(); +return false; } - return null; } - } - private List fetchRows() { -if (detailRawQueryResultIterator.hasNext()) { - return detailRawQueryResultIterator.next().getRows(); +if (!checkIfBatchIsProcessedCompletely(batch)) { + return true; } else { - return new ArrayList<>(); + return false; } } - private void fillDataFromPrefetch() { -try { - if (currentIdxInBuffer >= currentBuffer.size() && 0 != currentIdxInBuffer) { -if (prefetchEnabled) { - if (!isBackupFilled) { -fetchFuture.get(); - } - // copy backup buffer to current buffer and fill backup buffer asyn - currentIdxInBuffer = 0; - currentBuffer = backupBuffer; - isBackupFilled = false; - fetchFuture = executorService.submit(new RowsFetcher(true)); -} else { - currentIdxInBuffer = 0; - new RowsFetcher(false).call(); + @Override public Object[] next() { --- End diff -- ok ---
[GitHub] carbondata pull request #2613: [HOTFIX] Modified code to fix the degrade in ...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2613 [HOTFIX] Modified code to fix the degrade in compaction performance Problem Compaction performance for 3.5 billion degraded by 16-20% Analysis: Code modification in RawResultIerator.java has caused the problem wherein few extra checks are performed as compared to previous code Fix Revert the changes in RawResultIerator class - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? NA - [ ] Document update required? No - [ ] Testing done Verified in cluster testing - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata revert_pr_2133 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2613.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2613 commit 1481759abb7e8728bca95a557ce610f6116a2652 Author: m00258959 Date: 2018-08-07T05:24:16Z Problem Compaction performance for 3.5 billion degraded by 16-20% Analysis: Code modification in RawResultIerator.java has caused the problem wherein few extra checks are performed as compared to previous code Fix Revert the changes in RawResultIerator class ---
[GitHub] carbondata issue #2608: [CARBONDATA-2829][CARBONDATA-2832] Fix creating merg...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2608 LGTM ---
[GitHub] carbondata issue #2601: [CARBONDATA-2804][DataMap] fix the bug when bloom fi...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2601 LGTM ---
[GitHub] carbondata pull request #2601: [CARBONDATA-2804][DataMap] fix the bug when b...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2601#discussion_r207224591 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java --- @@ -3212,28 +3213,27 @@ public static ColumnarFormatVersion getFormatVersion(CarbonTable carbonTable) } storePath = carbonTable.getSegmentPath(validSegments.get(0).getSegmentNo()); } - -CarbonFile[] carbonFiles = FileFactory -.getCarbonFile(storePath) -.listFiles(new CarbonFileFilter() { - @Override - public boolean accept(CarbonFile file) { -if (file == null) { - return false; -} -return file.getName().endsWith("carbondata"); - } -}); -if (carbonFiles == null || carbonFiles.length < 1) { - return CarbonProperties.getInstance().getFormatVersion(); +// get the carbon index file header +FileFactory.FileType fileType = FileFactory.getFileType(storePath); +ColumnarFormatVersion version = null; +if (FileFactory.isFileExist(storePath, fileType)) { + SegmentIndexFileStore fileStore = new SegmentIndexFileStore(); + fileStore.readAllIIndexOfSegment(storePath); + Map carbonIndexMap = fileStore.getCarbonIndexMap(); + if (carbonIndexMap.size() == 0) { +version = CarbonProperties.getInstance().getFormatVersion(); + } + CarbonIndexFileReader indexReader = new CarbonIndexFileReader(); + for (byte[] fileData : carbonIndexMap.values()) { +indexReader.openThriftReader(fileData); +IndexHeader indexHeader = indexReader.readIndexHeader(); +version = ColumnarFormatVersion.valueOf((short)indexHeader.getVersion()); +break; --- End diff -- once reading is complete close indexReader using try and finally block indexReader.closeThriftReader() ---
[GitHub] carbondata pull request #2601: [CARBONDATA-2804][DataMap] fix the bug when b...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2601#discussion_r207223659 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java --- @@ -3212,28 +3213,27 @@ public static ColumnarFormatVersion getFormatVersion(CarbonTable carbonTable) } storePath = carbonTable.getSegmentPath(validSegments.get(0).getSegmentNo()); } - -CarbonFile[] carbonFiles = FileFactory -.getCarbonFile(storePath) -.listFiles(new CarbonFileFilter() { - @Override - public boolean accept(CarbonFile file) { -if (file == null) { - return false; -} -return file.getName().endsWith("carbondata"); - } -}); -if (carbonFiles == null || carbonFiles.length < 1) { - return CarbonProperties.getInstance().getFormatVersion(); +// get the carbon index file header +FileFactory.FileType fileType = FileFactory.getFileType(storePath); +ColumnarFormatVersion version = null; +if (FileFactory.isFileExist(storePath, fileType)) { --- End diff -- Rename storePath to segmentPath ---
[GitHub] carbondata pull request #2601: [CARBONDATA-2804][DataMap] fix the bug when b...
Github user manishgupta88 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2601#discussion_r207225748 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonUtil.java --- @@ -3212,28 +3213,27 @@ public static ColumnarFormatVersion getFormatVersion(CarbonTable carbonTable) } storePath = carbonTable.getSegmentPath(validSegments.get(0).getSegmentNo()); } - -CarbonFile[] carbonFiles = FileFactory -.getCarbonFile(storePath) -.listFiles(new CarbonFileFilter() { - @Override - public boolean accept(CarbonFile file) { -if (file == null) { - return false; -} -return file.getName().endsWith("carbondata"); - } -}); -if (carbonFiles == null || carbonFiles.length < 1) { - return CarbonProperties.getInstance().getFormatVersion(); +// get the carbon index file header +FileFactory.FileType fileType = FileFactory.getFileType(storePath); +ColumnarFormatVersion version = null; +if (FileFactory.isFileExist(storePath, fileType)) { + SegmentIndexFileStore fileStore = new SegmentIndexFileStore(); + fileStore.readAllIIndexOfSegment(storePath); + Map carbonIndexMap = fileStore.getCarbonIndexMap(); + if (carbonIndexMap.size() == 0) { +version = CarbonProperties.getInstance().getFormatVersion(); + } + CarbonIndexFileReader indexReader = new CarbonIndexFileReader(); + for (byte[] fileData : carbonIndexMap.values()) { +indexReader.openThriftReader(fileData); +IndexHeader indexHeader = indexReader.readIndexHeader(); +version = ColumnarFormatVersion.valueOf((short)indexHeader.getVersion()); +break; + } +} else { + version = CarbonProperties.getInstance().getFormatVersion(); --- End diff -- 1. This else condition is not required as for a valid segment its path will exist in the file system. 2. If at all the path does not exist then read the version info from the next valid segment and log a warning that valid segment path does not exist in the system ---
[GitHub] carbondata issue #2600: [CARBONDATA-2813] Fixed code to get data size from L...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2600 LGTM ---
[GitHub] carbondata issue #2595: [Documentation] [Unsafe Configuration] Added carbon....
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2595 @xuchuanyin Usually in production scenarios driver memory will be less than the executor memory. Now we are using unsafe for caching block/blocklet dataMap in driver. Current unsafe memory configured fo executor is getting used for driver also which is not a good idea. Therefore it is required to separate out driver and executor unsafe memory. You can observe the same in spark configuration also that spark has given different parameters for configuring driver and executor memory overhead to control the unsafe memory usage. spark.yarn.driver.memoryOverhead and spark.yarn.executor.memoryOverhead ---
[GitHub] carbondata pull request #2595: [Documentation] [Unsafe Configuration] Added ...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2595 [Documentation] [Unsafe Configuration] Added carbon.unsafe.driver.working.memory.in.mb parameter to differentiate between driver and executor unsafe memory Added carbon.unsafe.driver.working.memory.in.mb parameter to differentiate between driver and executor unsafe memory - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? No - [ ] Document update required? Yes. Updated - [ ] Testing done Verified manually - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata unsafe_driver_property Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2595 commit 8196be0a50d9d0f38452cfdf5fbe0b0485973e87 Author: manishgupta88 Date: 2018-08-01T14:08:30Z Added carbon.unsafe.driver.working.memory.in.mb parameter to differentiate between driver and executor unsafe memory ---
[GitHub] carbondata pull request #2593: [CARBONDATA-2753][Compatibility] Merge Index ...
GitHub user manishgupta88 opened a pull request: https://github.com/apache/carbondata/pull/2593 [CARBONDATA-2753][Compatibility] Merge Index file not getting created with blocklet information for old store **Problem** Merge Index file not getting created with blocklet information for old store **Analysis** In legacy store (store <= 1.1 version), blocklet information is not written in the carbon Index files. When merge Index is created using the Alter DDL command on old store then merge Index file should be created with blocklet information which is as per the new store. This is not happening because the flag to read the carbondata file footer is not passed as true from Alter DDL command flow. **Fix** Pass the flag to read carbondataFileFooter as true while creating the merge Index file using Alter DDL command - [ ] Any interfaces changed? No - [ ] Any backward compatibility impacted? No - [ ] Document update required? No - [ ] Testing done Manually verified - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata merge_index_compatibility Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2593.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2593 commit 4af316a7cebd0f05a08c4f4519302961eccbcb1f Author: manishgupta88 Date: 2018-08-01T08:54:52Z Problem Merge Index file not getting created with blocklet information for old store Analysis In legacy store (store <= 1.1 version), blocklet information is not written in the carbon Index files. When merge Index is created using the Alter DDL command on old store then merge Index file should be created with blocklet information which is as per the new store. This is not happening because the flag to read the carbondata file footer is not passed as true from Alter DDL command flow. Fix Pass the flag to read carbondataFileFooter as true while creating the merge Index file using Alter DDL command ---
[GitHub] carbondata issue #2585: [CARBONDATA-2805]fix the ordering mismatch of segmen...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2585 LGTM ---