[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r275275354 ## File path: docs/ddl-of-carbondata.md ## @@ -283,6 +284,18 @@ CarbonData DDL statements are documented here,which includes: TBLPROPERTIES ('TABLE_BLOCKLET_SIZE'='8') ``` + - # Table page Size Configuration + + This property is for setting page size in the carbondata file, the default value is 1 MB. + And supports a range of 1 MB to 1755 MB. + If page size crosses this value before 32000 rows, page will be cut to that may rows. Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r275270478 ## File path: format/src/main/thrift/carbondata.thrift ## @@ -180,6 +180,7 @@ struct BlockletInfo3{ 4: required i64 dimension_offsets; 5: required i64 measure_offsets; 6: required i32 number_number_of_pages; // This is rquired for alter table, in case of alter table when filter is only selected on new added column this will help +7: optional list row_count_in_page; // This will contain the row count in each page. Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r275268381 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +268,120 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. - * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + if (configuredPageSizeInBytes == 0) { +// no need to cut the page +// use default value +/*configuredPageSizeInBytes = +CarbonCommonConstants.TABLE_PAGE_SIZE_INMB_DEFAULT * 1024 * 1024;*/ +return false; + } Object[] nonDictArray = WriteStepRowUtil.getNoDictAndComplexDimension(row); - for (int i = 0; i < model.getVarcharDimIdxInNoDict().size(); i++) { -if (DataTypeUtil - .isPrimitiveColumn(model.getNoDictAndComplexColumns()[i].getDataType())) { - // get the size from the data type - varcharColumnSizeInByte[i] += - model.getNoDictAndComplexColumns()[i].getDataType().getSizeInBytes(); -} else { - varcharColumnSizeInByte[i] += - ((byte[]) nonDictArray[model.getVarcharDimIdxInNoDict().get(i)]).length; -} -if (SnappyCompressor.MAX_BYTE_TO_COMPRESS - -(varcharColumnSizeInByte[i] + dataRows.size() * 4) < (2 << 20)) { - LOGGER.debug("Limited by varchar column, page size is " + dataRows.size()); - // re-init for next page - varcharColumnSizeInByte = new int[model.getVarcharDimIdxInNoDict().size()]; - return true; + for (int i = 0; i < noDictDataTypesList.size(); i++) { +DataType columnType = noDictDataTypesList.get(i); +if ((columnType == DataTypes.STRING) || (columnType == DataTypes.VARCHAR)) { + currentElementLength = ((byte[]) nonDictArray[i]).length; + noDictColumnPageSize[bucketCounter] += currentElementLength; + canSnappyHandleThisRow(noDictColumnPageSize[bucketCounter]); + // If current page size is more than configured page size, cut the page here. + if (noDictColumnPageSize[bucketCounter] + dataRows.size() * 4 + >= configuredPageSizeInBytes) { +if (LOGGER.isDebugEnabled()) { + LOGGER.debug("cutting the page. Rows count in this page: " + dataRows.size()); +} +// re-init for next page +noDictColumnPageSize = new int[totalNoDictPageCount]; +return true; + } + bucketCounter++; +} else if (columnType.isComplexType()) { + // this is for depth of each complex column, model is having only total depth. + GenericDataType genericDataType = complexIndexMapCopy + .get(i - model.getNoDictionaryCount() + model.getPrimitiveDimLens().length); + int depth = genericDataType.getDepth(); + List> flatComplexColumnList = (List>) nonDictArray[i]; +
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r275251624 ## File path: processing/src/main/java/org/apache/carbondata/processing/datatypes/ArrayDataType.java ## @@ -322,4 +325,16 @@ public void getComplexColumnInfo(List columnInfoList) { name, false)); children.getComplexColumnInfo(columnInfoList); } + + @Override + public int getDepth() { +if (depth == 0) { + // calculate only one time + List complexColumnInfoList = new ArrayList<>(); Review comment: First time (For the first row (depth = 0), it calls `getComplexColumnInfo`, which recursively calls for child also. I am reusing the existing function instead of writing one for depth. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273890107 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +260,135 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. - * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + int configuredPageSizeInBytes; + String configuredPageSizeStrInBytes = + model.getTableSpec().getCarbonTable().getTableInfo().getFactTable().getTableProperties() + .get(CarbonCommonConstants.TABLE_PAGE_SIZE_INMB); + if (configuredPageSizeStrInBytes != null) { +configuredPageSizeInBytes = Integer.parseInt(configuredPageSizeStrInBytes) * 1024 * 1024; + } else { +// Set the default 1 MB page size if not configured from 1.6 version. +// If set now, it will impact forward compatibility between 1.5.x versions. +// use default value +/*configuredPageSizeInBytes = +CarbonCommonConstants.TABLE_PAGE_SIZE_INMB_DEFAULT * 1024 * 1024;*/ +return false; Review comment: If already restricted in old version, no need to restrict now right ? Also old version that logic was buggy. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r274261617 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +260,135 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. - * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + int configuredPageSizeInBytes; + String configuredPageSizeStrInBytes = + model.getTableSpec().getCarbonTable().getTableInfo().getFactTable().getTableProperties() + .get(CarbonCommonConstants.TABLE_PAGE_SIZE_INMB); + if (configuredPageSizeStrInBytes != null) { +configuredPageSizeInBytes = Integer.parseInt(configuredPageSizeStrInBytes) * 1024 * 1024; + } else { +// Set the default 1 MB page size if not configured from 1.6 version. +// If set now, it will impact forward compatibility between 1.5.x versions. +// use default value +/*configuredPageSizeInBytes = +CarbonCommonConstants.TABLE_PAGE_SIZE_INMB_DEFAULT * 1024 * 1024;*/ +return false; + } Object[] nonDictArray = WriteStepRowUtil.getNoDictAndComplexDimension(row); - for (int i = 0; i < model.getVarcharDimIdxInNoDict().size(); i++) { -if (DataTypeUtil - .isPrimitiveColumn(model.getNoDictAndComplexColumns()[i].getDataType())) { - // get the size from the data type - varcharColumnSizeInByte[i] += - model.getNoDictAndComplexColumns()[i].getDataType().getSizeInBytes(); -} else { - varcharColumnSizeInByte[i] += - ((byte[]) nonDictArray[model.getVarcharDimIdxInNoDict().get(i)]).length; -} -if (SnappyCompressor.MAX_BYTE_TO_COMPRESS - -(varcharColumnSizeInByte[i] + dataRows.size() * 4) < (2 << 20)) { - LOGGER.debug("Limited by varchar column, page size is " + dataRows.size()); - // re-init for next page - varcharColumnSizeInByte = new int[model.getVarcharDimIdxInNoDict().size()]; - return true; + for (int i = 0; i < noDictDataTypesList.size(); i++) { +DataType columnType = noDictDataTypesList.get(i); +if ((columnType == DataTypes.STRING) || (columnType == DataTypes.VARCHAR)) { + currentElementLength = ((byte[]) nonDictArray[i]).length; + noDictColumnPageSize[bucketCounter] += currentElementLength; + canSnappyHandleThisRow(noDictColumnPageSize[bucketCounter]); + // If current page size is more than configured page size, cut the page here. + if (noDictColumnPageSize[bucketCounter] + dataRows.size() * 4 + >= configuredPageSizeInBytes) { +LOGGER.debug("cutting the page. Rows count in this page: " + dataRows.size()); +// re-init for next page +noDictColumnPageSize = new int[totalNoDictPageCount]; +return true; + } +
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r274261557 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +260,135 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. - * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + int configuredPageSizeInBytes; + String configuredPageSizeStrInBytes = + model.getTableSpec().getCarbonTable().getTableInfo().getFactTable().getTableProperties() + .get(CarbonCommonConstants.TABLE_PAGE_SIZE_INMB); + if (configuredPageSizeStrInBytes != null) { +configuredPageSizeInBytes = Integer.parseInt(configuredPageSizeStrInBytes) * 1024 * 1024; + } else { +// Set the default 1 MB page size if not configured from 1.6 version. +// If set now, it will impact forward compatibility between 1.5.x versions. +// use default value +/*configuredPageSizeInBytes = +CarbonCommonConstants.TABLE_PAGE_SIZE_INMB_DEFAULT * 1024 * 1024;*/ +return false; + } Object[] nonDictArray = WriteStepRowUtil.getNoDictAndComplexDimension(row); - for (int i = 0; i < model.getVarcharDimIdxInNoDict().size(); i++) { -if (DataTypeUtil - .isPrimitiveColumn(model.getNoDictAndComplexColumns()[i].getDataType())) { - // get the size from the data type - varcharColumnSizeInByte[i] += - model.getNoDictAndComplexColumns()[i].getDataType().getSizeInBytes(); -} else { - varcharColumnSizeInByte[i] += - ((byte[]) nonDictArray[model.getVarcharDimIdxInNoDict().get(i)]).length; -} -if (SnappyCompressor.MAX_BYTE_TO_COMPRESS - -(varcharColumnSizeInByte[i] + dataRows.size() * 4) < (2 << 20)) { - LOGGER.debug("Limited by varchar column, page size is " + dataRows.size()); - // re-init for next page - varcharColumnSizeInByte = new int[model.getVarcharDimIdxInNoDict().size()]; - return true; + for (int i = 0; i < noDictDataTypesList.size(); i++) { +DataType columnType = noDictDataTypesList.get(i); +if ((columnType == DataTypes.STRING) || (columnType == DataTypes.VARCHAR)) { + currentElementLength = ((byte[]) nonDictArray[i]).length; + noDictColumnPageSize[bucketCounter] += currentElementLength; + canSnappyHandleThisRow(noDictColumnPageSize[bucketCounter]); + // If current page size is more than configured page size, cut the page here. + if (noDictColumnPageSize[bucketCounter] + dataRows.size() * 4 + >= configuredPageSizeInBytes) { +LOGGER.debug("cutting the page. Rows count in this page: " + dataRows.size()); +// re-init for next page +noDictColumnPageSize = new int[totalNoDictPageCount]; +return true; + } +
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r274261483 ## File path: core/src/main/java/org/apache/carbondata/core/datastore/row/CarbonRow.java ## @@ -31,6 +34,9 @@ private short rangeId; + /* key is complex column name, val is it's flat byte array */ + private Map>> complexFlatByteArrayMap; Review comment: yes, handled This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273797879 ## File path: core/src/main/java/org/apache/carbondata/core/datastore/row/CarbonRow.java ## @@ -31,6 +34,9 @@ private short rangeId; + /* key is complex column name, val is it's flat byte array */ + private Map>> complexFlatByteArrayMap; Review comment: This is an extracted data from `data` of `CarbonRow`. previously in `TablePage.java`, List> encodedComplexColumnar was extracted from `data` and using it, it was never modified actual `data`. So, I have used the same logic. Also modifying data[] with flat complex data may lead to many code changes where data[] is used. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273895645 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerModel.java ## @@ -316,31 +326,19 @@ public static CarbonFactDataHandlerModel getCarbonFactDataHandlerModel(CarbonLoa String[] tempStoreLocation, String carbonDataDirectoryPath) { // for dynamic page size in write step if varchar columns exist -List varcharDimIdxInNoDict = new ArrayList<>(); Review comment: It will be used for compaction also, getCarbonFactDataHandlerModel() is called in compaction flow, I have handled filling the datatype list in that method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273890739 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +260,135 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. - * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + int configuredPageSizeInBytes; + String configuredPageSizeStrInBytes = + model.getTableSpec().getCarbonTable().getTableInfo().getFactTable().getTableProperties() + .get(CarbonCommonConstants.TABLE_PAGE_SIZE_INMB); + if (configuredPageSizeStrInBytes != null) { +configuredPageSizeInBytes = Integer.parseInt(configuredPageSizeStrInBytes) * 1024 * 1024; + } else { +// Set the default 1 MB page size if not configured from 1.6 version. +// If set now, it will impact forward compatibility between 1.5.x versions. +// use default value +/*configuredPageSizeInBytes = +CarbonCommonConstants.TABLE_PAGE_SIZE_INMB_DEFAULT * 1024 * 1024;*/ +return false; + } Object[] nonDictArray = WriteStepRowUtil.getNoDictAndComplexDimension(row); - for (int i = 0; i < model.getVarcharDimIdxInNoDict().size(); i++) { -if (DataTypeUtil - .isPrimitiveColumn(model.getNoDictAndComplexColumns()[i].getDataType())) { - // get the size from the data type - varcharColumnSizeInByte[i] += - model.getNoDictAndComplexColumns()[i].getDataType().getSizeInBytes(); -} else { - varcharColumnSizeInByte[i] += - ((byte[]) nonDictArray[model.getVarcharDimIdxInNoDict().get(i)]).length; -} -if (SnappyCompressor.MAX_BYTE_TO_COMPRESS - -(varcharColumnSizeInByte[i] + dataRows.size() * 4) < (2 << 20)) { - LOGGER.debug("Limited by varchar column, page size is " + dataRows.size()); - // re-init for next page - varcharColumnSizeInByte = new int[model.getVarcharDimIdxInNoDict().size()]; - return true; + for (int i = 0; i < noDictDataTypesList.size(); i++) { +DataType columnType = noDictDataTypesList.get(i); +if ((columnType == DataTypes.STRING) || (columnType == DataTypes.VARCHAR)) { + currentElementLength = ((byte[]) nonDictArray[i]).length; + noDictColumnPageSize[bucketCounter] += currentElementLength; + canSnappyHandleThisRow(noDictColumnPageSize[bucketCounter]); + // If current page size is more than configured page size, cut the page here. + if (noDictColumnPageSize[bucketCounter] + dataRows.size() * 4 + >= configuredPageSizeInBytes) { +LOGGER.debug("cutting the page. Rows count in this page: " + dataRows.size()); Review comment: ok. done This is an automated message from the Apache Git Service. To
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273890107 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +260,135 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. - * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + int configuredPageSizeInBytes; + String configuredPageSizeStrInBytes = + model.getTableSpec().getCarbonTable().getTableInfo().getFactTable().getTableProperties() + .get(CarbonCommonConstants.TABLE_PAGE_SIZE_INMB); + if (configuredPageSizeStrInBytes != null) { +configuredPageSizeInBytes = Integer.parseInt(configuredPageSizeStrInBytes) * 1024 * 1024; + } else { +// Set the default 1 MB page size if not configured from 1.6 version. +// If set now, it will impact forward compatibility between 1.5.x versions. +// use default value +/*configuredPageSizeInBytes = +CarbonCommonConstants.TABLE_PAGE_SIZE_INMB_DEFAULT * 1024 * 1024;*/ +return false; Review comment: If already restricted in old version, no need to restrict now right ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273889787 ## File path: docs/ddl-of-carbondata.md ## @@ -283,6 +284,18 @@ CarbonData DDL statements are documented here,which includes: TBLPROPERTIES ('TABLE_BLOCKLET_SIZE'='8') ``` + - # Table page Size Configuration + + This property is for setting page size in the carbondata file, the default value is 1 MB. + And supports a range of 1 MB to 1755 MB. + If page size crosses this value before 32000 rows, page will be cut to that may rows. Review comment: ok. done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273889773 ## File path: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java ## @@ -227,50 +260,135 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { /** * Check if column page can be added more rows after adding this row to page. + * only few no-dictionary dimensions columns (string, varchar, + * complex columns) can grow huge in size. * - * A varchar column page uses SafeVarLengthColumnPage/UnsafeVarLengthColumnPage to store data - * and encoded using HighCardDictDimensionIndexCodec which will call getByteArrayPage() from - * column page and flatten into byte[] for compression. - * Limited by the index of array, we can only put number of Integer.MAX_VALUE bytes in a page. - * - * Another limitation is from Compressor. Currently we use snappy as default compressor, - * and it will call MaxCompressedLength method to estimate the result size for preparing output. - * For safety, the estimate result is oversize: `32 + source_len + source_len/6`. - * So the maximum bytes to compress by snappy is (2GB-32)*6/7≈1.71GB. - * - * Size of a row does not exceed 2MB since UnsafeSortDataRows uses 2MB byte[] as rowBuffer. - * Such that we can stop adding more row here if any long string column reach this limit. * - * If use unsafe column page, please ensure the memory configured is enough. - * @param row - * @return false if any varchar column page cannot add one more value(2MB) + * @param row carbonRow + * @return false if next rows can be added to same page. + * true if next rows cannot be added to same page */ - private boolean isVarcharColumnFull(CarbonRow row) { -//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB -if (model.getVarcharDimIdxInNoDict().size() > 0) { + private boolean needToCutThePage(CarbonRow row) { +List noDictDataTypesList = model.getNoDictDataTypesList(); +int totalNoDictPageCount = noDictDataTypesList.size() + model.getNoDictAllComplexColumnDepth(); +if (totalNoDictPageCount > 0) { + int currentElementLength; + int bucketCounter = 0; + int configuredPageSizeInBytes; + String configuredPageSizeStrInBytes = + model.getTableSpec().getCarbonTable().getTableInfo().getFactTable().getTableProperties() + .get(CarbonCommonConstants.TABLE_PAGE_SIZE_INMB); Review comment: ok. done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273797879 ## File path: core/src/main/java/org/apache/carbondata/core/datastore/row/CarbonRow.java ## @@ -31,6 +34,9 @@ private short rangeId; + /* key is complex column name, val is it's flat byte array */ + private Map>> complexFlatByteArrayMap; Review comment: This is an extracted data from `data` of `CarbonRow`. previously in `TablePage.java`, List> encodedComplexColumnar was extracted from `data` and using it, it was never modified actual `data`. So, I have used the same logic. Also modifying data[] with flat complex data may lead to many code changes where data[] is used. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB
ajantha-bhat commented on a change in pull request #2814: [CARBONDATA-3001] configurable page size in MB URL: https://github.com/apache/carbondata/pull/2814#discussion_r273797879 ## File path: core/src/main/java/org/apache/carbondata/core/datastore/row/CarbonRow.java ## @@ -31,6 +34,9 @@ private short rangeId; + /* key is complex column name, val is it's flat byte array */ + private Map>> complexFlatByteArrayMap; Review comment: This is an extracted data from `data` of `CarbonRow`. previously in `TablePage.java`, List> encodedComplexColumnar was extracted from `data` and using it, it was never modied actual `data`. So, I have used the same logic. Also modifying data[] with flat complex data may lead to many code changes where data[] is used. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services