[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3751: [CARBONDATA-3803] Mark CarbonSession as deprecated since 2.0
ajantha-bhat commented on a change in pull request #3751: URL: https://github.com/apache/carbondata/pull/3751#discussion_r421945890 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/CarbonSession.scala ## @@ -40,12 +40,18 @@ import org.apache.carbondata.streaming.CarbonStreamingQueryListener * Session implementation for {org.apache.spark.sql.SparkSession} * Implemented this class only to use our own SQL DDL commands. * User needs to use {CarbonSession.getOrCreateCarbon} to create Carbon session. + * + * @deprecated Since 2.0, only use for backward compatibility, + * please switch to use {@link CarbonExtensions}. */ +@Deprecated class CarbonSession(@transient val sc: SparkContext, Review comment: "**stored by**" syntax also we need to deprecate and update document for the same? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3754: [CARBONDATA-3808] Added documentation for cdc and scd scenarios
ajantha-bhat commented on a change in pull request #3754: URL: https://github.com/apache/carbondata/pull/3754#discussion_r421944153 ## File path: docs/scd-and-cdc-guide.md ## @@ -0,0 +1,118 @@ + + +# Upsert into a Carbon DataSet using Merge + +## SCD and CDC Scenarios +Change Data Capture (CDC), is to apply all data changes generated from an external data set +into a target dataset. In other words, a set of changes (update/delete/insert) applied to an external +table needs to be applied to a target table. + +Slowly Changing Dimensions (SCD), are the dimensions in which the data changes slowly, rather +than changing regularly on a time basis. + +SCD and CDC data changes can be merged to a carbon dataset online using the data frame level `MERGE` API. + + MERGE API + +Below API merges the datasets online and applies the actions as per the conditions. + +``` + targetDS.merge(sourceDS, ) Review comment: Need to explain what each and every API does here and when to use it. Because these are added newly by carbon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3754: [CARBONDATA-3808] Added documentation for cdc and scd scenarios
ajantha-bhat commented on a change in pull request #3754: URL: https://github.com/apache/carbondata/pull/3754#discussion_r421943365 ## File path: docs/scd-and-cdc-guide.md ## @@ -0,0 +1,118 @@ + + +# Upsert into a Carbon DataSet using Merge + +## SCD and CDC Scenarios +Change Data Capture (CDC), is to apply all data changes generated from an external data set +into a target dataset. In other words, a set of changes (update/delete/insert) applied to an external +table needs to be applied to a target table. + +Slowly Changing Dimensions (SCD), are the dimensions in which the data changes slowly, rather +than changing regularly on a time basis. + +SCD and CDC data changes can be merged to a carbon dataset online using the data frame level `MERGE` API. + + MERGE API + +Below API merges the datasets online and applies the actions as per the conditions. + +``` + targetDS.merge(sourceDS, ) + .whenMatched() + .updateExpr(updateMap) + .insertExpr(insertMap_u) + .whenNotMatched() + .insertExpr(insertMap) + .whenNotMatchedAndExistsOnlyOnTarget() + .delete() + .insertHistoryTableExpr(insertMap_d, ) + .execute() +``` +**NOTE:** SQL syntax for merge is not yet supported. + +# Example code sample to implement ccd scenario Review comment: Just give a link to `MergeTestCase.scala` and test case name No need have same thing here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421921906 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421921948 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and description: + + ```scala +// Import dependencies. +import java.util.Properties +import org.apache.carbon.flink.CarbonWriterFactory +import org.apache.carbon.flink.ProxyFileSystem +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.flink.api.common.restartstrategy.RestartStrategies +import org.apache.flink.core.fs.Path +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink + +// Specify database name. +val databaseName = "default" + +// Specify target table name. +val tableName = "test" +// Table path of the target table. +val tablePath = "/data/warehouse/test" +// Specify local temporary path. +val dataTempPath = "/data/temp/" + +val tableProperties = new Properties +// Set the table properties here. + +val writerProperties = new Properties +// Set the writer properties here, such as temp path, commit threshold, access key, secret key, endpoint, etc. + +val carbonProperties = new Properties +// Set the carbon properties here, such as date format, store location, etc. + +// Create carbon bulk writer factory. Two writer types are supported: 'Local' and 'S3'. +val writerFactory = CarbonWriterFactory.builder("Local").build( + databaseName, + tableName, + tablePath, + tableProperties, + writerProperties, + carbonProperties +) + +// Build a flink stream and run it. +// 1. New a flink execution environment. +val environment = StreamExecutionEnvironment.getExecutionEnvironment +// Set flink environment configuration here, such as parallelism, checkpointing, restart strategy, etc. + +// 2. Create flink data source, may be a kafka source, custom source, or others. +// The data type of source should be Array[AnyRef]. +// Array length should equals to table column count, and values order in array should matches table column order. +val source = ... +// 3. Create flink stream and set source. +val stream = environment.addSource(source) +// 4. Add other flink operators here. +// ... +// 5. Set flink stream target (write data to carbon with a write sink). +stream.addSink(StreamingFileSink.forBulkFormat(new Path(ProxyFileSystem.DEFAULT_URI), writerFactory).build) +// 6. Run flink stream. +try { + environment.execute +} catch { + case exception: Exception => +// Handle execute exception here. +} + ``` + +###Writer properties + +Local Writer + + | Property | Name | Description | + |--|--|-| + | CarbonLocalProperty.DATA_TEMP_PATH | carbon.writer.local.data.temp.path | Usually is a local path, data will write to temp path first, and mv to target data path finally.| Review comment: OK ## File path: docs/flink-integration-guide.md
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421921602 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and description: + + ```scala +// Import dependencies. +import java.util.Properties +import org.apache.carbon.flink.CarbonWriterFactory +import org.apache.carbon.flink.ProxyFileSystem +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.flink.api.common.restartstrategy.RestartStrategies +import org.apache.flink.core.fs.Path +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink + +// Specify database name. +val databaseName = "default" + +// Specify target table name. +val tableName = "test" +// Table path of the target table. +val tablePath = "/data/warehouse/test" +// Specify local temporary path. +val dataTempPath = "/data/temp/" + +val tableProperties = new Properties +// Set the table properties here. + +val writerProperties = new Properties +// Set the writer properties here, such as temp path, commit threshold, access key, secret key, endpoint, etc. + +val carbonProperties = new Properties +// Set the carbon properties here, such as date format, store location, etc. + +// Create carbon bulk writer factory. Two writer types are supported: 'Local' and 'S3'. +val writerFactory = CarbonWriterFactory.builder("Local").build( + databaseName, + tableName, + tablePath, + tableProperties, + writerProperties, + carbonProperties +) + +// Build a flink stream and run it. +// 1. New a flink execution environment. Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421921350 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and description: + + ```scala +// Import dependencies. +import java.util.Properties +import org.apache.carbon.flink.CarbonWriterFactory +import org.apache.carbon.flink.ProxyFileSystem +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.flink.api.common.restartstrategy.RestartStrategies +import org.apache.flink.core.fs.Path +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink + +// Specify database name. +val databaseName = "default" + +// Specify target table name. +val tableName = "test" +// Table path of the target table. +val tablePath = "/data/warehouse/test" +// Specify local temporary path. +val dataTempPath = "/data/temp/" + +val tableProperties = new Properties +// Set the table properties here. + +val writerProperties = new Properties +// Set the writer properties here, such as temp path, commit threshold, access key, secret key, endpoint, etc. + +val carbonProperties = new Properties +// Set the carbon properties here, such as date format, store location, etc. + +// Create carbon bulk writer factory. Two writer types are supported: 'Local' and 'S3'. +val writerFactory = CarbonWriterFactory.builder("Local").build( + databaseName, + tableName, + tablePath, + tableProperties, + writerProperties, + carbonProperties +) + +// Build a flink stream and run it. +// 1. New a flink execution environment. +val environment = StreamExecutionEnvironment.getExecutionEnvironment +// Set flink environment configuration here, such as parallelism, checkpointing, restart strategy, etc. + +// 2. Create flink data source, may be a kafka source, custom source, or others. +// The data type of source should be Array[AnyRef]. +// Array length should equals to table column count, and values order in array should matches table column order. +val source = ... +// 3. Create flink stream and set source. +val stream = environment.addSource(source) +// 4. Add other flink operators here. +// ... +// 5. Set flink stream target (write data to carbon with a write sink). +stream.addSink(StreamingFileSink.forBulkFormat(new Path(ProxyFileSystem.DEFAULT_URI), writerFactory).build) +// 6. Run flink stream. +try { + environment.execute +} catch { + case exception: Exception => +// Handle execute exception here. +} + ``` + +###Writer properties + +Local Writer + + | Property | Name | Description | + |--|--|-| + | CarbonLocalProperty.DATA_TEMP_PATH | carbon.writer.local.data.temp.path | Usually is a local path, data will write to temp path first, and mv to target data path finally.| + | CarbonLocalProperty.COMMIT_THRESHOLD | carbon.writer.local.commit.threshold
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421921411 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421890991 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and description: + + ```scala +// Import dependencies. +import java.util.Properties +import org.apache.carbon.flink.CarbonWriterFactory +import org.apache.carbon.flink.ProxyFileSystem +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.flink.api.common.restartstrategy.RestartStrategies +import org.apache.flink.core.fs.Path +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink + +// Specify database name. +val databaseName = "default" + +// Specify target table name. +val tableName = "test" +// Table path of the target table. +val tablePath = "/data/warehouse/test" +// Specify local temporary path. +val dataTempPath = "/data/temp/" + +val tableProperties = new Properties +// Set the table properties here. + +val writerProperties = new Properties +// Set the writer properties here, such as temp path, commit threshold, access key, secret key, endpoint, etc. + +val carbonProperties = new Properties +// Set the carbon properties here, such as date format, store location, etc. + +// Create carbon bulk writer factory. Two writer types are supported: 'Local' and 'S3'. +val writerFactory = CarbonWriterFactory.builder("Local").build( + databaseName, + tableName, + tablePath, + tableProperties, + writerProperties, + carbonProperties +) + +// Build a flink stream and run it. +// 1. New a flink execution environment. +val environment = StreamExecutionEnvironment.getExecutionEnvironment +// Set flink environment configuration here, such as parallelism, checkpointing, restart strategy, etc. + +// 2. Create flink data source, may be a kafka source, custom source, or others. +// The data type of source should be Array[AnyRef]. +// Array length should equals to table column count, and values order in array should matches table column order. +val source = ... +// 3. Create flink stream and set source. +val stream = environment.addSource(source) +// 4. Add other flink operators here. +// ... +// 5. Set flink stream target (write data to carbon with a write sink). +stream.addSink(StreamingFileSink.forBulkFormat(new Path(ProxyFileSystem.DEFAULT_URI), writerFactory).build) +// 6. Run flink stream. +try { + environment.execute +} catch { + case exception: Exception => +// Handle execute exception here. +} + ``` + +###Writer properties + +Local Writer + + | Property | Name | Description | + |--|--|-| + | CarbonLocalProperty.DATA_TEMP_PATH | carbon.writer.local.data.temp.path | Usually is a local path, data will write to temp path first, and mv to target data path finally.| + | CarbonLocalProperty.COMMIT_THRESHOLD | carbon.writer.local.commit.threshold
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421890931 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and description: + + ```scala +// Import dependencies. +import java.util.Properties +import org.apache.carbon.flink.CarbonWriterFactory +import org.apache.carbon.flink.ProxyFileSystem +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.flink.api.common.restartstrategy.RestartStrategies +import org.apache.flink.core.fs.Path +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink + +// Specify database name. +val databaseName = "default" + +// Specify target table name. +val tableName = "test" +// Table path of the target table. +val tablePath = "/data/warehouse/test" +// Specify local temporary path. +val dataTempPath = "/data/temp/" + +val tableProperties = new Properties +// Set the table properties here. + +val writerProperties = new Properties +// Set the writer properties here, such as temp path, commit threshold, access key, secret key, endpoint, etc. + +val carbonProperties = new Properties +// Set the carbon properties here, such as date format, store location, etc. + +// Create carbon bulk writer factory. Two writer types are supported: 'Local' and 'S3'. +val writerFactory = CarbonWriterFactory.builder("Local").build( + databaseName, + tableName, + tablePath, + tableProperties, + writerProperties, + carbonProperties +) + +// Build a flink stream and run it. +// 1. New a flink execution environment. +val environment = StreamExecutionEnvironment.getExecutionEnvironment +// Set flink environment configuration here, such as parallelism, checkpointing, restart strategy, etc. + +// 2. Create flink data source, may be a kafka source, custom source, or others. +// The data type of source should be Array[AnyRef]. +// Array length should equals to table column count, and values order in array should matches table column order. +val source = ... +// 3. Create flink stream and set source. +val stream = environment.addSource(source) +// 4. Add other flink operators here. +// ... +// 5. Set flink stream target (write data to carbon with a write sink). +stream.addSink(StreamingFileSink.forBulkFormat(new Path(ProxyFileSystem.DEFAULT_URI), writerFactory).build) +// 6. Run flink stream. +try { + environment.execute +} catch { + case exception: Exception => +// Handle execute exception here. +} + ``` + +###Writer properties + +Local Writer Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421890603 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and description: + + ```scala +// Import dependencies. +import java.util.Properties +import org.apache.carbon.flink.CarbonWriterFactory +import org.apache.carbon.flink.ProxyFileSystem +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.flink.api.common.restartstrategy.RestartStrategies +import org.apache.flink.core.fs.Path +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink + +// Specify database name. +val databaseName = "default" + +// Specify target table name. +val tableName = "test" +// Table path of the target table. +val tablePath = "/data/warehouse/test" +// Specify local temporary path. +val dataTempPath = "/data/temp/" + +val tableProperties = new Properties +// Set the table properties here. + +val writerProperties = new Properties +// Set the writer properties here, such as temp path, commit threshold, access key, secret key, endpoint, etc. + +val carbonProperties = new Properties +// Set the carbon properties here, such as date format, store location, etc. + +// Create carbon bulk writer factory. Two writer types are supported: 'Local' and 'S3'. +val writerFactory = CarbonWriterFactory.builder("Local").build( + databaseName, + tableName, + tablePath, + tableProperties, + writerProperties, + carbonProperties +) + +// Build a flink stream and run it. +// 1. New a flink execution environment. +val environment = StreamExecutionEnvironment.getExecutionEnvironment +// Set flink environment configuration here, such as parallelism, checkpointing, restart strategy, etc. + +// 2. Create flink data source, may be a kafka source, custom source, or others. +// The data type of source should be Array[AnyRef]. +// Array length should equals to table column count, and values order in array should matches table column order. +val source = ... +// 3. Create flink stream and set source. +val stream = environment.addSource(source) +// 4. Add other flink operators here. +// ... +// 5. Set flink stream target (write data to carbon with a write sink). +stream.addSink(StreamingFileSink.forBulkFormat(new Path(ProxyFileSystem.DEFAULT_URI), writerFactory).build) +// 6. Run flink stream. +try { + environment.execute +} catch { + case exception: Exception => +// Handle execute exception here. +} + ``` + +###Writer properties Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421890531 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description Review comment: OK ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] niuge01 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
niuge01 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421890386 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios Review comment: OK ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3755: [WIP]refactor MV events
CarbonDataQA1 commented on pull request #3755: URL: https://github.com/apache/carbondata/pull/3755#issuecomment-62519 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1254/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3755: [WIP]refactor MV events
CarbonDataQA1 commented on pull request #3755: URL: https://github.com/apache/carbondata/pull/3755#issuecomment-625443989 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2972/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (CARBONDATA-3809) Refresh index command fails for secondary index as per syntax mentioned in https://github.com/apache/carbondata/blob/master/docs/index/secondary-index-guide.md
Chetan Bhat created CARBONDATA-3809: --- Summary: Refresh index command fails for secondary index as per syntax mentioned in https://github.com/apache/carbondata/blob/master/docs/index/secondary-index-guide.md Key: CARBONDATA-3809 URL: https://issues.apache.org/jira/browse/CARBONDATA-3809 Project: CarbonData Issue Type: Bug Components: data-query Affects Versions: 2.0.0 Environment: Spark 2.3.2, Spark 2.4.5 Reporter: Chetan Bhat Refresh index command fails for secondary index as per syntax mentioned in [https://github.com/apache/carbondata/blob/master/docs/index/secondary-index-guide.md] 0: jdbc:hive2://10.20.255.171:23040/default> create table brinjal (imei string,AMSize string,channelsId string,ActiveCountry string, Activecity string,gamePointId double,deviceInformationId double,productionDate Timestamp,deliveryDate timestamp,deliverycharge double) STORED as carbondata TBLPROPERTIES('table_blocksize'='1'); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.218 seconds) 0: jdbc:hive2://10.20.255.171:23040/default> LOAD DATA INPATH 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO TABLE brinjal OPTIONS('DELIMITER'=',', 'QUOTECHAR'= '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); +-+--+ | Result | +-+--+ +-+--+ No rows selected (2.31 seconds) 0: jdbc:hive2://10.20.255.171:23040/default> CREATE INDEX indextable2 ON TABLE brinjal (AMSize) AS 'carbondata'; +-+--+ | Result | +-+--+ +-+--+ No rows selected (2.379 seconds) 0: jdbc:hive2://10.20.255.171:23040/default> refresh index indextable2; Error: org.apache.spark.sql.AnalysisException: == Parser1: org.apache.spark.sql.parser.CarbonExtensionSpark2SqlParser == [1.26] failure: end of input refresh index indextable2 ^; == Parser2: org.apache.spark.sql.execution.SparkSqlParser == REFRESH statements cannot contain ' ', '\n', '\r', '\t' inside unquoted resource paths(line 1, pos 0) == SQL == refresh index indextable2 ^^^; (state=,code=0) 0: jdbc:hive2://10.20.255.171:23040/default> REFRESH INDEX indextable2 WHERE SEGMENT.ID IN(0); Error: org.apache.spark.sql.AnalysisException: == Parser1: org.apache.spark.sql.parser.CarbonExtensionSpark2SqlParser == [1.27] failure: identifier matching regex (?i)ON expected REFRESH INDEX indextable2 WHERE SEGMENT.ID IN(0) ^; == Parser2: org.apache.spark.sql.execution.SparkSqlParser == REFRESH statements cannot contain ' ', '\n', '\r', '\t' inside unquoted resource paths(line 1, pos 0) == SQL == REFRESH INDEX indextable2 WHERE SEGMENT.ID IN(0) ^^^; (state=,code=0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [carbondata] akashrn5 opened a new pull request #3755: [WIP]refactor MV events
akashrn5 opened a new pull request #3755: URL: https://github.com/apache/carbondata/pull/3755 ### Why is this PR needed? ### What changes were proposed in this PR? ### Does this PR introduce any user interface change? - No - Yes. (please explain the change and update document) ### Is any new testcase added? - No - Yes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3754: [CARBONDATA-3808] Added documentation for cdc and scd scenarios
CarbonDataQA1 commented on pull request #3754: URL: https://github.com/apache/carbondata/pull/3754#issuecomment-625357707 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1253/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3754: [CARBONDATA-3808] Added documentation for cdc and scd scenarios
CarbonDataQA1 commented on pull request #3754: URL: https://github.com/apache/carbondata/pull/3754#issuecomment-625350618 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2971/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (CARBONDATA-3805) Drop index on bloom and lucene index fails
[ https://issues.apache.org/jira/browse/CARBONDATA-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Bhat updated CARBONDATA-3805: Description: Drop index on bloom and lucene index fails 0: jdbc:hive2://10.20.255.35:23040/default> create table brinjal_bloom (imei string,AMSize string,channelsId string,ActiveCountry string, Activecity string,gamePointId double,deviceInformationId double,productionDate Timestamp,deliveryDate timestamp,deliverycharge double) STORED as carbondata TBLPROPERTIES('table_blocksize'='1'); +---++ |Result| +---++ +---++ No rows selected (0.261 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> LOAD DATA INPATH 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO TABLE brinjal_bloom OPTIONS('DELIMITER'=',', 'QUOTECHAR'= '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); +---++ |Result| +---++ +---++ No rows selected (2.196 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> CREATE INDEX dm_brinjal ON TABLE brinjal_bloom(AMSize) as 'bloomfilter' PROPERTIES ('BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); +---++ |Result| +---++ +---++ No rows selected (1.039 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> drop index dm_brinjal on TABLE brinjal_bloom; *Error: org.apache.carbondata.core.exception.CarbonFileException: Error while setting modified time: (state=,code=0)* 0: jdbc:hive2://10.20.255.171:23040/default> CREATE TABLE uniqdata_lucene(CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 decimal(36,10),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED as carbondata; +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.632 seconds) 0: jdbc:hive2://10.20.255.171:23040/default> LOAD DATA INPATH 'hdfs://hacluster/chetan/2000_UniqData.csv' into table uniqdata_lucene OPTIONS('DELIMITER'=',', 'QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1'); +-+--+ | Result | +-+--+ +-+--+ No rows selected (3.894 seconds) 0: jdbc:hive2://10.20.255.171:23040/default> CREATE INDEX dm4 ON TABLE uniqdata_lucene (CUST_NAME) AS 'lucene'; +-+--+ | Result | +-+--+ +-+--+ No rows selected (2.518 seconds) 0: jdbc:hive2://10.20.255.171:23040/default> drop index dm4 on table uniqdata_lucene; *Error: org.apache.carbondata.core.exception.CarbonFileException: Error while setting modified time: (state=,code=0)* *Exception -* 2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error executing query, currentState RUNNING, | org.apache.spark.internal.Logging$class.logError(Logging.scala:91)2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error executing query, currentState RUNNING, | org.apache.spark.internal.Logging$class.logError(Logging.scala:91)org.apache.carbondata.core.exception.CarbonFileException: Error while setting modified time: at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.setLastModifiedTime(AbstractDFSCarbonFile.java:192) at org.apache.spark.sql.secondaryindex.util.FileInternalUtil$.touchStoreTimeStamp(FileInternalUtil.scala:53) at org.apache.spark.sql.hive.CarbonHiveIndexMetadataUtil$.removeIndexInfoFromParentTable(CarbonHiveIndexMetadataUtil.scala:111) at org.apache.spark.sql.execution.command.index.DropIndexCommand.removeIndexInfoFromParentTable(DropIndexCommand.scala:261) at org.apache.spark.sql.execution.command.index.DropIndexCommand.dropIndex(DropIndexCommand.scala:179) at org.apache.spark.sql.execution.command.index.DropIndexCommand.run(DropIndexCommand.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at
[jira] [Updated] (CARBONDATA-3805) Drop index on bloom index fails
[ https://issues.apache.org/jira/browse/CARBONDATA-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Bhat updated CARBONDATA-3805: Description: Drop index on bloom index fails 0: jdbc:hive2://10.20.255.35:23040/default> create table brinjal_bloom (imei string,AMSize string,channelsId string,ActiveCountry string, Activecity string,gamePointId double,deviceInformationId double,productionDate Timestamp,deliveryDate timestamp,deliverycharge double) STORED as carbondata TBLPROPERTIES('table_blocksize'='1'); +--+-+ |Result| +--+-+ +--+-+ No rows selected (0.261 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> LOAD DATA INPATH 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO TABLE brinjal_bloom OPTIONS('DELIMITER'=',', 'QUOTECHAR'= '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); +--+-+ |Result| +--+-+ +--+-+ No rows selected (2.196 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> CREATE INDEX dm_brinjal ON TABLE brinjal_bloom(AMSize) as 'bloomfilter' PROPERTIES ('BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); +--+-+ |Result| +--+-+ +--+-+ No rows selected (1.039 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> drop index dm_brinjal on TABLE brinjal_bloom; Error: org.apache.carbondata.core.exception.CarbonFileException: Error while setting modified time: (state=,code=0) *Exception -* 2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error executing query, currentState RUNNING, | org.apache.spark.internal.Logging$class.logError(Logging.scala:91)2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error executing query, currentState RUNNING, | org.apache.spark.internal.Logging$class.logError(Logging.scala:91)org.apache.carbondata.core.exception.CarbonFileException: Error while setting modified time: at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.setLastModifiedTime(AbstractDFSCarbonFile.java:192) at org.apache.spark.sql.secondaryindex.util.FileInternalUtil$.touchStoreTimeStamp(FileInternalUtil.scala:53) at org.apache.spark.sql.hive.CarbonHiveIndexMetadataUtil$.removeIndexInfoFromParentTable(CarbonHiveIndexMetadataUtil.scala:111) at org.apache.spark.sql.execution.command.index.DropIndexCommand.removeIndexInfoFromParentTable(DropIndexCommand.scala:261) at org.apache.spark.sql.execution.command.index.DropIndexCommand.dropIndex(DropIndexCommand.scala:179) at org.apache.spark.sql.execution.command.index.DropIndexCommand.run(DropIndexCommand.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at org.apache.spark.sql.Dataset.(Dataset.scala:194) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:232) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:175) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:185) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (CARBONDATA-3805) Drop index on bloom and lucene index fails
[ https://issues.apache.org/jira/browse/CARBONDATA-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Bhat updated CARBONDATA-3805: Summary: Drop index on bloom and lucene index fails (was: Drop index on bloom index fails) > Drop index on bloom and lucene index fails > -- > > Key: CARBONDATA-3805 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3805 > Project: CarbonData > Issue Type: Bug > Components: data-query >Affects Versions: 2.0.0 > Environment: Spark 2.3.2, Spark 2.4.5 >Reporter: Chetan Bhat >Priority: Major > > Drop index on bloom index fails > 0: jdbc:hive2://10.20.255.35:23040/default> create table brinjal_bloom (imei > string,AMSize string,channelsId string,ActiveCountry string, Activecity > string,gamePointId double,deviceInformationId double,productionDate > Timestamp,deliveryDate timestamp,deliverycharge double) STORED as carbondata > TBLPROPERTIES('table_blocksize'='1'); > +--+-+ > |Result| > +--+-+ > +--+-+ > No rows selected (0.261 seconds) > 0: jdbc:hive2://10.20.255.35:23040/default> LOAD DATA INPATH > 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO TABLE brinjal_bloom > OPTIONS('DELIMITER'=',', 'QUOTECHAR'= > '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= > 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); > +--+-+ > |Result| > +--+-+ > +--+-+ > No rows selected (2.196 seconds) > 0: jdbc:hive2://10.20.255.35:23040/default> CREATE INDEX dm_brinjal ON TABLE > brinjal_bloom(AMSize) as 'bloomfilter' PROPERTIES ('BLOOM_SIZE'='64', > 'BLOOM_FPP'='0.1'); > +--+-+ > |Result| > +--+-+ > +--+-+ > No rows selected (1.039 seconds) > 0: jdbc:hive2://10.20.255.35:23040/default> drop index dm_brinjal on TABLE > brinjal_bloom; > Error: org.apache.carbondata.core.exception.CarbonFileException: Error while > setting modified time: (state=,code=0) > *Exception -* > 2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | > Error executing query, currentState RUNNING, | > org.apache.spark.internal.Logging$class.logError(Logging.scala:91)2020-05-07 > 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error > executing query, currentState RUNNING, | > org.apache.spark.internal.Logging$class.logError(Logging.scala:91)org.apache.carbondata.core.exception.CarbonFileException: > Error while setting modified time: at > org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.setLastModifiedTime(AbstractDFSCarbonFile.java:192) > at > org.apache.spark.sql.secondaryindex.util.FileInternalUtil$.touchStoreTimeStamp(FileInternalUtil.scala:53) > at > org.apache.spark.sql.hive.CarbonHiveIndexMetadataUtil$.removeIndexInfoFromParentTable(CarbonHiveIndexMetadataUtil.scala:111) > at > org.apache.spark.sql.execution.command.index.DropIndexCommand.removeIndexInfoFromParentTable(DropIndexCommand.scala:261) > at > org.apache.spark.sql.execution.command.index.DropIndexCommand.dropIndex(DropIndexCommand.scala:179) > at > org.apache.spark.sql.execution.command.index.DropIndexCommand.run(DropIndexCommand.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at > org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at > org.apache.spark.sql.Dataset.(Dataset.scala:194) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at > org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694) at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:232) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:175) > at >
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3748: [CARBONDATA-3801] Query on partition table with SI having multiple partition columns gives empty results
CarbonDataQA1 commented on pull request #3748: URL: https://github.com/apache/carbondata/pull/3748#issuecomment-625301427 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1252/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3748: [CARBONDATA-3801] Query on partition table with SI having multiple partition columns gives empty results
CarbonDataQA1 commented on pull request #3748: URL: https://github.com/apache/carbondata/pull/3748#issuecomment-625300724 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2970/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Indhumathi27 opened a new pull request #3754: [CARBONDATA-3808] Added documentation for cdc and scd scenarios
Indhumathi27 opened a new pull request #3754: URL: https://github.com/apache/carbondata/pull/3754 ### Why is this PR needed? Added documentation for cdc and scd scenarios ### What changes were proposed in this PR? Added documentation for cdc and scd scenarios ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Indhumathi27 commented on pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
Indhumathi27 commented on pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#issuecomment-625278288 @niuge01 Please add a link in readme.md file This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (CARBONDATA-3808) Add documentation for cdc and scd scenarios
Indhumathi Muthumurugesh created CARBONDATA-3808: Summary: Add documentation for cdc and scd scenarios Key: CARBONDATA-3808 URL: https://issues.apache.org/jira/browse/CARBONDATA-3808 Project: CarbonData Issue Type: Improvement Reporter: Indhumathi Muthumurugesh -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CARBONDATA-3807) Filter queries and projection queries with bloom columns are not hitting the bloom datamap.
Prasanna Ravichandran created CARBONDATA-3807: - Summary: Filter queries and projection queries with bloom columns are not hitting the bloom datamap. Key: CARBONDATA-3807 URL: https://issues.apache.org/jira/browse/CARBONDATA-3807 Project: CarbonData Issue Type: Bug Environment: Ant cluster - opensource Reporter: Prasanna Ravichandran Attachments: bloom-filtercolumn-plan.png, bloom-show index.png Filter queries and projection queries with bloom columns are not hitting the bloom datamap. Test queries: drop table if exists uniqdata; CREATE TABLE uniqdata (cust_id int,cust_name String,active_emui_version string, dob timestamp, doj timestamp, bigint_column1 bigint,bigint_column2 bigint,decimal_column1 decimal(30,10), decimal_column2 decimal(36,36),double_column1 double, double_column2 double,integer_column1 int) stored as carbondata; load data inpath 'hdfs://hacluster/user/prasanna/2000_UniqData.csv' into table uniqdata options('fileheader'='cust_id,cust_name,active_emui_version,dob,doj,bigint_column1,bigint_column2,decimal_column1,decimal_column2,double_column1,double_column2,integer_column1','bad_records_action'='force'); create datamap datamapuniq_b1 on table uniqdata(cust_name) as 'bloomfilter' PROPERTIES ('BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); show indexes on uniqdata; explain select count(*) from uniqdata where cust_name="CUST_NAME_0"; --not hitting; explain select cust_name from uniqdata; --not hitting; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CARBONDATA-3806) Create bloom datamap fails with null pointer exception
Chetan Bhat created CARBONDATA-3806: --- Summary: Create bloom datamap fails with null pointer exception Key: CARBONDATA-3806 URL: https://issues.apache.org/jira/browse/CARBONDATA-3806 Project: CarbonData Issue Type: Bug Components: data-query Affects Versions: 1.6.1 Environment: Spark 2.3.2 Reporter: Chetan Bhat Create bloom datamap fails with null pointer exception create table brinjal_bloom (imei string,AMSize string,channelsId string,ActiveCountry string, Activecity string,gamePointId double,deviceInformationId double,productionDate Timestamp,deliveryDate timestamp,deliverycharge double) STORED BY 'carbondata' TBLPROPERTIES('table_blocksize'='1'); LOAD DATA INPATH 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO TABLE brinjal_bloom OPTIONS('DELIMITER'=',', 'QUOTECHAR'= '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); 0: jdbc:hive2://10.20.255.171:23040/default> CREATE DATAMAP dm_brinjal4 ON TABLE brinjal_bloom USING 'bloomfilter' DMPROPERTIES ('INDEX_COLUMNS' = 'AMSize', 'BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 210.0 failed 4 times, most recent failure: Lost task 0.3 in stage 210.0 (TID 1477, vm2, executor 2): java.lang.NullPointerException at org.apache.carbondata.core.datamap.Segment.getCommittedIndexFile(Segment.java:150) at org.apache.carbondata.core.util.BlockletDataMapUtil.getTableBlockUniqueIdentifiers(BlockletDataMapUtil.java:198) at org.apache.carbondata.core.indexstore.blockletindex.BlockletDataMapFactory.getTableBlockIndexUniqueIdentifiers(BlockletDataMapFactory.java:176) at org.apache.carbondata.core.indexstore.blockletindex.BlockletDataMapFactory.getDataMaps(BlockletDataMapFactory.java:154) at org.apache.carbondata.core.indexstore.blockletindex.BlockletDataMapFactory.getSegmentProperties(BlockletDataMapFactory.java:425) at org.apache.carbondata.datamap.IndexDataMapRebuildRDD.internalCompute(IndexDataMapRebuildRDD.scala:359) at org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:84) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: (state=,code=0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (CARBONDATA-3792) remove the system folder location property as it is not used and refactor the getSystemFolderLocation code
[ https://issues.apache.org/jira/browse/CARBONDATA-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Kapoor resolved CARBONDATA-3792. -- Fix Version/s: 2.0.0 Resolution: Fixed > remove the system folder location property as it is not used and refactor the > getSystemFolderLocation code > -- > > Key: CARBONDATA-3792 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3792 > Project: CarbonData > Issue Type: Bug >Reporter: Akash R Nilugal >Assignee: Akash R Nilugal >Priority: Minor > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > remove the system folder location property as it is not used and refactor the > getSystemFolderLocation code -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [carbondata] kunal642 commented on pull request #3743: [CARBONDATA-3792]Refactor system folder location and removed unwanted property
kunal642 commented on pull request #3743: URL: https://github.com/apache/carbondata/pull/3743#issuecomment-625249259 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] kunal642 edited a comment on pull request #3751: [CARBONDATA-3803] Mark CarbonSession as deprecated since 2.0
kunal642 edited a comment on pull request #3751: URL: https://github.com/apache/carbondata/pull/3751#issuecomment-625229890 Please print a warn log in CarbonSession to indicate to the user that this is deprecated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] kunal642 commented on pull request #3751: [CARBONDATA-3803] Mark CarbonSession as deprecated since 2.0
kunal642 commented on pull request #3751: URL: https://github.com/apache/carbondata/pull/3751#issuecomment-625229890 Please print a log in CarbonSession to indicate to the user that this is deprecated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] kunal642 commented on pull request #3739: [CARBONDATA-3791] Index server doc changes
kunal642 commented on pull request #3739: URL: https://github.com/apache/carbondata/pull/3739#issuecomment-625228432 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] kunal642 commented on pull request #3746: [CARBONDATA-3799] Fix inverted index cannot work with adaptive encoding
kunal642 commented on pull request #3746: URL: https://github.com/apache/carbondata/pull/3746#issuecomment-625228185 LGTM.. @jackylk Please review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (CARBONDATA-3805) Drop index on bloom index fails
Chetan Bhat created CARBONDATA-3805: --- Summary: Drop index on bloom index fails Key: CARBONDATA-3805 URL: https://issues.apache.org/jira/browse/CARBONDATA-3805 Project: CarbonData Issue Type: Bug Components: data-query Affects Versions: 2.0.0 Environment: Spark 2.3.2, Spark 2.4.5 Reporter: Chetan Bhat Drop index on bloom index fails 0: jdbc:hive2://10.20.255.35:23040/default> create table brinjal_bloom (imei string,AMSize string,channelsId string,ActiveCountry string, Activecity string,gamePointId double,deviceInformationId double,productionDate Timestamp,deliveryDate timestamp,deliverycharge double) STORED as carbondata TBLPROPERTIES('table_blocksize'='1'); +-+--+ | Result | +-+--+ +-+--+ No rows selected (0.261 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> LOAD DATA INPATH 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO TABLE brinjal_bloom OPTIONS('DELIMITER'=',', 'QUOTECHAR'= '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); +-+--+ | Result | +-+--+ +-+--+ No rows selected (2.196 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> CREATE INDEX dm_brinjal ON TABLE brinjal_bloom(AMSize) as 'bloomfilter' PROPERTIES ('BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); +-+--+ | Result | +-+--+ +-+--+ No rows selected (1.039 seconds) 0: jdbc:hive2://10.20.255.35:23040/default> drop index dm_brinjal on TABLE brinjal_bloom; *Exception -* 2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error executing query, currentState RUNNING, | org.apache.spark.internal.Logging$class.logError(Logging.scala:91)2020-05-07 20:10:13,865 | ERROR | [HiveServer2-Background-Pool: Thread-358] | Error executing query, currentState RUNNING, | org.apache.spark.internal.Logging$class.logError(Logging.scala:91)org.apache.carbondata.core.exception.CarbonFileException: Error while setting modified time: at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.setLastModifiedTime(AbstractDFSCarbonFile.java:192) at org.apache.spark.sql.secondaryindex.util.FileInternalUtil$.touchStoreTimeStamp(FileInternalUtil.scala:53) at org.apache.spark.sql.hive.CarbonHiveIndexMetadataUtil$.removeIndexInfoFromParentTable(CarbonHiveIndexMetadataUtil.scala:111) at org.apache.spark.sql.execution.command.index.DropIndexCommand.removeIndexInfoFromParentTable(DropIndexCommand.scala:261) at org.apache.spark.sql.execution.command.index.DropIndexCommand.dropIndex(DropIndexCommand.scala:179) at org.apache.spark.sql.execution.command.index.DropIndexCommand.run(DropIndexCommand.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at org.apache.spark.sql.Dataset.(Dataset.scala:194) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:232) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:175) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:185) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3753: [WIP] Partition column name should be case insensitive
CarbonDataQA1 commented on pull request #3753: URL: https://github.com/apache/carbondata/pull/3753#issuecomment-625184095 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2969/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3753: [WIP] Partition column name should be case insensitive
CarbonDataQA1 commented on pull request #3753: URL: https://github.com/apache/carbondata/pull/3753#issuecomment-625177043 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1251/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3745: [CARBONDATA-3793]Data load with partition columns fail with InvalidLoadOptionException when load option 'header' is set to '
akashrn5 commented on a change in pull request #3745: URL: https://github.com/apache/carbondata/pull/3745#discussion_r421406251 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkCarbonTableFormat.scala ## @@ -126,6 +126,7 @@ with Serializable { optionsFinal.put( "fileheader", dataSchema.fields.map(_.name.toLowerCase).mkString(",") + "," + partitionStr) +optionsFinal.put("header", "false") Review comment: i think you can remove from filling in optionsLocal This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
Indhumathi27 commented on a change in pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#discussion_r421233356 ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios Review comment: Please add license header ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios Review comment: ```suggestion ## Usage scenarios ``` ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description Review comment: ```suggestion ## Usage description ``` ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process Review comment: ```suggestion ### Writing process ``` ## File path: docs/flink-integration-guide.md ## @@ -0,0 +1,193 @@ +##Usage scenarios + + A typical scenario is that the data is cleaned and preprocessed by Flink, and then written to Carbon, + for subsequent analysis and queries. + + The CarbonData flink integration module is used connect Flink and Carbon in the above scenario. + + The CarbonData flink integration module provides a set of Flink BulkWriter implementations + (CarbonLocalWriter and CarbonS3Writer). The data is processed by the Flink, and finally written into + the stage directory of the target table by the CarbonXXXWriter. + + By default, those data in table stage directory, can not be immediately queried, those data can be queried + after the "INSERT INTO $tableName STAGE" command is executed. + + Since the flink data written to carbon is endless, in order to ensure the visibility of data + and the controllable amount of data processed during the execution of each insert form stage command, + the user should execute the insert from stage command in a timely manner. + + The execution interval of the insert form stage command should take the data visibility requirements + of the actual business and the flink data traffic. When the data visibility requirements are high + or the data traffic is large, the execution interval should be appropriately shortened. + +##Usage description + +###Writing process + + Typical flink stream: Source -> Process -> Output(Carbon Writer Sink) + + Pseudo code and
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3745: [CARBONDATA-3793]Data load with partition columns fail with InvalidLoadOptionException when load option 'header' is set to '
akashrn5 commented on a change in pull request #3745: URL: https://github.com/apache/carbondata/pull/3745#discussion_r421350945 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkCarbonTableFormat.scala ## @@ -126,6 +126,7 @@ with Serializable { optionsFinal.put( "fileheader", dataSchema.fields.map(_.name.toLowerCase).mkString(",") + "," + partitionStr) +optionsFinal.put("header", "false") Review comment: please refer line 77 in org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder#build(java.util.Map, long, java.lang.String), can be handled like that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai opened a new pull request #3753: [WIP] Partition column name should be case insensitive
QiangCai opened a new pull request #3753: URL: https://github.com/apache/carbondata/pull/3753 ### Why is this PR needed? when inserting into the static partition, the partition column name is case sensitive now. ### What changes were proposed in this PR? the partition column name should be case insensitive. convert all partition column names to low case. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - Yes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3739: [CARBONDATA-3791] Index server doc changes
CarbonDataQA1 commented on pull request #3739: URL: https://github.com/apache/carbondata/pull/3739#issuecomment-625082581 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1250/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3750: [CARBONDATA-3548]Renamed index_handler to spatial_index property and indexColumn is changed to spatialColumn
CarbonDataQA1 commented on pull request #3750: URL: https://github.com/apache/carbondata/pull/3750#issuecomment-625061586 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2967/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
CarbonDataQA1 commented on pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#issuecomment-625052127 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2966/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3752: [CARBONDATA-3804] Provide end-to-end flink integration guide
CarbonDataQA1 commented on pull request #3752: URL: https://github.com/apache/carbondata/pull/3752#issuecomment-625050735 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1248/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org