[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
CarbonDataQA1 commented on pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#issuecomment-719163064 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2975/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
CarbonDataQA1 commented on pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#issuecomment-719160498 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4734/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
QiangCai commented on a change in pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#discussion_r514748658 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -227,9 +232,15 @@ object DataLoadProcessBuilderOnSpark { // 2. sort var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions( configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS)) + +// if numPartitions user does not specify and not specified in config then dynamically calculate if (numPartitions <= 0) { - numPartitions = originRDD.partitions.length + val defaultMaxSplitBytes = sessionState(sparkSession).conf.filesMaxPartitionBytes + val dynamicPartitionNum = Math.ceil(SizeEstimator.estimate(originRDD).toDouble / Review comment: does SizeEstimator.estimate work for RDD? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on pull request #3988: [CARBONDATA-4037] Improve the table status and segment file writing
QiangCai commented on pull request #3988: URL: https://github.com/apache/carbondata/pull/3988#issuecomment-719134288 In the description, after it implements point 2, it should immediately delete the index file. So point 1 is no needed, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
marchpure commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514720835 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchemaCommon.scala ## @@ -121,7 +121,7 @@ case class UpdateTableModel( updatedTimeStamp: Long, var executorErrors: ExecutionErrors, deletedSegments: Seq[Segment], -loadAsNewSegment: Boolean = false) +loadAsNewSegment: Boolean = true) Review comment: I have removed all code about loadAsNewSegment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
marchpure commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514719866 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ## @@ -342,7 +342,8 @@ object CarbonDataRDDFactory { try { if (!carbonLoadModel.isCarbonTransactionalTable || segmentLock.lockWithRetries()) { -if (updateModel.isDefined && !updateModel.get.loadAsNewSegment) { +if (updateModel.isDefined && (!updateModel.get.loadAsNewSegment Review comment: I have modified code according to your suggestion. if (updateModel.isDefined && dataframe.isEmpty) = true it means the row to updated is Empty, we avoid to trigger loading process for empty dataset. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
marchpure commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514685949 ## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/allqueries/TestPruneUsingSegmentMinMax.scala ## @@ -103,7 +103,7 @@ class TestPruneUsingSegmentMinMax extends QueryTest with BeforeAndAfterAll { sql("update carbon set(a)=(10) where a=1").collect() checkAnswer(sql("select count(*) from carbon where a=10"), Seq(Row(3))) showCache = sql("show metacache on table carbon").collect() -assert(showCache(0).get(2).toString.equalsIgnoreCase("6/8 index files cached")) +assert(showCache(0).get(2).toString.equalsIgnoreCase("1/6 index files cached")) Review comment: 1. in this testcase, there is 5 insert and 1 update. if update write into new segments. there will be 6 segments in the table, so in total 6 index files in the table storelocation. 2. If update write into different segments folder, the data of a = 10 will exists in segment 0/3/4. But if update write into only one new segment folder, the data of a = 10 will exists in segment 5. Now, The data in 6 segments are shown as below. Segment - 0 : +---+---++---+---+ | a| b| c| d| e| +---+---++---+---+ | 2| aa|23.6| 8|2017-09-02 00:00:00| +---+---++---+---+ Segment - 1 : +---+---++---+---+ | a| b| c| d| e| +---+---++---+---+ | 3| ab|23.4| 5|2017-09-01 00:00:00| | 4| aa|23.6| 8|2017-09-02 00:00:00| +---+---++---+---+ Segment - 2 : +---+---++---+---+ | a| b| c| d| e| +---+---++---+---+ | 5| ab|23.4| 5|2017-09-01 00:00:00| | 6| aa|23.6| 8|2017-09-02 00:00:00| +---+---++---+---+ Segment - 3 : +---+---++---+---+ | a| b| c| d| e| +---+---++---+---+ | 2| aa|23.6| 8|2017-09-02 00:00:00| +---+---++---+---+ Segment - 4 : +---+---++---+---+ | a| b| c| d| e| +---+---++---+---+ | 2| aa|23.6| 8|2017-09-02 00:00:00| +---+---++---+---+ Segment - 5 : +---+---++---+---+ | a| b| c| d| e| +---+---++---+---+ | 10| ab|23.4| 5|2017-09-01 00:00:00| | 10| ab|23.4| 5|2017-09-01 00:00:00| | 10| ab|23.4| 5|2017-09-01 00:00:00| +---+---++---+---+ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
QiangCai commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514677476 ## File path: integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.cleanfiles + +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ListBuffer + +import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession} +import org.apache.spark.sql.index.CarbonIndexUtil + +import org.apache.carbondata.common.logging.LogServiceFactory +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.datastore.filesystem.CarbonFile +import org.apache.carbondata.core.datastore.impl.FileFactory +import org.apache.carbondata.core.exception.ConcurrentOperationException +import org.apache.carbondata.core.indexstore.PartitionSpec +import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, ICarbonLock, LockUsage} +import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, CarbonMetadata, SegmentFileStore} +import org.apache.carbondata.core.metadata.schema.table.CarbonTable +import org.apache.carbondata.core.mutate.CarbonUpdateUtil +import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, SegmentStatus, SegmentStatusManager} +import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil} +import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil} +import org.apache.carbondata.processing.loading.TableProcessingOperations + +object CleanFilesUtil { + private val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) + + /** + * The method deletes all data if forceTableClean and clean garbage segment + * (MARKED_FOR_DELETE state) if forceTableClean + * + * @param dbName : Database name + * @param tableName : Table name + * @param tablePath : Table path + * @param carbonTable: CarbonTable Object in case of force clean + * @param forceTableClean: for force clean it will delete all data + *it will clean garbage segment (MARKED_FOR_DELETE state) + * @param currentTablePartitions : Hive Partitions details + */ + def cleanFiles( +dbName: String, +tableName: String, +tablePath: String, +carbonTable: CarbonTable, +forceTableClean: Boolean, +currentTablePartitions: Option[Seq[PartitionSpec]] = None, +truncateTable: Boolean = false): Unit = { +var carbonCleanFilesLock: ICarbonLock = null +val absoluteTableIdentifier = if (forceTableClean) { + AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName) +} else { + carbonTable.getAbsoluteTableIdentifier +} +try { + val errorMsg = "Clean files request is failed for " + +s"$dbName.$tableName" + +". Not able to acquire the clean files lock due to another clean files " + +"operation is running in the background." + // in case of force clean the lock is not required + if (forceTableClean) { +FileFactory.deleteAllCarbonFilesOfDir( + FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath)) + } else { +carbonCleanFilesLock = + CarbonLockUtil +.getLockObject(absoluteTableIdentifier, LockUsage.CLEAN_FILES_LOCK, errorMsg) +if (truncateTable) { + SegmentStatusManager.truncateTable(carbonTable) +} +SegmentStatusManager.deleteLoadsAndUpdateMetadata( + carbonTable, true, currentTablePartitions.map(_.asJava).orNull) +CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true) Review comment: but I don't find any lock for update/delete when it tries to cleanUpDeltaFiles in clean files flow. When concurrent clean files and update happened, this method of clean files will remove the files which generated by the concurrent update. This is an automated message from the Apache Git Service. To
[GitHub] [carbondata] QiangCai commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
QiangCai commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514677476 ## File path: integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.cleanfiles + +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ListBuffer + +import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession} +import org.apache.spark.sql.index.CarbonIndexUtil + +import org.apache.carbondata.common.logging.LogServiceFactory +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.datastore.filesystem.CarbonFile +import org.apache.carbondata.core.datastore.impl.FileFactory +import org.apache.carbondata.core.exception.ConcurrentOperationException +import org.apache.carbondata.core.indexstore.PartitionSpec +import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, ICarbonLock, LockUsage} +import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, CarbonMetadata, SegmentFileStore} +import org.apache.carbondata.core.metadata.schema.table.CarbonTable +import org.apache.carbondata.core.mutate.CarbonUpdateUtil +import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, SegmentStatus, SegmentStatusManager} +import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil} +import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil} +import org.apache.carbondata.processing.loading.TableProcessingOperations + +object CleanFilesUtil { + private val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) + + /** + * The method deletes all data if forceTableClean and clean garbage segment + * (MARKED_FOR_DELETE state) if forceTableClean + * + * @param dbName : Database name + * @param tableName : Table name + * @param tablePath : Table path + * @param carbonTable: CarbonTable Object in case of force clean + * @param forceTableClean: for force clean it will delete all data + *it will clean garbage segment (MARKED_FOR_DELETE state) + * @param currentTablePartitions : Hive Partitions details + */ + def cleanFiles( +dbName: String, +tableName: String, +tablePath: String, +carbonTable: CarbonTable, +forceTableClean: Boolean, +currentTablePartitions: Option[Seq[PartitionSpec]] = None, +truncateTable: Boolean = false): Unit = { +var carbonCleanFilesLock: ICarbonLock = null +val absoluteTableIdentifier = if (forceTableClean) { + AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName) +} else { + carbonTable.getAbsoluteTableIdentifier +} +try { + val errorMsg = "Clean files request is failed for " + +s"$dbName.$tableName" + +". Not able to acquire the clean files lock due to another clean files " + +"operation is running in the background." + // in case of force clean the lock is not required + if (forceTableClean) { +FileFactory.deleteAllCarbonFilesOfDir( + FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath)) + } else { +carbonCleanFilesLock = + CarbonLockUtil +.getLockObject(absoluteTableIdentifier, LockUsage.CLEAN_FILES_LOCK, errorMsg) +if (truncateTable) { + SegmentStatusManager.truncateTable(carbonTable) +} +SegmentStatusManager.deleteLoadsAndUpdateMetadata( + carbonTable, true, currentTablePartitions.map(_.asJava).orNull) +CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true) Review comment: but I don't find any lock for update/delete when it tries to cleanUpDeltaFiles in clean files flow This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please
[GitHub] [carbondata] QiangCai commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
QiangCai commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514677476 ## File path: integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.cleanfiles + +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ListBuffer + +import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession} +import org.apache.spark.sql.index.CarbonIndexUtil + +import org.apache.carbondata.common.logging.LogServiceFactory +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.datastore.filesystem.CarbonFile +import org.apache.carbondata.core.datastore.impl.FileFactory +import org.apache.carbondata.core.exception.ConcurrentOperationException +import org.apache.carbondata.core.indexstore.PartitionSpec +import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, ICarbonLock, LockUsage} +import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, CarbonMetadata, SegmentFileStore} +import org.apache.carbondata.core.metadata.schema.table.CarbonTable +import org.apache.carbondata.core.mutate.CarbonUpdateUtil +import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, SegmentStatus, SegmentStatusManager} +import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil} +import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil} +import org.apache.carbondata.processing.loading.TableProcessingOperations + +object CleanFilesUtil { + private val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) + + /** + * The method deletes all data if forceTableClean and clean garbage segment + * (MARKED_FOR_DELETE state) if forceTableClean + * + * @param dbName : Database name + * @param tableName : Table name + * @param tablePath : Table path + * @param carbonTable: CarbonTable Object in case of force clean + * @param forceTableClean: for force clean it will delete all data + *it will clean garbage segment (MARKED_FOR_DELETE state) + * @param currentTablePartitions : Hive Partitions details + */ + def cleanFiles( +dbName: String, +tableName: String, +tablePath: String, +carbonTable: CarbonTable, +forceTableClean: Boolean, +currentTablePartitions: Option[Seq[PartitionSpec]] = None, +truncateTable: Boolean = false): Unit = { +var carbonCleanFilesLock: ICarbonLock = null +val absoluteTableIdentifier = if (forceTableClean) { + AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName) +} else { + carbonTable.getAbsoluteTableIdentifier +} +try { + val errorMsg = "Clean files request is failed for " + +s"$dbName.$tableName" + +". Not able to acquire the clean files lock due to another clean files " + +"operation is running in the background." + // in case of force clean the lock is not required + if (forceTableClean) { +FileFactory.deleteAllCarbonFilesOfDir( + FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath)) + } else { +carbonCleanFilesLock = + CarbonLockUtil +.getLockObject(absoluteTableIdentifier, LockUsage.CLEAN_FILES_LOCK, errorMsg) +if (truncateTable) { + SegmentStatusManager.truncateTable(carbonTable) +} +SegmentStatusManager.deleteLoadsAndUpdateMetadata( + carbonTable, true, currentTablePartitions.map(_.asJava).orNull) +CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true) Review comment: but I don't find any lock for update/delete when it tries to cleanUpDeltaFiles This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
CarbonDataQA1 commented on pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#issuecomment-718951982 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2974/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
CarbonDataQA1 commented on pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#issuecomment-718948233 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4733/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] brijoobopanna commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
brijoobopanna commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514436587 ## File path: docs/cleanfiles.md ## @@ -0,0 +1,78 @@ + + + +## CLEAN FILES + +Clean files command is used to remove the Compacted, Marked For Delete ,In Progress which are stale and Partial(Segments which are missing from the table status file but their data is present) + segments from the store. + + Clean Files Command + ``` + CLEAN FILES FOR TABLE TABLE_NAME + ``` + + +### TRASH FOLDER + + Carbondata supports a Trash Folder which is used as a redundant folder where all the unnecessary files and folders are moved to during clean files operation. + This trash folder is mantained inside the table path. It is a hidden folder(.Trash). The segments that are moved to the trash folder are mantained under a timestamp + subfolder(timestamp at which clean files operation is called). This helps the user to list down segments by timestamp. By default all the timestamp sub-directory have an expiration + time of (3 days since that timestamp) and it can be configured by the user using the following carbon property + ``` + carbon.trash.expiration.time = "Number of days" + ``` + Once the timestamp subdirectory is expired as per the configured expiration day value, the subdirectory is deleted from the trash folder in the subsequent clean files command. + + + + +### DRY RUN + Support for dry run is provided before the actual clean files operation. This dry run operation will list down all the segments which are going to be manipulated during + the clean files operation. The dry run result will show the current location of the segment(it can be in FACT folder, Partition folder or trash folder) and where that segment + will be moved(to the trash folder or deleted from store) once the actual operation will be called. + + + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('dry_run'='true') + ``` + +### FORCE DELETE TRASH +The force option with clean files command deletes all the files and folders from the trash folder. + + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('force'='true') + ``` + +### DATA RECOVERY FROM THE TRASH FOLDER + +The segments can be recovered from the trash folder by creating an external table from the desired segment location +in the trash folder and inserting into the original table from the external table. It will create a new segment in the original table. Review comment: It would be good if also explain the behavior on SI or MV based on these how the behavior will be This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] brijoobopanna commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
brijoobopanna commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514434545 ## File path: docs/cleanfiles.md ## @@ -0,0 +1,78 @@ + + + +## CLEAN FILES + +Clean files command is used to remove the Compacted, Marked For Delete ,In Progress which are stale and Partial(Segments which are missing from the table status file but their data is present) + segments from the store. + + Clean Files Command + ``` + CLEAN FILES FOR TABLE TABLE_NAME + ``` + + +### TRASH FOLDER + + Carbondata supports a Trash Folder which is used as a redundant folder where all the unnecessary files and folders are moved to during clean files operation. + This trash folder is mantained inside the table path. It is a hidden folder(.Trash). The segments that are moved to the trash folder are mantained under a timestamp + subfolder(timestamp at which clean files operation is called). This helps the user to list down segments by timestamp. By default all the timestamp sub-directory have an expiration + time of (3 days since that timestamp) and it can be configured by the user using the following carbon property + ``` + carbon.trash.expiration.time = "Number of days" + ``` + Once the timestamp subdirectory is expired as per the configured expiration day value, the subdirectory is deleted from the trash folder in the subsequent clean files command. + + + + +### DRY RUN + Support for dry run is provided before the actual clean files operation. This dry run operation will list down all the segments which are going to be manipulated during + the clean files operation. The dry run result will show the current location of the segment(it can be in FACT folder, Partition folder or trash folder) and where that segment Review comment: what about : show time threshold based segments in trash folder This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] brijoobopanna commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
brijoobopanna commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514432464 ## File path: docs/cleanfiles.md ## @@ -0,0 +1,78 @@ + + + +## CLEAN FILES + +Clean files command is used to remove the Compacted, Marked For Delete ,In Progress which are stale and Partial(Segments which are missing from the table status file but their data is present) + segments from the store. + + Clean Files Command + ``` + CLEAN FILES FOR TABLE TABLE_NAME + ``` + + +### TRASH FOLDER + + Carbondata supports a Trash Folder which is used as a redundant folder where all the unnecessary files and folders are moved to during clean files operation. Review comment: Can we please rewrite this part instead of telling unnecessary files and folders This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
vikramahuja1001 commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514429233 ## File path: integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.cleanfiles + +import java.util + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ListBuffer + +import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession} +import org.apache.spark.sql.index.CarbonIndexUtil + +import org.apache.carbondata.common.logging.LogServiceFactory +import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.datastore.filesystem.CarbonFile +import org.apache.carbondata.core.datastore.impl.FileFactory +import org.apache.carbondata.core.exception.ConcurrentOperationException +import org.apache.carbondata.core.indexstore.PartitionSpec +import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, ICarbonLock, LockUsage} +import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, CarbonMetadata, SegmentFileStore} +import org.apache.carbondata.core.metadata.schema.table.CarbonTable +import org.apache.carbondata.core.mutate.CarbonUpdateUtil +import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, SegmentStatus, SegmentStatusManager} +import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil} +import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil} +import org.apache.carbondata.processing.loading.TableProcessingOperations + +object CleanFilesUtil { + private val LOGGER = LogServiceFactory.getLogService(this.getClass.getCanonicalName) + + /** + * The method deletes all data if forceTableClean and clean garbage segment + * (MARKED_FOR_DELETE state) if forceTableClean + * + * @param dbName : Database name + * @param tableName : Table name + * @param tablePath : Table path + * @param carbonTable: CarbonTable Object in case of force clean + * @param forceTableClean: for force clean it will delete all data + *it will clean garbage segment (MARKED_FOR_DELETE state) + * @param currentTablePartitions : Hive Partitions details + */ + def cleanFiles( +dbName: String, +tableName: String, +tablePath: String, +carbonTable: CarbonTable, +forceTableClean: Boolean, +currentTablePartitions: Option[Seq[PartitionSpec]] = None, +truncateTable: Boolean = false): Unit = { +var carbonCleanFilesLock: ICarbonLock = null +val absoluteTableIdentifier = if (forceTableClean) { + AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName) +} else { + carbonTable.getAbsoluteTableIdentifier +} +try { + val errorMsg = "Clean files request is failed for " + +s"$dbName.$tableName" + +". Not able to acquire the clean files lock due to another clean files " + +"operation is running in the background." + // in case of force clean the lock is not required + if (forceTableClean) { +FileFactory.deleteAllCarbonFilesOfDir( + FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath)) + } else { +carbonCleanFilesLock = + CarbonLockUtil +.getLockObject(absoluteTableIdentifier, LockUsage.CLEAN_FILES_LOCK, errorMsg) +if (truncateTable) { + SegmentStatusManager.truncateTable(carbonTable) +} +SegmentStatusManager.deleteLoadsAndUpdateMetadata( + carbonTable, true, currentTablePartitions.map(_.asJava).orNull) +CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true) Review comment: no, we are copying the complete segment to trash, so no issues with delta files. I added a test case too with delete delta This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For
[jira] [Reopened] (CARBONDATA-4029) Getting Number format exception while querying on date columns in SDK carbon table.
[ https://issues.apache.org/jira/browse/CARBONDATA-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Ravichandran reopened CARBONDATA-4029: --- > Getting Number format exception while querying on date columns in SDK carbon > table. > --- > > Key: CARBONDATA-4029 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4029 > Project: CarbonData > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: 3 node FI cluster >Reporter: Prasanna Ravichandran >Priority: Minor > Attachments: Primitive.rar > > > We are getting Number format exception while querying on the date columns. > Attached the SDK files also. > Test queries: > --SDK compaction; > drop table if exists external_primitive; > create table external_primitive (id int, name string, rank smallint, salary > double, active boolean, dob date, doj timestamp, city string, dept string) > stored as carbondata; > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive2','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive3','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive4','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive5','format'='carbon'); > > alter table external_primitive compact 'minor'; --working fine pass; > select count(*) from external_primitive;--working fine pass; > show segments for table external_primitive; > select * from external_primitive limit 13; --working fine pass; > select * from external_primitive limit 14; --failed getting number format > exception; > select min(dob) from external_primitive; --failed getting number format > exception; > select max(dob) from external_primitive; --working; > select dob from external_primitive; --failed getting number format exception; > Console: > *0: /> show segments for table external_primitive;* > +--++--+--+++-+--+ > | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | > Index Size | File Format | > +--++--+--+++-+--+ > | 4 | Success | 2020-10-13 11:52:04.012 | 0.511S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 3 | Compacted | 2020-10-13 11:52:00.587 | 0.828S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 2 | Compacted | 2020-10-13 11:51:57.767 | 0.775S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 1 | Compacted | 2020-10-13 11:51:54.678 | 1.024S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 0.1 | Success | 2020-10-13 11:52:05.986 | 5.785S | {} | 9.62KB | 5.01KB | > columnar_v3 | > | 0 | Compacted | 2020-10-13 11:51:51.072 | 1.125S | {} | 8.55KB | 4.25KB | > columnar_v3 | > +--++--+--+++-+--+ > 6 rows selected (0.45 seconds) > *0: /> select * from external_primitive limit 13;* --working fine pass; > INFO : Execution ID: 95 > +-+---+---+--+-+-++++ > | id | name | rank | salary | active | dob | doj | city | dept | > +-+---+---+--+-+-++++ > | 1 | AAA | 3 | 3444345.66 | true | 1979-12-09 | 2011-02-09 22:30:20.0 | Pune > | IT | > | 2 | BBB | 2 | 543124.66 | false | 1987-02-19 | 2017-01-01 09:30:20.0 | > Bangalore | DATA | > | 3 | CCC | 1 | 787878.888 | false | 1982-05-12 | 2015-11-30 23:50:20.0 | > Pune | DATA | > | 4 | DDD | 1 | 9.24 | true | 1981-04-09 | 2000-01-15 04:30:20.0 | Delhi > | MAINS | > | 5 | EEE | 3 | 545656.99 | true | 1987-12-09 | 2017-11-25 01:30:20.0 | Delhi > | IT | > | 6 | FFF | 2 | 768678.0 | false | 1987-12-20 | 2017-01-10 02:30:20.0 | > Bangalore | DATA | > | 7 | GGG | 3 | 765665.0 | true | 1983-06-12 | 2016-12-31 23:30:20.0 | Pune | > IT | > | 8 | HHH | 2 | 567567.66 | false | 1979-01-12 | 1995-01-01 09:30:20.0 | > Bangalore | DATA | > | 9 | III | 2 | 787878.767 | true | 1985-02-19 | 2005-08-14 22:30:20.0 | Pune > | DATA | > | 10 | JJJ | 3 | 887877.14 | true | 2000-05-19 | 2016-10-10 09:30:20.0 | > Bangalore | MAINS | > | 18 | | 3 | 7.86786786787E9 | true | 1980-10-05 | 1995-10-07 19:30:20.0 | > Bangalore | IT | > | 19 | | 2 | 5464545.33 | true | 1986-06-06 | 2008-08-14 22:30:20.0 | Delhi | > DATA | > | 20 |
[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
marchpure commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514279897 ## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/alterTable/TestAlterTableSortColumnsProperty.scala ## @@ -739,14 +739,14 @@ class TestAlterTableSortColumnsProperty extends QueryTest with BeforeAndAfterAll val table = CarbonEnv.getCarbonTable(Option("default"), tableName)(sqlContext.sparkSession) val tablePath = table.getTablePath -(0 to 2).foreach { segmentId => +(0 to 3).foreach { segmentId => Review comment: now, update will write into new segment 3. before. update only write to old segment 2. so test case shall change from (0 to 2) to (0 to 3) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (CARBONDATA-4029) Getting Number format exception while querying on date columns in SDK carbon table.
[ https://issues.apache.org/jira/browse/CARBONDATA-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Ravichandran closed CARBONDATA-4029. - Resolution: Won't Fix > Getting Number format exception while querying on date columns in SDK carbon > table. > --- > > Key: CARBONDATA-4029 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4029 > Project: CarbonData > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: 3 node FI cluster >Reporter: Prasanna Ravichandran >Priority: Minor > Attachments: Primitive.rar > > > We are getting Number format exception while querying on the date columns. > Attached the SDK files also. > Test queries: > --SDK compaction; > drop table if exists external_primitive; > create table external_primitive (id int, name string, rank smallint, salary > double, active boolean, dob date, doj timestamp, city string, dept string) > stored as carbondata; > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive2','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive3','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive4','format'='carbon'); > alter table external_primitive add segment > options('path'='hdfs://hacluster/sdkfiles/primitive5','format'='carbon'); > > alter table external_primitive compact 'minor'; --working fine pass; > select count(*) from external_primitive;--working fine pass; > show segments for table external_primitive; > select * from external_primitive limit 13; --working fine pass; > select * from external_primitive limit 14; --failed getting number format > exception; > select min(dob) from external_primitive; --failed getting number format > exception; > select max(dob) from external_primitive; --working; > select dob from external_primitive; --failed getting number format exception; > Console: > *0: /> show segments for table external_primitive;* > +--++--+--+++-+--+ > | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | > Index Size | File Format | > +--++--+--+++-+--+ > | 4 | Success | 2020-10-13 11:52:04.012 | 0.511S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 3 | Compacted | 2020-10-13 11:52:00.587 | 0.828S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 2 | Compacted | 2020-10-13 11:51:57.767 | 0.775S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 1 | Compacted | 2020-10-13 11:51:54.678 | 1.024S | {} | 1.88KB | 655.0B | > columnar_v3 | > | 0.1 | Success | 2020-10-13 11:52:05.986 | 5.785S | {} | 9.62KB | 5.01KB | > columnar_v3 | > | 0 | Compacted | 2020-10-13 11:51:51.072 | 1.125S | {} | 8.55KB | 4.25KB | > columnar_v3 | > +--++--+--+++-+--+ > 6 rows selected (0.45 seconds) > *0: /> select * from external_primitive limit 13;* --working fine pass; > INFO : Execution ID: 95 > +-+---+---+--+-+-++++ > | id | name | rank | salary | active | dob | doj | city | dept | > +-+---+---+--+-+-++++ > | 1 | AAA | 3 | 3444345.66 | true | 1979-12-09 | 2011-02-09 22:30:20.0 | Pune > | IT | > | 2 | BBB | 2 | 543124.66 | false | 1987-02-19 | 2017-01-01 09:30:20.0 | > Bangalore | DATA | > | 3 | CCC | 1 | 787878.888 | false | 1982-05-12 | 2015-11-30 23:50:20.0 | > Pune | DATA | > | 4 | DDD | 1 | 9.24 | true | 1981-04-09 | 2000-01-15 04:30:20.0 | Delhi > | MAINS | > | 5 | EEE | 3 | 545656.99 | true | 1987-12-09 | 2017-11-25 01:30:20.0 | Delhi > | IT | > | 6 | FFF | 2 | 768678.0 | false | 1987-12-20 | 2017-01-10 02:30:20.0 | > Bangalore | DATA | > | 7 | GGG | 3 | 765665.0 | true | 1983-06-12 | 2016-12-31 23:30:20.0 | Pune | > IT | > | 8 | HHH | 2 | 567567.66 | false | 1979-01-12 | 1995-01-01 09:30:20.0 | > Bangalore | DATA | > | 9 | III | 2 | 787878.767 | true | 1985-02-19 | 2005-08-14 22:30:20.0 | Pune > | DATA | > | 10 | JJJ | 3 | 887877.14 | true | 2000-05-19 | 2016-10-10 09:30:20.0 | > Bangalore | MAINS | > | 18 | | 3 | 7.86786786787E9 | true | 1980-10-05 | 1995-10-07 19:30:20.0 | > Bangalore | IT | > | 19 | | 2 | 5464545.33 | true | 1986-06-06 | 2008-08-14 22:30:20.0 | Delhi
[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata
vikramahuja1001 commented on a change in pull request #3917: URL: https://github.com/apache/carbondata/pull/3917#discussion_r514271886 ## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ## @@ -1136,7 +1137,8 @@ public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean if (updateCompletionStatus) { DeleteLoadFolders .physicalFactAndMeasureMetadataDeletion(carbonTable, newAddedLoadHistoryList, -isForceDeletion, partitionSpecs); +isForceDeletion, partitionSpecs, String.valueOf(new Timestamp(System Review comment: yes, moved This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
ajantha-bhat commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514266931 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala ## @@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand( case _ => sys.error("") } -val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, deletedSegments) +val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, deletedSegments, + !carbonRelation.carbonTable.isHivePartitionTable) Review comment: @marchpure : Also please reply to my other comments or questions if it is handled. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
ajantha-bhat commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514263181 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala ## @@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand( case _ => sys.error("") } -val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, deletedSegments) +val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, deletedSegments, + !carbonRelation.carbonTable.isHivePartitionTable) Review comment: Nice, I will review it again. @QiangCai or others also can once review this PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
marchpure commented on a change in pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#discussion_r514260343 ## File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala ## @@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand( case _ => sys.error("") } -val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, deletedSegments) +val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, deletedSegments, + !carbonRelation.carbonTable.isHivePartitionTable) Review comment: I have modified code according to your suggestion. Now, for partition, upodate will wirte as new segment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (CARBONDATA-3971) Session level dynamic properties for repair(carbon.load.si.repair and carbon.si.repair.limit) are not updated in https://github.com/apache/carbondata/blob/master/docs
[ https://issues.apache.org/jira/browse/CARBONDATA-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Bhat closed CARBONDATA-3971. --- Fix Version/s: 2.1.0 Resolution: Fixed Issue is fixed. > Session level dynamic properties for repair(carbon.load.si.repair and > carbon.si.repair.limit) are not updated in > https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md > -- > > Key: CARBONDATA-3971 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3971 > Project: CarbonData > Issue Type: Bug > Components: docs >Affects Versions: 2.1.0 >Reporter: Chetan Bhat >Priority: Minor > Fix For: 2.1.0 > > > Session level dynamic properties for repair(carbon.load.si.repair and > carbon.si.repair.limit) are not mentioned in github link - > https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (CARBONDATA-3937) Insert into select from another carbon /parquet table is not working on Hive Beeline on a newly create Hive write format - carbon table. We are getting “Database is n
[ https://issues.apache.org/jira/browse/CARBONDATA-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Ravichandran closed CARBONDATA-3937. - > Insert into select from another carbon /parquet table is not working on Hive > Beeline on a newly create Hive write format - carbon table. We are getting > “Database is not set" error. > > > Key: CARBONDATA-3937 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3937 > Project: CarbonData > Issue Type: Bug > Components: hive-integration >Affects Versions: 2.0.0 >Reporter: Prasanna Ravichandran >Priority: Major > > Insert into select from another carbon or parquet table to a carbon table is > not working on Hive Beeline on a newly create Hive write format carbon table. > We are getting “Database is not set” error. > > Test queries: > drop table if exists hive_carbon; > create table hive_carbon(id int, name string, scale decimal, country string, > salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler'; > insert into hive_carbon select 1,"Ram","2.3","India",3500; > insert into hive_carbon select 2,"Raju","2.4","Russia",3600; > insert into hive_carbon select 3,"Raghu","2.5","China",3700; > insert into hive_carbon select 4,"Ravi","2.6","Australia",3800; > > drop table if exists hive_carbon2; > create table hive_carbon2(id int, name string, scale decimal, country string, > salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler'; > insert into hive_carbon2 select * from hive_carbon; > select * from hive_carbon; > select * from hive_carbon2; > > --execute below queries in spark-beeline; > create table hive_table(id int, name string, scale decimal, country string, > salary double); > create table parquet_table(id int, name string, scale decimal, country > string, salary double) stored as parquet; > insert into hive_table select 1,"Ram","2.3","India",3500; > select * from hive_table; > insert into parquet_table select 1,"Ram","2.3","India",3500; > select * from parquet_table; > --execute the below query in hive beeline; > insert into hive_carbon select * from parquet_table; > Attached the logs for your reference. But the insert into select from the > parquet and hive table into carbon table is working fine. > > Only insert into select from hive table to carbon table is only working. > Error details in MR job which run through hive query: > Error: java.io.IOException: java.io.IOException: Database name is not set. at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:414) > at > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:843) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at > org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:175) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1737) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) Caused by: > java.io.IOException: Database name is not set. at > org.apache.carbondata.hadoop.api.CarbonInputFormat.getDatabaseName(CarbonInputFormat.java:841) > at > org.apache.carbondata.hive.MapredCarbonInputFormat.getCarbonTable(MapredCarbonInputFormat.java:80) > at > org.apache.carbondata.hive.MapredCarbonInputFormat.getQueryModel(MapredCarbonInputFormat.java:215) > at > org.apache.carbondata.hive.MapredCarbonInputFormat.getRecordReader(MapredCarbonInputFormat.java:205) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:411) > ... 9 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (CARBONDATA-3852) CCD Merge with Partition Table is giving different results in different spark deploy modes
[ https://issues.apache.org/jira/browse/CARBONDATA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Ramachandra Setty closed CARBONDATA-3852. This PR resolved this issue. https://github.com/apache/carbondata/pull/3835 > CCD Merge with Partition Table is giving different results in different spark > deploy modes > -- > > Key: CARBONDATA-3852 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3852 > Project: CarbonData > Issue Type: Bug > Components: spark-integration >Affects Versions: 2.0.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.1.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The result sets are different when run the sql queries in spark-shell > --master local and spark-shell --master yarn (Two Different Spark Deploy > Modes) > {code} > import scala.collection.JavaConverters._ > import java.sql.Date > import org.apache.spark.sql._ > import org.apache.spark.sql.CarbonSession._ > import org.apache.spark.sql.catalyst.TableIdentifier > import > org.apache.spark.sql.execution.command.mutation.merge.{CarbonMergeDataSetCommand, > DeleteAction, InsertAction, InsertInHistoryTableAction, MergeDataSetMatches, > MergeMatch, UpdateAction, WhenMatched, WhenNotMatched, > WhenNotMatchedAndExistsOnlyOnTarget} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.test.util.QueryTest > import org.apache.spark.sql.types.{BooleanType, DateType, IntegerType, > StringType, StructField, StructType} > import spark.implicits._ > val df1 = sc.parallelize(1 to 10, 4).map{ x => ("id"+x, > s"order$x",s"customer$x", x*10, x*75, 1)}.toDF("id", "name", "c_name", > "quantity", "price", "state") > df1.write.format("carbondata").option("tableName", > "order").mode(SaveMode.Overwrite).save() > val dwframe = spark.read.format("carbondata").option("tableName", > "order").load() > val dwSelframe = dwframe.as("A") > val ds1 = sc.parallelize(3 to 10, 4) > .map {x => > if (x <= 4) { > ("id"+x, s"order$x",s"customer$x", x*10, x*75, 2) > } else { > ("id"+x, s"order$x",s"customer$x", x*10, x*75, 1) > } > }.toDF("id", "name", "c_name", "quantity", "price", "state") > > val ds2 = sc.parallelize(1 to 2, 4).map {x => ("newid"+x, > s"order$x",s"customer$x", x*10, x*75, 1)}.toDS().toDF() > val ds3 = ds1.union(ds2) > val odsframe = ds3.as("B") > > sql("drop table if exists target").show() > val initframe = spark.createDataFrame(Seq( > Row("a", "0"), > Row("b", "1"), > Row("c", "2"), > Row("d", "3") > ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", > StringType > initframe.write > .format("carbondata") > .option("tableName", "target") > .option("partitionColumns", "value") > .mode(SaveMode.Overwrite) > .save() > > val target = spark.read.format("carbondata").option("tableName", > "target").load() > var ccd = > spark.createDataFrame(Seq( > Row("a", "10", false, 0), > Row("a", null, true, 1), > Row("b", null, true, 2), > Row("c", null, true, 3), > Row("c", "20", false, 4), > Row("c", "200", false, 5), > Row("e", "100", false, 6) > ).asJava, > StructType(Seq(StructField("key", StringType), > StructField("newValue", StringType), > StructField("deleted", BooleanType), StructField("time", IntegerType > > ccd.createOrReplaceTempView("changes") > ccd = sql("SELECT key, latest.newValue as newValue, latest.deleted as deleted > FROM ( SELECT key, max(struct(time, newValue, deleted)) as latest FROM > changes GROUP BY key)") > val updateMap = Map("key" -> "B.key", "value" -> > "B.newValue").asInstanceOf[Map[Any, Any]] > val insertMap = Map("key" -> "B.key", "value" -> > "B.newValue").asInstanceOf[Map[Any, Any]] > target.as("A").merge(ccd.as("B"), "A.key=B.key"). > whenMatched("B.deleted=false"). > updateExpr(updateMap). > whenNotMatched("B.deleted=false"). > insertExpr(insertMap). > whenMatched("B.deleted=true"). > delete().execute() > > {code} > SQL Queries to run : > {code} > sql("select count(*) from target").show() > sql("select * from target order by key").show() > {code} > Results in spark-shell --master yarn > {code} > scala> sql("select count(*) from target").show() > ++ > |count(1)| > ++ > | 4| > ++ > scala> sql("select * from target order by key").show() > +---+-+ > |key|value| > +---+-+ > | a|0| > | b|1| > | c|2| > | d|3| > +---+-+ > {code} > Results in spark-shell --master local > {code} > scala> sql("select count(*) from target").show() > ++ > |count(1)| > ++ > | 3| > ++ > scala>
[jira] [Closed] (CARBONDATA-3851) Merge Update and Insert with Partition Table is giving different results in different spark deploy modes
[ https://issues.apache.org/jira/browse/CARBONDATA-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Ramachandra Setty closed CARBONDATA-3851. This PR resolved this issue. https://github.com/apache/carbondata/pull/3835 > Merge Update and Insert with Partition Table is giving different results in > different spark deploy modes > > > Key: CARBONDATA-3851 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3851 > Project: CarbonData > Issue Type: Bug > Components: spark-integration >Affects Versions: 2.0.0 >Reporter: Sachin Ramachandra Setty >Priority: Major > Fix For: 2.1.0 > > > The result sets are different when run the queries in spark-shell --master > local and spark-shell --master yarn (Two Different Spark Deploy Modes) > Steps to Reproduce Issue : > {code} > import scala.collection.JavaConverters._ > import java.sql.Date > import org.apache.spark.sql._ > import org.apache.spark.sql.CarbonSession._ > import org.apache.spark.sql.catalyst.TableIdentifier > import org.apache.spark.sql.execution.command.mutation.merge. > {CarbonMergeDataSetCommand, DeleteAction, InsertAction, > InsertInHistoryTableAction, MergeDataSetMatches, MergeMatch, UpdateAction, > WhenMatched, WhenNotMatched, WhenNotMatchedAndExistsOnlyOnTarget} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.test.util.QueryTest > import org.apache.spark.sql.types. > {BooleanType, DateType, IntegerType, StringType, StructField, StructType} > import spark.implicits._ > sql("drop table if exists order").show() > sql("drop table if exists order_hist").show() > sql("create table order_hist(id string, name string, quantity int, price int, > state int) PARTITIONED BY (c_name String) STORED AS carbondata").show() > val initframe = sc.parallelize(1 to 10, 4).map > { x => ("id"+x, s"order$x",s"customer$x", x*10, x*75, 1)} > .toDF("id", "name", "c_name", "quantity", "price", "state") > initframe.write > .format("carbondata") > .option("tableName", "order") > .option("partitionColumns", "c_name") > .mode(SaveMode.Overwrite) > .save() > val dwframe = spark.read.format("carbondata").option("tableName", > "order").load() > val dwSelframe = dwframe.as("A") > val ds1 = sc.parallelize(3 to 10, 4) > .map {x => > if (x <= 4) > { ("id"+x, s"order$x",s"customer$x", x*10, x*75, 2) } > else > { ("id"+x, s"order$x",s"customer$x", x*10, x*75, 1) } > }.toDF("id", "name", "c_name", "quantity", "price", "state") > ds1.show() > val ds2 = sc.parallelize(1 to 2, 4) > .map > {x => ("newid"+x, s"order$x",s"customer$x", x*10, x*75, 1) } > .toDS().toDF() > ds2.show() > val ds3 = ds1.union(ds2) > ds3.show() > val odsframe = ds3.as("B") > var matches = Seq.empty[MergeMatch] > val updateMap = Map(col("id") -> col("A.id"), > col("price") -> expr("B.price + 1"), > col("state") -> col("B.state")) > val insertMap = Map(col("id") -> col("B.id"), > col("name") -> col("B.name"), > col("c_name") -> col("B.c_name"), > col("quantity") -> col("B.quantity"), > col("price") -> expr("B.price * 100"), > col("state") -> col("B.state")) > val insertMap_u = Map(col("id") -> col("id"), > col("name") -> col("name"), > col("c_name") -> lit("insert"), > col("quantity") -> col("quantity"), > col("price") -> expr("price"), > col("state") -> col("state")) > val insertMap_d = Map(col("id") -> col("id"), > col("name") -> col("name"), > col("c_name") -> lit("delete"), > col("quantity") -> col("quantity"), > col("price") -> expr("price"), > col("state") -> col("state")) > matches ++= Seq(WhenMatched(Some(col("A.state") =!= > col("B.state"))).addAction(UpdateAction(updateMap)).addAction(InsertInHistoryTableAction(insertMap_u, > TableIdentifier("order_hist" > matches ++= Seq(WhenNotMatched().addAction(InsertAction(insertMap))) > matches ++= > Seq(WhenNotMatchedAndExistsOnlyOnTarget().addAction(DeleteAction()).addAction(InsertInHistoryTableAction(insertMap_d, > TableIdentifier("order_hist" > {code} > > SQL Queries : > {code} > sql("select count(*) from order").show() > sql("select count(*) from order where state = 2").show() > sql("select price from order where id = 'newid1'").show() > sql("select count(*) from order_hist where c_name = 'delete'").show() > sql("select count(*) from order_hist where c_name = 'insert'").show() > {code} > Results in spark-shell --master yarn > {code} > scala> sql("select count(*) from order").show() > ++ > |count(1)| > ++ > |10| > ++ > scala> sql("select count(*) from order where state = 2").show() > ++ > |count(1)| > ++ > |0| > ++ > scala> sql("select price from order where id = 'newid1'").show() > +-+ > |price| > +-+ > +-+ > scala> sql("select
[jira] [Updated] (CARBONDATA-4034) Improve the time-consuming of Horizontal Compaction for update
[ https://issues.apache.org/jira/browse/CARBONDATA-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiayu Shen updated CARBONDATA-4034: --- Description: In the update flow, horizontal compaction will be significantly slower when updating with a lot of segments(or a lot of blocks). There is a case whose costing is as shown in the log. {code:java} 2020-10-10 09:38:10,466 | INFO | [OperationManager-Background-Pool-28] | Horizontal Update Compaction operation started for [ods_oms.oms_wh_outbound_order] 2020-10-10 09:50:25,718 | INFO | [OperationManager-Background-Pool-28] | Horizontal Update Compaction operation completed for [ods_oms.oms_wh_outbound_order]. 2020-10-10 10:15:44,302 | INFO | [OperationManager-Background-Pool-28] | Horizontal Delete Compaction operation started for [ods_oms.oms_wh_outbound_order] 2020-10-10 10:15:54,874 | INFO | [OperationManager-Background-Pool-28] | Horizontal Delete Compaction operation completed for [ods_oms.oms_wh_outbound_order].{code} In this PR, we optimize the process between second and third row of the log, by optimizing the method _performDeleteDeltaCompaction_ in horizontal compaction flow. was: In the update flow, horizontal compaction will be significantly slower when updating with a lot of segments(or a lot of blocks). There is a case whose costing is as shown in the log. 2020-10-10 09:38:10,466 | INFO | [OperationManager-Background-Pool-28] | Horizontal Update Compaction operation started for [ods_oms.oms_wh_outbound_order] 2020-10-10 09:50:25,718 | INFO | [OperationManager-Background-Pool-28] | Horizontal Update Compaction operation completed for [ods_oms.oms_wh_outbound_order]. 2020-10-10 10:15:44,302 | INFO | [OperationManager-Background-Pool-28] | Horizontal Delete Compaction operation started for [ods_oms.oms_wh_outbound_order] 2020-10-10 10:15:54,874 | INFO | [OperationManager-Background-Pool-28] | Horizontal Delete Compaction operation completed for [ods_oms.oms_wh_outbound_order]. In this PR, we optimize the process between second and third row of the log, by optimizing the method _performDeleteDeltaCompaction_ in horizontal compaction flow. > Improve the time-consuming of Horizontal Compaction for update > -- > > Key: CARBONDATA-4034 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4034 > Project: CarbonData > Issue Type: Bug >Reporter: Jiayu Shen >Priority: Minor > Time Spent: 17h 10m > Remaining Estimate: 0h > > In the update flow, horizontal compaction will be significantly slower when > updating with a lot of segments(or a lot of blocks). There is a case whose > costing is as shown in the log. > {code:java} > 2020-10-10 09:38:10,466 | INFO | [OperationManager-Background-Pool-28] | > Horizontal Update Compaction operation started for > [ods_oms.oms_wh_outbound_order] > 2020-10-10 09:50:25,718 | INFO | [OperationManager-Background-Pool-28] | > Horizontal Update Compaction operation completed for > [ods_oms.oms_wh_outbound_order]. > 2020-10-10 10:15:44,302 | INFO | [OperationManager-Background-Pool-28] | > Horizontal Delete Compaction operation started for > [ods_oms.oms_wh_outbound_order] > 2020-10-10 10:15:54,874 | INFO | [OperationManager-Background-Pool-28] | > Horizontal Delete Compaction operation completed for > [ods_oms.oms_wh_outbound_order].{code} > In this PR, we optimize the process between second and third row of the log, > by optimizing the method _performDeleteDeltaCompaction_ in horizontal > compaction flow. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [carbondata] asfgit closed pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
asfgit closed pull request #3986: URL: https://github.com/apache/carbondata/pull/3986 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
CarbonDataQA1 commented on pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718730475 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2972/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
CarbonDataQA1 commented on pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718729637 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4731/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
QiangCai commented on pull request #3986: URL: https://github.com/apache/carbondata/pull/3986#issuecomment-718729381 LGTM, greate job This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
CarbonDataQA1 commented on pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718701205 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2971/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
CarbonDataQA1 commented on pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718697037 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4730/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (CARBONDATA-4042) Insert into select and CTAS launches fewer tasks(task count limited to number of nodes in cluster) even when target table is of no_sort
[ https://issues.apache.org/jira/browse/CARBONDATA-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash R Nilugal resolved CARBONDATA-4042. - Fix Version/s: 2.1.0 Resolution: Fixed > Insert into select and CTAS launches fewer tasks(task count limited to number > of nodes in cluster) even when target table is of no_sort > --- > > Key: CARBONDATA-4042 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4042 > Project: CarbonData > Issue Type: Improvement > Components: data-load, spark-integration >Reporter: Venugopal Reddy K >Priority: Major > Fix For: 2.1.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > *Issue:* > At present, When we do insert into table select from or create table as > select from, we lauch one single task per node. Whereas when we do a simple > select * from table query, tasks launched are equal to number of carbondata > files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). > Thus, slows down the load performance of insert into select and ctas cases. > Refer [Community discussion regd. task > lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html] > > *Suggestion:* > Launch the same number of tasks as in select query for insert into select and > ctas cases when the target table is of no-sort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [carbondata] asfgit closed pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort
asfgit closed pull request #3972: URL: https://github.com/apache/carbondata/pull/3972 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort
CarbonDataQA1 commented on pull request #3972: URL: https://github.com/apache/carbondata/pull/3972#issuecomment-718612012 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2970/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (CARBONDATA-3819) Fileformat column details is not present in the show segments DDL for heterogenous segments table.
[ https://issues.apache.org/jira/browse/CARBONDATA-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Ravichandran closed CARBONDATA-3819. - Fixed and verified. > Fileformat column details is not present in the show segments DDL for > heterogenous segments table. > -- > > Key: CARBONDATA-3819 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3819 > Project: CarbonData > Issue Type: Bug > Environment: Opensource ANT cluster >Reporter: Prasanna Ravichandran >Priority: Minor > Attachments: fileformat_notworking_actualresult.PNG, > fileformat_working_expected.PNG > > > Fileformat column details is not present in the show segments DDL for > heterogenous segments table. > Test steps: > # Create a heterogenous table with added parquet and carbon segments. > # DO show segments. > Expected results: > It should show "FileFormat" column details in show segments DDL. > Actual result: > It is not showing the File format column details in show segments DDL. > See the attached screenshots for more details. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (CARBONDATA-3819) Fileformat column details is not present in the show segments DDL for heterogenous segments table.
[ https://issues.apache.org/jira/browse/CARBONDATA-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Ravichandran resolved CARBONDATA-3819. --- Resolution: Fixed This issue is fixed in the latest Carbon jars - 2.0.0. > Fileformat column details is not present in the show segments DDL for > heterogenous segments table. > -- > > Key: CARBONDATA-3819 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3819 > Project: CarbonData > Issue Type: Bug > Environment: Opensource ANT cluster >Reporter: Prasanna Ravichandran >Priority: Minor > Attachments: fileformat_notworking_actualresult.PNG, > fileformat_working_expected.PNG > > > Fileformat column details is not present in the show segments DDL for > heterogenous segments table. > Test steps: > # Create a heterogenous table with added parquet and carbon segments. > # DO show segments. > Expected results: > It should show "FileFormat" column details in show segments DDL. > Actual result: > It is not showing the File format column details in show segments DDL. > See the attached screenshots for more details. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort
CarbonDataQA1 commented on pull request #3972: URL: https://github.com/apache/carbondata/pull/3972#issuecomment-718593609 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4729/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
CarbonDataQA1 commented on pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718541543 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4728/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder
CarbonDataQA1 commented on pull request #3999: URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718523188 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2969/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3995: [CARBONDATA-4043] Fix data load failure issue for columns added in legacy store
CarbonDataQA1 commented on pull request #3995: URL: https://github.com/apache/carbondata/pull/3995#issuecomment-718505838 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2968/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3995: [CARBONDATA-4043] Fix data load failure issue for columns added in legacy store
CarbonDataQA1 commented on pull request #3995: URL: https://github.com/apache/carbondata/pull/3995#issuecomment-718503918 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4727/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
CarbonDataQA1 commented on pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718493643 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4725/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
CarbonDataQA1 commented on pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718491747 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2966/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort
QiangCai commented on pull request #3972: URL: https://github.com/apache/carbondata/pull/3972#issuecomment-718478244 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
QiangCai commented on a change in pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#discussion_r514046333 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -143,10 +145,16 @@ object DataLoadProcessBuilderOnSpark { var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions( configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS)) + +// if numPartitions user does not specify and not specified in config then dynamically calculate +if (numPartitions == 0) { + numPartitions = Math.ceil(model.getTotalSize.toDouble / defaultMaxSplitBytes).toInt +} + +// after calculation based on size if still zero then take the partition number if (numPartitions <= 0) { numPartitions = convertRDD.partitions.length Review comment: numPartitions = Math.min(convertRDD.partitions.length, dynamic partition number) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
QiangCai commented on a change in pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#discussion_r514043536 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -93,6 +94,7 @@ object DataLoadProcessBuilderOnSpark { val convertStepRowCounter = sc.longAccumulator("Convert Processor Accumulator") val sortStepRowCounter = sc.longAccumulator("Sort Processor Accumulator") val writeStepRowCounter = sc.longAccumulator("Write Processor Accumulator") +val defaultMaxSplitBytes = sessionState(sparkSession).conf.filesMaxPartitionBytes Review comment: move to line 156 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -227,9 +236,17 @@ object DataLoadProcessBuilderOnSpark { // 2. sort var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions( configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS)) + +// if numPartitions user does not specify and not specified in config then dynamically calculate +if (numPartitions <= 0) { + numPartitions = Math.ceil(SizeEstimator.estimate(originRDD) / defaultMaxSplitBytes).toInt Review comment: SizeEstimator.estimate(originRDD).toDouble and move to line 247 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -143,10 +145,16 @@ object DataLoadProcessBuilderOnSpark { var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions( configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS)) + +// if numPartitions user does not specify and not specified in config then dynamically calculate +if (numPartitions == 0) { + numPartitions = Math.ceil(model.getTotalSize.toDouble / defaultMaxSplitBytes).toInt +} + +// after calculation based on size if still zero then take the partition number if (numPartitions <= 0) { numPartitions = convertRDD.partitions.length Review comment: Math.min(convertRDD.partitions.length, dynamic partition number) ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -227,9 +236,17 @@ object DataLoadProcessBuilderOnSpark { // 2. sort var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions( configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS)) + +// if numPartitions user does not specify and not specified in config then dynamically calculate +if (numPartitions <= 0) { + numPartitions = Math.ceil(SizeEstimator.estimate(originRDD) / defaultMaxSplitBytes).toInt +} + +// after calculation based on size if still zero then take the partition number if (numPartitions <= 0) { numPartitions = originRDD.partitions.length Review comment: numPartitions = Math.min(originRDD.partitions.length, dynamic partition number) ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -202,6 +210,7 @@ object DataLoadProcessBuilderOnSpark { val partialSuccessAccum = sc.longAccumulator("Partial Success Accumulator") val sortStepRowCounter = sc.longAccumulator("Sort Processor Accumulator") val writeStepRowCounter = sc.longAccumulator("Write Processor Accumulator") +val defaultMaxSplitBytes = sessionState(sparkSession).conf.filesMaxPartitionBytes Review comment: move to line 247 ## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ## @@ -143,10 +145,16 @@ object DataLoadProcessBuilderOnSpark { var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions( configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS)) + +// if numPartitions user does not specify and not specified in config then dynamically calculate +if (numPartitions == 0) { + numPartitions = Math.ceil(model.getTotalSize.toDouble / defaultMaxSplitBytes).toInt Review comment: move to 156 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3995: [CARBONDATA-4043] Fix data load failure issue for columns added in legacy store
Indhumathi27 commented on a change in pull request #3995: URL: https://github.com/apache/carbondata/pull/3995#discussion_r514042919 ## File path: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java ## @@ -424,6 +440,38 @@ public static boolean isHeaderValid(String tableName, String[] csvHeader, return noDicSortColMapping; } + /** + * Get the sort/no_sort column map based on schema order. + * This will be used in the final sort step to find the index of sort column, to compare the + * intermediate row data based on schema. + */ + public static Map> getSortColSchemaOrderMapping(CarbonTable carbonTable) { Review comment: added This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
CarbonDataQA1 commented on pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718399151 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4721/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically
CarbonDataQA1 commented on pull request #3912: URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718398674 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2963/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org