[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-719163064


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2975/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-719160498


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4734/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


QiangCai commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r514748658



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -227,9 +232,15 @@ object DataLoadProcessBuilderOnSpark {
 // 2. sort
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
 if (numPartitions <= 0) {
-  numPartitions = originRDD.partitions.length
+  val defaultMaxSplitBytes = 
sessionState(sparkSession).conf.filesMaxPartitionBytes
+  val dynamicPartitionNum = 
Math.ceil(SizeEstimator.estimate(originRDD).toDouble /

Review comment:
   does SizeEstimator.estimate work for RDD?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on pull request #3988: [CARBONDATA-4037] Improve the table status and segment file writing

2020-10-29 Thread GitBox


QiangCai commented on pull request #3988:
URL: https://github.com/apache/carbondata/pull/3988#issuecomment-719134288


   In the description, after it implements point 2, it should immediately 
delete the index file. So point 1 is no needed, right?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


marchpure commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514720835



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchemaCommon.scala
##
@@ -121,7 +121,7 @@ case class UpdateTableModel(
 updatedTimeStamp: Long,
 var executorErrors: ExecutionErrors,
 deletedSegments: Seq[Segment],
-loadAsNewSegment: Boolean = false)
+loadAsNewSegment: Boolean = true)

Review comment:
   I have removed all code about loadAsNewSegment





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


marchpure commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514719866



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##
@@ -342,7 +342,8 @@ object CarbonDataRDDFactory {
 
 try {
   if (!carbonLoadModel.isCarbonTransactionalTable || 
segmentLock.lockWithRetries()) {
-if (updateModel.isDefined && !updateModel.get.loadAsNewSegment) {
+if (updateModel.isDefined && (!updateModel.get.loadAsNewSegment

Review comment:
   I have modified code according to your suggestion.
   if (updateModel.isDefined && dataframe.isEmpty) = true
   it means the row to updated is Empty, we avoid to trigger loading process 
for empty dataset.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


marchpure commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514685949



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/allqueries/TestPruneUsingSegmentMinMax.scala
##
@@ -103,7 +103,7 @@ class TestPruneUsingSegmentMinMax extends QueryTest with 
BeforeAndAfterAll {
 sql("update carbon set(a)=(10) where a=1").collect()
 checkAnswer(sql("select count(*) from carbon where a=10"), Seq(Row(3)))
 showCache = sql("show metacache on table carbon").collect()
-assert(showCache(0).get(2).toString.equalsIgnoreCase("6/8 index files 
cached"))
+assert(showCache(0).get(2).toString.equalsIgnoreCase("1/6 index files 
cached"))

Review comment:
   1. in this testcase, there is 5 insert and 1 update. if update write 
into new segments. there will be 6 segments in the table, so in total 6 index 
files in the table storelocation.
   2. If update write into different segments folder, the data of a = 10 will 
exists in segment 0/3/4.
   But if update write into only one new segment folder, the data of a = 10 
will exists in segment 5.
   
   Now, The data in 6 segments are shown as below.
   
   Segment - 0 : 
   +---+---++---+---+
   |  a|  b|   c|  d|  e|
   +---+---++---+---+
   |  2| aa|23.6|  8|2017-09-02 00:00:00|
   +---+---++---+---+
   
   Segment - 1 : 
   +---+---++---+---+
   |  a|  b|   c|  d|  e|
   +---+---++---+---+
   |  3| ab|23.4|  5|2017-09-01 00:00:00|
   |  4| aa|23.6|  8|2017-09-02 00:00:00|
   +---+---++---+---+
   
   Segment - 2 : 
   +---+---++---+---+
   |  a|  b|   c|  d|  e|
   +---+---++---+---+
   |  5| ab|23.4|  5|2017-09-01 00:00:00|
   |  6| aa|23.6|  8|2017-09-02 00:00:00|
   +---+---++---+---+
   
   Segment - 3 : 
   +---+---++---+---+
   |  a|  b|   c|  d|  e|
   +---+---++---+---+
   |  2| aa|23.6|  8|2017-09-02 00:00:00|
   +---+---++---+---+
   
   Segment - 4 : 
   +---+---++---+---+
   |  a|  b|   c|  d|  e|
   +---+---++---+---+
   |  2| aa|23.6|  8|2017-09-02 00:00:00|
   +---+---++---+---+
   
   Segment - 5 : 
   +---+---++---+---+
   |  a|  b|   c|  d|  e|
   +---+---++---+---+
   | 10| ab|23.4|  5|2017-09-01 00:00:00|
   | 10| ab|23.4|  5|2017-09-01 00:00:00|
   | 10| ab|23.4|  5|2017-09-01 00:00:00|
   +---+---++---+---+





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


QiangCai commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514677476



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.cleanfiles
+
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession}
+import org.apache.spark.sql.index.CarbonIndexUtil
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.exception.ConcurrentOperationException
+import org.apache.carbondata.core.indexstore.PartitionSpec
+import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, 
ICarbonLock, LockUsage}
+import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, 
CarbonMetadata, SegmentFileStore}
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable
+import org.apache.carbondata.core.mutate.CarbonUpdateUtil
+import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, 
SegmentStatus, SegmentStatusManager}
+import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil}
+import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil}
+import org.apache.carbondata.processing.loading.TableProcessingOperations
+
+object CleanFilesUtil {
+  private val LOGGER = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
+  /**
+   * The method deletes all data if forceTableClean  and clean garbage 
segment
+   * (MARKED_FOR_DELETE state) if forceTableClean 
+   *
+   * @param dbName : Database name
+   * @param tableName  : Table name
+   * @param tablePath  : Table path
+   * @param carbonTable: CarbonTable Object  in case of 
force clean
+   * @param forceTableClean:  for force clean it will delete all 
data
+   *it will clean garbage segment 
(MARKED_FOR_DELETE state)
+   * @param currentTablePartitions : Hive Partitions  details
+   */
+  def cleanFiles(
+dbName: String,
+tableName: String,
+tablePath: String,
+carbonTable: CarbonTable,
+forceTableClean: Boolean,
+currentTablePartitions: Option[Seq[PartitionSpec]] = None,
+truncateTable: Boolean = false): Unit = {
+var carbonCleanFilesLock: ICarbonLock = null
+val absoluteTableIdentifier = if (forceTableClean) {
+  AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName)
+} else {
+  carbonTable.getAbsoluteTableIdentifier
+}
+try {
+  val errorMsg = "Clean files request is failed for " +
+s"$dbName.$tableName" +
+". Not able to acquire the clean files lock due to another clean files 
" +
+"operation is running in the background."
+  // in case of force clean the lock is not required
+  if (forceTableClean) {
+FileFactory.deleteAllCarbonFilesOfDir(
+  FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath))
+  } else {
+carbonCleanFilesLock =
+  CarbonLockUtil
+.getLockObject(absoluteTableIdentifier, 
LockUsage.CLEAN_FILES_LOCK, errorMsg)
+if (truncateTable) {
+  SegmentStatusManager.truncateTable(carbonTable)
+}
+SegmentStatusManager.deleteLoadsAndUpdateMetadata(
+  carbonTable, true, currentTablePartitions.map(_.asJava).orNull)
+CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true)

Review comment:
   but I don't find any lock for update/delete when it tries to 
cleanUpDeltaFiles in clean files flow. When concurrent clean files and update 
happened, this method of clean files will remove the files which generated by 
the concurrent update.





This is an automated message from the Apache Git Service.
To 

[GitHub] [carbondata] QiangCai commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


QiangCai commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514677476



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.cleanfiles
+
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession}
+import org.apache.spark.sql.index.CarbonIndexUtil
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.exception.ConcurrentOperationException
+import org.apache.carbondata.core.indexstore.PartitionSpec
+import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, 
ICarbonLock, LockUsage}
+import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, 
CarbonMetadata, SegmentFileStore}
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable
+import org.apache.carbondata.core.mutate.CarbonUpdateUtil
+import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, 
SegmentStatus, SegmentStatusManager}
+import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil}
+import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil}
+import org.apache.carbondata.processing.loading.TableProcessingOperations
+
+object CleanFilesUtil {
+  private val LOGGER = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
+  /**
+   * The method deletes all data if forceTableClean  and clean garbage 
segment
+   * (MARKED_FOR_DELETE state) if forceTableClean 
+   *
+   * @param dbName : Database name
+   * @param tableName  : Table name
+   * @param tablePath  : Table path
+   * @param carbonTable: CarbonTable Object  in case of 
force clean
+   * @param forceTableClean:  for force clean it will delete all 
data
+   *it will clean garbage segment 
(MARKED_FOR_DELETE state)
+   * @param currentTablePartitions : Hive Partitions  details
+   */
+  def cleanFiles(
+dbName: String,
+tableName: String,
+tablePath: String,
+carbonTable: CarbonTable,
+forceTableClean: Boolean,
+currentTablePartitions: Option[Seq[PartitionSpec]] = None,
+truncateTable: Boolean = false): Unit = {
+var carbonCleanFilesLock: ICarbonLock = null
+val absoluteTableIdentifier = if (forceTableClean) {
+  AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName)
+} else {
+  carbonTable.getAbsoluteTableIdentifier
+}
+try {
+  val errorMsg = "Clean files request is failed for " +
+s"$dbName.$tableName" +
+". Not able to acquire the clean files lock due to another clean files 
" +
+"operation is running in the background."
+  // in case of force clean the lock is not required
+  if (forceTableClean) {
+FileFactory.deleteAllCarbonFilesOfDir(
+  FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath))
+  } else {
+carbonCleanFilesLock =
+  CarbonLockUtil
+.getLockObject(absoluteTableIdentifier, 
LockUsage.CLEAN_FILES_LOCK, errorMsg)
+if (truncateTable) {
+  SegmentStatusManager.truncateTable(carbonTable)
+}
+SegmentStatusManager.deleteLoadsAndUpdateMetadata(
+  carbonTable, true, currentTablePartitions.map(_.asJava).orNull)
+CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true)

Review comment:
   but I don't find any lock for update/delete when it tries to 
cleanUpDeltaFiles in clean files flow





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please 

[GitHub] [carbondata] QiangCai commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


QiangCai commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514677476



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.cleanfiles
+
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession}
+import org.apache.spark.sql.index.CarbonIndexUtil
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.exception.ConcurrentOperationException
+import org.apache.carbondata.core.indexstore.PartitionSpec
+import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, 
ICarbonLock, LockUsage}
+import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, 
CarbonMetadata, SegmentFileStore}
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable
+import org.apache.carbondata.core.mutate.CarbonUpdateUtil
+import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, 
SegmentStatus, SegmentStatusManager}
+import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil}
+import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil}
+import org.apache.carbondata.processing.loading.TableProcessingOperations
+
+object CleanFilesUtil {
+  private val LOGGER = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
+  /**
+   * The method deletes all data if forceTableClean  and clean garbage 
segment
+   * (MARKED_FOR_DELETE state) if forceTableClean 
+   *
+   * @param dbName : Database name
+   * @param tableName  : Table name
+   * @param tablePath  : Table path
+   * @param carbonTable: CarbonTable Object  in case of 
force clean
+   * @param forceTableClean:  for force clean it will delete all 
data
+   *it will clean garbage segment 
(MARKED_FOR_DELETE state)
+   * @param currentTablePartitions : Hive Partitions  details
+   */
+  def cleanFiles(
+dbName: String,
+tableName: String,
+tablePath: String,
+carbonTable: CarbonTable,
+forceTableClean: Boolean,
+currentTablePartitions: Option[Seq[PartitionSpec]] = None,
+truncateTable: Boolean = false): Unit = {
+var carbonCleanFilesLock: ICarbonLock = null
+val absoluteTableIdentifier = if (forceTableClean) {
+  AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName)
+} else {
+  carbonTable.getAbsoluteTableIdentifier
+}
+try {
+  val errorMsg = "Clean files request is failed for " +
+s"$dbName.$tableName" +
+". Not able to acquire the clean files lock due to another clean files 
" +
+"operation is running in the background."
+  // in case of force clean the lock is not required
+  if (forceTableClean) {
+FileFactory.deleteAllCarbonFilesOfDir(
+  FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath))
+  } else {
+carbonCleanFilesLock =
+  CarbonLockUtil
+.getLockObject(absoluteTableIdentifier, 
LockUsage.CLEAN_FILES_LOCK, errorMsg)
+if (truncateTable) {
+  SegmentStatusManager.truncateTable(carbonTable)
+}
+SegmentStatusManager.deleteLoadsAndUpdateMetadata(
+  carbonTable, true, currentTablePartitions.map(_.asJava).orNull)
+CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true)

Review comment:
   but I don't find any lock for update/delete when it tries to 
cleanUpDeltaFiles





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact 

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#issuecomment-718951982


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2974/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#issuecomment-718948233


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4733/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] brijoobopanna commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


brijoobopanna commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514436587



##
File path: docs/cleanfiles.md
##
@@ -0,0 +1,78 @@
+
+
+
+## CLEAN FILES
+
+Clean files command is used to remove the Compacted, Marked For Delete ,In 
Progress which are stale and Partial(Segments which are missing from the table 
status file but their data is present)
+ segments from the store.
+ 
+ Clean Files Command
+   ```
+   CLEAN FILES FOR TABLE TABLE_NAME
+   ```
+
+
+### TRASH FOLDER
+
+  Carbondata supports a Trash Folder which is used as a redundant folder where 
all the unnecessary files and folders are moved to during clean files operation.
+  This trash folder is mantained inside the table path. It is a hidden 
folder(.Trash). The segments that are moved to the trash folder are mantained 
under a timestamp 
+  subfolder(timestamp at which clean files operation is called). This helps 
the user to list down segments by timestamp.  By default all the timestamp 
sub-directory have an expiration
+  time of (3 days since that timestamp) and it can be configured by the user 
using the following carbon property
+   ```
+   carbon.trash.expiration.time = "Number of days"
+   ``` 
+  Once the timestamp subdirectory is expired as per the configured expiration 
day value, the subdirectory is deleted from the trash folder in the subsequent 
clean files command.
+  
+
+
+
+### DRY RUN
+  Support for dry run is provided before the actual clean files operation. 
This dry run operation will list down all the segments which are going to be 
manipulated during
+  the clean files operation. The dry run result will show the current location 
of the segment(it can be in FACT folder, Partition folder or trash folder) and 
where that segment
+  will be moved(to the trash folder or deleted from store) once the actual 
operation will be called. 
+  
+
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('dry_run'='true')
+  ```
+
+### FORCE DELETE TRASH
+The force option with clean files command deletes all the files and folders 
from the trash folder.
+
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('force'='true')
+  ```
+
+### DATA RECOVERY FROM THE TRASH FOLDER
+
+The segments can be recovered from the trash folder by creating an external 
table from the desired segment location
+in the trash folder and inserting into the original table from the external 
table. It will create a new segment in the original table.

Review comment:
   It would be good if also explain the behavior on SI or MV based on these 
how the behavior will be





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] brijoobopanna commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


brijoobopanna commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514434545



##
File path: docs/cleanfiles.md
##
@@ -0,0 +1,78 @@
+
+
+
+## CLEAN FILES
+
+Clean files command is used to remove the Compacted, Marked For Delete ,In 
Progress which are stale and Partial(Segments which are missing from the table 
status file but their data is present)
+ segments from the store.
+ 
+ Clean Files Command
+   ```
+   CLEAN FILES FOR TABLE TABLE_NAME
+   ```
+
+
+### TRASH FOLDER
+
+  Carbondata supports a Trash Folder which is used as a redundant folder where 
all the unnecessary files and folders are moved to during clean files operation.
+  This trash folder is mantained inside the table path. It is a hidden 
folder(.Trash). The segments that are moved to the trash folder are mantained 
under a timestamp 
+  subfolder(timestamp at which clean files operation is called). This helps 
the user to list down segments by timestamp.  By default all the timestamp 
sub-directory have an expiration
+  time of (3 days since that timestamp) and it can be configured by the user 
using the following carbon property
+   ```
+   carbon.trash.expiration.time = "Number of days"
+   ``` 
+  Once the timestamp subdirectory is expired as per the configured expiration 
day value, the subdirectory is deleted from the trash folder in the subsequent 
clean files command.
+  
+
+
+
+### DRY RUN
+  Support for dry run is provided before the actual clean files operation. 
This dry run operation will list down all the segments which are going to be 
manipulated during
+  the clean files operation. The dry run result will show the current location 
of the segment(it can be in FACT folder, Partition folder or trash folder) and 
where that segment

Review comment:
   what about  : show time threshold based segments in trash folder





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] brijoobopanna commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


brijoobopanna commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514432464



##
File path: docs/cleanfiles.md
##
@@ -0,0 +1,78 @@
+
+
+
+## CLEAN FILES
+
+Clean files command is used to remove the Compacted, Marked For Delete ,In 
Progress which are stale and Partial(Segments which are missing from the table 
status file but their data is present)
+ segments from the store.
+ 
+ Clean Files Command
+   ```
+   CLEAN FILES FOR TABLE TABLE_NAME
+   ```
+
+
+### TRASH FOLDER
+
+  Carbondata supports a Trash Folder which is used as a redundant folder where 
all the unnecessary files and folders are moved to during clean files operation.

Review comment:
   Can we please rewrite this part instead of telling unnecessary files and 
folders
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514429233



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.cleanfiles
+
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession}
+import org.apache.spark.sql.index.CarbonIndexUtil
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.exception.ConcurrentOperationException
+import org.apache.carbondata.core.indexstore.PartitionSpec
+import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, 
ICarbonLock, LockUsage}
+import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, 
CarbonMetadata, SegmentFileStore}
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable
+import org.apache.carbondata.core.mutate.CarbonUpdateUtil
+import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, 
SegmentStatus, SegmentStatusManager}
+import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil}
+import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil}
+import org.apache.carbondata.processing.loading.TableProcessingOperations
+
+object CleanFilesUtil {
+  private val LOGGER = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
+  /**
+   * The method deletes all data if forceTableClean  and clean garbage 
segment
+   * (MARKED_FOR_DELETE state) if forceTableClean 
+   *
+   * @param dbName : Database name
+   * @param tableName  : Table name
+   * @param tablePath  : Table path
+   * @param carbonTable: CarbonTable Object  in case of 
force clean
+   * @param forceTableClean:  for force clean it will delete all 
data
+   *it will clean garbage segment 
(MARKED_FOR_DELETE state)
+   * @param currentTablePartitions : Hive Partitions  details
+   */
+  def cleanFiles(
+dbName: String,
+tableName: String,
+tablePath: String,
+carbonTable: CarbonTable,
+forceTableClean: Boolean,
+currentTablePartitions: Option[Seq[PartitionSpec]] = None,
+truncateTable: Boolean = false): Unit = {
+var carbonCleanFilesLock: ICarbonLock = null
+val absoluteTableIdentifier = if (forceTableClean) {
+  AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName)
+} else {
+  carbonTable.getAbsoluteTableIdentifier
+}
+try {
+  val errorMsg = "Clean files request is failed for " +
+s"$dbName.$tableName" +
+". Not able to acquire the clean files lock due to another clean files 
" +
+"operation is running in the background."
+  // in case of force clean the lock is not required
+  if (forceTableClean) {
+FileFactory.deleteAllCarbonFilesOfDir(
+  FileFactory.getCarbonFile(absoluteTableIdentifier.getTablePath))
+  } else {
+carbonCleanFilesLock =
+  CarbonLockUtil
+.getLockObject(absoluteTableIdentifier, 
LockUsage.CLEAN_FILES_LOCK, errorMsg)
+if (truncateTable) {
+  SegmentStatusManager.truncateTable(carbonTable)
+}
+SegmentStatusManager.deleteLoadsAndUpdateMetadata(
+  carbonTable, true, currentTablePartitions.map(_.asJava).orNull)
+CarbonUpdateUtil.cleanUpDeltaFiles(carbonTable, true)

Review comment:
   no, we are copying the complete segment to trash, so no issues with 
delta files. I added a test case too with delete delta





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For 

[jira] [Reopened] (CARBONDATA-4029) Getting Number format exception while querying on date columns in SDK carbon table.

2020-10-29 Thread Prasanna Ravichandran (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Ravichandran reopened CARBONDATA-4029:
---

> Getting Number format exception while querying on date columns in SDK carbon 
> table.
> ---
>
> Key: CARBONDATA-4029
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4029
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: 3 node FI cluster
>Reporter: Prasanna Ravichandran
>Priority: Minor
> Attachments: Primitive.rar
>
>
> We are getting Number format exception while querying on the date columns. 
> Attached the SDK files also.
> Test queries:
> --SDK compaction;
>  drop table if exists external_primitive;
>  create table external_primitive (id int, name string, rank smallint, salary 
> double, active boolean, dob date, doj timestamp, city string, dept string) 
> stored as carbondata;
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive2','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive3','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive4','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive5','format'='carbon');
>  
>  alter table external_primitive compact 'minor'; --working fine pass;
>  select count(*) from external_primitive;--working fine pass;
> show segments for table external_primitive;
>  select * from external_primitive limit 13; --working fine pass;
>  select * from external_primitive limit 14; --failed getting number format 
> exception;
> select min(dob) from external_primitive; --failed getting number format 
> exception;
> select max(dob) from external_primitive; --working;
> select dob from external_primitive; --failed getting number format exception;
> Console:
> *0: /> show segments for table external_primitive;*
> +--++--+--+++-+--+
> | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | 
> Index Size | File Format |
> +--++--+--+++-+--+
> | 4 | Success | 2020-10-13 11:52:04.012 | 0.511S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 3 | Compacted | 2020-10-13 11:52:00.587 | 0.828S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 2 | Compacted | 2020-10-13 11:51:57.767 | 0.775S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 1 | Compacted | 2020-10-13 11:51:54.678 | 1.024S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 0.1 | Success | 2020-10-13 11:52:05.986 | 5.785S | {} | 9.62KB | 5.01KB | 
> columnar_v3 |
> | 0 | Compacted | 2020-10-13 11:51:51.072 | 1.125S | {} | 8.55KB | 4.25KB | 
> columnar_v3 |
> +--++--+--+++-+--+
> 6 rows selected (0.45 seconds)
> *0: /> select * from external_primitive limit 13;* --working fine pass;
> INFO : Execution ID: 95
> +-+---+---+--+-+-++++
> | id | name | rank | salary | active | dob | doj | city | dept |
> +-+---+---+--+-+-++++
> | 1 | AAA | 3 | 3444345.66 | true | 1979-12-09 | 2011-02-09 22:30:20.0 | Pune 
> | IT |
> | 2 | BBB | 2 | 543124.66 | false | 1987-02-19 | 2017-01-01 09:30:20.0 | 
> Bangalore | DATA |
> | 3 | CCC | 1 | 787878.888 | false | 1982-05-12 | 2015-11-30 23:50:20.0 | 
> Pune | DATA |
> | 4 | DDD | 1 | 9.24 | true | 1981-04-09 | 2000-01-15 04:30:20.0 | Delhi 
> | MAINS |
> | 5 | EEE | 3 | 545656.99 | true | 1987-12-09 | 2017-11-25 01:30:20.0 | Delhi 
> | IT |
> | 6 | FFF | 2 | 768678.0 | false | 1987-12-20 | 2017-01-10 02:30:20.0 | 
> Bangalore | DATA |
> | 7 | GGG | 3 | 765665.0 | true | 1983-06-12 | 2016-12-31 23:30:20.0 | Pune | 
> IT |
> | 8 | HHH | 2 | 567567.66 | false | 1979-01-12 | 1995-01-01 09:30:20.0 | 
> Bangalore | DATA |
> | 9 | III | 2 | 787878.767 | true | 1985-02-19 | 2005-08-14 22:30:20.0 | Pune 
> | DATA |
> | 10 | JJJ | 3 | 887877.14 | true | 2000-05-19 | 2016-10-10 09:30:20.0 | 
> Bangalore | MAINS |
> | 18 | | 3 | 7.86786786787E9 | true | 1980-10-05 | 1995-10-07 19:30:20.0 | 
> Bangalore | IT |
> | 19 | | 2 | 5464545.33 | true | 1986-06-06 | 2008-08-14 22:30:20.0 | Delhi | 
> DATA |
> | 20 | 

[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


marchpure commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514279897



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/alterTable/TestAlterTableSortColumnsProperty.scala
##
@@ -739,14 +739,14 @@ class TestAlterTableSortColumnsProperty extends QueryTest 
with BeforeAndAfterAll
 
 val table = CarbonEnv.getCarbonTable(Option("default"), 
tableName)(sqlContext.sparkSession)
 val tablePath = table.getTablePath
-(0 to 2).foreach { segmentId =>
+(0 to 3).foreach { segmentId =>

Review comment:
   now, update will write into new segment 3.
   before. update only write to old segment 2. so test case shall change from 
(0 to 2) to (0 to 3)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (CARBONDATA-4029) Getting Number format exception while querying on date columns in SDK carbon table.

2020-10-29 Thread Prasanna Ravichandran (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Ravichandran closed CARBONDATA-4029.
-
Resolution: Won't Fix

> Getting Number format exception while querying on date columns in SDK carbon 
> table.
> ---
>
> Key: CARBONDATA-4029
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4029
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: 3 node FI cluster
>Reporter: Prasanna Ravichandran
>Priority: Minor
> Attachments: Primitive.rar
>
>
> We are getting Number format exception while querying on the date columns. 
> Attached the SDK files also.
> Test queries:
> --SDK compaction;
>  drop table if exists external_primitive;
>  create table external_primitive (id int, name string, rank smallint, salary 
> double, active boolean, dob date, doj timestamp, city string, dept string) 
> stored as carbondata;
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive2','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive3','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive4','format'='carbon');
>  alter table external_primitive add segment 
> options('path'='hdfs://hacluster/sdkfiles/primitive5','format'='carbon');
>  
>  alter table external_primitive compact 'minor'; --working fine pass;
>  select count(*) from external_primitive;--working fine pass;
> show segments for table external_primitive;
>  select * from external_primitive limit 13; --working fine pass;
>  select * from external_primitive limit 14; --failed getting number format 
> exception;
> select min(dob) from external_primitive; --failed getting number format 
> exception;
> select max(dob) from external_primitive; --working;
> select dob from external_primitive; --failed getting number format exception;
> Console:
> *0: /> show segments for table external_primitive;*
> +--++--+--+++-+--+
> | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | 
> Index Size | File Format |
> +--++--+--+++-+--+
> | 4 | Success | 2020-10-13 11:52:04.012 | 0.511S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 3 | Compacted | 2020-10-13 11:52:00.587 | 0.828S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 2 | Compacted | 2020-10-13 11:51:57.767 | 0.775S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 1 | Compacted | 2020-10-13 11:51:54.678 | 1.024S | {} | 1.88KB | 655.0B | 
> columnar_v3 |
> | 0.1 | Success | 2020-10-13 11:52:05.986 | 5.785S | {} | 9.62KB | 5.01KB | 
> columnar_v3 |
> | 0 | Compacted | 2020-10-13 11:51:51.072 | 1.125S | {} | 8.55KB | 4.25KB | 
> columnar_v3 |
> +--++--+--+++-+--+
> 6 rows selected (0.45 seconds)
> *0: /> select * from external_primitive limit 13;* --working fine pass;
> INFO : Execution ID: 95
> +-+---+---+--+-+-++++
> | id | name | rank | salary | active | dob | doj | city | dept |
> +-+---+---+--+-+-++++
> | 1 | AAA | 3 | 3444345.66 | true | 1979-12-09 | 2011-02-09 22:30:20.0 | Pune 
> | IT |
> | 2 | BBB | 2 | 543124.66 | false | 1987-02-19 | 2017-01-01 09:30:20.0 | 
> Bangalore | DATA |
> | 3 | CCC | 1 | 787878.888 | false | 1982-05-12 | 2015-11-30 23:50:20.0 | 
> Pune | DATA |
> | 4 | DDD | 1 | 9.24 | true | 1981-04-09 | 2000-01-15 04:30:20.0 | Delhi 
> | MAINS |
> | 5 | EEE | 3 | 545656.99 | true | 1987-12-09 | 2017-11-25 01:30:20.0 | Delhi 
> | IT |
> | 6 | FFF | 2 | 768678.0 | false | 1987-12-20 | 2017-01-10 02:30:20.0 | 
> Bangalore | DATA |
> | 7 | GGG | 3 | 765665.0 | true | 1983-06-12 | 2016-12-31 23:30:20.0 | Pune | 
> IT |
> | 8 | HHH | 2 | 567567.66 | false | 1979-01-12 | 1995-01-01 09:30:20.0 | 
> Bangalore | DATA |
> | 9 | III | 2 | 787878.767 | true | 1985-02-19 | 2005-08-14 22:30:20.0 | Pune 
> | DATA |
> | 10 | JJJ | 3 | 887877.14 | true | 2000-05-19 | 2016-10-10 09:30:20.0 | 
> Bangalore | MAINS |
> | 18 | | 3 | 7.86786786787E9 | true | 1980-10-05 | 1995-10-07 19:30:20.0 | 
> Bangalore | IT |
> | 19 | | 2 | 5464545.33 | true | 1986-06-06 | 2008-08-14 22:30:20.0 | Delhi 

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-29 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r514271886



##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##
@@ -1136,7 +1137,8 @@ public static void 
deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
   if (updateCompletionStatus) {
 DeleteLoadFolders
 .physicalFactAndMeasureMetadataDeletion(carbonTable, 
newAddedLoadHistoryList,
-isForceDeletion, partitionSpecs);
+isForceDeletion, partitionSpecs, String.valueOf(new 
Timestamp(System

Review comment:
   yes, moved





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


ajantha-bhat commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514266931



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##
@@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand(
   case _ => sys.error("")
 }
 
-val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments)
+val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments,
+  !carbonRelation.carbonTable.isHivePartitionTable)

Review comment:
   @marchpure : Also please reply to my other comments or questions if it 
is handled. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


ajantha-bhat commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514263181



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##
@@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand(
   case _ => sys.error("")
 }
 
-val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments)
+val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments,
+  !carbonRelation.carbonTable.isHivePartitionTable)

Review comment:
   Nice, I will review it again. @QiangCai or others also can once review 
this PR





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


marchpure commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r514260343



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##
@@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand(
   case _ => sys.error("")
 }
 
-val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments)
+val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments,
+  !carbonRelation.carbonTable.isHivePartitionTable)

Review comment:
   I have modified code according to your suggestion. Now, for partition, 
upodate will wirte as new segment





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (CARBONDATA-3971) Session level dynamic properties for repair(carbon.load.si.repair and carbon.si.repair.limit) are not updated in https://github.com/apache/carbondata/blob/master/docs

2020-10-29 Thread Chetan Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Bhat closed CARBONDATA-3971.
---
Fix Version/s: 2.1.0
   Resolution: Fixed

Issue is fixed.

> Session level dynamic properties for repair(carbon.load.si.repair and 
> carbon.si.repair.limit) are not updated in 
> https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md
> --
>
> Key: CARBONDATA-3971
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3971
> Project: CarbonData
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 2.1.0
>Reporter: Chetan Bhat
>Priority: Minor
> Fix For: 2.1.0
>
>
> Session level dynamic properties for repair(carbon.load.si.repair and 
> carbon.si.repair.limit) are not mentioned in  github link - 
> https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (CARBONDATA-3937) Insert into select from another carbon /parquet table is not working on Hive Beeline on a newly create Hive write format - carbon table. We are getting “Database is n

2020-10-29 Thread Prasanna Ravichandran (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Ravichandran closed CARBONDATA-3937.
-

> Insert into select from another carbon /parquet table is not working on Hive 
> Beeline on a newly create Hive write format - carbon table. We are getting 
> “Database is not set" error.
> 
>
> Key: CARBONDATA-3937
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3937
> Project: CarbonData
>  Issue Type: Bug
>  Components: hive-integration
>Affects Versions: 2.0.0
>Reporter: Prasanna Ravichandran
>Priority: Major
>
> Insert into select from another carbon or parquet table to a carbon table is 
> not working on Hive Beeline on a newly create Hive write format carbon table. 
> We are getting “Database is not set” error.
>  
> Test queries:
>  drop table if exists hive_carbon;
> create table hive_carbon(id int, name string, scale decimal, country string, 
> salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler';
> insert into hive_carbon select 1,"Ram","2.3","India",3500;
> insert into hive_carbon select 2,"Raju","2.4","Russia",3600;
> insert into hive_carbon select 3,"Raghu","2.5","China",3700;
> insert into hive_carbon select 4,"Ravi","2.6","Australia",3800;
>  
> drop table if exists hive_carbon2;
> create table hive_carbon2(id int, name string, scale decimal, country string, 
> salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler';
> insert into hive_carbon2 select * from hive_carbon;
> select * from hive_carbon;
> select * from hive_carbon2;
>  
>  --execute below queries in spark-beeline;
> create table hive_table(id int, name string, scale decimal, country string, 
> salary double);
>  create table parquet_table(id int, name string, scale decimal, country 
> string, salary double) stored as parquet;
>  insert into hive_table select 1,"Ram","2.3","India",3500;
>  select * from hive_table;
>  insert into parquet_table select 1,"Ram","2.3","India",3500;
>  select * from parquet_table;
> --execute the below query in hive beeline;
> insert into hive_carbon select * from parquet_table;
> Attached the logs for your reference. But the insert into select from the 
> parquet and hive table into carbon table is working fine.
>  
> Only insert into select from hive table to carbon table is only working.
> Error details in MR job which run through hive query:
> Error: java.io.IOException: java.io.IOException: Database name is not set. at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>  at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:414)
>  at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:843)
>  at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
> org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:175) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1737)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) Caused by: 
> java.io.IOException: Database name is not set. at 
> org.apache.carbondata.hadoop.api.CarbonInputFormat.getDatabaseName(CarbonInputFormat.java:841)
>  at 
> org.apache.carbondata.hive.MapredCarbonInputFormat.getCarbonTable(MapredCarbonInputFormat.java:80)
>  at 
> org.apache.carbondata.hive.MapredCarbonInputFormat.getQueryModel(MapredCarbonInputFormat.java:215)
>  at 
> org.apache.carbondata.hive.MapredCarbonInputFormat.getRecordReader(MapredCarbonInputFormat.java:205)
>  at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:411)
>  ... 9 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (CARBONDATA-3852) CCD Merge with Partition Table is giving different results in different spark deploy modes

2020-10-29 Thread Sachin Ramachandra Setty (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Ramachandra Setty closed CARBONDATA-3852.


This PR resolved this issue.

https://github.com/apache/carbondata/pull/3835

> CCD Merge with Partition Table is giving different results in different spark 
> deploy modes
> --
>
> Key: CARBONDATA-3852
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3852
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.1.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The result sets are different when run the sql queries in spark-shell 
> --master local and spark-shell --master yarn (Two Different Spark Deploy 
> Modes)
> {code}
> import scala.collection.JavaConverters._
> import java.sql.Date
> import org.apache.spark.sql._
> import org.apache.spark.sql.CarbonSession._
> import org.apache.spark.sql.catalyst.TableIdentifier
> import 
> org.apache.spark.sql.execution.command.mutation.merge.{CarbonMergeDataSetCommand,
>  DeleteAction, InsertAction, InsertInHistoryTableAction, MergeDataSetMatches, 
> MergeMatch, UpdateAction, WhenMatched, WhenNotMatched, 
> WhenNotMatchedAndExistsOnlyOnTarget}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.test.util.QueryTest
> import org.apache.spark.sql.types.{BooleanType, DateType, IntegerType, 
> StringType, StructField, StructType}
> import spark.implicits._
> val df1 = sc.parallelize(1 to 10, 4).map{ x => ("id"+x, 
> s"order$x",s"customer$x", x*10, x*75, 1)}.toDF("id", "name", "c_name", 
> "quantity", "price", "state")
> df1.write.format("carbondata").option("tableName", 
> "order").mode(SaveMode.Overwrite).save()
> val dwframe = spark.read.format("carbondata").option("tableName", 
> "order").load()
> val dwSelframe = dwframe.as("A")
> val ds1 = sc.parallelize(3 to 10, 4)
>   .map {x =>
> if (x <= 4) {
>   ("id"+x, s"order$x",s"customer$x", x*10, x*75, 2)
> } else {
>   ("id"+x, s"order$x",s"customer$x", x*10, x*75, 1)
> }
>   }.toDF("id", "name", "c_name", "quantity", "price", "state")
> 
> val ds2 = sc.parallelize(1 to 2, 4).map {x => ("newid"+x, 
> s"order$x",s"customer$x", x*10, x*75, 1)}.toDS().toDF()
> val ds3 = ds1.union(ds2)  
> val odsframe = ds3.as("B")
>   
> sql("drop table if exists target").show()
> val initframe = spark.createDataFrame(Seq(
>   Row("a", "0"),
>   Row("b", "1"),
>   Row("c", "2"),
>   Row("d", "3")
> ).asJava, StructType(Seq(StructField("key", StringType), StructField("value", 
> StringType
> initframe.write
>   .format("carbondata")
>   .option("tableName", "target")
>   .option("partitionColumns", "value")
>   .mode(SaveMode.Overwrite)
>   .save()
>   
> val target = spark.read.format("carbondata").option("tableName", 
> "target").load()
> var ccd =
>   spark.createDataFrame(Seq(
> Row("a", "10", false,  0),
> Row("a", null, true, 1),   
> Row("b", null, true, 2),   
> Row("c", null, true, 3),   
> Row("c", "20", false, 4),
> Row("c", "200", false, 5),
> Row("e", "100", false, 6) 
>   ).asJava,
> StructType(Seq(StructField("key", StringType),
>   StructField("newValue", StringType),
>   StructField("deleted", BooleanType), StructField("time", IntegerType
> 
> ccd.createOrReplaceTempView("changes")
> ccd = sql("SELECT key, latest.newValue as newValue, latest.deleted as deleted 
> FROM ( SELECT key, max(struct(time, newValue, deleted)) as latest FROM 
> changes GROUP BY key)")
> val updateMap = Map("key" -> "B.key", "value" -> 
> "B.newValue").asInstanceOf[Map[Any, Any]]
> val insertMap = Map("key" -> "B.key", "value" -> 
> "B.newValue").asInstanceOf[Map[Any, Any]]
> target.as("A").merge(ccd.as("B"), "A.key=B.key").
>   whenMatched("B.deleted=false").
>   updateExpr(updateMap).
>   whenNotMatched("B.deleted=false").
>   insertExpr(insertMap).
>   whenMatched("B.deleted=true").
>   delete().execute()
>   
> {code}
> SQL Queries to run :
> {code} 
> sql("select count(*) from target").show()
> sql("select * from target order by key").show()
> {code}
> Results in spark-shell --master yarn
> {code}
> scala> sql("select count(*) from target").show()
> ++
> |count(1)|
> ++
> |   4|
> ++
> scala> sql("select * from target order by key").show()
> +---+-+
> |key|value|
> +---+-+
> |  a|0|
> |  b|1|
> |  c|2|
> |  d|3|
> +---+-+
> {code}
> Results in spark-shell --master local
> {code}
> scala> sql("select count(*) from target").show()
> ++
> |count(1)|
> ++
> |   3|
> ++
> scala> 

[jira] [Closed] (CARBONDATA-3851) Merge Update and Insert with Partition Table is giving different results in different spark deploy modes

2020-10-29 Thread Sachin Ramachandra Setty (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Ramachandra Setty closed CARBONDATA-3851.


This PR  resolved this issue.

https://github.com/apache/carbondata/pull/3835

> Merge Update and Insert with Partition Table is giving different results in 
> different spark deploy modes
> 
>
> Key: CARBONDATA-3851
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3851
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Sachin Ramachandra Setty
>Priority: Major
> Fix For: 2.1.0
>
>
> The result sets are different when run the queries in spark-shell --master 
> local and spark-shell --master yarn (Two Different Spark Deploy Modes)
> Steps to Reproduce Issue :
> {code}
> import scala.collection.JavaConverters._
> import java.sql.Date
> import org.apache.spark.sql._
> import org.apache.spark.sql.CarbonSession._
> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.execution.command.mutation.merge.
> {CarbonMergeDataSetCommand, DeleteAction, InsertAction, 
> InsertInHistoryTableAction, MergeDataSetMatches, MergeMatch, UpdateAction, 
> WhenMatched, WhenNotMatched, WhenNotMatchedAndExistsOnlyOnTarget}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.test.util.QueryTest
> import org.apache.spark.sql.types.
> {BooleanType, DateType, IntegerType, StringType, StructField, StructType}
> import spark.implicits._
> sql("drop table if exists order").show()
> sql("drop table if exists order_hist").show()
> sql("create table order_hist(id string, name string, quantity int, price int, 
> state int) PARTITIONED BY (c_name String) STORED AS carbondata").show()
> val initframe = sc.parallelize(1 to 10, 4).map
> { x => ("id"+x, s"order$x",s"customer$x", x*10, x*75, 1)}
> .toDF("id", "name", "c_name", "quantity", "price", "state")
> initframe.write
>  .format("carbondata")
>  .option("tableName", "order")
>  .option("partitionColumns", "c_name")
>  .mode(SaveMode.Overwrite)
>  .save()
> val dwframe = spark.read.format("carbondata").option("tableName", 
> "order").load()
> val dwSelframe = dwframe.as("A")
> val ds1 = sc.parallelize(3 to 10, 4)
>  .map {x =>
>  if (x <= 4)
> { ("id"+x, s"order$x",s"customer$x", x*10, x*75, 2) }
> else
> { ("id"+x, s"order$x",s"customer$x", x*10, x*75, 1) }
> }.toDF("id", "name", "c_name", "quantity", "price", "state")
> ds1.show()
>  val ds2 = sc.parallelize(1 to 2, 4)
>  .map
> {x => ("newid"+x, s"order$x",s"customer$x", x*10, x*75, 1) }
> .toDS().toDF()
>  ds2.show()
>  val ds3 = ds1.union(ds2)
>  ds3.show()
> val odsframe = ds3.as("B")
> var matches = Seq.empty[MergeMatch]
>  val updateMap = Map(col("id") -> col("A.id"),
>  col("price") -> expr("B.price + 1"),
>  col("state") -> col("B.state"))
> val insertMap = Map(col("id") -> col("B.id"),
>  col("name") -> col("B.name"),
>  col("c_name") -> col("B.c_name"),
>  col("quantity") -> col("B.quantity"),
>  col("price") -> expr("B.price * 100"),
>  col("state") -> col("B.state"))
> val insertMap_u = Map(col("id") -> col("id"),
>  col("name") -> col("name"),
>  col("c_name") -> lit("insert"),
>  col("quantity") -> col("quantity"),
>  col("price") -> expr("price"),
>  col("state") -> col("state"))
> val insertMap_d = Map(col("id") -> col("id"),
>  col("name") -> col("name"),
>  col("c_name") -> lit("delete"),
>  col("quantity") -> col("quantity"),
>  col("price") -> expr("price"),
>  col("state") -> col("state"))
> matches ++= Seq(WhenMatched(Some(col("A.state") =!= 
> col("B.state"))).addAction(UpdateAction(updateMap)).addAction(InsertInHistoryTableAction(insertMap_u,
>  TableIdentifier("order_hist"
>  matches ++= Seq(WhenNotMatched().addAction(InsertAction(insertMap)))
>  matches ++= 
> Seq(WhenNotMatchedAndExistsOnlyOnTarget().addAction(DeleteAction()).addAction(InsertInHistoryTableAction(insertMap_d,
>  TableIdentifier("order_hist"
> {code}
>  
> SQL Queries :
> {code}
> sql("select count(*) from order").show()
>  sql("select count(*) from order where state = 2").show()
>  sql("select price from order where id = 'newid1'").show()
>  sql("select count(*) from order_hist where c_name = 'delete'").show()
>  sql("select count(*) from order_hist where c_name = 'insert'").show()
> {code}
> Results in spark-shell --master yarn
> {code}
>  scala> sql("select count(*) from order").show()
>  ++
> |count(1)|
> ++
> |10|
> ++
> scala> sql("select count(*) from order where state = 2").show()
>  ++
> |count(1)|
> ++
> |0|
> ++
> scala> sql("select price from order where id = 'newid1'").show()
>  +-+
> |price|
> +-+
>  +-+
> scala> sql("select 

[jira] [Updated] (CARBONDATA-4034) Improve the time-consuming of Horizontal Compaction for update

2020-10-29 Thread Jiayu Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayu Shen updated CARBONDATA-4034:
---
Description: 
In the update flow, horizontal compaction will be significantly slower when 
updating with a lot of segments(or a lot of blocks). There is a case whose 
costing is as shown in the log.
{code:java}
2020-10-10 09:38:10,466 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Update Compaction operation started for 
[ods_oms.oms_wh_outbound_order] 
 2020-10-10 09:50:25,718 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Update Compaction operation completed for 
[ods_oms.oms_wh_outbound_order]. 
 2020-10-10 10:15:44,302 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Delete Compaction operation started for 
[ods_oms.oms_wh_outbound_order] 
 2020-10-10 10:15:54,874 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Delete Compaction operation completed for 
[ods_oms.oms_wh_outbound_order].{code}
In this PR, we optimize the process between second and third row of the log, by 
optimizing the method _performDeleteDeltaCompaction_ in horizontal compaction 
flow.

 

  was:
In the update flow, horizontal compaction will be significantly slower when 
updating with a lot of segments(or a lot of blocks). There is a case whose 
costing is as shown in the log.

2020-10-10 09:38:10,466 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Update Compaction operation started for 
[ods_oms.oms_wh_outbound_order] 
 2020-10-10 09:50:25,718 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Update Compaction operation completed for 
[ods_oms.oms_wh_outbound_order]. 
 2020-10-10 10:15:44,302 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Delete Compaction operation started for 
[ods_oms.oms_wh_outbound_order] 
 2020-10-10 10:15:54,874 | INFO | [OperationManager-Background-Pool-28] | 
Horizontal Delete Compaction operation completed for 
[ods_oms.oms_wh_outbound_order].

In this PR, we optimize the process between second and third row of the log, by 
optimizing the method _performDeleteDeltaCompaction_ in horizontal compaction 
flow.

 


> Improve the time-consuming of Horizontal Compaction for update
> --
>
> Key: CARBONDATA-4034
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4034
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Jiayu Shen
>Priority: Minor
>  Time Spent: 17h 10m
>  Remaining Estimate: 0h
>
> In the update flow, horizontal compaction will be significantly slower when 
> updating with a lot of segments(or a lot of blocks). There is a case whose 
> costing is as shown in the log.
> {code:java}
> 2020-10-10 09:38:10,466 | INFO | [OperationManager-Background-Pool-28] | 
> Horizontal Update Compaction operation started for 
> [ods_oms.oms_wh_outbound_order] 
>  2020-10-10 09:50:25,718 | INFO | [OperationManager-Background-Pool-28] | 
> Horizontal Update Compaction operation completed for 
> [ods_oms.oms_wh_outbound_order]. 
>  2020-10-10 10:15:44,302 | INFO | [OperationManager-Background-Pool-28] | 
> Horizontal Delete Compaction operation started for 
> [ods_oms.oms_wh_outbound_order] 
>  2020-10-10 10:15:54,874 | INFO | [OperationManager-Background-Pool-28] | 
> Horizontal Delete Compaction operation completed for 
> [ods_oms.oms_wh_outbound_order].{code}
> In this PR, we optimize the process between second and third row of the log, 
> by optimizing the method _performDeleteDeltaCompaction_ in horizontal 
> compaction flow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] asfgit closed pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-29 Thread GitBox


asfgit closed pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718730475


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2972/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718729637


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4731/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-29 Thread GitBox


QiangCai commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-718729381


   LGTM, greate job



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718701205


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2971/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718697037


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4730/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (CARBONDATA-4042) Insert into select and CTAS launches fewer tasks(task count limited to number of nodes in cluster) even when target table is of no_sort

2020-10-29 Thread Akash R Nilugal (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akash R Nilugal resolved CARBONDATA-4042.
-
Fix Version/s: 2.1.0
   Resolution: Fixed

> Insert into select and CTAS launches fewer tasks(task count limited to number 
> of nodes in cluster) even when target table is of no_sort
> ---
>
> Key: CARBONDATA-4042
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4042
> Project: CarbonData
>  Issue Type: Improvement
>  Components: data-load, spark-integration
>Reporter: Venugopal Reddy K
>Priority: Major
> Fix For: 2.1.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> *Issue:*
> At present, When we do insert into table select from or create table as 
> select from, we lauch one single task per node. Whereas when we do a simple 
> select * from table query, tasks launched are equal to number of carbondata 
> files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). 
> Thus, slows down the load performance of insert into select and ctas cases.
> Refer [Community discussion regd. task 
> lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html]
>  
> *Suggestion:*
> Launch the same number of tasks as in select query for insert into select and 
> ctas cases when the target table is of no-sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] asfgit closed pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort

2020-10-29 Thread GitBox


asfgit closed pull request #3972:
URL: https://github.com/apache/carbondata/pull/3972


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3972:
URL: https://github.com/apache/carbondata/pull/3972#issuecomment-718612012


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2970/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (CARBONDATA-3819) Fileformat column details is not present in the show segments DDL for heterogenous segments table.

2020-10-29 Thread Prasanna Ravichandran (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Ravichandran closed CARBONDATA-3819.
-

Fixed and verified.

> Fileformat column details is not present in the show segments DDL for 
> heterogenous segments table.
> --
>
> Key: CARBONDATA-3819
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3819
> Project: CarbonData
>  Issue Type: Bug
> Environment: Opensource ANT cluster
>Reporter: Prasanna Ravichandran
>Priority: Minor
> Attachments: fileformat_notworking_actualresult.PNG, 
> fileformat_working_expected.PNG
>
>
> Fileformat column details is not present in the show segments DDL for 
> heterogenous segments table.
> Test steps: 
>  # Create a heterogenous table with added parquet and carbon segments.
>  # DO show segments. 
> Expected results:
> It should show "FileFormat" column details in show segments DDL.
> Actual result: 
> It is not showing the File format column details in show segments DDL.
> See the attached screenshots for more details.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CARBONDATA-3819) Fileformat column details is not present in the show segments DDL for heterogenous segments table.

2020-10-29 Thread Prasanna Ravichandran (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanna Ravichandran resolved CARBONDATA-3819.
---
Resolution: Fixed

This issue is fixed in the latest Carbon jars - 2.0.0.

> Fileformat column details is not present in the show segments DDL for 
> heterogenous segments table.
> --
>
> Key: CARBONDATA-3819
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3819
> Project: CarbonData
>  Issue Type: Bug
> Environment: Opensource ANT cluster
>Reporter: Prasanna Ravichandran
>Priority: Minor
> Attachments: fileformat_notworking_actualresult.PNG, 
> fileformat_working_expected.PNG
>
>
> Fileformat column details is not present in the show segments DDL for 
> heterogenous segments table.
> Test steps: 
>  # Create a heterogenous table with added parquet and carbon segments.
>  # DO show segments. 
> Expected results:
> It should show "FileFormat" column details in show segments DDL.
> Actual result: 
> It is not showing the File format column details in show segments DDL.
> See the attached screenshots for more details.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3972:
URL: https://github.com/apache/carbondata/pull/3972#issuecomment-718593609


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4729/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718541543


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4728/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-718523188


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2969/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3995: [CARBONDATA-4043] Fix data load failure issue for columns added in legacy store

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3995:
URL: https://github.com/apache/carbondata/pull/3995#issuecomment-718505838


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2968/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3995: [CARBONDATA-4043] Fix data load failure issue for columns added in legacy store

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3995:
URL: https://github.com/apache/carbondata/pull/3995#issuecomment-718503918


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4727/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718493643


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4725/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718491747


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2966/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort

2020-10-29 Thread GitBox


QiangCai commented on pull request #3972:
URL: https://github.com/apache/carbondata/pull/3972#issuecomment-718478244







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


QiangCai commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r514046333



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -143,10 +145,16 @@ object DataLoadProcessBuilderOnSpark {
 
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions == 0) {
+  numPartitions = Math.ceil(model.getTotalSize.toDouble / 
defaultMaxSplitBytes).toInt
+}
+
+// after calculation based on size if still zero then take the partition 
number
 if (numPartitions <= 0) {
   numPartitions = convertRDD.partitions.length

Review comment:
   numPartitions = Math.min(convertRDD.partitions.length, dynamic partition 
number)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


QiangCai commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r514043536



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -93,6 +94,7 @@ object DataLoadProcessBuilderOnSpark {
 val convertStepRowCounter = sc.longAccumulator("Convert Processor 
Accumulator")
 val sortStepRowCounter = sc.longAccumulator("Sort Processor Accumulator")
 val writeStepRowCounter = sc.longAccumulator("Write Processor Accumulator")
+val defaultMaxSplitBytes = 
sessionState(sparkSession).conf.filesMaxPartitionBytes

Review comment:
   move to line 156

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -227,9 +236,17 @@ object DataLoadProcessBuilderOnSpark {
 // 2. sort
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions <= 0) {
+  numPartitions = Math.ceil(SizeEstimator.estimate(originRDD) / 
defaultMaxSplitBytes).toInt

Review comment:
   SizeEstimator.estimate(originRDD).toDouble
   
   and move to line 247

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -143,10 +145,16 @@ object DataLoadProcessBuilderOnSpark {
 
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions == 0) {
+  numPartitions = Math.ceil(model.getTotalSize.toDouble / 
defaultMaxSplitBytes).toInt
+}
+
+// after calculation based on size if still zero then take the partition 
number
 if (numPartitions <= 0) {
   numPartitions = convertRDD.partitions.length

Review comment:
   Math.min(convertRDD.partitions.length, dynamic partition number)

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -227,9 +236,17 @@ object DataLoadProcessBuilderOnSpark {
 // 2. sort
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions <= 0) {
+  numPartitions = Math.ceil(SizeEstimator.estimate(originRDD) / 
defaultMaxSplitBytes).toInt
+}
+
+// after calculation based on size if still zero then take the partition 
number
 if (numPartitions <= 0) {
   numPartitions = originRDD.partitions.length

Review comment:
   numPartitions = Math.min(originRDD.partitions.length, dynamic partition 
number)

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -202,6 +210,7 @@ object DataLoadProcessBuilderOnSpark {
 val partialSuccessAccum = sc.longAccumulator("Partial Success Accumulator")
 val sortStepRowCounter = sc.longAccumulator("Sort Processor Accumulator")
 val writeStepRowCounter = sc.longAccumulator("Write Processor Accumulator")
+val defaultMaxSplitBytes = 
sessionState(sparkSession).conf.filesMaxPartitionBytes

Review comment:
   move to line 247

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -143,10 +145,16 @@ object DataLoadProcessBuilderOnSpark {
 
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions == 0) {
+  numPartitions = Math.ceil(model.getTotalSize.toDouble / 
defaultMaxSplitBytes).toInt

Review comment:
   move to 156





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3995: [CARBONDATA-4043] Fix data load failure issue for columns added in legacy store

2020-10-29 Thread GitBox


Indhumathi27 commented on a change in pull request #3995:
URL: https://github.com/apache/carbondata/pull/3995#discussion_r514042919



##
File path: 
processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java
##
@@ -424,6 +440,38 @@ public static boolean isHeaderValid(String tableName, 
String[] csvHeader,
 return noDicSortColMapping;
   }
 
+  /**
+   * Get the sort/no_sort column map based on schema order.
+   * This will be used in the final sort step to find the index of sort 
column, to compare the
+   * intermediate row data based on schema.
+   */
+  public static Map> 
getSortColSchemaOrderMapping(CarbonTable carbonTable) {

Review comment:
   added





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718399151


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4721/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-29 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-718398674


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2963/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org