[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [HOTFIX] Fix Random CI Failures of HiveCarbonTest

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734699847


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3190/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] nihal0107 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


nihal0107 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531434575



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,

Review comment:
   ok, handle the scenario when no cg or fg index exists then set 
`indexExists` property as false. Earlier this case was not handled when we drop 
all indexes then we didn't set the `indexExists` property as false.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] shenjiayu17 edited a comment on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


shenjiayu17 edited a comment on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734690792


   > @shenjiayu17 : please update `/docs/spatial-index-guide.md` about what new 
UDF is supported for query and what functionality changed
   
   spatial-index-guide.md has been updated
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] shenjiayu17 commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


shenjiayu17 commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734690792


   > @shenjiayu17 : please update `/docs/spatial-index-guide.md` about what new 
UDF is supported for query and what functionality changed
   spatial-index-guide.md has been updated
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


VenuReddy2103 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531416935



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,
+  // then set indexExists as false to return empty index list for next 
query.
+  val hasCgFgIndexes = indexMetadata.getIndexesMap.size() != 0 &&

Review comment:
   I understand that `indexMetadata` will have SI indexes as well. But, 
What i meant was indexMetadata.getIndexesMap.size() != 0 always evaluates to 
true at line 189. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#issuecomment-734683112


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3189/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#issuecomment-734682627


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4944/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


VenuReddy2103 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531416935



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,
+  // then set indexExists as false to return empty index list for next 
query.
+  val hasCgFgIndexes = indexMetadata.getIndexesMap.size() != 0 &&

Review comment:
   Got it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#issuecomment-734677407


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3188/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#issuecomment-734675548


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4943/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


Indhumathi27 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531409958



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,
+  // then set indexExists as false to return empty index list for next 
query.
+  val hasCgFgIndexes = indexMetadata.getIndexesMap.size() != 0 &&

Review comment:
   indexmetadata will also have SI Indexes. indexExists property is only 
for CG or FG indexes





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


VenuReddy2103 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531409321



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,

Review comment:
   For example, create 2 bloom indexes, Drop both of them. In last index 
drop, `indexMetadata` will be null at this point(line 186). We do not seem to 
set `indexExists` property to `false` in that case.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


akashrn5 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r531365242



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.IOException;
+import java.util.*;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.metadata.SegmentFileStore;
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable;
+import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
+import org.apache.carbondata.core.statusmanager.SegmentStatus;
+import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.Logger;
+
+/**
+ * Mantains the clean files command in carbondata. This class has methods for 
clean files
+ * operation.
+ */
+public class CleanFilesUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(CleanFilesUtil.class.getName());
+
+  /**
+   * This method will clean all the stale segments for a table, delete the 
source folder after
+   * copying the data to the trash and also remove the .segment files of the 
stale segments
+   */
+  public static void cleanStaleSegments(CarbonTable carbonTable)
+throws IOException {
+long timeStampForTrashFolder = System.currentTimeMillis();
+List staleSegments = getStaleSegments(carbonTable);
+if (staleSegments.size() > 0) {
+  for (String staleSegment : staleSegments) {
+String segmentNumber = 
staleSegment.split(CarbonCommonConstants.UNDERSCORE)[0];
+SegmentFileStore fileStore = new 
SegmentFileStore(carbonTable.getTablePath(),
+staleSegment);
+Map locationMap = 
fileStore.getSegmentFile()
+.getLocationMap();
+if (locationMap != null) {
+  CarbonFile segmentLocation = 
FileFactory.getCarbonFile(carbonTable.getTablePath() +
+  CarbonCommonConstants.FILE_SEPARATOR + 
fileStore.getSegmentFile().getLocationMap()
+  .entrySet().iterator().next().getKey());
+  // copy the complete segment to the trash folder
+  TrashUtil.copySegmentToTrash(segmentLocation, 
CarbonTablePath.getTrashFolderPath(
+  carbonTable.getTablePath()) + 
CarbonCommonConstants.FILE_SEPARATOR +
+  timeStampForTrashFolder + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath
+  .SEGMENT_PREFIX + segmentNumber);
+  // Deleting the stale Segment folders.
+  try {
+CarbonUtil.deleteFoldersAndFiles(segmentLocation);
+  } catch (IOException | InterruptedException e) {
+LOGGER.error("Unable to delete the segment: " + segmentNumber + " 
from after moving" +
+" it to the trash folder : " + e.getMessage(), e);
+  }
+  // delete the segment file as well
+  
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
+  staleSegment));
+}
+  }
+  staleSegments.clear();
+}
+  }
+
+  /**
+   * This method will clean all the stale segments for partition table, delete 
the source folders
+   * after copying the data to the trash and also remove the .segment files of 
the stale segments
+   */
+  public static void cleanStaleSegmentsForPartitionTable(CarbonTable 
carbonTable)
+throws IOException {
+long timeStampForTrashFolder = System.currentTimeMillis();
+List staleSegments = getStaleSegments(carbonTable);
+if (staleSegments.size() > 0) {
+  for (String staleSegment : staleSegments) {
+String segmentNumber = 
staleSegment.split(CarbonCommonConstants.UNDERSCORE)[0];
+// for each segment we get the indexfile first, then we get the 
carbondata file. Move both
+// of those 

[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


VenuReddy2103 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531399137



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,
+  // then set indexExists as false to return empty index list for next 
query.
+  val hasCgFgIndexes = indexMetadata.getIndexesMap.size() != 0 &&

Review comment:
   `indexMetadata.getIndexesMap.size() != 0` would always be true at this 
point. `indexMetadata` will be null if empty.  It is redundant check.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] VenuReddy2103 commented on a change in pull request #4000: [CARBONDATA-4020] Fixed drop index when multiple index exists

2020-11-26 Thread GitBox


VenuReddy2103 commented on a change in pull request #4000:
URL: https://github.com/apache/carbondata/pull/4000#discussion_r531399137



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/index/DropIndexCommand.scala
##
@@ -184,10 +184,12 @@ private[sql] case class DropIndexCommand(
 parentCarbonTable = getRefreshedParentTable(sparkSession, dbName)
 val indexMetadata = parentCarbonTable.getIndexMetadata
 if (null != indexMetadata && null != indexMetadata.getIndexesMap) {
-  val hasCgFgIndexes =
-!(indexMetadata.getIndexesMap.size() == 1 &&
-  
indexMetadata.getIndexesMap.containsKey(IndexType.SI.getIndexProviderName))
-  if (hasCgFgIndexes) {
+  // check if any CG or FG index exists. If not exists,
+  // then set indexExists as false to return empty index list for next 
query.
+  val hasCgFgIndexes = indexMetadata.getIndexesMap.size() != 0 &&

Review comment:
   `indexMetadata.getIndexesMap.size() != 0` would always be true at this 
point. `indexMetadata` will be null if empty. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Karan980 commented on pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


Karan980 commented on pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#issuecomment-734650646


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734650115


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3186/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734647802


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4941/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4029: refact carbon util

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4029:
URL: https://github.com/apache/carbondata/pull/4029#issuecomment-734647130


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3185/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#issuecomment-734646663


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4940/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#issuecomment-734646180


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3184/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4029: refact carbon util

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4029:
URL: https://github.com/apache/carbondata/pull/4029#issuecomment-734645757


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4939/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


akashrn5 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r531089568



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##
@@ -2086,6 +2087,34 @@ public int getMaxSIRepairLimit(String dbName, String 
tableName) {
 return Math.abs(Integer.parseInt(thresholdValue));
   }
 
+  /**
+   * The below method returns the time(in milliseconds) for which timestamp 
folder retention in
+   * trash folder will take place.
+   */
+  public long getTrashFolderRetentionTime() {
+String propertyValue = 
getProperty(CarbonCommonConstants.CARBON_TRASH_RETENTION_DAYS);

Review comment:
   instead of this, just call `getProperty` with default value also, then 
all these null checks are not needed

##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.IOException;
+import java.util.*;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.metadata.SegmentFileStore;
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable;
+import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
+import org.apache.carbondata.core.statusmanager.SegmentStatus;
+import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.Logger;
+
+/**
+ * Mantains the clean files command in carbondata. This class has methods for 
clean files
+ * operation.
+ */
+public class CleanFilesUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(CleanFilesUtil.class.getName());
+
+  /**
+   * This method will clean all the stale segments for a table, delete the 
source folder after
+   * copying the data to the trash and also remove the .segment files of the 
stale segments
+   */
+  public static void cleanStaleSegments(CarbonTable carbonTable)
+throws IOException {
+long timeStampForTrashFolder = System.currentTimeMillis();
+List staleSegments = getStaleSegments(carbonTable);
+if (staleSegments.size() > 0) {
+  for (String staleSegment : staleSegments) {
+String segmentNumber = 
staleSegment.split(CarbonCommonConstants.UNDERSCORE)[0];
+SegmentFileStore fileStore = new 
SegmentFileStore(carbonTable.getTablePath(),
+staleSegment);
+Map locationMap = 
fileStore.getSegmentFile()
+.getLocationMap();
+if (locationMap != null) {
+  CarbonFile segmentLocation = 
FileFactory.getCarbonFile(carbonTable.getTablePath() +
+  CarbonCommonConstants.FILE_SEPARATOR + 
fileStore.getSegmentFile().getLocationMap()
+  .entrySet().iterator().next().getKey());
+  // copy the complete segment to the trash folder
+  TrashUtil.copySegmentToTrash(segmentLocation, 
CarbonTablePath.getTrashFolderPath(
+  carbonTable.getTablePath()) + 
CarbonCommonConstants.FILE_SEPARATOR +
+  timeStampForTrashFolder + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath
+  .SEGMENT_PREFIX + segmentNumber);
+  // Deleting the stale Segment folders.
+  try {
+CarbonUtil.deleteFoldersAndFiles(segmentLocation);
+  } catch (IOException | InterruptedException e) {
+LOGGER.error("Unable to delete the segment: " + segmentNumber + " 
from after moving" +
+" it to the trash folder : " + e.getMessage(), e);
+  }
+  // delete the segment file as well
+  
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
+  staleSegment));
+}
+  }
+  staleSegments.clear();
+}
+  }
+
+  /**
+   * This method will clean all the stale segments for 

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531372539



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = 
"carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)
+  public static final String CARBON_MINOR_COMPACTION_SIZE = 
"carbon.minor.compaction.size";
+
   /**
* By default size of major compaction in MBs.
*/
   public static final String DEFAULT_CARBON_MAJOR_COMPACTION_SIZE = "1024";
 
+  /**
+   * By default size of minor compaction in MBs.
+   */
+  public static final String DEFAULT_CARBON_MINOR_COMPACTION_SIZE = "1048576";

Review comment:
   @ajantha-bhat Yes, 1TB is not proper here. Will remove it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531372454



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = 
"carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)

Review comment:
   @ajantha-bhat some users need system value to control all tables if use 
same value, like major compaction system value. I think by default we dont use 
it, and the user can specify for each table, if he want to set for all tables, 
he can use the system level parameter.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest 
with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 10): DataFrame = {
+import sqlContext.implicits._
+sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+  .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+"serialname" + x, x + 1)
+  }.toDF("country", "ID", "date", "name", "phonetype", "serialname", 
"salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor 
compaction") {

Review comment:
   @Zhangshunyu : Agree , I got confused for testcase, we need more than 1 
MB data for testing it. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest 
with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 10): DataFrame = {
+import sqlContext.implicits._
+sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+  .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+"serialname" + x, x + 1)
+  }.toDF("country", "ID", "date", "name", "phonetype", "serialname", 
"salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor 
compaction") {

Review comment:
   @Zhangshunyu : Agree , I got confused. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531370782



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest 
with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 10): DataFrame = {
+import sqlContext.implicits._
+sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+  .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+"serialname" + x, x + 1)
+  }.toDF("country", "ID", "date", "name", "phonetype", "serialname", 
"salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor 
compaction") {

Review comment:
   @ajantha-bhat set to 1MB means the segment size > 1MB, it should be 
ignore in compaction flow. how can 4 lines data reach to 1MB? if use 4 lines 
data, both will compact.
   here we use huge data to exceed 1MB to test the hug segment is ignored





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734632195


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4942/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734632043


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3187/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734630191


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3183/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734629906


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4938/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367494



##
File path: 
processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##
@@ -311,10 +311,8 @@ public static String getLoadNumberFromLoadName(String 
loadName) {
   listOfSegmentsToBeMerged = 
identifySegmentsToBeMergedBasedOnSize(compactionSize,
   listOfSegmentsLoadedInSameDateInterval, carbonLoadModel);
 } else {
-
-  listOfSegmentsToBeMerged =
-  
identifySegmentsToBeMergedBasedOnSegCount(listOfSegmentsLoadedInSameDateInterval,
-  tableLevelProperties);
+  listOfSegmentsToBeMerged = 
identifySegmentsToBeMergedBasedOnSegCount(compactionSize,

Review comment:
   may be you can calculate compaction size, inside this method as table 
property is already there. No need to modify this method signature. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367024



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest 
with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 10): DataFrame = {
+import sqlContext.implicits._
+sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+  .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+"serialname" + x, x + 1)
+  }.toDF("country", "ID", "date", "name", "phonetype", "serialname", 
"salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor 
compaction") {

Review comment:
   Also test for both partition and non partition flow





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531366602



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest 
with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 10): DataFrame = {
+import sqlContext.implicits._
+sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+  .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+"serialname" + x, x + 1)
+  }.toDF("country", "ID", "date", "name", "phonetype", "serialname", 
"salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor 
compaction") {

Review comment:
   can simplify the test case (no need of huge rows)
   Just insert 1 row 4 times and do minor compaction, it should work. and set 
table property to 1 MB and insert 1 row 4 times and do minor compaction, it 
shouldn't compact.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365650



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = 
"carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)
+  public static final String CARBON_MINOR_COMPACTION_SIZE = 
"carbon.minor.compaction.size";
+
   /**
* By default size of major compaction in MBs.
*/
   public static final String DEFAULT_CARBON_MAJOR_COMPACTION_SIZE = "1024";
 
+  /**
+   * By default size of minor compaction in MBs.
+   */
+  public static final String DEFAULT_CARBON_MINOR_COMPACTION_SIZE = "1048576";

Review comment:
   Don't keep default value, I have seen many user's segment more than 1TB. 
so for them auto compaction will not work by default. so, I suggest if table 
property is configured, then consider size for minor compaction. else the base 
behavior like consider all segments based on numbers. Also can support alter 
table property 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


marchpure commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734627533


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365161



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = 
"carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)

Review comment:
   Just the table property is enough, why again a carbon property is needed 
?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


ajantha-bhat commented on a change in pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#discussion_r531364008



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/secondaryindex/util/SecondaryIndexUtil.scala
##
@@ -617,4 +637,157 @@ object SecondaryIndexUtil {
 identifiedSegments
   }
 
+  /**
+   * This method deletes the old carbondata files.
+   */
+  private def deleteOldCarbonDataFiles(factTimeStamp: Long,
+  validSegments: util.List[Segment],
+  indexCarbonTable: CarbonTable): Unit = {
+validSegments.asScala.foreach { segment =>
+  val segmentPath = 
CarbonTablePath.getSegmentPath(indexCarbonTable.getTablePath,
+segment.getSegmentNo)
+  val dataFiles = FileFactory.getCarbonFile(segmentPath).listFiles(new 
CarbonFileFilter {
+override def accept(file: CarbonFile): Boolean = {
+  file.getName.endsWith(CarbonTablePath.CARBON_DATA_EXT)
+}})
+  dataFiles.foreach(dataFile =>
+  if 
(DataFileUtil.getTimeStampFromFileName(dataFile.getAbsolutePath).toLong < 
factTimeStamp) {
+dataFile.delete()
+  })
+}
+  }
+
+  def mergeSISegmentDataFiles(sparkSession: SparkSession,
+  carbonLoadModel: CarbonLoadModel,
+  carbonMergerMapping: CarbonMergerMapping): Array[((String, Boolean), 
String)] = {
+val validSegments = carbonMergerMapping.validSegments.toList
+val indexCarbonTable = 
carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable
+val absoluteTableIdentifier = indexCarbonTable.getAbsoluteTableIdentifier
+val jobConf: JobConf = new JobConf(FileFactory.getConfiguration)
+SparkHadoopUtil.get.addCredentials(jobConf)
+val job: Job = new Job(jobConf)
+val format = 
CarbonInputFormatUtil.createCarbonInputFormat(absoluteTableIdentifier, job)
+CarbonInputFormat.setTableInfo(job.getConfiguration, 
indexCarbonTable.getTableInfo)
+val proj = indexCarbonTable.getCreateOrderColumn
+  .asScala
+  .map(_.getColName)
+  
.filterNot(_.equalsIgnoreCase(CarbonCommonConstants.POSITION_REFERENCE)).toSet
+var mergeStatus = ArrayBuffer[((String, Boolean), String)]()
+val mergeSize = getTableBlockSizeInMb(indexCarbonTable)(sparkSession) * 
1024 * 1024
+val header = 
indexCarbonTable.getCreateOrderColumn.asScala.map(_.getColName).toArray
+val outputModel = getLoadModelForGlobalSort(sparkSession, indexCarbonTable)
+CarbonIndexUtil.initializeSILoadModel(outputModel, header)
+outputModel.setFactTimeStamp(carbonLoadModel.getFactTimeStamp)
+val segmentMetaDataAccumulator = sparkSession.sqlContext
+  .sparkContext
+  .collectionAccumulator[Map[String, SegmentMetaDataInfo]]
+validSegments.foreach { segment =>

Review comment:
   This can be a spark job, for multiple segments. Handling sequentially is 
bad





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Zhangshunyu opened a new pull request #4029: refact carbon util

2020-11-26 Thread GitBox


Zhangshunyu opened a new pull request #4029:
URL: https://github.com/apache/carbondata/pull/4029


### Why is this PR needed?
Currentlly, we have some Carbon{$FUNCTION_NAME}Util and some 
CarbonUtil/CarbonUtils, it has some mixed functions in  CarbonUtil , we should 
clean code.

### What changes were proposed in this PR?
   Refact the code to clean it
   
### Does this PR introduce any user interface change?
- No
   
### Is any new testcase added?
- No
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #4022: [CARBONDATA-4056] Added global sort for data files merge operation in SI segments.

2020-11-26 Thread GitBox


ajantha-bhat commented on pull request #4022:
URL: https://github.com/apache/carbondata/pull/4022#issuecomment-734625127


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734619484


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3182/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734612095


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4937/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734528399


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3181/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734526668


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4935/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


marchpure commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734525504


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734523383


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3180/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734523043


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4934/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734522833


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3179/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734521243


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4933/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


marchpure commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734504948


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734446125


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3178/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-73453


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4932/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734440008


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3177/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734439716


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4931/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734431498







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


marchpure commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734412704


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734410043


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3174/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734409813


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4928/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


marchpure commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734383891


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


marchpure commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734359374


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4027: [WIP]added compression and range column based FT for SI

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4027:
URL: https://github.com/apache/carbondata/pull/4027#issuecomment-734348605


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4926/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4027: [WIP]added compression and range column based FT for SI

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4027:
URL: https://github.com/apache/carbondata/pull/4027#issuecomment-734345215


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3172/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734341612


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3171/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734339926


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4927/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734339351


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3173/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734337379


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4925/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734328797


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3170/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028#issuecomment-734323366


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4924/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#issuecomment-734312587


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3168/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#issuecomment-734308995


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4923/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734304033


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3165/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026#issuecomment-734301494


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4920/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


marchpure commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734285029


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure opened a new pull request #4028: [WIP] Fix hivetest random failure

2020-11-26 Thread GitBox


marchpure opened a new pull request #4028:
URL: https://github.com/apache/carbondata/pull/4028


### Why is this PR needed?


### What changes were proposed in this PR?
   
   
### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)
   
### Is any new testcase added?
- No
- Yes
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] nihal0107 opened a new pull request #4027: [WIP]added compression testcase for SI

2020-11-26 Thread GitBox


nihal0107 opened a new pull request #4027:
URL: https://github.com/apache/carbondata/pull/4027


### Why is this PR needed?


### What changes were proposed in this PR?
   
   
### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)
   
### Is any new testcase added?
- No
- Yes
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure opened a new pull request #4026: [WIP] blockid code clean

2020-11-26 Thread GitBox


marchpure opened a new pull request #4026:
URL: https://github.com/apache/carbondata/pull/4026


### Why is this PR needed?


### What changes were proposed in this PR?
   
   
### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)
   
### Is any new testcase added?
- No
- Yes
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530973967



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFileCommand.scala
##
@@ -0,0 +1,372 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.spark.testsuite.cleanfiles
+
+import java.io.{File, PrintWriter}
+
+import scala.io.Source
+
+import org.apache.spark.sql.{CarbonEnv, Row}
+import org.apache.spark.sql.test.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.carbondata.core.util.path.CarbonTablePath
+
+class TestCleanFileCommand extends QueryTest with BeforeAndAfterAll {
+
+  var count = 0
+
+  test("clean up table and test trash folder with IN PROGRESS segments") {
+// do not send the segment folders to trash
+createTable()
+loadData()
+val path = CarbonEnv.getCarbonTable(Some("default"), 
"cleantest")(sqlContext.sparkSession)
+  .getTablePath
+val trashFolderPath = path + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath.TRASH_DIR
+editTableStatusFile(path)
+assert(!FileFactory.isFileExist(trashFolderPath))
+
+val segmentNumber1 = sql(s"""show segments for table cleantest""").count()
+assert(segmentNumber1 == 4)
+sql(s"CLEAN FILES FOR TABLE cleantest").show
+val segmentNumber2 = sql(s"""show segments for table cleantest""").count()
+assert(0 == segmentNumber2)
+assert(!FileFactory.isFileExist(trashFolderPath))
+count = 0

Review comment:
   yes, removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530973423



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFilesCommandPartitionTable.scala
##
@@ -0,0 +1,412 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.spark.testsuite.cleanfiles
+
+import java.io.{File, PrintWriter}
+
+import scala.io.Source
+
+import org.apache.spark.sql.{CarbonEnv, Row}
+import org.apache.spark.sql.test.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.carbondata.core.util.path.CarbonTablePath
+
+class TestCleanFilesCommandPartitionTable extends QueryTest with 
BeforeAndAfterAll {
+
+  var count = 0
+
+  test("clean up table and test trash folder with IN PROGRESS segments") {
+// do not send the segment folders to trash
+createParitionTable()
+loadData()
+val path = CarbonEnv.getCarbonTable(Some("default"), 
"cleantest")(sqlContext.sparkSession)
+  .getTablePath
+val trashFolderPath = path + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath.TRASH_DIR
+editTableStatusFile(path)
+assert(!FileFactory.isFileExist(trashFolderPath))
+val segmentNumber1 = sql(s"""show segments for table cleantest""").count()
+assert(segmentNumber1 == 4)
+sql(s"CLEAN FILES FOR TABLE cleantest").show
+val segmentNumber2 = sql(s"""show segments for table cleantest""").count()
+assert(0 == segmentNumber2)
+assert(!FileFactory.isFileExist(trashFolderPath))
+count = 0
+var list = getFileCountInTrashFolder(trashFolderPath)
+// no carbondata file is added to the trash
+assert(list == 0)
+sql("""DROP TABLE IF EXISTS CLEANTEST""")
+  }
+
+  test("clean up table and test trash folder with Marked For Delete segments") 
{
+// do not send MFD folders to trash
+createParitionTable()
+loadData()
+val path = CarbonEnv.getCarbonTable(Some("default"), 
"cleantest")(sqlContext.sparkSession)
+  .getTablePath
+val trashFolderPath = path + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath.TRASH_DIR
+assert(!FileFactory.isFileExist(trashFolderPath))
+sql(s"""Delete from table cleantest where segment.id in(1)""")
+val segmentNumber1 = sql(s"""show segments for table cleantest""").count()
+sql(s"CLEAN FILES FOR TABLE cleantest").show
+val segmentNumber2 = sql(s"""show segments for table cleantest""").count()
+assert(segmentNumber1 == segmentNumber2 + 1)
+assert(!FileFactory.isFileExist(trashFolderPath))
+count = 0
+var list = getFileCountInTrashFolder(trashFolderPath)
+// no carbondata file is added to the trash
+assert(list == 0)
+sql("""DROP TABLE IF EXISTS CLEANTEST""")
+  }
+
+  test("clean up table and test trash folder with compaction") {
+// do not send compacted folders to trash
+createParitionTable()
+loadData()
+sql(s"""ALTER TABLE CLEANTEST COMPACT "MINOR" """)
+
+val path = CarbonEnv.getCarbonTable(Some("default"), 
"cleantest")(sqlContext.sparkSession)
+  .getTablePath
+val trashFolderPath = path + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath.TRASH_DIR
+assert(!FileFactory.isFileExist(trashFolderPath))
+
+val segmentNumber1 = sql(s"""show segments for table cleantest""").count()
+sql(s"CLEAN FILES FOR TABLE cleantest").show
+val segmentNumber2 = sql(s"""show segments for table cleantest""").count()
+assert(segmentNumber1 == segmentNumber2 + 4)
+assert(!FileFactory.isFileExist(trashFolderPath))
+count = 0
+val list = getFileCountInTrashFolder(trashFolderPath)
+// no carbondata file is added to the trash
+assert(list == 0)
+sql("""DROP TABLE IF EXISTS CLEANTEST""")
+  }
+
+
+
+  test("test trash folder with 2 segments with same segment number") {
+createParitionTable()
+sql(s"""INSERT INTO CLEANTEST SELECT 1, 2,"hello","abc)
+
+val path = 

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734250541


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3164/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530970382



##
File path: docs/clean-files.md
##
@@ -0,0 +1,56 @@
+
+
+
+## CLEAN FILES
+
+Clean files command is used to remove the Compacted, Marked For Delete ,In 
Progress which are stale and partial(Segments which are missing from the table 
status file but their data is present)
+ segments from the store.
+
+ Clean Files Command
+   ```
+   CLEAN FILES FOR TABLE TABLE_NAME
+   ```
+
+
+### TRASH FOLDER
+
+  Carbondata supports a Trash Folder which is used as a redundant folder where 
all stale(segments whose entry is not in tablestatus file) carbondata segments 
are moved to during clean files operation.
+  This trash folder is mantained inside the table path and is a hidden 
folder(.Trash). The segments that are moved to the trash folder are mantained 
under a timestamp 
+  subfolder(each clean files operation is represented by a timestamp). This 
helps the user to list down segments in the trash folder by timestamp.  By 
default all the timestamp sub-directory have an expiration
+  time of 7 days(since the timestamp it was created) and it can be configured 
by the user using the following carbon property. The supported values are 
between 0 and 365(both included.)
+   ```
+   carbon.trash.retention.days = "Number of days"
+   ``` 
+  Once the timestamp subdirectory is expired as per the configured expiration 
day value, that subdirectory is deleted from the trash folder in the subsequent 
clean files command.
+
+### FORCE DELETE TRASH
+The force option with clean files command deletes all the files and folders 
from the trash folder.
+
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('force'='true')
+  ```
+
+### DATA RECOVERY FROM THE TRASH FOLDER
+
+The segments can be recovered from the trash folder by creating table from the 
desired segment location

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530969904



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.IOException;
+import java.util.*;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.metadata.SegmentFileStore;
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable;
+import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
+import org.apache.carbondata.core.statusmanager.SegmentStatus;
+import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.Logger;
+
+/**
+ * Mantains the clean files command in carbondata. This class has methods for 
clean files
+ * operation.
+ */
+public class CleanFilesUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(CleanFilesUtil.class.getName());
+
+  /**
+   * This method will clean all the stale segments for table given table. In 
this method, we first
+   * get the stale segments(segments whose entry is not in the table status, 
but are present in
+   * the metadata folder) or in case when table status is deleted. To identify 
the stale segments
+   * we compare the segment files in the metadata folder with table status 
file, if it exists. The
+   * identified stale segments are then copied to the trash folder and then 
their .segment files
+   * are also deleted from the metadata folder. We only compare with 
tablestatus file here, not
+   * with tablestatus history file.
+   */
+  public static void cleanStaleSegments(CarbonTable carbonTable)
+throws IOException {
+String metaDataLocation = carbonTable.getMetadataPath();
+long timeStampForTrashFolder = System.currentTimeMillis();
+String segmentFilesLocation =
+CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath());
+CarbonFile[] segmentFilesList = 
FileFactory.getCarbonFile(segmentFilesLocation).listFiles();
+// there are no segments present in the Metadata folder. Can return here
+if (segmentFilesList.length == 0) {
+  return;
+}
+LoadMetadataDetails[] details = 
SegmentStatusManager.readLoadMetadata(metaDataLocation);
+List staleSegments = getStaleSegments(details, segmentFilesList);

Review comment:
   Changed. separated flow for normal table and partition table. In case of 
normal table, getting the segment path from the .segment file location map and 
moving complete segment. In case of partition table flow, moving it file by 
file.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530964261



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -1414,6 +1414,23 @@ private CarbonCommonConstants() {
 
   public static final String BITSET_PIPE_LINE_DEFAULT = "true";
 
+  /**
+   * this is the user defined time(in days), timestamp subfolders in trash 
directory will take
+   * this value as retention time. They are deleted after this time.
+   */
+  @CarbonProperty
+  public static final String CARBON_TRASH_RETENTION_DAYS = 
"carbon.trash.retention.days";
+
+  /**
+   * Default retention time of a subdirectory in trash folder is 7 days.
+   */
+  public static final String CARBON_TRASH_RETENTION_DAYS_DEFAULT = "7";

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734242575


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4919/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530957279



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/secondaryindex/events/CleanFilesPostEventListener.scala
##
@@ -54,6 +52,12 @@ class CleanFilesPostEventListener extends 
OperationEventListener with Logging {
 val indexTables = CarbonIndexUtil
   .getIndexCarbonTables(carbonTable, cleanFilesPostEvent.sparkSession)
 indexTables.foreach { indexTable =>
+  if (cleanFilesPostEvent.force) {
+TrashUtil.emptyTrash(indexTable.getTablePath)
+  } else {
+TrashUtil.deleteExpiredDataFromTrash(indexTable.getTablePath)
+  }
+  CleanFilesUtil.cleanStaleSegments(indexTable)

Review comment:
   removed trash code flow from cleanfilespostevenlistener

##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFileCommand.scala
##
@@ -0,0 +1,372 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.spark.testsuite.cleanfiles
+
+import java.io.{File, PrintWriter}
+
+import scala.io.Source
+
+import org.apache.spark.sql.{CarbonEnv, Row}
+import org.apache.spark.sql.test.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonProperties
+import org.apache.carbondata.core.util.path.CarbonTablePath
+
+class TestCleanFileCommand extends QueryTest with BeforeAndAfterAll {
+
+  var count = 0
+
+  test("clean up table and test trash folder with IN PROGRESS segments") {
+// do not send the segment folders to trash
+createTable()
+loadData()
+val path = CarbonEnv.getCarbonTable(Some("default"), 
"cleantest")(sqlContext.sparkSession)
+  .getTablePath
+val trashFolderPath = path + CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath.TRASH_DIR

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530956951



##
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+
+import org.apache.hadoop.io.IOUtils;
+import org.apache.log4j.Logger;
+
+/**
+ * Mantains the trash folder in carbondata. This class has methods to copy 
data to the trash and
+ * remove data from the trash.
+ */
+public final class TrashUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(TrashUtil.class.getName());
+
+  /**
+   * Base method to copy the data to the trash folder.
+   *
+   * @param fromPath the path from which to copy the file
+   * @param toPath  the path where the file will be copied
+   * @return
+   */
+  private static void copyToTrashFolder(String fromPath, String toPath) throws 
IOException {
+DataOutputStream dataOutputStream = null;
+DataInputStream dataInputStream = null;
+try {
+  dataOutputStream = FileFactory.getDataOutputStream(toPath);
+  dataInputStream = FileFactory.getDataInputStream(fromPath);
+  IOUtils.copyBytes(dataInputStream, dataOutputStream, 
CarbonCommonConstants.BYTEBUFFER_SIZE);
+} catch (IOException exception) {
+  LOGGER.error("Unable to copy " + fromPath + " to the trash folder", 
exception);
+  throw exception;
+} finally {
+  CarbonUtil.closeStreams(dataInputStream, dataOutputStream);
+}
+  }
+
+  /**
+   * The below method copies the complete a file to the trash folder.
+   *
+   * @param filePathToCopy the files which are to be moved to the trash folder
+   * @param trashFolderWithTimestamptimestamp, partition folder(if any) 
and segment number
+   * @return
+   */
+  public static void copyFileToTrashFolder(String filePathToCopy,
+  String trashFolderWithTimestamp) throws IOException {
+CarbonFile carbonFileToCopy = FileFactory.getCarbonFile(filePathToCopy);
+try {
+  if (carbonFileToCopy.exists()) {
+if (!FileFactory.isFileExist(trashFolderWithTimestamp)) {
+  FileFactory.mkdirs(trashFolderWithTimestamp);
+}
+if (!FileFactory.isFileExist(trashFolderWithTimestamp + 
CarbonCommonConstants
+.FILE_SEPARATOR + carbonFileToCopy.getName())) {
+  copyToTrashFolder(filePathToCopy, trashFolderWithTimestamp + 
CarbonCommonConstants
+  .FILE_SEPARATOR + carbonFileToCopy.getName());
+}
+  }
+} catch (IOException e) {
+  LOGGER.error("Error while creating trash folder or copying data to the 
trash folder", e);
+  throw e;
+}
+  }
+
+  /**
+   * The below method copies the complete segment folder to the trash folder. 
Here, the data files
+   * in segment are listed and copied one by one to the trash folder.
+   *
+   * @param segmentPath the folder which are to be moved to the trash folder
+   * @param trashFolderWithTimestamp trashfolderpath with complete timestamp 
and segment number
+   * @return
+   */
+  public static void copySegmentToTrash(CarbonFile segmentPath,
+  String trashFolderWithTimestamp) throws IOException {
+try {
+  List dataFiles = 
FileFactory.getFolderList(segmentPath.getAbsolutePath());
+  for (CarbonFile carbonFile : dataFiles) {
+copyFileToTrashFolder(carbonFile.getAbsolutePath(), 
trashFolderWithTimestamp);
+  }
+  LOGGER.info("Segment: " + segmentPath.getAbsolutePath() + " has been 
copied to" +
+  " the trash folder successfully");
+} catch (IOException e) {
+  LOGGER.error("Error while getting folder list for the segment", e);
+  throw e;
+}
+  }
+

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530955940



##
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+
+import org.apache.hadoop.io.IOUtils;
+import org.apache.log4j.Logger;
+
+/**
+ * Mantains the trash folder in carbondata. This class has methods to copy 
data to the trash and
+ * remove data from the trash.
+ */
+public final class TrashUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(TrashUtil.class.getName());
+
+  /**
+   * Base method to copy the data to the trash folder.
+   *
+   * @param fromPath the path from which to copy the file
+   * @param toPath  the path where the file will be copied
+   * @return
+   */
+  private static void copyToTrashFolder(String fromPath, String toPath) throws 
IOException {
+DataOutputStream dataOutputStream = null;
+DataInputStream dataInputStream = null;
+try {
+  dataOutputStream = FileFactory.getDataOutputStream(toPath);
+  dataInputStream = FileFactory.getDataInputStream(fromPath);
+  IOUtils.copyBytes(dataInputStream, dataOutputStream, 
CarbonCommonConstants.BYTEBUFFER_SIZE);
+} catch (IOException exception) {
+  LOGGER.error("Unable to copy " + fromPath + " to the trash folder", 
exception);
+  throw exception;
+} finally {
+  CarbonUtil.closeStreams(dataInputStream, dataOutputStream);
+}
+  }
+
+  /**
+   * The below method copies the complete a file to the trash folder.
+   *
+   * @param filePathToCopy the files which are to be moved to the trash folder
+   * @param trashFolderWithTimestamptimestamp, partition folder(if any) 
and segment number
+   * @return
+   */
+  public static void copyFileToTrashFolder(String filePathToCopy,
+  String trashFolderWithTimestamp) throws IOException {
+CarbonFile carbonFileToCopy = FileFactory.getCarbonFile(filePathToCopy);
+try {
+  if (carbonFileToCopy.exists()) {
+if (!FileFactory.isFileExist(trashFolderWithTimestamp)) {
+  FileFactory.mkdirs(trashFolderWithTimestamp);
+}
+if (!FileFactory.isFileExist(trashFolderWithTimestamp + 
CarbonCommonConstants
+.FILE_SEPARATOR + carbonFileToCopy.getName())) {
+  copyToTrashFolder(filePathToCopy, trashFolderWithTimestamp + 
CarbonCommonConstants
+  .FILE_SEPARATOR + carbonFileToCopy.getName());
+}
+  }
+} catch (IOException e) {
+  LOGGER.error("Error while creating trash folder or copying data to the 
trash folder", e);
+  throw e;
+}
+  }
+
+  /**
+   * The below method copies the complete segment folder to the trash folder. 
Here, the data files
+   * in segment are listed and copied one by one to the trash folder.
+   *
+   * @param segmentPath the folder which are to be moved to the trash folder
+   * @param trashFolderWithTimestamp trashfolderpath with complete timestamp 
and segment number
+   * @return
+   */
+  public static void copySegmentToTrash(CarbonFile segmentPath,

Review comment:
   Is now being used for normal table clean stale segments  flow





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4005: [CARBONDATA-3978] Trash Folder support in carbondata

2020-11-26 Thread GitBox


vikramahuja1001 commented on a change in pull request #4005:
URL: https://github.com/apache/carbondata/pull/4005#discussion_r530955721



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util;
+
+import java.io.IOException;
+import java.util.*;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.metadata.SegmentFileStore;
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable;
+import org.apache.carbondata.core.statusmanager.LoadMetadataDetails;
+import org.apache.carbondata.core.statusmanager.SegmentStatus;
+import org.apache.carbondata.core.statusmanager.SegmentStatusManager;
+import org.apache.carbondata.core.util.path.CarbonTablePath;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.Logger;
+
+/**
+ * Mantains the clean files command in carbondata. This class has methods for 
clean files
+ * operation.
+ */
+public class CleanFilesUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(CleanFilesUtil.class.getName());
+
+  /**
+   * This method will clean all the stale segments for table given table. In 
this method, we first
+   * get the stale segments(segments whose entry is not in the table status, 
but are present in
+   * the metadata folder) or in case when table status is deleted. To identify 
the stale segments
+   * we compare the segment files in the metadata folder with table status 
file, if it exists. The
+   * identified stale segments are then copied to the trash folder and then 
their .segment files
+   * are also deleted from the metadata folder. We only compare with 
tablestatus file here, not
+   * with tablestatus history file.
+   */
+  public static void cleanStaleSegments(CarbonTable carbonTable)
+throws IOException {
+String metaDataLocation = carbonTable.getMetadataPath();
+long timeStampForTrashFolder = System.currentTimeMillis();
+String segmentFilesLocation =
+CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath());
+CarbonFile[] segmentFilesList = 
FileFactory.getCarbonFile(segmentFilesLocation).listFiles();
+// there are no segments present in the Metadata folder. Can return here
+if (segmentFilesList.length == 0) {
+  return;
+}
+LoadMetadataDetails[] details = 
SegmentStatusManager.readLoadMetadata(metaDataLocation);
+List staleSegments = getStaleSegments(details, segmentFilesList);
+
+if (staleSegments.size() > 0) {
+  for (String staleSegment : staleSegments) {
+String segmentNumber = 
staleSegment.split(CarbonCommonConstants.UNDERSCORE)[0];
+// for each segment we get the indexfile first, then we get the 
carbondata file. Move both
+// of those to trash folder
+List filesToDelete = new ArrayList<>();
+SegmentFileStore fileStore = new 
SegmentFileStore(carbonTable.getTablePath(),
+staleSegment);
+List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
+FileFactory.getConfiguration());
+for (String file : indexOrMergeFiles) {
+  // copy the index or merge file to the trash folder
+  TrashUtil.copyFileToTrashFolder(file, 
CarbonTablePath.getTrashFolderPath(carbonTable
+  .getTablePath()) + CarbonCommonConstants.FILE_SEPARATOR + 
timeStampForTrashFolder +
+  CarbonCommonConstants.FILE_SEPARATOR + 
CarbonTablePath.SEGMENT_PREFIX +
+  segmentNumber);
+  filesToDelete.add(FileFactory.getCarbonFile(file));
+}
+// get carbondata files from here
+Map> indexFilesMap = fileStore.getIndexFilesMap();
+for (Map.Entry> entry : indexFilesMap.entrySet()) 
{
+  for (String file : entry.getValue()) {
+// copy the carbondata file to trash
+TrashUtil.copyFileToTrashFolder(file, 

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734227969


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4917/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734190805


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3161/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


marchpure commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734189785


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


ajantha-bhat commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734184258


   @shenjiayu17 : And how is the performance after changing the algorithm?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #4012: [CARBONDATA-4051] Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread GitBox


ajantha-bhat commented on pull request #4012:
URL: https://github.com/apache/carbondata/pull/4012#issuecomment-734182769


   @shenjiayu17 : please update `/docs/spatial-index-guide.md` about what new 
UDF is supported for query and what functionality changed



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (CARBONDATA-4051) Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread Jiayu Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayu Shen updated CARBONDATA-4051:
---
Description: 
The requirement is from SEQ,related algorithms are provided by Discovery Team.

1. Replace geohash encoded algorithm, and reduce required properties of CREATE 
TABLE. For example,
{code:java}
CREATE TABLE geoTable(
 timevalue BIGINT,
 longitude LONG,
 latitude LONG) COMMENT "This is a GeoTable"
 STORED AS carbondata
 TBLPROPERTIES ($customProperties 'SPATIAL_INDEX'='mygeohash',
 'SPATIAL_INDEX.mygeohash.type'='geohash',
 'SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude',
 'SPATIAL_INDEX.mygeohash.originLatitude'='39.832277',
 'SPATIAL_INDEX.mygeohash.gridSize'='50',
 'SPATIAL_INDEX.mygeohash.conversionRatio'='100'){code}
2. Add geo query UDFs

query filter UDFs :
 * _*InPolygonList (List polygonList, OperationType opType)*_
 * _*InPolylineList (List polylineList, Float bufferInMeter)*_
 * _*InPolygonRangeList (List RangeList, **OperationType opType**)*_

*operation only support :*
 * *"OR", means calculating union of two polygons*
 * *"AND", means calculating intersection of two polygons*

geo util UDFs :
 * _*GeoIdToGridXy(Long geoId) :* *Pair*_
 * _*LatLngToGeoId(**Long* *latitude, Long* *longitude) : Long*_
 * _*GeoIdToLatLng(Long geoId) : Pair*_
 * _*ToUpperLayerGeoId(Long geoId) : Long*_
 * _*ToRangeList (String polygon) : List*_

3. Currently GeoID is a column created internally for spatial tables, this PR 
will support GeoID column to be customized during LOAD/INSERT INTO. For 
example, 
{code:java}
INSERT INTO geoTable SELECT 0,157542840,116285807,40084087;

It uesed to be as below, '855280799612' is generated internally,
++-+-++
|mygeohash  |timevalue   |longitude|latitude|
++-+-++
|855280799612|157542840|116285807|40084087|
++-+-++
but now is
++-+-++
|mygeohash  |timevalue  |longitude|latitude|
++-+-++
|0   |157542840|116285807|40084087|
++-+-++{code}
 

  was:
The requirement is from SEQ,related algorithms are provided by group Discovery.

1. Replace geohash encoded algorithm, and reduce required properties of CREATE 
TABLE. For example,
{code:java}
CREATE TABLE geoTable(
 timevalue BIGINT,
 longitude LONG,
 latitude LONG) COMMENT "This is a GeoTable"
 STORED AS carbondata
 TBLPROPERTIES ($customProperties 'SPATIAL_INDEX'='mygeohash',
 'SPATIAL_INDEX.mygeohash.type'='geohash',
 'SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude',
 'SPATIAL_INDEX.mygeohash.originLatitude'='39.832277',
 'SPATIAL_INDEX.mygeohash.gridSize'='50',
 'SPATIAL_INDEX.mygeohash.conversionRatio'='100'){code}
2. Add geo query UDFs

query filter UDFs :
 * _*InPolygonList (List polygonList, OperationType opType)*_
 * _*InPolylineList (List polylineList, Float bufferInMeter)*_
 * _*InPolygonRangeList (List RangeList, **OperationType opType**)*_

*operation only support :*
 * *"OR", means calculating union of two polygons*
 * *"AND", means calculating intersection of two polygons*

geo util UDFs :
 * _*GeoIdToGridXy(Long geoId) :* *Pair*_
 * _*LatLngToGeoId(**Long* *latitude, Long* *longitude) : Long*_
 * _*GeoIdToLatLng(Long geoId) : Pair*_
 * _*ToUpperLayerGeoId(Long geoId) : Long*_
 * _*ToRangeList (String polygon) : List*_

3. Currently GeoID is a column created internally for spatial tables, this PR 
will support GeoID column to be customized during LOAD/INSERT INTO. For 
example, 
{code:java}
INSERT INTO geoTable SELECT 0,157542840,116285807,40084087;

It uesed to be as below, '855280799612' is generated internally,
++-+-++
|mygeohash  |timevalue   |longitude|latitude|
++-+-++
|855280799612|157542840|116285807|40084087|
++-+-++
but now is
++-+-++
|mygeohash  |timevalue  |longitude|latitude|
++-+-++
|0   |157542840|116285807|40084087|
++-+-++{code}
 


> Geo spatial index algorithm improvement and UDFs enhancement
> 
>
> Key: CARBONDATA-4051
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4051
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Jiayu Shen
>Priority: Minor
> Attachments: CarbonData Spatial Index Design Doc v2.docx
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The requirement is from SEQ,related algorithms are provided by Discovery Team.
> 1. Replace geohash encoded algorithm, 

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734177874


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4918/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


CarbonDataQA2 commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734177288


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3162/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (CARBONDATA-4051) Geo spatial index algorithm improvement and UDFs enhancement

2020-11-26 Thread Jiayu Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayu Shen updated CARBONDATA-4051:
---
Attachment: (was: Genex Cloud Carbon Spatial Index 
Specification.docx)

> Geo spatial index algorithm improvement and UDFs enhancement
> 
>
> Key: CARBONDATA-4051
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4051
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Jiayu Shen
>Priority: Minor
> Attachments: CarbonData Spatial Index Design Doc v2.docx
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The requirement is from SEQ,related algorithms are provided by group 
> Discovery.
> 1. Replace geohash encoded algorithm, and reduce required properties of 
> CREATE TABLE. For example,
> {code:java}
> CREATE TABLE geoTable(
>  timevalue BIGINT,
>  longitude LONG,
>  latitude LONG) COMMENT "This is a GeoTable"
>  STORED AS carbondata
>  TBLPROPERTIES ($customProperties 'SPATIAL_INDEX'='mygeohash',
>  'SPATIAL_INDEX.mygeohash.type'='geohash',
>  'SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude',
>  'SPATIAL_INDEX.mygeohash.originLatitude'='39.832277',
>  'SPATIAL_INDEX.mygeohash.gridSize'='50',
>  'SPATIAL_INDEX.mygeohash.conversionRatio'='100'){code}
> 2. Add geo query UDFs
> query filter UDFs :
>  * _*InPolygonList (List polygonList, OperationType opType)*_
>  * _*InPolylineList (List polylineList, Float bufferInMeter)*_
>  * _*InPolygonRangeList (List RangeList, **OperationType opType**)*_
> *operation only support :*
>  * *"OR", means calculating union of two polygons*
>  * *"AND", means calculating intersection of two polygons*
> geo util UDFs :
>  * _*GeoIdToGridXy(Long geoId) :* *Pair*_
>  * _*LatLngToGeoId(**Long* *latitude, Long* *longitude) : Long*_
>  * _*GeoIdToLatLng(Long geoId) : Pair*_
>  * _*ToUpperLayerGeoId(Long geoId) : Long*_
>  * _*ToRangeList (String polygon) : List*_
> 3. Currently GeoID is a column created internally for spatial tables, this PR 
> will support GeoID column to be customized during LOAD/INSERT INTO. For 
> example, 
> {code:java}
> INSERT INTO geoTable SELECT 0,157542840,116285807,40084087;
> It uesed to be as below, '855280799612' is generated internally,
> ++-+-++
> |mygeohash  |timevalue   |longitude|latitude|
> ++-+-++
> |855280799612|157542840|116285807|40084087|
> ++-+-++
> but now is
> ++-+-++
> |mygeohash  |timevalue  |longitude|latitude|
> ++-+-++
> |0   |157542840|116285807|40084087|
> ++-+-++{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] marchpure commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


marchpure commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734175206


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (CARBONDATA-4059) Block compaction on SI table.

2020-11-26 Thread Nihal kumar ojha (Jira)
Nihal kumar ojha created CARBONDATA-4059:


 Summary: Block compaction on SI table.
 Key: CARBONDATA-4059
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4059
 Project: CarbonData
  Issue Type: Bug
Reporter: Nihal kumar ojha


Currently compaction is allowed on SI table. Because of this if only SI table 
is compacted then running filter query query on main table is causing more data 
scan of SI table which will causing performance degradation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] marchpure commented on pull request #4025: [WIP] Make TableStatus/UpdateTableStatus/SegmentFile Smaller

2020-11-26 Thread GitBox


marchpure commented on pull request #4025:
URL: https://github.com/apache/carbondata/pull/4025#issuecomment-734161643


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >