[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-03-08 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590016320



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##
@@ -87,13 +106,53 @@ object DataTrashManager {
 }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is 
used before and after
+   * clean files operation to check how much space is actually freed, during 
the operation.
+   */
+  def getSizeSnapshot(carbonTable: CarbonTable): Long = {
+val metadataDetails = 
SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+var size: Long = 0
+val segmentFileLocation = 
CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+if (FileFactory.isFileExist(segmentFileLocation)) {
+  size += FileFactory.getDirectorySize(segmentFileLocation)
+}
+metadataDetails.foreach(oneLoad =>
+  if (oneLoad.getVisibility.toBoolean) {
+size += calculateSegmentSizeForOneLoad(carbonTable, oneLoad, 
metadataDetails)
+  }
+)
+size
+  }
+
+  /**
+   * Method to handle the Clean files dry run operation
+   */
+  def cleanFilesDryRunOperation (
+  carbonTable: CarbonTable,
+  isForceDelete: Boolean,
+  cleanStaleInProgress: Boolean,
+  showStats: Boolean): (Long, Long) = {
+// get size freed from the trash folder
+val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, 
isForceDelete,
+isDryRun = true, showStats)
+// get size that will be deleted (MFD, COmpacted, Inprogress segments)
+val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, 
isForceDelete,
+  cleanStaleInProgress)
+(trashFolderSizeStats._1 + expiredSegmentsSizeStats._1, 
trashFolderSizeStats._2 +
+expiredSegmentsSizeStats._2)
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean,
+  isDryRun: Boolean, showStats: Boolean): (Long, Long) = {
 if (isForceDelete) {
   // empty the trash folder
-  TrashUtil.emptyTrash(carbonTable.getTablePath)
+  val a = TrashUtil.emptyTrash(carbonTable.getTablePath, isDryRun, 
showStats)
+  (a.head, a(1))
 } else {
   // clear trash based on timestamp
-  TrashUtil.deleteExpiredDataFromTrash(carbonTable.getTablePath)
+  val a = TrashUtil.deleteExpiredDataFromTrash(carbonTable.getTablePath, 
isDryRun, showStats)

Review comment:
   same as above





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-03-08 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590016222



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##
@@ -87,13 +106,53 @@ object DataTrashManager {
 }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is 
used before and after
+   * clean files operation to check how much space is actually freed, during 
the operation.
+   */
+  def getSizeSnapshot(carbonTable: CarbonTable): Long = {
+val metadataDetails = 
SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+var size: Long = 0
+val segmentFileLocation = 
CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+if (FileFactory.isFileExist(segmentFileLocation)) {
+  size += FileFactory.getDirectorySize(segmentFileLocation)
+}
+metadataDetails.foreach(oneLoad =>
+  if (oneLoad.getVisibility.toBoolean) {
+size += calculateSegmentSizeForOneLoad(carbonTable, oneLoad, 
metadataDetails)
+  }
+)
+size
+  }
+
+  /**
+   * Method to handle the Clean files dry run operation
+   */
+  def cleanFilesDryRunOperation (
+  carbonTable: CarbonTable,
+  isForceDelete: Boolean,
+  cleanStaleInProgress: Boolean,
+  showStats: Boolean): (Long, Long) = {
+// get size freed from the trash folder
+val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, 
isForceDelete,
+isDryRun = true, showStats)
+// get size that will be deleted (MFD, COmpacted, Inprogress segments)
+val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, 
isForceDelete,
+  cleanStaleInProgress)
+(trashFolderSizeStats._1 + expiredSegmentsSizeStats._1, 
trashFolderSizeStats._2 +
+expiredSegmentsSizeStats._2)
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean,
+  isDryRun: Boolean, showStats: Boolean): (Long, Long) = {
 if (isForceDelete) {
   // empty the trash folder
-  TrashUtil.emptyTrash(carbonTable.getTablePath)
+  val a = TrashUtil.emptyTrash(carbonTable.getTablePath, isDryRun, 
showStats)

Review comment:
   give a proper variable name here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-03-08 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r589969086



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##
@@ -41,6 +43,26 @@ case class CarbonCleanFilesCommand(
   extends DataCommand {
 
   val LOGGER: Logger = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+  val isDryRun: Boolean = options.getOrElse("dryrun", "false").toBoolean
+  var showStats: Boolean = options.getOrElse("statistics", "true").toBoolean
+  if (isInternalCleanCall) {
+showStats = false
+  }
+
+  override def output: Seq[AttributeReference] = {
+if (isDryRun) {
+  // dry run operation
+  Seq(
+AttributeReference("Size Freed", StringType, nullable = false)(),

Review comment:
   agree with @ajantha-bhat , change the titles accordingly and 
@ajantha-bhat for clean files just total cleaned size is fine right? anyways we 
have separated size in dry run, what do you think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-19 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r579089632



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java
##
@@ -74,13 +74,19 @@ public static void cleanStaleSegments(CarbonTable 
carbonTable)
   // delete the segment file as well
   
FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(),
   staleSegmentFile));
+  StringBuilder deletedFiles = new StringBuilder();

Review comment:
   same as above

##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1133,17 +1133,23 @@ public static void deleteSegment(String tablePath, 
Segment segment,
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
 Map> indexFilesMap = fileStore.getIndexFilesMap();
+StringBuilder deletedFiles = new StringBuilder();

Review comment:
   do not use string builder, use java functional APIs like below
   
   `String firstThenLast2 = files
.stream()
.map(p -> file.getPath)
.collect(Collectors.toStringJoiner(","))
   .toString();`
   
   try this

##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -1895,6 +1895,11 @@ private CarbonCommonConstants() {
*/
   public static final String COMMA = ",";
 
+  /**
+   * SINGLE SPACE
+   */
+  public static final String SPACE = " ";

Review comment:
   revert this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-19 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r579051553



##
File path: docs/clean-files.md
##
@@ -64,4 +64,41 @@ The stale_inprogress option with force option will delete 
Marked for delete, Com
 
   ```
   CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true')
-  ```
\ No newline at end of file
+  ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how 
much space fill we freed 
+during the actual clean files operation. The dry run operation will not delete 
any data but will just give
+size based statistics on the data which will be cleaned in clean files. Dry 
run operation will return two columns where the first will 
+show how much space will be freed by that clean files operation and the second 
column will show the 
+remaining stale data(data which can be deleted but has not yet expired as per 
the ```max.query.execution.time``` and ``` carbon.trash.retention.days``` values
+).  By default the value of ```dryrun``` option is ```false```.
+
+Dry Run Operation is supported with four types of commands:
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('dryrun'='true')
+  ```
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('force'='true', 'dryrun'='true')
+  ```
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME 
options('stale_inprogress'='true','dryrun'='true')
+  ```
+
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true','dryrun'='true')
+  ```
+
+**NOTE**:
+  * Since the dry run operation will calculate size and will access File level 
API's, the operation can
+  be a costly and a time consuming operation in case of tables with large 
number of segments.
+  * When dry run is true, the statistics option will not matter.
+  
+### SHOW STATISTICS
+Clean files operation tells how much size is freed during that operation to 
the user.  By default, the clean files operation
+will show the size freed statistics. Since calculating and showing statistics 
can be a costly operation and reduce the performance of the
+clean files operation, the user can disable that option by using ```statistics 
= false``` in the clean files options.
+  
+   ```
+   CLEAN FILES FOR TABLE TABLE_NAME options('statistics`='false')

Review comment:
   ```suggestion
  CLEAN FILES FOR TABLE TABLE_NAME options('statistics'='false')
   ```

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##
@@ -87,13 +104,48 @@ object DataTrashManager {
 }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is 
used before and after
+   * clean files operation to check how much space is actually freed, during 
the operation.
+   */
+  def getSizeScreenshot(carbonTable: CarbonTable): Long = {
+val metadataDetails = 
SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+var size: Long = 
FileFactory.getDirectorySize(CarbonTablePath.getSegmentFilesLocation(

Review comment:
   ```suggestion
   var segmentFilesLocationSize: Long = 
FileFactory.getDirectorySize(CarbonTablePath.getSegmentFilesLocation(
   ```

##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##
@@ -41,6 +42,26 @@ case class CarbonCleanFilesCommand(
   extends DataCommand {
 
   val LOGGER: Logger = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+  val isDryRun: Boolean = options.getOrElse("dryrun", "false").toBoolean
+  var showStats: Boolean = options.getOrElse("statistics", "true").toBoolean
+  if (isInternalCleanCall) {
+showStats = false
+  }
+
+  override def output: Seq[AttributeReference] = {
+if (isDryRun) {
+  // dry run operation
+  Seq(
+AttributeReference("Size Freed", LongType, nullable = false)(),
+AttributeReference("Trash Data Remaining", LongType, nullable = 
false)())

Review comment:
   please return the return values of size in readable format, if giving in 
MB, have UNIT in column header as (MB)

##
File path: docs/clean-files.md
##
@@ -64,4 +64,41 @@ The stale_inprogress option with force option will delete 
Marked for delete, Com
 
   ```
   CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true')
-  ```
\ No newline at end of file
+  ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how 
much space fill we freed 

Review comment:
   ```suggestion
   Clean files also support a dry run option which will let the user know how 
much space will we freed 
   ```





This is an automated message from the Apache Git Service.
To 

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-18 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r578921513



##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, 
Segment segment,
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
   FileFactory.deleteFile(entry.getKey());
+  LOGGER.info("File deleted after clean files operation: " + 
entry.getKey());
   for (String file : entry.getValue()) {
 String[] deltaFilePaths =
 updateStatusManager.getDeleteDeltaFilePath(file, 
segment.getSegmentNo());
 for (String deltaFilePath : deltaFilePaths) {
   FileFactory.deleteFile(deltaFilePath);
+  LOGGER.info("File deleted after clean files operation: " + 
deltaFilePath);

Review comment:
   instead of logging each file name for every delete which will increase 
these logs when many delta files are there, once the loop completes, you can 
log once for all files in line 1147, along with actual block path or else you 
can say delete the the block file(print block file name) and the corresponding 
delta files as filestamp will same i guess. Just check once and add

##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, 
Segment segment,
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
   FileFactory.deleteFile(entry.getKey());
+  LOGGER.info("File deleted after clean files operation: " + 
entry.getKey());

Review comment:
   ```suggestion
 LOGGER.info("Deleted  file: " + entry.getKey() +  ", on clean files" );
   ```
   
   can do same for all 

##
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##
@@ -152,46 +153,77 @@ public static void copyFilesToTrash(List 
filesToCopy,
* The below method deletes timestamp subdirectories in the trash folder 
which have expired as
* per the user defined retention time
*/
-  public static void deleteExpiredDataFromTrash(String tablePath) {
+  public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean 
isDryRun) {

Review comment:
   update the method comment based on method signature changes and return 
values, follow the same for other also if changed

##
File path: docs/clean-files.md
##
@@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete 
Marked for delete, Com
 
   ```
   CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true')
-  ```
\ No newline at end of file
+  ```
+### DRY RUN OPTION
+Clean files also support a dry run option which will let the user know how 
much space fill we freed 
+during the actual clean files operation. The dry run operation will not delete 
any data but will just give
+size bases statistics to the data. Dry run operation will return two columns 
where the first will 
+show how much space will be freed by that clean files operation and the second 
column will show the 
+remaining stale data(data which can be deleted but has not yet expired as per 
the ```max.query.execution.time``` and ``` carbon.trash.retention.days``` values
+).  By default the value of ```dryrun``` option is ```false```.
+
+Dry Run Operation is supported with four types of commands:
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('dryrun'='true')
+  ```
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('force'='true', 'dryrun'='true')
+  ```
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME 
options('stale_inprogress'='true','dryrun'='true')
+  ```
+
+  ```
+  CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 
'force'='true','dryrun'='true')
+  ```
+
+**NOTE**:
+  * Since the dry run operation will calculate size and will access File level 
API's, the operation can
+  be a costly and a time consuming operation in case of tables with large 
number of segments.

Review comment:
   here better to add point of when dry run is true, other options doesnt 
matter except force = true, and i hope you have handled this in code

##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##
@@ -983,7 +983,28 @@ public static boolean 
isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden
 }
   }
 
-  private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) 
{
+  public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, 
AbsoluteTableIdentifier
+  absoluteTableIdentifier) {
+boolean result = false;

Review comment:
   rename to `isExpiredSegment`

##
File path: 

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-17 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577533482



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##
@@ -37,11 +38,24 @@ case class CarbonCleanFilesCommand(
 databaseNameOp: Option[String],
 tableName: String,
 options: Map[String, String] = Map.empty,
+dryRun: Boolean,
 isInternalCleanCall: Boolean = false)
   extends DataCommand {
 
   val LOGGER: Logger = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
 
+  override def output: Seq[AttributeReference] = {
+if (dryRun) {
+  Seq(
+AttributeReference("Size Freed", LongType, nullable = false)(),
+AttributeReference("Trash Data Remaining", LongType, nullable = 
false)())
+} else {
+  Seq(
+AttributeReference("Size Freed", LongType, nullable = false)(),
+AttributeReference("Trash Data Remaining", LongType, nullable = 
false)())
+}

Review comment:
   if else both blocks are same? i think better to give these rows only in 
case of dry run

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##
@@ -87,13 +101,28 @@ object DataTrashManager {
 }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean): Unit = {
+  def cleanFilesDryRunOperation (
+  carbonTable: CarbonTable,
+  isForceDelete: Boolean,
+  cleanStaleInProgress: Boolean,
+  partitionSpecs: Option[Seq[PartitionSpec]] = None): Seq[Long] = {
+// get size freed from the trash folder
+val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, 
isForceDelete, isDryRun = true)
+// get size that will be deleted (MFD, COmpacted, Inprogress segments)
+val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, 
isForceDelete,
+  cleanStaleInProgress, partitionSpecs)
+Seq(trashFolderSizeStats.head + expiredSegmentsSizeStats.head, 
trashFolderSizeStats(1) +
+expiredSegmentsSizeStats(1))
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, 
isForceDelete: Boolean,
+  isDryRun: Boolean): Seq[Long] = {

Review comment:
   i think we are mixing the dry run option also along with forcedelete, 
and making this complex with code and combination handling, what i think is, 
when user say dry run, it should be clear that i dont take any other options 
and i just tell user in return how much and what i am going to clean, thats 
all, we will not delete or clear any files when dry run. So it will be easy to 
code and cleaner, may be new class or a new method in clean files command 
class. What you guys think @ajantha-bhat @QiangCai 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-17 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577527003



##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##
@@ -1297,4 +1359,37 @@ public static TableStatusReturnTuple 
separateVisibleAndInvisibleSegments(
   return new HashMap<>(0);
 }
   }
+
+  public static long partitionTableSegmentSize(CarbonTable carbonTable, 
LoadMetadataDetails

Review comment:
   yes, better not mix the logic of dry run size calculation and actual 
clean files, keep it separate, so that user will know for sure that when he/she 
runs the dry run it might take some time as it will do calculation of size.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-17 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577523404



##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##
@@ -1072,7 +1097,22 @@ public static void 
deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
 isUpdateRequired(isForceDeletion, carbonTable,
 identifier, details, cleanStaleInprogress);
 if (!tuple2.isUpdateRequired) {
-  return;
+  try {
+for (LoadMetadataDetails oneLoad : details) {
+  if (isExpiredSegment(oneLoad, 
carbonTable.getAbsoluteTableIdentifier())) {
+if (!carbonTable.isHivePartitionTable()) {
+  trashSizeRemaining += 
FileFactory.getDirectorySize(CarbonTablePath
+.getSegmentPath(carbonTable.getTablePath(), 
oneLoad.getLoadName()));
+} else {
+  trashSizeRemaining += 
partitionTableSegmentSize(carbonTable, oneLoad,
+details, partitionSpecs);
+}
+  }
+}
+  } catch (Exception e) {
+LOG.error("Unable to calculate size of garbage data", e);
+  }
+  return new long[]{sizeFreed, trashSizeRemaining};

Review comment:
   yes, agree with @ajantha-bhat , i meant the same





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

2021-02-15 Thread GitBox


akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r570107627



##
File path: 
core/src/main/java/org/apache/carbondata/core/datastore/impl/FileFactory.java
##
@@ -297,6 +297,10 @@ public static boolean deleteFile(String filePath) throws 
IOException {
 return getCarbonFile(filePath).deleteFile();
   }
 
+  public static boolean deleteFile(CarbonFile carbonFile) throws IOException {

Review comment:
   already delete API is there, please use the same

##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1126,25 +1130,36 @@ public static void deleteSegmentFile(String tablePath, 
Segment segment) throws E
* @param partitionSpecs
* @throws IOException
*/
-  public static void deleteSegment(String tablePath, Segment segment,
+  public static long deleteSegment(String tablePath, Segment segment,
   List partitionSpecs,
   SegmentUpdateStatusManager updateStatusManager) throws Exception {
 SegmentFileStore fileStore = new SegmentFileStore(tablePath, 
segment.getSegmentFileName());
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
+long sizeFreed = 0;
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
-  FileFactory.deleteFile(entry.getKey());
+  CarbonFile entryCarbonFile = FileFactory.getCarbonFile(entry.getKey());
+  sizeFreed += entryCarbonFile.getSize();
+  FileFactory.deleteFile(entryCarbonFile);
+  LOGGER.info("File deleted after clean files operation: " + 
entry.getKey());
   for (String file : entry.getValue()) {
 String[] deltaFilePaths =
 updateStatusManager.getDeleteDeltaFilePath(file, 
segment.getSegmentNo());
 for (String deltaFilePath : deltaFilePaths) {
-  FileFactory.deleteFile(deltaFilePath);
+  CarbonFile deltaCarbonFile = 
FileFactory.getCarbonFile(deltaFilePath);
+  sizeFreed += deltaCarbonFile.getSize();
+  FileFactory.deleteFile(deltaCarbonFile);
+  LOGGER.info("File deleted after clean files operation: " + 
deltaFilePath);
 }
-FileFactory.deleteFile(file);
+CarbonFile deleteCarbonFile = FileFactory.getCarbonFile(file);
+sizeFreed += deleteCarbonFile.getSize();
+FileFactory.deleteFile(deleteCarbonFile);

Review comment:
   i can see lot of same code of lines here, may be you can refactor to a 
method like `getSizeAndDelete()`

##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFilesCommandPartitionTable.scala
##
@@ -65,14 +65,19 @@ class TestCleanFilesCommandPartitionTable extends QueryTest 
with BeforeAndAfterA
 loadData()
 sql(s"""ALTER TABLE CLEANTEST COMPACT "MINOR" """)
 loadData()
+sql(s"CLEAN FILES FOR TABLE cleantest DRYRUN").show()
+sql(s"CLEAN FILES FOR TABLE cleantest").show()

Review comment:
   .show is a waste call in FTs, please have a proper asserts for all

##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##
@@ -1297,4 +1356,37 @@ public static TableStatusReturnTuple 
separateVisibleAndInvisibleSegments(
   return new HashMap<>(0);
 }
   }
+
+  public static long partitionTableSegmentSize(CarbonTable carbonTable, 
LoadMetadataDetails
+  oneLoad, LoadMetadataDetails[] loadMetadataDetails, List
+  partitionSpecs) throws Exception {
+long size = 0;
+SegmentFileStore fileStore = new 
SegmentFileStore(carbonTable.getTablePath(), oneLoad
+.getSegmentFile());
+List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
+FileFactory.getConfiguration());
+Map> indexFilesMap = fileStore.getIndexFilesMap();

Review comment:
   can't we calculate size together for all segments?, because here 
everytime its calling readindex files, and calling File APIs to calculate 
sizes, so in case of OBS it will be very slow, better to make it optimized to 
calculate one time i feel

##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1126,25 +1130,36 @@ public static void deleteSegmentFile(String tablePath, 
Segment segment) throws E
* @param partitionSpecs
* @throws IOException
*/
-  public static void deleteSegment(String tablePath, Segment segment,
+  public static long deleteSegment(String tablePath, Segment segment,
   List partitionSpecs,
   SegmentUpdateStatusManager updateStatusManager) throws Exception {
 SegmentFileStore fileStore = new SegmentFileStore(tablePath, 
segment.getSegmentFileName());
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
+