[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r513197605



##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1143,28 +1148,62 @@ public static void cleanSegments(CarbonTable table,
* @throws IOException
*/
   public static void deleteSegment(String tablePath, Segment segment,
-  List partitionSpecs,
-  SegmentUpdateStatusManager updateStatusManager) throws Exception {
+  List partitionSpecs, SegmentUpdateStatusManager 
updateStatusManager,
+  SegmentStatus segmentStatus, Boolean isPartitionTable, String 
timeStampForTrashFolder)
+  throws Exception {
 SegmentFileStore fileStore = new SegmentFileStore(tablePath, 
segment.getSegmentFileName());
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
+List filesToDelete = new ArrayList<>();
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
-  FileFactory.deleteFile(entry.getKey());
+  // Move the file to the trash folder in case the segment status is 
insert in progress
+  if (segmentStatus == SegmentStatus.INSERT_IN_PROGRESS && 
isPartitionTable) {
+TrashUtil.copyDataToTrashFolderByFile(tablePath, entry.getKey(), 
timeStampForTrashFolder +

Review comment:
   changed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717713614


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2955/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717711741


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4712/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Indhumathi27 commented on pull request #3952: [CARBONDATA-4006] Fix for currentUser as NULL in getcount method during index server fallback mode

2020-10-27 Thread GitBox


Indhumathi27 commented on pull request #3952:
URL: https://github.com/apache/carbondata/pull/3952#issuecomment-717703722


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


akashrn5 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r513171515



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##
@@ -2116,6 +2087,26 @@ public int getMaxSIRepairLimit(String dbName, String 
tableName) {
 return Math.abs(Integer.parseInt(thresholdValue));
   }
 
+  /**
+   * The below method returns the microseconds after which the trash folder 
will expire
+   */
+  public long getTrashFolderExpirationTime() {
+String configuredValue = 
getProperty(CarbonCommonConstants.CARBON_TRASH_EXPIRATION_DAYS,
+CarbonCommonConstants.CARBON_TRASH_EXPIRATION_DAYS_DEFAULT);
+Integer result = 0;
+try {
+  result = Integer.parseInt(configuredValue);
+  if (result < 0) {
+LOGGER.warn("Value of carbon.trash.expiration.days is negative, taking 
default value");
+result = Integer.parseInt(CARBON_TRASH_EXPIRATION_DAYS_DEFAULT);
+  }
+} catch (NumberFormatException e) {
+  LOGGER.error("Error happened while parsing", e);

Review comment:
   ```suggestion
 LOGGER.error("Invalid value configured for " + 
"CarbonCommonConstants.CARBON_TRASH_EXPIRATION_DAYS, considering the default 
value", e);
   ```
   
   Please refactor it, in case of failure also it should take default value, 
now it will be considered as 0, and common message you can take to a variable 
to reduce the code and keep clean.

##
File path: 
core/src/main/java/org/apache/carbondata/core/util/DeleteLoadFolders.java
##
@@ -113,12 +116,24 @@ private static void 
physicalFactAndMeasureMetadataDeletion(CarbonTable carbonTab
 SegmentUpdateStatusManager updateStatusManager =
 new SegmentUpdateStatusManager(carbonTable, currLoadDetails);
 for (final LoadMetadataDetails oneLoad : loadDetails) {
-  if (checkIfLoadCanBeDeletedPhysically(oneLoad, isForceDelete)) {
+  if (checkIfLoadCanBeDeletedPhysically(oneLoad, isForceDelete, carbonTable
+  .getAbsoluteTableIdentifier())) {
 try {
+  // if insert in progress, then move it to trash
+  if (oneLoad.getSegmentStatus() == SegmentStatus.INSERT_IN_PROGRESS 
&& !carbonTable
+  .isHivePartitionTable()) {
+// move this segment to trash
+
TrashUtil.copyDataToTrashBySegment(FileFactory.getCarbonFile(CarbonTablePath
+.getFactDir(carbonTable.getTablePath()) + "/Part0/Segment_" + 
oneLoad

Review comment:
   do not hardcode

##
File path: 
core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java
##
@@ -47,6 +47,7 @@
   public static final String BATCH_PREFIX = "_batchno";
   private static final String LOCK_DIR = "LockFiles";
 
+  public static final String SEGMENTS_METADATA_FOLDER = "segments";

Review comment:
   also change in `getSegmentFilesLocation` and `getSegmentFilePath` in 
this class and check for other place who uses `"segments"` string once

##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1143,28 +1148,62 @@ public static void cleanSegments(CarbonTable table,
* @throws IOException
*/
   public static void deleteSegment(String tablePath, Segment segment,
-  List partitionSpecs,
-  SegmentUpdateStatusManager updateStatusManager) throws Exception {
+  List partitionSpecs, SegmentUpdateStatusManager 
updateStatusManager,
+  SegmentStatus segmentStatus, Boolean isPartitionTable, String 
timeStampForTrashFolder)
+  throws Exception {
 SegmentFileStore fileStore = new SegmentFileStore(tablePath, 
segment.getSegmentFileName());
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
+List filesToDelete = new ArrayList<>();
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
-  FileFactory.deleteFile(entry.getKey());
+  // Move the file to the trash folder in case the segment status is 
insert in progress
+  if (segmentStatus == SegmentStatus.INSERT_IN_PROGRESS && 
isPartitionTable) {
+TrashUtil.copyDataToTrashFolderByFile(tablePath, entry.getKey(), 
timeStampForTrashFolder +

Review comment:
   you are sending these many arguments in many places, can we refactor it 
to reduce the changes in many place?

##
File path: 
core/src/main/java/org/apache/carbondata/core/util/path/TrashUtil.java
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You 

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717678439


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4711/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717678491


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2954/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


QiangCai commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r513152963



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchemaCommon.scala
##
@@ -121,7 +121,7 @@ case class UpdateTableModel(
 updatedTimeStamp: Long,
 var executorErrors: ExecutionErrors,
 deletedSegments: Seq[Segment],
-loadAsNewSegment: Boolean = false)
+loadAsNewSegment: Boolean = true)

Review comment:
   remove it, always update into new segment





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-27 Thread GitBox


QiangCai commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513147277



##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each 
SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
*/
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final 
String blockName) {
+  public List getDeleteDeltaFilesList(final Segment seg, final String 
blockName) {
+List deleteDeltaFileList = new ArrayList<>();
 String segmentPath = CarbonTablePath.getSegmentPath(
-identifier.getTablePath(), segmentId.getSegmentNo());
-CarbonFile segDir =
-FileFactory.getCarbonFile(segmentPath);
+identifier.getTablePath(), seg.getSegmentNo());
+
 for (SegmentUpdateDetails block : updateDetails) {
   if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
-  (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
-  && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus( {
-final long deltaStartTimestamp =
-
getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-final long deltaEndTimeStamp =
-getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, 
block);
-
-return segDir.listFiles(new CarbonFileFilter() {
-
-  @Override
-  public boolean accept(CarbonFile pathName) {
-String fileName = pathName.getName();
-if (pathName.getSize() > 0
-&& 
fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
-  String blkName = fileName.substring(0, 
fileName.lastIndexOf("-"));
-  long timestamp =
-  
Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
-  return blockName.equals(blkName) && timestamp <= 
deltaEndTimeStamp
-  && timestamp >= deltaStartTimestamp;
-}
-return false;
-  }
-});
+  (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+  !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+Set deltaFileTimestamps = block.getDeltaFileStamps();
+String deleteDeltaFilePrefix = segmentPath + 
CarbonCommonConstants.FILE_SEPARATOR +
+blockName + CarbonCommonConstants.HYPHEN;
+if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {

Review comment:
   please add test case:
   1.  set CarbonCommonConstants.CARBON_HORIZONTAL_COMPACTION_ENABLE to false
   2. update different row in same file three times (it should generate three 
delete delta files for same blockname) 
   3. set CarbonCommonConstants.CARBON_HORIZONTAL_COMPACTION_ENABLE to true
   4. update different row in same file again
   5. query table to check result

##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
   }
 
   /**
-   * Return all delta file for a block.
-   * @param segmentId
-   * @param blockName
-   * @return
+   * Get all delete delta files of the block of specified segment.
+   * Actually, delete delta file name is generated from each 
SegmentUpdateDetails.
+   *
+   * @param seg the segment which is to find block and its delete delta files
+   * @param blockName the specified block of the segment
+   * @return delete delta file list of the block
*/
-  public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final 
String blockName) {
+  public List getDeleteDeltaFilesList(final Segment seg, final String 
blockName) {
+List deleteDeltaFileList = new ArrayList<>();
 String segmentPath = CarbonTablePath.getSegmentPath(
-identifier.getTablePath(), segmentId.getSegmentNo());
-CarbonFile segDir =
-FileFactory.getCarbonFile(segmentPath);
+identifier.getTablePath(), seg.getSegmentNo());
+
 for (SegmentUpdateDetails block : updateDetails) {
   if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
-  (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
-  && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus( {
-final long deltaStartTimestamp =
-
getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-final long 

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717496054


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2953/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717492977


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4710/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


ajantha-bhat commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r512886166



##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchemaCommon.scala
##
@@ -121,7 +121,7 @@ case class UpdateTableModel(
 updatedTimeStamp: Long,
 var executorErrors: ExecutionErrors,
 deletedSegments: Seq[Segment],
-loadAsNewSegment: Boolean = false)
+loadAsNewSegment: Boolean = true)

Review comment:
   There is no use of horizontal compaction features after this change?
   @QiangCai , @marchpure : we need to mark it as deprecated ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on pull request #3972: [CARBONDATA-4042]Launch same number of task as select query for insert into select and ctas cases when target table is of no_sort

2020-10-27 Thread GitBox


akashrn5 commented on pull request #3972:
URL: https://github.com/apache/carbondata/pull/3972#issuecomment-717401461


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on pull request #3981: [CARBONDATA-4031] Incorrect query result after Update/Delete and Inse…

2020-10-27 Thread GitBox


akashrn5 commented on pull request #3981:
URL: https://github.com/apache/carbondata/pull/3981#issuecomment-717398581


   @marchpure as from the description, i understood that the table status file 
after insert overwrite do not contain the `updatetablestatusfile ` name. So why 
do we need to add boolean variable, and send this boolean from many places. 
This may lead to confusion later also. So why dont we take care it just while 
updating the table status in case of insert overwrite? 
   
   I mean to say, while updating the status after insert overwrite, we can 
first copy the existing meta and then prepare new entry and update case can be 
handled there, this way, it will lead to less changes and less confusion later.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on pull request #3981: [CARBONDATA-4031] Incorrect query result after Update/Delete and Inse…

2020-10-27 Thread GitBox


akashrn5 commented on pull request #3981:
URL: https://github.com/apache/carbondata/pull/3981#issuecomment-717398850


   > > @marchpure please correct the PR title and please give detailed 
explanation of the issue and the fix in the PR description.
   > 
   > Yes. I have modified the PR desc as your suggestion
   
   PR title is still not yet formatted, please check



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #3981: [CARBONDATA-4031] Incorrect query result after Update/Delete and Inse…

2020-10-27 Thread GitBox


akashrn5 commented on a change in pull request #3981:
URL: https://github.com/apache/carbondata/pull/3981#discussion_r512875192



##
File path: 
core/src/main/java/org/apache/carbondata/core/mutate/CarbonUpdateUtil.java
##
@@ -276,14 +278,13 @@ public static boolean 
updateTableMetadataStatus(Set updatedSegmentsList
 SegmentStatusManager.readLoadMetadata(metaDataFilepath);
 
 for (LoadMetadataDetails loadMetadata : listOfLoadFolderDetailsArray) {
+  if (isUpdateStatusFileUpdateRequired &&
+  loadMetadata.getLoadName().equalsIgnoreCase("0")) {
+loadMetadata.setUpdateStatusFileName(
+CarbonUpdateUtil.getUpdateStatusFileName(updatedTimeStamp));
+  }
 
   if (isTimestampUpdateRequired) {
-// we are storing the link between the 2 status files in the 
segment 0 only.

Review comment:
   why this condition is moved up? and why not `isTimestampUpdateRequired`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on pull request #3952: [CARBONDATA-4006] Fix for currentUser as NULL in getcount method during index server fallback mode

2020-10-27 Thread GitBox


vikramahuja1001 commented on pull request #3952:
URL: https://github.com/apache/carbondata/pull/3952#issuecomment-717378005


   @kunal642 @akashrn5 please merge this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3981: [CARBONDATA-4031] Incorrect query result after Update/Delete and Inse…

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3981:
URL: https://github.com/apache/carbondata/pull/3981#issuecomment-717333719


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4709/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717329144


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2951/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3981: [CARBONDATA-4031] Incorrect query result after Update/Delete and Inse…

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3981:
URL: https://github.com/apache/carbondata/pull/3981#issuecomment-717328681


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2952/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717325746


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4708/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


ajantha-bhat commented on a change in pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#discussion_r512760255



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##
@@ -342,7 +342,8 @@ object CarbonDataRDDFactory {
 
 try {
   if (!carbonLoadModel.isCarbonTransactionalTable || 
segmentLock.lockWithRetries()) {
-if (updateModel.isDefined && !updateModel.get.loadAsNewSegment) {
+if (updateModel.isDefined && (!updateModel.get.loadAsNewSegment

Review comment:
   If you just set updateModel.get.loadAsNewSegment= true in update flow is 
not enough ? please explain why this change is required ?

##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/alterTable/TestAlterTableSortColumnsProperty.scala
##
@@ -739,14 +739,14 @@ class TestAlterTableSortColumnsProperty extends QueryTest 
with BeforeAndAfterAll
 
 val table = CarbonEnv.getCarbonTable(Option("default"), 
tableName)(sqlContext.sparkSession)
 val tablePath = table.getTablePath
-(0 to 2).foreach { segmentId =>
+(0 to 3).foreach { segmentId =>

Review comment:
   why this change required ? I think you can revert this

##
File path: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/CarbonProjectForUpdateCommand.scala
##
@@ -340,7 +340,8 @@ private[sql] case class CarbonProjectForUpdateCommand(
   case _ => sys.error("")
 }
 
-val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments)
+val updateTableModel = UpdateTableModel(true, currentTime, executorErrors, 
deletedSegments,
+  !carbonRelation.carbonTable.isHivePartitionTable)

Review comment:
   so, for partition, update will not write as new segment ? How handle 
dirty data issue for partition table update scenario ?

##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/allqueries/TestPruneUsingSegmentMinMax.scala
##
@@ -103,7 +103,7 @@ class TestPruneUsingSegmentMinMax extends QueryTest with 
BeforeAndAfterAll {
 sql("update carbon set(a)=(10) where a=1").collect()
 checkAnswer(sql("select count(*) from carbon where a=10"), Seq(Row(3)))
 showCache = sql("show metacache on table carbon").collect()
-assert(showCache(0).get(2).toString.equalsIgnoreCase("6/8 index files 
cached"))
+assert(showCache(0).get(2).toString.equalsIgnoreCase("1/6 index files 
cached"))

Review comment:
   how 8 index files become 6 (other two were stale ?) and out of 6 why 
only 1 is cached now ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure closed pull request #3934: [WIP] Support Global Unique Id for SegmentNo

2020-10-27 Thread GitBox


marchpure closed pull request #3934:
URL: https://github.com/apache/carbondata/pull/3934


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3988: [CARBONDATA-4037] Improve the table status and segment file writing

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3988:
URL: https://github.com/apache/carbondata/pull/3988#issuecomment-717268180


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2950/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3988: [CARBONDATA-4037] Improve the table status and segment file writing

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3988:
URL: https://github.com/apache/carbondata/pull/3988#issuecomment-717267646


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4707/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (CARBONDATA-3903) Documentation Issue in Github Docs Link https://github.com/apache/carbondata/tree/master/docs

2020-10-27 Thread PURUJIT CHAUGULE (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PURUJIT CHAUGULE closed CARBONDATA-3903.


Document updation is resolved and verified.

> Documentation Issue in Github  Docs Link 
> https://github.com/apache/carbondata/tree/master/docs
> --
>
> Key: CARBONDATA-3903
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3903
> Project: CarbonData
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 2.0.1
> Environment: https://github.com/apache/carbondata/tree/master/docs
>Reporter: PURUJIT CHAUGULE
>Priority: Minor
> Fix For: 2.1.0
>
>
> dml-of-carbondata.md
> LOAD DATA:
>  * Mention Each Load is considered as a Segment.
>  * Give all possible options for SORT_SCOPE like 
> GLOBAL_SORT/LOCAL_SORT/NO_SORT (with explanation of difference between each 
> type).
>  * Add Example Of complete Load query with/without use of OPTIONS.
> INSERT DATA:
>  * Mention each insert is a Segment.
> LOAD Using Static/Dynamic Partitioning:
>  * Can give a hyperlink to Static/Dynamic partitioning.
> UPDATE/DELETE:
>  * Mention about delta files concept in update and delete.
> DELETE:
>  * Add example for deletion of all records from a table (delete from 
> tablename).
> COMPACTION:
>  * Can mention Minor compaction of two types Auto and Manual( 
> carbon.auto.load.merge =true/false), and that if 
> carbon.auto.load.merge=false, trigger should be done manually.
>  * Hyperlink to Configurable properties of Compaction.
>  * Mention that compacted segments do not get cleaned automatically and 
> should be triggered manually using clean files.
>  
> flink-integration-guide.md
>  * Mention what are stages, how is it used.
>  * Process of insertion, deletion of stages in carbontable. (How is it stored 
> in carbontable).
>  
> language-manual.md
>  * Mention Compaction Hyperlink in DML section.
>  
> spatial-index-guide.md
>  * Mention the TBLPROPERTIES supported / not supported for Geo table.
>  * Mention Spatial Index does not make a new column.
>  * CTAS from one geo table to another does not create another Geo table can 
> be mentioned.
>  * Mention that a certain combination of Spatial Index table properties need 
> to be added in create table, without which a geo table does not get created.
>  * Mention that we cannot alter columns (change datatype, change name, drop) 
> mentioned in spatial_index.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] marchpure commented on pull request #3981: [CARBONDATA-4031] Incorrect query result after Update/Delete and Inse…

2020-10-27 Thread GitBox


marchpure commented on pull request #3981:
URL: https://github.com/apache/carbondata/pull/3981#issuecomment-717259421


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717258126


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2949/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3999: [CARBONDATA-4044] Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3999:
URL: https://github.com/apache/carbondata/pull/3999#issuecomment-717257966


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4706/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (CARBONDATA-4010) "Alter table set tblproperties should support long string columns" and bad record handling of long string data for string columns need to be updated in https://github

2020-10-27 Thread Chetan Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Bhat closed CARBONDATA-4010.
---

Issue is fixed in Carbon 2.1 version.

> "Alter table set tblproperties should support long string columns" and bad 
> record handling of long string data for string columns need to be updated in 
> https://github.com/apache/carbondata/blob/master/docs
> -
>
> Key: CARBONDATA-4010
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4010
> Project: CarbonData
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 2.1.0
> Environment: https://github.com/apache/carbondata/blob/master/docs
>Reporter: Chetan Bhat
>Priority: Minor
> Fix For: 2.1.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> "Alter table set tblproperties should support long string columns" and bad 
> record handling of long string data for string columns need to be updated in 
> https://github.com/apache/carbondata/blob/master/docs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (CARBONDATA-3932) need to change discovery.uri and add hive.metastore.uri,hive.config.resources in https://github.com/apache/carbondata/blob/master/docs/prestosql-guide.md#presto-mul

2020-10-27 Thread Chetan Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Bhat closed CARBONDATA-3932.
---

Issue is fixed in Carbon 2.1 version.

> need to change discovery.uri and add  
> hive.metastore.uri,hive.config.resources  in 
> https://github.com/apache/carbondata/blob/master/docs/prestosql-guide.md#presto-multinode-cluster-setup-for-carbondata
> -
>
> Key: CARBONDATA-3932
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3932
> Project: CarbonData
>  Issue Type: Bug
>  Components: docs, presto-integration
>Affects Versions: 2.0.1
> Environment: Documentation
>Reporter: Chetan Bhat
>Priority: Minor
> Fix For: 2.1.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Need to change discovery.uri=:8086 to 
> discovery.uri=http://:8086 in 
> [https://github.com/apache/carbondata/blob/master/docs/prestosql-guide.md#presto-multinode-cluster-setup-for-carbondata]
> Need to add these configurations as well in carbondata.properties and to be 
> updated in carbondata-presto opensource doc .
> 1.hive.metastore.uri
> 2.hive.config.resources
> Ex : -
> connector.name=carbondata
> hive.metastore.uri=thrift://10.21.18.106:9083
> hive.config.resources=/opt/HA/C10/install/hadoop/datanode/etc/hadoop/core-site.xml,/opt/HA/C10/install/hadoop/datanode/etc/hadoop/hdfs-site.xml
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (CARBONDATA-3901) Documentation issues in https://github.com/apache/carbondata/tree/master/docs

2020-10-27 Thread Chetan Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Bhat closed CARBONDATA-3901.
---

Issue is fixed in Carbon 2.1 version

> Documentation issues in https://github.com/apache/carbondata/tree/master/docs
> -
>
> Key: CARBONDATA-3901
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3901
> Project: CarbonData
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 2.0.1
> Environment: https://github.com/apache/carbondata/tree/master/docs
>Reporter: Chetan Bhat
>Priority: Minor
> Fix For: 2.1.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> *Issue 1  -* 
> [https://github.com/apache/carbondata/blob/master/docs/ddl-of-carbondata.mdSORT_SCOPE]
>  Sort scope of the load.Options include no sort, local sort ,batch sort and 
> global sort  --> Batch sort to be removed as its not supported.
> *Issue 2 -* 
> [https://github.com/apache/carbondata/blob/master/docs/streaming-guide.md#close-stream]
>    CLOSE STREAM link is not working.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (CARBONDATA-3824) Error when Secondary index tried to be created on table that does not exist is not correct.

2020-10-27 Thread Chetan Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chetan Bhat closed CARBONDATA-3824.
---

Issue is fixed in Carbon 2.1 version.

> Error when Secondary index tried to be created on table that does not exist 
> is not correct.
> ---
>
> Key: CARBONDATA-3824
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3824
> Project: CarbonData
>  Issue Type: Bug
>  Components: data-query
>Affects Versions: 2.0.0
> Environment: Spark 2.3.2, Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Fix For: 2.1.0
>
>
> *Issue :-*
> Table uniqdata_double does not exist.
> Secondary index tried to be created on table. Error message is incorrect.
> CREATE INDEX indextable2 ON TABLE uniqdata_double (DOB) AS 'carbondata' 
> PROPERTIES('carbon.column.compressor'='zstd');
> *Error: java.lang.RuntimeException: Operation not allowed on non-carbon table 
> (state=,code=0)*
>  
> *Expected :-*
> *Error: java.lang.RuntimeException: Table does not exist* *(state=,code=0)***



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#issuecomment-717190630


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2948/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#issuecomment-717189509


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4705/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] maheshrajus commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-27 Thread GitBox


maheshrajus commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r512579906



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -143,10 +143,18 @@ object DataLoadProcessBuilderOnSpark {
 
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
-if (numPartitions <= 0) {
-  numPartitions = convertRDD.partitions.length
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions == 0) {
+  // get the size in bytes and convert to size in MB
+  val sizeOfDataFrame = SizeEstimator.estimate(inputRDD)/100
+  // data frame size can not be more than Int size
+  numPartitions = sizeOfDataFrame.toInt/inputRDD.getNumPartitions

Review comment:
   @akashrn5 you reviewed old code, now i pushed again the new code and 
testing is in progress





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on a change in pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-27 Thread GitBox


akashrn5 commented on a change in pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#discussion_r503187128



##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
##
@@ -143,10 +143,18 @@ object DataLoadProcessBuilderOnSpark {
 
 var numPartitions = CarbonDataProcessorUtil.getGlobalSortPartitions(
   
configuration.getDataLoadProperty(CarbonCommonConstants.LOAD_GLOBAL_SORT_PARTITIONS))
-if (numPartitions <= 0) {
-  numPartitions = convertRDD.partitions.length
+
+// if numPartitions user does not specify and not specified in config then 
dynamically calculate
+if (numPartitions == 0) {
+  // get the size in bytes and convert to size in MB
+  val sizeOfDataFrame = SizeEstimator.estimate(inputRDD)/100
+  // data frame size can not be more than Int size
+  numPartitions = sizeOfDataFrame.toInt/inputRDD.getNumPartitions

Review comment:
   i think here, you should try to get the partitions based on the total 
load size and the partition size and not the partition number.Please check 
again and handle correctly.
   
   @QiangCai please have a look once





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-717143281


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4704/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3912: [CARBONDATA-3977] Global sort partitions should be determined dynamically

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3912:
URL: https://github.com/apache/carbondata/pull/3912#issuecomment-717141636


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2947/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#issuecomment-717131900


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (CARBONDATA-4007) ArrayIndexOutofBoundsException when IUD operations performed using SDK

2020-10-27 Thread Kunal Kapoor (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Kapoor resolved CARBONDATA-4007.
--
Fix Version/s: 2.1.0
   Resolution: Fixed

> ArrayIndexOutofBoundsException when IUD operations performed using SDK
> --
>
> Key: CARBONDATA-4007
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4007
> Project: CarbonData
>  Issue Type: Bug
>  Components: data-load
>Affects Versions: 2.1.0
> Environment: Spark 2.4.5 jars used for compilation of SDK 
>Reporter: Chetan Bhat
>Priority: Major
> Fix For: 2.1.0
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Issue -
> ArrayIndexOutofBoundsException when IUD operations performed using SDK.
> Exception -
> java.lang.ArrayIndexOutOfBoundsException: 1
>  at 
> org.apache.carbondata.hadoop.api.CarbonTableOutputFormat$1.close(CarbonTableOutputFormat.java:579)
>  at org.apache.carbondata.sdk.file.CarbonIUD.delete(CarbonIUD.java:110)
>  at 
> org.apache.carbondata.sdk.file.CarbonIUD.deleteExecution(CarbonIUD.java:238)
>  at org.apache.carbondata.sdk.file.CarbonIUD.closeDelete(CarbonIUD.java:123)
>  at org.apache.carbondata.sdk.file.CarbonIUD.commit(CarbonIUD.java:221)
>  at com.apache.spark.SdkIUD_Test.testDelete(SdkIUD_Test.java:130)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>  at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>  at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>  at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>  at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>  at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>  at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>  at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>  at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>  at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
>  at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3916: [CARBONDATA-3935]Support partition table transactional write in presto

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3916:
URL: https://github.com/apache/carbondata/pull/3916#issuecomment-717126913


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2943/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3916: [CARBONDATA-3935]Support partition table transactional write in presto

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3916:
URL: https://github.com/apache/carbondata/pull/3916#issuecomment-717126479


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4700/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512553653



##
File path: docs/dml-of-carbondata.md
##
@@ -562,3 +563,50 @@ CarbonData DML statements are documented here,which 
includes:
   ```
   CLEAN FILES FOR TABLE carbon_table
   ```
+
+## CLEAN FILES
+
+  Clean files command is used to remove the Compacted and Marked
+  For Delete Segments from the store. Carbondata also supports Trash
+  Folder where all the stale data is moved to after clean files
+  is called
+
+  There are several types of compaction
+
+  ```
+  CLEAN FILES ON TABLE TableName
+  ```
+
+  - **Minor Compaction**

Review comment:
   removed

##
File path: docs/dml-of-carbondata.md
##
@@ -562,3 +563,50 @@ CarbonData DML statements are documented here,which 
includes:
   ```
   CLEAN FILES FOR TABLE carbon_table
   ```
+
+## CLEAN FILES
+
+  Clean files command is used to remove the Compacted and Marked

Review comment:
   linked

##
File path: 
integration/spark/src/main/scala/org/apache/carbondata/cleanfiles/CleanFilesUtil.scala
##
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.cleanfiles
+
+import java.util
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
+
+import org.apache.spark.sql.{AnalysisException, CarbonEnv, Row, SparkSession}
+import org.apache.spark.sql.index.CarbonIndexUtil
+
+import org.apache.carbondata.common.logging.LogServiceFactory
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.exception.ConcurrentOperationException
+import org.apache.carbondata.core.indexstore.PartitionSpec
+import org.apache.carbondata.core.locks.{CarbonLockFactory, CarbonLockUtil, 
ICarbonLock, LockUsage}
+import org.apache.carbondata.core.metadata.{AbsoluteTableIdentifier, 
CarbonMetadata, SegmentFileStore}
+import org.apache.carbondata.core.metadata.schema.table.CarbonTable
+import org.apache.carbondata.core.mutate.CarbonUpdateUtil
+import org.apache.carbondata.core.statusmanager.{LoadMetadataDetails, 
SegmentStatus, SegmentStatusManager}
+import org.apache.carbondata.core.util.{CarbonProperties, CarbonUtil}
+import org.apache.carbondata.core.util.path.{CarbonTablePath, TrashUtil}
+import org.apache.carbondata.processing.loading.TableProcessingOperations
+import org.apache.carbondata.processing.loading.model.CarbonLoadModel
+
+object CleanFilesUtil {
+  private val LOGGER = 
LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
+  /**
+   * The method deletes all data if forceTableClean  and clean garbage 
segment
+   * (MARKED_FOR_DELETE state) if forceTableClean 
+   *
+   * @param dbName : Database name
+   * @param tableName  : Table name
+   * @param tablePath  : Table path
+   * @param carbonTable: CarbonTable Object  in case of 
force clean
+   * @param forceTableClean:  for force clean it will delete all 
data
+   *it will clean garbage segment 
(MARKED_FOR_DELETE state)
+   * @param currentTablePartitions : Hive Partitions  details
+   */
+  def cleanFiles(
+dbName: String,
+tableName: String,
+tablePath: String,
+timeStamp: String,
+carbonTable: CarbonTable,
+forceTableClean: Boolean,
+currentTablePartitions: Option[Seq[PartitionSpec]] = None,
+truncateTable: Boolean = false): Unit = {
+var carbonCleanFilesLock: ICarbonLock = null
+val absoluteTableIdentifier = if (forceTableClean) {
+  AbsoluteTableIdentifier.from(tablePath, dbName, tableName, tableName)
+} else {
+  carbonTable.getAbsoluteTableIdentifier
+}
+try {
+  val errorMsg = "Clean files request is failed for " +
+s"$dbName.$tableName" +
+". Not able to acquire the clean files lock due to another clean files 
" +
+"operation is running in the background."
+  // in case of force clean the lock is not required
+  if 

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512553372



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/path/TrashUtil.java
##
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.core.util.path;
+
+import java.io.File;
+import java.io.IOException;
+import java.sql.Timestamp;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.carbondata.common.logging.LogServiceFactory;
+import org.apache.carbondata.core.constants.CarbonCommonConstants;
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile;
+import org.apache.carbondata.core.datastore.impl.FileFactory;
+import org.apache.carbondata.core.exception.CarbonFileException;
+import org.apache.carbondata.core.util.CarbonUtil;
+
+import org.apache.commons.io.FileUtils;
+
+import org.apache.log4j.Logger;
+
+public final class TrashUtil {
+
+  private static final Logger LOGGER =
+  LogServiceFactory.getLogService(CarbonUtil.class.getName());
+
+  /**
+   * The below method copies the complete a file to the trash folder. Provide 
necessary
+   * timestamp and the segment number in the suffixToAdd  variable, so that 
the proper folder is
+   * created in the trash folder.
+   */
+  public static void copyDataToTrashFolderByFile(String carbonTablePath, 
String pathOfFileToCopy,
+  String suffixToAdd) {
+String trashFolderPath = CarbonTablePath.getTrashFolder(carbonTablePath) +
+CarbonCommonConstants.FILE_SEPARATOR + suffixToAdd;
+try {
+  if (new File(pathOfFileToCopy).exists()) {
+FileUtils.copyFileToDirectory(new File(pathOfFileToCopy), new 
File(trashFolderPath));
+LOGGER.info("File: " + pathOfFileToCopy + " successfully copied to the 
trash folder: "
++ trashFolderPath);
+  }
+} catch (IOException e) {
+  LOGGER.error("Unable to copy " + pathOfFileToCopy + " to the trash 
folder", e);
+}
+  }
+
+  /**
+   * The below method copies the complete segment folder to the trash folder. 
Provide necessary
+   * timestamp and the segment number in the suffixToAdd  variable, so that 
the proper folder is
+   * created in the trash folder.
+   */
+  public static void copyDataToTrashBySegment(CarbonFile path, String 
carbonTablePath,
+  String suffixToAdd) {
+String trashFolderPath = CarbonTablePath.getTrashFolder(carbonTablePath) +
+CarbonCommonConstants.FILE_SEPARATOR + suffixToAdd;
+try {
+  FileUtils.copyDirectory(new File(path.getAbsolutePath()), new 
File(trashFolderPath));
+  LOGGER.info("Segment: " + path.getAbsolutePath() + " has been copied to 
the trash folder" +
+  " successfully");
+} catch (IOException e) {
+  LOGGER.error("Unable to create the trash folder and copy data to it", e);
+}
+  }
+
+  /**
+   * The below method deletes timestamp subdirectories in the trash folder 
which have expired as
+   * per the user defined expiration time
+   */
+  public static void deleteAllDataFromTrashFolderByTimeStamp(String 
carbonTablePath, Long timeStamp)
+  throws IOException {
+String pathOfTrashFolder = CarbonTablePath.getTrashFolder(carbonTablePath);
+// Deleting the timestamp based subdirectories in the trashfolder by the 
given timestamp.
+if (FileFactory.isFileExist(pathOfTrashFolder)) {
+  try {
+List carbonFileList = 
FileFactory.getFolderList(pathOfTrashFolder);
+for (CarbonFile carbonFile : carbonFileList) {
+  String[] aB = 
carbonFile.getAbsolutePath().split(CarbonCommonConstants.FILE_SEPARATOR);

Review comment:
   different names for partition tables and normal tables, changed the 
variable name though

##
File path: 
core/src/main/java/org/apache/carbondata/core/util/path/TrashUtil.java
##
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the 

[GitHub] [carbondata] asfgit closed pull request #3970: [CARBONDATA-4007] Fix multiple issues in SDK

2020-10-27 Thread GitBox


asfgit closed pull request #3970:
URL: https://github.com/apache/carbondata/pull/3970


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512552552



##
File path: 
core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##
@@ -1049,7 +1049,7 @@ private static ReturnTuple isUpdateRequired(boolean 
isForceDeletion, CarbonTable
   }
 
   public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, 
boolean isForceDeletion,
-  List partitionSpecs) throws IOException {
+  List partitionSpecs, String timeStamp) throws IOException 
{

Review comment:
   i have changed this behaviour, after this change even one clean files 
command can create multiple timestamp subdirectories. The user can use tree 
command to list the files and use the timestamp subfolder as he desires.

##
File path: 
core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##
@@ -2116,6 +2086,20 @@ public int getMaxSIRepairLimit(String dbName, String 
tableName) {
 return Math.abs(Integer.parseInt(thresholdValue));
   }
 
+  /**
+   * The below method returns the microseconds after which the trash folder 
will expire
+   */
+  public long getTrashFolderExpirationTime() {
+String configuredValue = 
getProperty(CarbonCommonConstants.TRASH_EXPIRATION_DAYS,
+CarbonCommonConstants.TRASH_EXPIRATION_DAYS_DEFAULT);
+int result = Integer.parseInt(configuredValue);
+if (result < 0) {
+  result = Integer.parseInt(TRASH_EXPIRATION_DAYS_DEFAULT);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #3916: [CARBONDATA-3935]Support partition table transactional write in presto

2020-10-27 Thread GitBox


ajantha-bhat commented on pull request #3916:
URL: https://github.com/apache/carbondata/pull/3916#issuecomment-717124467


   @akashrn5 
   a) Please update the PR description about what all problems were there and 
what changes done to support it. Now it is not clear.
   b) Please confirm here whether the cluster test is passed. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512552754



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java
##
@@ -47,6 +47,7 @@
   public static final String BATCH_PREFIX = "_batchno";
   private static final String LOCK_DIR = "LockFiles";
 
+  public static final String SEGMENTS_FOLDER = "segments";

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3916: [CARBONDATA-3935]Support partition table transactional write in presto

2020-10-27 Thread GitBox


ajantha-bhat commented on a change in pull request #3916:
URL: https://github.com/apache/carbondata/pull/3916#discussion_r512505051



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -369,7 +369,8 @@ private CarbonCommonConstants() {
   public static final String CARBON_MERGE_INDEX_IN_SEGMENT =
   "carbon.merge.index.in.segment";
 
-  public static final String CARBON_MERGE_INDEX_IN_SEGMENT_DEFAULT = "true";
+  // TODO: revert this after proper fix in this PR

Review comment:
   please revert this

##
File path: 
integration/hive/src/main/java/org/apache/carbondata/hive/MapredCarbonOutputCommitter.java
##
@@ -123,13 +132,46 @@ public void commitJob(JobContext jobContext) throws 
IOException {
 try {
   Configuration configuration = jobContext.getConfiguration();
   CarbonLoadModel carbonLoadModel = 
MapredCarbonOutputFormat.getLoadModel(configuration);
-  ThreadLocalSessionInfo.unsetAll();
-  
SegmentFileStore.writeSegmentFile(carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable(),
-  carbonLoadModel.getSegmentId(), 
String.valueOf(carbonLoadModel.getFactTimeStamp()));
-  SegmentFileStore
-  
.mergeIndexAndWriteSegmentFile(carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable(),
-  carbonLoadModel.getSegmentId(), 
String.valueOf(carbonLoadModel.getFactTimeStamp()));
-  CarbonTableOutputFormat.setLoadModel(configuration, carbonLoadModel);
+  if 
(!carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable().isHivePartitionTable())
 {
+ThreadLocalSessionInfo.unsetAll();
+SegmentFileStore
+
.writeSegmentFile(carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable(),
+carbonLoadModel.getSegmentId(), 
String.valueOf(carbonLoadModel.getFactTimeStamp()));
+SegmentFileStore.mergeIndexAndWriteSegmentFile(
+carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable(),
+carbonLoadModel.getSegmentId(), 
String.valueOf(carbonLoadModel.getFactTimeStamp()));
+CarbonTableOutputFormat.setLoadModel(configuration, carbonLoadModel);
+  } else {
+String tableFactLocation =
+
carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable().getTablePath();
+List carbonFiles =
+FileFactory.getCarbonFile(tableFactLocation).listFiles(true, new 
CarbonFileFilter() {

Review comment:
   list files of whole table can be very slow when multiple segments are 
present. we list previous load segments also here. I think we need to keep list 
of index/merge index created for current load in memory and write in the 
segment file here.

##
File path: 
integration/hive/src/main/java/org/apache/carbondata/hive/MapredCarbonOutputFormat.java
##
@@ -115,9 +115,20 @@ public void checkOutputSpecs(FileSystem fileSystem, 
JobConf jobConf) throws IOEx
 
carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable().getPartitionInfo();
 final int partitionColumn =
 partitionInfo != null ? partitionInfo.getColumnSchemaList().size() : 0;
+final String finalOutputPath;
 if 
(carbonLoadModel.getCarbonDataLoadSchema().getCarbonTable().isHivePartitionTable())
 {
-  carbonLoadModel.getMetrics().addToPartitionPath(finalOutPath.toString());
-  context.getConfiguration().set("carbon.outputformat.writepath", 
finalOutPath.toString());
+  String[] outputPathSplits = finalOutPath.toString().split("/");
+  StringBuilder partitionDirs = new StringBuilder();
+  for (int i = partitionColumn; i > 0;  i--) {

Review comment:
   so, how carbondata-hive partition write is working ?  

##
File path: 
integration/presto/src/test/scala/org/apache/carbondata/presto/integrationtest/PrestoInsertIntoPartitionTableTestCase.scala
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.presto.integrationtest
+
+import java.io.File
+import java.util
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import org.scalatest.{BeforeAndAfterAll, BeforeAndAfterEach, FunSuiteLike}
+

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512551259



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/DeleteLoadFolders.java
##
@@ -192,11 +208,17 @@ private static boolean 
checkIfLoadCanBeDeleted(LoadMetadataDetails oneLoad,
   }
 
   private static boolean checkIfLoadCanBeDeletedPhysically(LoadMetadataDetails 
oneLoad,
-  boolean isForceDelete) {
+  boolean isForceDelete, AbsoluteTableIdentifier absoluteTableIdentifier) {
 // Check if the segment is added externally and path is set then do not 
delete it
 if ((SegmentStatus.MARKED_FOR_DELETE == oneLoad.getSegmentStatus()
-|| SegmentStatus.COMPACTED == oneLoad.getSegmentStatus()) && 
(oneLoad.getPath() == null
+|| SegmentStatus.COMPACTED == oneLoad.getSegmentStatus() || 
SegmentStatus
+.INSERT_IN_PROGRESS == oneLoad.getSegmentStatus()) && 
(oneLoad.getPath() == null

Review comment:
   i am not sure about this. maybe we can discuss with  @ajantha-bhat or 
@akashrn5 once?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512551385



##
File path: 
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##
@@ -1427,6 +1428,25 @@ private CarbonCommonConstants() {
 
   public static final String BITSET_PIPE_LINE_DEFAULT = "true";
 
+  public static final long MILLIS_SECONDS_IN_A_DAY = TimeUnit.DAYS.toMillis(1);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512551780



##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1105,28 +1109,79 @@ public static void cleanSegments(CarbonTable table, 
List partitio
* @throws IOException
*/
   public static void deleteSegment(String tablePath, Segment segment,
-  List partitionSpecs,
-  SegmentUpdateStatusManager updateStatusManager) throws Exception {
+  List partitionSpecs, SegmentUpdateStatusManager 
updateStatusManager,
+  SegmentStatus segmentStatus, Boolean isPartitionTable, String timeStamp)
+  throws Exception {
 SegmentFileStore fileStore = new SegmentFileStore(tablePath, 
segment.getSegmentFileName());
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
+List filesToDelete = new ArrayList<>();
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
-  FileFactory.deleteFile(entry.getKey());
+  // Move the file to the trash folder in case the segment status is 
insert in progress
+  if (segmentStatus == SegmentStatus.INSERT_IN_PROGRESS) {
+if (!isPartitionTable) {
+  TrashUtil.copyDataToTrashFolderByFile(tablePath, entry.getKey(), 
timeStamp +

Review comment:
   copying it whole segment wise for normal tables, but in case of 
partition table, doing it file level.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512550744



##
File path: 
core/src/main/java/org/apache/carbondata/core/util/DeleteLoadFolders.java
##
@@ -138,8 +143,19 @@ public boolean accept(CarbonFile file) {
   if (filesToBeDeleted.length == 0) {
 status = true;
   } else {
-
 for (CarbonFile eachFile : filesToBeDeleted) {
+  // If the file to be deleted is a carbondata file, index 
file, index merge file
+  // or a delta file, copy that file to the trash folder.
+  if 
((eachFile.getName().endsWith(CarbonCommonConstants.FACT_FILE_EXT) ||

Review comment:
   coping segment wise in the case of normal table, in the case of 
partition flow have kept it file by file.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512550325



##
File path: 
core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##
@@ -1105,28 +1109,79 @@ public static void cleanSegments(CarbonTable table, 
List partitio
* @throws IOException
*/
   public static void deleteSegment(String tablePath, Segment segment,
-  List partitionSpecs,
-  SegmentUpdateStatusManager updateStatusManager) throws Exception {
+  List partitionSpecs, SegmentUpdateStatusManager 
updateStatusManager,
+  SegmentStatus segmentStatus, Boolean isPartitionTable, String timeStamp)
+  throws Exception {
 SegmentFileStore fileStore = new SegmentFileStore(tablePath, 
segment.getSegmentFileName());
 List indexOrMergeFiles = 
fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
 FileFactory.getConfiguration());
+List filesToDelete = new ArrayList<>();
 Map> indexFilesMap = fileStore.getIndexFilesMap();
 for (Map.Entry> entry : indexFilesMap.entrySet()) {
-  FileFactory.deleteFile(entry.getKey());
+  // Move the file to the trash folder in case the segment status is 
insert in progress
+  if (segmentStatus == SegmentStatus.INSERT_IN_PROGRESS) {

Review comment:
   for the normal table flow, i have changed it to copy to trash by 
segment, but in case of partition table copying to trash by file because i will 
have to read the segment file to get the desired carbondata and the index files 
per segment, which will increase the IO time.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512549332



##
File path: 
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFileCommand.scala
##
@@ -0,0 +1,484 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ */
+
+package org.apache.carbondata.spark.testsuite.cleanfiles
+
+import java.io.{File, PrintWriter}
+import java.util
+import java.util.List
+
+import org.apache.carbondata.cleanfiles.CleanFilesUtil
+import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.CarbonFile
+import org.apache.carbondata.core.datastore.impl.FileFactory
+import org.apache.carbondata.core.util.CarbonUtil
+import org.apache.spark.sql.{CarbonEnv, Row}
+import org.apache.spark.sql.test.util.QueryTest
+import org.scalatest.BeforeAndAfterAll
+
+import scala.io.Source
+
+class TestCleanFileCommand extends QueryTest with BeforeAndAfterAll {
+
+  var count = 0
+
+  test("clean up table and test trash folder with In Progress segments") {
+sql("""DROP TABLE IF EXISTS CLEANTEST""")
+sql("""DROP TABLE IF EXISTS CLEANTEST1""")
+sql(
+  """
+| CREATE TABLE cleantest (name String, id Int)
+| STORED AS carbondata
+  """.stripMargin)
+sql(s"""INSERT INTO CLEANTEST SELECT "abc", 1""")
+sql(s"""INSERT INTO CLEANTEST SELECT "abc", 1""")
+sql(s"""INSERT INTO CLEANTEST SELECT "abc", 1""")
+// run a select query before deletion
+checkAnswer(sql(s"""select count(*) from cleantest"""),
+  Seq(Row(3)))
+
+val path = CarbonEnv.getCarbonTable(Some("default"), 
"cleantest")(sqlContext.sparkSession)
+  .getTablePath
+val tableStatusFilePath = path + CarbonCommonConstants.FILE_SEPARATOR + 
"Metadata" +
+  CarbonCommonConstants.FILE_SEPARATOR + "tableStatus"
+editTableStatusFile(path)
+val trashFolderPath = path + CarbonCommonConstants.FILE_SEPARATOR +
+  CarbonCommonConstants.CARBON_TRASH_FOLDER_NAME
+
+assert(!FileFactory.isFileExist(trashFolderPath))
+val dryRun = sql(s"CLEAN FILES FOR TABLE cleantest 
OPTIONS('isDryRun'='true')").count()
+// dry run shows 3 segments to move to trash
+assert(dryRun == 3)
+
+sql(s"CLEAN FILES FOR TABLE cleantest").show
+
+checkAnswer(sql(s"""select count(*) from cleantest"""),
+  Seq(Row(0)))
+assert(FileFactory.isFileExist(trashFolderPath))
+var list = getFileCountInTrashFolder(trashFolderPath)
+assert(list == 6)
+
+val dryRun1 = sql(s"CLEAN FILES FOR TABLE cleantest 
OPTIONS('isDryRun'='true')").count()
+sql(s"CLEAN FILES FOR TABLE cleantest").show
+
+count = 0
+list = getFileCountInTrashFolder(trashFolderPath)
+// no carbondata file is added to the trash
+assert(list == 6)
+
+
+val timeStamp = getTimestampFolderName(trashFolderPath)
+
+// recovering data from trash folder
+sql(
+  """
+| CREATE TABLE cleantest1 (name String, id Int)
+| STORED AS carbondata
+  """.stripMargin)
+
+val segment0Path = trashFolderPath + CarbonCommonConstants.FILE_SEPARATOR 
+ timeStamp +
+  CarbonCommonConstants.FILE_SEPARATOR + CarbonCommonConstants.LOAD_FOLDER 
+ '0'
+val segment1Path = trashFolderPath + CarbonCommonConstants.FILE_SEPARATOR 
+ timeStamp +
+  CarbonCommonConstants.FILE_SEPARATOR + CarbonCommonConstants.LOAD_FOLDER 
+ '1'
+val segment2Path = trashFolderPath + CarbonCommonConstants.FILE_SEPARATOR 
+ timeStamp +
+  CarbonCommonConstants.FILE_SEPARATOR + CarbonCommonConstants.LOAD_FOLDER 
+ '2'
+
+sql(s"alter table cleantest1 add segment options('path'='$segment0Path'," +
+  s"'format'='carbon')").show()
+sql(s"alter table cleantest1 add segment options('path'='$segment1Path'," +
+  s"'format'='carbon')").show()
+sql(s"alter table cleantest1 add segment options('path'='$segment2Path'," +
+  s"'format'='carbon')").show()
+sql(s"""INSERT INTO CLEANTEST SELECT * from cleantest1""")
+
+// test after recovering data from trash
+checkAnswer(sql(s"""select count(*) from cleantest"""),
+  Seq(Row(3)))
+
+sql(s"CLEAN FILES FOR 

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512548768



##
File path: docs/cleanfiles.md
##
@@ -0,0 +1,78 @@
+
+
+
+## CLEAN FILES
+
+Clean files command is used to remove the Compacted, Marked For Delete ,In 
Progress which are stale and Partial(Segments which are missing from the table 
status file but their data is present)
+ segments from the store.
+ 
+ Clean Files Command
+   ```
+   CLEAN FILES ON TABLE TABLE_NAME
+   ```
+
+
+### TRASH FOLDER
+
+  Carbondata supports a Trash Folder which is used as a redundant folder where 
all the unnecessary files and folders are moved to during clean files operation.
+  This trash folder is mantained inside the table path. It is a hidden 
folder(.Trash). The segments that are moved to the trash folder are mantained 
under a timestamp 
+  subfolder(timestamp at which clean files operation is called). This helps 
the user to list down segments by timestamp.  By default all the timestamp 
sub-directory have an expiration
+  time of (3 days since that timestamp) and it can be configured by the user 
using the following carbon property
+   ```
+   carbon.trash.expiration.time = "Number of days"
+   ``` 
+  Once the timestamp subdirectory is expired as per the configured expiration 
day value, the subdirectory is deleted from the trash folder in the subsequent 
clean files command.
+  
+
+
+
+### DRY RUN
+  Support for dry run is provided before the actual clean files operation. 
This dry run operation will list down all the segments which are going to be 
manipulated during
+  the clean files operation. The dry run result will show the current location 
of the segment(it can be in FACT folder, Partition folder or trash folder) and 
where that segment
+  will be moved(to the trash folder or deleted from store) once the actual 
operation will be called. 
+  
+
+  ```
+  CLEAN FILES ON TABLE TABLE_NAME options('dry_run'='true')
+  ```
+
+### FORCE DELETE TRASH
+The force option with clean files command deletes all the files and folders 
from the trash folder.
+
+  ```
+  CLEAN FILES ON TABLE TABLE_NAME options('force'='true')
+  ```
+
+### DATA RECOVERY FROM THE TRASH FOLDER
+
+The segments from can be recovered from the trash folder by creating an 
external table from the desired segment location

Review comment:
   changed

##
File path: docs/dml-of-carbondata.md
##
@@ -552,3 +553,50 @@ CarbonData DML statements are documented here,which 
includes:
   ```
   CLEAN FILES FOR TABLE carbon_table
   ```
+
+## CLEAN FILES
+
+  Clean files command is used to remove the Compacted and Marked
+  For Delete Segments from the store. Carbondata also supports Trash
+  Folder where all the stale data is moved to after clean files
+  is called
+
+  There are several types of compaction
+
+  ```
+  CLEAN FILES ON TABLE TableName
+  ```

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #3917: [CARBONDATA-3978] Clean Files Refactor and support for trash folder in carbondata

2020-10-27 Thread GitBox


vikramahuja1001 commented on a change in pull request #3917:
URL: https://github.com/apache/carbondata/pull/3917#discussion_r512545772



##
File path: docs/cleanfiles.md
##
@@ -0,0 +1,78 @@
+
+
+
+## CLEAN FILES
+
+Clean files command is used to remove the Compacted, Marked For Delete ,In 
Progress which are stale and Partial(Segments which are missing from the table 
status file but their data is present)
+ segments from the store.
+ 
+ Clean Files Command
+   ```
+   CLEAN FILES ON TABLE TABLE_NAME

Review comment:
   changed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3988: [CARBONDATA-4037] Improve the table status and segment file writing

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3988:
URL: https://github.com/apache/carbondata/pull/3988#issuecomment-717104688


   Build Failed  with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2942/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3988: [CARBONDATA-4037] Improve the table status and segment file writing

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3988:
URL: https://github.com/apache/carbondata/pull/3988#issuecomment-717095810


   Build Failed  with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4699/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 edited a comment on pull request #3916: [CARBONDATA-3935]Support partition table transactional write in presto

2020-10-27 Thread GitBox


akashrn5 edited a comment on pull request #3916:
URL: https://github.com/apache/carbondata/pull/3916#issuecomment-717074706


   @ajantha-bhat @QiangCai i have rebased, please have a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on pull request #3916: [CARBONDATA-3935]Support partition table transactional write in presto

2020-10-27 Thread GitBox


akashrn5 commented on pull request #3916:
URL: https://github.com/apache/carbondata/pull/3916#issuecomment-717074706


   @ajantha-bhat i have rebased, please have a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (CARBONDATA-4044) Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-27 Thread Yahui Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221222#comment-17221222
 ] 

Yahui Liu commented on CARBONDATA-4044:
---

I have faced this issue, and try to solve it. Currently we will call 
CarbonLoaderUtil.checkAndCreateCarbonDataLocation to check and create 
Segment_XXX folder(if not exist), but we didn't check wherther stale data exist 
in segment folder when Segment_XXX folder already exists. My idea is to try to 
remove Segment_XXX folder always before creating Segment_XXX folder again. It 
will make sure there will be no stale data. Is this solution validation for all 
cases? Please provide some ideas to me. Thanks. 

> Fix dirty data in indexfile while IUD with stale data in segment folder
> ---
>
> Key: CARBONDATA-4044
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4044
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Xingjun Hao
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> XX.mergecarbonindex and XX..segment records the indexfiles list of a segment. 
> now, we generate xx.mergeindexfile and xx.segment  based on filter out all 
> indexfiles(including carbonindex and mergecarbonindex), which will leading 
> dirty data when there is stale data in segment folder.
> For example, there are a stale index file in segment_0 folder, 
> "0_1603763776.carbonindex".
> While loading, a new carbonindex "0_16037752342.carbonindex" is wrote, when 
> merge carbonindex files, we expect to only merge 0_16037752342.carbonindex, 
> But If we filter out all carbonindex in segment folder, both 
> "0_1603763776.carbonindex" and 0_16037752342.carbonindex will be merged and 
> recorded into segment file.
>  
> While updating, there has same problem. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] akashrn5 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-27 Thread GitBox


akashrn5 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717054658


   > We tested with 10 segments total 10G and update more than 20 times to see 
the cost, update row count of each time is 176973.
   > The result and comparison before and after improving is shown below, we 
can see that time-consuming of UPDATE reduces by about 30% after improving 
horizontal compaction.
   > 
   > Time Before Improving  Time after Improving
   > Average time
   > from 2nd to 21st UPDATE**99.19**   **67.55**
   > UPDATE 
   > 1st38.397  71.91
   > 2nd74.918  67.386
   > 3rd43.851  46.171
   > 4th52.133  82.974
   > 5th86.021  44.499
   > 6th62.992  39.973
   > 7th99.077  69.292
   > 8th77.815  60.416
   > 9th74.949  50.187
   > 10th   85.874  50.973
   > 11th   103.902 58.005
   > 12th   77.534  86.087
   > 13th   79.154  76.283
   > 14th   116.117 72.029
   > 15th   112.846 74.182
   > 16th   124.707 69.282
   > 17th   124.677 67.546
   > 18th   147.15  66.652
   > 19th   135.127 109.385
   > 20th   133.94  80.849
   > 21st   171.112 78.92
   
   Cool



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3970: [CARBONDATA-4007] Fix multiple issues in SDK

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3970:
URL: https://github.com/apache/carbondata/pull/3970#issuecomment-717049704


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2941/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3970: [CARBONDATA-4007] Fix multiple issues in SDK

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3970:
URL: https://github.com/apache/carbondata/pull/3970#issuecomment-717049604


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4698/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #3997: [CARBONDATA-4045] Add TPCDS TestCase for Spark on CarbonData Integration Test

2020-10-27 Thread GitBox


marchpure commented on pull request #3997:
URL: https://github.com/apache/carbondata/pull/3997#issuecomment-717045718


   > @marchpure : The reason why we don't have TPCH and TPCDS in UT is we need 
a huge data set, loading huge data takes time.
   > 
   > What is the reason behind adding this? we can anyways have separate TPCH 
or TPCDS machines that can have automation script to give performance benchmark 
on every release
   > 
   > Also no need to run TPCH and TPCDS on every PR builder. Running once per 
release is enough.
   > 
   > @QiangCai , @kunal642 : What's your opinion on this?
   
   1. The TPCDS dataset in this PR is really small(totally 33KB).  it won"t 
took so much time to load and query.  It may help to avoid possible issues with 
accepted overhead. 
   2. it also help us to debug tpcds.  explain plan ~analyse in local 
environment. 
   
   the inspiration to add tpcds test case is CARBONDATA 4008.  Whose issue is 
Spark on CarbonData will fail in TPCDS Query 83. This issue seems has been 
there for a log time,  which implies that our UT is not enough. 
   I believe that we can add a profile to turn on/off of TPCDS test in the 
future if the automatic TPCDS machine is ready.
   
   Maybe we can have a module name 'carbondata-integretion-test'?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure edited a comment on pull request #3997: [CARBONDATA-4045] Add TPCDS TestCase for Spark on CarbonData Integration Test

2020-10-27 Thread GitBox


marchpure edited a comment on pull request #3997:
URL: https://github.com/apache/carbondata/pull/3997#issuecomment-717045718


   > @marchpure : The reason why we don't have TPCH and TPCDS in UT is we need 
a huge data set, loading huge data takes time.
   > 
   > What is the reason behind adding this? we can anyways have separate TPCH 
or TPCDS machines that can have automation script to give performance benchmark 
on every release
   > 
   > Also no need to run TPCH and TPCDS on every PR builder. Running once per 
release is enough.
   > 
   > @QiangCai , @kunal642 : What's your opinion on this?
   
   1. The TPCDS dataset in this PR is really small(totally 33KB).  it won"t 
took so much time to load and query.  It may help to avoid possible issues with 
accepted overhead. 
   2. it also help us to debug tpcds.  explain plan ~analyse in local 
environment. 
   
   the inspiration to add tpcds test case is CARBONDATA 4008.  Whose issue is 
Spark on CarbonData will fail in TPCDS Query 83. This issue seems has been 
there for a log time,  which implies that our UT is not enough. 
   I believe that we can add a profile to turn on/off of TPCDS test in the 
future if the automatic TPCDS machine is ready.
   
   **Maybe we can have a module name 'carbondata-integretion-test'?**



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] dependabot[bot] commented on pull request #3456: Bump solr.version from 6.3.0 to 8.3.0 in /datamap/lucene

2020-10-27 Thread GitBox


dependabot[bot] commented on pull request #3456:
URL: https://github.com/apache/carbondata/pull/3456#issuecomment-717039451


   Dependabot tried to update this pull request, but something went wrong. 
We're looking into it, but in the meantime you can retry the update by 
commenting `@dependabot rebase`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] dependabot[bot] commented on pull request #3447: Bump dep.jackson.version from 2.6.5 to 2.10.1 in /store/sdk

2020-10-27 Thread GitBox


dependabot[bot] commented on pull request #3447:
URL: https://github.com/apache/carbondata/pull/3447#issuecomment-717038691


   Dependabot tried to update this pull request, but something went wrong. 
We're looking into it, but in the meantime you can retry the update by 
commenting `@dependabot rebase`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #3997: [CARBONDATA-4045] Add TPCDS TestCase for Spark on CarbonData Integration Test

2020-10-27 Thread GitBox


ajantha-bhat commented on pull request #3997:
URL: https://github.com/apache/carbondata/pull/3997#issuecomment-717031195


   @marchpure : The reason why we don't have TPCH and TPCDS in UT is we need a 
huge data set, loading huge data takes time.
   
   What is the reason behind adding this? we can anyways have separate TPCH or 
TPCDS machines that can have automation script to give performance benchmark on 
every release
   
   Also no need to run TPCH and TPCDS on every PR builder.  Running once per 
release is enough.
   
   @QiangCai , @kunal642 : What's your opinion on this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-27 Thread GitBox


Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717027548


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-27 Thread GitBox


Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717025630


   > We tested with 10 segments total 10G and update more than 20 times to see 
the cost, update row count of each time is 176973.
   > The result and comparison before and after improving is shown below, we 
can see that time-consuming of UPDATE reduces by about 30% after improving 
horizontal compaction.
   > 
   > Time Before Improving  Time after Improving
   > Average time
   > from 2nd to 21st UPDATE**99.19**   **67.55**
   > UPDATE 
   > 1st38.397  71.91
   > 2nd74.918  67.386
   > 3rd43.851  46.171
   > 4th52.133  82.974
   > 5th86.021  44.499
   > 6th62.992  39.973
   > 7th99.077  69.292
   > 8th77.815  60.416
   > 9th74.949  50.187
   > 10th   85.874  50.973
   > 11th   103.902 58.005
   > 12th   77.534  86.087
   > 13th   79.154  76.283
   > 14th   116.117 72.029
   > 15th   112.846 74.182
   > 16th   124.707 69.282
   > 17th   124.677 67.546
   > 18th   147.15  66.652
   > 19th   135.127 109.385
   > 20th   133.94  80.849
   > 21st   171.112 78.92
   
   greate!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] shenjiayu17 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update

2020-10-27 Thread GitBox


shenjiayu17 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717024597


   We tested with 10 segments total 10G and update more than 20 times to see 
the cost, update row count of each time is 176973.
   The result and comparison before and after improving is shown below,  we can 
see that time-consuming of UPDATE reduces by about 30% after improving 
horizontal compaction.
   
   |   | Time Before Improving | Time after Improving | 
   | --- | - | - | 
   | Average time  from 2nd to 21st UPDATE | **99.19** | **67.55** |
   | UPDATE  | | |
   | 1st | 38.397  |  71.91   |
   | 2nd | 74.918  | 67.386   |
   | 3rd | 43.851  | 46.171   |
   | 4th | 52.133  | 82.974   |
   | 5th | 86.021  | 44.499   |
   | 6th | 62.992  | 39.973   |
   | 7th | 99.077  | 69.292   |
   | 8th | 77.815  | 60.416   |
   | 9th | 74.949  | 50.187   |
   | 10th | 85.874  | 50.973   |
   | 11th | 103.902  | 58.005   |
   | 12th | 77.534  | 86.087   |
   | 13th | 79.154  | 76.283   |
   | 14th | 116.117  | 72.029   |
   | 15th | 112.846  | 74.182   |
   | 16th | 124.707  | 69.282   |
   | 17th | 124.677  | 67.546   |
   | 18th | 147.15   | 66.652   |
   | 19th | 135.127  | 109.385   |
   | 20th | 133.94   | 80.849   |
   | 21st | 171.112  | 78.92  |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


ajantha-bhat commented on pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994#issuecomment-717017303


   @marchpure : yes, If update writes into new segment, It can fix the issue.  
If you handle this in your PR then I will close mine.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat closed pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


ajantha-bhat closed pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


ajantha-bhat commented on pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994#issuecomment-717017602


   handled in #3999 , I will close this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3997: [CARBONDATA-4045] Add TPCDS TestCase

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3997:
URL: https://github.com/apache/carbondata/pull/3997#issuecomment-717017353


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2940/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3997: [CARBONDATA-4045] Add TPCDS TestCase

2020-10-27 Thread GitBox


CarbonDataQA1 commented on pull request #3997:
URL: https://github.com/apache/carbondata/pull/3997#issuecomment-717016610


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4697/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure edited a comment on pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


marchpure edited a comment on pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994#issuecomment-717015851


   > @marchpure : If you dont assign UUID for all update case or compaction 
after update case, all testcases will pass. But the original data mismatch will 
still happen for segment that has stale files and went for update and 
compaction (that scenario you can test)
   
   yes.  I have test it. in pr3999, update will generate new segment like merge 
into. which can void recreate segment file. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


marchpure commented on pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994#issuecomment-717015851


   > @marchpure : If you dont assign UUID for all update case or compaction 
after update case, all testcases will pass. But the original data mismatch will 
still happen for segment that has stale files and went for update and 
compaction (that scenario you can test)
   
   yes.  I have test it. update will generate new segment like merge into. 
which can void recreate segment file. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] ajantha-bhat commented on pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


ajantha-bhat commented on pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994#issuecomment-717014461


   @marchpure : If you dont assign UUID for all update case or compaction after 
update case, all testcases will pass. But the original data mismatch will still 
happen for segment that has stale files and went for update and compaction 
(that scenario you can test)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] marchpure commented on pull request #3994: [CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success

2020-10-27 Thread GitBox


marchpure commented on pull request #3994:
URL: https://github.com/apache/carbondata/pull/3994#issuecomment-717010368


   > > Hi, Ajantha. I have fixed this in #3999 , please have a check.
   > 
   > @marchpure: I think your PR is not handling all the scenarios, so update 
scenario and test case will fail (even mine also)
   
   yes.  update test case all passes 
   i am handing other 10 test failures



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] asfgit closed pull request #3996: [DOC] Adjust document for partition table

2020-10-27 Thread GitBox


asfgit closed pull request #3996:
URL: https://github.com/apache/carbondata/pull/3996


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org